All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] [RFC] pageout work and dirty reclaim throttling
@ 2012-02-28 14:00 ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-28 14:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Linux Memory Management List, Fengguang Wu, LKML

Andrew,

This aims to improve two major page reclaim problems

a) pageout I/O efficiency, by sending pageout work to the flusher
b) interactive performance, by selectively throttle the writing tasks

when under heavy pressure of dirty/writeback pages. The tests results for 1)
and 2) look promising and are included in patches 6 and 9.

However there are still two open problems.

1) ext4 "hung task" problem, as put by Jan Kara:

: We enter memcg reclaim from grab_cache_page_write_begin() and are
: waiting in reclaim_wait(). Because grab_cache_page_write_begin() is
: called with transaction started, this blocks transaction from
: committing and subsequently blocks all other activity on the
: filesystem. The fact is this isn't new with your patches, just your
: changes or the fact that we are running in a memory constrained cgroup
: make this more visible.

2) the pageout work may be deferred by sync work

Like 1), there is also no obvious good way out. The closest fix may be to
service some pageout works each time the other work finishes with one inode.
But problem is, the sync work does not limit chunk size at all. So it's
possible for sync to work on one inode for 1 minute before giving the pageout
works a chance...

Due to problems (1) and (2), it's still not a complete solution. For ease of
debug, several trace_printk() and debugfs interfaces are included for now.

 [PATCH 1/9] memcg: add page_cgroup flags for dirty page tracking
 [PATCH 2/9] memcg: add dirty page accounting infrastructure
 [PATCH 3/9] memcg: add kernel calls for memcg dirty page stats
 [PATCH 4/9] memcg: dirty page accounting support routines
 [PATCH 5/9] writeback: introduce the pageout work
 [PATCH 6/9] vmscan: dirty reclaim throttling
 [PATCH 7/9] mm: pass __GFP_WRITE to memcg charge and reclaim routines
 [PATCH 8/9] mm: dont set __GFP_WRITE on ramfs/sysfs writes                                                  
 [PATCH 9/9] mm: debug vmscan waits

 fs/fs-writeback.c                |  230 +++++++++++++++++++++-
 fs/nfs/write.c                   |    4 
 fs/super.c                       |    1 
 include/linux/backing-dev.h      |    2 
 include/linux/gfp.h              |    2 
 include/linux/memcontrol.h       |   13 +
 include/linux/mmzone.h           |    1 
 include/linux/page_cgroup.h      |   23 ++
 include/linux/sched.h            |    1 
 include/linux/writeback.h        |   18 +
 include/trace/events/vmscan.h    |   68 ++++++
 include/trace/events/writeback.h |   12 -
 mm/backing-dev.c                 |   10 
 mm/filemap.c                     |   20 +
 mm/internal.h                    |    7 
 mm/memcontrol.c                  |  199 ++++++++++++++++++-
 mm/migrate.c                     |    3 
 mm/page-writeback.c              |    6 
 mm/page_alloc.c                  |    1 
 mm/swap.c                        |    4 
 mm/truncate.c                    |    1 
 mm/vmscan.c                      |  298 ++++++++++++++++++++++++++---
 22 files changed, 864 insertions(+), 60 deletions(-)

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 0/9] [RFC] pageout work and dirty reclaim throttling
@ 2012-02-28 14:00 ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-28 14:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Linux Memory Management List, Fengguang Wu, LKML

Andrew,

This aims to improve two major page reclaim problems

a) pageout I/O efficiency, by sending pageout work to the flusher
b) interactive performance, by selectively throttle the writing tasks

when under heavy pressure of dirty/writeback pages. The tests results for 1)
and 2) look promising and are included in patches 6 and 9.

However there are still two open problems.

1) ext4 "hung task" problem, as put by Jan Kara:

: We enter memcg reclaim from grab_cache_page_write_begin() and are
: waiting in reclaim_wait(). Because grab_cache_page_write_begin() is
: called with transaction started, this blocks transaction from
: committing and subsequently blocks all other activity on the
: filesystem. The fact is this isn't new with your patches, just your
: changes or the fact that we are running in a memory constrained cgroup
: make this more visible.

2) the pageout work may be deferred by sync work

Like 1), there is also no obvious good way out. The closest fix may be to
service some pageout works each time the other work finishes with one inode.
But problem is, the sync work does not limit chunk size at all. So it's
possible for sync to work on one inode for 1 minute before giving the pageout
works a chance...

Due to problems (1) and (2), it's still not a complete solution. For ease of
debug, several trace_printk() and debugfs interfaces are included for now.

 [PATCH 1/9] memcg: add page_cgroup flags for dirty page tracking
 [PATCH 2/9] memcg: add dirty page accounting infrastructure
 [PATCH 3/9] memcg: add kernel calls for memcg dirty page stats
 [PATCH 4/9] memcg: dirty page accounting support routines
 [PATCH 5/9] writeback: introduce the pageout work
 [PATCH 6/9] vmscan: dirty reclaim throttling
 [PATCH 7/9] mm: pass __GFP_WRITE to memcg charge and reclaim routines
 [PATCH 8/9] mm: dont set __GFP_WRITE on ramfs/sysfs writes                                                  
 [PATCH 9/9] mm: debug vmscan waits

 fs/fs-writeback.c                |  230 +++++++++++++++++++++-
 fs/nfs/write.c                   |    4 
 fs/super.c                       |    1 
 include/linux/backing-dev.h      |    2 
 include/linux/gfp.h              |    2 
 include/linux/memcontrol.h       |   13 +
 include/linux/mmzone.h           |    1 
 include/linux/page_cgroup.h      |   23 ++
 include/linux/sched.h            |    1 
 include/linux/writeback.h        |   18 +
 include/trace/events/vmscan.h    |   68 ++++++
 include/trace/events/writeback.h |   12 -
 mm/backing-dev.c                 |   10 
 mm/filemap.c                     |   20 +
 mm/internal.h                    |    7 
 mm/memcontrol.c                  |  199 ++++++++++++++++++-
 mm/migrate.c                     |    3 
 mm/page-writeback.c              |    6 
 mm/page_alloc.c                  |    1 
 mm/swap.c                        |    4 
 mm/truncate.c                    |    1 
 mm/vmscan.c                      |  298 ++++++++++++++++++++++++++---
 22 files changed, 864 insertions(+), 60 deletions(-)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 1/9] memcg: add page_cgroup flags for dirty page tracking
  2012-02-28 14:00 ` Fengguang Wu
@ 2012-02-28 14:00   ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-28 14:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrea Righi, Minchan Kim, Fengguang Wu,
	Linux Memory Management List, LKML

[-- Attachment #1: memcg-add-page_cgroup-flags-for-dirty-page-tracking.patch --]
[-- Type: text/plain, Size: 2301 bytes --]

From: Greg Thelen <gthelen@google.com>

Add additional flags to page_cgroup to track dirty pages
within a mem_cgroup.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrea Righi <andrea@betterlinux.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 include/linux/page_cgroup.h |   23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

--- linux.orig/include/linux/page_cgroup.h	2012-02-19 10:53:14.000000000 +0800
+++ linux/include/linux/page_cgroup.h	2012-02-19 10:53:16.000000000 +0800
@@ -10,6 +10,9 @@ enum {
 	/* flags for mem_cgroup and file and I/O status */
 	PCG_MOVE_LOCK, /* For race between move_account v.s. following bits */
 	PCG_FILE_MAPPED, /* page is accounted as "mapped" */
+	PCG_FILE_DIRTY, /* page is dirty */
+	PCG_FILE_WRITEBACK, /* page is under writeback */
+	PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */
 	__NR_PCG_FLAGS,
 };
 
@@ -64,6 +67,10 @@ static inline void ClearPageCgroup##unam
 static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
 	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
 
+#define TESTSETPCGFLAG(uname, lname)			\
+static inline int TestSetPageCgroup##uname(struct page_cgroup *pc)	\
+	{ return test_and_set_bit(PCG_##lname, &pc->flags); }
+
 /* Cache flag is set only once (at allocation) */
 TESTPCGFLAG(Cache, CACHE)
 CLEARPCGFLAG(Cache, CACHE)
@@ -77,6 +84,22 @@ SETPCGFLAG(FileMapped, FILE_MAPPED)
 CLEARPCGFLAG(FileMapped, FILE_MAPPED)
 TESTPCGFLAG(FileMapped, FILE_MAPPED)
 
+SETPCGFLAG(FileDirty, FILE_DIRTY)
+CLEARPCGFLAG(FileDirty, FILE_DIRTY)
+TESTPCGFLAG(FileDirty, FILE_DIRTY)
+TESTCLEARPCGFLAG(FileDirty, FILE_DIRTY)
+TESTSETPCGFLAG(FileDirty, FILE_DIRTY)
+
+SETPCGFLAG(FileWriteback, FILE_WRITEBACK)
+CLEARPCGFLAG(FileWriteback, FILE_WRITEBACK)
+TESTPCGFLAG(FileWriteback, FILE_WRITEBACK)
+
+SETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+CLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTCLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTSETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+
 SETPCGFLAG(Migration, MIGRATION)
 CLEARPCGFLAG(Migration, MIGRATION)
 TESTPCGFLAG(Migration, MIGRATION)



^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 1/9] memcg: add page_cgroup flags for dirty page tracking
@ 2012-02-28 14:00   ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-28 14:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrea Righi, Minchan Kim, Fengguang Wu,
	Linux Memory Management List, LKML

[-- Attachment #1: memcg-add-page_cgroup-flags-for-dirty-page-tracking.patch --]
[-- Type: text/plain, Size: 2604 bytes --]

From: Greg Thelen <gthelen@google.com>

Add additional flags to page_cgroup to track dirty pages
within a mem_cgroup.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrea Righi <andrea@betterlinux.com>
Signed-off-by: Greg Thelen <gthelen@google.com>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 include/linux/page_cgroup.h |   23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

--- linux.orig/include/linux/page_cgroup.h	2012-02-19 10:53:14.000000000 +0800
+++ linux/include/linux/page_cgroup.h	2012-02-19 10:53:16.000000000 +0800
@@ -10,6 +10,9 @@ enum {
 	/* flags for mem_cgroup and file and I/O status */
 	PCG_MOVE_LOCK, /* For race between move_account v.s. following bits */
 	PCG_FILE_MAPPED, /* page is accounted as "mapped" */
+	PCG_FILE_DIRTY, /* page is dirty */
+	PCG_FILE_WRITEBACK, /* page is under writeback */
+	PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */
 	__NR_PCG_FLAGS,
 };
 
@@ -64,6 +67,10 @@ static inline void ClearPageCgroup##unam
 static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
 	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
 
+#define TESTSETPCGFLAG(uname, lname)			\
+static inline int TestSetPageCgroup##uname(struct page_cgroup *pc)	\
+	{ return test_and_set_bit(PCG_##lname, &pc->flags); }
+
 /* Cache flag is set only once (at allocation) */
 TESTPCGFLAG(Cache, CACHE)
 CLEARPCGFLAG(Cache, CACHE)
@@ -77,6 +84,22 @@ SETPCGFLAG(FileMapped, FILE_MAPPED)
 CLEARPCGFLAG(FileMapped, FILE_MAPPED)
 TESTPCGFLAG(FileMapped, FILE_MAPPED)
 
+SETPCGFLAG(FileDirty, FILE_DIRTY)
+CLEARPCGFLAG(FileDirty, FILE_DIRTY)
+TESTPCGFLAG(FileDirty, FILE_DIRTY)
+TESTCLEARPCGFLAG(FileDirty, FILE_DIRTY)
+TESTSETPCGFLAG(FileDirty, FILE_DIRTY)
+
+SETPCGFLAG(FileWriteback, FILE_WRITEBACK)
+CLEARPCGFLAG(FileWriteback, FILE_WRITEBACK)
+TESTPCGFLAG(FileWriteback, FILE_WRITEBACK)
+
+SETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+CLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTCLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+TESTSETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+
 SETPCGFLAG(Migration, MIGRATION)
 CLEARPCGFLAG(Migration, MIGRATION)
 TESTPCGFLAG(Migration, MIGRATION)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 2/9] memcg: add dirty page accounting infrastructure
  2012-02-28 14:00 ` Fengguang Wu
@ 2012-02-28 14:00   ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-28 14:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrea Righi, Fengguang Wu,
	Linux Memory Management List, LKML

[-- Attachment #1: memcg-add-dirty-page-accounting-infrastructure.patch --]
[-- Type: text/plain, Size: 6537 bytes --]

From: Greg Thelen <gthelen@google.com>

Add memcg routines to count dirty, writeback, and unstable_NFS pages.
These routines are not yet used by the kernel to count such pages.  A
later change adds kernel calls to these new routines.

As inode pages are marked dirty, if the dirtied page's cgroup differs
from the inode's cgroup, then mark the inode shared across several
cgroup.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <andrea@betterlinux.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
Changelog since v8:
- In v8 this patch was applied after 'memcg: add mem_cgroup_mark_inode_dirty()'.
  In this version (v9), this patch comes first.  The result is that this patch
  does not contain code to mark inode with I_MEMCG_SHARED.  That logic is
  deferred until the later 'memcg: add mem_cgroup_mark_inode_dirty()' patch.

Fengguang: "unstable_nfs" seems a more consistent name?

 include/linux/memcontrol.h |    8 ++-
 mm/memcontrol.c            |   87 +++++++++++++++++++++++++++++++----
 2 files changed, 86 insertions(+), 9 deletions(-)

--- linux.orig/include/linux/memcontrol.h	2012-02-19 11:29:59.000000000 +0800
+++ linux/include/linux/memcontrol.h	2012-02-19 11:30:05.000000000 +0800
@@ -27,9 +27,15 @@ struct page_cgroup;
 struct page;
 struct mm_struct;
 
-/* Stats that can be updated by kernel. */
+/*
+ * Per mem_cgroup page counts tracked by kernel.  As pages enter and leave these
+ * states, the kernel notifies memcg using mem_cgroup_{inc,dec}_page_stat().
+ */
 enum mem_cgroup_page_stat_item {
 	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+	MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
+	MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
+	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
 };
 
 struct mem_cgroup_reclaim_cookie {
--- linux.orig/mm/memcontrol.c	2012-02-19 11:29:59.000000000 +0800
+++ linux/mm/memcontrol.c	2012-02-19 11:30:25.000000000 +0800
@@ -86,8 +86,11 @@ enum mem_cgroup_stat_index {
 	 */
 	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
 	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
 	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
+	MEM_CGROUP_STAT_FILE_DIRTY,	/* # of dirty pages in page cache */
+	MEM_CGROUP_STAT_FILE_WRITEBACK,		/* # of pages under writeback */
+	MEM_CGROUP_STAT_FILE_UNSTABLE_NFS,	/* # of NFS unstable pages */
 	MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
 	MEM_CGROUP_ON_MOVE,	/* someone is moving account between groups */
 	MEM_CGROUP_STAT_NSTATS,
@@ -1885,6 +1888,44 @@ void mem_cgroup_update_page_stat(struct 
 			ClearPageCgroupFileMapped(pc);
 		idx = MEM_CGROUP_STAT_FILE_MAPPED;
 		break;
+
+	case MEMCG_NR_FILE_DIRTY:
+		/* Use Test{Set,Clear} to only un/charge the memcg once. */
+		if (val > 0) {
+			if (TestSetPageCgroupFileDirty(pc))
+				val = 0;
+		} else {
+			if (!TestClearPageCgroupFileDirty(pc))
+				val = 0;
+		}
+		idx = MEM_CGROUP_STAT_FILE_DIRTY;
+		break;
+
+	case MEMCG_NR_FILE_WRITEBACK:
+		/*
+		 * This counter is adjusted while holding the mapping's
+		 * tree_lock.  Therefore there is no race between settings and
+		 * clearing of this flag.
+		 */
+		if (val > 0)
+			SetPageCgroupFileWriteback(pc);
+		else
+			ClearPageCgroupFileWriteback(pc);
+		idx = MEM_CGROUP_STAT_FILE_WRITEBACK;
+		break;
+
+	case MEMCG_NR_FILE_UNSTABLE_NFS:
+		/* Use Test{Set,Clear} to only un/charge the memcg once. */
+		if (val > 0) {
+			if (TestSetPageCgroupFileUnstableNFS(pc))
+				val = 0;
+		} else {
+			if (!TestClearPageCgroupFileUnstableNFS(pc))
+				val = 0;
+		}
+		idx = MEM_CGROUP_STAT_FILE_UNSTABLE_NFS;
+		break;
+
 	default:
 		BUG();
 	}
@@ -2481,6 +2522,17 @@ void mem_cgroup_split_huge_fixup(struct 
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static inline
+void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
+				       struct mem_cgroup *to,
+				       enum mem_cgroup_stat_index idx)
+{
+	preempt_disable();
+	__this_cpu_dec(from->stat->count[idx]);
+	__this_cpu_inc(to->stat->count[idx]);
+	preempt_enable();
+}
+
 /**
  * mem_cgroup_move_account - move account of the page
  * @page: the page
@@ -2529,13 +2581,18 @@ static int mem_cgroup_move_account(struc
 
 	move_lock_page_cgroup(pc, &flags);
 
-	if (PageCgroupFileMapped(pc)) {
-		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
-		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
-	}
+	if (PageCgroupFileMapped(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+					MEM_CGROUP_STAT_FILE_MAPPED);
+	if (PageCgroupFileDirty(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+						  MEM_CGROUP_STAT_FILE_DIRTY);
+	if (PageCgroupFileWriteback(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+					MEM_CGROUP_STAT_FILE_WRITEBACK);
+	if (PageCgroupFileUnstableNFS(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+					MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
 	mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -nr_pages);
 	if (uncharge)
 		/* This is not "cancel", but cancel_charge does all we need. */
@@ -3994,6 +4051,9 @@ enum {
 	MCS_SWAP,
 	MCS_PGFAULT,
 	MCS_PGMAJFAULT,
+	MCS_FILE_DIRTY,
+	MCS_WRITEBACK,
+	MCS_UNSTABLE_NFS,
 	MCS_INACTIVE_ANON,
 	MCS_ACTIVE_ANON,
 	MCS_INACTIVE_FILE,
@@ -4018,6 +4078,9 @@ struct {
 	{"swap", "total_swap"},
 	{"pgfault", "total_pgfault"},
 	{"pgmajfault", "total_pgmajfault"},
+	{"dirty", "total_dirty"},
+	{"writeback", "total_writeback"},
+	{"nfs_unstable", "total_nfs_unstable"},
 	{"inactive_anon", "total_inactive_anon"},
 	{"active_anon", "total_active_anon"},
 	{"inactive_file", "total_inactive_file"},
@@ -4051,6 +4114,14 @@ mem_cgroup_get_local_stat(struct mem_cgr
 	val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_PGMAJFAULT);
 	s->stat[MCS_PGMAJFAULT] += val;
 
+	/* dirty stat */
+	val = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY);
+	s->stat[MCS_FILE_DIRTY] += val * PAGE_SIZE;
+	val = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_WRITEBACK);
+	s->stat[MCS_WRITEBACK] += val * PAGE_SIZE;
+	val = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
+	s->stat[MCS_UNSTABLE_NFS] += val * PAGE_SIZE;
+
 	/* per zone stat */
 	val = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_INACTIVE_ANON));
 	s->stat[MCS_INACTIVE_ANON] += val * PAGE_SIZE;



^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 2/9] memcg: add dirty page accounting infrastructure
@ 2012-02-28 14:00   ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-28 14:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrea Righi, Fengguang Wu,
	Linux Memory Management List, LKML

[-- Attachment #1: memcg-add-dirty-page-accounting-infrastructure.patch --]
[-- Type: text/plain, Size: 6840 bytes --]

From: Greg Thelen <gthelen@google.com>

Add memcg routines to count dirty, writeback, and unstable_NFS pages.
These routines are not yet used by the kernel to count such pages.  A
later change adds kernel calls to these new routines.

As inode pages are marked dirty, if the dirtied page's cgroup differs
from the inode's cgroup, then mark the inode shared across several
cgroup.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <andrea@betterlinux.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
Changelog since v8:
- In v8 this patch was applied after 'memcg: add mem_cgroup_mark_inode_dirty()'.
  In this version (v9), this patch comes first.  The result is that this patch
  does not contain code to mark inode with I_MEMCG_SHARED.  That logic is
  deferred until the later 'memcg: add mem_cgroup_mark_inode_dirty()' patch.

Fengguang: "unstable_nfs" seems a more consistent name?

 include/linux/memcontrol.h |    8 ++-
 mm/memcontrol.c            |   87 +++++++++++++++++++++++++++++++----
 2 files changed, 86 insertions(+), 9 deletions(-)

--- linux.orig/include/linux/memcontrol.h	2012-02-19 11:29:59.000000000 +0800
+++ linux/include/linux/memcontrol.h	2012-02-19 11:30:05.000000000 +0800
@@ -27,9 +27,15 @@ struct page_cgroup;
 struct page;
 struct mm_struct;
 
-/* Stats that can be updated by kernel. */
+/*
+ * Per mem_cgroup page counts tracked by kernel.  As pages enter and leave these
+ * states, the kernel notifies memcg using mem_cgroup_{inc,dec}_page_stat().
+ */
 enum mem_cgroup_page_stat_item {
 	MEMCG_NR_FILE_MAPPED, /* # of pages charged as file rss */
+	MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
+	MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
+	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
 };
 
 struct mem_cgroup_reclaim_cookie {
--- linux.orig/mm/memcontrol.c	2012-02-19 11:29:59.000000000 +0800
+++ linux/mm/memcontrol.c	2012-02-19 11:30:25.000000000 +0800
@@ -86,8 +86,11 @@ enum mem_cgroup_stat_index {
 	 */
 	MEM_CGROUP_STAT_CACHE, 	   /* # of pages charged as cache */
 	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
-	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
 	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
+	MEM_CGROUP_STAT_FILE_DIRTY,	/* # of dirty pages in page cache */
+	MEM_CGROUP_STAT_FILE_WRITEBACK,		/* # of pages under writeback */
+	MEM_CGROUP_STAT_FILE_UNSTABLE_NFS,	/* # of NFS unstable pages */
 	MEM_CGROUP_STAT_DATA, /* end of data requires synchronization */
 	MEM_CGROUP_ON_MOVE,	/* someone is moving account between groups */
 	MEM_CGROUP_STAT_NSTATS,
@@ -1885,6 +1888,44 @@ void mem_cgroup_update_page_stat(struct 
 			ClearPageCgroupFileMapped(pc);
 		idx = MEM_CGROUP_STAT_FILE_MAPPED;
 		break;
+
+	case MEMCG_NR_FILE_DIRTY:
+		/* Use Test{Set,Clear} to only un/charge the memcg once. */
+		if (val > 0) {
+			if (TestSetPageCgroupFileDirty(pc))
+				val = 0;
+		} else {
+			if (!TestClearPageCgroupFileDirty(pc))
+				val = 0;
+		}
+		idx = MEM_CGROUP_STAT_FILE_DIRTY;
+		break;
+
+	case MEMCG_NR_FILE_WRITEBACK:
+		/*
+		 * This counter is adjusted while holding the mapping's
+		 * tree_lock.  Therefore there is no race between settings and
+		 * clearing of this flag.
+		 */
+		if (val > 0)
+			SetPageCgroupFileWriteback(pc);
+		else
+			ClearPageCgroupFileWriteback(pc);
+		idx = MEM_CGROUP_STAT_FILE_WRITEBACK;
+		break;
+
+	case MEMCG_NR_FILE_UNSTABLE_NFS:
+		/* Use Test{Set,Clear} to only un/charge the memcg once. */
+		if (val > 0) {
+			if (TestSetPageCgroupFileUnstableNFS(pc))
+				val = 0;
+		} else {
+			if (!TestClearPageCgroupFileUnstableNFS(pc))
+				val = 0;
+		}
+		idx = MEM_CGROUP_STAT_FILE_UNSTABLE_NFS;
+		break;
+
 	default:
 		BUG();
 	}
@@ -2481,6 +2522,17 @@ void mem_cgroup_split_huge_fixup(struct 
 }
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 
+static inline
+void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
+				       struct mem_cgroup *to,
+				       enum mem_cgroup_stat_index idx)
+{
+	preempt_disable();
+	__this_cpu_dec(from->stat->count[idx]);
+	__this_cpu_inc(to->stat->count[idx]);
+	preempt_enable();
+}
+
 /**
  * mem_cgroup_move_account - move account of the page
  * @page: the page
@@ -2529,13 +2581,18 @@ static int mem_cgroup_move_account(struc
 
 	move_lock_page_cgroup(pc, &flags);
 
-	if (PageCgroupFileMapped(pc)) {
-		/* Update mapped_file data for mem_cgroup */
-		preempt_disable();
-		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
-		preempt_enable();
-	}
+	if (PageCgroupFileMapped(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+					MEM_CGROUP_STAT_FILE_MAPPED);
+	if (PageCgroupFileDirty(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+						  MEM_CGROUP_STAT_FILE_DIRTY);
+	if (PageCgroupFileWriteback(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+					MEM_CGROUP_STAT_FILE_WRITEBACK);
+	if (PageCgroupFileUnstableNFS(pc))
+		mem_cgroup_move_account_page_stat(from, to,
+					MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
 	mem_cgroup_charge_statistics(from, PageCgroupCache(pc), -nr_pages);
 	if (uncharge)
 		/* This is not "cancel", but cancel_charge does all we need. */
@@ -3994,6 +4051,9 @@ enum {
 	MCS_SWAP,
 	MCS_PGFAULT,
 	MCS_PGMAJFAULT,
+	MCS_FILE_DIRTY,
+	MCS_WRITEBACK,
+	MCS_UNSTABLE_NFS,
 	MCS_INACTIVE_ANON,
 	MCS_ACTIVE_ANON,
 	MCS_INACTIVE_FILE,
@@ -4018,6 +4078,9 @@ struct {
 	{"swap", "total_swap"},
 	{"pgfault", "total_pgfault"},
 	{"pgmajfault", "total_pgmajfault"},
+	{"dirty", "total_dirty"},
+	{"writeback", "total_writeback"},
+	{"nfs_unstable", "total_nfs_unstable"},
 	{"inactive_anon", "total_inactive_anon"},
 	{"active_anon", "total_active_anon"},
 	{"inactive_file", "total_inactive_file"},
@@ -4051,6 +4114,14 @@ mem_cgroup_get_local_stat(struct mem_cgr
 	val = mem_cgroup_read_events(memcg, MEM_CGROUP_EVENTS_PGMAJFAULT);
 	s->stat[MCS_PGMAJFAULT] += val;
 
+	/* dirty stat */
+	val = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY);
+	s->stat[MCS_FILE_DIRTY] += val * PAGE_SIZE;
+	val = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_WRITEBACK);
+	s->stat[MCS_WRITEBACK] += val * PAGE_SIZE;
+	val = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
+	s->stat[MCS_UNSTABLE_NFS] += val * PAGE_SIZE;
+
 	/* per zone stat */
 	val = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_INACTIVE_ANON));
 	s->stat[MCS_INACTIVE_ANON] += val * PAGE_SIZE;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 3/9] memcg: add kernel calls for memcg dirty page stats
  2012-02-28 14:00 ` Fengguang Wu
@ 2012-02-28 14:00   ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-28 14:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrea Righi, Daisuke Nishimura, Minchan Kim,
	Fengguang Wu, Linux Memory Management List, LKML

[-- Attachment #1: memcg-add-kernel-calls-for-memcg-dirty-page-stats.patch --]
[-- Type: text/plain, Size: 4355 bytes --]

From: Greg Thelen <gthelen@google.com>

Add calls into memcg dirty page accounting.  Notify memcg when pages
transition between clean, file dirty, writeback, and unstable nfs.  This
allows the memory controller to maintain an accurate view of the amount
of its memory that is dirty.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <andrea@betterlinux.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 fs/nfs/write.c      |    4 ++++
 mm/filemap.c        |    1 +
 mm/page-writeback.c |    4 ++++
 mm/truncate.c       |    1 +
 4 files changed, 10 insertions(+)

--- linux.orig/fs/nfs/write.c	2012-02-19 10:53:14.000000000 +0800
+++ linux/fs/nfs/write.c	2012-02-19 10:53:21.000000000 +0800
@@ -449,6 +449,7 @@ nfs_mark_request_commit(struct nfs_page 
 	nfsi->ncommit++;
 	spin_unlock(&inode->i_lock);
 	pnfs_mark_request_commit(req, lseg);
+	mem_cgroup_inc_page_stat(req->wb_page, MEMCG_NR_FILE_UNSTABLE_NFS);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -460,6 +461,7 @@ nfs_clear_request_commit(struct nfs_page
 	struct page *page = req->wb_page;
 
 	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
+		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_UNSTABLE_NFS);
 		dec_zone_page_state(page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 		return 1;
@@ -1408,6 +1410,8 @@ void nfs_retry_commit(struct list_head *
 		req = nfs_list_entry(page_list->next);
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req, lseg);
+		mem_cgroup_dec_page_stat(req->wb_page,
+					 MEMCG_NR_FILE_UNSTABLE_NFS);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
 			     BDI_RECLAIMABLE);
--- linux.orig/mm/filemap.c	2012-02-19 10:53:14.000000000 +0800
+++ linux/mm/filemap.c	2012-02-19 10:53:21.000000000 +0800
@@ -142,6 +142,7 @@ void __delete_from_page_cache(struct pag
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 	}
--- linux.orig/mm/page-writeback.c	2012-02-19 10:53:14.000000000 +0800
+++ linux/mm/page-writeback.c	2012-02-19 10:53:21.000000000 +0800
@@ -1933,6 +1933,7 @@ int __set_page_dirty_no_writeback(struct
 void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
@@ -1951,6 +1952,7 @@ EXPORT_SYMBOL(account_page_dirtied);
  */
 void account_page_writeback(struct page *page)
 {
+	mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
 	inc_zone_page_state(page, NR_WRITEBACK);
 }
 EXPORT_SYMBOL(account_page_writeback);
@@ -2152,6 +2154,7 @@ int clear_page_dirty_for_io(struct page 
 		 * for more comments.
 		 */
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
@@ -2188,6 +2191,7 @@ int test_clear_page_writeback(struct pag
 		ret = TestClearPageWriteback(page);
 	}
 	if (ret) {
+		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
 		dec_zone_page_state(page, NR_WRITEBACK);
 		inc_zone_page_state(page, NR_WRITTEN);
 	}
--- linux.orig/mm/truncate.c	2012-02-19 10:53:14.000000000 +0800
+++ linux/mm/truncate.c	2012-02-19 10:53:21.000000000 +0800
@@ -76,6 +76,7 @@ void cancel_dirty_page(struct page *page
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);



^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 3/9] memcg: add kernel calls for memcg dirty page stats
@ 2012-02-28 14:00   ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-28 14:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrea Righi, Daisuke Nishimura, Minchan Kim,
	Fengguang Wu, Linux Memory Management List, LKML

[-- Attachment #1: memcg-add-kernel-calls-for-memcg-dirty-page-stats.patch --]
[-- Type: text/plain, Size: 4658 bytes --]

From: Greg Thelen <gthelen@google.com>

Add calls into memcg dirty page accounting.  Notify memcg when pages
transition between clean, file dirty, writeback, and unstable nfs.  This
allows the memory controller to maintain an accurate view of the amount
of its memory that is dirty.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Andrea Righi <andrea@betterlinux.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 fs/nfs/write.c      |    4 ++++
 mm/filemap.c        |    1 +
 mm/page-writeback.c |    4 ++++
 mm/truncate.c       |    1 +
 4 files changed, 10 insertions(+)

--- linux.orig/fs/nfs/write.c	2012-02-19 10:53:14.000000000 +0800
+++ linux/fs/nfs/write.c	2012-02-19 10:53:21.000000000 +0800
@@ -449,6 +449,7 @@ nfs_mark_request_commit(struct nfs_page 
 	nfsi->ncommit++;
 	spin_unlock(&inode->i_lock);
 	pnfs_mark_request_commit(req, lseg);
+	mem_cgroup_inc_page_stat(req->wb_page, MEMCG_NR_FILE_UNSTABLE_NFS);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -460,6 +461,7 @@ nfs_clear_request_commit(struct nfs_page
 	struct page *page = req->wb_page;
 
 	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
+		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_UNSTABLE_NFS);
 		dec_zone_page_state(page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 		return 1;
@@ -1408,6 +1410,8 @@ void nfs_retry_commit(struct list_head *
 		req = nfs_list_entry(page_list->next);
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req, lseg);
+		mem_cgroup_dec_page_stat(req->wb_page,
+					 MEMCG_NR_FILE_UNSTABLE_NFS);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
 			     BDI_RECLAIMABLE);
--- linux.orig/mm/filemap.c	2012-02-19 10:53:14.000000000 +0800
+++ linux/mm/filemap.c	2012-02-19 10:53:21.000000000 +0800
@@ -142,6 +142,7 @@ void __delete_from_page_cache(struct pag
 	 * having removed the page entirely.
 	 */
 	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
 		dec_zone_page_state(page, NR_FILE_DIRTY);
 		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
 	}
--- linux.orig/mm/page-writeback.c	2012-02-19 10:53:14.000000000 +0800
+++ linux/mm/page-writeback.c	2012-02-19 10:53:21.000000000 +0800
@@ -1933,6 +1933,7 @@ int __set_page_dirty_no_writeback(struct
 void account_page_dirtied(struct page *page, struct address_space *mapping)
 {
 	if (mapping_cap_account_dirty(mapping)) {
+		mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_FILE_DIRTY);
 		__inc_zone_page_state(page, NR_DIRTIED);
 		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
@@ -1951,6 +1952,7 @@ EXPORT_SYMBOL(account_page_dirtied);
  */
 void account_page_writeback(struct page *page)
 {
+	mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
 	inc_zone_page_state(page, NR_WRITEBACK);
 }
 EXPORT_SYMBOL(account_page_writeback);
@@ -2152,6 +2154,7 @@ int clear_page_dirty_for_io(struct page 
 		 * for more comments.
 		 */
 		if (TestClearPageDirty(page)) {
+			mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);
@@ -2188,6 +2191,7 @@ int test_clear_page_writeback(struct pag
 		ret = TestClearPageWriteback(page);
 	}
 	if (ret) {
+		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_WRITEBACK);
 		dec_zone_page_state(page, NR_WRITEBACK);
 		inc_zone_page_state(page, NR_WRITTEN);
 	}
--- linux.orig/mm/truncate.c	2012-02-19 10:53:14.000000000 +0800
+++ linux/mm/truncate.c	2012-02-19 10:53:21.000000000 +0800
@@ -76,6 +76,7 @@ void cancel_dirty_page(struct page *page
 	if (TestClearPageDirty(page)) {
 		struct address_space *mapping = page->mapping;
 		if (mapping && mapping_cap_account_dirty(mapping)) {
+			mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);
 			dec_zone_page_state(page, NR_FILE_DIRTY);
 			dec_bdi_stat(mapping->backing_dev_info,
 					BDI_RECLAIMABLE);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 4/9] memcg: dirty page accounting support routines
  2012-02-28 14:00 ` Fengguang Wu
@ 2012-02-28 14:00   ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-28 14:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Fengguang Wu, Linux Memory Management List, LKML

[-- Attachment #1: memcg-dirty-page-accounting-support-routines.patch --]
[-- Type: text/plain, Size: 4744 bytes --]

From: Greg Thelen <gthelen@google.com>

Added memcg dirty page accounting support routines.  These routines are
used by later changes to provide memcg aware writeback and dirty page
limiting.  A mem_cgroup_dirty_info() tracepoint is is also included to
allow for easier understanding of memcg writeback operation.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
Changelog since v8:
- Use 'memcg' rather than 'mem' for local variables and parameters.
  This is consistent with other memory controller code.

 include/linux/memcontrol.h |    5 +
 mm/memcontrol.c            |  112 +++++++++++++++++++++++++++++++++++
 2 files changed, 117 insertions(+)

--- linux.orig/include/linux/memcontrol.h	2012-02-25 20:48:34.337580646 +0800
+++ linux/include/linux/memcontrol.h	2012-02-25 20:48:34.361580646 +0800
@@ -36,8 +36,13 @@ enum mem_cgroup_page_stat_item {
 	MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
 	MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
 	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
+	MEMCG_NR_DIRTYABLE_PAGES, /* # of pages that could be dirty */
 };
 
+unsigned long mem_cgroup_page_stat(struct mem_cgroup *memcg,
+				   enum mem_cgroup_page_stat_item item);
+unsigned long mem_cgroup_dirty_pages(struct mem_cgroup *memcg);
+
 struct mem_cgroup_reclaim_cookie {
 	struct zone *zone;
 	int priority;
--- linux.orig/mm/memcontrol.c	2012-02-25 20:48:34.337580646 +0800
+++ linux/mm/memcontrol.c	2012-02-25 21:09:54.073554384 +0800
@@ -1255,6 +1255,118 @@ int mem_cgroup_swappiness(struct mem_cgr
 	return memcg->swappiness;
 }
 
+static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
+{
+	if (nr_swap_pages == 0)
+		return false;
+	if (!do_swap_account)
+		return true;
+	if (memcg->memsw_is_minimum)
+		return false;
+	if (res_counter_margin(&memcg->memsw) == 0)
+		return false;
+	return true;
+}
+
+static s64 mem_cgroup_local_page_stat(struct mem_cgroup *memcg,
+				      enum mem_cgroup_page_stat_item item)
+{
+	s64 ret;
+
+	switch (item) {
+	case MEMCG_NR_FILE_DIRTY:
+		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY);
+		break;
+	case MEMCG_NR_FILE_WRITEBACK:
+		ret = mem_cgroup_read_stat(memcg,
+					   MEM_CGROUP_STAT_FILE_WRITEBACK);
+		break;
+	case MEMCG_NR_FILE_UNSTABLE_NFS:
+		ret = mem_cgroup_read_stat(memcg,
+					   MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
+		break;
+	case MEMCG_NR_DIRTYABLE_PAGES:
+		ret = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_ACTIVE_FILE)) +
+			mem_cgroup_nr_lru_pages(memcg, BIT(LRU_INACTIVE_FILE));
+		if (mem_cgroup_can_swap(memcg))
+			ret += mem_cgroup_nr_lru_pages(memcg, BIT(LRU_ACTIVE_ANON)) +
+				mem_cgroup_nr_lru_pages(memcg, BIT(LRU_INACTIVE_ANON));
+		break;
+	default:
+		BUG();
+		break;
+	}
+	return ret;
+}
+
+/*
+ * Return the number of additional pages that the @memcg cgroup could allocate.
+ * If use_hierarchy is set, then this involves checking parent mem cgroups to
+ * find the cgroup with the smallest free space.
+ */
+static unsigned long
+mem_cgroup_hierarchical_free_pages(struct mem_cgroup *memcg)
+{
+	u64 free;
+	unsigned long min_free;
+
+	min_free = global_page_state(NR_FREE_PAGES);
+
+	while (memcg) {
+		free = mem_cgroup_margin(memcg);
+		min_free = min_t(u64, min_free, free);
+		memcg = parent_mem_cgroup(memcg);
+	}
+
+	return min_free;
+}
+
+/*
+ * mem_cgroup_page_stat() - get memory cgroup file cache statistics
+ * @memcg:     memory cgroup to query
+ * @item:      memory statistic item exported to the kernel
+ *
+ * Return the accounted statistic value.
+ */
+unsigned long mem_cgroup_page_stat(struct mem_cgroup *memcg,
+				   enum mem_cgroup_page_stat_item item)
+{
+	struct mem_cgroup *iter;
+	s64 value;
+
+	/*
+	 * If we're looking for dirtyable pages we need to evaluate free pages
+	 * depending on the limit and usage of the parents first of all.
+	 */
+	if (item == MEMCG_NR_DIRTYABLE_PAGES)
+		value = mem_cgroup_hierarchical_free_pages(memcg);
+	else
+		value = 0;
+
+	/*
+	 * Recursively evaluate page statistics against all cgroup under
+	 * hierarchy tree
+	 */
+	for_each_mem_cgroup_tree(iter, memcg)
+		value += mem_cgroup_local_page_stat(iter, item);
+
+	/*
+	 * Summing of unlocked per-cpu counters is racy and may yield a slightly
+	 * negative value.  Zero is the only sensible value in such cases.
+	 */
+	if (unlikely(value < 0))
+		value = 0;
+
+	return value;
+}
+
+unsigned long mem_cgroup_dirty_pages(struct mem_cgroup *memcg)
+{
+	return mem_cgroup_page_stat(memcg, MEMCG_NR_FILE_DIRTY) +
+		mem_cgroup_page_stat(memcg, MEMCG_NR_FILE_WRITEBACK) +
+		mem_cgroup_page_stat(memcg, MEMCG_NR_FILE_UNSTABLE_NFS);
+}
+
 static void mem_cgroup_start_move(struct mem_cgroup *memcg)
 {
 	int cpu;



^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 4/9] memcg: dirty page accounting support routines
@ 2012-02-28 14:00   ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-28 14:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Fengguang Wu, Linux Memory Management List, LKML

[-- Attachment #1: memcg-dirty-page-accounting-support-routines.patch --]
[-- Type: text/plain, Size: 5047 bytes --]

From: Greg Thelen <gthelen@google.com>

Added memcg dirty page accounting support routines.  These routines are
used by later changes to provide memcg aware writeback and dirty page
limiting.  A mem_cgroup_dirty_info() tracepoint is is also included to
allow for easier understanding of memcg writeback operation.

Signed-off-by: Greg Thelen <gthelen@google.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
Changelog since v8:
- Use 'memcg' rather than 'mem' for local variables and parameters.
  This is consistent with other memory controller code.

 include/linux/memcontrol.h |    5 +
 mm/memcontrol.c            |  112 +++++++++++++++++++++++++++++++++++
 2 files changed, 117 insertions(+)

--- linux.orig/include/linux/memcontrol.h	2012-02-25 20:48:34.337580646 +0800
+++ linux/include/linux/memcontrol.h	2012-02-25 20:48:34.361580646 +0800
@@ -36,8 +36,13 @@ enum mem_cgroup_page_stat_item {
 	MEMCG_NR_FILE_DIRTY, /* # of dirty pages in page cache */
 	MEMCG_NR_FILE_WRITEBACK, /* # of pages under writeback */
 	MEMCG_NR_FILE_UNSTABLE_NFS, /* # of NFS unstable pages */
+	MEMCG_NR_DIRTYABLE_PAGES, /* # of pages that could be dirty */
 };
 
+unsigned long mem_cgroup_page_stat(struct mem_cgroup *memcg,
+				   enum mem_cgroup_page_stat_item item);
+unsigned long mem_cgroup_dirty_pages(struct mem_cgroup *memcg);
+
 struct mem_cgroup_reclaim_cookie {
 	struct zone *zone;
 	int priority;
--- linux.orig/mm/memcontrol.c	2012-02-25 20:48:34.337580646 +0800
+++ linux/mm/memcontrol.c	2012-02-25 21:09:54.073554384 +0800
@@ -1255,6 +1255,118 @@ int mem_cgroup_swappiness(struct mem_cgr
 	return memcg->swappiness;
 }
 
+static inline bool mem_cgroup_can_swap(struct mem_cgroup *memcg)
+{
+	if (nr_swap_pages == 0)
+		return false;
+	if (!do_swap_account)
+		return true;
+	if (memcg->memsw_is_minimum)
+		return false;
+	if (res_counter_margin(&memcg->memsw) == 0)
+		return false;
+	return true;
+}
+
+static s64 mem_cgroup_local_page_stat(struct mem_cgroup *memcg,
+				      enum mem_cgroup_page_stat_item item)
+{
+	s64 ret;
+
+	switch (item) {
+	case MEMCG_NR_FILE_DIRTY:
+		ret = mem_cgroup_read_stat(memcg, MEM_CGROUP_STAT_FILE_DIRTY);
+		break;
+	case MEMCG_NR_FILE_WRITEBACK:
+		ret = mem_cgroup_read_stat(memcg,
+					   MEM_CGROUP_STAT_FILE_WRITEBACK);
+		break;
+	case MEMCG_NR_FILE_UNSTABLE_NFS:
+		ret = mem_cgroup_read_stat(memcg,
+					   MEM_CGROUP_STAT_FILE_UNSTABLE_NFS);
+		break;
+	case MEMCG_NR_DIRTYABLE_PAGES:
+		ret = mem_cgroup_nr_lru_pages(memcg, BIT(LRU_ACTIVE_FILE)) +
+			mem_cgroup_nr_lru_pages(memcg, BIT(LRU_INACTIVE_FILE));
+		if (mem_cgroup_can_swap(memcg))
+			ret += mem_cgroup_nr_lru_pages(memcg, BIT(LRU_ACTIVE_ANON)) +
+				mem_cgroup_nr_lru_pages(memcg, BIT(LRU_INACTIVE_ANON));
+		break;
+	default:
+		BUG();
+		break;
+	}
+	return ret;
+}
+
+/*
+ * Return the number of additional pages that the @memcg cgroup could allocate.
+ * If use_hierarchy is set, then this involves checking parent mem cgroups to
+ * find the cgroup with the smallest free space.
+ */
+static unsigned long
+mem_cgroup_hierarchical_free_pages(struct mem_cgroup *memcg)
+{
+	u64 free;
+	unsigned long min_free;
+
+	min_free = global_page_state(NR_FREE_PAGES);
+
+	while (memcg) {
+		free = mem_cgroup_margin(memcg);
+		min_free = min_t(u64, min_free, free);
+		memcg = parent_mem_cgroup(memcg);
+	}
+
+	return min_free;
+}
+
+/*
+ * mem_cgroup_page_stat() - get memory cgroup file cache statistics
+ * @memcg:     memory cgroup to query
+ * @item:      memory statistic item exported to the kernel
+ *
+ * Return the accounted statistic value.
+ */
+unsigned long mem_cgroup_page_stat(struct mem_cgroup *memcg,
+				   enum mem_cgroup_page_stat_item item)
+{
+	struct mem_cgroup *iter;
+	s64 value;
+
+	/*
+	 * If we're looking for dirtyable pages we need to evaluate free pages
+	 * depending on the limit and usage of the parents first of all.
+	 */
+	if (item == MEMCG_NR_DIRTYABLE_PAGES)
+		value = mem_cgroup_hierarchical_free_pages(memcg);
+	else
+		value = 0;
+
+	/*
+	 * Recursively evaluate page statistics against all cgroup under
+	 * hierarchy tree
+	 */
+	for_each_mem_cgroup_tree(iter, memcg)
+		value += mem_cgroup_local_page_stat(iter, item);
+
+	/*
+	 * Summing of unlocked per-cpu counters is racy and may yield a slightly
+	 * negative value.  Zero is the only sensible value in such cases.
+	 */
+	if (unlikely(value < 0))
+		value = 0;
+
+	return value;
+}
+
+unsigned long mem_cgroup_dirty_pages(struct mem_cgroup *memcg)
+{
+	return mem_cgroup_page_stat(memcg, MEMCG_NR_FILE_DIRTY) +
+		mem_cgroup_page_stat(memcg, MEMCG_NR_FILE_WRITEBACK) +
+		mem_cgroup_page_stat(memcg, MEMCG_NR_FILE_UNSTABLE_NFS);
+}
+
 static void mem_cgroup_start_move(struct mem_cgroup *memcg)
 {
 	int cpu;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 5/9] writeback: introduce the pageout work
  2012-02-28 14:00 ` Fengguang Wu
@ 2012-02-28 14:00   ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-28 14:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim, Fengguang Wu,
	Linux Memory Management List, LKML

[-- Attachment #1: writeback-bdi_start_inode_writeback.patch --]
[-- Type: text/plain, Size: 15931 bytes --]

This relays file pageout IOs to the flusher threads.

It's much more important now that page reclaim generally does not
writeout filesystem-backed pages.

The ultimate target is to gracefully handle the LRU lists pressured by
dirty/writeback pages. In particular, problems (1-2) are addressed here.

1) I/O efficiency

The flusher will piggy back the nearby ~10ms worth of dirty pages for I/O.

This takes advantage of the time/spacial locality in most workloads: the
nearby pages of one file are typically populated into the LRU at the same
time, hence will likely be close to each other in the LRU list. Writing
them in one shot helps clean more pages effectively for page reclaim.

For the common dd style sequential writes that have excellent locality,
up to ~80ms data will be wrote around by the pageout work, which helps
make I/O performance very close to that of the background writeback.

2) writeback work coordinations

To avoid memory allocations at page reclaim, a mempool for struct
wb_writeback_work is created.

wakeup_flusher_threads() is removed because it can easily delay the
more oriented pageout works and even exhaust the mempool reservations.
It's also found to not I/O efficient by frequently submitting writeback
works with small ->nr_pages.

Background/periodic works will quit automatically, so as to clean the
pages under reclaim ASAP. However for now the sync work can still block
us for long time.

Jan Kara: limit the search scope; remove works and unpin inodes on umount.

TODO: the pageout works may be starved by the sync work and maybe others.
Need a proper way to guarantee fairness.

CC: Jan Kara <jack@suse.cz>
CC: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
CC: Greg Thelen <gthelen@google.com>
CC: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |  230 +++++++++++++++++++++++++++--
 fs/super.c                       |    1 
 include/linux/backing-dev.h      |    2 
 include/linux/writeback.h        |   16 +-
 include/trace/events/writeback.h |   12 +
 mm/vmscan.c                      |   36 ++--
 6 files changed, 268 insertions(+), 29 deletions(-)

--- linux.orig/fs/fs-writeback.c	2012-02-28 19:07:06.109064465 +0800
+++ linux/fs/fs-writeback.c	2012-02-28 19:07:07.277064493 +0800
@@ -41,6 +41,8 @@ struct wb_writeback_work {
 	long nr_pages;
 	struct super_block *sb;
 	unsigned long *older_than_this;
+	struct inode *inode;
+	pgoff_t offset;
 	enum writeback_sync_modes sync_mode;
 	unsigned int tagged_writepages:1;
 	unsigned int for_kupdate:1;
@@ -57,6 +59,27 @@ struct wb_writeback_work {
  */
 int nr_pdflush_threads;
 
+static mempool_t *wb_work_mempool;
+
+static void *wb_work_alloc(gfp_t gfp_mask, void *pool_data)
+{
+	/*
+	 * alloc_queue_pageout_work() will be called on page reclaim
+	 */
+	if (current->flags & PF_MEMALLOC)
+		return NULL;
+
+	return kmalloc(sizeof(struct wb_writeback_work), gfp_mask);
+}
+
+static __init int wb_work_init(void)
+{
+	wb_work_mempool = mempool_create(WB_WORK_MEMPOOL_SIZE,
+					 wb_work_alloc, mempool_kfree, NULL);
+	return wb_work_mempool ? 0 : -ENOMEM;
+}
+fs_initcall(wb_work_init);
+
 /**
  * writeback_in_progress - determine whether there is writeback in progress
  * @bdi: the device's backing_dev_info structure.
@@ -129,7 +152,7 @@ __bdi_start_writeback(struct backing_dev
 	 * This is WB_SYNC_NONE writeback, so if allocation fails just
 	 * wakeup the thread for old dirty data writeback
 	 */
-	work = kzalloc(sizeof(*work), GFP_ATOMIC);
+	work = mempool_alloc(wb_work_mempool, GFP_NOWAIT);
 	if (!work) {
 		if (bdi->wb.task) {
 			trace_writeback_nowork(bdi);
@@ -138,6 +161,7 @@ __bdi_start_writeback(struct backing_dev
 		return;
 	}
 
+	memset(work, 0, sizeof(*work));
 	work->sync_mode	= WB_SYNC_NONE;
 	work->nr_pages	= nr_pages;
 	work->range_cyclic = range_cyclic;
@@ -187,6 +211,181 @@ void bdi_start_background_writeback(stru
 }
 
 /*
+ * Check if @work already covers @offset, or try to extend it to cover @offset.
+ * Returns true if the wb_writeback_work now encompasses the requested offset.
+ */
+static bool extend_writeback_range(struct wb_writeback_work *work,
+				   pgoff_t offset,
+				   unsigned long unit)
+{
+	pgoff_t end = work->offset + work->nr_pages;
+
+	if (offset >= work->offset && offset < end)
+		return true;
+
+	/*
+	 * for sequential workloads with good locality, include up to 8 times
+	 * more data in one chunk
+	 */
+	if (work->nr_pages >= 8 * unit)
+		return false;
+
+	/* the unsigned comparison helps eliminate one compare */
+	if (work->offset - offset < unit) {
+		work->nr_pages += unit;
+		work->offset -= unit;
+		return true;
+	}
+
+	if (offset - end < unit) {
+		work->nr_pages += unit;
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * schedule writeback on a range of inode pages.
+ */
+static struct wb_writeback_work *
+alloc_queue_pageout_work(struct backing_dev_info *bdi,
+			 struct inode *inode,
+			 pgoff_t offset,
+			 pgoff_t len)
+{
+	struct wb_writeback_work *work;
+
+	/*
+	 * Grab the inode until the work is executed. We are calling this from
+	 * page reclaim context and the only thing pinning the address_space
+	 * for the moment is the page lock.
+	 */
+	if (!igrab(inode))
+		return ERR_PTR(-ENOENT);
+
+	work = mempool_alloc(wb_work_mempool, GFP_NOWAIT);
+	if (!work) {
+		trace_printk("wb_work_mempool alloc fail\n");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	memset(work, 0, sizeof(*work));
+	work->sync_mode		= WB_SYNC_NONE;
+	work->inode		= inode;
+	work->offset		= offset;
+	work->nr_pages		= len;
+	work->reason		= WB_REASON_PAGEOUT;
+
+	bdi_queue_work(bdi, work);
+
+	return work;
+}
+
+/*
+ * Called by page reclaim code to flush the dirty page ASAP. Do write-around to
+ * improve IO throughput. The nearby pages will have good chance to reside in
+ * the same LRU list that vmscan is working on, and even close to each other
+ * inside the LRU list in the common case of sequential read/write.
+ *
+ * ret > 0: success, allocated/queued a new pageout work;
+ *	    there are at least @ret writeback works queued now
+ * ret = 0: success, reused/extended a previous pageout work
+ * ret < 0: failed
+ */
+int queue_pageout_work(struct address_space *mapping, struct page *page)
+{
+	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	struct inode *inode = mapping->host;
+	struct wb_writeback_work *work;
+	unsigned long write_around_pages;
+	pgoff_t offset = page->index;
+	int i = 0;
+	int ret = -ENOENT;
+
+	if (unlikely(!inode))
+		return ret;
+
+	/*
+	 * piggy back 8-15ms worth of data
+	 */
+	write_around_pages = bdi->avg_write_bandwidth + MIN_WRITEBACK_PAGES;
+	write_around_pages = rounddown_pow_of_two(write_around_pages) >> 6;
+
+	i = 1;
+	spin_lock_bh(&bdi->wb_lock);
+	list_for_each_entry_reverse(work, &bdi->work_list, list) {
+		if (work->inode != inode)
+			continue;
+		if (extend_writeback_range(work, offset, write_around_pages)) {
+			ret = 0;
+			break;
+		}
+		/*
+		 * vmscan will slow down page reclaim when there are more than
+		 * LOTS_OF_WRITEBACK_WORKS queued. Limit search depth to two
+		 * times larger.
+		 */
+		if (i++ > 2 * LOTS_OF_WRITEBACK_WORKS)
+			break;
+	}
+	spin_unlock_bh(&bdi->wb_lock);
+
+	if (ret) {
+		ret = i;
+		offset = round_down(offset, write_around_pages);
+		work = alloc_queue_pageout_work(bdi, inode,
+						offset, write_around_pages);
+		if (IS_ERR(work))
+			ret = PTR_ERR(work);
+	}
+	return ret;
+}
+
+static void wb_free_work(struct wb_writeback_work *work)
+{
+	if (work->inode)
+		iput(work->inode);
+	/*
+	 * Notify the caller of completion if this is a synchronous
+	 * work item, otherwise just free it.
+	 */
+	if (work->done)
+		complete(work->done);
+	else
+		mempool_free(work, wb_work_mempool);
+}
+
+/*
+ * Remove works for @sb; or if (@sb == NULL), remove all works on @bdi.
+ */
+void bdi_remove_writeback_works(struct backing_dev_info *bdi,
+				struct super_block *sb)
+{
+	struct wb_writeback_work *work, *tmp;
+	LIST_HEAD(dispose);
+
+	spin_lock_bh(&bdi->wb_lock);
+	list_for_each_entry_safe(work, tmp, &bdi->work_list, list) {
+		if (sb) {
+			if (work->sb && work->sb != sb)
+				continue;
+			if (work->inode && work->inode->i_sb != sb)
+				continue;
+		}
+		list_move(&work->list, &dispose);
+	}
+	spin_unlock_bh(&bdi->wb_lock);
+
+	while (!list_empty(&dispose)) {
+		work = list_entry(dispose.next,
+				  struct wb_writeback_work, list);
+		list_del_init(&work->list);
+		wb_free_work(work);
+	}
+}
+
+/*
  * Remove the inode from the writeback list it is on.
  */
 void inode_wb_list_del(struct inode *inode)
@@ -833,6 +1032,21 @@ static unsigned long get_nr_dirty_pages(
 		get_nr_dirty_inodes();
 }
 
+static long wb_pageout(struct bdi_writeback *wb, struct wb_writeback_work *work)
+{
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_NONE,
+		.nr_to_write = LONG_MAX,
+		.range_start = work->offset << PAGE_CACHE_SHIFT,
+		.range_end = (work->offset + work->nr_pages - 1)
+						<< PAGE_CACHE_SHIFT,
+	};
+
+	do_writepages(work->inode->i_mapping, &wbc);
+
+	return LONG_MAX - wbc.nr_to_write;
+}
+
 static long wb_check_background_flush(struct bdi_writeback *wb)
 {
 	if (over_bground_thresh(wb->bdi)) {
@@ -905,16 +1119,12 @@ long wb_do_writeback(struct bdi_writebac
 
 		trace_writeback_exec(bdi, work);
 
-		wrote += wb_writeback(wb, work);
-
-		/*
-		 * Notify the caller of completion if this is a synchronous
-		 * work item, otherwise just free it.
-		 */
-		if (work->done)
-			complete(work->done);
+		if (!work->inode)
+			wrote += wb_writeback(wb, work);
 		else
-			kfree(work);
+			wrote += wb_pageout(wb, work);
+
+		wb_free_work(work);
 	}
 
 	/*
--- linux.orig/include/trace/events/writeback.h	2012-02-28 19:07:06.077064464 +0800
+++ linux/include/trace/events/writeback.h	2012-02-28 20:25:46.415730762 +0800
@@ -23,7 +23,7 @@
 
 #define WB_WORK_REASON							\
 		{WB_REASON_BACKGROUND,		"background"},		\
-		{WB_REASON_TRY_TO_FREE_PAGES,	"try_to_free_pages"},	\
+		{WB_REASON_PAGEOUT,		"pageout"},		\
 		{WB_REASON_SYNC,		"sync"},		\
 		{WB_REASON_PERIODIC,		"periodic"},		\
 		{WB_REASON_LAPTOP_TIMER,	"laptop_timer"},	\
@@ -45,6 +45,8 @@ DECLARE_EVENT_CLASS(writeback_work_class
 		__field(int, range_cyclic)
 		__field(int, for_background)
 		__field(int, reason)
+		__field(unsigned long, ino)
+		__field(unsigned long, offset)
 	),
 	TP_fast_assign(
 		struct device *dev = bdi->dev;
@@ -58,9 +60,11 @@ DECLARE_EVENT_CLASS(writeback_work_class
 		__entry->range_cyclic = work->range_cyclic;
 		__entry->for_background	= work->for_background;
 		__entry->reason = work->reason;
+		__entry->ino = work->inode ? work->inode->i_ino : 0;
+		__entry->offset = work->offset;
 	),
 	TP_printk("bdi %s: sb_dev %d:%d nr_pages=%ld sync_mode=%d "
-		  "kupdate=%d range_cyclic=%d background=%d reason=%s",
+		  "kupdate=%d range_cyclic=%d background=%d reason=%s ino=%lu offset=%lu",
 		  __entry->name,
 		  MAJOR(__entry->sb_dev), MINOR(__entry->sb_dev),
 		  __entry->nr_pages,
@@ -68,7 +72,9 @@ DECLARE_EVENT_CLASS(writeback_work_class
 		  __entry->for_kupdate,
 		  __entry->range_cyclic,
 		  __entry->for_background,
-		  __print_symbolic(__entry->reason, WB_WORK_REASON)
+		  __print_symbolic(__entry->reason, WB_WORK_REASON),
+		  __entry->ino,
+		  __entry->offset
 	)
 );
 #define DEFINE_WRITEBACK_WORK_EVENT(name) \
--- linux.orig/include/linux/writeback.h	2012-02-28 19:07:06.093064464 +0800
+++ linux/include/linux/writeback.h	2012-02-28 20:25:47.323730784 +0800
@@ -40,7 +40,7 @@ enum writeback_sync_modes {
  */
 enum wb_reason {
 	WB_REASON_BACKGROUND,
-	WB_REASON_TRY_TO_FREE_PAGES,
+	WB_REASON_PAGEOUT,
 	WB_REASON_SYNC,
 	WB_REASON_PERIODIC,
 	WB_REASON_LAPTOP_TIMER,
@@ -94,6 +94,20 @@ long writeback_inodes_wb(struct bdi_writ
 				enum wb_reason reason);
 long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
 void wakeup_flusher_threads(long nr_pages, enum wb_reason reason);
+int queue_pageout_work(struct address_space *mapping, struct page *page);
+
+/*
+ * Tailored for vmscan which may submit lots of pageout works. The page reclaim
+ * will try to slow down the pageout work submission rate when the queue size
+ * grows to LOTS_OF_WRITEBACK_WORKS. queue_pageout_work() will accordingly limit
+ * its search depth to (2 * LOTS_OF_WRITEBACK_WORKS).
+ *
+ * Note that the limited search and work pool is not a big problem: 1024 IOs
+ * under flight are typically more than enough to saturate the disk. And the
+ * overheads of searching in the work list didn't even show up in perf report.
+ */
+#define WB_WORK_MEMPOOL_SIZE		1024
+#define LOTS_OF_WRITEBACK_WORKS		(WB_WORK_MEMPOOL_SIZE / 8)
 
 /* writeback.h requires fs.h; it, too, is not included from here. */
 static inline void wait_on_inode(struct inode *inode)
--- linux.orig/fs/super.c	2012-02-28 19:07:06.101064465 +0800
+++ linux/fs/super.c	2012-02-28 19:07:07.277064493 +0800
@@ -389,6 +389,7 @@ void generic_shutdown_super(struct super
 
 		fsnotify_unmount_inodes(&sb->s_inodes);
 
+		bdi_remove_writeback_works(sb->s_bdi, sb);
 		evict_inodes(sb);
 
 		if (sop->put_super)
--- linux.orig/include/linux/backing-dev.h	2012-02-28 19:07:06.081064464 +0800
+++ linux/include/linux/backing-dev.h	2012-02-28 19:07:07.281064493 +0800
@@ -126,6 +126,8 @@ int bdi_has_dirty_io(struct backing_dev_
 void bdi_arm_supers_timer(void);
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
 void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
+void bdi_remove_writeback_works(struct backing_dev_info *bdi,
+				struct super_block *sb);
 
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
--- linux.orig/mm/vmscan.c	2012-02-28 19:07:06.065064464 +0800
+++ linux/mm/vmscan.c	2012-02-28 20:26:15.559731455 +0800
@@ -874,12 +874,22 @@ static unsigned long shrink_page_list(st
 			nr_dirty++;
 
 			/*
-			 * Only kswapd can writeback filesystem pages to
-			 * avoid risk of stack overflow but do not writeback
-			 * unless under significant pressure.
+			 * Pages may be dirtied anywhere inside the LRU. This
+			 * ensures they undergo a full period of LRU iteration
+			 * before considering pageout. The intention is to
+			 * delay writeout to the flusher thread, unless when
+			 * run into a long segment of dirty pages.
+			 */
+			if (references == PAGEREF_RECLAIM_CLEAN &&
+			    priority == DEF_PRIORITY)
+				goto keep_locked;
+
+			/*
+			 * Try relaying the pageout I/O to the flusher threads
+			 * for better I/O efficiency and avoid stack overflow.
 			 */
-			if (page_is_file_cache(page) &&
-					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
+			if (page_is_file_cache(page) && mapping &&
+			    queue_pageout_work(mapping, page) >= 0) {
 				/*
 				 * Immediately reclaim when written back.
 				 * Similar in principal to deactivate_page()
@@ -892,8 +902,13 @@ static unsigned long shrink_page_list(st
 				goto keep_locked;
 			}
 
-			if (references == PAGEREF_RECLAIM_CLEAN)
+			/*
+			 * Only kswapd can writeback filesystem pages to
+			 * avoid risk of stack overflow.
+			 */
+			if (page_is_file_cache(page) && !current_is_kswapd())
 				goto keep_locked;
+
 			if (!may_enter_fs)
 				goto keep_locked;
 			if (!sc->may_writepage)
@@ -2373,17 +2388,8 @@ static unsigned long do_try_to_free_page
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
 			goto out;
 
-		/*
-		 * Try to write back as many pages as we just scanned.  This
-		 * tends to cause slow streaming writers to write data to the
-		 * disk smoothly, at the dirtying rate, which is nice.   But
-		 * that's undesirable in laptop mode, where we *want* lumpy
-		 * writeout.  So in laptop mode, write out the whole world.
-		 */
 		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
 		if (total_scanned > writeback_threshold) {
-			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned,
-						WB_REASON_TRY_TO_FREE_PAGES);
 			sc->may_writepage = 1;
 		}
 



^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 5/9] writeback: introduce the pageout work
@ 2012-02-28 14:00   ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-28 14:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim, Fengguang Wu,
	Linux Memory Management List, LKML

[-- Attachment #1: writeback-bdi_start_inode_writeback.patch --]
[-- Type: text/plain, Size: 16234 bytes --]

This relays file pageout IOs to the flusher threads.

It's much more important now that page reclaim generally does not
writeout filesystem-backed pages.

The ultimate target is to gracefully handle the LRU lists pressured by
dirty/writeback pages. In particular, problems (1-2) are addressed here.

1) I/O efficiency

The flusher will piggy back the nearby ~10ms worth of dirty pages for I/O.

This takes advantage of the time/spacial locality in most workloads: the
nearby pages of one file are typically populated into the LRU at the same
time, hence will likely be close to each other in the LRU list. Writing
them in one shot helps clean more pages effectively for page reclaim.

For the common dd style sequential writes that have excellent locality,
up to ~80ms data will be wrote around by the pageout work, which helps
make I/O performance very close to that of the background writeback.

2) writeback work coordinations

To avoid memory allocations at page reclaim, a mempool for struct
wb_writeback_work is created.

wakeup_flusher_threads() is removed because it can easily delay the
more oriented pageout works and even exhaust the mempool reservations.
It's also found to not I/O efficient by frequently submitting writeback
works with small ->nr_pages.

Background/periodic works will quit automatically, so as to clean the
pages under reclaim ASAP. However for now the sync work can still block
us for long time.

Jan Kara: limit the search scope; remove works and unpin inodes on umount.

TODO: the pageout works may be starved by the sync work and maybe others.
Need a proper way to guarantee fairness.

CC: Jan Kara <jack@suse.cz>
CC: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
CC: Greg Thelen <gthelen@google.com>
CC: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |  230 +++++++++++++++++++++++++++--
 fs/super.c                       |    1 
 include/linux/backing-dev.h      |    2 
 include/linux/writeback.h        |   16 +-
 include/trace/events/writeback.h |   12 +
 mm/vmscan.c                      |   36 ++--
 6 files changed, 268 insertions(+), 29 deletions(-)

--- linux.orig/fs/fs-writeback.c	2012-02-28 19:07:06.109064465 +0800
+++ linux/fs/fs-writeback.c	2012-02-28 19:07:07.277064493 +0800
@@ -41,6 +41,8 @@ struct wb_writeback_work {
 	long nr_pages;
 	struct super_block *sb;
 	unsigned long *older_than_this;
+	struct inode *inode;
+	pgoff_t offset;
 	enum writeback_sync_modes sync_mode;
 	unsigned int tagged_writepages:1;
 	unsigned int for_kupdate:1;
@@ -57,6 +59,27 @@ struct wb_writeback_work {
  */
 int nr_pdflush_threads;
 
+static mempool_t *wb_work_mempool;
+
+static void *wb_work_alloc(gfp_t gfp_mask, void *pool_data)
+{
+	/*
+	 * alloc_queue_pageout_work() will be called on page reclaim
+	 */
+	if (current->flags & PF_MEMALLOC)
+		return NULL;
+
+	return kmalloc(sizeof(struct wb_writeback_work), gfp_mask);
+}
+
+static __init int wb_work_init(void)
+{
+	wb_work_mempool = mempool_create(WB_WORK_MEMPOOL_SIZE,
+					 wb_work_alloc, mempool_kfree, NULL);
+	return wb_work_mempool ? 0 : -ENOMEM;
+}
+fs_initcall(wb_work_init);
+
 /**
  * writeback_in_progress - determine whether there is writeback in progress
  * @bdi: the device's backing_dev_info structure.
@@ -129,7 +152,7 @@ __bdi_start_writeback(struct backing_dev
 	 * This is WB_SYNC_NONE writeback, so if allocation fails just
 	 * wakeup the thread for old dirty data writeback
 	 */
-	work = kzalloc(sizeof(*work), GFP_ATOMIC);
+	work = mempool_alloc(wb_work_mempool, GFP_NOWAIT);
 	if (!work) {
 		if (bdi->wb.task) {
 			trace_writeback_nowork(bdi);
@@ -138,6 +161,7 @@ __bdi_start_writeback(struct backing_dev
 		return;
 	}
 
+	memset(work, 0, sizeof(*work));
 	work->sync_mode	= WB_SYNC_NONE;
 	work->nr_pages	= nr_pages;
 	work->range_cyclic = range_cyclic;
@@ -187,6 +211,181 @@ void bdi_start_background_writeback(stru
 }
 
 /*
+ * Check if @work already covers @offset, or try to extend it to cover @offset.
+ * Returns true if the wb_writeback_work now encompasses the requested offset.
+ */
+static bool extend_writeback_range(struct wb_writeback_work *work,
+				   pgoff_t offset,
+				   unsigned long unit)
+{
+	pgoff_t end = work->offset + work->nr_pages;
+
+	if (offset >= work->offset && offset < end)
+		return true;
+
+	/*
+	 * for sequential workloads with good locality, include up to 8 times
+	 * more data in one chunk
+	 */
+	if (work->nr_pages >= 8 * unit)
+		return false;
+
+	/* the unsigned comparison helps eliminate one compare */
+	if (work->offset - offset < unit) {
+		work->nr_pages += unit;
+		work->offset -= unit;
+		return true;
+	}
+
+	if (offset - end < unit) {
+		work->nr_pages += unit;
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * schedule writeback on a range of inode pages.
+ */
+static struct wb_writeback_work *
+alloc_queue_pageout_work(struct backing_dev_info *bdi,
+			 struct inode *inode,
+			 pgoff_t offset,
+			 pgoff_t len)
+{
+	struct wb_writeback_work *work;
+
+	/*
+	 * Grab the inode until the work is executed. We are calling this from
+	 * page reclaim context and the only thing pinning the address_space
+	 * for the moment is the page lock.
+	 */
+	if (!igrab(inode))
+		return ERR_PTR(-ENOENT);
+
+	work = mempool_alloc(wb_work_mempool, GFP_NOWAIT);
+	if (!work) {
+		trace_printk("wb_work_mempool alloc fail\n");
+		return ERR_PTR(-ENOMEM);
+	}
+
+	memset(work, 0, sizeof(*work));
+	work->sync_mode		= WB_SYNC_NONE;
+	work->inode		= inode;
+	work->offset		= offset;
+	work->nr_pages		= len;
+	work->reason		= WB_REASON_PAGEOUT;
+
+	bdi_queue_work(bdi, work);
+
+	return work;
+}
+
+/*
+ * Called by page reclaim code to flush the dirty page ASAP. Do write-around to
+ * improve IO throughput. The nearby pages will have good chance to reside in
+ * the same LRU list that vmscan is working on, and even close to each other
+ * inside the LRU list in the common case of sequential read/write.
+ *
+ * ret > 0: success, allocated/queued a new pageout work;
+ *	    there are at least @ret writeback works queued now
+ * ret = 0: success, reused/extended a previous pageout work
+ * ret < 0: failed
+ */
+int queue_pageout_work(struct address_space *mapping, struct page *page)
+{
+	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	struct inode *inode = mapping->host;
+	struct wb_writeback_work *work;
+	unsigned long write_around_pages;
+	pgoff_t offset = page->index;
+	int i = 0;
+	int ret = -ENOENT;
+
+	if (unlikely(!inode))
+		return ret;
+
+	/*
+	 * piggy back 8-15ms worth of data
+	 */
+	write_around_pages = bdi->avg_write_bandwidth + MIN_WRITEBACK_PAGES;
+	write_around_pages = rounddown_pow_of_two(write_around_pages) >> 6;
+
+	i = 1;
+	spin_lock_bh(&bdi->wb_lock);
+	list_for_each_entry_reverse(work, &bdi->work_list, list) {
+		if (work->inode != inode)
+			continue;
+		if (extend_writeback_range(work, offset, write_around_pages)) {
+			ret = 0;
+			break;
+		}
+		/*
+		 * vmscan will slow down page reclaim when there are more than
+		 * LOTS_OF_WRITEBACK_WORKS queued. Limit search depth to two
+		 * times larger.
+		 */
+		if (i++ > 2 * LOTS_OF_WRITEBACK_WORKS)
+			break;
+	}
+	spin_unlock_bh(&bdi->wb_lock);
+
+	if (ret) {
+		ret = i;
+		offset = round_down(offset, write_around_pages);
+		work = alloc_queue_pageout_work(bdi, inode,
+						offset, write_around_pages);
+		if (IS_ERR(work))
+			ret = PTR_ERR(work);
+	}
+	return ret;
+}
+
+static void wb_free_work(struct wb_writeback_work *work)
+{
+	if (work->inode)
+		iput(work->inode);
+	/*
+	 * Notify the caller of completion if this is a synchronous
+	 * work item, otherwise just free it.
+	 */
+	if (work->done)
+		complete(work->done);
+	else
+		mempool_free(work, wb_work_mempool);
+}
+
+/*
+ * Remove works for @sb; or if (@sb == NULL), remove all works on @bdi.
+ */
+void bdi_remove_writeback_works(struct backing_dev_info *bdi,
+				struct super_block *sb)
+{
+	struct wb_writeback_work *work, *tmp;
+	LIST_HEAD(dispose);
+
+	spin_lock_bh(&bdi->wb_lock);
+	list_for_each_entry_safe(work, tmp, &bdi->work_list, list) {
+		if (sb) {
+			if (work->sb && work->sb != sb)
+				continue;
+			if (work->inode && work->inode->i_sb != sb)
+				continue;
+		}
+		list_move(&work->list, &dispose);
+	}
+	spin_unlock_bh(&bdi->wb_lock);
+
+	while (!list_empty(&dispose)) {
+		work = list_entry(dispose.next,
+				  struct wb_writeback_work, list);
+		list_del_init(&work->list);
+		wb_free_work(work);
+	}
+}
+
+/*
  * Remove the inode from the writeback list it is on.
  */
 void inode_wb_list_del(struct inode *inode)
@@ -833,6 +1032,21 @@ static unsigned long get_nr_dirty_pages(
 		get_nr_dirty_inodes();
 }
 
+static long wb_pageout(struct bdi_writeback *wb, struct wb_writeback_work *work)
+{
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_NONE,
+		.nr_to_write = LONG_MAX,
+		.range_start = work->offset << PAGE_CACHE_SHIFT,
+		.range_end = (work->offset + work->nr_pages - 1)
+						<< PAGE_CACHE_SHIFT,
+	};
+
+	do_writepages(work->inode->i_mapping, &wbc);
+
+	return LONG_MAX - wbc.nr_to_write;
+}
+
 static long wb_check_background_flush(struct bdi_writeback *wb)
 {
 	if (over_bground_thresh(wb->bdi)) {
@@ -905,16 +1119,12 @@ long wb_do_writeback(struct bdi_writebac
 
 		trace_writeback_exec(bdi, work);
 
-		wrote += wb_writeback(wb, work);
-
-		/*
-		 * Notify the caller of completion if this is a synchronous
-		 * work item, otherwise just free it.
-		 */
-		if (work->done)
-			complete(work->done);
+		if (!work->inode)
+			wrote += wb_writeback(wb, work);
 		else
-			kfree(work);
+			wrote += wb_pageout(wb, work);
+
+		wb_free_work(work);
 	}
 
 	/*
--- linux.orig/include/trace/events/writeback.h	2012-02-28 19:07:06.077064464 +0800
+++ linux/include/trace/events/writeback.h	2012-02-28 20:25:46.415730762 +0800
@@ -23,7 +23,7 @@
 
 #define WB_WORK_REASON							\
 		{WB_REASON_BACKGROUND,		"background"},		\
-		{WB_REASON_TRY_TO_FREE_PAGES,	"try_to_free_pages"},	\
+		{WB_REASON_PAGEOUT,		"pageout"},		\
 		{WB_REASON_SYNC,		"sync"},		\
 		{WB_REASON_PERIODIC,		"periodic"},		\
 		{WB_REASON_LAPTOP_TIMER,	"laptop_timer"},	\
@@ -45,6 +45,8 @@ DECLARE_EVENT_CLASS(writeback_work_class
 		__field(int, range_cyclic)
 		__field(int, for_background)
 		__field(int, reason)
+		__field(unsigned long, ino)
+		__field(unsigned long, offset)
 	),
 	TP_fast_assign(
 		struct device *dev = bdi->dev;
@@ -58,9 +60,11 @@ DECLARE_EVENT_CLASS(writeback_work_class
 		__entry->range_cyclic = work->range_cyclic;
 		__entry->for_background	= work->for_background;
 		__entry->reason = work->reason;
+		__entry->ino = work->inode ? work->inode->i_ino : 0;
+		__entry->offset = work->offset;
 	),
 	TP_printk("bdi %s: sb_dev %d:%d nr_pages=%ld sync_mode=%d "
-		  "kupdate=%d range_cyclic=%d background=%d reason=%s",
+		  "kupdate=%d range_cyclic=%d background=%d reason=%s ino=%lu offset=%lu",
 		  __entry->name,
 		  MAJOR(__entry->sb_dev), MINOR(__entry->sb_dev),
 		  __entry->nr_pages,
@@ -68,7 +72,9 @@ DECLARE_EVENT_CLASS(writeback_work_class
 		  __entry->for_kupdate,
 		  __entry->range_cyclic,
 		  __entry->for_background,
-		  __print_symbolic(__entry->reason, WB_WORK_REASON)
+		  __print_symbolic(__entry->reason, WB_WORK_REASON),
+		  __entry->ino,
+		  __entry->offset
 	)
 );
 #define DEFINE_WRITEBACK_WORK_EVENT(name) \
--- linux.orig/include/linux/writeback.h	2012-02-28 19:07:06.093064464 +0800
+++ linux/include/linux/writeback.h	2012-02-28 20:25:47.323730784 +0800
@@ -40,7 +40,7 @@ enum writeback_sync_modes {
  */
 enum wb_reason {
 	WB_REASON_BACKGROUND,
-	WB_REASON_TRY_TO_FREE_PAGES,
+	WB_REASON_PAGEOUT,
 	WB_REASON_SYNC,
 	WB_REASON_PERIODIC,
 	WB_REASON_LAPTOP_TIMER,
@@ -94,6 +94,20 @@ long writeback_inodes_wb(struct bdi_writ
 				enum wb_reason reason);
 long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
 void wakeup_flusher_threads(long nr_pages, enum wb_reason reason);
+int queue_pageout_work(struct address_space *mapping, struct page *page);
+
+/*
+ * Tailored for vmscan which may submit lots of pageout works. The page reclaim
+ * will try to slow down the pageout work submission rate when the queue size
+ * grows to LOTS_OF_WRITEBACK_WORKS. queue_pageout_work() will accordingly limit
+ * its search depth to (2 * LOTS_OF_WRITEBACK_WORKS).
+ *
+ * Note that the limited search and work pool is not a big problem: 1024 IOs
+ * under flight are typically more than enough to saturate the disk. And the
+ * overheads of searching in the work list didn't even show up in perf report.
+ */
+#define WB_WORK_MEMPOOL_SIZE		1024
+#define LOTS_OF_WRITEBACK_WORKS		(WB_WORK_MEMPOOL_SIZE / 8)
 
 /* writeback.h requires fs.h; it, too, is not included from here. */
 static inline void wait_on_inode(struct inode *inode)
--- linux.orig/fs/super.c	2012-02-28 19:07:06.101064465 +0800
+++ linux/fs/super.c	2012-02-28 19:07:07.277064493 +0800
@@ -389,6 +389,7 @@ void generic_shutdown_super(struct super
 
 		fsnotify_unmount_inodes(&sb->s_inodes);
 
+		bdi_remove_writeback_works(sb->s_bdi, sb);
 		evict_inodes(sb);
 
 		if (sop->put_super)
--- linux.orig/include/linux/backing-dev.h	2012-02-28 19:07:06.081064464 +0800
+++ linux/include/linux/backing-dev.h	2012-02-28 19:07:07.281064493 +0800
@@ -126,6 +126,8 @@ int bdi_has_dirty_io(struct backing_dev_
 void bdi_arm_supers_timer(void);
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
 void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
+void bdi_remove_writeback_works(struct backing_dev_info *bdi,
+				struct super_block *sb);
 
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
--- linux.orig/mm/vmscan.c	2012-02-28 19:07:06.065064464 +0800
+++ linux/mm/vmscan.c	2012-02-28 20:26:15.559731455 +0800
@@ -874,12 +874,22 @@ static unsigned long shrink_page_list(st
 			nr_dirty++;
 
 			/*
-			 * Only kswapd can writeback filesystem pages to
-			 * avoid risk of stack overflow but do not writeback
-			 * unless under significant pressure.
+			 * Pages may be dirtied anywhere inside the LRU. This
+			 * ensures they undergo a full period of LRU iteration
+			 * before considering pageout. The intention is to
+			 * delay writeout to the flusher thread, unless when
+			 * run into a long segment of dirty pages.
+			 */
+			if (references == PAGEREF_RECLAIM_CLEAN &&
+			    priority == DEF_PRIORITY)
+				goto keep_locked;
+
+			/*
+			 * Try relaying the pageout I/O to the flusher threads
+			 * for better I/O efficiency and avoid stack overflow.
 			 */
-			if (page_is_file_cache(page) &&
-					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
+			if (page_is_file_cache(page) && mapping &&
+			    queue_pageout_work(mapping, page) >= 0) {
 				/*
 				 * Immediately reclaim when written back.
 				 * Similar in principal to deactivate_page()
@@ -892,8 +902,13 @@ static unsigned long shrink_page_list(st
 				goto keep_locked;
 			}
 
-			if (references == PAGEREF_RECLAIM_CLEAN)
+			/*
+			 * Only kswapd can writeback filesystem pages to
+			 * avoid risk of stack overflow.
+			 */
+			if (page_is_file_cache(page) && !current_is_kswapd())
 				goto keep_locked;
+
 			if (!may_enter_fs)
 				goto keep_locked;
 			if (!sc->may_writepage)
@@ -2373,17 +2388,8 @@ static unsigned long do_try_to_free_page
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
 			goto out;
 
-		/*
-		 * Try to write back as many pages as we just scanned.  This
-		 * tends to cause slow streaming writers to write data to the
-		 * disk smoothly, at the dirtying rate, which is nice.   But
-		 * that's undesirable in laptop mode, where we *want* lumpy
-		 * writeout.  So in laptop mode, write out the whole world.
-		 */
 		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
 		if (total_scanned > writeback_threshold) {
-			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned,
-						WB_REASON_TRY_TO_FREE_PAGES);
 			sc->may_writepage = 1;
 		}
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 6/9] vmscan: dirty reclaim throttling
  2012-02-28 14:00 ` Fengguang Wu
@ 2012-02-28 14:00   ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-28 14:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Fengguang Wu, Linux Memory Management List, LKML

[-- Attachment #1: vmscan-pgreclaim-throttle.patch --]
[-- Type: text/plain, Size: 23474 bytes --]

1) OOM avoidance and scan rate control

Typically we do LRU scan w/o rate control and quickly get enough clean
pages for the LRU lists not full of dirty pages.

Or we can still get a number of freshly cleaned pages (moved to LRU tail
by end_page_writeback()) when the queued pageout I/O is completed within
tens of milli-seconds.

However if the LRU list is small and full of dirty pages, it can be quickly
fully scanned and go OOM before the flusher manages to clean enough pages.
Generally this does not happen for global reclaim which does dirty throttling
but happens easily with memcg LRUs.

A simple yet reliable scheme is employed to avoid OOM and keep scan rate
in sync with the I/O rate:

	if (encountered PG_reclaim pages)
		do some throttle wait

PG_reclaim plays the key role. When a dirty page is encountered, we
queue pageout writeback work for it, set PG_reclaim and put it back to
the LRU head. So if PG_reclaim pages are encountered again, it means
they have not yet been cleaned by the flusher after a full scan of the
inactive list. It indicates we are scanning faster than I/O and shall
take a snap.

The runtime behavior on a fully dirtied small LRU list would be:
It will start with a quick scan of the list, queuing all pages for I/O.
Then the scan will be slowed down by the PG_reclaim pages *adaptively*
to match the I/O bandwidth.

2) selective dirty reclaim throttling for interactive performance

For desktops, it's not just the USB writer, but also unrelated processes
that are allocating memory at the same time the writing happens. What we
want to avoid is a situation where something like firefox or evolution
or even gnome-terminal is performing a small read and gets either

a) started for IO bandwidth and stalls (not the focus here obviously)
b) enter page reclaim, finds PG_reclaim pages from the USB write and stalls

It's (b) we need to watch out for. So we try to only throttle the write
tasks by means of

- distinguish dirtier tasks and unrelated clean tasks by testing
   - whether __GFP_WRITE is set
   - whether current->nr_dirtied changed recently

- put dirtier tasks to wait at lower dirty fill levels (~50%) and
  clean tasks at much higher threshold (80%).

- slightly decrease wait threshold on decreased scan priority
  (which indicates long run of hard-to-reclaim pages)

3) test case

Run 2 dd tasks in a 100MB memcg (a very handy test case from Greg Thelen):

	mkdir /cgroup/x
	echo 100M > /cgroup/x/memory.limit_in_bytes
	echo $$ > /cgroup/x/tasks

	for i in `seq 2`
	do
		dd if=/dev/zero of=/fs/f$i bs=1k count=1M &
	done

Before patch, the dd tasks are quickly OOM killed.
After patch, they run well with reasonably good performance and overheads:

1073741824 bytes (1.1 GB) copied, 22.2196 s, 48.3 MB/s
1073741824 bytes (1.1 GB) copied, 22.4675 s, 47.8 MB/s

iostat -kx 1

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    0.00  178.00     0.00 89568.00  1006.38    74.35  417.71   4.80  85.40
sda               0.00     2.00    0.00  191.00     0.00 94428.00   988.77    53.34  219.03   4.34  82.90
sda               0.00    20.00    0.00  196.00     0.00 97712.00   997.06    71.11  337.45   4.77  93.50
sda               0.00     5.00    0.00  175.00     0.00 84648.00   967.41    54.03  316.44   5.06  88.60
sda               0.00     0.00    0.00  186.00     0.00 92432.00   993.89    56.22  267.54   5.38 100.00
sda               0.00     1.00    0.00  183.00     0.00 90156.00   985.31    37.99  325.55   4.33  79.20
sda               0.00     0.00    0.00  175.00     0.00 88692.00  1013.62    48.70  218.43   4.69  82.10
sda               0.00     0.00    0.00  196.00     0.00 97528.00   995.18    43.38  236.87   5.10 100.00
sda               0.00     0.00    0.00  179.00     0.00 88648.00   990.48    45.83  285.43   5.59 100.00
sda               0.00     0.00    0.00  178.00     0.00 88500.00   994.38    28.28  158.89   4.99  88.80
sda               0.00     0.00    0.00  194.00     0.00 95852.00   988.16    32.58  167.39   5.15 100.00
sda               0.00     2.00    0.00  215.00     0.00 105996.00   986.01    41.72  201.43   4.65 100.00
sda               0.00     4.00    0.00  173.00     0.00 84332.00   974.94    50.48  260.23   5.76  99.60
sda               0.00     0.00    0.00  182.00     0.00 90312.00   992.44    36.83  212.07   5.49 100.00
sda               0.00     8.00    0.00  195.00     0.00 95940.50   984.01    50.18  221.06   5.13 100.00
sda               0.00     1.00    0.00  220.00     0.00 108852.00   989.56    40.99  202.68   4.55 100.00
sda               0.00     2.00    0.00  161.00     0.00 80384.00   998.56    37.19  268.49   6.21 100.00
sda               0.00     4.00    0.00  182.00     0.00 90830.00   998.13    50.58  239.77   5.49 100.00
sda               0.00     0.00    0.00  197.00     0.00 94877.00   963.22    36.68  196.79   5.08 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.25    0.00   15.08   33.92    0.00   50.75
           0.25    0.00   14.54   35.09    0.00   50.13
           0.50    0.00   13.57   32.41    0.00   53.52
           0.50    0.00   11.28   36.84    0.00   51.38
           0.50    0.00   15.75   32.00    0.00   51.75
           0.50    0.00   10.50   34.00    0.00   55.00
           0.50    0.00   17.63   27.46    0.00   54.41
           0.50    0.00   15.08   30.90    0.00   53.52
           0.50    0.00   11.28   32.83    0.00   55.39
           0.75    0.00   16.79   26.82    0.00   55.64
           0.50    0.00   16.08   29.15    0.00   54.27
           0.50    0.00   13.50   30.50    0.00   55.50
           0.50    0.00   14.32   35.18    0.00   50.00
           0.50    0.00   12.06   33.92    0.00   53.52
           0.50    0.00   17.29   30.58    0.00   51.63
           0.50    0.00   15.08   29.65    0.00   54.77
           0.50    0.00   12.53   29.32    0.00   57.64
           0.50    0.00   15.29   31.83    0.00   52.38

The global dd numbers for comparison:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    0.00  189.00     0.00 95752.00  1013.25   143.09  684.48   5.29 100.00
sda               0.00     0.00    0.00  208.00     0.00 105480.00  1014.23   143.06  733.29   4.81 100.00
sda               0.00     0.00    0.00  161.00     0.00 81924.00  1017.69   141.71  757.79   6.21 100.00
sda               0.00     0.00    0.00  217.00     0.00 109580.00  1009.95   143.09  749.55   4.61 100.10
sda               0.00     0.00    0.00  187.00     0.00 94728.00  1013.13   144.31  773.67   5.35 100.00
sda               0.00     0.00    0.00  189.00     0.00 95752.00  1013.25   144.14  742.00   5.29 100.00
sda               0.00     0.00    0.00  177.00     0.00 90032.00  1017.31   143.32  656.59   5.65 100.00
sda               0.00     0.00    0.00  215.00     0.00 108640.00  1010.60   142.90  817.54   4.65 100.00
sda               0.00     2.00    0.00  166.00     0.00 83858.00  1010.34   143.64  808.61   6.02 100.00
sda               0.00     0.00    0.00  186.00     0.00 92813.00   997.99   141.18  736.95   5.38 100.00
sda               0.00     0.00    0.00  206.00     0.00 104456.00  1014.14   146.27  729.33   4.85 100.00
sda               0.00     0.00    0.00  213.00     0.00 107024.00  1004.92   143.25  705.70   4.69 100.00
sda               0.00     0.00    0.00  188.00     0.00 95748.00  1018.60   141.82  764.78   5.32 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.51    0.00   11.22   52.30    0.00   35.97
           0.25    0.00   10.15   52.54    0.00   37.06
           0.25    0.00    5.01   56.64    0.00   38.10
           0.51    0.00   15.15   43.94    0.00   40.40
           0.25    0.00   12.12   48.23    0.00   39.39
           0.51    0.00   11.20   53.94    0.00   34.35
           0.26    0.00    9.72   51.41    0.00   38.62
           0.76    0.00    9.62   50.63    0.00   38.99
           0.51    0.00   10.46   53.32    0.00   35.71
           0.51    0.00    9.41   51.91    0.00   38.17
           0.25    0.00   10.69   49.62    0.00   39.44
           0.51    0.00   12.21   52.67    0.00   34.61
           0.51    0.00   11.45   53.18    0.00   34.86

XXX: commit NFS unstable pages via write_inode(). Well it's currently
not possible to specify range of pages to commit.

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 include/linux/mmzone.h        |    1 
 include/linux/sched.h         |    1 
 include/linux/writeback.h     |    2 
 include/trace/events/vmscan.h |   68 ++++++++++
 mm/internal.h                 |    2 
 mm/page-writeback.c           |    2 
 mm/page_alloc.c               |    1 
 mm/swap.c                     |    4 
 mm/vmscan.c                   |  211 ++++++++++++++++++++++++++++++--
 9 files changed, 278 insertions(+), 14 deletions(-)

--- linux.orig/include/linux/writeback.h	2012-02-28 20:50:01.855765353 +0800
+++ linux/include/linux/writeback.h	2012-02-28 20:50:06.411765461 +0800
@@ -136,6 +136,8 @@ static inline void laptop_sync_completio
 #endif
 void throttle_vm_writeout(gfp_t gfp_mask);
 bool zone_dirty_ok(struct zone *zone);
+unsigned long zone_dirtyable_memory(struct zone *zone);
+unsigned long global_dirtyable_memory(void);
 
 extern unsigned long global_dirty_limit;
 
--- linux.orig/mm/page-writeback.c	2012-02-28 20:50:01.799765351 +0800
+++ linux/mm/page-writeback.c	2012-02-28 20:50:06.411765461 +0800
@@ -263,7 +263,7 @@ void global_dirty_limits(unsigned long *
  * Returns the zone's number of pages potentially available for dirty
  * page cache.  This is the base value for the per-zone dirty limits.
  */
-static unsigned long zone_dirtyable_memory(struct zone *zone)
+unsigned long zone_dirtyable_memory(struct zone *zone)
 {
 	/*
 	 * The effective global number of dirtyable pages may exclude
--- linux.orig/mm/swap.c	2012-02-28 20:50:01.791765351 +0800
+++ linux/mm/swap.c	2012-02-28 20:50:06.411765461 +0800
@@ -270,8 +270,10 @@ void rotate_reclaimable_page(struct page
 		page_cache_get(page);
 		local_irq_save(flags);
 		pvec = &__get_cpu_var(lru_rotate_pvecs);
-		if (!pagevec_add(pvec, page))
+		if (!pagevec_add(pvec, page)) {
 			pagevec_move_tail(pvec);
+			reclaim_rotated(page);
+		}
 		local_irq_restore(flags);
 	}
 }
--- linux.orig/mm/vmscan.c	2012-02-28 20:50:01.815765352 +0800
+++ linux/mm/vmscan.c	2012-02-28 20:52:56.227769496 +0800
@@ -50,9 +50,6 @@
 
 #include "internal.h"
 
-#define CREATE_TRACE_POINTS
-#include <trace/events/vmscan.h>
-
 /*
  * reclaim_mode determines how the inactive list is shrunk
  * RECLAIM_MODE_SINGLE: Reclaim only order-0 pages
@@ -120,6 +117,40 @@ struct mem_cgroup_zone {
 	struct zone *zone;
 };
 
+/*
+ * page reclaim dirty throttle thresholds:
+ *
+ *              20%           40%    50%    60%           80%
+ * |------+------+------+------+------+------+------+------+------+------|
+ * 0             ^BALANCED     ^THROTTLE_WRITE	           ^THROTTLE_ALL
+ *
+ * The LRU dirty pages should normally be under the 20% global balance ratio.
+ * When exceeding 40%, the allocations for writes will be throttled; in case
+ * that failed to keep the dirty pages under control, we'll have to throttle
+ * all tasks when above 80%, to avoid spinning the CPU.
+ *
+ * We start throttling KSWAPD before ALL, hoping to trigger more direct reclaims
+ * to throttle the write tasks and keep dirty pages from growing up.
+ */
+enum reclaim_dirty_level {
+	DIRTY_LEVEL_BALANCED			= 2,
+	DIRTY_LEVEL_THROTTLE_WRITE		= 4,
+	DIRTY_LEVEL_THROTTLE_RECENT_WRITE	= 5,
+	DIRTY_LEVEL_THROTTLE_KSWAPD		= 6,
+	DIRTY_LEVEL_THROTTLE_ALL		= 8,
+	DIRTY_LEVEL_MAX				= 10
+};
+enum reclaim_throttle_type {
+	RTT_WRITE,
+	RTT_RECENT_WRITE,
+	RTT_KSWAPD,
+	RTT_CLEAN,
+	RTT_MAX
+};
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/vmscan.h>
+
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
 
 #ifdef ARCH_HAS_PREFETCH
@@ -767,7 +798,8 @@ static unsigned long shrink_page_list(st
 				      struct scan_control *sc,
 				      int priority,
 				      unsigned long *ret_nr_dirty,
-				      unsigned long *ret_nr_writeback)
+				      unsigned long *ret_nr_writeback,
+				      unsigned long *ret_nr_pgreclaim)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
@@ -776,6 +808,7 @@ static unsigned long shrink_page_list(st
 	unsigned long nr_congested = 0;
 	unsigned long nr_reclaimed = 0;
 	unsigned long nr_writeback = 0;
+	unsigned long nr_pgreclaim = 0;
 
 	cond_resched();
 
@@ -814,6 +847,14 @@ static unsigned long shrink_page_list(st
 		if (PageWriteback(page)) {
 			nr_writeback++;
 			/*
+			 * The pageout works do write around which may put
+			 * close-to-LRU-tail pages to writeback a bit earlier.
+			 */
+			if (PageReclaim(page))
+				nr_pgreclaim++;
+			else
+				SetPageReclaim(page);
+			/*
 			 * Synchronous reclaim cannot queue pages for
 			 * writeback due to the possibility of stack overflow
 			 * but if it encounters a page under writeback, wait
@@ -885,21 +926,45 @@ static unsigned long shrink_page_list(st
 				goto keep_locked;
 
 			/*
+			 * The PG_reclaim page was put to I/O and moved to LRU
+			 * head sometime ago. If hit it again at LRU tail, we
+			 * may be scanning faster than the flusher can writeout
+			 * dirty pages. So suggest some reclaim_wait throttling
+			 * to match the I/O rate.
+			 */
+			if (page_is_file_cache(page) && PageReclaim(page)) {
+				nr_pgreclaim++;
+				goto keep_locked;
+			}
+
+			/*
 			 * Try relaying the pageout I/O to the flusher threads
 			 * for better I/O efficiency and avoid stack overflow.
 			 */
-			if (page_is_file_cache(page) && mapping &&
-			    queue_pageout_work(mapping, page) >= 0) {
+			if (page_is_file_cache(page) && mapping) {
+				int res = queue_pageout_work(mapping, page);
+
+				/*
+				 * It's not really PG_reclaim, but here we need
+				 * to trigger reclaim_wait to avoid overrunning
+				 * I/O when there are too many works queued, or
+				 * cannot queue new work at all.
+				 */
+				if (res < 0 || res > LOTS_OF_WRITEBACK_WORKS)
+					nr_pgreclaim++;
+
 				/*
 				 * Immediately reclaim when written back.
 				 * Similar in principal to deactivate_page()
 				 * except we already have the page isolated
 				 * and know it's dirty
 				 */
-				inc_zone_page_state(page, NR_VMSCAN_IMMEDIATE);
-				SetPageReclaim(page);
-
-				goto keep_locked;
+				if (res >= 0) {
+					inc_zone_page_state(page,
+							NR_VMSCAN_IMMEDIATE);
+					SetPageReclaim(page);
+					goto keep_locked;
+				}
 			}
 
 			/*
@@ -1043,6 +1108,7 @@ keep_lumpy:
 	count_vm_events(PGACTIVATE, pgactivate);
 	*ret_nr_dirty += nr_dirty;
 	*ret_nr_writeback += nr_writeback;
+	*ret_nr_pgreclaim += nr_pgreclaim;
 	return nr_reclaimed;
 }
 
@@ -1508,6 +1574,117 @@ static inline bool should_reclaim_stall(
 	return priority <= lumpy_stall_priority;
 }
 
+static int reclaim_dirty_level(unsigned long dirty,
+			       unsigned long total)
+{
+	unsigned long limit = total * global_dirty_limit /
+					global_dirtyable_memory();
+	if (dirty <= limit)
+		return 0;
+
+	dirty -= limit;
+	total -= limit;
+
+	return 8 * dirty / (total | 1) + DIRTY_LEVEL_BALANCED;
+}
+
+static bool should_throttle_dirty(struct mem_cgroup_zone *mz,
+				  struct scan_control *sc,
+				  int priority)
+{
+	unsigned long nr_dirty;
+	unsigned long nr_dirtyable;
+	int dirty_level = -1;
+	int level;
+	int type;
+	bool wait;
+
+	if (global_reclaim(sc)) {
+		struct zone *zone = mz->zone;
+		nr_dirty = zone_page_state(zone, NR_FILE_DIRTY) +
+				zone_page_state(zone, NR_UNSTABLE_NFS) +
+				zone_page_state(zone, NR_WRITEBACK);
+		nr_dirtyable = zone_dirtyable_memory(zone);
+	} else {
+		struct mem_cgroup *memcg = mz->mem_cgroup;
+		nr_dirty = mem_cgroup_dirty_pages(memcg);
+		nr_dirtyable = mem_cgroup_page_stat(memcg,
+						    MEMCG_NR_DIRTYABLE_PAGES);
+		trace_printk("memcg nr_dirtyable=%lu nr_dirty=%lu\n",
+			     nr_dirtyable, nr_dirty);
+	}
+
+	dirty_level = reclaim_dirty_level(nr_dirty, nr_dirtyable);
+	/*
+	 * Take a snap when encountered a long contiguous run of dirty pages.
+	 * When under global dirty limit, kswapd will only wait on priority==0,
+	 * and the clean tasks will never wait.
+	 */
+	level = dirty_level + (DEF_PRIORITY - priority) / 2;
+
+	if (current_is_kswapd()) {
+		type = RTT_KSWAPD;
+		wait = level >= DIRTY_LEVEL_THROTTLE_KSWAPD;
+		goto out;
+	}
+
+	if (sc->gfp_mask & __GFP_WRITE) {
+		type = RTT_WRITE;
+		wait = level >= DIRTY_LEVEL_THROTTLE_WRITE;
+		goto out;
+	}
+
+	if (current->nr_dirtied != current->nr_dirtied_snapshot) {
+		type = RTT_RECENT_WRITE;
+		wait = level >= DIRTY_LEVEL_THROTTLE_RECENT_WRITE;
+		current->nr_dirtied_snapshot = current->nr_dirtied;
+		goto out;
+	}
+
+	type = RTT_CLEAN;
+	wait = level >= DIRTY_LEVEL_THROTTLE_ALL;
+out:
+	if (wait) {
+		trace_mm_vmscan_should_throttle_dirty(type, priority,
+						      dirty_level, wait);
+	}
+	return wait;
+}
+
+
+/*
+ * reclaim_wait - wait for some pages being rotated to the LRU tail
+ * @zone: the zone under page reclaim
+ * @timeout: timeout in jiffies
+ *
+ * Wait until @timeout, or when some (typically PG_reclaim under writeback)
+ * pages rotated to the LRU so that page reclaim can make progress.
+ */
+static long reclaim_wait(struct mem_cgroup_zone *mz, long timeout)
+{
+	unsigned long start = jiffies;
+	wait_queue_head_t *wqh;
+	DEFINE_WAIT(wait);
+	long ret;
+
+	wqh = &mz->zone->zone_pgdat->reclaim_wait;
+	prepare_to_wait(wqh, &wait, TASK_KILLABLE);
+	ret = io_schedule_timeout(timeout);
+	finish_wait(wqh, &wait);
+
+	trace_mm_vmscan_reclaim_wait(timeout, jiffies - start, mz);
+
+	return ret;
+}
+
+void reclaim_rotated(struct page *page)
+{
+	wait_queue_head_t *wqh = &NODE_DATA(page_to_nid(page))->reclaim_wait;
+
+	if (waitqueue_active(wqh))
+		wake_up(wqh);
+}
+
 /*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
@@ -1524,6 +1701,7 @@ shrink_inactive_list(unsigned long nr_to
 	unsigned long nr_file;
 	unsigned long nr_dirty = 0;
 	unsigned long nr_writeback = 0;
+	unsigned long nr_pgreclaim = 0;
 	isolate_mode_t reclaim_mode = ISOLATE_INACTIVE;
 	struct zone *zone = mz->zone;
 
@@ -1574,13 +1752,13 @@ shrink_inactive_list(unsigned long nr_to
 	spin_unlock_irq(&zone->lru_lock);
 
 	nr_reclaimed = shrink_page_list(&page_list, mz, sc, priority,
-						&nr_dirty, &nr_writeback);
+				&nr_dirty, &nr_writeback, &nr_pgreclaim);
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
 		set_reclaim_mode(priority, sc, true);
 		nr_reclaimed += shrink_page_list(&page_list, mz, sc,
-					priority, &nr_dirty, &nr_writeback);
+			priority, &nr_dirty, &nr_writeback, &nr_pgreclaim);
 	}
 
 	spin_lock_irq(&zone->lru_lock);
@@ -1623,6 +1801,15 @@ shrink_inactive_list(unsigned long nr_to
 	 */
 	if (nr_writeback && nr_writeback >= (nr_taken >> (DEF_PRIORITY-priority)))
 		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
+	/*
+	 * If reclaimed any pages, it's safe from busy scanning. Otherwise when
+	 * encountered PG_reclaim pages or writeback work queue congested,
+	 * consider I/O throttling. Try to throttle only the dirtier tasks by
+	 * honouring higher throttle thresholds to kswapd and other clean tasks.
+	 */
+	if (!nr_reclaimed && nr_pgreclaim &&
+	    should_throttle_dirty(mz, sc, priority))
+		reclaim_wait(mz, HZ/10);
 
 	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
 		zone_idx(zone),
--- linux.orig/include/trace/events/vmscan.h	2012-02-28 20:50:01.831765352 +0800
+++ linux/include/trace/events/vmscan.h	2012-02-28 20:50:06.415765461 +0800
@@ -25,6 +25,12 @@
 		{RECLAIM_WB_ASYNC,	"RECLAIM_WB_ASYNC"}	\
 		) : "RECLAIM_WB_NONE"
 
+#define RECLAIM_THROTTLE_TYPE					\
+		{RTT_WRITE,		"write"},		\
+		{RTT_RECENT_WRITE,	"recent_write"},	\
+		{RTT_KSWAPD,		"kswapd"},		\
+		{RTT_CLEAN,		"clean"}		\
+
 #define trace_reclaim_flags(page, sync) ( \
 	(page_is_file_cache(page) ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \
 	(sync & RECLAIM_MODE_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC)   \
@@ -477,6 +483,68 @@ TRACE_EVENT_CONDITION(update_swap_token_
 		  __entry->swap_token_mm, __entry->swap_token_prio)
 );
 
+TRACE_EVENT(mm_vmscan_should_throttle_dirty,
+
+	TP_PROTO(int type, int priority, int dirty_level, bool wait),
+
+	TP_ARGS(type, priority, dirty_level, wait),
+
+	TP_STRUCT__entry(
+		__field(int, type)
+		__field(int, priority)
+		__field(int, dirty_level)
+		__field(bool, wait)
+	),
+
+	TP_fast_assign(
+		__entry->type = type;
+		__entry->priority = priority;
+		__entry->dirty_level = dirty_level;
+		__entry->wait = wait;
+	),
+
+	TP_printk("type=%s priority=%d dirty_level=%d wait=%d",
+		__print_symbolic(__entry->type, RECLAIM_THROTTLE_TYPE),
+		__entry->priority,
+		__entry->dirty_level,
+		__entry->wait)
+);
+
+struct mem_cgroup_zone;
+
+TRACE_EVENT(mm_vmscan_reclaim_wait,
+
+	TP_PROTO(unsigned long timeout,
+		 unsigned long delayed,
+		 struct mem_cgroup_zone *mz),
+
+	TP_ARGS(timeout, delayed, mz),
+
+	TP_STRUCT__entry(
+		__field(	unsigned int,	usec_timeout	)
+		__field(	unsigned int,	usec_delayed	)
+		__field(	unsigned int,	memcg		)
+		__field(	unsigned int,	node		)
+		__field(	unsigned int,	zone		)
+	),
+
+	TP_fast_assign(
+		__entry->usec_timeout	= jiffies_to_usecs(timeout);
+		__entry->usec_delayed	= jiffies_to_usecs(delayed);
+		__entry->memcg	= !mz->mem_cgroup ? 0 :
+					css_id(mem_cgroup_css(mz->mem_cgroup));
+		__entry->node	= zone_to_nid(mz->zone);
+		__entry->zone	= zone_idx(mz->zone);
+	),
+
+	TP_printk("usec_timeout=%u usec_delayed=%u memcg=%u node=%u zone=%u",
+			__entry->usec_timeout,
+			__entry->usec_delayed,
+			__entry->memcg,
+			__entry->node,
+			__entry->zone)
+);
+
 #endif /* _TRACE_VMSCAN_H */
 
 /* This part must be outside protection */
--- linux.orig/include/linux/sched.h	2012-02-28 20:50:01.839765352 +0800
+++ linux/include/linux/sched.h	2012-02-28 20:50:06.415765461 +0800
@@ -1544,6 +1544,7 @@ struct task_struct {
 	 */
 	int nr_dirtied;
 	int nr_dirtied_pause;
+	int nr_dirtied_snapshot; /* for detecting recent dirty activities */
 	unsigned long dirty_paused_when; /* start of a write-and-pause period */
 
 #ifdef CONFIG_LATENCYTOP
--- linux.orig/include/linux/mmzone.h	2012-02-28 20:50:01.843765352 +0800
+++ linux/include/linux/mmzone.h	2012-02-28 20:50:06.415765461 +0800
@@ -662,6 +662,7 @@ typedef struct pglist_data {
 					     range, including holes */
 	int node_id;
 	wait_queue_head_t kswapd_wait;
+	wait_queue_head_t reclaim_wait;
 	struct task_struct *kswapd;
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
--- linux.orig/mm/page_alloc.c	2012-02-28 20:50:01.823765351 +0800
+++ linux/mm/page_alloc.c	2012-02-28 20:50:06.419765461 +0800
@@ -4256,6 +4256,7 @@ static void __paginginit free_area_init_
 	pgdat_resize_init(pgdat);
 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
+	init_waitqueue_head(&pgdat->reclaim_wait);
 	pgdat->kswapd_max_order = 0;
 	pgdat_page_cgroup_init(pgdat);
 	
--- linux.orig/mm/internal.h	2012-02-28 20:50:01.807765351 +0800
+++ linux/mm/internal.h	2012-02-28 20:50:06.419765461 +0800
@@ -91,6 +91,8 @@ extern unsigned long highest_memmap_pfn;
 extern int isolate_lru_page(struct page *page);
 extern void putback_lru_page(struct page *page);
 
+void reclaim_rotated(struct page *page);
+
 /*
  * in mm/page_alloc.c
  */



^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 6/9] vmscan: dirty reclaim throttling
@ 2012-02-28 14:00   ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-28 14:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Fengguang Wu, Linux Memory Management List, LKML

[-- Attachment #1: vmscan-pgreclaim-throttle.patch --]
[-- Type: text/plain, Size: 23777 bytes --]

1) OOM avoidance and scan rate control

Typically we do LRU scan w/o rate control and quickly get enough clean
pages for the LRU lists not full of dirty pages.

Or we can still get a number of freshly cleaned pages (moved to LRU tail
by end_page_writeback()) when the queued pageout I/O is completed within
tens of milli-seconds.

However if the LRU list is small and full of dirty pages, it can be quickly
fully scanned and go OOM before the flusher manages to clean enough pages.
Generally this does not happen for global reclaim which does dirty throttling
but happens easily with memcg LRUs.

A simple yet reliable scheme is employed to avoid OOM and keep scan rate
in sync with the I/O rate:

	if (encountered PG_reclaim pages)
		do some throttle wait

PG_reclaim plays the key role. When a dirty page is encountered, we
queue pageout writeback work for it, set PG_reclaim and put it back to
the LRU head. So if PG_reclaim pages are encountered again, it means
they have not yet been cleaned by the flusher after a full scan of the
inactive list. It indicates we are scanning faster than I/O and shall
take a snap.

The runtime behavior on a fully dirtied small LRU list would be:
It will start with a quick scan of the list, queuing all pages for I/O.
Then the scan will be slowed down by the PG_reclaim pages *adaptively*
to match the I/O bandwidth.

2) selective dirty reclaim throttling for interactive performance

For desktops, it's not just the USB writer, but also unrelated processes
that are allocating memory at the same time the writing happens. What we
want to avoid is a situation where something like firefox or evolution
or even gnome-terminal is performing a small read and gets either

a) started for IO bandwidth and stalls (not the focus here obviously)
b) enter page reclaim, finds PG_reclaim pages from the USB write and stalls

It's (b) we need to watch out for. So we try to only throttle the write
tasks by means of

- distinguish dirtier tasks and unrelated clean tasks by testing
   - whether __GFP_WRITE is set
   - whether current->nr_dirtied changed recently

- put dirtier tasks to wait at lower dirty fill levels (~50%) and
  clean tasks at much higher threshold (80%).

- slightly decrease wait threshold on decreased scan priority
  (which indicates long run of hard-to-reclaim pages)

3) test case

Run 2 dd tasks in a 100MB memcg (a very handy test case from Greg Thelen):

	mkdir /cgroup/x
	echo 100M > /cgroup/x/memory.limit_in_bytes
	echo $$ > /cgroup/x/tasks

	for i in `seq 2`
	do
		dd if=/dev/zero of=/fs/f$i bs=1k count=1M &
	done

Before patch, the dd tasks are quickly OOM killed.
After patch, they run well with reasonably good performance and overheads:

1073741824 bytes (1.1 GB) copied, 22.2196 s, 48.3 MB/s
1073741824 bytes (1.1 GB) copied, 22.4675 s, 47.8 MB/s

iostat -kx 1

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    0.00  178.00     0.00 89568.00  1006.38    74.35  417.71   4.80  85.40
sda               0.00     2.00    0.00  191.00     0.00 94428.00   988.77    53.34  219.03   4.34  82.90
sda               0.00    20.00    0.00  196.00     0.00 97712.00   997.06    71.11  337.45   4.77  93.50
sda               0.00     5.00    0.00  175.00     0.00 84648.00   967.41    54.03  316.44   5.06  88.60
sda               0.00     0.00    0.00  186.00     0.00 92432.00   993.89    56.22  267.54   5.38 100.00
sda               0.00     1.00    0.00  183.00     0.00 90156.00   985.31    37.99  325.55   4.33  79.20
sda               0.00     0.00    0.00  175.00     0.00 88692.00  1013.62    48.70  218.43   4.69  82.10
sda               0.00     0.00    0.00  196.00     0.00 97528.00   995.18    43.38  236.87   5.10 100.00
sda               0.00     0.00    0.00  179.00     0.00 88648.00   990.48    45.83  285.43   5.59 100.00
sda               0.00     0.00    0.00  178.00     0.00 88500.00   994.38    28.28  158.89   4.99  88.80
sda               0.00     0.00    0.00  194.00     0.00 95852.00   988.16    32.58  167.39   5.15 100.00
sda               0.00     2.00    0.00  215.00     0.00 105996.00   986.01    41.72  201.43   4.65 100.00
sda               0.00     4.00    0.00  173.00     0.00 84332.00   974.94    50.48  260.23   5.76  99.60
sda               0.00     0.00    0.00  182.00     0.00 90312.00   992.44    36.83  212.07   5.49 100.00
sda               0.00     8.00    0.00  195.00     0.00 95940.50   984.01    50.18  221.06   5.13 100.00
sda               0.00     1.00    0.00  220.00     0.00 108852.00   989.56    40.99  202.68   4.55 100.00
sda               0.00     2.00    0.00  161.00     0.00 80384.00   998.56    37.19  268.49   6.21 100.00
sda               0.00     4.00    0.00  182.00     0.00 90830.00   998.13    50.58  239.77   5.49 100.00
sda               0.00     0.00    0.00  197.00     0.00 94877.00   963.22    36.68  196.79   5.08 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.25    0.00   15.08   33.92    0.00   50.75
           0.25    0.00   14.54   35.09    0.00   50.13
           0.50    0.00   13.57   32.41    0.00   53.52
           0.50    0.00   11.28   36.84    0.00   51.38
           0.50    0.00   15.75   32.00    0.00   51.75
           0.50    0.00   10.50   34.00    0.00   55.00
           0.50    0.00   17.63   27.46    0.00   54.41
           0.50    0.00   15.08   30.90    0.00   53.52
           0.50    0.00   11.28   32.83    0.00   55.39
           0.75    0.00   16.79   26.82    0.00   55.64
           0.50    0.00   16.08   29.15    0.00   54.27
           0.50    0.00   13.50   30.50    0.00   55.50
           0.50    0.00   14.32   35.18    0.00   50.00
           0.50    0.00   12.06   33.92    0.00   53.52
           0.50    0.00   17.29   30.58    0.00   51.63
           0.50    0.00   15.08   29.65    0.00   54.77
           0.50    0.00   12.53   29.32    0.00   57.64
           0.50    0.00   15.29   31.83    0.00   52.38

The global dd numbers for comparison:

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sda               0.00     0.00    0.00  189.00     0.00 95752.00  1013.25   143.09  684.48   5.29 100.00
sda               0.00     0.00    0.00  208.00     0.00 105480.00  1014.23   143.06  733.29   4.81 100.00
sda               0.00     0.00    0.00  161.00     0.00 81924.00  1017.69   141.71  757.79   6.21 100.00
sda               0.00     0.00    0.00  217.00     0.00 109580.00  1009.95   143.09  749.55   4.61 100.10
sda               0.00     0.00    0.00  187.00     0.00 94728.00  1013.13   144.31  773.67   5.35 100.00
sda               0.00     0.00    0.00  189.00     0.00 95752.00  1013.25   144.14  742.00   5.29 100.00
sda               0.00     0.00    0.00  177.00     0.00 90032.00  1017.31   143.32  656.59   5.65 100.00
sda               0.00     0.00    0.00  215.00     0.00 108640.00  1010.60   142.90  817.54   4.65 100.00
sda               0.00     2.00    0.00  166.00     0.00 83858.00  1010.34   143.64  808.61   6.02 100.00
sda               0.00     0.00    0.00  186.00     0.00 92813.00   997.99   141.18  736.95   5.38 100.00
sda               0.00     0.00    0.00  206.00     0.00 104456.00  1014.14   146.27  729.33   4.85 100.00
sda               0.00     0.00    0.00  213.00     0.00 107024.00  1004.92   143.25  705.70   4.69 100.00
sda               0.00     0.00    0.00  188.00     0.00 95748.00  1018.60   141.82  764.78   5.32 100.00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.51    0.00   11.22   52.30    0.00   35.97
           0.25    0.00   10.15   52.54    0.00   37.06
           0.25    0.00    5.01   56.64    0.00   38.10
           0.51    0.00   15.15   43.94    0.00   40.40
           0.25    0.00   12.12   48.23    0.00   39.39
           0.51    0.00   11.20   53.94    0.00   34.35
           0.26    0.00    9.72   51.41    0.00   38.62
           0.76    0.00    9.62   50.63    0.00   38.99
           0.51    0.00   10.46   53.32    0.00   35.71
           0.51    0.00    9.41   51.91    0.00   38.17
           0.25    0.00   10.69   49.62    0.00   39.44
           0.51    0.00   12.21   52.67    0.00   34.61
           0.51    0.00   11.45   53.18    0.00   34.86

XXX: commit NFS unstable pages via write_inode(). Well it's currently
not possible to specify range of pages to commit.

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 include/linux/mmzone.h        |    1 
 include/linux/sched.h         |    1 
 include/linux/writeback.h     |    2 
 include/trace/events/vmscan.h |   68 ++++++++++
 mm/internal.h                 |    2 
 mm/page-writeback.c           |    2 
 mm/page_alloc.c               |    1 
 mm/swap.c                     |    4 
 mm/vmscan.c                   |  211 ++++++++++++++++++++++++++++++--
 9 files changed, 278 insertions(+), 14 deletions(-)

--- linux.orig/include/linux/writeback.h	2012-02-28 20:50:01.855765353 +0800
+++ linux/include/linux/writeback.h	2012-02-28 20:50:06.411765461 +0800
@@ -136,6 +136,8 @@ static inline void laptop_sync_completio
 #endif
 void throttle_vm_writeout(gfp_t gfp_mask);
 bool zone_dirty_ok(struct zone *zone);
+unsigned long zone_dirtyable_memory(struct zone *zone);
+unsigned long global_dirtyable_memory(void);
 
 extern unsigned long global_dirty_limit;
 
--- linux.orig/mm/page-writeback.c	2012-02-28 20:50:01.799765351 +0800
+++ linux/mm/page-writeback.c	2012-02-28 20:50:06.411765461 +0800
@@ -263,7 +263,7 @@ void global_dirty_limits(unsigned long *
  * Returns the zone's number of pages potentially available for dirty
  * page cache.  This is the base value for the per-zone dirty limits.
  */
-static unsigned long zone_dirtyable_memory(struct zone *zone)
+unsigned long zone_dirtyable_memory(struct zone *zone)
 {
 	/*
 	 * The effective global number of dirtyable pages may exclude
--- linux.orig/mm/swap.c	2012-02-28 20:50:01.791765351 +0800
+++ linux/mm/swap.c	2012-02-28 20:50:06.411765461 +0800
@@ -270,8 +270,10 @@ void rotate_reclaimable_page(struct page
 		page_cache_get(page);
 		local_irq_save(flags);
 		pvec = &__get_cpu_var(lru_rotate_pvecs);
-		if (!pagevec_add(pvec, page))
+		if (!pagevec_add(pvec, page)) {
 			pagevec_move_tail(pvec);
+			reclaim_rotated(page);
+		}
 		local_irq_restore(flags);
 	}
 }
--- linux.orig/mm/vmscan.c	2012-02-28 20:50:01.815765352 +0800
+++ linux/mm/vmscan.c	2012-02-28 20:52:56.227769496 +0800
@@ -50,9 +50,6 @@
 
 #include "internal.h"
 
-#define CREATE_TRACE_POINTS
-#include <trace/events/vmscan.h>
-
 /*
  * reclaim_mode determines how the inactive list is shrunk
  * RECLAIM_MODE_SINGLE: Reclaim only order-0 pages
@@ -120,6 +117,40 @@ struct mem_cgroup_zone {
 	struct zone *zone;
 };
 
+/*
+ * page reclaim dirty throttle thresholds:
+ *
+ *              20%           40%    50%    60%           80%
+ * |------+------+------+------+------+------+------+------+------+------|
+ * 0             ^BALANCED     ^THROTTLE_WRITE	           ^THROTTLE_ALL
+ *
+ * The LRU dirty pages should normally be under the 20% global balance ratio.
+ * When exceeding 40%, the allocations for writes will be throttled; in case
+ * that failed to keep the dirty pages under control, we'll have to throttle
+ * all tasks when above 80%, to avoid spinning the CPU.
+ *
+ * We start throttling KSWAPD before ALL, hoping to trigger more direct reclaims
+ * to throttle the write tasks and keep dirty pages from growing up.
+ */
+enum reclaim_dirty_level {
+	DIRTY_LEVEL_BALANCED			= 2,
+	DIRTY_LEVEL_THROTTLE_WRITE		= 4,
+	DIRTY_LEVEL_THROTTLE_RECENT_WRITE	= 5,
+	DIRTY_LEVEL_THROTTLE_KSWAPD		= 6,
+	DIRTY_LEVEL_THROTTLE_ALL		= 8,
+	DIRTY_LEVEL_MAX				= 10
+};
+enum reclaim_throttle_type {
+	RTT_WRITE,
+	RTT_RECENT_WRITE,
+	RTT_KSWAPD,
+	RTT_CLEAN,
+	RTT_MAX
+};
+
+#define CREATE_TRACE_POINTS
+#include <trace/events/vmscan.h>
+
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
 
 #ifdef ARCH_HAS_PREFETCH
@@ -767,7 +798,8 @@ static unsigned long shrink_page_list(st
 				      struct scan_control *sc,
 				      int priority,
 				      unsigned long *ret_nr_dirty,
-				      unsigned long *ret_nr_writeback)
+				      unsigned long *ret_nr_writeback,
+				      unsigned long *ret_nr_pgreclaim)
 {
 	LIST_HEAD(ret_pages);
 	LIST_HEAD(free_pages);
@@ -776,6 +808,7 @@ static unsigned long shrink_page_list(st
 	unsigned long nr_congested = 0;
 	unsigned long nr_reclaimed = 0;
 	unsigned long nr_writeback = 0;
+	unsigned long nr_pgreclaim = 0;
 
 	cond_resched();
 
@@ -814,6 +847,14 @@ static unsigned long shrink_page_list(st
 		if (PageWriteback(page)) {
 			nr_writeback++;
 			/*
+			 * The pageout works do write around which may put
+			 * close-to-LRU-tail pages to writeback a bit earlier.
+			 */
+			if (PageReclaim(page))
+				nr_pgreclaim++;
+			else
+				SetPageReclaim(page);
+			/*
 			 * Synchronous reclaim cannot queue pages for
 			 * writeback due to the possibility of stack overflow
 			 * but if it encounters a page under writeback, wait
@@ -885,21 +926,45 @@ static unsigned long shrink_page_list(st
 				goto keep_locked;
 
 			/*
+			 * The PG_reclaim page was put to I/O and moved to LRU
+			 * head sometime ago. If hit it again at LRU tail, we
+			 * may be scanning faster than the flusher can writeout
+			 * dirty pages. So suggest some reclaim_wait throttling
+			 * to match the I/O rate.
+			 */
+			if (page_is_file_cache(page) && PageReclaim(page)) {
+				nr_pgreclaim++;
+				goto keep_locked;
+			}
+
+			/*
 			 * Try relaying the pageout I/O to the flusher threads
 			 * for better I/O efficiency and avoid stack overflow.
 			 */
-			if (page_is_file_cache(page) && mapping &&
-			    queue_pageout_work(mapping, page) >= 0) {
+			if (page_is_file_cache(page) && mapping) {
+				int res = queue_pageout_work(mapping, page);
+
+				/*
+				 * It's not really PG_reclaim, but here we need
+				 * to trigger reclaim_wait to avoid overrunning
+				 * I/O when there are too many works queued, or
+				 * cannot queue new work at all.
+				 */
+				if (res < 0 || res > LOTS_OF_WRITEBACK_WORKS)
+					nr_pgreclaim++;
+
 				/*
 				 * Immediately reclaim when written back.
 				 * Similar in principal to deactivate_page()
 				 * except we already have the page isolated
 				 * and know it's dirty
 				 */
-				inc_zone_page_state(page, NR_VMSCAN_IMMEDIATE);
-				SetPageReclaim(page);
-
-				goto keep_locked;
+				if (res >= 0) {
+					inc_zone_page_state(page,
+							NR_VMSCAN_IMMEDIATE);
+					SetPageReclaim(page);
+					goto keep_locked;
+				}
 			}
 
 			/*
@@ -1043,6 +1108,7 @@ keep_lumpy:
 	count_vm_events(PGACTIVATE, pgactivate);
 	*ret_nr_dirty += nr_dirty;
 	*ret_nr_writeback += nr_writeback;
+	*ret_nr_pgreclaim += nr_pgreclaim;
 	return nr_reclaimed;
 }
 
@@ -1508,6 +1574,117 @@ static inline bool should_reclaim_stall(
 	return priority <= lumpy_stall_priority;
 }
 
+static int reclaim_dirty_level(unsigned long dirty,
+			       unsigned long total)
+{
+	unsigned long limit = total * global_dirty_limit /
+					global_dirtyable_memory();
+	if (dirty <= limit)
+		return 0;
+
+	dirty -= limit;
+	total -= limit;
+
+	return 8 * dirty / (total | 1) + DIRTY_LEVEL_BALANCED;
+}
+
+static bool should_throttle_dirty(struct mem_cgroup_zone *mz,
+				  struct scan_control *sc,
+				  int priority)
+{
+	unsigned long nr_dirty;
+	unsigned long nr_dirtyable;
+	int dirty_level = -1;
+	int level;
+	int type;
+	bool wait;
+
+	if (global_reclaim(sc)) {
+		struct zone *zone = mz->zone;
+		nr_dirty = zone_page_state(zone, NR_FILE_DIRTY) +
+				zone_page_state(zone, NR_UNSTABLE_NFS) +
+				zone_page_state(zone, NR_WRITEBACK);
+		nr_dirtyable = zone_dirtyable_memory(zone);
+	} else {
+		struct mem_cgroup *memcg = mz->mem_cgroup;
+		nr_dirty = mem_cgroup_dirty_pages(memcg);
+		nr_dirtyable = mem_cgroup_page_stat(memcg,
+						    MEMCG_NR_DIRTYABLE_PAGES);
+		trace_printk("memcg nr_dirtyable=%lu nr_dirty=%lu\n",
+			     nr_dirtyable, nr_dirty);
+	}
+
+	dirty_level = reclaim_dirty_level(nr_dirty, nr_dirtyable);
+	/*
+	 * Take a snap when encountered a long contiguous run of dirty pages.
+	 * When under global dirty limit, kswapd will only wait on priority==0,
+	 * and the clean tasks will never wait.
+	 */
+	level = dirty_level + (DEF_PRIORITY - priority) / 2;
+
+	if (current_is_kswapd()) {
+		type = RTT_KSWAPD;
+		wait = level >= DIRTY_LEVEL_THROTTLE_KSWAPD;
+		goto out;
+	}
+
+	if (sc->gfp_mask & __GFP_WRITE) {
+		type = RTT_WRITE;
+		wait = level >= DIRTY_LEVEL_THROTTLE_WRITE;
+		goto out;
+	}
+
+	if (current->nr_dirtied != current->nr_dirtied_snapshot) {
+		type = RTT_RECENT_WRITE;
+		wait = level >= DIRTY_LEVEL_THROTTLE_RECENT_WRITE;
+		current->nr_dirtied_snapshot = current->nr_dirtied;
+		goto out;
+	}
+
+	type = RTT_CLEAN;
+	wait = level >= DIRTY_LEVEL_THROTTLE_ALL;
+out:
+	if (wait) {
+		trace_mm_vmscan_should_throttle_dirty(type, priority,
+						      dirty_level, wait);
+	}
+	return wait;
+}
+
+
+/*
+ * reclaim_wait - wait for some pages being rotated to the LRU tail
+ * @zone: the zone under page reclaim
+ * @timeout: timeout in jiffies
+ *
+ * Wait until @timeout, or when some (typically PG_reclaim under writeback)
+ * pages rotated to the LRU so that page reclaim can make progress.
+ */
+static long reclaim_wait(struct mem_cgroup_zone *mz, long timeout)
+{
+	unsigned long start = jiffies;
+	wait_queue_head_t *wqh;
+	DEFINE_WAIT(wait);
+	long ret;
+
+	wqh = &mz->zone->zone_pgdat->reclaim_wait;
+	prepare_to_wait(wqh, &wait, TASK_KILLABLE);
+	ret = io_schedule_timeout(timeout);
+	finish_wait(wqh, &wait);
+
+	trace_mm_vmscan_reclaim_wait(timeout, jiffies - start, mz);
+
+	return ret;
+}
+
+void reclaim_rotated(struct page *page)
+{
+	wait_queue_head_t *wqh = &NODE_DATA(page_to_nid(page))->reclaim_wait;
+
+	if (waitqueue_active(wqh))
+		wake_up(wqh);
+}
+
 /*
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
@@ -1524,6 +1701,7 @@ shrink_inactive_list(unsigned long nr_to
 	unsigned long nr_file;
 	unsigned long nr_dirty = 0;
 	unsigned long nr_writeback = 0;
+	unsigned long nr_pgreclaim = 0;
 	isolate_mode_t reclaim_mode = ISOLATE_INACTIVE;
 	struct zone *zone = mz->zone;
 
@@ -1574,13 +1752,13 @@ shrink_inactive_list(unsigned long nr_to
 	spin_unlock_irq(&zone->lru_lock);
 
 	nr_reclaimed = shrink_page_list(&page_list, mz, sc, priority,
-						&nr_dirty, &nr_writeback);
+				&nr_dirty, &nr_writeback, &nr_pgreclaim);
 
 	/* Check if we should syncronously wait for writeback */
 	if (should_reclaim_stall(nr_taken, nr_reclaimed, priority, sc)) {
 		set_reclaim_mode(priority, sc, true);
 		nr_reclaimed += shrink_page_list(&page_list, mz, sc,
-					priority, &nr_dirty, &nr_writeback);
+			priority, &nr_dirty, &nr_writeback, &nr_pgreclaim);
 	}
 
 	spin_lock_irq(&zone->lru_lock);
@@ -1623,6 +1801,15 @@ shrink_inactive_list(unsigned long nr_to
 	 */
 	if (nr_writeback && nr_writeback >= (nr_taken >> (DEF_PRIORITY-priority)))
 		wait_iff_congested(zone, BLK_RW_ASYNC, HZ/10);
+	/*
+	 * If reclaimed any pages, it's safe from busy scanning. Otherwise when
+	 * encountered PG_reclaim pages or writeback work queue congested,
+	 * consider I/O throttling. Try to throttle only the dirtier tasks by
+	 * honouring higher throttle thresholds to kswapd and other clean tasks.
+	 */
+	if (!nr_reclaimed && nr_pgreclaim &&
+	    should_throttle_dirty(mz, sc, priority))
+		reclaim_wait(mz, HZ/10);
 
 	trace_mm_vmscan_lru_shrink_inactive(zone->zone_pgdat->node_id,
 		zone_idx(zone),
--- linux.orig/include/trace/events/vmscan.h	2012-02-28 20:50:01.831765352 +0800
+++ linux/include/trace/events/vmscan.h	2012-02-28 20:50:06.415765461 +0800
@@ -25,6 +25,12 @@
 		{RECLAIM_WB_ASYNC,	"RECLAIM_WB_ASYNC"}	\
 		) : "RECLAIM_WB_NONE"
 
+#define RECLAIM_THROTTLE_TYPE					\
+		{RTT_WRITE,		"write"},		\
+		{RTT_RECENT_WRITE,	"recent_write"},	\
+		{RTT_KSWAPD,		"kswapd"},		\
+		{RTT_CLEAN,		"clean"}		\
+
 #define trace_reclaim_flags(page, sync) ( \
 	(page_is_file_cache(page) ? RECLAIM_WB_FILE : RECLAIM_WB_ANON) | \
 	(sync & RECLAIM_MODE_SYNC ? RECLAIM_WB_SYNC : RECLAIM_WB_ASYNC)   \
@@ -477,6 +483,68 @@ TRACE_EVENT_CONDITION(update_swap_token_
 		  __entry->swap_token_mm, __entry->swap_token_prio)
 );
 
+TRACE_EVENT(mm_vmscan_should_throttle_dirty,
+
+	TP_PROTO(int type, int priority, int dirty_level, bool wait),
+
+	TP_ARGS(type, priority, dirty_level, wait),
+
+	TP_STRUCT__entry(
+		__field(int, type)
+		__field(int, priority)
+		__field(int, dirty_level)
+		__field(bool, wait)
+	),
+
+	TP_fast_assign(
+		__entry->type = type;
+		__entry->priority = priority;
+		__entry->dirty_level = dirty_level;
+		__entry->wait = wait;
+	),
+
+	TP_printk("type=%s priority=%d dirty_level=%d wait=%d",
+		__print_symbolic(__entry->type, RECLAIM_THROTTLE_TYPE),
+		__entry->priority,
+		__entry->dirty_level,
+		__entry->wait)
+);
+
+struct mem_cgroup_zone;
+
+TRACE_EVENT(mm_vmscan_reclaim_wait,
+
+	TP_PROTO(unsigned long timeout,
+		 unsigned long delayed,
+		 struct mem_cgroup_zone *mz),
+
+	TP_ARGS(timeout, delayed, mz),
+
+	TP_STRUCT__entry(
+		__field(	unsigned int,	usec_timeout	)
+		__field(	unsigned int,	usec_delayed	)
+		__field(	unsigned int,	memcg		)
+		__field(	unsigned int,	node		)
+		__field(	unsigned int,	zone		)
+	),
+
+	TP_fast_assign(
+		__entry->usec_timeout	= jiffies_to_usecs(timeout);
+		__entry->usec_delayed	= jiffies_to_usecs(delayed);
+		__entry->memcg	= !mz->mem_cgroup ? 0 :
+					css_id(mem_cgroup_css(mz->mem_cgroup));
+		__entry->node	= zone_to_nid(mz->zone);
+		__entry->zone	= zone_idx(mz->zone);
+	),
+
+	TP_printk("usec_timeout=%u usec_delayed=%u memcg=%u node=%u zone=%u",
+			__entry->usec_timeout,
+			__entry->usec_delayed,
+			__entry->memcg,
+			__entry->node,
+			__entry->zone)
+);
+
 #endif /* _TRACE_VMSCAN_H */
 
 /* This part must be outside protection */
--- linux.orig/include/linux/sched.h	2012-02-28 20:50:01.839765352 +0800
+++ linux/include/linux/sched.h	2012-02-28 20:50:06.415765461 +0800
@@ -1544,6 +1544,7 @@ struct task_struct {
 	 */
 	int nr_dirtied;
 	int nr_dirtied_pause;
+	int nr_dirtied_snapshot; /* for detecting recent dirty activities */
 	unsigned long dirty_paused_when; /* start of a write-and-pause period */
 
 #ifdef CONFIG_LATENCYTOP
--- linux.orig/include/linux/mmzone.h	2012-02-28 20:50:01.843765352 +0800
+++ linux/include/linux/mmzone.h	2012-02-28 20:50:06.415765461 +0800
@@ -662,6 +662,7 @@ typedef struct pglist_data {
 					     range, including holes */
 	int node_id;
 	wait_queue_head_t kswapd_wait;
+	wait_queue_head_t reclaim_wait;
 	struct task_struct *kswapd;
 	int kswapd_max_order;
 	enum zone_type classzone_idx;
--- linux.orig/mm/page_alloc.c	2012-02-28 20:50:01.823765351 +0800
+++ linux/mm/page_alloc.c	2012-02-28 20:50:06.419765461 +0800
@@ -4256,6 +4256,7 @@ static void __paginginit free_area_init_
 	pgdat_resize_init(pgdat);
 	pgdat->nr_zones = 0;
 	init_waitqueue_head(&pgdat->kswapd_wait);
+	init_waitqueue_head(&pgdat->reclaim_wait);
 	pgdat->kswapd_max_order = 0;
 	pgdat_page_cgroup_init(pgdat);
 	
--- linux.orig/mm/internal.h	2012-02-28 20:50:01.807765351 +0800
+++ linux/mm/internal.h	2012-02-28 20:50:06.419765461 +0800
@@ -91,6 +91,8 @@ extern unsigned long highest_memmap_pfn;
 extern int isolate_lru_page(struct page *page);
 extern void putback_lru_page(struct page *page);
 
+void reclaim_rotated(struct page *page);
+
 /*
  * in mm/page_alloc.c
  */


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 7/9] mm: pass __GFP_WRITE to memcg charge and reclaim routines
  2012-02-28 14:00 ` Fengguang Wu
@ 2012-02-28 14:00   ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-28 14:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Fengguang Wu, Linux Memory Management List, LKML

[-- Attachment #1: memcg-pass-__GFP_WRITE-to-reclaim.patch --]
[-- Type: text/plain, Size: 1874 bytes --]

__GFP_WRITE will be tested in vmscan to find out the write tasks.

For good interactive performance, we try to focus dirty reclaim waits on
them and avoid blocking unrelated tasks.

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 include/linux/gfp.h |    2 +-
 mm/filemap.c        |   13 +++++++------
 2 files changed, 8 insertions(+), 7 deletions(-)

--- linux.orig/include/linux/gfp.h	2012-02-28 10:22:24.000000000 +0800
+++ linux/include/linux/gfp.h	2012-02-28 10:22:42.936316697 +0800
@@ -129,7 +129,7 @@ struct vm_area_struct;
 /* Control page allocator reclaim behavior */
 #define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\
 			__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
-			__GFP_NORETRY|__GFP_NOMEMALLOC)
+			__GFP_NORETRY|__GFP_NOMEMALLOC|__GFP_WRITE)
 
 /* Control slab gfp mask during early boot */
 #define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_WAIT|__GFP_IO|__GFP_FS))
--- linux.orig/mm/filemap.c	2012-02-28 10:22:25.000000000 +0800
+++ linux/mm/filemap.c	2012-02-28 10:24:12.320318821 +0800
@@ -2340,21 +2340,22 @@ struct page *grab_cache_page_write_begin
 	int status;
 	gfp_t gfp_mask;
 	struct page *page;
-	gfp_t gfp_notmask = 0;
+	gfp_t lru_gfp_mask = GFP_KERNEL | __GFP_WRITE;
 
 	gfp_mask = mapping_gfp_mask(mapping) | __GFP_WRITE;
-	if (flags & AOP_FLAG_NOFS)
-		gfp_notmask = __GFP_FS;
+	if (flags & AOP_FLAG_NOFS) {
+		gfp_mask &= ~__GFP_FS;
+		lru_gfp_mask &= ~__GFP_FS;
+	}
 repeat:
 	page = find_lock_page(mapping, index);
 	if (page)
 		goto found;
 
-	page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
+	page = __page_cache_alloc(gfp_mask);
 	if (!page)
 		return NULL;
-	status = add_to_page_cache_lru(page, mapping, index,
-						GFP_KERNEL & ~gfp_notmask);
+	status = add_to_page_cache_lru(page, mapping, index, lru_gfp_mask);
 	if (unlikely(status)) {
 		page_cache_release(page);
 		if (status == -EEXIST)



^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 7/9] mm: pass __GFP_WRITE to memcg charge and reclaim routines
@ 2012-02-28 14:00   ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-28 14:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Fengguang Wu, Linux Memory Management List, LKML

[-- Attachment #1: memcg-pass-__GFP_WRITE-to-reclaim.patch --]
[-- Type: text/plain, Size: 2177 bytes --]

__GFP_WRITE will be tested in vmscan to find out the write tasks.

For good interactive performance, we try to focus dirty reclaim waits on
them and avoid blocking unrelated tasks.

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 include/linux/gfp.h |    2 +-
 mm/filemap.c        |   13 +++++++------
 2 files changed, 8 insertions(+), 7 deletions(-)

--- linux.orig/include/linux/gfp.h	2012-02-28 10:22:24.000000000 +0800
+++ linux/include/linux/gfp.h	2012-02-28 10:22:42.936316697 +0800
@@ -129,7 +129,7 @@ struct vm_area_struct;
 /* Control page allocator reclaim behavior */
 #define GFP_RECLAIM_MASK (__GFP_WAIT|__GFP_HIGH|__GFP_IO|__GFP_FS|\
 			__GFP_NOWARN|__GFP_REPEAT|__GFP_NOFAIL|\
-			__GFP_NORETRY|__GFP_NOMEMALLOC)
+			__GFP_NORETRY|__GFP_NOMEMALLOC|__GFP_WRITE)
 
 /* Control slab gfp mask during early boot */
 #define GFP_BOOT_MASK (__GFP_BITS_MASK & ~(__GFP_WAIT|__GFP_IO|__GFP_FS))
--- linux.orig/mm/filemap.c	2012-02-28 10:22:25.000000000 +0800
+++ linux/mm/filemap.c	2012-02-28 10:24:12.320318821 +0800
@@ -2340,21 +2340,22 @@ struct page *grab_cache_page_write_begin
 	int status;
 	gfp_t gfp_mask;
 	struct page *page;
-	gfp_t gfp_notmask = 0;
+	gfp_t lru_gfp_mask = GFP_KERNEL | __GFP_WRITE;
 
 	gfp_mask = mapping_gfp_mask(mapping) | __GFP_WRITE;
-	if (flags & AOP_FLAG_NOFS)
-		gfp_notmask = __GFP_FS;
+	if (flags & AOP_FLAG_NOFS) {
+		gfp_mask &= ~__GFP_FS;
+		lru_gfp_mask &= ~__GFP_FS;
+	}
 repeat:
 	page = find_lock_page(mapping, index);
 	if (page)
 		goto found;
 
-	page = __page_cache_alloc(gfp_mask & ~gfp_notmask);
+	page = __page_cache_alloc(gfp_mask);
 	if (!page)
 		return NULL;
-	status = add_to_page_cache_lru(page, mapping, index,
-						GFP_KERNEL & ~gfp_notmask);
+	status = add_to_page_cache_lru(page, mapping, index, lru_gfp_mask);
 	if (unlikely(status)) {
 		page_cache_release(page);
 		if (status == -EEXIST)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 8/9] mm: dont set __GFP_WRITE on ramfs/sysfs writes
  2012-02-28 14:00 ` Fengguang Wu
@ 2012-02-28 14:00   ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-28 14:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Johannes Weiner, Fengguang Wu,
	Linux Memory Management List, LKML

[-- Attachment #1: mm-__GFP_WRITE-cap_account_dirty.patch --]
[-- Type: text/plain, Size: 886 bytes --]

Try to avoid page reclaim waits when writing to ramfs/sysfs etc.

Maybe not a big deal...

CC: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 mm/filemap.c |    8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

--- linux.orig/mm/filemap.c	2012-02-28 10:24:12.000000000 +0800
+++ linux/mm/filemap.c	2012-02-28 10:25:55.568321275 +0800
@@ -2340,9 +2340,13 @@ struct page *grab_cache_page_write_begin
 	int status;
 	gfp_t gfp_mask;
 	struct page *page;
-	gfp_t lru_gfp_mask = GFP_KERNEL | __GFP_WRITE;
+	gfp_t lru_gfp_mask = GFP_KERNEL;
 
-	gfp_mask = mapping_gfp_mask(mapping) | __GFP_WRITE;
+	gfp_mask = mapping_gfp_mask(mapping);
+	if (mapping_cap_account_dirty(mapping)) {
+		gfp_mask |= __GFP_WRITE;
+		lru_gfp_mask |= __GFP_WRITE;
+	}
 	if (flags & AOP_FLAG_NOFS) {
 		gfp_mask &= ~__GFP_FS;
 		lru_gfp_mask &= ~__GFP_FS;



^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 8/9] mm: dont set __GFP_WRITE on ramfs/sysfs writes
@ 2012-02-28 14:00   ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-28 14:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Johannes Weiner, Fengguang Wu,
	Linux Memory Management List, LKML

[-- Attachment #1: mm-__GFP_WRITE-cap_account_dirty.patch --]
[-- Type: text/plain, Size: 1189 bytes --]

Try to avoid page reclaim waits when writing to ramfs/sysfs etc.

Maybe not a big deal...

CC: Johannes Weiner <jweiner@redhat.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 mm/filemap.c |    8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

--- linux.orig/mm/filemap.c	2012-02-28 10:24:12.000000000 +0800
+++ linux/mm/filemap.c	2012-02-28 10:25:55.568321275 +0800
@@ -2340,9 +2340,13 @@ struct page *grab_cache_page_write_begin
 	int status;
 	gfp_t gfp_mask;
 	struct page *page;
-	gfp_t lru_gfp_mask = GFP_KERNEL | __GFP_WRITE;
+	gfp_t lru_gfp_mask = GFP_KERNEL;
 
-	gfp_mask = mapping_gfp_mask(mapping) | __GFP_WRITE;
+	gfp_mask = mapping_gfp_mask(mapping);
+	if (mapping_cap_account_dirty(mapping)) {
+		gfp_mask |= __GFP_WRITE;
+		lru_gfp_mask |= __GFP_WRITE;
+	}
 	if (flags & AOP_FLAG_NOFS) {
 		gfp_mask &= ~__GFP_FS;
 		lru_gfp_mask &= ~__GFP_FS;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 9/9] mm: debug vmscan waits
  2012-02-28 14:00 ` Fengguang Wu
@ 2012-02-28 14:00   ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-28 14:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Wu Fengguang, Linux Memory Management List, LKML

[-- Attachment #1: mm-debugfs-vmscan-stalls.patch --]
[-- Type: text/plain, Size: 8789 bytes --]

Create /debug/vm/ and export some page reclaim wait counters.

nr_migrate_wait_writeback	wait_on_page_writeback() on migration
nr_reclaim_wait_congested	wait_iff_congested() sleeps
nr_reclaim_wait_writeback	wait_on_page_writeback() on vmscan
nr_congestion_wait		congestion_wait() sleeps
nr_reclaim_throttle_*		reclaim_wait() sleeps

/debug/vm/ might be a convenient place for kernel hackers to play with VM
variables, however this whole stuff is mainly a convenient hack...

It shows that it's now pretty hard to trigger reclaim waits.

1) zero waits on

	truncate -s 100T /fs/100T    
	dd if=/fs/100T of=/dev/null bs=4k &
	dd if=/dev/zero of=/fs/zero bs=4k &

# grep -r . /debug/vm; grep '(nr_vmscan_write|allocstall)' /proc/vmstat
/debug/vm/nr_reclaim_throttle_clean:0
/debug/vm/nr_reclaim_throttle_kswapd:0
/debug/vm/nr_reclaim_throttle_recent_write:0
/debug/vm/nr_reclaim_throttle_write:0
/debug/vm/nr_congestion_wait:0
/debug/vm/nr_reclaim_wait_congested:0
/debug/vm/nr_reclaim_wait_writeback:0
/debug/vm/nr_migrate_wait_writeback:0
nr_vmscan_write 0
allocstall 0

2) some waits on (together with 1)

	usemem 5G --sleep 1000& # mem=8GB

/debug/vm/nr_reclaim_throttle_clean:0
/debug/vm/nr_reclaim_throttle_kswapd:39
/debug/vm/nr_reclaim_throttle_recent_write:1
/debug/vm/nr_reclaim_throttle_write:288
/debug/vm/nr_congestion_wait:13
/debug/vm/nr_reclaim_wait_congested:0
/debug/vm/nr_reclaim_wait_writeback:0
/debug/vm/nr_migrate_wait_writeback:0
nr_vmscan_write 690
allocstall 267675

echo 0 > /debug/vm/* # before doing 3) 

3) some waits on (together with 1,2)

	startx
	start lots of X app and switch among them in a loop

/debug/vm/nr_reclaim_throttle_clean:0
/debug/vm/nr_reclaim_throttle_kswapd:0
/debug/vm/nr_reclaim_throttle_recent_write:0
/debug/vm/nr_reclaim_throttle_write:0
/debug/vm/nr_congestion_wait:19
/debug/vm/nr_reclaim_wait_congested:0
/debug/vm/nr_reclaim_wait_writeback:0
/debug/vm/nr_migrate_wait_writeback:0
nr_vmscan_write 694
allocstall 270880

4) some waits on (together with 1,2,3)

	swapon -a

/debug/vm/nr_reclaim_throttle_clean:0
/debug/vm/nr_reclaim_throttle_kswapd:0
/debug/vm/nr_reclaim_throttle_recent_write:0
/debug/vm/nr_reclaim_throttle_write:2145
/debug/vm/nr_congestion_wait:47
/debug/vm/nr_reclaim_wait_congested:0
/debug/vm/nr_reclaim_wait_writeback:0
/debug/vm/nr_migrate_wait_writeback:0
nr_vmscan_write 42768
allocstall 416735

5) reset counters and stress it more.

	# usemem 1G --sleep 1000&
	# free
		     total       used       free     shared    buffers     cached
	Mem:          6801       6758         42          0          0        994
	-/+ buffers/cache:       5764       1036
	Swap:        51106        235      50870

It's now obviously slow, it now takes seconds or even 10+ seconds to switch to
the other windows:

  765.30    A System Monitor
  769.72    A Dictionary
  772.01    A Home
  790.79    A Desktop Help
  795.47    A *Unsaved Document 1 - gedit
  813.01    A ALC888.svg  (1/11)
  819.24    A Restore Session - Iceweasel
  827.23    A Klondike
  853.57    A urxvt
  862.49    A xeyes
  868.67    A Xpdf: /usr/share/doc/shared-mime-info/shared-mime-info-spec.pdf
  869.47    A snb:/home/wfg - ZSH

And it seems that the slowness is caused by huge number of pageout()s:

/debug/vm/nr_reclaim_throttle_clean:0
/debug/vm/nr_reclaim_throttle_kswapd:0
/debug/vm/nr_reclaim_throttle_recent_write:0
/debug/vm/nr_reclaim_throttle_write:307
/debug/vm/nr_congestion_wait:0
/debug/vm/nr_reclaim_wait_congested:0
/debug/vm/nr_reclaim_wait_writeback:0
/debug/vm/nr_migrate_wait_writeback:0
nr_vmscan_write 175085
allocstall 669671

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/backing-dev.c |   10 ++++++++
 mm/internal.h    |    5 ++++
 mm/migrate.c     |    3 ++
 mm/vmscan.c      |   55 +++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 71 insertions(+), 2 deletions(-)

--- linux.orig/mm/vmscan.c	2012-02-28 18:54:57.000000000 +0800
+++ linux/mm/vmscan.c	2012-02-28 18:55:15.657047580 +0800
@@ -790,6 +790,8 @@ static enum page_references page_check_r
 	return PAGEREF_RECLAIM;
 }
 
+u32 nr_reclaim_wait_writeback;
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -861,9 +863,10 @@ static unsigned long shrink_page_list(st
 			 * for the IO to complete.
 			 */
 			if ((sc->reclaim_mode & RECLAIM_MODE_SYNC) &&
-			    may_enter_fs)
+			    may_enter_fs) {
 				wait_on_page_writeback(page);
-			else {
+				nr_reclaim_wait_writeback++;
+			} else {
 				unlock_page(page);
 				goto keep_lumpy;
 			}
@@ -1573,6 +1576,7 @@ static inline bool should_reclaim_stall(
 
 	return priority <= lumpy_stall_priority;
 }
+u32 nr_reclaim_throttle[RTT_MAX];
 
 static int reclaim_dirty_level(unsigned long dirty,
 			       unsigned long total)
@@ -1643,6 +1647,7 @@ static bool should_throttle_dirty(struct
 	wait = level >= DIRTY_LEVEL_THROTTLE_ALL;
 out:
 	if (wait) {
+		nr_reclaim_throttle[type]++;
 		trace_mm_vmscan_should_throttle_dirty(type, priority,
 						      dirty_level, wait);
 	}
@@ -3811,3 +3816,49 @@ void scan_unevictable_unregister_node(st
 	device_remove_file(&node->dev, &dev_attr_scan_unevictable_pages);
 }
 #endif
+
+#if defined(CONFIG_DEBUG_FS)
+#include <linux/debugfs.h>
+
+static struct dentry *vm_debug_root;
+
+static int __init vm_debug_init(void)
+{
+	struct dentry *dentry;
+
+	vm_debug_root = debugfs_create_dir("vm", NULL);
+	if (!vm_debug_root)
+		goto fail;
+
+#ifdef CONFIG_MIGRATION
+	dentry = debugfs_create_u32("nr_migrate_wait_writeback", 0644,
+				    vm_debug_root, &nr_migrate_wait_writeback);
+#endif
+
+	dentry = debugfs_create_u32("nr_reclaim_wait_writeback", 0644,
+				    vm_debug_root, &nr_reclaim_wait_writeback);
+
+	dentry = debugfs_create_u32("nr_reclaim_wait_congested", 0644,
+				    vm_debug_root, &nr_reclaim_wait_congested);
+
+	dentry = debugfs_create_u32("nr_congestion_wait", 0644,
+				    vm_debug_root, &nr_congestion_wait);
+
+	dentry = debugfs_create_u32("nr_reclaim_throttle_write", 0644,
+			vm_debug_root, nr_reclaim_throttle + RTT_WRITE);
+	dentry = debugfs_create_u32("nr_reclaim_throttle_recent_write", 0644,
+			vm_debug_root, nr_reclaim_throttle + RTT_RECENT_WRITE);
+	dentry = debugfs_create_u32("nr_reclaim_throttle_kswapd", 0644,
+			vm_debug_root, nr_reclaim_throttle + RTT_KSWAPD);
+	dentry = debugfs_create_u32("nr_reclaim_throttle_clean", 0644,
+			vm_debug_root, nr_reclaim_throttle + RTT_CLEAN);
+	if (!dentry)
+		goto fail;
+
+	return 0;
+fail:
+	return -ENOMEM;
+}
+
+module_init(vm_debug_init);
+#endif /* CONFIG_DEBUG_FS */
--- linux.orig/mm/migrate.c	2012-02-28 18:54:46.000000000 +0800
+++ linux/mm/migrate.c	2012-02-28 18:55:03.085047281 +0800
@@ -674,6 +674,8 @@ static int move_to_new_page(struct page
 	return rc;
 }
 
+u32 nr_migrate_wait_writeback;
+
 static int __unmap_and_move(struct page *page, struct page *newpage,
 			int force, bool offlining, enum migrate_mode mode)
 {
@@ -742,6 +744,7 @@ static int __unmap_and_move(struct page
 		if (!force)
 			goto uncharge;
 		wait_on_page_writeback(page);
+		nr_migrate_wait_writeback++;
 	}
 	/*
 	 * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
--- linux.orig/mm/internal.h	2012-02-28 18:54:46.000000000 +0800
+++ linux/mm/internal.h	2012-02-28 18:55:03.085047281 +0800
@@ -311,3 +311,8 @@ extern u64 hwpoison_filter_flags_mask;
 extern u64 hwpoison_filter_flags_value;
 extern u64 hwpoison_filter_memcg;
 extern u32 hwpoison_filter_enable;
+
+extern u32 nr_migrate_wait_writeback;
+extern u32 nr_reclaim_wait_congested;
+extern u32 nr_congestion_wait;
+
--- linux.orig/mm/backing-dev.c	2012-02-28 18:54:46.000000000 +0800
+++ linux/mm/backing-dev.c	2012-02-28 18:55:03.085047281 +0800
@@ -12,6 +12,8 @@
 #include <linux/device.h>
 #include <trace/events/writeback.h>
 
+#include "internal.h"
+
 static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
 
 struct backing_dev_info default_backing_dev_info = {
@@ -805,6 +807,9 @@ void set_bdi_congested(struct backing_de
 }
 EXPORT_SYMBOL(set_bdi_congested);
 
+u32 nr_reclaim_wait_congested;
+u32 nr_congestion_wait;
+
 /**
  * congestion_wait - wait for a backing_dev to become uncongested
  * @sync: SYNC or ASYNC IO
@@ -825,6 +830,10 @@ long congestion_wait(int sync, long time
 	ret = io_schedule_timeout(timeout);
 	finish_wait(wqh, &wait);
 
+	nr_congestion_wait++;
+	trace_printk("%pS %pS\n",
+		     __builtin_return_address(0),
+		     __builtin_return_address(1));
 	trace_writeback_congestion_wait(jiffies_to_usecs(timeout),
 					jiffies_to_usecs(jiffies - start));
 
@@ -879,6 +888,7 @@ long wait_iff_congested(struct zone *zon
 	ret = io_schedule_timeout(timeout);
 	finish_wait(wqh, &wait);
 
+	nr_reclaim_wait_congested++;
 out:
 	trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
 					jiffies_to_usecs(jiffies - start));



^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH 9/9] mm: debug vmscan waits
@ 2012-02-28 14:00   ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-28 14:00 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Wu Fengguang, Linux Memory Management List, LKML

[-- Attachment #1: mm-debugfs-vmscan-stalls.patch --]
[-- Type: text/plain, Size: 9092 bytes --]

Create /debug/vm/ and export some page reclaim wait counters.

nr_migrate_wait_writeback	wait_on_page_writeback() on migration
nr_reclaim_wait_congested	wait_iff_congested() sleeps
nr_reclaim_wait_writeback	wait_on_page_writeback() on vmscan
nr_congestion_wait		congestion_wait() sleeps
nr_reclaim_throttle_*		reclaim_wait() sleeps

/debug/vm/ might be a convenient place for kernel hackers to play with VM
variables, however this whole stuff is mainly a convenient hack...

It shows that it's now pretty hard to trigger reclaim waits.

1) zero waits on

	truncate -s 100T /fs/100T    
	dd if=/fs/100T of=/dev/null bs=4k &
	dd if=/dev/zero of=/fs/zero bs=4k &

# grep -r . /debug/vm; grep '(nr_vmscan_write|allocstall)' /proc/vmstat
/debug/vm/nr_reclaim_throttle_clean:0
/debug/vm/nr_reclaim_throttle_kswapd:0
/debug/vm/nr_reclaim_throttle_recent_write:0
/debug/vm/nr_reclaim_throttle_write:0
/debug/vm/nr_congestion_wait:0
/debug/vm/nr_reclaim_wait_congested:0
/debug/vm/nr_reclaim_wait_writeback:0
/debug/vm/nr_migrate_wait_writeback:0
nr_vmscan_write 0
allocstall 0

2) some waits on (together with 1)

	usemem 5G --sleep 1000& # mem=8GB

/debug/vm/nr_reclaim_throttle_clean:0
/debug/vm/nr_reclaim_throttle_kswapd:39
/debug/vm/nr_reclaim_throttle_recent_write:1
/debug/vm/nr_reclaim_throttle_write:288
/debug/vm/nr_congestion_wait:13
/debug/vm/nr_reclaim_wait_congested:0
/debug/vm/nr_reclaim_wait_writeback:0
/debug/vm/nr_migrate_wait_writeback:0
nr_vmscan_write 690
allocstall 267675

echo 0 > /debug/vm/* # before doing 3) 

3) some waits on (together with 1,2)

	startx
	start lots of X app and switch among them in a loop

/debug/vm/nr_reclaim_throttle_clean:0
/debug/vm/nr_reclaim_throttle_kswapd:0
/debug/vm/nr_reclaim_throttle_recent_write:0
/debug/vm/nr_reclaim_throttle_write:0
/debug/vm/nr_congestion_wait:19
/debug/vm/nr_reclaim_wait_congested:0
/debug/vm/nr_reclaim_wait_writeback:0
/debug/vm/nr_migrate_wait_writeback:0
nr_vmscan_write 694
allocstall 270880

4) some waits on (together with 1,2,3)

	swapon -a

/debug/vm/nr_reclaim_throttle_clean:0
/debug/vm/nr_reclaim_throttle_kswapd:0
/debug/vm/nr_reclaim_throttle_recent_write:0
/debug/vm/nr_reclaim_throttle_write:2145
/debug/vm/nr_congestion_wait:47
/debug/vm/nr_reclaim_wait_congested:0
/debug/vm/nr_reclaim_wait_writeback:0
/debug/vm/nr_migrate_wait_writeback:0
nr_vmscan_write 42768
allocstall 416735

5) reset counters and stress it more.

	# usemem 1G --sleep 1000&
	# free
		     total       used       free     shared    buffers     cached
	Mem:          6801       6758         42          0          0        994
	-/+ buffers/cache:       5764       1036
	Swap:        51106        235      50870

It's now obviously slow, it now takes seconds or even 10+ seconds to switch to
the other windows:

  765.30    A System Monitor
  769.72    A Dictionary
  772.01    A Home
  790.79    A Desktop Help
  795.47    A *Unsaved Document 1 - gedit
  813.01    A ALC888.svg  (1/11)
  819.24    A Restore Session - Iceweasel
  827.23    A Klondike
  853.57    A urxvt
  862.49    A xeyes
  868.67    A Xpdf: /usr/share/doc/shared-mime-info/shared-mime-info-spec.pdf
  869.47    A snb:/home/wfg - ZSH

And it seems that the slowness is caused by huge number of pageout()s:

/debug/vm/nr_reclaim_throttle_clean:0
/debug/vm/nr_reclaim_throttle_kswapd:0
/debug/vm/nr_reclaim_throttle_recent_write:0
/debug/vm/nr_reclaim_throttle_write:307
/debug/vm/nr_congestion_wait:0
/debug/vm/nr_reclaim_wait_congested:0
/debug/vm/nr_reclaim_wait_writeback:0
/debug/vm/nr_migrate_wait_writeback:0
nr_vmscan_write 175085
allocstall 669671

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/backing-dev.c |   10 ++++++++
 mm/internal.h    |    5 ++++
 mm/migrate.c     |    3 ++
 mm/vmscan.c      |   55 +++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 71 insertions(+), 2 deletions(-)

--- linux.orig/mm/vmscan.c	2012-02-28 18:54:57.000000000 +0800
+++ linux/mm/vmscan.c	2012-02-28 18:55:15.657047580 +0800
@@ -790,6 +790,8 @@ static enum page_references page_check_r
 	return PAGEREF_RECLAIM;
 }
 
+u32 nr_reclaim_wait_writeback;
+
 /*
  * shrink_page_list() returns the number of reclaimed pages
  */
@@ -861,9 +863,10 @@ static unsigned long shrink_page_list(st
 			 * for the IO to complete.
 			 */
 			if ((sc->reclaim_mode & RECLAIM_MODE_SYNC) &&
-			    may_enter_fs)
+			    may_enter_fs) {
 				wait_on_page_writeback(page);
-			else {
+				nr_reclaim_wait_writeback++;
+			} else {
 				unlock_page(page);
 				goto keep_lumpy;
 			}
@@ -1573,6 +1576,7 @@ static inline bool should_reclaim_stall(
 
 	return priority <= lumpy_stall_priority;
 }
+u32 nr_reclaim_throttle[RTT_MAX];
 
 static int reclaim_dirty_level(unsigned long dirty,
 			       unsigned long total)
@@ -1643,6 +1647,7 @@ static bool should_throttle_dirty(struct
 	wait = level >= DIRTY_LEVEL_THROTTLE_ALL;
 out:
 	if (wait) {
+		nr_reclaim_throttle[type]++;
 		trace_mm_vmscan_should_throttle_dirty(type, priority,
 						      dirty_level, wait);
 	}
@@ -3811,3 +3816,49 @@ void scan_unevictable_unregister_node(st
 	device_remove_file(&node->dev, &dev_attr_scan_unevictable_pages);
 }
 #endif
+
+#if defined(CONFIG_DEBUG_FS)
+#include <linux/debugfs.h>
+
+static struct dentry *vm_debug_root;
+
+static int __init vm_debug_init(void)
+{
+	struct dentry *dentry;
+
+	vm_debug_root = debugfs_create_dir("vm", NULL);
+	if (!vm_debug_root)
+		goto fail;
+
+#ifdef CONFIG_MIGRATION
+	dentry = debugfs_create_u32("nr_migrate_wait_writeback", 0644,
+				    vm_debug_root, &nr_migrate_wait_writeback);
+#endif
+
+	dentry = debugfs_create_u32("nr_reclaim_wait_writeback", 0644,
+				    vm_debug_root, &nr_reclaim_wait_writeback);
+
+	dentry = debugfs_create_u32("nr_reclaim_wait_congested", 0644,
+				    vm_debug_root, &nr_reclaim_wait_congested);
+
+	dentry = debugfs_create_u32("nr_congestion_wait", 0644,
+				    vm_debug_root, &nr_congestion_wait);
+
+	dentry = debugfs_create_u32("nr_reclaim_throttle_write", 0644,
+			vm_debug_root, nr_reclaim_throttle + RTT_WRITE);
+	dentry = debugfs_create_u32("nr_reclaim_throttle_recent_write", 0644,
+			vm_debug_root, nr_reclaim_throttle + RTT_RECENT_WRITE);
+	dentry = debugfs_create_u32("nr_reclaim_throttle_kswapd", 0644,
+			vm_debug_root, nr_reclaim_throttle + RTT_KSWAPD);
+	dentry = debugfs_create_u32("nr_reclaim_throttle_clean", 0644,
+			vm_debug_root, nr_reclaim_throttle + RTT_CLEAN);
+	if (!dentry)
+		goto fail;
+
+	return 0;
+fail:
+	return -ENOMEM;
+}
+
+module_init(vm_debug_init);
+#endif /* CONFIG_DEBUG_FS */
--- linux.orig/mm/migrate.c	2012-02-28 18:54:46.000000000 +0800
+++ linux/mm/migrate.c	2012-02-28 18:55:03.085047281 +0800
@@ -674,6 +674,8 @@ static int move_to_new_page(struct page
 	return rc;
 }
 
+u32 nr_migrate_wait_writeback;
+
 static int __unmap_and_move(struct page *page, struct page *newpage,
 			int force, bool offlining, enum migrate_mode mode)
 {
@@ -742,6 +744,7 @@ static int __unmap_and_move(struct page
 		if (!force)
 			goto uncharge;
 		wait_on_page_writeback(page);
+		nr_migrate_wait_writeback++;
 	}
 	/*
 	 * By try_to_unmap(), page->mapcount goes down to 0 here. In this case,
--- linux.orig/mm/internal.h	2012-02-28 18:54:46.000000000 +0800
+++ linux/mm/internal.h	2012-02-28 18:55:03.085047281 +0800
@@ -311,3 +311,8 @@ extern u64 hwpoison_filter_flags_mask;
 extern u64 hwpoison_filter_flags_value;
 extern u64 hwpoison_filter_memcg;
 extern u32 hwpoison_filter_enable;
+
+extern u32 nr_migrate_wait_writeback;
+extern u32 nr_reclaim_wait_congested;
+extern u32 nr_congestion_wait;
+
--- linux.orig/mm/backing-dev.c	2012-02-28 18:54:46.000000000 +0800
+++ linux/mm/backing-dev.c	2012-02-28 18:55:03.085047281 +0800
@@ -12,6 +12,8 @@
 #include <linux/device.h>
 #include <trace/events/writeback.h>
 
+#include "internal.h"
+
 static atomic_long_t bdi_seq = ATOMIC_LONG_INIT(0);
 
 struct backing_dev_info default_backing_dev_info = {
@@ -805,6 +807,9 @@ void set_bdi_congested(struct backing_de
 }
 EXPORT_SYMBOL(set_bdi_congested);
 
+u32 nr_reclaim_wait_congested;
+u32 nr_congestion_wait;
+
 /**
  * congestion_wait - wait for a backing_dev to become uncongested
  * @sync: SYNC or ASYNC IO
@@ -825,6 +830,10 @@ long congestion_wait(int sync, long time
 	ret = io_schedule_timeout(timeout);
 	finish_wait(wqh, &wait);
 
+	nr_congestion_wait++;
+	trace_printk("%pS %pS\n",
+		     __builtin_return_address(0),
+		     __builtin_return_address(1));
 	trace_writeback_congestion_wait(jiffies_to_usecs(timeout),
 					jiffies_to_usecs(jiffies - start));
 
@@ -879,6 +888,7 @@ long wait_iff_congested(struct zone *zon
 	ret = io_schedule_timeout(timeout);
 	finish_wait(wqh, &wait);
 
+	nr_reclaim_wait_congested++;
 out:
 	trace_writeback_wait_iff_congested(jiffies_to_usecs(timeout),
 					jiffies_to_usecs(jiffies - start));


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 4/9] memcg: dirty page accounting support routines
  2012-02-28 14:00   ` Fengguang Wu
@ 2012-02-28 15:15     ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-28 15:15 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Linux Memory Management List, LKML

On Tue, Feb 28, 2012 at 10:00:26PM +0800, Fengguang Wu wrote:
> From: Greg Thelen <gthelen@google.com>
> 
> Added memcg dirty page accounting support routines.  These routines are
> used by later changes to provide memcg aware writeback and dirty page
> limiting.  A mem_cgroup_dirty_info() tracepoint is is also included to
> allow for easier understanding of memcg writeback operation.

Greg, sorry that the mem_cgroup_dirty_info() interfaces and
tracepoints are abridged since they are not used here. Obviously this
patch series is not enough to keep the number of dirty pages under
control. It only tries to improve page reclaim behavior given whatever
dirty number. We'll need further schemes to keep dirty pages under
sane levels, so that unrelated tasks do not suffer from reclaim waits
when there are heavy writers in the same memcg.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 4/9] memcg: dirty page accounting support routines
@ 2012-02-28 15:15     ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-28 15:15 UTC (permalink / raw)
  To: Greg Thelen
  Cc: Andrew Morton, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Linux Memory Management List, LKML

On Tue, Feb 28, 2012 at 10:00:26PM +0800, Fengguang Wu wrote:
> From: Greg Thelen <gthelen@google.com>
> 
> Added memcg dirty page accounting support routines.  These routines are
> used by later changes to provide memcg aware writeback and dirty page
> limiting.  A mem_cgroup_dirty_info() tracepoint is is also included to
> allow for easier understanding of memcg writeback operation.

Greg, sorry that the mem_cgroup_dirty_info() interfaces and
tracepoints are abridged since they are not used here. Obviously this
patch series is not enough to keep the number of dirty pages under
control. It only tries to improve page reclaim behavior given whatever
dirty number. We'll need further schemes to keep dirty pages under
sane levels, so that unrelated tasks do not suffer from reclaim waits
when there are heavy writers in the same memcg.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 2/9] memcg: add dirty page accounting infrastructure
  2012-02-28 14:00   ` Fengguang Wu
@ 2012-02-28 22:37     ` Andrew Morton
  -1 siblings, 0 replies; 116+ messages in thread
From: Andrew Morton @ 2012-02-28 22:37 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrea Righi, Linux Memory Management List, LKML

On Tue, 28 Feb 2012 22:00:24 +0800
Fengguang Wu <fengguang.wu@intel.com> wrote:

> From: Greg Thelen <gthelen@google.com>
> 
> Add memcg routines to count dirty, writeback, and unstable_NFS pages.
> These routines are not yet used by the kernel to count such pages.  A
> later change adds kernel calls to these new routines.
> 
> As inode pages are marked dirty, if the dirtied page's cgroup differs
> from the inode's cgroup, then mark the inode shared across several
> cgroup.
> 
> ...
>
> @@ -1885,6 +1888,44 @@ void mem_cgroup_update_page_stat(struct 
>  			ClearPageCgroupFileMapped(pc);
>  		idx = MEM_CGROUP_STAT_FILE_MAPPED;
>  		break;
> +
> +	case MEMCG_NR_FILE_DIRTY:
> +		/* Use Test{Set,Clear} to only un/charge the memcg once. */
> +		if (val > 0) {
> +			if (TestSetPageCgroupFileDirty(pc))
> +				val = 0;
> +		} else {
> +			if (!TestClearPageCgroupFileDirty(pc))
> +				val = 0;
> +		}

Made me scratch my head for a while, but I see now that the `val' arg
to (the undocumented) mem_cgroup_update_page_stat() can only ever have
the values 1 or -1.  I hope.

>
> ...
>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 2/9] memcg: add dirty page accounting infrastructure
@ 2012-02-28 22:37     ` Andrew Morton
  0 siblings, 0 replies; 116+ messages in thread
From: Andrew Morton @ 2012-02-28 22:37 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrea Righi, Linux Memory Management List, LKML

On Tue, 28 Feb 2012 22:00:24 +0800
Fengguang Wu <fengguang.wu@intel.com> wrote:

> From: Greg Thelen <gthelen@google.com>
> 
> Add memcg routines to count dirty, writeback, and unstable_NFS pages.
> These routines are not yet used by the kernel to count such pages.  A
> later change adds kernel calls to these new routines.
> 
> As inode pages are marked dirty, if the dirtied page's cgroup differs
> from the inode's cgroup, then mark the inode shared across several
> cgroup.
> 
> ...
>
> @@ -1885,6 +1888,44 @@ void mem_cgroup_update_page_stat(struct 
>  			ClearPageCgroupFileMapped(pc);
>  		idx = MEM_CGROUP_STAT_FILE_MAPPED;
>  		break;
> +
> +	case MEMCG_NR_FILE_DIRTY:
> +		/* Use Test{Set,Clear} to only un/charge the memcg once. */
> +		if (val > 0) {
> +			if (TestSetPageCgroupFileDirty(pc))
> +				val = 0;
> +		} else {
> +			if (!TestClearPageCgroupFileDirty(pc))
> +				val = 0;
> +		}

Made me scratch my head for a while, but I see now that the `val' arg
to (the undocumented) mem_cgroup_update_page_stat() can only ever have
the values 1 or -1.  I hope.

>
> ...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 4/9] memcg: dirty page accounting support routines
  2012-02-28 14:00   ` Fengguang Wu
@ 2012-02-28 22:45     ` Andrew Morton
  -1 siblings, 0 replies; 116+ messages in thread
From: Andrew Morton @ 2012-02-28 22:45 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Linux Memory Management List, LKML

On Tue, 28 Feb 2012 22:00:26 +0800
Fengguang Wu <fengguang.wu@intel.com> wrote:

> From: Greg Thelen <gthelen@google.com>
> 
> Added memcg dirty page accounting support routines.  These routines are
> used by later changes to provide memcg aware writeback and dirty page
> limiting.  A mem_cgroup_dirty_info() tracepoint is is also included to
> allow for easier understanding of memcg writeback operation.
> 
> ...
>
> +/*
> + * Return the number of additional pages that the @memcg cgroup could allocate.
> + * If use_hierarchy is set, then this involves checking parent mem cgroups to
> + * find the cgroup with the smallest free space.
> + */

Comment needs revisting - use_hierarchy does not exist.

> +static unsigned long
> +mem_cgroup_hierarchical_free_pages(struct mem_cgroup *memcg)
> +{
> +	u64 free;
> +	unsigned long min_free;
> +
> +	min_free = global_page_state(NR_FREE_PAGES);
> +
> +	while (memcg) {
> +		free = mem_cgroup_margin(memcg);
> +		min_free = min_t(u64, min_free, free);
> +		memcg = parent_mem_cgroup(memcg);
> +	}
> +
> +	return min_free;
> +}
> +
> +/*
> + * mem_cgroup_page_stat() - get memory cgroup file cache statistics
> + * @memcg:     memory cgroup to query
> + * @item:      memory statistic item exported to the kernel
> + *
> + * Return the accounted statistic value.
> + */
> +unsigned long mem_cgroup_page_stat(struct mem_cgroup *memcg,
> +				   enum mem_cgroup_page_stat_item item)
> +{
> +	struct mem_cgroup *iter;
> +	s64 value;
> +
> +	/*
> +	 * If we're looking for dirtyable pages we need to evaluate free pages
> +	 * depending on the limit and usage of the parents first of all.
> +	 */
> +	if (item == MEMCG_NR_DIRTYABLE_PAGES)
> +		value = mem_cgroup_hierarchical_free_pages(memcg);
> +	else
> +		value = 0;
> +
> +	/*
> +	 * Recursively evaluate page statistics against all cgroup under
> +	 * hierarchy tree
> +	 */
> +	for_each_mem_cgroup_tree(iter, memcg)
> +		value += mem_cgroup_local_page_stat(iter, item);

What's the locking rule for for_each_mem_cgroup_tree()?  It's unobvious
from the code and isn't documented?

> +	/*
> +	 * Summing of unlocked per-cpu counters is racy and may yield a slightly
> +	 * negative value.  Zero is the only sensible value in such cases.
> +	 */
> +	if (unlikely(value < 0))
> +		value = 0;
> +
> +	return value;
> +}
> +
>
> ...
>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 4/9] memcg: dirty page accounting support routines
@ 2012-02-28 22:45     ` Andrew Morton
  0 siblings, 0 replies; 116+ messages in thread
From: Andrew Morton @ 2012-02-28 22:45 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Linux Memory Management List, LKML

On Tue, 28 Feb 2012 22:00:26 +0800
Fengguang Wu <fengguang.wu@intel.com> wrote:

> From: Greg Thelen <gthelen@google.com>
> 
> Added memcg dirty page accounting support routines.  These routines are
> used by later changes to provide memcg aware writeback and dirty page
> limiting.  A mem_cgroup_dirty_info() tracepoint is is also included to
> allow for easier understanding of memcg writeback operation.
> 
> ...
>
> +/*
> + * Return the number of additional pages that the @memcg cgroup could allocate.
> + * If use_hierarchy is set, then this involves checking parent mem cgroups to
> + * find the cgroup with the smallest free space.
> + */

Comment needs revisting - use_hierarchy does not exist.

> +static unsigned long
> +mem_cgroup_hierarchical_free_pages(struct mem_cgroup *memcg)
> +{
> +	u64 free;
> +	unsigned long min_free;
> +
> +	min_free = global_page_state(NR_FREE_PAGES);
> +
> +	while (memcg) {
> +		free = mem_cgroup_margin(memcg);
> +		min_free = min_t(u64, min_free, free);
> +		memcg = parent_mem_cgroup(memcg);
> +	}
> +
> +	return min_free;
> +}
> +
> +/*
> + * mem_cgroup_page_stat() - get memory cgroup file cache statistics
> + * @memcg:     memory cgroup to query
> + * @item:      memory statistic item exported to the kernel
> + *
> + * Return the accounted statistic value.
> + */
> +unsigned long mem_cgroup_page_stat(struct mem_cgroup *memcg,
> +				   enum mem_cgroup_page_stat_item item)
> +{
> +	struct mem_cgroup *iter;
> +	s64 value;
> +
> +	/*
> +	 * If we're looking for dirtyable pages we need to evaluate free pages
> +	 * depending on the limit and usage of the parents first of all.
> +	 */
> +	if (item == MEMCG_NR_DIRTYABLE_PAGES)
> +		value = mem_cgroup_hierarchical_free_pages(memcg);
> +	else
> +		value = 0;
> +
> +	/*
> +	 * Recursively evaluate page statistics against all cgroup under
> +	 * hierarchy tree
> +	 */
> +	for_each_mem_cgroup_tree(iter, memcg)
> +		value += mem_cgroup_local_page_stat(iter, item);

What's the locking rule for for_each_mem_cgroup_tree()?  It's unobvious
from the code and isn't documented?

> +	/*
> +	 * Summing of unlocked per-cpu counters is racy and may yield a slightly
> +	 * negative value.  Zero is the only sensible value in such cases.
> +	 */
> +	if (unlikely(value < 0))
> +		value = 0;
> +
> +	return value;
> +}
> +
>
> ...
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-02-28 14:00   ` Fengguang Wu
@ 2012-02-29  0:04     ` Andrew Morton
  -1 siblings, 0 replies; 116+ messages in thread
From: Andrew Morton @ 2012-02-29  0:04 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Tue, 28 Feb 2012 22:00:27 +0800
Fengguang Wu <fengguang.wu@intel.com> wrote:

> This relays file pageout IOs to the flusher threads.
> 
> It's much more important now that page reclaim generally does not
> writeout filesystem-backed pages.

It doesn't?  We still do writeback in direct reclaim.  This claim
should be fleshed out rather a lot, please.

> The ultimate target is to gracefully handle the LRU lists pressured by
> dirty/writeback pages. In particular, problems (1-2) are addressed here.
> 
> 1) I/O efficiency
> 
> The flusher will piggy back the nearby ~10ms worth of dirty pages for I/O.
> 
> This takes advantage of the time/spacial locality in most workloads: the
> nearby pages of one file are typically populated into the LRU at the same
> time, hence will likely be close to each other in the LRU list. Writing
> them in one shot helps clean more pages effectively for page reclaim.

Yes, this is often true.  But when adjacent pages from the same file
are clustered together on the LRU, direct reclaim's LRU-based walk will
also provide good I/O patterns.

> For the common dd style sequential writes that have excellent locality,
> up to ~80ms data will be wrote around by the pageout work, which helps
> make I/O performance very close to that of the background writeback.
> 
> 2) writeback work coordinations
> 
> To avoid memory allocations at page reclaim, a mempool for struct
> wb_writeback_work is created.
> 
> wakeup_flusher_threads() is removed because it can easily delay the
> more oriented pageout works and even exhaust the mempool reservations.
> It's also found to not I/O efficient by frequently submitting writeback
> works with small ->nr_pages.

The last sentence here needs help.

> Background/periodic works will quit automatically, so as to clean the
> pages under reclaim ASAP.

I don't know what this means.  How does a work "quit automatically" and
why does that initiate I/O?

> However for now the sync work can still block
> us for long time.

Please define the term "sync work".

> Jan Kara: limit the search scope; remove works and unpin inodes on umount.
> 
> TODO: the pageout works may be starved by the sync work and maybe others.
> Need a proper way to guarantee fairness.
> 
> CC: Jan Kara <jack@suse.cz>
> CC: Mel Gorman <mgorman@suse.de>
> Acked-by: Rik van Riel <riel@redhat.com>
> CC: Greg Thelen <gthelen@google.com>
> CC: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c                |  230 +++++++++++++++++++++++++++--
>  fs/super.c                       |    1 
>  include/linux/backing-dev.h      |    2 
>  include/linux/writeback.h        |   16 +-
>  include/trace/events/writeback.h |   12 +
>  mm/vmscan.c                      |   36 ++--
>  6 files changed, 268 insertions(+), 29 deletions(-)
> 
> --- linux.orig/fs/fs-writeback.c	2012-02-28 19:07:06.109064465 +0800
> +++ linux/fs/fs-writeback.c	2012-02-28 19:07:07.277064493 +0800
> @@ -41,6 +41,8 @@ struct wb_writeback_work {
>  	long nr_pages;
>  	struct super_block *sb;
>  	unsigned long *older_than_this;
> +	struct inode *inode;
> +	pgoff_t offset;

Please document `offset' here.  What is it used for?

>  	enum writeback_sync_modes sync_mode;
>  	unsigned int tagged_writepages:1;
>  	unsigned int for_kupdate:1;
> @@ -57,6 +59,27 @@ struct wb_writeback_work {
>   */
>  int nr_pdflush_threads;
>  
> +static mempool_t *wb_work_mempool;
> +
> +static void *wb_work_alloc(gfp_t gfp_mask, void *pool_data)

The gfp_mask is traditionally the last function argument.

> +{
> +	/*
> +	 * alloc_queue_pageout_work() will be called on page reclaim
> +	 */
> +	if (current->flags & PF_MEMALLOC)
> +		return NULL;

Do we need to test current->flags here?  Could we have checked
!(gfp_mask & __GFP_IO) and/or __GFP_FILE?

I'm not really suggsting such a change - just trying to get my head
around how this stuff works..


> +	return kmalloc(sizeof(struct wb_writeback_work), gfp_mask);
> +}
> +
> +static __init int wb_work_init(void)
> +{
> +	wb_work_mempool = mempool_create(WB_WORK_MEMPOOL_SIZE,
> +					 wb_work_alloc, mempool_kfree, NULL);
> +	return wb_work_mempool ? 0 : -ENOMEM;
> +}
> +fs_initcall(wb_work_init);

Please provide a description of the wb_writeback_work lifecycle: when
they are allocated, when they are freed, how we ensure that a finite
number are in flight.

Also, when a mempool_alloc() caller is waiting for wb_writeback_works
to be freed, how are we dead certain that some *will* be freed?  It
means that the mempool_alloc() caller cannot be holding any explicit or
implicit locks which would prevent wb_writeback_works would be freed. 
Where "implicit lock" means things like being inside ext3/4
journal_start.

This stuff is tricky and is hard to get right.  Reviewing its
correctness by staring at a patch is difficult.

>  /**
>   * writeback_in_progress - determine whether there is writeback in progress
>   * @bdi: the device's backing_dev_info structure.
> @@ -129,7 +152,7 @@ __bdi_start_writeback(struct backing_dev
>  	 * This is WB_SYNC_NONE writeback, so if allocation fails just
>  	 * wakeup the thread for old dirty data writeback
>  	 */
> -	work = kzalloc(sizeof(*work), GFP_ATOMIC);
> +	work = mempool_alloc(wb_work_mempool, GFP_NOWAIT);

Sneaky change from GFP_ATOMIC to GFP_NOWAIT is significant, but
undescribed?

>  	if (!work) {
>  		if (bdi->wb.task) {
>  			trace_writeback_nowork(bdi);
> @@ -138,6 +161,7 @@ __bdi_start_writeback(struct backing_dev
>  		return;
>  	}
>  
> +	memset(work, 0, sizeof(*work));
>  	work->sync_mode	= WB_SYNC_NONE;
>  	work->nr_pages	= nr_pages;
>  	work->range_cyclic = range_cyclic;
> @@ -187,6 +211,181 @@ void bdi_start_background_writeback(stru
>  }
>  
>  /*
> + * Check if @work already covers @offset, or try to extend it to cover @offset.
> + * Returns true if the wb_writeback_work now encompasses the requested offset.
> + */
> +static bool extend_writeback_range(struct wb_writeback_work *work,
> +				   pgoff_t offset,
> +				   unsigned long unit)
> +{
> +	pgoff_t end = work->offset + work->nr_pages;
> +
> +	if (offset >= work->offset && offset < end)
> +		return true;
> +
> +	/*
> +	 * for sequential workloads with good locality, include up to 8 times
> +	 * more data in one chunk
> +	 */
> +	if (work->nr_pages >= 8 * unit)
> +		return false;

argh, gack, puke.  I thought I revoked your magic number license months ago!

Please, it's a HUGE red flag that bad things are happening.  Would the
kernel be better or worse if we were to use 9.5 instead?  How do we
know that "8" is optimum for all memory sizes, device bandwidths, etc?

It's a hack - it's *always* a hack.  Find a better way.

> +	/* the unsigned comparison helps eliminate one compare */
> +	if (work->offset - offset < unit) {
> +		work->nr_pages += unit;
> +		work->offset -= unit;
> +		return true;
> +	}
> +
> +	if (offset - end < unit) {
> +		work->nr_pages += unit;
> +		return true;
> +	}
> +
> +	return false;
> +}
> +
> +/*
> + * schedule writeback on a range of inode pages.
> + */
> +static struct wb_writeback_work *
> +alloc_queue_pageout_work(struct backing_dev_info *bdi,
> +			 struct inode *inode,
> +			 pgoff_t offset,
> +			 pgoff_t len)
> +{
> +	struct wb_writeback_work *work;
> +
> +	/*
> +	 * Grab the inode until the work is executed. We are calling this from
> +	 * page reclaim context and the only thing pinning the address_space
> +	 * for the moment is the page lock.
> +	 */
> +	if (!igrab(inode))
> +		return ERR_PTR(-ENOENT);

uh-oh.  igrab() means iput().

ENOENT means "no such file or directory" and makes no sense in this
context.

> +	work = mempool_alloc(wb_work_mempool, GFP_NOWAIT);
> +	if (!work) {
> +		trace_printk("wb_work_mempool alloc fail\n");
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	memset(work, 0, sizeof(*work));
> +	work->sync_mode		= WB_SYNC_NONE;
> +	work->inode		= inode;
> +	work->offset		= offset;
> +	work->nr_pages		= len;
> +	work->reason		= WB_REASON_PAGEOUT;
> +
> +	bdi_queue_work(bdi, work);
> +
> +	return work;
> +}
> +
> +/*
> + * Called by page reclaim code to flush the dirty page ASAP. Do write-around to
> + * improve IO throughput. The nearby pages will have good chance to reside in
> + * the same LRU list that vmscan is working on, and even close to each other
> + * inside the LRU list in the common case of sequential read/write.
> + *
> + * ret > 0: success, allocated/queued a new pageout work;
> + *	    there are at least @ret writeback works queued now
> + * ret = 0: success, reused/extended a previous pageout work
> + * ret < 0: failed
> + */
> +int queue_pageout_work(struct address_space *mapping, struct page *page)
> +{
> +	struct backing_dev_info *bdi = mapping->backing_dev_info;
> +	struct inode *inode = mapping->host;
> +	struct wb_writeback_work *work;
> +	unsigned long write_around_pages;
> +	pgoff_t offset = page->index;
> +	int i = 0;
> +	int ret = -ENOENT;

ENOENT means "no such file or directory" and makes no sense in this
context.

> +	if (unlikely(!inode))
> +		return ret;

How does this happen?

> +	/*
> +	 * piggy back 8-15ms worth of data
> +	 */
> +	write_around_pages = bdi->avg_write_bandwidth + MIN_WRITEBACK_PAGES;
> +	write_around_pages = rounddown_pow_of_two(write_around_pages) >> 6;

Where did "6" come from?

> +	i = 1;
> +	spin_lock_bh(&bdi->wb_lock);
> +	list_for_each_entry_reverse(work, &bdi->work_list, list) {
> +		if (work->inode != inode)
> +			continue;
> +		if (extend_writeback_range(work, offset, write_around_pages)) {
> +			ret = 0;
> +			break;
> +		}
> +		/*
> +		 * vmscan will slow down page reclaim when there are more than
> +		 * LOTS_OF_WRITEBACK_WORKS queued. Limit search depth to two
> +		 * times larger.
> +		 */
> +		if (i++ > 2 * LOTS_OF_WRITEBACK_WORKS)
> +			break;

I'm now totally lost.  What are the units of "i"?  (And why the heck was
it called "i" anyway?) Afaict, "i" counts the number of times we
successfully extended the writeback range by write_around_pages?  What
relationship does this have to yet-another-magic-number-times-two?

Please have a think about how to make the code comprehensible to (and
hence maintainable by) others?

> +	}
> +	spin_unlock_bh(&bdi->wb_lock);
> +
> +	if (ret) {
> +		ret = i;
> +		offset = round_down(offset, write_around_pages);
> +		work = alloc_queue_pageout_work(bdi, inode,
> +						offset, write_around_pages);
> +		if (IS_ERR(work))
> +			ret = PTR_ERR(work);
> +	}

Need a comment over this code section.  afacit it would be something
like "if we failed to add pages to an existing wb_writeback_work then
allocate and queue a new one".

> +	return ret;
> +}
> +
> +static void wb_free_work(struct wb_writeback_work *work)
> +{
> +	if (work->inode)
> +		iput(work->inode);

And here is where at least two previous attempts to perform
address_space-based writearound within direct reclaim have come
unstuck.

Occasionally, iput() does a huge amount of stuff: when it is the final
iput() on the inode.  iirc this can include taking tons of VFS locks,
truncating files, starting (and perhaps waiting upon) journal commits,
etc.  I forget, but it's a *lot*.

And quite possibly you haven't tested this at all, because it's pretty
rare for an iput() like this to be the final one.

Let me give you an example to worry about: suppose code which holds fs
locks calls into direct reclaim and then calls
mempool_alloc(wb_work_mempool) and that mempool_alloc() has to wait for
an item to be returned.  But no items will ever be returned, because
all the threads which own wb_writeback_works are stuck in
wb_free_work->iput, trying to take an fs lock which is still held by
the now-blocked direct-reclaim caller.

And a billion similar scenarios :( The really nasty thing about this is
that it is very rare for this iput() to be a final iput(), so it's hard
to get code coverage.

Please have a think about all of this and see if you can demonstrate
how the iput() here is guaranteed safe.

> +	/*
> +	 * Notify the caller of completion if this is a synchronous
> +	 * work item, otherwise just free it.
> +	 */
> +	if (work->done)
> +		complete(work->done);
> +	else
> +		mempool_free(work, wb_work_mempool);
> +}
> +
> +/*
> + * Remove works for @sb; or if (@sb == NULL), remove all works on @bdi.
> + */
> +void bdi_remove_writeback_works(struct backing_dev_info *bdi,
> +				struct super_block *sb)
> +{
> +	struct wb_writeback_work *work, *tmp;
> +	LIST_HEAD(dispose);
> +
> +	spin_lock_bh(&bdi->wb_lock);
> +	list_for_each_entry_safe(work, tmp, &bdi->work_list, list) {
> +		if (sb) {
> +			if (work->sb && work->sb != sb)
> +				continue;

What does it mean when wb_writeback_work.sb==NULL?  This reader doesn't
know, hence he can't understand (or review) this code.  Perhaps
describe it here, unless it is well described elsewhere?

> +			if (work->inode && work->inode->i_sb != sb)

And what is the meaning of wb_writeback_work.inode==NULL?

Seems odd that wb_writeback_work.sb exists, when it is accessible via
wb_writeback_work.inode->i_sb.

As a person who reviews a huge amount of code, I can tell you this: the
key to understanding code is to understand the data structures and the
relationship between their fields and between different data
structures.  Including lifetime rules, locking rules and hidden
information such as "what does it mean when wb_writeback_work.sb is
NULL".  Once one understands all this about the data structures, the
code becomes pretty obvious and bugs can be spotted and fixed.  But
alas, wb_writeback_work is basically undocumented.

> +				continue;
> +		}
> +		list_move(&work->list, &dispose);

So here we have queued for disposal a) works which refer to an inode on
sb and b) works which have ->sb==NULL and ->inode==NULL.  I don't know
whether the b) type exist.

> +	}
> +	spin_unlock_bh(&bdi->wb_lock);
> +
> +	while (!list_empty(&dispose)) {
> +		work = list_entry(dispose.next,
> +				  struct wb_writeback_work, list);
> +		list_del_init(&work->list);
> +		wb_free_work(work);
> +	}

You should be able to do this operation without writing to all the
list_heads: no list_del(), no list_del_init().

> +}
> +
> +/*
>   * Remove the inode from the writeback list it is on.
>   */
>  void inode_wb_list_del(struct inode *inode)
> @@ -833,6 +1032,21 @@ static unsigned long get_nr_dirty_pages(
>  		get_nr_dirty_inodes();
>  }
>  
> +static long wb_pageout(struct bdi_writeback *wb, struct wb_writeback_work *work)
> +{
> +	struct writeback_control wbc = {
> +		.sync_mode = WB_SYNC_NONE,
> +		.nr_to_write = LONG_MAX,
> +		.range_start = work->offset << PAGE_CACHE_SHIFT,

I think this will give you a 32->64 bit overflow on 32-bit machines.

> +		.range_end = (work->offset + work->nr_pages - 1)
> +						<< PAGE_CACHE_SHIFT,

Ditto.

Please include this in a patchset sometime ;)

--- a/include/linux/writeback.h~a
+++ a/include/linux/writeback.h
@@ -64,7 +64,7 @@ struct writeback_control {
 	long pages_skipped;		/* Pages which were not written */
 
 	/*
-	 * For a_ops->writepages(): is start or end are non-zero then this is
+	 * For a_ops->writepages(): if start or end are non-zero then this is
 	 * a hint that the filesystem need only write out the pages inside that
 	 * byterange.  The byte at `end' is included in the writeout request.
 	 */


> +	};
> +
> +	do_writepages(work->inode->i_mapping, &wbc);
> +
> +	return LONG_MAX - wbc.nr_to_write;
> +}

<infers the return semantics from the code> It took a while.  Peeking
at the caller helped.

>  static long wb_check_background_flush(struct bdi_writeback *wb)
>  {
>  	if (over_bground_thresh(wb->bdi)) {
> @@ -905,16 +1119,12 @@ long wb_do_writeback(struct bdi_writebac
>  
>  		trace_writeback_exec(bdi, work);
>  
> -		wrote += wb_writeback(wb, work);
> -
> -		/*
> -		 * Notify the caller of completion if this is a synchronous
> -		 * work item, otherwise just free it.
> -		 */
> -		if (work->done)
> -			complete(work->done);
> +		if (!work->inode)
> +			wrote += wb_writeback(wb, work);
>  		else
> -			kfree(work);
> +			wrote += wb_pageout(wb, work);
> +
> +		wb_free_work(work);
>  	}
>  
>  	/*
>
> ...
>
> --- linux.orig/include/linux/backing-dev.h	2012-02-28 19:07:06.081064464 +0800
> +++ linux/include/linux/backing-dev.h	2012-02-28 19:07:07.281064493 +0800
> @@ -126,6 +126,8 @@ int bdi_has_dirty_io(struct backing_dev_
>  void bdi_arm_supers_timer(void);
>  void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
>  void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
> +void bdi_remove_writeback_works(struct backing_dev_info *bdi,
> +				struct super_block *sb);
>  
>  extern spinlock_t bdi_lock;
>  extern struct list_head bdi_list;
> --- linux.orig/mm/vmscan.c	2012-02-28 19:07:06.065064464 +0800
> +++ linux/mm/vmscan.c	2012-02-28 20:26:15.559731455 +0800
> @@ -874,12 +874,22 @@ static unsigned long shrink_page_list(st
>  			nr_dirty++;
>  
>  			/*
> -			 * Only kswapd can writeback filesystem pages to
> -			 * avoid risk of stack overflow but do not writeback
> -			 * unless under significant pressure.
> +			 * Pages may be dirtied anywhere inside the LRU. This
> +			 * ensures they undergo a full period of LRU iteration
> +			 * before considering pageout. The intention is to
> +			 * delay writeout to the flusher thread, unless when
> +			 * run into a long segment of dirty pages.
> +			 */
> +			if (references == PAGEREF_RECLAIM_CLEAN &&
> +			    priority == DEF_PRIORITY)
> +				goto keep_locked;
> +
> +			/*
> +			 * Try relaying the pageout I/O to the flusher threads
> +			 * for better I/O efficiency and avoid stack overflow.
>  			 */
> -			if (page_is_file_cache(page) &&
> -					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
> +			if (page_is_file_cache(page) && mapping &&
> +			    queue_pageout_work(mapping, page) >= 0) {
>  				/*
>  				 * Immediately reclaim when written back.
>  				 * Similar in principal to deactivate_page()
> @@ -892,8 +902,13 @@ static unsigned long shrink_page_list(st
>  				goto keep_locked;
>  			}
>  
> -			if (references == PAGEREF_RECLAIM_CLEAN)
> +			/*
> +			 * Only kswapd can writeback filesystem pages to
> +			 * avoid risk of stack overflow.
> +			 */
> +			if (page_is_file_cache(page) && !current_is_kswapd())

And here we run into big problems.

When a page-allocator enters direct reclaim, that process is trying to
allocate a page from a particular zone (or set of zones).  For example,
he wants a ZONE_NORMAL or ZONE_DMA page.  Asking flusher threads to go
off and write back three gigabytes of ZONE_HIGHMEM is pointless,
inefficient and doesn't fix the caller's problem at all.

This has always been the biggest problem with the
avoid-writeback-from-direct-reclaim patches.  And your patchset (as far
as I've read) doesn't address the problem at all and appears to be
blissfully unaware of its existence.


I've attempted versions of this I think twice, and thrown the patches
away in disgust.  One approach I tried was, within direct reclaim, to
grab the page I wanted (ie: one which is in one of the caller's desired
zones) and to pass that page over to the kernel threads.  The kernel
threads would ensure that this particular page was included in the
writearound preparation.  So that we at least make *some* progress
toward what the caller is asking us to do.

iirc, the way I "grabbed" the page was to actually lock it, with
[try_]_lock_page().  And unlock it again way over within the writeback
thread.  I forget why I did it this way, rather than get_page() or
whatever.  Locking the page is a good way of preventing anyone else
from futzing with it.  It also pins the inode, which perhaps meant that
with careful management, I could avoid the igrab()/iput() horrors
discussed above.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-02-29  0:04     ` Andrew Morton
  0 siblings, 0 replies; 116+ messages in thread
From: Andrew Morton @ 2012-02-29  0:04 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Tue, 28 Feb 2012 22:00:27 +0800
Fengguang Wu <fengguang.wu@intel.com> wrote:

> This relays file pageout IOs to the flusher threads.
> 
> It's much more important now that page reclaim generally does not
> writeout filesystem-backed pages.

It doesn't?  We still do writeback in direct reclaim.  This claim
should be fleshed out rather a lot, please.

> The ultimate target is to gracefully handle the LRU lists pressured by
> dirty/writeback pages. In particular, problems (1-2) are addressed here.
> 
> 1) I/O efficiency
> 
> The flusher will piggy back the nearby ~10ms worth of dirty pages for I/O.
> 
> This takes advantage of the time/spacial locality in most workloads: the
> nearby pages of one file are typically populated into the LRU at the same
> time, hence will likely be close to each other in the LRU list. Writing
> them in one shot helps clean more pages effectively for page reclaim.

Yes, this is often true.  But when adjacent pages from the same file
are clustered together on the LRU, direct reclaim's LRU-based walk will
also provide good I/O patterns.

> For the common dd style sequential writes that have excellent locality,
> up to ~80ms data will be wrote around by the pageout work, which helps
> make I/O performance very close to that of the background writeback.
> 
> 2) writeback work coordinations
> 
> To avoid memory allocations at page reclaim, a mempool for struct
> wb_writeback_work is created.
> 
> wakeup_flusher_threads() is removed because it can easily delay the
> more oriented pageout works and even exhaust the mempool reservations.
> It's also found to not I/O efficient by frequently submitting writeback
> works with small ->nr_pages.

The last sentence here needs help.

> Background/periodic works will quit automatically, so as to clean the
> pages under reclaim ASAP.

I don't know what this means.  How does a work "quit automatically" and
why does that initiate I/O?

> However for now the sync work can still block
> us for long time.

Please define the term "sync work".

> Jan Kara: limit the search scope; remove works and unpin inodes on umount.
> 
> TODO: the pageout works may be starved by the sync work and maybe others.
> Need a proper way to guarantee fairness.
> 
> CC: Jan Kara <jack@suse.cz>
> CC: Mel Gorman <mgorman@suse.de>
> Acked-by: Rik van Riel <riel@redhat.com>
> CC: Greg Thelen <gthelen@google.com>
> CC: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c                |  230 +++++++++++++++++++++++++++--
>  fs/super.c                       |    1 
>  include/linux/backing-dev.h      |    2 
>  include/linux/writeback.h        |   16 +-
>  include/trace/events/writeback.h |   12 +
>  mm/vmscan.c                      |   36 ++--
>  6 files changed, 268 insertions(+), 29 deletions(-)
> 
> --- linux.orig/fs/fs-writeback.c	2012-02-28 19:07:06.109064465 +0800
> +++ linux/fs/fs-writeback.c	2012-02-28 19:07:07.277064493 +0800
> @@ -41,6 +41,8 @@ struct wb_writeback_work {
>  	long nr_pages;
>  	struct super_block *sb;
>  	unsigned long *older_than_this;
> +	struct inode *inode;
> +	pgoff_t offset;

Please document `offset' here.  What is it used for?

>  	enum writeback_sync_modes sync_mode;
>  	unsigned int tagged_writepages:1;
>  	unsigned int for_kupdate:1;
> @@ -57,6 +59,27 @@ struct wb_writeback_work {
>   */
>  int nr_pdflush_threads;
>  
> +static mempool_t *wb_work_mempool;
> +
> +static void *wb_work_alloc(gfp_t gfp_mask, void *pool_data)

The gfp_mask is traditionally the last function argument.

> +{
> +	/*
> +	 * alloc_queue_pageout_work() will be called on page reclaim
> +	 */
> +	if (current->flags & PF_MEMALLOC)
> +		return NULL;

Do we need to test current->flags here?  Could we have checked
!(gfp_mask & __GFP_IO) and/or __GFP_FILE?

I'm not really suggsting such a change - just trying to get my head
around how this stuff works..


> +	return kmalloc(sizeof(struct wb_writeback_work), gfp_mask);
> +}
> +
> +static __init int wb_work_init(void)
> +{
> +	wb_work_mempool = mempool_create(WB_WORK_MEMPOOL_SIZE,
> +					 wb_work_alloc, mempool_kfree, NULL);
> +	return wb_work_mempool ? 0 : -ENOMEM;
> +}
> +fs_initcall(wb_work_init);

Please provide a description of the wb_writeback_work lifecycle: when
they are allocated, when they are freed, how we ensure that a finite
number are in flight.

Also, when a mempool_alloc() caller is waiting for wb_writeback_works
to be freed, how are we dead certain that some *will* be freed?  It
means that the mempool_alloc() caller cannot be holding any explicit or
implicit locks which would prevent wb_writeback_works would be freed. 
Where "implicit lock" means things like being inside ext3/4
journal_start.

This stuff is tricky and is hard to get right.  Reviewing its
correctness by staring at a patch is difficult.

>  /**
>   * writeback_in_progress - determine whether there is writeback in progress
>   * @bdi: the device's backing_dev_info structure.
> @@ -129,7 +152,7 @@ __bdi_start_writeback(struct backing_dev
>  	 * This is WB_SYNC_NONE writeback, so if allocation fails just
>  	 * wakeup the thread for old dirty data writeback
>  	 */
> -	work = kzalloc(sizeof(*work), GFP_ATOMIC);
> +	work = mempool_alloc(wb_work_mempool, GFP_NOWAIT);

Sneaky change from GFP_ATOMIC to GFP_NOWAIT is significant, but
undescribed?

>  	if (!work) {
>  		if (bdi->wb.task) {
>  			trace_writeback_nowork(bdi);
> @@ -138,6 +161,7 @@ __bdi_start_writeback(struct backing_dev
>  		return;
>  	}
>  
> +	memset(work, 0, sizeof(*work));
>  	work->sync_mode	= WB_SYNC_NONE;
>  	work->nr_pages	= nr_pages;
>  	work->range_cyclic = range_cyclic;
> @@ -187,6 +211,181 @@ void bdi_start_background_writeback(stru
>  }
>  
>  /*
> + * Check if @work already covers @offset, or try to extend it to cover @offset.
> + * Returns true if the wb_writeback_work now encompasses the requested offset.
> + */
> +static bool extend_writeback_range(struct wb_writeback_work *work,
> +				   pgoff_t offset,
> +				   unsigned long unit)
> +{
> +	pgoff_t end = work->offset + work->nr_pages;
> +
> +	if (offset >= work->offset && offset < end)
> +		return true;
> +
> +	/*
> +	 * for sequential workloads with good locality, include up to 8 times
> +	 * more data in one chunk
> +	 */
> +	if (work->nr_pages >= 8 * unit)
> +		return false;

argh, gack, puke.  I thought I revoked your magic number license months ago!

Please, it's a HUGE red flag that bad things are happening.  Would the
kernel be better or worse if we were to use 9.5 instead?  How do we
know that "8" is optimum for all memory sizes, device bandwidths, etc?

It's a hack - it's *always* a hack.  Find a better way.

> +	/* the unsigned comparison helps eliminate one compare */
> +	if (work->offset - offset < unit) {
> +		work->nr_pages += unit;
> +		work->offset -= unit;
> +		return true;
> +	}
> +
> +	if (offset - end < unit) {
> +		work->nr_pages += unit;
> +		return true;
> +	}
> +
> +	return false;
> +}
> +
> +/*
> + * schedule writeback on a range of inode pages.
> + */
> +static struct wb_writeback_work *
> +alloc_queue_pageout_work(struct backing_dev_info *bdi,
> +			 struct inode *inode,
> +			 pgoff_t offset,
> +			 pgoff_t len)
> +{
> +	struct wb_writeback_work *work;
> +
> +	/*
> +	 * Grab the inode until the work is executed. We are calling this from
> +	 * page reclaim context and the only thing pinning the address_space
> +	 * for the moment is the page lock.
> +	 */
> +	if (!igrab(inode))
> +		return ERR_PTR(-ENOENT);

uh-oh.  igrab() means iput().

ENOENT means "no such file or directory" and makes no sense in this
context.

> +	work = mempool_alloc(wb_work_mempool, GFP_NOWAIT);
> +	if (!work) {
> +		trace_printk("wb_work_mempool alloc fail\n");
> +		return ERR_PTR(-ENOMEM);
> +	}
> +
> +	memset(work, 0, sizeof(*work));
> +	work->sync_mode		= WB_SYNC_NONE;
> +	work->inode		= inode;
> +	work->offset		= offset;
> +	work->nr_pages		= len;
> +	work->reason		= WB_REASON_PAGEOUT;
> +
> +	bdi_queue_work(bdi, work);
> +
> +	return work;
> +}
> +
> +/*
> + * Called by page reclaim code to flush the dirty page ASAP. Do write-around to
> + * improve IO throughput. The nearby pages will have good chance to reside in
> + * the same LRU list that vmscan is working on, and even close to each other
> + * inside the LRU list in the common case of sequential read/write.
> + *
> + * ret > 0: success, allocated/queued a new pageout work;
> + *	    there are at least @ret writeback works queued now
> + * ret = 0: success, reused/extended a previous pageout work
> + * ret < 0: failed
> + */
> +int queue_pageout_work(struct address_space *mapping, struct page *page)
> +{
> +	struct backing_dev_info *bdi = mapping->backing_dev_info;
> +	struct inode *inode = mapping->host;
> +	struct wb_writeback_work *work;
> +	unsigned long write_around_pages;
> +	pgoff_t offset = page->index;
> +	int i = 0;
> +	int ret = -ENOENT;

ENOENT means "no such file or directory" and makes no sense in this
context.

> +	if (unlikely(!inode))
> +		return ret;

How does this happen?

> +	/*
> +	 * piggy back 8-15ms worth of data
> +	 */
> +	write_around_pages = bdi->avg_write_bandwidth + MIN_WRITEBACK_PAGES;
> +	write_around_pages = rounddown_pow_of_two(write_around_pages) >> 6;

Where did "6" come from?

> +	i = 1;
> +	spin_lock_bh(&bdi->wb_lock);
> +	list_for_each_entry_reverse(work, &bdi->work_list, list) {
> +		if (work->inode != inode)
> +			continue;
> +		if (extend_writeback_range(work, offset, write_around_pages)) {
> +			ret = 0;
> +			break;
> +		}
> +		/*
> +		 * vmscan will slow down page reclaim when there are more than
> +		 * LOTS_OF_WRITEBACK_WORKS queued. Limit search depth to two
> +		 * times larger.
> +		 */
> +		if (i++ > 2 * LOTS_OF_WRITEBACK_WORKS)
> +			break;

I'm now totally lost.  What are the units of "i"?  (And why the heck was
it called "i" anyway?) Afaict, "i" counts the number of times we
successfully extended the writeback range by write_around_pages?  What
relationship does this have to yet-another-magic-number-times-two?

Please have a think about how to make the code comprehensible to (and
hence maintainable by) others?

> +	}
> +	spin_unlock_bh(&bdi->wb_lock);
> +
> +	if (ret) {
> +		ret = i;
> +		offset = round_down(offset, write_around_pages);
> +		work = alloc_queue_pageout_work(bdi, inode,
> +						offset, write_around_pages);
> +		if (IS_ERR(work))
> +			ret = PTR_ERR(work);
> +	}

Need a comment over this code section.  afacit it would be something
like "if we failed to add pages to an existing wb_writeback_work then
allocate and queue a new one".

> +	return ret;
> +}
> +
> +static void wb_free_work(struct wb_writeback_work *work)
> +{
> +	if (work->inode)
> +		iput(work->inode);

And here is where at least two previous attempts to perform
address_space-based writearound within direct reclaim have come
unstuck.

Occasionally, iput() does a huge amount of stuff: when it is the final
iput() on the inode.  iirc this can include taking tons of VFS locks,
truncating files, starting (and perhaps waiting upon) journal commits,
etc.  I forget, but it's a *lot*.

And quite possibly you haven't tested this at all, because it's pretty
rare for an iput() like this to be the final one.

Let me give you an example to worry about: suppose code which holds fs
locks calls into direct reclaim and then calls
mempool_alloc(wb_work_mempool) and that mempool_alloc() has to wait for
an item to be returned.  But no items will ever be returned, because
all the threads which own wb_writeback_works are stuck in
wb_free_work->iput, trying to take an fs lock which is still held by
the now-blocked direct-reclaim caller.

And a billion similar scenarios :( The really nasty thing about this is
that it is very rare for this iput() to be a final iput(), so it's hard
to get code coverage.

Please have a think about all of this and see if you can demonstrate
how the iput() here is guaranteed safe.

> +	/*
> +	 * Notify the caller of completion if this is a synchronous
> +	 * work item, otherwise just free it.
> +	 */
> +	if (work->done)
> +		complete(work->done);
> +	else
> +		mempool_free(work, wb_work_mempool);
> +}
> +
> +/*
> + * Remove works for @sb; or if (@sb == NULL), remove all works on @bdi.
> + */
> +void bdi_remove_writeback_works(struct backing_dev_info *bdi,
> +				struct super_block *sb)
> +{
> +	struct wb_writeback_work *work, *tmp;
> +	LIST_HEAD(dispose);
> +
> +	spin_lock_bh(&bdi->wb_lock);
> +	list_for_each_entry_safe(work, tmp, &bdi->work_list, list) {
> +		if (sb) {
> +			if (work->sb && work->sb != sb)
> +				continue;

What does it mean when wb_writeback_work.sb==NULL?  This reader doesn't
know, hence he can't understand (or review) this code.  Perhaps
describe it here, unless it is well described elsewhere?

> +			if (work->inode && work->inode->i_sb != sb)

And what is the meaning of wb_writeback_work.inode==NULL?

Seems odd that wb_writeback_work.sb exists, when it is accessible via
wb_writeback_work.inode->i_sb.

As a person who reviews a huge amount of code, I can tell you this: the
key to understanding code is to understand the data structures and the
relationship between their fields and between different data
structures.  Including lifetime rules, locking rules and hidden
information such as "what does it mean when wb_writeback_work.sb is
NULL".  Once one understands all this about the data structures, the
code becomes pretty obvious and bugs can be spotted and fixed.  But
alas, wb_writeback_work is basically undocumented.

> +				continue;
> +		}
> +		list_move(&work->list, &dispose);

So here we have queued for disposal a) works which refer to an inode on
sb and b) works which have ->sb==NULL and ->inode==NULL.  I don't know
whether the b) type exist.

> +	}
> +	spin_unlock_bh(&bdi->wb_lock);
> +
> +	while (!list_empty(&dispose)) {
> +		work = list_entry(dispose.next,
> +				  struct wb_writeback_work, list);
> +		list_del_init(&work->list);
> +		wb_free_work(work);
> +	}

You should be able to do this operation without writing to all the
list_heads: no list_del(), no list_del_init().

> +}
> +
> +/*
>   * Remove the inode from the writeback list it is on.
>   */
>  void inode_wb_list_del(struct inode *inode)
> @@ -833,6 +1032,21 @@ static unsigned long get_nr_dirty_pages(
>  		get_nr_dirty_inodes();
>  }
>  
> +static long wb_pageout(struct bdi_writeback *wb, struct wb_writeback_work *work)
> +{
> +	struct writeback_control wbc = {
> +		.sync_mode = WB_SYNC_NONE,
> +		.nr_to_write = LONG_MAX,
> +		.range_start = work->offset << PAGE_CACHE_SHIFT,

I think this will give you a 32->64 bit overflow on 32-bit machines.

> +		.range_end = (work->offset + work->nr_pages - 1)
> +						<< PAGE_CACHE_SHIFT,

Ditto.

Please include this in a patchset sometime ;)

--- a/include/linux/writeback.h~a
+++ a/include/linux/writeback.h
@@ -64,7 +64,7 @@ struct writeback_control {
 	long pages_skipped;		/* Pages which were not written */
 
 	/*
-	 * For a_ops->writepages(): is start or end are non-zero then this is
+	 * For a_ops->writepages(): if start or end are non-zero then this is
 	 * a hint that the filesystem need only write out the pages inside that
 	 * byterange.  The byte at `end' is included in the writeout request.
 	 */


> +	};
> +
> +	do_writepages(work->inode->i_mapping, &wbc);
> +
> +	return LONG_MAX - wbc.nr_to_write;
> +}

<infers the return semantics from the code> It took a while.  Peeking
at the caller helped.

>  static long wb_check_background_flush(struct bdi_writeback *wb)
>  {
>  	if (over_bground_thresh(wb->bdi)) {
> @@ -905,16 +1119,12 @@ long wb_do_writeback(struct bdi_writebac
>  
>  		trace_writeback_exec(bdi, work);
>  
> -		wrote += wb_writeback(wb, work);
> -
> -		/*
> -		 * Notify the caller of completion if this is a synchronous
> -		 * work item, otherwise just free it.
> -		 */
> -		if (work->done)
> -			complete(work->done);
> +		if (!work->inode)
> +			wrote += wb_writeback(wb, work);
>  		else
> -			kfree(work);
> +			wrote += wb_pageout(wb, work);
> +
> +		wb_free_work(work);
>  	}
>  
>  	/*
>
> ...
>
> --- linux.orig/include/linux/backing-dev.h	2012-02-28 19:07:06.081064464 +0800
> +++ linux/include/linux/backing-dev.h	2012-02-28 19:07:07.281064493 +0800
> @@ -126,6 +126,8 @@ int bdi_has_dirty_io(struct backing_dev_
>  void bdi_arm_supers_timer(void);
>  void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
>  void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
> +void bdi_remove_writeback_works(struct backing_dev_info *bdi,
> +				struct super_block *sb);
>  
>  extern spinlock_t bdi_lock;
>  extern struct list_head bdi_list;
> --- linux.orig/mm/vmscan.c	2012-02-28 19:07:06.065064464 +0800
> +++ linux/mm/vmscan.c	2012-02-28 20:26:15.559731455 +0800
> @@ -874,12 +874,22 @@ static unsigned long shrink_page_list(st
>  			nr_dirty++;
>  
>  			/*
> -			 * Only kswapd can writeback filesystem pages to
> -			 * avoid risk of stack overflow but do not writeback
> -			 * unless under significant pressure.
> +			 * Pages may be dirtied anywhere inside the LRU. This
> +			 * ensures they undergo a full period of LRU iteration
> +			 * before considering pageout. The intention is to
> +			 * delay writeout to the flusher thread, unless when
> +			 * run into a long segment of dirty pages.
> +			 */
> +			if (references == PAGEREF_RECLAIM_CLEAN &&
> +			    priority == DEF_PRIORITY)
> +				goto keep_locked;
> +
> +			/*
> +			 * Try relaying the pageout I/O to the flusher threads
> +			 * for better I/O efficiency and avoid stack overflow.
>  			 */
> -			if (page_is_file_cache(page) &&
> -					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
> +			if (page_is_file_cache(page) && mapping &&
> +			    queue_pageout_work(mapping, page) >= 0) {
>  				/*
>  				 * Immediately reclaim when written back.
>  				 * Similar in principal to deactivate_page()
> @@ -892,8 +902,13 @@ static unsigned long shrink_page_list(st
>  				goto keep_locked;
>  			}
>  
> -			if (references == PAGEREF_RECLAIM_CLEAN)
> +			/*
> +			 * Only kswapd can writeback filesystem pages to
> +			 * avoid risk of stack overflow.
> +			 */
> +			if (page_is_file_cache(page) && !current_is_kswapd())

And here we run into big problems.

When a page-allocator enters direct reclaim, that process is trying to
allocate a page from a particular zone (or set of zones).  For example,
he wants a ZONE_NORMAL or ZONE_DMA page.  Asking flusher threads to go
off and write back three gigabytes of ZONE_HIGHMEM is pointless,
inefficient and doesn't fix the caller's problem at all.

This has always been the biggest problem with the
avoid-writeback-from-direct-reclaim patches.  And your patchset (as far
as I've read) doesn't address the problem at all and appears to be
blissfully unaware of its existence.


I've attempted versions of this I think twice, and thrown the patches
away in disgust.  One approach I tried was, within direct reclaim, to
grab the page I wanted (ie: one which is in one of the caller's desired
zones) and to pass that page over to the kernel threads.  The kernel
threads would ensure that this particular page was included in the
writearound preparation.  So that we at least make *some* progress
toward what the caller is asking us to do.

iirc, the way I "grabbed" the page was to actually lock it, with
[try_]_lock_page().  And unlock it again way over within the writeback
thread.  I forget why I did it this way, rather than get_page() or
whatever.  Locking the page is a good way of preventing anyone else
from futzing with it.  It also pins the inode, which perhaps meant that
with careful management, I could avoid the igrab()/iput() horrors
discussed above.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 2/9] memcg: add dirty page accounting infrastructure
  2012-02-28 22:37     ` Andrew Morton
@ 2012-02-29  0:27       ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-29  0:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrea Righi, Linux Memory Management List, LKML

On Tue, Feb 28, 2012 at 02:37:38PM -0800, Andrew Morton wrote:
> On Tue, 28 Feb 2012 22:00:24 +0800
> Fengguang Wu <fengguang.wu@intel.com> wrote:
> 
> > From: Greg Thelen <gthelen@google.com>
> > 
> > Add memcg routines to count dirty, writeback, and unstable_NFS pages.
> > These routines are not yet used by the kernel to count such pages.  A
> > later change adds kernel calls to these new routines.
> > 
> > As inode pages are marked dirty, if the dirtied page's cgroup differs
> > from the inode's cgroup, then mark the inode shared across several
> > cgroup.
> > 
> > ...
> >
> > @@ -1885,6 +1888,44 @@ void mem_cgroup_update_page_stat(struct 
> >  			ClearPageCgroupFileMapped(pc);
> >  		idx = MEM_CGROUP_STAT_FILE_MAPPED;
> >  		break;
> > +
> > +	case MEMCG_NR_FILE_DIRTY:
> > +		/* Use Test{Set,Clear} to only un/charge the memcg once. */
> > +		if (val > 0) {
> > +			if (TestSetPageCgroupFileDirty(pc))
> > +				val = 0;
> > +		} else {
> > +			if (!TestClearPageCgroupFileDirty(pc))
> > +				val = 0;
> > +		}
> 
> Made me scratch my head for a while, but I see now that the `val' arg
> to (the undocumented) mem_cgroup_update_page_stat() can only ever have
> the values 1 or -1.  I hope.

Yeah, I see it's called this way:

   3    151  /c/linux/include/linux/memcontrol.h <<mem_cgroup_inc_page_stat>>
             mem_cgroup_update_page_stat(page, idx, 1);

   4    157  /c/linux/include/linux/memcontrol.h <<mem_cgroup_dec_page_stat>>
             mem_cgroup_update_page_stat(page, idx, -1);

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 2/9] memcg: add dirty page accounting infrastructure
@ 2012-02-29  0:27       ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-29  0:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Andrea Righi, Linux Memory Management List, LKML

On Tue, Feb 28, 2012 at 02:37:38PM -0800, Andrew Morton wrote:
> On Tue, 28 Feb 2012 22:00:24 +0800
> Fengguang Wu <fengguang.wu@intel.com> wrote:
> 
> > From: Greg Thelen <gthelen@google.com>
> > 
> > Add memcg routines to count dirty, writeback, and unstable_NFS pages.
> > These routines are not yet used by the kernel to count such pages.  A
> > later change adds kernel calls to these new routines.
> > 
> > As inode pages are marked dirty, if the dirtied page's cgroup differs
> > from the inode's cgroup, then mark the inode shared across several
> > cgroup.
> > 
> > ...
> >
> > @@ -1885,6 +1888,44 @@ void mem_cgroup_update_page_stat(struct 
> >  			ClearPageCgroupFileMapped(pc);
> >  		idx = MEM_CGROUP_STAT_FILE_MAPPED;
> >  		break;
> > +
> > +	case MEMCG_NR_FILE_DIRTY:
> > +		/* Use Test{Set,Clear} to only un/charge the memcg once. */
> > +		if (val > 0) {
> > +			if (TestSetPageCgroupFileDirty(pc))
> > +				val = 0;
> > +		} else {
> > +			if (!TestClearPageCgroupFileDirty(pc))
> > +				val = 0;
> > +		}
> 
> Made me scratch my head for a while, but I see now that the `val' arg
> to (the undocumented) mem_cgroup_update_page_stat() can only ever have
> the values 1 or -1.  I hope.

Yeah, I see it's called this way:

   3    151  /c/linux/include/linux/memcontrol.h <<mem_cgroup_inc_page_stat>>
             mem_cgroup_update_page_stat(page, idx, 1);

   4    157  /c/linux/include/linux/memcontrol.h <<mem_cgroup_dec_page_stat>>
             mem_cgroup_update_page_stat(page, idx, -1);

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 1/9] memcg: add page_cgroup flags for dirty page tracking
  2012-02-28 14:00   ` Fengguang Wu
@ 2012-02-29  0:50     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 116+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-29  0:50 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Greg Thelen, Jan Kara, Ying Han, hannes,
	Rik van Riel, Andrea Righi, Minchan Kim,
	Linux Memory Management List, LKML

On Tue, 28 Feb 2012 22:00:23 +0800
Fengguang Wu <fengguang.wu@intel.com> wrote:

> From: Greg Thelen <gthelen@google.com>
> 
> Add additional flags to page_cgroup to track dirty pages
> within a mem_cgroup.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Andrea Righi <andrea@betterlinux.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>

I'm sorry but I changed the design of page_cgroup's flags update
and never want to add new flags (I'd like to remove page_cgroup->flags.)

Please see linux-next.

A good example is PCG_FILE_MAPPED, which I removed.

memcg: use new logic for page stat accounting
memcg: remove PCG_FILE_MAPPED

You can make use of PageDirty() and PageWriteback() instead of new flags.. (I hope.)

Thanks,
-Kame

> ---
>  include/linux/page_cgroup.h |   23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
> 
> --- linux.orig/include/linux/page_cgroup.h	2012-02-19 10:53:14.000000000 +0800
> +++ linux/include/linux/page_cgroup.h	2012-02-19 10:53:16.000000000 +0800
> @@ -10,6 +10,9 @@ enum {
>  	/* flags for mem_cgroup and file and I/O status */
>  	PCG_MOVE_LOCK, /* For race between move_account v.s. following bits */
>  	PCG_FILE_MAPPED, /* page is accounted as "mapped" */
> +	PCG_FILE_DIRTY, /* page is dirty */
> +	PCG_FILE_WRITEBACK, /* page is under writeback */
> +	PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */
>  	__NR_PCG_FLAGS,
>  };
>  
> @@ -64,6 +67,10 @@ static inline void ClearPageCgroup##unam
>  static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
>  	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
>  
> +#define TESTSETPCGFLAG(uname, lname)			\
> +static inline int TestSetPageCgroup##uname(struct page_cgroup *pc)	\
> +	{ return test_and_set_bit(PCG_##lname, &pc->flags); }
> +
>  /* Cache flag is set only once (at allocation) */
>  TESTPCGFLAG(Cache, CACHE)
>  CLEARPCGFLAG(Cache, CACHE)
> @@ -77,6 +84,22 @@ SETPCGFLAG(FileMapped, FILE_MAPPED)
>  CLEARPCGFLAG(FileMapped, FILE_MAPPED)
>  TESTPCGFLAG(FileMapped, FILE_MAPPED)
>  
> +SETPCGFLAG(FileDirty, FILE_DIRTY)
> +CLEARPCGFLAG(FileDirty, FILE_DIRTY)
> +TESTPCGFLAG(FileDirty, FILE_DIRTY)
> +TESTCLEARPCGFLAG(FileDirty, FILE_DIRTY)
> +TESTSETPCGFLAG(FileDirty, FILE_DIRTY)
> +
> +SETPCGFLAG(FileWriteback, FILE_WRITEBACK)
> +CLEARPCGFLAG(FileWriteback, FILE_WRITEBACK)
> +TESTPCGFLAG(FileWriteback, FILE_WRITEBACK)
> +
> +SETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +CLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +TESTPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +TESTCLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +TESTSETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +
>  SETPCGFLAG(Migration, MIGRATION)
>  CLEARPCGFLAG(Migration, MIGRATION)
>  TESTPCGFLAG(Migration, MIGRATION)
> 
> 
> 


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 1/9] memcg: add page_cgroup flags for dirty page tracking
@ 2012-02-29  0:50     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 116+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-29  0:50 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Greg Thelen, Jan Kara, Ying Han, hannes,
	Rik van Riel, Andrea Righi, Minchan Kim,
	Linux Memory Management List, LKML

On Tue, 28 Feb 2012 22:00:23 +0800
Fengguang Wu <fengguang.wu@intel.com> wrote:

> From: Greg Thelen <gthelen@google.com>
> 
> Add additional flags to page_cgroup to track dirty pages
> within a mem_cgroup.
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Signed-off-by: Andrea Righi <andrea@betterlinux.com>
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>

I'm sorry but I changed the design of page_cgroup's flags update
and never want to add new flags (I'd like to remove page_cgroup->flags.)

Please see linux-next.

A good example is PCG_FILE_MAPPED, which I removed.

memcg: use new logic for page stat accounting
memcg: remove PCG_FILE_MAPPED

You can make use of PageDirty() and PageWriteback() instead of new flags.. (I hope.)

Thanks,
-Kame

> ---
>  include/linux/page_cgroup.h |   23 +++++++++++++++++++++++
>  1 file changed, 23 insertions(+)
> 
> --- linux.orig/include/linux/page_cgroup.h	2012-02-19 10:53:14.000000000 +0800
> +++ linux/include/linux/page_cgroup.h	2012-02-19 10:53:16.000000000 +0800
> @@ -10,6 +10,9 @@ enum {
>  	/* flags for mem_cgroup and file and I/O status */
>  	PCG_MOVE_LOCK, /* For race between move_account v.s. following bits */
>  	PCG_FILE_MAPPED, /* page is accounted as "mapped" */
> +	PCG_FILE_DIRTY, /* page is dirty */
> +	PCG_FILE_WRITEBACK, /* page is under writeback */
> +	PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */
>  	__NR_PCG_FLAGS,
>  };
>  
> @@ -64,6 +67,10 @@ static inline void ClearPageCgroup##unam
>  static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
>  	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
>  
> +#define TESTSETPCGFLAG(uname, lname)			\
> +static inline int TestSetPageCgroup##uname(struct page_cgroup *pc)	\
> +	{ return test_and_set_bit(PCG_##lname, &pc->flags); }
> +
>  /* Cache flag is set only once (at allocation) */
>  TESTPCGFLAG(Cache, CACHE)
>  CLEARPCGFLAG(Cache, CACHE)
> @@ -77,6 +84,22 @@ SETPCGFLAG(FileMapped, FILE_MAPPED)
>  CLEARPCGFLAG(FileMapped, FILE_MAPPED)
>  TESTPCGFLAG(FileMapped, FILE_MAPPED)
>  
> +SETPCGFLAG(FileDirty, FILE_DIRTY)
> +CLEARPCGFLAG(FileDirty, FILE_DIRTY)
> +TESTPCGFLAG(FileDirty, FILE_DIRTY)
> +TESTCLEARPCGFLAG(FileDirty, FILE_DIRTY)
> +TESTSETPCGFLAG(FileDirty, FILE_DIRTY)
> +
> +SETPCGFLAG(FileWriteback, FILE_WRITEBACK)
> +CLEARPCGFLAG(FileWriteback, FILE_WRITEBACK)
> +TESTPCGFLAG(FileWriteback, FILE_WRITEBACK)
> +
> +SETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +CLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +TESTPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +TESTCLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +TESTSETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> +
>  SETPCGFLAG(Migration, MIGRATION)
>  CLEARPCGFLAG(Migration, MIGRATION)
>  TESTPCGFLAG(Migration, MIGRATION)
> 
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 3/9] memcg: add kernel calls for memcg dirty page stats
  2012-02-28 14:00   ` Fengguang Wu
@ 2012-02-29  1:10     ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 116+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-29  1:10 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Greg Thelen, Jan Kara, Ying Han, hannes,
	Rik van Riel, Andrea Righi, Daisuke Nishimura, Minchan Kim,
	Linux Memory Management List, LKML

On Tue, 28 Feb 2012 22:00:25 +0800
Fengguang Wu <fengguang.wu@intel.com> wrote:

> From: Greg Thelen <gthelen@google.com>
> 
> Add calls into memcg dirty page accounting.  Notify memcg when pages
> transition between clean, file dirty, writeback, and unstable nfs.  This
> allows the memory controller to maintain an accurate view of the amount
> of its memory that is dirty.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Signed-off-by: Andrea Righi <andrea@betterlinux.com>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
> ---
>  fs/nfs/write.c      |    4 ++++
>  mm/filemap.c        |    1 +
>  mm/page-writeback.c |    4 ++++
>  mm/truncate.c       |    1 +
>  4 files changed, 10 insertions(+)
> 
> --- linux.orig/fs/nfs/write.c	2012-02-19 10:53:14.000000000 +0800
> +++ linux/fs/nfs/write.c	2012-02-19 10:53:21.000000000 +0800
> @@ -449,6 +449,7 @@ nfs_mark_request_commit(struct nfs_page 
>  	nfsi->ncommit++;
>  	spin_unlock(&inode->i_lock);
>  	pnfs_mark_request_commit(req, lseg);
> +	mem_cgroup_inc_page_stat(req->wb_page, MEMCG_NR_FILE_UNSTABLE_NFS);

Hmm...Is the status UNSTABLE_NFS cannot be obtaiend by 'struct page' ?

One idea to avoid adding a new flag to pc->flags is..

Can't we do this by following if 'req' exists per page ?

	memcg = mem_cgroup_from_page(page);  # update memcg's refcnt+1
	req->memcg = memcg;		     # record memcg to req.
	mem_cgroup_inc_nfs_unstable(memcg)   # a new call



>  	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
>  	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> @@ -460,6 +461,7 @@ nfs_clear_request_commit(struct nfs_page
>  	struct page *page = req->wb_page;
>  
>  	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> +		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_UNSTABLE_NFS);
>  		dec_zone_page_state(page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
>  		return 1;
> @@ -1408,6 +1410,8 @@ void nfs_retry_commit(struct list_head *
>  		req = nfs_list_entry(page_list->next);
>  		nfs_list_remove_request(req);
>  		nfs_mark_request_commit(req, lseg);
> +		mem_cgroup_dec_page_stat(req->wb_page,
> +					 MEMCG_NR_FILE_UNSTABLE_NFS);
>  		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
>  			     BDI_RECLAIMABLE);
> --- linux.orig/mm/filemap.c	2012-02-19 10:53:14.000000000 +0800
> +++ linux/mm/filemap.c	2012-02-19 10:53:21.000000000 +0800
> @@ -142,6 +142,7 @@ void __delete_from_page_cache(struct pag
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);


I think we can make use of PageDirty() as explained.

Thanks,
-Kame


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 3/9] memcg: add kernel calls for memcg dirty page stats
@ 2012-02-29  1:10     ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 116+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-29  1:10 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Greg Thelen, Jan Kara, Ying Han, hannes,
	Rik van Riel, Andrea Righi, Daisuke Nishimura, Minchan Kim,
	Linux Memory Management List, LKML

On Tue, 28 Feb 2012 22:00:25 +0800
Fengguang Wu <fengguang.wu@intel.com> wrote:

> From: Greg Thelen <gthelen@google.com>
> 
> Add calls into memcg dirty page accounting.  Notify memcg when pages
> transition between clean, file dirty, writeback, and unstable nfs.  This
> allows the memory controller to maintain an accurate view of the amount
> of its memory that is dirty.
> 
> Signed-off-by: Greg Thelen <gthelen@google.com>
> Signed-off-by: Andrea Righi <andrea@betterlinux.com>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
> Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
> ---
>  fs/nfs/write.c      |    4 ++++
>  mm/filemap.c        |    1 +
>  mm/page-writeback.c |    4 ++++
>  mm/truncate.c       |    1 +
>  4 files changed, 10 insertions(+)
> 
> --- linux.orig/fs/nfs/write.c	2012-02-19 10:53:14.000000000 +0800
> +++ linux/fs/nfs/write.c	2012-02-19 10:53:21.000000000 +0800
> @@ -449,6 +449,7 @@ nfs_mark_request_commit(struct nfs_page 
>  	nfsi->ncommit++;
>  	spin_unlock(&inode->i_lock);
>  	pnfs_mark_request_commit(req, lseg);
> +	mem_cgroup_inc_page_stat(req->wb_page, MEMCG_NR_FILE_UNSTABLE_NFS);

Hmm...Is the status UNSTABLE_NFS cannot be obtaiend by 'struct page' ?

One idea to avoid adding a new flag to pc->flags is..

Can't we do this by following if 'req' exists per page ?

	memcg = mem_cgroup_from_page(page);  # update memcg's refcnt+1
	req->memcg = memcg;		     # record memcg to req.
	mem_cgroup_inc_nfs_unstable(memcg)   # a new call



>  	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
>  	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
> @@ -460,6 +461,7 @@ nfs_clear_request_commit(struct nfs_page
>  	struct page *page = req->wb_page;
>  
>  	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
> +		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_UNSTABLE_NFS);
>  		dec_zone_page_state(page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
>  		return 1;
> @@ -1408,6 +1410,8 @@ void nfs_retry_commit(struct list_head *
>  		req = nfs_list_entry(page_list->next);
>  		nfs_list_remove_request(req);
>  		nfs_mark_request_commit(req, lseg);
> +		mem_cgroup_dec_page_stat(req->wb_page,
> +					 MEMCG_NR_FILE_UNSTABLE_NFS);
>  		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
>  		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
>  			     BDI_RECLAIMABLE);
> --- linux.orig/mm/filemap.c	2012-02-19 10:53:14.000000000 +0800
> +++ linux/mm/filemap.c	2012-02-19 10:53:21.000000000 +0800
> @@ -142,6 +142,7 @@ void __delete_from_page_cache(struct pag
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_DIRTY);


I think we can make use of PageDirty() as explained.

Thanks,
-Kame

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 4/9] memcg: dirty page accounting support routines
  2012-02-28 22:45     ` Andrew Morton
@ 2012-02-29  1:15       ` KAMEZAWA Hiroyuki
  -1 siblings, 0 replies; 116+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-29  1:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Fengguang Wu, Greg Thelen, Jan Kara, Ying Han, hannes,
	Rik van Riel, Linux Memory Management List, LKML

On Tue, 28 Feb 2012 14:45:07 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 28 Feb 2012 22:00:26 +0800
> Fengguang Wu <fengguang.wu@intel.com> wrote:
> 
> > From: Greg Thelen <gthelen@google.com>
> > 
> > Added memcg dirty page accounting support routines.  These routines are
> > used by later changes to provide memcg aware writeback and dirty page
> > limiting.  A mem_cgroup_dirty_info() tracepoint is is also included to
> > allow for easier understanding of memcg writeback operation.
> > 
> > ...
> >
> > +/*
> > + * Return the number of additional pages that the @memcg cgroup could allocate.
> > + * If use_hierarchy is set, then this involves checking parent mem cgroups to
> > + * find the cgroup with the smallest free space.
> > + */
> 
> Comment needs revisting - use_hierarchy does not exist.
> 
> > +static unsigned long
> > +mem_cgroup_hierarchical_free_pages(struct mem_cgroup *memcg)
> > +{
> > +	u64 free;
> > +	unsigned long min_free;
> > +
> > +	min_free = global_page_state(NR_FREE_PAGES);
> > +
> > +	while (memcg) {
> > +		free = mem_cgroup_margin(memcg);
> > +		min_free = min_t(u64, min_free, free);
> > +		memcg = parent_mem_cgroup(memcg);
> > +	}
> > +
> > +	return min_free;
> > +}
> > +
> > +/*
> > + * mem_cgroup_page_stat() - get memory cgroup file cache statistics
> > + * @memcg:     memory cgroup to query
> > + * @item:      memory statistic item exported to the kernel
> > + *
> > + * Return the accounted statistic value.
> > + */
> > +unsigned long mem_cgroup_page_stat(struct mem_cgroup *memcg,
> > +				   enum mem_cgroup_page_stat_item item)
> > +{
> > +	struct mem_cgroup *iter;
> > +	s64 value;
> > +
> > +	/*
> > +	 * If we're looking for dirtyable pages we need to evaluate free pages
> > +	 * depending on the limit and usage of the parents first of all.
> > +	 */
> > +	if (item == MEMCG_NR_DIRTYABLE_PAGES)
> > +		value = mem_cgroup_hierarchical_free_pages(memcg);
> > +	else
> > +		value = 0;
> > +
> > +	/*
> > +	 * Recursively evaluate page statistics against all cgroup under
> > +	 * hierarchy tree
> > +	 */
> > +	for_each_mem_cgroup_tree(iter, memcg)
> > +		value += mem_cgroup_local_page_stat(iter, item);
> 
> What's the locking rule for for_each_mem_cgroup_tree()?  It's unobvious
> from the code and isn't documented?
> 

Because for_each_mem_cgroup_tree() uses rcu_read_lock() and referernce counting
internally, it's not required to take any lock in callers.
One rule is the caller shoud call mem_cgroup_iter_break() if he want to break
the loop.

Thanks,
-Kame







^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 4/9] memcg: dirty page accounting support routines
@ 2012-02-29  1:15       ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 116+ messages in thread
From: KAMEZAWA Hiroyuki @ 2012-02-29  1:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Fengguang Wu, Greg Thelen, Jan Kara, Ying Han, hannes,
	Rik van Riel, Linux Memory Management List, LKML

On Tue, 28 Feb 2012 14:45:07 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:

> On Tue, 28 Feb 2012 22:00:26 +0800
> Fengguang Wu <fengguang.wu@intel.com> wrote:
> 
> > From: Greg Thelen <gthelen@google.com>
> > 
> > Added memcg dirty page accounting support routines.  These routines are
> > used by later changes to provide memcg aware writeback and dirty page
> > limiting.  A mem_cgroup_dirty_info() tracepoint is is also included to
> > allow for easier understanding of memcg writeback operation.
> > 
> > ...
> >
> > +/*
> > + * Return the number of additional pages that the @memcg cgroup could allocate.
> > + * If use_hierarchy is set, then this involves checking parent mem cgroups to
> > + * find the cgroup with the smallest free space.
> > + */
> 
> Comment needs revisting - use_hierarchy does not exist.
> 
> > +static unsigned long
> > +mem_cgroup_hierarchical_free_pages(struct mem_cgroup *memcg)
> > +{
> > +	u64 free;
> > +	unsigned long min_free;
> > +
> > +	min_free = global_page_state(NR_FREE_PAGES);
> > +
> > +	while (memcg) {
> > +		free = mem_cgroup_margin(memcg);
> > +		min_free = min_t(u64, min_free, free);
> > +		memcg = parent_mem_cgroup(memcg);
> > +	}
> > +
> > +	return min_free;
> > +}
> > +
> > +/*
> > + * mem_cgroup_page_stat() - get memory cgroup file cache statistics
> > + * @memcg:     memory cgroup to query
> > + * @item:      memory statistic item exported to the kernel
> > + *
> > + * Return the accounted statistic value.
> > + */
> > +unsigned long mem_cgroup_page_stat(struct mem_cgroup *memcg,
> > +				   enum mem_cgroup_page_stat_item item)
> > +{
> > +	struct mem_cgroup *iter;
> > +	s64 value;
> > +
> > +	/*
> > +	 * If we're looking for dirtyable pages we need to evaluate free pages
> > +	 * depending on the limit and usage of the parents first of all.
> > +	 */
> > +	if (item == MEMCG_NR_DIRTYABLE_PAGES)
> > +		value = mem_cgroup_hierarchical_free_pages(memcg);
> > +	else
> > +		value = 0;
> > +
> > +	/*
> > +	 * Recursively evaluate page statistics against all cgroup under
> > +	 * hierarchy tree
> > +	 */
> > +	for_each_mem_cgroup_tree(iter, memcg)
> > +		value += mem_cgroup_local_page_stat(iter, item);
> 
> What's the locking rule for for_each_mem_cgroup_tree()?  It's unobvious
> from the code and isn't documented?
> 

Because for_each_mem_cgroup_tree() uses rcu_read_lock() and referernce counting
internally, it's not required to take any lock in callers.
One rule is the caller shoud call mem_cgroup_iter_break() if he want to break
the loop.

Thanks,
-Kame






--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-02-29  0:04     ` Andrew Morton
@ 2012-02-29  2:31       ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-29  2:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Tue, Feb 28, 2012 at 04:04:03PM -0800, Andrew Morton wrote:
> On Tue, 28 Feb 2012 22:00:27 +0800
> Fengguang Wu <fengguang.wu@intel.com> wrote:
> 
> > This relays file pageout IOs to the flusher threads.
> > 
> > It's much more important now that page reclaim generally does not
> > writeout filesystem-backed pages.
> 
> It doesn't?  We still do writeback in direct reclaim.  This claim
> should be fleshed out rather a lot, please.

That claim is actually from Mel in his review comments :)

Current upstream kernel avoids writeback in direct reclaim totally
with commit ee72886d8ed5d ("mm: vmscan: do not writeback filesystem
pages in direct reclaim").

Now with this patch, as long as the pageout works are queued
successfully, the pageout() calls from kswapd() will also be
eliminated.

> > The ultimate target is to gracefully handle the LRU lists pressured by
> > dirty/writeback pages. In particular, problems (1-2) are addressed here.
> > 
> > 1) I/O efficiency
> > 
> > The flusher will piggy back the nearby ~10ms worth of dirty pages for I/O.
> > 
> > This takes advantage of the time/spacial locality in most workloads: the
> > nearby pages of one file are typically populated into the LRU at the same
> > time, hence will likely be close to each other in the LRU list. Writing
> > them in one shot helps clean more pages effectively for page reclaim.
> 
> Yes, this is often true.  But when adjacent pages from the same file
> are clustered together on the LRU, direct reclaim's LRU-based walk will
> also provide good I/O patterns.

I'm afraid the I/O elevator is not so smart (and technically possible)
at merging the pageout() bios. The file pages are typically
interleaved between DMA32 and NORMAL zones or even among NUMA nodes.
Page reclaim also walks the nodes/zones interleavely, but in some
different manner.  So pageout() might at best generate I/O for [1,
30], [150, 168], [90, 99], ...

IOW, the holes and disorderness are effectively killing large I/O. Not
to mention it hurts interactive performance to block in get_request_wait()
if we ever submit I/O inside page reclaim.

> > For the common dd style sequential writes that have excellent locality,
> > up to ~80ms data will be wrote around by the pageout work, which helps
> > make I/O performance very close to that of the background writeback.
> > 
> > 2) writeback work coordinations
> > 
> > To avoid memory allocations at page reclaim, a mempool for struct
> > wb_writeback_work is created.
> > 
> > wakeup_flusher_threads() is removed because it can easily delay the
> > more oriented pageout works and even exhaust the mempool reservations.
> > It's also found to not I/O efficient by frequently submitting writeback
> > works with small ->nr_pages.
> 
> The last sentence here needs help.

wakeup_flusher_threads() is called with total_scanned. Which could be
(LRU_size / 4096). Given 1GB LRU_size, the write chunk would be 256KB.
This is much smaller than the old 4MB and the now preferred write
chunk size (write_bandwidth/2).

                writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
==>             if (total_scanned > writeback_threshold) {
                        wakeup_flusher_threads(laptop_mode ? 0 : total_scanned,
                                                WB_REASON_TRY_TO_FREE_PAGES);
                        sc->may_writepage = 1;
                }

Actually I see much more wakeup_flusher_threads() calls than expected.
The above test condition may be too permissive.

For direct reclaim, sc->nr_to_reclaim=32 and total_scanned starts with
(LRU_size / 4096), which *always* exceeds writeback_threshold in boxes
with more than 1GB memory. So the flusher end up constantly be fed with
small writeout requests.

The test is not really reflecting "dirty pages pressure". And it's
easy to trigger direct reclaim by starting some concurrent page
allocators or by using memcg. Which has nothing to do with dirty
pressure.

> > Background/periodic works will quit automatically, so as to clean the
> > pages under reclaim ASAP.
> 
> I don't know what this means.  How does a work "quit automatically" and
> why does that initiate I/O?

Typically the flusher will be working on the background/periodic works
when there are heavy dirtier tasks. And wb_writeback() has this

                /*
                 * Background writeout and kupdate-style writeback may
                 * run forever. Stop them if there is other work to do
                 * so that e.g. sync can proceed. They'll be restarted
                 * after the other works are all done.
                 */
                if ((work->for_background || work->for_kupdate) &&
                    !list_empty(&wb->bdi->work_list))
                        break;

to quit the background/periodic work when pageout or other works are
queued. So the pageout works can typically be pick up and executed
quickly by the flusher: the background/periodic works are the dominant
ones and there are rarely other type of works in the way.

> > However for now the sync work can still block
> > us for long time.
> 
> Please define the term "sync work".

That's the works submitted by

        __sync_filesystem()
          ==> writeback_inodes_sb() for the WB_SYNC_NONE pass
          ==> sync_inodes_sb()      for the WB_SYNC_ALL pass

with reason WB_REASON_SYNC.

Thanks,
Fengguang

// break time..

> > Jan Kara: limit the search scope; remove works and unpin inodes on umount.
> > 
> > TODO: the pageout works may be starved by the sync work and maybe others.
> > Need a proper way to guarantee fairness.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-02-29  2:31       ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-29  2:31 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Tue, Feb 28, 2012 at 04:04:03PM -0800, Andrew Morton wrote:
> On Tue, 28 Feb 2012 22:00:27 +0800
> Fengguang Wu <fengguang.wu@intel.com> wrote:
> 
> > This relays file pageout IOs to the flusher threads.
> > 
> > It's much more important now that page reclaim generally does not
> > writeout filesystem-backed pages.
> 
> It doesn't?  We still do writeback in direct reclaim.  This claim
> should be fleshed out rather a lot, please.

That claim is actually from Mel in his review comments :)

Current upstream kernel avoids writeback in direct reclaim totally
with commit ee72886d8ed5d ("mm: vmscan: do not writeback filesystem
pages in direct reclaim").

Now with this patch, as long as the pageout works are queued
successfully, the pageout() calls from kswapd() will also be
eliminated.

> > The ultimate target is to gracefully handle the LRU lists pressured by
> > dirty/writeback pages. In particular, problems (1-2) are addressed here.
> > 
> > 1) I/O efficiency
> > 
> > The flusher will piggy back the nearby ~10ms worth of dirty pages for I/O.
> > 
> > This takes advantage of the time/spacial locality in most workloads: the
> > nearby pages of one file are typically populated into the LRU at the same
> > time, hence will likely be close to each other in the LRU list. Writing
> > them in one shot helps clean more pages effectively for page reclaim.
> 
> Yes, this is often true.  But when adjacent pages from the same file
> are clustered together on the LRU, direct reclaim's LRU-based walk will
> also provide good I/O patterns.

I'm afraid the I/O elevator is not so smart (and technically possible)
at merging the pageout() bios. The file pages are typically
interleaved between DMA32 and NORMAL zones or even among NUMA nodes.
Page reclaim also walks the nodes/zones interleavely, but in some
different manner.  So pageout() might at best generate I/O for [1,
30], [150, 168], [90, 99], ...

IOW, the holes and disorderness are effectively killing large I/O. Not
to mention it hurts interactive performance to block in get_request_wait()
if we ever submit I/O inside page reclaim.

> > For the common dd style sequential writes that have excellent locality,
> > up to ~80ms data will be wrote around by the pageout work, which helps
> > make I/O performance very close to that of the background writeback.
> > 
> > 2) writeback work coordinations
> > 
> > To avoid memory allocations at page reclaim, a mempool for struct
> > wb_writeback_work is created.
> > 
> > wakeup_flusher_threads() is removed because it can easily delay the
> > more oriented pageout works and even exhaust the mempool reservations.
> > It's also found to not I/O efficient by frequently submitting writeback
> > works with small ->nr_pages.
> 
> The last sentence here needs help.

wakeup_flusher_threads() is called with total_scanned. Which could be
(LRU_size / 4096). Given 1GB LRU_size, the write chunk would be 256KB.
This is much smaller than the old 4MB and the now preferred write
chunk size (write_bandwidth/2).

                writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
==>             if (total_scanned > writeback_threshold) {
                        wakeup_flusher_threads(laptop_mode ? 0 : total_scanned,
                                                WB_REASON_TRY_TO_FREE_PAGES);
                        sc->may_writepage = 1;
                }

Actually I see much more wakeup_flusher_threads() calls than expected.
The above test condition may be too permissive.

For direct reclaim, sc->nr_to_reclaim=32 and total_scanned starts with
(LRU_size / 4096), which *always* exceeds writeback_threshold in boxes
with more than 1GB memory. So the flusher end up constantly be fed with
small writeout requests.

The test is not really reflecting "dirty pages pressure". And it's
easy to trigger direct reclaim by starting some concurrent page
allocators or by using memcg. Which has nothing to do with dirty
pressure.

> > Background/periodic works will quit automatically, so as to clean the
> > pages under reclaim ASAP.
> 
> I don't know what this means.  How does a work "quit automatically" and
> why does that initiate I/O?

Typically the flusher will be working on the background/periodic works
when there are heavy dirtier tasks. And wb_writeback() has this

                /*
                 * Background writeout and kupdate-style writeback may
                 * run forever. Stop them if there is other work to do
                 * so that e.g. sync can proceed. They'll be restarted
                 * after the other works are all done.
                 */
                if ((work->for_background || work->for_kupdate) &&
                    !list_empty(&wb->bdi->work_list))
                        break;

to quit the background/periodic work when pageout or other works are
queued. So the pageout works can typically be pick up and executed
quickly by the flusher: the background/periodic works are the dominant
ones and there are rarely other type of works in the way.

> > However for now the sync work can still block
> > us for long time.
> 
> Please define the term "sync work".

That's the works submitted by

        __sync_filesystem()
          ==> writeback_inodes_sb() for the WB_SYNC_NONE pass
          ==> sync_inodes_sb()      for the WB_SYNC_ALL pass

with reason WB_REASON_SYNC.

Thanks,
Fengguang

// break time..

> > Jan Kara: limit the search scope; remove works and unpin inodes on umount.
> > 
> > TODO: the pageout works may be starved by the sync work and maybe others.
> > Need a proper way to guarantee fairness.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-02-29  0:04     ` Andrew Morton
@ 2012-02-29 13:28       ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-29 13:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Tue, Feb 28, 2012 at 04:04:03PM -0800, Andrew Morton wrote:
> On Tue, 28 Feb 2012 22:00:27 +0800 Fengguang Wu <fengguang.wu@intel.com> wrote:

> > ---
> >  fs/fs-writeback.c                |  230 +++++++++++++++++++++++++++--
> >  fs/super.c                       |    1 
> >  include/linux/backing-dev.h      |    2 
> >  include/linux/writeback.h        |   16 +-
> >  include/trace/events/writeback.h |   12 +
> >  mm/vmscan.c                      |   36 ++--
> >  6 files changed, 268 insertions(+), 29 deletions(-)
> > 
> > --- linux.orig/fs/fs-writeback.c	2012-02-28 19:07:06.109064465 +0800
> > +++ linux/fs/fs-writeback.c	2012-02-28 19:07:07.277064493 +0800
> > @@ -41,6 +41,8 @@ struct wb_writeback_work {
> >  	long nr_pages;
> >  	struct super_block *sb;
> >  	unsigned long *older_than_this;
> > +	struct inode *inode;
> > +	pgoff_t offset;
> 
> Please document `offset' here.  What is it used for?

It's the start offset to writeout. It seems better to name it "start"
and comment as below

        /*
         * When (inode != NULL), it's a pageout work for cleaning the
         * inode pages from start to start+nr_pages.
         */              
        struct inode *inode;
        pgoff_t start;  

> >  	enum writeback_sync_modes sync_mode;
> >  	unsigned int tagged_writepages:1;
> >  	unsigned int for_kupdate:1;
> > @@ -57,6 +59,27 @@ struct wb_writeback_work {
> >   */
> >  int nr_pdflush_threads;
> >  
> > +static mempool_t *wb_work_mempool;
> > +
> > +static void *wb_work_alloc(gfp_t gfp_mask, void *pool_data)
> 
> The gfp_mask is traditionally the last function argument.

Yup, but it's the prototype defined by "mempool_alloc_t *alloc_fn".

> > +{
> > +	/*
> > +	 * alloc_queue_pageout_work() will be called on page reclaim
> > +	 */
> > +	if (current->flags & PF_MEMALLOC)
> > +		return NULL;
> 
> Do we need to test current->flags here?

Yes, the intention is to avoid page allocation in the page reclaim path.

Also, if the pool of 1024 writeback works are exhausted, just fail new
pageout requests. Normally it does not need too many IOs in flight to
saturate a disk. Limiting the queue size helps reduce the in-queue
delays.

> Could we have checked !(gfp_mask & __GFP_IO) and/or __GFP_FILE?

mempool_alloc() will call into this callback with
(__GFP_WAIT|__GFP_IO) removed from gfp_mask. So no need to check them.

> I'm not really suggsting such a change - just trying to get my head
> around how this stuff works..
> 
> > +	return kmalloc(sizeof(struct wb_writeback_work), gfp_mask);
> > +}
> > +
> > +static __init int wb_work_init(void)
> > +{
> > +	wb_work_mempool = mempool_create(WB_WORK_MEMPOOL_SIZE,
> > +					 wb_work_alloc, mempool_kfree, NULL);
> > +	return wb_work_mempool ? 0 : -ENOMEM;
> > +}
> > +fs_initcall(wb_work_init);
> 
> Please provide a description of the wb_writeback_work lifecycle: when
> they are allocated, when they are freed, how we ensure that a finite
> number are in flight.

The wb_writeback_work is created (and hence auto destroyed) either on
stack, or dynamically allocated from the mempool. The implicit rule
is: the caller shall allocate wb_writeback_work on stack iif it want
to wait for completion of the work.

The work is then queued into bdi->work_list, where the flusher picks
up one wb_writeback_work at a time, dequeue, execute and finally
either free it (mempool allocated) or wake up the caller (on stack).

The queue size is generally not limited because it was not a problem
at all. However it's now possible for vmscan to queue lots of pageout
works in short time. So the below limiting rules are applied:

- when LOTS_OF_WRITEBACK_WORKS = WB_WORK_MEMPOOL_SIZE / 8 = 128 works
  are queued, vmscan should start throttling itself (to the rate the
  flusher can consume pageout works).

- when 2 * LOTS_OF_WRITEBACK_WORKS wb_writeback_work are queued, will
  refuse to queue new pageout works

- the remaining mempool reservations are available for other types of works

> Also, when a mempool_alloc() caller is waiting for wb_writeback_works
> to be freed, how are we dead certain that some *will* be freed?  It
> means that the mempool_alloc() caller cannot be holding any explicit or
> implicit locks which would prevent wb_writeback_works would be freed. 
> Where "implicit lock" means things like being inside ext3/4
> journal_start.

There will be no wait at work allocation time. Instead the next patch
will check the return value of queue_pageout_work() and do some
reclaim_wait() throttling when above LOTS_OF_WRITEBACK_WORKS.

That throttling wait is done in small HZ/10 timeouts. Well we'll have
to throttle it if the LRU list is full of dirty pages, even if we are
inside the implicit lock of journal_start and risk delaying other
tasks trying to take the lock...

> This stuff is tricky and is hard to get right.  Reviewing its
> correctness by staring at a patch is difficult.
> 
> >  /**
> >   * writeback_in_progress - determine whether there is writeback in progress
> >   * @bdi: the device's backing_dev_info structure.
> > @@ -129,7 +152,7 @@ __bdi_start_writeback(struct backing_dev
> >  	 * This is WB_SYNC_NONE writeback, so if allocation fails just
> >  	 * wakeup the thread for old dirty data writeback
> >  	 */
> > -	work = kzalloc(sizeof(*work), GFP_ATOMIC);
> > +	work = mempool_alloc(wb_work_mempool, GFP_NOWAIT);
> 
> Sneaky change from GFP_ATOMIC to GFP_NOWAIT is significant, but
> undescribed?

Oh this is the legacy path, I'll change it back to GFP_ATOMIC.

> >  	if (!work) {
> >  		if (bdi->wb.task) {
> >  			trace_writeback_nowork(bdi);
> > @@ -138,6 +161,7 @@ __bdi_start_writeback(struct backing_dev
> >  		return;
> >  	}
> >  
> > +	memset(work, 0, sizeof(*work));
> >  	work->sync_mode	= WB_SYNC_NONE;
> >  	work->nr_pages	= nr_pages;
> >  	work->range_cyclic = range_cyclic;
> > @@ -187,6 +211,181 @@ void bdi_start_background_writeback(stru
> >  }
> >  
> >  /*
> > + * Check if @work already covers @offset, or try to extend it to cover @offset.
> > + * Returns true if the wb_writeback_work now encompasses the requested offset.
> > + */
> > +static bool extend_writeback_range(struct wb_writeback_work *work,
> > +				   pgoff_t offset,
> > +				   unsigned long unit)
> > +{
> > +	pgoff_t end = work->offset + work->nr_pages;
> > +
> > +	if (offset >= work->offset && offset < end)
> > +		return true;
> > +
> > +	/*
> > +	 * for sequential workloads with good locality, include up to 8 times
> > +	 * more data in one chunk
> > +	 */
> > +	if (work->nr_pages >= 8 * unit)
> > +		return false;
> 
> argh, gack, puke.  I thought I revoked your magic number license months ago!

Sorry, I thought I've added comment for it. Obviously it's not enough.

> Please, it's a HUGE red flag that bad things are happening.  Would the
> kernel be better or worse if we were to use 9.5 instead?  How do we
> know that "8" is optimum for all memory sizes, device bandwidths, etc?

The unit chunk size is calculated so that it costs ~10ms to write so
many pages. So 8 means we can extend it up to 80ms. It's a good value
because 80ms data transfer time makes the typical overheads of 8ms
disk seek time look small enough.

> It's a hack - it's *always* a hack.  Find a better way.

Hmm, it's more like black magic w/o explanations :(

> > +	/* the unsigned comparison helps eliminate one compare */
> > +	if (work->offset - offset < unit) {
> > +		work->nr_pages += unit;
> > +		work->offset -= unit;
> > +		return true;
> > +	}
> > +
> > +	if (offset - end < unit) {
> > +		work->nr_pages += unit;
> > +		return true;
> > +	}
> > +
> > +	return false;
> > +}
> > +
> > +/*
> > + * schedule writeback on a range of inode pages.
> > + */
> > +static struct wb_writeback_work *
> > +alloc_queue_pageout_work(struct backing_dev_info *bdi,
> > +			 struct inode *inode,
> > +			 pgoff_t offset,
> > +			 pgoff_t len)
> > +{
> > +	struct wb_writeback_work *work;
> > +
> > +	/*
> > +	 * Grab the inode until the work is executed. We are calling this from
> > +	 * page reclaim context and the only thing pinning the address_space
> > +	 * for the moment is the page lock.
> > +	 */
> > +	if (!igrab(inode))
> > +		return ERR_PTR(-ENOENT);
> 
> uh-oh.  igrab() means iput().
> 
> ENOENT means "no such file or directory" and makes no sense in this
> context.

OK, will change to return NULL on failure.

> > +	work = mempool_alloc(wb_work_mempool, GFP_NOWAIT);
> > +	if (!work) {
> > +		trace_printk("wb_work_mempool alloc fail\n");
> > +		return ERR_PTR(-ENOMEM);
> > +	}
> > +
> > +	memset(work, 0, sizeof(*work));
> > +	work->sync_mode		= WB_SYNC_NONE;
> > +	work->inode		= inode;
> > +	work->offset		= offset;
> > +	work->nr_pages		= len;
> > +	work->reason		= WB_REASON_PAGEOUT;
> > +
> > +	bdi_queue_work(bdi, work);
> > +
> > +	return work;
> > +}
> > +
> > +/*
> > + * Called by page reclaim code to flush the dirty page ASAP. Do write-around to
> > + * improve IO throughput. The nearby pages will have good chance to reside in
> > + * the same LRU list that vmscan is working on, and even close to each other
> > + * inside the LRU list in the common case of sequential read/write.
> > + *
> > + * ret > 0: success, allocated/queued a new pageout work;
> > + *	    there are at least @ret writeback works queued now
> > + * ret = 0: success, reused/extended a previous pageout work
> > + * ret < 0: failed
> > + */
> > +int queue_pageout_work(struct address_space *mapping, struct page *page)
> > +{
> > +	struct backing_dev_info *bdi = mapping->backing_dev_info;
> > +	struct inode *inode = mapping->host;
> > +	struct wb_writeback_work *work;
> > +	unsigned long write_around_pages;
> > +	pgoff_t offset = page->index;
> > +	int i = 0;
> > +	int ret = -ENOENT;
> 
> ENOENT means "no such file or directory" and makes no sense in this
> context.

OK, changed to return plain -1 when failed.

> > +	if (unlikely(!inode))
> > +		return ret;
> 
> How does this happen?

Probably will never happen... Removed. Sorry for the noise.

> > +	/*
> > +	 * piggy back 8-15ms worth of data
> > +	 */
> > +	write_around_pages = bdi->avg_write_bandwidth + MIN_WRITEBACK_PAGES;
> > +	write_around_pages = rounddown_pow_of_two(write_around_pages) >> 6;
> 
> Where did "6" come from?

It takes 1000ms to write avg_write_bandwidth data, dividing that by
64, we get 15ms. Since there is a rounddown_pow_of_two(), the actual
range will be [8ms, 15ms]. So a shift of 6 yields work execution
time of around 10ms (not including the possible seek time).

> > +	i = 1;
> > +	spin_lock_bh(&bdi->wb_lock);
> > +	list_for_each_entry_reverse(work, &bdi->work_list, list) {
> > +		if (work->inode != inode)
> > +			continue;
> > +		if (extend_writeback_range(work, offset, write_around_pages)) {
> > +			ret = 0;
> > +			break;
> > +		}
> > +		/*
> > +		 * vmscan will slow down page reclaim when there are more than
> > +		 * LOTS_OF_WRITEBACK_WORKS queued. Limit search depth to two
> > +		 * times larger.
> > +		 */
> > +		if (i++ > 2 * LOTS_OF_WRITEBACK_WORKS)
> > +			break;
> 
> I'm now totally lost.  What are the units of "i"?  (And why the heck was
> it called "i" anyway?) Afaict, "i" counts the number of times we
> successfully extended the writeback range by write_around_pages?  What
> relationship does this have to yet-another-magic-number-times-two?

Ah it's a silly mistake!  That test should be moved to the top of loop,
so that i simply counts the number of works iterated.

> Please have a think about how to make the code comprehensible to (and
> hence maintainable by) others?

OK!

> > +	}
> > +	spin_unlock_bh(&bdi->wb_lock);
> > +
> > +	if (ret) {
> > +		ret = i;
> > +		offset = round_down(offset, write_around_pages);
> > +		work = alloc_queue_pageout_work(bdi, inode,
> > +						offset, write_around_pages);
> > +		if (IS_ERR(work))
> > +			ret = PTR_ERR(work);
> > +	}
> 
> Need a comment over this code section.  afacit it would be something
> like "if we failed to add pages to an existing wb_writeback_work then
> allocate and queue a new one".

OK, added the comment.

> > +	return ret;
> > +}
> > +
> > +static void wb_free_work(struct wb_writeback_work *work)
> > +{
> > +	if (work->inode)
> > +		iput(work->inode);
> 
> And here is where at least two previous attempts to perform
> address_space-based writearound within direct reclaim have come
> unstuck.
> 
> Occasionally, iput() does a huge amount of stuff: when it is the final
> iput() on the inode.  iirc this can include taking tons of VFS locks,
> truncating files, starting (and perhaps waiting upon) journal commits,
> etc.  I forget, but it's a *lot*.

Yeah, Jan also reminded me of that iput().

> And quite possibly you haven't tested this at all, because it's pretty
> rare for an iput() like this to be the final one.

That's true.

> Let me give you an example to worry about: suppose code which holds fs
> locks calls into direct reclaim and then calls
> mempool_alloc(wb_work_mempool) and that mempool_alloc() has to wait for
> an item to be returned.

Nope, there will be no waits and memory allocations when called
from vmscan. That's one purpose of this test in wb_work_alloc():

        	if (current->flags & PF_MEMALLOC)
        		return NULL;

Actually __alloc_pages_slowpath() also tests PF_MEMALLOC in addtion to
__GFP_WAIT and avoids direct page reclaim when either is set. So for the
purpose of avoiding waits and recursions, GFP_NOWAIT should be enough.

> But no items will ever be returned, because
> all the threads which own wb_writeback_works are stuck in
> wb_free_work->iput, trying to take an fs lock which is still held by
> the now-blocked direct-reclaim caller.

Since we choose to fail mempool_alloc() immediately when called from
vmscan, it's free from VM deadlock/recursion problems.

> And a billion similar scenarios :( The really nasty thing about this is
> that it is very rare for this iput() to be a final iput(), so it's hard
> to get code coverage.
> 
> Please have a think about all of this and see if you can demonstrate
> how the iput() here is guaranteed safe.

iput() does bring new possibilities/sources of deadlock.

The iput() may at times block the flusher for a while. But it's called
without taking any other locks and there is no page reclaim recursions.
So the only possibility of implicit deadlock is

        iput => fs code => queue some writeback work and wait on it

The only two wait_for_completion() calls are:
- sync_inodes_sb
- writeback_inodes_sb_nr

Quick in-kernel code audits show no sign of they being triggered by
some iput() call. But the possibilities always exist now and future.
If really necessary, it can be reliably caught and avoided with 

static void bdi_queue_work(...)
 {     
        trace_writeback_queue(bdi, work);

+       if (unlikely(current == bdi->wb.task ||
+                    current == default_backing_dev_info.wb.task)) {
+               WARN_ON_ONCE(1); /* recursion; deadlock if ->done is set */
+               if (work->done)
+                       complete(work->done);
+               return;
+       }
+      


> > +	/*
> > +	 * Notify the caller of completion if this is a synchronous
> > +	 * work item, otherwise just free it.
> > +	 */
> > +	if (work->done)
> > +		complete(work->done);
> > +	else
> > +		mempool_free(work, wb_work_mempool);
> > +}
> > +
> > +/*
> > + * Remove works for @sb; or if (@sb == NULL), remove all works on @bdi.
> > + */
> > +void bdi_remove_writeback_works(struct backing_dev_info *bdi,
> > +				struct super_block *sb)
> > +{
> > +	struct wb_writeback_work *work, *tmp;
> > +	LIST_HEAD(dispose);
> > +
> > +	spin_lock_bh(&bdi->wb_lock);
> > +	list_for_each_entry_safe(work, tmp, &bdi->work_list, list) {
> > +		if (sb) {
> > +			if (work->sb && work->sb != sb)
> > +				continue;
> 
> What does it mean when wb_writeback_work.sb==NULL?  This reader doesn't
> know, hence he can't understand (or review) this code.  Perhaps
> describe it here, unless it is well described elsewhere?

OK.

WB_REASON_LAPTOP_TIMER, WB_REASON_FREE_MORE_MEM, and some WB_REASON_SYNC
callers will queue works with wb_writeback_work.sb==NULL. They only care
about knocking down the bdi dirty pages.

> > +			if (work->inode && work->inode->i_sb != sb)
> 
> And what is the meaning of wb_writeback_work.inode==NULL?

wb_writeback_work.inode != NULL, iff it's a pageout work.

> Seems odd that wb_writeback_work.sb exists, when it is accessible via
> wb_writeback_work.inode->i_sb.

Yeah this is twisted. I should have set pageout work's

        wb_writeback_work.sb = inode->i_sb

and the loop will then look straightforward:

        list_for_each_entry_safe(work, tmp, &bdi->work_list, list) {
                if (sb && work->sb && sb != work->sb)
                        continue;
                list_move(&work->list, &dispose);
        }

> As a person who reviews a huge amount of code, I can tell you this: the
> key to understanding code is to understand the data structures and the
> relationship between their fields and between different data
> structures.  Including lifetime rules, locking rules and hidden
> information such as "what does it mean when wb_writeback_work.sb is
> NULL".  Once one understands all this about the data structures, the
> code becomes pretty obvious and bugs can be spotted and fixed.  But
> alas, wb_writeback_work is basically undocumented.

Good to know that! I'll document it with the above comments.
 
> > +				continue;
> > +		}
> > +		list_move(&work->list, &dispose);
> 
> So here we have queued for disposal a) works which refer to an inode on
> sb and b) works which have ->sb==NULL and ->inode==NULL.  I don't know
> whether the b) type exist.

b) exists for the works queued by __bdi_start_writeback(), whose
callers may be laptop mode, free_more_memory() and the sync syscall.

> > +	}
> > +	spin_unlock_bh(&bdi->wb_lock);
> > +
> > +	while (!list_empty(&dispose)) {
> > +		work = list_entry(dispose.next,
> > +				  struct wb_writeback_work, list);
> > +		list_del_init(&work->list);
> > +		wb_free_work(work);
> > +	}
> 
> You should be able to do this operation without writing to all the
> list_heads: no list_del(), no list_del_init().

Good point! Here is the better form:

        list_for_each_entry(work, &dispose, list)
                wb_free_work(work);

> > +}
> > +
> > +/*
> >   * Remove the inode from the writeback list it is on.
> >   */
> >  void inode_wb_list_del(struct inode *inode)
> > @@ -833,6 +1032,21 @@ static unsigned long get_nr_dirty_pages(
> >  		get_nr_dirty_inodes();
> >  }
> >  
> > +static long wb_pageout(struct bdi_writeback *wb, struct wb_writeback_work *work)
> > +{
> > +	struct writeback_control wbc = {
> > +		.sync_mode = WB_SYNC_NONE,
> > +		.nr_to_write = LONG_MAX,
> > +		.range_start = work->offset << PAGE_CACHE_SHIFT,
> 
> I think this will give you a 32->64 bit overflow on 32-bit machines.

Oops, good catch!
 
> > +		.range_end = (work->offset + work->nr_pages - 1)
> > +						<< PAGE_CACHE_SHIFT,
> 
> Ditto.
> 
> Please include this in a patchset sometime ;)
> 
> --- a/include/linux/writeback.h~a
> +++ a/include/linux/writeback.h
> @@ -64,7 +64,7 @@ struct writeback_control {
>  	long pages_skipped;		/* Pages which were not written */
>  
>  	/*
> -	 * For a_ops->writepages(): is start or end are non-zero then this is
> +	 * For a_ops->writepages(): if start or end are non-zero then this is
>  	 * a hint that the filesystem need only write out the pages inside that
>  	 * byterange.  The byte at `end' is included in the writeout request.
>  	 */

Thanks, queued as a trivial patch.
 
> > +	};
> > +
> > +	do_writepages(work->inode->i_mapping, &wbc);
> > +
> > +	return LONG_MAX - wbc.nr_to_write;
> > +}
> 
> <infers the return semantics from the code> It took a while.  Peeking
> at the caller helped.

Hmm I'll add this comment:

/*
 * Clean pages for page reclaim. Returns the number of pages written.
 */
static long wb_pageout(struct bdi_writeback *wb, struct wb_writeback_work *work)
 
> >  static long wb_check_background_flush(struct bdi_writeback *wb)
> >  {
> >  	if (over_bground_thresh(wb->bdi)) {
> > @@ -905,16 +1119,12 @@ long wb_do_writeback(struct bdi_writebac
> >  
> >  		trace_writeback_exec(bdi, work);
> >  
> > -		wrote += wb_writeback(wb, work);
> > -
> > -		/*
> > -		 * Notify the caller of completion if this is a synchronous
> > -		 * work item, otherwise just free it.
> > -		 */
> > -		if (work->done)
> > -			complete(work->done);
> > +		if (!work->inode)
> > +			wrote += wb_writeback(wb, work);
> >  		else
> > -			kfree(work);
> > +			wrote += wb_pageout(wb, work);
> > +
> > +		wb_free_work(work);
> >  	}
> >  
> >  	/*
> >
> > ...
> >
> > --- linux.orig/include/linux/backing-dev.h	2012-02-28 19:07:06.081064464 +0800
> > +++ linux/include/linux/backing-dev.h	2012-02-28 19:07:07.281064493 +0800
> > @@ -126,6 +126,8 @@ int bdi_has_dirty_io(struct backing_dev_
> >  void bdi_arm_supers_timer(void);
> >  void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
> >  void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
> > +void bdi_remove_writeback_works(struct backing_dev_info *bdi,
> > +				struct super_block *sb);
> >  
> >  extern spinlock_t bdi_lock;
> >  extern struct list_head bdi_list;
> > --- linux.orig/mm/vmscan.c	2012-02-28 19:07:06.065064464 +0800
> > +++ linux/mm/vmscan.c	2012-02-28 20:26:15.559731455 +0800
> > @@ -874,12 +874,22 @@ static unsigned long shrink_page_list(st
> >  			nr_dirty++;
> >  
> >  			/*
> > -			 * Only kswapd can writeback filesystem pages to
> > -			 * avoid risk of stack overflow but do not writeback
> > -			 * unless under significant pressure.
> > +			 * Pages may be dirtied anywhere inside the LRU. This
> > +			 * ensures they undergo a full period of LRU iteration
> > +			 * before considering pageout. The intention is to
> > +			 * delay writeout to the flusher thread, unless when
> > +			 * run into a long segment of dirty pages.
> > +			 */
> > +			if (references == PAGEREF_RECLAIM_CLEAN &&
> > +			    priority == DEF_PRIORITY)
> > +				goto keep_locked;
> > +
> > +			/*
> > +			 * Try relaying the pageout I/O to the flusher threads
> > +			 * for better I/O efficiency and avoid stack overflow.
> >  			 */
> > -			if (page_is_file_cache(page) &&
> > -					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
> > +			if (page_is_file_cache(page) && mapping &&
> > +			    queue_pageout_work(mapping, page) >= 0) {
> >  				/*
> >  				 * Immediately reclaim when written back.
> >  				 * Similar in principal to deactivate_page()
> > @@ -892,8 +902,13 @@ static unsigned long shrink_page_list(st
> >  				goto keep_locked;
> >  			}
> >  
> > -			if (references == PAGEREF_RECLAIM_CLEAN)
> > +			/*
> > +			 * Only kswapd can writeback filesystem pages to
> > +			 * avoid risk of stack overflow.
> > +			 */
> > +			if (page_is_file_cache(page) && !current_is_kswapd())
> 
> And here we run into big problems.
> 
> When a page-allocator enters direct reclaim, that process is trying to
> allocate a page from a particular zone (or set of zones).  For example,
> he wants a ZONE_NORMAL or ZONE_DMA page.  Asking flusher threads to go
> off and write back three gigabytes of ZONE_HIGHMEM is pointless,
> inefficient and doesn't fix the caller's problem at all.
> 
> This has always been the biggest problem with the
> avoid-writeback-from-direct-reclaim patches.  And your patchset (as far
> as I've read) doesn't address the problem at all and appears to be
> blissfully unaware of its existence.

I'm sure fully aware of that problem and this patch actually greatly
reduces the chances the avoid-writeback-from-direct-reclaim code is
executed. Typically we'll be able to queue pageout work for the exact
target page we want to reclaim.

> I've attempted versions of this I think twice, and thrown the patches
> away in disgust.  One approach I tried was, within direct reclaim, to
> grab the page I wanted (ie: one which is in one of the caller's desired
> zones) and to pass that page over to the kernel threads.  The kernel
> threads would ensure that this particular page was included in the
> writearound preparation.  So that we at least make *some* progress
> toward what the caller is asking us to do.
> 
> iirc, the way I "grabbed" the page was to actually lock it, with
> [try_]_lock_page().  And unlock it again way over within the writeback
> thread.  I forget why I did it this way, rather than get_page() or
> whatever.  Locking the page is a good way of preventing anyone else
> from futzing with it.  It also pins the inode, which perhaps meant that
> with careful management, I could avoid the igrab()/iput() horrors
> discussed above.

The page lock will unfortunately be a blocking point for btrfs
lock_delalloc_pages(). I hit this when the earlier proof-of-concept
patch does simple congestion_wait()s inside the page lock. Even though
there is one or two pages locked (but for longer time than before)
system wide, btrfs is impacted a lot.

So igrab()/iput() may still be the best option in despite of the
complex fs dependencies it introduced.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-02-29 13:28       ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-29 13:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Tue, Feb 28, 2012 at 04:04:03PM -0800, Andrew Morton wrote:
> On Tue, 28 Feb 2012 22:00:27 +0800 Fengguang Wu <fengguang.wu@intel.com> wrote:

> > ---
> >  fs/fs-writeback.c                |  230 +++++++++++++++++++++++++++--
> >  fs/super.c                       |    1 
> >  include/linux/backing-dev.h      |    2 
> >  include/linux/writeback.h        |   16 +-
> >  include/trace/events/writeback.h |   12 +
> >  mm/vmscan.c                      |   36 ++--
> >  6 files changed, 268 insertions(+), 29 deletions(-)
> > 
> > --- linux.orig/fs/fs-writeback.c	2012-02-28 19:07:06.109064465 +0800
> > +++ linux/fs/fs-writeback.c	2012-02-28 19:07:07.277064493 +0800
> > @@ -41,6 +41,8 @@ struct wb_writeback_work {
> >  	long nr_pages;
> >  	struct super_block *sb;
> >  	unsigned long *older_than_this;
> > +	struct inode *inode;
> > +	pgoff_t offset;
> 
> Please document `offset' here.  What is it used for?

It's the start offset to writeout. It seems better to name it "start"
and comment as below

        /*
         * When (inode != NULL), it's a pageout work for cleaning the
         * inode pages from start to start+nr_pages.
         */              
        struct inode *inode;
        pgoff_t start;  

> >  	enum writeback_sync_modes sync_mode;
> >  	unsigned int tagged_writepages:1;
> >  	unsigned int for_kupdate:1;
> > @@ -57,6 +59,27 @@ struct wb_writeback_work {
> >   */
> >  int nr_pdflush_threads;
> >  
> > +static mempool_t *wb_work_mempool;
> > +
> > +static void *wb_work_alloc(gfp_t gfp_mask, void *pool_data)
> 
> The gfp_mask is traditionally the last function argument.

Yup, but it's the prototype defined by "mempool_alloc_t *alloc_fn".

> > +{
> > +	/*
> > +	 * alloc_queue_pageout_work() will be called on page reclaim
> > +	 */
> > +	if (current->flags & PF_MEMALLOC)
> > +		return NULL;
> 
> Do we need to test current->flags here?

Yes, the intention is to avoid page allocation in the page reclaim path.

Also, if the pool of 1024 writeback works are exhausted, just fail new
pageout requests. Normally it does not need too many IOs in flight to
saturate a disk. Limiting the queue size helps reduce the in-queue
delays.

> Could we have checked !(gfp_mask & __GFP_IO) and/or __GFP_FILE?

mempool_alloc() will call into this callback with
(__GFP_WAIT|__GFP_IO) removed from gfp_mask. So no need to check them.

> I'm not really suggsting such a change - just trying to get my head
> around how this stuff works..
> 
> > +	return kmalloc(sizeof(struct wb_writeback_work), gfp_mask);
> > +}
> > +
> > +static __init int wb_work_init(void)
> > +{
> > +	wb_work_mempool = mempool_create(WB_WORK_MEMPOOL_SIZE,
> > +					 wb_work_alloc, mempool_kfree, NULL);
> > +	return wb_work_mempool ? 0 : -ENOMEM;
> > +}
> > +fs_initcall(wb_work_init);
> 
> Please provide a description of the wb_writeback_work lifecycle: when
> they are allocated, when they are freed, how we ensure that a finite
> number are in flight.

The wb_writeback_work is created (and hence auto destroyed) either on
stack, or dynamically allocated from the mempool. The implicit rule
is: the caller shall allocate wb_writeback_work on stack iif it want
to wait for completion of the work.

The work is then queued into bdi->work_list, where the flusher picks
up one wb_writeback_work at a time, dequeue, execute and finally
either free it (mempool allocated) or wake up the caller (on stack).

The queue size is generally not limited because it was not a problem
at all. However it's now possible for vmscan to queue lots of pageout
works in short time. So the below limiting rules are applied:

- when LOTS_OF_WRITEBACK_WORKS = WB_WORK_MEMPOOL_SIZE / 8 = 128 works
  are queued, vmscan should start throttling itself (to the rate the
  flusher can consume pageout works).

- when 2 * LOTS_OF_WRITEBACK_WORKS wb_writeback_work are queued, will
  refuse to queue new pageout works

- the remaining mempool reservations are available for other types of works

> Also, when a mempool_alloc() caller is waiting for wb_writeback_works
> to be freed, how are we dead certain that some *will* be freed?  It
> means that the mempool_alloc() caller cannot be holding any explicit or
> implicit locks which would prevent wb_writeback_works would be freed. 
> Where "implicit lock" means things like being inside ext3/4
> journal_start.

There will be no wait at work allocation time. Instead the next patch
will check the return value of queue_pageout_work() and do some
reclaim_wait() throttling when above LOTS_OF_WRITEBACK_WORKS.

That throttling wait is done in small HZ/10 timeouts. Well we'll have
to throttle it if the LRU list is full of dirty pages, even if we are
inside the implicit lock of journal_start and risk delaying other
tasks trying to take the lock...

> This stuff is tricky and is hard to get right.  Reviewing its
> correctness by staring at a patch is difficult.
> 
> >  /**
> >   * writeback_in_progress - determine whether there is writeback in progress
> >   * @bdi: the device's backing_dev_info structure.
> > @@ -129,7 +152,7 @@ __bdi_start_writeback(struct backing_dev
> >  	 * This is WB_SYNC_NONE writeback, so if allocation fails just
> >  	 * wakeup the thread for old dirty data writeback
> >  	 */
> > -	work = kzalloc(sizeof(*work), GFP_ATOMIC);
> > +	work = mempool_alloc(wb_work_mempool, GFP_NOWAIT);
> 
> Sneaky change from GFP_ATOMIC to GFP_NOWAIT is significant, but
> undescribed?

Oh this is the legacy path, I'll change it back to GFP_ATOMIC.

> >  	if (!work) {
> >  		if (bdi->wb.task) {
> >  			trace_writeback_nowork(bdi);
> > @@ -138,6 +161,7 @@ __bdi_start_writeback(struct backing_dev
> >  		return;
> >  	}
> >  
> > +	memset(work, 0, sizeof(*work));
> >  	work->sync_mode	= WB_SYNC_NONE;
> >  	work->nr_pages	= nr_pages;
> >  	work->range_cyclic = range_cyclic;
> > @@ -187,6 +211,181 @@ void bdi_start_background_writeback(stru
> >  }
> >  
> >  /*
> > + * Check if @work already covers @offset, or try to extend it to cover @offset.
> > + * Returns true if the wb_writeback_work now encompasses the requested offset.
> > + */
> > +static bool extend_writeback_range(struct wb_writeback_work *work,
> > +				   pgoff_t offset,
> > +				   unsigned long unit)
> > +{
> > +	pgoff_t end = work->offset + work->nr_pages;
> > +
> > +	if (offset >= work->offset && offset < end)
> > +		return true;
> > +
> > +	/*
> > +	 * for sequential workloads with good locality, include up to 8 times
> > +	 * more data in one chunk
> > +	 */
> > +	if (work->nr_pages >= 8 * unit)
> > +		return false;
> 
> argh, gack, puke.  I thought I revoked your magic number license months ago!

Sorry, I thought I've added comment for it. Obviously it's not enough.

> Please, it's a HUGE red flag that bad things are happening.  Would the
> kernel be better or worse if we were to use 9.5 instead?  How do we
> know that "8" is optimum for all memory sizes, device bandwidths, etc?

The unit chunk size is calculated so that it costs ~10ms to write so
many pages. So 8 means we can extend it up to 80ms. It's a good value
because 80ms data transfer time makes the typical overheads of 8ms
disk seek time look small enough.

> It's a hack - it's *always* a hack.  Find a better way.

Hmm, it's more like black magic w/o explanations :(

> > +	/* the unsigned comparison helps eliminate one compare */
> > +	if (work->offset - offset < unit) {
> > +		work->nr_pages += unit;
> > +		work->offset -= unit;
> > +		return true;
> > +	}
> > +
> > +	if (offset - end < unit) {
> > +		work->nr_pages += unit;
> > +		return true;
> > +	}
> > +
> > +	return false;
> > +}
> > +
> > +/*
> > + * schedule writeback on a range of inode pages.
> > + */
> > +static struct wb_writeback_work *
> > +alloc_queue_pageout_work(struct backing_dev_info *bdi,
> > +			 struct inode *inode,
> > +			 pgoff_t offset,
> > +			 pgoff_t len)
> > +{
> > +	struct wb_writeback_work *work;
> > +
> > +	/*
> > +	 * Grab the inode until the work is executed. We are calling this from
> > +	 * page reclaim context and the only thing pinning the address_space
> > +	 * for the moment is the page lock.
> > +	 */
> > +	if (!igrab(inode))
> > +		return ERR_PTR(-ENOENT);
> 
> uh-oh.  igrab() means iput().
> 
> ENOENT means "no such file or directory" and makes no sense in this
> context.

OK, will change to return NULL on failure.

> > +	work = mempool_alloc(wb_work_mempool, GFP_NOWAIT);
> > +	if (!work) {
> > +		trace_printk("wb_work_mempool alloc fail\n");
> > +		return ERR_PTR(-ENOMEM);
> > +	}
> > +
> > +	memset(work, 0, sizeof(*work));
> > +	work->sync_mode		= WB_SYNC_NONE;
> > +	work->inode		= inode;
> > +	work->offset		= offset;
> > +	work->nr_pages		= len;
> > +	work->reason		= WB_REASON_PAGEOUT;
> > +
> > +	bdi_queue_work(bdi, work);
> > +
> > +	return work;
> > +}
> > +
> > +/*
> > + * Called by page reclaim code to flush the dirty page ASAP. Do write-around to
> > + * improve IO throughput. The nearby pages will have good chance to reside in
> > + * the same LRU list that vmscan is working on, and even close to each other
> > + * inside the LRU list in the common case of sequential read/write.
> > + *
> > + * ret > 0: success, allocated/queued a new pageout work;
> > + *	    there are at least @ret writeback works queued now
> > + * ret = 0: success, reused/extended a previous pageout work
> > + * ret < 0: failed
> > + */
> > +int queue_pageout_work(struct address_space *mapping, struct page *page)
> > +{
> > +	struct backing_dev_info *bdi = mapping->backing_dev_info;
> > +	struct inode *inode = mapping->host;
> > +	struct wb_writeback_work *work;
> > +	unsigned long write_around_pages;
> > +	pgoff_t offset = page->index;
> > +	int i = 0;
> > +	int ret = -ENOENT;
> 
> ENOENT means "no such file or directory" and makes no sense in this
> context.

OK, changed to return plain -1 when failed.

> > +	if (unlikely(!inode))
> > +		return ret;
> 
> How does this happen?

Probably will never happen... Removed. Sorry for the noise.

> > +	/*
> > +	 * piggy back 8-15ms worth of data
> > +	 */
> > +	write_around_pages = bdi->avg_write_bandwidth + MIN_WRITEBACK_PAGES;
> > +	write_around_pages = rounddown_pow_of_two(write_around_pages) >> 6;
> 
> Where did "6" come from?

It takes 1000ms to write avg_write_bandwidth data, dividing that by
64, we get 15ms. Since there is a rounddown_pow_of_two(), the actual
range will be [8ms, 15ms]. So a shift of 6 yields work execution
time of around 10ms (not including the possible seek time).

> > +	i = 1;
> > +	spin_lock_bh(&bdi->wb_lock);
> > +	list_for_each_entry_reverse(work, &bdi->work_list, list) {
> > +		if (work->inode != inode)
> > +			continue;
> > +		if (extend_writeback_range(work, offset, write_around_pages)) {
> > +			ret = 0;
> > +			break;
> > +		}
> > +		/*
> > +		 * vmscan will slow down page reclaim when there are more than
> > +		 * LOTS_OF_WRITEBACK_WORKS queued. Limit search depth to two
> > +		 * times larger.
> > +		 */
> > +		if (i++ > 2 * LOTS_OF_WRITEBACK_WORKS)
> > +			break;
> 
> I'm now totally lost.  What are the units of "i"?  (And why the heck was
> it called "i" anyway?) Afaict, "i" counts the number of times we
> successfully extended the writeback range by write_around_pages?  What
> relationship does this have to yet-another-magic-number-times-two?

Ah it's a silly mistake!  That test should be moved to the top of loop,
so that i simply counts the number of works iterated.

> Please have a think about how to make the code comprehensible to (and
> hence maintainable by) others?

OK!

> > +	}
> > +	spin_unlock_bh(&bdi->wb_lock);
> > +
> > +	if (ret) {
> > +		ret = i;
> > +		offset = round_down(offset, write_around_pages);
> > +		work = alloc_queue_pageout_work(bdi, inode,
> > +						offset, write_around_pages);
> > +		if (IS_ERR(work))
> > +			ret = PTR_ERR(work);
> > +	}
> 
> Need a comment over this code section.  afacit it would be something
> like "if we failed to add pages to an existing wb_writeback_work then
> allocate and queue a new one".

OK, added the comment.

> > +	return ret;
> > +}
> > +
> > +static void wb_free_work(struct wb_writeback_work *work)
> > +{
> > +	if (work->inode)
> > +		iput(work->inode);
> 
> And here is where at least two previous attempts to perform
> address_space-based writearound within direct reclaim have come
> unstuck.
> 
> Occasionally, iput() does a huge amount of stuff: when it is the final
> iput() on the inode.  iirc this can include taking tons of VFS locks,
> truncating files, starting (and perhaps waiting upon) journal commits,
> etc.  I forget, but it's a *lot*.

Yeah, Jan also reminded me of that iput().

> And quite possibly you haven't tested this at all, because it's pretty
> rare for an iput() like this to be the final one.

That's true.

> Let me give you an example to worry about: suppose code which holds fs
> locks calls into direct reclaim and then calls
> mempool_alloc(wb_work_mempool) and that mempool_alloc() has to wait for
> an item to be returned.

Nope, there will be no waits and memory allocations when called
from vmscan. That's one purpose of this test in wb_work_alloc():

        	if (current->flags & PF_MEMALLOC)
        		return NULL;

Actually __alloc_pages_slowpath() also tests PF_MEMALLOC in addtion to
__GFP_WAIT and avoids direct page reclaim when either is set. So for the
purpose of avoiding waits and recursions, GFP_NOWAIT should be enough.

> But no items will ever be returned, because
> all the threads which own wb_writeback_works are stuck in
> wb_free_work->iput, trying to take an fs lock which is still held by
> the now-blocked direct-reclaim caller.

Since we choose to fail mempool_alloc() immediately when called from
vmscan, it's free from VM deadlock/recursion problems.

> And a billion similar scenarios :( The really nasty thing about this is
> that it is very rare for this iput() to be a final iput(), so it's hard
> to get code coverage.
> 
> Please have a think about all of this and see if you can demonstrate
> how the iput() here is guaranteed safe.

iput() does bring new possibilities/sources of deadlock.

The iput() may at times block the flusher for a while. But it's called
without taking any other locks and there is no page reclaim recursions.
So the only possibility of implicit deadlock is

        iput => fs code => queue some writeback work and wait on it

The only two wait_for_completion() calls are:
- sync_inodes_sb
- writeback_inodes_sb_nr

Quick in-kernel code audits show no sign of they being triggered by
some iput() call. But the possibilities always exist now and future.
If really necessary, it can be reliably caught and avoided with 

static void bdi_queue_work(...)
 {     
        trace_writeback_queue(bdi, work);

+       if (unlikely(current == bdi->wb.task ||
+                    current == default_backing_dev_info.wb.task)) {
+               WARN_ON_ONCE(1); /* recursion; deadlock if ->done is set */
+               if (work->done)
+                       complete(work->done);
+               return;
+       }
+      


> > +	/*
> > +	 * Notify the caller of completion if this is a synchronous
> > +	 * work item, otherwise just free it.
> > +	 */
> > +	if (work->done)
> > +		complete(work->done);
> > +	else
> > +		mempool_free(work, wb_work_mempool);
> > +}
> > +
> > +/*
> > + * Remove works for @sb; or if (@sb == NULL), remove all works on @bdi.
> > + */
> > +void bdi_remove_writeback_works(struct backing_dev_info *bdi,
> > +				struct super_block *sb)
> > +{
> > +	struct wb_writeback_work *work, *tmp;
> > +	LIST_HEAD(dispose);
> > +
> > +	spin_lock_bh(&bdi->wb_lock);
> > +	list_for_each_entry_safe(work, tmp, &bdi->work_list, list) {
> > +		if (sb) {
> > +			if (work->sb && work->sb != sb)
> > +				continue;
> 
> What does it mean when wb_writeback_work.sb==NULL?  This reader doesn't
> know, hence he can't understand (or review) this code.  Perhaps
> describe it here, unless it is well described elsewhere?

OK.

WB_REASON_LAPTOP_TIMER, WB_REASON_FREE_MORE_MEM, and some WB_REASON_SYNC
callers will queue works with wb_writeback_work.sb==NULL. They only care
about knocking down the bdi dirty pages.

> > +			if (work->inode && work->inode->i_sb != sb)
> 
> And what is the meaning of wb_writeback_work.inode==NULL?

wb_writeback_work.inode != NULL, iff it's a pageout work.

> Seems odd that wb_writeback_work.sb exists, when it is accessible via
> wb_writeback_work.inode->i_sb.

Yeah this is twisted. I should have set pageout work's

        wb_writeback_work.sb = inode->i_sb

and the loop will then look straightforward:

        list_for_each_entry_safe(work, tmp, &bdi->work_list, list) {
                if (sb && work->sb && sb != work->sb)
                        continue;
                list_move(&work->list, &dispose);
        }

> As a person who reviews a huge amount of code, I can tell you this: the
> key to understanding code is to understand the data structures and the
> relationship between their fields and between different data
> structures.  Including lifetime rules, locking rules and hidden
> information such as "what does it mean when wb_writeback_work.sb is
> NULL".  Once one understands all this about the data structures, the
> code becomes pretty obvious and bugs can be spotted and fixed.  But
> alas, wb_writeback_work is basically undocumented.

Good to know that! I'll document it with the above comments.
 
> > +				continue;
> > +		}
> > +		list_move(&work->list, &dispose);
> 
> So here we have queued for disposal a) works which refer to an inode on
> sb and b) works which have ->sb==NULL and ->inode==NULL.  I don't know
> whether the b) type exist.

b) exists for the works queued by __bdi_start_writeback(), whose
callers may be laptop mode, free_more_memory() and the sync syscall.

> > +	}
> > +	spin_unlock_bh(&bdi->wb_lock);
> > +
> > +	while (!list_empty(&dispose)) {
> > +		work = list_entry(dispose.next,
> > +				  struct wb_writeback_work, list);
> > +		list_del_init(&work->list);
> > +		wb_free_work(work);
> > +	}
> 
> You should be able to do this operation without writing to all the
> list_heads: no list_del(), no list_del_init().

Good point! Here is the better form:

        list_for_each_entry(work, &dispose, list)
                wb_free_work(work);

> > +}
> > +
> > +/*
> >   * Remove the inode from the writeback list it is on.
> >   */
> >  void inode_wb_list_del(struct inode *inode)
> > @@ -833,6 +1032,21 @@ static unsigned long get_nr_dirty_pages(
> >  		get_nr_dirty_inodes();
> >  }
> >  
> > +static long wb_pageout(struct bdi_writeback *wb, struct wb_writeback_work *work)
> > +{
> > +	struct writeback_control wbc = {
> > +		.sync_mode = WB_SYNC_NONE,
> > +		.nr_to_write = LONG_MAX,
> > +		.range_start = work->offset << PAGE_CACHE_SHIFT,
> 
> I think this will give you a 32->64 bit overflow on 32-bit machines.

Oops, good catch!
 
> > +		.range_end = (work->offset + work->nr_pages - 1)
> > +						<< PAGE_CACHE_SHIFT,
> 
> Ditto.
> 
> Please include this in a patchset sometime ;)
> 
> --- a/include/linux/writeback.h~a
> +++ a/include/linux/writeback.h
> @@ -64,7 +64,7 @@ struct writeback_control {
>  	long pages_skipped;		/* Pages which were not written */
>  
>  	/*
> -	 * For a_ops->writepages(): is start or end are non-zero then this is
> +	 * For a_ops->writepages(): if start or end are non-zero then this is
>  	 * a hint that the filesystem need only write out the pages inside that
>  	 * byterange.  The byte at `end' is included in the writeout request.
>  	 */

Thanks, queued as a trivial patch.
 
> > +	};
> > +
> > +	do_writepages(work->inode->i_mapping, &wbc);
> > +
> > +	return LONG_MAX - wbc.nr_to_write;
> > +}
> 
> <infers the return semantics from the code> It took a while.  Peeking
> at the caller helped.

Hmm I'll add this comment:

/*
 * Clean pages for page reclaim. Returns the number of pages written.
 */
static long wb_pageout(struct bdi_writeback *wb, struct wb_writeback_work *work)
 
> >  static long wb_check_background_flush(struct bdi_writeback *wb)
> >  {
> >  	if (over_bground_thresh(wb->bdi)) {
> > @@ -905,16 +1119,12 @@ long wb_do_writeback(struct bdi_writebac
> >  
> >  		trace_writeback_exec(bdi, work);
> >  
> > -		wrote += wb_writeback(wb, work);
> > -
> > -		/*
> > -		 * Notify the caller of completion if this is a synchronous
> > -		 * work item, otherwise just free it.
> > -		 */
> > -		if (work->done)
> > -			complete(work->done);
> > +		if (!work->inode)
> > +			wrote += wb_writeback(wb, work);
> >  		else
> > -			kfree(work);
> > +			wrote += wb_pageout(wb, work);
> > +
> > +		wb_free_work(work);
> >  	}
> >  
> >  	/*
> >
> > ...
> >
> > --- linux.orig/include/linux/backing-dev.h	2012-02-28 19:07:06.081064464 +0800
> > +++ linux/include/linux/backing-dev.h	2012-02-28 19:07:07.281064493 +0800
> > @@ -126,6 +126,8 @@ int bdi_has_dirty_io(struct backing_dev_
> >  void bdi_arm_supers_timer(void);
> >  void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
> >  void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
> > +void bdi_remove_writeback_works(struct backing_dev_info *bdi,
> > +				struct super_block *sb);
> >  
> >  extern spinlock_t bdi_lock;
> >  extern struct list_head bdi_list;
> > --- linux.orig/mm/vmscan.c	2012-02-28 19:07:06.065064464 +0800
> > +++ linux/mm/vmscan.c	2012-02-28 20:26:15.559731455 +0800
> > @@ -874,12 +874,22 @@ static unsigned long shrink_page_list(st
> >  			nr_dirty++;
> >  
> >  			/*
> > -			 * Only kswapd can writeback filesystem pages to
> > -			 * avoid risk of stack overflow but do not writeback
> > -			 * unless under significant pressure.
> > +			 * Pages may be dirtied anywhere inside the LRU. This
> > +			 * ensures they undergo a full period of LRU iteration
> > +			 * before considering pageout. The intention is to
> > +			 * delay writeout to the flusher thread, unless when
> > +			 * run into a long segment of dirty pages.
> > +			 */
> > +			if (references == PAGEREF_RECLAIM_CLEAN &&
> > +			    priority == DEF_PRIORITY)
> > +				goto keep_locked;
> > +
> > +			/*
> > +			 * Try relaying the pageout I/O to the flusher threads
> > +			 * for better I/O efficiency and avoid stack overflow.
> >  			 */
> > -			if (page_is_file_cache(page) &&
> > -					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
> > +			if (page_is_file_cache(page) && mapping &&
> > +			    queue_pageout_work(mapping, page) >= 0) {
> >  				/*
> >  				 * Immediately reclaim when written back.
> >  				 * Similar in principal to deactivate_page()
> > @@ -892,8 +902,13 @@ static unsigned long shrink_page_list(st
> >  				goto keep_locked;
> >  			}
> >  
> > -			if (references == PAGEREF_RECLAIM_CLEAN)
> > +			/*
> > +			 * Only kswapd can writeback filesystem pages to
> > +			 * avoid risk of stack overflow.
> > +			 */
> > +			if (page_is_file_cache(page) && !current_is_kswapd())
> 
> And here we run into big problems.
> 
> When a page-allocator enters direct reclaim, that process is trying to
> allocate a page from a particular zone (or set of zones).  For example,
> he wants a ZONE_NORMAL or ZONE_DMA page.  Asking flusher threads to go
> off and write back three gigabytes of ZONE_HIGHMEM is pointless,
> inefficient and doesn't fix the caller's problem at all.
> 
> This has always been the biggest problem with the
> avoid-writeback-from-direct-reclaim patches.  And your patchset (as far
> as I've read) doesn't address the problem at all and appears to be
> blissfully unaware of its existence.

I'm sure fully aware of that problem and this patch actually greatly
reduces the chances the avoid-writeback-from-direct-reclaim code is
executed. Typically we'll be able to queue pageout work for the exact
target page we want to reclaim.

> I've attempted versions of this I think twice, and thrown the patches
> away in disgust.  One approach I tried was, within direct reclaim, to
> grab the page I wanted (ie: one which is in one of the caller's desired
> zones) and to pass that page over to the kernel threads.  The kernel
> threads would ensure that this particular page was included in the
> writearound preparation.  So that we at least make *some* progress
> toward what the caller is asking us to do.
> 
> iirc, the way I "grabbed" the page was to actually lock it, with
> [try_]_lock_page().  And unlock it again way over within the writeback
> thread.  I forget why I did it this way, rather than get_page() or
> whatever.  Locking the page is a good way of preventing anyone else
> from futzing with it.  It also pins the inode, which perhaps meant that
> with careful management, I could avoid the igrab()/iput() horrors
> discussed above.

The page lock will unfortunately be a blocking point for btrfs
lock_delalloc_pages(). I hit this when the earlier proof-of-concept
patch does simple congestion_wait()s inside the page lock. Even though
there is one or two pages locked (but for longer time than before)
system wide, btrfs is impacted a lot.

So igrab()/iput() may still be the best option in despite of the
complex fs dependencies it introduced.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v2 5/9] writeback: introduce the pageout work
  2012-02-28 14:00   ` Fengguang Wu
@ 2012-02-29 13:51     ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-29 13:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

This relays file pageout IOs to the flusher threads.

It's much more important now that page reclaim generally does not
writeout filesystem-backed pages.

The ultimate target is to gracefully handle the LRU lists pressured by
dirty/writeback pages. In particular, problems (1-2) are addressed here.

1) I/O efficiency

The flusher will piggy back the nearby ~10ms worth of dirty pages for I/O.

This takes advantage of the time/spacial locality in most workloads: the
nearby pages of one file are typically populated into the LRU at the same
time, hence will likely be close to each other in the LRU list. Writing
them in one shot helps clean more pages effectively for page reclaim.

For the common dd style sequential writes that have excellent locality,
up to ~128ms data will be wrote around by the pageout work, which helps
make I/O performance very close to that of the background writeback.

2) writeback work coordinations

To avoid memory allocations at page reclaim, a mempool for struct
wb_writeback_work is created.

wakeup_flusher_threads() is removed because it can easily delay the
more oriented pageout works and even exhaust the mempool reservations.
It's also found to not I/O efficient by frequently submitting writeback
works with small ->nr_pages. wakeup_flusher_threads() is called with
total_scanned. Which could be (LRU_size / 4096). Given 1GB LRU_size,
the write chunk would be 256KB.  This is much smaller than the old 4MB
and the now preferred write chunk size (write_bandwidth/2).  For
direct reclaim, sc->nr_to_reclaim=32 and total_scanned starts with
(LRU_size / 4096), which *always* exceeds writeback_threshold in boxes
with more than 1GB memory. So the flusher end up constantly be fed
with small writeout requests.

Typically the flusher will be working on the background/periodic works
when there are heavy dirtier tasks. And wb_writeback() will quit the
background/periodic work when pageout or other works are queued. So
the pageout works can typically be pick up and executed quickly by the
flusher: the background/periodic works are the dominant ones and there
are rarely other type of works in the way.

However the other type of works, if ever they come, can still block us
for long time. Will need a proper way to guarantee fairness.

Jan Kara: limit the search scope; remove works and unpin inodes on umount.

CC: Jan Kara <jack@suse.cz>
CC: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
CC: Greg Thelen <gthelen@google.com>
CC: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |  278 +++++++++++++++++++++++++++--
 fs/super.c                       |    1 
 include/linux/backing-dev.h      |    2 
 include/linux/writeback.h        |   16 +
 include/trace/events/writeback.h |   12 -
 mm/vmscan.c                      |   36 ++-
 6 files changed, 316 insertions(+), 29 deletions(-)

--- linux.orig/fs/fs-writeback.c	2012-02-29 08:41:52.057540723 +0800
+++ linux/fs/fs-writeback.c	2012-02-29 21:38:20.215305104 +0800
@@ -36,9 +36,37 @@
 
 /*
  * Passed into wb_writeback(), essentially a subset of writeback_control
+ *
+ * The wb_writeback_work is created (and hence auto destroyed) either on stack,
+ * or dynamically allocated from the mempool. The implicit rule is: the caller
+ * shall allocate wb_writeback_work on stack iif it want to wait for completion
+ * of the work (aka. synchronous work).
+ *
+ * The work is then queued into bdi->work_list, where the flusher picks up one
+ * wb_writeback_work at a time, dequeue, execute and finally either free it
+ * (mempool allocated) or wake up the caller (on stack).
+ *
+ * It's possible for vmscan to queue lots of pageout works in short time.
+ * However it does not need too many IOs in flight to saturate a typical disk.
+ * Limiting the queue size helps reduce the queuing delays. So the below rules
+ * are applied:
+ *
+ * - when LOTS_OF_WRITEBACK_WORKS = WB_WORK_MEMPOOL_SIZE / 8 = 128 works are
+ *   queued, vmscan should start throttling itself (to the rate the flusher can
+ *   consume pageout works).
+ *
+ * - when 2 * LOTS_OF_WRITEBACK_WORKS wb_writeback_work are queued, will refuse
+ *   to queue new pageout works
+ *
+ * - the remaining mempool reservations are available for other types of works
  */
 struct wb_writeback_work {
 	long nr_pages;
+	/*
+	 * WB_REASON_LAPTOP_TIMER, WB_REASON_FREE_MORE_MEM and some
+	 * WB_REASON_SYNC callers queue works with ->sb == NULL. They just want
+	 * to knock down the bdi dirty pages and don't care about the exact sb.
+	 */
 	struct super_block *sb;
 	unsigned long *older_than_this;
 	enum writeback_sync_modes sync_mode;
@@ -48,6 +76,13 @@ struct wb_writeback_work {
 	unsigned int for_background:1;
 	enum wb_reason reason;		/* why was writeback initiated? */
 
+	/*
+	 * When (inode != NULL), it's a pageout work for cleaning the inode
+	 * pages from start to start+nr_pages.
+	 */
+	struct inode *inode;
+	pgoff_t start;
+
 	struct list_head list;		/* pending work list */
 	struct completion *done;	/* set if the caller waits */
 };
@@ -57,6 +92,28 @@ struct wb_writeback_work {
  */
 int nr_pdflush_threads;
 
+static mempool_t *wb_work_mempool;
+
+static void *wb_work_alloc(gfp_t gfp_mask, void *pool_data)
+{
+	/*
+	 * Avoid page allocation on page reclaim. The mempool reservations are
+	 * typically more than enough for good disk utilization.
+	 */
+	if (current->flags & PF_MEMALLOC)
+		return NULL;
+
+	return kmalloc(sizeof(struct wb_writeback_work), gfp_mask);
+}
+
+static __init int wb_work_init(void)
+{
+	wb_work_mempool = mempool_create(WB_WORK_MEMPOOL_SIZE,
+					 wb_work_alloc, mempool_kfree, NULL);
+	return wb_work_mempool ? 0 : -ENOMEM;
+}
+fs_initcall(wb_work_init);
+
 /**
  * writeback_in_progress - determine whether there is writeback in progress
  * @bdi: the device's backing_dev_info structure.
@@ -111,6 +168,22 @@ static void bdi_queue_work(struct backin
 {
 	trace_writeback_queue(bdi, work);
 
+	/*
+	 * The iput() for pageout works may occasionally dive deep into complex
+	 * fs code. This brings new possibilities/sources of deadlock:
+	 *
+	 *   free work => iput => fs code => queue writeback work and wait on it
+	 *
+	 * In the above scheme, the flusher ends up waiting endless for itself.
+	 */
+	if (unlikely(current == bdi->wb.task ||
+		     current == default_backing_dev_info.wb.task)) {
+		WARN_ON_ONCE(1); /* recursion; deadlock if ->done is set */
+		if (work->done)
+			complete(work->done);
+		return;
+	}
+
 	spin_lock_bh(&bdi->wb_lock);
 	list_add_tail(&work->list, &bdi->work_list);
 	if (!bdi->wb.task)
@@ -129,7 +202,7 @@ __bdi_start_writeback(struct backing_dev
 	 * This is WB_SYNC_NONE writeback, so if allocation fails just
 	 * wakeup the thread for old dirty data writeback
 	 */
-	work = kzalloc(sizeof(*work), GFP_ATOMIC);
+	work = mempool_alloc(wb_work_mempool, GFP_ATOMIC);
 	if (!work) {
 		if (bdi->wb.task) {
 			trace_writeback_nowork(bdi);
@@ -138,6 +211,7 @@ __bdi_start_writeback(struct backing_dev
 		return;
 	}
 
+	memset(work, 0, sizeof(*work));
 	work->sync_mode	= WB_SYNC_NONE;
 	work->nr_pages	= nr_pages;
 	work->range_cyclic = range_cyclic;
@@ -187,6 +261,176 @@ void bdi_start_background_writeback(stru
 }
 
 /*
+ * Check if @work already covers @offset, or try to extend it to cover @offset.
+ * Returns true if the wb_writeback_work now encompasses the requested offset.
+ */
+static bool extend_writeback_range(struct wb_writeback_work *work,
+				   pgoff_t offset,
+				   unsigned long unit)
+{
+	pgoff_t end = work->start + work->nr_pages;
+
+	if (offset >= work->start && offset < end)
+		return true;
+
+	/*
+	 * For sequential workloads with good locality, include up to 8 times
+	 * more data in one chunk. The unit chunk size is calculated so that it
+	 * costs 8-16ms to write so many pages. So 8 times means we can extend
+	 * it up to 128ms. It's a good value because 128ms data transfer time
+	 * makes the typical overheads of 8ms disk seek time look small enough.
+	 */
+	if (work->nr_pages >= 8 * unit)
+		return false;
+
+	/* the unsigned comparison helps eliminate one compare */
+	if (work->start - offset < unit) {
+		work->nr_pages += unit;
+		work->start -= unit;
+		return true;
+	}
+
+	if (offset - end < unit) {
+		work->nr_pages += unit;
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * schedule writeback on a range of inode pages.
+ */
+static struct wb_writeback_work *
+alloc_queue_pageout_work(struct backing_dev_info *bdi,
+			 struct inode *inode,
+			 pgoff_t start,
+			 pgoff_t len)
+{
+	struct wb_writeback_work *work;
+
+	/*
+	 * Grab the inode until the work is executed. We are calling this from
+	 * page reclaim context and the only thing pinning the address_space
+	 * for the moment is the page lock.
+	 */
+	if (!igrab(inode))
+		return NULL;
+
+	work = mempool_alloc(wb_work_mempool, GFP_NOWAIT);
+	if (!work)
+		return NULL;
+
+	memset(work, 0, sizeof(*work));
+	work->sync_mode		= WB_SYNC_NONE;
+	work->sb		= inode->i_sb;
+	work->inode		= inode;
+	work->start		= start;
+	work->nr_pages		= len;
+	work->reason		= WB_REASON_PAGEOUT;
+
+	bdi_queue_work(bdi, work);
+
+	return work;
+}
+
+/*
+ * Called by page reclaim code to flush the dirty page ASAP. Do write-around to
+ * improve IO throughput. The nearby pages will have good chance to reside in
+ * the same LRU list that vmscan is working on, and even close to each other
+ * inside the LRU list in the common case of sequential read/write.
+ *
+ * ret > 0: success, allocated/queued a new pageout work;
+ *	    there are at least @ret writeback works queued now
+ * ret = 0: success, reused/extended a previous pageout work
+ * ret < 0: failed
+ */
+int queue_pageout_work(struct address_space *mapping, struct page *page)
+{
+	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	struct inode *inode = mapping->host;
+	struct wb_writeback_work *work;
+	unsigned long write_around_pages;
+	pgoff_t offset = page->index;
+	int i = 0;
+	int ret = -1;
+
+	BUG_ON(!inode);
+
+	/*
+	 * piggy back 8-16ms worth of data
+	 */
+	write_around_pages = bdi->avg_write_bandwidth + MIN_WRITEBACK_PAGES;
+	write_around_pages = rounddown_pow_of_two(write_around_pages) >> 6;
+
+	spin_lock_bh(&bdi->wb_lock);
+	list_for_each_entry_reverse(work, &bdi->work_list, list) {
+		/*
+		 * vmscan will slow down page reclaim when there are more than
+		 * LOTS_OF_WRITEBACK_WORKS queued. Limit search depth to two
+		 * times larger.
+		 */
+		if (i++ > 2 * LOTS_OF_WRITEBACK_WORKS)
+			break;
+		if (work->inode != inode)
+			continue;
+		if (extend_writeback_range(work, offset, write_around_pages)) {
+			ret = 0;
+			break;
+		}
+	}
+	spin_unlock_bh(&bdi->wb_lock);
+
+	/*
+	 * if we failed to add the page to an existing wb_writeback_work and
+	 * there are not too many existing ones, allocate and queue a new one
+	 */
+	if (ret && i <= 2 * LOTS_OF_WRITEBACK_WORKS) {
+		offset = round_down(offset, write_around_pages);
+		work = alloc_queue_pageout_work(bdi, inode,
+						offset, write_around_pages);
+		if (work)
+			ret = i;
+	}
+	return ret;
+}
+
+static void wb_free_work(struct wb_writeback_work *work)
+{
+	if (work->inode)
+		iput(work->inode);
+	/*
+	 * Notify the caller of completion if this is a synchronous
+	 * work item, otherwise just free it.
+	 */
+	if (work->done)
+		complete(work->done);
+	else
+		mempool_free(work, wb_work_mempool);
+}
+
+/*
+ * Remove works for @sb; or if (@sb == NULL), remove all works on @bdi.
+ */
+void bdi_remove_writeback_works(struct backing_dev_info *bdi,
+				struct super_block *sb)
+{
+	struct wb_writeback_work *work, *tmp;
+	LIST_HEAD(dispose);
+
+	spin_lock_bh(&bdi->wb_lock);
+	list_for_each_entry_safe(work, tmp, &bdi->work_list, list) {
+		if (sb && work->sb && sb != work->sb)
+			continue;
+		list_move(&work->list, &dispose);
+	}
+	spin_unlock_bh(&bdi->wb_lock);
+
+	list_for_each_entry(work, &dispose, list)
+		wb_free_work(work);
+}
+
+/*
  * Remove the inode from the writeback list it is on.
  */
 void inode_wb_list_del(struct inode *inode)
@@ -833,6 +1077,24 @@ static unsigned long get_nr_dirty_pages(
 		get_nr_dirty_inodes();
 }
 
+/*
+ * Clean pages for page reclaim. Returns the number of pages written.
+ */
+static long wb_pageout(struct bdi_writeback *wb, struct wb_writeback_work *work)
+{
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_NONE,
+		.nr_to_write = LONG_MAX,
+		.range_start = (loff_t)work->start << PAGE_CACHE_SHIFT,
+		.range_end = (loff_t)(work->start + work->nr_pages - 1)
+						<< PAGE_CACHE_SHIFT,
+	};
+
+	do_writepages(work->inode->i_mapping, &wbc);
+
+	return LONG_MAX - wbc.nr_to_write;
+}
+
 static long wb_check_background_flush(struct bdi_writeback *wb)
 {
 	if (over_bground_thresh(wb->bdi)) {
@@ -905,16 +1167,12 @@ long wb_do_writeback(struct bdi_writebac
 
 		trace_writeback_exec(bdi, work);
 
-		wrote += wb_writeback(wb, work);
-
-		/*
-		 * Notify the caller of completion if this is a synchronous
-		 * work item, otherwise just free it.
-		 */
-		if (work->done)
-			complete(work->done);
+		if (!work->inode)
+			wrote += wb_writeback(wb, work);
 		else
-			kfree(work);
+			wrote += wb_pageout(wb, work);
+
+		wb_free_work(work);
 	}
 
 	/*
--- linux.orig/include/trace/events/writeback.h	2012-02-29 08:41:52.021540723 +0800
+++ linux/include/trace/events/writeback.h	2012-02-29 16:35:48.115056756 +0800
@@ -23,7 +23,7 @@
 
 #define WB_WORK_REASON							\
 		{WB_REASON_BACKGROUND,		"background"},		\
-		{WB_REASON_TRY_TO_FREE_PAGES,	"try_to_free_pages"},	\
+		{WB_REASON_PAGEOUT,		"pageout"},		\
 		{WB_REASON_SYNC,		"sync"},		\
 		{WB_REASON_PERIODIC,		"periodic"},		\
 		{WB_REASON_LAPTOP_TIMER,	"laptop_timer"},	\
@@ -45,6 +45,8 @@ DECLARE_EVENT_CLASS(writeback_work_class
 		__field(int, range_cyclic)
 		__field(int, for_background)
 		__field(int, reason)
+		__field(unsigned long, ino)
+		__field(unsigned long, start)
 	),
 	TP_fast_assign(
 		struct device *dev = bdi->dev;
@@ -58,9 +60,11 @@ DECLARE_EVENT_CLASS(writeback_work_class
 		__entry->range_cyclic = work->range_cyclic;
 		__entry->for_background	= work->for_background;
 		__entry->reason = work->reason;
+		__entry->ino = work->inode ? work->inode->i_ino : 0;
+		__entry->start = work->start;
 	),
 	TP_printk("bdi %s: sb_dev %d:%d nr_pages=%ld sync_mode=%d "
-		  "kupdate=%d range_cyclic=%d background=%d reason=%s",
+		  "kupdate=%d range_cyclic=%d background=%d reason=%s ino=%lu start=%lu",
 		  __entry->name,
 		  MAJOR(__entry->sb_dev), MINOR(__entry->sb_dev),
 		  __entry->nr_pages,
@@ -68,7 +72,9 @@ DECLARE_EVENT_CLASS(writeback_work_class
 		  __entry->for_kupdate,
 		  __entry->range_cyclic,
 		  __entry->for_background,
-		  __print_symbolic(__entry->reason, WB_WORK_REASON)
+		  __print_symbolic(__entry->reason, WB_WORK_REASON),
+		  __entry->ino,
+		  __entry->start
 	)
 );
 #define DEFINE_WRITEBACK_WORK_EVENT(name) \
--- linux.orig/include/linux/writeback.h	2012-02-29 08:41:52.037540723 +0800
+++ linux/include/linux/writeback.h	2012-02-29 21:31:32.095295409 +0800
@@ -40,7 +40,7 @@ enum writeback_sync_modes {
  */
 enum wb_reason {
 	WB_REASON_BACKGROUND,
-	WB_REASON_TRY_TO_FREE_PAGES,
+	WB_REASON_PAGEOUT,
 	WB_REASON_SYNC,
 	WB_REASON_PERIODIC,
 	WB_REASON_LAPTOP_TIMER,
@@ -94,6 +94,20 @@ long writeback_inodes_wb(struct bdi_writ
 				enum wb_reason reason);
 long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
 void wakeup_flusher_threads(long nr_pages, enum wb_reason reason);
+int queue_pageout_work(struct address_space *mapping, struct page *page);
+
+/*
+ * Tailored for vmscan which may submit lots of pageout works. The page reclaim
+ * should try to slow down the pageout work submission rate when the queue size
+ * grows to LOTS_OF_WRITEBACK_WORKS. queue_pageout_work() will accordingly limit
+ * its search depth to (2 * LOTS_OF_WRITEBACK_WORKS).
+ *
+ * Note that the limited search and work pool is not a big problem: 1024 IOs
+ * under flight are typically more than enough to saturate the disk. And the
+ * overheads of searching in the work list didn't even show up in perf report.
+ */
+#define WB_WORK_MEMPOOL_SIZE		1024
+#define LOTS_OF_WRITEBACK_WORKS		(WB_WORK_MEMPOOL_SIZE / 8)
 
 /* writeback.h requires fs.h; it, too, is not included from here. */
 static inline void wait_on_inode(struct inode *inode)
--- linux.orig/fs/super.c	2012-02-29 08:41:52.045540723 +0800
+++ linux/fs/super.c	2012-02-29 11:19:57.474606367 +0800
@@ -389,6 +389,7 @@ void generic_shutdown_super(struct super
 
 		fsnotify_unmount_inodes(&sb->s_inodes);
 
+		bdi_remove_writeback_works(sb->s_bdi, sb);
 		evict_inodes(sb);
 
 		if (sop->put_super)
--- linux.orig/include/linux/backing-dev.h	2012-02-29 08:41:52.029540722 +0800
+++ linux/include/linux/backing-dev.h	2012-02-29 11:19:57.474606367 +0800
@@ -126,6 +126,8 @@ int bdi_has_dirty_io(struct backing_dev_
 void bdi_arm_supers_timer(void);
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
 void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
+void bdi_remove_writeback_works(struct backing_dev_info *bdi,
+				struct super_block *sb);
 
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
--- linux.orig/mm/vmscan.c	2012-02-29 08:41:52.009540722 +0800
+++ linux/mm/vmscan.c	2012-02-29 21:31:31.463295395 +0800
@@ -874,12 +874,22 @@ static unsigned long shrink_page_list(st
 			nr_dirty++;
 
 			/*
-			 * Only kswapd can writeback filesystem pages to
-			 * avoid risk of stack overflow but do not writeback
-			 * unless under significant pressure.
+			 * Pages may be dirtied anywhere inside the LRU. This
+			 * ensures they undergo a full period of LRU iteration
+			 * before considering pageout. The intention is to
+			 * delay writeout to the flusher thread, unless when
+			 * run into a long segment of dirty pages.
+			 */
+			if (references == PAGEREF_RECLAIM_CLEAN &&
+			    priority == DEF_PRIORITY)
+				goto keep_locked;
+
+			/*
+			 * Try relaying the pageout I/O to the flusher threads
+			 * for better I/O efficiency and avoid stack overflow.
 			 */
-			if (page_is_file_cache(page) &&
-					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
+			if (page_is_file_cache(page) && mapping &&
+			    queue_pageout_work(mapping, page) >= 0) {
 				/*
 				 * Immediately reclaim when written back.
 				 * Similar in principal to deactivate_page()
@@ -892,8 +902,13 @@ static unsigned long shrink_page_list(st
 				goto keep_locked;
 			}
 
-			if (references == PAGEREF_RECLAIM_CLEAN)
+			/*
+			 * Only kswapd can writeback filesystem pages to
+			 * avoid risk of stack overflow.
+			 */
+			if (page_is_file_cache(page) && !current_is_kswapd())
 				goto keep_locked;
+
 			if (!may_enter_fs)
 				goto keep_locked;
 			if (!sc->may_writepage)
@@ -2373,17 +2388,8 @@ static unsigned long do_try_to_free_page
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
 			goto out;
 
-		/*
-		 * Try to write back as many pages as we just scanned.  This
-		 * tends to cause slow streaming writers to write data to the
-		 * disk smoothly, at the dirtying rate, which is nice.   But
-		 * that's undesirable in laptop mode, where we *want* lumpy
-		 * writeout.  So in laptop mode, write out the whole world.
-		 */
 		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
 		if (total_scanned > writeback_threshold) {
-			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned,
-						WB_REASON_TRY_TO_FREE_PAGES);
 			sc->may_writepage = 1;
 		}
 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v2 5/9] writeback: introduce the pageout work
@ 2012-02-29 13:51     ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-02-29 13:51 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

This relays file pageout IOs to the flusher threads.

It's much more important now that page reclaim generally does not
writeout filesystem-backed pages.

The ultimate target is to gracefully handle the LRU lists pressured by
dirty/writeback pages. In particular, problems (1-2) are addressed here.

1) I/O efficiency

The flusher will piggy back the nearby ~10ms worth of dirty pages for I/O.

This takes advantage of the time/spacial locality in most workloads: the
nearby pages of one file are typically populated into the LRU at the same
time, hence will likely be close to each other in the LRU list. Writing
them in one shot helps clean more pages effectively for page reclaim.

For the common dd style sequential writes that have excellent locality,
up to ~128ms data will be wrote around by the pageout work, which helps
make I/O performance very close to that of the background writeback.

2) writeback work coordinations

To avoid memory allocations at page reclaim, a mempool for struct
wb_writeback_work is created.

wakeup_flusher_threads() is removed because it can easily delay the
more oriented pageout works and even exhaust the mempool reservations.
It's also found to not I/O efficient by frequently submitting writeback
works with small ->nr_pages. wakeup_flusher_threads() is called with
total_scanned. Which could be (LRU_size / 4096). Given 1GB LRU_size,
the write chunk would be 256KB.  This is much smaller than the old 4MB
and the now preferred write chunk size (write_bandwidth/2).  For
direct reclaim, sc->nr_to_reclaim=32 and total_scanned starts with
(LRU_size / 4096), which *always* exceeds writeback_threshold in boxes
with more than 1GB memory. So the flusher end up constantly be fed
with small writeout requests.

Typically the flusher will be working on the background/periodic works
when there are heavy dirtier tasks. And wb_writeback() will quit the
background/periodic work when pageout or other works are queued. So
the pageout works can typically be pick up and executed quickly by the
flusher: the background/periodic works are the dominant ones and there
are rarely other type of works in the way.

However the other type of works, if ever they come, can still block us
for long time. Will need a proper way to guarantee fairness.

Jan Kara: limit the search scope; remove works and unpin inodes on umount.

CC: Jan Kara <jack@suse.cz>
CC: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
CC: Greg Thelen <gthelen@google.com>
CC: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |  278 +++++++++++++++++++++++++++--
 fs/super.c                       |    1 
 include/linux/backing-dev.h      |    2 
 include/linux/writeback.h        |   16 +
 include/trace/events/writeback.h |   12 -
 mm/vmscan.c                      |   36 ++-
 6 files changed, 316 insertions(+), 29 deletions(-)

--- linux.orig/fs/fs-writeback.c	2012-02-29 08:41:52.057540723 +0800
+++ linux/fs/fs-writeback.c	2012-02-29 21:38:20.215305104 +0800
@@ -36,9 +36,37 @@
 
 /*
  * Passed into wb_writeback(), essentially a subset of writeback_control
+ *
+ * The wb_writeback_work is created (and hence auto destroyed) either on stack,
+ * or dynamically allocated from the mempool. The implicit rule is: the caller
+ * shall allocate wb_writeback_work on stack iif it want to wait for completion
+ * of the work (aka. synchronous work).
+ *
+ * The work is then queued into bdi->work_list, where the flusher picks up one
+ * wb_writeback_work at a time, dequeue, execute and finally either free it
+ * (mempool allocated) or wake up the caller (on stack).
+ *
+ * It's possible for vmscan to queue lots of pageout works in short time.
+ * However it does not need too many IOs in flight to saturate a typical disk.
+ * Limiting the queue size helps reduce the queuing delays. So the below rules
+ * are applied:
+ *
+ * - when LOTS_OF_WRITEBACK_WORKS = WB_WORK_MEMPOOL_SIZE / 8 = 128 works are
+ *   queued, vmscan should start throttling itself (to the rate the flusher can
+ *   consume pageout works).
+ *
+ * - when 2 * LOTS_OF_WRITEBACK_WORKS wb_writeback_work are queued, will refuse
+ *   to queue new pageout works
+ *
+ * - the remaining mempool reservations are available for other types of works
  */
 struct wb_writeback_work {
 	long nr_pages;
+	/*
+	 * WB_REASON_LAPTOP_TIMER, WB_REASON_FREE_MORE_MEM and some
+	 * WB_REASON_SYNC callers queue works with ->sb == NULL. They just want
+	 * to knock down the bdi dirty pages and don't care about the exact sb.
+	 */
 	struct super_block *sb;
 	unsigned long *older_than_this;
 	enum writeback_sync_modes sync_mode;
@@ -48,6 +76,13 @@ struct wb_writeback_work {
 	unsigned int for_background:1;
 	enum wb_reason reason;		/* why was writeback initiated? */
 
+	/*
+	 * When (inode != NULL), it's a pageout work for cleaning the inode
+	 * pages from start to start+nr_pages.
+	 */
+	struct inode *inode;
+	pgoff_t start;
+
 	struct list_head list;		/* pending work list */
 	struct completion *done;	/* set if the caller waits */
 };
@@ -57,6 +92,28 @@ struct wb_writeback_work {
  */
 int nr_pdflush_threads;
 
+static mempool_t *wb_work_mempool;
+
+static void *wb_work_alloc(gfp_t gfp_mask, void *pool_data)
+{
+	/*
+	 * Avoid page allocation on page reclaim. The mempool reservations are
+	 * typically more than enough for good disk utilization.
+	 */
+	if (current->flags & PF_MEMALLOC)
+		return NULL;
+
+	return kmalloc(sizeof(struct wb_writeback_work), gfp_mask);
+}
+
+static __init int wb_work_init(void)
+{
+	wb_work_mempool = mempool_create(WB_WORK_MEMPOOL_SIZE,
+					 wb_work_alloc, mempool_kfree, NULL);
+	return wb_work_mempool ? 0 : -ENOMEM;
+}
+fs_initcall(wb_work_init);
+
 /**
  * writeback_in_progress - determine whether there is writeback in progress
  * @bdi: the device's backing_dev_info structure.
@@ -111,6 +168,22 @@ static void bdi_queue_work(struct backin
 {
 	trace_writeback_queue(bdi, work);
 
+	/*
+	 * The iput() for pageout works may occasionally dive deep into complex
+	 * fs code. This brings new possibilities/sources of deadlock:
+	 *
+	 *   free work => iput => fs code => queue writeback work and wait on it
+	 *
+	 * In the above scheme, the flusher ends up waiting endless for itself.
+	 */
+	if (unlikely(current == bdi->wb.task ||
+		     current == default_backing_dev_info.wb.task)) {
+		WARN_ON_ONCE(1); /* recursion; deadlock if ->done is set */
+		if (work->done)
+			complete(work->done);
+		return;
+	}
+
 	spin_lock_bh(&bdi->wb_lock);
 	list_add_tail(&work->list, &bdi->work_list);
 	if (!bdi->wb.task)
@@ -129,7 +202,7 @@ __bdi_start_writeback(struct backing_dev
 	 * This is WB_SYNC_NONE writeback, so if allocation fails just
 	 * wakeup the thread for old dirty data writeback
 	 */
-	work = kzalloc(sizeof(*work), GFP_ATOMIC);
+	work = mempool_alloc(wb_work_mempool, GFP_ATOMIC);
 	if (!work) {
 		if (bdi->wb.task) {
 			trace_writeback_nowork(bdi);
@@ -138,6 +211,7 @@ __bdi_start_writeback(struct backing_dev
 		return;
 	}
 
+	memset(work, 0, sizeof(*work));
 	work->sync_mode	= WB_SYNC_NONE;
 	work->nr_pages	= nr_pages;
 	work->range_cyclic = range_cyclic;
@@ -187,6 +261,176 @@ void bdi_start_background_writeback(stru
 }
 
 /*
+ * Check if @work already covers @offset, or try to extend it to cover @offset.
+ * Returns true if the wb_writeback_work now encompasses the requested offset.
+ */
+static bool extend_writeback_range(struct wb_writeback_work *work,
+				   pgoff_t offset,
+				   unsigned long unit)
+{
+	pgoff_t end = work->start + work->nr_pages;
+
+	if (offset >= work->start && offset < end)
+		return true;
+
+	/*
+	 * For sequential workloads with good locality, include up to 8 times
+	 * more data in one chunk. The unit chunk size is calculated so that it
+	 * costs 8-16ms to write so many pages. So 8 times means we can extend
+	 * it up to 128ms. It's a good value because 128ms data transfer time
+	 * makes the typical overheads of 8ms disk seek time look small enough.
+	 */
+	if (work->nr_pages >= 8 * unit)
+		return false;
+
+	/* the unsigned comparison helps eliminate one compare */
+	if (work->start - offset < unit) {
+		work->nr_pages += unit;
+		work->start -= unit;
+		return true;
+	}
+
+	if (offset - end < unit) {
+		work->nr_pages += unit;
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * schedule writeback on a range of inode pages.
+ */
+static struct wb_writeback_work *
+alloc_queue_pageout_work(struct backing_dev_info *bdi,
+			 struct inode *inode,
+			 pgoff_t start,
+			 pgoff_t len)
+{
+	struct wb_writeback_work *work;
+
+	/*
+	 * Grab the inode until the work is executed. We are calling this from
+	 * page reclaim context and the only thing pinning the address_space
+	 * for the moment is the page lock.
+	 */
+	if (!igrab(inode))
+		return NULL;
+
+	work = mempool_alloc(wb_work_mempool, GFP_NOWAIT);
+	if (!work)
+		return NULL;
+
+	memset(work, 0, sizeof(*work));
+	work->sync_mode		= WB_SYNC_NONE;
+	work->sb		= inode->i_sb;
+	work->inode		= inode;
+	work->start		= start;
+	work->nr_pages		= len;
+	work->reason		= WB_REASON_PAGEOUT;
+
+	bdi_queue_work(bdi, work);
+
+	return work;
+}
+
+/*
+ * Called by page reclaim code to flush the dirty page ASAP. Do write-around to
+ * improve IO throughput. The nearby pages will have good chance to reside in
+ * the same LRU list that vmscan is working on, and even close to each other
+ * inside the LRU list in the common case of sequential read/write.
+ *
+ * ret > 0: success, allocated/queued a new pageout work;
+ *	    there are at least @ret writeback works queued now
+ * ret = 0: success, reused/extended a previous pageout work
+ * ret < 0: failed
+ */
+int queue_pageout_work(struct address_space *mapping, struct page *page)
+{
+	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	struct inode *inode = mapping->host;
+	struct wb_writeback_work *work;
+	unsigned long write_around_pages;
+	pgoff_t offset = page->index;
+	int i = 0;
+	int ret = -1;
+
+	BUG_ON(!inode);
+
+	/*
+	 * piggy back 8-16ms worth of data
+	 */
+	write_around_pages = bdi->avg_write_bandwidth + MIN_WRITEBACK_PAGES;
+	write_around_pages = rounddown_pow_of_two(write_around_pages) >> 6;
+
+	spin_lock_bh(&bdi->wb_lock);
+	list_for_each_entry_reverse(work, &bdi->work_list, list) {
+		/*
+		 * vmscan will slow down page reclaim when there are more than
+		 * LOTS_OF_WRITEBACK_WORKS queued. Limit search depth to two
+		 * times larger.
+		 */
+		if (i++ > 2 * LOTS_OF_WRITEBACK_WORKS)
+			break;
+		if (work->inode != inode)
+			continue;
+		if (extend_writeback_range(work, offset, write_around_pages)) {
+			ret = 0;
+			break;
+		}
+	}
+	spin_unlock_bh(&bdi->wb_lock);
+
+	/*
+	 * if we failed to add the page to an existing wb_writeback_work and
+	 * there are not too many existing ones, allocate and queue a new one
+	 */
+	if (ret && i <= 2 * LOTS_OF_WRITEBACK_WORKS) {
+		offset = round_down(offset, write_around_pages);
+		work = alloc_queue_pageout_work(bdi, inode,
+						offset, write_around_pages);
+		if (work)
+			ret = i;
+	}
+	return ret;
+}
+
+static void wb_free_work(struct wb_writeback_work *work)
+{
+	if (work->inode)
+		iput(work->inode);
+	/*
+	 * Notify the caller of completion if this is a synchronous
+	 * work item, otherwise just free it.
+	 */
+	if (work->done)
+		complete(work->done);
+	else
+		mempool_free(work, wb_work_mempool);
+}
+
+/*
+ * Remove works for @sb; or if (@sb == NULL), remove all works on @bdi.
+ */
+void bdi_remove_writeback_works(struct backing_dev_info *bdi,
+				struct super_block *sb)
+{
+	struct wb_writeback_work *work, *tmp;
+	LIST_HEAD(dispose);
+
+	spin_lock_bh(&bdi->wb_lock);
+	list_for_each_entry_safe(work, tmp, &bdi->work_list, list) {
+		if (sb && work->sb && sb != work->sb)
+			continue;
+		list_move(&work->list, &dispose);
+	}
+	spin_unlock_bh(&bdi->wb_lock);
+
+	list_for_each_entry(work, &dispose, list)
+		wb_free_work(work);
+}
+
+/*
  * Remove the inode from the writeback list it is on.
  */
 void inode_wb_list_del(struct inode *inode)
@@ -833,6 +1077,24 @@ static unsigned long get_nr_dirty_pages(
 		get_nr_dirty_inodes();
 }
 
+/*
+ * Clean pages for page reclaim. Returns the number of pages written.
+ */
+static long wb_pageout(struct bdi_writeback *wb, struct wb_writeback_work *work)
+{
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_NONE,
+		.nr_to_write = LONG_MAX,
+		.range_start = (loff_t)work->start << PAGE_CACHE_SHIFT,
+		.range_end = (loff_t)(work->start + work->nr_pages - 1)
+						<< PAGE_CACHE_SHIFT,
+	};
+
+	do_writepages(work->inode->i_mapping, &wbc);
+
+	return LONG_MAX - wbc.nr_to_write;
+}
+
 static long wb_check_background_flush(struct bdi_writeback *wb)
 {
 	if (over_bground_thresh(wb->bdi)) {
@@ -905,16 +1167,12 @@ long wb_do_writeback(struct bdi_writebac
 
 		trace_writeback_exec(bdi, work);
 
-		wrote += wb_writeback(wb, work);
-
-		/*
-		 * Notify the caller of completion if this is a synchronous
-		 * work item, otherwise just free it.
-		 */
-		if (work->done)
-			complete(work->done);
+		if (!work->inode)
+			wrote += wb_writeback(wb, work);
 		else
-			kfree(work);
+			wrote += wb_pageout(wb, work);
+
+		wb_free_work(work);
 	}
 
 	/*
--- linux.orig/include/trace/events/writeback.h	2012-02-29 08:41:52.021540723 +0800
+++ linux/include/trace/events/writeback.h	2012-02-29 16:35:48.115056756 +0800
@@ -23,7 +23,7 @@
 
 #define WB_WORK_REASON							\
 		{WB_REASON_BACKGROUND,		"background"},		\
-		{WB_REASON_TRY_TO_FREE_PAGES,	"try_to_free_pages"},	\
+		{WB_REASON_PAGEOUT,		"pageout"},		\
 		{WB_REASON_SYNC,		"sync"},		\
 		{WB_REASON_PERIODIC,		"periodic"},		\
 		{WB_REASON_LAPTOP_TIMER,	"laptop_timer"},	\
@@ -45,6 +45,8 @@ DECLARE_EVENT_CLASS(writeback_work_class
 		__field(int, range_cyclic)
 		__field(int, for_background)
 		__field(int, reason)
+		__field(unsigned long, ino)
+		__field(unsigned long, start)
 	),
 	TP_fast_assign(
 		struct device *dev = bdi->dev;
@@ -58,9 +60,11 @@ DECLARE_EVENT_CLASS(writeback_work_class
 		__entry->range_cyclic = work->range_cyclic;
 		__entry->for_background	= work->for_background;
 		__entry->reason = work->reason;
+		__entry->ino = work->inode ? work->inode->i_ino : 0;
+		__entry->start = work->start;
 	),
 	TP_printk("bdi %s: sb_dev %d:%d nr_pages=%ld sync_mode=%d "
-		  "kupdate=%d range_cyclic=%d background=%d reason=%s",
+		  "kupdate=%d range_cyclic=%d background=%d reason=%s ino=%lu start=%lu",
 		  __entry->name,
 		  MAJOR(__entry->sb_dev), MINOR(__entry->sb_dev),
 		  __entry->nr_pages,
@@ -68,7 +72,9 @@ DECLARE_EVENT_CLASS(writeback_work_class
 		  __entry->for_kupdate,
 		  __entry->range_cyclic,
 		  __entry->for_background,
-		  __print_symbolic(__entry->reason, WB_WORK_REASON)
+		  __print_symbolic(__entry->reason, WB_WORK_REASON),
+		  __entry->ino,
+		  __entry->start
 	)
 );
 #define DEFINE_WRITEBACK_WORK_EVENT(name) \
--- linux.orig/include/linux/writeback.h	2012-02-29 08:41:52.037540723 +0800
+++ linux/include/linux/writeback.h	2012-02-29 21:31:32.095295409 +0800
@@ -40,7 +40,7 @@ enum writeback_sync_modes {
  */
 enum wb_reason {
 	WB_REASON_BACKGROUND,
-	WB_REASON_TRY_TO_FREE_PAGES,
+	WB_REASON_PAGEOUT,
 	WB_REASON_SYNC,
 	WB_REASON_PERIODIC,
 	WB_REASON_LAPTOP_TIMER,
@@ -94,6 +94,20 @@ long writeback_inodes_wb(struct bdi_writ
 				enum wb_reason reason);
 long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
 void wakeup_flusher_threads(long nr_pages, enum wb_reason reason);
+int queue_pageout_work(struct address_space *mapping, struct page *page);
+
+/*
+ * Tailored for vmscan which may submit lots of pageout works. The page reclaim
+ * should try to slow down the pageout work submission rate when the queue size
+ * grows to LOTS_OF_WRITEBACK_WORKS. queue_pageout_work() will accordingly limit
+ * its search depth to (2 * LOTS_OF_WRITEBACK_WORKS).
+ *
+ * Note that the limited search and work pool is not a big problem: 1024 IOs
+ * under flight are typically more than enough to saturate the disk. And the
+ * overheads of searching in the work list didn't even show up in perf report.
+ */
+#define WB_WORK_MEMPOOL_SIZE		1024
+#define LOTS_OF_WRITEBACK_WORKS		(WB_WORK_MEMPOOL_SIZE / 8)
 
 /* writeback.h requires fs.h; it, too, is not included from here. */
 static inline void wait_on_inode(struct inode *inode)
--- linux.orig/fs/super.c	2012-02-29 08:41:52.045540723 +0800
+++ linux/fs/super.c	2012-02-29 11:19:57.474606367 +0800
@@ -389,6 +389,7 @@ void generic_shutdown_super(struct super
 
 		fsnotify_unmount_inodes(&sb->s_inodes);
 
+		bdi_remove_writeback_works(sb->s_bdi, sb);
 		evict_inodes(sb);
 
 		if (sop->put_super)
--- linux.orig/include/linux/backing-dev.h	2012-02-29 08:41:52.029540722 +0800
+++ linux/include/linux/backing-dev.h	2012-02-29 11:19:57.474606367 +0800
@@ -126,6 +126,8 @@ int bdi_has_dirty_io(struct backing_dev_
 void bdi_arm_supers_timer(void);
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
 void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
+void bdi_remove_writeback_works(struct backing_dev_info *bdi,
+				struct super_block *sb);
 
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
--- linux.orig/mm/vmscan.c	2012-02-29 08:41:52.009540722 +0800
+++ linux/mm/vmscan.c	2012-02-29 21:31:31.463295395 +0800
@@ -874,12 +874,22 @@ static unsigned long shrink_page_list(st
 			nr_dirty++;
 
 			/*
-			 * Only kswapd can writeback filesystem pages to
-			 * avoid risk of stack overflow but do not writeback
-			 * unless under significant pressure.
+			 * Pages may be dirtied anywhere inside the LRU. This
+			 * ensures they undergo a full period of LRU iteration
+			 * before considering pageout. The intention is to
+			 * delay writeout to the flusher thread, unless when
+			 * run into a long segment of dirty pages.
+			 */
+			if (references == PAGEREF_RECLAIM_CLEAN &&
+			    priority == DEF_PRIORITY)
+				goto keep_locked;
+
+			/*
+			 * Try relaying the pageout I/O to the flusher threads
+			 * for better I/O efficiency and avoid stack overflow.
 			 */
-			if (page_is_file_cache(page) &&
-					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
+			if (page_is_file_cache(page) && mapping &&
+			    queue_pageout_work(mapping, page) >= 0) {
 				/*
 				 * Immediately reclaim when written back.
 				 * Similar in principal to deactivate_page()
@@ -892,8 +902,13 @@ static unsigned long shrink_page_list(st
 				goto keep_locked;
 			}
 
-			if (references == PAGEREF_RECLAIM_CLEAN)
+			/*
+			 * Only kswapd can writeback filesystem pages to
+			 * avoid risk of stack overflow.
+			 */
+			if (page_is_file_cache(page) && !current_is_kswapd())
 				goto keep_locked;
+
 			if (!may_enter_fs)
 				goto keep_locked;
 			if (!sc->may_writepage)
@@ -2373,17 +2388,8 @@ static unsigned long do_try_to_free_page
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
 			goto out;
 
-		/*
-		 * Try to write back as many pages as we just scanned.  This
-		 * tends to cause slow streaming writers to write data to the
-		 * disk smoothly, at the dirtying rate, which is nice.   But
-		 * that's undesirable in laptop mode, where we *want* lumpy
-		 * writeout.  So in laptop mode, write out the whole world.
-		 */
 		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
 		if (total_scanned > writeback_threshold) {
-			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned,
-						WB_REASON_TRY_TO_FREE_PAGES);
 			sc->may_writepage = 1;
 		}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 8/9] mm: dont set __GFP_WRITE on ramfs/sysfs writes
  2012-02-28 14:00   ` Fengguang Wu
@ 2012-03-01 10:13     ` Johannes Weiner
  -1 siblings, 0 replies; 116+ messages in thread
From: Johannes Weiner @ 2012-03-01 10:13 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Greg Thelen, Jan Kara, Ying Han,
	KAMEZAWA Hiroyuki, Rik van Riel, Johannes Weiner,
	Linux Memory Management List, LKML

On Tue, Feb 28, 2012 at 10:00:30PM +0800, Fengguang Wu wrote:
> Try to avoid page reclaim waits when writing to ramfs/sysfs etc.
> 
> Maybe not a big deal...

This looks like a separate fix that would make sense standalone.  It's
not just the waits, there is not much of a point in skipping zones
during allocation based on the dirty usage which they'll never
contribute to.  Could you maybe pull this up front?

> CC: Johannes Weiner <jweiner@redhat.com>
> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 8/9] mm: dont set __GFP_WRITE on ramfs/sysfs writes
@ 2012-03-01 10:13     ` Johannes Weiner
  0 siblings, 0 replies; 116+ messages in thread
From: Johannes Weiner @ 2012-03-01 10:13 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Greg Thelen, Jan Kara, Ying Han,
	KAMEZAWA Hiroyuki, Rik van Riel, Johannes Weiner,
	Linux Memory Management List, LKML

On Tue, Feb 28, 2012 at 10:00:30PM +0800, Fengguang Wu wrote:
> Try to avoid page reclaim waits when writing to ramfs/sysfs etc.
> 
> Maybe not a big deal...

This looks like a separate fix that would make sense standalone.  It's
not just the waits, there is not much of a point in skipping zones
during allocation based on the dirty usage which they'll never
contribute to.  Could you maybe pull this up front?

> CC: Johannes Weiner <jweiner@redhat.com>
> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 8/9] mm: dont set __GFP_WRITE on ramfs/sysfs writes
  2012-03-01 10:13     ` Johannes Weiner
@ 2012-03-01 10:30       ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-01 10:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Greg Thelen, Jan Kara, Ying Han,
	KAMEZAWA Hiroyuki, Rik van Riel, Johannes Weiner,
	Linux Memory Management List, LKML

On Thu, Mar 01, 2012 at 11:13:54AM +0100, Johannes Weiner wrote:
> On Tue, Feb 28, 2012 at 10:00:30PM +0800, Fengguang Wu wrote:
> > Try to avoid page reclaim waits when writing to ramfs/sysfs etc.
> > 
> > Maybe not a big deal...
> 
> This looks like a separate fix that would make sense standalone.  It's
> not just the waits, there is not much of a point in skipping zones
> during allocation based on the dirty usage which they'll never
> contribute to.  Could you maybe pull this up front?

OK, thanks!

> > CC: Johannes Weiner <jweiner@redhat.com>
> > Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 8/9] mm: dont set __GFP_WRITE on ramfs/sysfs writes
@ 2012-03-01 10:30       ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-01 10:30 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, Greg Thelen, Jan Kara, Ying Han,
	KAMEZAWA Hiroyuki, Rik van Riel, Johannes Weiner,
	Linux Memory Management List, LKML

On Thu, Mar 01, 2012 at 11:13:54AM +0100, Johannes Weiner wrote:
> On Tue, Feb 28, 2012 at 10:00:30PM +0800, Fengguang Wu wrote:
> > Try to avoid page reclaim waits when writing to ramfs/sysfs etc.
> > 
> > Maybe not a big deal...
> 
> This looks like a separate fix that would make sense standalone.  It's
> not just the waits, there is not much of a point in skipping zones
> during allocation based on the dirty usage which they'll never
> contribute to.  Could you maybe pull this up front?

OK, thanks!

> > CC: Johannes Weiner <jweiner@redhat.com>
> > Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
> 
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-02-29  0:04     ` Andrew Morton
@ 2012-03-01 11:04       ` Jan Kara
  -1 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2012-03-01 11:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Fengguang Wu, Greg Thelen, Jan Kara, Ying Han, hannes,
	KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Tue 28-02-12 16:04:03, Andrew Morton wrote:
...
> > --- linux.orig/mm/vmscan.c	2012-02-28 19:07:06.065064464 +0800
> > +++ linux/mm/vmscan.c	2012-02-28 20:26:15.559731455 +0800
> > @@ -874,12 +874,22 @@ static unsigned long shrink_page_list(st
> >  			nr_dirty++;
> >  
> >  			/*
> > -			 * Only kswapd can writeback filesystem pages to
> > -			 * avoid risk of stack overflow but do not writeback
> > -			 * unless under significant pressure.
> > +			 * Pages may be dirtied anywhere inside the LRU. This
> > +			 * ensures they undergo a full period of LRU iteration
> > +			 * before considering pageout. The intention is to
> > +			 * delay writeout to the flusher thread, unless when
> > +			 * run into a long segment of dirty pages.
> > +			 */
> > +			if (references == PAGEREF_RECLAIM_CLEAN &&
> > +			    priority == DEF_PRIORITY)
> > +				goto keep_locked;
> > +
> > +			/*
> > +			 * Try relaying the pageout I/O to the flusher threads
> > +			 * for better I/O efficiency and avoid stack overflow.
> >  			 */
> > -			if (page_is_file_cache(page) &&
> > -					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
> > +			if (page_is_file_cache(page) && mapping &&
> > +			    queue_pageout_work(mapping, page) >= 0) {
> >  				/*
> >  				 * Immediately reclaim when written back.
> >  				 * Similar in principal to deactivate_page()
> > @@ -892,8 +902,13 @@ static unsigned long shrink_page_list(st
> >  				goto keep_locked;
> >  			}
> >  
> > -			if (references == PAGEREF_RECLAIM_CLEAN)
> > +			/*
> > +			 * Only kswapd can writeback filesystem pages to
> > +			 * avoid risk of stack overflow.
> > +			 */
> > +			if (page_is_file_cache(page) && !current_is_kswapd())
> 
> And here we run into big problems.
> 
> When a page-allocator enters direct reclaim, that process is trying to
> allocate a page from a particular zone (or set of zones).  For example,
> he wants a ZONE_NORMAL or ZONE_DMA page.  Asking flusher threads to go
> off and write back three gigabytes of ZONE_HIGHMEM is pointless,
> inefficient and doesn't fix the caller's problem at all.
> 
> This has always been the biggest problem with the
> avoid-writeback-from-direct-reclaim patches.  And your patchset (as far
> as I've read) doesn't address the problem at all and appears to be
> blissfully unaware of its existence.
> 
> 
> I've attempted versions of this I think twice, and thrown the patches
> away in disgust.  One approach I tried was, within direct reclaim, to
> grab the page I wanted (ie: one which is in one of the caller's desired
> zones) and to pass that page over to the kernel threads.  The kernel
> threads would ensure that this particular page was included in the
> writearound preparation.  So that we at least make *some* progress
> toward what the caller is asking us to do.
> 
> iirc, the way I "grabbed" the page was to actually lock it, with
> [try_]_lock_page().  And unlock it again way over within the writeback
> thread.  I forget why I did it this way, rather than get_page() or
> whatever.  Locking the page is a good way of preventing anyone else
> from futzing with it.  It also pins the inode, which perhaps meant that
> with careful management, I could avoid the igrab()/iput() horrors
> discussed above.
  I think using get_page() might be a good way to go. Naive implementation:
If we need to write a page from kswapd, we do get_page(), attach page to
wb_writeback_work and push it to flusher thread to deal with it.
Flusher thread sees the work, takes a page lock, verifies the page is still
attached to some inode & dirty (it could have been truncated / cleaned by
someone else) and if yes, it submits page for IO (possibly with some
writearound). This scheme won't have problems with iput() and won't have
problems with umount. Also we guarantee some progress - either flusher
thread does it, or some else must have done the work before flusher thread
got to it.

For better efficiency, we could further refine the scheme - record mapping
pointer (as an opaque cookie), starting index, and writeout length in
wb_writeback_work together with page pointer. That way if we need to
writeout another page, we can check whether it's not already included in an
existing work or whether extending existing work wouldn't be better. The
downside of this scheme is, that the progress guarantee isn't that strong
anymore - we guarantee that the page referenced from the work is cleaned
but not necessarily all those other pages that were bundled in the same
work item (because once our referenced page is cleaned, we cannot get to
inode anymore). To mitigate this, we could:
a) have references to N pages from work item and at most that many pages
to be packed into a single work - this restores the progress guarantee.
b) have references to N pages from work item but don't limit how many page
writeout requests are packed - isn't as strong as a) but lowers probability
of not enough progress. Also we could further lower the probability we
don't have a usable page reference in work item by including a reference
to a page whose writeout was last bundled to the work item.

What do you think?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-01 11:04       ` Jan Kara
  0 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2012-03-01 11:04 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Fengguang Wu, Greg Thelen, Jan Kara, Ying Han, hannes,
	KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Tue 28-02-12 16:04:03, Andrew Morton wrote:
...
> > --- linux.orig/mm/vmscan.c	2012-02-28 19:07:06.065064464 +0800
> > +++ linux/mm/vmscan.c	2012-02-28 20:26:15.559731455 +0800
> > @@ -874,12 +874,22 @@ static unsigned long shrink_page_list(st
> >  			nr_dirty++;
> >  
> >  			/*
> > -			 * Only kswapd can writeback filesystem pages to
> > -			 * avoid risk of stack overflow but do not writeback
> > -			 * unless under significant pressure.
> > +			 * Pages may be dirtied anywhere inside the LRU. This
> > +			 * ensures they undergo a full period of LRU iteration
> > +			 * before considering pageout. The intention is to
> > +			 * delay writeout to the flusher thread, unless when
> > +			 * run into a long segment of dirty pages.
> > +			 */
> > +			if (references == PAGEREF_RECLAIM_CLEAN &&
> > +			    priority == DEF_PRIORITY)
> > +				goto keep_locked;
> > +
> > +			/*
> > +			 * Try relaying the pageout I/O to the flusher threads
> > +			 * for better I/O efficiency and avoid stack overflow.
> >  			 */
> > -			if (page_is_file_cache(page) &&
> > -					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
> > +			if (page_is_file_cache(page) && mapping &&
> > +			    queue_pageout_work(mapping, page) >= 0) {
> >  				/*
> >  				 * Immediately reclaim when written back.
> >  				 * Similar in principal to deactivate_page()
> > @@ -892,8 +902,13 @@ static unsigned long shrink_page_list(st
> >  				goto keep_locked;
> >  			}
> >  
> > -			if (references == PAGEREF_RECLAIM_CLEAN)
> > +			/*
> > +			 * Only kswapd can writeback filesystem pages to
> > +			 * avoid risk of stack overflow.
> > +			 */
> > +			if (page_is_file_cache(page) && !current_is_kswapd())
> 
> And here we run into big problems.
> 
> When a page-allocator enters direct reclaim, that process is trying to
> allocate a page from a particular zone (or set of zones).  For example,
> he wants a ZONE_NORMAL or ZONE_DMA page.  Asking flusher threads to go
> off and write back three gigabytes of ZONE_HIGHMEM is pointless,
> inefficient and doesn't fix the caller's problem at all.
> 
> This has always been the biggest problem with the
> avoid-writeback-from-direct-reclaim patches.  And your patchset (as far
> as I've read) doesn't address the problem at all and appears to be
> blissfully unaware of its existence.
> 
> 
> I've attempted versions of this I think twice, and thrown the patches
> away in disgust.  One approach I tried was, within direct reclaim, to
> grab the page I wanted (ie: one which is in one of the caller's desired
> zones) and to pass that page over to the kernel threads.  The kernel
> threads would ensure that this particular page was included in the
> writearound preparation.  So that we at least make *some* progress
> toward what the caller is asking us to do.
> 
> iirc, the way I "grabbed" the page was to actually lock it, with
> [try_]_lock_page().  And unlock it again way over within the writeback
> thread.  I forget why I did it this way, rather than get_page() or
> whatever.  Locking the page is a good way of preventing anyone else
> from futzing with it.  It also pins the inode, which perhaps meant that
> with careful management, I could avoid the igrab()/iput() horrors
> discussed above.
  I think using get_page() might be a good way to go. Naive implementation:
If we need to write a page from kswapd, we do get_page(), attach page to
wb_writeback_work and push it to flusher thread to deal with it.
Flusher thread sees the work, takes a page lock, verifies the page is still
attached to some inode & dirty (it could have been truncated / cleaned by
someone else) and if yes, it submits page for IO (possibly with some
writearound). This scheme won't have problems with iput() and won't have
problems with umount. Also we guarantee some progress - either flusher
thread does it, or some else must have done the work before flusher thread
got to it.

For better efficiency, we could further refine the scheme - record mapping
pointer (as an opaque cookie), starting index, and writeout length in
wb_writeback_work together with page pointer. That way if we need to
writeout another page, we can check whether it's not already included in an
existing work or whether extending existing work wouldn't be better. The
downside of this scheme is, that the progress guarantee isn't that strong
anymore - we guarantee that the page referenced from the work is cleaned
but not necessarily all those other pages that were bundled in the same
work item (because once our referenced page is cleaned, we cannot get to
inode anymore). To mitigate this, we could:
a) have references to N pages from work item and at most that many pages
to be packed into a single work - this restores the progress guarantee.
b) have references to N pages from work item but don't limit how many page
writeout requests are packed - isn't as strong as a) but lowers probability
of not enough progress. Also we could further lower the probability we
don't have a usable page reference in work item by including a reference
to a page whose writeout was last bundled to the work item.

What do you think?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-03-01 11:04       ` Jan Kara
@ 2012-03-01 11:41         ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-01 11:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Greg Thelen, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Thu, Mar 01, 2012 at 12:04:04PM +0100, Jan Kara wrote:
> On Tue 28-02-12 16:04:03, Andrew Morton wrote:
> ...
> > > --- linux.orig/mm/vmscan.c	2012-02-28 19:07:06.065064464 +0800
> > > +++ linux/mm/vmscan.c	2012-02-28 20:26:15.559731455 +0800
> > > @@ -874,12 +874,22 @@ static unsigned long shrink_page_list(st
> > >  			nr_dirty++;
> > >  
> > >  			/*
> > > -			 * Only kswapd can writeback filesystem pages to
> > > -			 * avoid risk of stack overflow but do not writeback
> > > -			 * unless under significant pressure.
> > > +			 * Pages may be dirtied anywhere inside the LRU. This
> > > +			 * ensures they undergo a full period of LRU iteration
> > > +			 * before considering pageout. The intention is to
> > > +			 * delay writeout to the flusher thread, unless when
> > > +			 * run into a long segment of dirty pages.
> > > +			 */
> > > +			if (references == PAGEREF_RECLAIM_CLEAN &&
> > > +			    priority == DEF_PRIORITY)
> > > +				goto keep_locked;
> > > +
> > > +			/*
> > > +			 * Try relaying the pageout I/O to the flusher threads
> > > +			 * for better I/O efficiency and avoid stack overflow.
> > >  			 */
> > > -			if (page_is_file_cache(page) &&
> > > -					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
> > > +			if (page_is_file_cache(page) && mapping &&
> > > +			    queue_pageout_work(mapping, page) >= 0) {
> > >  				/*
> > >  				 * Immediately reclaim when written back.
> > >  				 * Similar in principal to deactivate_page()
> > > @@ -892,8 +902,13 @@ static unsigned long shrink_page_list(st
> > >  				goto keep_locked;
> > >  			}
> > >  
> > > -			if (references == PAGEREF_RECLAIM_CLEAN)
> > > +			/*
> > > +			 * Only kswapd can writeback filesystem pages to
> > > +			 * avoid risk of stack overflow.
> > > +			 */
> > > +			if (page_is_file_cache(page) && !current_is_kswapd())
> > 
> > And here we run into big problems.
> > 
> > When a page-allocator enters direct reclaim, that process is trying to
> > allocate a page from a particular zone (or set of zones).  For example,
> > he wants a ZONE_NORMAL or ZONE_DMA page.  Asking flusher threads to go
> > off and write back three gigabytes of ZONE_HIGHMEM is pointless,
> > inefficient and doesn't fix the caller's problem at all.
> > 
> > This has always been the biggest problem with the
> > avoid-writeback-from-direct-reclaim patches.  And your patchset (as far
> > as I've read) doesn't address the problem at all and appears to be
> > blissfully unaware of its existence.
> > 
> > 
> > I've attempted versions of this I think twice, and thrown the patches
> > away in disgust.  One approach I tried was, within direct reclaim, to
> > grab the page I wanted (ie: one which is in one of the caller's desired
> > zones) and to pass that page over to the kernel threads.  The kernel
> > threads would ensure that this particular page was included in the
> > writearound preparation.  So that we at least make *some* progress
> > toward what the caller is asking us to do.
> > 
> > iirc, the way I "grabbed" the page was to actually lock it, with
> > [try_]_lock_page().  And unlock it again way over within the writeback
> > thread.  I forget why I did it this way, rather than get_page() or
> > whatever.  Locking the page is a good way of preventing anyone else
> > from futzing with it.  It also pins the inode, which perhaps meant that
> > with careful management, I could avoid the igrab()/iput() horrors
> > discussed above.
>   I think using get_page() might be a good way to go. Naive implementation:
> If we need to write a page from kswapd, we do get_page(), attach page to
> wb_writeback_work and push it to flusher thread to deal with it.
> Flusher thread sees the work, takes a page lock, verifies the page is still
> attached to some inode & dirty (it could have been truncated / cleaned by
> someone else) and if yes, it submits page for IO (possibly with some
> writearound). This scheme won't have problems with iput() and won't have
> problems with umount. Also we guarantee some progress - either flusher
> thread does it, or some else must have done the work before flusher thread
> got to it.

I like this idea.

get_page() looks the perfect solution to verify if the struct inode
pointer (w/o igrab) is still live and valid.

[...upon rethinking...] Oh but still we need to lock some page to pin
the inode during the writeout. Then there is the dilemma: if the page
is locked, we effectively keep it from being written out...

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-01 11:41         ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-01 11:41 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Greg Thelen, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Thu, Mar 01, 2012 at 12:04:04PM +0100, Jan Kara wrote:
> On Tue 28-02-12 16:04:03, Andrew Morton wrote:
> ...
> > > --- linux.orig/mm/vmscan.c	2012-02-28 19:07:06.065064464 +0800
> > > +++ linux/mm/vmscan.c	2012-02-28 20:26:15.559731455 +0800
> > > @@ -874,12 +874,22 @@ static unsigned long shrink_page_list(st
> > >  			nr_dirty++;
> > >  
> > >  			/*
> > > -			 * Only kswapd can writeback filesystem pages to
> > > -			 * avoid risk of stack overflow but do not writeback
> > > -			 * unless under significant pressure.
> > > +			 * Pages may be dirtied anywhere inside the LRU. This
> > > +			 * ensures they undergo a full period of LRU iteration
> > > +			 * before considering pageout. The intention is to
> > > +			 * delay writeout to the flusher thread, unless when
> > > +			 * run into a long segment of dirty pages.
> > > +			 */
> > > +			if (references == PAGEREF_RECLAIM_CLEAN &&
> > > +			    priority == DEF_PRIORITY)
> > > +				goto keep_locked;
> > > +
> > > +			/*
> > > +			 * Try relaying the pageout I/O to the flusher threads
> > > +			 * for better I/O efficiency and avoid stack overflow.
> > >  			 */
> > > -			if (page_is_file_cache(page) &&
> > > -					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
> > > +			if (page_is_file_cache(page) && mapping &&
> > > +			    queue_pageout_work(mapping, page) >= 0) {
> > >  				/*
> > >  				 * Immediately reclaim when written back.
> > >  				 * Similar in principal to deactivate_page()
> > > @@ -892,8 +902,13 @@ static unsigned long shrink_page_list(st
> > >  				goto keep_locked;
> > >  			}
> > >  
> > > -			if (references == PAGEREF_RECLAIM_CLEAN)
> > > +			/*
> > > +			 * Only kswapd can writeback filesystem pages to
> > > +			 * avoid risk of stack overflow.
> > > +			 */
> > > +			if (page_is_file_cache(page) && !current_is_kswapd())
> > 
> > And here we run into big problems.
> > 
> > When a page-allocator enters direct reclaim, that process is trying to
> > allocate a page from a particular zone (or set of zones).  For example,
> > he wants a ZONE_NORMAL or ZONE_DMA page.  Asking flusher threads to go
> > off and write back three gigabytes of ZONE_HIGHMEM is pointless,
> > inefficient and doesn't fix the caller's problem at all.
> > 
> > This has always been the biggest problem with the
> > avoid-writeback-from-direct-reclaim patches.  And your patchset (as far
> > as I've read) doesn't address the problem at all and appears to be
> > blissfully unaware of its existence.
> > 
> > 
> > I've attempted versions of this I think twice, and thrown the patches
> > away in disgust.  One approach I tried was, within direct reclaim, to
> > grab the page I wanted (ie: one which is in one of the caller's desired
> > zones) and to pass that page over to the kernel threads.  The kernel
> > threads would ensure that this particular page was included in the
> > writearound preparation.  So that we at least make *some* progress
> > toward what the caller is asking us to do.
> > 
> > iirc, the way I "grabbed" the page was to actually lock it, with
> > [try_]_lock_page().  And unlock it again way over within the writeback
> > thread.  I forget why I did it this way, rather than get_page() or
> > whatever.  Locking the page is a good way of preventing anyone else
> > from futzing with it.  It also pins the inode, which perhaps meant that
> > with careful management, I could avoid the igrab()/iput() horrors
> > discussed above.
>   I think using get_page() might be a good way to go. Naive implementation:
> If we need to write a page from kswapd, we do get_page(), attach page to
> wb_writeback_work and push it to flusher thread to deal with it.
> Flusher thread sees the work, takes a page lock, verifies the page is still
> attached to some inode & dirty (it could have been truncated / cleaned by
> someone else) and if yes, it submits page for IO (possibly with some
> writearound). This scheme won't have problems with iput() and won't have
> problems with umount. Also we guarantee some progress - either flusher
> thread does it, or some else must have done the work before flusher thread
> got to it.

I like this idea.

get_page() looks the perfect solution to verify if the struct inode
pointer (w/o igrab) is still live and valid.

[...upon rethinking...] Oh but still we need to lock some page to pin
the inode during the writeout. Then there is the dilemma: if the page
is locked, we effectively keep it from being written out...

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-02-29  0:04     ` Andrew Morton
@ 2012-03-01 12:36       ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-01 12:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

> Please have a think about all of this and see if you can demonstrate
> how the iput() here is guaranteed safe.

There are already several __iget()/iput() calls inside fs-writeback.c.
The existing iput() calls already demonstrate its safety?

Basically the flusher works in this way

- the dirty inode list i_wb_list does not reference count the inode at all

- the flusher thread does something analog to igrab() and set I_SYNC
  before going off to writeout the inode

- evice() will wait for completion of I_SYNC

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-01 12:36       ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-01 12:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

> Please have a think about all of this and see if you can demonstrate
> how the iput() here is guaranteed safe.

There are already several __iget()/iput() calls inside fs-writeback.c.
The existing iput() calls already demonstrate its safety?

Basically the flusher works in this way

- the dirty inode list i_wb_list does not reference count the inode at all

- the flusher thread does something analog to igrab() and set I_SYNC
  before going off to writeout the inode

- evice() will wait for completion of I_SYNC

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 5/9] writeback: introduce the pageout work
  2012-02-29 13:51     ` Fengguang Wu
@ 2012-03-01 13:35       ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-01 13:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

> However the other type of works, if ever they come, can still block us
> for long time. Will need a proper way to guarantee fairness.

The simplistic way around this may be to refuse to queue new pageout
works when found other type of works in the queue. Then vmscan will
fall back to pageout(). It's rare condition anyway and hardly deserves
a comprehensive fairness scheme.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH v2 5/9] writeback: introduce the pageout work
@ 2012-03-01 13:35       ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-01 13:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

> However the other type of works, if ever they come, can still block us
> for long time. Will need a proper way to guarantee fairness.

The simplistic way around this may be to refuse to queue new pageout
works when found other type of works in the queue. Then vmscan will
fall back to pageout(). It's rare condition anyway and hardly deserves
a comprehensive fairness scheme.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-03-01 12:36       ` Fengguang Wu
@ 2012-03-01 16:38         ` Jan Kara
  -1 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2012-03-01 16:38 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Greg Thelen, Jan Kara, Ying Han, hannes,
	KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Thu 01-03-12 20:36:40, Wu Fengguang wrote:
> > Please have a think about all of this and see if you can demonstrate
> > how the iput() here is guaranteed safe.
> 
> There are already several __iget()/iput() calls inside fs-writeback.c.
> The existing iput() calls already demonstrate its safety?
> 
> Basically the flusher works in this way
> 
> - the dirty inode list i_wb_list does not reference count the inode at all
> 
> - the flusher thread does something analog to igrab() and set I_SYNC
>   before going off to writeout the inode
> 
> - evict() will wait for completion of I_SYNC
  Yes, you are right that currently writeback code already holds inode
references and so it can happen that flusher thread drops the last inode
reference. But currently that could create problems only if someone waits
for flusher thread to make progress while effectively blocking e.g.
truncate from happening. Currently flusher thread handles sync(2) and
background writeback and filesystems take care to not hold any locks
blocking IO / truncate while possibly waiting for these.

But with your addition situation changes significantly - now anyone doing
allocation can block and do allocation from all sorts of places including
ones where we hold locks blocking other fs activity. The good news is that
we use GFP_NOFS in such places. So if GFP_NOFS allocation cannot possibly
depend on a completion of some writeback work, then I'd still be
comfortable with dropping inode references from writeback code. But Andrew
is right this at least needs some arguing...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-01 16:38         ` Jan Kara
  0 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2012-03-01 16:38 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Greg Thelen, Jan Kara, Ying Han, hannes,
	KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Thu 01-03-12 20:36:40, Wu Fengguang wrote:
> > Please have a think about all of this and see if you can demonstrate
> > how the iput() here is guaranteed safe.
> 
> There are already several __iget()/iput() calls inside fs-writeback.c.
> The existing iput() calls already demonstrate its safety?
> 
> Basically the flusher works in this way
> 
> - the dirty inode list i_wb_list does not reference count the inode at all
> 
> - the flusher thread does something analog to igrab() and set I_SYNC
>   before going off to writeout the inode
> 
> - evict() will wait for completion of I_SYNC
  Yes, you are right that currently writeback code already holds inode
references and so it can happen that flusher thread drops the last inode
reference. But currently that could create problems only if someone waits
for flusher thread to make progress while effectively blocking e.g.
truncate from happening. Currently flusher thread handles sync(2) and
background writeback and filesystems take care to not hold any locks
blocking IO / truncate while possibly waiting for these.

But with your addition situation changes significantly - now anyone doing
allocation can block and do allocation from all sorts of places including
ones where we hold locks blocking other fs activity. The good news is that
we use GFP_NOFS in such places. So if GFP_NOFS allocation cannot possibly
depend on a completion of some writeback work, then I'd still be
comfortable with dropping inode references from writeback code. But Andrew
is right this at least needs some arguing...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-03-01 11:41         ` Fengguang Wu
@ 2012-03-01 16:50           ` Jan Kara
  -1 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2012-03-01 16:50 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, Andrew Morton, Greg Thelen, Ying Han, hannes,
	KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Thu 01-03-12 19:41:51, Wu Fengguang wrote:
> On Thu, Mar 01, 2012 at 12:04:04PM +0100, Jan Kara wrote:
> > On Tue 28-02-12 16:04:03, Andrew Morton wrote:
> > ...
> > > > --- linux.orig/mm/vmscan.c	2012-02-28 19:07:06.065064464 +0800
> > > > +++ linux/mm/vmscan.c	2012-02-28 20:26:15.559731455 +0800
> > > > @@ -874,12 +874,22 @@ static unsigned long shrink_page_list(st
> > > >  			nr_dirty++;
> > > >  
> > > >  			/*
> > > > -			 * Only kswapd can writeback filesystem pages to
> > > > -			 * avoid risk of stack overflow but do not writeback
> > > > -			 * unless under significant pressure.
> > > > +			 * Pages may be dirtied anywhere inside the LRU. This
> > > > +			 * ensures they undergo a full period of LRU iteration
> > > > +			 * before considering pageout. The intention is to
> > > > +			 * delay writeout to the flusher thread, unless when
> > > > +			 * run into a long segment of dirty pages.
> > > > +			 */
> > > > +			if (references == PAGEREF_RECLAIM_CLEAN &&
> > > > +			    priority == DEF_PRIORITY)
> > > > +				goto keep_locked;
> > > > +
> > > > +			/*
> > > > +			 * Try relaying the pageout I/O to the flusher threads
> > > > +			 * for better I/O efficiency and avoid stack overflow.
> > > >  			 */
> > > > -			if (page_is_file_cache(page) &&
> > > > -					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
> > > > +			if (page_is_file_cache(page) && mapping &&
> > > > +			    queue_pageout_work(mapping, page) >= 0) {
> > > >  				/*
> > > >  				 * Immediately reclaim when written back.
> > > >  				 * Similar in principal to deactivate_page()
> > > > @@ -892,8 +902,13 @@ static unsigned long shrink_page_list(st
> > > >  				goto keep_locked;
> > > >  			}
> > > >  
> > > > -			if (references == PAGEREF_RECLAIM_CLEAN)
> > > > +			/*
> > > > +			 * Only kswapd can writeback filesystem pages to
> > > > +			 * avoid risk of stack overflow.
> > > > +			 */
> > > > +			if (page_is_file_cache(page) && !current_is_kswapd())
> > > 
> > > And here we run into big problems.
> > > 
> > > When a page-allocator enters direct reclaim, that process is trying to
> > > allocate a page from a particular zone (or set of zones).  For example,
> > > he wants a ZONE_NORMAL or ZONE_DMA page.  Asking flusher threads to go
> > > off and write back three gigabytes of ZONE_HIGHMEM is pointless,
> > > inefficient and doesn't fix the caller's problem at all.
> > > 
> > > This has always been the biggest problem with the
> > > avoid-writeback-from-direct-reclaim patches.  And your patchset (as far
> > > as I've read) doesn't address the problem at all and appears to be
> > > blissfully unaware of its existence.
> > > 
> > > 
> > > I've attempted versions of this I think twice, and thrown the patches
> > > away in disgust.  One approach I tried was, within direct reclaim, to
> > > grab the page I wanted (ie: one which is in one of the caller's desired
> > > zones) and to pass that page over to the kernel threads.  The kernel
> > > threads would ensure that this particular page was included in the
> > > writearound preparation.  So that we at least make *some* progress
> > > toward what the caller is asking us to do.
> > > 
> > > iirc, the way I "grabbed" the page was to actually lock it, with
> > > [try_]_lock_page().  And unlock it again way over within the writeback
> > > thread.  I forget why I did it this way, rather than get_page() or
> > > whatever.  Locking the page is a good way of preventing anyone else
> > > from futzing with it.  It also pins the inode, which perhaps meant that
> > > with careful management, I could avoid the igrab()/iput() horrors
> > > discussed above.
> >   I think using get_page() might be a good way to go. Naive implementation:
> > If we need to write a page from kswapd, we do get_page(), attach page to
> > wb_writeback_work and push it to flusher thread to deal with it.
> > Flusher thread sees the work, takes a page lock, verifies the page is still
> > attached to some inode & dirty (it could have been truncated / cleaned by
> > someone else) and if yes, it submits page for IO (possibly with some
> > writearound). This scheme won't have problems with iput() and won't have
> > problems with umount. Also we guarantee some progress - either flusher
> > thread does it, or some else must have done the work before flusher thread
> > got to it.
> 
> I like this idea.
> 
> get_page() looks the perfect solution to verify if the struct inode
> pointer (w/o igrab) is still live and valid.
> 
> [...upon rethinking...] Oh but still we need to lock some page to pin
> the inode during the writeout. Then there is the dilemma: if the page
> is locked, we effectively keep it from being written out...
  Well, we could just lock the page to grab inode reference like in the
rest of writeback code and then unlock the page. Still there would be the
advantage that we won't pin the inode for the time work is just waiting to
be processed.

Another idea I've got is that inodes could have a special usecount counter
for writeback. It would have the same lifetime rules as I_SYNC flag and
we'd wait in end_writeback() for the counter to drop to zero. That way
writeback code could safely get inode reference for the time of writeback
without incurring the risk of having to completely cleanup the inode.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-01 16:50           ` Jan Kara
  0 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2012-03-01 16:50 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, Andrew Morton, Greg Thelen, Ying Han, hannes,
	KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Thu 01-03-12 19:41:51, Wu Fengguang wrote:
> On Thu, Mar 01, 2012 at 12:04:04PM +0100, Jan Kara wrote:
> > On Tue 28-02-12 16:04:03, Andrew Morton wrote:
> > ...
> > > > --- linux.orig/mm/vmscan.c	2012-02-28 19:07:06.065064464 +0800
> > > > +++ linux/mm/vmscan.c	2012-02-28 20:26:15.559731455 +0800
> > > > @@ -874,12 +874,22 @@ static unsigned long shrink_page_list(st
> > > >  			nr_dirty++;
> > > >  
> > > >  			/*
> > > > -			 * Only kswapd can writeback filesystem pages to
> > > > -			 * avoid risk of stack overflow but do not writeback
> > > > -			 * unless under significant pressure.
> > > > +			 * Pages may be dirtied anywhere inside the LRU. This
> > > > +			 * ensures they undergo a full period of LRU iteration
> > > > +			 * before considering pageout. The intention is to
> > > > +			 * delay writeout to the flusher thread, unless when
> > > > +			 * run into a long segment of dirty pages.
> > > > +			 */
> > > > +			if (references == PAGEREF_RECLAIM_CLEAN &&
> > > > +			    priority == DEF_PRIORITY)
> > > > +				goto keep_locked;
> > > > +
> > > > +			/*
> > > > +			 * Try relaying the pageout I/O to the flusher threads
> > > > +			 * for better I/O efficiency and avoid stack overflow.
> > > >  			 */
> > > > -			if (page_is_file_cache(page) &&
> > > > -					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
> > > > +			if (page_is_file_cache(page) && mapping &&
> > > > +			    queue_pageout_work(mapping, page) >= 0) {
> > > >  				/*
> > > >  				 * Immediately reclaim when written back.
> > > >  				 * Similar in principal to deactivate_page()
> > > > @@ -892,8 +902,13 @@ static unsigned long shrink_page_list(st
> > > >  				goto keep_locked;
> > > >  			}
> > > >  
> > > > -			if (references == PAGEREF_RECLAIM_CLEAN)
> > > > +			/*
> > > > +			 * Only kswapd can writeback filesystem pages to
> > > > +			 * avoid risk of stack overflow.
> > > > +			 */
> > > > +			if (page_is_file_cache(page) && !current_is_kswapd())
> > > 
> > > And here we run into big problems.
> > > 
> > > When a page-allocator enters direct reclaim, that process is trying to
> > > allocate a page from a particular zone (or set of zones).  For example,
> > > he wants a ZONE_NORMAL or ZONE_DMA page.  Asking flusher threads to go
> > > off and write back three gigabytes of ZONE_HIGHMEM is pointless,
> > > inefficient and doesn't fix the caller's problem at all.
> > > 
> > > This has always been the biggest problem with the
> > > avoid-writeback-from-direct-reclaim patches.  And your patchset (as far
> > > as I've read) doesn't address the problem at all and appears to be
> > > blissfully unaware of its existence.
> > > 
> > > 
> > > I've attempted versions of this I think twice, and thrown the patches
> > > away in disgust.  One approach I tried was, within direct reclaim, to
> > > grab the page I wanted (ie: one which is in one of the caller's desired
> > > zones) and to pass that page over to the kernel threads.  The kernel
> > > threads would ensure that this particular page was included in the
> > > writearound preparation.  So that we at least make *some* progress
> > > toward what the caller is asking us to do.
> > > 
> > > iirc, the way I "grabbed" the page was to actually lock it, with
> > > [try_]_lock_page().  And unlock it again way over within the writeback
> > > thread.  I forget why I did it this way, rather than get_page() or
> > > whatever.  Locking the page is a good way of preventing anyone else
> > > from futzing with it.  It also pins the inode, which perhaps meant that
> > > with careful management, I could avoid the igrab()/iput() horrors
> > > discussed above.
> >   I think using get_page() might be a good way to go. Naive implementation:
> > If we need to write a page from kswapd, we do get_page(), attach page to
> > wb_writeback_work and push it to flusher thread to deal with it.
> > Flusher thread sees the work, takes a page lock, verifies the page is still
> > attached to some inode & dirty (it could have been truncated / cleaned by
> > someone else) and if yes, it submits page for IO (possibly with some
> > writearound). This scheme won't have problems with iput() and won't have
> > problems with umount. Also we guarantee some progress - either flusher
> > thread does it, or some else must have done the work before flusher thread
> > got to it.
> 
> I like this idea.
> 
> get_page() looks the perfect solution to verify if the struct inode
> pointer (w/o igrab) is still live and valid.
> 
> [...upon rethinking...] Oh but still we need to lock some page to pin
> the inode during the writeout. Then there is the dilemma: if the page
> is locked, we effectively keep it from being written out...
  Well, we could just lock the page to grab inode reference like in the
rest of writeback code and then unlock the page. Still there would be the
advantage that we won't pin the inode for the time work is just waiting to
be processed.

Another idea I've got is that inodes could have a special usecount counter
for writeback. It would have the same lifetime rules as I_SYNC flag and
we'd wait in end_writeback() for the counter to drop to zero. That way
writeback code could safely get inode reference for the time of writeback
without incurring the risk of having to completely cleanup the inode.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-03-01 11:04       ` Jan Kara
@ 2012-03-01 19:42         ` Andrew Morton
  -1 siblings, 0 replies; 116+ messages in thread
From: Andrew Morton @ 2012-03-01 19:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: Fengguang Wu, Greg Thelen, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Thu, 1 Mar 2012 12:04:04 +0100
Jan Kara <jack@suse.cz> wrote:

> > iirc, the way I "grabbed" the page was to actually lock it, with
> > [try_]_lock_page().  And unlock it again way over within the writeback
> > thread.  I forget why I did it this way, rather than get_page() or
> > whatever.  Locking the page is a good way of preventing anyone else
> > from futzing with it.  It also pins the inode, which perhaps meant that
> > with careful management, I could avoid the igrab()/iput() horrors
> > discussed above.
>
>   I think using get_page() might be a good way to go.

get_page() doesn't pin the inode - truncate() will still detach it
from the address_space().

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-01 19:42         ` Andrew Morton
  0 siblings, 0 replies; 116+ messages in thread
From: Andrew Morton @ 2012-03-01 19:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: Fengguang Wu, Greg Thelen, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Thu, 1 Mar 2012 12:04:04 +0100
Jan Kara <jack@suse.cz> wrote:

> > iirc, the way I "grabbed" the page was to actually lock it, with
> > [try_]_lock_page().  And unlock it again way over within the writeback
> > thread.  I forget why I did it this way, rather than get_page() or
> > whatever.  Locking the page is a good way of preventing anyone else
> > from futzing with it.  It also pins the inode, which perhaps meant that
> > with careful management, I could avoid the igrab()/iput() horrors
> > discussed above.
>
>   I think using get_page() might be a good way to go.

get_page() doesn't pin the inode - truncate() will still detach it
from the address_space().

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-03-01 11:41         ` Fengguang Wu
@ 2012-03-01 19:46           ` Andrew Morton
  -1 siblings, 0 replies; 116+ messages in thread
From: Andrew Morton @ 2012-03-01 19:46 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, Greg Thelen, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Thu, 1 Mar 2012 19:41:51 +0800
Fengguang Wu <fengguang.wu@intel.com> wrote:

> >   I think using get_page() might be a good way to go. Naive implementation:
> > If we need to write a page from kswapd, we do get_page(), attach page to
> > wb_writeback_work and push it to flusher thread to deal with it.
> > Flusher thread sees the work, takes a page lock, verifies the page is still
> > attached to some inode & dirty (it could have been truncated / cleaned by
> > someone else) and if yes, it submits page for IO (possibly with some
> > writearound). This scheme won't have problems with iput() and won't have
> > problems with umount. Also we guarantee some progress - either flusher
> > thread does it, or some else must have done the work before flusher thread
> > got to it.
> 
> I like this idea.
> 
> get_page() looks the perfect solution to verify if the struct inode
> pointer (w/o igrab) is still live and valid.
> 
> [...upon rethinking...] Oh but still we need to lock some page to pin
> the inode during the writeout. Then there is the dilemma: if the page
> is locked, we effectively keep it from being written out...

No, all you need to do is to structure the code so that after the page
gets unlocked, the kernel thread does not touch the address_space.  So
the processing within the kthread is along the lines of

writearound(locked_page)
{
	write some pages preceding locked_page;	/* touches address_space */
	write locked_page;
	write pages following locked_page;	/* touches address_space */
	unlock_page(locked_page);
}


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-01 19:46           ` Andrew Morton
  0 siblings, 0 replies; 116+ messages in thread
From: Andrew Morton @ 2012-03-01 19:46 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, Greg Thelen, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Thu, 1 Mar 2012 19:41:51 +0800
Fengguang Wu <fengguang.wu@intel.com> wrote:

> >   I think using get_page() might be a good way to go. Naive implementation:
> > If we need to write a page from kswapd, we do get_page(), attach page to
> > wb_writeback_work and push it to flusher thread to deal with it.
> > Flusher thread sees the work, takes a page lock, verifies the page is still
> > attached to some inode & dirty (it could have been truncated / cleaned by
> > someone else) and if yes, it submits page for IO (possibly with some
> > writearound). This scheme won't have problems with iput() and won't have
> > problems with umount. Also we guarantee some progress - either flusher
> > thread does it, or some else must have done the work before flusher thread
> > got to it.
> 
> I like this idea.
> 
> get_page() looks the perfect solution to verify if the struct inode
> pointer (w/o igrab) is still live and valid.
> 
> [...upon rethinking...] Oh but still we need to lock some page to pin
> the inode during the writeout. Then there is the dilemma: if the page
> is locked, we effectively keep it from being written out...

No, all you need to do is to structure the code so that after the page
gets unlocked, the kernel thread does not touch the address_space.  So
the processing within the kthread is along the lines of

writearound(locked_page)
{
	write some pages preceding locked_page;	/* touches address_space */
	write locked_page;
	write pages following locked_page;	/* touches address_space */
	unlock_page(locked_page);
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-03-01 19:42         ` Andrew Morton
@ 2012-03-01 21:15           ` Jan Kara
  -1 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2012-03-01 21:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Fengguang Wu, Greg Thelen, Ying Han, hannes,
	KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Thu 01-03-12 11:42:01, Andrew Morton wrote:
> On Thu, 1 Mar 2012 12:04:04 +0100
> Jan Kara <jack@suse.cz> wrote:
> 
> > > iirc, the way I "grabbed" the page was to actually lock it, with
> > > [try_]_lock_page().  And unlock it again way over within the writeback
> > > thread.  I forget why I did it this way, rather than get_page() or
> > > whatever.  Locking the page is a good way of preventing anyone else
> > > from futzing with it.  It also pins the inode, which perhaps meant that
> > > with careful management, I could avoid the igrab()/iput() horrors
> > > discussed above.
> >
> >   I think using get_page() might be a good way to go.
> 
> get_page() doesn't pin the inode - truncate() will still detach it
> from the address_space().
  Yes, I know. And exactly because of that I'd like to use it. Flusher
thread would lock the page from the work item, check whether it is still
attached to the inode and if yes, it will proceed. Otherwise it will just
discard the work item because we know the page has already been written out
by someone else or truncated.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-01 21:15           ` Jan Kara
  0 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2012-03-01 21:15 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Fengguang Wu, Greg Thelen, Ying Han, hannes,
	KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Thu 01-03-12 11:42:01, Andrew Morton wrote:
> On Thu, 1 Mar 2012 12:04:04 +0100
> Jan Kara <jack@suse.cz> wrote:
> 
> > > iirc, the way I "grabbed" the page was to actually lock it, with
> > > [try_]_lock_page().  And unlock it again way over within the writeback
> > > thread.  I forget why I did it this way, rather than get_page() or
> > > whatever.  Locking the page is a good way of preventing anyone else
> > > from futzing with it.  It also pins the inode, which perhaps meant that
> > > with careful management, I could avoid the igrab()/iput() horrors
> > > discussed above.
> >
> >   I think using get_page() might be a good way to go.
> 
> get_page() doesn't pin the inode - truncate() will still detach it
> from the address_space().
  Yes, I know. And exactly because of that I'd like to use it. Flusher
thread would lock the page from the work item, check whether it is still
attached to the inode and if yes, it will proceed. Otherwise it will just
discard the work item because we know the page has already been written out
by someone else or truncated.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-03-01 21:15           ` Jan Kara
@ 2012-03-01 21:22             ` Andrew Morton
  -1 siblings, 0 replies; 116+ messages in thread
From: Andrew Morton @ 2012-03-01 21:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: Fengguang Wu, Greg Thelen, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Thu, 1 Mar 2012 22:15:51 +0100
Jan Kara <jack@suse.cz> wrote:

> On Thu 01-03-12 11:42:01, Andrew Morton wrote:
> > On Thu, 1 Mar 2012 12:04:04 +0100
> > Jan Kara <jack@suse.cz> wrote:
> > 
> > > > iirc, the way I "grabbed" the page was to actually lock it, with
> > > > [try_]_lock_page().  And unlock it again way over within the writeback
> > > > thread.  I forget why I did it this way, rather than get_page() or
> > > > whatever.  Locking the page is a good way of preventing anyone else
> > > > from futzing with it.  It also pins the inode, which perhaps meant that
> > > > with careful management, I could avoid the igrab()/iput() horrors
> > > > discussed above.
> > >
> > >   I think using get_page() might be a good way to go.
> > 
> > get_page() doesn't pin the inode - truncate() will still detach it
> > from the address_space().
>   Yes, I know. And exactly because of that I'd like to use it. Flusher
> thread would lock the page from the work item, check whether it is still
> attached to the inode and if yes, it will proceed. Otherwise it will just
> discard the work item because we know the page has already been written out
> by someone else or truncated.

That would work OK.  The vmscanning process won't know that its
writeback effort failed, but it's hard to see how that could cause a
problem.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-01 21:22             ` Andrew Morton
  0 siblings, 0 replies; 116+ messages in thread
From: Andrew Morton @ 2012-03-01 21:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: Fengguang Wu, Greg Thelen, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Thu, 1 Mar 2012 22:15:51 +0100
Jan Kara <jack@suse.cz> wrote:

> On Thu 01-03-12 11:42:01, Andrew Morton wrote:
> > On Thu, 1 Mar 2012 12:04:04 +0100
> > Jan Kara <jack@suse.cz> wrote:
> > 
> > > > iirc, the way I "grabbed" the page was to actually lock it, with
> > > > [try_]_lock_page().  And unlock it again way over within the writeback
> > > > thread.  I forget why I did it this way, rather than get_page() or
> > > > whatever.  Locking the page is a good way of preventing anyone else
> > > > from futzing with it.  It also pins the inode, which perhaps meant that
> > > > with careful management, I could avoid the igrab()/iput() horrors
> > > > discussed above.
> > >
> > >   I think using get_page() might be a good way to go.
> > 
> > get_page() doesn't pin the inode - truncate() will still detach it
> > from the address_space().
>   Yes, I know. And exactly because of that I'd like to use it. Flusher
> thread would lock the page from the work item, check whether it is still
> attached to the inode and if yes, it will proceed. Otherwise it will just
> discard the work item because we know the page has already been written out
> by someone else or truncated.

That would work OK.  The vmscanning process won't know that its
writeback effort failed, but it's hard to see how that could cause a
problem.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-03-01 16:38         ` Jan Kara
@ 2012-03-02  4:48           ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-02  4:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Greg Thelen, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Thu, Mar 01, 2012 at 05:38:37PM +0100, Jan Kara wrote:
> On Thu 01-03-12 20:36:40, Wu Fengguang wrote:
> > > Please have a think about all of this and see if you can demonstrate
> > > how the iput() here is guaranteed safe.
> > 
> > There are already several __iget()/iput() calls inside fs-writeback.c.
> > The existing iput() calls already demonstrate its safety?
> > 
> > Basically the flusher works in this way
> > 
> > - the dirty inode list i_wb_list does not reference count the inode at all
> > 
> > - the flusher thread does something analog to igrab() and set I_SYNC
> >   before going off to writeout the inode
> > 
> > - evict() will wait for completion of I_SYNC
>   Yes, you are right that currently writeback code already holds inode
> references and so it can happen that flusher thread drops the last inode
> reference. But currently that could create problems only if someone waits
> for flusher thread to make progress while effectively blocking e.g.
> truncate from happening. Currently flusher thread handles sync(2) and
> background writeback and filesystems take care to not hold any locks
> blocking IO / truncate while possibly waiting for these.
> 
> But with your addition situation changes significantly - now anyone doing
> allocation can block and do allocation from all sorts of places including
> ones where we hold locks blocking other fs activity. The good news is that
> we use GFP_NOFS in such places. So if GFP_NOFS allocation cannot possibly
> depend on a completion of some writeback work, then I'd still be
> comfortable with dropping inode references from writeback code. But Andrew
> is right this at least needs some arguing...

You seem to miss the point that we don't do wait or page allocations
inside queue_pageout_work(). The final iput() will not block the
random tasks because the latter don't wait for completion of the work.

        random task                     flusher thread

        page allocation
          page reclaim
            queue_pageout_work()
              igrab()

                  ......  after a while  ......

                                        execute pageout work                
                                        iput()
                                        <work completed>

There will be some reclaim_wait()s if the pageout works are not
executed quickly, in which case vmscan will be impacted and slowed
down. However it's not waiting for any specific work to complete, so
there is no chance to form a loop of dependencies leading to deadlocks.

The iput() does have the theoretic possibility to deadlock the flusher
thread itself (but not with the other random tasks). Since the flusher
thread has always been doing iput() w/o running into such bugs, we can
reasonably expect the new iput() to be as safe in practical.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-02  4:48           ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-02  4:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Greg Thelen, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Thu, Mar 01, 2012 at 05:38:37PM +0100, Jan Kara wrote:
> On Thu 01-03-12 20:36:40, Wu Fengguang wrote:
> > > Please have a think about all of this and see if you can demonstrate
> > > how the iput() here is guaranteed safe.
> > 
> > There are already several __iget()/iput() calls inside fs-writeback.c.
> > The existing iput() calls already demonstrate its safety?
> > 
> > Basically the flusher works in this way
> > 
> > - the dirty inode list i_wb_list does not reference count the inode at all
> > 
> > - the flusher thread does something analog to igrab() and set I_SYNC
> >   before going off to writeout the inode
> > 
> > - evict() will wait for completion of I_SYNC
>   Yes, you are right that currently writeback code already holds inode
> references and so it can happen that flusher thread drops the last inode
> reference. But currently that could create problems only if someone waits
> for flusher thread to make progress while effectively blocking e.g.
> truncate from happening. Currently flusher thread handles sync(2) and
> background writeback and filesystems take care to not hold any locks
> blocking IO / truncate while possibly waiting for these.
> 
> But with your addition situation changes significantly - now anyone doing
> allocation can block and do allocation from all sorts of places including
> ones where we hold locks blocking other fs activity. The good news is that
> we use GFP_NOFS in such places. So if GFP_NOFS allocation cannot possibly
> depend on a completion of some writeback work, then I'd still be
> comfortable with dropping inode references from writeback code. But Andrew
> is right this at least needs some arguing...

You seem to miss the point that we don't do wait or page allocations
inside queue_pageout_work(). The final iput() will not block the
random tasks because the latter don't wait for completion of the work.

        random task                     flusher thread

        page allocation
          page reclaim
            queue_pageout_work()
              igrab()

                  ......  after a while  ......

                                        execute pageout work                
                                        iput()
                                        <work completed>

There will be some reclaim_wait()s if the pageout works are not
executed quickly, in which case vmscan will be impacted and slowed
down. However it's not waiting for any specific work to complete, so
there is no chance to form a loop of dependencies leading to deadlocks.

The iput() does have the theoretic possibility to deadlock the flusher
thread itself (but not with the other random tasks). Since the flusher
thread has always been doing iput() w/o running into such bugs, we can
reasonably expect the new iput() to be as safe in practical.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v3 5/9] writeback: introduce the pageout work
  2012-03-01 13:35       ` Fengguang Wu
@ 2012-03-02  6:22         ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-02  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Thu, Mar 01, 2012 at 09:35:20PM +0800, Fengguang Wu wrote:
> > However the other type of works, if ever they come, can still block us
> > for long time. Will need a proper way to guarantee fairness.
> 
> The simplistic way around this may be to refuse to queue new pageout
> works when found other type of works in the queue. Then vmscan will
> fall back to pageout(). It's rare condition anyway and hardly deserves
> a comprehensive fairness scheme.

This implements that simple idea.

---
Subject: writeback: introduce the pageout work
Date: Thu Jul 29 14:41:19 CST 2010

[v3: improve queue_pageout_work() and fail it when other works are queued]

This relays file pageout IOs to the flusher threads.

It's much more important now that page reclaim generally does not
writeout filesystem-backed pages.

The ultimate target is to gracefully handle the LRU lists pressured by
dirty/writeback pages. In particular, problems (1-2) are addressed here.

1) I/O efficiency

The flusher will piggy back the nearby ~10ms worth of dirty pages for I/O.

This takes advantage of the time/spacial locality in most workloads: the
nearby pages of one file are typically populated into the LRU at the same
time, hence will likely be close to each other in the LRU list. Writing
them in one shot helps clean more pages effectively for page reclaim.

For the common dd style sequential writes that have excellent locality,
up to ~128ms data will be wrote around by the pageout work, which helps
make I/O performance very close to that of the background writeback.

2) writeback work coordinations

To avoid memory allocations at page reclaim, a mempool for struct
wb_writeback_work is created.

wakeup_flusher_threads() is removed because it can easily delay the
more oriented pageout works and even exhaust the mempool reservations.
It's also found to not I/O efficient by frequently submitting writeback
works with small ->nr_pages. wakeup_flusher_threads() is called with
total_scanned. Which could be (LRU_size / 4096). Given 1GB LRU_size,
the write chunk would be 256KB.  This is much smaller than the old 4MB
and the now preferred write chunk size (write_bandwidth/2).  For
direct reclaim, sc->nr_to_reclaim=32 and total_scanned starts with
(LRU_size / 4096), which *always* exceeds writeback_threshold in boxes
with more than 1GB memory. So the flusher end up constantly be fed
with small writeout requests.

Typically the flusher will be working on the background/periodic works
when there are heavy dirtier tasks. And wb_writeback() will quit the
background/periodic work when pageout or other works are queued. So
the pageout works can typically be pick up and executed quickly by the
flusher: the background/periodic works are the dominant ones and there
are rarely other type of works in the way.

However the other type of works, if ever they come, can still block us
for long time. So we simply refuse to queue new pageout works when found
other type of works in the queue. Then vmscan may fall back to pageout().
It's rare condition anyway.

Jan Kara: limit the search scope; remove works and unpin inodes on umount.

CC: Jan Kara <jack@suse.cz>
CC: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
CC: Greg Thelen <gthelen@google.com>
CC: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |  289 +++++++++++++++++++++++++++--
 fs/super.c                       |    1 
 include/linux/backing-dev.h      |    2 
 include/linux/writeback.h        |   16 +
 include/trace/events/writeback.h |   12 -
 mm/vmscan.c                      |   36 ++-
 6 files changed, 327 insertions(+), 29 deletions(-)

--- linux.orig/fs/fs-writeback.c	2012-03-02 13:32:54.921717396 +0800
+++ linux/fs/fs-writeback.c	2012-03-02 13:37:07.481723398 +0800
@@ -36,9 +36,37 @@
 
 /*
  * Passed into wb_writeback(), essentially a subset of writeback_control
+ *
+ * The wb_writeback_work is created (and hence auto destroyed) either on stack,
+ * or dynamically allocated from the mempool. The implicit rule is: the caller
+ * shall allocate wb_writeback_work on stack iif it want to wait for completion
+ * of the work (aka. synchronous work).
+ *
+ * The work is then queued into bdi->work_list, where the flusher picks up one
+ * wb_writeback_work at a time, dequeue, execute and finally either free it
+ * (mempool allocated) or wake up the caller (on stack).
+ *
+ * It's possible for vmscan to queue lots of pageout works in short time.
+ * However it does not need too many IOs in flight to saturate a typical disk.
+ * Limiting the queue size helps reduce the queuing delays. So the below rules
+ * are applied:
+ *
+ * - when LOTS_OF_WRITEBACK_WORKS = WB_WORK_MEMPOOL_SIZE / 8 = 128 works are
+ *   queued, vmscan should start throttling itself (to the rate the flusher can
+ *   consume pageout works).
+ *
+ * - when 2 * LOTS_OF_WRITEBACK_WORKS wb_writeback_work are queued, will refuse
+ *   to queue new pageout works
+ *
+ * - the remaining mempool reservations are available for other types of works
  */
 struct wb_writeback_work {
 	long nr_pages;
+	/*
+	 * WB_REASON_LAPTOP_TIMER, WB_REASON_FREE_MORE_MEM and some
+	 * WB_REASON_SYNC callers queue works with ->sb == NULL. They just want
+	 * to knock down the bdi dirty pages and don't care about the exact sb.
+	 */
 	struct super_block *sb;
 	unsigned long *older_than_this;
 	enum writeback_sync_modes sync_mode;
@@ -48,6 +76,13 @@ struct wb_writeback_work {
 	unsigned int for_background:1;
 	enum wb_reason reason;		/* why was writeback initiated? */
 
+	/*
+	 * When (inode != NULL), it's a pageout work for cleaning the inode
+	 * pages from start to start+nr_pages.
+	 */
+	struct inode *inode;
+	pgoff_t start;
+
 	struct list_head list;		/* pending work list */
 	struct completion *done;	/* set if the caller waits */
 };
@@ -57,6 +92,28 @@ struct wb_writeback_work {
  */
 int nr_pdflush_threads;
 
+static mempool_t *wb_work_mempool;
+
+static void *wb_work_alloc(gfp_t gfp_mask, void *pool_data)
+{
+	/*
+	 * Avoid page allocation on page reclaim. The mempool reservations are
+	 * typically more than enough for good disk utilization.
+	 */
+	if (current->flags & PF_MEMALLOC)
+		return NULL;
+
+	return kmalloc(sizeof(struct wb_writeback_work), gfp_mask);
+}
+
+static __init int wb_work_init(void)
+{
+	wb_work_mempool = mempool_create(WB_WORK_MEMPOOL_SIZE,
+					 wb_work_alloc, mempool_kfree, NULL);
+	return wb_work_mempool ? 0 : -ENOMEM;
+}
+fs_initcall(wb_work_init);
+
 /**
  * writeback_in_progress - determine whether there is writeback in progress
  * @bdi: the device's backing_dev_info structure.
@@ -111,6 +168,22 @@ static void bdi_queue_work(struct backin
 {
 	trace_writeback_queue(bdi, work);
 
+	/*
+	 * The iput() for pageout works may occasionally dive deep into complex
+	 * fs code. This brings new possibilities/sources of deadlock:
+	 *
+	 *   free work => iput => fs code => queue writeback work and wait on it
+	 *
+	 * In the above scheme, the flusher ends up waiting endless for itself.
+	 */
+	if (unlikely(current == bdi->wb.task ||
+		     current == default_backing_dev_info.wb.task)) {
+		WARN_ON_ONCE(1); /* recursion; deadlock if ->done is set */
+		if (work->done)
+			complete(work->done);
+		return;
+	}
+
 	spin_lock_bh(&bdi->wb_lock);
 	list_add_tail(&work->list, &bdi->work_list);
 	if (!bdi->wb.task)
@@ -129,7 +202,7 @@ __bdi_start_writeback(struct backing_dev
 	 * This is WB_SYNC_NONE writeback, so if allocation fails just
 	 * wakeup the thread for old dirty data writeback
 	 */
-	work = kzalloc(sizeof(*work), GFP_ATOMIC);
+	work = mempool_alloc(wb_work_mempool, GFP_ATOMIC);
 	if (!work) {
 		if (bdi->wb.task) {
 			trace_writeback_nowork(bdi);
@@ -138,6 +211,7 @@ __bdi_start_writeback(struct backing_dev
 		return;
 	}
 
+	memset(work, 0, sizeof(*work));
 	work->sync_mode	= WB_SYNC_NONE;
 	work->nr_pages	= nr_pages;
 	work->range_cyclic = range_cyclic;
@@ -187,6 +261,187 @@ void bdi_start_background_writeback(stru
 }
 
 /*
+ * Check if @work already covers @offset, or try to extend it to cover @offset.
+ * Returns true if the wb_writeback_work now encompasses the requested offset.
+ */
+static bool extend_writeback_range(struct wb_writeback_work *work,
+				   pgoff_t offset,
+				   unsigned long unit)
+{
+	pgoff_t end = work->start + work->nr_pages;
+
+	if (offset >= work->start && offset < end)
+		return true;
+
+	/*
+	 * For sequential workloads with good locality, include up to 8 times
+	 * more data in one chunk. The unit chunk size is calculated so that it
+	 * costs 8-16ms to write so many pages. So 8 times means we can extend
+	 * it up to 128ms. It's a good value because 128ms data transfer time
+	 * makes the typical overheads of 8ms disk seek time look small enough.
+	 */
+	if (work->nr_pages >= 8 * unit)
+		return false;
+
+	/* the unsigned comparison helps eliminate one compare */
+	if (work->start - offset < unit) {
+		work->nr_pages += unit;
+		work->start -= unit;
+		return true;
+	}
+
+	if (offset - end < unit) {
+		work->nr_pages += unit;
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * schedule writeback on a range of inode pages.
+ */
+static struct wb_writeback_work *
+alloc_queue_pageout_work(struct backing_dev_info *bdi,
+			 struct inode *inode,
+			 pgoff_t start,
+			 pgoff_t len)
+{
+	struct wb_writeback_work *work;
+
+	/*
+	 * Grab the inode until the work is executed. We are calling this from
+	 * page reclaim context and the only thing pinning the address_space
+	 * for the moment is the page lock.
+	 */
+	if (!igrab(inode))
+		return NULL;
+
+	work = mempool_alloc(wb_work_mempool, GFP_NOWAIT);
+	if (!work)
+		return NULL;
+
+	memset(work, 0, sizeof(*work));
+	work->sync_mode		= WB_SYNC_NONE;
+	work->sb		= inode->i_sb;
+	work->inode		= inode;
+	work->start		= start;
+	work->nr_pages		= len;
+	work->reason		= WB_REASON_PAGEOUT;
+
+	bdi_queue_work(bdi, work);
+
+	return work;
+}
+
+/*
+ * Called by page reclaim code to flush the dirty page ASAP. Do write-around to
+ * improve IO throughput. The nearby pages will have good chance to reside in
+ * the same LRU list that vmscan is working on, and even close to each other
+ * inside the LRU list in the common case of sequential read/write.
+ *
+ * ret >= 0: success, there are at least @ret writeback works queued now
+ * ret < 0: failed
+ */
+enum {
+	PAGEOUT_WORK_FAILED_ALLOC = -3,
+	PAGEOUT_WORK_FAILED_TOO_MANY = -2,
+	PAGEOUT_WORK_FAILED_OTHER_WORKS = -1,
+};
+int queue_pageout_work(struct address_space *mapping, struct page *page)
+{
+	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	struct inode *inode = mapping->host;
+	struct wb_writeback_work *work;
+	unsigned long write_around_pages;
+	pgoff_t offset = page->index;
+	int nr_works_iterated = 0;
+	int ret = PAGEOUT_WORK_FAILED_ALLOC;
+
+	/*
+	 * piggy back 8-16ms worth of data
+	 */
+	write_around_pages = bdi->avg_write_bandwidth + MIN_WRITEBACK_PAGES;
+	write_around_pages = rounddown_pow_of_two(write_around_pages) >> 6;
+
+	spin_lock_bh(&bdi->wb_lock);
+	list_for_each_entry_reverse(work, &bdi->work_list, list) {
+		/*
+		 * vmscan will slow down page reclaim when there are more than
+		 * LOTS_OF_WRITEBACK_WORKS queued. Limit the number of pageout
+		 * works two times larger.
+		 */
+		if (nr_works_iterated++ > 2 * LOTS_OF_WRITEBACK_WORKS) {
+			ret = PAGEOUT_WORK_FAILED_TOO_MANY;
+			break;
+		}
+		/*
+		 * The other type of works may delay us for long time.
+		 * Just fail it so that vmscan falls back to pageout().
+		 */
+		if (work->inode == NULL) {
+			ret = PAGEOUT_WORK_FAILED_OTHER_WORKS;
+			break;
+		}
+		if (work->inode != inode)
+			continue;
+		if (extend_writeback_range(work, offset, write_around_pages)) {
+			ret = nr_works_iterated;
+			break;
+		}
+	}
+	spin_unlock_bh(&bdi->wb_lock);
+
+	/*
+	 * if we didn't add the page to an existing wb_writeback_work and not
+	 * encountered other failure conditions, allocate and queue a new one
+	 */
+	if (ret == PAGEOUT_WORK_FAILED_ALLOC) {
+		offset = round_down(offset, write_around_pages);
+		work = alloc_queue_pageout_work(bdi, inode,
+						offset, write_around_pages);
+		if (work)
+			ret = nr_works_iterated;
+	}
+	return ret;
+}
+
+static void wb_free_work(struct wb_writeback_work *work)
+{
+	if (work->inode)
+		iput(work->inode);
+	/*
+	 * Notify the caller of completion if this is a synchronous
+	 * work item, otherwise just free it.
+	 */
+	if (work->done)
+		complete(work->done);
+	else
+		mempool_free(work, wb_work_mempool);
+}
+
+/*
+ * Remove works for @sb; or if (@sb == NULL), remove all works on @bdi.
+ */
+void bdi_remove_writeback_works(struct backing_dev_info *bdi,
+				struct super_block *sb)
+{
+	struct wb_writeback_work *work, *tmp;
+	LIST_HEAD(dispose);
+
+	spin_lock_bh(&bdi->wb_lock);
+	list_for_each_entry_safe(work, tmp, &bdi->work_list, list) {
+		if (sb && work->sb && sb != work->sb)
+			continue;
+		list_move(&work->list, &dispose);
+	}
+	spin_unlock_bh(&bdi->wb_lock);
+
+	list_for_each_entry(work, &dispose, list)
+		wb_free_work(work);
+}
+
+/*
  * Remove the inode from the writeback list it is on.
  */
 void inode_wb_list_del(struct inode *inode)
@@ -833,6 +1088,24 @@ static unsigned long get_nr_dirty_pages(
 		get_nr_dirty_inodes();
 }
 
+/*
+ * Clean pages for page reclaim. Returns the number of pages written.
+ */
+static long wb_pageout(struct bdi_writeback *wb, struct wb_writeback_work *work)
+{
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_NONE,
+		.nr_to_write = LONG_MAX,
+		.range_start = (loff_t)work->start << PAGE_CACHE_SHIFT,
+		.range_end = (loff_t)(work->start + work->nr_pages - 1)
+						<< PAGE_CACHE_SHIFT,
+	};
+
+	do_writepages(work->inode->i_mapping, &wbc);
+
+	return LONG_MAX - wbc.nr_to_write;
+}
+
 static long wb_check_background_flush(struct bdi_writeback *wb)
 {
 	if (over_bground_thresh(wb->bdi)) {
@@ -905,16 +1178,12 @@ long wb_do_writeback(struct bdi_writebac
 
 		trace_writeback_exec(bdi, work);
 
-		wrote += wb_writeback(wb, work);
-
-		/*
-		 * Notify the caller of completion if this is a synchronous
-		 * work item, otherwise just free it.
-		 */
-		if (work->done)
-			complete(work->done);
+		if (!work->inode)
+			wrote += wb_writeback(wb, work);
 		else
-			kfree(work);
+			wrote += wb_pageout(wb, work);
+
+		wb_free_work(work);
 	}
 
 	/*
--- linux.orig/include/trace/events/writeback.h	2012-03-02 13:32:54.885717395 +0800
+++ linux/include/trace/events/writeback.h	2012-03-02 13:34:39.873719890 +0800
@@ -23,7 +23,7 @@
 
 #define WB_WORK_REASON							\
 		{WB_REASON_BACKGROUND,		"background"},		\
-		{WB_REASON_TRY_TO_FREE_PAGES,	"try_to_free_pages"},	\
+		{WB_REASON_PAGEOUT,		"pageout"},		\
 		{WB_REASON_SYNC,		"sync"},		\
 		{WB_REASON_PERIODIC,		"periodic"},		\
 		{WB_REASON_LAPTOP_TIMER,	"laptop_timer"},	\
@@ -45,6 +45,8 @@ DECLARE_EVENT_CLASS(writeback_work_class
 		__field(int, range_cyclic)
 		__field(int, for_background)
 		__field(int, reason)
+		__field(unsigned long, ino)
+		__field(unsigned long, start)
 	),
 	TP_fast_assign(
 		struct device *dev = bdi->dev;
@@ -58,9 +60,11 @@ DECLARE_EVENT_CLASS(writeback_work_class
 		__entry->range_cyclic = work->range_cyclic;
 		__entry->for_background	= work->for_background;
 		__entry->reason = work->reason;
+		__entry->ino = work->inode ? work->inode->i_ino : 0;
+		__entry->start = work->start;
 	),
 	TP_printk("bdi %s: sb_dev %d:%d nr_pages=%ld sync_mode=%d "
-		  "kupdate=%d range_cyclic=%d background=%d reason=%s",
+		  "kupdate=%d range_cyclic=%d background=%d reason=%s ino=%lu start=%lu",
 		  __entry->name,
 		  MAJOR(__entry->sb_dev), MINOR(__entry->sb_dev),
 		  __entry->nr_pages,
@@ -68,7 +72,9 @@ DECLARE_EVENT_CLASS(writeback_work_class
 		  __entry->for_kupdate,
 		  __entry->range_cyclic,
 		  __entry->for_background,
-		  __print_symbolic(__entry->reason, WB_WORK_REASON)
+		  __print_symbolic(__entry->reason, WB_WORK_REASON),
+		  __entry->ino,
+		  __entry->start
 	)
 );
 #define DEFINE_WRITEBACK_WORK_EVENT(name) \
--- linux.orig/include/linux/writeback.h	2012-03-02 13:32:54.901717396 +0800
+++ linux/include/linux/writeback.h	2012-03-02 13:34:39.873719890 +0800
@@ -40,7 +40,7 @@ enum writeback_sync_modes {
  */
 enum wb_reason {
 	WB_REASON_BACKGROUND,
-	WB_REASON_TRY_TO_FREE_PAGES,
+	WB_REASON_PAGEOUT,
 	WB_REASON_SYNC,
 	WB_REASON_PERIODIC,
 	WB_REASON_LAPTOP_TIMER,
@@ -94,6 +94,20 @@ long writeback_inodes_wb(struct bdi_writ
 				enum wb_reason reason);
 long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
 void wakeup_flusher_threads(long nr_pages, enum wb_reason reason);
+int queue_pageout_work(struct address_space *mapping, struct page *page);
+
+/*
+ * Tailored for vmscan which may submit lots of pageout works. The page reclaim
+ * should try to slow down the pageout work submission rate when the queue size
+ * grows to LOTS_OF_WRITEBACK_WORKS. queue_pageout_work() will accordingly limit
+ * its search depth to (2 * LOTS_OF_WRITEBACK_WORKS).
+ *
+ * Note that the limited search and work pool is not a big problem: 1024 IOs
+ * under flight are typically more than enough to saturate the disk. And the
+ * overheads of searching in the work list didn't even show up in perf report.
+ */
+#define WB_WORK_MEMPOOL_SIZE		1024
+#define LOTS_OF_WRITEBACK_WORKS		(WB_WORK_MEMPOOL_SIZE / 8)
 
 /* writeback.h requires fs.h; it, too, is not included from here. */
 static inline void wait_on_inode(struct inode *inode)
--- linux.orig/fs/super.c	2012-03-02 13:32:54.909717395 +0800
+++ linux/fs/super.c	2012-03-02 13:34:39.873719890 +0800
@@ -389,6 +389,7 @@ void generic_shutdown_super(struct super
 
 		fsnotify_unmount_inodes(&sb->s_inodes);
 
+		bdi_remove_writeback_works(sb->s_bdi, sb);
 		evict_inodes(sb);
 
 		if (sop->put_super)
--- linux.orig/include/linux/backing-dev.h	2012-03-02 13:32:54.893717395 +0800
+++ linux/include/linux/backing-dev.h	2012-03-02 13:34:39.873719890 +0800
@@ -126,6 +126,8 @@ int bdi_has_dirty_io(struct backing_dev_
 void bdi_arm_supers_timer(void);
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
 void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
+void bdi_remove_writeback_works(struct backing_dev_info *bdi,
+				struct super_block *sb);
 
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
--- linux.orig/mm/vmscan.c	2012-03-02 13:32:54.873717395 +0800
+++ linux/mm/vmscan.c	2012-03-02 13:34:39.873719890 +0800
@@ -874,12 +874,22 @@ static unsigned long shrink_page_list(st
 			nr_dirty++;
 
 			/*
-			 * Only kswapd can writeback filesystem pages to
-			 * avoid risk of stack overflow but do not writeback
-			 * unless under significant pressure.
+			 * Pages may be dirtied anywhere inside the LRU. This
+			 * ensures they undergo a full period of LRU iteration
+			 * before considering pageout. The intention is to
+			 * delay writeout to the flusher thread, unless when
+			 * run into a long segment of dirty pages.
+			 */
+			if (references == PAGEREF_RECLAIM_CLEAN &&
+			    priority == DEF_PRIORITY)
+				goto keep_locked;
+
+			/*
+			 * Try relaying the pageout I/O to the flusher threads
+			 * for better I/O efficiency and avoid stack overflow.
 			 */
-			if (page_is_file_cache(page) &&
-					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
+			if (page_is_file_cache(page) && mapping &&
+			    queue_pageout_work(mapping, page) >= 0) {
 				/*
 				 * Immediately reclaim when written back.
 				 * Similar in principal to deactivate_page()
@@ -892,8 +902,13 @@ static unsigned long shrink_page_list(st
 				goto keep_locked;
 			}
 
-			if (references == PAGEREF_RECLAIM_CLEAN)
+			/*
+			 * Only kswapd can writeback filesystem pages to
+			 * avoid risk of stack overflow.
+			 */
+			if (page_is_file_cache(page) && !current_is_kswapd())
 				goto keep_locked;
+
 			if (!may_enter_fs)
 				goto keep_locked;
 			if (!sc->may_writepage)
@@ -2373,17 +2388,8 @@ static unsigned long do_try_to_free_page
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
 			goto out;
 
-		/*
-		 * Try to write back as many pages as we just scanned.  This
-		 * tends to cause slow streaming writers to write data to the
-		 * disk smoothly, at the dirtying rate, which is nice.   But
-		 * that's undesirable in laptop mode, where we *want* lumpy
-		 * writeout.  So in laptop mode, write out the whole world.
-		 */
 		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
 		if (total_scanned > writeback_threshold) {
-			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned,
-						WB_REASON_TRY_TO_FREE_PAGES);
 			sc->may_writepage = 1;
 		}
 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [PATCH v3 5/9] writeback: introduce the pageout work
@ 2012-03-02  6:22         ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-02  6:22 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Thu, Mar 01, 2012 at 09:35:20PM +0800, Fengguang Wu wrote:
> > However the other type of works, if ever they come, can still block us
> > for long time. Will need a proper way to guarantee fairness.
> 
> The simplistic way around this may be to refuse to queue new pageout
> works when found other type of works in the queue. Then vmscan will
> fall back to pageout(). It's rare condition anyway and hardly deserves
> a comprehensive fairness scheme.

This implements that simple idea.

---
Subject: writeback: introduce the pageout work
Date: Thu Jul 29 14:41:19 CST 2010

[v3: improve queue_pageout_work() and fail it when other works are queued]

This relays file pageout IOs to the flusher threads.

It's much more important now that page reclaim generally does not
writeout filesystem-backed pages.

The ultimate target is to gracefully handle the LRU lists pressured by
dirty/writeback pages. In particular, problems (1-2) are addressed here.

1) I/O efficiency

The flusher will piggy back the nearby ~10ms worth of dirty pages for I/O.

This takes advantage of the time/spacial locality in most workloads: the
nearby pages of one file are typically populated into the LRU at the same
time, hence will likely be close to each other in the LRU list. Writing
them in one shot helps clean more pages effectively for page reclaim.

For the common dd style sequential writes that have excellent locality,
up to ~128ms data will be wrote around by the pageout work, which helps
make I/O performance very close to that of the background writeback.

2) writeback work coordinations

To avoid memory allocations at page reclaim, a mempool for struct
wb_writeback_work is created.

wakeup_flusher_threads() is removed because it can easily delay the
more oriented pageout works and even exhaust the mempool reservations.
It's also found to not I/O efficient by frequently submitting writeback
works with small ->nr_pages. wakeup_flusher_threads() is called with
total_scanned. Which could be (LRU_size / 4096). Given 1GB LRU_size,
the write chunk would be 256KB.  This is much smaller than the old 4MB
and the now preferred write chunk size (write_bandwidth/2).  For
direct reclaim, sc->nr_to_reclaim=32 and total_scanned starts with
(LRU_size / 4096), which *always* exceeds writeback_threshold in boxes
with more than 1GB memory. So the flusher end up constantly be fed
with small writeout requests.

Typically the flusher will be working on the background/periodic works
when there are heavy dirtier tasks. And wb_writeback() will quit the
background/periodic work when pageout or other works are queued. So
the pageout works can typically be pick up and executed quickly by the
flusher: the background/periodic works are the dominant ones and there
are rarely other type of works in the way.

However the other type of works, if ever they come, can still block us
for long time. So we simply refuse to queue new pageout works when found
other type of works in the queue. Then vmscan may fall back to pageout().
It's rare condition anyway.

Jan Kara: limit the search scope; remove works and unpin inodes on umount.

CC: Jan Kara <jack@suse.cz>
CC: Mel Gorman <mgorman@suse.de>
Acked-by: Rik van Riel <riel@redhat.com>
CC: Greg Thelen <gthelen@google.com>
CC: Minchan Kim <minchan.kim@gmail.com>
Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 fs/fs-writeback.c                |  289 +++++++++++++++++++++++++++--
 fs/super.c                       |    1 
 include/linux/backing-dev.h      |    2 
 include/linux/writeback.h        |   16 +
 include/trace/events/writeback.h |   12 -
 mm/vmscan.c                      |   36 ++-
 6 files changed, 327 insertions(+), 29 deletions(-)

--- linux.orig/fs/fs-writeback.c	2012-03-02 13:32:54.921717396 +0800
+++ linux/fs/fs-writeback.c	2012-03-02 13:37:07.481723398 +0800
@@ -36,9 +36,37 @@
 
 /*
  * Passed into wb_writeback(), essentially a subset of writeback_control
+ *
+ * The wb_writeback_work is created (and hence auto destroyed) either on stack,
+ * or dynamically allocated from the mempool. The implicit rule is: the caller
+ * shall allocate wb_writeback_work on stack iif it want to wait for completion
+ * of the work (aka. synchronous work).
+ *
+ * The work is then queued into bdi->work_list, where the flusher picks up one
+ * wb_writeback_work at a time, dequeue, execute and finally either free it
+ * (mempool allocated) or wake up the caller (on stack).
+ *
+ * It's possible for vmscan to queue lots of pageout works in short time.
+ * However it does not need too many IOs in flight to saturate a typical disk.
+ * Limiting the queue size helps reduce the queuing delays. So the below rules
+ * are applied:
+ *
+ * - when LOTS_OF_WRITEBACK_WORKS = WB_WORK_MEMPOOL_SIZE / 8 = 128 works are
+ *   queued, vmscan should start throttling itself (to the rate the flusher can
+ *   consume pageout works).
+ *
+ * - when 2 * LOTS_OF_WRITEBACK_WORKS wb_writeback_work are queued, will refuse
+ *   to queue new pageout works
+ *
+ * - the remaining mempool reservations are available for other types of works
  */
 struct wb_writeback_work {
 	long nr_pages;
+	/*
+	 * WB_REASON_LAPTOP_TIMER, WB_REASON_FREE_MORE_MEM and some
+	 * WB_REASON_SYNC callers queue works with ->sb == NULL. They just want
+	 * to knock down the bdi dirty pages and don't care about the exact sb.
+	 */
 	struct super_block *sb;
 	unsigned long *older_than_this;
 	enum writeback_sync_modes sync_mode;
@@ -48,6 +76,13 @@ struct wb_writeback_work {
 	unsigned int for_background:1;
 	enum wb_reason reason;		/* why was writeback initiated? */
 
+	/*
+	 * When (inode != NULL), it's a pageout work for cleaning the inode
+	 * pages from start to start+nr_pages.
+	 */
+	struct inode *inode;
+	pgoff_t start;
+
 	struct list_head list;		/* pending work list */
 	struct completion *done;	/* set if the caller waits */
 };
@@ -57,6 +92,28 @@ struct wb_writeback_work {
  */
 int nr_pdflush_threads;
 
+static mempool_t *wb_work_mempool;
+
+static void *wb_work_alloc(gfp_t gfp_mask, void *pool_data)
+{
+	/*
+	 * Avoid page allocation on page reclaim. The mempool reservations are
+	 * typically more than enough for good disk utilization.
+	 */
+	if (current->flags & PF_MEMALLOC)
+		return NULL;
+
+	return kmalloc(sizeof(struct wb_writeback_work), gfp_mask);
+}
+
+static __init int wb_work_init(void)
+{
+	wb_work_mempool = mempool_create(WB_WORK_MEMPOOL_SIZE,
+					 wb_work_alloc, mempool_kfree, NULL);
+	return wb_work_mempool ? 0 : -ENOMEM;
+}
+fs_initcall(wb_work_init);
+
 /**
  * writeback_in_progress - determine whether there is writeback in progress
  * @bdi: the device's backing_dev_info structure.
@@ -111,6 +168,22 @@ static void bdi_queue_work(struct backin
 {
 	trace_writeback_queue(bdi, work);
 
+	/*
+	 * The iput() for pageout works may occasionally dive deep into complex
+	 * fs code. This brings new possibilities/sources of deadlock:
+	 *
+	 *   free work => iput => fs code => queue writeback work and wait on it
+	 *
+	 * In the above scheme, the flusher ends up waiting endless for itself.
+	 */
+	if (unlikely(current == bdi->wb.task ||
+		     current == default_backing_dev_info.wb.task)) {
+		WARN_ON_ONCE(1); /* recursion; deadlock if ->done is set */
+		if (work->done)
+			complete(work->done);
+		return;
+	}
+
 	spin_lock_bh(&bdi->wb_lock);
 	list_add_tail(&work->list, &bdi->work_list);
 	if (!bdi->wb.task)
@@ -129,7 +202,7 @@ __bdi_start_writeback(struct backing_dev
 	 * This is WB_SYNC_NONE writeback, so if allocation fails just
 	 * wakeup the thread for old dirty data writeback
 	 */
-	work = kzalloc(sizeof(*work), GFP_ATOMIC);
+	work = mempool_alloc(wb_work_mempool, GFP_ATOMIC);
 	if (!work) {
 		if (bdi->wb.task) {
 			trace_writeback_nowork(bdi);
@@ -138,6 +211,7 @@ __bdi_start_writeback(struct backing_dev
 		return;
 	}
 
+	memset(work, 0, sizeof(*work));
 	work->sync_mode	= WB_SYNC_NONE;
 	work->nr_pages	= nr_pages;
 	work->range_cyclic = range_cyclic;
@@ -187,6 +261,187 @@ void bdi_start_background_writeback(stru
 }
 
 /*
+ * Check if @work already covers @offset, or try to extend it to cover @offset.
+ * Returns true if the wb_writeback_work now encompasses the requested offset.
+ */
+static bool extend_writeback_range(struct wb_writeback_work *work,
+				   pgoff_t offset,
+				   unsigned long unit)
+{
+	pgoff_t end = work->start + work->nr_pages;
+
+	if (offset >= work->start && offset < end)
+		return true;
+
+	/*
+	 * For sequential workloads with good locality, include up to 8 times
+	 * more data in one chunk. The unit chunk size is calculated so that it
+	 * costs 8-16ms to write so many pages. So 8 times means we can extend
+	 * it up to 128ms. It's a good value because 128ms data transfer time
+	 * makes the typical overheads of 8ms disk seek time look small enough.
+	 */
+	if (work->nr_pages >= 8 * unit)
+		return false;
+
+	/* the unsigned comparison helps eliminate one compare */
+	if (work->start - offset < unit) {
+		work->nr_pages += unit;
+		work->start -= unit;
+		return true;
+	}
+
+	if (offset - end < unit) {
+		work->nr_pages += unit;
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * schedule writeback on a range of inode pages.
+ */
+static struct wb_writeback_work *
+alloc_queue_pageout_work(struct backing_dev_info *bdi,
+			 struct inode *inode,
+			 pgoff_t start,
+			 pgoff_t len)
+{
+	struct wb_writeback_work *work;
+
+	/*
+	 * Grab the inode until the work is executed. We are calling this from
+	 * page reclaim context and the only thing pinning the address_space
+	 * for the moment is the page lock.
+	 */
+	if (!igrab(inode))
+		return NULL;
+
+	work = mempool_alloc(wb_work_mempool, GFP_NOWAIT);
+	if (!work)
+		return NULL;
+
+	memset(work, 0, sizeof(*work));
+	work->sync_mode		= WB_SYNC_NONE;
+	work->sb		= inode->i_sb;
+	work->inode		= inode;
+	work->start		= start;
+	work->nr_pages		= len;
+	work->reason		= WB_REASON_PAGEOUT;
+
+	bdi_queue_work(bdi, work);
+
+	return work;
+}
+
+/*
+ * Called by page reclaim code to flush the dirty page ASAP. Do write-around to
+ * improve IO throughput. The nearby pages will have good chance to reside in
+ * the same LRU list that vmscan is working on, and even close to each other
+ * inside the LRU list in the common case of sequential read/write.
+ *
+ * ret >= 0: success, there are at least @ret writeback works queued now
+ * ret < 0: failed
+ */
+enum {
+	PAGEOUT_WORK_FAILED_ALLOC = -3,
+	PAGEOUT_WORK_FAILED_TOO_MANY = -2,
+	PAGEOUT_WORK_FAILED_OTHER_WORKS = -1,
+};
+int queue_pageout_work(struct address_space *mapping, struct page *page)
+{
+	struct backing_dev_info *bdi = mapping->backing_dev_info;
+	struct inode *inode = mapping->host;
+	struct wb_writeback_work *work;
+	unsigned long write_around_pages;
+	pgoff_t offset = page->index;
+	int nr_works_iterated = 0;
+	int ret = PAGEOUT_WORK_FAILED_ALLOC;
+
+	/*
+	 * piggy back 8-16ms worth of data
+	 */
+	write_around_pages = bdi->avg_write_bandwidth + MIN_WRITEBACK_PAGES;
+	write_around_pages = rounddown_pow_of_two(write_around_pages) >> 6;
+
+	spin_lock_bh(&bdi->wb_lock);
+	list_for_each_entry_reverse(work, &bdi->work_list, list) {
+		/*
+		 * vmscan will slow down page reclaim when there are more than
+		 * LOTS_OF_WRITEBACK_WORKS queued. Limit the number of pageout
+		 * works two times larger.
+		 */
+		if (nr_works_iterated++ > 2 * LOTS_OF_WRITEBACK_WORKS) {
+			ret = PAGEOUT_WORK_FAILED_TOO_MANY;
+			break;
+		}
+		/*
+		 * The other type of works may delay us for long time.
+		 * Just fail it so that vmscan falls back to pageout().
+		 */
+		if (work->inode == NULL) {
+			ret = PAGEOUT_WORK_FAILED_OTHER_WORKS;
+			break;
+		}
+		if (work->inode != inode)
+			continue;
+		if (extend_writeback_range(work, offset, write_around_pages)) {
+			ret = nr_works_iterated;
+			break;
+		}
+	}
+	spin_unlock_bh(&bdi->wb_lock);
+
+	/*
+	 * if we didn't add the page to an existing wb_writeback_work and not
+	 * encountered other failure conditions, allocate and queue a new one
+	 */
+	if (ret == PAGEOUT_WORK_FAILED_ALLOC) {
+		offset = round_down(offset, write_around_pages);
+		work = alloc_queue_pageout_work(bdi, inode,
+						offset, write_around_pages);
+		if (work)
+			ret = nr_works_iterated;
+	}
+	return ret;
+}
+
+static void wb_free_work(struct wb_writeback_work *work)
+{
+	if (work->inode)
+		iput(work->inode);
+	/*
+	 * Notify the caller of completion if this is a synchronous
+	 * work item, otherwise just free it.
+	 */
+	if (work->done)
+		complete(work->done);
+	else
+		mempool_free(work, wb_work_mempool);
+}
+
+/*
+ * Remove works for @sb; or if (@sb == NULL), remove all works on @bdi.
+ */
+void bdi_remove_writeback_works(struct backing_dev_info *bdi,
+				struct super_block *sb)
+{
+	struct wb_writeback_work *work, *tmp;
+	LIST_HEAD(dispose);
+
+	spin_lock_bh(&bdi->wb_lock);
+	list_for_each_entry_safe(work, tmp, &bdi->work_list, list) {
+		if (sb && work->sb && sb != work->sb)
+			continue;
+		list_move(&work->list, &dispose);
+	}
+	spin_unlock_bh(&bdi->wb_lock);
+
+	list_for_each_entry(work, &dispose, list)
+		wb_free_work(work);
+}
+
+/*
  * Remove the inode from the writeback list it is on.
  */
 void inode_wb_list_del(struct inode *inode)
@@ -833,6 +1088,24 @@ static unsigned long get_nr_dirty_pages(
 		get_nr_dirty_inodes();
 }
 
+/*
+ * Clean pages for page reclaim. Returns the number of pages written.
+ */
+static long wb_pageout(struct bdi_writeback *wb, struct wb_writeback_work *work)
+{
+	struct writeback_control wbc = {
+		.sync_mode = WB_SYNC_NONE,
+		.nr_to_write = LONG_MAX,
+		.range_start = (loff_t)work->start << PAGE_CACHE_SHIFT,
+		.range_end = (loff_t)(work->start + work->nr_pages - 1)
+						<< PAGE_CACHE_SHIFT,
+	};
+
+	do_writepages(work->inode->i_mapping, &wbc);
+
+	return LONG_MAX - wbc.nr_to_write;
+}
+
 static long wb_check_background_flush(struct bdi_writeback *wb)
 {
 	if (over_bground_thresh(wb->bdi)) {
@@ -905,16 +1178,12 @@ long wb_do_writeback(struct bdi_writebac
 
 		trace_writeback_exec(bdi, work);
 
-		wrote += wb_writeback(wb, work);
-
-		/*
-		 * Notify the caller of completion if this is a synchronous
-		 * work item, otherwise just free it.
-		 */
-		if (work->done)
-			complete(work->done);
+		if (!work->inode)
+			wrote += wb_writeback(wb, work);
 		else
-			kfree(work);
+			wrote += wb_pageout(wb, work);
+
+		wb_free_work(work);
 	}
 
 	/*
--- linux.orig/include/trace/events/writeback.h	2012-03-02 13:32:54.885717395 +0800
+++ linux/include/trace/events/writeback.h	2012-03-02 13:34:39.873719890 +0800
@@ -23,7 +23,7 @@
 
 #define WB_WORK_REASON							\
 		{WB_REASON_BACKGROUND,		"background"},		\
-		{WB_REASON_TRY_TO_FREE_PAGES,	"try_to_free_pages"},	\
+		{WB_REASON_PAGEOUT,		"pageout"},		\
 		{WB_REASON_SYNC,		"sync"},		\
 		{WB_REASON_PERIODIC,		"periodic"},		\
 		{WB_REASON_LAPTOP_TIMER,	"laptop_timer"},	\
@@ -45,6 +45,8 @@ DECLARE_EVENT_CLASS(writeback_work_class
 		__field(int, range_cyclic)
 		__field(int, for_background)
 		__field(int, reason)
+		__field(unsigned long, ino)
+		__field(unsigned long, start)
 	),
 	TP_fast_assign(
 		struct device *dev = bdi->dev;
@@ -58,9 +60,11 @@ DECLARE_EVENT_CLASS(writeback_work_class
 		__entry->range_cyclic = work->range_cyclic;
 		__entry->for_background	= work->for_background;
 		__entry->reason = work->reason;
+		__entry->ino = work->inode ? work->inode->i_ino : 0;
+		__entry->start = work->start;
 	),
 	TP_printk("bdi %s: sb_dev %d:%d nr_pages=%ld sync_mode=%d "
-		  "kupdate=%d range_cyclic=%d background=%d reason=%s",
+		  "kupdate=%d range_cyclic=%d background=%d reason=%s ino=%lu start=%lu",
 		  __entry->name,
 		  MAJOR(__entry->sb_dev), MINOR(__entry->sb_dev),
 		  __entry->nr_pages,
@@ -68,7 +72,9 @@ DECLARE_EVENT_CLASS(writeback_work_class
 		  __entry->for_kupdate,
 		  __entry->range_cyclic,
 		  __entry->for_background,
-		  __print_symbolic(__entry->reason, WB_WORK_REASON)
+		  __print_symbolic(__entry->reason, WB_WORK_REASON),
+		  __entry->ino,
+		  __entry->start
 	)
 );
 #define DEFINE_WRITEBACK_WORK_EVENT(name) \
--- linux.orig/include/linux/writeback.h	2012-03-02 13:32:54.901717396 +0800
+++ linux/include/linux/writeback.h	2012-03-02 13:34:39.873719890 +0800
@@ -40,7 +40,7 @@ enum writeback_sync_modes {
  */
 enum wb_reason {
 	WB_REASON_BACKGROUND,
-	WB_REASON_TRY_TO_FREE_PAGES,
+	WB_REASON_PAGEOUT,
 	WB_REASON_SYNC,
 	WB_REASON_PERIODIC,
 	WB_REASON_LAPTOP_TIMER,
@@ -94,6 +94,20 @@ long writeback_inodes_wb(struct bdi_writ
 				enum wb_reason reason);
 long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
 void wakeup_flusher_threads(long nr_pages, enum wb_reason reason);
+int queue_pageout_work(struct address_space *mapping, struct page *page);
+
+/*
+ * Tailored for vmscan which may submit lots of pageout works. The page reclaim
+ * should try to slow down the pageout work submission rate when the queue size
+ * grows to LOTS_OF_WRITEBACK_WORKS. queue_pageout_work() will accordingly limit
+ * its search depth to (2 * LOTS_OF_WRITEBACK_WORKS).
+ *
+ * Note that the limited search and work pool is not a big problem: 1024 IOs
+ * under flight are typically more than enough to saturate the disk. And the
+ * overheads of searching in the work list didn't even show up in perf report.
+ */
+#define WB_WORK_MEMPOOL_SIZE		1024
+#define LOTS_OF_WRITEBACK_WORKS		(WB_WORK_MEMPOOL_SIZE / 8)
 
 /* writeback.h requires fs.h; it, too, is not included from here. */
 static inline void wait_on_inode(struct inode *inode)
--- linux.orig/fs/super.c	2012-03-02 13:32:54.909717395 +0800
+++ linux/fs/super.c	2012-03-02 13:34:39.873719890 +0800
@@ -389,6 +389,7 @@ void generic_shutdown_super(struct super
 
 		fsnotify_unmount_inodes(&sb->s_inodes);
 
+		bdi_remove_writeback_works(sb->s_bdi, sb);
 		evict_inodes(sb);
 
 		if (sop->put_super)
--- linux.orig/include/linux/backing-dev.h	2012-03-02 13:32:54.893717395 +0800
+++ linux/include/linux/backing-dev.h	2012-03-02 13:34:39.873719890 +0800
@@ -126,6 +126,8 @@ int bdi_has_dirty_io(struct backing_dev_
 void bdi_arm_supers_timer(void);
 void bdi_wakeup_thread_delayed(struct backing_dev_info *bdi);
 void bdi_lock_two(struct bdi_writeback *wb1, struct bdi_writeback *wb2);
+void bdi_remove_writeback_works(struct backing_dev_info *bdi,
+				struct super_block *sb);
 
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
--- linux.orig/mm/vmscan.c	2012-03-02 13:32:54.873717395 +0800
+++ linux/mm/vmscan.c	2012-03-02 13:34:39.873719890 +0800
@@ -874,12 +874,22 @@ static unsigned long shrink_page_list(st
 			nr_dirty++;
 
 			/*
-			 * Only kswapd can writeback filesystem pages to
-			 * avoid risk of stack overflow but do not writeback
-			 * unless under significant pressure.
+			 * Pages may be dirtied anywhere inside the LRU. This
+			 * ensures they undergo a full period of LRU iteration
+			 * before considering pageout. The intention is to
+			 * delay writeout to the flusher thread, unless when
+			 * run into a long segment of dirty pages.
+			 */
+			if (references == PAGEREF_RECLAIM_CLEAN &&
+			    priority == DEF_PRIORITY)
+				goto keep_locked;
+
+			/*
+			 * Try relaying the pageout I/O to the flusher threads
+			 * for better I/O efficiency and avoid stack overflow.
 			 */
-			if (page_is_file_cache(page) &&
-					(!current_is_kswapd() || priority >= DEF_PRIORITY - 2)) {
+			if (page_is_file_cache(page) && mapping &&
+			    queue_pageout_work(mapping, page) >= 0) {
 				/*
 				 * Immediately reclaim when written back.
 				 * Similar in principal to deactivate_page()
@@ -892,8 +902,13 @@ static unsigned long shrink_page_list(st
 				goto keep_locked;
 			}
 
-			if (references == PAGEREF_RECLAIM_CLEAN)
+			/*
+			 * Only kswapd can writeback filesystem pages to
+			 * avoid risk of stack overflow.
+			 */
+			if (page_is_file_cache(page) && !current_is_kswapd())
 				goto keep_locked;
+
 			if (!may_enter_fs)
 				goto keep_locked;
 			if (!sc->may_writepage)
@@ -2373,17 +2388,8 @@ static unsigned long do_try_to_free_page
 		if (sc->nr_reclaimed >= sc->nr_to_reclaim)
 			goto out;
 
-		/*
-		 * Try to write back as many pages as we just scanned.  This
-		 * tends to cause slow streaming writers to write data to the
-		 * disk smoothly, at the dirtying rate, which is nice.   But
-		 * that's undesirable in laptop mode, where we *want* lumpy
-		 * writeout.  So in laptop mode, write out the whole world.
-		 */
 		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
 		if (total_scanned > writeback_threshold) {
-			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned,
-						WB_REASON_TRY_TO_FREE_PAGES);
 			sc->may_writepage = 1;
 		}
 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [RFC PATCH] mm: don't treat anonymous pages as dirtyable pages
  2012-02-28 14:00   ` Fengguang Wu
@ 2012-03-02  6:59     ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-02  6:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Linux Memory Management List, LKML

> 5) reset counters and stress it more.
> 
> 	# usemem 1G --sleep 1000&
> 	# free
> 		     total       used       free     shared    buffers     cached
> 	Mem:          6801       6758         42          0          0        994
> 	-/+ buffers/cache:       5764       1036
> 	Swap:        51106        235      50870
> 
> It's now obviously slow, it now takes seconds or even 10+ seconds to switch to
> the other windows:
> 
>   765.30    A System Monitor
>   769.72    A Dictionary
>   772.01    A Home
>   790.79    A Desktop Help
>   795.47    A *Unsaved Document 1 - gedit
>   813.01    A ALC888.svg  (1/11)
>   819.24    A Restore Session - Iceweasel
>   827.23    A Klondike
>   853.57    A urxvt
>   862.49    A xeyes
>   868.67    A Xpdf: /usr/share/doc/shared-mime-info/shared-mime-info-spec.pdf
>   869.47    A snb:/home/wfg - ZSH
> 
> And it seems that the slowness is caused by huge number of pageout()s:
> 
> /debug/vm/nr_reclaim_throttle_clean:0
> /debug/vm/nr_reclaim_throttle_kswapd:0
> /debug/vm/nr_reclaim_throttle_recent_write:0
> /debug/vm/nr_reclaim_throttle_write:307
> /debug/vm/nr_congestion_wait:0
> /debug/vm/nr_reclaim_wait_congested:0
> /debug/vm/nr_reclaim_wait_writeback:0
> /debug/vm/nr_migrate_wait_writeback:0
> nr_vmscan_write 175085
> allocstall 669671

The heavy swapping is a big problem. This patch is found to
effectively eliminate it :-)

---
Subject: mm: don't treat anonymous pages as dirtyable pages

Assume a mem=1GB desktop (swap enabled) with 800MB anonymous pages and
200MB file pages.  When the user starts a heavy dirtier task, the file
LRU lists may be mostly filled with dirty pages since the global dirty
limit is calculated as

	(anon+file) * 20% = 1GB * 20% = 200MB

This makes the file LRU lists hard to reclaim, which in turn increases
the scan rate of the anon LRU lists and lead to a lot of swapping. This
is probably one big reason why some desktop users see bad responsiveness
during heavy file copies once the swap is enabled.

The heavy swapping could mostly be avoided by calculating the global
dirty limit as

	file * 20% = 200MB * 20% = 40MB

The side effect would be that users feel longer file copy time because
the copy task is throttled earlier than before. However typical users
should be much more sensible to interactive performance rather than the
copy task which may well be leaved in the background.

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 include/linux/vmstat.h |    1 -
 mm/page-writeback.c    |   10 ++++++----
 mm/vmscan.c            |   14 --------------
 3 files changed, 6 insertions(+), 19 deletions(-)

--- linux.orig/include/linux/vmstat.h	2012-03-02 13:55:28.569749568 +0800
+++ linux/include/linux/vmstat.h	2012-03-02 13:56:06.585750471 +0800
@@ -139,7 +139,6 @@ static inline unsigned long zone_page_st
 	return x;
 }
 
-extern unsigned long global_reclaimable_pages(void);
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
 
 #ifdef CONFIG_NUMA
--- linux.orig/mm/page-writeback.c	2012-03-02 13:55:28.549749567 +0800
+++ linux/mm/page-writeback.c	2012-03-02 13:56:26.257750938 +0800
@@ -181,8 +181,7 @@ static unsigned long highmem_dirtyable_m
 		struct zone *z =
 			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
 
-		x += zone_page_state(z, NR_FREE_PAGES) +
-		     zone_reclaimable_pages(z) - z->dirty_balance_reserve;
+		x += zone_dirtyable_memory(z);
 	}
 	/*
 	 * Make sure that the number of highmem pages is never larger
@@ -206,7 +205,9 @@ unsigned long global_dirtyable_memory(vo
 {
 	unsigned long x;
 
-	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages() -
+	x = global_page_state(NR_FREE_PAGES) +
+	    global_page_state(NR_ACTIVE_FILE) +
+	    global_page_state(NR_INACTIVE_FILE) -
 	    dirty_balance_reserve;
 
 	if (!vm_highmem_is_dirtyable)
@@ -275,7 +276,8 @@ unsigned long zone_dirtyable_memory(stru
 	 * care about vm_highmem_is_dirtyable here.
 	 */
 	return zone_page_state(zone, NR_FREE_PAGES) +
-	       zone_reclaimable_pages(zone) -
+	       zone_page_state(zone, NR_ACTIVE_FILE) +
+	       zone_page_state(zone, NR_INACTIVE_FILE) -
 	       zone->dirty_balance_reserve;
 }
 
--- linux.orig/mm/vmscan.c	2012-03-02 13:55:28.561749567 +0800
+++ linux/mm/vmscan.c	2012-03-02 13:56:06.585750471 +0800
@@ -3315,20 +3315,6 @@ void wakeup_kswapd(struct zone *zone, in
  * - mapped pages, which may require several travels to be reclaimed
  * - dirty pages, which is not "instantly" reclaimable
  */
-unsigned long global_reclaimable_pages(void)
-{
-	int nr;
-
-	nr = global_page_state(NR_ACTIVE_FILE) +
-	     global_page_state(NR_INACTIVE_FILE);
-
-	if (nr_swap_pages > 0)
-		nr += global_page_state(NR_ACTIVE_ANON) +
-		      global_page_state(NR_INACTIVE_ANON);
-
-	return nr;
-}
-
 unsigned long zone_reclaimable_pages(struct zone *zone)
 {
 	int nr;

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [RFC PATCH] mm: don't treat anonymous pages as dirtyable pages
@ 2012-03-02  6:59     ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-02  6:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Linux Memory Management List, LKML

> 5) reset counters and stress it more.
> 
> 	# usemem 1G --sleep 1000&
> 	# free
> 		     total       used       free     shared    buffers     cached
> 	Mem:          6801       6758         42          0          0        994
> 	-/+ buffers/cache:       5764       1036
> 	Swap:        51106        235      50870
> 
> It's now obviously slow, it now takes seconds or even 10+ seconds to switch to
> the other windows:
> 
>   765.30    A System Monitor
>   769.72    A Dictionary
>   772.01    A Home
>   790.79    A Desktop Help
>   795.47    A *Unsaved Document 1 - gedit
>   813.01    A ALC888.svg  (1/11)
>   819.24    A Restore Session - Iceweasel
>   827.23    A Klondike
>   853.57    A urxvt
>   862.49    A xeyes
>   868.67    A Xpdf: /usr/share/doc/shared-mime-info/shared-mime-info-spec.pdf
>   869.47    A snb:/home/wfg - ZSH
> 
> And it seems that the slowness is caused by huge number of pageout()s:
> 
> /debug/vm/nr_reclaim_throttle_clean:0
> /debug/vm/nr_reclaim_throttle_kswapd:0
> /debug/vm/nr_reclaim_throttle_recent_write:0
> /debug/vm/nr_reclaim_throttle_write:307
> /debug/vm/nr_congestion_wait:0
> /debug/vm/nr_reclaim_wait_congested:0
> /debug/vm/nr_reclaim_wait_writeback:0
> /debug/vm/nr_migrate_wait_writeback:0
> nr_vmscan_write 175085
> allocstall 669671

The heavy swapping is a big problem. This patch is found to
effectively eliminate it :-)

---
Subject: mm: don't treat anonymous pages as dirtyable pages

Assume a mem=1GB desktop (swap enabled) with 800MB anonymous pages and
200MB file pages.  When the user starts a heavy dirtier task, the file
LRU lists may be mostly filled with dirty pages since the global dirty
limit is calculated as

	(anon+file) * 20% = 1GB * 20% = 200MB

This makes the file LRU lists hard to reclaim, which in turn increases
the scan rate of the anon LRU lists and lead to a lot of swapping. This
is probably one big reason why some desktop users see bad responsiveness
during heavy file copies once the swap is enabled.

The heavy swapping could mostly be avoided by calculating the global
dirty limit as

	file * 20% = 200MB * 20% = 40MB

The side effect would be that users feel longer file copy time because
the copy task is throttled earlier than before. However typical users
should be much more sensible to interactive performance rather than the
copy task which may well be leaved in the background.

Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
---
 include/linux/vmstat.h |    1 -
 mm/page-writeback.c    |   10 ++++++----
 mm/vmscan.c            |   14 --------------
 3 files changed, 6 insertions(+), 19 deletions(-)

--- linux.orig/include/linux/vmstat.h	2012-03-02 13:55:28.569749568 +0800
+++ linux/include/linux/vmstat.h	2012-03-02 13:56:06.585750471 +0800
@@ -139,7 +139,6 @@ static inline unsigned long zone_page_st
 	return x;
 }
 
-extern unsigned long global_reclaimable_pages(void);
 extern unsigned long zone_reclaimable_pages(struct zone *zone);
 
 #ifdef CONFIG_NUMA
--- linux.orig/mm/page-writeback.c	2012-03-02 13:55:28.549749567 +0800
+++ linux/mm/page-writeback.c	2012-03-02 13:56:26.257750938 +0800
@@ -181,8 +181,7 @@ static unsigned long highmem_dirtyable_m
 		struct zone *z =
 			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
 
-		x += zone_page_state(z, NR_FREE_PAGES) +
-		     zone_reclaimable_pages(z) - z->dirty_balance_reserve;
+		x += zone_dirtyable_memory(z);
 	}
 	/*
 	 * Make sure that the number of highmem pages is never larger
@@ -206,7 +205,9 @@ unsigned long global_dirtyable_memory(vo
 {
 	unsigned long x;
 
-	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages() -
+	x = global_page_state(NR_FREE_PAGES) +
+	    global_page_state(NR_ACTIVE_FILE) +
+	    global_page_state(NR_INACTIVE_FILE) -
 	    dirty_balance_reserve;
 
 	if (!vm_highmem_is_dirtyable)
@@ -275,7 +276,8 @@ unsigned long zone_dirtyable_memory(stru
 	 * care about vm_highmem_is_dirtyable here.
 	 */
 	return zone_page_state(zone, NR_FREE_PAGES) +
-	       zone_reclaimable_pages(zone) -
+	       zone_page_state(zone, NR_ACTIVE_FILE) +
+	       zone_page_state(zone, NR_INACTIVE_FILE) -
 	       zone->dirty_balance_reserve;
 }
 
--- linux.orig/mm/vmscan.c	2012-03-02 13:55:28.561749567 +0800
+++ linux/mm/vmscan.c	2012-03-02 13:56:06.585750471 +0800
@@ -3315,20 +3315,6 @@ void wakeup_kswapd(struct zone *zone, in
  * - mapped pages, which may require several travels to be reclaimed
  * - dirty pages, which is not "instantly" reclaimable
  */
-unsigned long global_reclaimable_pages(void)
-{
-	int nr;
-
-	nr = global_page_state(NR_ACTIVE_FILE) +
-	     global_page_state(NR_INACTIVE_FILE);
-
-	if (nr_swap_pages > 0)
-		nr += global_page_state(NR_ACTIVE_ANON) +
-		      global_page_state(NR_INACTIVE_ANON);
-
-	return nr;
-}
-
 unsigned long zone_reclaimable_pages(struct zone *zone)
 {
 	int nr;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC PATCH] mm: don't treat anonymous pages as dirtyable pages
  2012-03-02  6:59     ` Fengguang Wu
@ 2012-03-02  7:18       ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-02  7:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Linux Memory Management List, LKML

The test results:

With the below heavy memory usage and one file copy from sparse file
to USB key under way,

root@snb /home/wfg/memcg-dirty/snb# free
             total       used       free     shared    buffers     cached
Mem:          6801       6750         50          0          0        893
-/+ buffers/cache:       5857        944
Swap:        51106         34      51072

There are no single reclaim waits:

/debug/vm/nr_reclaim_throttle_clean:0
/debug/vm/nr_reclaim_throttle_kswapd:0
/debug/vm/nr_reclaim_throttle_recent_write:0
/debug/vm/nr_reclaim_throttle_write:0
/debug/vm/nr_reclaim_wait_congested:0
/debug/vm/nr_reclaim_wait_writeback:0
/debug/vm/nr_migrate_wait_writeback:0

and only occasionally increase of

        /debug/vm/nr_congestion_wait (from kswapd)
        nr_vmscan_write
        allocstall

And the most visible thing: windows switching remains swiftly fast:

 time         window title
-----------------------------------------------------------------------------
 3024.91    A LibreOffice 3.4
 3024.97    A Restore Session - Iceweasel
 3024.98    A System Settings
 3025.13    A urxvt
 3025.14    A xeyes
 3025.15    A snb:/home/wfg - ZSH
 3025.16    A snb:/home/wfg - ZSH
 3025.17    A Xpdf: /usr/share/doc/shared-mime-info/shared-mime-info-spec.pdf
 3025.18    A OpenOffice.org
 3025.23    A OpenOffice.org
 3025.25    A OpenOffice.org
 3025.26    A OpenOffice.org
 3025.27    A OpenOffice.org
 3025.28    A Chess
 3025.29    A Dictionary
 3025.31    A System Monitor
 3025.35    A snb:/home/wfg - ZSH
 3025.41    A Desktop Help
 3025.43    A Mines
 3025.49    A Tetravex
 3025.54    A Iagno
 3025.55    A Four-in-a-row
 3025.60    A Mahjongg - Easy
 3025.64    A Klotski
 3025.66    A Five or More
 3025.68    A Tali
 3025.69    A Robots
 3025.71    A Klondike
 3025.79    A Home
 3025.82    A Home
 3025.86    A *Unsaved Document 1 - gedit
 3025.87    A Sudoku
 3025.93    A LibreOffice 3.4
 3025.98    A Restore Session - Iceweasel
 3025.99    A System Settings
 3026.13    A urxvt

Thanks,
Fengguang

> Assume a mem=1GB desktop (swap enabled) with 800MB anonymous pages and
> 200MB file pages.  When the user starts a heavy dirtier task, the file
> LRU lists may be mostly filled with dirty pages since the global dirty
> limit is calculated as
> 
> 	(anon+file) * 20% = 1GB * 20% = 200MB
> 
> This makes the file LRU lists hard to reclaim, which in turn increases
> the scan rate of the anon LRU lists and lead to a lot of swapping. This
> is probably one big reason why some desktop users see bad responsiveness
> during heavy file copies once the swap is enabled.
> 
> The heavy swapping could mostly be avoided by calculating the global
> dirty limit as
> 
> 	file * 20% = 200MB * 20% = 40MB
> 
> The side effect would be that users feel longer file copy time because
> the copy task is throttled earlier than before. However typical users
> should be much more sensible to interactive performance rather than the
> copy task which may well be leaved in the background.
> 
> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
> ---
>  include/linux/vmstat.h |    1 -
>  mm/page-writeback.c    |   10 ++++++----
>  mm/vmscan.c            |   14 --------------
>  3 files changed, 6 insertions(+), 19 deletions(-)
> 
> --- linux.orig/include/linux/vmstat.h	2012-03-02 13:55:28.569749568 +0800
> +++ linux/include/linux/vmstat.h	2012-03-02 13:56:06.585750471 +0800
> @@ -139,7 +139,6 @@ static inline unsigned long zone_page_st
>  	return x;
>  }
>  
> -extern unsigned long global_reclaimable_pages(void);
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
>  
>  #ifdef CONFIG_NUMA
> --- linux.orig/mm/page-writeback.c	2012-03-02 13:55:28.549749567 +0800
> +++ linux/mm/page-writeback.c	2012-03-02 13:56:26.257750938 +0800
> @@ -181,8 +181,7 @@ static unsigned long highmem_dirtyable_m
>  		struct zone *z =
>  			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
>  
> -		x += zone_page_state(z, NR_FREE_PAGES) +
> -		     zone_reclaimable_pages(z) - z->dirty_balance_reserve;
> +		x += zone_dirtyable_memory(z);
>  	}
>  	/*
>  	 * Make sure that the number of highmem pages is never larger
> @@ -206,7 +205,9 @@ unsigned long global_dirtyable_memory(vo
>  {
>  	unsigned long x;
>  
> -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages() -
> +	x = global_page_state(NR_FREE_PAGES) +
> +	    global_page_state(NR_ACTIVE_FILE) +
> +	    global_page_state(NR_INACTIVE_FILE) -
>  	    dirty_balance_reserve;
>  
>  	if (!vm_highmem_is_dirtyable)
> @@ -275,7 +276,8 @@ unsigned long zone_dirtyable_memory(stru
>  	 * care about vm_highmem_is_dirtyable here.
>  	 */
>  	return zone_page_state(zone, NR_FREE_PAGES) +
> -	       zone_reclaimable_pages(zone) -
> +	       zone_page_state(zone, NR_ACTIVE_FILE) +
> +	       zone_page_state(zone, NR_INACTIVE_FILE) -
>  	       zone->dirty_balance_reserve;
>  }
>  
> --- linux.orig/mm/vmscan.c	2012-03-02 13:55:28.561749567 +0800
> +++ linux/mm/vmscan.c	2012-03-02 13:56:06.585750471 +0800
> @@ -3315,20 +3315,6 @@ void wakeup_kswapd(struct zone *zone, in
>   * - mapped pages, which may require several travels to be reclaimed
>   * - dirty pages, which is not "instantly" reclaimable
>   */
> -unsigned long global_reclaimable_pages(void)
> -{
> -	int nr;
> -
> -	nr = global_page_state(NR_ACTIVE_FILE) +
> -	     global_page_state(NR_INACTIVE_FILE);
> -
> -	if (nr_swap_pages > 0)
> -		nr += global_page_state(NR_ACTIVE_ANON) +
> -		      global_page_state(NR_INACTIVE_ANON);
> -
> -	return nr;
> -}
> -
>  unsigned long zone_reclaimable_pages(struct zone *zone)
>  {
>  	int nr;

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC PATCH] mm: don't treat anonymous pages as dirtyable pages
@ 2012-03-02  7:18       ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-02  7:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Greg Thelen, Jan Kara, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Linux Memory Management List, LKML

The test results:

With the below heavy memory usage and one file copy from sparse file
to USB key under way,

root@snb /home/wfg/memcg-dirty/snb# free
             total       used       free     shared    buffers     cached
Mem:          6801       6750         50          0          0        893
-/+ buffers/cache:       5857        944
Swap:        51106         34      51072

There are no single reclaim waits:

/debug/vm/nr_reclaim_throttle_clean:0
/debug/vm/nr_reclaim_throttle_kswapd:0
/debug/vm/nr_reclaim_throttle_recent_write:0
/debug/vm/nr_reclaim_throttle_write:0
/debug/vm/nr_reclaim_wait_congested:0
/debug/vm/nr_reclaim_wait_writeback:0
/debug/vm/nr_migrate_wait_writeback:0

and only occasionally increase of

        /debug/vm/nr_congestion_wait (from kswapd)
        nr_vmscan_write
        allocstall

And the most visible thing: windows switching remains swiftly fast:

 time         window title
-----------------------------------------------------------------------------
 3024.91    A LibreOffice 3.4
 3024.97    A Restore Session - Iceweasel
 3024.98    A System Settings
 3025.13    A urxvt
 3025.14    A xeyes
 3025.15    A snb:/home/wfg - ZSH
 3025.16    A snb:/home/wfg - ZSH
 3025.17    A Xpdf: /usr/share/doc/shared-mime-info/shared-mime-info-spec.pdf
 3025.18    A OpenOffice.org
 3025.23    A OpenOffice.org
 3025.25    A OpenOffice.org
 3025.26    A OpenOffice.org
 3025.27    A OpenOffice.org
 3025.28    A Chess
 3025.29    A Dictionary
 3025.31    A System Monitor
 3025.35    A snb:/home/wfg - ZSH
 3025.41    A Desktop Help
 3025.43    A Mines
 3025.49    A Tetravex
 3025.54    A Iagno
 3025.55    A Four-in-a-row
 3025.60    A Mahjongg - Easy
 3025.64    A Klotski
 3025.66    A Five or More
 3025.68    A Tali
 3025.69    A Robots
 3025.71    A Klondike
 3025.79    A Home
 3025.82    A Home
 3025.86    A *Unsaved Document 1 - gedit
 3025.87    A Sudoku
 3025.93    A LibreOffice 3.4
 3025.98    A Restore Session - Iceweasel
 3025.99    A System Settings
 3026.13    A urxvt

Thanks,
Fengguang

> Assume a mem=1GB desktop (swap enabled) with 800MB anonymous pages and
> 200MB file pages.  When the user starts a heavy dirtier task, the file
> LRU lists may be mostly filled with dirty pages since the global dirty
> limit is calculated as
> 
> 	(anon+file) * 20% = 1GB * 20% = 200MB
> 
> This makes the file LRU lists hard to reclaim, which in turn increases
> the scan rate of the anon LRU lists and lead to a lot of swapping. This
> is probably one big reason why some desktop users see bad responsiveness
> during heavy file copies once the swap is enabled.
> 
> The heavy swapping could mostly be avoided by calculating the global
> dirty limit as
> 
> 	file * 20% = 200MB * 20% = 40MB
> 
> The side effect would be that users feel longer file copy time because
> the copy task is throttled earlier than before. However typical users
> should be much more sensible to interactive performance rather than the
> copy task which may well be leaved in the background.
> 
> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
> ---
>  include/linux/vmstat.h |    1 -
>  mm/page-writeback.c    |   10 ++++++----
>  mm/vmscan.c            |   14 --------------
>  3 files changed, 6 insertions(+), 19 deletions(-)
> 
> --- linux.orig/include/linux/vmstat.h	2012-03-02 13:55:28.569749568 +0800
> +++ linux/include/linux/vmstat.h	2012-03-02 13:56:06.585750471 +0800
> @@ -139,7 +139,6 @@ static inline unsigned long zone_page_st
>  	return x;
>  }
>  
> -extern unsigned long global_reclaimable_pages(void);
>  extern unsigned long zone_reclaimable_pages(struct zone *zone);
>  
>  #ifdef CONFIG_NUMA
> --- linux.orig/mm/page-writeback.c	2012-03-02 13:55:28.549749567 +0800
> +++ linux/mm/page-writeback.c	2012-03-02 13:56:26.257750938 +0800
> @@ -181,8 +181,7 @@ static unsigned long highmem_dirtyable_m
>  		struct zone *z =
>  			&NODE_DATA(node)->node_zones[ZONE_HIGHMEM];
>  
> -		x += zone_page_state(z, NR_FREE_PAGES) +
> -		     zone_reclaimable_pages(z) - z->dirty_balance_reserve;
> +		x += zone_dirtyable_memory(z);
>  	}
>  	/*
>  	 * Make sure that the number of highmem pages is never larger
> @@ -206,7 +205,9 @@ unsigned long global_dirtyable_memory(vo
>  {
>  	unsigned long x;
>  
> -	x = global_page_state(NR_FREE_PAGES) + global_reclaimable_pages() -
> +	x = global_page_state(NR_FREE_PAGES) +
> +	    global_page_state(NR_ACTIVE_FILE) +
> +	    global_page_state(NR_INACTIVE_FILE) -
>  	    dirty_balance_reserve;
>  
>  	if (!vm_highmem_is_dirtyable)
> @@ -275,7 +276,8 @@ unsigned long zone_dirtyable_memory(stru
>  	 * care about vm_highmem_is_dirtyable here.
>  	 */
>  	return zone_page_state(zone, NR_FREE_PAGES) +
> -	       zone_reclaimable_pages(zone) -
> +	       zone_page_state(zone, NR_ACTIVE_FILE) +
> +	       zone_page_state(zone, NR_INACTIVE_FILE) -
>  	       zone->dirty_balance_reserve;
>  }
>  
> --- linux.orig/mm/vmscan.c	2012-03-02 13:55:28.561749567 +0800
> +++ linux/mm/vmscan.c	2012-03-02 13:56:06.585750471 +0800
> @@ -3315,20 +3315,6 @@ void wakeup_kswapd(struct zone *zone, in
>   * - mapped pages, which may require several travels to be reclaimed
>   * - dirty pages, which is not "instantly" reclaimable
>   */
> -unsigned long global_reclaimable_pages(void)
> -{
> -	int nr;
> -
> -	nr = global_page_state(NR_ACTIVE_FILE) +
> -	     global_page_state(NR_INACTIVE_FILE);
> -
> -	if (nr_swap_pages > 0)
> -		nr += global_page_state(NR_ACTIVE_ANON) +
> -		      global_page_state(NR_INACTIVE_ANON);
> -
> -	return nr;
> -}
> -
>  unsigned long zone_reclaimable_pages(struct zone *zone)
>  {
>  	int nr;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-03-02  4:48           ` Fengguang Wu
@ 2012-03-02  9:59             ` Jan Kara
  -1 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2012-03-02  9:59 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, Andrew Morton, Greg Thelen, Ying Han, hannes,
	KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Fri 02-03-12 12:48:58, Wu Fengguang wrote:
> On Thu, Mar 01, 2012 at 05:38:37PM +0100, Jan Kara wrote:
> > On Thu 01-03-12 20:36:40, Wu Fengguang wrote:
> > > > Please have a think about all of this and see if you can demonstrate
> > > > how the iput() here is guaranteed safe.
> > > 
> > > There are already several __iget()/iput() calls inside fs-writeback.c.
> > > The existing iput() calls already demonstrate its safety?
> > > 
> > > Basically the flusher works in this way
> > > 
> > > - the dirty inode list i_wb_list does not reference count the inode at all
> > > 
> > > - the flusher thread does something analog to igrab() and set I_SYNC
> > >   before going off to writeout the inode
> > > 
> > > - evict() will wait for completion of I_SYNC
> >   Yes, you are right that currently writeback code already holds inode
> > references and so it can happen that flusher thread drops the last inode
> > reference. But currently that could create problems only if someone waits
> > for flusher thread to make progress while effectively blocking e.g.
> > truncate from happening. Currently flusher thread handles sync(2) and
> > background writeback and filesystems take care to not hold any locks
> > blocking IO / truncate while possibly waiting for these.
> > 
> > But with your addition situation changes significantly - now anyone doing
> > allocation can block and do allocation from all sorts of places including
> > ones where we hold locks blocking other fs activity. The good news is that
> > we use GFP_NOFS in such places. So if GFP_NOFS allocation cannot possibly
> > depend on a completion of some writeback work, then I'd still be
> > comfortable with dropping inode references from writeback code. But Andrew
> > is right this at least needs some arguing...
> 
> You seem to miss the point that we don't do wait or page allocations
> inside queue_pageout_work().
  I didn't miss this point. I know we don't wait directly. But if the only
way to free pages from the zone where we need to do allocation is via flusher
thread, then we effectively *are* waiting for the work to complete. And if
the flusher thread is blocked, we have a problem. And I agree it's unlikely
but given enough time and people, I believe someone finds a way to
(inadvertedly) trigger this.

> The final iput() will not block the
> random tasks because the latter don't wait for completion of the work.
> 
>         random task                     flusher thread
> 
>         page allocation
>           page reclaim
>             queue_pageout_work()
>               igrab()
> 
>                   ......  after a while  ......
> 
>                                         execute pageout work                
>                                         iput()
>                                         <work completed>
> 
> There will be some reclaim_wait()s if the pageout works are not
> executed quickly, in which case vmscan will be impacted and slowed
> down. However it's not waiting for any specific work to complete, so
> there is no chance to form a loop of dependencies leading to deadlocks.
> 
> The iput() does have the theoretic possibility to deadlock the flusher
> thread itself (but not with the other random tasks). Since the flusher
> thread has always been doing iput() w/o running into such bugs, we can
> reasonably expect the new iput() to be as safe in practical.
  But so far, kswapd could do writeout itself so even if flusher thread is
blocked in iput(), we could still do writeout from kswapd to clean zones.

Now I don't think blocking on iput() can be a problem because of reasons I
outlined in another email yesterday (GFP_NOFS allocations and such). Just
I don't agree with your reasoning that it cannot be a problem because it
was not problem previously. That's just not true.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-02  9:59             ` Jan Kara
  0 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2012-03-02  9:59 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, Andrew Morton, Greg Thelen, Ying Han, hannes,
	KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Fri 02-03-12 12:48:58, Wu Fengguang wrote:
> On Thu, Mar 01, 2012 at 05:38:37PM +0100, Jan Kara wrote:
> > On Thu 01-03-12 20:36:40, Wu Fengguang wrote:
> > > > Please have a think about all of this and see if you can demonstrate
> > > > how the iput() here is guaranteed safe.
> > > 
> > > There are already several __iget()/iput() calls inside fs-writeback.c.
> > > The existing iput() calls already demonstrate its safety?
> > > 
> > > Basically the flusher works in this way
> > > 
> > > - the dirty inode list i_wb_list does not reference count the inode at all
> > > 
> > > - the flusher thread does something analog to igrab() and set I_SYNC
> > >   before going off to writeout the inode
> > > 
> > > - evict() will wait for completion of I_SYNC
> >   Yes, you are right that currently writeback code already holds inode
> > references and so it can happen that flusher thread drops the last inode
> > reference. But currently that could create problems only if someone waits
> > for flusher thread to make progress while effectively blocking e.g.
> > truncate from happening. Currently flusher thread handles sync(2) and
> > background writeback and filesystems take care to not hold any locks
> > blocking IO / truncate while possibly waiting for these.
> > 
> > But with your addition situation changes significantly - now anyone doing
> > allocation can block and do allocation from all sorts of places including
> > ones where we hold locks blocking other fs activity. The good news is that
> > we use GFP_NOFS in such places. So if GFP_NOFS allocation cannot possibly
> > depend on a completion of some writeback work, then I'd still be
> > comfortable with dropping inode references from writeback code. But Andrew
> > is right this at least needs some arguing...
> 
> You seem to miss the point that we don't do wait or page allocations
> inside queue_pageout_work().
  I didn't miss this point. I know we don't wait directly. But if the only
way to free pages from the zone where we need to do allocation is via flusher
thread, then we effectively *are* waiting for the work to complete. And if
the flusher thread is blocked, we have a problem. And I agree it's unlikely
but given enough time and people, I believe someone finds a way to
(inadvertedly) trigger this.

> The final iput() will not block the
> random tasks because the latter don't wait for completion of the work.
> 
>         random task                     flusher thread
> 
>         page allocation
>           page reclaim
>             queue_pageout_work()
>               igrab()
> 
>                   ......  after a while  ......
> 
>                                         execute pageout work                
>                                         iput()
>                                         <work completed>
> 
> There will be some reclaim_wait()s if the pageout works are not
> executed quickly, in which case vmscan will be impacted and slowed
> down. However it's not waiting for any specific work to complete, so
> there is no chance to form a loop of dependencies leading to deadlocks.
> 
> The iput() does have the theoretic possibility to deadlock the flusher
> thread itself (but not with the other random tasks). Since the flusher
> thread has always been doing iput() w/o running into such bugs, we can
> reasonably expect the new iput() to be as safe in practical.
  But so far, kswapd could do writeout itself so even if flusher thread is
blocked in iput(), we could still do writeout from kswapd to clean zones.

Now I don't think blocking on iput() can be a problem because of reasons I
outlined in another email yesterday (GFP_NOFS allocations and such). Just
I don't agree with your reasoning that it cannot be a problem because it
was not problem previously. That's just not true.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-03-02  9:59             ` Jan Kara
@ 2012-03-02 10:39               ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-02 10:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Greg Thelen, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Fri, Mar 02, 2012 at 10:59:10AM +0100, Jan Kara wrote:
> On Fri 02-03-12 12:48:58, Wu Fengguang wrote:
> > On Thu, Mar 01, 2012 at 05:38:37PM +0100, Jan Kara wrote:
> > > On Thu 01-03-12 20:36:40, Wu Fengguang wrote:
> > > > > Please have a think about all of this and see if you can demonstrate
> > > > > how the iput() here is guaranteed safe.
> > > > 
> > > > There are already several __iget()/iput() calls inside fs-writeback.c.
> > > > The existing iput() calls already demonstrate its safety?
> > > > 
> > > > Basically the flusher works in this way
> > > > 
> > > > - the dirty inode list i_wb_list does not reference count the inode at all
> > > > 
> > > > - the flusher thread does something analog to igrab() and set I_SYNC
> > > >   before going off to writeout the inode
> > > > 
> > > > - evict() will wait for completion of I_SYNC
> > >   Yes, you are right that currently writeback code already holds inode
> > > references and so it can happen that flusher thread drops the last inode
> > > reference. But currently that could create problems only if someone waits
> > > for flusher thread to make progress while effectively blocking e.g.
> > > truncate from happening. Currently flusher thread handles sync(2) and
> > > background writeback and filesystems take care to not hold any locks
> > > blocking IO / truncate while possibly waiting for these.
> > > 
> > > But with your addition situation changes significantly - now anyone doing
> > > allocation can block and do allocation from all sorts of places including
> > > ones where we hold locks blocking other fs activity. The good news is that
> > > we use GFP_NOFS in such places. So if GFP_NOFS allocation cannot possibly
> > > depend on a completion of some writeback work, then I'd still be
> > > comfortable with dropping inode references from writeback code. But Andrew
> > > is right this at least needs some arguing...
> > 
> > You seem to miss the point that we don't do wait or page allocations
> > inside queue_pageout_work().
>   I didn't miss this point. I know we don't wait directly. But if the only

Ah OK.

> way to free pages from the zone where we need to do allocation is via flusher
> thread, then we effectively *are* waiting for the work to complete. And if
> the flusher thread is blocked, we have a problem.

Right. If the flusher ever deadlocks itself, page reclaim may be in trouble.

What's more, the global dirty threshold may also go exceeded
(especially when it's the only bdi in the system). Then
balance_dirty_pages() kicks in and block every writers in the system,
including the occasional writers. For example, /bin/bash will be block
when writing to .bash_history. The system effectively becomes
unusable..

> And I agree it's unlikely but given enough time and people, I
> believe someone finds a way to (inadvertedly) trigger this.

Right. The pageout works could add lots more iput() to the flusher
and turn some hidden statistical impossible bugs into real ones.

Fortunately the "flusher deadlocks itself" case is easy to detect and
prevent as illustrated in another email.

> > The final iput() will not block the
> > random tasks because the latter don't wait for completion of the work.
> > 
> >         random task                     flusher thread
> > 
> >         page allocation
> >           page reclaim
> >             queue_pageout_work()
> >               igrab()
> > 
> >                   ......  after a while  ......
> > 
> >                                         execute pageout work                
> >                                         iput()
> >                                         <work completed>
> > 
> > There will be some reclaim_wait()s if the pageout works are not
> > executed quickly, in which case vmscan will be impacted and slowed
> > down. However it's not waiting for any specific work to complete, so
> > there is no chance to form a loop of dependencies leading to deadlocks.
> > 
> > The iput() does have the theoretic possibility to deadlock the flusher
> > thread itself (but not with the other random tasks). Since the flusher
> > thread has always been doing iput() w/o running into such bugs, we can
> > reasonably expect the new iput() to be as safe in practical.
>   But so far, kswapd could do writeout itself so even if flusher thread is
> blocked in iput(), we could still do writeout from kswapd to clean zones.
> 
> Now I don't think blocking on iput() can be a problem because of reasons I
> outlined in another email yesterday (GFP_NOFS allocations and such). Just
> I don't agree with your reasoning that it cannot be a problem because it
> was not problem previously. That's just not true.

Heh the dilemma for GFP_NOFS is: vmscan only pageout() inode pages for
__GFP_FS allocations. So the only hope for this kind of allocations
is to relay pageout works to the flusher...and hope that it does not
deadlock itself.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-02 10:39               ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-02 10:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Greg Thelen, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Fri, Mar 02, 2012 at 10:59:10AM +0100, Jan Kara wrote:
> On Fri 02-03-12 12:48:58, Wu Fengguang wrote:
> > On Thu, Mar 01, 2012 at 05:38:37PM +0100, Jan Kara wrote:
> > > On Thu 01-03-12 20:36:40, Wu Fengguang wrote:
> > > > > Please have a think about all of this and see if you can demonstrate
> > > > > how the iput() here is guaranteed safe.
> > > > 
> > > > There are already several __iget()/iput() calls inside fs-writeback.c.
> > > > The existing iput() calls already demonstrate its safety?
> > > > 
> > > > Basically the flusher works in this way
> > > > 
> > > > - the dirty inode list i_wb_list does not reference count the inode at all
> > > > 
> > > > - the flusher thread does something analog to igrab() and set I_SYNC
> > > >   before going off to writeout the inode
> > > > 
> > > > - evict() will wait for completion of I_SYNC
> > >   Yes, you are right that currently writeback code already holds inode
> > > references and so it can happen that flusher thread drops the last inode
> > > reference. But currently that could create problems only if someone waits
> > > for flusher thread to make progress while effectively blocking e.g.
> > > truncate from happening. Currently flusher thread handles sync(2) and
> > > background writeback and filesystems take care to not hold any locks
> > > blocking IO / truncate while possibly waiting for these.
> > > 
> > > But with your addition situation changes significantly - now anyone doing
> > > allocation can block and do allocation from all sorts of places including
> > > ones where we hold locks blocking other fs activity. The good news is that
> > > we use GFP_NOFS in such places. So if GFP_NOFS allocation cannot possibly
> > > depend on a completion of some writeback work, then I'd still be
> > > comfortable with dropping inode references from writeback code. But Andrew
> > > is right this at least needs some arguing...
> > 
> > You seem to miss the point that we don't do wait or page allocations
> > inside queue_pageout_work().
>   I didn't miss this point. I know we don't wait directly. But if the only

Ah OK.

> way to free pages from the zone where we need to do allocation is via flusher
> thread, then we effectively *are* waiting for the work to complete. And if
> the flusher thread is blocked, we have a problem.

Right. If the flusher ever deadlocks itself, page reclaim may be in trouble.

What's more, the global dirty threshold may also go exceeded
(especially when it's the only bdi in the system). Then
balance_dirty_pages() kicks in and block every writers in the system,
including the occasional writers. For example, /bin/bash will be block
when writing to .bash_history. The system effectively becomes
unusable..

> And I agree it's unlikely but given enough time and people, I
> believe someone finds a way to (inadvertedly) trigger this.

Right. The pageout works could add lots more iput() to the flusher
and turn some hidden statistical impossible bugs into real ones.

Fortunately the "flusher deadlocks itself" case is easy to detect and
prevent as illustrated in another email.

> > The final iput() will not block the
> > random tasks because the latter don't wait for completion of the work.
> > 
> >         random task                     flusher thread
> > 
> >         page allocation
> >           page reclaim
> >             queue_pageout_work()
> >               igrab()
> > 
> >                   ......  after a while  ......
> > 
> >                                         execute pageout work                
> >                                         iput()
> >                                         <work completed>
> > 
> > There will be some reclaim_wait()s if the pageout works are not
> > executed quickly, in which case vmscan will be impacted and slowed
> > down. However it's not waiting for any specific work to complete, so
> > there is no chance to form a loop of dependencies leading to deadlocks.
> > 
> > The iput() does have the theoretic possibility to deadlock the flusher
> > thread itself (but not with the other random tasks). Since the flusher
> > thread has always been doing iput() w/o running into such bugs, we can
> > reasonably expect the new iput() to be as safe in practical.
>   But so far, kswapd could do writeout itself so even if flusher thread is
> blocked in iput(), we could still do writeout from kswapd to clean zones.
> 
> Now I don't think blocking on iput() can be a problem because of reasons I
> outlined in another email yesterday (GFP_NOFS allocations and such). Just
> I don't agree with your reasoning that it cannot be a problem because it
> was not problem previously. That's just not true.

Heh the dilemma for GFP_NOFS is: vmscan only pageout() inode pages for
__GFP_FS allocations. So the only hope for this kind of allocations
is to relay pageout works to the flusher...and hope that it does not
deadlock itself.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-03-02 10:39               ` Fengguang Wu
@ 2012-03-02 19:57                 ` Andrew Morton
  -1 siblings, 0 replies; 116+ messages in thread
From: Andrew Morton @ 2012-03-02 19:57 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, Greg Thelen, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Fri, 2 Mar 2012 18:39:51 +0800
Fengguang Wu <fengguang.wu@intel.com> wrote:

> > And I agree it's unlikely but given enough time and people, I
> > believe someone finds a way to (inadvertedly) trigger this.
> 
> Right. The pageout works could add lots more iput() to the flusher
> and turn some hidden statistical impossible bugs into real ones.
> 
> Fortunately the "flusher deadlocks itself" case is easy to detect and
> prevent as illustrated in another email.

It would be a heck of a lot safer and saner to avoid the iput().  We
know how to do this, so why not do it?


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-02 19:57                 ` Andrew Morton
  0 siblings, 0 replies; 116+ messages in thread
From: Andrew Morton @ 2012-03-02 19:57 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, Greg Thelen, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Fri, 2 Mar 2012 18:39:51 +0800
Fengguang Wu <fengguang.wu@intel.com> wrote:

> > And I agree it's unlikely but given enough time and people, I
> > believe someone finds a way to (inadvertedly) trigger this.
> 
> Right. The pageout works could add lots more iput() to the flusher
> and turn some hidden statistical impossible bugs into real ones.
> 
> Fortunately the "flusher deadlocks itself" case is easy to detect and
> prevent as illustrated in another email.

It would be a heck of a lot safer and saner to avoid the iput().  We
know how to do this, so why not do it?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-03-01 19:46           ` Andrew Morton
@ 2012-03-03 13:25             ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-03 13:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Greg Thelen, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Thu, Mar 01, 2012 at 11:46:34AM -0800, Andrew Morton wrote:
> On Thu, 1 Mar 2012 19:41:51 +0800
> Fengguang Wu <fengguang.wu@intel.com> wrote:
> 
> > >   I think using get_page() might be a good way to go. Naive implementation:
> > > If we need to write a page from kswapd, we do get_page(), attach page to
> > > wb_writeback_work and push it to flusher thread to deal with it.
> > > Flusher thread sees the work, takes a page lock, verifies the page is still
> > > attached to some inode & dirty (it could have been truncated / cleaned by
> > > someone else) and if yes, it submits page for IO (possibly with some
> > > writearound). This scheme won't have problems with iput() and won't have
> > > problems with umount. Also we guarantee some progress - either flusher
> > > thread does it, or some else must have done the work before flusher thread
> > > got to it.
> > 
> > I like this idea.
> > 
> > get_page() looks the perfect solution to verify if the struct inode
> > pointer (w/o igrab) is still live and valid.
> > 
> > [...upon rethinking...] Oh but still we need to lock some page to pin
> > the inode during the writeout. Then there is the dilemma: if the page
> > is locked, we effectively keep it from being written out...
> 
> No, all you need to do is to structure the code so that after the page
> gets unlocked, the kernel thread does not touch the address_space.  So
> the processing within the kthread is along the lines of
> 
> writearound(locked_page)
> {
> 	write some pages preceding locked_page;	/* touches address_space */

It seems the above line will lead to ABBA deadlock.

At least btrfs will lock a number of pages in lock_delalloc_pages().
This demands that all page locks be taken in ascending order of the
file offset. Otherwise it's possible some task doing
__filemap_fdatawrite_range() which in turn call into
lock_delalloc_pages() deadlock with the writearound() here, which is
taking some page in the middle first. The fix is to only do
"write ahead" which will obviously lead to more smaller I/Os.

> 	write locked_page;
> 	write pages following locked_page;	/* touches address_space */
> 	unlock_page(locked_page);
> }

As it is in general a lock, which implies danger of deadlocks. If some
filesystem do smart things like piggy backing more pages than we
asked, it may try to lock the locked_page in writearound() and block
the flusher for ever.

Grabbing the page lock at work enqueue time is particular problematic.
It's susceptible to the above ABBA deadlock scheme because we will be
taking one page lock per pageout work and the pages are likely in
_random_ order. Another scheme is, when the flusher is running sync
work (or running some final iput() and therefore truncate),  and the
vmscan is queuing a pageout work with one page locked. The flusher
will then go deadlock on that page: the current sync/truncate is
trying to lock a page that can only be unlocked when the flusher goes
forward to execute the pageout work. The fix is to do get_page() at
work enqueue time and only take the page lock at work execution time.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-03 13:25             ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-03 13:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Greg Thelen, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Thu, Mar 01, 2012 at 11:46:34AM -0800, Andrew Morton wrote:
> On Thu, 1 Mar 2012 19:41:51 +0800
> Fengguang Wu <fengguang.wu@intel.com> wrote:
> 
> > >   I think using get_page() might be a good way to go. Naive implementation:
> > > If we need to write a page from kswapd, we do get_page(), attach page to
> > > wb_writeback_work and push it to flusher thread to deal with it.
> > > Flusher thread sees the work, takes a page lock, verifies the page is still
> > > attached to some inode & dirty (it could have been truncated / cleaned by
> > > someone else) and if yes, it submits page for IO (possibly with some
> > > writearound). This scheme won't have problems with iput() and won't have
> > > problems with umount. Also we guarantee some progress - either flusher
> > > thread does it, or some else must have done the work before flusher thread
> > > got to it.
> > 
> > I like this idea.
> > 
> > get_page() looks the perfect solution to verify if the struct inode
> > pointer (w/o igrab) is still live and valid.
> > 
> > [...upon rethinking...] Oh but still we need to lock some page to pin
> > the inode during the writeout. Then there is the dilemma: if the page
> > is locked, we effectively keep it from being written out...
> 
> No, all you need to do is to structure the code so that after the page
> gets unlocked, the kernel thread does not touch the address_space.  So
> the processing within the kthread is along the lines of
> 
> writearound(locked_page)
> {
> 	write some pages preceding locked_page;	/* touches address_space */

It seems the above line will lead to ABBA deadlock.

At least btrfs will lock a number of pages in lock_delalloc_pages().
This demands that all page locks be taken in ascending order of the
file offset. Otherwise it's possible some task doing
__filemap_fdatawrite_range() which in turn call into
lock_delalloc_pages() deadlock with the writearound() here, which is
taking some page in the middle first. The fix is to only do
"write ahead" which will obviously lead to more smaller I/Os.

> 	write locked_page;
> 	write pages following locked_page;	/* touches address_space */
> 	unlock_page(locked_page);
> }

As it is in general a lock, which implies danger of deadlocks. If some
filesystem do smart things like piggy backing more pages than we
asked, it may try to lock the locked_page in writearound() and block
the flusher for ever.

Grabbing the page lock at work enqueue time is particular problematic.
It's susceptible to the above ABBA deadlock scheme because we will be
taking one page lock per pageout work and the pages are likely in
_random_ order. Another scheme is, when the flusher is running sync
work (or running some final iput() and therefore truncate),  and the
vmscan is queuing a pageout work with one page locked. The flusher
will then go deadlock on that page: the current sync/truncate is
trying to lock a page that can only be unlocked when the flusher goes
forward to execute the pageout work. The fix is to do get_page() at
work enqueue time and only take the page lock at work execution time.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-03-02 19:57                 ` Andrew Morton
@ 2012-03-03 13:55                   ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-03 13:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Greg Thelen, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML, Adrian Hunter,
	Artem Bityutskiy

On Fri, Mar 02, 2012 at 11:57:00AM -0800, Andrew Morton wrote:
> On Fri, 2 Mar 2012 18:39:51 +0800
> Fengguang Wu <fengguang.wu@intel.com> wrote:
> 
> > > And I agree it's unlikely but given enough time and people, I
> > > believe someone finds a way to (inadvertedly) trigger this.
> > 
> > Right. The pageout works could add lots more iput() to the flusher
> > and turn some hidden statistical impossible bugs into real ones.
> > 
> > Fortunately the "flusher deadlocks itself" case is easy to detect and
> > prevent as illustrated in another email.
> 
> It would be a heck of a lot safer and saner to avoid the iput().  We
> know how to do this, so why not do it?

My concern about the page lock is, it costs more code and sounds like
hacking around something. It seems we (including me) have been trying
to shun away from the iput() problem. Since it's unlikely we are to
get rid of the already existing iput() calls from the flusher context,
why not face the problem, sort it out and use it with confident in new
code?

Let me try it now. The only scheme iput() can deadlock the flusher is
for the iput() path to come back to queue some work and wait for it.
Here are the exhaust list of the queue+wait paths:

writeback_inodes_sb_nr_if_idle
  ext4_nonda_switch
    ext4_page_mkwrite                   # from page fault
    ext4_da_write_begin                 # from user writes

writeback_inodes_sb_nr
  quotactl syscall                      # from syscall
  __sync_filesystem                     # from sync/umount
  shrink_liability                      # ubifs
    make_free_space
      ubifs_budget_space                # from all over ubifs:

   2    274  /c/linux/fs/ubifs/dir.c <<ubifs_create>>
   3    531  /c/linux/fs/ubifs/dir.c <<ubifs_link>>
   4    586  /c/linux/fs/ubifs/dir.c <<ubifs_unlink>>
   5    675  /c/linux/fs/ubifs/dir.c <<ubifs_rmdir>>
   6    731  /c/linux/fs/ubifs/dir.c <<ubifs_mkdir>>
   7    803  /c/linux/fs/ubifs/dir.c <<ubifs_mknod>>
   8    871  /c/linux/fs/ubifs/dir.c <<ubifs_symlink>>
   9   1006  /c/linux/fs/ubifs/dir.c <<ubifs_rename>>
  10   1009  /c/linux/fs/ubifs/dir.c <<ubifs_rename>>
  11    246  /c/linux/fs/ubifs/file.c <<write_begin_slow>>
  12    388  /c/linux/fs/ubifs/file.c <<allocate_budget>>
  13   1125  /c/linux/fs/ubifs/file.c <<do_truncation>>   <===== deadlockable
  14   1217  /c/linux/fs/ubifs/file.c <<do_setattr>>
  15   1381  /c/linux/fs/ubifs/file.c <<update_mctime>>
  16   1486  /c/linux/fs/ubifs/file.c <<ubifs_vm_page_mkwrite>>
  17    110  /c/linux/fs/ubifs/ioctl.c <<setflags>>
  19    122  /c/linux/fs/ubifs/xattr.c <<create_xattr>>
  20    201  /c/linux/fs/ubifs/xattr.c <<change_xattr>>
  21    494  /c/linux/fs/ubifs/xattr.c <<remove_xattr>>

It seems they are all safe except for ubifs. ubifs may actually
deadlock from the above do_truncation() caller. However it should be
fixable because the ubifs call for writeback_inodes_sb_nr() sounds
very brute force writeback and wait and there may well be better way
out.

CCing ubifs developers for possible thoughts..

Thanks,
Fengguang

PS. I'll be on travel in the following week and won't have much time
for replying emails. Sorry about that.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-03 13:55                   ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-03 13:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Greg Thelen, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML, Adrian Hunter,
	Artem Bityutskiy

On Fri, Mar 02, 2012 at 11:57:00AM -0800, Andrew Morton wrote:
> On Fri, 2 Mar 2012 18:39:51 +0800
> Fengguang Wu <fengguang.wu@intel.com> wrote:
> 
> > > And I agree it's unlikely but given enough time and people, I
> > > believe someone finds a way to (inadvertedly) trigger this.
> > 
> > Right. The pageout works could add lots more iput() to the flusher
> > and turn some hidden statistical impossible bugs into real ones.
> > 
> > Fortunately the "flusher deadlocks itself" case is easy to detect and
> > prevent as illustrated in another email.
> 
> It would be a heck of a lot safer and saner to avoid the iput().  We
> know how to do this, so why not do it?

My concern about the page lock is, it costs more code and sounds like
hacking around something. It seems we (including me) have been trying
to shun away from the iput() problem. Since it's unlikely we are to
get rid of the already existing iput() calls from the flusher context,
why not face the problem, sort it out and use it with confident in new
code?

Let me try it now. The only scheme iput() can deadlock the flusher is
for the iput() path to come back to queue some work and wait for it.
Here are the exhaust list of the queue+wait paths:

writeback_inodes_sb_nr_if_idle
  ext4_nonda_switch
    ext4_page_mkwrite                   # from page fault
    ext4_da_write_begin                 # from user writes

writeback_inodes_sb_nr
  quotactl syscall                      # from syscall
  __sync_filesystem                     # from sync/umount
  shrink_liability                      # ubifs
    make_free_space
      ubifs_budget_space                # from all over ubifs:

   2    274  /c/linux/fs/ubifs/dir.c <<ubifs_create>>
   3    531  /c/linux/fs/ubifs/dir.c <<ubifs_link>>
   4    586  /c/linux/fs/ubifs/dir.c <<ubifs_unlink>>
   5    675  /c/linux/fs/ubifs/dir.c <<ubifs_rmdir>>
   6    731  /c/linux/fs/ubifs/dir.c <<ubifs_mkdir>>
   7    803  /c/linux/fs/ubifs/dir.c <<ubifs_mknod>>
   8    871  /c/linux/fs/ubifs/dir.c <<ubifs_symlink>>
   9   1006  /c/linux/fs/ubifs/dir.c <<ubifs_rename>>
  10   1009  /c/linux/fs/ubifs/dir.c <<ubifs_rename>>
  11    246  /c/linux/fs/ubifs/file.c <<write_begin_slow>>
  12    388  /c/linux/fs/ubifs/file.c <<allocate_budget>>
  13   1125  /c/linux/fs/ubifs/file.c <<do_truncation>>   <===== deadlockable
  14   1217  /c/linux/fs/ubifs/file.c <<do_setattr>>
  15   1381  /c/linux/fs/ubifs/file.c <<update_mctime>>
  16   1486  /c/linux/fs/ubifs/file.c <<ubifs_vm_page_mkwrite>>
  17    110  /c/linux/fs/ubifs/ioctl.c <<setflags>>
  19    122  /c/linux/fs/ubifs/xattr.c <<create_xattr>>
  20    201  /c/linux/fs/ubifs/xattr.c <<change_xattr>>
  21    494  /c/linux/fs/ubifs/xattr.c <<remove_xattr>>

It seems they are all safe except for ubifs. ubifs may actually
deadlock from the above do_truncation() caller. However it should be
fixable because the ubifs call for writeback_inodes_sb_nr() sounds
very brute force writeback and wait and there may well be better way
out.

CCing ubifs developers for possible thoughts..

Thanks,
Fengguang

PS. I'll be on travel in the following week and won't have much time
for replying emails. Sorry about that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-03-03 13:55                   ` Fengguang Wu
@ 2012-03-03 14:27                     ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-03 14:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Greg Thelen, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML, Artem Bityutskiy,
	Adrian Hunter

[correct email addresses for Artem and Adrian]

On Sat, Mar 03, 2012 at 09:55:58PM +0800, Fengguang Wu wrote:
> On Fri, Mar 02, 2012 at 11:57:00AM -0800, Andrew Morton wrote:
> > On Fri, 2 Mar 2012 18:39:51 +0800
> > Fengguang Wu <fengguang.wu@intel.com> wrote:
> > 
> > > > And I agree it's unlikely but given enough time and people, I
> > > > believe someone finds a way to (inadvertedly) trigger this.
> > > 
> > > Right. The pageout works could add lots more iput() to the flusher
> > > and turn some hidden statistical impossible bugs into real ones.
> > > 
> > > Fortunately the "flusher deadlocks itself" case is easy to detect and
> > > prevent as illustrated in another email.
> > 
> > It would be a heck of a lot safer and saner to avoid the iput().  We
> > know how to do this, so why not do it?
> 
> My concern about the page lock is, it costs more code and sounds like
> hacking around something. It seems we (including me) have been trying
> to shun away from the iput() problem. Since it's unlikely we are to
> get rid of the already existing iput() calls from the flusher context,
> why not face the problem, sort it out and use it with confident in new
> code?
> 
> Let me try it now. The only scheme iput() can deadlock the flusher is
> for the iput() path to come back to queue some work and wait for it.
> Here are the exhaust list of the queue+wait paths:
> 
> writeback_inodes_sb_nr_if_idle
>   ext4_nonda_switch
>     ext4_page_mkwrite                   # from page fault
>     ext4_da_write_begin                 # from user writes
> 
> writeback_inodes_sb_nr
>   quotactl syscall                      # from syscall
>   __sync_filesystem                     # from sync/umount
>   shrink_liability                      # ubifs
>     make_free_space
>       ubifs_budget_space                # from all over ubifs:
> 
>    2    274  /c/linux/fs/ubifs/dir.c <<ubifs_create>>
>    3    531  /c/linux/fs/ubifs/dir.c <<ubifs_link>>
>    4    586  /c/linux/fs/ubifs/dir.c <<ubifs_unlink>>
>    5    675  /c/linux/fs/ubifs/dir.c <<ubifs_rmdir>>
>    6    731  /c/linux/fs/ubifs/dir.c <<ubifs_mkdir>>
>    7    803  /c/linux/fs/ubifs/dir.c <<ubifs_mknod>>
>    8    871  /c/linux/fs/ubifs/dir.c <<ubifs_symlink>>
>    9   1006  /c/linux/fs/ubifs/dir.c <<ubifs_rename>>
>   10   1009  /c/linux/fs/ubifs/dir.c <<ubifs_rename>>
>   11    246  /c/linux/fs/ubifs/file.c <<write_begin_slow>>
>   12    388  /c/linux/fs/ubifs/file.c <<allocate_budget>>
>   13   1125  /c/linux/fs/ubifs/file.c <<do_truncation>>   <===== deadlockable
>   14   1217  /c/linux/fs/ubifs/file.c <<do_setattr>>
>   15   1381  /c/linux/fs/ubifs/file.c <<update_mctime>>
>   16   1486  /c/linux/fs/ubifs/file.c <<ubifs_vm_page_mkwrite>>
>   17    110  /c/linux/fs/ubifs/ioctl.c <<setflags>>
>   19    122  /c/linux/fs/ubifs/xattr.c <<create_xattr>>
>   20    201  /c/linux/fs/ubifs/xattr.c <<change_xattr>>
>   21    494  /c/linux/fs/ubifs/xattr.c <<remove_xattr>>
> 
> It seems they are all safe except for ubifs. ubifs may actually
> deadlock from the above do_truncation() caller. However it should be

Sorry that do_truncation() is actually called from ubifs_setattr()
which is not related to iput().

Are there other possibilities for iput() to call into the above list
of ubifs functions, then start writeback work and wait for it which
will deadlock the flusher? ubifs_unlink() and perhaps remove_xattr()?

> fixable because the ubifs call for writeback_inodes_sb_nr() sounds
> very brute force writeback and wait and there may well be better way
> out.
> 
> CCing ubifs developers for possible thoughts..
> 
> Thanks,
> Fengguang
> 
> PS. I'll be on travel in the following week and won't have much time
> for replying emails. Sorry about that.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-03 14:27                     ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-03 14:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Greg Thelen, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML, Artem Bityutskiy,
	Adrian Hunter

[correct email addresses for Artem and Adrian]

On Sat, Mar 03, 2012 at 09:55:58PM +0800, Fengguang Wu wrote:
> On Fri, Mar 02, 2012 at 11:57:00AM -0800, Andrew Morton wrote:
> > On Fri, 2 Mar 2012 18:39:51 +0800
> > Fengguang Wu <fengguang.wu@intel.com> wrote:
> > 
> > > > And I agree it's unlikely but given enough time and people, I
> > > > believe someone finds a way to (inadvertedly) trigger this.
> > > 
> > > Right. The pageout works could add lots more iput() to the flusher
> > > and turn some hidden statistical impossible bugs into real ones.
> > > 
> > > Fortunately the "flusher deadlocks itself" case is easy to detect and
> > > prevent as illustrated in another email.
> > 
> > It would be a heck of a lot safer and saner to avoid the iput().  We
> > know how to do this, so why not do it?
> 
> My concern about the page lock is, it costs more code and sounds like
> hacking around something. It seems we (including me) have been trying
> to shun away from the iput() problem. Since it's unlikely we are to
> get rid of the already existing iput() calls from the flusher context,
> why not face the problem, sort it out and use it with confident in new
> code?
> 
> Let me try it now. The only scheme iput() can deadlock the flusher is
> for the iput() path to come back to queue some work and wait for it.
> Here are the exhaust list of the queue+wait paths:
> 
> writeback_inodes_sb_nr_if_idle
>   ext4_nonda_switch
>     ext4_page_mkwrite                   # from page fault
>     ext4_da_write_begin                 # from user writes
> 
> writeback_inodes_sb_nr
>   quotactl syscall                      # from syscall
>   __sync_filesystem                     # from sync/umount
>   shrink_liability                      # ubifs
>     make_free_space
>       ubifs_budget_space                # from all over ubifs:
> 
>    2    274  /c/linux/fs/ubifs/dir.c <<ubifs_create>>
>    3    531  /c/linux/fs/ubifs/dir.c <<ubifs_link>>
>    4    586  /c/linux/fs/ubifs/dir.c <<ubifs_unlink>>
>    5    675  /c/linux/fs/ubifs/dir.c <<ubifs_rmdir>>
>    6    731  /c/linux/fs/ubifs/dir.c <<ubifs_mkdir>>
>    7    803  /c/linux/fs/ubifs/dir.c <<ubifs_mknod>>
>    8    871  /c/linux/fs/ubifs/dir.c <<ubifs_symlink>>
>    9   1006  /c/linux/fs/ubifs/dir.c <<ubifs_rename>>
>   10   1009  /c/linux/fs/ubifs/dir.c <<ubifs_rename>>
>   11    246  /c/linux/fs/ubifs/file.c <<write_begin_slow>>
>   12    388  /c/linux/fs/ubifs/file.c <<allocate_budget>>
>   13   1125  /c/linux/fs/ubifs/file.c <<do_truncation>>   <===== deadlockable
>   14   1217  /c/linux/fs/ubifs/file.c <<do_setattr>>
>   15   1381  /c/linux/fs/ubifs/file.c <<update_mctime>>
>   16   1486  /c/linux/fs/ubifs/file.c <<ubifs_vm_page_mkwrite>>
>   17    110  /c/linux/fs/ubifs/ioctl.c <<setflags>>
>   19    122  /c/linux/fs/ubifs/xattr.c <<create_xattr>>
>   20    201  /c/linux/fs/ubifs/xattr.c <<change_xattr>>
>   21    494  /c/linux/fs/ubifs/xattr.c <<remove_xattr>>
> 
> It seems they are all safe except for ubifs. ubifs may actually
> deadlock from the above do_truncation() caller. However it should be

Sorry that do_truncation() is actually called from ubifs_setattr()
which is not related to iput().

Are there other possibilities for iput() to call into the above list
of ubifs functions, then start writeback work and wait for it which
will deadlock the flusher? ubifs_unlink() and perhaps remove_xattr()?

> fixable because the ubifs call for writeback_inodes_sb_nr() sounds
> very brute force writeback and wait and there may well be better way
> out.
> 
> CCing ubifs developers for possible thoughts..
> 
> Thanks,
> Fengguang
> 
> PS. I'll be on travel in the following week and won't have much time
> for replying emails. Sorry about that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 1/9] memcg: add page_cgroup flags for dirty page tracking
  2012-02-29  0:50     ` KAMEZAWA Hiroyuki
@ 2012-03-04  1:29       ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-04  1:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Greg Thelen, Jan Kara, Ying Han, hannes,
	Rik van Riel, Andrea Righi, Minchan Kim,
	Linux Memory Management List, LKML

On Wed, Feb 29, 2012 at 09:50:51AM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 28 Feb 2012 22:00:23 +0800
> Fengguang Wu <fengguang.wu@intel.com> wrote:
> 
> > From: Greg Thelen <gthelen@google.com>
> > 
> > Add additional flags to page_cgroup to track dirty pages
> > within a mem_cgroup.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Signed-off-by: Andrea Righi <andrea@betterlinux.com>
> > Signed-off-by: Greg Thelen <gthelen@google.com>
> > Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> > Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
> 
> I'm sorry but I changed the design of page_cgroup's flags update
> and never want to add new flags (I'd like to remove page_cgroup->flags.)

No sorry - it makes good sense to reuse the native page flags :)

> Please see linux-next.
> 
> A good example is PCG_FILE_MAPPED, which I removed.
> 
> memcg: use new logic for page stat accounting
> memcg: remove PCG_FILE_MAPPED
> 
> You can make use of PageDirty() and PageWriteback() instead of new flags.. (I hope.)

The dirty page accounting is currently done in account_page_dirtied()
which is called from

__set_page_dirty <= __set_page_dirty_buffers
__set_page_dirty_nobuffers
ceph_set_page_dirty

inside &mapping->tree_lock. TestSetPageDirty() is also called inside
&mapping->private_lock. So we'll be including the two mapping locks
and possibly &ci->i_ceph_lock if doing

         move_lock_mem_cgroup(page) # may take &memcg->move_lock
         TestSetPageDirty(page)
         update page stats (without any checks)
         move_unlock_mem_cgroup(page)

It should be feasible if that lock dependency is fine.

The PG_writeback accounting is very similar to the PG_dirty accounting
and can be handled in the same way.

Thanks,
Fengguang

> > ---
> >  include/linux/page_cgroup.h |   23 +++++++++++++++++++++++
> >  1 file changed, 23 insertions(+)
> > 
> > --- linux.orig/include/linux/page_cgroup.h	2012-02-19 10:53:14.000000000 +0800
> > +++ linux/include/linux/page_cgroup.h	2012-02-19 10:53:16.000000000 +0800
> > @@ -10,6 +10,9 @@ enum {
> >  	/* flags for mem_cgroup and file and I/O status */
> >  	PCG_MOVE_LOCK, /* For race between move_account v.s. following bits */
> >  	PCG_FILE_MAPPED, /* page is accounted as "mapped" */
> > +	PCG_FILE_DIRTY, /* page is dirty */
> > +	PCG_FILE_WRITEBACK, /* page is under writeback */
> > +	PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */
> >  	__NR_PCG_FLAGS,
> >  };
> >  
> > @@ -64,6 +67,10 @@ static inline void ClearPageCgroup##unam
> >  static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
> >  	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
> >  
> > +#define TESTSETPCGFLAG(uname, lname)			\
> > +static inline int TestSetPageCgroup##uname(struct page_cgroup *pc)	\
> > +	{ return test_and_set_bit(PCG_##lname, &pc->flags); }
> > +
> >  /* Cache flag is set only once (at allocation) */
> >  TESTPCGFLAG(Cache, CACHE)
> >  CLEARPCGFLAG(Cache, CACHE)
> > @@ -77,6 +84,22 @@ SETPCGFLAG(FileMapped, FILE_MAPPED)
> >  CLEARPCGFLAG(FileMapped, FILE_MAPPED)
> >  TESTPCGFLAG(FileMapped, FILE_MAPPED)
> >  
> > +SETPCGFLAG(FileDirty, FILE_DIRTY)
> > +CLEARPCGFLAG(FileDirty, FILE_DIRTY)
> > +TESTPCGFLAG(FileDirty, FILE_DIRTY)
> > +TESTCLEARPCGFLAG(FileDirty, FILE_DIRTY)
> > +TESTSETPCGFLAG(FileDirty, FILE_DIRTY)
> > +
> > +SETPCGFLAG(FileWriteback, FILE_WRITEBACK)
> > +CLEARPCGFLAG(FileWriteback, FILE_WRITEBACK)
> > +TESTPCGFLAG(FileWriteback, FILE_WRITEBACK)
> > +
> > +SETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> > +CLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> > +TESTPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> > +TESTCLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> > +TESTSETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> > +
> >  SETPCGFLAG(Migration, MIGRATION)
> >  CLEARPCGFLAG(Migration, MIGRATION)
> >  TESTPCGFLAG(Migration, MIGRATION)
> > 
> > 
> > 

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 1/9] memcg: add page_cgroup flags for dirty page tracking
@ 2012-03-04  1:29       ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-04  1:29 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Andrew Morton, Greg Thelen, Jan Kara, Ying Han, hannes,
	Rik van Riel, Andrea Righi, Minchan Kim,
	Linux Memory Management List, LKML

On Wed, Feb 29, 2012 at 09:50:51AM +0900, KAMEZAWA Hiroyuki wrote:
> On Tue, 28 Feb 2012 22:00:23 +0800
> Fengguang Wu <fengguang.wu@intel.com> wrote:
> 
> > From: Greg Thelen <gthelen@google.com>
> > 
> > Add additional flags to page_cgroup to track dirty pages
> > within a mem_cgroup.
> > 
> > Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> > Signed-off-by: Andrea Righi <andrea@betterlinux.com>
> > Signed-off-by: Greg Thelen <gthelen@google.com>
> > Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
> > Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
> 
> I'm sorry but I changed the design of page_cgroup's flags update
> and never want to add new flags (I'd like to remove page_cgroup->flags.)

No sorry - it makes good sense to reuse the native page flags :)

> Please see linux-next.
> 
> A good example is PCG_FILE_MAPPED, which I removed.
> 
> memcg: use new logic for page stat accounting
> memcg: remove PCG_FILE_MAPPED
> 
> You can make use of PageDirty() and PageWriteback() instead of new flags.. (I hope.)

The dirty page accounting is currently done in account_page_dirtied()
which is called from

__set_page_dirty <= __set_page_dirty_buffers
__set_page_dirty_nobuffers
ceph_set_page_dirty

inside &mapping->tree_lock. TestSetPageDirty() is also called inside
&mapping->private_lock. So we'll be including the two mapping locks
and possibly &ci->i_ceph_lock if doing

         move_lock_mem_cgroup(page) # may take &memcg->move_lock
         TestSetPageDirty(page)
         update page stats (without any checks)
         move_unlock_mem_cgroup(page)

It should be feasible if that lock dependency is fine.

The PG_writeback accounting is very similar to the PG_dirty accounting
and can be handled in the same way.

Thanks,
Fengguang

> > ---
> >  include/linux/page_cgroup.h |   23 +++++++++++++++++++++++
> >  1 file changed, 23 insertions(+)
> > 
> > --- linux.orig/include/linux/page_cgroup.h	2012-02-19 10:53:14.000000000 +0800
> > +++ linux/include/linux/page_cgroup.h	2012-02-19 10:53:16.000000000 +0800
> > @@ -10,6 +10,9 @@ enum {
> >  	/* flags for mem_cgroup and file and I/O status */
> >  	PCG_MOVE_LOCK, /* For race between move_account v.s. following bits */
> >  	PCG_FILE_MAPPED, /* page is accounted as "mapped" */
> > +	PCG_FILE_DIRTY, /* page is dirty */
> > +	PCG_FILE_WRITEBACK, /* page is under writeback */
> > +	PCG_FILE_UNSTABLE_NFS, /* page is NFS unstable */
> >  	__NR_PCG_FLAGS,
> >  };
> >  
> > @@ -64,6 +67,10 @@ static inline void ClearPageCgroup##unam
> >  static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
> >  	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
> >  
> > +#define TESTSETPCGFLAG(uname, lname)			\
> > +static inline int TestSetPageCgroup##uname(struct page_cgroup *pc)	\
> > +	{ return test_and_set_bit(PCG_##lname, &pc->flags); }
> > +
> >  /* Cache flag is set only once (at allocation) */
> >  TESTPCGFLAG(Cache, CACHE)
> >  CLEARPCGFLAG(Cache, CACHE)
> > @@ -77,6 +84,22 @@ SETPCGFLAG(FileMapped, FILE_MAPPED)
> >  CLEARPCGFLAG(FileMapped, FILE_MAPPED)
> >  TESTPCGFLAG(FileMapped, FILE_MAPPED)
> >  
> > +SETPCGFLAG(FileDirty, FILE_DIRTY)
> > +CLEARPCGFLAG(FileDirty, FILE_DIRTY)
> > +TESTPCGFLAG(FileDirty, FILE_DIRTY)
> > +TESTCLEARPCGFLAG(FileDirty, FILE_DIRTY)
> > +TESTSETPCGFLAG(FileDirty, FILE_DIRTY)
> > +
> > +SETPCGFLAG(FileWriteback, FILE_WRITEBACK)
> > +CLEARPCGFLAG(FileWriteback, FILE_WRITEBACK)
> > +TESTPCGFLAG(FileWriteback, FILE_WRITEBACK)
> > +
> > +SETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> > +CLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> > +TESTPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> > +TESTCLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> > +TESTSETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
> > +
> >  SETPCGFLAG(Migration, MIGRATION)
> >  CLEARPCGFLAG(Migration, MIGRATION)
> >  TESTPCGFLAG(Migration, MIGRATION)
> > 
> > 
> > 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-03-03 14:27                     ` Fengguang Wu
@ 2012-03-04 11:13                       ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-04 11:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Greg Thelen, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML, Artem Bityutskiy,
	Adrian Hunter, Chris Mason, linux-fsdevel

On Sat, Mar 03, 2012 at 10:27:45PM +0800, Fengguang Wu wrote:
> [correct email addresses for Artem and Adrian]
> 
> On Sat, Mar 03, 2012 at 09:55:58PM +0800, Fengguang Wu wrote:
> > On Fri, Mar 02, 2012 at 11:57:00AM -0800, Andrew Morton wrote:
> > > On Fri, 2 Mar 2012 18:39:51 +0800
> > > Fengguang Wu <fengguang.wu@intel.com> wrote:
> > > 
> > > > > And I agree it's unlikely but given enough time and people, I
> > > > > believe someone finds a way to (inadvertedly) trigger this.
> > > > 
> > > > Right. The pageout works could add lots more iput() to the flusher
> > > > and turn some hidden statistical impossible bugs into real ones.
> > > > 
> > > > Fortunately the "flusher deadlocks itself" case is easy to detect and
> > > > prevent as illustrated in another email.
> > > 
> > > It would be a heck of a lot safer and saner to avoid the iput().  We
> > > know how to do this, so why not do it?
> > 
> > My concern about the page lock is, it costs more code and sounds like
> > hacking around something. It seems we (including me) have been trying
> > to shun away from the iput() problem. Since it's unlikely we are to
> > get rid of the already existing iput() calls from the flusher context,
> > why not face the problem, sort it out and use it with confident in new
> > code?
> > 
> > Let me try it now. The only scheme iput() can deadlock the flusher is
> > for the iput() path to come back to queue some work and wait for it.
> > Here are the exhaust list of the queue+wait paths:


> > writeback_inodes_sb_nr_if_idle

Sorry the above function is actually called from all over btrfs. ext4
uses the much more heavy weight writeback_inodes_sb_if_idle(). I'm not
sure if ext4/ubifs developers are fully aware that these functions may
take seconds or even dozens of seconds to complete and show up as long
delays to the users, because they write and wait ALL dirty pages on
the superblock. Even w/o the iput() deadlock problem, it's still
questionable to start and wait for such big writeback works from fs
code.  If these waits could be turned into congestion_wait() style
throttling, it should not only completely remove the possibility of
iput() deadlock, but also make the delays much smaller.
congestion_wait() is just an example and may not be a good fit, but
the *_if_idle thing does indicate that the calling ext4/btrfs sites
are pretty flexible (or careless) about the exact policy used.

Thanks,
Fengguang

> >   ext4_nonda_switch
> >     ext4_page_mkwrite                   # from page fault
> >     ext4_da_write_begin                 # from user writes
> > 
> > writeback_inodes_sb_nr
> >   quotactl syscall                      # from syscall
> >   __sync_filesystem                     # from sync/umount
> >   shrink_liability                      # ubifs
> >     make_free_space
> >       ubifs_budget_space                # from all over ubifs:
> > 
> >    2    274  /c/linux/fs/ubifs/dir.c <<ubifs_create>>
> >    3    531  /c/linux/fs/ubifs/dir.c <<ubifs_link>>
> >    4    586  /c/linux/fs/ubifs/dir.c <<ubifs_unlink>>
> >    5    675  /c/linux/fs/ubifs/dir.c <<ubifs_rmdir>>
> >    6    731  /c/linux/fs/ubifs/dir.c <<ubifs_mkdir>>
> >    7    803  /c/linux/fs/ubifs/dir.c <<ubifs_mknod>>
> >    8    871  /c/linux/fs/ubifs/dir.c <<ubifs_symlink>>
> >    9   1006  /c/linux/fs/ubifs/dir.c <<ubifs_rename>>
> >   10   1009  /c/linux/fs/ubifs/dir.c <<ubifs_rename>>
> >   11    246  /c/linux/fs/ubifs/file.c <<write_begin_slow>>
> >   12    388  /c/linux/fs/ubifs/file.c <<allocate_budget>>
> >   13   1125  /c/linux/fs/ubifs/file.c <<do_truncation>>   <===== deadlockable
> >   14   1217  /c/linux/fs/ubifs/file.c <<do_setattr>>
> >   15   1381  /c/linux/fs/ubifs/file.c <<update_mctime>>
> >   16   1486  /c/linux/fs/ubifs/file.c <<ubifs_vm_page_mkwrite>>
> >   17    110  /c/linux/fs/ubifs/ioctl.c <<setflags>>
> >   19    122  /c/linux/fs/ubifs/xattr.c <<create_xattr>>
> >   20    201  /c/linux/fs/ubifs/xattr.c <<change_xattr>>
> >   21    494  /c/linux/fs/ubifs/xattr.c <<remove_xattr>>
> > 
> > It seems they are all safe except for ubifs. ubifs may actually
> > deadlock from the above do_truncation() caller. However it should be
> 
> Sorry that do_truncation() is actually called from ubifs_setattr()
> which is not related to iput().
> 
> Are there other possibilities for iput() to call into the above list
> of ubifs functions, then start writeback work and wait for it which
> will deadlock the flusher? ubifs_unlink() and perhaps remove_xattr()?
> 
> > fixable because the ubifs call for writeback_inodes_sb_nr() sounds
> > very brute force writeback and wait and there may well be better way
> > out.
> > 
> > CCing ubifs developers for possible thoughts..
> > 
> > Thanks,
> > Fengguang
> > 
> > PS. I'll be on travel in the following week and won't have much time
> > for replying emails. Sorry about that.

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-04 11:13                       ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-04 11:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Greg Thelen, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML, Artem Bityutskiy,
	Adrian Hunter, Chris Mason, linux-fsdevel

On Sat, Mar 03, 2012 at 10:27:45PM +0800, Fengguang Wu wrote:
> [correct email addresses for Artem and Adrian]
> 
> On Sat, Mar 03, 2012 at 09:55:58PM +0800, Fengguang Wu wrote:
> > On Fri, Mar 02, 2012 at 11:57:00AM -0800, Andrew Morton wrote:
> > > On Fri, 2 Mar 2012 18:39:51 +0800
> > > Fengguang Wu <fengguang.wu@intel.com> wrote:
> > > 
> > > > > And I agree it's unlikely but given enough time and people, I
> > > > > believe someone finds a way to (inadvertedly) trigger this.
> > > > 
> > > > Right. The pageout works could add lots more iput() to the flusher
> > > > and turn some hidden statistical impossible bugs into real ones.
> > > > 
> > > > Fortunately the "flusher deadlocks itself" case is easy to detect and
> > > > prevent as illustrated in another email.
> > > 
> > > It would be a heck of a lot safer and saner to avoid the iput().  We
> > > know how to do this, so why not do it?
> > 
> > My concern about the page lock is, it costs more code and sounds like
> > hacking around something. It seems we (including me) have been trying
> > to shun away from the iput() problem. Since it's unlikely we are to
> > get rid of the already existing iput() calls from the flusher context,
> > why not face the problem, sort it out and use it with confident in new
> > code?
> > 
> > Let me try it now. The only scheme iput() can deadlock the flusher is
> > for the iput() path to come back to queue some work and wait for it.
> > Here are the exhaust list of the queue+wait paths:


> > writeback_inodes_sb_nr_if_idle

Sorry the above function is actually called from all over btrfs. ext4
uses the much more heavy weight writeback_inodes_sb_if_idle(). I'm not
sure if ext4/ubifs developers are fully aware that these functions may
take seconds or even dozens of seconds to complete and show up as long
delays to the users, because they write and wait ALL dirty pages on
the superblock. Even w/o the iput() deadlock problem, it's still
questionable to start and wait for such big writeback works from fs
code.  If these waits could be turned into congestion_wait() style
throttling, it should not only completely remove the possibility of
iput() deadlock, but also make the delays much smaller.
congestion_wait() is just an example and may not be a good fit, but
the *_if_idle thing does indicate that the calling ext4/btrfs sites
are pretty flexible (or careless) about the exact policy used.

Thanks,
Fengguang

> >   ext4_nonda_switch
> >     ext4_page_mkwrite                   # from page fault
> >     ext4_da_write_begin                 # from user writes
> > 
> > writeback_inodes_sb_nr
> >   quotactl syscall                      # from syscall
> >   __sync_filesystem                     # from sync/umount
> >   shrink_liability                      # ubifs
> >     make_free_space
> >       ubifs_budget_space                # from all over ubifs:
> > 
> >    2    274  /c/linux/fs/ubifs/dir.c <<ubifs_create>>
> >    3    531  /c/linux/fs/ubifs/dir.c <<ubifs_link>>
> >    4    586  /c/linux/fs/ubifs/dir.c <<ubifs_unlink>>
> >    5    675  /c/linux/fs/ubifs/dir.c <<ubifs_rmdir>>
> >    6    731  /c/linux/fs/ubifs/dir.c <<ubifs_mkdir>>
> >    7    803  /c/linux/fs/ubifs/dir.c <<ubifs_mknod>>
> >    8    871  /c/linux/fs/ubifs/dir.c <<ubifs_symlink>>
> >    9   1006  /c/linux/fs/ubifs/dir.c <<ubifs_rename>>
> >   10   1009  /c/linux/fs/ubifs/dir.c <<ubifs_rename>>
> >   11    246  /c/linux/fs/ubifs/file.c <<write_begin_slow>>
> >   12    388  /c/linux/fs/ubifs/file.c <<allocate_budget>>
> >   13   1125  /c/linux/fs/ubifs/file.c <<do_truncation>>   <===== deadlockable
> >   14   1217  /c/linux/fs/ubifs/file.c <<do_setattr>>
> >   15   1381  /c/linux/fs/ubifs/file.c <<update_mctime>>
> >   16   1486  /c/linux/fs/ubifs/file.c <<ubifs_vm_page_mkwrite>>
> >   17    110  /c/linux/fs/ubifs/ioctl.c <<setflags>>
> >   19    122  /c/linux/fs/ubifs/xattr.c <<create_xattr>>
> >   20    201  /c/linux/fs/ubifs/xattr.c <<change_xattr>>
> >   21    494  /c/linux/fs/ubifs/xattr.c <<remove_xattr>>
> > 
> > It seems they are all safe except for ubifs. ubifs may actually
> > deadlock from the above do_truncation() caller. However it should be
> 
> Sorry that do_truncation() is actually called from ubifs_setattr()
> which is not related to iput().
> 
> Are there other possibilities for iput() to call into the above list
> of ubifs functions, then start writeback work and wait for it which
> will deadlock the flusher? ubifs_unlink() and perhaps remove_xattr()?
> 
> > fixable because the ubifs call for writeback_inodes_sb_nr() sounds
> > very brute force writeback and wait and there may well be better way
> > out.
> > 
> > CCing ubifs developers for possible thoughts..
> > 
> > Thanks,
> > Fengguang
> > 
> > PS. I'll be on travel in the following week and won't have much time
> > for replying emails. Sorry about that.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-03-03 13:25             ` Fengguang Wu
@ 2012-03-07  0:37               ` Andrew Morton
  -1 siblings, 0 replies; 116+ messages in thread
From: Andrew Morton @ 2012-03-07  0:37 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, Greg Thelen, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Sat, 3 Mar 2012 21:25:55 +0800
Fengguang Wu <fengguang.wu@intel.com> wrote:

> > > get_page() looks the perfect solution to verify if the struct inode
> > > pointer (w/o igrab) is still live and valid.
> > > 
> > > [...upon rethinking...] Oh but still we need to lock some page to pin
> > > the inode during the writeout. Then there is the dilemma: if the page
> > > is locked, we effectively keep it from being written out...
> > 
> > No, all you need to do is to structure the code so that after the page
> > gets unlocked, the kernel thread does not touch the address_space.  So
> > the processing within the kthread is along the lines of
> > 
> > writearound(locked_page)
> > {
> > 	write some pages preceding locked_page;	/* touches address_space */
> 
> It seems the above line will lead to ABBA deadlock.
> 
> At least btrfs will lock a number of pages in lock_delalloc_pages().

Well, this code locks multiple pages too.  I forget what I did about
that - probably trylock.  Dirty pages aren't locked for very long.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-07  0:37               ` Andrew Morton
  0 siblings, 0 replies; 116+ messages in thread
From: Andrew Morton @ 2012-03-07  0:37 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Jan Kara, Greg Thelen, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Sat, 3 Mar 2012 21:25:55 +0800
Fengguang Wu <fengguang.wu@intel.com> wrote:

> > > get_page() looks the perfect solution to verify if the struct inode
> > > pointer (w/o igrab) is still live and valid.
> > > 
> > > [...upon rethinking...] Oh but still we need to lock some page to pin
> > > the inode during the writeout. Then there is the dilemma: if the page
> > > is locked, we effectively keep it from being written out...
> > 
> > No, all you need to do is to structure the code so that after the page
> > gets unlocked, the kernel thread does not touch the address_space.  So
> > the processing within the kthread is along the lines of
> > 
> > writearound(locked_page)
> > {
> > 	write some pages preceding locked_page;	/* touches address_space */
> 
> It seems the above line will lead to ABBA deadlock.
> 
> At least btrfs will lock a number of pages in lock_delalloc_pages().

Well, this code locks multiple pages too.  I forget what I did about
that - probably trylock.  Dirty pages aren't locked for very long.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-03-07  0:37               ` Andrew Morton
@ 2012-03-07  5:40                 ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-07  5:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Greg Thelen, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Tue, Mar 06, 2012 at 04:37:42PM -0800, Andrew Morton wrote:
> On Sat, 3 Mar 2012 21:25:55 +0800
> Fengguang Wu <fengguang.wu@intel.com> wrote:
> 
> > > > get_page() looks the perfect solution to verify if the struct inode
> > > > pointer (w/o igrab) is still live and valid.
> > > > 
> > > > [...upon rethinking...] Oh but still we need to lock some page to pin
> > > > the inode during the writeout. Then there is the dilemma: if the page
> > > > is locked, we effectively keep it from being written out...
> > > 
> > > No, all you need to do is to structure the code so that after the page
> > > gets unlocked, the kernel thread does not touch the address_space.  So
> > > the processing within the kthread is along the lines of
> > > 
> > > writearound(locked_page)
> > > {
> > > 	write some pages preceding locked_page;	/* touches address_space */
> > 
> > It seems the above line will lead to ABBA deadlock.
> > 
> > At least btrfs will lock a number of pages in lock_delalloc_pages().
> 
> Well, this code locks multiple pages too.  I forget what I did about
> that - probably trylock.  Dirty pages aren't locked for very long.

Yeah, trylock will be OK. And if the filesystems all do trylocks when
writing out any extra pages they try to piggy back, the ABBA deadlock
can be avoided. That makes the get_page()+trylock_page() scheme feasible.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-07  5:40                 ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-07  5:40 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Greg Thelen, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML

On Tue, Mar 06, 2012 at 04:37:42PM -0800, Andrew Morton wrote:
> On Sat, 3 Mar 2012 21:25:55 +0800
> Fengguang Wu <fengguang.wu@intel.com> wrote:
> 
> > > > get_page() looks the perfect solution to verify if the struct inode
> > > > pointer (w/o igrab) is still live and valid.
> > > > 
> > > > [...upon rethinking...] Oh but still we need to lock some page to pin
> > > > the inode during the writeout. Then there is the dilemma: if the page
> > > > is locked, we effectively keep it from being written out...
> > > 
> > > No, all you need to do is to structure the code so that after the page
> > > gets unlocked, the kernel thread does not touch the address_space.  So
> > > the processing within the kthread is along the lines of
> > > 
> > > writearound(locked_page)
> > > {
> > > 	write some pages preceding locked_page;	/* touches address_space */
> > 
> > It seems the above line will lead to ABBA deadlock.
> > 
> > At least btrfs will lock a number of pages in lock_delalloc_pages().
> 
> Well, this code locks multiple pages too.  I forget what I did about
> that - probably trylock.  Dirty pages aren't locked for very long.

Yeah, trylock will be OK. And if the filesystems all do trylocks when
writing out any extra pages they try to piggy back, the ABBA deadlock
can be avoided. That makes the get_page()+trylock_page() scheme feasible.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-03-03 13:55                   ` Fengguang Wu
@ 2012-03-07 15:48                     ` Artem Bityutskiy
  -1 siblings, 0 replies; 116+ messages in thread
From: Artem Bityutskiy @ 2012-03-07 15:48 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Jan Kara, Greg Thelen, Ying Han, hannes,
	KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML, Adrian Hunter,
	Artem Bityutskiy

On Sat, 2012-03-03 at 21:55 +0800, Fengguang Wu wrote:
>   13   1125  /c/linux/fs/ubifs/file.c <<do_truncation>>   <===== deadlockable

Sorry, but could you please explain once again how the deadlock may
happen?

> It seems they are all safe except for ubifs. ubifs may actually
> deadlock from the above do_truncation() caller. However it should be
> fixable because the ubifs call for writeback_inodes_sb_nr() sounds
> very brute force writeback and wait and there may well be better way
> out.

I do not think this "fixable" - this is part of UBIFS design to force
write-back when we are not sure we have enough space.

The problem is that we do not know how much space the dirty data in RAM
will take on the flash media (after it is actually written-back) - e.g.,
because we compress all the data (UBIFS performs on-the-flight
compression). So we do pessimistic assumptions and allow dirtying more
and more data as long as we know for sure that there is enough flash
space on the media for the worst-case scenario (data are not
compressible). This is what the UBIFS budgeting subsystem does.

Once the budgeting sub-system sees that we are not going to have enough
flash space for the worst-case scenario, it starts forcing write-back to
push some dirty data out to the flash media and update the budgeting
numbers, and get more realistic picture.

So basically, before you can change _anything_ on UBIFS file-system, you
need to budget for the space. Even when you truncate - because
truncation is also about allocating more space for writing the updated
inode and update the FS index. (Remember, all writes are out-of-place in
UBIFS because we work with raw flash, not a block device).

-- 
Best Regards,
Artem Bityutskiy


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-07 15:48                     ` Artem Bityutskiy
  0 siblings, 0 replies; 116+ messages in thread
From: Artem Bityutskiy @ 2012-03-07 15:48 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Jan Kara, Greg Thelen, Ying Han, hannes,
	KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML, Adrian Hunter,
	Artem Bityutskiy

On Sat, 2012-03-03 at 21:55 +0800, Fengguang Wu wrote:
>   13   1125  /c/linux/fs/ubifs/file.c <<do_truncation>>   <===== deadlockable

Sorry, but could you please explain once again how the deadlock may
happen?

> It seems they are all safe except for ubifs. ubifs may actually
> deadlock from the above do_truncation() caller. However it should be
> fixable because the ubifs call for writeback_inodes_sb_nr() sounds
> very brute force writeback and wait and there may well be better way
> out.

I do not think this "fixable" - this is part of UBIFS design to force
write-back when we are not sure we have enough space.

The problem is that we do not know how much space the dirty data in RAM
will take on the flash media (after it is actually written-back) - e.g.,
because we compress all the data (UBIFS performs on-the-flight
compression). So we do pessimistic assumptions and allow dirtying more
and more data as long as we know for sure that there is enough flash
space on the media for the worst-case scenario (data are not
compressible). This is what the UBIFS budgeting subsystem does.

Once the budgeting sub-system sees that we are not going to have enough
flash space for the worst-case scenario, it starts forcing write-back to
push some dirty data out to the flash media and update the budgeting
numbers, and get more realistic picture.

So basically, before you can change _anything_ on UBIFS file-system, you
need to budget for the space. Even when you truncate - because
truncation is also about allocating more space for writing the updated
inode and update the FS index. (Remember, all writes are out-of-place in
UBIFS because we work with raw flash, not a block device).

-- 
Best Regards,
Artem Bityutskiy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-03-07 15:48                     ` Artem Bityutskiy
@ 2012-03-09  7:31                       ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-09  7:31 UTC (permalink / raw)
  To: Artem Bityutskiy
  Cc: Andrew Morton, Jan Kara, Greg Thelen, Ying Han, hannes,
	KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML, Adrian Hunter

Artem,

On Wed, Mar 07, 2012 at 05:48:21PM +0200, Artem Bityutskiy wrote:
> On Sat, 2012-03-03 at 21:55 +0800, Fengguang Wu wrote:
> >   13   1125  /c/linux/fs/ubifs/file.c <<do_truncation>>   <===== deadlockable
> 
> Sorry, but could you please explain once again how the deadlock may
> happen?

Sorry I confused ubifs do_truncation() with the truncate_inode_pages()
that may be called from iput().

The once suspected deadlock scheme is when the flusher thread calls
the final iput:

        flusher thread
          iput_final
            <some ubifs function>
              ubifs_budget_space
                shrink_liability
                  writeback_inodes_sb
                    writeback_inodes_sb_nr
                      bdi_queue_work
                      wait_for_completion  => end up waiting for the flusher itself

However I cannot find any ubifs functions to form the above loop, so
ubifs should be safe for now.

> > It seems they are all safe except for ubifs. ubifs may actually
> > deadlock from the above do_truncation() caller. However it should be
> > fixable because the ubifs call for writeback_inodes_sb_nr() sounds
> > very brute force writeback and wait and there may well be better way
> > out.
> 
> I do not think this "fixable" - this is part of UBIFS design to force
> write-back when we are not sure we have enough space.
> 
> The problem is that we do not know how much space the dirty data in RAM
> will take on the flash media (after it is actually written-back) - e.g.,
> because we compress all the data (UBIFS performs on-the-flight
> compression). So we do pessimistic assumptions and allow dirtying more
> and more data as long as we know for sure that there is enough flash
> space on the media for the worst-case scenario (data are not
> compressible). This is what the UBIFS budgeting subsystem does.
> 
> Once the budgeting sub-system sees that we are not going to have enough
> flash space for the worst-case scenario, it starts forcing write-back to
> push some dirty data out to the flash media and update the budgeting
> numbers, and get more realistic picture.
> 
> So basically, before you can change _anything_ on UBIFS file-system, you
> need to budget for the space. Even when you truncate - because
> truncation is also about allocating more space for writing the updated
> inode and update the FS index. (Remember, all writes are out-of-place in
> UBIFS because we work with raw flash, not a block device).

Thanks for the detailed explanations!

Judging from the git log, ubifs starts with flushing NR_TO_WRITE=16
pages at one time commit 2acf80675800d ("UBIFS: simplify
make_free_space") and is later changed to flushing *the whole*
superblock by a writeback change ("writeback: get rid of
generic_sync_sb_inodes() export"). This could greatly increase the
wait time. I'd suggest to limit the write chunk size to about 125ms
as the below change:

--- linux.orig/fs/ubifs/budget.c	2012-03-08 23:16:01.661194026 -0800
+++ linux/fs/ubifs/budget.c	2012-03-08 23:16:02.477194003 -0800
@@ -63,7 +63,9 @@
 static void shrink_liability(struct ubifs_info *c, int nr_to_write)
 {
 	down_read(&c->vfs_sb->s_umount);
-	writeback_inodes_sb(c->vfs_sb, WB_REASON_FS_FREE_SPACE);
+	writeback_inodes_sb_nr(c->vfs_sb,
+			       c->bdi.avg_write_bandwidth / 8 + nr_to_write,
+			       WB_REASON_FS_FREE_SPACE);
 	up_read(&c->vfs_sb->s_umount);
 }
 
Here nr_to_write=16 merely serves as some minimal safeguard in case
bdi.avg_write_bandwidth drops to 0. Perhaps we can eliminate the
parameter and use the constant number directly.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-09  7:31                       ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-09  7:31 UTC (permalink / raw)
  To: Artem Bityutskiy
  Cc: Andrew Morton, Jan Kara, Greg Thelen, Ying Han, hannes,
	KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML, Adrian Hunter

Artem,

On Wed, Mar 07, 2012 at 05:48:21PM +0200, Artem Bityutskiy wrote:
> On Sat, 2012-03-03 at 21:55 +0800, Fengguang Wu wrote:
> >   13   1125  /c/linux/fs/ubifs/file.c <<do_truncation>>   <===== deadlockable
> 
> Sorry, but could you please explain once again how the deadlock may
> happen?

Sorry I confused ubifs do_truncation() with the truncate_inode_pages()
that may be called from iput().

The once suspected deadlock scheme is when the flusher thread calls
the final iput:

        flusher thread
          iput_final
            <some ubifs function>
              ubifs_budget_space
                shrink_liability
                  writeback_inodes_sb
                    writeback_inodes_sb_nr
                      bdi_queue_work
                      wait_for_completion  => end up waiting for the flusher itself

However I cannot find any ubifs functions to form the above loop, so
ubifs should be safe for now.

> > It seems they are all safe except for ubifs. ubifs may actually
> > deadlock from the above do_truncation() caller. However it should be
> > fixable because the ubifs call for writeback_inodes_sb_nr() sounds
> > very brute force writeback and wait and there may well be better way
> > out.
> 
> I do not think this "fixable" - this is part of UBIFS design to force
> write-back when we are not sure we have enough space.
> 
> The problem is that we do not know how much space the dirty data in RAM
> will take on the flash media (after it is actually written-back) - e.g.,
> because we compress all the data (UBIFS performs on-the-flight
> compression). So we do pessimistic assumptions and allow dirtying more
> and more data as long as we know for sure that there is enough flash
> space on the media for the worst-case scenario (data are not
> compressible). This is what the UBIFS budgeting subsystem does.
> 
> Once the budgeting sub-system sees that we are not going to have enough
> flash space for the worst-case scenario, it starts forcing write-back to
> push some dirty data out to the flash media and update the budgeting
> numbers, and get more realistic picture.
> 
> So basically, before you can change _anything_ on UBIFS file-system, you
> need to budget for the space. Even when you truncate - because
> truncation is also about allocating more space for writing the updated
> inode and update the FS index. (Remember, all writes are out-of-place in
> UBIFS because we work with raw flash, not a block device).

Thanks for the detailed explanations!

Judging from the git log, ubifs starts with flushing NR_TO_WRITE=16
pages at one time commit 2acf80675800d ("UBIFS: simplify
make_free_space") and is later changed to flushing *the whole*
superblock by a writeback change ("writeback: get rid of
generic_sync_sb_inodes() export"). This could greatly increase the
wait time. I'd suggest to limit the write chunk size to about 125ms
as the below change:

--- linux.orig/fs/ubifs/budget.c	2012-03-08 23:16:01.661194026 -0800
+++ linux/fs/ubifs/budget.c	2012-03-08 23:16:02.477194003 -0800
@@ -63,7 +63,9 @@
 static void shrink_liability(struct ubifs_info *c, int nr_to_write)
 {
 	down_read(&c->vfs_sb->s_umount);
-	writeback_inodes_sb(c->vfs_sb, WB_REASON_FS_FREE_SPACE);
+	writeback_inodes_sb_nr(c->vfs_sb,
+			       c->bdi.avg_write_bandwidth / 8 + nr_to_write,
+			       WB_REASON_FS_FREE_SPACE);
 	up_read(&c->vfs_sb->s_umount);
 }
 
Here nr_to_write=16 merely serves as some minimal safeguard in case
bdi.avg_write_bandwidth drops to 0. Perhaps we can eliminate the
parameter and use the constant number directly.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-03-09  7:31                       ` Fengguang Wu
@ 2012-03-09  9:51                         ` Jan Kara
  -1 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2012-03-09  9:51 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Artem Bityutskiy, Andrew Morton, Jan Kara, Greg Thelen, Ying Han,
	hannes, KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML, Adrian Hunter

  Hello,

On Thu 08-03-12 23:31:13, Wu Fengguang wrote:
> On Wed, Mar 07, 2012 at 05:48:21PM +0200, Artem Bityutskiy wrote:
> > On Sat, 2012-03-03 at 21:55 +0800, Fengguang Wu wrote:
> > >   13   1125  /c/linux/fs/ubifs/file.c <<do_truncation>>   <===== deadlockable
> > 
> > Sorry, but could you please explain once again how the deadlock may
> > happen?
> 
> Sorry I confused ubifs do_truncation() with the truncate_inode_pages()
> that may be called from iput().
> 
> The once suspected deadlock scheme is when the flusher thread calls
> the final iput:
> 
>         flusher thread
>           iput_final
>             <some ubifs function>
>               ubifs_budget_space
>                 shrink_liability
>                   writeback_inodes_sb
>                     writeback_inodes_sb_nr
>                       bdi_queue_work
>                       wait_for_completion  => end up waiting for the flusher itself
> 
> However I cannot find any ubifs functions to form the above loop, so
> ubifs should be safe for now.
  Yeah, me neither but I also failed to find a place where
ubifs_evict_inode() truncates inode space when deleting the inode... Artem?

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-09  9:51                         ` Jan Kara
  0 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2012-03-09  9:51 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Artem Bityutskiy, Andrew Morton, Jan Kara, Greg Thelen, Ying Han,
	hannes, KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML, Adrian Hunter

  Hello,

On Thu 08-03-12 23:31:13, Wu Fengguang wrote:
> On Wed, Mar 07, 2012 at 05:48:21PM +0200, Artem Bityutskiy wrote:
> > On Sat, 2012-03-03 at 21:55 +0800, Fengguang Wu wrote:
> > >   13   1125  /c/linux/fs/ubifs/file.c <<do_truncation>>   <===== deadlockable
> > 
> > Sorry, but could you please explain once again how the deadlock may
> > happen?
> 
> Sorry I confused ubifs do_truncation() with the truncate_inode_pages()
> that may be called from iput().
> 
> The once suspected deadlock scheme is when the flusher thread calls
> the final iput:
> 
>         flusher thread
>           iput_final
>             <some ubifs function>
>               ubifs_budget_space
>                 shrink_liability
>                   writeback_inodes_sb
>                     writeback_inodes_sb_nr
>                       bdi_queue_work
>                       wait_for_completion  => end up waiting for the flusher itself
> 
> However I cannot find any ubifs functions to form the above loop, so
> ubifs should be safe for now.
  Yeah, me neither but I also failed to find a place where
ubifs_evict_inode() truncates inode space when deleting the inode... Artem?

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-03-03 13:55                   ` Fengguang Wu
@ 2012-03-09 10:15                     ` Jan Kara
  -1 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2012-03-09 10:15 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Jan Kara, Greg Thelen, Ying Han, hannes,
	KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML, Adrian Hunter,
	Artem Bityutskiy

On Sat 03-03-12 21:55:58, Wu Fengguang wrote:
> On Fri, Mar 02, 2012 at 11:57:00AM -0800, Andrew Morton wrote:
> > On Fri, 2 Mar 2012 18:39:51 +0800
> > Fengguang Wu <fengguang.wu@intel.com> wrote:
> > 
> > > > And I agree it's unlikely but given enough time and people, I
> > > > believe someone finds a way to (inadvertedly) trigger this.
> > > 
> > > Right. The pageout works could add lots more iput() to the flusher
> > > and turn some hidden statistical impossible bugs into real ones.
> > > 
> > > Fortunately the "flusher deadlocks itself" case is easy to detect and
> > > prevent as illustrated in another email.
> > 
> > It would be a heck of a lot safer and saner to avoid the iput().  We
> > know how to do this, so why not do it?
> 
> My concern about the page lock is, it costs more code and sounds like
> hacking around something. It seems we (including me) have been trying
> to shun away from the iput() problem. Since it's unlikely we are to
> get rid of the already existing iput() calls from the flusher context,
> why not face the problem, sort it out and use it with confident in new
> code?
  We can get rid of it in the current code - see my patch set. And also we
don't have to introduce new iput() with your patch set... I don't think
using ->writepage() directly on a locked page would be a good thing because
filesystems tend to ignore it completely (e.g. ext4 if it needs to do an
allocation, or btrfs) or are much less efficient than when ->writepages()
is used.  So I'd prefer going through writeback_single_inode() as the rest
of flusher thread.

> Let me try it now. The only scheme iput() can deadlock the flusher is
> for the iput() path to come back to queue some work and wait for it.
  Let me stop you right here. You severely underestimate the complexity of
filesystems :). Take for example ext4. To do truncate you need to start a
transaction, to start a transaction, you have to have a space in journal.
To have a space in journal, you may have to wait for any other process to
finish writing. If that process needs to wait for flusher thread to be able
to finish writing, you have a deadlock. And there are other implicit
dependencies like this. And it's similar for other filesystems as well. So
you really want to make flusher thread as light as possible with the
dependencies.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-09 10:15                     ` Jan Kara
  0 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2012-03-09 10:15 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Jan Kara, Greg Thelen, Ying Han, hannes,
	KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML, Adrian Hunter,
	Artem Bityutskiy

On Sat 03-03-12 21:55:58, Wu Fengguang wrote:
> On Fri, Mar 02, 2012 at 11:57:00AM -0800, Andrew Morton wrote:
> > On Fri, 2 Mar 2012 18:39:51 +0800
> > Fengguang Wu <fengguang.wu@intel.com> wrote:
> > 
> > > > And I agree it's unlikely but given enough time and people, I
> > > > believe someone finds a way to (inadvertedly) trigger this.
> > > 
> > > Right. The pageout works could add lots more iput() to the flusher
> > > and turn some hidden statistical impossible bugs into real ones.
> > > 
> > > Fortunately the "flusher deadlocks itself" case is easy to detect and
> > > prevent as illustrated in another email.
> > 
> > It would be a heck of a lot safer and saner to avoid the iput().  We
> > know how to do this, so why not do it?
> 
> My concern about the page lock is, it costs more code and sounds like
> hacking around something. It seems we (including me) have been trying
> to shun away from the iput() problem. Since it's unlikely we are to
> get rid of the already existing iput() calls from the flusher context,
> why not face the problem, sort it out and use it with confident in new
> code?
  We can get rid of it in the current code - see my patch set. And also we
don't have to introduce new iput() with your patch set... I don't think
using ->writepage() directly on a locked page would be a good thing because
filesystems tend to ignore it completely (e.g. ext4 if it needs to do an
allocation, or btrfs) or are much less efficient than when ->writepages()
is used.  So I'd prefer going through writeback_single_inode() as the rest
of flusher thread.

> Let me try it now. The only scheme iput() can deadlock the flusher is
> for the iput() path to come back to queue some work and wait for it.
  Let me stop you right here. You severely underestimate the complexity of
filesystems :). Take for example ext4. To do truncate you need to start a
transaction, to start a transaction, you have to have a space in journal.
To have a space in journal, you may have to wait for any other process to
finish writing. If that process needs to wait for flusher thread to be able
to finish writing, you have a deadlock. And there are other implicit
dependencies like this. And it's similar for other filesystems as well. So
you really want to make flusher thread as light as possible with the
dependencies.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-03-09  9:51                         ` Jan Kara
@ 2012-03-09 10:24                           ` Artem Bityutskiy
  -1 siblings, 0 replies; 116+ messages in thread
From: Artem Bityutskiy @ 2012-03-09 10:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: Fengguang Wu, Andrew Morton, Greg Thelen, Ying Han, hannes,
	KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML, Adrian Hunter

On Fri, 2012-03-09 at 10:51 +0100, Jan Kara wrote:
> > However I cannot find any ubifs functions to form the above loop, so
> > ubifs should be safe for now.
>   Yeah, me neither but I also failed to find a place where
> ubifs_evict_inode() truncates inode space when deleting the inode... Artem?

evict_inode() stuff was introduced by Al relatively recently and I did
not even look at what it does, so I do not know what to answer. I'll
look at this and and answer.

-- 
Best Regards,
Artem Bityutskiy


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-09 10:24                           ` Artem Bityutskiy
  0 siblings, 0 replies; 116+ messages in thread
From: Artem Bityutskiy @ 2012-03-09 10:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: Fengguang Wu, Andrew Morton, Greg Thelen, Ying Han, hannes,
	KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML, Adrian Hunter

On Fri, 2012-03-09 at 10:51 +0100, Jan Kara wrote:
> > However I cannot find any ubifs functions to form the above loop, so
> > ubifs should be safe for now.
>   Yeah, me neither but I also failed to find a place where
> ubifs_evict_inode() truncates inode space when deleting the inode... Artem?

evict_inode() stuff was introduced by Al relatively recently and I did
not even look at what it does, so I do not know what to answer. I'll
look at this and and answer.

-- 
Best Regards,
Artem Bityutskiy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-03-09 10:15                     ` Jan Kara
@ 2012-03-09 15:10                       ` Fengguang Wu
  -1 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-09 15:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Greg Thelen, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML, Adrian Hunter,
	Artem Bityutskiy

On Fri, Mar 09, 2012 at 11:15:46AM +0100, Jan Kara wrote:
> On Sat 03-03-12 21:55:58, Wu Fengguang wrote:
> > On Fri, Mar 02, 2012 at 11:57:00AM -0800, Andrew Morton wrote:
> > > On Fri, 2 Mar 2012 18:39:51 +0800
> > > Fengguang Wu <fengguang.wu@intel.com> wrote:
> > > 
> > > > > And I agree it's unlikely but given enough time and people, I
> > > > > believe someone finds a way to (inadvertedly) trigger this.
> > > > 
> > > > Right. The pageout works could add lots more iput() to the flusher
> > > > and turn some hidden statistical impossible bugs into real ones.
> > > > 
> > > > Fortunately the "flusher deadlocks itself" case is easy to detect and
> > > > prevent as illustrated in another email.
> > > 
> > > It would be a heck of a lot safer and saner to avoid the iput().  We
> > > know how to do this, so why not do it?
> > 
> > My concern about the page lock is, it costs more code and sounds like
> > hacking around something. It seems we (including me) have been trying
> > to shun away from the iput() problem. Since it's unlikely we are to
> > get rid of the already existing iput() calls from the flusher context,
> > why not face the problem, sort it out and use it with confident in new
> > code?
>   We can get rid of it in the current code - see my patch set. And also we
> don't have to introduce new iput() with your patch set... I don't think
> using ->writepage() directly on a locked page would be a good thing because
> filesystems tend to ignore it completely (e.g. ext4 if it needs to do an
> allocation, or btrfs) or are much less efficient than when ->writepages()
> is used.  So I'd prefer going through writeback_single_inode() as the rest
> of flusher thread.

Totally agreed. I was also not feeling good to use ->writepage() on
the locked page. It looks very nice to pin the inode with I_SYNC
rather than igrab or lock_page.

> > Let me try it now. The only scheme iput() can deadlock the flusher is
> > for the iput() path to come back to queue some work and wait for it.
>   Let me stop you right here. You severely underestimate the complexity of
> filesystems :). Take for example ext4. To do truncate you need to start a
> transaction, to start a transaction, you have to have a space in journal.
> To have a space in journal, you may have to wait for any other process to
> finish writing. If that process needs to wait for flusher thread to be able
> to finish writing, you have a deadlock. And there are other implicit
> dependencies like this. And it's similar for other filesystems as well. So
> you really want to make flusher thread as light as possible with the
> dependencies.

Ah OK, please forgive my ignorance. Let's get rid of the existing
iput()s in the flusher thread.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-09 15:10                       ` Fengguang Wu
  0 siblings, 0 replies; 116+ messages in thread
From: Fengguang Wu @ 2012-03-09 15:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Greg Thelen, Ying Han, hannes, KAMEZAWA Hiroyuki,
	Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML, Adrian Hunter,
	Artem Bityutskiy

On Fri, Mar 09, 2012 at 11:15:46AM +0100, Jan Kara wrote:
> On Sat 03-03-12 21:55:58, Wu Fengguang wrote:
> > On Fri, Mar 02, 2012 at 11:57:00AM -0800, Andrew Morton wrote:
> > > On Fri, 2 Mar 2012 18:39:51 +0800
> > > Fengguang Wu <fengguang.wu@intel.com> wrote:
> > > 
> > > > > And I agree it's unlikely but given enough time and people, I
> > > > > believe someone finds a way to (inadvertedly) trigger this.
> > > > 
> > > > Right. The pageout works could add lots more iput() to the flusher
> > > > and turn some hidden statistical impossible bugs into real ones.
> > > > 
> > > > Fortunately the "flusher deadlocks itself" case is easy to detect and
> > > > prevent as illustrated in another email.
> > > 
> > > It would be a heck of a lot safer and saner to avoid the iput().  We
> > > know how to do this, so why not do it?
> > 
> > My concern about the page lock is, it costs more code and sounds like
> > hacking around something. It seems we (including me) have been trying
> > to shun away from the iput() problem. Since it's unlikely we are to
> > get rid of the already existing iput() calls from the flusher context,
> > why not face the problem, sort it out and use it with confident in new
> > code?
>   We can get rid of it in the current code - see my patch set. And also we
> don't have to introduce new iput() with your patch set... I don't think
> using ->writepage() directly on a locked page would be a good thing because
> filesystems tend to ignore it completely (e.g. ext4 if it needs to do an
> allocation, or btrfs) or are much less efficient than when ->writepages()
> is used.  So I'd prefer going through writeback_single_inode() as the rest
> of flusher thread.

Totally agreed. I was also not feeling good to use ->writepage() on
the locked page. It looks very nice to pin the inode with I_SYNC
rather than igrab or lock_page.

> > Let me try it now. The only scheme iput() can deadlock the flusher is
> > for the iput() path to come back to queue some work and wait for it.
>   Let me stop you right here. You severely underestimate the complexity of
> filesystems :). Take for example ext4. To do truncate you need to start a
> transaction, to start a transaction, you have to have a space in journal.
> To have a space in journal, you may have to wait for any other process to
> finish writing. If that process needs to wait for flusher thread to be able
> to finish writing, you have a deadlock. And there are other implicit
> dependencies like this. And it's similar for other filesystems as well. So
> you really want to make flusher thread as light as possible with the
> dependencies.

Ah OK, please forgive my ignorance. Let's get rid of the existing
iput()s in the flusher thread.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-03-09  9:51                         ` Jan Kara
@ 2012-03-09 16:10                           ` Artem Bityutskiy
  -1 siblings, 0 replies; 116+ messages in thread
From: Artem Bityutskiy @ 2012-03-09 16:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: Fengguang Wu, Andrew Morton, Greg Thelen, Ying Han, hannes,
	KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML, Adrian Hunter

On Fri, 2012-03-09 at 10:51 +0100, Jan Kara wrote:
> > However I cannot find any ubifs functions to form the above loop, so
> > ubifs should be safe for now.
>   Yeah, me neither but I also failed to find a place where
> ubifs_evict_inode() truncates inode space when deleting the inode... Artem?

We do call 'truncate_inode_pages()':

static void ubifs_evict_inode(struct inode *inode)
{
	...

        truncate_inode_pages(&inode->i_data, 0);

        ...
}

-- 
Best Regards,
Artem Bityutskiy


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-09 16:10                           ` Artem Bityutskiy
  0 siblings, 0 replies; 116+ messages in thread
From: Artem Bityutskiy @ 2012-03-09 16:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: Fengguang Wu, Andrew Morton, Greg Thelen, Ying Han, hannes,
	KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML, Adrian Hunter

On Fri, 2012-03-09 at 10:51 +0100, Jan Kara wrote:
> > However I cannot find any ubifs functions to form the above loop, so
> > ubifs should be safe for now.
>   Yeah, me neither but I also failed to find a place where
> ubifs_evict_inode() truncates inode space when deleting the inode... Artem?

We do call 'truncate_inode_pages()':

static void ubifs_evict_inode(struct inode *inode)
{
	...

        truncate_inode_pages(&inode->i_data, 0);

        ...
}

-- 
Best Regards,
Artem Bityutskiy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-03-09 16:10                           ` Artem Bityutskiy
@ 2012-03-09 21:11                             ` Jan Kara
  -1 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2012-03-09 21:11 UTC (permalink / raw)
  To: Artem Bityutskiy
  Cc: Jan Kara, Fengguang Wu, Andrew Morton, Greg Thelen, Ying Han,
	hannes, KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML, Adrian Hunter

On Fri 09-03-12 18:10:51, Artem Bityutskiy wrote:
> On Fri, 2012-03-09 at 10:51 +0100, Jan Kara wrote:
> > > However I cannot find any ubifs functions to form the above loop, so
> > > ubifs should be safe for now.
> >   Yeah, me neither but I also failed to find a place where
> > ubifs_evict_inode() truncates inode space when deleting the inode... Artem?
> 
> We do call 'truncate_inode_pages()':
> 
> static void ubifs_evict_inode(struct inode *inode)
> {
> 	...
> 
>         truncate_inode_pages(&inode->i_data, 0);
> 
>         ...
> }
  Well, but that just removes pages from page cache. You should somewhere
also free allocated blocks and free the inode... And I'm sure you do,
otherwise you would pretty quickly notice that file deletion does not work
:) Just I could not find which function does it.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-09 21:11                             ` Jan Kara
  0 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2012-03-09 21:11 UTC (permalink / raw)
  To: Artem Bityutskiy
  Cc: Jan Kara, Fengguang Wu, Andrew Morton, Greg Thelen, Ying Han,
	hannes, KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML, Adrian Hunter

On Fri 09-03-12 18:10:51, Artem Bityutskiy wrote:
> On Fri, 2012-03-09 at 10:51 +0100, Jan Kara wrote:
> > > However I cannot find any ubifs functions to form the above loop, so
> > > ubifs should be safe for now.
> >   Yeah, me neither but I also failed to find a place where
> > ubifs_evict_inode() truncates inode space when deleting the inode... Artem?
> 
> We do call 'truncate_inode_pages()':
> 
> static void ubifs_evict_inode(struct inode *inode)
> {
> 	...
> 
>         truncate_inode_pages(&inode->i_data, 0);
> 
>         ...
> }
  Well, but that just removes pages from page cache. You should somewhere
also free allocated blocks and free the inode... And I'm sure you do,
otherwise you would pretty quickly notice that file deletion does not work
:) Just I could not find which function does it.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-03-09 21:11                             ` Jan Kara
@ 2012-03-12 12:36                               ` Artem Bityutskiy
  -1 siblings, 0 replies; 116+ messages in thread
From: Artem Bityutskiy @ 2012-03-12 12:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: Fengguang Wu, Andrew Morton, Greg Thelen, Ying Han, hannes,
	KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML, Adrian Hunter

On Fri, 2012-03-09 at 22:11 +0100, Jan Kara wrote:
> On Fri 09-03-12 18:10:51, Artem Bityutskiy wrote:
> > On Fri, 2012-03-09 at 10:51 +0100, Jan Kara wrote:
> > > > However I cannot find any ubifs functions to form the above loop, so
> > > > ubifs should be safe for now.
> > >   Yeah, me neither but I also failed to find a place where
> > > ubifs_evict_inode() truncates inode space when deleting the inode... Artem?
> > 
> > We do call 'truncate_inode_pages()':
> > 
> > static void ubifs_evict_inode(struct inode *inode)
> > {
> > 	...
> > 
> >         truncate_inode_pages(&inode->i_data, 0);
> > 
> >         ...
> > }
>   Well, but that just removes pages from page cache. You should somewhere
> also free allocated blocks and free the inode... And I'm sure you do,
> otherwise you would pretty quickly notice that file deletion does not work
> :) Just I could not find which function does it.

ubifs_evict_inode() -> ubifs_jnl_delete_inode() ->
ubifs_tnc_remove_ino()

Basically, deletion in UBIFS is about writing a so-called "deletion
inode" to the journal and then removing all the data nodes of the
truncated inode from the TNC (in-memory cache of the FS index, which is
just a huge B-tree, like in reiser4 which inspired me long time ago, and
like in btrfs).

The second part of the overall deletion job will be when we commit - the
updated version of the FS index will be written to the flash media.

If we get a power cut before the commit, the journal reply will see the
deletion inode and will clean-up the index. The deletion inode is never
erased before the commit.

Basically, this design is dictated by the fact that we do not have a
cheap way of doing in-place updates.

This is a short version of the story. Here are some docs as well:
http://www.linux-mtd.infradead.org/doc/ubifs.html#L_documentation

-- 
Best Regards,
Artem Bityutskiy


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-12 12:36                               ` Artem Bityutskiy
  0 siblings, 0 replies; 116+ messages in thread
From: Artem Bityutskiy @ 2012-03-12 12:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: Fengguang Wu, Andrew Morton, Greg Thelen, Ying Han, hannes,
	KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML, Adrian Hunter

On Fri, 2012-03-09 at 22:11 +0100, Jan Kara wrote:
> On Fri 09-03-12 18:10:51, Artem Bityutskiy wrote:
> > On Fri, 2012-03-09 at 10:51 +0100, Jan Kara wrote:
> > > > However I cannot find any ubifs functions to form the above loop, so
> > > > ubifs should be safe for now.
> > >   Yeah, me neither but I also failed to find a place where
> > > ubifs_evict_inode() truncates inode space when deleting the inode... Artem?
> > 
> > We do call 'truncate_inode_pages()':
> > 
> > static void ubifs_evict_inode(struct inode *inode)
> > {
> > 	...
> > 
> >         truncate_inode_pages(&inode->i_data, 0);
> > 
> >         ...
> > }
>   Well, but that just removes pages from page cache. You should somewhere
> also free allocated blocks and free the inode... And I'm sure you do,
> otherwise you would pretty quickly notice that file deletion does not work
> :) Just I could not find which function does it.

ubifs_evict_inode() -> ubifs_jnl_delete_inode() ->
ubifs_tnc_remove_ino()

Basically, deletion in UBIFS is about writing a so-called "deletion
inode" to the journal and then removing all the data nodes of the
truncated inode from the TNC (in-memory cache of the FS index, which is
just a huge B-tree, like in reiser4 which inspired me long time ago, and
like in btrfs).

The second part of the overall deletion job will be when we commit - the
updated version of the FS index will be written to the flash media.

If we get a power cut before the commit, the journal reply will see the
deletion inode and will clean-up the index. The deletion inode is never
erased before the commit.

Basically, this design is dictated by the fact that we do not have a
cheap way of doing in-place updates.

This is a short version of the story. Here are some docs as well:
http://www.linux-mtd.infradead.org/doc/ubifs.html#L_documentation

-- 
Best Regards,
Artem Bityutskiy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-03-12 12:36                               ` Artem Bityutskiy
@ 2012-03-12 14:02                                 ` Jan Kara
  -1 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2012-03-12 14:02 UTC (permalink / raw)
  To: Artem Bityutskiy
  Cc: Jan Kara, Fengguang Wu, Andrew Morton, Greg Thelen, Ying Han,
	hannes, KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML, Adrian Hunter

On Mon 12-03-12 14:36:14, Artem Bityutskiy wrote:
> On Fri, 2012-03-09 at 22:11 +0100, Jan Kara wrote:
> > On Fri 09-03-12 18:10:51, Artem Bityutskiy wrote:
> > > On Fri, 2012-03-09 at 10:51 +0100, Jan Kara wrote:
> > > > > However I cannot find any ubifs functions to form the above loop, so
> > > > > ubifs should be safe for now.
> > > >   Yeah, me neither but I also failed to find a place where
> > > > ubifs_evict_inode() truncates inode space when deleting the inode... Artem?
> > > 
> > > We do call 'truncate_inode_pages()':
> > > 
> > > static void ubifs_evict_inode(struct inode *inode)
> > > {
> > > 	...
> > > 
> > >         truncate_inode_pages(&inode->i_data, 0);
> > > 
> > >         ...
> > > }
> >   Well, but that just removes pages from page cache. You should somewhere
> > also free allocated blocks and free the inode... And I'm sure you do,
> > otherwise you would pretty quickly notice that file deletion does not work
> > :) Just I could not find which function does it.
> 
> ubifs_evict_inode() -> ubifs_jnl_delete_inode() ->
> ubifs_tnc_remove_ino()
> 
> Basically, deletion in UBIFS is about writing a so-called "deletion
> inode" to the journal and then removing all the data nodes of the
> truncated inode from the TNC (in-memory cache of the FS index, which is
> just a huge B-tree, like in reiser4 which inspired me long time ago, and
> like in btrfs).
> 
> The second part of the overall deletion job will be when we commit - the
> updated version of the FS index will be written to the flash media.
  Oh, I see. This is what I was missing. And I presume you always make sure
to have enough space for new FS index so it cannot deadlock when trying to
push out dirty pages.

> If we get a power cut before the commit, the journal reply will see the
> deletion inode and will clean-up the index. The deletion inode is never
> erased before the commit.
> 
> Basically, this design is dictated by the fact that we do not have a
> cheap way of doing in-place updates.
> 
> This is a short version of the story. Here are some docs as well:
> http://www.linux-mtd.infradead.org/doc/ubifs.html#L_documentation
  I see. Thanks.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-12 14:02                                 ` Jan Kara
  0 siblings, 0 replies; 116+ messages in thread
From: Jan Kara @ 2012-03-12 14:02 UTC (permalink / raw)
  To: Artem Bityutskiy
  Cc: Jan Kara, Fengguang Wu, Andrew Morton, Greg Thelen, Ying Han,
	hannes, KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML, Adrian Hunter

On Mon 12-03-12 14:36:14, Artem Bityutskiy wrote:
> On Fri, 2012-03-09 at 22:11 +0100, Jan Kara wrote:
> > On Fri 09-03-12 18:10:51, Artem Bityutskiy wrote:
> > > On Fri, 2012-03-09 at 10:51 +0100, Jan Kara wrote:
> > > > > However I cannot find any ubifs functions to form the above loop, so
> > > > > ubifs should be safe for now.
> > > >   Yeah, me neither but I also failed to find a place where
> > > > ubifs_evict_inode() truncates inode space when deleting the inode... Artem?
> > > 
> > > We do call 'truncate_inode_pages()':
> > > 
> > > static void ubifs_evict_inode(struct inode *inode)
> > > {
> > > 	...
> > > 
> > >         truncate_inode_pages(&inode->i_data, 0);
> > > 
> > >         ...
> > > }
> >   Well, but that just removes pages from page cache. You should somewhere
> > also free allocated blocks and free the inode... And I'm sure you do,
> > otherwise you would pretty quickly notice that file deletion does not work
> > :) Just I could not find which function does it.
> 
> ubifs_evict_inode() -> ubifs_jnl_delete_inode() ->
> ubifs_tnc_remove_ino()
> 
> Basically, deletion in UBIFS is about writing a so-called "deletion
> inode" to the journal and then removing all the data nodes of the
> truncated inode from the TNC (in-memory cache of the FS index, which is
> just a huge B-tree, like in reiser4 which inspired me long time ago, and
> like in btrfs).
> 
> The second part of the overall deletion job will be when we commit - the
> updated version of the FS index will be written to the flash media.
  Oh, I see. This is what I was missing. And I presume you always make sure
to have enough space for new FS index so it cannot deadlock when trying to
push out dirty pages.

> If we get a power cut before the commit, the journal reply will see the
> deletion inode and will clean-up the index. The deletion inode is never
> erased before the commit.
> 
> Basically, this design is dictated by the fact that we do not have a
> cheap way of doing in-place updates.
> 
> This is a short version of the story. Here are some docs as well:
> http://www.linux-mtd.infradead.org/doc/ubifs.html#L_documentation
  I see. Thanks.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
  2012-03-12 14:02                                 ` Jan Kara
@ 2012-03-12 14:21                                   ` Artem Bityutskiy
  -1 siblings, 0 replies; 116+ messages in thread
From: Artem Bityutskiy @ 2012-03-12 14:21 UTC (permalink / raw)
  To: Jan Kara
  Cc: Fengguang Wu, Andrew Morton, Greg Thelen, Ying Han, hannes,
	KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML, Adrian Hunter

On Mon, 2012-03-12 at 15:02 +0100, Jan Kara wrote:
> > The second part of the overall deletion job will be when we commit - the
> > updated version of the FS index will be written to the flash media.
>   Oh, I see. This is what I was missing. And I presume you always make sure
> to have enough space for new FS index so it cannot deadlock when trying to
> push out dirty pages.

Yes, this is one of the hardest part and this is what the budgeting
subsystem does. Every VFS call (even unlink()) first invokes something
like 'ubifs_budget_space()' with arguments describing the space needs,
and the budgeting subsystem will account for the space, including the
possibility of the index growth. And the budgeting subsystem actually
forces write-back when it sees that there is not enough free space for
the operation. Because all the calculations are pessimistic, write-back
helps: the data nodes are compressed, and so on. The budgeting subsystem
may also force commit, which will clarify many unclarities and make the
calculations more precise. If nothing helps - ENOSPC is reported. For
deletions we also have a bit of reserve space to prevent -ENOSPC when
you actually want to delete a file on full file-system.

But the shorted answer: yes, we reserve 2 times the current index size
of the space for the index growths.

Long time ago I tried to describe this and the surrounding issues here:
http://www.linux-mtd.infradead.org/doc/ubifs.html#L_spaceacc

-- 
Best Regards,
Artem Bityutskiy


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [PATCH 5/9] writeback: introduce the pageout work
@ 2012-03-12 14:21                                   ` Artem Bityutskiy
  0 siblings, 0 replies; 116+ messages in thread
From: Artem Bityutskiy @ 2012-03-12 14:21 UTC (permalink / raw)
  To: Jan Kara
  Cc: Fengguang Wu, Andrew Morton, Greg Thelen, Ying Han, hannes,
	KAMEZAWA Hiroyuki, Rik van Riel, Mel Gorman, Minchan Kim,
	Linux Memory Management List, LKML, Adrian Hunter

On Mon, 2012-03-12 at 15:02 +0100, Jan Kara wrote:
> > The second part of the overall deletion job will be when we commit - the
> > updated version of the FS index will be written to the flash media.
>   Oh, I see. This is what I was missing. And I presume you always make sure
> to have enough space for new FS index so it cannot deadlock when trying to
> push out dirty pages.

Yes, this is one of the hardest part and this is what the budgeting
subsystem does. Every VFS call (even unlink()) first invokes something
like 'ubifs_budget_space()' with arguments describing the space needs,
and the budgeting subsystem will account for the space, including the
possibility of the index growth. And the budgeting subsystem actually
forces write-back when it sees that there is not enough free space for
the operation. Because all the calculations are pessimistic, write-back
helps: the data nodes are compressed, and so on. The budgeting subsystem
may also force commit, which will clarify many unclarities and make the
calculations more precise. If nothing helps - ENOSPC is reported. For
deletions we also have a bit of reserve space to prevent -ENOSPC when
you actually want to delete a file on full file-system.

But the shorted answer: yes, we reserve 2 times the current index size
of the space for the index growths.

Long time ago I tried to describe this and the surrounding issues here:
http://www.linux-mtd.infradead.org/doc/ubifs.html#L_spaceacc

-- 
Best Regards,
Artem Bityutskiy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 116+ messages in thread

end of thread, other threads:[~2012-03-12 14:19 UTC | newest]

Thread overview: 116+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-02-28 14:00 [PATCH 0/9] [RFC] pageout work and dirty reclaim throttling Fengguang Wu
2012-02-28 14:00 ` Fengguang Wu
2012-02-28 14:00 ` [PATCH 1/9] memcg: add page_cgroup flags for dirty page tracking Fengguang Wu
2012-02-28 14:00   ` Fengguang Wu
2012-02-29  0:50   ` KAMEZAWA Hiroyuki
2012-02-29  0:50     ` KAMEZAWA Hiroyuki
2012-03-04  1:29     ` Fengguang Wu
2012-03-04  1:29       ` Fengguang Wu
2012-02-28 14:00 ` [PATCH 2/9] memcg: add dirty page accounting infrastructure Fengguang Wu
2012-02-28 14:00   ` Fengguang Wu
2012-02-28 22:37   ` Andrew Morton
2012-02-28 22:37     ` Andrew Morton
2012-02-29  0:27     ` Fengguang Wu
2012-02-29  0:27       ` Fengguang Wu
2012-02-28 14:00 ` [PATCH 3/9] memcg: add kernel calls for memcg dirty page stats Fengguang Wu
2012-02-28 14:00   ` Fengguang Wu
2012-02-29  1:10   ` KAMEZAWA Hiroyuki
2012-02-29  1:10     ` KAMEZAWA Hiroyuki
2012-02-28 14:00 ` [PATCH 4/9] memcg: dirty page accounting support routines Fengguang Wu
2012-02-28 14:00   ` Fengguang Wu
2012-02-28 15:15   ` Fengguang Wu
2012-02-28 15:15     ` Fengguang Wu
2012-02-28 22:45   ` Andrew Morton
2012-02-28 22:45     ` Andrew Morton
2012-02-29  1:15     ` KAMEZAWA Hiroyuki
2012-02-29  1:15       ` KAMEZAWA Hiroyuki
2012-02-28 14:00 ` [PATCH 5/9] writeback: introduce the pageout work Fengguang Wu
2012-02-28 14:00   ` Fengguang Wu
2012-02-29  0:04   ` Andrew Morton
2012-02-29  0:04     ` Andrew Morton
2012-02-29  2:31     ` Fengguang Wu
2012-02-29  2:31       ` Fengguang Wu
2012-02-29 13:28     ` Fengguang Wu
2012-02-29 13:28       ` Fengguang Wu
2012-03-01 11:04     ` Jan Kara
2012-03-01 11:04       ` Jan Kara
2012-03-01 11:41       ` Fengguang Wu
2012-03-01 11:41         ` Fengguang Wu
2012-03-01 16:50         ` Jan Kara
2012-03-01 16:50           ` Jan Kara
2012-03-01 19:46         ` Andrew Morton
2012-03-01 19:46           ` Andrew Morton
2012-03-03 13:25           ` Fengguang Wu
2012-03-03 13:25             ` Fengguang Wu
2012-03-07  0:37             ` Andrew Morton
2012-03-07  0:37               ` Andrew Morton
2012-03-07  5:40               ` Fengguang Wu
2012-03-07  5:40                 ` Fengguang Wu
2012-03-01 19:42       ` Andrew Morton
2012-03-01 19:42         ` Andrew Morton
2012-03-01 21:15         ` Jan Kara
2012-03-01 21:15           ` Jan Kara
2012-03-01 21:22           ` Andrew Morton
2012-03-01 21:22             ` Andrew Morton
2012-03-01 12:36     ` Fengguang Wu
2012-03-01 12:36       ` Fengguang Wu
2012-03-01 16:38       ` Jan Kara
2012-03-01 16:38         ` Jan Kara
2012-03-02  4:48         ` Fengguang Wu
2012-03-02  4:48           ` Fengguang Wu
2012-03-02  9:59           ` Jan Kara
2012-03-02  9:59             ` Jan Kara
2012-03-02 10:39             ` Fengguang Wu
2012-03-02 10:39               ` Fengguang Wu
2012-03-02 19:57               ` Andrew Morton
2012-03-02 19:57                 ` Andrew Morton
2012-03-03 13:55                 ` Fengguang Wu
2012-03-03 13:55                   ` Fengguang Wu
2012-03-03 14:27                   ` Fengguang Wu
2012-03-03 14:27                     ` Fengguang Wu
2012-03-04 11:13                     ` Fengguang Wu
2012-03-04 11:13                       ` Fengguang Wu
2012-03-07 15:48                   ` Artem Bityutskiy
2012-03-07 15:48                     ` Artem Bityutskiy
2012-03-09  7:31                     ` Fengguang Wu
2012-03-09  7:31                       ` Fengguang Wu
2012-03-09  9:51                       ` Jan Kara
2012-03-09  9:51                         ` Jan Kara
2012-03-09 10:24                         ` Artem Bityutskiy
2012-03-09 10:24                           ` Artem Bityutskiy
2012-03-09 16:10                         ` Artem Bityutskiy
2012-03-09 16:10                           ` Artem Bityutskiy
2012-03-09 21:11                           ` Jan Kara
2012-03-09 21:11                             ` Jan Kara
2012-03-12 12:36                             ` Artem Bityutskiy
2012-03-12 12:36                               ` Artem Bityutskiy
2012-03-12 14:02                               ` Jan Kara
2012-03-12 14:02                                 ` Jan Kara
2012-03-12 14:21                                 ` Artem Bityutskiy
2012-03-12 14:21                                   ` Artem Bityutskiy
2012-03-09 10:15                   ` Jan Kara
2012-03-09 10:15                     ` Jan Kara
2012-03-09 15:10                     ` Fengguang Wu
2012-03-09 15:10                       ` Fengguang Wu
2012-02-29 13:51   ` [PATCH v2 " Fengguang Wu
2012-02-29 13:51     ` Fengguang Wu
2012-03-01 13:35     ` Fengguang Wu
2012-03-01 13:35       ` Fengguang Wu
2012-03-02  6:22       ` [PATCH v3 " Fengguang Wu
2012-03-02  6:22         ` Fengguang Wu
2012-02-28 14:00 ` [PATCH 6/9] vmscan: dirty reclaim throttling Fengguang Wu
2012-02-28 14:00   ` Fengguang Wu
2012-02-28 14:00 ` [PATCH 7/9] mm: pass __GFP_WRITE to memcg charge and reclaim routines Fengguang Wu
2012-02-28 14:00   ` Fengguang Wu
2012-02-28 14:00 ` [PATCH 8/9] mm: dont set __GFP_WRITE on ramfs/sysfs writes Fengguang Wu
2012-02-28 14:00   ` Fengguang Wu
2012-03-01 10:13   ` Johannes Weiner
2012-03-01 10:13     ` Johannes Weiner
2012-03-01 10:30     ` Fengguang Wu
2012-03-01 10:30       ` Fengguang Wu
2012-02-28 14:00 ` [PATCH 9/9] mm: debug vmscan waits Fengguang Wu
2012-02-28 14:00   ` Fengguang Wu
2012-03-02  6:59   ` [RFC PATCH] mm: don't treat anonymous pages as dirtyable pages Fengguang Wu
2012-03-02  6:59     ` Fengguang Wu
2012-03-02  7:18     ` Fengguang Wu
2012-03-02  7:18       ` Fengguang Wu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.