All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] Per-bdi writeback flusher threads v19
@ 2009-09-08  9:23 Jens Axboe
  2009-09-08  9:23 ` [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export Jens Axboe
                   ` (7 more replies)
  0 siblings, 8 replies; 76+ messages in thread
From: Jens Axboe @ 2009-09-08  9:23 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack

Hi,

This is the 19th release of the writeback patchset. Changes since
v18 include:

- Switch unpin_sb_for_writeback() to using put_super() instead of
  __put_super_and_need_start(). This means making put_super()
  non-static, but we don't have to export it.
- Always check and clean old data.
- Don't copy *wbc twice in wb_do_writeback().
- Tweak 'when to break' logic in wb_writeback().
- Get rid of wb_start_writeback() and bdi_sched_work(), fold them into
  bdi_queue_work().

Thanks to Jan Kara and Christoph Hellwig for their suggestions and
review!

 b/block/blk-core.c                 |    1 
 b/drivers/block/aoe/aoeblk.c       |    1 
 b/drivers/char/mem.c               |    1 
 b/drivers/staging/pohmelfs/inode.c |    9 
 b/fs/btrfs/disk-io.c               |    1 
 b/fs/buffer.c                      |    2 
 b/fs/char_dev.c                    |    1 
 b/fs/configfs/inode.c              |    1 
 b/fs/fs-writeback.c                | 1058 +++++++++++++++++++++--------
 b/fs/fuse/inode.c                  |    1 
 b/fs/hugetlbfs/inode.c             |    1 
 b/fs/nfs/client.c                  |    1 
 b/fs/ocfs2/dlm/dlmfs.c             |    1 
 b/fs/ramfs/inode.c                 |    1 
 b/fs/super.c                       |    5 
 b/fs/sync.c                        |   20 
 b/fs/sysfs/inode.c                 |    1 
 b/fs/ubifs/budget.c                |   16 
 b/fs/ubifs/super.c                 |    9 
 b/include/linux/backing-dev.h      |   55 +
 b/include/linux/fs.h               |    9 
 b/include/linux/writeback.h        |   24 
 b/kernel/cgroup.c                  |    1 
 b/kernel/sysctl.c                  |    8 
 b/mm/Makefile                      |    2 
 b/mm/backing-dev.c                 |  380 ++++++++++
 b/mm/page-writeback.c              |  188 -----
 b/mm/swap_state.c                  |    1 
 b/mm/vmscan.c                      |    2 
 mm/pdflush.c                       |  269 -------
 30 files changed, 1292 insertions(+), 778 deletions(-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export
  2009-09-08  9:23 [PATCH 0/8] Per-bdi writeback flusher threads v19 Jens Axboe
@ 2009-09-08  9:23 ` Jens Axboe
  2009-09-08 10:27     ` Artem Bityutskiy
  2009-09-08  9:23 ` [PATCH 2/8] writeback: move dirty inodes from super_block to backing_dev_info Jens Axboe
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2009-09-08  9:23 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack
  Cc: Jens Axboe

This adds two new exported functions:

- writeback_inodes_sb(), which only attempts to writeback dirty inodes on
  this super_block, for WB_SYNC_NONE writeout.
- sync_inodes_sbt(), which writes out all dirty inodes on this super_block
  and also waits for the IO to complete.

Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 drivers/staging/pohmelfs/inode.c |    9 +----
 fs/fs-writeback.c                |   70 ++++++++++++++++++++++---------------
 fs/sync.c                        |   18 +++++----
 fs/ubifs/budget.c                |   16 +-------
 fs/ubifs/super.c                 |    8 +----
 include/linux/fs.h               |    2 -
 include/linux/writeback.h        |    3 +-
 7 files changed, 58 insertions(+), 68 deletions(-)

diff --git a/drivers/staging/pohmelfs/inode.c b/drivers/staging/pohmelfs/inode.c
index 7b60579..e63c9be 100644
--- a/drivers/staging/pohmelfs/inode.c
+++ b/drivers/staging/pohmelfs/inode.c
@@ -1950,14 +1950,7 @@ static int pohmelfs_get_sb(struct file_system_type *fs_type,
  */
 static void pohmelfs_kill_super(struct super_block *sb)
 {
-	struct writeback_control wbc = {
-		.sync_mode	= WB_SYNC_ALL,
-		.range_start	= 0,
-		.range_end	= LLONG_MAX,
-		.nr_to_write	= LONG_MAX,
-	};
-	generic_sync_sb_inodes(sb, &wbc);
-
+	sync_inodes_sb(sb);
 	kill_anon_super(sb);
 }
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index c54226b..271e5f4 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -458,8 +458,8 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
  * on the writer throttling path, and we get decent balancing between many
  * throttled threads: we don't want them all piling up on inode_sync_wait.
  */
-void generic_sync_sb_inodes(struct super_block *sb,
-				struct writeback_control *wbc)
+static void generic_sync_sb_inodes(struct super_block *sb,
+				   struct writeback_control *wbc)
 {
 	const unsigned long start = jiffies;	/* livelock avoidance */
 	int sync = wbc->sync_mode == WB_SYNC_ALL;
@@ -593,13 +593,6 @@ void generic_sync_sb_inodes(struct super_block *sb,
 
 	return;		/* Leave any unwritten inodes on s_io */
 }
-EXPORT_SYMBOL_GPL(generic_sync_sb_inodes);
-
-static void sync_sb_inodes(struct super_block *sb,
-				struct writeback_control *wbc)
-{
-	generic_sync_sb_inodes(sb, wbc);
-}
 
 /*
  * Start writeback of dirty pagecache data against all unlocked inodes.
@@ -640,7 +633,7 @@ restart:
 			 */
 			if (down_read_trylock(&sb->s_umount)) {
 				if (sb->s_root)
-					sync_sb_inodes(sb, wbc);
+					generic_sync_sb_inodes(sb, wbc);
 				up_read(&sb->s_umount);
 			}
 			spin_lock(&sb_lock);
@@ -653,35 +646,56 @@ restart:
 	spin_unlock(&sb_lock);
 }
 
-/*
- * writeback and wait upon the filesystem's dirty inodes.  The caller will
- * do this in two passes - one to write, and one to wait.
- *
- * A finite limit is set on the number of pages which will be written.
- * To prevent infinite livelock of sys_sync().
+/**
+ * writeback_inodes_sb	-	writeback dirty inodes from given super_block
+ * @sb: the superblock
  *
- * We add in the number of potentially dirty inodes, because each inode write
- * can dirty pagecache in the underlying blockdev.
+ * Start writeback on some inodes on this super_block. No guarantees are made
+ * on how many (if any) will be written, and this function does not wait
+ * for IO completion of submitted IO. The number of pages submitted is
+ * returned.
  */
-void sync_inodes_sb(struct super_block *sb, int wait)
+long writeback_inodes_sb(struct super_block *sb)
 {
 	struct writeback_control wbc = {
-		.sync_mode	= wait ? WB_SYNC_ALL : WB_SYNC_NONE,
+		.sync_mode	= WB_SYNC_NONE,
 		.range_start	= 0,
 		.range_end	= LLONG_MAX,
 	};
+	unsigned long nr_dirty = global_page_state(NR_FILE_DIRTY);
+	unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
+	long nr_to_write;
 
-	if (!wait) {
-		unsigned long nr_dirty = global_page_state(NR_FILE_DIRTY);
-		unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
-
-		wbc.nr_to_write = nr_dirty + nr_unstable +
+	nr_to_write = nr_dirty + nr_unstable +
 			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
-	} else
-		wbc.nr_to_write = LONG_MAX; /* doesn't actually matter */
 
-	sync_sb_inodes(sb, &wbc);
+	wbc.nr_to_write = nr_to_write;
+	generic_sync_sb_inodes(sb, &wbc);
+	return nr_to_write - wbc.nr_to_write;
+}
+EXPORT_SYMBOL(writeback_inodes_sb);
+
+/**
+ * sync_inodes_sb	-	sync sb inode pages
+ * @sb: the superblock
+ *
+ * This function writes and waits on any dirty inode belonging to this
+ * super_block. The number of pages synced is returned.
+ */
+long sync_inodes_sb(struct super_block *sb)
+{
+	struct writeback_control wbc = {
+		.sync_mode	= WB_SYNC_ALL,
+		.range_start	= 0,
+		.range_end	= LLONG_MAX,
+	};
+	long nr_to_write = LONG_MAX; /* doesn't actually matter */
+
+	wbc.nr_to_write = nr_to_write;
+	generic_sync_sb_inodes(sb, &wbc);
+	return nr_to_write - wbc.nr_to_write;
 }
+EXPORT_SYMBOL(sync_inodes_sb);
 
 /**
  * write_inode_now	-	write an inode to disk
diff --git a/fs/sync.c b/fs/sync.c
index 3422ba6..66f2104 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -19,20 +19,22 @@
 			SYNC_FILE_RANGE_WAIT_AFTER)
 
 /*
- * Do the filesystem syncing work. For simple filesystems sync_inodes_sb(sb, 0)
- * just dirties buffers with inodes so we have to submit IO for these buffers
- * via __sync_blockdev(). This also speeds up the wait == 1 case since in that
- * case write_inode() functions do sync_dirty_buffer() and thus effectively
- * write one block at a time.
+ * Do the filesystem syncing work. For simple filesystems
+ * writeback_inodes_sb(sb) just dirties buffers with inodes so we have to
+ * submit IO for these buffers via __sync_blockdev(). This also speeds up the
+ * wait == 1 case since in that case write_inode() functions do
+ * sync_dirty_buffer() and thus effectively write one block at a time.
  */
 static int __sync_filesystem(struct super_block *sb, int wait)
 {
 	/* Avoid doing twice syncing and cache pruning for quota sync */
-	if (!wait)
+	if (!wait) {
 		writeout_quota_sb(sb, -1);
-	else
+		writeback_inodes_sb(sb);
+	} else {
 		sync_quota_sb(sb, -1);
-	sync_inodes_sb(sb, wait);
+		sync_inodes_sb(sb);
+	}
 	if (sb->s_op->sync_fs)
 		sb->s_op->sync_fs(sb, wait);
 	return __sync_blockdev(sb->s_bdev, wait);
diff --git a/fs/ubifs/budget.c b/fs/ubifs/budget.c
index eaf6d89..1c8991b 100644
--- a/fs/ubifs/budget.c
+++ b/fs/ubifs/budget.c
@@ -65,26 +65,14 @@
 static int shrink_liability(struct ubifs_info *c, int nr_to_write)
 {
 	int nr_written;
-	struct writeback_control wbc = {
-		.sync_mode   = WB_SYNC_NONE,
-		.range_end   = LLONG_MAX,
-		.nr_to_write = nr_to_write,
-	};
-
-	generic_sync_sb_inodes(c->vfs_sb, &wbc);
-	nr_written = nr_to_write - wbc.nr_to_write;
 
+	nr_written = writeback_inodes_sb(c->vfs_sb);
 	if (!nr_written) {
 		/*
 		 * Re-try again but wait on pages/inodes which are being
 		 * written-back concurrently (e.g., by pdflush).
 		 */
-		memset(&wbc, 0, sizeof(struct writeback_control));
-		wbc.sync_mode   = WB_SYNC_ALL;
-		wbc.range_end   = LLONG_MAX;
-		wbc.nr_to_write = nr_to_write;
-		generic_sync_sb_inodes(c->vfs_sb, &wbc);
-		nr_written = nr_to_write - wbc.nr_to_write;
+		nr_written = sync_inodes_sb(c->vfs_sb);
 	}
 
 	dbg_budg("%d pages were written back", nr_written);
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index 26d2e0d..8d6050a 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -438,12 +438,6 @@ static int ubifs_sync_fs(struct super_block *sb, int wait)
 {
 	int i, err;
 	struct ubifs_info *c = sb->s_fs_info;
-	struct writeback_control wbc = {
-		.sync_mode   = WB_SYNC_ALL,
-		.range_start = 0,
-		.range_end   = LLONG_MAX,
-		.nr_to_write = LONG_MAX,
-	};
 
 	/*
 	 * Zero @wait is just an advisory thing to help the file system shove
@@ -462,7 +456,7 @@ static int ubifs_sync_fs(struct super_block *sb, int wait)
 	 * the user be able to get more accurate results of 'statfs()' after
 	 * they synchronize the file system.
 	 */
-	generic_sync_sb_inodes(sb, &wbc);
+	sync_inodes_sb(sb);
 
 	/*
 	 * Synchronize write buffers, because 'ubifs_run_commit()' does not
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 73e9b64..07b0f66 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2070,8 +2070,6 @@ static inline void invalidate_remote_inode(struct inode *inode)
 extern int invalidate_inode_pages2(struct address_space *mapping);
 extern int invalidate_inode_pages2_range(struct address_space *mapping,
 					 pgoff_t start, pgoff_t end);
-extern void generic_sync_sb_inodes(struct super_block *sb,
-				struct writeback_control *wbc);
 extern int write_inode_now(struct inode *, int);
 extern int filemap_fdatawrite(struct address_space *);
 extern int filemap_flush(struct address_space *);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 3224820..0703929 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -78,7 +78,8 @@ struct writeback_control {
  */	
 void writeback_inodes(struct writeback_control *wbc);
 int inode_wait(void *);
-void sync_inodes_sb(struct super_block *, int wait);
+long writeback_inodes_sb(struct super_block *);
+long sync_inodes_sb(struct super_block *);
 
 /* writeback.h requires fs.h; it, too, is not included from here. */
 static inline void wait_on_inode(struct inode *inode)
-- 
1.6.4.1.207.g68ea


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 2/8] writeback: move dirty inodes from super_block to backing_dev_info
  2009-09-08  9:23 [PATCH 0/8] Per-bdi writeback flusher threads v19 Jens Axboe
  2009-09-08  9:23 ` [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export Jens Axboe
@ 2009-09-08  9:23 ` Jens Axboe
  2009-09-08  9:23 ` [PATCH 3/8] writeback: switch to per-bdi threads for flushing data Jens Axboe
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2009-09-08  9:23 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack
  Cc: Jens Axboe

This is a first step at introducing per-bdi flusher threads. We should
have no change in behaviour, although sb_has_dirty_inodes() is now
ridiculously expensive, as there's no easy way to answer that question.
Not a huge problem, since it'll be deleted in subsequent patches.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/fs-writeback.c           |  197 ++++++++++++++++++++++++++++---------------
 fs/super.c                  |    3 -
 include/linux/backing-dev.h |    9 ++
 include/linux/fs.h          |    5 +-
 mm/backing-dev.c            |   24 +++++
 mm/page-writeback.c         |   11 +--
 6 files changed, 165 insertions(+), 84 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 271e5f4..45ad4bb 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -25,6 +25,7 @@
 #include <linux/buffer_head.h>
 #include "internal.h"
 
+#define inode_to_bdi(inode)	((inode)->i_mapping->backing_dev_info)
 
 /**
  * writeback_acquire - attempt to get exclusive writeback access to a device
@@ -165,12 +166,13 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 			goto out;
 
 		/*
-		 * If the inode was already on s_dirty/s_io/s_more_io, don't
-		 * reposition it (that would break s_dirty time-ordering).
+		 * If the inode was already on b_dirty/b_io/b_more_io, don't
+		 * reposition it (that would break b_dirty time-ordering).
 		 */
 		if (!was_dirty) {
 			inode->dirtied_when = jiffies;
-			list_move(&inode->i_list, &sb->s_dirty);
+			list_move(&inode->i_list,
+					&inode_to_bdi(inode)->b_dirty);
 		}
 	}
 out:
@@ -191,31 +193,30 @@ static int write_inode(struct inode *inode, int sync)
  * furthest end of its superblock's dirty-inode list.
  *
  * Before stamping the inode's ->dirtied_when, we check to see whether it is
- * already the most-recently-dirtied inode on the s_dirty list.  If that is
+ * already the most-recently-dirtied inode on the b_dirty list.  If that is
  * the case then the inode must have been redirtied while it was being written
  * out and we don't reset its dirtied_when.
  */
 static void redirty_tail(struct inode *inode)
 {
-	struct super_block *sb = inode->i_sb;
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
 
-	if (!list_empty(&sb->s_dirty)) {
-		struct inode *tail_inode;
+	if (!list_empty(&bdi->b_dirty)) {
+		struct inode *tail;
 
-		tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
-		if (time_before(inode->dirtied_when,
-				tail_inode->dirtied_when))
+		tail = list_entry(bdi->b_dirty.next, struct inode, i_list);
+		if (time_before(inode->dirtied_when, tail->dirtied_when))
 			inode->dirtied_when = jiffies;
 	}
-	list_move(&inode->i_list, &sb->s_dirty);
+	list_move(&inode->i_list, &bdi->b_dirty);
 }
 
 /*
- * requeue inode for re-scanning after sb->s_io list is exhausted.
+ * requeue inode for re-scanning after bdi->b_io list is exhausted.
  */
 static void requeue_io(struct inode *inode)
 {
-	list_move(&inode->i_list, &inode->i_sb->s_more_io);
+	list_move(&inode->i_list, &inode_to_bdi(inode)->b_more_io);
 }
 
 static void inode_sync_complete(struct inode *inode)
@@ -262,18 +263,50 @@ static void move_expired_inodes(struct list_head *delaying_queue,
 /*
  * Queue all expired dirty inodes for io, eldest first.
  */
-static void queue_io(struct super_block *sb,
-				unsigned long *older_than_this)
+static void queue_io(struct backing_dev_info *bdi,
+		     unsigned long *older_than_this)
+{
+	list_splice_init(&bdi->b_more_io, bdi->b_io.prev);
+	move_expired_inodes(&bdi->b_dirty, &bdi->b_io, older_than_this);
+}
+
+static int sb_on_inode_list(struct super_block *sb, struct list_head *list)
 {
-	list_splice_init(&sb->s_more_io, sb->s_io.prev);
-	move_expired_inodes(&sb->s_dirty, &sb->s_io, older_than_this);
+	struct inode *inode;
+	int ret = 0;
+
+	spin_lock(&inode_lock);
+	list_for_each_entry(inode, list, i_list) {
+		if (inode->i_sb == sb) {
+			ret = 1;
+			break;
+		}
+	}
+	spin_unlock(&inode_lock);
+	return ret;
 }
 
 int sb_has_dirty_inodes(struct super_block *sb)
 {
-	return !list_empty(&sb->s_dirty) ||
-	       !list_empty(&sb->s_io) ||
-	       !list_empty(&sb->s_more_io);
+	struct backing_dev_info *bdi;
+	int ret = 0;
+
+	/*
+	 * This is REALLY expensive right now, but it'll go away
+	 * when the bdi writeback is introduced
+	 */
+	mutex_lock(&bdi_lock);
+	list_for_each_entry(bdi, &bdi_list, bdi_list) {
+		if (sb_on_inode_list(sb, &bdi->b_dirty) ||
+		    sb_on_inode_list(sb, &bdi->b_io) ||
+		    sb_on_inode_list(sb, &bdi->b_more_io)) {
+			ret = 1;
+			break;
+		}
+	}
+	mutex_unlock(&bdi_lock);
+
+	return ret;
 }
 EXPORT_SYMBOL(sb_has_dirty_inodes);
 
@@ -322,11 +355,11 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	if (inode->i_state & I_SYNC) {
 		/*
 		 * If this inode is locked for writeback and we are not doing
-		 * writeback-for-data-integrity, move it to s_more_io so that
+		 * writeback-for-data-integrity, move it to b_more_io so that
 		 * writeback can proceed with the other inodes on s_io.
 		 *
 		 * We'll have another go at writing back this inode when we
-		 * completed a full scan of s_io.
+		 * completed a full scan of b_io.
 		 */
 		if (!wait) {
 			requeue_io(inode);
@@ -371,11 +404,11 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			/*
 			 * We didn't write back all the pages.  nfs_writepages()
 			 * sometimes bales out without doing anything. Redirty
-			 * the inode; Move it from s_io onto s_more_io/s_dirty.
+			 * the inode; Move it from b_io onto b_more_io/b_dirty.
 			 */
 			/*
 			 * akpm: if the caller was the kupdate function we put
-			 * this inode at the head of s_dirty so it gets first
+			 * this inode at the head of b_dirty so it gets first
 			 * consideration.  Otherwise, move it to the tail, for
 			 * the reasons described there.  I'm not really sure
 			 * how much sense this makes.  Presumably I had a good
@@ -385,7 +418,7 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 			if (wbc->for_kupdate) {
 				/*
 				 * For the kupdate function we move the inode
-				 * to s_more_io so it will get more writeout as
+				 * to b_more_io so it will get more writeout as
 				 * soon as the queue becomes uncongested.
 				 */
 				inode->i_state |= I_DIRTY_PAGES;
@@ -433,51 +466,34 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	return ret;
 }
 
-/*
- * Write out a superblock's list of dirty inodes.  A wait will be performed
- * upon no inodes, all inodes or the final one, depending upon sync_mode.
- *
- * If older_than_this is non-NULL, then only write out inodes which
- * had their first dirtying at a time earlier than *older_than_this.
- *
- * If we're a pdflush thread, then implement pdflush collision avoidance
- * against the entire list.
- *
- * If `bdi' is non-zero then we're being asked to writeback a specific queue.
- * This function assumes that the blockdev superblock's inodes are backed by
- * a variety of queues, so all inodes are searched.  For other superblocks,
- * assume that all inodes are backed by the same queue.
- *
- * FIXME: this linear search could get expensive with many fileystems.  But
- * how to fix?  We need to go from an address_space to all inodes which share
- * a queue with that address_space.  (Easy: have a global "dirty superblocks"
- * list).
- *
- * The inodes to be written are parked on sb->s_io.  They are moved back onto
- * sb->s_dirty as they are selected for writing.  This way, none can be missed
- * on the writer throttling path, and we get decent balancing between many
- * throttled threads: we don't want them all piling up on inode_sync_wait.
- */
-static void generic_sync_sb_inodes(struct super_block *sb,
-				   struct writeback_control *wbc)
+static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
+				    struct writeback_control *wbc,
+				    struct super_block *sb)
 {
+	const int is_blkdev_sb = sb_is_blkdev_sb(sb);
 	const unsigned long start = jiffies;	/* livelock avoidance */
-	int sync = wbc->sync_mode == WB_SYNC_ALL;
 
 	spin_lock(&inode_lock);
-	if (!wbc->for_kupdate || list_empty(&sb->s_io))
-		queue_io(sb, wbc->older_than_this);
 
-	while (!list_empty(&sb->s_io)) {
-		struct inode *inode = list_entry(sb->s_io.prev,
+	if (!wbc->for_kupdate || list_empty(&bdi->b_io))
+		queue_io(bdi, wbc->older_than_this);
+
+	while (!list_empty(&bdi->b_io)) {
+		struct inode *inode = list_entry(bdi->b_io.prev,
 						struct inode, i_list);
-		struct address_space *mapping = inode->i_mapping;
-		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		long pages_skipped;
 
+		/*
+		 * super block given and doesn't match, skip this inode
+		 */
+		if (sb && sb != inode->i_sb) {
+			redirty_tail(inode);
+			continue;
+		}
+
 		if (!bdi_cap_writeback_dirty(bdi)) {
 			redirty_tail(inode);
-			if (sb_is_blkdev_sb(sb)) {
+			if (is_blkdev_sb) {
 				/*
 				 * Dirty memory-backed blockdev: the ramdisk
 				 * driver does this.  Skip just this inode
@@ -499,14 +515,14 @@ static void generic_sync_sb_inodes(struct super_block *sb,
 
 		if (wbc->nonblocking && bdi_write_congested(bdi)) {
 			wbc->encountered_congestion = 1;
-			if (!sb_is_blkdev_sb(sb))
+			if (!is_blkdev_sb)
 				break;		/* Skip a congested fs */
 			requeue_io(inode);
 			continue;		/* Skip a congested blockdev */
 		}
 
 		if (wbc->bdi && bdi != wbc->bdi) {
-			if (!sb_is_blkdev_sb(sb))
+			if (!is_blkdev_sb)
 				break;		/* fs has the wrong queue */
 			requeue_io(inode);
 			continue;		/* blockdev has wrong queue */
@@ -544,13 +560,57 @@ static void generic_sync_sb_inodes(struct super_block *sb,
 			wbc->more_io = 1;
 			break;
 		}
-		if (!list_empty(&sb->s_more_io))
+		if (!list_empty(&bdi->b_more_io))
 			wbc->more_io = 1;
 	}
 
-	if (sync) {
+	spin_unlock(&inode_lock);
+	/* Leave any unwritten inodes on b_io */
+}
+
+/*
+ * Write out a superblock's list of dirty inodes.  A wait will be performed
+ * upon no inodes, all inodes or the final one, depending upon sync_mode.
+ *
+ * If older_than_this is non-NULL, then only write out inodes which
+ * had their first dirtying at a time earlier than *older_than_this.
+ *
+ * If we're a pdlfush thread, then implement pdflush collision avoidance
+ * against the entire list.
+ *
+ * If `bdi' is non-zero then we're being asked to writeback a specific queue.
+ * This function assumes that the blockdev superblock's inodes are backed by
+ * a variety of queues, so all inodes are searched.  For other superblocks,
+ * assume that all inodes are backed by the same queue.
+ *
+ * FIXME: this linear search could get expensive with many fileystems.  But
+ * how to fix?  We need to go from an address_space to all inodes which share
+ * a queue with that address_space.  (Easy: have a global "dirty superblocks"
+ * list).
+ *
+ * The inodes to be written are parked on bdi->b_io.  They are moved back onto
+ * bdi->b_dirty as they are selected for writing.  This way, none can be missed
+ * on the writer throttling path, and we get decent balancing between many
+ * throttled threads: we don't want them all piling up on inode_sync_wait.
+ */
+static void generic_sync_sb_inodes(struct super_block *sb,
+				   struct writeback_control *wbc)
+{
+	struct backing_dev_info *bdi;
+
+	if (!wbc->bdi) {
+		mutex_lock(&bdi_lock);
+		list_for_each_entry(bdi, &bdi_list, bdi_list)
+			generic_sync_bdi_inodes(bdi, wbc, sb);
+		mutex_unlock(&bdi_lock);
+	} else
+		generic_sync_bdi_inodes(wbc->bdi, wbc, sb);
+
+	if (wbc->sync_mode == WB_SYNC_ALL) {
 		struct inode *inode, *old_inode = NULL;
 
+		spin_lock(&inode_lock);
+
 		/*
 		 * Data integrity sync. Must wait for all pages under writeback,
 		 * because there may have been pages dirtied before our sync
@@ -588,10 +648,7 @@ static void generic_sync_sb_inodes(struct super_block *sb,
 		}
 		spin_unlock(&inode_lock);
 		iput(old_inode);
-	} else
-		spin_unlock(&inode_lock);
-
-	return;		/* Leave any unwritten inodes on s_io */
+	}
 }
 
 /*
@@ -599,8 +656,8 @@ static void generic_sync_sb_inodes(struct super_block *sb,
  *
  * Note:
  * We don't need to grab a reference to superblock here. If it has non-empty
- * ->s_dirty it's hadn't been killed yet and kill_super() won't proceed
- * past sync_inodes_sb() until the ->s_dirty/s_io/s_more_io lists are all
+ * ->b_dirty it's hadn't been killed yet and kill_super() won't proceed
+ * past sync_inodes_sb() until the ->b_dirty/b_io/b_more_io lists are all
  * empty. Since __sync_single_inode() regains inode_lock before it finally moves
  * inode from superblock lists we are OK.
  *
diff --git a/fs/super.c b/fs/super.c
index 2761d3e..0d22ce3 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -62,9 +62,6 @@ static struct super_block *alloc_super(struct file_system_type *type)
 			s = NULL;
 			goto out;
 		}
-		INIT_LIST_HEAD(&s->s_dirty);
-		INIT_LIST_HEAD(&s->s_io);
-		INIT_LIST_HEAD(&s->s_more_io);
 		INIT_LIST_HEAD(&s->s_files);
 		INIT_LIST_HEAD(&s->s_instances);
 		INIT_HLIST_HEAD(&s->s_anon);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 1d52425..928cd54 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -40,6 +40,8 @@ enum bdi_stat_item {
 #define BDI_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
 
 struct backing_dev_info {
+	struct list_head bdi_list;
+
 	unsigned long ra_pages;	/* max readahead in PAGE_CACHE_SIZE units */
 	unsigned long state;	/* Always use atomic bitops on this */
 	unsigned int capabilities; /* Device capabilities */
@@ -58,6 +60,10 @@ struct backing_dev_info {
 
 	struct device *dev;
 
+	struct list_head	b_dirty;	/* dirty inodes */
+	struct list_head	b_io;		/* parked for writeback */
+	struct list_head	b_more_io;	/* parked for more writeback */
+
 #ifdef CONFIG_DEBUG_FS
 	struct dentry *debug_dir;
 	struct dentry *debug_stats;
@@ -72,6 +78,9 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev);
 void bdi_unregister(struct backing_dev_info *bdi);
 
+extern struct mutex bdi_lock;
+extern struct list_head bdi_list;
+
 static inline void __add_bdi_stat(struct backing_dev_info *bdi,
 		enum bdi_stat_item item, s64 amount)
 {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 07b0f66..97949b7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -715,7 +715,7 @@ struct posix_acl;
 
 struct inode {
 	struct hlist_node	i_hash;
-	struct list_head	i_list;
+	struct list_head	i_list;		/* backing dev IO list */
 	struct list_head	i_sb_list;
 	struct list_head	i_dentry;
 	unsigned long		i_ino;
@@ -1336,9 +1336,6 @@ struct super_block {
 	struct xattr_handler	**s_xattr;
 
 	struct list_head	s_inodes;	/* all inodes */
-	struct list_head	s_dirty;	/* dirty inodes */
-	struct list_head	s_io;		/* parked for writeback */
-	struct list_head	s_more_io;	/* parked for more writeback */
 	struct hlist_head	s_anon;		/* anonymous dentries for (nfs) exporting */
 	struct list_head	s_files;
 	/* s_dentry_lru and s_nr_dentry_unused are protected by dcache_lock */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index c86edd2..6f163e0 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -22,6 +22,8 @@ struct backing_dev_info default_backing_dev_info = {
 EXPORT_SYMBOL_GPL(default_backing_dev_info);
 
 static struct class *bdi_class;
+DEFINE_MUTEX(bdi_lock);
+LIST_HEAD(bdi_list);
 
 #ifdef CONFIG_DEBUG_FS
 #include <linux/debugfs.h>
@@ -211,6 +213,10 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 		goto exit;
 	}
 
+	mutex_lock(&bdi_lock);
+	list_add_tail(&bdi->bdi_list, &bdi_list);
+	mutex_unlock(&bdi_lock);
+
 	bdi->dev = dev;
 	bdi_debug_register(bdi, dev_name(dev));
 
@@ -225,9 +231,17 @@ int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev)
 }
 EXPORT_SYMBOL(bdi_register_dev);
 
+static void bdi_remove_from_list(struct backing_dev_info *bdi)
+{
+	mutex_lock(&bdi_lock);
+	list_del(&bdi->bdi_list);
+	mutex_unlock(&bdi_lock);
+}
+
 void bdi_unregister(struct backing_dev_info *bdi)
 {
 	if (bdi->dev) {
+		bdi_remove_from_list(bdi);
 		bdi_debug_unregister(bdi);
 		device_unregister(bdi->dev);
 		bdi->dev = NULL;
@@ -245,6 +259,10 @@ int bdi_init(struct backing_dev_info *bdi)
 	bdi->min_ratio = 0;
 	bdi->max_ratio = 100;
 	bdi->max_prop_frac = PROP_FRAC_BASE;
+	INIT_LIST_HEAD(&bdi->bdi_list);
+	INIT_LIST_HEAD(&bdi->b_io);
+	INIT_LIST_HEAD(&bdi->b_dirty);
+	INIT_LIST_HEAD(&bdi->b_more_io);
 
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++) {
 		err = percpu_counter_init(&bdi->bdi_stat[i], 0);
@@ -259,6 +277,8 @@ int bdi_init(struct backing_dev_info *bdi)
 err:
 		while (i--)
 			percpu_counter_destroy(&bdi->bdi_stat[i]);
+
+		bdi_remove_from_list(bdi);
 	}
 
 	return err;
@@ -269,6 +289,10 @@ void bdi_destroy(struct backing_dev_info *bdi)
 {
 	int i;
 
+	WARN_ON(!list_empty(&bdi->b_dirty));
+	WARN_ON(!list_empty(&bdi->b_io));
+	WARN_ON(!list_empty(&bdi->b_more_io));
+
 	bdi_unregister(bdi);
 
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 81627eb..f8341b6 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -320,15 +320,13 @@ static void task_dirty_limit(struct task_struct *tsk, unsigned long *pdirty)
 /*
  *
  */
-static DEFINE_SPINLOCK(bdi_lock);
 static unsigned int bdi_min_ratio;
 
 int bdi_set_min_ratio(struct backing_dev_info *bdi, unsigned int min_ratio)
 {
 	int ret = 0;
-	unsigned long flags;
 
-	spin_lock_irqsave(&bdi_lock, flags);
+	mutex_lock(&bdi_lock);
 	if (min_ratio > bdi->max_ratio) {
 		ret = -EINVAL;
 	} else {
@@ -340,27 +338,26 @@ int bdi_set_min_ratio(struct backing_dev_info *bdi, unsigned int min_ratio)
 			ret = -EINVAL;
 		}
 	}
-	spin_unlock_irqrestore(&bdi_lock, flags);
+	mutex_unlock(&bdi_lock);
 
 	return ret;
 }
 
 int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
 {
-	unsigned long flags;
 	int ret = 0;
 
 	if (max_ratio > 100)
 		return -EINVAL;
 
-	spin_lock_irqsave(&bdi_lock, flags);
+	mutex_lock(&bdi_lock);
 	if (bdi->min_ratio > max_ratio) {
 		ret = -EINVAL;
 	} else {
 		bdi->max_ratio = max_ratio;
 		bdi->max_prop_frac = (PROP_FRAC_BASE * max_ratio) / 100;
 	}
-	spin_unlock_irqrestore(&bdi_lock, flags);
+	mutex_unlock(&bdi_lock);
 
 	return ret;
 }
-- 
1.6.4.1.207.g68ea


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 3/8] writeback: switch to per-bdi threads for flushing data
  2009-09-08  9:23 [PATCH 0/8] Per-bdi writeback flusher threads v19 Jens Axboe
  2009-09-08  9:23 ` [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export Jens Axboe
  2009-09-08  9:23 ` [PATCH 2/8] writeback: move dirty inodes from super_block to backing_dev_info Jens Axboe
@ 2009-09-08  9:23 ` Jens Axboe
  2009-09-08 13:46   ` Daniel Walker
  2009-09-08  9:23 ` [PATCH 4/8] writeback: get rid of pdflush completely Jens Axboe
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2009-09-08  9:23 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack
  Cc: Jens Axboe

This gets rid of pdflush for bdi writeout and kupdated style cleaning.
pdflush writeout suffers from lack of locality and also requires more
threads to handle the same workload, since it has to work in a
non-blocking fashion against each queue. This also introduces lumpy
behaviour and potential request starvation, since pdflush can be starved
for queue access if others are accessing it. A sample ffsb workload that
does random writes to files is about 8% faster here on a simple SATA drive
during the benchmark phase. File layout also seems a LOT more smooth in
vmstat:

 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  1      0 608848   2652 375372    0    0     0 71024  604    24  1 10 48 42
 0  1      0 549644   2712 433736    0    0     0 60692  505    27  1  8 48 44
 1  0      0 476928   2784 505192    0    0     4 29540  553    24  0  9 53 37
 0  1      0 457972   2808 524008    0    0     0 54876  331    16  0  4 38 58
 0  1      0 366128   2928 614284    0    0     4 92168  710    58  0 13 53 34
 0  1      0 295092   3000 684140    0    0     0 62924  572    23  0  9 53 37
 0  1      0 236592   3064 741704    0    0     4 58256  523    17  0  8 48 44
 0  1      0 165608   3132 811464    0    0     0 57460  560    21  0  8 54 38
 0  1      0 102952   3200 873164    0    0     4 74748  540    29  1 10 48 41
 0  1      0  48604   3252 926472    0    0     0 53248  469    29  0  7 47 45

where vanilla tends to fluctuate a lot in the creation phase:

 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  1      0 678716   5792 303380    0    0     0 74064  565    50  1 11 52 36
 1  0      0 662488   5864 319396    0    0     4   352  302   329  0  2 47 51
 0  1      0 599312   5924 381468    0    0     0 78164  516    55  0  9 51 40
 0  1      0 519952   6008 459516    0    0     4 78156  622    56  1 11 52 37
 1  1      0 436640   6092 541632    0    0     0 82244  622    54  0 11 48 41
 0  1      0 436640   6092 541660    0    0     0     8  152    39  0  0 51 49
 0  1      0 332224   6200 644252    0    0     4 102800  728    46  1 13 49 36
 1  0      0 274492   6260 701056    0    0     4 12328  459    49  0  7 50 43
 0  1      0 211220   6324 763356    0    0     0 106940  515    37  1 10 51 39
 1  0      0 160412   6376 813468    0    0     0  8224  415    43  0  6 49 45
 1  1      0  85980   6452 886556    0    0     4 113516  575    39  1 11 54 34
 0  2      0  85968   6452 886620    0    0     0  1640  158   211  0  0 46 54

A 10 disk test with btrfs performs 26% faster with per-bdi flushing. A
SSD based writeback test on XFS performs over 20% better as well, with
the throughput being very stable around 1GB/sec, where pdflush only
manages 750MB/sec and fluctuates wildly while doing so. Random buffered
writes to many files behave a lot better as well, as does random mmap'ed
writes.

A separate thread is added to sync the super blocks. In the long term,
adding sync_supers_bdi() functionality could get rid of this thread again.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/buffer.c                 |    2 +-
 fs/fs-writeback.c           | 1000 ++++++++++++++++++++++++++++++-------------
 fs/super.c                  |    2 +-
 fs/sync.c                   |    2 +-
 include/linux/backing-dev.h |   55 ++-
 include/linux/fs.h          |    2 +-
 include/linux/writeback.h   |    8 +-
 mm/backing-dev.c            |  341 ++++++++++++++-
 mm/page-writeback.c         |  179 ++-------
 mm/vmscan.c                 |    2 +-
 10 files changed, 1121 insertions(+), 472 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 28f320f..90a9886 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -281,7 +281,7 @@ static void free_more_memory(void)
 	struct zone *zone;
 	int nid;
 
-	wakeup_pdflush(1024);
+	wakeup_flusher_threads(1024);
 	yield();
 
 	for_each_online_node(nid) {
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 45ad4bb..0e3a14a 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -19,6 +19,8 @@
 #include <linux/sched.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
+#include <linux/kthread.h>
+#include <linux/freezer.h>
 #include <linux/writeback.h>
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
@@ -27,165 +29,208 @@
 
 #define inode_to_bdi(inode)	((inode)->i_mapping->backing_dev_info)
 
-/**
- * writeback_acquire - attempt to get exclusive writeback access to a device
- * @bdi: the device's backing_dev_info structure
- *
- * It is a waste of resources to have more than one pdflush thread blocked on
- * a single request queue.  Exclusion at the request_queue level is obtained
- * via a flag in the request_queue's backing_dev_info.state.
- *
- * Non-request_queue-backed address_spaces will share default_backing_dev_info,
- * unless they implement their own.  Which is somewhat inefficient, as this
- * may prevent concurrent writeback against multiple devices.
+/*
+ * Work items for the bdi_writeback threads
  */
-static int writeback_acquire(struct backing_dev_info *bdi)
+struct bdi_work {
+	struct list_head list;
+	struct list_head wait_list;
+	struct rcu_head rcu_head;
+
+	unsigned long seen;
+	atomic_t pending;
+
+	struct super_block *sb;
+	unsigned long nr_pages;
+	enum writeback_sync_modes sync_mode;
+
+	unsigned long state;
+};
+
+enum {
+	WS_USED_B = 0,
+	WS_ONSTACK_B,
+};
+
+#define WS_USED (1 << WS_USED_B)
+#define WS_ONSTACK (1 << WS_ONSTACK_B)
+
+static inline bool bdi_work_on_stack(struct bdi_work *work)
+{
+	return test_bit(WS_ONSTACK_B, &work->state);
+}
+
+static inline void bdi_work_init(struct bdi_work *work,
+				 struct writeback_control *wbc)
+{
+	INIT_RCU_HEAD(&work->rcu_head);
+	work->sb = wbc->sb;
+	work->nr_pages = wbc->nr_to_write;
+	work->sync_mode = wbc->sync_mode;
+	work->state = WS_USED;
+}
+
+static inline void bdi_work_init_on_stack(struct bdi_work *work,
+					  struct writeback_control *wbc)
 {
-	return !test_and_set_bit(BDI_pdflush, &bdi->state);
+	bdi_work_init(work, wbc);
+	work->state |= WS_ONSTACK;
 }
 
 /**
  * writeback_in_progress - determine whether there is writeback in progress
  * @bdi: the device's backing_dev_info structure.
  *
- * Determine whether there is writeback in progress against a backing device.
+ * Determine whether there is writeback waiting to be handled against a
+ * backing device.
  */
 int writeback_in_progress(struct backing_dev_info *bdi)
 {
-	return test_bit(BDI_pdflush, &bdi->state);
+	return !list_empty(&bdi->work_list);
 }
 
-/**
- * writeback_release - relinquish exclusive writeback access against a device.
- * @bdi: the device's backing_dev_info structure
- */
-static void writeback_release(struct backing_dev_info *bdi)
+static void bdi_work_clear(struct bdi_work *work)
 {
-	BUG_ON(!writeback_in_progress(bdi));
-	clear_bit(BDI_pdflush, &bdi->state);
+	clear_bit(WS_USED_B, &work->state);
+	smp_mb__after_clear_bit();
+	wake_up_bit(&work->state, WS_USED_B);
 }
 
-static noinline void block_dump___mark_inode_dirty(struct inode *inode)
+static void bdi_work_free(struct rcu_head *head)
 {
-	if (inode->i_ino || strcmp(inode->i_sb->s_id, "bdev")) {
-		struct dentry *dentry;
-		const char *name = "?";
+	struct bdi_work *work = container_of(head, struct bdi_work, rcu_head);
 
-		dentry = d_find_alias(inode);
-		if (dentry) {
-			spin_lock(&dentry->d_lock);
-			name = (const char *) dentry->d_name.name;
-		}
-		printk(KERN_DEBUG
-		       "%s(%d): dirtied inode %lu (%s) on %s\n",
-		       current->comm, task_pid_nr(current), inode->i_ino,
-		       name, inode->i_sb->s_id);
-		if (dentry) {
-			spin_unlock(&dentry->d_lock);
-			dput(dentry);
-		}
-	}
+	if (!bdi_work_on_stack(work))
+		kfree(work);
+	else
+		bdi_work_clear(work);
 }
 
-/**
- *	__mark_inode_dirty -	internal function
- *	@inode: inode to mark
- *	@flags: what kind of dirty (i.e. I_DIRTY_SYNC)
- *	Mark an inode as dirty. Callers should use mark_inode_dirty or
- *  	mark_inode_dirty_sync.
- *
- * Put the inode on the super block's dirty list.
- *
- * CAREFUL! We mark it dirty unconditionally, but move it onto the
- * dirty list only if it is hashed or if it refers to a blockdev.
- * If it was not hashed, it will never be added to the dirty list
- * even if it is later hashed, as it will have been marked dirty already.
- *
- * In short, make sure you hash any inodes _before_ you start marking
- * them dirty.
- *
- * This function *must* be atomic for the I_DIRTY_PAGES case -
- * set_page_dirty() is called under spinlock in several places.
- *
- * Note that for blockdevs, inode->dirtied_when represents the dirtying time of
- * the block-special inode (/dev/hda1) itself.  And the ->dirtied_when field of
- * the kernel-internal blockdev inode represents the dirtying time of the
- * blockdev's pages.  This is why for I_DIRTY_PAGES we always use
- * page->mapping->host, so the page-dirtying time is recorded in the internal
- * blockdev inode.
- */
-void __mark_inode_dirty(struct inode *inode, int flags)
+static void wb_work_complete(struct bdi_work *work)
 {
-	struct super_block *sb = inode->i_sb;
+	const enum writeback_sync_modes sync_mode = work->sync_mode;
 
 	/*
-	 * Don't do this for I_DIRTY_PAGES - that doesn't actually
-	 * dirty the inode itself
+	 * For allocated work, we can clear the done/seen bit right here.
+	 * For on-stack work, we need to postpone both the clear and free
+	 * to after the RCU grace period, since the stack could be invalidated
+	 * as soon as bdi_work_clear() has done the wakeup.
 	 */
-	if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
-		if (sb->s_op->dirty_inode)
-			sb->s_op->dirty_inode(inode);
-	}
+	if (!bdi_work_on_stack(work))
+		bdi_work_clear(work);
+	if (sync_mode == WB_SYNC_NONE || bdi_work_on_stack(work))
+		call_rcu(&work->rcu_head, bdi_work_free);
+}
 
+static void wb_clear_pending(struct bdi_writeback *wb, struct bdi_work *work)
+{
 	/*
-	 * make sure that changes are seen by all cpus before we test i_state
-	 * -- mikulas
+	 * The caller has retrieved the work arguments from this work,
+	 * drop our reference. If this is the last ref, delete and free it
 	 */
-	smp_mb();
+	if (atomic_dec_and_test(&work->pending)) {
+		struct backing_dev_info *bdi = wb->bdi;
 
-	/* avoid the locking if we can */
-	if ((inode->i_state & flags) == flags)
-		return;
-
-	if (unlikely(block_dump))
-		block_dump___mark_inode_dirty(inode);
+		spin_lock(&bdi->wb_lock);
+		list_del_rcu(&work->list);
+		spin_unlock(&bdi->wb_lock);
 
-	spin_lock(&inode_lock);
-	if ((inode->i_state & flags) != flags) {
-		const int was_dirty = inode->i_state & I_DIRTY;
+		wb_work_complete(work);
+	}
+}
 
-		inode->i_state |= flags;
+static void bdi_queue_work(struct backing_dev_info *bdi, struct bdi_work *work)
+{
+	if (work) {
+		work->seen = bdi->wb_mask;
+		BUG_ON(!work->seen);
+		atomic_set(&work->pending, bdi->wb_cnt);
+		BUG_ON(!bdi->wb_cnt);
 
 		/*
-		 * If the inode is being synced, just update its dirty state.
-		 * The unlocker will place the inode on the appropriate
-		 * superblock list, based upon its state.
+		 * Make sure stores are seen before it appears on the list
 		 */
-		if (inode->i_state & I_SYNC)
-			goto out;
+		smp_mb();
 
-		/*
-		 * Only add valid (hashed) inodes to the superblock's
-		 * dirty list.  Add blockdev inodes as well.
-		 */
-		if (!S_ISBLK(inode->i_mode)) {
-			if (hlist_unhashed(&inode->i_hash))
-				goto out;
-		}
-		if (inode->i_state & (I_FREEING|I_CLEAR))
-			goto out;
+		spin_lock(&bdi->wb_lock);
+		list_add_tail_rcu(&work->list, &bdi->work_list);
+		spin_unlock(&bdi->wb_lock);
+	}
+
+	/*
+	 * If the default thread isn't there, make sure we add it. When
+	 * it gets created and wakes up, we'll run this work.
+	 */
+	if (unlikely(list_empty_careful(&bdi->wb_list)))
+		wake_up_process(default_backing_dev_info.wb.task);
+	else {
+		struct bdi_writeback *wb = &bdi->wb;
 
 		/*
-		 * If the inode was already on b_dirty/b_io/b_more_io, don't
-		 * reposition it (that would break b_dirty time-ordering).
+		 * If we failed allocating the bdi work item, wake up the wb
+		 * thread always. As a safety precaution, it'll flush out
+		 * everything
 		 */
-		if (!was_dirty) {
-			inode->dirtied_when = jiffies;
-			list_move(&inode->i_list,
-					&inode_to_bdi(inode)->b_dirty);
-		}
+		if (!wb_has_dirty_io(wb)) {
+			if (work)
+				wb_clear_pending(wb, work);
+		} else if (wb->task);
+			wake_up_process(wb->task);
 	}
-out:
-	spin_unlock(&inode_lock);
 }
 
-EXPORT_SYMBOL(__mark_inode_dirty);
+/*
+ * Used for on-stack allocated work items. The caller needs to wait until
+ * the wb threads have acked the work before it's safe to continue.
+ */
+static void bdi_wait_on_work_clear(struct bdi_work *work)
+{
+	wait_on_bit(&work->state, WS_USED_B, bdi_sched_wait,
+		    TASK_UNINTERRUPTIBLE);
+}
 
-static int write_inode(struct inode *inode, int sync)
+static struct bdi_work *bdi_alloc_work(struct writeback_control *wbc)
 {
-	if (inode->i_sb->s_op->write_inode && !is_bad_inode(inode))
-		return inode->i_sb->s_op->write_inode(inode, sync);
-	return 0;
+	struct bdi_work *work;
+
+	work = kmalloc(sizeof(*work), GFP_ATOMIC);
+	if (work)
+		bdi_work_init(work, wbc);
+
+	return work;
+}
+
+void bdi_start_writeback(struct writeback_control *wbc)
+{
+	const bool must_wait = wbc->sync_mode == WB_SYNC_ALL;
+	struct bdi_work work_stack, *work = NULL;
+
+	if (!must_wait)
+		work = bdi_alloc_work(wbc);
+
+	if (!work) {
+		work = &work_stack;
+		bdi_work_init_on_stack(work, wbc);
+	}
+
+	bdi_queue_work(wbc->bdi, work);
+
+	/*
+	 * If the sync mode is WB_SYNC_ALL, block waiting for the work to
+	 * complete. If not, we only need to wait for the work to be started,
+	 * if we allocated it on-stack. We use the same mechanism, if the
+	 * wait bit is set in the bdi_work struct, then threads will not
+	 * clear pending until after they are done.
+	 *
+	 * Note that work == &work_stack if must_wait is true, so we don't
+	 * need to do call_rcu() here ever, since the completion path will
+	 * have done that for us.
+	 */
+	if (must_wait || work == &work_stack) {
+		bdi_wait_on_work_clear(work);
+		if (work != &work_stack)
+			call_rcu(&work->rcu_head, bdi_work_free);
+	}
 }
 
 /*
@@ -199,16 +244,16 @@ static int write_inode(struct inode *inode, int sync)
  */
 static void redirty_tail(struct inode *inode)
 {
-	struct backing_dev_info *bdi = inode_to_bdi(inode);
+	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 
-	if (!list_empty(&bdi->b_dirty)) {
+	if (!list_empty(&wb->b_dirty)) {
 		struct inode *tail;
 
-		tail = list_entry(bdi->b_dirty.next, struct inode, i_list);
+		tail = list_entry(wb->b_dirty.next, struct inode, i_list);
 		if (time_before(inode->dirtied_when, tail->dirtied_when))
 			inode->dirtied_when = jiffies;
 	}
-	list_move(&inode->i_list, &bdi->b_dirty);
+	list_move(&inode->i_list, &wb->b_dirty);
 }
 
 /*
@@ -216,7 +261,9 @@ static void redirty_tail(struct inode *inode)
  */
 static void requeue_io(struct inode *inode)
 {
-	list_move(&inode->i_list, &inode_to_bdi(inode)->b_more_io);
+	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
+
+	list_move(&inode->i_list, &wb->b_more_io);
 }
 
 static void inode_sync_complete(struct inode *inode)
@@ -263,52 +310,18 @@ static void move_expired_inodes(struct list_head *delaying_queue,
 /*
  * Queue all expired dirty inodes for io, eldest first.
  */
-static void queue_io(struct backing_dev_info *bdi,
-		     unsigned long *older_than_this)
+static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
 {
-	list_splice_init(&bdi->b_more_io, bdi->b_io.prev);
-	move_expired_inodes(&bdi->b_dirty, &bdi->b_io, older_than_this);
+	list_splice_init(&wb->b_more_io, wb->b_io.prev);
+	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
 }
 
-static int sb_on_inode_list(struct super_block *sb, struct list_head *list)
-{
-	struct inode *inode;
-	int ret = 0;
-
-	spin_lock(&inode_lock);
-	list_for_each_entry(inode, list, i_list) {
-		if (inode->i_sb == sb) {
-			ret = 1;
-			break;
-		}
-	}
-	spin_unlock(&inode_lock);
-	return ret;
-}
-
-int sb_has_dirty_inodes(struct super_block *sb)
+static int write_inode(struct inode *inode, int sync)
 {
-	struct backing_dev_info *bdi;
-	int ret = 0;
-
-	/*
-	 * This is REALLY expensive right now, but it'll go away
-	 * when the bdi writeback is introduced
-	 */
-	mutex_lock(&bdi_lock);
-	list_for_each_entry(bdi, &bdi_list, bdi_list) {
-		if (sb_on_inode_list(sb, &bdi->b_dirty) ||
-		    sb_on_inode_list(sb, &bdi->b_io) ||
-		    sb_on_inode_list(sb, &bdi->b_more_io)) {
-			ret = 1;
-			break;
-		}
-	}
-	mutex_unlock(&bdi_lock);
-
-	return ret;
+	if (inode->i_sb->s_op->write_inode && !is_bad_inode(inode))
+		return inode->i_sb->s_op->write_inode(inode, sync);
+	return 0;
 }
-EXPORT_SYMBOL(sb_has_dirty_inodes);
 
 /*
  * Wait for writeback on an inode to complete.
@@ -466,20 +479,71 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	return ret;
 }
 
-static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
-				    struct writeback_control *wbc,
-				    struct super_block *sb)
+/*
+ * For WB_SYNC_NONE writeback, the caller does not have the sb pinned
+ * before calling writeback. So make sure that we do pin it, so it doesn't
+ * go away while we are writing inodes from it.
+ *
+ * Returns 0 if the super was successfully pinned (or pinning wasn't needed),
+ * 1 if we failed.
+ */
+static int pin_sb_for_writeback(struct writeback_control *wbc,
+				   struct inode *inode)
+{
+	struct super_block *sb = inode->i_sb;
+
+	/*
+	 * Caller must already hold the ref for this
+	 */
+	if (wbc->sync_mode == WB_SYNC_ALL) {
+		WARN_ON(!rwsem_is_locked(&sb->s_umount));
+		return 0;
+	}
+
+	spin_lock(&sb_lock);
+	sb->s_count++;
+	if (down_read_trylock(&sb->s_umount)) {
+		if (sb->s_root) {
+			spin_unlock(&sb_lock);
+			return 0;
+		}
+		/*
+		 * umounted, drop rwsem again and fall through to failure
+		 */
+		up_read(&sb->s_umount);
+	}
+
+	sb->s_count--;
+	spin_unlock(&sb_lock);
+	return 1;
+}
+
+static void unpin_sb_for_writeback(struct writeback_control *wbc,
+				   struct inode *inode)
+{
+	struct super_block *sb = inode->i_sb;
+
+	if (wbc->sync_mode == WB_SYNC_ALL)
+		return;
+
+	up_read(&sb->s_umount);
+	put_super(sb);
+}
+
+static void writeback_inodes_wb(struct bdi_writeback *wb,
+				struct writeback_control *wbc)
 {
+	struct super_block *sb = wbc->sb;
 	const int is_blkdev_sb = sb_is_blkdev_sb(sb);
 	const unsigned long start = jiffies;	/* livelock avoidance */
 
 	spin_lock(&inode_lock);
 
-	if (!wbc->for_kupdate || list_empty(&bdi->b_io))
-		queue_io(bdi, wbc->older_than_this);
+	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+		queue_io(wb, wbc->older_than_this);
 
-	while (!list_empty(&bdi->b_io)) {
-		struct inode *inode = list_entry(bdi->b_io.prev,
+	while (!list_empty(&wb->b_io)) {
+		struct inode *inode = list_entry(wb->b_io.prev,
 						struct inode, i_list);
 		long pages_skipped;
 
@@ -491,7 +555,7 @@ static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
 			continue;
 		}
 
-		if (!bdi_cap_writeback_dirty(bdi)) {
+		if (!bdi_cap_writeback_dirty(wb->bdi)) {
 			redirty_tail(inode);
 			if (is_blkdev_sb) {
 				/*
@@ -513,7 +577,7 @@ static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
 			continue;
 		}
 
-		if (wbc->nonblocking && bdi_write_congested(bdi)) {
+		if (wbc->nonblocking && bdi_write_congested(wb->bdi)) {
 			wbc->encountered_congestion = 1;
 			if (!is_blkdev_sb)
 				break;		/* Skip a congested fs */
@@ -521,13 +585,6 @@ static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
 			continue;		/* Skip a congested blockdev */
 		}
 
-		if (wbc->bdi && bdi != wbc->bdi) {
-			if (!is_blkdev_sb)
-				break;		/* fs has the wrong queue */
-			requeue_io(inode);
-			continue;		/* blockdev has wrong queue */
-		}
-
 		/*
 		 * Was this inode dirtied after sync_sb_inodes was called?
 		 * This keeps sync from extra jobs and livelock.
@@ -535,16 +592,16 @@ static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
 		if (inode_dirtied_after(inode, start))
 			break;
 
-		/* Is another pdflush already flushing this queue? */
-		if (current_is_pdflush() && !writeback_acquire(bdi))
-			break;
+		if (pin_sb_for_writeback(wbc, inode)) {
+			requeue_io(inode);
+			continue;
+		}
 
 		BUG_ON(inode->i_state & (I_FREEING | I_CLEAR));
 		__iget(inode);
 		pages_skipped = wbc->pages_skipped;
 		writeback_single_inode(inode, wbc);
-		if (current_is_pdflush())
-			writeback_release(bdi);
+		unpin_sb_for_writeback(wbc, inode);
 		if (wbc->pages_skipped != pages_skipped) {
 			/*
 			 * writeback is not making progress due to locked
@@ -560,7 +617,7 @@ static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
 			wbc->more_io = 1;
 			break;
 		}
-		if (!list_empty(&bdi->b_more_io))
+		if (!list_empty(&wb->b_more_io))
 			wbc->more_io = 1;
 	}
 
@@ -568,139 +625,501 @@ static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
 	/* Leave any unwritten inodes on b_io */
 }
 
+void writeback_inodes_wbc(struct writeback_control *wbc)
+{
+	struct backing_dev_info *bdi = wbc->bdi;
+
+	writeback_inodes_wb(&bdi->wb, wbc);
+}
+
 /*
- * Write out a superblock's list of dirty inodes.  A wait will be performed
- * upon no inodes, all inodes or the final one, depending upon sync_mode.
- *
- * If older_than_this is non-NULL, then only write out inodes which
- * had their first dirtying at a time earlier than *older_than_this.
- *
- * If we're a pdlfush thread, then implement pdflush collision avoidance
- * against the entire list.
+ * The maximum number of pages to writeout in a single bdi flush/kupdate
+ * operation.  We do this so we don't hold I_SYNC against an inode for
+ * enormous amounts of time, which would block a userspace task which has
+ * been forced to throttle against that inode.  Also, the code reevaluates
+ * the dirty each time it has written this many pages.
+ */
+#define MAX_WRITEBACK_PAGES     1024
+
+static inline bool over_bground_thresh(void)
+{
+	unsigned long background_thresh, dirty_thresh;
+
+	get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
+
+	return (global_page_state(NR_FILE_DIRTY) +
+		global_page_state(NR_UNSTABLE_NFS) >= background_thresh);
+}
+
+/*
+ * Explicit flushing or periodic writeback of "old" data.
  *
- * If `bdi' is non-zero then we're being asked to writeback a specific queue.
- * This function assumes that the blockdev superblock's inodes are backed by
- * a variety of queues, so all inodes are searched.  For other superblocks,
- * assume that all inodes are backed by the same queue.
+ * Define "old": the first time one of an inode's pages is dirtied, we mark the
+ * dirtying-time in the inode's address_space.  So this periodic writeback code
+ * just walks the superblock inode list, writing back any inodes which are
+ * older than a specific point in time.
  *
- * FIXME: this linear search could get expensive with many fileystems.  But
- * how to fix?  We need to go from an address_space to all inodes which share
- * a queue with that address_space.  (Easy: have a global "dirty superblocks"
- * list).
+ * Try to run once per dirty_writeback_interval.  But if a writeback event
+ * takes longer than a dirty_writeback_interval interval, then leave a
+ * one-second gap.
  *
- * The inodes to be written are parked on bdi->b_io.  They are moved back onto
- * bdi->b_dirty as they are selected for writing.  This way, none can be missed
- * on the writer throttling path, and we get decent balancing between many
- * throttled threads: we don't want them all piling up on inode_sync_wait.
+ * older_than_this takes precedence over nr_to_write.  So we'll only write back
+ * all dirty pages if they are all attached to "old" mappings.
  */
-static void generic_sync_sb_inodes(struct super_block *sb,
-				   struct writeback_control *wbc)
+static long wb_writeback(struct bdi_writeback *wb, long nr_pages,
+			 struct super_block *sb,
+			 enum writeback_sync_modes sync_mode, int for_kupdate)
 {
-	struct backing_dev_info *bdi;
-
-	if (!wbc->bdi) {
-		mutex_lock(&bdi_lock);
-		list_for_each_entry(bdi, &bdi_list, bdi_list)
-			generic_sync_bdi_inodes(bdi, wbc, sb);
-		mutex_unlock(&bdi_lock);
-	} else
-		generic_sync_bdi_inodes(wbc->bdi, wbc, sb);
+	struct writeback_control wbc = {
+		.bdi			= wb->bdi,
+		.sb			= sb,
+		.sync_mode		= sync_mode,
+		.older_than_this	= NULL,
+		.for_kupdate		= for_kupdate,
+		.range_cyclic		= 1,
+	};
+	unsigned long oldest_jif;
+	long wrote = 0;
 
-	if (wbc->sync_mode == WB_SYNC_ALL) {
-		struct inode *inode, *old_inode = NULL;
+	if (wbc.for_kupdate) {
+		wbc.older_than_this = &oldest_jif;
+		oldest_jif = jiffies -
+				msecs_to_jiffies(dirty_expire_interval * 10);
+	}
 
-		spin_lock(&inode_lock);
+	for (;;) {
+		/*
+		 * Don't flush anything for non-integrity writeback where
+		 * no nr_pages was given
+		 */
+		if (!for_kupdate && nr_pages <= 0 && sync_mode == WB_SYNC_NONE)
+			break;
 
 		/*
-		 * Data integrity sync. Must wait for all pages under writeback,
-		 * because there may have been pages dirtied before our sync
-		 * call, but which had writeout started before we write it out.
-		 * In which case, the inode may not be on the dirty list, but
-		 * we still have to wait for that writeout.
+		 * If no specific pages were given and this is just a
+		 * periodic background writeout and we are below the
+		 * background dirty threshold, don't do anything
 		 */
-		list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
-			struct address_space *mapping;
+		if (for_kupdate && nr_pages <= 0 && !over_bground_thresh())
+			break;
 
-			if (inode->i_state &
-					(I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
-				continue;
-			mapping = inode->i_mapping;
-			if (mapping->nrpages == 0)
+		wbc.more_io = 0;
+		wbc.encountered_congestion = 0;
+		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
+		wbc.pages_skipped = 0;
+		writeback_inodes_wb(wb, &wbc);
+		nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
+		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
+
+		/*
+		 * If we ran out of stuff to write, bail unless more_io got set
+		 */
+		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
+			if (wbc.more_io && !wbc.for_kupdate)
 				continue;
-			__iget(inode);
-			spin_unlock(&inode_lock);
+			break;
+		}
+	}
+
+	return wrote;
+}
+
+/*
+ * Return the next bdi_work struct that hasn't been processed by this
+ * wb thread yet
+ */
+static struct bdi_work *get_next_work_item(struct backing_dev_info *bdi,
+					   struct bdi_writeback *wb)
+{
+	struct bdi_work *work, *ret = NULL;
+
+	rcu_read_lock();
+
+	list_for_each_entry_rcu(work, &bdi->work_list, list) {
+		if (!test_and_clear_bit(wb->nr, &work->seen))
+			continue;
+
+		ret = work;
+		break;
+	}
+
+	rcu_read_unlock();
+	return ret;
+}
+
+static long wb_check_old_data_flush(struct bdi_writeback *wb)
+{
+	unsigned long expired;
+	long nr_pages;
+
+	expired = wb->last_old_flush +
+			msecs_to_jiffies(dirty_writeback_interval * 10);
+	if (time_before(jiffies, expired))
+		return 0;
+
+	wb->last_old_flush = jiffies;
+	nr_pages = global_page_state(NR_FILE_DIRTY) +
+			global_page_state(NR_UNSTABLE_NFS) +
+			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+
+	if (nr_pages)
+		return wb_writeback(wb, nr_pages, NULL, WB_SYNC_NONE, 1);
+
+	return 0;
+}
+
+/*
+ * Retrieve work items and do the writeback they describe
+ */
+long wb_do_writeback(struct bdi_writeback *wb, int force_wait)
+{
+	struct backing_dev_info *bdi = wb->bdi;
+	struct bdi_work *work;
+	long nr_pages, wrote = 0;
+
+	while ((work = get_next_work_item(bdi, wb)) != NULL) {
+		enum writeback_sync_modes sync_mode;
+
+		nr_pages = work->nr_pages;
+
+		/*
+		 * Override sync mode, in case we must wait for completion
+		 */
+		if (force_wait)
+			work->sync_mode = sync_mode = WB_SYNC_ALL;
+		else
+			sync_mode = work->sync_mode;
+
+		/*
+		 * If this isn't a data integrity operation, just notify
+		 * that we have seen this work and we are now starting it.
+		 */
+		if (sync_mode == WB_SYNC_NONE)
+			wb_clear_pending(wb, work);
+
+		wrote += wb_writeback(wb, nr_pages, work->sb, sync_mode, 0);
+
+		/*
+		 * This is a data integrity writeback, so only do the
+		 * notification when we have completed the work.
+		 */
+		if (sync_mode == WB_SYNC_ALL)
+			wb_clear_pending(wb, work);
+	}
+
+	/*
+	 * Check for periodic writeback, kupdated() style
+	 */
+	wrote += wb_check_old_data_flush(wb);
+
+	return wrote;
+}
+
+/*
+ * Handle writeback of dirty data for the device backed by this bdi. Also
+ * wakes up periodically and does kupdated style flushing.
+ */
+int bdi_writeback_task(struct bdi_writeback *wb)
+{
+	unsigned long last_active = jiffies;
+	unsigned long wait_jiffies = -1UL;
+	long pages_written;
+
+	while (!kthread_should_stop()) {
+		pages_written = wb_do_writeback(wb, 0);
+
+		if (pages_written)
+			last_active = jiffies;
+		else if (wait_jiffies != -1UL) {
+			unsigned long max_idle;
+
 			/*
-			 * We hold a reference to 'inode' so it couldn't have
-			 * been removed from s_inodes list while we dropped the
-			 * inode_lock.  We cannot iput the inode now as we can
-			 * be holding the last reference and we cannot iput it
-			 * under inode_lock. So we keep the reference and iput
-			 * it later.
+			 * Longest period of inactivity that we tolerate. If we
+			 * see dirty data again later, the task will get
+			 * recreated automatically.
 			 */
-			iput(old_inode);
-			old_inode = inode;
+			max_idle = max(5UL * 60 * HZ, wait_jiffies);
+			if (time_after(jiffies, max_idle + last_active))
+				break;
+		}
+
+		wait_jiffies = msecs_to_jiffies(dirty_writeback_interval * 10);
+		set_current_state(TASK_INTERRUPTIBLE);
+		schedule_timeout(wait_jiffies);
+		try_to_freeze();
+	}
 
-			filemap_fdatawait(mapping);
+	return 0;
+}
 
-			cond_resched();
+/*
+ * Schedule writeback for all backing devices. Expensive! If this is a data
+ * integrity operation, writeback will be complete when this returns. If
+ * we are simply called for WB_SYNC_NONE, then writeback will merely be
+ * scheduled to run.
+ */
+static void bdi_writeback_all(struct writeback_control *wbc)
+{
+	const bool must_wait = wbc->sync_mode == WB_SYNC_ALL;
+	struct backing_dev_info *bdi;
+	struct bdi_work *work;
+	LIST_HEAD(list);
+
+restart:
+	spin_lock(&bdi_lock);
+
+	list_for_each_entry(bdi, &bdi_list, bdi_list) {
+		struct bdi_work *work;
 
-			spin_lock(&inode_lock);
+		if (!bdi_has_dirty_io(bdi))
+			continue;
+
+		/*
+		 * If work allocation fails, do the writes inline. We drop
+		 * the lock and restart the list writeout. This should be OK,
+		 * since this happens rarely and because the writeout should
+		 * eventually make more free memory available.
+		 */
+		work = bdi_alloc_work(wbc);
+		if (!work) {
+			struct writeback_control __wbc;
+
+			/*
+			 * Not a data integrity writeout, just continue
+			 */
+			if (!must_wait)
+				continue;
+
+			spin_unlock(&bdi_lock);
+			__wbc = *wbc;
+			__wbc.bdi = bdi;
+			writeback_inodes_wbc(&__wbc);
+			goto restart;
 		}
-		spin_unlock(&inode_lock);
-		iput(old_inode);
+		if (must_wait)
+			list_add_tail(&work->wait_list, &list);
+
+		bdi_queue_work(bdi, work);
+	}
+
+	spin_unlock(&bdi_lock);
+
+	/*
+	 * If this is for WB_SYNC_ALL, wait for pending work to complete
+	 * before returning.
+	 */
+	while (!list_empty(&list)) {
+		work = list_entry(list.next, struct bdi_work, wait_list);
+		list_del(&work->wait_list);
+		bdi_wait_on_work_clear(work);
+		call_rcu(&work->rcu_head, bdi_work_free);
 	}
 }
 
 /*
- * Start writeback of dirty pagecache data against all unlocked inodes.
+ * Start writeback of `nr_pages' pages.  If `nr_pages' is zero, write back
+ * the whole world.
+ */
+void wakeup_flusher_threads(long nr_pages)
+{
+	struct writeback_control wbc = {
+		.sync_mode	= WB_SYNC_NONE,
+		.older_than_this = NULL,
+		.range_cyclic	= 1,
+	};
+
+	if (nr_pages == 0)
+		nr_pages = global_page_state(NR_FILE_DIRTY) +
+				global_page_state(NR_UNSTABLE_NFS);
+	wbc.nr_to_write = nr_pages;
+	bdi_writeback_all(&wbc);
+}
+
+static noinline void block_dump___mark_inode_dirty(struct inode *inode)
+{
+	if (inode->i_ino || strcmp(inode->i_sb->s_id, "bdev")) {
+		struct dentry *dentry;
+		const char *name = "?";
+
+		dentry = d_find_alias(inode);
+		if (dentry) {
+			spin_lock(&dentry->d_lock);
+			name = (const char *) dentry->d_name.name;
+		}
+		printk(KERN_DEBUG
+		       "%s(%d): dirtied inode %lu (%s) on %s\n",
+		       current->comm, task_pid_nr(current), inode->i_ino,
+		       name, inode->i_sb->s_id);
+		if (dentry) {
+			spin_unlock(&dentry->d_lock);
+			dput(dentry);
+		}
+	}
+}
+
+/**
+ *	__mark_inode_dirty -	internal function
+ *	@inode: inode to mark
+ *	@flags: what kind of dirty (i.e. I_DIRTY_SYNC)
+ *	Mark an inode as dirty. Callers should use mark_inode_dirty or
+ *  	mark_inode_dirty_sync.
  *
- * Note:
- * We don't need to grab a reference to superblock here. If it has non-empty
- * ->b_dirty it's hadn't been killed yet and kill_super() won't proceed
- * past sync_inodes_sb() until the ->b_dirty/b_io/b_more_io lists are all
- * empty. Since __sync_single_inode() regains inode_lock before it finally moves
- * inode from superblock lists we are OK.
+ * Put the inode on the super block's dirty list.
  *
- * If `older_than_this' is non-zero then only flush inodes which have a
- * flushtime older than *older_than_this.
+ * CAREFUL! We mark it dirty unconditionally, but move it onto the
+ * dirty list only if it is hashed or if it refers to a blockdev.
+ * If it was not hashed, it will never be added to the dirty list
+ * even if it is later hashed, as it will have been marked dirty already.
  *
- * If `bdi' is non-zero then we will scan the first inode against each
- * superblock until we find the matching ones.  One group will be the dirty
- * inodes against a filesystem.  Then when we hit the dummy blockdev superblock,
- * sync_sb_inodes will seekout the blockdev which matches `bdi'.  Maybe not
- * super-efficient but we're about to do a ton of I/O...
+ * In short, make sure you hash any inodes _before_ you start marking
+ * them dirty.
+ *
+ * This function *must* be atomic for the I_DIRTY_PAGES case -
+ * set_page_dirty() is called under spinlock in several places.
+ *
+ * Note that for blockdevs, inode->dirtied_when represents the dirtying time of
+ * the block-special inode (/dev/hda1) itself.  And the ->dirtied_when field of
+ * the kernel-internal blockdev inode represents the dirtying time of the
+ * blockdev's pages.  This is why for I_DIRTY_PAGES we always use
+ * page->mapping->host, so the page-dirtying time is recorded in the internal
+ * blockdev inode.
  */
-void
-writeback_inodes(struct writeback_control *wbc)
+void __mark_inode_dirty(struct inode *inode, int flags)
 {
-	struct super_block *sb;
+	struct super_block *sb = inode->i_sb;
 
-	might_sleep();
-	spin_lock(&sb_lock);
-restart:
-	list_for_each_entry_reverse(sb, &super_blocks, s_list) {
-		if (sb_has_dirty_inodes(sb)) {
-			/* we're making our own get_super here */
-			sb->s_count++;
-			spin_unlock(&sb_lock);
-			/*
-			 * If we can't get the readlock, there's no sense in
-			 * waiting around, most of the time the FS is going to
-			 * be unmounted by the time it is released.
-			 */
-			if (down_read_trylock(&sb->s_umount)) {
-				if (sb->s_root)
-					generic_sync_sb_inodes(sb, wbc);
-				up_read(&sb->s_umount);
-			}
-			spin_lock(&sb_lock);
-			if (__put_super_and_need_restart(sb))
-				goto restart;
+	/*
+	 * Don't do this for I_DIRTY_PAGES - that doesn't actually
+	 * dirty the inode itself
+	 */
+	if (flags & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
+		if (sb->s_op->dirty_inode)
+			sb->s_op->dirty_inode(inode);
+	}
+
+	/*
+	 * make sure that changes are seen by all cpus before we test i_state
+	 * -- mikulas
+	 */
+	smp_mb();
+
+	/* avoid the locking if we can */
+	if ((inode->i_state & flags) == flags)
+		return;
+
+	if (unlikely(block_dump))
+		block_dump___mark_inode_dirty(inode);
+
+	spin_lock(&inode_lock);
+	if ((inode->i_state & flags) != flags) {
+		const int was_dirty = inode->i_state & I_DIRTY;
+
+		inode->i_state |= flags;
+
+		/*
+		 * If the inode is being synced, just update its dirty state.
+		 * The unlocker will place the inode on the appropriate
+		 * superblock list, based upon its state.
+		 */
+		if (inode->i_state & I_SYNC)
+			goto out;
+
+		/*
+		 * Only add valid (hashed) inodes to the superblock's
+		 * dirty list.  Add blockdev inodes as well.
+		 */
+		if (!S_ISBLK(inode->i_mode)) {
+			if (hlist_unhashed(&inode->i_hash))
+				goto out;
+		}
+		if (inode->i_state & (I_FREEING|I_CLEAR))
+			goto out;
+
+		/*
+		 * If the inode was already on b_dirty/b_io/b_more_io, don't
+		 * reposition it (that would break b_dirty time-ordering).
+		 */
+		if (!was_dirty) {
+			struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
+
+			inode->dirtied_when = jiffies;
+			list_move(&inode->i_list, &wb->b_dirty);
 		}
-		if (wbc->nr_to_write <= 0)
-			break;
 	}
-	spin_unlock(&sb_lock);
+out:
+	spin_unlock(&inode_lock);
+}
+
+EXPORT_SYMBOL(__mark_inode_dirty);
+
+/*
+ * Write out a superblock's list of dirty inodes.  A wait will be performed
+ * upon no inodes, all inodes or the final one, depending upon sync_mode.
+ *
+ * If older_than_this is non-NULL, then only write out inodes which
+ * had their first dirtying at a time earlier than *older_than_this.
+ *
+ * If we're a pdlfush thread, then implement pdflush collision avoidance
+ * against the entire list.
+ *
+ * If `bdi' is non-zero then we're being asked to writeback a specific queue.
+ * This function assumes that the blockdev superblock's inodes are backed by
+ * a variety of queues, so all inodes are searched.  For other superblocks,
+ * assume that all inodes are backed by the same queue.
+ *
+ * The inodes to be written are parked on bdi->b_io.  They are moved back onto
+ * bdi->b_dirty as they are selected for writing.  This way, none can be missed
+ * on the writer throttling path, and we get decent balancing between many
+ * throttled threads: we don't want them all piling up on inode_sync_wait.
+ */
+static void wait_sb_inodes(struct writeback_control *wbc)
+{
+	struct inode *inode, *old_inode = NULL;
+
+	/*
+	 * We need to be protected against the filesystem going from
+	 * r/o to r/w or vice versa.
+	 */
+	WARN_ON(!rwsem_is_locked(&wbc->sb->s_umount));
+
+	spin_lock(&inode_lock);
+
+	/*
+	 * Data integrity sync. Must wait for all pages under writeback,
+	 * because there may have been pages dirtied before our sync
+	 * call, but which had writeout started before we write it out.
+	 * In which case, the inode may not be on the dirty list, but
+	 * we still have to wait for that writeout.
+	 */
+	list_for_each_entry(inode, &wbc->sb->s_inodes, i_sb_list) {
+		struct address_space *mapping;
+
+		if (inode->i_state & (I_FREEING|I_CLEAR|I_WILL_FREE|I_NEW))
+			continue;
+		mapping = inode->i_mapping;
+		if (mapping->nrpages == 0)
+			continue;
+		__iget(inode);
+		spin_unlock(&inode_lock);
+		/*
+		 * We hold a reference to 'inode' so it couldn't have
+		 * been removed from s_inodes list while we dropped the
+		 * inode_lock.  We cannot iput the inode now as we can
+		 * be holding the last reference and we cannot iput it
+		 * under inode_lock. So we keep the reference and iput
+		 * it later.
+		 */
+		iput(old_inode);
+		old_inode = inode;
+
+		filemap_fdatawait(mapping);
+
+		cond_resched();
+
+		spin_lock(&inode_lock);
+	}
+	spin_unlock(&inode_lock);
+	iput(old_inode);
 }
 
 /**
@@ -715,6 +1134,7 @@ restart:
 long writeback_inodes_sb(struct super_block *sb)
 {
 	struct writeback_control wbc = {
+		.sb		= sb,
 		.sync_mode	= WB_SYNC_NONE,
 		.range_start	= 0,
 		.range_end	= LLONG_MAX,
@@ -727,7 +1147,7 @@ long writeback_inodes_sb(struct super_block *sb)
 			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
 
 	wbc.nr_to_write = nr_to_write;
-	generic_sync_sb_inodes(sb, &wbc);
+	bdi_writeback_all(&wbc);
 	return nr_to_write - wbc.nr_to_write;
 }
 EXPORT_SYMBOL(writeback_inodes_sb);
@@ -742,6 +1162,7 @@ EXPORT_SYMBOL(writeback_inodes_sb);
 long sync_inodes_sb(struct super_block *sb)
 {
 	struct writeback_control wbc = {
+		.sb		= sb,
 		.sync_mode	= WB_SYNC_ALL,
 		.range_start	= 0,
 		.range_end	= LLONG_MAX,
@@ -749,7 +1170,8 @@ long sync_inodes_sb(struct super_block *sb)
 	long nr_to_write = LONG_MAX; /* doesn't actually matter */
 
 	wbc.nr_to_write = nr_to_write;
-	generic_sync_sb_inodes(sb, &wbc);
+	bdi_writeback_all(&wbc);
+	wait_sb_inodes(&wbc);
 	return nr_to_write - wbc.nr_to_write;
 }
 EXPORT_SYMBOL(sync_inodes_sb);
diff --git a/fs/super.c b/fs/super.c
index 0d22ce3..9cda337 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -168,7 +168,7 @@ int __put_super_and_need_restart(struct super_block *sb)
  *	Drops a temporary reference, frees superblock if there's no
  *	references left.
  */
-static void put_super(struct super_block *sb)
+void put_super(struct super_block *sb)
 {
 	spin_lock(&sb_lock);
 	__put_super(sb);
diff --git a/fs/sync.c b/fs/sync.c
index 66f2104..103cc7f 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -120,7 +120,7 @@ restart:
  */
 SYSCALL_DEFINE0(sync)
 {
-	wakeup_pdflush(0);
+	wakeup_flusher_threads(0);
 	sync_filesystems(0);
 	sync_filesystems(1);
 	if (unlikely(laptop_mode))
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 928cd54..d045f5f 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -13,6 +13,8 @@
 #include <linux/proportions.h>
 #include <linux/kernel.h>
 #include <linux/fs.h>
+#include <linux/sched.h>
+#include <linux/writeback.h>
 #include <asm/atomic.h>
 
 struct page;
@@ -23,7 +25,8 @@ struct dentry;
  * Bits in backing_dev_info.state
  */
 enum bdi_state {
-	BDI_pdflush,		/* A pdflush thread is working this device */
+	BDI_pending,		/* On its way to being activated */
+	BDI_wb_alloc,		/* Default embedded wb allocated */
 	BDI_async_congested,	/* The async (write) queue is getting full */
 	BDI_sync_congested,	/* The sync queue is getting full */
 	BDI_unused,		/* Available bits start here */
@@ -39,9 +42,22 @@ enum bdi_stat_item {
 
 #define BDI_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
 
+struct bdi_writeback {
+	struct list_head list;			/* hangs off the bdi */
+
+	struct backing_dev_info *bdi;		/* our parent bdi */
+	unsigned int nr;
+
+	unsigned long last_old_flush;		/* last old data flush */
+
+	struct task_struct	*task;		/* writeback task */
+	struct list_head	b_dirty;	/* dirty inodes */
+	struct list_head	b_io;		/* parked for writeback */
+	struct list_head	b_more_io;	/* parked for more writeback */
+};
+
 struct backing_dev_info {
 	struct list_head bdi_list;
-
 	unsigned long ra_pages;	/* max readahead in PAGE_CACHE_SIZE units */
 	unsigned long state;	/* Always use atomic bitops on this */
 	unsigned int capabilities; /* Device capabilities */
@@ -58,11 +74,15 @@ struct backing_dev_info {
 	unsigned int min_ratio;
 	unsigned int max_ratio, max_prop_frac;
 
-	struct device *dev;
+	struct bdi_writeback wb;  /* default writeback info for this bdi */
+	spinlock_t wb_lock;	  /* protects update side of wb_list */
+	struct list_head wb_list; /* the flusher threads hanging off this bdi */
+	unsigned long wb_mask;	  /* bitmask of registered tasks */
+	unsigned int wb_cnt;	  /* number of registered tasks */
 
-	struct list_head	b_dirty;	/* dirty inodes */
-	struct list_head	b_io;		/* parked for writeback */
-	struct list_head	b_more_io;	/* parked for more writeback */
+	struct list_head work_list;
+
+	struct device *dev;
 
 #ifdef CONFIG_DEBUG_FS
 	struct dentry *debug_dir;
@@ -77,10 +97,20 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 		const char *fmt, ...);
 int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev);
 void bdi_unregister(struct backing_dev_info *bdi);
+void bdi_start_writeback(struct writeback_control *wbc);
+int bdi_writeback_task(struct bdi_writeback *wb);
+int bdi_has_dirty_io(struct backing_dev_info *bdi);
 
-extern struct mutex bdi_lock;
+extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
 
+static inline int wb_has_dirty_io(struct bdi_writeback *wb)
+{
+	return !list_empty(&wb->b_dirty) ||
+	       !list_empty(&wb->b_io) ||
+	       !list_empty(&wb->b_more_io);
+}
+
 static inline void __add_bdi_stat(struct backing_dev_info *bdi,
 		enum bdi_stat_item item, s64 amount)
 {
@@ -270,6 +300,11 @@ static inline bool bdi_cap_swap_backed(struct backing_dev_info *bdi)
 	return bdi->capabilities & BDI_CAP_SWAP_BACKED;
 }
 
+static inline bool bdi_cap_flush_forker(struct backing_dev_info *bdi)
+{
+	return bdi == &default_backing_dev_info;
+}
+
 static inline bool mapping_cap_writeback_dirty(struct address_space *mapping)
 {
 	return bdi_cap_writeback_dirty(mapping->backing_dev_info);
@@ -285,4 +320,10 @@ static inline bool mapping_cap_swap_backed(struct address_space *mapping)
 	return bdi_cap_swap_backed(mapping->backing_dev_info);
 }
 
+static inline int bdi_sched_wait(void *word)
+{
+	schedule();
+	return 0;
+}
+
 #endif		/* _LINUX_BACKING_DEV_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 97949b7..8fe571f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1785,6 +1785,7 @@ extern int get_sb_pseudo(struct file_system_type *, char *,
 	struct vfsmount *mnt);
 extern void simple_set_mnt(struct vfsmount *mnt, struct super_block *sb);
 int __put_super_and_need_restart(struct super_block *sb);
+void put_super(struct super_block *sb);
 
 /* Alas, no aliases. Too much hassle with bringing module.h everywhere */
 #define fops_get(fops) \
@@ -2181,7 +2182,6 @@ extern int bdev_read_only(struct block_device *);
 extern int set_blocksize(struct block_device *, int);
 extern int sb_set_blocksize(struct super_block *, int);
 extern int sb_min_blocksize(struct super_block *, int);
-extern int sb_has_dirty_inodes(struct super_block *);
 
 extern int generic_file_mmap(struct file *, struct vm_area_struct *);
 extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 0703929..cef7552 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -40,6 +40,8 @@ enum writeback_sync_modes {
 struct writeback_control {
 	struct backing_dev_info *bdi;	/* If !NULL, only write back this
 					   queue */
+	struct super_block *sb;		/* if !NULL, only write inodes from
+					   this super_block */
 	enum writeback_sync_modes sync_mode;
 	unsigned long *older_than_this;	/* If !NULL, only write back inodes
 					   older than this */
@@ -76,10 +78,13 @@ struct writeback_control {
 /*
  * fs/fs-writeback.c
  */	
-void writeback_inodes(struct writeback_control *wbc);
+struct bdi_writeback;
 int inode_wait(void *);
 long writeback_inodes_sb(struct super_block *);
 long sync_inodes_sb(struct super_block *);
+void writeback_inodes_wbc(struct writeback_control *wbc);
+long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
+void wakeup_flusher_threads(long nr_pages);
 
 /* writeback.h requires fs.h; it, too, is not included from here. */
 static inline void wait_on_inode(struct inode *inode)
@@ -99,7 +104,6 @@ static inline void inode_sync_wait(struct inode *inode)
 /*
  * mm/page-writeback.c
  */
-int wakeup_pdflush(long nr_pages);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
 void throttle_vm_writeout(gfp_t gfp_mask);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 6f163e0..7f3fa79 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -1,8 +1,11 @@
 
 #include <linux/wait.h>
 #include <linux/backing-dev.h>
+#include <linux/kthread.h>
+#include <linux/freezer.h>
 #include <linux/fs.h>
 #include <linux/pagemap.h>
+#include <linux/mm.h>
 #include <linux/sched.h>
 #include <linux/module.h>
 #include <linux/writeback.h>
@@ -22,8 +25,18 @@ struct backing_dev_info default_backing_dev_info = {
 EXPORT_SYMBOL_GPL(default_backing_dev_info);
 
 static struct class *bdi_class;
-DEFINE_MUTEX(bdi_lock);
+DEFINE_SPINLOCK(bdi_lock);
 LIST_HEAD(bdi_list);
+LIST_HEAD(bdi_pending_list);
+
+static struct task_struct *sync_supers_tsk;
+static struct timer_list sync_supers_timer;
+
+static int bdi_sync_supers(void *);
+static void sync_supers_timer_fn(unsigned long);
+static void arm_supers_timer(void);
+
+static void bdi_add_default_flusher_task(struct backing_dev_info *bdi);
 
 #ifdef CONFIG_DEBUG_FS
 #include <linux/debugfs.h>
@@ -187,6 +200,13 @@ static int __init default_bdi_init(void)
 {
 	int err;
 
+	sync_supers_tsk = kthread_run(bdi_sync_supers, NULL, "sync_supers");
+	BUG_ON(IS_ERR(sync_supers_tsk));
+
+	init_timer(&sync_supers_timer);
+	setup_timer(&sync_supers_timer, sync_supers_timer_fn, 0);
+	arm_supers_timer();
+
 	err = bdi_init(&default_backing_dev_info);
 	if (!err)
 		bdi_register(&default_backing_dev_info, NULL, "default");
@@ -195,6 +215,242 @@ static int __init default_bdi_init(void)
 }
 subsys_initcall(default_bdi_init);
 
+static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
+{
+	memset(wb, 0, sizeof(*wb));
+
+	wb->bdi = bdi;
+	wb->last_old_flush = jiffies;
+	INIT_LIST_HEAD(&wb->b_dirty);
+	INIT_LIST_HEAD(&wb->b_io);
+	INIT_LIST_HEAD(&wb->b_more_io);
+}
+
+static void bdi_task_init(struct backing_dev_info *bdi,
+			  struct bdi_writeback *wb)
+{
+	struct task_struct *tsk = current;
+
+	spin_lock(&bdi->wb_lock);
+	list_add_tail_rcu(&wb->list, &bdi->wb_list);
+	spin_unlock(&bdi->wb_lock);
+
+	tsk->flags |= PF_FLUSHER | PF_SWAPWRITE;
+	set_freezable();
+
+	/*
+	 * Our parent may run at a different priority, just set us to normal
+	 */
+	set_user_nice(tsk, 0);
+}
+
+static int bdi_start_fn(void *ptr)
+{
+	struct bdi_writeback *wb = ptr;
+	struct backing_dev_info *bdi = wb->bdi;
+	int ret;
+
+	/*
+	 * Add us to the active bdi_list
+	 */
+	spin_lock(&bdi_lock);
+	list_add(&bdi->bdi_list, &bdi_list);
+	spin_unlock(&bdi_lock);
+
+	bdi_task_init(bdi, wb);
+
+	/*
+	 * Clear pending bit and wakeup anybody waiting to tear us down
+	 */
+	clear_bit(BDI_pending, &bdi->state);
+	smp_mb__after_clear_bit();
+	wake_up_bit(&bdi->state, BDI_pending);
+
+	ret = bdi_writeback_task(wb);
+
+	/*
+	 * Remove us from the list
+	 */
+	spin_lock(&bdi->wb_lock);
+	list_del_rcu(&wb->list);
+	spin_unlock(&bdi->wb_lock);
+
+	/*
+	 * Flush any work that raced with us exiting. No new work
+	 * will be added, since this bdi isn't discoverable anymore.
+	 */
+	if (!list_empty(&bdi->work_list))
+		wb_do_writeback(wb, 1);
+
+	wb->task = NULL;
+	return ret;
+}
+
+int bdi_has_dirty_io(struct backing_dev_info *bdi)
+{
+	return wb_has_dirty_io(&bdi->wb);
+}
+
+static void bdi_flush_io(struct backing_dev_info *bdi)
+{
+	struct writeback_control wbc = {
+		.bdi			= bdi,
+		.sync_mode		= WB_SYNC_NONE,
+		.older_than_this	= NULL,
+		.range_cyclic		= 1,
+		.nr_to_write		= 1024,
+	};
+
+	writeback_inodes_wbc(&wbc);
+}
+
+/*
+ * kupdated() used to do this. We cannot do it from the bdi_forker_task()
+ * or we risk deadlocking on ->s_umount. The longer term solution would be
+ * to implement sync_supers_bdi() or similar and simply do it from the
+ * bdi writeback tasks individually.
+ */
+static int bdi_sync_supers(void *unused)
+{
+	set_user_nice(current, 0);
+
+	while (!kthread_should_stop()) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		schedule();
+
+		/*
+		 * Do this periodically, like kupdated() did before.
+		 */
+		sync_supers();
+	}
+
+	return 0;
+}
+
+static void arm_supers_timer(void)
+{
+	unsigned long next;
+
+	next = msecs_to_jiffies(dirty_writeback_interval * 10) + jiffies;
+	mod_timer(&sync_supers_timer, round_jiffies_up(next));
+}
+
+static void sync_supers_timer_fn(unsigned long unused)
+{
+	wake_up_process(sync_supers_tsk);
+	arm_supers_timer();
+}
+
+static int bdi_forker_task(void *ptr)
+{
+	struct bdi_writeback *me = ptr;
+
+	bdi_task_init(me->bdi, me);
+
+	for (;;) {
+		struct backing_dev_info *bdi, *tmp;
+		struct bdi_writeback *wb;
+
+		/*
+		 * Temporary measure, we want to make sure we don't see
+		 * dirty data on the default backing_dev_info
+		 */
+		if (wb_has_dirty_io(me) || !list_empty(&me->bdi->work_list))
+			wb_do_writeback(me, 0);
+
+		spin_lock(&bdi_lock);
+
+		/*
+		 * Check if any existing bdi's have dirty data without
+		 * a thread registered. If so, set that up.
+		 */
+		list_for_each_entry_safe(bdi, tmp, &bdi_list, bdi_list) {
+			if (bdi->wb.task)
+				continue;
+			if (list_empty(&bdi->work_list) &&
+			    !bdi_has_dirty_io(bdi))
+				continue;
+
+			bdi_add_default_flusher_task(bdi);
+		}
+
+		set_current_state(TASK_INTERRUPTIBLE);
+
+		if (list_empty(&bdi_pending_list)) {
+			unsigned long wait;
+
+			spin_unlock(&bdi_lock);
+			wait = msecs_to_jiffies(dirty_writeback_interval * 10);
+			schedule_timeout(wait);
+			try_to_freeze();
+			continue;
+		}
+
+		__set_current_state(TASK_RUNNING);
+
+		/*
+		 * This is our real job - check for pending entries in
+		 * bdi_pending_list, and create the tasks that got added
+		 */
+		bdi = list_entry(bdi_pending_list.next, struct backing_dev_info,
+				 bdi_list);
+		list_del_init(&bdi->bdi_list);
+		spin_unlock(&bdi_lock);
+
+		wb = &bdi->wb;
+		wb->task = kthread_run(bdi_start_fn, wb, "flush-%s",
+					dev_name(bdi->dev));
+		/*
+		 * If task creation fails, then readd the bdi to
+		 * the pending list and force writeout of the bdi
+		 * from this forker thread. That will free some memory
+		 * and we can try again.
+		 */
+		if (IS_ERR(wb->task)) {
+			wb->task = NULL;
+
+			/*
+			 * Add this 'bdi' to the back, so we get
+			 * a chance to flush other bdi's to free
+			 * memory.
+			 */
+			spin_lock(&bdi_lock);
+			list_add_tail(&bdi->bdi_list, &bdi_pending_list);
+			spin_unlock(&bdi_lock);
+
+			bdi_flush_io(bdi);
+		}
+	}
+
+	return 0;
+}
+
+/*
+ * Add the default flusher task that gets created for any bdi
+ * that has dirty data pending writeout
+ */
+void static bdi_add_default_flusher_task(struct backing_dev_info *bdi)
+{
+	if (!bdi_cap_writeback_dirty(bdi))
+		return;
+
+	/*
+	 * Check with the helper whether to proceed adding a task. Will only
+	 * abort if we two or more simultanous calls to
+	 * bdi_add_default_flusher_task() occured, further additions will block
+	 * waiting for previous additions to finish.
+	 */
+	if (!test_and_set_bit(BDI_pending, &bdi->state)) {
+		list_move_tail(&bdi->bdi_list, &bdi_pending_list);
+
+		/*
+		 * We are now on the pending list, wake up bdi_forker_task()
+		 * to finish the job and add us back to the active bdi_list
+		 */
+		wake_up_process(default_backing_dev_info.wb.task);
+	}
+}
+
 int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 		const char *fmt, ...)
 {
@@ -213,13 +469,34 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 		goto exit;
 	}
 
-	mutex_lock(&bdi_lock);
+	spin_lock(&bdi_lock);
 	list_add_tail(&bdi->bdi_list, &bdi_list);
-	mutex_unlock(&bdi_lock);
+	spin_unlock(&bdi_lock);
 
 	bdi->dev = dev;
-	bdi_debug_register(bdi, dev_name(dev));
 
+	/*
+	 * Just start the forker thread for our default backing_dev_info,
+	 * and add other bdi's to the list. They will get a thread created
+	 * on-demand when they need it.
+	 */
+	if (bdi_cap_flush_forker(bdi)) {
+		struct bdi_writeback *wb = &bdi->wb;
+
+		wb->task = kthread_run(bdi_forker_task, wb, "bdi-%s",
+						dev_name(dev));
+		if (IS_ERR(wb->task)) {
+			wb->task = NULL;
+			ret = -ENOMEM;
+
+			spin_lock(&bdi_lock);
+			list_del(&bdi->bdi_list);
+			spin_unlock(&bdi_lock);
+			goto exit;
+		}
+	}
+
+	bdi_debug_register(bdi, dev_name(dev));
 exit:
 	return ret;
 }
@@ -231,17 +508,42 @@ int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev)
 }
 EXPORT_SYMBOL(bdi_register_dev);
 
-static void bdi_remove_from_list(struct backing_dev_info *bdi)
+/*
+ * Remove bdi from the global list and shutdown any threads we have running
+ */
+static void bdi_wb_shutdown(struct backing_dev_info *bdi)
 {
-	mutex_lock(&bdi_lock);
+	struct bdi_writeback *wb;
+
+	if (!bdi_cap_writeback_dirty(bdi))
+		return;
+
+	/*
+	 * If setup is pending, wait for that to complete first
+	 */
+	wait_on_bit(&bdi->state, BDI_pending, bdi_sched_wait,
+			TASK_UNINTERRUPTIBLE);
+
+	/*
+	 * Make sure nobody finds us on the bdi_list anymore
+	 */
+	spin_lock(&bdi_lock);
 	list_del(&bdi->bdi_list);
-	mutex_unlock(&bdi_lock);
+	spin_unlock(&bdi_lock);
+
+	/*
+	 * Finally, kill the kernel threads. We don't need to be RCU
+	 * safe anymore, since the bdi is gone from visibility.
+	 */
+	list_for_each_entry(wb, &bdi->wb_list, list)
+		kthread_stop(wb->task);
 }
 
 void bdi_unregister(struct backing_dev_info *bdi)
 {
 	if (bdi->dev) {
-		bdi_remove_from_list(bdi);
+		if (!bdi_cap_flush_forker(bdi))
+			bdi_wb_shutdown(bdi);
 		bdi_debug_unregister(bdi);
 		device_unregister(bdi->dev);
 		bdi->dev = NULL;
@@ -251,18 +553,25 @@ EXPORT_SYMBOL(bdi_unregister);
 
 int bdi_init(struct backing_dev_info *bdi)
 {
-	int i;
-	int err;
+	int i, err;
 
 	bdi->dev = NULL;
 
 	bdi->min_ratio = 0;
 	bdi->max_ratio = 100;
 	bdi->max_prop_frac = PROP_FRAC_BASE;
+	spin_lock_init(&bdi->wb_lock);
 	INIT_LIST_HEAD(&bdi->bdi_list);
-	INIT_LIST_HEAD(&bdi->b_io);
-	INIT_LIST_HEAD(&bdi->b_dirty);
-	INIT_LIST_HEAD(&bdi->b_more_io);
+	INIT_LIST_HEAD(&bdi->wb_list);
+	INIT_LIST_HEAD(&bdi->work_list);
+
+	bdi_wb_init(&bdi->wb, bdi);
+
+	/*
+	 * Just one thread support for now, hard code mask and count
+	 */
+	bdi->wb_mask = 1;
+	bdi->wb_cnt = 1;
 
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++) {
 		err = percpu_counter_init(&bdi->bdi_stat[i], 0);
@@ -277,8 +586,6 @@ int bdi_init(struct backing_dev_info *bdi)
 err:
 		while (i--)
 			percpu_counter_destroy(&bdi->bdi_stat[i]);
-
-		bdi_remove_from_list(bdi);
 	}
 
 	return err;
@@ -289,9 +596,7 @@ void bdi_destroy(struct backing_dev_info *bdi)
 {
 	int i;
 
-	WARN_ON(!list_empty(&bdi->b_dirty));
-	WARN_ON(!list_empty(&bdi->b_io));
-	WARN_ON(!list_empty(&bdi->b_more_io));
+	WARN_ON(bdi_has_dirty_io(bdi));
 
 	bdi_unregister(bdi);
 
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index f8341b6..25e7770 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -36,15 +36,6 @@
 #include <linux/pagevec.h>
 
 /*
- * The maximum number of pages to writeout in a single bdflush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES	1024
-
-/*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
  */
@@ -117,8 +108,6 @@ EXPORT_SYMBOL(laptop_mode);
 /* End of sysctl-exported parameters */
 
 
-static void background_writeout(unsigned long _min_pages);
-
 /*
  * Scale the writeback cache size proportional to the relative writeout speeds.
  *
@@ -326,7 +315,7 @@ int bdi_set_min_ratio(struct backing_dev_info *bdi, unsigned int min_ratio)
 {
 	int ret = 0;
 
-	mutex_lock(&bdi_lock);
+	spin_lock(&bdi_lock);
 	if (min_ratio > bdi->max_ratio) {
 		ret = -EINVAL;
 	} else {
@@ -338,7 +327,7 @@ int bdi_set_min_ratio(struct backing_dev_info *bdi, unsigned int min_ratio)
 			ret = -EINVAL;
 		}
 	}
-	mutex_unlock(&bdi_lock);
+	spin_unlock(&bdi_lock);
 
 	return ret;
 }
@@ -350,14 +339,14 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
 	if (max_ratio > 100)
 		return -EINVAL;
 
-	mutex_lock(&bdi_lock);
+	spin_lock(&bdi_lock);
 	if (bdi->min_ratio > max_ratio) {
 		ret = -EINVAL;
 	} else {
 		bdi->max_ratio = max_ratio;
 		bdi->max_prop_frac = (PROP_FRAC_BASE * max_ratio) / 100;
 	}
-	mutex_unlock(&bdi_lock);
+	spin_unlock(&bdi_lock);
 
 	return ret;
 }
@@ -543,7 +532,7 @@ static void balance_dirty_pages(struct address_space *mapping)
 		 * up.
 		 */
 		if (bdi_nr_reclaimable > bdi_thresh) {
-			writeback_inodes(&wbc);
+			writeback_inodes_wbc(&wbc);
 			pages_written += write_chunk - wbc.nr_to_write;
 			get_dirty_limits(&background_thresh, &dirty_thresh,
 				       &bdi_thresh, bdi);
@@ -572,7 +561,7 @@ static void balance_dirty_pages(struct address_space *mapping)
 		if (pages_written >= write_chunk)
 			break;		/* We've done our duty */
 
-		congestion_wait(BLK_RW_ASYNC, HZ/10);
+		schedule_timeout(1);
 	}
 
 	if (bdi_nr_reclaimable + bdi_nr_writeback < bdi_thresh &&
@@ -591,10 +580,18 @@ static void balance_dirty_pages(struct address_space *mapping)
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
 	if ((laptop_mode && pages_written) ||
-			(!laptop_mode && (global_page_state(NR_FILE_DIRTY)
-					  + global_page_state(NR_UNSTABLE_NFS)
-					  > background_thresh)))
-		pdflush_operation(background_writeout, 0);
+	    (!laptop_mode && ((nr_writeback = global_page_state(NR_FILE_DIRTY)
+					  + global_page_state(NR_UNSTABLE_NFS))
+					  > background_thresh))) {
+		struct writeback_control wbc = {
+			.bdi		= bdi,
+			.sync_mode	= WB_SYNC_NONE,
+			.nr_to_write	= nr_writeback,
+		};
+
+
+		bdi_start_writeback(&wbc);
+	}
 }
 
 void set_page_dirty_balance(struct page *page, int page_mkwrite)
@@ -678,153 +675,35 @@ void throttle_vm_writeout(gfp_t gfp_mask)
         }
 }
 
-/*
- * writeback at least _min_pages, and keep writing until the amount of dirty
- * memory is less than the background threshold, or until we're all clean.
- */
-static void background_writeout(unsigned long _min_pages)
-{
-	long min_pages = _min_pages;
-	struct writeback_control wbc = {
-		.bdi		= NULL,
-		.sync_mode	= WB_SYNC_NONE,
-		.older_than_this = NULL,
-		.nr_to_write	= 0,
-		.nonblocking	= 1,
-		.range_cyclic	= 1,
-	};
-
-	for ( ; ; ) {
-		unsigned long background_thresh;
-		unsigned long dirty_thresh;
-
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
-		if (global_page_state(NR_FILE_DIRTY) +
-			global_page_state(NR_UNSTABLE_NFS) < background_thresh
-				&& min_pages <= 0)
-			break;
-		wbc.more_io = 0;
-		wbc.encountered_congestion = 0;
-		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
-		wbc.pages_skipped = 0;
-		writeback_inodes(&wbc);
-		min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
-		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
-			/* Wrote less than expected */
-			if (wbc.encountered_congestion || wbc.more_io)
-				congestion_wait(BLK_RW_ASYNC, HZ/10);
-			else
-				break;
-		}
-	}
-}
-
-/*
- * Start writeback of `nr_pages' pages.  If `nr_pages' is zero, write back
- * the whole world.  Returns 0 if a pdflush thread was dispatched.  Returns
- * -1 if all pdflush threads were busy.
- */
-int wakeup_pdflush(long nr_pages)
-{
-	if (nr_pages == 0)
-		nr_pages = global_page_state(NR_FILE_DIRTY) +
-				global_page_state(NR_UNSTABLE_NFS);
-	return pdflush_operation(background_writeout, nr_pages);
-}
-
-static void wb_timer_fn(unsigned long unused);
 static void laptop_timer_fn(unsigned long unused);
 
-static DEFINE_TIMER(wb_timer, wb_timer_fn, 0, 0);
 static DEFINE_TIMER(laptop_mode_wb_timer, laptop_timer_fn, 0, 0);
 
 /*
- * Periodic writeback of "old" data.
- *
- * Define "old": the first time one of an inode's pages is dirtied, we mark the
- * dirtying-time in the inode's address_space.  So this periodic writeback code
- * just walks the superblock inode list, writing back any inodes which are
- * older than a specific point in time.
- *
- * Try to run once per dirty_writeback_interval.  But if a writeback event
- * takes longer than a dirty_writeback_interval interval, then leave a
- * one-second gap.
- *
- * older_than_this takes precedence over nr_to_write.  So we'll only write back
- * all dirty pages if they are all attached to "old" mappings.
- */
-static void wb_kupdate(unsigned long arg)
-{
-	unsigned long oldest_jif;
-	unsigned long start_jif;
-	unsigned long next_jif;
-	long nr_to_write;
-	struct writeback_control wbc = {
-		.bdi		= NULL,
-		.sync_mode	= WB_SYNC_NONE,
-		.older_than_this = &oldest_jif,
-		.nr_to_write	= 0,
-		.nonblocking	= 1,
-		.for_kupdate	= 1,
-		.range_cyclic	= 1,
-	};
-
-	sync_supers();
-
-	oldest_jif = jiffies - msecs_to_jiffies(dirty_expire_interval * 10);
-	start_jif = jiffies;
-	next_jif = start_jif + msecs_to_jiffies(dirty_writeback_interval * 10);
-	nr_to_write = global_page_state(NR_FILE_DIRTY) +
-			global_page_state(NR_UNSTABLE_NFS) +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
-	while (nr_to_write > 0) {
-		wbc.more_io = 0;
-		wbc.encountered_congestion = 0;
-		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
-		writeback_inodes(&wbc);
-		if (wbc.nr_to_write > 0) {
-			if (wbc.encountered_congestion || wbc.more_io)
-				congestion_wait(BLK_RW_ASYNC, HZ/10);
-			else
-				break;	/* All the old data is written */
-		}
-		nr_to_write -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
-	}
-	if (time_before(next_jif, jiffies + HZ))
-		next_jif = jiffies + HZ;
-	if (dirty_writeback_interval)
-		mod_timer(&wb_timer, next_jif);
-}
-
-/*
  * sysctl handler for /proc/sys/vm/dirty_writeback_centisecs
  */
 int dirty_writeback_centisecs_handler(ctl_table *table, int write,
 	struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
 {
 	proc_dointvec(table, write, file, buffer, length, ppos);
-	if (dirty_writeback_interval)
-		mod_timer(&wb_timer, jiffies +
-			msecs_to_jiffies(dirty_writeback_interval * 10));
-	else
-		del_timer(&wb_timer);
 	return 0;
 }
 
-static void wb_timer_fn(unsigned long unused)
-{
-	if (pdflush_operation(wb_kupdate, 0) < 0)
-		mod_timer(&wb_timer, jiffies + HZ); /* delay 1 second */
-}
-
-static void laptop_flush(unsigned long unused)
+static void do_laptop_sync(struct work_struct *work)
 {
-	sys_sync();
+	wakeup_flusher_threads(0);
+	kfree(work);
 }
 
 static void laptop_timer_fn(unsigned long unused)
 {
-	pdflush_operation(laptop_flush, 0);
+	struct work_struct *work;
+
+	work = kmalloc(sizeof(*work), GFP_ATOMIC);
+	if (work) {
+		INIT_WORK(work, do_laptop_sync);
+		schedule_work(work);
+	}
 }
 
 /*
@@ -907,8 +786,6 @@ void __init page_writeback_init(void)
 {
 	int shift;
 
-	mod_timer(&wb_timer,
-		  jiffies + msecs_to_jiffies(dirty_writeback_interval * 10));
 	writeback_set_ratelimit();
 	register_cpu_notifier(&ratelimit_nb);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 94e86dd..ba8228e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1720,7 +1720,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 */
 		if (total_scanned > sc->swap_cluster_max +
 					sc->swap_cluster_max / 2) {
-			wakeup_pdflush(laptop_mode ? 0 : total_scanned);
+			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
 			sc->may_writepage = 1;
 		}
 
-- 
1.6.4.1.207.g68ea


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 4/8] writeback: get rid of pdflush completely
  2009-09-08  9:23 [PATCH 0/8] Per-bdi writeback flusher threads v19 Jens Axboe
                   ` (2 preceding siblings ...)
  2009-09-08  9:23 ` [PATCH 3/8] writeback: switch to per-bdi threads for flushing data Jens Axboe
@ 2009-09-08  9:23 ` Jens Axboe
  2009-09-08  9:23 ` [PATCH 5/8] writeback: add some debug inode list counters to bdi stats Jens Axboe
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2009-09-08  9:23 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack
  Cc: Jens Axboe

It is now unused, so kill it off.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/fs-writeback.c         |    5 +
 include/linux/writeback.h |   12 --
 mm/Makefile               |    2 +-
 mm/pdflush.c              |  269 ---------------------------------------------
 4 files changed, 6 insertions(+), 282 deletions(-)
 delete mode 100644 mm/pdflush.c

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 0e3a14a..7c79ff5 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -30,6 +30,11 @@
 #define inode_to_bdi(inode)	((inode)->i_mapping->backing_dev_info)
 
 /*
+ * We don't actually have pdflush, but this one is exported though /proc...
+ */
+int nr_pdflush_threads;
+
+/*
  * Work items for the bdi_writeback threads
  */
 struct bdi_work {
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index cef7552..78b1e46 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -14,17 +14,6 @@ extern struct list_head inode_in_use;
 extern struct list_head inode_unused;
 
 /*
- * Yes, writeback.h requires sched.h
- * No, sched.h is not included from here.
- */
-static inline int task_is_pdflush(struct task_struct *task)
-{
-	return task->flags & PF_FLUSHER;
-}
-
-#define current_is_pdflush()	task_is_pdflush(current)
-
-/*
  * fs/fs-writeback.c
  */
 enum writeback_sync_modes {
@@ -155,7 +144,6 @@ balance_dirty_pages_ratelimited(struct address_space *mapping)
 typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
 				void *data);
 
-int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0);
 int generic_writepages(struct address_space *mapping,
 		       struct writeback_control *wbc);
 int write_cache_pages(struct address_space *mapping,
diff --git a/mm/Makefile b/mm/Makefile
index 5e0bd64..147a7a7 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -8,7 +8,7 @@ mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
 			   vmalloc.o
 
 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
-			   maccess.o page_alloc.o page-writeback.o pdflush.o \
+			   maccess.o page_alloc.o page-writeback.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
 			   page_isolation.o mm_init.o $(mmu-y)
diff --git a/mm/pdflush.c b/mm/pdflush.c
deleted file mode 100644
index 235ac44..0000000
--- a/mm/pdflush.c
+++ /dev/null
@@ -1,269 +0,0 @@
-/*
- * mm/pdflush.c - worker threads for writing back filesystem data
- *
- * Copyright (C) 2002, Linus Torvalds.
- *
- * 09Apr2002	Andrew Morton
- *		Initial version
- * 29Feb2004	kaos@sgi.com
- *		Move worker thread creation to kthread to avoid chewing
- *		up stack space with nested calls to kernel_thread.
- */
-
-#include <linux/sched.h>
-#include <linux/list.h>
-#include <linux/signal.h>
-#include <linux/spinlock.h>
-#include <linux/gfp.h>
-#include <linux/init.h>
-#include <linux/module.h>
-#include <linux/fs.h>		/* Needed by writeback.h	  */
-#include <linux/writeback.h>	/* Prototypes pdflush_operation() */
-#include <linux/kthread.h>
-#include <linux/cpuset.h>
-#include <linux/freezer.h>
-
-
-/*
- * Minimum and maximum number of pdflush instances
- */
-#define MIN_PDFLUSH_THREADS	2
-#define MAX_PDFLUSH_THREADS	8
-
-static void start_one_pdflush_thread(void);
-
-
-/*
- * The pdflush threads are worker threads for writing back dirty data.
- * Ideally, we'd like one thread per active disk spindle.  But the disk
- * topology is very hard to divine at this level.   Instead, we take
- * care in various places to prevent more than one pdflush thread from
- * performing writeback against a single filesystem.  pdflush threads
- * have the PF_FLUSHER flag set in current->flags to aid in this.
- */
-
-/*
- * All the pdflush threads.  Protected by pdflush_lock
- */
-static LIST_HEAD(pdflush_list);
-static DEFINE_SPINLOCK(pdflush_lock);
-
-/*
- * The count of currently-running pdflush threads.  Protected
- * by pdflush_lock.
- *
- * Readable by sysctl, but not writable.  Published to userspace at
- * /proc/sys/vm/nr_pdflush_threads.
- */
-int nr_pdflush_threads = 0;
-
-/*
- * The time at which the pdflush thread pool last went empty
- */
-static unsigned long last_empty_jifs;
-
-/*
- * The pdflush thread.
- *
- * Thread pool management algorithm:
- * 
- * - The minimum and maximum number of pdflush instances are bound
- *   by MIN_PDFLUSH_THREADS and MAX_PDFLUSH_THREADS.
- * 
- * - If there have been no idle pdflush instances for 1 second, create
- *   a new one.
- * 
- * - If the least-recently-went-to-sleep pdflush thread has been asleep
- *   for more than one second, terminate a thread.
- */
-
-/*
- * A structure for passing work to a pdflush thread.  Also for passing
- * state information between pdflush threads.  Protected by pdflush_lock.
- */
-struct pdflush_work {
-	struct task_struct *who;	/* The thread */
-	void (*fn)(unsigned long);	/* A callback function */
-	unsigned long arg0;		/* An argument to the callback */
-	struct list_head list;		/* On pdflush_list, when idle */
-	unsigned long when_i_went_to_sleep;
-};
-
-static int __pdflush(struct pdflush_work *my_work)
-{
-	current->flags |= PF_FLUSHER | PF_SWAPWRITE;
-	set_freezable();
-	my_work->fn = NULL;
-	my_work->who = current;
-	INIT_LIST_HEAD(&my_work->list);
-
-	spin_lock_irq(&pdflush_lock);
-	for ( ; ; ) {
-		struct pdflush_work *pdf;
-
-		set_current_state(TASK_INTERRUPTIBLE);
-		list_move(&my_work->list, &pdflush_list);
-		my_work->when_i_went_to_sleep = jiffies;
-		spin_unlock_irq(&pdflush_lock);
-		schedule();
-		try_to_freeze();
-		spin_lock_irq(&pdflush_lock);
-		if (!list_empty(&my_work->list)) {
-			/*
-			 * Someone woke us up, but without removing our control
-			 * structure from the global list.  swsusp will do this
-			 * in try_to_freeze()->refrigerator().  Handle it.
-			 */
-			my_work->fn = NULL;
-			continue;
-		}
-		if (my_work->fn == NULL) {
-			printk("pdflush: bogus wakeup\n");
-			continue;
-		}
-		spin_unlock_irq(&pdflush_lock);
-
-		(*my_work->fn)(my_work->arg0);
-
-		spin_lock_irq(&pdflush_lock);
-
-		/*
-		 * Thread creation: For how long have there been zero
-		 * available threads?
-		 *
-		 * To throttle creation, we reset last_empty_jifs.
-		 */
-		if (time_after(jiffies, last_empty_jifs + 1 * HZ)) {
-			if (list_empty(&pdflush_list)) {
-				if (nr_pdflush_threads < MAX_PDFLUSH_THREADS) {
-					last_empty_jifs = jiffies;
-					nr_pdflush_threads++;
-					spin_unlock_irq(&pdflush_lock);
-					start_one_pdflush_thread();
-					spin_lock_irq(&pdflush_lock);
-				}
-			}
-		}
-
-		my_work->fn = NULL;
-
-		/*
-		 * Thread destruction: For how long has the sleepiest
-		 * thread slept?
-		 */
-		if (list_empty(&pdflush_list))
-			continue;
-		if (nr_pdflush_threads <= MIN_PDFLUSH_THREADS)
-			continue;
-		pdf = list_entry(pdflush_list.prev, struct pdflush_work, list);
-		if (time_after(jiffies, pdf->when_i_went_to_sleep + 1 * HZ)) {
-			/* Limit exit rate */
-			pdf->when_i_went_to_sleep = jiffies;
-			break;					/* exeunt */
-		}
-	}
-	nr_pdflush_threads--;
-	spin_unlock_irq(&pdflush_lock);
-	return 0;
-}
-
-/*
- * Of course, my_work wants to be just a local in __pdflush().  It is
- * separated out in this manner to hopefully prevent the compiler from
- * performing unfortunate optimisations against the auto variables.  Because
- * these are visible to other tasks and CPUs.  (No problem has actually
- * been observed.  This is just paranoia).
- */
-static int pdflush(void *dummy)
-{
-	struct pdflush_work my_work;
-	cpumask_var_t cpus_allowed;
-
-	/*
-	 * Since the caller doesn't even check kthread_run() worked, let's not
-	 * freak out too much if this fails.
-	 */
-	if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL)) {
-		printk(KERN_WARNING "pdflush failed to allocate cpumask\n");
-		return 0;
-	}
-
-	/*
-	 * pdflush can spend a lot of time doing encryption via dm-crypt.  We
-	 * don't want to do that at keventd's priority.
-	 */
-	set_user_nice(current, 0);
-
-	/*
-	 * Some configs put our parent kthread in a limited cpuset,
-	 * which kthread() overrides, forcing cpus_allowed == cpu_all_mask.
-	 * Our needs are more modest - cut back to our cpusets cpus_allowed.
-	 * This is needed as pdflush's are dynamically created and destroyed.
-	 * The boottime pdflush's are easily placed w/o these 2 lines.
-	 */
-	cpuset_cpus_allowed(current, cpus_allowed);
-	set_cpus_allowed_ptr(current, cpus_allowed);
-	free_cpumask_var(cpus_allowed);
-
-	return __pdflush(&my_work);
-}
-
-/*
- * Attempt to wake up a pdflush thread, and get it to do some work for you.
- * Returns zero if it indeed managed to find a worker thread, and passed your
- * payload to it.
- */
-int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0)
-{
-	unsigned long flags;
-	int ret = 0;
-
-	BUG_ON(fn == NULL);	/* Hard to diagnose if it's deferred */
-
-	spin_lock_irqsave(&pdflush_lock, flags);
-	if (list_empty(&pdflush_list)) {
-		ret = -1;
-	} else {
-		struct pdflush_work *pdf;
-
-		pdf = list_entry(pdflush_list.next, struct pdflush_work, list);
-		list_del_init(&pdf->list);
-		if (list_empty(&pdflush_list))
-			last_empty_jifs = jiffies;
-		pdf->fn = fn;
-		pdf->arg0 = arg0;
-		wake_up_process(pdf->who);
-	}
-	spin_unlock_irqrestore(&pdflush_lock, flags);
-
-	return ret;
-}
-
-static void start_one_pdflush_thread(void)
-{
-	struct task_struct *k;
-
-	k = kthread_run(pdflush, NULL, "pdflush");
-	if (unlikely(IS_ERR(k))) {
-		spin_lock_irq(&pdflush_lock);
-		nr_pdflush_threads--;
-		spin_unlock_irq(&pdflush_lock);
-	}
-}
-
-static int __init pdflush_init(void)
-{
-	int i;
-
-	/*
-	 * Pre-set nr_pdflush_threads...  If we fail to create,
-	 * the count will be decremented.
-	 */
-	nr_pdflush_threads = MIN_PDFLUSH_THREADS;
-
-	for (i = 0; i < MIN_PDFLUSH_THREADS; i++)
-		start_one_pdflush_thread();
-	return 0;
-}
-
-module_init(pdflush_init);
-- 
1.6.4.1.207.g68ea


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 5/8] writeback: add some debug inode list counters to bdi stats
  2009-09-08  9:23 [PATCH 0/8] Per-bdi writeback flusher threads v19 Jens Axboe
                   ` (3 preceding siblings ...)
  2009-09-08  9:23 ` [PATCH 4/8] writeback: get rid of pdflush completely Jens Axboe
@ 2009-09-08  9:23 ` Jens Axboe
  2009-09-08  9:23 ` [PATCH 6/8] writeback: add name to backing_dev_info Jens Axboe
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2009-09-08  9:23 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack
  Cc: Jens Axboe

Add some debug entries to be able to inspect the internal state of
the writeback details.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 mm/backing-dev.c |   38 ++++++++++++++++++++++++++++++++++----
 1 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 7f3fa79..22c45e9 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -52,9 +52,29 @@ static void bdi_debug_init(void)
 static int bdi_debug_stats_show(struct seq_file *m, void *v)
 {
 	struct backing_dev_info *bdi = m->private;
+	struct bdi_writeback *wb;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
+	unsigned long nr_dirty, nr_io, nr_more_io, nr_wb;
+	struct inode *inode;
+
+	/*
+	 * inode lock is enough here, the bdi->wb_list is protected by
+	 * RCU on the reader side
+	 */
+	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
+	spin_lock(&inode_lock);
+	list_for_each_entry(wb, &bdi->wb_list, list) {
+		nr_wb++;
+		list_for_each_entry(inode, &wb->b_dirty, i_list)
+			nr_dirty++;
+		list_for_each_entry(inode, &wb->b_io, i_list)
+			nr_io++;
+		list_for_each_entry(inode, &wb->b_more_io, i_list)
+			nr_more_io++;
+	}
+	spin_unlock(&inode_lock);
 
 	get_dirty_limits(&background_thresh, &dirty_thresh, &bdi_thresh, bdi);
 
@@ -64,12 +84,22 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   "BdiReclaimable:   %8lu kB\n"
 		   "BdiDirtyThresh:   %8lu kB\n"
 		   "DirtyThresh:      %8lu kB\n"
-		   "BackgroundThresh: %8lu kB\n",
+		   "BackgroundThresh: %8lu kB\n"
+		   "WriteBack threads:%8lu\n"
+		   "b_dirty:          %8lu\n"
+		   "b_io:             %8lu\n"
+		   "b_more_io:        %8lu\n"
+		   "bdi_list:         %8u\n"
+		   "state:            %8lx\n"
+		   "wb_mask:          %8lx\n"
+		   "wb_list:          %8u\n"
+		   "wb_cnt:           %8u\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
-		   K(bdi_thresh),
-		   K(dirty_thresh),
-		   K(background_thresh));
+		   K(bdi_thresh), K(dirty_thresh),
+		   K(background_thresh), nr_wb, nr_dirty, nr_io, nr_more_io,
+		   !list_empty(&bdi->bdi_list), bdi->state, bdi->wb_mask,
+		   !list_empty(&bdi->wb_list), bdi->wb_cnt);
 #undef K
 
 	return 0;
-- 
1.6.4.1.207.g68ea


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 6/8] writeback: add name to backing_dev_info
  2009-09-08  9:23 [PATCH 0/8] Per-bdi writeback flusher threads v19 Jens Axboe
                   ` (4 preceding siblings ...)
  2009-09-08  9:23 ` [PATCH 5/8] writeback: add some debug inode list counters to bdi stats Jens Axboe
@ 2009-09-08  9:23 ` Jens Axboe
  2009-09-08  9:23 ` [PATCH 7/8] writeback: check for registered bdi in flusher add and inode dirty Jens Axboe
  2009-09-08  9:23 ` [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb Jens Axboe
  7 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2009-09-08  9:23 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack
  Cc: Jens Axboe

This enables us to track who does what and print info. Its main use
is catching dirty inodes on the default_backing_dev_info, so we can
fix that up.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 block/blk-core.c            |    1 +
 drivers/block/aoe/aoeblk.c  |    1 +
 drivers/char/mem.c          |    1 +
 fs/btrfs/disk-io.c          |    1 +
 fs/char_dev.c               |    1 +
 fs/configfs/inode.c         |    1 +
 fs/fuse/inode.c             |    1 +
 fs/hugetlbfs/inode.c        |    1 +
 fs/nfs/client.c             |    1 +
 fs/ocfs2/dlm/dlmfs.c        |    1 +
 fs/ramfs/inode.c            |    1 +
 fs/sysfs/inode.c            |    1 +
 fs/ubifs/super.c            |    1 +
 include/linux/backing-dev.h |    2 ++
 kernel/cgroup.c             |    1 +
 mm/backing-dev.c            |    1 +
 mm/swap_state.c             |    1 +
 17 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index e3299a7..e695634 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -501,6 +501,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 			(VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
 	q->backing_dev_info.state = 0;
 	q->backing_dev_info.capabilities = BDI_CAP_MAP_COPY;
+	q->backing_dev_info.name = "block";
 
 	err = bdi_init(&q->backing_dev_info);
 	if (err) {
diff --git a/drivers/block/aoe/aoeblk.c b/drivers/block/aoe/aoeblk.c
index 2307a27..0efb8fc 100644
--- a/drivers/block/aoe/aoeblk.c
+++ b/drivers/block/aoe/aoeblk.c
@@ -265,6 +265,7 @@ aoeblk_gdalloc(void *vp)
 	}
 
 	blk_queue_make_request(&d->blkq, aoeblk_make_request);
+	d->blkq.backing_dev_info.name = "aoe";
 	if (bdi_init(&d->blkq.backing_dev_info))
 		goto err_mempool;
 	spin_lock_irqsave(&d->lock, flags);
diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index afa8813..645237b 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -822,6 +822,7 @@ static const struct file_operations zero_fops = {
  * - permits private mappings, "copies" are taken of the source of zeros
  */
 static struct backing_dev_info zero_bdi = {
+	.name		= "char/mem",
 	.capabilities	= BDI_CAP_MAP_COPY,
 };
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index e83be2e..15831d5 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1352,6 +1352,7 @@ static int setup_bdi(struct btrfs_fs_info *info, struct backing_dev_info *bdi)
 {
 	int err;
 
+	bdi->name = "btrfs";
 	bdi->capabilities = BDI_CAP_MAP_COPY;
 	err = bdi_init(bdi);
 	if (err)
diff --git a/fs/char_dev.c b/fs/char_dev.c
index a173551..7c27a8e 100644
--- a/fs/char_dev.c
+++ b/fs/char_dev.c
@@ -31,6 +31,7 @@
  * - no readahead or I/O queue unplugging required
  */
 struct backing_dev_info directly_mappable_cdev_bdi = {
+	.name = "char",
 	.capabilities	= (
 #ifdef CONFIG_MMU
 		/* permit private copies of the data to be taken */
diff --git a/fs/configfs/inode.c b/fs/configfs/inode.c
index 4921e74..a2f7460 100644
--- a/fs/configfs/inode.c
+++ b/fs/configfs/inode.c
@@ -51,6 +51,7 @@ static const struct address_space_operations configfs_aops = {
 };
 
 static struct backing_dev_info configfs_backing_dev_info = {
+	.name		= "configfs",
 	.ra_pages	= 0,	/* No readahead */
 	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK,
 };
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index f91ccc4..4567db6 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -801,6 +801,7 @@ static int fuse_bdi_init(struct fuse_conn *fc, struct super_block *sb)
 {
 	int err;
 
+	fc->bdi.name = "fuse";
 	fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
 	fc->bdi.unplug_io_fn = default_unplug_io_fn;
 	/* fuse does it's own writeback accounting */
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index cb88dac..a93b885 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -44,6 +44,7 @@ static const struct inode_operations hugetlbfs_dir_inode_operations;
 static const struct inode_operations hugetlbfs_inode_operations;
 
 static struct backing_dev_info hugetlbfs_backing_dev_info = {
+	.name		= "hugetlbfs",
 	.ra_pages	= 0,	/* No readahead */
 	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK,
 };
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 8d25ccb..c6be84a 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -879,6 +879,7 @@ static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fsinfo *
 		server->rsize = NFS_MAX_FILE_IO_SIZE;
 	server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 
+	server->backing_dev_info.name = "nfs";
 	server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
 
 	if (server->wsize > max_rpc_payload)
diff --git a/fs/ocfs2/dlm/dlmfs.c b/fs/ocfs2/dlm/dlmfs.c
index 1c9efb4..02bf178 100644
--- a/fs/ocfs2/dlm/dlmfs.c
+++ b/fs/ocfs2/dlm/dlmfs.c
@@ -325,6 +325,7 @@ clear_fields:
 }
 
 static struct backing_dev_info dlmfs_backing_dev_info = {
+	.name		= "ocfs2-dlmfs",
 	.ra_pages	= 0,	/* No readahead */
 	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK,
 };
diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index 0ff7566..a7f0110 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -46,6 +46,7 @@ static const struct super_operations ramfs_ops;
 static const struct inode_operations ramfs_dir_inode_operations;
 
 static struct backing_dev_info ramfs_backing_dev_info = {
+	.name		= "ramfs",
 	.ra_pages	= 0,	/* No readahead */
 	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK |
 			  BDI_CAP_MAP_DIRECT | BDI_CAP_MAP_COPY |
diff --git a/fs/sysfs/inode.c b/fs/sysfs/inode.c
index 555f0ff..e57f98e 100644
--- a/fs/sysfs/inode.c
+++ b/fs/sysfs/inode.c
@@ -29,6 +29,7 @@ static const struct address_space_operations sysfs_aops = {
 };
 
 static struct backing_dev_info sysfs_backing_dev_info = {
+	.name		= "sysfs",
 	.ra_pages	= 0,	/* No readahead */
 	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK,
 };
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index 8d6050a..51763aa 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -1965,6 +1965,7 @@ static int ubifs_fill_super(struct super_block *sb, void *data, int silent)
 	 *
 	 * Read-ahead will be disabled because @c->bdi.ra_pages is 0.
 	 */
+	c->bdi.name = "ubifs",
 	c->bdi.capabilities = BDI_CAP_MAP_COPY;
 	c->bdi.unplug_io_fn = default_unplug_io_fn;
 	err  = bdi_init(&c->bdi);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index d045f5f..2f218b7 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -66,6 +66,8 @@ struct backing_dev_info {
 	void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
 	void *unplug_io_data;
 
+	char *name;
+
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
 	struct prop_local_percpu completions;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index b6eadfe..c7ece8f 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -600,6 +600,7 @@ static struct inode_operations cgroup_dir_inode_operations;
 static struct file_operations proc_cgroupstats_operations;
 
 static struct backing_dev_info cgroup_backing_dev_info = {
+	.name		= "cgroup",
 	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK,
 };
 
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 22c45e9..5cb32c5 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -17,6 +17,7 @@ void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 EXPORT_SYMBOL(default_unplug_io_fn);
 
 struct backing_dev_info default_backing_dev_info = {
+	.name		= "default",
 	.ra_pages	= VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE,
 	.state		= 0,
 	.capabilities	= BDI_CAP_MAP_COPY,
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 42cd38e..5ae6b8b 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -34,6 +34,7 @@ static const struct address_space_operations swap_aops = {
 };
 
 static struct backing_dev_info swap_backing_dev_info = {
+	.name		= "swap",
 	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK | BDI_CAP_SWAP_BACKED,
 	.unplug_io_fn	= swap_unplug_io_fn,
 };
-- 
1.6.4.1.207.g68ea


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 7/8] writeback: check for registered bdi in flusher add and inode dirty
  2009-09-08  9:23 [PATCH 0/8] Per-bdi writeback flusher threads v19 Jens Axboe
                   ` (5 preceding siblings ...)
  2009-09-08  9:23 ` [PATCH 6/8] writeback: add name to backing_dev_info Jens Axboe
@ 2009-09-08  9:23 ` Jens Axboe
  2009-09-08  9:23 ` [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb Jens Axboe
  7 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2009-09-08  9:23 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack
  Cc: Jens Axboe

Also a debugging aid. We want to catch dirty inodes being added to
backing devices that don't do writeback.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/fs-writeback.c           |    7 +++++++
 include/linux/backing-dev.h |    1 +
 mm/backing-dev.c            |    6 ++++++
 3 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 7c79ff5..6ccdfbb 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1046,6 +1046,13 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		 */
 		if (!was_dirty) {
 			struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
+			struct backing_dev_info *bdi = wb->bdi;
+
+			if (bdi_cap_writeback_dirty(bdi) &&
+			    !test_bit(BDI_registered, &bdi->state)) {
+				WARN_ON(1);
+				printk("bdi-%s not registered\n", bdi->name);
+			}
 
 			inode->dirtied_when = jiffies;
 			list_move(&inode->i_list, &wb->b_dirty);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 2f218b7..f169bcb 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -29,6 +29,7 @@ enum bdi_state {
 	BDI_wb_alloc,		/* Default embedded wb allocated */
 	BDI_async_congested,	/* The async (write) queue is getting full */
 	BDI_sync_congested,	/* The sync queue is getting full */
+	BDI_registered,		/* bdi_register() was done */
 	BDI_unused,		/* Available bits start here */
 };
 
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 5cb32c5..8629ea8 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -465,6 +465,11 @@ void static bdi_add_default_flusher_task(struct backing_dev_info *bdi)
 	if (!bdi_cap_writeback_dirty(bdi))
 		return;
 
+	if (WARN_ON(!test_bit(BDI_registered, &bdi->state))) {
+		printk("bdi %p/%s is not registered!\n", bdi, bdi->name);
+		return;
+	}
+
 	/*
 	 * Check with the helper whether to proceed adding a task. Will only
 	 * abort if we two or more simultanous calls to
@@ -528,6 +533,7 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 	}
 
 	bdi_debug_register(bdi, dev_name(dev));
+	set_bit(BDI_registered, &bdi->state);
 exit:
 	return ret;
 }
-- 
1.6.4.1.207.g68ea


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-08  9:23 [PATCH 0/8] Per-bdi writeback flusher threads v19 Jens Axboe
                   ` (6 preceding siblings ...)
  2009-09-08  9:23 ` [PATCH 7/8] writeback: check for registered bdi in flusher add and inode dirty Jens Axboe
@ 2009-09-08  9:23 ` Jens Axboe
  2009-09-08 10:37     ` Artem Bityutskiy
  7 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2009-09-08  9:23 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack
  Cc: Theodore Ts'o, Jens Axboe

From: Theodore Ts'o <tytso@mit.edu>

Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
concern of not holding I_SYNC for too long.  (At least, that was the
comment previously.)  This doesn't make sense now because the only
time we wait for I_SYNC is if we are calling sync or fsync, and in
that case we need to write out all of the data anyway.  Previously
there may have been other code paths that waited on I_SYNC, but not
any more.

According to Christoph, the current writeback size is way too small,
and XFS had a hack that bumped out nr_to_write to four times the value
sent by the VM to be able to saturate medium-sized RAID arrays.  This
value was also problematic for ext4 as well, as it caused large files
to be come interleaved on disk by in 8 megabyte chunks (we bumped up
the nr_to_write by a factor of two).

So, in this patch, we make the MAX_WRITEBACK_PAGES a tunable,
max_writeback_mb, and set it to a default value of 128 megabytes.

http://bugzilla.kernel.org/show_bug.cgi?id=13930

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/fs-writeback.c         |    9 +--------
 include/linux/writeback.h |    1 +
 kernel/sysctl.c           |    8 ++++++++
 mm/page-writeback.c       |    6 ++++++
 4 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 6ccdfbb..7800798 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -637,14 +637,7 @@ void writeback_inodes_wbc(struct writeback_control *wbc)
 	writeback_inodes_wb(&bdi->wb, wbc);
 }
 
-/*
- * The maximum number of pages to writeout in a single bdi flush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES     1024
+#define MAX_WRITEBACK_PAGES	(max_writeback_mb << (20 - PAGE_SHIFT))
 
 static inline bool over_bground_thresh(void)
 {
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 78b1e46..fbed759 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -104,6 +104,7 @@ extern int vm_dirty_ratio;
 extern unsigned long vm_dirty_bytes;
 extern unsigned int dirty_writeback_interval;
 extern unsigned int dirty_expire_interval;
+extern unsigned int max_writeback_mb;
 extern int vm_highmem_is_dirtyable;
 extern int block_dump;
 extern int laptop_mode;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 58be760..315fc30 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1104,6 +1104,14 @@ static struct ctl_table vm_table[] = {
 		.proc_handler	= &proc_dointvec,
 	},
 	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "max_writeback_mb",
+		.data		= &max_writeback_mb,
+		.maxlen		= sizeof(max_writeback_mb),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
 		.ctl_name	= VM_NR_PDFLUSH_THREADS,
 		.procname	= "nr_pdflush_threads",
 		.data		= &nr_pdflush_threads,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 25e7770..7f821e6 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -55,6 +55,12 @@ static inline long sync_writeback_pages(void)
 /* The following parameters are exported via /proc/sys/vm */
 
 /*
+ * The maximum amount of memory (in megabytes) to write out in a
+ * single bdflush/kupdate operation.
+ */
+unsigned int max_writeback_mb = 128;
+
+/*
  * Start background writeback (via pdflush) at this percentage
  */
 int dirty_background_ratio = 10;
-- 
1.6.4.1.207.g68ea


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export
  2009-09-08  9:23 ` [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export Jens Axboe
@ 2009-09-08 10:27     ` Artem Bityutskiy
  0 siblings, 0 replies; 76+ messages in thread
From: Artem Bityutskiy @ 2009-09-08 10:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack

Hi Jens,

On 09/08/2009 12:23 PM, Jens Axboe wrote:
>   	int i, err;
>   	struct ubifs_info *c = sb->s_fs_info;
> -	struct writeback_control wbc = {
> -		.sync_mode   = WB_SYNC_ALL,
> -		.range_start = 0,
> -		.range_end   = LLONG_MAX,
> -		.nr_to_write = LONG_MAX,
> -	};
>
>   	/*
>   	 * Zero @wait is just an advisory thing to help the file system shove
> @@ -462,7 +456,7 @@ static int ubifs_sync_fs(struct super_block *sb, int wait)
>   	 * the user be able to get more accurate results of 'statfs()' after
>   	 * they synchronize the file system.
>   	 */
> -	generic_sync_sb_inodes(sb,&wbc);
> +	sync_inodes_sb(sb);

This call is unnecessary and I've removed it and the patch is sitting in
linux-next for long time:
http://git.infradead.org/ubifs-2.6.git/commit/887ee17117fd23e962332b353d250ac9e090b20f

Stephen e-mailed about the conflict recently. Could we please resolve the
conflict? I guess if you pick up my patch then git will be able to resolve
stuff automatically.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export
@ 2009-09-08 10:27     ` Artem Bityutskiy
  0 siblings, 0 replies; 76+ messages in thread
From: Artem Bityutskiy @ 2009-09-08 10:27 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack

Hi Jens,

On 09/08/2009 12:23 PM, Jens Axboe wrote:
>   	int i, err;
>   	struct ubifs_info *c = sb->s_fs_info;
> -	struct writeback_control wbc = {
> -		.sync_mode   = WB_SYNC_ALL,
> -		.range_start = 0,
> -		.range_end   = LLONG_MAX,
> -		.nr_to_write = LONG_MAX,
> -	};
>
>   	/*
>   	 * Zero @wait is just an advisory thing to help the file system shove
> @@ -462,7 +456,7 @@ static int ubifs_sync_fs(struct super_block *sb, int wait)
>   	 * the user be able to get more accurate results of 'statfs()' after
>   	 * they synchronize the file system.
>   	 */
> -	generic_sync_sb_inodes(sb,&wbc);
> +	sync_inodes_sb(sb);

This call is unnecessary and I've removed it and the patch is sitting in
linux-next for long time:
http://git.infradead.org/ubifs-2.6.git/commit/887ee17117fd23e962332b353d250ac9e090b20f

Stephen e-mailed about the conflict recently. Could we please resolve the
conflict? I guess if you pick up my patch then git will be able to resolve
stuff automatically.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-08  9:23 ` [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb Jens Axboe
@ 2009-09-08 10:37     ` Artem Bityutskiy
  0 siblings, 0 replies; 76+ messages in thread
From: Artem Bityutskiy @ 2009-09-08 10:37 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack,
	Theodore Ts'o

Hi,

On 09/08/2009 12:23 PM, Jens Axboe wrote:
> From: Theodore Ts'o<tytso@mit.edu>
>
> Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
> concern of not holding I_SYNC for too long.  (At least, that was the
> comment previously.)  This doesn't make sense now because the only
> time we wait for I_SYNC is if we are calling sync or fsync, and in
> that case we need to write out all of the data anyway.  Previously
> there may have been other code paths that waited on I_SYNC, but not
> any more.
>
> According to Christoph, the current writeback size is way too small,
> and XFS had a hack that bumped out nr_to_write to four times the value
> sent by the VM to be able to saturate medium-sized RAID arrays.  This
> value was also problematic for ext4 as well, as it caused large files
> to be come interleaved on disk by in 8 megabyte chunks (we bumped up
> the nr_to_write by a factor of two).
>
> So, in this patch, we make the MAX_WRITEBACK_PAGES a tunable,
> max_writeback_mb, and set it to a default value of 128 megabytes.
>
> http://bugzilla.kernel.org/show_bug.cgi?id=13930
>
> Signed-off-by: "Theodore Ts'o"<tytso@mit.edu>
> Signed-off-by: Jens Axboe<jens.axboe@oracle.com>

Would be nice to update doc files like

Documentation/sysctl/vm.txt
Documentation/filesystems/proc.txt

as well.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
@ 2009-09-08 10:37     ` Artem Bityutskiy
  0 siblings, 0 replies; 76+ messages in thread
From: Artem Bityutskiy @ 2009-09-08 10:37 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack,
	Theodore Ts'o

Hi,

On 09/08/2009 12:23 PM, Jens Axboe wrote:
> From: Theodore Ts'o<tytso@mit.edu>
>
> Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
> concern of not holding I_SYNC for too long.  (At least, that was the
> comment previously.)  This doesn't make sense now because the only
> time we wait for I_SYNC is if we are calling sync or fsync, and in
> that case we need to write out all of the data anyway.  Previously
> there may have been other code paths that waited on I_SYNC, but not
> any more.
>
> According to Christoph, the current writeback size is way too small,
> and XFS had a hack that bumped out nr_to_write to four times the value
> sent by the VM to be able to saturate medium-sized RAID arrays.  This
> value was also problematic for ext4 as well, as it caused large files
> to be come interleaved on disk by in 8 megabyte chunks (we bumped up
> the nr_to_write by a factor of two).
>
> So, in this patch, we make the MAX_WRITEBACK_PAGES a tunable,
> max_writeback_mb, and set it to a default value of 128 megabytes.
>
> http://bugzilla.kernel.org/show_bug.cgi?id=13930
>
> Signed-off-by: "Theodore Ts'o"<tytso@mit.edu>
> Signed-off-by: Jens Axboe<jens.axboe@oracle.com>

Would be nice to update doc files like

Documentation/sysctl/vm.txt
Documentation/filesystems/proc.txt

as well.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export
  2009-09-08 10:27     ` Artem Bityutskiy
  (?)
@ 2009-09-08 10:41     ` Jens Axboe
  2009-09-08 10:52       ` Artem Bityutskiy
  -1 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2009-09-08 10:41 UTC (permalink / raw)
  To: Artem Bityutskiy
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack

On Tue, Sep 08 2009, Artem Bityutskiy wrote:
> Hi Jens,
>
> On 09/08/2009 12:23 PM, Jens Axboe wrote:
>>   	int i, err;
>>   	struct ubifs_info *c = sb->s_fs_info;
>> -	struct writeback_control wbc = {
>> -		.sync_mode   = WB_SYNC_ALL,
>> -		.range_start = 0,
>> -		.range_end   = LLONG_MAX,
>> -		.nr_to_write = LONG_MAX,
>> -	};
>>
>>   	/*
>>   	 * Zero @wait is just an advisory thing to help the file system shove
>> @@ -462,7 +456,7 @@ static int ubifs_sync_fs(struct super_block *sb, int wait)
>>   	 * the user be able to get more accurate results of 'statfs()' after
>>   	 * they synchronize the file system.
>>   	 */
>> -	generic_sync_sb_inodes(sb,&wbc);
>> +	sync_inodes_sb(sb);
>
> This call is unnecessary and I've removed it and the patch is sitting in
> linux-next for long time:
> http://git.infradead.org/ubifs-2.6.git/commit/887ee17117fd23e962332b353d250ac9e090b20f
>
> Stephen e-mailed about the conflict recently. Could we please resolve the
> conflict? I guess if you pick up my patch then git will be able to resolve
> stuff automatically.

Would seem weird for me to carry your patch. As the issue is resolved in
-next, I'd say we just let whomever gets to merge last resolve it at
their end.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export
  2009-09-08 10:41     ` Jens Axboe
@ 2009-09-08 10:52       ` Artem Bityutskiy
  2009-09-08 10:57         ` Jens Axboe
  0 siblings, 1 reply; 76+ messages in thread
From: Artem Bityutskiy @ 2009-09-08 10:52 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack

On Tue, 2009-09-08 at 12:41 +0200, Jens Axboe wrote:
> On Tue, Sep 08 2009, Artem Bityutskiy wrote:
> > Hi Jens,
> >
> > On 09/08/2009 12:23 PM, Jens Axboe wrote:
> >>   	int i, err;
> >>   	struct ubifs_info *c = sb->s_fs_info;
> >> -	struct writeback_control wbc = {
> >> -		.sync_mode   = WB_SYNC_ALL,
> >> -		.range_start = 0,
> >> -		.range_end   = LLONG_MAX,
> >> -		.nr_to_write = LONG_MAX,
> >> -	};
> >>
> >>   	/*
> >>   	 * Zero @wait is just an advisory thing to help the file system shove
> >> @@ -462,7 +456,7 @@ static int ubifs_sync_fs(struct super_block *sb, int wait)
> >>   	 * the user be able to get more accurate results of 'statfs()' after
> >>   	 * they synchronize the file system.
> >>   	 */
> >> -	generic_sync_sb_inodes(sb,&wbc);
> >> +	sync_inodes_sb(sb);
> >
> > This call is unnecessary and I've removed it and the patch is sitting in
> > linux-next for long time:
> > http://git.infradead.org/ubifs-2.6.git/commit/887ee17117fd23e962332b353d250ac9e090b20f
> >
> > Stephen e-mailed about the conflict recently. Could we please resolve the
> > conflict? I guess if you pick up my patch then git will be able to resolve
> > stuff automatically.
> 
> Would seem weird for me to carry your patch. As the issue is resolved in
> -next, I'd say we just let whomever gets to merge last resolve it at
> their end.

That's Linus. Do you think it is nice to send him a pull request which
for sure requires requires manual work?

But well, if you do not want to carry my patch, then I'll have to
re-base my tree later, fix stuff, and send a pull request. I mean,
your stuff will for sure be merged first, because I send pull requests
late, just because UBIFS is a minor thing in the kernel.

:-(

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export
  2009-09-08 10:52       ` Artem Bityutskiy
@ 2009-09-08 10:57         ` Jens Axboe
  2009-09-08 11:01             ` Artem Bityutskiy
  0 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2009-09-08 10:57 UTC (permalink / raw)
  To: Artem Bityutskiy
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack

On Tue, Sep 08 2009, Artem Bityutskiy wrote:
> On Tue, 2009-09-08 at 12:41 +0200, Jens Axboe wrote:
> > On Tue, Sep 08 2009, Artem Bityutskiy wrote:
> > > Hi Jens,
> > >
> > > On 09/08/2009 12:23 PM, Jens Axboe wrote:
> > >>   	int i, err;
> > >>   	struct ubifs_info *c = sb->s_fs_info;
> > >> -	struct writeback_control wbc = {
> > >> -		.sync_mode   = WB_SYNC_ALL,
> > >> -		.range_start = 0,
> > >> -		.range_end   = LLONG_MAX,
> > >> -		.nr_to_write = LONG_MAX,
> > >> -	};
> > >>
> > >>   	/*
> > >>   	 * Zero @wait is just an advisory thing to help the file system shove
> > >> @@ -462,7 +456,7 @@ static int ubifs_sync_fs(struct super_block *sb, int wait)
> > >>   	 * the user be able to get more accurate results of 'statfs()' after
> > >>   	 * they synchronize the file system.
> > >>   	 */
> > >> -	generic_sync_sb_inodes(sb,&wbc);
> > >> +	sync_inodes_sb(sb);
> > >
> > > This call is unnecessary and I've removed it and the patch is sitting in
> > > linux-next for long time:
> > > http://git.infradead.org/ubifs-2.6.git/commit/887ee17117fd23e962332b353d250ac9e090b20f
> > >
> > > Stephen e-mailed about the conflict recently. Could we please resolve the
> > > conflict? I guess if you pick up my patch then git will be able to resolve
> > > stuff automatically.
> > 
> > Would seem weird for me to carry your patch. As the issue is resolved in
> > -next, I'd say we just let whomever gets to merge last resolve it at
> > their end.
> 
> That's Linus. Do you think it is nice to send him a pull request which
> for sure requires requires manual work?

No, that's not what I wrote, you should never send Linus a pull request
that requires manual merging. One of us gets to resolve it, depending on
who gets to send the latter pull request.

> But well, if you do not want to carry my patch, then I'll have to
> re-base my tree later, fix stuff, and send a pull request. I mean,
> your stuff will for sure be merged first, because I send pull requests
> late, just because UBIFS is a minor thing in the kernel.

You don't have to rebase, if my work is merged first then you just merge
Linus' tree into yours and fixup the conflict before asking Linus to
pull.

It's a trivial conflict, I don't understand what the fuzz is about.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export
  2009-09-08 10:57         ` Jens Axboe
@ 2009-09-08 11:01             ` Artem Bityutskiy
  0 siblings, 0 replies; 76+ messages in thread
From: Artem Bityutskiy @ 2009-09-08 11:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack

On 09/08/2009 01:57 PM, Jens Axboe wrote:
>> But well, if you do not want to carry my patch, then I'll have to
>> re-base my tree later, fix stuff, and send a pull request. I mean,
>> your stuff will for sure be merged first, because I send pull requests
>> late, just because UBIFS is a minor thing in the kernel.
>
> You don't have to rebase, if my work is merged first then you just merge
> Linus' tree into yours and fixup the conflict before asking Linus to
> pull.

I thought Linus asked to avoid merge commits in pull requests at some
point, no? So I thought that I'd re-base then, which Linus also
dislikes :-)

> It's a trivial conflict, I don't understand what the fuzz is about.

Well, I was just thinking how to avoid merge commits. But I guess I can
do what you suggest.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export
@ 2009-09-08 11:01             ` Artem Bityutskiy
  0 siblings, 0 replies; 76+ messages in thread
From: Artem Bityutskiy @ 2009-09-08 11:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack

On 09/08/2009 01:57 PM, Jens Axboe wrote:
>> But well, if you do not want to carry my patch, then I'll have to
>> re-base my tree later, fix stuff, and send a pull request. I mean,
>> your stuff will for sure be merged first, because I send pull requests
>> late, just because UBIFS is a minor thing in the kernel.
>
> You don't have to rebase, if my work is merged first then you just merge
> Linus' tree into yours and fixup the conflict before asking Linus to
> pull.

I thought Linus asked to avoid merge commits in pull requests at some
point, no? So I thought that I'd re-base then, which Linus also
dislikes :-)

> It's a trivial conflict, I don't understand what the fuzz is about.

Well, I was just thinking how to avoid merge commits. But I guess I can
do what you suggest.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export
  2009-09-08 11:01             ` Artem Bityutskiy
  (?)
@ 2009-09-08 11:05             ` Jens Axboe
  2009-09-08 11:31                 ` Artem Bityutskiy
  -1 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2009-09-08 11:05 UTC (permalink / raw)
  To: Artem Bityutskiy
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack

On Tue, Sep 08 2009, Artem Bityutskiy wrote:
> On 09/08/2009 01:57 PM, Jens Axboe wrote:
>>> But well, if you do not want to carry my patch, then I'll have to
>>> re-base my tree later, fix stuff, and send a pull request. I mean,
>>> your stuff will for sure be merged first, because I send pull requests
>>> late, just because UBIFS is a minor thing in the kernel.
>>
>> You don't have to rebase, if my work is merged first then you just merge
>> Linus' tree into yours and fixup the conflict before asking Linus to
>> pull.
>
> I thought Linus asked to avoid merge commits in pull requests at some
> point, no? So I thought that I'd re-base then, which Linus also
> dislikes :-)

Pointless merges are discouraged, but this one isn't pointless since it
resolves a conflict. If you rebase, Linus will flame your ass to a
crisp, I know from personal experience :-)

>> It's a trivial conflict, I don't understand what the fuzz is about.
>
> Well, I was just thinking how to avoid merge commits. But I guess I can
> do what you suggest.

Just don't worry about it, things are fine as-is. When the conflict
happens with the mainline tree, fix it up. If it was a more involved
depdency chain, we could do something more about it. But for this, I'd
say it's a lot more work to attempt to "fix" something that is a 10s
merge issue at pull time.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export
  2009-09-08 11:05             ` Jens Axboe
@ 2009-09-08 11:31                 ` Artem Bityutskiy
  0 siblings, 0 replies; 76+ messages in thread
From: Artem Bityutskiy @ 2009-09-08 11:31 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack

On 09/08/2009 02:05 PM, Jens Axboe wrote:
> Just don't worry about it, things are fine as-is. When the conflict
> happens with the mainline tree, fix it up. If it was a more involved
> depdency chain, we could do something more about it. But for this, I'd
> say it's a lot more work to attempt to "fix" something that is a 10s
> merge issue at pull time.

All right, I'll do this, thanks.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export
@ 2009-09-08 11:31                 ` Artem Bityutskiy
  0 siblings, 0 replies; 76+ messages in thread
From: Artem Bityutskiy @ 2009-09-08 11:31 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack

On 09/08/2009 02:05 PM, Jens Axboe wrote:
> Just don't worry about it, things are fine as-is. When the conflict
> happens with the mainline tree, fix it up. If it was a more involved
> depdency chain, we could do something more about it. But for this, I'd
> say it's a lot more work to attempt to "fix" something that is a 10s
> merge issue at pull time.

All right, I'll do this, thanks.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 3/8] writeback: switch to per-bdi threads for flushing data
  2009-09-08  9:23 ` [PATCH 3/8] writeback: switch to per-bdi threads for flushing data Jens Axboe
@ 2009-09-08 13:46   ` Daniel Walker
  2009-09-08 14:21     ` Jens Axboe
  0 siblings, 1 reply; 76+ messages in thread
From: Daniel Walker @ 2009-09-08 13:46 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack

On Tue, 2009-09-08 at 11:23 +0200, Jens Axboe wrote:
> This gets rid of pdflush for bdi writeout and kupdated style cleaning.
> pdflush writeout suffers from lack of locality and also requires more
> threads to handle the same workload, since it has to work in a
> non-blocking fashion against each queue. This also introduces lumpy
> behaviour and potential request starvation, since pdflush can be starved
> for queue access if others are accessing it. A sample ffsb workload that
> does random writes to files is about 8% faster here on a simple SATA drive
> during the benchmark phase. File layout also seems a LOT more smooth in
> vmstat:


This patch has a checkpatch error, and couple of warnings.. Here's one
of the warnings which I though was concerning..

WARNING: trailing semicolon indicates no statements, indent implies
otherwise
#388: FILE: fs/fs-writeback.c:177:
+               } else if (wb->task);
+                       wake_up_process(wb->task);

I suppose that could be a defect .. btw, patch 7 of 8 also has a few
trivial warnings.

Daniel


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 3/8] writeback: switch to per-bdi threads for flushing data
  2009-09-08 13:46   ` Daniel Walker
@ 2009-09-08 14:21     ` Jens Axboe
  0 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2009-09-08 14:21 UTC (permalink / raw)
  To: Daniel Walker
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack

On Tue, Sep 08 2009, Daniel Walker wrote:
> On Tue, 2009-09-08 at 11:23 +0200, Jens Axboe wrote:
> > This gets rid of pdflush for bdi writeout and kupdated style cleaning.
> > pdflush writeout suffers from lack of locality and also requires more
> > threads to handle the same workload, since it has to work in a
> > non-blocking fashion against each queue. This also introduces lumpy
> > behaviour and potential request starvation, since pdflush can be starved
> > for queue access if others are accessing it. A sample ffsb workload that
> > does random writes to files is about 8% faster here on a simple SATA drive
> > during the benchmark phase. File layout also seems a LOT more smooth in
> > vmstat:
> 
> 
> This patch has a checkpatch error, and couple of warnings.. Here's one
> of the warnings which I though was concerning..
> 
> WARNING: trailing semicolon indicates no statements, indent implies
> otherwise
> #388: FILE: fs/fs-writeback.c:177:
> +               } else if (wb->task);
> +                       wake_up_process(wb->task);
> 
> I suppose that could be a defect .. btw, patch 7 of 8 also has a few
> trivial warnings.

Oops yes, that was added between -v18 and -19 with the moving of that
code. Will fix that up, thanks for spotting that.

I'll check the series for checkpatch cleanliness. I did at some point,
but that was a few revisions ago.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-08 10:37     ` Artem Bityutskiy
  (?)
@ 2009-09-08 16:06     ` Peter Zijlstra
  2009-09-08 16:29       ` Chris Mason
  -1 siblings, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2009-09-08 16:06 UTC (permalink / raw)
  To: Artem Bityutskiy
  Cc: Jens Axboe, linux-kernel, linux-fsdevel, chris.mason, david, hch,
	akpm, jack, Theodore Ts'o

On Tue, 2009-09-08 at 13:37 +0300, Artem Bityutskiy wrote:
> Hi,
> 
> On 09/08/2009 12:23 PM, Jens Axboe wrote:
> > From: Theodore Ts'o<tytso@mit.edu>
> >
> > Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
> > concern of not holding I_SYNC for too long.  (At least, that was the
> > comment previously.)  This doesn't make sense now because the only
> > time we wait for I_SYNC is if we are calling sync or fsync, and in
> > that case we need to write out all of the data anyway.  Previously
> > there may have been other code paths that waited on I_SYNC, but not
> > any more.
> >
> > According to Christoph, the current writeback size is way too small,
> > and XFS had a hack that bumped out nr_to_write to four times the value
> > sent by the VM to be able to saturate medium-sized RAID arrays.  This
> > value was also problematic for ext4 as well, as it caused large files
> > to be come interleaved on disk by in 8 megabyte chunks (we bumped up
> > the nr_to_write by a factor of two).
> >
> > So, in this patch, we make the MAX_WRITEBACK_PAGES a tunable,
> > max_writeback_mb, and set it to a default value of 128 megabytes.
> >
> > http://bugzilla.kernel.org/show_bug.cgi?id=13930
> >
> > Signed-off-by: "Theodore Ts'o"<tytso@mit.edu>
> > Signed-off-by: Jens Axboe<jens.axboe@oracle.com>
> 
> Would be nice to update doc files like
> 
> Documentation/sysctl/vm.txt
> Documentation/filesystems/proc.txt

I'm still not convinced this knob is worth the patch and I'm inclined to
flat out NAK it..

The whole point of MAX_WRITEBACK_PAGES seems to occasionally check the
dirty stats again and not write out too much.

Clearly the current limit isn't sufficient for some people,
 - xfs/btrfs seem generally stuck in balance_dirty_pages()'s
congestion_wait()
 - ext4 generates inconveniently small extents


The first seems to suggest to me the number isn't well balanced against
whatever drives congestion_wait() (that thing still gives me a
head-ache).

# git grep clear_bdi_congested
drivers/block/pktcdvd.c:                clear_bdi_congested(&pd->disk->queue->backing_dev_info,
fs/fuse/dev.c:                  clear_bdi_congested(&fc->bdi, BLK_RW_SYNC);
fs/fuse/dev.c:                  clear_bdi_congested(&fc->bdi, BLK_RW_ASYNC);
fs/nfs/write.c:         clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
include/linux/backing-dev.h:void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
include/linux/blkdev.h: clear_bdi_congested(&q->backing_dev_info, sync);
mm/backing-dev.c:void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
mm/backing-dev.c:EXPORT_SYMBOL(clear_bdi_congested);

Suggests that regular block devices don't even manage device congestion
and it reverts to a simple timeout -- should we fix that?

Now, suppose it were to do something useful, I'd think we'd want to
limit write-out to whatever it takes so saturate the BDI.


As to the extends, shouldn't ext4 allocate extends based on the amount
of dirty pages in the file instead of however much we're going to write
out now?



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-08 16:06     ` Peter Zijlstra
@ 2009-09-08 16:29       ` Chris Mason
  2009-09-08 16:56         ` Peter Zijlstra
                           ` (2 more replies)
  0 siblings, 3 replies; 76+ messages in thread
From: Chris Mason @ 2009-09-08 16:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Artem Bityutskiy, Jens Axboe, linux-kernel, linux-fsdevel, david,
	hch, akpm, jack, Theodore Ts'o

On Tue, Sep 08, 2009 at 06:06:23PM +0200, Peter Zijlstra wrote:
> On Tue, 2009-09-08 at 13:37 +0300, Artem Bityutskiy wrote:
> > Hi,
> > 
> > On 09/08/2009 12:23 PM, Jens Axboe wrote:
> > > From: Theodore Ts'o<tytso@mit.edu>
> > >
> > > Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
> > > concern of not holding I_SYNC for too long.  (At least, that was the
> > > comment previously.)  This doesn't make sense now because the only
> > > time we wait for I_SYNC is if we are calling sync or fsync, and in
> > > that case we need to write out all of the data anyway.  Previously
> > > there may have been other code paths that waited on I_SYNC, but not
> > > any more.
> > >
> > > According to Christoph, the current writeback size is way too small,
> > > and XFS had a hack that bumped out nr_to_write to four times the value
> > > sent by the VM to be able to saturate medium-sized RAID arrays.  This
> > > value was also problematic for ext4 as well, as it caused large files
> > > to be come interleaved on disk by in 8 megabyte chunks (we bumped up
> > > the nr_to_write by a factor of two).
> > >
> > > So, in this patch, we make the MAX_WRITEBACK_PAGES a tunable,
> > > max_writeback_mb, and set it to a default value of 128 megabytes.
> > >
> > > http://bugzilla.kernel.org/show_bug.cgi?id=13930
> > >
> > > Signed-off-by: "Theodore Ts'o"<tytso@mit.edu>
> > > Signed-off-by: Jens Axboe<jens.axboe@oracle.com>
> > 
> > Would be nice to update doc files like
> > 
> > Documentation/sysctl/vm.txt
> > Documentation/filesystems/proc.txt
> 
> I'm still not convinced this knob is worth the patch and I'm inclined to
> flat out NAK it..
> 
> The whole point of MAX_WRITEBACK_PAGES seems to occasionally check the
> dirty stats again and not write out too much.

The problem is that 'too much' is a very abstract thing.  When a process
is stuck in balance_dirty_pages, we want them to do the minimal amount
of work (or waiting) required to get them safely back inside file_write().

> 
> Clearly the current limit isn't sufficient for some people,
>  - xfs/btrfs seem generally stuck in balance_dirty_pages()'s
> congestion_wait()
>  - ext4 generates inconveniently small extents

This is actually two different side of the same problem.  The filesystem
knows that bytes 0-N in the file are setup for delayed allocation.
Writepage is called on byte 0, and now the filesystem gets to decide how
big an extent to make.

It could decide to make an extent based on the total number of bytes
under delayed allocation, and hope the caller of writepage will be kind
enough to send down the pages contiguously afterward (xfs), or it could
make a smaller extent based on something closer to the total number of
bytes this particular writepages() call plans on writing (I guess what
ext4 is doing).

Either way, if pdflush or the bdi thread or whoever ends up switching to
another file during a big streaming write, the end result is that we
fragment.  We may fragment the file (ext4) or we may fragment the
writeback (xfs), but the end result isn't good.

Looking at two xfs examples, this is the IO for two concurrent streaming
writers (two different files) on 2.6.31-rc8 (pdflush is doing all the IO
in this graph, sorry the legend colors wrapped on me).  If you squint,
you can kind of see the fingers of IO as pdflush switches between files.

http://oss.oracle.com/~mason/seekwatcher/xfs-tag.png

And here is the IO when XFS forces nr_to_write much higher with a patch
from Christoph:

http://oss.oracle.com/~mason/seekwatcher/xfs-extend-tag.png

These graphs would look the same no matter what I did with
congestion_wait().  The first graph is slower just because pdflush
switches from one file to another.

> 
> 
> The first seems to suggest to me the number isn't well balanced against
> whatever drives congestion_wait() (that thing still gives me a
> head-ache).
> 
> # git grep clear_bdi_congested
> drivers/block/pktcdvd.c:                clear_bdi_congested(&pd->disk->queue->backing_dev_info,
> fs/fuse/dev.c:                  clear_bdi_congested(&fc->bdi, BLK_RW_SYNC);
> fs/fuse/dev.c:                  clear_bdi_congested(&fc->bdi, BLK_RW_ASYNC);
> fs/nfs/write.c:         clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
> include/linux/backing-dev.h:void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
> include/linux/blkdev.h: clear_bdi_congested(&q->backing_dev_info, sync);
> mm/backing-dev.c:void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> mm/backing-dev.c:EXPORT_SYMBOL(clear_bdi_congested);
> 
> Suggests that regular block devices don't even manage device congestion
> and it reverts to a simple timeout -- should we fix that?

Look for blk_clear_queue_congested().  It is managed, I personally don't
think it is very useful.  But, that's a different thread ;)

> 
> Now, suppose it were to do something useful, I'd think we'd want to
> limit write-out to whatever it takes so saturate the BDI.

If we don't want a blanket increase, I'd suggest that we just give the
FS a way to say: 'I know nr_to_write is only 32, but if you just write a
few blocks more, the system will be better off'.

Something like wbc->fs_write_hint

This way, when the FS allocates a great big contiguous delalloc extent,
it can set the wbc to reflect that we've got cheap and easy IO here.

> 
> 
> As to the extends, shouldn't ext4 allocate extends based on the amount
> of dirty pages in the file instead of however much we're going to write
> out now?

It probably does a mixture of both.

-chris


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-08 16:29       ` Chris Mason
@ 2009-09-08 16:56         ` Peter Zijlstra
  2009-09-08 17:28           ` Chris Mason
  2009-09-09  1:53           ` Dave Chinner
  2009-09-08 18:06           ` Theodore Tso
  2009-09-09  9:29           ` Wu Fengguang
  2 siblings, 2 replies; 76+ messages in thread
From: Peter Zijlstra @ 2009-09-08 16:56 UTC (permalink / raw)
  To: Chris Mason
  Cc: Artem Bityutskiy, Jens Axboe, linux-kernel, linux-fsdevel, david,
	hch, akpm, jack, Theodore Ts'o, Wu Fengguang

On Tue, 2009-09-08 at 12:29 -0400, Chris Mason wrote:

> > I'm still not convinced this knob is worth the patch and I'm inclined to
> > flat out NAK it..
> > 
> > The whole point of MAX_WRITEBACK_PAGES seems to occasionally check the
> > dirty stats again and not write out too much.
> 
> The problem is that 'too much' is a very abstract thing.  When a process
> is stuck in balance_dirty_pages, we want them to do the minimal amount
> of work (or waiting) required to get them safely back inside file_write().

>From the VMs POV I think we'd like to keep near the dirty limit as that
maximizes the write cache efficiency. Of course that needs to be
balanced against write out efficiency.

> > Clearly the current limit isn't sufficient for some people,
> >  - xfs/btrfs seem generally stuck in balance_dirty_pages()'s
> > congestion_wait()
> >  - ext4 generates inconveniently small extents
> 
> This is actually two different side of the same problem.  The filesystem
> knows that bytes 0-N in the file are setup for delayed allocation.
> Writepage is called on byte 0, and now the filesystem gets to decide how
> big an extent to make.
> 
> It could decide to make an extent based on the total number of bytes
> under delayed allocation, and hope the caller of writepage will be kind
> enough to send down the pages contiguously afterward (xfs), or it could
> make a smaller extent based on something closer to the total number of
> bytes this particular writepages() call plans on writing (I guess what
> ext4 is doing).
> 
> Either way, if pdflush or the bdi thread or whoever ends up switching to
> another file during a big streaming write, the end result is that we
> fragment.  We may fragment the file (ext4) or we may fragment the
> writeback (xfs), but the end result isn't good.

OK, so what we want is for a way to re-enter the whole
writeback_inodes() path onto the same file, right?

That would result in the writeback continuing where it left off last.

Wu, can we make writeback_inodes() do something like that? Pass some
magic along in wbc maybe?

> Looking at two xfs examples, this is the IO for two concurrent streaming
> writers (two different files) on 2.6.31-rc8 (pdflush is doing all the IO
> in this graph, sorry the legend colors wrapped on me).  If you squint,
> you can kind of see the fingers of IO as pdflush switches between files.
> 
> http://oss.oracle.com/~mason/seekwatcher/xfs-tag.png
> 
> And here is the IO when XFS forces nr_to_write much higher with a patch
> from Christoph:
> 
> http://oss.oracle.com/~mason/seekwatcher/xfs-extend-tag.png
> 
> These graphs would look the same no matter what I did with
> congestion_wait().  The first graph is slower just because pdflush
> switches from one file to another.
> 
> > 
> > 
> > The first seems to suggest to me the number isn't well balanced against
> > whatever drives congestion_wait() (that thing still gives me a
> > head-ache).
> > 
> > # git grep clear_bdi_congested
> > drivers/block/pktcdvd.c:                clear_bdi_congested(&pd->disk->queue->backing_dev_info,
> > fs/fuse/dev.c:                  clear_bdi_congested(&fc->bdi, BLK_RW_SYNC);
> > fs/fuse/dev.c:                  clear_bdi_congested(&fc->bdi, BLK_RW_ASYNC);
> > fs/nfs/write.c:         clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
> > include/linux/backing-dev.h:void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
> > include/linux/blkdev.h: clear_bdi_congested(&q->backing_dev_info, sync);
> > mm/backing-dev.c:void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> > mm/backing-dev.c:EXPORT_SYMBOL(clear_bdi_congested);
> > 
> > Suggests that regular block devices don't even manage device congestion
> > and it reverts to a simple timeout -- should we fix that?
> 
> Look for blk_clear_queue_congested().  It is managed, I personally don't
> think it is very useful.  But, that's a different thread ;)

Ah, how blind I am ;-)

Right, so what can we do to make it useful? I think the intent is to
limit the number of pages in writeback and provide some progress
feedback to the vm.

Going by your experience we're failing there.

> > Now, suppose it were to do something useful, I'd think we'd want to
> > limit write-out to whatever it takes so saturate the BDI.
> 
> If we don't want a blanket increase, 

The thing is, this sysctl seems an utter cop out, we can't even explain
how to calculate a number that'll work for a situation, the best we can
do is say, prod at it and pray -- that's not good.

Last time I also asked if an increased number is good for every
situation, I have a machine with a RAID5 array and USB storage, will it
harm either situation?

> I'd suggest that we just give the
> FS a way to say: 'I know nr_to_write is only 32, but if you just write a
> few blocks more, the system will be better off'.
> 
> Something like wbc->fs_write_hint
> 
> This way, when the FS allocates a great big contiguous delalloc extent,
> it can set the wbc to reflect that we've got cheap and easy IO here.

I think that's certainly a possibility.

What's the down-side of allocating extents based on the available dirty
pages instead of the current write-out request? As long as we're good at
generating sequential IO in general (yeah, I know we suck now) it
doesn't really matter when it will be filled, as we know it will
eventually be.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-08 16:56         ` Peter Zijlstra
@ 2009-09-08 17:28           ` Chris Mason
  2009-09-08 17:46             ` Peter Zijlstra
  2009-09-09  1:53           ` Dave Chinner
  1 sibling, 1 reply; 76+ messages in thread
From: Chris Mason @ 2009-09-08 17:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Artem Bityutskiy, Jens Axboe, linux-kernel, linux-fsdevel, david,
	hch, akpm, jack, Theodore Ts'o, Wu Fengguang

On Tue, Sep 08, 2009 at 06:56:23PM +0200, Peter Zijlstra wrote:
> On Tue, 2009-09-08 at 12:29 -0400, Chris Mason wrote:
> 
> > > I'm still not convinced this knob is worth the patch and I'm inclined to
> > > flat out NAK it..
> > > 
> > > The whole point of MAX_WRITEBACK_PAGES seems to occasionally check the
> > > dirty stats again and not write out too much.
> > 
> > The problem is that 'too much' is a very abstract thing.  When a process
> > is stuck in balance_dirty_pages, we want them to do the minimal amount
> > of work (or waiting) required to get them safely back inside file_write().
> 
> >From the VMs POV I think we'd like to keep near the dirty limit as that
> maximizes the write cache efficiency. Of course that needs to be
> balanced against write out efficiency.
> 
> > > Clearly the current limit isn't sufficient for some people,
> > >  - xfs/btrfs seem generally stuck in balance_dirty_pages()'s
> > > congestion_wait()
> > >  - ext4 generates inconveniently small extents
> > 
> > This is actually two different side of the same problem.  The filesystem
> > knows that bytes 0-N in the file are setup for delayed allocation.
> > Writepage is called on byte 0, and now the filesystem gets to decide how
> > big an extent to make.
> > 
> > It could decide to make an extent based on the total number of bytes
> > under delayed allocation, and hope the caller of writepage will be kind
> > enough to send down the pages contiguously afterward (xfs), or it could
> > make a smaller extent based on something closer to the total number of
> > bytes this particular writepages() call plans on writing (I guess what
> > ext4 is doing).
> > 
> > Either way, if pdflush or the bdi thread or whoever ends up switching to
> > another file during a big streaming write, the end result is that we
> > fragment.  We may fragment the file (ext4) or we may fragment the
> > writeback (xfs), but the end result isn't good.
> 
> OK, so what we want is for a way to re-enter the whole
> writeback_inodes() path onto the same file, right?

It would help.

> 
> That would result in the writeback continuing where it left off last.
> 
> Wu, can we make writeback_inodes() do something like that? Pass some
> magic along in wbc maybe?
> 
> > Looking at two xfs examples, this is the IO for two concurrent streaming
> > writers (two different files) on 2.6.31-rc8 (pdflush is doing all the IO
> > in this graph, sorry the legend colors wrapped on me).  If you squint,
> > you can kind of see the fingers of IO as pdflush switches between files.
> > 
> > http://oss.oracle.com/~mason/seekwatcher/xfs-tag.png
> > 
> > And here is the IO when XFS forces nr_to_write much higher with a patch
> > from Christoph:
> > 
> > http://oss.oracle.com/~mason/seekwatcher/xfs-extend-tag.png
> > 
> > These graphs would look the same no matter what I did with
> > congestion_wait().  The first graph is slower just because pdflush
> > switches from one file to another.
> > 
> > > 
> > > 
> > > The first seems to suggest to me the number isn't well balanced against
> > > whatever drives congestion_wait() (that thing still gives me a
> > > head-ache).
> > > 
> > > # git grep clear_bdi_congested
> > > drivers/block/pktcdvd.c:                clear_bdi_congested(&pd->disk->queue->backing_dev_info,
> > > fs/fuse/dev.c:                  clear_bdi_congested(&fc->bdi, BLK_RW_SYNC);
> > > fs/fuse/dev.c:                  clear_bdi_congested(&fc->bdi, BLK_RW_ASYNC);
> > > fs/nfs/write.c:         clear_bdi_congested(&nfss->backing_dev_info, BLK_RW_ASYNC);
> > > include/linux/backing-dev.h:void clear_bdi_congested(struct backing_dev_info *bdi, int sync);
> > > include/linux/blkdev.h: clear_bdi_congested(&q->backing_dev_info, sync);
> > > mm/backing-dev.c:void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
> > > mm/backing-dev.c:EXPORT_SYMBOL(clear_bdi_congested);
> > > 
> > > Suggests that regular block devices don't even manage device congestion
> > > and it reverts to a simple timeout -- should we fix that?
> > 
> > Look for blk_clear_queue_congested().  It is managed, I personally don't
> > think it is very useful.  But, that's a different thread ;)
> 
> Ah, how blind I am ;-)
> 
> Right, so what can we do to make it useful? I think the intent is to
> limit the number of pages in writeback and provide some progress
> feedback to the vm.
> 
> Going by your experience we're failing there.

Well, congestion_wait is a stop sign but not a queue.  So, if you're
being nice and honoring congestion but another process (say O_DIRECT
random writes) doesn't, then you back off forever and none of your IO
gets done.

To get around this, you can add code to make sure that you do
_some_ io, but this isn't enough for your work to get done
quickly, and you do end up waiting in get_request() so the async
benefits of using the congestion test go away.

If we changed everyone to honor congestion, we end up with a poll model
because a ton of congestion_wait() callers create a thundering herd.

So, we could add a queue, and then congestion_wait() would look a lot
like get_request_wait().  I'd rather that everyone just used
get_request_wait, and then have us fix any latency problems in the
elevator.

For me, perfect would be one or more threads per-bdi doing the
writeback, and never checking for congestion (like what Jens' code
does).  The congestion_wait inside balance_dirty_pages() is really just
a schedule_timeout(), on a fully loaded box the congestion doesn't go
away anyway.  We should switch that to a saner system of waiting for
progress on the bdi writeback + dirty thresholds.

Btrfs would love to be able to send down a bio non-blocking.  That would
let me get rid of the congestion check I have today (I think Jens said
that would be an easy change and then I talked him into some small mods
of the writeback path).

> 
> > > Now, suppose it were to do something useful, I'd think we'd want to
> > > limit write-out to whatever it takes so saturate the BDI.
> > 
> > If we don't want a blanket increase, 
> 
> The thing is, this sysctl seems an utter cop out, we can't even explain
> how to calculate a number that'll work for a situation, the best we can
> do is say, prod at it and pray -- that's not good.
> 
> Last time I also asked if an increased number is good for every
> situation, I have a machine with a RAID5 array and USB storage, will it
> harm either situation?

If the goal is to make sure that pdflush or balance_dirty_pages only
does IO until some condition is met, we should add a flag to the bdi
that gets set when that condition is met.  Things will go a lot more
smoothly than magic numbers.

Then we can add the fs_hint as another change so the FS can tell
write_cache_pages callers how to do optimal IO based on its allocation
decisions.

> 
> > I'd suggest that we just give the
> > FS a way to say: 'I know nr_to_write is only 32, but if you just write a
> > few blocks more, the system will be better off'.
> > 
> > Something like wbc->fs_write_hint
> > 
> > This way, when the FS allocates a great big contiguous delalloc extent,
> > it can set the wbc to reflect that we've got cheap and easy IO here.
> 
> I think that's certainly a possibility.
> 
> What's the down-side of allocating extents based on the available dirty
> pages instead of the current write-out request? As long as we're good at
> generating sequential IO in general (yeah, I know we suck now) it
> doesn't really matter when it will be filled, as we know it will
> eventually be.

I'm guessing the small extents from ext4 come from tuning the allocator
for writeback performance instead of anti-fragmentation.  But I'm
guessing.

-chris


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-08 17:28           ` Chris Mason
@ 2009-09-08 17:46             ` Peter Zijlstra
  2009-09-08 17:55               ` Peter Zijlstra
  2009-09-08 17:57               ` Chris Mason
  0 siblings, 2 replies; 76+ messages in thread
From: Peter Zijlstra @ 2009-09-08 17:46 UTC (permalink / raw)
  To: Chris Mason
  Cc: Artem Bityutskiy, Jens Axboe, linux-kernel, linux-fsdevel, david,
	hch, akpm, jack, Theodore Ts'o, Wu Fengguang

On Tue, 2009-09-08 at 13:28 -0400, Chris Mason wrote:
> > Right, so what can we do to make it useful? I think the intent is to
> > limit the number of pages in writeback and provide some progress
> > feedback to the vm.
> > 
> > Going by your experience we're failing there.
> 
> Well, congestion_wait is a stop sign but not a queue.  So, if you're
> being nice and honoring congestion but another process (say O_DIRECT
> random writes) doesn't, then you back off forever and none of your IO
> gets done.
> 
> To get around this, you can add code to make sure that you do
> _some_ io, but this isn't enough for your work to get done
> quickly, and you do end up waiting in get_request() so the async
> benefits of using the congestion test go away.
> 
> If we changed everyone to honor congestion, we end up with a poll model
> because a ton of congestion_wait() callers create a thundering herd.
> 
> So, we could add a queue, and then congestion_wait() would look a lot
> like get_request_wait().  I'd rather that everyone just used
> get_request_wait, and then have us fix any latency problems in the
> elevator.

Except you'd need to lift it to the BDI layer, because not all backing
devices are a block device.

Making it into a per-bdi queue sounds good to me though.

> For me, perfect would be one or more threads per-bdi doing the
> writeback, and never checking for congestion (like what Jens' code
> does).  The congestion_wait inside balance_dirty_pages() is really just
> a schedule_timeout(), on a fully loaded box the congestion doesn't go
> away anyway.  We should switch that to a saner system of waiting for
> progress on the bdi writeback + dirty thresholds.

Right, one of the things we could possibly do is tie into
__bdi_writeout_inc() and test levels there once every so often and then
flip a bit when we're low enough to stop writing.

> Btrfs would love to be able to send down a bio non-blocking.  That would
> let me get rid of the congestion check I have today (I think Jens said
> that would be an easy change and then I talked him into some small mods
> of the writeback path).

Wont that land us into trouble because the amount of writeback will
become unwieldy?

> > > > Now, suppose it were to do something useful, I'd think we'd want to
> > > > limit write-out to whatever it takes so saturate the BDI.
> > > 
> > > If we don't want a blanket increase, 
> > 
> > The thing is, this sysctl seems an utter cop out, we can't even explain
> > how to calculate a number that'll work for a situation, the best we can
> > do is say, prod at it and pray -- that's not good.
> > 
> > Last time I also asked if an increased number is good for every
> > situation, I have a machine with a RAID5 array and USB storage, will it
> > harm either situation?
> 
> If the goal is to make sure that pdflush or balance_dirty_pages only
> does IO until some condition is met, we should add a flag to the bdi
> that gets set when that condition is met.  Things will go a lot more
> smoothly than magic numbers.

Agreed - and from what I can make out, that really is the only goal
here.

> Then we can add the fs_hint as another change so the FS can tell
> write_cache_pages callers how to do optimal IO based on its allocation
> decisions.

I think you lost me here, but I think you mean to provide some FS
specific feedback to the generic write page routines -- whatever
works ;-)

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-08 17:46             ` Peter Zijlstra
@ 2009-09-08 17:55               ` Peter Zijlstra
  2009-09-08 18:32                 ` Peter Zijlstra
  2009-09-08 18:35                 ` Chris Mason
  2009-09-08 17:57               ` Chris Mason
  1 sibling, 2 replies; 76+ messages in thread
From: Peter Zijlstra @ 2009-09-08 17:55 UTC (permalink / raw)
  To: Chris Mason
  Cc: Artem Bityutskiy, Jens Axboe, linux-kernel, linux-fsdevel, david,
	hch, akpm, jack, Theodore Ts'o, Wu Fengguang

On Tue, 2009-09-08 at 19:46 +0200, Peter Zijlstra wrote:
> On Tue, 2009-09-08 at 13:28 -0400, Chris Mason wrote:
> > > Right, so what can we do to make it useful? I think the intent is to
> > > limit the number of pages in writeback and provide some progress
> > > feedback to the vm.
> > > 
> > > Going by your experience we're failing there.
> > 
> > Well, congestion_wait is a stop sign but not a queue.  So, if you're
> > being nice and honoring congestion but another process (say O_DIRECT
> > random writes) doesn't, then you back off forever and none of your IO
> > gets done.
> > 
> > To get around this, you can add code to make sure that you do
> > _some_ io, but this isn't enough for your work to get done
> > quickly, and you do end up waiting in get_request() so the async
> > benefits of using the congestion test go away.
> > 
> > If we changed everyone to honor congestion, we end up with a poll model
> > because a ton of congestion_wait() callers create a thundering herd.
> > 
> > So, we could add a queue, and then congestion_wait() would look a lot
> > like get_request_wait().  I'd rather that everyone just used
> > get_request_wait, and then have us fix any latency problems in the
> > elevator.
> 
> Except you'd need to lift it to the BDI layer, because not all backing
> devices are a block device.
> 
> Making it into a per-bdi queue sounds good to me though.
> 
> > For me, perfect would be one or more threads per-bdi doing the
> > writeback, and never checking for congestion (like what Jens' code
> > does).  The congestion_wait inside balance_dirty_pages() is really just
> > a schedule_timeout(), on a fully loaded box the congestion doesn't go
> > away anyway.  We should switch that to a saner system of waiting for
> > progress on the bdi writeback + dirty thresholds.
> 
> Right, one of the things we could possibly do is tie into
> __bdi_writeout_inc() and test levels there once every so often and then
> flip a bit when we're low enough to stop writing.

I think I'm somewhat confused here though..

There's kernel threads doing writeout, and there's apps getting stuck in
balance_dirty_pages().

If we want all writeout to be done by kernel threads (bdi/pd-flush like
things) then we still need to manage the actual apps and delay them.

As things stand now, we kick pdflush into action when dirty levels are
above the background level, and start writing out from the app task when
we hit the full dirty level.

Moving all writeout to a kernel thread sounds good from writing linear
stuff pov, but what do we make apps wait on then?

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-08 17:46             ` Peter Zijlstra
  2009-09-08 17:55               ` Peter Zijlstra
@ 2009-09-08 17:57               ` Chris Mason
  2009-09-08 18:28                 ` Peter Zijlstra
  1 sibling, 1 reply; 76+ messages in thread
From: Chris Mason @ 2009-09-08 17:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Artem Bityutskiy, Jens Axboe, linux-kernel, linux-fsdevel, david,
	hch, akpm, jack, Theodore Ts'o, Wu Fengguang

On Tue, Sep 08, 2009 at 07:46:14PM +0200, Peter Zijlstra wrote:
> On Tue, 2009-09-08 at 13:28 -0400, Chris Mason wrote:
> > > Right, so what can we do to make it useful? I think the intent is to
> > > limit the number of pages in writeback and provide some progress
> > > feedback to the vm.
> > > 
> > > Going by your experience we're failing there.
> > 
> > Well, congestion_wait is a stop sign but not a queue.  So, if you're
> > being nice and honoring congestion but another process (say O_DIRECT
> > random writes) doesn't, then you back off forever and none of your IO
> > gets done.
> > 
> > To get around this, you can add code to make sure that you do
> > _some_ io, but this isn't enough for your work to get done
> > quickly, and you do end up waiting in get_request() so the async
> > benefits of using the congestion test go away.
> > 
> > If we changed everyone to honor congestion, we end up with a poll model
> > because a ton of congestion_wait() callers create a thundering herd.
> > 
> > So, we could add a queue, and then congestion_wait() would look a lot
> > like get_request_wait().  I'd rather that everyone just used
> > get_request_wait, and then have us fix any latency problems in the
> > elevator.
> 
> Except you'd need to lift it to the BDI layer, because not all backing
> devices are a block device.
> 
> Making it into a per-bdi queue sounds good to me though.
> 
> > For me, perfect would be one or more threads per-bdi doing the
> > writeback, and never checking for congestion (like what Jens' code
> > does).  The congestion_wait inside balance_dirty_pages() is really just
> > a schedule_timeout(), on a fully loaded box the congestion doesn't go
> > away anyway.  We should switch that to a saner system of waiting for
> > progress on the bdi writeback + dirty thresholds.
> 
> Right, one of the things we could possibly do is tie into
> __bdi_writeout_inc() and test levels there once every so often and then
> flip a bit when we're low enough to stop writing.
> 
> > Btrfs would love to be able to send down a bio non-blocking.  That would
> > let me get rid of the congestion check I have today (I think Jens said
> > that would be an easy change and then I talked him into some small mods
> > of the writeback path).
> 
> Wont that land us into trouble because the amount of writeback will
> become unwieldy?

The btrfs usage is a little different.  I've got a pile of bios all
setup and ready for submission, and I'm trying to send them down to N
devices from one thread.  So, if a given submit_bio call is going to
block, I'd rather move on to another device.

This is really what pdflush is using congestion for too, the difference
is that I've already got the bios made.

> 
> > > > > Now, suppose it were to do something useful, I'd think we'd want to
> > > > > limit write-out to whatever it takes so saturate the BDI.
> > > > 
> > > > If we don't want a blanket increase, 
> > > 
> > > The thing is, this sysctl seems an utter cop out, we can't even explain
> > > how to calculate a number that'll work for a situation, the best we can
> > > do is say, prod at it and pray -- that's not good.
> > > 
> > > Last time I also asked if an increased number is good for every
> > > situation, I have a machine with a RAID5 array and USB storage, will it
> > > harm either situation?
> > 
> > If the goal is to make sure that pdflush or balance_dirty_pages only
> > does IO until some condition is met, we should add a flag to the bdi
> > that gets set when that condition is met.  Things will go a lot more
> > smoothly than magic numbers.
> 
> Agreed - and from what I can make out, that really is the only goal
> here.
> 
> > Then we can add the fs_hint as another change so the FS can tell
> > write_cache_pages callers how to do optimal IO based on its allocation
> > decisions.
> 
> I think you lost me here, but I think you mean to provide some FS
> specific feedback to the generic write page routines -- whatever
> works ;-)

Going back to the streaming writer case, pretend the FS just created a
nice fat 256MB extent out of dealloc pages, but after we wrote the first
4k, we dropped below the dirty threshold and IO is no longer "required".

It would be silly to just write 4k.  We know we have a contiguous
area 256MB long on disk and 256MB of dirty pages.  In this case, pdflush
(or Jens' bdi threads) want to write some large portion of that 256MB.

You might argue a balance_dirty_pages callers wants to return quickly,
but even then we'd want to write at least 128k.

-chris


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-08 16:29       ` Chris Mason
@ 2009-09-08 18:06           ` Theodore Tso
  2009-09-08 18:06           ` Theodore Tso
  2009-09-09  9:29           ` Wu Fengguang
  2 siblings, 0 replies; 76+ messages in thread
From: Theodore Tso @ 2009-09-08 18:06 UTC (permalink / raw)
  To: Chris Mason, Peter Zijlstra, Artem Bityutskiy, Jens Axboe,
	linux-kernel, linux-fsdevel, david, hch, akpm, jack

On Tue, Sep 08, 2009 at 12:29:36PM -0400, Chris Mason wrote:
> > 
> > Clearly the current limit isn't sufficient for some people,
> >  - xfs/btrfs seem generally stuck in balance_dirty_pages()'s
> > congestion_wait()
> >  - ext4 generates inconveniently small extents
> 
> This is actually two different side of the same problem.  The filesystem
> knows that bytes 0-N in the file are setup for delayed allocation.
> Writepage is called on byte 0, and now the filesystem gets to decide how
> big an extent to make.
> 
> It could decide to make an extent based on the total number of bytes
> under delayed allocation, and hope the caller of writepage will be kind
> enough to send down the pages contiguously afterward (xfs), or it could
> make a smaller extent based on something closer to the total number of
> bytes this particular writepages() call plans on writing (I guess what
> ext4 is doing).
>
> Either way, if pdflush or the bdi thread or whoever ends up switching to
> another file during a big streaming write, the end result is that we
> fragment.  We may fragment the file (ext4) or we may fragment the
> writeback (xfs), but the end result isn't good.

Yep; the question is whether we want to fragment the read operation in
the future (ext4) or write operation now (XFS).   

> > Now, suppose it were to do something useful, I'd think we'd want to
> > limit write-out to whatever it takes so saturate the BDI.
> 
> If we don't want a blanket increase, I'd suggest that we just give the
> FS a way to say: 'I know nr_to_write is only 32, but if you just write a
> few blocks more, the system will be better off'.

Well, we can mostly do this now, using the XFS hack:

      wbc->nr_to_write *= 4;

Which is another way of saying, we *know* the page writeback routines
are on crack, so we'll ignore their suggestion of how many pages to
write, and we'll try to write more than what they asked us to write.

(This wasn't a proposed change; it's in Linux 2.6 mainline already;
see fs/xfs/linux-2.6/xfs_aops.c, in xfs_vm_writepage).  The fact that
filesystems are playing games like this should be a clear indication
that things are badly broken above....

> > As to the extends, shouldn't ext4 allocate extends based on the amount
> > of dirty pages in the file instead of however much we're going to write
> > out now?
> 
> It probably does a mixture of both.

It does do a mixture, but in a fairly primitive way.  I was thinking
about writing some ugly code to more precisely determine how many
dirty-and-delayed-allocation-pages exist beyond what we've currently
requested to write, but it seemed like most of the problem would be
solved simply by having the page writeback routines simply send more
pages down to the filesystem, instead of having the file system work
around brain damage in the VM writeback routines.

						- Ted

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
@ 2009-09-08 18:06           ` Theodore Tso
  0 siblings, 0 replies; 76+ messages in thread
From: Theodore Tso @ 2009-09-08 18:06 UTC (permalink / raw)
  To: Chris Mason, Peter Zijlstra, Artem Bityutskiy, Jens Axboe, linux-kernel

On Tue, Sep 08, 2009 at 12:29:36PM -0400, Chris Mason wrote:
> > 
> > Clearly the current limit isn't sufficient for some people,
> >  - xfs/btrfs seem generally stuck in balance_dirty_pages()'s
> > congestion_wait()
> >  - ext4 generates inconveniently small extents
> 
> This is actually two different side of the same problem.  The filesystem
> knows that bytes 0-N in the file are setup for delayed allocation.
> Writepage is called on byte 0, and now the filesystem gets to decide how
> big an extent to make.
> 
> It could decide to make an extent based on the total number of bytes
> under delayed allocation, and hope the caller of writepage will be kind
> enough to send down the pages contiguously afterward (xfs), or it could
> make a smaller extent based on something closer to the total number of
> bytes this particular writepages() call plans on writing (I guess what
> ext4 is doing).
>
> Either way, if pdflush or the bdi thread or whoever ends up switching to
> another file during a big streaming write, the end result is that we
> fragment.  We may fragment the file (ext4) or we may fragment the
> writeback (xfs), but the end result isn't good.

Yep; the question is whether we want to fragment the read operation in
the future (ext4) or write operation now (XFS).   

> > Now, suppose it were to do something useful, I'd think we'd want to
> > limit write-out to whatever it takes so saturate the BDI.
> 
> If we don't want a blanket increase, I'd suggest that we just give the
> FS a way to say: 'I know nr_to_write is only 32, but if you just write a
> few blocks more, the system will be better off'.

Well, we can mostly do this now, using the XFS hack:

      wbc->nr_to_write *= 4;

Which is another way of saying, we *know* the page writeback routines
are on crack, so we'll ignore their suggestion of how many pages to
write, and we'll try to write more than what they asked us to write.

(This wasn't a proposed change; it's in Linux 2.6 mainline already;
see fs/xfs/linux-2.6/xfs_aops.c, in xfs_vm_writepage).  The fact that
filesystems are playing games like this should be a clear indication
that things are badly broken above....

> > As to the extends, shouldn't ext4 allocate extends based on the amount
> > of dirty pages in the file instead of however much we're going to write
> > out now?
> 
> It probably does a mixture of both.

It does do a mixture, but in a fairly primitive way.  I was thinking
about writing some ugly code to more precisely determine how many
dirty-and-delayed-allocation-pages exist beyond what we've currently
requested to write, but it seemed like most of the problem would be
solved simply by having the page writeback routines simply send more
pages down to the filesystem, instead of having the file system work
around brain damage in the VM writeback routines.

						- Ted

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-08 18:06           ` Theodore Tso
  (?)
@ 2009-09-08 18:19           ` Christoph Hellwig
  2009-09-08 19:34             ` Theodore Tso
  -1 siblings, 1 reply; 76+ messages in thread
From: Christoph Hellwig @ 2009-09-08 18:19 UTC (permalink / raw)
  To: Theodore Tso, Chris Mason, Peter Zijlstra, Artem Bityutskiy,
	Jens Axboe, linux-kernel, linux-fsdevel, david, hch, akpm, jack

On Tue, Sep 08, 2009 at 02:06:01PM -0400, Theodore Tso wrote:
> Well, we can mostly do this now, using the XFS hack:
> 
>       wbc->nr_to_write *= 4;
> 
> Which is another way of saying, we *know* the page writeback routines
> are on crack, so we'll ignore their suggestion of how many pages to
> write, and we'll try to write more than what they asked us to write.
> 
> (This wasn't a proposed change; it's in Linux 2.6 mainline already;
> see fs/xfs/linux-2.6/xfs_aops.c, in xfs_vm_writepage).  The fact that
> filesystems are playing games like this should be a clear indication
> that things are badly broken above....

Note that we did not put in this hack behind anyones back.  The first
version from Chris was posted on fsdevel, lkml, the ext3 list and so on:

	http://lkml.org/lkml/2008/10/9/237


An when we finally decided that we absolute need it it also made another
roundtrip to linux-mm to hope that we'd get something better from the VM
people:

	http://article.gmane.org/gmane.comp.file-systems.xfs.general/29663


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-08 17:57               ` Chris Mason
@ 2009-09-08 18:28                 ` Peter Zijlstra
  0 siblings, 0 replies; 76+ messages in thread
From: Peter Zijlstra @ 2009-09-08 18:28 UTC (permalink / raw)
  To: Chris Mason
  Cc: Artem Bityutskiy, Jens Axboe, linux-kernel, linux-fsdevel, david,
	hch, akpm, jack, Theodore Ts'o, Wu Fengguang

On Tue, 2009-09-08 at 13:57 -0400, Chris Mason wrote:
> Going back to the streaming writer case, pretend the FS just created a
> nice fat 256MB extent out of dealloc pages, but after we wrote the first
> 4k, we dropped below the dirty threshold and IO is no longer "required".
> 
> It would be silly to just write 4k.  We know we have a contiguous
> area 256MB long on disk and 256MB of dirty pages.  In this case, pdflush
> (or Jens' bdi threads) want to write some large portion of that 256MB.
> 
> You might argue a balance_dirty_pages callers wants to return quickly,
> but even then we'd want to write at least 128k.

Sure and that's no problem at all,.. I'm thinking something like a
fraction of the dirty limit, maybe something like
(dirty_ratio-background_ratio) / 4 as chunk size. That gives a sizable
amount and scales with the writeback cache stuff.

Esp if we move all write activity into the bdi threads and have the
application tasks wait. In that case we can release the app tasks to
generate more dirty pages while still writing out data in a linear
fashion.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-08 17:55               ` Peter Zijlstra
@ 2009-09-08 18:32                 ` Peter Zijlstra
  2009-09-09 14:23                   ` Jan Kara
  2009-09-08 18:35                 ` Chris Mason
  1 sibling, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2009-09-08 18:32 UTC (permalink / raw)
  To: Chris Mason
  Cc: Artem Bityutskiy, Jens Axboe, linux-kernel, linux-fsdevel, david,
	hch, akpm, jack, Theodore Ts'o, Wu Fengguang

On Tue, 2009-09-08 at 19:55 +0200, Peter Zijlstra wrote:
> 
> I think I'm somewhat confused here though..
> 
> There's kernel threads doing writeout, and there's apps getting stuck in
> balance_dirty_pages().
> 
> If we want all writeout to be done by kernel threads (bdi/pd-flush like
> things) then we still need to manage the actual apps and delay them.
> 
> As things stand now, we kick pdflush into action when dirty levels are
> above the background level, and start writing out from the app task when
> we hit the full dirty level.
> 
> Moving all writeout to a kernel thread sounds good from writing linear
> stuff pov, but what do we make apps wait on then?

OK, so like said in the previous email, we could have these app tasks
simply sleep on a waitqueue which gets periodic wakeups from
__bdi_writeback_inc() every time the dirty threshold drops.

The woken tasks would then check their bdi dirty limit (its task
dependent) against the current values and either go back to sleep or
back to work.

The only problem would be the mass wakeups when lots of tasks are
blocked on dirty, but I'm guessing there's no way around that anyway,
and its better to have a limited number of writers than have everybody
write something, which would result in massive write fragmentation.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-08 17:55               ` Peter Zijlstra
  2009-09-08 18:32                 ` Peter Zijlstra
@ 2009-09-08 18:35                 ` Chris Mason
  1 sibling, 0 replies; 76+ messages in thread
From: Chris Mason @ 2009-09-08 18:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Artem Bityutskiy, Jens Axboe, linux-kernel, linux-fsdevel, david,
	hch, akpm, jack, Theodore Ts'o, Wu Fengguang

On Tue, Sep 08, 2009 at 07:55:01PM +0200, Peter Zijlstra wrote:
> On Tue, 2009-09-08 at 19:46 +0200, Peter Zijlstra wrote:
> > On Tue, 2009-09-08 at 13:28 -0400, Chris Mason wrote:
> > > > Right, so what can we do to make it useful? I think the intent is to
> > > > limit the number of pages in writeback and provide some progress
> > > > feedback to the vm.
> > > > 
> > > > Going by your experience we're failing there.
> > > 
> > > Well, congestion_wait is a stop sign but not a queue.  So, if you're
> > > being nice and honoring congestion but another process (say O_DIRECT
> > > random writes) doesn't, then you back off forever and none of your IO
> > > gets done.
> > > 
> > > To get around this, you can add code to make sure that you do
> > > _some_ io, but this isn't enough for your work to get done
> > > quickly, and you do end up waiting in get_request() so the async
> > > benefits of using the congestion test go away.
> > > 
> > > If we changed everyone to honor congestion, we end up with a poll model
> > > because a ton of congestion_wait() callers create a thundering herd.
> > > 
> > > So, we could add a queue, and then congestion_wait() would look a lot
> > > like get_request_wait().  I'd rather that everyone just used
> > > get_request_wait, and then have us fix any latency problems in the
> > > elevator.
> > 
> > Except you'd need to lift it to the BDI layer, because not all backing
> > devices are a block device.
> > 
> > Making it into a per-bdi queue sounds good to me though.
> > 
> > > For me, perfect would be one or more threads per-bdi doing the
> > > writeback, and never checking for congestion (like what Jens' code
> > > does).  The congestion_wait inside balance_dirty_pages() is really just
> > > a schedule_timeout(), on a fully loaded box the congestion doesn't go
> > > away anyway.  We should switch that to a saner system of waiting for
> > > progress on the bdi writeback + dirty thresholds.
> > 
> > Right, one of the things we could possibly do is tie into
> > __bdi_writeout_inc() and test levels there once every so often and then
> > flip a bit when we're low enough to stop writing.
> 
> I think I'm somewhat confused here though..
> 
> There's kernel threads doing writeout, and there's apps getting stuck in
> balance_dirty_pages().
> 
> If we want all writeout to be done by kernel threads (bdi/pd-flush like
> things) then we still need to manage the actual apps and delay them.
> 
> As things stand now, we kick pdflush into action when dirty levels are
> above the background level, and start writing out from the app task when
> we hit the full dirty level.
> 
> Moving all writeout to a kernel thread sounds good from writing linear
> stuff pov, but what do we make apps wait on then?

I suppose we could come up with the perfect queuing system where procs
got in line and came out as the bdi became less busy.  The problem is
that schedule_timeout(HZ/10) isn't really a great idea because HZ/10
might be much much too long for fast devices.

congestion_wait() isn't a great idea because the block device might stay
congested long after we've crossed below the threshold.

If there was a flag on the bdi that got cleared as things improved, we
could wait on that.

Otherwise, schedule_timeout() with increasing timeout values per
iteration and a poll on the thresholds isn't too far from what we have
now.

-chris


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-08 18:19           ` Christoph Hellwig
@ 2009-09-08 19:34             ` Theodore Tso
  0 siblings, 0 replies; 76+ messages in thread
From: Theodore Tso @ 2009-09-08 19:34 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Peter Zijlstra, Artem Bityutskiy, Jens Axboe,
	linux-kernel, linux-fsdevel, david, akpm, jack

On Tue, Sep 08, 2009 at 02:19:37PM -0400, Christoph Hellwig wrote:
> > (This wasn't a proposed change; it's in Linux 2.6 mainline already;
> > see fs/xfs/linux-2.6/xfs_aops.c, in xfs_vm_writepage).  The fact that
> > filesystems are playing games like this should be a clear indication
> > that things are badly broken above....
> 
> Note that we did not put in this hack behind anyones back.  The first
> version from Chris was posted on fsdevel, lkml, the ext3 list and so on:
> 
> An when we finally decided that we absolute need it it also made another
> roundtrip to linux-mm to hope that we'd get something better from the VM
> people:

Sorry, I didn't want to imply that this was done behind anyone's back
--- rather that the fact that we need to do this sort of thing is an
indication that something is badly broken in the page writeback
functions.

I was reacting to Peter's argument that we shouldn't have a
knob/tunable to adjust this limit.  If we can figure out something
which is auto-tuning, that's all very well and good, but the fact that
people have been pointing a problem here for at least a full year,
maybe we should have the tunable now, and then later on, if the VM
crowd can figure out something more clever, we can retire the tunable
at some point in the future.

	   		     	   	     - Ted

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-08 16:56         ` Peter Zijlstra
  2009-09-08 17:28           ` Chris Mason
@ 2009-09-09  1:53           ` Dave Chinner
  2009-09-09  3:52             ` Wu Fengguang
  1 sibling, 1 reply; 76+ messages in thread
From: Dave Chinner @ 2009-09-09  1:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Artem Bityutskiy, Jens Axboe, linux-kernel,
	linux-fsdevel, hch, akpm, jack, Theodore Ts'o, Wu Fengguang

On Tue, Sep 08, 2009 at 06:56:23PM +0200, Peter Zijlstra wrote:
> On Tue, 2009-09-08 at 12:29 -0400, Chris Mason wrote:
> > Either way, if pdflush or the bdi thread or whoever ends up switching to
> > another file during a big streaming write, the end result is that we
> > fragment.  We may fragment the file (ext4) or we may fragment the
> > writeback (xfs), but the end result isn't good.
> 
> OK, so what we want is for a way to re-enter the whole
> writeback_inodes() path onto the same file, right?

No, that would take use back to the Bad Old Days where one large
file write can starve out the other 10,000 small files that need to
be written. The old writeback code used to end up in this way
because it didn't rotate large files to the back of the dirty inode
queue once wbc->nr_to_write was exhausted. This could cause files
not to be written back for tens of minutes....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-09  1:53           ` Dave Chinner
@ 2009-09-09  3:52             ` Wu Fengguang
  0 siblings, 0 replies; 76+ messages in thread
From: Wu Fengguang @ 2009-09-09  3:52 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Peter Zijlstra, Chris Mason, Artem Bityutskiy, Jens Axboe,
	linux-kernel, linux-fsdevel, hch, akpm, jack, Theodore Ts'o

On Wed, Sep 09, 2009 at 09:53:59AM +0800, Dave Chinner wrote:
> On Tue, Sep 08, 2009 at 06:56:23PM +0200, Peter Zijlstra wrote:
> > On Tue, 2009-09-08 at 12:29 -0400, Chris Mason wrote:
> > > Either way, if pdflush or the bdi thread or whoever ends up switching to
> > > another file during a big streaming write, the end result is that we
> > > fragment.  We may fragment the file (ext4) or we may fragment the
> > > writeback (xfs), but the end result isn't good.
> > 
> > OK, so what we want is for a way to re-enter the whole
> > writeback_inodes() path onto the same file, right?
> 
> No, that would take use back to the Bad Old Days where one large
> file write can starve out the other 10,000 small files that need to
> be written. The old writeback code used to end up in this way
> because it didn't rotate large files to the back of the dirty inode
> queue once wbc->nr_to_write was exhausted. This could cause files
> not to be written back for tens of minutes....

Problem is, there is no per-file writeback quota.

Here is a quick demo of idea to continue writeback of the last file if
its quota has not been exceeded. It also fixes the premature abortion
on congestions problem. The end result is, writeback of big files
won't reduce to small chunks because of intermixing small files or
congestion condition.

Thanks,
Fengguang
---

writeback: ensure big files are written in MAX_WRITEBACK_PAGES chunks

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c         |   39 ++++++++++++++++++++++++++++++++++--
 include/linux/writeback.h |   11 ++++++++++
 mm/page-writeback.c       |    9 --------
 3 files changed, 48 insertions(+), 11 deletions(-)

--- linux.orig/fs/fs-writeback.c	2009-09-09 10:02:30.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-09-09 11:42:19.000000000 +0800
@@ -218,6 +218,19 @@ static void requeue_io(struct inode *ino
 	list_move(&inode->i_list, &inode->i_sb->s_more_io);
 }
 
+/*
+ * continue io on this inode on next writeback if
+ * it has not accumulated large enough writeback io chunk
+ */
+static void requeue_partial_io(struct writeback_control *wbc, struct inode *inode)
+{
+	if (wbc->last_file_written == 0 ||
+	    wbc->last_file_written >= MAX_WRITEBACK_PAGES)
+		return requeue_io(inode);
+
+	list_move_tail(&inode->i_list, &inode->i_sb->s_io);
+}
+
 static void inode_sync_complete(struct inode *inode)
 {
 	/*
@@ -311,6 +324,8 @@ writeback_single_inode(struct inode *ino
 {
 	struct address_space *mapping = inode->i_mapping;
 	int wait = wbc->sync_mode == WB_SYNC_ALL;
+	long last_file_written;
+	long nr_to_write;
 	unsigned dirty;
 	int ret;
 
@@ -348,8 +363,21 @@ writeback_single_inode(struct inode *ino
 
 	spin_unlock(&inode_lock);
 
+	if (wbc->last_file != inode->i_ino)
+		last_file_written = 0;
+	else
+		last_file_written = wbc->last_file_written;
+	wbc->nr_to_write -= last_file_written;
+	nr_to_write = wbc->nr_to_write;
+
 	ret = do_writepages(mapping, wbc);
 
+	if (wbc->last_file != inode->i_ino) {
+		wbc->last_file = inode->i_ino;
+		wbc->last_file_written = nr_to_write - wbc->nr_to_write;
+	} else
+		wbc->last_file_written += nr_to_write - wbc->nr_to_write;
+
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
 		int err = write_inode(inode, wait);
@@ -378,11 +406,16 @@ writeback_single_inode(struct inode *ino
 			 * sometimes bales out without doing anything.
 			 */
 			inode->i_state |= I_DIRTY_PAGES;
-			if (wbc->nr_to_write <= 0) {
+			if (wbc->encountered_congestion) {
+				/*
+				 * keep and retry after congestion
+				 */
+				requeue_partial_io(wbc, inode);
+			} else if (wbc->nr_to_write <= 0) {
 				/*
 				 * slice used up: queue for next turn
 				 */
-				requeue_io(inode);
+				requeue_partial_io(wbc, inode);
 			} else {
 				/*
 				 * somehow blocked: retry later
@@ -402,6 +435,8 @@ writeback_single_inode(struct inode *ino
 		}
 	}
 	inode_sync_complete(inode);
+	wbc->nr_to_write += last_file_written;
+
 	return ret;
 }
 
--- linux.orig/include/linux/writeback.h	2009-09-09 11:13:43.000000000 +0800
+++ linux/include/linux/writeback.h	2009-09-09 11:41:40.000000000 +0800
@@ -25,6 +25,15 @@ static inline int task_is_pdflush(struct
 #define current_is_pdflush()	task_is_pdflush(current)
 
 /*
+ * The maximum number of pages to writeout in a single bdflush/kupdate
+ * operation.  We do this so we don't hold I_SYNC against an inode for
+ * enormous amounts of time, which would block a userspace task which has
+ * been forced to throttle against that inode.  Also, the code reevaluates
+ * the dirty each time it has written this many pages.
+ */
+#define MAX_WRITEBACK_PAGES	1024
+
+/*
  * fs/fs-writeback.c
  */
 enum writeback_sync_modes {
@@ -45,6 +54,8 @@ struct writeback_control {
 					   older than this */
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
+	unsigned long last_file;	/* Inode number of last written file */
+	long last_file_written;		/* Total pages written for last file */
 	long pages_skipped;		/* Pages which were not written */
 
 	/*
--- linux.orig/mm/page-writeback.c	2009-09-09 10:05:02.000000000 +0800
+++ linux/mm/page-writeback.c	2009-09-09 11:41:01.000000000 +0800
@@ -36,15 +36,6 @@
 #include <linux/pagevec.h>
 
 /*
- * The maximum number of pages to writeout in a single bdflush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES	1024
-
-/*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
  */

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-08 16:29       ` Chris Mason
@ 2009-09-09  9:29           ` Wu Fengguang
  2009-09-08 18:06           ` Theodore Tso
  2009-09-09  9:29           ` Wu Fengguang
  2 siblings, 0 replies; 76+ messages in thread
From: Wu Fengguang @ 2009-09-09  9:29 UTC (permalink / raw)
  To: Chris Mason, Peter Zijlstra, Artem Bityutskiy, Jens Axboe,
	linux-kernel, linux-fsdevel, david, hch, akpm, jack,
	Theodore Ts'o

On Tue, Sep 08, 2009 at 12:29:36PM -0400, Chris Mason wrote:
> On Tue, Sep 08, 2009 at 06:06:23PM +0200, Peter Zijlstra wrote:
> > On Tue, 2009-09-08 at 13:37 +0300, Artem Bityutskiy wrote:
> > > Hi,
> > > 
> > > On 09/08/2009 12:23 PM, Jens Axboe wrote:
> > > > From: Theodore Ts'o<tytso@mit.edu>
> > > >
> > > > Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
> > > > concern of not holding I_SYNC for too long.  (At least, that was the
> > > > comment previously.)  This doesn't make sense now because the only
> > > > time we wait for I_SYNC is if we are calling sync or fsync, and in
> > > > that case we need to write out all of the data anyway.  Previously
> > > > there may have been other code paths that waited on I_SYNC, but not
> > > > any more.
> > > >
> > > > According to Christoph, the current writeback size is way too small,
> > > > and XFS had a hack that bumped out nr_to_write to four times the value
> > > > sent by the VM to be able to saturate medium-sized RAID arrays.  This
> > > > value was also problematic for ext4 as well, as it caused large files
> > > > to be come interleaved on disk by in 8 megabyte chunks (we bumped up
> > > > the nr_to_write by a factor of two).
> > > >
> > > > So, in this patch, we make the MAX_WRITEBACK_PAGES a tunable,
> > > > max_writeback_mb, and set it to a default value of 128 megabytes.
> > > >
> > > > http://bugzilla.kernel.org/show_bug.cgi?id=13930
> > > >
> > > > Signed-off-by: "Theodore Ts'o"<tytso@mit.edu>
> > > > Signed-off-by: Jens Axboe<jens.axboe@oracle.com>
> > > 
> > > Would be nice to update doc files like
> > > 
> > > Documentation/sysctl/vm.txt
> > > Documentation/filesystems/proc.txt
> > 
> > I'm still not convinced this knob is worth the patch and I'm inclined to
> > flat out NAK it..
> > 
> > The whole point of MAX_WRITEBACK_PAGES seems to occasionally check the
> > dirty stats again and not write out too much.
> 
> The problem is that 'too much' is a very abstract thing.  When a process
> is stuck in balance_dirty_pages, we want them to do the minimal amount
> of work (or waiting) required to get them safely back inside file_write().

It seems that balance_dirty_pages() is not coupled with MAX_WRITEBACK_PAGES. 
Instead it uses the much smaller (ratelimit_pages + ratelimit_pages / 2).

So I feel that we could just increase MAX_WRITEBACK_PAGES.  It won't
lead to bumpy throttled writes. It does affect fairness of background
writes, ie. small files will have to wait more time for large files.
But I'm fine with MAX_WRITEBACK_PAGES=64MB, which means for desktop
that a large file may only delay others for 1 second, which is small
enough comparing to the 30 second dirty expire time.

On the other hand, I find that the (ratelimit_pages + ratelimit_pages / 2)
used for balance_dirty_pages() may fall below the real number of
dirtied pages, which is not safe if some filesystem choose to dirty
2 * ratelimit_pages before calling balance_dirty_pages_ratelimited_nr().

So, how about this patch?

Thanks,
Fengguang
---

writeback: balance_dirty_pages() shall write more than dirtied pages

Some filesystem may choose to write much more than ratelimit_pages
before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
determine number to write based on real number of dirtied pages.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

--- linux.orig/mm/page-writeback.c	2009-09-09 16:53:15.000000000 +0800
+++ linux/mm/page-writeback.c	2009-09-09 17:05:14.000000000 +0800
@@ -44,12 +44,12 @@ static long ratelimit_pages = 32;
 /*
  * When balance_dirty_pages decides that the caller needs to perform some
  * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than RATELIMIT_PAGES to ensure that reasonably
+ * It should be somewhat larger than dirtied pages to ensure that reasonably
  * large amounts of I/O are submitted.
  */
-static inline long sync_writeback_pages(void)
+static inline long sync_writeback_pages(unsigned long dirtied)
 {
-	return ratelimit_pages + ratelimit_pages / 2;
+	return dirtied + dirtied / 2;
 }
 
 /* The following parameters are exported via /proc/sys/vm */
@@ -481,7 +481,8 @@ get_dirty_limits(unsigned long *pbackgro
  * If we're over `background_thresh' then pdflush is woken to perform some
  * writeout.
  */
-static void balance_dirty_pages(struct address_space *mapping)
+static void balance_dirty_pages(struct address_space *mapping,
+				unsigned long write_chunk)
 {
 	long nr_reclaimable, bdi_nr_reclaimable;
 	long nr_writeback, bdi_nr_writeback;
@@ -489,7 +490,6 @@ static void balance_dirty_pages(struct a
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
 	unsigned long pages_written = 0;
-	unsigned long write_chunk = sync_writeback_pages();
 
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
@@ -634,9 +634,10 @@ void balance_dirty_pages_ratelimited_nr(
 	p =  &__get_cpu_var(ratelimits);
 	*p += nr_pages_dirtied;
 	if (unlikely(*p >= ratelimit)) {
+		ratelimit = sync_writeback_pages(*p);
 		*p = 0;
 		preempt_enable();
-		balance_dirty_pages(mapping);
+		balance_dirty_pages(mapping, ratelimit);
 		return;
 	}
 	preempt_enable();

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
@ 2009-09-09  9:29           ` Wu Fengguang
  0 siblings, 0 replies; 76+ messages in thread
From: Wu Fengguang @ 2009-09-09  9:29 UTC (permalink / raw)
  To: Chris Mason, Peter Zijlstra, Artem Bityutskiy, Jens Axboe, linux-kernel

On Tue, Sep 08, 2009 at 12:29:36PM -0400, Chris Mason wrote:
> On Tue, Sep 08, 2009 at 06:06:23PM +0200, Peter Zijlstra wrote:
> > On Tue, 2009-09-08 at 13:37 +0300, Artem Bityutskiy wrote:
> > > Hi,
> > > 
> > > On 09/08/2009 12:23 PM, Jens Axboe wrote:
> > > > From: Theodore Ts'o<tytso@mit.edu>
> > > >
> > > > Originally, MAX_WRITEBACK_PAGES was hard-coded to 1024 because of a
> > > > concern of not holding I_SYNC for too long.  (At least, that was the
> > > > comment previously.)  This doesn't make sense now because the only
> > > > time we wait for I_SYNC is if we are calling sync or fsync, and in
> > > > that case we need to write out all of the data anyway.  Previously
> > > > there may have been other code paths that waited on I_SYNC, but not
> > > > any more.
> > > >
> > > > According to Christoph, the current writeback size is way too small,
> > > > and XFS had a hack that bumped out nr_to_write to four times the value
> > > > sent by the VM to be able to saturate medium-sized RAID arrays.  This
> > > > value was also problematic for ext4 as well, as it caused large files
> > > > to be come interleaved on disk by in 8 megabyte chunks (we bumped up
> > > > the nr_to_write by a factor of two).
> > > >
> > > > So, in this patch, we make the MAX_WRITEBACK_PAGES a tunable,
> > > > max_writeback_mb, and set it to a default value of 128 megabytes.
> > > >
> > > > http://bugzilla.kernel.org/show_bug.cgi?id=13930
> > > >
> > > > Signed-off-by: "Theodore Ts'o"<tytso@mit.edu>
> > > > Signed-off-by: Jens Axboe<jens.axboe@oracle.com>
> > > 
> > > Would be nice to update doc files like
> > > 
> > > Documentation/sysctl/vm.txt
> > > Documentation/filesystems/proc.txt
> > 
> > I'm still not convinced this knob is worth the patch and I'm inclined to
> > flat out NAK it..
> > 
> > The whole point of MAX_WRITEBACK_PAGES seems to occasionally check the
> > dirty stats again and not write out too much.
> 
> The problem is that 'too much' is a very abstract thing.  When a process
> is stuck in balance_dirty_pages, we want them to do the minimal amount
> of work (or waiting) required to get them safely back inside file_write().

It seems that balance_dirty_pages() is not coupled with MAX_WRITEBACK_PAGES. 
Instead it uses the much smaller (ratelimit_pages + ratelimit_pages / 2).

So I feel that we could just increase MAX_WRITEBACK_PAGES.  It won't
lead to bumpy throttled writes. It does affect fairness of background
writes, ie. small files will have to wait more time for large files.
But I'm fine with MAX_WRITEBACK_PAGES=64MB, which means for desktop
that a large file may only delay others for 1 second, which is small
enough comparing to the 30 second dirty expire time.

On the other hand, I find that the (ratelimit_pages + ratelimit_pages / 2)
used for balance_dirty_pages() may fall below the real number of
dirtied pages, which is not safe if some filesystem choose to dirty
2 * ratelimit_pages before calling balance_dirty_pages_ratelimited_nr().

So, how about this patch?

Thanks,
Fengguang
---

writeback: balance_dirty_pages() shall write more than dirtied pages

Some filesystem may choose to write much more than ratelimit_pages
before calling balance_dirty_pages_ratelimited_nr(). So it is safer to
determine number to write based on real number of dirtied pages.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

--- linux.orig/mm/page-writeback.c	2009-09-09 16:53:15.000000000 +0800
+++ linux/mm/page-writeback.c	2009-09-09 17:05:14.000000000 +0800
@@ -44,12 +44,12 @@ static long ratelimit_pages = 32;
 /*
  * When balance_dirty_pages decides that the caller needs to perform some
  * non-background writeback, this is how many pages it will attempt to write.
- * It should be somewhat larger than RATELIMIT_PAGES to ensure that reasonably
+ * It should be somewhat larger than dirtied pages to ensure that reasonably
  * large amounts of I/O are submitted.
  */
-static inline long sync_writeback_pages(void)
+static inline long sync_writeback_pages(unsigned long dirtied)
 {
-	return ratelimit_pages + ratelimit_pages / 2;
+	return dirtied + dirtied / 2;
 }
 
 /* The following parameters are exported via /proc/sys/vm */
@@ -481,7 +481,8 @@ get_dirty_limits(unsigned long *pbackgro
  * If we're over `background_thresh' then pdflush is woken to perform some
  * writeout.
  */
-static void balance_dirty_pages(struct address_space *mapping)
+static void balance_dirty_pages(struct address_space *mapping,
+				unsigned long write_chunk)
 {
 	long nr_reclaimable, bdi_nr_reclaimable;
 	long nr_writeback, bdi_nr_writeback;
@@ -489,7 +490,6 @@ static void balance_dirty_pages(struct a
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
 	unsigned long pages_written = 0;
-	unsigned long write_chunk = sync_writeback_pages();
 
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
@@ -634,9 +634,10 @@ void balance_dirty_pages_ratelimited_nr(
 	p =  &__get_cpu_var(ratelimits);
 	*p += nr_pages_dirtied;
 	if (unlikely(*p >= ratelimit)) {
+		ratelimit = sync_writeback_pages(*p);
 		*p = 0;
 		preempt_enable();
-		balance_dirty_pages(mapping);
+		balance_dirty_pages(mapping, ratelimit);
 		return;
 	}
 	preempt_enable();

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-09  9:29           ` Wu Fengguang
  (?)
@ 2009-09-09 12:28           ` Christoph Hellwig
  2009-09-09 12:32             ` Wu Fengguang
  -1 siblings, 1 reply; 76+ messages in thread
From: Christoph Hellwig @ 2009-09-09 12:28 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Chris Mason, Peter Zijlstra, Artem Bityutskiy, Jens Axboe,
	linux-kernel, linux-fsdevel, david, hch, akpm, jack,
	Theodore Ts'o

On Wed, Sep 09, 2009 at 05:29:01PM +0800, Wu Fengguang wrote:
> It seems that balance_dirty_pages() is not coupled with MAX_WRITEBACK_PAGES. 
> Instead it uses the much smaller (ratelimit_pages + ratelimit_pages / 2).

With Jen's writeback patches applied balance_dirty_pages does not start
writeback itself anymore but calls bdi_start_writeback to let the
flusher thread do it.

it would be good if we do any writeback tuning ontop of these patches..

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-09 12:28           ` Christoph Hellwig
@ 2009-09-09 12:32             ` Wu Fengguang
  2009-09-09 12:36                 ` Artem Bityutskiy
  2009-09-09 12:37               ` Jens Axboe
  0 siblings, 2 replies; 76+ messages in thread
From: Wu Fengguang @ 2009-09-09 12:32 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chris Mason, Peter Zijlstra, Artem Bityutskiy, Jens Axboe,
	linux-kernel, linux-fsdevel, david, akpm, jack,
	Theodore Ts'o

On Wed, Sep 09, 2009 at 08:28:06PM +0800, Christoph Hellwig wrote:
> On Wed, Sep 09, 2009 at 05:29:01PM +0800, Wu Fengguang wrote:
> > It seems that balance_dirty_pages() is not coupled with MAX_WRITEBACK_PAGES. 
> > Instead it uses the much smaller (ratelimit_pages + ratelimit_pages / 2).
> 
> With Jen's writeback patches applied balance_dirty_pages does not start
> writeback itself anymore but calls bdi_start_writeback to let the
> flusher thread do it.
> 
> it would be good if we do any writeback tuning ontop of these patches..

Ah OK. I'm using latest linux-next and expected his patches to be there..

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-09 12:32             ` Wu Fengguang
@ 2009-09-09 12:36                 ` Artem Bityutskiy
  2009-09-09 12:37               ` Jens Axboe
  1 sibling, 0 replies; 76+ messages in thread
From: Artem Bityutskiy @ 2009-09-09 12:36 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, Chris Mason, Peter Zijlstra, Jens Axboe,
	linux-kernel, linux-fsdevel, david, akpm, jack,
	Theodore Ts'o

On 09/09/2009 03:32 PM, Wu Fengguang wrote:
> On Wed, Sep 09, 2009 at 08:28:06PM +0800, Christoph Hellwig wrote:
>> On Wed, Sep 09, 2009 at 05:29:01PM +0800, Wu Fengguang wrote:
>>> It seems that balance_dirty_pages() is not coupled with MAX_WRITEBACK_PAGES.
>>> Instead it uses the much smaller (ratelimit_pages + ratelimit_pages / 2).
>>
>> With Jen's writeback patches applied balance_dirty_pages does not start
>> writeback itself anymore but calls bdi_start_writeback to let the
>> flusher thread do it.
>>
>> it would be good if we do any writeback tuning ontop of these patches..
>
> Ah OK. I'm using latest linux-next and expected his patches to be there..

At least few days ago they were there.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
@ 2009-09-09 12:36                 ` Artem Bityutskiy
  0 siblings, 0 replies; 76+ messages in thread
From: Artem Bityutskiy @ 2009-09-09 12:36 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, Chris Mason, Peter Zijlstra, Jens Axboe,
	linux-kernel, linux-fsdevel, david, akpm, jack,
	Theodore Ts'o

On 09/09/2009 03:32 PM, Wu Fengguang wrote:
> On Wed, Sep 09, 2009 at 08:28:06PM +0800, Christoph Hellwig wrote:
>> On Wed, Sep 09, 2009 at 05:29:01PM +0800, Wu Fengguang wrote:
>>> It seems that balance_dirty_pages() is not coupled with MAX_WRITEBACK_PAGES.
>>> Instead it uses the much smaller (ratelimit_pages + ratelimit_pages / 2).
>>
>> With Jen's writeback patches applied balance_dirty_pages does not start
>> writeback itself anymore but calls bdi_start_writeback to let the
>> flusher thread do it.
>>
>> it would be good if we do any writeback tuning ontop of these patches..
>
> Ah OK. I'm using latest linux-next and expected his patches to be there..

At least few days ago they were there.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-09 12:32             ` Wu Fengguang
  2009-09-09 12:36                 ` Artem Bityutskiy
@ 2009-09-09 12:37               ` Jens Axboe
  2009-09-09 12:43                 ` Christoph Hellwig
  2009-09-09 12:57                 ` Wu Fengguang
  1 sibling, 2 replies; 76+ messages in thread
From: Jens Axboe @ 2009-09-09 12:37 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, Chris Mason, Peter Zijlstra, Artem Bityutskiy,
	linux-kernel, linux-fsdevel, david, akpm, jack,
	Theodore Ts'o

On Wed, Sep 09 2009, Wu Fengguang wrote:
> On Wed, Sep 09, 2009 at 08:28:06PM +0800, Christoph Hellwig wrote:
> > On Wed, Sep 09, 2009 at 05:29:01PM +0800, Wu Fengguang wrote:
> > > It seems that balance_dirty_pages() is not coupled with MAX_WRITEBACK_PAGES. 
> > > Instead it uses the much smaller (ratelimit_pages + ratelimit_pages / 2).
> > 
> > With Jen's writeback patches applied balance_dirty_pages does not start
> > writeback itself anymore but calls bdi_start_writeback to let the
> > flusher thread do it.
> > 
> > it would be good if we do any writeback tuning ontop of these patches..
> 
> Ah OK. I'm using latest linux-next and expected his patches to be there..

They are there, have been for months! But I think Christoph is a little
confused, we'll still do writeback inline from balance_dirty_pages(). It
does writeback_inodes_wbc(), which does not schedule async writeout.

So if your patches are based and tested off -next, you should be good.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-09 12:37               ` Jens Axboe
@ 2009-09-09 12:43                 ` Christoph Hellwig
  2009-09-09 12:44                   ` Jens Axboe
  2009-09-09 12:57                 ` Wu Fengguang
  1 sibling, 1 reply; 76+ messages in thread
From: Christoph Hellwig @ 2009-09-09 12:43 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Wu Fengguang, Christoph Hellwig, Chris Mason, Peter Zijlstra,
	Artem Bityutskiy, linux-kernel, linux-fsdevel, david, akpm, jack,
	Theodore Ts'o

On Wed, Sep 09, 2009 at 02:37:46PM +0200, Jens Axboe wrote:
> They are there, have been for months! But I think Christoph is a little
> confused, we'll still do writeback inline from balance_dirty_pages(). It
> does writeback_inodes_wbc(), which does not schedule async writeout.

Hmm, at least the version I have applied right now calls
bdi_start_writeback and not writeback_inodes_wbc from
balance_dirty_pages.  But you sent our another spin or two afer those,
so maybe it changed again.


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-09 12:43                 ` Christoph Hellwig
@ 2009-09-09 12:44                   ` Jens Axboe
  2009-09-09 12:51                     ` Christoph Hellwig
  0 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2009-09-09 12:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Wu Fengguang, Chris Mason, Peter Zijlstra, Artem Bityutskiy,
	linux-kernel, linux-fsdevel, david, akpm, jack,
	Theodore Ts'o

On Wed, Sep 09 2009, Christoph Hellwig wrote:
> On Wed, Sep 09, 2009 at 02:37:46PM +0200, Jens Axboe wrote:
> > They are there, have been for months! But I think Christoph is a little
> > confused, we'll still do writeback inline from balance_dirty_pages(). It
> > does writeback_inodes_wbc(), which does not schedule async writeout.
> 
> Hmm, at least the version I have applied right now calls
> bdi_start_writeback and not writeback_inodes_wbc from
> balance_dirty_pages.  But you sent our another spin or two afer those,
> so maybe it changed again.

Yes, that's the very end check, the old code does the same. Further up
it does writeback_inodes_wbc() if we are over the threshold.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-09 12:44                   ` Jens Axboe
@ 2009-09-09 12:51                     ` Christoph Hellwig
  0 siblings, 0 replies; 76+ messages in thread
From: Christoph Hellwig @ 2009-09-09 12:51 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Wu Fengguang, Chris Mason, Peter Zijlstra,
	Artem Bityutskiy, linux-kernel, linux-fsdevel, david, akpm, jack,
	Theodore Ts'o

On Wed, Sep 09, 2009 at 02:44:50PM +0200, Jens Axboe wrote:
> > bdi_start_writeback and not writeback_inodes_wbc from
> > balance_dirty_pages.  But you sent our another spin or two afer those,
> > so maybe it changed again.
> 
> Yes, that's the very end check, the old code does the same. Further up
> it does writeback_inodes_wbc() if we are over the threshold.

Aehm, true.  It was still a sync_inodes_wbc in my tree..


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-09 12:37               ` Jens Axboe
  2009-09-09 12:43                 ` Christoph Hellwig
@ 2009-09-09 12:57                 ` Wu Fengguang
  1 sibling, 0 replies; 76+ messages in thread
From: Wu Fengguang @ 2009-09-09 12:57 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Chris Mason, Peter Zijlstra, Artem Bityutskiy,
	linux-kernel, linux-fsdevel, david, akpm, jack,
	Theodore Ts'o

On Wed, Sep 09, 2009 at 08:37:46PM +0800, Jens Axboe wrote:
> On Wed, Sep 09 2009, Wu Fengguang wrote:
> > On Wed, Sep 09, 2009 at 08:28:06PM +0800, Christoph Hellwig wrote:
> > > On Wed, Sep 09, 2009 at 05:29:01PM +0800, Wu Fengguang wrote:
> > > > It seems that balance_dirty_pages() is not coupled with MAX_WRITEBACK_PAGES. 
> > > > Instead it uses the much smaller (ratelimit_pages + ratelimit_pages / 2).
> > > 
> > > With Jen's writeback patches applied balance_dirty_pages does not start
> > > writeback itself anymore but calls bdi_start_writeback to let the
> > > flusher thread do it.
> > > 
> > > it would be good if we do any writeback tuning ontop of these patches..
> > 
> > Ah OK. I'm using latest linux-next and expected his patches to be there..
> 
> They are there, have been for months! But I think Christoph is a little
> confused, we'll still do writeback inline from balance_dirty_pages(). It
> does writeback_inodes_wbc(), which does not schedule async writeout.
> 
> So if your patches are based and tested off -next, you should be good.

Just found that I was in the wrong branch..  Now I see
writeback_inodes_wbc(), thanks.

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-08 18:32                 ` Peter Zijlstra
@ 2009-09-09 14:23                   ` Jan Kara
  2009-09-09 14:37                     ` Wu Fengguang
  2009-09-10 15:49                     ` Peter Zijlstra
  0 siblings, 2 replies; 76+ messages in thread
From: Jan Kara @ 2009-09-09 14:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Chris Mason, Artem Bityutskiy, Jens Axboe, linux-kernel,
	linux-fsdevel, david, hch, akpm, jack, Theodore Ts'o,
	Wu Fengguang

On Tue 08-09-09 20:32:26, Peter Zijlstra wrote:
> On Tue, 2009-09-08 at 19:55 +0200, Peter Zijlstra wrote:
> > 
> > I think I'm somewhat confused here though..
> > 
> > There's kernel threads doing writeout, and there's apps getting stuck in
> > balance_dirty_pages().
> > 
> > If we want all writeout to be done by kernel threads (bdi/pd-flush like
> > things) then we still need to manage the actual apps and delay them.
> > 
> > As things stand now, we kick pdflush into action when dirty levels are
> > above the background level, and start writing out from the app task when
> > we hit the full dirty level.
> > 
> > Moving all writeout to a kernel thread sounds good from writing linear
> > stuff pov, but what do we make apps wait on then?
> 
> OK, so like said in the previous email, we could have these app tasks
> simply sleep on a waitqueue which gets periodic wakeups from
> __bdi_writeback_inc() every time the dirty threshold drops.
> 
> The woken tasks would then check their bdi dirty limit (its task
> dependent) against the current values and either go back to sleep or
> back to work.
  Well, what I imagined we could do is:
Have a per-bdi variable 'pages_written' - that would reflect the amount of
pages written to the bdi since boot (OK, we'd have to handle overflows but
that's doable).

There will be a per-bdi variable 'pages_waited'. When a thread should sleep
in balance_dirty_pages() because we are over limits, it kicks writeback thread
and does:
  to_wait =  max(pages_waited, pages_written) + sync_dirty_pages() (or
whatever number we decide)
  pages_waited = to_wait
  sleep until pages_written reaches to_wait or we drop below dirty limits.

That will make sure each thread will sleep until writeback threads have done
their duty for the writing thread.

If we make sure sleeping threads are properly ordered on the wait queue,
we could always wakeup just the first one and thus avoid the herding
effect. When we drop below dirty limits, we would just wakeup the whole
waitqueue.

Does this sound reasonable?

> The only problem would be the mass wakeups when lots of tasks are
> blocked on dirty, but I'm guessing there's no way around that anyway,
> and its better to have a limited number of writers than have everybody
> write something, which would result in massive write fragmentation.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-09 14:23                   ` Jan Kara
@ 2009-09-09 14:37                     ` Wu Fengguang
  2009-09-10 15:49                     ` Peter Zijlstra
  1 sibling, 0 replies; 76+ messages in thread
From: Wu Fengguang @ 2009-09-09 14:37 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, Chris Mason, Artem Bityutskiy, Jens Axboe,
	linux-kernel, linux-fsdevel, david, hch, akpm, Theodore Ts'o

On Wed, Sep 09, 2009 at 10:23:15PM +0800, Jan Kara wrote:
> On Tue 08-09-09 20:32:26, Peter Zijlstra wrote:
> > On Tue, 2009-09-08 at 19:55 +0200, Peter Zijlstra wrote:
> > > 
> > > I think I'm somewhat confused here though..
> > > 
> > > There's kernel threads doing writeout, and there's apps getting stuck in
> > > balance_dirty_pages().
> > > 
> > > If we want all writeout to be done by kernel threads (bdi/pd-flush like
> > > things) then we still need to manage the actual apps and delay them.
> > > 
> > > As things stand now, we kick pdflush into action when dirty levels are
> > > above the background level, and start writing out from the app task when
> > > we hit the full dirty level.
> > > 
> > > Moving all writeout to a kernel thread sounds good from writing linear
> > > stuff pov, but what do we make apps wait on then?
> > 
> > OK, so like said in the previous email, we could have these app tasks
> > simply sleep on a waitqueue which gets periodic wakeups from
> > __bdi_writeback_inc() every time the dirty threshold drops.
> > 
> > The woken tasks would then check their bdi dirty limit (its task
> > dependent) against the current values and either go back to sleep or
> > back to work.
>   Well, what I imagined we could do is:
> Have a per-bdi variable 'pages_written' - that would reflect the amount of
> pages written to the bdi since boot (OK, we'd have to handle overflows but
> that's doable).
> 
> There will be a per-bdi variable 'pages_waited'. When a thread should sleep
> in balance_dirty_pages() because we are over limits, it kicks writeback thread
> and does:
>   to_wait =  max(pages_waited, pages_written) + sync_dirty_pages() (or
> whatever number we decide)
>   pages_waited = to_wait
>   sleep until pages_written reaches to_wait or we drop below dirty limits.
> 
> That will make sure each thread will sleep until writeback threads have done
> their duty for the writing thread.
> 
> If we make sure sleeping threads are properly ordered on the wait queue,
> we could always wakeup just the first one and thus avoid the herding
> effect. When we drop below dirty limits, we would just wakeup the whole
> waitqueue.
> 
> Does this sound reasonable?

Yup! I have a similar idea: for each chunk the kernel writeback thread
synced, it 'honours' so many pages of quota to some waiting/sleeping
dirtier task to consume (so that it can continue dirty so many pages).

This makes it possible to control the relative/absolute writeback
bandwidth for each dirtier tasks. Something like IO controller.

Thanks,
Fengguang

> > The only problem would be the mass wakeups when lots of tasks are
> > blocked on dirty, but I'm guessing there's no way around that anyway,
> > and its better to have a limited number of writers than have everybody
> > write something, which would result in massive write fragmentation.
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-09 14:23                   ` Jan Kara
  2009-09-09 14:37                     ` Wu Fengguang
@ 2009-09-10 15:49                     ` Peter Zijlstra
  2009-09-14 11:17                       ` Jan Kara
  1 sibling, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2009-09-10 15:49 UTC (permalink / raw)
  To: Jan Kara
  Cc: Chris Mason, Artem Bityutskiy, Jens Axboe, linux-kernel,
	linux-fsdevel, david, hch, akpm, Theodore Ts'o, Wu Fengguang

On Wed, 2009-09-09 at 16:23 +0200, Jan Kara wrote:
>   Well, what I imagined we could do is:
> Have a per-bdi variable 'pages_written' - that would reflect the amount of
> pages written to the bdi since boot (OK, we'd have to handle overflows but
> that's doable).
> 
> There will be a per-bdi variable 'pages_waited'. When a thread should sleep
> in balance_dirty_pages() because we are over limits, it kicks writeback thread
> and does:
>   to_wait =  max(pages_waited, pages_written) + sync_dirty_pages() (or
> whatever number we decide)
>   pages_waited = to_wait
>   sleep until pages_written reaches to_wait or we drop below dirty limits.
> 
> That will make sure each thread will sleep until writeback threads have done
> their duty for the writing thread.
> 
> If we make sure sleeping threads are properly ordered on the wait queue,
> we could always wakeup just the first one and thus avoid the herding
> effect. When we drop below dirty limits, we would just wakeup the whole
> waitqueue.
> 
> Does this sound reasonable?

That seems to go wrong when there's multiple tasks waiting on the same
bdi, you'd count each page for 1/n its weight.

Suppose pages_written = 1024, and 4 tasks block and compute their to
wait as pages_written + 256 = 1280, then we'd release all 4 of them
after 256 pages are written, instead of 4*256, which would be
pages_written = 2048.




^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-10 15:49                     ` Peter Zijlstra
@ 2009-09-14 11:17                       ` Jan Kara
  2009-09-24  8:33                         ` Wu Fengguang
  0 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2009-09-14 11:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jan Kara, Chris Mason, Artem Bityutskiy, Jens Axboe,
	linux-kernel, linux-fsdevel, david, hch, akpm, Theodore Ts'o,
	Wu Fengguang

On Thu 10-09-09 17:49:10, Peter Zijlstra wrote:
> On Wed, 2009-09-09 at 16:23 +0200, Jan Kara wrote:
> >   Well, what I imagined we could do is:
> > Have a per-bdi variable 'pages_written' - that would reflect the amount of
> > pages written to the bdi since boot (OK, we'd have to handle overflows but
> > that's doable).
> > 
> > There will be a per-bdi variable 'pages_waited'. When a thread should sleep
> > in balance_dirty_pages() because we are over limits, it kicks writeback thread
> > and does:
> >   to_wait =  max(pages_waited, pages_written) + sync_dirty_pages() (or
> > whatever number we decide)
> >   pages_waited = to_wait
> >   sleep until pages_written reaches to_wait or we drop below dirty limits.
> > 
> > That will make sure each thread will sleep until writeback threads have done
> > their duty for the writing thread.
> > 
> > If we make sure sleeping threads are properly ordered on the wait queue,
> > we could always wakeup just the first one and thus avoid the herding
> > effect. When we drop below dirty limits, we would just wakeup the whole
> > waitqueue.
> > 
> > Does this sound reasonable?
> 
> That seems to go wrong when there's multiple tasks waiting on the same
> bdi, you'd count each page for 1/n its weight.
> 
> Suppose pages_written = 1024, and 4 tasks block and compute their to
> wait as pages_written + 256 = 1280, then we'd release all 4 of them
> after 256 pages are written, instead of 4*256, which would be
> pages_written = 2048.
  Well, there's some locking needed of course. The intent is to stack
demands as they come. So in case pages_written = 1024, pages_waited = 1024
we would do:
THREAD 1:

spin_lock
to_wait = 1024 + 256
pages_waited = 1280
spin_unlock

THREAD 2:

spin_lock
to_wait = 1280 + 256
pages_waited = 1536
spin_unlock

  So weight of each page will be kept. The fact that second thread
effectively waits until the first thread has its demand satisfied looks
strange at the first sight but we don't do better currently and I think
it's fine - if they were two writer threads, then soon the thread released
first will queue behind the thread still waiting so long term the behavior
should be fair.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-14 11:17                       ` Jan Kara
@ 2009-09-24  8:33                         ` Wu Fengguang
  2009-09-24 15:38                           ` Peter Zijlstra
  2009-09-29 17:35                           ` Jan Kara
  0 siblings, 2 replies; 76+ messages in thread
From: Wu Fengguang @ 2009-09-24  8:33 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, Chris Mason, Artem Bityutskiy, Jens Axboe,
	linux-kernel, linux-fsdevel, david, hch, akpm, Theodore Ts'o

On Mon, Sep 14, 2009 at 07:17:21PM +0800, Jan Kara wrote:
> On Thu 10-09-09 17:49:10, Peter Zijlstra wrote:
> > On Wed, 2009-09-09 at 16:23 +0200, Jan Kara wrote:
> > >   Well, what I imagined we could do is:
> > > Have a per-bdi variable 'pages_written' - that would reflect the amount of
> > > pages written to the bdi since boot (OK, we'd have to handle overflows but
> > > that's doable).
> > > 
> > > There will be a per-bdi variable 'pages_waited'. When a thread should sleep
> > > in balance_dirty_pages() because we are over limits, it kicks writeback thread
> > > and does:
> > >   to_wait =  max(pages_waited, pages_written) + sync_dirty_pages() (or
> > > whatever number we decide)
> > >   pages_waited = to_wait
> > >   sleep until pages_written reaches to_wait or we drop below dirty limits.
> > > 
> > > That will make sure each thread will sleep until writeback threads have done
> > > their duty for the writing thread.
> > > 
> > > If we make sure sleeping threads are properly ordered on the wait queue,
> > > we could always wakeup just the first one and thus avoid the herding
> > > effect. When we drop below dirty limits, we would just wakeup the whole
> > > waitqueue.
> > > 
> > > Does this sound reasonable?
> > 
> > That seems to go wrong when there's multiple tasks waiting on the same
> > bdi, you'd count each page for 1/n its weight.
> > 
> > Suppose pages_written = 1024, and 4 tasks block and compute their to
> > wait as pages_written + 256 = 1280, then we'd release all 4 of them
> > after 256 pages are written, instead of 4*256, which would be
> > pages_written = 2048.
>   Well, there's some locking needed of course. The intent is to stack
> demands as they come. So in case pages_written = 1024, pages_waited = 1024
> we would do:
> THREAD 1:
> 
> spin_lock
> to_wait = 1024 + 256
> pages_waited = 1280
> spin_unlock
> 
> THREAD 2:
> 
> spin_lock
> to_wait = 1280 + 256
> pages_waited = 1536
> spin_unlock
> 
>   So weight of each page will be kept. The fact that second thread
> effectively waits until the first thread has its demand satisfied looks
> strange at the first sight but we don't do better currently and I think
> it's fine - if they were two writer threads, then soon the thread released
> first will queue behind the thread still waiting so long term the behavior
> should be fair.

Yeah, FIFO queuing should be good enough.

I'd like to propose one more data structure for evaluation :)

- bdi->throttle_lock
- bdi->throttle_list    pages to sync for each waiting task, taken from sync_writeback_pages()
- bdi->throttle_pages   (counted down) pages to sync for the head task, shall be atomic_t

In balance_dirty_pages(), it would do

        nr_to_sync = sync_writeback_pages()
        if (list_empty(bdi->throttle_list))  # I'm the only task
                bdi->throttle_pages = nr_to_sync
        append nr_to_sync to bdi->throttle_list
        kick off background writeback
        wait
        remove itself from bdi->throttle_list and wait list
        set bdi->throttle_pages for new head task (or LONG_MAX)

In __bdi_writeout_inc(), it would do

        if (--bdi->throttle_pages <= 0)
                check and wake up head task

In wb_writeback(), it would do

        if (args->for_background && exiting)
                wake up all throttled tasks

To prevent wake up too many tasks at the same time, it can relax the
background threshold a bit, so that __bdi_writeout_inc() become the
only wake up point in normal cases.

        if (args->for_background && !list_empty(bdi->throttle_list) &&
                over background_thresh - background_thresh / 32)
                keep write pages;

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-24  8:33                         ` Wu Fengguang
@ 2009-09-24 15:38                           ` Peter Zijlstra
  2009-09-25  1:33                             ` Wu Fengguang
  2009-09-29 17:35                           ` Jan Kara
  1 sibling, 1 reply; 76+ messages in thread
From: Peter Zijlstra @ 2009-09-24 15:38 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Chris Mason, Artem Bityutskiy, Jens Axboe,
	linux-kernel, linux-fsdevel, david, hch, akpm, Theodore Ts'o

On Thu, 2009-09-24 at 16:33 +0800, Wu Fengguang wrote:

> Yeah, FIFO queuing should be good enough.
> 
> I'd like to propose one more data structure for evaluation :)
> 
> - bdi->throttle_lock
> - bdi->throttle_list    pages to sync for each waiting task, taken from sync_writeback_pages()
> - bdi->throttle_pages   (counted down) pages to sync for the head task, shall be atomic_t
> 
> In balance_dirty_pages(), it would do
> 
>         nr_to_sync = sync_writeback_pages()
>         if (list_empty(bdi->throttle_list))  # I'm the only task
>                 bdi->throttle_pages = nr_to_sync
>         append nr_to_sync to bdi->throttle_list
>         kick off background writeback
>         wait
>         remove itself from bdi->throttle_list and wait list
>         set bdi->throttle_pages for new head task (or LONG_MAX)
> 
> In __bdi_writeout_inc(), it would do
> 
>         if (--bdi->throttle_pages <= 0)
>                 check and wake up head task
> 
> In wb_writeback(), it would do
> 
>         if (args->for_background && exiting)
>                 wake up all throttled tasks
> 
> To prevent wake up too many tasks at the same time, it can relax the
> background threshold a bit, so that __bdi_writeout_inc() become the
> only wake up point in normal cases.
> 
>         if (args->for_background && !list_empty(bdi->throttle_list) &&
>                 over background_thresh - background_thresh / 32)
>                 keep write pages;

Right, something like that ought to work well, or at least sounds like
worth a try ;-)


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-24 15:38                           ` Peter Zijlstra
@ 2009-09-25  1:33                             ` Wu Fengguang
  0 siblings, 0 replies; 76+ messages in thread
From: Wu Fengguang @ 2009-09-25  1:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jan Kara, Chris Mason, Artem Bityutskiy, Jens Axboe,
	linux-kernel, linux-fsdevel, david, hch, akpm, Theodore Ts'o

On Thu, Sep 24, 2009 at 11:38:16PM +0800, Peter Zijlstra wrote:
> On Thu, 2009-09-24 at 16:33 +0800, Wu Fengguang wrote:
> 
> > Yeah, FIFO queuing should be good enough.
> > 
> > I'd like to propose one more data structure for evaluation :)
> > 
> > - bdi->throttle_lock
> > - bdi->throttle_list    pages to sync for each waiting task, taken from sync_writeback_pages()
> > - bdi->throttle_pages   (counted down) pages to sync for the head task, shall be atomic_t
> > 
> > In balance_dirty_pages(), it would do
> > 
> >         nr_to_sync = sync_writeback_pages()
> >         if (list_empty(bdi->throttle_list))  # I'm the only task
> >                 bdi->throttle_pages = nr_to_sync
> >         append nr_to_sync to bdi->throttle_list
> >         kick off background writeback
> >         wait
> >         remove itself from bdi->throttle_list and wait list
> >         set bdi->throttle_pages for new head task (or LONG_MAX)
> > 
> > In __bdi_writeout_inc(), it would do
> > 
> >         if (--bdi->throttle_pages <= 0)
> >                 check and wake up head task
> > 
> > In wb_writeback(), it would do
> > 
> >         if (args->for_background && exiting)
> >                 wake up all throttled tasks

> > To prevent wake up too many tasks at the same time, it can relax the
> > background threshold a bit, so that __bdi_writeout_inc() become the
> > only wake up point in normal cases.
> > 
> >         if (args->for_background && !list_empty(bdi->throttle_list) &&
> >                 over background_thresh - background_thresh / 32)
> >                 keep write pages;

I realized this last change is not necessary, because we already
have a big enough buffer area:

(dirty_thresh + background_thresh)/2  ==> background_thresh

> Right, something like that ought to work well, or at least sounds like
> worth a try ;-)

Thanks :)

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-24  8:33                         ` Wu Fengguang
  2009-09-24 15:38                           ` Peter Zijlstra
@ 2009-09-29 17:35                           ` Jan Kara
  2009-09-30  1:24                             ` Wu Fengguang
  1 sibling, 1 reply; 76+ messages in thread
From: Jan Kara @ 2009-09-29 17:35 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Peter Zijlstra, Chris Mason, Artem Bityutskiy,
	Jens Axboe, linux-kernel, linux-fsdevel, david, hch, akpm,
	Theodore Ts'o

On Thu 24-09-09 16:33:42, Wu Fengguang wrote:
> On Mon, Sep 14, 2009 at 07:17:21PM +0800, Jan Kara wrote:
> > On Thu 10-09-09 17:49:10, Peter Zijlstra wrote:
> > > On Wed, 2009-09-09 at 16:23 +0200, Jan Kara wrote:
> > > >   Well, what I imagined we could do is:
> > > > Have a per-bdi variable 'pages_written' - that would reflect the amount of
> > > > pages written to the bdi since boot (OK, we'd have to handle overflows but
> > > > that's doable).
> > > > 
> > > > There will be a per-bdi variable 'pages_waited'. When a thread should sleep
> > > > in balance_dirty_pages() because we are over limits, it kicks writeback thread
> > > > and does:
> > > >   to_wait =  max(pages_waited, pages_written) + sync_dirty_pages() (or
> > > > whatever number we decide)
> > > >   pages_waited = to_wait
> > > >   sleep until pages_written reaches to_wait or we drop below dirty limits.
> > > > 
> > > > That will make sure each thread will sleep until writeback threads have done
> > > > their duty for the writing thread.
> > > > 
> > > > If we make sure sleeping threads are properly ordered on the wait queue,
> > > > we could always wakeup just the first one and thus avoid the herding
> > > > effect. When we drop below dirty limits, we would just wakeup the whole
> > > > waitqueue.
> > > > 
> > > > Does this sound reasonable?
> > > 
> > > That seems to go wrong when there's multiple tasks waiting on the same
> > > bdi, you'd count each page for 1/n its weight.
> > > 
> > > Suppose pages_written = 1024, and 4 tasks block and compute their to
> > > wait as pages_written + 256 = 1280, then we'd release all 4 of them
> > > after 256 pages are written, instead of 4*256, which would be
> > > pages_written = 2048.
> >   Well, there's some locking needed of course. The intent is to stack
> > demands as they come. So in case pages_written = 1024, pages_waited = 1024
> > we would do:
> > THREAD 1:
> > 
> > spin_lock
> > to_wait = 1024 + 256
> > pages_waited = 1280
> > spin_unlock
> > 
> > THREAD 2:
> > 
> > spin_lock
> > to_wait = 1280 + 256
> > pages_waited = 1536
> > spin_unlock
> > 
> >   So weight of each page will be kept. The fact that second thread
> > effectively waits until the first thread has its demand satisfied looks
> > strange at the first sight but we don't do better currently and I think
> > it's fine - if they were two writer threads, then soon the thread released
> > first will queue behind the thread still waiting so long term the behavior
> > should be fair.
> 
> Yeah, FIFO queuing should be good enough.
> 
> I'd like to propose one more data structure for evaluation :)
> 
> - bdi->throttle_lock
> - bdi->throttle_list    pages to sync for each waiting task, taken from sync_writeback_pages()
> - bdi->throttle_pages   (counted down) pages to sync for the head task, shall be atomic_t
> 
> In balance_dirty_pages(), it would do
> 
>         nr_to_sync = sync_writeback_pages()
>         if (list_empty(bdi->throttle_list))  # I'm the only task
>                 bdi->throttle_pages = nr_to_sync
>         append nr_to_sync to bdi->throttle_list
>         kick off background writeback
>         wait
>         remove itself from bdi->throttle_list and wait list
>         set bdi->throttle_pages for new head task (or LONG_MAX)
> 
> In __bdi_writeout_inc(), it would do
> 
>         if (--bdi->throttle_pages <= 0)
>                 check and wake up head task
  Yeah, this would work as well. I don't see a big difference between my
approach and this so if you get to implementing this, I'm happy :).

> In wb_writeback(), it would do
> 
>         if (args->for_background && exiting)
>                 wake up all throttled tasks
> To prevent wake up too many tasks at the same time, it can relax the
> background threshold a bit, so that __bdi_writeout_inc() become the
> only wake up point in normal cases.
> 
>         if (args->for_background && !list_empty(bdi->throttle_list) &&
>                 over background_thresh - background_thresh / 32)
>                 keep write pages;
  We want to wakeup tasks when we get below dirty_limit (either global
or per-bdi). Not when we get below background threshold...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-29 17:35                           ` Jan Kara
@ 2009-09-30  1:24                             ` Wu Fengguang
  2009-09-30 11:55                               ` Jan Kara
  0 siblings, 1 reply; 76+ messages in thread
From: Wu Fengguang @ 2009-09-30  1:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, Chris Mason, Artem Bityutskiy, Jens Axboe,
	linux-kernel, linux-fsdevel, david, hch, akpm, Theodore Ts'o

On Wed, Sep 30, 2009 at 01:35:06AM +0800, Jan Kara wrote:
> On Thu 24-09-09 16:33:42, Wu Fengguang wrote:
> > On Mon, Sep 14, 2009 at 07:17:21PM +0800, Jan Kara wrote:
> > > On Thu 10-09-09 17:49:10, Peter Zijlstra wrote:
> > > > On Wed, 2009-09-09 at 16:23 +0200, Jan Kara wrote:
> > > > >   Well, what I imagined we could do is:
> > > > > Have a per-bdi variable 'pages_written' - that would reflect the amount of
> > > > > pages written to the bdi since boot (OK, we'd have to handle overflows but
> > > > > that's doable).
> > > > > 
> > > > > There will be a per-bdi variable 'pages_waited'. When a thread should sleep
> > > > > in balance_dirty_pages() because we are over limits, it kicks writeback thread
> > > > > and does:
> > > > >   to_wait =  max(pages_waited, pages_written) + sync_dirty_pages() (or
> > > > > whatever number we decide)
> > > > >   pages_waited = to_wait
> > > > >   sleep until pages_written reaches to_wait or we drop below dirty limits.
> > > > > 
> > > > > That will make sure each thread will sleep until writeback threads have done
> > > > > their duty for the writing thread.
> > > > > 
> > > > > If we make sure sleeping threads are properly ordered on the wait queue,
> > > > > we could always wakeup just the first one and thus avoid the herding
> > > > > effect. When we drop below dirty limits, we would just wakeup the whole
> > > > > waitqueue.
> > > > > 
> > > > > Does this sound reasonable?
> > > > 
> > > > That seems to go wrong when there's multiple tasks waiting on the same
> > > > bdi, you'd count each page for 1/n its weight.
> > > > 
> > > > Suppose pages_written = 1024, and 4 tasks block and compute their to
> > > > wait as pages_written + 256 = 1280, then we'd release all 4 of them
> > > > after 256 pages are written, instead of 4*256, which would be
> > > > pages_written = 2048.
> > >   Well, there's some locking needed of course. The intent is to stack
> > > demands as they come. So in case pages_written = 1024, pages_waited = 1024
> > > we would do:
> > > THREAD 1:
> > > 
> > > spin_lock
> > > to_wait = 1024 + 256
> > > pages_waited = 1280
> > > spin_unlock
> > > 
> > > THREAD 2:
> > > 
> > > spin_lock
> > > to_wait = 1280 + 256
> > > pages_waited = 1536
> > > spin_unlock
> > > 
> > >   So weight of each page will be kept. The fact that second thread
> > > effectively waits until the first thread has its demand satisfied looks
> > > strange at the first sight but we don't do better currently and I think
> > > it's fine - if they were two writer threads, then soon the thread released
> > > first will queue behind the thread still waiting so long term the behavior
> > > should be fair.
> > 
> > Yeah, FIFO queuing should be good enough.
> > 
> > I'd like to propose one more data structure for evaluation :)
> > 
> > - bdi->throttle_lock
> > - bdi->throttle_list    pages to sync for each waiting task, taken from sync_writeback_pages()
> > - bdi->throttle_pages   (counted down) pages to sync for the head task, shall be atomic_t
> > 
> > In balance_dirty_pages(), it would do
> > 
> >         nr_to_sync = sync_writeback_pages()
> >         if (list_empty(bdi->throttle_list))  # I'm the only task
> >                 bdi->throttle_pages = nr_to_sync
> >         append nr_to_sync to bdi->throttle_list
> >         kick off background writeback
> >         wait
> >         remove itself from bdi->throttle_list and wait list
> >         set bdi->throttle_pages for new head task (or LONG_MAX)
> > 
> > In __bdi_writeout_inc(), it would do
> > 
> >         if (--bdi->throttle_pages <= 0)
> >                 check and wake up head task
>   Yeah, this would work as well. I don't see a big difference between my
> approach and this so if you get to implementing this, I'm happy :).

Thanks. Here is a prototype implementation for preview :)

> > In wb_writeback(), it would do
> > 
> >         if (args->for_background && exiting)
> >                 wake up all throttled tasks
> > To prevent wake up too many tasks at the same time, it can relax the
> > background threshold a bit, so that __bdi_writeout_inc() become the
> > only wake up point in normal cases.
> > 
> >         if (args->for_background && !list_empty(bdi->throttle_list) &&
> >                 over background_thresh - background_thresh / 32)
> >                 keep write pages;
>   We want to wakeup tasks when we get below dirty_limit (either global
> or per-bdi). Not when we get below background threshold...

I did a trick to add one bdi work from each waiting task, and remove
them when the task is waked up :)

Thanks,
Fengguang

---
writeback: let balance_dirty_pages() wait on background writeback

CC: Chris Mason <chris.mason@oracle.com> 
CC: Dave Chinner <david@fromorbit.com> 
CC: Jan Kara <jack@suse.cz> 
CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
CC: Jens Axboe <jens.axboe@oracle.com> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c           |   89 ++++++++++++++++++++++++++++++++--
 include/linux/backing-dev.h |   15 +++++
 mm/backing-dev.c            |    4 +
 mm/page-writeback.c         |   43 ++--------------
 4 files changed, 109 insertions(+), 42 deletions(-)

--- linux.orig/mm/page-writeback.c	2009-09-28 19:01:40.000000000 +0800
+++ linux/mm/page-writeback.c	2009-09-28 19:02:48.000000000 +0800
@@ -218,6 +218,10 @@ static inline void __bdi_writeout_inc(st
 {
 	__prop_inc_percpu_max(&vm_completions, &bdi->completions,
 			      bdi->max_prop_frac);
+
+	if (atomic_read(&bdi->throttle_pages) < DIRTY_THROTTLE_PAGES_STOP &&
+	    atomic_dec_and_test(&bdi->throttle_pages))
+		bdi_writeback_wakeup(bdi);
 }
 
 void bdi_writeout_inc(struct backing_dev_info *bdi)
@@ -458,20 +462,10 @@ static void balance_dirty_pages(struct a
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long pages_written = 0;
-	unsigned long pause = 1;
 	int dirty_exceeded;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
 	for (;;) {
-		struct writeback_control wbc = {
-			.bdi		= bdi,
-			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
-			.nr_to_write	= write_chunk,
-			.range_cyclic	= 1,
-		};
-
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 				 global_page_state(NR_UNSTABLE_NFS);
 		nr_writeback = global_page_state(NR_WRITEBACK) +
@@ -518,31 +512,7 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		if (bdi_nr_reclaimable > bdi_thresh) {
-			writeback_inodes_wbc(&wbc);
-			pages_written += write_chunk - wbc.nr_to_write;
-			/* don't wait if we've done enough */
-			if (pages_written >= write_chunk)
-				break;
-		}
-		schedule_timeout_interruptible(pause);
-
-		/*
-		 * Increase the delay for each loop, up to our previous
-		 * default of taking a 100ms nap.
-		 */
-		pause <<= 1;
-		if (pause > HZ / 10)
-			pause = HZ / 10;
+		bdi_writeback_wait(bdi, write_chunk);
 	}
 
 	if (!dirty_exceeded && bdi->dirty_exceeded)
@@ -559,8 +529,7 @@ static void balance_dirty_pages(struct a
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && (nr_reclaimable > background_thresh)))
+	if (!laptop_mode && (nr_reclaimable > background_thresh))
 		bdi_start_writeback(bdi, NULL, 0);
 }
 
--- linux.orig/include/linux/backing-dev.h	2009-09-28 18:52:51.000000000 +0800
+++ linux/include/linux/backing-dev.h	2009-09-28 19:02:45.000000000 +0800
@@ -86,6 +86,13 @@ struct backing_dev_info {
 
 	struct list_head work_list;
 
+	/*
+	 * dirtier process throttling
+	 */
+	spinlock_t		throttle_lock;
+	struct list_head	throttle_list;	/* nr to sync for each task */
+	atomic_t		throttle_pages; /* nr to sync for head task */
+
 	struct device *dev;
 
 #ifdef CONFIG_DEBUG_FS
@@ -94,6 +101,12 @@ struct backing_dev_info {
 #endif
 };
 
+/*
+ * when no task is throttled, set throttle_pages to larger than this,
+ * to avoid unnecessary atomic decreases.
+ */
+#define DIRTY_THROTTLE_PAGES_STOP	(1 << 22)
+
 int bdi_init(struct backing_dev_info *bdi);
 void bdi_destroy(struct backing_dev_info *bdi);
 
@@ -105,6 +118,8 @@ void bdi_start_writeback(struct backing_
 				long nr_pages);
 int bdi_writeback_task(struct bdi_writeback *wb);
 int bdi_has_dirty_io(struct backing_dev_info *bdi);
+int bdi_writeback_wakeup(struct backing_dev_info *bdi);
+void bdi_writeback_wait(struct backing_dev_info *bdi, long nr_pages);
 
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
--- linux.orig/fs/fs-writeback.c	2009-09-28 18:57:51.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-09-28 19:02:45.000000000 +0800
@@ -25,6 +25,7 @@
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
 #include <linux/buffer_head.h>
+#include <linux/completion.h>
 #include "internal.h"
 
 #define inode_to_bdi(inode)	((inode)->i_mapping->backing_dev_info)
@@ -136,14 +137,14 @@ static void wb_work_complete(struct bdi_
 		call_rcu(&work->rcu_head, bdi_work_free);
 }
 
-static void wb_clear_pending(struct bdi_writeback *wb, struct bdi_work *work)
+static void wb_clear_pending(struct backing_dev_info *bdi,
+			     struct bdi_work *work)
 {
 	/*
 	 * The caller has retrieved the work arguments from this work,
 	 * drop our reference. If this is the last ref, delete and free it
 	 */
 	if (atomic_dec_and_test(&work->pending)) {
-		struct backing_dev_info *bdi = wb->bdi;
 
 		spin_lock(&bdi->wb_lock);
 		list_del_rcu(&work->list);
@@ -275,6 +276,81 @@ void bdi_start_writeback(struct backing_
 	bdi_alloc_queue_work(bdi, &args);
 }
 
+struct dirty_throttle_task {
+	long			nr_pages;
+	struct list_head	list;
+	struct completion	complete;
+};
+
+void bdi_writeback_wait(struct backing_dev_info *bdi, long nr_pages)
+{
+	struct dirty_throttle_task tt = {
+		.nr_pages = nr_pages,
+		.complete = COMPLETION_INITIALIZER_ONSTACK(tt.complete),
+	};
+	struct wb_writeback_args args = {
+		.sync_mode	= WB_SYNC_NONE,
+		.nr_pages	= LONG_MAX,
+		.range_cyclic	= 1,
+		.for_background	= 1,
+	};
+	struct bdi_work work;
+
+	bdi_work_init(&work, &args);
+	work.state |= WS_ONSTACK;
+
+	/*
+	 * make sure we will be waken up by someone
+	 */
+	bdi_queue_work(bdi, &work);
+
+	/*
+	 * register throttle pages
+	 */
+	spin_lock(&bdi->throttle_lock);
+	if (list_empty(&bdi->throttle_list))
+		atomic_set(&bdi->throttle_pages, nr_pages);
+	list_add(&tt.list, &bdi->throttle_list);
+	spin_unlock(&bdi->throttle_lock);
+
+	wait_for_completion(&tt.complete);
+
+	wb_clear_pending(bdi, &work); /* XXX */
+}
+
+/*
+ * return 1 if there are more waiting tasks.
+ */
+int bdi_writeback_wakeup(struct backing_dev_info *bdi)
+{
+	struct dirty_throttle_task *tt;
+
+	spin_lock(&bdi->throttle_lock);
+	/*
+	 * remove and wakeup head task
+	 */
+	if (!list_empty(&bdi->throttle_list)) {
+		tt = list_entry(bdi->throttle_list.prev,
+				struct dirty_throttle_task, list);
+		list_del(&tt->list);
+		complete(&tt->complete);
+	}
+	/*
+	 * update throttle pages
+	 */
+	if (!list_empty(&bdi->throttle_list)) {
+		tt = list_entry(bdi->throttle_list.prev,
+				struct dirty_throttle_task, list);
+		atomic_set(&bdi->throttle_pages, tt->nr_pages);
+	} else {
+		tt = NULL;
+		atomic_set(&bdi->throttle_pages, DIRTY_THROTTLE_PAGES_STOP * 2);
+	}
+	spin_unlock(&bdi->throttle_lock);
+
+	return tt != NULL;
+}
+
 /*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
@@ -788,8 +864,11 @@ static long wb_writeback(struct bdi_writ
 		 * For background writeout, stop when we are below the
 		 * background dirty threshold
 		 */
-		if (args->for_background && !over_bground_thresh())
+		if (args->for_background && !over_bground_thresh()) {
+			while (bdi_writeback_wakeup(wb->bdi))
+				;  /* unthrottle all tasks */
 			break;
+		}
 
 		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
@@ -911,7 +990,7 @@ long wb_do_writeback(struct bdi_writebac
 		 * that we have seen this work and we are now starting it.
 		 */
 		if (args.sync_mode == WB_SYNC_NONE)
-			wb_clear_pending(wb, work);
+			wb_clear_pending(bdi, work);
 
 		wrote += wb_writeback(wb, &args);
 
@@ -920,7 +999,7 @@ long wb_do_writeback(struct bdi_writebac
 		 * notification when we have completed the work.
 		 */
 		if (args.sync_mode == WB_SYNC_ALL)
-			wb_clear_pending(wb, work);
+			wb_clear_pending(bdi, work);
 	}
 
 	/*
--- linux.orig/mm/backing-dev.c	2009-09-28 18:52:18.000000000 +0800
+++ linux/mm/backing-dev.c	2009-09-28 19:02:45.000000000 +0800
@@ -645,6 +645,10 @@ int bdi_init(struct backing_dev_info *bd
 	bdi->wb_mask = 1;
 	bdi->wb_cnt = 1;
 
+	spin_lock_init(&bdi->throttle_lock);
+	INIT_LIST_HEAD(&bdi->throttle_list);
+	atomic_set(&bdi->throttle_pages, DIRTY_THROTTLE_PAGES_STOP * 2);
+
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++) {
 		err = percpu_counter_init(&bdi->bdi_stat[i], 0);
 		if (err)

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-30  1:24                             ` Wu Fengguang
@ 2009-09-30 11:55                               ` Jan Kara
  2009-09-30 12:10                                 ` Jens Axboe
  2009-10-01 13:36                                 ` Wu Fengguang
  0 siblings, 2 replies; 76+ messages in thread
From: Jan Kara @ 2009-09-30 11:55 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Peter Zijlstra, Chris Mason, Artem Bityutskiy,
	Jens Axboe, linux-kernel, linux-fsdevel, david, hch, akpm,
	Theodore Ts'o

> writeback: let balance_dirty_pages() wait on background writeback
> 
> CC: Chris Mason <chris.mason@oracle.com> 
> CC: Dave Chinner <david@fromorbit.com> 
> CC: Jan Kara <jack@suse.cz> 
> CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
> CC: Jens Axboe <jens.axboe@oracle.com> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c           |   89 ++++++++++++++++++++++++++++++++--
>  include/linux/backing-dev.h |   15 +++++
>  mm/backing-dev.c            |    4 +
>  mm/page-writeback.c         |   43 ++--------------
>  4 files changed, 109 insertions(+), 42 deletions(-)
> 
> --- linux.orig/mm/page-writeback.c	2009-09-28 19:01:40.000000000 +0800
> +++ linux/mm/page-writeback.c	2009-09-28 19:02:48.000000000 +0800
> @@ -218,6 +218,10 @@ static inline void __bdi_writeout_inc(st
>  {
>  	__prop_inc_percpu_max(&vm_completions, &bdi->completions,
>  			      bdi->max_prop_frac);
> +
> +	if (atomic_read(&bdi->throttle_pages) < DIRTY_THROTTLE_PAGES_STOP &&
> +	    atomic_dec_and_test(&bdi->throttle_pages))
> +		bdi_writeback_wakeup(bdi);
>  }
>  
>  void bdi_writeout_inc(struct backing_dev_info *bdi)
> @@ -458,20 +462,10 @@ static void balance_dirty_pages(struct a
>  	unsigned long background_thresh;
>  	unsigned long dirty_thresh;
>  	unsigned long bdi_thresh;
> -	unsigned long pages_written = 0;
> -	unsigned long pause = 1;
>  	int dirty_exceeded;
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  
>  	for (;;) {
> -		struct writeback_control wbc = {
> -			.bdi		= bdi,
> -			.sync_mode	= WB_SYNC_NONE,
> -			.older_than_this = NULL,
> -			.nr_to_write	= write_chunk,
> -			.range_cyclic	= 1,
> -		};
> -
>  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
>  				 global_page_state(NR_UNSTABLE_NFS);
>  		nr_writeback = global_page_state(NR_WRITEBACK) +
> @@ -518,31 +512,7 @@ static void balance_dirty_pages(struct a
>  		if (!bdi->dirty_exceeded)
>  			bdi->dirty_exceeded = 1;
>  
> -		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> -		 * Unstable writes are a feature of certain networked
> -		 * filesystems (i.e. NFS) in which data may have been
> -		 * written to the server's write cache, but has not yet
> -		 * been flushed to permanent storage.
> -		 * Only move pages to writeback if this bdi is over its
> -		 * threshold otherwise wait until the disk writes catch
> -		 * up.
> -		 */
> -		if (bdi_nr_reclaimable > bdi_thresh) {
> -			writeback_inodes_wbc(&wbc);
> -			pages_written += write_chunk - wbc.nr_to_write;
> -			/* don't wait if we've done enough */
> -			if (pages_written >= write_chunk)
> -				break;
> -		}
> -		schedule_timeout_interruptible(pause);
> -
> -		/*
> -		 * Increase the delay for each loop, up to our previous
> -		 * default of taking a 100ms nap.
> -		 */
> -		pause <<= 1;
> -		if (pause > HZ / 10)
> -			pause = HZ / 10;
> +		bdi_writeback_wait(bdi, write_chunk);
>  	}
>  
>  	if (!dirty_exceeded && bdi->dirty_exceeded)
> @@ -559,8 +529,7 @@ static void balance_dirty_pages(struct a
>  	 * In normal mode, we start background writeout at the lower
>  	 * background_thresh, to keep the amount of dirty memory low.
>  	 */
> -	if ((laptop_mode && pages_written) ||
> -	    (!laptop_mode && (nr_reclaimable > background_thresh)))
> +	if (!laptop_mode && (nr_reclaimable > background_thresh))
>  		bdi_start_writeback(bdi, NULL, 0);
>  }
>  
> --- linux.orig/include/linux/backing-dev.h	2009-09-28 18:52:51.000000000 +0800
> +++ linux/include/linux/backing-dev.h	2009-09-28 19:02:45.000000000 +0800
> @@ -86,6 +86,13 @@ struct backing_dev_info {
>  
>  	struct list_head work_list;
>  
> +	/*
> +	 * dirtier process throttling
> +	 */
> +	spinlock_t		throttle_lock;
> +	struct list_head	throttle_list;	/* nr to sync for each task */
> +	atomic_t		throttle_pages; /* nr to sync for head task */
> +
>  	struct device *dev;
>  
>  #ifdef CONFIG_DEBUG_FS
> @@ -94,6 +101,12 @@ struct backing_dev_info {
>  #endif
>  };
>  
> +/*
> + * when no task is throttled, set throttle_pages to larger than this,
> + * to avoid unnecessary atomic decreases.
> + */
> +#define DIRTY_THROTTLE_PAGES_STOP	(1 << 22)
> +
>  int bdi_init(struct backing_dev_info *bdi);
>  void bdi_destroy(struct backing_dev_info *bdi);
>  
> @@ -105,6 +118,8 @@ void bdi_start_writeback(struct backing_
>  				long nr_pages);
>  int bdi_writeback_task(struct bdi_writeback *wb);
>  int bdi_has_dirty_io(struct backing_dev_info *bdi);
> +int bdi_writeback_wakeup(struct backing_dev_info *bdi);
> +void bdi_writeback_wait(struct backing_dev_info *bdi, long nr_pages);
>  
>  extern spinlock_t bdi_lock;
>  extern struct list_head bdi_list;
> --- linux.orig/fs/fs-writeback.c	2009-09-28 18:57:51.000000000 +0800
> +++ linux/fs/fs-writeback.c	2009-09-28 19:02:45.000000000 +0800
> @@ -25,6 +25,7 @@
>  #include <linux/blkdev.h>
>  #include <linux/backing-dev.h>
>  #include <linux/buffer_head.h>
> +#include <linux/completion.h>
>  #include "internal.h"
>  
>  #define inode_to_bdi(inode)	((inode)->i_mapping->backing_dev_info)
> @@ -136,14 +137,14 @@ static void wb_work_complete(struct bdi_
>  		call_rcu(&work->rcu_head, bdi_work_free);
>  }
>  
> -static void wb_clear_pending(struct bdi_writeback *wb, struct bdi_work *work)
> +static void wb_clear_pending(struct backing_dev_info *bdi,
> +			     struct bdi_work *work)
>  {
>  	/*
>  	 * The caller has retrieved the work arguments from this work,
>  	 * drop our reference. If this is the last ref, delete and free it
>  	 */
>  	if (atomic_dec_and_test(&work->pending)) {
> -		struct backing_dev_info *bdi = wb->bdi;
>  
>  		spin_lock(&bdi->wb_lock);
>  		list_del_rcu(&work->list);
> @@ -275,6 +276,81 @@ void bdi_start_writeback(struct backing_
>  	bdi_alloc_queue_work(bdi, &args);
>  }
>  
> +struct dirty_throttle_task {
> +	long			nr_pages;
> +	struct list_head	list;
> +	struct completion	complete;
> +};
> +
> +void bdi_writeback_wait(struct backing_dev_info *bdi, long nr_pages)
> +{
> +	struct dirty_throttle_task tt = {
> +		.nr_pages = nr_pages,
> +		.complete = COMPLETION_INITIALIZER_ONSTACK(tt.complete),
> +	};
> +	struct wb_writeback_args args = {
> +		.sync_mode	= WB_SYNC_NONE,
> +		.nr_pages	= LONG_MAX,
> +		.range_cyclic	= 1,
> +		.for_background	= 1,
> +	};
> +	struct bdi_work work;
> +
> +	bdi_work_init(&work, &args);
> +	work.state |= WS_ONSTACK;
> +
> +	/*
> +	 * make sure we will be waken up by someone
> +	 */
> +	bdi_queue_work(bdi, &work);
  This is wrong, you shouldn't submit the work like this because you'll
have to wait for completion (wb_clear_pending below is just bogus). You
should rather do bdi_start_writeback(bdi, NULL, 0).

> +
> +	/*
> +	 * register throttle pages
> +	 */
> +	spin_lock(&bdi->throttle_lock);
> +	if (list_empty(&bdi->throttle_list))
> +		atomic_set(&bdi->throttle_pages, nr_pages);
> +	list_add(&tt.list, &bdi->throttle_list);
> +	spin_unlock(&bdi->throttle_lock);
> +
> +	wait_for_completion(&tt.complete);
> +
> +	wb_clear_pending(bdi, &work); /* XXX */
> +}
> +
> +/*
> + * return 1 if there are more waiting tasks.
> + */
> +int bdi_writeback_wakeup(struct backing_dev_info *bdi)
> +{
> +	struct dirty_throttle_task *tt;
> +
> +	spin_lock(&bdi->throttle_lock);
> +	/*
> +	 * remove and wakeup head task
> +	 */
> +	if (!list_empty(&bdi->throttle_list)) {
> +		tt = list_entry(bdi->throttle_list.prev,
> +				struct dirty_throttle_task, list);
> +		list_del(&tt->list);
> +		complete(&tt->complete);
> +	}
> +	/*
> +	 * update throttle pages
> +	 */
> +	if (!list_empty(&bdi->throttle_list)) {
> +		tt = list_entry(bdi->throttle_list.prev,
> +				struct dirty_throttle_task, list);
> +		atomic_set(&bdi->throttle_pages, tt->nr_pages);
> +	} else {
> +		tt = NULL;
> +		atomic_set(&bdi->throttle_pages, DIRTY_THROTTLE_PAGES_STOP * 2);
  Why is here * 2?

> +	}
> +	spin_unlock(&bdi->throttle_lock);
> +
> +	return tt != NULL;
> +}
> +
>  /*
>   * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
>   * furthest end of its superblock's dirty-inode list.
> @@ -788,8 +864,11 @@ static long wb_writeback(struct bdi_writ
>  		 * For background writeout, stop when we are below the
>  		 * background dirty threshold
>  		 */
> -		if (args->for_background && !over_bground_thresh())
> +		if (args->for_background && !over_bground_thresh()) {
> +			while (bdi_writeback_wakeup(wb->bdi))
> +				;  /* unthrottle all tasks */
>  			break;
> +		}
  You probably didn't understand my comment in the previous email. This is
too late to wakeup all the tasks. There are two limits - background_limit
(set to 5%) and dirty_limit (set to 10%). When amount of dirty data is
above background_limit, we start the writeback but we don't throttle tasks
yet. We start throttling tasks only when amount of dirty data on the bdi
exceeds the part of the dirty limit belonging to the bdi. In case of a
single bdi, this means we start throttling threads only when 10% of memory
is dirty. To keep this behavior, we have to wakeup waiting threads as soon
as their BDI gets below the dirty limit or when global number of dirty
pages gets below (background_limit + dirty_limit) / 2.

>  
>  		wbc.more_io = 0;
>  		wbc.encountered_congestion = 0;
> @@ -911,7 +990,7 @@ long wb_do_writeback(struct bdi_writebac
>  		 * that we have seen this work and we are now starting it.
>  		 */
>  		if (args.sync_mode == WB_SYNC_NONE)
> -			wb_clear_pending(wb, work);
> +			wb_clear_pending(bdi, work);
>  
>  		wrote += wb_writeback(wb, &args);
>  
> @@ -920,7 +999,7 @@ long wb_do_writeback(struct bdi_writebac
>  		 * notification when we have completed the work.
>  		 */
>  		if (args.sync_mode == WB_SYNC_ALL)
> -			wb_clear_pending(wb, work);
> +			wb_clear_pending(bdi, work);
>  	}
>  
>  	/*
> --- linux.orig/mm/backing-dev.c	2009-09-28 18:52:18.000000000 +0800
> +++ linux/mm/backing-dev.c	2009-09-28 19:02:45.000000000 +0800
> @@ -645,6 +645,10 @@ int bdi_init(struct backing_dev_info *bd
>  	bdi->wb_mask = 1;
>  	bdi->wb_cnt = 1;
>  
> +	spin_lock_init(&bdi->throttle_lock);
> +	INIT_LIST_HEAD(&bdi->throttle_list);
> +	atomic_set(&bdi->throttle_pages, DIRTY_THROTTLE_PAGES_STOP * 2);
> +
  Again, why is * 2 here? I'd just set DIRTY_THROTTLE_PAGES_STOP to some
magic value (like ~0) and use it directly...

>  	for (i = 0; i < NR_BDI_STAT_ITEMS; i++) {
>  		err = percpu_counter_init(&bdi->bdi_stat[i], 0);
>  		if (err)

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-30 11:55                               ` Jan Kara
@ 2009-09-30 12:10                                 ` Jens Axboe
  2009-10-01 15:17                                   ` Wu Fengguang
  2009-10-01 13:36                                 ` Wu Fengguang
  1 sibling, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2009-09-30 12:10 UTC (permalink / raw)
  To: Jan Kara
  Cc: Wu Fengguang, Peter Zijlstra, Chris Mason, Artem Bityutskiy,
	linux-kernel, linux-fsdevel, david, hch, akpm, Theodore Ts'o

On Wed, Sep 30 2009, Jan Kara wrote:
> > +void bdi_writeback_wait(struct backing_dev_info *bdi, long nr_pages)
> > +{
> > +	struct dirty_throttle_task tt = {
> > +		.nr_pages = nr_pages,
> > +		.complete = COMPLETION_INITIALIZER_ONSTACK(tt.complete),
> > +	};
> > +	struct wb_writeback_args args = {
> > +		.sync_mode	= WB_SYNC_NONE,
> > +		.nr_pages	= LONG_MAX,
> > +		.range_cyclic	= 1,
> > +		.for_background	= 1,
> > +	};
> > +	struct bdi_work work;
> > +
> > +	bdi_work_init(&work, &args);
> > +	work.state |= WS_ONSTACK;
> > +
> > +	/*
> > +	 * make sure we will be waken up by someone
> > +	 */
> > +	bdi_queue_work(bdi, &work);
>   This is wrong, you shouldn't submit the work like this because you'll
> have to wait for completion (wb_clear_pending below is just bogus). You
> should rather do bdi_start_writeback(bdi, NULL, 0).

Indeed, the above will die a horrible death fairly soon. But we can add
some "barrier" like synchronization, if you just wish to wait for
previously submitted work to have been completed.


-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-30 11:55                               ` Jan Kara
  2009-09-30 12:10                                 ` Jens Axboe
@ 2009-10-01 13:36                                 ` Wu Fengguang
  2009-10-01 14:22                                   ` Jan Kara
  1 sibling, 1 reply; 76+ messages in thread
From: Wu Fengguang @ 2009-10-01 13:36 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, Chris Mason, Artem Bityutskiy, Jens Axboe,
	linux-kernel, linux-fsdevel, david, hch, akpm, Theodore Ts'o

On Wed, Sep 30, 2009 at 07:55:39PM +0800, Jan Kara wrote:
> > writeback: let balance_dirty_pages() wait on background writeback
> > 
> > CC: Chris Mason <chris.mason@oracle.com> 
> > CC: Dave Chinner <david@fromorbit.com> 
> > CC: Jan Kara <jack@suse.cz> 
> > CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
> > CC: Jens Axboe <jens.axboe@oracle.com> 
> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> > ---
> >  fs/fs-writeback.c           |   89 ++++++++++++++++++++++++++++++++--
> >  include/linux/backing-dev.h |   15 +++++
> >  mm/backing-dev.c            |    4 +
> >  mm/page-writeback.c         |   43 ++--------------
> >  4 files changed, 109 insertions(+), 42 deletions(-)
> > 
> > --- linux.orig/mm/page-writeback.c	2009-09-28 19:01:40.000000000 +0800
> > +++ linux/mm/page-writeback.c	2009-09-28 19:02:48.000000000 +0800
> > @@ -218,6 +218,10 @@ static inline void __bdi_writeout_inc(st
> >  {
> >  	__prop_inc_percpu_max(&vm_completions, &bdi->completions,
> >  			      bdi->max_prop_frac);
> > +
> > +	if (atomic_read(&bdi->throttle_pages) < DIRTY_THROTTLE_PAGES_STOP &&
> > +	    atomic_dec_and_test(&bdi->throttle_pages))
> > +		bdi_writeback_wakeup(bdi);
> >  }
> >  
> >  void bdi_writeout_inc(struct backing_dev_info *bdi)
> > @@ -458,20 +462,10 @@ static void balance_dirty_pages(struct a
> >  	unsigned long background_thresh;
> >  	unsigned long dirty_thresh;
> >  	unsigned long bdi_thresh;
> > -	unsigned long pages_written = 0;
> > -	unsigned long pause = 1;
> >  	int dirty_exceeded;
> >  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> >  
> >  	for (;;) {
> > -		struct writeback_control wbc = {
> > -			.bdi		= bdi,
> > -			.sync_mode	= WB_SYNC_NONE,
> > -			.older_than_this = NULL,
> > -			.nr_to_write	= write_chunk,
> > -			.range_cyclic	= 1,
> > -		};
> > -
> >  		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
> >  				 global_page_state(NR_UNSTABLE_NFS);
> >  		nr_writeback = global_page_state(NR_WRITEBACK) +
> > @@ -518,31 +512,7 @@ static void balance_dirty_pages(struct a
> >  		if (!bdi->dirty_exceeded)
> >  			bdi->dirty_exceeded = 1;
> >  
> > -		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
> > -		 * Unstable writes are a feature of certain networked
> > -		 * filesystems (i.e. NFS) in which data may have been
> > -		 * written to the server's write cache, but has not yet
> > -		 * been flushed to permanent storage.
> > -		 * Only move pages to writeback if this bdi is over its
> > -		 * threshold otherwise wait until the disk writes catch
> > -		 * up.
> > -		 */
> > -		if (bdi_nr_reclaimable > bdi_thresh) {
> > -			writeback_inodes_wbc(&wbc);
> > -			pages_written += write_chunk - wbc.nr_to_write;
> > -			/* don't wait if we've done enough */
> > -			if (pages_written >= write_chunk)
> > -				break;
> > -		}
> > -		schedule_timeout_interruptible(pause);
> > -
> > -		/*
> > -		 * Increase the delay for each loop, up to our previous
> > -		 * default of taking a 100ms nap.
> > -		 */
> > -		pause <<= 1;
> > -		if (pause > HZ / 10)
> > -			pause = HZ / 10;
> > +		bdi_writeback_wait(bdi, write_chunk);

Added a "break;" line here: we can remove the loop now :)

> >  	}
> >  
> >  	if (!dirty_exceeded && bdi->dirty_exceeded)
> > @@ -559,8 +529,7 @@ static void balance_dirty_pages(struct a
> >  	 * In normal mode, we start background writeout at the lower
> >  	 * background_thresh, to keep the amount of dirty memory low.
> >  	 */
> > -	if ((laptop_mode && pages_written) ||
> > -	    (!laptop_mode && (nr_reclaimable > background_thresh)))
> > +	if (!laptop_mode && (nr_reclaimable > background_thresh))
> >  		bdi_start_writeback(bdi, NULL, 0);
> >  }
> >  
> > --- linux.orig/include/linux/backing-dev.h	2009-09-28 18:52:51.000000000 +0800
> > +++ linux/include/linux/backing-dev.h	2009-09-28 19:02:45.000000000 +0800
> > @@ -86,6 +86,13 @@ struct backing_dev_info {
> >  
> >  	struct list_head work_list;
> >  
> > +	/*
> > +	 * dirtier process throttling
> > +	 */
> > +	spinlock_t		throttle_lock;
> > +	struct list_head	throttle_list;	/* nr to sync for each task */
> > +	atomic_t		throttle_pages; /* nr to sync for head task */
> > +
> >  	struct device *dev;
> >  
> >  #ifdef CONFIG_DEBUG_FS
> > @@ -94,6 +101,12 @@ struct backing_dev_info {
> >  #endif
> >  };
> >  
> > +/*
> > + * when no task is throttled, set throttle_pages to larger than this,
> > + * to avoid unnecessary atomic decreases.
> > + */
> > +#define DIRTY_THROTTLE_PAGES_STOP	(1 << 22)
> > +
> >  int bdi_init(struct backing_dev_info *bdi);
> >  void bdi_destroy(struct backing_dev_info *bdi);
> >  
> > @@ -105,6 +118,8 @@ void bdi_start_writeback(struct backing_
> >  				long nr_pages);
> >  int bdi_writeback_task(struct bdi_writeback *wb);
> >  int bdi_has_dirty_io(struct backing_dev_info *bdi);
> > +int bdi_writeback_wakeup(struct backing_dev_info *bdi);
> > +void bdi_writeback_wait(struct backing_dev_info *bdi, long nr_pages);
> >  
> >  extern spinlock_t bdi_lock;
> >  extern struct list_head bdi_list;
> > --- linux.orig/fs/fs-writeback.c	2009-09-28 18:57:51.000000000 +0800
> > +++ linux/fs/fs-writeback.c	2009-09-28 19:02:45.000000000 +0800
> > @@ -25,6 +25,7 @@
> >  #include <linux/blkdev.h>
> >  #include <linux/backing-dev.h>
> >  #include <linux/buffer_head.h>
> > +#include <linux/completion.h>
> >  #include "internal.h"
> >  
> >  #define inode_to_bdi(inode)	((inode)->i_mapping->backing_dev_info)
> > @@ -136,14 +137,14 @@ static void wb_work_complete(struct bdi_
> >  		call_rcu(&work->rcu_head, bdi_work_free);
> >  }
> >  
> > -static void wb_clear_pending(struct bdi_writeback *wb, struct bdi_work *work)
> > +static void wb_clear_pending(struct backing_dev_info *bdi,
> > +			     struct bdi_work *work)
> >  {
> >  	/*
> >  	 * The caller has retrieved the work arguments from this work,
> >  	 * drop our reference. If this is the last ref, delete and free it
> >  	 */
> >  	if (atomic_dec_and_test(&work->pending)) {
> > -		struct backing_dev_info *bdi = wb->bdi;
> >  
> >  		spin_lock(&bdi->wb_lock);
> >  		list_del_rcu(&work->list);
> > @@ -275,6 +276,81 @@ void bdi_start_writeback(struct backing_
> >  	bdi_alloc_queue_work(bdi, &args);
> >  }
> >  
> > +struct dirty_throttle_task {
> > +	long			nr_pages;
> > +	struct list_head	list;
> > +	struct completion	complete;
> > +};
> > +
> > +void bdi_writeback_wait(struct backing_dev_info *bdi, long nr_pages)
> > +{
> > +	struct dirty_throttle_task tt = {
> > +		.nr_pages = nr_pages,
> > +		.complete = COMPLETION_INITIALIZER_ONSTACK(tt.complete),
> > +	};
> > +	struct wb_writeback_args args = {
> > +		.sync_mode	= WB_SYNC_NONE,
> > +		.nr_pages	= LONG_MAX,
> > +		.range_cyclic	= 1,
> > +		.for_background	= 1,
> > +	};
> > +	struct bdi_work work;
> > +
> > +	bdi_work_init(&work, &args);
> > +	work.state |= WS_ONSTACK;
> > +
> > +	/*
> > +	 * make sure we will be waken up by someone
> > +	 */
> > +	bdi_queue_work(bdi, &work);
>   This is wrong, you shouldn't submit the work like this because you'll
> have to wait for completion (wb_clear_pending below is just bogus). You
> should rather do bdi_start_writeback(bdi, NULL, 0).

No I don't intent to wait for completion of this work (that would
wait too long). This bdi work is to ensure writeback IO submissions
are now in progress. Thus __bdi_writeout_inc() will be called to
decrease bdi->throttle_pages, and when it counts down to 0, wake up
this process.

The alternative way is to do

        if (no background work queued)
                bdi_start_writeback(bdi, NULL, 0)

It looks a saner solution, thanks for the suggestion :)

> > +
> > +	/*
> > +	 * register throttle pages
> > +	 */
> > +	spin_lock(&bdi->throttle_lock);
> > +	if (list_empty(&bdi->throttle_list))
> > +		atomic_set(&bdi->throttle_pages, nr_pages);
> > +	list_add(&tt.list, &bdi->throttle_list);
> > +	spin_unlock(&bdi->throttle_lock);
> > +
> > +	wait_for_completion(&tt.complete);

> > +	wb_clear_pending(bdi, &work); /* XXX */

For the above reason, I remove the work here and don't care whether it
has been executed or is running or not seen at all. We have been waken up.

Sorry I definitely "misused" wb_clear_pending() for a slightly
different purpose..

This didn't really cancel the work if it has already been running.
So bdi_writeback_wait() achieves another goal of starting background
writeback if bdi-flush is previously idle.

> > +}
> > +
> > +/*
> > + * return 1 if there are more waiting tasks.
> > + */
> > +int bdi_writeback_wakeup(struct backing_dev_info *bdi)
> > +{
> > +	struct dirty_throttle_task *tt;
> > +
> > +	spin_lock(&bdi->throttle_lock);
> > +	/*
> > +	 * remove and wakeup head task
> > +	 */
> > +	if (!list_empty(&bdi->throttle_list)) {
> > +		tt = list_entry(bdi->throttle_list.prev,
> > +				struct dirty_throttle_task, list);
> > +		list_del(&tt->list);
> > +		complete(&tt->complete);
> > +	}
> > +	/*
> > +	 * update throttle pages
> > +	 */
> > +	if (!list_empty(&bdi->throttle_list)) {
> > +		tt = list_entry(bdi->throttle_list.prev,
> > +				struct dirty_throttle_task, list);
> > +		atomic_set(&bdi->throttle_pages, tt->nr_pages);
> > +	} else {
> > +		tt = NULL;
> > +		atomic_set(&bdi->throttle_pages, DIRTY_THROTTLE_PAGES_STOP * 2);
>   Why is here * 2?
 
Because we do a racy test in another place:

    +	if (atomic_read(&bdi->throttle_pages) < DIRTY_THROTTLE_PAGES_STOP &&
    +	    atomic_dec_and_test(&bdi->throttle_pages))
    +		bdi_writeback_wakeup(bdi);

The *2 is for reducing the race possibility. It might still be racy, but
that's OK, because it's mainly an optimization. It's perfectly correct
if we simply do 

    +	if (atomic_dec_and_test(&bdi->throttle_pages))
    +		bdi_writeback_wakeup(bdi);

> > +	}
> > +	spin_unlock(&bdi->throttle_lock);
> > +
> > +	return tt != NULL;
> > +}
> > +
> >  /*
> >   * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
> >   * furthest end of its superblock's dirty-inode list.
> > @@ -788,8 +864,11 @@ static long wb_writeback(struct bdi_writ
> >  		 * For background writeout, stop when we are below the
> >  		 * background dirty threshold
> >  		 */
> > -		if (args->for_background && !over_bground_thresh())
> > +		if (args->for_background && !over_bground_thresh()) {
> > +			while (bdi_writeback_wakeup(wb->bdi))
> > +				;  /* unthrottle all tasks */
> >  			break;
> > +		}
>   You probably didn't understand my comment in the previous email. This is
> too late to wakeup all the tasks. There are two limits - background_limit
> (set to 5%) and dirty_limit (set to 10%). When amount of dirty data is
> above background_limit, we start the writeback but we don't throttle tasks
> yet. We start throttling tasks only when amount of dirty data on the bdi
> exceeds the part of the dirty limit belonging to the bdi. In case of a
> single bdi, this means we start throttling threads only when 10% of memory
> is dirty. To keep this behavior, we have to wakeup waiting threads as soon
> as their BDI gets below the dirty limit or when global number of dirty
> pages gets below (background_limit + dirty_limit) / 2.

Sure, but the design goal is to wakeup the throttled tasks in the
__bdi_writeout_inc() path instead of here. As long as some (background)
writeback is running, __bdi_writeout_inc() will be called to wakeup
the tasks.  This "unthrottle all on exit of background writeback" is
merely a safeguard, since once background writeback (which could be
queued by the throttled task itself, in bdi_writeback_wait) exits, the
calls to __bdi_writeout_inc() is likely to stop.

> >  
> >  		wbc.more_io = 0;
> >  		wbc.encountered_congestion = 0;
> > @@ -911,7 +990,7 @@ long wb_do_writeback(struct bdi_writebac
> >  		 * that we have seen this work and we are now starting it.
> >  		 */
> >  		if (args.sync_mode == WB_SYNC_NONE)
> > -			wb_clear_pending(wb, work);
> > +			wb_clear_pending(bdi, work);
> >  
> >  		wrote += wb_writeback(wb, &args);
> >  
> > @@ -920,7 +999,7 @@ long wb_do_writeback(struct bdi_writebac
> >  		 * notification when we have completed the work.
> >  		 */
> >  		if (args.sync_mode == WB_SYNC_ALL)
> > -			wb_clear_pending(wb, work);
> > +			wb_clear_pending(bdi, work);
> >  	}
> >  
> >  	/*
> > --- linux.orig/mm/backing-dev.c	2009-09-28 18:52:18.000000000 +0800
> > +++ linux/mm/backing-dev.c	2009-09-28 19:02:45.000000000 +0800
> > @@ -645,6 +645,10 @@ int bdi_init(struct backing_dev_info *bd
> >  	bdi->wb_mask = 1;
> >  	bdi->wb_cnt = 1;
> >  
> > +	spin_lock_init(&bdi->throttle_lock);
> > +	INIT_LIST_HEAD(&bdi->throttle_list);
> > +	atomic_set(&bdi->throttle_pages, DIRTY_THROTTLE_PAGES_STOP * 2);
> > +
>   Again, why is * 2 here? I'd just set DIRTY_THROTTLE_PAGES_STOP to some
> magic value (like ~0) and use it directly...

See above. ~0 is not used because atomic_t only promises 24bit data
space, and to reduce a small race :)

Thanks,
Fengguang

> >  	for (i = 0; i < NR_BDI_STAT_ITEMS; i++) {
> >  		err = percpu_counter_init(&bdi->bdi_stat[i], 0);
> >  		if (err)
> 
> 								Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-10-01 13:36                                 ` Wu Fengguang
@ 2009-10-01 14:22                                   ` Jan Kara
  2009-10-01 14:54                                     ` Wu Fengguang
  0 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2009-10-01 14:22 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Peter Zijlstra, Chris Mason, Artem Bityutskiy,
	Jens Axboe, linux-kernel, linux-fsdevel, david, hch, akpm,
	Theodore Ts'o

On Thu 01-10-09 21:36:10, Wu Fengguang wrote:
> > > --- linux.orig/fs/fs-writeback.c	2009-09-28 18:57:51.000000000 +0800
> > > +++ linux/fs/fs-writeback.c	2009-09-28 19:02:45.000000000 +0800
> > > @@ -25,6 +25,7 @@
> > >  #include <linux/blkdev.h>
> > >  #include <linux/backing-dev.h>
> > >  #include <linux/buffer_head.h>
> > > +#include <linux/completion.h>
> > >  #include "internal.h"
> > >  
> > >  #define inode_to_bdi(inode)	((inode)->i_mapping->backing_dev_info)
> > > @@ -136,14 +137,14 @@ static void wb_work_complete(struct bdi_
> > >  		call_rcu(&work->rcu_head, bdi_work_free);
> > >  }
> > >  
> > > -static void wb_clear_pending(struct bdi_writeback *wb, struct bdi_work *work)
> > > +static void wb_clear_pending(struct backing_dev_info *bdi,
> > > +			     struct bdi_work *work)
> > >  {
> > >  	/*
> > >  	 * The caller has retrieved the work arguments from this work,
> > >  	 * drop our reference. If this is the last ref, delete and free it
> > >  	 */
> > >  	if (atomic_dec_and_test(&work->pending)) {
> > > -		struct backing_dev_info *bdi = wb->bdi;
> > >  
> > >  		spin_lock(&bdi->wb_lock);
> > >  		list_del_rcu(&work->list);
> > > @@ -275,6 +276,81 @@ void bdi_start_writeback(struct backing_
> > >  	bdi_alloc_queue_work(bdi, &args);
> > >  }
> > >  
> > > +struct dirty_throttle_task {
> > > +	long			nr_pages;
> > > +	struct list_head	list;
> > > +	struct completion	complete;
> > > +};
> > > +
> > > +void bdi_writeback_wait(struct backing_dev_info *bdi, long nr_pages)
> > > +{
> > > +	struct dirty_throttle_task tt = {
> > > +		.nr_pages = nr_pages,
> > > +		.complete = COMPLETION_INITIALIZER_ONSTACK(tt.complete),
> > > +	};
> > > +	struct wb_writeback_args args = {
> > > +		.sync_mode	= WB_SYNC_NONE,
> > > +		.nr_pages	= LONG_MAX,
> > > +		.range_cyclic	= 1,
> > > +		.for_background	= 1,
> > > +	};
> > > +	struct bdi_work work;
> > > +
> > > +	bdi_work_init(&work, &args);
> > > +	work.state |= WS_ONSTACK;
> > > +
> > > +	/*
> > > +	 * make sure we will be waken up by someone
> > > +	 */
> > > +	bdi_queue_work(bdi, &work);
> >   This is wrong, you shouldn't submit the work like this because you'll
> > have to wait for completion (wb_clear_pending below is just bogus). You
> > should rather do bdi_start_writeback(bdi, NULL, 0).
> 
> No I don't intent to wait for completion of this work (that would
> wait too long). This bdi work is to ensure writeback IO submissions
> are now in progress. Thus __bdi_writeout_inc() will be called to
> decrease bdi->throttle_pages, and when it counts down to 0, wake up
> this process.
> 
> The alternative way is to do
> 
>         if (no background work queued)
>                 bdi_start_writeback(bdi, NULL, 0)
> 
> It looks a saner solution, thanks for the suggestion :)
  Yes, but you'll have hard time finding whether there's background work
queued or not. So I suggest you just queue the background writeout
unconditionally.

> > > +
> > > +	/*
> > > +	 * register throttle pages
> > > +	 */
> > > +	spin_lock(&bdi->throttle_lock);
> > > +	if (list_empty(&bdi->throttle_list))
> > > +		atomic_set(&bdi->throttle_pages, nr_pages);
> > > +	list_add(&tt.list, &bdi->throttle_list);
> > > +	spin_unlock(&bdi->throttle_lock);
> > > +
> > > +	wait_for_completion(&tt.complete);
> 
> > > +	wb_clear_pending(bdi, &work); /* XXX */
> 
> For the above reason, I remove the work here and don't care whether it
> has been executed or is running or not seen at all. We have been waken up.
> 
> Sorry I definitely "misused" wb_clear_pending() for a slightly
> different purpose..
> 
> This didn't really cancel the work if it has already been running.
> So bdi_writeback_wait() achieves another goal of starting background
> writeback if bdi-flush is previously idle.
> 
> > > +}
> > > +
> > > +/*
> > > + * return 1 if there are more waiting tasks.
> > > + */
> > > +int bdi_writeback_wakeup(struct backing_dev_info *bdi)
> > > +{
> > > +	struct dirty_throttle_task *tt;
> > > +
> > > +	spin_lock(&bdi->throttle_lock);
> > > +	/*
> > > +	 * remove and wakeup head task
> > > +	 */
> > > +	if (!list_empty(&bdi->throttle_list)) {
> > > +		tt = list_entry(bdi->throttle_list.prev,
> > > +				struct dirty_throttle_task, list);
> > > +		list_del(&tt->list);
> > > +		complete(&tt->complete);
> > > +	}
> > > +	/*
> > > +	 * update throttle pages
> > > +	 */
> > > +	if (!list_empty(&bdi->throttle_list)) {
> > > +		tt = list_entry(bdi->throttle_list.prev,
> > > +				struct dirty_throttle_task, list);
> > > +		atomic_set(&bdi->throttle_pages, tt->nr_pages);
> > > +	} else {
> > > +		tt = NULL;
> > > +		atomic_set(&bdi->throttle_pages, DIRTY_THROTTLE_PAGES_STOP * 2);
> >   Why is here * 2?
>  
> Because we do a racy test in another place:
> 
>     +	if (atomic_read(&bdi->throttle_pages) < DIRTY_THROTTLE_PAGES_STOP &&
>     +	    atomic_dec_and_test(&bdi->throttle_pages))
>     +		bdi_writeback_wakeup(bdi);
> 
> The *2 is for reducing the race possibility. It might still be racy, but
> that's OK, because it's mainly an optimization. It's perfectly correct
> if we simply do 
  Ah, I see. OK, then it deserves at least a comment...

>     +	if (atomic_dec_and_test(&bdi->throttle_pages))
>     +		bdi_writeback_wakeup(bdi);
> 
> > > +	}
> > > +	spin_unlock(&bdi->throttle_lock);
> > > +
> > > +	return tt != NULL;
> > > +}
> > > +
> > >  /*
> > >   * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
> > >   * furthest end of its superblock's dirty-inode list.
> > > @@ -788,8 +864,11 @@ static long wb_writeback(struct bdi_writ
> > >  		 * For background writeout, stop when we are below the
> > >  		 * background dirty threshold
> > >  		 */
> > > -		if (args->for_background && !over_bground_thresh())
> > > +		if (args->for_background && !over_bground_thresh()) {
> > > +			while (bdi_writeback_wakeup(wb->bdi))
> > > +				;  /* unthrottle all tasks */
> > >  			break;
> > > +		}
> >   You probably didn't understand my comment in the previous email. This is
> > too late to wakeup all the tasks. There are two limits - background_limit
> > (set to 5%) and dirty_limit (set to 10%). When amount of dirty data is
> > above background_limit, we start the writeback but we don't throttle tasks
> > yet. We start throttling tasks only when amount of dirty data on the bdi
> > exceeds the part of the dirty limit belonging to the bdi. In case of a
> > single bdi, this means we start throttling threads only when 10% of memory
> > is dirty. To keep this behavior, we have to wakeup waiting threads as soon
> > as their BDI gets below the dirty limit or when global number of dirty
> > pages gets below (background_limit + dirty_limit) / 2.
> 
> Sure, but the design goal is to wakeup the throttled tasks in the
> __bdi_writeout_inc() path instead of here. As long as some (background)
> writeback is running, __bdi_writeout_inc() will be called to wakeup
> the tasks.  This "unthrottle all on exit of background writeback" is
> merely a safeguard, since once background writeback (which could be
> queued by the throttled task itself, in bdi_writeback_wait) exits, the
> calls to __bdi_writeout_inc() is likely to stop.
  The thing is: In the old code, tasks returned from balance_dirty_pages()
as soon as we got below dirty_limit, regardless of how much they managed to
write. So we want to wake them up from waiting as soon as we get below the
dirty limit (maybe a bit later so that they don't immediately block again
but I hope you get the point).

							Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-10-01 14:22                                   ` Jan Kara
@ 2009-10-01 14:54                                     ` Wu Fengguang
  2009-10-01 21:35                                       ` Jan Kara
  0 siblings, 1 reply; 76+ messages in thread
From: Wu Fengguang @ 2009-10-01 14:54 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, Chris Mason, Artem Bityutskiy, Jens Axboe,
	linux-kernel, linux-fsdevel, david, hch, akpm, Theodore Ts'o

On Thu, Oct 01, 2009 at 10:22:43PM +0800, Jan Kara wrote:
> On Thu 01-10-09 21:36:10, Wu Fengguang wrote:
> > > > --- linux.orig/fs/fs-writeback.c	2009-09-28 18:57:51.000000000 +0800
> > > > +++ linux/fs/fs-writeback.c	2009-09-28 19:02:45.000000000 +0800
> > > > @@ -25,6 +25,7 @@
> > > >  #include <linux/blkdev.h>
> > > >  #include <linux/backing-dev.h>
> > > >  #include <linux/buffer_head.h>
> > > > +#include <linux/completion.h>
> > > >  #include "internal.h"
> > > >  
> > > >  #define inode_to_bdi(inode)	((inode)->i_mapping->backing_dev_info)
> > > > @@ -136,14 +137,14 @@ static void wb_work_complete(struct bdi_
> > > >  		call_rcu(&work->rcu_head, bdi_work_free);
> > > >  }
> > > >  
> > > > -static void wb_clear_pending(struct bdi_writeback *wb, struct bdi_work *work)
> > > > +static void wb_clear_pending(struct backing_dev_info *bdi,
> > > > +			     struct bdi_work *work)
> > > >  {
> > > >  	/*
> > > >  	 * The caller has retrieved the work arguments from this work,
> > > >  	 * drop our reference. If this is the last ref, delete and free it
> > > >  	 */
> > > >  	if (atomic_dec_and_test(&work->pending)) {
> > > > -		struct backing_dev_info *bdi = wb->bdi;
> > > >  
> > > >  		spin_lock(&bdi->wb_lock);
> > > >  		list_del_rcu(&work->list);
> > > > @@ -275,6 +276,81 @@ void bdi_start_writeback(struct backing_
> > > >  	bdi_alloc_queue_work(bdi, &args);
> > > >  }
> > > >  
> > > > +struct dirty_throttle_task {
> > > > +	long			nr_pages;
> > > > +	struct list_head	list;
> > > > +	struct completion	complete;
> > > > +};
> > > > +
> > > > +void bdi_writeback_wait(struct backing_dev_info *bdi, long nr_pages)
> > > > +{
> > > > +	struct dirty_throttle_task tt = {
> > > > +		.nr_pages = nr_pages,
> > > > +		.complete = COMPLETION_INITIALIZER_ONSTACK(tt.complete),
> > > > +	};
> > > > +	struct wb_writeback_args args = {
> > > > +		.sync_mode	= WB_SYNC_NONE,
> > > > +		.nr_pages	= LONG_MAX,
> > > > +		.range_cyclic	= 1,
> > > > +		.for_background	= 1,
> > > > +	};
> > > > +	struct bdi_work work;
> > > > +
> > > > +	bdi_work_init(&work, &args);
> > > > +	work.state |= WS_ONSTACK;
> > > > +
> > > > +	/*
> > > > +	 * make sure we will be waken up by someone
> > > > +	 */
> > > > +	bdi_queue_work(bdi, &work);
> > >   This is wrong, you shouldn't submit the work like this because you'll
> > > have to wait for completion (wb_clear_pending below is just bogus). You
> > > should rather do bdi_start_writeback(bdi, NULL, 0).
> > 
> > No I don't intent to wait for completion of this work (that would
> > wait too long). This bdi work is to ensure writeback IO submissions
> > are now in progress. Thus __bdi_writeout_inc() will be called to
> > decrease bdi->throttle_pages, and when it counts down to 0, wake up
> > this process.
> > 
> > The alternative way is to do
> > 
> >         if (no background work queued)
> >                 bdi_start_writeback(bdi, NULL, 0)
> > 
> > It looks a saner solution, thanks for the suggestion :)
>   Yes, but you'll have hard time finding whether there's background work
> queued or not. So I suggest you just queue the background writeout
> unconditionally.

I added an atomic flag bit WB_FLAG_BACKGROUND_WORK for it :)

It is necessary because balance_dirty_pages() is called frequently and
one background work typically takes long time to finish, so huge
number of memory may be pinned for all the queued works.

> > > > +
> > > > +	/*
> > > > +	 * register throttle pages
> > > > +	 */
> > > > +	spin_lock(&bdi->throttle_lock);
> > > > +	if (list_empty(&bdi->throttle_list))
> > > > +		atomic_set(&bdi->throttle_pages, nr_pages);
> > > > +	list_add(&tt.list, &bdi->throttle_list);
> > > > +	spin_unlock(&bdi->throttle_lock);
> > > > +
> > > > +	wait_for_completion(&tt.complete);
> > 
> > > > +	wb_clear_pending(bdi, &work); /* XXX */
> > 
> > For the above reason, I remove the work here and don't care whether it
> > has been executed or is running or not seen at all. We have been waken up.
> > 
> > Sorry I definitely "misused" wb_clear_pending() for a slightly
> > different purpose..
> > 
> > This didn't really cancel the work if it has already been running.
> > So bdi_writeback_wait() achieves another goal of starting background
> > writeback if bdi-flush is previously idle.
> > 
> > > > +}
> > > > +
> > > > +/*
> > > > + * return 1 if there are more waiting tasks.
> > > > + */
> > > > +int bdi_writeback_wakeup(struct backing_dev_info *bdi)
> > > > +{
> > > > +	struct dirty_throttle_task *tt;
> > > > +
> > > > +	spin_lock(&bdi->throttle_lock);
> > > > +	/*
> > > > +	 * remove and wakeup head task
> > > > +	 */
> > > > +	if (!list_empty(&bdi->throttle_list)) {
> > > > +		tt = list_entry(bdi->throttle_list.prev,
> > > > +				struct dirty_throttle_task, list);
> > > > +		list_del(&tt->list);
> > > > +		complete(&tt->complete);
> > > > +	}
> > > > +	/*
> > > > +	 * update throttle pages
> > > > +	 */
> > > > +	if (!list_empty(&bdi->throttle_list)) {
> > > > +		tt = list_entry(bdi->throttle_list.prev,
> > > > +				struct dirty_throttle_task, list);
> > > > +		atomic_set(&bdi->throttle_pages, tt->nr_pages);
> > > > +	} else {
> > > > +		tt = NULL;
> > > > +		atomic_set(&bdi->throttle_pages, DIRTY_THROTTLE_PAGES_STOP * 2);
> > >   Why is here * 2?
> >  
> > Because we do a racy test in another place:
> > 
> >     +	if (atomic_read(&bdi->throttle_pages) < DIRTY_THROTTLE_PAGES_STOP &&
> >     +	    atomic_dec_and_test(&bdi->throttle_pages))
> >     +		bdi_writeback_wakeup(bdi);
> > 
> > The *2 is for reducing the race possibility. It might still be racy, but
> > that's OK, because it's mainly an optimization. It's perfectly correct
> > if we simply do 
>   Ah, I see. OK, then it deserves at least a comment...

Good suggestion. Here is one:

        /*
         * The DIRTY_THROTTLE_PAGES_STOP test is an optional optimization, so
         * it's OK to be racy. We set DIRTY_THROTTLE_PAGES_STOP*2 in other
         * places to reduce the race possibility.
         */     

> >     +	if (atomic_dec_and_test(&bdi->throttle_pages))
> >     +		bdi_writeback_wakeup(bdi);
> > 
> > > > +	}
> > > > +	spin_unlock(&bdi->throttle_lock);
> > > > +
> > > > +	return tt != NULL;
> > > > +}
> > > > +
> > > >  /*
> > > >   * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
> > > >   * furthest end of its superblock's dirty-inode list.
> > > > @@ -788,8 +864,11 @@ static long wb_writeback(struct bdi_writ
> > > >  		 * For background writeout, stop when we are below the
> > > >  		 * background dirty threshold
> > > >  		 */
> > > > -		if (args->for_background && !over_bground_thresh())
> > > > +		if (args->for_background && !over_bground_thresh()) {
> > > > +			while (bdi_writeback_wakeup(wb->bdi))
> > > > +				;  /* unthrottle all tasks */
> > > >  			break;
> > > > +		}
> > >   You probably didn't understand my comment in the previous email. This is
> > > too late to wakeup all the tasks. There are two limits - background_limit
> > > (set to 5%) and dirty_limit (set to 10%). When amount of dirty data is
> > > above background_limit, we start the writeback but we don't throttle tasks
> > > yet. We start throttling tasks only when amount of dirty data on the bdi
> > > exceeds the part of the dirty limit belonging to the bdi. In case of a
> > > single bdi, this means we start throttling threads only when 10% of memory
> > > is dirty. To keep this behavior, we have to wakeup waiting threads as soon
> > > as their BDI gets below the dirty limit or when global number of dirty
> > > pages gets below (background_limit + dirty_limit) / 2.
> > 
> > Sure, but the design goal is to wakeup the throttled tasks in the
> > __bdi_writeout_inc() path instead of here. As long as some (background)
> > writeback is running, __bdi_writeout_inc() will be called to wakeup
> > the tasks.  This "unthrottle all on exit of background writeback" is
> > merely a safeguard, since once background writeback (which could be
> > queued by the throttled task itself, in bdi_writeback_wait) exits, the
> > calls to __bdi_writeout_inc() is likely to stop.
>   The thing is: In the old code, tasks returned from balance_dirty_pages()
> as soon as we got below dirty_limit, regardless of how much they managed to
> write. So we want to wake them up from waiting as soon as we get below the
> dirty limit (maybe a bit later so that they don't immediately block again
> but I hope you get the point).

Ah good catch!  However overhitting the threshold by 1MB (maybe more with
concurrent dirtiers) should not be a problem. As you said, that avoids the
task being immediately blocked again.

The old code does the dirty_limit check in an opportunistic manner. There were
no guarantee. 2.6.32 further weakens it with the removal of congestion back off.

Here is the updated patch :)

Thanks,
Fengguang
---

writeback: let balance_dirty_pages() wait on background writeback


CC: Chris Mason <chris.mason@oracle.com> 
CC: Dave Chinner <david@fromorbit.com> 
CC: Jan Kara <jack@suse.cz> 
CC: Peter Zijlstra <a.p.zijlstra@chello.nl> 
CC: Jens Axboe <jens.axboe@oracle.com> 
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c           |   92 +++++++++++++++++++++++++++-------
 include/linux/backing-dev.h |   41 ++++++++++++++-
 mm/backing-dev.c            |    4 +
 mm/page-writeback.c         |   53 ++++---------------
 4 files changed, 132 insertions(+), 58 deletions(-)

--- linux.orig/mm/page-writeback.c	2009-10-01 13:34:29.000000000 +0800
+++ linux/mm/page-writeback.c	2009-10-01 22:30:32.000000000 +0800
@@ -218,6 +218,15 @@ static inline void __bdi_writeout_inc(st
 {
 	__prop_inc_percpu_max(&vm_completions, &bdi->completions,
 			      bdi->max_prop_frac);
+
+	/*
+	 * The DIRTY_THROTTLE_PAGES_STOP test is an optional optimization, so
+	 * it's OK to be racy. We set DIRTY_THROTTLE_PAGES_STOP*2 in other
+	 * places to reduce the race possibility.
+	 */
+	if (atomic_read(&bdi->throttle_pages) < DIRTY_THROTTLE_PAGES_STOP &&
+	    atomic_dec_and_test(&bdi->throttle_pages))
+		bdi_writeback_wakeup(bdi);
 }
 
 void bdi_writeout_inc(struct backing_dev_info *bdi)
@@ -458,20 +467,10 @@ static void balance_dirty_pages(struct a
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long pages_written = 0;
-	unsigned long pause = 1;
 	int dirty_exceeded;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 
 	for (;;) {
-		struct writeback_control wbc = {
-			.bdi		= bdi,
-			.sync_mode	= WB_SYNC_NONE,
-			.older_than_this = NULL,
-			.nr_to_write	= write_chunk,
-			.range_cyclic	= 1,
-		};
-
 		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
 				 global_page_state(NR_UNSTABLE_NFS);
 		nr_writeback = global_page_state(NR_WRITEBACK) +
@@ -518,39 +517,13 @@ static void balance_dirty_pages(struct a
 		if (!bdi->dirty_exceeded)
 			bdi->dirty_exceeded = 1;
 
-		/* Note: nr_reclaimable denotes nr_dirty + nr_unstable.
-		 * Unstable writes are a feature of certain networked
-		 * filesystems (i.e. NFS) in which data may have been
-		 * written to the server's write cache, but has not yet
-		 * been flushed to permanent storage.
-		 * Only move pages to writeback if this bdi is over its
-		 * threshold otherwise wait until the disk writes catch
-		 * up.
-		 */
-		if (bdi_nr_reclaimable > bdi_thresh) {
-			writeback_inodes_wbc(&wbc);
-			pages_written += write_chunk - wbc.nr_to_write;
-			/* don't wait if we've done enough */
-			if (pages_written >= write_chunk)
-				break;
-		}
-		schedule_timeout_interruptible(pause);
-
-		/*
-		 * Increase the delay for each loop, up to our previous
-		 * default of taking a 100ms nap.
-		 */
-		pause <<= 1;
-		if (pause > HZ / 10)
-			pause = HZ / 10;
+		bdi_writeback_wait(bdi, write_chunk);
+		break;
 	}
 
 	if (!dirty_exceeded && bdi->dirty_exceeded)
 		bdi->dirty_exceeded = 0;
 
-	if (writeback_in_progress(bdi))
-		return;
-
 	/*
 	 * In laptop mode, we wait until hitting the higher threshold before
 	 * starting background writeout, and then write out all the way down
@@ -559,8 +532,8 @@ static void balance_dirty_pages(struct a
 	 * In normal mode, we start background writeout at the lower
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
-	if ((laptop_mode && pages_written) ||
-	    (!laptop_mode && (nr_reclaimable > background_thresh)))
+	if (!laptop_mode && (nr_reclaimable > background_thresh) &&
+	    can_submit_background_writeback(bdi))
 		bdi_start_writeback(bdi, NULL, 0);
 }
 
--- linux.orig/include/linux/backing-dev.h	2009-10-01 12:37:21.000000000 +0800
+++ linux/include/linux/backing-dev.h	2009-10-01 22:21:52.000000000 +0800
@@ -86,6 +86,13 @@ struct backing_dev_info {
 
 	struct list_head work_list;
 
+	/*
+	 * dirtier process throttling
+	 */
+	spinlock_t		throttle_lock;
+	struct list_head	throttle_list;	/* nr to sync for each task */
+	atomic_t		throttle_pages; /* nr to sync for head task */
+
 	struct device *dev;
 
 #ifdef CONFIG_DEBUG_FS
@@ -94,6 +101,17 @@ struct backing_dev_info {
 #endif
 };
 
+/*
+ * when no task is throttled, set throttle_pages to larger than this,
+ * to avoid unnecessary atomic decreases.
+ */
+#define DIRTY_THROTTLE_PAGES_STOP	(1 << 22)
+
+/*
+ * background work queued, set to avoid queuing redundant many background works
+ */
+#define WB_FLAG_BACKGROUND_WORK		30
+
 int bdi_init(struct backing_dev_info *bdi);
 void bdi_destroy(struct backing_dev_info *bdi);
 
@@ -105,6 +123,8 @@ void bdi_start_writeback(struct backing_
 				long nr_pages);
 int bdi_writeback_task(struct bdi_writeback *wb);
 int bdi_has_dirty_io(struct backing_dev_info *bdi);
+int bdi_writeback_wakeup(struct backing_dev_info *bdi);
+void bdi_writeback_wait(struct backing_dev_info *bdi, long nr_pages);
 
 extern spinlock_t bdi_lock;
 extern struct list_head bdi_list;
@@ -248,7 +268,26 @@ int bdi_set_max_ratio(struct backing_dev
 extern struct backing_dev_info default_backing_dev_info;
 void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page);
 
-int writeback_in_progress(struct backing_dev_info *bdi);
+/**
+ * writeback_in_progress - determine whether there is writeback in progress
+ * @bdi: the device's backing_dev_info structure.
+ *
+ * Determine whether there is writeback waiting to be handled against a
+ * backing device.
+ */
+int writeback_in_progress(struct backing_dev_info *bdi)
+{
+	return !list_empty(&bdi->work_list);
+}
+
+/*
+ * This prevents > 2 for_background writeback works in circulation.
+ * (one running and another queued)
+ */
+int can_submit_background_writeback(struct backing_dev_info *bdi)
+{
+	return test_and_set_bit(WB_FLAG_BACKGROUND_WORK, &bdi->wb_mask);
+}
 
 static inline int bdi_congested(struct backing_dev_info *bdi, int bdi_bits)
 {
--- linux.orig/fs/fs-writeback.c	2009-10-01 13:34:29.000000000 +0800
+++ linux/fs/fs-writeback.c	2009-10-01 22:31:54.000000000 +0800
@@ -25,6 +25,7 @@
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
 #include <linux/buffer_head.h>
+#include <linux/completion.h>
 #include "internal.h"
 
 #define inode_to_bdi(inode)	((inode)->i_mapping->backing_dev_info)
@@ -85,18 +86,6 @@ static inline void bdi_work_init(struct 
 int sysctl_dirty_debug __read_mostly;
 
 
-/**
- * writeback_in_progress - determine whether there is writeback in progress
- * @bdi: the device's backing_dev_info structure.
- *
- * Determine whether there is writeback waiting to be handled against a
- * backing device.
- */
-int writeback_in_progress(struct backing_dev_info *bdi)
-{
-	return !list_empty(&bdi->work_list);
-}
-
 static void bdi_work_clear(struct bdi_work *work)
 {
 	clear_bit(WS_USED_B, &work->state);
@@ -136,17 +125,19 @@ static void wb_work_complete(struct bdi_
 		call_rcu(&work->rcu_head, bdi_work_free);
 }
 
-static void wb_clear_pending(struct bdi_writeback *wb, struct bdi_work *work)
+static void wb_clear_pending(struct backing_dev_info *bdi,
+			     struct bdi_work *work)
 {
 	/*
 	 * The caller has retrieved the work arguments from this work,
 	 * drop our reference. If this is the last ref, delete and free it
 	 */
 	if (atomic_dec_and_test(&work->pending)) {
-		struct backing_dev_info *bdi = wb->bdi;
 
 		spin_lock(&bdi->wb_lock);
 		list_del_rcu(&work->list);
+		if (work->args.for_background)
+			clear_bit(WB_FLAG_BACKGROUND_WORK, &bdi->wb_mask);
 		spin_unlock(&bdi->wb_lock);
 
 		wb_work_complete(work);
@@ -275,6 +266,70 @@ void bdi_start_writeback(struct backing_
 	bdi_alloc_queue_work(bdi, &args);
 }
 
+struct dirty_throttle_task {
+	long			nr_pages;
+	struct list_head	list;
+	struct completion	complete;
+};
+
+void bdi_writeback_wait(struct backing_dev_info *bdi, long nr_pages)
+{
+	struct dirty_throttle_task tt = {
+		.nr_pages = nr_pages,
+		.complete = COMPLETION_INITIALIZER_ONSTACK(tt.complete),
+	};
+
+	/*
+	 * make sure we will be woke up by someone
+	 */
+	if (can_submit_background_writeback(bdi))
+		bdi_start_writeback(bdi, NULL, 0);
+
+	/*
+	 * register throttle pages
+	 */
+	spin_lock(&bdi->throttle_lock);
+	if (list_empty(&bdi->throttle_list))
+		atomic_set(&bdi->throttle_pages, nr_pages);
+	list_add(&tt.list, &bdi->throttle_list);
+	spin_unlock(&bdi->throttle_lock);
+
+	wait_for_completion(&tt.complete);
+}
+
+/*
+ * return 1 if there are more waiting tasks.
+ */
+int bdi_writeback_wakeup(struct backing_dev_info *bdi)
+{
+	struct dirty_throttle_task *tt;
+
+	spin_lock(&bdi->throttle_lock);
+	/*
+	 * remove and wakeup head task
+	 */
+	if (!list_empty(&bdi->throttle_list)) {
+		tt = list_entry(bdi->throttle_list.prev,
+				struct dirty_throttle_task, list);
+		list_del(&tt->list);
+		complete(&tt->complete);
+	}
+	/*
+	 * update throttle pages
+	 */
+	if (!list_empty(&bdi->throttle_list)) {
+		tt = list_entry(bdi->throttle_list.prev,
+				struct dirty_throttle_task, list);
+		atomic_set(&bdi->throttle_pages, tt->nr_pages);
+	} else {
+		tt = NULL;
+		atomic_set(&bdi->throttle_pages, DIRTY_THROTTLE_PAGES_STOP * 2);
+	}
+	spin_unlock(&bdi->throttle_lock);
+
+	return tt != NULL;
+}
+
 /*
  * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
  * furthest end of its superblock's dirty-inode list.
@@ -756,8 +811,11 @@ static long wb_writeback(struct bdi_writ
 		 * For background writeout, stop when we are below the
 		 * background dirty threshold
 		 */
-		if (args->for_background && !over_bground_thresh())
+		if (args->for_background && !over_bground_thresh()) {
+			while (bdi_writeback_wakeup(wb->bdi))
+				;  /* unthrottle all tasks */
 			break;
+		}
 
 		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
@@ -879,7 +937,7 @@ long wb_do_writeback(struct bdi_writebac
 		 * that we have seen this work and we are now starting it.
 		 */
 		if (args.sync_mode == WB_SYNC_NONE)
-			wb_clear_pending(wb, work);
+			wb_clear_pending(bdi, work);
 
 		wrote += wb_writeback(wb, &args);
 
@@ -888,7 +946,7 @@ long wb_do_writeback(struct bdi_writebac
 		 * notification when we have completed the work.
 		 */
 		if (args.sync_mode == WB_SYNC_ALL)
-			wb_clear_pending(wb, work);
+			wb_clear_pending(bdi, work);
 	}
 
 	/*
--- linux.orig/mm/backing-dev.c	2009-10-01 13:34:29.000000000 +0800
+++ linux/mm/backing-dev.c	2009-10-01 22:17:05.000000000 +0800
@@ -646,6 +646,10 @@ int bdi_init(struct backing_dev_info *bd
 	bdi->wb_mask = 1;
 	bdi->wb_cnt = 1;
 
+	spin_lock_init(&bdi->throttle_lock);
+	INIT_LIST_HEAD(&bdi->throttle_list);
+	atomic_set(&bdi->throttle_pages, DIRTY_THROTTLE_PAGES_STOP * 2);
+
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++) {
 		err = percpu_counter_init(&bdi->bdi_stat[i], 0);
 		if (err)

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-09-30 12:10                                 ` Jens Axboe
@ 2009-10-01 15:17                                   ` Wu Fengguang
  0 siblings, 0 replies; 76+ messages in thread
From: Wu Fengguang @ 2009-10-01 15:17 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jan Kara, Peter Zijlstra, Chris Mason, Artem Bityutskiy,
	linux-kernel, linux-fsdevel, david, hch, akpm, Theodore Ts'o

On Wed, Sep 30, 2009 at 08:10:30PM +0800, Jens Axboe wrote:
> On Wed, Sep 30 2009, Jan Kara wrote:
> > > +void bdi_writeback_wait(struct backing_dev_info *bdi, long nr_pages)
> > > +{
> > > +	struct dirty_throttle_task tt = {
> > > +		.nr_pages = nr_pages,
> > > +		.complete = COMPLETION_INITIALIZER_ONSTACK(tt.complete),
> > > +	};
> > > +	struct wb_writeback_args args = {
> > > +		.sync_mode	= WB_SYNC_NONE,
> > > +		.nr_pages	= LONG_MAX,
> > > +		.range_cyclic	= 1,
> > > +		.for_background	= 1,
> > > +	};
> > > +	struct bdi_work work;
> > > +
> > > +	bdi_work_init(&work, &args);
> > > +	work.state |= WS_ONSTACK;
> > > +
> > > +	/*
> > > +	 * make sure we will be waken up by someone
> > > +	 */
> > > +	bdi_queue_work(bdi, &work);
> >   This is wrong, you shouldn't submit the work like this because you'll
> > have to wait for completion (wb_clear_pending below is just bogus). You
> > should rather do bdi_start_writeback(bdi, NULL, 0).
> 
> Indeed, the above will die a horrible death fairly soon. But we can add
> some "barrier" like synchronization, if you just wish to wait for
> previously submitted work to have been completed.

Thanks, I just purged that hack and go for bdi_start_writeback :)

Thanks,
Fengguang


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-10-01 14:54                                     ` Wu Fengguang
@ 2009-10-01 21:35                                       ` Jan Kara
  2009-10-02  2:25                                         ` Wu Fengguang
  0 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2009-10-01 21:35 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Peter Zijlstra, Chris Mason, Artem Bityutskiy,
	Jens Axboe, linux-kernel, linux-fsdevel, david, hch, akpm,
	Theodore Ts'o

On Thu 01-10-09 22:54:43, Wu Fengguang wrote:
> > > >   You probably didn't understand my comment in the previous email. This is
> > > > too late to wakeup all the tasks. There are two limits - background_limit
> > > > (set to 5%) and dirty_limit (set to 10%). When amount of dirty data is
> > > > above background_limit, we start the writeback but we don't throttle tasks
> > > > yet. We start throttling tasks only when amount of dirty data on the bdi
> > > > exceeds the part of the dirty limit belonging to the bdi. In case of a
> > > > single bdi, this means we start throttling threads only when 10% of memory
> > > > is dirty. To keep this behavior, we have to wakeup waiting threads as soon
> > > > as their BDI gets below the dirty limit or when global number of dirty
> > > > pages gets below (background_limit + dirty_limit) / 2.
> > > 
> > > Sure, but the design goal is to wakeup the throttled tasks in the
> > > __bdi_writeout_inc() path instead of here. As long as some (background)
> > > writeback is running, __bdi_writeout_inc() will be called to wakeup
> > > the tasks.  This "unthrottle all on exit of background writeback" is
> > > merely a safeguard, since once background writeback (which could be
> > > queued by the throttled task itself, in bdi_writeback_wait) exits, the
> > > calls to __bdi_writeout_inc() is likely to stop.
> >   The thing is: In the old code, tasks returned from balance_dirty_pages()
> > as soon as we got below dirty_limit, regardless of how much they managed to
> > write. So we want to wake them up from waiting as soon as we get below the
> > dirty limit (maybe a bit later so that they don't immediately block again
> > but I hope you get the point).
> 
> Ah good catch!  However overhitting the threshold by 1MB (maybe more with
> concurrent dirtiers) should not be a problem. As you said, that avoids the
> task being immediately blocked again.
> 
> The old code does the dirty_limit check in an opportunistic manner. There were
> no guarantee. 2.6.32 further weakens it with the removal of congestion back off.
  Sure, there are no guarantees but if we let threads sleep in
balance_dirty_pages longer than necessary it will have a performance impact
(application will sleep instead of doing useful work). So we should better
make sure applications sleep as few as necessary in balance_dirty_pages.

> @@ -756,8 +811,11 @@ static long wb_writeback(struct bdi_writ
>  		 * For background writeout, stop when we are below the
>  		 * background dirty threshold
>  		 */
> -		if (args->for_background && !over_bground_thresh())
> +		if (args->for_background && !over_bground_thresh()) {
> +			while (bdi_writeback_wakeup(wb->bdi))
> +				;  /* unthrottle all tasks */
>  			break;
> +		}
  Thus the check here should rather be
if (args->for_background && !over_dirty_limit())

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-10-01 21:35                                       ` Jan Kara
@ 2009-10-02  2:25                                         ` Wu Fengguang
  2009-10-02  9:54                                           ` Jan Kara
  0 siblings, 1 reply; 76+ messages in thread
From: Wu Fengguang @ 2009-10-02  2:25 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, Chris Mason, Artem Bityutskiy, Jens Axboe,
	linux-kernel, linux-fsdevel, david, hch, akpm, Theodore Ts'o

On Fri, Oct 02, 2009 at 05:35:23AM +0800, Jan Kara wrote:
> On Thu 01-10-09 22:54:43, Wu Fengguang wrote:
> > > > >   You probably didn't understand my comment in the previous email. This is
> > > > > too late to wakeup all the tasks. There are two limits - background_limit
> > > > > (set to 5%) and dirty_limit (set to 10%). When amount of dirty data is
> > > > > above background_limit, we start the writeback but we don't throttle tasks
> > > > > yet. We start throttling tasks only when amount of dirty data on the bdi
> > > > > exceeds the part of the dirty limit belonging to the bdi. In case of a
> > > > > single bdi, this means we start throttling threads only when 10% of memory
> > > > > is dirty. To keep this behavior, we have to wakeup waiting threads as soon
> > > > > as their BDI gets below the dirty limit or when global number of dirty
> > > > > pages gets below (background_limit + dirty_limit) / 2.
> > > > 
> > > > Sure, but the design goal is to wakeup the throttled tasks in the
> > > > __bdi_writeout_inc() path instead of here. As long as some (background)
> > > > writeback is running, __bdi_writeout_inc() will be called to wakeup
> > > > the tasks.  This "unthrottle all on exit of background writeback" is
> > > > merely a safeguard, since once background writeback (which could be
> > > > queued by the throttled task itself, in bdi_writeback_wait) exits, the
> > > > calls to __bdi_writeout_inc() is likely to stop.
> > >   The thing is: In the old code, tasks returned from balance_dirty_pages()
> > > as soon as we got below dirty_limit, regardless of how much they managed to
> > > write. So we want to wake them up from waiting as soon as we get below the
> > > dirty limit (maybe a bit later so that they don't immediately block again
> > > but I hope you get the point).
> > 
> > Ah good catch!  However overhitting the threshold by 1MB (maybe more with
> > concurrent dirtiers) should not be a problem. As you said, that avoids the
> > task being immediately blocked again.
> > 
> > The old code does the dirty_limit check in an opportunistic manner. There were
> > no guarantee. 2.6.32 further weakens it with the removal of congestion back off.
>   Sure, there are no guarantees but if we let threads sleep in
> balance_dirty_pages longer than necessary it will have a performance impact
> (application will sleep instead of doing useful work). So we should better
> make sure applications sleep as few as necessary in balance_dirty_pages.

To avoid long sleep, we limit write_chunk size for balance_dirty_pages.
That's all we need.  The "abort earlier if below dirty_limit" logic is
not necessary (or even undesirable) in three ways.
- just found that pre-31 kernels will normally succeed in writing the
  whole write_chunk because nonblocking=0, thus it won't backoff on
  congestion. So it's not over_bground_thresh() but over_dirty_limit()
  that will change behavior.
- whether it be abort on over_bground_thresh() or over_dirty_limit(),
  there is some constant threshold around which applications are
  throttled. The exact threshold level won't change the throttled
  dirty throughput. It is determined by the write IO throughput the
  block device can handle.
- The over_bground_thresh() check is merely a safeguard which is not
  relevant in 99.9% time. But when increased to over_dirty_limit(), it
  may become a hot wakeup path comparable to the __bdi_writeout_inc()
  path.  The problem of this wakeup path is, it is "wakeup all". It's
  preferable to wake up processes one by one in __bdi_writeout_inc().

I assume dirty_limit to be (background_thresh + dirty_thresh) / 2.

> > @@ -756,8 +811,11 @@ static long wb_writeback(struct bdi_writ
> >  		 * For background writeout, stop when we are below the
> >  		 * background dirty threshold
> >  		 */
> > -		if (args->for_background && !over_bground_thresh())
> > +		if (args->for_background && !over_bground_thresh()) {
> > +			while (bdi_writeback_wakeup(wb->bdi))
> > +				;  /* unthrottle all tasks */
> >  			break;
> > +		}
>   Thus the check here should rather be
> if (args->for_background && !over_dirty_limit())

Sorry, for above reasons, I don't think we need to add dirty_limit
check here.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-10-02  2:25                                         ` Wu Fengguang
@ 2009-10-02  9:54                                           ` Jan Kara
  2009-10-02 10:34                                             ` Wu Fengguang
  0 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2009-10-02  9:54 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Peter Zijlstra, Chris Mason, Artem Bityutskiy,
	Jens Axboe, linux-kernel, linux-fsdevel, david, hch, akpm,
	Theodore Ts'o

On Fri 02-10-09 10:25:12, Wu Fengguang wrote:
> On Fri, Oct 02, 2009 at 05:35:23AM +0800, Jan Kara wrote:
> > On Thu 01-10-09 22:54:43, Wu Fengguang wrote:
> > > > > >   You probably didn't understand my comment in the previous email. This is
> > > > > > too late to wakeup all the tasks. There are two limits - background_limit
> > > > > > (set to 5%) and dirty_limit (set to 10%). When amount of dirty data is
> > > > > > above background_limit, we start the writeback but we don't throttle tasks
> > > > > > yet. We start throttling tasks only when amount of dirty data on the bdi
> > > > > > exceeds the part of the dirty limit belonging to the bdi. In case of a
> > > > > > single bdi, this means we start throttling threads only when 10% of memory
> > > > > > is dirty. To keep this behavior, we have to wakeup waiting threads as soon
> > > > > > as their BDI gets below the dirty limit or when global number of dirty
> > > > > > pages gets below (background_limit + dirty_limit) / 2.
> > > > > 
> > > > > Sure, but the design goal is to wakeup the throttled tasks in the
> > > > > __bdi_writeout_inc() path instead of here. As long as some (background)
> > > > > writeback is running, __bdi_writeout_inc() will be called to wakeup
> > > > > the tasks.  This "unthrottle all on exit of background writeback" is
> > > > > merely a safeguard, since once background writeback (which could be
> > > > > queued by the throttled task itself, in bdi_writeback_wait) exits, the
> > > > > calls to __bdi_writeout_inc() is likely to stop.
> > > >   The thing is: In the old code, tasks returned from balance_dirty_pages()
> > > > as soon as we got below dirty_limit, regardless of how much they managed to
> > > > write. So we want to wake them up from waiting as soon as we get below the
> > > > dirty limit (maybe a bit later so that they don't immediately block again
> > > > but I hope you get the point).
> > > 
> > > Ah good catch!  However overhitting the threshold by 1MB (maybe more with
> > > concurrent dirtiers) should not be a problem. As you said, that avoids the
> > > task being immediately blocked again.
> > > 
> > > The old code does the dirty_limit check in an opportunistic manner. There were
> > > no guarantee. 2.6.32 further weakens it with the removal of congestion back off.
> >   Sure, there are no guarantees but if we let threads sleep in
> > balance_dirty_pages longer than necessary it will have a performance impact
> > (application will sleep instead of doing useful work). So we should better
> > make sure applications sleep as few as necessary in balance_dirty_pages.
> 
> To avoid long sleep, we limit write_chunk size for balance_dirty_pages.
> That's all we need.  The "abort earlier if below dirty_limit" logic is
> not necessary (or even undesirable) in three ways.
> - just found that pre-31 kernels will normally succeed in writing the
>   whole write_chunk because nonblocking=0, thus it won't backoff on
>   congestion. So it's not over_bground_thresh() but over_dirty_limit()
>   that will change behavior.
  OK, good point.

> - whether it be abort on over_bground_thresh() or over_dirty_limit(),
>   there is some constant threshold around which applications are
>   throttled. The exact threshold level won't change the throttled
>   dirty throughput. It is determined by the write IO throughput the
>   block device can handle.
  But the aim is to throttle applications at higher limit than a limit at
which we start pdflush-style writeback. So that if writeback thread is fast
enough to flush the data, applications don't get throttled at all. That's
the reason for a difference between dirty_thresh and background_thresh.

> - The over_bground_thresh() check is merely a safeguard which is not
>   relevant in 99.9% time. But when increased to over_dirty_limit(), it
>   may become a hot wakeup path comparable to the __bdi_writeout_inc()
>   path.  The problem of this wakeup path is, it is "wakeup all". It's
>   preferable to wake up processes one by one in __bdi_writeout_inc().
  Well, it depends on the number of applications writing data (if there are
100 threads writing data, the last would get unblocked after 400 MB are
written assuming ratelimit_pages = 1024). So in this case there are high
chances that quite some threads will get woken up because we reach even
background_thresh.
  What I'm in fact a bit worried about is the latency - in the example
above it can take quite a long time for an application to be woken in
balance_dirty_pages (that's not a new problem I agree). When the threads
are writing continuously losts of data, there's no way around this. But
when it was just a short spike of IO, we'd win if we woke those threads
earlier. But OK, probaly we can sort that out later.

								Honza

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb
  2009-10-02  9:54                                           ` Jan Kara
@ 2009-10-02 10:34                                             ` Wu Fengguang
  0 siblings, 0 replies; 76+ messages in thread
From: Wu Fengguang @ 2009-10-02 10:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: Peter Zijlstra, Chris Mason, Artem Bityutskiy, Jens Axboe,
	linux-kernel, linux-fsdevel, david, hch, akpm, Theodore Ts'o

On Fri, Oct 02, 2009 at 05:54:59PM +0800, Jan Kara wrote:
> On Fri 02-10-09 10:25:12, Wu Fengguang wrote:
> > On Fri, Oct 02, 2009 at 05:35:23AM +0800, Jan Kara wrote:
> > > On Thu 01-10-09 22:54:43, Wu Fengguang wrote:
> > > > > > >   You probably didn't understand my comment in the previous email. This is
> > > > > > > too late to wakeup all the tasks. There are two limits - background_limit
> > > > > > > (set to 5%) and dirty_limit (set to 10%). When amount of dirty data is
> > > > > > > above background_limit, we start the writeback but we don't throttle tasks
> > > > > > > yet. We start throttling tasks only when amount of dirty data on the bdi
> > > > > > > exceeds the part of the dirty limit belonging to the bdi. In case of a
> > > > > > > single bdi, this means we start throttling threads only when 10% of memory
> > > > > > > is dirty. To keep this behavior, we have to wakeup waiting threads as soon
> > > > > > > as their BDI gets below the dirty limit or when global number of dirty
> > > > > > > pages gets below (background_limit + dirty_limit) / 2.
> > > > > > 
> > > > > > Sure, but the design goal is to wakeup the throttled tasks in the
> > > > > > __bdi_writeout_inc() path instead of here. As long as some (background)
> > > > > > writeback is running, __bdi_writeout_inc() will be called to wakeup
> > > > > > the tasks.  This "unthrottle all on exit of background writeback" is
> > > > > > merely a safeguard, since once background writeback (which could be
> > > > > > queued by the throttled task itself, in bdi_writeback_wait) exits, the
> > > > > > calls to __bdi_writeout_inc() is likely to stop.
> > > > >   The thing is: In the old code, tasks returned from balance_dirty_pages()
> > > > > as soon as we got below dirty_limit, regardless of how much they managed to
> > > > > write. So we want to wake them up from waiting as soon as we get below the
> > > > > dirty limit (maybe a bit later so that they don't immediately block again
> > > > > but I hope you get the point).
> > > > 
> > > > Ah good catch!  However overhitting the threshold by 1MB (maybe more with
> > > > concurrent dirtiers) should not be a problem. As you said, that avoids the
> > > > task being immediately blocked again.
> > > > 
> > > > The old code does the dirty_limit check in an opportunistic manner. There were
> > > > no guarantee. 2.6.32 further weakens it with the removal of congestion back off.
> > >   Sure, there are no guarantees but if we let threads sleep in
> > > balance_dirty_pages longer than necessary it will have a performance impact
> > > (application will sleep instead of doing useful work). So we should better
> > > make sure applications sleep as few as necessary in balance_dirty_pages.
> > 
> > To avoid long sleep, we limit write_chunk size for balance_dirty_pages.
> > That's all we need.  The "abort earlier if below dirty_limit" logic is
> > not necessary (or even undesirable) in three ways.
> > - just found that pre-31 kernels will normally succeed in writing the
> >   whole write_chunk because nonblocking=0, thus it won't backoff on
> >   congestion. So it's not over_bground_thresh() but over_dirty_limit()
> >   that will change behavior.
>   OK, good point.
> 
> > - whether it be abort on over_bground_thresh() or over_dirty_limit(),
> >   there is some constant threshold around which applications are
> >   throttled. The exact threshold level won't change the throttled
> >   dirty throughput. It is determined by the write IO throughput the
> >   block device can handle.
>   But the aim is to throttle applications at higher limit than a limit at
> which we start pdflush-style writeback. So that if writeback thread is fast
> enough to flush the data, applications don't get throttled at all. That's
> the reason for a difference between dirty_thresh and background_thresh.

When doing over_bground_thresh(), the real threshold won't be far from dirty_limit.
- for single dirtier, the threshold may be (dirty_limit - 4MB).
- for N dirtiers, it may be (dirty_limit - N*1MB) in worst case (the
  ratelimit will backoff on dirty_exceeded). However it's highly
  unlikely to reach worst case, because there are so many dirtiers and
  so much dirtying pressure, a small fraction of "unthrottled at the
  moment" dirtiers will be able to pump up the dirty pages to the
  dirty limit. Since the dirtiers are unthrottled one by one, it is
  unlikely for them to block at the same time. In stochastic, the
  more N, the less probability for N processes to enqueue at the same
  time. It's an exponential decreasing function.

> > - The over_bground_thresh() check is merely a safeguard which is not
> >   relevant in 99.9% time. But when increased to over_dirty_limit(), it
> >   may become a hot wakeup path comparable to the __bdi_writeout_inc()
> >   path.  The problem of this wakeup path is, it is "wakeup all". It's
> >   preferable to wake up processes one by one in __bdi_writeout_inc().
>   Well, it depends on the number of applications writing data (if there are
> 100 threads writing data, the last would get unblocked after 400 MB are
> written assuming ratelimit_pages = 1024). So in this case there are high
> chances that quite some threads will get woken up because we reach even
> background_thresh.

There is such a chance, but should be extremely low in probability :)

>   What I'm in fact a bit worried about is the latency - in the example
> above it can take quite a long time for an application to be woken in
> balance_dirty_pages (that's not a new problem I agree). When the threads

No worry it's fine :) The over_dirty_limit() could make things better,
but is not a guarantee. In fact there are no guarantee of latency at
all, when there are so many dirtiers competing the IO channel..

> are writing continuously losts of data, there's no way around this. But
> when it was just a short spike of IO, we'd win if we woke those threads
> earlier. But OK, probaly we can sort that out later.

Yes in this case it would be beneficial. The good thing is, the
over_dirty_limit() would be trivial to add if necessary.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export
  2009-09-04  8:28   ` Jan Kara
@ 2009-09-04 11:59     ` Jens Axboe
  0 siblings, 0 replies; 76+ messages in thread
From: Jens Axboe @ 2009-09-04 11:59 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, tytso, akpm

On Fri, Sep 04 2009, Jan Kara wrote:
> On Fri 04-09-09 09:46:39, Jens Axboe wrote:
> > This adds two new exported functions:
> > 
> > - writeback_inodes_sb(), which only attempts to writeback dirty inodes on
> >   this super_block, for WB_SYNC_NONE writeout.
> > - sync_inodes_sbt(), which writes out all dirty inodes on this super_block
> >   and also waits for the IO to complete.
> > 
> > Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
>   The patch looks good. A nice cleanup.
> Acked-by: Jan Kara <jack@suse.cz>

Thanks, ack added.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export
  2009-09-04  7:46 ` [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export Jens Axboe
@ 2009-09-04  8:28   ` Jan Kara
  2009-09-04 11:59     ` Jens Axboe
  0 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2009-09-04  8:28 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, tytso, akpm, jack

On Fri 04-09-09 09:46:39, Jens Axboe wrote:
> This adds two new exported functions:
> 
> - writeback_inodes_sb(), which only attempts to writeback dirty inodes on
>   this super_block, for WB_SYNC_NONE writeout.
> - sync_inodes_sbt(), which writes out all dirty inodes on this super_block
>   and also waits for the IO to complete.
> 
> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
  The patch looks good. A nice cleanup.
Acked-by: Jan Kara <jack@suse.cz>

> ---
>  drivers/staging/pohmelfs/inode.c |    9 +----
>  fs/fs-writeback.c                |   70 ++++++++++++++++++++++---------------
>  fs/sync.c                        |   18 +++++----
>  fs/ubifs/budget.c                |   16 +-------
>  fs/ubifs/super.c                 |    8 +----
>  include/linux/fs.h               |    2 -
>  include/linux/writeback.h        |    3 +-
>  7 files changed, 58 insertions(+), 68 deletions(-)
> 
> diff --git a/drivers/staging/pohmelfs/inode.c b/drivers/staging/pohmelfs/inode.c
> index 7b60579..e63c9be 100644
> --- a/drivers/staging/pohmelfs/inode.c
> +++ b/drivers/staging/pohmelfs/inode.c
> @@ -1950,14 +1950,7 @@ static int pohmelfs_get_sb(struct file_system_type *fs_type,
>   */
>  static void pohmelfs_kill_super(struct super_block *sb)
>  {
> -	struct writeback_control wbc = {
> -		.sync_mode	= WB_SYNC_ALL,
> -		.range_start	= 0,
> -		.range_end	= LLONG_MAX,
> -		.nr_to_write	= LONG_MAX,
> -	};
> -	generic_sync_sb_inodes(sb, &wbc);
> -
> +	sync_inodes_sb(sb);
>  	kill_anon_super(sb);
>  }
>  
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index c54226b..271e5f4 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -458,8 +458,8 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
>   * on the writer throttling path, and we get decent balancing between many
>   * throttled threads: we don't want them all piling up on inode_sync_wait.
>   */
> -void generic_sync_sb_inodes(struct super_block *sb,
> -				struct writeback_control *wbc)
> +static void generic_sync_sb_inodes(struct super_block *sb,
> +				   struct writeback_control *wbc)
>  {
>  	const unsigned long start = jiffies;	/* livelock avoidance */
>  	int sync = wbc->sync_mode == WB_SYNC_ALL;
> @@ -593,13 +593,6 @@ void generic_sync_sb_inodes(struct super_block *sb,
>  
>  	return;		/* Leave any unwritten inodes on s_io */
>  }
> -EXPORT_SYMBOL_GPL(generic_sync_sb_inodes);
> -
> -static void sync_sb_inodes(struct super_block *sb,
> -				struct writeback_control *wbc)
> -{
> -	generic_sync_sb_inodes(sb, wbc);
> -}
>  
>  /*
>   * Start writeback of dirty pagecache data against all unlocked inodes.
> @@ -640,7 +633,7 @@ restart:
>  			 */
>  			if (down_read_trylock(&sb->s_umount)) {
>  				if (sb->s_root)
> -					sync_sb_inodes(sb, wbc);
> +					generic_sync_sb_inodes(sb, wbc);
>  				up_read(&sb->s_umount);
>  			}
>  			spin_lock(&sb_lock);
> @@ -653,35 +646,56 @@ restart:
>  	spin_unlock(&sb_lock);
>  }
>  
> -/*
> - * writeback and wait upon the filesystem's dirty inodes.  The caller will
> - * do this in two passes - one to write, and one to wait.
> - *
> - * A finite limit is set on the number of pages which will be written.
> - * To prevent infinite livelock of sys_sync().
> +/**
> + * writeback_inodes_sb	-	writeback dirty inodes from given super_block
> + * @sb: the superblock
>   *
> - * We add in the number of potentially dirty inodes, because each inode write
> - * can dirty pagecache in the underlying blockdev.
> + * Start writeback on some inodes on this super_block. No guarantees are made
> + * on how many (if any) will be written, and this function does not wait
> + * for IO completion of submitted IO. The number of pages submitted is
> + * returned.
>   */
> -void sync_inodes_sb(struct super_block *sb, int wait)
> +long writeback_inodes_sb(struct super_block *sb)
>  {
>  	struct writeback_control wbc = {
> -		.sync_mode	= wait ? WB_SYNC_ALL : WB_SYNC_NONE,
> +		.sync_mode	= WB_SYNC_NONE,
>  		.range_start	= 0,
>  		.range_end	= LLONG_MAX,
>  	};
> +	unsigned long nr_dirty = global_page_state(NR_FILE_DIRTY);
> +	unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
> +	long nr_to_write;
>  
> -	if (!wait) {
> -		unsigned long nr_dirty = global_page_state(NR_FILE_DIRTY);
> -		unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
> -
> -		wbc.nr_to_write = nr_dirty + nr_unstable +
> +	nr_to_write = nr_dirty + nr_unstable +
>  			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
> -	} else
> -		wbc.nr_to_write = LONG_MAX; /* doesn't actually matter */
>  
> -	sync_sb_inodes(sb, &wbc);
> +	wbc.nr_to_write = nr_to_write;
> +	generic_sync_sb_inodes(sb, &wbc);
> +	return nr_to_write - wbc.nr_to_write;
> +}
> +EXPORT_SYMBOL(writeback_inodes_sb);
> +
> +/**
> + * sync_inodes_sb	-	sync sb inode pages
> + * @sb: the superblock
> + *
> + * This function writes and waits on any dirty inode belonging to this
> + * super_block. The number of pages synced is returned.
> + */
> +long sync_inodes_sb(struct super_block *sb)
> +{
> +	struct writeback_control wbc = {
> +		.sync_mode	= WB_SYNC_ALL,
> +		.range_start	= 0,
> +		.range_end	= LLONG_MAX,
> +	};
> +	long nr_to_write = LONG_MAX; /* doesn't actually matter */
> +
> +	wbc.nr_to_write = nr_to_write;
> +	generic_sync_sb_inodes(sb, &wbc);
> +	return nr_to_write - wbc.nr_to_write;
>  }
> +EXPORT_SYMBOL(sync_inodes_sb);
>  
>  /**
>   * write_inode_now	-	write an inode to disk
> diff --git a/fs/sync.c b/fs/sync.c
> index 3422ba6..66f2104 100644
> --- a/fs/sync.c
> +++ b/fs/sync.c
> @@ -19,20 +19,22 @@
>  			SYNC_FILE_RANGE_WAIT_AFTER)
>  
>  /*
> - * Do the filesystem syncing work. For simple filesystems sync_inodes_sb(sb, 0)
> - * just dirties buffers with inodes so we have to submit IO for these buffers
> - * via __sync_blockdev(). This also speeds up the wait == 1 case since in that
> - * case write_inode() functions do sync_dirty_buffer() and thus effectively
> - * write one block at a time.
> + * Do the filesystem syncing work. For simple filesystems
> + * writeback_inodes_sb(sb) just dirties buffers with inodes so we have to
> + * submit IO for these buffers via __sync_blockdev(). This also speeds up the
> + * wait == 1 case since in that case write_inode() functions do
> + * sync_dirty_buffer() and thus effectively write one block at a time.
>   */
>  static int __sync_filesystem(struct super_block *sb, int wait)
>  {
>  	/* Avoid doing twice syncing and cache pruning for quota sync */
> -	if (!wait)
> +	if (!wait) {
>  		writeout_quota_sb(sb, -1);
> -	else
> +		writeback_inodes_sb(sb);
> +	} else {
>  		sync_quota_sb(sb, -1);
> -	sync_inodes_sb(sb, wait);
> +		sync_inodes_sb(sb);
> +	}
>  	if (sb->s_op->sync_fs)
>  		sb->s_op->sync_fs(sb, wait);
>  	return __sync_blockdev(sb->s_bdev, wait);
> diff --git a/fs/ubifs/budget.c b/fs/ubifs/budget.c
> index eaf6d89..1c8991b 100644
> --- a/fs/ubifs/budget.c
> +++ b/fs/ubifs/budget.c
> @@ -65,26 +65,14 @@
>  static int shrink_liability(struct ubifs_info *c, int nr_to_write)
>  {
>  	int nr_written;
> -	struct writeback_control wbc = {
> -		.sync_mode   = WB_SYNC_NONE,
> -		.range_end   = LLONG_MAX,
> -		.nr_to_write = nr_to_write,
> -	};
> -
> -	generic_sync_sb_inodes(c->vfs_sb, &wbc);
> -	nr_written = nr_to_write - wbc.nr_to_write;
>  
> +	nr_written = writeback_inodes_sb(c->vfs_sb);
>  	if (!nr_written) {
>  		/*
>  		 * Re-try again but wait on pages/inodes which are being
>  		 * written-back concurrently (e.g., by pdflush).
>  		 */
> -		memset(&wbc, 0, sizeof(struct writeback_control));
> -		wbc.sync_mode   = WB_SYNC_ALL;
> -		wbc.range_end   = LLONG_MAX;
> -		wbc.nr_to_write = nr_to_write;
> -		generic_sync_sb_inodes(c->vfs_sb, &wbc);
> -		nr_written = nr_to_write - wbc.nr_to_write;
> +		nr_written = sync_inodes_sb(c->vfs_sb);
>  	}
>  
>  	dbg_budg("%d pages were written back", nr_written);
> diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
> index 26d2e0d..8d6050a 100644
> --- a/fs/ubifs/super.c
> +++ b/fs/ubifs/super.c
> @@ -438,12 +438,6 @@ static int ubifs_sync_fs(struct super_block *sb, int wait)
>  {
>  	int i, err;
>  	struct ubifs_info *c = sb->s_fs_info;
> -	struct writeback_control wbc = {
> -		.sync_mode   = WB_SYNC_ALL,
> -		.range_start = 0,
> -		.range_end   = LLONG_MAX,
> -		.nr_to_write = LONG_MAX,
> -	};
>  
>  	/*
>  	 * Zero @wait is just an advisory thing to help the file system shove
> @@ -462,7 +456,7 @@ static int ubifs_sync_fs(struct super_block *sb, int wait)
>  	 * the user be able to get more accurate results of 'statfs()' after
>  	 * they synchronize the file system.
>  	 */
> -	generic_sync_sb_inodes(sb, &wbc);
> +	sync_inodes_sb(sb);
>  
>  	/*
>  	 * Synchronize write buffers, because 'ubifs_run_commit()' does not
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 73e9b64..07b0f66 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2070,8 +2070,6 @@ static inline void invalidate_remote_inode(struct inode *inode)
>  extern int invalidate_inode_pages2(struct address_space *mapping);
>  extern int invalidate_inode_pages2_range(struct address_space *mapping,
>  					 pgoff_t start, pgoff_t end);
> -extern void generic_sync_sb_inodes(struct super_block *sb,
> -				struct writeback_control *wbc);
>  extern int write_inode_now(struct inode *, int);
>  extern int filemap_fdatawrite(struct address_space *);
>  extern int filemap_flush(struct address_space *);
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 3224820..0703929 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -78,7 +78,8 @@ struct writeback_control {
>   */	
>  void writeback_inodes(struct writeback_control *wbc);
>  int inode_wait(void *);
> -void sync_inodes_sb(struct super_block *, int wait);
> +long writeback_inodes_sb(struct super_block *);
> +long sync_inodes_sb(struct super_block *);
>  
>  /* writeback.h requires fs.h; it, too, is not included from here. */
>  static inline void wait_on_inode(struct inode *inode)
> -- 
> 1.6.4.1.207.g68ea
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export
  2009-09-04  7:46 [PATCH 0/8] Per-bdi writeback flusher threads v18 Jens Axboe
@ 2009-09-04  7:46 ` Jens Axboe
  2009-09-04  8:28   ` Jan Kara
  0 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2009-09-04  7:46 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: chris.mason, david, hch, tytso, akpm, jack, Jens Axboe

This adds two new exported functions:

- writeback_inodes_sb(), which only attempts to writeback dirty inodes on
  this super_block, for WB_SYNC_NONE writeout.
- sync_inodes_sbt(), which writes out all dirty inodes on this super_block
  and also waits for the IO to complete.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 drivers/staging/pohmelfs/inode.c |    9 +----
 fs/fs-writeback.c                |   70 ++++++++++++++++++++++---------------
 fs/sync.c                        |   18 +++++----
 fs/ubifs/budget.c                |   16 +-------
 fs/ubifs/super.c                 |    8 +----
 include/linux/fs.h               |    2 -
 include/linux/writeback.h        |    3 +-
 7 files changed, 58 insertions(+), 68 deletions(-)

diff --git a/drivers/staging/pohmelfs/inode.c b/drivers/staging/pohmelfs/inode.c
index 7b60579..e63c9be 100644
--- a/drivers/staging/pohmelfs/inode.c
+++ b/drivers/staging/pohmelfs/inode.c
@@ -1950,14 +1950,7 @@ static int pohmelfs_get_sb(struct file_system_type *fs_type,
  */
 static void pohmelfs_kill_super(struct super_block *sb)
 {
-	struct writeback_control wbc = {
-		.sync_mode	= WB_SYNC_ALL,
-		.range_start	= 0,
-		.range_end	= LLONG_MAX,
-		.nr_to_write	= LONG_MAX,
-	};
-	generic_sync_sb_inodes(sb, &wbc);
-
+	sync_inodes_sb(sb);
 	kill_anon_super(sb);
 }
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index c54226b..271e5f4 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -458,8 +458,8 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
  * on the writer throttling path, and we get decent balancing between many
  * throttled threads: we don't want them all piling up on inode_sync_wait.
  */
-void generic_sync_sb_inodes(struct super_block *sb,
-				struct writeback_control *wbc)
+static void generic_sync_sb_inodes(struct super_block *sb,
+				   struct writeback_control *wbc)
 {
 	const unsigned long start = jiffies;	/* livelock avoidance */
 	int sync = wbc->sync_mode == WB_SYNC_ALL;
@@ -593,13 +593,6 @@ void generic_sync_sb_inodes(struct super_block *sb,
 
 	return;		/* Leave any unwritten inodes on s_io */
 }
-EXPORT_SYMBOL_GPL(generic_sync_sb_inodes);
-
-static void sync_sb_inodes(struct super_block *sb,
-				struct writeback_control *wbc)
-{
-	generic_sync_sb_inodes(sb, wbc);
-}
 
 /*
  * Start writeback of dirty pagecache data against all unlocked inodes.
@@ -640,7 +633,7 @@ restart:
 			 */
 			if (down_read_trylock(&sb->s_umount)) {
 				if (sb->s_root)
-					sync_sb_inodes(sb, wbc);
+					generic_sync_sb_inodes(sb, wbc);
 				up_read(&sb->s_umount);
 			}
 			spin_lock(&sb_lock);
@@ -653,35 +646,56 @@ restart:
 	spin_unlock(&sb_lock);
 }
 
-/*
- * writeback and wait upon the filesystem's dirty inodes.  The caller will
- * do this in two passes - one to write, and one to wait.
- *
- * A finite limit is set on the number of pages which will be written.
- * To prevent infinite livelock of sys_sync().
+/**
+ * writeback_inodes_sb	-	writeback dirty inodes from given super_block
+ * @sb: the superblock
  *
- * We add in the number of potentially dirty inodes, because each inode write
- * can dirty pagecache in the underlying blockdev.
+ * Start writeback on some inodes on this super_block. No guarantees are made
+ * on how many (if any) will be written, and this function does not wait
+ * for IO completion of submitted IO. The number of pages submitted is
+ * returned.
  */
-void sync_inodes_sb(struct super_block *sb, int wait)
+long writeback_inodes_sb(struct super_block *sb)
 {
 	struct writeback_control wbc = {
-		.sync_mode	= wait ? WB_SYNC_ALL : WB_SYNC_NONE,
+		.sync_mode	= WB_SYNC_NONE,
 		.range_start	= 0,
 		.range_end	= LLONG_MAX,
 	};
+	unsigned long nr_dirty = global_page_state(NR_FILE_DIRTY);
+	unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
+	long nr_to_write;
 
-	if (!wait) {
-		unsigned long nr_dirty = global_page_state(NR_FILE_DIRTY);
-		unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
-
-		wbc.nr_to_write = nr_dirty + nr_unstable +
+	nr_to_write = nr_dirty + nr_unstable +
 			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
-	} else
-		wbc.nr_to_write = LONG_MAX; /* doesn't actually matter */
 
-	sync_sb_inodes(sb, &wbc);
+	wbc.nr_to_write = nr_to_write;
+	generic_sync_sb_inodes(sb, &wbc);
+	return nr_to_write - wbc.nr_to_write;
+}
+EXPORT_SYMBOL(writeback_inodes_sb);
+
+/**
+ * sync_inodes_sb	-	sync sb inode pages
+ * @sb: the superblock
+ *
+ * This function writes and waits on any dirty inode belonging to this
+ * super_block. The number of pages synced is returned.
+ */
+long sync_inodes_sb(struct super_block *sb)
+{
+	struct writeback_control wbc = {
+		.sync_mode	= WB_SYNC_ALL,
+		.range_start	= 0,
+		.range_end	= LLONG_MAX,
+	};
+	long nr_to_write = LONG_MAX; /* doesn't actually matter */
+
+	wbc.nr_to_write = nr_to_write;
+	generic_sync_sb_inodes(sb, &wbc);
+	return nr_to_write - wbc.nr_to_write;
 }
+EXPORT_SYMBOL(sync_inodes_sb);
 
 /**
  * write_inode_now	-	write an inode to disk
diff --git a/fs/sync.c b/fs/sync.c
index 3422ba6..66f2104 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -19,20 +19,22 @@
 			SYNC_FILE_RANGE_WAIT_AFTER)
 
 /*
- * Do the filesystem syncing work. For simple filesystems sync_inodes_sb(sb, 0)
- * just dirties buffers with inodes so we have to submit IO for these buffers
- * via __sync_blockdev(). This also speeds up the wait == 1 case since in that
- * case write_inode() functions do sync_dirty_buffer() and thus effectively
- * write one block at a time.
+ * Do the filesystem syncing work. For simple filesystems
+ * writeback_inodes_sb(sb) just dirties buffers with inodes so we have to
+ * submit IO for these buffers via __sync_blockdev(). This also speeds up the
+ * wait == 1 case since in that case write_inode() functions do
+ * sync_dirty_buffer() and thus effectively write one block at a time.
  */
 static int __sync_filesystem(struct super_block *sb, int wait)
 {
 	/* Avoid doing twice syncing and cache pruning for quota sync */
-	if (!wait)
+	if (!wait) {
 		writeout_quota_sb(sb, -1);
-	else
+		writeback_inodes_sb(sb);
+	} else {
 		sync_quota_sb(sb, -1);
-	sync_inodes_sb(sb, wait);
+		sync_inodes_sb(sb);
+	}
 	if (sb->s_op->sync_fs)
 		sb->s_op->sync_fs(sb, wait);
 	return __sync_blockdev(sb->s_bdev, wait);
diff --git a/fs/ubifs/budget.c b/fs/ubifs/budget.c
index eaf6d89..1c8991b 100644
--- a/fs/ubifs/budget.c
+++ b/fs/ubifs/budget.c
@@ -65,26 +65,14 @@
 static int shrink_liability(struct ubifs_info *c, int nr_to_write)
 {
 	int nr_written;
-	struct writeback_control wbc = {
-		.sync_mode   = WB_SYNC_NONE,
-		.range_end   = LLONG_MAX,
-		.nr_to_write = nr_to_write,
-	};
-
-	generic_sync_sb_inodes(c->vfs_sb, &wbc);
-	nr_written = nr_to_write - wbc.nr_to_write;
 
+	nr_written = writeback_inodes_sb(c->vfs_sb);
 	if (!nr_written) {
 		/*
 		 * Re-try again but wait on pages/inodes which are being
 		 * written-back concurrently (e.g., by pdflush).
 		 */
-		memset(&wbc, 0, sizeof(struct writeback_control));
-		wbc.sync_mode   = WB_SYNC_ALL;
-		wbc.range_end   = LLONG_MAX;
-		wbc.nr_to_write = nr_to_write;
-		generic_sync_sb_inodes(c->vfs_sb, &wbc);
-		nr_written = nr_to_write - wbc.nr_to_write;
+		nr_written = sync_inodes_sb(c->vfs_sb);
 	}
 
 	dbg_budg("%d pages were written back", nr_written);
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index 26d2e0d..8d6050a 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -438,12 +438,6 @@ static int ubifs_sync_fs(struct super_block *sb, int wait)
 {
 	int i, err;
 	struct ubifs_info *c = sb->s_fs_info;
-	struct writeback_control wbc = {
-		.sync_mode   = WB_SYNC_ALL,
-		.range_start = 0,
-		.range_end   = LLONG_MAX,
-		.nr_to_write = LONG_MAX,
-	};
 
 	/*
 	 * Zero @wait is just an advisory thing to help the file system shove
@@ -462,7 +456,7 @@ static int ubifs_sync_fs(struct super_block *sb, int wait)
 	 * the user be able to get more accurate results of 'statfs()' after
 	 * they synchronize the file system.
 	 */
-	generic_sync_sb_inodes(sb, &wbc);
+	sync_inodes_sb(sb);
 
 	/*
 	 * Synchronize write buffers, because 'ubifs_run_commit()' does not
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 73e9b64..07b0f66 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2070,8 +2070,6 @@ static inline void invalidate_remote_inode(struct inode *inode)
 extern int invalidate_inode_pages2(struct address_space *mapping);
 extern int invalidate_inode_pages2_range(struct address_space *mapping,
 					 pgoff_t start, pgoff_t end);
-extern void generic_sync_sb_inodes(struct super_block *sb,
-				struct writeback_control *wbc);
 extern int write_inode_now(struct inode *, int);
 extern int filemap_fdatawrite(struct address_space *);
 extern int filemap_flush(struct address_space *);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 3224820..0703929 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -78,7 +78,8 @@ struct writeback_control {
  */	
 void writeback_inodes(struct writeback_control *wbc);
 int inode_wait(void *);
-void sync_inodes_sb(struct super_block *, int wait);
+long writeback_inodes_sb(struct super_block *);
+long sync_inodes_sb(struct super_block *);
 
 /* writeback.h requires fs.h; it, too, is not included from here. */
 static inline void wait_on_inode(struct inode *inode)
-- 
1.6.4.1.207.g68ea


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export
  2009-09-02 10:26     ` Jens Axboe
@ 2009-09-02 14:01       ` Jan Kara
  0 siblings, 0 replies; 76+ messages in thread
From: Jan Kara @ 2009-09-02 14:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, tytso, akpm

On Wed 02-09-09 12:26:26, Jens Axboe wrote:
> On Wed, Sep 02 2009, Jan Kara wrote:
> > On Wed 02-09-09 10:42:40, Jens Axboe wrote:
> > > This adds two new exported functions:
> > > 
> > > - sync_inodes_sb(), which writes out dirty inodes on a super_block, and
> > > - sync_inodes_sb_wait(), which does the same but also waits for IO
> > >   completion.
> >   This is a nice cleanup. I only find the name sync_inodes_sb() slightly
> > misleading and the comment by that function as well. The name should rather
> > be something like writeback_inodes_sb() (and sync_inodes_sb_wait() could
> > stay just sync_inodes_sb()) - the writeback it does does not really
> > guarantee anything. For example it can skip inodes or pages it does not
> > like for some reason. What that function really does is - try to write some
> > dirty pages on that superblock and don't try too hard.
> >   I don't insist on the renaming of the function but I really thing the
> > comment should be improved.
> 
> I don't disagree, I was a bit torn on the naming as well. I will make
> that change, thanks for the feedback!
  OK, thanks.

> I'd really like your feedback on the pin_sb_for_writeback() stuff too,
> since that is the contentious bit. And, this goes for others as well,
> I'd appreciate any reviewed-by and/or acked-by on patches.
  I'll have a look at it probably tomorrow.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export
  2009-09-02 10:13   ` Jan Kara
@ 2009-09-02 10:26     ` Jens Axboe
  2009-09-02 14:01       ` Jan Kara
  0 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2009-09-02 10:26 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, tytso, akpm

On Wed, Sep 02 2009, Jan Kara wrote:
> On Wed 02-09-09 10:42:40, Jens Axboe wrote:
> > This adds two new exported functions:
> > 
> > - sync_inodes_sb(), which writes out dirty inodes on a super_block, and
> > - sync_inodes_sb_wait(), which does the same but also waits for IO
> >   completion.
>   This is a nice cleanup. I only find the name sync_inodes_sb() slightly
> misleading and the comment by that function as well. The name should rather
> be something like writeback_inodes_sb() (and sync_inodes_sb_wait() could
> stay just sync_inodes_sb()) - the writeback it does does not really
> guarantee anything. For example it can skip inodes or pages it does not
> like for some reason. What that function really does is - try to write some
> dirty pages on that superblock and don't try too hard.
>   I don't insist on the renaming of the function but I really thing the
> comment should be improved.

I don't disagree, I was a bit torn on the naming as well. I will make
that change, thanks for the feedback!

I'd really like your feedback on the pin_sb_for_writeback() stuff too,
since that is the contentious bit. And, this goes for others as well,
I'd appreciate any reviewed-by and/or acked-by on patches.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export
  2009-09-02  8:42 ` [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export Jens Axboe
@ 2009-09-02 10:13   ` Jan Kara
  2009-09-02 10:26     ` Jens Axboe
  0 siblings, 1 reply; 76+ messages in thread
From: Jan Kara @ 2009-09-02 10:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, tytso, akpm, jack

On Wed 02-09-09 10:42:40, Jens Axboe wrote:
> This adds two new exported functions:
> 
> - sync_inodes_sb(), which writes out dirty inodes on a super_block, and
> - sync_inodes_sb_wait(), which does the same but also waits for IO
>   completion.
  This is a nice cleanup. I only find the name sync_inodes_sb() slightly
misleading and the comment by that function as well. The name should rather
be something like writeback_inodes_sb() (and sync_inodes_sb_wait() could
stay just sync_inodes_sb()) - the writeback it does does not really
guarantee anything. For example it can skip inodes or pages it does not
like for some reason. What that function really does is - try to write some
dirty pages on that superblock and don't try too hard.
  I don't insist on the renaming of the function but I really thing the
comment should be improved.

								Honza

> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
> ---
>  drivers/staging/pohmelfs/inode.c |    9 +----
>  fs/fs-writeback.c                |   68 ++++++++++++++++++++++---------------
>  fs/sync.c                        |    8 +++--
>  fs/ubifs/budget.c                |   16 +--------
>  fs/ubifs/super.c                 |    8 +----
>  include/linux/fs.h               |    2 -
>  include/linux/writeback.h        |    3 +-
>  7 files changed, 51 insertions(+), 63 deletions(-)
> 
> diff --git a/drivers/staging/pohmelfs/inode.c b/drivers/staging/pohmelfs/inode.c
> index 7b60579..bb6db36 100644
> --- a/drivers/staging/pohmelfs/inode.c
> +++ b/drivers/staging/pohmelfs/inode.c
> @@ -1950,14 +1950,7 @@ static int pohmelfs_get_sb(struct file_system_type *fs_type,
>   */
>  static void pohmelfs_kill_super(struct super_block *sb)
>  {
> -	struct writeback_control wbc = {
> -		.sync_mode	= WB_SYNC_ALL,
> -		.range_start	= 0,
> -		.range_end	= LLONG_MAX,
> -		.nr_to_write	= LONG_MAX,
> -	};
> -	generic_sync_sb_inodes(sb, &wbc);
> -
> +	sync_sb_inodes_wait(sb);
>  	kill_anon_super(sb);
>  }
>  
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index c54226b..382b15c 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -458,8 +458,8 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
>   * on the writer throttling path, and we get decent balancing between many
>   * throttled threads: we don't want them all piling up on inode_sync_wait.
>   */
> -void generic_sync_sb_inodes(struct super_block *sb,
> -				struct writeback_control *wbc)
> +static void generic_sync_sb_inodes(struct super_block *sb,
> +				   struct writeback_control *wbc)
>  {
>  	const unsigned long start = jiffies;	/* livelock avoidance */
>  	int sync = wbc->sync_mode == WB_SYNC_ALL;
> @@ -593,13 +593,6 @@ void generic_sync_sb_inodes(struct super_block *sb,
>  
>  	return;		/* Leave any unwritten inodes on s_io */
>  }
> -EXPORT_SYMBOL_GPL(generic_sync_sb_inodes);
> -
> -static void sync_sb_inodes(struct super_block *sb,
> -				struct writeback_control *wbc)
> -{
> -	generic_sync_sb_inodes(sb, wbc);
> -}
>  
>  /*
>   * Start writeback of dirty pagecache data against all unlocked inodes.
> @@ -640,7 +633,7 @@ restart:
>  			 */
>  			if (down_read_trylock(&sb->s_umount)) {
>  				if (sb->s_root)
> -					sync_sb_inodes(sb, wbc);
> +					generic_sync_sb_inodes(sb, wbc);
>  				up_read(&sb->s_umount);
>  			}
>  			spin_lock(&sb_lock);
> @@ -653,35 +646,54 @@ restart:
>  	spin_unlock(&sb_lock);
>  }
>  
> -/*
> - * writeback and wait upon the filesystem's dirty inodes.  The caller will
> - * do this in two passes - one to write, and one to wait.
> - *
> - * A finite limit is set on the number of pages which will be written.
> - * To prevent infinite livelock of sys_sync().
> +/**
> + * sync_inodes_sb	-	sync sb inode pages
> + * @sb: the superblock
>   *
> - * We add in the number of potentially dirty inodes, because each inode write
> - * can dirty pagecache in the underlying blockdev.
> + * This function writes dirty inodes belonging to this super_block. It does
> + * not wait for completion of IO.
>   */
> -void sync_inodes_sb(struct super_block *sb, int wait)
> +long sync_inodes_sb(struct super_block *sb)
>  {
>  	struct writeback_control wbc = {
> -		.sync_mode	= wait ? WB_SYNC_ALL : WB_SYNC_NONE,
> +		.sync_mode	= WB_SYNC_NONE,
>  		.range_start	= 0,
>  		.range_end	= LLONG_MAX,
>  	};
> +	unsigned long nr_dirty = global_page_state(NR_FILE_DIRTY);
> +	unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
> +	long nr_to_write;
>  
> -	if (!wait) {
> -		unsigned long nr_dirty = global_page_state(NR_FILE_DIRTY);
> -		unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
> -
> -		wbc.nr_to_write = nr_dirty + nr_unstable +
> +	nr_to_write = nr_dirty + nr_unstable +
>  			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
> -	} else
> -		wbc.nr_to_write = LONG_MAX; /* doesn't actually matter */
>  
> -	sync_sb_inodes(sb, &wbc);
> +	wbc.nr_to_write = nr_to_write;
> +	generic_sync_sb_inodes(sb, &wbc);
> +	return nr_to_write - wbc.nr_to_write;
> +}
> +EXPORT_SYMBOL(sync_inodes_sb);
> +
> +/**
> + * sync_inodes_sb_wait	-	sync sb inode pages
> + * @sb: the superblock
> + *
> + * This function writes and waits on any dirty inode belonging to this
> + * super_block.
> + */
> +long sync_inodes_sb_wait(struct super_block *sb)
> +{
> +	struct writeback_control wbc = {
> +		.sync_mode	= WB_SYNC_ALL,
> +		.range_start	= 0,
> +		.range_end	= LLONG_MAX,
> +	};
> +	long nr_to_write = LLONG_MAX; /* doesn't actually matter */
> +
> +	wbc.nr_to_write = nr_to_write;
> +	generic_sync_sb_inodes(sb, &wbc);
> +	return nr_to_write - wbc.nr_to_write;
>  }
> +EXPORT_SYMBOL(sync_inodes_sb_wait);
>  
>  /**
>   * write_inode_now	-	write an inode to disk
> diff --git a/fs/sync.c b/fs/sync.c
> index 3422ba6..431dba2 100644
> --- a/fs/sync.c
> +++ b/fs/sync.c
> @@ -28,11 +28,13 @@
>  static int __sync_filesystem(struct super_block *sb, int wait)
>  {
>  	/* Avoid doing twice syncing and cache pruning for quota sync */
> -	if (!wait)
> +	if (!wait) {
>  		writeout_quota_sb(sb, -1);
> -	else
> +		sync_inodes_sb(sb);
> +	} else {
>  		sync_quota_sb(sb, -1);
> -	sync_inodes_sb(sb, wait);
> +		sync_inodes_sb_wait(sb);
> +	}
>  	if (sb->s_op->sync_fs)
>  		sb->s_op->sync_fs(sb, wait);
>  	return __sync_blockdev(sb->s_bdev, wait);
> diff --git a/fs/ubifs/budget.c b/fs/ubifs/budget.c
> index eaf6d89..341edd1 100644
> --- a/fs/ubifs/budget.c
> +++ b/fs/ubifs/budget.c
> @@ -65,26 +65,14 @@
>  static int shrink_liability(struct ubifs_info *c, int nr_to_write)
>  {
>  	int nr_written;
> -	struct writeback_control wbc = {
> -		.sync_mode   = WB_SYNC_NONE,
> -		.range_end   = LLONG_MAX,
> -		.nr_to_write = nr_to_write,
> -	};
> -
> -	generic_sync_sb_inodes(c->vfs_sb, &wbc);
> -	nr_written = nr_to_write - wbc.nr_to_write;
>  
> +	nr_written = sync_inodes_sb(c->vfs_sb);
>  	if (!nr_written) {
>  		/*
>  		 * Re-try again but wait on pages/inodes which are being
>  		 * written-back concurrently (e.g., by pdflush).
>  		 */
> -		memset(&wbc, 0, sizeof(struct writeback_control));
> -		wbc.sync_mode   = WB_SYNC_ALL;
> -		wbc.range_end   = LLONG_MAX;
> -		wbc.nr_to_write = nr_to_write;
> -		generic_sync_sb_inodes(c->vfs_sb, &wbc);
> -		nr_written = nr_to_write - wbc.nr_to_write;
> +		nr_written = sync_inodes_sb_wait(c->vfs_sb);
>  	}
>  
>  	dbg_budg("%d pages were written back", nr_written);
> diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
> index 26d2e0d..0caa3f1 100644
> --- a/fs/ubifs/super.c
> +++ b/fs/ubifs/super.c
> @@ -438,12 +438,6 @@ static int ubifs_sync_fs(struct super_block *sb, int wait)
>  {
>  	int i, err;
>  	struct ubifs_info *c = sb->s_fs_info;
> -	struct writeback_control wbc = {
> -		.sync_mode   = WB_SYNC_ALL,
> -		.range_start = 0,
> -		.range_end   = LLONG_MAX,
> -		.nr_to_write = LONG_MAX,
> -	};
>  
>  	/*
>  	 * Zero @wait is just an advisory thing to help the file system shove
> @@ -462,7 +456,7 @@ static int ubifs_sync_fs(struct super_block *sb, int wait)
>  	 * the user be able to get more accurate results of 'statfs()' after
>  	 * they synchronize the file system.
>  	 */
> -	generic_sync_sb_inodes(sb, &wbc);
> +	sync_inodes_sb_wait(sb);
>  
>  	/*
>  	 * Synchronize write buffers, because 'ubifs_run_commit()' does not
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 73e9b64..07b0f66 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2070,8 +2070,6 @@ static inline void invalidate_remote_inode(struct inode *inode)
>  extern int invalidate_inode_pages2(struct address_space *mapping);
>  extern int invalidate_inode_pages2_range(struct address_space *mapping,
>  					 pgoff_t start, pgoff_t end);
> -extern void generic_sync_sb_inodes(struct super_block *sb,
> -				struct writeback_control *wbc);
>  extern int write_inode_now(struct inode *, int);
>  extern int filemap_fdatawrite(struct address_space *);
>  extern int filemap_flush(struct address_space *);
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index 3224820..f26a60b 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -78,7 +78,8 @@ struct writeback_control {
>   */	
>  void writeback_inodes(struct writeback_control *wbc);
>  int inode_wait(void *);
> -void sync_inodes_sb(struct super_block *, int wait);
> +long sync_inodes_sb(struct super_block *);
> +long sync_inodes_sb_wait(struct super_block *);
>  
>  /* writeback.h requires fs.h; it, too, is not included from here. */
>  static inline void wait_on_inode(struct inode *inode)
> -- 
> 1.6.4.1.207.g68ea
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export
  2009-09-02  8:42 [PATCH 0/8] Per-bdi writeback flusher threads v17 Jens Axboe
@ 2009-09-02  8:42 ` Jens Axboe
  2009-09-02 10:13   ` Jan Kara
  0 siblings, 1 reply; 76+ messages in thread
From: Jens Axboe @ 2009-09-02  8:42 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: chris.mason, david, hch, tytso, akpm, jack, Jens Axboe

This adds two new exported functions:

- sync_inodes_sb(), which writes out dirty inodes on a super_block, and
- sync_inodes_sb_wait(), which does the same but also waits for IO
  completion.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 drivers/staging/pohmelfs/inode.c |    9 +----
 fs/fs-writeback.c                |   68 ++++++++++++++++++++++---------------
 fs/sync.c                        |    8 +++--
 fs/ubifs/budget.c                |   16 +--------
 fs/ubifs/super.c                 |    8 +----
 include/linux/fs.h               |    2 -
 include/linux/writeback.h        |    3 +-
 7 files changed, 51 insertions(+), 63 deletions(-)

diff --git a/drivers/staging/pohmelfs/inode.c b/drivers/staging/pohmelfs/inode.c
index 7b60579..bb6db36 100644
--- a/drivers/staging/pohmelfs/inode.c
+++ b/drivers/staging/pohmelfs/inode.c
@@ -1950,14 +1950,7 @@ static int pohmelfs_get_sb(struct file_system_type *fs_type,
  */
 static void pohmelfs_kill_super(struct super_block *sb)
 {
-	struct writeback_control wbc = {
-		.sync_mode	= WB_SYNC_ALL,
-		.range_start	= 0,
-		.range_end	= LLONG_MAX,
-		.nr_to_write	= LONG_MAX,
-	};
-	generic_sync_sb_inodes(sb, &wbc);
-
+	sync_sb_inodes_wait(sb);
 	kill_anon_super(sb);
 }
 
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index c54226b..382b15c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -458,8 +458,8 @@ writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
  * on the writer throttling path, and we get decent balancing between many
  * throttled threads: we don't want them all piling up on inode_sync_wait.
  */
-void generic_sync_sb_inodes(struct super_block *sb,
-				struct writeback_control *wbc)
+static void generic_sync_sb_inodes(struct super_block *sb,
+				   struct writeback_control *wbc)
 {
 	const unsigned long start = jiffies;	/* livelock avoidance */
 	int sync = wbc->sync_mode == WB_SYNC_ALL;
@@ -593,13 +593,6 @@ void generic_sync_sb_inodes(struct super_block *sb,
 
 	return;		/* Leave any unwritten inodes on s_io */
 }
-EXPORT_SYMBOL_GPL(generic_sync_sb_inodes);
-
-static void sync_sb_inodes(struct super_block *sb,
-				struct writeback_control *wbc)
-{
-	generic_sync_sb_inodes(sb, wbc);
-}
 
 /*
  * Start writeback of dirty pagecache data against all unlocked inodes.
@@ -640,7 +633,7 @@ restart:
 			 */
 			if (down_read_trylock(&sb->s_umount)) {
 				if (sb->s_root)
-					sync_sb_inodes(sb, wbc);
+					generic_sync_sb_inodes(sb, wbc);
 				up_read(&sb->s_umount);
 			}
 			spin_lock(&sb_lock);
@@ -653,35 +646,54 @@ restart:
 	spin_unlock(&sb_lock);
 }
 
-/*
- * writeback and wait upon the filesystem's dirty inodes.  The caller will
- * do this in two passes - one to write, and one to wait.
- *
- * A finite limit is set on the number of pages which will be written.
- * To prevent infinite livelock of sys_sync().
+/**
+ * sync_inodes_sb	-	sync sb inode pages
+ * @sb: the superblock
  *
- * We add in the number of potentially dirty inodes, because each inode write
- * can dirty pagecache in the underlying blockdev.
+ * This function writes dirty inodes belonging to this super_block. It does
+ * not wait for completion of IO.
  */
-void sync_inodes_sb(struct super_block *sb, int wait)
+long sync_inodes_sb(struct super_block *sb)
 {
 	struct writeback_control wbc = {
-		.sync_mode	= wait ? WB_SYNC_ALL : WB_SYNC_NONE,
+		.sync_mode	= WB_SYNC_NONE,
 		.range_start	= 0,
 		.range_end	= LLONG_MAX,
 	};
+	unsigned long nr_dirty = global_page_state(NR_FILE_DIRTY);
+	unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
+	long nr_to_write;
 
-	if (!wait) {
-		unsigned long nr_dirty = global_page_state(NR_FILE_DIRTY);
-		unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
-
-		wbc.nr_to_write = nr_dirty + nr_unstable +
+	nr_to_write = nr_dirty + nr_unstable +
 			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
-	} else
-		wbc.nr_to_write = LONG_MAX; /* doesn't actually matter */
 
-	sync_sb_inodes(sb, &wbc);
+	wbc.nr_to_write = nr_to_write;
+	generic_sync_sb_inodes(sb, &wbc);
+	return nr_to_write - wbc.nr_to_write;
+}
+EXPORT_SYMBOL(sync_inodes_sb);
+
+/**
+ * sync_inodes_sb_wait	-	sync sb inode pages
+ * @sb: the superblock
+ *
+ * This function writes and waits on any dirty inode belonging to this
+ * super_block.
+ */
+long sync_inodes_sb_wait(struct super_block *sb)
+{
+	struct writeback_control wbc = {
+		.sync_mode	= WB_SYNC_ALL,
+		.range_start	= 0,
+		.range_end	= LLONG_MAX,
+	};
+	long nr_to_write = LLONG_MAX; /* doesn't actually matter */
+
+	wbc.nr_to_write = nr_to_write;
+	generic_sync_sb_inodes(sb, &wbc);
+	return nr_to_write - wbc.nr_to_write;
 }
+EXPORT_SYMBOL(sync_inodes_sb_wait);
 
 /**
  * write_inode_now	-	write an inode to disk
diff --git a/fs/sync.c b/fs/sync.c
index 3422ba6..431dba2 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -28,11 +28,13 @@
 static int __sync_filesystem(struct super_block *sb, int wait)
 {
 	/* Avoid doing twice syncing and cache pruning for quota sync */
-	if (!wait)
+	if (!wait) {
 		writeout_quota_sb(sb, -1);
-	else
+		sync_inodes_sb(sb);
+	} else {
 		sync_quota_sb(sb, -1);
-	sync_inodes_sb(sb, wait);
+		sync_inodes_sb_wait(sb);
+	}
 	if (sb->s_op->sync_fs)
 		sb->s_op->sync_fs(sb, wait);
 	return __sync_blockdev(sb->s_bdev, wait);
diff --git a/fs/ubifs/budget.c b/fs/ubifs/budget.c
index eaf6d89..341edd1 100644
--- a/fs/ubifs/budget.c
+++ b/fs/ubifs/budget.c
@@ -65,26 +65,14 @@
 static int shrink_liability(struct ubifs_info *c, int nr_to_write)
 {
 	int nr_written;
-	struct writeback_control wbc = {
-		.sync_mode   = WB_SYNC_NONE,
-		.range_end   = LLONG_MAX,
-		.nr_to_write = nr_to_write,
-	};
-
-	generic_sync_sb_inodes(c->vfs_sb, &wbc);
-	nr_written = nr_to_write - wbc.nr_to_write;
 
+	nr_written = sync_inodes_sb(c->vfs_sb);
 	if (!nr_written) {
 		/*
 		 * Re-try again but wait on pages/inodes which are being
 		 * written-back concurrently (e.g., by pdflush).
 		 */
-		memset(&wbc, 0, sizeof(struct writeback_control));
-		wbc.sync_mode   = WB_SYNC_ALL;
-		wbc.range_end   = LLONG_MAX;
-		wbc.nr_to_write = nr_to_write;
-		generic_sync_sb_inodes(c->vfs_sb, &wbc);
-		nr_written = nr_to_write - wbc.nr_to_write;
+		nr_written = sync_inodes_sb_wait(c->vfs_sb);
 	}
 
 	dbg_budg("%d pages were written back", nr_written);
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index 26d2e0d..0caa3f1 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -438,12 +438,6 @@ static int ubifs_sync_fs(struct super_block *sb, int wait)
 {
 	int i, err;
 	struct ubifs_info *c = sb->s_fs_info;
-	struct writeback_control wbc = {
-		.sync_mode   = WB_SYNC_ALL,
-		.range_start = 0,
-		.range_end   = LLONG_MAX,
-		.nr_to_write = LONG_MAX,
-	};
 
 	/*
 	 * Zero @wait is just an advisory thing to help the file system shove
@@ -462,7 +456,7 @@ static int ubifs_sync_fs(struct super_block *sb, int wait)
 	 * the user be able to get more accurate results of 'statfs()' after
 	 * they synchronize the file system.
 	 */
-	generic_sync_sb_inodes(sb, &wbc);
+	sync_inodes_sb_wait(sb);
 
 	/*
 	 * Synchronize write buffers, because 'ubifs_run_commit()' does not
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 73e9b64..07b0f66 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2070,8 +2070,6 @@ static inline void invalidate_remote_inode(struct inode *inode)
 extern int invalidate_inode_pages2(struct address_space *mapping);
 extern int invalidate_inode_pages2_range(struct address_space *mapping,
 					 pgoff_t start, pgoff_t end);
-extern void generic_sync_sb_inodes(struct super_block *sb,
-				struct writeback_control *wbc);
 extern int write_inode_now(struct inode *, int);
 extern int filemap_fdatawrite(struct address_space *);
 extern int filemap_flush(struct address_space *);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 3224820..f26a60b 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -78,7 +78,8 @@ struct writeback_control {
  */	
 void writeback_inodes(struct writeback_control *wbc);
 int inode_wait(void *);
-void sync_inodes_sb(struct super_block *, int wait);
+long sync_inodes_sb(struct super_block *);
+long sync_inodes_sb_wait(struct super_block *);
 
 /* writeback.h requires fs.h; it, too, is not included from here. */
 static inline void wait_on_inode(struct inode *inode)
-- 
1.6.4.1.207.g68ea


^ permalink raw reply related	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2009-10-02 10:35 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-08  9:23 [PATCH 0/8] Per-bdi writeback flusher threads v19 Jens Axboe
2009-09-08  9:23 ` [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export Jens Axboe
2009-09-08 10:27   ` Artem Bityutskiy
2009-09-08 10:27     ` Artem Bityutskiy
2009-09-08 10:41     ` Jens Axboe
2009-09-08 10:52       ` Artem Bityutskiy
2009-09-08 10:57         ` Jens Axboe
2009-09-08 11:01           ` Artem Bityutskiy
2009-09-08 11:01             ` Artem Bityutskiy
2009-09-08 11:05             ` Jens Axboe
2009-09-08 11:31               ` Artem Bityutskiy
2009-09-08 11:31                 ` Artem Bityutskiy
2009-09-08  9:23 ` [PATCH 2/8] writeback: move dirty inodes from super_block to backing_dev_info Jens Axboe
2009-09-08  9:23 ` [PATCH 3/8] writeback: switch to per-bdi threads for flushing data Jens Axboe
2009-09-08 13:46   ` Daniel Walker
2009-09-08 14:21     ` Jens Axboe
2009-09-08  9:23 ` [PATCH 4/8] writeback: get rid of pdflush completely Jens Axboe
2009-09-08  9:23 ` [PATCH 5/8] writeback: add some debug inode list counters to bdi stats Jens Axboe
2009-09-08  9:23 ` [PATCH 6/8] writeback: add name to backing_dev_info Jens Axboe
2009-09-08  9:23 ` [PATCH 7/8] writeback: check for registered bdi in flusher add and inode dirty Jens Axboe
2009-09-08  9:23 ` [PATCH 8/8] vm: Add an tuning knob for vm.max_writeback_mb Jens Axboe
2009-09-08 10:37   ` Artem Bityutskiy
2009-09-08 10:37     ` Artem Bityutskiy
2009-09-08 16:06     ` Peter Zijlstra
2009-09-08 16:29       ` Chris Mason
2009-09-08 16:56         ` Peter Zijlstra
2009-09-08 17:28           ` Chris Mason
2009-09-08 17:46             ` Peter Zijlstra
2009-09-08 17:55               ` Peter Zijlstra
2009-09-08 18:32                 ` Peter Zijlstra
2009-09-09 14:23                   ` Jan Kara
2009-09-09 14:37                     ` Wu Fengguang
2009-09-10 15:49                     ` Peter Zijlstra
2009-09-14 11:17                       ` Jan Kara
2009-09-24  8:33                         ` Wu Fengguang
2009-09-24 15:38                           ` Peter Zijlstra
2009-09-25  1:33                             ` Wu Fengguang
2009-09-29 17:35                           ` Jan Kara
2009-09-30  1:24                             ` Wu Fengguang
2009-09-30 11:55                               ` Jan Kara
2009-09-30 12:10                                 ` Jens Axboe
2009-10-01 15:17                                   ` Wu Fengguang
2009-10-01 13:36                                 ` Wu Fengguang
2009-10-01 14:22                                   ` Jan Kara
2009-10-01 14:54                                     ` Wu Fengguang
2009-10-01 21:35                                       ` Jan Kara
2009-10-02  2:25                                         ` Wu Fengguang
2009-10-02  9:54                                           ` Jan Kara
2009-10-02 10:34                                             ` Wu Fengguang
2009-09-08 18:35                 ` Chris Mason
2009-09-08 17:57               ` Chris Mason
2009-09-08 18:28                 ` Peter Zijlstra
2009-09-09  1:53           ` Dave Chinner
2009-09-09  3:52             ` Wu Fengguang
2009-09-08 18:06         ` Theodore Tso
2009-09-08 18:06           ` Theodore Tso
2009-09-08 18:19           ` Christoph Hellwig
2009-09-08 19:34             ` Theodore Tso
2009-09-09  9:29         ` Wu Fengguang
2009-09-09  9:29           ` Wu Fengguang
2009-09-09 12:28           ` Christoph Hellwig
2009-09-09 12:32             ` Wu Fengguang
2009-09-09 12:36               ` Artem Bityutskiy
2009-09-09 12:36                 ` Artem Bityutskiy
2009-09-09 12:37               ` Jens Axboe
2009-09-09 12:43                 ` Christoph Hellwig
2009-09-09 12:44                   ` Jens Axboe
2009-09-09 12:51                     ` Christoph Hellwig
2009-09-09 12:57                 ` Wu Fengguang
  -- strict thread matches above, loose matches on Subject: below --
2009-09-04  7:46 [PATCH 0/8] Per-bdi writeback flusher threads v18 Jens Axboe
2009-09-04  7:46 ` [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export Jens Axboe
2009-09-04  8:28   ` Jan Kara
2009-09-04 11:59     ` Jens Axboe
2009-09-02  8:42 [PATCH 0/8] Per-bdi writeback flusher threads v17 Jens Axboe
2009-09-02  8:42 ` [PATCH 1/8] writeback: get rid of generic_sync_sb_inodes() export Jens Axboe
2009-09-02 10:13   ` Jan Kara
2009-09-02 10:26     ` Jens Axboe
2009-09-02 14:01       ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.