linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/11] Per-bdi writeback flusher threads v9
@ 2009-05-28 11:46 Jens Axboe
  2009-05-28 11:46 ` [PATCH 01/11] ntfs: remove old debug check for dirty data in ntfs_put_super() Jens Axboe
                   ` (16 more replies)
  0 siblings, 17 replies; 66+ messages in thread
From: Jens Axboe @ 2009-05-28 11:46 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, tytso
  Cc: chris.mason, david, hch, akpm, jack, yanmin_zhang, richard, damien.wyart

Hi,

Here's the 9th version of the writeback patches. Changes since v8:

- Fix a bdi_work on-stack allocation hang. I hope this fixes Ted's
  issue.
- Get rid of the explicit wait queues, we can just use wake_up_process()
  since it's just for that one task.
- Add separate "sync_supers" thread that makes sure that the dirty
  super blocks get written. We cannot safely do this from bdi_forker_task(),
  as that risks deadlocking on ->s_umount. Artem, I implemented this
  by doing the wake ups from a timer so that it would be easier for you
  to just deactivate the timer when there are no super blocks.

For ease of patching, I've put the full diff here:

  http://kernel.dk/writeback-v9.patch

and also stored this in a writeback-v9 branch that will not change,
you can pull that into Linus tree from here:

  git://git.kernel.dk/linux-2.6-block.git writeback-v9

 block/blk-core.c            |    1 +
 drivers/block/aoe/aoeblk.c  |    1 +
 drivers/char/mem.c          |    1 +
 fs/btrfs/disk-io.c          |   24 +-
 fs/buffer.c                 |    2 +-
 fs/char_dev.c               |    1 +
 fs/configfs/inode.c         |    1 +
 fs/fs-writeback.c           |  804 ++++++++++++++++++++++++++++-------
 fs/fuse/inode.c             |    1 +
 fs/hugetlbfs/inode.c        |    1 +
 fs/nfs/client.c             |    1 +
 fs/ntfs/super.c             |   33 +--
 fs/ocfs2/dlm/dlmfs.c        |    1 +
 fs/ramfs/inode.c            |    1 +
 fs/super.c                  |    3 -
 fs/sync.c                   |    2 +-
 fs/sysfs/inode.c            |    1 +
 fs/ubifs/super.c            |    1 +
 include/linux/backing-dev.h |   73 ++++-
 include/linux/fs.h          |   11 +-
 include/linux/writeback.h   |   15 +-
 kernel/cgroup.c             |    1 +
 mm/Makefile                 |    2 +-
 mm/backing-dev.c            |  518 ++++++++++++++++++++++-
 mm/page-writeback.c         |  151 +------
 mm/pdflush.c                |  269 ------------
 mm/swap_state.c             |    1 +
 mm/vmscan.c                 |    2 +-
 28 files changed, 1286 insertions(+), 637 deletions(-)

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 01/11] ntfs: remove old debug check for dirty data in ntfs_put_super()
  2009-05-28 11:46 [PATCH 0/11] Per-bdi writeback flusher threads v9 Jens Axboe
@ 2009-05-28 11:46 ` Jens Axboe
  2009-05-28 11:46 ` [PATCH 02/11] btrfs: properly register fs backing device Jens Axboe
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2009-05-28 11:46 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, tytso
  Cc: chris.mason, david, hch, akpm, jack, yanmin_zhang, richard,
	damien.wyart, Jens Axboe

This should not trigger anymore, so kill it.

Acked-by: Anton Altaparmakov <aia21@cam.ac.uk>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/ntfs/super.c |   33 +++------------------------------
 1 files changed, 3 insertions(+), 30 deletions(-)

diff --git a/fs/ntfs/super.c b/fs/ntfs/super.c
index f76951d..3fc03bd 100644
--- a/fs/ntfs/super.c
+++ b/fs/ntfs/super.c
@@ -2373,39 +2373,12 @@ static void ntfs_put_super(struct super_block *sb)
 		vol->mftmirr_ino = NULL;
 	}
 	/*
-	 * If any dirty inodes are left, throw away all mft data page cache
-	 * pages to allow a clean umount.  This should never happen any more
-	 * due to mft.c::ntfs_mft_writepage() cleaning all the dirty pages as
-	 * the underlying mft records are written out and cleaned.  If it does,
-	 * happen anyway, we want to know...
+	 * We should have no dirty inodes left, due to
+	 * mft.c::ntfs_mft_writepage() cleaning all the dirty pages as
+	 * the underlying mft records are written out and cleaned.
 	 */
 	ntfs_commit_inode(vol->mft_ino);
 	write_inode_now(vol->mft_ino, 1);
-	if (sb_has_dirty_inodes(sb)) {
-		const char *s1, *s2;
-
-		mutex_lock(&vol->mft_ino->i_mutex);
-		truncate_inode_pages(vol->mft_ino->i_mapping, 0);
-		mutex_unlock(&vol->mft_ino->i_mutex);
-		write_inode_now(vol->mft_ino, 1);
-		if (sb_has_dirty_inodes(sb)) {
-			static const char *_s1 = "inodes";
-			static const char *_s2 = "";
-			s1 = _s1;
-			s2 = _s2;
-		} else {
-			static const char *_s1 = "mft pages";
-			static const char *_s2 = "They have been thrown "
-					"away.  ";
-			s1 = _s1;
-			s2 = _s2;
-		}
-		ntfs_error(sb, "Dirty %s found at umount time.  %sYou should "
-				"run chkdsk.  Please email "
-				"linux-ntfs-dev@lists.sourceforge.net and say "
-				"that you saw this message.  Thank you.", s1,
-				s2);
-	}
 #endif /* NTFS_RW */
 
 	iput(vol->mft_ino);
-- 
1.6.3.rc0.1.gf800


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 02/11] btrfs: properly register fs backing device
  2009-05-28 11:46 [PATCH 0/11] Per-bdi writeback flusher threads v9 Jens Axboe
  2009-05-28 11:46 ` [PATCH 01/11] ntfs: remove old debug check for dirty data in ntfs_put_super() Jens Axboe
@ 2009-05-28 11:46 ` Jens Axboe
  2009-05-28 11:46 ` [PATCH 03/11] writeback: move dirty inodes from super_block to backing_dev_info Jens Axboe
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2009-05-28 11:46 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, tytso
  Cc: chris.mason, david, hch, akpm, jack, yanmin_zhang, richard,
	damien.wyart, Jens Axboe

btrfs assigns this bdi to all inodes on that file system, so make
sure it's registered. This isn't really important now, but will be
when we put dirty inodes there. Even now, we miss the stats when the
bdi isn't visible.

Also fixes failure to check bdi_init() return value, and bad inherit of
->capabilities flags from the default bdi.

Acked-by: Chris Mason <chris.mason@oracle.com>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/btrfs/disk-io.c |   23 ++++++++++++++++++-----
 1 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 4b0ea0b..2dc19c9 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1345,12 +1345,24 @@ static void btrfs_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 	free_extent_map(em);
 }
 
+/*
+ * If this fails, caller must call bdi_destroy() to get rid of the
+ * bdi again.
+ */
 static int setup_bdi(struct btrfs_fs_info *info, struct backing_dev_info *bdi)
 {
-	bdi_init(bdi);
+	int err;
+
+	bdi->capabilities = BDI_CAP_MAP_COPY;
+	err = bdi_init(bdi);
+	if (err)
+		return err;
+
+	err = bdi_register(bdi, NULL, "btrfs");
+	if (err)
+		return err;
+
 	bdi->ra_pages	= default_backing_dev_info.ra_pages;
-	bdi->state		= 0;
-	bdi->capabilities	= default_backing_dev_info.capabilities;
 	bdi->unplug_io_fn	= btrfs_unplug_io_fn;
 	bdi->unplug_io_data	= info;
 	bdi->congested_fn	= btrfs_congested_fn;
@@ -1574,7 +1586,8 @@ struct btrfs_root *open_ctree(struct super_block *sb,
 	fs_info->sb = sb;
 	fs_info->max_extent = (u64)-1;
 	fs_info->max_inline = 8192 * 1024;
-	setup_bdi(fs_info, &fs_info->bdi);
+	if (setup_bdi(fs_info, &fs_info->bdi))
+		goto fail_bdi;
 	fs_info->btree_inode = new_inode(sb);
 	fs_info->btree_inode->i_ino = 1;
 	fs_info->btree_inode->i_nlink = 1;
@@ -1931,8 +1944,8 @@ fail_iput:
 
 	btrfs_close_devices(fs_info->fs_devices);
 	btrfs_mapping_tree_free(&fs_info->mapping_tree);
+fail_bdi:
 	bdi_destroy(&fs_info->bdi);
-
 fail:
 	kfree(extent_root);
 	kfree(tree_root);
-- 
1.6.3.rc0.1.gf800


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 03/11] writeback: move dirty inodes from super_block to backing_dev_info
  2009-05-28 11:46 [PATCH 0/11] Per-bdi writeback flusher threads v9 Jens Axboe
  2009-05-28 11:46 ` [PATCH 01/11] ntfs: remove old debug check for dirty data in ntfs_put_super() Jens Axboe
  2009-05-28 11:46 ` [PATCH 02/11] btrfs: properly register fs backing device Jens Axboe
@ 2009-05-28 11:46 ` Jens Axboe
  2009-05-28 11:46 ` [PATCH 04/11] writeback: switch to per-bdi threads for flushing data Jens Axboe
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2009-05-28 11:46 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, tytso
  Cc: chris.mason, david, hch, akpm, jack, yanmin_zhang, richard,
	damien.wyart, Jens Axboe

This is a first step at introducing per-bdi flusher threads. We should
have no change in behaviour, although sb_has_dirty_inodes() is now
ridiculously expensive, as there's no easy way to answer that question.
Not a huge problem, since it'll be deleted in subsequent patches.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/fs-writeback.c           |  196 +++++++++++++++++++++++++++---------------
 fs/super.c                  |    3 -
 include/linux/backing-dev.h |    9 ++
 include/linux/fs.h          |    5 +-
 mm/backing-dev.c            |   24 +++++
 mm/page-writeback.c         |   11 +--
 6 files changed, 164 insertions(+), 84 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 91013ff..1137408 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -25,6 +25,7 @@
 #include <linux/buffer_head.h>
 #include "internal.h"
 
+#define inode_to_bdi(inode)	((inode)->i_mapping->backing_dev_info)
 
 /**
  * writeback_acquire - attempt to get exclusive writeback access to a device
@@ -158,12 +159,13 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 			goto out;
 
 		/*
-		 * If the inode was already on s_dirty/s_io/s_more_io, don't
-		 * reposition it (that would break s_dirty time-ordering).
+		 * If the inode was already on b_dirty/b_io/b_more_io, don't
+		 * reposition it (that would break b_dirty time-ordering).
 		 */
 		if (!was_dirty) {
 			inode->dirtied_when = jiffies;
-			list_move(&inode->i_list, &sb->s_dirty);
+			list_move(&inode->i_list,
+					&inode_to_bdi(inode)->b_dirty);
 		}
 	}
 out:
@@ -184,31 +186,30 @@ static int write_inode(struct inode *inode, int sync)
  * furthest end of its superblock's dirty-inode list.
  *
  * Before stamping the inode's ->dirtied_when, we check to see whether it is
- * already the most-recently-dirtied inode on the s_dirty list.  If that is
+ * already the most-recently-dirtied inode on the b_dirty list.  If that is
  * the case then the inode must have been redirtied while it was being written
  * out and we don't reset its dirtied_when.
  */
 static void redirty_tail(struct inode *inode)
 {
-	struct super_block *sb = inode->i_sb;
+	struct backing_dev_info *bdi = inode_to_bdi(inode);
 
-	if (!list_empty(&sb->s_dirty)) {
-		struct inode *tail_inode;
+	if (!list_empty(&bdi->b_dirty)) {
+		struct inode *tail;
 
-		tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
-		if (time_before(inode->dirtied_when,
-				tail_inode->dirtied_when))
+		tail = list_entry(bdi->b_dirty.next, struct inode, i_list);
+		if (time_before(inode->dirtied_when, tail->dirtied_when))
 			inode->dirtied_when = jiffies;
 	}
-	list_move(&inode->i_list, &sb->s_dirty);
+	list_move(&inode->i_list, &bdi->b_dirty);
 }
 
 /*
- * requeue inode for re-scanning after sb->s_io list is exhausted.
+ * requeue inode for re-scanning after bdi->b_io list is exhausted.
  */
 static void requeue_io(struct inode *inode)
 {
-	list_move(&inode->i_list, &inode->i_sb->s_more_io);
+	list_move(&inode->i_list, &inode_to_bdi(inode)->b_more_io);
 }
 
 static void inode_sync_complete(struct inode *inode)
@@ -255,18 +256,50 @@ static void move_expired_inodes(struct list_head *delaying_queue,
 /*
  * Queue all expired dirty inodes for io, eldest first.
  */
-static void queue_io(struct super_block *sb,
-				unsigned long *older_than_this)
+static void queue_io(struct backing_dev_info *bdi,
+		     unsigned long *older_than_this)
+{
+	list_splice_init(&bdi->b_more_io, bdi->b_io.prev);
+	move_expired_inodes(&bdi->b_dirty, &bdi->b_io, older_than_this);
+}
+
+static int sb_on_inode_list(struct super_block *sb, struct list_head *list)
 {
-	list_splice_init(&sb->s_more_io, sb->s_io.prev);
-	move_expired_inodes(&sb->s_dirty, &sb->s_io, older_than_this);
+	struct inode *inode;
+	int ret = 0;
+
+	spin_lock(&inode_lock);
+	list_for_each_entry(inode, list, i_list) {
+		if (inode->i_sb == sb) {
+			ret = 1;
+			break;
+		}
+	}
+	spin_unlock(&inode_lock);
+	return ret;
 }
 
 int sb_has_dirty_inodes(struct super_block *sb)
 {
-	return !list_empty(&sb->s_dirty) ||
-	       !list_empty(&sb->s_io) ||
-	       !list_empty(&sb->s_more_io);
+	struct backing_dev_info *bdi;
+	int ret = 0;
+
+	/*
+	 * This is REALLY expensive right now, but it'll go away
+	 * when the bdi writeback is introduced
+	 */
+	mutex_lock(&bdi_lock);
+	list_for_each_entry(bdi, &bdi_list, bdi_list) {
+		if (sb_on_inode_list(sb, &bdi->b_dirty) ||
+		    sb_on_inode_list(sb, &bdi->b_io) ||
+		    sb_on_inode_list(sb, &bdi->b_more_io)) {
+			ret = 1;
+			break;
+		}
+	}
+	mutex_unlock(&bdi_lock);
+
+	return ret;
 }
 EXPORT_SYMBOL(sb_has_dirty_inodes);
 
@@ -322,11 +355,11 @@ __sync_single_inode(struct inode *inode, struct writeback_control *wbc)
 			/*
 			 * We didn't write back all the pages.  nfs_writepages()
 			 * sometimes bales out without doing anything. Redirty
-			 * the inode; Move it from s_io onto s_more_io/s_dirty.
+			 * the inode; Move it from b_io onto b_more_io/b_dirty.
 			 */
 			/*
 			 * akpm: if the caller was the kupdate function we put
-			 * this inode at the head of s_dirty so it gets first
+			 * this inode at the head of b_dirty so it gets first
 			 * consideration.  Otherwise, move it to the tail, for
 			 * the reasons described there.  I'm not really sure
 			 * how much sense this makes.  Presumably I had a good
@@ -336,7 +369,7 @@ __sync_single_inode(struct inode *inode, struct writeback_control *wbc)
 			if (wbc->for_kupdate) {
 				/*
 				 * For the kupdate function we move the inode
-				 * to s_more_io so it will get more writeout as
+				 * to b_more_io so it will get more writeout as
 				 * soon as the queue becomes uncongested.
 				 */
 				inode->i_state |= I_DIRTY_PAGES;
@@ -402,10 +435,10 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	if ((wbc->sync_mode != WB_SYNC_ALL) && (inode->i_state & I_SYNC)) {
 		/*
 		 * We're skipping this inode because it's locked, and we're not
-		 * doing writeback-for-data-integrity.  Move it to s_more_io so
-		 * that writeback can proceed with the other inodes on s_io.
+		 * doing writeback-for-data-integrity.  Move it to b_more_io so
+		 * that writeback can proceed with the other inodes on b_io.
 		 * We'll have another go at writing back this inode when we
-		 * completed a full scan of s_io.
+		 * completed a full scan of b_io.
 		 */
 		requeue_io(inode);
 		return 0;
@@ -428,51 +461,34 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	return __sync_single_inode(inode, wbc);
 }
 
-/*
- * Write out a superblock's list of dirty inodes.  A wait will be performed
- * upon no inodes, all inodes or the final one, depending upon sync_mode.
- *
- * If older_than_this is non-NULL, then only write out inodes which
- * had their first dirtying at a time earlier than *older_than_this.
- *
- * If we're a pdflush thread, then implement pdflush collision avoidance
- * against the entire list.
- *
- * If `bdi' is non-zero then we're being asked to writeback a specific queue.
- * This function assumes that the blockdev superblock's inodes are backed by
- * a variety of queues, so all inodes are searched.  For other superblocks,
- * assume that all inodes are backed by the same queue.
- *
- * FIXME: this linear search could get expensive with many fileystems.  But
- * how to fix?  We need to go from an address_space to all inodes which share
- * a queue with that address_space.  (Easy: have a global "dirty superblocks"
- * list).
- *
- * The inodes to be written are parked on sb->s_io.  They are moved back onto
- * sb->s_dirty as they are selected for writing.  This way, none can be missed
- * on the writer throttling path, and we get decent balancing between many
- * throttled threads: we don't want them all piling up on inode_sync_wait.
- */
-void generic_sync_sb_inodes(struct super_block *sb,
-				struct writeback_control *wbc)
+static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
+				    struct writeback_control *wbc,
+				    struct super_block *sb,
+				    int is_blkdev_sb)
 {
 	const unsigned long start = jiffies;	/* livelock avoidance */
-	int sync = wbc->sync_mode == WB_SYNC_ALL;
 
 	spin_lock(&inode_lock);
-	if (!wbc->for_kupdate || list_empty(&sb->s_io))
-		queue_io(sb, wbc->older_than_this);
 
-	while (!list_empty(&sb->s_io)) {
-		struct inode *inode = list_entry(sb->s_io.prev,
+	if (!wbc->for_kupdate || list_empty(&bdi->b_io))
+		queue_io(bdi, wbc->older_than_this);
+
+	while (!list_empty(&bdi->b_io)) {
+		struct inode *inode = list_entry(bdi->b_io.prev,
 						struct inode, i_list);
-		struct address_space *mapping = inode->i_mapping;
-		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		long pages_skipped;
 
+		/*
+		 * super block given and doesn't match, skip this inode
+		 */
+		if (sb && sb != inode->i_sb) {
+			redirty_tail(inode);
+			continue;
+		}
+
 		if (!bdi_cap_writeback_dirty(bdi)) {
 			redirty_tail(inode);
-			if (sb_is_blkdev_sb(sb)) {
+			if (is_blkdev_sb) {
 				/*
 				 * Dirty memory-backed blockdev: the ramdisk
 				 * driver does this.  Skip just this inode
@@ -494,14 +510,14 @@ void generic_sync_sb_inodes(struct super_block *sb,
 
 		if (wbc->nonblocking && bdi_write_congested(bdi)) {
 			wbc->encountered_congestion = 1;
-			if (!sb_is_blkdev_sb(sb))
+			if (!is_blkdev_sb)
 				break;		/* Skip a congested fs */
 			requeue_io(inode);
 			continue;		/* Skip a congested blockdev */
 		}
 
 		if (wbc->bdi && bdi != wbc->bdi) {
-			if (!sb_is_blkdev_sb(sb))
+			if (!is_blkdev_sb)
 				break;		/* fs has the wrong queue */
 			requeue_io(inode);
 			continue;		/* blockdev has wrong queue */
@@ -539,13 +555,55 @@ void generic_sync_sb_inodes(struct super_block *sb,
 			wbc->more_io = 1;
 			break;
 		}
-		if (!list_empty(&sb->s_more_io))
+		if (!list_empty(&bdi->b_more_io))
 			wbc->more_io = 1;
 	}
 
-	if (sync) {
+	spin_unlock(&inode_lock);
+	/* Leave any unwritten inodes on b_io */
+}
+
+/*
+ * Write out a superblock's list of dirty inodes.  A wait will be performed
+ * upon no inodes, all inodes or the final one, depending upon sync_mode.
+ *
+ * If older_than_this is non-NULL, then only write out inodes which
+ * had their first dirtying at a time earlier than *older_than_this.
+ *
+ * If we're a pdlfush thread, then implement pdflush collision avoidance
+ * against the entire list.
+ *
+ * If `bdi' is non-zero then we're being asked to writeback a specific queue.
+ * This function assumes that the blockdev superblock's inodes are backed by
+ * a variety of queues, so all inodes are searched.  For other superblocks,
+ * assume that all inodes are backed by the same queue.
+ *
+ * FIXME: this linear search could get expensive with many fileystems.  But
+ * how to fix?  We need to go from an address_space to all inodes which share
+ * a queue with that address_space.  (Easy: have a global "dirty superblocks"
+ * list).
+ *
+ * The inodes to be written are parked on bdi->b_io.  They are moved back onto
+ * bdi->b_dirty as they are selected for writing.  This way, none can be missed
+ * on the writer throttling path, and we get decent balancing between many
+ * throttled threads: we don't want them all piling up on inode_sync_wait.
+ */
+void generic_sync_sb_inodes(struct super_block *sb,
+				struct writeback_control *wbc)
+{
+	const int is_blkdev_sb = sb_is_blkdev_sb(sb);
+	struct backing_dev_info *bdi;
+
+	mutex_lock(&bdi_lock);
+	list_for_each_entry(bdi, &bdi_list, bdi_list)
+		generic_sync_bdi_inodes(bdi, wbc, sb, is_blkdev_sb);
+	mutex_unlock(&bdi_lock);
+
+	if (wbc->sync_mode == WB_SYNC_ALL) {
 		struct inode *inode, *old_inode = NULL;
 
+		spin_lock(&inode_lock);
+
 		/*
 		 * Data integrity sync. Must wait for all pages under writeback,
 		 * because there may have been pages dirtied before our sync
@@ -583,10 +641,8 @@ void generic_sync_sb_inodes(struct super_block *sb,
 		}
 		spin_unlock(&inode_lock);
 		iput(old_inode);
-	} else
-		spin_unlock(&inode_lock);
+	}
 
-	return;		/* Leave any unwritten inodes on s_io */
 }
 EXPORT_SYMBOL_GPL(generic_sync_sb_inodes);
 
@@ -601,8 +657,8 @@ static void sync_sb_inodes(struct super_block *sb,
  *
  * Note:
  * We don't need to grab a reference to superblock here. If it has non-empty
- * ->s_dirty it's hadn't been killed yet and kill_super() won't proceed
- * past sync_inodes_sb() until the ->s_dirty/s_io/s_more_io lists are all
+ * ->b_dirty it's hadn't been killed yet and kill_super() won't proceed
+ * past sync_inodes_sb() until the ->b_dirty/b_io/b_more_io lists are all
  * empty. Since __sync_single_inode() regains inode_lock before it finally moves
  * inode from superblock lists we are OK.
  *
diff --git a/fs/super.c b/fs/super.c
index 1943fdf..76dd5b2 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -64,9 +64,6 @@ static struct super_block *alloc_super(struct file_system_type *type)
 			s = NULL;
 			goto out;
 		}
-		INIT_LIST_HEAD(&s->s_dirty);
-		INIT_LIST_HEAD(&s->s_io);
-		INIT_LIST_HEAD(&s->s_more_io);
 		INIT_LIST_HEAD(&s->s_files);
 		INIT_LIST_HEAD(&s->s_instances);
 		INIT_HLIST_HEAD(&s->s_anon);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 0ec2c59..8719c87 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -40,6 +40,8 @@ enum bdi_stat_item {
 #define BDI_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
 
 struct backing_dev_info {
+	struct list_head bdi_list;
+
 	unsigned long ra_pages;	/* max readahead in PAGE_CACHE_SIZE units */
 	unsigned long state;	/* Always use atomic bitops on this */
 	unsigned int capabilities; /* Device capabilities */
@@ -58,6 +60,10 @@ struct backing_dev_info {
 
 	struct device *dev;
 
+	struct list_head	b_dirty;	/* dirty inodes */
+	struct list_head	b_io;		/* parked for writeback */
+	struct list_head	b_more_io;	/* parked for more writeback */
+
 #ifdef CONFIG_DEBUG_FS
 	struct dentry *debug_dir;
 	struct dentry *debug_stats;
@@ -72,6 +78,9 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev);
 void bdi_unregister(struct backing_dev_info *bdi);
 
+extern struct mutex bdi_lock;
+extern struct list_head bdi_list;
+
 static inline void __add_bdi_stat(struct backing_dev_info *bdi,
 		enum bdi_stat_item item, s64 amount)
 {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3b534e5..6b475d4 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -712,7 +712,7 @@ static inline int mapping_writably_mapped(struct address_space *mapping)
 
 struct inode {
 	struct hlist_node	i_hash;
-	struct list_head	i_list;
+	struct list_head	i_list;		/* backing dev IO list */
 	struct list_head	i_sb_list;
 	struct list_head	i_dentry;
 	unsigned long		i_ino;
@@ -1329,9 +1329,6 @@ struct super_block {
 	struct xattr_handler	**s_xattr;
 
 	struct list_head	s_inodes;	/* all inodes */
-	struct list_head	s_dirty;	/* dirty inodes */
-	struct list_head	s_io;		/* parked for writeback */
-	struct list_head	s_more_io;	/* parked for more writeback */
 	struct hlist_head	s_anon;		/* anonymous dentries for (nfs) exporting */
 	struct list_head	s_files;
 	/* s_dentry_lru and s_nr_dentry_unused are protected by dcache_lock */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 493b468..de0bbfe 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -22,6 +22,8 @@ struct backing_dev_info default_backing_dev_info = {
 EXPORT_SYMBOL_GPL(default_backing_dev_info);
 
 static struct class *bdi_class;
+DEFINE_MUTEX(bdi_lock);
+LIST_HEAD(bdi_list);
 
 #ifdef CONFIG_DEBUG_FS
 #include <linux/debugfs.h>
@@ -211,6 +213,10 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 		goto exit;
 	}
 
+	mutex_lock(&bdi_lock);
+	list_add_tail(&bdi->bdi_list, &bdi_list);
+	mutex_unlock(&bdi_lock);
+
 	bdi->dev = dev;
 	bdi_debug_register(bdi, dev_name(dev));
 
@@ -225,9 +231,17 @@ int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev)
 }
 EXPORT_SYMBOL(bdi_register_dev);
 
+static void bdi_remove_from_list(struct backing_dev_info *bdi)
+{
+	mutex_lock(&bdi_lock);
+	list_del(&bdi->bdi_list);
+	mutex_unlock(&bdi_lock);
+}
+
 void bdi_unregister(struct backing_dev_info *bdi)
 {
 	if (bdi->dev) {
+		bdi_remove_from_list(bdi);
 		bdi_debug_unregister(bdi);
 		device_unregister(bdi->dev);
 		bdi->dev = NULL;
@@ -245,6 +259,10 @@ int bdi_init(struct backing_dev_info *bdi)
 	bdi->min_ratio = 0;
 	bdi->max_ratio = 100;
 	bdi->max_prop_frac = PROP_FRAC_BASE;
+	INIT_LIST_HEAD(&bdi->bdi_list);
+	INIT_LIST_HEAD(&bdi->b_io);
+	INIT_LIST_HEAD(&bdi->b_dirty);
+	INIT_LIST_HEAD(&bdi->b_more_io);
 
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++) {
 		err = percpu_counter_init(&bdi->bdi_stat[i], 0);
@@ -259,6 +277,8 @@ int bdi_init(struct backing_dev_info *bdi)
 err:
 		while (i--)
 			percpu_counter_destroy(&bdi->bdi_stat[i]);
+
+		bdi_remove_from_list(bdi);
 	}
 
 	return err;
@@ -269,6 +289,10 @@ void bdi_destroy(struct backing_dev_info *bdi)
 {
 	int i;
 
+	WARN_ON(!list_empty(&bdi->b_dirty));
+	WARN_ON(!list_empty(&bdi->b_io));
+	WARN_ON(!list_empty(&bdi->b_more_io));
+
 	bdi_unregister(bdi);
 
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index bb553c3..7c44314 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -319,15 +319,13 @@ static void task_dirty_limit(struct task_struct *tsk, long *pdirty)
 /*
  *
  */
-static DEFINE_SPINLOCK(bdi_lock);
 static unsigned int bdi_min_ratio;
 
 int bdi_set_min_ratio(struct backing_dev_info *bdi, unsigned int min_ratio)
 {
 	int ret = 0;
-	unsigned long flags;
 
-	spin_lock_irqsave(&bdi_lock, flags);
+	mutex_lock(&bdi_lock);
 	if (min_ratio > bdi->max_ratio) {
 		ret = -EINVAL;
 	} else {
@@ -339,27 +337,26 @@ int bdi_set_min_ratio(struct backing_dev_info *bdi, unsigned int min_ratio)
 			ret = -EINVAL;
 		}
 	}
-	spin_unlock_irqrestore(&bdi_lock, flags);
+	mutex_unlock(&bdi_lock);
 
 	return ret;
 }
 
 int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
 {
-	unsigned long flags;
 	int ret = 0;
 
 	if (max_ratio > 100)
 		return -EINVAL;
 
-	spin_lock_irqsave(&bdi_lock, flags);
+	mutex_lock(&bdi_lock);
 	if (bdi->min_ratio > max_ratio) {
 		ret = -EINVAL;
 	} else {
 		bdi->max_ratio = max_ratio;
 		bdi->max_prop_frac = (PROP_FRAC_BASE * max_ratio) / 100;
 	}
-	spin_unlock_irqrestore(&bdi_lock, flags);
+	mutex_unlock(&bdi_lock);
 
 	return ret;
 }
-- 
1.6.3.rc0.1.gf800


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 04/11] writeback: switch to per-bdi threads for flushing data
  2009-05-28 11:46 [PATCH 0/11] Per-bdi writeback flusher threads v9 Jens Axboe
                   ` (2 preceding siblings ...)
  2009-05-28 11:46 ` [PATCH 03/11] writeback: move dirty inodes from super_block to backing_dev_info Jens Axboe
@ 2009-05-28 11:46 ` Jens Axboe
  2009-05-28 14:13   ` Artem Bityutskiy
  2009-05-28 11:46 ` [PATCH 05/11] writeback: get rid of pdflush completely Jens Axboe
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2009-05-28 11:46 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, tytso
  Cc: chris.mason, david, hch, akpm, jack, yanmin_zhang, richard,
	damien.wyart, Jens Axboe

This gets rid of pdflush for bdi writeout and kupdated style cleaning.
This is an experiment to see if we get better writeout behaviour with
per-bdi flushing. Some initial tests look pretty encouraging. A sample
ffsb workload that does random writes to files is about 8% faster here
on a simple SATA drive during the benchmark phase. File layout also seems
a LOT more smooth in vmstat:

 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  1      0 608848   2652 375372    0    0     0 71024  604    24  1 10 48 42
 0  1      0 549644   2712 433736    0    0     0 60692  505    27  1  8 48 44
 1  0      0 476928   2784 505192    0    0     4 29540  553    24  0  9 53 37
 0  1      0 457972   2808 524008    0    0     0 54876  331    16  0  4 38 58
 0  1      0 366128   2928 614284    0    0     4 92168  710    58  0 13 53 34
 0  1      0 295092   3000 684140    0    0     0 62924  572    23  0  9 53 37
 0  1      0 236592   3064 741704    0    0     4 58256  523    17  0  8 48 44
 0  1      0 165608   3132 811464    0    0     0 57460  560    21  0  8 54 38
 0  1      0 102952   3200 873164    0    0     4 74748  540    29  1 10 48 41
 0  1      0  48604   3252 926472    0    0     0 53248  469    29  0  7 47 45

where vanilla tends to fluctuate a lot in the creation phase:

 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  1      0 678716   5792 303380    0    0     0 74064  565    50  1 11 52 36
 1  0      0 662488   5864 319396    0    0     4   352  302   329  0  2 47 51
 0  1      0 599312   5924 381468    0    0     0 78164  516    55  0  9 51 40
 0  1      0 519952   6008 459516    0    0     4 78156  622    56  1 11 52 37
 1  1      0 436640   6092 541632    0    0     0 82244  622    54  0 11 48 41
 0  1      0 436640   6092 541660    0    0     0     8  152    39  0  0 51 49
 0  1      0 332224   6200 644252    0    0     4 102800  728    46  1 13 49 36
 1  0      0 274492   6260 701056    0    0     4 12328  459    49  0  7 50 43
 0  1      0 211220   6324 763356    0    0     0 106940  515    37  1 10 51 39
 1  0      0 160412   6376 813468    0    0     0  8224  415    43  0  6 49 45
 1  1      0  85980   6452 886556    0    0     4 113516  575    39  1 11 54 34
 0  2      0  85968   6452 886620    0    0     0  1640  158   211  0  0 46 54

So apart from seemingly behaving better for buffered writeout, this also
allows us to potentially have more than one bdi thread flushing out data.
This may be useful for NUMA type setups.

A 10 disk test with btrfs performs 26% faster with per-bdi flushing. Other
tests pending. mmap heavy writing also improves considerably.

A separate thread is added to sync the super blocks. In the long term,
adding sync_supers_bdi() functionality could get rid of this thread again.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/buffer.c                 |    2 +-
 fs/fs-writeback.c           |  309 ++++++++++++++++++++++++++-----------------
 fs/sync.c                   |    2 +-
 include/linux/backing-dev.h |   28 ++++
 include/linux/fs.h          |    3 +-
 include/linux/writeback.h   |    2 +-
 mm/backing-dev.c            |  231 +++++++++++++++++++++++++++++++-
 mm/page-writeback.c         |  140 +------------------
 mm/vmscan.c                 |    2 +-
 9 files changed, 452 insertions(+), 267 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index aed2977..14f0802 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -281,7 +281,7 @@ static void free_more_memory(void)
 	struct zone *zone;
 	int nid;
 
-	wakeup_pdflush(1024);
+	wakeup_flusher_threads(1024);
 	yield();
 
 	for_each_online_node(nid) {
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1137408..aa0b560 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -19,6 +19,8 @@
 #include <linux/sched.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
+#include <linux/kthread.h>
+#include <linux/freezer.h>
 #include <linux/writeback.h>
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
@@ -61,10 +63,186 @@ int writeback_in_progress(struct backing_dev_info *bdi)
  */
 static void writeback_release(struct backing_dev_info *bdi)
 {
-	BUG_ON(!writeback_in_progress(bdi));
+	WARN_ON_ONCE(!writeback_in_progress(bdi));
+	bdi->wb_arg.nr_pages = 0;
+	bdi->wb_arg.sb = NULL;
 	clear_bit(BDI_pdflush, &bdi->state);
 }
 
+int bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
+			 long nr_pages, enum writeback_sync_modes sync_mode)
+{
+	/*
+	 * This only happens the first time someone kicks this bdi, so put
+	 * it out-of-line.
+	 */
+	if (unlikely(!bdi->task)) {
+		bdi_add_default_flusher_task(bdi);
+		return 1;
+	}
+
+	if (writeback_acquire(bdi)) {
+		bdi->wb_arg.nr_pages = nr_pages;
+		bdi->wb_arg.sb = sb;
+		bdi->wb_arg.sync_mode = sync_mode;
+
+		if (bdi->task)
+			wake_up_process(bdi->task);
+	}
+
+	return 0;
+}
+
+/*
+ * The maximum number of pages to writeout in a single bdi flush/kupdate
+ * operation.  We do this so we don't hold I_SYNC against an inode for
+ * enormous amounts of time, which would block a userspace task which has
+ * been forced to throttle against that inode.  Also, the code reevaluates
+ * the dirty each time it has written this many pages.
+ */
+#define MAX_WRITEBACK_PAGES     1024
+
+/*
+ * Periodic writeback of "old" data.
+ *
+ * Define "old": the first time one of an inode's pages is dirtied, we mark the
+ * dirtying-time in the inode's address_space.  So this periodic writeback code
+ * just walks the superblock inode list, writing back any inodes which are
+ * older than a specific point in time.
+ *
+ * Try to run once per dirty_writeback_interval.  But if a writeback event
+ * takes longer than a dirty_writeback_interval interval, then leave a
+ * one-second gap.
+ *
+ * older_than_this takes precedence over nr_to_write.  So we'll only write back
+ * all dirty pages if they are all attached to "old" mappings.
+ */
+static void bdi_kupdated(struct backing_dev_info *bdi)
+{
+	unsigned long oldest_jif;
+	long nr_to_write;
+	struct writeback_control wbc = {
+		.bdi			= bdi,
+		.sync_mode		= WB_SYNC_NONE,
+		.older_than_this	= &oldest_jif,
+		.nr_to_write		= 0,
+		.for_kupdate		= 1,
+		.range_cyclic		= 1,
+	};
+
+	oldest_jif = jiffies - msecs_to_jiffies(dirty_expire_interval * 10);
+
+	nr_to_write = global_page_state(NR_FILE_DIRTY) +
+			global_page_state(NR_UNSTABLE_NFS) +
+			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+
+	while (nr_to_write > 0) {
+		wbc.more_io = 0;
+		wbc.encountered_congestion = 0;
+		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
+		generic_sync_bdi_inodes(NULL, &wbc);
+		if (wbc.nr_to_write > 0)
+			break;	/* All the old data is written */
+		nr_to_write -= MAX_WRITEBACK_PAGES;
+	}
+}
+
+static inline bool over_bground_thresh(void)
+{
+	unsigned long background_thresh, dirty_thresh;
+
+	get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
+
+	return (global_page_state(NR_FILE_DIRTY) +
+		global_page_state(NR_UNSTABLE_NFS) >= background_thresh);
+}
+
+static void bdi_pdflush(struct backing_dev_info *bdi)
+{
+	struct writeback_control wbc = {
+		.bdi			= bdi,
+		.sync_mode		= bdi->wb_arg.sync_mode,
+		.older_than_this	= NULL,
+		.range_cyclic		= 1,
+	};
+	long nr_pages = bdi->wb_arg.nr_pages;
+
+	for (;;) {
+		if (wbc.sync_mode == WB_SYNC_NONE && nr_pages <= 0 &&
+		    !over_bground_thresh())
+			break;
+
+		wbc.more_io = 0;
+		wbc.encountered_congestion = 0;
+		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
+		wbc.pages_skipped = 0;
+		generic_sync_bdi_inodes(bdi->wb_arg.sb, &wbc);
+		nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
+		/*
+		 * If we ran out of stuff to write, bail unless more_io got set
+		 */
+		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
+			if (wbc.more_io)
+				continue;
+			break;
+		}
+	}
+}
+
+/*
+ * Handle writeback of dirty data for the device backed by this bdi. Also
+ * wakes up periodically and does kupdated style flushing.
+ */
+int bdi_writeback_task(struct backing_dev_info *bdi)
+{
+	while (!kthread_should_stop()) {
+		unsigned long wait_jiffies;
+
+		wait_jiffies = msecs_to_jiffies(dirty_writeback_interval * 10);
+		set_current_state(TASK_INTERRUPTIBLE);
+		schedule_timeout(wait_jiffies);
+		try_to_freeze();
+
+		/*
+		 * We get here in two cases:
+		 *
+		 *  schedule_timeout() returned because the dirty writeback
+		 *  interval has elapsed. If that happens, we will be able
+		 *  to acquire the writeback lock and will proceed to do
+		 *  kupdated style writeout.
+		 *
+		 *  Someone called bdi_start_writeback(), which will acquire
+		 *  the writeback lock. This means our writeback_acquire()
+		 *  below will fail and we call into bdi_pdflush() for
+		 *  pdflush style writeout.
+		 *
+		 */
+		if (writeback_acquire(bdi))
+			bdi_kupdated(bdi);
+		else
+			bdi_pdflush(bdi);
+
+		writeback_release(bdi);
+	}
+
+	return 0;
+}
+
+void bdi_writeback_all(struct super_block *sb, struct writeback_control *wbc)
+{
+	struct backing_dev_info *bdi, *tmp;
+
+	mutex_lock(&bdi_lock);
+
+	list_for_each_entry_safe(bdi, tmp, &bdi_list, bdi_list) {
+		if (!bdi_has_dirty_io(bdi))
+			continue;
+		bdi_start_writeback(bdi, sb, wbc->nr_to_write, wbc->sync_mode);
+	}
+
+	mutex_unlock(&bdi_lock);
+}
+
 /**
  *	__mark_inode_dirty -	internal function
  *	@inode: inode to mark
@@ -263,46 +441,6 @@ static void queue_io(struct backing_dev_info *bdi,
 	move_expired_inodes(&bdi->b_dirty, &bdi->b_io, older_than_this);
 }
 
-static int sb_on_inode_list(struct super_block *sb, struct list_head *list)
-{
-	struct inode *inode;
-	int ret = 0;
-
-	spin_lock(&inode_lock);
-	list_for_each_entry(inode, list, i_list) {
-		if (inode->i_sb == sb) {
-			ret = 1;
-			break;
-		}
-	}
-	spin_unlock(&inode_lock);
-	return ret;
-}
-
-int sb_has_dirty_inodes(struct super_block *sb)
-{
-	struct backing_dev_info *bdi;
-	int ret = 0;
-
-	/*
-	 * This is REALLY expensive right now, but it'll go away
-	 * when the bdi writeback is introduced
-	 */
-	mutex_lock(&bdi_lock);
-	list_for_each_entry(bdi, &bdi_list, bdi_list) {
-		if (sb_on_inode_list(sb, &bdi->b_dirty) ||
-		    sb_on_inode_list(sb, &bdi->b_io) ||
-		    sb_on_inode_list(sb, &bdi->b_more_io)) {
-			ret = 1;
-			break;
-		}
-	}
-	mutex_unlock(&bdi_lock);
-
-	return ret;
-}
-EXPORT_SYMBOL(sb_has_dirty_inodes);
-
 /*
  * Write a single inode's dirty pages and inode data out to disk.
  * If `wait' is set, wait on the writeout.
@@ -461,11 +599,11 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	return __sync_single_inode(inode, wbc);
 }
 
-static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
-				    struct writeback_control *wbc,
-				    struct super_block *sb,
-				    int is_blkdev_sb)
+void generic_sync_bdi_inodes(struct super_block *sb,
+			     struct writeback_control *wbc)
 {
+	const int is_blkdev_sb = sb_is_blkdev_sb(sb);
+	struct backing_dev_info *bdi = wbc->bdi;
 	const unsigned long start = jiffies;	/* livelock avoidance */
 
 	spin_lock(&inode_lock);
@@ -516,13 +654,6 @@ static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
 			continue;		/* Skip a congested blockdev */
 		}
 
-		if (wbc->bdi && bdi != wbc->bdi) {
-			if (!is_blkdev_sb)
-				break;		/* fs has the wrong queue */
-			requeue_io(inode);
-			continue;		/* blockdev has wrong queue */
-		}
-
 		/*
 		 * Was this inode dirtied after sync_sb_inodes was called?
 		 * This keeps sync from extra jobs and livelock.
@@ -530,16 +661,10 @@ static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
 		if (inode_dirtied_after(inode, start))
 			break;
 
-		/* Is another pdflush already flushing this queue? */
-		if (current_is_pdflush() && !writeback_acquire(bdi))
-			break;
-
 		BUG_ON(inode->i_state & I_FREEING);
 		__iget(inode);
 		pages_skipped = wbc->pages_skipped;
 		__writeback_single_inode(inode, wbc);
-		if (current_is_pdflush())
-			writeback_release(bdi);
 		if (wbc->pages_skipped != pages_skipped) {
 			/*
 			 * writeback is not making progress due to locked
@@ -578,11 +703,6 @@ static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
  * a variety of queues, so all inodes are searched.  For other superblocks,
  * assume that all inodes are backed by the same queue.
  *
- * FIXME: this linear search could get expensive with many fileystems.  But
- * how to fix?  We need to go from an address_space to all inodes which share
- * a queue with that address_space.  (Easy: have a global "dirty superblocks"
- * list).
- *
  * The inodes to be written are parked on bdi->b_io.  They are moved back onto
  * bdi->b_dirty as they are selected for writing.  This way, none can be missed
  * on the writer throttling path, and we get decent balancing between many
@@ -591,13 +711,10 @@ static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
 void generic_sync_sb_inodes(struct super_block *sb,
 				struct writeback_control *wbc)
 {
-	const int is_blkdev_sb = sb_is_blkdev_sb(sb);
-	struct backing_dev_info *bdi;
-
-	mutex_lock(&bdi_lock);
-	list_for_each_entry(bdi, &bdi_list, bdi_list)
-		generic_sync_bdi_inodes(bdi, wbc, sb, is_blkdev_sb);
-	mutex_unlock(&bdi_lock);
+	if (wbc->bdi)
+		generic_sync_bdi_inodes(sb, wbc);
+	else
+		bdi_writeback_all(sb, wbc);
 
 	if (wbc->sync_mode == WB_SYNC_ALL) {
 		struct inode *inode, *old_inode = NULL;
@@ -653,58 +770,6 @@ static void sync_sb_inodes(struct super_block *sb,
 }
 
 /*
- * Start writeback of dirty pagecache data against all unlocked inodes.
- *
- * Note:
- * We don't need to grab a reference to superblock here. If it has non-empty
- * ->b_dirty it's hadn't been killed yet and kill_super() won't proceed
- * past sync_inodes_sb() until the ->b_dirty/b_io/b_more_io lists are all
- * empty. Since __sync_single_inode() regains inode_lock before it finally moves
- * inode from superblock lists we are OK.
- *
- * If `older_than_this' is non-zero then only flush inodes which have a
- * flushtime older than *older_than_this.
- *
- * If `bdi' is non-zero then we will scan the first inode against each
- * superblock until we find the matching ones.  One group will be the dirty
- * inodes against a filesystem.  Then when we hit the dummy blockdev superblock,
- * sync_sb_inodes will seekout the blockdev which matches `bdi'.  Maybe not
- * super-efficient but we're about to do a ton of I/O...
- */
-void
-writeback_inodes(struct writeback_control *wbc)
-{
-	struct super_block *sb;
-
-	might_sleep();
-	spin_lock(&sb_lock);
-restart:
-	list_for_each_entry_reverse(sb, &super_blocks, s_list) {
-		if (sb_has_dirty_inodes(sb)) {
-			/* we're making our own get_super here */
-			sb->s_count++;
-			spin_unlock(&sb_lock);
-			/*
-			 * If we can't get the readlock, there's no sense in
-			 * waiting around, most of the time the FS is going to
-			 * be unmounted by the time it is released.
-			 */
-			if (down_read_trylock(&sb->s_umount)) {
-				if (sb->s_root)
-					sync_sb_inodes(sb, wbc);
-				up_read(&sb->s_umount);
-			}
-			spin_lock(&sb_lock);
-			if (__put_super_and_need_restart(sb))
-				goto restart;
-		}
-		if (wbc->nr_to_write <= 0)
-			break;
-	}
-	spin_unlock(&sb_lock);
-}
-
-/*
  * writeback and wait upon the filesystem's dirty inodes.  The caller will
  * do this in two passes - one to write, and one to wait.
  *
diff --git a/fs/sync.c b/fs/sync.c
index 7abc65f..3887f10 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -23,7 +23,7 @@
  */
 static void do_sync(unsigned long wait)
 {
-	wakeup_pdflush(0);
+	wakeup_flusher_threads(0);
 	sync_inodes(0);		/* All mappings, inodes and their blockdevs */
 	vfs_dq_sync(NULL);
 	sync_supers();		/* Write the superblocks */
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 8719c87..4a312e9 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -13,6 +13,7 @@
 #include <linux/proportions.h>
 #include <linux/kernel.h>
 #include <linux/fs.h>
+#include <linux/writeback.h>
 #include <asm/atomic.h>
 
 struct page;
@@ -24,6 +25,7 @@ struct dentry;
  */
 enum bdi_state {
 	BDI_pdflush,		/* A pdflush thread is working this device */
+	BDI_pending,		/* On its way to being activated */
 	BDI_async_congested,	/* The async (write) queue is getting full */
 	BDI_sync_congested,	/* The sync queue is getting full */
 	BDI_unused,		/* Available bits start here */
@@ -39,6 +41,12 @@ enum bdi_stat_item {
 
 #define BDI_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
 
+struct bdi_writeback_arg {
+	unsigned long nr_pages;
+	struct super_block *sb;
+	enum writeback_sync_modes sync_mode;
+};
+
 struct backing_dev_info {
 	struct list_head bdi_list;
 
@@ -60,6 +68,8 @@ struct backing_dev_info {
 
 	struct device *dev;
 
+	struct task_struct	*task;		/* writeback task */
+	struct bdi_writeback_arg wb_arg;	/* protected by BDI_pdflush */
 	struct list_head	b_dirty;	/* dirty inodes */
 	struct list_head	b_io;		/* parked for writeback */
 	struct list_head	b_more_io;	/* parked for more writeback */
@@ -77,10 +87,22 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 		const char *fmt, ...);
 int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev);
 void bdi_unregister(struct backing_dev_info *bdi);
+int bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
+			 long nr_pages, enum writeback_sync_modes sync_mode);
+int bdi_writeback_task(struct backing_dev_info *bdi);
+void bdi_writeback_all(struct super_block *sb, struct writeback_control *wbc);
+void bdi_add_default_flusher_task(struct backing_dev_info *bdi);
 
 extern struct mutex bdi_lock;
 extern struct list_head bdi_list;
 
+static inline int bdi_has_dirty_io(struct backing_dev_info *bdi)
+{
+	return !list_empty(&bdi->b_dirty) ||
+	       !list_empty(&bdi->b_io) ||
+	       !list_empty(&bdi->b_more_io);
+}
+
 static inline void __add_bdi_stat(struct backing_dev_info *bdi,
 		enum bdi_stat_item item, s64 amount)
 {
@@ -196,6 +218,7 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
 #define BDI_CAP_EXEC_MAP	0x00000040
 #define BDI_CAP_NO_ACCT_WB	0x00000080
 #define BDI_CAP_SWAP_BACKED	0x00000100
+#define BDI_CAP_FLUSH_FORKER	0x00000200
 
 #define BDI_CAP_VMFLAGS \
 	(BDI_CAP_READ_MAP | BDI_CAP_WRITE_MAP | BDI_CAP_EXEC_MAP)
@@ -265,6 +288,11 @@ static inline bool bdi_cap_swap_backed(struct backing_dev_info *bdi)
 	return bdi->capabilities & BDI_CAP_SWAP_BACKED;
 }
 
+static inline bool bdi_cap_flush_forker(struct backing_dev_info *bdi)
+{
+	return bdi->capabilities & BDI_CAP_FLUSH_FORKER;
+}
+
 static inline bool mapping_cap_writeback_dirty(struct address_space *mapping)
 {
 	return bdi_cap_writeback_dirty(mapping->backing_dev_info);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6b475d4..ecdc544 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2063,6 +2063,8 @@ extern int invalidate_inode_pages2_range(struct address_space *mapping,
 					 pgoff_t start, pgoff_t end);
 extern void generic_sync_sb_inodes(struct super_block *sb,
 				struct writeback_control *wbc);
+extern void generic_sync_bdi_inodes(struct super_block *sb,
+				struct writeback_control *);
 extern int write_inode_now(struct inode *, int);
 extern int filemap_fdatawrite(struct address_space *);
 extern int filemap_flush(struct address_space *);
@@ -2180,7 +2182,6 @@ extern int bdev_read_only(struct block_device *);
 extern int set_blocksize(struct block_device *, int);
 extern int sb_set_blocksize(struct super_block *, int);
 extern int sb_min_blocksize(struct super_block *, int);
-extern int sb_has_dirty_inodes(struct super_block *);
 
 extern int generic_file_mmap(struct file *, struct vm_area_struct *);
 extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 9344547..a8e9f78 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -99,7 +99,7 @@ static inline void inode_sync_wait(struct inode *inode)
 /*
  * mm/page-writeback.c
  */
-int wakeup_pdflush(long nr_pages);
+void wakeup_flusher_threads(long nr_pages);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
 void throttle_vm_writeout(gfp_t gfp_mask);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index de0bbfe..3dbfc76 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -1,8 +1,11 @@
 
 #include <linux/wait.h>
 #include <linux/backing-dev.h>
+#include <linux/kthread.h>
+#include <linux/freezer.h>
 #include <linux/fs.h>
 #include <linux/pagemap.h>
+#include <linux/mm.h>
 #include <linux/sched.h>
 #include <linux/module.h>
 #include <linux/writeback.h>
@@ -16,7 +19,7 @@ EXPORT_SYMBOL(default_unplug_io_fn);
 struct backing_dev_info default_backing_dev_info = {
 	.ra_pages	= VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE,
 	.state		= 0,
-	.capabilities	= BDI_CAP_MAP_COPY,
+	.capabilities	= BDI_CAP_MAP_COPY | BDI_CAP_FLUSH_FORKER,
 	.unplug_io_fn	= default_unplug_io_fn,
 };
 EXPORT_SYMBOL_GPL(default_backing_dev_info);
@@ -24,6 +27,14 @@ EXPORT_SYMBOL_GPL(default_backing_dev_info);
 static struct class *bdi_class;
 DEFINE_MUTEX(bdi_lock);
 LIST_HEAD(bdi_list);
+LIST_HEAD(bdi_pending_list);
+
+static struct task_struct *sync_supers_tsk;
+static struct timer_list sync_supers_timer;
+
+static int bdi_sync_supers(void *);
+static void sync_supers_timer_fn(unsigned long);
+static void arm_supers_timer(void);
 
 #ifdef CONFIG_DEBUG_FS
 #include <linux/debugfs.h>
@@ -187,6 +198,13 @@ static int __init default_bdi_init(void)
 {
 	int err;
 
+	sync_supers_tsk = kthread_run(bdi_sync_supers, NULL, "sync_supers");
+	BUG_ON(!sync_supers_tsk);
+
+	init_timer(&sync_supers_timer);
+	setup_timer(&sync_supers_timer, sync_supers_timer_fn, 0);
+	arm_supers_timer();
+
 	err = bdi_init(&default_backing_dev_info);
 	if (!err)
 		bdi_register(&default_backing_dev_info, NULL, "default");
@@ -195,6 +213,172 @@ static int __init default_bdi_init(void)
 }
 subsys_initcall(default_bdi_init);
 
+static int bdi_start_fn(void *ptr)
+{
+	struct backing_dev_info *bdi = ptr;
+	struct task_struct *tsk = current;
+
+	/*
+	 * Add us to the active bdi_list
+	 */
+	mutex_lock(&bdi_lock);
+	list_add(&bdi->bdi_list, &bdi_list);
+	mutex_unlock(&bdi_lock);
+
+	tsk->flags |= PF_FLUSHER | PF_SWAPWRITE;
+	set_freezable();
+
+	/*
+	 * Our parent may run at a different priority, just set us to normal
+	 */
+	set_user_nice(tsk, 0);
+
+	/*
+	 * Clear pending bit and wakeup anybody waiting to tear us down
+	 */
+	clear_bit(BDI_pending, &bdi->state);
+	smp_mb__after_clear_bit();
+	wake_up_bit(&bdi->state, BDI_pending);
+
+	return bdi_writeback_task(bdi);
+}
+
+static void bdi_flush_io(struct backing_dev_info *bdi)
+{
+	struct writeback_control wbc = {
+		.bdi			= bdi,
+		.sync_mode		= WB_SYNC_NONE,
+		.older_than_this	= NULL,
+		.range_cyclic		= 1,
+		.nr_to_write		= 1024,
+	};
+
+	generic_sync_bdi_inodes(NULL, &wbc);
+}
+
+/*
+ * kupdated() used to do this. We cannot do it from the bdi_forker_task()
+ * or we risk deadlocking on ->s_umount. The longer term solution would be
+ * to implement sync_supers_bdi() or similar and simply do it from the
+ * bdi writeback tasks individually.
+ */
+static int bdi_sync_supers(void *unused)
+{
+	set_user_nice(current, 0);
+
+	while (!kthread_should_stop()) {
+		set_current_state(TASK_INTERRUPTIBLE);
+		schedule();
+
+		/*
+		 * Do this periodically, like kupdated() did before.
+		 */
+		sync_supers();
+	}
+
+	return 0;
+}
+
+static void arm_supers_timer(void)
+{
+	unsigned long next;
+
+	next = msecs_to_jiffies(dirty_writeback_interval * 10) + jiffies;
+	mod_timer(&sync_supers_timer, round_jiffies_up(next));
+}
+
+static void sync_supers_timer_fn(unsigned long unused)
+{
+	wake_up_process(sync_supers_tsk);
+	arm_supers_timer();
+}
+
+static int bdi_forker_task(void *ptr)
+{
+	struct backing_dev_info *me = ptr;
+
+	for (;;) {
+		struct backing_dev_info *bdi, *tmp;
+
+		/*
+		 * Temporary measure, we want to make sure we don't see
+		 * dirty data on the default backing_dev_info
+		 */
+		if (bdi_has_dirty_io(me))
+			bdi_flush_io(me);
+
+		mutex_lock(&bdi_lock);
+
+		/*
+		 * Check if any existing bdi's have dirty data without
+		 * a thread registered. If so, set that up.
+		 */
+		list_for_each_entry_safe(bdi, tmp, &bdi_list, bdi_list) {
+			if (bdi->task || !bdi_has_dirty_io(bdi))
+				continue;
+
+			bdi_add_default_flusher_task(bdi);
+		}
+
+		if (list_empty(&bdi_pending_list)) {
+			unsigned long wait;
+
+			mutex_unlock(&bdi_lock);
+			wait = msecs_to_jiffies(dirty_writeback_interval * 10);
+			set_current_state(TASK_INTERRUPTIBLE);
+			schedule_timeout(wait);
+			try_to_freeze();
+			continue;
+		}
+
+		/*
+		 * This is our real job - check for pending entries in
+		 * bdi_pending_list, and create the tasks that got added
+		 */
+		bdi = list_entry(bdi_pending_list.next, struct backing_dev_info,
+				 bdi_list);
+		list_del_init(&bdi->bdi_list);
+		mutex_unlock(&bdi_lock);
+
+		BUG_ON(bdi->task);
+
+		bdi->task = kthread_run(bdi_start_fn, bdi, "bdi-%s",
+					dev_name(bdi->dev));
+		/*
+		 * If task creation fails, then readd the bdi to
+		 * the pending list and force writeout of the bdi
+		 * from this forker thread. That will free some memory
+		 * and we can try again.
+		 */
+		if (!bdi->task) {
+			/*
+			 * Add this 'bdi' to the back, so we get
+			 * a chance to flush other bdi's to free
+			 * memory.
+			 */
+			mutex_lock(&bdi_lock);
+			list_add_tail(&bdi->bdi_list, &bdi_pending_list);
+			mutex_unlock(&bdi_lock);
+
+			bdi_flush_io(bdi);
+		}
+	}
+
+	return 0;
+}
+
+void bdi_add_default_flusher_task(struct backing_dev_info *bdi)
+{
+	if (test_and_set_bit(BDI_pending, &bdi->state))
+		return;
+
+	mutex_lock(&bdi_lock);
+	list_move_tail(&bdi->bdi_list, &bdi_pending_list);
+	mutex_unlock(&bdi_lock);
+
+	wake_up_process(default_backing_dev_info.task);
+}
+
 int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 		const char *fmt, ...)
 {
@@ -218,8 +402,25 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 	mutex_unlock(&bdi_lock);
 
 	bdi->dev = dev;
-	bdi_debug_register(bdi, dev_name(dev));
 
+	/*
+	 * Just start the forker thread for our default backing_dev_info,
+	 * and add other bdi's to the list. They will get a thread created
+	 * on-demand when they need it.
+	 */
+	if (bdi_cap_flush_forker(bdi)) {
+		bdi->task = kthread_run(bdi_forker_task, bdi, "bdi-%s",
+						dev_name(dev));
+		if (!bdi->task) {
+			mutex_lock(&bdi_lock);
+			list_del(&bdi->bdi_list);
+			mutex_unlock(&bdi_lock);
+			ret = -ENOMEM;
+			goto exit;
+		}
+	}
+
+	bdi_debug_register(bdi, dev_name(dev));
 exit:
 	return ret;
 }
@@ -231,8 +432,19 @@ int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev)
 }
 EXPORT_SYMBOL(bdi_register_dev);
 
-static void bdi_remove_from_list(struct backing_dev_info *bdi)
+static int sched_wait(void *word)
 {
+	schedule();
+	return 0;
+}
+
+static void bdi_wb_shutdown(struct backing_dev_info *bdi)
+{
+	/*
+	 * If setup is pending, wait for that to complete first
+	 */
+	wait_on_bit(&bdi->state, BDI_pending, sched_wait, TASK_UNINTERRUPTIBLE);
+
 	mutex_lock(&bdi_lock);
 	list_del(&bdi->bdi_list);
 	mutex_unlock(&bdi_lock);
@@ -241,7 +453,13 @@ static void bdi_remove_from_list(struct backing_dev_info *bdi)
 void bdi_unregister(struct backing_dev_info *bdi)
 {
 	if (bdi->dev) {
-		bdi_remove_from_list(bdi);
+		if (!bdi_cap_flush_forker(bdi)) {
+			bdi_wb_shutdown(bdi);
+			if (bdi->task) {
+				kthread_stop(bdi->task);
+				bdi->task = NULL;
+			}
+		}
 		bdi_debug_unregister(bdi);
 		device_unregister(bdi->dev);
 		bdi->dev = NULL;
@@ -251,8 +469,7 @@ EXPORT_SYMBOL(bdi_unregister);
 
 int bdi_init(struct backing_dev_info *bdi)
 {
-	int i;
-	int err;
+	int i, err;
 
 	bdi->dev = NULL;
 
@@ -277,8 +494,6 @@ int bdi_init(struct backing_dev_info *bdi)
 err:
 		while (i--)
 			percpu_counter_destroy(&bdi->bdi_stat[i]);
-
-		bdi_remove_from_list(bdi);
 	}
 
 	return err;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7c44314..46c62b0 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -36,15 +36,6 @@
 #include <linux/pagevec.h>
 
 /*
- * The maximum number of pages to writeout in a single bdflush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES	1024
-
-/*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
  */
@@ -117,8 +108,6 @@ EXPORT_SYMBOL(laptop_mode);
 /* End of sysctl-exported parameters */
 
 
-static void background_writeout(unsigned long _min_pages);
-
 /*
  * Scale the writeback cache size proportional to the relative writeout speeds.
  *
@@ -539,7 +528,7 @@ static void balance_dirty_pages(struct address_space *mapping)
 		 * been flushed to permanent storage.
 		 */
 		if (bdi_nr_reclaimable) {
-			writeback_inodes(&wbc);
+			generic_sync_bdi_inodes(NULL, &wbc);
 			pages_written += write_chunk - wbc.nr_to_write;
 			get_dirty_limits(&background_thresh, &dirty_thresh,
 				       &bdi_thresh, bdi);
@@ -590,7 +579,7 @@ static void balance_dirty_pages(struct address_space *mapping)
 			(!laptop_mode && (global_page_state(NR_FILE_DIRTY)
 					  + global_page_state(NR_UNSTABLE_NFS)
 					  > background_thresh)))
-		pdflush_operation(background_writeout, 0);
+		bdi_start_writeback(bdi, NULL, 0, WB_SYNC_NONE);
 }
 
 void set_page_dirty_balance(struct page *page, int page_mkwrite)
@@ -675,152 +664,41 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 }
 
 /*
- * writeback at least _min_pages, and keep writing until the amount of dirty
- * memory is less than the background threshold, or until we're all clean.
+ * Start writeback of `nr_pages' pages.  If `nr_pages' is zero, write back
+ * the whole world.
  */
-static void background_writeout(unsigned long _min_pages)
+void wakeup_flusher_threads(long nr_pages)
 {
-	long min_pages = _min_pages;
 	struct writeback_control wbc = {
-		.bdi		= NULL,
 		.sync_mode	= WB_SYNC_NONE,
 		.older_than_this = NULL,
-		.nr_to_write	= 0,
-		.nonblocking	= 1,
 		.range_cyclic	= 1,
 	};
 
-	for ( ; ; ) {
-		unsigned long background_thresh;
-		unsigned long dirty_thresh;
-
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
-		if (global_page_state(NR_FILE_DIRTY) +
-			global_page_state(NR_UNSTABLE_NFS) < background_thresh
-				&& min_pages <= 0)
-			break;
-		wbc.more_io = 0;
-		wbc.encountered_congestion = 0;
-		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
-		wbc.pages_skipped = 0;
-		writeback_inodes(&wbc);
-		min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
-		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
-			/* Wrote less than expected */
-			if (wbc.encountered_congestion || wbc.more_io)
-				congestion_wait(WRITE, HZ/10);
-			else
-				break;
-		}
-	}
-}
-
-/*
- * Start writeback of `nr_pages' pages.  If `nr_pages' is zero, write back
- * the whole world.  Returns 0 if a pdflush thread was dispatched.  Returns
- * -1 if all pdflush threads were busy.
- */
-int wakeup_pdflush(long nr_pages)
-{
 	if (nr_pages == 0)
 		nr_pages = global_page_state(NR_FILE_DIRTY) +
 				global_page_state(NR_UNSTABLE_NFS);
-	return pdflush_operation(background_writeout, nr_pages);
+	wbc.nr_to_write = nr_pages;
+	bdi_writeback_all(NULL, &wbc);
 }
 
-static void wb_timer_fn(unsigned long unused);
 static void laptop_timer_fn(unsigned long unused);
 
-static DEFINE_TIMER(wb_timer, wb_timer_fn, 0, 0);
 static DEFINE_TIMER(laptop_mode_wb_timer, laptop_timer_fn, 0, 0);
 
 /*
- * Periodic writeback of "old" data.
- *
- * Define "old": the first time one of an inode's pages is dirtied, we mark the
- * dirtying-time in the inode's address_space.  So this periodic writeback code
- * just walks the superblock inode list, writing back any inodes which are
- * older than a specific point in time.
- *
- * Try to run once per dirty_writeback_interval.  But if a writeback event
- * takes longer than a dirty_writeback_interval interval, then leave a
- * one-second gap.
- *
- * older_than_this takes precedence over nr_to_write.  So we'll only write back
- * all dirty pages if they are all attached to "old" mappings.
- */
-static void wb_kupdate(unsigned long arg)
-{
-	unsigned long oldest_jif;
-	unsigned long start_jif;
-	unsigned long next_jif;
-	long nr_to_write;
-	struct writeback_control wbc = {
-		.bdi		= NULL,
-		.sync_mode	= WB_SYNC_NONE,
-		.older_than_this = &oldest_jif,
-		.nr_to_write	= 0,
-		.nonblocking	= 1,
-		.for_kupdate	= 1,
-		.range_cyclic	= 1,
-	};
-
-	sync_supers();
-
-	oldest_jif = jiffies - msecs_to_jiffies(dirty_expire_interval * 10);
-	start_jif = jiffies;
-	next_jif = start_jif + msecs_to_jiffies(dirty_writeback_interval * 10);
-	nr_to_write = global_page_state(NR_FILE_DIRTY) +
-			global_page_state(NR_UNSTABLE_NFS) +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
-	while (nr_to_write > 0) {
-		wbc.more_io = 0;
-		wbc.encountered_congestion = 0;
-		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
-		writeback_inodes(&wbc);
-		if (wbc.nr_to_write > 0) {
-			if (wbc.encountered_congestion || wbc.more_io)
-				congestion_wait(WRITE, HZ/10);
-			else
-				break;	/* All the old data is written */
-		}
-		nr_to_write -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
-	}
-	if (time_before(next_jif, jiffies + HZ))
-		next_jif = jiffies + HZ;
-	if (dirty_writeback_interval)
-		mod_timer(&wb_timer, next_jif);
-}
-
-/*
  * sysctl handler for /proc/sys/vm/dirty_writeback_centisecs
  */
 int dirty_writeback_centisecs_handler(ctl_table *table, int write,
 	struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
 {
 	proc_dointvec(table, write, file, buffer, length, ppos);
-	if (dirty_writeback_interval)
-		mod_timer(&wb_timer, jiffies +
-			msecs_to_jiffies(dirty_writeback_interval * 10));
-	else
-		del_timer(&wb_timer);
 	return 0;
 }
 
-static void wb_timer_fn(unsigned long unused)
-{
-	if (pdflush_operation(wb_kupdate, 0) < 0)
-		mod_timer(&wb_timer, jiffies + HZ); /* delay 1 second */
-}
-
-static void laptop_flush(unsigned long unused)
-{
-	sys_sync();
-}
-
 static void laptop_timer_fn(unsigned long unused)
 {
-	pdflush_operation(laptop_flush, 0);
+	wakeup_flusher_threads(0);
 }
 
 /*
@@ -903,8 +781,6 @@ void __init page_writeback_init(void)
 {
 	int shift;
 
-	mod_timer(&wb_timer,
-		  jiffies + msecs_to_jiffies(dirty_writeback_interval * 10));
 	writeback_set_ratelimit();
 	register_cpu_notifier(&ratelimit_nb);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5fa3eda..e37fd38 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1654,7 +1654,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 */
 		if (total_scanned > sc->swap_cluster_max +
 					sc->swap_cluster_max / 2) {
-			wakeup_pdflush(laptop_mode ? 0 : total_scanned);
+			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
 			sc->may_writepage = 1;
 		}
 
-- 
1.6.3.rc0.1.gf800


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 05/11] writeback: get rid of pdflush completely
  2009-05-28 11:46 [PATCH 0/11] Per-bdi writeback flusher threads v9 Jens Axboe
                   ` (3 preceding siblings ...)
  2009-05-28 11:46 ` [PATCH 04/11] writeback: switch to per-bdi threads for flushing data Jens Axboe
@ 2009-05-28 11:46 ` Jens Axboe
  2009-05-28 11:46 ` [PATCH 06/11] writeback: separate the flushing state/task from the bdi Jens Axboe
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2009-05-28 11:46 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, tytso
  Cc: chris.mason, david, hch, akpm, jack, yanmin_zhang, richard,
	damien.wyart, Jens Axboe

It is now unused, so kill it off.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/fs-writeback.c         |    5 +
 include/linux/writeback.h |   12 --
 mm/Makefile               |    2 +-
 mm/pdflush.c              |  269 ---------------------------------------------
 4 files changed, 6 insertions(+), 282 deletions(-)
 delete mode 100644 mm/pdflush.c

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index aa0b560..5ae0dd4 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -29,6 +29,11 @@
 
 #define inode_to_bdi(inode)	((inode)->i_mapping->backing_dev_info)
 
+/*
+ * We don't actually have pdflush, but this one is exported though /proc...
+ */
+int nr_pdflush_threads;
+
 /**
  * writeback_acquire - attempt to get exclusive writeback access to a device
  * @bdi: the device's backing_dev_info structure
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a8e9f78..baf04a9 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -14,17 +14,6 @@ extern struct list_head inode_in_use;
 extern struct list_head inode_unused;
 
 /*
- * Yes, writeback.h requires sched.h
- * No, sched.h is not included from here.
- */
-static inline int task_is_pdflush(struct task_struct *task)
-{
-	return task->flags & PF_FLUSHER;
-}
-
-#define current_is_pdflush()	task_is_pdflush(current)
-
-/*
  * fs/fs-writeback.c
  */
 enum writeback_sync_modes {
@@ -151,7 +140,6 @@ balance_dirty_pages_ratelimited(struct address_space *mapping)
 typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
 				void *data);
 
-int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0);
 int generic_writepages(struct address_space *mapping,
 		       struct writeback_control *wbc);
 int write_cache_pages(struct address_space *mapping,
diff --git a/mm/Makefile b/mm/Makefile
index ec73c68..2adb811 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -8,7 +8,7 @@ mmu-$(CONFIG_MMU)	:= fremap.o highmem.o madvise.o memory.o mincore.o \
 			   vmalloc.o
 
 obj-y			:= bootmem.o filemap.o mempool.o oom_kill.o fadvise.o \
-			   maccess.o page_alloc.o page-writeback.o pdflush.o \
+			   maccess.o page_alloc.o page-writeback.o \
 			   readahead.o swap.o truncate.o vmscan.o shmem.o \
 			   prio_tree.o util.o mmzone.o vmstat.o backing-dev.o \
 			   page_isolation.o mm_init.o $(mmu-y)
diff --git a/mm/pdflush.c b/mm/pdflush.c
deleted file mode 100644
index 235ac44..0000000
--- a/mm/pdflush.c
+++ /dev/null
@@ -1,269 +0,0 @@
-/*
- * mm/pdflush.c - worker threads for writing back filesystem data
- *
- * Copyright (C) 2002, Linus Torvalds.
- *
- * 09Apr2002	Andrew Morton
- *		Initial version
- * 29Feb2004	kaos@sgi.com
- *		Move worker thread creation to kthread to avoid chewing
- *		up stack space with nested calls to kernel_thread.
- */
-
-#include <linux/sched.h>
-#include <linux/list.h>
-#include <linux/signal.h>
-#include <linux/spinlock.h>
-#include <linux/gfp.h>
-#include <linux/init.h>
-#include <linux/module.h>
-#include <linux/fs.h>		/* Needed by writeback.h	  */
-#include <linux/writeback.h>	/* Prototypes pdflush_operation() */
-#include <linux/kthread.h>
-#include <linux/cpuset.h>
-#include <linux/freezer.h>
-
-
-/*
- * Minimum and maximum number of pdflush instances
- */
-#define MIN_PDFLUSH_THREADS	2
-#define MAX_PDFLUSH_THREADS	8
-
-static void start_one_pdflush_thread(void);
-
-
-/*
- * The pdflush threads are worker threads for writing back dirty data.
- * Ideally, we'd like one thread per active disk spindle.  But the disk
- * topology is very hard to divine at this level.   Instead, we take
- * care in various places to prevent more than one pdflush thread from
- * performing writeback against a single filesystem.  pdflush threads
- * have the PF_FLUSHER flag set in current->flags to aid in this.
- */
-
-/*
- * All the pdflush threads.  Protected by pdflush_lock
- */
-static LIST_HEAD(pdflush_list);
-static DEFINE_SPINLOCK(pdflush_lock);
-
-/*
- * The count of currently-running pdflush threads.  Protected
- * by pdflush_lock.
- *
- * Readable by sysctl, but not writable.  Published to userspace at
- * /proc/sys/vm/nr_pdflush_threads.
- */
-int nr_pdflush_threads = 0;
-
-/*
- * The time at which the pdflush thread pool last went empty
- */
-static unsigned long last_empty_jifs;
-
-/*
- * The pdflush thread.
- *
- * Thread pool management algorithm:
- * 
- * - The minimum and maximum number of pdflush instances are bound
- *   by MIN_PDFLUSH_THREADS and MAX_PDFLUSH_THREADS.
- * 
- * - If there have been no idle pdflush instances for 1 second, create
- *   a new one.
- * 
- * - If the least-recently-went-to-sleep pdflush thread has been asleep
- *   for more than one second, terminate a thread.
- */
-
-/*
- * A structure for passing work to a pdflush thread.  Also for passing
- * state information between pdflush threads.  Protected by pdflush_lock.
- */
-struct pdflush_work {
-	struct task_struct *who;	/* The thread */
-	void (*fn)(unsigned long);	/* A callback function */
-	unsigned long arg0;		/* An argument to the callback */
-	struct list_head list;		/* On pdflush_list, when idle */
-	unsigned long when_i_went_to_sleep;
-};
-
-static int __pdflush(struct pdflush_work *my_work)
-{
-	current->flags |= PF_FLUSHER | PF_SWAPWRITE;
-	set_freezable();
-	my_work->fn = NULL;
-	my_work->who = current;
-	INIT_LIST_HEAD(&my_work->list);
-
-	spin_lock_irq(&pdflush_lock);
-	for ( ; ; ) {
-		struct pdflush_work *pdf;
-
-		set_current_state(TASK_INTERRUPTIBLE);
-		list_move(&my_work->list, &pdflush_list);
-		my_work->when_i_went_to_sleep = jiffies;
-		spin_unlock_irq(&pdflush_lock);
-		schedule();
-		try_to_freeze();
-		spin_lock_irq(&pdflush_lock);
-		if (!list_empty(&my_work->list)) {
-			/*
-			 * Someone woke us up, but without removing our control
-			 * structure from the global list.  swsusp will do this
-			 * in try_to_freeze()->refrigerator().  Handle it.
-			 */
-			my_work->fn = NULL;
-			continue;
-		}
-		if (my_work->fn == NULL) {
-			printk("pdflush: bogus wakeup\n");
-			continue;
-		}
-		spin_unlock_irq(&pdflush_lock);
-
-		(*my_work->fn)(my_work->arg0);
-
-		spin_lock_irq(&pdflush_lock);
-
-		/*
-		 * Thread creation: For how long have there been zero
-		 * available threads?
-		 *
-		 * To throttle creation, we reset last_empty_jifs.
-		 */
-		if (time_after(jiffies, last_empty_jifs + 1 * HZ)) {
-			if (list_empty(&pdflush_list)) {
-				if (nr_pdflush_threads < MAX_PDFLUSH_THREADS) {
-					last_empty_jifs = jiffies;
-					nr_pdflush_threads++;
-					spin_unlock_irq(&pdflush_lock);
-					start_one_pdflush_thread();
-					spin_lock_irq(&pdflush_lock);
-				}
-			}
-		}
-
-		my_work->fn = NULL;
-
-		/*
-		 * Thread destruction: For how long has the sleepiest
-		 * thread slept?
-		 */
-		if (list_empty(&pdflush_list))
-			continue;
-		if (nr_pdflush_threads <= MIN_PDFLUSH_THREADS)
-			continue;
-		pdf = list_entry(pdflush_list.prev, struct pdflush_work, list);
-		if (time_after(jiffies, pdf->when_i_went_to_sleep + 1 * HZ)) {
-			/* Limit exit rate */
-			pdf->when_i_went_to_sleep = jiffies;
-			break;					/* exeunt */
-		}
-	}
-	nr_pdflush_threads--;
-	spin_unlock_irq(&pdflush_lock);
-	return 0;
-}
-
-/*
- * Of course, my_work wants to be just a local in __pdflush().  It is
- * separated out in this manner to hopefully prevent the compiler from
- * performing unfortunate optimisations against the auto variables.  Because
- * these are visible to other tasks and CPUs.  (No problem has actually
- * been observed.  This is just paranoia).
- */
-static int pdflush(void *dummy)
-{
-	struct pdflush_work my_work;
-	cpumask_var_t cpus_allowed;
-
-	/*
-	 * Since the caller doesn't even check kthread_run() worked, let's not
-	 * freak out too much if this fails.
-	 */
-	if (!alloc_cpumask_var(&cpus_allowed, GFP_KERNEL)) {
-		printk(KERN_WARNING "pdflush failed to allocate cpumask\n");
-		return 0;
-	}
-
-	/*
-	 * pdflush can spend a lot of time doing encryption via dm-crypt.  We
-	 * don't want to do that at keventd's priority.
-	 */
-	set_user_nice(current, 0);
-
-	/*
-	 * Some configs put our parent kthread in a limited cpuset,
-	 * which kthread() overrides, forcing cpus_allowed == cpu_all_mask.
-	 * Our needs are more modest - cut back to our cpusets cpus_allowed.
-	 * This is needed as pdflush's are dynamically created and destroyed.
-	 * The boottime pdflush's are easily placed w/o these 2 lines.
-	 */
-	cpuset_cpus_allowed(current, cpus_allowed);
-	set_cpus_allowed_ptr(current, cpus_allowed);
-	free_cpumask_var(cpus_allowed);
-
-	return __pdflush(&my_work);
-}
-
-/*
- * Attempt to wake up a pdflush thread, and get it to do some work for you.
- * Returns zero if it indeed managed to find a worker thread, and passed your
- * payload to it.
- */
-int pdflush_operation(void (*fn)(unsigned long), unsigned long arg0)
-{
-	unsigned long flags;
-	int ret = 0;
-
-	BUG_ON(fn == NULL);	/* Hard to diagnose if it's deferred */
-
-	spin_lock_irqsave(&pdflush_lock, flags);
-	if (list_empty(&pdflush_list)) {
-		ret = -1;
-	} else {
-		struct pdflush_work *pdf;
-
-		pdf = list_entry(pdflush_list.next, struct pdflush_work, list);
-		list_del_init(&pdf->list);
-		if (list_empty(&pdflush_list))
-			last_empty_jifs = jiffies;
-		pdf->fn = fn;
-		pdf->arg0 = arg0;
-		wake_up_process(pdf->who);
-	}
-	spin_unlock_irqrestore(&pdflush_lock, flags);
-
-	return ret;
-}
-
-static void start_one_pdflush_thread(void)
-{
-	struct task_struct *k;
-
-	k = kthread_run(pdflush, NULL, "pdflush");
-	if (unlikely(IS_ERR(k))) {
-		spin_lock_irq(&pdflush_lock);
-		nr_pdflush_threads--;
-		spin_unlock_irq(&pdflush_lock);
-	}
-}
-
-static int __init pdflush_init(void)
-{
-	int i;
-
-	/*
-	 * Pre-set nr_pdflush_threads...  If we fail to create,
-	 * the count will be decremented.
-	 */
-	nr_pdflush_threads = MIN_PDFLUSH_THREADS;
-
-	for (i = 0; i < MIN_PDFLUSH_THREADS; i++)
-		start_one_pdflush_thread();
-	return 0;
-}
-
-module_init(pdflush_init);
-- 
1.6.3.rc0.1.gf800


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 06/11] writeback: separate the flushing state/task from the bdi
  2009-05-28 11:46 [PATCH 0/11] Per-bdi writeback flusher threads v9 Jens Axboe
                   ` (4 preceding siblings ...)
  2009-05-28 11:46 ` [PATCH 05/11] writeback: get rid of pdflush completely Jens Axboe
@ 2009-05-28 11:46 ` Jens Axboe
  2009-05-28 11:46 ` [PATCH 07/11] writeback: support > 1 flusher thread per bdi Jens Axboe
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2009-05-28 11:46 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, tytso
  Cc: chris.mason, david, hch, akpm, jack, yanmin_zhang, richard,
	damien.wyart, Jens Axboe

Add a struct bdi_writeback for tracking and handling dirty IO. This
is in preparation for adding > 1 flusher task per bdi.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/fs-writeback.c           |  136 +++++++++++++++++++++++++++----------------
 include/linux/backing-dev.h |   38 +++++++-----
 mm/backing-dev.c            |  126 ++++++++++++++++++++++++++++++++--------
 3 files changed, 208 insertions(+), 92 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 5ae0dd4..ed242d5 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -46,9 +46,11 @@ int nr_pdflush_threads;
  * unless they implement their own.  Which is somewhat inefficient, as this
  * may prevent concurrent writeback against multiple devices.
  */
-static int writeback_acquire(struct backing_dev_info *bdi)
+static int writeback_acquire(struct bdi_writeback *wb)
 {
-	return !test_and_set_bit(BDI_pdflush, &bdi->state);
+	struct backing_dev_info *bdi = wb->bdi;
+
+	return !test_and_set_bit(wb->nr, &bdi->wb_active);
 }
 
 /**
@@ -59,19 +61,37 @@ static int writeback_acquire(struct backing_dev_info *bdi)
  */
 int writeback_in_progress(struct backing_dev_info *bdi)
 {
-	return test_bit(BDI_pdflush, &bdi->state);
+	return bdi->wb_active != 0;
 }
 
 /**
  * writeback_release - relinquish exclusive writeback access against a device.
  * @bdi: the device's backing_dev_info structure
  */
-static void writeback_release(struct backing_dev_info *bdi)
+static void writeback_release(struct bdi_writeback *wb)
 {
-	WARN_ON_ONCE(!writeback_in_progress(bdi));
-	bdi->wb_arg.nr_pages = 0;
-	bdi->wb_arg.sb = NULL;
-	clear_bit(BDI_pdflush, &bdi->state);
+	struct backing_dev_info *bdi = wb->bdi;
+
+	wb->nr_pages = 0;
+	wb->sb = NULL;
+	clear_bit(wb->nr, &bdi->wb_active);
+}
+
+static void wb_start_writeback(struct bdi_writeback *wb, struct super_block *sb,
+			       long nr_pages,
+			       enum writeback_sync_modes sync_mode)
+{
+	if (!wb_has_dirty_io(wb))
+		return;
+
+	if (writeback_acquire(wb)) {
+		wb->nr_pages = nr_pages;
+		wb->sb = sb;
+		wb->sync_mode = sync_mode;
+
+		if (wb->task)
+			wake_up_process(wb->task);
+	}
 }
 
 int bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
@@ -81,20 +101,12 @@ int bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
 	 * This only happens the first time someone kicks this bdi, so put
 	 * it out-of-line.
 	 */
-	if (unlikely(!bdi->task)) {
+	if (unlikely(!bdi->wb.task)) {
 		bdi_add_default_flusher_task(bdi);
 		return 1;
 	}
 
-	if (writeback_acquire(bdi)) {
-		bdi->wb_arg.nr_pages = nr_pages;
-		bdi->wb_arg.sb = sb;
-		bdi->wb_arg.sync_mode = sync_mode;
-
-		if (bdi->task)
-			wake_up_process(bdi->task);
-	}
-
+	wb_start_writeback(&bdi->wb, sb, nr_pages, sync_mode);
 	return 0;
 }
 
@@ -122,12 +134,12 @@ int bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
  * older_than_this takes precedence over nr_to_write.  So we'll only write back
  * all dirty pages if they are all attached to "old" mappings.
  */
-static void bdi_kupdated(struct backing_dev_info *bdi)
+static void wb_kupdated(struct bdi_writeback *wb)
 {
 	unsigned long oldest_jif;
 	long nr_to_write;
 	struct writeback_control wbc = {
-		.bdi			= bdi,
+		.bdi			= wb->bdi,
 		.sync_mode		= WB_SYNC_NONE,
 		.older_than_this	= &oldest_jif,
 		.nr_to_write		= 0,
@@ -162,15 +174,19 @@ static inline bool over_bground_thresh(void)
 		global_page_state(NR_UNSTABLE_NFS) >= background_thresh);
 }
 
-static void bdi_pdflush(struct backing_dev_info *bdi)
+static void generic_sync_wb_inodes(struct bdi_writeback *wb,
+				   struct super_block *sb,
+				   struct writeback_control *wbc);
+
+static void wb_writeback(struct bdi_writeback *wb)
 {
 	struct writeback_control wbc = {
-		.bdi			= bdi,
-		.sync_mode		= bdi->wb_arg.sync_mode,
+		.bdi			= wb->bdi,
+		.sync_mode		= wb->sync_mode,
 		.older_than_this	= NULL,
 		.range_cyclic		= 1,
 	};
-	long nr_pages = bdi->wb_arg.nr_pages;
+	long nr_pages = wb->nr_pages;
 
 	for (;;) {
 		if (wbc.sync_mode == WB_SYNC_NONE && nr_pages <= 0 &&
@@ -181,7 +197,7 @@ static void bdi_pdflush(struct backing_dev_info *bdi)
 		wbc.encountered_congestion = 0;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		wbc.pages_skipped = 0;
-		generic_sync_bdi_inodes(bdi->wb_arg.sb, &wbc);
+		generic_sync_wb_inodes(wb, wb->sb, &wbc);
 		nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 		/*
 		 * If we ran out of stuff to write, bail unless more_io got set
@@ -198,7 +214,7 @@ static void bdi_pdflush(struct backing_dev_info *bdi)
  * Handle writeback of dirty data for the device backed by this bdi. Also
  * wakes up periodically and does kupdated style flushing.
  */
-int bdi_writeback_task(struct backing_dev_info *bdi)
+int bdi_writeback_task(struct bdi_writeback *wb)
 {
 	while (!kthread_should_stop()) {
 		unsigned long wait_jiffies;
@@ -222,12 +238,12 @@ int bdi_writeback_task(struct backing_dev_info *bdi)
 		 *  pdflush style writeout.
 		 *
 		 */
-		if (writeback_acquire(bdi))
-			bdi_kupdated(bdi);
+		if (writeback_acquire(wb))
+			wb_kupdated(wb);
 		else
-			bdi_pdflush(bdi);
+			wb_writeback(wb);
 
-		writeback_release(bdi);
+		writeback_release(wb);
 	}
 
 	return 0;
@@ -248,6 +264,14 @@ void bdi_writeback_all(struct super_block *sb, struct writeback_control *wbc)
 	mutex_unlock(&bdi_lock);
 }
 
+/*
+ * We have only a single wb per bdi, so just return that.
+ */
+static inline struct bdi_writeback *inode_get_wb(struct inode *inode)
+{
+	return &inode_to_bdi(inode)->wb;
+}
+
 /**
  *	__mark_inode_dirty -	internal function
  *	@inode: inode to mark
@@ -346,9 +370,10 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		 * reposition it (that would break b_dirty time-ordering).
 		 */
 		if (!was_dirty) {
+			struct bdi_writeback *wb = inode_get_wb(inode);
+
 			inode->dirtied_when = jiffies;
-			list_move(&inode->i_list,
-					&inode_to_bdi(inode)->b_dirty);
+			list_move(&inode->i_list, &wb->b_dirty);
 		}
 	}
 out:
@@ -375,16 +400,16 @@ static int write_inode(struct inode *inode, int sync)
  */
 static void redirty_tail(struct inode *inode)
 {
-	struct backing_dev_info *bdi = inode_to_bdi(inode);
+	struct bdi_writeback *wb = inode_get_wb(inode);
 
-	if (!list_empty(&bdi->b_dirty)) {
+	if (!list_empty(&wb->b_dirty)) {
 		struct inode *tail;
 
-		tail = list_entry(bdi->b_dirty.next, struct inode, i_list);
+		tail = list_entry(wb->b_dirty.next, struct inode, i_list);
 		if (time_before(inode->dirtied_when, tail->dirtied_when))
 			inode->dirtied_when = jiffies;
 	}
-	list_move(&inode->i_list, &bdi->b_dirty);
+	list_move(&inode->i_list, &wb->b_dirty);
 }
 
 /*
@@ -392,7 +417,9 @@ static void redirty_tail(struct inode *inode)
  */
 static void requeue_io(struct inode *inode)
 {
-	list_move(&inode->i_list, &inode_to_bdi(inode)->b_more_io);
+	struct bdi_writeback *wb = inode_get_wb(inode);
+
+	list_move(&inode->i_list, &wb->b_more_io);
 }
 
 static void inode_sync_complete(struct inode *inode)
@@ -439,11 +466,10 @@ static void move_expired_inodes(struct list_head *delaying_queue,
 /*
  * Queue all expired dirty inodes for io, eldest first.
  */
-static void queue_io(struct backing_dev_info *bdi,
-		     unsigned long *older_than_this)
+static void queue_io(struct bdi_writeback *wb, unsigned long *older_than_this)
 {
-	list_splice_init(&bdi->b_more_io, bdi->b_io.prev);
-	move_expired_inodes(&bdi->b_dirty, &bdi->b_io, older_than_this);
+	list_splice_init(&wb->b_more_io, wb->b_io.prev);
+	move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
 }
 
 /*
@@ -604,20 +630,20 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	return __sync_single_inode(inode, wbc);
 }
 
-void generic_sync_bdi_inodes(struct super_block *sb,
-			     struct writeback_control *wbc)
+static void generic_sync_wb_inodes(struct bdi_writeback *wb,
+				   struct super_block *sb,
+				   struct writeback_control *wbc)
 {
 	const int is_blkdev_sb = sb_is_blkdev_sb(sb);
-	struct backing_dev_info *bdi = wbc->bdi;
 	const unsigned long start = jiffies;	/* livelock avoidance */
 
 	spin_lock(&inode_lock);
 
-	if (!wbc->for_kupdate || list_empty(&bdi->b_io))
-		queue_io(bdi, wbc->older_than_this);
+	if (!wbc->for_kupdate || list_empty(&wb->b_io))
+		queue_io(wb, wbc->older_than_this);
 
-	while (!list_empty(&bdi->b_io)) {
-		struct inode *inode = list_entry(bdi->b_io.prev,
+	while (!list_empty(&wb->b_io)) {
+		struct inode *inode = list_entry(wb->b_io.prev,
 						struct inode, i_list);
 		long pages_skipped;
 
@@ -629,7 +655,7 @@ void generic_sync_bdi_inodes(struct super_block *sb,
 			continue;
 		}
 
-		if (!bdi_cap_writeback_dirty(bdi)) {
+		if (!bdi_cap_writeback_dirty(wb->bdi)) {
 			redirty_tail(inode);
 			if (is_blkdev_sb) {
 				/*
@@ -651,7 +677,7 @@ void generic_sync_bdi_inodes(struct super_block *sb,
 			continue;
 		}
 
-		if (wbc->nonblocking && bdi_write_congested(bdi)) {
+		if (wbc->nonblocking && bdi_write_congested(wb->bdi)) {
 			wbc->encountered_congestion = 1;
 			if (!is_blkdev_sb)
 				break;		/* Skip a congested fs */
@@ -685,7 +711,7 @@ void generic_sync_bdi_inodes(struct super_block *sb,
 			wbc->more_io = 1;
 			break;
 		}
-		if (!list_empty(&bdi->b_more_io))
+		if (!list_empty(&wb->b_more_io))
 			wbc->more_io = 1;
 	}
 
@@ -693,6 +719,14 @@ void generic_sync_bdi_inodes(struct super_block *sb,
 	/* Leave any unwritten inodes on b_io */
 }
 
+void generic_sync_bdi_inodes(struct super_block *sb,
+			     struct writeback_control *wbc)
+{
+	struct backing_dev_info *bdi = wbc->bdi;
+
+	generic_sync_wb_inodes(&bdi->wb, sb, wbc);
+}
+
 /*
  * Write out a superblock's list of dirty inodes.  A wait will be performed
  * upon no inodes, all inodes or the final one, depending upon sync_mode.
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 4a312e9..59f88e5 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -24,8 +24,8 @@ struct dentry;
  * Bits in backing_dev_info.state
  */
 enum bdi_state {
-	BDI_pdflush,		/* A pdflush thread is working this device */
 	BDI_pending,		/* On its way to being activated */
+	BDI_wb_alloc,		/* Default embedded wb allocated */
 	BDI_async_congested,	/* The async (write) queue is getting full */
 	BDI_sync_congested,	/* The sync queue is getting full */
 	BDI_unused,		/* Available bits start here */
@@ -41,15 +41,22 @@ enum bdi_stat_item {
 
 #define BDI_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
 
-struct bdi_writeback_arg {
-	unsigned long nr_pages;
-	struct super_block *sb;
+struct bdi_writeback {
+	struct backing_dev_info *bdi;		/* our parent bdi */
+	unsigned int nr;
+
+	struct task_struct	*task;		/* writeback task */
+	struct list_head	b_dirty;	/* dirty inodes */
+	struct list_head	b_io;		/* parked for writeback */
+	struct list_head	b_more_io;	/* parked for more writeback */
+
+	unsigned long		nr_pages;
+	struct super_block	*sb;
 	enum writeback_sync_modes sync_mode;
 };
 
 struct backing_dev_info {
 	struct list_head bdi_list;
-
 	unsigned long ra_pages;	/* max readahead in PAGE_CACHE_SIZE units */
 	unsigned long state;	/* Always use atomic bitops on this */
 	unsigned int capabilities; /* Device capabilities */
@@ -66,13 +73,11 @@ struct backing_dev_info {
 	unsigned int min_ratio;
 	unsigned int max_ratio, max_prop_frac;
 
-	struct device *dev;
+	struct bdi_writeback wb;  /* default writeback info for this bdi */
+	unsigned long wb_active;  /* bitmap of active tasks */
+	unsigned long wb_mask;	  /* number of registered tasks */
 
-	struct task_struct	*task;		/* writeback task */
-	struct bdi_writeback_arg wb_arg;	/* protected by BDI_pdflush */
-	struct list_head	b_dirty;	/* dirty inodes */
-	struct list_head	b_io;		/* parked for writeback */
-	struct list_head	b_more_io;	/* parked for more writeback */
+	struct device *dev;
 
 #ifdef CONFIG_DEBUG_FS
 	struct dentry *debug_dir;
@@ -89,18 +94,19 @@ int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev);
 void bdi_unregister(struct backing_dev_info *bdi);
 int bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
 			 long nr_pages, enum writeback_sync_modes sync_mode);
-int bdi_writeback_task(struct backing_dev_info *bdi);
+int bdi_writeback_task(struct bdi_writeback *wb);
 void bdi_writeback_all(struct super_block *sb, struct writeback_control *wbc);
 void bdi_add_default_flusher_task(struct backing_dev_info *bdi);
+int bdi_has_dirty_io(struct backing_dev_info *bdi);
 
 extern struct mutex bdi_lock;
 extern struct list_head bdi_list;
 
-static inline int bdi_has_dirty_io(struct backing_dev_info *bdi)
+static inline int wb_has_dirty_io(struct bdi_writeback *wb)
 {
-	return !list_empty(&bdi->b_dirty) ||
-	       !list_empty(&bdi->b_io) ||
-	       !list_empty(&bdi->b_more_io);
+	return !list_empty(&wb->b_dirty) ||
+	       !list_empty(&wb->b_io) ||
+	       !list_empty(&wb->b_more_io);
 }
 
 static inline void __add_bdi_stat(struct backing_dev_info *bdi,
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 3dbfc76..75c9054 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -213,10 +213,45 @@ static int __init default_bdi_init(void)
 }
 subsys_initcall(default_bdi_init);
 
+static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
+{
+	memset(wb, 0, sizeof(*wb));
+
+	wb->bdi = bdi;
+	INIT_LIST_HEAD(&wb->b_dirty);
+	INIT_LIST_HEAD(&wb->b_io);
+	INIT_LIST_HEAD(&wb->b_more_io);
+}
+
+static int wb_assign_nr(struct backing_dev_info *bdi, struct bdi_writeback *wb)
+{
+	set_bit(0, &bdi->wb_mask);
+	wb->nr = 0;
+	return 0;
+}
+
+static void bdi_put_wb(struct backing_dev_info *bdi, struct bdi_writeback *wb)
+{
+	clear_bit(wb->nr, &bdi->wb_mask);
+	clear_bit(BDI_wb_alloc, &bdi->state);
+}
+
+static struct bdi_writeback *bdi_new_wb(struct backing_dev_info *bdi)
+{
+	struct bdi_writeback *wb;
+
+	set_bit(BDI_wb_alloc, &bdi->state);
+	wb = &bdi->wb;
+	wb_assign_nr(bdi, wb);
+	return wb;
+}
+
 static int bdi_start_fn(void *ptr)
 {
-	struct backing_dev_info *bdi = ptr;
+	struct bdi_writeback *wb = ptr;
+	struct backing_dev_info *bdi = wb->bdi;
 	struct task_struct *tsk = current;
+	int ret;
 
 	/*
 	 * Add us to the active bdi_list
@@ -240,7 +275,15 @@ static int bdi_start_fn(void *ptr)
 	smp_mb__after_clear_bit();
 	wake_up_bit(&bdi->state, BDI_pending);
 
-	return bdi_writeback_task(bdi);
+	ret = bdi_writeback_task(wb);
+
+	bdi_put_wb(bdi, wb);
+	return ret;
+}
+
+int bdi_has_dirty_io(struct backing_dev_info *bdi)
+{
+	return wb_has_dirty_io(&bdi->wb);
 }
 
 static void bdi_flush_io(struct backing_dev_info *bdi)
@@ -295,17 +338,18 @@ static void sync_supers_timer_fn(unsigned long unused)
 
 static int bdi_forker_task(void *ptr)
 {
-	struct backing_dev_info *me = ptr;
+	struct bdi_writeback *me = ptr;
 
 	for (;;) {
 		struct backing_dev_info *bdi, *tmp;
+		struct bdi_writeback *wb;
 
 		/*
 		 * Temporary measure, we want to make sure we don't see
 		 * dirty data on the default backing_dev_info
 		 */
-		if (bdi_has_dirty_io(me))
-			bdi_flush_io(me);
+		if (wb_has_dirty_io(me))
+			bdi_flush_io(me->bdi);
 
 		mutex_lock(&bdi_lock);
 
@@ -314,7 +358,7 @@ static int bdi_forker_task(void *ptr)
 		 * a thread registered. If so, set that up.
 		 */
 		list_for_each_entry_safe(bdi, tmp, &bdi_list, bdi_list) {
-			if (bdi->task || !bdi_has_dirty_io(bdi))
+			if (bdi->wb.task || !bdi_has_dirty_io(bdi))
 				continue;
 
 			bdi_add_default_flusher_task(bdi);
@@ -340,17 +384,22 @@ static int bdi_forker_task(void *ptr)
 		list_del_init(&bdi->bdi_list);
 		mutex_unlock(&bdi_lock);
 
-		BUG_ON(bdi->task);
+		wb = bdi_new_wb(bdi);
+		if (!wb)
+			goto readd_flush;
 
-		bdi->task = kthread_run(bdi_start_fn, bdi, "bdi-%s",
+		wb->task = kthread_run(bdi_start_fn, wb, "bdi-%s",
 					dev_name(bdi->dev));
+
 		/*
 		 * If task creation fails, then readd the bdi to
 		 * the pending list and force writeout of the bdi
 		 * from this forker thread. That will free some memory
 		 * and we can try again.
 		 */
-		if (!bdi->task) {
+		if (!wb->task) {
+			bdi_put_wb(bdi, wb);
+readd_flush:
 			/*
 			 * Add this 'bdi' to the back, so we get
 			 * a chance to flush other bdi's to free
@@ -367,8 +416,18 @@ static int bdi_forker_task(void *ptr)
 	return 0;
 }
 
+/*
+ * Add a new flusher task that gets created for any bdi
+ * that has dirty data pending writeout
+ */
 void bdi_add_default_flusher_task(struct backing_dev_info *bdi)
 {
+	if (!bdi_cap_writeback_dirty(bdi))
+		return;
+
+	/*
+	 * Someone already marked this pending for task creation
+	 */
 	if (test_and_set_bit(BDI_pending, &bdi->state))
 		return;
 
@@ -376,7 +435,7 @@ void bdi_add_default_flusher_task(struct backing_dev_info *bdi)
 	list_move_tail(&bdi->bdi_list, &bdi_pending_list);
 	mutex_unlock(&bdi_lock);
 
-	wake_up_process(default_backing_dev_info.task);
+	wake_up_process(default_backing_dev_info.wb.task);
 }
 
 int bdi_register(struct backing_dev_info *bdi, struct device *parent,
@@ -409,13 +468,23 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 	 * on-demand when they need it.
 	 */
 	if (bdi_cap_flush_forker(bdi)) {
-		bdi->task = kthread_run(bdi_forker_task, bdi, "bdi-%s",
+		struct bdi_writeback *wb;
+
+		wb = bdi_new_wb(bdi);
+		if (!wb) {
+			ret = -ENOMEM;
+			goto remove_err;
+		}
+
+		wb->task = kthread_run(bdi_forker_task, wb, "bdi-%s",
 						dev_name(dev));
-		if (!bdi->task) {
+		if (!wb->task) {
+			bdi_put_wb(bdi, wb);
+			ret = -ENOMEM;
+remove_err:
 			mutex_lock(&bdi_lock);
 			list_del(&bdi->bdi_list);
 			mutex_unlock(&bdi_lock);
-			ret = -ENOMEM;
 			goto exit;
 		}
 	}
@@ -438,28 +507,37 @@ static int sched_wait(void *word)
 	return 0;
 }
 
+/*
+ * Remove bdi from global list and shutdown any threads we have running
+ */
 static void bdi_wb_shutdown(struct backing_dev_info *bdi)
 {
+	if (!bdi_cap_writeback_dirty(bdi))
+		return;
+
 	/*
 	 * If setup is pending, wait for that to complete first
 	 */
 	wait_on_bit(&bdi->state, BDI_pending, sched_wait, TASK_UNINTERRUPTIBLE);
 
+	/*
+	 * Make sure nobody finds us on the bdi_list anymore
+	 */
 	mutex_lock(&bdi_lock);
 	list_del(&bdi->bdi_list);
 	mutex_unlock(&bdi_lock);
+
+	/*
+	 * Finally, kill the kernel thread
+	 */
+	kthread_stop(bdi->wb.task);
 }
 
 void bdi_unregister(struct backing_dev_info *bdi)
 {
 	if (bdi->dev) {
-		if (!bdi_cap_flush_forker(bdi)) {
+		if (!bdi_cap_flush_forker(bdi))
 			bdi_wb_shutdown(bdi);
-			if (bdi->task) {
-				kthread_stop(bdi->task);
-				bdi->task = NULL;
-			}
-		}
 		bdi_debug_unregister(bdi);
 		device_unregister(bdi->dev);
 		bdi->dev = NULL;
@@ -477,9 +555,9 @@ int bdi_init(struct backing_dev_info *bdi)
 	bdi->max_ratio = 100;
 	bdi->max_prop_frac = PROP_FRAC_BASE;
 	INIT_LIST_HEAD(&bdi->bdi_list);
-	INIT_LIST_HEAD(&bdi->b_io);
-	INIT_LIST_HEAD(&bdi->b_dirty);
-	INIT_LIST_HEAD(&bdi->b_more_io);
+	bdi->wb_mask = bdi->wb_active = 0;
+
+	bdi_wb_init(&bdi->wb, bdi);
 
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++) {
 		err = percpu_counter_init(&bdi->bdi_stat[i], 0);
@@ -504,9 +582,7 @@ void bdi_destroy(struct backing_dev_info *bdi)
 {
 	int i;
 
-	WARN_ON(!list_empty(&bdi->b_dirty));
-	WARN_ON(!list_empty(&bdi->b_io));
-	WARN_ON(!list_empty(&bdi->b_more_io));
+	WARN_ON(bdi_has_dirty_io(bdi));
 
 	bdi_unregister(bdi);
 
-- 
1.6.3.rc0.1.gf800


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 07/11] writeback: support > 1 flusher thread per bdi
  2009-05-28 11:46 [PATCH 0/11] Per-bdi writeback flusher threads v9 Jens Axboe
                   ` (5 preceding siblings ...)
  2009-05-28 11:46 ` [PATCH 06/11] writeback: separate the flushing state/task from the bdi Jens Axboe
@ 2009-05-28 11:46 ` Jens Axboe
  2009-05-28 11:46 ` [PATCH 08/11] writeback: allow sleepy exit of default writeback task Jens Axboe
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2009-05-28 11:46 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, tytso
  Cc: chris.mason, david, hch, akpm, jack, yanmin_zhang, richard,
	damien.wyart, Jens Axboe

Build on the bdi_writeback support by allowing registration of
more than 1 flusher thread. File systems can call bdi_add_flusher_task(bdi)
to add more flusher threads to the device. If they do so, they must also
provide a super_operations function to return the suitable bdi_writeback
struct from any given inode.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/fs-writeback.c           |  445 +++++++++++++++++++++++++++++++++++--------
 include/linux/backing-dev.h |   34 +++-
 include/linux/fs.h          |    3 +
 include/linux/writeback.h   |    1 +
 mm/backing-dev.c            |  242 +++++++++++++++++++-----
 5 files changed, 592 insertions(+), 133 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index ed242d5..f3db578 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -34,80 +34,249 @@
  */
 int nr_pdflush_threads;
 
-/**
- * writeback_acquire - attempt to get exclusive writeback access to a device
- * @bdi: the device's backing_dev_info structure
- *
- * It is a waste of resources to have more than one pdflush thread blocked on
- * a single request queue.  Exclusion at the request_queue level is obtained
- * via a flag in the request_queue's backing_dev_info.state.
- *
- * Non-request_queue-backed address_spaces will share default_backing_dev_info,
- * unless they implement their own.  Which is somewhat inefficient, as this
- * may prevent concurrent writeback against multiple devices.
+static void generic_sync_wb_inodes(struct bdi_writeback *wb,
+				   struct super_block *sb,
+				   struct writeback_control *wbc);
+
+/*
+ * Work items for the bdi_writeback threads
  */
-static int writeback_acquire(struct bdi_writeback *wb)
+struct bdi_work {
+	struct list_head list;
+	struct list_head wait_list;
+	struct rcu_head rcu_head;
+
+	unsigned long seen;
+	atomic_t pending;
+
+	unsigned long sb_data;
+	unsigned long nr_pages;
+	enum writeback_sync_modes sync_mode;
+
+	unsigned long state;
+};
+
+static struct super_block *bdi_work_sb(struct bdi_work *work)
 {
-	struct backing_dev_info *bdi = wb->bdi;
+	return (struct super_block *) (work->sb_data & ~1UL);
+}
+
+static inline bool bdi_work_on_stack(struct bdi_work *work)
+{
+	return work->sb_data & 1UL;
+}
 
-	return !test_and_set_bit(wb->nr, &bdi->wb_active);
+static inline void bdi_work_init(struct bdi_work *work, struct super_block *sb,
+				 unsigned long nr_pages,
+				 enum writeback_sync_modes sync_mode)
+{
+	INIT_RCU_HEAD(&work->rcu_head);
+	work->sb_data = (unsigned long) sb;
+	work->nr_pages = nr_pages;
+	work->sync_mode = sync_mode;
+	work->state = 1;
+}
+
+static inline void bdi_work_init_on_stack(struct bdi_work *work,
+					  struct super_block *sb,
+					  unsigned long nr_pages,
+					  enum writeback_sync_modes sync_mode)
+{
+	bdi_work_init(work, sb, nr_pages, sync_mode);
+	work->sb_data |= 1UL;
 }
 
 /**
  * writeback_in_progress - determine whether there is writeback in progress
  * @bdi: the device's backing_dev_info structure.
  *
- * Determine whether there is writeback in progress against a backing device.
+ * Determine whether there is writeback waiting to be handled against a
+ * backing device.
  */
 int writeback_in_progress(struct backing_dev_info *bdi)
 {
-	return bdi->wb_active != 0;
+	return !list_empty(&bdi->work_list);
 }
 
-/**
- * writeback_release - relinquish exclusive writeback access against a device.
- * @bdi: the device's backing_dev_info structure
- */
-static void writeback_release(struct bdi_writeback *wb)
+static void bdi_work_clear(struct bdi_work *work)
 {
-	struct backing_dev_info *bdi = wb->bdi;
+	clear_bit(0, &work->state);
+	smp_mb__after_clear_bit();
+	wake_up_bit(&work->state, 0);
+}
 
-	wb->nr_pages = 0;
-	wb->sb = NULL;
-	clear_bit(wb->nr, &bdi->wb_active);
+static void bdi_work_free(struct rcu_head *head)
+{
+	struct bdi_work *work = container_of(head, struct bdi_work, rcu_head);
+
+	if (!bdi_work_on_stack(work))
+		kfree(work);
+	else
+		bdi_work_clear(work);
 }
 
-static void wb_start_writeback(struct bdi_writeback *wb, struct super_block *sb,
-			       long nr_pages,
-			       enum writeback_sync_modes sync_mode)
+static void wb_work_complete(struct bdi_work *work)
 {
-	if (!wb_has_dirty_io(wb))
-		return;
+	const enum writeback_sync_modes sync_mode = work->sync_mode;
 
-	if (writeback_acquire(wb)) {
-		wb->nr_pages = nr_pages;
-		wb->sb = sb;
-		wb->sync_mode = sync_mode;
+	/*
+	 * For allocated work, we can clear the done/seen bit right here.
+	 * For on-stack work, we need to postpone both the clear and free
+	 * to after the RCU grace period, since the stack could be invalidated
+	 * as soon as bdi_work_clear() has done the wakeup.
+	 */
+	if (!bdi_work_on_stack(work))
+		bdi_work_clear(work);
+	if (sync_mode == WB_SYNC_NONE || bdi_work_on_stack(work))
+		call_rcu(&work->rcu_head, bdi_work_free);
+}
 
-		if (wb->task)
-			wake_up_process(wb->task);
+static void wb_clear_pending(struct bdi_writeback *wb, struct bdi_work *work)
+{
+	/*
+	 * The caller has retrieved the work arguments from this work,
+	 * drop our reference. If this is the last ref, delete and free it
+	 */
+	if (atomic_dec_and_test(&work->pending)) {
+		struct backing_dev_info *bdi = wb->bdi;
+
+		spin_lock(&bdi->wb_lock);
+		list_del_rcu(&work->list);
+		spin_unlock(&bdi->wb_lock);
+
+		wb_work_complete(work);
 	}
 }
 
-int bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
-			 long nr_pages, enum writeback_sync_modes sync_mode)
+static void wb_start_writeback(struct bdi_writeback *wb, struct bdi_work *work)
 {
 	/*
-	 * This only happens the first time someone kicks this bdi, so put
-	 * it out-of-line.
+	 * If we failed allocating the bdi work item, wake up the wb thread
+	 * always. As a safety precaution, it'll flush out everything
 	 */
-	if (unlikely(!bdi->wb.task)) {
+	if (!wb_has_dirty_io(wb) && work)
+		wb_clear_pending(wb, work);
+	else if (wb->task)
+		wake_up_process(wb->task);
+}
+
+static void bdi_queue_work(struct backing_dev_info *bdi, struct bdi_work *work)
+{
+	if (work) {
+		work->seen = bdi->wb_mask;
+		BUG_ON(!work->seen);
+		atomic_set(&work->pending, bdi->wb_cnt);
+		BUG_ON(!bdi->wb_cnt);
+
+		/*
+		 * Make sure stores are seen before it appears on the list
+		 */
+		smp_mb();
+
+		spin_lock(&bdi->wb_lock);
+		list_add_tail_rcu(&work->list, &bdi->work_list);
+		spin_unlock(&bdi->wb_lock);
+	}
+}
+
+static void bdi_sched_work(struct backing_dev_info *bdi, struct bdi_work *work)
+{
+	if (!bdi_wblist_needs_lock(bdi))
+		wb_start_writeback(&bdi->wb, work);
+	else {
+		struct bdi_writeback *wb;
+		int idx;
+
+		idx = srcu_read_lock(&bdi->srcu);
+
+		list_for_each_entry_rcu(wb, &bdi->wb_list, list)
+			wb_start_writeback(wb, work);
+
+		srcu_read_unlock(&bdi->srcu, idx);
+	}
+}
+
+static void __bdi_start_work(struct backing_dev_info *bdi,
+			     struct bdi_work *work)
+{
+	/*
+	 * If the default thread isn't there, make sure we add it. When
+	 * it gets created and wakes up, we'll run this work.
+	 */
+	if (unlikely(list_empty_careful(&bdi->wb_list)))
 		bdi_add_default_flusher_task(bdi);
-		return 1;
+	else
+		bdi_sched_work(bdi, work);
+}
+
+static void bdi_start_work(struct backing_dev_info *bdi, struct bdi_work *work)
+{
+	/*
+	 * If the default thread isn't there, make sure we add it. When
+	 * it gets created and wakes up, we'll run this work.
+	 */
+	if (unlikely(list_empty_careful(&bdi->wb_list))) {
+		mutex_lock(&bdi_lock);
+		bdi_add_default_flusher_task(bdi);
+		mutex_unlock(&bdi_lock);
+	} else
+		bdi_sched_work(bdi, work);
+}
+
+/*
+ * Used for on-stack allocated work items. The caller needs to wait until
+ * the wb threads have acked the work before it's safe to continue.
+ */
+static void bdi_wait_on_work_clear(struct bdi_work *work)
+{
+	wait_on_bit(&work->state, 0, bdi_sched_wait, TASK_UNINTERRUPTIBLE);
+}
+
+static struct bdi_work *bdi_alloc_work(struct super_block *sb, long nr_pages,
+				       enum writeback_sync_modes sync_mode)
+{
+	struct bdi_work *work;
+
+	work = kmalloc(sizeof(*work), GFP_ATOMIC);
+	if (work)
+		bdi_work_init(work, sb, nr_pages, sync_mode);
+
+	return work;
+}
+
+void bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
+			 long nr_pages, enum writeback_sync_modes sync_mode)
+{
+	const bool must_wait = sync_mode == WB_SYNC_ALL;
+	struct bdi_work work_stack, *work = NULL;
+
+	if (!must_wait)
+		work = bdi_alloc_work(sb, nr_pages, sync_mode);
+
+	if (!work) {
+		work = &work_stack;
+		bdi_work_init_on_stack(work, sb, nr_pages, sync_mode);
 	}
 
-	wb_start_writeback(&bdi->wb, sb, nr_pages, sync_mode);
-	return 0;
+	bdi_queue_work(bdi, work);
+	bdi_start_work(bdi, work);
+
+	/*
+	 * If the sync mode is WB_SYNC_ALL, block waiting for the work to
+	 * complete. If not, we only need to wait for the work to be started,
+	 * if we allocated it on-stack. We use the same mechanism, if the
+	 * wait bit is set in the bdi_work struct, then threads will not
+	 * clear pending until after they are done.
+	 *
+	 * Note that work == &work_stack if must_wait is true, so we don't
+	 * need to do call_rcu() here ever, since the completion path will
+	 * have done that for us.
+	 */
+	if (must_wait || work == &work_stack) {
+		bdi_wait_on_work_clear(work);
+		if (work != &work_stack)
+			call_rcu(&work->rcu_head, bdi_work_free);
+	}
 }
 
 /*
@@ -157,7 +326,7 @@ static void wb_kupdated(struct bdi_writeback *wb)
 		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
-		generic_sync_bdi_inodes(NULL, &wbc);
+		generic_sync_wb_inodes(wb, NULL, &wbc);
 		if (wbc.nr_to_write > 0)
 			break;	/* All the old data is written */
 		nr_to_write -= MAX_WRITEBACK_PAGES;
@@ -174,22 +343,19 @@ static inline bool over_bground_thresh(void)
 		global_page_state(NR_UNSTABLE_NFS) >= background_thresh);
 }
 
-static void generic_sync_wb_inodes(struct bdi_writeback *wb,
-				   struct super_block *sb,
-				   struct writeback_control *wbc);
-
-static void wb_writeback(struct bdi_writeback *wb)
+static void __wb_writeback(struct bdi_writeback *wb, long nr_pages,
+			   struct super_block *sb,
+			   enum writeback_sync_modes sync_mode)
 {
 	struct writeback_control wbc = {
 		.bdi			= wb->bdi,
-		.sync_mode		= wb->sync_mode,
+		.sync_mode		= sync_mode,
 		.older_than_this	= NULL,
 		.range_cyclic		= 1,
 	};
-	long nr_pages = wb->nr_pages;
 
 	for (;;) {
-		if (wbc.sync_mode == WB_SYNC_NONE && nr_pages <= 0 &&
+		if (sync_mode == WB_SYNC_NONE && nr_pages <= 0 &&
 		    !over_bground_thresh())
 			break;
 
@@ -197,7 +363,7 @@ static void wb_writeback(struct bdi_writeback *wb)
 		wbc.encountered_congestion = 0;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		wbc.pages_skipped = 0;
-		generic_sync_wb_inodes(wb, wb->sb, &wbc);
+		generic_sync_wb_inodes(wb, sb, &wbc);
 		nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 		/*
 		 * If we ran out of stuff to write, bail unless more_io got set
@@ -211,6 +377,82 @@ static void wb_writeback(struct bdi_writeback *wb)
 }
 
 /*
+ * Return the next bdi_work struct that hasn't been processed by this
+ * wb thread yet
+ */
+static struct bdi_work *get_next_work_item(struct backing_dev_info *bdi,
+					   struct bdi_writeback *wb)
+{
+	struct bdi_work *work, *ret = NULL;
+
+	rcu_read_lock();
+
+	list_for_each_entry_rcu(work, &bdi->work_list, list) {
+		if (!test_and_clear_bit(wb->nr, &work->seen))
+			continue;
+
+		ret = work;
+		break;
+	}
+
+	rcu_read_unlock();
+	return ret;
+}
+
+/*
+ * Retrieve work items and do the writeback they describe
+ */
+static void wb_writeback(struct bdi_writeback *wb)
+{
+	struct backing_dev_info *bdi = wb->bdi;
+	struct bdi_work *work;
+
+	while ((work = get_next_work_item(bdi, wb)) != NULL) {
+		struct super_block *sb = bdi_work_sb(work);
+		long nr_pages = work->nr_pages;
+		enum writeback_sync_modes sync_mode = work->sync_mode;
+
+		/*
+		 * If this isn't a data integrity operation, just notify
+		 * that we have seen this work and we are now starting it.
+		 */
+		if (sync_mode == WB_SYNC_NONE)
+			wb_clear_pending(wb, work);
+
+		__wb_writeback(wb, nr_pages, sb, sync_mode);
+
+		/*
+		 * This is a data integrity writeback, so only do the
+		 * notification when we have completed the work.
+		 */
+		if (sync_mode == WB_SYNC_ALL)
+			wb_clear_pending(wb, work);
+	}
+}
+
+/*
+ * This will be inlined in bdi_writeback_task() once we get rid of any
+ * dirty inodes on the default_backing_dev_info
+ */
+void wb_do_writeback(struct bdi_writeback *wb)
+{
+	/*
+	 * We get here in two cases:
+	 *
+	 *  schedule_timeout() returned because the dirty writeback
+	 *  interval has elapsed. If that happens, the work item list
+	 *  will be empty and we will proceed to do kupdated style writeout.
+	 *
+	 *  Someone called bdi_start_writeback(), which put one/more work
+	 *  items on the work_list. Process those.
+	 */
+	if (list_empty(&wb->bdi->work_list))
+		wb_kupdated(wb);
+	else
+		wb_writeback(wb);
+}
+
+/*
  * Handle writeback of dirty data for the device backed by this bdi. Also
  * wakes up periodically and does kupdated style flushing.
  */
@@ -219,57 +461,84 @@ int bdi_writeback_task(struct bdi_writeback *wb)
 	while (!kthread_should_stop()) {
 		unsigned long wait_jiffies;
 
+		wb_do_writeback(wb);
+
 		wait_jiffies = msecs_to_jiffies(dirty_writeback_interval * 10);
 		set_current_state(TASK_INTERRUPTIBLE);
 		schedule_timeout(wait_jiffies);
 		try_to_freeze();
-
-		/*
-		 * We get here in two cases:
-		 *
-		 *  schedule_timeout() returned because the dirty writeback
-		 *  interval has elapsed. If that happens, we will be able
-		 *  to acquire the writeback lock and will proceed to do
-		 *  kupdated style writeout.
-		 *
-		 *  Someone called bdi_start_writeback(), which will acquire
-		 *  the writeback lock. This means our writeback_acquire()
-		 *  below will fail and we call into bdi_pdflush() for
-		 *  pdflush style writeout.
-		 *
-		 */
-		if (writeback_acquire(wb))
-			wb_kupdated(wb);
-		else
-			wb_writeback(wb);
-
-		writeback_release(wb);
 	}
 
 	return 0;
 }
 
+/*
+ * Schedule writeback for all backing devices. Expensive! If this is a data
+ * integrity operation, writeback will be complete when this returns. If
+ * we are simply called for WB_SYNC_NONE, then writeback will merely be
+ * scheduled to run.
+ */
 void bdi_writeback_all(struct super_block *sb, struct writeback_control *wbc)
 {
+	const bool must_wait = wbc->sync_mode == WB_SYNC_ALL;
 	struct backing_dev_info *bdi, *tmp;
+	struct bdi_work *work;
+	LIST_HEAD(list);
 
 	mutex_lock(&bdi_lock);
 
 	list_for_each_entry_safe(bdi, tmp, &bdi_list, bdi_list) {
+		struct bdi_work *work;
+
 		if (!bdi_has_dirty_io(bdi))
 			continue;
-		bdi_start_writeback(bdi, sb, wbc->nr_to_write, wbc->sync_mode);
+
+		/*
+		 * If work allocation fails, do the writes inline. An
+		 * alternative approach would be too fall back to an on-stack
+		 * allocation of work. For that we need to drop the bdi_lock
+		 * and restart the scan afterwards, though.
+		 */
+		work = bdi_alloc_work(sb, wbc->nr_to_write, wbc->sync_mode);
+		if (!work) {
+			wbc->bdi = bdi;
+			generic_sync_bdi_inodes(sb, wbc);
+			continue;
+		}
+		if (must_wait)
+			list_add_tail(&work->wait_list, &list);
+
+		bdi_queue_work(bdi, work);
+		__bdi_start_work(bdi, work);
 	}
 
 	mutex_unlock(&bdi_lock);
+
+	/*
+	 * If this is for WB_SYNC_ALL, wait for pending work to complete
+	 * before returning.
+	 */
+	while (!list_empty(&list)) {
+		work = list_entry(list.next, struct bdi_work, wait_list);
+		list_del(&work->wait_list);
+		bdi_wait_on_work_clear(work);
+		call_rcu(&work->rcu_head, bdi_work_free);
+	}
 }
 
 /*
- * We have only a single wb per bdi, so just return that.
+ * If the filesystem didn't provide a way to map an inode to a dedicated
+ * flusher thread, it doesn't support more than 1 thread. So we know it's
+ * the default thread, return that.
  */
 static inline struct bdi_writeback *inode_get_wb(struct inode *inode)
 {
-	return &inode_to_bdi(inode)->wb;
+	const struct super_operations *sop = inode->i_sb->s_op;
+
+	if (!sop->inode_get_wb)
+		return &inode_to_bdi(inode)->wb;
+
+	return sop->inode_get_wb(inode);
 }
 
 /**
@@ -723,8 +992,24 @@ void generic_sync_bdi_inodes(struct super_block *sb,
 			     struct writeback_control *wbc)
 {
 	struct backing_dev_info *bdi = wbc->bdi;
+	struct bdi_writeback *wb;
 
-	generic_sync_wb_inodes(&bdi->wb, sb, wbc);
+	/*
+	 * Common case is just a single wb thread and that is embedded in
+	 * the bdi, so it doesn't need locking
+	 */
+	if (!bdi_wblist_needs_lock(bdi))
+		generic_sync_wb_inodes(&bdi->wb, sb, wbc);
+	else {
+		int idx;
+
+		idx = srcu_read_lock(&bdi->srcu);
+
+		list_for_each_entry_rcu(wb, &bdi->wb_list, list)
+			generic_sync_wb_inodes(wb, sb, wbc);
+
+		srcu_read_unlock(&bdi->srcu, idx);
+	}
 }
 
 /*
@@ -751,7 +1036,7 @@ void generic_sync_sb_inodes(struct super_block *sb,
 				struct writeback_control *wbc)
 {
 	if (wbc->bdi)
-		generic_sync_bdi_inodes(sb, wbc);
+		bdi_start_writeback(wbc->bdi, sb, wbc->nr_to_write, wbc->sync_mode);
 	else
 		bdi_writeback_all(sb, wbc);
 
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 59f88e5..8584438 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -13,6 +13,8 @@
 #include <linux/proportions.h>
 #include <linux/kernel.h>
 #include <linux/fs.h>
+#include <linux/sched.h>
+#include <linux/srcu.h>
 #include <linux/writeback.h>
 #include <asm/atomic.h>
 
@@ -26,6 +28,7 @@ struct dentry;
 enum bdi_state {
 	BDI_pending,		/* On its way to being activated */
 	BDI_wb_alloc,		/* Default embedded wb allocated */
+	BDI_wblist_lock,	/* bdi->wb_list now needs locking */
 	BDI_async_congested,	/* The async (write) queue is getting full */
 	BDI_sync_congested,	/* The sync queue is getting full */
 	BDI_unused,		/* Available bits start here */
@@ -42,6 +45,8 @@ enum bdi_stat_item {
 #define BDI_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
 
 struct bdi_writeback {
+	struct list_head list;			/* hangs off the bdi */
+
 	struct backing_dev_info *bdi;		/* our parent bdi */
 	unsigned int nr;
 
@@ -49,13 +54,12 @@ struct bdi_writeback {
 	struct list_head	b_dirty;	/* dirty inodes */
 	struct list_head	b_io;		/* parked for writeback */
 	struct list_head	b_more_io;	/* parked for more writeback */
-
-	unsigned long		nr_pages;
-	struct super_block	*sb;
-	enum writeback_sync_modes sync_mode;
 };
 
+#define BDI_MAX_FLUSHERS	32
+
 struct backing_dev_info {
+	struct srcu_struct srcu; /* for wb_list read side protection */
 	struct list_head bdi_list;
 	unsigned long ra_pages;	/* max readahead in PAGE_CACHE_SIZE units */
 	unsigned long state;	/* Always use atomic bitops on this */
@@ -74,8 +78,12 @@ struct backing_dev_info {
 	unsigned int max_ratio, max_prop_frac;
 
 	struct bdi_writeback wb;  /* default writeback info for this bdi */
-	unsigned long wb_active;  /* bitmap of active tasks */
-	unsigned long wb_mask;	  /* number of registered tasks */
+	spinlock_t wb_lock;	  /* protects update side of wb_list */
+	struct list_head wb_list; /* the flusher threads hanging off this bdi */
+	unsigned long wb_mask;	  /* bitmask of registered tasks */
+	unsigned int wb_cnt;	  /* number of registered tasks */
+
+	struct list_head work_list;
 
 	struct device *dev;
 
@@ -92,16 +100,22 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 		const char *fmt, ...);
 int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev);
 void bdi_unregister(struct backing_dev_info *bdi);
-int bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
+void bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
 			 long nr_pages, enum writeback_sync_modes sync_mode);
 int bdi_writeback_task(struct bdi_writeback *wb);
 void bdi_writeback_all(struct super_block *sb, struct writeback_control *wbc);
 void bdi_add_default_flusher_task(struct backing_dev_info *bdi);
+void bdi_add_flusher_task(struct backing_dev_info *bdi);
 int bdi_has_dirty_io(struct backing_dev_info *bdi);
 
 extern struct mutex bdi_lock;
 extern struct list_head bdi_list;
 
+static inline int bdi_wblist_needs_lock(struct backing_dev_info *bdi)
+{
+	return test_bit(BDI_wblist_lock, &bdi->state);
+}
+
 static inline int wb_has_dirty_io(struct bdi_writeback *wb)
 {
 	return !list_empty(&wb->b_dirty) ||
@@ -314,4 +328,10 @@ static inline bool mapping_cap_swap_backed(struct address_space *mapping)
 	return bdi_cap_swap_backed(mapping->backing_dev_info);
 }
 
+static inline int bdi_sched_wait(void *word)
+{
+	schedule();
+	return 0;
+}
+
 #endif		/* _LINUX_BACKING_DEV_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ecdc544..d3bda5d 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1550,11 +1550,14 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
 extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
 		unsigned long, loff_t *);
 
+struct bdi_writeback;
+
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
 	void (*destroy_inode)(struct inode *);
 
    	void (*dirty_inode) (struct inode *);
+	struct bdi_writeback *(*inode_get_wb) (struct inode *);
 	int (*write_inode) (struct inode *, int);
 	void (*drop_inode) (struct inode *);
 	void (*delete_inode) (struct inode *);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index baf04a9..e414702 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -69,6 +69,7 @@ void writeback_inodes(struct writeback_control *wbc);
 int inode_wait(void *);
 void sync_inodes_sb(struct super_block *, int wait);
 void sync_inodes(int wait);
+void wb_do_writeback(struct bdi_writeback *wb);
 
 /* writeback.h requires fs.h; it, too, is not included from here. */
 static inline void wait_on_inode(struct inode *inode)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 75c9054..8980f6f 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -213,52 +213,100 @@ static int __init default_bdi_init(void)
 }
 subsys_initcall(default_bdi_init);
 
-static void bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
+static int wb_assign_nr(struct backing_dev_info *bdi, struct bdi_writeback *wb)
 {
-	memset(wb, 0, sizeof(*wb));
+	unsigned long mask = BDI_MAX_FLUSHERS - 1;
+	unsigned int nr;
 
-	wb->bdi = bdi;
-	INIT_LIST_HEAD(&wb->b_dirty);
-	INIT_LIST_HEAD(&wb->b_io);
-	INIT_LIST_HEAD(&wb->b_more_io);
-}
+	do {
+		if ((bdi->wb_mask & mask) == mask)
+			return 1;
+
+		nr = find_first_zero_bit(&bdi->wb_mask, BDI_MAX_FLUSHERS);
+	} while (test_and_set_bit(nr, &bdi->wb_mask));
+
+	wb->nr = nr;
+
+	spin_lock(&bdi->wb_lock);
+	bdi->wb_cnt++;
+	spin_unlock(&bdi->wb_lock);
 
-static int wb_assign_nr(struct backing_dev_info *bdi, struct bdi_writeback *wb)
-{
-	set_bit(0, &bdi->wb_mask);
-	wb->nr = 0;
 	return 0;
 }
 
 static void bdi_put_wb(struct backing_dev_info *bdi, struct bdi_writeback *wb)
 {
-	clear_bit(wb->nr, &bdi->wb_mask);
-	clear_bit(BDI_wb_alloc, &bdi->state);
+	/*
+	 * If this is the default wb thread exiting, leave the bit set
+	 * in the wb mask as we set that before it's created as well. This
+	 * is done to make sure that assigned work with no thread has at
+	 * least one receipient.
+	 */
+	if (wb == &bdi->wb)
+		clear_bit(BDI_wb_alloc, &bdi->state);
+	else {
+		clear_bit(wb->nr, &bdi->wb_mask);
+		kfree(wb);
+		spin_lock(&bdi->wb_lock);
+		bdi->wb_cnt--;
+		spin_unlock(&bdi->wb_lock);
+	}
+}
+
+static int bdi_wb_init(struct bdi_writeback *wb, struct backing_dev_info *bdi)
+{
+	memset(wb, 0, sizeof(*wb));
+
+	wb->bdi = bdi;
+	INIT_LIST_HEAD(&wb->b_dirty);
+	INIT_LIST_HEAD(&wb->b_io);
+	INIT_LIST_HEAD(&wb->b_more_io);
+
+	return wb_assign_nr(bdi, wb);
 }
 
 static struct bdi_writeback *bdi_new_wb(struct backing_dev_info *bdi)
 {
 	struct bdi_writeback *wb;
 
-	set_bit(BDI_wb_alloc, &bdi->state);
-	wb = &bdi->wb;
-	wb_assign_nr(bdi, wb);
+	/*
+	 * Default bdi->wb is already assigned, so just return it
+	 */
+	if (!test_and_set_bit(BDI_wb_alloc, &bdi->state))
+		wb = &bdi->wb;
+	else {
+		wb = kmalloc(sizeof(struct bdi_writeback), GFP_KERNEL);
+		if (wb) {
+			if (bdi_wb_init(wb, bdi)) {
+				kfree(wb);
+				wb = NULL;
+			}
+		}
+	}
+
 	return wb;
 }
 
-static int bdi_start_fn(void *ptr)
+static void bdi_task_init(struct backing_dev_info *bdi,
+			  struct bdi_writeback *wb)
 {
-	struct bdi_writeback *wb = ptr;
-	struct backing_dev_info *bdi = wb->bdi;
 	struct task_struct *tsk = current;
-	int ret;
+	int was_empty;
 
 	/*
-	 * Add us to the active bdi_list
+	 * Add us to the active bdi_list. If we are adding threads beyond
+	 * the default embedded bdi_writeback, then we need to start using
+	 * proper locking. Check the list for empty first, then set the
+	 * BDI_wblist_lock flag if there's > 1 entry on the list now
 	 */
-	mutex_lock(&bdi_lock);
-	list_add(&bdi->bdi_list, &bdi_list);
-	mutex_unlock(&bdi_lock);
+	spin_lock(&bdi->wb_lock);
+
+	was_empty = list_empty(&bdi->wb_list);
+	list_add_tail_rcu(&wb->list, &bdi->wb_list);
+	if (!was_empty)
+		set_bit(BDI_wblist_lock, &bdi->state);
+
+	spin_unlock(&bdi->wb_lock);
 
 	tsk->flags |= PF_FLUSHER | PF_SWAPWRITE;
 	set_freezable();
@@ -267,6 +315,22 @@ static int bdi_start_fn(void *ptr)
 	 * Our parent may run at a different priority, just set us to normal
 	 */
 	set_user_nice(tsk, 0);
+}
+
+static int bdi_start_fn(void *ptr)
+{
+	struct bdi_writeback *wb = ptr;
+	struct backing_dev_info *bdi = wb->bdi;
+	int ret;
+
+	/*
+	 * Add us to the active bdi_list
+	 */
+	mutex_lock(&bdi_lock);
+	list_add(&bdi->bdi_list, &bdi_list);
+	mutex_unlock(&bdi_lock);
+
+	bdi_task_init(bdi, wb);
 
 	/*
 	 * Clear pending bit and wakeup anybody waiting to tear us down
@@ -277,13 +341,44 @@ static int bdi_start_fn(void *ptr)
 
 	ret = bdi_writeback_task(wb);
 
+	/*
+	 * Remove us from the list
+	 */
+	spin_lock(&bdi->wb_lock);
+	list_del_rcu(&wb->list);
+	spin_unlock(&bdi->wb_lock);
+
+	/*
+	 * wait for rcu grace period to end, so we can free wb
+	 */
+	synchronize_srcu(&bdi->srcu);
+
 	bdi_put_wb(bdi, wb);
 	return ret;
 }
 
 int bdi_has_dirty_io(struct backing_dev_info *bdi)
 {
-	return wb_has_dirty_io(&bdi->wb);
+	struct bdi_writeback *wb;
+	int ret = 0;
+
+	if (!bdi_wblist_needs_lock(bdi))
+		ret = wb_has_dirty_io(&bdi->wb);
+	else {
+		int idx;
+
+		idx = srcu_read_lock(&bdi->srcu);
+
+		list_for_each_entry_rcu(wb, &bdi->wb_list, list) {
+			ret = wb_has_dirty_io(wb);
+			if (ret)
+				break;
+		}
+
+		srcu_read_unlock(&bdi->srcu, idx);
+	}
+
+	return ret;
 }
 
 static void bdi_flush_io(struct backing_dev_info *bdi)
@@ -340,6 +435,8 @@ static int bdi_forker_task(void *ptr)
 {
 	struct bdi_writeback *me = ptr;
 
+	bdi_task_init(me->bdi, me);
+
 	for (;;) {
 		struct backing_dev_info *bdi, *tmp;
 		struct bdi_writeback *wb;
@@ -348,8 +445,8 @@ static int bdi_forker_task(void *ptr)
 		 * Temporary measure, we want to make sure we don't see
 		 * dirty data on the default backing_dev_info
 		 */
-		if (wb_has_dirty_io(me))
-			bdi_flush_io(me->bdi);
+		if (wb_has_dirty_io(me) || !list_empty(&me->bdi->work_list))
+			wb_do_writeback(me);
 
 		mutex_lock(&bdi_lock);
 
@@ -417,27 +514,70 @@ readd_flush:
 }
 
 /*
- * Add a new flusher task that gets created for any bdi
- * that has dirty data pending writeout
+ * bdi_lock held on entry
  */
-void bdi_add_default_flusher_task(struct backing_dev_info *bdi)
+static void bdi_add_one_flusher_task(struct backing_dev_info *bdi,
+				     int(*func)(struct backing_dev_info *))
 {
 	if (!bdi_cap_writeback_dirty(bdi))
 		return;
 
 	/*
-	 * Someone already marked this pending for task creation
+	 * Check with the helper whether to proceed adding a task. Will only
+	 * abort if we two or more simultanous calls to
+	 * bdi_add_default_flusher_task() occured, further additions will block
+	 * waiting for previous additions to finish.
 	 */
-	if (test_and_set_bit(BDI_pending, &bdi->state))
-		return;
+	if (!func(bdi)) {
+		list_move_tail(&bdi->bdi_list, &bdi_pending_list);
 
-	mutex_lock(&bdi_lock);
-	list_move_tail(&bdi->bdi_list, &bdi_pending_list);
+		/*
+		 * We are now on the pending list, wake up bdi_forker_task()
+		 * to finish the job and add us back to the active bdi_list
+		 */
+		wake_up_process(default_backing_dev_info.wb.task);
+	}
+}
+
+static int flusher_add_helper_block(struct backing_dev_info *bdi)
+{
 	mutex_unlock(&bdi_lock);
+	wait_on_bit_lock(&bdi->state, BDI_pending, bdi_sched_wait,
+				TASK_UNINTERRUPTIBLE);
+	mutex_lock(&bdi_lock);
+	return 0;
+}
 
-	wake_up_process(default_backing_dev_info.wb.task);
+static int flusher_add_helper_test(struct backing_dev_info *bdi)
+{
+	return test_and_set_bit(BDI_pending, &bdi->state);
+}
+
+/*
+ * Add the default flusher task that gets created for any bdi
+ * that has dirty data pending writeout
+ */
+void bdi_add_default_flusher_task(struct backing_dev_info *bdi)
+{
+	bdi_add_one_flusher_task(bdi, flusher_add_helper_test);
 }
 
+/**
+ * bdi_add_flusher_task - add one more flusher task to this @bdi
+ *  @bdi:	the bdi
+ *
+ * Add an additional flusher task to this @bdi. Will block waiting on
+ * previous additions, if any.
+ *
+ */
+void bdi_add_flusher_task(struct backing_dev_info *bdi)
+{
+	mutex_lock(&bdi_lock);
+	bdi_add_one_flusher_task(bdi, flusher_add_helper_block);
+	mutex_unlock(&bdi_lock);
+}
+EXPORT_SYMBOL(bdi_add_flusher_task);
+
 int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 		const char *fmt, ...)
 {
@@ -501,24 +641,21 @@ int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev)
 }
 EXPORT_SYMBOL(bdi_register_dev);
 
-static int sched_wait(void *word)
-{
-	schedule();
-	return 0;
-}
-
 /*
  * Remove bdi from global list and shutdown any threads we have running
  */
 static void bdi_wb_shutdown(struct backing_dev_info *bdi)
 {
+	struct bdi_writeback *wb;
+
 	if (!bdi_cap_writeback_dirty(bdi))
 		return;
 
 	/*
 	 * If setup is pending, wait for that to complete first
 	 */
-	wait_on_bit(&bdi->state, BDI_pending, sched_wait, TASK_UNINTERRUPTIBLE);
+	wait_on_bit(&bdi->state, BDI_pending, bdi_sched_wait,
+			TASK_UNINTERRUPTIBLE);
 
 	/*
 	 * Make sure nobody finds us on the bdi_list anymore
@@ -528,9 +665,11 @@ static void bdi_wb_shutdown(struct backing_dev_info *bdi)
 	mutex_unlock(&bdi_lock);
 
 	/*
-	 * Finally, kill the kernel thread
+	 * Finally, kill the kernel threads. We don't need to be RCU
+	 * safe anymore, since the bdi is gone from visibility.
 	 */
-	kthread_stop(bdi->wb.task);
+	list_for_each_entry(wb, &bdi->wb_list, list)
+		kthread_stop(wb->task);
 }
 
 void bdi_unregister(struct backing_dev_info *bdi)
@@ -554,8 +693,12 @@ int bdi_init(struct backing_dev_info *bdi)
 	bdi->min_ratio = 0;
 	bdi->max_ratio = 100;
 	bdi->max_prop_frac = PROP_FRAC_BASE;
+	spin_lock_init(&bdi->wb_lock);
+	bdi->wb_mask = 0;
+	bdi->wb_cnt = 0;
 	INIT_LIST_HEAD(&bdi->bdi_list);
-	bdi->wb_mask = bdi->wb_active = 0;
+	INIT_LIST_HEAD(&bdi->wb_list);
+	INIT_LIST_HEAD(&bdi->work_list);
 
 	bdi_wb_init(&bdi->wb, bdi);
 
@@ -565,10 +708,15 @@ int bdi_init(struct backing_dev_info *bdi)
 			goto err;
 	}
 
+	err = init_srcu_struct(&bdi->srcu);
+	if (err)
+		goto err;
+
 	bdi->dirty_exceeded = 0;
 	err = prop_local_init_percpu(&bdi->completions);
 
 	if (err) {
+		cleanup_srcu_struct(&bdi->srcu);
 err:
 		while (i--)
 			percpu_counter_destroy(&bdi->bdi_stat[i]);
@@ -586,6 +734,8 @@ void bdi_destroy(struct backing_dev_info *bdi)
 
 	bdi_unregister(bdi);
 
+	cleanup_srcu_struct(&bdi->srcu);
+
 	for (i = 0; i < NR_BDI_STAT_ITEMS; i++)
 		percpu_counter_destroy(&bdi->bdi_stat[i]);
 
-- 
1.6.3.rc0.1.gf800


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 08/11] writeback: allow sleepy exit of default writeback task
  2009-05-28 11:46 [PATCH 0/11] Per-bdi writeback flusher threads v9 Jens Axboe
                   ` (6 preceding siblings ...)
  2009-05-28 11:46 ` [PATCH 07/11] writeback: support > 1 flusher thread per bdi Jens Axboe
@ 2009-05-28 11:46 ` Jens Axboe
  2009-05-28 11:46 ` [PATCH 09/11] writeback: add some debug inode list counters to bdi stats Jens Axboe
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2009-05-28 11:46 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, tytso
  Cc: chris.mason, david, hch, akpm, jack, yanmin_zhang, richard,
	damien.wyart, Jens Axboe

Since we do lazy create of default writeback tasks for a bdi, we can
allow sleepy exit if it has been completely idle for 5 minutes.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/fs-writeback.c           |   54 ++++++++++++++++++++++++++++++++++--------
 include/linux/backing-dev.h |    5 ++++
 include/linux/writeback.h   |    2 +-
 3 files changed, 49 insertions(+), 12 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index f3db578..d1d47c0 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -303,10 +303,10 @@ void bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
  * older_than_this takes precedence over nr_to_write.  So we'll only write back
  * all dirty pages if they are all attached to "old" mappings.
  */
-static void wb_kupdated(struct bdi_writeback *wb)
+static long wb_kupdated(struct bdi_writeback *wb)
 {
 	unsigned long oldest_jif;
-	long nr_to_write;
+	long nr_to_write, wrote = 0;
 	struct writeback_control wbc = {
 		.bdi			= wb->bdi,
 		.sync_mode		= WB_SYNC_NONE,
@@ -327,10 +327,13 @@ static void wb_kupdated(struct bdi_writeback *wb)
 		wbc.encountered_congestion = 0;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		generic_sync_wb_inodes(wb, NULL, &wbc);
+		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 		if (wbc.nr_to_write > 0)
 			break;	/* All the old data is written */
 		nr_to_write -= MAX_WRITEBACK_PAGES;
 	}
+
+	return wrote;
 }
 
 static inline bool over_bground_thresh(void)
@@ -343,7 +346,7 @@ static inline bool over_bground_thresh(void)
 		global_page_state(NR_UNSTABLE_NFS) >= background_thresh);
 }
 
-static void __wb_writeback(struct bdi_writeback *wb, long nr_pages,
+static long __wb_writeback(struct bdi_writeback *wb, long nr_pages,
 			   struct super_block *sb,
 			   enum writeback_sync_modes sync_mode)
 {
@@ -353,6 +356,7 @@ static void __wb_writeback(struct bdi_writeback *wb, long nr_pages,
 		.older_than_this	= NULL,
 		.range_cyclic		= 1,
 	};
+	long wrote = 0;
 
 	for (;;) {
 		if (sync_mode == WB_SYNC_NONE && nr_pages <= 0 &&
@@ -365,6 +369,7 @@ static void __wb_writeback(struct bdi_writeback *wb, long nr_pages,
 		wbc.pages_skipped = 0;
 		generic_sync_wb_inodes(wb, sb, &wbc);
 		nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
+		wrote += MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 		/*
 		 * If we ran out of stuff to write, bail unless more_io got set
 		 */
@@ -374,6 +379,8 @@ static void __wb_writeback(struct bdi_writeback *wb, long nr_pages,
 			break;
 		}
 	}
+
+	return wrote;
 }
 
 /*
@@ -402,10 +409,11 @@ static struct bdi_work *get_next_work_item(struct backing_dev_info *bdi,
 /*
  * Retrieve work items and do the writeback they describe
  */
-static void wb_writeback(struct bdi_writeback *wb)
+static long wb_writeback(struct bdi_writeback *wb)
 {
 	struct backing_dev_info *bdi = wb->bdi;
 	struct bdi_work *work;
+	long wrote = 0;
 
 	while ((work = get_next_work_item(bdi, wb)) != NULL) {
 		struct super_block *sb = bdi_work_sb(work);
@@ -419,7 +427,7 @@ static void wb_writeback(struct bdi_writeback *wb)
 		if (sync_mode == WB_SYNC_NONE)
 			wb_clear_pending(wb, work);
 
-		__wb_writeback(wb, nr_pages, sb, sync_mode);
+		wrote += __wb_writeback(wb, nr_pages, sb, sync_mode);
 
 		/*
 		 * This is a data integrity writeback, so only do the
@@ -428,14 +436,18 @@ static void wb_writeback(struct bdi_writeback *wb)
 		if (sync_mode == WB_SYNC_ALL)
 			wb_clear_pending(wb, work);
 	}
+
+	return wrote;
 }
 
 /*
  * This will be inlined in bdi_writeback_task() once we get rid of any
  * dirty inodes on the default_backing_dev_info
  */
-void wb_do_writeback(struct bdi_writeback *wb)
+long wb_do_writeback(struct bdi_writeback *wb)
 {
+	long wrote;
+
 	/*
 	 * We get here in two cases:
 	 *
@@ -447,9 +459,11 @@ void wb_do_writeback(struct bdi_writeback *wb)
 	 *  items on the work_list. Process those.
 	 */
 	if (list_empty(&wb->bdi->work_list))
-		wb_kupdated(wb);
+		wrote = wb_kupdated(wb);
 	else
-		wb_writeback(wb);
+		wrote = wb_writeback(wb);
+
+	return wrote;
 }
 
 /*
@@ -458,10 +472,28 @@ void wb_do_writeback(struct bdi_writeback *wb)
  */
 int bdi_writeback_task(struct bdi_writeback *wb)
 {
+	unsigned long last_active = jiffies;
+	unsigned long wait_jiffies = -1UL;
+	long pages_written;
+
 	while (!kthread_should_stop()) {
-		unsigned long wait_jiffies;
+		pages_written = wb_do_writeback(wb);
+
+		if (pages_written)
+			last_active = jiffies;
+		else if (wait_jiffies != -1UL) {
+			unsigned long max_idle;
 
-		wb_do_writeback(wb);
+			/*
+			 * Longest period of inactivity that we tolerate. If we
+			 * see dirty data again later, the task will get
+			 * recreated automatically.
+			 */
+			max_idle = max(5UL * 60 * HZ, wait_jiffies);
+			if (time_after(jiffies, max_idle + last_active) &&
+			    wb_is_default_task(wb))
+				break;
+		}
 
 		wait_jiffies = msecs_to_jiffies(dirty_writeback_interval * 10);
 		set_current_state(TASK_INTERRUPTIBLE);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 8584438..d55553d 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -111,6 +111,11 @@ int bdi_has_dirty_io(struct backing_dev_info *bdi);
 extern struct mutex bdi_lock;
 extern struct list_head bdi_list;
 
+static inline int wb_is_default_task(struct bdi_writeback *wb)
+{
+	return wb == &wb->bdi->wb;
+}
+
 static inline int bdi_wblist_needs_lock(struct backing_dev_info *bdi)
 {
 	return test_bit(BDI_wblist_lock, &bdi->state);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index e414702..30e318b 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -69,7 +69,7 @@ void writeback_inodes(struct writeback_control *wbc);
 int inode_wait(void *);
 void sync_inodes_sb(struct super_block *, int wait);
 void sync_inodes(int wait);
-void wb_do_writeback(struct bdi_writeback *wb);
+long wb_do_writeback(struct bdi_writeback *wb);
 
 /* writeback.h requires fs.h; it, too, is not included from here. */
 static inline void wait_on_inode(struct inode *inode)
-- 
1.6.3.rc0.1.gf800


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 09/11] writeback: add some debug inode list counters to bdi stats
  2009-05-28 11:46 [PATCH 0/11] Per-bdi writeback flusher threads v9 Jens Axboe
                   ` (7 preceding siblings ...)
  2009-05-28 11:46 ` [PATCH 08/11] writeback: allow sleepy exit of default writeback task Jens Axboe
@ 2009-05-28 11:46 ` Jens Axboe
  2009-05-28 11:46 ` [PATCH 10/11] writeback: add name to backing_dev_info Jens Axboe
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2009-05-28 11:46 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, tytso
  Cc: chris.mason, david, hch, akpm, jack, yanmin_zhang, richard,
	damien.wyart, Jens Axboe

Add some debug entries to be able to inspect the internal state of
the writeback details.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 mm/backing-dev.c |   38 ++++++++++++++++++++++++++++++++++----
 1 files changed, 34 insertions(+), 4 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 8980f6f..b981118 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -50,9 +50,29 @@ static void bdi_debug_init(void)
 static int bdi_debug_stats_show(struct seq_file *m, void *v)
 {
 	struct backing_dev_info *bdi = m->private;
+	struct bdi_writeback *wb;
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
+	unsigned long nr_dirty, nr_io, nr_more_io, nr_wb;
+	struct inode *inode;
+
+	/*
+	 * inode lock is enough here, the bdi->wb_list is protected by
+	 * RCU on the reader side
+	 */
+	nr_wb = nr_dirty = nr_io = nr_more_io = 0;
+	spin_lock(&inode_lock);
+	list_for_each_entry(wb, &bdi->wb_list, list) {
+		nr_wb++;
+		list_for_each_entry(inode, &wb->b_dirty, i_list)
+			nr_dirty++;
+		list_for_each_entry(inode, &wb->b_io, i_list)
+			nr_io++;
+		list_for_each_entry(inode, &wb->b_more_io, i_list)
+			nr_more_io++;
+	}
+	spin_unlock(&inode_lock);
 
 	get_dirty_limits(&background_thresh, &dirty_thresh, &bdi_thresh, bdi);
 
@@ -62,12 +82,22 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
 		   "BdiReclaimable:   %8lu kB\n"
 		   "BdiDirtyThresh:   %8lu kB\n"
 		   "DirtyThresh:      %8lu kB\n"
-		   "BackgroundThresh: %8lu kB\n",
+		   "BackgroundThresh: %8lu kB\n"
+		   "WriteBack threads:%8lu\n"
+		   "b_dirty:          %8lu\n"
+		   "b_io:             %8lu\n"
+		   "b_more_io:        %8lu\n"
+		   "bdi_list:         %8u\n"
+		   "state:            %8lx\n"
+		   "wb_mask:          %8lx\n"
+		   "wb_list:          %8u\n"
+		   "wb_cnt:           %8u\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
 		   (unsigned long) K(bdi_stat(bdi, BDI_RECLAIMABLE)),
-		   K(bdi_thresh),
-		   K(dirty_thresh),
-		   K(background_thresh));
+		   K(bdi_thresh), K(dirty_thresh),
+		   K(background_thresh), nr_wb, nr_dirty, nr_io, nr_more_io,
+		   !list_empty(&bdi->bdi_list), bdi->state, bdi->wb_mask,
+		   !list_empty(&bdi->wb_list), bdi->wb_cnt);
 #undef K
 
 	return 0;
-- 
1.6.3.rc0.1.gf800


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 10/11] writeback: add name to backing_dev_info
  2009-05-28 11:46 [PATCH 0/11] Per-bdi writeback flusher threads v9 Jens Axboe
                   ` (8 preceding siblings ...)
  2009-05-28 11:46 ` [PATCH 09/11] writeback: add some debug inode list counters to bdi stats Jens Axboe
@ 2009-05-28 11:46 ` Jens Axboe
  2009-05-28 11:46 ` [PATCH 11/11] writeback: check for registered bdi in flusher add and inode dirty Jens Axboe
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2009-05-28 11:46 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, tytso
  Cc: chris.mason, david, hch, akpm, jack, yanmin_zhang, richard,
	damien.wyart, Jens Axboe

This enables us to track who does what and print info. Its main use
is catching dirty inodes on the default_backing_dev_info, so we can
fix that up.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 block/blk-core.c            |    1 +
 drivers/block/aoe/aoeblk.c  |    1 +
 drivers/char/mem.c          |    1 +
 fs/btrfs/disk-io.c          |    1 +
 fs/char_dev.c               |    1 +
 fs/configfs/inode.c         |    1 +
 fs/fuse/inode.c             |    1 +
 fs/hugetlbfs/inode.c        |    1 +
 fs/nfs/client.c             |    1 +
 fs/ocfs2/dlm/dlmfs.c        |    1 +
 fs/ramfs/inode.c            |    1 +
 fs/sysfs/inode.c            |    1 +
 fs/ubifs/super.c            |    1 +
 include/linux/backing-dev.h |    2 ++
 kernel/cgroup.c             |    1 +
 mm/backing-dev.c            |    1 +
 mm/swap_state.c             |    1 +
 17 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index c89883b..d3f18b5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -517,6 +517,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 
 	q->backing_dev_info.unplug_io_fn = blk_backing_dev_unplug;
 	q->backing_dev_info.unplug_io_data = q;
+	q->backing_dev_info.name = "block";
 	err = bdi_init(&q->backing_dev_info);
 	if (err) {
 		kmem_cache_free(blk_requestq_cachep, q);
diff --git a/drivers/block/aoe/aoeblk.c b/drivers/block/aoe/aoeblk.c
index 2307a27..0efb8fc 100644
--- a/drivers/block/aoe/aoeblk.c
+++ b/drivers/block/aoe/aoeblk.c
@@ -265,6 +265,7 @@ aoeblk_gdalloc(void *vp)
 	}
 
 	blk_queue_make_request(&d->blkq, aoeblk_make_request);
+	d->blkq.backing_dev_info.name = "aoe";
 	if (bdi_init(&d->blkq.backing_dev_info))
 		goto err_mempool;
 	spin_lock_irqsave(&d->lock, flags);
diff --git a/drivers/char/mem.c b/drivers/char/mem.c
index 8f05c38..3b38093 100644
--- a/drivers/char/mem.c
+++ b/drivers/char/mem.c
@@ -820,6 +820,7 @@ static const struct file_operations zero_fops = {
  * - permits private mappings, "copies" are taken of the source of zeros
  */
 static struct backing_dev_info zero_bdi = {
+	.name		= "char/mem",
 	.capabilities	= BDI_CAP_MAP_COPY,
 };
 
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 2dc19c9..eff2a82 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1353,6 +1353,7 @@ static int setup_bdi(struct btrfs_fs_info *info, struct backing_dev_info *bdi)
 {
 	int err;
 
+	bdi->name = "btrfs";
 	bdi->capabilities = BDI_CAP_MAP_COPY;
 	err = bdi_init(bdi);
 	if (err)
diff --git a/fs/char_dev.c b/fs/char_dev.c
index 38f7122..350ef9c 100644
--- a/fs/char_dev.c
+++ b/fs/char_dev.c
@@ -32,6 +32,7 @@
  * - no readahead or I/O queue unplugging required
  */
 struct backing_dev_info directly_mappable_cdev_bdi = {
+	.name = "char",
 	.capabilities	= (
 #ifdef CONFIG_MMU
 		/* permit private copies of the data to be taken */
diff --git a/fs/configfs/inode.c b/fs/configfs/inode.c
index 5d349d3..9a266cd 100644
--- a/fs/configfs/inode.c
+++ b/fs/configfs/inode.c
@@ -46,6 +46,7 @@ static const struct address_space_operations configfs_aops = {
 };
 
 static struct backing_dev_info configfs_backing_dev_info = {
+	.name		= "configfs",
 	.ra_pages	= 0,	/* No readahead */
 	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK,
 };
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 91f7c85..e5e8b03 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -484,6 +484,7 @@ int fuse_conn_init(struct fuse_conn *fc, struct super_block *sb)
 	INIT_LIST_HEAD(&fc->bg_queue);
 	INIT_LIST_HEAD(&fc->entry);
 	atomic_set(&fc->num_waiting, 0);
+	fc->bdi.name = "fuse";
 	fc->bdi.ra_pages = (VM_MAX_READAHEAD * 1024) / PAGE_CACHE_SIZE;
 	fc->bdi.unplug_io_fn = default_unplug_io_fn;
 	/* fuse does it's own writeback accounting */
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index c1462d4..db1e537 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -43,6 +43,7 @@ static const struct inode_operations hugetlbfs_dir_inode_operations;
 static const struct inode_operations hugetlbfs_inode_operations;
 
 static struct backing_dev_info hugetlbfs_backing_dev_info = {
+	.name		= "hugetlbfs",
 	.ra_pages	= 0,	/* No readahead */
 	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK,
 };
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index 75c9cd2..3a26d06 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -836,6 +836,7 @@ static void nfs_server_set_fsinfo(struct nfs_server *server, struct nfs_fsinfo *
 		server->rsize = NFS_MAX_FILE_IO_SIZE;
 	server->rpages = (server->rsize + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 
+	server->backing_dev_info.name = "nfs";
 	server->backing_dev_info.ra_pages = server->rpages * NFS_MAX_READAHEAD;
 
 	if (server->wsize > max_rpc_payload)
diff --git a/fs/ocfs2/dlm/dlmfs.c b/fs/ocfs2/dlm/dlmfs.c
index 1c9efb4..02bf178 100644
--- a/fs/ocfs2/dlm/dlmfs.c
+++ b/fs/ocfs2/dlm/dlmfs.c
@@ -325,6 +325,7 @@ clear_fields:
 }
 
 static struct backing_dev_info dlmfs_backing_dev_info = {
+	.name		= "ocfs2-dlmfs",
 	.ra_pages	= 0,	/* No readahead */
 	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK,
 };
diff --git a/fs/ramfs/inode.c b/fs/ramfs/inode.c
index 3a6b193..5a24199 100644
--- a/fs/ramfs/inode.c
+++ b/fs/ramfs/inode.c
@@ -46,6 +46,7 @@ static const struct super_operations ramfs_ops;
 static const struct inode_operations ramfs_dir_inode_operations;
 
 static struct backing_dev_info ramfs_backing_dev_info = {
+	.name		= "ramfs",
 	.ra_pages	= 0,	/* No readahead */
 	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK |
 			  BDI_CAP_MAP_DIRECT | BDI_CAP_MAP_COPY |
diff --git a/fs/sysfs/inode.c b/fs/sysfs/inode.c
index 555f0ff..e57f98e 100644
--- a/fs/sysfs/inode.c
+++ b/fs/sysfs/inode.c
@@ -29,6 +29,7 @@ static const struct address_space_operations sysfs_aops = {
 };
 
 static struct backing_dev_info sysfs_backing_dev_info = {
+	.name		= "sysfs",
 	.ra_pages	= 0,	/* No readahead */
 	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK,
 };
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index e9f7a75..2349e2c 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -1923,6 +1923,7 @@ static int ubifs_fill_super(struct super_block *sb, void *data, int silent)
 	 *
 	 * Read-ahead will be disabled because @c->bdi.ra_pages is 0.
 	 */
+	c->bdi.name = "ubifs",
 	c->bdi.capabilities = BDI_CAP_MAP_COPY;
 	c->bdi.unplug_io_fn = default_unplug_io_fn;
 	err  = bdi_init(&c->bdi);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index d55553d..653a652 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -69,6 +69,8 @@ struct backing_dev_info {
 	void (*unplug_io_fn)(struct backing_dev_info *, struct page *);
 	void *unplug_io_data;
 
+	char *name;
+
 	struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
 
 	struct prop_local_percpu completions;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index a7267bf..0863c5f 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -598,6 +598,7 @@ static struct inode_operations cgroup_dir_inode_operations;
 static struct file_operations proc_cgroupstats_operations;
 
 static struct backing_dev_info cgroup_backing_dev_info = {
+	.name		= "cgroup",
 	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK,
 };
 
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index b981118..e6991d6 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -17,6 +17,7 @@ void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
 EXPORT_SYMBOL(default_unplug_io_fn);
 
 struct backing_dev_info default_backing_dev_info = {
+	.name		= "default",
 	.ra_pages	= VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE,
 	.state		= 0,
 	.capabilities	= BDI_CAP_MAP_COPY | BDI_CAP_FLUSH_FORKER,
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 3ecea98..323da00 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -34,6 +34,7 @@ static const struct address_space_operations swap_aops = {
 };
 
 static struct backing_dev_info swap_backing_dev_info = {
+	.name		= "swap",
 	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK | BDI_CAP_SWAP_BACKED,
 	.unplug_io_fn	= swap_unplug_io_fn,
 };
-- 
1.6.3.rc0.1.gf800


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* [PATCH 11/11] writeback: check for registered bdi in flusher add and inode dirty
  2009-05-28 11:46 [PATCH 0/11] Per-bdi writeback flusher threads v9 Jens Axboe
                   ` (9 preceding siblings ...)
  2009-05-28 11:46 ` [PATCH 10/11] writeback: add name to backing_dev_info Jens Axboe
@ 2009-05-28 11:46 ` Jens Axboe
  2009-05-28 13:56 ` [PATCH 0/11] Per-bdi writeback flusher threads v9 Peter Zijlstra
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2009-05-28 11:46 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, tytso
  Cc: chris.mason, david, hch, akpm, jack, yanmin_zhang, richard,
	damien.wyart, Jens Axboe

Also a debugging aid. We want to catch dirty inodes being added to
backing devices that don't do writeback.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/fs-writeback.c           |    7 +++++++
 include/linux/backing-dev.h |    1 +
 mm/backing-dev.c            |    6 ++++++
 3 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d1d47c0..d6fbfa7 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -672,6 +672,13 @@ void __mark_inode_dirty(struct inode *inode, int flags)
 		 */
 		if (!was_dirty) {
 			struct bdi_writeback *wb = inode_get_wb(inode);
+			struct backing_dev_info *bdi = wb->bdi;
+
+			if (bdi_cap_writeback_dirty(bdi) &&
+			    !test_bit(BDI_registered, &bdi->state)) {
+				WARN_ON(1);
+				printk("bdi-%s not registered\n", bdi->name);
+			}
 
 			inode->dirtied_when = jiffies;
 			list_move(&inode->i_list, &wb->b_dirty);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 653a652..2831c81 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -31,6 +31,7 @@ enum bdi_state {
 	BDI_wblist_lock,	/* bdi->wb_list now needs locking */
 	BDI_async_congested,	/* The async (write) queue is getting full */
 	BDI_sync_congested,	/* The sync queue is getting full */
+	BDI_registered,		/* bdi_register() was done */
 	BDI_unused,		/* Available bits start here */
 };
 
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index e6991d6..3882ac3 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -553,6 +553,11 @@ static void bdi_add_one_flusher_task(struct backing_dev_info *bdi,
 	if (!bdi_cap_writeback_dirty(bdi))
 		return;
 
+	if (WARN_ON(!test_bit(BDI_registered, &bdi->state))) {
+		printk("bdi %p/%s is not registered!\n", bdi, bdi->name);
+		return;
+	}
+
 	/*
 	 * Check with the helper whether to proceed adding a task. Will only
 	 * abort if we two or more simultanous calls to
@@ -661,6 +666,7 @@ remove_err:
 	}
 
 	bdi_debug_register(bdi, dev_name(dev));
+	set_bit(BDI_registered, &bdi->state);
 exit:
 	return ret;
 }
-- 
1.6.3.rc0.1.gf800


^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-05-28 11:46 [PATCH 0/11] Per-bdi writeback flusher threads v9 Jens Axboe
                   ` (10 preceding siblings ...)
  2009-05-28 11:46 ` [PATCH 11/11] writeback: check for registered bdi in flusher add and inode dirty Jens Axboe
@ 2009-05-28 13:56 ` Peter Zijlstra
  2009-05-28 22:28   ` Jens Axboe
  2009-05-28 14:17 ` Artem Bityutskiy
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 66+ messages in thread
From: Peter Zijlstra @ 2009-05-28 13:56 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, tytso, chris.mason, david, hch,
	akpm, jack, yanmin_zhang, richard, damien.wyart

On Thu, 2009-05-28 at 13:46 +0200, Jens Axboe wrote:
> - Get rid of the explicit wait queues, we can just use wake_up_process()
>   since it's just for that one task.

Ah, good, should clean up those funny prepare/finish_wait thingies that
looked odd.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 04/11] writeback: switch to per-bdi threads for flushing data
  2009-05-28 11:46 ` [PATCH 04/11] writeback: switch to per-bdi threads for flushing data Jens Axboe
@ 2009-05-28 14:13   ` Artem Bityutskiy
  2009-05-28 22:28     ` Jens Axboe
  0 siblings, 1 reply; 66+ messages in thread
From: Artem Bityutskiy @ 2009-05-28 14:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, tytso, chris.mason, david, hch,
	akpm, jack, yanmin_zhang, richard, damien.wyart

Jens Axboe wrote:
> +#define BDI_CAP_FLUSH_FORKER	0x00000200

Would it please be possible to add a comment about
what this flag is, and whether it is for internal
usage or not. Not immediately obvious for me.

Artem.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-05-28 11:46 [PATCH 0/11] Per-bdi writeback flusher threads v9 Jens Axboe
                   ` (11 preceding siblings ...)
  2009-05-28 13:56 ` [PATCH 0/11] Per-bdi writeback flusher threads v9 Peter Zijlstra
@ 2009-05-28 14:17 ` Artem Bityutskiy
  2009-05-28 14:19   ` Artem Bityutskiy
  2009-05-28 14:41 ` Theodore Tso
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 66+ messages in thread
From: Artem Bityutskiy @ 2009-05-28 14:17 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, tytso, chris.mason, david, hch,
	akpm, jack, yanmin_zhang, richard, damien.wyart

Jens Axboe wrote:
> Here's the 9th version of the writeback patches. Changes since v8:
> 
> - Fix a bdi_work on-stack allocation hang. I hope this fixes Ted's
>   issue.
> - Get rid of the explicit wait queues, we can just use wake_up_process()
>   since it's just for that one task.
> - Add separate "sync_supers" thread that makes sure that the dirty
>   super blocks get written. We cannot safely do this from bdi_forker_task(),
>   as that risks deadlocking on ->s_umount. Artem, I implemented this
>   by doing the wake ups from a timer so that it would be easier for you
>   to just deactivate the timer when there are no super blocks.

Thanks. 

I've just tried to test UBIFS with your patches (writeback-v9)
and got lots of these warnings:
------------[ cut here ]------------
WARNING: at fs/fs-writeback.c:679 __mark_inode_dirty+0x1b6/0x212()
Hardware name: HP xw6600 Workstation
Modules linked in: deflate zlib_deflate lzo lzo_decompress lzo_compress ubifs crc16 ubi nandsim nand nand_ids nand_ecc mtd cpufreq_ondemand acpi_cpufreq freq_table iTCO_wdt iTCO_vendor_support tg3 libphy wmi mptsas mptscsih mptbase scsi_transport_sas [last unloaded: microcode]
Pid: 2210, comm: integck Tainted: G        W  2.6.30-rc7-block-2.6 #1
Call Trace:
 [<ffffffff810ecf78>] ? __mark_inode_dirty+0x1b6/0x212
 [<ffffffff8103ffe2>] warn_slowpath_common+0x77/0xa4
 [<ffffffff8104001e>] warn_slowpath_null+0xf/0x11
 [<ffffffff810ecf78>] __mark_inode_dirty+0x1b6/0x212
 [<ffffffff810a4faa>] __set_page_dirty_nobuffers+0xf5/0x105
 [<ffffffffa00c4399>] ubifs_write_end+0x1a9/0x236 [ubifs]
 [<ffffffff8109c7c1>] ? pagefault_enable+0x28/0x33
 [<ffffffff8109cc8f>] ? iov_iter_copy_from_user_atomic+0xfb/0x10a
 [<ffffffff8109e2da>] generic_file_buffered_write+0x18c/0x2d9
 [<ffffffff8109e828>] __generic_file_aio_write_nolock+0x261/0x295
 [<ffffffff8109f09f>] generic_file_aio_write+0x69/0xc5
 [<ffffffffa00c39d6>] ubifs_aio_write+0x14c/0x19e [ubifs]
 [<ffffffff810d1a89>] do_sync_write+0xe7/0x12d
 [<ffffffff812f51c5>] ? __mutex_lock_common+0x36f/0x419
 [<ffffffff812f5218>] ? __mutex_lock_common+0x3c2/0x419
 [<ffffffff81054bd4>] ? autoremove_wake_function+0x0/0x38
 [<ffffffff812f4cae>] ? __mutex_unlock_slowpath+0x10d/0x13c
 [<ffffffff8106211f>] ? trace_hardirqs_on+0xd/0xf
 [<ffffffff812f4ccb>] ? __mutex_unlock_slowpath+0x12a/0x13c
 [<ffffffff811578d0>] ? security_file_permission+0x11/0x13
 [<ffffffff810d24ae>] vfs_write+0xab/0x105
 [<ffffffff810d25cc>] sys_write+0x47/0x70
 [<ffffffff8100bc2b>] system_call_fastpath+0x16/0x1b
---[ end trace 7205fe43ac3aa184 ]---

And then eventually my test failed. It yells at this code:

if (bdi_cap_writeback_dirty(bdi) &&
    !test_bit(BDI_registered, &bdi->state)) {
        WARN_ON(1);
    printk("bdi-%s not registered\n", bdi->name);
}

UBIFS is flash file-system. It works on top of MTD devices,
not block devices. Well, to be correct, it works on top of
UBI volumes, which sit on top of MTD devices, which represent
raw flash.

UBIFS needs write-back, but it does not need a full BDI
device. So we used-to have a fake BDI device. Also, UBIFS
wants to disable read-ahead. We do not need anything else
from the block sub-system.

I guess the reason for the complaint is that UBIFS does
not call 'bdi_register()' or 'bdi_register_dev()'. The
question is - should it? 'bdi_register()' a block device,
but we do not have one.

Suggestions?

Artem.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-05-28 14:17 ` Artem Bityutskiy
@ 2009-05-28 14:19   ` Artem Bityutskiy
  2009-05-28 20:35     ` Peter Zijlstra
  0 siblings, 1 reply; 66+ messages in thread
From: Artem Bityutskiy @ 2009-05-28 14:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, tytso, chris.mason, david, hch,
	akpm, jack, yanmin_zhang, richard, damien.wyart

Artem Bityutskiy wrote:
> question is - should it? 'bdi_register()' a block device,
> but we do not have one.

Sorry, wanted to say: 'bdi_register()' registers a block
device.

Artem.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-05-28 11:46 [PATCH 0/11] Per-bdi writeback flusher threads v9 Jens Axboe
                   ` (12 preceding siblings ...)
  2009-05-28 14:17 ` Artem Bityutskiy
@ 2009-05-28 14:41 ` Theodore Tso
  2009-05-29 16:07 ` Artem Bityutskiy
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 66+ messages in thread
From: Theodore Tso @ 2009-05-28 14:41 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack,
	yanmin_zhang, richard, damien.wyart

On Thu, May 28, 2009 at 01:46:33PM +0200, Jens Axboe wrote:
> Hi,
> 
> Here's the 9th version of the writeback patches. Changes since v8:
> 
> - Fix a bdi_work on-stack allocation hang. I hope this fixes Ted's
>   issue.

It appears to have fixed the soft lockup hang when running fsstress,
thanks!!

					- Ted


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-05-28 14:19   ` Artem Bityutskiy
@ 2009-05-28 20:35     ` Peter Zijlstra
  2009-05-28 22:27       ` Jens Axboe
  2009-05-29 15:37       ` Artem Bityutskiy
  0 siblings, 2 replies; 66+ messages in thread
From: Peter Zijlstra @ 2009-05-28 20:35 UTC (permalink / raw)
  To: Artem Bityutskiy
  Cc: Jens Axboe, linux-kernel, linux-fsdevel, tytso, chris.mason,
	david, hch, akpm, jack, yanmin_zhang, richard, damien.wyart

On Thu, 2009-05-28 at 17:19 +0300, Artem Bityutskiy wrote:
> Artem Bityutskiy wrote:
> > question is - should it? 'bdi_register()' a block device,
> > but we do not have one.
> 
> Sorry, wanted to say: 'bdi_register()' registers a block
> device.

BDI stands for backing device info and is not related to block devices
other than that block devices can also be (ok, always are) backing
devices.

If you want to do writeback, you need a backing device to write to. The
BDI is the device abstraction for writeback.




^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-05-28 20:35     ` Peter Zijlstra
@ 2009-05-28 22:27       ` Jens Axboe
  2009-05-29 15:37       ` Artem Bityutskiy
  1 sibling, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2009-05-28 22:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Artem Bityutskiy, linux-kernel, linux-fsdevel, tytso,
	chris.mason, david, hch, akpm, jack, yanmin_zhang, richard,
	damien.wyart

On Thu, May 28 2009, Peter Zijlstra wrote:
> On Thu, 2009-05-28 at 17:19 +0300, Artem Bityutskiy wrote:
> > Artem Bityutskiy wrote:
> > > question is - should it? 'bdi_register()' a block device,
> > > but we do not have one.
> > 
> > Sorry, wanted to say: 'bdi_register()' registers a block
> > device.
> 
> BDI stands for backing device info and is not related to block devices
> other than that block devices can also be (ok, always are) backing
> devices.
> 
> If you want to do writeback, you need a backing device to write to. The
> BDI is the device abstraction for writeback.

Precisely. Apparently ubifs doesn't register its backing device. I fixed
a similar issue in btrfs, I'll do an audit of the file systems and fix
that up.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 04/11] writeback: switch to per-bdi threads for flushing  data
  2009-05-28 14:13   ` Artem Bityutskiy
@ 2009-05-28 22:28     ` Jens Axboe
  0 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2009-05-28 22:28 UTC (permalink / raw)
  To: Artem Bityutskiy
  Cc: linux-kernel, linux-fsdevel, tytso, chris.mason, david, hch,
	akpm, jack, yanmin_zhang, richard, damien.wyart

On Thu, May 28 2009, Artem Bityutskiy wrote:
> Jens Axboe wrote:
>> +#define BDI_CAP_FLUSH_FORKER	0x00000200
>
> Would it please be possible to add a comment about
> what this flag is, and whether it is for internal
> usage or not. Not immediately obvious for me.

It's internal, probably I should just replace it with a check for
&default_backing_dev_info. If not, I'll add a comment.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-05-28 13:56 ` [PATCH 0/11] Per-bdi writeback flusher threads v9 Peter Zijlstra
@ 2009-05-28 22:28   ` Jens Axboe
  0 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2009-05-28 22:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-fsdevel, tytso, chris.mason, david, hch,
	akpm, jack, yanmin_zhang, richard, damien.wyart

On Thu, May 28 2009, Peter Zijlstra wrote:
> On Thu, 2009-05-28 at 13:46 +0200, Jens Axboe wrote:
> > - Get rid of the explicit wait queues, we can just use wake_up_process()
> >   since it's just for that one task.
> 
> Ah, good, should clean up those funny prepare/finish_wait thingies that
> looked odd.

Precisely, they are gone now :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-05-28 20:35     ` Peter Zijlstra
  2009-05-28 22:27       ` Jens Axboe
@ 2009-05-29 15:37       ` Artem Bityutskiy
  2009-05-29 15:50         ` Jens Axboe
  1 sibling, 1 reply; 66+ messages in thread
From: Artem Bityutskiy @ 2009-05-29 15:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Jens Axboe, linux-kernel, linux-fsdevel, tytso, chris.mason,
	david, hch, akpm, jack, yanmin_zhang, richard, damien.wyart

Peter Zijlstra wrote:
> On Thu, 2009-05-28 at 17:19 +0300, Artem Bityutskiy wrote:
>> Artem Bityutskiy wrote:
>>> question is - should it? 'bdi_register()' a block device,
>>> but we do not have one.
>> Sorry, wanted to say: 'bdi_register()' registers a block
>> device.
> 
> BDI stands for backing device info and is not related to block devices
> other than that block devices can also be (ok, always are) backing
> devices.
> 
> If you want to do writeback, you need a backing device to write to. The
> BDI is the device abstraction for writeback.

I see, thanks. The below UBIFS patch fixes the issue. I'll
push it to ubifs-2.6.git tree, unless there are objections.

From: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
Subject: [PATCH] UBIFS: do not forget to register BDI device

Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
---
 fs/ubifs/super.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index 2349e2c..d1ac967 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -1929,6 +1929,9 @@ static int ubifs_fill_super(struct super_block *sb, void *data, int silent)
 	err  = bdi_init(&c->bdi);
 	if (err)
 		goto out_close;
+	err = bdi_register(&c->bdi, NULL, "ubifs");
+	if (err)
+		goto out_close;
 
 	err = ubifs_parse_options(c, data, 0);
 	if (err)
-- 
1.6.0.6

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-05-29 15:37       ` Artem Bityutskiy
@ 2009-05-29 15:50         ` Jens Axboe
  2009-05-29 16:02           ` Artem Bityutskiy
  0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2009-05-29 15:50 UTC (permalink / raw)
  To: Artem Bityutskiy
  Cc: Peter Zijlstra, linux-kernel, linux-fsdevel, tytso, chris.mason,
	david, hch, akpm, jack, yanmin_zhang, richard, damien.wyart

On Fri, May 29 2009, Artem Bityutskiy wrote:
> Peter Zijlstra wrote:
>> On Thu, 2009-05-28 at 17:19 +0300, Artem Bityutskiy wrote:
>>> Artem Bityutskiy wrote:
>>>> question is - should it? 'bdi_register()' a block device,
>>>> but we do not have one.
>>> Sorry, wanted to say: 'bdi_register()' registers a block
>>> device.
>>
>> BDI stands for backing device info and is not related to block devices
>> other than that block devices can also be (ok, always are) backing
>> devices.
>>
>> If you want to do writeback, you need a backing device to write to. The
>> BDI is the device abstraction for writeback.
>
> I see, thanks. The below UBIFS patch fixes the issue. I'll
> push it to ubifs-2.6.git tree, unless there are objections.
>
> From: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
> Subject: [PATCH] UBIFS: do not forget to register BDI device
>
> Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com>
> ---
> fs/ubifs/super.c |    3 +++
> 1 files changed, 3 insertions(+), 0 deletions(-)
>
> diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
> index 2349e2c..d1ac967 100644
> --- a/fs/ubifs/super.c
> +++ b/fs/ubifs/super.c
> @@ -1929,6 +1929,9 @@ static int ubifs_fill_super(struct super_block *sb, void *data, int silent)
> 	err  = bdi_init(&c->bdi);
> 	if (err)
> 		goto out_close;
> +	err = bdi_register(&c->bdi, NULL, "ubifs");
> +	if (err)
> +		goto out_close;

Not quite right, you need to call bdi_destroy() if you have done the
init.

I committed this one this morning:

http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=570a2fe1df85741988ad0ca22aa406744436e281

But feel free to commit/submit to the ubifs tree directly, then it'll
disappear from my tree once it is merged.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-05-29 15:50         ` Jens Axboe
@ 2009-05-29 16:02           ` Artem Bityutskiy
  2009-05-29 17:07             ` Jens Axboe
  0 siblings, 1 reply; 66+ messages in thread
From: Artem Bityutskiy @ 2009-05-29 16:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Peter Zijlstra, linux-kernel, linux-fsdevel, tytso, chris.mason,
	david, hch, akpm, jack, yanmin_zhang, richard, damien.wyart

Jens Axboe wrote:
>> diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
>> index 2349e2c..d1ac967 100644
>> --- a/fs/ubifs/super.c
>> +++ b/fs/ubifs/super.c
>> @@ -1929,6 +1929,9 @@ static int ubifs_fill_super(struct super_block *sb, void *data, int silent)
>> 	err  = bdi_init(&c->bdi);
>> 	if (err)
>> 		goto out_close;
>> +	err = bdi_register(&c->bdi, NULL, "ubifs");
>> +	if (err)
>> +		goto out_close;
> 
> Not quite right, you need to call bdi_destroy() if you have done the
> init.

Right, bdi_destroy() is already there for long time.
I'm confused.

> I committed this one this morning:
> 
> http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=570a2fe1df85741988ad0ca22aa406744436e281

Hmm, it is the same as my patch, but you do
+       err = bdi_register(&c->bdi);
while I do
+	err = bdi_register(&c->bdi, NULL, "ubifs");

> But feel free to commit/submit to the ubifs tree directly, then it'll
> disappear from my tree once it is merged.

Yeah, I think it can go via my tree. I'd merge it at
2.6.31 window. This change does not depend on your
work anyway.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-05-28 11:46 [PATCH 0/11] Per-bdi writeback flusher threads v9 Jens Axboe
                   ` (13 preceding siblings ...)
  2009-05-28 14:41 ` Theodore Tso
@ 2009-05-29 16:07 ` Artem Bityutskiy
  2009-05-29 16:20   ` Artem Bityutskiy
  2009-05-29 17:08   ` Jens Axboe
  2009-06-03 11:12 ` Artem Bityutskiy
  2009-06-04 15:20 ` Frederic Weisbecker
  16 siblings, 2 replies; 66+ messages in thread
From: Artem Bityutskiy @ 2009-05-29 16:07 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, tytso, chris.mason, david, hch,
	akpm, jack, yanmin_zhang, richard, damien.wyart

Jens Axboe wrote:
> Hi,
> 
> Here's the 9th version of the writeback patches. Changes since v8:
> 
> - Fix a bdi_work on-stack allocation hang. I hope this fixes Ted's
>   issue.
> - Get rid of the explicit wait queues, we can just use wake_up_process()
>   since it's just for that one task.
> - Add separate "sync_supers" thread that makes sure that the dirty
>   super blocks get written. We cannot safely do this from bdi_forker_task(),
>   as that risks deadlocking on ->s_umount. Artem, I implemented this
>   by doing the wake ups from a timer so that it would be easier for you
>   to just deactivate the timer when there are no super blocks.
> 
> For ease of patching, I've put the full diff here:
> 
>   http://kernel.dk/writeback-v9.patch
> 
> and also stored this in a writeback-v9 branch that will not change,
> you can pull that into Linus tree from here:
> 
>   git://git.kernel.dk/linux-2.6-block.git writeback-v9

I'm working with the above branch. Got the following twice.
Not sure what triggers this, probably if I do nothing and
cpufreq starts doing its magic, this is triggered.

And I'm not sure it has something to do with your changes,
it is just that I saw this only with your tree. Please,
ignore if this is not relevant.

=======================================================
 scaling: [ INFO: possible circular locking dependency detected ]
2.6.30-rc7-block-2.6 #1                                          
-------------------------------------------------------          
K99cpuspeed/9923 is trying to acquire lock:                      
 (&(&dbs_info->work)->work){+.+...}, at: [<ffffffff81051155>] __cancel_work_timer+0xd9/0x21d

but task is already holding lock:
 (dbs_mutex){+.+.+.}, at: [<ffffffffa0073aa8>] cpufreq_governor_dbs+0x23c/0x2cc [cpufreq_ondemand]

which lock already depends on the new lock.


the existing dependency chain (in reverse order) is:

-> #2 (dbs_mutex){+.+.+.}:
       [<ffffffff81063529>] __lock_acquire+0xa63/0xbeb
       [<ffffffff8106379f>] lock_acquire+0xee/0x112   
       [<ffffffff812f4eb0>] __mutex_lock_common+0x5a/0x419
       [<ffffffff812f5309>] mutex_lock_nested+0x30/0x35   
       [<ffffffffa00738f2>] cpufreq_governor_dbs+0x86/0x2cc [cpufreq_ondemand]
       [<ffffffff8125eaa4>] __cpufreq_governor+0x84/0xc2                      
       [<ffffffff8125ecae>] __cpufreq_set_policy+0x195/0x211                  
       [<ffffffff8125f6fb>] store_scaling_governor+0x1e7/0x223                
       [<ffffffff8126038f>] store+0x5f/0x83                                   
       [<ffffffff81125107>] sysfs_write_file+0xe4/0x119                       
       [<ffffffff810d24ae>] vfs_write+0xab/0x105                              
       [<ffffffff810d25cc>] sys_write+0x47/0x70                               
       [<ffffffff8100bc2b>] system_call_fastpath+0x16/0x1b                    
       [<ffffffffffffffff>] 0xffffffffffffffff                                

-> #1 (&per_cpu(cpu_policy_rwsem, cpu)){+++++.}:
       [<ffffffff81063529>] __lock_acquire+0xa63/0xbeb
       [<ffffffff8106379f>] lock_acquire+0xee/0x112   
       [<ffffffff812f5561>] down_write+0x3d/0x49      
       [<ffffffff8125fc31>] lock_policy_rwsem_write+0x48/0x78
       [<ffffffffa007364c>] do_dbs_timer+0x5f/0x27f [cpufreq_ondemand]
       [<ffffffff81050869>] worker_thread+0x24b/0x367                 
       [<ffffffff810547c1>] kthread+0x56/0x83                         
       [<ffffffff8100cd3a>] child_rip+0xa/0x20                        
       [<ffffffffffffffff>] 0xffffffffffffffff                        

-> #0 (&(&dbs_info->work)->work){+.+...}:
       [<ffffffff8106341d>] __lock_acquire+0x957/0xbeb
       [<ffffffff8106379f>] lock_acquire+0xee/0x112   
       [<ffffffff81051189>] __cancel_work_timer+0x10d/0x21d
       [<ffffffff810512a6>] cancel_delayed_work_sync+0xd/0xf
       [<ffffffffa0073abb>] cpufreq_governor_dbs+0x24f/0x2cc [cpufreq_ondemand]
       [<ffffffff8125eaa4>] __cpufreq_governor+0x84/0xc2                       
       [<ffffffff8125ec98>] __cpufreq_set_policy+0x17f/0x211                   
       [<ffffffff8125f6fb>] store_scaling_governor+0x1e7/0x223                 
       [<ffffffff8126038f>] store+0x5f/0x83                                    
       [<ffffffff81125107>] sysfs_write_file+0xe4/0x119                        
       [<ffffffff810d24ae>] vfs_write+0xab/0x105                               
       [<ffffffff810d25cc>] sys_write+0x47/0x70                                
       [<ffffffff8100bc2b>] system_call_fastpath+0x16/0x1b                     
       [<ffffffffffffffff>] 0xffffffffffffffff                                 

other info that might help us debug this:

3 locks held by K99cpuspeed/9923:
 #0:  (&buffer->mutex){+.+.+.}, at: [<ffffffff8112505b>] sysfs_write_file+0x38/0x119
 #1:  (&per_cpu(cpu_policy_rwsem, cpu)){+++++.}, at: [<ffffffff8125fc31>] lock_policy_rwsem_write+0x48/0x78
 #2:  (dbs_mutex){+.+.+.}, at: [<ffffffffa0073aa8>] cpufreq_governor_dbs+0x23c/0x2cc [cpufreq_ondemand]

stack backtrace:
Pid: 9923, comm: K99cpuspeed Not tainted 2.6.30-rc7-block-2.6 #1
Call Trace:
 [<ffffffff81062750>] print_circular_bug_tail+0x71/0x7c
 [<ffffffff8106341d>] __lock_acquire+0x957/0xbeb
 [<ffffffff8106379f>] lock_acquire+0xee/0x112
 [<ffffffff81051155>] ? __cancel_work_timer+0xd9/0x21d
 [<ffffffff81051189>] __cancel_work_timer+0x10d/0x21d
 [<ffffffff81051155>] ? __cancel_work_timer+0xd9/0x21d
 [<ffffffff812f5218>] ? __mutex_lock_common+0x3c2/0x419
 [<ffffffffa0073aa8>] ? cpufreq_governor_dbs+0x23c/0x2cc [cpufreq_ondemand]
 [<ffffffff81061e66>] ? mark_held_locks+0x4d/0x6b
 [<ffffffffa0073aa8>] ? cpufreq_governor_dbs+0x23c/0x2cc [cpufreq_ondemand]
 [<ffffffff810512a6>] cancel_delayed_work_sync+0xd/0xf
 [<ffffffffa0073abb>] cpufreq_governor_dbs+0x24f/0x2cc [cpufreq_ondemand]
 [<ffffffff810580f1>] ? up_read+0x26/0x2b
 [<ffffffff8125eaa4>] __cpufreq_governor+0x84/0xc2
 [<ffffffff8125ec98>] __cpufreq_set_policy+0x17f/0x211
 [<ffffffff8125f6fb>] store_scaling_governor+0x1e7/0x223
 [<ffffffff812604dc>] ? handle_update+0x0/0x33
 [<ffffffff812f5569>] ? down_write+0x45/0x49
 [<ffffffff8126038f>] store+0x5f/0x83
 [<ffffffff81125107>] sysfs_write_file+0xe4/0x119
 [<ffffffff810d24ae>] vfs_write+0xab/0x105
 [<ffffffff810d25cc>] sys_write+0x47/0x70
 [<ffffffff8100bc2b>] system_call_fastpath+0x16/0x1b

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-05-29 16:07 ` Artem Bityutskiy
@ 2009-05-29 16:20   ` Artem Bityutskiy
  2009-05-29 17:09     ` Jens Axboe
  2009-05-29 17:08   ` Jens Axboe
  1 sibling, 1 reply; 66+ messages in thread
From: Artem Bityutskiy @ 2009-05-29 16:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, tytso, chris.mason, david, hch,
	akpm, jack, yanmin_zhang, richard, damien.wyart

Artem Bityutskiy wrote:
> Jens Axboe wrote:
>> Hi,
>>
>> Here's the 9th version of the writeback patches. Changes since v8:
>>
>> - Fix a bdi_work on-stack allocation hang. I hope this fixes Ted's
>>   issue.
>> - Get rid of the explicit wait queues, we can just use wake_up_process()
>>   since it's just for that one task.
>> - Add separate "sync_supers" thread that makes sure that the dirty
>>   super blocks get written. We cannot safely do this from 
>> bdi_forker_task(),
>>   as that risks deadlocking on ->s_umount. Artem, I implemented this
>>   by doing the wake ups from a timer so that it would be easier for you
>>   to just deactivate the timer when there are no super blocks.
>>
>> For ease of patching, I've put the full diff here:
>>
>>   http://kernel.dk/writeback-v9.patch
>>
>> and also stored this in a writeback-v9 branch that will not change,
>> you can pull that into Linus tree from here:
>>
>>   git://git.kernel.dk/linux-2.6-block.git writeback-v9
> 
> I'm working with the above branch. Got the following twice.
> Not sure what triggers this, probably if I do nothing and
> cpufreq starts doing its magic, this is triggered.
> 
> And I'm not sure it has something to do with your changes,
> it is just that I saw this only with your tree. Please,
> ignore if this is not relevant.

Sorry, probably I shouldn't have reported this before looking
closer. I'll investigate this later and fine out whether it
is related to your work or not. Sorry for too early and probably
false alarm.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-05-29 16:02           ` Artem Bityutskiy
@ 2009-05-29 17:07             ` Jens Axboe
  2009-06-03  7:39               ` Artem Bityutskiy
  0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2009-05-29 17:07 UTC (permalink / raw)
  To: Artem Bityutskiy
  Cc: Peter Zijlstra, linux-kernel, linux-fsdevel, tytso, chris.mason,
	david, hch, akpm, jack, yanmin_zhang, richard, damien.wyart

On Fri, May 29 2009, Artem Bityutskiy wrote:
> Jens Axboe wrote:
>>> diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
>>> index 2349e2c..d1ac967 100644
>>> --- a/fs/ubifs/super.c
>>> +++ b/fs/ubifs/super.c
>>> @@ -1929,6 +1929,9 @@ static int ubifs_fill_super(struct super_block *sb, void *data, int silent)
>>> 	err  = bdi_init(&c->bdi);
>>> 	if (err)
>>> 		goto out_close;
>>> +	err = bdi_register(&c->bdi, NULL, "ubifs");
>>> +	if (err)
>>> +		goto out_close;
>>
>> Not quite right, you need to call bdi_destroy() if you have done the
>> init.
>
> Right, bdi_destroy() is already there for long time.
> I'm confused.
>
>> I committed this one this morning:
>>
>> http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=570a2fe1df85741988ad0ca22aa406744436e281
>
> Hmm, it is the same as my patch, but you do
> +       err = bdi_register(&c->bdi);
> while I do
> +	err = bdi_register(&c->bdi, NULL, "ubifs");

Oops, that's my bad. If you combine the two, we should have a working
patch :-)

>> But feel free to commit/submit to the ubifs tree directly, then it'll
>> disappear from my tree once it is merged.
>
> Yeah, I think it can go via my tree. I'd merge it at
> 2.6.31 window. This change does not depend on your
> work anyway.

Right, I'll just carry the fixup patches meanwhile as well, but wont
upstream them.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-05-29 16:07 ` Artem Bityutskiy
  2009-05-29 16:20   ` Artem Bityutskiy
@ 2009-05-29 17:08   ` Jens Axboe
  1 sibling, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2009-05-29 17:08 UTC (permalink / raw)
  To: Artem Bityutskiy
  Cc: linux-kernel, linux-fsdevel, tytso, chris.mason, david, hch,
	akpm, jack, yanmin_zhang, richard, damien.wyart

On Fri, May 29 2009, Artem Bityutskiy wrote:
> Jens Axboe wrote:
>> Hi,
>>
>> Here's the 9th version of the writeback patches. Changes since v8:
>>
>> - Fix a bdi_work on-stack allocation hang. I hope this fixes Ted's
>>   issue.
>> - Get rid of the explicit wait queues, we can just use wake_up_process()
>>   since it's just for that one task.
>> - Add separate "sync_supers" thread that makes sure that the dirty
>>   super blocks get written. We cannot safely do this from bdi_forker_task(),
>>   as that risks deadlocking on ->s_umount. Artem, I implemented this
>>   by doing the wake ups from a timer so that it would be easier for you
>>   to just deactivate the timer when there are no super blocks.
>>
>> For ease of patching, I've put the full diff here:
>>
>>   http://kernel.dk/writeback-v9.patch
>>
>> and also stored this in a writeback-v9 branch that will not change,
>> you can pull that into Linus tree from here:
>>
>>   git://git.kernel.dk/linux-2.6-block.git writeback-v9
>
> I'm working with the above branch. Got the following twice.
> Not sure what triggers this, probably if I do nothing and
> cpufreq starts doing its magic, this is triggered.
>
> And I'm not sure it has something to do with your changes,
> it is just that I saw this only with your tree. Please,
> ignore if this is not relevant.

OK, doesn't look related, but if it only triggers with the writeback
patches, something fishy is going on. I'll check up on it.

>
> =======================================================
> scaling: [ INFO: possible circular locking dependency detected ]
> 2.6.30-rc7-block-2.6 #1                                           
> -------------------------------------------------------           
> K99cpuspeed/9923 is trying to acquire lock:                       
> (&(&dbs_info->work)->work){+.+...}, at: [<ffffffff81051155>] 
> __cancel_work_timer+0xd9/0x21d
>
> but task is already holding lock:
> (dbs_mutex){+.+.+.}, at: [<ffffffffa0073aa8>] cpufreq_governor_dbs+0x23c/0x2cc [cpufreq_ondemand]
>
> which lock already depends on the new lock.
>
>
> the existing dependency chain (in reverse order) is:
>
> -> #2 (dbs_mutex){+.+.+.}:
>       [<ffffffff81063529>] __lock_acquire+0xa63/0xbeb
>       [<ffffffff8106379f>] lock_acquire+0xee/0x112         
> [<ffffffff812f4eb0>] __mutex_lock_common+0x5a/0x419
>       [<ffffffff812f5309>] mutex_lock_nested+0x30/0x35         
> [<ffffffffa00738f2>] cpufreq_governor_dbs+0x86/0x2cc [cpufreq_ondemand]
>       [<ffffffff8125eaa4>] __cpufreq_governor+0x84/0xc2                   
>          [<ffffffff8125ecae>] __cpufreq_set_policy+0x195/0x211            
>             [<ffffffff8125f6fb>] store_scaling_governor+0x1e7/0x223       
>                [<ffffffff8126038f>] store+0x5f/0x83                       
>                   [<ffffffff81125107>] sysfs_write_file+0xe4/0x119        
>                      [<ffffffff810d24ae>] vfs_write+0xab/0x105            
>                         [<ffffffff810d25cc>] sys_write+0x47/0x70          
>                            [<ffffffff8100bc2b>] 
> system_call_fastpath+0x16/0x1b                          
> [<ffffffffffffffff>] 0xffffffffffffffff                                
>
> -> #1 (&per_cpu(cpu_policy_rwsem, cpu)){+++++.}:
>       [<ffffffff81063529>] __lock_acquire+0xa63/0xbeb
>       [<ffffffff8106379f>] lock_acquire+0xee/0x112         
> [<ffffffff812f5561>] down_write+0x3d/0x49            [<ffffffff8125fc31>] 
> lock_policy_rwsem_write+0x48/0x78
>       [<ffffffffa007364c>] do_dbs_timer+0x5f/0x27f [cpufreq_ondemand]
>       [<ffffffff81050869>] worker_thread+0x24b/0x367                      
>  [<ffffffff810547c1>] kthread+0x56/0x83                               
> [<ffffffff8100cd3a>] child_rip+0xa/0x20                              
> [<ffffffffffffffff>] 0xffffffffffffffff                        
>
> -> #0 (&(&dbs_info->work)->work){+.+...}:
>       [<ffffffff8106341d>] __lock_acquire+0x957/0xbeb
>       [<ffffffff8106379f>] lock_acquire+0xee/0x112         
> [<ffffffff81051189>] __cancel_work_timer+0x10d/0x21d
>       [<ffffffff810512a6>] cancel_delayed_work_sync+0xd/0xf
>       [<ffffffffa0073abb>] cpufreq_governor_dbs+0x24f/0x2cc [cpufreq_ondemand]
>       [<ffffffff8125eaa4>] __cpufreq_governor+0x84/0xc2                   
>           [<ffffffff8125ec98>] __cpufreq_set_policy+0x17f/0x211           
>               [<ffffffff8125f6fb>] store_scaling_governor+0x1e7/0x223     
>                   [<ffffffff8126038f>] store+0x5f/0x83                    
>                       [<ffffffff81125107>] sysfs_write_file+0xe4/0x119    
>                           [<ffffffff810d24ae>] vfs_write+0xab/0x105       
>                               [<ffffffff810d25cc>] sys_write+0x47/0x70    
>                                   [<ffffffff8100bc2b>] 
> system_call_fastpath+0x16/0x1b                           
> [<ffffffffffffffff>] 0xffffffffffffffff                                 
>
> other info that might help us debug this:
>
> 3 locks held by K99cpuspeed/9923:
> #0:  (&buffer->mutex){+.+.+.}, at: [<ffffffff8112505b>] sysfs_write_file+0x38/0x119
> #1:  (&per_cpu(cpu_policy_rwsem, cpu)){+++++.}, at: [<ffffffff8125fc31>] lock_policy_rwsem_write+0x48/0x78
> #2:  (dbs_mutex){+.+.+.}, at: [<ffffffffa0073aa8>] cpufreq_governor_dbs+0x23c/0x2cc [cpufreq_ondemand]
>
> stack backtrace:
> Pid: 9923, comm: K99cpuspeed Not tainted 2.6.30-rc7-block-2.6 #1
> Call Trace:
> [<ffffffff81062750>] print_circular_bug_tail+0x71/0x7c
> [<ffffffff8106341d>] __lock_acquire+0x957/0xbeb
> [<ffffffff8106379f>] lock_acquire+0xee/0x112
> [<ffffffff81051155>] ? __cancel_work_timer+0xd9/0x21d
> [<ffffffff81051189>] __cancel_work_timer+0x10d/0x21d
> [<ffffffff81051155>] ? __cancel_work_timer+0xd9/0x21d
> [<ffffffff812f5218>] ? __mutex_lock_common+0x3c2/0x419
> [<ffffffffa0073aa8>] ? cpufreq_governor_dbs+0x23c/0x2cc [cpufreq_ondemand]
> [<ffffffff81061e66>] ? mark_held_locks+0x4d/0x6b
> [<ffffffffa0073aa8>] ? cpufreq_governor_dbs+0x23c/0x2cc [cpufreq_ondemand]
> [<ffffffff810512a6>] cancel_delayed_work_sync+0xd/0xf
> [<ffffffffa0073abb>] cpufreq_governor_dbs+0x24f/0x2cc [cpufreq_ondemand]
> [<ffffffff810580f1>] ? up_read+0x26/0x2b
> [<ffffffff8125eaa4>] __cpufreq_governor+0x84/0xc2
> [<ffffffff8125ec98>] __cpufreq_set_policy+0x17f/0x211
> [<ffffffff8125f6fb>] store_scaling_governor+0x1e7/0x223
> [<ffffffff812604dc>] ? handle_update+0x0/0x33
> [<ffffffff812f5569>] ? down_write+0x45/0x49
> [<ffffffff8126038f>] store+0x5f/0x83
> [<ffffffff81125107>] sysfs_write_file+0xe4/0x119
> [<ffffffff810d24ae>] vfs_write+0xab/0x105
> [<ffffffff810d25cc>] sys_write+0x47/0x70
> [<ffffffff8100bc2b>] system_call_fastpath+0x16/0x1b
>

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-05-29 16:20   ` Artem Bityutskiy
@ 2009-05-29 17:09     ` Jens Axboe
  2009-06-03  8:11       ` Artem Bityutskiy
  0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2009-05-29 17:09 UTC (permalink / raw)
  To: Artem Bityutskiy
  Cc: linux-kernel, linux-fsdevel, tytso, chris.mason, david, hch,
	akpm, jack, yanmin_zhang, richard, damien.wyart

On Fri, May 29 2009, Artem Bityutskiy wrote:
> Artem Bityutskiy wrote:
>> Jens Axboe wrote:
>>> Hi,
>>>
>>> Here's the 9th version of the writeback patches. Changes since v8:
>>>
>>> - Fix a bdi_work on-stack allocation hang. I hope this fixes Ted's
>>>   issue.
>>> - Get rid of the explicit wait queues, we can just use wake_up_process()
>>>   since it's just for that one task.
>>> - Add separate "sync_supers" thread that makes sure that the dirty
>>>   super blocks get written. We cannot safely do this from
>>> bdi_forker_task(),
>>>   as that risks deadlocking on ->s_umount. Artem, I implemented this
>>>   by doing the wake ups from a timer so that it would be easier for you
>>>   to just deactivate the timer when there are no super blocks.
>>>
>>> For ease of patching, I've put the full diff here:
>>>
>>>   http://kernel.dk/writeback-v9.patch
>>>
>>> and also stored this in a writeback-v9 branch that will not change,
>>> you can pull that into Linus tree from here:
>>>
>>>   git://git.kernel.dk/linux-2.6-block.git writeback-v9
>>
>> I'm working with the above branch. Got the following twice.
>> Not sure what triggers this, probably if I do nothing and
>> cpufreq starts doing its magic, this is triggered.
>>
>> And I'm not sure it has something to do with your changes,
>> it is just that I saw this only with your tree. Please,
>> ignore if this is not relevant.
>
> Sorry, probably I shouldn't have reported this before looking
> closer. I'll investigate this later and fine out whether it
> is related to your work or not. Sorry for too early and probably
> false alarm.

No problem. If it does turn out to have some relation to the writeback
stuff, let me know.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-05-29 17:07             ` Jens Axboe
@ 2009-06-03  7:39               ` Artem Bityutskiy
  2009-06-03  7:44                 ` Jens Axboe
  0 siblings, 1 reply; 66+ messages in thread
From: Artem Bityutskiy @ 2009-06-03  7:39 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Peter Zijlstra, linux-kernel, linux-fsdevel, tytso, chris.mason,
	david, hch, akpm, jack, yanmin_zhang, richard, damien.wyart

Jens Axboe wrote:
> On Fri, May 29 2009, Artem Bityutskiy wrote:
>> Jens Axboe wrote:
>>>> diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
>>>> index 2349e2c..d1ac967 100644
>>>> --- a/fs/ubifs/super.c
>>>> +++ b/fs/ubifs/super.c
>>>> @@ -1929,6 +1929,9 @@ static int ubifs_fill_super(struct super_block *sb, void *data, int silent)
>>>> 	err  = bdi_init(&c->bdi);
>>>> 	if (err)
>>>> 		goto out_close;
>>>> +	err = bdi_register(&c->bdi, NULL, "ubifs");
>>>> +	if (err)
>>>> +		goto out_close;
>>> Not quite right, you need to call bdi_destroy() if you have done the
>>> init.
>> Right, bdi_destroy() is already there for long time.
>> I'm confused.
>>
>>> I committed this one this morning:
>>>
>>> http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=570a2fe1df85741988ad0ca22aa406744436e281
>> Hmm, it is the same as my patch, but you do
>> +       err = bdi_register(&c->bdi);
>> while I do
>> +	err = bdi_register(&c->bdi, NULL, "ubifs");
> 
> Oops, that's my bad. If you combine the two, we should have a working
> patch :-)
> 
>>> But feel free to commit/submit to the ubifs tree directly, then it'll
>>> disappear from my tree once it is merged.
>> Yeah, I think it can go via my tree. I'd merge it at
>> 2.6.31 window. This change does not depend on your
>> work anyway.
> 
> Right, I'll just carry the fixup patches meanwhile as well, but wont
> upstream them.

Just to make sure I understood you correctly. I assume my original
patch is fine (because there is bdi_destroy()) and merge it to
ubifs tree.


-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-06-03  7:39               ` Artem Bityutskiy
@ 2009-06-03  7:44                 ` Jens Axboe
  2009-06-03  7:46                   ` Artem Bityutskiy
  2009-06-03  7:59                   ` Artem Bityutskiy
  0 siblings, 2 replies; 66+ messages in thread
From: Jens Axboe @ 2009-06-03  7:44 UTC (permalink / raw)
  To: Artem Bityutskiy
  Cc: Peter Zijlstra, linux-kernel, linux-fsdevel, tytso, chris.mason,
	david, hch, akpm, jack, yanmin_zhang, richard, damien.wyart

On Wed, Jun 03 2009, Artem Bityutskiy wrote:
> Jens Axboe wrote:
>> On Fri, May 29 2009, Artem Bityutskiy wrote:
>>> Jens Axboe wrote:
>>>>> diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
>>>>> index 2349e2c..d1ac967 100644
>>>>> --- a/fs/ubifs/super.c
>>>>> +++ b/fs/ubifs/super.c
>>>>> @@ -1929,6 +1929,9 @@ static int ubifs_fill_super(struct super_block *sb, void *data, int silent)
>>>>> 	err  = bdi_init(&c->bdi);
>>>>> 	if (err)
>>>>> 		goto out_close;
>>>>> +	err = bdi_register(&c->bdi, NULL, "ubifs");
>>>>> +	if (err)
>>>>> +		goto out_close;
>>>> Not quite right, you need to call bdi_destroy() if you have done the
>>>> init.
>>> Right, bdi_destroy() is already there for long time.
>>> I'm confused.
>>>
>>>> I committed this one this morning:
>>>>
>>>> http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=570a2fe1df85741988ad0ca22aa406744436e281
>>> Hmm, it is the same as my patch, but you do
>>> +       err = bdi_register(&c->bdi);
>>> while I do
>>> +	err = bdi_register(&c->bdi, NULL, "ubifs");
>>
>> Oops, that's my bad. If you combine the two, we should have a working
>> patch :-)
>>
>>>> But feel free to commit/submit to the ubifs tree directly, then it'll
>>>> disappear from my tree once it is merged.
>>> Yeah, I think it can go via my tree. I'd merge it at
>>> 2.6.31 window. This change does not depend on your
>>> work anyway.
>>
>> Right, I'll just carry the fixup patches meanwhile as well, but wont
>> upstream them.
>
> Just to make sure I understood you correctly. I assume my original
> patch is fine (because there is bdi_destroy()) and merge it to
> ubifs tree.

It needs to be:

        err = bdi_register(&c->bdi, NULL, "ubifs");
        if (err)
                goto out_bdi;

so you hit the bdi_destroy() for that failure, not goto out_close;
Otherwise it was fine.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-06-03  7:44                 ` Jens Axboe
@ 2009-06-03  7:46                   ` Artem Bityutskiy
  2009-06-03  7:50                     ` Jens Axboe
  2009-06-03  7:59                   ` Artem Bityutskiy
  1 sibling, 1 reply; 66+ messages in thread
From: Artem Bityutskiy @ 2009-06-03  7:46 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Artem Bityutskiy, Peter Zijlstra, linux-kernel, linux-fsdevel,
	tytso, chris.mason, david, hch, akpm, jack, yanmin_zhang,
	richard, damien.wyart

Jens Axboe wrote:
>> Just to make sure I understood you correctly. I assume my original
>> patch is fine (because there is bdi_destroy()) and merge it to
>> ubifs tree.
> 
> It needs to be:
> 
>         err = bdi_register(&c->bdi, NULL, "ubifs");
>         if (err)
>                 goto out_bdi;
> 
> so you hit the bdi_destroy() for that failure, not goto out_close;
> Otherwise it was fine.

Ah, I see. Rather non-typical convention though. I expected
bdi_register() to clean-up stuff in case of failure. Isn't
it a better interface?

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-06-03  7:46                   ` Artem Bityutskiy
@ 2009-06-03  7:50                     ` Jens Axboe
  2009-06-03  7:54                       ` Artem Bityutskiy
  0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2009-06-03  7:50 UTC (permalink / raw)
  To: Artem Bityutskiy
  Cc: Peter Zijlstra, linux-kernel, linux-fsdevel, tytso, chris.mason,
	david, hch, akpm, jack, yanmin_zhang, richard, damien.wyart

On Wed, Jun 03 2009, Artem Bityutskiy wrote:
> Jens Axboe wrote:
>>> Just to make sure I understood you correctly. I assume my original
>>> patch is fine (because there is bdi_destroy()) and merge it to
>>> ubifs tree.
>>
>> It needs to be:
>>
>>         err = bdi_register(&c->bdi, NULL, "ubifs");
>>         if (err)
>>                 goto out_bdi;
>>
>> so you hit the bdi_destroy() for that failure, not goto out_close;
>> Otherwise it was fine.
>
> Ah, I see. Rather non-typical convention though. I expected
> bdi_register() to clean-up stuff in case of failure. Isn't
> it a better interface?

You already did a bdi_init() at that point. bdi_destroy() must be used
to clean up after both bdi_init() and/or bdi_register().

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-06-03  7:50                     ` Jens Axboe
@ 2009-06-03  7:54                       ` Artem Bityutskiy
  0 siblings, 0 replies; 66+ messages in thread
From: Artem Bityutskiy @ 2009-06-03  7:54 UTC (permalink / raw)
  To: ext Jens Axboe
  Cc: Peter Zijlstra, linux-kernel, linux-fsdevel, tytso, chris.mason,
	david, hch, akpm, jack, yanmin_zhang, richard, damien.wyart

Jens Axboe wrote:
>> Ah, I see. Rather non-typical convention though. I expected
>> bdi_register() to clean-up stuff in case of failure. Isn't
>> it a better interface?
> 
> You already did a bdi_init() at that point. bdi_destroy() must be used
> to clean up after both bdi_init() and/or bdi_register().

Right, silly me.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-06-03  7:44                 ` Jens Axboe
  2009-06-03  7:46                   ` Artem Bityutskiy
@ 2009-06-03  7:59                   ` Artem Bityutskiy
  2009-06-03  8:07                     ` Jens Axboe
  1 sibling, 1 reply; 66+ messages in thread
From: Artem Bityutskiy @ 2009-06-03  7:59 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Artem Bityutskiy, Peter Zijlstra, linux-kernel, linux-fsdevel,
	tytso, chris.mason, david, hch, akpm, jack, yanmin_zhang,
	richard, damien.wyart

Jens Axboe wrote:
>> Just to make sure I understood you correctly. I assume my original
>> patch is fine (because there is bdi_destroy()) and merge it to
>> ubifs tree.
> 
> It needs to be:
> 
>         err = bdi_register(&c->bdi, NULL, "ubifs");
>         if (err)
>                 goto out_bdi;
> 
> so you hit the bdi_destroy() for that failure, not goto out_close;
> Otherwise it was fine.

Did this, also added a
Reviewed-by: Jens Axboe <jens.axboe@oracle.com>

http://git.infradead.org/ubifs-2.6.git?a=commit;h=813fdc16ad591e79d0c1b88d31970dcd1c2aa3f1

Thanks.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-06-03  7:59                   ` Artem Bityutskiy
@ 2009-06-03  8:07                     ` Jens Axboe
  0 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2009-06-03  8:07 UTC (permalink / raw)
  To: Artem Bityutskiy
  Cc: Peter Zijlstra, linux-kernel, linux-fsdevel, tytso, chris.mason,
	david, hch, akpm, jack, yanmin_zhang, richard, damien.wyart

On Wed, Jun 03 2009, Artem Bityutskiy wrote:
> Jens Axboe wrote:
>>> Just to make sure I understood you correctly. I assume my original
>>> patch is fine (because there is bdi_destroy()) and merge it to
>>> ubifs tree.
>>
>> It needs to be:
>>
>>         err = bdi_register(&c->bdi, NULL, "ubifs");
>>         if (err)
>>                 goto out_bdi;
>>
>> so you hit the bdi_destroy() for that failure, not goto out_close;
>> Otherwise it was fine.
>
> Did this, also added a
> Reviewed-by: Jens Axboe <jens.axboe@oracle.com>
>
> http://git.infradead.org/ubifs-2.6.git?a=commit;h=813fdc16ad591e79d0c1b88d31970dcd1c2aa3f1

Looks good!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-05-29 17:09     ` Jens Axboe
@ 2009-06-03  8:11       ` Artem Bityutskiy
  0 siblings, 0 replies; 66+ messages in thread
From: Artem Bityutskiy @ 2009-06-03  8:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, tytso, chris.mason, david, hch,
	akpm, jack, yanmin_zhang, richard, damien.wyart

ext Jens Axboe wrote:
> On Fri, May 29 2009, Artem Bityutskiy wrote:
>> Artem Bityutskiy wrote:
>>> Jens Axboe wrote:
>>>> Hi,
>>>>
>>>> Here's the 9th version of the writeback patches. Changes since v8:
>>>>
>>>> - Fix a bdi_work on-stack allocation hang. I hope this fixes Ted's
>>>>   issue.
>>>> - Get rid of the explicit wait queues, we can just use wake_up_process()
>>>>   since it's just for that one task.
>>>> - Add separate "sync_supers" thread that makes sure that the dirty
>>>>   super blocks get written. We cannot safely do this from
>>>> bdi_forker_task(),
>>>>   as that risks deadlocking on ->s_umount. Artem, I implemented this
>>>>   by doing the wake ups from a timer so that it would be easier for you
>>>>   to just deactivate the timer when there are no super blocks.
>>>>
>>>> For ease of patching, I've put the full diff here:
>>>>
>>>>   http://kernel.dk/writeback-v9.patch
>>>>
>>>> and also stored this in a writeback-v9 branch that will not change,
>>>> you can pull that into Linus tree from here:
>>>>
>>>>   git://git.kernel.dk/linux-2.6-block.git writeback-v9
>>> I'm working with the above branch. Got the following twice.
>>> Not sure what triggers this, probably if I do nothing and
>>> cpufreq starts doing its magic, this is triggered.
>>>
>>> And I'm not sure it has something to do with your changes,
>>> it is just that I saw this only with your tree. Please,
>>> ignore if this is not relevant.
>> Sorry, probably I shouldn't have reported this before looking
>> closer. I'll investigate this later and fine out whether it
>> is related to your work or not. Sorry for too early and probably
>> false alarm.
> 
> No problem. If it does turn out to have some relation to the writeback
> stuff, let me know.

OK, I'm confirming that I observe this also with pure 2.6.30-rc7
as well.


-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-05-28 11:46 [PATCH 0/11] Per-bdi writeback flusher threads v9 Jens Axboe
                   ` (14 preceding siblings ...)
  2009-05-29 16:07 ` Artem Bityutskiy
@ 2009-06-03 11:12 ` Artem Bityutskiy
  2009-06-03 11:42   ` Jens Axboe
  2009-06-04 15:20 ` Frederic Weisbecker
  16 siblings, 1 reply; 66+ messages in thread
From: Artem Bityutskiy @ 2009-06-03 11:12 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, tytso, chris.mason, david, hch,
	akpm, jack, yanmin_zhang, richard, damien.wyart

Jens Axboe wrote:
> Here's the 9th version of the writeback patches. Changes since v8:
> 
> - Fix a bdi_work on-stack allocation hang. I hope this fixes Ted's
>   issue.
> - Get rid of the explicit wait queues, we can just use wake_up_process()
>   since it's just for that one task.
> - Add separate "sync_supers" thread that makes sure that the dirty
>   super blocks get written. We cannot safely do this from bdi_forker_task(),
>   as that risks deadlocking on ->s_umount. Artem, I implemented this
>   by doing the wake ups from a timer so that it would be easier for you
>   to just deactivate the timer when there are no super blocks.

I wonder if you would consider to work on top of the latest VFS changes:

git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6.git for-next

For me the problem is that my original patches were created against
the VFS tree, and they do not apply nicely to your tree. So what I've
tried to do - I applied your patches on top of the VFS tree. But they
did not apply cleanly either. I'm currently working on merging them,
but I thought it is better to ask if you already did this.
                                                                                              
-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-06-03 11:12 ` Artem Bityutskiy
@ 2009-06-03 11:42   ` Jens Axboe
  0 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2009-06-03 11:42 UTC (permalink / raw)
  To: Artem Bityutskiy
  Cc: linux-kernel, linux-fsdevel, tytso, chris.mason, david, hch,
	akpm, jack, yanmin_zhang, richard, damien.wyart

On Wed, Jun 03 2009, Artem Bityutskiy wrote:
> Jens Axboe wrote:
>> Here's the 9th version of the writeback patches. Changes since v8:
>>
>> - Fix a bdi_work on-stack allocation hang. I hope this fixes Ted's
>>   issue.
>> - Get rid of the explicit wait queues, we can just use wake_up_process()
>>   since it's just for that one task.
>> - Add separate "sync_supers" thread that makes sure that the dirty
>>   super blocks get written. We cannot safely do this from bdi_forker_task(),
>>   as that risks deadlocking on ->s_umount. Artem, I implemented this
>>   by doing the wake ups from a timer so that it would be easier for you
>>   to just deactivate the timer when there are no super blocks.
>
> I wonder if you would consider to work on top of the latest VFS changes:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6.git for-next
>
> For me the problem is that my original patches were created against
> the VFS tree, and they do not apply nicely to your tree. So what I've
> tried to do - I applied your patches on top of the VFS tree. But they
> did not apply cleanly either. I'm currently working on merging them,
> but I thought it is better to ask if you already did this.

Al, what's the time frame for submitting these vfs changes? I'm assuming
2.6.31 since it's called for-next. If that is the case, then it would be
for the best if I rebase on top of those.

So, to answer your other ping mail as well, my writeback changes will
then be based on top off the vfs tree and then your 0-17 patches. Then
we should have a joint base to work from.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-05-28 11:46 [PATCH 0/11] Per-bdi writeback flusher threads v9 Jens Axboe
                   ` (15 preceding siblings ...)
  2009-06-03 11:12 ` Artem Bityutskiy
@ 2009-06-04 15:20 ` Frederic Weisbecker
  2009-06-04 19:07   ` Andrew Morton
  2009-06-05  1:14   ` Zhang, Yanmin
  16 siblings, 2 replies; 66+ messages in thread
From: Frederic Weisbecker @ 2009-06-04 15:20 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, tytso, chris.mason, david, hch,
	akpm, jack, yanmin_zhang, richard, damien.wyart

[-- Attachment #1: Type: text/plain, Size: 3380 bytes --]

Hi,


On Thu, May 28, 2009 at 01:46:33PM +0200, Jens Axboe wrote:
> Hi,
> 
> Here's the 9th version of the writeback patches. Changes since v8:
> 
> - Fix a bdi_work on-stack allocation hang. I hope this fixes Ted's
>   issue.
> - Get rid of the explicit wait queues, we can just use wake_up_process()
>   since it's just for that one task.
> - Add separate "sync_supers" thread that makes sure that the dirty
>   super blocks get written. We cannot safely do this from bdi_forker_task(),
>   as that risks deadlocking on ->s_umount. Artem, I implemented this
>   by doing the wake ups from a timer so that it would be easier for you
>   to just deactivate the timer when there are no super blocks.
> 
> For ease of patching, I've put the full diff here:
> 
>   http://kernel.dk/writeback-v9.patch
> 
> and also stored this in a writeback-v9 branch that will not change,
> you can pull that into Linus tree from here:
> 
>   git://git.kernel.dk/linux-2.6-block.git writeback-v9
> 
>  block/blk-core.c            |    1 +
>  drivers/block/aoe/aoeblk.c  |    1 +
>  drivers/char/mem.c          |    1 +
>  fs/btrfs/disk-io.c          |   24 +-
>  fs/buffer.c                 |    2 +-
>  fs/char_dev.c               |    1 +
>  fs/configfs/inode.c         |    1 +
>  fs/fs-writeback.c           |  804 ++++++++++++++++++++++++++++-------
>  fs/fuse/inode.c             |    1 +
>  fs/hugetlbfs/inode.c        |    1 +
>  fs/nfs/client.c             |    1 +
>  fs/ntfs/super.c             |   33 +--
>  fs/ocfs2/dlm/dlmfs.c        |    1 +
>  fs/ramfs/inode.c            |    1 +
>  fs/super.c                  |    3 -
>  fs/sync.c                   |    2 +-
>  fs/sysfs/inode.c            |    1 +
>  fs/ubifs/super.c            |    1 +
>  include/linux/backing-dev.h |   73 ++++-
>  include/linux/fs.h          |   11 +-
>  include/linux/writeback.h   |   15 +-
>  kernel/cgroup.c             |    1 +
>  mm/Makefile                 |    2 +-
>  mm/backing-dev.c            |  518 ++++++++++++++++++++++-
>  mm/page-writeback.c         |  151 +------
>  mm/pdflush.c                |  269 ------------
>  mm/swap_state.c             |    1 +
>  mm/vmscan.c                 |    2 +-
>  28 files changed, 1286 insertions(+), 637 deletions(-)
> 


I've just tested it on UP in a single disk.

I've run two parallels dbench tests on two partitions and
tried it with this patch and without.

I used 30 proc each during 600 secs.

You can see the result in attachment.
And also there:

http://kernel.org/pub/linux/kernel/people/frederic/dbench.pdf
http://kernel.org/pub/linux/kernel/people/frederic/bdi-writeback-hda1.log
http://kernel.org/pub/linux/kernel/people/frederic/bdi-writeback-hda3.log
http://kernel.org/pub/linux/kernel/people/frederic/pdflush-hda1.log
http://kernel.org/pub/linux/kernel/people/frederic/pdflush-hda3.log


As you can see, bdi writeback is faster than pdflush on hda1 and slower
on hda3. But, well that's not the point.

What I can observe here is the difference on the standard deviation
for the rate between two parallel writers on a same device (but
two different partitions, then superblocks).

With pdflush, the distributed rate is much better balanced than
with bdi writeback in a single device.

I'm not sure why. Is there something in these patches that makes
several bdi flusher threads for a same bdi not well balanced
between them?

Frederic.

[-- Attachment #2: dbench.pdf --]
[-- Type: application/pdf, Size: 21887 bytes --]

[-- Attachment #3: bdi-writeback-hda1.log --]
[-- Type: text/plain, Size: 26598 bytes --]

dbench version 3.04 - Copyright Andrew Tridgell 1999-2004

Running for 600 seconds with load '/usr/share/dbench/client.txt' and minimum warmup 120 secs
30 clients started
  30        48    47.25 MB/sec  warmup   1 sec   
  30        48    25.73 MB/sec  warmup   2 sec   
  30        48    17.67 MB/sec  warmup   3 sec   
  30        51    14.77 MB/sec  warmup   4 sec   
  30        55     5.01 MB/sec  warmup  14 sec   
  30        57     2.47 MB/sec  warmup  29 sec   
  30        60     2.29 MB/sec  warmup  33 sec   
  30        61     1.83 MB/sec  warmup  42 sec   
  30        66     1.90 MB/sec  warmup  45 sec   
  30        66     1.86 MB/sec  warmup  46 sec   
  30        66     1.82 MB/sec  warmup  47 sec   
  30        66     1.78 MB/sec  warmup  48 sec   
  30        94     2.43 MB/sec  warmup  52 sec   
  30        94     2.39 MB/sec  warmup  53 sec   
  30        99     2.08 MB/sec  warmup  64 sec   
  30       126     2.40 MB/sec  warmup  68 sec   
  30       126     2.37 MB/sec  warmup  69 sec   
  30       171     2.49 MB/sec  warmup  84 sec   
  30       171     2.46 MB/sec  warmup  85 sec   
  30       171     2.43 MB/sec  warmup  86 sec   
  30       171     2.40 MB/sec  warmup  87 sec   
  30       179     2.23 MB/sec  warmup  97 sec   
  30       186     2.15 MB/sec  warmup 104 sec   
  30       186     2.02 MB/sec  warmup 111 sec   
  30       189     1.93 MB/sec  warmup 117 sec   
  30       256     2.44 MB/sec  warmup 119 sec   
  30       256     2.42 MB/sec  warmup 120 sec   
  30       261     0.00 MB/sec  execute   1 sec   
  30       261     0.00 MB/sec  execute   2 sec   
  30       272     0.45 MB/sec  execute  23 sec   
  30       299     1.20 MB/sec  execute  30 sec   
  30       299     1.16 MB/sec  execute  31 sec   
  30       299     1.12 MB/sec  execute  32 sec   
  30       299     1.09 MB/sec  execute  33 sec   
  30       299     1.06 MB/sec  execute  34 sec   
  30       335     1.83 MB/sec  execute  38 sec   
  30       350     2.14 MB/sec  execute  39 sec   
  30       418     3.10 MB/sec  execute  48 sec   
  30       430     3.22 MB/sec  execute  49 sec   
  30       430     3.16 MB/sec  execute  50 sec   
  30       493     3.61 MB/sec  execute  59 sec   
  30       499     3.61 MB/sec  execute  60 sec   
  30       617     3.94 MB/sec  execute  67 sec   
  30       720     4.14 MB/sec  execute  68 sec   
  30       839     4.46 MB/sec  execute  69 sec   
  30      1171     5.26 MB/sec  execute  70 sec   
  30      1185     5.25 MB/sec  execute  71 sec   
  30      1185     5.17 MB/sec  execute  72 sec   
  30      1493     5.44 MB/sec  execute  81 sec   
  30      1493     5.37 MB/sec  execute  82 sec   
  30      1505     5.33 MB/sec  execute  83 sec   
  30      1559     5.39 MB/sec  execute  84 sec   
  30      1563     5.33 MB/sec  execute  85 sec   
  30      1646     5.43 MB/sec  execute  86 sec   
  30      1677     5.43 MB/sec  execute  87 sec   
  30      2030     6.06 MB/sec  execute  88 sec   
  30      2381     6.61 MB/sec  execute  89 sec   
  30      2738     7.17 MB/sec  execute  90 sec   
  30      3127     7.72 MB/sec  execute  91 sec   
  30      3451     8.32 MB/sec  execute  92 sec   
  30      3837     8.78 MB/sec  execute  93 sec   
  30      4188     9.35 MB/sec  execute  94 sec   
  30      4521     9.80 MB/sec  execute  95 sec   
  30      4903    10.34 MB/sec  execute  96 sec   
  30      5267    10.88 MB/sec  execute  97 sec   
  30      5376    11.02 MB/sec  execute  98 sec   
  30      5587    11.17 MB/sec  execute  99 sec   
  30      5868    11.52 MB/sec  execute 100 sec   
  30      6039    11.68 MB/sec  execute 101 sec   
  30      6047    11.57 MB/sec  execute 102 sec   
  30      6078    11.46 MB/sec  execute 103 sec   
  30      6170    11.46 MB/sec  execute 104 sec   
  30      6224    11.42 MB/sec  execute 105 sec   
  30      6374    11.56 MB/sec  execute 106 sec   
  30      6601    11.84 MB/sec  execute 107 sec   
  30      6839    12.08 MB/sec  execute 108 sec   
  30      7078    12.32 MB/sec  execute 109 sec   
  30      7320    12.56 MB/sec  execute 110 sec   
  30      7634    12.88 MB/sec  execute 111 sec   
  30      8001    13.31 MB/sec  execute 112 sec   
  30      8290    13.69 MB/sec  execute 113 sec   
  30      8638    14.13 MB/sec  execute 114 sec   
  30      9038    14.41 MB/sec  execute 115 sec   
  30      9367    14.86 MB/sec  execute 116 sec   
  30      9697    15.19 MB/sec  execute 117 sec   
  30     10052    15.52 MB/sec  execute 118 sec   
  30     10412    15.91 MB/sec  execute 119 sec   
  30     10613    16.01 MB/sec  execute 120 sec   
  30     10640    15.93 MB/sec  execute 121 sec   
  30     10838    16.08 MB/sec  execute 122 sec   
  30     11211    16.38 MB/sec  execute 123 sec   
  30     11558    16.69 MB/sec  execute 124 sec   
  30     11899    17.09 MB/sec  execute 125 sec   
  30     12267    17.39 MB/sec  execute 126 sec   
  30     12619    17.67 MB/sec  execute 127 sec   
  30     12891    17.94 MB/sec  execute 128 sec   
  30     13072    18.01 MB/sec  execute 129 sec   
  30     13263    18.05 MB/sec  execute 130 sec   
  30     13425    18.16 MB/sec  execute 131 sec   
  30     13572    18.23 MB/sec  execute 132 sec   
  30     13761    18.29 MB/sec  execute 133 sec   
  30     13901    18.29 MB/sec  execute 134 sec   
  30     14035    18.36 MB/sec  execute 135 sec   
  30     14129    18.37 MB/sec  execute 136 sec   
  30     14212    18.31 MB/sec  execute 137 sec   
  30     14279    18.26 MB/sec  execute 138 sec   
  30     14374    18.20 MB/sec  execute 139 sec   
  30     14460    18.12 MB/sec  execute 140 sec   
  30     14552    18.14 MB/sec  execute 141 sec   
  30     14565    18.02 MB/sec  execute 142 sec   
  30     14567    17.90 MB/sec  execute 143 sec   
  30     14567    17.77 MB/sec  execute 144 sec   
  30     14567    17.65 MB/sec  execute 145 sec   
  30     14567    17.53 MB/sec  execute 146 sec   
  30     14567    17.41 MB/sec  execute 147 sec   
  30     14728    17.51 MB/sec  execute 148 sec   
  30     14957    17.69 MB/sec  execute 149 sec   
  30     15027    17.61 MB/sec  execute 150 sec   
  30     15378    17.90 MB/sec  execute 151 sec   
  30     15742    18.13 MB/sec  execute 152 sec   
  30     16100    18.42 MB/sec  execute 153 sec   
  30     16466    18.68 MB/sec  execute 154 sec   
  30     16790    18.93 MB/sec  execute 155 sec   
  30     17055    19.08 MB/sec  execute 156 sec   
  30     17138    19.06 MB/sec  execute 157 sec   
  30     17230    19.03 MB/sec  execute 158 sec   
  30     17332    18.99 MB/sec  execute 159 sec   
  30     17522    19.07 MB/sec  execute 160 sec   
  30     17736    19.16 MB/sec  execute 161 sec   
  30     17840    18.18 MB/sec  execute 171 sec   
  30     17840    18.07 MB/sec  execute 172 sec   
  30     17851    17.97 MB/sec  execute 173 sec   
  30     17851    17.87 MB/sec  execute 174 sec   
  30     17851    17.77 MB/sec  execute 175 sec   
  30     17852    17.67 MB/sec  execute 176 sec   
  30     17858    17.58 MB/sec  execute 177 sec   
  30     17858    17.48 MB/sec  execute 178 sec   
  30     17899    17.43 MB/sec  execute 179 sec   
  30     18946    17.39 MB/sec  execute 189 sec   
  30     18946    17.29 MB/sec  execute 190 sec   
  30     19249    17.30 MB/sec  execute 192 sec   
  30     19415    17.35 MB/sec  execute 193 sec   
  30     19517    17.33 MB/sec  execute 194 sec   
  30     19589    17.31 MB/sec  execute 195 sec   
  30     19669    17.28 MB/sec  execute 196 sec   
  30     19709    17.24 MB/sec  execute 197 sec   
  30     19773    17.19 MB/sec  execute 198 sec   
  30     19847    17.18 MB/sec  execute 199 sec   
  30     19947    17.14 MB/sec  execute 200 sec   
  30     20045    17.14 MB/sec  execute 201 sec   
  30     20136    17.14 MB/sec  execute 202 sec   
  30     20203    17.10 MB/sec  execute 203 sec   
  30     20294    17.10 MB/sec  execute 204 sec   
  30     20316    17.05 MB/sec  execute 205 sec   
  30     20422    17.03 MB/sec  execute 206 sec   
  30     20470    16.97 MB/sec  execute 207 sec   
  30     20480    16.90 MB/sec  execute 208 sec   
  30     20480    16.82 MB/sec  execute 209 sec   
  30     20480    16.74 MB/sec  execute 210 sec   
  30     20480    16.66 MB/sec  execute 211 sec   
  30     20526    16.62 MB/sec  execute 212 sec   
  30     20555    16.56 MB/sec  execute 213 sec   
  30     20555    16.48 MB/sec  execute 214 sec   
  30     20768    16.56 MB/sec  execute 215 sec   
  30     21073    16.75 MB/sec  execute 216 sec   
  30     21427    16.92 MB/sec  execute 217 sec   
  30     21778    17.10 MB/sec  execute 218 sec   
  30     22150    17.27 MB/sec  execute 219 sec   
  30     22494    17.47 MB/sec  execute 220 sec   
  30     22837    17.63 MB/sec  execute 221 sec   
  30     23200    17.81 MB/sec  execute 222 sec   
  30     23552    18.00 MB/sec  execute 223 sec   
  30     23886    18.17 MB/sec  execute 224 sec   
  30     24037    18.21 MB/sec  execute 225 sec   
  30     24060    18.14 MB/sec  execute 226 sec   
  30     24293    17.72 MB/sec  execute 234 sec   
  30     24293    17.64 MB/sec  execute 235 sec   
  30     24293    17.57 MB/sec  execute 236 sec   
  30     24321    17.53 MB/sec  execute 237 sec   
  30     24547    17.58 MB/sec  execute 238 sec   
  30     24602    17.56 MB/sec  execute 239 sec   
  30     24950    17.71 MB/sec  execute 240 sec   
  30     25300    17.86 MB/sec  execute 241 sec   
  30     25654    18.03 MB/sec  execute 242 sec   
  30     26001    18.20 MB/sec  execute 243 sec   
  30     26340    18.34 MB/sec  execute 244 sec   
  30     27206    18.57 MB/sec  execute 248 sec   
  30     27288    18.56 MB/sec  execute 249 sec   
  30     27288    18.49 MB/sec  execute 250 sec   
  30     27290    18.41 MB/sec  execute 251 sec   
  30     27290    18.34 MB/sec  execute 252 sec   
  30     27347    18.32 MB/sec  execute 253 sec   
  30     27347    18.25 MB/sec  execute 254 sec   
  30     27347    18.18 MB/sec  execute 255 sec   
  30     27454    18.17 MB/sec  execute 256 sec   
  30     27728    18.28 MB/sec  execute 257 sec   
  30     28097    18.43 MB/sec  execute 258 sec   
  30     28464    18.59 MB/sec  execute 259 sec   
  30     28795    18.73 MB/sec  execute 260 sec   
  30     29002    18.83 MB/sec  execute 261 sec   
  30     29209    18.84 MB/sec  execute 262 sec   
  30     29428    18.88 MB/sec  execute 263 sec   
  30     29577    18.92 MB/sec  execute 264 sec   
  30     29725    18.94 MB/sec  execute 265 sec   
  30     29802    18.93 MB/sec  execute 266 sec   
  30     29835    18.88 MB/sec  execute 267 sec   
  30     29938    18.88 MB/sec  execute 268 sec   
  30     30150    18.93 MB/sec  execute 269 sec   
  30     30487    19.08 MB/sec  execute 270 sec   
  30     30853    19.22 MB/sec  execute 271 sec   
  30     31222    19.35 MB/sec  execute 272 sec   
  30     31579    19.49 MB/sec  execute 273 sec   
  30     31936    19.64 MB/sec  execute 274 sec   
  30     32085    19.67 MB/sec  execute 275 sec   
  30     32232    19.68 MB/sec  execute 276 sec   
  30     32399    19.71 MB/sec  execute 277 sec   
  30     32513    19.70 MB/sec  execute 278 sec   
  30     33554    19.35 MB/sec  execute 291 sec   
  30     33577    19.29 MB/sec  execute 292 sec   
  30     33577    19.23 MB/sec  execute 293 sec   
  30     33577    19.16 MB/sec  execute 294 sec   
  30     33577    19.10 MB/sec  execute 295 sec   
  30     33577    19.03 MB/sec  execute 296 sec   
  30     33577    18.97 MB/sec  execute 297 sec   
  30     33577    18.91 MB/sec  execute 298 sec   
  30     33577    18.84 MB/sec  execute 299 sec   
  30     33577    18.78 MB/sec  execute 300 sec   
  30     33577    18.72 MB/sec  execute 301 sec   
  30     33588    18.66 MB/sec  execute 302 sec   
  30     33667    18.64 MB/sec  execute 303 sec   
  30     33843    18.66 MB/sec  execute 304 sec   
  30     33872    18.62 MB/sec  execute 305 sec   
  30     34209    18.76 MB/sec  execute 306 sec   
  30     34558    18.88 MB/sec  execute 307 sec   
  30     34883    19.00 MB/sec  execute 308 sec   
  30     35233    19.13 MB/sec  execute 309 sec   
  30     35571    19.24 MB/sec  execute 310 sec   
  30     35939    19.36 MB/sec  execute 311 sec   
  30     36268    19.52 MB/sec  execute 312 sec   
  30     36588    19.65 MB/sec  execute 313 sec   
  30     36887    19.70 MB/sec  execute 314 sec   
  30     36887    19.64 MB/sec  execute 315 sec   
  30     36889    19.58 MB/sec  execute 316 sec   
  30     37176    19.68 MB/sec  execute 317 sec   
  30     37289    19.64 MB/sec  execute 318 sec   
  30     37321    19.59 MB/sec  execute 319 sec   
  30     37452    19.58 MB/sec  execute 320 sec   
  30     37677    19.62 MB/sec  execute 321 sec   
  30     38025    19.74 MB/sec  execute 322 sec   
  30     38379    19.86 MB/sec  execute 323 sec   
  30     38741    19.98 MB/sec  execute 324 sec   
  30     39109    20.09 MB/sec  execute 325 sec   
  30     39465    20.22 MB/sec  execute 326 sec   
  30     39831    20.33 MB/sec  execute 327 sec   
  30     40194    20.45 MB/sec  execute 328 sec   
  30     40530    20.54 MB/sec  execute 329 sec   
  30     40741    20.59 MB/sec  execute 330 sec   
  30     40882    20.59 MB/sec  execute 331 sec   
  30     40967    20.58 MB/sec  execute 332 sec   
  30     41068    20.58 MB/sec  execute 333 sec   
  30     41191    20.57 MB/sec  execute 334 sec   
  30     41249    20.53 MB/sec  execute 335 sec   
  30     41249    20.47 MB/sec  execute 336 sec   
  30     41249    20.40 MB/sec  execute 337 sec   
  30     41249    20.34 MB/sec  execute 338 sec   
  30     41249    20.28 MB/sec  execute 339 sec   
  30     41261    20.24 MB/sec  execute 340 sec   
  30     41261    20.18 MB/sec  execute 341 sec   
  30     41262    20.12 MB/sec  execute 342 sec   
  30     41279    20.06 MB/sec  execute 343 sec   
  30     41375    20.00 MB/sec  execute 345 sec   
  30     41375    19.94 MB/sec  execute 346 sec   
  30     41416    19.90 MB/sec  execute 347 sec   
  30     41704    19.98 MB/sec  execute 348 sec   
  30     42073    20.10 MB/sec  execute 349 sec   
  30     42437    20.21 MB/sec  execute 350 sec   
  30     42788    20.31 MB/sec  execute 351 sec   
  30     43159    20.42 MB/sec  execute 352 sec   
  30     43528    20.53 MB/sec  execute 353 sec   
  30     43878    20.64 MB/sec  execute 354 sec   
  30     44254    20.73 MB/sec  execute 355 sec   
  30     44585    20.85 MB/sec  execute 356 sec   
  30     44944    20.94 MB/sec  execute 357 sec   
  30     45246    21.01 MB/sec  execute 358 sec   
  30     45453    21.06 MB/sec  execute 359 sec   
  30     45662    21.12 MB/sec  execute 360 sec   
  30     45873    21.15 MB/sec  execute 361 sec   
  30     46057    21.18 MB/sec  execute 362 sec   
  30     46289    21.20 MB/sec  execute 363 sec   
  30     46469    21.22 MB/sec  execute 364 sec   
  30     46611    21.24 MB/sec  execute 365 sec   
  30     46719    21.25 MB/sec  execute 366 sec   
  30     46869    21.22 MB/sec  execute 367 sec   
  30     46898    21.17 MB/sec  execute 368 sec   
  30     46930    21.12 MB/sec  execute 369 sec   
  30     46960    21.08 MB/sec  execute 370 sec   
  30     47021    21.06 MB/sec  execute 371 sec   
  30     47043    21.01 MB/sec  execute 372 sec   
  30     47166    21.00 MB/sec  execute 373 sec   
  30     47219    20.97 MB/sec  execute 374 sec   
  30     47219    20.91 MB/sec  execute 375 sec   
  30     47219    20.86 MB/sec  execute 376 sec   
  30     47219    20.80 MB/sec  execute 377 sec   
  30     47219    20.75 MB/sec  execute 378 sec   
  30     47219    20.69 MB/sec  execute 379 sec   
  30     47219    20.64 MB/sec  execute 380 sec   
  30     47245    20.60 MB/sec  execute 381 sec   
  30     47296    20.56 MB/sec  execute 382 sec   
  30     47461    20.57 MB/sec  execute 383 sec   
  30     47678    20.61 MB/sec  execute 384 sec   
  30     48044    20.71 MB/sec  execute 385 sec   
  30     48370    20.82 MB/sec  execute 386 sec   
  30     48993    20.63 MB/sec  execute 395 sec   
  30     49009    20.58 MB/sec  execute 396 sec   
  30     49075    20.55 MB/sec  execute 397 sec   
  30     49075    20.50 MB/sec  execute 398 sec   
  30     49075    20.45 MB/sec  execute 399 sec   
  30     49075    20.40 MB/sec  execute 400 sec   
  30     49075    20.35 MB/sec  execute 401 sec   
  30     49075    20.30 MB/sec  execute 402 sec   
  30     49274    20.33 MB/sec  execute 403 sec   
  30     49623    20.43 MB/sec  execute 404 sec   
  30     49993    20.51 MB/sec  execute 405 sec   
  30     50324    20.62 MB/sec  execute 406 sec   
  30     50700    20.69 MB/sec  execute 407 sec   
  30     50845    20.70 MB/sec  execute 408 sec   
  30     52557    20.77 MB/sec  execute 420 sec   
  30     52924    20.85 MB/sec  execute 421 sec   
  30     53287    20.94 MB/sec  execute 422 sec   
  30     53631    21.02 MB/sec  execute 423 sec   
  30     53997    21.12 MB/sec  execute 424 sec   
  30     54316    21.19 MB/sec  execute 425 sec   
  30     54659    21.27 MB/sec  execute 426 sec   
  30     54845    21.31 MB/sec  execute 427 sec   
  30     55031    21.32 MB/sec  execute 428 sec   
  30     55175    21.33 MB/sec  execute 429 sec   
  30     55317    21.34 MB/sec  execute 430 sec   
  30     55437    21.33 MB/sec  execute 431 sec   
  30     55518    21.30 MB/sec  execute 432 sec   
  30     55518    21.26 MB/sec  execute 433 sec   
  30     55633    21.25 MB/sec  execute 434 sec   
  30     55734    21.24 MB/sec  execute 435 sec   
  30     55757    21.20 MB/sec  execute 436 sec   
  30     55780    21.16 MB/sec  execute 437 sec   
  30     55862    21.14 MB/sec  execute 438 sec   
  30     56199    21.22 MB/sec  execute 439 sec   
  30     56559    21.30 MB/sec  execute 440 sec   
  30     56920    21.39 MB/sec  execute 441 sec   
  30     57279    21.47 MB/sec  execute 442 sec   
  30     57642    21.55 MB/sec  execute 443 sec   
  30     58017    21.63 MB/sec  execute 444 sec   
  30     58374    21.72 MB/sec  execute 445 sec   
  30     58736    21.78 MB/sec  execute 446 sec   
  30     59070    21.86 MB/sec  execute 447 sec   
  30     59434    21.95 MB/sec  execute 448 sec   
  30     59619    21.98 MB/sec  execute 449 sec   
  30     59654    21.94 MB/sec  execute 450 sec   
  30     59983    22.05 MB/sec  execute 451 sec   
  30     60218    22.08 MB/sec  execute 452 sec   
  30     60495    22.13 MB/sec  execute 453 sec   
  30     60506    22.08 MB/sec  execute 454 sec   
  30     60584    22.06 MB/sec  execute 455 sec   
  30     60662    22.03 MB/sec  execute 456 sec   
  30     60854    22.02 MB/sec  execute 457 sec   
  30     61212    22.09 MB/sec  execute 458 sec   
  30     61523    22.16 MB/sec  execute 459 sec   
  30     61533    22.11 MB/sec  execute 460 sec   
  30     61536    22.06 MB/sec  execute 461 sec   
  30     61537    22.01 MB/sec  execute 462 sec   
  30     61538    21.97 MB/sec  execute 463 sec   
  30     61550    21.93 MB/sec  execute 464 sec   
  30     61550    21.88 MB/sec  execute 465 sec   
  30     61555    21.84 MB/sec  execute 466 sec   
  30     61555    21.79 MB/sec  execute 467 sec   
  30     61555    21.74 MB/sec  execute 468 sec   
  30     61556    21.70 MB/sec  execute 469 sec   
  30     61663    21.69 MB/sec  execute 470 sec   
  30     62013    21.76 MB/sec  execute 471 sec   
  30     62299    21.87 MB/sec  execute 472 sec   
  30     62632    21.95 MB/sec  execute 473 sec   
  30     62971    22.00 MB/sec  execute 474 sec   
  30     63260    22.05 MB/sec  execute 475 sec   
  30     63646    21.86 MB/sec  execute 480 sec   
  30     63711    21.83 MB/sec  execute 481 sec   
  30     63711    21.79 MB/sec  execute 482 sec   
  30     63825    21.78 MB/sec  execute 483 sec   
  30     64186    21.86 MB/sec  execute 484 sec   
  30     64537    21.93 MB/sec  execute 485 sec   
  30     64897    22.00 MB/sec  execute 486 sec   
  30     65243    22.09 MB/sec  execute 487 sec   
  30     65600    22.15 MB/sec  execute 488 sec   
  30     65961    22.23 MB/sec  execute 489 sec   
  30     66314    22.30 MB/sec  execute 490 sec   
  30     66410    22.29 MB/sec  execute 491 sec   
  30     66662    22.32 MB/sec  execute 492 sec   
  30     66870    22.35 MB/sec  execute 493 sec   
  30     67106    22.39 MB/sec  execute 494 sec   
  30     67305    22.40 MB/sec  execute 495 sec   
  30     67475    22.41 MB/sec  execute 496 sec   
  30     67482    22.37 MB/sec  execute 497 sec   
  30     67520    22.33 MB/sec  execute 498 sec   
  30     67530    22.29 MB/sec  execute 499 sec   
  30     67623    22.27 MB/sec  execute 500 sec   
  30     67679    22.24 MB/sec  execute 501 sec   
  30     67768    22.22 MB/sec  execute 502 sec   
  30     68060    22.28 MB/sec  execute 503 sec   
  30     68262    22.31 MB/sec  execute 504 sec   
  30     68335    22.28 MB/sec  execute 505 sec   
  30     68346    22.24 MB/sec  execute 506 sec   
  30     68395    22.21 MB/sec  execute 507 sec   
  30     68438    22.18 MB/sec  execute 508 sec   
  30     68440    22.14 MB/sec  execute 509 sec   
  30     68600    22.15 MB/sec  execute 510 sec   
  30     68952    22.22 MB/sec  execute 511 sec   
  30     69262    22.27 MB/sec  execute 512 sec   
  30     69504    22.31 MB/sec  execute 513 sec   
  30     69774    22.35 MB/sec  execute 514 sec   
  30     70182    22.41 MB/sec  execute 515 sec   
  30     70510    22.50 MB/sec  execute 516 sec   
  30     70834    22.58 MB/sec  execute 517 sec   
  30     71147    22.64 MB/sec  execute 518 sec   
  30     71177    22.61 MB/sec  execute 519 sec   
  30     71337    22.61 MB/sec  execute 520 sec   
  30     71354    22.57 MB/sec  execute 521 sec   
  30     71354    22.53 MB/sec  execute 522 sec   
  30     71372    22.49 MB/sec  execute 523 sec   
  30     71372    22.44 MB/sec  execute 524 sec   
  30     71483    22.44 MB/sec  execute 525 sec   
  30     71641    22.44 MB/sec  execute 526 sec   
  30     71823    22.46 MB/sec  execute 527 sec   
  30     72045    22.47 MB/sec  execute 528 sec   
  30     72211    22.48 MB/sec  execute 529 sec   
  30     72417    22.50 MB/sec  execute 530 sec   
  30     72778    22.57 MB/sec  execute 531 sec   
  30     73142    22.64 MB/sec  execute 532 sec   
  30     73511    22.70 MB/sec  execute 533 sec   
  30     73572    22.67 MB/sec  execute 534 sec   
  30     73671    22.66 MB/sec  execute 535 sec   
  30     73909    22.63 MB/sec  execute 538 sec   
  30     74121    22.68 MB/sec  execute 539 sec   
  30     74351    22.72 MB/sec  execute 540 sec   
  30     74443    22.69 MB/sec  execute 541 sec   
  30     74453    22.65 MB/sec  execute 542 sec   
  30     74532    22.62 MB/sec  execute 543 sec   
  30     74612    22.60 MB/sec  execute 544 sec   
  30     74721    22.50 MB/sec  execute 547 sec   
  30     74721    22.46 MB/sec  execute 548 sec   
  30     74735    22.42 MB/sec  execute 549 sec   
  30     74735    22.38 MB/sec  execute 550 sec   
  30     74750    22.34 MB/sec  execute 551 sec   
  30     74903    22.34 MB/sec  execute 552 sec   
  30     75279    22.41 MB/sec  execute 553 sec   
  30     75573    22.46 MB/sec  execute 554 sec   
  30     75873    22.50 MB/sec  execute 555 sec   
  30     76203    22.57 MB/sec  execute 556 sec   
  30     76562    22.64 MB/sec  execute 557 sec   
  30     76899    22.68 MB/sec  execute 558 sec   
  30     77253    22.74 MB/sec  execute 559 sec   
  30     77614    22.80 MB/sec  execute 560 sec   
  30     77942    22.87 MB/sec  execute 561 sec   
  30     78282    22.94 MB/sec  execute 562 sec   
  30     78566    22.99 MB/sec  execute 563 sec   
  30     78917    23.05 MB/sec  execute 564 sec   
  30     79191    23.08 MB/sec  execute 565 sec   
  30     79232    23.04 MB/sec  execute 566 sec   
  30     79232    23.00 MB/sec  execute 567 sec   
  30     79239    22.96 MB/sec  execute 568 sec   
  30     79239    22.92 MB/sec  execute 569 sec   
  30     79593    22.97 MB/sec  execute 570 sec   
  30     79954    23.03 MB/sec  execute 571 sec   
  30     80319    23.10 MB/sec  execute 572 sec   
  30     80680    23.16 MB/sec  execute 573 sec   
  30     81044    23.22 MB/sec  execute 574 sec   
  30     81404    23.28 MB/sec  execute 575 sec   
  30     81765    23.34 MB/sec  execute 576 sec   
  30     82122    23.41 MB/sec  execute 577 sec   
  30     82226    23.37 MB/sec  execute 578 sec   
  30     82226    23.33 MB/sec  execute 579 sec   
  30     82226    23.29 MB/sec  execute 580 sec   
  30     82226    23.25 MB/sec  execute 581 sec   
  30     82226    23.21 MB/sec  execute 582 sec   
  30     82226    23.17 MB/sec  execute 583 sec   
  30     82226    23.13 MB/sec  execute 584 sec   
  30     82226    23.09 MB/sec  execute 585 sec   
  30     82226    23.05 MB/sec  execute 586 sec   
  30     82226    23.01 MB/sec  execute 587 sec   
  30     82325    23.00 MB/sec  execute 588 sec   
  30     82358    22.96 MB/sec  execute 589 sec   
  30     82395    22.93 MB/sec  execute 590 sec   
  30     82485    22.91 MB/sec  execute 591 sec   
  30     82495    22.88 MB/sec  execute 592 sec   
  30     82682    22.89 MB/sec  execute 593 sec   
  30     83043    22.95 MB/sec  execute 594 sec   
  30     83407    23.00 MB/sec  execute 595 sec   
  30     83772    23.06 MB/sec  execute 596 sec   
  30     84137    23.12 MB/sec  execute 597 sec   
  30     84392    23.15 MB/sec  execute 598 sec   
  30     84523    23.16 MB/sec  execute 599 sec   
  30     84692    23.16 MB/sec  cleanup 600 sec   
  30     84692    23.12 MB/sec  cleanup 601 sec   
  30     84692    23.08 MB/sec  cleanup 602 sec   
  30     84692    23.05 MB/sec  cleanup 603 sec   
  30     84692    23.01 MB/sec  cleanup 604 sec   
  30     84692    22.97 MB/sec  cleanup 605 sec   
  30     84692    22.93 MB/sec  cleanup 606 sec   
  30     84692    22.87 MB/sec  cleanup 608 sec   
  30     84692    22.83 MB/sec  cleanup 609 sec   
  30     84692    22.79 MB/sec  cleanup 610 sec   
  30     84692    22.76 MB/sec  cleanup 611 sec   
  30     84692    22.72 MB/sec  cleanup 612 sec   
  30     84692    22.68 MB/sec  cleanup 613 sec   
  30     84692    22.64 MB/sec  cleanup 614 sec   
  30     84692    22.61 MB/sec  cleanup 615 sec   
  30     84692    22.57 MB/sec  cleanup 616 sec   
  30     84692    22.53 MB/sec  cleanup 617 sec   
  30     84692    22.50 MB/sec  cleanup 618 sec   
  30     84692    22.46 MB/sec  cleanup 619 sec   
  30     84692    22.44 MB/sec  cleanup 620 sec   

Throughput 23.1628 MB/sec 30 procs

[-- Attachment #4: bdi-writeback-hda3.log --]
[-- Type: text/plain, Size: 23517 bytes --]

dbench version 3.04 - Copyright Andrew Tridgell 1999-2004

Running for 600 seconds with load '/usr/share/dbench/client.txt' and minimum warmup 120 secs
30 clients started
  30        13     0.00 MB/sec  warmup   1 sec   
  30        14     0.00 MB/sec  warmup   2 sec   
  30        14     0.00 MB/sec  warmup   3 sec   
  30        14     0.00 MB/sec  warmup   4 sec   
  30        20     2.14 MB/sec  warmup   5 sec   
  30        26     3.29 MB/sec  warmup   6 sec   
  30        30     3.86 MB/sec  warmup   7 sec   
  30        34     4.19 MB/sec  warmup   8 sec   
  30        35     3.93 MB/sec  warmup   9 sec   
  30        44     4.58 MB/sec  warmup  10 sec   
  30        49     4.88 MB/sec  warmup  11 sec   
  30        52     4.88 MB/sec  warmup  12 sec   
  30        60     5.19 MB/sec  warmup  13 sec   
  30        68     5.65 MB/sec  warmup  14 sec   
  30        68     5.28 MB/sec  warmup  15 sec   
  30        83     5.94 MB/sec  warmup  16 sec   
  30        92     6.43 MB/sec  warmup  17 sec   
  30        96     6.35 MB/sec  warmup  18 sec   
  30       101     6.29 MB/sec  warmup  19 sec   
  30       112     6.62 MB/sec  warmup  20 sec   
  30       116     6.60 MB/sec  warmup  21 sec   
  30       123     6.69 MB/sec  warmup  22 sec   
  30       123     6.41 MB/sec  warmup  23 sec   
  30       142     6.92 MB/sec  warmup  24 sec   
  30       149     7.02 MB/sec  warmup  25 sec   
  30       158     7.13 MB/sec  warmup  26 sec   
  30       169     7.29 MB/sec  warmup  27 sec   
  30       177     7.33 MB/sec  warmup  28 sec   
  30       190     7.51 MB/sec  warmup  29 sec   
  30       225     7.94 MB/sec  warmup  32 sec   
  30       237     8.04 MB/sec  warmup  33 sec   
  30       250     8.16 MB/sec  warmup  34 sec   
  30       254     8.04 MB/sec  warmup  35 sec   
  30       269     8.25 MB/sec  warmup  36 sec   
  30       275     8.24 MB/sec  warmup  37 sec   
  30       288     8.35 MB/sec  warmup  38 sec   
  30       301     8.45 MB/sec  warmup  39 sec   
  30       315     8.55 MB/sec  warmup  40 sec   
  30       339     8.58 MB/sec  warmup  41 sec   
  30       347     8.57 MB/sec  warmup  42 sec   
  30       356     8.57 MB/sec  warmup  43 sec   
  30       374     8.62 MB/sec  warmup  44 sec   
  30       535     9.14 MB/sec  warmup  45 sec   
  30       606     9.09 MB/sec  warmup  46 sec   
  30       631     8.68 MB/sec  warmup  51 sec   
  30       631     8.52 MB/sec  warmup  52 sec   
  30       671     8.56 MB/sec  warmup  53 sec   
  30       949     9.24 MB/sec  warmup  54 sec   
  30      1102     9.86 MB/sec  warmup  55 sec   
  30      1196     9.99 MB/sec  warmup  56 sec   
  30      1339    10.34 MB/sec  warmup  57 sec   
  30      1460    10.46 MB/sec  warmup  58 sec   
  30      1544    10.67 MB/sec  warmup  59 sec   
  30      1593    10.76 MB/sec  warmup  60 sec   
  30      1621    10.73 MB/sec  warmup  61 sec   
  30      1644    10.73 MB/sec  warmup  62 sec   
  30      1661    10.79 MB/sec  warmup  63 sec   
  30      1689    10.85 MB/sec  warmup  64 sec   
  30      1740    10.46 MB/sec  warmup  68 sec   
  30      1761    10.36 MB/sec  warmup  69 sec   
  30      1817    10.44 MB/sec  warmup  70 sec   
  30      2138    10.95 MB/sec  warmup  71 sec   
  30      2446    11.50 MB/sec  warmup  72 sec   
  30      2532    11.76 MB/sec  warmup  73 sec   
  30      2538    11.69 MB/sec  warmup  74 sec   
  30      2572    11.58 MB/sec  warmup  75 sec   
  30      2731    11.87 MB/sec  warmup  76 sec   
  30      3047    12.36 MB/sec  warmup  77 sec   
  30      3104    12.33 MB/sec  warmup  78 sec   
  30      3107    12.15 MB/sec  warmup  79 sec   
  30      3124    12.01 MB/sec  warmup  80 sec   
  30      3143    11.96 MB/sec  warmup  81 sec   
  30      3231    11.82 MB/sec  warmup  84 sec   
  30      3231    11.69 MB/sec  warmup  85 sec   
  30      3231    11.55 MB/sec  warmup  86 sec   
  30      3231    11.42 MB/sec  warmup  87 sec   
  30      3231    11.29 MB/sec  warmup  88 sec   
  30      3428    11.29 MB/sec  warmup  92 sec   
  30      3663    11.70 MB/sec  warmup  93 sec   
  30      3785    11.85 MB/sec  warmup  94 sec   
  30      3923    11.94 MB/sec  warmup  95 sec   
  30      3937    11.86 MB/sec  warmup  96 sec   
  30      3980    11.78 MB/sec  warmup  97 sec   
  30      4298    12.17 MB/sec  warmup  98 sec   
  30      4616    12.59 MB/sec  warmup  99 sec   
  30      4905    12.95 MB/sec  warmup 100 sec   
  30      5228    13.32 MB/sec  warmup 101 sec   
  30      5513    13.58 MB/sec  warmup 102 sec   
  30      5826    13.97 MB/sec  warmup 103 sec   
  30      6116    14.29 MB/sec  warmup 104 sec   
  30      6409    14.58 MB/sec  warmup 105 sec   
  30      6674    14.89 MB/sec  warmup 106 sec   
  30      6913    15.14 MB/sec  warmup 107 sec   
  30      7043    15.19 MB/sec  warmup 108 sec   
  30      7052    15.05 MB/sec  warmup 109 sec   
  30      7052    14.91 MB/sec  warmup 110 sec   
  30      7052    14.78 MB/sec  warmup 111 sec   
  30      7052    14.65 MB/sec  warmup 112 sec   
  30      7155    13.85 MB/sec  warmup 120 sec   
  30      7658    49.68 MB/sec  execute   1 sec   
  30      7972    51.10 MB/sec  execute   2 sec   
  30      8295    51.55 MB/sec  execute   3 sec   
  30      8611    51.36 MB/sec  execute   4 sec   
  30      8916    51.52 MB/sec  execute   5 sec   
  30      9246    51.10 MB/sec  execute   6 sec   
  30      9544    51.43 MB/sec  execute   7 sec   
  30      9848    50.67 MB/sec  execute   8 sec   
  30     10068    49.09 MB/sec  execute   9 sec   
  30     10386    46.18 MB/sec  execute  11 sec   
  30     10695    47.04 MB/sec  execute  12 sec   
  30     10875    45.13 MB/sec  execute  13 sec   
  30     11177    45.71 MB/sec  execute  14 sec   
  30     11484    46.09 MB/sec  execute  15 sec   
  30     11807    46.10 MB/sec  execute  16 sec   
  30     12103    46.36 MB/sec  execute  17 sec   
  30     12405    46.66 MB/sec  execute  18 sec   
  30     12735    46.73 MB/sec  execute  19 sec   
  30     13569    46.71 MB/sec  execute  22 sec   
  30     13878    47.14 MB/sec  execute  23 sec   
  30     14149    47.36 MB/sec  execute  24 sec   
  30     14442    47.34 MB/sec  execute  25 sec   
  30     14759    47.20 MB/sec  execute  26 sec   
  30     14961    46.57 MB/sec  execute  27 sec   
  30     15008    43.28 MB/sec  execute  29 sec   
  30     15234    33.64 MB/sec  execute  38 sec   
  30     15234    32.77 MB/sec  execute  39 sec   
  30     15395    26.75 MB/sec  execute  49 sec   
  30     15421    26.24 MB/sec  execute  50 sec   
  30     15421    25.73 MB/sec  execute  51 sec   
  30     15559    22.94 MB/sec  execute  58 sec   
  30     15578    22.63 MB/sec  execute  59 sec   
  30     15665    22.51 MB/sec  execute  60 sec   
  30     15833    22.59 MB/sec  execute  61 sec   
  30     15833    22.22 MB/sec  execute  62 sec   
  30     15833    21.87 MB/sec  execute  63 sec   
  30     15920    19.76 MB/sec  execute  71 sec   
  30     15923    19.49 MB/sec  execute  72 sec   
  30     16229    19.88 MB/sec  execute  73 sec   
  30     16549    20.32 MB/sec  execute  74 sec   
  30     16806    20.58 MB/sec  execute  75 sec   
  30     16935    20.60 MB/sec  execute  76 sec   
  30     16942    20.34 MB/sec  execute  77 sec   
  30     17042    20.27 MB/sec  execute  78 sec   
  30     17125    20.14 MB/sec  execute  79 sec   
  30     17160    19.95 MB/sec  execute  80 sec   
  30     17201    19.81 MB/sec  execute  81 sec   
  30     17201    19.57 MB/sec  execute  82 sec   
  30     17212    19.35 MB/sec  execute  83 sec   
  30     17228    19.14 MB/sec  execute  84 sec   
  30     17376    19.22 MB/sec  execute  85 sec   
  30     17412    19.06 MB/sec  execute  86 sec   
  30     17624    17.30 MB/sec  execute  97 sec   
  30     17734    17.33 MB/sec  execute  98 sec   
  30     17817    17.32 MB/sec  execute  99 sec   
  30     17872    17.24 MB/sec  execute 100 sec   
  30     17912    17.09 MB/sec  execute 101 sec   
  30     17920    16.94 MB/sec  execute 102 sec   
  30     17922    16.76 MB/sec  execute 103 sec   
  30     18159    16.94 MB/sec  execute 104 sec   
  30     18298    16.98 MB/sec  execute 105 sec   
  30     18405    16.98 MB/sec  execute 106 sec   
  30     18512    17.00 MB/sec  execute 107 sec   
  30     18618    17.00 MB/sec  execute 108 sec   
  30     18715    16.98 MB/sec  execute 109 sec   
  30     18758    15.53 MB/sec  execute 119 sec   
  30     18758    14.94 MB/sec  execute 124 sec   
  30     18758    14.82 MB/sec  execute 125 sec   
  30     18758    14.70 MB/sec  execute 126 sec   
  30     18851    14.78 MB/sec  execute 127 sec   
  30     19019    14.87 MB/sec  execute 128 sec   
  30     19143    14.93 MB/sec  execute 129 sec   
  30     19301    15.00 MB/sec  execute 130 sec   
  30     19458    15.02 MB/sec  execute 131 sec   
  30     19636    15.10 MB/sec  execute 132 sec   
  30     19814    15.25 MB/sec  execute 133 sec   
  30     20000    15.36 MB/sec  execute 134 sec   
  30     20210    15.49 MB/sec  execute 135 sec   
  30     20419    15.64 MB/sec  execute 136 sec   
  30     20664    15.81 MB/sec  execute 137 sec   
  30     20910    15.95 MB/sec  execute 138 sec   
  30     21142    16.11 MB/sec  execute 139 sec   
  30     21377    16.29 MB/sec  execute 140 sec   
  30     21487    16.28 MB/sec  execute 141 sec   
  30     21516    16.20 MB/sec  execute 142 sec   
  30     21566    16.14 MB/sec  execute 143 sec   
  30     21752    16.24 MB/sec  execute 144 sec   
  30     22045    16.45 MB/sec  execute 145 sec   
  30     22285    15.60 MB/sec  execute 155 sec   
  30     22506    15.74 MB/sec  execute 156 sec   
  30     22740    15.79 MB/sec  execute 157 sec   
  30     22761    15.71 MB/sec  execute 158 sec   
  30     22761    15.61 MB/sec  execute 159 sec   
  30     22807    15.57 MB/sec  execute 160 sec   
  30     22854    15.56 MB/sec  execute 161 sec   
  30     24444    15.92 MB/sec  execute 174 sec   
  30     24744    16.05 MB/sec  execute 175 sec   
  30     25033    16.29 MB/sec  execute 176 sec   
  30     25233    16.37 MB/sec  execute 177 sec   
  30     25413    15.58 MB/sec  execute 188 sec   
  30     25413    15.50 MB/sec  execute 189 sec   
  30     25510    15.50 MB/sec  execute 190 sec   
  30     25575    15.46 MB/sec  execute 191 sec   
  30     25780    15.58 MB/sec  execute 192 sec   
  30     26020    15.66 MB/sec  execute 193 sec   
  30     26228    15.77 MB/sec  execute 194 sec   
  30     26496    15.89 MB/sec  execute 195 sec   
  30     26758    16.04 MB/sec  execute 196 sec   
  30     26976    16.16 MB/sec  execute 197 sec   
  30     27206    16.25 MB/sec  execute 198 sec   
  30     27417    16.35 MB/sec  execute 199 sec   
  30     27651    16.42 MB/sec  execute 200 sec   
  30     27875    16.53 MB/sec  execute 201 sec   
  30     28118    16.68 MB/sec  execute 202 sec   
  30     28283    16.71 MB/sec  execute 203 sec   
  30     28285    16.63 MB/sec  execute 204 sec   
  30     28299    16.55 MB/sec  execute 205 sec   
  30     28381    16.53 MB/sec  execute 206 sec   
  30     28429    16.49 MB/sec  execute 207 sec   
  30     28484    16.46 MB/sec  execute 208 sec   
  30     28563    16.45 MB/sec  execute 209 sec   
  30     28563    16.38 MB/sec  execute 210 sec   
  30     28563    16.30 MB/sec  execute 211 sec   
  30     28710    15.47 MB/sec  execute 224 sec   
  30     28740    15.42 MB/sec  execute 225 sec   
  30     28756    15.37 MB/sec  execute 226 sec   
  30     28808    15.33 MB/sec  execute 227 sec   
  30     28993    15.36 MB/sec  execute 228 sec   
  30     28993    15.29 MB/sec  execute 229 sec   
  30     29136    15.32 MB/sec  execute 230 sec   
  30     29136    15.26 MB/sec  execute 231 sec   
  30     29136    15.19 MB/sec  execute 232 sec   
  30     29136    15.12 MB/sec  execute 233 sec   
  30     29136    15.06 MB/sec  execute 234 sec   
  30     29136    15.00 MB/sec  execute 235 sec   
  30     29273    14.34 MB/sec  execute 248 sec   
  30     29292    14.31 MB/sec  execute 249 sec   
  30     29295    14.25 MB/sec  execute 250 sec   
  30     29327    14.21 MB/sec  execute 251 sec   
  30     29568    14.31 MB/sec  execute 252 sec   
  30     29837    14.43 MB/sec  execute 253 sec   
  30     30141    14.56 MB/sec  execute 254 sec   
  30     30309    14.62 MB/sec  execute 255 sec   
  30     30346    14.58 MB/sec  execute 256 sec   
  30     30347    14.51 MB/sec  execute 257 sec   
  30     30349    14.45 MB/sec  execute 258 sec   
  30     30423    14.46 MB/sec  execute 259 sec   
  30     30542    14.49 MB/sec  execute 260 sec   
  30     30656    14.51 MB/sec  execute 261 sec   
  30     30845    14.53 MB/sec  execute 262 sec   
  30     31008    14.60 MB/sec  execute 263 sec   
  30     31127    14.62 MB/sec  execute 264 sec   
  30     31170    14.58 MB/sec  execute 265 sec   
  30     31386    14.24 MB/sec  execute 274 sec   
  30     31521    14.27 MB/sec  execute 275 sec   
  30     31680    14.32 MB/sec  execute 276 sec   
  30     31895    14.38 MB/sec  execute 277 sec   
  30     32031    14.40 MB/sec  execute 278 sec   
  30     32168    14.40 MB/sec  execute 279 sec   
  30     32414    14.51 MB/sec  execute 280 sec   
  30     32667    14.59 MB/sec  execute 281 sec   
  30     32945    14.69 MB/sec  execute 282 sec   
  30     33242    14.81 MB/sec  execute 283 sec   
  30     33536    14.92 MB/sec  execute 284 sec   
  30     33735    15.00 MB/sec  execute 285 sec   
  30     33965    15.07 MB/sec  execute 286 sec   
  30     34200    15.13 MB/sec  execute 287 sec   
  30     34455    15.21 MB/sec  execute 288 sec   
  30     34524    15.20 MB/sec  execute 289 sec   
  30     34528    15.15 MB/sec  execute 290 sec   
  30     34544    15.11 MB/sec  execute 291 sec   
  30     34632    15.09 MB/sec  execute 292 sec   
  30     34901    15.19 MB/sec  execute 293 sec   
  30     35218    15.32 MB/sec  execute 294 sec   
  30     35515    15.43 MB/sec  execute 295 sec   
  30     35825    15.54 MB/sec  execute 296 sec   
  30     36138    15.67 MB/sec  execute 297 sec   
  30     36433    15.78 MB/sec  execute 298 sec   
  30     36724    15.90 MB/sec  execute 299 sec   
  30     37025    16.01 MB/sec  execute 300 sec   
  30     37336    16.09 MB/sec  execute 301 sec   
  30     37372    16.05 MB/sec  execute 302 sec   
  30     37625    15.57 MB/sec  execute 314 sec   
  30     37637    15.52 MB/sec  execute 315 sec   
  30     37652    15.48 MB/sec  execute 316 sec   
  30     37654    15.43 MB/sec  execute 317 sec   
  30     37654    15.38 MB/sec  execute 318 sec   
  30     37654    15.33 MB/sec  execute 319 sec   
  30     37654    15.29 MB/sec  execute 320 sec   
  30     37654    15.24 MB/sec  execute 321 sec   
  30     37654    15.19 MB/sec  execute 322 sec   
  30     37654    15.14 MB/sec  execute 323 sec   
  30     37654    15.10 MB/sec  execute 324 sec   
  30     37654    15.05 MB/sec  execute 325 sec   
  30     37654    15.00 MB/sec  execute 326 sec   
  30     37654    14.95 MB/sec  execute 327 sec   
  30     37691    14.92 MB/sec  execute 328 sec   
  30     37779    14.94 MB/sec  execute 329 sec   
  30     37963    14.99 MB/sec  execute 330 sec   
  30     38159    15.05 MB/sec  execute 331 sec   
  30     38352    15.11 MB/sec  execute 332 sec   
  30     38549    15.18 MB/sec  execute 333 sec   
  30     38733    15.20 MB/sec  execute 334 sec   
  30     38873    15.21 MB/sec  execute 335 sec   
  30     39050    15.25 MB/sec  execute 336 sec   
  30     39197    15.28 MB/sec  execute 337 sec   
  30     39289    15.27 MB/sec  execute 338 sec   
  30     39297    15.23 MB/sec  execute 339 sec   
  30     39554    15.25 MB/sec  execute 342 sec   
  30     39584    15.21 MB/sec  execute 343 sec   
  30     39587    15.16 MB/sec  execute 344 sec   
  30     39587    15.12 MB/sec  execute 345 sec   
  30     39587    15.08 MB/sec  execute 346 sec   
  30     39587    15.03 MB/sec  execute 347 sec   
  30     39587    14.99 MB/sec  execute 348 sec   
  30     39587    14.95 MB/sec  execute 349 sec   
  30     39696    14.67 MB/sec  execute 358 sec   
  30     39801    14.68 MB/sec  execute 359 sec   
  30     39927    14.71 MB/sec  execute 360 sec   
  30     40050    14.71 MB/sec  execute 361 sec   
  30     40162    14.72 MB/sec  execute 362 sec   
  30     40289    14.73 MB/sec  execute 363 sec   
  30     40451    14.78 MB/sec  execute 364 sec   
  30     40666    14.81 MB/sec  execute 365 sec   
  30     40864    14.86 MB/sec  execute 366 sec   
  30     41030    14.89 MB/sec  execute 367 sec   
  30     41073    14.86 MB/sec  execute 368 sec   
  30     41095    14.83 MB/sec  execute 369 sec   
  30     41107    14.79 MB/sec  execute 370 sec   
  30     41107    14.75 MB/sec  execute 371 sec   
  30     41107    14.71 MB/sec  execute 372 sec   
  30     41107    14.67 MB/sec  execute 373 sec   
  30     41107    14.64 MB/sec  execute 374 sec   
  30     41107    14.60 MB/sec  execute 375 sec   
  30     41107    14.56 MB/sec  execute 376 sec   
  30     41107    14.52 MB/sec  execute 377 sec   
  30     41107    14.48 MB/sec  execute 378 sec   
  30     41107    14.44 MB/sec  execute 379 sec   
  30     41107    14.40 MB/sec  execute 380 sec   
  30     41232    14.19 MB/sec  execute 388 sec   
  30     41458    14.10 MB/sec  execute 392 sec   
  30     41549    14.10 MB/sec  execute 393 sec   
  30     41549    14.06 MB/sec  execute 394 sec   
  30     41549    14.03 MB/sec  execute 395 sec   
  30     41549    13.99 MB/sec  execute 396 sec   
  30     41549    13.96 MB/sec  execute 397 sec   
  30     42290    13.74 MB/sec  execute 412 sec   
  30     42290    13.71 MB/sec  execute 413 sec   
  30     42290    13.67 MB/sec  execute 414 sec   
  30     42290    13.64 MB/sec  execute 415 sec   
  30     42290    13.61 MB/sec  execute 416 sec   
  30     42458    13.37 MB/sec  execute 426 sec   
  30     42607    13.39 MB/sec  execute 427 sec   
  30     42765    13.43 MB/sec  execute 428 sec   
  30     42961    13.45 MB/sec  execute 429 sec   
  30     43148    13.50 MB/sec  execute 430 sec   
  30     43339    13.55 MB/sec  execute 431 sec   
  30     43438    13.54 MB/sec  execute 432 sec   
  30     43451    13.51 MB/sec  execute 433 sec   
  30     43469    13.49 MB/sec  execute 434 sec   
  30     43532    13.48 MB/sec  execute 435 sec   
  30     43534    13.45 MB/sec  execute 436 sec   
  30     43534    13.42 MB/sec  execute 437 sec   
  30     43534    13.39 MB/sec  execute 438 sec   
  30     43534    13.36 MB/sec  execute 439 sec   
  30     43534    13.33 MB/sec  execute 440 sec   
  30     43653    13.03 MB/sec  execute 451 sec   
  30     43707    13.02 MB/sec  execute 452 sec   
  30     43727    13.01 MB/sec  execute 453 sec   
  30     43729    12.98 MB/sec  execute 454 sec   
  30     43756    12.96 MB/sec  execute 455 sec   
  30     43763    12.93 MB/sec  execute 456 sec   
  30     43789    12.92 MB/sec  execute 457 sec   
  30     43828    12.90 MB/sec  execute 458 sec   
  30     43875    12.89 MB/sec  execute 459 sec   
  30     43879    12.86 MB/sec  execute 460 sec   
  30     43879    12.84 MB/sec  execute 461 sec   
  30     43879    12.81 MB/sec  execute 462 sec   
  30     43879    12.78 MB/sec  execute 463 sec   
  30     43879    12.75 MB/sec  execute 464 sec   
  30     44099    12.54 MB/sec  execute 475 sec   
  30     44193    12.55 MB/sec  execute 476 sec   
  30     44252    12.55 MB/sec  execute 477 sec   
  30     44302    12.53 MB/sec  execute 478 sec   
  30     44313    12.51 MB/sec  execute 479 sec   
  30     44313    12.48 MB/sec  execute 480 sec   
  30     44313    12.45 MB/sec  execute 481 sec   
  30     44313    12.43 MB/sec  execute 482 sec   
  30     44511    12.29 MB/sec  execute 490 sec   
  30     44610    12.30 MB/sec  execute 491 sec   
  30     44691    12.30 MB/sec  execute 492 sec   
  30     44772    12.32 MB/sec  execute 493 sec   
  30     44921    12.34 MB/sec  execute 494 sec   
  30     45064    12.36 MB/sec  execute 495 sec   
  30     45169    12.36 MB/sec  execute 496 sec   
  30     45266    12.36 MB/sec  execute 497 sec   
  30     45377    12.37 MB/sec  execute 498 sec   
  30     45418    12.36 MB/sec  execute 499 sec   
  30     45419    12.33 MB/sec  execute 500 sec   
  30     45419    12.31 MB/sec  execute 501 sec   
  30     45479    12.30 MB/sec  execute 502 sec   
  30     45584    12.31 MB/sec  execute 503 sec   
  30     45584    12.28 MB/sec  execute 504 sec   
  30     45584    12.26 MB/sec  execute 505 sec   
  30     45584    12.23 MB/sec  execute 506 sec   
  30     45584    12.21 MB/sec  execute 507 sec   
  30     45584    12.19 MB/sec  execute 508 sec   
  30     45584    12.16 MB/sec  execute 509 sec   
  30     45584    12.14 MB/sec  execute 510 sec   
  30     45613    12.13 MB/sec  execute 511 sec   
  30     45756    11.96 MB/sec  execute 520 sec   
  30     45783    11.95 MB/sec  execute 521 sec   
  30     45795    11.93 MB/sec  execute 522 sec   
  30     45795    11.91 MB/sec  execute 523 sec   
  30     45814    11.89 MB/sec  execute 524 sec   
  30     45927    11.90 MB/sec  execute 525 sec   
  30     46076    11.93 MB/sec  execute 526 sec   
  30     46229    11.95 MB/sec  execute 527 sec   
  30     46367    11.97 MB/sec  execute 528 sec   
  30     46510    11.99 MB/sec  execute 529 sec   
  30     46531    11.98 MB/sec  execute 530 sec   
  30     46685    11.87 MB/sec  execute 537 sec   
  30     47084    11.79 MB/sec  execute 546 sec   
  30     47084    11.77 MB/sec  execute 547 sec   
  30     47084    11.75 MB/sec  execute 548 sec   
  30     47084    11.73 MB/sec  execute 549 sec   
  30     47084    11.71 MB/sec  execute 550 sec   
  30     47084    11.69 MB/sec  execute 551 sec   
  30     47108    11.67 MB/sec  execute 552 sec   
  30     47170    11.67 MB/sec  execute 553 sec   
  30     47258    11.47 MB/sec  execute 564 sec   
  30     47335    11.46 MB/sec  execute 565 sec   
  30     47346    11.44 MB/sec  execute 566 sec   
  30     47346    11.42 MB/sec  execute 567 sec   
  30     47346    11.40 MB/sec  execute 568 sec   
  30     47346    11.38 MB/sec  execute 569 sec   
  30     47346    11.36 MB/sec  execute 570 sec   
  30     47346    11.34 MB/sec  execute 571 sec   
  30     47527    11.27 MB/sec  execute 577 sec   
  30     47610    11.28 MB/sec  execute 578 sec   
  30     47658    11.29 MB/sec  execute 579 sec   
  30     47778    11.30 MB/sec  execute 580 sec   
  30     47912    11.32 MB/sec  execute 581 sec   
  30     48003    11.34 MB/sec  execute 582 sec   
  30     48125    11.36 MB/sec  execute 583 sec   
  30     48236    11.36 MB/sec  execute 584 sec   
  30     48445    11.38 MB/sec  execute 585 sec   
  30     48705    11.44 MB/sec  execute 586 sec   
  30     48750    11.43 MB/sec  execute 587 sec   
  30     48774    11.41 MB/sec  execute 588 sec   
  30     48817    11.40 MB/sec  execute 589 sec   
  30     48949    11.28 MB/sec  execute 597 sec   
  30     49124    11.31 MB/sec  execute 598 sec   
  30     49341    11.35 MB/sec  execute 599 sec   
  30     49649    11.41 MB/sec  cleanup 600 sec   
  30     49649    11.39 MB/sec  cleanup 601 sec   
  30     49649    11.37 MB/sec  cleanup 602 sec   
  30     49649    11.35 MB/sec  cleanup 603 sec   
  30     49649    11.33 MB/sec  cleanup 604 sec   
  30     49649    11.31 MB/sec  cleanup 605 sec   

Throughput 11.4073 MB/sec 30 procs

[-- Attachment #5: pdflush-hda1.log --]
[-- Type: text/plain, Size: 28719 bytes --]

dbench version 3.04 - Copyright Andrew Tridgell 1999-2004

Running for 600 seconds with load '/usr/share/dbench/client.txt' and minimum warmup 120 secs
30 clients started
  30        48    56.54 MB/sec  warmup   1 sec   
  30        48    28.00 MB/sec  warmup   2 sec   
  30        48    18.73 MB/sec  warmup   3 sec   
  30        48    14.07 MB/sec  warmup   4 sec   
  30        48    11.26 MB/sec  warmup   5 sec   
  30        48     9.39 MB/sec  warmup   6 sec   
  30        50     4.38 MB/sec  warmup  13 sec   
  30        50     2.57 MB/sec  warmup  23 sec   
  30        59     2.65 MB/sec  warmup  27 sec   
  30        59     2.56 MB/sec  warmup  28 sec   
  30        65     2.17 MB/sec  warmup  37 sec   
  30        68     2.06 MB/sec  warmup  42 sec   
  30        69     1.67 MB/sec  warmup  52 sec   
  30        72     1.52 MB/sec  warmup  60 sec   
  30        81     1.51 MB/sec  warmup  70 sec   
  30        86     1.48 MB/sec  warmup  76 sec   
  30       104     1.54 MB/sec  warmup  88 sec   
  30       121     1.71 MB/sec  warmup  89 sec   
  30       121     1.69 MB/sec  warmup  90 sec   
  30       125     1.72 MB/sec  warmup  91 sec   
  30       370     2.26 MB/sec  warmup  92 sec   
  30       681     2.90 MB/sec  warmup  93 sec   
  30       868     3.29 MB/sec  warmup  94 sec   
  30       879     3.38 MB/sec  warmup  95 sec   
  30       886     3.40 MB/sec  warmup  96 sec   
  30       902     3.41 MB/sec  warmup  97 sec   
  30       902     3.37 MB/sec  warmup  98 sec   
  30       902     3.34 MB/sec  warmup  99 sec   
  30       902     3.31 MB/sec  warmup 100 sec   
  30       902     3.27 MB/sec  warmup 101 sec   
  30       902     3.24 MB/sec  warmup 102 sec   
  30       961     3.67 MB/sec  warmup 103 sec   
  30       983     3.72 MB/sec  warmup 104 sec   
  30       983     3.69 MB/sec  warmup 105 sec   
  30       983     3.65 MB/sec  warmup 106 sec   
  30       983     3.62 MB/sec  warmup 107 sec   
  30       983     3.59 MB/sec  warmup 108 sec   
  30       990     3.61 MB/sec  warmup 109 sec   
  30      1073     3.95 MB/sec  warmup 110 sec   
  30      1078     3.95 MB/sec  warmup 111 sec   
  30      1078     3.91 MB/sec  warmup 112 sec   
  30      1078     3.88 MB/sec  warmup 113 sec   
  30      1078     3.84 MB/sec  warmup 114 sec   
  30      1078     3.81 MB/sec  warmup 115 sec   
  30      1078     3.78 MB/sec  warmup 116 sec   
  30      1079     3.77 MB/sec  warmup 117 sec   
  30      1079     3.74 MB/sec  warmup 118 sec   
  30      1079     3.71 MB/sec  warmup 119 sec   
  30      1079     0.00 MB/sec  execute   1 sec   
  30      1089     4.10 MB/sec  execute   2 sec   
  30      1099     5.52 MB/sec  execute   3 sec   
  30      1107     6.18 MB/sec  execute   4 sec   
  30      1148     7.56 MB/sec  execute   5 sec   
  30      1151     6.70 MB/sec  execute   6 sec   
  30      1151     5.74 MB/sec  execute   7 sec   
  30      1151     5.02 MB/sec  execute   8 sec   
  30      1151     4.46 MB/sec  execute   9 sec   
  30      1151     4.02 MB/sec  execute  10 sec   
  30      1151     3.66 MB/sec  execute  11 sec   
  30      1164     3.91 MB/sec  execute  12 sec   
  30      1172     4.23 MB/sec  execute  13 sec   
  30      1305     7.88 MB/sec  execute  15 sec   
  30      1324     7.53 MB/sec  execute  16 sec   
  30      1428     7.96 MB/sec  execute  17 sec   
  30      1793    10.90 MB/sec  execute  18 sec   
  30      2150    13.32 MB/sec  execute  19 sec   
  30      2514    15.80 MB/sec  execute  20 sec   
  30      2891    17.63 MB/sec  execute  21 sec   
  30      3109    18.79 MB/sec  execute  22 sec   
  30      3173    19.45 MB/sec  execute  23 sec   
  30      3392    20.10 MB/sec  execute  24 sec   
  30      3435    19.51 MB/sec  execute  25 sec   
  30      3435    18.76 MB/sec  execute  26 sec   
  30      3537    18.72 MB/sec  execute  27 sec   
  30      3742    19.16 MB/sec  execute  28 sec   
  30      3790    18.80 MB/sec  execute  29 sec   
  30      3801    18.19 MB/sec  execute  30 sec   
  30      3801    17.61 MB/sec  execute  31 sec   
  30      3801    17.06 MB/sec  execute  32 sec   
  30      3802    16.54 MB/sec  execute  33 sec   
  30      3838    16.27 MB/sec  execute  34 sec   
  30      3863    15.94 MB/sec  execute  35 sec   
  30      4080    16.43 MB/sec  execute  36 sec   
  30      4185    16.61 MB/sec  execute  37 sec   
  30      4254    16.47 MB/sec  execute  38 sec   
  30      4254    16.05 MB/sec  execute  39 sec   
  30      4254    15.65 MB/sec  execute  40 sec   
  30      4261    15.41 MB/sec  execute  41 sec   
  30      4266    15.16 MB/sec  execute  42 sec   
  30      6902    19.95 MB/sec  execute  56 sec   
  30      6997    20.12 MB/sec  execute  57 sec   
  30      6998    19.79 MB/sec  execute  58 sec   
  30      7060    19.63 MB/sec  execute  59 sec   
  30      7234    19.72 MB/sec  execute  60 sec   
  30      7580    20.44 MB/sec  execute  61 sec   
  30      7943    21.01 MB/sec  execute  62 sec   
  30      8298    21.60 MB/sec  execute  63 sec   
  30      8659    22.19 MB/sec  execute  64 sec   
  30      8986    22.73 MB/sec  execute  65 sec   
  30      9347    23.33 MB/sec  execute  66 sec   
  30      9713    23.84 MB/sec  execute  67 sec   
  30      9968    24.12 MB/sec  execute  68 sec   
  30      9968    23.77 MB/sec  execute  69 sec   
  30      9968    23.43 MB/sec  execute  70 sec   
  30      9968    23.10 MB/sec  execute  71 sec   
  30      9968    22.77 MB/sec  execute  72 sec   
  30     10001    22.57 MB/sec  execute  73 sec   
  30     10013    22.31 MB/sec  execute  74 sec   
  30     10078    22.19 MB/sec  execute  75 sec   
  30     10109    22.04 MB/sec  execute  76 sec   
  30     10113    21.76 MB/sec  execute  77 sec   
  30     10197    21.58 MB/sec  execute  78 sec   
  30     10198    21.31 MB/sec  execute  79 sec   
  30     10209    21.05 MB/sec  execute  80 sec   
  30     10209    20.78 MB/sec  execute  81 sec   
  30     10510    19.28 MB/sec  execute  91 sec   
  30     10698    19.43 MB/sec  execute  92 sec   
  30     10746    19.31 MB/sec  execute  93 sec   
  30     10944    19.44 MB/sec  execute  94 sec   
  30     11068    19.47 MB/sec  execute  95 sec   
  30     11077    19.27 MB/sec  execute  96 sec   
  30     11166    19.22 MB/sec  execute  97 sec   
  30     11166    19.02 MB/sec  execute  98 sec   
  30     11166    18.83 MB/sec  execute  99 sec   
  30     11166    18.64 MB/sec  execute 100 sec   
  30     11272    18.62 MB/sec  execute 101 sec   
  30     11438    18.71 MB/sec  execute 102 sec   
  30     11633    18.92 MB/sec  execute 103 sec   
  30     11783    18.96 MB/sec  execute 104 sec   
  30     12021    19.17 MB/sec  execute 105 sec   
  30     13712    19.37 MB/sec  execute 117 sec   
  30     13943    19.49 MB/sec  execute 118 sec   
  30     13993    19.37 MB/sec  execute 119 sec   
  30     14021    19.23 MB/sec  execute 120 sec   
  30     14031    19.10 MB/sec  execute 121 sec   
  30     14037    18.95 MB/sec  execute 122 sec   
  30     14070    18.85 MB/sec  execute 123 sec   
  30     14413    19.15 MB/sec  execute 124 sec   
  30     14729    19.37 MB/sec  execute 125 sec   
  30     15089    19.68 MB/sec  execute 126 sec   
  30     15451    20.00 MB/sec  execute 127 sec   
  30     15814    20.30 MB/sec  execute 128 sec   
  30     16178    20.60 MB/sec  execute 129 sec   
  30     16212    20.50 MB/sec  execute 130 sec   
  30     16270    20.42 MB/sec  execute 131 sec   
  30     16278    20.28 MB/sec  execute 132 sec   
  30     16632    20.56 MB/sec  execute 133 sec   
  30     16981    20.85 MB/sec  execute 134 sec   
  30     17315    21.11 MB/sec  execute 135 sec   
  30     17595    21.27 MB/sec  execute 136 sec   
  30     17920    21.57 MB/sec  execute 137 sec   
  30     18231    21.77 MB/sec  execute 138 sec   
  30     18605    22.02 MB/sec  execute 139 sec   
  30     18977    22.24 MB/sec  execute 140 sec   
  30     19181    22.34 MB/sec  execute 141 sec   
  30     19232    22.20 MB/sec  execute 142 sec   
  30     19481    22.34 MB/sec  execute 143 sec   
  30     19724    22.45 MB/sec  execute 144 sec   
  30     19831    22.43 MB/sec  execute 145 sec   
  30     19943    22.38 MB/sec  execute 146 sec   
  30     20025    22.31 MB/sec  execute 147 sec   
  30     20091    22.24 MB/sec  execute 148 sec   
  30     20175    22.18 MB/sec  execute 149 sec   
  30     20252    22.12 MB/sec  execute 150 sec   
  30     20348    22.08 MB/sec  execute 151 sec   
  30     20454    22.03 MB/sec  execute 152 sec   
  30     20543    21.99 MB/sec  execute 153 sec   
  30     20653    21.97 MB/sec  execute 154 sec   
  30     20706    21.88 MB/sec  execute 155 sec   
  30     20758    21.80 MB/sec  execute 156 sec   
  30     20790    21.70 MB/sec  execute 157 sec   
  30     20828    21.61 MB/sec  execute 158 sec   
  30     21111    21.76 MB/sec  execute 159 sec   
  30     21312    21.79 MB/sec  execute 160 sec   
  30     21414    20.90 MB/sec  execute 168 sec   
  30     21414    20.78 MB/sec  execute 169 sec   
  30     21414    20.66 MB/sec  execute 170 sec   
  30     21414    20.54 MB/sec  execute 171 sec   
  30     21414    20.42 MB/sec  execute 172 sec   
  30     21414    20.30 MB/sec  execute 173 sec   
  30     21414    20.18 MB/sec  execute 174 sec   
  30     21414    20.07 MB/sec  execute 175 sec   
  30     21414    19.95 MB/sec  execute 176 sec   
  30     21443    19.89 MB/sec  execute 177 sec   
  30     21539    19.89 MB/sec  execute 178 sec   
  30     21630    19.88 MB/sec  execute 179 sec   
  30     21677    19.81 MB/sec  execute 180 sec   
  30     21906    19.89 MB/sec  execute 181 sec   
  30     22088    19.95 MB/sec  execute 182 sec   
  30     22242    19.98 MB/sec  execute 183 sec   
  30     22461    20.08 MB/sec  execute 184 sec   
  30     22547    20.05 MB/sec  execute 185 sec   
  30     22566    19.95 MB/sec  execute 186 sec   
  30     22572    19.85 MB/sec  execute 187 sec   
  30     22572    19.74 MB/sec  execute 188 sec   
  30     22572    19.64 MB/sec  execute 189 sec   
  30     22573    19.53 MB/sec  execute 190 sec   
  30     22727    19.58 MB/sec  execute 191 sec   
  30     22952    19.66 MB/sec  execute 192 sec   
  30     23302    19.83 MB/sec  execute 193 sec   
  30     23326    19.75 MB/sec  execute 194 sec   
  30     23326    19.65 MB/sec  execute 195 sec   
  30     23639    19.28 MB/sec  execute 201 sec   
  30     23914    19.43 MB/sec  execute 202 sec   
  30     24221    19.58 MB/sec  execute 203 sec   
  30     24323    19.55 MB/sec  execute 204 sec   
  30     24361    19.49 MB/sec  execute 205 sec   
  30     24361    19.39 MB/sec  execute 206 sec   
  30     24361    19.30 MB/sec  execute 207 sec   
  30     24387    19.23 MB/sec  execute 208 sec   
  30     24387    19.14 MB/sec  execute 209 sec   
  30     24387    19.05 MB/sec  execute 210 sec   
  30     24387    18.96 MB/sec  execute 211 sec   
  30     24387    18.87 MB/sec  execute 212 sec   
  30     24387    18.78 MB/sec  execute 213 sec   
  30     24407    18.69 MB/sec  execute 214 sec   
  30     24517    18.69 MB/sec  execute 215 sec   
  30     24647    18.69 MB/sec  execute 216 sec   
  30     24907    18.80 MB/sec  execute 217 sec   
  30     25262    18.97 MB/sec  execute 218 sec   
  30     25610    19.15 MB/sec  execute 219 sec   
  30     25698    19.15 MB/sec  execute 220 sec   
  30     25804    19.10 MB/sec  execute 221 sec   
  30     26017    19.19 MB/sec  execute 222 sec   
  30     26179    19.27 MB/sec  execute 223 sec   
  30     26291    19.26 MB/sec  execute 224 sec   
  30     26448    19.26 MB/sec  execute 225 sec   
  30     26801    19.45 MB/sec  execute 226 sec   
  30     26998    19.51 MB/sec  execute 227 sec   
  30     27023    19.44 MB/sec  execute 228 sec   
  30     27057    19.36 MB/sec  execute 230 sec   
  30     27057    19.28 MB/sec  execute 231 sec   
  30     27057    19.20 MB/sec  execute 232 sec   
  30     27173    19.15 MB/sec  execute 233 sec   
  30     27173    19.07 MB/sec  execute 234 sec   
  30     27182    18.99 MB/sec  execute 235 sec   
  30     27182    18.91 MB/sec  execute 236 sec   
  30     27182    18.83 MB/sec  execute 237 sec   
  30     27182    18.75 MB/sec  execute 238 sec   
  30     27182    18.67 MB/sec  execute 239 sec   
  30     27182    18.60 MB/sec  execute 240 sec   
  30     27182    18.52 MB/sec  execute 241 sec   
  30     27182    18.44 MB/sec  execute 242 sec   
  30     27182    18.37 MB/sec  execute 243 sec   
  30     27182    18.29 MB/sec  execute 244 sec   
  30     27183    18.22 MB/sec  execute 245 sec   
  30     27185    18.14 MB/sec  execute 246 sec   
  30     27185    18.07 MB/sec  execute 247 sec   
  30     27185    18.00 MB/sec  execute 248 sec   
  30     27185    17.93 MB/sec  execute 249 sec   
  30     27185    17.86 MB/sec  execute 250 sec   
  30     27185    17.79 MB/sec  execute 251 sec   
  30     27187    17.72 MB/sec  execute 252 sec   
  30     27187    17.65 MB/sec  execute 253 sec   
  30     27199    17.58 MB/sec  execute 254 sec   
  30     27228    17.53 MB/sec  execute 255 sec   
  30     27261    17.48 MB/sec  execute 256 sec   
  30     27296    17.14 MB/sec  execute 261 sec   
  30     27296    17.07 MB/sec  execute 262 sec   
  30     27296    17.01 MB/sec  execute 263 sec   
  30     27296    16.94 MB/sec  execute 264 sec   
  30     27296    16.88 MB/sec  execute 265 sec   
  30     27300    16.82 MB/sec  execute 266 sec   
  30     27308    16.78 MB/sec  execute 267 sec   
  30     27308    16.71 MB/sec  execute 268 sec   
  30     27308    16.65 MB/sec  execute 269 sec   
  30     27423    16.64 MB/sec  execute 270 sec   
  30     27720    16.77 MB/sec  execute 271 sec   
  30     27951    16.85 MB/sec  execute 272 sec   
  30     28036    16.88 MB/sec  execute 273 sec   
  30     28182    16.91 MB/sec  execute 274 sec   
  30     28259    16.89 MB/sec  execute 275 sec   
  30     28308    16.86 MB/sec  execute 276 sec   
  30     28330    16.81 MB/sec  execute 277 sec   
  30     28330    16.75 MB/sec  execute 278 sec   
  30     28377    16.72 MB/sec  execute 279 sec   
  30     28539    16.74 MB/sec  execute 280 sec   
  30     28762    16.80 MB/sec  execute 281 sec   
  30     29077    16.92 MB/sec  execute 282 sec   
  30     29372    17.03 MB/sec  execute 283 sec   
  30     29642    17.14 MB/sec  execute 284 sec   
  30     29906    17.20 MB/sec  execute 285 sec   
  30     30263    17.34 MB/sec  execute 286 sec   
  30     30622    17.48 MB/sec  execute 287 sec   
  30     30981    17.63 MB/sec  execute 288 sec   
  30     31325    17.75 MB/sec  execute 289 sec   
  30     31685    17.90 MB/sec  execute 290 sec   
  30     32046    18.05 MB/sec  execute 291 sec   
  30     32185    18.06 MB/sec  execute 292 sec   
  30     32267    18.04 MB/sec  execute 293 sec   
  30     32415    18.06 MB/sec  execute 294 sec   
  30     32783    18.19 MB/sec  execute 295 sec   
  30     33145    18.32 MB/sec  execute 296 sec   
  30     33481    18.47 MB/sec  execute 297 sec   
  30     33666    18.53 MB/sec  execute 298 sec   
  30     33667    18.46 MB/sec  execute 299 sec   
  30     33785    18.46 MB/sec  execute 300 sec   
  30     33973    18.49 MB/sec  execute 301 sec   
  30     34219    18.54 MB/sec  execute 302 sec   
  30     34401    18.57 MB/sec  execute 303 sec   
  30     34428    18.53 MB/sec  execute 304 sec   
  30     34547    18.52 MB/sec  execute 305 sec   
  30     34559    18.46 MB/sec  execute 306 sec   
  30     34567    18.40 MB/sec  execute 307 sec   
  30     34579    18.34 MB/sec  execute 308 sec   
  30     34643    18.32 MB/sec  execute 309 sec   
  30     34794    18.33 MB/sec  execute 310 sec   
  30     34794    18.27 MB/sec  execute 311 sec   
  30     34794    18.22 MB/sec  execute 312 sec   
  30     34794    18.16 MB/sec  execute 313 sec   
  30     34854    18.15 MB/sec  execute 314 sec   
  30     35074    18.18 MB/sec  execute 315 sec   
  30     35264    18.22 MB/sec  execute 316 sec   
  30     35560    18.30 MB/sec  execute 317 sec   
  30     35832    18.40 MB/sec  execute 318 sec   
  30     36135    18.47 MB/sec  execute 319 sec   
  30     36442    18.58 MB/sec  execute 320 sec   
  30     36734    18.68 MB/sec  execute 321 sec   
  30     37050    18.76 MB/sec  execute 322 sec   
  30     37130    18.74 MB/sec  execute 323 sec   
  30     37131    18.68 MB/sec  execute 324 sec   
  30     37168    18.65 MB/sec  execute 325 sec   
  30     37261    18.63 MB/sec  execute 326 sec   
  30     37315    18.61 MB/sec  execute 327 sec   
  30     37608    18.53 MB/sec  execute 332 sec   
  30     37945    18.66 MB/sec  execute 333 sec   
  30     38283    18.76 MB/sec  execute 334 sec   
  30     38574    18.84 MB/sec  execute 335 sec   
  30     38649    18.83 MB/sec  execute 336 sec   
  30     38649    18.77 MB/sec  execute 337 sec   
  30     38987    18.67 MB/sec  execute 342 sec   
  30     39347    18.78 MB/sec  execute 343 sec   
  30     39666    18.89 MB/sec  execute 344 sec   
  30     39939    18.96 MB/sec  execute 345 sec   
  30     40084    18.98 MB/sec  execute 346 sec   
  30     40101    18.93 MB/sec  execute 347 sec   
  30     40101    18.88 MB/sec  execute 348 sec   
  30     40101    18.82 MB/sec  execute 349 sec   
  30     40101    18.77 MB/sec  execute 350 sec   
  30     40101    18.71 MB/sec  execute 351 sec   
  30     40103    18.67 MB/sec  execute 352 sec   
  30     40104    18.61 MB/sec  execute 353 sec   
  30     40715    18.12 MB/sec  execute 367 sec   
  30     40740    18.08 MB/sec  execute 368 sec   
  30     40758    18.04 MB/sec  execute 369 sec   
  30     40758    17.99 MB/sec  execute 370 sec   
  30     40758    17.94 MB/sec  execute 371 sec   
  30     40760    17.89 MB/sec  execute 372 sec   
  30     40837    17.88 MB/sec  execute 373 sec   
  30     41050    17.92 MB/sec  execute 374 sec   
  30     41051    17.88 MB/sec  execute 375 sec   
  30     41085    17.84 MB/sec  execute 376 sec   
  30     41428    17.94 MB/sec  execute 377 sec   
  30     41802    18.03 MB/sec  execute 378 sec   
  30     42136    18.15 MB/sec  execute 379 sec   
  30     42520    18.24 MB/sec  execute 380 sec   
  30     42832    18.34 MB/sec  execute 381 sec   
  30     43128    18.40 MB/sec  execute 382 sec   
  30     43397    18.48 MB/sec  execute 383 sec   
  30     43671    18.54 MB/sec  execute 384 sec   
  30     43902    18.59 MB/sec  execute 385 sec   
  30     43943    18.55 MB/sec  execute 386 sec   
  30     44051    18.55 MB/sec  execute 387 sec   
  30     44098    18.53 MB/sec  execute 388 sec   
  30     44099    18.48 MB/sec  execute 389 sec   
  30     44183    18.46 MB/sec  execute 390 sec   
  30     44223    18.43 MB/sec  execute 391 sec   
  30     44223    18.38 MB/sec  execute 392 sec   
  30     44226    18.33 MB/sec  execute 393 sec   
  30     44226    18.29 MB/sec  execute 394 sec   
  30     44226    18.24 MB/sec  execute 395 sec   
  30     44226    18.19 MB/sec  execute 396 sec   
  30     44226    18.15 MB/sec  execute 397 sec   
  30     44228    18.10 MB/sec  execute 398 sec   
  30     44229    18.06 MB/sec  execute 399 sec   
  30     44233    18.02 MB/sec  execute 400 sec   
  30     44461    18.09 MB/sec  execute 401 sec   
  30     44750    18.17 MB/sec  execute 402 sec   
  30     44835    18.17 MB/sec  execute 403 sec   
  30     44890    18.15 MB/sec  execute 404 sec   
  30     45089    18.18 MB/sec  execute 405 sec   
  30     45109    18.14 MB/sec  execute 406 sec   
  30     45322    18.18 MB/sec  execute 407 sec   
  30     45326    18.14 MB/sec  execute 408 sec   
  30     45353    18.12 MB/sec  execute 409 sec   
  30     45718    18.21 MB/sec  execute 410 sec   
  30     45791    18.19 MB/sec  execute 411 sec   
  30     45947    18.04 MB/sec  execute 416 sec   
  30     46023    18.02 MB/sec  execute 417 sec   
  30     46140    18.02 MB/sec  execute 418 sec   
  30     46384    18.07 MB/sec  execute 419 sec   
  30     46621    18.12 MB/sec  execute 420 sec   
  30     46858    18.18 MB/sec  execute 421 sec   
  30     47095    18.23 MB/sec  execute 422 sec   
  30     47334    18.28 MB/sec  execute 423 sec   
  30     47557    18.31 MB/sec  execute 424 sec   
  30     47799    18.36 MB/sec  execute 425 sec   
  30     47934    18.37 MB/sec  execute 426 sec   
  30     47934    18.33 MB/sec  execute 427 sec   
  30     47934    18.29 MB/sec  execute 428 sec   
  30     47934    18.24 MB/sec  execute 429 sec   
  30     47935    18.20 MB/sec  execute 430 sec   
  30     47945    18.17 MB/sec  execute 431 sec   
  30     47974    18.13 MB/sec  execute 432 sec   
  30     47982    18.10 MB/sec  execute 433 sec   
  30     48020    18.07 MB/sec  execute 434 sec   
  30     48362    18.11 MB/sec  execute 436 sec   
  30     48627    18.16 MB/sec  execute 437 sec   
  30     48794    18.19 MB/sec  execute 438 sec   
  30     49137    18.27 MB/sec  execute 439 sec   
  30     49429    18.34 MB/sec  execute 440 sec   
  30     49483    18.31 MB/sec  execute 441 sec   
  30     49564    18.30 MB/sec  execute 442 sec   
  30     49564    18.26 MB/sec  execute 443 sec   
  30     49564    18.22 MB/sec  execute 444 sec   
  30     49745    18.22 MB/sec  execute 445 sec   
  30     49869    18.24 MB/sec  execute 446 sec   
  30     49869    18.19 MB/sec  execute 447 sec   
  30     49869    18.15 MB/sec  execute 448 sec   
  30     49869    18.11 MB/sec  execute 449 sec   
  30     49869    18.07 MB/sec  execute 450 sec   
  30     49869    18.03 MB/sec  execute 451 sec   
  30     49966    18.01 MB/sec  execute 453 sec   
  30     50077    18.01 MB/sec  execute 454 sec   
  30     50436    18.10 MB/sec  execute 455 sec   
  30     50787    18.18 MB/sec  execute 456 sec   
  30     51046    18.23 MB/sec  execute 457 sec   
  30     51046    18.19 MB/sec  execute 458 sec   
  30     51046    18.15 MB/sec  execute 459 sec   
  30     51119    18.14 MB/sec  execute 460 sec   
  30     51262    18.15 MB/sec  execute 461 sec   
  30     51450    18.18 MB/sec  execute 462 sec   
  30     51692    18.21 MB/sec  execute 463 sec   
  30     51743    18.19 MB/sec  execute 464 sec   
  30     52015    18.27 MB/sec  execute 465 sec   
  30     52213    18.29 MB/sec  execute 466 sec   
  30     52388    18.33 MB/sec  execute 467 sec   
  30     52592    18.34 MB/sec  execute 468 sec   
  30     52843    18.38 MB/sec  execute 469 sec   
  30     53055    18.42 MB/sec  execute 470 sec   
  30     53346    18.45 MB/sec  execute 472 sec   
  30     53626    18.52 MB/sec  execute 473 sec   
  30     53893    18.55 MB/sec  execute 474 sec   
  30     54125    18.59 MB/sec  execute 475 sec   
  30     54148    18.55 MB/sec  execute 476 sec   
  30     54148    18.51 MB/sec  execute 477 sec   
  30     54215    18.50 MB/sec  execute 478 sec   
  30     54215    18.46 MB/sec  execute 479 sec   
  30     54215    18.42 MB/sec  execute 480 sec   
  30     54215    18.38 MB/sec  execute 481 sec   
  30     54215    18.34 MB/sec  execute 482 sec   
  30     54215    18.30 MB/sec  execute 483 sec   
  30     54215    18.27 MB/sec  execute 484 sec   
  30     54215    18.23 MB/sec  execute 485 sec   
  30     54215    18.19 MB/sec  execute 486 sec   
  30     54228    18.15 MB/sec  execute 487 sec   
  30     54408    17.36 MB/sec  execute 511 sec   
  30     54460    17.36 MB/sec  execute 512 sec   
  30     54488    17.34 MB/sec  execute 513 sec   
  30     54556    17.33 MB/sec  execute 514 sec   
  30     54608    17.30 MB/sec  execute 515 sec   
  30     54608    17.27 MB/sec  execute 516 sec   
  30     54608    17.24 MB/sec  execute 517 sec   
  30     54616    17.20 MB/sec  execute 518 sec   
  30     54616    17.17 MB/sec  execute 519 sec   
  30     54616    17.14 MB/sec  execute 520 sec   
  30     54616    17.10 MB/sec  execute 521 sec   
  30     54616    17.07 MB/sec  execute 522 sec   
  30     54616    17.04 MB/sec  execute 523 sec   
  30     54616    17.01 MB/sec  execute 524 sec   
  30     54682    17.00 MB/sec  execute 525 sec   
  30     54683    16.97 MB/sec  execute 526 sec   
  30     54683    16.93 MB/sec  execute 527 sec   
  30     54683    16.90 MB/sec  execute 528 sec   
  30     54686    16.87 MB/sec  execute 529 sec   
  30     54687    16.84 MB/sec  execute 530 sec   
  30     54789    16.82 MB/sec  execute 531 sec   
  30     54849    16.81 MB/sec  execute 532 sec   
  30     54930    16.79 MB/sec  execute 533 sec   
  30     55042    16.80 MB/sec  execute 534 sec   
  30     55161    16.80 MB/sec  execute 535 sec   
  30     55244    16.80 MB/sec  execute 536 sec   
  30     55273    16.77 MB/sec  execute 537 sec   
  30     55427    16.76 MB/sec  execute 539 sec   
  30     55443    16.73 MB/sec  execute 540 sec   
  30     55443    16.70 MB/sec  execute 541 sec   
  30     55443    16.67 MB/sec  execute 542 sec   
  30     55591    16.69 MB/sec  execute 543 sec   
  30     55748    16.69 MB/sec  execute 544 sec   
  30     55751    16.66 MB/sec  execute 545 sec   
  30     55751    16.63 MB/sec  execute 546 sec   
  30     55751    16.60 MB/sec  execute 547 sec   
  30     55751    16.57 MB/sec  execute 548 sec   
  30     55751    16.54 MB/sec  execute 549 sec   
  30     55751    16.51 MB/sec  execute 550 sec   
  30     55751    16.48 MB/sec  execute 551 sec   
  30     55751    16.45 MB/sec  execute 552 sec   
  30     55760    16.42 MB/sec  execute 553 sec   
  30     55896    16.43 MB/sec  execute 554 sec   
  30     56111    16.47 MB/sec  execute 555 sec   
  30     56459    16.53 MB/sec  execute 556 sec   
  30     56802    16.61 MB/sec  execute 557 sec   
  30     57107    16.66 MB/sec  execute 558 sec   
  30     57195    16.66 MB/sec  execute 559 sec   
  30     57302    16.66 MB/sec  execute 560 sec   
  30     57313    16.63 MB/sec  execute 561 sec   
  30     57314    16.60 MB/sec  execute 562 sec   
  30     57336    16.58 MB/sec  execute 563 sec   
  30     57346    16.55 MB/sec  execute 564 sec   
  30     57346    16.52 MB/sec  execute 565 sec   
  30     57351    16.50 MB/sec  execute 566 sec   
  30     57373    16.47 MB/sec  execute 567 sec   
  30     57390    16.45 MB/sec  execute 568 sec   
  30     57422    16.43 MB/sec  execute 569 sec   
  30     57451    16.41 MB/sec  execute 570 sec   
  30     57501    16.39 MB/sec  execute 571 sec   
  30     57520    16.38 MB/sec  execute 572 sec   
  30     57538    16.35 MB/sec  execute 573 sec   
  30     57893    16.36 MB/sec  execute 576 sec   
  30     58139    16.42 MB/sec  execute 577 sec   
  30     58157    16.39 MB/sec  execute 578 sec   
  30     58157    16.37 MB/sec  execute 579 sec   
  30     58195    16.36 MB/sec  execute 580 sec   
  30     58317    16.36 MB/sec  execute 581 sec   
  30     58644    16.41 MB/sec  execute 582 sec   
  30     58646    16.39 MB/sec  execute 583 sec   
  30     58661    16.36 MB/sec  execute 584 sec   
  30     58902    16.33 MB/sec  execute 587 sec   
  30     59116    16.36 MB/sec  execute 588 sec   
  30     59333    16.38 MB/sec  execute 589 sec   
  30     59561    16.43 MB/sec  execute 590 sec   
  30     59788    16.47 MB/sec  execute 591 sec   
  30     60014    16.50 MB/sec  execute 592 sec   
  30     60236    16.53 MB/sec  execute 593 sec   
  30     60461    16.57 MB/sec  execute 594 sec   
  30     60625    16.58 MB/sec  execute 595 sec   
  30     60681    16.57 MB/sec  execute 596 sec   
  30     60685    16.54 MB/sec  execute 597 sec   
  30     60713    16.52 MB/sec  execute 598 sec   
  30     60735    16.50 MB/sec  execute 599 sec   
  30     60751    16.48 MB/sec  cleanup 600 sec   
  30     60751    16.45 MB/sec  cleanup 601 sec   
  30     60751    16.42 MB/sec  cleanup 603 sec   
  30     60751    16.39 MB/sec  cleanup 604 sec   
  30     60751    16.37 MB/sec  cleanup 605 sec   
  30     60751    16.34 MB/sec  cleanup 606 sec   
  30     60751    16.31 MB/sec  cleanup 607 sec   
  30     60751    16.29 MB/sec  cleanup 608 sec   
  30     60751    16.26 MB/sec  cleanup 609 sec   
  30     60751    16.23 MB/sec  cleanup 610 sec   
  30     60751    16.21 MB/sec  cleanup 611 sec   
  30     60751    16.18 MB/sec  cleanup 612 sec   
  30     60751    16.15 MB/sec  cleanup 613 sec   
  30     60751    16.13 MB/sec  cleanup 614 sec   
  30     60751    16.10 MB/sec  cleanup 615 sec   
  30     60751    16.08 MB/sec  cleanup 616 sec   
  30     60751    15.99 MB/sec  cleanup 619 sec   
  30     60751    15.96 MB/sec  cleanup 620 sec   
  30     60751    15.94 MB/sec  cleanup 621 sec   
  30     60751    15.93 MB/sec  cleanup 621 sec   

Throughput 16.4806 MB/sec 30 procs

[-- Attachment #6: pdflush-hda3.log --]
[-- Type: text/plain, Size: 27799 bytes --]

dbench version 3.04 - Copyright Andrew Tridgell 1999-2004

Running for 600 seconds with load '/usr/share/dbench/client.txt' and minimum warmup 120 secs
30 clients started
  30        13     2.82 MB/sec  warmup   1 sec   
  30        20     4.27 MB/sec  warmup   3 sec   
  30        20     3.05 MB/sec  warmup   4 sec   
  30        20     2.37 MB/sec  warmup   5 sec   
  30        20     1.94 MB/sec  warmup   6 sec   
  30        51     6.86 MB/sec  warmup   7 sec   
  30        64     7.10 MB/sec  warmup   8 sec   
  30        69     7.26 MB/sec  warmup   9 sec   
  30        74     7.22 MB/sec  warmup  10 sec   
  30        78     7.09 MB/sec  warmup  11 sec   
  30        82     6.71 MB/sec  warmup  12 sec   
  30        93     7.23 MB/sec  warmup  13 sec   
  30       161     7.83 MB/sec  warmup  14 sec   
  30       169     8.17 MB/sec  warmup  15 sec   
  30       169     7.64 MB/sec  warmup  16 sec   
  30       177     7.82 MB/sec  warmup  17 sec   
  30       183     7.81 MB/sec  warmup  18 sec   
  30       188     7.73 MB/sec  warmup  19 sec   
  30       190     7.46 MB/sec  warmup  20 sec   
  30       221     7.60 MB/sec  warmup  21 sec   
  30       225     7.52 MB/sec  warmup  22 sec   
  30       228     7.43 MB/sec  warmup  23 sec   
  30       232     7.29 MB/sec  warmup  24 sec   
  30       232     6.99 MB/sec  warmup  25 sec   
  30       245     7.37 MB/sec  warmup  26 sec   
  30       252     7.34 MB/sec  warmup  27 sec   
  30       256     7.29 MB/sec  warmup  28 sec   
  30       265     7.38 MB/sec  warmup  29 sec   
  30       265     7.13 MB/sec  warmup  30 sec   
  30       282     7.43 MB/sec  warmup  31 sec   
  30       292     7.51 MB/sec  warmup  32 sec   
  30       304     7.51 MB/sec  warmup  33 sec   
  30       306     7.33 MB/sec  warmup  34 sec   
  30       306     7.12 MB/sec  warmup  35 sec   
  30       324     7.40 MB/sec  warmup  36 sec   
  30       336     7.54 MB/sec  warmup  37 sec   
  30       347     7.62 MB/sec  warmup  38 sec   
  30       347     7.43 MB/sec  warmup  39 sec   
  30       359     7.55 MB/sec  warmup  40 sec   
  30       379     7.57 MB/sec  warmup  41 sec   
  30       387     7.59 MB/sec  warmup  42 sec   
  30       400     7.70 MB/sec  warmup  43 sec   
  30       408     7.71 MB/sec  warmup  44 sec   
  30       421     7.79 MB/sec  warmup  45 sec   
  30       433     7.75 MB/sec  warmup  46 sec   
  30       444     7.83 MB/sec  warmup  47 sec   
  30       448     7.75 MB/sec  warmup  48 sec   
  30       456     7.76 MB/sec  warmup  49 sec   
  30       480     7.93 MB/sec  warmup  50 sec   
  30       483     7.84 MB/sec  warmup  51 sec   
  30       498     7.87 MB/sec  warmup  52 sec   
  30       506     7.86 MB/sec  warmup  53 sec   
  30       531     8.06 MB/sec  warmup  54 sec   
  30       544     8.14 MB/sec  warmup  55 sec   
  30       575     8.17 MB/sec  warmup  56 sec   
  30       606     8.15 MB/sec  warmup  57 sec   
  30       613     8.06 MB/sec  warmup  58 sec   
  30       802     8.63 MB/sec  warmup  59 sec   
  30       971     9.04 MB/sec  warmup  60 sec   
  30       988     9.00 MB/sec  warmup  61 sec   
  30       998     9.02 MB/sec  warmup  62 sec   
  30      1082     9.09 MB/sec  warmup  63 sec   
  30      1223     9.30 MB/sec  warmup  64 sec   
  30      1472     9.81 MB/sec  warmup  65 sec   
  30      1629     9.43 MB/sec  warmup  72 sec   
  30      1661     9.45 MB/sec  warmup  73 sec   
  30      1792     9.78 MB/sec  warmup  74 sec   
  30      1926     9.94 MB/sec  warmup  75 sec   
  30      2252    10.47 MB/sec  warmup  76 sec   
  30      2564    11.02 MB/sec  warmup  77 sec   
  30      2881    11.53 MB/sec  warmup  78 sec   
  30      3138    12.11 MB/sec  warmup  79 sec   
  30      3270    12.36 MB/sec  warmup  80 sec   
  30      3360    11.97 MB/sec  warmup  85 sec   
  30      3390    11.46 MB/sec  warmup  89 sec   
  30      3430    10.80 MB/sec  warmup  95 sec   
  30      3435    10.67 MB/sec  warmup  96 sec   
  30      3461    10.62 MB/sec  warmup  97 sec   
  30      3474    10.51 MB/sec  warmup  99 sec   
  30      3485    10.39 MB/sec  warmup 100 sec   
  30      3542    10.35 MB/sec  warmup 101 sec   
  30      3574    10.31 MB/sec  warmup 102 sec   
  30      3789    10.53 MB/sec  warmup 103 sec   
  30      4105    10.90 MB/sec  warmup 104 sec   
  30      4416    11.30 MB/sec  warmup 105 sec   
  30      4736    11.70 MB/sec  warmup 106 sec   
  30      4903    11.82 MB/sec  warmup 107 sec   
  30      4934    11.75 MB/sec  warmup 108 sec   
  30      4984    11.70 MB/sec  warmup 109 sec   
  30      4992    11.63 MB/sec  warmup 110 sec   
  30      4992    11.52 MB/sec  warmup 111 sec   
  30      4992    11.42 MB/sec  warmup 112 sec   
  30      5033    11.38 MB/sec  warmup 113 sec   
  30      5149    11.47 MB/sec  warmup 114 sec   
  30      5284    11.61 MB/sec  warmup 115 sec   
  30      5450    11.76 MB/sec  warmup 116 sec   
  30      5618    11.90 MB/sec  warmup 117 sec   
  30      5783    12.00 MB/sec  warmup 118 sec   
  30      5783    11.90 MB/sec  warmup 119 sec   
  30      5802    11.80 MB/sec  warmup 120 sec   
  30      5948    11.54 MB/sec  execute   1 sec   
  30      5956     7.62 MB/sec  execute   2 sec   
  30      6053     9.94 MB/sec  execute   3 sec   
  30      6218    14.81 MB/sec  execute   4 sec   
  30      6275     5.18 MB/sec  execute  13 sec   
  30      6275     4.81 MB/sec  execute  14 sec   
  30      6275     4.48 MB/sec  execute  15 sec   
  30      6275     4.20 MB/sec  execute  16 sec   
  30      6275     3.95 MB/sec  execute  17 sec   
  30      6275     3.73 MB/sec  execute  18 sec   
  30      6278     3.57 MB/sec  execute  19 sec   
  30      6380     4.11 MB/sec  execute  20 sec   
  30      6473     5.05 MB/sec  execute  21 sec   
  30      6474     4.82 MB/sec  execute  22 sec   
  30      6480     4.85 MB/sec  execute  23 sec   
  30      6586     5.57 MB/sec  execute  24 sec   
  30      6788     6.49 MB/sec  execute  25 sec   
  30      7070     8.13 MB/sec  execute  26 sec   
  30      7355     9.76 MB/sec  execute  27 sec   
  30      7684    10.88 MB/sec  execute  28 sec   
  30      7965    12.52 MB/sec  execute  29 sec   
  30      8027    12.57 MB/sec  execute  30 sec   
  30      8027    12.16 MB/sec  execute  31 sec   
  30      8027    11.78 MB/sec  execute  32 sec   
  30      8027    11.42 MB/sec  execute  33 sec   
  30      8081    11.46 MB/sec  execute  34 sec   
  30      8234    11.74 MB/sec  execute  35 sec   
  30      8402    12.12 MB/sec  execute  36 sec   
  30      8680    13.02 MB/sec  execute  37 sec   
  30      8911    13.77 MB/sec  execute  38 sec   
  30      9069    14.05 MB/sec  execute  39 sec   
  30      9396    15.01 MB/sec  execute  40 sec   
  30      9636    15.58 MB/sec  execute  41 sec   
  30      9775    15.73 MB/sec  execute  42 sec   
  30      9870    15.74 MB/sec  execute  43 sec   
  30      9934    12.59 MB/sec  execute  54 sec   
  30      9934    12.36 MB/sec  execute  55 sec   
  30      9934    12.14 MB/sec  execute  56 sec   
  30      9934    11.93 MB/sec  execute  57 sec   
  30      9934    11.72 MB/sec  execute  58 sec   
  30      9934    11.53 MB/sec  execute  59 sec   
  30      9934    11.34 MB/sec  execute  60 sec   
  30      9934    11.15 MB/sec  execute  61 sec   
  30      9934    10.97 MB/sec  execute  62 sec   
  30      9934    10.80 MB/sec  execute  63 sec   
  30     10006    10.83 MB/sec  execute  64 sec   
  30     10316    11.34 MB/sec  execute  65 sec   
  30     10584    11.96 MB/sec  execute  66 sec   
  30     10871    12.60 MB/sec  execute  67 sec   
  30     11147    13.17 MB/sec  execute  68 sec   
  30     11383    13.43 MB/sec  execute  69 sec   
  30     11676    13.82 MB/sec  execute  70 sec   
  30     11896    14.12 MB/sec  execute  71 sec   
  30     12149    14.53 MB/sec  execute  72 sec   
  30     12234    14.49 MB/sec  execute  73 sec   
  30     12273    14.44 MB/sec  execute  74 sec   
  30     12399    14.55 MB/sec  execute  75 sec   
  30     12410    14.37 MB/sec  execute  76 sec   
  30     12476    14.28 MB/sec  execute  77 sec   
  30     12758    14.73 MB/sec  execute  78 sec   
  30     13081    15.17 MB/sec  execute  79 sec   
  30     13393    15.60 MB/sec  execute  80 sec   
  30     13701    16.05 MB/sec  execute  81 sec   
  30     14015    16.43 MB/sec  execute  82 sec   
  30     14291    16.74 MB/sec  execute  83 sec   
  30     14327    16.61 MB/sec  execute  84 sec   
  30     14469    15.54 MB/sec  execute  92 sec   
  30     14498    15.40 MB/sec  execute  93 sec   
  30     14498    15.24 MB/sec  execute  94 sec   
  30     14498    15.08 MB/sec  execute  95 sec   
  30     14531    14.95 MB/sec  execute  96 sec   
  30     14736    15.17 MB/sec  execute  97 sec   
  30     14880    15.22 MB/sec  execute  98 sec   
  30     15007    15.30 MB/sec  execute  99 sec   
  30     15159    15.39 MB/sec  execute 100 sec   
  30     15237    15.37 MB/sec  execute 101 sec   
  30     15342    15.42 MB/sec  execute 102 sec   
  30     15567    15.56 MB/sec  execute 103 sec   
  30     15879    15.89 MB/sec  execute 104 sec   
  30     16075    16.11 MB/sec  execute 105 sec   
  30     16323    16.31 MB/sec  execute 106 sec   
  30     16556    16.49 MB/sec  execute 107 sec   
  30     16817    16.74 MB/sec  execute 108 sec   
  30     17654    17.09 MB/sec  execute 114 sec   
  30     17766    17.10 MB/sec  execute 115 sec   
  30     17856    17.03 MB/sec  execute 116 sec   
  30     17889    16.93 MB/sec  execute 117 sec   
  30     17959    16.84 MB/sec  execute 118 sec   
  30     18024    16.77 MB/sec  execute 119 sec   
  30     18036    16.66 MB/sec  execute 120 sec   
  30     18036    16.53 MB/sec  execute 121 sec   
  30     18067    16.45 MB/sec  execute 122 sec   
  30     18179    14.64 MB/sec  execute 138 sec   
  30     19236    15.07 MB/sec  execute 145 sec   
  30     19498    15.27 MB/sec  execute 146 sec   
  30     19687    15.41 MB/sec  execute 147 sec   
  30     19928    15.50 MB/sec  execute 148 sec   
  30     20154    15.70 MB/sec  execute 149 sec   
  30     20341    15.79 MB/sec  execute 150 sec   
  30     20577    15.91 MB/sec  execute 151 sec   
  30     20813    16.14 MB/sec  execute 152 sec   
  30     21060    16.32 MB/sec  execute 153 sec   
  30     21308    16.47 MB/sec  execute 154 sec   
  30     21542    16.64 MB/sec  execute 155 sec   
  30     21872    16.63 MB/sec  execute 158 sec   
  30     22137    16.76 MB/sec  execute 159 sec   
  30     22206    16.69 MB/sec  execute 160 sec   
  30     22330    16.68 MB/sec  execute 161 sec   
  30     22472    16.73 MB/sec  execute 162 sec   
  30     22529    16.68 MB/sec  execute 163 sec   
  30     22541    16.60 MB/sec  execute 164 sec   
  30     22541    16.43 MB/sec  execute 166 sec   
  30     22541    16.33 MB/sec  execute 167 sec   
  30     22541    16.23 MB/sec  execute 168 sec   
  30     22541    16.14 MB/sec  execute 169 sec   
  30     22541    16.04 MB/sec  execute 170 sec   
  30     22556    15.97 MB/sec  execute 171 sec   
  30     22689    16.01 MB/sec  execute 172 sec   
  30     23014    16.21 MB/sec  execute 173 sec   
  30     23191    16.30 MB/sec  execute 174 sec   
  30     23423    16.42 MB/sec  execute 175 sec   
  30     23680    16.55 MB/sec  execute 176 sec   
  30     23846    16.60 MB/sec  execute 177 sec   
  30     23965    16.61 MB/sec  execute 178 sec   
  30     24135    16.69 MB/sec  execute 179 sec   
  30     24259    16.71 MB/sec  execute 180 sec   
  30     24481    16.83 MB/sec  execute 181 sec   
  30     24729    16.96 MB/sec  execute 182 sec   
  30     24991    17.11 MB/sec  execute 183 sec   
  30     25281    17.31 MB/sec  execute 184 sec   
  30     25579    17.46 MB/sec  execute 185 sec   
  30     25838    17.62 MB/sec  execute 186 sec   
  30     25903    17.57 MB/sec  execute 187 sec   
  30     26263    17.35 MB/sec  execute 192 sec   
  30     26350    17.33 MB/sec  execute 193 sec   
  30     26371    17.25 MB/sec  execute 194 sec   
  30     26569    17.34 MB/sec  execute 195 sec   
  30     26892    17.50 MB/sec  execute 196 sec   
  30     27050    17.55 MB/sec  execute 197 sec   
  30     27088    17.49 MB/sec  execute 198 sec   
  30     27135    17.43 MB/sec  execute 199 sec   
  30     27151    17.37 MB/sec  execute 200 sec   
  30     27151    17.28 MB/sec  execute 201 sec   
  30     27157    17.21 MB/sec  execute 202 sec   
  30     27300    17.23 MB/sec  execute 203 sec   
  30     27628    17.41 MB/sec  execute 204 sec   
  30     27928    17.55 MB/sec  execute 205 sec   
  30     28244    17.72 MB/sec  execute 206 sec   
  30     28571    17.89 MB/sec  execute 207 sec   
  30     28897    18.05 MB/sec  execute 208 sec   
  30     29208    18.21 MB/sec  execute 209 sec   
  30     29535    18.38 MB/sec  execute 210 sec   
  30     29785    18.48 MB/sec  execute 211 sec   
  30     29998    18.56 MB/sec  execute 212 sec   
  30     30092    18.54 MB/sec  execute 213 sec   
  30     30092    18.45 MB/sec  execute 214 sec   
  30     30092    18.37 MB/sec  execute 215 sec   
  30     30092    18.28 MB/sec  execute 216 sec   
  30     30092    18.20 MB/sec  execute 217 sec   
  30     30169    18.19 MB/sec  execute 218 sec   
  30     30258    18.17 MB/sec  execute 219 sec   
  30     30411    18.20 MB/sec  execute 220 sec   
  30     30595    18.28 MB/sec  execute 221 sec   
  30     30662    18.26 MB/sec  execute 222 sec   
  30     30960    18.09 MB/sec  execute 227 sec   
  30     31247    18.20 MB/sec  execute 228 sec   
  30     31316    18.16 MB/sec  execute 229 sec   
  30     31350    18.10 MB/sec  execute 230 sec   
  30     31427    18.08 MB/sec  execute 231 sec   
  30     31497    18.06 MB/sec  execute 232 sec   
  30     31674    18.09 MB/sec  execute 233 sec   
  30     31936    18.18 MB/sec  execute 234 sec   
  30     32246    18.34 MB/sec  execute 235 sec   
  30     32560    18.46 MB/sec  execute 236 sec   
  30     32897    18.56 MB/sec  execute 237 sec   
  30     33175    18.74 MB/sec  execute 238 sec   
  30     33513    18.82 MB/sec  execute 239 sec   
  30     33784    18.99 MB/sec  execute 240 sec   
  30     34125    19.10 MB/sec  execute 241 sec   
  30     37031    20.11 MB/sec  execute 253 sec   
  30     37316    20.23 MB/sec  execute 254 sec   
  30     37603    20.29 MB/sec  execute 255 sec   
  30     37918    20.41 MB/sec  execute 256 sec   
  30     38232    20.47 MB/sec  execute 257 sec   
  30     38526    20.58 MB/sec  execute 258 sec   
  30     38718    20.64 MB/sec  execute 259 sec   
  30     38718    20.56 MB/sec  execute 260 sec   
  30     38718    20.48 MB/sec  execute 261 sec   
  30     38755    20.42 MB/sec  execute 262 sec   
  30     38824    20.39 MB/sec  execute 263 sec   
  30     38824    20.31 MB/sec  execute 264 sec   
  30     38824    20.23 MB/sec  execute 265 sec   
  30     38824    20.16 MB/sec  execute 266 sec   
  30     38824    20.08 MB/sec  execute 267 sec   
  30     38824    20.01 MB/sec  execute 268 sec   
  30     38824    19.93 MB/sec  execute 269 sec   
  30     38824    19.86 MB/sec  execute 270 sec   
  30     38824    19.78 MB/sec  execute 271 sec   
  30     38824    19.71 MB/sec  execute 272 sec   
  30     38824    19.64 MB/sec  execute 273 sec   
  30     38824    19.57 MB/sec  execute 274 sec   
  30     38824    19.50 MB/sec  execute 275 sec   
  30     38824    19.43 MB/sec  execute 276 sec   
  30     38824    19.36 MB/sec  execute 277 sec   
  30     38824    19.29 MB/sec  execute 278 sec   
  30     38824    19.22 MB/sec  execute 279 sec   
  30     38878    19.19 MB/sec  execute 280 sec   
  30     38962    19.15 MB/sec  execute 281 sec   
  30     39021    19.14 MB/sec  execute 282 sec   
  30     39180    18.29 MB/sec  execute 297 sec   
  30     39329    18.35 MB/sec  execute 298 sec   
  30     39422    18.35 MB/sec  execute 299 sec   
  30     39551    18.34 MB/sec  execute 300 sec   
  30     39623    18.32 MB/sec  execute 301 sec   
  30     39689    18.29 MB/sec  execute 302 sec   
  30     39729    18.24 MB/sec  execute 303 sec   
  30     39763    18.20 MB/sec  execute 304 sec   
  30     39766    18.14 MB/sec  execute 305 sec   
  30     39766    18.08 MB/sec  execute 306 sec   
  30     39784    18.03 MB/sec  execute 307 sec   
  30     39884    18.02 MB/sec  execute 308 sec   
  30     40163    18.11 MB/sec  execute 309 sec   
  30     40489    18.22 MB/sec  execute 310 sec   
  30     40727    18.29 MB/sec  execute 311 sec   
  30     40859    18.30 MB/sec  execute 312 sec   
  30     41000    18.31 MB/sec  execute 313 sec   
  30     41057    18.27 MB/sec  execute 314 sec   
  30     41100    18.25 MB/sec  execute 315 sec   
  30     41140    18.21 MB/sec  execute 316 sec   
  30     41186    18.17 MB/sec  execute 317 sec   
  30     41225    18.14 MB/sec  execute 318 sec   
  30     41292    18.11 MB/sec  execute 319 sec   
  30     41540    18.18 MB/sec  execute 320 sec   
  30     41849    18.28 MB/sec  execute 321 sec   
  30     42124    18.37 MB/sec  execute 322 sec   
  30     42341    18.41 MB/sec  execute 323 sec   
  30     42596    18.49 MB/sec  execute 324 sec   
  30     42915    18.59 MB/sec  execute 325 sec   
  30     43192    18.67 MB/sec  execute 326 sec   
  30     43474    18.72 MB/sec  execute 327 sec   
  30     43655    18.48 MB/sec  execute 332 sec   
  30     43679    18.41 MB/sec  execute 334 sec   
  30     43679    18.36 MB/sec  execute 335 sec   
  30     43679    18.30 MB/sec  execute 336 sec   
  30     43679    18.25 MB/sec  execute 337 sec   
  30     43679    18.19 MB/sec  execute 338 sec   
  30     43679    18.14 MB/sec  execute 339 sec   
  30     43679    18.09 MB/sec  execute 340 sec   
  30     43679    18.03 MB/sec  execute 341 sec   
  30     43679    17.98 MB/sec  execute 342 sec   
  30     43679    17.93 MB/sec  execute 343 sec   
  30     43679    17.88 MB/sec  execute 344 sec   
  30     43679    17.82 MB/sec  execute 345 sec   
  30     43679    17.77 MB/sec  execute 346 sec   
  30     43679    17.72 MB/sec  execute 347 sec   
  30     43679    17.67 MB/sec  execute 348 sec   
  30     44430    17.86 MB/sec  execute 352 sec   
  30     44741    17.93 MB/sec  execute 353 sec   
  30     45043    18.03 MB/sec  execute 354 sec   
  30     45343    18.10 MB/sec  execute 355 sec   
  30     45671    18.21 MB/sec  execute 356 sec   
  30     45932    18.28 MB/sec  execute 357 sec   
  30     46248    18.35 MB/sec  execute 358 sec   
  30     46541    18.43 MB/sec  execute 359 sec   
  30     46711    18.47 MB/sec  execute 360 sec   
  30     46909    18.50 MB/sec  execute 361 sec   
  30     47073    18.54 MB/sec  execute 362 sec   
  30     47345    18.42 MB/sec  execute 367 sec   
  30     47438    18.40 MB/sec  execute 368 sec   
  30     47526    18.38 MB/sec  execute 369 sec   
  30     47744    18.43 MB/sec  execute 370 sec   
  30     47817    18.42 MB/sec  execute 371 sec   
  30     47817    18.37 MB/sec  execute 372 sec   
  30     47817    18.32 MB/sec  execute 373 sec   
  30     47817    18.27 MB/sec  execute 374 sec   
  30     47817    18.22 MB/sec  execute 375 sec   
  30     47817    18.17 MB/sec  execute 376 sec   
  30     47817    18.12 MB/sec  execute 377 sec   
  30     47821    18.08 MB/sec  execute 378 sec   
  30     47882    18.05 MB/sec  execute 379 sec   
  30     47930    18.03 MB/sec  execute 380 sec   
  30     48008    18.01 MB/sec  execute 381 sec   
  30     48115    18.01 MB/sec  execute 382 sec   
  30     48377    18.08 MB/sec  execute 383 sec   
  30     48586    18.11 MB/sec  execute 384 sec   
  30     48801    18.16 MB/sec  execute 385 sec   
  30     49112    18.24 MB/sec  execute 386 sec   
  30     49364    18.30 MB/sec  execute 387 sec   
  30     49606    18.37 MB/sec  execute 388 sec   
  30     49817    18.40 MB/sec  execute 389 sec   
  30     50071    18.45 MB/sec  execute 390 sec   
  30     50392    18.52 MB/sec  execute 391 sec   
  30     50665    18.62 MB/sec  execute 392 sec   
  30     50969    18.69 MB/sec  execute 393 sec   
  30     51267    18.76 MB/sec  execute 394 sec   
  30     51576    18.82 MB/sec  execute 395 sec   
  30     51863    18.90 MB/sec  execute 396 sec   
  30     52156    18.96 MB/sec  execute 397 sec   
  30     52441    18.57 MB/sec  execute 408 sec   
  30     52710    18.63 MB/sec  execute 409 sec   
  30     52913    18.67 MB/sec  execute 410 sec   
  30     52963    18.64 MB/sec  execute 411 sec   
  30     52969    18.59 MB/sec  execute 412 sec   
  30     52969    18.55 MB/sec  execute 413 sec   
  30     52969    18.50 MB/sec  execute 414 sec   
  30     53011    18.48 MB/sec  execute 415 sec   
  30     53101    18.47 MB/sec  execute 416 sec   
  30     53166    18.45 MB/sec  execute 417 sec   
  30     53257    18.44 MB/sec  execute 418 sec   
  30     53360    18.44 MB/sec  execute 419 sec   
  30     53471    18.44 MB/sec  execute 420 sec   
  30     53570    18.43 MB/sec  execute 421 sec   
  30     53719    18.44 MB/sec  execute 422 sec   
  30     54034    18.52 MB/sec  execute 423 sec   
  30     54325    18.58 MB/sec  execute 424 sec   
  30     54634    18.65 MB/sec  execute 425 sec   
  30     54910    18.73 MB/sec  execute 426 sec   
  30     55216    18.79 MB/sec  execute 427 sec   
  30     55486    18.86 MB/sec  execute 428 sec   
  30     55768    18.91 MB/sec  execute 429 sec   
  30     56018    18.88 MB/sec  execute 431 sec   
  30     56327    18.45 MB/sec  execute 444 sec   
  30     56419    18.44 MB/sec  execute 445 sec   
  30     56471    18.41 MB/sec  execute 446 sec   
  30     56520    18.39 MB/sec  execute 447 sec   
  30     56520    18.35 MB/sec  execute 448 sec   
  30     56520    18.31 MB/sec  execute 449 sec   
  30     56520    18.27 MB/sec  execute 450 sec   
  30     56520    18.23 MB/sec  execute 451 sec   
  30     56520    18.19 MB/sec  execute 452 sec   
  30     56520    18.15 MB/sec  execute 453 sec   
  30     56520    18.11 MB/sec  execute 454 sec   
  30     56520    18.07 MB/sec  execute 455 sec   
  30     56520    18.03 MB/sec  execute 456 sec   
  30     56520    17.99 MB/sec  execute 457 sec   
  30     56640    17.99 MB/sec  execute 458 sec   
  30     56751    17.99 MB/sec  execute 459 sec   
  30     57011    18.04 MB/sec  execute 460 sec   
  30     57116    18.03 MB/sec  execute 461 sec   
  30     57206    18.03 MB/sec  execute 462 sec   
  30     57325    18.04 MB/sec  execute 463 sec   
  30     57473    18.04 MB/sec  execute 464 sec   
  30     57552    18.04 MB/sec  execute 465 sec   
  30     57623    18.03 MB/sec  execute 466 sec   
  30     57883    18.07 MB/sec  execute 467 sec   
  30     57914    18.05 MB/sec  execute 468 sec   
  30     57967    18.03 MB/sec  execute 469 sec   
  30     58052    18.03 MB/sec  execute 470 sec   
  30     58147    18.00 MB/sec  execute 471 sec   
  30     58195    17.98 MB/sec  execute 472 sec   
  30     58221    17.95 MB/sec  execute 473 sec   
  30     58225    17.91 MB/sec  execute 474 sec   
  30     58352    17.56 MB/sec  execute 485 sec   
  30     58474    17.58 MB/sec  execute 486 sec   
  30     58630    17.60 MB/sec  execute 487 sec   
  30     58712    17.59 MB/sec  execute 488 sec   
  30     58982    17.64 MB/sec  execute 489 sec   
  30     59303    17.69 MB/sec  execute 490 sec   
  30     59613    17.76 MB/sec  execute 491 sec   
  30     59885    17.82 MB/sec  execute 492 sec   
  30     60158    17.90 MB/sec  execute 493 sec   
  30     60464    17.94 MB/sec  execute 494 sec   
  30     60770    18.00 MB/sec  execute 495 sec   
  30     61069    18.06 MB/sec  execute 496 sec   
  30     61384    18.12 MB/sec  execute 497 sec   
  30     61687    18.19 MB/sec  execute 498 sec   
  30     61905    18.24 MB/sec  execute 499 sec   
  30     62204    18.25 MB/sec  execute 501 sec   
  30     62480    18.32 MB/sec  execute 502 sec   
  30     62776    18.40 MB/sec  execute 503 sec   
  30     63086    18.43 MB/sec  execute 504 sec   
  30     63402    18.48 MB/sec  execute 505 sec   
  30     63412    18.45 MB/sec  execute 506 sec   
  30     63726    18.27 MB/sec  execute 513 sec   
  30     63727    18.24 MB/sec  execute 514 sec   
  30     63900    18.25 MB/sec  execute 515 sec   
  30     64195    18.32 MB/sec  execute 516 sec   
  30     64520    18.38 MB/sec  execute 517 sec   
  30     64827    18.44 MB/sec  execute 518 sec   
  30     65161    18.50 MB/sec  execute 519 sec   
  30     65456    18.57 MB/sec  execute 520 sec   
  30     65696    18.61 MB/sec  execute 521 sec   
  30     65995    18.66 MB/sec  execute 522 sec   
  30     66315    18.72 MB/sec  execute 523 sec   
  30     66594    18.79 MB/sec  execute 524 sec   
  30     66867    18.86 MB/sec  execute 525 sec   
  30     67175    18.90 MB/sec  execute 526 sec   
  30     67401    18.92 MB/sec  execute 527 sec   
  30     67613    18.96 MB/sec  execute 528 sec   
  30     67872    19.00 MB/sec  execute 529 sec   
  30     68048    19.04 MB/sec  execute 530 sec   
  30     68234    19.05 MB/sec  execute 531 sec   
  30     68431    19.08 MB/sec  execute 532 sec   
  30     68685    19.11 MB/sec  execute 533 sec   
  30     68994    19.15 MB/sec  execute 534 sec   
  30     69074    19.14 MB/sec  execute 535 sec   
  30     69134    19.12 MB/sec  execute 536 sec   
  30     69434    19.17 MB/sec  execute 537 sec   
  30     69551    19.16 MB/sec  execute 538 sec   
  30     69551    19.13 MB/sec  execute 539 sec   
  30     69551    19.09 MB/sec  execute 540 sec   
  30     69551    19.06 MB/sec  execute 541 sec   
  30     69551    19.02 MB/sec  execute 542 sec   
  30     69551    18.99 MB/sec  execute 543 sec   
  30     69782    18.61 MB/sec  execute 557 sec   
  30     70043    18.66 MB/sec  execute 558 sec   
  30     70329    18.71 MB/sec  execute 559 sec   
  30     70594    18.74 MB/sec  execute 560 sec   
  30     70873    18.78 MB/sec  execute 561 sec   
  30     71176    18.83 MB/sec  execute 562 sec   
  30     71444    18.87 MB/sec  execute 563 sec   
  30     71722    18.93 MB/sec  execute 564 sec   
  30     72012    18.96 MB/sec  execute 565 sec   
  30     72257    19.01 MB/sec  execute 566 sec   
  30     72509    19.05 MB/sec  execute 567 sec   
  30     72745    19.09 MB/sec  execute 568 sec   
  30     72949    19.12 MB/sec  execute 569 sec   
  30     73154    18.85 MB/sec  execute 580 sec   
  30     73184    18.82 MB/sec  execute 581 sec   
  30     73195    18.79 MB/sec  execute 582 sec   
  30     73506    18.84 MB/sec  execute 583 sec   
  30     73681    18.86 MB/sec  execute 584 sec   
  30     73774    18.86 MB/sec  execute 585 sec   
  30     73905    18.85 MB/sec  execute 586 sec   
  30     74020    18.86 MB/sec  execute 587 sec   
  30     74103    18.85 MB/sec  execute 588 sec   
  30     74206    18.85 MB/sec  execute 589 sec   
  30     74337    18.85 MB/sec  execute 590 sec   
  30     74442    18.85 MB/sec  execute 591 sec   
  30     74581    18.85 MB/sec  execute 592 sec   
  30     74801    18.89 MB/sec  execute 593 sec   
  30     75102    18.93 MB/sec  execute 594 sec   
  30     75360    18.97 MB/sec  execute 595 sec   
  30     75676    19.01 MB/sec  execute 596 sec   
  30     75916    19.06 MB/sec  execute 597 sec   
  30     76208    19.09 MB/sec  execute 598 sec   
  30     76497    19.15 MB/sec  execute 599 sec   
  30     76721    19.18 MB/sec  execute 600 sec   
  30     77016    19.22 MB/sec  cleanup 601 sec   
  30     77016    19.19 MB/sec  cleanup 602 sec   
  30     77016    18.84 MB/sec  cleanup 612 sec   
  30     77016    18.69 MB/sec  cleanup 617 sec   
  30     77016    18.64 MB/sec  cleanup 619 sec   

Throughput 19.219 MB/sec 30 procs

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-06-04 15:20 ` Frederic Weisbecker
@ 2009-06-04 19:07   ` Andrew Morton
  2009-06-04 19:13     ` Frederic Weisbecker
  2009-06-05  1:14   ` Zhang, Yanmin
  1 sibling, 1 reply; 66+ messages in thread
From: Andrew Morton @ 2009-06-04 19:07 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Jens Axboe, linux-kernel, linux-fsdevel, tytso, chris.mason,
	david, hch, jack, yanmin_zhang, richard, damien.wyart

On Thu, 4 Jun 2009 17:20:44 +0200 Frederic Weisbecker <fweisbec@gmail.com> wrote:

> I've just tested it on UP in a single disk.

I must say, I'm stunned at the amount of testing which people are
performing on this patchset.  Normally when someone sends out a
patchset it just sort of lands with a dull thud.

I'm not sure what Jens did right to make all this happen, but thanks!

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-06-04 19:07   ` Andrew Morton
@ 2009-06-04 19:13     ` Frederic Weisbecker
  2009-06-04 19:50       ` Jens Axboe
  0 siblings, 1 reply; 66+ messages in thread
From: Frederic Weisbecker @ 2009-06-04 19:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jens Axboe, linux-kernel, linux-fsdevel, tytso, chris.mason,
	david, hch, jack, yanmin_zhang, richard, damien.wyart

On Thu, Jun 04, 2009 at 12:07:26PM -0700, Andrew Morton wrote:
> On Thu, 4 Jun 2009 17:20:44 +0200 Frederic Weisbecker <fweisbec@gmail.com> wrote:
> 
> > I've just tested it on UP in a single disk.
> 
> I must say, I'm stunned at the amount of testing which people are
> performing on this patchset.  Normally when someone sends out a
> patchset it just sort of lands with a dull thud.
> 
> I'm not sure what Jens did right to make all this happen, but thanks!


I don't know how he did either. I was reading theses patches and *something*
pushed me to my testbox, and then I tested...

Jens, how do you do that?


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-06-04 19:13     ` Frederic Weisbecker
@ 2009-06-04 19:50       ` Jens Axboe
  2009-06-04 20:10         ` Jens Axboe
  2009-06-04 21:37         ` Frederic Weisbecker
  0 siblings, 2 replies; 66+ messages in thread
From: Jens Axboe @ 2009-06-04 19:50 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, linux-kernel, linux-fsdevel, tytso, chris.mason,
	david, hch, jack, yanmin_zhang, richard, damien.wyart

On Thu, Jun 04 2009, Frederic Weisbecker wrote:
> On Thu, Jun 04, 2009 at 12:07:26PM -0700, Andrew Morton wrote:
> > On Thu, 4 Jun 2009 17:20:44 +0200 Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > 
> > > I've just tested it on UP in a single disk.
> > 
> > I must say, I'm stunned at the amount of testing which people are
> > performing on this patchset.  Normally when someone sends out a
> > patchset it just sort of lands with a dull thud.
> > 
> > I'm not sure what Jens did right to make all this happen, but thanks!
> 
> 
> I don't know how he did either. I was reading theses patches and *something*
> pushed me to my testbox, and then I tested...
> 
> Jens, how do you do that?

Heh, not sure :-)

But indeed, thanks for the testing. It looks quite interesting. I'm
guessing it probably has to do with who ends up doing the balancing and
that the flusher threads block, it may change the picture a bit. So it
may just be that it'll require a few vm tweaks. I'll definitely look
into it and try and reproduce your results.

Did you run it a 2nd time on each drive and check if the results were
(approximately) consistent on the two drives?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-06-04 19:50       ` Jens Axboe
@ 2009-06-04 20:10         ` Jens Axboe
  2009-06-04 22:34           ` Frederic Weisbecker
  2009-06-04 21:37         ` Frederic Weisbecker
  1 sibling, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2009-06-04 20:10 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, linux-kernel, linux-fsdevel, tytso, chris.mason,
	david, hch, jack, yanmin_zhang, richard, damien.wyart

On Thu, Jun 04 2009, Jens Axboe wrote:
> On Thu, Jun 04 2009, Frederic Weisbecker wrote:
> > On Thu, Jun 04, 2009 at 12:07:26PM -0700, Andrew Morton wrote:
> > > On Thu, 4 Jun 2009 17:20:44 +0200 Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > > 
> > > > I've just tested it on UP in a single disk.
> > > 
> > > I must say, I'm stunned at the amount of testing which people are
> > > performing on this patchset.  Normally when someone sends out a
> > > patchset it just sort of lands with a dull thud.
> > > 
> > > I'm not sure what Jens did right to make all this happen, but thanks!
> > 
> > 
> > I don't know how he did either. I was reading theses patches and *something*
> > pushed me to my testbox, and then I tested...
> > 
> > Jens, how do you do that?
> 
> Heh, not sure :-)
> 
> But indeed, thanks for the testing. It looks quite interesting. I'm
> guessing it probably has to do with who ends up doing the balancing and
> that the flusher threads block, it may change the picture a bit. So it
> may just be that it'll require a few vm tweaks. I'll definitely look
> into it and try and reproduce your results.
> 
> Did you run it a 2nd time on each drive and check if the results were
> (approximately) consistent on the two drives?

each partition... What IO scheduler did you use on hda?

The main difference with this test case is that before we had two super
blocks, each with lists of dirty inodes. pdflush would attack those. Now
we have both the inodes from the two supers on a single set of lists on
the bdi. So either we have some ordering issue there (which is causing
the unfairness), or something else is.

So perhaps you can try with noop on hda to see if that changes the
picture?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-06-04 19:50       ` Jens Axboe
  2009-06-04 20:10         ` Jens Axboe
@ 2009-06-04 21:37         ` Frederic Weisbecker
  1 sibling, 0 replies; 66+ messages in thread
From: Frederic Weisbecker @ 2009-06-04 21:37 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andrew Morton, linux-kernel, linux-fsdevel, tytso, chris.mason,
	david, hch, jack, yanmin_zhang, richard, damien.wyart

On Thu, Jun 04, 2009 at 09:50:13PM +0200, Jens Axboe wrote:
> On Thu, Jun 04 2009, Frederic Weisbecker wrote:
> > On Thu, Jun 04, 2009 at 12:07:26PM -0700, Andrew Morton wrote:
> > > On Thu, 4 Jun 2009 17:20:44 +0200 Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > > 
> > > > I've just tested it on UP in a single disk.
> > > 
> > > I must say, I'm stunned at the amount of testing which people are
> > > performing on this patchset.  Normally when someone sends out a
> > > patchset it just sort of lands with a dull thud.
> > > 
> > > I'm not sure what Jens did right to make all this happen, but thanks!
> > 
> > 
> > I don't know how he did either. I was reading theses patches and *something*
> > pushed me to my testbox, and then I tested...
> > 
> > Jens, how do you do that?
> 
> Heh, not sure :-)
> 
> But indeed, thanks for the testing. It looks quite interesting. I'm
> guessing it probably has to do with who ends up doing the balancing and
> that the flusher threads block, it may change the picture a bit. So it
> may just be that it'll require a few vm tweaks. I'll definitely look
> into it and try and reproduce your results.
> 
> Did you run it a 2nd time on each drive and check if the results were
> (approximately) consistent on the two drives?


Another snapshot, only with bdi-writeback this time.

http://kernel.org/pub/linux/kernel/people/frederic/dbench2.pdf

Looks like the same effect but the difference is more quiet this time.

I guess there is a good bunch of entropy inside, so it's hard to tell :)
I'll test with no op scheduler.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-06-04 20:10         ` Jens Axboe
@ 2009-06-04 22:34           ` Frederic Weisbecker
  2009-06-05 19:15             ` Jens Axboe
  0 siblings, 1 reply; 66+ messages in thread
From: Frederic Weisbecker @ 2009-06-04 22:34 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andrew Morton, linux-kernel, linux-fsdevel, tytso, chris.mason,
	david, hch, jack, yanmin_zhang, richard, damien.wyart

On Thu, Jun 04, 2009 at 10:10:12PM +0200, Jens Axboe wrote:
> On Thu, Jun 04 2009, Jens Axboe wrote:
> > On Thu, Jun 04 2009, Frederic Weisbecker wrote:
> > > On Thu, Jun 04, 2009 at 12:07:26PM -0700, Andrew Morton wrote:
> > > > On Thu, 4 Jun 2009 17:20:44 +0200 Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > > > 
> > > > > I've just tested it on UP in a single disk.
> > > > 
> > > > I must say, I'm stunned at the amount of testing which people are
> > > > performing on this patchset.  Normally when someone sends out a
> > > > patchset it just sort of lands with a dull thud.
> > > > 
> > > > I'm not sure what Jens did right to make all this happen, but thanks!
> > > 
> > > 
> > > I don't know how he did either. I was reading theses patches and *something*
> > > pushed me to my testbox, and then I tested...
> > > 
> > > Jens, how do you do that?
> > 
> > Heh, not sure :-)
> > 
> > But indeed, thanks for the testing. It looks quite interesting. I'm
> > guessing it probably has to do with who ends up doing the balancing and
> > that the flusher threads block, it may change the picture a bit. So it
> > may just be that it'll require a few vm tweaks. I'll definitely look
> > into it and try and reproduce your results.
> > 
> > Did you run it a 2nd time on each drive and check if the results were
> > (approximately) consistent on the two drives?
> 
> each partition... What IO scheduler did you use on hda?


CFQ.

 
> The main difference with this test case is that before we had two super
> blocks, each with lists of dirty inodes. pdflush would attack those. Now
> we have both the inodes from the two supers on a single set of lists on
> the bdi. So either we have some ordering issue there (which is causing
> the unfairness), or something else is.


Yeah.
But although these flushers are per-bdi, with a single list (well, three)
of dirty inodes, it looks like the writeback is still performed per
superblock, I mean the bdi work gives the concerned superblock
and the bdi list is iterated in generic_sync_wb_inodes() which
only processes the inodes for the given superblock. So there is
a bit of a per superblock serialization there and....


(Note, the above is just written for myself in the secret hope I could
understand better these patches by writing my brainstorming...)


> So perhaps you can try with noop on hda to see if that changes the
> picture?



The result with noop is even more impressive.

See: http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop.pdf

Also a comparison, noop with pdflush against noop with bdi writeback:

http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop-cmp.pdf


Frederic.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-06-04 15:20 ` Frederic Weisbecker
  2009-06-04 19:07   ` Andrew Morton
@ 2009-06-05  1:14   ` Zhang, Yanmin
  2009-06-05 19:16     ` Jens Axboe
  1 sibling, 1 reply; 66+ messages in thread
From: Zhang, Yanmin @ 2009-06-05  1:14 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Jens Axboe, linux-kernel, linux-fsdevel, tytso, chris.mason,
	david, hch, akpm, jack, richard, damien.wyart

On Thu, 2009-06-04 at 17:20 +0200, Frederic Weisbecker wrote:
> Hi,
> 
> 
> On Thu, May 28, 2009 at 01:46:33PM +0200, Jens Axboe wrote:
> > Hi,
> > 
> > Here's the 9th version of the writeback patches. Changes since v8:

> I've just tested it on UP in a single disk.
> 
> I've run two parallels dbench tests on two partitions and
> tried it with this patch and without.
I also tested V9 with multiple-dbench workload by starting multiple
dbench tasks and every task has 4 processes to do I/O on one partition (file
system). Mostly I use JBODs which have 7/11/13 disks.

I didn't find result regression between vanilla and V9 kernel on this workload.

> 
> I used 30 proc each during 600 secs.
> 
> You can see the result in attachment.
> And also there:
> 
> http://kernel.org/pub/linux/kernel/people/frederic/dbench.pdf
> http://kernel.org/pub/linux/kernel/people/frederic/bdi-writeback-hda1.log
> http://kernel.org/pub/linux/kernel/people/frederic/bdi-writeback-hda3.log
> http://kernel.org/pub/linux/kernel/people/frederic/pdflush-hda1.log
> http://kernel.org/pub/linux/kernel/people/frederic/pdflush-hda3.log
> 
> 
> As you can see, bdi writeback is faster than pdflush on hda1 and slower
> on hda3. But, well that's not the point.
> 
> What I can observe here is the difference on the standard deviation
> for the rate between two parallel writers on a same device (but
> two different partitions, then superblocks).
> 
> With pdflush, the distributed rate is much better balanced than
> with bdi writeback in a single device.
> 
> I'm not sure why. Is there something in these patches that makes
> several bdi flusher threads for a same bdi not well balanced
> between them?
> 
> Frederic.

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-06-04 22:34           ` Frederic Weisbecker
@ 2009-06-05 19:15             ` Jens Axboe
  2009-06-05 21:14               ` Jan Kara
  2009-06-06  0:35               ` Frederic Weisbecker
  0 siblings, 2 replies; 66+ messages in thread
From: Jens Axboe @ 2009-06-05 19:15 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Andrew Morton, linux-kernel, linux-fsdevel, tytso, chris.mason,
	david, hch, jack, yanmin_zhang, richard, damien.wyart

On Fri, Jun 05 2009, Frederic Weisbecker wrote:
> On Thu, Jun 04, 2009 at 10:10:12PM +0200, Jens Axboe wrote:
> > On Thu, Jun 04 2009, Jens Axboe wrote:
> > > On Thu, Jun 04 2009, Frederic Weisbecker wrote:
> > > > On Thu, Jun 04, 2009 at 12:07:26PM -0700, Andrew Morton wrote:
> > > > > On Thu, 4 Jun 2009 17:20:44 +0200 Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > > > > 
> > > > > > I've just tested it on UP in a single disk.
> > > > > 
> > > > > I must say, I'm stunned at the amount of testing which people are
> > > > > performing on this patchset.  Normally when someone sends out a
> > > > > patchset it just sort of lands with a dull thud.
> > > > > 
> > > > > I'm not sure what Jens did right to make all this happen, but thanks!
> > > > 
> > > > 
> > > > I don't know how he did either. I was reading theses patches and *something*
> > > > pushed me to my testbox, and then I tested...
> > > > 
> > > > Jens, how do you do that?
> > > 
> > > Heh, not sure :-)
> > > 
> > > But indeed, thanks for the testing. It looks quite interesting. I'm
> > > guessing it probably has to do with who ends up doing the balancing and
> > > that the flusher threads block, it may change the picture a bit. So it
> > > may just be that it'll require a few vm tweaks. I'll definitely look
> > > into it and try and reproduce your results.
> > > 
> > > Did you run it a 2nd time on each drive and check if the results were
> > > (approximately) consistent on the two drives?
> > 
> > each partition... What IO scheduler did you use on hda?
> 
> 
> CFQ.
> 
>  
> > The main difference with this test case is that before we had two super
> > blocks, each with lists of dirty inodes. pdflush would attack those. Now
> > we have both the inodes from the two supers on a single set of lists on
> > the bdi. So either we have some ordering issue there (which is causing
> > the unfairness), or something else is.
> 
> 
> Yeah.
> But although these flushers are per-bdi, with a single list (well, three)
> of dirty inodes, it looks like the writeback is still performed per
> superblock, I mean the bdi work gives the concerned superblock
> and the bdi list is iterated in generic_sync_wb_inodes() which
> only processes the inodes for the given superblock. So there is
> a bit of a per superblock serialization there and....

But in most cases sb == NULL, which means that the writeback does not
care. It should only pass in a valid sb if someone explicitly wants to
sync that sb.

But the way that the lists are organized now does definitely open some
windows of unfairness for a test like yours. It's on the top of the
investigate list for monday.

> > So perhaps you can try with noop on hda to see if that changes the
> > picture?
> 
> 
> 
> The result with noop is even more impressive.
> 
> See: http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop.pdf
> 
> Also a comparison, noop with pdflush against noop with bdi writeback:
> 
> http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop-cmp.pdf

OK, so things aren't exactly peachy here to begin with. It may not
actually BE an issue, or at least now a new one, but that doesn't mean
that we should not attempt to quantify the impact.

How are you starting these runs? With a test like this, even a small
difference in start time can make a huge difference.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-06-05  1:14   ` Zhang, Yanmin
@ 2009-06-05 19:16     ` Jens Axboe
  0 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2009-06-05 19:16 UTC (permalink / raw)
  To: Zhang, Yanmin
  Cc: Frederic Weisbecker, linux-kernel, linux-fsdevel, tytso,
	chris.mason, david, hch, akpm, jack, richard, damien.wyart

On Fri, Jun 05 2009, Zhang, Yanmin wrote:
> On Thu, 2009-06-04 at 17:20 +0200, Frederic Weisbecker wrote:
> > Hi,
> >
> >
> > On Thu, May 28, 2009 at 01:46:33PM +0200, Jens Axboe wrote:
> > > Hi,
> > >
> > > Here's the 9th version of the writeback patches. Changes since v8:
> 
> > I've just tested it on UP in a single disk.
> >
> > I've run two parallels dbench tests on two partitions and
> > tried it with this patch and without.
> I also tested V9 with multiple-dbench workload by starting multiple
> dbench tasks and every task has 4 processes to do I/O on one partition (file
> system). Mostly I use JBODs which have 7/11/13 disks.
> 
> I didn't find result regression between ???vanilla and V9 kernel on
> this workload.

Ah that's good, thanks for that result as well :-)

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-06-05 19:15             ` Jens Axboe
@ 2009-06-05 21:14               ` Jan Kara
  2009-06-06  0:18                 ` Chris Mason
  2009-06-06  1:00                 ` Frederic Weisbecker
  2009-06-06  0:35               ` Frederic Weisbecker
  1 sibling, 2 replies; 66+ messages in thread
From: Jan Kara @ 2009-06-05 21:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Frederic Weisbecker, Andrew Morton, linux-kernel, linux-fsdevel,
	tytso, chris.mason, david, hch, jack, yanmin_zhang, richard,
	damien.wyart

On Fri 05-06-09 21:15:28, Jens Axboe wrote:
> On Fri, Jun 05 2009, Frederic Weisbecker wrote:
> > The result with noop is even more impressive.
> > 
> > See: http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop.pdf
> > 
> > Also a comparison, noop with pdflush against noop with bdi writeback:
> > 
> > http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop-cmp.pdf
> 
> OK, so things aren't exactly peachy here to begin with. It may not
> actually BE an issue, or at least now a new one, but that doesn't mean
> that we should not attempt to quantify the impact.
  What looks interesting is also the overall throughput. With pdflush we
get to 2.5 MB/s + 26 MB/s while with per-bdi we get to 2.7 MB/s + 13 MB/s.
So per-bdi seems to be *more* fair but throughput suffers a lot (which
might be inevitable due to incurred seeks).
  Frederic, how much does dbench achieve for you just on one partition
(test both consecutively if possible) with as many threads as have those
two dbench instances together? Thanks.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-06-05 21:14               ` Jan Kara
@ 2009-06-06  0:18                 ` Chris Mason
  2009-06-06  0:23                   ` Jan Kara
  2009-06-06  1:00                 ` Frederic Weisbecker
  1 sibling, 1 reply; 66+ messages in thread
From: Chris Mason @ 2009-06-06  0:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, Frederic Weisbecker, Andrew Morton, linux-kernel,
	linux-fsdevel, tytso, david, hch, yanmin_zhang, richard,
	damien.wyart

On Fri, Jun 05, 2009 at 11:14:38PM +0200, Jan Kara wrote:
> On Fri 05-06-09 21:15:28, Jens Axboe wrote:
> > On Fri, Jun 05 2009, Frederic Weisbecker wrote:
> > > The result with noop is even more impressive.
> > > 
> > > See: http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop.pdf
> > > 
> > > Also a comparison, noop with pdflush against noop with bdi writeback:
> > > 
> > > http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop-cmp.pdf
> > 
> > OK, so things aren't exactly peachy here to begin with. It may not
> > actually BE an issue, or at least now a new one, but that doesn't mean
> > that we should not attempt to quantify the impact.
>   What looks interesting is also the overall throughput. With pdflush we
> get to 2.5 MB/s + 26 MB/s while with per-bdi we get to 2.7 MB/s + 13 MB/s.
> So per-bdi seems to be *more* fair but throughput suffers a lot (which
> might be inevitable due to incurred seeks).
>   Frederic, how much does dbench achieve for you just on one partition
> (test both consecutively if possible) with as many threads as have those
> two dbench instances together? Thanks.

Is the graph showing us dbench tput or disk tput?  I'm assuming it is
disk tput, so bdi may just be writing less?

-chris


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-06-06  0:18                 ` Chris Mason
@ 2009-06-06  0:23                   ` Jan Kara
  2009-06-06  1:06                     ` Frederic Weisbecker
  0 siblings, 1 reply; 66+ messages in thread
From: Jan Kara @ 2009-06-06  0:23 UTC (permalink / raw)
  To: Chris Mason, Jan Kara, Jens Axboe, Frederic Weisbecker, Andrew Morton

On Fri 05-06-09 20:18:15, Chris Mason wrote:
> On Fri, Jun 05, 2009 at 11:14:38PM +0200, Jan Kara wrote:
> > On Fri 05-06-09 21:15:28, Jens Axboe wrote:
> > > On Fri, Jun 05 2009, Frederic Weisbecker wrote:
> > > > The result with noop is even more impressive.
> > > > 
> > > > See: http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop.pdf
> > > > 
> > > > Also a comparison, noop with pdflush against noop with bdi writeback:
> > > > 
> > > > http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop-cmp.pdf
> > > 
> > > OK, so things aren't exactly peachy here to begin with. It may not
> > > actually BE an issue, or at least now a new one, but that doesn't mean
> > > that we should not attempt to quantify the impact.
> >   What looks interesting is also the overall throughput. With pdflush we
> > get to 2.5 MB/s + 26 MB/s while with per-bdi we get to 2.7 MB/s + 13 MB/s.
> > So per-bdi seems to be *more* fair but throughput suffers a lot (which
> > might be inevitable due to incurred seeks).
> >   Frederic, how much does dbench achieve for you just on one partition
> > (test both consecutively if possible) with as many threads as have those
> > two dbench instances together? Thanks.
> 
> Is the graph showing us dbench tput or disk tput?  I'm assuming it is
> disk tput, so bdi may just be writing less?
  Good, question. I was assuming dbench throughput :).

									Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-06-05 19:15             ` Jens Axboe
  2009-06-05 21:14               ` Jan Kara
@ 2009-06-06  0:35               ` Frederic Weisbecker
  1 sibling, 0 replies; 66+ messages in thread
From: Frederic Weisbecker @ 2009-06-06  0:35 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Andrew Morton, linux-kernel, linux-fsdevel, tytso, chris.mason,
	david, hch, jack, yanmin_zhang, richard, damien.wyart

On Fri, Jun 05, 2009 at 09:15:28PM +0200, Jens Axboe wrote:
> On Fri, Jun 05 2009, Frederic Weisbecker wrote:
> > On Thu, Jun 04, 2009 at 10:10:12PM +0200, Jens Axboe wrote:
> > > On Thu, Jun 04 2009, Jens Axboe wrote:
> > > > On Thu, Jun 04 2009, Frederic Weisbecker wrote:
> > > > > On Thu, Jun 04, 2009 at 12:07:26PM -0700, Andrew Morton wrote:
> > > > > > On Thu, 4 Jun 2009 17:20:44 +0200 Frederic Weisbecker <fweisbec@gmail.com> wrote:
> > > > > > 
> > > > > > > I've just tested it on UP in a single disk.
> > > > > > 
> > > > > > I must say, I'm stunned at the amount of testing which people are
> > > > > > performing on this patchset.  Normally when someone sends out a
> > > > > > patchset it just sort of lands with a dull thud.
> > > > > > 
> > > > > > I'm not sure what Jens did right to make all this happen, but thanks!
> > > > > 
> > > > > 
> > > > > I don't know how he did either. I was reading theses patches and *something*
> > > > > pushed me to my testbox, and then I tested...
> > > > > 
> > > > > Jens, how do you do that?
> > > > 
> > > > Heh, not sure :-)
> > > > 
> > > > But indeed, thanks for the testing. It looks quite interesting. I'm
> > > > guessing it probably has to do with who ends up doing the balancing and
> > > > that the flusher threads block, it may change the picture a bit. So it
> > > > may just be that it'll require a few vm tweaks. I'll definitely look
> > > > into it and try and reproduce your results.
> > > > 
> > > > Did you run it a 2nd time on each drive and check if the results were
> > > > (approximately) consistent on the two drives?
> > > 
> > > each partition... What IO scheduler did you use on hda?
> > 
> > 
> > CFQ.
> > 
> >  
> > > The main difference with this test case is that before we had two super
> > > blocks, each with lists of dirty inodes. pdflush would attack those. Now
> > > we have both the inodes from the two supers on a single set of lists on
> > > the bdi. So either we have some ordering issue there (which is causing
> > > the unfairness), or something else is.
> > 
> > 
> > Yeah.
> > But although these flushers are per-bdi, with a single list (well, three)
> > of dirty inodes, it looks like the writeback is still performed per
> > superblock, I mean the bdi work gives the concerned superblock
> > and the bdi list is iterated in generic_sync_wb_inodes() which
> > only processes the inodes for the given superblock. So there is
> > a bit of a per superblock serialization there and....
> 
> But in most cases sb == NULL, which means that the writeback does not
> care. It should only pass in a valid sb if someone explicitly wants to
> sync that sb.


Ah ok.

 
> But the way that the lists are organized now does definitely open some
> windows of unfairness for a test like yours. It's on the top of the
> investigate list for monday.



I stay tuned.



> > > So perhaps you can try with noop on hda to see if that changes the
> > > picture?
> > 
> > 
> > 
> > The result with noop is even more impressive.
> > 
> > See: http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop.pdf
> > 
> > Also a comparison, noop with pdflush against noop with bdi writeback:
> > 
> > http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop-cmp.pdf
> 
> OK, so things aren't exactly peachy here to begin with. It may not
> actually BE an issue, or at least now a new one, but that doesn't mean
> that we should not attempt to quantify the impact.
> 
> How are you starting these runs? With a test like this, even a small
> difference in start time can make a huge difference.


Hmm, in a kind of draft way :)
I pre-write the command on two consoles, each on a concerned
partition, then I type enter for each one.

So there is always one that is started before the other with
some delay. And it looks like the first often win the race.

Frederic.


 
> -- 
> Jens Axboe
> 


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-06-05 21:14               ` Jan Kara
  2009-06-06  0:18                 ` Chris Mason
@ 2009-06-06  1:00                 ` Frederic Weisbecker
  1 sibling, 0 replies; 66+ messages in thread
From: Frederic Weisbecker @ 2009-06-06  1:00 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, Andrew Morton, linux-kernel, linux-fsdevel, tytso,
	chris.mason, david, hch, yanmin_zhang, richard, damien.wyart

On Fri, Jun 05, 2009 at 11:14:38PM +0200, Jan Kara wrote:
> On Fri 05-06-09 21:15:28, Jens Axboe wrote:
> > On Fri, Jun 05 2009, Frederic Weisbecker wrote:
> > > The result with noop is even more impressive.
> > > 
> > > See: http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop.pdf
> > > 
> > > Also a comparison, noop with pdflush against noop with bdi writeback:
> > > 
> > > http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop-cmp.pdf
> > 
> > OK, so things aren't exactly peachy here to begin with. It may not
> > actually BE an issue, or at least now a new one, but that doesn't mean
> > that we should not attempt to quantify the impact.
>   What looks interesting is also the overall throughput. With pdflush we
> get to 2.5 MB/s + 26 MB/s while with per-bdi we get to 2.7 MB/s + 13 MB/s.
> So per-bdi seems to be *more* fair but throughput suffers a lot (which
> might be inevitable due to incurred seeks).



Heh indeed, I was confused with the colors here but yes pdflush has
a faster total and a higher unfairness with noop, at least with this test.



>   Frederic, how much does dbench achieve for you just on one partition
> (test both consecutively if possible) with as many threads as have those
> two dbench instances together? Thanks.



Good idea, I'll try it out so that there wouldn't have any per superblock
ordering there, or whathever that could be.

Thanks.

 
> 									Honza
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-06-06  0:23                   ` Jan Kara
@ 2009-06-06  1:06                     ` Frederic Weisbecker
  2009-06-08  9:23                       ` Jens Axboe
  0 siblings, 1 reply; 66+ messages in thread
From: Frederic Weisbecker @ 2009-06-06  1:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Chris Mason, Jens Axboe, Andrew Morton, linux-kernel,
	linux-fsdevel, tytso, david, hch, yanmin_zhang, richard,
	damien.wyart

On Sat, Jun 06, 2009 at 02:23:40AM +0200, Jan Kara wrote:
> On Fri 05-06-09 20:18:15, Chris Mason wrote:
> > On Fri, Jun 05, 2009 at 11:14:38PM +0200, Jan Kara wrote:
> > > On Fri 05-06-09 21:15:28, Jens Axboe wrote:
> > > > On Fri, Jun 05 2009, Frederic Weisbecker wrote:
> > > > > The result with noop is even more impressive.
> > > > > 
> > > > > See: http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop.pdf
> > > > > 
> > > > > Also a comparison, noop with pdflush against noop with bdi writeback:
> > > > > 
> > > > > http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop-cmp.pdf
> > > > 
> > > > OK, so things aren't exactly peachy here to begin with. It may not
> > > > actually BE an issue, or at least now a new one, but that doesn't mean
> > > > that we should not attempt to quantify the impact.
> > >   What looks interesting is also the overall throughput. With pdflush we
> > > get to 2.5 MB/s + 26 MB/s while with per-bdi we get to 2.7 MB/s + 13 MB/s.
> > > So per-bdi seems to be *more* fair but throughput suffers a lot (which
> > > might be inevitable due to incurred seeks).
> > >   Frederic, how much does dbench achieve for you just on one partition
> > > (test both consecutively if possible) with as many threads as have those
> > > two dbench instances together? Thanks.
> > 
> > Is the graph showing us dbench tput or disk tput?  I'm assuming it is
> > disk tput, so bdi may just be writing less?
>   Good, question. I was assuming dbench throughput :).
> 
> 									Honza


Yeah it's dbench. May be that's not the right tool to measure the writeback
layer, even though dbench results are necessarily influenced by the writeback
behaviour.

May be I should use something else?

Note that if you want I can put some surgicals trace_printk()
in fs/fs-writeback.c


> 
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-06-06  1:06                     ` Frederic Weisbecker
@ 2009-06-08  9:23                       ` Jens Axboe
  2009-06-08 12:23                         ` Jan Kara
  0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2009-06-08  9:23 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Jan Kara, Chris Mason, Andrew Morton, linux-kernel,
	linux-fsdevel, tytso, david, hch, yanmin_zhang, richard,
	damien.wyart

On Sat, Jun 06 2009, Frederic Weisbecker wrote:
> On Sat, Jun 06, 2009 at 02:23:40AM +0200, Jan Kara wrote:
> > On Fri 05-06-09 20:18:15, Chris Mason wrote:
> > > On Fri, Jun 05, 2009 at 11:14:38PM +0200, Jan Kara wrote:
> > > > On Fri 05-06-09 21:15:28, Jens Axboe wrote:
> > > > > On Fri, Jun 05 2009, Frederic Weisbecker wrote:
> > > > > > The result with noop is even more impressive.
> > > > > > 
> > > > > > See: http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop.pdf
> > > > > > 
> > > > > > Also a comparison, noop with pdflush against noop with bdi writeback:
> > > > > > 
> > > > > > http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop-cmp.pdf
> > > > > 
> > > > > OK, so things aren't exactly peachy here to begin with. It may not
> > > > > actually BE an issue, or at least now a new one, but that doesn't mean
> > > > > that we should not attempt to quantify the impact.
> > > >   What looks interesting is also the overall throughput. With pdflush we
> > > > get to 2.5 MB/s + 26 MB/s while with per-bdi we get to 2.7 MB/s + 13 MB/s.
> > > > So per-bdi seems to be *more* fair but throughput suffers a lot (which
> > > > might be inevitable due to incurred seeks).
> > > >   Frederic, how much does dbench achieve for you just on one partition
> > > > (test both consecutively if possible) with as many threads as have those
> > > > two dbench instances together? Thanks.
> > > 
> > > Is the graph showing us dbench tput or disk tput?  I'm assuming it is
> > > disk tput, so bdi may just be writing less?
> >   Good, question. I was assuming dbench throughput :).
> > 
> > 									Honza
> 
> 
> Yeah it's dbench. May be that's not the right tool to measure the writeback
> layer, even though dbench results are necessarily influenced by the writeback
> behaviour.
> 
> May be I should use something else?
> 
> Note that if you want I can put some surgicals trace_printk()
> in fs/fs-writeback.c

FWIW, I ran a similar test here just now. CFQ was used, two partitions
on an (otherwise) idle drive. I used 30 clients per dbench and 600s
runtime. Results are nearly identical, both throughout the run and
total:

/dev/sdb1
Throughput 165.738 MB/sec  30 clients  30 procs  max_latency=459.002 ms

/dev/sdb2
Throughput 165.773 MB/sec  30 clients  30 procs  max_latency=607.198 ms

The flusher threads see very little exercise here.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-06-08  9:23                       ` Jens Axboe
@ 2009-06-08 12:23                         ` Jan Kara
  2009-06-08 12:28                           ` Jens Axboe
  0 siblings, 1 reply; 66+ messages in thread
From: Jan Kara @ 2009-06-08 12:23 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Frederic Weisbecker, Jan Kara, Chris Mason, Andrew Morton,
	linux-kernel, linux-fsdevel, tytso, david, hch, yanmin_zhang,
	richard, damien.wyart

On Mon 08-06-09 11:23:38, Jens Axboe wrote:
> On Sat, Jun 06 2009, Frederic Weisbecker wrote:
> > On Sat, Jun 06, 2009 at 02:23:40AM +0200, Jan Kara wrote:
> > > On Fri 05-06-09 20:18:15, Chris Mason wrote:
> > > > On Fri, Jun 05, 2009 at 11:14:38PM +0200, Jan Kara wrote:
> > > > > On Fri 05-06-09 21:15:28, Jens Axboe wrote:
> > > > > > On Fri, Jun 05 2009, Frederic Weisbecker wrote:
> > > > > > > The result with noop is even more impressive.
> > > > > > > 
> > > > > > > See: http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop.pdf
> > > > > > > 
> > > > > > > Also a comparison, noop with pdflush against noop with bdi writeback:
> > > > > > > 
> > > > > > > http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop-cmp.pdf
> > > > > > 
> > > > > > OK, so things aren't exactly peachy here to begin with. It may not
> > > > > > actually BE an issue, or at least now a new one, but that doesn't mean
> > > > > > that we should not attempt to quantify the impact.
> > > > >   What looks interesting is also the overall throughput. With pdflush we
> > > > > get to 2.5 MB/s + 26 MB/s while with per-bdi we get to 2.7 MB/s + 13 MB/s.
> > > > > So per-bdi seems to be *more* fair but throughput suffers a lot (which
> > > > > might be inevitable due to incurred seeks).
> > > > >   Frederic, how much does dbench achieve for you just on one partition
> > > > > (test both consecutively if possible) with as many threads as have those
> > > > > two dbench instances together? Thanks.
> > > > 
> > > > Is the graph showing us dbench tput or disk tput?  I'm assuming it is
> > > > disk tput, so bdi may just be writing less?
> > >   Good, question. I was assuming dbench throughput :).
> > > 
> > > 									Honza
> > 
> > 
> > Yeah it's dbench. May be that's not the right tool to measure the writeback
> > layer, even though dbench results are necessarily influenced by the writeback
> > behaviour.
> > 
> > May be I should use something else?
> > 
> > Note that if you want I can put some surgicals trace_printk()
> > in fs/fs-writeback.c
> 
> FWIW, I ran a similar test here just now. CFQ was used, two partitions
> on an (otherwise) idle drive. I used 30 clients per dbench and 600s
> runtime. Results are nearly identical, both throughout the run and
> total:
> 
> /dev/sdb1
> Throughput 165.738 MB/sec  30 clients  30 procs  max_latency=459.002 ms
> 
> /dev/sdb2
> Throughput 165.773 MB/sec  30 clients  30 procs  max_latency=607.198 ms
  Hmm, interesting. 165 MB/sec (in fact 330 MB/sec for that drive) sounds
like quite a lot ;). This usually happens with dbench when the processes
manage to delete / redirty data before writeback thread gets to them (so
some IO happens in memory only and throughput is bound by the CPU / memory
speed). So I think you are on a different part of the performance curve
than Frederic. Probably you have to run with more threads so that dbench
threads get throttled because of total amount of dirty data generated...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-06-08 12:23                         ` Jan Kara
@ 2009-06-08 12:28                           ` Jens Axboe
  2009-06-08 13:01                             ` Jan Kara
  2009-06-09 18:39                             ` Frederic Weisbecker
  0 siblings, 2 replies; 66+ messages in thread
From: Jens Axboe @ 2009-06-08 12:28 UTC (permalink / raw)
  To: Jan Kara
  Cc: Frederic Weisbecker, Chris Mason, Andrew Morton, linux-kernel,
	linux-fsdevel, tytso, david, hch, yanmin_zhang, richard,
	damien.wyart

On Mon, Jun 08 2009, Jan Kara wrote:
> On Mon 08-06-09 11:23:38, Jens Axboe wrote:
> > On Sat, Jun 06 2009, Frederic Weisbecker wrote:
> > > On Sat, Jun 06, 2009 at 02:23:40AM +0200, Jan Kara wrote:
> > > > On Fri 05-06-09 20:18:15, Chris Mason wrote:
> > > > > On Fri, Jun 05, 2009 at 11:14:38PM +0200, Jan Kara wrote:
> > > > > > On Fri 05-06-09 21:15:28, Jens Axboe wrote:
> > > > > > > On Fri, Jun 05 2009, Frederic Weisbecker wrote:
> > > > > > > > The result with noop is even more impressive.
> > > > > > > > 
> > > > > > > > See: http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop.pdf
> > > > > > > > 
> > > > > > > > Also a comparison, noop with pdflush against noop with bdi writeback:
> > > > > > > > 
> > > > > > > > http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop-cmp.pdf
> > > > > > > 
> > > > > > > OK, so things aren't exactly peachy here to begin with. It may not
> > > > > > > actually BE an issue, or at least now a new one, but that doesn't mean
> > > > > > > that we should not attempt to quantify the impact.
> > > > > >   What looks interesting is also the overall throughput. With pdflush we
> > > > > > get to 2.5 MB/s + 26 MB/s while with per-bdi we get to 2.7 MB/s + 13 MB/s.
> > > > > > So per-bdi seems to be *more* fair but throughput suffers a lot (which
> > > > > > might be inevitable due to incurred seeks).
> > > > > >   Frederic, how much does dbench achieve for you just on one partition
> > > > > > (test both consecutively if possible) with as many threads as have those
> > > > > > two dbench instances together? Thanks.
> > > > > 
> > > > > Is the graph showing us dbench tput or disk tput?  I'm assuming it is
> > > > > disk tput, so bdi may just be writing less?
> > > >   Good, question. I was assuming dbench throughput :).
> > > > 
> > > > 									Honza
> > > 
> > > 
> > > Yeah it's dbench. May be that's not the right tool to measure the writeback
> > > layer, even though dbench results are necessarily influenced by the writeback
> > > behaviour.
> > > 
> > > May be I should use something else?
> > > 
> > > Note that if you want I can put some surgicals trace_printk()
> > > in fs/fs-writeback.c
> > 
> > FWIW, I ran a similar test here just now. CFQ was used, two partitions
> > on an (otherwise) idle drive. I used 30 clients per dbench and 600s
> > runtime. Results are nearly identical, both throughout the run and
> > total:
> > 
> > /dev/sdb1
> > Throughput 165.738 MB/sec  30 clients  30 procs  max_latency=459.002 ms
> > 
> > /dev/sdb2
> > Throughput 165.773 MB/sec  30 clients  30 procs  max_latency=607.198 ms
>   Hmm, interesting. 165 MB/sec (in fact 330 MB/sec for that drive) sounds
> like quite a lot ;). This usually happens with dbench when the processes
> manage to delete / redirty data before writeback thread gets to them (so
> some IO happens in memory only and throughput is bound by the CPU / memory
> speed). So I think you are on a different part of the performance curve
> than Frederic. Probably you have to run with more threads so that dbench
> threads get throttled because of total amount of dirty data generated...

Certainly, the actual disk data rate was consistenctly in the
60-70MB/sec region. The issue is likely that the box has 6GB of RAM, if
I boot with less than 30 clients will do.

But unless the situation changes radically with memory pressure, it
still shows a fair distribution of IO between the two. Since they have
identical results throughout, it should be safe to assume that the have
equal bandwidth distribution at the disk end. A fast dbench run is one
that doesn't touch the disk at all, once you start touching disk you
lose :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-06-08 12:28                           ` Jens Axboe
@ 2009-06-08 13:01                             ` Jan Kara
  2009-06-09 18:39                             ` Frederic Weisbecker
  1 sibling, 0 replies; 66+ messages in thread
From: Jan Kara @ 2009-06-08 13:01 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Frederic Weisbecker, Chris Mason, Andrew Morton, linux-kernel,
	linux-fsdevel, tytso, david, hch, yanmin_zhang, richard,
	damien.wyart

On Mon 08-06-09 14:28:34, Jens Axboe wrote:
> On Mon, Jun 08 2009, Jan Kara wrote:
> > On Mon 08-06-09 11:23:38, Jens Axboe wrote:
> > > On Sat, Jun 06 2009, Frederic Weisbecker wrote:
> > > > On Sat, Jun 06, 2009 at 02:23:40AM +0200, Jan Kara wrote:
> > > > > On Fri 05-06-09 20:18:15, Chris Mason wrote:
> > > > > > On Fri, Jun 05, 2009 at 11:14:38PM +0200, Jan Kara wrote:
> > > > > > > On Fri 05-06-09 21:15:28, Jens Axboe wrote:
> > > > > > > > On Fri, Jun 05 2009, Frederic Weisbecker wrote:
> > > > > > > > > The result with noop is even more impressive.
> > > > > > > > > 
> > > > > > > > > See: http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop.pdf
> > > > > > > > > 
> > > > > > > > > Also a comparison, noop with pdflush against noop with bdi writeback:
> > > > > > > > > 
> > > > > > > > > http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop-cmp.pdf
> > > > > > > > 
> > > > > > > > OK, so things aren't exactly peachy here to begin with. It may not
> > > > > > > > actually BE an issue, or at least now a new one, but that doesn't mean
> > > > > > > > that we should not attempt to quantify the impact.
> > > > > > >   What looks interesting is also the overall throughput. With pdflush we
> > > > > > > get to 2.5 MB/s + 26 MB/s while with per-bdi we get to 2.7 MB/s + 13 MB/s.
> > > > > > > So per-bdi seems to be *more* fair but throughput suffers a lot (which
> > > > > > > might be inevitable due to incurred seeks).
> > > > > > >   Frederic, how much does dbench achieve for you just on one partition
> > > > > > > (test both consecutively if possible) with as many threads as have those
> > > > > > > two dbench instances together? Thanks.
> > > > > > 
> > > > > > Is the graph showing us dbench tput or disk tput?  I'm assuming it is
> > > > > > disk tput, so bdi may just be writing less?
> > > > >   Good, question. I was assuming dbench throughput :).
> > > > > 
> > > > > 									Honza
> > > > 
> > > > 
> > > > Yeah it's dbench. May be that's not the right tool to measure the writeback
> > > > layer, even though dbench results are necessarily influenced by the writeback
> > > > behaviour.
> > > > 
> > > > May be I should use something else?
> > > > 
> > > > Note that if you want I can put some surgicals trace_printk()
> > > > in fs/fs-writeback.c
> > > 
> > > FWIW, I ran a similar test here just now. CFQ was used, two partitions
> > > on an (otherwise) idle drive. I used 30 clients per dbench and 600s
> > > runtime. Results are nearly identical, both throughout the run and
> > > total:
> > > 
> > > /dev/sdb1
> > > Throughput 165.738 MB/sec  30 clients  30 procs  max_latency=459.002 ms
> > > 
> > > /dev/sdb2
> > > Throughput 165.773 MB/sec  30 clients  30 procs  max_latency=607.198 ms
> >   Hmm, interesting. 165 MB/sec (in fact 330 MB/sec for that drive) sounds
> > like quite a lot ;). This usually happens with dbench when the processes
> > manage to delete / redirty data before writeback thread gets to them (so
> > some IO happens in memory only and throughput is bound by the CPU / memory
> > speed). So I think you are on a different part of the performance curve
> > than Frederic. Probably you have to run with more threads so that dbench
> > threads get throttled because of total amount of dirty data generated...
> 
> Certainly, the actual disk data rate was consistenctly in the
> 60-70MB/sec region. The issue is likely that the box has 6GB of RAM, if
> I boot with less than 30 clients will do.
  Yes, that would do as well.

> But unless the situation changes radically with memory pressure, it
> still shows a fair distribution of IO between the two. Since they have
> identical results throughout, it should be safe to assume that the have
> equal bandwidth distribution at the disk end. A fast dbench run is one
  Yes, I agree. Your previous test indirectly shows fair distribution
on the disk end (with blktrace you could actually confirm it directly).

> that doesn't touch the disk at all, once you start touching disk you
> lose :-)

									Honza  
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 0/11] Per-bdi writeback flusher threads v9
  2009-06-08 12:28                           ` Jens Axboe
  2009-06-08 13:01                             ` Jan Kara
@ 2009-06-09 18:39                             ` Frederic Weisbecker
  1 sibling, 0 replies; 66+ messages in thread
From: Frederic Weisbecker @ 2009-06-09 18:39 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jan Kara, Chris Mason, Andrew Morton, linux-kernel,
	linux-fsdevel, tytso, david, hch, yanmin_zhang, richard,
	damien.wyart

On Mon, Jun 08, 2009 at 02:28:34PM +0200, Jens Axboe wrote:
> On Mon, Jun 08 2009, Jan Kara wrote:
> > On Mon 08-06-09 11:23:38, Jens Axboe wrote:
> > > On Sat, Jun 06 2009, Frederic Weisbecker wrote:
> > > > On Sat, Jun 06, 2009 at 02:23:40AM +0200, Jan Kara wrote:
> > > > > On Fri 05-06-09 20:18:15, Chris Mason wrote:
> > > > > > On Fri, Jun 05, 2009 at 11:14:38PM +0200, Jan Kara wrote:
> > > > > > > On Fri 05-06-09 21:15:28, Jens Axboe wrote:
> > > > > > > > On Fri, Jun 05 2009, Frederic Weisbecker wrote:
> > > > > > > > > The result with noop is even more impressive.
> > > > > > > > > 
> > > > > > > > > See: http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop.pdf
> > > > > > > > > 
> > > > > > > > > Also a comparison, noop with pdflush against noop with bdi writeback:
> > > > > > > > > 
> > > > > > > > > http://kernel.org/pub/linux/kernel/people/frederic/dbench-noop-cmp.pdf
> > > > > > > > 
> > > > > > > > OK, so things aren't exactly peachy here to begin with. It may not
> > > > > > > > actually BE an issue, or at least now a new one, but that doesn't mean
> > > > > > > > that we should not attempt to quantify the impact.
> > > > > > >   What looks interesting is also the overall throughput. With pdflush we
> > > > > > > get to 2.5 MB/s + 26 MB/s while with per-bdi we get to 2.7 MB/s + 13 MB/s.
> > > > > > > So per-bdi seems to be *more* fair but throughput suffers a lot (which
> > > > > > > might be inevitable due to incurred seeks).
> > > > > > >   Frederic, how much does dbench achieve for you just on one partition
> > > > > > > (test both consecutively if possible) with as many threads as have those
> > > > > > > two dbench instances together? Thanks.
> > > > > > 
> > > > > > Is the graph showing us dbench tput or disk tput?  I'm assuming it is
> > > > > > disk tput, so bdi may just be writing less?
> > > > >   Good, question. I was assuming dbench throughput :).
> > > > > 
> > > > > 									Honza
> > > > 
> > > > 
> > > > Yeah it's dbench. May be that's not the right tool to measure the writeback
> > > > layer, even though dbench results are necessarily influenced by the writeback
> > > > behaviour.
> > > > 
> > > > May be I should use something else?
> > > > 
> > > > Note that if you want I can put some surgicals trace_printk()
> > > > in fs/fs-writeback.c
> > > 
> > > FWIW, I ran a similar test here just now. CFQ was used, two partitions
> > > on an (otherwise) idle drive. I used 30 clients per dbench and 600s
> > > runtime. Results are nearly identical, both throughout the run and
> > > total:
> > > 
> > > /dev/sdb1
> > > Throughput 165.738 MB/sec  30 clients  30 procs  max_latency=459.002 ms
> > > 
> > > /dev/sdb2
> > > Throughput 165.773 MB/sec  30 clients  30 procs  max_latency=607.198 ms
> >   Hmm, interesting. 165 MB/sec (in fact 330 MB/sec for that drive) sounds
> > like quite a lot ;). This usually happens with dbench when the processes
> > manage to delete / redirty data before writeback thread gets to them (so
> > some IO happens in memory only and throughput is bound by the CPU / memory
> > speed). So I think you are on a different part of the performance curve
> > than Frederic. Probably you have to run with more threads so that dbench
> > threads get throttled because of total amount of dirty data generated...
> 
> Certainly, the actual disk data rate was consistenctly in the
> 60-70MB/sec region. The issue is likely that the box has 6GB of RAM, if
> I boot with less than 30 clients will do.
> 
> But unless the situation changes radically with memory pressure, it
> still shows a fair distribution of IO between the two. Since they have
> identical results throughout, it should be safe to assume that the have
> equal bandwidth distribution at the disk end. A fast dbench run is one
> that doesn't touch the disk at all, once you start touching disk you
> lose :-)



When I ran my tests, I only had 384 MB of memory, 100 threads and
only one CPU. So I was in a constant writeback, which should
be smoother with 6 GB of memory and 30 threads.

May be that's why you had a so well balanced result... Or may
be there is too much entropy in my testbox :)


 
> -- 
> Jens Axboe
> 


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 04/11] writeback: switch to per-bdi threads for flushing data
  2009-05-27 17:50     ` Jens Axboe
@ 2009-05-28 14:45       ` Jan Kara
  0 siblings, 0 replies; 66+ messages in thread
From: Jan Kara @ 2009-05-28 14:45 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm,
	yanmin_zhang, richard, damien.wyart

On Wed 27-05-09 19:50:19, Jens Axboe wrote:
> > > +static int bdi_forker_task(void *ptr)
> > > +{
> > > +	struct backing_dev_info *me = ptr;
> > > +	DEFINE_WAIT(wait);
> > > +
> > > +	for (;;) {
> > > +		struct backing_dev_info *bdi, *tmp;
> > > +
> > > +		/*
> > > +		 * Do this periodically, like kupdated() did before.
> > > +		 */
> > > +		sync_supers();
> >   Ugh, this looks nasty. Moreover I'm afraid of forker_task() getting stuck
> > (and thus not being able to start new threads) in sync_supers() when some
> > fs is busy and other needs to create flusher thread...
> >   Why not just having a separate thread for this? I know we have lots of
> > kernel threads already but this one seems like a useful one... Or do you
> > plan getting rid of this completely sometime in the near future and sync
> > supers also from per-bdi thread (which would make a lot of sence to me)?
> 
> It's ugly, and I think this is precisely what Ted hit. He's in umount,
> has ->s_umount sem held and waiting for IO.
  I've looked into this a bit more because it was still nagging in the back
of my mind and I think there indeed is a race (although your sync writeback
waiting has now hidden it). The problem is following:
  bdi flusher threads lives independently of filesystem being mounted or
not. So it can happen that bdi_kupdate() or bdi_pdflush() runs in parallel
with umount running in another thread. That should not really happen
because
  1) umount can fail with EBUSY because generic_sync_bdi_inodes() holds
    a reference to inode
  2) we race more subtly and we get to call __writeback_single_inode() after
    the filesystem has been unmounted (put_super() has been called).

  So I believe you simply have to deal with superblock references and
umount semaphore in your patches...

> So there's definitely trouble brewing there. As a short term solution, a
> separate thread will do. Longer term, the sync_supers_bdi() type setup I
> mentioned earlier would probably be the best. But once we start dealing
> with the super blocks, we have to be more careful with referencing.
> Which we also discussed in a previous mail :-)

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 04/11] writeback: switch to per-bdi threads for flushing data
  2009-05-27 15:14   ` Jan Kara
@ 2009-05-27 17:50     ` Jens Axboe
  2009-05-28 14:45       ` Jan Kara
  0 siblings, 1 reply; 66+ messages in thread
From: Jens Axboe @ 2009-05-27 17:50 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm,
	yanmin_zhang, richard, damien.wyart

On Wed, May 27 2009, Jan Kara wrote:
>   The patch set seems easier to read now. Thanks for cleaning it up.

No problem. The issue is mainly that I have to maintain these
intermediate steps, and as code gets added and bugs fixed, things have
to be shuffled back and forth. Now that things are stabilizing more,
it's easier.

> > +void bdi_writeback_all(struct super_block *sb, struct writeback_control *wbc)
> > +{
> > +	struct backing_dev_info *bdi, *tmp;
> > +
> > +	mutex_lock(&bdi_lock);
> > +
> > +	list_for_each_entry_safe(bdi, tmp, &bdi_list, bdi_list) {
> > +		if (!bdi_has_dirty_io(bdi))
> > +			continue;
> > +		bdi_start_writeback(bdi, sb, wbc->nr_to_write, wbc->sync_mode);
> > +	}
> > +
> > +	mutex_unlock(&bdi_lock);
> > +}
> > +
>   Looking at this function, I've realized that wbc->nr_to_write has a bit
> silly meaning here. Each BDI will be kicked to write nr_to_write pages
> which is not what it used to mean originally. I don't think it really matters
> but we should have this in mind...

Yes, I know about that difference. I don't think it matters a whole lot,
since we typically just use MAX_WRITEBACK_PAGES which is only 4MB of IO.
And in the case of writing back the world, we'll just come short on each
bdi.

> > @@ -591,13 +715,10 @@ static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
> >  void generic_sync_sb_inodes(struct super_block *sb,
> >  				struct writeback_control *wbc)
> >  {
> > -	const int is_blkdev_sb = sb_is_blkdev_sb(sb);
> > -	struct backing_dev_info *bdi;
> > -
> > -	mutex_lock(&bdi_lock);
> > -	list_for_each_entry(bdi, &bdi_list, bdi_list)
> > -		generic_sync_bdi_inodes(bdi, wbc, sb, is_blkdev_sb);
> > -	mutex_unlock(&bdi_lock);
> > +	if (wbc->bdi)
> > +		generic_sync_bdi_inodes(sb, wbc);
> > +	else
> > +		bdi_writeback_all(sb, wbc);
>   I guess this asynchronousness is just transient...

Right, if it bothers you, I can fix that up too :-)

> > +static int bdi_forker_task(void *ptr)
> > +{
> > +	struct backing_dev_info *me = ptr;
> > +	DEFINE_WAIT(wait);
> > +
> > +	for (;;) {
> > +		struct backing_dev_info *bdi, *tmp;
> > +
> > +		/*
> > +		 * Do this periodically, like kupdated() did before.
> > +		 */
> > +		sync_supers();
>   Ugh, this looks nasty. Moreover I'm afraid of forker_task() getting stuck
> (and thus not being able to start new threads) in sync_supers() when some
> fs is busy and other needs to create flusher thread...
>   Why not just having a separate thread for this? I know we have lots of
> kernel threads already but this one seems like a useful one... Or do you
> plan getting rid of this completely sometime in the near future and sync
> supers also from per-bdi thread (which would make a lot of sence to me)?

It's ugly, and I think this is precisely what Ted hit. He's in umount,
has ->s_umount sem held and waiting for IO.

So there's definitely trouble brewing there. As a short term solution, a
separate thread will do. Longer term, the sync_supers_bdi() type setup I
mentioned earlier would probably be the best. But once we start dealing
with the super blocks, we have to be more careful with referencing.
Which we also discussed in a previous mail :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 04/11] writeback: switch to per-bdi threads for flushing data
  2009-05-27  9:41 ` [PATCH 04/11] writeback: switch to per-bdi threads for flushing data Jens Axboe
  2009-05-27 11:11   ` Peter Zijlstra
@ 2009-05-27 15:14   ` Jan Kara
  2009-05-27 17:50     ` Jens Axboe
  1 sibling, 1 reply; 66+ messages in thread
From: Jan Kara @ 2009-05-27 15:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack,
	yanmin_zhang, richard, damien.wyart

  The patch set seems easier to read now. Thanks for cleaning it up.

> +void bdi_writeback_all(struct super_block *sb, struct writeback_control *wbc)
> +{
> +	struct backing_dev_info *bdi, *tmp;
> +
> +	mutex_lock(&bdi_lock);
> +
> +	list_for_each_entry_safe(bdi, tmp, &bdi_list, bdi_list) {
> +		if (!bdi_has_dirty_io(bdi))
> +			continue;
> +		bdi_start_writeback(bdi, sb, wbc->nr_to_write, wbc->sync_mode);
> +	}
> +
> +	mutex_unlock(&bdi_lock);
> +}
> +
  Looking at this function, I've realized that wbc->nr_to_write has a bit
silly meaning here. Each BDI will be kicked to write nr_to_write pages
which is not what it used to mean originally. I don't think it really matters
but we should have this in mind...

> @@ -591,13 +715,10 @@ static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
>  void generic_sync_sb_inodes(struct super_block *sb,
>  				struct writeback_control *wbc)
>  {
> -	const int is_blkdev_sb = sb_is_blkdev_sb(sb);
> -	struct backing_dev_info *bdi;
> -
> -	mutex_lock(&bdi_lock);
> -	list_for_each_entry(bdi, &bdi_list, bdi_list)
> -		generic_sync_bdi_inodes(bdi, wbc, sb, is_blkdev_sb);
> -	mutex_unlock(&bdi_lock);
> +	if (wbc->bdi)
> +		generic_sync_bdi_inodes(sb, wbc);
> +	else
> +		bdi_writeback_all(sb, wbc);
  I guess this asynchronousness is just transient...

> +static int bdi_forker_task(void *ptr)
> +{
> +	struct backing_dev_info *me = ptr;
> +	DEFINE_WAIT(wait);
> +
> +	for (;;) {
> +		struct backing_dev_info *bdi, *tmp;
> +
> +		/*
> +		 * Do this periodically, like kupdated() did before.
> +		 */
> +		sync_supers();
  Ugh, this looks nasty. Moreover I'm afraid of forker_task() getting stuck
(and thus not being able to start new threads) in sync_supers() when some
fs is busy and other needs to create flusher thread...
  Why not just having a separate thread for this? I know we have lots of
kernel threads already but this one seems like a useful one... Or do you
plan getting rid of this completely sometime in the near future and sync
supers also from per-bdi thread (which would make a lot of sence to me)?

> +
> +		/*
> +		 * Temporary measure, we want to make sure we don't see
> +		 * dirty data on the default backing_dev_info
> +		 */
> +		if (bdi_has_dirty_io(me))
> +			bdi_flush_io(me);
> +
> +		prepare_to_wait(&me->wait, &wait, TASK_INTERRUPTIBLE);
> +
> +		mutex_lock(&bdi_lock);
> +
> +		/*
> +		 * Check if any existing bdi's have dirty data without
> +		 * a thread registered. If so, set that up.
> +		 */
> +		list_for_each_entry_safe(bdi, tmp, &bdi_list, bdi_list) {
> +			if (bdi->task || !bdi_has_dirty_io(bdi))
> +				continue;
> +
> +			bdi_add_default_flusher_task(bdi);
> +		}
> +
> +		if (list_empty(&bdi_pending_list)) {
> +			unsigned long wait;
> +
> +			mutex_unlock(&bdi_lock);
> +			wait = msecs_to_jiffies(dirty_writeback_interval * 10);
> +			schedule_timeout(wait);
> +			try_to_freeze();
> +			continue;
> +		}
> +
> +		/*
> +		 * This is our real job - check for pending entries in
> +		 * bdi_pending_list, and create the tasks that got added
> +		 */
> +		bdi = list_entry(bdi_pending_list.next, struct backing_dev_info,
> +				 bdi_list);
> +		list_del_init(&bdi->bdi_list);
> +		mutex_unlock(&bdi_lock);
> +
> +		BUG_ON(bdi->task);
> +
> +		bdi->task = kthread_run(bdi_start_fn, bdi, "bdi-%s",
> +					dev_name(bdi->dev));
> +		/*
> +		 * If task creation fails, then readd the bdi to
> +		 * the pending list and force writeout of the bdi
> +		 * from this forker thread. That will free some memory
> +		 * and we can try again.
> +		 */
> +		if (!bdi->task) {
> +			/*
> +			 * Add this 'bdi' to the back, so we get
> +			 * a chance to flush other bdi's to free
> +			 * memory.
> +			 */
> +			mutex_lock(&bdi_lock);
> +			list_add_tail(&bdi->bdi_list, &bdi_pending_list);
> +			mutex_unlock(&bdi_lock);
> +
> +			bdi_flush_io(bdi);
> +		}
> +	}
> +
> +	finish_wait(&me->wait, &wait);
> +	return 0;
> +}

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 04/11] writeback: switch to per-bdi threads for flushing data
  2009-05-27 11:11   ` Peter Zijlstra
@ 2009-05-27 11:24     ` Jens Axboe
  0 siblings, 0 replies; 66+ messages in thread
From: Jens Axboe @ 2009-05-27 11:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack,
	yanmin_zhang, richard, damien.wyart

On Wed, May 27 2009, Peter Zijlstra wrote:
> On Wed, 2009-05-27 at 11:41 +0200, Jens Axboe wrote:
> 
> > +	if (writeback_acquire(bdi)) {
> > +		bdi->wb_arg.nr_pages = nr_pages;
> > +		bdi->wb_arg.sb = sb;
> > +		bdi->wb_arg.sync_mode = sync_mode;
> > +		/*
> > +		 * make above store seen before the task is woken
> > +		 */
> > +		smp_mb();
> > +		wake_up(&bdi->wait);
> > +	}
> 
> wake_up() implies a wmb() when we indeed to a wakeup, is that
> sufficient?

That is sufficient. I'll kill it in the next revision, seeing as this is
just an intermediate step, no harm done.

> > +int bdi_writeback_task(struct backing_dev_info *bdi)
> > +{
> > +	while (!kthread_should_stop()) {
> > +		unsigned long wait_jiffies;
> > +		DEFINE_WAIT(wait);
> > +
> > +		prepare_to_wait(&bdi->wait, &wait, TASK_INTERRUPTIBLE);
> > +		wait_jiffies = msecs_to_jiffies(dirty_writeback_interval * 10);
> > +		schedule_timeout(wait_jiffies);
> > +		try_to_freeze();
> > +
> > +		/*
> > +		 * We get here in two cases:
> > +		 *
> > +		 *  schedule_timeout() returned because the dirty writeback
> > +		 *  interval has elapsed. If that happens, we will be able
> > +		 *  to acquire the writeback lock and will proceed to do
> > +		 *  kupdated style writeout.
> > +		 *
> > +		 *  Someone called bdi_start_writeback(), which will acquire
> > +		 *  the writeback lock. This means our writeback_acquire()
> > +		 *  below will fail and we call into bdi_pdflush() for
> > +		 *  pdflush style writeout.
> > +		 *
> > +		 */
> > +		if (writeback_acquire(bdi))
> > +			bdi_kupdated(bdi);
> > +		else
> > +			bdi_pdflush(bdi);
> > +
> > +		writeback_release(bdi);
> > +		finish_wait(&bdi->wait, &wait);
> > +	}
> > +
> > +	return 0;
> > +}
> 
> the unpaired writeback_release() wrt writeback_acquire() looks odd.

Did you read the comment? :-)

> Also the prepare/finish wait bits seem oddly out of place. Are there
> really multiple waiters on bdi->wait? The above wake_up() seems to
> suggest not, since it directly modifies bdi state instead of queueing
> work.

Intermediate step, further along it should be more clear.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 66+ messages in thread

* Re: [PATCH 04/11] writeback: switch to per-bdi threads for flushing data
  2009-05-27  9:41 ` [PATCH 04/11] writeback: switch to per-bdi threads for flushing data Jens Axboe
@ 2009-05-27 11:11   ` Peter Zijlstra
  2009-05-27 11:24     ` Jens Axboe
  2009-05-27 15:14   ` Jan Kara
  1 sibling, 1 reply; 66+ messages in thread
From: Peter Zijlstra @ 2009-05-27 11:11 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, david, hch, akpm, jack,
	yanmin_zhang, richard, damien.wyart

On Wed, 2009-05-27 at 11:41 +0200, Jens Axboe wrote:

> +	if (writeback_acquire(bdi)) {
> +		bdi->wb_arg.nr_pages = nr_pages;
> +		bdi->wb_arg.sb = sb;
> +		bdi->wb_arg.sync_mode = sync_mode;
> +		/*
> +		 * make above store seen before the task is woken
> +		 */
> +		smp_mb();
> +		wake_up(&bdi->wait);
> +	}

wake_up() implies a wmb() when we indeed to a wakeup, is that
sufficient?

> +int bdi_writeback_task(struct backing_dev_info *bdi)
> +{
> +	while (!kthread_should_stop()) {
> +		unsigned long wait_jiffies;
> +		DEFINE_WAIT(wait);
> +
> +		prepare_to_wait(&bdi->wait, &wait, TASK_INTERRUPTIBLE);
> +		wait_jiffies = msecs_to_jiffies(dirty_writeback_interval * 10);
> +		schedule_timeout(wait_jiffies);
> +		try_to_freeze();
> +
> +		/*
> +		 * We get here in two cases:
> +		 *
> +		 *  schedule_timeout() returned because the dirty writeback
> +		 *  interval has elapsed. If that happens, we will be able
> +		 *  to acquire the writeback lock and will proceed to do
> +		 *  kupdated style writeout.
> +		 *
> +		 *  Someone called bdi_start_writeback(), which will acquire
> +		 *  the writeback lock. This means our writeback_acquire()
> +		 *  below will fail and we call into bdi_pdflush() for
> +		 *  pdflush style writeout.
> +		 *
> +		 */
> +		if (writeback_acquire(bdi))
> +			bdi_kupdated(bdi);
> +		else
> +			bdi_pdflush(bdi);
> +
> +		writeback_release(bdi);
> +		finish_wait(&bdi->wait, &wait);
> +	}
> +
> +	return 0;
> +}

the unpaired writeback_release() wrt writeback_acquire() looks odd.

Also the prepare/finish wait bits seem oddly out of place. Are there
really multiple waiters on bdi->wait? The above wake_up() seems to
suggest not, since it directly modifies bdi state instead of queueing
work.



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [PATCH 04/11] writeback: switch to per-bdi threads for flushing data
  2009-05-27  9:41 [PATCH 0/11] Per-bdi writeback flusher threads v8 Jens Axboe
@ 2009-05-27  9:41 ` Jens Axboe
  2009-05-27 11:11   ` Peter Zijlstra
  2009-05-27 15:14   ` Jan Kara
  0 siblings, 2 replies; 66+ messages in thread
From: Jens Axboe @ 2009-05-27  9:41 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: chris.mason, david, hch, akpm, jack, yanmin_zhang, richard,
	damien.wyart, Jens Axboe

This gets rid of pdflush for bdi writeout and kupdated style cleaning.
This is an experiment to see if we get better writeout behaviour with
per-bdi flushing. Some initial tests look pretty encouraging. A sample
ffsb workload that does random writes to files is about 8% faster here
on a simple SATA drive during the benchmark phase. File layout also seems
a LOT more smooth in vmstat:

 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  1      0 608848   2652 375372    0    0     0 71024  604    24  1 10 48 42
 0  1      0 549644   2712 433736    0    0     0 60692  505    27  1  8 48 44
 1  0      0 476928   2784 505192    0    0     4 29540  553    24  0  9 53 37
 0  1      0 457972   2808 524008    0    0     0 54876  331    16  0  4 38 58
 0  1      0 366128   2928 614284    0    0     4 92168  710    58  0 13 53 34
 0  1      0 295092   3000 684140    0    0     0 62924  572    23  0  9 53 37
 0  1      0 236592   3064 741704    0    0     4 58256  523    17  0  8 48 44
 0  1      0 165608   3132 811464    0    0     0 57460  560    21  0  8 54 38
 0  1      0 102952   3200 873164    0    0     4 74748  540    29  1 10 48 41
 0  1      0  48604   3252 926472    0    0     0 53248  469    29  0  7 47 45

where vanilla tends to fluctuate a lot in the creation phase:

 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  1      0 678716   5792 303380    0    0     0 74064  565    50  1 11 52 36
 1  0      0 662488   5864 319396    0    0     4   352  302   329  0  2 47 51
 0  1      0 599312   5924 381468    0    0     0 78164  516    55  0  9 51 40
 0  1      0 519952   6008 459516    0    0     4 78156  622    56  1 11 52 37
 1  1      0 436640   6092 541632    0    0     0 82244  622    54  0 11 48 41
 0  1      0 436640   6092 541660    0    0     0     8  152    39  0  0 51 49
 0  1      0 332224   6200 644252    0    0     4 102800  728    46  1 13 49 36
 1  0      0 274492   6260 701056    0    0     4 12328  459    49  0  7 50 43
 0  1      0 211220   6324 763356    0    0     0 106940  515    37  1 10 51 39
 1  0      0 160412   6376 813468    0    0     0  8224  415    43  0  6 49 45
 1  1      0  85980   6452 886556    0    0     4 113516  575    39  1 11 54 34
 0  2      0  85968   6452 886620    0    0     0  1640  158   211  0  0 46 54

So apart from seemingly behaving better for buffered writeout, this also
allows us to potentially have more than one bdi thread flushing out data.
This may be useful for NUMA type setups.

A 10 disk test with btrfs performs 26% faster with per-bdi flushing. Other
tests pending.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/buffer.c                 |    2 +-
 fs/fs-writeback.c           |  313 ++++++++++++++++++++++++++-----------------
 fs/sync.c                   |    2 +-
 include/linux/backing-dev.h |   29 ++++
 include/linux/fs.h          |    3 +-
 include/linux/writeback.h   |    2 +-
 mm/backing-dev.c            |  189 +++++++++++++++++++++++++-
 mm/page-writeback.c         |  140 +------------------
 mm/vmscan.c                 |    2 +-
 9 files changed, 415 insertions(+), 267 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index aed2977..14f0802 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -281,7 +281,7 @@ static void free_more_memory(void)
 	struct zone *zone;
 	int nid;
 
-	wakeup_pdflush(1024);
+	wakeup_flusher_threads(1024);
 	yield();
 
 	for_each_online_node(nid) {
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1137408..5d99b12 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -19,6 +19,8 @@
 #include <linux/sched.h>
 #include <linux/fs.h>
 #include <linux/mm.h>
+#include <linux/kthread.h>
+#include <linux/freezer.h>
 #include <linux/writeback.h>
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
@@ -61,10 +63,190 @@ int writeback_in_progress(struct backing_dev_info *bdi)
  */
 static void writeback_release(struct backing_dev_info *bdi)
 {
-	BUG_ON(!writeback_in_progress(bdi));
+	WARN_ON_ONCE(!writeback_in_progress(bdi));
+	bdi->wb_arg.nr_pages = 0;
+	bdi->wb_arg.sb = NULL;
 	clear_bit(BDI_pdflush, &bdi->state);
 }
 
+int bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
+			 long nr_pages, enum writeback_sync_modes sync_mode)
+{
+	/*
+	 * This only happens the first time someone kicks this bdi, so put
+	 * it out-of-line.
+	 */
+	if (unlikely(!bdi->task)) {
+		bdi_add_default_flusher_task(bdi);
+		return 1;
+	}
+
+	if (writeback_acquire(bdi)) {
+		bdi->wb_arg.nr_pages = nr_pages;
+		bdi->wb_arg.sb = sb;
+		bdi->wb_arg.sync_mode = sync_mode;
+		/*
+		 * make above store seen before the task is woken
+		 */
+		smp_mb();
+		wake_up(&bdi->wait);
+	}
+
+	return 0;
+}
+
+/*
+ * The maximum number of pages to writeout in a single bdi flush/kupdate
+ * operation.  We do this so we don't hold I_SYNC against an inode for
+ * enormous amounts of time, which would block a userspace task which has
+ * been forced to throttle against that inode.  Also, the code reevaluates
+ * the dirty each time it has written this many pages.
+ */
+#define MAX_WRITEBACK_PAGES     1024
+
+/*
+ * Periodic writeback of "old" data.
+ *
+ * Define "old": the first time one of an inode's pages is dirtied, we mark the
+ * dirtying-time in the inode's address_space.  So this periodic writeback code
+ * just walks the superblock inode list, writing back any inodes which are
+ * older than a specific point in time.
+ *
+ * Try to run once per dirty_writeback_interval.  But if a writeback event
+ * takes longer than a dirty_writeback_interval interval, then leave a
+ * one-second gap.
+ *
+ * older_than_this takes precedence over nr_to_write.  So we'll only write back
+ * all dirty pages if they are all attached to "old" mappings.
+ */
+static void bdi_kupdated(struct backing_dev_info *bdi)
+{
+	unsigned long oldest_jif;
+	long nr_to_write;
+	struct writeback_control wbc = {
+		.bdi			= bdi,
+		.sync_mode		= WB_SYNC_NONE,
+		.older_than_this	= &oldest_jif,
+		.nr_to_write		= 0,
+		.for_kupdate		= 1,
+		.range_cyclic		= 1,
+	};
+
+	oldest_jif = jiffies - msecs_to_jiffies(dirty_expire_interval * 10);
+
+	nr_to_write = global_page_state(NR_FILE_DIRTY) +
+			global_page_state(NR_UNSTABLE_NFS) +
+			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
+
+	while (nr_to_write > 0) {
+		wbc.more_io = 0;
+		wbc.encountered_congestion = 0;
+		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
+		generic_sync_bdi_inodes(NULL, &wbc);
+		if (wbc.nr_to_write > 0)
+			break;	/* All the old data is written */
+		nr_to_write -= MAX_WRITEBACK_PAGES;
+	}
+}
+
+static inline bool over_bground_thresh(void)
+{
+	unsigned long background_thresh, dirty_thresh;
+
+	get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
+
+	return (global_page_state(NR_FILE_DIRTY) +
+		global_page_state(NR_UNSTABLE_NFS) >= background_thresh);
+}
+
+static void bdi_pdflush(struct backing_dev_info *bdi)
+{
+	struct writeback_control wbc = {
+		.bdi			= bdi,
+		.sync_mode		= bdi->wb_arg.sync_mode,
+		.older_than_this	= NULL,
+		.range_cyclic		= 1,
+	};
+	long nr_pages = bdi->wb_arg.nr_pages;
+
+	for (;;) {
+		if (wbc.sync_mode == WB_SYNC_NONE && nr_pages <= 0 &&
+		    !over_bground_thresh())
+			break;
+
+		wbc.more_io = 0;
+		wbc.encountered_congestion = 0;
+		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
+		wbc.pages_skipped = 0;
+		generic_sync_bdi_inodes(bdi->wb_arg.sb, &wbc);
+		nr_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
+		/*
+		 * If we ran out of stuff to write, bail unless more_io got set
+		 */
+		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
+			if (wbc.more_io)
+				continue;
+			break;
+		}
+	}
+}
+
+/*
+ * Handle writeback of dirty data for the device backed by this bdi. Also
+ * wakes up periodically and does kupdated style flushing.
+ */
+int bdi_writeback_task(struct backing_dev_info *bdi)
+{
+	while (!kthread_should_stop()) {
+		unsigned long wait_jiffies;
+		DEFINE_WAIT(wait);
+
+		prepare_to_wait(&bdi->wait, &wait, TASK_INTERRUPTIBLE);
+		wait_jiffies = msecs_to_jiffies(dirty_writeback_interval * 10);
+		schedule_timeout(wait_jiffies);
+		try_to_freeze();
+
+		/*
+		 * We get here in two cases:
+		 *
+		 *  schedule_timeout() returned because the dirty writeback
+		 *  interval has elapsed. If that happens, we will be able
+		 *  to acquire the writeback lock and will proceed to do
+		 *  kupdated style writeout.
+		 *
+		 *  Someone called bdi_start_writeback(), which will acquire
+		 *  the writeback lock. This means our writeback_acquire()
+		 *  below will fail and we call into bdi_pdflush() for
+		 *  pdflush style writeout.
+		 *
+		 */
+		if (writeback_acquire(bdi))
+			bdi_kupdated(bdi);
+		else
+			bdi_pdflush(bdi);
+
+		writeback_release(bdi);
+		finish_wait(&bdi->wait, &wait);
+	}
+
+	return 0;
+}
+
+void bdi_writeback_all(struct super_block *sb, struct writeback_control *wbc)
+{
+	struct backing_dev_info *bdi, *tmp;
+
+	mutex_lock(&bdi_lock);
+
+	list_for_each_entry_safe(bdi, tmp, &bdi_list, bdi_list) {
+		if (!bdi_has_dirty_io(bdi))
+			continue;
+		bdi_start_writeback(bdi, sb, wbc->nr_to_write, wbc->sync_mode);
+	}
+
+	mutex_unlock(&bdi_lock);
+}
+
 /**
  *	__mark_inode_dirty -	internal function
  *	@inode: inode to mark
@@ -263,46 +445,6 @@ static void queue_io(struct backing_dev_info *bdi,
 	move_expired_inodes(&bdi->b_dirty, &bdi->b_io, older_than_this);
 }
 
-static int sb_on_inode_list(struct super_block *sb, struct list_head *list)
-{
-	struct inode *inode;
-	int ret = 0;
-
-	spin_lock(&inode_lock);
-	list_for_each_entry(inode, list, i_list) {
-		if (inode->i_sb == sb) {
-			ret = 1;
-			break;
-		}
-	}
-	spin_unlock(&inode_lock);
-	return ret;
-}
-
-int sb_has_dirty_inodes(struct super_block *sb)
-{
-	struct backing_dev_info *bdi;
-	int ret = 0;
-
-	/*
-	 * This is REALLY expensive right now, but it'll go away
-	 * when the bdi writeback is introduced
-	 */
-	mutex_lock(&bdi_lock);
-	list_for_each_entry(bdi, &bdi_list, bdi_list) {
-		if (sb_on_inode_list(sb, &bdi->b_dirty) ||
-		    sb_on_inode_list(sb, &bdi->b_io) ||
-		    sb_on_inode_list(sb, &bdi->b_more_io)) {
-			ret = 1;
-			break;
-		}
-	}
-	mutex_unlock(&bdi_lock);
-
-	return ret;
-}
-EXPORT_SYMBOL(sb_has_dirty_inodes);
-
 /*
  * Write a single inode's dirty pages and inode data out to disk.
  * If `wait' is set, wait on the writeout.
@@ -461,11 +603,11 @@ __writeback_single_inode(struct inode *inode, struct writeback_control *wbc)
 	return __sync_single_inode(inode, wbc);
 }
 
-static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
-				    struct writeback_control *wbc,
-				    struct super_block *sb,
-				    int is_blkdev_sb)
+void generic_sync_bdi_inodes(struct super_block *sb,
+			     struct writeback_control *wbc)
 {
+	const int is_blkdev_sb = sb_is_blkdev_sb(sb);
+	struct backing_dev_info *bdi = wbc->bdi;
 	const unsigned long start = jiffies;	/* livelock avoidance */
 
 	spin_lock(&inode_lock);
@@ -516,13 +658,6 @@ static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
 			continue;		/* Skip a congested blockdev */
 		}
 
-		if (wbc->bdi && bdi != wbc->bdi) {
-			if (!is_blkdev_sb)
-				break;		/* fs has the wrong queue */
-			requeue_io(inode);
-			continue;		/* blockdev has wrong queue */
-		}
-
 		/*
 		 * Was this inode dirtied after sync_sb_inodes was called?
 		 * This keeps sync from extra jobs and livelock.
@@ -530,16 +665,10 @@ static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
 		if (inode_dirtied_after(inode, start))
 			break;
 
-		/* Is another pdflush already flushing this queue? */
-		if (current_is_pdflush() && !writeback_acquire(bdi))
-			break;
-
 		BUG_ON(inode->i_state & I_FREEING);
 		__iget(inode);
 		pages_skipped = wbc->pages_skipped;
 		__writeback_single_inode(inode, wbc);
-		if (current_is_pdflush())
-			writeback_release(bdi);
 		if (wbc->pages_skipped != pages_skipped) {
 			/*
 			 * writeback is not making progress due to locked
@@ -578,11 +707,6 @@ static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
  * a variety of queues, so all inodes are searched.  For other superblocks,
  * assume that all inodes are backed by the same queue.
  *
- * FIXME: this linear search could get expensive with many fileystems.  But
- * how to fix?  We need to go from an address_space to all inodes which share
- * a queue with that address_space.  (Easy: have a global "dirty superblocks"
- * list).
- *
  * The inodes to be written are parked on bdi->b_io.  They are moved back onto
  * bdi->b_dirty as they are selected for writing.  This way, none can be missed
  * on the writer throttling path, and we get decent balancing between many
@@ -591,13 +715,10 @@ static void generic_sync_bdi_inodes(struct backing_dev_info *bdi,
 void generic_sync_sb_inodes(struct super_block *sb,
 				struct writeback_control *wbc)
 {
-	const int is_blkdev_sb = sb_is_blkdev_sb(sb);
-	struct backing_dev_info *bdi;
-
-	mutex_lock(&bdi_lock);
-	list_for_each_entry(bdi, &bdi_list, bdi_list)
-		generic_sync_bdi_inodes(bdi, wbc, sb, is_blkdev_sb);
-	mutex_unlock(&bdi_lock);
+	if (wbc->bdi)
+		generic_sync_bdi_inodes(sb, wbc);
+	else
+		bdi_writeback_all(sb, wbc);
 
 	if (wbc->sync_mode == WB_SYNC_ALL) {
 		struct inode *inode, *old_inode = NULL;
@@ -653,58 +774,6 @@ static void sync_sb_inodes(struct super_block *sb,
 }
 
 /*
- * Start writeback of dirty pagecache data against all unlocked inodes.
- *
- * Note:
- * We don't need to grab a reference to superblock here. If it has non-empty
- * ->b_dirty it's hadn't been killed yet and kill_super() won't proceed
- * past sync_inodes_sb() until the ->b_dirty/b_io/b_more_io lists are all
- * empty. Since __sync_single_inode() regains inode_lock before it finally moves
- * inode from superblock lists we are OK.
- *
- * If `older_than_this' is non-zero then only flush inodes which have a
- * flushtime older than *older_than_this.
- *
- * If `bdi' is non-zero then we will scan the first inode against each
- * superblock until we find the matching ones.  One group will be the dirty
- * inodes against a filesystem.  Then when we hit the dummy blockdev superblock,
- * sync_sb_inodes will seekout the blockdev which matches `bdi'.  Maybe not
- * super-efficient but we're about to do a ton of I/O...
- */
-void
-writeback_inodes(struct writeback_control *wbc)
-{
-	struct super_block *sb;
-
-	might_sleep();
-	spin_lock(&sb_lock);
-restart:
-	list_for_each_entry_reverse(sb, &super_blocks, s_list) {
-		if (sb_has_dirty_inodes(sb)) {
-			/* we're making our own get_super here */
-			sb->s_count++;
-			spin_unlock(&sb_lock);
-			/*
-			 * If we can't get the readlock, there's no sense in
-			 * waiting around, most of the time the FS is going to
-			 * be unmounted by the time it is released.
-			 */
-			if (down_read_trylock(&sb->s_umount)) {
-				if (sb->s_root)
-					sync_sb_inodes(sb, wbc);
-				up_read(&sb->s_umount);
-			}
-			spin_lock(&sb_lock);
-			if (__put_super_and_need_restart(sb))
-				goto restart;
-		}
-		if (wbc->nr_to_write <= 0)
-			break;
-	}
-	spin_unlock(&sb_lock);
-}
-
-/*
  * writeback and wait upon the filesystem's dirty inodes.  The caller will
  * do this in two passes - one to write, and one to wait.
  *
diff --git a/fs/sync.c b/fs/sync.c
index 7abc65f..3887f10 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -23,7 +23,7 @@
  */
 static void do_sync(unsigned long wait)
 {
-	wakeup_pdflush(0);
+	wakeup_flusher_threads(0);
 	sync_inodes(0);		/* All mappings, inodes and their blockdevs */
 	vfs_dq_sync(NULL);
 	sync_supers();		/* Write the superblocks */
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 8719c87..9f040a9 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -13,6 +13,7 @@
 #include <linux/proportions.h>
 #include <linux/kernel.h>
 #include <linux/fs.h>
+#include <linux/writeback.h>
 #include <asm/atomic.h>
 
 struct page;
@@ -24,6 +25,7 @@ struct dentry;
  */
 enum bdi_state {
 	BDI_pdflush,		/* A pdflush thread is working this device */
+	BDI_pending,		/* On its way to being activated */
 	BDI_async_congested,	/* The async (write) queue is getting full */
 	BDI_sync_congested,	/* The sync queue is getting full */
 	BDI_unused,		/* Available bits start here */
@@ -39,6 +41,12 @@ enum bdi_stat_item {
 
 #define BDI_STAT_BATCH (8*(1+ilog2(nr_cpu_ids)))
 
+struct bdi_writeback_arg {
+	unsigned long nr_pages;
+	struct super_block *sb;
+	enum writeback_sync_modes sync_mode;
+};
+
 struct backing_dev_info {
 	struct list_head bdi_list;
 
@@ -60,6 +68,9 @@ struct backing_dev_info {
 
 	struct device *dev;
 
+	struct task_struct	*task;		/* writeback task */
+	wait_queue_head_t	wait;
+	struct bdi_writeback_arg wb_arg;	/* protected by BDI_pdflush */
 	struct list_head	b_dirty;	/* dirty inodes */
 	struct list_head	b_io;		/* parked for writeback */
 	struct list_head	b_more_io;	/* parked for more writeback */
@@ -77,10 +88,22 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 		const char *fmt, ...);
 int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev);
 void bdi_unregister(struct backing_dev_info *bdi);
+int bdi_start_writeback(struct backing_dev_info *bdi, struct super_block *sb,
+			 long nr_pages, enum writeback_sync_modes sync_mode);
+int bdi_writeback_task(struct backing_dev_info *bdi);
+void bdi_writeback_all(struct super_block *sb, struct writeback_control *wbc);
+void bdi_add_default_flusher_task(struct backing_dev_info *bdi);
 
 extern struct mutex bdi_lock;
 extern struct list_head bdi_list;
 
+static inline int bdi_has_dirty_io(struct backing_dev_info *bdi)
+{
+	return !list_empty(&bdi->b_dirty) ||
+	       !list_empty(&bdi->b_io) ||
+	       !list_empty(&bdi->b_more_io);
+}
+
 static inline void __add_bdi_stat(struct backing_dev_info *bdi,
 		enum bdi_stat_item item, s64 amount)
 {
@@ -196,6 +219,7 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
 #define BDI_CAP_EXEC_MAP	0x00000040
 #define BDI_CAP_NO_ACCT_WB	0x00000080
 #define BDI_CAP_SWAP_BACKED	0x00000100
+#define BDI_CAP_FLUSH_FORKER	0x00000200
 
 #define BDI_CAP_VMFLAGS \
 	(BDI_CAP_READ_MAP | BDI_CAP_WRITE_MAP | BDI_CAP_EXEC_MAP)
@@ -265,6 +289,11 @@ static inline bool bdi_cap_swap_backed(struct backing_dev_info *bdi)
 	return bdi->capabilities & BDI_CAP_SWAP_BACKED;
 }
 
+static inline bool bdi_cap_flush_forker(struct backing_dev_info *bdi)
+{
+	return bdi->capabilities & BDI_CAP_FLUSH_FORKER;
+}
+
 static inline bool mapping_cap_writeback_dirty(struct address_space *mapping)
 {
 	return bdi_cap_writeback_dirty(mapping->backing_dev_info);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6b475d4..ecdc544 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2063,6 +2063,8 @@ extern int invalidate_inode_pages2_range(struct address_space *mapping,
 					 pgoff_t start, pgoff_t end);
 extern void generic_sync_sb_inodes(struct super_block *sb,
 				struct writeback_control *wbc);
+extern void generic_sync_bdi_inodes(struct super_block *sb,
+				struct writeback_control *);
 extern int write_inode_now(struct inode *, int);
 extern int filemap_fdatawrite(struct address_space *);
 extern int filemap_flush(struct address_space *);
@@ -2180,7 +2182,6 @@ extern int bdev_read_only(struct block_device *);
 extern int set_blocksize(struct block_device *, int);
 extern int sb_set_blocksize(struct super_block *, int);
 extern int sb_min_blocksize(struct super_block *, int);
-extern int sb_has_dirty_inodes(struct super_block *);
 
 extern int generic_file_mmap(struct file *, struct vm_area_struct *);
 extern int generic_file_readonly_mmap(struct file *, struct vm_area_struct *);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 9344547..a8e9f78 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -99,7 +99,7 @@ static inline void inode_sync_wait(struct inode *inode)
 /*
  * mm/page-writeback.c
  */
-int wakeup_pdflush(long nr_pages);
+void wakeup_flusher_threads(long nr_pages);
 void laptop_io_completion(void);
 void laptop_sync_completion(void);
 void throttle_vm_writeout(gfp_t gfp_mask);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index de0bbfe..0df8079 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -1,8 +1,11 @@
 
 #include <linux/wait.h>
 #include <linux/backing-dev.h>
+#include <linux/kthread.h>
+#include <linux/freezer.h>
 #include <linux/fs.h>
 #include <linux/pagemap.h>
+#include <linux/mm.h>
 #include <linux/sched.h>
 #include <linux/module.h>
 #include <linux/writeback.h>
@@ -16,7 +19,7 @@ EXPORT_SYMBOL(default_unplug_io_fn);
 struct backing_dev_info default_backing_dev_info = {
 	.ra_pages	= VM_MAX_READAHEAD * 1024 / PAGE_CACHE_SIZE,
 	.state		= 0,
-	.capabilities	= BDI_CAP_MAP_COPY,
+	.capabilities	= BDI_CAP_MAP_COPY | BDI_CAP_FLUSH_FORKER,
 	.unplug_io_fn	= default_unplug_io_fn,
 };
 EXPORT_SYMBOL_GPL(default_backing_dev_info);
@@ -24,6 +27,7 @@ EXPORT_SYMBOL_GPL(default_backing_dev_info);
 static struct class *bdi_class;
 DEFINE_MUTEX(bdi_lock);
 LIST_HEAD(bdi_list);
+LIST_HEAD(bdi_pending_list);
 
 #ifdef CONFIG_DEBUG_FS
 #include <linux/debugfs.h>
@@ -195,6 +199,143 @@ static int __init default_bdi_init(void)
 }
 subsys_initcall(default_bdi_init);
 
+static int bdi_start_fn(void *ptr)
+{
+	struct backing_dev_info *bdi = ptr;
+	struct task_struct *tsk = current;
+
+	/*
+	 * Add us to the active bdi_list
+	 */
+	mutex_lock(&bdi_lock);
+	list_add(&bdi->bdi_list, &bdi_list);
+	mutex_unlock(&bdi_lock);
+
+	tsk->flags |= PF_FLUSHER | PF_SWAPWRITE;
+	set_freezable();
+
+	/*
+	 * Our parent may run at a different priority, just set us to normal
+	 */
+	set_user_nice(tsk, 0);
+
+	/*
+	 * Clear pending bit and wakeup anybody waiting to tear us down
+	 */
+	clear_bit(BDI_pending, &bdi->state);
+	smp_mb__after_clear_bit();
+	wake_up_bit(&bdi->state, BDI_pending);
+
+	return bdi_writeback_task(bdi);
+}
+
+static void bdi_flush_io(struct backing_dev_info *bdi)
+{
+	struct writeback_control wbc = {
+		.bdi			= bdi,
+		.sync_mode		= WB_SYNC_NONE,
+		.older_than_this	= NULL,
+		.range_cyclic		= 1,
+		.nr_to_write		= 1024,
+	};
+
+	generic_sync_bdi_inodes(NULL, &wbc);
+}
+
+static int bdi_forker_task(void *ptr)
+{
+	struct backing_dev_info *me = ptr;
+	DEFINE_WAIT(wait);
+
+	for (;;) {
+		struct backing_dev_info *bdi, *tmp;
+
+		/*
+		 * Do this periodically, like kupdated() did before.
+		 */
+		sync_supers();
+
+		/*
+		 * Temporary measure, we want to make sure we don't see
+		 * dirty data on the default backing_dev_info
+		 */
+		if (bdi_has_dirty_io(me))
+			bdi_flush_io(me);
+
+		prepare_to_wait(&me->wait, &wait, TASK_INTERRUPTIBLE);
+
+		mutex_lock(&bdi_lock);
+
+		/*
+		 * Check if any existing bdi's have dirty data without
+		 * a thread registered. If so, set that up.
+		 */
+		list_for_each_entry_safe(bdi, tmp, &bdi_list, bdi_list) {
+			if (bdi->task || !bdi_has_dirty_io(bdi))
+				continue;
+
+			bdi_add_default_flusher_task(bdi);
+		}
+
+		if (list_empty(&bdi_pending_list)) {
+			unsigned long wait;
+
+			mutex_unlock(&bdi_lock);
+			wait = msecs_to_jiffies(dirty_writeback_interval * 10);
+			schedule_timeout(wait);
+			try_to_freeze();
+			continue;
+		}
+
+		/*
+		 * This is our real job - check for pending entries in
+		 * bdi_pending_list, and create the tasks that got added
+		 */
+		bdi = list_entry(bdi_pending_list.next, struct backing_dev_info,
+				 bdi_list);
+		list_del_init(&bdi->bdi_list);
+		mutex_unlock(&bdi_lock);
+
+		BUG_ON(bdi->task);
+
+		bdi->task = kthread_run(bdi_start_fn, bdi, "bdi-%s",
+					dev_name(bdi->dev));
+		/*
+		 * If task creation fails, then readd the bdi to
+		 * the pending list and force writeout of the bdi
+		 * from this forker thread. That will free some memory
+		 * and we can try again.
+		 */
+		if (!bdi->task) {
+			/*
+			 * Add this 'bdi' to the back, so we get
+			 * a chance to flush other bdi's to free
+			 * memory.
+			 */
+			mutex_lock(&bdi_lock);
+			list_add_tail(&bdi->bdi_list, &bdi_pending_list);
+			mutex_unlock(&bdi_lock);
+
+			bdi_flush_io(bdi);
+		}
+	}
+
+	finish_wait(&me->wait, &wait);
+	return 0;
+}
+
+void bdi_add_default_flusher_task(struct backing_dev_info *bdi)
+{
+	if (test_and_set_bit(BDI_pending, &bdi->state))
+		return;
+
+	mutex_lock(&bdi_lock);
+	list_move_tail(&bdi->bdi_list, &bdi_pending_list);
+	mutex_unlock(&bdi_lock);
+
+	wake_up(&default_backing_dev_info.wait);
+}
+
 int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 		const char *fmt, ...)
 {
@@ -218,8 +359,25 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 	mutex_unlock(&bdi_lock);
 
 	bdi->dev = dev;
-	bdi_debug_register(bdi, dev_name(dev));
 
+	/*
+	 * Just start the forker thread for our default backing_dev_info,
+	 * and add other bdi's to the list. They will get a thread created
+	 * on-demand when they need it.
+	 */
+	if (bdi_cap_flush_forker(bdi)) {
+		bdi->task = kthread_run(bdi_forker_task, bdi, "bdi-%s",
+						dev_name(dev));
+		if (!bdi->task) {
+			mutex_lock(&bdi_lock);
+			list_del(&bdi->bdi_list);
+			mutex_unlock(&bdi_lock);
+			ret = -ENOMEM;
+			goto exit;
+		}
+	}
+
+	bdi_debug_register(bdi, dev_name(dev));
 exit:
 	return ret;
 }
@@ -231,8 +389,19 @@ int bdi_register_dev(struct backing_dev_info *bdi, dev_t dev)
 }
 EXPORT_SYMBOL(bdi_register_dev);
 
-static void bdi_remove_from_list(struct backing_dev_info *bdi)
+static int sched_wait(void *word)
+{
+	schedule();
+	return 0;
+}
+
+static void bdi_wb_shutdown(struct backing_dev_info *bdi)
 {
+	/*
+	 * If setup is pending, wait for that to complete first
+	 */
+	wait_on_bit(&bdi->state, BDI_pending, sched_wait, TASK_UNINTERRUPTIBLE);
+
 	mutex_lock(&bdi_lock);
 	list_del(&bdi->bdi_list);
 	mutex_unlock(&bdi_lock);
@@ -241,7 +410,13 @@ static void bdi_remove_from_list(struct backing_dev_info *bdi)
 void bdi_unregister(struct backing_dev_info *bdi)
 {
 	if (bdi->dev) {
-		bdi_remove_from_list(bdi);
+		if (!bdi_cap_flush_forker(bdi)) {
+			bdi_wb_shutdown(bdi);
+			if (bdi->task) {
+				kthread_stop(bdi->task);
+				bdi->task = NULL;
+			}
+		}
 		bdi_debug_unregister(bdi);
 		device_unregister(bdi->dev);
 		bdi->dev = NULL;
@@ -251,14 +426,14 @@ EXPORT_SYMBOL(bdi_unregister);
 
 int bdi_init(struct backing_dev_info *bdi)
 {
-	int i;
-	int err;
+	int i, err;
 
 	bdi->dev = NULL;
 
 	bdi->min_ratio = 0;
 	bdi->max_ratio = 100;
 	bdi->max_prop_frac = PROP_FRAC_BASE;
+	init_waitqueue_head(&bdi->wait);
 	INIT_LIST_HEAD(&bdi->bdi_list);
 	INIT_LIST_HEAD(&bdi->b_io);
 	INIT_LIST_HEAD(&bdi->b_dirty);
@@ -277,8 +452,6 @@ int bdi_init(struct backing_dev_info *bdi)
 err:
 		while (i--)
 			percpu_counter_destroy(&bdi->bdi_stat[i]);
-
-		bdi_remove_from_list(bdi);
 	}
 
 	return err;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7c44314..46c62b0 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -36,15 +36,6 @@
 #include <linux/pagevec.h>
 
 /*
- * The maximum number of pages to writeout in a single bdflush/kupdate
- * operation.  We do this so we don't hold I_SYNC against an inode for
- * enormous amounts of time, which would block a userspace task which has
- * been forced to throttle against that inode.  Also, the code reevaluates
- * the dirty each time it has written this many pages.
- */
-#define MAX_WRITEBACK_PAGES	1024
-
-/*
  * After a CPU has dirtied this many pages, balance_dirty_pages_ratelimited
  * will look to see if it needs to force writeback or throttling.
  */
@@ -117,8 +108,6 @@ EXPORT_SYMBOL(laptop_mode);
 /* End of sysctl-exported parameters */
 
 
-static void background_writeout(unsigned long _min_pages);
-
 /*
  * Scale the writeback cache size proportional to the relative writeout speeds.
  *
@@ -539,7 +528,7 @@ static void balance_dirty_pages(struct address_space *mapping)
 		 * been flushed to permanent storage.
 		 */
 		if (bdi_nr_reclaimable) {
-			writeback_inodes(&wbc);
+			generic_sync_bdi_inodes(NULL, &wbc);
 			pages_written += write_chunk - wbc.nr_to_write;
 			get_dirty_limits(&background_thresh, &dirty_thresh,
 				       &bdi_thresh, bdi);
@@ -590,7 +579,7 @@ static void balance_dirty_pages(struct address_space *mapping)
 			(!laptop_mode && (global_page_state(NR_FILE_DIRTY)
 					  + global_page_state(NR_UNSTABLE_NFS)
 					  > background_thresh)))
-		pdflush_operation(background_writeout, 0);
+		bdi_start_writeback(bdi, NULL, 0, WB_SYNC_NONE);
 }
 
 void set_page_dirty_balance(struct page *page, int page_mkwrite)
@@ -675,152 +664,41 @@ void throttle_vm_writeout(gfp_t gfp_mask)
 }
 
 /*
- * writeback at least _min_pages, and keep writing until the amount of dirty
- * memory is less than the background threshold, or until we're all clean.
+ * Start writeback of `nr_pages' pages.  If `nr_pages' is zero, write back
+ * the whole world.
  */
-static void background_writeout(unsigned long _min_pages)
+void wakeup_flusher_threads(long nr_pages)
 {
-	long min_pages = _min_pages;
 	struct writeback_control wbc = {
-		.bdi		= NULL,
 		.sync_mode	= WB_SYNC_NONE,
 		.older_than_this = NULL,
-		.nr_to_write	= 0,
-		.nonblocking	= 1,
 		.range_cyclic	= 1,
 	};
 
-	for ( ; ; ) {
-		unsigned long background_thresh;
-		unsigned long dirty_thresh;
-
-		get_dirty_limits(&background_thresh, &dirty_thresh, NULL, NULL);
-		if (global_page_state(NR_FILE_DIRTY) +
-			global_page_state(NR_UNSTABLE_NFS) < background_thresh
-				&& min_pages <= 0)
-			break;
-		wbc.more_io = 0;
-		wbc.encountered_congestion = 0;
-		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
-		wbc.pages_skipped = 0;
-		writeback_inodes(&wbc);
-		min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
-		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
-			/* Wrote less than expected */
-			if (wbc.encountered_congestion || wbc.more_io)
-				congestion_wait(WRITE, HZ/10);
-			else
-				break;
-		}
-	}
-}
-
-/*
- * Start writeback of `nr_pages' pages.  If `nr_pages' is zero, write back
- * the whole world.  Returns 0 if a pdflush thread was dispatched.  Returns
- * -1 if all pdflush threads were busy.
- */
-int wakeup_pdflush(long nr_pages)
-{
 	if (nr_pages == 0)
 		nr_pages = global_page_state(NR_FILE_DIRTY) +
 				global_page_state(NR_UNSTABLE_NFS);
-	return pdflush_operation(background_writeout, nr_pages);
+	wbc.nr_to_write = nr_pages;
+	bdi_writeback_all(NULL, &wbc);
 }
 
-static void wb_timer_fn(unsigned long unused);
 static void laptop_timer_fn(unsigned long unused);
 
-static DEFINE_TIMER(wb_timer, wb_timer_fn, 0, 0);
 static DEFINE_TIMER(laptop_mode_wb_timer, laptop_timer_fn, 0, 0);
 
 /*
- * Periodic writeback of "old" data.
- *
- * Define "old": the first time one of an inode's pages is dirtied, we mark the
- * dirtying-time in the inode's address_space.  So this periodic writeback code
- * just walks the superblock inode list, writing back any inodes which are
- * older than a specific point in time.
- *
- * Try to run once per dirty_writeback_interval.  But if a writeback event
- * takes longer than a dirty_writeback_interval interval, then leave a
- * one-second gap.
- *
- * older_than_this takes precedence over nr_to_write.  So we'll only write back
- * all dirty pages if they are all attached to "old" mappings.
- */
-static void wb_kupdate(unsigned long arg)
-{
-	unsigned long oldest_jif;
-	unsigned long start_jif;
-	unsigned long next_jif;
-	long nr_to_write;
-	struct writeback_control wbc = {
-		.bdi		= NULL,
-		.sync_mode	= WB_SYNC_NONE,
-		.older_than_this = &oldest_jif,
-		.nr_to_write	= 0,
-		.nonblocking	= 1,
-		.for_kupdate	= 1,
-		.range_cyclic	= 1,
-	};
-
-	sync_supers();
-
-	oldest_jif = jiffies - msecs_to_jiffies(dirty_expire_interval * 10);
-	start_jif = jiffies;
-	next_jif = start_jif + msecs_to_jiffies(dirty_writeback_interval * 10);
-	nr_to_write = global_page_state(NR_FILE_DIRTY) +
-			global_page_state(NR_UNSTABLE_NFS) +
-			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
-	while (nr_to_write > 0) {
-		wbc.more_io = 0;
-		wbc.encountered_congestion = 0;
-		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
-		writeback_inodes(&wbc);
-		if (wbc.nr_to_write > 0) {
-			if (wbc.encountered_congestion || wbc.more_io)
-				congestion_wait(WRITE, HZ/10);
-			else
-				break;	/* All the old data is written */
-		}
-		nr_to_write -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
-	}
-	if (time_before(next_jif, jiffies + HZ))
-		next_jif = jiffies + HZ;
-	if (dirty_writeback_interval)
-		mod_timer(&wb_timer, next_jif);
-}
-
-/*
  * sysctl handler for /proc/sys/vm/dirty_writeback_centisecs
  */
 int dirty_writeback_centisecs_handler(ctl_table *table, int write,
 	struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
 {
 	proc_dointvec(table, write, file, buffer, length, ppos);
-	if (dirty_writeback_interval)
-		mod_timer(&wb_timer, jiffies +
-			msecs_to_jiffies(dirty_writeback_interval * 10));
-	else
-		del_timer(&wb_timer);
 	return 0;
 }
 
-static void wb_timer_fn(unsigned long unused)
-{
-	if (pdflush_operation(wb_kupdate, 0) < 0)
-		mod_timer(&wb_timer, jiffies + HZ); /* delay 1 second */
-}
-
-static void laptop_flush(unsigned long unused)
-{
-	sys_sync();
-}
-
 static void laptop_timer_fn(unsigned long unused)
 {
-	pdflush_operation(laptop_flush, 0);
+	wakeup_flusher_threads(0);
 }
 
 /*
@@ -903,8 +781,6 @@ void __init page_writeback_init(void)
 {
 	int shift;
 
-	mod_timer(&wb_timer,
-		  jiffies + msecs_to_jiffies(dirty_writeback_interval * 10));
 	writeback_set_ratelimit();
 	register_cpu_notifier(&ratelimit_nb);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5fa3eda..e37fd38 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1654,7 +1654,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 */
 		if (total_scanned > sc->swap_cluster_max +
 					sc->swap_cluster_max / 2) {
-			wakeup_pdflush(laptop_mode ? 0 : total_scanned);
+			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
 			sc->may_writepage = 1;
 		}
 
-- 
1.6.3.rc0.1.gf800


^ permalink raw reply related	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2009-06-09 18:39 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-05-28 11:46 [PATCH 0/11] Per-bdi writeback flusher threads v9 Jens Axboe
2009-05-28 11:46 ` [PATCH 01/11] ntfs: remove old debug check for dirty data in ntfs_put_super() Jens Axboe
2009-05-28 11:46 ` [PATCH 02/11] btrfs: properly register fs backing device Jens Axboe
2009-05-28 11:46 ` [PATCH 03/11] writeback: move dirty inodes from super_block to backing_dev_info Jens Axboe
2009-05-28 11:46 ` [PATCH 04/11] writeback: switch to per-bdi threads for flushing data Jens Axboe
2009-05-28 14:13   ` Artem Bityutskiy
2009-05-28 22:28     ` Jens Axboe
2009-05-28 11:46 ` [PATCH 05/11] writeback: get rid of pdflush completely Jens Axboe
2009-05-28 11:46 ` [PATCH 06/11] writeback: separate the flushing state/task from the bdi Jens Axboe
2009-05-28 11:46 ` [PATCH 07/11] writeback: support > 1 flusher thread per bdi Jens Axboe
2009-05-28 11:46 ` [PATCH 08/11] writeback: allow sleepy exit of default writeback task Jens Axboe
2009-05-28 11:46 ` [PATCH 09/11] writeback: add some debug inode list counters to bdi stats Jens Axboe
2009-05-28 11:46 ` [PATCH 10/11] writeback: add name to backing_dev_info Jens Axboe
2009-05-28 11:46 ` [PATCH 11/11] writeback: check for registered bdi in flusher add and inode dirty Jens Axboe
2009-05-28 13:56 ` [PATCH 0/11] Per-bdi writeback flusher threads v9 Peter Zijlstra
2009-05-28 22:28   ` Jens Axboe
2009-05-28 14:17 ` Artem Bityutskiy
2009-05-28 14:19   ` Artem Bityutskiy
2009-05-28 20:35     ` Peter Zijlstra
2009-05-28 22:27       ` Jens Axboe
2009-05-29 15:37       ` Artem Bityutskiy
2009-05-29 15:50         ` Jens Axboe
2009-05-29 16:02           ` Artem Bityutskiy
2009-05-29 17:07             ` Jens Axboe
2009-06-03  7:39               ` Artem Bityutskiy
2009-06-03  7:44                 ` Jens Axboe
2009-06-03  7:46                   ` Artem Bityutskiy
2009-06-03  7:50                     ` Jens Axboe
2009-06-03  7:54                       ` Artem Bityutskiy
2009-06-03  7:59                   ` Artem Bityutskiy
2009-06-03  8:07                     ` Jens Axboe
2009-05-28 14:41 ` Theodore Tso
2009-05-29 16:07 ` Artem Bityutskiy
2009-05-29 16:20   ` Artem Bityutskiy
2009-05-29 17:09     ` Jens Axboe
2009-06-03  8:11       ` Artem Bityutskiy
2009-05-29 17:08   ` Jens Axboe
2009-06-03 11:12 ` Artem Bityutskiy
2009-06-03 11:42   ` Jens Axboe
2009-06-04 15:20 ` Frederic Weisbecker
2009-06-04 19:07   ` Andrew Morton
2009-06-04 19:13     ` Frederic Weisbecker
2009-06-04 19:50       ` Jens Axboe
2009-06-04 20:10         ` Jens Axboe
2009-06-04 22:34           ` Frederic Weisbecker
2009-06-05 19:15             ` Jens Axboe
2009-06-05 21:14               ` Jan Kara
2009-06-06  0:18                 ` Chris Mason
2009-06-06  0:23                   ` Jan Kara
2009-06-06  1:06                     ` Frederic Weisbecker
2009-06-08  9:23                       ` Jens Axboe
2009-06-08 12:23                         ` Jan Kara
2009-06-08 12:28                           ` Jens Axboe
2009-06-08 13:01                             ` Jan Kara
2009-06-09 18:39                             ` Frederic Weisbecker
2009-06-06  1:00                 ` Frederic Weisbecker
2009-06-06  0:35               ` Frederic Weisbecker
2009-06-04 21:37         ` Frederic Weisbecker
2009-06-05  1:14   ` Zhang, Yanmin
2009-06-05 19:16     ` Jens Axboe
  -- strict thread matches above, loose matches on Subject: below --
2009-05-27  9:41 [PATCH 0/11] Per-bdi writeback flusher threads v8 Jens Axboe
2009-05-27  9:41 ` [PATCH 04/11] writeback: switch to per-bdi threads for flushing data Jens Axboe
2009-05-27 11:11   ` Peter Zijlstra
2009-05-27 11:24     ` Jens Axboe
2009-05-27 15:14   ` Jan Kara
2009-05-27 17:50     ` Jens Axboe
2009-05-28 14:45       ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).