All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] Post merge per-bdi writeback patches v2
@ 2009-09-14  9:36 Jens Axboe
  2009-09-14  9:36 ` [PATCH 1/7] writeback: merely wakeup flusher thread if work allocation fails for WB_SYNC_NONE Jens Axboe
                   ` (6 more replies)
  0 siblings, 7 replies; 35+ messages in thread
From: Jens Axboe @ 2009-09-14  9:36 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: chris.mason, hch, tytso, akpm, jack, trond.myklebust

Hi,

Since the writeback patches are now (at last!) in mainline, I renamed
this branch from writeback-postmerge to just plain writeback. The first
four patches are identical to the ones from friday, and then I added
three more patches. Two of them are cleanups to better separate the
WB_SYNC_NONE and WB_SYNC_ALL writeback paths, the latter is a fix for
when a bdi is destroyed whilst inodes are still attached. I don't see
a much better way to fix this at the moment, if inodes get requeued
at sync time, then it would complicate exit quite a lot. So just
move those entries to the default_backing_dev_info, which can handle
them at any time.

Patches are also available at:

  git://git.kernel.dk/linux-2.6-block.git writeback

 fs/btrfs/disk-io.c          |    1 
 fs/fs-writeback.c           |  181 ++++++++++++++----------------------
 fs/fuse/inode.c             |    2 
 fs/super.c                  |    6 +
 fs/sync.c                   |    9 +
 fs/ubifs/super.c            |    1 
 include/linux/backing-dev.h |    1 
 include/linux/fs.h          |    1 
 mm/backing-dev.c            |   88 +++++++++++++----
 mm/page-writeback.c         |    9 -
 10 files changed, 161 insertions(+), 138 deletions(-)

-- 
Jens Axboe



^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/7] writeback: merely wakeup flusher thread if work allocation fails for WB_SYNC_NONE
  2009-09-14  9:36 [PATCH 0/7] Post merge per-bdi writeback patches v2 Jens Axboe
@ 2009-09-14  9:36 ` Jens Axboe
  2009-09-14 12:34   ` Jan Kara
  2009-09-14  9:36 ` [PATCH 2/7] Assign bdi in super_block Jens Axboe
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2009-09-14  9:36 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: chris.mason, hch, tytso, akpm, jack, trond.myklebust, Jens Axboe

From: Christoph Hellwig <hch@infradead.org>

Since it's an opportunistic writeback and not a data integrity action,
don't punt to blocking writeback. Just wakeup the thread and it will
flush old data.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/fs-writeback.c |   46 ++++++++++++++--------------------------------
 1 files changed, 14 insertions(+), 32 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index da86ef5..1873fd0 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -75,13 +75,6 @@ static inline void bdi_work_init(struct bdi_work *work,
 	work->state = WS_USED;
 }
 
-static inline void bdi_work_init_on_stack(struct bdi_work *work,
-					  struct writeback_control *wbc)
-{
-	bdi_work_init(work, wbc);
-	work->state |= WS_ONSTACK;
-}
-
 /**
  * writeback_in_progress - determine whether there is writeback in progress
  * @bdi: the device's backing_dev_info structure.
@@ -207,34 +200,23 @@ static struct bdi_work *bdi_alloc_work(struct writeback_control *wbc)
 
 void bdi_start_writeback(struct writeback_control *wbc)
 {
-	const bool must_wait = wbc->sync_mode == WB_SYNC_ALL;
-	struct bdi_work work_stack, *work = NULL;
-
-	if (!must_wait)
-		work = bdi_alloc_work(wbc);
+	/*
+	 * WB_SYNC_NONE is opportunistic writeback. If this allocation fails,
+	 * bdi_queue_work() will wake up the thread and flush old data. This
+	 * should ensure some amount of progress in freeing memory.
+	 */
+	if (wbc->sync_mode != WB_SYNC_ALL) {
+		struct bdi_work *w = bdi_alloc_work(wbc);
 
-	if (!work) {
-		work = &work_stack;
-		bdi_work_init_on_stack(work, wbc);
-	}
+		bdi_queue_work(wbc->bdi, w);
+	} else {
+		struct bdi_work work;
 
-	bdi_queue_work(wbc->bdi, work);
+		bdi_work_init(&work, wbc);
+		work.state |= WS_ONSTACK;
 
-	/*
-	 * If the sync mode is WB_SYNC_ALL, block waiting for the work to
-	 * complete. If not, we only need to wait for the work to be started,
-	 * if we allocated it on-stack. We use the same mechanism, if the
-	 * wait bit is set in the bdi_work struct, then threads will not
-	 * clear pending until after they are done.
-	 *
-	 * Note that work == &work_stack if must_wait is true, so we don't
-	 * need to do call_rcu() here ever, since the completion path will
-	 * have done that for us.
-	 */
-	if (must_wait || work == &work_stack) {
-		bdi_wait_on_work_clear(work);
-		if (work != &work_stack)
-			call_rcu(&work->rcu_head, bdi_work_free);
+		bdi_queue_work(wbc->bdi, &work);
+		bdi_wait_on_work_clear(&work);
 	}
 }
 
-- 
1.6.4.1.207.g68ea


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 2/7] Assign bdi in super_block
  2009-09-14  9:36 [PATCH 0/7] Post merge per-bdi writeback patches v2 Jens Axboe
  2009-09-14  9:36 ` [PATCH 1/7] writeback: merely wakeup flusher thread if work allocation fails for WB_SYNC_NONE Jens Axboe
@ 2009-09-14  9:36 ` Jens Axboe
  2009-09-14 13:02   ` Jan Kara
  2009-09-14  9:36 ` [PATCH 3/7] writeback: only use bdi_writeback_all() for WB_SYNC_NONE writeout Jens Axboe
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2009-09-14  9:36 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: chris.mason, hch, tytso, akpm, jack, trond.myklebust, Jens Axboe

We do this automatically in get_sb_bdev() from the set_bdev_super()
callback. Filesystems that have their own private backing_dev_info
must assign that in ->fill_super().

Note that ->s_bdi assignment is required for proper writeback!

Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/btrfs/disk-io.c |    1 +
 fs/fuse/inode.c    |    2 ++
 fs/super.c         |    6 ++++++
 fs/sync.c          |    9 ++++++++-
 fs/ubifs/super.c   |    1 +
 include/linux/fs.h |    1 +
 6 files changed, 19 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 15831d5..8b81927 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1600,6 +1600,7 @@ struct btrfs_root *open_ctree(struct super_block *sb,
 
 	sb->s_blocksize = 4096;
 	sb->s_blocksize_bits = blksize_bits(4096);
+	sb->s_bdi = &fs_info->bdi;
 
 	/*
 	 * we set the i_size on the btree inode to the max possible int.
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 4567db6..e5dbecd 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -894,6 +894,8 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
 	if (err)
 		goto err_put_conn;
 
+	sb->s_bdi = &fc->bdi;
+
 	/* Handle umasking inside the fuse code */
 	if (sb->s_flags & MS_POSIXACL)
 		fc->dont_mask = 1;
diff --git a/fs/super.c b/fs/super.c
index 9cda337..b03fea8 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -707,6 +707,12 @@ static int set_bdev_super(struct super_block *s, void *data)
 {
 	s->s_bdev = data;
 	s->s_dev = s->s_bdev->bd_dev;
+
+	/*
+	 * We set the bdi here to the queue backing, file systems can
+	 * overwrite this in ->fill_super()
+	 */
+	s->s_bdi = &bdev_get_queue(s->s_bdev)->backing_dev_info;
 	return 0;
 }
 
diff --git a/fs/sync.c b/fs/sync.c
index 103cc7f..8582c34 100644
--- a/fs/sync.c
+++ b/fs/sync.c
@@ -27,6 +27,13 @@
  */
 static int __sync_filesystem(struct super_block *sb, int wait)
 {
+	/*
+	 * This should be safe, as we require bdi backing to actually
+	 * write out data in the first place
+	 */
+	if (!sb->s_bdi)
+		return 0;
+
 	/* Avoid doing twice syncing and cache pruning for quota sync */
 	if (!wait) {
 		writeout_quota_sb(sb, -1);
@@ -101,7 +108,7 @@ restart:
 		spin_unlock(&sb_lock);
 
 		down_read(&sb->s_umount);
-		if (!(sb->s_flags & MS_RDONLY) && sb->s_root)
+		if (!(sb->s_flags & MS_RDONLY) && sb->s_root && sb->s_bdi)
 			__sync_filesystem(sb, wait);
 		up_read(&sb->s_umount);
 
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index 51763aa..c4af069 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -1980,6 +1980,7 @@ static int ubifs_fill_super(struct super_block *sb, void *data, int silent)
 	if (err)
 		goto out_bdi;
 
+	sb->s_bdi = &c->bdi;
 	sb->s_fs_info = c;
 	sb->s_magic = UBIFS_SUPER_MAGIC;
 	sb->s_blocksize = UBIFS_BLOCK_SIZE;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index a79f483..f998adc 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1343,6 +1343,7 @@ struct super_block {
 	int			s_nr_dentry_unused;	/* # of dentry on lru */
 
 	struct block_device	*s_bdev;
+	struct backing_dev_info *s_bdi;
 	struct mtd_info		*s_mtd;
 	struct list_head	s_instances;
 	struct quota_info	s_dquot;	/* Diskquota specific options */
-- 
1.6.4.1.207.g68ea


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 3/7] writeback: only use bdi_writeback_all() for WB_SYNC_NONE writeout
  2009-09-14  9:36 [PATCH 0/7] Post merge per-bdi writeback patches v2 Jens Axboe
  2009-09-14  9:36 ` [PATCH 1/7] writeback: merely wakeup flusher thread if work allocation fails for WB_SYNC_NONE Jens Axboe
  2009-09-14  9:36 ` [PATCH 2/7] Assign bdi in super_block Jens Axboe
@ 2009-09-14  9:36 ` Jens Axboe
  2009-09-14 13:12   ` Jan Kara
  2009-09-14  9:36 ` [PATCH 4/7] writeback: use RCU to protect bdi_list Jens Axboe
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2009-09-14  9:36 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: chris.mason, hch, tytso, akpm, jack, trond.myklebust, Jens Axboe

Data integrity writeback must use bdi_start_writeback() and ensure
that wbc->sb and wbc->bdi are set.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/fs-writeback.c |   69 ++++++++++------------------------------------------
 1 files changed, 14 insertions(+), 55 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 1873fd0..5d4bd1c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -187,7 +187,8 @@ static void bdi_wait_on_work_clear(struct bdi_work *work)
 		    TASK_UNINTERRUPTIBLE);
 }
 
-static struct bdi_work *bdi_alloc_work(struct writeback_control *wbc)
+static void bdi_alloc_queue_work(struct backing_dev_info *bdi,
+				 struct writeback_control *wbc)
 {
 	struct bdi_work *work;
 
@@ -195,7 +196,7 @@ static struct bdi_work *bdi_alloc_work(struct writeback_control *wbc)
 	if (work)
 		bdi_work_init(work, wbc);
 
-	return work;
+	bdi_queue_work(bdi, work);
 }
 
 void bdi_start_writeback(struct writeback_control *wbc)
@@ -205,11 +206,9 @@ void bdi_start_writeback(struct writeback_control *wbc)
 	 * bdi_queue_work() will wake up the thread and flush old data. This
 	 * should ensure some amount of progress in freeing memory.
 	 */
-	if (wbc->sync_mode != WB_SYNC_ALL) {
-		struct bdi_work *w = bdi_alloc_work(wbc);
-
-		bdi_queue_work(wbc->bdi, w);
-	} else {
+	if (wbc->sync_mode != WB_SYNC_ALL)
+		bdi_alloc_queue_work(wbc->bdi, wbc);
+	else {
 		struct bdi_work work;
 
 		bdi_work_init(&work, wbc);
@@ -840,67 +839,26 @@ int bdi_writeback_task(struct bdi_writeback *wb)
 }
 
 /*
- * Schedule writeback for all backing devices. Expensive! If this is a data
- * integrity operation, writeback will be complete when this returns. If
- * we are simply called for WB_SYNC_NONE, then writeback will merely be
- * scheduled to run.
+ * Schedule writeback for all backing devices. Can only be used for
+ * WB_SYNC_NONE writeback, WB_SYNC_ALL should use bdi_start_writeback()
+ * and pass in the superblock.
  */
 static void bdi_writeback_all(struct writeback_control *wbc)
 {
-	const bool must_wait = wbc->sync_mode == WB_SYNC_ALL;
 	struct backing_dev_info *bdi;
-	struct bdi_work *work;
-	LIST_HEAD(list);
 
-restart:
+	WARN_ON(wbc->sync_mode == WB_SYNC_ALL);
+
 	spin_lock(&bdi_lock);
 
 	list_for_each_entry(bdi, &bdi_list, bdi_list) {
-		struct bdi_work *work;
-
 		if (!bdi_has_dirty_io(bdi))
 			continue;
 
-		/*
-		 * If work allocation fails, do the writes inline. We drop
-		 * the lock and restart the list writeout. This should be OK,
-		 * since this happens rarely and because the writeout should
-		 * eventually make more free memory available.
-		 */
-		work = bdi_alloc_work(wbc);
-		if (!work) {
-			struct writeback_control __wbc;
-
-			/*
-			 * Not a data integrity writeout, just continue
-			 */
-			if (!must_wait)
-				continue;
-
-			spin_unlock(&bdi_lock);
-			__wbc = *wbc;
-			__wbc.bdi = bdi;
-			writeback_inodes_wbc(&__wbc);
-			goto restart;
-		}
-		if (must_wait)
-			list_add_tail(&work->wait_list, &list);
-
-		bdi_queue_work(bdi, work);
+		bdi_alloc_queue_work(bdi, wbc);
 	}
 
 	spin_unlock(&bdi_lock);
-
-	/*
-	 * If this is for WB_SYNC_ALL, wait for pending work to complete
-	 * before returning.
-	 */
-	while (!list_empty(&list)) {
-		work = list_entry(list.next, struct bdi_work, wait_list);
-		list_del(&work->wait_list);
-		bdi_wait_on_work_clear(work);
-		call_rcu(&work->rcu_head, bdi_work_free);
-	}
 }
 
 /*
@@ -1157,6 +1115,7 @@ long sync_inodes_sb(struct super_block *sb)
 {
 	struct writeback_control wbc = {
 		.sb		= sb,
+		.bdi		= sb->s_bdi,
 		.sync_mode	= WB_SYNC_ALL,
 		.range_start	= 0,
 		.range_end	= LLONG_MAX,
@@ -1164,7 +1123,7 @@ long sync_inodes_sb(struct super_block *sb)
 	long nr_to_write = LONG_MAX; /* doesn't actually matter */
 
 	wbc.nr_to_write = nr_to_write;
-	bdi_writeback_all(&wbc);
+	bdi_start_writeback(&wbc);
 	wait_sb_inodes(&wbc);
 	return nr_to_write - wbc.nr_to_write;
 }
-- 
1.6.4.1.207.g68ea


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 4/7] writeback: use RCU to protect bdi_list
  2009-09-14  9:36 [PATCH 0/7] Post merge per-bdi writeback patches v2 Jens Axboe
                   ` (2 preceding siblings ...)
  2009-09-14  9:36 ` [PATCH 3/7] writeback: only use bdi_writeback_all() for WB_SYNC_NONE writeout Jens Axboe
@ 2009-09-14  9:36 ` Jens Axboe
  2009-09-14 11:10   ` Minchan Kim
  2009-09-14  9:36 ` [PATCH 5/7] writeback: inline allocation failure handling in bdi_alloc_queue_work() Jens Axboe
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2009-09-14  9:36 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: chris.mason, hch, tytso, akpm, jack, trond.myklebust, Jens Axboe

Now that bdi_writeback_all() no longer handles integrity writeback,
it doesn't have to block anymore. This means that we can switch
bdi_list reader side protection to RCU.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/fs-writeback.c           |    4 +-
 include/linux/backing-dev.h |    1 +
 mm/backing-dev.c            |   76 +++++++++++++++++++++++++++++++------------
 mm/page-writeback.c         |    8 ++--
 4 files changed, 62 insertions(+), 27 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 5d4bd1c..d7592c7 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -849,7 +849,7 @@ static void bdi_writeback_all(struct writeback_control *wbc)
 
 	WARN_ON(wbc->sync_mode == WB_SYNC_ALL);
 
-	spin_lock(&bdi_lock);
+	rcu_read_lock();
 
 	list_for_each_entry(bdi, &bdi_list, bdi_list) {
 		if (!bdi_has_dirty_io(bdi))
@@ -858,7 +858,7 @@ static void bdi_writeback_all(struct writeback_control *wbc)
 		bdi_alloc_queue_work(bdi, wbc);
 	}
 
-	spin_unlock(&bdi_lock);
+	rcu_read_unlock();
 }
 
 /*
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index f169bcb..859e797 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -59,6 +59,7 @@ struct bdi_writeback {
 
 struct backing_dev_info {
 	struct list_head bdi_list;
+	struct rcu_head rcu_head;
 	unsigned long ra_pages;	/* max readahead in PAGE_CACHE_SIZE units */
 	unsigned long state;	/* Always use atomic bitops on this */
 	unsigned int capabilities; /* Device capabilities */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index d3ca0da..fd93566 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -26,6 +26,12 @@ struct backing_dev_info default_backing_dev_info = {
 EXPORT_SYMBOL_GPL(default_backing_dev_info);
 
 static struct class *bdi_class;
+
+/*
+ * bdi_lock protects updates to bdi_list and bdi_pending_list, as well as
+ * reader side protection for bdi_pending_list. bdi_list has RCU reader side
+ * locking.
+ */
 DEFINE_SPINLOCK(bdi_lock);
 LIST_HEAD(bdi_list);
 LIST_HEAD(bdi_pending_list);
@@ -284,9 +290,9 @@ static int bdi_start_fn(void *ptr)
 	/*
 	 * Add us to the active bdi_list
 	 */
-	spin_lock(&bdi_lock);
-	list_add(&bdi->bdi_list, &bdi_list);
-	spin_unlock(&bdi_lock);
+	spin_lock_bh(&bdi_lock);
+	list_add_rcu(&bdi->bdi_list, &bdi_list);
+	spin_unlock_bh(&bdi_lock);
 
 	bdi_task_init(bdi, wb);
 
@@ -389,7 +395,7 @@ static int bdi_forker_task(void *ptr)
 		if (wb_has_dirty_io(me) || !list_empty(&me->bdi->work_list))
 			wb_do_writeback(me, 0);
 
-		spin_lock(&bdi_lock);
+		spin_lock_bh(&bdi_lock);
 
 		/*
 		 * Check if any existing bdi's have dirty data without
@@ -410,7 +416,7 @@ static int bdi_forker_task(void *ptr)
 		if (list_empty(&bdi_pending_list)) {
 			unsigned long wait;
 
-			spin_unlock(&bdi_lock);
+			spin_unlock_bh(&bdi_lock);
 			wait = msecs_to_jiffies(dirty_writeback_interval * 10);
 			schedule_timeout(wait);
 			try_to_freeze();
@@ -426,7 +432,7 @@ static int bdi_forker_task(void *ptr)
 		bdi = list_entry(bdi_pending_list.next, struct backing_dev_info,
 				 bdi_list);
 		list_del_init(&bdi->bdi_list);
-		spin_unlock(&bdi_lock);
+		spin_unlock_bh(&bdi_lock);
 
 		wb = &bdi->wb;
 		wb->task = kthread_run(bdi_start_fn, wb, "flush-%s",
@@ -445,9 +451,9 @@ static int bdi_forker_task(void *ptr)
 			 * a chance to flush other bdi's to free
 			 * memory.
 			 */
-			spin_lock(&bdi_lock);
+			spin_lock_bh(&bdi_lock);
 			list_add_tail(&bdi->bdi_list, &bdi_pending_list);
-			spin_unlock(&bdi_lock);
+			spin_unlock_bh(&bdi_lock);
 
 			bdi_flush_io(bdi);
 		}
@@ -456,6 +462,24 @@ static int bdi_forker_task(void *ptr)
 	return 0;
 }
 
+static void bdi_add_to_pending(struct rcu_head *head)
+{
+	struct backing_dev_info *bdi;
+
+	bdi = container_of(head, struct backing_dev_info, rcu_head);
+	INIT_LIST_HEAD(&bdi->bdi_list);
+
+	spin_lock(&bdi_lock);
+	list_add_tail(&bdi->bdi_list, &bdi_pending_list);
+	spin_unlock(&bdi_lock);
+
+	/*
+	 * We are now on the pending list, wake up bdi_forker_task()
+	 * to finish the job and add us back to the active bdi_list
+	 */
+	wake_up_process(default_backing_dev_info.wb.task);
+}
+
 /*
  * Add the default flusher task that gets created for any bdi
  * that has dirty data pending writeout
@@ -478,16 +502,29 @@ void static bdi_add_default_flusher_task(struct backing_dev_info *bdi)
 	 * waiting for previous additions to finish.
 	 */
 	if (!test_and_set_bit(BDI_pending, &bdi->state)) {
-		list_move_tail(&bdi->bdi_list, &bdi_pending_list);
+		list_del_rcu(&bdi->bdi_list);
 
 		/*
-		 * We are now on the pending list, wake up bdi_forker_task()
-		 * to finish the job and add us back to the active bdi_list
+		 * We must wait for the current RCU period to end before
+		 * moving to the pending list. So schedule that operation
+		 * from an RCU callback.
 		 */
-		wake_up_process(default_backing_dev_info.wb.task);
+		call_rcu(&bdi->rcu_head, bdi_add_to_pending);
 	}
 }
 
+/*
+ * Remove bdi from bdi_list, and ensure that it is no longer visible
+ */
+static void bdi_remove_from_list(struct backing_dev_info *bdi)
+{
+	spin_lock_bh(&bdi_lock);
+	list_del_rcu(&bdi->bdi_list);
+	spin_unlock_bh(&bdi_lock);
+
+	synchronize_rcu();
+}
+
 int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 		const char *fmt, ...)
 {
@@ -506,9 +543,9 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 		goto exit;
 	}
 
-	spin_lock(&bdi_lock);
-	list_add_tail(&bdi->bdi_list, &bdi_list);
-	spin_unlock(&bdi_lock);
+	spin_lock_bh(&bdi_lock);
+	list_add_tail_rcu(&bdi->bdi_list, &bdi_list);
+	spin_unlock_bh(&bdi_lock);
 
 	bdi->dev = dev;
 
@@ -526,9 +563,7 @@ int bdi_register(struct backing_dev_info *bdi, struct device *parent,
 			wb->task = NULL;
 			ret = -ENOMEM;
 
-			spin_lock(&bdi_lock);
-			list_del(&bdi->bdi_list);
-			spin_unlock(&bdi_lock);
+			bdi_remove_from_list(bdi);
 			goto exit;
 		}
 	}
@@ -565,9 +600,7 @@ static void bdi_wb_shutdown(struct backing_dev_info *bdi)
 	/*
 	 * Make sure nobody finds us on the bdi_list anymore
 	 */
-	spin_lock(&bdi_lock);
-	list_del(&bdi->bdi_list);
-	spin_unlock(&bdi_lock);
+	bdi_remove_from_list(bdi);
 
 	/*
 	 * Finally, kill the kernel threads. We don't need to be RCU
@@ -599,6 +632,7 @@ int bdi_init(struct backing_dev_info *bdi)
 	bdi->max_ratio = 100;
 	bdi->max_prop_frac = PROP_FRAC_BASE;
 	spin_lock_init(&bdi->wb_lock);
+	INIT_RCU_HEAD(&bdi->rcu_head);
 	INIT_LIST_HEAD(&bdi->bdi_list);
 	INIT_LIST_HEAD(&bdi->wb_list);
 	INIT_LIST_HEAD(&bdi->work_list);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 25e7770..a5f0f76 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -315,7 +315,7 @@ int bdi_set_min_ratio(struct backing_dev_info *bdi, unsigned int min_ratio)
 {
 	int ret = 0;
 
-	spin_lock(&bdi_lock);
+	spin_lock_bh(&bdi_lock);
 	if (min_ratio > bdi->max_ratio) {
 		ret = -EINVAL;
 	} else {
@@ -327,7 +327,7 @@ int bdi_set_min_ratio(struct backing_dev_info *bdi, unsigned int min_ratio)
 			ret = -EINVAL;
 		}
 	}
-	spin_unlock(&bdi_lock);
+	spin_unlock_bh(&bdi_lock);
 
 	return ret;
 }
@@ -339,14 +339,14 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned max_ratio)
 	if (max_ratio > 100)
 		return -EINVAL;
 
-	spin_lock(&bdi_lock);
+	spin_lock_bh(&bdi_lock);
 	if (bdi->min_ratio > max_ratio) {
 		ret = -EINVAL;
 	} else {
 		bdi->max_ratio = max_ratio;
 		bdi->max_prop_frac = (PROP_FRAC_BASE * max_ratio) / 100;
 	}
-	spin_unlock(&bdi_lock);
+	spin_unlock_bh(&bdi_lock);
 
 	return ret;
 }
-- 
1.6.4.1.207.g68ea


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 5/7] writeback: inline allocation failure handling in bdi_alloc_queue_work()
  2009-09-14  9:36 [PATCH 0/7] Post merge per-bdi writeback patches v2 Jens Axboe
                   ` (3 preceding siblings ...)
  2009-09-14  9:36 ` [PATCH 4/7] writeback: use RCU to protect bdi_list Jens Axboe
@ 2009-09-14  9:36 ` Jens Axboe
  2009-09-14 13:13   ` Jan Kara
  2009-09-14  9:36 ` [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback Jens Axboe
  2009-09-14  9:36 ` [PATCH 7/7] writeback: splice dirty inode entries to default bdi on bdi_destroy() Jens Axboe
  6 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2009-09-14  9:36 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: chris.mason, hch, tytso, akpm, jack, trond.myklebust, Jens Axboe

This gets rid of work == NULL in bdi_queue_work() and puts the
OOM handling where it belongs.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/fs-writeback.c |   55 +++++++++++++++++++++++++++-------------------------
 1 files changed, 29 insertions(+), 26 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index d7592c7..5cd8b3b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -139,21 +139,19 @@ static void wb_clear_pending(struct bdi_writeback *wb, struct bdi_work *work)
 
 static void bdi_queue_work(struct backing_dev_info *bdi, struct bdi_work *work)
 {
-	if (work) {
-		work->seen = bdi->wb_mask;
-		BUG_ON(!work->seen);
-		atomic_set(&work->pending, bdi->wb_cnt);
-		BUG_ON(!bdi->wb_cnt);
+	work->seen = bdi->wb_mask;
+	BUG_ON(!work->seen);
+	atomic_set(&work->pending, bdi->wb_cnt);
+	BUG_ON(!bdi->wb_cnt);
 
-		/*
-		 * Make sure stores are seen before it appears on the list
-		 */
-		smp_mb();
+	/*
+	 * Make sure stores are seen before it appears on the list
+	 */
+	smp_mb();
 
-		spin_lock(&bdi->wb_lock);
-		list_add_tail_rcu(&work->list, &bdi->work_list);
-		spin_unlock(&bdi->wb_lock);
-	}
+	spin_lock(&bdi->wb_lock);
+	list_add_tail_rcu(&work->list, &bdi->work_list);
+	spin_unlock(&bdi->wb_lock);
 
 	/*
 	 * If the default thread isn't there, make sure we add it. When
@@ -165,14 +163,12 @@ static void bdi_queue_work(struct backing_dev_info *bdi, struct bdi_work *work)
 		struct bdi_writeback *wb = &bdi->wb;
 
 		/*
-		 * If we failed allocating the bdi work item, wake up the wb
-		 * thread always. As a safety precaution, it'll flush out
-		 * everything
+		 * End work now if this wb has no dirty IO pending. Otherwise
+		 * wakeup the handling thread
 		 */
-		if (!wb_has_dirty_io(wb)) {
-			if (work)
-				wb_clear_pending(wb, work);
-		} else if (wb->task)
+		if (!wb_has_dirty_io(wb))
+			wb_clear_pending(wb, work);
+		else if (wb->task)
 			wake_up_process(wb->task);
 	}
 }
@@ -192,11 +188,20 @@ static void bdi_alloc_queue_work(struct backing_dev_info *bdi,
 {
 	struct bdi_work *work;
 
+	/*
+	 * This is WB_SYNC_NONE writeback, so if allocation fails just
+	 * wakeup the thread for old dirty data writeback
+	 */
 	work = kmalloc(sizeof(*work), GFP_ATOMIC);
-	if (work)
+	if (work) {
 		bdi_work_init(work, wbc);
+		bdi_queue_work(bdi, work);
+	} else {
+		struct bdi_writeback *wb = &bdi->wb;
 
-	bdi_queue_work(bdi, work);
+		if (wb->task)
+			wake_up_process(wb->task);
+	}
 }
 
 void bdi_start_writeback(struct writeback_control *wbc)
@@ -852,10 +857,8 @@ static void bdi_writeback_all(struct writeback_control *wbc)
 	rcu_read_lock();
 
 	list_for_each_entry(bdi, &bdi_list, bdi_list) {
-		if (!bdi_has_dirty_io(bdi))
-			continue;
-
-		bdi_alloc_queue_work(bdi, wbc);
+		if (bdi_has_dirty_io(bdi))
+			bdi_alloc_queue_work(bdi, wbc);
 	}
 
 	rcu_read_unlock();
-- 
1.6.4.1.207.g68ea


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback
  2009-09-14  9:36 [PATCH 0/7] Post merge per-bdi writeback patches v2 Jens Axboe
                   ` (4 preceding siblings ...)
  2009-09-14  9:36 ` [PATCH 5/7] writeback: inline allocation failure handling in bdi_alloc_queue_work() Jens Axboe
@ 2009-09-14  9:36 ` Jens Axboe
  2009-09-14 13:33   ` Jan Kara
  2009-09-14  9:36 ` [PATCH 7/7] writeback: splice dirty inode entries to default bdi on bdi_destroy() Jens Axboe
  6 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2009-09-14  9:36 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: chris.mason, hch, tytso, akpm, jack, trond.myklebust, Jens Axboe

bdi_start_writeback() is currently split into two paths, one for
WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
only WB_SYNC_NONE.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 fs/fs-writeback.c   |   51 ++++++++++++++++++++++++++++++++++-----------------
 mm/page-writeback.c |    1 -
 2 files changed, 34 insertions(+), 18 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 5cd8b3b..64ca471 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -204,24 +204,42 @@ static void bdi_alloc_queue_work(struct backing_dev_info *bdi,
 	}
 }
 
-void bdi_start_writeback(struct writeback_control *wbc)
+/**
+ * bdi_sync_writeback - start and wait for writeback
+ * @wbc: writeback parameters
+ *
+ * Description:
+ *   This does WB_SYNC_ALL data integrity writeback and waits for the
+ *   IO to complete. Callers must hold the sb s_umount semaphore for
+ *   reading, to avoid having the super disappear before we are done.
+ */
+static void bdi_sync_writeback(struct writeback_control *wbc)
 {
-	/*
-	 * WB_SYNC_NONE is opportunistic writeback. If this allocation fails,
-	 * bdi_queue_work() will wake up the thread and flush old data. This
-	 * should ensure some amount of progress in freeing memory.
-	 */
-	if (wbc->sync_mode != WB_SYNC_ALL)
-		bdi_alloc_queue_work(wbc->bdi, wbc);
-	else {
-		struct bdi_work work;
+	struct bdi_work work;
 
-		bdi_work_init(&work, wbc);
-		work.state |= WS_ONSTACK;
+	wbc->sync_mode = WB_SYNC_ALL;
 
-		bdi_queue_work(wbc->bdi, &work);
-		bdi_wait_on_work_clear(&work);
-	}
+	bdi_work_init(&work, wbc);
+	work.state |= WS_ONSTACK;
+
+	bdi_queue_work(wbc->bdi, &work);
+	bdi_wait_on_work_clear(&work);
+}
+
+/**
+ * bdi_start_writeback - start writeback
+ * @wbc: writeback parameters
+ *
+ * Description:
+ *   This does WB_SYNC_NONE opportunistic writeback. The IO is only
+ *   started when this function returns, we make no guarentees on
+ *   completion. Caller need not hold sb s_umount semaphore.
+ *
+ */
+void bdi_start_writeback(struct writeback_control *wbc)
+{
+	wbc->sync_mode = WB_SYNC_NONE;
+	bdi_alloc_queue_work(wbc->bdi, wbc);
 }
 
 /*
@@ -1119,14 +1137,13 @@ long sync_inodes_sb(struct super_block *sb)
 	struct writeback_control wbc = {
 		.sb		= sb,
 		.bdi		= sb->s_bdi,
-		.sync_mode	= WB_SYNC_ALL,
 		.range_start	= 0,
 		.range_end	= LLONG_MAX,
 	};
 	long nr_to_write = LONG_MAX; /* doesn't actually matter */
 
 	wbc.nr_to_write = nr_to_write;
-	bdi_start_writeback(&wbc);
+	bdi_sync_writeback(&wbc);
 	wait_sb_inodes(&wbc);
 	return nr_to_write - wbc.nr_to_write;
 }
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index a5f0f76..f61f0cc 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -585,7 +585,6 @@ static void balance_dirty_pages(struct address_space *mapping)
 					  > background_thresh))) {
 		struct writeback_control wbc = {
 			.bdi		= bdi,
-			.sync_mode	= WB_SYNC_NONE,
 			.nr_to_write	= nr_writeback,
 		};
 
-- 
1.6.4.1.207.g68ea


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 7/7] writeback: splice dirty inode entries to default bdi on bdi_destroy()
  2009-09-14  9:36 [PATCH 0/7] Post merge per-bdi writeback patches v2 Jens Axboe
                   ` (5 preceding siblings ...)
  2009-09-14  9:36 ` [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback Jens Axboe
@ 2009-09-14  9:36 ` Jens Axboe
  2009-09-14 10:56   ` Jens Axboe
  6 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2009-09-14  9:36 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: chris.mason, hch, tytso, akpm, jack, trond.myklebust, Jens Axboe

We cannot safely ensure that the inodes are all gone at this point
in time, and we must not destroy this bdi with inodes having off it.
So just splice our entries to the default bdi since that one will
always persist.

Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
 mm/backing-dev.c |   12 +++++++++++-
 1 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index fd93566..27f82dc 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -668,7 +668,17 @@ void bdi_destroy(struct backing_dev_info *bdi)
 {
 	int i;
 
-	WARN_ON(bdi_has_dirty_io(bdi));
+	/*
+	 * Splice our entries to the default_backing_dev_info, if this
+	 * bdi disappears
+	 */
+	if (bdi_has_dirty_io(bdi)) {
+		struct bdi_writeback *dst = &default_backing_dev_info.wb;
+
+		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
+		list_splice(&bdi->wb.b_io, &dst->b_io);
+		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
+	}
 
 	bdi_unregister(bdi);
 
-- 
1.6.4.1.207.g68ea


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 7/7] writeback: splice dirty inode entries to default bdi on bdi_destroy()
  2009-09-14  9:36 ` [PATCH 7/7] writeback: splice dirty inode entries to default bdi on bdi_destroy() Jens Axboe
@ 2009-09-14 10:56   ` Jens Axboe
  0 siblings, 0 replies; 35+ messages in thread
From: Jens Axboe @ 2009-09-14 10:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel
  Cc: chris.mason, hch, tytso, akpm, jack, trond.myklebust

On Mon, Sep 14 2009, Jens Axboe wrote:
> We cannot safely ensure that the inodes are all gone at this point
> in time, and we must not destroy this bdi with inodes having off it.
> So just splice our entries to the default bdi since that one will
> always persist.
> 
> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
> ---
>  mm/backing-dev.c |   12 +++++++++++-
>  1 files changed, 11 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index fd93566..27f82dc 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -668,7 +668,17 @@ void bdi_destroy(struct backing_dev_info *bdi)
>  {
>  	int i;
>  
> -	WARN_ON(bdi_has_dirty_io(bdi));
> +	/*
> +	 * Splice our entries to the default_backing_dev_info, if this
> +	 * bdi disappears
> +	 */
> +	if (bdi_has_dirty_io(bdi)) {
> +		struct bdi_writeback *dst = &default_backing_dev_info.wb;
> +
> +		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
> +		list_splice(&bdi->wb.b_io, &dst->b_io);
> +		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
> +	}

This needs locking, of course...

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 4/7] writeback: use RCU to protect bdi_list
  2009-09-14  9:36 ` [PATCH 4/7] writeback: use RCU to protect bdi_list Jens Axboe
@ 2009-09-14 11:10   ` Minchan Kim
  2009-09-14 11:11     ` Jens Axboe
  0 siblings, 1 reply; 35+ messages in thread
From: Minchan Kim @ 2009-09-14 11:10 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, hch, tytso, akpm, jack,
	trond.myklebust

On Mon, 14 Sep 2009 11:36:31 +0200
Jens Axboe <jens.axboe@oracle.com> wrote:

> Now that bdi_writeback_all() no longer handles integrity writeback,
> it doesn't have to block anymore. This means that we can switch
> bdi_list reader side protection to RCU.
> 
> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
> ---
>  fs/fs-writeback.c           |    4 +-
>  include/linux/backing-dev.h |    1 +
>  mm/backing-dev.c            |   76 +++++++++++++++++++++++++++++++------------
>  mm/page-writeback.c         |    8 ++--
>  4 files changed, 62 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 5d4bd1c..d7592c7 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -849,7 +849,7 @@ static void bdi_writeback_all(struct writeback_control *wbc)
>  
>  	WARN_ON(wbc->sync_mode == WB_SYNC_ALL);
>  
> -	spin_lock(&bdi_lock);
> +	rcu_read_lock();
>  
>  	list_for_each_entry(bdi, &bdi_list, bdi_list) {

list_for_each_entry_rcu?

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 4/7] writeback: use RCU to protect bdi_list
  2009-09-14 11:10   ` Minchan Kim
@ 2009-09-14 11:11     ` Jens Axboe
  0 siblings, 0 replies; 35+ messages in thread
From: Jens Axboe @ 2009-09-14 11:11 UTC (permalink / raw)
  To: Minchan Kim
  Cc: linux-kernel, linux-fsdevel, chris.mason, hch, tytso, akpm, jack,
	trond.myklebust

On Mon, Sep 14 2009, Minchan Kim wrote:
> On Mon, 14 Sep 2009 11:36:31 +0200
> Jens Axboe <jens.axboe@oracle.com> wrote:
> 
> > Now that bdi_writeback_all() no longer handles integrity writeback,
> > it doesn't have to block anymore. This means that we can switch
> > bdi_list reader side protection to RCU.
> > 
> > Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
> > ---
> >  fs/fs-writeback.c           |    4 +-
> >  include/linux/backing-dev.h |    1 +
> >  mm/backing-dev.c            |   76 +++++++++++++++++++++++++++++++------------
> >  mm/page-writeback.c         |    8 ++--
> >  4 files changed, 62 insertions(+), 27 deletions(-)
> > 
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 5d4bd1c..d7592c7 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -849,7 +849,7 @@ static void bdi_writeback_all(struct writeback_control *wbc)
> >  
> >  	WARN_ON(wbc->sync_mode == WB_SYNC_ALL);
> >  
> > -	spin_lock(&bdi_lock);
> > +	rcu_read_lock();
> >  
> >  	list_for_each_entry(bdi, &bdi_list, bdi_list) {
> 
> list_for_each_entry_rcu?

Indeed, pretty silly. Thanks!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/7] writeback: merely wakeup flusher thread if work allocation fails for WB_SYNC_NONE
  2009-09-14  9:36 ` [PATCH 1/7] writeback: merely wakeup flusher thread if work allocation fails for WB_SYNC_NONE Jens Axboe
@ 2009-09-14 12:34   ` Jan Kara
  2009-09-14 12:39     ` Jens Axboe
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Kara @ 2009-09-14 12:34 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, hch, tytso, akpm, jack,
	trond.myklebust

On Mon 14-09-09 11:36:28, Jens Axboe wrote:
> From: Christoph Hellwig <hch@infradead.org>
> 
> Since it's an opportunistic writeback and not a data integrity action,
> don't punt to blocking writeback. Just wakeup the thread and it will
> flush old data.
> 
> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
  Looks good. Acked-by: Jan Kara <jack@suse.cz>
 BTW, don't we miss Christoph's Signed-off-by?

								Honza

> ---
>  fs/fs-writeback.c |   46 ++++++++++++++--------------------------------
>  1 files changed, 14 insertions(+), 32 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index da86ef5..1873fd0 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -75,13 +75,6 @@ static inline void bdi_work_init(struct bdi_work *work,
>  	work->state = WS_USED;
>  }
>  
> -static inline void bdi_work_init_on_stack(struct bdi_work *work,
> -					  struct writeback_control *wbc)
> -{
> -	bdi_work_init(work, wbc);
> -	work->state |= WS_ONSTACK;
> -}
> -
>  /**
>   * writeback_in_progress - determine whether there is writeback in progress
>   * @bdi: the device's backing_dev_info structure.
> @@ -207,34 +200,23 @@ static struct bdi_work *bdi_alloc_work(struct writeback_control *wbc)
>  
>  void bdi_start_writeback(struct writeback_control *wbc)
>  {
> -	const bool must_wait = wbc->sync_mode == WB_SYNC_ALL;
> -	struct bdi_work work_stack, *work = NULL;
> -
> -	if (!must_wait)
> -		work = bdi_alloc_work(wbc);
> +	/*
> +	 * WB_SYNC_NONE is opportunistic writeback. If this allocation fails,
> +	 * bdi_queue_work() will wake up the thread and flush old data. This
> +	 * should ensure some amount of progress in freeing memory.
> +	 */
> +	if (wbc->sync_mode != WB_SYNC_ALL) {
> +		struct bdi_work *w = bdi_alloc_work(wbc);
>  
> -	if (!work) {
> -		work = &work_stack;
> -		bdi_work_init_on_stack(work, wbc);
> -	}
> +		bdi_queue_work(wbc->bdi, w);
> +	} else {
> +		struct bdi_work work;
>  
> -	bdi_queue_work(wbc->bdi, work);
> +		bdi_work_init(&work, wbc);
> +		work.state |= WS_ONSTACK;
>  
> -	/*
> -	 * If the sync mode is WB_SYNC_ALL, block waiting for the work to
> -	 * complete. If not, we only need to wait for the work to be started,
> -	 * if we allocated it on-stack. We use the same mechanism, if the
> -	 * wait bit is set in the bdi_work struct, then threads will not
> -	 * clear pending until after they are done.
> -	 *
> -	 * Note that work == &work_stack if must_wait is true, so we don't
> -	 * need to do call_rcu() here ever, since the completion path will
> -	 * have done that for us.
> -	 */
> -	if (must_wait || work == &work_stack) {
> -		bdi_wait_on_work_clear(work);
> -		if (work != &work_stack)
> -			call_rcu(&work->rcu_head, bdi_work_free);
> +		bdi_queue_work(wbc->bdi, &work);
> +		bdi_wait_on_work_clear(&work);
>  	}
>  }
>  
> -- 
> 1.6.4.1.207.g68ea
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/7] writeback: merely wakeup flusher thread if work allocation fails for WB_SYNC_NONE
  2009-09-14 12:34   ` Jan Kara
@ 2009-09-14 12:39     ` Jens Axboe
  2009-09-14 13:19       ` Christoph Hellwig
  0 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2009-09-14 12:39 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-kernel, linux-fsdevel, chris.mason, hch, tytso, akpm,
	trond.myklebust

On Mon, Sep 14 2009, Jan Kara wrote:
> On Mon 14-09-09 11:36:28, Jens Axboe wrote:
> > From: Christoph Hellwig <hch@infradead.org>
> > 
> > Since it's an opportunistic writeback and not a data integrity action,
> > don't punt to blocking writeback. Just wakeup the thread and it will
> > flush old data.
> > 
> > Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
>   Looks good. Acked-by: Jan Kara <jack@suse.cz>

Thanks!

>  BTW, don't we miss Christoph's Signed-off-by?

We do, he didn't include one.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/7] Assign bdi in super_block
  2009-09-14  9:36 ` [PATCH 2/7] Assign bdi in super_block Jens Axboe
@ 2009-09-14 13:02   ` Jan Kara
  2009-09-14 18:25     ` Trond Myklebust
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Kara @ 2009-09-14 13:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, hch, tytso, akpm, jack,
	trond.myklebust

On Mon 14-09-09 11:36:29, Jens Axboe wrote:
> We do this automatically in get_sb_bdev() from the set_bdev_super()
> callback. Filesystems that have their own private backing_dev_info
> must assign that in ->fill_super().
> 
> Note that ->s_bdi assignment is required for proper writeback!
> 
> Acked-by: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
  Hmm, looking at this again, I'm not sure this will work for NFS. It seems
to set mapping->backing_dev_info to its private backing dev info for
regular files while it leaves it intact for other inodes (e.g.
directories). I'm not sure why it does so but it seems its inodes end up on
two different BDI lists and thus they wouldn't be synced properly. Trond,
do I read the code properly?
  Also we definitely need to set *some* bdi in nfs_get_sb as otherwise sync
won't work for it.

									Honza
> ---
>  fs/btrfs/disk-io.c |    1 +
>  fs/fuse/inode.c    |    2 ++
>  fs/super.c         |    6 ++++++
>  fs/sync.c          |    9 ++++++++-
>  fs/ubifs/super.c   |    1 +
>  include/linux/fs.h |    1 +
>  6 files changed, 19 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 15831d5..8b81927 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -1600,6 +1600,7 @@ struct btrfs_root *open_ctree(struct super_block *sb,
>  
>  	sb->s_blocksize = 4096;
>  	sb->s_blocksize_bits = blksize_bits(4096);
> +	sb->s_bdi = &fs_info->bdi;
>  
>  	/*
>  	 * we set the i_size on the btree inode to the max possible int.
> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 4567db6..e5dbecd 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -894,6 +894,8 @@ static int fuse_fill_super(struct super_block *sb, void *data, int silent)
>  	if (err)
>  		goto err_put_conn;
>  
> +	sb->s_bdi = &fc->bdi;
> +
>  	/* Handle umasking inside the fuse code */
>  	if (sb->s_flags & MS_POSIXACL)
>  		fc->dont_mask = 1;
> diff --git a/fs/super.c b/fs/super.c
> index 9cda337..b03fea8 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -707,6 +707,12 @@ static int set_bdev_super(struct super_block *s, void *data)
>  {
>  	s->s_bdev = data;
>  	s->s_dev = s->s_bdev->bd_dev;
> +
> +	/*
> +	 * We set the bdi here to the queue backing, file systems can
> +	 * overwrite this in ->fill_super()
> +	 */
> +	s->s_bdi = &bdev_get_queue(s->s_bdev)->backing_dev_info;
>  	return 0;
>  }
>  
> diff --git a/fs/sync.c b/fs/sync.c
> index 103cc7f..8582c34 100644
> --- a/fs/sync.c
> +++ b/fs/sync.c
> @@ -27,6 +27,13 @@
>   */
>  static int __sync_filesystem(struct super_block *sb, int wait)
>  {
> +	/*
> +	 * This should be safe, as we require bdi backing to actually
> +	 * write out data in the first place
> +	 */
> +	if (!sb->s_bdi)
> +		return 0;
> +
>  	/* Avoid doing twice syncing and cache pruning for quota sync */
>  	if (!wait) {
>  		writeout_quota_sb(sb, -1);
> @@ -101,7 +108,7 @@ restart:
>  		spin_unlock(&sb_lock);
>  
>  		down_read(&sb->s_umount);
> -		if (!(sb->s_flags & MS_RDONLY) && sb->s_root)
> +		if (!(sb->s_flags & MS_RDONLY) && sb->s_root && sb->s_bdi)
>  			__sync_filesystem(sb, wait);
>  		up_read(&sb->s_umount);
>  
> diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
> index 51763aa..c4af069 100644
> --- a/fs/ubifs/super.c
> +++ b/fs/ubifs/super.c
> @@ -1980,6 +1980,7 @@ static int ubifs_fill_super(struct super_block *sb, void *data, int silent)
>  	if (err)
>  		goto out_bdi;
>  
> +	sb->s_bdi = &c->bdi;
>  	sb->s_fs_info = c;
>  	sb->s_magic = UBIFS_SUPER_MAGIC;
>  	sb->s_blocksize = UBIFS_BLOCK_SIZE;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index a79f483..f998adc 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1343,6 +1343,7 @@ struct super_block {
>  	int			s_nr_dentry_unused;	/* # of dentry on lru */
>  
>  	struct block_device	*s_bdev;
> +	struct backing_dev_info *s_bdi;
>  	struct mtd_info		*s_mtd;
>  	struct list_head	s_instances;
>  	struct quota_info	s_dquot;	/* Diskquota specific options */
> -- 
> 1.6.4.1.207.g68ea
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 3/7] writeback: only use bdi_writeback_all() for WB_SYNC_NONE writeout
  2009-09-14  9:36 ` [PATCH 3/7] writeback: only use bdi_writeback_all() for WB_SYNC_NONE writeout Jens Axboe
@ 2009-09-14 13:12   ` Jan Kara
  0 siblings, 0 replies; 35+ messages in thread
From: Jan Kara @ 2009-09-14 13:12 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, hch, tytso, akpm, jack,
	trond.myklebust

On Mon 14-09-09 11:36:30, Jens Axboe wrote:
> Data integrity writeback must use bdi_start_writeback() and ensure
> that wbc->sb and wbc->bdi are set.
  This patch looks good.
Acked-by: Jan Kara <jack@suse.cz>

								Honza

> 
> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
> ---
>  fs/fs-writeback.c |   69 ++++++++++------------------------------------------
>  1 files changed, 14 insertions(+), 55 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 1873fd0..5d4bd1c 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -187,7 +187,8 @@ static void bdi_wait_on_work_clear(struct bdi_work *work)
>  		    TASK_UNINTERRUPTIBLE);
>  }
>  
> -static struct bdi_work *bdi_alloc_work(struct writeback_control *wbc)
> +static void bdi_alloc_queue_work(struct backing_dev_info *bdi,
> +				 struct writeback_control *wbc)
>  {
>  	struct bdi_work *work;
>  
> @@ -195,7 +196,7 @@ static struct bdi_work *bdi_alloc_work(struct writeback_control *wbc)
>  	if (work)
>  		bdi_work_init(work, wbc);
>  
> -	return work;
> +	bdi_queue_work(bdi, work);
>  }
>  
>  void bdi_start_writeback(struct writeback_control *wbc)
> @@ -205,11 +206,9 @@ void bdi_start_writeback(struct writeback_control *wbc)
>  	 * bdi_queue_work() will wake up the thread and flush old data. This
>  	 * should ensure some amount of progress in freeing memory.
>  	 */
> -	if (wbc->sync_mode != WB_SYNC_ALL) {
> -		struct bdi_work *w = bdi_alloc_work(wbc);
> -
> -		bdi_queue_work(wbc->bdi, w);
> -	} else {
> +	if (wbc->sync_mode != WB_SYNC_ALL)
> +		bdi_alloc_queue_work(wbc->bdi, wbc);
> +	else {
>  		struct bdi_work work;
>  
>  		bdi_work_init(&work, wbc);
> @@ -840,67 +839,26 @@ int bdi_writeback_task(struct bdi_writeback *wb)
>  }
>  
>  /*
> - * Schedule writeback for all backing devices. Expensive! If this is a data
> - * integrity operation, writeback will be complete when this returns. If
> - * we are simply called for WB_SYNC_NONE, then writeback will merely be
> - * scheduled to run.
> + * Schedule writeback for all backing devices. Can only be used for
> + * WB_SYNC_NONE writeback, WB_SYNC_ALL should use bdi_start_writeback()
> + * and pass in the superblock.
>   */
>  static void bdi_writeback_all(struct writeback_control *wbc)
>  {
> -	const bool must_wait = wbc->sync_mode == WB_SYNC_ALL;
>  	struct backing_dev_info *bdi;
> -	struct bdi_work *work;
> -	LIST_HEAD(list);
>  
> -restart:
> +	WARN_ON(wbc->sync_mode == WB_SYNC_ALL);
> +
>  	spin_lock(&bdi_lock);
>  
>  	list_for_each_entry(bdi, &bdi_list, bdi_list) {
> -		struct bdi_work *work;
> -
>  		if (!bdi_has_dirty_io(bdi))
>  			continue;
>  
> -		/*
> -		 * If work allocation fails, do the writes inline. We drop
> -		 * the lock and restart the list writeout. This should be OK,
> -		 * since this happens rarely and because the writeout should
> -		 * eventually make more free memory available.
> -		 */
> -		work = bdi_alloc_work(wbc);
> -		if (!work) {
> -			struct writeback_control __wbc;
> -
> -			/*
> -			 * Not a data integrity writeout, just continue
> -			 */
> -			if (!must_wait)
> -				continue;
> -
> -			spin_unlock(&bdi_lock);
> -			__wbc = *wbc;
> -			__wbc.bdi = bdi;
> -			writeback_inodes_wbc(&__wbc);
> -			goto restart;
> -		}
> -		if (must_wait)
> -			list_add_tail(&work->wait_list, &list);
> -
> -		bdi_queue_work(bdi, work);
> +		bdi_alloc_queue_work(bdi, wbc);
>  	}
>  
>  	spin_unlock(&bdi_lock);
> -
> -	/*
> -	 * If this is for WB_SYNC_ALL, wait for pending work to complete
> -	 * before returning.
> -	 */
> -	while (!list_empty(&list)) {
> -		work = list_entry(list.next, struct bdi_work, wait_list);
> -		list_del(&work->wait_list);
> -		bdi_wait_on_work_clear(work);
> -		call_rcu(&work->rcu_head, bdi_work_free);
> -	}
>  }
>  
>  /*
> @@ -1157,6 +1115,7 @@ long sync_inodes_sb(struct super_block *sb)
>  {
>  	struct writeback_control wbc = {
>  		.sb		= sb,
> +		.bdi		= sb->s_bdi,
>  		.sync_mode	= WB_SYNC_ALL,
>  		.range_start	= 0,
>  		.range_end	= LLONG_MAX,
> @@ -1164,7 +1123,7 @@ long sync_inodes_sb(struct super_block *sb)
>  	long nr_to_write = LONG_MAX; /* doesn't actually matter */
>  
>  	wbc.nr_to_write = nr_to_write;
> -	bdi_writeback_all(&wbc);
> +	bdi_start_writeback(&wbc);
>  	wait_sb_inodes(&wbc);
>  	return nr_to_write - wbc.nr_to_write;
>  }
> -- 
> 1.6.4.1.207.g68ea
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 5/7] writeback: inline allocation failure handling in bdi_alloc_queue_work()
  2009-09-14  9:36 ` [PATCH 5/7] writeback: inline allocation failure handling in bdi_alloc_queue_work() Jens Axboe
@ 2009-09-14 13:13   ` Jan Kara
  0 siblings, 0 replies; 35+ messages in thread
From: Jan Kara @ 2009-09-14 13:13 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, hch, tytso, akpm, jack,
	trond.myklebust

On Mon 14-09-09 11:36:32, Jens Axboe wrote:
> This gets rid of work == NULL in bdi_queue_work() and puts the
> OOM handling where it belongs.
  Looks good.

Acked-by: Jan Kara <jack@suse.cz>

								Honza
> 
> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
> ---
>  fs/fs-writeback.c |   55 +++++++++++++++++++++++++++-------------------------
>  1 files changed, 29 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index d7592c7..5cd8b3b 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -139,21 +139,19 @@ static void wb_clear_pending(struct bdi_writeback *wb, struct bdi_work *work)
>  
>  static void bdi_queue_work(struct backing_dev_info *bdi, struct bdi_work *work)
>  {
> -	if (work) {
> -		work->seen = bdi->wb_mask;
> -		BUG_ON(!work->seen);
> -		atomic_set(&work->pending, bdi->wb_cnt);
> -		BUG_ON(!bdi->wb_cnt);
> +	work->seen = bdi->wb_mask;
> +	BUG_ON(!work->seen);
> +	atomic_set(&work->pending, bdi->wb_cnt);
> +	BUG_ON(!bdi->wb_cnt);
>  
> -		/*
> -		 * Make sure stores are seen before it appears on the list
> -		 */
> -		smp_mb();
> +	/*
> +	 * Make sure stores are seen before it appears on the list
> +	 */
> +	smp_mb();
>  
> -		spin_lock(&bdi->wb_lock);
> -		list_add_tail_rcu(&work->list, &bdi->work_list);
> -		spin_unlock(&bdi->wb_lock);
> -	}
> +	spin_lock(&bdi->wb_lock);
> +	list_add_tail_rcu(&work->list, &bdi->work_list);
> +	spin_unlock(&bdi->wb_lock);
>  
>  	/*
>  	 * If the default thread isn't there, make sure we add it. When
> @@ -165,14 +163,12 @@ static void bdi_queue_work(struct backing_dev_info *bdi, struct bdi_work *work)
>  		struct bdi_writeback *wb = &bdi->wb;
>  
>  		/*
> -		 * If we failed allocating the bdi work item, wake up the wb
> -		 * thread always. As a safety precaution, it'll flush out
> -		 * everything
> +		 * End work now if this wb has no dirty IO pending. Otherwise
> +		 * wakeup the handling thread
>  		 */
> -		if (!wb_has_dirty_io(wb)) {
> -			if (work)
> -				wb_clear_pending(wb, work);
> -		} else if (wb->task)
> +		if (!wb_has_dirty_io(wb))
> +			wb_clear_pending(wb, work);
> +		else if (wb->task)
>  			wake_up_process(wb->task);
>  	}
>  }
> @@ -192,11 +188,20 @@ static void bdi_alloc_queue_work(struct backing_dev_info *bdi,
>  {
>  	struct bdi_work *work;
>  
> +	/*
> +	 * This is WB_SYNC_NONE writeback, so if allocation fails just
> +	 * wakeup the thread for old dirty data writeback
> +	 */
>  	work = kmalloc(sizeof(*work), GFP_ATOMIC);
> -	if (work)
> +	if (work) {
>  		bdi_work_init(work, wbc);
> +		bdi_queue_work(bdi, work);
> +	} else {
> +		struct bdi_writeback *wb = &bdi->wb;
>  
> -	bdi_queue_work(bdi, work);
> +		if (wb->task)
> +			wake_up_process(wb->task);
> +	}
>  }
>  
>  void bdi_start_writeback(struct writeback_control *wbc)
> @@ -852,10 +857,8 @@ static void bdi_writeback_all(struct writeback_control *wbc)
>  	rcu_read_lock();
>  
>  	list_for_each_entry(bdi, &bdi_list, bdi_list) {
> -		if (!bdi_has_dirty_io(bdi))
> -			continue;
> -
> -		bdi_alloc_queue_work(bdi, wbc);
> +		if (bdi_has_dirty_io(bdi))
> +			bdi_alloc_queue_work(bdi, wbc);
>  	}
>  
>  	rcu_read_unlock();
> -- 
> 1.6.4.1.207.g68ea
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/7] writeback: merely wakeup flusher thread if work allocation fails for WB_SYNC_NONE
  2009-09-14 12:39     ` Jens Axboe
@ 2009-09-14 13:19       ` Christoph Hellwig
  0 siblings, 0 replies; 35+ messages in thread
From: Christoph Hellwig @ 2009-09-14 13:19 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jan Kara, linux-kernel, linux-fsdevel, chris.mason, hch, tytso,
	akpm, trond.myklebust

On Mon, Sep 14, 2009 at 02:39:29PM +0200, Jens Axboe wrote:
> 
> >  BTW, don't we miss Christoph's Signed-off-by?
> 
> We do, he didn't include one.


Signed-off-by: Christoph Hellwig <hch@lst.de>

also for Jens' slightly improved version.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback
  2009-09-14  9:36 ` [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback Jens Axboe
@ 2009-09-14 13:33   ` Jan Kara
  2009-09-14 13:42     ` Christoph Hellwig
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Kara @ 2009-09-14 13:33 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-kernel, linux-fsdevel, chris.mason, hch, tytso, akpm, jack,
	trond.myklebust

On Mon 14-09-09 11:36:33, Jens Axboe wrote:
> bdi_start_writeback() is currently split into two paths, one for
> WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
> for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
> only WB_SYNC_NONE.
  What I don't like about this patch is that if somebody sets up
writeback_control with WB_SYNC_ALL mode set and then submits it to disk via
bdi_start_writeback() it will just silently convert his writeback to an
asynchronous one.
  So I'd maybe leave setting of sync_mode to the caller and just WARN_ON if
it does not match the purpose of the function...

									Honza

> 
> Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
> ---
>  fs/fs-writeback.c   |   51 ++++++++++++++++++++++++++++++++++-----------------
>  mm/page-writeback.c |    1 -
>  2 files changed, 34 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 5cd8b3b..64ca471 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -204,24 +204,42 @@ static void bdi_alloc_queue_work(struct backing_dev_info *bdi,
>  	}
>  }
>  
> -void bdi_start_writeback(struct writeback_control *wbc)
> +/**
> + * bdi_sync_writeback - start and wait for writeback
> + * @wbc: writeback parameters
> + *
> + * Description:
> + *   This does WB_SYNC_ALL data integrity writeback and waits for the
> + *   IO to complete. Callers must hold the sb s_umount semaphore for
> + *   reading, to avoid having the super disappear before we are done.
> + */
> +static void bdi_sync_writeback(struct writeback_control *wbc)
>  {
> -	/*
> -	 * WB_SYNC_NONE is opportunistic writeback. If this allocation fails,
> -	 * bdi_queue_work() will wake up the thread and flush old data. This
> -	 * should ensure some amount of progress in freeing memory.
> -	 */
> -	if (wbc->sync_mode != WB_SYNC_ALL)
> -		bdi_alloc_queue_work(wbc->bdi, wbc);
> -	else {
> -		struct bdi_work work;
> +	struct bdi_work work;
>  
> -		bdi_work_init(&work, wbc);
> -		work.state |= WS_ONSTACK;
> +	wbc->sync_mode = WB_SYNC_ALL;
>  
> -		bdi_queue_work(wbc->bdi, &work);
> -		bdi_wait_on_work_clear(&work);
> -	}
> +	bdi_work_init(&work, wbc);
> +	work.state |= WS_ONSTACK;
> +
> +	bdi_queue_work(wbc->bdi, &work);
> +	bdi_wait_on_work_clear(&work);
> +}
> +
> +/**
> + * bdi_start_writeback - start writeback
> + * @wbc: writeback parameters
> + *
> + * Description:
> + *   This does WB_SYNC_NONE opportunistic writeback. The IO is only
> + *   started when this function returns, we make no guarentees on
> + *   completion. Caller need not hold sb s_umount semaphore.
> + *
> + */
> +void bdi_start_writeback(struct writeback_control *wbc)
> +{
> +	wbc->sync_mode = WB_SYNC_NONE;
> +	bdi_alloc_queue_work(wbc->bdi, wbc);
>  }
>  
>  /*
> @@ -1119,14 +1137,13 @@ long sync_inodes_sb(struct super_block *sb)
>  	struct writeback_control wbc = {
>  		.sb		= sb,
>  		.bdi		= sb->s_bdi,
> -		.sync_mode	= WB_SYNC_ALL,
>  		.range_start	= 0,
>  		.range_end	= LLONG_MAX,
>  	};
>  	long nr_to_write = LONG_MAX; /* doesn't actually matter */
>  
>  	wbc.nr_to_write = nr_to_write;
> -	bdi_start_writeback(&wbc);
> +	bdi_sync_writeback(&wbc);
>  	wait_sb_inodes(&wbc);
>  	return nr_to_write - wbc.nr_to_write;
>  }
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index a5f0f76..f61f0cc 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -585,7 +585,6 @@ static void balance_dirty_pages(struct address_space *mapping)
>  					  > background_thresh))) {
>  		struct writeback_control wbc = {
>  			.bdi		= bdi,
> -			.sync_mode	= WB_SYNC_NONE,
>  			.nr_to_write	= nr_writeback,
>  		};
>  
> -- 
> 1.6.4.1.207.g68ea
> 
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback
  2009-09-14 13:33   ` Jan Kara
@ 2009-09-14 13:42     ` Christoph Hellwig
  2009-09-14 19:28       ` Jens Axboe
  0 siblings, 1 reply; 35+ messages in thread
From: Christoph Hellwig @ 2009-09-14 13:42 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, linux-kernel, linux-fsdevel, chris.mason, hch, tytso,
	akpm, trond.myklebust

On Mon, Sep 14, 2009 at 03:33:07PM +0200, Jan Kara wrote:
> On Mon 14-09-09 11:36:33, Jens Axboe wrote:
> > bdi_start_writeback() is currently split into two paths, one for
> > WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
> > for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
> > only WB_SYNC_NONE.
>   What I don't like about this patch is that if somebody sets up
> writeback_control with WB_SYNC_ALL mode set and then submits it to disk via
> bdi_start_writeback() it will just silently convert his writeback to an
> asynchronous one.
>   So I'd maybe leave setting of sync_mode to the caller and just WARN_ON if
> it does not match the purpose of the function...

Or initialize the wb entirely inside these functions.  For the sync case
we really only need a superblock as argument, and for writeback it's
bdi + nr_pages.  And also make sure they consistenly return void as
no one cares about the return value.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/7] Assign bdi in super_block
  2009-09-14 13:02   ` Jan Kara
@ 2009-09-14 18:25     ` Trond Myklebust
  2009-09-14 18:36       ` Jens Axboe
  0 siblings, 1 reply; 35+ messages in thread
From: Trond Myklebust @ 2009-09-14 18:25 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, linux-kernel, linux-fsdevel, chris.mason, hch, tytso, akpm

On Mon, 2009-09-14 at 15:02 +0200, Jan Kara wrote:
> On Mon 14-09-09 11:36:29, Jens Axboe wrote:
> > We do this automatically in get_sb_bdev() from the set_bdev_super()
> > callback. Filesystems that have their own private backing_dev_info
> > must assign that in ->fill_super().
> > 
> > Note that ->s_bdi assignment is required for proper writeback!
> > 
> > Acked-by: Christoph Hellwig <hch@infradead.org>
> > Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
>   Hmm, looking at this again, I'm not sure this will work for NFS. It seems
> to set mapping->backing_dev_info to its private backing dev info for
> regular files while it leaves it intact for other inodes (e.g.
> directories). I'm not sure why it does so but it seems its inodes end up on
> two different BDI lists and thus they wouldn't be synced properly. Trond,
> do I read the code properly?
>   Also we definitely need to set *some* bdi in nfs_get_sb as otherwise sync
> won't work for it.

There hasn't really been a need for a bdi in NFS other than for the
regular file read and writeback code. The main reason for making it
private was to ensure that we could set a per-superblock readahead limit
that was a decent multiple of the server's preferred read block size.

Is there any reason why we couldn't set sb->s_bdi to point to that
private bdi?

Cheers
  Trond


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/7] Assign bdi in super_block
  2009-09-14 18:25     ` Trond Myklebust
@ 2009-09-14 18:36       ` Jens Axboe
  2009-09-15 10:14         ` Jan Kara
  0 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2009-09-14 18:36 UTC (permalink / raw)
  To: Trond Myklebust
  Cc: Jan Kara, linux-kernel, linux-fsdevel, chris.mason, hch, tytso, akpm

On Mon, Sep 14 2009, Trond Myklebust wrote:
> On Mon, 2009-09-14 at 15:02 +0200, Jan Kara wrote:
> > On Mon 14-09-09 11:36:29, Jens Axboe wrote:
> > > We do this automatically in get_sb_bdev() from the set_bdev_super()
> > > callback. Filesystems that have their own private backing_dev_info
> > > must assign that in ->fill_super().
> > > 
> > > Note that ->s_bdi assignment is required for proper writeback!
> > > 
> > > Acked-by: Christoph Hellwig <hch@infradead.org>
> > > Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
> >   Hmm, looking at this again, I'm not sure this will work for NFS. It seems
> > to set mapping->backing_dev_info to its private backing dev info for
> > regular files while it leaves it intact for other inodes (e.g.
> > directories). I'm not sure why it does so but it seems its inodes end up on
> > two different BDI lists and thus they wouldn't be synced properly. Trond,
> > do I read the code properly?
> >   Also we definitely need to set *some* bdi in nfs_get_sb as otherwise sync
> > won't work for it.
> 
> There hasn't really been a need for a bdi in NFS other than for the
> regular file read and writeback code. The main reason for making it
> private was to ensure that we could set a per-superblock readahead limit
> that was a decent multiple of the server's preferred read block size.
> 
> Is there any reason why we couldn't set sb->s_bdi to point to that
> private bdi?

No, that should work fine. NFS already works fine with the bdi flusher
threads, so you should just point it at that bdi.

If you could take a look at the parent patch and give some input (or an
addition for NFS, even better), then I would much appreciate it.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback
  2009-09-14 13:42     ` Christoph Hellwig
@ 2009-09-14 19:28       ` Jens Axboe
  2009-09-14 19:42         ` Jens Axboe
  0 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2009-09-14 19:28 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-kernel, linux-fsdevel, chris.mason, tytso, akpm,
	trond.myklebust

On Mon, Sep 14 2009, Christoph Hellwig wrote:
> On Mon, Sep 14, 2009 at 03:33:07PM +0200, Jan Kara wrote:
> > On Mon 14-09-09 11:36:33, Jens Axboe wrote:
> > > bdi_start_writeback() is currently split into two paths, one for
> > > WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
> > > for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
> > > only WB_SYNC_NONE.
> >   What I don't like about this patch is that if somebody sets up
> > writeback_control with WB_SYNC_ALL mode set and then submits it to disk via
> > bdi_start_writeback() it will just silently convert his writeback to an
> > asynchronous one.
> >   So I'd maybe leave setting of sync_mode to the caller and just WARN_ON if
> > it does not match the purpose of the function...
> 
> Or initialize the wb entirely inside these functions.  For the sync case
> we really only need a superblock as argument, and for writeback it's
> bdi + nr_pages.  And also make sure they consistenly return void as
> no one cares about the return value.

Yes, I thought about doing that and like that better than the warning.
Just pass in the needed args and allocate+fill the wbc on stack. I'll
make that change.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback
  2009-09-14 19:28       ` Jens Axboe
@ 2009-09-14 19:42         ` Jens Axboe
  2009-09-15  9:08           ` Jan Kara
  0 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2009-09-14 19:42 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-kernel, linux-fsdevel, chris.mason, tytso, akpm,
	trond.myklebust

On Mon, Sep 14 2009, Jens Axboe wrote:
> On Mon, Sep 14 2009, Christoph Hellwig wrote:
> > On Mon, Sep 14, 2009 at 03:33:07PM +0200, Jan Kara wrote:
> > > On Mon 14-09-09 11:36:33, Jens Axboe wrote:
> > > > bdi_start_writeback() is currently split into two paths, one for
> > > > WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
> > > > for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
> > > > only WB_SYNC_NONE.
> > >   What I don't like about this patch is that if somebody sets up
> > > writeback_control with WB_SYNC_ALL mode set and then submits it to disk via
> > > bdi_start_writeback() it will just silently convert his writeback to an
> > > asynchronous one.
> > >   So I'd maybe leave setting of sync_mode to the caller and just WARN_ON if
> > > it does not match the purpose of the function...
> > 
> > Or initialize the wb entirely inside these functions.  For the sync case
> > we really only need a superblock as argument, and for writeback it's
> > bdi + nr_pages.  And also make sure they consistenly return void as
> > no one cares about the return value.
> 
> Yes, I thought about doing that and like that better than the warning.
> Just pass in the needed args and allocate+fill the wbc on stack. I'll
> make that change.

That works out much better, imho:

http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=270c12655d7d11e234d335a8ab0540c02c034b66

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback
  2009-09-14 19:42         ` Jens Axboe
@ 2009-09-15  9:08           ` Jan Kara
  2009-09-15  9:14             ` Jens Axboe
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Kara @ 2009-09-15  9:08 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Christoph Hellwig, Jan Kara, linux-kernel, linux-fsdevel,
	chris.mason, tytso, akpm, trond.myklebust

On Mon 14-09-09 21:42:43, Jens Axboe wrote:
> On Mon, Sep 14 2009, Jens Axboe wrote:
> > On Mon, Sep 14 2009, Christoph Hellwig wrote:
> > > On Mon, Sep 14, 2009 at 03:33:07PM +0200, Jan Kara wrote:
> > > > On Mon 14-09-09 11:36:33, Jens Axboe wrote:
> > > > > bdi_start_writeback() is currently split into two paths, one for
> > > > > WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
> > > > > for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
> > > > > only WB_SYNC_NONE.
> > > >   What I don't like about this patch is that if somebody sets up
> > > > writeback_control with WB_SYNC_ALL mode set and then submits it to disk via
> > > > bdi_start_writeback() it will just silently convert his writeback to an
> > > > asynchronous one.
> > > >   So I'd maybe leave setting of sync_mode to the caller and just WARN_ON if
> > > > it does not match the purpose of the function...
> > > 
> > > Or initialize the wb entirely inside these functions.  For the sync case
> > > we really only need a superblock as argument, and for writeback it's
> > > bdi + nr_pages.  And also make sure they consistenly return void as
> > > no one cares about the return value.
> > 
> > Yes, I thought about doing that and like that better than the warning.
> > Just pass in the needed args and allocate+fill the wbc on stack. I'll
> > make that change.
> 
> That works out much better, imho:
> 
> http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=270c12655d7d11e234d335a8ab0540c02c034b66
  Yeah, the code looks better. BTW, how about converting also
bdi_writeback_all() to get superblock and nr_pages as an argument?
Currently it seems to be the only place "above" flusher thread which uses
wbc and it's just constructed in the callers of bdi_writeback_all() and
then disassembled inside the function...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback
  2009-09-15  9:08           ` Jan Kara
@ 2009-09-15  9:14             ` Jens Axboe
  2009-09-15 11:44               ` Jens Axboe
  0 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2009-09-15  9:14 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, chris.mason,
	tytso, akpm, trond.myklebust

On Tue, Sep 15 2009, Jan Kara wrote:
> On Mon 14-09-09 21:42:43, Jens Axboe wrote:
> > On Mon, Sep 14 2009, Jens Axboe wrote:
> > > On Mon, Sep 14 2009, Christoph Hellwig wrote:
> > > > On Mon, Sep 14, 2009 at 03:33:07PM +0200, Jan Kara wrote:
> > > > > On Mon 14-09-09 11:36:33, Jens Axboe wrote:
> > > > > > bdi_start_writeback() is currently split into two paths, one for
> > > > > > WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
> > > > > > for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
> > > > > > only WB_SYNC_NONE.
> > > > >   What I don't like about this patch is that if somebody sets up
> > > > > writeback_control with WB_SYNC_ALL mode set and then submits it to disk via
> > > > > bdi_start_writeback() it will just silently convert his writeback to an
> > > > > asynchronous one.
> > > > >   So I'd maybe leave setting of sync_mode to the caller and just WARN_ON if
> > > > > it does not match the purpose of the function...
> > > > 
> > > > Or initialize the wb entirely inside these functions.  For the sync case
> > > > we really only need a superblock as argument, and for writeback it's
> > > > bdi + nr_pages.  And also make sure they consistenly return void as
> > > > no one cares about the return value.
> > > 
> > > Yes, I thought about doing that and like that better than the warning.
> > > Just pass in the needed args and allocate+fill the wbc on stack. I'll
> > > make that change.
> > 
> > That works out much better, imho:
> > 
> > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=270c12655d7d11e234d335a8ab0540c02c034b66
>   Yeah, the code looks better. BTW, how about converting also
> bdi_writeback_all() to get superblock and nr_pages as an argument?
> Currently it seems to be the only place "above" flusher thread which uses
> wbc and it's just constructed in the callers of bdi_writeback_all() and
> then disassembled inside the function...

Yes good point, I'll include that too. Thanks!

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/7] Assign bdi in super_block
  2009-09-14 18:36       ` Jens Axboe
@ 2009-09-15 10:14         ` Jan Kara
  2009-09-15 10:22           ` Jens Axboe
  2009-09-15 13:16           ` Trond Myklebust
  0 siblings, 2 replies; 35+ messages in thread
From: Jan Kara @ 2009-09-15 10:14 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Trond Myklebust, Jan Kara, linux-kernel, linux-fsdevel,
	chris.mason, hch, tytso, akpm

On Mon 14-09-09 20:36:54, Jens Axboe wrote:
> On Mon, Sep 14 2009, Trond Myklebust wrote:
> > On Mon, 2009-09-14 at 15:02 +0200, Jan Kara wrote:
> > > On Mon 14-09-09 11:36:29, Jens Axboe wrote:
> > > > We do this automatically in get_sb_bdev() from the set_bdev_super()
> > > > callback. Filesystems that have their own private backing_dev_info
> > > > must assign that in ->fill_super().
> > > > 
> > > > Note that ->s_bdi assignment is required for proper writeback!
> > > > 
> > > > Acked-by: Christoph Hellwig <hch@infradead.org>
> > > > Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
> > >   Hmm, looking at this again, I'm not sure this will work for NFS. It seems
> > > to set mapping->backing_dev_info to its private backing dev info for
> > > regular files while it leaves it intact for other inodes (e.g.
> > > directories). I'm not sure why it does so but it seems its inodes end up on
> > > two different BDI lists and thus they wouldn't be synced properly. Trond,
> > > do I read the code properly?
> > >   Also we definitely need to set *some* bdi in nfs_get_sb as otherwise sync
> > > won't work for it.
> > 
> > There hasn't really been a need for a bdi in NFS other than for the
> > regular file read and writeback code. The main reason for making it
> > private was to ensure that we could set a per-superblock readahead limit
> > that was a decent multiple of the server's preferred read block size.
> > 
> > Is there any reason why we couldn't set sb->s_bdi to point to that
> > private bdi?
> 
> No, that should work fine. NFS already works fine with the bdi flusher
> threads, so you should just point it at that bdi.
  But will it really work well? I mean if we sync the superblock on the
client, it will sync only the private BDI. So it won't sync any directory
inodes because they are on the default_backing_dev_info (NFS leaves
sb->s_bdev at NULL).

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/7] Assign bdi in super_block
  2009-09-15 10:14         ` Jan Kara
@ 2009-09-15 10:22           ` Jens Axboe
  2009-09-15 13:16           ` Trond Myklebust
  1 sibling, 0 replies; 35+ messages in thread
From: Jens Axboe @ 2009-09-15 10:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: Trond Myklebust, linux-kernel, linux-fsdevel, chris.mason, hch,
	tytso, akpm

On Tue, Sep 15 2009, Jan Kara wrote:
> On Mon 14-09-09 20:36:54, Jens Axboe wrote:
> > On Mon, Sep 14 2009, Trond Myklebust wrote:
> > > On Mon, 2009-09-14 at 15:02 +0200, Jan Kara wrote:
> > > > On Mon 14-09-09 11:36:29, Jens Axboe wrote:
> > > > > We do this automatically in get_sb_bdev() from the set_bdev_super()
> > > > > callback. Filesystems that have their own private backing_dev_info
> > > > > must assign that in ->fill_super().
> > > > > 
> > > > > Note that ->s_bdi assignment is required for proper writeback!
> > > > > 
> > > > > Acked-by: Christoph Hellwig <hch@infradead.org>
> > > > > Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
> > > >   Hmm, looking at this again, I'm not sure this will work for NFS. It seems
> > > > to set mapping->backing_dev_info to its private backing dev info for
> > > > regular files while it leaves it intact for other inodes (e.g.
> > > > directories). I'm not sure why it does so but it seems its inodes end up on
> > > > two different BDI lists and thus they wouldn't be synced properly. Trond,
> > > > do I read the code properly?
> > > >   Also we definitely need to set *some* bdi in nfs_get_sb as otherwise sync
> > > > won't work for it.
> > > 
> > > There hasn't really been a need for a bdi in NFS other than for the
> > > regular file read and writeback code. The main reason for making it
> > > private was to ensure that we could set a per-superblock readahead limit
> > > that was a decent multiple of the server's preferred read block size.
> > > 
> > > Is there any reason why we couldn't set sb->s_bdi to point to that
> > > private bdi?
> > 
> > No, that should work fine. NFS already works fine with the bdi flusher
> > threads, so you should just point it at that bdi.
>   But will it really work well? I mean if we sync the superblock on the
> client, it will sync only the private BDI. So it won't sync any directory
> inodes because they are on the default_backing_dev_info (NFS leaves
> sb->s_bdev at NULL).

No you are right, it wont be complete. I talked to Chris about this
yesterday, and we agree that the best option is to give NFS the 'btrfs
treatment' bdi wise.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback
  2009-09-15  9:14             ` Jens Axboe
@ 2009-09-15 11:44               ` Jens Axboe
  2009-09-15 12:58                 ` Jan Kara
  0 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2009-09-15 11:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, chris.mason,
	tytso, akpm, trond.myklebust

On Tue, Sep 15 2009, Jens Axboe wrote:
> On Tue, Sep 15 2009, Jan Kara wrote:
> > On Mon 14-09-09 21:42:43, Jens Axboe wrote:
> > > On Mon, Sep 14 2009, Jens Axboe wrote:
> > > > On Mon, Sep 14 2009, Christoph Hellwig wrote:
> > > > > On Mon, Sep 14, 2009 at 03:33:07PM +0200, Jan Kara wrote:
> > > > > > On Mon 14-09-09 11:36:33, Jens Axboe wrote:
> > > > > > > bdi_start_writeback() is currently split into two paths, one for
> > > > > > > WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
> > > > > > > for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
> > > > > > > only WB_SYNC_NONE.
> > > > > >   What I don't like about this patch is that if somebody sets up
> > > > > > writeback_control with WB_SYNC_ALL mode set and then submits it to disk via
> > > > > > bdi_start_writeback() it will just silently convert his writeback to an
> > > > > > asynchronous one.
> > > > > >   So I'd maybe leave setting of sync_mode to the caller and just WARN_ON if
> > > > > > it does not match the purpose of the function...
> > > > > 
> > > > > Or initialize the wb entirely inside these functions.  For the sync case
> > > > > we really only need a superblock as argument, and for writeback it's
> > > > > bdi + nr_pages.  And also make sure they consistenly return void as
> > > > > no one cares about the return value.
> > > > 
> > > > Yes, I thought about doing that and like that better than the warning.
> > > > Just pass in the needed args and allocate+fill the wbc on stack. I'll
> > > > make that change.
> > > 
> > > That works out much better, imho:
> > > 
> > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=270c12655d7d11e234d335a8ab0540c02c034b66
> >   Yeah, the code looks better. BTW, how about converting also
> > bdi_writeback_all() to get superblock and nr_pages as an argument?
> > Currently it seems to be the only place "above" flusher thread which uses
> > wbc and it's just constructed in the callers of bdi_writeback_all() and
> > then disassembled inside the function...
> 
> Yes good point, I'll include that too. Thanks!

One small problem there, though... Currently all queued writeback is
range cyclic, however with this change then we drop that bit from
sync_inodes_sb() which isn't range_cyclic and instead just specifies the
whole range.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback
  2009-09-15 11:44               ` Jens Axboe
@ 2009-09-15 12:58                 ` Jan Kara
  2009-09-15 13:04                   ` Jens Axboe
  0 siblings, 1 reply; 35+ messages in thread
From: Jan Kara @ 2009-09-15 12:58 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jan Kara, Christoph Hellwig, linux-kernel, linux-fsdevel,
	chris.mason, tytso, akpm, trond.myklebust

On Tue 15-09-09 13:44:26, Jens Axboe wrote:
> On Tue, Sep 15 2009, Jens Axboe wrote:
> > On Tue, Sep 15 2009, Jan Kara wrote:
> > > On Mon 14-09-09 21:42:43, Jens Axboe wrote:
> > > > On Mon, Sep 14 2009, Jens Axboe wrote:
> > > > > On Mon, Sep 14 2009, Christoph Hellwig wrote:
> > > > > > On Mon, Sep 14, 2009 at 03:33:07PM +0200, Jan Kara wrote:
> > > > > > > On Mon 14-09-09 11:36:33, Jens Axboe wrote:
> > > > > > > > bdi_start_writeback() is currently split into two paths, one for
> > > > > > > > WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
> > > > > > > > for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
> > > > > > > > only WB_SYNC_NONE.
> > > > > > >   What I don't like about this patch is that if somebody sets up
> > > > > > > writeback_control with WB_SYNC_ALL mode set and then submits it to disk via
> > > > > > > bdi_start_writeback() it will just silently convert his writeback to an
> > > > > > > asynchronous one.
> > > > > > >   So I'd maybe leave setting of sync_mode to the caller and just WARN_ON if
> > > > > > > it does not match the purpose of the function...
> > > > > > 
> > > > > > Or initialize the wb entirely inside these functions.  For the sync case
> > > > > > we really only need a superblock as argument, and for writeback it's
> > > > > > bdi + nr_pages.  And also make sure they consistenly return void as
> > > > > > no one cares about the return value.
> > > > > 
> > > > > Yes, I thought about doing that and like that better than the warning.
> > > > > Just pass in the needed args and allocate+fill the wbc on stack. I'll
> > > > > make that change.
> > > > 
> > > > That works out much better, imho:
> > > > 
> > > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=270c12655d7d11e234d335a8ab0540c02c034b66
> > >   Yeah, the code looks better. BTW, how about converting also
> > > bdi_writeback_all() to get superblock and nr_pages as an argument?
> > > Currently it seems to be the only place "above" flusher thread which uses
> > > wbc and it's just constructed in the callers of bdi_writeback_all() and
> > > then disassembled inside the function...
> > 
> > Yes good point, I'll include that too. Thanks!
> 
> One small problem there, though... Currently all queued writeback is
> range cyclic, however with this change then we drop that bit from
> sync_inodes_sb() which isn't range_cyclic and instead just specifies the
> whole range.
  I'm not sure I understand your comment but I see a problem that even
writeback queued from sync_inodes_sb() is processed by wb_writeback() which
sets range_cyclic. That's a bug even in your old patchset.
  Let's have a look at the flags in wbc:
  nonblocking - Currently only set by direct callers of ->writepage() BUT
                originally wb_kupdate() and background_writeout() also
                set this flag. Since filesystems and write_cache_pages()
		use the flag we should set it for equivalent writeouts as
                well. This should be fixed...
  encountered_congestion - Checked only by AFS, probably we can get rid of
                it now.
  for_kupdate - Used, we set it properly in wb_writeback() so that is fine.
  for_reclaim - Used by direct callers of ->writepage(). OK.
  for_writepages - Only set. Get rid of it.
  range_cyclic - Used. Set also when a caller didn't want it - should be
                 fixed.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback
  2009-09-15 12:58                 ` Jan Kara
@ 2009-09-15 13:04                   ` Jens Axboe
  2009-09-15 13:08                     ` Christoph Hellwig
  0 siblings, 1 reply; 35+ messages in thread
From: Jens Axboe @ 2009-09-15 13:04 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, chris.mason,
	tytso, akpm, trond.myklebust

On Tue, Sep 15 2009, Jan Kara wrote:
> On Tue 15-09-09 13:44:26, Jens Axboe wrote:
> > On Tue, Sep 15 2009, Jens Axboe wrote:
> > > On Tue, Sep 15 2009, Jan Kara wrote:
> > > > On Mon 14-09-09 21:42:43, Jens Axboe wrote:
> > > > > On Mon, Sep 14 2009, Jens Axboe wrote:
> > > > > > On Mon, Sep 14 2009, Christoph Hellwig wrote:
> > > > > > > On Mon, Sep 14, 2009 at 03:33:07PM +0200, Jan Kara wrote:
> > > > > > > > On Mon 14-09-09 11:36:33, Jens Axboe wrote:
> > > > > > > > > bdi_start_writeback() is currently split into two paths, one for
> > > > > > > > > WB_SYNC_NONE and one for WB_SYNC_ALL. Add bdi_sync_writeback()
> > > > > > > > > for WB_SYNC_ALL writeback and let bdi_start_writeback() handle
> > > > > > > > > only WB_SYNC_NONE.
> > > > > > > >   What I don't like about this patch is that if somebody sets up
> > > > > > > > writeback_control with WB_SYNC_ALL mode set and then submits it to disk via
> > > > > > > > bdi_start_writeback() it will just silently convert his writeback to an
> > > > > > > > asynchronous one.
> > > > > > > >   So I'd maybe leave setting of sync_mode to the caller and just WARN_ON if
> > > > > > > > it does not match the purpose of the function...
> > > > > > > 
> > > > > > > Or initialize the wb entirely inside these functions.  For the sync case
> > > > > > > we really only need a superblock as argument, and for writeback it's
> > > > > > > bdi + nr_pages.  And also make sure they consistenly return void as
> > > > > > > no one cares about the return value.
> > > > > > 
> > > > > > Yes, I thought about doing that and like that better than the warning.
> > > > > > Just pass in the needed args and allocate+fill the wbc on stack. I'll
> > > > > > make that change.
> > > > > 
> > > > > That works out much better, imho:
> > > > > 
> > > > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=270c12655d7d11e234d335a8ab0540c02c034b66
> > > >   Yeah, the code looks better. BTW, how about converting also
> > > > bdi_writeback_all() to get superblock and nr_pages as an argument?
> > > > Currently it seems to be the only place "above" flusher thread which uses
> > > > wbc and it's just constructed in the callers of bdi_writeback_all() and
> > > > then disassembled inside the function...
> > > 
> > > Yes good point, I'll include that too. Thanks!
> > 
> > One small problem there, though... Currently all queued writeback is
> > range cyclic, however with this change then we drop that bit from
> > sync_inodes_sb() which isn't range_cyclic and instead just specifies the
> > whole range.
>   I'm not sure I understand your comment but I see a problem that even
> writeback queued from sync_inodes_sb() is processed by wb_writeback() which
> sets range_cyclic. That's a bug even in your old patchset.

Indeed, looks like I have to double check all that again.

>   Let's have a look at the flags in wbc:
>   nonblocking - Currently only set by direct callers of ->writepage() BUT
>                 originally wb_kupdate() and background_writeout() also
>                 set this flag. Since filesystems and write_cache_pages()
> 		use the flag we should set it for equivalent writeouts as
>                 well. This should be fixed...

Since this is all handled by the dedicated thread now, dropping the
nonblocking bit was on purpose. What would the point be, except for
stopping pdflush being blocked on request allocation?

>   encountered_congestion - Checked only by AFS, probably we can get rid of
>                 it now.
>   for_kupdate - Used, we set it properly in wb_writeback() so that is fine.
>   for_reclaim - Used by direct callers of ->writepage(). OK.
>   for_writepages - Only set. Get rid of it.

Hmm indeed, that predates the patchset though. But I can queue up a
removal.

>   range_cyclic - Used. Set also when a caller didn't want it - should be
>                  fixed.

Yes, that one wants fixing. Will probably have to pass that in through
the 'work' structure.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback
  2009-09-15 13:04                   ` Jens Axboe
@ 2009-09-15 13:08                     ` Christoph Hellwig
  2009-09-15 13:17                       ` Jens Axboe
  2009-09-15 14:01                       ` Jan Kara
  0 siblings, 2 replies; 35+ messages in thread
From: Christoph Hellwig @ 2009-09-15 13:08 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Jan Kara, Christoph Hellwig, linux-kernel, linux-fsdevel,
	chris.mason, tytso, akpm, trond.myklebust

On Tue, Sep 15, 2009 at 03:04:19PM +0200, Jens Axboe wrote:
> 
> >   Let's have a look at the flags in wbc:
> >   nonblocking - Currently only set by direct callers of ->writepage() BUT
> >                 originally wb_kupdate() and background_writeout() also
> >                 set this flag. Since filesystems and write_cache_pages()
> > 		use the flag we should set it for equivalent writeouts as
> >                 well. This should be fixed...
> 
> Since this is all handled by the dedicated thread now, dropping the
> nonblocking bit was on purpose. What would the point be, except for
> stopping pdflush being blocked on request allocation?

Note that this flag just caused utter mess traditionally.  btrfs decided
to ignore it completely and ext4 partially.  Removing this check in
XFS increases large bufferd write loads massively.

Just half-removing it is a bad idea, though - if you don't set it
anymore please kill it entirely.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 2/7] Assign bdi in super_block
  2009-09-15 10:14         ` Jan Kara
  2009-09-15 10:22           ` Jens Axboe
@ 2009-09-15 13:16           ` Trond Myklebust
  1 sibling, 0 replies; 35+ messages in thread
From: Trond Myklebust @ 2009-09-15 13:16 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jens Axboe, linux-kernel, linux-fsdevel, chris.mason, hch, tytso, akpm

On Tue, 2009-09-15 at 12:14 +0200, Jan Kara wrote:
> On Mon 14-09-09 20:36:54, Jens Axboe wrote:
> > No, that should work fine. NFS already works fine with the bdi flusher
> > threads, so you should just point it at that bdi.
>   But will it really work well? I mean if we sync the superblock on the
> client, it will sync only the private BDI. So it won't sync any directory
> inodes because they are on the default_backing_dev_info (NFS leaves
> sb->s_bdev at NULL).

All directory related operations (link, rename, create, ...) are fully
synchronous in NFS. There should be no need to set up anything to
synchronise directory inodes.

Cheers
  Trond


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback
  2009-09-15 13:08                     ` Christoph Hellwig
@ 2009-09-15 13:17                       ` Jens Axboe
  2009-09-15 14:01                       ` Jan Kara
  1 sibling, 0 replies; 35+ messages in thread
From: Jens Axboe @ 2009-09-15 13:17 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jan Kara, linux-kernel, linux-fsdevel, chris.mason, tytso, akpm,
	trond.myklebust

On Tue, Sep 15 2009, Christoph Hellwig wrote:
> On Tue, Sep 15, 2009 at 03:04:19PM +0200, Jens Axboe wrote:
> > 
> > >   Let's have a look at the flags in wbc:
> > >   nonblocking - Currently only set by direct callers of ->writepage() BUT
> > >                 originally wb_kupdate() and background_writeout() also
> > >                 set this flag. Since filesystems and write_cache_pages()
> > > 		use the flag we should set it for equivalent writeouts as
> > >                 well. This should be fixed...
> > 
> > Since this is all handled by the dedicated thread now, dropping the
> > nonblocking bit was on purpose. What would the point be, except for
> > stopping pdflush being blocked on request allocation?
> 
> Note that this flag just caused utter mess traditionally.  btrfs decided
> to ignore it completely and ext4 partially.  Removing this check in
> XFS increases large bufferd write loads massively.
> 
> Just half-removing it is a bad idea, though - if you don't set it
> anymore please kill it entirely.

I haven't touched it, except removing it from the caller where it
doesn't make sense anymore. If you think we should kill it completely,
then lets look at that in a few days. I've got more than enough stuff
queued up for inclusion now that I need to test and verify before doing
even more cleanups/changes :-)

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback
  2009-09-15 13:08                     ` Christoph Hellwig
  2009-09-15 13:17                       ` Jens Axboe
@ 2009-09-15 14:01                       ` Jan Kara
  2009-09-15 14:09                         ` Chris Mason
  1 sibling, 1 reply; 35+ messages in thread
From: Jan Kara @ 2009-09-15 14:01 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Jan Kara, linux-kernel, linux-fsdevel, chris.mason,
	tytso, akpm, trond.myklebust

On Tue 15-09-09 09:08:29, Christoph Hellwig wrote:
> On Tue, Sep 15, 2009 at 03:04:19PM +0200, Jens Axboe wrote:
> > 
> > >   Let's have a look at the flags in wbc:
> > >   nonblocking - Currently only set by direct callers of ->writepage() BUT
> > >                 originally wb_kupdate() and background_writeout() also
> > >                 set this flag. Since filesystems and write_cache_pages()
> > > 		use the flag we should set it for equivalent writeouts as
> > >                 well. This should be fixed...
> > 
> > Since this is all handled by the dedicated thread now, dropping the
> > nonblocking bit was on purpose. What would the point be, except for
> > stopping pdflush being blocked on request allocation?
> 
> Note that this flag just caused utter mess traditionally.  btrfs decided
> to ignore it completely and ext4 partially.  Removing this check in
> XFS increases large bufferd write loads massively.
> 
> Just half-removing it is a bad idea, though - if you don't set it
> anymore please kill it entirely.
  The nonblocking flag is still set for writeback done for memory reclaim.
OTOH the only real consumer of this flag now seems to be
__block_write_full_page() which does trylock_buffer() in case of
nonblocking writeback. I'm undecided whether it makes sence or not.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback
  2009-09-15 14:01                       ` Jan Kara
@ 2009-09-15 14:09                         ` Chris Mason
  0 siblings, 0 replies; 35+ messages in thread
From: Chris Mason @ 2009-09-15 14:09 UTC (permalink / raw)
  To: Jan Kara
  Cc: Christoph Hellwig, Jens Axboe, linux-kernel, linux-fsdevel,
	tytso, akpm, trond.myklebust

On Tue, Sep 15, 2009 at 04:01:45PM +0200, Jan Kara wrote:
> On Tue 15-09-09 09:08:29, Christoph Hellwig wrote:
> > On Tue, Sep 15, 2009 at 03:04:19PM +0200, Jens Axboe wrote:
> > > 
> > > >   Let's have a look at the flags in wbc:
> > > >   nonblocking - Currently only set by direct callers of ->writepage() BUT
> > > >                 originally wb_kupdate() and background_writeout() also
> > > >                 set this flag. Since filesystems and write_cache_pages()
> > > > 		use the flag we should set it for equivalent writeouts as
> > > >                 well. This should be fixed...
> > > 
> > > Since this is all handled by the dedicated thread now, dropping the
> > > nonblocking bit was on purpose. What would the point be, except for
> > > stopping pdflush being blocked on request allocation?
> > 
> > Note that this flag just caused utter mess traditionally.  btrfs decided
> > to ignore it completely and ext4 partially.  Removing this check in
> > XFS increases large bufferd write loads massively.
> > 
> > Just half-removing it is a bad idea, though - if you don't set it
> > anymore please kill it entirely.
>   The nonblocking flag is still set for writeback done for memory reclaim.
> OTOH the only real consumer of this flag now seems to be
> __block_write_full_page() which does trylock_buffer() in case of
> nonblocking writeback. I'm undecided whether it makes sence or not.

Ugh, making sense is tricky to say.  If __block_write_full_page
does a lock_buffer() instead of a trylock_buffer(), and ext3 is mounted in
data=ordered mode then it is very possible that we'll end up with a
dirty page with locked buffers.

The buffers will have been locked by ext3 data=ordered writeback and
they won't unlock until the IO is done.

We probably don't want kswapd waiting on that writeback.

-chris




^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2009-09-15 14:10 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-14  9:36 [PATCH 0/7] Post merge per-bdi writeback patches v2 Jens Axboe
2009-09-14  9:36 ` [PATCH 1/7] writeback: merely wakeup flusher thread if work allocation fails for WB_SYNC_NONE Jens Axboe
2009-09-14 12:34   ` Jan Kara
2009-09-14 12:39     ` Jens Axboe
2009-09-14 13:19       ` Christoph Hellwig
2009-09-14  9:36 ` [PATCH 2/7] Assign bdi in super_block Jens Axboe
2009-09-14 13:02   ` Jan Kara
2009-09-14 18:25     ` Trond Myklebust
2009-09-14 18:36       ` Jens Axboe
2009-09-15 10:14         ` Jan Kara
2009-09-15 10:22           ` Jens Axboe
2009-09-15 13:16           ` Trond Myklebust
2009-09-14  9:36 ` [PATCH 3/7] writeback: only use bdi_writeback_all() for WB_SYNC_NONE writeout Jens Axboe
2009-09-14 13:12   ` Jan Kara
2009-09-14  9:36 ` [PATCH 4/7] writeback: use RCU to protect bdi_list Jens Axboe
2009-09-14 11:10   ` Minchan Kim
2009-09-14 11:11     ` Jens Axboe
2009-09-14  9:36 ` [PATCH 5/7] writeback: inline allocation failure handling in bdi_alloc_queue_work() Jens Axboe
2009-09-14 13:13   ` Jan Kara
2009-09-14  9:36 ` [PATCH 6/7] writeback: separate starting of sync vs opportunistic writeback Jens Axboe
2009-09-14 13:33   ` Jan Kara
2009-09-14 13:42     ` Christoph Hellwig
2009-09-14 19:28       ` Jens Axboe
2009-09-14 19:42         ` Jens Axboe
2009-09-15  9:08           ` Jan Kara
2009-09-15  9:14             ` Jens Axboe
2009-09-15 11:44               ` Jens Axboe
2009-09-15 12:58                 ` Jan Kara
2009-09-15 13:04                   ` Jens Axboe
2009-09-15 13:08                     ` Christoph Hellwig
2009-09-15 13:17                       ` Jens Axboe
2009-09-15 14:01                       ` Jan Kara
2009-09-15 14:09                         ` Chris Mason
2009-09-14  9:36 ` [PATCH 7/7] writeback: splice dirty inode entries to default bdi on bdi_destroy() Jens Axboe
2009-09-14 10:56   ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.