All of lore.kernel.org
 help / color / mirror / Atom feed
* [GIT PULL] writeback changes for 3.5-rc1
@ 2012-05-28 11:41 Fengguang Wu
  2012-05-28 17:09 ` Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: Fengguang Wu @ 2012-05-28 11:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: LKML

Hi Linus,

Please pull the writeback changes, mainly from Jan Kara to avoid
iput() in the flusher threads.

Thanks,
Fengguang
---

The following changes since commit 668ce0ac707719d866af7e432e518af7b4c575ad:

  Merge branch 'systemh-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux (2012-04-13 19:44:36 -0700)

are available in the git repository at:


  git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux.git tags/writeback

for you to fetch changes up to 169ebd90131b2ffca74bb2dbe7eeacd39fb83714:

  writeback: Avoid iput() from flusher thread (2012-05-06 13:43:41 +0800)

----------------------------------------------------------------
avoid iput() from flusher thread

----------------------------------------------------------------
Fengguang Wu (1):
      writeback: initialize global_dirty_limit

H Hartley Sweeten (1):
      mm: page-writeback.c: local functions should not be exposed globally

Jan Kara (9):
      writeback: Move clearing of I_SYNC into inode_sync_complete()
      writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()
      writeback: Move I_DIRTY_PAGES handling
      writeback: Separate inode requeueing after writeback
      writeback: Remove wb->list_lock from writeback_single_inode()
      writeback: Refactor writeback_single_inode()
      vfs: Move waiting for inode writeback from end_writeback() to evict_inode()
      vfs: Rename end_writeback() to clear_inode()
      writeback: Avoid iput() from flusher thread

Richard Kennedy (1):
      fs: remove 8 bytes of padding from struct writeback_control on 64 bit builds

 Documentation/filesystems/porting         |   16 +-
 arch/powerpc/platforms/cell/spufs/inode.c |    2 +-
 arch/s390/hypfs/inode.c                   |    2 +-
 fs/9p/vfs_inode.c                         |    2 +-
 fs/affs/inode.c                           |    2 +-
 fs/afs/inode.c                            |    2 +-
 fs/autofs4/inode.c                        |    2 +-
 fs/bfs/inode.c                            |    2 +-
 fs/binfmt_misc.c                          |    2 +-
 fs/block_dev.c                            |    2 +-
 fs/btrfs/inode.c                          |    2 +-
 fs/cifs/cifsfs.c                          |    2 +-
 fs/coda/inode.c                           |    2 +-
 fs/ecryptfs/super.c                       |    2 +-
 fs/exofs/inode.c                          |    4 +-
 fs/ext2/inode.c                           |    2 +-
 fs/ext3/inode.c                           |    6 +-
 fs/ext4/super.c                           |    2 +-
 fs/fat/inode.c                            |    2 +-
 fs/freevxfs/vxfs_inode.c                  |    2 +-
 fs/fs-writeback.c                         |  336 ++++++++++++++++++-----------
 fs/fuse/inode.c                           |    2 +-
 fs/gfs2/super.c                           |    2 +-
 fs/hfs/inode.c                            |    2 +-
 fs/hfsplus/super.c                        |    2 +-
 fs/hostfs/hostfs_kern.c                   |    2 +-
 fs/hpfs/inode.c                           |    2 +-
 fs/hppfs/hppfs.c                          |    2 +-
 fs/hugetlbfs/inode.c                      |    2 +-
 fs/inode.c                                |   15 +-
 fs/jffs2/fs.c                             |    2 +-
 fs/jfs/inode.c                            |    2 +-
 fs/logfs/readwrite.c                      |    2 +-
 fs/minix/inode.c                          |    2 +-
 fs/ncpfs/inode.c                          |    2 +-
 fs/nfs/inode.c                            |    4 +-
 fs/nilfs2/inode.c                         |    4 +-
 fs/ntfs/inode.c                           |    2 +-
 fs/ocfs2/dlmfs/dlmfs.c                    |    2 +-
 fs/ocfs2/inode.c                          |    2 +-
 fs/omfs/inode.c                           |    2 +-
 fs/proc/inode.c                           |    2 +-
 fs/pstore/inode.c                         |    2 +-
 fs/reiserfs/inode.c                       |    4 +-
 fs/sysfs/inode.c                          |    2 +-
 fs/sysv/inode.c                           |    2 +-
 fs/ubifs/super.c                          |    2 +-
 fs/udf/inode.c                            |    2 +-
 fs/ufs/inode.c                            |    2 +-
 fs/xfs/xfs_super.c                        |    2 +-
 include/linux/fs.h                        |   13 +-
 include/linux/writeback.h                 |   10 +-
 include/trace/events/writeback.h          |   36 +++-
 ipc/mqueue.c                              |    2 +-
 mm/page-writeback.c                       |    3 +-
 mm/shmem.c                                |    2 +-
 56 files changed, 319 insertions(+), 220 deletions(-)

diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting
index 74acd96..8c91d10 100644
--- a/Documentation/filesystems/porting
+++ b/Documentation/filesystems/porting
@@ -297,7 +297,8 @@ in the beginning of ->setattr unconditionally.
 be used instead.  It gets called whenever the inode is evicted, whether it has
 remaining links or not.  Caller does *not* evict the pagecache or inode-associated
 metadata buffers; getting rid of those is responsibility of method, as it had
-been for ->delete_inode().
+been for ->delete_inode(). Caller makes sure async writeback cannot be running
+for the inode while (or after) ->evict_inode() is called.
 
 	->drop_inode() returns int now; it's called on final iput() with
 inode->i_lock held and it returns true if filesystems wants the inode to be
@@ -306,14 +307,11 @@ updated appropriately.  generic_delete_inode() is also alive and it consists
 simply of return 1.  Note that all actual eviction work is done by caller after
 ->drop_inode() returns.
 
-	clear_inode() is gone; use end_writeback() instead.  As before, it must
-be called exactly once on each call of ->evict_inode() (as it used to be for
-each call of ->delete_inode()).  Unlike before, if you are using inode-associated
-metadata buffers (i.e. mark_buffer_dirty_inode()), it's your responsibility to
-call invalidate_inode_buffers() before end_writeback().
-	No async writeback (and thus no calls of ->write_inode()) will happen
-after end_writeback() returns, so actions that should not overlap with ->write_inode()
-(e.g. freeing on-disk inode if i_nlink is 0) ought to be done after that call.
+	As before, clear_inode() must be called exactly once on each call of
+->evict_inode() (as it used to be for each call of ->delete_inode()).  Unlike
+before, if you are using inode-associated metadata buffers (i.e.
+mark_buffer_dirty_inode()), it's your responsibility to call
+invalidate_inode_buffers() before clear_inode().
 
 	NOTE: checking i_nlink in the beginning of ->write_inode() and bailing out
 if it's zero is not *and* *never* *had* *been* enough.  Final unlink() and iput()
diff --git a/arch/powerpc/platforms/cell/spufs/inode.c b/arch/powerpc/platforms/cell/spufs/inode.c
index 1d75c92..66519d2 100644
--- a/arch/powerpc/platforms/cell/spufs/inode.c
+++ b/arch/powerpc/platforms/cell/spufs/inode.c
@@ -151,7 +151,7 @@ static void
 spufs_evict_inode(struct inode *inode)
 {
 	struct spufs_inode_info *ei = SPUFS_I(inode);
-	end_writeback(inode);
+	clear_inode(inode);
 	if (ei->i_ctx)
 		put_spu_context(ei->i_ctx);
 	if (ei->i_gang)
diff --git a/arch/s390/hypfs/inode.c b/arch/s390/hypfs/inode.c
index 6a2cb56..73dae8b 100644
--- a/arch/s390/hypfs/inode.c
+++ b/arch/s390/hypfs/inode.c
@@ -115,7 +115,7 @@ static struct inode *hypfs_make_inode(struct super_block *sb, umode_t mode)
 
 static void hypfs_evict_inode(struct inode *inode)
 {
-	end_writeback(inode);
+	clear_inode(inode);
 	kfree(inode->i_private);
 }
 
diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c
index 014c8dd..57ccb75 100644
--- a/fs/9p/vfs_inode.c
+++ b/fs/9p/vfs_inode.c
@@ -448,7 +448,7 @@ void v9fs_evict_inode(struct inode *inode)
 	struct v9fs_inode *v9inode = V9FS_I(inode);
 
 	truncate_inode_pages(inode->i_mapping, 0);
-	end_writeback(inode);
+	clear_inode(inode);
 	filemap_fdatawrite(inode->i_mapping);
 
 #ifdef CONFIG_9P_FSCACHE
diff --git a/fs/affs/inode.c b/fs/affs/inode.c
index 88a4b0b..8bc4a59 100644
--- a/fs/affs/inode.c
+++ b/fs/affs/inode.c
@@ -264,7 +264,7 @@ affs_evict_inode(struct inode *inode)
 	}
 
 	invalidate_inode_buffers(inode);
-	end_writeback(inode);
+	clear_inode(inode);
 	affs_free_prealloc(inode);
 	cache_page = (unsigned long)AFFS_I(inode)->i_lc;
 	if (cache_page) {
diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index d890ae3..95cffd3 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -423,7 +423,7 @@ void afs_evict_inode(struct inode *inode)
 	ASSERTCMP(inode->i_ino, ==, vnode->fid.vnode);
 
 	truncate_inode_pages(&inode->i_data, 0);
-	end_writeback(inode);
+	clear_inode(inode);
 
 	afs_give_up_callback(vnode);
 
diff --git a/fs/autofs4/inode.c b/fs/autofs4/inode.c
index d8dc002..df31ddb 100644
--- a/fs/autofs4/inode.c
+++ b/fs/autofs4/inode.c
@@ -101,7 +101,7 @@ static int autofs4_show_options(struct seq_file *m, struct dentry *root)
 
 static void autofs4_evict_inode(struct inode *inode)
 {
-	end_writeback(inode);
+	clear_inode(inode);
 	kfree(inode->i_private);
 }
 
diff --git a/fs/bfs/inode.c b/fs/bfs/inode.c
index e23dc7c..9870417 100644
--- a/fs/bfs/inode.c
+++ b/fs/bfs/inode.c
@@ -174,7 +174,7 @@ static void bfs_evict_inode(struct inode *inode)
 
 	truncate_inode_pages(&inode->i_data, 0);
 	invalidate_inode_buffers(inode);
-	end_writeback(inode);
+	clear_inode(inode);
 
 	if (inode->i_nlink)
 		return;
diff --git a/fs/binfmt_misc.c b/fs/binfmt_misc.c
index 613aa06..790b3cd 100644
--- a/fs/binfmt_misc.c
+++ b/fs/binfmt_misc.c
@@ -505,7 +505,7 @@ static struct inode *bm_get_inode(struct super_block *sb, int mode)
 
 static void bm_evict_inode(struct inode *inode)
 {
-	end_writeback(inode);
+	clear_inode(inode);
 	kfree(inode->i_private);
 }
 
diff --git a/fs/block_dev.c b/fs/block_dev.c
index e08f6a20..d8a7959 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -487,7 +487,7 @@ static void bdev_evict_inode(struct inode *inode)
 	struct list_head *p;
 	truncate_inode_pages(&inode->i_data, 0);
 	invalidate_inode_buffers(inode); /* is it needed here? */
-	end_writeback(inode);
+	clear_inode(inode);
 	spin_lock(&bdev_lock);
 	while ( (p = bdev->bd_inodes.next) != &bdev->bd_inodes ) {
 		__bd_forget(list_entry(p, struct inode, i_devices));
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 115bc05..5c058c4 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3756,7 +3756,7 @@ void btrfs_evict_inode(struct inode *inode)
 	btrfs_end_transaction(trans, root);
 	btrfs_btree_balance_dirty(root, nr);
 no_delete:
-	end_writeback(inode);
+	clear_inode(inode);
 	return;
 }
 
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index d342128..acb138f 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -272,7 +272,7 @@ static void
 cifs_evict_inode(struct inode *inode)
 {
 	truncate_inode_pages(&inode->i_data, 0);
-	end_writeback(inode);
+	clear_inode(inode);
 	cifs_fscache_release_inode_cookie(inode);
 }
 
diff --git a/fs/coda/inode.c b/fs/coda/inode.c
index 2870597..f181312 100644
--- a/fs/coda/inode.c
+++ b/fs/coda/inode.c
@@ -244,7 +244,7 @@ static void coda_put_super(struct super_block *sb)
 static void coda_evict_inode(struct inode *inode)
 {
 	truncate_inode_pages(&inode->i_data, 0);
-	end_writeback(inode);
+	clear_inode(inode);
 	coda_cache_clear_inode(inode);
 }
 
diff --git a/fs/ecryptfs/super.c b/fs/ecryptfs/super.c
index 2dd946b..e879cf8 100644
--- a/fs/ecryptfs/super.c
+++ b/fs/ecryptfs/super.c
@@ -133,7 +133,7 @@ static int ecryptfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 static void ecryptfs_evict_inode(struct inode *inode)
 {
 	truncate_inode_pages(&inode->i_data, 0);
-	end_writeback(inode);
+	clear_inode(inode);
 	iput(ecryptfs_inode_to_lower(inode));
 }
 
diff --git a/fs/exofs/inode.c b/fs/exofs/inode.c
index ea5e1f9..5badb0c 100644
--- a/fs/exofs/inode.c
+++ b/fs/exofs/inode.c
@@ -1473,7 +1473,7 @@ void exofs_evict_inode(struct inode *inode)
 		goto no_delete;
 
 	inode->i_size = 0;
-	end_writeback(inode);
+	clear_inode(inode);
 
 	/* if we are deleting an obj that hasn't been created yet, wait.
 	 * This also makes sure that create_done cannot be called with an
@@ -1503,5 +1503,5 @@ void exofs_evict_inode(struct inode *inode)
 	return;
 
 no_delete:
-	end_writeback(inode);
+	clear_inode(inode);
 }
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 740cad8..37b8bf6 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -90,7 +90,7 @@ void ext2_evict_inode(struct inode * inode)
 	}
 
 	invalidate_inode_buffers(inode);
-	end_writeback(inode);
+	clear_inode(inode);
 
 	ext2_discard_reservation(inode);
 	rsv = EXT2_I(inode)->i_block_alloc_info;
diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index 10d7812..ca5eb61 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -272,18 +272,18 @@ void ext3_evict_inode (struct inode *inode)
 	if (ext3_mark_inode_dirty(handle, inode)) {
 		/* If that failed, just dquot_drop() and be done with that */
 		dquot_drop(inode);
-		end_writeback(inode);
+		clear_inode(inode);
 	} else {
 		ext3_xattr_delete_inode(handle, inode);
 		dquot_free_inode(inode);
 		dquot_drop(inode);
-		end_writeback(inode);
+		clear_inode(inode);
 		ext3_free_inode(handle, inode);
 	}
 	ext3_journal_stop(handle);
 	return;
 no_delete:
-	end_writeback(inode);
+	clear_inode(inode);
 	dquot_drop(inode);
 }
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index ceebaf8..2484f56 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1007,7 +1007,7 @@ static void destroy_inodecache(void)
 void ext4_clear_inode(struct inode *inode)
 {
 	invalidate_inode_buffers(inode);
-	end_writeback(inode);
+	clear_inode(inode);
 	dquot_drop(inode);
 	ext4_discard_preallocations(inode);
 	if (EXT4_I(inode)->jinode) {
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 21687e3..b3d290c 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -454,7 +454,7 @@ static void fat_evict_inode(struct inode *inode)
 		fat_truncate_blocks(inode, 0);
 	}
 	invalidate_inode_buffers(inode);
-	end_writeback(inode);
+	clear_inode(inode);
 	fat_cache_inval_inode(inode);
 	fat_detach(inode);
 }
diff --git a/fs/freevxfs/vxfs_inode.c b/fs/freevxfs/vxfs_inode.c
index cf9ef91..ef67c95 100644
--- a/fs/freevxfs/vxfs_inode.c
+++ b/fs/freevxfs/vxfs_inode.c
@@ -355,6 +355,6 @@ void
 vxfs_evict_inode(struct inode *ip)
 {
 	truncate_inode_pages(&ip->i_data, 0);
-	end_writeback(ip);
+	clear_inode(ip);
 	call_rcu(&ip->i_rcu, vxfs_i_callback);
 }
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 539f36c..8d2fb8c 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -231,11 +231,8 @@ static void requeue_io(struct inode *inode, struct bdi_writeback *wb)
 
 static void inode_sync_complete(struct inode *inode)
 {
-	/*
-	 * Prevent speculative execution through
-	 * spin_unlock(&wb->list_lock);
-	 */
-
+	inode->i_state &= ~I_SYNC;
+	/* Waiters must see I_SYNC cleared before being woken up */
 	smp_mb();
 	wake_up_bit(&inode->i_state, __I_SYNC);
 }
@@ -329,10 +326,12 @@ static int write_inode(struct inode *inode, struct writeback_control *wbc)
 }
 
 /*
- * Wait for writeback on an inode to complete.
+ * Wait for writeback on an inode to complete. Called with i_lock held.
+ * Caller must make sure inode cannot go away when we drop i_lock.
  */
-static void inode_wait_for_writeback(struct inode *inode,
-				     struct bdi_writeback *wb)
+static void __inode_wait_for_writeback(struct inode *inode)
+	__releases(inode->i_lock)
+	__acquires(inode->i_lock)
 {
 	DEFINE_WAIT_BIT(wq, &inode->i_state, __I_SYNC);
 	wait_queue_head_t *wqh;
@@ -340,70 +339,119 @@ static void inode_wait_for_writeback(struct inode *inode,
 	wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
 	while (inode->i_state & I_SYNC) {
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&wb->list_lock);
 		__wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
-		spin_lock(&wb->list_lock);
 		spin_lock(&inode->i_lock);
 	}
 }
 
 /*
- * Write out an inode's dirty pages.  Called under wb->list_lock and
- * inode->i_lock.  Either the caller has an active reference on the inode or
- * the inode has I_WILL_FREE set.
- *
- * If `wait' is set, wait on the writeout.
- *
- * The whole writeout design is quite complex and fragile.  We want to avoid
- * starvation of particular inodes when others are being redirtied, prevent
- * livelocks, etc.
+ * Wait for writeback on an inode to complete. Caller must have inode pinned.
  */
-static int
-writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
-		       struct writeback_control *wbc)
+void inode_wait_for_writeback(struct inode *inode)
 {
-	struct address_space *mapping = inode->i_mapping;
-	long nr_to_write = wbc->nr_to_write;
-	unsigned dirty;
-	int ret;
+	spin_lock(&inode->i_lock);
+	__inode_wait_for_writeback(inode);
+	spin_unlock(&inode->i_lock);
+}
 
-	assert_spin_locked(&wb->list_lock);
-	assert_spin_locked(&inode->i_lock);
+/*
+ * Sleep until I_SYNC is cleared. This function must be called with i_lock
+ * held and drops it. It is aimed for callers not holding any inode reference
+ * so once i_lock is dropped, inode can go away.
+ */
+static void inode_sleep_on_writeback(struct inode *inode)
+	__releases(inode->i_lock)
+{
+	DEFINE_WAIT(wait);
+	wait_queue_head_t *wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
+	int sleep;
 
-	if (!atomic_read(&inode->i_count))
-		WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
-	else
-		WARN_ON(inode->i_state & I_WILL_FREE);
+	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+	sleep = inode->i_state & I_SYNC;
+	spin_unlock(&inode->i_lock);
+	if (sleep)
+		schedule();
+	finish_wait(wqh, &wait);
+}
 
-	if (inode->i_state & I_SYNC) {
+/*
+ * Find proper writeback list for the inode depending on its current state and
+ * possibly also change of its state while we were doing writeback.  Here we
+ * handle things such as livelock prevention or fairness of writeback among
+ * inodes. This function can be called only by flusher thread - noone else
+ * processes all inodes in writeback lists and requeueing inodes behind flusher
+ * thread's back can have unexpected consequences.
+ */
+static void requeue_inode(struct inode *inode, struct bdi_writeback *wb,
+			  struct writeback_control *wbc)
+{
+	if (inode->i_state & I_FREEING)
+		return;
+
+	/*
+	 * Sync livelock prevention. Each inode is tagged and synced in one
+	 * shot. If still dirty, it will be redirty_tail()'ed below.  Update
+	 * the dirty time to prevent enqueue and sync it again.
+	 */
+	if ((inode->i_state & I_DIRTY) &&
+	    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
+		inode->dirtied_when = jiffies;
+
+	if (wbc->pages_skipped) {
 		/*
-		 * If this inode is locked for writeback and we are not doing
-		 * writeback-for-data-integrity, move it to b_more_io so that
-		 * writeback can proceed with the other inodes on s_io.
-		 *
-		 * We'll have another go at writing back this inode when we
-		 * completed a full scan of b_io.
+		 * writeback is not making progress due to locked
+		 * buffers. Skip this inode for now.
 		 */
-		if (wbc->sync_mode != WB_SYNC_ALL) {
+		redirty_tail(inode, wb);
+		return;
+	}
+
+	if (mapping_tagged(inode->i_mapping, PAGECACHE_TAG_DIRTY)) {
+		/*
+		 * We didn't write back all the pages.  nfs_writepages()
+		 * sometimes bales out without doing anything.
+		 */
+		if (wbc->nr_to_write <= 0) {
+			/* Slice used up. Queue for next turn. */
 			requeue_io(inode, wb);
-			trace_writeback_single_inode_requeue(inode, wbc,
-							     nr_to_write);
-			return 0;
+		} else {
+			/*
+			 * Writeback blocked by something other than
+			 * congestion. Delay the inode for some time to
+			 * avoid spinning on the CPU (100% iowait)
+			 * retrying writeback of the dirty page/inode
+			 * that cannot be performed immediately.
+			 */
+			redirty_tail(inode, wb);
 		}
-
+	} else if (inode->i_state & I_DIRTY) {
 		/*
-		 * It's a data-integrity sync.  We must wait.
+		 * Filesystems can dirty the inode during writeback operations,
+		 * such as delayed allocation during submission or metadata
+		 * updates after data IO completion.
 		 */
-		inode_wait_for_writeback(inode, wb);
+		redirty_tail(inode, wb);
+	} else {
+		/* The inode is clean. Remove from writeback lists. */
+		list_del_init(&inode->i_wb_list);
 	}
+}
 
-	BUG_ON(inode->i_state & I_SYNC);
+/*
+ * Write out an inode and its dirty pages. Do not update the writeback list
+ * linkage. That is left to the caller. The caller is also responsible for
+ * setting I_SYNC flag and calling inode_sync_complete() to clear it.
+ */
+static int
+__writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
+			 struct writeback_control *wbc)
+{
+	struct address_space *mapping = inode->i_mapping;
+	long nr_to_write = wbc->nr_to_write;
+	unsigned dirty;
+	int ret;
 
-	/* Set I_SYNC, reset I_DIRTY_PAGES */
-	inode->i_state |= I_SYNC;
-	inode->i_state &= ~I_DIRTY_PAGES;
-	spin_unlock(&inode->i_lock);
-	spin_unlock(&wb->list_lock);
+	WARN_ON(!(inode->i_state & I_SYNC));
 
 	ret = do_writepages(mapping, wbc);
 
@@ -424,6 +472,9 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	 * write_inode()
 	 */
 	spin_lock(&inode->i_lock);
+	/* Clear I_DIRTY_PAGES if we've written out all dirty pages */
+	if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
+		inode->i_state &= ~I_DIRTY_PAGES;
 	dirty = inode->i_state & I_DIRTY;
 	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
 	spin_unlock(&inode->i_lock);
@@ -433,60 +484,67 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 		if (ret == 0)
 			ret = err;
 	}
+	trace_writeback_single_inode(inode, wbc, nr_to_write);
+	return ret;
+}
+
+/*
+ * Write out an inode's dirty pages. Either the caller has an active reference
+ * on the inode or the inode has I_WILL_FREE set.
+ *
+ * This function is designed to be called for writing back one inode which
+ * we go e.g. from filesystem. Flusher thread uses __writeback_single_inode()
+ * and does more profound writeback list handling in writeback_sb_inodes().
+ */
+static int
+writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
+		       struct writeback_control *wbc)
+{
+	int ret = 0;
 
-	spin_lock(&wb->list_lock);
 	spin_lock(&inode->i_lock);
-	inode->i_state &= ~I_SYNC;
-	if (!(inode->i_state & I_FREEING)) {
+	if (!atomic_read(&inode->i_count))
+		WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
+	else
+		WARN_ON(inode->i_state & I_WILL_FREE);
+
+	if (inode->i_state & I_SYNC) {
+		if (wbc->sync_mode != WB_SYNC_ALL)
+			goto out;
 		/*
-		 * Sync livelock prevention. Each inode is tagged and synced in
-		 * one shot. If still dirty, it will be redirty_tail()'ed below.
-		 * Update the dirty time to prevent enqueue and sync it again.
+		 * It's a data-integrity sync. We must wait. Since callers hold
+		 * inode reference or inode has I_WILL_FREE set, it cannot go
+		 * away under us.
 		 */
-		if ((inode->i_state & I_DIRTY) &&
-		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
-			inode->dirtied_when = jiffies;
-
-		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
-			/*
-			 * We didn't write back all the pages.  nfs_writepages()
-			 * sometimes bales out without doing anything.
-			 */
-			inode->i_state |= I_DIRTY_PAGES;
-			if (wbc->nr_to_write <= 0) {
-				/*
-				 * slice used up: queue for next turn
-				 */
-				requeue_io(inode, wb);
-			} else {
-				/*
-				 * Writeback blocked by something other than
-				 * congestion. Delay the inode for some time to
-				 * avoid spinning on the CPU (100% iowait)
-				 * retrying writeback of the dirty page/inode
-				 * that cannot be performed immediately.
-				 */
-				redirty_tail(inode, wb);
-			}
-		} else if (inode->i_state & I_DIRTY) {
-			/*
-			 * Filesystems can dirty the inode during writeback
-			 * operations, such as delayed allocation during
-			 * submission or metadata updates after data IO
-			 * completion.
-			 */
-			redirty_tail(inode, wb);
-		} else {
-			/*
-			 * The inode is clean.  At this point we either have
-			 * a reference to the inode or it's on it's way out.
-			 * No need to add it back to the LRU.
-			 */
-			list_del_init(&inode->i_wb_list);
-		}
+		__inode_wait_for_writeback(inode);
 	}
+	WARN_ON(inode->i_state & I_SYNC);
+	/*
+	 * Skip inode if it is clean. We don't want to mess with writeback
+	 * lists in this function since flusher thread may be doing for example
+	 * sync in parallel and if we move the inode, it could get skipped. So
+	 * here we make sure inode is on some writeback list and leave it there
+	 * unless we have completely cleaned the inode.
+	 */
+	if (!(inode->i_state & I_DIRTY))
+		goto out;
+	inode->i_state |= I_SYNC;
+	spin_unlock(&inode->i_lock);
+
+	ret = __writeback_single_inode(inode, wb, wbc);
+
+	spin_lock(&wb->list_lock);
+	spin_lock(&inode->i_lock);
+	/*
+	 * If inode is clean, remove it from writeback lists. Otherwise don't
+	 * touch it. See comment above for explanation.
+	 */
+	if (!(inode->i_state & I_DIRTY))
+		list_del_init(&inode->i_wb_list);
+	spin_unlock(&wb->list_lock);
 	inode_sync_complete(inode);
-	trace_writeback_single_inode(inode, wbc, nr_to_write);
+out:
+	spin_unlock(&inode->i_lock);
 	return ret;
 }
 
@@ -580,29 +638,57 @@ static long writeback_sb_inodes(struct super_block *sb,
 			redirty_tail(inode, wb);
 			continue;
 		}
-		__iget(inode);
+		if ((inode->i_state & I_SYNC) && wbc.sync_mode != WB_SYNC_ALL) {
+			/*
+			 * If this inode is locked for writeback and we are not
+			 * doing writeback-for-data-integrity, move it to
+			 * b_more_io so that writeback can proceed with the
+			 * other inodes on s_io.
+			 *
+			 * We'll have another go at writing back this inode
+			 * when we completed a full scan of b_io.
+			 */
+			spin_unlock(&inode->i_lock);
+			requeue_io(inode, wb);
+			trace_writeback_sb_inodes_requeue(inode);
+			continue;
+		}
+		spin_unlock(&wb->list_lock);
+
+		/*
+		 * We already requeued the inode if it had I_SYNC set and we
+		 * are doing WB_SYNC_NONE writeback. So this catches only the
+		 * WB_SYNC_ALL case.
+		 */
+		if (inode->i_state & I_SYNC) {
+			/* Wait for I_SYNC. This function drops i_lock... */
+			inode_sleep_on_writeback(inode);
+			/* Inode may be gone, start again */
+			continue;
+		}
+		inode->i_state |= I_SYNC;
+		spin_unlock(&inode->i_lock);
+
 		write_chunk = writeback_chunk_size(wb->bdi, work);
 		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;
 
-		writeback_single_inode(inode, wb, &wbc);
+		/*
+		 * We use I_SYNC to pin the inode in memory. While it is set
+		 * evict_inode() will wait so the inode cannot be freed.
+		 */
+		__writeback_single_inode(inode, wb, &wbc);
 
 		work->nr_pages -= write_chunk - wbc.nr_to_write;
 		wrote += write_chunk - wbc.nr_to_write;
+		spin_lock(&wb->list_lock);
+		spin_lock(&inode->i_lock);
 		if (!(inode->i_state & I_DIRTY))
 			wrote++;
-		if (wbc.pages_skipped) {
-			/*
-			 * writeback is not making progress due to locked
-			 * buffers.  Skip this inode for now.
-			 */
-			redirty_tail(inode, wb);
-		}
+		requeue_inode(inode, wb, &wbc);
+		inode_sync_complete(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&wb->list_lock);
-		iput(inode);
-		cond_resched();
-		spin_lock(&wb->list_lock);
+		cond_resched_lock(&wb->list_lock);
 		/*
 		 * bail out to wb_writeback() often enough to check
 		 * background threshold and other termination conditions.
@@ -796,8 +882,10 @@ static long wb_writeback(struct bdi_writeback *wb,
 			trace_writeback_wait(wb->bdi, work);
 			inode = wb_inode(wb->b_more_io.prev);
 			spin_lock(&inode->i_lock);
-			inode_wait_for_writeback(inode, wb);
-			spin_unlock(&inode->i_lock);
+			spin_unlock(&wb->list_lock);
+			/* This function drops i_lock... */
+			inode_sleep_on_writeback(inode);
+			spin_lock(&wb->list_lock);
 		}
 	}
 	spin_unlock(&wb->list_lock);
@@ -1331,7 +1419,6 @@ EXPORT_SYMBOL(sync_inodes_sb);
 int write_inode_now(struct inode *inode, int sync)
 {
 	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
-	int ret;
 	struct writeback_control wbc = {
 		.nr_to_write = LONG_MAX,
 		.sync_mode = sync ? WB_SYNC_ALL : WB_SYNC_NONE,
@@ -1343,12 +1430,7 @@ int write_inode_now(struct inode *inode, int sync)
 		wbc.nr_to_write = 0;
 
 	might_sleep();
-	spin_lock(&wb->list_lock);
-	spin_lock(&inode->i_lock);
-	ret = writeback_single_inode(inode, wb, &wbc);
-	spin_unlock(&inode->i_lock);
-	spin_unlock(&wb->list_lock);
-	return ret;
+	return writeback_single_inode(inode, wb, &wbc);
 }
 EXPORT_SYMBOL(write_inode_now);
 
@@ -1365,15 +1447,7 @@ EXPORT_SYMBOL(write_inode_now);
  */
 int sync_inode(struct inode *inode, struct writeback_control *wbc)
 {
-	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
-	int ret;
-
-	spin_lock(&wb->list_lock);
-	spin_lock(&inode->i_lock);
-	ret = writeback_single_inode(inode, wb, wbc);
-	spin_unlock(&inode->i_lock);
-	spin_unlock(&wb->list_lock);
-	return ret;
+	return writeback_single_inode(inode, &inode_to_bdi(inode)->wb, wbc);
 }
 EXPORT_SYMBOL(sync_inode);
 
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 4aec599..87e6115 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -122,7 +122,7 @@ static void fuse_destroy_inode(struct inode *inode)
 static void fuse_evict_inode(struct inode *inode)
 {
 	truncate_inode_pages(&inode->i_data, 0);
-	end_writeback(inode);
+	clear_inode(inode);
 	if (inode->i_sb->s_flags & MS_ACTIVE) {
 		struct fuse_conn *fc = get_fuse_conn(inode);
 		struct fuse_inode *fi = get_fuse_inode(inode);
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 6172fa7..713e621 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -1554,7 +1554,7 @@ out_unlock:
 out:
 	/* Case 3 starts here */
 	truncate_inode_pages(&inode->i_data, 0);
-	end_writeback(inode);
+	clear_inode(inode);
 	gfs2_dir_hash_inval(ip);
 	ip->i_gl->gl_object = NULL;
 	flush_delayed_work_sync(&ip->i_gl->gl_work);
diff --git a/fs/hfs/inode.c b/fs/hfs/inode.c
index 737dbeb..761ec06 100644
--- a/fs/hfs/inode.c
+++ b/fs/hfs/inode.c
@@ -532,7 +532,7 @@ out:
 void hfs_evict_inode(struct inode *inode)
 {
 	truncate_inode_pages(&inode->i_data, 0);
-	end_writeback(inode);
+	clear_inode(inode);
 	if (HFS_IS_RSRC(inode) && HFS_I(inode)->rsrc_inode) {
 		HFS_I(HFS_I(inode)->rsrc_inode)->rsrc_inode = NULL;
 		iput(HFS_I(inode)->rsrc_inode);
diff --git a/fs/hfsplus/super.c b/fs/hfsplus/super.c
index ceb1c28..a9bca4b 100644
--- a/fs/hfsplus/super.c
+++ b/fs/hfsplus/super.c
@@ -154,7 +154,7 @@ static void hfsplus_evict_inode(struct inode *inode)
 {
 	dprint(DBG_INODE, "hfsplus_evict_inode: %lu\n", inode->i_ino);
 	truncate_inode_pages(&inode->i_data, 0);
-	end_writeback(inode);
+	clear_inode(inode);
 	if (HFSPLUS_IS_RSRC(inode)) {
 		HFSPLUS_I(HFSPLUS_I(inode)->rsrc_inode)->rsrc_inode = NULL;
 		iput(HFSPLUS_I(inode)->rsrc_inode);
diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c
index 07c516b..2afa5bb 100644
--- a/fs/hostfs/hostfs_kern.c
+++ b/fs/hostfs/hostfs_kern.c
@@ -240,7 +240,7 @@ static struct inode *hostfs_alloc_inode(struct super_block *sb)
 static void hostfs_evict_inode(struct inode *inode)
 {
 	truncate_inode_pages(&inode->i_data, 0);
-	end_writeback(inode);
+	clear_inode(inode);
 	if (HOSTFS_I(inode)->fd != -1) {
 		close_file(&HOSTFS_I(inode)->fd);
 		HOSTFS_I(inode)->fd = -1;
diff --git a/fs/hpfs/inode.c b/fs/hpfs/inode.c
index 3b2cec2..b43066c 100644
--- a/fs/hpfs/inode.c
+++ b/fs/hpfs/inode.c
@@ -299,7 +299,7 @@ void hpfs_write_if_changed(struct inode *inode)
 void hpfs_evict_inode(struct inode *inode)
 {
 	truncate_inode_pages(&inode->i_data, 0);
-	end_writeback(inode);
+	clear_inode(inode);
 	if (!inode->i_nlink) {
 		hpfs_lock(inode->i_sb);
 		hpfs_remove_fnode(inode->i_sb, inode->i_ino);
diff --git a/fs/hppfs/hppfs.c b/fs/hppfs/hppfs.c
index a80e45a..d4f93b5 100644
--- a/fs/hppfs/hppfs.c
+++ b/fs/hppfs/hppfs.c
@@ -614,7 +614,7 @@ static struct inode *hppfs_alloc_inode(struct super_block *sb)
 
 void hppfs_evict_inode(struct inode *ino)
 {
-	end_writeback(ino);
+	clear_inode(ino);
 	dput(HPPFS_I(ino)->proc_dentry);
 	mntput(ino->i_sb->s_fs_info);
 }
diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 28cf06e..568193d 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -393,7 +393,7 @@ static void truncate_hugepages(struct inode *inode, loff_t lstart)
 static void hugetlbfs_evict_inode(struct inode *inode)
 {
 	truncate_hugepages(inode, 0);
-	end_writeback(inode);
+	clear_inode(inode);
 }
 
 static inline void
diff --git a/fs/inode.c b/fs/inode.c
index 9f4f5fe..f4e1450 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -486,7 +486,7 @@ void __remove_inode_hash(struct inode *inode)
 }
 EXPORT_SYMBOL(__remove_inode_hash);
 
-void end_writeback(struct inode *inode)
+void clear_inode(struct inode *inode)
 {
 	might_sleep();
 	/*
@@ -500,11 +500,10 @@ void end_writeback(struct inode *inode)
 	BUG_ON(!list_empty(&inode->i_data.private_list));
 	BUG_ON(!(inode->i_state & I_FREEING));
 	BUG_ON(inode->i_state & I_CLEAR);
-	inode_sync_wait(inode);
 	/* don't need i_lock here, no concurrent mods to i_state */
 	inode->i_state = I_FREEING | I_CLEAR;
 }
-EXPORT_SYMBOL(end_writeback);
+EXPORT_SYMBOL(clear_inode);
 
 /*
  * Free the inode passed in, removing it from the lists it is still connected
@@ -531,12 +530,20 @@ static void evict(struct inode *inode)
 
 	inode_sb_list_del(inode);
 
+	/*
+	 * Wait for flusher thread to be done with the inode so that filesystem
+	 * does not start destroying it while writeback is still running. Since
+	 * the inode has I_FREEING set, flusher thread won't start new work on
+	 * the inode.  We just have to wait for running writeback to finish.
+	 */
+	inode_wait_for_writeback(inode);
+
 	if (op->evict_inode) {
 		op->evict_inode(inode);
 	} else {
 		if (inode->i_data.nrpages)
 			truncate_inode_pages(&inode->i_data, 0);
-		end_writeback(inode);
+		clear_inode(inode);
 	}
 	if (S_ISBLK(inode->i_mode) && inode->i_bdev)
 		bd_forget(inode);
diff --git a/fs/jffs2/fs.c b/fs/jffs2/fs.c
index bb6f993..3d3092e 100644
--- a/fs/jffs2/fs.c
+++ b/fs/jffs2/fs.c
@@ -240,7 +240,7 @@ void jffs2_evict_inode (struct inode *inode)
 	jffs2_dbg(1, "%s(): ino #%lu mode %o\n",
 		  __func__, inode->i_ino, inode->i_mode);
 	truncate_inode_pages(&inode->i_data, 0);
-	end_writeback(inode);
+	clear_inode(inode);
 	jffs2_do_clear_inode(c, f);
 }
 
diff --git a/fs/jfs/inode.c b/fs/jfs/inode.c
index 77b69b2..4692bf3 100644
--- a/fs/jfs/inode.c
+++ b/fs/jfs/inode.c
@@ -169,7 +169,7 @@ void jfs_evict_inode(struct inode *inode)
 	} else {
 		truncate_inode_pages(&inode->i_data, 0);
 	}
-	end_writeback(inode);
+	clear_inode(inode);
 	dquot_drop(inode);
 }
 
diff --git a/fs/logfs/readwrite.c b/fs/logfs/readwrite.c
index e3ab5e5..f1cb512 100644
--- a/fs/logfs/readwrite.c
+++ b/fs/logfs/readwrite.c
@@ -2175,7 +2175,7 @@ void logfs_evict_inode(struct inode *inode)
 		}
 	}
 	truncate_inode_pages(&inode->i_data, 0);
-	end_writeback(inode);
+	clear_inode(inode);
 
 	/* Cheaper version of write_inode.  All changes are concealed in
 	 * aliases, which are moved back.  No write to the medium happens.
diff --git a/fs/minix/inode.c b/fs/minix/inode.c
index fcb05d2..2a503ad 100644
--- a/fs/minix/inode.c
+++ b/fs/minix/inode.c
@@ -32,7 +32,7 @@ static void minix_evict_inode(struct inode *inode)
 		minix_truncate(inode);
 	}
 	invalidate_inode_buffers(inode);
-	end_writeback(inode);
+	clear_inode(inode);
 	if (!inode->i_nlink)
 		minix_free_inode(inode);
 }
diff --git a/fs/ncpfs/inode.c b/fs/ncpfs/inode.c
index 87484fb..333df07 100644
--- a/fs/ncpfs/inode.c
+++ b/fs/ncpfs/inode.c
@@ -292,7 +292,7 @@ static void
 ncp_evict_inode(struct inode *inode)
 {
 	truncate_inode_pages(&inode->i_data, 0);
-	end_writeback(inode);
+	clear_inode(inode);
 
 	if (S_ISDIR(inode->i_mode)) {
 		DDPRINTK("ncp_evict_inode: put directory %ld\n", inode->i_ino);
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index e8bbfa5..c607313 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -121,7 +121,7 @@ static void nfs_clear_inode(struct inode *inode)
 void nfs_evict_inode(struct inode *inode)
 {
 	truncate_inode_pages(&inode->i_data, 0);
-	end_writeback(inode);
+	clear_inode(inode);
 	nfs_clear_inode(inode);
 }
 
@@ -1500,7 +1500,7 @@ static int nfs_update_inode(struct inode *inode, struct nfs_fattr *fattr)
 void nfs4_evict_inode(struct inode *inode)
 {
 	truncate_inode_pages(&inode->i_data, 0);
-	end_writeback(inode);
+	clear_inode(inode);
 	pnfs_return_layout(inode);
 	pnfs_destroy_layout(NFS_I(inode));
 	/* If we are holding a delegation, return it! */
diff --git a/fs/nilfs2/inode.c b/fs/nilfs2/inode.c
index 8f7b95a..7cc6446 100644
--- a/fs/nilfs2/inode.c
+++ b/fs/nilfs2/inode.c
@@ -734,7 +734,7 @@ void nilfs_evict_inode(struct inode *inode)
 	if (inode->i_nlink || !ii->i_root || unlikely(is_bad_inode(inode))) {
 		if (inode->i_data.nrpages)
 			truncate_inode_pages(&inode->i_data, 0);
-		end_writeback(inode);
+		clear_inode(inode);
 		nilfs_clear_inode(inode);
 		return;
 	}
@@ -746,7 +746,7 @@ void nilfs_evict_inode(struct inode *inode)
 	/* TODO: some of the following operations may fail.  */
 	nilfs_truncate_bmap(ii, 0);
 	nilfs_mark_inode_dirty(inode);
-	end_writeback(inode);
+	clear_inode(inode);
 
 	ret = nilfs_ifile_delete_inode(ii->i_root->ifile, inode->i_ino);
 	if (!ret)
diff --git a/fs/ntfs/inode.c b/fs/ntfs/inode.c
index 2eaa666..c6dbd3d 100644
--- a/fs/ntfs/inode.c
+++ b/fs/ntfs/inode.c
@@ -2258,7 +2258,7 @@ void ntfs_evict_big_inode(struct inode *vi)
 	ntfs_inode *ni = NTFS_I(vi);
 
 	truncate_inode_pages(&vi->i_data, 0);
-	end_writeback(vi);
+	clear_inode(vi);
 
 #ifdef NTFS_RW
 	if (NInoDirty(ni)) {
diff --git a/fs/ocfs2/dlmfs/dlmfs.c b/fs/ocfs2/dlmfs/dlmfs.c
index 3b5825e..e31d6ae 100644
--- a/fs/ocfs2/dlmfs/dlmfs.c
+++ b/fs/ocfs2/dlmfs/dlmfs.c
@@ -367,7 +367,7 @@ static void dlmfs_evict_inode(struct inode *inode)
 	int status;
 	struct dlmfs_inode_private *ip;
 
-	end_writeback(inode);
+	clear_inode(inode);
 
 	mlog(0, "inode %lu\n", inode->i_ino);
 
diff --git a/fs/ocfs2/inode.c b/fs/ocfs2/inode.c
index 17454a9..735514c 100644
--- a/fs/ocfs2/inode.c
+++ b/fs/ocfs2/inode.c
@@ -1069,7 +1069,7 @@ static void ocfs2_clear_inode(struct inode *inode)
 	int status;
 	struct ocfs2_inode_info *oi = OCFS2_I(inode);
 
-	end_writeback(inode);
+	clear_inode(inode);
 	trace_ocfs2_clear_inode((unsigned long long)oi->ip_blkno,
 				inode->i_nlink);
 
diff --git a/fs/omfs/inode.c b/fs/omfs/inode.c
index dbc8422..e6213b3 100644
--- a/fs/omfs/inode.c
+++ b/fs/omfs/inode.c
@@ -184,7 +184,7 @@ int omfs_sync_inode(struct inode *inode)
 static void omfs_evict_inode(struct inode *inode)
 {
 	truncate_inode_pages(&inode->i_data, 0);
-	end_writeback(inode);
+	clear_inode(inode);
 
 	if (inode->i_nlink)
 		return;
diff --git a/fs/proc/inode.c b/fs/proc/inode.c
index 205c922..29ab406 100644
--- a/fs/proc/inode.c
+++ b/fs/proc/inode.c
@@ -33,7 +33,7 @@ static void proc_evict_inode(struct inode *inode)
 	const struct proc_ns_operations *ns_ops;
 
 	truncate_inode_pages(&inode->i_data, 0);
-	end_writeback(inode);
+	clear_inode(inode);
 
 	/* Stop tracking associated processes */
 	put_pid(PROC_I(inode)->pid);
diff --git a/fs/pstore/inode.c b/fs/pstore/inode.c
index 1950788..aeb19e6 100644
--- a/fs/pstore/inode.c
+++ b/fs/pstore/inode.c
@@ -85,7 +85,7 @@ static void pstore_evict_inode(struct inode *inode)
 	struct pstore_private	*p = inode->i_private;
 	unsigned long		flags;
 
-	end_writeback(inode);
+	clear_inode(inode);
 	if (p) {
 		spin_lock_irqsave(&allpstore_lock, flags);
 		list_del(&p->list);
diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
index 494c315..59d0687 100644
--- a/fs/reiserfs/inode.c
+++ b/fs/reiserfs/inode.c
@@ -76,14 +76,14 @@ void reiserfs_evict_inode(struct inode *inode)
 		;
 	}
       out:
-	end_writeback(inode);	/* note this must go after the journal_end to prevent deadlock */
+	clear_inode(inode);	/* note this must go after the journal_end to prevent deadlock */
 	dquot_drop(inode);
 	inode->i_blocks = 0;
 	reiserfs_write_unlock_once(inode->i_sb, depth);
 	return;
 
 no_delete:
-	end_writeback(inode);
+	clear_inode(inode);
 	dquot_drop(inode);
 }
 
diff --git a/fs/sysfs/inode.c b/fs/sysfs/inode.c
index feb2d69..b8ce6a9 100644
--- a/fs/sysfs/inode.c
+++ b/fs/sysfs/inode.c
@@ -310,7 +310,7 @@ void sysfs_evict_inode(struct inode *inode)
 	struct sysfs_dirent *sd  = inode->i_private;
 
 	truncate_inode_pages(&inode->i_data, 0);
-	end_writeback(inode);
+	clear_inode(inode);
 	sysfs_put(sd);
 }
 
diff --git a/fs/sysv/inode.c b/fs/sysv/inode.c
index 3da5ce2..08d0b25 100644
--- a/fs/sysv/inode.c
+++ b/fs/sysv/inode.c
@@ -316,7 +316,7 @@ static void sysv_evict_inode(struct inode *inode)
 		sysv_truncate(inode);
 	}
 	invalidate_inode_buffers(inode);
-	end_writeback(inode);
+	clear_inode(inode);
 	if (!inode->i_nlink)
 		sysv_free_inode(inode);
 }
diff --git a/fs/ubifs/super.c b/fs/ubifs/super.c
index 76e4e05..7bf60ae 100644
--- a/fs/ubifs/super.c
+++ b/fs/ubifs/super.c
@@ -378,7 +378,7 @@ out:
 		smp_wmb();
 	}
 done:
-	end_writeback(inode);
+	clear_inode(inode);
 }
 
 static void ubifs_dirty_inode(struct inode *inode, int flags)
diff --git a/fs/udf/inode.c b/fs/udf/inode.c
index 7d75280..873e1ba 100644
--- a/fs/udf/inode.c
+++ b/fs/udf/inode.c
@@ -80,7 +80,7 @@ void udf_evict_inode(struct inode *inode)
 	} else
 		truncate_inode_pages(&inode->i_data, 0);
 	invalidate_inode_buffers(inode);
-	end_writeback(inode);
+	clear_inode(inode);
 	if (iinfo->i_alloc_type != ICBTAG_FLAG_AD_IN_ICB &&
 	    inode->i_size != iinfo->i_lenExtents) {
 		udf_warn(inode->i_sb, "Inode %lu (mode %o) has inode size %llu different from extent length %llu. Filesystem need not be standards compliant.\n",
diff --git a/fs/ufs/inode.c b/fs/ufs/inode.c
index 7cdd395..dd7c89d 100644
--- a/fs/ufs/inode.c
+++ b/fs/ufs/inode.c
@@ -895,7 +895,7 @@ void ufs_evict_inode(struct inode * inode)
 	}
 
 	invalidate_inode_buffers(inode);
-	end_writeback(inode);
+	clear_inode(inode);
 
 	if (want_delete) {
 		lock_ufs(inode->i_sb);
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index dab9a5f..5b806f2 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -926,7 +926,7 @@ xfs_fs_evict_inode(
 	trace_xfs_evict_inode(ip);
 
 	truncate_inode_pages(&inode->i_data, 0);
-	end_writeback(inode);
+	clear_inode(inode);
 	XFS_STATS_INC(vn_rele);
 	XFS_STATS_INC(vn_remove);
 	XFS_STATS_DEC(vn_active);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 8de6755..1c71e7f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1744,8 +1744,8 @@ struct super_operations {
  * I_FREEING		Set when inode is about to be freed but still has dirty
  *			pages or buffers attached or the inode itself is still
  *			dirty.
- * I_CLEAR		Added by end_writeback().  In this state the inode is clean
- *			and can be destroyed.  Inode keeps I_FREEING.
+ * I_CLEAR		Added by clear_inode().  In this state the inode is
+ *			clean and can be destroyed.  Inode keeps I_FREEING.
  *
  *			Inodes that are I_WILL_FREE, I_FREEING or I_CLEAR are
  *			prohibited for many purposes.  iget() must wait for
@@ -1753,9 +1753,10 @@ struct super_operations {
  *			anew.  Other functions will just ignore such inodes,
  *			if appropriate.  I_NEW is used for waiting.
  *
- * I_SYNC		Synchonized write of dirty inode data.  The bits is
- *			set during data writeback, and cleared with a wakeup
- *			on the bit address once it is done.
+ * I_SYNC		Writeback of inode is running. The bit is set during
+ *			data writeback, and cleared with a wakeup on the bit
+ *			address once it is done. The bit is also used to pin
+ *			the inode in memory for flusher thread.
  *
  * I_REFERENCED		Marks the inode as recently references on the LRU list.
  *
@@ -2328,7 +2329,7 @@ extern unsigned int get_next_ino(void);
 
 extern void __iget(struct inode * inode);
 extern void iget_failed(struct inode *);
-extern void end_writeback(struct inode *);
+extern void clear_inode(struct inode *);
 extern void __destroy_inode(struct inode *);
 extern struct inode *new_inode_pseudo(struct super_block *sb);
 extern struct inode *new_inode(struct super_block *sb);
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a2b84f5..6d0a0fc 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -58,7 +58,6 @@ extern const char *wb_reason_name[];
  * in a manner such that unspecified fields are set to zero.
  */
 struct writeback_control {
-	enum writeback_sync_modes sync_mode;
 	long nr_to_write;		/* Write this many pages, and decrement
 					   this for each page written */
 	long pages_skipped;		/* Pages which were not written */
@@ -71,6 +70,8 @@ struct writeback_control {
 	loff_t range_start;
 	loff_t range_end;
 
+	enum writeback_sync_modes sync_mode;
+
 	unsigned for_kupdate:1;		/* A kupdate writeback */
 	unsigned for_background:1;	/* A background writeback */
 	unsigned tagged_writepages:1;	/* tag-and-write to avoid livelock */
@@ -94,6 +95,7 @@ long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages,
 				enum wb_reason reason);
 long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
 void wakeup_flusher_threads(long nr_pages, enum wb_reason reason);
+void inode_wait_for_writeback(struct inode *inode);
 
 /* writeback.h requires fs.h; it, too, is not included from here. */
 static inline void wait_on_inode(struct inode *inode)
@@ -101,12 +103,6 @@ static inline void wait_on_inode(struct inode *inode)
 	might_sleep();
 	wait_on_bit(&inode->i_state, __I_NEW, inode_wait, TASK_UNINTERRUPTIBLE);
 }
-static inline void inode_sync_wait(struct inode *inode)
-{
-	might_sleep();
-	wait_on_bit(&inode->i_state, __I_SYNC, inode_wait,
-							TASK_UNINTERRUPTIBLE);
-}
 
 
 /*
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 7b81887..b453d92 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -372,6 +372,35 @@ TRACE_EVENT(balance_dirty_pages,
 	  )
 );
 
+TRACE_EVENT(writeback_sb_inodes_requeue,
+
+	TP_PROTO(struct inode *inode),
+	TP_ARGS(inode),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long, ino)
+		__field(unsigned long, state)
+		__field(unsigned long, dirtied_when)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name,
+		        dev_name(inode_to_bdi(inode)->dev), 32);
+		__entry->ino		= inode->i_ino;
+		__entry->state		= inode->i_state;
+		__entry->dirtied_when	= inode->dirtied_when;
+	),
+
+	TP_printk("bdi %s: ino=%lu state=%s dirtied_when=%lu age=%lu",
+		  __entry->name,
+		  __entry->ino,
+		  show_inode_state(__entry->state),
+		  __entry->dirtied_when,
+		  (jiffies - __entry->dirtied_when) / HZ
+	)
+);
+
 DECLARE_EVENT_CLASS(writeback_congest_waited_template,
 
 	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
@@ -450,13 +479,6 @@ DECLARE_EVENT_CLASS(writeback_single_inode_template,
 	)
 );
 
-DEFINE_EVENT(writeback_single_inode_template, writeback_single_inode_requeue,
-	TP_PROTO(struct inode *inode,
-		 struct writeback_control *wbc,
-		 unsigned long nr_to_write),
-	TP_ARGS(inode, wbc, nr_to_write)
-);
-
 DEFINE_EVENT(writeback_single_inode_template, writeback_single_inode,
 	TP_PROTO(struct inode *inode,
 		 struct writeback_control *wbc,
diff --git a/ipc/mqueue.c b/ipc/mqueue.c
index 28bd64d..0032d9c 100644
--- a/ipc/mqueue.c
+++ b/ipc/mqueue.c
@@ -249,7 +249,7 @@ static void mqueue_evict_inode(struct inode *inode)
 	int i;
 	struct ipc_namespace *ipc_ns;
 
-	end_writeback(inode);
+	clear_inode(inode);
 
 	if (S_ISDIR(inode->i_mode))
 		return;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 26adea8..93d8d2f 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -204,7 +204,7 @@ static unsigned long highmem_dirtyable_memory(unsigned long total)
  * Returns the global number of pages potentially available for dirty
  * page cache.  This is the base value for the global dirty limits.
  */
-unsigned long global_dirtyable_memory(void)
+static unsigned long global_dirtyable_memory(void)
 {
 	unsigned long x;
 
@@ -1568,6 +1568,7 @@ void writeback_set_ratelimit(void)
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	global_dirty_limits(&background_thresh, &dirty_thresh);
+	global_dirty_limit = dirty_thresh;
 	ratelimit_pages = dirty_thresh / (num_online_cpus() * 32);
 	if (ratelimit_pages < 16)
 		ratelimit_pages = 16;
diff --git a/mm/shmem.c b/mm/shmem.c
index f99ff3e..68412fa 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -597,7 +597,7 @@ static void shmem_evict_inode(struct inode *inode)
 	}
 	BUG_ON(inode->i_blocks);
 	shmem_free_inode(inode->i_sb);
-	end_writeback(inode);
+	clear_inode(inode);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [GIT PULL] writeback changes for 3.5-rc1
  2012-05-28 11:41 [GIT PULL] writeback changes for 3.5-rc1 Fengguang Wu
@ 2012-05-28 17:09 ` Linus Torvalds
  2012-05-29 15:57     ` Fengguang Wu
  0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2012-05-28 17:09 UTC (permalink / raw)
  To: Fengguang Wu; +Cc: LKML

Ok, pulled.

However, I have an independent question for you - have you looked at
any kind of per-file write-behind kind of logic?

The reason I ask is that pretty much every time I write some big file
(usually when over-writing a harddisk), I tend to use my own hackish
model, which looks like this:

#define BUFSIZE (8*1024*1024ul)

        ...
        for (..) {
                ...
                if (write(fd, buffer, BUFSIZE) != BUFSIZE)
                        break;
                sync_file_range(fd, index*BUFSIZE, BUFSIZE,
SYNC_FILE_RANGE_WRITE);
                if (index)
                        sync_file_range(fd, (index-1)*BUFSIZE,
BUFSIZE, SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
                ....

and it tends to be *beautiful* for both disk IO performane and for
system responsiveness while the big write is in progress.

And I'm wondering if we couldn't expose this kind of write-behind
logic from the kernel. Sure, it only works for the "contiguous write
of a single large file" model, but that model isn't actually all
*that* unusual.

Right now all the write-back logic is based on the
balance_dirty_pages() model, which is more of a global dirty model.
Which obviously is needed too - this isn't an "either or" kind of
thing, it's more of a "maybe we could have a streaming detector *and*
the 'random writes' code". So I was wondering if anybody had ever been
looking more at an explicit write-behind model that uses the same kind
of "per-file window" that the read-ahead code does.

(The above code only works well for known streaming writes, but the
*model* of saying "ok, let's start writeout for the previous streaming
block, and then wait for the writeout of the streaming block before
that" really does tend to result in very smooth IO and minimal
disruption of other processes..)

                      Linus

On Mon, May 28, 2012 at 4:41 AM, Fengguang Wu <fengguang.wu@intel.com> wrote:
>
> Please pull the writeback changes, mainly from Jan Kara to avoid
> iput() in the flusher threads.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* write-behind on streaming writes
  2012-05-28 17:09 ` Linus Torvalds
@ 2012-05-29 15:57     ` Fengguang Wu
  0 siblings, 0 replies; 38+ messages in thread
From: Fengguang Wu @ 2012-05-29 15:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Myklebust, Trond, linux-fsdevel, Linux Memory Management List

Hi Linus,

On Mon, May 28, 2012 at 10:09:56AM -0700, Linus Torvalds wrote:
> Ok, pulled.
> 
> However, I have an independent question for you - have you looked at
> any kind of per-file write-behind kind of logic?

Yes, definitely.  Especially for NFS, it benefits to keep each file's
dirty pages low. Because in NFS, a simple stat() will require flushing
all the file's dirty pages before proceeding.

However in general there are no strong user requests for this feature.
I guess it's mainly because they still have the choices to use O_SYNC
or O_DIRECT.

Actually O_SYNC is pretty close to the below code for the purpose of
limiting the dirty and writeback pages, except that it's not on by
default, hence means nothing for normal users.

> The reason I ask is that pretty much every time I write some big file
> (usually when over-writing a harddisk), I tend to use my own hackish
> model, which looks like this:
> 
> #define BUFSIZE (8*1024*1024ul)
> 
>         ...
>         for (..) {
>                 ...
>                 if (write(fd, buffer, BUFSIZE) != BUFSIZE)
>                         break;
>                 sync_file_range(fd, index*BUFSIZE, BUFSIZE,
> SYNC_FILE_RANGE_WRITE);
>                 if (index)
>                         sync_file_range(fd, (index-1)*BUFSIZE,
> BUFSIZE, SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
>                 ....
> 
> and it tends to be *beautiful* for both disk IO performane and for
> system responsiveness while the big write is in progress.

It seems to me all about optimizing the 1-dd case for desktop users,
and the most beautiful thing about per-file write behind is, it keeps
both the number of dirty and writeback pages low in the system when
there are only one or two sequential dirtier tasks. Which is good for
responsiveness.

Note that the above user space code won't work well when there are 10+
dirtier tasks. It effectively creates 10+ IO submitters on different
regions of the disk and thus create lots of seeks. When there are 10+
dirtier tasks, it's not only desirable to have one single flusher
thread to submit all IO, but also for the flusher to work on the
inodes with large write chunk size.

I happen to have some numbers on comparing the current adaptive
(write_bandwidth/2=50MB) and the old fixed 4MB write chunk sizes on
XFS (not choosing ext4 because it internally enforces >=128MB chunk
size).  It's basically 4% performance drop in the 1-dd case and up to
20% in the 100-dd case.

  3.4.0-rc2             3.4.0-rc2-4M+
-----------  ------------------------  
     114.02        -4.2%       109.23  snb/thresh=8G/xfs-1dd-1-3.4.0-rc2
     102.25       -11.7%        90.24  snb/thresh=8G/xfs-10dd-1-3.4.0-rc2
     104.17       -17.5%        85.91  snb/thresh=8G/xfs-20dd-1-3.4.0-rc2
     104.94       -18.7%        85.28  snb/thresh=8G/xfs-30dd-1-3.4.0-rc2
     104.76       -21.9%        81.82  snb/thresh=8G/xfs-100dd-1-3.4.0-rc2

So we probably still want to keep the 0.5s worth of chunk size.

> And I'm wondering if we couldn't expose this kind of write-behind
> logic from the kernel. Sure, it only works for the "contiguous write
> of a single large file" model, but that model isn't actually all
> *that* unusual.
> 
> Right now all the write-back logic is based on the
> balance_dirty_pages() model, which is more of a global dirty model.
> Which obviously is needed too - this isn't an "either or" kind of
> thing, it's more of a "maybe we could have a streaming detector *and*
> the 'random writes' code". So I was wondering if anybody had ever been
> looking more at an explicit write-behind model that uses the same kind
> of "per-file window" that the read-ahead code does.

I can imagine it being implemented in kernel this way:

streaming write detector in balance_dirty_pages():

        if (not globally throttled &&
            is streaming writer &&
            it's crossing the N+1 boundary) {
                queue writeback work for chunk N to the flusher
                wait for work completion
        }

The good thing is, that looks not a complex addition. However the
potential problem is, the "wait for work completion" part won't have
guaranteed complete time, especially when there are multiple dd tasks.
This could result in uncontrollable delays in the write() syscall. So
we may do this instead:

-               wait for work completion
+               sleep for (chunk_size/write_bandwidth)

To avoid long write() delays, we might further split the one big 0.5s
sleep into smaller sleeps.

> (The above code only works well for known streaming writes, but the
> *model* of saying "ok, let's start writeout for the previous streaming
> block, and then wait for the writeout of the streaming block before
> that" really does tend to result in very smooth IO and minimal
> disruption of other processes..)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 38+ messages in thread

* write-behind on streaming writes
@ 2012-05-29 15:57     ` Fengguang Wu
  0 siblings, 0 replies; 38+ messages in thread
From: Fengguang Wu @ 2012-05-29 15:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Myklebust, Trond, linux-fsdevel, Linux Memory Management List

Hi Linus,

On Mon, May 28, 2012 at 10:09:56AM -0700, Linus Torvalds wrote:
> Ok, pulled.
> 
> However, I have an independent question for you - have you looked at
> any kind of per-file write-behind kind of logic?

Yes, definitely.  Especially for NFS, it benefits to keep each file's
dirty pages low. Because in NFS, a simple stat() will require flushing
all the file's dirty pages before proceeding.

However in general there are no strong user requests for this feature.
I guess it's mainly because they still have the choices to use O_SYNC
or O_DIRECT.

Actually O_SYNC is pretty close to the below code for the purpose of
limiting the dirty and writeback pages, except that it's not on by
default, hence means nothing for normal users.

> The reason I ask is that pretty much every time I write some big file
> (usually when over-writing a harddisk), I tend to use my own hackish
> model, which looks like this:
> 
> #define BUFSIZE (8*1024*1024ul)
> 
>         ...
>         for (..) {
>                 ...
>                 if (write(fd, buffer, BUFSIZE) != BUFSIZE)
>                         break;
>                 sync_file_range(fd, index*BUFSIZE, BUFSIZE,
> SYNC_FILE_RANGE_WRITE);
>                 if (index)
>                         sync_file_range(fd, (index-1)*BUFSIZE,
> BUFSIZE, SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
>                 ....
> 
> and it tends to be *beautiful* for both disk IO performane and for
> system responsiveness while the big write is in progress.

It seems to me all about optimizing the 1-dd case for desktop users,
and the most beautiful thing about per-file write behind is, it keeps
both the number of dirty and writeback pages low in the system when
there are only one or two sequential dirtier tasks. Which is good for
responsiveness.

Note that the above user space code won't work well when there are 10+
dirtier tasks. It effectively creates 10+ IO submitters on different
regions of the disk and thus create lots of seeks. When there are 10+
dirtier tasks, it's not only desirable to have one single flusher
thread to submit all IO, but also for the flusher to work on the
inodes with large write chunk size.

I happen to have some numbers on comparing the current adaptive
(write_bandwidth/2=50MB) and the old fixed 4MB write chunk sizes on
XFS (not choosing ext4 because it internally enforces >=128MB chunk
size).  It's basically 4% performance drop in the 1-dd case and up to
20% in the 100-dd case.

  3.4.0-rc2             3.4.0-rc2-4M+
-----------  ------------------------  
     114.02        -4.2%       109.23  snb/thresh=8G/xfs-1dd-1-3.4.0-rc2
     102.25       -11.7%        90.24  snb/thresh=8G/xfs-10dd-1-3.4.0-rc2
     104.17       -17.5%        85.91  snb/thresh=8G/xfs-20dd-1-3.4.0-rc2
     104.94       -18.7%        85.28  snb/thresh=8G/xfs-30dd-1-3.4.0-rc2
     104.76       -21.9%        81.82  snb/thresh=8G/xfs-100dd-1-3.4.0-rc2

So we probably still want to keep the 0.5s worth of chunk size.

> And I'm wondering if we couldn't expose this kind of write-behind
> logic from the kernel. Sure, it only works for the "contiguous write
> of a single large file" model, but that model isn't actually all
> *that* unusual.
> 
> Right now all the write-back logic is based on the
> balance_dirty_pages() model, which is more of a global dirty model.
> Which obviously is needed too - this isn't an "either or" kind of
> thing, it's more of a "maybe we could have a streaming detector *and*
> the 'random writes' code". So I was wondering if anybody had ever been
> looking more at an explicit write-behind model that uses the same kind
> of "per-file window" that the read-ahead code does.

I can imagine it being implemented in kernel this way:

streaming write detector in balance_dirty_pages():

        if (not globally throttled &&
            is streaming writer &&
            it's crossing the N+1 boundary) {
                queue writeback work for chunk N to the flusher
                wait for work completion
        }

The good thing is, that looks not a complex addition. However the
potential problem is, the "wait for work completion" part won't have
guaranteed complete time, especially when there are multiple dd tasks.
This could result in uncontrollable delays in the write() syscall. So
we may do this instead:

-               wait for work completion
+               sleep for (chunk_size/write_bandwidth)

To avoid long write() delays, we might further split the one big 0.5s
sleep into smaller sleeps.

> (The above code only works well for known streaming writes, but the
> *model* of saying "ok, let's start writeout for the previous streaming
> block, and then wait for the writeout of the streaming block before
> that" really does tend to result in very smooth IO and minimal
> disruption of other processes..)

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
  2012-05-29 15:57     ` Fengguang Wu
@ 2012-05-29 17:35       ` Linus Torvalds
  -1 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2012-05-29 17:35 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: LKML, Myklebust, Trond, linux-fsdevel, Linux Memory Management List

On Tue, May 29, 2012 at 8:57 AM, Fengguang Wu <fengguang.wu@intel.com> wrote:
>
> Actually O_SYNC is pretty close to the below code for the purpose of
> limiting the dirty and writeback pages, except that it's not on by
> default, hence means nothing for normal users.

Absolutely not.

O_SYNC syncs the *current* write, syncs your metadata, and just
generally makes your writer synchronous. It's just a f*cking moronic
idea. Nobody sane ever uses it, since you are much better off just
using fsync() if you want that kind of behavior. That's one of those
"stupid legacy flags" things that have no sane use.

The whole point is that doing that is never the right thing to do. You
want to sync *past* writes, and you never ever want to wait on them
unless you just sent more (newer) writes to the disk that you are
*not* waiting on - so that you always have more IO pending.

O_SYNC is the absolutely anti-thesis of that kind of "multiple levels
of overlapping IO". Because it requires that the IO is _done_ by the
time you start more, which is against the whole point.

> It seems to me all about optimizing the 1-dd case for desktop users,
> and the most beautiful thing about per-file write behind is, it keeps
> both the number of dirty and writeback pages low in the system when
> there are only one or two sequential dirtier tasks. Which is good for
> responsiveness.

Yes, but I don't think it's about a single-dd case - it's about just
trying to handle one common case (streaming writes) efficiently and
naturally. Try to get those out of the system so that you can then
worry about the *other* cases knowing that they don't have that kind
of big streaming behavior.

For example, right now our main top-level writeback logic is *not*
about streaming writes (just dirty counts), but then we try to "find"
the locality by making the lower-level writeback do the whole "write
back by chunking inodes" without really having any higher-level
information.

I just suspect that we'd be better off teaching upper levels about the
streaming. I know for a fact that if I do it by hand, system
responsiveness was *much* better, and IO throughput didn't go down at
all.

> Note that the above user space code won't work well when there are 10+
> dirtier tasks. It effectively creates 10+ IO submitters on different
> regions of the disk and thus create lots of seeks.

Not really much more than our current writeback code does. It
*schedules* data for writing, but doesn't wait for it until much
later.

You seem to think it was synchronous. It's not. Look at the second
sync_file_range() thing, and the important part is the "index-1". The
fact that you confused this with O_SYNC seems to be the same thing.
This has absolutely *nothing* to do with O_SYNC.

The other important part is that the chunk size is fairly large. We do
read-ahead in 64k kind of things, to make sense the write-behind
chunking needs to be in "multiple megabytes".  8MB is probably the
minimum size it makes sense.

The write-behind would be for things like people writing disk images
and video files. Not for random IO in smaller chunks.

                       Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
@ 2012-05-29 17:35       ` Linus Torvalds
  0 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2012-05-29 17:35 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: LKML, Myklebust, Trond, linux-fsdevel, Linux Memory Management List

On Tue, May 29, 2012 at 8:57 AM, Fengguang Wu <fengguang.wu@intel.com> wrote:
>
> Actually O_SYNC is pretty close to the below code for the purpose of
> limiting the dirty and writeback pages, except that it's not on by
> default, hence means nothing for normal users.

Absolutely not.

O_SYNC syncs the *current* write, syncs your metadata, and just
generally makes your writer synchronous. It's just a f*cking moronic
idea. Nobody sane ever uses it, since you are much better off just
using fsync() if you want that kind of behavior. That's one of those
"stupid legacy flags" things that have no sane use.

The whole point is that doing that is never the right thing to do. You
want to sync *past* writes, and you never ever want to wait on them
unless you just sent more (newer) writes to the disk that you are
*not* waiting on - so that you always have more IO pending.

O_SYNC is the absolutely anti-thesis of that kind of "multiple levels
of overlapping IO". Because it requires that the IO is _done_ by the
time you start more, which is against the whole point.

> It seems to me all about optimizing the 1-dd case for desktop users,
> and the most beautiful thing about per-file write behind is, it keeps
> both the number of dirty and writeback pages low in the system when
> there are only one or two sequential dirtier tasks. Which is good for
> responsiveness.

Yes, but I don't think it's about a single-dd case - it's about just
trying to handle one common case (streaming writes) efficiently and
naturally. Try to get those out of the system so that you can then
worry about the *other* cases knowing that they don't have that kind
of big streaming behavior.

For example, right now our main top-level writeback logic is *not*
about streaming writes (just dirty counts), but then we try to "find"
the locality by making the lower-level writeback do the whole "write
back by chunking inodes" without really having any higher-level
information.

I just suspect that we'd be better off teaching upper levels about the
streaming. I know for a fact that if I do it by hand, system
responsiveness was *much* better, and IO throughput didn't go down at
all.

> Note that the above user space code won't work well when there are 10+
> dirtier tasks. It effectively creates 10+ IO submitters on different
> regions of the disk and thus create lots of seeks.

Not really much more than our current writeback code does. It
*schedules* data for writing, but doesn't wait for it until much
later.

You seem to think it was synchronous. It's not. Look at the second
sync_file_range() thing, and the important part is the "index-1". The
fact that you confused this with O_SYNC seems to be the same thing.
This has absolutely *nothing* to do with O_SYNC.

The other important part is that the chunk size is fairly large. We do
read-ahead in 64k kind of things, to make sense the write-behind
chunking needs to be in "multiple megabytes".  8MB is probably the
minimum size it makes sense.

The write-behind would be for things like people writing disk images
and video files. Not for random IO in smaller chunks.

                       Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
  2012-05-29 17:35       ` Linus Torvalds
@ 2012-05-30  3:21         ` Fengguang Wu
  -1 siblings, 0 replies; 38+ messages in thread
From: Fengguang Wu @ 2012-05-30  3:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Vivek Goyal

Linus,

On Tue, May 29, 2012 at 10:35:46AM -0700, Linus Torvalds wrote:
> On Tue, May 29, 2012 at 8:57 AM, Fengguang Wu <fengguang.wu@intel.com> wrote:
> >
> > Actually O_SYNC is pretty close to the below code for the purpose of
> > limiting the dirty and writeback pages, except that it's not on by
> > default, hence means nothing for normal users.
> 
> Absolutely not.
> 
> O_SYNC syncs the *current* write, syncs your metadata, and just
> generally makes your writer synchronous. It's just a f*cking moronic
> idea. Nobody sane ever uses it, since you are much better off just
> using fsync() if you want that kind of behavior. That's one of those
> "stupid legacy flags" things that have no sane use.
> 
> The whole point is that doing that is never the right thing to do. You
> want to sync *past* writes, and you never ever want to wait on them
> unless you just sent more (newer) writes to the disk that you are
> *not* waiting on - so that you always have more IO pending.
> 
> O_SYNC is the absolutely anti-thesis of that kind of "multiple levels
> of overlapping IO". Because it requires that the IO is _done_ by the
> time you start more, which is against the whole point.

Yeah, O_SYNC is not really the sane thing to use.  Thanks for teaching
me this with great details!

> > It seems to me all about optimizing the 1-dd case for desktop users,
> > and the most beautiful thing about per-file write behind is, it keeps
> > both the number of dirty and writeback pages low in the system when
> > there are only one or two sequential dirtier tasks. Which is good for
> > responsiveness.
> 
> Yes, but I don't think it's about a single-dd case - it's about just
> trying to handle one common case (streaming writes) efficiently and
> naturally. Try to get those out of the system so that you can then
> worry about the *other* cases knowing that they don't have that kind
> of big streaming behavior.
> 
> For example, right now our main top-level writeback logic is *not*
> about streaming writes (just dirty counts), but then we try to "find"
> the locality by making the lower-level writeback do the whole "write
> back by chunking inodes" without really having any higher-level
> information.

Agreed. Streaming writes can be reliably detected in the same way as
readahead. And doing explicit write-behind for them may help make the
writeback more oriented and well behaved.

For example, consider file A being sequentially written to by dd, and
another mmapped file B being randomly written to. In the current
global writeback, the two files will likely have 1:1 share of the
dirty pages. With write-behind, we'll effectively limit file A's dirty
footprint to 2 chunk sizes, possibly leaving much more rooms for file
B and increase the chances it accumulate more adjacent dirty pages at
writeback time.

> I just suspect that we'd be better off teaching upper levels about the
> streaming. I know for a fact that if I do it by hand, system
> responsiveness was *much* better, and IO throughput didn't go down at
> all.

Your observation of better responsiveness may well be stemmed from
these two aspects:

1) lower dirty/writeback pages
2) the async write IO queue being drained constantly

(1) is obvious. For a mem=4G desktop, the default dirty limit can be
up to (4096 * 20% = 819MB). While your smart writer effectively limits
dirty/writeback pages to a dramatically lower 16MB.

(2) comes from the use of _WAIT_ flags in

        sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);

Each sync_file_range() syscall will submit 8MB write IO and wait for
completion. That means the async write IO queue constantly swing
between 0 and 8MB fillness at the frequency (100MBps / 8MB = 12.5ms).
So on every 12.5ms, the async IO queue runs empty, which gives any
pending read IO (from firefox etc.) a chance to be serviced. Nice
and sweet breaks!

I suspect (2) contributes *much more* than (1) to desktop responsiveness.

Because in a desktop with heavy sequential writes and sporadic reads,
the 20% dirty/writeback pages can hardly reach the end of LRU lists to
trigger waits in direct page reclaim.

On the other hand, it's a known problem that our IO scheculer is still
not that well behaved to provide good read latency when the flusher
rightfully manages to keep 100% fillness of the async IO queue all the
time.

The IO scheduler will be the right place to solve this issue. There's
nothing wrong for the flusher to blindly fill the async IO queue. It's
the flusher's duty to avoid underrun of the async IO queue and the IO
scheduler's duty to select the right queue to service (or to idle).
The IO scheduler *in theory* has all the information to do the right
decisions to _not service_ requests from the flusher when there are
reads observed recently...

> > Note that the above user space code won't work well when there are 10+
> > dirtier tasks. It effectively creates 10+ IO submitters on different
> > regions of the disk and thus create lots of seeks.
> 
> Not really much more than our current writeback code does. It
> *schedules* data for writing, but doesn't wait for it until much
> later.
>
> You seem to think it was synchronous. It's not. Look at the second
> sync_file_range() thing, and the important part is the "index-1". The
> fact that you confused this with O_SYNC seems to be the same thing.
> This has absolutely *nothing* to do with O_SYNC.
 
Hmm we should be sharing the same view here: it's not waiting for
"index", but does wait for "index-1" for clear of PG_writeback by
using SYNC_FILE_RANGE_WAIT_AFTER.

Or when there are 10+ writers running, each submitting 8MB data to the
async IO queue, they may well overrun the max IO queue size and get
blocked in the earlier stage of get_request_wait().

> The other important part is that the chunk size is fairly large. We do
> read-ahead in 64k kind of things, to make sense the write-behind
> chunking needs to be in "multiple megabytes".  8MB is probably the
> minimum size it makes sense.

Yup. And we also need to make sure it's not 10 tasks each scheduling
50MB write IOs *concurrently*. sync_file_range() is unfortunately
doing it this way by sending IO requests to the async IO queue on its
own, rather than delegating the work to the flusher and let one single
flusher submit IOs for them one after the other.

Imagine the async IO queue can hold exactly 50MB writeback pages. You
can see the obvious difference in the below graph. The IO queue will
be filled with dirty pages from (a) one single inode (b) 10 different
inodes. In the later case, the IO scheduler will switch between the
inodes much more frequently and create lots more seeks.

A theoretic view of the async IO queue:

    +----------------+               +----------------+
    |                |               |    inode 1     |
    |                |               +----------------+
    |                |               |    inode 2     |
    |                |               +----------------+
    |                |               |    inode 3     |
    |                |               +----------------+
    |                |               |    inode 4     |
    |                |               +----------------+
    |   inode 1      |               |    inode 5     |
    |                |               +----------------+
    |                |               |    inode 6     |
    |                |               +----------------+
    |                |               |    inode 7     |
    |                |               +----------------+
    |                |               |    inode 8     |
    |                |               +----------------+
    |                |               |    inode 9     |
    |                |               +----------------+
    |                |               |    inode 10    |
    +----------------+               +----------------+
    (a) one single flusher           (b) 10 sync_file_range()
        submitting 50MB IO               submitting 50MB IO
        for each inode *in turn*         for each inode *in parallel*

So if parallel file syncs are a common usage, we'll need to make them
IO-less, too.

> The write-behind would be for things like people writing disk images
> and video files. Not for random IO in smaller chunks.

Yup.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
@ 2012-05-30  3:21         ` Fengguang Wu
  0 siblings, 0 replies; 38+ messages in thread
From: Fengguang Wu @ 2012-05-30  3:21 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Vivek Goyal

Linus,

On Tue, May 29, 2012 at 10:35:46AM -0700, Linus Torvalds wrote:
> On Tue, May 29, 2012 at 8:57 AM, Fengguang Wu <fengguang.wu@intel.com> wrote:
> >
> > Actually O_SYNC is pretty close to the below code for the purpose of
> > limiting the dirty and writeback pages, except that it's not on by
> > default, hence means nothing for normal users.
> 
> Absolutely not.
> 
> O_SYNC syncs the *current* write, syncs your metadata, and just
> generally makes your writer synchronous. It's just a f*cking moronic
> idea. Nobody sane ever uses it, since you are much better off just
> using fsync() if you want that kind of behavior. That's one of those
> "stupid legacy flags" things that have no sane use.
> 
> The whole point is that doing that is never the right thing to do. You
> want to sync *past* writes, and you never ever want to wait on them
> unless you just sent more (newer) writes to the disk that you are
> *not* waiting on - so that you always have more IO pending.
> 
> O_SYNC is the absolutely anti-thesis of that kind of "multiple levels
> of overlapping IO". Because it requires that the IO is _done_ by the
> time you start more, which is against the whole point.

Yeah, O_SYNC is not really the sane thing to use.  Thanks for teaching
me this with great details!

> > It seems to me all about optimizing the 1-dd case for desktop users,
> > and the most beautiful thing about per-file write behind is, it keeps
> > both the number of dirty and writeback pages low in the system when
> > there are only one or two sequential dirtier tasks. Which is good for
> > responsiveness.
> 
> Yes, but I don't think it's about a single-dd case - it's about just
> trying to handle one common case (streaming writes) efficiently and
> naturally. Try to get those out of the system so that you can then
> worry about the *other* cases knowing that they don't have that kind
> of big streaming behavior.
> 
> For example, right now our main top-level writeback logic is *not*
> about streaming writes (just dirty counts), but then we try to "find"
> the locality by making the lower-level writeback do the whole "write
> back by chunking inodes" without really having any higher-level
> information.

Agreed. Streaming writes can be reliably detected in the same way as
readahead. And doing explicit write-behind for them may help make the
writeback more oriented and well behaved.

For example, consider file A being sequentially written to by dd, and
another mmapped file B being randomly written to. In the current
global writeback, the two files will likely have 1:1 share of the
dirty pages. With write-behind, we'll effectively limit file A's dirty
footprint to 2 chunk sizes, possibly leaving much more rooms for file
B and increase the chances it accumulate more adjacent dirty pages at
writeback time.

> I just suspect that we'd be better off teaching upper levels about the
> streaming. I know for a fact that if I do it by hand, system
> responsiveness was *much* better, and IO throughput didn't go down at
> all.

Your observation of better responsiveness may well be stemmed from
these two aspects:

1) lower dirty/writeback pages
2) the async write IO queue being drained constantly

(1) is obvious. For a mem=4G desktop, the default dirty limit can be
up to (4096 * 20% = 819MB). While your smart writer effectively limits
dirty/writeback pages to a dramatically lower 16MB.

(2) comes from the use of _WAIT_ flags in

        sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);

Each sync_file_range() syscall will submit 8MB write IO and wait for
completion. That means the async write IO queue constantly swing
between 0 and 8MB fillness at the frequency (100MBps / 8MB = 12.5ms).
So on every 12.5ms, the async IO queue runs empty, which gives any
pending read IO (from firefox etc.) a chance to be serviced. Nice
and sweet breaks!

I suspect (2) contributes *much more* than (1) to desktop responsiveness.

Because in a desktop with heavy sequential writes and sporadic reads,
the 20% dirty/writeback pages can hardly reach the end of LRU lists to
trigger waits in direct page reclaim.

On the other hand, it's a known problem that our IO scheculer is still
not that well behaved to provide good read latency when the flusher
rightfully manages to keep 100% fillness of the async IO queue all the
time.

The IO scheduler will be the right place to solve this issue. There's
nothing wrong for the flusher to blindly fill the async IO queue. It's
the flusher's duty to avoid underrun of the async IO queue and the IO
scheduler's duty to select the right queue to service (or to idle).
The IO scheduler *in theory* has all the information to do the right
decisions to _not service_ requests from the flusher when there are
reads observed recently...

> > Note that the above user space code won't work well when there are 10+
> > dirtier tasks. It effectively creates 10+ IO submitters on different
> > regions of the disk and thus create lots of seeks.
> 
> Not really much more than our current writeback code does. It
> *schedules* data for writing, but doesn't wait for it until much
> later.
>
> You seem to think it was synchronous. It's not. Look at the second
> sync_file_range() thing, and the important part is the "index-1". The
> fact that you confused this with O_SYNC seems to be the same thing.
> This has absolutely *nothing* to do with O_SYNC.
 
Hmm we should be sharing the same view here: it's not waiting for
"index", but does wait for "index-1" for clear of PG_writeback by
using SYNC_FILE_RANGE_WAIT_AFTER.

Or when there are 10+ writers running, each submitting 8MB data to the
async IO queue, they may well overrun the max IO queue size and get
blocked in the earlier stage of get_request_wait().

> The other important part is that the chunk size is fairly large. We do
> read-ahead in 64k kind of things, to make sense the write-behind
> chunking needs to be in "multiple megabytes".  8MB is probably the
> minimum size it makes sense.

Yup. And we also need to make sure it's not 10 tasks each scheduling
50MB write IOs *concurrently*. sync_file_range() is unfortunately
doing it this way by sending IO requests to the async IO queue on its
own, rather than delegating the work to the flusher and let one single
flusher submit IOs for them one after the other.

Imagine the async IO queue can hold exactly 50MB writeback pages. You
can see the obvious difference in the below graph. The IO queue will
be filled with dirty pages from (a) one single inode (b) 10 different
inodes. In the later case, the IO scheduler will switch between the
inodes much more frequently and create lots more seeks.

A theoretic view of the async IO queue:

    +----------------+               +----------------+
    |                |               |    inode 1     |
    |                |               +----------------+
    |                |               |    inode 2     |
    |                |               +----------------+
    |                |               |    inode 3     |
    |                |               +----------------+
    |                |               |    inode 4     |
    |                |               +----------------+
    |   inode 1      |               |    inode 5     |
    |                |               +----------------+
    |                |               |    inode 6     |
    |                |               +----------------+
    |                |               |    inode 7     |
    |                |               +----------------+
    |                |               |    inode 8     |
    |                |               +----------------+
    |                |               |    inode 9     |
    |                |               +----------------+
    |                |               |    inode 10    |
    +----------------+               +----------------+
    (a) one single flusher           (b) 10 sync_file_range()
        submitting 50MB IO               submitting 50MB IO
        for each inode *in turn*         for each inode *in parallel*

So if parallel file syncs are a common usage, we'll need to make them
IO-less, too.

> The write-behind would be for things like people writing disk images
> and video files. Not for random IO in smaller chunks.

Yup.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
  2012-05-30  3:21         ` Fengguang Wu
@ 2012-06-05  1:01           ` Dave Chinner
  -1 siblings, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2012-06-05  1:01 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Vivek Goyal

On Wed, May 30, 2012 at 11:21:29AM +0800, Fengguang Wu wrote:
> Linus,
> 
> On Tue, May 29, 2012 at 10:35:46AM -0700, Linus Torvalds wrote:
> > On Tue, May 29, 2012 at 8:57 AM, Fengguang Wu <fengguang.wu@intel.com> wrote:
> > I just suspect that we'd be better off teaching upper levels about the
> > streaming. I know for a fact that if I do it by hand, system
> > responsiveness was *much* better, and IO throughput didn't go down at
> > all.
> 
> Your observation of better responsiveness may well be stemmed from
> these two aspects:
> 
> 1) lower dirty/writeback pages
> 2) the async write IO queue being drained constantly
> 
> (1) is obvious. For a mem=4G desktop, the default dirty limit can be
> up to (4096 * 20% = 819MB). While your smart writer effectively limits
> dirty/writeback pages to a dramatically lower 16MB.
> 
> (2) comes from the use of _WAIT_ flags in
> 
>         sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
> 
> Each sync_file_range() syscall will submit 8MB write IO and wait for
> completion. That means the async write IO queue constantly swing
> between 0 and 8MB fillness at the frequency (100MBps / 8MB = 12.5ms).
> So on every 12.5ms, the async IO queue runs empty, which gives any
> pending read IO (from firefox etc.) a chance to be serviced. Nice
> and sweet breaks!
> 
> I suspect (2) contributes *much more* than (1) to desktop responsiveness.

Almost certainly, especially with NCQ devices where even if the IO
scheduler preempts the write queue immediately, the device might
complete the outstanding 31 writes before servicing the read which
is issued as the 32nd command....

So NCQ depth is going to play a part here as well.

> Because in a desktop with heavy sequential writes and sporadic reads,
> the 20% dirty/writeback pages can hardly reach the end of LRU lists to
> trigger waits in direct page reclaim.
> 
> On the other hand, it's a known problem that our IO scheculer is still
> not that well behaved to provide good read latency when the flusher
> rightfully manages to keep 100% fillness of the async IO queue all the
> time.

Deep queues are the antithesis of low latency. If you want good IO
interactivity (i.e. low access latency) you cannot keep deep async
IO queues. If you want good throughput, you need deep queues to
allow the best scheduling window as possible and to keep the IO
device as busy as possible.

> The IO scheduler will be the right place to solve this issue. There's
> nothing wrong for the flusher to blindly fill the async IO queue. It's
> the flusher's duty to avoid underrun of the async IO queue and the IO
> scheduler's duty to select the right queue to service (or to idle).
> The IO scheduler *in theory* has all the information to do the right
> decisions to _not service_ requests from the flusher when there are
> reads observed recently...

That's my take on the issue, too. Even if we decide that streaming
writes should be sync'd immeidately, where should we draw the limit?

I often write temporary files that would qualify as large streaming
writes (e.g. 1GB) and then immediately remove them. I rely on the
fact they don't hit the disk for performance (i.e. <1s to create,
wait 2s, <1s to read, <1s to unlink). If these are forced to disk
rather than sitting in memory for a short while, the create will now
take ~10s per file and I won't be able to create 10 of them
concurrently and have them all take <1s to create....

IOWs, what might seem like an interactivity optimisation for
one workload will quite badly affect the performance of a different
workload. Optimising read latency vs write bandwidth is exactly what
we have IO schedulers for....

> Or when there are 10+ writers running, each submitting 8MB data to the
> async IO queue, they may well overrun the max IO queue size and get
> blocked in the earlier stage of get_request_wait().

Yup, as soon as you have multiple IO submitters, we get back to the
old problem of thrashing the disks. This is *exactly* the throughput
problem we solved by moving to IO-less throttling. That is, having N
IO submitters is far less efficient than having a single, well
controlled IO submitter. That's exactly what we want to avoid...

> > The other important part is that the chunk size is fairly large. We do
> > read-ahead in 64k kind of things, to make sense the write-behind
> > chunking needs to be in "multiple megabytes".  8MB is probably the
> > minimum size it makes sense.
> 
> Yup. And we also need to make sure it's not 10 tasks each scheduling
> 50MB write IOs *concurrently*. sync_file_range() is unfortunately
> doing it this way by sending IO requests to the async IO queue on its
> own, rather than delegating the work to the flusher and let one single
> flusher submit IOs for them one after the other.

Yup, that's the thrashing we need to avoid ;)

> So if parallel file syncs are a common usage, we'll need to make them
> IO-less, too.

Or just tell people "don't do that"

> > The write-behind would be for things like people writing disk images
> > and video files. Not for random IO in smaller chunks.

Or you could just use async direct IO to acheive exactly the same
thing without modifying the kernel at all ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
@ 2012-06-05  1:01           ` Dave Chinner
  0 siblings, 0 replies; 38+ messages in thread
From: Dave Chinner @ 2012-06-05  1:01 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Vivek Goyal

On Wed, May 30, 2012 at 11:21:29AM +0800, Fengguang Wu wrote:
> Linus,
> 
> On Tue, May 29, 2012 at 10:35:46AM -0700, Linus Torvalds wrote:
> > On Tue, May 29, 2012 at 8:57 AM, Fengguang Wu <fengguang.wu@intel.com> wrote:
> > I just suspect that we'd be better off teaching upper levels about the
> > streaming. I know for a fact that if I do it by hand, system
> > responsiveness was *much* better, and IO throughput didn't go down at
> > all.
> 
> Your observation of better responsiveness may well be stemmed from
> these two aspects:
> 
> 1) lower dirty/writeback pages
> 2) the async write IO queue being drained constantly
> 
> (1) is obvious. For a mem=4G desktop, the default dirty limit can be
> up to (4096 * 20% = 819MB). While your smart writer effectively limits
> dirty/writeback pages to a dramatically lower 16MB.
> 
> (2) comes from the use of _WAIT_ flags in
> 
>         sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
> 
> Each sync_file_range() syscall will submit 8MB write IO and wait for
> completion. That means the async write IO queue constantly swing
> between 0 and 8MB fillness at the frequency (100MBps / 8MB = 12.5ms).
> So on every 12.5ms, the async IO queue runs empty, which gives any
> pending read IO (from firefox etc.) a chance to be serviced. Nice
> and sweet breaks!
> 
> I suspect (2) contributes *much more* than (1) to desktop responsiveness.

Almost certainly, especially with NCQ devices where even if the IO
scheduler preempts the write queue immediately, the device might
complete the outstanding 31 writes before servicing the read which
is issued as the 32nd command....

So NCQ depth is going to play a part here as well.

> Because in a desktop with heavy sequential writes and sporadic reads,
> the 20% dirty/writeback pages can hardly reach the end of LRU lists to
> trigger waits in direct page reclaim.
> 
> On the other hand, it's a known problem that our IO scheculer is still
> not that well behaved to provide good read latency when the flusher
> rightfully manages to keep 100% fillness of the async IO queue all the
> time.

Deep queues are the antithesis of low latency. If you want good IO
interactivity (i.e. low access latency) you cannot keep deep async
IO queues. If you want good throughput, you need deep queues to
allow the best scheduling window as possible and to keep the IO
device as busy as possible.

> The IO scheduler will be the right place to solve this issue. There's
> nothing wrong for the flusher to blindly fill the async IO queue. It's
> the flusher's duty to avoid underrun of the async IO queue and the IO
> scheduler's duty to select the right queue to service (or to idle).
> The IO scheduler *in theory* has all the information to do the right
> decisions to _not service_ requests from the flusher when there are
> reads observed recently...

That's my take on the issue, too. Even if we decide that streaming
writes should be sync'd immeidately, where should we draw the limit?

I often write temporary files that would qualify as large streaming
writes (e.g. 1GB) and then immediately remove them. I rely on the
fact they don't hit the disk for performance (i.e. <1s to create,
wait 2s, <1s to read, <1s to unlink). If these are forced to disk
rather than sitting in memory for a short while, the create will now
take ~10s per file and I won't be able to create 10 of them
concurrently and have them all take <1s to create....

IOWs, what might seem like an interactivity optimisation for
one workload will quite badly affect the performance of a different
workload. Optimising read latency vs write bandwidth is exactly what
we have IO schedulers for....

> Or when there are 10+ writers running, each submitting 8MB data to the
> async IO queue, they may well overrun the max IO queue size and get
> blocked in the earlier stage of get_request_wait().

Yup, as soon as you have multiple IO submitters, we get back to the
old problem of thrashing the disks. This is *exactly* the throughput
problem we solved by moving to IO-less throttling. That is, having N
IO submitters is far less efficient than having a single, well
controlled IO submitter. That's exactly what we want to avoid...

> > The other important part is that the chunk size is fairly large. We do
> > read-ahead in 64k kind of things, to make sense the write-behind
> > chunking needs to be in "multiple megabytes".  8MB is probably the
> > minimum size it makes sense.
> 
> Yup. And we also need to make sure it's not 10 tasks each scheduling
> 50MB write IOs *concurrently*. sync_file_range() is unfortunately
> doing it this way by sending IO requests to the async IO queue on its
> own, rather than delegating the work to the flusher and let one single
> flusher submit IOs for them one after the other.

Yup, that's the thrashing we need to avoid ;)

> So if parallel file syncs are a common usage, we'll need to make them
> IO-less, too.

Or just tell people "don't do that"

> > The write-behind would be for things like people writing disk images
> > and video files. Not for random IO in smaller chunks.

Or you could just use async direct IO to acheive exactly the same
thing without modifying the kernel at all ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
  2012-06-05  1:01           ` Dave Chinner
@ 2012-06-05 17:18             ` Vivek Goyal
  -1 siblings, 0 replies; 38+ messages in thread
From: Vivek Goyal @ 2012-06-05 17:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Fengguang Wu, Linus Torvalds, LKML, Myklebust, Trond,
	linux-fsdevel, Linux Memory Management List

On Tue, Jun 05, 2012 at 11:01:48AM +1000, Dave Chinner wrote:
> On Wed, May 30, 2012 at 11:21:29AM +0800, Fengguang Wu wrote:
> > Linus,
> > 
> > On Tue, May 29, 2012 at 10:35:46AM -0700, Linus Torvalds wrote:
> > > On Tue, May 29, 2012 at 8:57 AM, Fengguang Wu <fengguang.wu@intel.com> wrote:
> > > I just suspect that we'd be better off teaching upper levels about the
> > > streaming. I know for a fact that if I do it by hand, system
> > > responsiveness was *much* better, and IO throughput didn't go down at
> > > all.
> > 
> > Your observation of better responsiveness may well be stemmed from
> > these two aspects:
> > 
> > 1) lower dirty/writeback pages
> > 2) the async write IO queue being drained constantly
> > 
> > (1) is obvious. For a mem=4G desktop, the default dirty limit can be
> > up to (4096 * 20% = 819MB). While your smart writer effectively limits
> > dirty/writeback pages to a dramatically lower 16MB.
> > 
> > (2) comes from the use of _WAIT_ flags in
> > 
> >         sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
> > 
> > Each sync_file_range() syscall will submit 8MB write IO and wait for
> > completion. That means the async write IO queue constantly swing
> > between 0 and 8MB fillness at the frequency (100MBps / 8MB = 12.5ms).
> > So on every 12.5ms, the async IO queue runs empty, which gives any
> > pending read IO (from firefox etc.) a chance to be serviced. Nice
> > and sweet breaks!
> > 
> > I suspect (2) contributes *much more* than (1) to desktop responsiveness.
> 
> Almost certainly, especially with NCQ devices where even if the IO
> scheduler preempts the write queue immediately, the device might
> complete the outstanding 31 writes before servicing the read which
> is issued as the 32nd command....

CFQ does preempt async IO once sync IO gets queued.

> 
> So NCQ depth is going to play a part here as well.

Yes NCQ depth does contribute primarily to READ latencies in presence of
async IO. I think disk drivers and disk firmware should also participate in 
prioritizing READs over pending WRITEs to improve the situation.

IO scheduler can only do so much. CFQ already tries hard to keep pending
async queue depth low and that results in lower throughput many a times
(as compared to deadline).

In fact CFQ tries so hard to prioritize SYNC IO over async IO, that I have
often heard cases of WRITEs being starved and people facing "task blocked
for 120 second warnings".

Thanks
Vivek

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
@ 2012-06-05 17:18             ` Vivek Goyal
  0 siblings, 0 replies; 38+ messages in thread
From: Vivek Goyal @ 2012-06-05 17:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Fengguang Wu, Linus Torvalds, LKML, Myklebust, Trond,
	linux-fsdevel, Linux Memory Management List

On Tue, Jun 05, 2012 at 11:01:48AM +1000, Dave Chinner wrote:
> On Wed, May 30, 2012 at 11:21:29AM +0800, Fengguang Wu wrote:
> > Linus,
> > 
> > On Tue, May 29, 2012 at 10:35:46AM -0700, Linus Torvalds wrote:
> > > On Tue, May 29, 2012 at 8:57 AM, Fengguang Wu <fengguang.wu@intel.com> wrote:
> > > I just suspect that we'd be better off teaching upper levels about the
> > > streaming. I know for a fact that if I do it by hand, system
> > > responsiveness was *much* better, and IO throughput didn't go down at
> > > all.
> > 
> > Your observation of better responsiveness may well be stemmed from
> > these two aspects:
> > 
> > 1) lower dirty/writeback pages
> > 2) the async write IO queue being drained constantly
> > 
> > (1) is obvious. For a mem=4G desktop, the default dirty limit can be
> > up to (4096 * 20% = 819MB). While your smart writer effectively limits
> > dirty/writeback pages to a dramatically lower 16MB.
> > 
> > (2) comes from the use of _WAIT_ flags in
> > 
> >         sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
> > 
> > Each sync_file_range() syscall will submit 8MB write IO and wait for
> > completion. That means the async write IO queue constantly swing
> > between 0 and 8MB fillness at the frequency (100MBps / 8MB = 12.5ms).
> > So on every 12.5ms, the async IO queue runs empty, which gives any
> > pending read IO (from firefox etc.) a chance to be serviced. Nice
> > and sweet breaks!
> > 
> > I suspect (2) contributes *much more* than (1) to desktop responsiveness.
> 
> Almost certainly, especially with NCQ devices where even if the IO
> scheduler preempts the write queue immediately, the device might
> complete the outstanding 31 writes before servicing the read which
> is issued as the 32nd command....

CFQ does preempt async IO once sync IO gets queued.

> 
> So NCQ depth is going to play a part here as well.

Yes NCQ depth does contribute primarily to READ latencies in presence of
async IO. I think disk drivers and disk firmware should also participate in 
prioritizing READs over pending WRITEs to improve the situation.

IO scheduler can only do so much. CFQ already tries hard to keep pending
async queue depth low and that results in lower throughput many a times
(as compared to deadline).

In fact CFQ tries so hard to prioritize SYNC IO over async IO, that I have
often heard cases of WRITEs being starved and people facing "task blocked
for 120 second warnings".

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
  2012-05-30  3:21         ` Fengguang Wu
@ 2012-06-05 17:23           ` Vivek Goyal
  -1 siblings, 0 replies; 38+ messages in thread
From: Vivek Goyal @ 2012-06-05 17:23 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List

On Wed, May 30, 2012 at 11:21:29AM +0800, Fengguang Wu wrote:

[..]
> (2) comes from the use of _WAIT_ flags in
> 
>         sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
> 
> Each sync_file_range() syscall will submit 8MB write IO and wait for
> completion. That means the async write IO queue constantly swing
> between 0 and 8MB fillness at the frequency (100MBps / 8MB = 12.5ms).
> So on every 12.5ms, the async IO queue runs empty, which gives any
> pending read IO (from firefox etc.) a chance to be serviced. Nice
> and sweet breaks!

I doubt that async IO queue is empty for 12.5ms. We wait for previous
range to finish (index-1) and have already started the IO on next 8MB
of pages. So effectively that should keep 8MB of async IO in
queue (until and unless there are delays from user space side). So reason
for latency improvement might be something else and not because async
IO queue is empty for some time.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
@ 2012-06-05 17:23           ` Vivek Goyal
  0 siblings, 0 replies; 38+ messages in thread
From: Vivek Goyal @ 2012-06-05 17:23 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List

On Wed, May 30, 2012 at 11:21:29AM +0800, Fengguang Wu wrote:

[..]
> (2) comes from the use of _WAIT_ flags in
> 
>         sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
> 
> Each sync_file_range() syscall will submit 8MB write IO and wait for
> completion. That means the async write IO queue constantly swing
> between 0 and 8MB fillness at the frequency (100MBps / 8MB = 12.5ms).
> So on every 12.5ms, the async IO queue runs empty, which gives any
> pending read IO (from firefox etc.) a chance to be serviced. Nice
> and sweet breaks!

I doubt that async IO queue is empty for 12.5ms. We wait for previous
range to finish (index-1) and have already started the IO on next 8MB
of pages. So effectively that should keep 8MB of async IO in
queue (until and unless there are delays from user space side). So reason
for latency improvement might be something else and not because async
IO queue is empty for some time.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
  2012-06-05 17:23           ` Vivek Goyal
@ 2012-06-05 17:41             ` Vivek Goyal
  -1 siblings, 0 replies; 38+ messages in thread
From: Vivek Goyal @ 2012-06-05 17:41 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List

On Tue, Jun 05, 2012 at 01:23:02PM -0400, Vivek Goyal wrote:
> On Wed, May 30, 2012 at 11:21:29AM +0800, Fengguang Wu wrote:
> 
> [..]
> > (2) comes from the use of _WAIT_ flags in
> > 
> >         sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
> > 
> > Each sync_file_range() syscall will submit 8MB write IO and wait for
> > completion. That means the async write IO queue constantly swing
> > between 0 and 8MB fillness at the frequency (100MBps / 8MB = 12.5ms).
> > So on every 12.5ms, the async IO queue runs empty, which gives any
> > pending read IO (from firefox etc.) a chance to be serviced. Nice
> > and sweet breaks!
> 
> I doubt that async IO queue is empty for 12.5ms. We wait for previous
> range to finish (index-1) and have already started the IO on next 8MB
> of pages. So effectively that should keep 8MB of async IO in
> queue (until and unless there are delays from user space side). So reason
> for latency improvement might be something else and not because async
> IO queue is empty for some time.

With sync_file_range() test, we can have 8MB of IO in flight. Without that
I think we can have more at times and that might be the reason for latency
improvement.

I see that CFQ has code to allow deeper NCQ depth if there is only a single
writer. So once a reader comes along it might find tons of async IO
already in flight. sync_file_range() will limit that in flight IO hence
the latency improvement. So if we have multiple dd doing sync_file_range()
then probably this latency improvement should go away.

I will run some tests to verify if my understanding about deeper queue
depths in case of single writer is correct or not.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
@ 2012-06-05 17:41             ` Vivek Goyal
  0 siblings, 0 replies; 38+ messages in thread
From: Vivek Goyal @ 2012-06-05 17:41 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List

On Tue, Jun 05, 2012 at 01:23:02PM -0400, Vivek Goyal wrote:
> On Wed, May 30, 2012 at 11:21:29AM +0800, Fengguang Wu wrote:
> 
> [..]
> > (2) comes from the use of _WAIT_ flags in
> > 
> >         sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
> > 
> > Each sync_file_range() syscall will submit 8MB write IO and wait for
> > completion. That means the async write IO queue constantly swing
> > between 0 and 8MB fillness at the frequency (100MBps / 8MB = 12.5ms).
> > So on every 12.5ms, the async IO queue runs empty, which gives any
> > pending read IO (from firefox etc.) a chance to be serviced. Nice
> > and sweet breaks!
> 
> I doubt that async IO queue is empty for 12.5ms. We wait for previous
> range to finish (index-1) and have already started the IO on next 8MB
> of pages. So effectively that should keep 8MB of async IO in
> queue (until and unless there are delays from user space side). So reason
> for latency improvement might be something else and not because async
> IO queue is empty for some time.

With sync_file_range() test, we can have 8MB of IO in flight. Without that
I think we can have more at times and that might be the reason for latency
improvement.

I see that CFQ has code to allow deeper NCQ depth if there is only a single
writer. So once a reader comes along it might find tons of async IO
already in flight. sync_file_range() will limit that in flight IO hence
the latency improvement. So if we have multiple dd doing sync_file_range()
then probably this latency improvement should go away.

I will run some tests to verify if my understanding about deeper queue
depths in case of single writer is correct or not.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
  2012-06-05 17:41             ` Vivek Goyal
@ 2012-06-05 18:48               ` Vivek Goyal
  -1 siblings, 0 replies; 38+ messages in thread
From: Vivek Goyal @ 2012-06-05 18:48 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List

On Tue, Jun 05, 2012 at 01:41:57PM -0400, Vivek Goyal wrote:
> On Tue, Jun 05, 2012 at 01:23:02PM -0400, Vivek Goyal wrote:
> > On Wed, May 30, 2012 at 11:21:29AM +0800, Fengguang Wu wrote:
> > 
> > [..]
> > > (2) comes from the use of _WAIT_ flags in
> > > 
> > >         sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
> > > 
> > > Each sync_file_range() syscall will submit 8MB write IO and wait for
> > > completion. That means the async write IO queue constantly swing
> > > between 0 and 8MB fillness at the frequency (100MBps / 8MB = 12.5ms).
> > > So on every 12.5ms, the async IO queue runs empty, which gives any
> > > pending read IO (from firefox etc.) a chance to be serviced. Nice
> > > and sweet breaks!
> > 
> > I doubt that async IO queue is empty for 12.5ms. We wait for previous
> > range to finish (index-1) and have already started the IO on next 8MB
> > of pages. So effectively that should keep 8MB of async IO in
> > queue (until and unless there are delays from user space side). So reason
> > for latency improvement might be something else and not because async
> > IO queue is empty for some time.
> 
> With sync_file_range() test, we can have 8MB of IO in flight. Without that
> I think we can have more at times and that might be the reason for latency
> improvement.
> 
> I see that CFQ has code to allow deeper NCQ depth if there is only a single
> writer. So once a reader comes along it might find tons of async IO
> already in flight. sync_file_range() will limit that in flight IO hence
> the latency improvement. So if we have multiple dd doing sync_file_range()
> then probably this latency improvement should go away.
> 
> I will run some tests to verify if my understanding about deeper queue
> depths in case of single writer is correct or not.

So I did run some tests and can confirm that on an average there seem to
be more in flight requests *without* sync_file_range() and that's probably
the reason that why sync_file_range() test is showing better latency. 

I can see that with "dd if=/dev/zero of=zerofile bs=1M count=1024", we are
driving deeper queue depths (upto 32) and in later stages in flight
requests are constantly high.

With sync_file_range(), in flight requests number of requests fluctuate a
lot between 1 and 32. Many a times it is just 1 or up to 16 and few times
went up to 32.

So sync_file_range() test keeps less in flight requests on on average
hence better latencies. It might not produce throughput drop on SATA
disks but might have some effect on storage array luns. Will give it
a try.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
@ 2012-06-05 18:48               ` Vivek Goyal
  0 siblings, 0 replies; 38+ messages in thread
From: Vivek Goyal @ 2012-06-05 18:48 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List

On Tue, Jun 05, 2012 at 01:41:57PM -0400, Vivek Goyal wrote:
> On Tue, Jun 05, 2012 at 01:23:02PM -0400, Vivek Goyal wrote:
> > On Wed, May 30, 2012 at 11:21:29AM +0800, Fengguang Wu wrote:
> > 
> > [..]
> > > (2) comes from the use of _WAIT_ flags in
> > > 
> > >         sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
> > > 
> > > Each sync_file_range() syscall will submit 8MB write IO and wait for
> > > completion. That means the async write IO queue constantly swing
> > > between 0 and 8MB fillness at the frequency (100MBps / 8MB = 12.5ms).
> > > So on every 12.5ms, the async IO queue runs empty, which gives any
> > > pending read IO (from firefox etc.) a chance to be serviced. Nice
> > > and sweet breaks!
> > 
> > I doubt that async IO queue is empty for 12.5ms. We wait for previous
> > range to finish (index-1) and have already started the IO on next 8MB
> > of pages. So effectively that should keep 8MB of async IO in
> > queue (until and unless there are delays from user space side). So reason
> > for latency improvement might be something else and not because async
> > IO queue is empty for some time.
> 
> With sync_file_range() test, we can have 8MB of IO in flight. Without that
> I think we can have more at times and that might be the reason for latency
> improvement.
> 
> I see that CFQ has code to allow deeper NCQ depth if there is only a single
> writer. So once a reader comes along it might find tons of async IO
> already in flight. sync_file_range() will limit that in flight IO hence
> the latency improvement. So if we have multiple dd doing sync_file_range()
> then probably this latency improvement should go away.
> 
> I will run some tests to verify if my understanding about deeper queue
> depths in case of single writer is correct or not.

So I did run some tests and can confirm that on an average there seem to
be more in flight requests *without* sync_file_range() and that's probably
the reason that why sync_file_range() test is showing better latency. 

I can see that with "dd if=/dev/zero of=zerofile bs=1M count=1024", we are
driving deeper queue depths (upto 32) and in later stages in flight
requests are constantly high.

With sync_file_range(), in flight requests number of requests fluctuate a
lot between 1 and 32. Many a times it is just 1 or up to 16 and few times
went up to 32.

So sync_file_range() test keeps less in flight requests on on average
hence better latencies. It might not produce throughput drop on SATA
disks but might have some effect on storage array luns. Will give it
a try.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
  2012-06-05 18:48               ` Vivek Goyal
@ 2012-06-05 20:10                 ` Vivek Goyal
  -1 siblings, 0 replies; 38+ messages in thread
From: Vivek Goyal @ 2012-06-05 20:10 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Jens Axboe

On Tue, Jun 05, 2012 at 02:48:53PM -0400, Vivek Goyal wrote:

[..]
> So sync_file_range() test keeps less in flight requests on on average
> hence better latencies. It might not produce throughput drop on SATA
> disks but might have some effect on storage array luns. Will give it
> a try.

Well, I ran dd and syn_file_range test on a storage array Lun. Wrote a
file of size 4G on ext4. Got about 300MB/s write speed. In fact when I
measured time using "time", sync_file_range test finished little faster.

Then I started looking at blktrace output. sync_file_range() test
initially (for about 8 seconds), drives shallow queue depth (about 16),
but after 8 seconds somehow flusher gets involved and starts submitting
lots of requests and we start driving much higher queue depth (upto more than
100). Not sure why flusher should get involved. Is everything working as
expected. I thought that as we wait for last 8MB IO to finish before we
start new one, we should have at max 16MB of IO in flight. Fengguang?

Anyway, so this test of speed comparision is invalid as flusher gets
involved after some time and we start driving higher in flight requests.
I guess I should hard code the maximum number of requests in flight
to see the effect of request queue depth on throughput.

I am also attaching the sync_file_range() test linus mentioned. Did I 
write it right.

Thanks
Vivek

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <time.h>
#include <fcntl.h>
#include <string.h>


#define BUFSIZE (8*1024*1024)
char buf [BUFSIZE];

int main()
{
	int fd, index = 0;

	fd = open("sync-file-range-tester.tst-file", O_WRONLY|O_CREAT);
	if (fd < 0) {
		perror("open");
		exit(1);
	}

	memset(buf, 'a', BUFSIZE);

	while (1) {
                if (write(fd, buf, BUFSIZE) != BUFSIZE)
                        break;
                sync_file_range(fd, index*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WRITE);
                if (index) {
                        sync_file_range(fd, (index-1)*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
		}
		index++;
		if (index >=512)
			break;
	}
}

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
@ 2012-06-05 20:10                 ` Vivek Goyal
  0 siblings, 0 replies; 38+ messages in thread
From: Vivek Goyal @ 2012-06-05 20:10 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Jens Axboe

On Tue, Jun 05, 2012 at 02:48:53PM -0400, Vivek Goyal wrote:

[..]
> So sync_file_range() test keeps less in flight requests on on average
> hence better latencies. It might not produce throughput drop on SATA
> disks but might have some effect on storage array luns. Will give it
> a try.

Well, I ran dd and syn_file_range test on a storage array Lun. Wrote a
file of size 4G on ext4. Got about 300MB/s write speed. In fact when I
measured time using "time", sync_file_range test finished little faster.

Then I started looking at blktrace output. sync_file_range() test
initially (for about 8 seconds), drives shallow queue depth (about 16),
but after 8 seconds somehow flusher gets involved and starts submitting
lots of requests and we start driving much higher queue depth (upto more than
100). Not sure why flusher should get involved. Is everything working as
expected. I thought that as we wait for last 8MB IO to finish before we
start new one, we should have at max 16MB of IO in flight. Fengguang?

Anyway, so this test of speed comparision is invalid as flusher gets
involved after some time and we start driving higher in flight requests.
I guess I should hard code the maximum number of requests in flight
to see the effect of request queue depth on throughput.

I am also attaching the sync_file_range() test linus mentioned. Did I 
write it right.

Thanks
Vivek

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <time.h>
#include <fcntl.h>
#include <string.h>


#define BUFSIZE (8*1024*1024)
char buf [BUFSIZE];

int main()
{
	int fd, index = 0;

	fd = open("sync-file-range-tester.tst-file", O_WRONLY|O_CREAT);
	if (fd < 0) {
		perror("open");
		exit(1);
	}

	memset(buf, 'a', BUFSIZE);

	while (1) {
                if (write(fd, buf, BUFSIZE) != BUFSIZE)
                        break;
                sync_file_range(fd, index*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WRITE);
                if (index) {
                        sync_file_range(fd, (index-1)*BUFSIZE, BUFSIZE, SYNC_FILE_RANGE_WAIT_BEFORE|SYNC_FILE_RANGE_WRITE|SYNC_FILE_RANGE_WAIT_AFTER);
		}
		index++;
		if (index >=512)
			break;
	}
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
  2012-06-05 20:10                 ` Vivek Goyal
@ 2012-06-06  2:57                   ` Vivek Goyal
  -1 siblings, 0 replies; 38+ messages in thread
From: Vivek Goyal @ 2012-06-06  2:57 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Jens Axboe

On Tue, Jun 05, 2012 at 04:10:45PM -0400, Vivek Goyal wrote:
> On Tue, Jun 05, 2012 at 02:48:53PM -0400, Vivek Goyal wrote:
> 
> [..]
> > So sync_file_range() test keeps less in flight requests on on average
> > hence better latencies. It might not produce throughput drop on SATA
> > disks but might have some effect on storage array luns. Will give it
> > a try.
> 
> Well, I ran dd and syn_file_range test on a storage array Lun. Wrote a
> file of size 4G on ext4. Got about 300MB/s write speed. In fact when I
> measured time using "time", sync_file_range test finished little faster.
> 
> Then I started looking at blktrace output. sync_file_range() test
> initially (for about 8 seconds), drives shallow queue depth (about 16),
> but after 8 seconds somehow flusher gets involved and starts submitting
> lots of requests and we start driving much higher queue depth (upto more than
> 100). Not sure why flusher should get involved. Is everything working as
> expected. I thought that as we wait for last 8MB IO to finish before we
> start new one, we should have at max 16MB of IO in flight. Fengguang?

Ok, found it. I am using "int index" which in turn caused signed integer
extension of (i*BUFSIZE). Once "i" crosses 255, integer overflow happens
and 64bit offset is sign extended and offsets are screwed. So after 2G
file size, sync_file_range() effectively stops working leaving dirty
pages which are cleaned up by flusher. So that explains why flusher
was kicking during my tests. Change "int" to "unsigned int" and problem
if fixed.

Now I ran sync_file_range() test and another program which writes 4GB file
and does a fdatasync() at the end and compared total execution time. First
one takes around 12.5 seconds while later one takes around 12.00 seconds.
So sync_file_range() is just little slower on this SAN lun.

I had expected a bigger difference as sync_file_range() is just driving
max queue depth of 32 (total 16MB IO in flight), while flushers are
driving queue depths up to 140 or so. So in this paritcular test, driving
much deeper queue depths is not really helping much. (I have seen higher
throughputs with higher queue depths in the past. Now sure why don't we
see it here).

Thanks
Vivek

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
@ 2012-06-06  2:57                   ` Vivek Goyal
  0 siblings, 0 replies; 38+ messages in thread
From: Vivek Goyal @ 2012-06-06  2:57 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Jens Axboe

On Tue, Jun 05, 2012 at 04:10:45PM -0400, Vivek Goyal wrote:
> On Tue, Jun 05, 2012 at 02:48:53PM -0400, Vivek Goyal wrote:
> 
> [..]
> > So sync_file_range() test keeps less in flight requests on on average
> > hence better latencies. It might not produce throughput drop on SATA
> > disks but might have some effect on storage array luns. Will give it
> > a try.
> 
> Well, I ran dd and syn_file_range test on a storage array Lun. Wrote a
> file of size 4G on ext4. Got about 300MB/s write speed. In fact when I
> measured time using "time", sync_file_range test finished little faster.
> 
> Then I started looking at blktrace output. sync_file_range() test
> initially (for about 8 seconds), drives shallow queue depth (about 16),
> but after 8 seconds somehow flusher gets involved and starts submitting
> lots of requests and we start driving much higher queue depth (upto more than
> 100). Not sure why flusher should get involved. Is everything working as
> expected. I thought that as we wait for last 8MB IO to finish before we
> start new one, we should have at max 16MB of IO in flight. Fengguang?

Ok, found it. I am using "int index" which in turn caused signed integer
extension of (i*BUFSIZE). Once "i" crosses 255, integer overflow happens
and 64bit offset is sign extended and offsets are screwed. So after 2G
file size, sync_file_range() effectively stops working leaving dirty
pages which are cleaned up by flusher. So that explains why flusher
was kicking during my tests. Change "int" to "unsigned int" and problem
if fixed.

Now I ran sync_file_range() test and another program which writes 4GB file
and does a fdatasync() at the end and compared total execution time. First
one takes around 12.5 seconds while later one takes around 12.00 seconds.
So sync_file_range() is just little slower on this SAN lun.

I had expected a bigger difference as sync_file_range() is just driving
max queue depth of 32 (total 16MB IO in flight), while flushers are
driving queue depths up to 140 or so. So in this paritcular test, driving
much deeper queue depths is not really helping much. (I have seen higher
throughputs with higher queue depths in the past. Now sure why don't we
see it here).

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
  2012-06-06  2:57                   ` Vivek Goyal
@ 2012-06-06  3:14                     ` Linus Torvalds
  -1 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2012-06-06  3:14 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Jens Axboe

On Tue, Jun 5, 2012 at 7:57 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
>
> I had expected a bigger difference as sync_file_range() is just driving
> max queue depth of 32 (total 16MB IO in flight), while flushers are
> driving queue depths up to 140 or so. So in this paritcular test, driving
> much deeper queue depths is not really helping much. (I have seen higher
> throughputs with higher queue depths in the past. Now sure why don't we
> see it here).

How did interactivity feel?

Because quite frankly, if the throughput difference is 12.5 vs 12
seconds, I suspect the interactivity thing is what dominates.

And from my memory of the interactivity different was absolutely
*huge*. Even back when I used rotational media, I basically couldn't
even notice the background write with the sync_file_range() approach.
While the regular writeback without the writebehind had absolutely
*huge* pauses if you used something like firefox that uses fsync()
etc. And starting new applications that weren't cached was noticeably
worse too - and then with sync_file_range it wasn't even all that
noticeable.

NOTE! For the real "firefox + fsync" test, I suspect you'd need to do
the writeback on the same filesystem (and obviously disk) as your home
directory is. If the big write is to another filesystem and another
disk, I think you won't see the same issues.

Admittedly, I have not really touched anything with a rotational disk
for the last few years, nor do I ever want to see those rotating
pieces of high-tech rust ever again. And maybe your SAN has so good
latency even under load that it doesn't really matter. I remember it
mattering a lot back when..

Of course, back when I did that testing and had rotational media, we
didn't have the per-bdi writeback logic with the smart speed-dependent
depths etc, so it may be that we're just so much better at writeback
these days that it's not nearly as noticeable any more.

                        Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
@ 2012-06-06  3:14                     ` Linus Torvalds
  0 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2012-06-06  3:14 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Jens Axboe

On Tue, Jun 5, 2012 at 7:57 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
>
> I had expected a bigger difference as sync_file_range() is just driving
> max queue depth of 32 (total 16MB IO in flight), while flushers are
> driving queue depths up to 140 or so. So in this paritcular test, driving
> much deeper queue depths is not really helping much. (I have seen higher
> throughputs with higher queue depths in the past. Now sure why don't we
> see it here).

How did interactivity feel?

Because quite frankly, if the throughput difference is 12.5 vs 12
seconds, I suspect the interactivity thing is what dominates.

And from my memory of the interactivity different was absolutely
*huge*. Even back when I used rotational media, I basically couldn't
even notice the background write with the sync_file_range() approach.
While the regular writeback without the writebehind had absolutely
*huge* pauses if you used something like firefox that uses fsync()
etc. And starting new applications that weren't cached was noticeably
worse too - and then with sync_file_range it wasn't even all that
noticeable.

NOTE! For the real "firefox + fsync" test, I suspect you'd need to do
the writeback on the same filesystem (and obviously disk) as your home
directory is. If the big write is to another filesystem and another
disk, I think you won't see the same issues.

Admittedly, I have not really touched anything with a rotational disk
for the last few years, nor do I ever want to see those rotating
pieces of high-tech rust ever again. And maybe your SAN has so good
latency even under load that it doesn't really matter. I remember it
mattering a lot back when..

Of course, back when I did that testing and had rotational media, we
didn't have the per-bdi writeback logic with the smart speed-dependent
depths etc, so it may be that we're just so much better at writeback
these days that it's not nearly as noticeable any more.

                        Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
  2012-06-06  3:14                     ` Linus Torvalds
@ 2012-06-06 12:14                       ` Vivek Goyal
  -1 siblings, 0 replies; 38+ messages in thread
From: Vivek Goyal @ 2012-06-06 12:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Fengguang Wu, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Jens Axboe

On Tue, Jun 05, 2012 at 08:14:08PM -0700, Linus Torvalds wrote:
> On Tue, Jun 5, 2012 at 7:57 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > I had expected a bigger difference as sync_file_range() is just driving
> > max queue depth of 32 (total 16MB IO in flight), while flushers are
> > driving queue depths up to 140 or so. So in this paritcular test, driving
> > much deeper queue depths is not really helping much. (I have seen higher
> > throughputs with higher queue depths in the past. Now sure why don't we
> > see it here).
> 
> How did interactivity feel?
> 
> Because quite frankly, if the throughput difference is 12.5 vs 12
> seconds, I suspect the interactivity thing is what dominates.
> 
> And from my memory of the interactivity different was absolutely
> *huge*. Even back when I used rotational media, I basically couldn't
> even notice the background write with the sync_file_range() approach.
> While the regular writeback without the writebehind had absolutely
> *huge* pauses if you used something like firefox that uses fsync()
> etc. And starting new applications that weren't cached was noticeably
> worse too - and then with sync_file_range it wasn't even all that
> noticeable.
> 
> NOTE! For the real "firefox + fsync" test, I suspect you'd need to do
> the writeback on the same filesystem (and obviously disk) as your home
> directory is. If the big write is to another filesystem and another
> disk, I think you won't see the same issues.

Ok, I did following test on my single SATA disk and my root filesystem
is on this disk.

I dropped caches and launched firefox and monitored the time it takes
for firefox to start. (cache cold).

And my results are reverse of what you have been seeing. With
sync_file_range() running, firefox takes roughly 30 seconds to start and
with flusher in operation, it takes roughly 20 seconds to start. (I have
approximated the average of 3 runs for simplicity).

I think it is happening because sync_file_range() will send all
the writes as SYNC and it will compete with firefox IO. On the other
hand, flusher's IO will show up as ASYNC and CFQ  will be penalize it
heavily and firefox's IO will be prioritized. And this effect should
just get worse as more processes do sync_file_range().

So write-behind should provide better interactivity if writes submitted
are ASYNC and not SYNC.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
@ 2012-06-06 12:14                       ` Vivek Goyal
  0 siblings, 0 replies; 38+ messages in thread
From: Vivek Goyal @ 2012-06-06 12:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Fengguang Wu, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Jens Axboe

On Tue, Jun 05, 2012 at 08:14:08PM -0700, Linus Torvalds wrote:
> On Tue, Jun 5, 2012 at 7:57 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> >
> > I had expected a bigger difference as sync_file_range() is just driving
> > max queue depth of 32 (total 16MB IO in flight), while flushers are
> > driving queue depths up to 140 or so. So in this paritcular test, driving
> > much deeper queue depths is not really helping much. (I have seen higher
> > throughputs with higher queue depths in the past. Now sure why don't we
> > see it here).
> 
> How did interactivity feel?
> 
> Because quite frankly, if the throughput difference is 12.5 vs 12
> seconds, I suspect the interactivity thing is what dominates.
> 
> And from my memory of the interactivity different was absolutely
> *huge*. Even back when I used rotational media, I basically couldn't
> even notice the background write with the sync_file_range() approach.
> While the regular writeback without the writebehind had absolutely
> *huge* pauses if you used something like firefox that uses fsync()
> etc. And starting new applications that weren't cached was noticeably
> worse too - and then with sync_file_range it wasn't even all that
> noticeable.
> 
> NOTE! For the real "firefox + fsync" test, I suspect you'd need to do
> the writeback on the same filesystem (and obviously disk) as your home
> directory is. If the big write is to another filesystem and another
> disk, I think you won't see the same issues.

Ok, I did following test on my single SATA disk and my root filesystem
is on this disk.

I dropped caches and launched firefox and monitored the time it takes
for firefox to start. (cache cold).

And my results are reverse of what you have been seeing. With
sync_file_range() running, firefox takes roughly 30 seconds to start and
with flusher in operation, it takes roughly 20 seconds to start. (I have
approximated the average of 3 runs for simplicity).

I think it is happening because sync_file_range() will send all
the writes as SYNC and it will compete with firefox IO. On the other
hand, flusher's IO will show up as ASYNC and CFQ  will be penalize it
heavily and firefox's IO will be prioritized. And this effect should
just get worse as more processes do sync_file_range().

So write-behind should provide better interactivity if writes submitted
are ASYNC and not SYNC.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
  2012-06-06 12:14                       ` Vivek Goyal
@ 2012-06-06 14:00                         ` Fengguang Wu
  -1 siblings, 0 replies; 38+ messages in thread
From: Fengguang Wu @ 2012-06-06 14:00 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Jens Axboe

On Wed, Jun 06, 2012 at 08:14:08AM -0400, Vivek Goyal wrote:
> On Tue, Jun 05, 2012 at 08:14:08PM -0700, Linus Torvalds wrote:
> > On Tue, Jun 5, 2012 at 7:57 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > I had expected a bigger difference as sync_file_range() is just driving
> > > max queue depth of 32 (total 16MB IO in flight), while flushers are
> > > driving queue depths up to 140 or so. So in this paritcular test, driving
> > > much deeper queue depths is not really helping much. (I have seen higher
> > > throughputs with higher queue depths in the past. Now sure why don't we
> > > see it here).
> > 
> > How did interactivity feel?
> > 
> > Because quite frankly, if the throughput difference is 12.5 vs 12
> > seconds, I suspect the interactivity thing is what dominates.
> > 
> > And from my memory of the interactivity different was absolutely
> > *huge*. Even back when I used rotational media, I basically couldn't
> > even notice the background write with the sync_file_range() approach.
> > While the regular writeback without the writebehind had absolutely
> > *huge* pauses if you used something like firefox that uses fsync()
> > etc. And starting new applications that weren't cached was noticeably
> > worse too - and then with sync_file_range it wasn't even all that
> > noticeable.
> > 
> > NOTE! For the real "firefox + fsync" test, I suspect you'd need to do
> > the writeback on the same filesystem (and obviously disk) as your home
> > directory is. If the big write is to another filesystem and another
> > disk, I think you won't see the same issues.
> 
> Ok, I did following test on my single SATA disk and my root filesystem
> is on this disk.
> 
> I dropped caches and launched firefox and monitored the time it takes
> for firefox to start. (cache cold).
> 
> And my results are reverse of what you have been seeing. With
> sync_file_range() running, firefox takes roughly 30 seconds to start and
> with flusher in operation, it takes roughly 20 seconds to start. (I have
> approximated the average of 3 runs for simplicity).
> 
> I think it is happening because sync_file_range() will send all
> the writes as SYNC and it will compete with firefox IO. On the other
> hand, flusher's IO will show up as ASYNC and CFQ  will be penalize it
> heavily and firefox's IO will be prioritized. And this effect should
> just get worse as more processes do sync_file_range().
> 
> So write-behind should provide better interactivity if writes submitted
> are ASYNC and not SYNC.

Hi Vivek, thanks for testing all of these out! The result is
definitely interesting and a surprise: we overlooked the SYNC nature
of sync_file_range().

I'd suggest to use these calls to achieve the write-and-drop-behind
behavior, *with* WB_SYNC_NONE:

        posix_fadvise(fd, offset, len, POSIX_FADV_DONTNEED);
        sync_file_range(fd, offset, len, SYNC_FILE_RANGE_WAIT_AFTER);

The caveat is, the below bdi_write_congested() will never evaluate to
true since we are only filling the request queue with 8MB data.

SYSCALL_DEFINE(fadvise64_64):

        case POSIX_FADV_DONTNEED:
                if (!bdi_write_congested(mapping->backing_dev_info))
                        __filemap_fdatawrite_range(mapping, offset, endbyte,
                                                   WB_SYNC_NONE);

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
@ 2012-06-06 14:00                         ` Fengguang Wu
  0 siblings, 0 replies; 38+ messages in thread
From: Fengguang Wu @ 2012-06-06 14:00 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Jens Axboe

On Wed, Jun 06, 2012 at 08:14:08AM -0400, Vivek Goyal wrote:
> On Tue, Jun 05, 2012 at 08:14:08PM -0700, Linus Torvalds wrote:
> > On Tue, Jun 5, 2012 at 7:57 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > >
> > > I had expected a bigger difference as sync_file_range() is just driving
> > > max queue depth of 32 (total 16MB IO in flight), while flushers are
> > > driving queue depths up to 140 or so. So in this paritcular test, driving
> > > much deeper queue depths is not really helping much. (I have seen higher
> > > throughputs with higher queue depths in the past. Now sure why don't we
> > > see it here).
> > 
> > How did interactivity feel?
> > 
> > Because quite frankly, if the throughput difference is 12.5 vs 12
> > seconds, I suspect the interactivity thing is what dominates.
> > 
> > And from my memory of the interactivity different was absolutely
> > *huge*. Even back when I used rotational media, I basically couldn't
> > even notice the background write with the sync_file_range() approach.
> > While the regular writeback without the writebehind had absolutely
> > *huge* pauses if you used something like firefox that uses fsync()
> > etc. And starting new applications that weren't cached was noticeably
> > worse too - and then with sync_file_range it wasn't even all that
> > noticeable.
> > 
> > NOTE! For the real "firefox + fsync" test, I suspect you'd need to do
> > the writeback on the same filesystem (and obviously disk) as your home
> > directory is. If the big write is to another filesystem and another
> > disk, I think you won't see the same issues.
> 
> Ok, I did following test on my single SATA disk and my root filesystem
> is on this disk.
> 
> I dropped caches and launched firefox and monitored the time it takes
> for firefox to start. (cache cold).
> 
> And my results are reverse of what you have been seeing. With
> sync_file_range() running, firefox takes roughly 30 seconds to start and
> with flusher in operation, it takes roughly 20 seconds to start. (I have
> approximated the average of 3 runs for simplicity).
> 
> I think it is happening because sync_file_range() will send all
> the writes as SYNC and it will compete with firefox IO. On the other
> hand, flusher's IO will show up as ASYNC and CFQ  will be penalize it
> heavily and firefox's IO will be prioritized. And this effect should
> just get worse as more processes do sync_file_range().
> 
> So write-behind should provide better interactivity if writes submitted
> are ASYNC and not SYNC.

Hi Vivek, thanks for testing all of these out! The result is
definitely interesting and a surprise: we overlooked the SYNC nature
of sync_file_range().

I'd suggest to use these calls to achieve the write-and-drop-behind
behavior, *with* WB_SYNC_NONE:

        posix_fadvise(fd, offset, len, POSIX_FADV_DONTNEED);
        sync_file_range(fd, offset, len, SYNC_FILE_RANGE_WAIT_AFTER);

The caveat is, the below bdi_write_congested() will never evaluate to
true since we are only filling the request queue with 8MB data.

SYSCALL_DEFINE(fadvise64_64):

        case POSIX_FADV_DONTNEED:
                if (!bdi_write_congested(mapping->backing_dev_info))
                        __filemap_fdatawrite_range(mapping, offset, endbyte,
                                                   WB_SYNC_NONE);

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
  2012-06-06  2:57                   ` Vivek Goyal
@ 2012-06-06 14:08                     ` Fengguang Wu
  -1 siblings, 0 replies; 38+ messages in thread
From: Fengguang Wu @ 2012-06-06 14:08 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Jens Axboe

On Tue, Jun 05, 2012 at 10:57:30PM -0400, Vivek Goyal wrote:
> On Tue, Jun 05, 2012 at 04:10:45PM -0400, Vivek Goyal wrote:
> > On Tue, Jun 05, 2012 at 02:48:53PM -0400, Vivek Goyal wrote:
> > 
> > [..]
> > > So sync_file_range() test keeps less in flight requests on on average
> > > hence better latencies. It might not produce throughput drop on SATA
> > > disks but might have some effect on storage array luns. Will give it
> > > a try.
> > 
> > Well, I ran dd and syn_file_range test on a storage array Lun. Wrote a
> > file of size 4G on ext4. Got about 300MB/s write speed. In fact when I
> > measured time using "time", sync_file_range test finished little faster.
> > 
> > Then I started looking at blktrace output. sync_file_range() test
> > initially (for about 8 seconds), drives shallow queue depth (about 16),
> > but after 8 seconds somehow flusher gets involved and starts submitting
> > lots of requests and we start driving much higher queue depth (upto more than
> > 100). Not sure why flusher should get involved. Is everything working as
> > expected. I thought that as we wait for last 8MB IO to finish before we
> > start new one, we should have at max 16MB of IO in flight. Fengguang?
> 
> Ok, found it. I am using "int index" which in turn caused signed integer
> extension of (i*BUFSIZE). Once "i" crosses 255, integer overflow happens
> and 64bit offset is sign extended and offsets are screwed. So after 2G
> file size, sync_file_range() effectively stops working leaving dirty
> pages which are cleaned up by flusher. So that explains why flusher
> was kicking during my tests. Change "int" to "unsigned int" and problem
> if fixed.

Good catch! Besides that, I do see a small chance for the flusher
thread to kick in: at the time when the inode dirty expires after 30s.
Just a kind reminder, because I don't see how it can impact this
workload in some noticeable way.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
@ 2012-06-06 14:08                     ` Fengguang Wu
  0 siblings, 0 replies; 38+ messages in thread
From: Fengguang Wu @ 2012-06-06 14:08 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Jens Axboe

On Tue, Jun 05, 2012 at 10:57:30PM -0400, Vivek Goyal wrote:
> On Tue, Jun 05, 2012 at 04:10:45PM -0400, Vivek Goyal wrote:
> > On Tue, Jun 05, 2012 at 02:48:53PM -0400, Vivek Goyal wrote:
> > 
> > [..]
> > > So sync_file_range() test keeps less in flight requests on on average
> > > hence better latencies. It might not produce throughput drop on SATA
> > > disks but might have some effect on storage array luns. Will give it
> > > a try.
> > 
> > Well, I ran dd and syn_file_range test on a storage array Lun. Wrote a
> > file of size 4G on ext4. Got about 300MB/s write speed. In fact when I
> > measured time using "time", sync_file_range test finished little faster.
> > 
> > Then I started looking at blktrace output. sync_file_range() test
> > initially (for about 8 seconds), drives shallow queue depth (about 16),
> > but after 8 seconds somehow flusher gets involved and starts submitting
> > lots of requests and we start driving much higher queue depth (upto more than
> > 100). Not sure why flusher should get involved. Is everything working as
> > expected. I thought that as we wait for last 8MB IO to finish before we
> > start new one, we should have at max 16MB of IO in flight. Fengguang?
> 
> Ok, found it. I am using "int index" which in turn caused signed integer
> extension of (i*BUFSIZE). Once "i" crosses 255, integer overflow happens
> and 64bit offset is sign extended and offsets are screwed. So after 2G
> file size, sync_file_range() effectively stops working leaving dirty
> pages which are cleaned up by flusher. So that explains why flusher
> was kicking during my tests. Change "int" to "unsigned int" and problem
> if fixed.

Good catch! Besides that, I do see a small chance for the flusher
thread to kick in: at the time when the inode dirty expires after 30s.
Just a kind reminder, because I don't see how it can impact this
workload in some noticeable way.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
  2012-06-06 12:14                       ` Vivek Goyal
@ 2012-06-06 16:15                         ` Vivek Goyal
  -1 siblings, 0 replies; 38+ messages in thread
From: Vivek Goyal @ 2012-06-06 16:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Fengguang Wu, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Jens Axboe

On Wed, Jun 06, 2012 at 08:14:08AM -0400, Vivek Goyal wrote:

[..]
> I think it is happening because sync_file_range() will send all
> the writes as SYNC and it will compete with firefox IO. On the other
> hand, flusher's IO will show up as ASYNC and CFQ  will be penalize it
> heavily and firefox's IO will be prioritized. And this effect should
> just get worse as more processes do sync_file_range().

Ok, this time I tried the same test again but with 4 processes doing
writes in parallel on 4 different files.

And with sync_file_range() things turned ugly. Interactivity was very poor. 

firefox launch test took around 1m:45s with sync_file range() while it
took only about 35seconds with regular flusher threads.

So sending writeback IO synchronously wreaks havoc.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
@ 2012-06-06 16:15                         ` Vivek Goyal
  0 siblings, 0 replies; 38+ messages in thread
From: Vivek Goyal @ 2012-06-06 16:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Fengguang Wu, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Jens Axboe

On Wed, Jun 06, 2012 at 08:14:08AM -0400, Vivek Goyal wrote:

[..]
> I think it is happening because sync_file_range() will send all
> the writes as SYNC and it will compete with firefox IO. On the other
> hand, flusher's IO will show up as ASYNC and CFQ  will be penalize it
> heavily and firefox's IO will be prioritized. And this effect should
> just get worse as more processes do sync_file_range().

Ok, this time I tried the same test again but with 4 processes doing
writes in parallel on 4 different files.

And with sync_file_range() things turned ugly. Interactivity was very poor. 

firefox launch test took around 1m:45s with sync_file range() while it
took only about 35seconds with regular flusher threads.

So sending writeback IO synchronously wreaks havoc.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
  2012-06-06 14:00                         ` Fengguang Wu
@ 2012-06-06 17:04                           ` Vivek Goyal
  -1 siblings, 0 replies; 38+ messages in thread
From: Vivek Goyal @ 2012-06-06 17:04 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Jens Axboe

On Wed, Jun 06, 2012 at 10:00:58PM +0800, Fengguang Wu wrote:
> On Wed, Jun 06, 2012 at 08:14:08AM -0400, Vivek Goyal wrote:
> > On Tue, Jun 05, 2012 at 08:14:08PM -0700, Linus Torvalds wrote:
> > > On Tue, Jun 5, 2012 at 7:57 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > > >
> > > > I had expected a bigger difference as sync_file_range() is just driving
> > > > max queue depth of 32 (total 16MB IO in flight), while flushers are
> > > > driving queue depths up to 140 or so. So in this paritcular test, driving
> > > > much deeper queue depths is not really helping much. (I have seen higher
> > > > throughputs with higher queue depths in the past. Now sure why don't we
> > > > see it here).
> > > 
> > > How did interactivity feel?
> > > 
> > > Because quite frankly, if the throughput difference is 12.5 vs 12
> > > seconds, I suspect the interactivity thing is what dominates.
> > > 
> > > And from my memory of the interactivity different was absolutely
> > > *huge*. Even back when I used rotational media, I basically couldn't
> > > even notice the background write with the sync_file_range() approach.
> > > While the regular writeback without the writebehind had absolutely
> > > *huge* pauses if you used something like firefox that uses fsync()
> > > etc. And starting new applications that weren't cached was noticeably
> > > worse too - and then with sync_file_range it wasn't even all that
> > > noticeable.
> > > 
> > > NOTE! For the real "firefox + fsync" test, I suspect you'd need to do
> > > the writeback on the same filesystem (and obviously disk) as your home
> > > directory is. If the big write is to another filesystem and another
> > > disk, I think you won't see the same issues.
> > 
> > Ok, I did following test on my single SATA disk and my root filesystem
> > is on this disk.
> > 
> > I dropped caches and launched firefox and monitored the time it takes
> > for firefox to start. (cache cold).
> > 
> > And my results are reverse of what you have been seeing. With
> > sync_file_range() running, firefox takes roughly 30 seconds to start and
> > with flusher in operation, it takes roughly 20 seconds to start. (I have
> > approximated the average of 3 runs for simplicity).
> > 
> > I think it is happening because sync_file_range() will send all
> > the writes as SYNC and it will compete with firefox IO. On the other
> > hand, flusher's IO will show up as ASYNC and CFQ  will be penalize it
> > heavily and firefox's IO will be prioritized. And this effect should
> > just get worse as more processes do sync_file_range().
> > 
> > So write-behind should provide better interactivity if writes submitted
> > are ASYNC and not SYNC.
> 
> Hi Vivek, thanks for testing all of these out! The result is
> definitely interesting and a surprise: we overlooked the SYNC nature
> of sync_file_range().
> 
> I'd suggest to use these calls to achieve the write-and-drop-behind
> behavior, *with* WB_SYNC_NONE:
> 
>         posix_fadvise(fd, offset, len, POSIX_FADV_DONTNEED);
>         sync_file_range(fd, offset, len, SYNC_FILE_RANGE_WAIT_AFTER);
> 
> The caveat is, the below bdi_write_congested() will never evaluate to
> true since we are only filling the request queue with 8MB data.
> 
> SYSCALL_DEFINE(fadvise64_64):
> 
>         case POSIX_FADV_DONTNEED:
>                 if (!bdi_write_congested(mapping->backing_dev_info))
>                         __filemap_fdatawrite_range(mapping, offset, endbyte,
>                                                    WB_SYNC_NONE);

Hi Fengguang,

Instead of above, I modified sync_file_range() to call __filemap_fdatawrite_range(WB_SYNC_NONE) and I do see now ASYNC writes showing up at elevator.

With 4 processes doing sync_file_range() now, firefox start time test
clocks around 18-19 seconds which is better than 30-35 seconds of 4
processes doing buffered writes. And system looks pretty good from
interactivity point of view.

Thanks
Vivek

Following is the patch I applied to test.

---
 fs/sync.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/fs/sync.c
===================================================================
--- linux-2.6.orig/fs/sync.c	2012-06-06 00:12:33.000000000 -0400
+++ linux-2.6/fs/sync.c	2012-06-06 23:11:17.050691776 -0400
@@ -342,7 +342,7 @@ SYSCALL_DEFINE(sync_file_range)(int fd, 
 	}
 
 	if (flags & SYNC_FILE_RANGE_WRITE) {
-		ret = filemap_fdatawrite_range(mapping, offset, endbyte);
+		ret = __filemap_fdatawrite_range(mapping, offset, endbyte, WB_SYNC_NONE);
 		if (ret < 0)
 			goto out_put;
 	}

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
@ 2012-06-06 17:04                           ` Vivek Goyal
  0 siblings, 0 replies; 38+ messages in thread
From: Vivek Goyal @ 2012-06-06 17:04 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Linus Torvalds, LKML, Myklebust, Trond, linux-fsdevel,
	Linux Memory Management List, Jens Axboe

On Wed, Jun 06, 2012 at 10:00:58PM +0800, Fengguang Wu wrote:
> On Wed, Jun 06, 2012 at 08:14:08AM -0400, Vivek Goyal wrote:
> > On Tue, Jun 05, 2012 at 08:14:08PM -0700, Linus Torvalds wrote:
> > > On Tue, Jun 5, 2012 at 7:57 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > > >
> > > > I had expected a bigger difference as sync_file_range() is just driving
> > > > max queue depth of 32 (total 16MB IO in flight), while flushers are
> > > > driving queue depths up to 140 or so. So in this paritcular test, driving
> > > > much deeper queue depths is not really helping much. (I have seen higher
> > > > throughputs with higher queue depths in the past. Now sure why don't we
> > > > see it here).
> > > 
> > > How did interactivity feel?
> > > 
> > > Because quite frankly, if the throughput difference is 12.5 vs 12
> > > seconds, I suspect the interactivity thing is what dominates.
> > > 
> > > And from my memory of the interactivity different was absolutely
> > > *huge*. Even back when I used rotational media, I basically couldn't
> > > even notice the background write with the sync_file_range() approach.
> > > While the regular writeback without the writebehind had absolutely
> > > *huge* pauses if you used something like firefox that uses fsync()
> > > etc. And starting new applications that weren't cached was noticeably
> > > worse too - and then with sync_file_range it wasn't even all that
> > > noticeable.
> > > 
> > > NOTE! For the real "firefox + fsync" test, I suspect you'd need to do
> > > the writeback on the same filesystem (and obviously disk) as your home
> > > directory is. If the big write is to another filesystem and another
> > > disk, I think you won't see the same issues.
> > 
> > Ok, I did following test on my single SATA disk and my root filesystem
> > is on this disk.
> > 
> > I dropped caches and launched firefox and monitored the time it takes
> > for firefox to start. (cache cold).
> > 
> > And my results are reverse of what you have been seeing. With
> > sync_file_range() running, firefox takes roughly 30 seconds to start and
> > with flusher in operation, it takes roughly 20 seconds to start. (I have
> > approximated the average of 3 runs for simplicity).
> > 
> > I think it is happening because sync_file_range() will send all
> > the writes as SYNC and it will compete with firefox IO. On the other
> > hand, flusher's IO will show up as ASYNC and CFQ  will be penalize it
> > heavily and firefox's IO will be prioritized. And this effect should
> > just get worse as more processes do sync_file_range().
> > 
> > So write-behind should provide better interactivity if writes submitted
> > are ASYNC and not SYNC.
> 
> Hi Vivek, thanks for testing all of these out! The result is
> definitely interesting and a surprise: we overlooked the SYNC nature
> of sync_file_range().
> 
> I'd suggest to use these calls to achieve the write-and-drop-behind
> behavior, *with* WB_SYNC_NONE:
> 
>         posix_fadvise(fd, offset, len, POSIX_FADV_DONTNEED);
>         sync_file_range(fd, offset, len, SYNC_FILE_RANGE_WAIT_AFTER);
> 
> The caveat is, the below bdi_write_congested() will never evaluate to
> true since we are only filling the request queue with 8MB data.
> 
> SYSCALL_DEFINE(fadvise64_64):
> 
>         case POSIX_FADV_DONTNEED:
>                 if (!bdi_write_congested(mapping->backing_dev_info))
>                         __filemap_fdatawrite_range(mapping, offset, endbyte,
>                                                    WB_SYNC_NONE);

Hi Fengguang,

Instead of above, I modified sync_file_range() to call __filemap_fdatawrite_range(WB_SYNC_NONE) and I do see now ASYNC writes showing up at elevator.

With 4 processes doing sync_file_range() now, firefox start time test
clocks around 18-19 seconds which is better than 30-35 seconds of 4
processes doing buffered writes. And system looks pretty good from
interactivity point of view.

Thanks
Vivek

Following is the patch I applied to test.

---
 fs/sync.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6/fs/sync.c
===================================================================
--- linux-2.6.orig/fs/sync.c	2012-06-06 00:12:33.000000000 -0400
+++ linux-2.6/fs/sync.c	2012-06-06 23:11:17.050691776 -0400
@@ -342,7 +342,7 @@ SYSCALL_DEFINE(sync_file_range)(int fd, 
 	}
 
 	if (flags & SYNC_FILE_RANGE_WRITE) {
-		ret = filemap_fdatawrite_range(mapping, offset, endbyte);
+		ret = __filemap_fdatawrite_range(mapping, offset, endbyte, WB_SYNC_NONE);
 		if (ret < 0)
 			goto out_put;
 	}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
  2012-06-06 17:04                           ` Vivek Goyal
@ 2012-06-07  9:45                             ` Jan Kara
  -1 siblings, 0 replies; 38+ messages in thread
From: Jan Kara @ 2012-06-07  9:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Linus Torvalds, LKML, Myklebust, Trond,
	linux-fsdevel, Linux Memory Management List, Jens Axboe

On Wed 06-06-12 13:04:28, Vivek Goyal wrote:
> On Wed, Jun 06, 2012 at 10:00:58PM +0800, Fengguang Wu wrote:
> > On Wed, Jun 06, 2012 at 08:14:08AM -0400, Vivek Goyal wrote:
> > > On Tue, Jun 05, 2012 at 08:14:08PM -0700, Linus Torvalds wrote:
> > > > On Tue, Jun 5, 2012 at 7:57 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > >
> > > > > I had expected a bigger difference as sync_file_range() is just driving
> > > > > max queue depth of 32 (total 16MB IO in flight), while flushers are
> > > > > driving queue depths up to 140 or so. So in this paritcular test, driving
> > > > > much deeper queue depths is not really helping much. (I have seen higher
> > > > > throughputs with higher queue depths in the past. Now sure why don't we
> > > > > see it here).
> > > > 
> > > > How did interactivity feel?
> > > > 
> > > > Because quite frankly, if the throughput difference is 12.5 vs 12
> > > > seconds, I suspect the interactivity thing is what dominates.
> > > > 
> > > > And from my memory of the interactivity different was absolutely
> > > > *huge*. Even back when I used rotational media, I basically couldn't
> > > > even notice the background write with the sync_file_range() approach.
> > > > While the regular writeback without the writebehind had absolutely
> > > > *huge* pauses if you used something like firefox that uses fsync()
> > > > etc. And starting new applications that weren't cached was noticeably
> > > > worse too - and then with sync_file_range it wasn't even all that
> > > > noticeable.
> > > > 
> > > > NOTE! For the real "firefox + fsync" test, I suspect you'd need to do
> > > > the writeback on the same filesystem (and obviously disk) as your home
> > > > directory is. If the big write is to another filesystem and another
> > > > disk, I think you won't see the same issues.
> > > 
> > > Ok, I did following test on my single SATA disk and my root filesystem
> > > is on this disk.
> > > 
> > > I dropped caches and launched firefox and monitored the time it takes
> > > for firefox to start. (cache cold).
> > > 
> > > And my results are reverse of what you have been seeing. With
> > > sync_file_range() running, firefox takes roughly 30 seconds to start and
> > > with flusher in operation, it takes roughly 20 seconds to start. (I have
> > > approximated the average of 3 runs for simplicity).
> > > 
> > > I think it is happening because sync_file_range() will send all
> > > the writes as SYNC and it will compete with firefox IO. On the other
> > > hand, flusher's IO will show up as ASYNC and CFQ  will be penalize it
> > > heavily and firefox's IO will be prioritized. And this effect should
> > > just get worse as more processes do sync_file_range().
> > > 
> > > So write-behind should provide better interactivity if writes submitted
> > > are ASYNC and not SYNC.
> > 
> > Hi Vivek, thanks for testing all of these out! The result is
> > definitely interesting and a surprise: we overlooked the SYNC nature
> > of sync_file_range().
> > 
> > I'd suggest to use these calls to achieve the write-and-drop-behind
> > behavior, *with* WB_SYNC_NONE:
> > 
> >         posix_fadvise(fd, offset, len, POSIX_FADV_DONTNEED);
> >         sync_file_range(fd, offset, len, SYNC_FILE_RANGE_WAIT_AFTER);
> > 
> > The caveat is, the below bdi_write_congested() will never evaluate to
> > true since we are only filling the request queue with 8MB data.
> > 
> > SYSCALL_DEFINE(fadvise64_64):
> > 
> >         case POSIX_FADV_DONTNEED:
> >                 if (!bdi_write_congested(mapping->backing_dev_info))
> >                         __filemap_fdatawrite_range(mapping, offset, endbyte,
> >                                                    WB_SYNC_NONE);
> 
> Hi Fengguang,
> 
> Instead of above, I modified sync_file_range() to call
> __filemap_fdatawrite_range(WB_SYNC_NONE) and I do see now ASYNC writes
> showing up at elevator.
> 
> With 4 processes doing sync_file_range() now, firefox start time test
> clocks around 18-19 seconds which is better than 30-35 seconds of 4
> processes doing buffered writes. And system looks pretty good from
> interactivity point of view.
  So do you have any idea why is that? Do we drive shallower queues? Also
how does speed of the writers compare to the speed with normal buffered
writes + fsync (you'd need fsync for sync_file_range writers as well to
make comparison fair)?

								Honza
> ---
>  fs/sync.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-2.6/fs/sync.c
> ===================================================================
> --- linux-2.6.orig/fs/sync.c	2012-06-06 00:12:33.000000000 -0400
> +++ linux-2.6/fs/sync.c	2012-06-06 23:11:17.050691776 -0400
> @@ -342,7 +342,7 @@ SYSCALL_DEFINE(sync_file_range)(int fd, 
>  	}
>  
>  	if (flags & SYNC_FILE_RANGE_WRITE) {
> -		ret = filemap_fdatawrite_range(mapping, offset, endbyte);
> +		ret = __filemap_fdatawrite_range(mapping, offset, endbyte, WB_SYNC_NONE);
>  		if (ret < 0)
>  			goto out_put;
>  	}
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
@ 2012-06-07  9:45                             ` Jan Kara
  0 siblings, 0 replies; 38+ messages in thread
From: Jan Kara @ 2012-06-07  9:45 UTC (permalink / raw)
  To: Vivek Goyal
  Cc: Fengguang Wu, Linus Torvalds, LKML, Myklebust, Trond,
	linux-fsdevel, Linux Memory Management List, Jens Axboe

On Wed 06-06-12 13:04:28, Vivek Goyal wrote:
> On Wed, Jun 06, 2012 at 10:00:58PM +0800, Fengguang Wu wrote:
> > On Wed, Jun 06, 2012 at 08:14:08AM -0400, Vivek Goyal wrote:
> > > On Tue, Jun 05, 2012 at 08:14:08PM -0700, Linus Torvalds wrote:
> > > > On Tue, Jun 5, 2012 at 7:57 PM, Vivek Goyal <vgoyal@redhat.com> wrote:
> > > > >
> > > > > I had expected a bigger difference as sync_file_range() is just driving
> > > > > max queue depth of 32 (total 16MB IO in flight), while flushers are
> > > > > driving queue depths up to 140 or so. So in this paritcular test, driving
> > > > > much deeper queue depths is not really helping much. (I have seen higher
> > > > > throughputs with higher queue depths in the past. Now sure why don't we
> > > > > see it here).
> > > > 
> > > > How did interactivity feel?
> > > > 
> > > > Because quite frankly, if the throughput difference is 12.5 vs 12
> > > > seconds, I suspect the interactivity thing is what dominates.
> > > > 
> > > > And from my memory of the interactivity different was absolutely
> > > > *huge*. Even back when I used rotational media, I basically couldn't
> > > > even notice the background write with the sync_file_range() approach.
> > > > While the regular writeback without the writebehind had absolutely
> > > > *huge* pauses if you used something like firefox that uses fsync()
> > > > etc. And starting new applications that weren't cached was noticeably
> > > > worse too - and then with sync_file_range it wasn't even all that
> > > > noticeable.
> > > > 
> > > > NOTE! For the real "firefox + fsync" test, I suspect you'd need to do
> > > > the writeback on the same filesystem (and obviously disk) as your home
> > > > directory is. If the big write is to another filesystem and another
> > > > disk, I think you won't see the same issues.
> > > 
> > > Ok, I did following test on my single SATA disk and my root filesystem
> > > is on this disk.
> > > 
> > > I dropped caches and launched firefox and monitored the time it takes
> > > for firefox to start. (cache cold).
> > > 
> > > And my results are reverse of what you have been seeing. With
> > > sync_file_range() running, firefox takes roughly 30 seconds to start and
> > > with flusher in operation, it takes roughly 20 seconds to start. (I have
> > > approximated the average of 3 runs for simplicity).
> > > 
> > > I think it is happening because sync_file_range() will send all
> > > the writes as SYNC and it will compete with firefox IO. On the other
> > > hand, flusher's IO will show up as ASYNC and CFQ  will be penalize it
> > > heavily and firefox's IO will be prioritized. And this effect should
> > > just get worse as more processes do sync_file_range().
> > > 
> > > So write-behind should provide better interactivity if writes submitted
> > > are ASYNC and not SYNC.
> > 
> > Hi Vivek, thanks for testing all of these out! The result is
> > definitely interesting and a surprise: we overlooked the SYNC nature
> > of sync_file_range().
> > 
> > I'd suggest to use these calls to achieve the write-and-drop-behind
> > behavior, *with* WB_SYNC_NONE:
> > 
> >         posix_fadvise(fd, offset, len, POSIX_FADV_DONTNEED);
> >         sync_file_range(fd, offset, len, SYNC_FILE_RANGE_WAIT_AFTER);
> > 
> > The caveat is, the below bdi_write_congested() will never evaluate to
> > true since we are only filling the request queue with 8MB data.
> > 
> > SYSCALL_DEFINE(fadvise64_64):
> > 
> >         case POSIX_FADV_DONTNEED:
> >                 if (!bdi_write_congested(mapping->backing_dev_info))
> >                         __filemap_fdatawrite_range(mapping, offset, endbyte,
> >                                                    WB_SYNC_NONE);
> 
> Hi Fengguang,
> 
> Instead of above, I modified sync_file_range() to call
> __filemap_fdatawrite_range(WB_SYNC_NONE) and I do see now ASYNC writes
> showing up at elevator.
> 
> With 4 processes doing sync_file_range() now, firefox start time test
> clocks around 18-19 seconds which is better than 30-35 seconds of 4
> processes doing buffered writes. And system looks pretty good from
> interactivity point of view.
  So do you have any idea why is that? Do we drive shallower queues? Also
how does speed of the writers compare to the speed with normal buffered
writes + fsync (you'd need fsync for sync_file_range writers as well to
make comparison fair)?

								Honza
> ---
>  fs/sync.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> Index: linux-2.6/fs/sync.c
> ===================================================================
> --- linux-2.6.orig/fs/sync.c	2012-06-06 00:12:33.000000000 -0400
> +++ linux-2.6/fs/sync.c	2012-06-06 23:11:17.050691776 -0400
> @@ -342,7 +342,7 @@ SYSCALL_DEFINE(sync_file_range)(int fd, 
>  	}
>  
>  	if (flags & SYNC_FILE_RANGE_WRITE) {
> -		ret = filemap_fdatawrite_range(mapping, offset, endbyte);
> +		ret = __filemap_fdatawrite_range(mapping, offset, endbyte, WB_SYNC_NONE);
>  		if (ret < 0)
>  			goto out_put;
>  	}
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
  2012-06-07  9:45                             ` Jan Kara
@ 2012-06-07 19:06                               ` Vivek Goyal
  -1 siblings, 0 replies; 38+ messages in thread
From: Vivek Goyal @ 2012-06-07 19:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Fengguang Wu, Linus Torvalds, LKML, Myklebust, Trond,
	linux-fsdevel, Linux Memory Management List, Jens Axboe

On Thu, Jun 07, 2012 at 11:45:04AM +0200, Jan Kara wrote:
[..]
> > Instead of above, I modified sync_file_range() to call
> > __filemap_fdatawrite_range(WB_SYNC_NONE) and I do see now ASYNC writes
> > showing up at elevator.
> > 
> > With 4 processes doing sync_file_range() now, firefox start time test
> > clocks around 18-19 seconds which is better than 30-35 seconds of 4
> > processes doing buffered writes. And system looks pretty good from
> > interactivity point of view.
>   So do you have any idea why is that? Do we drive shallower queues? Also
> how does speed of the writers compare to the speed with normal buffered
> writes + fsync (you'd need fsync for sync_file_range writers as well to
> make comparison fair)?

Ok, I did more tests and few odd things I noticed.

- Results are varying a lot. Sometimes with write+flush workload also firefox
  launched fast. So now it is hard to conclude things.

- For some reason I had nr_requests as 16K on my root drive. I have no
  idea who is setting it. Once I set it to 128, then firefox with
  write+flush workload performs much better and launch time are similar
  to sync_file_range.

- I tried to open new windows in firefox and browse web, load new
  websites. I would say sync_file_range() feels little better but
  I don't have any logical explanation and can't conclude anything yet
  by looking at traces. I am continuing to stare though.

So in summary, at this point of time I really can't conclude that
using sync_file_range() with ASYNC request is providing better latencies
in my setup.

I will keept at it though and if I notice something new, will write back.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: write-behind on streaming writes
@ 2012-06-07 19:06                               ` Vivek Goyal
  0 siblings, 0 replies; 38+ messages in thread
From: Vivek Goyal @ 2012-06-07 19:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: Fengguang Wu, Linus Torvalds, LKML, Myklebust, Trond,
	linux-fsdevel, Linux Memory Management List, Jens Axboe

On Thu, Jun 07, 2012 at 11:45:04AM +0200, Jan Kara wrote:
[..]
> > Instead of above, I modified sync_file_range() to call
> > __filemap_fdatawrite_range(WB_SYNC_NONE) and I do see now ASYNC writes
> > showing up at elevator.
> > 
> > With 4 processes doing sync_file_range() now, firefox start time test
> > clocks around 18-19 seconds which is better than 30-35 seconds of 4
> > processes doing buffered writes. And system looks pretty good from
> > interactivity point of view.
>   So do you have any idea why is that? Do we drive shallower queues? Also
> how does speed of the writers compare to the speed with normal buffered
> writes + fsync (you'd need fsync for sync_file_range writers as well to
> make comparison fair)?

Ok, I did more tests and few odd things I noticed.

- Results are varying a lot. Sometimes with write+flush workload also firefox
  launched fast. So now it is hard to conclude things.

- For some reason I had nr_requests as 16K on my root drive. I have no
  idea who is setting it. Once I set it to 128, then firefox with
  write+flush workload performs much better and launch time are similar
  to sync_file_range.

- I tried to open new windows in firefox and browse web, load new
  websites. I would say sync_file_range() feels little better but
  I don't have any logical explanation and can't conclude anything yet
  by looking at traces. I am continuing to stare though.

So in summary, at this point of time I really can't conclude that
using sync_file_range() with ASYNC request is providing better latencies
in my setup.

I will keept at it though and if I notice something new, will write back.

Thanks
Vivek

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2012-06-07 19:06 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-05-28 11:41 [GIT PULL] writeback changes for 3.5-rc1 Fengguang Wu
2012-05-28 17:09 ` Linus Torvalds
2012-05-29 15:57   ` write-behind on streaming writes Fengguang Wu
2012-05-29 15:57     ` Fengguang Wu
2012-05-29 17:35     ` Linus Torvalds
2012-05-29 17:35       ` Linus Torvalds
2012-05-30  3:21       ` Fengguang Wu
2012-05-30  3:21         ` Fengguang Wu
2012-06-05  1:01         ` Dave Chinner
2012-06-05  1:01           ` Dave Chinner
2012-06-05 17:18           ` Vivek Goyal
2012-06-05 17:18             ` Vivek Goyal
2012-06-05 17:23         ` Vivek Goyal
2012-06-05 17:23           ` Vivek Goyal
2012-06-05 17:41           ` Vivek Goyal
2012-06-05 17:41             ` Vivek Goyal
2012-06-05 18:48             ` Vivek Goyal
2012-06-05 18:48               ` Vivek Goyal
2012-06-05 20:10               ` Vivek Goyal
2012-06-05 20:10                 ` Vivek Goyal
2012-06-06  2:57                 ` Vivek Goyal
2012-06-06  2:57                   ` Vivek Goyal
2012-06-06  3:14                   ` Linus Torvalds
2012-06-06  3:14                     ` Linus Torvalds
2012-06-06 12:14                     ` Vivek Goyal
2012-06-06 12:14                       ` Vivek Goyal
2012-06-06 14:00                       ` Fengguang Wu
2012-06-06 14:00                         ` Fengguang Wu
2012-06-06 17:04                         ` Vivek Goyal
2012-06-06 17:04                           ` Vivek Goyal
2012-06-07  9:45                           ` Jan Kara
2012-06-07  9:45                             ` Jan Kara
2012-06-07 19:06                             ` Vivek Goyal
2012-06-07 19:06                               ` Vivek Goyal
2012-06-06 16:15                       ` Vivek Goyal
2012-06-06 16:15                         ` Vivek Goyal
2012-06-06 14:08                   ` Fengguang Wu
2012-06-06 14:08                     ` Fengguang Wu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.