linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch] Converting writeback linked lists to a tree based data structure
@ 2008-01-15  8:09 Michael Rubin
  2008-01-15  8:46 ` Peter Zijlstra
       [not found] ` <E1JFRFm-00011Q-0q@localhost.localdomain>
  0 siblings, 2 replies; 34+ messages in thread
From: Michael Rubin @ 2008-01-15  8:09 UTC (permalink / raw)
  To: a.p.zijlstra, akpm, linux-kernel, linux-mm, mrubin, wfg

From: Michael Rubin <mrubin@google.com>

For those of you who have waited so long. This is the third submission
of the first attempt at this patch. It is a trilogy.

Two changes are in this patch. They are dependant on each other.

In addition we get an unintended performance improvement. Syncing
1,000,000 inodes each with 4KB of dirty data takes the original kernel
83 seconds and with the change it take 77 seconds.

1) Adding a datastructure to guarantee fairness when writing
   inodes to disk and simplify the code base.

   When inodes are parked for writeback they are parked in the
   flush_tree. The flush tree is a data structure based on an rb tree.

   Duplicate keys are handled by making a list in the tree for each key
   value. The order of how we choose the next inode to flush is decided
   by two fields. First the earliest dirtied_when value. If there are
   duplicate dirtied_when values then the earliest i_flush_gen value
   determines who gets flushed next.

   The flush tree organizes the dirtied_when keys with the rb_tree. Any
   inodes with a duplicate dirtied_when value are link listed together. This
   link list is sorted by the inode's i_flush_gen. When both the
   dirtied_when and the i_flush_gen are identical the order in the
   linked list determines the order we flush the inodes.

2) Added an inode flag to allow inodes to be marked so that they
   are never written back to disk.

   The motivation behind this change is several fold. The first is
   to insure fairness in the writeback algorithm. The second is to
   deal with a bug where the writing to large files concurrently
   to smaller ones creates a situation where writeback cannot
   keep up with traffic and memory baloons until the we hit the
   threshold watermark. This can result in surprising long latency
   with respect to disk traffic. This latency can take minutes. The
   flush tree fixes this issue and fixes several other minor issues
   with fairness also.

Signed-off-by: Michael Rubin <mrubin@google.com>
---

Index: 2624rc7_wb/fs/anon_inodes.c
===================================================================
--- 2624rc7_wb.orig/fs/anon_inodes.c	2008-01-06 13:45:38.000000000 -0800
+++ 2624rc7_wb/fs/anon_inodes.c	2008-01-08 13:46:59.000000000 -0800
@@ -154,13 +154,7 @@ static struct inode *anon_inode_mkinode(
 
 	inode->i_fop = &anon_inode_fops;
 
-	/*
-	 * Mark the inode dirty from the very beginning,
-	 * that way it will never be moved to the dirty
-	 * list because mark_inode_dirty() will think
-	 * that it already _is_ on the dirty list.
-	 */
-	inode->i_state = I_DIRTY;
+	inode->i_state = I_WRITEBACK_NEVER;
 	inode->i_mode = S_IRUSR | S_IWUSR;
 	inode->i_uid = current->fsuid;
 	inode->i_gid = current->fsgid;
Index: 2624rc7_wb/fs/fs-writeback.c
===================================================================
--- 2624rc7_wb.orig/fs/fs-writeback.c	2008-01-06 13:45:38.000000000 -0800
+++ 2624rc7_wb/fs/fs-writeback.c	2008-01-14 18:53:54.000000000 -0800
@@ -23,8 +23,185 @@
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
 #include <linux/buffer_head.h>
+#include <linux/rbtree.h>
 #include "internal.h"
 
+#define rb_to_inode(node) rb_entry((node), struct inode, i_flush_node)
+
+/*
+ * When inodes are parked for writeback they are parked in the
+ * flush_tree. The flush tree is a data structure based on an rb tree.
+ *
+ * Duplicate keys are handled by making a list in the tree for each key
+ * value. The order of how we choose the next inode to flush is decided
+ * by two fields. First the earliest dirtied_when value. If there are
+ * duplicate dirtied_when values then the earliest i_flush_gen value
+ * determines who gets flushed next.
+ *
+ * The flush tree organizes the dirtied_when keys with the rb_tree. Any
+ * inodes with a duplicate dirtied_when value are link listed together. This
+ * link list is sorted by the inode's i_flush_gen. When both the
+ * dirtied_when and the i_flush_gen are identical the order in the
+ * linked list determines the order we flush the inodes.
+ */
+
+/*
+ * Find a rb_node matching the key in the flush tree. There are no duplicate
+ * rb_nodes in the tree. Instead they are chained off the first node.
+ */
+static struct inode *flush_tree_search(struct super_block *sb,
+				       unsigned long ts)
+{
+	struct rb_node *n = sb->s_flush_root.rb_node;
+	assert_spin_locked(&inode_lock);
+	while (n) {
+		struct inode *inode = rb_to_inode(n);
+		if (time_before(ts, inode->dirtied_when)) {
+			n = n->rb_left;
+		} else if (time_after(ts, inode->dirtied_when)) {
+			n = n->rb_right;
+		} else {
+			return inode;
+		}
+	}
+	return NULL;
+}
+
+/*
+ * Inserting an inode into the flush tree. The tree is keyed by the
+ * dirtied_when member.
+ *
+ * If there is a duplicate key in the tree already the new inode is put
+ * on the tail of a list of the rb_node.
+ * All inserted inodes must have one of the I_DIRTY flags set.
+ */
+static void flush_tree_insert(struct super_block *sb, struct inode *inode)
+{
+	struct rb_node **new = &(sb->s_flush_root.rb_node);
+	struct rb_node *parent = NULL;
+
+	assert_spin_locked(&inode_lock);
+	BUG_ON((inode->i_state & I_DIRTY) == 0);
+	BUG_ON(inode->i_state & (I_FREEING|I_CLEAR));
+	BUG_ON(RB_LINKED_NODE(&inode->i_flush_node));
+
+	sb->s_flush_count++;
+	inode->i_flush_gen = sb->s_flush_gen;
+
+	list_del_init(&inode->i_list);
+	while (*new) {
+		struct inode *this = rb_to_inode(*new);
+		parent = *new;
+		if (time_before(inode->dirtied_when, this->dirtied_when))
+			new = &((*new)->rb_left);
+		else if (time_after(inode->dirtied_when, this->dirtied_when))
+			new = &((*new)->rb_right);
+		else {
+			list_add_tail(&inode->i_list, &this->i_list);
+			return;
+		}
+	}
+
+	/* Add in the new node and re-balance the tree */
+	rb_link_node(&inode->i_flush_node, parent, new);
+	rb_insert_color(&inode->i_flush_node, &sb->s_flush_root);
+}
+
+/*
+ * Here we return the inode that has the smallest key in the flush tree
+ * that is greater than the parameter "prev_time".
+ */
+static struct inode *flush_tree_min_greater(struct super_block *sb,
+					    unsigned long prev_time)
+{
+	struct rb_node *node = sb->s_flush_root.rb_node;
+	struct inode *bsf = NULL;
+	/* best so far */
+	assert_spin_locked(&inode_lock);
+	while (node) {
+		struct inode *data = rb_to_inode(node);
+		/* Just trying to get lucky */
+		if ((prev_time + 1) == data->dirtied_when)
+			return data;
+		/*
+		 * If this value is greater than our prev_time and is
+		 * less than the best so far, this is our new best so far.
+		 */
+		if (time_after(data->dirtied_when, prev_time) &&
+		    (bsf ? time_after(bsf->dirtied_when, data->dirtied_when):1))
+			bsf = data;
+
+		/* Search all the way down to the bottom of the tree */
+		if (time_before(prev_time, data->dirtied_when)) {
+			node = node->rb_left;
+		} else if (time_after_eq(prev_time, data->dirtied_when)) {
+			node = node->rb_right;
+		}
+	}
+	return bsf;
+}
+
+/*
+ * Here is where we iterate to find the next inode to process. The
+ * strategy is to first look for any other inodes with the same dirtied_when
+ * value. If we have already processed that node then we need to find
+ * the next highest dirtied_when value in the tree.
+ */
+static struct inode *flush_tree_next(struct super_block *sb,
+				     struct inode *prev_inode,
+				     unsigned long prev_dirtied,
+				     unsigned long long flush_gen)
+{
+	struct inode *inode;
+	assert_spin_locked(&inode_lock);
+
+	/*
+	 * First we look to see if there is an inode with the same
+	 * dirtied_time as our previous processed inode.
+	 */
+	inode = flush_tree_search(sb, prev_dirtied);
+
+	/*
+	 * If there is and if it's not the same one as we just processed
+	 * and the i_flush_gen is later than our start.
+	 */
+	if (inode && (inode->i_flush_gen < flush_gen))
+		return inode;
+
+	/* If not we find the next inode that has been dirtied after this one */
+	return flush_tree_min_greater(sb, prev_dirtied);
+}
+
+/* Removing a node from the flushtree. */
+void flush_tree_remove(struct super_block *sb, struct inode *inode)
+{
+	struct rb_node *rb_node = &inode->i_flush_node;
+	struct rb_root *rb_root = &sb->s_flush_root;
+
+	assert_spin_locked(&inode_lock);
+	BUG_ON((inode->i_state & I_DIRTY) == 0);
+
+	sb->s_flush_count--;
+
+	/* There is no chain on this inode. Just remove it from the tree */
+	if (list_empty(&inode->i_list)) {
+		BUG_ON(!RB_LINKED_NODE(rb_node));
+		rb_erase(rb_node, rb_root);
+		memset(rb_node, 0, sizeof(*rb_node));
+		return;
+	}
+
+	/* This node is on a chain AND is in the rb_tree */
+	if (RB_LINKED_NODE(rb_node)) {
+		struct inode *new = list_entry(inode->i_list.next,
+					       struct inode, i_list);
+		rb_replace_node(rb_node, &new->i_flush_node, rb_root);
+		memset(rb_node, 0, sizeof(*rb_node));
+	}
+	/* Take it off the list */
+	list_del_init(&inode->i_list);
+}
+
 /**
  *	__mark_inode_dirty -	internal function
  *	@inode: inode to mark
@@ -32,11 +209,11 @@
  *	Mark an inode as dirty. Callers should use mark_inode_dirty or
  *  	mark_inode_dirty_sync.
  *
- * Put the inode on the super block's dirty list.
+ * Put the inode on the super block's flush_tree.
  *
  * CAREFUL! We mark it dirty unconditionally, but move it onto the
  * dirty list only if it is hashed or if it refers to a blockdev.
- * If it was not hashed, it will never be added to the dirty list
+ * If it was not hashed, it will never be added to the flush_tree
  * even if it is later hashed, as it will have been marked dirty already.
  *
  * In short, make sure you hash any inodes _before_ you start marking
@@ -75,6 +252,10 @@ void __mark_inode_dirty(struct inode *in
 	if ((inode->i_state & flags) == flags)
 		return;
 
+	if (inode->i_state & I_WRITEBACK_NEVER)
+		return;
+
+
 	if (unlikely(block_dump)) {
 		struct dentry *dentry = NULL;
 		const char *name = "?";
@@ -97,34 +278,36 @@ void __mark_inode_dirty(struct inode *in
 	if ((inode->i_state & flags) != flags) {
 		const int was_dirty = inode->i_state & I_DIRTY;
 
-		inode->i_state |= flags;
-
-		/*
-		 * If the inode is being synced, just update its dirty state.
-		 * The unlocker will place the inode on the appropriate
-		 * superblock list, based upon its state.
-		 */
-		if (inode->i_state & I_SYNC)
+		/* If we are freeing this inode we cannot mark it dirty. */
+		if (inode->i_state & (I_FREEING|I_CLEAR))
 			goto out;
 
 		/*
-		 * Only add valid (hashed) inodes to the superblock's
-		 * dirty list.  Add blockdev inodes as well.
+		 * Only add valid (hashed) inodes to the flush_tree
+		 * Add blockdev inodes as well.
 		 */
 		if (!S_ISBLK(inode->i_mode)) {
 			if (hlist_unhashed(&inode->i_hash))
 				goto out;
 		}
-		if (inode->i_state & (I_FREEING|I_CLEAR))
+
+		inode->i_state |= flags;
+
+		/*
+		 * If the inode is locked, just update its dirty state.
+		 * The unlocker will place the inode into the flush_tree
+		 * based upon its state.
+		 */
+		if (inode->i_state & I_SYNC)
 			goto out;
 
 		/*
-		 * If the inode was already on s_dirty/s_io/s_more_io, don't
-		 * reposition it (that would break s_dirty time-ordering).
+		 * If the inode was already in the flush_tree, don't
+		 * reposition it (that would break flush_tree time-ordering).
 		 */
 		if (!was_dirty) {
 			inode->dirtied_when = jiffies;
-			list_move(&inode->i_list, &sb->s_dirty);
+			flush_tree_insert(sb, inode);
 		}
 	}
 out:
@@ -140,38 +323,6 @@ static int write_inode(struct inode *ino
 	return 0;
 }
 
-/*
- * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
- * furthest end of its superblock's dirty-inode list.
- *
- * Before stamping the inode's ->dirtied_when, we check to see whether it is
- * already the most-recently-dirtied inode on the s_dirty list.  If that is
- * the case then the inode must have been redirtied while it was being written
- * out and we don't reset its dirtied_when.
- */
-static void redirty_tail(struct inode *inode)
-{
-	struct super_block *sb = inode->i_sb;
-
-	if (!list_empty(&sb->s_dirty)) {
-		struct inode *tail_inode;
-
-		tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
-		if (!time_after_eq(inode->dirtied_when,
-				tail_inode->dirtied_when))
-			inode->dirtied_when = jiffies;
-	}
-	list_move(&inode->i_list, &sb->s_dirty);
-}
-
-/*
- * requeue inode for re-scanning after sb->s_io list is exhausted.
- */
-static void requeue_io(struct inode *inode)
-{
-	list_move(&inode->i_list, &inode->i_sb->s_more_io);
-}
-
 static void inode_sync_complete(struct inode *inode)
 {
 	/*
@@ -181,38 +332,9 @@ static void inode_sync_complete(struct i
 	wake_up_bit(&inode->i_state, __I_SYNC);
 }
 
-/*
- * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
- */
-static void move_expired_inodes(struct list_head *delaying_queue,
-			       struct list_head *dispatch_queue,
-				unsigned long *older_than_this)
-{
-	while (!list_empty(delaying_queue)) {
-		struct inode *inode = list_entry(delaying_queue->prev,
-						struct inode, i_list);
-		if (older_than_this &&
-			time_after(inode->dirtied_when, *older_than_this))
-			break;
-		list_move(&inode->i_list, dispatch_queue);
-	}
-}
-
-/*
- * Queue all expired dirty inodes for io, eldest first.
- */
-static void queue_io(struct super_block *sb,
-				unsigned long *older_than_this)
-{
-	list_splice_init(&sb->s_more_io, sb->s_io.prev);
-	move_expired_inodes(&sb->s_dirty, &sb->s_io, older_than_this);
-}
-
 int sb_has_dirty_inodes(struct super_block *sb)
 {
-	return !list_empty(&sb->s_dirty) ||
-	       !list_empty(&sb->s_io) ||
-	       !list_empty(&sb->s_more_io);
+	return !RB_EMPTY_ROOT(&sb->s_flush_root);
 }
 EXPORT_SYMBOL(sb_has_dirty_inodes);
 
@@ -221,7 +343,7 @@ EXPORT_SYMBOL(sb_has_dirty_inodes);
  * If `wait' is set, wait on the writeout.
  *
  * The whole writeout design is quite complex and fragile.  We want to avoid
- * starvation of particular inodes when others are being redirtied, prevent
+ * starvation of particular inodes when others are being re-dirtied, prevent
  * livelocks, etc.
  *
  * Called under inode_lock.
@@ -231,10 +353,13 @@ __sync_single_inode(struct inode *inode,
 {
 	unsigned dirty;
 	struct address_space *mapping = inode->i_mapping;
+	struct super_block *sb = inode->i_sb;
 	int wait = wbc->sync_mode == WB_SYNC_ALL;
 	int ret;
+	long pages_skipped = wbc->pages_skipped;
 
 	BUG_ON(inode->i_state & I_SYNC);
+	BUG_ON(RB_LINKED_NODE(&inode->i_flush_node));
 
 	/* Set I_SYNC, reset I_DIRTY */
 	dirty = inode->i_state & I_DIRTY;
@@ -260,60 +385,63 @@ __sync_single_inode(struct inode *inode,
 
 	spin_lock(&inode_lock);
 	inode->i_state &= ~I_SYNC;
-	if (!(inode->i_state & I_FREEING)) {
-		if (!(inode->i_state & I_DIRTY) &&
+	if (inode->i_state & I_FREEING)
+		goto out;
+
+	if (wbc->pages_skipped != pages_skipped) {
+		/*
+		 * writeback is not making progress due to locked
+		 *  buffers.  Skip this inode for now.
+		*/
+		inode->i_state |= I_DIRTY_PAGES;
+		flush_tree_insert(sb, inode);
+	} else if (!(inode->i_state & I_DIRTY) &&
 		    mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+		/*
+		 * We didn't write back all the pages.  nfs_writepages()
+		 * sometimes bales out without doing anything. Redirty
+		 * the inode; then put it into the flush_tree.
+		 */
+		if (wbc->for_kupdate) {
 			/*
-			 * We didn't write back all the pages.  nfs_writepages()
-			 * sometimes bales out without doing anything. Redirty
-			 * the inode; Move it from s_io onto s_more_io/s_dirty.
-			 */
-			/*
-			 * akpm: if the caller was the kupdate function we put
-			 * this inode at the head of s_dirty so it gets first
-			 * consideration.  Otherwise, move it to the tail, for
-			 * the reasons described there.  I'm not really sure
-			 * how much sense this makes.  Presumably I had a good
-			 * reasons for doing it this way, and I'd rather not
-			 * muck with it at present.
-			 */
-			if (wbc->for_kupdate) {
-				/*
-				 * For the kupdate function we move the inode
-				 * to s_more_io so it will get more writeout as
-				 * soon as the queue becomes uncongested.
-				 */
-				inode->i_state |= I_DIRTY_PAGES;
-				requeue_io(inode);
-			} else {
-				/*
-				 * Otherwise fully redirty the inode so that
-				 * other inodes on this superblock will get some
-				 * writeout.  Otherwise heavy writing to one
-				 * file would indefinitely suspend writeout of
-				 * all the other files.
-				 */
-				inode->i_state |= I_DIRTY_PAGES;
-				redirty_tail(inode);
-			}
-		} else if (inode->i_state & I_DIRTY) {
-			/*
-			 * Someone redirtied the inode while were writing back
-			 * the pages.
-			 */
-			redirty_tail(inode);
-		} else if (atomic_read(&inode->i_count)) {
-			/*
-			 * The inode is clean, inuse
+			 * For the kupdate function we leave
+			 * dirtied_when field untouched and return
+			 * it to the flush_tree. The next iteration
+			 * of kupdate will flush more pages when
+			 * the queue is no longer congested.
 			 */
-			list_move(&inode->i_list, &inode_in_use);
+			inode->i_state |= I_DIRTY_PAGES;
+			flush_tree_insert(sb, inode);
 		} else {
 			/*
-			 * The inode is clean, unused
+			 * Otherwise fully redirty the inode so that
+			 * other inodes on this superblock will get some
+			 * writeout.  Otherwise heavy writing to one
+			 * file would indefinitely suspend writeout of
+			 * all the other files.
 			 */
-			list_move(&inode->i_list, &inode_unused);
+			inode->i_state |= I_DIRTY_PAGES;
+			inode->dirtied_when = jiffies;
+			flush_tree_insert(sb, inode);
 		}
+	} else if (inode->i_state & I_DIRTY) {
+		/*
+		 * Someone redirtied the inode while were writing back
+		 * the pages.
+		 */
+		flush_tree_insert(inode->i_sb, inode);
+	} else if (atomic_read(&inode->i_count)) {
+		/*
+		 * The inode is clean, inuse
+		 */
+		list_move(&inode->i_list, &inode_in_use);
+	} else {
+		/*
+		 * The inode is clean, unused
+		 */
+		list_move(&inode->i_list, &inode_unused);
 	}
+out:
 	inode_sync_complete(inode);
 	return ret;
 }
@@ -333,27 +461,14 @@ __writeback_single_inode(struct inode *i
 	else
 		WARN_ON(inode->i_state & I_WILL_FREE);
 
+	/*
+	 * If the inode is locked and we are not going to wait for it
+	 * to be unlocked then we can just exit the routine. Since the
+	 * inode is marked I_DIRTY it will be inserted into the flush
+	 * tree by sync_single_inode when the I_SYNC is released.
+	 */
 	if ((wbc->sync_mode != WB_SYNC_ALL) && (inode->i_state & I_SYNC)) {
-		struct address_space *mapping = inode->i_mapping;
-		int ret;
-
-		/*
-		 * We're skipping this inode because it's locked, and we're not
-		 * doing writeback-for-data-integrity.  Move it to s_more_io so
-		 * that writeback can proceed with the other inodes on s_io.
-		 * We'll have another go at writing back this inode when we
-		 * completed a full scan of s_io.
-		 */
-		requeue_io(inode);
-
-		/*
-		 * Even if we don't actually write the inode itself here,
-		 * we can at least start some of the data writeout..
-		 */
-		spin_unlock(&inode_lock);
-		ret = do_writepages(mapping, wbc);
-		spin_lock(&inode_lock);
-		return ret;
+		return 0;
 	}
 
 	/*
@@ -380,12 +495,9 @@ __writeback_single_inode(struct inode *i
  * If older_than_this is non-NULL, then only write out inodes which
  * had their first dirtying at a time earlier than *older_than_this.
  *
- * If we're a pdlfush thread, then implement pdflush collision avoidance
+ * If we're a pdflush thread, then implement pdflush collision avoidance
  * against the entire list.
  *
- * WB_SYNC_HOLD is a hack for sys_sync(): reattach the inode to sb->s_dirty so
- * that it can be located for waiting on in __writeback_single_inode().
- *
  * Called under inode_lock.
  *
  * If `bdi' is non-zero then we're being asked to writeback a specific queue.
@@ -398,33 +510,35 @@ __writeback_single_inode(struct inode *i
  * a queue with that address_space.  (Easy: have a global "dirty superblocks"
  * list).
  *
- * The inodes to be written are parked on sb->s_io.  They are moved back onto
- * sb->s_dirty as they are selected for writing.  This way, none can be missed
- * on the writer throttling path, and we get decent balancing between many
- * throttled threads: we don't want them all piling up on inode_sync_wait.
+ * The inodes to be written are inserted into the flush_tree.
  */
 static void
 sync_sb_inodes(struct super_block *sb, struct writeback_control *wbc)
 {
 	const unsigned long start = jiffies;	/* livelock avoidance */
+	struct inode *inode = NULL;
+	unsigned long prev_dirtied = 0;
+	unsigned long long flush_gen;
+
+	spin_lock(&inode_lock);
 
-	if (!wbc->for_kupdate || list_empty(&sb->s_io))
-		queue_io(sb, wbc->older_than_this);
+	flush_gen = ++sb->s_flush_gen;
 
-	while (!list_empty(&sb->s_io)) {
-		struct inode *inode = list_entry(sb->s_io.prev,
-						struct inode, i_list);
+	while ((inode = flush_tree_next(sb, inode,
+		prev_dirtied, flush_gen)) != NULL) {
 		struct address_space *mapping = inode->i_mapping;
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
-		long pages_skipped;
+
+		flush_tree_remove(sb , inode);
+		prev_dirtied = inode->dirtied_when;
 
 		if (!bdi_cap_writeback_dirty(bdi)) {
-			redirty_tail(inode);
 			if (sb_is_blkdev_sb(sb)) {
 				/*
 				 * Dirty memory-backed blockdev: the ramdisk
 				 * driver does this.  Skip just this inode
 				 */
+				flush_tree_insert(sb, inode);
 				continue;
 			}
 			/*
@@ -439,14 +553,14 @@ sync_sb_inodes(struct super_block *sb, s
 			wbc->encountered_congestion = 1;
 			if (!sb_is_blkdev_sb(sb))
 				break;		/* Skip a congested fs */
-			requeue_io(inode);
+			flush_tree_insert(sb, inode);
 			continue;		/* Skip a congested blockdev */
 		}
 
 		if (wbc->bdi && bdi != wbc->bdi) {
 			if (!sb_is_blkdev_sb(sb))
 				break;		/* fs has the wrong queue */
-			requeue_io(inode);
+			flush_tree_insert(sb, inode);
 			continue;		/* blockdev has wrong queue */
 		}
 
@@ -454,36 +568,33 @@ sync_sb_inodes(struct super_block *sb, s
 		if (time_after(inode->dirtied_when, start))
 			break;
 
+		/* Was this inode dirtied too recently? */
+		if (wbc->older_than_this && time_after(inode->dirtied_when,
+						       *wbc->older_than_this))
+			break;
+
+
 		/* Is another pdflush already flushing this queue? */
 		if (current_is_pdflush() && !writeback_acquire(bdi))
 			break;
 
 		BUG_ON(inode->i_state & I_FREEING);
 		__iget(inode);
-		pages_skipped = wbc->pages_skipped;
 		__writeback_single_inode(inode, wbc);
-		if (wbc->sync_mode == WB_SYNC_HOLD) {
-			inode->dirtied_when = jiffies;
-			list_move(&inode->i_list, &sb->s_dirty);
-		}
 		if (current_is_pdflush())
 			writeback_release(bdi);
-		if (wbc->pages_skipped != pages_skipped) {
-			/*
-			 * writeback is not making progress due to locked
-			 * buffers.  Skip this inode for now.
-			 */
-			redirty_tail(inode);
-		}
 		spin_unlock(&inode_lock);
 		iput(inode);
 		cond_resched();
 		spin_lock(&inode_lock);
-		if (wbc->nr_to_write <= 0)
+		if (wbc->nr_to_write <= 0) {
+			inode = NULL;
 			break;
+		}
 	}
-	if (!list_empty(&sb->s_more_io))
-		wbc->more_io = 1;
+	if (inode)
+		flush_tree_insert(sb, inode);
+	spin_unlock(&inode_lock);
 	return;		/* Leave any unwritten inodes on s_io */
 }
 
@@ -492,9 +603,9 @@ sync_sb_inodes(struct super_block *sb, s
  *
  * Note:
  * We don't need to grab a reference to superblock here. If it has non-empty
- * ->s_dirty it's hadn't been killed yet and kill_super() won't proceed
- * past sync_inodes_sb() until the ->s_dirty/s_io/s_more_io lists are all
- * empty. Since __sync_single_inode() regains inode_lock before it finally moves
+ * flush_tree it's hadn't been killed yet and kill_super() won't proceed
+ * past sync_inodes_sb() until the flush_tree is empty. Since
+ * __sync_single_inode() regains inode_lock before it finally moves
  * inode from superblock lists we are OK.
  *
  * If `older_than_this' is non-zero then only flush inodes which have a
@@ -527,9 +638,7 @@ restart:
 			 */
 			if (down_read_trylock(&sb->s_umount)) {
 				if (sb->s_root) {
-					spin_lock(&inode_lock);
 					sync_sb_inodes(sb, wbc);
-					spin_unlock(&inode_lock);
 				}
 				up_read(&sb->s_umount);
 			}
@@ -545,8 +654,7 @@ restart:
 
 /*
  * writeback and wait upon the filesystem's dirty inodes.  The caller will
- * do this in two passes - one to write, and one to wait.  WB_SYNC_HOLD is
- * used to park the written inodes on sb->s_dirty for the wait pass.
+ * do this in two passes - one to write, and one to wait.
  *
  * A finite limit is set on the number of pages which will be written.
  * To prevent infinite livelock of sys_sync().
@@ -557,7 +665,7 @@ restart:
 void sync_inodes_sb(struct super_block *sb, int wait)
 {
 	struct writeback_control wbc = {
-		.sync_mode	= wait ? WB_SYNC_ALL : WB_SYNC_HOLD,
+		.sync_mode	= wait ? WB_SYNC_ALL : WB_SYNC_NONE,
 		.range_start	= 0,
 		.range_end	= LLONG_MAX,
 	};
@@ -568,9 +676,7 @@ void sync_inodes_sb(struct super_block *
 			(inodes_stat.nr_inodes - inodes_stat.nr_unused) +
 			nr_dirty + nr_unstable;
 	wbc.nr_to_write += wbc.nr_to_write / 2;		/* Bit more for luck */
-	spin_lock(&inode_lock);
 	sync_sb_inodes(sb, &wbc);
-	spin_unlock(&inode_lock);
 }
 
 /*
@@ -654,7 +760,7 @@ void sync_inodes(int wait)
  */
 int write_inode_now(struct inode *inode, int sync)
 {
-	int ret;
+	int ret = 0;
 	struct writeback_control wbc = {
 		.nr_to_write = LONG_MAX,
 		.sync_mode = WB_SYNC_ALL,
@@ -666,9 +772,7 @@ int write_inode_now(struct inode *inode,
 		wbc.nr_to_write = 0;
 
 	might_sleep();
-	spin_lock(&inode_lock);
-	ret = __writeback_single_inode(inode, &wbc);
-	spin_unlock(&inode_lock);
+	sync_inode(inode, &wbc);
 	if (sync)
 		inode_sync_wait(inode);
 	return ret;
@@ -688,10 +792,13 @@ EXPORT_SYMBOL(write_inode_now);
  */
 int sync_inode(struct inode *inode, struct writeback_control *wbc)
 {
-	int ret;
+	int ret = 0;
 
 	spin_lock(&inode_lock);
-	ret = __writeback_single_inode(inode, wbc);
+	if (inode->i_state & I_DIRTY) {
+		flush_tree_remove(inode->i_sb, inode);
+		ret = __writeback_single_inode(inode, wbc);
+	}
 	spin_unlock(&inode_lock);
 	return ret;
 }
Index: 2624rc7_wb/fs/hugetlbfs/inode.c
===================================================================
--- 2624rc7_wb.orig/fs/hugetlbfs/inode.c	2008-01-06 13:45:38.000000000 -0800
+++ 2624rc7_wb/fs/hugetlbfs/inode.c	2008-01-08 13:47:38.000000000 -0800
@@ -383,7 +383,7 @@ static void hugetlbfs_forget_inode(struc
 	struct super_block *sb = inode->i_sb;
 
 	if (!hlist_unhashed(&inode->i_hash)) {
-		if (!(inode->i_state & (I_DIRTY|I_SYNC)))
+		if (!(inode->i_state & (I_DIRTY|I_SYNC|I_WRITEBACK_NEVER)))
 			list_move(&inode->i_list, &inode_unused);
 		inodes_stat.nr_unused++;
 		if (!sb || (sb->s_flags & MS_ACTIVE)) {
Index: 2624rc7_wb/fs/inode.c
===================================================================
--- 2624rc7_wb.orig/fs/inode.c	2008-01-06 13:45:38.000000000 -0800
+++ 2624rc7_wb/fs/inode.c	2008-01-08 13:54:07.000000000 -0800
@@ -143,6 +143,7 @@ static struct inode *alloc_inode(struct 
 		inode->i_cdev = NULL;
 		inode->i_rdev = 0;
 		inode->dirtied_when = 0;
+		memset(&inode->i_flush_node, 0, sizeof(inode->i_flush_node));
 		if (security_inode_alloc(inode)) {
 			if (inode->i_sb->s_op->destroy_inode)
 				inode->i_sb->s_op->destroy_inode(inode);
@@ -337,6 +338,10 @@ static int invalidate_list(struct list_h
 		inode = list_entry(tmp, struct inode, i_sb_list);
 		invalidate_inode_buffers(inode);
 		if (!atomic_read(&inode->i_count)) {
+			if (inode->i_state & I_DIRTY) {
+				flush_tree_remove(inode->i_sb, inode);
+				inode->i_state &= ~I_DIRTY;
+			}
 			list_move(&inode->i_list, dispose);
 			inode->i_state |= I_FREEING;
 			count++;
@@ -1043,7 +1048,10 @@ EXPORT_SYMBOL(remove_inode_hash);
 void generic_delete_inode(struct inode *inode)
 {
 	const struct super_operations *op = inode->i_sb->s_op;
-
+	if ((inode->i_state & I_DIRTY)) {
+		flush_tree_remove(inode->i_sb, inode);
+		inode->i_state &= ~I_DIRTY;
+	}
 	list_del_init(&inode->i_list);
 	list_del_init(&inode->i_sb_list);
 	inode->i_state |= I_FREEING;
@@ -1080,7 +1088,7 @@ static void generic_forget_inode(struct 
 	struct super_block *sb = inode->i_sb;
 
 	if (!hlist_unhashed(&inode->i_hash)) {
-		if (!(inode->i_state & (I_DIRTY|I_SYNC)))
+		if (!(inode->i_state & (I_DIRTY|I_SYNC|I_WRITEBACK_NEVER)))
 			list_move(&inode->i_list, &inode_unused);
 		inodes_stat.nr_unused++;
 		if (sb->s_flags & MS_ACTIVE) {
Index: 2624rc7_wb/fs/pipe.c
===================================================================
--- 2624rc7_wb.orig/fs/pipe.c	2008-01-06 13:45:38.000000000 -0800
+++ 2624rc7_wb/fs/pipe.c	2008-01-08 13:55:15.000000000 -0800
@@ -930,13 +930,7 @@ static struct inode * get_pipe_inode(voi
 	pipe->readers = pipe->writers = 1;
 	inode->i_fop = &rdwr_pipe_fops;
 
-	/*
-	 * Mark the inode dirty from the very beginning,
-	 * that way it will never be moved to the dirty
-	 * list because "mark_inode_dirty()" will think
-	 * that it already _is_ on the dirty list.
-	 */
-	inode->i_state = I_DIRTY;
+	inode->i_state = I_WRITEBACK_NEVER;
 	inode->i_mode = S_IFIFO | S_IRUSR | S_IWUSR;
 	inode->i_uid = current->fsuid;
 	inode->i_gid = current->fsgid;
Index: 2624rc7_wb/fs/super.c
===================================================================
--- 2624rc7_wb.orig/fs/super.c	2008-01-06 13:45:38.000000000 -0800
+++ 2624rc7_wb/fs/super.c	2008-01-08 15:04:11.000000000 -0800
@@ -61,9 +61,7 @@ static struct super_block *alloc_super(s
 			s = NULL;
 			goto out;
 		}
-		INIT_LIST_HEAD(&s->s_dirty);
-		INIT_LIST_HEAD(&s->s_io);
-		INIT_LIST_HEAD(&s->s_more_io);
+		s->s_flush_root = RB_ROOT;
 		INIT_LIST_HEAD(&s->s_files);
 		INIT_LIST_HEAD(&s->s_instances);
 		INIT_HLIST_HEAD(&s->s_anon);
@@ -103,6 +101,7 @@ out:
  */
 static inline void destroy_super(struct super_block *s)
 {
+	mutex_destroy(&s->s_flush_lock);
 	security_sb_free(s);
 	kfree(s->s_subtype);
 	kfree(s);
Index: 2624rc7_wb/include/linux/fs.h
===================================================================
--- 2624rc7_wb.orig/include/linux/fs.h	2008-01-06 13:45:38.000000000 -0800
+++ 2624rc7_wb/include/linux/fs.h	2008-01-08 14:57:42.000000000 -0800
@@ -280,6 +280,7 @@ extern int dir_notify_enable;
 #include <linux/kobject.h>
 #include <linux/list.h>
 #include <linux/radix-tree.h>
+#include <linux/rbtree.h>
 #include <linux/prio_tree.h>
 #include <linux/init.h>
 #include <linux/pid.h>
@@ -592,6 +593,8 @@ struct inode {
 	struct hlist_node	i_hash;
 	struct list_head	i_list;
 	struct list_head	i_sb_list;
+	struct rb_node		i_flush_node;
+	unsigned long long	i_flush_gen;
 	struct list_head	i_dentry;
 	unsigned long		i_ino;
 	atomic_t		i_count;
@@ -1003,9 +1006,10 @@ struct super_block {
 	struct xattr_handler	**s_xattr;
 
 	struct list_head	s_inodes;	/* all inodes */
-	struct list_head	s_dirty;	/* dirty inodes */
-	struct list_head	s_io;		/* parked for writeback */
-	struct list_head	s_more_io;	/* parked for more writeback */
+	struct rb_root		s_flush_root;
+	unsigned long		s_flush_count;
+	unsigned long long	s_flush_gen;
+
 	struct hlist_head	s_anon;		/* anonymous dentries for (nfs) exporting */
 	struct list_head	s_files;
 
@@ -1308,6 +1312,8 @@ struct super_operations {
  *			of inode dirty data.  Having a seperate lock for this
  *			purpose reduces latency and prevents some filesystem-
  *			specific deadlocks.
+ *I_WRITEBACK_NEVER	For inodes that we may dirty data to but never
+ *			want written back.
  *
  * Q: Why does I_DIRTY_DATASYNC exist?  It appears as if it could be replaced
  *    by (I_DIRTY_SYNC|I_DIRTY_PAGES).
@@ -1326,6 +1332,7 @@ struct super_operations {
 #define I_LOCK			(1 << __I_LOCK)
 #define __I_SYNC		8
 #define I_SYNC			(1 << __I_SYNC)
+#define I_WRITEBACK_NEVER	512
 
 #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
 
Index: 2624rc7_wb/include/linux/rbtree.h
===================================================================
--- 2624rc7_wb.orig/include/linux/rbtree.h	2008-01-06 13:45:38.000000000 -0800
+++ 2624rc7_wb/include/linux/rbtree.h	2008-01-09 01:41:30.000000000 -0800
@@ -135,6 +135,9 @@ static inline void rb_set_color(struct r
 #define RB_EMPTY_ROOT(root)	((root)->rb_node == NULL)
 #define RB_EMPTY_NODE(node)	(rb_parent(node) == node)
 #define RB_CLEAR_NODE(node)	(rb_set_parent(node, node))
+#define RB_LINKED_NODE(node)	((node)->rb_parent_color || \
+				 (node)->rb_left || (node)->rb_right)
+
 
 extern void rb_insert_color(struct rb_node *, struct rb_root *);
 extern void rb_erase(struct rb_node *, struct rb_root *);
Index: 2624rc7_wb/include/linux/writeback.h
===================================================================
--- 2624rc7_wb.orig/include/linux/writeback.h	2008-01-06 13:45:38.000000000 -0800
+++ 2624rc7_wb/include/linux/writeback.h	2008-01-08 14:04:38.000000000 -0800
@@ -30,7 +30,6 @@ static inline int task_is_pdflush(struct
 enum writeback_sync_modes {
 	WB_SYNC_NONE,	/* Don't wait on anything */
 	WB_SYNC_ALL,	/* Wait on every mapping */
-	WB_SYNC_HOLD,	/* Hold the inode on sb_dirty for sys_sync() */
 };
 
 /*
@@ -72,6 +71,8 @@ void writeback_inodes(struct writeback_c
 int inode_wait(void *);
 void sync_inodes_sb(struct super_block *, int wait);
 void sync_inodes(int wait);
+void flush_tree_remove(struct super_block *sb, struct inode *inode);
+
 
 /* writeback.h requires fs.h; it, too, is not included from here. */
 static inline void wait_on_inode(struct inode *inode)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
  2008-01-15  8:09 [patch] Converting writeback linked lists to a tree based data structure Michael Rubin
@ 2008-01-15  8:46 ` Peter Zijlstra
  2008-01-15 17:53   ` Michael Rubin
       [not found] ` <E1JFRFm-00011Q-0q@localhost.localdomain>
  1 sibling, 1 reply; 34+ messages in thread
From: Peter Zijlstra @ 2008-01-15  8:46 UTC (permalink / raw)
  To: Michael Rubin; +Cc: akpm, linux-kernel, linux-mm, wfg


On Tue, 2008-01-15 at 00:09 -0800, Michael Rubin wrote:
> From: Michael Rubin <mrubin@google.com>
> 
> For those of you who have waited so long. This is the third submission
> of the first attempt at this patch. It is a trilogy.

Just a quick question, how does this interact/depend-uppon etc.. with
Fengguangs patches I still have in my mailbox? (Those from Dec 28th)






^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
  2008-01-15  8:46 ` Peter Zijlstra
@ 2008-01-15 17:53   ` Michael Rubin
       [not found]     ` <E1JEyWa-0001Ys-F9@localhost.localdomain>
  0 siblings, 1 reply; 34+ messages in thread
From: Michael Rubin @ 2008-01-15 17:53 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: akpm, linux-kernel, linux-mm, wfg

On Jan 15, 2008 12:46 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> Just a quick question, how does this interact/depend-uppon etc.. with
> Fengguangs patches I still have in my mailbox? (Those from Dec 28th)

They don't. They apply to a 2.6.24rc7 tree. This is a candidte for 2.6.25.

This work was done before Fengguang's patches. I am trying to test
Fengguang's for comparison but am having problems with getting mm1 to
boot on my systems.

mrubin

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
       [not found]     ` <E1JEyWa-0001Ys-F9@localhost.localdomain>
@ 2008-01-16  3:01       ` Fengguang Wu
  2008-01-16  3:44         ` Andrew Morton
  2008-01-16 18:55         ` Michael Rubin
  0 siblings, 2 replies; 34+ messages in thread
From: Fengguang Wu @ 2008-01-16  3:01 UTC (permalink / raw)
  To: Michael Rubin; +Cc: Peter Zijlstra, akpm, linux-kernel, linux-mm

On Tue, Jan 15, 2008 at 09:53:42AM -0800, Michael Rubin wrote:
> On Jan 15, 2008 12:46 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > Just a quick question, how does this interact/depend-uppon etc.. with
> > Fengguangs patches I still have in my mailbox? (Those from Dec 28th)
> 
> They don't. They apply to a 2.6.24rc7 tree. This is a candidte for 2.6.25.
> 
> This work was done before Fengguang's patches. I am trying to test
> Fengguang's for comparison but am having problems with getting mm1 to
> boot on my systems.

Yeah, they are independent ones. The initial motivation is to fix the
bug "sluggish writeback on small+large files". Michael introduced
a new rbtree, and me introduced a new list(s_more_io_wait).

Basically I think rbtree is an overkill to do time based ordering.
Sorry, Michael. But s_dirty would be enough for that. Plus, s_more_io
provides fair queuing between small/large files, and s_more_io_wait
provides waiting mechanism for blocked inodes.

The time ordered rbtree may delay io for a blocked inode simply by
modifying its dirtied_when and reinsert it. But it would no longer be
that easy if it is to be ordered by location.

If we are going to do location based ordering in the future, the lists
will continue to be useful. It would simply be a matter of switching
from the s_dirty(order by time) to some rbtree or radix tree(order by
location).

We can even provide both ordering at the same time to different
fs/inodes which is configurable by the user. Because the s_dirty
and/or rbtree would provide _only_ ordering(not faireness or waiting)
and hence is interchangeable.

This patchset could be a good reference. It does location based
ordering with radix tree:

[RFC][PATCH] clustered writeback <http://lkml.org/lkml/2007/8/27/45>

Thank you,
Fengguang


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
  2008-01-16  3:01       ` Fengguang Wu
@ 2008-01-16  3:44         ` Andrew Morton
       [not found]           ` <E1JEzqb-0003YX-Rg@localhost.localdomain>
  2008-01-16  7:55           ` David Chinner
  2008-01-16 18:55         ` Michael Rubin
  1 sibling, 2 replies; 34+ messages in thread
From: Andrew Morton @ 2008-01-16  3:44 UTC (permalink / raw)
  To: Fengguang Wu; +Cc: Michael Rubin, Peter Zijlstra, linux-kernel, linux-mm

On Wed, 16 Jan 2008 11:01:08 +0800 Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:

> On Tue, Jan 15, 2008 at 09:53:42AM -0800, Michael Rubin wrote:
> > On Jan 15, 2008 12:46 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > > Just a quick question, how does this interact/depend-uppon etc.. with
> > > Fengguangs patches I still have in my mailbox? (Those from Dec 28th)
> > 
> > They don't. They apply to a 2.6.24rc7 tree. This is a candidte for 2.6.25.
> > 
> > This work was done before Fengguang's patches. I am trying to test
> > Fengguang's for comparison but am having problems with getting mm1 to
> > boot on my systems.
> 
> Yeah, they are independent ones. The initial motivation is to fix the
> bug "sluggish writeback on small+large files". Michael introduced
> a new rbtree, and me introduced a new list(s_more_io_wait).
> 
> Basically I think rbtree is an overkill to do time based ordering.
> Sorry, Michael. But s_dirty would be enough for that. Plus, s_more_io
> provides fair queuing between small/large files, and s_more_io_wait
> provides waiting mechanism for blocked inodes.
> 
> The time ordered rbtree may delay io for a blocked inode simply by
> modifying its dirtied_when and reinsert it. But it would no longer be
> that easy if it is to be ordered by location.

What does the term "ordered by location" mean?  Attemting to sort inodes by
physical disk address?  By using their i_ino as a key?

That sounds optimistic.

> If we are going to do location based ordering in the future, the lists
> will continue to be useful. It would simply be a matter of switching
> from the s_dirty(order by time) to some rbtree or radix tree(order by
> location).
> 
> We can even provide both ordering at the same time to different
> fs/inodes which is configurable by the user. Because the s_dirty
> and/or rbtree would provide _only_ ordering(not faireness or waiting)
> and hence is interchangeable.
> 
> This patchset could be a good reference. It does location based
> ordering with radix tree:
> 
> [RFC][PATCH] clustered writeback <http://lkml.org/lkml/2007/8/27/45>

list_heads are just the wrong data structure for this function.  Especially
list_heads which are protected by a non-sleeping lock.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
       [not found]           ` <E1JEzqb-0003YX-Rg@localhost.localdomain>
@ 2008-01-16  4:25             ` Fengguang Wu
  2008-01-16  4:42               ` Andrew Morton
  0 siblings, 1 reply; 34+ messages in thread
From: Fengguang Wu @ 2008-01-16  4:25 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michael Rubin, Peter Zijlstra, linux-kernel, linux-mm

On Tue, Jan 15, 2008 at 07:44:15PM -0800, Andrew Morton wrote:
> On Wed, 16 Jan 2008 11:01:08 +0800 Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
> 
> > On Tue, Jan 15, 2008 at 09:53:42AM -0800, Michael Rubin wrote:
> > > On Jan 15, 2008 12:46 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > > > Just a quick question, how does this interact/depend-uppon etc.. with
> > > > Fengguangs patches I still have in my mailbox? (Those from Dec 28th)
> > > 
> > > They don't. They apply to a 2.6.24rc7 tree. This is a candidte for 2.6.25.
> > > 
> > > This work was done before Fengguang's patches. I am trying to test
> > > Fengguang's for comparison but am having problems with getting mm1 to
> > > boot on my systems.
> > 
> > Yeah, they are independent ones. The initial motivation is to fix the
> > bug "sluggish writeback on small+large files". Michael introduced
> > a new rbtree, and me introduced a new list(s_more_io_wait).
> > 
> > Basically I think rbtree is an overkill to do time based ordering.
> > Sorry, Michael. But s_dirty would be enough for that. Plus, s_more_io
> > provides fair queuing between small/large files, and s_more_io_wait
> > provides waiting mechanism for blocked inodes.
> > 
> > The time ordered rbtree may delay io for a blocked inode simply by
> > modifying its dirtied_when and reinsert it. But it would no longer be
> > that easy if it is to be ordered by location.
> 
> What does the term "ordered by location" mean?  Attemting to sort inodes by
> physical disk address?  By using their i_ino as a key?
> 
> That sounds optimistic.

Yes, exactly. Think about email servers with lots of dirty files.

> > If we are going to do location based ordering in the future, the lists
> > will continue to be useful. It would simply be a matter of switching
> > from the s_dirty(order by time) to some rbtree or radix tree(order by
> > location).
> > 
> > We can even provide both ordering at the same time to different
> > fs/inodes which is configurable by the user. Because the s_dirty
> > and/or rbtree would provide _only_ ordering(not faireness or waiting)
> > and hence is interchangeable.
> > 
> > This patchset could be a good reference. It does location based
> > ordering with radix tree:
> > 
> > [RFC][PATCH] clustered writeback <http://lkml.org/lkml/2007/8/27/45>
> 
> list_heads are just the wrong data structure for this function.  Especially
> list_heads which are protected by a non-sleeping lock.

list_heads are OK if we use them for one and only function. We have
been trying to jam too much into s_dirty in the past.  Grabbing a
refcount could be better than locking - anyway if we split the
functions today, it would be easy to replace the list_heads one by
one in the future.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
  2008-01-16  4:25             ` Fengguang Wu
@ 2008-01-16  4:42               ` Andrew Morton
       [not found]                 ` <E1JF0It-0000yD-Mi@localhost.localdomain>
  0 siblings, 1 reply; 34+ messages in thread
From: Andrew Morton @ 2008-01-16  4:42 UTC (permalink / raw)
  To: Fengguang Wu; +Cc: Michael Rubin, Peter Zijlstra, linux-kernel, linux-mm

On Wed, 16 Jan 2008 12:25:53 +0800 Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:

> list_heads are OK if we use them for one and only function.

Not really.  They're inappropriate when you wish to remember your
position in the list while you dropped the lock (as we must do in
writeback).

A data structure which permits us to interate across the search key rather
than across the actual storage locations is more appropriate.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
       [not found]                 ` <E1JF0It-0000yD-Mi@localhost.localdomain>
@ 2008-01-16  4:55                   ` Fengguang Wu
  2008-01-16  5:51                     ` Andrew Morton
  0 siblings, 1 reply; 34+ messages in thread
From: Fengguang Wu @ 2008-01-16  4:55 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michael Rubin, Peter Zijlstra, linux-kernel, linux-mm

On Tue, Jan 15, 2008 at 08:42:36PM -0800, Andrew Morton wrote:
> On Wed, 16 Jan 2008 12:25:53 +0800 Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
> 
> > list_heads are OK if we use them for one and only function.
> 
> Not really.  They're inappropriate when you wish to remember your
> position in the list while you dropped the lock (as we must do in
> writeback).
> 
> A data structure which permits us to interate across the search key rather
> than across the actual storage locations is more appropriate.

I totally agree with you. What I mean is to first do the split of
functions - into three: ordering, starvation prevention, and blockade
waiting. Then to do better ordering by adopting radix tree(or rbtree
if radix tree is not enough), and lastly get rid of the list_heads to
avoid locking. Does it sound like a good path?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
  2008-01-16  4:55                   ` Fengguang Wu
@ 2008-01-16  5:51                     ` Andrew Morton
       [not found]                       ` <E1JF4Ey-0000x4-5p@localhost.localdomain>
  0 siblings, 1 reply; 34+ messages in thread
From: Andrew Morton @ 2008-01-16  5:51 UTC (permalink / raw)
  To: Fengguang Wu; +Cc: Michael Rubin, Peter Zijlstra, linux-kernel, linux-mm

On Wed, 16 Jan 2008 12:55:07 +0800 Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:

> On Tue, Jan 15, 2008 at 08:42:36PM -0800, Andrew Morton wrote:
> > On Wed, 16 Jan 2008 12:25:53 +0800 Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
> > 
> > > list_heads are OK if we use them for one and only function.
> > 
> > Not really.  They're inappropriate when you wish to remember your
> > position in the list while you dropped the lock (as we must do in
> > writeback).
> > 
> > A data structure which permits us to interate across the search key rather
> > than across the actual storage locations is more appropriate.
> 
> I totally agree with you. What I mean is to first do the split of
> functions - into three: ordering, starvation prevention, and blockade
> waiting.

Does "ordering" here refer to ordering bt time-of-first-dirty?

What is "blockade waiting"?

> Then to do better ordering by adopting radix tree(or rbtree
> if radix tree is not enough),

ordering of what?

> and lastly get rid of the list_heads to
> avoid locking. Does it sound like a good path?

I'd have thaought that replacing list_heads with another data structure
would be a simgle commit.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
  2008-01-16  3:44         ` Andrew Morton
       [not found]           ` <E1JEzqb-0003YX-Rg@localhost.localdomain>
@ 2008-01-16  7:55           ` David Chinner
  2008-01-16  8:13             ` Andrew Morton
  1 sibling, 1 reply; 34+ messages in thread
From: David Chinner @ 2008-01-16  7:55 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Fengguang Wu, Michael Rubin, Peter Zijlstra, linux-kernel, linux-mm

On Tue, Jan 15, 2008 at 07:44:15PM -0800, Andrew Morton wrote:
> On Wed, 16 Jan 2008 11:01:08 +0800 Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
> 
> > On Tue, Jan 15, 2008 at 09:53:42AM -0800, Michael Rubin wrote:
> > > On Jan 15, 2008 12:46 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > > > Just a quick question, how does this interact/depend-uppon etc.. with
> > > > Fengguangs patches I still have in my mailbox? (Those from Dec 28th)
> > > 
> > > They don't. They apply to a 2.6.24rc7 tree. This is a candidte for 2.6.25.
> > > 
> > > This work was done before Fengguang's patches. I am trying to test
> > > Fengguang's for comparison but am having problems with getting mm1 to
> > > boot on my systems.
> > 
> > Yeah, they are independent ones. The initial motivation is to fix the
> > bug "sluggish writeback on small+large files". Michael introduced
> > a new rbtree, and me introduced a new list(s_more_io_wait).
> > 
> > Basically I think rbtree is an overkill to do time based ordering.
> > Sorry, Michael. But s_dirty would be enough for that. Plus, s_more_io
> > provides fair queuing between small/large files, and s_more_io_wait
> > provides waiting mechanism for blocked inodes.
> > 
> > The time ordered rbtree may delay io for a blocked inode simply by
> > modifying its dirtied_when and reinsert it. But it would no longer be
> > that easy if it is to be ordered by location.
> 
> What does the term "ordered by location" mean?  Attemting to sort inodes by
> physical disk address?  By using their i_ino as a key?
> 
> That sounds optimistic.

In XFS, inode number is an encoding of it's location on disk, so
ordering inode writeback by inode number *does* make sense.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
  2008-01-16  7:55           ` David Chinner
@ 2008-01-16  8:13             ` Andrew Morton
       [not found]               ` <E1JF7yp-0006l8-5P@localhost.localdomain>
  0 siblings, 1 reply; 34+ messages in thread
From: Andrew Morton @ 2008-01-16  8:13 UTC (permalink / raw)
  To: David Chinner
  Cc: Fengguang Wu, Michael Rubin, Peter Zijlstra, linux-kernel, linux-mm

On Wed, 16 Jan 2008 18:55:38 +1100 David Chinner <dgc@sgi.com> wrote:

> On Tue, Jan 15, 2008 at 07:44:15PM -0800, Andrew Morton wrote:
> > On Wed, 16 Jan 2008 11:01:08 +0800 Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
> > 
> > > On Tue, Jan 15, 2008 at 09:53:42AM -0800, Michael Rubin wrote:
> > > > On Jan 15, 2008 12:46 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > > > > Just a quick question, how does this interact/depend-uppon etc.. with
> > > > > Fengguangs patches I still have in my mailbox? (Those from Dec 28th)
> > > > 
> > > > They don't. They apply to a 2.6.24rc7 tree. This is a candidte for 2.6.25.
> > > > 
> > > > This work was done before Fengguang's patches. I am trying to test
> > > > Fengguang's for comparison but am having problems with getting mm1 to
> > > > boot on my systems.
> > > 
> > > Yeah, they are independent ones. The initial motivation is to fix the
> > > bug "sluggish writeback on small+large files". Michael introduced
> > > a new rbtree, and me introduced a new list(s_more_io_wait).
> > > 
> > > Basically I think rbtree is an overkill to do time based ordering.
> > > Sorry, Michael. But s_dirty would be enough for that. Plus, s_more_io
> > > provides fair queuing between small/large files, and s_more_io_wait
> > > provides waiting mechanism for blocked inodes.
> > > 
> > > The time ordered rbtree may delay io for a blocked inode simply by
> > > modifying its dirtied_when and reinsert it. But it would no longer be
> > > that easy if it is to be ordered by location.
> > 
> > What does the term "ordered by location" mean?  Attemting to sort inodes by
> > physical disk address?  By using their i_ino as a key?
> > 
> > That sounds optimistic.
> 
> In XFS, inode number is an encoding of it's location on disk, so
> ordering inode writeback by inode number *does* make sense.

This code is mainly concerned with writing pagecache data, not inodes.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
       [not found]                       ` <E1JF4Ey-0000x4-5p@localhost.localdomain>
@ 2008-01-16  9:07                         ` Fengguang Wu
  2008-01-18  7:36                           ` Mike Waychison
  2008-01-16 22:35                         ` David Chinner
  1 sibling, 1 reply; 34+ messages in thread
From: Fengguang Wu @ 2008-01-16  9:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Michael Rubin, Peter Zijlstra, linux-kernel, linux-mm

On Tue, Jan 15, 2008 at 09:51:49PM -0800, Andrew Morton wrote:
> On Wed, 16 Jan 2008 12:55:07 +0800 Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
> 
> > On Tue, Jan 15, 2008 at 08:42:36PM -0800, Andrew Morton wrote:
> > > On Wed, 16 Jan 2008 12:25:53 +0800 Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
> > > 
> > > > list_heads are OK if we use them for one and only function.
> > > 
> > > Not really.  They're inappropriate when you wish to remember your
> > > position in the list while you dropped the lock (as we must do in
> > > writeback).
> > > 
> > > A data structure which permits us to interate across the search key rather
> > > than across the actual storage locations is more appropriate.
> > 
> > I totally agree with you. What I mean is to first do the split of
> > functions - into three: ordering, starvation prevention, and blockade
> > waiting.
> 
> Does "ordering" here refer to ordering bt time-of-first-dirty?

Ordering by dirtied_when or i_ino, either is OK.

> What is "blockade waiting"?

Some inodes/pages cannot be synced now for some reason and should be
retried after a while.

> > Then to do better ordering by adopting radix tree(or rbtree
> > if radix tree is not enough),
> 
> ordering of what?

Switch from time to location.

> > and lastly get rid of the list_heads to
> > avoid locking. Does it sound like a good path?
> 
> I'd have thaought that replacing list_heads with another data structure
> would be a simgle commit.

That would be easy. s_more_io and s_more_io_wait can all be converted
to radix trees.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
       [not found]               ` <E1JF7yp-0006l8-5P@localhost.localdomain>
@ 2008-01-16 13:06                 ` Fengguang Wu
  0 siblings, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2008-01-16 13:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Chinner, Michael Rubin, Peter Zijlstra, linux-kernel, linux-mm

On Wed, Jan 16, 2008 at 12:13:43AM -0800, Andrew Morton wrote:
> On Wed, 16 Jan 2008 18:55:38 +1100 David Chinner <dgc@sgi.com> wrote:
> 
> > On Tue, Jan 15, 2008 at 07:44:15PM -0800, Andrew Morton wrote:
> > > On Wed, 16 Jan 2008 11:01:08 +0800 Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
> > > 
> > > > On Tue, Jan 15, 2008 at 09:53:42AM -0800, Michael Rubin wrote:
> > > > > On Jan 15, 2008 12:46 AM, Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> > > > > > Just a quick question, how does this interact/depend-uppon etc.. with
> > > > > > Fengguangs patches I still have in my mailbox? (Those from Dec 28th)
> > > > > 
> > > > > They don't. They apply to a 2.6.24rc7 tree. This is a candidte for 2.6.25.
> > > > > 
> > > > > This work was done before Fengguang's patches. I am trying to test
> > > > > Fengguang's for comparison but am having problems with getting mm1 to
> > > > > boot on my systems.
> > > > 
> > > > Yeah, they are independent ones. The initial motivation is to fix the
> > > > bug "sluggish writeback on small+large files". Michael introduced
> > > > a new rbtree, and me introduced a new list(s_more_io_wait).
> > > > 
> > > > Basically I think rbtree is an overkill to do time based ordering.
> > > > Sorry, Michael. But s_dirty would be enough for that. Plus, s_more_io
> > > > provides fair queuing between small/large files, and s_more_io_wait
> > > > provides waiting mechanism for blocked inodes.
> > > > 
> > > > The time ordered rbtree may delay io for a blocked inode simply by
> > > > modifying its dirtied_when and reinsert it. But it would no longer be
> > > > that easy if it is to be ordered by location.
> > > 
> > > What does the term "ordered by location" mean?  Attemting to sort inodes by
> > > physical disk address?  By using their i_ino as a key?
> > > 
> > > That sounds optimistic.
> > 
> > In XFS, inode number is an encoding of it's location on disk, so
> > ordering inode writeback by inode number *does* make sense.
> 
> This code is mainly concerned with writing pagecache data, not inodes.

Now I tend to agree with you. It is not easy to sort writeback pages
by disk location. There are three main obstacles:
- there can be multiple filesystems on the same disk
- ext3/reiserfs/... make use of blockdev inode, which spans the whole
  partition;
- xfs may store inode metadata/data separately;


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
  2008-01-16  3:01       ` Fengguang Wu
  2008-01-16  3:44         ` Andrew Morton
@ 2008-01-16 18:55         ` Michael Rubin
       [not found]           ` <E1JFLTR-0002pn-4Y@localhost.localdomain>
  1 sibling, 1 reply; 34+ messages in thread
From: Michael Rubin @ 2008-01-16 18:55 UTC (permalink / raw)
  To: Fengguang Wu; +Cc: Peter Zijlstra, akpm, linux-kernel, linux-mm

On Jan 15, 2008 7:01 PM, Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
> Basically I think rbtree is an overkill to do time based ordering.
> Sorry, Michael. But s_dirty would be enough for that. Plus, s_more_io
> provides fair queuing between small/large files, and s_more_io_wait
> provides waiting mechanism for blocked inodes.

I think the flush_tree (which is a little more than just an rbtree)
provides the same queuing mechanisms that the three or four lists
heads do and manages to do it in one structure. The i_flushed_when
provides the ability to have blocked inodes wait their turn so to
speak.

Another motivation behind the rbtree patch is to unify the data
structure that handles the priority and mechanism of how we write out
the pages of the inodes. There are some ideas about introducing
priority schemes for QOS and such in the future. I am not saying this
patch is about making that happen, but the idea is to if possible
unify the four stages of lists into a single structure to facilitate
efforts like that.

mrubin

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
       [not found]                       ` <E1JF4Ey-0000x4-5p@localhost.localdomain>
  2008-01-16  9:07                         ` Fengguang Wu
@ 2008-01-16 22:35                         ` David Chinner
       [not found]                           ` <E1JFLEW-0002oE-G1@localhost.localdomain>
  1 sibling, 1 reply; 34+ messages in thread
From: David Chinner @ 2008-01-16 22:35 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Michael Rubin, Peter Zijlstra, linux-kernel, linux-mm

On Wed, Jan 16, 2008 at 05:07:20PM +0800, Fengguang Wu wrote:
> On Tue, Jan 15, 2008 at 09:51:49PM -0800, Andrew Morton wrote:
> > > Then to do better ordering by adopting radix tree(or rbtree
> > > if radix tree is not enough),
> > 
> > ordering of what?
> 
> Switch from time to location.

Note that data writeback may be adversely affected by location
based writeback rather than time based writeback - think of
the effect of location based data writeback on an app that
creates lots of short term (<30s) temp files and then removes
them before they are written back.

Also, data writeback locatio cannot be easily derived from
the inode number in pretty much all cases. "near" in terms
of XFS means the same AG which means the data could be up to
a TB away from the inode, and if you have >1TB filesystems
usingthe default inode32 allocator, file data is *never*
placed near the inode - the inodes are in the first TB of
the filesystem, the data is rotored around the rest of the
filesystem.

And with delayed allocation, you don't know where the data is even
going to be written ahead of the filesystem ->writepage call, so you
can't do optimal location ordering for data in this case.

> > > and lastly get rid of the list_heads to
> > > avoid locking. Does it sound like a good path?
> > 
> > I'd have thaought that replacing list_heads with another data structure
> > would be a simgle commit.
> 
> That would be easy. s_more_io and s_more_io_wait can all be converted
> to radix trees.

Makes sense for location based writeback of the inodes themselves,
but not for data.

Hmmmm - I'm wondering if we'd do better to split data writeback from
inode writeback. i.e. we do two passes.  The first pass writes all
the data back in time order, the second pass writes all the inodes
back in location order.

Right now we interleave data and inode writeback, (i.e.  we do data,
inode, data, inode, data, inode, ....). I'd much prefer to see all
data written out first, then the inodes. ->writepage often dirties
the inode and hence if we need to do multiple do_writepages() calls
on an inode to flush all the data (e.g. congestion, large amounts of
data to be written, etc), we really shouldn't be calling
write_inode() after every do_writepages() call. The inode
should not be written until all the data is written....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
       [not found]                           ` <E1JFLEW-0002oE-G1@localhost.localdomain>
@ 2008-01-17  3:16                             ` Fengguang Wu
  2008-01-17  5:21                             ` David Chinner
  1 sibling, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2008-01-17  3:16 UTC (permalink / raw)
  To: David Chinner
  Cc: Andrew Morton, Michael Rubin, Peter Zijlstra, linux-kernel, linux-mm

On Thu, Jan 17, 2008 at 09:35:10AM +1100, David Chinner wrote:
> On Wed, Jan 16, 2008 at 05:07:20PM +0800, Fengguang Wu wrote:
> > On Tue, Jan 15, 2008 at 09:51:49PM -0800, Andrew Morton wrote:
> > > > Then to do better ordering by adopting radix tree(or rbtree
> > > > if radix tree is not enough),
> > > 
> > > ordering of what?
> > 
> > Switch from time to location.
> 
> Note that data writeback may be adversely affected by location
> based writeback rather than time based writeback - think of
> the effect of location based data writeback on an app that
> creates lots of short term (<30s) temp files and then removes
> them before they are written back.

A small(e.g. 5s) time window can still be enforced, but...

> Also, data writeback locatio cannot be easily derived from
> the inode number in pretty much all cases. "near" in terms
> of XFS means the same AG which means the data could be up to
> a TB away from the inode, and if you have >1TB filesystems
> usingthe default inode32 allocator, file data is *never*
> placed near the inode - the inodes are in the first TB of
> the filesystem, the data is rotored around the rest of the
> filesystem.
> 
> And with delayed allocation, you don't know where the data is even
> going to be written ahead of the filesystem ->writepage call, so you
> can't do optimal location ordering for data in this case.

Agreed.

> Hmmmm - I'm wondering if we'd do better to split data writeback from
> inode writeback. i.e. we do two passes.  The first pass writes all
> the data back in time order, the second pass writes all the inodes
> back in location order.
> 
> Right now we interleave data and inode writeback, (i.e.  we do data,
> inode, data, inode, data, inode, ....). I'd much prefer to see all
> data written out first, then the inodes. ->writepage often dirties
> the inode and hence if we need to do multiple do_writepages() calls
> on an inode to flush all the data (e.g. congestion, large amounts of
> data to be written, etc), we really shouldn't be calling
> write_inode() after every do_writepages() call. The inode
> should not be written until all the data is written....

That may do good to XFS. Another case is documented as follows:
"the write_inode() function of a typical fs will perform no I/O, but
will mark buffers in the blockdev mapping as dirty."


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
       [not found]           ` <E1JFLTR-0002pn-4Y@localhost.localdomain>
@ 2008-01-17  3:31             ` Fengguang Wu
  0 siblings, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2008-01-17  3:31 UTC (permalink / raw)
  To: Michael Rubin; +Cc: Peter Zijlstra, akpm, linux-kernel, linux-mm

On Wed, Jan 16, 2008 at 10:55:28AM -0800, Michael Rubin wrote:
> On Jan 15, 2008 7:01 PM, Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
> > Basically I think rbtree is an overkill to do time based ordering.
> > Sorry, Michael. But s_dirty would be enough for that. Plus, s_more_io
> > provides fair queuing between small/large files, and s_more_io_wait
> > provides waiting mechanism for blocked inodes.
> 
> I think the flush_tree (which is a little more than just an rbtree)
> provides the same queuing mechanisms that the three or four lists
> heads do and manages to do it in one structure. The i_flushed_when
> provides the ability to have blocked inodes wait their turn so to
> speak.
> 
> Another motivation behind the rbtree patch is to unify the data
> structure that handles the priority and mechanism of how we write out
> the pages of the inodes. There are some ideas about introducing
> priority schemes for QOS and such in the future. I am not saying this
> patch is about making that happen, but the idea is to if possible
> unify the four stages of lists into a single structure to facilitate
> efforts like that.

Yeah, rbtree is better than list_heads after all. Let's make it happen.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
       [not found]                           ` <E1JFLEW-0002oE-G1@localhost.localdomain>
  2008-01-17  3:16                             ` Fengguang Wu
@ 2008-01-17  5:21                             ` David Chinner
  1 sibling, 0 replies; 34+ messages in thread
From: David Chinner @ 2008-01-17  5:21 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: David Chinner, Andrew Morton, Michael Rubin, Peter Zijlstra,
	linux-kernel, linux-mm

On Thu, Jan 17, 2008 at 11:16:00AM +0800, Fengguang Wu wrote:
> On Thu, Jan 17, 2008 at 09:35:10AM +1100, David Chinner wrote:
> > On Wed, Jan 16, 2008 at 05:07:20PM +0800, Fengguang Wu wrote:
> > > On Tue, Jan 15, 2008 at 09:51:49PM -0800, Andrew Morton wrote:
> > > > > Then to do better ordering by adopting radix tree(or rbtree
> > > > > if radix tree is not enough),
> > > > 
> > > > ordering of what?
> > > 
> > > Switch from time to location.
> > 
> > Note that data writeback may be adversely affected by location
> > based writeback rather than time based writeback - think of
> > the effect of location based data writeback on an app that
> > creates lots of short term (<30s) temp files and then removes
> > them before they are written back.
> 
> A small(e.g. 5s) time window can still be enforced, but...

Yes, you could, but that will then result in non-deterministic
performance for repeated workloads because the order of file
writeback will not be consistent.

e.g.  the first run is fast because the output file is at lower
offset than the temp file meaning the temp file gets deleted
without being written.

The second run is slow because the location of the files is
reversed and the temp file is written to disk before the
final output file and hence the run is much slower because
it writes much more.

The third run is also slow, but the files are like the first
fast run. However, pdflush tries to write the temp file back
within 5s of it being dirtied so it skips it and writes
the output file first.

The difference between the first+second case can be found by
knowing that inode number determines writeback order, but
there is no obvious clue as to why the first+third runs are
different.

This is exactly the sort of non-deterministic behaviour we 
want to avoid in a writeback algorithm.

> > Hmmmm - I'm wondering if we'd do better to split data writeback from
> > inode writeback. i.e. we do two passes.  The first pass writes all
> > the data back in time order, the second pass writes all the inodes
> > back in location order.
> > 
> > Right now we interleave data and inode writeback, (i.e.  we do data,
> > inode, data, inode, data, inode, ....). I'd much prefer to see all
> > data written out first, then the inodes. ->writepage often dirties
> > the inode and hence if we need to do multiple do_writepages() calls
> > on an inode to flush all the data (e.g. congestion, large amounts of
> > data to be written, etc), we really shouldn't be calling
> > write_inode() after every do_writepages() call. The inode
> > should not be written until all the data is written....
> 
> That may do good to XFS. Another case is documented as follows:
> "the write_inode() function of a typical fs will perform no I/O, but
> will mark buffers in the blockdev mapping as dirty."

Yup, but in that situation ->write_inode() does not do any I/O, so
it will work with any high level inode writeback ordering or timing
scheme equally well.  As a result, that's not the case we need to
optimise at all.

FWIW, the NFS client is likely to work better with split data/
inode writeback as it also has to mark the inode dirty on async
write completion (to get ->write_inode called to issue a commit
RPC). Hence delaying the inode write until after all the data
is written makes sense there as well....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
       [not found] ` <E1JFRFm-00011Q-0q@localhost.localdomain>
@ 2008-01-17  9:41   ` Fengguang Wu
  2008-01-17 21:07     ` Michael Rubin
  0 siblings, 1 reply; 34+ messages in thread
From: Fengguang Wu @ 2008-01-17  9:41 UTC (permalink / raw)
  To: Michael Rubin; +Cc: a.p.zijlstra, akpm, linux-kernel, linux-mm

On Tue, Jan 15, 2008 at 12:09:21AM -0800, Michael Rubin wrote:
> 1) Adding a datastructure to guarantee fairness when writing
>    inodes to disk and simplify the code base.
> 
>    When inodes are parked for writeback they are parked in the
>    flush_tree. The flush tree is a data structure based on an rb tree.

The main benefit of rbtree is possibly better support of future policies.
Can you demonstrate an example?

(grumble:
Apart from the benefit of flexibility, I don't think it makes things
simpler, nor does the introduction of rbtree automatically fixes bugs.
Bugs can only be avoided by good understanding of all possible cases.)

The most tricky writeback issues could be starvation prevention
between
        - small/large files
        - new/old files
        - superblocks
Some kind of limit should be applied for each. They used to be:
        - requeue to s_more_io whenever MAX_WRITEBACK_PAGES is reached
          this preempts big files
        - refill s_io iif it is drained
          this prevents promotion of big/old files
        - return from sync_sb_inodes() after one go of s_io
          (todo: don't restart from the first superblock in writeback_inodes())
          this prevents busy superblock from starving others
          and ensures fairness between superblocks
 
Michael, could you sort out and document the new starvation prevention schemes?

>    Duplicate keys are handled by making a list in the tree for each key
>    value. The order of how we choose the next inode to flush is decided
>    by two fields. First the earliest dirtied_when value. If there are
>    duplicate dirtied_when values then the earliest i_flush_gen value
>    determines who gets flushed next.
> 
>    The flush tree organizes the dirtied_when keys with the rb_tree. Any
>    inodes with a duplicate dirtied_when value are link listed together. This
>    link list is sorted by the inode's i_flush_gen. When both the
>    dirtied_when and the i_flush_gen are identical the order in the
>    linked list determines the order we flush the inodes.

Introduce i_flush_gen to help restarting from the last inode?
Well, it's not as simple as list_heads.

> 2) Added an inode flag to allow inodes to be marked so that they
>    are never written back to disk.
> 
>    The motivation behind this change is several fold. The first is
>    to insure fairness in the writeback algorithm. The second is to

What do you mean by fairness? Why cannot I_WRITEBACK_NEVER be in a
decoupled standalone patch?

>    deal with a bug where the writing to large files concurrently
>    to smaller ones creates a situation where writeback cannot
>    keep up with traffic and memory baloons until the we hit the
>    threshold watermark. This can result in surprising long latency
>    with respect to disk traffic. This latency can take minutes. The

>    flush tree fixes this issue and fixes several other minor issues
>    with fairness also.

More details about the fixings, please?


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
  2008-01-17  9:41   ` Fengguang Wu
@ 2008-01-17 21:07     ` Michael Rubin
       [not found]       ` <E1JFjGz-0001eU-3O@localhost.localdomain>
  2008-01-18  5:01       ` David Chinner
  0 siblings, 2 replies; 34+ messages in thread
From: Michael Rubin @ 2008-01-17 21:07 UTC (permalink / raw)
  To: Fengguang Wu; +Cc: a.p.zijlstra, akpm, linux-kernel, linux-mm

On Jan 17, 2008 1:41 AM, Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
> On Tue, Jan 15, 2008 at 12:09:21AM -0800, Michael Rubin wrote:
> The main benefit of rbtree is possibly better support of future policies.
> Can you demonstrate an example?

These are ill-formed thoughts as of now on my end but the idea was
that keeping one tree sorted via a scheme might be simpler than
multiple list_heads.

> Bugs can only be avoided by good understanding of all possible cases.)

I take the above statement as a tautology.  And am trying my best to do so. :-)

> The most tricky writeback issues could be starvation prevention
> between


>         - small/large files
>         - new/old files
>         - superblocks

So I have written tests and believe I have covered these issues. If
you are concerned in specific on any and have a test case please let
me know.

> Some kind of limit should be applied for each. They used to be:
>         - requeue to s_more_io whenever MAX_WRITEBACK_PAGES is reached
>           this preempts big files

The patch uses th same limit.

>         - refill s_io iif it is drained
>           this prevents promotion of big/old files

Once a big file gets its first do_writepages it is moved behind the
other smaller files via i_flushed_when. And the same in reverse for
big vs old.

>         - return from sync_sb_inodes() after one go of s_io

I am not sure how this limit helps things out. Is this for superblock
starvation? Can you elaborate?

> Michael, could you sort out and document the new starvation prevention schemes?

The basic idea behind the writeback algorithm to handle starvation.
The over arching idea is that we want to preserve order of writeback
based on when an inode was dirtied and also preserve the dirtied_when
contents until the inode has been written back (partially or fully)

Every sync_sb_inodes we find the least recent inodes dirtied. To deal
with large or small starvation we have a s_flush_gen for each
iteration of sync_sb_inodes every time we issue a writeback we mark
that the inode cannot be processed until the next s_flush_gen. This
way we don't process one get to the rest since we keep pushing them
into subsequent s_fush_gen's.

Let me know if you want more detail or structured responses.

> Introduce i_flush_gen to help restarting from the last inode?
> Well, it's not as simple as list_heads.
>
> > 2) Added an inode flag to allow inodes to be marked so that they
> >    are never written back to disk.
> >
> >    The motivation behind this change is several fold. The first is
> >    to insure fairness in the writeback algorithm. The second is to
>
> What do you mean by fairness?

So originally this comment was written when I was trying to fix a bug
in 2.6.23. The one where we were starving large files from being
flushed. There was a fairness issue where small files were being
flushed but the large ones were just ballooning in memory.

> Why cannot I_WRITEBACK_NEVER be in a decoupled standalone patch?

The WRITEBACK_NEVER could be in a previous patch to the rbtree. But
not a subsequent patch to the rbtree. The rbtree depends on the
WRITEBACK_NEVER patch otherwise we run in to problems in
generic_delete_inode. Now that you point it out I think I can split
this patch into two patches and make the WRITEBACK_NEVER in the first
one.

> More details about the fixings, please?

So again this comment was written against 2.6.23. The biggest fix is
the starving of big files. I remember there were other smaller issues,
but there have been so many changes in the patch sets that I need to
go back to quantify.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
       [not found]       ` <E1JFjGz-0001eU-3O@localhost.localdomain>
@ 2008-01-18  4:56         ` Fengguang Wu
  2008-01-18  5:41           ` Andi Kleen
  2008-01-18  6:43           ` Michael Rubin
  0 siblings, 2 replies; 34+ messages in thread
From: Fengguang Wu @ 2008-01-18  4:56 UTC (permalink / raw)
  To: Michael Rubin; +Cc: a.p.zijlstra, akpm, linux-kernel, linux-mm

On Thu, Jan 17, 2008 at 01:07:05PM -0800, Michael Rubin wrote:
> On Jan 17, 2008 1:41 AM, Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
> > On Tue, Jan 15, 2008 at 12:09:21AM -0800, Michael Rubin wrote:
> > The main benefit of rbtree is possibly better support of future policies.
> > Can you demonstrate an example?
> 
> These are ill-formed thoughts as of now on my end but the idea was
> that keeping one tree sorted via a scheme might be simpler than
> multiple list_heads.

Suppose we want to grant longer expiration window for temp files,
adding a new list named s_dirty_tmpfile would be a handy solution.

So the question is: should we need more than 3 QoS classes?

> > The most tricky writeback issues could be starvation prevention
> > between
> 
> 
> >         - small/large files
> >         - new/old files
> >         - superblocks
> 
> So I have written tests and believe I have covered these issues. If
> you are concerned in specific on any and have a test case please let
> me know.

OK.

> > Some kind of limit should be applied for each. They used to be:
> >         - requeue to s_more_io whenever MAX_WRITEBACK_PAGES is reached
> >           this preempts big files
> 
> The patch uses th same limit.
> 
> >         - refill s_io iif it is drained
> >           this prevents promotion of big/old files
> 
> Once a big file gets its first do_writepages it is moved behind the
> other smaller files via i_flushed_when. And the same in reverse for
> big vs old.

You mean i_flush_gen? No, sync_sb_inodes() will abort on every
MAX_WRITEBACK_PAGES, and s_flush_gen will be updated accordingly.
Hence the sync will restart from big/old files.

> 
> >         - return from sync_sb_inodes() after one go of s_io
> 
> I am not sure how this limit helps things out. Is this for superblock
> starvation? Can you elaborate?

We should have a way to go to next superblock even if new dirty inodes
or pages are emerging fast in this superblock. Fill and drain s_io
only once and then abort helps.

s_io is a stable and bounded working set in one go of superblock.

> > Michael, could you sort out and document the new starvation prevention schemes?
> 
> The basic idea behind the writeback algorithm to handle starvation.
> The over arching idea is that we want to preserve order of writeback
> based on when an inode was dirtied and also preserve the dirtied_when
> contents until the inode has been written back (partially or fully)
> 
> Every sync_sb_inodes we find the least recent inodes dirtied. To deal
> with large or small starvation we have a s_flush_gen for each
> iteration of sync_sb_inodes every time we issue a writeback we mark
> that the inode cannot be processed until the next s_flush_gen. This
> way we don't process one get to the rest since we keep pushing them
> into subsequent s_fush_gen's.
> 
> Let me know if you want more detail or structured responses.
> 
> > Introduce i_flush_gen to help restarting from the last inode?
> > Well, it's not as simple as list_heads.

Basically you make one list_head in each rbtree node.
That list_head is recycled cyclic, and is an analog to the old
fashioned s_dirty. We need to know 'where we are' and 'where it ends'.
So an extra indicator must be introduced - i_flush_gen. It's awkward.

We are simply repeating the aged list_heads' problem.

> > > 2) Added an inode flag to allow inodes to be marked so that they
> > >    are never written back to disk.
> > >
> > >    The motivation behind this change is several fold. The first is
> > >    to insure fairness in the writeback algorithm. The second is to
> >
> > What do you mean by fairness?
> 
> So originally this comment was written when I was trying to fix a bug
> in 2.6.23. The one where we were starving large files from being
> flushed. There was a fairness issue where small files were being
> flushed but the large ones were just ballooning in memory.

In fact the bug is turned-around rather than fixed - now the small
files could be starved.

> > Why cannot I_WRITEBACK_NEVER be in a decoupled standalone patch?
> 
> The WRITEBACK_NEVER could be in a previous patch to the rbtree. But
> not a subsequent patch to the rbtree. The rbtree depends on the
> WRITEBACK_NEVER patch otherwise we run in to problems in
> generic_delete_inode. Now that you point it out I think I can split
> this patch into two patches and make the WRITEBACK_NEVER in the first
> one.

OK.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
  2008-01-17 21:07     ` Michael Rubin
       [not found]       ` <E1JFjGz-0001eU-3O@localhost.localdomain>
@ 2008-01-18  5:01       ` David Chinner
  2008-01-18  5:38         ` Michael Rubin
       [not found]         ` <E1JFjyv-0001hU-FA@localhost.localdomain>
  1 sibling, 2 replies; 34+ messages in thread
From: David Chinner @ 2008-01-18  5:01 UTC (permalink / raw)
  To: Michael Rubin; +Cc: Fengguang Wu, a.p.zijlstra, akpm, linux-kernel, linux-mm

On Thu, Jan 17, 2008 at 01:07:05PM -0800, Michael Rubin wrote:
> > Michael, could you sort out and document the new starvation prevention schemes?
> 
> The basic idea behind the writeback algorithm to handle starvation.
> The over arching idea is that we want to preserve order of writeback
> based on when an inode was dirtied and also preserve the dirtied_when
> contents until the inode has been written back (partially or fully)
> 
> Every sync_sb_inodes we find the least recent inodes dirtied. To deal
> with large or small starvation we have a s_flush_gen for each
> iteration of sync_sb_inodes every time we issue a writeback we mark
> that the inode cannot be processed until the next s_flush_gen. This
> way we don't process one get to the rest since we keep pushing them
> into subsequent s_fush_gen's.

This seems suboptimal for large files. If you keep feeding in
new least recently dirtied files, the large files will never
get an unimpeded go at the disk and hence we'll struggle to
get decent bandwidth under anything but pure large file
write loads.

Fairness is a tradeoff between seeks and bandwidth.  Ideally, we
want to spend 50% of the *disk* time servicing sequential writes and
50% of the time servicing seeky writes - that way neither get
penalised unfairly by the other type of workload.

Switching inodes during writeback implies a seek to the new write
location, while continuing to write the same inode has no seek
penalty because the writeback is sequential.  It follows from this
that allowing larges file a disproportionate amount of data
writeback is desirable.

Also, cycling rapidly through all the large files to write 4MB to each is
going to cause us to spend time seeking rather than writing compared
to cycling slower and writing 40MB from each large file at a time.

i.e. servicing one large file for 100ms is going to result in higher
writeback throughput than servicing 10 large files for 10ms each
because there's going to be less seeking and more writing done by
the disks.

That is, think of large file writes like process scheduler batch
jobs - bulk throughput is what matters, so the larger the time slice
you give them the higher the throughput.

IMO, the sort of result we should be looking at is a
writeback design that results in cycling somewhat like:

	slice 1: iterate over small files
	slice 2: flush large file 1
	slice 3: iterate over small files
	slice 4: flush large file 2
	......
	slice n-1: flush large file N
	slice n: iterate over small files
	slice n+1: flush large file N+1

So that we keep the disk busy with a relatively fair mix of
small and large I/Os while both are necessary.

Furthermore, as disk bandwidth goes up, the relationship
between large file and small file writes changes if we want
to maintain writeback at a significant percentage of the
maximum bandwidth of the drive (which is extremely desirable).
So if we take a 4k page machine and a 1024page writeback slice,
for different disks, we get a bandwidth slice in terms of disk
seeks like:

slow disk: 20MB/s, 10ms seek (say a laptop drive)
	- 4MB write takes 200ms, or equivalent of 10 seeks

normal disk: 60MB/s, 8ms seek (desktop)
	- 4MB write takes 66ms, or equivalent of 8 seeks

fast disk: 120MB/s, 5ms seek (15krpm SAS)
	- 4MB write takes 33ms,  or equivalent of 6 seeks

small RAID5 lun: 200MB/s, 4ms seek
	- 4MB write takes 20ms, or equivalent of 5 seeks

Large RAID5 lun: 1GB/s, 2ms seek
	- 4MB write takes 4ms, or equivalent of 2 seeks

Put simply:

	The higher the bandwidth of the device, the more frequently
	we need to be servicing the inodes with large amounts of
	dirty data to be written to maintain write throughput at a
	significant percentage of the device capability.

The writeback algorithm needs to take this into account for it
to be able to scale effectively for high throughput devices.

BTW, it needs to be recognised that if we are under memory pressure
we can clean much more memory in a short period of time by writing
out all the large files first. This would clearly benefit the system
as a whole as we'd get the most pages available for reclaim as
possible in a short a time as possible. The writeback algorithm
should really have a mode that allows this sort of flush ordering to
occur....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
  2008-01-18  5:01       ` David Chinner
@ 2008-01-18  5:38         ` Michael Rubin
  2008-01-18  8:54           ` David Chinner
       [not found]         ` <E1JFjyv-0001hU-FA@localhost.localdomain>
  1 sibling, 1 reply; 34+ messages in thread
From: Michael Rubin @ 2008-01-18  5:38 UTC (permalink / raw)
  To: David Chinner; +Cc: Fengguang Wu, a.p.zijlstra, akpm, linux-kernel, linux-mm

On Jan 17, 2008 9:01 PM, David Chinner <dgc@sgi.com> wrote:

First off thank you for the very detailed reply. This rocks and gives
me much to think about.

> On Thu, Jan 17, 2008 at 01:07:05PM -0800, Michael Rubin wrote:
> This seems suboptimal for large files. If you keep feeding in
> new least recently dirtied files, the large files will never
> get an unimpeded go at the disk and hence we'll struggle to
> get decent bandwidth under anything but pure large file
> write loads.

You're right. I understand now. I just  changed a dial on my tests,
ran it and found pdflush not keeping up like it should. I need to
address this.

> Switching inodes during writeback implies a seek to the new write
> location, while continuing to write the same inode has no seek
> penalty because the writeback is sequential.  It follows from this
> that allowing larges file a disproportionate amount of data
> writeback is desirable.
>
> Also, cycling rapidly through all the large files to write 4MB to each is
> going to cause us to spend time seeking rather than writing compared
> to cycling slower and writing 40MB from each large file at a time.
>
> i.e. servicing one large file for 100ms is going to result in higher
> writeback throughput than servicing 10 large files for 10ms each
> because there's going to be less seeking and more writing done by
> the disks.
>
> That is, think of large file writes like process scheduler batch
> jobs - bulk throughput is what matters, so the larger the time slice
> you give them the higher the throughput.
>
> IMO, the sort of result we should be looking at is a
> writeback design that results in cycling somewhat like:
>
>         slice 1: iterate over small files
>         slice 2: flush large file 1
>         slice 3: iterate over small files
>         slice 4: flush large file 2
>         ......
>         slice n-1: flush large file N
>         slice n: iterate over small files
>         slice n+1: flush large file N+1
>
> So that we keep the disk busy with a relatively fair mix of
> small and large I/Os while both are necessary.

I am getting where you are coming from. But if we are going to make
changes to optimize for seeks maybe we need to be more aggressive in
write back in how we organize both time and location. Right now AFAIK
there is no attention to location in the writeback path.

>         The higher the bandwidth of the device, the more frequently
>         we need to be servicing the inodes with large amounts of
>         dirty data to be written to maintain write throughput at a
>         significant percentage of the device capability.
>

Could you expand that to say it's not the inodes of large files but
the ones with data that we can exploit locality? Often large files are
fragmented. Would it make more sense to pursue cracking the inodes and
grouping their blocks's locations? Or is this all overkill and should
be handled at a lower level like the elevator?

> BTW, it needs to be recognised that if we are under memory pressure
> we can clean much more memory in a short period of time by writing
> out all the large files first. This would clearly benefit the system
> as a whole as we'd get the most pages available for reclaim as
> possible in a short a time as possible. The writeback algorithm
> should really have a mode that allows this sort of flush ordering to
> occur....

I completely agree.

mrubin

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
  2008-01-18  4:56         ` Fengguang Wu
@ 2008-01-18  5:41           ` Andi Kleen
       [not found]             ` <E1JFkHy-0001jR-VD@localhost.localdomain>
  2008-01-18  7:48             ` Mike Waychison
  2008-01-18  6:43           ` Michael Rubin
  1 sibling, 2 replies; 34+ messages in thread
From: Andi Kleen @ 2008-01-18  5:41 UTC (permalink / raw)
  To: Fengguang Wu; +Cc: Michael Rubin, a.p.zijlstra, akpm, linux-kernel, linux-mm

Fengguang Wu <wfg@mail.ustc.edu.cn> writes:
>
> Suppose we want to grant longer expiration window for temp files,
> adding a new list named s_dirty_tmpfile would be a handy solution.

How would the kernel know that a file is a tmp file?

> So the question is: should we need more than 3 QoS classes?

[just a random idea; i have not worked out all the implications]

Would it be possible to derive a writeback apriority from the ionice
level of the process originating the IO? e.g. we have long standing
problems that background jobs even when niced and can cause
significant slow downs to foreground processes by starving IO 
and pushing out pages. ionice was supposed to help with that
but in practice it does not seem to have helped too much and I suspect
it needs more prioritization higher up the VM food chain. Adding
such priorities to writeback would seem like a step in the right
direction, although it would of course not solve the problem
completely.

-Andi

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
       [not found]         ` <E1JFjyv-0001hU-FA@localhost.localdomain>
@ 2008-01-18  5:41           ` Fengguang Wu
  2008-01-19  2:50           ` David Chinner
  1 sibling, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2008-01-18  5:41 UTC (permalink / raw)
  To: David Chinner; +Cc: Michael Rubin, a.p.zijlstra, akpm, linux-kernel, linux-mm

> Fairness is a tradeoff between seeks and bandwidth.  Ideally, we
> want to spend 50% of the *disk* time servicing sequential writes and
> 50% of the time servicing seeky writes - that way neither get
> penalised unfairly by the other type of workload.
> 
> Switching inodes during writeback implies a seek to the new write
> location, while continuing to write the same inode has no seek
> penalty because the writeback is sequential.  It follows from this
> that allowing larges file a disproportionate amount of data
> writeback is desirable.
> 
> Also, cycling rapidly through all the large files to write 4MB to each is
> going to cause us to spend time seeking rather than writing compared
> to cycling slower and writing 40MB from each large file at a time.
> 
> i.e. servicing one large file for 100ms is going to result in higher
> writeback throughput than servicing 10 large files for 10ms each
> because there's going to be less seeking and more writing done by
> the disks.
> 
> That is, think of large file writes like process scheduler batch
> jobs - bulk throughput is what matters, so the larger the time slice
> you give them the higher the throughput.
> 
> IMO, the sort of result we should be looking at is a
> writeback design that results in cycling somewhat like:
> 
> 	slice 1: iterate over small files
> 	slice 2: flush large file 1
> 	slice 3: iterate over small files
> 	slice 4: flush large file 2
> 	......
> 	slice n-1: flush large file N
> 	slice n: iterate over small files
> 	slice n+1: flush large file N+1
> 
> So that we keep the disk busy with a relatively fair mix of
> small and large I/Os while both are necessary.

If we can sync fast enough, the lower layer would be able to merge
those 4MB requests. Whatever the slice size is, we end up sending the
same set of pages to disk: the writable pages of expired inodes. We
only have to worry about choosing the right set of pages being written
in one single batch(in every pdflush wakeup time). That set is
currently defined by the 5s/30s dirty expiration rules.

> Furthermore, as disk bandwidth goes up, the relationship
> between large file and small file writes changes if we want
> to maintain writeback at a significant percentage of the
> maximum bandwidth of the drive (which is extremely desirable).
> So if we take a 4k page machine and a 1024page writeback slice,
> for different disks, we get a bandwidth slice in terms of disk
> seeks like:
> 
> slow disk: 20MB/s, 10ms seek (say a laptop drive)
> 	- 4MB write takes 200ms, or equivalent of 10 seeks
> 
> normal disk: 60MB/s, 8ms seek (desktop)
> 	- 4MB write takes 66ms, or equivalent of 8 seeks
> 
> fast disk: 120MB/s, 5ms seek (15krpm SAS)
> 	- 4MB write takes 33ms,  or equivalent of 6 seeks
> 
> small RAID5 lun: 200MB/s, 4ms seek
> 	- 4MB write takes 20ms, or equivalent of 5 seeks
> 
> Large RAID5 lun: 1GB/s, 2ms seek
> 	- 4MB write takes 4ms, or equivalent of 2 seeks
> 
> Put simply:
> 
> 	The higher the bandwidth of the device, the more frequently
> 	we need to be servicing the inodes with large amounts of
> 	dirty data to be written to maintain write throughput at a
> 	significant percentage of the device capability.
> 
> The writeback algorithm needs to take this into account for it
> to be able to scale effectively for high throughput devices.

Slow queues go full first. Currently the writeback code will skip
_and_ congestion_wait() for congested filesystems. The better policy
is to congestion_wait() _after_ all other writable pages have been
synced. The same have been done for blocked inodes/pages in my
patchset.

> BTW, it needs to be recognised that if we are under memory pressure
> we can clean much more memory in a short period of time by writing
> out all the large files first. This would clearly benefit the system
> as a whole as we'd get the most pages available for reclaim as
> possible in a short a time as possible. The writeback algorithm
> should really have a mode that allows this sort of flush ordering to
> occur....

We might do write-behind for large files :-)


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
       [not found]             ` <E1JFkHy-0001jR-VD@localhost.localdomain>
@ 2008-01-18  6:01               ` Fengguang Wu
  0 siblings, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2008-01-18  6:01 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Michael Rubin, a.p.zijlstra, akpm, linux-kernel, linux-mm

On Fri, Jan 18, 2008 at 06:41:09AM +0100, Andi Kleen wrote:
> Fengguang Wu <wfg@mail.ustc.edu.cn> writes:
> >
> > Suppose we want to grant longer expiration window for temp files,
> > adding a new list named s_dirty_tmpfile would be a handy solution.
> 
> How would the kernel know that a file is a tmp file?

No idea - but it makes a good example ;-)

But for those making different filesystems for /tmp, /var, /data etc, 
per-superblock expiration parameters may help.

> > So the question is: should we need more than 3 QoS classes?
> 
> [just a random idea; i have not worked out all the implications]
> 
> Would it be possible to derive a writeback apriority from the ionice
> level of the process originating the IO? e.g. we have long standing
> problems that background jobs even when niced and can cause
> significant slow downs to foreground processes by starving IO 
> and pushing out pages. ionice was supposed to help with that
> but in practice it does not seem to have helped too much and I suspect
> it needs more prioritization higher up the VM food chain. Adding
> such priorities to writeback would seem like a step in the right
> direction, although it would of course not solve the problem
> completely.

Good idea. Michael may well be considering similar interfaces :-)


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
  2008-01-18  4:56         ` Fengguang Wu
  2008-01-18  5:41           ` Andi Kleen
@ 2008-01-18  6:43           ` Michael Rubin
       [not found]             ` <E1JFnZz-00015z-Vq@localhost.localdomain>
  1 sibling, 1 reply; 34+ messages in thread
From: Michael Rubin @ 2008-01-18  6:43 UTC (permalink / raw)
  To: Fengguang Wu; +Cc: a.p.zijlstra, akpm, linux-kernel, linux-mm

On Jan 17, 2008 8:56 PM, Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:

Once again thanks for the speedy replies. :-)

> On Thu, Jan 17, 2008 at 01:07:05PM -0800, Michael Rubin wrote:
> Suppose we want to grant longer expiration window for temp files,
> adding a new list named s_dirty_tmpfile would be a handy solution.

When you mean tmp do you mean files that eventually get written to
disk? If not I would just use the WRITEBACK_NEVER. If so I am not sure
if that feature is worth making a special case. It seems like the
location based ideas may be more useful.

> So the question is: should we need more than 3 QoS classes?
>
> > > The most tricky writeback issues could be starvation prevention
> > > between
> >
> >
> > >         - small/large files
> > >         - new/old files
> > >         - superblocks
> >
> > So I have written tests and believe I have covered these issues. If
> > you are concerned in specific on any and have a test case please let
> > me know.
>
> OK.
>
> > > Some kind of limit should be applied for each. They used to be:
> > >         - requeue to s_more_io whenever MAX_WRITEBACK_PAGES is reached
> > >           this preempts big files
> >
> > The patch uses th same limit.
> >
> > >         - refill s_io iif it is drained
> > >           this prevents promotion of big/old files
> >
> > Once a big file gets its first do_writepages it is moved behind the
> > other smaller files via i_flushed_when. And the same in reverse for
> > big vs old.
>
> You mean i_flush_gen?

Yeah sorry. It was once called i_flush_when. (sheepish)

> No, sync_sb_inodes() will abort on every
> MAX_WRITEBACK_PAGES, and s_flush_gen will be updated accordingly.
> Hence the sync will restart from big/old files.

If I understand you correctly I am not sure I agree. Here is what I
think happens in the patch:

1) pull big inode off of flush tree
2) sync big inode
3) Hit MAX_WRITEBACK_PAGES
4) Re-insert big inode (without modifying the dirtied_when)
5) update the i_flush_gen on big inode and re-insert behind small
inodes we have not synced yet.

In a subsequent sync_sb_inode we end up retrieving the small inode we
had not serviced yet.

> > >         - return from sync_sb_inodes() after one go of s_io
> >
> > I am not sure how this limit helps things out. Is this for superblock
> > starvation? Can you elaborate?
>
> We should have a way to go to next superblock even if new dirty inodes
> or pages are emerging fast in this superblock. Fill and drain s_io
> only once and then abort helps.

Got it.

> s_io is a stable and bounded working set in one go of superblock.

Is this necessary with MAX_WRITEBACK_PAGES? It feels like a double limit.

> Basically you make one list_head in each rbtree node.
> That list_head is recycled cyclic, and is an analog to the old
> fashioned s_dirty. We need to know 'where we are' and 'where it ends'.
> So an extra indicator must be introduced - i_flush_gen. It's awkward.
> We are simply repeating the aged list_heads' problem.

To me they both feel a little awkward. I feel like the original
problem in 2.6.23 led to a lot of examination which is bringing new
possibilities to light.

BTW the issue that started me on this whole path (starving large
files) was still present in 2.6.23-rc8 but now looks fixed in
2.6.24-rc3.
Still no idea about your changes in 2.6.24-rc6-mm1. I have given up
trying to get that thing to boot.

mrubin

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
  2008-01-16  9:07                         ` Fengguang Wu
@ 2008-01-18  7:36                           ` Mike Waychison
  0 siblings, 0 replies; 34+ messages in thread
From: Mike Waychison @ 2008-01-18  7:36 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: Andrew Morton, Michael Rubin, Peter Zijlstra, linux-kernel, linux-mm

Fengguang Wu wrote:
> On Tue, Jan 15, 2008 at 09:51:49PM -0800, Andrew Morton wrote:
>> On Wed, 16 Jan 2008 12:55:07 +0800 Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
>>
>>> On Tue, Jan 15, 2008 at 08:42:36PM -0800, Andrew Morton wrote:
>>>> On Wed, 16 Jan 2008 12:25:53 +0800 Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
>>>>
>>>>> list_heads are OK if we use them for one and only function.
>>>> Not really.  They're inappropriate when you wish to remember your
>>>> position in the list while you dropped the lock (as we must do in
>>>> writeback).
>>>>
>>>> A data structure which permits us to interate across the search key rather
>>>> than across the actual storage locations is more appropriate.
>>> I totally agree with you. What I mean is to first do the split of
>>> functions - into three: ordering, starvation prevention, and blockade
>>> waiting.
>> Does "ordering" here refer to ordering bt time-of-first-dirty?
> 
> Ordering by dirtied_when or i_ino, either is OK.
> 
>> What is "blockade waiting"?
> 
> Some inodes/pages cannot be synced now for some reason and should be
> retried after a while.
> 
>>> Then to do better ordering by adopting radix tree(or rbtree
>>> if radix tree is not enough),
>> ordering of what?
> 
> Switch from time to location.
> 

Given the way LBAs are located on disk and the fact that rotational 
latency is a large factor in changing locations of a drive head, any 
attempts to do a C-SCAN pass are pretty much useless.  Further 
complicating this is any volume management that sits between the fs and 
the actual storage.

A nice feature to have longer term is to have the write_inodes paths for 
background flushing understand storage congestion _through_ any volume 
management. This would allow us to back off background flushing on a per 
spindle basis (when using drives of course) and avoid write congestion 
in both the io scheduler and in the drive's writecaches, which I 
believe, but don't have hard evidence, get congested today, knocking the 
drive into a fifo fashion in firmware.

A data structure that allows us to keep a dirtied_when values consistent 
across back-offs and blocking allows us to further develop the 
background writeout paths to get to this point (though exposing this 
congestion information will require more work deeper in the stack).

>>> and lastly get rid of the list_heads to
>>> avoid locking. Does it sound like a good path?
>> I'd have thaought that replacing list_heads with another data structure
>> would be a simgle commit.
> 
> That would be easy. s_more_io and s_more_io_wait can all be converted
> to radix trees.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
  2008-01-18  5:41           ` Andi Kleen
       [not found]             ` <E1JFkHy-0001jR-VD@localhost.localdomain>
@ 2008-01-18  7:48             ` Mike Waychison
  1 sibling, 0 replies; 34+ messages in thread
From: Mike Waychison @ 2008-01-18  7:48 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Fengguang Wu, Michael Rubin, a.p.zijlstra, akpm, linux-kernel, linux-mm

We have some code internally that does just this, though it slightly 
abuses struct page by tagging pages with the highest priority that 
dirties them.

I'm not sure what a better solution is, though there has been talk about 
rewritting it to use a per mapping radix tree to keep mem_map small.

The idea is to mark current->ioprio into each page as they are dirtied, 
higher priority overrides lower priority.  Buffer_heads are done in the 
same way.

Come IO submission, bio's get the highest IO priority associated with 
the submitted pages/buffer_heads and these are passed into the the 
struct request's into the scheduler.

Similar changes are underway for making sure all AIO get the right ioprios.

It will take some cleaning to get it ready for lkml submission, though 
I'm still a bit unsure of what we should do to avoid struct page growth.

Mike Waychison

Andi Kleen wrote:
> Fengguang Wu <wfg@mail.ustc.edu.cn> writes:
>> Suppose we want to grant longer expiration window for temp files,
>> adding a new list named s_dirty_tmpfile would be a handy solution.
> 
> How would the kernel know that a file is a tmp file?
> 
>> So the question is: should we need more than 3 QoS classes?
> 
> [just a random idea; i have not worked out all the implications]
> 
> Would it be possible to derive a writeback apriority from the ionice
> level of the process originating the IO? e.g. we have long standing
> problems that background jobs even when niced and can cause
> significant slow downs to foreground processes by starving IO 
> and pushing out pages. ionice was supposed to help with that
> but in practice it does not seem to have helped too much and I suspect
> it needs more prioritization higher up the VM food chain. Adding
> such priorities to writeback would seem like a step in the right
> direction, although it would of course not solve the problem
> completely.
> 
> -Andi
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
  2008-01-18  5:38         ` Michael Rubin
@ 2008-01-18  8:54           ` David Chinner
  2008-01-18  9:26             ` Michael Rubin
  0 siblings, 1 reply; 34+ messages in thread
From: David Chinner @ 2008-01-18  8:54 UTC (permalink / raw)
  To: Michael Rubin
  Cc: David Chinner, Fengguang Wu, a.p.zijlstra, akpm, linux-kernel, linux-mm

On Thu, Jan 17, 2008 at 09:38:24PM -0800, Michael Rubin wrote:
> On Jan 17, 2008 9:01 PM, David Chinner <dgc@sgi.com> wrote:
> 
> First off thank you for the very detailed reply. This rocks and gives
> me much to think about.
> 
> > On Thu, Jan 17, 2008 at 01:07:05PM -0800, Michael Rubin wrote:
> > This seems suboptimal for large files. If you keep feeding in
> > new least recently dirtied files, the large files will never
> > get an unimpeded go at the disk and hence we'll struggle to
> > get decent bandwidth under anything but pure large file
> > write loads.
> 
> You're right. I understand now. I just  changed a dial on my tests,
> ran it and found pdflush not keeping up like it should. I need to
> address this.
> 
> > Switching inodes during writeback implies a seek to the new write
> > location, while continuing to write the same inode has no seek
> > penalty because the writeback is sequential.  It follows from this
> > that allowing larges file a disproportionate amount of data
> > writeback is desirable.
> >
> > Also, cycling rapidly through all the large files to write 4MB to each is
> > going to cause us to spend time seeking rather than writing compared
> > to cycling slower and writing 40MB from each large file at a time.
> >
> > i.e. servicing one large file for 100ms is going to result in higher
> > writeback throughput than servicing 10 large files for 10ms each
> > because there's going to be less seeking and more writing done by
> > the disks.
> >
> > That is, think of large file writes like process scheduler batch
> > jobs - bulk throughput is what matters, so the larger the time slice
> > you give them the higher the throughput.
> >
> > IMO, the sort of result we should be looking at is a
> > writeback design that results in cycling somewhat like:
> >
> >         slice 1: iterate over small files
> >         slice 2: flush large file 1
> >         slice 3: iterate over small files
> >         slice 4: flush large file 2
> >         ......
> >         slice n-1: flush large file N
> >         slice n: iterate over small files
> >         slice n+1: flush large file N+1
> >
> > So that we keep the disk busy with a relatively fair mix of
> > small and large I/Os while both are necessary.
> 
> I am getting where you are coming from. But if we are going to make
> changes to optimize for seeks maybe we need to be more aggressive in
> write back in how we organize both time and location. Right now AFAIK
> there is no attention to location in the writeback path.

True. But IMO, locality ordering really only impacts the small file
data writes and the inodes themselves because there is typically
lots of seeks in doing that.

For large sequential writes to a file, writing a significant
chunk of data gives that bit of writeback it's own locality
because it does not cause seeks. Hence simply writing large
enough chunks avoids any need to order the writeback by locality.

Hence I writeback ordering by locality more a function of 
optimising the "iterate over small files" aspect of the writeback.

> >         The higher the bandwidth of the device, the more frequently
> >         we need to be servicing the inodes with large amounts of
> >         dirty data to be written to maintain write throughput at a
> >         significant percentage of the device capability.
> >
> 
> Could you expand that to say it's not the inodes of large files but
> the ones with data that we can exploit locality?

Not sure I understand what you mean. Can you rephrase that?

> Often large files are fragmented.

Then the filesystem is not doing it's job. Fragmentation does
not happen very frequently in XFS for large files - that is one
of the reasons it is extremely good for large files and high
throughput applications...

> Would it make more sense to pursue cracking the inodes and
> grouping their blocks's locations? Or is this all overkill and should
> be handled at a lower level like the elevator?

For large files it is overkill. For filesystems that do delayed
allocation, it is often impossible (no block mapping until
the writeback is executed unless it's an overwrite).

At this point, I'd say it is best to leave it to the filesystem and
the elevator to do their jobs properly.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
  2008-01-18  8:54           ` David Chinner
@ 2008-01-18  9:26             ` Michael Rubin
  0 siblings, 0 replies; 34+ messages in thread
From: Michael Rubin @ 2008-01-18  9:26 UTC (permalink / raw)
  To: David Chinner; +Cc: Fengguang Wu, a.p.zijlstra, akpm, linux-kernel, linux-mm

On Jan 18, 2008 12:54 AM, David Chinner <dgc@sgi.com> wrote:
> At this point, I'd say it is best to leave it to the filesystem and
> the elevator to do their jobs properly.

Amen.

mrubin

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
       [not found]             ` <E1JFnZz-00015z-Vq@localhost.localdomain>
@ 2008-01-18  9:32               ` Fengguang Wu
  0 siblings, 0 replies; 34+ messages in thread
From: Fengguang Wu @ 2008-01-18  9:32 UTC (permalink / raw)
  To: Michael Rubin; +Cc: a.p.zijlstra, akpm, linux-kernel, linux-mm

On Thu, Jan 17, 2008 at 10:43:15PM -0800, Michael Rubin wrote:
> On Jan 17, 2008 8:56 PM, Fengguang Wu <wfg@mail.ustc.edu.cn> wrote:
> > On Thu, Jan 17, 2008 at 01:07:05PM -0800, Michael Rubin wrote:
> > Suppose we want to grant longer expiration window for temp files,
> > adding a new list named s_dirty_tmpfile would be a handy solution.
> 
> When you mean tmp do you mean files that eventually get written to

Yes, they are disk based and can be synced on.

> disk? If not I would just use the WRITEBACK_NEVER. If so I am not sure
> if that feature is worth making a special case. It seems like the
> location based ideas may be more useful.

I'm not interested in WRITEBACK_NEVER or location based writeback
for now :-)

> > > >         - refill s_io iif it is drained
> > > >           this prevents promotion of big/old files
> > >
> > > Once a big file gets its first do_writepages it is moved behind the
> > > other smaller files via i_flushed_when. And the same in reverse for
> > > big vs old.
> >
> > You mean i_flush_gen?
> 
> Yeah sorry. It was once called i_flush_when. (sheepish)
> 
> > No, sync_sb_inodes() will abort on every
> > MAX_WRITEBACK_PAGES, and s_flush_gen will be updated accordingly.
> > Hence the sync will restart from big/old files.
> 
> If I understand you correctly I am not sure I agree. Here is what I
> think happens in the patch:
> 
> 1) pull big inode off of flush tree
> 2) sync big inode
> 3) Hit MAX_WRITEBACK_PAGES
> 4) Re-insert big inode (without modifying the dirtied_when)
> 5) update the i_flush_gen on big inode and re-insert behind small
> inodes we have not synced yet.
> 
> In a subsequent sync_sb_inode we end up retrieving the small inode we
> had not serviced yet.

Yes, exactly. And then it will continue to sync the big one again.
It will never be able to move forward to the next dirtied_when before
exhausting the inodes in the current list(with the oldest dirtied_when).

> > > >         - return from sync_sb_inodes() after one go of s_io
> > >
> > > I am not sure how this limit helps things out. Is this for superblock
> > > starvation? Can you elaborate?
> >
> > We should have a way to go to next superblock even if new dirty inodes
> > or pages are emerging fast in this superblock. Fill and drain s_io
> > only once and then abort helps.
> 
> Got it.
> 
> > s_io is a stable and bounded working set in one go of superblock.
> 
> Is this necessary with MAX_WRITEBACK_PAGES? It feels like a double limit.

We need a limit and continuing scheme at each level. It was so hard to
sort them out, that I'm really reluctant to restart all the fuss again.

> > Basically you make one list_head in each rbtree node.
> > That list_head is recycled cyclic, and is an analog to the old
> > fashioned s_dirty. We need to know 'where we are' and 'where it ends'.
> > So an extra indicator must be introduced - i_flush_gen. It's awkward.
> > We are simply repeating the aged list_heads' problem.
> 
> To me they both feel a little awkward. I feel like the original
> problem in 2.6.23 led to a lot of examination which is bringing new
> possibilities to light.
> 
> BTW the issue that started me on this whole path (starving large
> files) was still present in 2.6.23-rc8 but now looks fixed in
> 2.6.24-rc3.
> Still no idea about your changes in 2.6.24-rc6-mm1. I have given up
> trying to get that thing to boot.

Hehe, I guess the bug is still there in 2.6.24-rc3. But should be gone
in the latest patchset.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [patch] Converting writeback linked lists to a tree based data structure
       [not found]         ` <E1JFjyv-0001hU-FA@localhost.localdomain>
  2008-01-18  5:41           ` Fengguang Wu
@ 2008-01-19  2:50           ` David Chinner
  1 sibling, 0 replies; 34+ messages in thread
From: David Chinner @ 2008-01-19  2:50 UTC (permalink / raw)
  To: Fengguang Wu
  Cc: David Chinner, Michael Rubin, a.p.zijlstra, akpm, linux-kernel, linux-mm

On Fri, Jan 18, 2008 at 01:41:33PM +0800, Fengguang Wu wrote:
> > That is, think of large file writes like process scheduler batch
> > jobs - bulk throughput is what matters, so the larger the time slice
> > you give them the higher the throughput.
> > 
> > IMO, the sort of result we should be looking at is a
> > writeback design that results in cycling somewhat like:
> > 
> > 	slice 1: iterate over small files
> > 	slice 2: flush large file 1
> > 	slice 3: iterate over small files
> > 	slice 4: flush large file 2
> > 	......
> > 	slice n-1: flush large file N
> > 	slice n: iterate over small files
> > 	slice n+1: flush large file N+1
> > 
> > So that we keep the disk busy with a relatively fair mix of
> > small and large I/Os while both are necessary.
> 
> If we can sync fast enough, the lower layer would be able to merge
> those 4MB requests.

No, not necessarily - think of a stripe with a chunk size of 512k.
That 4MB will be split into 8x512k chunks and sent to different
devices (and hence elevator queues). The only way you get elevator
merging in this sort of config is that if you send multiple stripe
*width* sized amounts to the device in a very short time period.
I see quite a few filesystems with stripe widths in the tens of MB
range.....

> > Put simply:
> > 
> > 	The higher the bandwidth of the device, the more frequently
> > 	we need to be servicing the inodes with large amounts of
> > 	dirty data to be written to maintain write throughput at a
> > 	significant percentage of the device capability.
> > 
> > The writeback algorithm needs to take this into account for it
> > to be able to scale effectively for high throughput devices.
> 
> Slow queues go full first. Currently the writeback code will skip
> _and_ congestion_wait() for congested filesystems. The better policy
> is to congestion_wait() _after_ all other writable pages have been
> synced.

Agreed.

The comments I've made are mainly concerned with getting efficient
flushing of a single device occuring. Interactions between multiple
devices are a separable issue....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [patch] Converting writeback linked lists to a tree based data structure
@ 2007-12-13  0:32 Michael Rubin
  0 siblings, 0 replies; 34+ messages in thread
From: Michael Rubin @ 2007-12-13  0:32 UTC (permalink / raw)
  To: a.p.zijlstra, akpm, linux-kernel, linux-mm, mrubin, wfg

From: Michael Rubin <mrubin@google.com>

This is an attempt to unify the writeback data structures. By adding an
rb tree we are able to have one consistent time ordering mechanism for
writeback. This should aid debugging and allow for future work with more
sophisticated time ordering methods. This is a proposal for 2.6.25.

The patch below includes the following changes.

1) Adding a data structure to guarantee fairness when writing inodes
to disk.  The flush_tree is based on an rbtree. with duplicate keys
being chained off the same rb_node.

2) Added a FS flag to mark file systems that are not disk backed so
we don't have to flush them. Not sure I marked all of them. But just
marking these improves writeback performance.

3) Added an inode flag to allow inodes to be marked so that they are
never written back to disk. See get_pipe_inode.

Under autotest this patch has passed: fsx, bonnie, and iozone. I am
currently writing more writeback focused tests (which so far have been
passed) to add into autotest.

Performance wise I ran a quick test.

a) I used sysctl to stop background writeback.
b) I ran the "sync" command.
c) Then I created 10,000,000 files in directories of 1000 files per directory.
d) Finally I timed the "sync" command for all the dirty inodes that had been parked.

I ran the perf test 5 times on each kernel. The average for the 2.6.24
kernel was 87.8 seconds. With this patch it was 69.3.
Signed-off-by: Michael Rubin <mrubin@google.com>
---

Index: 2624rc3_wb/fs/block_dev.c
===================================================================
--- 2624rc3_wb.orig/fs/block_dev.c	2007-12-11 13:52:47.000000000 -0800
+++ 2624rc3_wb/fs/block_dev.c	2007-12-11 13:55:01.000000000 -0800
@@ -518,6 +518,7 @@ static struct file_system_type bd_type =
 	.name		= "bdev",
 	.get_sb		= bd_get_sb,
 	.kill_sb	= kill_anon_super,
+	.fs_flags	= FS_ANONYMOUS,
 };
 
 static struct vfsmount *bd_mnt __read_mostly;
Index: 2624rc3_wb/fs/fs-writeback.c
===================================================================
--- 2624rc3_wb.orig/fs/fs-writeback.c	2007-12-11 13:52:47.000000000 -0800
+++ 2624rc3_wb/fs/fs-writeback.c	2007-12-11 14:19:33.000000000 -0800
@@ -23,8 +23,174 @@
 #include <linux/blkdev.h>
 #include <linux/backing-dev.h>
 #include <linux/buffer_head.h>
+#include <linux/rbtree.h>
 #include "internal.h"
 
+#define rb_to_inode(node) rb_entry((node), struct inode, i_flush_node)
+
+/*
+ * When inodes are parked for writeback they are parked in the
+ * flush_tree. The flush tree is a data structure based on an rb tree.
+ *
+ * Duplicate keys are handled by making a list in the tree for each key
+ * value. The order of how we choose the next inode to flush is decided
+ * by two fields. First the earliest dirtied_when value. If there are
+ * duplicate dirtied_when values then the earliest i_flushed_when value
+ * determines who gets flushed next.
+ *
+ * The flush tree organizes the dirtied_when keys with the rb_tree. Any
+ * inodes with a duplicate dirtied_when value are link listed together. This
+ * link list is sorted by the inode's i_flushed_when. When both the
+ * dirtied_when and the i_flushed_when are identical the order in the
+ * linked list determines the order we flush the inodes.
+ */
+
+/*
+ * Find a rb_node matching the key in the flush tree. There are no duplicate
+ * rb_nodes in the tree. Instead they are chained off the first node.
+ */
+static struct inode *flush_tree_search(struct super_block *sb,
+				       unsigned long ts)
+{
+	struct rb_node *n = sb->s_flush_root.rb_node;
+	assert_spin_locked(&inode_lock);
+	while (n) {
+		struct inode *inode = rb_to_inode(n);
+		if (time_before(ts, inode->dirtied_when)) {
+			n = n->rb_left;
+		} else if (time_after(ts, inode->dirtied_when)) {
+			n = n->rb_right;
+		} else {
+			return inode;
+		}
+	}
+	return NULL;
+}
+
+/*
+ * Inserting an inode into the flush tree. The tree is keyed by the
+ * dirtied_when member.
+ *
+ * If there is a duplicate key in the tree already the new inode is put
+ * on the tail of a list of the rb_node.
+ * All inserted inodes must have one of the I_DIRTY flags set.
+ */
+static void flush_tree_insert(struct super_block *sb, struct inode *inode)
+{
+	struct rb_node **new = &(sb->s_flush_root.rb_node);
+	struct rb_node *parent = NULL;
+
+	assert_spin_locked(&inode_lock);
+	BUG_ON((inode->i_state & I_DIRTY) == 0);
+	BUG_ON(inode->i_state & (I_FREEING|I_CLEAR));
+	BUG_ON(RB_LINKED_NODE(&inode->i_flush_node));
+
+	sb->s_flush_count++;
+
+	list_del_init(&inode->i_list);
+	while (*new) {
+		struct inode *this = rb_to_inode(*new);
+		parent = *new;
+		if (time_before(inode->dirtied_when, this->dirtied_when))
+			new = &((*new)->rb_left);
+		else if (time_after(inode->dirtied_when,
+				      this->dirtied_when)) {
+			new = &((*new)->rb_right);
+		} else {
+			list_add_tail(&inode->i_list, &this->i_list);
+			return;
+		}
+	}
+
+	/* Add in the new node and re-balance the tree */
+	rb_link_node(&inode->i_flush_node, parent, new);
+	rb_insert_color(&inode->i_flush_node, &sb->s_flush_root);
+}
+
+
+/*
+ * Here we return the inode that has the smallest key in the flush tree
+ * that is greater than the parameter "prev_time".
+ */
+static struct inode *flush_tree_min_greater(struct super_block *sb,
+					    unsigned long prev_time)
+{
+	struct rb_node *node = sb->s_flush_root.rb_node;
+	struct inode *bsf = NULL;
+	/* best so far */
+	assert_spin_locked(&inode_lock);
+	while (node) {
+		struct inode *data = rb_to_inode(node);
+		/* Just trying to get lucky */
+		if ((prev_time + 1) == data->dirtied_when)
+			return data;
+
+		/* If this value is greater than our prev_time and is
+		less than the best so far, this is our new best so far.*/
+		if ((data->dirtied_when > prev_time) &&
+		    (bsf ? bsf->dirtied_when > data->dirtied_when : 1))
+			bsf = data;
+
+		/* Search all the way down to the bottom of the tree */
+		if (time_before(prev_time, data->dirtied_when))
+			node = node->rb_left;
+		else if (time_after_eq(prev_time, data->dirtied_when))
+			node = node->rb_right;
+	}
+	return bsf;
+}
+
+/*
+ * Here is where we iterate to find the next inode to process. The
+ * strategy is to first look for any other inodes with the same dirtied_when
+ * value. If we have already processed that node then we need to find
+ * the next highest dirtied_when value in the tree.
+ */
+static struct inode *flush_tree_next(struct super_block *sb,
+				     unsigned long start_time,
+				     unsigned long prev_time)
+{
+	struct inode *inode = flush_tree_search(sb, prev_time);
+	assert_spin_locked(&inode_lock);
+	/* We have a duplicate timed inode as the last processed */
+	if (inode && (time_before(inode->i_flushed_when, start_time)))
+		return inode;
+
+	/* Now we have to find the oldest one next */
+	return flush_tree_min_greater(sb, prev_time);
+}
+
+/* Removing a node from the flushtree. */
+void flush_tree_remove(struct super_block *sb, struct inode *inode)
+{
+	struct rb_node *rb_node = &inode->i_flush_node;
+	struct rb_root *rb_root = &sb->s_flush_root;
+
+	assert_spin_locked(&inode_lock);
+	BUG_ON((inode->i_state & I_DIRTY) == 0);
+
+	sb->s_flush_count--;
+
+	/* There is no chain on this inode. Just remove it from the tree */
+	if (list_empty(&inode->i_list)) {
+		BUG_ON(!RB_LINKED_NODE(rb_node));
+		rb_erase(rb_node, rb_root);
+		memset(rb_node, 0, sizeof(*rb_node));
+		return;
+	}
+
+	/* This node is on a chain AND is in the rb_tree */
+	if (RB_LINKED_NODE(rb_node)) {
+		struct inode *new = list_entry(inode->i_list.next,
+					       struct inode, i_list);
+		rb_replace_node(rb_node, &new->i_flush_node, rb_root);
+		memset(rb_node, 0, sizeof(*rb_node));
+	}
+	/* Take it off the list */
+	list_del_init(&inode->i_list);
+}
+
+
 /**
  *	__mark_inode_dirty -	internal function
  *	@inode: inode to mark
@@ -32,7 +198,7 @@
  *	Mark an inode as dirty. Callers should use mark_inode_dirty or
  *  	mark_inode_dirty_sync.
  *
- * Put the inode on the super block's dirty list.
+ * Put the inode in the super block's flush_tree.
  *
  * CAREFUL! We mark it dirty unconditionally, but move it onto the
  * dirty list only if it is hashed or if it refers to a blockdev.
@@ -75,6 +241,13 @@ void __mark_inode_dirty(struct inode *in
 	if ((inode->i_state & flags) == flags)
 		return;
 
+	/* anonymous file systems do not write data back */
+	if (inode->i_sb->s_type->fs_flags & FS_ANONYMOUS)
+		return;
+
+	if (inode->i_state & I_DIRTY_NEVER)
+		return;
+
 	if (unlikely(block_dump)) {
 		struct dentry *dentry = NULL;
 		const char *name = "?";
@@ -97,14 +270,7 @@ void __mark_inode_dirty(struct inode *in
 	if ((inode->i_state & flags) != flags) {
 		const int was_dirty = inode->i_state & I_DIRTY;
 
-		inode->i_state |= flags;
-
-		/*
-		 * If the inode is being synced, just update its dirty state.
-		 * The unlocker will place the inode on the appropriate
-		 * superblock list, based upon its state.
-		 */
-		if (inode->i_state & I_SYNC)
+		if (inode->i_state & (I_FREEING|I_CLEAR))
 			goto out;
 
 		/*
@@ -115,16 +281,25 @@ void __mark_inode_dirty(struct inode *in
 			if (hlist_unhashed(&inode->i_hash))
 				goto out;
 		}
-		if (inode->i_state & (I_FREEING|I_CLEAR))
+
+		inode->i_state |= flags;
+
+		/*
+		 * If the inode is being synced, just update its dirty state.
+		 * The unlocker will place the inode on the appropriate
+		 * superblock list, based upon its state.
+		 */
+		if (inode->i_state & I_SYNC)
 			goto out;
 
 		/*
-		 * If the inode was already on s_dirty/s_io/s_more_io, don't
-		 * reposition it (that would break s_dirty time-ordering).
+		 * If the inode was already in the flushtree don't
+		 * re-insert it (that would break time-ordering).
 		 */
 		if (!was_dirty) {
 			inode->dirtied_when = jiffies;
-			list_move(&inode->i_list, &sb->s_dirty);
+			inode->i_flushed_when = jiffies;
+			flush_tree_insert(sb, inode);
 		}
 	}
 out:
@@ -140,38 +315,6 @@ static int write_inode(struct inode *ino
 	return 0;
 }
 
-/*
- * Redirty an inode: set its when-it-was dirtied timestamp and move it to the
- * furthest end of its superblock's dirty-inode list.
- *
- * Before stamping the inode's ->dirtied_when, we check to see whether it is
- * already the most-recently-dirtied inode on the s_dirty list.  If that is
- * the case then the inode must have been redirtied while it was being written
- * out and we don't reset its dirtied_when.
- */
-static void redirty_tail(struct inode *inode)
-{
-	struct super_block *sb = inode->i_sb;
-
-	if (!list_empty(&sb->s_dirty)) {
-		struct inode *tail_inode;
-
-		tail_inode = list_entry(sb->s_dirty.next, struct inode, i_list);
-		if (!time_after_eq(inode->dirtied_when,
-				tail_inode->dirtied_when))
-			inode->dirtied_when = jiffies;
-	}
-	list_move(&inode->i_list, &sb->s_dirty);
-}
-
-/*
- * requeue inode for re-scanning after sb->s_io list is exhausted.
- */
-static void requeue_io(struct inode *inode)
-{
-	list_move(&inode->i_list, &inode->i_sb->s_more_io);
-}
-
 static void inode_sync_complete(struct inode *inode)
 {
 	/*
@@ -181,38 +324,9 @@ static void inode_sync_complete(struct i
 	wake_up_bit(&inode->i_state, __I_SYNC);
 }
 
-/*
- * Move expired dirty inodes from @delaying_queue to @dispatch_queue.
- */
-static void move_expired_inodes(struct list_head *delaying_queue,
-			       struct list_head *dispatch_queue,
-				unsigned long *older_than_this)
-{
-	while (!list_empty(delaying_queue)) {
-		struct inode *inode = list_entry(delaying_queue->prev,
-						struct inode, i_list);
-		if (older_than_this &&
-			time_after(inode->dirtied_when, *older_than_this))
-			break;
-		list_move(&inode->i_list, dispatch_queue);
-	}
-}
-
-/*
- * Queue all expired dirty inodes for io, eldest first.
- */
-static void queue_io(struct super_block *sb,
-				unsigned long *older_than_this)
-{
-	list_splice_init(&sb->s_more_io, sb->s_io.prev);
-	move_expired_inodes(&sb->s_dirty, &sb->s_io, older_than_this);
-}
-
 int sb_has_dirty_inodes(struct super_block *sb)
 {
-	return !list_empty(&sb->s_dirty) ||
-	       !list_empty(&sb->s_io) ||
-	       !list_empty(&sb->s_more_io);
+	return !RB_EMPTY_ROOT(&sb->s_flush_root);
 }
 EXPORT_SYMBOL(sb_has_dirty_inodes);
 
@@ -221,7 +335,7 @@ EXPORT_SYMBOL(sb_has_dirty_inodes);
  * If `wait' is set, wait on the writeout.
  *
  * The whole writeout design is quite complex and fragile.  We want to avoid
- * starvation of particular inodes when others are being redirtied, prevent
+ * starvation of particular inodes when others are being re-dirtied, prevent
  * livelocks, etc.
  *
  * Called under inode_lock.
@@ -237,6 +351,7 @@ __sync_single_inode(struct inode *inode,
 	BUG_ON(inode->i_state & I_SYNC);
 
 	/* Set I_SYNC, reset I_DIRTY */
+	flush_tree_remove(inode->i_sb, inode);
 	dirty = inode->i_state & I_DIRTY;
 	inode->i_state |= I_SYNC;
 	inode->i_state &= ~I_DIRTY;
@@ -266,7 +381,7 @@ __sync_single_inode(struct inode *inode,
 			/*
 			 * We didn't write back all the pages.  nfs_writepages()
 			 * sometimes bales out without doing anything. Redirty
-			 * the inode; Move it from s_io onto s_more_io/s_dirty.
+			 * the inode;
 			 */
 			/*
 			 * akpm: if the caller was the kupdate function we put
@@ -279,29 +394,32 @@ __sync_single_inode(struct inode *inode,
 			 */
 			if (wbc->for_kupdate) {
 				/*
-				 * For the kupdate function we move the inode
-				 * to s_more_io so it will get more writeout as
-				 * soon as the queue becomes uncongested.
+				 * For the kupdate function we leave
+				 * dirtied_when field untouched and return
+				 * it to the flush_tree. The next iteration
+				 * of kupdate will flush more pages when
+				 * the queue is no longer congested.
 				 */
 				inode->i_state |= I_DIRTY_PAGES;
-				requeue_io(inode);
+				flush_tree_insert(inode->i_sb, inode);
 			} else {
 				/*
-				 * Otherwise fully redirty the inode so that
+				 * Otherwise fully re-dirty the inode so that
 				 * other inodes on this superblock will get some
 				 * writeout.  Otherwise heavy writing to one
 				 * file would indefinitely suspend writeout of
 				 * all the other files.
 				 */
 				inode->i_state |= I_DIRTY_PAGES;
-				redirty_tail(inode);
+				inode->dirtied_when = jiffies;
+				flush_tree_insert(inode->i_sb, inode);
 			}
 		} else if (inode->i_state & I_DIRTY) {
 			/*
-			 * Someone redirtied the inode while were writing back
+			 * Someone re-dirtied the inode while were writing back
 			 * the pages.
 			 */
-			redirty_tail(inode);
+			flush_tree_insert(inode->i_sb, inode);
 		} else if (atomic_read(&inode->i_count)) {
 			/*
 			 * The inode is clean, inuse
@@ -333,27 +451,20 @@ __writeback_single_inode(struct inode *i
 	else
 		WARN_ON(inode->i_state & I_WILL_FREE);
 
-	if ((wbc->sync_mode != WB_SYNC_ALL) && (inode->i_state & I_SYNC)) {
-		struct address_space *mapping = inode->i_mapping;
-		int ret;
+	BUG_ON((inode->i_state & I_DIRTY) == 0);
 
+	/*
+	 * If the inode is locked and we are not going to wait for it
+	 * to be unlocked then we can just exit the routine. Since the
+	 * inode is marked I_SYNC it will be inserted into the flush
+	 * tree by sync_single_inode when the I_SYNC is released.
+	 */
+	if ((wbc->sync_mode != WB_SYNC_ALL) && (inode->i_state & I_SYNC)) {
 		/*
 		 * We're skipping this inode because it's locked, and we're not
-		 * doing writeback-for-data-integrity.  Move it to s_more_io so
-		 * that writeback can proceed with the other inodes on s_io.
-		 * We'll have another go at writing back this inode when we
-		 * completed a full scan of s_io.
+		 * doing writeback-for-data-integrity.
 		 */
-		requeue_io(inode);
-
-		/*
-		 * Even if we don't actually write the inode itself here,
-		 * we can at least start some of the data writeout..
-		 */
-		spin_unlock(&inode_lock);
-		ret = do_writepages(mapping, wbc);
-		spin_lock(&inode_lock);
-		return ret;
+		return 0;
 	}
 
 	/*
@@ -380,11 +491,11 @@ __writeback_single_inode(struct inode *i
  * If older_than_this is non-NULL, then only write out inodes which
  * had their first dirtying at a time earlier than *older_than_this.
  *
- * If we're a pdlfush thread, then implement pdflush collision avoidance
+ * If we're a pdflush thread, then implement pdflush collision avoidance
  * against the entire list.
  *
- * WB_SYNC_HOLD is a hack for sys_sync(): reattach the inode to sb->s_dirty so
- * that it can be located for waiting on in __writeback_single_inode().
+ * WB_SYNC_HOLD is a hack for sys_sync(): so that it can be located for
+ * waiting on in __writeback_single_inode().
  *
  * Called under inode_lock.
  *
@@ -393,33 +504,34 @@ __writeback_single_inode(struct inode *i
  * a variety of queues, so all inodes are searched.  For other superblocks,
  * assume that all inodes are backed by the same queue.
  *
- * FIXME: this linear search could get expensive with many fileystems.  But
+ * FIXME: this linear search could get expensive with many filesystems.  But
  * how to fix?  We need to go from an address_space to all inodes which share
  * a queue with that address_space.  (Easy: have a global "dirty superblocks"
  * list).
  *
- * The inodes to be written are parked on sb->s_io.  They are moved back onto
- * sb->s_dirty as they are selected for writing.  This way, none can be missed
- * on the writer throttling path, and we get decent balancing between many
- * throttled threads: we don't want them all piling up on inode_sync_wait.
+ * The inodes to be written are inserted into the flush_tree.
  */
 static void
 sync_sb_inodes(struct super_block *sb, struct writeback_control *wbc)
 {
 	const unsigned long start = jiffies;	/* livelock avoidance */
+	struct inode *inode = NULL;
+	unsigned long prev_time = 0;
 
-	if (!wbc->for_kupdate || list_empty(&sb->s_io))
-		queue_io(sb, wbc->older_than_this);
+	if (sb->s_type->fs_flags & FS_ANONYMOUS)
+		return;
 
-	while (!list_empty(&sb->s_io)) {
-		struct inode *inode = list_entry(sb->s_io.prev,
-						struct inode, i_list);
+	mutex_lock(&sb->s_flush_lock);
+	spin_lock(&inode_lock);
+	while ((inode = flush_tree_next(sb, start, prev_time)) != NULL) {
 		struct address_space *mapping = inode->i_mapping;
 		struct backing_dev_info *bdi = mapping->backing_dev_info;
 		long pages_skipped;
 
+		prev_time = inode->dirtied_when;
+		inode->i_flushed_when = start;
+
 		if (!bdi_cap_writeback_dirty(bdi)) {
-			redirty_tail(inode);
 			if (sb_is_blkdev_sb(sb)) {
 				/*
 				 * Dirty memory-backed blockdev: the ramdisk
@@ -439,14 +551,12 @@ sync_sb_inodes(struct super_block *sb, s
 			wbc->encountered_congestion = 1;
 			if (!sb_is_blkdev_sb(sb))
 				break;		/* Skip a congested fs */
-			requeue_io(inode);
 			continue;		/* Skip a congested blockdev */
 		}
 
 		if (wbc->bdi && bdi != wbc->bdi) {
 			if (!sb_is_blkdev_sb(sb))
 				break;		/* fs has the wrong queue */
-			requeue_io(inode);
 			continue;		/* blockdev has wrong queue */
 		}
 
@@ -454,6 +564,11 @@ sync_sb_inodes(struct super_block *sb, s
 		if (time_after(inode->dirtied_when, start))
 			break;
 
+		/* Was this inode dirtied too recently? */
+		if (wbc->older_than_this && time_after(inode->dirtied_when,
+						*wbc->older_than_this))
+			break;
+
 		/* Is another pdflush already flushing this queue? */
 		if (current_is_pdflush() && !writeback_acquire(bdi))
 			break;
@@ -462,19 +577,8 @@ sync_sb_inodes(struct super_block *sb, s
 		__iget(inode);
 		pages_skipped = wbc->pages_skipped;
 		__writeback_single_inode(inode, wbc);
-		if (wbc->sync_mode == WB_SYNC_HOLD) {
-			inode->dirtied_when = jiffies;
-			list_move(&inode->i_list, &sb->s_dirty);
-		}
 		if (current_is_pdflush())
 			writeback_release(bdi);
-		if (wbc->pages_skipped != pages_skipped) {
-			/*
-			 * writeback is not making progress due to locked
-			 * buffers.  Skip this inode for now.
-			 */
-			redirty_tail(inode);
-		}
 		spin_unlock(&inode_lock);
 		iput(inode);
 		cond_resched();
@@ -482,8 +586,8 @@ sync_sb_inodes(struct super_block *sb, s
 		if (wbc->nr_to_write <= 0)
 			break;
 	}
-	if (!list_empty(&sb->s_more_io))
-		wbc->more_io = 1;
+	spin_unlock(&inode_lock);
+	mutex_unlock(&sb->s_flush_lock);
 	return;		/* Leave any unwritten inodes on s_io */
 }
 
@@ -492,9 +596,9 @@ sync_sb_inodes(struct super_block *sb, s
  *
  * Note:
  * We don't need to grab a reference to superblock here. If it has non-empty
- * ->s_dirty it's hadn't been killed yet and kill_super() won't proceed
- * past sync_inodes_sb() until the ->s_dirty/s_io/s_more_io lists are all
- * empty. Since __sync_single_inode() regains inode_lock before it finally moves
+ * flush_tree it hasn't been killed yet and kill_super() won't proceed
+ * past sync_inodes_sb() until the flush_tree is empty.
+ * Since __sync_single_inode() regains inode_lock before it finally moves
  * inode from superblock lists we are OK.
  *
  * If `older_than_this' is non-zero then only flush inodes which have a
@@ -527,9 +631,7 @@ restart:
 			 */
 			if (down_read_trylock(&sb->s_umount)) {
 				if (sb->s_root) {
-					spin_lock(&inode_lock);
 					sync_sb_inodes(sb, wbc);
-					spin_unlock(&inode_lock);
 				}
 				up_read(&sb->s_umount);
 			}
@@ -546,7 +648,7 @@ restart:
 /*
  * writeback and wait upon the filesystem's dirty inodes.  The caller will
  * do this in two passes - one to write, and one to wait.  WB_SYNC_HOLD is
- * used to park the written inodes on sb->s_dirty for the wait pass.
+ * used to park the written inodes on the flush_tree for the wait pass.
  *
  * A finite limit is set on the number of pages which will be written.
  * To prevent infinite livelock of sys_sync().
@@ -568,9 +670,7 @@ void sync_inodes_sb(struct super_block *
 			(inodes_stat.nr_inodes - inodes_stat.nr_unused) +
 			nr_dirty + nr_unstable;
 	wbc.nr_to_write += wbc.nr_to_write / 2;		/* Bit more for luck */
-	spin_lock(&inode_lock);
 	sync_sb_inodes(sb, &wbc);
-	spin_unlock(&inode_lock);
 }
 
 /*
Index: 2624rc3_wb/fs/inode.c
===================================================================
--- 2624rc3_wb.orig/fs/inode.c	2007-12-11 13:52:47.000000000 -0800
+++ 2624rc3_wb/fs/inode.c	2007-12-11 13:55:01.000000000 -0800
@@ -143,6 +143,7 @@ static struct inode *alloc_inode(struct 
 		inode->i_cdev = NULL;
 		inode->i_rdev = 0;
 		inode->dirtied_when = 0;
+		memset(&inode->i_flush_node, 0, sizeof(inode->i_flush_node));
 		if (security_inode_alloc(inode)) {
 			if (inode->i_sb->s_op->destroy_inode)
 				inode->i_sb->s_op->destroy_inode(inode);
@@ -1044,6 +1045,10 @@ void generic_delete_inode(struct inode *
 {
 	const struct super_operations *op = inode->i_sb->s_op;
 
+	if ((inode->i_state & I_DIRTY)) {
+		flush_tree_remove(inode->i_sb, inode);
+		inode->i_state &= ~I_DIRTY;
+	}
 	list_del_init(&inode->i_list);
 	list_del_init(&inode->i_sb_list);
 	inode->i_state |= I_FREEING;
Index: 2624rc3_wb/fs/pipe.c
===================================================================
--- 2624rc3_wb.orig/fs/pipe.c	2007-12-11 13:52:47.000000000 -0800
+++ 2624rc3_wb/fs/pipe.c	2007-12-11 13:55:01.000000000 -0800
@@ -931,12 +931,10 @@ static struct inode * get_pipe_inode(voi
 	inode->i_fop = &rdwr_pipe_fops;
 
 	/*
-	 * Mark the inode dirty from the very beginning,
-	 * that way it will never be moved to the dirty
-	 * list because "mark_inode_dirty()" will think
-	 * that it already _is_ on the dirty list.
+	 * Mark the inode "never dirty" from the very beginning,
+	 * that way it will never be written back.
 	 */
-	inode->i_state = I_DIRTY;
+	inode->i_state = I_DIRTY_NEVER;
 	inode->i_mode = S_IFIFO | S_IRUSR | S_IWUSR;
 	inode->i_uid = current->fsuid;
 	inode->i_gid = current->fsgid;
Index: 2624rc3_wb/fs/proc/root.c
===================================================================
--- 2624rc3_wb.orig/fs/proc/root.c	2007-12-11 13:52:47.000000000 -0800
+++ 2624rc3_wb/fs/proc/root.c	2007-12-11 13:55:01.000000000 -0800
@@ -102,6 +102,7 @@ struct file_system_type proc_fs_type = {
 	.name		= "proc",
 	.get_sb		= proc_get_sb,
 	.kill_sb	= proc_kill_sb,
+	.fs_flags	= FS_ANONYMOUS,
 };
 
 void __init proc_root_init(void)
Index: 2624rc3_wb/fs/super.c
===================================================================
--- 2624rc3_wb.orig/fs/super.c	2007-12-11 13:52:47.000000000 -0800
+++ 2624rc3_wb/fs/super.c	2007-12-11 13:55:01.000000000 -0800
@@ -61,9 +61,8 @@ static struct super_block *alloc_super(s
 			s = NULL;
 			goto out;
 		}
-		INIT_LIST_HEAD(&s->s_dirty);
-		INIT_LIST_HEAD(&s->s_io);
-		INIT_LIST_HEAD(&s->s_more_io);
+		s->s_flush_root = RB_ROOT;
+		mutex_init(&s->s_flush_lock);
 		INIT_LIST_HEAD(&s->s_files);
 		INIT_LIST_HEAD(&s->s_instances);
 		INIT_HLIST_HEAD(&s->s_anon);
@@ -103,6 +102,7 @@ out:
  */
 static inline void destroy_super(struct super_block *s)
 {
+	mutex_destroy(&s->s_flush_lock);
 	security_sb_free(s);
 	kfree(s->s_subtype);
 	kfree(s);
Index: 2624rc3_wb/fs/sysfs/mount.c
===================================================================
--- 2624rc3_wb.orig/fs/sysfs/mount.c	2007-12-11 13:52:47.000000000 -0800
+++ 2624rc3_wb/fs/sysfs/mount.c	2007-12-11 13:55:01.000000000 -0800
@@ -80,6 +80,7 @@ static struct file_system_type sysfs_fs_
 	.name		= "sysfs",
 	.get_sb		= sysfs_get_sb,
 	.kill_sb	= kill_anon_super,
+	.fs_flags	= FS_ANONYMOUS,
 };
 
 int __init sysfs_init(void)
Index: 2624rc3_wb/include/linux/fs.h
===================================================================
--- 2624rc3_wb.orig/include/linux/fs.h	2007-12-11 13:52:47.000000000 -0800
+++ 2624rc3_wb/include/linux/fs.h	2007-12-11 13:55:01.000000000 -0800
@@ -90,9 +90,10 @@ extern int dir_notify_enable;
 #define SEL_EX		4
 
 /* public flags for file_system_type */
-#define FS_REQUIRES_DEV 1 
-#define FS_BINARY_MOUNTDATA 2
-#define FS_HAS_SUBTYPE 4
+#define FS_REQUIRES_DEV		1
+#define FS_BINARY_MOUNTDATA	2
+#define FS_HAS_SUBTYPE		4
+#define FS_ANONYMOUS		8
 #define FS_REVAL_DOT	16384	/* Check the paths ".", ".." for staleness */
 #define FS_RENAME_DOES_D_MOVE	32768	/* FS will handle d_move()
 					 * during rename() internally.
@@ -285,6 +286,7 @@ extern int dir_notify_enable;
 #include <linux/pid.h>
 #include <linux/mutex.h>
 #include <linux/capability.h>
+#include <linux/rbtree.h>
 
 #include <asm/atomic.h>
 #include <asm/semaphore.h>
@@ -592,6 +594,8 @@ struct inode {
 	struct hlist_node	i_hash;
 	struct list_head	i_list;
 	struct list_head	i_sb_list;
+	struct rb_node		i_flush_node;
+	unsigned long		i_flushed_when;
 	struct list_head	i_dentry;
 	unsigned long		i_ino;
 	atomic_t		i_count;
@@ -1003,9 +1007,11 @@ struct super_block {
 	struct xattr_handler	**s_xattr;
 
 	struct list_head	s_inodes;	/* all inodes */
-	struct list_head	s_dirty;	/* dirty inodes */
-	struct list_head	s_io;		/* parked for writeback */
-	struct list_head	s_more_io;	/* parked for more writeback */
+
+	struct rb_root		s_flush_root;
+	unsigned long		s_flush_count;
+	struct mutex		s_flush_lock;
+
 	struct hlist_head	s_anon;		/* anonymous dentries for (nfs) exporting */
 	struct list_head	s_files;
 
@@ -1315,17 +1321,18 @@ struct super_operations {
  * Q: igrab() only checks on (I_FREEING|I_WILL_FREE).  Should it also check on
  *    I_CLEAR?  If not, why?
  */
-#define I_DIRTY_SYNC		1
-#define I_DIRTY_DATASYNC	2
-#define I_DIRTY_PAGES		4
-#define I_NEW			8
-#define I_WILL_FREE		16
-#define I_FREEING		32
-#define I_CLEAR			64
+#define I_DIRTY_SYNC		(1 << 0)
+#define I_DIRTY_DATASYNC	(1 << 1)
+#define I_DIRTY_PAGES		(1 << 2)
+#define I_NEW			(1 << 3)
+#define I_WILL_FREE		(1 << 4)
+#define I_FREEING		(1 << 5)
+#define I_CLEAR			(1 << 6)
 #define __I_LOCK		7
 #define I_LOCK			(1 << __I_LOCK)
 #define __I_SYNC		8
 #define I_SYNC			(1 << __I_SYNC)
+#define I_DIRTY_NEVER		(1 << 9)
 
 #define I_DIRTY (I_DIRTY_SYNC | I_DIRTY_DATASYNC | I_DIRTY_PAGES)
 
Index: 2624rc3_wb/include/linux/rbtree.h
===================================================================
--- 2624rc3_wb.orig/include/linux/rbtree.h	2007-12-11 13:52:47.000000000 -0800
+++ 2624rc3_wb/include/linux/rbtree.h	2007-12-11 13:55:01.000000000 -0800
@@ -135,6 +135,8 @@ static inline void rb_set_color(struct r
 #define RB_EMPTY_ROOT(root)	((root)->rb_node == NULL)
 #define RB_EMPTY_NODE(node)	(rb_parent(node) == node)
 #define RB_CLEAR_NODE(node)	(rb_set_parent(node, node))
+#define RB_LINKED_NODE(node)	((node)->rb_parent_color || \
+				 (node)->rb_left || (node)->rb_right)
 
 extern void rb_insert_color(struct rb_node *, struct rb_root *);
 extern void rb_erase(struct rb_node *, struct rb_root *);
Index: 2624rc3_wb/include/linux/writeback.h
===================================================================
--- 2624rc3_wb.orig/include/linux/writeback.h	2007-12-11 13:52:47.000000000 -0800
+++ 2624rc3_wb/include/linux/writeback.h	2007-12-11 13:55:01.000000000 -0800
@@ -62,7 +62,6 @@ struct writeback_control {
 	unsigned for_reclaim:1;		/* Invoked from the page allocator */
 	unsigned for_writepages:1;	/* This is a writepages() call */
 	unsigned range_cyclic:1;	/* range_start is cyclic */
-	unsigned more_io:1;		/* more io to be dispatched */
 };
 
 /*
@@ -72,6 +71,8 @@ void writeback_inodes(struct writeback_c
 int inode_wait(void *);
 void sync_inodes_sb(struct super_block *, int wait);
 void sync_inodes(int wait);
+void flush_tree_remove(struct super_block *sb, struct inode *inode);
+
 
 /* writeback.h requires fs.h; it, too, is not included from here. */
 static inline void wait_on_inode(struct inode *inode)
Index: 2624rc3_wb/mm/shmem.c
===================================================================
--- 2624rc3_wb.orig/mm/shmem.c	2007-12-11 13:52:47.000000000 -0800
+++ 2624rc3_wb/mm/shmem.c	2007-12-11 13:55:01.000000000 -0800
@@ -2460,6 +2460,7 @@ static struct file_system_type tmpfs_fs_
 	.name		= "tmpfs",
 	.get_sb		= shmem_get_sb,
 	.kill_sb	= kill_litter_super,
+	.fs_flags	= FS_ANONYMOUS,
 };
 static struct vfsmount *shm_mnt;
 
Index: 2624rc3_wb/mm/tiny-shmem.c
===================================================================
--- 2624rc3_wb.orig/mm/tiny-shmem.c	2007-12-11 13:52:47.000000000 -0800
+++ 2624rc3_wb/mm/tiny-shmem.c	2007-12-11 13:55:01.000000000 -0800
@@ -24,6 +24,7 @@ static struct file_system_type tmpfs_fs_
 	.name		= "tmpfs",
 	.get_sb		= ramfs_get_sb,
 	.kill_sb	= kill_litter_super,
+	.fs_flags	= FS_ANONYMOUS,
 };
 
 static struct vfsmount *shm_mnt;
Index: 2624rc3_wb/mm/page-writeback.c
===================================================================
--- 2624rc3_wb.orig/mm/page-writeback.c	2007-12-11 13:54:53.000000000 -0800
+++ 2624rc3_wb/mm/page-writeback.c	2007-12-11 13:55:49.000000000 -0800
@@ -558,7 +558,6 @@ static void background_writeout(unsigned
 			global_page_state(NR_UNSTABLE_NFS) < background_thresh
 				&& min_pages <= 0)
 			break;
-		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		wbc.pages_skipped = 0;
@@ -566,7 +565,7 @@ static void background_writeout(unsigned
 		min_pages -= MAX_WRITEBACK_PAGES - wbc.nr_to_write;
 		if (wbc.nr_to_write > 0 || wbc.pages_skipped > 0) {
 			/* Wrote less than expected */
-			if (wbc.encountered_congestion || wbc.more_io)
+			if (wbc.encountered_congestion)
 				congestion_wait(WRITE, HZ/10);
 			else
 				break;
@@ -633,12 +632,11 @@ static void wb_kupdate(unsigned long arg
 			global_page_state(NR_UNSTABLE_NFS) +
 			(inodes_stat.nr_inodes - inodes_stat.nr_unused);
 	while (nr_to_write > 0) {
-		wbc.more_io = 0;
 		wbc.encountered_congestion = 0;
 		wbc.nr_to_write = MAX_WRITEBACK_PAGES;
 		writeback_inodes(&wbc);
 		if (wbc.nr_to_write > 0) {
-			if (wbc.encountered_congestion || wbc.more_io)
+			if (wbc.encountered_congestion)
 				congestion_wait(WRITE, HZ/10);
 			else
 				break;	/* All the old data is written */

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2008-01-19  2:51 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-01-15  8:09 [patch] Converting writeback linked lists to a tree based data structure Michael Rubin
2008-01-15  8:46 ` Peter Zijlstra
2008-01-15 17:53   ` Michael Rubin
     [not found]     ` <E1JEyWa-0001Ys-F9@localhost.localdomain>
2008-01-16  3:01       ` Fengguang Wu
2008-01-16  3:44         ` Andrew Morton
     [not found]           ` <E1JEzqb-0003YX-Rg@localhost.localdomain>
2008-01-16  4:25             ` Fengguang Wu
2008-01-16  4:42               ` Andrew Morton
     [not found]                 ` <E1JF0It-0000yD-Mi@localhost.localdomain>
2008-01-16  4:55                   ` Fengguang Wu
2008-01-16  5:51                     ` Andrew Morton
     [not found]                       ` <E1JF4Ey-0000x4-5p@localhost.localdomain>
2008-01-16  9:07                         ` Fengguang Wu
2008-01-18  7:36                           ` Mike Waychison
2008-01-16 22:35                         ` David Chinner
     [not found]                           ` <E1JFLEW-0002oE-G1@localhost.localdomain>
2008-01-17  3:16                             ` Fengguang Wu
2008-01-17  5:21                             ` David Chinner
2008-01-16  7:55           ` David Chinner
2008-01-16  8:13             ` Andrew Morton
     [not found]               ` <E1JF7yp-0006l8-5P@localhost.localdomain>
2008-01-16 13:06                 ` Fengguang Wu
2008-01-16 18:55         ` Michael Rubin
     [not found]           ` <E1JFLTR-0002pn-4Y@localhost.localdomain>
2008-01-17  3:31             ` Fengguang Wu
     [not found] ` <E1JFRFm-00011Q-0q@localhost.localdomain>
2008-01-17  9:41   ` Fengguang Wu
2008-01-17 21:07     ` Michael Rubin
     [not found]       ` <E1JFjGz-0001eU-3O@localhost.localdomain>
2008-01-18  4:56         ` Fengguang Wu
2008-01-18  5:41           ` Andi Kleen
     [not found]             ` <E1JFkHy-0001jR-VD@localhost.localdomain>
2008-01-18  6:01               ` Fengguang Wu
2008-01-18  7:48             ` Mike Waychison
2008-01-18  6:43           ` Michael Rubin
     [not found]             ` <E1JFnZz-00015z-Vq@localhost.localdomain>
2008-01-18  9:32               ` Fengguang Wu
2008-01-18  5:01       ` David Chinner
2008-01-18  5:38         ` Michael Rubin
2008-01-18  8:54           ` David Chinner
2008-01-18  9:26             ` Michael Rubin
     [not found]         ` <E1JFjyv-0001hU-FA@localhost.localdomain>
2008-01-18  5:41           ` Fengguang Wu
2008-01-19  2:50           ` David Chinner
  -- strict thread matches above, loose matches on Subject: below --
2007-12-13  0:32 Michael Rubin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).