All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7 v2] writeback: Avoid iput() from flusher thread
@ 2012-03-20 22:56 Jan Kara
  2012-03-20 22:56 ` [PATCH 1/7] writeback: Move clearing of I_SYNC into inode_sync_complete() Jan Kara
                   ` (6 more replies)
  0 siblings, 7 replies; 32+ messages in thread
From: Jan Kara @ 2012-03-20 22:56 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Christoph Hellwig, linux-fsdevel


  Hi,

  this is reworked series getting rid of iput() from flusher thread. The
contents is last two patches from my previous submission but I've split them
up as Christoph asked. Also I've changed some calling conventions and fixed
a couple of bugs I found in the process. Comments welcome.

								Honza

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH 1/7] writeback: Move clearing of I_SYNC into inode_sync_complete()
  2012-03-20 22:56 [PATCH 0/7 v2] writeback: Avoid iput() from flusher thread Jan Kara
@ 2012-03-20 22:56 ` Jan Kara
  2012-04-30 14:36   ` Christoph Hellwig
  2012-03-20 22:56 ` [PATCH 2/7] writeback: Move requeueing when I_SYNC set to writeback_sb_inodes() Jan Kara
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Jan Kara @ 2012-03-20 22:56 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Christoph Hellwig, linux-fsdevel, Jan Kara

Move clearing of I_SYNC into inode_sync_complete().  It is more logical to have
clearing of I_SYNC bit and waking of waiters in one place. Also later we will
have two places needing to clear I_SYNC and wake up waiters so this allows them
to use the common helper. Moving of I_SYNC clearing to a later stage of
writeback_single_inode() is safe since we hold i_lock all the time.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c |    7 +------
 1 files changed, 1 insertions(+), 6 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index be84e28..afeecbf 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -231,11 +231,7 @@ static void requeue_io(struct inode *inode, struct bdi_writeback *wb)
 
 static void inode_sync_complete(struct inode *inode)
 {
-	/*
-	 * Prevent speculative execution through
-	 * spin_unlock(&wb->list_lock);
-	 */
-
+	inode->i_state &= ~I_SYNC;
 	smp_mb();
 	wake_up_bit(&inode->i_state, __I_SYNC);
 }
@@ -436,7 +432,6 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 
 	spin_lock(&wb->list_lock);
 	spin_lock(&inode->i_lock);
-	inode->i_state &= ~I_SYNC;
 	if (!(inode->i_state & I_FREEING)) {
 		/*
 		 * Sync livelock prevention. Each inode is tagged and synced in
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/7] writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()
  2012-03-20 22:56 [PATCH 0/7 v2] writeback: Avoid iput() from flusher thread Jan Kara
  2012-03-20 22:56 ` [PATCH 1/7] writeback: Move clearing of I_SYNC into inode_sync_complete() Jan Kara
@ 2012-03-20 22:56 ` Jan Kara
  2012-04-30 14:38   ` Christoph Hellwig
  2012-03-20 22:56 ` [PATCH 3/7] writeback: Move I_DIRTY_PAGES handling Jan Kara
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Jan Kara @ 2012-03-20 22:56 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Christoph Hellwig, linux-fsdevel, Jan Kara

When writeback_single_inode() is called on inode which has I_SYNC already
set while doing WB_SYNC_NONE, inode is moved to b_more_io list. However
this makes sense only if the caller is flusher thread. For other callers of
writeback_single_inode() it doesn't really make sense and may be even wrong
- flusher thread may be doing WB_SYNC_ALL writeback in parallel.

So we move requeueing from writeback_single_inode() to writeback_sb_inodes().

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c                |   30 ++++++++++++++++--------------
 include/trace/events/writeback.h |   36 +++++++++++++++++++++++++++++-------
 2 files changed, 45 insertions(+), 21 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index afeecbf..5f17a16 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -372,21 +372,8 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 		WARN_ON(inode->i_state & I_WILL_FREE);
 
 	if (inode->i_state & I_SYNC) {
-		/*
-		 * If this inode is locked for writeback and we are not doing
-		 * writeback-for-data-integrity, move it to b_more_io so that
-		 * writeback can proceed with the other inodes on s_io.
-		 *
-		 * We'll have another go at writing back this inode when we
-		 * completed a full scan of b_io.
-		 */
-		if (wbc->sync_mode != WB_SYNC_ALL) {
-			requeue_io(inode, wb);
-			trace_writeback_single_inode_requeue(inode, wbc,
-							     nr_to_write);
+		if (wbc->sync_mode != WB_SYNC_ALL)
 			return 0;
-		}
-
 		/*
 		 * It's a data-integrity sync.  We must wait.
 		 */
@@ -575,6 +562,21 @@ static long writeback_sb_inodes(struct super_block *sb,
 			redirty_tail(inode, wb);
 			continue;
 		}
+		if ((inode->i_state & I_SYNC) && wbc.sync_mode != WB_SYNC_ALL) {
+			/*
+			 * If this inode is locked for writeback and we are not
+			 * doing writeback-for-data-integrity, move it to
+			 * b_more_io so that writeback can proceed with the
+			 * other inodes on s_io.
+			 *
+			 * We'll have another go at writing back this inode
+			 * when we completed a full scan of b_io.
+			 */
+			spin_unlock(&inode->i_lock);
+			requeue_io(inode, wb);
+			trace_writeback_sb_inodes_requeue(inode);
+			continue;
+		}
 		__iget(inode);
 		write_chunk = writeback_chunk_size(wb->bdi, work);
 		wbc.nr_to_write = write_chunk;
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index 5973410..7a260d0 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -373,6 +373,35 @@ TRACE_EVENT(balance_dirty_pages,
 	  )
 );
 
+TRACE_EVENT(writeback_sb_inodes_requeue,
+
+	TP_PROTO(struct inode *inode),
+	TP_ARGS(inode),
+
+	TP_STRUCT__entry(
+		__array(char, name, 32)
+		__field(unsigned long, ino)
+		__field(unsigned long, state)
+		__field(unsigned long, dirtied_when)
+	),
+
+	TP_fast_assign(
+		strncpy(__entry->name,
+		        dev_name(inode_to_bdi(inode)->dev), 32);
+		__entry->ino		= inode->i_ino;
+		__entry->state		= inode->i_state;
+		__entry->dirtied_when	= inode->dirtied_when;
+	),
+
+	TP_printk("bdi %s: ino=%lu state=%s dirtied_when=%lu age=%lu",
+		  __entry->name,
+		  __entry->ino,
+		  show_inode_state(__entry->state),
+		  __entry->dirtied_when,
+		  (jiffies - __entry->dirtied_when) / HZ
+	)
+);
+
 DECLARE_EVENT_CLASS(writeback_congest_waited_template,
 
 	TP_PROTO(unsigned int usec_timeout, unsigned int usec_delayed),
@@ -451,13 +480,6 @@ DECLARE_EVENT_CLASS(writeback_single_inode_template,
 	)
 );
 
-DEFINE_EVENT(writeback_single_inode_template, writeback_single_inode_requeue,
-	TP_PROTO(struct inode *inode,
-		 struct writeback_control *wbc,
-		 unsigned long nr_to_write),
-	TP_ARGS(inode, wbc, nr_to_write)
-);
-
 DEFINE_EVENT(writeback_single_inode_template, writeback_single_inode,
 	TP_PROTO(struct inode *inode,
 		 struct writeback_control *wbc,
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/7] writeback: Move I_DIRTY_PAGES handling
  2012-03-20 22:56 [PATCH 0/7 v2] writeback: Avoid iput() from flusher thread Jan Kara
  2012-03-20 22:56 ` [PATCH 1/7] writeback: Move clearing of I_SYNC into inode_sync_complete() Jan Kara
  2012-03-20 22:56 ` [PATCH 2/7] writeback: Move requeueing when I_SYNC set to writeback_sb_inodes() Jan Kara
@ 2012-03-20 22:56 ` Jan Kara
  2012-03-22  2:41   ` Fengguang Wu
  2012-04-30 14:41   ` Christoph Hellwig
  2012-03-20 22:56 ` [PATCH 4/7] writeback: Separate inode requeueing after writeback Jan Kara
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 32+ messages in thread
From: Jan Kara @ 2012-03-20 22:56 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Christoph Hellwig, linux-fsdevel, Jan Kara

Instead of clearing I_DIRTY_PAGES and resetting it when we didn't succeed in
writing them all, just clear the bit only when we succeeded writing all the
pages. We also move the clearing of the bit close to other i_state handling to
separate it from writeback list handling. This is desirable because list
handling will differ for flusher thread and other writeback_single_inode()
callers in future. No filesystem plays any tricks with I_DIRTY_PAGES (like
checking it in ->writepages or ->write_inode implementation) so this movement
is safe.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 5f17a16..0ef272e 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -384,7 +384,6 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 
 	/* Set I_SYNC, reset I_DIRTY_PAGES */
 	inode->i_state |= I_SYNC;
-	inode->i_state &= ~I_DIRTY_PAGES;
 	spin_unlock(&inode->i_lock);
 	spin_unlock(&wb->list_lock);
 
@@ -407,6 +406,9 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	 * write_inode()
 	 */
 	spin_lock(&inode->i_lock);
+	/* Clear I_DIRTY_PAGES if we've written out all dirty pages */
+	if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
+		inode->i_state &= ~I_DIRTY_PAGES;
 	dirty = inode->i_state & I_DIRTY;
 	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
 	spin_unlock(&inode->i_lock);
@@ -434,7 +436,6 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 			 * We didn't write back all the pages.  nfs_writepages()
 			 * sometimes bales out without doing anything.
 			 */
-			inode->i_state |= I_DIRTY_PAGES;
 			if (wbc->nr_to_write <= 0) {
 				/*
 				 * slice used up: queue for next turn
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/7] writeback: Separate inode requeueing after writeback
  2012-03-20 22:56 [PATCH 0/7 v2] writeback: Avoid iput() from flusher thread Jan Kara
                   ` (2 preceding siblings ...)
  2012-03-20 22:56 ` [PATCH 3/7] writeback: Move I_DIRTY_PAGES handling Jan Kara
@ 2012-03-20 22:56 ` Jan Kara
  2012-04-30 14:43   ` Christoph Hellwig
  2012-03-20 22:56 ` [PATCH 5/7] writeback: Remove wb->list_lock from writeback_single_inode() Jan Kara
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 32+ messages in thread
From: Jan Kara @ 2012-03-20 22:56 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Christoph Hellwig, linux-fsdevel, Jan Kara

Move inode requeueing after inode has been written out into a separate
function.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c |   94 ++++++++++++++++++++++++++--------------------------
 1 files changed, 47 insertions(+), 47 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 0ef272e..789f829 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -343,6 +343,52 @@ static void inode_wait_for_writeback(struct inode *inode,
 	}
 }
 
+static void inode_wb_requeue(struct inode *inode, struct bdi_writeback *wb,
+			     struct writeback_control *wbc)
+{
+	if (inode->i_state & I_FREEING)
+		return;
+
+	/*
+	 * Sync livelock prevention. Each inode is tagged and synced in one
+	 * shot. If still dirty, it will be redirty_tail()'ed below.  Update
+	 * the dirty time to prevent enqueue and sync it again.
+	 */
+	if ((inode->i_state & I_DIRTY) &&
+	    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
+		inode->dirtied_when = jiffies;
+
+	if (mapping_tagged(inode->i_mapping, PAGECACHE_TAG_DIRTY)) {
+		/*
+		 * We didn't write back all the pages.  nfs_writepages()
+		 * sometimes bales out without doing anything.
+		 */
+		if (wbc->nr_to_write <= 0) {
+			/* Slice used up. Queue for next turn. */
+			requeue_io(inode, wb);
+		} else {
+			/*
+			 * Writeback blocked by something other than
+			 * congestion. Delay the inode for some time to
+			 * avoid spinning on the CPU (100% iowait)
+			 * retrying writeback of the dirty page/inode
+			 * that cannot be performed immediately.
+			 */
+			redirty_tail(inode, wb);
+		}
+	} else if (inode->i_state & I_DIRTY) {
+		/*
+		 * Filesystems can dirty the inode during writeback operations,
+		 * such as delayed allocation during submission or metadata
+		 * updates after data IO completion.
+		 */
+		redirty_tail(inode, wb);
+	} else {
+		/* The inode is clean. Remove from writeback lists. */
+		list_del_init(&inode->i_wb_list);
+	}
+}
+
 /*
  * Write out an inode's dirty pages.  Called under wb->list_lock and
  * inode->i_lock.  Either the caller has an active reference on the inode or
@@ -421,53 +467,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 
 	spin_lock(&wb->list_lock);
 	spin_lock(&inode->i_lock);
-	if (!(inode->i_state & I_FREEING)) {
-		/*
-		 * Sync livelock prevention. Each inode is tagged and synced in
-		 * one shot. If still dirty, it will be redirty_tail()'ed below.
-		 * Update the dirty time to prevent enqueue and sync it again.
-		 */
-		if ((inode->i_state & I_DIRTY) &&
-		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
-			inode->dirtied_when = jiffies;
-
-		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
-			/*
-			 * We didn't write back all the pages.  nfs_writepages()
-			 * sometimes bales out without doing anything.
-			 */
-			if (wbc->nr_to_write <= 0) {
-				/*
-				 * slice used up: queue for next turn
-				 */
-				requeue_io(inode, wb);
-			} else {
-				/*
-				 * Writeback blocked by something other than
-				 * congestion. Delay the inode for some time to
-				 * avoid spinning on the CPU (100% iowait)
-				 * retrying writeback of the dirty page/inode
-				 * that cannot be performed immediately.
-				 */
-				redirty_tail(inode, wb);
-			}
-		} else if (inode->i_state & I_DIRTY) {
-			/*
-			 * Filesystems can dirty the inode during writeback
-			 * operations, such as delayed allocation during
-			 * submission or metadata updates after data IO
-			 * completion.
-			 */
-			redirty_tail(inode, wb);
-		} else {
-			/*
-			 * The inode is clean.  At this point we either have
-			 * a reference to the inode or it's on it's way out.
-			 * No need to add it back to the LRU.
-			 */
-			list_del_init(&inode->i_wb_list);
-		}
-	}
+	inode_wb_requeue(inode, wb, wbc);
 	inode_sync_complete(inode);
 	trace_writeback_single_inode(inode, wbc, nr_to_write);
 	return ret;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 5/7] writeback: Remove wb->list_lock from writeback_single_inode()
  2012-03-20 22:56 [PATCH 0/7 v2] writeback: Avoid iput() from flusher thread Jan Kara
                   ` (3 preceding siblings ...)
  2012-03-20 22:56 ` [PATCH 4/7] writeback: Separate inode requeueing after writeback Jan Kara
@ 2012-03-20 22:56 ` Jan Kara
  2012-04-30 14:44   ` Christoph Hellwig
  2012-03-20 22:56 ` [PATCH 6/7] writeback: Refactor writeback_single_inode() Jan Kara
  2012-03-20 22:56 ` [PATCH 7/7] writeback: Avoid iput() from flusher thread Jan Kara
  6 siblings, 1 reply; 32+ messages in thread
From: Jan Kara @ 2012-03-20 22:56 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Christoph Hellwig, linux-fsdevel, Jan Kara

writeback_single_inode() doesn't need wb->list_lock for anything on entry now.
So remove the requirement. This makes locking of writeback_single_inode()
temporarily awkward (entering with i_lock, returning with i_lock and
wb->list_lock) but it will be sanitized in the next patch.

Also inode_wait_for_writeback() doesn't need wb->list_lock for anything. It was
just taking it to make usage convenient for callers but with
writeback_single_inode() changing it's not very convenient anymore. So remove
the lock from that function.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c |   17 +++++++----------
 1 files changed, 7 insertions(+), 10 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 789f829..27f7191 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -327,8 +327,7 @@ static int write_inode(struct inode *inode, struct writeback_control *wbc)
 /*
  * Wait for writeback on an inode to complete.
  */
-static void inode_wait_for_writeback(struct inode *inode,
-				     struct bdi_writeback *wb)
+static void inode_wait_for_writeback(struct inode *inode)
 {
 	DEFINE_WAIT_BIT(wq, &inode->i_state, __I_SYNC);
 	wait_queue_head_t *wqh;
@@ -336,9 +335,7 @@ static void inode_wait_for_writeback(struct inode *inode,
 	wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
 	while (inode->i_state & I_SYNC) {
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&wb->list_lock);
 		__wait_on_bit(wqh, &wq, inode_wait, TASK_UNINTERRUPTIBLE);
-		spin_lock(&wb->list_lock);
 		spin_lock(&inode->i_lock);
 	}
 }
@@ -409,7 +406,6 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	unsigned dirty;
 	int ret;
 
-	assert_spin_locked(&wb->list_lock);
 	assert_spin_locked(&inode->i_lock);
 
 	if (!atomic_read(&inode->i_count))
@@ -423,7 +419,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 		/*
 		 * It's a data-integrity sync.  We must wait.
 		 */
-		inode_wait_for_writeback(inode, wb);
+		inode_wait_for_writeback(inode);
 	}
 
 	BUG_ON(inode->i_state & I_SYNC);
@@ -431,7 +427,6 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	/* Set I_SYNC, reset I_DIRTY_PAGES */
 	inode->i_state |= I_SYNC;
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&wb->list_lock);
 
 	ret = do_writepages(mapping, wbc);
 
@@ -578,6 +573,8 @@ static long writeback_sb_inodes(struct super_block *sb,
 			trace_writeback_sb_inodes_requeue(inode);
 			continue;
 		}
+		spin_unlock(&wb->list_lock);
+
 		__iget(inode);
 		write_chunk = writeback_chunk_size(wb->bdi, work);
 		wbc.nr_to_write = write_chunk;
@@ -794,8 +791,10 @@ static long wb_writeback(struct bdi_writeback *wb,
 			trace_writeback_wait(wb->bdi, work);
 			inode = wb_inode(wb->b_more_io.prev);
 			spin_lock(&inode->i_lock);
-			inode_wait_for_writeback(inode, wb);
+			spin_unlock(&wb->list_lock);
+			inode_wait_for_writeback(inode);
 			spin_unlock(&inode->i_lock);
+			spin_lock(&wb->list_lock);
 		}
 	}
 	spin_unlock(&wb->list_lock);
@@ -1341,7 +1340,6 @@ int write_inode_now(struct inode *inode, int sync)
 		wbc.nr_to_write = 0;
 
 	might_sleep();
-	spin_lock(&wb->list_lock);
 	spin_lock(&inode->i_lock);
 	ret = writeback_single_inode(inode, wb, &wbc);
 	spin_unlock(&inode->i_lock);
@@ -1366,7 +1364,6 @@ int sync_inode(struct inode *inode, struct writeback_control *wbc)
 	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
 	int ret;
 
-	spin_lock(&wb->list_lock);
 	spin_lock(&inode->i_lock);
 	ret = writeback_single_inode(inode, wb, wbc);
 	spin_unlock(&inode->i_lock);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 6/7] writeback: Refactor writeback_single_inode()
  2012-03-20 22:56 [PATCH 0/7 v2] writeback: Avoid iput() from flusher thread Jan Kara
                   ` (4 preceding siblings ...)
  2012-03-20 22:56 ` [PATCH 5/7] writeback: Remove wb->list_lock from writeback_single_inode() Jan Kara
@ 2012-03-20 22:56 ` Jan Kara
  2012-03-20 23:35   ` Dave Chinner
  2012-04-30 14:46   ` Christoph Hellwig
  2012-03-20 22:56 ` [PATCH 7/7] writeback: Avoid iput() from flusher thread Jan Kara
  6 siblings, 2 replies; 32+ messages in thread
From: Jan Kara @ 2012-03-20 22:56 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Christoph Hellwig, linux-fsdevel, Jan Kara

The code in writeback_single_inode() is relatively complex. The list requeing
logic makes sense only for flusher thread but not really for sync_inode() or
write_inode_now() callers. Also when we want to get rid of inode references
held by flusher thread, we will need a special I_SYNC handling there.

So separate part of writeback_single_inode() which does the real writeback work
into __writeback_single_inode() and make writeback_single_inode() do only stuff
necessary for callers writing only one inode, moving the special list handling
into writeback_sb_inodes(). As a sideeffect this fixes a possible race where we
could skip some inode during sync(2) because other writer refiled it from b_io
to b_dirty list. Also I_SYNC handling is moved into the callers of
__writeback_single_inode() to make locking easier.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c |  137 +++++++++++++++++++++++++++++++----------------------
 1 files changed, 81 insertions(+), 56 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 27f7191..6f5f930 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -355,6 +355,15 @@ static void inode_wb_requeue(struct inode *inode, struct bdi_writeback *wb,
 	    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
 		inode->dirtied_when = jiffies;
 
+	if (wbc->pages_skipped) {
+		/*
+		 * writeback is not making progress due to locked
+		 * buffers. Skip this inode for now.
+		 */
+		redirty_tail(inode, wb);
+		return;
+	}
+
 	if (mapping_tagged(inode->i_mapping, PAGECACHE_TAG_DIRTY)) {
 		/*
 		 * We didn't write back all the pages.  nfs_writepages()
@@ -387,46 +396,20 @@ static void inode_wb_requeue(struct inode *inode, struct bdi_writeback *wb,
 }
 
 /*
- * Write out an inode's dirty pages.  Called under wb->list_lock and
- * inode->i_lock.  Either the caller has an active reference on the inode or
- * the inode has I_WILL_FREE set.
- *
- * If `wait' is set, wait on the writeout.
- *
- * The whole writeout design is quite complex and fragile.  We want to avoid
- * starvation of particular inodes when others are being redirtied, prevent
- * livelocks, etc.
+ * Write out an inode and its dirty pages. Do not update the writeback list
+ * linkage. That is left to the caller. The caller is also responsible for
+ * setting I_SYNC flag and calling inode_sync_complete() to clear it.
  */
 static int
-writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
-		       struct writeback_control *wbc)
+__writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
+			 struct writeback_control *wbc)
 {
 	struct address_space *mapping = inode->i_mapping;
 	long nr_to_write = wbc->nr_to_write;
 	unsigned dirty;
 	int ret;
 
-	assert_spin_locked(&inode->i_lock);
-
-	if (!atomic_read(&inode->i_count))
-		WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
-	else
-		WARN_ON(inode->i_state & I_WILL_FREE);
-
-	if (inode->i_state & I_SYNC) {
-		if (wbc->sync_mode != WB_SYNC_ALL)
-			return 0;
-		/*
-		 * It's a data-integrity sync.  We must wait.
-		 */
-		inode_wait_for_writeback(inode);
-	}
-
-	BUG_ON(inode->i_state & I_SYNC);
-
-	/* Set I_SYNC, reset I_DIRTY_PAGES */
-	inode->i_state |= I_SYNC;
-	spin_unlock(&inode->i_lock);
+	WARN_ON(!(inode->i_state & I_SYNC));
 
 	ret = do_writepages(mapping, wbc);
 
@@ -459,12 +442,65 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 		if (ret == 0)
 			ret = err;
 	}
+	trace_writeback_single_inode(inode, wbc, nr_to_write);
+	return ret;
+}
+
+/*
+ * Write out an inode's dirty pages. Either the caller has an active reference
+ * on the inode or the inode has I_WILL_FREE set.
+ *
+ * This function is designed to be called for writing back one inode which
+ * we go e.g. from filesystem. Flusher thread uses __writeback_single_inode()
+ * and does more profound writeback list handling in writeback_sb_inodes().
+ */
+static int
+writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
+		       struct writeback_control *wbc)
+{
+	int ret = 0;
+
+	spin_lock(&inode->i_lock);
+	if (!atomic_read(&inode->i_count))
+		WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
+	else
+		WARN_ON(inode->i_state & I_WILL_FREE);
+
+	if (inode->i_state & I_SYNC) {
+		if (wbc->sync_mode != WB_SYNC_ALL)
+			goto out;
+		/*
+		 * It's a data-integrity sync.  We must wait.
+		 */
+		inode_wait_for_writeback(inode);
+	}
+	BUG_ON(inode->i_state & I_SYNC);
+	/*
+	 * Skip inode if it is clean. We don't want to mess with writeback
+	 * lists in this function since flusher thread may be doing for example
+	 * sync in parallel and if we move the inode, it could get skipped. So
+	 * here we make sure inode is on some writeback list and leave it there
+	 * unless we have completely cleaned the inode.
+	 */
+	if (!(inode->i_state & I_DIRTY))
+		goto out;
+	inode->i_state |= I_SYNC;
+	spin_unlock(&inode->i_lock);
+
+	ret = __writeback_single_inode(inode, wb, wbc);
 
 	spin_lock(&wb->list_lock);
 	spin_lock(&inode->i_lock);
-	inode_wb_requeue(inode, wb, wbc);
+	/*
+	 * If inode is clean, remove it from writeback lists. Otherwise don't
+	 * touch it. See comment above for explanation.
+	 */
+	if (!(inode->i_state & I_DIRTY))
+		list_del_init(&inode->i_wb_list);
+	spin_unlock(&wb->list_lock);
 	inode_sync_complete(inode);
-	trace_writeback_single_inode(inode, wbc, nr_to_write);
+out:
+	spin_unlock(&inode->i_lock);
 	return ret;
 }
 
@@ -576,23 +612,24 @@ static long writeback_sb_inodes(struct super_block *sb,
 		spin_unlock(&wb->list_lock);
 
 		__iget(inode);
+		if (inode->i_state & I_SYNC)
+			inode_wait_for_writeback(inode);
+		inode->i_state |= I_SYNC;
+		spin_unlock(&inode->i_lock);
 		write_chunk = writeback_chunk_size(wb->bdi, work);
 		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;
 
-		writeback_single_inode(inode, wb, &wbc);
+		__writeback_single_inode(inode, wb, &wbc);
 
 		work->nr_pages -= write_chunk - wbc.nr_to_write;
 		wrote += write_chunk - wbc.nr_to_write;
+		spin_lock(&wb->list_lock);
+		spin_lock(&inode->i_lock);
 		if (!(inode->i_state & I_DIRTY))
 			wrote++;
-		if (wbc.pages_skipped) {
-			/*
-			 * writeback is not making progress due to locked
-			 * buffers.  Skip this inode for now.
-			 */
-			redirty_tail(inode, wb);
-		}
+		inode_wb_requeue(inode, wb, &wbc);
+		inode_sync_complete(inode);
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&wb->list_lock);
 		iput(inode);
@@ -1328,7 +1365,6 @@ EXPORT_SYMBOL(sync_inodes_sb);
 int write_inode_now(struct inode *inode, int sync)
 {
 	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
-	int ret;
 	struct writeback_control wbc = {
 		.nr_to_write = LONG_MAX,
 		.sync_mode = sync ? WB_SYNC_ALL : WB_SYNC_NONE,
@@ -1340,11 +1376,7 @@ int write_inode_now(struct inode *inode, int sync)
 		wbc.nr_to_write = 0;
 
 	might_sleep();
-	spin_lock(&inode->i_lock);
-	ret = writeback_single_inode(inode, wb, &wbc);
-	spin_unlock(&inode->i_lock);
-	spin_unlock(&wb->list_lock);
-	return ret;
+	return writeback_single_inode(inode, wb, &wbc);
 }
 EXPORT_SYMBOL(write_inode_now);
 
@@ -1361,14 +1393,7 @@ EXPORT_SYMBOL(write_inode_now);
  */
 int sync_inode(struct inode *inode, struct writeback_control *wbc)
 {
-	struct bdi_writeback *wb = &inode_to_bdi(inode)->wb;
-	int ret;
-
-	spin_lock(&inode->i_lock);
-	ret = writeback_single_inode(inode, wb, wbc);
-	spin_unlock(&inode->i_lock);
-	spin_unlock(&wb->list_lock);
-	return ret;
+	return writeback_single_inode(inode, &inode_to_bdi(inode)->wb, wbc);
 }
 EXPORT_SYMBOL(sync_inode);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 7/7] writeback: Avoid iput() from flusher thread
  2012-03-20 22:56 [PATCH 0/7 v2] writeback: Avoid iput() from flusher thread Jan Kara
                   ` (5 preceding siblings ...)
  2012-03-20 22:56 ` [PATCH 6/7] writeback: Refactor writeback_single_inode() Jan Kara
@ 2012-03-20 22:56 ` Jan Kara
  2012-03-20 23:50   ` Dave Chinner
                     ` (2 more replies)
  6 siblings, 3 replies; 32+ messages in thread
From: Jan Kara @ 2012-03-20 22:56 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Christoph Hellwig, linux-fsdevel, Jan Kara

Doing iput() from flusher thread (writeback_sb_inodes()) can create problems
because iput() can do a lot of work - for example truncate the inode if it's
the last iput on unlinked file. Some filesystems depend on flusher thread
progressing (e.g. because they need to flush delay allocated blocks to reduce
allocation uncertainty) and so flusher thread doing truncate creates
interesting dependencies and possibilities for deadlocks.

We get rid of iput() in flusher thread by using the fact that I_SYNC inode
flag effectively pins the inode in memory. So if we take care to either hold
i_lock or have I_SYNC set, we can get away without taking inode reference
in writeback_sb_inodes().

To make things work we have to move waiting for I_SYNC from end_writeback() to
evict() just before calling of ->evict_inode. This is because several
filesystems call end_writeback() after they have deleted the inode (btrfs,
gfs2, ...) and other filesystems (ext3, ext4, reiserfs, ...) can deadlock when
waiting for I_SYNC because they call end_writeback() from within a transaction.
Both were not really a problem previously because flusher thread and
->evict_inode() could not run in parallel but now these two could race.
So moving of I_SYNC wait prevents possible races..

As a side effect of these changes, we also fix possible use-after-free in
wb_writeback() because inode_wait_for_writeback() call could try to reacquire
i_lock on the inode that was already free.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c         |   52 ++++++++++++++++++++++++++++++++++----------
 fs/inode.c                |   13 ++++++++++-
 include/linux/fs.h        |    7 +++--
 include/linux/writeback.h |    7 +-----
 4 files changed, 57 insertions(+), 22 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 6f5f930..e5ca4b3 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -325,9 +325,10 @@ static int write_inode(struct inode *inode, struct writeback_control *wbc)
 }
 
 /*
- * Wait for writeback on an inode to complete.
+ * Wait for writeback on an inode to complete. Called with i_lock held.
+ * Caller must make sure inode cannot go away when we drop i_lock.
  */
-static void inode_wait_for_writeback(struct inode *inode)
+void inode_wait_for_writeback(struct inode *inode)
 {
 	DEFINE_WAIT_BIT(wq, &inode->i_state, __I_SYNC);
 	wait_queue_head_t *wqh;
@@ -340,6 +341,26 @@ static void inode_wait_for_writeback(struct inode *inode)
 	}
 }
 
+/*
+ * Sleep until I_SYNC is cleared. This function must be called with i_lock
+ * held and drops it. It is aimed for callers not holding any inode reference
+ * so once i_lock is dropped, inode can go away.
+ */
+static void inode_sleep_on_writeback(struct inode *inode)
+	__releases(inode->i_lock)
+{
+	DEFINE_WAIT(wait);
+	wait_queue_head_t *wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
+	int sleep;
+
+	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
+	sleep = inode->i_state & I_SYNC;
+	spin_unlock(&inode->i_lock);
+	if (sleep)
+		schedule();
+	finish_wait(wqh, &wait);
+}
+
 static void inode_wb_requeue(struct inode *inode, struct bdi_writeback *wb,
 			     struct writeback_control *wbc)
 {
@@ -470,7 +491,9 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 		if (wbc->sync_mode != WB_SYNC_ALL)
 			goto out;
 		/*
-		 * It's a data-integrity sync.  We must wait.
+		 * It's a data-integrity sync. We must wait. Since callers hold
+		 * inode reference or inode has I_WILL_FREE set, it cannot go
+		 * away under us.
 		 */
 		inode_wait_for_writeback(inode);
 	}
@@ -611,15 +634,23 @@ static long writeback_sb_inodes(struct super_block *sb,
 		}
 		spin_unlock(&wb->list_lock);
 
-		__iget(inode);
-		if (inode->i_state & I_SYNC)
-			inode_wait_for_writeback(inode);
+		if (inode->i_state & I_SYNC) {
+			/* Wait for I_SYNC. This function drops i_lock... */
+			inode_sleep_on_writeback(inode);
+			/* Inode may be gone, start again */
+			continue;
+		}
 		inode->i_state |= I_SYNC;
 		spin_unlock(&inode->i_lock);
+
 		write_chunk = writeback_chunk_size(wb->bdi, work);
 		wbc.nr_to_write = write_chunk;
 		wbc.pages_skipped = 0;
 
+		/*
+		 * We use I_SYNC to pin the inode in memory. While it is set
+		 * end_writeback() will wait so the inode cannot be freed.
+		 */
 		__writeback_single_inode(inode, wb, &wbc);
 
 		work->nr_pages -= write_chunk - wbc.nr_to_write;
@@ -631,10 +662,7 @@ static long writeback_sb_inodes(struct super_block *sb,
 		inode_wb_requeue(inode, wb, &wbc);
 		inode_sync_complete(inode);
 		spin_unlock(&inode->i_lock);
-		spin_unlock(&wb->list_lock);
-		iput(inode);
-		cond_resched();
-		spin_lock(&wb->list_lock);
+		cond_resched_lock(&wb->list_lock);
 		/*
 		 * bail out to wb_writeback() often enough to check
 		 * background threshold and other termination conditions.
@@ -829,8 +857,8 @@ static long wb_writeback(struct bdi_writeback *wb,
 			inode = wb_inode(wb->b_more_io.prev);
 			spin_lock(&inode->i_lock);
 			spin_unlock(&wb->list_lock);
-			inode_wait_for_writeback(inode);
-			spin_unlock(&inode->i_lock);
+			/* This function drops i_lock... */
+			inode_sleep_on_writeback(inode);
 			spin_lock(&wb->list_lock);
 		}
 	}
diff --git a/fs/inode.c b/fs/inode.c
index d3ebdbe..3869714 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -510,7 +510,6 @@ void end_writeback(struct inode *inode)
 	BUG_ON(!list_empty(&inode->i_data.private_list));
 	BUG_ON(!(inode->i_state & I_FREEING));
 	BUG_ON(inode->i_state & I_CLEAR);
-	inode_sync_wait(inode);
 	/* don't need i_lock here, no concurrent mods to i_state */
 	inode->i_state = I_FREEING | I_CLEAR;
 }
@@ -541,6 +540,18 @@ static void evict(struct inode *inode)
 
 	inode_sb_list_del(inode);
 
+	/*
+	 * Wait for flusher thread to be done with the inode so that filesystem
+	 * does not start destroying it while writeback is still running. Since
+	 * the inode has I_FREEING set, flusher thread won't start new work on
+	 * the inode.  We just have to wait for running writeback to finish. We
+	 * must use i_lock here because flusher thread might be working with
+	 * the inode without I_SYNC set but under i_lock.
+	 */
+	spin_lock(&inode->i_lock);
+	inode_wait_for_writeback(inode);
+	spin_unlock(&inode->i_lock);
+
 	if (op->evict_inode) {
 		op->evict_inode(inode);
 	} else {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 69cd5bb..ddb2f55 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1742,9 +1742,10 @@ struct super_operations {
  *			anew.  Other functions will just ignore such inodes,
  *			if appropriate.  I_NEW is used for waiting.
  *
- * I_SYNC		Synchonized write of dirty inode data.  The bits is
- *			set during data writeback, and cleared with a wakeup
- *			on the bit address once it is done.
+ * I_SYNC		Writeback of inode is running. The bit is set during
+ *			data writeback, and cleared with a wakeup on the bit
+ *			address once it is done. The bit is also used to pin
+ *			the inode in memory for flusher thread.
  *
  * I_REFERENCED		Marks the inode as recently references on the LRU list.
  *
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 995b8bf..b785b74 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -94,6 +94,7 @@ long writeback_inodes_wb(struct bdi_writeback *wb, long nr_pages,
 				enum wb_reason reason);
 long wb_do_writeback(struct bdi_writeback *wb, int force_wait);
 void wakeup_flusher_threads(long nr_pages, enum wb_reason reason);
+void inode_wait_for_writeback(struct inode *inode);
 
 /* writeback.h requires fs.h; it, too, is not included from here. */
 static inline void wait_on_inode(struct inode *inode)
@@ -101,12 +102,6 @@ static inline void wait_on_inode(struct inode *inode)
 	might_sleep();
 	wait_on_bit(&inode->i_state, __I_NEW, inode_wait, TASK_UNINTERRUPTIBLE);
 }
-static inline void inode_sync_wait(struct inode *inode)
-{
-	might_sleep();
-	wait_on_bit(&inode->i_state, __I_SYNC, inode_wait,
-							TASK_UNINTERRUPTIBLE);
-}
 
 
 /*
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/7] writeback: Refactor writeback_single_inode()
  2012-03-20 22:56 ` [PATCH 6/7] writeback: Refactor writeback_single_inode() Jan Kara
@ 2012-03-20 23:35   ` Dave Chinner
  2012-03-21 10:03     ` Jan Kara
  2012-04-30 14:46   ` Christoph Hellwig
  1 sibling, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2012-03-20 23:35 UTC (permalink / raw)
  To: Jan Kara; +Cc: Wu Fengguang, Christoph Hellwig, linux-fsdevel

On Tue, Mar 20, 2012 at 11:56:30PM +0100, Jan Kara wrote:
> The code in writeback_single_inode() is relatively complex. The list requeing
> logic makes sense only for flusher thread but not really for sync_inode() or
> write_inode_now() callers. Also when we want to get rid of inode references
> held by flusher thread, we will need a special I_SYNC handling there.
> 
> So separate part of writeback_single_inode() which does the real writeback work
> into __writeback_single_inode() and make writeback_single_inode() do only stuff
> necessary for callers writing only one inode, moving the special list handling
> into writeback_sb_inodes(). As a sideeffect this fixes a possible race where we
> could skip some inode during sync(2) because other writer refiled it from b_io
> to b_dirty list. Also I_SYNC handling is moved into the callers of
> __writeback_single_inode() to make locking easier.
.....
> +static int
> +writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
> +		       struct writeback_control *wbc)
> +{
> +	int ret = 0;
> +
> +	spin_lock(&inode->i_lock);
> +	if (!atomic_read(&inode->i_count))
> +		WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
> +	else
> +		WARN_ON(inode->i_state & I_WILL_FREE);
> +
> +	if (inode->i_state & I_SYNC) {
> +		if (wbc->sync_mode != WB_SYNC_ALL)
> +			goto out;
> +		/*
> +		 * It's a data-integrity sync.  We must wait.
> +		 */
> +		inode_wait_for_writeback(inode);
> +	}
> +	BUG_ON(inode->i_state & I_SYNC);

BUG_ON() seems a little harsh to me. I mean, having I_SYNC set is
not really a fatal error. It's an indication of a problem, but
writeback will continue and not fail catastrophically if this
occurs. So perhaps WARN_ON() might be better here.

.....

> @@ -576,23 +612,24 @@ static long writeback_sb_inodes(struct super_block *sb,
>  		spin_unlock(&wb->list_lock);
>  
>  		__iget(inode);
> +		if (inode->i_state & I_SYNC)
> +			inode_wait_for_writeback(inode);

Even for WB_SYNC_NONE writeback?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/7] writeback: Avoid iput() from flusher thread
  2012-03-20 22:56 ` [PATCH 7/7] writeback: Avoid iput() from flusher thread Jan Kara
@ 2012-03-20 23:50   ` Dave Chinner
  2012-03-21 10:25     ` Jan Kara
  2012-03-22  3:01   ` Fengguang Wu
  2012-04-30 15:30   ` Christoph Hellwig
  2 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2012-03-20 23:50 UTC (permalink / raw)
  To: Jan Kara; +Cc: Wu Fengguang, Christoph Hellwig, linux-fsdevel

On Tue, Mar 20, 2012 at 11:56:31PM +0100, Jan Kara wrote:
> Doing iput() from flusher thread (writeback_sb_inodes()) can create problems
> because iput() can do a lot of work - for example truncate the inode if it's
> the last iput on unlinked file. Some filesystems depend on flusher thread
> progressing (e.g. because they need to flush delay allocated blocks to reduce
> allocation uncertainty) and so flusher thread doing truncate creates
> interesting dependencies and possibilities for deadlocks.
> 
> We get rid of iput() in flusher thread by using the fact that I_SYNC inode
> flag effectively pins the inode in memory. So if we take care to either hold
> i_lock or have I_SYNC set, we can get away without taking inode reference
> in writeback_sb_inodes().
> 
> To make things work we have to move waiting for I_SYNC from end_writeback() to
> evict() just before calling of ->evict_inode. This is because several
> filesystems call end_writeback() after they have deleted the inode (btrfs,
> gfs2, ...) and other filesystems (ext3, ext4, reiserfs, ...) can deadlock when
> waiting for I_SYNC because they call end_writeback() from within a transaction.
> Both were not really a problem previously because flusher thread and
> ->evict_inode() could not run in parallel but now these two could race.
> So moving of I_SYNC wait prevents possible races..
> 
> As a side effect of these changes, we also fix possible use-after-free in
> wb_writeback() because inode_wait_for_writeback() call could try to reacquire
> i_lock on the inode that was already free.
.....

> diff --git a/fs/inode.c b/fs/inode.c
> index d3ebdbe..3869714 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -510,7 +510,6 @@ void end_writeback(struct inode *inode)
>  	BUG_ON(!list_empty(&inode->i_data.private_list));
>  	BUG_ON(!(inode->i_state & I_FREEING));
>  	BUG_ON(inode->i_state & I_CLEAR);
> -	inode_sync_wait(inode);
>  	/* don't need i_lock here, no concurrent mods to i_state */
>  	inode->i_state = I_FREEING | I_CLEAR;
>  }
> @@ -541,6 +540,18 @@ static void evict(struct inode *inode)
>  
>  	inode_sb_list_del(inode);
>  
> +	/*
> +	 * Wait for flusher thread to be done with the inode so that filesystem
> +	 * does not start destroying it while writeback is still running. Since
> +	 * the inode has I_FREEING set, flusher thread won't start new work on
> +	 * the inode.  We just have to wait for running writeback to finish. We
> +	 * must use i_lock here because flusher thread might be working with
> +	 * the inode without I_SYNC set but under i_lock.
> +	 */
> +	spin_lock(&inode->i_lock);
> +	inode_wait_for_writeback(inode);
> +	spin_unlock(&inode->i_lock);
> +

Why move this wait from end_writeback() to here?  The whole point
of end_writeback() is to provide a barrier that guarantees that
there is no async writeback running when it returns, so it seems
strange to move the barrier out of the function that is supposed to
provide the barrier....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/7] writeback: Refactor writeback_single_inode()
  2012-03-20 23:35   ` Dave Chinner
@ 2012-03-21 10:03     ` Jan Kara
  0 siblings, 0 replies; 32+ messages in thread
From: Jan Kara @ 2012-03-21 10:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, linux-fsdevel

On Wed 21-03-12 10:35:54, Dave Chinner wrote:
> On Tue, Mar 20, 2012 at 11:56:30PM +0100, Jan Kara wrote:
> > The code in writeback_single_inode() is relatively complex. The list requeing
> > logic makes sense only for flusher thread but not really for sync_inode() or
> > write_inode_now() callers. Also when we want to get rid of inode references
> > held by flusher thread, we will need a special I_SYNC handling there.
> > 
> > So separate part of writeback_single_inode() which does the real writeback work
> > into __writeback_single_inode() and make writeback_single_inode() do only stuff
> > necessary for callers writing only one inode, moving the special list handling
> > into writeback_sb_inodes(). As a sideeffect this fixes a possible race where we
> > could skip some inode during sync(2) because other writer refiled it from b_io
> > to b_dirty list. Also I_SYNC handling is moved into the callers of
> > __writeback_single_inode() to make locking easier.
> .....
> > +static int
> > +writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
> > +		       struct writeback_control *wbc)
> > +{
> > +	int ret = 0;
> > +
> > +	spin_lock(&inode->i_lock);
> > +	if (!atomic_read(&inode->i_count))
> > +		WARN_ON(!(inode->i_state & (I_WILL_FREE|I_FREEING)));
> > +	else
> > +		WARN_ON(inode->i_state & I_WILL_FREE);
> > +
> > +	if (inode->i_state & I_SYNC) {
> > +		if (wbc->sync_mode != WB_SYNC_ALL)
> > +			goto out;
> > +		/*
> > +		 * It's a data-integrity sync.  We must wait.
> > +		 */
> > +		inode_wait_for_writeback(inode);
> > +	}
> > +	BUG_ON(inode->i_state & I_SYNC);
> 
> BUG_ON() seems a little harsh to me. I mean, having I_SYNC set is
> not really a fatal error. It's an indication of a problem, but
> writeback will continue and not fail catastrophically if this
> occurs. So perhaps WARN_ON() might be better here.
  Yeah, I just copy-and-pasted this from the old code but I agree. I'll
change this.

> > @@ -576,23 +612,24 @@ static long writeback_sb_inodes(struct super_block *sb,
> >  		spin_unlock(&wb->list_lock);
> >  
> >  		__iget(inode);
> > +		if (inode->i_state & I_SYNC)
> > +			inode_wait_for_writeback(inode);
> 
> Even for WB_SYNC_NONE writeback?
  Above, we already requeued the inode if I_SYNC was set and we are doing
WB_SYNC_NONE writeback. So this is really only for WB_SYNC_ALL writeback.
I'll add a short comment about this.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/7] writeback: Avoid iput() from flusher thread
  2012-03-20 23:50   ` Dave Chinner
@ 2012-03-21 10:25     ` Jan Kara
  2012-03-22  3:03       ` Fengguang Wu
  2012-03-22  6:27       ` Dave Chinner
  0 siblings, 2 replies; 32+ messages in thread
From: Jan Kara @ 2012-03-21 10:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, linux-fsdevel

On Wed 21-03-12 10:50:05, Dave Chinner wrote:
> On Tue, Mar 20, 2012 at 11:56:31PM +0100, Jan Kara wrote:
> > Doing iput() from flusher thread (writeback_sb_inodes()) can create problems
> > because iput() can do a lot of work - for example truncate the inode if it's
> > the last iput on unlinked file. Some filesystems depend on flusher thread
> > progressing (e.g. because they need to flush delay allocated blocks to reduce
> > allocation uncertainty) and so flusher thread doing truncate creates
> > interesting dependencies and possibilities for deadlocks.
> > 
> > We get rid of iput() in flusher thread by using the fact that I_SYNC inode
> > flag effectively pins the inode in memory. So if we take care to either hold
> > i_lock or have I_SYNC set, we can get away without taking inode reference
> > in writeback_sb_inodes().
> > 
> > To make things work we have to move waiting for I_SYNC from end_writeback() to
> > evict() just before calling of ->evict_inode. This is because several
> > filesystems call end_writeback() after they have deleted the inode (btrfs,
> > gfs2, ...) and other filesystems (ext3, ext4, reiserfs, ...) can deadlock when
> > waiting for I_SYNC because they call end_writeback() from within a transaction.
> > Both were not really a problem previously because flusher thread and
> > ->evict_inode() could not run in parallel but now these two could race.
> > So moving of I_SYNC wait prevents possible races..
> > 
> > As a side effect of these changes, we also fix possible use-after-free in
> > wb_writeback() because inode_wait_for_writeback() call could try to reacquire
> > i_lock on the inode that was already free.
> .....
> 
> > diff --git a/fs/inode.c b/fs/inode.c
> > index d3ebdbe..3869714 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -510,7 +510,6 @@ void end_writeback(struct inode *inode)
> >  	BUG_ON(!list_empty(&inode->i_data.private_list));
> >  	BUG_ON(!(inode->i_state & I_FREEING));
> >  	BUG_ON(inode->i_state & I_CLEAR);
> > -	inode_sync_wait(inode);
> >  	/* don't need i_lock here, no concurrent mods to i_state */
> >  	inode->i_state = I_FREEING | I_CLEAR;
> >  }
> > @@ -541,6 +540,18 @@ static void evict(struct inode *inode)
> >  
> >  	inode_sb_list_del(inode);
> >  
> > +	/*
> > +	 * Wait for flusher thread to be done with the inode so that filesystem
> > +	 * does not start destroying it while writeback is still running. Since
> > +	 * the inode has I_FREEING set, flusher thread won't start new work on
> > +	 * the inode.  We just have to wait for running writeback to finish. We
> > +	 * must use i_lock here because flusher thread might be working with
> > +	 * the inode without I_SYNC set but under i_lock.
> > +	 */
> > +	spin_lock(&inode->i_lock);
> > +	inode_wait_for_writeback(inode);
> > +	spin_unlock(&inode->i_lock);
> > +
> 
> Why move this wait from end_writeback() to here?  The whole point
> of end_writeback() is to provide a barrier that guarantees that
> there is no async writeback running when it returns, so it seems
> strange to move the barrier out of the function that is supposed to
> provide the barrier....
  I agree that end_writeback() will be misnamed after this change. The
thing is (as I tried to explain in the changelog) that a lot of filesystems
get it wrong and call end_writeback() from places where
  a) it is too late and writeback could be scribbling over a freed inode
  b) they cannot really handle waiting for writeback to finish
And nobody really noticed because writeback couldn't be racing with
->evict_inode().

It is not easy to fix this e.g. in GFS2 because end_writeback() does
actually two things: It checks that inode is properly teared down (has no
pages etc.) and waits for writeback. And while waiting for writeback should
happen outside of a running transaction, checking of inode has to happen
between truncate_inode_pages() and deleting inode which, in case of GFS2,
has to be inside a transaction.

So what I decided to do is to split off waiting for writeback from checking
the inode and move waiting for writeback before ->evict_inode() is called.
That makes life easier for the filesystems and AFAICS it is a safe thing to
do.

Maybe we could rename end_writeback() to something like clear_inode() to
reflect this change?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/7] writeback: Move I_DIRTY_PAGES handling
  2012-03-20 22:56 ` [PATCH 3/7] writeback: Move I_DIRTY_PAGES handling Jan Kara
@ 2012-03-22  2:41   ` Fengguang Wu
  2012-03-22  8:35     ` Jan Kara
  2012-04-30 14:41   ` Christoph Hellwig
  1 sibling, 1 reply; 32+ messages in thread
From: Fengguang Wu @ 2012-03-22  2:41 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, linux-fsdevel, David Howells

On Tue, Mar 20, 2012 at 11:56:27PM +0100, Jan Kara wrote:
> Instead of clearing I_DIRTY_PAGES and resetting it when we didn't succeed in
> writing them all, just clear the bit only when we succeeded writing all the
> pages.

Is this just to reduce one line of code? Well it hardly deserves to
think about the tricky implications..

> We also move the clearing of the bit close to other i_state handling to
> separate it from writeback list handling. This is desirable because list
> handling will differ for flusher thread and other writeback_single_inode()
> callers in future.
>
> No filesystem plays any tricks with I_DIRTY_PAGES (like checking it
> in ->writepages or ->write_inode implementation) so this movement is
> safe.

afs_writeback_all() calls 

        __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);

which creates the subtle state (I_DIRTY_PAGES && !PAGECACHE_TAG_DIRTY)
which will no longer exist after this patch.

That line is introduced in 2007 in commit 31143d5d5 ("AFS: implement
basic file write support"), so it should not be trying to alter the
requeue/redirty behavior here. But this patch might still alter the
behavior from redirty_tail() to list_del_init() since the
(inode->i_state & I_DIRTY) test may no longer be true.

> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/fs-writeback.c |    5 +++--
>  1 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 5f17a16..0ef272e 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -384,7 +384,6 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
>  
>  	/* Set I_SYNC, reset I_DIRTY_PAGES */

That comment is obsoleted (and not really useful).

>  	inode->i_state |= I_SYNC;
> -	inode->i_state &= ~I_DIRTY_PAGES;
>  	spin_unlock(&inode->i_lock);
>  	spin_unlock(&wb->list_lock);
>  
> @@ -407,6 +406,9 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
>  	 * write_inode()
>  	 */
>  	spin_lock(&inode->i_lock);
> +	/* Clear I_DIRTY_PAGES if we've written out all dirty pages */
> +	if (!mapping_tagged(mapping, PAGECACHE_TAG_DIRTY))
> +		inode->i_state &= ~I_DIRTY_PAGES;
>  	dirty = inode->i_state & I_DIRTY;
>  	inode->i_state &= ~(I_DIRTY_SYNC | I_DIRTY_DATASYNC);
>  	spin_unlock(&inode->i_lock);
> @@ -434,7 +436,6 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
>  			 * We didn't write back all the pages.  nfs_writepages()
>  			 * sometimes bales out without doing anything.
>  			 */
> -			inode->i_state |= I_DIRTY_PAGES;
>  			if (wbc->nr_to_write <= 0) {
>  				/*
>  				 * slice used up: queue for next turn
> -- 
> 1.7.1

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/7] writeback: Avoid iput() from flusher thread
  2012-03-20 22:56 ` [PATCH 7/7] writeback: Avoid iput() from flusher thread Jan Kara
  2012-03-20 23:50   ` Dave Chinner
@ 2012-03-22  3:01   ` Fengguang Wu
  2012-04-30 15:30   ` Christoph Hellwig
  2 siblings, 0 replies; 32+ messages in thread
From: Fengguang Wu @ 2012-03-22  3:01 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, linux-fsdevel

> +/*
> + * Sleep until I_SYNC is cleared. This function must be called with i_lock
> + * held and drops it. It is aimed for callers not holding any inode reference
> + * so once i_lock is dropped, inode can go away.
> + */
> +static void inode_sleep_on_writeback(struct inode *inode)
> +	__releases(inode->i_lock)
> +{
> +	DEFINE_WAIT(wait);
> +	wait_queue_head_t *wqh = bit_waitqueue(&inode->i_state, __I_SYNC);
> +	int sleep;

i_state is "unsigned long", it's better to use the same type.

> +
> +	prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> +	sleep = inode->i_state & I_SYNC;
> +	spin_unlock(&inode->i_lock);
> +	if (sleep)
> +		schedule();
> +	finish_wait(wqh, &wait);
> +}
> +

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/7] writeback: Avoid iput() from flusher thread
  2012-03-21 10:25     ` Jan Kara
@ 2012-03-22  3:03       ` Fengguang Wu
  2012-03-22  6:27       ` Dave Chinner
  1 sibling, 0 replies; 32+ messages in thread
From: Fengguang Wu @ 2012-03-22  3:03 UTC (permalink / raw)
  To: Jan Kara; +Cc: Dave Chinner, Christoph Hellwig, linux-fsdevel

On Wed, Mar 21, 2012 at 11:25:50AM +0100, Jan Kara wrote:
> On Wed 21-03-12 10:50:05, Dave Chinner wrote:
> > On Tue, Mar 20, 2012 at 11:56:31PM +0100, Jan Kara wrote:
> > > Doing iput() from flusher thread (writeback_sb_inodes()) can create problems
> > > because iput() can do a lot of work - for example truncate the inode if it's
> > > the last iput on unlinked file. Some filesystems depend on flusher thread
> > > progressing (e.g. because they need to flush delay allocated blocks to reduce
> > > allocation uncertainty) and so flusher thread doing truncate creates
> > > interesting dependencies and possibilities for deadlocks.
> > > 
> > > We get rid of iput() in flusher thread by using the fact that I_SYNC inode
> > > flag effectively pins the inode in memory. So if we take care to either hold
> > > i_lock or have I_SYNC set, we can get away without taking inode reference
> > > in writeback_sb_inodes().
> > > 
> > > To make things work we have to move waiting for I_SYNC from end_writeback() to
> > > evict() just before calling of ->evict_inode. This is because several
> > > filesystems call end_writeback() after they have deleted the inode (btrfs,
> > > gfs2, ...) and other filesystems (ext3, ext4, reiserfs, ...) can deadlock when
> > > waiting for I_SYNC because they call end_writeback() from within a transaction.
> > > Both were not really a problem previously because flusher thread and
> > > ->evict_inode() could not run in parallel but now these two could race.
> > > So moving of I_SYNC wait prevents possible races..
> > > 
> > > As a side effect of these changes, we also fix possible use-after-free in
> > > wb_writeback() because inode_wait_for_writeback() call could try to reacquire
> > > i_lock on the inode that was already free.
> > .....
> > 
> > > diff --git a/fs/inode.c b/fs/inode.c
> > > index d3ebdbe..3869714 100644
> > > --- a/fs/inode.c
> > > +++ b/fs/inode.c
> > > @@ -510,7 +510,6 @@ void end_writeback(struct inode *inode)
> > >  	BUG_ON(!list_empty(&inode->i_data.private_list));
> > >  	BUG_ON(!(inode->i_state & I_FREEING));
> > >  	BUG_ON(inode->i_state & I_CLEAR);
> > > -	inode_sync_wait(inode);
> > >  	/* don't need i_lock here, no concurrent mods to i_state */
> > >  	inode->i_state = I_FREEING | I_CLEAR;
> > >  }
> > > @@ -541,6 +540,18 @@ static void evict(struct inode *inode)
> > >  
> > >  	inode_sb_list_del(inode);
> > >  
> > > +	/*
> > > +	 * Wait for flusher thread to be done with the inode so that filesystem
> > > +	 * does not start destroying it while writeback is still running. Since
> > > +	 * the inode has I_FREEING set, flusher thread won't start new work on
> > > +	 * the inode.  We just have to wait for running writeback to finish. We
> > > +	 * must use i_lock here because flusher thread might be working with
> > > +	 * the inode without I_SYNC set but under i_lock.
> > > +	 */
> > > +	spin_lock(&inode->i_lock);
> > > +	inode_wait_for_writeback(inode);
> > > +	spin_unlock(&inode->i_lock);
> > > +
> > 
> > Why move this wait from end_writeback() to here?  The whole point
> > of end_writeback() is to provide a barrier that guarantees that
> > there is no async writeback running when it returns, so it seems
> > strange to move the barrier out of the function that is supposed to
> > provide the barrier....
>   I agree that end_writeback() will be misnamed after this change. The
> thing is (as I tried to explain in the changelog) that a lot of filesystems
> get it wrong and call end_writeback() from places where
>   a) it is too late and writeback could be scribbling over a freed inode
>   b) they cannot really handle waiting for writeback to finish
> And nobody really noticed because writeback couldn't be racing with
> ->evict_inode().
> 
> It is not easy to fix this e.g. in GFS2 because end_writeback() does
> actually two things: It checks that inode is properly teared down (has no
> pages etc.) and waits for writeback. And while waiting for writeback should
> happen outside of a running transaction, checking of inode has to happen
> between truncate_inode_pages() and deleting inode which, in case of GFS2,
> has to be inside a transaction.
> 
> So what I decided to do is to split off waiting for writeback from checking
> the inode and move waiting for writeback before ->evict_inode() is called.
> That makes life easier for the filesystems and AFAICS it is a safe thing to
> do.
> 
> Maybe we could rename end_writeback() to something like clear_inode() to
> reflect this change?

end_writeback() does not imply "sleeping", but you'll at least remove
that explicit might_sleep() line inside it ;)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/7] writeback: Avoid iput() from flusher thread
  2012-03-21 10:25     ` Jan Kara
  2012-03-22  3:03       ` Fengguang Wu
@ 2012-03-22  6:27       ` Dave Chinner
  2012-03-22  9:50         ` Jan Kara
  1 sibling, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2012-03-22  6:27 UTC (permalink / raw)
  To: Jan Kara; +Cc: Wu Fengguang, Christoph Hellwig, linux-fsdevel

On Wed, Mar 21, 2012 at 11:25:50AM +0100, Jan Kara wrote:
> On Wed 21-03-12 10:50:05, Dave Chinner wrote:
> > On Tue, Mar 20, 2012 at 11:56:31PM +0100, Jan Kara wrote:
> > > Doing iput() from flusher thread (writeback_sb_inodes()) can create problems
> > > because iput() can do a lot of work - for example truncate the inode if it's
> > > the last iput on unlinked file. Some filesystems depend on flusher thread
> > > progressing (e.g. because they need to flush delay allocated blocks to reduce
> > > allocation uncertainty) and so flusher thread doing truncate creates
> > > interesting dependencies and possibilities for deadlocks.
> > > 
> > > We get rid of iput() in flusher thread by using the fact that I_SYNC inode
> > > flag effectively pins the inode in memory. So if we take care to either hold
> > > i_lock or have I_SYNC set, we can get away without taking inode reference
> > > in writeback_sb_inodes().
> > > 
> > > To make things work we have to move waiting for I_SYNC from end_writeback() to
> > > evict() just before calling of ->evict_inode. This is because several
> > > filesystems call end_writeback() after they have deleted the inode (btrfs,
> > > gfs2, ...) and other filesystems (ext3, ext4, reiserfs, ...) can deadlock when
> > > waiting for I_SYNC because they call end_writeback() from within a transaction.
> > > Both were not really a problem previously because flusher thread and
> > > ->evict_inode() could not run in parallel but now these two could race.
> > > So moving of I_SYNC wait prevents possible races..
> > > 
> > > As a side effect of these changes, we also fix possible use-after-free in
> > > wb_writeback() because inode_wait_for_writeback() call could try to reacquire
> > > i_lock on the inode that was already free.
> > .....
> > 
> > > diff --git a/fs/inode.c b/fs/inode.c
> > > index d3ebdbe..3869714 100644
> > > --- a/fs/inode.c
> > > +++ b/fs/inode.c
> > > @@ -510,7 +510,6 @@ void end_writeback(struct inode *inode)
> > >  	BUG_ON(!list_empty(&inode->i_data.private_list));
> > >  	BUG_ON(!(inode->i_state & I_FREEING));
> > >  	BUG_ON(inode->i_state & I_CLEAR);
> > > -	inode_sync_wait(inode);
> > >  	/* don't need i_lock here, no concurrent mods to i_state */
> > >  	inode->i_state = I_FREEING | I_CLEAR;
> > >  }
> > > @@ -541,6 +540,18 @@ static void evict(struct inode *inode)
> > >  
> > >  	inode_sb_list_del(inode);
> > >  
> > > +	/*
> > > +	 * Wait for flusher thread to be done with the inode so that filesystem
> > > +	 * does not start destroying it while writeback is still running. Since
> > > +	 * the inode has I_FREEING set, flusher thread won't start new work on
> > > +	 * the inode.  We just have to wait for running writeback to finish. We
> > > +	 * must use i_lock here because flusher thread might be working with
> > > +	 * the inode without I_SYNC set but under i_lock.
> > > +	 */
> > > +	spin_lock(&inode->i_lock);
> > > +	inode_wait_for_writeback(inode);
> > > +	spin_unlock(&inode->i_lock);
> > > +
> > 
> > Why move this wait from end_writeback() to here?  The whole point
> > of end_writeback() is to provide a barrier that guarantees that
> > there is no async writeback running when it returns, so it seems
> > strange to move the barrier out of the function that is supposed to
> > provide the barrier....
>   I agree that end_writeback() will be misnamed after this change. The
> thing is (as I tried to explain in the changelog) that a lot of filesystems
> get it wrong and call end_writeback() from places where
>   a) it is too late and writeback could be scribbling over a freed inode
>   b) they cannot really handle waiting for writeback to finish
> And nobody really noticed because writeback couldn't be racing with
> ->evict_inode().
> 
> It is not easy to fix this e.g. in GFS2 because end_writeback() does
> actually two things: It checks that inode is properly teared down (has no
> pages etc.) and waits for writeback. And while waiting for writeback should
> happen outside of a running transaction, checking of inode has to happen
> between truncate_inode_pages() and deleting inode which, in case of GFS2,
> has to be inside a transaction.

I don't see any problem there. Are you concerned simply because it
is called from the flusher thread? 

We currently have a sane inode lifecycle model via the reference
counting of all active users of inodes, including flusher threads.
IIUC, your concern is that some filesystems have added dependencies
that mean it is unsafe for them to drop the final reference to an
inode in certain contexts? Isn't that a problem with the filesystem
implementation rather than a flaw in the inode reference counting
model?

Indeed, XFS is probably the filesystem that has done transactions in
the context iput_final() for the longest - it has done this from the
initial port to linux (because that's how it worked on Irix).  XFS
has always done truncation of extents beyond EOF on inodes that need
it, and truncation and final unlink of files with a zero link count
when the final reference to an inode is dropped.

Also, XFS now relies on the flusher threads to write all the file
data in the page cache, so if it stalled the flusher thread because
truncation had dependencies on data flushing, then it would have
been broken years ago. XFS avoids dependencies between iput_final
and data flushing by ensuring that it doesn't do transactions until
after waiting for the writeback to end. That is, after
end_writeback() returns in .evict, the inode is guaranteed to be
clean, not visible to writeback, and not dependent on any other
inode (data or metadata) to be truncated. And because of that, we
know that iput_final() will not be stalled in any context - we have
a guarantee of forward progress.

IMO, ensuring data/metadata dependencies don't get tangled are a
filesystem design issue. Trying to hack around them by special
casing certain types of inode usage in the VFS to move it outside
the reference counting model is not the right solution. I really
think that is a slippery slope - the next deadlock will be solved
with some new lifecycle hack, and soon we'll end up with a lifecycle
model that nobody understands or can debug....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/7] writeback: Move I_DIRTY_PAGES handling
  2012-03-22  2:41   ` Fengguang Wu
@ 2012-03-22  8:35     ` Jan Kara
  2012-03-28  3:11       ` Fengguang Wu
  0 siblings, 1 reply; 32+ messages in thread
From: Jan Kara @ 2012-03-22  8:35 UTC (permalink / raw)
  To: Fengguang Wu; +Cc: Jan Kara, Christoph Hellwig, linux-fsdevel, David Howells

On Thu 22-03-12 10:41:14, Wu Fengguang wrote:
> On Tue, Mar 20, 2012 at 11:56:27PM +0100, Jan Kara wrote:
> > Instead of clearing I_DIRTY_PAGES and resetting it when we didn't succeed in
> > writing them all, just clear the bit only when we succeeded writing all the
> > pages.
> 
> Is this just to reduce one line of code? Well it hardly deserves to
> think about the tricky implications..
  It's not because of the reduction. It's because I don't want to take
i_lock in the beginning of writeback_single_inode() just to clear
I_DIRTY_PAGES when it's not really needed.

> > We also move the clearing of the bit close to other i_state handling to
> > separate it from writeback list handling. This is desirable because list
> > handling will differ for flusher thread and other writeback_single_inode()
> > callers in future.
> >
> > No filesystem plays any tricks with I_DIRTY_PAGES (like checking it
> > in ->writepages or ->write_inode implementation) so this movement is
> > safe.
> 
> afs_writeback_all() calls 
> 
>         __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> 
> which creates the subtle state (I_DIRTY_PAGES && !PAGECACHE_TAG_DIRTY)
> which will no longer exist after this patch.
> 
> That line is introduced in 2007 in commit 31143d5d5 ("AFS: implement
> basic file write support"), so it should not be trying to alter the
> requeue/redirty behavior here. But this patch might still alter the
> behavior from redirty_tail() to list_del_init() since the
> (inode->i_state & I_DIRTY) test may no longer be true.
  Hmm, I don't really see a reason for that __mark_inode_dirty() call.
Since AFS uses PAGECACHE_TAG_DIRTY to track dirty pages in it's
writepages() implementation, having I_DIRTY_PAGES set without any dirty
page just doesn't make any sense. David, do you remember why that
__mark_inode_dirty() is there?

> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/fs-writeback.c |    5 +++--
> >  1 files changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> > index 5f17a16..0ef272e 100644
> > --- a/fs/fs-writeback.c
> > +++ b/fs/fs-writeback.c
> > @@ -384,7 +384,6 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
> >  
> >  	/* Set I_SYNC, reset I_DIRTY_PAGES */
> 
> That comment is obsoleted (and not really useful).
  Yes. I've removed it, just in some later patch.

								Honza

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/7] writeback: Avoid iput() from flusher thread
  2012-03-22  6:27       ` Dave Chinner
@ 2012-03-22  9:50         ` Jan Kara
  0 siblings, 0 replies; 32+ messages in thread
From: Jan Kara @ 2012-03-22  9:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, linux-fsdevel

On Thu 22-03-12 17:27:56, Dave Chinner wrote:
> On Wed, Mar 21, 2012 at 11:25:50AM +0100, Jan Kara wrote:
> > On Wed 21-03-12 10:50:05, Dave Chinner wrote:
> > > On Tue, Mar 20, 2012 at 11:56:31PM +0100, Jan Kara wrote:
> > > > Doing iput() from flusher thread (writeback_sb_inodes()) can create problems
> > > > because iput() can do a lot of work - for example truncate the inode if it's
> > > > the last iput on unlinked file. Some filesystems depend on flusher thread
> > > > progressing (e.g. because they need to flush delay allocated blocks to reduce
> > > > allocation uncertainty) and so flusher thread doing truncate creates
> > > > interesting dependencies and possibilities for deadlocks.
> > > > 
> > > > We get rid of iput() in flusher thread by using the fact that I_SYNC inode
> > > > flag effectively pins the inode in memory. So if we take care to either hold
> > > > i_lock or have I_SYNC set, we can get away without taking inode reference
> > > > in writeback_sb_inodes().
> > > > 
> > > > To make things work we have to move waiting for I_SYNC from end_writeback() to
> > > > evict() just before calling of ->evict_inode. This is because several
> > > > filesystems call end_writeback() after they have deleted the inode (btrfs,
> > > > gfs2, ...) and other filesystems (ext3, ext4, reiserfs, ...) can deadlock when
> > > > waiting for I_SYNC because they call end_writeback() from within a transaction.
> > > > Both were not really a problem previously because flusher thread and
> > > > ->evict_inode() could not run in parallel but now these two could race.
> > > > So moving of I_SYNC wait prevents possible races..
> > > > 
> > > > As a side effect of these changes, we also fix possible use-after-free in
> > > > wb_writeback() because inode_wait_for_writeback() call could try to reacquire
> > > > i_lock on the inode that was already free.
> > > .....
> > > 
> > > > diff --git a/fs/inode.c b/fs/inode.c
> > > > index d3ebdbe..3869714 100644
> > > > --- a/fs/inode.c
> > > > +++ b/fs/inode.c
> > > > @@ -510,7 +510,6 @@ void end_writeback(struct inode *inode)
> > > >  	BUG_ON(!list_empty(&inode->i_data.private_list));
> > > >  	BUG_ON(!(inode->i_state & I_FREEING));
> > > >  	BUG_ON(inode->i_state & I_CLEAR);
> > > > -	inode_sync_wait(inode);
> > > >  	/* don't need i_lock here, no concurrent mods to i_state */
> > > >  	inode->i_state = I_FREEING | I_CLEAR;
> > > >  }
> > > > @@ -541,6 +540,18 @@ static void evict(struct inode *inode)
> > > >  
> > > >  	inode_sb_list_del(inode);
> > > >  
> > > > +	/*
> > > > +	 * Wait for flusher thread to be done with the inode so that filesystem
> > > > +	 * does not start destroying it while writeback is still running. Since
> > > > +	 * the inode has I_FREEING set, flusher thread won't start new work on
> > > > +	 * the inode.  We just have to wait for running writeback to finish. We
> > > > +	 * must use i_lock here because flusher thread might be working with
> > > > +	 * the inode without I_SYNC set but under i_lock.
> > > > +	 */
> > > > +	spin_lock(&inode->i_lock);
> > > > +	inode_wait_for_writeback(inode);
> > > > +	spin_unlock(&inode->i_lock);
> > > > +
> > > 
> > > Why move this wait from end_writeback() to here?  The whole point
> > > of end_writeback() is to provide a barrier that guarantees that
> > > there is no async writeback running when it returns, so it seems
> > > strange to move the barrier out of the function that is supposed to
> > > provide the barrier....
> >   I agree that end_writeback() will be misnamed after this change. The
> > thing is (as I tried to explain in the changelog) that a lot of filesystems
> > get it wrong and call end_writeback() from places where
> >   a) it is too late and writeback could be scribbling over a freed inode
> >   b) they cannot really handle waiting for writeback to finish
> > And nobody really noticed because writeback couldn't be racing with
> > ->evict_inode().
> > 
> > It is not easy to fix this e.g. in GFS2 because end_writeback() does
> > actually two things: It checks that inode is properly teared down (has no
> > pages etc.) and waits for writeback. And while waiting for writeback should
> > happen outside of a running transaction, checking of inode has to happen
> > between truncate_inode_pages() and deleting inode which, in case of GFS2,
> > has to be inside a transaction.
> 
> I don't see any problem there. Are you concerned simply because it
> is called from the flusher thread? 
  So one of my concern is that if inode_sync_wait() in end_writeback() ever
did anything, then lots of filesystems would start to have problems and
some (e.g. GFS2 as I wrote above) would be unfixable by just moving
end_writeback() elsewhere.

IMO it would be much saner to put the writeback barrier in VFS before
->evict_inode() is called. Currently, this is implicitely true because
flusher thread holds inode reference and I'd like to formulate it as a
rule.

> We currently have a sane inode lifecycle model via the reference
> counting of all active users of inodes, including flusher threads.
> IIUC, your concern is that some filesystems have added dependencies
> that mean it is unsafe for them to drop the final reference to an
> inode in certain contexts? Isn't that a problem with the filesystem
> implementation rather than a flaw in the inode reference counting
> model?
  Well, there are contexts where calling iput_final() is fine, there are
contexts where it certainly is not, and I believe flusher threads are
somewhat on the boundary.

The reason for this whole iput() from flusher thread exercise is that
mm guys would like to get a reasonable guarantee of a forward progress
of an allocation in a specific node / zone (so that they can get rid of
writepage() call from kswapd). That needs targetted cleaning by flusher
thread and some guarantee that flusher thread can make progress. It is
easier to make such arguments when you know iput_final() won't happen
from the flusher thread because then you know only ->writepages() and
->write_inode() can happen from flusher thread.

> Indeed, XFS is probably the filesystem that has done transactions in
> the context iput_final() for the longest - it has done this from the
> initial port to linux (because that's how it worked on Irix).  XFS
> has always done truncation of extents beyond EOF on inodes that need
> it, and truncation and final unlink of files with a zero link count
> when the final reference to an inode is dropped.
> 
> Also, XFS now relies on the flusher threads to write all the file
> data in the page cache, so if it stalled the flusher thread because
> truncation had dependencies on data flushing, then it would have
> been broken years ago. XFS avoids dependencies between iput_final
> and data flushing by ensuring that it doesn't do transactions until
> after waiting for the writeback to end. That is, after
> end_writeback() returns in .evict, the inode is guaranteed to be
> clean, not visible to writeback, and not dependent on any other
> inode (data or metadata) to be truncated. And because of that, we
> know that iput_final() will not be stalled in any context - we have
> a guarantee of forward progress.
  Sure I believe XFS is perfect and doesn't have problems with iput_final()
from flusher thread ;). However there are also other filesystems...

> IMO, ensuring data/metadata dependencies don't get tangled are a
> filesystem design issue. Trying to hack around them by special
> casing certain types of inode usage in the VFS to move it outside
> the reference counting model is not the right solution. I really
> think that is a slippery slope - the next deadlock will be solved
> with some new lifecycle hack, and soon we'll end up with a lifecycle
> model that nobody understands or can debug....
  I agree we trade somewhat simpler rules for flusher thread / filesystems
for some complexity in inode lifecycle. My opinon is that it's worth it.

								Honza

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/7] writeback: Move I_DIRTY_PAGES handling
  2012-03-22  8:35     ` Jan Kara
@ 2012-03-28  3:11       ` Fengguang Wu
  2012-03-28 15:12         ` Christoph Hellwig
  0 siblings, 1 reply; 32+ messages in thread
From: Fengguang Wu @ 2012-03-28  3:11 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, linux-fsdevel, David Howells

On Thu, Mar 22, 2012 at 09:35:40AM +0100, Jan Kara wrote:
> On Thu 22-03-12 10:41:14, Wu Fengguang wrote:
> > On Tue, Mar 20, 2012 at 11:56:27PM +0100, Jan Kara wrote:
> > > Instead of clearing I_DIRTY_PAGES and resetting it when we didn't succeed in
> > > writing them all, just clear the bit only when we succeeded writing all the
> > > pages.
> > 
> > Is this just to reduce one line of code? Well it hardly deserves to
> > think about the tricky implications..
>   It's not because of the reduction. It's because I don't want to take
> i_lock in the beginning of writeback_single_inode() just to clear
> I_DIRTY_PAGES when it's not really needed.

OK, that makes good sense.

> > > We also move the clearing of the bit close to other i_state handling to
> > > separate it from writeback list handling. This is desirable because list
> > > handling will differ for flusher thread and other writeback_single_inode()
> > > callers in future.
> > >
> > > No filesystem plays any tricks with I_DIRTY_PAGES (like checking it
> > > in ->writepages or ->write_inode implementation) so this movement is
> > > safe.
> > 
> > afs_writeback_all() calls 
> > 
> >         __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > 
> > which creates the subtle state (I_DIRTY_PAGES && !PAGECACHE_TAG_DIRTY)
> > which will no longer exist after this patch.
> > 
> > That line is introduced in 2007 in commit 31143d5d5 ("AFS: implement
> > basic file write support"), so it should not be trying to alter the
> > requeue/redirty behavior here. But this patch might still alter the
> > behavior from redirty_tail() to list_del_init() since the
> > (inode->i_state & I_DIRTY) test may no longer be true.
>   Hmm, I don't really see a reason for that __mark_inode_dirty() call.
> Since AFS uses PAGECACHE_TAG_DIRTY to track dirty pages in it's
> writepages() implementation, having I_DIRTY_PAGES set without any dirty
> page just doesn't make any sense. David, do you remember why that
> __mark_inode_dirty() is there?

Cannot speak for AFS, however I don't see the difference, either.
Why would afs_fsync/afs_setattr need to keep the inode in dirty list?
It seems fine to just remove that line?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/7] writeback: Move I_DIRTY_PAGES handling
  2012-03-28  3:11       ` Fengguang Wu
@ 2012-03-28 15:12         ` Christoph Hellwig
  2012-04-30 14:39           ` Christoph Hellwig
  0 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2012-03-28 15:12 UTC (permalink / raw)
  To: Fengguang Wu; +Cc: Jan Kara, Christoph Hellwig, linux-fsdevel, David Howells

On Wed, Mar 28, 2012 at 11:11:54AM +0800, Fengguang Wu wrote:
> Cannot speak for AFS, however I don't see the difference, either.
> Why would afs_fsync/afs_setattr need to keep the inode in dirty list?
> It seems fine to just remove that line?

All of afs_fsync looks bogus to me.  I'd take a bet that if you
removed everything but the filemap_write_and_wait_range call it would
still work exactly as it did before.  Is there a good testsuite for afs?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/7] writeback: Move clearing of I_SYNC into inode_sync_complete()
  2012-03-20 22:56 ` [PATCH 1/7] writeback: Move clearing of I_SYNC into inode_sync_complete() Jan Kara
@ 2012-04-30 14:36   ` Christoph Hellwig
  2012-04-30 21:30     ` Jan Kara
  0 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2012-04-30 14:36 UTC (permalink / raw)
  To: Jan Kara; +Cc: Wu Fengguang, Christoph Hellwig, linux-fsdevel

On Tue, Mar 20, 2012 at 11:56:25PM +0100, Jan Kara wrote:
> Move clearing of I_SYNC into inode_sync_complete().  It is more logical to have
> clearing of I_SYNC bit and waking of waiters in one place. Also later we will
> have two places needing to clear I_SYNC and wake up waiters so this allows them
> to use the common helper. Moving of I_SYNC clearing to a later stage of
> writeback_single_inode() is safe since we hold i_lock all the time.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

The code changes look good, but should we really remove a comment that
describes memory barrier?  IMHO any undocumented barrier is a bug
waiting to happen.


Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/7] writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()
  2012-03-20 22:56 ` [PATCH 2/7] writeback: Move requeueing when I_SYNC set to writeback_sb_inodes() Jan Kara
@ 2012-04-30 14:38   ` Christoph Hellwig
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2012-04-30 14:38 UTC (permalink / raw)
  To: Jan Kara; +Cc: Wu Fengguang, Christoph Hellwig, linux-fsdevel

On Tue, Mar 20, 2012 at 11:56:26PM +0100, Jan Kara wrote:
> When writeback_single_inode() is called on inode which has I_SYNC already
> set while doing WB_SYNC_NONE, inode is moved to b_more_io list. However
> this makes sense only if the caller is flusher thread. For other callers of
> writeback_single_inode() it doesn't really make sense and may be even wrong
> - flusher thread may be doing WB_SYNC_ALL writeback in parallel.
> 
> So we move requeueing from writeback_single_inode() to writeback_sb_inodes().
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/7] writeback: Move I_DIRTY_PAGES handling
  2012-03-28 15:12         ` Christoph Hellwig
@ 2012-04-30 14:39           ` Christoph Hellwig
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2012-04-30 14:39 UTC (permalink / raw)
  To: Fengguang Wu; +Cc: Jan Kara, Christoph Hellwig, linux-fsdevel, David Howells

On Wed, Mar 28, 2012 at 11:12:47AM -0400, Christoph Hellwig wrote:
> On Wed, Mar 28, 2012 at 11:11:54AM +0800, Fengguang Wu wrote:
> > Cannot speak for AFS, however I don't see the difference, either.
> > Why would afs_fsync/afs_setattr need to keep the inode in dirty list?
> > It seems fine to just remove that line?
> 
> All of afs_fsync looks bogus to me.  I'd take a bet that if you
> removed everything but the filemap_write_and_wait_range call it would
> still work exactly as it did before.  Is there a good testsuite for afs?

Dave, any chance to get your input on afs_fsync?  I think it could and
should be dramatically simplified.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/7] writeback: Move I_DIRTY_PAGES handling
  2012-03-20 22:56 ` [PATCH 3/7] writeback: Move I_DIRTY_PAGES handling Jan Kara
  2012-03-22  2:41   ` Fengguang Wu
@ 2012-04-30 14:41   ` Christoph Hellwig
  2012-04-30 21:21     ` Jan Kara
  1 sibling, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2012-04-30 14:41 UTC (permalink / raw)
  To: Jan Kara; +Cc: Wu Fengguang, Christoph Hellwig, linux-fsdevel

On Tue, Mar 20, 2012 at 11:56:27PM +0100, Jan Kara wrote:
> Instead of clearing I_DIRTY_PAGES and resetting it when we didn't succeed in
> writing them all, just clear the bit only when we succeeded writing all the
> pages. We also move the clearing of the bit close to other i_state handling to
> separate it from writeback list handling. This is desirable because list
> handling will differ for flusher thread and other writeback_single_inode()
> callers in future. No filesystem plays any tricks with I_DIRTY_PAGES (like
> checking it in ->writepages or ->write_inode implementation) so this movement
> is safe.

Looks good to me,

Reviewed-by: Christoph Hellwig <hch@lst.de>

In addition to this cleanup I wonder if we could go further and just
kill I_DIRTY_PAGES off.  I can't see any value add over using
mapping_tagged(mapping, PAGECACHE_TAG_DIRTY), which has the big benefit
of beeing managed transparently by the pagecache radix tree.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/7] writeback: Separate inode requeueing after writeback
  2012-03-20 22:56 ` [PATCH 4/7] writeback: Separate inode requeueing after writeback Jan Kara
@ 2012-04-30 14:43   ` Christoph Hellwig
  2012-04-30 21:42     ` Jan Kara
  0 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2012-04-30 14:43 UTC (permalink / raw)
  To: Jan Kara; +Cc: Wu Fengguang, Christoph Hellwig, linux-fsdevel

> +static void inode_wb_requeue(struct inode *inode, struct bdi_writeback *wb,
> +			     struct writeback_control *wbc)

I'd just call this requeue_inode to fit the naming scheme in elsewhere
in the file.  I'd also suggest adding a small top of function comment
explaining it.

Otherwise looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 5/7] writeback: Remove wb->list_lock from writeback_single_inode()
  2012-03-20 22:56 ` [PATCH 5/7] writeback: Remove wb->list_lock from writeback_single_inode() Jan Kara
@ 2012-04-30 14:44   ` Christoph Hellwig
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2012-04-30 14:44 UTC (permalink / raw)
  To: Jan Kara; +Cc: Wu Fengguang, Christoph Hellwig, linux-fsdevel

On Tue, Mar 20, 2012 at 11:56:29PM +0100, Jan Kara wrote:
> writeback_single_inode() doesn't need wb->list_lock for anything on entry now.
> So remove the requirement. This makes locking of writeback_single_inode()
> temporarily awkward (entering with i_lock, returning with i_lock and
> wb->list_lock) but it will be sanitized in the next patch.
> 
> Also inode_wait_for_writeback() doesn't need wb->list_lock for anything. It was
> just taking it to make usage convenient for callers but with
> writeback_single_inode() changing it's not very convenient anymore. So remove
> the lock from that function.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Looks good,


Reviewed-by: Christoph Hellwig <hch@lst.de>

In general this should have the sparse lock acquire/release annotations,
but if the next patch changes this anyway it might not be as important.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 6/7] writeback: Refactor writeback_single_inode()
  2012-03-20 22:56 ` [PATCH 6/7] writeback: Refactor writeback_single_inode() Jan Kara
  2012-03-20 23:35   ` Dave Chinner
@ 2012-04-30 14:46   ` Christoph Hellwig
  1 sibling, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2012-04-30 14:46 UTC (permalink / raw)
  To: Jan Kara; +Cc: Wu Fengguang, Christoph Hellwig, linux-fsdevel

Looks good if the comments from Dave are addressed,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/7] writeback: Avoid iput() from flusher thread
  2012-03-20 22:56 ` [PATCH 7/7] writeback: Avoid iput() from flusher thread Jan Kara
  2012-03-20 23:50   ` Dave Chinner
  2012-03-22  3:01   ` Fengguang Wu
@ 2012-04-30 15:30   ` Christoph Hellwig
  2012-04-30 21:58     ` Jan Kara
  2 siblings, 1 reply; 32+ messages in thread
From: Christoph Hellwig @ 2012-04-30 15:30 UTC (permalink / raw)
  To: Jan Kara; +Cc: Wu Fengguang, linux-fsdevel, Al Viro

> - * Wait for writeback on an inode to complete.
> + * Wait for writeback on an inode to complete. Called with i_lock held.
> + * Caller must make sure inode cannot go away when we drop i_lock.
>   */
> -static void inode_wait_for_writeback(struct inode *inode)
> +void inode_wait_for_writeback(struct inode *inode)
>  {
>  	DEFINE_WAIT_BIT(wq, &inode->i_state, __I_SYNC);
>  	wait_queue_head_t *wqh;
> @@ -340,6 +341,26 @@ static void inode_wait_for_writeback(struct inode *inode)
>  	}
>  }

This needs way better documentation, preferably using kernel doc.

I also suspect we should rename inode_wait_for_writeback to
__inode_wait_for_writeback and make inode_wait_for_writeback the public
wrapper that takes i_lock.

> diff --git a/fs/inode.c b/fs/inode.c
> index d3ebdbe..3869714 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -510,7 +510,6 @@ void end_writeback(struct inode *inode)
>  	BUG_ON(!list_empty(&inode->i_data.private_list));
>  	BUG_ON(!(inode->i_state & I_FREEING));
>  	BUG_ON(inode->i_state & I_CLEAR);
> -	inode_sync_wait(inode);
>  	/* don't need i_lock here, no concurrent mods to i_state */
>  	inode->i_state = I_FREEING | I_CLEAR;
>  }

end_writeback is really misnamed now and needs a better name.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/7] writeback: Move I_DIRTY_PAGES handling
  2012-04-30 14:41   ` Christoph Hellwig
@ 2012-04-30 21:21     ` Jan Kara
  0 siblings, 0 replies; 32+ messages in thread
From: Jan Kara @ 2012-04-30 21:21 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, Wu Fengguang, linux-fsdevel

On Mon 30-04-12 10:41:38, Christoph Hellwig wrote:
> On Tue, Mar 20, 2012 at 11:56:27PM +0100, Jan Kara wrote:
> > Instead of clearing I_DIRTY_PAGES and resetting it when we didn't succeed in
> > writing them all, just clear the bit only when we succeeded writing all the
> > pages. We also move the clearing of the bit close to other i_state handling to
> > separate it from writeback list handling. This is desirable because list
> > handling will differ for flusher thread and other writeback_single_inode()
> > callers in future. No filesystem plays any tricks with I_DIRTY_PAGES (like
> > checking it in ->writepages or ->write_inode implementation) so this movement
> > is safe.
> 
> Looks good to me,
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
  Thanks for review.

> In addition to this cleanup I wonder if we could go further and just
> kill I_DIRTY_PAGES off.  I can't see any value add over using
> mapping_tagged(mapping, PAGECACHE_TAG_DIRTY), which has the big benefit
> of beeing managed transparently by the pagecache radix tree.
  The only reason I can thing off is that checking whether inode is dirty
is simpler with I_DIRTY_PAGES. Without it the check would have to be
(inode->i_state & I_DIRTY) || mapping_tagged(inode->i_mapping,
    PAGECACHE_TAG_DIRTY)
which would need a separate helper function I guess.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/7] writeback: Move clearing of I_SYNC into inode_sync_complete()
  2012-04-30 14:36   ` Christoph Hellwig
@ 2012-04-30 21:30     ` Jan Kara
  0 siblings, 0 replies; 32+ messages in thread
From: Jan Kara @ 2012-04-30 21:30 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, Wu Fengguang, linux-fsdevel

On Mon 30-04-12 10:36:41, Christoph Hellwig wrote:
> On Tue, Mar 20, 2012 at 11:56:25PM +0100, Jan Kara wrote:
> > Move clearing of I_SYNC into inode_sync_complete().  It is more logical to have
> > clearing of I_SYNC bit and waking of waiters in one place. Also later we will
> > have two places needing to clear I_SYNC and wake up waiters so this allows them
> > to use the common helper. Moving of I_SYNC clearing to a later stage of
> > writeback_single_inode() is safe since we hold i_lock all the time.
> > 
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> The code changes look good, but should we really remove a comment that
> describes memory barrier?  IMHO any undocumented barrier is a bug
> waiting to happen.
> 
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
  The old comment didn't look too useful to me and the necessity of a
barrier is documented in wake_up_bit(). But yes, some comment should be
there. I added there simple:
        inode->i_state &= ~I_SYNC;
        /* Waiters must see I_SYNC cleared before being woken up */
        smp_mb();
        wake_up_bit(&inode->i_state, __I_SYNC);

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 4/7] writeback: Separate inode requeueing after writeback
  2012-04-30 14:43   ` Christoph Hellwig
@ 2012-04-30 21:42     ` Jan Kara
  0 siblings, 0 replies; 32+ messages in thread
From: Jan Kara @ 2012-04-30 21:42 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, Wu Fengguang, linux-fsdevel

On Mon 30-04-12 10:43:40, Christoph Hellwig wrote:
> > +static void inode_wb_requeue(struct inode *inode, struct bdi_writeback *wb,
> > +			     struct writeback_control *wbc)
> 
> I'd just call this requeue_inode to fit the naming scheme in elsewhere
> in the file.  I'd also suggest adding a small top of function comment
> explaining it.
> 
> Otherwise looks good,
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>
  OK, comment added:
/*
 * Find proper writeback list for the inode depending on its current state and
 * possibly also change of its state while we were doing writeback.  Here we
 * handle things such as livelock prevention or fairness of writeback among
 * inodes. This function can be called only by flusher thread - noone else
 * processes all inodes in writeback lists and requeueing inodes behind flusher
 * thread's back can have unexpected consequences.
 */

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 7/7] writeback: Avoid iput() from flusher thread
  2012-04-30 15:30   ` Christoph Hellwig
@ 2012-04-30 21:58     ` Jan Kara
  0 siblings, 0 replies; 32+ messages in thread
From: Jan Kara @ 2012-04-30 21:58 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Jan Kara, Wu Fengguang, linux-fsdevel, Al Viro

On Mon 30-04-12 11:30:27, Christoph Hellwig wrote:
> > - * Wait for writeback on an inode to complete.
> > + * Wait for writeback on an inode to complete. Called with i_lock held.
> > + * Caller must make sure inode cannot go away when we drop i_lock.
> >   */
> > -static void inode_wait_for_writeback(struct inode *inode)
> > +void inode_wait_for_writeback(struct inode *inode)
> >  {
> >  	DEFINE_WAIT_BIT(wq, &inode->i_state, __I_SYNC);
> >  	wait_queue_head_t *wqh;
> > @@ -340,6 +341,26 @@ static void inode_wait_for_writeback(struct inode *inode)
> >  	}
> >  }
> 
> This needs way better documentation, preferably using kernel doc.
  OK, will improve.

> I also suspect we should rename inode_wait_for_writeback to
> __inode_wait_for_writeback and make inode_wait_for_writeback the public
> wrapper that takes i_lock.
  Good idea. Will do.

> > diff --git a/fs/inode.c b/fs/inode.c
> > index d3ebdbe..3869714 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -510,7 +510,6 @@ void end_writeback(struct inode *inode)
> >  	BUG_ON(!list_empty(&inode->i_data.private_list));
> >  	BUG_ON(!(inode->i_state & I_FREEING));
> >  	BUG_ON(inode->i_state & I_CLEAR);
> > -	inode_sync_wait(inode);
> >  	/* don't need i_lock here, no concurrent mods to i_state */
> >  	inode->i_state = I_FREEING | I_CLEAR;
> >  }
> 
> end_writeback is really misnamed now and needs a better name.
  Yes. I think old good clear_inode() may be fine? The only thing we really
*do* in that function is setting of I_CLEAR. I guess I'll do the rename in
one big jumbo patch. Would that be OK? I don't want to spam with 50
one-line patches...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2012-04-30 21:58 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-20 22:56 [PATCH 0/7 v2] writeback: Avoid iput() from flusher thread Jan Kara
2012-03-20 22:56 ` [PATCH 1/7] writeback: Move clearing of I_SYNC into inode_sync_complete() Jan Kara
2012-04-30 14:36   ` Christoph Hellwig
2012-04-30 21:30     ` Jan Kara
2012-03-20 22:56 ` [PATCH 2/7] writeback: Move requeueing when I_SYNC set to writeback_sb_inodes() Jan Kara
2012-04-30 14:38   ` Christoph Hellwig
2012-03-20 22:56 ` [PATCH 3/7] writeback: Move I_DIRTY_PAGES handling Jan Kara
2012-03-22  2:41   ` Fengguang Wu
2012-03-22  8:35     ` Jan Kara
2012-03-28  3:11       ` Fengguang Wu
2012-03-28 15:12         ` Christoph Hellwig
2012-04-30 14:39           ` Christoph Hellwig
2012-04-30 14:41   ` Christoph Hellwig
2012-04-30 21:21     ` Jan Kara
2012-03-20 22:56 ` [PATCH 4/7] writeback: Separate inode requeueing after writeback Jan Kara
2012-04-30 14:43   ` Christoph Hellwig
2012-04-30 21:42     ` Jan Kara
2012-03-20 22:56 ` [PATCH 5/7] writeback: Remove wb->list_lock from writeback_single_inode() Jan Kara
2012-04-30 14:44   ` Christoph Hellwig
2012-03-20 22:56 ` [PATCH 6/7] writeback: Refactor writeback_single_inode() Jan Kara
2012-03-20 23:35   ` Dave Chinner
2012-03-21 10:03     ` Jan Kara
2012-04-30 14:46   ` Christoph Hellwig
2012-03-20 22:56 ` [PATCH 7/7] writeback: Avoid iput() from flusher thread Jan Kara
2012-03-20 23:50   ` Dave Chinner
2012-03-21 10:25     ` Jan Kara
2012-03-22  3:03       ` Fengguang Wu
2012-03-22  6:27       ` Dave Chinner
2012-03-22  9:50         ` Jan Kara
2012-03-22  3:01   ` Fengguang Wu
2012-04-30 15:30   ` Christoph Hellwig
2012-04-30 21:58     ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.