All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] writeback: Improve busyloop prevention
@ 2011-09-08  0:44 Jan Kara
  2011-09-08  0:44 ` [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io() Jan Kara
  2011-09-08  0:57 ` [PATCH 1/2] writeback: Improve busyloop prevention Wu Fengguang
  0 siblings, 2 replies; 28+ messages in thread
From: Jan Kara @ 2011-09-08  0:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Wu Fengguang, Dave Chinner, Jan Kara, Christoph Hellwig

Writeback of an inode can be stalled by things like internal fs locks being
held. So in case we didn't write anything during a pass through b_io list,
just wait for a moment and try again.

CC: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c |   26 ++++++++++++++------------
 1 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 04cf3b9..f506542 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -699,8 +699,8 @@ static long wb_writeback(struct bdi_writeback *wb,
 	unsigned long wb_start = jiffies;
 	long nr_pages = work->nr_pages;
 	unsigned long oldest_jif;
-	struct inode *inode;
 	long progress;
+	long pause = 1;
 
 	oldest_jif = jiffies;
 	work->older_than_this = &oldest_jif;
@@ -755,25 +755,27 @@ static long wb_writeback(struct bdi_writeback *wb,
 		 * mean the overall work is done. So we keep looping as long
 		 * as made some progress on cleaning pages or inodes.
 		 */
-		if (progress)
+		if (progress) {
+			pause = 1;
 			continue;
+		}
 		/*
 		 * No more inodes for IO, bail
 		 */
 		if (list_empty(&wb->b_more_io))
 			break;
 		/*
-		 * Nothing written. Wait for some inode to
-		 * become available for writeback. Otherwise
-		 * we'll just busyloop.
+		 * Nothing written (some internal fs locks were unavailable or
+		 * inode was under writeback from balance_dirty_pages() or
+		 * similar conditions).  Wait for a while to avoid busylooping.
 		 */
-		if (!list_empty(&wb->b_more_io))  {
-			trace_writeback_wait(wb->bdi, work);
-			inode = wb_inode(wb->b_more_io.prev);
-			spin_lock(&inode->i_lock);
-			inode_wait_for_writeback(inode, wb);
-			spin_unlock(&inode->i_lock);
-		}
+		trace_writeback_wait(wb->bdi, work);
+		spin_unlock(&wb->list_lock);
+		schedule_timeout(pause);
+		pause <<= 1;
+		if (pause > HZ / 10)
+			pause = HZ / 10;
+		spin_lock(&wb->list_lock);
 	}
 	spin_unlock(&wb->list_lock);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-09-08  0:44 [PATCH 1/2] writeback: Improve busyloop prevention Jan Kara
@ 2011-09-08  0:44 ` Jan Kara
  2011-09-08  1:22   ` Wu Fengguang
  2011-09-08  0:57 ` [PATCH 1/2] writeback: Improve busyloop prevention Wu Fengguang
  1 sibling, 1 reply; 28+ messages in thread
From: Jan Kara @ 2011-09-08  0:44 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Wu Fengguang, Dave Chinner, Jan Kara, Christoph Hellwig

Calling redirty_tail() can put off inode writeback for upto 30 seconds (or
whatever dirty_expire_centisecs is). This is unnecessarily big delay in some
cases and in some cases it is really bad thing. In particular XFS tries to be
nice to writeback and when ->write_inode is called for inode with lock ilock,
it just redirties the inode and returns EAGAIN. That currently causes
writeback_single_inode() to redirty_tail() the inode. As contended ilock is
common thing with XFS while extending files the result can be that inode
writeout is put off for a really long time.

Now that we have more robust busyloop prevention in wb_writeback() we can
call requeue_io() in cases where quick retry is required without fear of
raising CPU consumption too much.

CC: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c |   61 ++++++++++++++++++++++++----------------------------
 1 files changed, 28 insertions(+), 33 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index f506542..9bb4e96 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -356,6 +356,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	long nr_to_write = wbc->nr_to_write;
 	unsigned dirty;
 	int ret;
+	bool inode_written = false;
 
 	assert_spin_locked(&wb->list_lock);
 	assert_spin_locked(&inode->i_lock);
@@ -420,6 +421,8 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
 		int err = write_inode(inode, wbc);
+		if (!err)
+			inode_written = true;
 		if (ret == 0)
 			ret = err;
 	}
@@ -430,42 +433,39 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	if (!(inode->i_state & I_FREEING)) {
 		/*
 		 * Sync livelock prevention. Each inode is tagged and synced in
-		 * one shot. If still dirty, it will be redirty_tail()'ed below.
-		 * Update the dirty time to prevent enqueue and sync it again.
+		 * one shot. If still dirty, update dirty time and put it back
+		 * to dirty list to prevent enqueue and syncing it again.
 		 */
 		if ((inode->i_state & I_DIRTY) &&
-		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
+		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)) {
 			inode->dirtied_when = jiffies;
-
-		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+			redirty_tail(inode, wb);
+		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 			/*
-			 * We didn't write back all the pages.  nfs_writepages()
-			 * sometimes bales out without doing anything.
+			 * We didn't write back all the pages. nfs_writepages()
+			 * sometimes bales out without doing anything or we
+			 * just run our of our writeback slice.
 			 */
 			inode->i_state |= I_DIRTY_PAGES;
-			if (wbc->nr_to_write <= 0) {
-				/*
-				 * slice used up: queue for next turn
-				 */
-				requeue_io(inode, wb);
-			} else {
-				/*
-				 * Writeback blocked by something other than
-				 * congestion. Delay the inode for some time to
-				 * avoid spinning on the CPU (100% iowait)
-				 * retrying writeback of the dirty page/inode
-				 * that cannot be performed immediately.
-				 */
-				redirty_tail(inode, wb);
-			}
+			requeue_io(inode, wb);
 		} else if (inode->i_state & I_DIRTY) {
 			/*
 			 * Filesystems can dirty the inode during writeback
 			 * operations, such as delayed allocation during
 			 * submission or metadata updates after data IO
-			 * completion.
+			 * completion. Also inode could have been dirtied by
+			 * some process aggressively touching metadata.
+			 * Finally, filesystem could just fail to write the
+			 * inode for some reason. We have to distinguish the
+			 * last case from the previous ones - in the last case
+			 * we want to give the inode quick retry, in the
+			 * other cases we want to put it back to the dirty list
+			 * to avoid livelocking of writeback.
 			 */
-			redirty_tail(inode, wb);
+			if (inode_written)
+				redirty_tail(inode, wb);
+			else
+				requeue_io(inode, wb);
 		} else {
 			/*
 			 * The inode is clean.  At this point we either have
@@ -583,10 +583,10 @@ static long writeback_sb_inodes(struct super_block *sb,
 			wrote++;
 		if (wbc.pages_skipped) {
 			/*
-			 * writeback is not making progress due to locked
-			 * buffers.  Skip this inode for now.
+			 * Writeback is not making progress due to unavailable
+			 * fs locks or similar condition. Retry in next round.
 			 */
-			redirty_tail(inode, wb);
+			requeue_io(inode, wb);
 		}
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&wb->list_lock);
@@ -618,12 +618,7 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb,
 		struct super_block *sb = inode->i_sb;
 
 		if (!grab_super_passive(sb)) {
-			/*
-			 * grab_super_passive() may fail consistently due to
-			 * s_umount being grabbed by someone else. Don't use
-			 * requeue_io() to avoid busy retrying the inode/sb.
-			 */
-			redirty_tail(inode, wb);
+			requeue_io(inode, wb);
 			continue;
 		}
 		wrote += writeback_sb_inodes(sb, wb, work);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-09-08  0:44 [PATCH 1/2] writeback: Improve busyloop prevention Jan Kara
  2011-09-08  0:44 ` [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io() Jan Kara
@ 2011-09-08  0:57 ` Wu Fengguang
  2011-09-08 13:49   ` Jan Kara
  1 sibling, 1 reply; 28+ messages in thread
From: Wu Fengguang @ 2011-09-08  0:57 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner, Christoph Hellwig

On Thu, Sep 08, 2011 at 08:44:43AM +0800, Jan Kara wrote:
> Writeback of an inode can be stalled by things like internal fs locks being
> held. So in case we didn't write anything during a pass through b_io list,
> just wait for a moment and try again.
 
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>

with comments below.

> +		trace_writeback_wait(wb->bdi, work);
> +		spin_unlock(&wb->list_lock);

__set_current_state(TASK_INTERRUPTIBLE);

> +		schedule_timeout(pause);

> +		pause <<= 1;
> +		if (pause > HZ / 10)
> +			pause = HZ / 10;

It's a bit more safer to do

                if (pause < HZ / 10)
                        pause <<= 1;

in case someone hacked HZ=1.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-09-08  0:44 ` [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io() Jan Kara
@ 2011-09-08  1:22   ` Wu Fengguang
  2011-09-08 15:03     ` Jan Kara
  0 siblings, 1 reply; 28+ messages in thread
From: Wu Fengguang @ 2011-09-08  1:22 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner, Christoph Hellwig

Jan,

> @@ -420,6 +421,8 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
>  	/* Don't write the inode if only I_DIRTY_PAGES was set */
>  	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
>  		int err = write_inode(inode, wbc);
> +		if (!err)
> +			inode_written = true;
>  		if (ret == 0)
>  			ret = err;
>  	}

write_inode() typically return error after redirtying the inode.
So the conditions inode_written=false and (inode->i_state & I_DIRTY)
are mostly on/off together. For the cases they disagree, it's probably
a filesystem bug -- at least I don't think some FS will deliberately 
return success while redirtying the inode, or the reverse.

>  		} else if (inode->i_state & I_DIRTY) {
>  			/*
>  			 * Filesystems can dirty the inode during writeback
>  			 * operations, such as delayed allocation during
>  			 * submission or metadata updates after data IO
> -			 * completion.
> +			 * completion. Also inode could have been dirtied by
> +			 * some process aggressively touching metadata.
> +			 * Finally, filesystem could just fail to write the
> +			 * inode for some reason. We have to distinguish the
> +			 * last case from the previous ones - in the last case
> +			 * we want to give the inode quick retry, in the
> +			 * other cases we want to put it back to the dirty list
> +			 * to avoid livelocking of writeback.
>  			 */
> -			redirty_tail(inode, wb);
> +			if (inode_written)
> +				redirty_tail(inode, wb);

Can you elaborate the livelock in the below inode_written=true case?
Why the sleep in the wb_writeback() loop is not enough?

> +			else
> +				requeue_io(inode, wb);
>  		} else {
>  			/*
>  			 * The inode is clean.  At this point we either have

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/2] writeback: Improve busyloop prevention
  2011-09-08  0:57 ` [PATCH 1/2] writeback: Improve busyloop prevention Wu Fengguang
@ 2011-09-08 13:49   ` Jan Kara
  0 siblings, 0 replies; 28+ messages in thread
From: Jan Kara @ 2011-09-08 13:49 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Jan Kara, linux-fsdevel, Dave Chinner, Christoph Hellwig

On Thu 08-09-11 08:57:50, Wu Fengguang wrote:
> On Thu, Sep 08, 2011 at 08:44:43AM +0800, Jan Kara wrote:
> > Writeback of an inode can be stalled by things like internal fs locks being
> > held. So in case we didn't write anything during a pass through b_io list,
> > just wait for a moment and try again.
>  
> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> 
> with comments below.
> 
> > +		trace_writeback_wait(wb->bdi, work);
> > +		spin_unlock(&wb->list_lock);
> 
> __set_current_state(TASK_INTERRUPTIBLE);
  Ah, right. Thanks for catching this.
> 
> > +		schedule_timeout(pause);
> 
> > +		pause <<= 1;
> > +		if (pause > HZ / 10)
> > +			pause = HZ / 10;
> 
> It's a bit more safer to do
> 
>                 if (pause < HZ / 10)
>                         pause <<= 1;
> 
> in case someone hacked HZ=1.
  Good idea. Done. Thanks for review.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-09-08  1:22   ` Wu Fengguang
@ 2011-09-08 15:03     ` Jan Kara
  2011-09-18 14:07       ` Wu Fengguang
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Kara @ 2011-09-08 15:03 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Jan Kara, linux-fsdevel, Dave Chinner, Christoph Hellwig

On Thu 08-09-11 09:22:36, Wu Fengguang wrote:
> Jan,
> 
> > @@ -420,6 +421,8 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
> >  	/* Don't write the inode if only I_DIRTY_PAGES was set */
> >  	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
> >  		int err = write_inode(inode, wbc);
> > +		if (!err)
> > +			inode_written = true;
> >  		if (ret == 0)
> >  			ret = err;
> >  	}
> 
> write_inode() typically return error after redirtying the inode.
> So the conditions inode_written=false and (inode->i_state & I_DIRTY)
> are mostly on/off together. For the cases they disagree, it's probably
> a filesystem bug -- at least I don't think some FS will deliberately 
> return success while redirtying the inode, or the reverse.
  There is a possibility someone else redirties the inode between the moment
I_DIRTY bits are cleared in writeback_single_inode() and the check for
I_DIRTY is done after ->write_inode() is called. Especially when
write_inode() blocks waiting for some IO this isn't that hard to happen. So
there are valid (although relatively rare) cases when inode_written is
different from the result of I_DIRTY check.

> >  		} else if (inode->i_state & I_DIRTY) {
> >  			/*
> >  			 * Filesystems can dirty the inode during writeback
> >  			 * operations, such as delayed allocation during
> >  			 * submission or metadata updates after data IO
> > -			 * completion.
> > +			 * completion. Also inode could have been dirtied by
> > +			 * some process aggressively touching metadata.
> > +			 * Finally, filesystem could just fail to write the
> > +			 * inode for some reason. We have to distinguish the
> > +			 * last case from the previous ones - in the last case
> > +			 * we want to give the inode quick retry, in the
> > +			 * other cases we want to put it back to the dirty list
> > +			 * to avoid livelocking of writeback.
> >  			 */
> > -			redirty_tail(inode, wb);
> > +			if (inode_written)
> > +				redirty_tail(inode, wb);
> 
> Can you elaborate the livelock in the below inode_written=true case?
> Why the sleep in the wb_writeback() loop is not enough?
  In case someone would be able to consistently trigger the race window and
redirty the inode before we check here, we would loop for a long time
always writing just this inode and thus effectivelly stalling other
writeback. That's why I push redirtied inode behind other inodes in the
dirty list.

								Honza
> 
> > +			else
> > +				requeue_io(inode, wb);
> >  		} else {
> >  			/*
> >  			 * The inode is clean.  At this point we either have
> 
> Thanks,
> Fengguang
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-09-08 15:03     ` Jan Kara
@ 2011-09-18 14:07       ` Wu Fengguang
  2011-10-05 17:39         ` Jan Kara
  0 siblings, 1 reply; 28+ messages in thread
From: Wu Fengguang @ 2011-09-18 14:07 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner, Christoph Hellwig

On Thu, Sep 08, 2011 at 11:03:40PM +0800, Jan Kara wrote:
> On Thu 08-09-11 09:22:36, Wu Fengguang wrote:
> > Jan,
> > 
> > > @@ -420,6 +421,8 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
> > >  	/* Don't write the inode if only I_DIRTY_PAGES was set */
> > >  	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
> > >  		int err = write_inode(inode, wbc);
> > > +		if (!err)
> > > +			inode_written = true;
> > >  		if (ret == 0)
> > >  			ret = err;
> > >  	}
> > 
> > write_inode() typically return error after redirtying the inode.
> > So the conditions inode_written=false and (inode->i_state & I_DIRTY)
> > are mostly on/off together. For the cases they disagree, it's probably
> > a filesystem bug -- at least I don't think some FS will deliberately 
> > return success while redirtying the inode, or the reverse.
>   There is a possibility someone else redirties the inode between the moment
> I_DIRTY bits are cleared in writeback_single_inode() and the check for
> I_DIRTY is done after ->write_inode() is called. Especially when
> write_inode() blocks waiting for some IO this isn't that hard to happen. So
> there are valid (although relatively rare) cases when inode_written is
> different from the result of I_DIRTY check.

Ah yes, that's good point.

> > >  		} else if (inode->i_state & I_DIRTY) {
> > >  			/*
> > >  			 * Filesystems can dirty the inode during writeback
> > >  			 * operations, such as delayed allocation during
> > >  			 * submission or metadata updates after data IO
> > > -			 * completion.
> > > +			 * completion. Also inode could have been dirtied by
> > > +			 * some process aggressively touching metadata.
> > > +			 * Finally, filesystem could just fail to write the
> > > +			 * inode for some reason. We have to distinguish the
> > > +			 * last case from the previous ones - in the last case
> > > +			 * we want to give the inode quick retry, in the
> > > +			 * other cases we want to put it back to the dirty list
> > > +			 * to avoid livelocking of writeback.
> > >  			 */
> > > -			redirty_tail(inode, wb);
> > > +			if (inode_written)
> > > +				redirty_tail(inode, wb);
> > 
> > Can you elaborate the livelock in the below inode_written=true case?
> > Why the sleep in the wb_writeback() loop is not enough?
>   In case someone would be able to consistently trigger the race window and
> redirty the inode before we check here, we would loop for a long time
> always writing just this inode and thus effectivelly stalling other
> writeback. That's why I push redirtied inode behind other inodes in the
> dirty list.

Agreed. All the left to do is to confirm whether this addresses
Christoph's original problem.

Acked-by: Wu Fengguang <fengguang.wu@intel.com>

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-09-18 14:07       ` Wu Fengguang
@ 2011-10-05 17:39         ` Jan Kara
  2011-10-07 13:43           ` Wu Fengguang
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Kara @ 2011-10-05 17:39 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Jan Kara, linux-fsdevel, Dave Chinner, Christoph Hellwig

On Sun 18-09-11 22:07:37, Wu Fengguang wrote:
> On Thu, Sep 08, 2011 at 11:03:40PM +0800, Jan Kara wrote:
> > On Thu 08-09-11 09:22:36, Wu Fengguang wrote:
> > > Jan,
> > > 
> > > > @@ -420,6 +421,8 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
> > > >  	/* Don't write the inode if only I_DIRTY_PAGES was set */
> > > >  	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
> > > >  		int err = write_inode(inode, wbc);
> > > > +		if (!err)
> > > > +			inode_written = true;
> > > >  		if (ret == 0)
> > > >  			ret = err;
> > > >  	}
> > > 
> > > write_inode() typically return error after redirtying the inode.
> > > So the conditions inode_written=false and (inode->i_state & I_DIRTY)
> > > are mostly on/off together. For the cases they disagree, it's probably
> > > a filesystem bug -- at least I don't think some FS will deliberately 
> > > return success while redirtying the inode, or the reverse.
> >   There is a possibility someone else redirties the inode between the moment
> > I_DIRTY bits are cleared in writeback_single_inode() and the check for
> > I_DIRTY is done after ->write_inode() is called. Especially when
> > write_inode() blocks waiting for some IO this isn't that hard to happen. So
> > there are valid (although relatively rare) cases when inode_written is
> > different from the result of I_DIRTY check.
> 
> Ah yes, that's good point.
> 
> > > >  		} else if (inode->i_state & I_DIRTY) {
> > > >  			/*
> > > >  			 * Filesystems can dirty the inode during writeback
> > > >  			 * operations, such as delayed allocation during
> > > >  			 * submission or metadata updates after data IO
> > > > -			 * completion.
> > > > +			 * completion. Also inode could have been dirtied by
> > > > +			 * some process aggressively touching metadata.
> > > > +			 * Finally, filesystem could just fail to write the
> > > > +			 * inode for some reason. We have to distinguish the
> > > > +			 * last case from the previous ones - in the last case
> > > > +			 * we want to give the inode quick retry, in the
> > > > +			 * other cases we want to put it back to the dirty list
> > > > +			 * to avoid livelocking of writeback.
> > > >  			 */
> > > > -			redirty_tail(inode, wb);
> > > > +			if (inode_written)
> > > > +				redirty_tail(inode, wb);
> > > 
> > > Can you elaborate the livelock in the below inode_written=true case?
> > > Why the sleep in the wb_writeback() loop is not enough?
> >   In case someone would be able to consistently trigger the race window and
> > redirty the inode before we check here, we would loop for a long time
> > always writing just this inode and thus effectivelly stalling other
> > writeback. That's why I push redirtied inode behind other inodes in the
> > dirty list.
> 
> Agreed. All the left to do is to confirm whether this addresses
> Christoph's original problem.
> 
> Acked-by: Wu Fengguang <fengguang.wu@intel.com>
  Great, thanks for review! I'll resend the two patches to Christoph so
that he can try them.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-05 17:39         ` Jan Kara
@ 2011-10-07 13:43           ` Wu Fengguang
  2011-10-07 14:22             ` Jan Kara
  0 siblings, 1 reply; 28+ messages in thread
From: Wu Fengguang @ 2011-10-07 13:43 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner, Christoph Hellwig

>   Great, thanks for review! I'll resend the two patches to Christoph so
> that he can try them.

Jan, I'd like to test out your updated patches with my stupid dd
workloads. Would you (re)send them publicly?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-07 13:43           ` Wu Fengguang
@ 2011-10-07 14:22             ` Jan Kara
  2011-10-07 14:29               ` Wu Fengguang
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Kara @ 2011-10-07 14:22 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Jan Kara, linux-fsdevel, Dave Chinner, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 599 bytes --]

On Fri 07-10-11 21:43:47, Wu Fengguang wrote:
> >   Great, thanks for review! I'll resend the two patches to Christoph so
> > that he can try them.
> 
> Jan, I'd like to test out your updated patches with my stupid dd
> workloads. Would you (re)send them publicly?
  Ah, I resent them publicly on Wednesday
(http://comments.gmane.org/gmane.linux.kernel/1199713) but git send-email
apparently does not include emails from Acked-by into list of recipients so
you didn't get them. Sorry for that. The patches are attached for your
convenience.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

[-- Attachment #2: 0001-writeback-Improve-busyloop-prevention.patch --]
[-- Type: text/x-patch, Size: 2241 bytes --]

>From a042c2a839ad3cf89d8ee158b2bb4b94b573f578 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Thu, 8 Sep 2011 01:05:25 +0200
Subject: [PATCH 1/2] writeback: Improve busyloop prevention

Writeback of an inode can be stalled by things like internal fs locks being
held. So in case we didn't write anything during a pass through b_io list,
just wait for a moment and try again.

CC: Christoph Hellwig <hch@infradead.org>
Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c |   26 ++++++++++++++------------
 1 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 04cf3b9..bdeb26a 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -699,8 +699,8 @@ static long wb_writeback(struct bdi_writeback *wb,
 	unsigned long wb_start = jiffies;
 	long nr_pages = work->nr_pages;
 	unsigned long oldest_jif;
-	struct inode *inode;
 	long progress;
+	long pause = 1;
 
 	oldest_jif = jiffies;
 	work->older_than_this = &oldest_jif;
@@ -755,25 +755,27 @@ static long wb_writeback(struct bdi_writeback *wb,
 		 * mean the overall work is done. So we keep looping as long
 		 * as made some progress on cleaning pages or inodes.
 		 */
-		if (progress)
+		if (progress) {
+			pause = 1;
 			continue;
+		}
 		/*
 		 * No more inodes for IO, bail
 		 */
 		if (list_empty(&wb->b_more_io))
 			break;
 		/*
-		 * Nothing written. Wait for some inode to
-		 * become available for writeback. Otherwise
-		 * we'll just busyloop.
+		 * Nothing written (some internal fs locks were unavailable or
+		 * inode was under writeback from balance_dirty_pages() or
+		 * similar conditions).  Wait for a while to avoid busylooping.
 		 */
-		if (!list_empty(&wb->b_more_io))  {
-			trace_writeback_wait(wb->bdi, work);
-			inode = wb_inode(wb->b_more_io.prev);
-			spin_lock(&inode->i_lock);
-			inode_wait_for_writeback(inode, wb);
-			spin_unlock(&inode->i_lock);
-		}
+		trace_writeback_wait(wb->bdi, work);
+		spin_unlock(&wb->list_lock);
+		__set_current_state(TASK_INTERRUPTIBLE);
+		schedule_timeout(pause);
+		if (pause < HZ / 10)
+			pause <<= 1;
+		spin_lock(&wb->list_lock);
 	}
 	spin_unlock(&wb->list_lock);
 
-- 
1.7.1


[-- Attachment #3: 0002-writeback-Replace-some-redirty_tail-calls-with-reque.patch --]
[-- Type: text/x-patch, Size: 5390 bytes --]

>From 0a4a2cb4d5432f5446215b1e6e44f7d83032dba3 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Thu, 8 Sep 2011 01:46:42 +0200
Subject: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()

Calling redirty_tail() can put off inode writeback for upto 30 seconds (or
whatever dirty_expire_centisecs is). This is unnecessarily big delay in some
cases and in other cases it is a really bad thing. In particular XFS tries to
be nice to writeback and when ->write_inode is called for an inode with locked
ilock, it just redirties the inode and returns EAGAIN. That currently causes
writeback_single_inode() to redirty_tail() the inode. As contended ilock is
common thing with XFS while extending files the result can be that inode
writeout is put off for a really long time.

Now that we have more robust busyloop prevention in wb_writeback() we can
call requeue_io() in cases where quick retry is required without fear of
raising CPU consumption too much.

CC: Christoph Hellwig <hch@infradead.org>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c |   61 ++++++++++++++++++++++++----------------------------
 1 files changed, 28 insertions(+), 33 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index bdeb26a..c786023 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -356,6 +356,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	long nr_to_write = wbc->nr_to_write;
 	unsigned dirty;
 	int ret;
+	bool inode_written = false;
 
 	assert_spin_locked(&wb->list_lock);
 	assert_spin_locked(&inode->i_lock);
@@ -420,6 +421,8 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
 		int err = write_inode(inode, wbc);
+		if (!err)
+			inode_written = true;
 		if (ret == 0)
 			ret = err;
 	}
@@ -430,42 +433,39 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	if (!(inode->i_state & I_FREEING)) {
 		/*
 		 * Sync livelock prevention. Each inode is tagged and synced in
-		 * one shot. If still dirty, it will be redirty_tail()'ed below.
-		 * Update the dirty time to prevent enqueue and sync it again.
+		 * one shot. If still dirty, update dirty time and put it back
+		 * to dirty list to prevent enqueue and syncing it again.
 		 */
 		if ((inode->i_state & I_DIRTY) &&
-		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
+		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)) {
 			inode->dirtied_when = jiffies;
-
-		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+			redirty_tail(inode, wb);
+		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 			/*
-			 * We didn't write back all the pages.  nfs_writepages()
-			 * sometimes bales out without doing anything.
+			 * We didn't write back all the pages. nfs_writepages()
+			 * sometimes bales out without doing anything or we
+			 * just run our of our writeback slice.
 			 */
 			inode->i_state |= I_DIRTY_PAGES;
-			if (wbc->nr_to_write <= 0) {
-				/*
-				 * slice used up: queue for next turn
-				 */
-				requeue_io(inode, wb);
-			} else {
-				/*
-				 * Writeback blocked by something other than
-				 * congestion. Delay the inode for some time to
-				 * avoid spinning on the CPU (100% iowait)
-				 * retrying writeback of the dirty page/inode
-				 * that cannot be performed immediately.
-				 */
-				redirty_tail(inode, wb);
-			}
+			requeue_io(inode, wb);
 		} else if (inode->i_state & I_DIRTY) {
 			/*
 			 * Filesystems can dirty the inode during writeback
 			 * operations, such as delayed allocation during
 			 * submission or metadata updates after data IO
-			 * completion.
+			 * completion. Also inode could have been dirtied by
+			 * some process aggressively touching metadata.
+			 * Finally, filesystem could just fail to write the
+			 * inode for some reason. We have to distinguish the
+			 * last case from the previous ones - in the last case
+			 * we want to give the inode quick retry, in the
+			 * other cases we want to put it back to the dirty list
+			 * to avoid livelocking of writeback.
 			 */
-			redirty_tail(inode, wb);
+			if (inode_written)
+				redirty_tail(inode, wb);
+			else
+				requeue_io(inode, wb);
 		} else {
 			/*
 			 * The inode is clean.  At this point we either have
@@ -583,10 +583,10 @@ static long writeback_sb_inodes(struct super_block *sb,
 			wrote++;
 		if (wbc.pages_skipped) {
 			/*
-			 * writeback is not making progress due to locked
-			 * buffers.  Skip this inode for now.
+			 * Writeback is not making progress due to unavailable
+			 * fs locks or similar condition. Retry in next round.
 			 */
-			redirty_tail(inode, wb);
+			requeue_io(inode, wb);
 		}
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&wb->list_lock);
@@ -618,12 +618,7 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb,
 		struct super_block *sb = inode->i_sb;
 
 		if (!grab_super_passive(sb)) {
-			/*
-			 * grab_super_passive() may fail consistently due to
-			 * s_umount being grabbed by someone else. Don't use
-			 * requeue_io() to avoid busy retrying the inode/sb.
-			 */
-			redirty_tail(inode, wb);
+			requeue_io(inode, wb);
 			continue;
 		}
 		wrote += writeback_sb_inodes(sb, wb, work);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-07 14:22             ` Jan Kara
@ 2011-10-07 14:29               ` Wu Fengguang
  2011-10-07 14:45                 ` Jan Kara
  0 siblings, 1 reply; 28+ messages in thread
From: Wu Fengguang @ 2011-10-07 14:29 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner, Christoph Hellwig

On Fri, Oct 07, 2011 at 10:22:01PM +0800, Jan Kara wrote:
> On Fri 07-10-11 21:43:47, Wu Fengguang wrote:
> > >   Great, thanks for review! I'll resend the two patches to Christoph so
> > > that he can try them.
> > 
> > Jan, I'd like to test out your updated patches with my stupid dd
> > workloads. Would you (re)send them publicly?
>   Ah, I resent them publicly on Wednesday
> (http://comments.gmane.org/gmane.linux.kernel/1199713) but git send-email
> apparently does not include emails from Acked-by into list of recipients so
> you didn't get them. Sorry for that. The patches are attached for your
> convenience.

OK thanks. I only checked the linux-fsdevel list before asking..
The results should be ready tomorrow.

Thanks,
Fengguang

> From a042c2a839ad3cf89d8ee158b2bb4b94b573f578 Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@suse.cz>
> Date: Thu, 8 Sep 2011 01:05:25 +0200
> Subject: [PATCH 1/2] writeback: Improve busyloop prevention
> 
> Writeback of an inode can be stalled by things like internal fs locks being
> held. So in case we didn't write anything during a pass through b_io list,
> just wait for a moment and try again.
> 
> CC: Christoph Hellwig <hch@infradead.org>
> Reviewed-by: Wu Fengguang <fengguang.wu@intel.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/fs-writeback.c |   26 ++++++++++++++------------
>  1 files changed, 14 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 04cf3b9..bdeb26a 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -699,8 +699,8 @@ static long wb_writeback(struct bdi_writeback *wb,
>  	unsigned long wb_start = jiffies;
>  	long nr_pages = work->nr_pages;
>  	unsigned long oldest_jif;
> -	struct inode *inode;
>  	long progress;
> +	long pause = 1;
>  
>  	oldest_jif = jiffies;
>  	work->older_than_this = &oldest_jif;
> @@ -755,25 +755,27 @@ static long wb_writeback(struct bdi_writeback *wb,
>  		 * mean the overall work is done. So we keep looping as long
>  		 * as made some progress on cleaning pages or inodes.
>  		 */
> -		if (progress)
> +		if (progress) {
> +			pause = 1;
>  			continue;
> +		}
>  		/*
>  		 * No more inodes for IO, bail
>  		 */
>  		if (list_empty(&wb->b_more_io))
>  			break;
>  		/*
> -		 * Nothing written. Wait for some inode to
> -		 * become available for writeback. Otherwise
> -		 * we'll just busyloop.
> +		 * Nothing written (some internal fs locks were unavailable or
> +		 * inode was under writeback from balance_dirty_pages() or
> +		 * similar conditions).  Wait for a while to avoid busylooping.
>  		 */
> -		if (!list_empty(&wb->b_more_io))  {
> -			trace_writeback_wait(wb->bdi, work);
> -			inode = wb_inode(wb->b_more_io.prev);
> -			spin_lock(&inode->i_lock);
> -			inode_wait_for_writeback(inode, wb);
> -			spin_unlock(&inode->i_lock);
> -		}
> +		trace_writeback_wait(wb->bdi, work);
> +		spin_unlock(&wb->list_lock);
> +		__set_current_state(TASK_INTERRUPTIBLE);
> +		schedule_timeout(pause);
> +		if (pause < HZ / 10)
> +			pause <<= 1;
> +		spin_lock(&wb->list_lock);
>  	}
>  	spin_unlock(&wb->list_lock);
>  
> -- 
> 1.7.1
> 

> From 0a4a2cb4d5432f5446215b1e6e44f7d83032dba3 Mon Sep 17 00:00:00 2001
> From: Jan Kara <jack@suse.cz>
> Date: Thu, 8 Sep 2011 01:46:42 +0200
> Subject: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
> 
> Calling redirty_tail() can put off inode writeback for upto 30 seconds (or
> whatever dirty_expire_centisecs is). This is unnecessarily big delay in some
> cases and in other cases it is a really bad thing. In particular XFS tries to
> be nice to writeback and when ->write_inode is called for an inode with locked
> ilock, it just redirties the inode and returns EAGAIN. That currently causes
> writeback_single_inode() to redirty_tail() the inode. As contended ilock is
> common thing with XFS while extending files the result can be that inode
> writeout is put off for a really long time.
> 
> Now that we have more robust busyloop prevention in wb_writeback() we can
> call requeue_io() in cases where quick retry is required without fear of
> raising CPU consumption too much.
> 
> CC: Christoph Hellwig <hch@infradead.org>
> Acked-by: Wu Fengguang <fengguang.wu@intel.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/fs-writeback.c |   61 ++++++++++++++++++++++++----------------------------
>  1 files changed, 28 insertions(+), 33 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index bdeb26a..c786023 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -356,6 +356,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
>  	long nr_to_write = wbc->nr_to_write;
>  	unsigned dirty;
>  	int ret;
> +	bool inode_written = false;
>  
>  	assert_spin_locked(&wb->list_lock);
>  	assert_spin_locked(&inode->i_lock);
> @@ -420,6 +421,8 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
>  	/* Don't write the inode if only I_DIRTY_PAGES was set */
>  	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
>  		int err = write_inode(inode, wbc);
> +		if (!err)
> +			inode_written = true;
>  		if (ret == 0)
>  			ret = err;
>  	}
> @@ -430,42 +433,39 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
>  	if (!(inode->i_state & I_FREEING)) {
>  		/*
>  		 * Sync livelock prevention. Each inode is tagged and synced in
> -		 * one shot. If still dirty, it will be redirty_tail()'ed below.
> -		 * Update the dirty time to prevent enqueue and sync it again.
> +		 * one shot. If still dirty, update dirty time and put it back
> +		 * to dirty list to prevent enqueue and syncing it again.
>  		 */
>  		if ((inode->i_state & I_DIRTY) &&
> -		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
> +		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)) {
>  			inode->dirtied_when = jiffies;
> -
> -		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> +			redirty_tail(inode, wb);
> +		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
>  			/*
> -			 * We didn't write back all the pages.  nfs_writepages()
> -			 * sometimes bales out without doing anything.
> +			 * We didn't write back all the pages. nfs_writepages()
> +			 * sometimes bales out without doing anything or we
> +			 * just run our of our writeback slice.
>  			 */
>  			inode->i_state |= I_DIRTY_PAGES;
> -			if (wbc->nr_to_write <= 0) {
> -				/*
> -				 * slice used up: queue for next turn
> -				 */
> -				requeue_io(inode, wb);
> -			} else {
> -				/*
> -				 * Writeback blocked by something other than
> -				 * congestion. Delay the inode for some time to
> -				 * avoid spinning on the CPU (100% iowait)
> -				 * retrying writeback of the dirty page/inode
> -				 * that cannot be performed immediately.
> -				 */
> -				redirty_tail(inode, wb);
> -			}
> +			requeue_io(inode, wb);
>  		} else if (inode->i_state & I_DIRTY) {
>  			/*
>  			 * Filesystems can dirty the inode during writeback
>  			 * operations, such as delayed allocation during
>  			 * submission or metadata updates after data IO
> -			 * completion.
> +			 * completion. Also inode could have been dirtied by
> +			 * some process aggressively touching metadata.
> +			 * Finally, filesystem could just fail to write the
> +			 * inode for some reason. We have to distinguish the
> +			 * last case from the previous ones - in the last case
> +			 * we want to give the inode quick retry, in the
> +			 * other cases we want to put it back to the dirty list
> +			 * to avoid livelocking of writeback.
>  			 */
> -			redirty_tail(inode, wb);
> +			if (inode_written)
> +				redirty_tail(inode, wb);
> +			else
> +				requeue_io(inode, wb);
>  		} else {
>  			/*
>  			 * The inode is clean.  At this point we either have
> @@ -583,10 +583,10 @@ static long writeback_sb_inodes(struct super_block *sb,
>  			wrote++;
>  		if (wbc.pages_skipped) {
>  			/*
> -			 * writeback is not making progress due to locked
> -			 * buffers.  Skip this inode for now.
> +			 * Writeback is not making progress due to unavailable
> +			 * fs locks or similar condition. Retry in next round.
>  			 */
> -			redirty_tail(inode, wb);
> +			requeue_io(inode, wb);
>  		}
>  		spin_unlock(&inode->i_lock);
>  		spin_unlock(&wb->list_lock);
> @@ -618,12 +618,7 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb,
>  		struct super_block *sb = inode->i_sb;
>  
>  		if (!grab_super_passive(sb)) {
> -			/*
> -			 * grab_super_passive() may fail consistently due to
> -			 * s_umount being grabbed by someone else. Don't use
> -			 * requeue_io() to avoid busy retrying the inode/sb.
> -			 */
> -			redirty_tail(inode, wb);
> +			requeue_io(inode, wb);
>  			continue;
>  		}
>  		wrote += writeback_sb_inodes(sb, wb, work);
> -- 
> 1.7.1
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-07 14:29               ` Wu Fengguang
@ 2011-10-07 14:45                 ` Jan Kara
  2011-10-07 15:29                   ` Wu Fengguang
  2011-10-08  4:00                   ` Wu Fengguang
  0 siblings, 2 replies; 28+ messages in thread
From: Jan Kara @ 2011-10-07 14:45 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Jan Kara, linux-fsdevel, Dave Chinner, Christoph Hellwig

On Fri 07-10-11 22:29:28, Wu Fengguang wrote:
> On Fri, Oct 07, 2011 at 10:22:01PM +0800, Jan Kara wrote:
> > On Fri 07-10-11 21:43:47, Wu Fengguang wrote:
> > > >   Great, thanks for review! I'll resend the two patches to Christoph so
> > > > that he can try them.
> > > 
> > > Jan, I'd like to test out your updated patches with my stupid dd
> > > workloads. Would you (re)send them publicly?
> >   Ah, I resent them publicly on Wednesday
> > (http://comments.gmane.org/gmane.linux.kernel/1199713) but git send-email
> > apparently does not include emails from Acked-by into list of recipients so
> > you didn't get them. Sorry for that. The patches are attached for your
> > convenience.
> 
> OK thanks. I only checked the linux-fsdevel list before asking..
> The results should be ready tomorrow.
  Thanks for testing!

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-07 14:45                 ` Jan Kara
@ 2011-10-07 15:29                   ` Wu Fengguang
  2011-10-08  4:00                   ` Wu Fengguang
  1 sibling, 0 replies; 28+ messages in thread
From: Wu Fengguang @ 2011-10-07 15:29 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner, Christoph Hellwig

On Fri, Oct 07, 2011 at 10:45:04PM +0800, Jan Kara wrote:
> On Fri 07-10-11 22:29:28, Wu Fengguang wrote:
> > On Fri, Oct 07, 2011 at 10:22:01PM +0800, Jan Kara wrote:
> > > On Fri 07-10-11 21:43:47, Wu Fengguang wrote:
> > > > >   Great, thanks for review! I'll resend the two patches to Christoph so
> > > > > that he can try them.
> > > > 
> > > > Jan, I'd like to test out your updated patches with my stupid dd
> > > > workloads. Would you (re)send them publicly?
> > >   Ah, I resent them publicly on Wednesday
> > > (http://comments.gmane.org/gmane.linux.kernel/1199713) but git send-email
> > > apparently does not include emails from Acked-by into list of recipients so
> > > you didn't get them. Sorry for that. The patches are attached for your
> > > convenience.
> > 
> > OK thanks. I only checked the linux-fsdevel list before asking..
> > The results should be ready tomorrow.
>   Thanks for testing!

You are welcome! Not to mention it's hands down work :)

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-07 14:45                 ` Jan Kara
  2011-10-07 15:29                   ` Wu Fengguang
@ 2011-10-08  4:00                   ` Wu Fengguang
  2011-10-08 11:52                     ` Wu Fengguang
  2011-10-10 11:21                     ` Jan Kara
  1 sibling, 2 replies; 28+ messages in thread
From: Wu Fengguang @ 2011-10-08  4:00 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner, Christoph Hellwig, Chris Mason

Hi Jan,

The test results look not good: btrfs is heavily impacted and the
other filesystems are slightly impacted.

I'll send you the detailed logs in private emails (too large for the
mailing list). Basically I noticed many writeback_wait traces that
never appear w/o this patch. In the btrfs cases that see larger
regressions, I see large fluctuations in the writeout bandwidth and
long disk idle periods. It's still a bit puzzling how all these
happen..

      3.1.0-rc8-ioless6+  3.1.0-rc8-ioless6-requeue+
------------------------  ------------------------
                   59.39       -82.9%        10.13  thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
                   58.68       -80.3%        11.54  thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
                   58.92       -80.0%        11.76  thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
                   38.02        -1.0%        37.65  thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
                   45.20        +1.7%        45.96  thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
                   42.50        -0.8%        42.14  thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
                   47.50        -2.5%        46.32  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
                   58.18        -3.0%        56.41  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
                   55.79        -2.1%        54.63  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
                   44.89       -19.3%        36.23  thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
                   58.06        -4.2%        55.64  thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
                   51.94        -1.1%        51.35  thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
                   60.29       -35.9%        38.63  thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X
                   58.80       -33.2%        39.25  thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X
                   58.53       -21.5%        45.93  thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
                   31.96        -4.7%        30.44  thresh=1G/ext3-100dd-4k-8p-4096M-1024M:10-X
                   36.19        -1.0%        35.82  thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X
                   45.03        -2.7%        43.80  thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X
                   51.47        -2.6%        50.14  thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X
                   56.19        -1.0%        55.64  thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
                   58.41        -1.0%        57.84  thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
                   43.44        -8.4%        39.77  thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X
                   49.83        -3.3%        48.18  thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
                   52.70        -0.8%        52.26  thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
                   57.12       -85.5%         8.27  thresh=8M/btrfs-10dd-4k-8p-4096M-8M:10-X
                   59.29       -84.7%         9.05  thresh=8M/btrfs-1dd-4k-8p-4096M-8M:10-X
                   59.23       -84.9%         8.97  thresh=8M/btrfs-2dd-4k-8p-4096M-8M:10-X
                   33.63        -3.3%        32.51  thresh=8M/ext3-10dd-4k-8p-4096M-8M:10-X
                   48.30        -4.7%        46.03  thresh=8M/ext3-1dd-4k-8p-4096M-8M:10-X
                   46.77        -4.5%        44.69  thresh=8M/ext3-2dd-4k-8p-4096M-8M:10-X
                   36.58        -2.2%        35.77  thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
                   57.35        -0.3%        57.16  thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
                   52.82        -1.5%        52.04  thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X
                   32.19        -4.5%        30.74  thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
                   55.86        -1.4%        55.09  thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
                   48.96       -33.1%        32.74  thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X
                 1810.02       -22.1%      1410.49  TOTAL

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-08  4:00                   ` Wu Fengguang
@ 2011-10-08 11:52                     ` Wu Fengguang
  2011-10-08 13:49                       ` Wu Fengguang
  2011-10-10 11:21                     ` Jan Kara
  1 sibling, 1 reply; 28+ messages in thread
From: Wu Fengguang @ 2011-10-08 11:52 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner, Christoph Hellwig, Chris Mason

On Sat, Oct 08, 2011 at 12:00:36PM +0800, Wu Fengguang wrote:
> Hi Jan,
> 
> The test results look not good: btrfs is heavily impacted and the
> other filesystems are slightly impacted.
> 
> I'll send you the detailed logs in private emails (too large for the
> mailing list). Basically I noticed many writeback_wait traces that
> never appear w/o this patch. In the btrfs cases that see larger
> regressions, I see large fluctuations in the writeout bandwidth and
> long disk idle periods. It's still a bit puzzling how all these
> happen..

Sorry I find that part of the regressions (about 2-3%) are caused by
change of my test scripts recently. Here are the more fair compares
and they show only regressions in btrfs and xfs:

     3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue+
------------------------  ------------------------
                   37.34        +0.8%        37.65  thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
                   44.44        +3.4%        45.96  thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
                   41.70        +1.0%        42.14  thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
                   46.45        -0.3%        46.32  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
                   56.60        -0.3%        56.41  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
                   54.14        +0.9%        54.63  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
                   30.66        -0.7%        30.44  thresh=1G/ext3-100dd-4k-8p-4096M-1024M:10-X
                   35.24        +1.6%        35.82  thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X
                   43.58        +0.5%        43.80  thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X
                   50.42        -0.6%        50.14  thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X
                   56.23        -1.0%        55.64  thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
                   58.12        -0.5%        57.84  thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
                   45.37        +1.4%        46.03  thresh=8M/ext3-1dd-4k-8p-4096M-8M:10-X
                   43.71        +2.2%        44.69  thresh=8M/ext3-2dd-4k-8p-4096M-8M:10-X
                   35.58        +0.5%        35.77  thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
                   56.39        +1.4%        57.16  thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
                   51.26        +1.5%        52.04  thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X
                  787.25        +0.7%       792.47  TOTAL

     3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue+
------------------------  ------------------------
                   44.53       -18.6%        36.23  thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
                   55.89        -0.4%        55.64  thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
                   51.11        +0.5%        51.35  thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
                   41.76        -4.8%        39.77  thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X
                   48.34        -0.3%        48.18  thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
                   52.36        -0.2%        52.26  thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
                   31.07        -1.1%        30.74  thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
                   55.44        -0.6%        55.09  thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
                   47.59       -31.2%        32.74  thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X
                  428.07        -6.1%       401.99  TOTAL

     3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue+
------------------------  ------------------------
                   58.23       -82.6%        10.13  thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
                   58.43       -80.3%        11.54  thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
                   58.53       -79.9%        11.76  thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
                   56.55       -31.7%        38.63  thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X
                   56.11       -30.1%        39.25  thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X
                   56.21       -18.3%        45.93  thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
                  344.06       -54.3%       157.24  TOTAL

I'm now bisecting the patches to find out the root cause.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-08 11:52                     ` Wu Fengguang
@ 2011-10-08 13:49                       ` Wu Fengguang
  2011-10-09  0:27                         ` Wu Fengguang
  0 siblings, 1 reply; 28+ messages in thread
From: Wu Fengguang @ 2011-10-08 13:49 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner, Christoph Hellwig, Chris Mason

On Sat, Oct 08, 2011 at 07:52:27PM +0800, Wu Fengguang wrote:
> On Sat, Oct 08, 2011 at 12:00:36PM +0800, Wu Fengguang wrote:
> > Hi Jan,
> > 
> > The test results look not good: btrfs is heavily impacted and the
> > other filesystems are slightly impacted.
> > 
> > I'll send you the detailed logs in private emails (too large for the
> > mailing list). Basically I noticed many writeback_wait traces that
> > never appear w/o this patch. In the btrfs cases that see larger
> > regressions, I see large fluctuations in the writeout bandwidth and
> > long disk idle periods. It's still a bit puzzling how all these
> > happen..
> 
> Sorry I find that part of the regressions (about 2-3%) are caused by
> change of my test scripts recently. Here are the more fair compares
> and they show only regressions in btrfs and xfs:
> 
>      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue+
> ------------------------  ------------------------
>                    37.34        +0.8%        37.65  thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
>                    44.44        +3.4%        45.96  thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
>                    41.70        +1.0%        42.14  thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
>                    46.45        -0.3%        46.32  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
>                    56.60        -0.3%        56.41  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
>                    54.14        +0.9%        54.63  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
>                    30.66        -0.7%        30.44  thresh=1G/ext3-100dd-4k-8p-4096M-1024M:10-X
>                    35.24        +1.6%        35.82  thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X
>                    43.58        +0.5%        43.80  thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X
>                    50.42        -0.6%        50.14  thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X
>                    56.23        -1.0%        55.64  thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
>                    58.12        -0.5%        57.84  thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
>                    45.37        +1.4%        46.03  thresh=8M/ext3-1dd-4k-8p-4096M-8M:10-X
>                    43.71        +2.2%        44.69  thresh=8M/ext3-2dd-4k-8p-4096M-8M:10-X
>                    35.58        +0.5%        35.77  thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
>                    56.39        +1.4%        57.16  thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
>                    51.26        +1.5%        52.04  thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X
>                   787.25        +0.7%       792.47  TOTAL
> 
>      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue+
> ------------------------  ------------------------
>                    44.53       -18.6%        36.23  thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
>                    55.89        -0.4%        55.64  thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
>                    51.11        +0.5%        51.35  thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
>                    41.76        -4.8%        39.77  thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X
>                    48.34        -0.3%        48.18  thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
>                    52.36        -0.2%        52.26  thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
>                    31.07        -1.1%        30.74  thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
>                    55.44        -0.6%        55.09  thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
>                    47.59       -31.2%        32.74  thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X
>                   428.07        -6.1%       401.99  TOTAL
> 
>      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue+
> ------------------------  ------------------------
>                    58.23       -82.6%        10.13  thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
>                    58.43       -80.3%        11.54  thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
>                    58.53       -79.9%        11.76  thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
>                    56.55       -31.7%        38.63  thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X
>                    56.11       -30.1%        39.25  thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X
>                    56.21       -18.3%        45.93  thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
>                   344.06       -54.3%       157.24  TOTAL
> 
> I'm now bisecting the patches to find out the root cause.

Current findings are, when only applying the first patch, or reduce the second
patch to the below one, the btrfs regressions are restored:

     3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue2+  
------------------------  ------------------------  
                   58.23        -0.3%        58.06  thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
                   58.43        -0.4%        58.19  thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
                   58.53        -0.5%        58.25  thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
                   56.55        -0.4%        56.30  thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X
                   56.11        +0.1%        56.19  thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X
                   56.21        -0.2%        56.12  thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
                   50.42        -2.1%        49.36  thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X
                   56.23        -2.2%        55.00  thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
                   58.12        -2.2%        56.82  thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
                   41.76        +1.6%        42.42  thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X
                   48.34        -1.0%        47.85  thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
                   52.36        -1.5%        51.57  thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
                  651.29        -0.8%       646.12  TOTAL

     3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue3+
------------------------  ------------------------
                   56.55        -3.3%        54.70  thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X
                   56.11        -0.4%        55.91  thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X
                   56.21        +0.7%        56.58  thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
                  168.87        -1.0%       167.20  TOTAL

--- linux-next.orig/fs/fs-writeback.c	2011-10-08 20:49:31.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-10-08 20:51:22.000000000 +0800
@@ -370,6 +370,7 @@ writeback_single_inode(struct inode *ino
 	long nr_to_write = wbc->nr_to_write;
 	unsigned dirty;
 	int ret;
+	bool inode_written = false;
 
 	assert_spin_locked(&wb->list_lock);
 	assert_spin_locked(&inode->i_lock);
@@ -434,6 +435,8 @@ writeback_single_inode(struct inode *ino
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
 		int err = write_inode(inode, wbc);
+		if (!err)
+			inode_written = true;
 		if (ret == 0)
 			ret = err;
 	}
@@ -477,9 +480,19 @@ writeback_single_inode(struct inode *ino
 			 * Filesystems can dirty the inode during writeback
 			 * operations, such as delayed allocation during
 			 * submission or metadata updates after data IO
-			 * completion.
+			 * completion. Also inode could have been dirtied by
+			 * some process aggressively touching metadata.
+			 * Finally, filesystem could just fail to write the
+			 * inode for some reason. We have to distinguish the
+			 * last case from the previous ones - in the last case
+			 * we want to give the inode quick retry, in the
+			 * other cases we want to put it back to the dirty list
+			 * to avoid livelocking of writeback.
 			 */
-			redirty_tail(inode, wb);
+			if (inode_written)
+				redirty_tail(inode, wb);
+			else
+				requeue_io(inode, wb);
 		} else {
 			/*
 			 * The inode is clean.  At this point we either have

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-08 13:49                       ` Wu Fengguang
@ 2011-10-09  0:27                         ` Wu Fengguang
  2011-10-09  8:44                           ` Wu Fengguang
  0 siblings, 1 reply; 28+ messages in thread
From: Wu Fengguang @ 2011-10-09  0:27 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner, Christoph Hellwig, Chris Mason

On Sat, Oct 08, 2011 at 09:49:27PM +0800, Wu Fengguang wrote:
> On Sat, Oct 08, 2011 at 07:52:27PM +0800, Wu Fengguang wrote:
> > On Sat, Oct 08, 2011 at 12:00:36PM +0800, Wu Fengguang wrote:
> > > Hi Jan,
> > > 
> > > The test results look not good: btrfs is heavily impacted and the
> > > other filesystems are slightly impacted.
> > > 
> > > I'll send you the detailed logs in private emails (too large for the
> > > mailing list). Basically I noticed many writeback_wait traces that
> > > never appear w/o this patch. In the btrfs cases that see larger
> > > regressions, I see large fluctuations in the writeout bandwidth and
> > > long disk idle periods. It's still a bit puzzling how all these
> > > happen..
> > 
> > Sorry I find that part of the regressions (about 2-3%) are caused by
> > change of my test scripts recently. Here are the more fair compares
> > and they show only regressions in btrfs and xfs:
> > 
> >      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue+
> > ------------------------  ------------------------
> >                    37.34        +0.8%        37.65  thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
> >                    44.44        +3.4%        45.96  thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
> >                    41.70        +1.0%        42.14  thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
> >                    46.45        -0.3%        46.32  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
> >                    56.60        -0.3%        56.41  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
> >                    54.14        +0.9%        54.63  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
> >                    30.66        -0.7%        30.44  thresh=1G/ext3-100dd-4k-8p-4096M-1024M:10-X
> >                    35.24        +1.6%        35.82  thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X
> >                    43.58        +0.5%        43.80  thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X
> >                    50.42        -0.6%        50.14  thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X
> >                    56.23        -1.0%        55.64  thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
> >                    58.12        -0.5%        57.84  thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
> >                    45.37        +1.4%        46.03  thresh=8M/ext3-1dd-4k-8p-4096M-8M:10-X
> >                    43.71        +2.2%        44.69  thresh=8M/ext3-2dd-4k-8p-4096M-8M:10-X
> >                    35.58        +0.5%        35.77  thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
> >                    56.39        +1.4%        57.16  thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
> >                    51.26        +1.5%        52.04  thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X
> >                   787.25        +0.7%       792.47  TOTAL
> > 
> >      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue+
> > ------------------------  ------------------------
> >                    44.53       -18.6%        36.23  thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
> >                    55.89        -0.4%        55.64  thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
> >                    51.11        +0.5%        51.35  thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
> >                    41.76        -4.8%        39.77  thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X
> >                    48.34        -0.3%        48.18  thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
> >                    52.36        -0.2%        52.26  thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
> >                    31.07        -1.1%        30.74  thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
> >                    55.44        -0.6%        55.09  thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
> >                    47.59       -31.2%        32.74  thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X
> >                   428.07        -6.1%       401.99  TOTAL
> > 
> >      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue+
> > ------------------------  ------------------------
> >                    58.23       -82.6%        10.13  thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
> >                    58.43       -80.3%        11.54  thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
> >                    58.53       -79.9%        11.76  thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
> >                    56.55       -31.7%        38.63  thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X
> >                    56.11       -30.1%        39.25  thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X
> >                    56.21       -18.3%        45.93  thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
> >                   344.06       -54.3%       157.24  TOTAL
> > 
> > I'm now bisecting the patches to find out the root cause.
> 
> Current findings are, when only applying the first patch, or reduce the second
> patch to the below one, the btrfs regressions are restored:

And the below reduced patch is also OK:

     3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue4+
------------------------  ------------------------
                   58.23        -0.4%        57.98  thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
                   58.43        -2.2%        57.13  thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
                   58.53        -1.2%        57.83  thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
                   37.34        -0.7%        37.07  thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
                   44.44        +0.2%        44.52  thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
                   41.70        +0.0%        41.72  thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
                   46.45        -0.7%        46.10  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
                   56.60        -0.8%        56.15  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
                   54.14        +0.3%        54.33  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
                   44.53        -7.3%        41.29  thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
                   55.89        +0.9%        56.39  thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
                   51.11        +1.0%        51.60  thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
                   56.55        -1.0%        55.97  thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X
                   56.11        -1.5%        55.28  thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X
                   56.21        -1.9%        55.16  thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
                   30.66        -2.7%        29.82  thresh=1G/ext3-100dd-4k-8p-4096M-1024M:10-X
                   35.24        -0.7%        35.00  thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X
                   43.58        -2.1%        42.65  thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X
                   50.42        -2.4%        49.21  thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X
                   56.23        -2.2%        55.00  thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
                   58.12        -1.8%        57.08  thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
                   41.76        -5.1%        39.61  thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X
                   48.34        -2.6%        47.06  thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
                   52.36        -3.3%        50.64  thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
                   45.37        +0.7%        45.70  thresh=8M/ext3-1dd-4k-8p-4096M-8M:10-X
                   43.71        +0.7%        44.00  thresh=8M/ext3-2dd-4k-8p-4096M-8M:10-X
                   35.58        +0.7%        35.82  thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
                   56.39        -1.1%        55.77  thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
                   51.26        -0.6%        50.94  thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X
                   31.07       -13.3%        26.94  thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
                   55.44        +0.5%        55.72  thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
                   47.59        +1.6%        48.33  thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X
                 1559.39        -1.4%      1537.83  TOTAL

Subject: writeback: Replace some redirty_tail() calls with requeue_io()
Date: Thu, 8 Sep 2011 01:46:42 +0200

From: Jan Kara <jack@suse.cz>

Calling redirty_tail() can put off inode writeback for upto 30 seconds (or
whatever dirty_expire_centisecs is). This is unnecessarily big delay in some
cases and in other cases it is a really bad thing. In particular XFS tries to
be nice to writeback and when ->write_inode is called for an inode with locked
ilock, it just redirties the inode and returns EAGAIN. That currently causes
writeback_single_inode() to redirty_tail() the inode. As contended ilock is
common thing with XFS while extending files the result can be that inode
writeout is put off for a really long time.

Now that we have more robust busyloop prevention in wb_writeback() we can
call requeue_io() in cases where quick retry is required without fear of
raising CPU consumption too much.

CC: Christoph Hellwig <hch@infradead.org>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |   30 +++++++++++++++++++-----------
 1 file changed, 19 insertions(+), 11 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-10-08 20:49:31.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-10-08 21:51:00.000000000 +0800
@@ -370,6 +370,7 @@ writeback_single_inode(struct inode *ino
 	long nr_to_write = wbc->nr_to_write;
 	unsigned dirty;
 	int ret;
+	bool inode_written = false;
 
 	assert_spin_locked(&wb->list_lock);
 	assert_spin_locked(&inode->i_lock);
@@ -434,6 +435,8 @@ writeback_single_inode(struct inode *ino
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
 		int err = write_inode(inode, wbc);
+		if (!err)
+			inode_written = true;
 		if (ret == 0)
 			ret = err;
 	}
@@ -477,9 +480,19 @@ writeback_single_inode(struct inode *ino
 			 * Filesystems can dirty the inode during writeback
 			 * operations, such as delayed allocation during
 			 * submission or metadata updates after data IO
-			 * completion.
+			 * completion. Also inode could have been dirtied by
+			 * some process aggressively touching metadata.
+			 * Finally, filesystem could just fail to write the
+			 * inode for some reason. We have to distinguish the
+			 * last case from the previous ones - in the last case
+			 * we want to give the inode quick retry, in the
+			 * other cases we want to put it back to the dirty list
+			 * to avoid livelocking of writeback.
 			 */
-			redirty_tail(inode, wb);
+			if (inode_written)
+				redirty_tail(inode, wb);
+			else
+				requeue_io(inode, wb);
 		} else {
 			/*
 			 * The inode is clean.  At this point we either have
@@ -597,10 +610,10 @@ static long writeback_sb_inodes(struct s
 			wrote++;
 		if (wbc.pages_skipped) {
 			/*
-			 * writeback is not making progress due to locked
-			 * buffers.  Skip this inode for now.
+			 * Writeback is not making progress due to unavailable
+			 * fs locks or similar condition. Retry in next round.
 			 */
-			redirty_tail(inode, wb);
+			requeue_io(inode, wb);
 		}
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&wb->list_lock);
@@ -632,12 +645,7 @@ static long __writeback_inodes_wb(struct
 		struct super_block *sb = inode->i_sb;
 
 		if (!grab_super_passive(sb)) {
-			/*
-			 * grab_super_passive() may fail consistently due to
-			 * s_umount being grabbed by someone else. Don't use
-			 * requeue_io() to avoid busy retrying the inode/sb.
-			 */
-			redirty_tail(inode, wb);
+			requeue_io(inode, wb);
 			continue;
 		}
 		wrote += writeback_sb_inodes(sb, wb, work);

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-09  0:27                         ` Wu Fengguang
@ 2011-10-09  8:44                           ` Wu Fengguang
  0 siblings, 0 replies; 28+ messages in thread
From: Wu Fengguang @ 2011-10-09  8:44 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner, Christoph Hellwig, Chris Mason

On Sun, Oct 09, 2011 at 08:27:36AM +0800, Wu Fengguang wrote:
> On Sat, Oct 08, 2011 at 09:49:27PM +0800, Wu Fengguang wrote:
> > On Sat, Oct 08, 2011 at 07:52:27PM +0800, Wu Fengguang wrote:
> > > On Sat, Oct 08, 2011 at 12:00:36PM +0800, Wu Fengguang wrote:
> > > > Hi Jan,
> > > > 
> > > > The test results look not good: btrfs is heavily impacted and the
> > > > other filesystems are slightly impacted.
> > > > 
> > > > I'll send you the detailed logs in private emails (too large for the
> > > > mailing list). Basically I noticed many writeback_wait traces that
> > > > never appear w/o this patch. In the btrfs cases that see larger
> > > > regressions, I see large fluctuations in the writeout bandwidth and
> > > > long disk idle periods. It's still a bit puzzling how all these
> > > > happen..
> > > 
> > > Sorry I find that part of the regressions (about 2-3%) are caused by
> > > change of my test scripts recently. Here are the more fair compares
> > > and they show only regressions in btrfs and xfs:
> > > 
> > >      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue+
> > > ------------------------  ------------------------
> > >                    37.34        +0.8%        37.65  thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
> > >                    44.44        +3.4%        45.96  thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
> > >                    41.70        +1.0%        42.14  thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
> > >                    46.45        -0.3%        46.32  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
> > >                    56.60        -0.3%        56.41  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
> > >                    54.14        +0.9%        54.63  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
> > >                    30.66        -0.7%        30.44  thresh=1G/ext3-100dd-4k-8p-4096M-1024M:10-X
> > >                    35.24        +1.6%        35.82  thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X
> > >                    43.58        +0.5%        43.80  thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X
> > >                    50.42        -0.6%        50.14  thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X
> > >                    56.23        -1.0%        55.64  thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
> > >                    58.12        -0.5%        57.84  thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
> > >                    45.37        +1.4%        46.03  thresh=8M/ext3-1dd-4k-8p-4096M-8M:10-X
> > >                    43.71        +2.2%        44.69  thresh=8M/ext3-2dd-4k-8p-4096M-8M:10-X
> > >                    35.58        +0.5%        35.77  thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
> > >                    56.39        +1.4%        57.16  thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
> > >                    51.26        +1.5%        52.04  thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X
> > >                   787.25        +0.7%       792.47  TOTAL
> > > 
> > >      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue+
> > > ------------------------  ------------------------
> > >                    44.53       -18.6%        36.23  thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
> > >                    55.89        -0.4%        55.64  thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
> > >                    51.11        +0.5%        51.35  thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
> > >                    41.76        -4.8%        39.77  thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X
> > >                    48.34        -0.3%        48.18  thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
> > >                    52.36        -0.2%        52.26  thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
> > >                    31.07        -1.1%        30.74  thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
> > >                    55.44        -0.6%        55.09  thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
> > >                    47.59       -31.2%        32.74  thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X
> > >                   428.07        -6.1%       401.99  TOTAL
> > > 
> > >      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue+
> > > ------------------------  ------------------------
> > >                    58.23       -82.6%        10.13  thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
> > >                    58.43       -80.3%        11.54  thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
> > >                    58.53       -79.9%        11.76  thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
> > >                    56.55       -31.7%        38.63  thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X
> > >                    56.11       -30.1%        39.25  thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X
> > >                    56.21       -18.3%        45.93  thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
> > >                   344.06       -54.3%       157.24  TOTAL
> > > 
> > > I'm now bisecting the patches to find out the root cause.
> > 
> > Current findings are, when only applying the first patch, or reduce the second
> > patch to the below one, the btrfs regressions are restored:
> 
> And the below reduced patch is also OK:
> 
>      3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue4+
> ------------------------  ------------------------
>                    58.23        -0.4%        57.98  thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
>                    58.43        -2.2%        57.13  thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
>                    58.53        -1.2%        57.83  thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
>                    37.34        -0.7%        37.07  thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
>                    44.44        +0.2%        44.52  thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
>                    41.70        +0.0%        41.72  thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
>                    46.45        -0.7%        46.10  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
>                    56.60        -0.8%        56.15  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
>                    54.14        +0.3%        54.33  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
>                    44.53        -7.3%        41.29  thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
>                    55.89        +0.9%        56.39  thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
>                    51.11        +1.0%        51.60  thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
>                    56.55        -1.0%        55.97  thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X
>                    56.11        -1.5%        55.28  thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X
>                    56.21        -1.9%        55.16  thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
>                    30.66        -2.7%        29.82  thresh=1G/ext3-100dd-4k-8p-4096M-1024M:10-X
>                    35.24        -0.7%        35.00  thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X
>                    43.58        -2.1%        42.65  thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X
>                    50.42        -2.4%        49.21  thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X
>                    56.23        -2.2%        55.00  thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
>                    58.12        -1.8%        57.08  thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
>                    41.76        -5.1%        39.61  thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X
>                    48.34        -2.6%        47.06  thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
>                    52.36        -3.3%        50.64  thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
>                    45.37        +0.7%        45.70  thresh=8M/ext3-1dd-4k-8p-4096M-8M:10-X
>                    43.71        +0.7%        44.00  thresh=8M/ext3-2dd-4k-8p-4096M-8M:10-X
>                    35.58        +0.7%        35.82  thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
>                    56.39        -1.1%        55.77  thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
>                    51.26        -0.6%        50.94  thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X
>                    31.07       -13.3%        26.94  thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
>                    55.44        +0.5%        55.72  thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
>                    47.59        +1.6%        48.33  thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X
>                  1559.39        -1.4%      1537.83  TOTAL

Now I got slightly better results with the below incremental patch.

     3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue5+
------------------------  ------------------------
                   58.23        +0.9%        58.76  thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
                   58.43        -0.4%        58.19  thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
                   58.53        -0.1%        58.48  thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
                   37.34        +1.4%        37.88  thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
                   44.44        -0.3%        44.30  thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
                   41.70        +1.3%        42.24  thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
                   46.45        +0.3%        46.59  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
                   56.60        -0.7%        56.22  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
                   54.14        +0.7%        54.50  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
                   44.53        -2.4%        43.45  thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
                   55.89        +0.2%        56.00  thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
                   51.11        -0.0%        51.10  thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
                   56.55        +0.0%        56.56  thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X
                   56.11        +0.5%        56.38  thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X
                   56.21        +0.1%        56.29  thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
                   30.66        +0.2%        30.72  thresh=1G/ext3-100dd-4k-8p-4096M-1024M:10-X
                   35.24        +0.8%        35.54  thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X
                   43.58        +0.1%        43.64  thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X
                   50.42        -0.9%        49.99  thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X
                   56.23        -0.0%        56.21  thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
                   58.12        -1.9%        57.02  thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
                   41.76        -5.9%        39.30  thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X
                   48.34        -1.4%        47.67  thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
                   52.36        -1.8%        51.41  thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
                   54.37        +0.6%        54.71  thresh=8M/btrfs-10dd-4k-8p-4096M-8M:10-X
                   56.11        +1.7%        57.08  thresh=8M/btrfs-1dd-4k-8p-4096M-8M:10-X
                   56.22        +0.8%        56.67  thresh=8M/btrfs-2dd-4k-8p-4096M-8M:10-X
                   32.21        -0.2%        32.15  thresh=8M/ext3-10dd-4k-8p-4096M-8M:10-X
                   45.37        +0.2%        45.45  thresh=8M/ext3-1dd-4k-8p-4096M-8M:10-X
                   43.71        +0.0%        43.72  thresh=8M/ext3-2dd-4k-8p-4096M-8M:10-X
                   35.58        +0.1%        35.61  thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
                   56.39        +1.0%        56.98  thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
                   51.26        +0.4%        51.44  thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X
                   31.07       -11.5%        27.49  thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
                   55.44        +0.7%        55.81  thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
                   47.59        +1.0%        48.05  thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X
                 1758.29        -0.3%      1753.58  TOTAL

     3.1.0-rc8-ioless6a+  3.1.0-rc8-ioless6-requeue5+
------------------------  ------------------------
                   62.64        -0.8%        62.13  3G-UKEY-HDD/ext4-1dd-4k-8p-4096M-20:10-X
                   59.80        -0.1%        59.76  3G-UKEY-HDD/ext4-2dd-4k-8p-4096M-20:10-X
                   53.42        -1.1%        52.84  3G-UKEY-HDD/xfs-10dd-4k-8p-4096M-20:10-X
                   59.00        -0.1%        58.92  3G-UKEY-HDD/xfs-1dd-4k-8p-4096M-20:10-X
                   56.14        -0.3%        55.95  3G-UKEY-HDD/xfs-2dd-4k-8p-4096M-20:10-X
                  291.00        -0.5%       289.60  TOTAL

> @@ -597,10 +610,10 @@ static long writeback_sb_inodes(struct s
>  			wrote++;
>  		if (wbc.pages_skipped) {
>  			/*
> -			 * writeback is not making progress due to locked
> -			 * buffers.  Skip this inode for now.
> +			 * Writeback is not making progress due to unavailable
> +			 * fs locks or similar condition. Retry in next round.
>  			 */
> -			redirty_tail(inode, wb);
> +			requeue_io(inode, wb);
>  		}
>  		spin_unlock(&inode->i_lock);
>  		spin_unlock(&wb->list_lock);

The above change to requeue_io() is not desirable in the case the
inode was redirty_tail()ed in writeback_singel_inode(). So I'd propose
to just leave the original code as it is, or to remove the code as the
below patch. In either way, we get slightly better performance
demonstrated by the larger bandwidth in 3.1.0-rc8-ioless6-requeue3+
and 3.1.0-rc8-ioless6-requeue5+.

Thanks,
Fengguang
---

--- linux-next.orig/fs/fs-writeback.c	2011-10-09 08:34:38.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-10-09 08:35:06.000000000 +0800
@@ -608,13 +608,6 @@ static long writeback_sb_inodes(struct s
 		wrote += write_chunk - wbc.nr_to_write;
 		if (!(inode->i_state & I_DIRTY))
 			wrote++;
-		if (wbc.pages_skipped) {
-			/*
-			 * Writeback is not making progress due to unavailable
-			 * fs locks or similar condition. Retry in next round.
-			 */
-			requeue_io(inode, wb);
-		}
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&wb->list_lock);
 		iput(inode);

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-08  4:00                   ` Wu Fengguang
  2011-10-08 11:52                     ` Wu Fengguang
@ 2011-10-10 11:21                     ` Jan Kara
  2011-10-10 11:31                       ` Wu Fengguang
  1 sibling, 1 reply; 28+ messages in thread
From: Jan Kara @ 2011-10-10 11:21 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel, Dave Chinner, Christoph Hellwig, Chris Mason

  Hi Fengguang,

On Sat 08-10-11 12:00:36, Wu Fengguang wrote:
> The test results look not good: btrfs is heavily impacted and the
> other filesystems are slightly impacted.
>
> I'll send you the detailed logs in private emails (too large for the
> mailing list).  Basically I noticed many writeback_wait traces that never
> appear w/o this patch.
  OK, thanks for running these tests. I'll have a look at detailed logs.
I guess the difference can be caused by changes in redirty/requeue logic in
the second patch (the changes in the first patch could possibly make
several writeback_wait events from one event but never could introduce new
events).

I guess I'll also try to reproduce the problem since it should be pretty
easy when you see such a huge regression even with 1 dd process on btrfs
filesystem.

> In the btrfs cases that see larger regressions, I see large fluctuations
> in the writeout bandwidth and long disk idle periods. It's still a bit
> puzzling how all these happen..
  Yes, I don't understand it yet either...

									Honza

> 
>       3.1.0-rc8-ioless6+  3.1.0-rc8-ioless6-requeue+
> ------------------------  ------------------------
>                    59.39       -82.9%        10.13  thresh=100M/btrfs-10dd-4k-8p-4096M-100M:10-X
>                    58.68       -80.3%        11.54  thresh=100M/btrfs-1dd-4k-8p-4096M-100M:10-X
>                    58.92       -80.0%        11.76  thresh=100M/btrfs-2dd-4k-8p-4096M-100M:10-X
>                    38.02        -1.0%        37.65  thresh=100M/ext3-10dd-4k-8p-4096M-100M:10-X
>                    45.20        +1.7%        45.96  thresh=100M/ext3-1dd-4k-8p-4096M-100M:10-X
>                    42.50        -0.8%        42.14  thresh=100M/ext3-2dd-4k-8p-4096M-100M:10-X
>                    47.50        -2.5%        46.32  thresh=100M/ext4-10dd-4k-8p-4096M-100M:10-X
>                    58.18        -3.0%        56.41  thresh=100M/ext4-1dd-4k-8p-4096M-100M:10-X
>                    55.79        -2.1%        54.63  thresh=100M/ext4-2dd-4k-8p-4096M-100M:10-X
>                    44.89       -19.3%        36.23  thresh=100M/xfs-10dd-4k-8p-4096M-100M:10-X
>                    58.06        -4.2%        55.64  thresh=100M/xfs-1dd-4k-8p-4096M-100M:10-X
>                    51.94        -1.1%        51.35  thresh=100M/xfs-2dd-4k-8p-4096M-100M:10-X
>                    60.29       -35.9%        38.63  thresh=1G/btrfs-100dd-4k-8p-4096M-1024M:10-X
>                    58.80       -33.2%        39.25  thresh=1G/btrfs-10dd-4k-8p-4096M-1024M:10-X
>                    58.53       -21.5%        45.93  thresh=1G/btrfs-1dd-4k-8p-4096M-1024M:10-X
>                    31.96        -4.7%        30.44  thresh=1G/ext3-100dd-4k-8p-4096M-1024M:10-X
>                    36.19        -1.0%        35.82  thresh=1G/ext3-10dd-4k-8p-4096M-1024M:10-X
>                    45.03        -2.7%        43.80  thresh=1G/ext3-1dd-4k-8p-4096M-1024M:10-X
>                    51.47        -2.6%        50.14  thresh=1G/ext4-100dd-4k-8p-4096M-1024M:10-X
>                    56.19        -1.0%        55.64  thresh=1G/ext4-10dd-4k-8p-4096M-1024M:10-X
>                    58.41        -1.0%        57.84  thresh=1G/ext4-1dd-4k-8p-4096M-1024M:10-X
>                    43.44        -8.4%        39.77  thresh=1G/xfs-100dd-4k-8p-4096M-1024M:10-X
>                    49.83        -3.3%        48.18  thresh=1G/xfs-10dd-4k-8p-4096M-1024M:10-X
>                    52.70        -0.8%        52.26  thresh=1G/xfs-1dd-4k-8p-4096M-1024M:10-X
>                    57.12       -85.5%         8.27  thresh=8M/btrfs-10dd-4k-8p-4096M-8M:10-X
>                    59.29       -84.7%         9.05  thresh=8M/btrfs-1dd-4k-8p-4096M-8M:10-X
>                    59.23       -84.9%         8.97  thresh=8M/btrfs-2dd-4k-8p-4096M-8M:10-X
>                    33.63        -3.3%        32.51  thresh=8M/ext3-10dd-4k-8p-4096M-8M:10-X
>                    48.30        -4.7%        46.03  thresh=8M/ext3-1dd-4k-8p-4096M-8M:10-X
>                    46.77        -4.5%        44.69  thresh=8M/ext3-2dd-4k-8p-4096M-8M:10-X
>                    36.58        -2.2%        35.77  thresh=8M/ext4-10dd-4k-8p-4096M-8M:10-X
>                    57.35        -0.3%        57.16  thresh=8M/ext4-1dd-4k-8p-4096M-8M:10-X
>                    52.82        -1.5%        52.04  thresh=8M/ext4-2dd-4k-8p-4096M-8M:10-X
>                    32.19        -4.5%        30.74  thresh=8M/xfs-10dd-4k-8p-4096M-8M:10-X
>                    55.86        -1.4%        55.09  thresh=8M/xfs-1dd-4k-8p-4096M-8M:10-X
>                    48.96       -33.1%        32.74  thresh=8M/xfs-2dd-4k-8p-4096M-8M:10-X
>                  1810.02       -22.1%      1410.49  TOTAL
> 
> Thanks,
> Fengguang
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-10 11:21                     ` Jan Kara
@ 2011-10-10 11:31                       ` Wu Fengguang
  2011-10-10 23:30                         ` Jan Kara
  0 siblings, 1 reply; 28+ messages in thread
From: Wu Fengguang @ 2011-10-10 11:31 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner, Christoph Hellwig, Chris Mason

On Mon, Oct 10, 2011 at 07:21:33PM +0800, Jan Kara wrote:
>   Hi Fengguang,
> 
> On Sat 08-10-11 12:00:36, Wu Fengguang wrote:
> > The test results look not good: btrfs is heavily impacted and the
> > other filesystems are slightly impacted.
> >
> > I'll send you the detailed logs in private emails (too large for the
> > mailing list).  Basically I noticed many writeback_wait traces that never
> > appear w/o this patch.
>   OK, thanks for running these tests. I'll have a look at detailed logs.
> I guess the difference can be caused by changes in redirty/requeue logic in
> the second patch (the changes in the first patch could possibly make
> several writeback_wait events from one event but never could introduce new
> events).
> 
> I guess I'll also try to reproduce the problem since it should be pretty
> easy when you see such a huge regression even with 1 dd process on btrfs
> filesystem.
> 
> > In the btrfs cases that see larger regressions, I see large fluctuations
> > in the writeout bandwidth and long disk idle periods. It's still a bit
> > puzzling how all these happen..
>   Yes, I don't understand it yet either...

Jan, it's obviously caused by this chunk, which is not really
necessary for fixing Christoph's problem. So the easy way is to go
ahead without this chunk.

The remaining problems is, the simple dd tests may not be the suitable
workloads to demonstrate the patches' usefulness to XFS.

Thanks,
Fengguang
---

                if ((inode->i_state & I_DIRTY) &&
-                   (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
+                   (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)) {
                        inode->dirtied_when = jiffies;
-
-               if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+                       redirty_tail(inode, wb);
+               } else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
                        /*
-                        * We didn't write back all the pages.  nfs_writepages()
-                        * sometimes bales out without doing anything.
+                        * We didn't write back all the pages. nfs_writepages()
+                        * sometimes bales out without doing anything or we
+                        * just run our of our writeback slice.
                         */
                        inode->i_state |= I_DIRTY_PAGES;
-                       if (wbc->nr_to_write <= 0) {
-                               /*
-                                * slice used up: queue for next turn
-                                */
-                               requeue_io(inode, wb);
-                       } else {
-                               /*
-                                * Writeback blocked by something other than
-                                * congestion. Delay the inode for some time to
-                                * avoid spinning on the CPU (100% iowait)
-                                * retrying writeback of the dirty page/inode
-                                * that cannot be performed immediately.
-                                */
-                               redirty_tail(inode, wb);
-                       }
+                       requeue_io(inode, wb);
                } else if (inode->i_state & I_DIRTY) {

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-10 11:31                       ` Wu Fengguang
@ 2011-10-10 23:30                         ` Jan Kara
  2011-10-11  2:36                           ` Wu Fengguang
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Kara @ 2011-10-10 23:30 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel, Dave Chinner, Christoph Hellwig, Chris Mason

On Mon 10-10-11 19:31:30, Wu Fengguang wrote:
> On Mon, Oct 10, 2011 at 07:21:33PM +0800, Jan Kara wrote:
> >   Hi Fengguang,
> > 
> > On Sat 08-10-11 12:00:36, Wu Fengguang wrote:
> > > The test results look not good: btrfs is heavily impacted and the
> > > other filesystems are slightly impacted.
> > >
> > > I'll send you the detailed logs in private emails (too large for the
> > > mailing list).  Basically I noticed many writeback_wait traces that never
> > > appear w/o this patch.
> >   OK, thanks for running these tests. I'll have a look at detailed logs.
> > I guess the difference can be caused by changes in redirty/requeue logic in
> > the second patch (the changes in the first patch could possibly make
> > several writeback_wait events from one event but never could introduce new
> > events).
> > 
> > I guess I'll also try to reproduce the problem since it should be pretty
> > easy when you see such a huge regression even with 1 dd process on btrfs
> > filesystem.
> > 
> > > In the btrfs cases that see larger regressions, I see large fluctuations
> > > in the writeout bandwidth and long disk idle periods. It's still a bit
> > > puzzling how all these happen..
> >   Yes, I don't understand it yet either...
> 
> Jan, it's obviously caused by this chunk, which is not really
> necessary for fixing Christoph's problem. So the easy way is to go
> ahead without this chunk.
  Yes, thanks a lot for debugging this! I'd still like to understand why
the hunk below is causing such a big problem to btrfs. I was looking into
the traces and all I could find so far was that for some reason relevant
inode (ino 257) was not getting queued for writeback for a long time (e.g.
20 seconds) which introduced disk idle times and thus bad throughput. But I
don't understand why the inode was not queue for such a long time yet...
Today it's too late but I'll continue with my investigation tomorrow.

> The remaining problems is, the simple dd tests may not be the suitable
> workloads to demonstrate the patches' usefulness to XFS.
  Maybe, hopefully Christoph will tell use whether patches work for him or
not.

								Honza
> 
> Thanks,
> Fengguang
> ---
> 
>                 if ((inode->i_state & I_DIRTY) &&
> -                   (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
> +                   (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)) {
>                         inode->dirtied_when = jiffies;
> -
> -               if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> +                       redirty_tail(inode, wb);
> +               } else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
>                         /*
> -                        * We didn't write back all the pages.  nfs_writepages()
> -                        * sometimes bales out without doing anything.
> +                        * We didn't write back all the pages. nfs_writepages()
> +                        * sometimes bales out without doing anything or we
> +                        * just run our of our writeback slice.
>                          */
>                         inode->i_state |= I_DIRTY_PAGES;
> -                       if (wbc->nr_to_write <= 0) {
> -                               /*
> -                                * slice used up: queue for next turn
> -                                */
> -                               requeue_io(inode, wb);
> -                       } else {
> -                               /*
> -                                * Writeback blocked by something other than
> -                                * congestion. Delay the inode for some time to
> -                                * avoid spinning on the CPU (100% iowait)
> -                                * retrying writeback of the dirty page/inode
> -                                * that cannot be performed immediately.
> -                                */
> -                               redirty_tail(inode, wb);
> -                       }
> +                       requeue_io(inode, wb);
>                 } else if (inode->i_state & I_DIRTY) {
> 
> Thanks,
> Fengguang
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-10 23:30                         ` Jan Kara
@ 2011-10-11  2:36                           ` Wu Fengguang
  2011-10-11 21:53                             ` Jan Kara
  0 siblings, 1 reply; 28+ messages in thread
From: Wu Fengguang @ 2011-10-11  2:36 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner, Christoph Hellwig, Chris Mason

On Tue, Oct 11, 2011 at 07:30:07AM +0800, Jan Kara wrote:
> On Mon 10-10-11 19:31:30, Wu Fengguang wrote:
> > On Mon, Oct 10, 2011 at 07:21:33PM +0800, Jan Kara wrote:
> > >   Hi Fengguang,
> > > 
> > > On Sat 08-10-11 12:00:36, Wu Fengguang wrote:
> > > > The test results look not good: btrfs is heavily impacted and the
> > > > other filesystems are slightly impacted.
> > > >
> > > > I'll send you the detailed logs in private emails (too large for the
> > > > mailing list).  Basically I noticed many writeback_wait traces that never
> > > > appear w/o this patch.
> > >   OK, thanks for running these tests. I'll have a look at detailed logs.
> > > I guess the difference can be caused by changes in redirty/requeue logic in
> > > the second patch (the changes in the first patch could possibly make
> > > several writeback_wait events from one event but never could introduce new
> > > events).
> > > 
> > > I guess I'll also try to reproduce the problem since it should be pretty
> > > easy when you see such a huge regression even with 1 dd process on btrfs
> > > filesystem.
> > > 
> > > > In the btrfs cases that see larger regressions, I see large fluctuations
> > > > in the writeout bandwidth and long disk idle periods. It's still a bit
> > > > puzzling how all these happen..
> > >   Yes, I don't understand it yet either...
> > 
> > Jan, it's obviously caused by this chunk, which is not really
> > necessary for fixing Christoph's problem. So the easy way is to go
> > ahead without this chunk.
>   Yes, thanks a lot for debugging this! I'd still like to understand why
> the hunk below is causing such a big problem to btrfs. I was looking into
> the traces and all I could find so far was that for some reason relevant
> inode (ino 257) was not getting queued for writeback for a long time (e.g.
> 20 seconds) which introduced disk idle times and thus bad throughput. But I
> don't understand why the inode was not queue for such a long time yet...
> Today it's too late but I'll continue with my investigation tomorrow.

Yeah, I have exactly the same observation and puzzle..

> > The remaining problems is, the simple dd tests may not be the suitable
> > workloads to demonstrate the patches' usefulness to XFS.
>   Maybe, hopefully Christoph will tell use whether patches work for him or
> not.

The explanation could be, there are ignorable differences between
redirty_tail() and requeue_io() for XFS background writeback, because
the background writeback simply ignores inode->dirtied_when.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-11  2:36                           ` Wu Fengguang
@ 2011-10-11 21:53                             ` Jan Kara
  2011-10-12  2:44                               ` Wu Fengguang
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Kara @ 2011-10-11 21:53 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-fsdevel, Dave Chinner, Christoph Hellwig, Chris Mason

[-- Attachment #1: Type: text/plain, Size: 4339 bytes --]

On Tue 11-10-11 10:36:38, Wu Fengguang wrote:
> On Tue, Oct 11, 2011 at 07:30:07AM +0800, Jan Kara wrote:
> > On Mon 10-10-11 19:31:30, Wu Fengguang wrote:
> > > On Mon, Oct 10, 2011 at 07:21:33PM +0800, Jan Kara wrote:
> > > >   Hi Fengguang,
> > > > 
> > > > On Sat 08-10-11 12:00:36, Wu Fengguang wrote:
> > > > > The test results look not good: btrfs is heavily impacted and the
> > > > > other filesystems are slightly impacted.
> > > > >
> > > > > I'll send you the detailed logs in private emails (too large for the
> > > > > mailing list).  Basically I noticed many writeback_wait traces that never
> > > > > appear w/o this patch.
> > > >   OK, thanks for running these tests. I'll have a look at detailed logs.
> > > > I guess the difference can be caused by changes in redirty/requeue logic in
> > > > the second patch (the changes in the first patch could possibly make
> > > > several writeback_wait events from one event but never could introduce new
> > > > events).
> > > > 
> > > > I guess I'll also try to reproduce the problem since it should be pretty
> > > > easy when you see such a huge regression even with 1 dd process on btrfs
> > > > filesystem.
> > > > 
> > > > > In the btrfs cases that see larger regressions, I see large fluctuations
> > > > > in the writeout bandwidth and long disk idle periods. It's still a bit
> > > > > puzzling how all these happen..
> > > >   Yes, I don't understand it yet either...
> > > 
> > > Jan, it's obviously caused by this chunk, which is not really
> > > necessary for fixing Christoph's problem. So the easy way is to go
> > > ahead without this chunk.
> >   Yes, thanks a lot for debugging this! I'd still like to understand why
> > the hunk below is causing such a big problem to btrfs. I was looking into
> > the traces and all I could find so far was that for some reason relevant
> > inode (ino 257) was not getting queued for writeback for a long time (e.g.
> > 20 seconds) which introduced disk idle times and thus bad throughput. But I
> > don't understand why the inode was not queue for such a long time yet...
> > Today it's too late but I'll continue with my investigation tomorrow.
> 
> Yeah, I have exactly the same observation and puzzle..
  OK, I dug more into this and I think I found an explanation. The problem
starts at
   flush-btrfs-1-1336  [005]    20.688011: writeback_start: bdi btrfs-1:
sb_dev 0:0 nr_pages=23685 sync_mode=0 kupdate=1 range_cyclic=1 background=0
reason=periodic
  in the btrfs trace you sent me when we start "kupdate" style writeback
for bdi "btrfs-1". This work then blocks flusher thread upto moment:
   flush-btrfs-1-1336  [007]    45.707479: writeback_start: bdi btrfs-1:
sb_dev 0:0 nr_pages=18173 sync_mode=0 kupdate=1 range_cyclic=1 background=0
reason=periodic
   flush-btrfs-1-1336  [007]    45.707479: writeback_written: bdi btrfs-1:
sb_dev 0:0 nr_pages=18173 sync_mode=0 kupdate=1 range_cyclic=1 background=0
reason=periodic

  (i.e. for 25 seconds). The reason why this work blocks flusher thread for
so long is that btrfs has "btree inode" - essentially an inode holding
filesystem metadata and btrfs ignores any ->writepages() request for this
inode coming from kupdate style writeback. So we always try to write this
inode, make no progress, requeue inode (as it has still mapping tagged as
dirty), see that b_more_io is nonempty so we sleep for a while and then
retry. We do not include inode 257 with real dirty data into writeback
because this is kupdate style writeback and inode 257 does not have dirty
timestamp old enough. This loop would break either after 30s when inode
with data becomes old enough or - as we see above - at the moment when
btrfs decided to do transaction commit and cleaned metadata inode by it's
own methods. In either case this is far too late...
 
  So for now I don't see a better alternative than to revert to old
behavior in writeback_single_inode() as you suggested earlier. That way we
would redirty_tail() inodes which we cannot clean and thus they won't cause
livelocking of kupdate work. Longer term we might want to be more clever in
switching away from kupdate style writeback to pure background writeback
but it's not yet clear to me what the logic should be so that we give
enough preference to old inodes...

  New version of the second patch is attached.

								Honza

[-- Attachment #2: 0002-writeback-Replace-some-redirty_tail-calls-with-reque.patch --]
[-- Type: text/x-patch, Size: 5656 bytes --]

>From 97fb1a2f4d334787e9dd23ef58ead1cf4e439cc2 Mon Sep 17 00:00:00 2001
From: Jan Kara <jack@suse.cz>
Date: Thu, 8 Sep 2011 01:46:42 +0200
Subject: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()

Calling redirty_tail() can put off inode writeback for upto 30 seconds (or
whatever dirty_expire_centisecs is). This is unnecessarily big delay in some
cases and in other cases it is a really bad thing. In particular XFS tries to
be nice to writeback and when ->write_inode is called for an inode with locked
ilock, it just redirties the inode and returns EAGAIN. That currently causes
writeback_single_inode() to redirty_tail() the inode. As contended ilock is
common thing with XFS while extending files the result can be that inode
writeout is put off for a really long time.

Now that we have more robust busyloop prevention in wb_writeback() we can
call requeue_io() in cases where quick retry is required without fear of
raising CPU consumption too much.

CC: Christoph Hellwig <hch@infradead.org>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c |   55 ++++++++++++++++++++++++++++++----------------------
 1 files changed, 32 insertions(+), 23 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index bdeb26a..b03878a 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -356,6 +356,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	long nr_to_write = wbc->nr_to_write;
 	unsigned dirty;
 	int ret;
+	bool inode_written = false;
 
 	assert_spin_locked(&wb->list_lock);
 	assert_spin_locked(&inode->i_lock);
@@ -420,6 +421,8 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
 		int err = write_inode(inode, wbc);
+		if (!err)
+			inode_written = true;
 		if (ret == 0)
 			ret = err;
 	}
@@ -430,17 +433,20 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	if (!(inode->i_state & I_FREEING)) {
 		/*
 		 * Sync livelock prevention. Each inode is tagged and synced in
-		 * one shot. If still dirty, it will be redirty_tail()'ed below.
-		 * Update the dirty time to prevent enqueue and sync it again.
+		 * one shot. If still dirty, update dirty time and put it back
+		 * to dirty list to prevent enqueue and syncing it again.
 		 */
 		if ((inode->i_state & I_DIRTY) &&
-		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
+		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)) {
 			inode->dirtied_when = jiffies;
-
-		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+			redirty_tail(inode, wb);
+		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 			/*
-			 * We didn't write back all the pages.  nfs_writepages()
-			 * sometimes bales out without doing anything.
+			 * We didn't write back all the pages. We may have just
+			 * run out of our writeback slice, or nfs_writepages()
+			 * sometimes bales out without doing anything, or e.g.
+			 * btrfs ignores for_kupdate writeback requests for
+			 * metadata inodes.
 			 */
 			inode->i_state |= I_DIRTY_PAGES;
 			if (wbc->nr_to_write <= 0) {
@@ -450,11 +456,9 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 				requeue_io(inode, wb);
 			} else {
 				/*
-				 * Writeback blocked by something other than
-				 * congestion. Delay the inode for some time to
-				 * avoid spinning on the CPU (100% iowait)
-				 * retrying writeback of the dirty page/inode
-				 * that cannot be performed immediately.
+				 * Writeback blocked by something. Put inode
+				 * back to dirty list to prevent livelocking of
+				 * writeback.
 				 */
 				redirty_tail(inode, wb);
 			}
@@ -463,9 +467,19 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 			 * Filesystems can dirty the inode during writeback
 			 * operations, such as delayed allocation during
 			 * submission or metadata updates after data IO
-			 * completion.
+			 * completion. Also inode could have been dirtied by
+			 * some process aggressively touching metadata.
+			 * Finally, filesystem could just fail to write the
+			 * inode for some reason. We have to distinguish the
+			 * last case from the previous ones - in the last case
+			 * we want to give the inode quick retry, in the
+			 * other cases we want to put it back to the dirty list
+			 * to avoid livelocking of writeback.
 			 */
-			redirty_tail(inode, wb);
+			if (inode_written)
+				redirty_tail(inode, wb);
+			else
+				requeue_io(inode, wb);
 		} else {
 			/*
 			 * The inode is clean.  At this point we either have
@@ -583,10 +597,10 @@ static long writeback_sb_inodes(struct super_block *sb,
 			wrote++;
 		if (wbc.pages_skipped) {
 			/*
-			 * writeback is not making progress due to locked
-			 * buffers.  Skip this inode for now.
+			 * Writeback is not making progress due to unavailable
+			 * fs locks or similar condition. Retry in next round.
 			 */
-			redirty_tail(inode, wb);
+			requeue_io(inode, wb);
 		}
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&wb->list_lock);
@@ -618,12 +632,7 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb,
 		struct super_block *sb = inode->i_sb;
 
 		if (!grab_super_passive(sb)) {
-			/*
-			 * grab_super_passive() may fail consistently due to
-			 * s_umount being grabbed by someone else. Don't use
-			 * requeue_io() to avoid busy retrying the inode/sb.
-			 */
-			redirty_tail(inode, wb);
+			requeue_io(inode, wb);
 			continue;
 		}
 		wrote += writeback_sb_inodes(sb, wb, work);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-11 21:53                             ` Jan Kara
@ 2011-10-12  2:44                               ` Wu Fengguang
  2011-10-12 19:34                                 ` Jan Kara
  0 siblings, 1 reply; 28+ messages in thread
From: Wu Fengguang @ 2011-10-12  2:44 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Dave Chinner, Christoph Hellwig, Chris Mason

On Wed, Oct 12, 2011 at 05:53:59AM +0800, Jan Kara wrote:
> On Tue 11-10-11 10:36:38, Wu Fengguang wrote:
> > On Tue, Oct 11, 2011 at 07:30:07AM +0800, Jan Kara wrote:
> > > On Mon 10-10-11 19:31:30, Wu Fengguang wrote:
> > > > On Mon, Oct 10, 2011 at 07:21:33PM +0800, Jan Kara wrote:
> > > > >   Hi Fengguang,
> > > > > 
> > > > > On Sat 08-10-11 12:00:36, Wu Fengguang wrote:
> > > > > > The test results look not good: btrfs is heavily impacted and the
> > > > > > other filesystems are slightly impacted.
> > > > > >
> > > > > > I'll send you the detailed logs in private emails (too large for the
> > > > > > mailing list).  Basically I noticed many writeback_wait traces that never
> > > > > > appear w/o this patch.
> > > > >   OK, thanks for running these tests. I'll have a look at detailed logs.
> > > > > I guess the difference can be caused by changes in redirty/requeue logic in
> > > > > the second patch (the changes in the first patch could possibly make
> > > > > several writeback_wait events from one event but never could introduce new
> > > > > events).
> > > > > 
> > > > > I guess I'll also try to reproduce the problem since it should be pretty
> > > > > easy when you see such a huge regression even with 1 dd process on btrfs
> > > > > filesystem.
> > > > > 
> > > > > > In the btrfs cases that see larger regressions, I see large fluctuations
> > > > > > in the writeout bandwidth and long disk idle periods. It's still a bit
> > > > > > puzzling how all these happen..
> > > > >   Yes, I don't understand it yet either...
> > > > 
> > > > Jan, it's obviously caused by this chunk, which is not really
> > > > necessary for fixing Christoph's problem. So the easy way is to go
> > > > ahead without this chunk.
> > >   Yes, thanks a lot for debugging this! I'd still like to understand why
> > > the hunk below is causing such a big problem to btrfs. I was looking into
> > > the traces and all I could find so far was that for some reason relevant
> > > inode (ino 257) was not getting queued for writeback for a long time (e.g.
> > > 20 seconds) which introduced disk idle times and thus bad throughput. But I
> > > don't understand why the inode was not queue for such a long time yet...
> > > Today it's too late but I'll continue with my investigation tomorrow.
> > 
> > Yeah, I have exactly the same observation and puzzle..
>   OK, I dug more into this and I think I found an explanation. The problem
> starts at
>    flush-btrfs-1-1336  [005]    20.688011: writeback_start: bdi btrfs-1:
> sb_dev 0:0 nr_pages=23685 sync_mode=0 kupdate=1 range_cyclic=1 background=0
> reason=periodic
>   in the btrfs trace you sent me when we start "kupdate" style writeback
> for bdi "btrfs-1". This work then blocks flusher thread upto moment:
>    flush-btrfs-1-1336  [007]    45.707479: writeback_start: bdi btrfs-1:
> sb_dev 0:0 nr_pages=18173 sync_mode=0 kupdate=1 range_cyclic=1 background=0
> reason=periodic
>    flush-btrfs-1-1336  [007]    45.707479: writeback_written: bdi btrfs-1:
> sb_dev 0:0 nr_pages=18173 sync_mode=0 kupdate=1 range_cyclic=1 background=0
> reason=periodic
> 
>   (i.e. for 25 seconds). The reason why this work blocks flusher thread for
> so long is that btrfs has "btree inode" - essentially an inode holding
> filesystem metadata and btrfs ignores any ->writepages() request for this
> inode coming from kupdate style writeback. So we always try to write this
> inode, make no progress, requeue inode (as it has still mapping tagged as
> dirty), see that b_more_io is nonempty so we sleep for a while and then
> retry. We do not include inode 257 with real dirty data into writeback
> because this is kupdate style writeback and inode 257 does not have dirty
> timestamp old enough. This loop would break either after 30s when inode
> with data becomes old enough or - as we see above - at the moment when
> btrfs decided to do transaction commit and cleaned metadata inode by it's
> own methods. In either case this is far too late...

Yes indeed. Good catch! 

The implication of this case is, never put an inode to b_more_io
unless made some progress on cleaning some pages or the metadata.

Failing to do so will lead to

- busy looping (which can be fixed by patch 1/2 "writeback: Improve busyloop prevention")

- block the current work (and in turn the other queued works) for long time,
  where the other pending works may tend to work on a different set of
  inodes or have different criteria for the FS to make progress. The
  existing examples are the for_kupdate test in btrfs and the SYNC vs
  ASYNC tests in general. And I'm planning to send writeback works
  from the vmscan code to write a specific inode..

In this sense, it looks not the right direction to convert the
redirty_tail() calls to requeue_io().

If we change redirty_tail() to the earlier proposed requeue_io_wait(),
all the known problems can be solved nicely.

>   So for now I don't see a better alternative than to revert to old
> behavior in writeback_single_inode() as you suggested earlier. That way we
> would redirty_tail() inodes which we cannot clean and thus they won't cause
> livelocking of kupdate work.

requeue_io_wait() can equally avoid touching inode->dirtied_when :)

> Longer term we might want to be more clever in
> switching away from kupdate style writeback to pure background writeback
> but it's not yet clear to me what the logic should be so that we give
> enough preference to old inodes...

We'll need to adequately update older_than_this in the wb_writeback()
loop for background work. Then we can make the switch.

>   New version of the second patch is attached.
> 
> 								Honza

> @@ -583,10 +597,10 @@ static long writeback_sb_inodes(struct super_block *sb,
>  			wrote++;
>  		if (wbc.pages_skipped) {
>  			/*
> -			 * writeback is not making progress due to locked
> -			 * buffers.  Skip this inode for now.
> +			 * Writeback is not making progress due to unavailable
> +			 * fs locks or similar condition. Retry in next round.
>  			 */
> -			redirty_tail(inode, wb);
> +			requeue_io(inode, wb);
>  		}
>  		spin_unlock(&inode->i_lock);
>  		spin_unlock(&wb->list_lock);

In the case writeback_single_inode() just redirty_tail()ed the inode,
it's not good to requeue_io() it here. So I'd suggest to keep the
original code, or remove the if(pages_skipped){} block totally.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-12  2:44                               ` Wu Fengguang
@ 2011-10-12 19:34                                 ` Jan Kara
  0 siblings, 0 replies; 28+ messages in thread
From: Jan Kara @ 2011-10-12 19:34 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel, Dave Chinner, Christoph Hellwig, Chris Mason

On Wed 12-10-11 10:44:36, Wu Fengguang wrote:
> On Wed, Oct 12, 2011 at 05:53:59AM +0800, Jan Kara wrote:
> > On Tue 11-10-11 10:36:38, Wu Fengguang wrote:
> > > On Tue, Oct 11, 2011 at 07:30:07AM +0800, Jan Kara wrote:
> > > > On Mon 10-10-11 19:31:30, Wu Fengguang wrote:
> > > > > On Mon, Oct 10, 2011 at 07:21:33PM +0800, Jan Kara wrote:
> > > > > >   Hi Fengguang,
> > > > > > 
> > > > > > On Sat 08-10-11 12:00:36, Wu Fengguang wrote:
> > > > > > > The test results look not good: btrfs is heavily impacted and the
> > > > > > > other filesystems are slightly impacted.
> > > > > > >
> > > > > > > I'll send you the detailed logs in private emails (too large for the
> > > > > > > mailing list).  Basically I noticed many writeback_wait traces that never
> > > > > > > appear w/o this patch.
> > > > > >   OK, thanks for running these tests. I'll have a look at detailed logs.
> > > > > > I guess the difference can be caused by changes in redirty/requeue logic in
> > > > > > the second patch (the changes in the first patch could possibly make
> > > > > > several writeback_wait events from one event but never could introduce new
> > > > > > events).
> > > > > > 
> > > > > > I guess I'll also try to reproduce the problem since it should be pretty
> > > > > > easy when you see such a huge regression even with 1 dd process on btrfs
> > > > > > filesystem.
> > > > > > 
> > > > > > > In the btrfs cases that see larger regressions, I see large fluctuations
> > > > > > > in the writeout bandwidth and long disk idle periods. It's still a bit
> > > > > > > puzzling how all these happen..
> > > > > >   Yes, I don't understand it yet either...
> > > > > 
> > > > > Jan, it's obviously caused by this chunk, which is not really
> > > > > necessary for fixing Christoph's problem. So the easy way is to go
> > > > > ahead without this chunk.
> > > >   Yes, thanks a lot for debugging this! I'd still like to understand why
> > > > the hunk below is causing such a big problem to btrfs. I was looking into
> > > > the traces and all I could find so far was that for some reason relevant
> > > > inode (ino 257) was not getting queued for writeback for a long time (e.g.
> > > > 20 seconds) which introduced disk idle times and thus bad throughput. But I
> > > > don't understand why the inode was not queue for such a long time yet...
> > > > Today it's too late but I'll continue with my investigation tomorrow.
> > > 
> > > Yeah, I have exactly the same observation and puzzle..
> >   OK, I dug more into this and I think I found an explanation. The problem
> > starts at
> >    flush-btrfs-1-1336  [005]    20.688011: writeback_start: bdi btrfs-1:
> > sb_dev 0:0 nr_pages=23685 sync_mode=0 kupdate=1 range_cyclic=1 background=0
> > reason=periodic
> >   in the btrfs trace you sent me when we start "kupdate" style writeback
> > for bdi "btrfs-1". This work then blocks flusher thread upto moment:
> >    flush-btrfs-1-1336  [007]    45.707479: writeback_start: bdi btrfs-1:
> > sb_dev 0:0 nr_pages=18173 sync_mode=0 kupdate=1 range_cyclic=1 background=0
> > reason=periodic
> >    flush-btrfs-1-1336  [007]    45.707479: writeback_written: bdi btrfs-1:
> > sb_dev 0:0 nr_pages=18173 sync_mode=0 kupdate=1 range_cyclic=1 background=0
> > reason=periodic
> > 
> >   (i.e. for 25 seconds). The reason why this work blocks flusher thread for
> > so long is that btrfs has "btree inode" - essentially an inode holding
> > filesystem metadata and btrfs ignores any ->writepages() request for this
> > inode coming from kupdate style writeback. So we always try to write this
> > inode, make no progress, requeue inode (as it has still mapping tagged as
> > dirty), see that b_more_io is nonempty so we sleep for a while and then
> > retry. We do not include inode 257 with real dirty data into writeback
> > because this is kupdate style writeback and inode 257 does not have dirty
> > timestamp old enough. This loop would break either after 30s when inode
> > with data becomes old enough or - as we see above - at the moment when
> > btrfs decided to do transaction commit and cleaned metadata inode by it's
> > own methods. In either case this is far too late...
> 
> Yes indeed. Good catch! 
> 
> The implication of this case is, never put an inode to b_more_io
> unless made some progress on cleaning some pages or the metadata.
  Well, it is not so simple. The problem really is a lack of information
propagation from filesystem to writeback code. It may well happen (similar
to XFS case Christoph described), that we failed to write anything because
of lock contention or similar rather temporary reason. In that case,
retrying via b_more_io would be fine. On the other hand it may be case that
we never succeed to write anything from the inode like btrfs shows, and
finally, it may even be the case that even though we succeed to write
something the inode remains dirty e.g. because it is metadata inode which
gets dirty on each filesystem change. In this light, even current code
which does requeue_io() when nr_to_write <= 0 is in theory livelockable by
a constantly dirty inode which has always enough pages to write. But so far
noone observed this condition in practice so I wouldn't be too worried...

> Failing to do so will lead to
> 
> - busy looping (which can be fixed by patch 1/2 "writeback: Improve busyloop prevention")
> 
> - block the current work (and in turn the other queued works) for long time,
>   where the other pending works may tend to work on a different set of
>   inodes or have different criteria for the FS to make progress. The
>   existing examples are the for_kupdate test in btrfs and the SYNC vs
>   ASYNC tests in general. And I'm planning to send writeback works
>   from the vmscan code to write a specific inode..
> 
> In this sense, it looks not the right direction to convert the
> redirty_tail() calls to requeue_io().
  Well, there has to be some balance. Surely giving up on inodes where we
have problems with doing writeback easily prevents livelocks but on the
other hand it introduces the type of problems Christoph saw with XFS - some
inodes can get starved from writeback.

> If we change redirty_tail() to the earlier proposed requeue_io_wait(),
> all the known problems can be solved nicely.
  Right, so I see that using requeue_io_wait() will avoid livelocks and the
possibility of starvation will be lower than with redirty_tail(). But what
I don't like about your implementation is the uncertainty when writeback
will be retried (using requeue_io_wait() may effectively mean putting off
writeback of the inode for dirty_writeback_interval = 5 seconds) and also
more subtle logic of list handling. But you inspired me how we could maybe
implement requeue_io_wait() logic without these problems: When we have
writeback work which cannot make progress, we could check whether there's
other work to do (plus if we are doing for_kupdate work, we'll check
whether for_background work would be needed) and if yes, we just finish
current work and continue with the new work. Then requeue_io() for inode
which cannot make progress will effectively work like requeue_io_wait() but
we'll get well defined retry times and no additional list logic. What do
you think? I'll send updated patches in a moment...

> >   So for now I don't see a better alternative than to revert to old
> > behavior in writeback_single_inode() as you suggested earlier. That way we
> > would redirty_tail() inodes which we cannot clean and thus they won't cause
> > livelocking of kupdate work.
> 
> requeue_io_wait() can equally avoid touching inode->dirtied_when :)
> 
> > Longer term we might want to be more clever in
> > switching away from kupdate style writeback to pure background writeback
> > but it's not yet clear to me what the logic should be so that we give
> > enough preference to old inodes...
> 
> We'll need to adequately update older_than_this in the wb_writeback()
> loop for background work. Then we can make the switch.
> 
> > @@ -583,10 +597,10 @@ static long writeback_sb_inodes(struct super_block *sb,
> >  			wrote++;
> >  		if (wbc.pages_skipped) {
> >  			/*
> > -			 * writeback is not making progress due to locked
> > -			 * buffers.  Skip this inode for now.
> > +			 * Writeback is not making progress due to unavailable
> > +			 * fs locks or similar condition. Retry in next round.
> >  			 */
> > -			redirty_tail(inode, wb);
> > +			requeue_io(inode, wb);
> >  		}
> >  		spin_unlock(&inode->i_lock);
> >  		spin_unlock(&wb->list_lock);
> 
> In the case writeback_single_inode() just redirty_tail()ed the inode,
> it's not good to requeue_io() it here. So I'd suggest to keep the
> original code, or remove the if(pages_skipped){} block totally.
  Good point. I'll just remove that if. Hmm, which will make pages_skipped
unused so we can remove them altogether. I can post that as a separate
patch.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-12 20:57 ` [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io() Jan Kara
@ 2011-10-13 14:30   ` Wu Fengguang
  0 siblings, 0 replies; 28+ messages in thread
From: Wu Fengguang @ 2011-10-13 14:30 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel, Christoph Hellwig, Dave Chinner

Jan,

This looks fine, too.

If no objections, I'd like to push the patches (preferably with
updated patch 1) to linux-next ASAP.

Thanks,
Fengguang

On Thu, Oct 13, 2011 at 04:57:23AM +0800, Jan Kara wrote:
> Calling redirty_tail() can put off inode writeback for upto 30 seconds (or
> whatever dirty_expire_centisecs is). This is unnecessarily big delay in some
> cases and in other cases it is a really bad thing. In particular XFS tries to
> be nice to writeback and when ->write_inode is called for an inode with locked
> ilock, it just redirties the inode and returns EAGAIN. That currently causes
> writeback_single_inode() to redirty_tail() the inode. As contended ilock is
> common thing with XFS while extending files the result can be that inode
> writeout is put off for a really long time.
> 
> Now that we have more robust busyloop prevention in wb_writeback() we can
> call requeue_io() in cases where quick retry is required without fear of
> raising CPU consumption too much.
> 
> CC: Christoph Hellwig <hch@infradead.org>
> Acked-by: Wu Fengguang <fengguang.wu@intel.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/fs-writeback.c |   56 +++++++++++++++++++++++++++-------------------------
>  1 files changed, 29 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index b619f3a..094afcd 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -356,6 +356,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
>  	long nr_to_write = wbc->nr_to_write;
>  	unsigned dirty;
>  	int ret;
> +	bool inode_written = false;
>  
>  	assert_spin_locked(&wb->list_lock);
>  	assert_spin_locked(&inode->i_lock);
> @@ -420,6 +421,8 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
>  	/* Don't write the inode if only I_DIRTY_PAGES was set */
>  	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
>  		int err = write_inode(inode, wbc);
> +		if (!err)
> +			inode_written = true;
>  		if (ret == 0)
>  			ret = err;
>  	}
> @@ -430,17 +433,20 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
>  	if (!(inode->i_state & I_FREEING)) {
>  		/*
>  		 * Sync livelock prevention. Each inode is tagged and synced in
> -		 * one shot. If still dirty, it will be redirty_tail()'ed below.
> -		 * Update the dirty time to prevent enqueue and sync it again.
> +		 * one shot. If still dirty, update dirty time and put it back
> +		 * to dirty list to prevent enqueue and syncing it again.
>  		 */
>  		if ((inode->i_state & I_DIRTY) &&
> -		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
> +		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)) {
>  			inode->dirtied_when = jiffies;
> -
> -		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
> +			redirty_tail(inode, wb);
> +		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
>  			/*
> -			 * We didn't write back all the pages.  nfs_writepages()
> -			 * sometimes bales out without doing anything.
> +			 * We didn't write back all the pages. We may have just
> +			 * run out of our writeback slice, or nfs_writepages()
> +			 * sometimes bales out without doing anything, or e.g.
> +			 * btrfs ignores for_kupdate writeback requests for
> +			 * metadata inodes.
>  			 */
>  			inode->i_state |= I_DIRTY_PAGES;
>  			if (wbc->nr_to_write <= 0) {
> @@ -450,11 +456,9 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
>  				requeue_io(inode, wb);
>  			} else {
>  				/*
> -				 * Writeback blocked by something other than
> -				 * congestion. Delay the inode for some time to
> -				 * avoid spinning on the CPU (100% iowait)
> -				 * retrying writeback of the dirty page/inode
> -				 * that cannot be performed immediately.
> +				 * Writeback blocked by something. Put inode
> +				 * back to dirty list to prevent livelocking of
> +				 * writeback.
>  				 */
>  				redirty_tail(inode, wb);
>  			}
> @@ -463,9 +467,19 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
>  			 * Filesystems can dirty the inode during writeback
>  			 * operations, such as delayed allocation during
>  			 * submission or metadata updates after data IO
> -			 * completion.
> +			 * completion. Also inode could have been dirtied by
> +			 * some process aggressively touching metadata.
> +			 * Finally, filesystem could just fail to write the
> +			 * inode for some reason. We have to distinguish the
> +			 * last case from the previous ones - in the last case
> +			 * we want to give the inode quick retry, in the
> +			 * other cases we want to put it back to the dirty list
> +			 * to avoid livelocking of writeback.
>  			 */
> -			redirty_tail(inode, wb);
> +			if (inode_written)
> +				redirty_tail(inode, wb);
> +			else
> +				requeue_io(inode, wb);
>  		} else {
>  			/*
>  			 * The inode is clean.  At this point we either have
> @@ -581,13 +595,6 @@ static long writeback_sb_inodes(struct super_block *sb,
>  		wrote += write_chunk - wbc.nr_to_write;
>  		if (!(inode->i_state & I_DIRTY))
>  			wrote++;
> -		if (wbc.pages_skipped) {
> -			/*
> -			 * writeback is not making progress due to locked
> -			 * buffers.  Skip this inode for now.
> -			 */
> -			redirty_tail(inode, wb);
> -		}
>  		spin_unlock(&inode->i_lock);
>  		spin_unlock(&wb->list_lock);
>  		iput(inode);
> @@ -618,12 +625,7 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb,
>  		struct super_block *sb = inode->i_sb;
>  
>  		if (!grab_super_passive(sb)) {
> -			/*
> -			 * grab_super_passive() may fail consistently due to
> -			 * s_umount being grabbed by someone else. Don't use
> -			 * requeue_io() to avoid busy retrying the inode/sb.
> -			 */
> -			redirty_tail(inode, wb);
> +			requeue_io(inode, wb);
>  			continue;
>  		}
>  		wrote += writeback_sb_inodes(sb, wb, work);
> -- 
> 1.7.1

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-12 20:57 [PATCH 0/2 v4] writeback: Improve busyloop prevention and inode requeueing Jan Kara
@ 2011-10-12 20:57 ` Jan Kara
  2011-10-13 14:30   ` Wu Fengguang
  0 siblings, 1 reply; 28+ messages in thread
From: Jan Kara @ 2011-10-12 20:57 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-fsdevel, Christoph Hellwig, Dave Chinner, Jan Kara

Calling redirty_tail() can put off inode writeback for upto 30 seconds (or
whatever dirty_expire_centisecs is). This is unnecessarily big delay in some
cases and in other cases it is a really bad thing. In particular XFS tries to
be nice to writeback and when ->write_inode is called for an inode with locked
ilock, it just redirties the inode and returns EAGAIN. That currently causes
writeback_single_inode() to redirty_tail() the inode. As contended ilock is
common thing with XFS while extending files the result can be that inode
writeout is put off for a really long time.

Now that we have more robust busyloop prevention in wb_writeback() we can
call requeue_io() in cases where quick retry is required without fear of
raising CPU consumption too much.

CC: Christoph Hellwig <hch@infradead.org>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c |   56 +++++++++++++++++++++++++++-------------------------
 1 files changed, 29 insertions(+), 27 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index b619f3a..094afcd 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -356,6 +356,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	long nr_to_write = wbc->nr_to_write;
 	unsigned dirty;
 	int ret;
+	bool inode_written = false;
 
 	assert_spin_locked(&wb->list_lock);
 	assert_spin_locked(&inode->i_lock);
@@ -420,6 +421,8 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
 		int err = write_inode(inode, wbc);
+		if (!err)
+			inode_written = true;
 		if (ret == 0)
 			ret = err;
 	}
@@ -430,17 +433,20 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	if (!(inode->i_state & I_FREEING)) {
 		/*
 		 * Sync livelock prevention. Each inode is tagged and synced in
-		 * one shot. If still dirty, it will be redirty_tail()'ed below.
-		 * Update the dirty time to prevent enqueue and sync it again.
+		 * one shot. If still dirty, update dirty time and put it back
+		 * to dirty list to prevent enqueue and syncing it again.
 		 */
 		if ((inode->i_state & I_DIRTY) &&
-		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
+		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)) {
 			inode->dirtied_when = jiffies;
-
-		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+			redirty_tail(inode, wb);
+		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 			/*
-			 * We didn't write back all the pages.  nfs_writepages()
-			 * sometimes bales out without doing anything.
+			 * We didn't write back all the pages. We may have just
+			 * run out of our writeback slice, or nfs_writepages()
+			 * sometimes bales out without doing anything, or e.g.
+			 * btrfs ignores for_kupdate writeback requests for
+			 * metadata inodes.
 			 */
 			inode->i_state |= I_DIRTY_PAGES;
 			if (wbc->nr_to_write <= 0) {
@@ -450,11 +456,9 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 				requeue_io(inode, wb);
 			} else {
 				/*
-				 * Writeback blocked by something other than
-				 * congestion. Delay the inode for some time to
-				 * avoid spinning on the CPU (100% iowait)
-				 * retrying writeback of the dirty page/inode
-				 * that cannot be performed immediately.
+				 * Writeback blocked by something. Put inode
+				 * back to dirty list to prevent livelocking of
+				 * writeback.
 				 */
 				redirty_tail(inode, wb);
 			}
@@ -463,9 +467,19 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 			 * Filesystems can dirty the inode during writeback
 			 * operations, such as delayed allocation during
 			 * submission or metadata updates after data IO
-			 * completion.
+			 * completion. Also inode could have been dirtied by
+			 * some process aggressively touching metadata.
+			 * Finally, filesystem could just fail to write the
+			 * inode for some reason. We have to distinguish the
+			 * last case from the previous ones - in the last case
+			 * we want to give the inode quick retry, in the
+			 * other cases we want to put it back to the dirty list
+			 * to avoid livelocking of writeback.
 			 */
-			redirty_tail(inode, wb);
+			if (inode_written)
+				redirty_tail(inode, wb);
+			else
+				requeue_io(inode, wb);
 		} else {
 			/*
 			 * The inode is clean.  At this point we either have
@@ -581,13 +595,6 @@ static long writeback_sb_inodes(struct super_block *sb,
 		wrote += write_chunk - wbc.nr_to_write;
 		if (!(inode->i_state & I_DIRTY))
 			wrote++;
-		if (wbc.pages_skipped) {
-			/*
-			 * writeback is not making progress due to locked
-			 * buffers.  Skip this inode for now.
-			 */
-			redirty_tail(inode, wb);
-		}
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&wb->list_lock);
 		iput(inode);
@@ -618,12 +625,7 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb,
 		struct super_block *sb = inode->i_sb;
 
 		if (!grab_super_passive(sb)) {
-			/*
-			 * grab_super_passive() may fail consistently due to
-			 * s_umount being grabbed by someone else. Don't use
-			 * requeue_io() to avoid busy retrying the inode/sb.
-			 */
-			redirty_tail(inode, wb);
+			requeue_io(inode, wb);
 			continue;
 		}
 		wrote += writeback_sb_inodes(sb, wb, work);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io()
  2011-10-05 17:58 [PATCH 0/2] Avoid putting of writeback of inodes for too long (v3) Jan Kara
@ 2011-10-05 17:58 ` Jan Kara
  0 siblings, 0 replies; 28+ messages in thread
From: Jan Kara @ 2011-10-05 17:58 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: LKML, Jan Kara, Christoph Hellwig

Calling redirty_tail() can put off inode writeback for upto 30 seconds (or
whatever dirty_expire_centisecs is). This is unnecessarily big delay in some
cases and in other cases it is a really bad thing. In particular XFS tries to
be nice to writeback and when ->write_inode is called for an inode with locked
ilock, it just redirties the inode and returns EAGAIN. That currently causes
writeback_single_inode() to redirty_tail() the inode. As contended ilock is
common thing with XFS while extending files the result can be that inode
writeout is put off for a really long time.

Now that we have more robust busyloop prevention in wb_writeback() we can
call requeue_io() in cases where quick retry is required without fear of
raising CPU consumption too much.

CC: Christoph Hellwig <hch@infradead.org>
Acked-by: Wu Fengguang <fengguang.wu@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/fs-writeback.c |   61 ++++++++++++++++++++++++----------------------------
 1 files changed, 28 insertions(+), 33 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index bdeb26a..c786023 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -356,6 +356,7 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	long nr_to_write = wbc->nr_to_write;
 	unsigned dirty;
 	int ret;
+	bool inode_written = false;
 
 	assert_spin_locked(&wb->list_lock);
 	assert_spin_locked(&inode->i_lock);
@@ -420,6 +421,8 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	/* Don't write the inode if only I_DIRTY_PAGES was set */
 	if (dirty & (I_DIRTY_SYNC | I_DIRTY_DATASYNC)) {
 		int err = write_inode(inode, wbc);
+		if (!err)
+			inode_written = true;
 		if (ret == 0)
 			ret = err;
 	}
@@ -430,42 +433,39 @@ writeback_single_inode(struct inode *inode, struct bdi_writeback *wb,
 	if (!(inode->i_state & I_FREEING)) {
 		/*
 		 * Sync livelock prevention. Each inode is tagged and synced in
-		 * one shot. If still dirty, it will be redirty_tail()'ed below.
-		 * Update the dirty time to prevent enqueue and sync it again.
+		 * one shot. If still dirty, update dirty time and put it back
+		 * to dirty list to prevent enqueue and syncing it again.
 		 */
 		if ((inode->i_state & I_DIRTY) &&
-		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages))
+		    (wbc->sync_mode == WB_SYNC_ALL || wbc->tagged_writepages)) {
 			inode->dirtied_when = jiffies;
-
-		if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
+			redirty_tail(inode, wb);
+		} else if (mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
 			/*
-			 * We didn't write back all the pages.  nfs_writepages()
-			 * sometimes bales out without doing anything.
+			 * We didn't write back all the pages. nfs_writepages()
+			 * sometimes bales out without doing anything or we
+			 * just run our of our writeback slice.
 			 */
 			inode->i_state |= I_DIRTY_PAGES;
-			if (wbc->nr_to_write <= 0) {
-				/*
-				 * slice used up: queue for next turn
-				 */
-				requeue_io(inode, wb);
-			} else {
-				/*
-				 * Writeback blocked by something other than
-				 * congestion. Delay the inode for some time to
-				 * avoid spinning on the CPU (100% iowait)
-				 * retrying writeback of the dirty page/inode
-				 * that cannot be performed immediately.
-				 */
-				redirty_tail(inode, wb);
-			}
+			requeue_io(inode, wb);
 		} else if (inode->i_state & I_DIRTY) {
 			/*
 			 * Filesystems can dirty the inode during writeback
 			 * operations, such as delayed allocation during
 			 * submission or metadata updates after data IO
-			 * completion.
+			 * completion. Also inode could have been dirtied by
+			 * some process aggressively touching metadata.
+			 * Finally, filesystem could just fail to write the
+			 * inode for some reason. We have to distinguish the
+			 * last case from the previous ones - in the last case
+			 * we want to give the inode quick retry, in the
+			 * other cases we want to put it back to the dirty list
+			 * to avoid livelocking of writeback.
 			 */
-			redirty_tail(inode, wb);
+			if (inode_written)
+				redirty_tail(inode, wb);
+			else
+				requeue_io(inode, wb);
 		} else {
 			/*
 			 * The inode is clean.  At this point we either have
@@ -583,10 +583,10 @@ static long writeback_sb_inodes(struct super_block *sb,
 			wrote++;
 		if (wbc.pages_skipped) {
 			/*
-			 * writeback is not making progress due to locked
-			 * buffers.  Skip this inode for now.
+			 * Writeback is not making progress due to unavailable
+			 * fs locks or similar condition. Retry in next round.
 			 */
-			redirty_tail(inode, wb);
+			requeue_io(inode, wb);
 		}
 		spin_unlock(&inode->i_lock);
 		spin_unlock(&wb->list_lock);
@@ -618,12 +618,7 @@ static long __writeback_inodes_wb(struct bdi_writeback *wb,
 		struct super_block *sb = inode->i_sb;
 
 		if (!grab_super_passive(sb)) {
-			/*
-			 * grab_super_passive() may fail consistently due to
-			 * s_umount being grabbed by someone else. Don't use
-			 * requeue_io() to avoid busy retrying the inode/sb.
-			 */
-			redirty_tail(inode, wb);
+			requeue_io(inode, wb);
 			continue;
 		}
 		wrote += writeback_sb_inodes(sb, wb, work);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2011-10-13 14:30 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-09-08  0:44 [PATCH 1/2] writeback: Improve busyloop prevention Jan Kara
2011-09-08  0:44 ` [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io() Jan Kara
2011-09-08  1:22   ` Wu Fengguang
2011-09-08 15:03     ` Jan Kara
2011-09-18 14:07       ` Wu Fengguang
2011-10-05 17:39         ` Jan Kara
2011-10-07 13:43           ` Wu Fengguang
2011-10-07 14:22             ` Jan Kara
2011-10-07 14:29               ` Wu Fengguang
2011-10-07 14:45                 ` Jan Kara
2011-10-07 15:29                   ` Wu Fengguang
2011-10-08  4:00                   ` Wu Fengguang
2011-10-08 11:52                     ` Wu Fengguang
2011-10-08 13:49                       ` Wu Fengguang
2011-10-09  0:27                         ` Wu Fengguang
2011-10-09  8:44                           ` Wu Fengguang
2011-10-10 11:21                     ` Jan Kara
2011-10-10 11:31                       ` Wu Fengguang
2011-10-10 23:30                         ` Jan Kara
2011-10-11  2:36                           ` Wu Fengguang
2011-10-11 21:53                             ` Jan Kara
2011-10-12  2:44                               ` Wu Fengguang
2011-10-12 19:34                                 ` Jan Kara
2011-09-08  0:57 ` [PATCH 1/2] writeback: Improve busyloop prevention Wu Fengguang
2011-09-08 13:49   ` Jan Kara
2011-10-05 17:58 [PATCH 0/2] Avoid putting of writeback of inodes for too long (v3) Jan Kara
2011-10-05 17:58 ` [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io() Jan Kara
2011-10-12 20:57 [PATCH 0/2 v4] writeback: Improve busyloop prevention and inode requeueing Jan Kara
2011-10-12 20:57 ` [PATCH 2/2] writeback: Replace some redirty_tail() calls with requeue_io() Jan Kara
2011-10-13 14:30   ` Wu Fengguang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.