All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY
@ 2011-08-27  6:14 ` Christoph Hellwig
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2011-08-27  6:14 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-fsdevel, xfs

Right now ->write_inode has no way to safely return a EAGAIN without explicitly
redirtying the inode, as we would lose the dirty state otherwise.  Most
filesystems get this wrong, but XFS makes heavy use of it to avoid blocking
the flusher thread when ->write_inode hits contentended inode locks.  A
contended ilock is something XFS can hit very easibly when extending files, as
the data I/O completion handler takes the lock to update the size, and the
->write_inode call can race with it fairly easily if writing enough data
in one go so that the completion for the first write come in just before
we call ->write_inode.

Change the handling of this case to use requeue_io for a quick retry instead
of redirty_tail, which keeps moving out the dirtied_when data and thus keeps
delaying the writeout more and more with every failed attempt to get the lock.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2011-08-26 14:47:42.137050059 +0200
+++ linux-2.6/fs/fs-writeback.c	2011-08-26 15:06:47.003493601 +0200
@@ -464,8 +464,18 @@ writeback_single_inode(struct inode *ino
 			 * operations, such as delayed allocation during
 			 * submission or metadata updates after data IO
 			 * completion.
+			 *
+			 * For the latter case it is very important to give
+			 * the inode another turn on b_more_io instead of
+			 * redirtying it.  Constantly moving dirtied_when
+			 * forward will prevent us from ever writing out
+			 * the metadata dirtied in the I/O completion handler.
+			 *
+			 * For files on XFS that constantly get appended to
+			 * calling redirty_tail means they will never get
+			 * their updated i_size written out.
 			 */
-			redirty_tail(inode, wb);
+			requeue_io(inode, wb);
 		} else {
 			/*
 			 * The inode is clean.  At this point we either have

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY
@ 2011-08-27  6:14 ` Christoph Hellwig
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2011-08-27  6:14 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: linux-fsdevel, xfs

Right now ->write_inode has no way to safely return a EAGAIN without explicitly
redirtying the inode, as we would lose the dirty state otherwise.  Most
filesystems get this wrong, but XFS makes heavy use of it to avoid blocking
the flusher thread when ->write_inode hits contentended inode locks.  A
contended ilock is something XFS can hit very easibly when extending files, as
the data I/O completion handler takes the lock to update the size, and the
->write_inode call can race with it fairly easily if writing enough data
in one go so that the completion for the first write come in just before
we call ->write_inode.

Change the handling of this case to use requeue_io for a quick retry instead
of redirty_tail, which keeps moving out the dirtied_when data and thus keeps
delaying the writeout more and more with every failed attempt to get the lock.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/fs/fs-writeback.c
===================================================================
--- linux-2.6.orig/fs/fs-writeback.c	2011-08-26 14:47:42.137050059 +0200
+++ linux-2.6/fs/fs-writeback.c	2011-08-26 15:06:47.003493601 +0200
@@ -464,8 +464,18 @@ writeback_single_inode(struct inode *ino
 			 * operations, such as delayed allocation during
 			 * submission or metadata updates after data IO
 			 * completion.
+			 *
+			 * For the latter case it is very important to give
+			 * the inode another turn on b_more_io instead of
+			 * redirtying it.  Constantly moving dirtied_when
+			 * forward will prevent us from ever writing out
+			 * the metadata dirtied in the I/O completion handler.
+			 *
+			 * For files on XFS that constantly get appended to
+			 * calling redirty_tail means they will never get
+			 * their updated i_size written out.
 			 */
-			redirty_tail(inode, wb);
+			requeue_io(inode, wb);
 		} else {
 			/*
 			 * The inode is clean.  At this point we either have

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY
  2011-08-27  6:14 ` Christoph Hellwig
@ 2011-08-27 13:58   ` Wu Fengguang
  -1 siblings, 0 replies; 17+ messages in thread
From: Wu Fengguang @ 2011-08-27 13:58 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, xfs, Jan Kara, Dave Chinner

[-- Attachment #1: Type: text/plain, Size: 2805 bytes --]

Christoph,

On Sat, Aug 27, 2011 at 02:14:09PM +0800, Christoph Hellwig wrote:
> Right now ->write_inode has no way to safely return a EAGAIN without explicitly
> redirtying the inode, as we would lose the dirty state otherwise.  Most
> filesystems get this wrong, but XFS makes heavy use of it to avoid blocking
> the flusher thread when ->write_inode hits contentended inode locks.  A
> contended ilock is something XFS can hit very easibly when extending files, as
> the data I/O completion handler takes the lock to update the size, and the
> ->write_inode call can race with it fairly easily if writing enough data
> in one go so that the completion for the first write come in just before
> we call ->write_inode.
> 
> Change the handling of this case to use requeue_io for a quick retry instead
> of redirty_tail, which keeps moving out the dirtied_when data and thus keeps
> delaying the writeout more and more with every failed attempt to get the lock.

Yeah redirty_tail() does have the problem of possibly delay inodes for
too long time. However, you know requeue_io() always has the danger of
triggering busy wb_writeback() loops in corner cases.

For example, nfs_write_inode()/nfs_commit_unstable_pages() often
redirty the inode without doing anything (hence no any progress, a
prerequisite for busy loops) depending on the in flight writes, which
unfortunately further depends on _external_ network/server states..
That means some stalled network/sever state could lead to busy loops
in NFS clients.

The alternative solution may be to firstly apply the attached patch,
and change this one to:

  -			redirty_tail(inode, wb);
  +			requeue_io_wait(inode, wb);

Thanks,
Fengguang

> Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> Index: linux-2.6/fs/fs-writeback.c
> ===================================================================
> --- linux-2.6.orig/fs/fs-writeback.c	2011-08-26 14:47:42.137050059 +0200
> +++ linux-2.6/fs/fs-writeback.c	2011-08-26 15:06:47.003493601 +0200
> @@ -464,8 +464,18 @@ writeback_single_inode(struct inode *ino
>  			 * operations, such as delayed allocation during
>  			 * submission or metadata updates after data IO
>  			 * completion.
> +			 *
> +			 * For the latter case it is very important to give
> +			 * the inode another turn on b_more_io instead of
> +			 * redirtying it.  Constantly moving dirtied_when
> +			 * forward will prevent us from ever writing out
> +			 * the metadata dirtied in the I/O completion handler.
> +			 *
> +			 * For files on XFS that constantly get appended to
> +			 * calling redirty_tail means they will never get
> +			 * their updated i_size written out.
>  			 */
> -			redirty_tail(inode, wb);
> +			requeue_io(inode, wb);
>  		} else {
>  			/*
>  			 * The inode is clean.  At this point we either have

[-- Attachment #2: writeback-more_io_wait.patch --]
[-- Type: text/x-diff, Size: 5015 bytes --]

Subject: writeback: introduce queue b_more_io_wait
Date: Sun Jul 31 18:44:44 CST 2011

The problem is, redirty_tail() may update i_dirtied_when and result in
30s max delay. If redirty_tail() is called often enough, some inode may
even be delayed for ever.

So introduce the b_more_io_wait queue to park inodes that for some
reason cannot be synced immediately. The inodes will be sent to b_io at
the next b_io refill time, however won't be busy retried as b_more_io.

The new data flow after converting all redirty_tail() calls to
requeue_io_wait():

b_dirty --> b_io --> b_more_io/b_more_io_wait --+
             ^                                  |
	     |                                  |
	     +----------------------------------+

Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c           |   10 ++++++++++
 include/linux/backing-dev.h |    8 +++++---
 mm/backing-dev.c            |   10 ++++++++--
 3 files changed, 23 insertions(+), 5 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-08-27 15:28:27.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-08-27 15:45:10.000000000 +0800
@@ -220,6 +220,15 @@ static void requeue_io(struct inode *ino
 	list_move(&inode->i_wb_list, &wb->b_more_io);
 }
 
+/*
+ * The inode should be retried in an opportunistic way.
+ */
+static void requeue_io_wait(struct inode *inode, struct bdi_writeback *wb)
+{
+	assert_spin_locked(&wb->list_lock);
+	list_move(&inode->i_wb_list, &wb->b_more_io_wait);
+}
+
 static void inode_sync_complete(struct inode *inode)
 {
 	/*
@@ -307,6 +316,7 @@ static void queue_io(struct bdi_writebac
 	int moved;
 	assert_spin_locked(&wb->list_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
+	list_splice_init(&wb->b_more_io_wait, &wb->b_io);
 	moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
 	trace_writeback_queue_io(wb, older_than_this, moved);
 }
--- linux-next.orig/include/linux/backing-dev.h	2011-08-26 19:27:20.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-08-27 15:45:10.000000000 +0800
@@ -59,6 +59,7 @@ struct bdi_writeback {
 	struct list_head b_dirty;	/* dirty inodes */
 	struct list_head b_io;		/* parked for writeback */
 	struct list_head b_more_io;	/* parked for more writeback */
+	struct list_head b_more_io_wait;/* opportunistic retry io */
 	spinlock_t list_lock;		/* protects the b_* lists */
 };
 
@@ -129,9 +130,10 @@ extern struct list_head bdi_pending_list
 
 static inline int wb_has_dirty_io(struct bdi_writeback *wb)
 {
-	return !list_empty(&wb->b_dirty) ||
-	       !list_empty(&wb->b_io) ||
-	       !list_empty(&wb->b_more_io);
+	return !list_empty(&wb->b_dirty)	||
+	       !list_empty(&wb->b_io)		||
+	       !list_empty(&wb->b_more_io)	||
+	       !list_empty(&wb->b_more_io_wait);
 }
 
 static inline void __add_bdi_stat(struct backing_dev_info *bdi,
--- linux-next.orig/mm/backing-dev.c	2011-08-26 19:27:20.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-08-27 15:45:10.000000000 +0800
@@ -74,10 +74,10 @@ static int bdi_debug_stats_show(struct s
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long nr_dirty, nr_io, nr_more_io;
+	unsigned long nr_dirty, nr_io, nr_more_io, nr_more_io_wait;
 	struct inode *inode;
 
-	nr_dirty = nr_io = nr_more_io = 0;
+	nr_dirty = nr_io = nr_more_io = nr_more_io_wait = 0;
 	spin_lock(&wb->list_lock);
 	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
 		nr_dirty++;
@@ -85,6 +85,8 @@ static int bdi_debug_stats_show(struct s
 		nr_io++;
 	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
 		nr_more_io++;
+	list_for_each_entry(inode, &wb->b_more_io_wait, i_wb_list)
+		nr_more_io_wait++;
 	spin_unlock(&wb->list_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
@@ -103,6 +105,7 @@ static int bdi_debug_stats_show(struct s
 		   "b_dirty:            %10lu\n"
 		   "b_io:               %10lu\n"
 		   "b_more_io:          %10lu\n"
+		   "b_more_io_wait:     %10lu\n"
 		   "bdi_list:           %10u\n"
 		   "state:              %10lx\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
@@ -116,6 +119,7 @@ static int bdi_debug_stats_show(struct s
 		   nr_dirty,
 		   nr_io,
 		   nr_more_io,
+		   nr_more_io_wait,
 		   !list_empty(&bdi->bdi_list), bdi->state);
 #undef K
 
@@ -635,6 +639,7 @@ static void bdi_wb_init(struct bdi_write
 	INIT_LIST_HEAD(&wb->b_dirty);
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
+	INIT_LIST_HEAD(&wb->b_more_io_wait);
 	spin_lock_init(&wb->list_lock);
 	setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
 }
@@ -701,6 +706,7 @@ void bdi_destroy(struct backing_dev_info
 		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
 		list_splice(&bdi->wb.b_io, &dst->b_io);
 		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
+		list_splice(&bdi->wb.b_more_io_wait, &dst->b_more_io_wait);
 		spin_unlock(&bdi->wb.list_lock);
 		spin_unlock(&dst->list_lock);
 	}

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY
@ 2011-08-27 13:58   ` Wu Fengguang
  0 siblings, 0 replies; 17+ messages in thread
From: Wu Fengguang @ 2011-08-27 13:58 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, Jan Kara, xfs

[-- Attachment #1: Type: text/plain, Size: 2805 bytes --]

Christoph,

On Sat, Aug 27, 2011 at 02:14:09PM +0800, Christoph Hellwig wrote:
> Right now ->write_inode has no way to safely return a EAGAIN without explicitly
> redirtying the inode, as we would lose the dirty state otherwise.  Most
> filesystems get this wrong, but XFS makes heavy use of it to avoid blocking
> the flusher thread when ->write_inode hits contentended inode locks.  A
> contended ilock is something XFS can hit very easibly when extending files, as
> the data I/O completion handler takes the lock to update the size, and the
> ->write_inode call can race with it fairly easily if writing enough data
> in one go so that the completion for the first write come in just before
> we call ->write_inode.
> 
> Change the handling of this case to use requeue_io for a quick retry instead
> of redirty_tail, which keeps moving out the dirtied_when data and thus keeps
> delaying the writeout more and more with every failed attempt to get the lock.

Yeah redirty_tail() does have the problem of possibly delay inodes for
too long time. However, you know requeue_io() always has the danger of
triggering busy wb_writeback() loops in corner cases.

For example, nfs_write_inode()/nfs_commit_unstable_pages() often
redirty the inode without doing anything (hence no any progress, a
prerequisite for busy loops) depending on the in flight writes, which
unfortunately further depends on _external_ network/server states..
That means some stalled network/sever state could lead to busy loops
in NFS clients.

The alternative solution may be to firstly apply the attached patch,
and change this one to:

  -			redirty_tail(inode, wb);
  +			requeue_io_wait(inode, wb);

Thanks,
Fengguang

> Signed-off-by: Christoph Hellwig <hch@lst.de>
> 
> Index: linux-2.6/fs/fs-writeback.c
> ===================================================================
> --- linux-2.6.orig/fs/fs-writeback.c	2011-08-26 14:47:42.137050059 +0200
> +++ linux-2.6/fs/fs-writeback.c	2011-08-26 15:06:47.003493601 +0200
> @@ -464,8 +464,18 @@ writeback_single_inode(struct inode *ino
>  			 * operations, such as delayed allocation during
>  			 * submission or metadata updates after data IO
>  			 * completion.
> +			 *
> +			 * For the latter case it is very important to give
> +			 * the inode another turn on b_more_io instead of
> +			 * redirtying it.  Constantly moving dirtied_when
> +			 * forward will prevent us from ever writing out
> +			 * the metadata dirtied in the I/O completion handler.
> +			 *
> +			 * For files on XFS that constantly get appended to
> +			 * calling redirty_tail means they will never get
> +			 * their updated i_size written out.
>  			 */
> -			redirty_tail(inode, wb);
> +			requeue_io(inode, wb);
>  		} else {
>  			/*
>  			 * The inode is clean.  At this point we either have

[-- Attachment #2: writeback-more_io_wait.patch --]
[-- Type: text/x-diff, Size: 5015 bytes --]

Subject: writeback: introduce queue b_more_io_wait
Date: Sun Jul 31 18:44:44 CST 2011

The problem is, redirty_tail() may update i_dirtied_when and result in
30s max delay. If redirty_tail() is called often enough, some inode may
even be delayed for ever.

So introduce the b_more_io_wait queue to park inodes that for some
reason cannot be synced immediately. The inodes will be sent to b_io at
the next b_io refill time, however won't be busy retried as b_more_io.

The new data flow after converting all redirty_tail() calls to
requeue_io_wait():

b_dirty --> b_io --> b_more_io/b_more_io_wait --+
             ^                                  |
	     |                                  |
	     +----------------------------------+

Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c           |   10 ++++++++++
 include/linux/backing-dev.h |    8 +++++---
 mm/backing-dev.c            |   10 ++++++++--
 3 files changed, 23 insertions(+), 5 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-08-27 15:28:27.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-08-27 15:45:10.000000000 +0800
@@ -220,6 +220,15 @@ static void requeue_io(struct inode *ino
 	list_move(&inode->i_wb_list, &wb->b_more_io);
 }
 
+/*
+ * The inode should be retried in an opportunistic way.
+ */
+static void requeue_io_wait(struct inode *inode, struct bdi_writeback *wb)
+{
+	assert_spin_locked(&wb->list_lock);
+	list_move(&inode->i_wb_list, &wb->b_more_io_wait);
+}
+
 static void inode_sync_complete(struct inode *inode)
 {
 	/*
@@ -307,6 +316,7 @@ static void queue_io(struct bdi_writebac
 	int moved;
 	assert_spin_locked(&wb->list_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
+	list_splice_init(&wb->b_more_io_wait, &wb->b_io);
 	moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
 	trace_writeback_queue_io(wb, older_than_this, moved);
 }
--- linux-next.orig/include/linux/backing-dev.h	2011-08-26 19:27:20.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-08-27 15:45:10.000000000 +0800
@@ -59,6 +59,7 @@ struct bdi_writeback {
 	struct list_head b_dirty;	/* dirty inodes */
 	struct list_head b_io;		/* parked for writeback */
 	struct list_head b_more_io;	/* parked for more writeback */
+	struct list_head b_more_io_wait;/* opportunistic retry io */
 	spinlock_t list_lock;		/* protects the b_* lists */
 };
 
@@ -129,9 +130,10 @@ extern struct list_head bdi_pending_list
 
 static inline int wb_has_dirty_io(struct bdi_writeback *wb)
 {
-	return !list_empty(&wb->b_dirty) ||
-	       !list_empty(&wb->b_io) ||
-	       !list_empty(&wb->b_more_io);
+	return !list_empty(&wb->b_dirty)	||
+	       !list_empty(&wb->b_io)		||
+	       !list_empty(&wb->b_more_io)	||
+	       !list_empty(&wb->b_more_io_wait);
 }
 
 static inline void __add_bdi_stat(struct backing_dev_info *bdi,
--- linux-next.orig/mm/backing-dev.c	2011-08-26 19:27:20.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-08-27 15:45:10.000000000 +0800
@@ -74,10 +74,10 @@ static int bdi_debug_stats_show(struct s
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long nr_dirty, nr_io, nr_more_io;
+	unsigned long nr_dirty, nr_io, nr_more_io, nr_more_io_wait;
 	struct inode *inode;
 
-	nr_dirty = nr_io = nr_more_io = 0;
+	nr_dirty = nr_io = nr_more_io = nr_more_io_wait = 0;
 	spin_lock(&wb->list_lock);
 	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
 		nr_dirty++;
@@ -85,6 +85,8 @@ static int bdi_debug_stats_show(struct s
 		nr_io++;
 	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
 		nr_more_io++;
+	list_for_each_entry(inode, &wb->b_more_io_wait, i_wb_list)
+		nr_more_io_wait++;
 	spin_unlock(&wb->list_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
@@ -103,6 +105,7 @@ static int bdi_debug_stats_show(struct s
 		   "b_dirty:            %10lu\n"
 		   "b_io:               %10lu\n"
 		   "b_more_io:          %10lu\n"
+		   "b_more_io_wait:     %10lu\n"
 		   "bdi_list:           %10u\n"
 		   "state:              %10lx\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
@@ -116,6 +119,7 @@ static int bdi_debug_stats_show(struct s
 		   nr_dirty,
 		   nr_io,
 		   nr_more_io,
+		   nr_more_io_wait,
 		   !list_empty(&bdi->bdi_list), bdi->state);
 #undef K
 
@@ -635,6 +639,7 @@ static void bdi_wb_init(struct bdi_write
 	INIT_LIST_HEAD(&wb->b_dirty);
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
+	INIT_LIST_HEAD(&wb->b_more_io_wait);
 	spin_lock_init(&wb->list_lock);
 	setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
 }
@@ -701,6 +706,7 @@ void bdi_destroy(struct backing_dev_info
 		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
 		list_splice(&bdi->wb.b_io, &dst->b_io);
 		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
+		list_splice(&bdi->wb.b_more_io_wait, &dst->b_more_io_wait);
 		spin_unlock(&bdi->wb.list_lock);
 		spin_unlock(&dst->list_lock);
 	}

[-- Attachment #3: Type: text/plain, Size: 121 bytes --]

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY
  2011-08-27 13:58   ` Wu Fengguang
@ 2011-09-03  1:13     ` Jan Kara
  -1 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2011-09-03  1:13 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, linux-fsdevel, xfs, Jan Kara, Dave Chinner

On Sat 27-08-11 21:58:25, Wu Fengguang wrote:
> Christoph,
> 
> On Sat, Aug 27, 2011 at 02:14:09PM +0800, Christoph Hellwig wrote:
> > Right now ->write_inode has no way to safely return a EAGAIN without explicitly
> > redirtying the inode, as we would lose the dirty state otherwise.  Most
> > filesystems get this wrong, but XFS makes heavy use of it to avoid blocking
> > the flusher thread when ->write_inode hits contentended inode locks.  A
> > contended ilock is something XFS can hit very easibly when extending files, as
> > the data I/O completion handler takes the lock to update the size, and the
> > ->write_inode call can race with it fairly easily if writing enough data
> > in one go so that the completion for the first write come in just before
> > we call ->write_inode.
> > 
> > Change the handling of this case to use requeue_io for a quick retry instead
> > of redirty_tail, which keeps moving out the dirtied_when data and thus keeps
> > delaying the writeout more and more with every failed attempt to get the lock.
> 
> Yeah redirty_tail() does have the problem of possibly delay inodes for
> too long time. However, you know requeue_io() always has the danger of
> triggering busy wb_writeback() loops in corner cases.
> 
> For example, nfs_write_inode()/nfs_commit_unstable_pages() often
> redirty the inode without doing anything (hence no any progress, a
> prerequisite for busy loops) depending on the in flight writes, which
> unfortunately further depends on _external_ network/server states..
> That means some stalled network/sever state could lead to busy loops
> in NFS clients.
> 
> The alternative solution may be to firstly apply the attached patch,
> and change this one to:
> 
>   -			redirty_tail(inode, wb);
>   +			requeue_io_wait(inode, wb);
  But your patch doesn't solve the busyloop when the problematic inodes are
the only ones under writeback, does it? Then b_more_io and b_more_io_wait
are effectively the same if I understand it right.

I think that busylooping in cases like these could be fixed improving the
busyloop prevention at the end of the loop in wb_writeback(). Maybe we
could just take a short nap before continuting with writeback instead of /
in addition to waiting for inode writeback. What do you think?

								Honza

> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > 
> > Index: linux-2.6/fs/fs-writeback.c
> > ===================================================================
> > --- linux-2.6.orig/fs/fs-writeback.c	2011-08-26 14:47:42.137050059 +0200
> > +++ linux-2.6/fs/fs-writeback.c	2011-08-26 15:06:47.003493601 +0200
> > @@ -464,8 +464,18 @@ writeback_single_inode(struct inode *ino
> >  			 * operations, such as delayed allocation during
> >  			 * submission or metadata updates after data IO
> >  			 * completion.
> > +			 *
> > +			 * For the latter case it is very important to give
> > +			 * the inode another turn on b_more_io instead of
> > +			 * redirtying it.  Constantly moving dirtied_when
> > +			 * forward will prevent us from ever writing out
> > +			 * the metadata dirtied in the I/O completion handler.
> > +			 *
> > +			 * For files on XFS that constantly get appended to
> > +			 * calling redirty_tail means they will never get
> > +			 * their updated i_size written out.
> >  			 */
> > -			redirty_tail(inode, wb);
> > +			requeue_io(inode, wb);
> >  		} else {
> >  			/*
> >  			 * The inode is clean.  At this point we either have

> Subject: writeback: introduce queue b_more_io_wait
> Date: Sun Jul 31 18:44:44 CST 2011
> 
> The problem is, redirty_tail() may update i_dirtied_when and result in
> 30s max delay. If redirty_tail() is called often enough, some inode may
> even be delayed for ever.
> 
> So introduce the b_more_io_wait queue to park inodes that for some
> reason cannot be synced immediately. The inodes will be sent to b_io at
> the next b_io refill time, however won't be busy retried as b_more_io.
> 
> The new data flow after converting all redirty_tail() calls to
> requeue_io_wait():
> 
> b_dirty --> b_io --> b_more_io/b_more_io_wait --+
>              ^                                  |
> 	     |                                  |
> 	     +----------------------------------+
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c           |   10 ++++++++++
>  include/linux/backing-dev.h |    8 +++++---
>  mm/backing-dev.c            |   10 ++++++++--
>  3 files changed, 23 insertions(+), 5 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-08-27 15:28:27.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-08-27 15:45:10.000000000 +0800
> @@ -220,6 +220,15 @@ static void requeue_io(struct inode *ino
>  	list_move(&inode->i_wb_list, &wb->b_more_io);
>  }
>  
> +/*
> + * The inode should be retried in an opportunistic way.
> + */
> +static void requeue_io_wait(struct inode *inode, struct bdi_writeback *wb)
> +{
> +	assert_spin_locked(&wb->list_lock);
> +	list_move(&inode->i_wb_list, &wb->b_more_io_wait);
> +}
> +
>  static void inode_sync_complete(struct inode *inode)
>  {
>  	/*
> @@ -307,6 +316,7 @@ static void queue_io(struct bdi_writebac
>  	int moved;
>  	assert_spin_locked(&wb->list_lock);
>  	list_splice_init(&wb->b_more_io, &wb->b_io);
> +	list_splice_init(&wb->b_more_io_wait, &wb->b_io);
>  	moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
>  	trace_writeback_queue_io(wb, older_than_this, moved);
>  }
> --- linux-next.orig/include/linux/backing-dev.h	2011-08-26 19:27:20.000000000 +0800
> +++ linux-next/include/linux/backing-dev.h	2011-08-27 15:45:10.000000000 +0800
> @@ -59,6 +59,7 @@ struct bdi_writeback {
>  	struct list_head b_dirty;	/* dirty inodes */
>  	struct list_head b_io;		/* parked for writeback */
>  	struct list_head b_more_io;	/* parked for more writeback */
> +	struct list_head b_more_io_wait;/* opportunistic retry io */
>  	spinlock_t list_lock;		/* protects the b_* lists */
>  };
>  
> @@ -129,9 +130,10 @@ extern struct list_head bdi_pending_list
>  
>  static inline int wb_has_dirty_io(struct bdi_writeback *wb)
>  {
> -	return !list_empty(&wb->b_dirty) ||
> -	       !list_empty(&wb->b_io) ||
> -	       !list_empty(&wb->b_more_io);
> +	return !list_empty(&wb->b_dirty)	||
> +	       !list_empty(&wb->b_io)		||
> +	       !list_empty(&wb->b_more_io)	||
> +	       !list_empty(&wb->b_more_io_wait);
>  }
>  
>  static inline void __add_bdi_stat(struct backing_dev_info *bdi,
> --- linux-next.orig/mm/backing-dev.c	2011-08-26 19:27:20.000000000 +0800
> +++ linux-next/mm/backing-dev.c	2011-08-27 15:45:10.000000000 +0800
> @@ -74,10 +74,10 @@ static int bdi_debug_stats_show(struct s
>  	unsigned long background_thresh;
>  	unsigned long dirty_thresh;
>  	unsigned long bdi_thresh;
> -	unsigned long nr_dirty, nr_io, nr_more_io;
> +	unsigned long nr_dirty, nr_io, nr_more_io, nr_more_io_wait;
>  	struct inode *inode;
>  
> -	nr_dirty = nr_io = nr_more_io = 0;
> +	nr_dirty = nr_io = nr_more_io = nr_more_io_wait = 0;
>  	spin_lock(&wb->list_lock);
>  	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
>  		nr_dirty++;
> @@ -85,6 +85,8 @@ static int bdi_debug_stats_show(struct s
>  		nr_io++;
>  	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
>  		nr_more_io++;
> +	list_for_each_entry(inode, &wb->b_more_io_wait, i_wb_list)
> +		nr_more_io_wait++;
>  	spin_unlock(&wb->list_lock);
>  
>  	global_dirty_limits(&background_thresh, &dirty_thresh);
> @@ -103,6 +105,7 @@ static int bdi_debug_stats_show(struct s
>  		   "b_dirty:            %10lu\n"
>  		   "b_io:               %10lu\n"
>  		   "b_more_io:          %10lu\n"
> +		   "b_more_io_wait:     %10lu\n"
>  		   "bdi_list:           %10u\n"
>  		   "state:              %10lx\n",
>  		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
> @@ -116,6 +119,7 @@ static int bdi_debug_stats_show(struct s
>  		   nr_dirty,
>  		   nr_io,
>  		   nr_more_io,
> +		   nr_more_io_wait,
>  		   !list_empty(&bdi->bdi_list), bdi->state);
>  #undef K
>  
> @@ -635,6 +639,7 @@ static void bdi_wb_init(struct bdi_write
>  	INIT_LIST_HEAD(&wb->b_dirty);
>  	INIT_LIST_HEAD(&wb->b_io);
>  	INIT_LIST_HEAD(&wb->b_more_io);
> +	INIT_LIST_HEAD(&wb->b_more_io_wait);
>  	spin_lock_init(&wb->list_lock);
>  	setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
>  }
> @@ -701,6 +706,7 @@ void bdi_destroy(struct backing_dev_info
>  		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
>  		list_splice(&bdi->wb.b_io, &dst->b_io);
>  		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
> +		list_splice(&bdi->wb.b_more_io_wait, &dst->b_more_io_wait);
>  		spin_unlock(&bdi->wb.list_lock);
>  		spin_unlock(&dst->list_lock);
>  	}

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY
@ 2011-09-03  1:13     ` Jan Kara
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2011-09-03  1:13 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Christoph Hellwig, linux-fsdevel, Jan Kara, xfs

On Sat 27-08-11 21:58:25, Wu Fengguang wrote:
> Christoph,
> 
> On Sat, Aug 27, 2011 at 02:14:09PM +0800, Christoph Hellwig wrote:
> > Right now ->write_inode has no way to safely return a EAGAIN without explicitly
> > redirtying the inode, as we would lose the dirty state otherwise.  Most
> > filesystems get this wrong, but XFS makes heavy use of it to avoid blocking
> > the flusher thread when ->write_inode hits contentended inode locks.  A
> > contended ilock is something XFS can hit very easibly when extending files, as
> > the data I/O completion handler takes the lock to update the size, and the
> > ->write_inode call can race with it fairly easily if writing enough data
> > in one go so that the completion for the first write come in just before
> > we call ->write_inode.
> > 
> > Change the handling of this case to use requeue_io for a quick retry instead
> > of redirty_tail, which keeps moving out the dirtied_when data and thus keeps
> > delaying the writeout more and more with every failed attempt to get the lock.
> 
> Yeah redirty_tail() does have the problem of possibly delay inodes for
> too long time. However, you know requeue_io() always has the danger of
> triggering busy wb_writeback() loops in corner cases.
> 
> For example, nfs_write_inode()/nfs_commit_unstable_pages() often
> redirty the inode without doing anything (hence no any progress, a
> prerequisite for busy loops) depending on the in flight writes, which
> unfortunately further depends on _external_ network/server states..
> That means some stalled network/sever state could lead to busy loops
> in NFS clients.
> 
> The alternative solution may be to firstly apply the attached patch,
> and change this one to:
> 
>   -			redirty_tail(inode, wb);
>   +			requeue_io_wait(inode, wb);
  But your patch doesn't solve the busyloop when the problematic inodes are
the only ones under writeback, does it? Then b_more_io and b_more_io_wait
are effectively the same if I understand it right.

I think that busylooping in cases like these could be fixed improving the
busyloop prevention at the end of the loop in wb_writeback(). Maybe we
could just take a short nap before continuting with writeback instead of /
in addition to waiting for inode writeback. What do you think?

								Honza

> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > 
> > Index: linux-2.6/fs/fs-writeback.c
> > ===================================================================
> > --- linux-2.6.orig/fs/fs-writeback.c	2011-08-26 14:47:42.137050059 +0200
> > +++ linux-2.6/fs/fs-writeback.c	2011-08-26 15:06:47.003493601 +0200
> > @@ -464,8 +464,18 @@ writeback_single_inode(struct inode *ino
> >  			 * operations, such as delayed allocation during
> >  			 * submission or metadata updates after data IO
> >  			 * completion.
> > +			 *
> > +			 * For the latter case it is very important to give
> > +			 * the inode another turn on b_more_io instead of
> > +			 * redirtying it.  Constantly moving dirtied_when
> > +			 * forward will prevent us from ever writing out
> > +			 * the metadata dirtied in the I/O completion handler.
> > +			 *
> > +			 * For files on XFS that constantly get appended to
> > +			 * calling redirty_tail means they will never get
> > +			 * their updated i_size written out.
> >  			 */
> > -			redirty_tail(inode, wb);
> > +			requeue_io(inode, wb);
> >  		} else {
> >  			/*
> >  			 * The inode is clean.  At this point we either have

> Subject: writeback: introduce queue b_more_io_wait
> Date: Sun Jul 31 18:44:44 CST 2011
> 
> The problem is, redirty_tail() may update i_dirtied_when and result in
> 30s max delay. If redirty_tail() is called often enough, some inode may
> even be delayed for ever.
> 
> So introduce the b_more_io_wait queue to park inodes that for some
> reason cannot be synced immediately. The inodes will be sent to b_io at
> the next b_io refill time, however won't be busy retried as b_more_io.
> 
> The new data flow after converting all redirty_tail() calls to
> requeue_io_wait():
> 
> b_dirty --> b_io --> b_more_io/b_more_io_wait --+
>              ^                                  |
> 	     |                                  |
> 	     +----------------------------------+
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c           |   10 ++++++++++
>  include/linux/backing-dev.h |    8 +++++---
>  mm/backing-dev.c            |   10 ++++++++--
>  3 files changed, 23 insertions(+), 5 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-08-27 15:28:27.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-08-27 15:45:10.000000000 +0800
> @@ -220,6 +220,15 @@ static void requeue_io(struct inode *ino
>  	list_move(&inode->i_wb_list, &wb->b_more_io);
>  }
>  
> +/*
> + * The inode should be retried in an opportunistic way.
> + */
> +static void requeue_io_wait(struct inode *inode, struct bdi_writeback *wb)
> +{
> +	assert_spin_locked(&wb->list_lock);
> +	list_move(&inode->i_wb_list, &wb->b_more_io_wait);
> +}
> +
>  static void inode_sync_complete(struct inode *inode)
>  {
>  	/*
> @@ -307,6 +316,7 @@ static void queue_io(struct bdi_writebac
>  	int moved;
>  	assert_spin_locked(&wb->list_lock);
>  	list_splice_init(&wb->b_more_io, &wb->b_io);
> +	list_splice_init(&wb->b_more_io_wait, &wb->b_io);
>  	moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
>  	trace_writeback_queue_io(wb, older_than_this, moved);
>  }
> --- linux-next.orig/include/linux/backing-dev.h	2011-08-26 19:27:20.000000000 +0800
> +++ linux-next/include/linux/backing-dev.h	2011-08-27 15:45:10.000000000 +0800
> @@ -59,6 +59,7 @@ struct bdi_writeback {
>  	struct list_head b_dirty;	/* dirty inodes */
>  	struct list_head b_io;		/* parked for writeback */
>  	struct list_head b_more_io;	/* parked for more writeback */
> +	struct list_head b_more_io_wait;/* opportunistic retry io */
>  	spinlock_t list_lock;		/* protects the b_* lists */
>  };
>  
> @@ -129,9 +130,10 @@ extern struct list_head bdi_pending_list
>  
>  static inline int wb_has_dirty_io(struct bdi_writeback *wb)
>  {
> -	return !list_empty(&wb->b_dirty) ||
> -	       !list_empty(&wb->b_io) ||
> -	       !list_empty(&wb->b_more_io);
> +	return !list_empty(&wb->b_dirty)	||
> +	       !list_empty(&wb->b_io)		||
> +	       !list_empty(&wb->b_more_io)	||
> +	       !list_empty(&wb->b_more_io_wait);
>  }
>  
>  static inline void __add_bdi_stat(struct backing_dev_info *bdi,
> --- linux-next.orig/mm/backing-dev.c	2011-08-26 19:27:20.000000000 +0800
> +++ linux-next/mm/backing-dev.c	2011-08-27 15:45:10.000000000 +0800
> @@ -74,10 +74,10 @@ static int bdi_debug_stats_show(struct s
>  	unsigned long background_thresh;
>  	unsigned long dirty_thresh;
>  	unsigned long bdi_thresh;
> -	unsigned long nr_dirty, nr_io, nr_more_io;
> +	unsigned long nr_dirty, nr_io, nr_more_io, nr_more_io_wait;
>  	struct inode *inode;
>  
> -	nr_dirty = nr_io = nr_more_io = 0;
> +	nr_dirty = nr_io = nr_more_io = nr_more_io_wait = 0;
>  	spin_lock(&wb->list_lock);
>  	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
>  		nr_dirty++;
> @@ -85,6 +85,8 @@ static int bdi_debug_stats_show(struct s
>  		nr_io++;
>  	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
>  		nr_more_io++;
> +	list_for_each_entry(inode, &wb->b_more_io_wait, i_wb_list)
> +		nr_more_io_wait++;
>  	spin_unlock(&wb->list_lock);
>  
>  	global_dirty_limits(&background_thresh, &dirty_thresh);
> @@ -103,6 +105,7 @@ static int bdi_debug_stats_show(struct s
>  		   "b_dirty:            %10lu\n"
>  		   "b_io:               %10lu\n"
>  		   "b_more_io:          %10lu\n"
> +		   "b_more_io_wait:     %10lu\n"
>  		   "bdi_list:           %10u\n"
>  		   "state:              %10lx\n",
>  		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
> @@ -116,6 +119,7 @@ static int bdi_debug_stats_show(struct s
>  		   nr_dirty,
>  		   nr_io,
>  		   nr_more_io,
> +		   nr_more_io_wait,
>  		   !list_empty(&bdi->bdi_list), bdi->state);
>  #undef K
>  
> @@ -635,6 +639,7 @@ static void bdi_wb_init(struct bdi_write
>  	INIT_LIST_HEAD(&wb->b_dirty);
>  	INIT_LIST_HEAD(&wb->b_io);
>  	INIT_LIST_HEAD(&wb->b_more_io);
> +	INIT_LIST_HEAD(&wb->b_more_io_wait);
>  	spin_lock_init(&wb->list_lock);
>  	setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
>  }
> @@ -701,6 +706,7 @@ void bdi_destroy(struct backing_dev_info
>  		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
>  		list_splice(&bdi->wb.b_io, &dst->b_io);
>  		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
> +		list_splice(&bdi->wb.b_more_io_wait, &dst->b_more_io_wait);
>  		spin_unlock(&bdi->wb.list_lock);
>  		spin_unlock(&dst->list_lock);
>  	}

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY
  2011-09-03  1:13     ` Jan Kara
@ 2011-09-03 21:35       ` Wu Fengguang
  -1 siblings, 0 replies; 17+ messages in thread
From: Wu Fengguang @ 2011-09-03 21:35 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, linux-fsdevel, xfs, Dave Chinner

On Sat, Sep 03, 2011 at 09:13:15AM +0800, Jan Kara wrote:
> On Sat 27-08-11 21:58:25, Wu Fengguang wrote:
> > Christoph,
> > 
> > On Sat, Aug 27, 2011 at 02:14:09PM +0800, Christoph Hellwig wrote:
> > > Right now ->write_inode has no way to safely return a EAGAIN without explicitly
> > > redirtying the inode, as we would lose the dirty state otherwise.  Most
> > > filesystems get this wrong, but XFS makes heavy use of it to avoid blocking
> > > the flusher thread when ->write_inode hits contentended inode locks.  A
> > > contended ilock is something XFS can hit very easibly when extending files, as
> > > the data I/O completion handler takes the lock to update the size, and the
> > > ->write_inode call can race with it fairly easily if writing enough data
> > > in one go so that the completion for the first write come in just before
> > > we call ->write_inode.
> > > 
> > > Change the handling of this case to use requeue_io for a quick retry instead
> > > of redirty_tail, which keeps moving out the dirtied_when data and thus keeps
> > > delaying the writeout more and more with every failed attempt to get the lock.
> > 
> > Yeah redirty_tail() does have the problem of possibly delay inodes for
> > too long time. However, you know requeue_io() always has the danger of
> > triggering busy wb_writeback() loops in corner cases.
> > 
> > For example, nfs_write_inode()/nfs_commit_unstable_pages() often
> > redirty the inode without doing anything (hence no any progress, a
> > prerequisite for busy loops) depending on the in flight writes, which
> > unfortunately further depends on _external_ network/server states..
> > That means some stalled network/sever state could lead to busy loops
> > in NFS clients.
> > 
> > The alternative solution may be to firstly apply the attached patch,
> > and change this one to:
> > 
> >   -			redirty_tail(inode, wb);
> >   +			requeue_io_wait(inode, wb);
>   But your patch doesn't solve the busyloop when the problematic inodes are
> the only ones under writeback, does it? Then b_more_io and b_more_io_wait
> are effectively the same if I understand it right.

The difference lies in the

                /*
                 * No more inodes for IO, bail
                 */
                if (list_empty(&wb->b_more_io))
                        break;

check in wb_writeback(). So when what's left are all b_more_io_wait
inodes, the above check will take effect and break the loop. This is
the tricky point of the patch: it relies on the code not touched by
the patch to work. I've updated the changelog to explain this.

> I think that busylooping in cases like these could be fixed improving the
> busyloop prevention at the end of the loop in wb_writeback(). Maybe we
> could just take a short nap before continuting with writeback instead of /
> in addition to waiting for inode writeback. What do you think?

That's a reasonable robust option, however at the cost of keeping the
writeback code in some ambiguous state ;)

Thanks,
Fengguang
---
Subject: writeback: introduce queue b_more_io_wait
Date: Sun Jul 31 18:44:44 CST 2011

The problem is, redirty_tail() may update i_dirtied_when and result in
30s max delay. If redirty_tail() is called often enough, some inode may
even be delayed for ever.

So introduce the b_more_io_wait queue to park inodes that for some
reason cannot be synced immediately. The inodes will be sent to b_io at
the next b_io refill time, however won't be busy retried as b_more_io:
when the redirtied inodes are all in b_more_io_wait, wb_writeback() will
see empty b_more_io and hence break out of the loop. 

The new data flow after converting all redirty_tail() calls to
requeue_io_wait():

b_dirty --> b_io --> b_more_io/b_more_io_wait --+
             ^                                  |
	     |                                  |
	     +----------------------------------+

Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c           |   10 ++++++++++
 include/linux/backing-dev.h |    8 +++++---
 mm/backing-dev.c            |   10 ++++++++--
 3 files changed, 23 insertions(+), 5 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-08-27 15:28:27.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-08-27 15:45:10.000000000 +0800
@@ -220,6 +220,15 @@ static void requeue_io(struct inode *ino
 	list_move(&inode->i_wb_list, &wb->b_more_io);
 }
 
+/*
+ * The inode should be retried in an opportunistic way.
+ */
+static void requeue_io_wait(struct inode *inode, struct bdi_writeback *wb)
+{
+	assert_spin_locked(&wb->list_lock);
+	list_move(&inode->i_wb_list, &wb->b_more_io_wait);
+}
+
 static void inode_sync_complete(struct inode *inode)
 {
 	/*
@@ -307,6 +316,7 @@ static void queue_io(struct bdi_writebac
 	int moved;
 	assert_spin_locked(&wb->list_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
+	list_splice_init(&wb->b_more_io_wait, &wb->b_io);
 	moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
 	trace_writeback_queue_io(wb, older_than_this, moved);
 }
--- linux-next.orig/include/linux/backing-dev.h	2011-08-26 19:27:20.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-08-27 15:45:10.000000000 +0800
@@ -59,6 +59,7 @@ struct bdi_writeback {
 	struct list_head b_dirty;	/* dirty inodes */
 	struct list_head b_io;		/* parked for writeback */
 	struct list_head b_more_io;	/* parked for more writeback */
+	struct list_head b_more_io_wait;/* opportunistic retry io */
 	spinlock_t list_lock;		/* protects the b_* lists */
 };
 
@@ -129,9 +130,10 @@ extern struct list_head bdi_pending_list
 
 static inline int wb_has_dirty_io(struct bdi_writeback *wb)
 {
-	return !list_empty(&wb->b_dirty) ||
-	       !list_empty(&wb->b_io) ||
-	       !list_empty(&wb->b_more_io);
+	return !list_empty(&wb->b_dirty)	||
+	       !list_empty(&wb->b_io)		||
+	       !list_empty(&wb->b_more_io)	||
+	       !list_empty(&wb->b_more_io_wait);
 }
 
 static inline void __add_bdi_stat(struct backing_dev_info *bdi,
--- linux-next.orig/mm/backing-dev.c	2011-08-26 19:27:20.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-08-27 15:45:10.000000000 +0800
@@ -74,10 +74,10 @@ static int bdi_debug_stats_show(struct s
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long nr_dirty, nr_io, nr_more_io;
+	unsigned long nr_dirty, nr_io, nr_more_io, nr_more_io_wait;
 	struct inode *inode;
 
-	nr_dirty = nr_io = nr_more_io = 0;
+	nr_dirty = nr_io = nr_more_io = nr_more_io_wait = 0;
 	spin_lock(&wb->list_lock);
 	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
 		nr_dirty++;
@@ -85,6 +85,8 @@ static int bdi_debug_stats_show(struct s
 		nr_io++;
 	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
 		nr_more_io++;
+	list_for_each_entry(inode, &wb->b_more_io_wait, i_wb_list)
+		nr_more_io_wait++;
 	spin_unlock(&wb->list_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
@@ -103,6 +105,7 @@ static int bdi_debug_stats_show(struct s
 		   "b_dirty:            %10lu\n"
 		   "b_io:               %10lu\n"
 		   "b_more_io:          %10lu\n"
+		   "b_more_io_wait:     %10lu\n"
 		   "bdi_list:           %10u\n"
 		   "state:              %10lx\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
@@ -116,6 +119,7 @@ static int bdi_debug_stats_show(struct s
 		   nr_dirty,
 		   nr_io,
 		   nr_more_io,
+		   nr_more_io_wait,
 		   !list_empty(&bdi->bdi_list), bdi->state);
 #undef K
 
@@ -635,6 +639,7 @@ static void bdi_wb_init(struct bdi_write
 	INIT_LIST_HEAD(&wb->b_dirty);
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
+	INIT_LIST_HEAD(&wb->b_more_io_wait);
 	spin_lock_init(&wb->list_lock);
 	setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
 }
@@ -701,6 +706,7 @@ void bdi_destroy(struct backing_dev_info
 		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
 		list_splice(&bdi->wb.b_io, &dst->b_io);
 		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
+		list_splice(&bdi->wb.b_more_io_wait, &dst->b_more_io_wait);
 		spin_unlock(&bdi->wb.list_lock);
 		spin_unlock(&dst->list_lock);
 	}

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY
@ 2011-09-03 21:35       ` Wu Fengguang
  0 siblings, 0 replies; 17+ messages in thread
From: Wu Fengguang @ 2011-09-03 21:35 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, linux-fsdevel, xfs

On Sat, Sep 03, 2011 at 09:13:15AM +0800, Jan Kara wrote:
> On Sat 27-08-11 21:58:25, Wu Fengguang wrote:
> > Christoph,
> > 
> > On Sat, Aug 27, 2011 at 02:14:09PM +0800, Christoph Hellwig wrote:
> > > Right now ->write_inode has no way to safely return a EAGAIN without explicitly
> > > redirtying the inode, as we would lose the dirty state otherwise.  Most
> > > filesystems get this wrong, but XFS makes heavy use of it to avoid blocking
> > > the flusher thread when ->write_inode hits contentended inode locks.  A
> > > contended ilock is something XFS can hit very easibly when extending files, as
> > > the data I/O completion handler takes the lock to update the size, and the
> > > ->write_inode call can race with it fairly easily if writing enough data
> > > in one go so that the completion for the first write come in just before
> > > we call ->write_inode.
> > > 
> > > Change the handling of this case to use requeue_io for a quick retry instead
> > > of redirty_tail, which keeps moving out the dirtied_when data and thus keeps
> > > delaying the writeout more and more with every failed attempt to get the lock.
> > 
> > Yeah redirty_tail() does have the problem of possibly delay inodes for
> > too long time. However, you know requeue_io() always has the danger of
> > triggering busy wb_writeback() loops in corner cases.
> > 
> > For example, nfs_write_inode()/nfs_commit_unstable_pages() often
> > redirty the inode without doing anything (hence no any progress, a
> > prerequisite for busy loops) depending on the in flight writes, which
> > unfortunately further depends on _external_ network/server states..
> > That means some stalled network/sever state could lead to busy loops
> > in NFS clients.
> > 
> > The alternative solution may be to firstly apply the attached patch,
> > and change this one to:
> > 
> >   -			redirty_tail(inode, wb);
> >   +			requeue_io_wait(inode, wb);
>   But your patch doesn't solve the busyloop when the problematic inodes are
> the only ones under writeback, does it? Then b_more_io and b_more_io_wait
> are effectively the same if I understand it right.

The difference lies in the

                /*
                 * No more inodes for IO, bail
                 */
                if (list_empty(&wb->b_more_io))
                        break;

check in wb_writeback(). So when what's left are all b_more_io_wait
inodes, the above check will take effect and break the loop. This is
the tricky point of the patch: it relies on the code not touched by
the patch to work. I've updated the changelog to explain this.

> I think that busylooping in cases like these could be fixed improving the
> busyloop prevention at the end of the loop in wb_writeback(). Maybe we
> could just take a short nap before continuting with writeback instead of /
> in addition to waiting for inode writeback. What do you think?

That's a reasonable robust option, however at the cost of keeping the
writeback code in some ambiguous state ;)

Thanks,
Fengguang
---
Subject: writeback: introduce queue b_more_io_wait
Date: Sun Jul 31 18:44:44 CST 2011

The problem is, redirty_tail() may update i_dirtied_when and result in
30s max delay. If redirty_tail() is called often enough, some inode may
even be delayed for ever.

So introduce the b_more_io_wait queue to park inodes that for some
reason cannot be synced immediately. The inodes will be sent to b_io at
the next b_io refill time, however won't be busy retried as b_more_io:
when the redirtied inodes are all in b_more_io_wait, wb_writeback() will
see empty b_more_io and hence break out of the loop. 

The new data flow after converting all redirty_tail() calls to
requeue_io_wait():

b_dirty --> b_io --> b_more_io/b_more_io_wait --+
             ^                                  |
	     |                                  |
	     +----------------------------------+

Cc: Jan Kara <jack@suse.cz>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c           |   10 ++++++++++
 include/linux/backing-dev.h |    8 +++++---
 mm/backing-dev.c            |   10 ++++++++--
 3 files changed, 23 insertions(+), 5 deletions(-)

--- linux-next.orig/fs/fs-writeback.c	2011-08-27 15:28:27.000000000 +0800
+++ linux-next/fs/fs-writeback.c	2011-08-27 15:45:10.000000000 +0800
@@ -220,6 +220,15 @@ static void requeue_io(struct inode *ino
 	list_move(&inode->i_wb_list, &wb->b_more_io);
 }
 
+/*
+ * The inode should be retried in an opportunistic way.
+ */
+static void requeue_io_wait(struct inode *inode, struct bdi_writeback *wb)
+{
+	assert_spin_locked(&wb->list_lock);
+	list_move(&inode->i_wb_list, &wb->b_more_io_wait);
+}
+
 static void inode_sync_complete(struct inode *inode)
 {
 	/*
@@ -307,6 +316,7 @@ static void queue_io(struct bdi_writebac
 	int moved;
 	assert_spin_locked(&wb->list_lock);
 	list_splice_init(&wb->b_more_io, &wb->b_io);
+	list_splice_init(&wb->b_more_io_wait, &wb->b_io);
 	moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
 	trace_writeback_queue_io(wb, older_than_this, moved);
 }
--- linux-next.orig/include/linux/backing-dev.h	2011-08-26 19:27:20.000000000 +0800
+++ linux-next/include/linux/backing-dev.h	2011-08-27 15:45:10.000000000 +0800
@@ -59,6 +59,7 @@ struct bdi_writeback {
 	struct list_head b_dirty;	/* dirty inodes */
 	struct list_head b_io;		/* parked for writeback */
 	struct list_head b_more_io;	/* parked for more writeback */
+	struct list_head b_more_io_wait;/* opportunistic retry io */
 	spinlock_t list_lock;		/* protects the b_* lists */
 };
 
@@ -129,9 +130,10 @@ extern struct list_head bdi_pending_list
 
 static inline int wb_has_dirty_io(struct bdi_writeback *wb)
 {
-	return !list_empty(&wb->b_dirty) ||
-	       !list_empty(&wb->b_io) ||
-	       !list_empty(&wb->b_more_io);
+	return !list_empty(&wb->b_dirty)	||
+	       !list_empty(&wb->b_io)		||
+	       !list_empty(&wb->b_more_io)	||
+	       !list_empty(&wb->b_more_io_wait);
 }
 
 static inline void __add_bdi_stat(struct backing_dev_info *bdi,
--- linux-next.orig/mm/backing-dev.c	2011-08-26 19:27:20.000000000 +0800
+++ linux-next/mm/backing-dev.c	2011-08-27 15:45:10.000000000 +0800
@@ -74,10 +74,10 @@ static int bdi_debug_stats_show(struct s
 	unsigned long background_thresh;
 	unsigned long dirty_thresh;
 	unsigned long bdi_thresh;
-	unsigned long nr_dirty, nr_io, nr_more_io;
+	unsigned long nr_dirty, nr_io, nr_more_io, nr_more_io_wait;
 	struct inode *inode;
 
-	nr_dirty = nr_io = nr_more_io = 0;
+	nr_dirty = nr_io = nr_more_io = nr_more_io_wait = 0;
 	spin_lock(&wb->list_lock);
 	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
 		nr_dirty++;
@@ -85,6 +85,8 @@ static int bdi_debug_stats_show(struct s
 		nr_io++;
 	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
 		nr_more_io++;
+	list_for_each_entry(inode, &wb->b_more_io_wait, i_wb_list)
+		nr_more_io_wait++;
 	spin_unlock(&wb->list_lock);
 
 	global_dirty_limits(&background_thresh, &dirty_thresh);
@@ -103,6 +105,7 @@ static int bdi_debug_stats_show(struct s
 		   "b_dirty:            %10lu\n"
 		   "b_io:               %10lu\n"
 		   "b_more_io:          %10lu\n"
+		   "b_more_io_wait:     %10lu\n"
 		   "bdi_list:           %10u\n"
 		   "state:              %10lx\n",
 		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
@@ -116,6 +119,7 @@ static int bdi_debug_stats_show(struct s
 		   nr_dirty,
 		   nr_io,
 		   nr_more_io,
+		   nr_more_io_wait,
 		   !list_empty(&bdi->bdi_list), bdi->state);
 #undef K
 
@@ -635,6 +639,7 @@ static void bdi_wb_init(struct bdi_write
 	INIT_LIST_HEAD(&wb->b_dirty);
 	INIT_LIST_HEAD(&wb->b_io);
 	INIT_LIST_HEAD(&wb->b_more_io);
+	INIT_LIST_HEAD(&wb->b_more_io_wait);
 	spin_lock_init(&wb->list_lock);
 	setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
 }
@@ -701,6 +706,7 @@ void bdi_destroy(struct backing_dev_info
 		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
 		list_splice(&bdi->wb.b_io, &dst->b_io);
 		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
+		list_splice(&bdi->wb.b_more_io_wait, &dst->b_more_io_wait);
 		spin_unlock(&bdi->wb.list_lock);
 		spin_unlock(&dst->list_lock);
 	}

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY
  2011-09-03 21:35       ` Wu Fengguang
@ 2011-09-05 11:11         ` Jan Kara
  -1 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2011-09-05 11:11 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Christoph Hellwig, linux-fsdevel, xfs, Dave Chinner

On Sun 04-09-11 05:35:27, Wu Fengguang wrote:
> On Sat, Sep 03, 2011 at 09:13:15AM +0800, Jan Kara wrote:
> > On Sat 27-08-11 21:58:25, Wu Fengguang wrote:
> > > Christoph,
> > > 
> > > On Sat, Aug 27, 2011 at 02:14:09PM +0800, Christoph Hellwig wrote:
> > > > Right now ->write_inode has no way to safely return a EAGAIN without explicitly
> > > > redirtying the inode, as we would lose the dirty state otherwise.  Most
> > > > filesystems get this wrong, but XFS makes heavy use of it to avoid blocking
> > > > the flusher thread when ->write_inode hits contentended inode locks.  A
> > > > contended ilock is something XFS can hit very easibly when extending files, as
> > > > the data I/O completion handler takes the lock to update the size, and the
> > > > ->write_inode call can race with it fairly easily if writing enough data
> > > > in one go so that the completion for the first write come in just before
> > > > we call ->write_inode.
> > > > 
> > > > Change the handling of this case to use requeue_io for a quick retry instead
> > > > of redirty_tail, which keeps moving out the dirtied_when data and thus keeps
> > > > delaying the writeout more and more with every failed attempt to get the lock.
> > > 
> > > Yeah redirty_tail() does have the problem of possibly delay inodes for
> > > too long time. However, you know requeue_io() always has the danger of
> > > triggering busy wb_writeback() loops in corner cases.
> > > 
> > > For example, nfs_write_inode()/nfs_commit_unstable_pages() often
> > > redirty the inode without doing anything (hence no any progress, a
> > > prerequisite for busy loops) depending on the in flight writes, which
> > > unfortunately further depends on _external_ network/server states..
> > > That means some stalled network/sever state could lead to busy loops
> > > in NFS clients.
> > > 
> > > The alternative solution may be to firstly apply the attached patch,
> > > and change this one to:
> > > 
> > >   -			redirty_tail(inode, wb);
> > >   +			requeue_io_wait(inode, wb);
> >   But your patch doesn't solve the busyloop when the problematic inodes are
> > the only ones under writeback, does it? Then b_more_io and b_more_io_wait
> > are effectively the same if I understand it right.
> 
> The difference lies in the
> 
>                 /*
>                  * No more inodes for IO, bail
>                  */
>                 if (list_empty(&wb->b_more_io))
>                         break;
> 
> check in wb_writeback(). So when what's left are all b_more_io_wait
> inodes, the above check will take effect and break the loop. This is
> the tricky point of the patch: it relies on the code not touched by
> the patch to work. I've updated the changelog to explain this.
  Argh, that's really subtle! Two points here.
1) We will immediately retry inodes in b_more_io_wait list because of
	if (progress)
		continue;
check.

2) The time writeout will be delayed is only <=dirty_expire_centisecs but
can be arbitrarily small if someone submits more work. Also if
wb_writeback() was called from wb_do_writeback(), we will retry
b_more_io_wait inodes twice immediately because of
wb_check_old_data_flush() and wb_check_background_flush() calls.

> > I think that busylooping in cases like these could be fixed improving the
> > busyloop prevention at the end of the loop in wb_writeback(). Maybe we
> > could just take a short nap before continuting with writeback instead of /
> > in addition to waiting for inode writeback. What do you think?
> 
> That's a reasonable robust option, however at the cost of keeping the
> writeback code in some ambiguous state ;)
  What do you exactly mean by ambiguous state? I don't see anything
ambiguous in waiting for a jiffie or so. Not that I'd be completely happy
about "just wait for a while and see if things are better" but your
solution does not seem ideal either... 

								Honza

> ---
> Subject: writeback: introduce queue b_more_io_wait
> Date: Sun Jul 31 18:44:44 CST 2011
> 
> The problem is, redirty_tail() may update i_dirtied_when and result in
> 30s max delay. If redirty_tail() is called often enough, some inode may
> even be delayed for ever.
> 
> So introduce the b_more_io_wait queue to park inodes that for some
> reason cannot be synced immediately. The inodes will be sent to b_io at
> the next b_io refill time, however won't be busy retried as b_more_io:
> when the redirtied inodes are all in b_more_io_wait, wb_writeback() will
> see empty b_more_io and hence break out of the loop. 
> 
> The new data flow after converting all redirty_tail() calls to
> requeue_io_wait():
> 
> b_dirty --> b_io --> b_more_io/b_more_io_wait --+
>              ^                                  |
> 	     |                                  |
> 	     +----------------------------------+
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c           |   10 ++++++++++
>  include/linux/backing-dev.h |    8 +++++---
>  mm/backing-dev.c            |   10 ++++++++--
>  3 files changed, 23 insertions(+), 5 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-08-27 15:28:27.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-08-27 15:45:10.000000000 +0800
> @@ -220,6 +220,15 @@ static void requeue_io(struct inode *ino
>  	list_move(&inode->i_wb_list, &wb->b_more_io);
>  }
>  
> +/*
> + * The inode should be retried in an opportunistic way.
> + */
> +static void requeue_io_wait(struct inode *inode, struct bdi_writeback *wb)
> +{
> +	assert_spin_locked(&wb->list_lock);
> +	list_move(&inode->i_wb_list, &wb->b_more_io_wait);
> +}
> +
>  static void inode_sync_complete(struct inode *inode)
>  {
>  	/*
> @@ -307,6 +316,7 @@ static void queue_io(struct bdi_writebac
>  	int moved;
>  	assert_spin_locked(&wb->list_lock);
>  	list_splice_init(&wb->b_more_io, &wb->b_io);
> +	list_splice_init(&wb->b_more_io_wait, &wb->b_io);
>  	moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
>  	trace_writeback_queue_io(wb, older_than_this, moved);
>  }
> --- linux-next.orig/include/linux/backing-dev.h	2011-08-26 19:27:20.000000000 +0800
> +++ linux-next/include/linux/backing-dev.h	2011-08-27 15:45:10.000000000 +0800
> @@ -59,6 +59,7 @@ struct bdi_writeback {
>  	struct list_head b_dirty;	/* dirty inodes */
>  	struct list_head b_io;		/* parked for writeback */
>  	struct list_head b_more_io;	/* parked for more writeback */
> +	struct list_head b_more_io_wait;/* opportunistic retry io */
>  	spinlock_t list_lock;		/* protects the b_* lists */
>  };
>  
> @@ -129,9 +130,10 @@ extern struct list_head bdi_pending_list
>  
>  static inline int wb_has_dirty_io(struct bdi_writeback *wb)
>  {
> -	return !list_empty(&wb->b_dirty) ||
> -	       !list_empty(&wb->b_io) ||
> -	       !list_empty(&wb->b_more_io);
> +	return !list_empty(&wb->b_dirty)	||
> +	       !list_empty(&wb->b_io)		||
> +	       !list_empty(&wb->b_more_io)	||
> +	       !list_empty(&wb->b_more_io_wait);
>  }
>  
>  static inline void __add_bdi_stat(struct backing_dev_info *bdi,
> --- linux-next.orig/mm/backing-dev.c	2011-08-26 19:27:20.000000000 +0800
> +++ linux-next/mm/backing-dev.c	2011-08-27 15:45:10.000000000 +0800
> @@ -74,10 +74,10 @@ static int bdi_debug_stats_show(struct s
>  	unsigned long background_thresh;
>  	unsigned long dirty_thresh;
>  	unsigned long bdi_thresh;
> -	unsigned long nr_dirty, nr_io, nr_more_io;
> +	unsigned long nr_dirty, nr_io, nr_more_io, nr_more_io_wait;
>  	struct inode *inode;
>  
> -	nr_dirty = nr_io = nr_more_io = 0;
> +	nr_dirty = nr_io = nr_more_io = nr_more_io_wait = 0;
>  	spin_lock(&wb->list_lock);
>  	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
>  		nr_dirty++;
> @@ -85,6 +85,8 @@ static int bdi_debug_stats_show(struct s
>  		nr_io++;
>  	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
>  		nr_more_io++;
> +	list_for_each_entry(inode, &wb->b_more_io_wait, i_wb_list)
> +		nr_more_io_wait++;
>  	spin_unlock(&wb->list_lock);
>  
>  	global_dirty_limits(&background_thresh, &dirty_thresh);
> @@ -103,6 +105,7 @@ static int bdi_debug_stats_show(struct s
>  		   "b_dirty:            %10lu\n"
>  		   "b_io:               %10lu\n"
>  		   "b_more_io:          %10lu\n"
> +		   "b_more_io_wait:     %10lu\n"
>  		   "bdi_list:           %10u\n"
>  		   "state:              %10lx\n",
>  		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
> @@ -116,6 +119,7 @@ static int bdi_debug_stats_show(struct s
>  		   nr_dirty,
>  		   nr_io,
>  		   nr_more_io,
> +		   nr_more_io_wait,
>  		   !list_empty(&bdi->bdi_list), bdi->state);
>  #undef K
>  
> @@ -635,6 +639,7 @@ static void bdi_wb_init(struct bdi_write
>  	INIT_LIST_HEAD(&wb->b_dirty);
>  	INIT_LIST_HEAD(&wb->b_io);
>  	INIT_LIST_HEAD(&wb->b_more_io);
> +	INIT_LIST_HEAD(&wb->b_more_io_wait);
>  	spin_lock_init(&wb->list_lock);
>  	setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
>  }
> @@ -701,6 +706,7 @@ void bdi_destroy(struct backing_dev_info
>  		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
>  		list_splice(&bdi->wb.b_io, &dst->b_io);
>  		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
> +		list_splice(&bdi->wb.b_more_io_wait, &dst->b_more_io_wait);
>  		spin_unlock(&bdi->wb.list_lock);
>  		spin_unlock(&dst->list_lock);
>  	}
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY
@ 2011-09-05 11:11         ` Jan Kara
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2011-09-05 11:11 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Christoph Hellwig, linux-fsdevel, Jan Kara, xfs

On Sun 04-09-11 05:35:27, Wu Fengguang wrote:
> On Sat, Sep 03, 2011 at 09:13:15AM +0800, Jan Kara wrote:
> > On Sat 27-08-11 21:58:25, Wu Fengguang wrote:
> > > Christoph,
> > > 
> > > On Sat, Aug 27, 2011 at 02:14:09PM +0800, Christoph Hellwig wrote:
> > > > Right now ->write_inode has no way to safely return a EAGAIN without explicitly
> > > > redirtying the inode, as we would lose the dirty state otherwise.  Most
> > > > filesystems get this wrong, but XFS makes heavy use of it to avoid blocking
> > > > the flusher thread when ->write_inode hits contentended inode locks.  A
> > > > contended ilock is something XFS can hit very easibly when extending files, as
> > > > the data I/O completion handler takes the lock to update the size, and the
> > > > ->write_inode call can race with it fairly easily if writing enough data
> > > > in one go so that the completion for the first write come in just before
> > > > we call ->write_inode.
> > > > 
> > > > Change the handling of this case to use requeue_io for a quick retry instead
> > > > of redirty_tail, which keeps moving out the dirtied_when data and thus keeps
> > > > delaying the writeout more and more with every failed attempt to get the lock.
> > > 
> > > Yeah redirty_tail() does have the problem of possibly delay inodes for
> > > too long time. However, you know requeue_io() always has the danger of
> > > triggering busy wb_writeback() loops in corner cases.
> > > 
> > > For example, nfs_write_inode()/nfs_commit_unstable_pages() often
> > > redirty the inode without doing anything (hence no any progress, a
> > > prerequisite for busy loops) depending on the in flight writes, which
> > > unfortunately further depends on _external_ network/server states..
> > > That means some stalled network/sever state could lead to busy loops
> > > in NFS clients.
> > > 
> > > The alternative solution may be to firstly apply the attached patch,
> > > and change this one to:
> > > 
> > >   -			redirty_tail(inode, wb);
> > >   +			requeue_io_wait(inode, wb);
> >   But your patch doesn't solve the busyloop when the problematic inodes are
> > the only ones under writeback, does it? Then b_more_io and b_more_io_wait
> > are effectively the same if I understand it right.
> 
> The difference lies in the
> 
>                 /*
>                  * No more inodes for IO, bail
>                  */
>                 if (list_empty(&wb->b_more_io))
>                         break;
> 
> check in wb_writeback(). So when what's left are all b_more_io_wait
> inodes, the above check will take effect and break the loop. This is
> the tricky point of the patch: it relies on the code not touched by
> the patch to work. I've updated the changelog to explain this.
  Argh, that's really subtle! Two points here.
1) We will immediately retry inodes in b_more_io_wait list because of
	if (progress)
		continue;
check.

2) The time writeout will be delayed is only <=dirty_expire_centisecs but
can be arbitrarily small if someone submits more work. Also if
wb_writeback() was called from wb_do_writeback(), we will retry
b_more_io_wait inodes twice immediately because of
wb_check_old_data_flush() and wb_check_background_flush() calls.

> > I think that busylooping in cases like these could be fixed improving the
> > busyloop prevention at the end of the loop in wb_writeback(). Maybe we
> > could just take a short nap before continuting with writeback instead of /
> > in addition to waiting for inode writeback. What do you think?
> 
> That's a reasonable robust option, however at the cost of keeping the
> writeback code in some ambiguous state ;)
  What do you exactly mean by ambiguous state? I don't see anything
ambiguous in waiting for a jiffie or so. Not that I'd be completely happy
about "just wait for a while and see if things are better" but your
solution does not seem ideal either... 

								Honza

> ---
> Subject: writeback: introduce queue b_more_io_wait
> Date: Sun Jul 31 18:44:44 CST 2011
> 
> The problem is, redirty_tail() may update i_dirtied_when and result in
> 30s max delay. If redirty_tail() is called often enough, some inode may
> even be delayed for ever.
> 
> So introduce the b_more_io_wait queue to park inodes that for some
> reason cannot be synced immediately. The inodes will be sent to b_io at
> the next b_io refill time, however won't be busy retried as b_more_io:
> when the redirtied inodes are all in b_more_io_wait, wb_writeback() will
> see empty b_more_io and hence break out of the loop. 
> 
> The new data flow after converting all redirty_tail() calls to
> requeue_io_wait():
> 
> b_dirty --> b_io --> b_more_io/b_more_io_wait --+
>              ^                                  |
> 	     |                                  |
> 	     +----------------------------------+
> 
> Cc: Jan Kara <jack@suse.cz>
> Cc: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
> ---
>  fs/fs-writeback.c           |   10 ++++++++++
>  include/linux/backing-dev.h |    8 +++++---
>  mm/backing-dev.c            |   10 ++++++++--
>  3 files changed, 23 insertions(+), 5 deletions(-)
> 
> --- linux-next.orig/fs/fs-writeback.c	2011-08-27 15:28:27.000000000 +0800
> +++ linux-next/fs/fs-writeback.c	2011-08-27 15:45:10.000000000 +0800
> @@ -220,6 +220,15 @@ static void requeue_io(struct inode *ino
>  	list_move(&inode->i_wb_list, &wb->b_more_io);
>  }
>  
> +/*
> + * The inode should be retried in an opportunistic way.
> + */
> +static void requeue_io_wait(struct inode *inode, struct bdi_writeback *wb)
> +{
> +	assert_spin_locked(&wb->list_lock);
> +	list_move(&inode->i_wb_list, &wb->b_more_io_wait);
> +}
> +
>  static void inode_sync_complete(struct inode *inode)
>  {
>  	/*
> @@ -307,6 +316,7 @@ static void queue_io(struct bdi_writebac
>  	int moved;
>  	assert_spin_locked(&wb->list_lock);
>  	list_splice_init(&wb->b_more_io, &wb->b_io);
> +	list_splice_init(&wb->b_more_io_wait, &wb->b_io);
>  	moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this);
>  	trace_writeback_queue_io(wb, older_than_this, moved);
>  }
> --- linux-next.orig/include/linux/backing-dev.h	2011-08-26 19:27:20.000000000 +0800
> +++ linux-next/include/linux/backing-dev.h	2011-08-27 15:45:10.000000000 +0800
> @@ -59,6 +59,7 @@ struct bdi_writeback {
>  	struct list_head b_dirty;	/* dirty inodes */
>  	struct list_head b_io;		/* parked for writeback */
>  	struct list_head b_more_io;	/* parked for more writeback */
> +	struct list_head b_more_io_wait;/* opportunistic retry io */
>  	spinlock_t list_lock;		/* protects the b_* lists */
>  };
>  
> @@ -129,9 +130,10 @@ extern struct list_head bdi_pending_list
>  
>  static inline int wb_has_dirty_io(struct bdi_writeback *wb)
>  {
> -	return !list_empty(&wb->b_dirty) ||
> -	       !list_empty(&wb->b_io) ||
> -	       !list_empty(&wb->b_more_io);
> +	return !list_empty(&wb->b_dirty)	||
> +	       !list_empty(&wb->b_io)		||
> +	       !list_empty(&wb->b_more_io)	||
> +	       !list_empty(&wb->b_more_io_wait);
>  }
>  
>  static inline void __add_bdi_stat(struct backing_dev_info *bdi,
> --- linux-next.orig/mm/backing-dev.c	2011-08-26 19:27:20.000000000 +0800
> +++ linux-next/mm/backing-dev.c	2011-08-27 15:45:10.000000000 +0800
> @@ -74,10 +74,10 @@ static int bdi_debug_stats_show(struct s
>  	unsigned long background_thresh;
>  	unsigned long dirty_thresh;
>  	unsigned long bdi_thresh;
> -	unsigned long nr_dirty, nr_io, nr_more_io;
> +	unsigned long nr_dirty, nr_io, nr_more_io, nr_more_io_wait;
>  	struct inode *inode;
>  
> -	nr_dirty = nr_io = nr_more_io = 0;
> +	nr_dirty = nr_io = nr_more_io = nr_more_io_wait = 0;
>  	spin_lock(&wb->list_lock);
>  	list_for_each_entry(inode, &wb->b_dirty, i_wb_list)
>  		nr_dirty++;
> @@ -85,6 +85,8 @@ static int bdi_debug_stats_show(struct s
>  		nr_io++;
>  	list_for_each_entry(inode, &wb->b_more_io, i_wb_list)
>  		nr_more_io++;
> +	list_for_each_entry(inode, &wb->b_more_io_wait, i_wb_list)
> +		nr_more_io_wait++;
>  	spin_unlock(&wb->list_lock);
>  
>  	global_dirty_limits(&background_thresh, &dirty_thresh);
> @@ -103,6 +105,7 @@ static int bdi_debug_stats_show(struct s
>  		   "b_dirty:            %10lu\n"
>  		   "b_io:               %10lu\n"
>  		   "b_more_io:          %10lu\n"
> +		   "b_more_io_wait:     %10lu\n"
>  		   "bdi_list:           %10u\n"
>  		   "state:              %10lx\n",
>  		   (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)),
> @@ -116,6 +119,7 @@ static int bdi_debug_stats_show(struct s
>  		   nr_dirty,
>  		   nr_io,
>  		   nr_more_io,
> +		   nr_more_io_wait,
>  		   !list_empty(&bdi->bdi_list), bdi->state);
>  #undef K
>  
> @@ -635,6 +639,7 @@ static void bdi_wb_init(struct bdi_write
>  	INIT_LIST_HEAD(&wb->b_dirty);
>  	INIT_LIST_HEAD(&wb->b_io);
>  	INIT_LIST_HEAD(&wb->b_more_io);
> +	INIT_LIST_HEAD(&wb->b_more_io_wait);
>  	spin_lock_init(&wb->list_lock);
>  	setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi);
>  }
> @@ -701,6 +706,7 @@ void bdi_destroy(struct backing_dev_info
>  		list_splice(&bdi->wb.b_dirty, &dst->b_dirty);
>  		list_splice(&bdi->wb.b_io, &dst->b_io);
>  		list_splice(&bdi->wb.b_more_io, &dst->b_more_io);
> +		list_splice(&bdi->wb.b_more_io_wait, &dst->b_more_io_wait);
>  		spin_unlock(&bdi->wb.list_lock);
>  		spin_unlock(&dst->list_lock);
>  	}
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY
  2011-09-05 11:11         ` Jan Kara
@ 2011-09-05 13:22           ` Wu Fengguang
  -1 siblings, 0 replies; 17+ messages in thread
From: Wu Fengguang @ 2011-09-05 13:22 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, linux-fsdevel, xfs, Dave Chinner

On Mon, Sep 05, 2011 at 07:11:53PM +0800, Jan Kara wrote:
> On Sun 04-09-11 05:35:27, Wu Fengguang wrote:
> > On Sat, Sep 03, 2011 at 09:13:15AM +0800, Jan Kara wrote:
> > > On Sat 27-08-11 21:58:25, Wu Fengguang wrote:
> > > > Christoph,
> > > > 
> > > > On Sat, Aug 27, 2011 at 02:14:09PM +0800, Christoph Hellwig wrote:
> > > > > Right now ->write_inode has no way to safely return a EAGAIN without explicitly
> > > > > redirtying the inode, as we would lose the dirty state otherwise.  Most
> > > > > filesystems get this wrong, but XFS makes heavy use of it to avoid blocking
> > > > > the flusher thread when ->write_inode hits contentended inode locks.  A
> > > > > contended ilock is something XFS can hit very easibly when extending files, as
> > > > > the data I/O completion handler takes the lock to update the size, and the
> > > > > ->write_inode call can race with it fairly easily if writing enough data
> > > > > in one go so that the completion for the first write come in just before
> > > > > we call ->write_inode.
> > > > > 
> > > > > Change the handling of this case to use requeue_io for a quick retry instead
> > > > > of redirty_tail, which keeps moving out the dirtied_when data and thus keeps
> > > > > delaying the writeout more and more with every failed attempt to get the lock.
> > > > 
> > > > Yeah redirty_tail() does have the problem of possibly delay inodes for
> > > > too long time. However, you know requeue_io() always has the danger of
> > > > triggering busy wb_writeback() loops in corner cases.
> > > > 
> > > > For example, nfs_write_inode()/nfs_commit_unstable_pages() often
> > > > redirty the inode without doing anything (hence no any progress, a
> > > > prerequisite for busy loops) depending on the in flight writes, which
> > > > unfortunately further depends on _external_ network/server states..
> > > > That means some stalled network/sever state could lead to busy loops
> > > > in NFS clients.
> > > > 
> > > > The alternative solution may be to firstly apply the attached patch,
> > > > and change this one to:
> > > > 
> > > >   -			redirty_tail(inode, wb);
> > > >   +			requeue_io_wait(inode, wb);
> > >   But your patch doesn't solve the busyloop when the problematic inodes are
> > > the only ones under writeback, does it? Then b_more_io and b_more_io_wait
> > > are effectively the same if I understand it right.
> > 
> > The difference lies in the
> > 
> >                 /*
> >                  * No more inodes for IO, bail
> >                  */
> >                 if (list_empty(&wb->b_more_io))
> >                         break;
> > 
> > check in wb_writeback(). So when what's left are all b_more_io_wait
> > inodes, the above check will take effect and break the loop. This is
> > the tricky point of the patch: it relies on the code not touched by
> > the patch to work. I've updated the changelog to explain this.
>   Argh, that's really subtle! Two points here.

Yes, it is..

> 1) We will immediately retry inodes in b_more_io_wait list because of
> 	if (progress)
> 		continue;
> check.

That's right. I'm aware of this and think it's reasonable to not
sleep as long as there are b_more_io inodes to write. The typical
situation will be, the time take to write b_more_io inodes naturally
serve as necessary delay to retry the b_more_io_wait inodes.

> 2) The time writeout will be delayed is only <=dirty_expire_centisecs but
> can be arbitrarily small if someone submits more work.

Exactly.

> Also if wb_writeback() was called from wb_do_writeback(), we will
> retry b_more_io_wait inodes twice immediately because of
> wb_check_old_data_flush() and wb_check_background_flush() calls.

Right. However given that wb_check_old_data_flush() is ratelimited by
dirty_writeback_interval=5s and wb_check_background_flush() is limited
by over_bground_thresh(), it's hardly a real problem.

> > > I think that busylooping in cases like these could be fixed improving the
> > > busyloop prevention at the end of the loop in wb_writeback(). Maybe we
> > > could just take a short nap before continuting with writeback instead of /
> > > in addition to waiting for inode writeback. What do you think?
> > 
> > That's a reasonable robust option, however at the cost of keeping the
> > writeback code in some ambiguous state ;)
>   What do you exactly mean by ambiguous state?

I mean in Christoph's case, it will be calling requeue_io() and at the
same time rely on your suggested unconditional sleep at the end of
wb_writeback() loop to avoid busy loop. Or in other words, b_more_io
will be holding both inodes that should be busy retried and the inodes
to be opportunistically retried.  However I admit it's not a big
problem if we take b_more_io as general "to be retried ASAP".

> I don't see anything ambiguous in waiting for a jiffie or so. Not
> that I'd be completely happy about "just wait for a while and see if
> things are better" but your solution does not seem ideal either... 

There are no big differences (that matter) in terms of "how much exact
time to wait" in this XFS case.  What make me prefer b_more_io_wait is
that it looks a more general solution to replace the majority
redirty_tail() calls to avoid modifying dirtied_when.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY
@ 2011-09-05 13:22           ` Wu Fengguang
  0 siblings, 0 replies; 17+ messages in thread
From: Wu Fengguang @ 2011-09-05 13:22 UTC (permalink / raw)
  To: Jan Kara; +Cc: Christoph Hellwig, linux-fsdevel, xfs

On Mon, Sep 05, 2011 at 07:11:53PM +0800, Jan Kara wrote:
> On Sun 04-09-11 05:35:27, Wu Fengguang wrote:
> > On Sat, Sep 03, 2011 at 09:13:15AM +0800, Jan Kara wrote:
> > > On Sat 27-08-11 21:58:25, Wu Fengguang wrote:
> > > > Christoph,
> > > > 
> > > > On Sat, Aug 27, 2011 at 02:14:09PM +0800, Christoph Hellwig wrote:
> > > > > Right now ->write_inode has no way to safely return a EAGAIN without explicitly
> > > > > redirtying the inode, as we would lose the dirty state otherwise.  Most
> > > > > filesystems get this wrong, but XFS makes heavy use of it to avoid blocking
> > > > > the flusher thread when ->write_inode hits contentended inode locks.  A
> > > > > contended ilock is something XFS can hit very easibly when extending files, as
> > > > > the data I/O completion handler takes the lock to update the size, and the
> > > > > ->write_inode call can race with it fairly easily if writing enough data
> > > > > in one go so that the completion for the first write come in just before
> > > > > we call ->write_inode.
> > > > > 
> > > > > Change the handling of this case to use requeue_io for a quick retry instead
> > > > > of redirty_tail, which keeps moving out the dirtied_when data and thus keeps
> > > > > delaying the writeout more and more with every failed attempt to get the lock.
> > > > 
> > > > Yeah redirty_tail() does have the problem of possibly delay inodes for
> > > > too long time. However, you know requeue_io() always has the danger of
> > > > triggering busy wb_writeback() loops in corner cases.
> > > > 
> > > > For example, nfs_write_inode()/nfs_commit_unstable_pages() often
> > > > redirty the inode without doing anything (hence no any progress, a
> > > > prerequisite for busy loops) depending on the in flight writes, which
> > > > unfortunately further depends on _external_ network/server states..
> > > > That means some stalled network/sever state could lead to busy loops
> > > > in NFS clients.
> > > > 
> > > > The alternative solution may be to firstly apply the attached patch,
> > > > and change this one to:
> > > > 
> > > >   -			redirty_tail(inode, wb);
> > > >   +			requeue_io_wait(inode, wb);
> > >   But your patch doesn't solve the busyloop when the problematic inodes are
> > > the only ones under writeback, does it? Then b_more_io and b_more_io_wait
> > > are effectively the same if I understand it right.
> > 
> > The difference lies in the
> > 
> >                 /*
> >                  * No more inodes for IO, bail
> >                  */
> >                 if (list_empty(&wb->b_more_io))
> >                         break;
> > 
> > check in wb_writeback(). So when what's left are all b_more_io_wait
> > inodes, the above check will take effect and break the loop. This is
> > the tricky point of the patch: it relies on the code not touched by
> > the patch to work. I've updated the changelog to explain this.
>   Argh, that's really subtle! Two points here.

Yes, it is..

> 1) We will immediately retry inodes in b_more_io_wait list because of
> 	if (progress)
> 		continue;
> check.

That's right. I'm aware of this and think it's reasonable to not
sleep as long as there are b_more_io inodes to write. The typical
situation will be, the time take to write b_more_io inodes naturally
serve as necessary delay to retry the b_more_io_wait inodes.

> 2) The time writeout will be delayed is only <=dirty_expire_centisecs but
> can be arbitrarily small if someone submits more work.

Exactly.

> Also if wb_writeback() was called from wb_do_writeback(), we will
> retry b_more_io_wait inodes twice immediately because of
> wb_check_old_data_flush() and wb_check_background_flush() calls.

Right. However given that wb_check_old_data_flush() is ratelimited by
dirty_writeback_interval=5s and wb_check_background_flush() is limited
by over_bground_thresh(), it's hardly a real problem.

> > > I think that busylooping in cases like these could be fixed improving the
> > > busyloop prevention at the end of the loop in wb_writeback(). Maybe we
> > > could just take a short nap before continuting with writeback instead of /
> > > in addition to waiting for inode writeback. What do you think?
> > 
> > That's a reasonable robust option, however at the cost of keeping the
> > writeback code in some ambiguous state ;)
>   What do you exactly mean by ambiguous state?

I mean in Christoph's case, it will be calling requeue_io() and at the
same time rely on your suggested unconditional sleep at the end of
wb_writeback() loop to avoid busy loop. Or in other words, b_more_io
will be holding both inodes that should be busy retried and the inodes
to be opportunistically retried.  However I admit it's not a big
problem if we take b_more_io as general "to be retried ASAP".

> I don't see anything ambiguous in waiting for a jiffie or so. Not
> that I'd be completely happy about "just wait for a while and see if
> things are better" but your solution does not seem ideal either... 

There are no big differences (that matter) in terms of "how much exact
time to wait" in this XFS case.  What make me prefer b_more_io_wait is
that it looks a more general solution to replace the majority
redirty_tail() calls to avoid modifying dirtied_when.

Thanks,
Fengguang

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY
  2011-09-05 13:22           ` Wu Fengguang
@ 2011-09-07 11:52             ` Christoph Hellwig
  -1 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2011-09-07 11:52 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Christoph Hellwig, linux-fsdevel

On Mon, Sep 05, 2011 at 09:22:16PM +0800, Wu Fengguang wrote:
> > > That's a reasonable robust option, however at the cost of keeping the
> > > writeback code in some ambiguous state ;)
> >   What do you exactly mean by ambiguous state?
> 
> I mean in Christoph's case, it will be calling requeue_io() and at the
> same time rely on your suggested unconditional sleep at the end of
> wb_writeback() loop to avoid busy loop. Or in other words, b_more_io
> will be holding both inodes that should be busy retried and the inodes
> to be opportunistically retried.  However I admit it's not a big
> problem if we take b_more_io as general "to be retried ASAP".
> 
> > I don't see anything ambiguous in waiting for a jiffie or so. Not
> > that I'd be completely happy about "just wait for a while and see if
> > things are better" but your solution does not seem ideal either... 
> 
> There are no big differences (that matter) in terms of "how much exact
> time to wait" in this XFS case.  What make me prefer b_more_io_wait is
> that it looks a more general solution to replace the majority
> redirty_tail() calls to avoid modifying dirtied_when.

FYI, we had a few more users hit this issue recently.  I'm not sure why,
but we are seeing this fairly often now.  I'd really like to get some
sort of fix for this in ASAP as it causes data loss for users.  

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY
@ 2011-09-07 11:52             ` Christoph Hellwig
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2011-09-07 11:52 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Christoph Hellwig, linux-fsdevel, Jan Kara, xfs

On Mon, Sep 05, 2011 at 09:22:16PM +0800, Wu Fengguang wrote:
> > > That's a reasonable robust option, however at the cost of keeping the
> > > writeback code in some ambiguous state ;)
> >   What do you exactly mean by ambiguous state?
> 
> I mean in Christoph's case, it will be calling requeue_io() and at the
> same time rely on your suggested unconditional sleep at the end of
> wb_writeback() loop to avoid busy loop. Or in other words, b_more_io
> will be holding both inodes that should be busy retried and the inodes
> to be opportunistically retried.  However I admit it's not a big
> problem if we take b_more_io as general "to be retried ASAP".
> 
> > I don't see anything ambiguous in waiting for a jiffie or so. Not
> > that I'd be completely happy about "just wait for a while and see if
> > things are better" but your solution does not seem ideal either... 
> 
> There are no big differences (that matter) in terms of "how much exact
> time to wait" in this XFS case.  What make me prefer b_more_io_wait is
> that it looks a more general solution to replace the majority
> redirty_tail() calls to avoid modifying dirtied_when.

FYI, we had a few more users hit this issue recently.  I'm not sure why,
but we are seeing this fairly often now.  I'd really like to get some
sort of fix for this in ASAP as it causes data loss for users.  

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY
  2011-09-07 11:52             ` Christoph Hellwig
  (?)
@ 2011-09-07 12:51             ` Wu Fengguang
  2011-09-08  0:51                 ` Jan Kara
  -1 siblings, 1 reply; 17+ messages in thread
From: Wu Fengguang @ 2011-09-07 12:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, Jan Kara, xfs

On Wed, Sep 07, 2011 at 07:52:37PM +0800, Christoph Hellwig wrote:
> On Mon, Sep 05, 2011 at 09:22:16PM +0800, Wu Fengguang wrote:
> > > > That's a reasonable robust option, however at the cost of keeping the
> > > > writeback code in some ambiguous state ;)
> > >   What do you exactly mean by ambiguous state?
> > 
> > I mean in Christoph's case, it will be calling requeue_io() and at the
> > same time rely on your suggested unconditional sleep at the end of
> > wb_writeback() loop to avoid busy loop. Or in other words, b_more_io
> > will be holding both inodes that should be busy retried and the inodes
> > to be opportunistically retried.  However I admit it's not a big
> > problem if we take b_more_io as general "to be retried ASAP".
> > 
> > > I don't see anything ambiguous in waiting for a jiffie or so. Not
> > > that I'd be completely happy about "just wait for a while and see if
> > > things are better" but your solution does not seem ideal either... 
> > 
> > There are no big differences (that matter) in terms of "how much exact
> > time to wait" in this XFS case.  What make me prefer b_more_io_wait is
> > that it looks a more general solution to replace the majority
> > redirty_tail() calls to avoid modifying dirtied_when.
> 
> FYI, we had a few more users hit this issue recently.  I'm not sure why,
> but we are seeing this fairly often now.  I'd really like to get some
> sort of fix for this in ASAP as it causes data loss for users.  

Jan, do you agree to push the b_more_io_wait patch into linux-next?

If not, let's do a patch to do unconditional sleep at the end of the
wb_writeback() loop?

Thanks,
Fengguang

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY
  2011-09-07 12:51             ` Wu Fengguang
@ 2011-09-08  0:51                 ` Jan Kara
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2011-09-08  0:51 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Christoph Hellwig, Jan Kara, linux-fsdevel, xfs, Dave Chinner

On Wed 07-09-11 20:51:05, Wu Fengguang wrote:
> On Wed, Sep 07, 2011 at 07:52:37PM +0800, Christoph Hellwig wrote:
> > On Mon, Sep 05, 2011 at 09:22:16PM +0800, Wu Fengguang wrote:
> > > > > That's a reasonable robust option, however at the cost of keeping the
> > > > > writeback code in some ambiguous state ;)
> > > >   What do you exactly mean by ambiguous state?
> > > 
> > > I mean in Christoph's case, it will be calling requeue_io() and at the
> > > same time rely on your suggested unconditional sleep at the end of
> > > wb_writeback() loop to avoid busy loop. Or in other words, b_more_io
> > > will be holding both inodes that should be busy retried and the inodes
> > > to be opportunistically retried.  However I admit it's not a big
> > > problem if we take b_more_io as general "to be retried ASAP".
> > > 
> > > > I don't see anything ambiguous in waiting for a jiffie or so. Not
> > > > that I'd be completely happy about "just wait for a while and see if
> > > > things are better" but your solution does not seem ideal either... 
> > > 
> > > There are no big differences (that matter) in terms of "how much exact
> > > time to wait" in this XFS case.  What make me prefer b_more_io_wait is
> > > that it looks a more general solution to replace the majority
> > > redirty_tail() calls to avoid modifying dirtied_when.
> > 
> > FYI, we had a few more users hit this issue recently.  I'm not sure why,
> > but we are seeing this fairly often now.  I'd really like to get some
> > sort of fix for this in ASAP as it causes data loss for users.  
> 
> Jan, do you agree to push the b_more_io_wait patch into linux-next?
> 
> If not, let's do a patch to do unconditional sleep at the end of the
> wb_writeback() loop?
  Well, what I don't like about b_more_io_wait is that the logic shifting
inodes between lists becomes subtle and I'm afraid we could easily break it
in future. Also times when inodes are retried are not so well defined
although I agree that most likely that's not going to be a problem in
practice. So that's why I'd prefer to use more robust approach of just
waiting in the loop when we couldn't make any progress. I've just sent a
patch which does that and a patch which converts redirty_tail()s to
requeue_io() where it makes sense. Note that writeback_single_inode()
change is a bit more complex to keep livelock avoidance working. Please
have a look whether the patches would be fine with you. Thanks.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY
@ 2011-09-08  0:51                 ` Jan Kara
  0 siblings, 0 replies; 17+ messages in thread
From: Jan Kara @ 2011-09-08  0:51 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Christoph Hellwig, linux-fsdevel, Jan Kara, xfs

On Wed 07-09-11 20:51:05, Wu Fengguang wrote:
> On Wed, Sep 07, 2011 at 07:52:37PM +0800, Christoph Hellwig wrote:
> > On Mon, Sep 05, 2011 at 09:22:16PM +0800, Wu Fengguang wrote:
> > > > > That's a reasonable robust option, however at the cost of keeping the
> > > > > writeback code in some ambiguous state ;)
> > > >   What do you exactly mean by ambiguous state?
> > > 
> > > I mean in Christoph's case, it will be calling requeue_io() and at the
> > > same time rely on your suggested unconditional sleep at the end of
> > > wb_writeback() loop to avoid busy loop. Or in other words, b_more_io
> > > will be holding both inodes that should be busy retried and the inodes
> > > to be opportunistically retried.  However I admit it's not a big
> > > problem if we take b_more_io as general "to be retried ASAP".
> > > 
> > > > I don't see anything ambiguous in waiting for a jiffie or so. Not
> > > > that I'd be completely happy about "just wait for a while and see if
> > > > things are better" but your solution does not seem ideal either... 
> > > 
> > > There are no big differences (that matter) in terms of "how much exact
> > > time to wait" in this XFS case.  What make me prefer b_more_io_wait is
> > > that it looks a more general solution to replace the majority
> > > redirty_tail() calls to avoid modifying dirtied_when.
> > 
> > FYI, we had a few more users hit this issue recently.  I'm not sure why,
> > but we are seeing this fairly often now.  I'd really like to get some
> > sort of fix for this in ASAP as it causes data loss for users.  
> 
> Jan, do you agree to push the b_more_io_wait patch into linux-next?
> 
> If not, let's do a patch to do unconditional sleep at the end of the
> wb_writeback() loop?
  Well, what I don't like about b_more_io_wait is that the logic shifting
inodes between lists becomes subtle and I'm afraid we could easily break it
in future. Also times when inodes are retried are not so well defined
although I agree that most likely that's not going to be a problem in
practice. So that's why I'd prefer to use more robust approach of just
waiting in the loop when we couldn't make any progress. I've just sent a
patch which does that and a patch which converts redirty_tail()s to
requeue_io() where it makes sense. Note that writeback_single_inode()
change is a bit more complex to keep livelock avoidance working. Please
have a look whether the patches would be fine with you. Thanks.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2011-09-08  0:52 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-27  6:14 [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY Christoph Hellwig
2011-08-27  6:14 ` Christoph Hellwig
2011-08-27 13:58 ` Wu Fengguang
2011-08-27 13:58   ` Wu Fengguang
2011-09-03  1:13   ` Jan Kara
2011-09-03  1:13     ` Jan Kara
2011-09-03 21:35     ` Wu Fengguang
2011-09-03 21:35       ` Wu Fengguang
2011-09-05 11:11       ` Jan Kara
2011-09-05 11:11         ` Jan Kara
2011-09-05 13:22         ` Wu Fengguang
2011-09-05 13:22           ` Wu Fengguang
2011-09-07 11:52           ` Christoph Hellwig
2011-09-07 11:52             ` Christoph Hellwig
2011-09-07 12:51             ` Wu Fengguang
2011-09-08  0:51               ` Jan Kara
2011-09-08  0:51                 ` Jan Kara

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.