* [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY @ 2011-08-27 6:14 ` Christoph Hellwig 0 siblings, 0 replies; 17+ messages in thread From: Christoph Hellwig @ 2011-08-27 6:14 UTC (permalink / raw) To: Wu Fengguang; +Cc: linux-fsdevel, xfs Right now ->write_inode has no way to safely return a EAGAIN without explicitly redirtying the inode, as we would lose the dirty state otherwise. Most filesystems get this wrong, but XFS makes heavy use of it to avoid blocking the flusher thread when ->write_inode hits contentended inode locks. A contended ilock is something XFS can hit very easibly when extending files, as the data I/O completion handler takes the lock to update the size, and the ->write_inode call can race with it fairly easily if writing enough data in one go so that the completion for the first write come in just before we call ->write_inode. Change the handling of this case to use requeue_io for a quick retry instead of redirty_tail, which keeps moving out the dirtied_when data and thus keeps delaying the writeout more and more with every failed attempt to get the lock. Signed-off-by: Christoph Hellwig <hch@lst.de> Index: linux-2.6/fs/fs-writeback.c =================================================================== --- linux-2.6.orig/fs/fs-writeback.c 2011-08-26 14:47:42.137050059 +0200 +++ linux-2.6/fs/fs-writeback.c 2011-08-26 15:06:47.003493601 +0200 @@ -464,8 +464,18 @@ writeback_single_inode(struct inode *ino * operations, such as delayed allocation during * submission or metadata updates after data IO * completion. + * + * For the latter case it is very important to give + * the inode another turn on b_more_io instead of + * redirtying it. Constantly moving dirtied_when + * forward will prevent us from ever writing out + * the metadata dirtied in the I/O completion handler. + * + * For files on XFS that constantly get appended to + * calling redirty_tail means they will never get + * their updated i_size written out. */ - redirty_tail(inode, wb); + requeue_io(inode, wb); } else { /* * The inode is clean. At this point we either have ^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY @ 2011-08-27 6:14 ` Christoph Hellwig 0 siblings, 0 replies; 17+ messages in thread From: Christoph Hellwig @ 2011-08-27 6:14 UTC (permalink / raw) To: Wu Fengguang; +Cc: linux-fsdevel, xfs Right now ->write_inode has no way to safely return a EAGAIN without explicitly redirtying the inode, as we would lose the dirty state otherwise. Most filesystems get this wrong, but XFS makes heavy use of it to avoid blocking the flusher thread when ->write_inode hits contentended inode locks. A contended ilock is something XFS can hit very easibly when extending files, as the data I/O completion handler takes the lock to update the size, and the ->write_inode call can race with it fairly easily if writing enough data in one go so that the completion for the first write come in just before we call ->write_inode. Change the handling of this case to use requeue_io for a quick retry instead of redirty_tail, which keeps moving out the dirtied_when data and thus keeps delaying the writeout more and more with every failed attempt to get the lock. Signed-off-by: Christoph Hellwig <hch@lst.de> Index: linux-2.6/fs/fs-writeback.c =================================================================== --- linux-2.6.orig/fs/fs-writeback.c 2011-08-26 14:47:42.137050059 +0200 +++ linux-2.6/fs/fs-writeback.c 2011-08-26 15:06:47.003493601 +0200 @@ -464,8 +464,18 @@ writeback_single_inode(struct inode *ino * operations, such as delayed allocation during * submission or metadata updates after data IO * completion. + * + * For the latter case it is very important to give + * the inode another turn on b_more_io instead of + * redirtying it. Constantly moving dirtied_when + * forward will prevent us from ever writing out + * the metadata dirtied in the I/O completion handler. + * + * For files on XFS that constantly get appended to + * calling redirty_tail means they will never get + * their updated i_size written out. */ - redirty_tail(inode, wb); + requeue_io(inode, wb); } else { /* * The inode is clean. At this point we either have _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY 2011-08-27 6:14 ` Christoph Hellwig @ 2011-08-27 13:58 ` Wu Fengguang -1 siblings, 0 replies; 17+ messages in thread From: Wu Fengguang @ 2011-08-27 13:58 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-fsdevel, xfs, Jan Kara, Dave Chinner [-- Attachment #1: Type: text/plain, Size: 2805 bytes --] Christoph, On Sat, Aug 27, 2011 at 02:14:09PM +0800, Christoph Hellwig wrote: > Right now ->write_inode has no way to safely return a EAGAIN without explicitly > redirtying the inode, as we would lose the dirty state otherwise. Most > filesystems get this wrong, but XFS makes heavy use of it to avoid blocking > the flusher thread when ->write_inode hits contentended inode locks. A > contended ilock is something XFS can hit very easibly when extending files, as > the data I/O completion handler takes the lock to update the size, and the > ->write_inode call can race with it fairly easily if writing enough data > in one go so that the completion for the first write come in just before > we call ->write_inode. > > Change the handling of this case to use requeue_io for a quick retry instead > of redirty_tail, which keeps moving out the dirtied_when data and thus keeps > delaying the writeout more and more with every failed attempt to get the lock. Yeah redirty_tail() does have the problem of possibly delay inodes for too long time. However, you know requeue_io() always has the danger of triggering busy wb_writeback() loops in corner cases. For example, nfs_write_inode()/nfs_commit_unstable_pages() often redirty the inode without doing anything (hence no any progress, a prerequisite for busy loops) depending on the in flight writes, which unfortunately further depends on _external_ network/server states.. That means some stalled network/sever state could lead to busy loops in NFS clients. The alternative solution may be to firstly apply the attached patch, and change this one to: - redirty_tail(inode, wb); + requeue_io_wait(inode, wb); Thanks, Fengguang > Signed-off-by: Christoph Hellwig <hch@lst.de> > > Index: linux-2.6/fs/fs-writeback.c > =================================================================== > --- linux-2.6.orig/fs/fs-writeback.c 2011-08-26 14:47:42.137050059 +0200 > +++ linux-2.6/fs/fs-writeback.c 2011-08-26 15:06:47.003493601 +0200 > @@ -464,8 +464,18 @@ writeback_single_inode(struct inode *ino > * operations, such as delayed allocation during > * submission or metadata updates after data IO > * completion. > + * > + * For the latter case it is very important to give > + * the inode another turn on b_more_io instead of > + * redirtying it. Constantly moving dirtied_when > + * forward will prevent us from ever writing out > + * the metadata dirtied in the I/O completion handler. > + * > + * For files on XFS that constantly get appended to > + * calling redirty_tail means they will never get > + * their updated i_size written out. > */ > - redirty_tail(inode, wb); > + requeue_io(inode, wb); > } else { > /* > * The inode is clean. At this point we either have [-- Attachment #2: writeback-more_io_wait.patch --] [-- Type: text/x-diff, Size: 5015 bytes --] Subject: writeback: introduce queue b_more_io_wait Date: Sun Jul 31 18:44:44 CST 2011 The problem is, redirty_tail() may update i_dirtied_when and result in 30s max delay. If redirty_tail() is called often enough, some inode may even be delayed for ever. So introduce the b_more_io_wait queue to park inodes that for some reason cannot be synced immediately. The inodes will be sent to b_io at the next b_io refill time, however won't be busy retried as b_more_io. The new data flow after converting all redirty_tail() calls to requeue_io_wait(): b_dirty --> b_io --> b_more_io/b_more_io_wait --+ ^ | | | +----------------------------------+ Cc: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- fs/fs-writeback.c | 10 ++++++++++ include/linux/backing-dev.h | 8 +++++--- mm/backing-dev.c | 10 ++++++++-- 3 files changed, 23 insertions(+), 5 deletions(-) --- linux-next.orig/fs/fs-writeback.c 2011-08-27 15:28:27.000000000 +0800 +++ linux-next/fs/fs-writeback.c 2011-08-27 15:45:10.000000000 +0800 @@ -220,6 +220,15 @@ static void requeue_io(struct inode *ino list_move(&inode->i_wb_list, &wb->b_more_io); } +/* + * The inode should be retried in an opportunistic way. + */ +static void requeue_io_wait(struct inode *inode, struct bdi_writeback *wb) +{ + assert_spin_locked(&wb->list_lock); + list_move(&inode->i_wb_list, &wb->b_more_io_wait); +} + static void inode_sync_complete(struct inode *inode) { /* @@ -307,6 +316,7 @@ static void queue_io(struct bdi_writebac int moved; assert_spin_locked(&wb->list_lock); list_splice_init(&wb->b_more_io, &wb->b_io); + list_splice_init(&wb->b_more_io_wait, &wb->b_io); moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this); trace_writeback_queue_io(wb, older_than_this, moved); } --- linux-next.orig/include/linux/backing-dev.h 2011-08-26 19:27:20.000000000 +0800 +++ linux-next/include/linux/backing-dev.h 2011-08-27 15:45:10.000000000 +0800 @@ -59,6 +59,7 @@ struct bdi_writeback { struct list_head b_dirty; /* dirty inodes */ struct list_head b_io; /* parked for writeback */ struct list_head b_more_io; /* parked for more writeback */ + struct list_head b_more_io_wait;/* opportunistic retry io */ spinlock_t list_lock; /* protects the b_* lists */ }; @@ -129,9 +130,10 @@ extern struct list_head bdi_pending_list static inline int wb_has_dirty_io(struct bdi_writeback *wb) { - return !list_empty(&wb->b_dirty) || - !list_empty(&wb->b_io) || - !list_empty(&wb->b_more_io); + return !list_empty(&wb->b_dirty) || + !list_empty(&wb->b_io) || + !list_empty(&wb->b_more_io) || + !list_empty(&wb->b_more_io_wait); } static inline void __add_bdi_stat(struct backing_dev_info *bdi, --- linux-next.orig/mm/backing-dev.c 2011-08-26 19:27:20.000000000 +0800 +++ linux-next/mm/backing-dev.c 2011-08-27 15:45:10.000000000 +0800 @@ -74,10 +74,10 @@ static int bdi_debug_stats_show(struct s unsigned long background_thresh; unsigned long dirty_thresh; unsigned long bdi_thresh; - unsigned long nr_dirty, nr_io, nr_more_io; + unsigned long nr_dirty, nr_io, nr_more_io, nr_more_io_wait; struct inode *inode; - nr_dirty = nr_io = nr_more_io = 0; + nr_dirty = nr_io = nr_more_io = nr_more_io_wait = 0; spin_lock(&wb->list_lock); list_for_each_entry(inode, &wb->b_dirty, i_wb_list) nr_dirty++; @@ -85,6 +85,8 @@ static int bdi_debug_stats_show(struct s nr_io++; list_for_each_entry(inode, &wb->b_more_io, i_wb_list) nr_more_io++; + list_for_each_entry(inode, &wb->b_more_io_wait, i_wb_list) + nr_more_io_wait++; spin_unlock(&wb->list_lock); global_dirty_limits(&background_thresh, &dirty_thresh); @@ -103,6 +105,7 @@ static int bdi_debug_stats_show(struct s "b_dirty: %10lu\n" "b_io: %10lu\n" "b_more_io: %10lu\n" + "b_more_io_wait: %10lu\n" "bdi_list: %10u\n" "state: %10lx\n", (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)), @@ -116,6 +119,7 @@ static int bdi_debug_stats_show(struct s nr_dirty, nr_io, nr_more_io, + nr_more_io_wait, !list_empty(&bdi->bdi_list), bdi->state); #undef K @@ -635,6 +639,7 @@ static void bdi_wb_init(struct bdi_write INIT_LIST_HEAD(&wb->b_dirty); INIT_LIST_HEAD(&wb->b_io); INIT_LIST_HEAD(&wb->b_more_io); + INIT_LIST_HEAD(&wb->b_more_io_wait); spin_lock_init(&wb->list_lock); setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi); } @@ -701,6 +706,7 @@ void bdi_destroy(struct backing_dev_info list_splice(&bdi->wb.b_dirty, &dst->b_dirty); list_splice(&bdi->wb.b_io, &dst->b_io); list_splice(&bdi->wb.b_more_io, &dst->b_more_io); + list_splice(&bdi->wb.b_more_io_wait, &dst->b_more_io_wait); spin_unlock(&bdi->wb.list_lock); spin_unlock(&dst->list_lock); } ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY @ 2011-08-27 13:58 ` Wu Fengguang 0 siblings, 0 replies; 17+ messages in thread From: Wu Fengguang @ 2011-08-27 13:58 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-fsdevel, Jan Kara, xfs [-- Attachment #1: Type: text/plain, Size: 2805 bytes --] Christoph, On Sat, Aug 27, 2011 at 02:14:09PM +0800, Christoph Hellwig wrote: > Right now ->write_inode has no way to safely return a EAGAIN without explicitly > redirtying the inode, as we would lose the dirty state otherwise. Most > filesystems get this wrong, but XFS makes heavy use of it to avoid blocking > the flusher thread when ->write_inode hits contentended inode locks. A > contended ilock is something XFS can hit very easibly when extending files, as > the data I/O completion handler takes the lock to update the size, and the > ->write_inode call can race with it fairly easily if writing enough data > in one go so that the completion for the first write come in just before > we call ->write_inode. > > Change the handling of this case to use requeue_io for a quick retry instead > of redirty_tail, which keeps moving out the dirtied_when data and thus keeps > delaying the writeout more and more with every failed attempt to get the lock. Yeah redirty_tail() does have the problem of possibly delay inodes for too long time. However, you know requeue_io() always has the danger of triggering busy wb_writeback() loops in corner cases. For example, nfs_write_inode()/nfs_commit_unstable_pages() often redirty the inode without doing anything (hence no any progress, a prerequisite for busy loops) depending on the in flight writes, which unfortunately further depends on _external_ network/server states.. That means some stalled network/sever state could lead to busy loops in NFS clients. The alternative solution may be to firstly apply the attached patch, and change this one to: - redirty_tail(inode, wb); + requeue_io_wait(inode, wb); Thanks, Fengguang > Signed-off-by: Christoph Hellwig <hch@lst.de> > > Index: linux-2.6/fs/fs-writeback.c > =================================================================== > --- linux-2.6.orig/fs/fs-writeback.c 2011-08-26 14:47:42.137050059 +0200 > +++ linux-2.6/fs/fs-writeback.c 2011-08-26 15:06:47.003493601 +0200 > @@ -464,8 +464,18 @@ writeback_single_inode(struct inode *ino > * operations, such as delayed allocation during > * submission or metadata updates after data IO > * completion. > + * > + * For the latter case it is very important to give > + * the inode another turn on b_more_io instead of > + * redirtying it. Constantly moving dirtied_when > + * forward will prevent us from ever writing out > + * the metadata dirtied in the I/O completion handler. > + * > + * For files on XFS that constantly get appended to > + * calling redirty_tail means they will never get > + * their updated i_size written out. > */ > - redirty_tail(inode, wb); > + requeue_io(inode, wb); > } else { > /* > * The inode is clean. At this point we either have [-- Attachment #2: writeback-more_io_wait.patch --] [-- Type: text/x-diff, Size: 5015 bytes --] Subject: writeback: introduce queue b_more_io_wait Date: Sun Jul 31 18:44:44 CST 2011 The problem is, redirty_tail() may update i_dirtied_when and result in 30s max delay. If redirty_tail() is called often enough, some inode may even be delayed for ever. So introduce the b_more_io_wait queue to park inodes that for some reason cannot be synced immediately. The inodes will be sent to b_io at the next b_io refill time, however won't be busy retried as b_more_io. The new data flow after converting all redirty_tail() calls to requeue_io_wait(): b_dirty --> b_io --> b_more_io/b_more_io_wait --+ ^ | | | +----------------------------------+ Cc: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- fs/fs-writeback.c | 10 ++++++++++ include/linux/backing-dev.h | 8 +++++--- mm/backing-dev.c | 10 ++++++++-- 3 files changed, 23 insertions(+), 5 deletions(-) --- linux-next.orig/fs/fs-writeback.c 2011-08-27 15:28:27.000000000 +0800 +++ linux-next/fs/fs-writeback.c 2011-08-27 15:45:10.000000000 +0800 @@ -220,6 +220,15 @@ static void requeue_io(struct inode *ino list_move(&inode->i_wb_list, &wb->b_more_io); } +/* + * The inode should be retried in an opportunistic way. + */ +static void requeue_io_wait(struct inode *inode, struct bdi_writeback *wb) +{ + assert_spin_locked(&wb->list_lock); + list_move(&inode->i_wb_list, &wb->b_more_io_wait); +} + static void inode_sync_complete(struct inode *inode) { /* @@ -307,6 +316,7 @@ static void queue_io(struct bdi_writebac int moved; assert_spin_locked(&wb->list_lock); list_splice_init(&wb->b_more_io, &wb->b_io); + list_splice_init(&wb->b_more_io_wait, &wb->b_io); moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this); trace_writeback_queue_io(wb, older_than_this, moved); } --- linux-next.orig/include/linux/backing-dev.h 2011-08-26 19:27:20.000000000 +0800 +++ linux-next/include/linux/backing-dev.h 2011-08-27 15:45:10.000000000 +0800 @@ -59,6 +59,7 @@ struct bdi_writeback { struct list_head b_dirty; /* dirty inodes */ struct list_head b_io; /* parked for writeback */ struct list_head b_more_io; /* parked for more writeback */ + struct list_head b_more_io_wait;/* opportunistic retry io */ spinlock_t list_lock; /* protects the b_* lists */ }; @@ -129,9 +130,10 @@ extern struct list_head bdi_pending_list static inline int wb_has_dirty_io(struct bdi_writeback *wb) { - return !list_empty(&wb->b_dirty) || - !list_empty(&wb->b_io) || - !list_empty(&wb->b_more_io); + return !list_empty(&wb->b_dirty) || + !list_empty(&wb->b_io) || + !list_empty(&wb->b_more_io) || + !list_empty(&wb->b_more_io_wait); } static inline void __add_bdi_stat(struct backing_dev_info *bdi, --- linux-next.orig/mm/backing-dev.c 2011-08-26 19:27:20.000000000 +0800 +++ linux-next/mm/backing-dev.c 2011-08-27 15:45:10.000000000 +0800 @@ -74,10 +74,10 @@ static int bdi_debug_stats_show(struct s unsigned long background_thresh; unsigned long dirty_thresh; unsigned long bdi_thresh; - unsigned long nr_dirty, nr_io, nr_more_io; + unsigned long nr_dirty, nr_io, nr_more_io, nr_more_io_wait; struct inode *inode; - nr_dirty = nr_io = nr_more_io = 0; + nr_dirty = nr_io = nr_more_io = nr_more_io_wait = 0; spin_lock(&wb->list_lock); list_for_each_entry(inode, &wb->b_dirty, i_wb_list) nr_dirty++; @@ -85,6 +85,8 @@ static int bdi_debug_stats_show(struct s nr_io++; list_for_each_entry(inode, &wb->b_more_io, i_wb_list) nr_more_io++; + list_for_each_entry(inode, &wb->b_more_io_wait, i_wb_list) + nr_more_io_wait++; spin_unlock(&wb->list_lock); global_dirty_limits(&background_thresh, &dirty_thresh); @@ -103,6 +105,7 @@ static int bdi_debug_stats_show(struct s "b_dirty: %10lu\n" "b_io: %10lu\n" "b_more_io: %10lu\n" + "b_more_io_wait: %10lu\n" "bdi_list: %10u\n" "state: %10lx\n", (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)), @@ -116,6 +119,7 @@ static int bdi_debug_stats_show(struct s nr_dirty, nr_io, nr_more_io, + nr_more_io_wait, !list_empty(&bdi->bdi_list), bdi->state); #undef K @@ -635,6 +639,7 @@ static void bdi_wb_init(struct bdi_write INIT_LIST_HEAD(&wb->b_dirty); INIT_LIST_HEAD(&wb->b_io); INIT_LIST_HEAD(&wb->b_more_io); + INIT_LIST_HEAD(&wb->b_more_io_wait); spin_lock_init(&wb->list_lock); setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi); } @@ -701,6 +706,7 @@ void bdi_destroy(struct backing_dev_info list_splice(&bdi->wb.b_dirty, &dst->b_dirty); list_splice(&bdi->wb.b_io, &dst->b_io); list_splice(&bdi->wb.b_more_io, &dst->b_more_io); + list_splice(&bdi->wb.b_more_io_wait, &dst->b_more_io_wait); spin_unlock(&bdi->wb.list_lock); spin_unlock(&dst->list_lock); } [-- Attachment #3: Type: text/plain, Size: 121 bytes --] _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY 2011-08-27 13:58 ` Wu Fengguang @ 2011-09-03 1:13 ` Jan Kara -1 siblings, 0 replies; 17+ messages in thread From: Jan Kara @ 2011-09-03 1:13 UTC (permalink / raw) To: Wu Fengguang Cc: Christoph Hellwig, linux-fsdevel, xfs, Jan Kara, Dave Chinner On Sat 27-08-11 21:58:25, Wu Fengguang wrote: > Christoph, > > On Sat, Aug 27, 2011 at 02:14:09PM +0800, Christoph Hellwig wrote: > > Right now ->write_inode has no way to safely return a EAGAIN without explicitly > > redirtying the inode, as we would lose the dirty state otherwise. Most > > filesystems get this wrong, but XFS makes heavy use of it to avoid blocking > > the flusher thread when ->write_inode hits contentended inode locks. A > > contended ilock is something XFS can hit very easibly when extending files, as > > the data I/O completion handler takes the lock to update the size, and the > > ->write_inode call can race with it fairly easily if writing enough data > > in one go so that the completion for the first write come in just before > > we call ->write_inode. > > > > Change the handling of this case to use requeue_io for a quick retry instead > > of redirty_tail, which keeps moving out the dirtied_when data and thus keeps > > delaying the writeout more and more with every failed attempt to get the lock. > > Yeah redirty_tail() does have the problem of possibly delay inodes for > too long time. However, you know requeue_io() always has the danger of > triggering busy wb_writeback() loops in corner cases. > > For example, nfs_write_inode()/nfs_commit_unstable_pages() often > redirty the inode without doing anything (hence no any progress, a > prerequisite for busy loops) depending on the in flight writes, which > unfortunately further depends on _external_ network/server states.. > That means some stalled network/sever state could lead to busy loops > in NFS clients. > > The alternative solution may be to firstly apply the attached patch, > and change this one to: > > - redirty_tail(inode, wb); > + requeue_io_wait(inode, wb); But your patch doesn't solve the busyloop when the problematic inodes are the only ones under writeback, does it? Then b_more_io and b_more_io_wait are effectively the same if I understand it right. I think that busylooping in cases like these could be fixed improving the busyloop prevention at the end of the loop in wb_writeback(). Maybe we could just take a short nap before continuting with writeback instead of / in addition to waiting for inode writeback. What do you think? Honza > > Signed-off-by: Christoph Hellwig <hch@lst.de> > > > > Index: linux-2.6/fs/fs-writeback.c > > =================================================================== > > --- linux-2.6.orig/fs/fs-writeback.c 2011-08-26 14:47:42.137050059 +0200 > > +++ linux-2.6/fs/fs-writeback.c 2011-08-26 15:06:47.003493601 +0200 > > @@ -464,8 +464,18 @@ writeback_single_inode(struct inode *ino > > * operations, such as delayed allocation during > > * submission or metadata updates after data IO > > * completion. > > + * > > + * For the latter case it is very important to give > > + * the inode another turn on b_more_io instead of > > + * redirtying it. Constantly moving dirtied_when > > + * forward will prevent us from ever writing out > > + * the metadata dirtied in the I/O completion handler. > > + * > > + * For files on XFS that constantly get appended to > > + * calling redirty_tail means they will never get > > + * their updated i_size written out. > > */ > > - redirty_tail(inode, wb); > > + requeue_io(inode, wb); > > } else { > > /* > > * The inode is clean. At this point we either have > Subject: writeback: introduce queue b_more_io_wait > Date: Sun Jul 31 18:44:44 CST 2011 > > The problem is, redirty_tail() may update i_dirtied_when and result in > 30s max delay. If redirty_tail() is called often enough, some inode may > even be delayed for ever. > > So introduce the b_more_io_wait queue to park inodes that for some > reason cannot be synced immediately. The inodes will be sent to b_io at > the next b_io refill time, however won't be busy retried as b_more_io. > > The new data flow after converting all redirty_tail() calls to > requeue_io_wait(): > > b_dirty --> b_io --> b_more_io/b_more_io_wait --+ > ^ | > | | > +----------------------------------+ > > Cc: Jan Kara <jack@suse.cz> > Cc: Dave Chinner <david@fromorbit.com> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > --- > fs/fs-writeback.c | 10 ++++++++++ > include/linux/backing-dev.h | 8 +++++--- > mm/backing-dev.c | 10 ++++++++-- > 3 files changed, 23 insertions(+), 5 deletions(-) > > --- linux-next.orig/fs/fs-writeback.c 2011-08-27 15:28:27.000000000 +0800 > +++ linux-next/fs/fs-writeback.c 2011-08-27 15:45:10.000000000 +0800 > @@ -220,6 +220,15 @@ static void requeue_io(struct inode *ino > list_move(&inode->i_wb_list, &wb->b_more_io); > } > > +/* > + * The inode should be retried in an opportunistic way. > + */ > +static void requeue_io_wait(struct inode *inode, struct bdi_writeback *wb) > +{ > + assert_spin_locked(&wb->list_lock); > + list_move(&inode->i_wb_list, &wb->b_more_io_wait); > +} > + > static void inode_sync_complete(struct inode *inode) > { > /* > @@ -307,6 +316,7 @@ static void queue_io(struct bdi_writebac > int moved; > assert_spin_locked(&wb->list_lock); > list_splice_init(&wb->b_more_io, &wb->b_io); > + list_splice_init(&wb->b_more_io_wait, &wb->b_io); > moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this); > trace_writeback_queue_io(wb, older_than_this, moved); > } > --- linux-next.orig/include/linux/backing-dev.h 2011-08-26 19:27:20.000000000 +0800 > +++ linux-next/include/linux/backing-dev.h 2011-08-27 15:45:10.000000000 +0800 > @@ -59,6 +59,7 @@ struct bdi_writeback { > struct list_head b_dirty; /* dirty inodes */ > struct list_head b_io; /* parked for writeback */ > struct list_head b_more_io; /* parked for more writeback */ > + struct list_head b_more_io_wait;/* opportunistic retry io */ > spinlock_t list_lock; /* protects the b_* lists */ > }; > > @@ -129,9 +130,10 @@ extern struct list_head bdi_pending_list > > static inline int wb_has_dirty_io(struct bdi_writeback *wb) > { > - return !list_empty(&wb->b_dirty) || > - !list_empty(&wb->b_io) || > - !list_empty(&wb->b_more_io); > + return !list_empty(&wb->b_dirty) || > + !list_empty(&wb->b_io) || > + !list_empty(&wb->b_more_io) || > + !list_empty(&wb->b_more_io_wait); > } > > static inline void __add_bdi_stat(struct backing_dev_info *bdi, > --- linux-next.orig/mm/backing-dev.c 2011-08-26 19:27:20.000000000 +0800 > +++ linux-next/mm/backing-dev.c 2011-08-27 15:45:10.000000000 +0800 > @@ -74,10 +74,10 @@ static int bdi_debug_stats_show(struct s > unsigned long background_thresh; > unsigned long dirty_thresh; > unsigned long bdi_thresh; > - unsigned long nr_dirty, nr_io, nr_more_io; > + unsigned long nr_dirty, nr_io, nr_more_io, nr_more_io_wait; > struct inode *inode; > > - nr_dirty = nr_io = nr_more_io = 0; > + nr_dirty = nr_io = nr_more_io = nr_more_io_wait = 0; > spin_lock(&wb->list_lock); > list_for_each_entry(inode, &wb->b_dirty, i_wb_list) > nr_dirty++; > @@ -85,6 +85,8 @@ static int bdi_debug_stats_show(struct s > nr_io++; > list_for_each_entry(inode, &wb->b_more_io, i_wb_list) > nr_more_io++; > + list_for_each_entry(inode, &wb->b_more_io_wait, i_wb_list) > + nr_more_io_wait++; > spin_unlock(&wb->list_lock); > > global_dirty_limits(&background_thresh, &dirty_thresh); > @@ -103,6 +105,7 @@ static int bdi_debug_stats_show(struct s > "b_dirty: %10lu\n" > "b_io: %10lu\n" > "b_more_io: %10lu\n" > + "b_more_io_wait: %10lu\n" > "bdi_list: %10u\n" > "state: %10lx\n", > (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)), > @@ -116,6 +119,7 @@ static int bdi_debug_stats_show(struct s > nr_dirty, > nr_io, > nr_more_io, > + nr_more_io_wait, > !list_empty(&bdi->bdi_list), bdi->state); > #undef K > > @@ -635,6 +639,7 @@ static void bdi_wb_init(struct bdi_write > INIT_LIST_HEAD(&wb->b_dirty); > INIT_LIST_HEAD(&wb->b_io); > INIT_LIST_HEAD(&wb->b_more_io); > + INIT_LIST_HEAD(&wb->b_more_io_wait); > spin_lock_init(&wb->list_lock); > setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi); > } > @@ -701,6 +706,7 @@ void bdi_destroy(struct backing_dev_info > list_splice(&bdi->wb.b_dirty, &dst->b_dirty); > list_splice(&bdi->wb.b_io, &dst->b_io); > list_splice(&bdi->wb.b_more_io, &dst->b_more_io); > + list_splice(&bdi->wb.b_more_io_wait, &dst->b_more_io_wait); > spin_unlock(&bdi->wb.list_lock); > spin_unlock(&dst->list_lock); > } -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY @ 2011-09-03 1:13 ` Jan Kara 0 siblings, 0 replies; 17+ messages in thread From: Jan Kara @ 2011-09-03 1:13 UTC (permalink / raw) To: Wu Fengguang; +Cc: Christoph Hellwig, linux-fsdevel, Jan Kara, xfs On Sat 27-08-11 21:58:25, Wu Fengguang wrote: > Christoph, > > On Sat, Aug 27, 2011 at 02:14:09PM +0800, Christoph Hellwig wrote: > > Right now ->write_inode has no way to safely return a EAGAIN without explicitly > > redirtying the inode, as we would lose the dirty state otherwise. Most > > filesystems get this wrong, but XFS makes heavy use of it to avoid blocking > > the flusher thread when ->write_inode hits contentended inode locks. A > > contended ilock is something XFS can hit very easibly when extending files, as > > the data I/O completion handler takes the lock to update the size, and the > > ->write_inode call can race with it fairly easily if writing enough data > > in one go so that the completion for the first write come in just before > > we call ->write_inode. > > > > Change the handling of this case to use requeue_io for a quick retry instead > > of redirty_tail, which keeps moving out the dirtied_when data and thus keeps > > delaying the writeout more and more with every failed attempt to get the lock. > > Yeah redirty_tail() does have the problem of possibly delay inodes for > too long time. However, you know requeue_io() always has the danger of > triggering busy wb_writeback() loops in corner cases. > > For example, nfs_write_inode()/nfs_commit_unstable_pages() often > redirty the inode without doing anything (hence no any progress, a > prerequisite for busy loops) depending on the in flight writes, which > unfortunately further depends on _external_ network/server states.. > That means some stalled network/sever state could lead to busy loops > in NFS clients. > > The alternative solution may be to firstly apply the attached patch, > and change this one to: > > - redirty_tail(inode, wb); > + requeue_io_wait(inode, wb); But your patch doesn't solve the busyloop when the problematic inodes are the only ones under writeback, does it? Then b_more_io and b_more_io_wait are effectively the same if I understand it right. I think that busylooping in cases like these could be fixed improving the busyloop prevention at the end of the loop in wb_writeback(). Maybe we could just take a short nap before continuting with writeback instead of / in addition to waiting for inode writeback. What do you think? Honza > > Signed-off-by: Christoph Hellwig <hch@lst.de> > > > > Index: linux-2.6/fs/fs-writeback.c > > =================================================================== > > --- linux-2.6.orig/fs/fs-writeback.c 2011-08-26 14:47:42.137050059 +0200 > > +++ linux-2.6/fs/fs-writeback.c 2011-08-26 15:06:47.003493601 +0200 > > @@ -464,8 +464,18 @@ writeback_single_inode(struct inode *ino > > * operations, such as delayed allocation during > > * submission or metadata updates after data IO > > * completion. > > + * > > + * For the latter case it is very important to give > > + * the inode another turn on b_more_io instead of > > + * redirtying it. Constantly moving dirtied_when > > + * forward will prevent us from ever writing out > > + * the metadata dirtied in the I/O completion handler. > > + * > > + * For files on XFS that constantly get appended to > > + * calling redirty_tail means they will never get > > + * their updated i_size written out. > > */ > > - redirty_tail(inode, wb); > > + requeue_io(inode, wb); > > } else { > > /* > > * The inode is clean. At this point we either have > Subject: writeback: introduce queue b_more_io_wait > Date: Sun Jul 31 18:44:44 CST 2011 > > The problem is, redirty_tail() may update i_dirtied_when and result in > 30s max delay. If redirty_tail() is called often enough, some inode may > even be delayed for ever. > > So introduce the b_more_io_wait queue to park inodes that for some > reason cannot be synced immediately. The inodes will be sent to b_io at > the next b_io refill time, however won't be busy retried as b_more_io. > > The new data flow after converting all redirty_tail() calls to > requeue_io_wait(): > > b_dirty --> b_io --> b_more_io/b_more_io_wait --+ > ^ | > | | > +----------------------------------+ > > Cc: Jan Kara <jack@suse.cz> > Cc: Dave Chinner <david@fromorbit.com> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > --- > fs/fs-writeback.c | 10 ++++++++++ > include/linux/backing-dev.h | 8 +++++--- > mm/backing-dev.c | 10 ++++++++-- > 3 files changed, 23 insertions(+), 5 deletions(-) > > --- linux-next.orig/fs/fs-writeback.c 2011-08-27 15:28:27.000000000 +0800 > +++ linux-next/fs/fs-writeback.c 2011-08-27 15:45:10.000000000 +0800 > @@ -220,6 +220,15 @@ static void requeue_io(struct inode *ino > list_move(&inode->i_wb_list, &wb->b_more_io); > } > > +/* > + * The inode should be retried in an opportunistic way. > + */ > +static void requeue_io_wait(struct inode *inode, struct bdi_writeback *wb) > +{ > + assert_spin_locked(&wb->list_lock); > + list_move(&inode->i_wb_list, &wb->b_more_io_wait); > +} > + > static void inode_sync_complete(struct inode *inode) > { > /* > @@ -307,6 +316,7 @@ static void queue_io(struct bdi_writebac > int moved; > assert_spin_locked(&wb->list_lock); > list_splice_init(&wb->b_more_io, &wb->b_io); > + list_splice_init(&wb->b_more_io_wait, &wb->b_io); > moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this); > trace_writeback_queue_io(wb, older_than_this, moved); > } > --- linux-next.orig/include/linux/backing-dev.h 2011-08-26 19:27:20.000000000 +0800 > +++ linux-next/include/linux/backing-dev.h 2011-08-27 15:45:10.000000000 +0800 > @@ -59,6 +59,7 @@ struct bdi_writeback { > struct list_head b_dirty; /* dirty inodes */ > struct list_head b_io; /* parked for writeback */ > struct list_head b_more_io; /* parked for more writeback */ > + struct list_head b_more_io_wait;/* opportunistic retry io */ > spinlock_t list_lock; /* protects the b_* lists */ > }; > > @@ -129,9 +130,10 @@ extern struct list_head bdi_pending_list > > static inline int wb_has_dirty_io(struct bdi_writeback *wb) > { > - return !list_empty(&wb->b_dirty) || > - !list_empty(&wb->b_io) || > - !list_empty(&wb->b_more_io); > + return !list_empty(&wb->b_dirty) || > + !list_empty(&wb->b_io) || > + !list_empty(&wb->b_more_io) || > + !list_empty(&wb->b_more_io_wait); > } > > static inline void __add_bdi_stat(struct backing_dev_info *bdi, > --- linux-next.orig/mm/backing-dev.c 2011-08-26 19:27:20.000000000 +0800 > +++ linux-next/mm/backing-dev.c 2011-08-27 15:45:10.000000000 +0800 > @@ -74,10 +74,10 @@ static int bdi_debug_stats_show(struct s > unsigned long background_thresh; > unsigned long dirty_thresh; > unsigned long bdi_thresh; > - unsigned long nr_dirty, nr_io, nr_more_io; > + unsigned long nr_dirty, nr_io, nr_more_io, nr_more_io_wait; > struct inode *inode; > > - nr_dirty = nr_io = nr_more_io = 0; > + nr_dirty = nr_io = nr_more_io = nr_more_io_wait = 0; > spin_lock(&wb->list_lock); > list_for_each_entry(inode, &wb->b_dirty, i_wb_list) > nr_dirty++; > @@ -85,6 +85,8 @@ static int bdi_debug_stats_show(struct s > nr_io++; > list_for_each_entry(inode, &wb->b_more_io, i_wb_list) > nr_more_io++; > + list_for_each_entry(inode, &wb->b_more_io_wait, i_wb_list) > + nr_more_io_wait++; > spin_unlock(&wb->list_lock); > > global_dirty_limits(&background_thresh, &dirty_thresh); > @@ -103,6 +105,7 @@ static int bdi_debug_stats_show(struct s > "b_dirty: %10lu\n" > "b_io: %10lu\n" > "b_more_io: %10lu\n" > + "b_more_io_wait: %10lu\n" > "bdi_list: %10u\n" > "state: %10lx\n", > (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)), > @@ -116,6 +119,7 @@ static int bdi_debug_stats_show(struct s > nr_dirty, > nr_io, > nr_more_io, > + nr_more_io_wait, > !list_empty(&bdi->bdi_list), bdi->state); > #undef K > > @@ -635,6 +639,7 @@ static void bdi_wb_init(struct bdi_write > INIT_LIST_HEAD(&wb->b_dirty); > INIT_LIST_HEAD(&wb->b_io); > INIT_LIST_HEAD(&wb->b_more_io); > + INIT_LIST_HEAD(&wb->b_more_io_wait); > spin_lock_init(&wb->list_lock); > setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi); > } > @@ -701,6 +706,7 @@ void bdi_destroy(struct backing_dev_info > list_splice(&bdi->wb.b_dirty, &dst->b_dirty); > list_splice(&bdi->wb.b_io, &dst->b_io); > list_splice(&bdi->wb.b_more_io, &dst->b_more_io); > + list_splice(&bdi->wb.b_more_io_wait, &dst->b_more_io_wait); > spin_unlock(&bdi->wb.list_lock); > spin_unlock(&dst->list_lock); > } -- Jan Kara <jack@suse.cz> SUSE Labs, CR _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY 2011-09-03 1:13 ` Jan Kara @ 2011-09-03 21:35 ` Wu Fengguang -1 siblings, 0 replies; 17+ messages in thread From: Wu Fengguang @ 2011-09-03 21:35 UTC (permalink / raw) To: Jan Kara; +Cc: Christoph Hellwig, linux-fsdevel, xfs, Dave Chinner On Sat, Sep 03, 2011 at 09:13:15AM +0800, Jan Kara wrote: > On Sat 27-08-11 21:58:25, Wu Fengguang wrote: > > Christoph, > > > > On Sat, Aug 27, 2011 at 02:14:09PM +0800, Christoph Hellwig wrote: > > > Right now ->write_inode has no way to safely return a EAGAIN without explicitly > > > redirtying the inode, as we would lose the dirty state otherwise. Most > > > filesystems get this wrong, but XFS makes heavy use of it to avoid blocking > > > the flusher thread when ->write_inode hits contentended inode locks. A > > > contended ilock is something XFS can hit very easibly when extending files, as > > > the data I/O completion handler takes the lock to update the size, and the > > > ->write_inode call can race with it fairly easily if writing enough data > > > in one go so that the completion for the first write come in just before > > > we call ->write_inode. > > > > > > Change the handling of this case to use requeue_io for a quick retry instead > > > of redirty_tail, which keeps moving out the dirtied_when data and thus keeps > > > delaying the writeout more and more with every failed attempt to get the lock. > > > > Yeah redirty_tail() does have the problem of possibly delay inodes for > > too long time. However, you know requeue_io() always has the danger of > > triggering busy wb_writeback() loops in corner cases. > > > > For example, nfs_write_inode()/nfs_commit_unstable_pages() often > > redirty the inode without doing anything (hence no any progress, a > > prerequisite for busy loops) depending on the in flight writes, which > > unfortunately further depends on _external_ network/server states.. > > That means some stalled network/sever state could lead to busy loops > > in NFS clients. > > > > The alternative solution may be to firstly apply the attached patch, > > and change this one to: > > > > - redirty_tail(inode, wb); > > + requeue_io_wait(inode, wb); > But your patch doesn't solve the busyloop when the problematic inodes are > the only ones under writeback, does it? Then b_more_io and b_more_io_wait > are effectively the same if I understand it right. The difference lies in the /* * No more inodes for IO, bail */ if (list_empty(&wb->b_more_io)) break; check in wb_writeback(). So when what's left are all b_more_io_wait inodes, the above check will take effect and break the loop. This is the tricky point of the patch: it relies on the code not touched by the patch to work. I've updated the changelog to explain this. > I think that busylooping in cases like these could be fixed improving the > busyloop prevention at the end of the loop in wb_writeback(). Maybe we > could just take a short nap before continuting with writeback instead of / > in addition to waiting for inode writeback. What do you think? That's a reasonable robust option, however at the cost of keeping the writeback code in some ambiguous state ;) Thanks, Fengguang --- Subject: writeback: introduce queue b_more_io_wait Date: Sun Jul 31 18:44:44 CST 2011 The problem is, redirty_tail() may update i_dirtied_when and result in 30s max delay. If redirty_tail() is called often enough, some inode may even be delayed for ever. So introduce the b_more_io_wait queue to park inodes that for some reason cannot be synced immediately. The inodes will be sent to b_io at the next b_io refill time, however won't be busy retried as b_more_io: when the redirtied inodes are all in b_more_io_wait, wb_writeback() will see empty b_more_io and hence break out of the loop. The new data flow after converting all redirty_tail() calls to requeue_io_wait(): b_dirty --> b_io --> b_more_io/b_more_io_wait --+ ^ | | | +----------------------------------+ Cc: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- fs/fs-writeback.c | 10 ++++++++++ include/linux/backing-dev.h | 8 +++++--- mm/backing-dev.c | 10 ++++++++-- 3 files changed, 23 insertions(+), 5 deletions(-) --- linux-next.orig/fs/fs-writeback.c 2011-08-27 15:28:27.000000000 +0800 +++ linux-next/fs/fs-writeback.c 2011-08-27 15:45:10.000000000 +0800 @@ -220,6 +220,15 @@ static void requeue_io(struct inode *ino list_move(&inode->i_wb_list, &wb->b_more_io); } +/* + * The inode should be retried in an opportunistic way. + */ +static void requeue_io_wait(struct inode *inode, struct bdi_writeback *wb) +{ + assert_spin_locked(&wb->list_lock); + list_move(&inode->i_wb_list, &wb->b_more_io_wait); +} + static void inode_sync_complete(struct inode *inode) { /* @@ -307,6 +316,7 @@ static void queue_io(struct bdi_writebac int moved; assert_spin_locked(&wb->list_lock); list_splice_init(&wb->b_more_io, &wb->b_io); + list_splice_init(&wb->b_more_io_wait, &wb->b_io); moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this); trace_writeback_queue_io(wb, older_than_this, moved); } --- linux-next.orig/include/linux/backing-dev.h 2011-08-26 19:27:20.000000000 +0800 +++ linux-next/include/linux/backing-dev.h 2011-08-27 15:45:10.000000000 +0800 @@ -59,6 +59,7 @@ struct bdi_writeback { struct list_head b_dirty; /* dirty inodes */ struct list_head b_io; /* parked for writeback */ struct list_head b_more_io; /* parked for more writeback */ + struct list_head b_more_io_wait;/* opportunistic retry io */ spinlock_t list_lock; /* protects the b_* lists */ }; @@ -129,9 +130,10 @@ extern struct list_head bdi_pending_list static inline int wb_has_dirty_io(struct bdi_writeback *wb) { - return !list_empty(&wb->b_dirty) || - !list_empty(&wb->b_io) || - !list_empty(&wb->b_more_io); + return !list_empty(&wb->b_dirty) || + !list_empty(&wb->b_io) || + !list_empty(&wb->b_more_io) || + !list_empty(&wb->b_more_io_wait); } static inline void __add_bdi_stat(struct backing_dev_info *bdi, --- linux-next.orig/mm/backing-dev.c 2011-08-26 19:27:20.000000000 +0800 +++ linux-next/mm/backing-dev.c 2011-08-27 15:45:10.000000000 +0800 @@ -74,10 +74,10 @@ static int bdi_debug_stats_show(struct s unsigned long background_thresh; unsigned long dirty_thresh; unsigned long bdi_thresh; - unsigned long nr_dirty, nr_io, nr_more_io; + unsigned long nr_dirty, nr_io, nr_more_io, nr_more_io_wait; struct inode *inode; - nr_dirty = nr_io = nr_more_io = 0; + nr_dirty = nr_io = nr_more_io = nr_more_io_wait = 0; spin_lock(&wb->list_lock); list_for_each_entry(inode, &wb->b_dirty, i_wb_list) nr_dirty++; @@ -85,6 +85,8 @@ static int bdi_debug_stats_show(struct s nr_io++; list_for_each_entry(inode, &wb->b_more_io, i_wb_list) nr_more_io++; + list_for_each_entry(inode, &wb->b_more_io_wait, i_wb_list) + nr_more_io_wait++; spin_unlock(&wb->list_lock); global_dirty_limits(&background_thresh, &dirty_thresh); @@ -103,6 +105,7 @@ static int bdi_debug_stats_show(struct s "b_dirty: %10lu\n" "b_io: %10lu\n" "b_more_io: %10lu\n" + "b_more_io_wait: %10lu\n" "bdi_list: %10u\n" "state: %10lx\n", (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)), @@ -116,6 +119,7 @@ static int bdi_debug_stats_show(struct s nr_dirty, nr_io, nr_more_io, + nr_more_io_wait, !list_empty(&bdi->bdi_list), bdi->state); #undef K @@ -635,6 +639,7 @@ static void bdi_wb_init(struct bdi_write INIT_LIST_HEAD(&wb->b_dirty); INIT_LIST_HEAD(&wb->b_io); INIT_LIST_HEAD(&wb->b_more_io); + INIT_LIST_HEAD(&wb->b_more_io_wait); spin_lock_init(&wb->list_lock); setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi); } @@ -701,6 +706,7 @@ void bdi_destroy(struct backing_dev_info list_splice(&bdi->wb.b_dirty, &dst->b_dirty); list_splice(&bdi->wb.b_io, &dst->b_io); list_splice(&bdi->wb.b_more_io, &dst->b_more_io); + list_splice(&bdi->wb.b_more_io_wait, &dst->b_more_io_wait); spin_unlock(&bdi->wb.list_lock); spin_unlock(&dst->list_lock); } ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY @ 2011-09-03 21:35 ` Wu Fengguang 0 siblings, 0 replies; 17+ messages in thread From: Wu Fengguang @ 2011-09-03 21:35 UTC (permalink / raw) To: Jan Kara; +Cc: Christoph Hellwig, linux-fsdevel, xfs On Sat, Sep 03, 2011 at 09:13:15AM +0800, Jan Kara wrote: > On Sat 27-08-11 21:58:25, Wu Fengguang wrote: > > Christoph, > > > > On Sat, Aug 27, 2011 at 02:14:09PM +0800, Christoph Hellwig wrote: > > > Right now ->write_inode has no way to safely return a EAGAIN without explicitly > > > redirtying the inode, as we would lose the dirty state otherwise. Most > > > filesystems get this wrong, but XFS makes heavy use of it to avoid blocking > > > the flusher thread when ->write_inode hits contentended inode locks. A > > > contended ilock is something XFS can hit very easibly when extending files, as > > > the data I/O completion handler takes the lock to update the size, and the > > > ->write_inode call can race with it fairly easily if writing enough data > > > in one go so that the completion for the first write come in just before > > > we call ->write_inode. > > > > > > Change the handling of this case to use requeue_io for a quick retry instead > > > of redirty_tail, which keeps moving out the dirtied_when data and thus keeps > > > delaying the writeout more and more with every failed attempt to get the lock. > > > > Yeah redirty_tail() does have the problem of possibly delay inodes for > > too long time. However, you know requeue_io() always has the danger of > > triggering busy wb_writeback() loops in corner cases. > > > > For example, nfs_write_inode()/nfs_commit_unstable_pages() often > > redirty the inode without doing anything (hence no any progress, a > > prerequisite for busy loops) depending on the in flight writes, which > > unfortunately further depends on _external_ network/server states.. > > That means some stalled network/sever state could lead to busy loops > > in NFS clients. > > > > The alternative solution may be to firstly apply the attached patch, > > and change this one to: > > > > - redirty_tail(inode, wb); > > + requeue_io_wait(inode, wb); > But your patch doesn't solve the busyloop when the problematic inodes are > the only ones under writeback, does it? Then b_more_io and b_more_io_wait > are effectively the same if I understand it right. The difference lies in the /* * No more inodes for IO, bail */ if (list_empty(&wb->b_more_io)) break; check in wb_writeback(). So when what's left are all b_more_io_wait inodes, the above check will take effect and break the loop. This is the tricky point of the patch: it relies on the code not touched by the patch to work. I've updated the changelog to explain this. > I think that busylooping in cases like these could be fixed improving the > busyloop prevention at the end of the loop in wb_writeback(). Maybe we > could just take a short nap before continuting with writeback instead of / > in addition to waiting for inode writeback. What do you think? That's a reasonable robust option, however at the cost of keeping the writeback code in some ambiguous state ;) Thanks, Fengguang --- Subject: writeback: introduce queue b_more_io_wait Date: Sun Jul 31 18:44:44 CST 2011 The problem is, redirty_tail() may update i_dirtied_when and result in 30s max delay. If redirty_tail() is called often enough, some inode may even be delayed for ever. So introduce the b_more_io_wait queue to park inodes that for some reason cannot be synced immediately. The inodes will be sent to b_io at the next b_io refill time, however won't be busy retried as b_more_io: when the redirtied inodes are all in b_more_io_wait, wb_writeback() will see empty b_more_io and hence break out of the loop. The new data flow after converting all redirty_tail() calls to requeue_io_wait(): b_dirty --> b_io --> b_more_io/b_more_io_wait --+ ^ | | | +----------------------------------+ Cc: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> --- fs/fs-writeback.c | 10 ++++++++++ include/linux/backing-dev.h | 8 +++++--- mm/backing-dev.c | 10 ++++++++-- 3 files changed, 23 insertions(+), 5 deletions(-) --- linux-next.orig/fs/fs-writeback.c 2011-08-27 15:28:27.000000000 +0800 +++ linux-next/fs/fs-writeback.c 2011-08-27 15:45:10.000000000 +0800 @@ -220,6 +220,15 @@ static void requeue_io(struct inode *ino list_move(&inode->i_wb_list, &wb->b_more_io); } +/* + * The inode should be retried in an opportunistic way. + */ +static void requeue_io_wait(struct inode *inode, struct bdi_writeback *wb) +{ + assert_spin_locked(&wb->list_lock); + list_move(&inode->i_wb_list, &wb->b_more_io_wait); +} + static void inode_sync_complete(struct inode *inode) { /* @@ -307,6 +316,7 @@ static void queue_io(struct bdi_writebac int moved; assert_spin_locked(&wb->list_lock); list_splice_init(&wb->b_more_io, &wb->b_io); + list_splice_init(&wb->b_more_io_wait, &wb->b_io); moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this); trace_writeback_queue_io(wb, older_than_this, moved); } --- linux-next.orig/include/linux/backing-dev.h 2011-08-26 19:27:20.000000000 +0800 +++ linux-next/include/linux/backing-dev.h 2011-08-27 15:45:10.000000000 +0800 @@ -59,6 +59,7 @@ struct bdi_writeback { struct list_head b_dirty; /* dirty inodes */ struct list_head b_io; /* parked for writeback */ struct list_head b_more_io; /* parked for more writeback */ + struct list_head b_more_io_wait;/* opportunistic retry io */ spinlock_t list_lock; /* protects the b_* lists */ }; @@ -129,9 +130,10 @@ extern struct list_head bdi_pending_list static inline int wb_has_dirty_io(struct bdi_writeback *wb) { - return !list_empty(&wb->b_dirty) || - !list_empty(&wb->b_io) || - !list_empty(&wb->b_more_io); + return !list_empty(&wb->b_dirty) || + !list_empty(&wb->b_io) || + !list_empty(&wb->b_more_io) || + !list_empty(&wb->b_more_io_wait); } static inline void __add_bdi_stat(struct backing_dev_info *bdi, --- linux-next.orig/mm/backing-dev.c 2011-08-26 19:27:20.000000000 +0800 +++ linux-next/mm/backing-dev.c 2011-08-27 15:45:10.000000000 +0800 @@ -74,10 +74,10 @@ static int bdi_debug_stats_show(struct s unsigned long background_thresh; unsigned long dirty_thresh; unsigned long bdi_thresh; - unsigned long nr_dirty, nr_io, nr_more_io; + unsigned long nr_dirty, nr_io, nr_more_io, nr_more_io_wait; struct inode *inode; - nr_dirty = nr_io = nr_more_io = 0; + nr_dirty = nr_io = nr_more_io = nr_more_io_wait = 0; spin_lock(&wb->list_lock); list_for_each_entry(inode, &wb->b_dirty, i_wb_list) nr_dirty++; @@ -85,6 +85,8 @@ static int bdi_debug_stats_show(struct s nr_io++; list_for_each_entry(inode, &wb->b_more_io, i_wb_list) nr_more_io++; + list_for_each_entry(inode, &wb->b_more_io_wait, i_wb_list) + nr_more_io_wait++; spin_unlock(&wb->list_lock); global_dirty_limits(&background_thresh, &dirty_thresh); @@ -103,6 +105,7 @@ static int bdi_debug_stats_show(struct s "b_dirty: %10lu\n" "b_io: %10lu\n" "b_more_io: %10lu\n" + "b_more_io_wait: %10lu\n" "bdi_list: %10u\n" "state: %10lx\n", (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)), @@ -116,6 +119,7 @@ static int bdi_debug_stats_show(struct s nr_dirty, nr_io, nr_more_io, + nr_more_io_wait, !list_empty(&bdi->bdi_list), bdi->state); #undef K @@ -635,6 +639,7 @@ static void bdi_wb_init(struct bdi_write INIT_LIST_HEAD(&wb->b_dirty); INIT_LIST_HEAD(&wb->b_io); INIT_LIST_HEAD(&wb->b_more_io); + INIT_LIST_HEAD(&wb->b_more_io_wait); spin_lock_init(&wb->list_lock); setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi); } @@ -701,6 +706,7 @@ void bdi_destroy(struct backing_dev_info list_splice(&bdi->wb.b_dirty, &dst->b_dirty); list_splice(&bdi->wb.b_io, &dst->b_io); list_splice(&bdi->wb.b_more_io, &dst->b_more_io); + list_splice(&bdi->wb.b_more_io_wait, &dst->b_more_io_wait); spin_unlock(&bdi->wb.list_lock); spin_unlock(&dst->list_lock); } _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY 2011-09-03 21:35 ` Wu Fengguang @ 2011-09-05 11:11 ` Jan Kara -1 siblings, 0 replies; 17+ messages in thread From: Jan Kara @ 2011-09-05 11:11 UTC (permalink / raw) To: Wu Fengguang Cc: Jan Kara, Christoph Hellwig, linux-fsdevel, xfs, Dave Chinner On Sun 04-09-11 05:35:27, Wu Fengguang wrote: > On Sat, Sep 03, 2011 at 09:13:15AM +0800, Jan Kara wrote: > > On Sat 27-08-11 21:58:25, Wu Fengguang wrote: > > > Christoph, > > > > > > On Sat, Aug 27, 2011 at 02:14:09PM +0800, Christoph Hellwig wrote: > > > > Right now ->write_inode has no way to safely return a EAGAIN without explicitly > > > > redirtying the inode, as we would lose the dirty state otherwise. Most > > > > filesystems get this wrong, but XFS makes heavy use of it to avoid blocking > > > > the flusher thread when ->write_inode hits contentended inode locks. A > > > > contended ilock is something XFS can hit very easibly when extending files, as > > > > the data I/O completion handler takes the lock to update the size, and the > > > > ->write_inode call can race with it fairly easily if writing enough data > > > > in one go so that the completion for the first write come in just before > > > > we call ->write_inode. > > > > > > > > Change the handling of this case to use requeue_io for a quick retry instead > > > > of redirty_tail, which keeps moving out the dirtied_when data and thus keeps > > > > delaying the writeout more and more with every failed attempt to get the lock. > > > > > > Yeah redirty_tail() does have the problem of possibly delay inodes for > > > too long time. However, you know requeue_io() always has the danger of > > > triggering busy wb_writeback() loops in corner cases. > > > > > > For example, nfs_write_inode()/nfs_commit_unstable_pages() often > > > redirty the inode without doing anything (hence no any progress, a > > > prerequisite for busy loops) depending on the in flight writes, which > > > unfortunately further depends on _external_ network/server states.. > > > That means some stalled network/sever state could lead to busy loops > > > in NFS clients. > > > > > > The alternative solution may be to firstly apply the attached patch, > > > and change this one to: > > > > > > - redirty_tail(inode, wb); > > > + requeue_io_wait(inode, wb); > > But your patch doesn't solve the busyloop when the problematic inodes are > > the only ones under writeback, does it? Then b_more_io and b_more_io_wait > > are effectively the same if I understand it right. > > The difference lies in the > > /* > * No more inodes for IO, bail > */ > if (list_empty(&wb->b_more_io)) > break; > > check in wb_writeback(). So when what's left are all b_more_io_wait > inodes, the above check will take effect and break the loop. This is > the tricky point of the patch: it relies on the code not touched by > the patch to work. I've updated the changelog to explain this. Argh, that's really subtle! Two points here. 1) We will immediately retry inodes in b_more_io_wait list because of if (progress) continue; check. 2) The time writeout will be delayed is only <=dirty_expire_centisecs but can be arbitrarily small if someone submits more work. Also if wb_writeback() was called from wb_do_writeback(), we will retry b_more_io_wait inodes twice immediately because of wb_check_old_data_flush() and wb_check_background_flush() calls. > > I think that busylooping in cases like these could be fixed improving the > > busyloop prevention at the end of the loop in wb_writeback(). Maybe we > > could just take a short nap before continuting with writeback instead of / > > in addition to waiting for inode writeback. What do you think? > > That's a reasonable robust option, however at the cost of keeping the > writeback code in some ambiguous state ;) What do you exactly mean by ambiguous state? I don't see anything ambiguous in waiting for a jiffie or so. Not that I'd be completely happy about "just wait for a while and see if things are better" but your solution does not seem ideal either... Honza > --- > Subject: writeback: introduce queue b_more_io_wait > Date: Sun Jul 31 18:44:44 CST 2011 > > The problem is, redirty_tail() may update i_dirtied_when and result in > 30s max delay. If redirty_tail() is called often enough, some inode may > even be delayed for ever. > > So introduce the b_more_io_wait queue to park inodes that for some > reason cannot be synced immediately. The inodes will be sent to b_io at > the next b_io refill time, however won't be busy retried as b_more_io: > when the redirtied inodes are all in b_more_io_wait, wb_writeback() will > see empty b_more_io and hence break out of the loop. > > The new data flow after converting all redirty_tail() calls to > requeue_io_wait(): > > b_dirty --> b_io --> b_more_io/b_more_io_wait --+ > ^ | > | | > +----------------------------------+ > > Cc: Jan Kara <jack@suse.cz> > Cc: Dave Chinner <david@fromorbit.com> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > --- > fs/fs-writeback.c | 10 ++++++++++ > include/linux/backing-dev.h | 8 +++++--- > mm/backing-dev.c | 10 ++++++++-- > 3 files changed, 23 insertions(+), 5 deletions(-) > > --- linux-next.orig/fs/fs-writeback.c 2011-08-27 15:28:27.000000000 +0800 > +++ linux-next/fs/fs-writeback.c 2011-08-27 15:45:10.000000000 +0800 > @@ -220,6 +220,15 @@ static void requeue_io(struct inode *ino > list_move(&inode->i_wb_list, &wb->b_more_io); > } > > +/* > + * The inode should be retried in an opportunistic way. > + */ > +static void requeue_io_wait(struct inode *inode, struct bdi_writeback *wb) > +{ > + assert_spin_locked(&wb->list_lock); > + list_move(&inode->i_wb_list, &wb->b_more_io_wait); > +} > + > static void inode_sync_complete(struct inode *inode) > { > /* > @@ -307,6 +316,7 @@ static void queue_io(struct bdi_writebac > int moved; > assert_spin_locked(&wb->list_lock); > list_splice_init(&wb->b_more_io, &wb->b_io); > + list_splice_init(&wb->b_more_io_wait, &wb->b_io); > moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this); > trace_writeback_queue_io(wb, older_than_this, moved); > } > --- linux-next.orig/include/linux/backing-dev.h 2011-08-26 19:27:20.000000000 +0800 > +++ linux-next/include/linux/backing-dev.h 2011-08-27 15:45:10.000000000 +0800 > @@ -59,6 +59,7 @@ struct bdi_writeback { > struct list_head b_dirty; /* dirty inodes */ > struct list_head b_io; /* parked for writeback */ > struct list_head b_more_io; /* parked for more writeback */ > + struct list_head b_more_io_wait;/* opportunistic retry io */ > spinlock_t list_lock; /* protects the b_* lists */ > }; > > @@ -129,9 +130,10 @@ extern struct list_head bdi_pending_list > > static inline int wb_has_dirty_io(struct bdi_writeback *wb) > { > - return !list_empty(&wb->b_dirty) || > - !list_empty(&wb->b_io) || > - !list_empty(&wb->b_more_io); > + return !list_empty(&wb->b_dirty) || > + !list_empty(&wb->b_io) || > + !list_empty(&wb->b_more_io) || > + !list_empty(&wb->b_more_io_wait); > } > > static inline void __add_bdi_stat(struct backing_dev_info *bdi, > --- linux-next.orig/mm/backing-dev.c 2011-08-26 19:27:20.000000000 +0800 > +++ linux-next/mm/backing-dev.c 2011-08-27 15:45:10.000000000 +0800 > @@ -74,10 +74,10 @@ static int bdi_debug_stats_show(struct s > unsigned long background_thresh; > unsigned long dirty_thresh; > unsigned long bdi_thresh; > - unsigned long nr_dirty, nr_io, nr_more_io; > + unsigned long nr_dirty, nr_io, nr_more_io, nr_more_io_wait; > struct inode *inode; > > - nr_dirty = nr_io = nr_more_io = 0; > + nr_dirty = nr_io = nr_more_io = nr_more_io_wait = 0; > spin_lock(&wb->list_lock); > list_for_each_entry(inode, &wb->b_dirty, i_wb_list) > nr_dirty++; > @@ -85,6 +85,8 @@ static int bdi_debug_stats_show(struct s > nr_io++; > list_for_each_entry(inode, &wb->b_more_io, i_wb_list) > nr_more_io++; > + list_for_each_entry(inode, &wb->b_more_io_wait, i_wb_list) > + nr_more_io_wait++; > spin_unlock(&wb->list_lock); > > global_dirty_limits(&background_thresh, &dirty_thresh); > @@ -103,6 +105,7 @@ static int bdi_debug_stats_show(struct s > "b_dirty: %10lu\n" > "b_io: %10lu\n" > "b_more_io: %10lu\n" > + "b_more_io_wait: %10lu\n" > "bdi_list: %10u\n" > "state: %10lx\n", > (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)), > @@ -116,6 +119,7 @@ static int bdi_debug_stats_show(struct s > nr_dirty, > nr_io, > nr_more_io, > + nr_more_io_wait, > !list_empty(&bdi->bdi_list), bdi->state); > #undef K > > @@ -635,6 +639,7 @@ static void bdi_wb_init(struct bdi_write > INIT_LIST_HEAD(&wb->b_dirty); > INIT_LIST_HEAD(&wb->b_io); > INIT_LIST_HEAD(&wb->b_more_io); > + INIT_LIST_HEAD(&wb->b_more_io_wait); > spin_lock_init(&wb->list_lock); > setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi); > } > @@ -701,6 +706,7 @@ void bdi_destroy(struct backing_dev_info > list_splice(&bdi->wb.b_dirty, &dst->b_dirty); > list_splice(&bdi->wb.b_io, &dst->b_io); > list_splice(&bdi->wb.b_more_io, &dst->b_more_io); > + list_splice(&bdi->wb.b_more_io_wait, &dst->b_more_io_wait); > spin_unlock(&bdi->wb.list_lock); > spin_unlock(&dst->list_lock); > } -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY @ 2011-09-05 11:11 ` Jan Kara 0 siblings, 0 replies; 17+ messages in thread From: Jan Kara @ 2011-09-05 11:11 UTC (permalink / raw) To: Wu Fengguang; +Cc: Christoph Hellwig, linux-fsdevel, Jan Kara, xfs On Sun 04-09-11 05:35:27, Wu Fengguang wrote: > On Sat, Sep 03, 2011 at 09:13:15AM +0800, Jan Kara wrote: > > On Sat 27-08-11 21:58:25, Wu Fengguang wrote: > > > Christoph, > > > > > > On Sat, Aug 27, 2011 at 02:14:09PM +0800, Christoph Hellwig wrote: > > > > Right now ->write_inode has no way to safely return a EAGAIN without explicitly > > > > redirtying the inode, as we would lose the dirty state otherwise. Most > > > > filesystems get this wrong, but XFS makes heavy use of it to avoid blocking > > > > the flusher thread when ->write_inode hits contentended inode locks. A > > > > contended ilock is something XFS can hit very easibly when extending files, as > > > > the data I/O completion handler takes the lock to update the size, and the > > > > ->write_inode call can race with it fairly easily if writing enough data > > > > in one go so that the completion for the first write come in just before > > > > we call ->write_inode. > > > > > > > > Change the handling of this case to use requeue_io for a quick retry instead > > > > of redirty_tail, which keeps moving out the dirtied_when data and thus keeps > > > > delaying the writeout more and more with every failed attempt to get the lock. > > > > > > Yeah redirty_tail() does have the problem of possibly delay inodes for > > > too long time. However, you know requeue_io() always has the danger of > > > triggering busy wb_writeback() loops in corner cases. > > > > > > For example, nfs_write_inode()/nfs_commit_unstable_pages() often > > > redirty the inode without doing anything (hence no any progress, a > > > prerequisite for busy loops) depending on the in flight writes, which > > > unfortunately further depends on _external_ network/server states.. > > > That means some stalled network/sever state could lead to busy loops > > > in NFS clients. > > > > > > The alternative solution may be to firstly apply the attached patch, > > > and change this one to: > > > > > > - redirty_tail(inode, wb); > > > + requeue_io_wait(inode, wb); > > But your patch doesn't solve the busyloop when the problematic inodes are > > the only ones under writeback, does it? Then b_more_io and b_more_io_wait > > are effectively the same if I understand it right. > > The difference lies in the > > /* > * No more inodes for IO, bail > */ > if (list_empty(&wb->b_more_io)) > break; > > check in wb_writeback(). So when what's left are all b_more_io_wait > inodes, the above check will take effect and break the loop. This is > the tricky point of the patch: it relies on the code not touched by > the patch to work. I've updated the changelog to explain this. Argh, that's really subtle! Two points here. 1) We will immediately retry inodes in b_more_io_wait list because of if (progress) continue; check. 2) The time writeout will be delayed is only <=dirty_expire_centisecs but can be arbitrarily small if someone submits more work. Also if wb_writeback() was called from wb_do_writeback(), we will retry b_more_io_wait inodes twice immediately because of wb_check_old_data_flush() and wb_check_background_flush() calls. > > I think that busylooping in cases like these could be fixed improving the > > busyloop prevention at the end of the loop in wb_writeback(). Maybe we > > could just take a short nap before continuting with writeback instead of / > > in addition to waiting for inode writeback. What do you think? > > That's a reasonable robust option, however at the cost of keeping the > writeback code in some ambiguous state ;) What do you exactly mean by ambiguous state? I don't see anything ambiguous in waiting for a jiffie or so. Not that I'd be completely happy about "just wait for a while and see if things are better" but your solution does not seem ideal either... Honza > --- > Subject: writeback: introduce queue b_more_io_wait > Date: Sun Jul 31 18:44:44 CST 2011 > > The problem is, redirty_tail() may update i_dirtied_when and result in > 30s max delay. If redirty_tail() is called often enough, some inode may > even be delayed for ever. > > So introduce the b_more_io_wait queue to park inodes that for some > reason cannot be synced immediately. The inodes will be sent to b_io at > the next b_io refill time, however won't be busy retried as b_more_io: > when the redirtied inodes are all in b_more_io_wait, wb_writeback() will > see empty b_more_io and hence break out of the loop. > > The new data flow after converting all redirty_tail() calls to > requeue_io_wait(): > > b_dirty --> b_io --> b_more_io/b_more_io_wait --+ > ^ | > | | > +----------------------------------+ > > Cc: Jan Kara <jack@suse.cz> > Cc: Dave Chinner <david@fromorbit.com> > Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> > --- > fs/fs-writeback.c | 10 ++++++++++ > include/linux/backing-dev.h | 8 +++++--- > mm/backing-dev.c | 10 ++++++++-- > 3 files changed, 23 insertions(+), 5 deletions(-) > > --- linux-next.orig/fs/fs-writeback.c 2011-08-27 15:28:27.000000000 +0800 > +++ linux-next/fs/fs-writeback.c 2011-08-27 15:45:10.000000000 +0800 > @@ -220,6 +220,15 @@ static void requeue_io(struct inode *ino > list_move(&inode->i_wb_list, &wb->b_more_io); > } > > +/* > + * The inode should be retried in an opportunistic way. > + */ > +static void requeue_io_wait(struct inode *inode, struct bdi_writeback *wb) > +{ > + assert_spin_locked(&wb->list_lock); > + list_move(&inode->i_wb_list, &wb->b_more_io_wait); > +} > + > static void inode_sync_complete(struct inode *inode) > { > /* > @@ -307,6 +316,7 @@ static void queue_io(struct bdi_writebac > int moved; > assert_spin_locked(&wb->list_lock); > list_splice_init(&wb->b_more_io, &wb->b_io); > + list_splice_init(&wb->b_more_io_wait, &wb->b_io); > moved = move_expired_inodes(&wb->b_dirty, &wb->b_io, older_than_this); > trace_writeback_queue_io(wb, older_than_this, moved); > } > --- linux-next.orig/include/linux/backing-dev.h 2011-08-26 19:27:20.000000000 +0800 > +++ linux-next/include/linux/backing-dev.h 2011-08-27 15:45:10.000000000 +0800 > @@ -59,6 +59,7 @@ struct bdi_writeback { > struct list_head b_dirty; /* dirty inodes */ > struct list_head b_io; /* parked for writeback */ > struct list_head b_more_io; /* parked for more writeback */ > + struct list_head b_more_io_wait;/* opportunistic retry io */ > spinlock_t list_lock; /* protects the b_* lists */ > }; > > @@ -129,9 +130,10 @@ extern struct list_head bdi_pending_list > > static inline int wb_has_dirty_io(struct bdi_writeback *wb) > { > - return !list_empty(&wb->b_dirty) || > - !list_empty(&wb->b_io) || > - !list_empty(&wb->b_more_io); > + return !list_empty(&wb->b_dirty) || > + !list_empty(&wb->b_io) || > + !list_empty(&wb->b_more_io) || > + !list_empty(&wb->b_more_io_wait); > } > > static inline void __add_bdi_stat(struct backing_dev_info *bdi, > --- linux-next.orig/mm/backing-dev.c 2011-08-26 19:27:20.000000000 +0800 > +++ linux-next/mm/backing-dev.c 2011-08-27 15:45:10.000000000 +0800 > @@ -74,10 +74,10 @@ static int bdi_debug_stats_show(struct s > unsigned long background_thresh; > unsigned long dirty_thresh; > unsigned long bdi_thresh; > - unsigned long nr_dirty, nr_io, nr_more_io; > + unsigned long nr_dirty, nr_io, nr_more_io, nr_more_io_wait; > struct inode *inode; > > - nr_dirty = nr_io = nr_more_io = 0; > + nr_dirty = nr_io = nr_more_io = nr_more_io_wait = 0; > spin_lock(&wb->list_lock); > list_for_each_entry(inode, &wb->b_dirty, i_wb_list) > nr_dirty++; > @@ -85,6 +85,8 @@ static int bdi_debug_stats_show(struct s > nr_io++; > list_for_each_entry(inode, &wb->b_more_io, i_wb_list) > nr_more_io++; > + list_for_each_entry(inode, &wb->b_more_io_wait, i_wb_list) > + nr_more_io_wait++; > spin_unlock(&wb->list_lock); > > global_dirty_limits(&background_thresh, &dirty_thresh); > @@ -103,6 +105,7 @@ static int bdi_debug_stats_show(struct s > "b_dirty: %10lu\n" > "b_io: %10lu\n" > "b_more_io: %10lu\n" > + "b_more_io_wait: %10lu\n" > "bdi_list: %10u\n" > "state: %10lx\n", > (unsigned long) K(bdi_stat(bdi, BDI_WRITEBACK)), > @@ -116,6 +119,7 @@ static int bdi_debug_stats_show(struct s > nr_dirty, > nr_io, > nr_more_io, > + nr_more_io_wait, > !list_empty(&bdi->bdi_list), bdi->state); > #undef K > > @@ -635,6 +639,7 @@ static void bdi_wb_init(struct bdi_write > INIT_LIST_HEAD(&wb->b_dirty); > INIT_LIST_HEAD(&wb->b_io); > INIT_LIST_HEAD(&wb->b_more_io); > + INIT_LIST_HEAD(&wb->b_more_io_wait); > spin_lock_init(&wb->list_lock); > setup_timer(&wb->wakeup_timer, wakeup_timer_fn, (unsigned long)bdi); > } > @@ -701,6 +706,7 @@ void bdi_destroy(struct backing_dev_info > list_splice(&bdi->wb.b_dirty, &dst->b_dirty); > list_splice(&bdi->wb.b_io, &dst->b_io); > list_splice(&bdi->wb.b_more_io, &dst->b_more_io); > + list_splice(&bdi->wb.b_more_io_wait, &dst->b_more_io_wait); > spin_unlock(&bdi->wb.list_lock); > spin_unlock(&dst->list_lock); > } -- Jan Kara <jack@suse.cz> SUSE Labs, CR _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY 2011-09-05 11:11 ` Jan Kara @ 2011-09-05 13:22 ` Wu Fengguang -1 siblings, 0 replies; 17+ messages in thread From: Wu Fengguang @ 2011-09-05 13:22 UTC (permalink / raw) To: Jan Kara; +Cc: Christoph Hellwig, linux-fsdevel, xfs, Dave Chinner On Mon, Sep 05, 2011 at 07:11:53PM +0800, Jan Kara wrote: > On Sun 04-09-11 05:35:27, Wu Fengguang wrote: > > On Sat, Sep 03, 2011 at 09:13:15AM +0800, Jan Kara wrote: > > > On Sat 27-08-11 21:58:25, Wu Fengguang wrote: > > > > Christoph, > > > > > > > > On Sat, Aug 27, 2011 at 02:14:09PM +0800, Christoph Hellwig wrote: > > > > > Right now ->write_inode has no way to safely return a EAGAIN without explicitly > > > > > redirtying the inode, as we would lose the dirty state otherwise. Most > > > > > filesystems get this wrong, but XFS makes heavy use of it to avoid blocking > > > > > the flusher thread when ->write_inode hits contentended inode locks. A > > > > > contended ilock is something XFS can hit very easibly when extending files, as > > > > > the data I/O completion handler takes the lock to update the size, and the > > > > > ->write_inode call can race with it fairly easily if writing enough data > > > > > in one go so that the completion for the first write come in just before > > > > > we call ->write_inode. > > > > > > > > > > Change the handling of this case to use requeue_io for a quick retry instead > > > > > of redirty_tail, which keeps moving out the dirtied_when data and thus keeps > > > > > delaying the writeout more and more with every failed attempt to get the lock. > > > > > > > > Yeah redirty_tail() does have the problem of possibly delay inodes for > > > > too long time. However, you know requeue_io() always has the danger of > > > > triggering busy wb_writeback() loops in corner cases. > > > > > > > > For example, nfs_write_inode()/nfs_commit_unstable_pages() often > > > > redirty the inode without doing anything (hence no any progress, a > > > > prerequisite for busy loops) depending on the in flight writes, which > > > > unfortunately further depends on _external_ network/server states.. > > > > That means some stalled network/sever state could lead to busy loops > > > > in NFS clients. > > > > > > > > The alternative solution may be to firstly apply the attached patch, > > > > and change this one to: > > > > > > > > - redirty_tail(inode, wb); > > > > + requeue_io_wait(inode, wb); > > > But your patch doesn't solve the busyloop when the problematic inodes are > > > the only ones under writeback, does it? Then b_more_io and b_more_io_wait > > > are effectively the same if I understand it right. > > > > The difference lies in the > > > > /* > > * No more inodes for IO, bail > > */ > > if (list_empty(&wb->b_more_io)) > > break; > > > > check in wb_writeback(). So when what's left are all b_more_io_wait > > inodes, the above check will take effect and break the loop. This is > > the tricky point of the patch: it relies on the code not touched by > > the patch to work. I've updated the changelog to explain this. > Argh, that's really subtle! Two points here. Yes, it is.. > 1) We will immediately retry inodes in b_more_io_wait list because of > if (progress) > continue; > check. That's right. I'm aware of this and think it's reasonable to not sleep as long as there are b_more_io inodes to write. The typical situation will be, the time take to write b_more_io inodes naturally serve as necessary delay to retry the b_more_io_wait inodes. > 2) The time writeout will be delayed is only <=dirty_expire_centisecs but > can be arbitrarily small if someone submits more work. Exactly. > Also if wb_writeback() was called from wb_do_writeback(), we will > retry b_more_io_wait inodes twice immediately because of > wb_check_old_data_flush() and wb_check_background_flush() calls. Right. However given that wb_check_old_data_flush() is ratelimited by dirty_writeback_interval=5s and wb_check_background_flush() is limited by over_bground_thresh(), it's hardly a real problem. > > > I think that busylooping in cases like these could be fixed improving the > > > busyloop prevention at the end of the loop in wb_writeback(). Maybe we > > > could just take a short nap before continuting with writeback instead of / > > > in addition to waiting for inode writeback. What do you think? > > > > That's a reasonable robust option, however at the cost of keeping the > > writeback code in some ambiguous state ;) > What do you exactly mean by ambiguous state? I mean in Christoph's case, it will be calling requeue_io() and at the same time rely on your suggested unconditional sleep at the end of wb_writeback() loop to avoid busy loop. Or in other words, b_more_io will be holding both inodes that should be busy retried and the inodes to be opportunistically retried. However I admit it's not a big problem if we take b_more_io as general "to be retried ASAP". > I don't see anything ambiguous in waiting for a jiffie or so. Not > that I'd be completely happy about "just wait for a while and see if > things are better" but your solution does not seem ideal either... There are no big differences (that matter) in terms of "how much exact time to wait" in this XFS case. What make me prefer b_more_io_wait is that it looks a more general solution to replace the majority redirty_tail() calls to avoid modifying dirtied_when. Thanks, Fengguang ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY @ 2011-09-05 13:22 ` Wu Fengguang 0 siblings, 0 replies; 17+ messages in thread From: Wu Fengguang @ 2011-09-05 13:22 UTC (permalink / raw) To: Jan Kara; +Cc: Christoph Hellwig, linux-fsdevel, xfs On Mon, Sep 05, 2011 at 07:11:53PM +0800, Jan Kara wrote: > On Sun 04-09-11 05:35:27, Wu Fengguang wrote: > > On Sat, Sep 03, 2011 at 09:13:15AM +0800, Jan Kara wrote: > > > On Sat 27-08-11 21:58:25, Wu Fengguang wrote: > > > > Christoph, > > > > > > > > On Sat, Aug 27, 2011 at 02:14:09PM +0800, Christoph Hellwig wrote: > > > > > Right now ->write_inode has no way to safely return a EAGAIN without explicitly > > > > > redirtying the inode, as we would lose the dirty state otherwise. Most > > > > > filesystems get this wrong, but XFS makes heavy use of it to avoid blocking > > > > > the flusher thread when ->write_inode hits contentended inode locks. A > > > > > contended ilock is something XFS can hit very easibly when extending files, as > > > > > the data I/O completion handler takes the lock to update the size, and the > > > > > ->write_inode call can race with it fairly easily if writing enough data > > > > > in one go so that the completion for the first write come in just before > > > > > we call ->write_inode. > > > > > > > > > > Change the handling of this case to use requeue_io for a quick retry instead > > > > > of redirty_tail, which keeps moving out the dirtied_when data and thus keeps > > > > > delaying the writeout more and more with every failed attempt to get the lock. > > > > > > > > Yeah redirty_tail() does have the problem of possibly delay inodes for > > > > too long time. However, you know requeue_io() always has the danger of > > > > triggering busy wb_writeback() loops in corner cases. > > > > > > > > For example, nfs_write_inode()/nfs_commit_unstable_pages() often > > > > redirty the inode without doing anything (hence no any progress, a > > > > prerequisite for busy loops) depending on the in flight writes, which > > > > unfortunately further depends on _external_ network/server states.. > > > > That means some stalled network/sever state could lead to busy loops > > > > in NFS clients. > > > > > > > > The alternative solution may be to firstly apply the attached patch, > > > > and change this one to: > > > > > > > > - redirty_tail(inode, wb); > > > > + requeue_io_wait(inode, wb); > > > But your patch doesn't solve the busyloop when the problematic inodes are > > > the only ones under writeback, does it? Then b_more_io and b_more_io_wait > > > are effectively the same if I understand it right. > > > > The difference lies in the > > > > /* > > * No more inodes for IO, bail > > */ > > if (list_empty(&wb->b_more_io)) > > break; > > > > check in wb_writeback(). So when what's left are all b_more_io_wait > > inodes, the above check will take effect and break the loop. This is > > the tricky point of the patch: it relies on the code not touched by > > the patch to work. I've updated the changelog to explain this. > Argh, that's really subtle! Two points here. Yes, it is.. > 1) We will immediately retry inodes in b_more_io_wait list because of > if (progress) > continue; > check. That's right. I'm aware of this and think it's reasonable to not sleep as long as there are b_more_io inodes to write. The typical situation will be, the time take to write b_more_io inodes naturally serve as necessary delay to retry the b_more_io_wait inodes. > 2) The time writeout will be delayed is only <=dirty_expire_centisecs but > can be arbitrarily small if someone submits more work. Exactly. > Also if wb_writeback() was called from wb_do_writeback(), we will > retry b_more_io_wait inodes twice immediately because of > wb_check_old_data_flush() and wb_check_background_flush() calls. Right. However given that wb_check_old_data_flush() is ratelimited by dirty_writeback_interval=5s and wb_check_background_flush() is limited by over_bground_thresh(), it's hardly a real problem. > > > I think that busylooping in cases like these could be fixed improving the > > > busyloop prevention at the end of the loop in wb_writeback(). Maybe we > > > could just take a short nap before continuting with writeback instead of / > > > in addition to waiting for inode writeback. What do you think? > > > > That's a reasonable robust option, however at the cost of keeping the > > writeback code in some ambiguous state ;) > What do you exactly mean by ambiguous state? I mean in Christoph's case, it will be calling requeue_io() and at the same time rely on your suggested unconditional sleep at the end of wb_writeback() loop to avoid busy loop. Or in other words, b_more_io will be holding both inodes that should be busy retried and the inodes to be opportunistically retried. However I admit it's not a big problem if we take b_more_io as general "to be retried ASAP". > I don't see anything ambiguous in waiting for a jiffie or so. Not > that I'd be completely happy about "just wait for a while and see if > things are better" but your solution does not seem ideal either... There are no big differences (that matter) in terms of "how much exact time to wait" in this XFS case. What make me prefer b_more_io_wait is that it looks a more general solution to replace the majority redirty_tail() calls to avoid modifying dirtied_when. Thanks, Fengguang _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY 2011-09-05 13:22 ` Wu Fengguang @ 2011-09-07 11:52 ` Christoph Hellwig -1 siblings, 0 replies; 17+ messages in thread From: Christoph Hellwig @ 2011-09-07 11:52 UTC (permalink / raw) To: Wu Fengguang; +Cc: Christoph Hellwig, linux-fsdevel On Mon, Sep 05, 2011 at 09:22:16PM +0800, Wu Fengguang wrote: > > > That's a reasonable robust option, however at the cost of keeping the > > > writeback code in some ambiguous state ;) > > What do you exactly mean by ambiguous state? > > I mean in Christoph's case, it will be calling requeue_io() and at the > same time rely on your suggested unconditional sleep at the end of > wb_writeback() loop to avoid busy loop. Or in other words, b_more_io > will be holding both inodes that should be busy retried and the inodes > to be opportunistically retried. However I admit it's not a big > problem if we take b_more_io as general "to be retried ASAP". > > > I don't see anything ambiguous in waiting for a jiffie or so. Not > > that I'd be completely happy about "just wait for a while and see if > > things are better" but your solution does not seem ideal either... > > There are no big differences (that matter) in terms of "how much exact > time to wait" in this XFS case. What make me prefer b_more_io_wait is > that it looks a more general solution to replace the majority > redirty_tail() calls to avoid modifying dirtied_when. FYI, we had a few more users hit this issue recently. I'm not sure why, but we are seeing this fairly often now. I'd really like to get some sort of fix for this in ASAP as it causes data loss for users. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY @ 2011-09-07 11:52 ` Christoph Hellwig 0 siblings, 0 replies; 17+ messages in thread From: Christoph Hellwig @ 2011-09-07 11:52 UTC (permalink / raw) To: Wu Fengguang; +Cc: Christoph Hellwig, linux-fsdevel, Jan Kara, xfs On Mon, Sep 05, 2011 at 09:22:16PM +0800, Wu Fengguang wrote: > > > That's a reasonable robust option, however at the cost of keeping the > > > writeback code in some ambiguous state ;) > > What do you exactly mean by ambiguous state? > > I mean in Christoph's case, it will be calling requeue_io() and at the > same time rely on your suggested unconditional sleep at the end of > wb_writeback() loop to avoid busy loop. Or in other words, b_more_io > will be holding both inodes that should be busy retried and the inodes > to be opportunistically retried. However I admit it's not a big > problem if we take b_more_io as general "to be retried ASAP". > > > I don't see anything ambiguous in waiting for a jiffie or so. Not > > that I'd be completely happy about "just wait for a while and see if > > things are better" but your solution does not seem ideal either... > > There are no big differences (that matter) in terms of "how much exact > time to wait" in this XFS case. What make me prefer b_more_io_wait is > that it looks a more general solution to replace the majority > redirty_tail() calls to avoid modifying dirtied_when. FYI, we had a few more users hit this issue recently. I'm not sure why, but we are seeing this fairly often now. I'd really like to get some sort of fix for this in ASAP as it causes data loss for users. _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY 2011-09-07 11:52 ` Christoph Hellwig (?) @ 2011-09-07 12:51 ` Wu Fengguang 2011-09-08 0:51 ` Jan Kara -1 siblings, 1 reply; 17+ messages in thread From: Wu Fengguang @ 2011-09-07 12:51 UTC (permalink / raw) To: Christoph Hellwig; +Cc: linux-fsdevel, Jan Kara, xfs On Wed, Sep 07, 2011 at 07:52:37PM +0800, Christoph Hellwig wrote: > On Mon, Sep 05, 2011 at 09:22:16PM +0800, Wu Fengguang wrote: > > > > That's a reasonable robust option, however at the cost of keeping the > > > > writeback code in some ambiguous state ;) > > > What do you exactly mean by ambiguous state? > > > > I mean in Christoph's case, it will be calling requeue_io() and at the > > same time rely on your suggested unconditional sleep at the end of > > wb_writeback() loop to avoid busy loop. Or in other words, b_more_io > > will be holding both inodes that should be busy retried and the inodes > > to be opportunistically retried. However I admit it's not a big > > problem if we take b_more_io as general "to be retried ASAP". > > > > > I don't see anything ambiguous in waiting for a jiffie or so. Not > > > that I'd be completely happy about "just wait for a while and see if > > > things are better" but your solution does not seem ideal either... > > > > There are no big differences (that matter) in terms of "how much exact > > time to wait" in this XFS case. What make me prefer b_more_io_wait is > > that it looks a more general solution to replace the majority > > redirty_tail() calls to avoid modifying dirtied_when. > > FYI, we had a few more users hit this issue recently. I'm not sure why, > but we are seeing this fairly often now. I'd really like to get some > sort of fix for this in ASAP as it causes data loss for users. Jan, do you agree to push the b_more_io_wait patch into linux-next? If not, let's do a patch to do unconditional sleep at the end of the wb_writeback() loop? Thanks, Fengguang _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY 2011-09-07 12:51 ` Wu Fengguang @ 2011-09-08 0:51 ` Jan Kara 0 siblings, 0 replies; 17+ messages in thread From: Jan Kara @ 2011-09-08 0:51 UTC (permalink / raw) To: Wu Fengguang Cc: Christoph Hellwig, Jan Kara, linux-fsdevel, xfs, Dave Chinner On Wed 07-09-11 20:51:05, Wu Fengguang wrote: > On Wed, Sep 07, 2011 at 07:52:37PM +0800, Christoph Hellwig wrote: > > On Mon, Sep 05, 2011 at 09:22:16PM +0800, Wu Fengguang wrote: > > > > > That's a reasonable robust option, however at the cost of keeping the > > > > > writeback code in some ambiguous state ;) > > > > What do you exactly mean by ambiguous state? > > > > > > I mean in Christoph's case, it will be calling requeue_io() and at the > > > same time rely on your suggested unconditional sleep at the end of > > > wb_writeback() loop to avoid busy loop. Or in other words, b_more_io > > > will be holding both inodes that should be busy retried and the inodes > > > to be opportunistically retried. However I admit it's not a big > > > problem if we take b_more_io as general "to be retried ASAP". > > > > > > > I don't see anything ambiguous in waiting for a jiffie or so. Not > > > > that I'd be completely happy about "just wait for a while and see if > > > > things are better" but your solution does not seem ideal either... > > > > > > There are no big differences (that matter) in terms of "how much exact > > > time to wait" in this XFS case. What make me prefer b_more_io_wait is > > > that it looks a more general solution to replace the majority > > > redirty_tail() calls to avoid modifying dirtied_when. > > > > FYI, we had a few more users hit this issue recently. I'm not sure why, > > but we are seeing this fairly often now. I'd really like to get some > > sort of fix for this in ASAP as it causes data loss for users. > > Jan, do you agree to push the b_more_io_wait patch into linux-next? > > If not, let's do a patch to do unconditional sleep at the end of the > wb_writeback() loop? Well, what I don't like about b_more_io_wait is that the logic shifting inodes between lists becomes subtle and I'm afraid we could easily break it in future. Also times when inodes are retried are not so well defined although I agree that most likely that's not going to be a problem in practice. So that's why I'd prefer to use more robust approach of just waiting in the loop when we couldn't make any progress. I've just sent a patch which does that and a patch which converts redirty_tail()s to requeue_io() where it makes sense. Note that writeback_single_inode() change is a bit more complex to keep livelock avoidance working. Please have a look whether the patches would be fine with you. Thanks. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY @ 2011-09-08 0:51 ` Jan Kara 0 siblings, 0 replies; 17+ messages in thread From: Jan Kara @ 2011-09-08 0:51 UTC (permalink / raw) To: Wu Fengguang; +Cc: Christoph Hellwig, linux-fsdevel, Jan Kara, xfs On Wed 07-09-11 20:51:05, Wu Fengguang wrote: > On Wed, Sep 07, 2011 at 07:52:37PM +0800, Christoph Hellwig wrote: > > On Mon, Sep 05, 2011 at 09:22:16PM +0800, Wu Fengguang wrote: > > > > > That's a reasonable robust option, however at the cost of keeping the > > > > > writeback code in some ambiguous state ;) > > > > What do you exactly mean by ambiguous state? > > > > > > I mean in Christoph's case, it will be calling requeue_io() and at the > > > same time rely on your suggested unconditional sleep at the end of > > > wb_writeback() loop to avoid busy loop. Or in other words, b_more_io > > > will be holding both inodes that should be busy retried and the inodes > > > to be opportunistically retried. However I admit it's not a big > > > problem if we take b_more_io as general "to be retried ASAP". > > > > > > > I don't see anything ambiguous in waiting for a jiffie or so. Not > > > > that I'd be completely happy about "just wait for a while and see if > > > > things are better" but your solution does not seem ideal either... > > > > > > There are no big differences (that matter) in terms of "how much exact > > > time to wait" in this XFS case. What make me prefer b_more_io_wait is > > > that it looks a more general solution to replace the majority > > > redirty_tail() calls to avoid modifying dirtied_when. > > > > FYI, we had a few more users hit this issue recently. I'm not sure why, > > but we are seeing this fairly often now. I'd really like to get some > > sort of fix for this in ASAP as it causes data loss for users. > > Jan, do you agree to push the b_more_io_wait patch into linux-next? > > If not, let's do a patch to do unconditional sleep at the end of the > wb_writeback() loop? Well, what I don't like about b_more_io_wait is that the logic shifting inodes between lists becomes subtle and I'm afraid we could easily break it in future. Also times when inodes are retried are not so well defined although I agree that most likely that's not going to be a problem in practice. So that's why I'd prefer to use more robust approach of just waiting in the loop when we couldn't make any progress. I've just sent a patch which does that and a patch which converts redirty_tail()s to requeue_io() where it makes sense. Note that writeback_single_inode() change is a bit more complex to keep livelock avoidance working. Please have a look whether the patches would be fine with you. Thanks. Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2011-09-08 0:52 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2011-08-27 6:14 [PATCH, RFC] writeback: avoid redirtying when ->write_inode failed to clear I_DIRTY Christoph Hellwig 2011-08-27 6:14 ` Christoph Hellwig 2011-08-27 13:58 ` Wu Fengguang 2011-08-27 13:58 ` Wu Fengguang 2011-09-03 1:13 ` Jan Kara 2011-09-03 1:13 ` Jan Kara 2011-09-03 21:35 ` Wu Fengguang 2011-09-03 21:35 ` Wu Fengguang 2011-09-05 11:11 ` Jan Kara 2011-09-05 11:11 ` Jan Kara 2011-09-05 13:22 ` Wu Fengguang 2011-09-05 13:22 ` Wu Fengguang 2011-09-07 11:52 ` Christoph Hellwig 2011-09-07 11:52 ` Christoph Hellwig 2011-09-07 12:51 ` Wu Fengguang 2011-09-08 0:51 ` Jan Kara 2011-09-08 0:51 ` Jan Kara
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.