All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/10] remove xfsbufd
@ 2012-03-27 16:44 Christoph Hellwig
  2012-03-27 16:44 ` [PATCH 01/10] xfs: remove log item from AIL in xfs_qm_dqflush after a shutdown Christoph Hellwig
                   ` (10 more replies)
  0 siblings, 11 replies; 42+ messages in thread
From: Christoph Hellwig @ 2012-03-27 16:44 UTC (permalink / raw)
  To: xfs

Now that we all dirty metadata is tracked in the AIL, and except for few
special cases only written through it there is no point to keep the current
delayed buffers list and xfsbufd around.

This series remove a few more of the remaining special cases and then
replaced the global delwri buffer list with a local on-stack one.  The
main consumer is xfsaild, which is used more often now.

Besides removing a lot of code this change reduce buffer cache lookups
on loaded systems from xfsaild because we can figure out that a buffer
already is under writeback entirely locally now.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 01/10] xfs: remove log item from AIL in xfs_qm_dqflush after a shutdown
  2012-03-27 16:44 [PATCH 00/10] remove xfsbufd Christoph Hellwig
@ 2012-03-27 16:44 ` Christoph Hellwig
  2012-03-27 18:17   ` Mark Tinguely
  2012-04-13  9:36   ` Dave Chinner
  2012-03-27 16:44 ` [PATCH 02/10] xfs: remove log item from AIL in xfs_iflush " Christoph Hellwig
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 42+ messages in thread
From: Christoph Hellwig @ 2012-03-27 16:44 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-delete-dquot-from-ail-earlier --]
[-- Type: text/plain, Size: 1683 bytes --]

If a filesystem has been forced shutdown we are never going to write dquots
to disk, which means the dquot items will stay in the AIL forever.
Currently that is not a problem, but a pending chance requires us to
empty the AIL before shutting down the filesystem, in which case this
behaviour is lethal.  Make sure to remove the log item from the AIL
to allow emptying the AIL on shutdown filesystems.

Signed-off-by: Christoph Hellwig <hch@lst.de>

---
 fs/xfs/xfs_dquot.c |   13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

Index: xfs/fs/xfs/xfs_dquot.c
===================================================================
--- xfs.orig/fs/xfs/xfs_dquot.c	2012-02-23 17:52:53.916002428 -0800
+++ xfs/fs/xfs/xfs_dquot.c	2012-02-23 17:53:01.829335739 -0800
@@ -904,10 +904,21 @@ xfs_qm_dqflush(
 	/*
 	 * This may have been unpinned because the filesystem is shutting
 	 * down forcibly. If that's the case we must not write this dquot
-	 * to disk, because the log record didn't make it to disk!
+	 * to disk, because the log record didn't make it to disk.
+	 *
+	 * We also have to remove the log item from the AIL in this case,
+	 * as we wait for an emptry AIL as part of the unmount process.
 	 */
 	if (XFS_FORCED_SHUTDOWN(mp)) {
+		struct xfs_log_item	*lip = &dqp->q_logitem.qli_item;
 		dqp->dq_flags &= ~XFS_DQ_DIRTY;
+
+		spin_lock(&mp->m_ail->xa_lock);
+		if (lip->li_flags & XFS_LI_IN_AIL)
+			xfs_trans_ail_delete(mp->m_ail, lip);
+		else
+			spin_unlock(&mp->m_ail->xa_lock);
+
 		xfs_dqfunlock(dqp);
 		return XFS_ERROR(EIO);
 	}

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 02/10] xfs: remove log item from AIL in xfs_iflush after a shutdown
  2012-03-27 16:44 [PATCH 00/10] remove xfsbufd Christoph Hellwig
  2012-03-27 16:44 ` [PATCH 01/10] xfs: remove log item from AIL in xfs_qm_dqflush after a shutdown Christoph Hellwig
@ 2012-03-27 16:44 ` Christoph Hellwig
  2012-04-13  9:37   ` Dave Chinner
  2012-03-27 16:44 ` [PATCH 03/10] xfs: allow assigning the tail lsn with the AIL lock held Christoph Hellwig
                   ` (8 subsequent siblings)
  10 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2012-03-27 16:44 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-delete-inode-from-ail-earlier --]
[-- Type: text/plain, Size: 3719 bytes --]

If a filesystem has been forced shutdown we are never going to write inodes
to disk, which means the inode items will stay in the AIL until we free
the inode. Currently that is not a problem, but a pending change requires us
to empty the AIL before shutting down the filesystem. In that case leaving
the inode in the AIL is lethal. Make sure to remove the log item from the AIL
to allow emptying the AIL on shutdown filesystems.

Signed-off-by: Christoph Hellwig <hch@lst.de>

---
 fs/xfs/xfs_iget.c  |   18 +-----------------
 fs/xfs/xfs_inode.c |   17 +++++++++--------
 fs/xfs/xfs_sync.c  |    1 +
 3 files changed, 11 insertions(+), 25 deletions(-)

Index: xfs/fs/xfs/xfs_inode.c
===================================================================
--- xfs.orig/fs/xfs/xfs_inode.c	2012-03-16 12:44:56.037030588 +0100
+++ xfs/fs/xfs/xfs_inode.c	2012-03-16 12:47:03.697032954 +0100
@@ -2397,7 +2397,6 @@ xfs_iflush(
 	xfs_inode_t		*ip,
 	uint			flags)
 {
-	xfs_inode_log_item_t	*iip;
 	xfs_buf_t		*bp;
 	xfs_dinode_t		*dip;
 	xfs_mount_t		*mp;
@@ -2410,7 +2409,6 @@ xfs_iflush(
 	ASSERT(ip->i_d.di_format != XFS_DINODE_FMT_BTREE ||
 	       ip->i_d.di_nextents > XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK));
 
-	iip = ip->i_itemp;
 	mp = ip->i_mount;
 
 	/*
@@ -2447,13 +2445,14 @@ xfs_iflush(
 	/*
 	 * This may have been unpinned because the filesystem is shutting
 	 * down forcibly. If that's the case we must not write this inode
-	 * to disk, because the log record didn't make it to disk!
+	 * to disk, because the log record didn't make it to disk.
+	 *
+	 * We also have to remove the log item from the AIL in this case,
+	 * as we wait for an empty AIL as part of the unmount process.
 	 */
 	if (XFS_FORCED_SHUTDOWN(mp)) {
-		if (iip)
-			iip->ili_fields = 0;
-		xfs_ifunlock(ip);
-		return XFS_ERROR(EIO);
+		error = XFS_ERROR(EIO);
+		goto abort_out;
 	}
 
 	/*
@@ -2500,11 +2499,13 @@ corrupt_out:
 	xfs_buf_relse(bp);
 	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
 cluster_corrupt_out:
+	error = XFS_ERROR(EFSCORRUPTED);
+abort_out:
 	/*
 	 * Unlocks the flush lock
 	 */
 	xfs_iflush_abort(ip);
-	return XFS_ERROR(EFSCORRUPTED);
+	return error;
 }
 
 
Index: xfs/fs/xfs/xfs_iget.c
===================================================================
--- xfs.orig/fs/xfs/xfs_iget.c	2012-03-16 12:44:56.047030587 +0100
+++ xfs/fs/xfs/xfs_iget.c	2012-03-16 12:47:03.697032954 +0100
@@ -123,23 +123,7 @@ xfs_inode_free(
 		xfs_idestroy_fork(ip, XFS_ATTR_FORK);
 
 	if (ip->i_itemp) {
-		/*
-		 * Only if we are shutting down the fs will we see an
-		 * inode still in the AIL. If it is there, we should remove
-		 * it to prevent a use-after-free from occurring.
-		 */
-		xfs_log_item_t	*lip = &ip->i_itemp->ili_item;
-		struct xfs_ail	*ailp = lip->li_ailp;
-
-		ASSERT(((lip->li_flags & XFS_LI_IN_AIL) == 0) ||
-				       XFS_FORCED_SHUTDOWN(ip->i_mount));
-		if (lip->li_flags & XFS_LI_IN_AIL) {
-			spin_lock(&ailp->xa_lock);
-			if (lip->li_flags & XFS_LI_IN_AIL)
-				xfs_trans_ail_delete(ailp, lip);
-			else
-				spin_unlock(&ailp->xa_lock);
-		}
+		ASSERT(!(ip->i_itemp->ili_item.li_flags & XFS_LI_IN_AIL));
 		xfs_inode_item_destroy(ip);
 		ip->i_itemp = NULL;
 	}
Index: xfs/fs/xfs/xfs_sync.c
===================================================================
--- xfs.orig/fs/xfs/xfs_sync.c	2012-03-16 12:44:57.707030619 +0100
+++ xfs/fs/xfs/xfs_sync.c	2012-03-16 12:47:03.697032954 +0100
@@ -783,6 +783,7 @@ restart:
 		goto reclaim;
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
 		xfs_iunpin_wait(ip);
+		xfs_iflush_abort(ip);
 		goto reclaim;
 	}
 	if (xfs_ipincount(ip)) {

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 03/10] xfs: allow assigning the tail lsn with the AIL lock held
  2012-03-27 16:44 [PATCH 00/10] remove xfsbufd Christoph Hellwig
  2012-03-27 16:44 ` [PATCH 01/10] xfs: remove log item from AIL in xfs_qm_dqflush after a shutdown Christoph Hellwig
  2012-03-27 16:44 ` [PATCH 02/10] xfs: remove log item from AIL in xfs_iflush " Christoph Hellwig
@ 2012-03-27 16:44 ` Christoph Hellwig
  2012-03-27 18:18   ` Mark Tinguely
  2012-04-13  9:42   ` Dave Chinner
  2012-03-27 16:44 ` [PATCH 04/10] xfs: implement freezing by emptying the AIL Christoph Hellwig
                   ` (7 subsequent siblings)
  10 siblings, 2 replies; 42+ messages in thread
From: Christoph Hellwig @ 2012-03-27 16:44 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-lockless-xlog_assign_tail_lsn --]
[-- Type: text/plain, Size: 5206 bytes --]

Provide a variant of xlog_assign_tail_lsn that has the AIL lock already
held.  By doing so we do an additional atomic_read + atomic_set under
the lock, which comes down to two instructions.

Switch xfs_trans_ail_update_bulk and xfs_trans_ail_delete_bulk to the
new version to reduce the number of lock roundtrips, and prepare for
a new addition that would require a third lock roundtrip in
xfs_trans_ail_delete_bulk.  This addition is also the reason for
slightly rearranging the conditionals and relying on xfs_log_space_wake
for checking that the filesystem has been shut down internally.

Signed-off-by: Christoph Hellwig <hch@lst.de>

---
 fs/xfs/xfs_log.c        |   31 +++++++++++++++++++++++--------
 fs/xfs/xfs_log.h        |    1 +
 fs/xfs/xfs_trans_ail.c  |   22 +++++++++++++++-------
 fs/xfs/xfs_trans_priv.h |    1 +
 4 files changed, 40 insertions(+), 15 deletions(-)

Index: xfs/fs/xfs/xfs_log.c
===================================================================
--- xfs.orig/fs/xfs/xfs_log.c	2012-03-16 12:44:55.880363918 +0100
+++ xfs/fs/xfs/xfs_log.c	2012-03-16 12:50:24.040370003 +0100
@@ -915,27 +915,42 @@ xfs_log_need_covered(xfs_mount_t *mp)
  * We may be holding the log iclog lock upon entering this routine.
  */
 xfs_lsn_t
-xlog_assign_tail_lsn(
+xlog_assign_tail_lsn_locked(
 	struct xfs_mount	*mp)
 {
-	xfs_lsn_t		tail_lsn;
 	struct log		*log = mp->m_log;
+	struct xfs_log_item	*lip;
+	xfs_lsn_t		tail_lsn;
+
+	assert_spin_locked(&mp->m_ail->xa_lock);
 
 	/*
 	 * To make sure we always have a valid LSN for the log tail we keep
 	 * track of the last LSN which was committed in log->l_last_sync_lsn,
-	 * and use that when the AIL was empty and xfs_ail_min_lsn returns 0.
-	 *
-	 * If the AIL has been emptied we also need to wake any process
-	 * waiting for this condition.
+	 * and use that when the AIL was empty.
 	 */
-	tail_lsn = xfs_ail_min_lsn(mp->m_ail);
-	if (!tail_lsn)
+	lip = xfs_ail_min(mp->m_ail);
+	if (lip)
+		tail_lsn = lip->li_lsn;
+	else
 		tail_lsn = atomic64_read(&log->l_last_sync_lsn);
 	atomic64_set(&log->l_tail_lsn, tail_lsn);
 	return tail_lsn;
 }
 
+xfs_lsn_t
+xlog_assign_tail_lsn(
+	struct xfs_mount	*mp)
+{
+	xfs_lsn_t		tail_lsn;
+
+	spin_lock(&mp->m_ail->xa_lock);
+	tail_lsn = xlog_assign_tail_lsn_locked(mp);
+	spin_unlock(&mp->m_ail->xa_lock);
+
+	return tail_lsn;
+}
+
 /*
  * Return the space in the log between the tail and the head.  The head
  * is passed in the cycle/bytes formal parms.  In the special case where
Index: xfs/fs/xfs/xfs_log.h
===================================================================
--- xfs.orig/fs/xfs/xfs_log.h	2012-03-16 12:44:55.893697252 +0100
+++ xfs/fs/xfs/xfs_log.h	2012-03-16 12:47:09.127033055 +0100
@@ -152,6 +152,7 @@ int	  xfs_log_mount(struct xfs_mount	*mp
 			int		 	num_bblocks);
 int	  xfs_log_mount_finish(struct xfs_mount *mp);
 xfs_lsn_t xlog_assign_tail_lsn(struct xfs_mount *mp);
+xfs_lsn_t xlog_assign_tail_lsn_locked(struct xfs_mount *mp);
 void	  xfs_log_space_wake(struct xfs_mount *mp);
 int	  xfs_log_notify(struct xfs_mount	*mp,
 			 struct xlog_in_core	*iclog,
Index: xfs/fs/xfs/xfs_trans_ail.c
===================================================================
--- xfs.orig/fs/xfs/xfs_trans_ail.c	2012-03-16 12:44:55.917030586 +0100
+++ xfs/fs/xfs/xfs_trans_ail.c	2012-03-16 12:50:20.483703269 +0100
@@ -79,7 +79,7 @@ xfs_ail_check(
  * Return a pointer to the first item in the AIL.  If the AIL is empty, then
  * return NULL.
  */
-static xfs_log_item_t *
+xfs_log_item_t *
 xfs_ail_min(
 	struct xfs_ail  *ailp)
 {
@@ -667,11 +667,15 @@ xfs_trans_ail_update_bulk(
 
 	if (!list_empty(&tmp))
 		xfs_ail_splice(ailp, cur, &tmp, lsn);
-	spin_unlock(&ailp->xa_lock);
 
-	if (mlip_changed && !XFS_FORCED_SHUTDOWN(ailp->xa_mount)) {
-		xlog_assign_tail_lsn(ailp->xa_mount);
+	if (mlip_changed) {
+		if (!XFS_FORCED_SHUTDOWN(ailp->xa_mount))
+			xlog_assign_tail_lsn_locked(ailp->xa_mount);
+		spin_unlock(&ailp->xa_lock);
+
 		xfs_log_space_wake(ailp->xa_mount);
+	} else {
+		spin_unlock(&ailp->xa_lock);
 	}
 }
 
@@ -729,11 +733,15 @@ xfs_trans_ail_delete_bulk(
 		if (mlip == lip)
 			mlip_changed = 1;
 	}
-	spin_unlock(&ailp->xa_lock);
 
-	if (mlip_changed && !XFS_FORCED_SHUTDOWN(ailp->xa_mount)) {
-		xlog_assign_tail_lsn(ailp->xa_mount);
+	if (mlip_changed) {
+		if (!XFS_FORCED_SHUTDOWN(ailp->xa_mount))
+			xlog_assign_tail_lsn_locked(ailp->xa_mount);
+		spin_unlock(&ailp->xa_lock);
+
 		xfs_log_space_wake(ailp->xa_mount);
+	} else {
+		spin_unlock(&ailp->xa_lock);
 	}
 }
 
Index: xfs/fs/xfs/xfs_trans_priv.h
===================================================================
--- xfs.orig/fs/xfs/xfs_trans_priv.h	2012-03-16 12:44:55.943697253 +0100
+++ xfs/fs/xfs/xfs_trans_priv.h	2012-03-16 12:49:31.993702371 +0100
@@ -102,6 +102,7 @@ xfs_trans_ail_delete(
 
 void			xfs_ail_push(struct xfs_ail *, xfs_lsn_t);
 void			xfs_ail_push_all(struct xfs_ail *);
+struct xfs_log_item	*xfs_ail_min(struct xfs_ail  *ailp);
 xfs_lsn_t		xfs_ail_min_lsn(struct xfs_ail *ailp);
 
 struct xfs_log_item *	xfs_trans_ail_cursor_first(struct xfs_ail *ailp,

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 04/10] xfs: implement freezing by emptying the AIL
  2012-03-27 16:44 [PATCH 00/10] remove xfsbufd Christoph Hellwig
                   ` (2 preceding siblings ...)
  2012-03-27 16:44 ` [PATCH 03/10] xfs: allow assigning the tail lsn with the AIL lock held Christoph Hellwig
@ 2012-03-27 16:44 ` Christoph Hellwig
  2012-04-13 10:04   ` Dave Chinner
                     ` (2 more replies)
  2012-03-27 16:44 ` [PATCH 05/10] xfs: do flush inodes from background inode reclaim Christoph Hellwig
                   ` (6 subsequent siblings)
  10 siblings, 3 replies; 42+ messages in thread
From: Christoph Hellwig @ 2012-03-27 16:44 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-empty-ail-on-freeze --]
[-- Type: text/plain, Size: 11915 bytes --]

Now that we write back all metadata either synchronously or through the AIL
we can simply implement metadata freezing in terms of emptying the AIL.

The implementation for this is fairly simply and straight-forward:  A new
routine is added that increments a counter that tells xfsaild to not stop
until the AIL is empty and then waits on a wakeup from
xfs_trans_ail_delete_bulk to signal that the AIL is empty.

As usual the devil is in the details, in this case the filesystem shutdown
code.  Currently we are a bit sloppy there and do not continue ail pushing
in that case, and thus never reach the code in the log item implementations
that can unwind in case of a shutdown filesystem.  Also the code to 
abort inode and dquot flushes was rather sloppy before and did not remove
the log items from the AIL, which had to be fixed as well.

Also treat unmount the same way as freeze now, except that we still keep a
synchronous inode reclaim pass to make sure we reclaim all clean inodes, too.

As an upside we can now remove the radix tree based inode writeback and
xfs_unmountfs_writesb.

Signed-off-by: Christoph Hellwig <hch@lst.de>

---
 fs/xfs/xfs_mount.c      |   56 ++++++-----------------------
 fs/xfs/xfs_mount.h      |    1 
 fs/xfs/xfs_sync.c       |   90 ++++--------------------------------------------
 fs/xfs/xfs_trans_ail.c  |   49 ++++++++++++++++++++++----
 fs/xfs/xfs_trans_priv.h |    3 +
 5 files changed, 65 insertions(+), 134 deletions(-)

Index: xfs/fs/xfs/xfs_sync.c
===================================================================
--- xfs.orig/fs/xfs/xfs_sync.c	2012-03-25 16:41:19.484551080 +0200
+++ xfs/fs/xfs/xfs_sync.c	2012-03-25 17:23:59.994598559 +0200
@@ -241,45 +241,6 @@ xfs_sync_inode_data(
 	return error;
 }
 
-STATIC int
-xfs_sync_inode_attr(
-	struct xfs_inode	*ip,
-	struct xfs_perag	*pag,
-	int			flags)
-{
-	int			error = 0;
-
-	xfs_ilock(ip, XFS_ILOCK_SHARED);
-	if (xfs_inode_clean(ip))
-		goto out_unlock;
-	if (!xfs_iflock_nowait(ip)) {
-		if (!(flags & SYNC_WAIT))
-			goto out_unlock;
-		xfs_iflock(ip);
-	}
-
-	if (xfs_inode_clean(ip)) {
-		xfs_ifunlock(ip);
-		goto out_unlock;
-	}
-
-	error = xfs_iflush(ip, flags);
-
-	/*
-	 * We don't want to try again on non-blocking flushes that can't run
-	 * again immediately. If an inode really must be written, then that's
-	 * what the SYNC_WAIT flag is for.
-	 */
-	if (error == EAGAIN) {
-		ASSERT(!(flags & SYNC_WAIT));
-		error = 0;
-	}
-
- out_unlock:
-	xfs_iunlock(ip, XFS_ILOCK_SHARED);
-	return error;
-}
-
 /*
  * Write out pagecache data for the whole filesystem.
  */
@@ -300,19 +261,6 @@ xfs_sync_data(
 	return 0;
 }
 
-/*
- * Write out inode metadata (attributes) for the whole filesystem.
- */
-STATIC int
-xfs_sync_attr(
-	struct xfs_mount	*mp,
-	int			flags)
-{
-	ASSERT((flags & ~SYNC_WAIT) == 0);
-
-	return xfs_inode_ag_iterator(mp, xfs_sync_inode_attr, flags);
-}
-
 STATIC int
 xfs_sync_fsdata(
 	struct xfs_mount	*mp)
@@ -350,7 +298,7 @@ xfs_sync_fsdata(
  * First stage of freeze - no writers will make progress now we are here,
  * so we flush delwri and delalloc buffers here, then wait for all I/O to
  * complete.  Data is frozen at that point. Metadata is not frozen,
- * transactions can still occur here so don't bother flushing the buftarg
+ * transactions can still occur here so don't bother emptying the AIL
  * because it'll just get dirty again.
  */
 int
@@ -379,33 +327,6 @@ xfs_quiesce_data(
 	return error ? error : error2;
 }
 
-STATIC void
-xfs_quiesce_fs(
-	struct xfs_mount	*mp)
-{
-	int	count = 0, pincount;
-
-	xfs_reclaim_inodes(mp, 0);
-	xfs_flush_buftarg(mp->m_ddev_targp, 0);
-
-	/*
-	 * This loop must run at least twice.  The first instance of the loop
-	 * will flush most meta data but that will generate more meta data
-	 * (typically directory updates).  Which then must be flushed and
-	 * logged before we can write the unmount record. We also so sync
-	 * reclaim of inodes to catch any that the above delwri flush skipped.
-	 */
-	do {
-		xfs_reclaim_inodes(mp, SYNC_WAIT);
-		xfs_sync_attr(mp, SYNC_WAIT);
-		pincount = xfs_flush_buftarg(mp->m_ddev_targp, 1);
-		if (!pincount) {
-			delay(50);
-			count++;
-		}
-	} while (count < 2);
-}
-
 /*
  * Second stage of a quiesce. The data is already synced, now we have to take
  * care of the metadata. New transactions are already blocked, so we need to
@@ -421,8 +342,8 @@ xfs_quiesce_attr(
 	while (atomic_read(&mp->m_active_trans) > 0)
 		delay(100);
 
-	/* flush inodes and push all remaining buffers out to disk */
-	xfs_quiesce_fs(mp);
+	/* flush all pending changes from the AIL */
+	xfs_ail_push_all_sync(mp->m_ail);
 
 	/*
 	 * Just warn here till VFS can correctly support
@@ -436,7 +357,12 @@ xfs_quiesce_attr(
 		xfs_warn(mp, "xfs_attr_quiesce: failed to log sb changes. "
 				"Frozen image may not be consistent.");
 	xfs_log_unmount_write(mp);
-	xfs_unmountfs_writesb(mp);
+
+	/*
+	 * At this point we might have modified the superblock again and thus
+	 * added an item to the AIL, thus flush it again.
+	 */
+	xfs_ail_push_all_sync(mp->m_ail);
 }
 
 static void
Index: xfs/fs/xfs/xfs_trans_ail.c
===================================================================
--- xfs.orig/fs/xfs/xfs_trans_ail.c	2012-03-25 16:41:19.917884421 +0200
+++ xfs/fs/xfs/xfs_trans_ail.c	2012-03-25 17:23:21.141264505 +0200
@@ -383,9 +383,8 @@ xfsaild_push(
 		spin_lock(&ailp->xa_lock);
 	}
 
-	target = ailp->xa_target;
 	lip = xfs_trans_ail_cursor_first(ailp, &cur, ailp->xa_last_pushed_lsn);
-	if (!lip || XFS_FORCED_SHUTDOWN(mp)) {
+	if (!lip) {
 		/*
 		 * AIL is empty or our push has reached the end.
 		 */
@@ -397,6 +396,15 @@ xfsaild_push(
 	XFS_STATS_INC(xs_push_ail);
 
 	/*
+	 * If we are draining the AIL push all items, not just the current
+	 * threshold.
+	 */
+	if (atomic_read(&ailp->xa_wait_empty))
+		target = xfs_ail_max(ailp)->li_lsn;
+	else
+		target = ailp->xa_target;
+
+	/*
 	 * While the item we are looking at is below the given threshold
 	 * try to flush it out. We'd like not to stop until we've at least
 	 * tried to push on everything in the AIL with an LSN less than
@@ -466,11 +474,6 @@ xfsaild_push(
 		}
 
 		spin_lock(&ailp->xa_lock);
-		/* should we bother continuing? */
-		if (XFS_FORCED_SHUTDOWN(mp))
-			break;
-		ASSERT(mp->m_log);
-
 		count++;
 
 		/*
@@ -611,6 +614,34 @@ xfs_ail_push_all(
 }
 
 /*
+ * Push out all items in the AIL immediately and wait until the AIL is empty.
+ */
+void
+xfs_ail_push_all_sync(
+	struct xfs_ail  *ailp)
+{
+	DEFINE_WAIT(wait);
+
+	/*
+	 * We use a counter instead of a flag here to support multiple
+	 * processes calling into sync at the same time.
+	 */
+	atomic_inc(&ailp->xa_wait_empty);
+	do {
+		prepare_to_wait(&ailp->xa_empty, &wait, TASK_UNINTERRUPTIBLE);
+
+		wake_up_process(ailp->xa_task);
+
+		if (!xfs_ail_min_lsn(ailp))
+			break;
+		schedule();
+	} while (xfs_ail_min_lsn(ailp));
+	atomic_dec(&ailp->xa_wait_empty);
+
+	finish_wait(&ailp->xa_empty, &wait);
+}
+
+/*
  * xfs_trans_ail_update - bulk AIL insertion operation.
  *
  * @xfs_trans_ail_update takes an array of log items that all need to be
@@ -737,6 +768,8 @@ xfs_trans_ail_delete_bulk(
 	if (mlip_changed) {
 		if (!XFS_FORCED_SHUTDOWN(ailp->xa_mount))
 			xlog_assign_tail_lsn_locked(ailp->xa_mount);
+		if (list_empty(&ailp->xa_ail))
+			wake_up_all(&ailp->xa_empty);
 		spin_unlock(&ailp->xa_lock);
 
 		xfs_log_space_wake(ailp->xa_mount);
@@ -773,6 +806,8 @@ xfs_trans_ail_init(
 	INIT_LIST_HEAD(&ailp->xa_ail);
 	INIT_LIST_HEAD(&ailp->xa_cursors);
 	spin_lock_init(&ailp->xa_lock);
+	init_waitqueue_head(&ailp->xa_empty);
+	atomic_set(&ailp->xa_wait_empty, 0);
 
 	ailp->xa_task = kthread_run(xfsaild, ailp, "xfsaild/%s",
 			ailp->xa_mount->m_fsname);
Index: xfs/fs/xfs/xfs_trans_priv.h
===================================================================
--- xfs.orig/fs/xfs/xfs_trans_priv.h	2012-03-25 16:41:19.921217755 +0200
+++ xfs/fs/xfs/xfs_trans_priv.h	2012-03-25 17:23:21.167931172 +0200
@@ -71,6 +71,8 @@ struct xfs_ail {
 	spinlock_t		xa_lock;
 	xfs_lsn_t		xa_last_pushed_lsn;
 	int			xa_log_flush;
+	wait_queue_head_t	xa_empty;
+	atomic_t		xa_wait_empty;
 };
 
 /*
@@ -102,6 +104,7 @@ xfs_trans_ail_delete(
 
 void			xfs_ail_push(struct xfs_ail *, xfs_lsn_t);
 void			xfs_ail_push_all(struct xfs_ail *);
+void			xfs_ail_push_all_sync(struct xfs_ail *);
 struct xfs_log_item	*xfs_ail_min(struct xfs_ail  *ailp);
 xfs_lsn_t		xfs_ail_min_lsn(struct xfs_ail *ailp);
 
Index: xfs/fs/xfs/xfs_mount.c
===================================================================
--- xfs.orig/fs/xfs/xfs_mount.c	2012-03-25 16:41:00.881217402 +0200
+++ xfs/fs/xfs/xfs_mount.c	2012-03-25 16:41:20.901217773 +0200
@@ -22,6 +22,7 @@
 #include "xfs_log.h"
 #include "xfs_inum.h"
 #include "xfs_trans.h"
+#include "xfs_trans_priv.h"
 #include "xfs_sb.h"
 #include "xfs_ag.h"
 #include "xfs_dir2.h"
@@ -1475,15 +1476,15 @@ xfs_unmountfs(
 	xfs_log_force(mp, XFS_LOG_SYNC);
 
 	/*
-	 * Do a delwri reclaim pass first so that as many dirty inodes are
-	 * queued up for IO as possible. Then flush the buffers before making
-	 * a synchronous path to catch all the remaining inodes are reclaimed.
-	 * This makes the reclaim process as quick as possible by avoiding
-	 * synchronous writeout and blocking on inodes already in the delwri
-	 * state as much as possible.
+	 * Flush all pending changes from the AIL.
+	 */
+	xfs_ail_push_all_sync(mp->m_ail);
+
+	/*
+	 * And reclaim all inodes.  At this point there should be no dirty
+	 * inode, and none should be pinned or locked, but use synchronous
+	 * reclaim just to be sure.
 	 */
-	xfs_reclaim_inodes(mp, 0);
-	xfs_flush_buftarg(mp->m_ddev_targp, 1);
 	xfs_reclaim_inodes(mp, SYNC_WAIT);
 
 	xfs_qm_unmount(mp);
@@ -1519,15 +1520,12 @@ xfs_unmountfs(
 	if (error)
 		xfs_warn(mp, "Unable to update superblock counters. "
 				"Freespace may not be correct on next mount.");
-	xfs_unmountfs_writesb(mp);
 
 	/*
-	 * Make sure all buffers have been flushed and completed before
-	 * unmounting the log.
+	 * At this point we might have modified the superblock again and thus
+	 * added an item to the AIL, thus flush it again.
 	 */
-	error = xfs_flush_buftarg(mp->m_ddev_targp, 1);
-	if (error)
-		xfs_warn(mp, "%d busy buffers during unmount.", error);
+	xfs_ail_push_all_sync(mp->m_ail);
 	xfs_wait_buftarg(mp->m_ddev_targp);
 
 	xfs_log_unmount_write(mp);
@@ -1588,36 +1586,6 @@ xfs_log_sbcount(xfs_mount_t *mp)
 	return error;
 }
 
-int
-xfs_unmountfs_writesb(xfs_mount_t *mp)
-{
-	xfs_buf_t	*sbp;
-	int		error = 0;
-
-	/*
-	 * skip superblock write if fs is read-only, or
-	 * if we are doing a forced umount.
-	 */
-	if (!((mp->m_flags & XFS_MOUNT_RDONLY) ||
-		XFS_FORCED_SHUTDOWN(mp))) {
-
-		sbp = xfs_getsb(mp, 0);
-
-		XFS_BUF_UNDONE(sbp);
-		XFS_BUF_UNREAD(sbp);
-		xfs_buf_delwri_dequeue(sbp);
-		XFS_BUF_WRITE(sbp);
-		XFS_BUF_UNASYNC(sbp);
-		ASSERT(sbp->b_target == mp->m_ddev_targp);
-		xfsbdstrat(mp, sbp);
-		error = xfs_buf_iowait(sbp);
-		if (error)
-			xfs_buf_ioerror_alert(sbp, __func__);
-		xfs_buf_relse(sbp);
-	}
-	return error;
-}
-
 /*
  * xfs_mod_sb() can be used to copy arbitrary changes to the
  * in-core superblock into the superblock buffer to be logged.
Index: xfs/fs/xfs/xfs_mount.h
===================================================================
--- xfs.orig/fs/xfs/xfs_mount.h	2012-03-25 16:41:00.897884068 +0200
+++ xfs/fs/xfs/xfs_mount.h	2012-03-25 16:41:20.901217773 +0200
@@ -378,7 +378,6 @@ extern __uint64_t xfs_default_resblks(xf
 extern int	xfs_mountfs(xfs_mount_t *mp);
 
 extern void	xfs_unmountfs(xfs_mount_t *);
-extern int	xfs_unmountfs_writesb(xfs_mount_t *);
 extern int	xfs_mod_incore_sb(xfs_mount_t *, xfs_sb_field_t, int64_t, int);
 extern int	xfs_mod_incore_sb_batch(xfs_mount_t *, xfs_mod_sb_t *,
 			uint, int);

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 05/10] xfs: do flush inodes from background inode reclaim
  2012-03-27 16:44 [PATCH 00/10] remove xfsbufd Christoph Hellwig
                   ` (3 preceding siblings ...)
  2012-03-27 16:44 ` [PATCH 04/10] xfs: implement freezing by emptying the AIL Christoph Hellwig
@ 2012-03-27 16:44 ` Christoph Hellwig
  2012-04-13 10:14   ` Dave Chinner
  2012-04-16 19:25   ` Mark Tinguely
  2012-03-27 16:44 ` [PATCH 06/10] xfs: do not write the buffer from xfs_iflush Christoph Hellwig
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 42+ messages in thread
From: Christoph Hellwig @ 2012-03-27 16:44 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-do-not-write-inodes-from-async-reclaim --]
[-- Type: text/plain, Size: 6468 bytes --]

We already flush dirty inodes throug the AIL regularly, there is no reason
to have second thread compete with it and disturb the I/O pattern.  We still
do write inodes when doing a synchronous reclaim from the shrinker or during
unmount for now.

Signed-off-by: Christoph Hellwig <hch@lst.de>

---
 fs/xfs/xfs_sync.c |  104 ++++++++++++++++++++++--------------------------------
 1 file changed, 43 insertions(+), 61 deletions(-)

Index: xfs/fs/xfs/xfs_sync.c
===================================================================
--- xfs.orig/fs/xfs/xfs_sync.c	2012-03-24 17:57:12.447345506 +0100
+++ xfs/fs/xfs/xfs_sync.c	2012-03-24 18:03:14.510685553 +0100
@@ -628,11 +628,8 @@ xfs_reclaim_inode_grab(
 }
 
 /*
- * Inodes in different states need to be treated differently, and the return
- * value of xfs_iflush is not sufficient to get this right. The following table
- * lists the inode states and the reclaim actions necessary for non-blocking
- * reclaim:
- *
+ * Inodes in different states need to be treated differently. The following
+ * table lists the inode states and the reclaim actions necessary:
  *
  *	inode state	     iflush ret		required action
  *      ---------------      ----------         ---------------
@@ -642,9 +639,8 @@ xfs_reclaim_inode_grab(
  *	stale, unpinned		0		reclaim
  *	clean, pinned(*)	0		requeue
  *	stale, pinned		EAGAIN		requeue
- *	dirty, delwri ok	0		requeue
- *	dirty, delwri blocked	EAGAIN		requeue
- *	dirty, sync flush	0		reclaim
+ *	dirty, async		-		requeue
+ *	dirty, sync		0		reclaim
  *
  * (*) dgc: I don't think the clean, pinned state is possible but it gets
  * handled anyway given the order of checks implemented.
@@ -655,26 +651,23 @@ xfs_reclaim_inode_grab(
  *
  * Also, because we get the flush lock first, we know that any inode that has
  * been flushed delwri has had the flush completed by the time we check that
- * the inode is clean. The clean inode check needs to be done before flushing
- * the inode delwri otherwise we would loop forever requeuing clean inodes as
- * we cannot tell apart a successful delwri flush and a clean inode from the
- * return value of xfs_iflush().
- *
- * Note that because the inode is flushed delayed write by background
- * writeback, the flush lock may already be held here and waiting on it can
- * result in very long latencies. Hence for sync reclaims, where we wait on the
- * flush lock, the caller should push out delayed write inodes first before
- * trying to reclaim them to minimise the amount of time spent waiting. For
- * background relaim, we just requeue the inode for the next pass.
+ * the inode is clean.
+ *
+ * Note that because the inode is flushed delayed write by AIL pushing, the
+ * flush lock may already be held here and waiting on it can result in very
+ * long latencies.  Hence for sync reclaims, where we wait on the flush lock,
+ * the caller should push the AIL first before trying to reclaim inodes to
+ * minimise the amount of time spent waiting.  For background relaim, we only
+ * bother to reclaim clean inodes anyway.
  *
  * Hence the order of actions after gaining the locks should be:
  *	bad		=> reclaim
  *	shutdown	=> unpin and reclaim
- *	pinned, delwri	=> requeue
+ *	pinned, async	=> requeue
  *	pinned, sync	=> unpin
  *	stale		=> reclaim
  *	clean		=> reclaim
- *	dirty, delwri	=> flush and requeue
+ *	dirty, async	=> requeue
  *	dirty, sync	=> flush, wait and reclaim
  */
 STATIC int
@@ -713,10 +706,8 @@ restart:
 		goto reclaim;
 	}
 	if (xfs_ipincount(ip)) {
-		if (!(sync_mode & SYNC_WAIT)) {
-			xfs_ifunlock(ip);
-			goto out;
-		}
+		if (!(sync_mode & SYNC_WAIT))
+			goto out_ifunlock;
 		xfs_iunpin_wait(ip);
 	}
 	if (xfs_iflags_test(ip, XFS_ISTALE))
@@ -725,6 +716,13 @@ restart:
 		goto reclaim;
 
 	/*
+	 * Never flush out dirty data during non-blocking reclaim, as it would
+	 * just contend with AIL pushing trying to do the same job.
+	 */
+	if (!(sync_mode & SYNC_WAIT))
+		goto out_ifunlock;
+
+	/*
 	 * Now we have an inode that needs flushing.
 	 *
 	 * We do a nonblocking flush here even if we are doing a SYNC_WAIT
@@ -742,42 +740,13 @@ restart:
 	 * pass through will see the stale flag set on the inode.
 	 */
 	error = xfs_iflush(ip, SYNC_TRYLOCK | sync_mode);
-	if (sync_mode & SYNC_WAIT) {
-		if (error == EAGAIN) {
-			xfs_iunlock(ip, XFS_ILOCK_EXCL);
-			/* backoff longer than in xfs_ifree_cluster */
-			delay(2);
-			goto restart;
-		}
-		xfs_iflock(ip);
-		goto reclaim;
+	if (error == EAGAIN) {
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+		/* backoff longer than in xfs_ifree_cluster */
+		delay(2);
+		goto restart;
 	}
-
-	/*
-	 * When we have to flush an inode but don't have SYNC_WAIT set, we
-	 * flush the inode out using a delwri buffer and wait for the next
-	 * call into reclaim to find it in a clean state instead of waiting for
-	 * it now. We also don't return errors here - if the error is transient
-	 * then the next reclaim pass will flush the inode, and if the error
-	 * is permanent then the next sync reclaim will reclaim the inode and
-	 * pass on the error.
-	 */
-	if (error && error != EAGAIN && !XFS_FORCED_SHUTDOWN(ip->i_mount)) {
-		xfs_warn(ip->i_mount,
-			"inode 0x%llx background reclaim flush failed with %d",
-			(long long)ip->i_ino, error);
-	}
-out:
-	xfs_iflags_clear(ip, XFS_IRECLAIM);
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-	/*
-	 * We could return EAGAIN here to make reclaim rescan the inode tree in
-	 * a short while. However, this just burns CPU time scanning the tree
-	 * waiting for IO to complete and xfssyncd never goes back to the idle
-	 * state. Instead, return 0 to let the next scheduled background reclaim
-	 * attempt to reclaim the inode again.
-	 */
-	return 0;
+	xfs_iflock(ip);
 
 reclaim:
 	xfs_ifunlock(ip);
@@ -811,8 +780,21 @@ reclaim:
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 
 	xfs_inode_free(ip);
-
 	return error;
+
+out_ifunlock:
+	xfs_ifunlock(ip);
+out:
+	xfs_iflags_clear(ip, XFS_IRECLAIM);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	/*
+	 * We could return EAGAIN here to make reclaim rescan the inode tree in
+	 * a short while. However, this just burns CPU time scanning the tree
+	 * waiting for IO to complete and xfssyncd never goes back to the idle
+	 * state. Instead, return 0 to let the next scheduled background reclaim
+	 * attempt to reclaim the inode again.
+	 */
+	return 0;
 }
 
 /*

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 06/10] xfs: do not write the buffer from xfs_iflush
  2012-03-27 16:44 [PATCH 00/10] remove xfsbufd Christoph Hellwig
                   ` (4 preceding siblings ...)
  2012-03-27 16:44 ` [PATCH 05/10] xfs: do flush inodes from background inode reclaim Christoph Hellwig
@ 2012-03-27 16:44 ` Christoph Hellwig
  2012-04-13 10:31   ` Dave Chinner
  2012-04-18 13:33   ` Mark Tinguely
  2012-03-27 16:44 ` [PATCH 07/10] xfs: do not write the buffer from xfs_qm_dqflush Christoph Hellwig
                   ` (4 subsequent siblings)
  10 siblings, 2 replies; 42+ messages in thread
From: Christoph Hellwig @ 2012-03-27 16:44 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-iflush-dont-write-buffer --]
[-- Type: text/plain, Size: 8160 bytes --]

Instead of writing the buffer directly from inside xfs_iflush return it to
the caller and let the caller decide what to do with the buffer.  Also
remove the pincount check in xfs_iflush that all non-blocking callers already
implement and the now unused flags parameter.

Signed-off-by: Christoph Hellwig <hch@lst.de>

---
 fs/xfs/xfs_inode.c      |   54 ++++++++++++++----------------------------------
 fs/xfs/xfs_inode.h      |    2 -
 fs/xfs/xfs_inode_item.c |   17 ++++++++++++++-
 fs/xfs/xfs_sync.c       |   29 +++++++++++++------------
 4 files changed, 48 insertions(+), 54 deletions(-)

Index: xfs/fs/xfs/xfs_inode.c
===================================================================
--- xfs.orig/fs/xfs/xfs_inode.c	2012-03-25 16:41:19.481217746 +0200
+++ xfs/fs/xfs/xfs_inode.c	2012-03-25 16:41:21.647884454 +0200
@@ -2384,22 +2384,22 @@ cluster_corrupt_out:
 }
 
 /*
- * xfs_iflush() will write a modified inode's changes out to the
- * inode's on disk home.  The caller must have the inode lock held
- * in at least shared mode and the inode flush completion must be
- * active as well.  The inode lock will still be held upon return from
- * the call and the caller is free to unlock it.
- * The inode flush will be completed when the inode reaches the disk.
- * The flags indicate how the inode's buffer should be written out.
+ * Flush dirty inode metadata into the backing buffer.
+ *
+ * The caller must have the inode lock and the inode flush lock held.  The
+ * inode lock will still be held upon return to the caller, and the inode
+ * flush lock will be released after the inode has reached the disk.
+ *
+ * The caller must write out the buffer returned in *bpp and release it.
  */
 int
 xfs_iflush(
-	xfs_inode_t		*ip,
-	uint			flags)
+	struct xfs_inode	*ip,
+	struct xfs_buf		**bpp)
 {
-	xfs_buf_t		*bp;
-	xfs_dinode_t		*dip;
-	xfs_mount_t		*mp;
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_buf		*bp;
+	struct xfs_dinode	*dip;
 	int			error;
 
 	XFS_STATS_INC(xs_iflush_count);
@@ -2409,24 +2409,8 @@ xfs_iflush(
 	ASSERT(ip->i_d.di_format != XFS_DINODE_FMT_BTREE ||
 	       ip->i_d.di_nextents > XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK));
 
-	mp = ip->i_mount;
+	*bpp = NULL;
 
-	/*
-	 * We can't flush the inode until it is unpinned, so wait for it if we
-	 * are allowed to block.  We know no one new can pin it, because we are
-	 * holding the inode lock shared and you need to hold it exclusively to
-	 * pin the inode.
-	 *
-	 * If we are not allowed to block, force the log out asynchronously so
-	 * that when we come back the inode will be unpinned. If other inodes
-	 * in the same cluster are dirty, they will probably write the inode
-	 * out for us if they occur after the log force completes.
-	 */
-	if (!(flags & SYNC_WAIT) && xfs_ipincount(ip)) {
-		xfs_iunpin(ip);
-		xfs_ifunlock(ip);
-		return EAGAIN;
-	}
 	xfs_iunpin_wait(ip);
 
 	/*
@@ -2458,8 +2442,7 @@ xfs_iflush(
 	/*
 	 * Get the buffer containing the on-disk inode.
 	 */
-	error = xfs_itobp(mp, NULL, ip, &dip, &bp,
-				(flags & SYNC_TRYLOCK) ? XBF_TRYLOCK : XBF_LOCK);
+	error = xfs_itobp(mp, NULL, ip, &dip, &bp, XBF_TRYLOCK);
 	if (error || !bp) {
 		xfs_ifunlock(ip);
 		return error;
@@ -2487,13 +2470,8 @@ xfs_iflush(
 	if (error)
 		goto cluster_corrupt_out;
 
-	if (flags & SYNC_WAIT)
-		error = xfs_bwrite(bp);
-	else
-		xfs_buf_delwri_queue(bp);
-
-	xfs_buf_relse(bp);
-	return error;
+	*bpp = bp;
+	return 0;
 
 corrupt_out:
 	xfs_buf_relse(bp);
Index: xfs/fs/xfs/xfs_inode.h
===================================================================
--- xfs.orig/fs/xfs/xfs_inode.h	2012-03-25 16:41:00.701217399 +0200
+++ xfs/fs/xfs/xfs_inode.h	2012-03-25 16:41:21.647884454 +0200
@@ -528,7 +528,7 @@ int		xfs_iunlink(struct xfs_trans *, xfs
 
 void		xfs_iext_realloc(xfs_inode_t *, int, int);
 void		xfs_iunpin_wait(xfs_inode_t *);
-int		xfs_iflush(xfs_inode_t *, uint);
+int		xfs_iflush(struct xfs_inode *, struct xfs_buf **);
 void		xfs_promote_inode(struct xfs_inode *);
 void		xfs_lock_inodes(xfs_inode_t **, int, uint);
 void		xfs_lock_two_inodes(xfs_inode_t *, xfs_inode_t *, uint);
Index: xfs/fs/xfs/xfs_inode_item.c
===================================================================
--- xfs.orig/fs/xfs/xfs_inode_item.c	2012-03-25 16:41:00.711217397 +0200
+++ xfs/fs/xfs/xfs_inode_item.c	2012-03-25 16:41:21.651217787 +0200
@@ -506,6 +506,15 @@ xfs_inode_item_trylock(
 	if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
 		return XFS_ITEM_LOCKED;
 
+	/*
+	 * Re-check the pincount now that we stabilized the value by
+	 * taking the ilock.
+	 */
+	if (xfs_ipincount(ip) > 0) {
+		xfs_iunlock(ip, XFS_ILOCK_SHARED);
+		return XFS_ITEM_PINNED;
+	}
+
 	if (!xfs_iflock_nowait(ip)) {
 		/*
 		 * inode has already been flushed to the backing buffer,
@@ -666,6 +675,8 @@ xfs_inode_item_push(
 {
 	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
 	struct xfs_inode	*ip = iip->ili_inode;
+	struct xfs_buf		*bp = NULL;
+	int			error;
 
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_SHARED));
 	ASSERT(xfs_isiflocked(ip));
@@ -689,7 +700,11 @@ xfs_inode_item_push(
 	 * will pull the inode from the AIL, mark it clean and unlock the flush
 	 * lock.
 	 */
-	(void) xfs_iflush(ip, SYNC_TRYLOCK);
+	error = xfs_iflush(ip, &bp);
+	if (!error) {
+		xfs_buf_delwri_queue(bp);
+		xfs_buf_relse(bp);
+	}
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 }
 
Index: xfs/fs/xfs/xfs_sync.c
===================================================================
--- xfs.orig/fs/xfs/xfs_sync.c	2012-03-25 16:41:21.304551114 +0200
+++ xfs/fs/xfs/xfs_sync.c	2012-03-25 16:45:50.037889431 +0200
@@ -645,10 +645,6 @@ xfs_reclaim_inode_grab(
  * (*) dgc: I don't think the clean, pinned state is possible but it gets
  * handled anyway given the order of checks implemented.
  *
- * As can be seen from the table, the return value of xfs_iflush() is not
- * sufficient to correctly decide the reclaim action here. The checks in
- * xfs_iflush() might look like duplicates, but they are not.
- *
  * Also, because we get the flush lock first, we know that any inode that has
  * been flushed delwri has had the flush completed by the time we check that
  * the inode is clean.
@@ -676,7 +672,8 @@ xfs_reclaim_inode(
 	struct xfs_perag	*pag,
 	int			sync_mode)
 {
-	int	error;
+	struct xfs_buf		*bp = NULL;
+	int			error;
 
 restart:
 	error = 0;
@@ -725,29 +722,33 @@ restart:
 	/*
 	 * Now we have an inode that needs flushing.
 	 *
-	 * We do a nonblocking flush here even if we are doing a SYNC_WAIT
-	 * reclaim as we can deadlock with inode cluster removal.
+	 * Note that xfs_iflush will never block on the inode buffer lock, as
 	 * xfs_ifree_cluster() can lock the inode buffer before it locks the
-	 * ip->i_lock, and we are doing the exact opposite here. As a result,
-	 * doing a blocking xfs_itobp() to get the cluster buffer will result
+	 * ip->i_lock, and we are doing the exact opposite here.  As a result,
+	 * doing a blocking xfs_itobp() to get the cluster buffer would result
 	 * in an ABBA deadlock with xfs_ifree_cluster().
 	 *
 	 * As xfs_ifree_cluser() must gather all inodes that are active in the
 	 * cache to mark them stale, if we hit this case we don't actually want
 	 * to do IO here - we want the inode marked stale so we can simply
-	 * reclaim it. Hence if we get an EAGAIN error on a SYNC_WAIT flush,
-	 * just unlock the inode, back off and try again. Hopefully the next
-	 * pass through will see the stale flag set on the inode.
+	 * reclaim it.  Hence if we get an EAGAIN error here,  just unlock the
+	 * inode, back off and try again.  Hopefully the next pass through will
+	 * see the stale flag set on the inode.
 	 */
-	error = xfs_iflush(ip, SYNC_TRYLOCK | sync_mode);
+	error = xfs_iflush(ip, &bp);
 	if (error == EAGAIN) {
 		xfs_iunlock(ip, XFS_ILOCK_EXCL);
 		/* backoff longer than in xfs_ifree_cluster */
 		delay(2);
 		goto restart;
 	}
-	xfs_iflock(ip);
 
+	if (!error) {
+		error = xfs_bwrite(bp);
+		xfs_buf_relse(bp);
+	}
+
+	xfs_iflock(ip);
 reclaim:
 	xfs_ifunlock(ip);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 07/10] xfs: do not write the buffer from xfs_qm_dqflush
  2012-03-27 16:44 [PATCH 00/10] remove xfsbufd Christoph Hellwig
                   ` (5 preceding siblings ...)
  2012-03-27 16:44 ` [PATCH 06/10] xfs: do not write the buffer from xfs_iflush Christoph Hellwig
@ 2012-03-27 16:44 ` Christoph Hellwig
  2012-04-13 10:33   ` Dave Chinner
  2012-04-18 21:11   ` Mark Tinguely
  2012-03-27 16:44 ` [PATCH 08/10] xfs: do not add buffers to the delwri queue until pushed Christoph Hellwig
                   ` (3 subsequent siblings)
  10 siblings, 2 replies; 42+ messages in thread
From: Christoph Hellwig @ 2012-03-27 16:44 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-dqflush-dont-write-buffer --]
[-- Type: text/plain, Size: 6612 bytes --]

Instead of writing the buffer directly from inside xfs_qm_dqflush return it
to the caller and let the caller decide what to do with the buffer.  Also
remove the pincount check in xfs_qm_dqflush that all non-blocking callers
already implement and the now unused flags parameter and the XFS_DQ_IS_DIRTY
check that all callers already perform.

Signed-off-by: Christoph Hellwig <hch@lst.de>

---
 fs/xfs/xfs_dquot.c      |   43 +++++++++++++------------------------------
 fs/xfs/xfs_dquot.h      |    2 +-
 fs/xfs/xfs_dquot_item.c |   21 +++++++++++++++++++--
 fs/xfs/xfs_qm.c         |   23 +++++++++++++++++++----
 4 files changed, 52 insertions(+), 37 deletions(-)

Index: xfs/fs/xfs/xfs_dquot.c
===================================================================
--- xfs.orig/fs/xfs/xfs_dquot.c	2012-03-25 13:29:53.121004759 +0200
+++ xfs/fs/xfs/xfs_dquot.c	2012-03-25 13:36:25.947678708 +0200
@@ -878,8 +878,8 @@ xfs_qm_dqflush_done(
  */
 int
 xfs_qm_dqflush(
-	xfs_dquot_t		*dqp,
-	uint			flags)
+	struct xfs_dquot	*dqp,
+	struct xfs_buf		**bpp)
 {
 	struct xfs_mount	*mp = dqp->q_mount;
 	struct xfs_buf		*bp;
@@ -891,14 +891,8 @@ xfs_qm_dqflush(
 
 	trace_xfs_dqflush(dqp);
 
-	/*
-	 * If not dirty, or it's pinned and we are not supposed to block, nada.
-	 */
-	if (!XFS_DQ_IS_DIRTY(dqp) ||
-	    ((flags & SYNC_TRYLOCK) && atomic_read(&dqp->q_pincount) > 0)) {
-		xfs_dqfunlock(dqp);
-		return 0;
-	}
+	*bpp = NULL;
+
 	xfs_qm_dqunpin_wait(dqp);
 
 	/*
@@ -918,9 +912,8 @@ xfs_qm_dqflush(
 			xfs_trans_ail_delete(mp->m_ail, lip);
 		else
 			spin_unlock(&mp->m_ail->xa_lock);
-
-		xfs_dqfunlock(dqp);
-		return XFS_ERROR(EIO);
+		error = XFS_ERROR(EIO);
+		goto out_unlock;
 	}
 
 	/*
@@ -928,11 +921,8 @@ xfs_qm_dqflush(
 	 */
 	error = xfs_trans_read_buf(mp, NULL, mp->m_ddev_targp, dqp->q_blkno,
 				   mp->m_quotainfo->qi_dqchunklen, 0, &bp);
-	if (error) {
-		ASSERT(error != ENOENT);
-		xfs_dqfunlock(dqp);
-		return error;
-	}
+	if (error)
+		goto out_unlock;
 
 	/*
 	 * Calculate the location of the dquot inside the buffer.
@@ -978,20 +968,13 @@ xfs_qm_dqflush(
 		xfs_log_force(mp, 0);
 	}
 
-	if (flags & SYNC_WAIT)
-		error = xfs_bwrite(bp);
-	else
-		xfs_buf_delwri_queue(bp);
-
-	xfs_buf_relse(bp);
-
 	trace_xfs_dqflush_done(dqp);
+	*bpp = bp;
+	return 0;
 
-	/*
-	 * dqp is still locked, but caller is free to unlock it now.
-	 */
-	return error;
-
+out_unlock:
+	xfs_dqfunlock(dqp);
+	return XFS_ERROR(EIO);
 }
 
 /*
Index: xfs/fs/xfs/xfs_dquot.h
===================================================================
--- xfs.orig/fs/xfs/xfs_dquot.h	2012-03-25 13:29:53.134338092 +0200
+++ xfs/fs/xfs/xfs_dquot.h	2012-03-25 13:31:55.844340367 +0200
@@ -141,7 +141,7 @@ static inline xfs_dquot_t *xfs_inode_dqu
 extern int		xfs_qm_dqread(struct xfs_mount *, xfs_dqid_t, uint,
 					uint, struct xfs_dquot	**);
 extern void		xfs_qm_dqdestroy(xfs_dquot_t *);
-extern int		xfs_qm_dqflush(xfs_dquot_t *, uint);
+extern int		xfs_qm_dqflush(struct xfs_dquot *, struct xfs_buf **);
 extern void		xfs_qm_dqunpin_wait(xfs_dquot_t *);
 extern void		xfs_qm_adjust_dqtimers(xfs_mount_t *,
 					xfs_disk_dquot_t *);
Index: xfs/fs/xfs/xfs_dquot_item.c
===================================================================
--- xfs.orig/fs/xfs/xfs_dquot_item.c	2012-03-25 13:29:53.144338092 +0200
+++ xfs/fs/xfs/xfs_dquot_item.c	2012-03-25 13:36:38.981012285 +0200
@@ -119,10 +119,12 @@ xfs_qm_dquot_logitem_push(
 	struct xfs_log_item	*lip)
 {
 	struct xfs_dquot	*dqp = DQUOT_ITEM(lip)->qli_dquot;
+	struct xfs_buf		*bp = NULL;
 	int			error;
 
 	ASSERT(XFS_DQ_IS_LOCKED(dqp));
 	ASSERT(!completion_done(&dqp->q_flush));
+	ASSERT(atomic_read(&dqp->q_pincount) == 0);
 
 	/*
 	 * Since we were able to lock the dquot's flush lock and
@@ -133,10 +135,16 @@ xfs_qm_dquot_logitem_push(
 	 * lock without sleeping, then there must not have been
 	 * anyone in the process of flushing the dquot.
 	 */
-	error = xfs_qm_dqflush(dqp, SYNC_TRYLOCK);
-	if (error)
+	error = xfs_qm_dqflush(dqp, &bp);
+	if (error) {
 		xfs_warn(dqp->q_mount, "%s: push error %d on dqp %p",
 			__func__, error, dqp);
+		goto out_unlock;
+	}
+
+	xfs_buf_delwri_queue(bp);
+	xfs_buf_relse(bp);
+out_unlock:
 	xfs_dqunlock(dqp);
 }
 
@@ -239,6 +247,15 @@ xfs_qm_dquot_logitem_trylock(
 	if (!xfs_dqlock_nowait(dqp))
 		return XFS_ITEM_LOCKED;
 
+	/*
+	 * Re-check the pincount now that we stabilized the value by
+	 * taking the quota lock.
+	 */
+	if (atomic_read(&dqp->q_pincount) > 0) {
+		xfs_dqunlock(dqp);
+		return XFS_ITEM_PINNED;
+	}
+
 	if (!xfs_dqflock_nowait(dqp)) {
 		/*
 		 * dquot has already been flushed to the backing buffer,
Index: xfs/fs/xfs/xfs_qm.c
===================================================================
--- xfs.orig/fs/xfs/xfs_qm.c	2012-03-25 13:29:53.161004759 +0200
+++ xfs/fs/xfs/xfs_qm.c	2012-03-25 13:36:22.031011971 +0200
@@ -175,16 +175,21 @@ xfs_qm_dqpurge(
 	 * we're unmounting, we do care, so we flush it and wait.
 	 */
 	if (XFS_DQ_IS_DIRTY(dqp)) {
-		int	error;
+		struct xfs_buf	*bp = NULL;
+		int		error;
 
 		/*
 		 * We don't care about getting disk errors here. We need
 		 * to purge this dquot anyway, so we go ahead regardless.
 		 */
-		error = xfs_qm_dqflush(dqp, SYNC_WAIT);
+		error = xfs_qm_dqflush(dqp, &bp);
 		if (error)
 			xfs_warn(mp, "%s: dquot %p flush failed",
 				__func__, dqp);
+		} else {
+			error = xfs_bwrite(bp);
+			xfs_buf_relse(bp);
+		}
 		xfs_dqflock(dqp);
 	}
 
@@ -1184,6 +1189,7 @@ STATIC int
 xfs_qm_flush_one(
 	struct xfs_dquot	*dqp)
 {
+	struct xfs_buf		*bp = NULL;
 	int			error = 0;
 
 	xfs_dqlock(dqp);
@@ -1195,8 +1201,12 @@ xfs_qm_flush_one(
 	if (!xfs_dqflock_nowait(dqp))
 		xfs_dqflock_pushbuf_wait(dqp);
 
-	error = xfs_qm_dqflush(dqp, 0);
+	error = xfs_qm_dqflush(dqp, &bp);
+	if (error)
+		goto out_unlock;
 
+	xfs_buf_delwri_queue(bp);
+	xfs_buf_relse(bp);
 out_unlock:
 	xfs_dqunlock(dqp);
 	return error;
@@ -1463,18 +1473,23 @@ xfs_qm_dqreclaim_one(
 	 * dirty dquots.
 	 */
 	if (XFS_DQ_IS_DIRTY(dqp)) {
+		struct xfs_buf	*bp = NULL;
+
 		trace_xfs_dqreclaim_dirty(dqp);
 
 		/*
 		 * We flush it delayed write, so don't bother releasing the
 		 * freelist lock.
 		 */
-		error = xfs_qm_dqflush(dqp, 0);
+		error = xfs_qm_dqflush(dqp, &bp);
 		if (error) {
 			xfs_warn(mp, "%s: dquot %p flush failed",
 				 __func__, dqp);
+			goto out_busy;
 		}
 
+		xfs_buf_delwri_queue(bp);
+		xfs_buf_relse(bp);
 		/*
 		 * Give the dquot another try on the freelist, as the
 		 * flushing will take some time.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 08/10] xfs: do not add buffers to the delwri queue until pushed
  2012-03-27 16:44 [PATCH 00/10] remove xfsbufd Christoph Hellwig
                   ` (6 preceding siblings ...)
  2012-03-27 16:44 ` [PATCH 07/10] xfs: do not write the buffer from xfs_qm_dqflush Christoph Hellwig
@ 2012-03-27 16:44 ` Christoph Hellwig
  2012-04-13 10:35   ` Dave Chinner
  2012-04-18 21:11   ` Mark Tinguely
  2012-03-27 16:44 ` [PATCH 09/10] xfs: on-stack delayed write buffer lists Christoph Hellwig
                   ` (2 subsequent siblings)
  10 siblings, 2 replies; 42+ messages in thread
From: Christoph Hellwig @ 2012-03-27 16:44 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-buf-item-write-later --]
[-- Type: text/plain, Size: 2183 bytes --]

Instead of adding buffers to the delwri list as soon as they are logged,
even if they can't be written until commited because they are pinned
defer adding them to the delwri list until xfsaild pushes them.  This
makes the code more similar to other log items and prepares for writing
buffers directly from xfsaild.

The complication here is that we need to fail buffers that were added
but not logged yet in xfs_buf_item_unpin, borrowing code from
xfs_bioerror.

Signed-off-by: Christoph Hellwig <hch@lst.de>

---
 fs/xfs/xfs_buf_item.c  |   11 ++++++++---
 fs/xfs/xfs_trans_buf.c |    2 --
 2 files changed, 8 insertions(+), 5 deletions(-)

Index: xfs/fs/xfs/xfs_buf_item.c
===================================================================
--- xfs.orig/fs/xfs/xfs_buf_item.c	2012-03-16 09:51:16.830170721 +0100
+++ xfs/fs/xfs/xfs_buf_item.c	2012-03-16 09:52:14.600171793 +0100
@@ -460,6 +460,12 @@ xfs_buf_item_unpin(
 			ASSERT(bp->b_fspriv == NULL);
 		}
 		xfs_buf_relse(bp);
+	} else if (freed && remove) {
+		xfs_buf_lock(bp);
+		xfs_buf_ioerror(bp, EIO);
+		XFS_BUF_UNDONE(bp);
+		xfs_buf_stale(bp);
+		xfs_buf_ioend(bp, 0);
 	}
 }
 
@@ -604,9 +610,7 @@ xfs_buf_item_committed(
 }
 
 /*
- * The buffer is locked, but is not a delayed write buffer. This happens
- * if we race with IO completion and hence we don't want to try to write it
- * again. Just release the buffer.
+ * The buffer is locked, but is not a delayed write buffer.
  */
 STATIC void
 xfs_buf_item_push(
@@ -620,6 +624,7 @@ xfs_buf_item_push(
 
 	trace_xfs_buf_item_push(bip);
 
+	xfs_buf_delwri_queue(bp);
 	xfs_buf_relse(bp);
 }
 
Index: xfs/fs/xfs/xfs_trans_buf.c
===================================================================
--- xfs.orig/fs/xfs/xfs_trans_buf.c	2012-03-16 09:51:16.846837387 +0100
+++ xfs/fs/xfs/xfs_trans_buf.c	2012-03-16 09:52:14.600171793 +0100
@@ -626,8 +626,6 @@ xfs_trans_log_buf(xfs_trans_t	*tp,
 	bp->b_iodone = xfs_buf_iodone_callbacks;
 	bip->bli_item.li_cb = xfs_buf_iodone;
 
-	xfs_buf_delwri_queue(bp);
-
 	trace_xfs_trans_log_buf(bip);
 
 	/*

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 09/10] xfs: on-stack delayed write buffer lists
  2012-03-27 16:44 [PATCH 00/10] remove xfsbufd Christoph Hellwig
                   ` (7 preceding siblings ...)
  2012-03-27 16:44 ` [PATCH 08/10] xfs: do not add buffers to the delwri queue until pushed Christoph Hellwig
@ 2012-03-27 16:44 ` Christoph Hellwig
  2012-04-13 11:37   ` Dave Chinner
  2012-04-20 18:19   ` Mark Tinguely
  2012-03-27 16:44 ` [PATCH 10/10] xfs: remove some obsolete comments in xfs_trans_ail.c Christoph Hellwig
  2012-03-28  0:53 ` [PATCH 00/10] remove xfsbufd Dave Chinner
  10 siblings, 2 replies; 42+ messages in thread
From: Christoph Hellwig @ 2012-03-27 16:44 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-kill-xfsbufd --]
[-- Type: text/plain, Size: 77642 bytes --]

Queue delwri buffers on a local on-stack list instead of a per-buftarg one,
and write back the buffers per-process instead of by waking up xfsbufd.

This is now easily doable given that we have very few places left that write
delwri buffers:

 - log recovery:
	Only done at mount time, and already forcing out the buffers
	synchronously using xfs_flush_buftarg

 - quotacheck:
	Same story.

 - dquot reclaim:
	Writes out dirty dquots on the LRU under memory pressure.  We might
	want to look into doing more of this via xfsaild, but it's already
	more optimal than the synchronous inode reclaim that writes each
	buffer synchronously.

 - xfsaild:
	This is the main beneficiary of the change.  By keeping a local list
	of buffers to write we reduce latency of writing out buffers, and
	more importably we can remove all the delwri list promotions which
	were hitting the buffer cache hard under sustained metadata loads.

The implementation is very straight forward - xfs_buf_delwri_queue now gets
a new list_head pointer that it adds the delwri buffers to, and all callers
need to eventually submit the list using xfs_buf_delwi_submit or
xfs_buf_delwi_submit_nowait.  Buffers that already are on a delwri list are
skipped in xfs_buf_delwri_queue, assuming they already are on another delwri
list.  The biggest change to pass down the buffer list was done to the AIL
pushing. Now that we operate on buffers the trylock, push and pushbuf log
item methods are merged into a single push routine, which tries to lock the
item, and if possible add the buffer that needs writeback to the buffer list.
This leads to much simpler code than the previous split but requires the
individual IOP_PUSH instances to unlock and reacquire the AIL around calls
to blocking routines.

Given that xfsailds now also handles writing out buffers the conditions for
log forcing and the sleep times needed some small changes.  The most
important one is that we consider an AIL busy as long we still have buffers
to push, and the other one is that we do increment the pushed LSN for
buffers that are under flushing at this moment, but still count them towards
the stuck items for restart purposes.  Without this we could hammer on stuck
items without ever forcing the log and not make progress under heavy random
delete workloads on fast flash storage devices.

Signed-off-by: Christoph Hellwig <hch@lst.de>

---
 fs/xfs/xfs_buf.c          |  336 ++++++++++++++++------------------------------
 fs/xfs/xfs_buf.h          |   28 ---
 fs/xfs/xfs_buf_item.c     |   96 +++----------
 fs/xfs/xfs_dquot.c        |   33 ----
 fs/xfs/xfs_dquot.h        |    1 
 fs/xfs/xfs_dquot_item.c   |  161 ++++------------------
 fs/xfs/xfs_extfree_item.c |   53 +------
 fs/xfs/xfs_inode.c        |   25 ---
 fs/xfs/xfs_inode.h        |    1 
 fs/xfs/xfs_inode_item.c   |  152 ++++----------------
 fs/xfs/xfs_log_recover.c  |   46 +++---
 fs/xfs/xfs_qm.c           |  150 +++++++++-----------
 fs/xfs/xfs_super.c        |   16 --
 fs/xfs/xfs_sync.c         |   18 --
 fs/xfs/xfs_trace.h        |    7 
 fs/xfs/xfs_trans.h        |   18 --
 fs/xfs/xfs_trans_ail.c    |  133 +++++++-----------
 fs/xfs/xfs_trans_buf.c    |   86 +++--------
 fs/xfs/xfs_trans_priv.h   |    1 
 19 files changed, 437 insertions(+), 924 deletions(-)

Index: xfs/fs/xfs/xfs_log_recover.c
===================================================================
--- xfs.orig/fs/xfs/xfs_log_recover.c	2012-03-25 16:41:00.134550721 +0200
+++ xfs/fs/xfs/xfs_log_recover.c	2012-03-25 16:46:11.977889836 +0200
@@ -2103,6 +2103,7 @@ xlog_recover_do_dquot_buffer(
 STATIC int
 xlog_recover_buffer_pass2(
 	xlog_t			*log,
+	struct list_head	*buffer_list,
 	xlog_recover_item_t	*item)
 {
 	xfs_buf_log_format_t	*buf_f = item->ri_buf[0].i_addr;
@@ -2173,7 +2174,7 @@ xlog_recover_buffer_pass2(
 	} else {
 		ASSERT(bp->b_target->bt_mount == mp);
 		bp->b_iodone = xlog_recover_iodone;
-		xfs_buf_delwri_queue(bp);
+		xfs_buf_delwri_queue(bp, buffer_list);
 	}
 
 	xfs_buf_relse(bp);
@@ -2183,6 +2184,7 @@ xlog_recover_buffer_pass2(
 STATIC int
 xlog_recover_inode_pass2(
 	xlog_t			*log,
+	struct list_head	*buffer_list,
 	xlog_recover_item_t	*item)
 {
 	xfs_inode_log_format_t	*in_f;
@@ -2436,7 +2438,7 @@ xlog_recover_inode_pass2(
 write_inode_buffer:
 	ASSERT(bp->b_target->bt_mount == mp);
 	bp->b_iodone = xlog_recover_iodone;
-	xfs_buf_delwri_queue(bp);
+	xfs_buf_delwri_queue(bp, buffer_list);
 	xfs_buf_relse(bp);
 error:
 	if (need_free)
@@ -2477,6 +2479,7 @@ xlog_recover_quotaoff_pass1(
 STATIC int
 xlog_recover_dquot_pass2(
 	xlog_t			*log,
+	struct list_head	*buffer_list,
 	xlog_recover_item_t	*item)
 {
 	xfs_mount_t		*mp = log->l_mp;
@@ -2558,7 +2561,7 @@ xlog_recover_dquot_pass2(
 	ASSERT(dq_f->qlf_size == 2);
 	ASSERT(bp->b_target->bt_mount == mp);
 	bp->b_iodone = xlog_recover_iodone;
-	xfs_buf_delwri_queue(bp);
+	xfs_buf_delwri_queue(bp, buffer_list);
 	xfs_buf_relse(bp);
 
 	return (0);
@@ -2712,21 +2715,22 @@ STATIC int
 xlog_recover_commit_pass2(
 	struct log		*log,
 	struct xlog_recover	*trans,
+	struct list_head	*buffer_list,
 	xlog_recover_item_t	*item)
 {
 	trace_xfs_log_recover_item_recover(log, trans, item, XLOG_RECOVER_PASS2);
 
 	switch (ITEM_TYPE(item)) {
 	case XFS_LI_BUF:
-		return xlog_recover_buffer_pass2(log, item);
+		return xlog_recover_buffer_pass2(log, buffer_list, item);
 	case XFS_LI_INODE:
-		return xlog_recover_inode_pass2(log, item);
+		return xlog_recover_inode_pass2(log, buffer_list, item);
 	case XFS_LI_EFI:
 		return xlog_recover_efi_pass2(log, item, trans->r_lsn);
 	case XFS_LI_EFD:
 		return xlog_recover_efd_pass2(log, item);
 	case XFS_LI_DQUOT:
-		return xlog_recover_dquot_pass2(log, item);
+		return xlog_recover_dquot_pass2(log, buffer_list, item);
 	case XFS_LI_QUOTAOFF:
 		/* nothing to do in pass2 */
 		return 0;
@@ -2750,8 +2754,9 @@ xlog_recover_commit_trans(
 	struct xlog_recover	*trans,
 	int			pass)
 {
-	int			error = 0;
+	int			error = 0, error2;
 	xlog_recover_item_t	*item;
+	LIST_HEAD		(buffer_list);
 
 	hlist_del(&trans->r_list);
 
@@ -2760,16 +2765,27 @@ xlog_recover_commit_trans(
 		return error;
 
 	list_for_each_entry(item, &trans->r_itemq, ri_list) {
-		if (pass == XLOG_RECOVER_PASS1)
+		switch (pass) {
+		case XLOG_RECOVER_PASS1:
 			error = xlog_recover_commit_pass1(log, trans, item);
-		else
-			error = xlog_recover_commit_pass2(log, trans, item);
+			break;
+		case XLOG_RECOVER_PASS2:
+			error = xlog_recover_commit_pass2(log, trans,
+							  &buffer_list, item);
+			break;
+		default:
+			ASSERT(0);
+		}
+
 		if (error)
-			return error;
+			goto out;
 	}
 
 	xlog_recover_free_trans(trans);
-	return 0;
+
+out:
+	error2 = xfs_buf_delwri_submit(&buffer_list);
+	return error ? error : error2;
 }
 
 STATIC int
@@ -3650,11 +3666,8 @@ xlog_do_recover(
 	 * First replay the images in the log.
 	 */
 	error = xlog_do_log_recovery(log, head_blk, tail_blk);
-	if (error) {
+	if (error)
 		return error;
-	}
-
-	xfs_flush_buftarg(log->l_mp->m_ddev_targp, 1);
 
 	/*
 	 * If IO errors happened during recovery, bail out.
@@ -3681,7 +3694,6 @@ xlog_do_recover(
 	bp = xfs_getsb(log->l_mp, 0);
 	XFS_BUF_UNDONE(bp);
 	ASSERT(!(XFS_BUF_ISWRITE(bp)));
-	ASSERT(!(XFS_BUF_ISDELAYWRITE(bp)));
 	XFS_BUF_READ(bp);
 	XFS_BUF_UNASYNC(bp);
 	xfsbdstrat(log->l_mp, bp);
Index: xfs/fs/xfs/xfs_buf.c
===================================================================
--- xfs.orig/fs/xfs/xfs_buf.c	2012-03-25 16:41:00.144550722 +0200
+++ xfs/fs/xfs/xfs_buf.c	2012-03-25 17:11:17.154584413 +0200
@@ -42,7 +42,6 @@
 #include "xfs_trace.h"
 
 static kmem_zone_t *xfs_buf_zone;
-STATIC int xfsbufd(void *);
 
 static struct workqueue_struct *xfslogd_workqueue;
 
@@ -144,8 +143,11 @@ void
 xfs_buf_stale(
 	struct xfs_buf	*bp)
 {
+	ASSERT(xfs_buf_islocked(bp));
+
 	bp->b_flags |= XBF_STALE;
-	xfs_buf_delwri_dequeue(bp);
+	bp->b_flags &= ~_XBF_DELWRI_Q;
+
 	atomic_set(&(bp)->b_lru_ref, 0);
 	if (!list_empty(&bp->b_lru)) {
 		struct xfs_buftarg *btp = bp->b_target;
@@ -592,10 +594,10 @@ _xfs_buf_read(
 {
 	int			status;
 
-	ASSERT(!(flags & (XBF_DELWRI|XBF_WRITE)));
+	ASSERT(!(flags & XBF_WRITE));
 	ASSERT(bp->b_bn != XFS_BUF_DADDR_NULL);
 
-	bp->b_flags &= ~(XBF_WRITE | XBF_ASYNC | XBF_DELWRI | XBF_READ_AHEAD);
+	bp->b_flags &= ~(XBF_WRITE | XBF_ASYNC | XBF_READ_AHEAD);
 	bp->b_flags |= flags & (XBF_READ | XBF_ASYNC | XBF_READ_AHEAD);
 
 	status = xfs_buf_iorequest(bp);
@@ -855,7 +857,7 @@ xfs_buf_rele(
 			spin_unlock(&pag->pag_buf_lock);
 		} else {
 			xfs_buf_lru_del(bp);
-			ASSERT(!(bp->b_flags & (XBF_DELWRI|_XBF_DELWRI_Q)));
+			ASSERT(!(bp->b_flags & _XBF_DELWRI_Q));
 			rb_erase(&bp->b_rbnode, &pag->pag_buf_tree);
 			spin_unlock(&pag->pag_buf_lock);
 			xfs_perag_put(pag);
@@ -915,13 +917,6 @@ xfs_buf_lock(
 	trace_xfs_buf_lock_done(bp, _RET_IP_);
 }
 
-/*
- *	Releases the lock on the buffer object.
- *	If the buffer is marked delwri but is not queued, do so before we
- *	unlock the buffer as we need to set flags correctly.  We also need to
- *	take a reference for the delwri queue because the unlocker is going to
- *	drop their's and they don't know we just queued it.
- */
 void
 xfs_buf_unlock(
 	struct xfs_buf		*bp)
@@ -1019,10 +1014,11 @@ xfs_bwrite(
 {
 	int			error;
 
+	ASSERT(xfs_buf_islocked(bp));
+
 	bp->b_flags |= XBF_WRITE;
-	bp->b_flags &= ~(XBF_ASYNC | XBF_READ);
+	bp->b_flags &= ~(XBF_ASYNC | XBF_READ | _XBF_DELWRI_Q);
 
-	xfs_buf_delwri_dequeue(bp);
 	xfs_bdstrat_cb(bp);
 
 	error = xfs_buf_iowait(bp);
@@ -1254,7 +1250,7 @@ xfs_buf_iorequest(
 {
 	trace_xfs_buf_iorequest(bp, _RET_IP_);
 
-	ASSERT(!(bp->b_flags & XBF_DELWRI));
+	ASSERT(!(bp->b_flags & _XBF_DELWRI_Q));
 
 	if (bp->b_flags & XBF_WRITE)
 		xfs_buf_wait_unpin(bp);
@@ -1435,11 +1431,9 @@ xfs_free_buftarg(
 {
 	unregister_shrinker(&btp->bt_shrinker);
 
-	xfs_flush_buftarg(btp, 1);
 	if (mp->m_flags & XFS_MOUNT_BARRIER)
 		xfs_blkdev_issue_flush(btp);
 
-	kthread_stop(btp->bt_task);
 	kmem_free(btp);
 }
 
@@ -1491,20 +1485,6 @@ xfs_setsize_buftarg(
 	return xfs_setsize_buftarg_flags(btp, blocksize, sectorsize, 1);
 }
 
-STATIC int
-xfs_alloc_delwri_queue(
-	xfs_buftarg_t		*btp,
-	const char		*fsname)
-{
-	INIT_LIST_HEAD(&btp->bt_delwri_queue);
-	spin_lock_init(&btp->bt_delwri_lock);
-	btp->bt_flags = 0;
-	btp->bt_task = kthread_run(xfsbufd, btp, "xfsbufd/%s", fsname);
-	if (IS_ERR(btp->bt_task))
-		return PTR_ERR(btp->bt_task);
-	return 0;
-}
-
 xfs_buftarg_t *
 xfs_alloc_buftarg(
 	struct xfs_mount	*mp,
@@ -1527,8 +1507,6 @@ xfs_alloc_buftarg(
 	spin_lock_init(&btp->bt_lru_lock);
 	if (xfs_setsize_buftarg_early(btp, bdev))
 		goto error;
-	if (xfs_alloc_delwri_queue(btp, fsname))
-		goto error;
 	btp->bt_shrinker.shrink = xfs_buftarg_shrink;
 	btp->bt_shrinker.seeks = DEFAULT_SEEKS;
 	register_shrinker(&btp->bt_shrinker);
@@ -1539,125 +1517,52 @@ error:
 	return NULL;
 }
 
-
 /*
- *	Delayed write buffer handling
+ * Add a buffer to the delayed write list.
+ *
+ * This queues a buffer for writeout if it hasn't already been.  Note that
+ * neither this routine nor the buffer list submission functions perform
+ * any internal synchronization.  It is expected that the lists are thread-local
+ * to the callers.
+ *
+ * Returns true if we queued up the buffer, or false if it already had
+ * been on the buffer list.
  */
-void
+bool
 xfs_buf_delwri_queue(
-	xfs_buf_t		*bp)
+	struct xfs_buf		*bp,
+	struct list_head	*list)
 {
-	struct xfs_buftarg	*btp = bp->b_target;
-
-	trace_xfs_buf_delwri_queue(bp, _RET_IP_);
-
+	ASSERT(xfs_buf_islocked(bp));
 	ASSERT(!(bp->b_flags & XBF_READ));
 
-	spin_lock(&btp->bt_delwri_lock);
-	if (!list_empty(&bp->b_list)) {
-		/* if already in the queue, move it to the tail */
-		ASSERT(bp->b_flags & _XBF_DELWRI_Q);
-		list_move_tail(&bp->b_list, &btp->bt_delwri_queue);
-	} else {
-		/* start xfsbufd as it is about to have something to do */
-		if (list_empty(&btp->bt_delwri_queue))
-			wake_up_process(bp->b_target->bt_task);
-
-		atomic_inc(&bp->b_hold);
-		bp->b_flags |= XBF_DELWRI | _XBF_DELWRI_Q | XBF_ASYNC;
-		list_add_tail(&bp->b_list, &btp->bt_delwri_queue);
-	}
-	bp->b_queuetime = jiffies;
-	spin_unlock(&btp->bt_delwri_lock);
-}
-
-void
-xfs_buf_delwri_dequeue(
-	xfs_buf_t		*bp)
-{
-	int			dequeued = 0;
-
-	spin_lock(&bp->b_target->bt_delwri_lock);
-	if ((bp->b_flags & XBF_DELWRI) && !list_empty(&bp->b_list)) {
-		ASSERT(bp->b_flags & _XBF_DELWRI_Q);
-		list_del_init(&bp->b_list);
-		dequeued = 1;
+	/*
+	 * If the buffer is already marked delwri it already is queued up
+	 * by someone else for imediate writeout.  Just ignore it in that
+	 * case.
+	 */
+	if (bp->b_flags & _XBF_DELWRI_Q) {
+		trace_xfs_buf_delwri_queued(bp, _RET_IP_);
+		return false;
 	}
-	bp->b_flags &= ~(XBF_DELWRI|_XBF_DELWRI_Q);
-	spin_unlock(&bp->b_target->bt_delwri_lock);
-
-	if (dequeued)
-		xfs_buf_rele(bp);
-
-	trace_xfs_buf_delwri_dequeue(bp, _RET_IP_);
-}
 
-/*
- * If a delwri buffer needs to be pushed before it has aged out, then promote
- * it to the head of the delwri queue so that it will be flushed on the next
- * xfsbufd run. We do this by resetting the queuetime of the buffer to be older
- * than the age currently needed to flush the buffer. Hence the next time the
- * xfsbufd sees it is guaranteed to be considered old enough to flush.
- */
-void
-xfs_buf_delwri_promote(
-	struct xfs_buf	*bp)
-{
-	struct xfs_buftarg *btp = bp->b_target;
-	long		age = xfs_buf_age_centisecs * msecs_to_jiffies(10) + 1;
-
-	ASSERT(bp->b_flags & XBF_DELWRI);
-	ASSERT(bp->b_flags & _XBF_DELWRI_Q);
+	trace_xfs_buf_delwri_queue(bp, _RET_IP_);
 
 	/*
-	 * Check the buffer age before locking the delayed write queue as we
-	 * don't need to promote buffers that are already past the flush age.
+	 * If a buffer gets written out synchronously while it is on a delwri
+	 * list we lazily remove it, aka only the _XBF_DELWRI_Q flag gets
+	 * cleared, but it remains referenced and on the list.  In a rare
+	 * corner case it might get readded to a delwri list after the
+	 * synchronous writeout, in which case we need just need to re-add
+	 * the flag here.
 	 */
-	if (bp->b_queuetime < jiffies - age)
-		return;
-	bp->b_queuetime = jiffies - age;
-	spin_lock(&btp->bt_delwri_lock);
-	list_move(&bp->b_list, &btp->bt_delwri_queue);
-	spin_unlock(&btp->bt_delwri_lock);
-}
-
-/*
- * Move as many buffers as specified to the supplied list
- * idicating if we skipped any buffers to prevent deadlocks.
- */
-STATIC int
-xfs_buf_delwri_split(
-	xfs_buftarg_t	*target,
-	struct list_head *list,
-	unsigned long	age)
-{
-	xfs_buf_t	*bp, *n;
-	int		skipped = 0;
-	int		force;
-
-	force = test_and_clear_bit(XBT_FORCE_FLUSH, &target->bt_flags);
-	INIT_LIST_HEAD(list);
-	spin_lock(&target->bt_delwri_lock);
-	list_for_each_entry_safe(bp, n, &target->bt_delwri_queue, b_list) {
-		ASSERT(bp->b_flags & XBF_DELWRI);
-
-		if (!xfs_buf_ispinned(bp) && xfs_buf_trylock(bp)) {
-			if (!force &&
-			    time_before(jiffies, bp->b_queuetime + age)) {
-				xfs_buf_unlock(bp);
-				break;
-			}
-
-			bp->b_flags &= ~(XBF_DELWRI | _XBF_DELWRI_Q);
-			bp->b_flags |= XBF_WRITE;
-			list_move_tail(&bp->b_list, list);
-			trace_xfs_buf_delwri_split(bp, _RET_IP_);
-		} else
-			skipped++;
+	bp->b_flags |= _XBF_DELWRI_Q;
+	if (list_empty(&bp->b_list)) {
+		atomic_inc(&bp->b_hold);
+		list_add_tail(&bp->b_list, list);
 	}
 
-	spin_unlock(&target->bt_delwri_lock);
-	return skipped;
+	return true;
 }
 
 /*
@@ -1683,99 +1588,106 @@ xfs_buf_cmp(
 	return 0;
 }
 
-STATIC int
-xfsbufd(
-	void		*data)
-{
-	xfs_buftarg_t   *target = (xfs_buftarg_t *)data;
-
-	current->flags |= PF_MEMALLOC;
-
-	set_freezable();
+static int
+__xfs_buf_delwri_submit(
+	struct list_head	*submit_list,
+	struct list_head	*list,
+	bool			wait)
+{
+	struct blk_plug		plug;
+	struct xfs_buf		*bp, *n;
+	int			pinned = 0;
+
+	list_for_each_entry_safe(bp, n, list, b_list) {
+		if (!wait) {
+			if (xfs_buf_ispinned(bp)) {
+				pinned++;
+				continue;
+			}
+			if (!xfs_buf_trylock(bp))
+				continue;
+		} else {
+			xfs_buf_lock(bp);
+		}
 
-	do {
-		long	age = xfs_buf_age_centisecs * msecs_to_jiffies(10);
-		long	tout = xfs_buf_timer_centisecs * msecs_to_jiffies(10);
-		struct list_head tmp;
-		struct blk_plug plug;
+		/*
+		 * Someone else might have written the buffer synchronously
+		 * in the meantime.  In that case only the _XBF_DELWRI_Q flag
+		 * got cleared, and we have to drop the reference and remove
+		 * it from the list here.
+		 */
+		if (!(bp->b_flags & _XBF_DELWRI_Q)) {
+			list_del_init(&bp->b_list);
+			xfs_buf_relse(bp);
+			continue;
+		}
 
-		if (unlikely(freezing(current)))
-			try_to_freeze();
+		list_move_tail(&bp->b_list, submit_list);
+		trace_xfs_buf_delwri_split(bp, _RET_IP_);
+	}
 
-		/* sleep for a long time if there is nothing to do. */
-		if (list_empty(&target->bt_delwri_queue))
-			tout = MAX_SCHEDULE_TIMEOUT;
-		schedule_timeout_interruptible(tout);
+	list_sort(NULL, submit_list, xfs_buf_cmp);
 
-		xfs_buf_delwri_split(target, &tmp, age);
-		list_sort(NULL, &tmp, xfs_buf_cmp);
+	blk_start_plug(&plug);
+	list_for_each_entry_safe(bp, n, submit_list, b_list) {
+		bp->b_flags &= ~_XBF_DELWRI_Q;
+		bp->b_flags |= XBF_WRITE;
 
-		blk_start_plug(&plug);
-		while (!list_empty(&tmp)) {
-			struct xfs_buf *bp;
-			bp = list_first_entry(&tmp, struct xfs_buf, b_list);
+		if (!wait) {
+			bp->b_flags |= XBF_ASYNC;
 			list_del_init(&bp->b_list);
-			xfs_bdstrat_cb(bp);
 		}
-		blk_finish_plug(&plug);
-	} while (!kthread_should_stop());
+		xfs_bdstrat_cb(bp);
+	}
+	blk_finish_plug(&plug);
 
-	return 0;
+	return pinned;
 }
 
 /*
- *	Go through all incore buffers, and release buffers if they belong to
- *	the given device. This is used in filesystem error handling to
- *	preserve the consistency of its metadata.
+ * Write out a buffer list asynchronously.
+ *
+ * This will take the buffer list, write all non-locked and non-pinned buffers
+ * out and not wait for I/O completion on any of the buffers.  This interface
+ * is only safely useable for callers that can track I/O completion by higher
+ * level means, e.g. AIL pushing.
  */
 int
-xfs_flush_buftarg(
-	xfs_buftarg_t	*target,
-	int		wait)
-{
-	xfs_buf_t	*bp;
-	int		pincount = 0;
-	LIST_HEAD(tmp_list);
-	LIST_HEAD(wait_list);
-	struct blk_plug plug;
+xfs_buf_delwri_submit_nowait(
+	struct list_head	*list)
+{
+	LIST_HEAD		(submit_list);
+	return __xfs_buf_delwri_submit(&submit_list, list, false);
+}
 
-	flush_workqueue(xfslogd_workqueue);
+/*
+ * Write out a buffer list synchronously.
+ *
+ * This will take the buffer list, write all buffers out and wait for I/O
+ * completion on all of the buffers.
+ */
+int
+xfs_buf_delwri_submit(
+	struct list_head	*list)
+{
+	LIST_HEAD		(submit_list);
+	int			error = 0, error2;
+	struct xfs_buf		*bp;
 
-	set_bit(XBT_FORCE_FLUSH, &target->bt_flags);
-	pincount = xfs_buf_delwri_split(target, &tmp_list, 0);
+	__xfs_buf_delwri_submit(&submit_list, list, true);
 
-	/*
-	 * Dropped the delayed write list lock, now walk the temporary list.
-	 * All I/O is issued async and then if we need to wait for completion
-	 * we do that after issuing all the IO.
-	 */
-	list_sort(NULL, &tmp_list, xfs_buf_cmp);
+	/* Wait for IO to complete. */
+	while (!list_empty(&submit_list)) {
+		bp = list_first_entry(&submit_list, struct xfs_buf, b_list);
 
-	blk_start_plug(&plug);
-	while (!list_empty(&tmp_list)) {
-		bp = list_first_entry(&tmp_list, struct xfs_buf, b_list);
-		ASSERT(target == bp->b_target);
 		list_del_init(&bp->b_list);
-		if (wait) {
-			bp->b_flags &= ~XBF_ASYNC;
-			list_add(&bp->b_list, &wait_list);
-		}
-		xfs_bdstrat_cb(bp);
-	}
-	blk_finish_plug(&plug);
-
-	if (wait) {
-		/* Wait for IO to complete. */
-		while (!list_empty(&wait_list)) {
-			bp = list_first_entry(&wait_list, struct xfs_buf, b_list);
-
-			list_del_init(&bp->b_list);
-			xfs_buf_iowait(bp);
-			xfs_buf_relse(bp);
-		}
+		error2 = xfs_buf_iowait(bp);
+		xfs_buf_relse(bp);
+		if (!error)
+			error = error2;
 	}
 
-	return pincount;
+	return error;
 }
 
 int __init
Index: xfs/fs/xfs/xfs_buf_item.c
===================================================================
--- xfs.orig/fs/xfs/xfs_buf_item.c	2012-03-25 16:46:11.337889825 +0200
+++ xfs/fs/xfs/xfs_buf_item.c	2012-03-25 17:16:31.607923578 +0200
@@ -418,7 +418,6 @@ xfs_buf_item_unpin(
 	if (freed && stale) {
 		ASSERT(bip->bli_flags & XFS_BLI_STALE);
 		ASSERT(xfs_buf_islocked(bp));
-		ASSERT(!(XFS_BUF_ISDELAYWRITE(bp)));
 		ASSERT(XFS_BUF_ISSTALE(bp));
 		ASSERT(bip->bli_format.blf_flags & XFS_BLF_CANCEL);
 
@@ -469,34 +468,28 @@ xfs_buf_item_unpin(
 	}
 }
 
-/*
- * This is called to attempt to lock the buffer associated with this
- * buf log item.  Don't sleep on the buffer lock.  If we can't get
- * the lock right away, return 0.  If we can get the lock, take a
- * reference to the buffer. If this is a delayed write buffer that
- * needs AIL help to be written back, invoke the pushbuf routine
- * rather than the normal success path.
- */
 STATIC uint
-xfs_buf_item_trylock(
-	struct xfs_log_item	*lip)
+xfs_buf_item_push(
+	struct xfs_log_item	*lip,
+	struct list_head	*buffer_list)
 {
 	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
 	struct xfs_buf		*bp = bip->bli_buf;
+	uint			rval = XFS_ITEM_SUCCESS;
 
 	if (xfs_buf_ispinned(bp))
 		return XFS_ITEM_PINNED;
 	if (!xfs_buf_trylock(bp))
 		return XFS_ITEM_LOCKED;
 
-	/* take a reference to the buffer.  */
-	xfs_buf_hold(bp);
-
 	ASSERT(!(bip->bli_flags & XFS_BLI_STALE));
-	trace_xfs_buf_item_trylock(bip);
-	if (XFS_BUF_ISDELAYWRITE(bp))
-		return XFS_ITEM_PUSHBUF;
-	return XFS_ITEM_SUCCESS;
+
+	trace_xfs_buf_item_push(bip);
+
+	if (!xfs_buf_delwri_queue(bp, buffer_list))
+		rval = XFS_ITEM_FLUSHING;
+	xfs_buf_unlock(bp);
+	return rval;
 }
 
 /*
@@ -609,48 +602,6 @@ xfs_buf_item_committed(
 	return lsn;
 }
 
-/*
- * The buffer is locked, but is not a delayed write buffer.
- */
-STATIC void
-xfs_buf_item_push(
-	struct xfs_log_item	*lip)
-{
-	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
-	struct xfs_buf		*bp = bip->bli_buf;
-
-	ASSERT(!(bip->bli_flags & XFS_BLI_STALE));
-	ASSERT(!XFS_BUF_ISDELAYWRITE(bp));
-
-	trace_xfs_buf_item_push(bip);
-
-	xfs_buf_delwri_queue(bp);
-	xfs_buf_relse(bp);
-}
-
-/*
- * The buffer is locked and is a delayed write buffer. Promote the buffer
- * in the delayed write queue as the caller knows that they must invoke
- * the xfsbufd to get this buffer written. We have to unlock the buffer
- * to allow the xfsbufd to write it, too.
- */
-STATIC bool
-xfs_buf_item_pushbuf(
-	struct xfs_log_item	*lip)
-{
-	struct xfs_buf_log_item	*bip = BUF_ITEM(lip);
-	struct xfs_buf		*bp = bip->bli_buf;
-
-	ASSERT(!(bip->bli_flags & XFS_BLI_STALE));
-	ASSERT(XFS_BUF_ISDELAYWRITE(bp));
-
-	trace_xfs_buf_item_pushbuf(bip);
-
-	xfs_buf_delwri_promote(bp);
-	xfs_buf_relse(bp);
-	return true;
-}
-
 STATIC void
 xfs_buf_item_committing(
 	struct xfs_log_item	*lip,
@@ -666,11 +617,9 @@ static const struct xfs_item_ops xfs_buf
 	.iop_format	= xfs_buf_item_format,
 	.iop_pin	= xfs_buf_item_pin,
 	.iop_unpin	= xfs_buf_item_unpin,
-	.iop_trylock	= xfs_buf_item_trylock,
 	.iop_unlock	= xfs_buf_item_unlock,
 	.iop_committed	= xfs_buf_item_committed,
 	.iop_push	= xfs_buf_item_push,
-	.iop_pushbuf	= xfs_buf_item_pushbuf,
 	.iop_committing = xfs_buf_item_committing
 };
 
@@ -989,20 +938,27 @@ xfs_buf_iodone_callbacks(
 	 * If the write was asynchronous then no one will be looking for the
 	 * error.  Clear the error state and write the buffer out again.
 	 *
-	 * During sync or umount we'll write all pending buffers again
-	 * synchronous, which will catch these errors if they keep hanging
-	 * around.
+	 * XXX: This helps against transient write errors, but we need to find
+	 * a way to shut the filesystem down if the writes keep failing.
+	 *
+	 * In practice we'll shut the filesystem down soon as non-transient
+	 * erorrs tend to affect the whole device and a failing log write
+	 * will make us give up.  But we really ought to do better here.
 	 */
 	if (XFS_BUF_ISASYNC(bp)) {
+		ASSERT(bp->b_iodone != NULL);
+
+		trace_xfs_buf_item_iodone_async(bp, _RET_IP_);
+
 		xfs_buf_ioerror(bp, 0); /* errno of 0 unsets the flag */
 
 		if (!XFS_BUF_ISSTALE(bp)) {
-			xfs_buf_delwri_queue(bp);
-			XFS_BUF_DONE(bp);
+			bp->b_flags |= XBF_WRITE | XBF_ASYNC | XBF_DONE;
+			xfs_bdstrat_cb(bp);
+		} else {
+			xfs_buf_relse(bp);
 		}
-		ASSERT(bp->b_iodone != NULL);
-		trace_xfs_buf_item_iodone_async(bp, _RET_IP_);
-		xfs_buf_relse(bp);
+
 		return;
 	}
 
Index: xfs/fs/xfs/xfs_dquot_item.c
===================================================================
--- xfs.orig/fs/xfs/xfs_dquot_item.c	2012-03-25 16:46:10.211223137 +0200
+++ xfs/fs/xfs/xfs_dquot_item.c	2012-03-25 16:47:50.337891661 +0200
@@ -108,46 +108,6 @@ xfs_qm_dquot_logitem_unpin(
 		wake_up(&dqp->q_pinwait);
 }
 
-/*
- * Given the logitem, this writes the corresponding dquot entry to disk
- * asynchronously. This is called with the dquot entry securely locked;
- * we simply get xfs_qm_dqflush() to do the work, and unlock the dquot
- * at the end.
- */
-STATIC void
-xfs_qm_dquot_logitem_push(
-	struct xfs_log_item	*lip)
-{
-	struct xfs_dquot	*dqp = DQUOT_ITEM(lip)->qli_dquot;
-	struct xfs_buf		*bp = NULL;
-	int			error;
-
-	ASSERT(XFS_DQ_IS_LOCKED(dqp));
-	ASSERT(!completion_done(&dqp->q_flush));
-	ASSERT(atomic_read(&dqp->q_pincount) == 0);
-
-	/*
-	 * Since we were able to lock the dquot's flush lock and
-	 * we found it on the AIL, the dquot must be dirty.  This
-	 * is because the dquot is removed from the AIL while still
-	 * holding the flush lock in xfs_dqflush_done().  Thus, if
-	 * we found it in the AIL and were able to obtain the flush
-	 * lock without sleeping, then there must not have been
-	 * anyone in the process of flushing the dquot.
-	 */
-	error = xfs_qm_dqflush(dqp, &bp);
-	if (error) {
-		xfs_warn(dqp->q_mount, "%s: push error %d on dqp %p",
-			__func__, error, dqp);
-		goto out_unlock;
-	}
-
-	xfs_buf_delwri_queue(bp);
-	xfs_buf_relse(bp);
-out_unlock:
-	xfs_dqunlock(dqp);
-}
-
 STATIC xfs_lsn_t
 xfs_qm_dquot_logitem_committed(
 	struct xfs_log_item	*lip,
@@ -179,67 +139,15 @@ xfs_qm_dqunpin_wait(
 	wait_event(dqp->q_pinwait, (atomic_read(&dqp->q_pincount) == 0));
 }
 
-/*
- * This is called when IOP_TRYLOCK returns XFS_ITEM_PUSHBUF to indicate that
- * the dquot is locked by us, but the flush lock isn't. So, here we are
- * going to see if the relevant dquot buffer is incore, waiting on DELWRI.
- * If so, we want to push it out to help us take this item off the AIL as soon
- * as possible.
- *
- * We must not be holding the AIL lock at this point. Calling incore() to
- * search the buffer cache can be a time consuming thing, and AIL lock is a
- * spinlock.
- */
-STATIC bool
-xfs_qm_dquot_logitem_pushbuf(
-	struct xfs_log_item	*lip)
-{
-	struct xfs_dq_logitem	*qlip = DQUOT_ITEM(lip);
-	struct xfs_dquot	*dqp = qlip->qli_dquot;
-	struct xfs_buf		*bp;
-	bool			ret = true;
-
-	ASSERT(XFS_DQ_IS_LOCKED(dqp));
-
-	/*
-	 * If flushlock isn't locked anymore, chances are that the
-	 * inode flush completed and the inode was taken off the AIL.
-	 * So, just get out.
-	 */
-	if (completion_done(&dqp->q_flush) ||
-	    !(lip->li_flags & XFS_LI_IN_AIL)) {
-		xfs_dqunlock(dqp);
-		return true;
-	}
-
-	bp = xfs_incore(dqp->q_mount->m_ddev_targp, qlip->qli_format.qlf_blkno,
-			dqp->q_mount->m_quotainfo->qi_dqchunklen, XBF_TRYLOCK);
-	xfs_dqunlock(dqp);
-	if (!bp)
-		return true;
-	if (XFS_BUF_ISDELAYWRITE(bp))
-		xfs_buf_delwri_promote(bp);
-	if (xfs_buf_ispinned(bp))
-		ret = false;
-	xfs_buf_relse(bp);
-	return ret;
-}
-
-/*
- * This is called to attempt to lock the dquot associated with this
- * dquot log item.  Don't sleep on the dquot lock or the flush lock.
- * If the flush lock is already held, indicating that the dquot has
- * been or is in the process of being flushed, then see if we can
- * find the dquot's buffer in the buffer cache without sleeping.  If
- * we can and it is marked delayed write, then we want to send it out.
- * We delay doing so until the push routine, though, to avoid sleeping
- * in any device strategy routines.
- */
 STATIC uint
-xfs_qm_dquot_logitem_trylock(
-	struct xfs_log_item	*lip)
+xfs_qm_dquot_logitem_push(
+	struct xfs_log_item	*lip,
+	struct list_head	*buffer_list)
 {
 	struct xfs_dquot	*dqp = DQUOT_ITEM(lip)->qli_dquot;
+	struct xfs_buf		*bp = NULL;
+	uint			rval = XFS_ITEM_SUCCESS;
+	int			error;
 
 	if (atomic_read(&dqp->q_pincount) > 0)
 		return XFS_ITEM_PINNED;
@@ -252,20 +160,36 @@ xfs_qm_dquot_logitem_trylock(
 	 * taking the quota lock.
 	 */
 	if (atomic_read(&dqp->q_pincount) > 0) {
-		xfs_dqunlock(dqp);
-		return XFS_ITEM_PINNED;
+		rval = XFS_ITEM_PINNED;
+		goto out_unlock;
 	}
 
+	/*
+	 * Someone else is already flushing the dquot.  Nothing we can do
+	 * here but wait for the flush to finish and remove the item from
+	 * the AIL.
+	 */
 	if (!xfs_dqflock_nowait(dqp)) {
-		/*
-		 * dquot has already been flushed to the backing buffer,
-		 * leave it locked, pushbuf routine will unlock it.
-		 */
-		return XFS_ITEM_PUSHBUF;
+		rval = XFS_ITEM_FLUSHING;
+		goto out_unlock;
 	}
 
-	ASSERT(lip->li_flags & XFS_LI_IN_AIL);
-	return XFS_ITEM_SUCCESS;
+	spin_unlock(&lip->li_ailp->xa_lock);
+
+	error = xfs_qm_dqflush(dqp, &bp);
+	if (error) {
+		xfs_warn(dqp->q_mount, "%s: push error %d on dqp %p",
+			__func__, error, dqp);
+	} else {
+		if (!xfs_buf_delwri_queue(bp, buffer_list))
+			rval = XFS_ITEM_FLUSHING;
+		xfs_buf_relse(bp);
+	}
+
+	spin_lock(&lip->li_ailp->xa_lock);
+out_unlock:
+	xfs_dqunlock(dqp);
+	return rval;
 }
 
 /*
@@ -316,11 +240,9 @@ static const struct xfs_item_ops xfs_dqu
 	.iop_format	= xfs_qm_dquot_logitem_format,
 	.iop_pin	= xfs_qm_dquot_logitem_pin,
 	.iop_unpin	= xfs_qm_dquot_logitem_unpin,
-	.iop_trylock	= xfs_qm_dquot_logitem_trylock,
 	.iop_unlock	= xfs_qm_dquot_logitem_unlock,
 	.iop_committed	= xfs_qm_dquot_logitem_committed,
 	.iop_push	= xfs_qm_dquot_logitem_push,
-	.iop_pushbuf	= xfs_qm_dquot_logitem_pushbuf,
 	.iop_committing = xfs_qm_dquot_logitem_committing
 };
 
@@ -415,11 +337,13 @@ xfs_qm_qoff_logitem_unpin(
 }
 
 /*
- * Quotaoff items have no locking, so just return success.
+ * There isn't much you can do to push a quotaoff item.  It is simply
+ * stuck waiting for the log to be flushed to disk.
  */
 STATIC uint
-xfs_qm_qoff_logitem_trylock(
-	struct xfs_log_item	*lip)
+xfs_qm_qoff_logitem_push(
+	struct xfs_log_item	*lip,
+	struct list_head	*buffer_list)
 {
 	return XFS_ITEM_LOCKED;
 }
@@ -446,17 +370,6 @@ xfs_qm_qoff_logitem_committed(
 	return lsn;
 }
 
-/*
- * There isn't much you can do to push on an quotaoff item.  It is simply
- * stuck waiting for the log to be flushed to disk.
- */
-STATIC void
-xfs_qm_qoff_logitem_push(
-	struct xfs_log_item	*lip)
-{
-}
-
-
 STATIC xfs_lsn_t
 xfs_qm_qoffend_logitem_committed(
 	struct xfs_log_item	*lip,
@@ -504,7 +417,6 @@ static const struct xfs_item_ops xfs_qm_
 	.iop_format	= xfs_qm_qoff_logitem_format,
 	.iop_pin	= xfs_qm_qoff_logitem_pin,
 	.iop_unpin	= xfs_qm_qoff_logitem_unpin,
-	.iop_trylock	= xfs_qm_qoff_logitem_trylock,
 	.iop_unlock	= xfs_qm_qoff_logitem_unlock,
 	.iop_committed	= xfs_qm_qoffend_logitem_committed,
 	.iop_push	= xfs_qm_qoff_logitem_push,
@@ -519,7 +431,6 @@ static const struct xfs_item_ops xfs_qm_
 	.iop_format	= xfs_qm_qoff_logitem_format,
 	.iop_pin	= xfs_qm_qoff_logitem_pin,
 	.iop_unpin	= xfs_qm_qoff_logitem_unpin,
-	.iop_trylock	= xfs_qm_qoff_logitem_trylock,
 	.iop_unlock	= xfs_qm_qoff_logitem_unlock,
 	.iop_committed	= xfs_qm_qoff_logitem_committed,
 	.iop_push	= xfs_qm_qoff_logitem_push,
Index: xfs/fs/xfs/xfs_extfree_item.c
===================================================================
--- xfs.orig/fs/xfs/xfs_extfree_item.c	2012-03-25 16:41:00.177884056 +0200
+++ xfs/fs/xfs/xfs_extfree_item.c	2012-03-25 16:48:54.091226177 +0200
@@ -147,22 +147,20 @@ xfs_efi_item_unpin(
 }
 
 /*
- * Efi items have no locking or pushing.  However, since EFIs are
- * pulled from the AIL when their corresponding EFDs are committed
- * to disk, their situation is very similar to being pinned.  Return
- * XFS_ITEM_PINNED so that the caller will eventually flush the log.
- * This should help in getting the EFI out of the AIL.
+ * Efi items have no locking or pushing.  However, since EFIs are pulled from
+ * the AIL when their corresponding EFDs are committed to disk, their situation
+ * is very similar to being pinned.  Return XFS_ITEM_PINNED so that the caller
+ * will eventually flush the log.  This should help in getting the EFI out of
+ * the AIL.
  */
 STATIC uint
-xfs_efi_item_trylock(
-	struct xfs_log_item	*lip)
+xfs_efi_item_push(
+	struct xfs_log_item	*lip,
+	struct list_head	*buffer_list)
 {
 	return XFS_ITEM_PINNED;
 }
 
-/*
- * Efi items have no locking, so just return.
- */
 STATIC void
 xfs_efi_item_unlock(
 	struct xfs_log_item	*lip)
@@ -190,17 +188,6 @@ xfs_efi_item_committed(
 }
 
 /*
- * There isn't much you can do to push on an efi item.  It is simply
- * stuck waiting for all of its corresponding efd items to be
- * committed to disk.
- */
-STATIC void
-xfs_efi_item_push(
-	struct xfs_log_item	*lip)
-{
-}
-
-/*
  * The EFI dependency tracking op doesn't do squat.  It can't because
  * it doesn't know where the free extent is coming from.  The dependency
  * tracking has to be handled by the "enclosing" metadata object.  For
@@ -222,7 +209,6 @@ static const struct xfs_item_ops xfs_efi
 	.iop_format	= xfs_efi_item_format,
 	.iop_pin	= xfs_efi_item_pin,
 	.iop_unpin	= xfs_efi_item_unpin,
-	.iop_trylock	= xfs_efi_item_trylock,
 	.iop_unlock	= xfs_efi_item_unlock,
 	.iop_committed	= xfs_efi_item_committed,
 	.iop_push	= xfs_efi_item_push,
@@ -404,19 +390,17 @@ xfs_efd_item_unpin(
 }
 
 /*
- * Efd items have no locking, so just return success.
+ * There isn't much you can do to push on an efd item.  It is simply stuck
+ * waiting for the log to be flushed to disk.
  */
 STATIC uint
-xfs_efd_item_trylock(
-	struct xfs_log_item	*lip)
+xfs_efd_item_push(
+	struct xfs_log_item	*lip,
+	struct list_head	*buffer_list)
 {
 	return XFS_ITEM_LOCKED;
 }
 
-/*
- * Efd items have no locking or pushing, so return failure
- * so that the caller doesn't bother with us.
- */
 STATIC void
 xfs_efd_item_unlock(
 	struct xfs_log_item	*lip)
@@ -451,16 +435,6 @@ xfs_efd_item_committed(
 }
 
 /*
- * There isn't much you can do to push on an efd item.  It is simply
- * stuck waiting for the log to be flushed to disk.
- */
-STATIC void
-xfs_efd_item_push(
-	struct xfs_log_item	*lip)
-{
-}
-
-/*
  * The EFD dependency tracking op doesn't do squat.  It can't because
  * it doesn't know where the free extent is coming from.  The dependency
  * tracking has to be handled by the "enclosing" metadata object.  For
@@ -482,7 +456,6 @@ static const struct xfs_item_ops xfs_efd
 	.iop_format	= xfs_efd_item_format,
 	.iop_pin	= xfs_efd_item_pin,
 	.iop_unpin	= xfs_efd_item_unpin,
-	.iop_trylock	= xfs_efd_item_trylock,
 	.iop_unlock	= xfs_efd_item_unlock,
 	.iop_committed	= xfs_efd_item_committed,
 	.iop_push	= xfs_efd_item_push,
Index: xfs/fs/xfs/xfs_inode_item.c
===================================================================
--- xfs.orig/fs/xfs/xfs_inode_item.c	2012-03-25 16:41:21.651217787 +0200
+++ xfs/fs/xfs/xfs_inode_item.c	2012-03-25 16:49:09.501226460 +0200
@@ -480,25 +480,16 @@ xfs_inode_item_unpin(
 		wake_up_bit(&ip->i_flags, __XFS_IPINNED_BIT);
 }
 
-/*
- * This is called to attempt to lock the inode associated with this
- * inode log item, in preparation for the push routine which does the actual
- * iflush.  Don't sleep on the inode lock or the flush lock.
- *
- * If the flush lock is already held, indicating that the inode has
- * been or is in the process of being flushed, then (ideally) we'd like to
- * see if the inode's buffer is still incore, and if so give it a nudge.
- * We delay doing so until the pushbuf routine, though, to avoid holding
- * the AIL lock across a call to the blackhole which is the buffer cache.
- * Also we don't want to sleep in any device strategy routines, which can happen
- * if we do the subsequent bawrite in here.
- */
 STATIC uint
-xfs_inode_item_trylock(
-	struct xfs_log_item	*lip)
+xfs_inode_item_push(
+	struct xfs_log_item	*lip,
+	struct list_head	*buffer_list)
 {
 	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
 	struct xfs_inode	*ip = iip->ili_inode;
+	struct xfs_buf		*bp = NULL;
+	uint			rval = XFS_ITEM_SUCCESS;
+	int			error;
 
 	if (xfs_ipincount(ip) > 0)
 		return XFS_ITEM_PINNED;
@@ -511,34 +502,45 @@ xfs_inode_item_trylock(
 	 * taking the ilock.
 	 */
 	if (xfs_ipincount(ip) > 0) {
-		xfs_iunlock(ip, XFS_ILOCK_SHARED);
-		return XFS_ITEM_PINNED;
+		rval = XFS_ITEM_PINNED;
+		goto out_unlock;
 	}
 
+	/*
+	 * Someone else is already flushing the inode.  Nothing we can do
+	 * here but wait for the flush to finish and remove the item from
+	 * the AIL.
+	 */
 	if (!xfs_iflock_nowait(ip)) {
-		/*
-		 * inode has already been flushed to the backing buffer,
-		 * leave it locked in shared mode, pushbuf routine will
-		 * unlock it.
-		 */
-		return XFS_ITEM_PUSHBUF;
+		rval = XFS_ITEM_FLUSHING;
+		goto out_unlock;
 	}
 
-	/* Stale items should force out the iclog */
+	/*
+	 * Stale inode items should force out the iclog.
+	 */
 	if (ip->i_flags & XFS_ISTALE) {
 		xfs_ifunlock(ip);
 		xfs_iunlock(ip, XFS_ILOCK_SHARED);
 		return XFS_ITEM_PINNED;
 	}
 
-#ifdef DEBUG
-	if (!XFS_FORCED_SHUTDOWN(ip->i_mount)) {
-		ASSERT(iip->ili_fields != 0);
-		ASSERT(iip->ili_logged == 0);
-		ASSERT(lip->li_flags & XFS_LI_IN_AIL);
+	ASSERT(iip->ili_fields != 0 || XFS_FORCED_SHUTDOWN(ip->i_mount));
+	ASSERT(iip->ili_logged == 0 || XFS_FORCED_SHUTDOWN(ip->i_mount));
+
+	spin_unlock(&lip->li_ailp->xa_lock);
+
+	error = xfs_iflush(ip, &bp);
+	if (!error) {
+		if (!xfs_buf_delwri_queue(bp, buffer_list))
+			rval = XFS_ITEM_FLUSHING;
+		xfs_buf_relse(bp);
 	}
-#endif
-	return XFS_ITEM_SUCCESS;
+
+	spin_lock(&lip->li_ailp->xa_lock);
+out_unlock:
+	xfs_iunlock(ip, XFS_ILOCK_SHARED);
+	return rval;
 }
 
 /*
@@ -623,92 +625,6 @@ xfs_inode_item_committed(
 }
 
 /*
- * This gets called by xfs_trans_push_ail(), when IOP_TRYLOCK
- * failed to get the inode flush lock but did get the inode locked SHARED.
- * Here we're trying to see if the inode buffer is incore, and if so whether it's
- * marked delayed write. If that's the case, we'll promote it and that will
- * allow the caller to write the buffer by triggering the xfsbufd to run.
- */
-STATIC bool
-xfs_inode_item_pushbuf(
-	struct xfs_log_item	*lip)
-{
-	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
-	struct xfs_inode	*ip = iip->ili_inode;
-	struct xfs_buf		*bp;
-	bool			ret = true;
-
-	ASSERT(xfs_isilocked(ip, XFS_ILOCK_SHARED));
-
-	/*
-	 * If a flush is not in progress anymore, chances are that the
-	 * inode was taken off the AIL. So, just get out.
-	 */
-	if (!xfs_isiflocked(ip) ||
-	    !(lip->li_flags & XFS_LI_IN_AIL)) {
-		xfs_iunlock(ip, XFS_ILOCK_SHARED);
-		return true;
-	}
-
-	bp = xfs_incore(ip->i_mount->m_ddev_targp, iip->ili_format.ilf_blkno,
-			iip->ili_format.ilf_len, XBF_TRYLOCK);
-
-	xfs_iunlock(ip, XFS_ILOCK_SHARED);
-	if (!bp)
-		return true;
-	if (XFS_BUF_ISDELAYWRITE(bp))
-		xfs_buf_delwri_promote(bp);
-	if (xfs_buf_ispinned(bp))
-		ret = false;
-	xfs_buf_relse(bp);
-	return ret;
-}
-
-/*
- * This is called to asynchronously write the inode associated with this
- * inode log item out to disk. The inode will already have been locked by
- * a successful call to xfs_inode_item_trylock().
- */
-STATIC void
-xfs_inode_item_push(
-	struct xfs_log_item	*lip)
-{
-	struct xfs_inode_log_item *iip = INODE_ITEM(lip);
-	struct xfs_inode	*ip = iip->ili_inode;
-	struct xfs_buf		*bp = NULL;
-	int			error;
-
-	ASSERT(xfs_isilocked(ip, XFS_ILOCK_SHARED));
-	ASSERT(xfs_isiflocked(ip));
-
-	/*
-	 * Since we were able to lock the inode's flush lock and
-	 * we found it on the AIL, the inode must be dirty.  This
-	 * is because the inode is removed from the AIL while still
-	 * holding the flush lock in xfs_iflush_done().  Thus, if
-	 * we found it in the AIL and were able to obtain the flush
-	 * lock without sleeping, then there must not have been
-	 * anyone in the process of flushing the inode.
-	 */
-	ASSERT(XFS_FORCED_SHUTDOWN(ip->i_mount) || iip->ili_fields != 0);
-
-	/*
-	 * Push the inode to it's backing buffer. This will not remove the
-	 * inode from the AIL - a further push will be required to trigger a
-	 * buffer push. However, this allows all the dirty inodes to be pushed
-	 * to the buffer before it is pushed to disk. The buffer IO completion
-	 * will pull the inode from the AIL, mark it clean and unlock the flush
-	 * lock.
-	 */
-	error = xfs_iflush(ip, &bp);
-	if (!error) {
-		xfs_buf_delwri_queue(bp);
-		xfs_buf_relse(bp);
-	}
-	xfs_iunlock(ip, XFS_ILOCK_SHARED);
-}
-
-/*
  * XXX rcc - this one really has to do something.  Probably needs
  * to stamp in a new field in the incore inode.
  */
@@ -728,11 +644,9 @@ static const struct xfs_item_ops xfs_ino
 	.iop_format	= xfs_inode_item_format,
 	.iop_pin	= xfs_inode_item_pin,
 	.iop_unpin	= xfs_inode_item_unpin,
-	.iop_trylock	= xfs_inode_item_trylock,
 	.iop_unlock	= xfs_inode_item_unlock,
 	.iop_committed	= xfs_inode_item_committed,
 	.iop_push	= xfs_inode_item_push,
-	.iop_pushbuf	= xfs_inode_item_pushbuf,
 	.iop_committing = xfs_inode_item_committing
 };
 
Index: xfs/fs/xfs/xfs_trace.h
===================================================================
--- xfs.orig/fs/xfs/xfs_trace.h	2012-03-25 16:41:18.207884389 +0200
+++ xfs/fs/xfs/xfs_trace.h	2012-03-25 17:09:36.381249212 +0200
@@ -328,7 +328,7 @@ DEFINE_BUF_EVENT(xfs_buf_unlock);
 DEFINE_BUF_EVENT(xfs_buf_iowait);
 DEFINE_BUF_EVENT(xfs_buf_iowait_done);
 DEFINE_BUF_EVENT(xfs_buf_delwri_queue);
-DEFINE_BUF_EVENT(xfs_buf_delwri_dequeue);
+DEFINE_BUF_EVENT(xfs_buf_delwri_queued);
 DEFINE_BUF_EVENT(xfs_buf_delwri_split);
 DEFINE_BUF_EVENT(xfs_buf_get_uncached);
 DEFINE_BUF_EVENT(xfs_bdstrat_shut);
@@ -486,12 +486,10 @@ DEFINE_BUF_ITEM_EVENT(xfs_buf_item_forma
 DEFINE_BUF_ITEM_EVENT(xfs_buf_item_pin);
 DEFINE_BUF_ITEM_EVENT(xfs_buf_item_unpin);
 DEFINE_BUF_ITEM_EVENT(xfs_buf_item_unpin_stale);
-DEFINE_BUF_ITEM_EVENT(xfs_buf_item_trylock);
 DEFINE_BUF_ITEM_EVENT(xfs_buf_item_unlock);
 DEFINE_BUF_ITEM_EVENT(xfs_buf_item_unlock_stale);
 DEFINE_BUF_ITEM_EVENT(xfs_buf_item_committed);
 DEFINE_BUF_ITEM_EVENT(xfs_buf_item_push);
-DEFINE_BUF_ITEM_EVENT(xfs_buf_item_pushbuf);
 DEFINE_BUF_ITEM_EVENT(xfs_trans_get_buf);
 DEFINE_BUF_ITEM_EVENT(xfs_trans_get_buf_recur);
 DEFINE_BUF_ITEM_EVENT(xfs_trans_getsb);
@@ -881,10 +879,9 @@ DEFINE_EVENT(xfs_log_item_class, name, \
 	TP_PROTO(struct xfs_log_item *lip), \
 	TP_ARGS(lip))
 DEFINE_LOG_ITEM_EVENT(xfs_ail_push);
-DEFINE_LOG_ITEM_EVENT(xfs_ail_pushbuf);
-DEFINE_LOG_ITEM_EVENT(xfs_ail_pushbuf_pinned);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_pinned);
 DEFINE_LOG_ITEM_EVENT(xfs_ail_locked);
+DEFINE_LOG_ITEM_EVENT(xfs_ail_flushing);
 
 
 DECLARE_EVENT_CLASS(xfs_file_class,
Index: xfs/fs/xfs/xfs_trans.h
===================================================================
--- xfs.orig/fs/xfs/xfs_trans.h	2012-03-25 16:41:00.214550722 +0200
+++ xfs/fs/xfs/xfs_trans.h	2012-03-25 16:46:11.987889837 +0200
@@ -345,11 +345,9 @@ struct xfs_item_ops {
 	void (*iop_format)(xfs_log_item_t *, struct xfs_log_iovec *);
 	void (*iop_pin)(xfs_log_item_t *);
 	void (*iop_unpin)(xfs_log_item_t *, int remove);
-	uint (*iop_trylock)(xfs_log_item_t *);
+	uint (*iop_push)(struct xfs_log_item *, struct list_head *);
 	void (*iop_unlock)(xfs_log_item_t *);
 	xfs_lsn_t (*iop_committed)(xfs_log_item_t *, xfs_lsn_t);
-	void (*iop_push)(xfs_log_item_t *);
-	bool (*iop_pushbuf)(xfs_log_item_t *);
 	void (*iop_committing)(xfs_log_item_t *, xfs_lsn_t);
 };
 
@@ -357,20 +355,18 @@ struct xfs_item_ops {
 #define IOP_FORMAT(ip,vp)	(*(ip)->li_ops->iop_format)(ip, vp)
 #define IOP_PIN(ip)		(*(ip)->li_ops->iop_pin)(ip)
 #define IOP_UNPIN(ip, remove)	(*(ip)->li_ops->iop_unpin)(ip, remove)
-#define IOP_TRYLOCK(ip)		(*(ip)->li_ops->iop_trylock)(ip)
+#define IOP_PUSH(ip, list)	(*(ip)->li_ops->iop_push)(ip, list)
 #define IOP_UNLOCK(ip)		(*(ip)->li_ops->iop_unlock)(ip)
 #define IOP_COMMITTED(ip, lsn)	(*(ip)->li_ops->iop_committed)(ip, lsn)
-#define IOP_PUSH(ip)		(*(ip)->li_ops->iop_push)(ip)
-#define IOP_PUSHBUF(ip)		(*(ip)->li_ops->iop_pushbuf)(ip)
 #define IOP_COMMITTING(ip, lsn) (*(ip)->li_ops->iop_committing)(ip, lsn)
 
 /*
- * Return values for the IOP_TRYLOCK() routines.
+ * Return values for the IOP_PUSH() routines.
  */
-#define	XFS_ITEM_SUCCESS	0
-#define	XFS_ITEM_PINNED		1
-#define	XFS_ITEM_LOCKED		2
-#define XFS_ITEM_PUSHBUF	3
+#define XFS_ITEM_SUCCESS	0
+#define XFS_ITEM_PINNED		1
+#define XFS_ITEM_LOCKED		2
+#define XFS_ITEM_FLUSHING	3
 
 /*
  * This is the type of function which can be given to xfs_trans_callback()
Index: xfs/fs/xfs/xfs_trans_ail.c
===================================================================
--- xfs.orig/fs/xfs/xfs_trans_ail.c	2012-03-25 16:41:20.901217773 +0200
+++ xfs/fs/xfs/xfs_trans_ail.c	2012-03-25 17:16:19.787923358 +0200
@@ -364,29 +364,31 @@ xfsaild_push(
 	xfs_log_item_t		*lip;
 	xfs_lsn_t		lsn;
 	xfs_lsn_t		target;
-	long			tout = 10;
+	long			tout;
 	int			stuck = 0;
+	int			flushing = 0;
 	int			count = 0;
-	int			push_xfsbufd = 0;
 
 	/*
-	 * If last time we ran we encountered pinned items, force the log first
-	 * and wait for it before pushing again.
+	 * If we encountered pinned items or did not finish writing out all
+	 * buffers the last time we ran, force the log first and wait for it
+	 * before pushing again.
 	 */
-	spin_lock(&ailp->xa_lock);
-	if (ailp->xa_last_pushed_lsn == 0 && ailp->xa_log_flush &&
-	    !list_empty(&ailp->xa_ail)) {
+	if (ailp->xa_log_flush && ailp->xa_last_pushed_lsn == 0 &&
+	    (!list_empty_careful(&ailp->xa_buf_list) ||
+	     xfs_ail_min_lsn(ailp))) {
 		ailp->xa_log_flush = 0;
-		spin_unlock(&ailp->xa_lock);
+
 		XFS_STATS_INC(xs_push_ail_flush);
 		xfs_log_force(mp, XFS_LOG_SYNC);
-		spin_lock(&ailp->xa_lock);
 	}
 
+	spin_lock(&ailp->xa_lock);
 	lip = xfs_trans_ail_cursor_first(ailp, &cur, ailp->xa_last_pushed_lsn);
 	if (!lip) {
 		/*
-		 * AIL is empty or our push has reached the end.
+		 * If the AIL is empty or our push has reached the end we are
+		 * done now.
 		 */
 		xfs_trans_ail_cursor_done(ailp, &cur);
 		spin_unlock(&ailp->xa_lock);
@@ -396,64 +398,49 @@ xfsaild_push(
 	XFS_STATS_INC(xs_push_ail);
 
 	/*
-	 * If we are draining the AIL push all items, not just the current
-	 * threshold.
+	 * If at least one caller asks us to drain the AIL we have to push
+	 * out all all items, not just the current threshold.
 	 */
 	if (atomic_read(&ailp->xa_wait_empty))
 		target = xfs_ail_max(ailp)->li_lsn;
 	else
 		target = ailp->xa_target;
 
-	/*
-	 * While the item we are looking at is below the given threshold
-	 * try to flush it out. We'd like not to stop until we've at least
-	 * tried to push on everything in the AIL with an LSN less than
-	 * the given threshold.
-	 *
-	 * However, we will stop after a certain number of pushes and wait
-	 * for a reduced timeout to fire before pushing further. This
-	 * prevents use from spinning when we can't do anything or there is
-	 * lots of contention on the AIL lists.
-	 */
 	lsn = lip->li_lsn;
 	while ((XFS_LSN_CMP(lip->li_lsn, target) <= 0)) {
 		int	lock_result;
+
 		/*
-		 * If we can lock the item without sleeping, unlock the AIL
-		 * lock and flush the item.  Then re-grab the AIL lock so we
-		 * can look for the next item on the AIL. List changes are
-		 * handled by the AIL lookup functions internally
-		 *
-		 * If we can't lock the item, either its holder will flush it
-		 * or it is already being flushed or it is being relogged.  In
-		 * any of these case it is being taken care of and we can just
-		 * skip to the next item in the list.
+		 * Note that IOP_PUSH may unlocked and reacquire the AIL lock.
+		 * We rely on the AIL cursor implementation to handle be able
+		 * to deal with the dropped lock.
 		 */
-		lock_result = IOP_TRYLOCK(lip);
-		spin_unlock(&ailp->xa_lock);
+		lock_result = IOP_PUSH(lip, &ailp->xa_buf_list);
 		switch (lock_result) {
 		case XFS_ITEM_SUCCESS:
 			XFS_STATS_INC(xs_push_ail_success);
 			trace_xfs_ail_push(lip);
 
-			IOP_PUSH(lip);
 			ailp->xa_last_pushed_lsn = lsn;
 			break;
+		case XFS_ITEM_FLUSHING:
+			/*
+			 * The item or its backing buffer is already beeing
+			 * flushed.  The typical reason for that is that an
+			 * inode buffer is locked because we already pushed
+			 * the updates to it as part of inode clustering.
+			 *
+			 * We do not want to to stop flushing just because
+			 * lots of items are already beeing flushed, but we
+			 * need to re-try the flushing relatively soon if
+			 * most of the AIL is beeing flushed.
+			 */
+			XFS_STATS_INC(xs_push_ail_flushing);
+			trace_xfs_ail_flushing(lip);
 
-		case XFS_ITEM_PUSHBUF:
-			XFS_STATS_INC(xs_push_ail_pushbuf);
-			trace_xfs_ail_pushbuf(lip);
-
-			if (!IOP_PUSHBUF(lip)) {
-				trace_xfs_ail_pushbuf_pinned(lip);
-				stuck++;
-				ailp->xa_log_flush++;
-			} else {
-				ailp->xa_last_pushed_lsn = lsn;
-			}
-			push_xfsbufd = 1;
+			flushing++;
+			ailp->xa_last_pushed_lsn = lsn;
 			break;
-
 		case XFS_ITEM_PINNED:
 			XFS_STATS_INC(xs_push_ail_pinned);
 			trace_xfs_ail_pinned(lip);
@@ -461,23 +448,22 @@ xfsaild_push(
 			stuck++;
 			ailp->xa_log_flush++;
 			break;
-
 		case XFS_ITEM_LOCKED:
 			XFS_STATS_INC(xs_push_ail_locked);
 			trace_xfs_ail_locked(lip);
+
 			stuck++;
 			break;
-
 		default:
 			ASSERT(0);
 			break;
 		}
 
-		spin_lock(&ailp->xa_lock);
 		count++;
 
 		/*
 		 * Are there too many items we can't do anything with?
+		 *
 		 * If we we are skipping too many items because we can't flush
 		 * them or they are already being flushed, we back off and
 		 * given them time to complete whatever operation is being
@@ -499,42 +485,36 @@ xfsaild_push(
 	xfs_trans_ail_cursor_done(ailp, &cur);
 	spin_unlock(&ailp->xa_lock);
 
-	if (push_xfsbufd) {
-		/* we've got delayed write buffers to flush */
-		wake_up_process(mp->m_ddev_targp->bt_task);
-	}
+	if (xfs_buf_delwri_submit_nowait(&ailp->xa_buf_list))
+		ailp->xa_log_flush++;
 
-	/* assume we have more work to do in a short while */
+	if (!count || XFS_LSN_CMP(lsn, target) >= 0) {
 out_done:
-	if (!count) {
-		/* We're past our target or empty, so idle */
-		ailp->xa_last_pushed_lsn = 0;
-		ailp->xa_log_flush = 0;
-
-		tout = 50;
-	} else if (XFS_LSN_CMP(lsn, target) >= 0) {
 		/*
-		 * We reached the target so wait a bit longer for I/O to
-		 * complete and remove pushed items from the AIL before we
-		 * start the next scan from the start of the AIL.
+		 * We reached the target or the AIL is empty, so wait a bit
+		 * longer for I/O to complete and remove pushed items from the
+		 * AIL before we start the next scan from the start of the AIL.
 		 */
 		tout = 50;
 		ailp->xa_last_pushed_lsn = 0;
-	} else if ((stuck * 100) / count > 90) {
+	} else if (((stuck + flushing) * 100) / count > 90) {
 		/*
-		 * Either there is a lot of contention on the AIL or we
-		 * are stuck due to operations in progress. "Stuck" in this
-		 * case is defined as >90% of the items we tried to push
-		 * were stuck.
+		 * Either there is a lot of contention on the AIL or we are
+		 * stuck due to operations in progress. "Stuck" in this case
+		 * is defined as >90% of the items we tried to push were stuck.
 		 *
 		 * Backoff a bit more to allow some I/O to complete before
-		 * restarting from the start of the AIL. This prevents us
-		 * from spinning on the same items, and if they are pinned will
-		 * all the restart to issue a log force to unpin the stuck
-		 * items.
+		 * restarting from the start of the AIL. This prevents us from
+		 * spinning on the same items, and if they are pinned will all
+		 * the restart to issue a log force to unpin the stuck items.
 		 */
 		tout = 20;
 		ailp->xa_last_pushed_lsn = 0;
+	} else {
+		/*
+		 * Assume we have more work to do in a short while.
+		 */
+		tout = 10;
 	}
 
 	return tout;
@@ -547,6 +527,8 @@ xfsaild(
 	struct xfs_ail	*ailp = data;
 	long		tout = 0;	/* milliseconds */
 
+	current->flags |= PF_MEMALLOC;
+
 	while (!kthread_should_stop()) {
 		if (tout && tout <= 20)
 			__set_current_state(TASK_KILLABLE);
@@ -806,6 +788,7 @@ xfs_trans_ail_init(
 	INIT_LIST_HEAD(&ailp->xa_ail);
 	INIT_LIST_HEAD(&ailp->xa_cursors);
 	spin_lock_init(&ailp->xa_lock);
+	INIT_LIST_HEAD(&ailp->xa_buf_list);
 	init_waitqueue_head(&ailp->xa_empty);
 	atomic_set(&ailp->xa_wait_empty, 0);
 
Index: xfs/fs/xfs/xfs_trans_buf.c
===================================================================
--- xfs.orig/fs/xfs/xfs_trans_buf.c	2012-03-25 16:46:11.341223159 +0200
+++ xfs/fs/xfs/xfs_trans_buf.c	2012-03-25 16:46:11.991223170 +0200
@@ -165,14 +165,6 @@ xfs_trans_get_buf(xfs_trans_t	*tp,
 			XFS_BUF_DONE(bp);
 		}
 
-		/*
-		 * If the buffer is stale then it was binval'ed
-		 * since last read.  This doesn't matter since the
-		 * caller isn't allowed to use the data anyway.
-		 */
-		else if (XFS_BUF_ISSTALE(bp))
-			ASSERT(!XFS_BUF_ISDELAYWRITE(bp));
-
 		ASSERT(bp->b_transp == tp);
 		bip = bp->b_fspriv;
 		ASSERT(bip != NULL);
@@ -418,19 +410,6 @@ xfs_trans_read_buf(
 	return 0;
 
 shutdown_abort:
-	/*
-	 * the theory here is that buffer is good but we're
-	 * bailing out because the filesystem is being forcibly
-	 * shut down.  So we should leave the b_flags alone since
-	 * the buffer's not staled and just get out.
-	 */
-#if defined(DEBUG)
-	if (XFS_BUF_ISSTALE(bp) && XFS_BUF_ISDELAYWRITE(bp))
-		xfs_notice(mp, "about to pop assert, bp == 0x%p", bp);
-#endif
-	ASSERT((bp->b_flags & (XBF_STALE|XBF_DELWRI)) !=
-				     (XBF_STALE|XBF_DELWRI));
-
 	trace_xfs_trans_read_buf_shut(bp, _RET_IP_);
 	xfs_buf_relse(bp);
 	*bpp = NULL;
@@ -649,22 +628,33 @@ xfs_trans_log_buf(xfs_trans_t	*tp,
 
 
 /*
- * This called to invalidate a buffer that is being used within
- * a transaction.  Typically this is because the blocks in the
- * buffer are being freed, so we need to prevent it from being
- * written out when we're done.  Allowing it to be written again
- * might overwrite data in the free blocks if they are reallocated
- * to a file.
- *
- * We prevent the buffer from being written out by clearing the
- * B_DELWRI flag.  We can't always
- * get rid of the buf log item at this point, though, because
- * the buffer may still be pinned by another transaction.  If that
- * is the case, then we'll wait until the buffer is committed to
- * disk for the last time (we can tell by the ref count) and
- * free it in xfs_buf_item_unpin().  Until it is cleaned up we
- * will keep the buffer locked so that the buffer and buf log item
- * are not reused.
+ * Invalidate a buffer that is being used within a transaction.
+ *
+ * Typically this is because the blocks in the buffer are being freed, so we
+ * need to prevent it from being written out when we're done.  Allowing it
+ * to be written again might overwrite data in the free blocks if they are
+ * reallocated to a file.
+ *
+ * We prevent the buffer from being written out by marking it stale.  We can't
+ * get rid of the buf log item at this point because the buffer may still be
+ * pinned by another transaction.  If that is the case, then we'll wait until
+ * the buffer is committed to disk for the last time (we can tell by the ref
+ * count) and free it in xfs_buf_item_unpin().  Until that happens we will
+ * keep the buffer locked so that the buffer and buf log item are not reused.
+ *
+ * We also set the XFS_BLF_CANCEL flag in the buf log format structure and log
+ * the buf item.  This will be used at recovery time to determine that copies
+ * of the buffer in the log before this should not be replayed.
+ *
+ * We mark the item descriptor and the transaction dirty so that we'll hold
+ * the buffer until after the commit.
+ *
+ * Since we're invalidating the buffer, we also clear the state about which
+ * parts of the buffer have been logged.  We also clear the flag indicating
+ * that this is an inode buffer since the data in the buffer will no longer
+ * be valid.
+ *
+ * We set the stale bit in the buffer as well since we're getting rid of it.
  */
 void
 xfs_trans_binval(
@@ -684,7 +674,6 @@ xfs_trans_binval(
 		 * If the buffer is already invalidated, then
 		 * just return.
 		 */
-		ASSERT(!(XFS_BUF_ISDELAYWRITE(bp)));
 		ASSERT(XFS_BUF_ISSTALE(bp));
 		ASSERT(!(bip->bli_flags & (XFS_BLI_LOGGED | XFS_BLI_DIRTY)));
 		ASSERT(!(bip->bli_format.blf_flags & XFS_BLF_INODE_BUF));
@@ -694,27 +683,8 @@ xfs_trans_binval(
 		return;
 	}
 
-	/*
-	 * Clear the dirty bit in the buffer and set the STALE flag
-	 * in the buf log item.  The STALE flag will be used in
-	 * xfs_buf_item_unpin() to determine if it should clean up
-	 * when the last reference to the buf item is given up.
-	 * We set the XFS_BLF_CANCEL flag in the buf log format structure
-	 * and log the buf item.  This will be used at recovery time
-	 * to determine that copies of the buffer in the log before
-	 * this should not be replayed.
-	 * We mark the item descriptor and the transaction dirty so
-	 * that we'll hold the buffer until after the commit.
-	 *
-	 * Since we're invalidating the buffer, we also clear the state
-	 * about which parts of the buffer have been logged.  We also
-	 * clear the flag indicating that this is an inode buffer since
-	 * the data in the buffer will no longer be valid.
-	 *
-	 * We set the stale bit in the buffer as well since we're getting
-	 * rid of it.
-	 */
 	xfs_buf_stale(bp);
+
 	bip->bli_flags |= XFS_BLI_STALE;
 	bip->bli_flags &= ~(XFS_BLI_INODE_BUF | XFS_BLI_LOGGED | XFS_BLI_DIRTY);
 	bip->bli_format.blf_flags &= ~XFS_BLF_INODE_BUF;
Index: xfs/fs/xfs/xfs_qm.c
===================================================================
--- xfs.orig/fs/xfs/xfs_qm.c	2012-03-25 16:46:10.211223137 +0200
+++ xfs/fs/xfs/xfs_qm.c	2012-03-25 17:09:36.364582545 +0200
@@ -65,7 +65,8 @@ STATIC int
 xfs_qm_dquot_walk(
 	struct xfs_mount	*mp,
 	int			type,
-	int			(*execute)(struct xfs_dquot *dqp))
+	int			(*execute)(struct xfs_dquot *dqp, void *data),
+	void			*data)
 {
 	struct xfs_quotainfo	*qi = mp->m_quotainfo;
 	struct radix_tree_root	*tree = XFS_DQUOT_TREE(qi, type);
@@ -97,7 +98,7 @@ restart:
 
 			next_index = be32_to_cpu(dqp->q_core.d_id) + 1;
 
-			error = execute(batch[i]);
+			error = execute(batch[i], data);
 			if (error == EAGAIN) {
 				skipped++;
 				continue;
@@ -129,7 +130,8 @@ restart:
  */
 STATIC int
 xfs_qm_dqpurge(
-	struct xfs_dquot	*dqp)
+	struct xfs_dquot	*dqp,
+	void			*data)
 {
 	struct xfs_mount	*mp = dqp->q_mount;
 	struct xfs_quotainfo	*qi = mp->m_quotainfo;
@@ -153,21 +155,7 @@ xfs_qm_dqpurge(
 
 	dqp->dq_flags |= XFS_DQ_FREEING;
 
-	/*
-	 * If we're turning off quotas, we have to make sure that, for
-	 * example, we don't delete quota disk blocks while dquots are
-	 * in the process of getting written to those disk blocks.
-	 * This dquot might well be on AIL, and we can't leave it there
-	 * if we're turning off quotas. Basically, we need this flush
-	 * lock, and are willing to block on it.
-	 */
-	if (!xfs_dqflock_nowait(dqp)) {
-		/*
-		 * Block on the flush lock after nudging dquot buffer,
-		 * if it is incore.
-		 */
-		xfs_dqflock_pushbuf_wait(dqp);
-	}
+	xfs_dqflock(dqp);
 
 	/*
 	 * If we are turning this type of quotas off, we don't care
@@ -183,7 +171,7 @@ xfs_qm_dqpurge(
 		 * to purge this dquot anyway, so we go ahead regardless.
 		 */
 		error = xfs_qm_dqflush(dqp, &bp);
-		if (error)
+		if (error) {
 			xfs_warn(mp, "%s: dquot %p flush failed",
 				__func__, dqp);
 		} else {
@@ -231,11 +219,11 @@ xfs_qm_dqpurge_all(
 	uint			flags)
 {
 	if (flags & XFS_QMOPT_UQUOTA)
-		xfs_qm_dquot_walk(mp, XFS_DQ_USER, xfs_qm_dqpurge);
+		xfs_qm_dquot_walk(mp, XFS_DQ_USER, xfs_qm_dqpurge, NULL);
 	if (flags & XFS_QMOPT_GQUOTA)
-		xfs_qm_dquot_walk(mp, XFS_DQ_GROUP, xfs_qm_dqpurge);
+		xfs_qm_dquot_walk(mp, XFS_DQ_GROUP, xfs_qm_dqpurge, NULL);
 	if (flags & XFS_QMOPT_PQUOTA)
-		xfs_qm_dquot_walk(mp, XFS_DQ_PROJ, xfs_qm_dqpurge);
+		xfs_qm_dquot_walk(mp, XFS_DQ_PROJ, xfs_qm_dqpurge, NULL);
 }
 
 /*
@@ -860,15 +848,16 @@ xfs_qm_reset_dqcounts(
 
 STATIC int
 xfs_qm_dqiter_bufs(
-	xfs_mount_t	*mp,
-	xfs_dqid_t	firstid,
-	xfs_fsblock_t	bno,
-	xfs_filblks_t	blkcnt,
-	uint		flags)
+	struct xfs_mount	*mp,
+	xfs_dqid_t		firstid,
+	xfs_fsblock_t		bno,
+	xfs_filblks_t		blkcnt,
+	uint			flags,
+	struct list_head	*buffer_list)
 {
-	xfs_buf_t	*bp;
-	int		error;
-	int		type;
+	struct xfs_buf		*bp;
+	int			error;
+	int			type;
 
 	ASSERT(blkcnt > 0);
 	type = flags & XFS_QMOPT_UQUOTA ? XFS_DQ_USER :
@@ -892,7 +881,7 @@ xfs_qm_dqiter_bufs(
 			break;
 
 		xfs_qm_reset_dqcounts(mp, bp, firstid, type);
-		xfs_buf_delwri_queue(bp);
+		xfs_buf_delwri_queue(bp, buffer_list);
 		xfs_buf_relse(bp);
 		/*
 		 * goto the next block.
@@ -900,6 +889,7 @@ xfs_qm_dqiter_bufs(
 		bno++;
 		firstid += mp->m_quotainfo->qi_dqperchunk;
 	}
+
 	return error;
 }
 
@@ -909,11 +899,12 @@ xfs_qm_dqiter_bufs(
  */
 STATIC int
 xfs_qm_dqiterate(
-	xfs_mount_t	*mp,
-	xfs_inode_t	*qip,
-	uint		flags)
+	struct xfs_mount	*mp,
+	struct xfs_inode	*qip,
+	uint			flags,
+	struct list_head	*buffer_list)
 {
-	xfs_bmbt_irec_t		*map;
+	struct xfs_bmbt_irec	*map;
 	int			i, nmaps;	/* number of map entries */
 	int			error;		/* return value */
 	xfs_fileoff_t		lblkno;
@@ -980,21 +971,17 @@ xfs_qm_dqiterate(
 			 * Iterate thru all the blks in the extent and
 			 * reset the counters of all the dquots inside them.
 			 */
-			if ((error = xfs_qm_dqiter_bufs(mp,
-						       firstid,
-						       map[i].br_startblock,
-						       map[i].br_blockcount,
-						       flags))) {
-				break;
-			}
+			error = xfs_qm_dqiter_bufs(mp, firstid,
+						   map[i].br_startblock,
+						   map[i].br_blockcount,
+						   flags, buffer_list);
+			if (error)
+				goto out;
 		}
-
-		if (error)
-			break;
 	} while (nmaps > 0);
 
+out:
 	kmem_free(map);
-
 	return error;
 }
 
@@ -1187,8 +1174,10 @@ error0:
 
 STATIC int
 xfs_qm_flush_one(
-	struct xfs_dquot	*dqp)
+	struct xfs_dquot	*dqp,
+	void			*data)
 {
+	struct list_head	*buffer_list = data;
 	struct xfs_buf		*bp = NULL;
 	int			error = 0;
 
@@ -1198,14 +1187,12 @@ xfs_qm_flush_one(
 	if (!XFS_DQ_IS_DIRTY(dqp))
 		goto out_unlock;
 
-	if (!xfs_dqflock_nowait(dqp))
-		xfs_dqflock_pushbuf_wait(dqp);
-
+	xfs_dqflock(dqp);
 	error = xfs_qm_dqflush(dqp, &bp);
 	if (error)
 		goto out_unlock;
 
-	xfs_buf_delwri_queue(bp);
+	xfs_buf_delwri_queue(bp, buffer_list);
 	xfs_buf_relse(bp);
 out_unlock:
 	xfs_dqunlock(dqp);
@@ -1225,6 +1212,7 @@ xfs_qm_quotacheck(
 	size_t		structsz;
 	xfs_inode_t	*uip, *gip;
 	uint		flags;
+	LIST_HEAD	(buffer_list);
 
 	count = INT_MAX;
 	structsz = 1;
@@ -1243,7 +1231,8 @@ xfs_qm_quotacheck(
 	 */
 	uip = mp->m_quotainfo->qi_uquotaip;
 	if (uip) {
-		error = xfs_qm_dqiterate(mp, uip, XFS_QMOPT_UQUOTA);
+		error = xfs_qm_dqiterate(mp, uip, XFS_QMOPT_UQUOTA,
+					 &buffer_list);
 		if (error)
 			goto error_return;
 		flags |= XFS_UQUOTA_CHKD;
@@ -1252,7 +1241,8 @@ xfs_qm_quotacheck(
 	gip = mp->m_quotainfo->qi_gquotaip;
 	if (gip) {
 		error = xfs_qm_dqiterate(mp, gip, XFS_IS_GQUOTA_ON(mp) ?
-					XFS_QMOPT_GQUOTA : XFS_QMOPT_PQUOTA);
+					 XFS_QMOPT_GQUOTA : XFS_QMOPT_PQUOTA,
+					 &buffer_list);
 		if (error)
 			goto error_return;
 		flags |= XFS_OQUOTA_CHKD;
@@ -1275,19 +1265,27 @@ xfs_qm_quotacheck(
 	 * We've made all the changes that we need to make incore.  Flush them
 	 * down to disk buffers if everything was updated successfully.
 	 */
-	if (XFS_IS_UQUOTA_ON(mp))
-		error = xfs_qm_dquot_walk(mp, XFS_DQ_USER, xfs_qm_flush_one);
+	if (XFS_IS_UQUOTA_ON(mp)) {
+		error = xfs_qm_dquot_walk(mp, XFS_DQ_USER, xfs_qm_flush_one,
+					  &buffer_list);
+	}
 	if (XFS_IS_GQUOTA_ON(mp)) {
-		error2 = xfs_qm_dquot_walk(mp, XFS_DQ_GROUP, xfs_qm_flush_one);
+		error2 = xfs_qm_dquot_walk(mp, XFS_DQ_GROUP, xfs_qm_flush_one,
+					   &buffer_list);
 		if (!error)
 			error = error2;
 	}
 	if (XFS_IS_PQUOTA_ON(mp)) {
-		error2 = xfs_qm_dquot_walk(mp, XFS_DQ_PROJ, xfs_qm_flush_one);
+		error2 = xfs_qm_dquot_walk(mp, XFS_DQ_PROJ, xfs_qm_flush_one,
+					   &buffer_list);
 		if (!error)
 			error = error2;
 	}
 
+	error2 = xfs_buf_delwri_submit(&buffer_list);
+	if (!error)
+		error = error2;
+
 	/*
 	 * We can get this error if we couldn't do a dquot allocation inside
 	 * xfs_qm_dqusage_adjust (via bulkstat). We don't care about the
@@ -1301,15 +1299,6 @@ xfs_qm_quotacheck(
 	}
 
 	/*
-	 * We didn't log anything, because if we crashed, we'll have to
-	 * start the quotacheck from scratch anyway. However, we must make
-	 * sure that our dquot changes are secure before we put the
-	 * quotacheck'd stamp on the superblock. So, here we do a synchronous
-	 * flush.
-	 */
-	xfs_flush_buftarg(mp->m_ddev_targp, 1);
-
-	/*
 	 * If one type of quotas is off, then it will lose its
 	 * quotachecked status, since we won't be doing accounting for
 	 * that type anymore.
@@ -1318,6 +1307,13 @@ xfs_qm_quotacheck(
 	mp->m_qflags |= flags;
 
  error_return:
+	while (!list_empty(&buffer_list)) {
+		struct xfs_buf *bp =
+			list_first_entry(&buffer_list, struct xfs_buf, b_list);
+		list_del_init(&bp->b_list);
+		xfs_buf_relse(bp);
+	}
+
 	if (error) {
 		xfs_warn(mp,
 	"Quotacheck: Unsuccessful (Error %d): Disabling quotas.",
@@ -1434,6 +1430,7 @@ xfs_qm_dqfree_one(
 STATIC void
 xfs_qm_dqreclaim_one(
 	struct xfs_dquot	*dqp,
+	struct list_head	*buffer_list,
 	struct list_head	*dispose_list)
 {
 	struct xfs_mount	*mp = dqp->q_mount;
@@ -1466,21 +1463,11 @@ xfs_qm_dqreclaim_one(
 	if (!xfs_dqflock_nowait(dqp))
 		goto out_busy;
 
-	/*
-	 * We have the flush lock so we know that this is not in the
-	 * process of being flushed. So, if this is dirty, flush it
-	 * DELWRI so that we don't get a freelist infested with
-	 * dirty dquots.
-	 */
 	if (XFS_DQ_IS_DIRTY(dqp)) {
 		struct xfs_buf	*bp = NULL;
 
 		trace_xfs_dqreclaim_dirty(dqp);
 
-		/*
-		 * We flush it delayed write, so don't bother releasing the
-		 * freelist lock.
-		 */
 		error = xfs_qm_dqflush(dqp, &bp);
 		if (error) {
 			xfs_warn(mp, "%s: dquot %p flush failed",
@@ -1488,7 +1475,7 @@ xfs_qm_dqreclaim_one(
 			goto out_busy;
 		}
 
-		xfs_buf_delwri_queue(bp);
+		xfs_buf_delwri_queue(bp, buffer_list);
 		xfs_buf_relse(bp);
 		/*
 		 * Give the dquot another try on the freelist, as the
@@ -1533,8 +1520,10 @@ xfs_qm_shake(
 	struct xfs_quotainfo	*qi =
 		container_of(shrink, struct xfs_quotainfo, qi_shrinker);
 	int			nr_to_scan = sc->nr_to_scan;
+	LIST_HEAD		(buffer_list);
 	LIST_HEAD		(dispose_list);
 	struct xfs_dquot	*dqp;
+	int			error;
 
 	if ((sc->gfp_mask & (__GFP_FS|__GFP_WAIT)) != (__GFP_FS|__GFP_WAIT))
 		return 0;
@@ -1547,15 +1536,20 @@ xfs_qm_shake(
 			break;
 		dqp = list_first_entry(&qi->qi_lru_list, struct xfs_dquot,
 				       q_lru);
-		xfs_qm_dqreclaim_one(dqp, &dispose_list);
+		xfs_qm_dqreclaim_one(dqp, &buffer_list, &dispose_list);
 	}
 	mutex_unlock(&qi->qi_lru_lock);
 
+	error = xfs_buf_delwri_submit(&buffer_list);
+	if (error)
+		xfs_warn(NULL, "%s: dquot reclaim failed", __func__);
+
 	while (!list_empty(&dispose_list)) {
 		dqp = list_first_entry(&dispose_list, struct xfs_dquot, q_lru);
 		list_del_init(&dqp->q_lru);
 		xfs_qm_dqfree_one(dqp);
 	}
+
 out:
 	return (qi->qi_lru_count / 100) * sysctl_vfs_cache_pressure;
 }
Index: xfs/fs/xfs/xfs_inode.c
===================================================================
--- xfs.orig/fs/xfs/xfs_inode.c	2012-03-25 16:41:21.647884454 +0200
+++ xfs/fs/xfs/xfs_inode.c	2012-03-25 17:11:20.831251148 +0200
@@ -2347,11 +2347,11 @@ cluster_corrupt_out:
 	 */
 	rcu_read_unlock();
 	/*
-	 * Clean up the buffer.  If it was B_DELWRI, just release it --
+	 * Clean up the buffer.  If it was delwri, just release it --
 	 * brelse can handle it with no problems.  If not, shut down the
 	 * filesystem before releasing the buffer.
 	 */
-	bufwasdelwri = XFS_BUF_ISDELAYWRITE(bp);
+	bufwasdelwri = (bp->b_flags & _XBF_DELWRI_Q);
 	if (bufwasdelwri)
 		xfs_buf_relse(bp);
 
@@ -2685,27 +2685,6 @@ corrupt_out:
 	return XFS_ERROR(EFSCORRUPTED);
 }
 
-void
-xfs_promote_inode(
-	struct xfs_inode	*ip)
-{
-	struct xfs_buf		*bp;
-
-	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL|XFS_ILOCK_SHARED));
-
-	bp = xfs_incore(ip->i_mount->m_ddev_targp, ip->i_imap.im_blkno,
-			ip->i_imap.im_len, XBF_TRYLOCK);
-	if (!bp)
-		return;
-
-	if (XFS_BUF_ISDELAYWRITE(bp)) {
-		xfs_buf_delwri_promote(bp);
-		wake_up_process(ip->i_mount->m_ddev_targp->bt_task);
-	}
-
-	xfs_buf_relse(bp);
-}
-
 /*
  * Return a pointer to the extent record at file index idx.
  */
Index: xfs/fs/xfs/xfs_trans_priv.h
===================================================================
--- xfs.orig/fs/xfs/xfs_trans_priv.h	2012-03-25 16:41:20.901217773 +0200
+++ xfs/fs/xfs/xfs_trans_priv.h	2012-03-25 16:46:11.994556504 +0200
@@ -71,6 +71,7 @@ struct xfs_ail {
 	spinlock_t		xa_lock;
 	xfs_lsn_t		xa_last_pushed_lsn;
 	int			xa_log_flush;
+	struct list_head	xa_buf_list;
 	wait_queue_head_t	xa_empty;
 	atomic_t		xa_wait_empty;
 };
Index: xfs/fs/xfs/xfs_super.c
===================================================================
--- xfs.orig/fs/xfs/xfs_super.c	2012-03-25 16:41:00.284550724 +0200
+++ xfs/fs/xfs/xfs_super.c	2012-03-25 16:46:11.994556504 +0200
@@ -967,15 +967,7 @@ xfs_fs_put_super(
 
 	xfs_syncd_stop(mp);
 
-	/*
-	 * Blow away any referenced inode in the filestreams cache.
-	 * This can and will cause log traffic as inodes go inactive
-	 * here.
-	 */
 	xfs_filestream_unmount(mp);
-
-	xfs_flush_buftarg(mp->m_ddev_targp, 1);
-
 	xfs_unmountfs(mp);
 	xfs_freesb(mp);
 	xfs_icsb_destroy_counters(mp);
@@ -1391,15 +1383,7 @@ out_destroy_workqueues:
  out_syncd_stop:
 	xfs_syncd_stop(mp);
  out_unmount:
-	/*
-	 * Blow away any referenced inode in the filestreams cache.
-	 * This can and will cause log traffic as inodes go inactive
-	 * here.
-	 */
 	xfs_filestream_unmount(mp);
-
-	xfs_flush_buftarg(mp->m_ddev_targp, 1);
-
 	xfs_unmountfs(mp);
 	goto out_free_sb;
 }
Index: xfs/fs/xfs/xfs_sync.c
===================================================================
--- xfs.orig/fs/xfs/xfs_sync.c	2012-03-25 16:45:50.037889431 +0200
+++ xfs/fs/xfs/xfs_sync.c	2012-03-25 17:11:20.847917816 +0200
@@ -313,17 +313,10 @@ xfs_quiesce_data(
 	/* write superblock and hoover up shutdown errors */
 	error = xfs_sync_fsdata(mp);
 
-	/* make sure all delwri buffers are written out */
-	xfs_flush_buftarg(mp->m_ddev_targp, 1);
-
 	/* mark the log as covered if needed */
 	if (xfs_log_need_covered(mp))
 		error2 = xfs_fs_log_dummy(mp);
 
-	/* flush data-only devices */
-	if (mp->m_rtdev_targp)
-		xfs_flush_buftarg(mp->m_rtdev_targp, 1);
-
 	return error ? error : error2;
 }
 
@@ -681,17 +674,6 @@ restart:
 	if (!xfs_iflock_nowait(ip)) {
 		if (!(sync_mode & SYNC_WAIT))
 			goto out;
-
-		/*
-		 * If we only have a single dirty inode in a cluster there is
-		 * a fair chance that the AIL push may have pushed it into
-		 * the buffer, but xfsbufd won't touch it until 30 seconds
-		 * from now, and thus we will lock up here.
-		 *
-		 * Promote the inode buffer to the front of the delwri list
-		 * and wake up xfsbufd now.
-		 */
-		xfs_promote_inode(ip);
 		xfs_iflock(ip);
 	}
 
Index: xfs/fs/xfs/xfs_dquot.h
===================================================================
--- xfs.orig/fs/xfs/xfs_dquot.h	2012-03-25 16:46:10.207889804 +0200
+++ xfs/fs/xfs/xfs_dquot.h	2012-03-25 16:46:11.997889838 +0200
@@ -152,7 +152,6 @@ extern int		xfs_qm_dqget(xfs_mount_t *,
 extern void		xfs_qm_dqput(xfs_dquot_t *);
 
 extern void		xfs_dqlock2(struct xfs_dquot *, struct xfs_dquot *);
-extern void		xfs_dqflock_pushbuf_wait(struct xfs_dquot *dqp);
 
 static inline struct xfs_dquot *xfs_qm_dqhold(struct xfs_dquot *dqp)
 {
Index: xfs/fs/xfs/xfs_buf.h
===================================================================
--- xfs.orig/fs/xfs/xfs_buf.h	2012-03-25 16:41:00.317884058 +0200
+++ xfs/fs/xfs/xfs_buf.h	2012-03-25 16:46:11.997889838 +0200
@@ -50,8 +50,7 @@ typedef enum {
 #define XBF_MAPPED	(1 << 3) /* buffer mapped (b_addr valid) */
 #define XBF_ASYNC	(1 << 4) /* initiator will not wait for completion */
 #define XBF_DONE	(1 << 5) /* all pages in the buffer uptodate */
-#define XBF_DELWRI	(1 << 6) /* buffer has dirty pages */
-#define XBF_STALE	(1 << 7) /* buffer has been staled, do not find it */
+#define XBF_STALE	(1 << 6) /* buffer has been staled, do not find it */
 
 /* I/O hints for the BIO layer */
 #define XBF_SYNCIO	(1 << 10)/* treat this buffer as synchronous I/O */
@@ -66,7 +65,7 @@ typedef enum {
 /* flags used only internally */
 #define _XBF_PAGES	(1 << 20)/* backed by refcounted pages */
 #define _XBF_KMEM	(1 << 21)/* backed by heap memory */
-#define _XBF_DELWRI_Q	(1 << 22)/* buffer on delwri queue */
+#define _XBF_DELWRI_Q	(1 << 22)/* buffer on a delwri queue */
 
 typedef unsigned int xfs_buf_flags_t;
 
@@ -77,7 +76,6 @@ typedef unsigned int xfs_buf_flags_t;
 	{ XBF_MAPPED,		"MAPPED" }, \
 	{ XBF_ASYNC,		"ASYNC" }, \
 	{ XBF_DONE,		"DONE" }, \
-	{ XBF_DELWRI,		"DELWRI" }, \
 	{ XBF_STALE,		"STALE" }, \
 	{ XBF_SYNCIO,		"SYNCIO" }, \
 	{ XBF_FUA,		"FUA" }, \
@@ -89,10 +87,6 @@ typedef unsigned int xfs_buf_flags_t;
 	{ _XBF_KMEM,		"KMEM" }, \
 	{ _XBF_DELWRI_Q,	"DELWRI_Q" }
 
-typedef enum {
-	XBT_FORCE_FLUSH = 0,
-} xfs_buftarg_flags_t;
-
 typedef struct xfs_buftarg {
 	dev_t			bt_dev;
 	struct block_device	*bt_bdev;
@@ -102,12 +96,6 @@ typedef struct xfs_buftarg {
 	unsigned int		bt_sshift;
 	size_t			bt_smask;
 
-	/* per device delwri queue */
-	struct task_struct	*bt_task;
-	struct list_head	bt_delwri_queue;
-	spinlock_t		bt_delwri_lock;
-	unsigned long		bt_flags;
-
 	/* LRU control structures */
 	struct shrinker		bt_shrinker;
 	struct list_head	bt_lru;
@@ -151,7 +139,6 @@ typedef struct xfs_buf {
 	struct xfs_trans	*b_transp;
 	struct page		**b_pages;	/* array of page pointers */
 	struct page		*b_page_array[XB_PAGES]; /* inline pages */
-	unsigned long		b_queuetime;	/* time buffer was queued */
 	atomic_t		b_pin_count;	/* pin count */
 	atomic_t		b_io_remaining;	/* #outstanding I/O requests */
 	unsigned int		b_page_count;	/* size of page array */
@@ -221,24 +208,22 @@ static inline int xfs_buf_geterror(xfs_b
 extern xfs_caddr_t xfs_buf_offset(xfs_buf_t *, size_t);
 
 /* Delayed Write Buffer Routines */
-extern void xfs_buf_delwri_queue(struct xfs_buf *);
-extern void xfs_buf_delwri_dequeue(struct xfs_buf *);
-extern void xfs_buf_delwri_promote(struct xfs_buf *);
+extern bool xfs_buf_delwri_queue(struct xfs_buf *, struct list_head *);
+extern int xfs_buf_delwri_submit(struct list_head *);
+extern int xfs_buf_delwri_submit_nowait(struct list_head *);
 
 /* Buffer Daemon Setup Routines */
 extern int xfs_buf_init(void);
 extern void xfs_buf_terminate(void);
 
 #define XFS_BUF_ZEROFLAGS(bp) \
-	((bp)->b_flags &= ~(XBF_READ|XBF_WRITE|XBF_ASYNC|XBF_DELWRI| \
+	((bp)->b_flags &= ~(XBF_READ|XBF_WRITE|XBF_ASYNC| \
 			    XBF_SYNCIO|XBF_FUA|XBF_FLUSH))
 
 void xfs_buf_stale(struct xfs_buf *bp);
 #define XFS_BUF_UNSTALE(bp)	((bp)->b_flags &= ~XBF_STALE)
 #define XFS_BUF_ISSTALE(bp)	((bp)->b_flags & XBF_STALE)
 
-#define XFS_BUF_ISDELAYWRITE(bp)	((bp)->b_flags & XBF_DELWRI)
-
 #define XFS_BUF_DONE(bp)	((bp)->b_flags |= XBF_DONE)
 #define XFS_BUF_UNDONE(bp)	((bp)->b_flags &= ~XBF_DONE)
 #define XFS_BUF_ISDONE(bp)	((bp)->b_flags & XBF_DONE)
@@ -288,7 +273,6 @@ extern xfs_buftarg_t *xfs_alloc_buftarg(
 extern void xfs_free_buftarg(struct xfs_mount *, struct xfs_buftarg *);
 extern void xfs_wait_buftarg(xfs_buftarg_t *);
 extern int xfs_setsize_buftarg(xfs_buftarg_t *, unsigned int, unsigned int);
-extern int xfs_flush_buftarg(xfs_buftarg_t *, int);
 
 #define xfs_getsize_buftarg(buftarg)	block_size((buftarg)->bt_bdev)
 #define xfs_readonly_buftarg(buftarg)	bdev_read_only((buftarg)->bt_bdev)
Index: xfs/fs/xfs/xfs_inode.h
===================================================================
--- xfs.orig/fs/xfs/xfs_inode.h	2012-03-25 16:41:21.647884454 +0200
+++ xfs/fs/xfs/xfs_inode.h	2012-03-25 16:46:11.997889838 +0200
@@ -529,7 +529,6 @@ int		xfs_iunlink(struct xfs_trans *, xfs
 void		xfs_iext_realloc(xfs_inode_t *, int, int);
 void		xfs_iunpin_wait(xfs_inode_t *);
 int		xfs_iflush(struct xfs_inode *, struct xfs_buf **);
-void		xfs_promote_inode(struct xfs_inode *);
 void		xfs_lock_inodes(xfs_inode_t **, int, uint);
 void		xfs_lock_two_inodes(xfs_inode_t *, xfs_inode_t *, uint);
 
Index: xfs/fs/xfs/xfs_dquot.c
===================================================================
--- xfs.orig/fs/xfs/xfs_dquot.c	2012-03-25 16:46:10.207889804 +0200
+++ xfs/fs/xfs/xfs_dquot.c	2012-03-25 17:09:36.354582544 +0200
@@ -1005,39 +1005,6 @@ xfs_dqlock2(
 	}
 }
 
-/*
- * Give the buffer a little push if it is incore and
- * wait on the flush lock.
- */
-void
-xfs_dqflock_pushbuf_wait(
-	xfs_dquot_t	*dqp)
-{
-	xfs_mount_t	*mp = dqp->q_mount;
-	xfs_buf_t	*bp;
-
-	/*
-	 * Check to see if the dquot has been flushed delayed
-	 * write.  If so, grab its buffer and send it
-	 * out immediately.  We'll be able to acquire
-	 * the flush lock when the I/O completes.
-	 */
-	bp = xfs_incore(mp->m_ddev_targp, dqp->q_blkno,
-			mp->m_quotainfo->qi_dqchunklen, XBF_TRYLOCK);
-	if (!bp)
-		goto out_lock;
-
-	if (XFS_BUF_ISDELAYWRITE(bp)) {
-		if (xfs_buf_ispinned(bp))
-			xfs_log_force(mp, 0);
-		xfs_buf_delwri_promote(bp);
-		wake_up_process(bp->b_target->bt_task);
-	}
-	xfs_buf_relse(bp);
-out_lock:
-	xfs_dqflock(dqp);
-}
-
 int __init
 xfs_qm_init(void)
 {

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH 10/10] xfs: remove some obsolete comments in xfs_trans_ail.c
  2012-03-27 16:44 [PATCH 00/10] remove xfsbufd Christoph Hellwig
                   ` (8 preceding siblings ...)
  2012-03-27 16:44 ` [PATCH 09/10] xfs: on-stack delayed write buffer lists Christoph Hellwig
@ 2012-03-27 16:44 ` Christoph Hellwig
  2012-04-13 11:37   ` Dave Chinner
  2012-03-28  0:53 ` [PATCH 00/10] remove xfsbufd Dave Chinner
  10 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2012-03-27 16:44 UTC (permalink / raw)
  To: xfs

[-- Attachment #1: xfs-cleanup-ail-comments --]
[-- Type: text/plain, Size: 1171 bytes --]

Signed-off-by: Christoph Hellwig <hch@lst.de>

---
 fs/xfs/xfs_trans_ail.c |   14 --------------
 1 file changed, 14 deletions(-)

Index: xfs/fs/xfs/xfs_trans_ail.c
===================================================================
--- xfs.orig/fs/xfs/xfs_trans_ail.c	2012-03-25 17:16:19.787923358 +0200
+++ xfs/fs/xfs/xfs_trans_ail.c	2012-03-25 17:16:37.094590345 +0200
@@ -760,20 +760,6 @@ xfs_trans_ail_delete_bulk(
 	}
 }
 
-/*
- * The active item list (AIL) is a doubly linked list of log
- * items sorted by ascending lsn.  The base of the list is
- * a forw/back pointer pair embedded in the xfs mount structure.
- * The base is initialized with both pointers pointing to the
- * base.  This case always needs to be distinguished, because
- * the base has no lsn to look at.  We almost always insert
- * at the end of the list, so on inserts we search from the
- * end of the list to find where the new item belongs.
- */
-
-/*
- * Initialize the doubly linked list to point only to itself.
- */
 int
 xfs_trans_ail_init(
 	xfs_mount_t	*mp)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/10] xfs: remove log item from AIL in xfs_qm_dqflush after a shutdown
  2012-03-27 16:44 ` [PATCH 01/10] xfs: remove log item from AIL in xfs_qm_dqflush after a shutdown Christoph Hellwig
@ 2012-03-27 18:17   ` Mark Tinguely
  2012-04-13  9:36   ` Dave Chinner
  1 sibling, 0 replies; 42+ messages in thread
From: Mark Tinguely @ 2012-03-27 18:17 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On 03/27/12 11:44, Christoph Hellwig wrote:
> If a filesystem has been forced shutdown we are never going to write dquots
> to disk, which means the dquot items will stay in the AIL forever.
> Currently that is not a problem, but a pending chance requires us to
> empty the AIL before shutting down the filesystem, in which case this
> behaviour is lethal.  Make sure to remove the log item from the AIL
> to allow emptying the AIL on shutdown filesystems.
>
> Signed-off-by: Christoph Hellwig<hch@lst.de>

Looks good.

Reviewed-by: Mark Tinguely <tinguely@sgi.com>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 03/10] xfs: allow assigning the tail lsn with the AIL lock held
  2012-03-27 16:44 ` [PATCH 03/10] xfs: allow assigning the tail lsn with the AIL lock held Christoph Hellwig
@ 2012-03-27 18:18   ` Mark Tinguely
  2012-04-13  9:42   ` Dave Chinner
  1 sibling, 0 replies; 42+ messages in thread
From: Mark Tinguely @ 2012-03-27 18:18 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On 03/27/12 11:44, Christoph Hellwig wrote:
> Provide a variant of xlog_assign_tail_lsn that has the AIL lock already
> held.  By doing so we do an additional atomic_read + atomic_set under
> the lock, which comes down to two instructions.
>
> Switch xfs_trans_ail_update_bulk and xfs_trans_ail_delete_bulk to the
> new version to reduce the number of lock roundtrips, and prepare for
> a new addition that would require a third lock roundtrip in
> xfs_trans_ail_delete_bulk.  This addition is also the reason for
> slightly rearranging the conditionals and relying on xfs_log_space_wake
> for checking that the filesystem has been shut down internally.
>
> Signed-off-by: Christoph Hellwig<hch@lst.de>


Looks good.

Reviewed-by: Mark Tinguely <tinguely@sgi.com>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/10] remove xfsbufd
  2012-03-27 16:44 [PATCH 00/10] remove xfsbufd Christoph Hellwig
                   ` (9 preceding siblings ...)
  2012-03-27 16:44 ` [PATCH 10/10] xfs: remove some obsolete comments in xfs_trans_ail.c Christoph Hellwig
@ 2012-03-28  0:53 ` Dave Chinner
  2012-03-28 15:10   ` Christoph Hellwig
  10 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2012-03-28  0:53 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Mar 27, 2012 at 12:44:00PM -0400, Christoph Hellwig wrote:
> Now that we all dirty metadata is tracked in the AIL, and except for few
> special cases only written through it there is no point to keep the current
> delayed buffers list and xfsbufd around.
> 
> This series remove a few more of the remaining special cases and then
> replaced the global delwri buffer list with a local on-stack one.  The
> main consumer is xfsaild, which is used more often now.
> 
> Besides removing a lot of code this change reduce buffer cache lookups
> on loaded systems from xfsaild because we can figure out that a buffer
> already is under writeback entirely locally now.

>From a quick set of tests, I can't see any significant degradation
in IO patterns and performance under heavy load here with this
patch set. it doesn't however, reduce the buffer cache lookups all
that much on such workloads - about 10% at most - as most of the
lookups are common from the directory and inode buffer
modifications. Here's a sample profile:

-  10.09%  [kernel]  [k] _xfs_buf_find
   - _xfs_buf_find
      - 99.57% xfs_buf_get
         - 99.35% xfs_buf_read
            - 99.87% xfs_trans_read_buf
               + 50.36% xfs_da_do_buf
               + 26.12% xfs_btree_read_buf_block.constprop.24
               + 12.36% xfs_imap_to_bp.isra.9
               + 10.73% xfs_read_agi

This shows that 50% of the lookups from the directory code, 25% from
the inode btree lookups, 12% from mapping inodes, and 10% from
reading the AGI buffer during inode allocation. 

You know, I suspect that we could avoid almost all those AGI buffer
lookups by moving to a similar in-core log and flush technique that
the inodes use. We've already got all the information in the struct
xfs_perag - rearranging it to have a "in-core on disk" structures
for the AGI, AGF and AGFL would make a lot of the "select an AG"
code much simpler than having to read and modify the AG buffers
directly. It might even be possible to do such a change without
needing to change the on-disk journal format for them...

I think I'll put that on my list of stuff to do - right next to
in-core unlinked inode lists....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/10] remove xfsbufd
  2012-03-28  0:53 ` [PATCH 00/10] remove xfsbufd Dave Chinner
@ 2012-03-28 15:10   ` Christoph Hellwig
  2012-03-29  0:52     ` Dave Chinner
  0 siblings, 1 reply; 42+ messages in thread
From: Christoph Hellwig @ 2012-03-28 15:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

On Wed, Mar 28, 2012 at 11:53:37AM +1100, Dave Chinner wrote:
> in IO patterns and performance under heavy load here with this
> patch set. it doesn't however, reduce the buffer cache lookups all
> that much on such workloads - about 10% at most - as most of the
> lookups are common from the directory and inode buffer
> modifications. Here's a sample profile:

10% might not be extremly huge, but it's pretty significant.

> This shows that 50% of the lookups from the directory code, 25% from
> the inode btree lookups, 12% from mapping inodes, and 10% from
> reading the AGI buffer during inode allocation. 
> 
> You know, I suspect that we could avoid almost all those AGI buffer
> lookups by moving to a similar in-core log and flush technique that
> the inodes use. We've already got all the information in the struct
> xfs_perag - rearranging it to have a "in-core on disk" structures
> for the AGI, AGF and AGFL would make a lot of the "select an AG"
> code much simpler than having to read and modify the AG buffers
> directly. It might even be possible to do such a change without
> needing to change the on-disk journal format for them...
> 
> I think I'll put that on my list of stuff to do - right next to
> in-core unlinked inode lists....

Sounds fine.  A simple short-term fix might be to simply pin a reference
to the AGI buffers and add a pointer from struct xfs_perag to them.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/10] remove xfsbufd
  2012-03-28 15:10   ` Christoph Hellwig
@ 2012-03-29  0:52     ` Dave Chinner
  2012-03-29 19:38       ` Christoph Hellwig
  0 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2012-03-29  0:52 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Wed, Mar 28, 2012 at 11:10:41AM -0400, Christoph Hellwig wrote:
> On Wed, Mar 28, 2012 at 11:53:37AM +1100, Dave Chinner wrote:
> > in IO patterns and performance under heavy load here with this
> > patch set. it doesn't however, reduce the buffer cache lookups all
> > that much on such workloads - about 10% at most - as most of the
> > lookups are common from the directory and inode buffer
> > modifications. Here's a sample profile:
> 
> 10% might not be extremly huge, but it's pretty significant.

Yes, I didn't mean to belittle the improvement it makes, as every
little bit helps, just that the buffer cache lookups are dominated
by other types of lookups.

> > This shows that 50% of the lookups from the directory code, 25% from
> > the inode btree lookups, 12% from mapping inodes, and 10% from
> > reading the AGI buffer during inode allocation. 
> > 
> > You know, I suspect that we could avoid almost all those AGI buffer
> > lookups by moving to a similar in-core log and flush technique that
> > the inodes use. We've already got all the information in the struct
> > xfs_perag - rearranging it to have a "in-core on disk" structures
> > for the AGI, AGF and AGFL would make a lot of the "select an AG"
> > code much simpler than having to read and modify the AG buffers
> > directly. It might even be possible to do such a change without
> > needing to change the on-disk journal format for them...

I just had a crazy thought - it would be relatively easy to make
object based caches for finding buffers. Add an rbtree root to
various structures (e.g. inode, AGI, AGF, etc) and index all the
buffers associated with the btrees on that object in the object
rbtree. Need to find a directory/bmapbt/attr buffer? look up the
rbtree on the inode. Need to find a freespace btree buffer? lookup
the rbtree on the AGF.

I suspect that this can be done without much API churn, and it would
remove the central per-AG buffer cache lookups for most operations.
Smaller caches means less lookup overhead for most operations - with
10-11% of CPU time being spent in lookups on an 8p machine, that's
almost an entire CPU worth of time being used. Hence reducing the
rbtree lookup and modification overhead should be a significant win.

Crazy idea, yes, but I'm going to think about it some more,
especially as the shrinker operates of the LRU and is entirely
independent of the rbtree indexing.....

> > I think I'll put that on my list of stuff to do - right next to
> > in-core unlinked inode lists....
> 
> Sounds fine.  A simple short-term fix might be to simply pin a reference
> to the AGI buffers and add a pointer from struct xfs_perag to them.

I'd prefer not to do that - filesystems with lots of AGs will then
pin significant amounts of memory that would otherwise be
reclaimable. Besides, I don't think the problem is that significant
to need immediate resolution in such a way.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 00/10] remove xfsbufd
  2012-03-29  0:52     ` Dave Chinner
@ 2012-03-29 19:38       ` Christoph Hellwig
  0 siblings, 0 replies; 42+ messages in thread
From: Christoph Hellwig @ 2012-03-29 19:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

On Thu, Mar 29, 2012 at 11:52:24AM +1100, Dave Chinner wrote:
> I suspect that this can be done without much API churn, and it would
> remove the central per-AG buffer cache lookups for most operations.
> Smaller caches means less lookup overhead for most operations - with
> 10-11% of CPU time being spent in lookups on an 8p machine, that's
> almost an entire CPU worth of time being used. Hence reducing the
> rbtree lookup and modification overhead should be a significant win.
> 
> Crazy idea, yes, but I'm going to think about it some more,
> especially as the shrinker operates of the LRU and is entirely
> independent of the rbtree indexing.....

Sounds like a good idea to me.  I think the biggest win will be to index
the directory blocks logically, e.g. have a tree hanging off the inode
to find them.  The other worthwhile optimizations besides avoiding AGI buffer
lookups during inode scanning would be to do something more about inode
buffers - probably a tree that maps directly from inode number to the
backing buffer.

I'd probably add the secondary tree linkage directly to the buffer to
keep things simpler.  In fact I'm not even sure we need a secondary
linkage - at least from a quick look I can't see why we'd need to keep
buffers on the physically indexed per-ag rbtree if we can find them by
other means.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 01/10] xfs: remove log item from AIL in xfs_qm_dqflush after a shutdown
  2012-03-27 16:44 ` [PATCH 01/10] xfs: remove log item from AIL in xfs_qm_dqflush after a shutdown Christoph Hellwig
  2012-03-27 18:17   ` Mark Tinguely
@ 2012-04-13  9:36   ` Dave Chinner
  1 sibling, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2012-04-13  9:36 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Mar 27, 2012 at 12:44:01PM -0400, Christoph Hellwig wrote:
> If a filesystem has been forced shutdown we are never going to write dquots
> to disk, which means the dquot items will stay in the AIL forever.
> Currently that is not a problem, but a pending chance requires us to
> empty the AIL before shutting down the filesystem, in which case this
> behaviour is lethal.  Make sure to remove the log item from the AIL
> to allow emptying the AIL on shutdown filesystems.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 02/10] xfs: remove log item from AIL in xfs_iflush after a shutdown
  2012-03-27 16:44 ` [PATCH 02/10] xfs: remove log item from AIL in xfs_iflush " Christoph Hellwig
@ 2012-04-13  9:37   ` Dave Chinner
  0 siblings, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2012-04-13  9:37 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Mar 27, 2012 at 12:44:02PM -0400, Christoph Hellwig wrote:
> If a filesystem has been forced shutdown we are never going to write inodes
> to disk, which means the inode items will stay in the AIL until we free
> the inode. Currently that is not a problem, but a pending change requires us
> to empty the AIL before shutting down the filesystem. In that case leaving
> the inode in the AIL is lethal. Make sure to remove the log item from the AIL
> to allow emptying the AIL on shutdown filesystems.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Simplifies things a lot - abort immediately rather than leave it for
someone else to cleanup.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 03/10] xfs: allow assigning the tail lsn with the AIL lock held
  2012-03-27 16:44 ` [PATCH 03/10] xfs: allow assigning the tail lsn with the AIL lock held Christoph Hellwig
  2012-03-27 18:18   ` Mark Tinguely
@ 2012-04-13  9:42   ` Dave Chinner
  1 sibling, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2012-04-13  9:42 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Mar 27, 2012 at 12:44:03PM -0400, Christoph Hellwig wrote:
> Provide a variant of xlog_assign_tail_lsn that has the AIL lock already
> held.  By doing so we do an additional atomic_read + atomic_set under
> the lock, which comes down to two instructions.
> 
> Switch xfs_trans_ail_update_bulk and xfs_trans_ail_delete_bulk to the
> new version to reduce the number of lock roundtrips, and prepare for
> a new addition that would require a third lock roundtrip in
> xfs_trans_ail_delete_bulk.  This addition is also the reason for
> slightly rearranging the conditionals and relying on xfs_log_space_wake
> for checking that the filesystem has been shut down internally.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks good, and will be slightly more efficient, too.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/10] xfs: implement freezing by emptying the AIL
  2012-03-27 16:44 ` [PATCH 04/10] xfs: implement freezing by emptying the AIL Christoph Hellwig
@ 2012-04-13 10:04   ` Dave Chinner
  2012-04-16 13:33   ` Mark Tinguely
  2012-04-16 13:47   ` Mark Tinguely
  2 siblings, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2012-04-13 10:04 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Mar 27, 2012 at 12:44:04PM -0400, Christoph Hellwig wrote:
> Now that we write back all metadata either synchronously or through the AIL
> we can simply implement metadata freezing in terms of emptying the AIL.
> 
> The implementation for this is fairly simply and straight-forward:  A new
> routine is added that increments a counter that tells xfsaild to not stop
> until the AIL is empty and then waits on a wakeup from
> xfs_trans_ail_delete_bulk to signal that the AIL is empty.
> 
> As usual the devil is in the details, in this case the filesystem shutdown
> code.  Currently we are a bit sloppy there and do not continue ail pushing
> in that case, and thus never reach the code in the log item implementations
> that can unwind in case of a shutdown filesystem.  Also the code to 
> abort inode and dquot flushes was rather sloppy before and did not remove
> the log items from the AIL, which had to be fixed as well.

Probably don't need this bit in the commit message - the previous
commits kind of explain the reason....

> Also treat unmount the same way as freeze now, except that we still keep a
> synchronous inode reclaim pass to make sure we reclaim all clean inodes, too.

Actaully, I think we need an inode reclaim pass when freezing, too,
otherwise the shrinker or background reclaim will get stuck trying
to reclaim them.

.....

> -STATIC void
> -xfs_quiesce_fs(
> -	struct xfs_mount	*mp)
> -{
> -	int	count = 0, pincount;
> -
> -	xfs_reclaim_inodes(mp, 0);
> -	xfs_flush_buftarg(mp->m_ddev_targp, 0);

here's where we used to do inode reclaim during a freeze...

....
> @@ -421,8 +342,8 @@ xfs_quiesce_attr(
>  	while (atomic_read(&mp->m_active_trans) > 0)
>  		delay(100);
>  
> -	/* flush inodes and push all remaining buffers out to disk */
> -	xfs_quiesce_fs(mp);
> +	/* flush all pending changes from the AIL */
> +	xfs_ail_push_all_sync(mp->m_ail);

and now that doesn't happen. I think we still need the reclaim pass
here...

> @@ -397,6 +396,15 @@ xfsaild_push(
>  	XFS_STATS_INC(xs_push_ail);
>  
>  	/*
> +	 * If we are draining the AIL push all items, not just the current
> +	 * threshold.
> +	 */
> +	if (atomic_read(&ailp->xa_wait_empty))
> +		target = xfs_ail_max(ailp)->li_lsn;
> +	else
> +		target = ailp->xa_target;
> +

I'm not sure this is the best way to do this. Effectively you've
implemented xfs_ail_push_all() differently, and added a new counter
to do it.

....
> @@ -611,6 +614,34 @@ xfs_ail_push_all(
>  }
>  
>  /*
> + * Push out all items in the AIL immediately and wait until the AIL is empty.
> + */
> +void
> +xfs_ail_push_all_sync(
> +	struct xfs_ail  *ailp)
> +{
> +	DEFINE_WAIT(wait);
> +
> +	/*
> +	 * We use a counter instead of a flag here to support multiple
> +	 * processes calling into sync at the same time.
> +	 */
> +	atomic_inc(&ailp->xa_wait_empty);

if we just set the target here appropriately, we don't need the
atomic counter, just:

	do {
		prepare_to_wait()
		ailp->xa_target = xfs_ail_max(ailp)->li_lsn;
		wake_up_process(ailp->xa_task);
		if (!xfs_ail_min_lsn(ailp))
			break;
		schedule();
	} while (xfs_ail_min_lsn(ailp));


All the other changes look OK.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 05/10] xfs: do flush inodes from background inode reclaim
  2012-03-27 16:44 ` [PATCH 05/10] xfs: do flush inodes from background inode reclaim Christoph Hellwig
@ 2012-04-13 10:14   ` Dave Chinner
  2012-04-16 19:25   ` Mark Tinguely
  1 sibling, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2012-04-13 10:14 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Mar 27, 2012 at 12:44:05PM -0400, Christoph Hellwig wrote:
> We already flush dirty inodes throug the AIL regularly, there is no reason
> to have second thread compete with it and disturb the I/O pattern.  We still
> do write inodes when doing a synchronous reclaim from the shrinker or during
> unmount for now.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

I think the subject line should say "don't" rather than "do".

....
> -
> -	/*
> -	 * When we have to flush an inode but don't have SYNC_WAIT set, we
> -	 * flush the inode out using a delwri buffer and wait for the next
> -	 * call into reclaim to find it in a clean state instead of waiting for
> -	 * it now. We also don't return errors here - if the error is transient
> -	 * then the next reclaim pass will flush the inode, and if the error
> -	 * is permanent then the next sync reclaim will reclaim the inode and
> -	 * pass on the error.
> -	 */
> -	if (error && error != EAGAIN && !XFS_FORCED_SHUTDOWN(ip->i_mount)) {
> -		xfs_warn(ip->i_mount,
> -			"inode 0x%llx background reclaim flush failed with %d",
> -			(long long)ip->i_ino, error);
> -	}
> -out:
> -	xfs_iflags_clear(ip, XFS_IRECLAIM);
> -	xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -	/*
> -	 * We could return EAGAIN here to make reclaim rescan the inode tree in
> -	 * a short while. However, this just burns CPU time scanning the tree
> -	 * waiting for IO to complete and xfssyncd never goes back to the idle
> -	 * state. Instead, return 0 to let the next scheduled background reclaim
> -	 * attempt to reclaim the inode again.
> -	 */
> -	return 0;

Getting rid of this mess is great. Looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/10] xfs: do not write the buffer from xfs_iflush
  2012-03-27 16:44 ` [PATCH 06/10] xfs: do not write the buffer from xfs_iflush Christoph Hellwig
@ 2012-04-13 10:31   ` Dave Chinner
  2012-04-18 13:33   ` Mark Tinguely
  1 sibling, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2012-04-13 10:31 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Mar 27, 2012 at 12:44:06PM -0400, Christoph Hellwig wrote:
> Instead of writing the buffer directly from inside xfs_iflush return it to
> the caller and let the caller decide what to do with the buffer.  Also
> remove the pincount check in xfs_iflush that all non-blocking callers already
> implement and the now unused flags parameter.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 07/10] xfs: do not write the buffer from xfs_qm_dqflush
  2012-03-27 16:44 ` [PATCH 07/10] xfs: do not write the buffer from xfs_qm_dqflush Christoph Hellwig
@ 2012-04-13 10:33   ` Dave Chinner
  2012-04-18 21:11   ` Mark Tinguely
  1 sibling, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2012-04-13 10:33 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Mar 27, 2012 at 12:44:07PM -0400, Christoph Hellwig wrote:
> Instead of writing the buffer directly from inside xfs_qm_dqflush return it
> to the caller and let the caller decide what to do with the buffer.  Also
> remove the pincount check in xfs_qm_dqflush that all non-blocking callers
> already implement and the now unused flags parameter and the XFS_DQ_IS_DIRTY
> check that all callers already perform.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks ok.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 08/10] xfs: do not add buffers to the delwri queue until pushed
  2012-03-27 16:44 ` [PATCH 08/10] xfs: do not add buffers to the delwri queue until pushed Christoph Hellwig
@ 2012-04-13 10:35   ` Dave Chinner
  2012-04-18 21:11   ` Mark Tinguely
  1 sibling, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2012-04-13 10:35 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Mar 27, 2012 at 12:44:08PM -0400, Christoph Hellwig wrote:
> Instead of adding buffers to the delwri list as soon as they are logged,
> even if they can't be written until commited because they are pinned
> defer adding them to the delwri list until xfsaild pushes them.  This
> makes the code more similar to other log items and prepares for writing
> buffers directly from xfsaild.
> 
> The complication here is that we need to fail buffers that were added
> but not logged yet in xfs_buf_item_unpin, borrowing code from
> xfs_bioerror.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks good.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 09/10] xfs: on-stack delayed write buffer lists
  2012-03-27 16:44 ` [PATCH 09/10] xfs: on-stack delayed write buffer lists Christoph Hellwig
@ 2012-04-13 11:37   ` Dave Chinner
  2012-04-20 18:19   ` Mark Tinguely
  1 sibling, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2012-04-13 11:37 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Mar 27, 2012 at 12:44:09PM -0400, Christoph Hellwig wrote:
> Queue delwri buffers on a local on-stack list instead of a per-buftarg one,
> and write back the buffers per-process instead of by waking up xfsbufd.
> 
> This is now easily doable given that we have very few places left that write
> delwri buffers:
> 
>  - log recovery:
> 	Only done at mount time, and already forcing out the buffers
> 	synchronously using xfs_flush_buftarg
> 
>  - quotacheck:
> 	Same story.
> 
>  - dquot reclaim:
> 	Writes out dirty dquots on the LRU under memory pressure.  We might
> 	want to look into doing more of this via xfsaild, but it's already
> 	more optimal than the synchronous inode reclaim that writes each
> 	buffer synchronously.
> 
>  - xfsaild:
> 	This is the main beneficiary of the change.  By keeping a local list
> 	of buffers to write we reduce latency of writing out buffers, and
> 	more importably we can remove all the delwri list promotions which
> 	were hitting the buffer cache hard under sustained metadata loads.
> 
> The implementation is very straight forward - xfs_buf_delwri_queue now gets
> a new list_head pointer that it adds the delwri buffers to, and all callers
> need to eventually submit the list using xfs_buf_delwi_submit or
> xfs_buf_delwi_submit_nowait.  Buffers that already are on a delwri list are
> skipped in xfs_buf_delwri_queue, assuming they already are on another delwri
> list.  The biggest change to pass down the buffer list was done to the AIL
> pushing. Now that we operate on buffers the trylock, push and pushbuf log
> item methods are merged into a single push routine, which tries to lock the
> item, and if possible add the buffer that needs writeback to the buffer list.
> This leads to much simpler code than the previous split but requires the
> individual IOP_PUSH instances to unlock and reacquire the AIL around calls
> to blocking routines.
> 
> Given that xfsailds now also handles writing out buffers the conditions for
> log forcing and the sleep times needed some small changes.  The most
> important one is that we consider an AIL busy as long we still have buffers
> to push, and the other one is that we do increment the pushed LSN for
> buffers that are under flushing at this moment, but still count them towards
> the stuck items for restart purposes.  Without this we could hammer on stuck
> items without ever forcing the log and not make progress under heavy random
> delete workloads on fast flash storage devices.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Log recovery changes look OK - makes it quite obvious that the
buffers are all submitted at once. That's probably good - recovery
can do lots of reading to bring objects in, so avoiding writing at
the same time will help speed that up.

> Index: xfs/fs/xfs/xfs_buf.c
> ===================================================================
> --- xfs.orig/fs/xfs/xfs_buf.c	2012-03-25 16:41:00.144550722 +0200
> +++ xfs/fs/xfs/xfs_buf.c	2012-03-25 17:11:17.154584413 +0200
> @@ -42,7 +42,6 @@
>  #include "xfs_trace.h"
>  
>  static kmem_zone_t *xfs_buf_zone;
> -STATIC int xfsbufd(void *);
>  
>  static struct workqueue_struct *xfslogd_workqueue;
>  
> @@ -144,8 +143,11 @@ void
>  xfs_buf_stale(
>  	struct xfs_buf	*bp)
>  {
> +	ASSERT(xfs_buf_islocked(bp));
> +
>  	bp->b_flags |= XBF_STALE;
> -	xfs_buf_delwri_dequeue(bp);
> +	bp->b_flags &= ~_XBF_DELWRI_Q;
> +

The reason for clearing the DELWRI_Q flag is not obvious here.
Perhaps a comment to say that the delwri list has a reference and
clearing the flag will ensure that it does not write it out?

.....

> -void
> -xfs_buf_delwri_promote(
> -	struct xfs_buf	*bp)
> -{
> -	struct xfs_buftarg *btp = bp->b_target;
> -	long		age = xfs_buf_age_centisecs * msecs_to_jiffies(10) + 1;
> -
> -	ASSERT(bp->b_flags & XBF_DELWRI);
> -	ASSERT(bp->b_flags & _XBF_DELWRI_Q);
> +	trace_xfs_buf_delwri_queue(bp, _RET_IP_);
>  
>  	/*
> -	 * Check the buffer age before locking the delayed write queue as we
> -	 * don't need to promote buffers that are already past the flush age.
> +	 * If a buffer gets written out synchronously while it is on a delwri
> +	 * list we lazily remove it, aka only the _XBF_DELWRI_Q flag gets

                                     - the write will clear the _XBF_DELWRI_Q flag

....

> +static int
> +__xfs_buf_delwri_submit(
> +	struct list_head	*submit_list,
> +	struct list_head	*list,
> +	bool			wait)

It might be worth adding a comment describing the way the lists are
used. I was a little confused about the names of them - @submit_list
is the list of buffers that IO was started on, but @list is the list
of buffers that we are submitting for write processing. Perhaps just
naming them better will avoid that confusion in future. e.g. io_list
for the list of buffers we started IO on, buffer_list for the
incoming buffer list that we need to process?

.....
>  /*
> - *	Go through all incore buffers, and release buffers if they belong to
> - *	the given device. This is used in filesystem error handling to
> - *	preserve the consistency of its metadata.
> + * Write out a buffer list asynchronously.
> + *
> + * This will take the buffer list, write all non-locked and non-pinned buffers

And the incoming list is called the buffer list here, so maybe
naming it that is best, and using it consistently for all 3 delwri
submit functions...

....

> @@ -989,20 +938,27 @@ xfs_buf_iodone_callbacks(
>  	 * If the write was asynchronous then no one will be looking for the
>  	 * error.  Clear the error state and write the buffer out again.
>  	 *
> -	 * During sync or umount we'll write all pending buffers again
> -	 * synchronous, which will catch these errors if they keep hanging
> -	 * around.
> +	 * XXX: This helps against transient write errors, but we need to find
> +	 * a way to shut the filesystem down if the writes keep failing.
> +	 *
> +	 * In practice we'll shut the filesystem down soon as non-transient
> +	 * erorrs tend to affect the whole device and a failing log write
> +	 * will make us give up.  But we really ought to do better here.
>  	 */
>  	if (XFS_BUF_ISASYNC(bp)) {
> +		ASSERT(bp->b_iodone != NULL);
> +
> +		trace_xfs_buf_item_iodone_async(bp, _RET_IP_);
> +
>  		xfs_buf_ioerror(bp, 0); /* errno of 0 unsets the flag */
>  
>  		if (!XFS_BUF_ISSTALE(bp)) {
> -			xfs_buf_delwri_queue(bp);
> -			XFS_BUF_DONE(bp);
> +			bp->b_flags |= XBF_WRITE | XBF_ASYNC | XBF_DONE;
> +			xfs_bdstrat_cb(bp);

I don't think this is an equivalent transformation.

This will just resubmit the IO immediately after it is failed, while
previously it will only be pushed again after it ages out (15s
later). Perhaps it can just be left to be pushed by the aild next
time it passes over it?

> + * There isn't much you can do to push on an efd item.  It is simply stuck
> + * waiting for the log to be flushed to disk.
>   */
>  STATIC uint
> -xfs_efd_item_trylock(
> -	struct xfs_log_item	*lip)
> +xfs_efd_item_push(
> +	struct xfs_log_item	*lip,
> +	struct list_head	*buffer_list)
>  {
>  	return XFS_ITEM_LOCKED;

Perhaps that should actually be XFS_ITEM_PINNED, like the efi item.

> Index: xfs/fs/xfs/xfs_trans_ail.c

The aild pushing changes look OK.

> @@ -547,6 +527,8 @@ xfsaild(
>  	struct xfs_ail	*ailp = data;
>  	long		tout = 0;	/* milliseconds */
>  
> +	current->flags |= PF_MEMALLOC;
> +
>  	while (!kthread_should_stop()) {
>  		if (tout && tout <= 20)
>  			__set_current_state(TASK_KILLABLE);

I'm not sure that PF_MEMALLOC is really necessary for the aild. Is
there any particular reason for adding the flag here?

> @@ -183,7 +171,7 @@ xfs_qm_dqpurge(
>  		 * to purge this dquot anyway, so we go ahead regardless.
>  		 */
>  		error = xfs_qm_dqflush(dqp, &bp);
> -		if (error)
> +		if (error) {
>  			xfs_warn(mp, "%s: dquot %p flush failed",
>  				__func__, dqp);
>  		} else {

I think that's fixing a problem from a previous patch that I
missed....

Otherwise it looks fine.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 10/10] xfs: remove some obsolete comments in xfs_trans_ail.c
  2012-03-27 16:44 ` [PATCH 10/10] xfs: remove some obsolete comments in xfs_trans_ail.c Christoph Hellwig
@ 2012-04-13 11:37   ` Dave Chinner
  0 siblings, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2012-04-13 11:37 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Tue, Mar 27, 2012 at 12:44:10PM -0400, Christoph Hellwig wrote:
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/10] xfs: implement freezing by emptying the AIL
  2012-03-27 16:44 ` [PATCH 04/10] xfs: implement freezing by emptying the AIL Christoph Hellwig
  2012-04-13 10:04   ` Dave Chinner
@ 2012-04-16 13:33   ` Mark Tinguely
  2012-04-16 13:47   ` Mark Tinguely
  2 siblings, 0 replies; 42+ messages in thread
From: Mark Tinguely @ 2012-04-16 13:33 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On 03/27/12 11:44, Christoph Hellwig wrote:

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/10] xfs: implement freezing by emptying the AIL
  2012-03-27 16:44 ` [PATCH 04/10] xfs: implement freezing by emptying the AIL Christoph Hellwig
  2012-04-13 10:04   ` Dave Chinner
  2012-04-16 13:33   ` Mark Tinguely
@ 2012-04-16 13:47   ` Mark Tinguely
  2012-04-16 23:54     ` Dave Chinner
  2 siblings, 1 reply; 42+ messages in thread
From: Mark Tinguely @ 2012-04-16 13:47 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On 03/27/12 11:44, Christoph Hellwig wrote:
> Now that we write back all metadata either synchronously or through the AIL
> we can simply implement metadata freezing in terms of emptying the AIL.
>
> The implementation for this is fairly simply and straight-forward:  A new
> routine is added that increments a counter that tells xfsaild to not stop
> until the AIL is empty and then waits on a wakeup from
> xfs_trans_ail_delete_bulk to signal that the AIL is empty.
>
> As usual the devil is in the details, in this case the filesystem shutdown
> code.  Currently we are a bit sloppy there and do not continue ail pushing
> in that case, and thus never reach the code in the log item implementations
> that can unwind in case of a shutdown filesystem.  Also the code to
> abort inode and dquot flushes was rather sloppy before and did not remove
> the log items from the AIL, which had to be fixed as well.
>
> Also treat unmount the same way as freeze now, except that we still keep a
> synchronous inode reclaim pass to make sure we reclaim all clean inodes, too.
>
> As an upside we can now remove the radix tree based inode writeback and
> xfs_unmountfs_writesb.
>
> Signed-off-by: Christoph Hellwig<hch@lst.de>

Sorry for the empty email.

This series hangs my test boxes. This patch is the first indication of 
the hang. Reboot, and remove patch 4 and the test are successful.

The machine is still responsive. Only the SCRATCH filesystem from the 
test suite is hung.

Per Dave's observation, I added a couple inode reclaims to this patch 
and the test gets further (hangs on run 9 of test 068 rather than run 3).

The back traces are from a Linux 3.4-rc2 kernel with just patches 0-4 of 
this series applied. This traceback does not have extra inode reclaims. 
The hang is in test 068. I did an ls and sync to the filesystem, so I 
included their tracebacks as well.  live system.

I have looked at the remaining patches in the series, but have not 
reviewed them because they depend on this patch...

--Mark.
---

crash> bt -f 20050
PID: 20050  TASK: ffff88034a6943c0  CPU: 0   COMMAND: "fsstress"
  #0 [ffff88034aa93d18] __schedule at ffffffff81416e50
     ffff88034aa93d20: 0000000000000082 ffff88034aa92010
     ffff88034aa93d30: 0000000000012400 0000000000012400
     ffff88034aa93d40: 0000000000012400 0000000000012400
     ffff88034aa93d50: ffff88034aa93fd8 ffff88034aa93fd8
     ffff88034aa93d60: 0000000000012400 ffff88034a6943c0
     ffff88034aa93d70: ffffffff81813020 ffff88034a6a4060
     ffff88034aa93d80: 0000000000000029 ffff88034aa93df8
     ffff88034aa93d90: ffffffff811167fd 80000002b65ff065
     ffff88034aa93da0: ffff88035fc92478 ffff88034c33a018
     ffff88034aa93db0: 000000000060c048 ffffea000b847410
     ffff88034aa93dc0: ffff88034c6a5680 ffffea000b8473e0
     ffff88034aa93dd0: ffff88034c33a018 000000000060c048
     ffff88034aa93de0: ffff88034c6a5680 ffff88034b66a558
     ffff88034aa93df0: 0000000000000029 ffff88034aa93e38
     ffff88034aa93e00: ffffffff81116a1d ffff88034ad56080
     ffff88034aa93e10: ffff88034ad56080 ffff88034aa93ee8
     ffff88034aa93e20: 0000000000000000 ffff88034a6943c0
     ffff88034aa93e30: ffff88034a6943b0 ffff88034a694888
     ffff88034aa93e40: ffff88034aa93ee8 ffff88034a6943c0
     ffff88034aa93e50: ffff88034a6943c0 ffff88034aa93e68
     ffff88034aa93e60: ffffffff814171c4
  #1 [ffff88034aa93e60] schedule at ffffffff814171c4
     ffff88034aa93e68: ffff88034aa93ed8 ffffffff81040e39
  #2 [ffff88034aa93e70] do_wait at ffffffff81040e39
     ffff88034aa93e78: 0000000000000000 ffff88034a6943c0
     ffff88034aa93e88: ffff88034a6943c0 ffff88034aa93f10
     ffff88034aa93e98: ffff88034a6943c0 ffff88034aa93f30
     ffff88034aa93ea8: ffff88034a6948f0 ffffffffffffffea
     ffff88034aa93eb8: 0000000000000004 0000000000000000
     ffff88034aa93ec8: 0000000000000000 00007fff47fe1c2c
     ffff88034aa93ed8: ffff88034aa93f78 ffffffff81040f11
  #3 [ffff88034aa93ee0] sys_wait4 at ffffffff81040f11
     ffff88034aa93ee8: 0000000400000003 0000000000000000
     ffff88034aa93ef8: 0000000000000000 00007fff47fe1c2c
     ffff88034aa93f08: 0000000000000000 00007fff00000000
     ffff88034aa93f18: ffff88034a6943c0 ffffffff8103f510
     ffff88034aa93f28: ffff88034baa2098 ffff88034baa2098
     ffff88034aa93f38: 0000000000000000 00007fff47fe1c00
     ffff88034aa93f48: 0000000000000000 00007fff47fe1c2c
     ffff88034aa93f58: 00007fff47fe1b50 0000000000000003
     ffff88034aa93f68: 0000000000000000 00007fff47fe1c00
     ffff88034aa93f78: 0000000000000002 ffffffff8141fff9
  #4 [ffff88034aa93f80] system_call_fastpath at ffffffff8141fff9
     RIP: 00007fbe427e5244  RSP: 00007fff47fdfab0  RFLAGS: 00010246
     RAX: 000000000000003d  RBX: ffffffff8141fff9  RCX: 00007fff47fdfa50
     RDX: 0000000000000000  RSI: 00007fff47fe1c2c  RDI: ffffffffffffffff
     RBP: 0000000000000002   R8: 0000000000004e52   R9: 0000000000004e52
     R10: 0000000000000000  R11: 0000000000000246  R12: 00007fff47fe1c00
     R13: 0000000000000000  R14: 0000000000000003  R15: 00007fff47fe1b50
     ORIG_RAX: 000000000000003d  CS: 0033  SS: 002b

PID: 20051  TASK: ffff88034e31e600  CPU: 3   COMMAND: "fsstress"
  #0 [ffff88034c5c1c08] __schedule at ffffffff81416e50
     ffff88034c5c1c10: 0000000000000086 ffff88034c5c0010
     ffff88034c5c1c20: 0000000000012400 0000000000012400
     ffff88034c5c1c30: 0000000000012400 0000000000012400
     ffff88034c5c1c40: ffff88034c5c1fd8 ffff88034c5c1fd8
     ffff88034c5c1c50: 0000000000012400 ffff88034e31e600
     ffff88034c5c1c60: ffff88034fa12580 8080808080808080
     ffff88034c5c1c70: fefefefefefefeff 000000010000002e
     ffff88034c5c1c80: ffff88034c312000 ffff88034c5c1cd8
     ffff88034c5c1c90: ffffffff8115a045 ffff8802b7ae1324
     ffff88034c5c1ca0: ffff88034f4b2ac0 ffff88034c5c1cd8
     ffff88034c5c1cb0: ffffffff811580c2 0000000000000041
     ffff88034c5c1cc0: 0000000000001051 0000000000000000
     ffff88034c5c1cd0: ffff88034c5c1db8 ffff88034c5c1d68
     ffff88034c5c1ce0: ffffffff8115c4a4 0000000000000000
     ffff88034c5c1cf0: ffff88034c5c1dc8 ffff88034c5c1d08
     ffff88034c5c1d00: ffffffff8116b49c ffff88034c5c1d28
     ffff88034c5c1d10: 0000000000000246 ffff88034c5c1d58
     ffff88034c5c1d20: ffff88034c5c1d88 0000000000013160
     ffff88034c5c1d30: ffff88034c5c1df8 ffff88034c5c1ed8
     ffff88034c5c1d40: 00000000001b90b8 ffff88034c5c1d58
     ffff88034c5c1d50: ffffffff814171c4
  #1 [ffff88034c5c1d50] schedule at ffffffff814171c4
     ffff88034c5c1d58: ffff88034c5c1de8 ffffffffa044d4b5
  #2 [ffff88034c5c1d60] xfs_file_aio_write at ffffffffa044d4b5 [xfs]
     ffff88034c5c1d68: ffff88034f4b2ac0 ffff8802b7ae11f8
     ffff88034c5c1d78: ffff8802b7ae10c0 0000000000000001
     ffff88034c5c1d88: 0000000000000000 ffff88034e31e600
     ffff88034c5c1d98: ffffffff8105e3e0 ffff88034be7aeb0
     ffff88034c5c1da8: ffff88034b84f918 0000000000017777
     ffff88034c5c1db8: ffff88034e3d46a0 ffff88034c5c1df8
     ffff88034c5c1dc8: ffff88034c5c1ed8 ffff88034f4b2ac0
     ffff88034c5c1dd8: ffff88034c5c1f48 0000000000000000
     ffff88034c5c1de8: ffff88034c5c1f08 ffffffff8114d3d9
  #3 [ffff88034c5c1df0] do_sync_write at ffffffff8114d3d9
     ffff88034c5c1df8: 0000000000000002 0000000000000001
     ffff88034c5c1e08: 0000000000000000 ffffffff00000001
     ffff88034c5c1e18: ffff88034f4b2ac0 0000000000000000
     ffff88034c5c1e28: 0000000000000000 0000000000000000
     ffff88034c5c1e38: 0000000000000000 ffff88034e31e600
     ffff88034c5c1e48: 0000000000000000 00000000001b90b8
     ffff88034c5c1e58: 0000000000000808 0000000000000098
     ffff88034c5c1e68: 0000000000017777 00000000000081b6
     ffff88034c5c1e78: 0000000000017777 0000000000000000
     ffff88034c5c1e88: 000000000019b7bd 0000000000001000
     ffff88034c5c1e98: ffff88034c5c1ea8 ffffffff811ffcd3
     ffff88034c5c1ea8: ffff88034c5c1ed8 ffffffff811db75d
     ffff88034c5c1eb8: 0000000000017777 ffff88034f4b2ac0
     ffff88034c5c1ec8: 0000000000000001 00007fbe3c000d10
     ffff88034c5c1ed8: 00007fbe3c000d10 0000000000017777
     ffff88034c5c1ee8: 0000000000017777 ffff88034f4b2ac0
     ffff88034c5c1ef8: ffff88034c5c1f48 00007fbe3c000d10
     ffff88034c5c1f08: ffff88034c5c1f38 ffffffff8114da0b
  #4 [ffff88034c5c1f10] vfs_write at ffffffff8114da0b
     ffff88034c5c1f18: ffff88034f4b2ac0 fffffffffffffff7
     ffff88034c5c1f28: 0000000000017777 00007fbe3c000d10
     ffff88034c5c1f38: ffff88034c5c1f78 ffffffff8114db60
  #5 [ffff88034c5c1f40] sys_write at ffffffff8114db60
     ffff88034c5c1f48: 00000000001b90b8 0000000000001000
     ffff88034c5c1f58: 00007fbe3c000d10 00007fff47fdfa20
     ffff88034c5c1f68: 0000000000000003 0000000000000085
     ffff88034c5c1f78: 0000000000017777 ffffffff8141fff9
  #6 [ffff88034c5c1f80] system_call_fastpath at ffffffff8141fff9
     RIP: 00007fbe427e46f0  RSP: 00007fff47fde6b8  RFLAGS: 00010246
     RAX: 0000000000000001  RBX: ffffffff8141fff9  RCX: 0000000000000000
     RDX: 0000000000017777  RSI: 00007fbe3c000d10  RDI: 0000000000000003
     RBP: 0000000000017777   R8: 0000000000000077   R9: 0000000000200000
     R10: 0000000000000000  R11: 0000000000000246  R12: 0000000000000085
     R13: 0000000000000003  R14: 00007fff47fdfa20  R15: 00007fbe3c000d10
     ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

PID: 20052  TASK: ffff88034ad56080  CPU: 3   COMMAND: "fsstress"
  #0 [ffff88034a88fbb8] __schedule at ffffffff81416e50
     ffff88034a88fbc0: 0000000000000086 ffff88034a88e010
     ffff88034a88fbd0: 0000000000012400 0000000000012400
     ffff88034a88fbe0: 0000000000012400 0000000000012400
     ffff88034a88fbf0: ffff88034a88ffd8 ffff88034a88ffd8
     ffff88034a88fc00: 0000000000012400 ffff88034ad56080
     ffff88034a88fc10: ffff88034fa12580 0000000000000001
     ffff88034a88fc20: ffff88034a88fc60 ffffffff81075faa
     ffff88034a88fc30: ffff88034a88fcd0 ffffffff810017ef
     ffff88034a88fc40: ffff88034ad56080 ffff88034fa12bd8
     ffff88034a88fc50: 000000034a66e288 ffff88035fcd2478
     ffff88034a88fc60: ffff88034a88fc70 ffff88034ad566d8
     ffff88034a88fc70: ffff88034a88fca0 ffffffff81072d2f
     ffff88034a88fc80: ffff88034b65e2c8 ffff88034a88fcc8
     ffff88034a88fc90: ffffffff810732a8 ffff88035fcd2e40
     ffff88034a88fca0: ffff88034b65e2c8 ffff88035fc52478
     ffff88034a88fcb0: 0000000000000001 0000000000000001
     ffff88034a88fcc0: 0000000000000004 ffff88034a88fcf8
     ffff88034a88fcd0: 7fffffffffffffff ffff88034a88fe98
     ffff88034a88fce0: 7fffffffffffffff ffff88034ad56080
     ffff88034a88fcf0: 0000000000000000 ffff88034a88fd08
     ffff88034a88fd00: ffffffff814171c4
  #1 [ffff88034a88fd00] schedule at ffffffff814171c4
     ffff88034a88fd08: ffff88034a88fda8 ffffffff81415455
  #2 [ffff88034a88fd10] schedule_timeout at ffffffff81415455
     ffff88034a88fd18: ffff88035fc52400 0000000000000005
     ffff88034a88fd28: ffff88034a88fd58 ffffffff8106c2a1
     ffff88034a88fd38: ffff88034a88fd58 ffffffff81069895
     ffff88034a88fd48: ffff88035fc52400 ffff88034b65e280
     ffff88034a88fd58: ffff88034a88fd88 ffffffff81069918
     ffff88034a88fd68: ffff88034b65e280 ffff88035fc52400
     ffff88034a88fd78: 0000000000000000 7fffffffffffffff
     ffff88034a88fd88: ffff88034a88fe98 ffff88034a88fea0
     ffff88034a88fd98: ffff88034ad56080 0000000000000000
     ffff88034a88fda8: ffff88034a88fe38 ffffffff814166b7
  #3 [ffff88034a88fdb0] wait_for_common at ffffffff814166b7
     ffff88034a88fdb8: ffff88034a88fe08 ffff88034ad56080
     ffff88034a88fdc8: 0000000200000000 0000000000000002
     ffff88034a88fdd8: 0000000000000001 ffff88034ad56080
     ffff88034a88fde8: ffffffff810702d0 ffff88034a88fea8
     ffff88034a88fdf8: ffff88034a88fea8 0000000000000246
     ffff88034a88fe08: ffff88034a88fe18 ffff88034be7ac00
     ffff88034a88fe18: ffff88034a88fe58 ffff88034a88fe98
     ffff88034a88fe28: ffff88034a88ff6c ffffffff8117a4b0
     ffff88034a88fe38: ffff88034a88fe48 ffffffff81416828
  #4 [ffff88034a88fe40] wait_for_completion at ffffffff81416828
     ffff88034a88fe48: ffff88034a88fed8 ffffffff81174eaa
  #5 [ffff88034a88fe50] sync_inodes_sb at ffffffff81174eaa
     ffff88034a88fe58: 7fffffffffffffff ffff88034be7ac00
     ffff88034a88fe68: ffff88034b84fd90 0000000000000001
     ffff88034a88fe78: 0000000000000002 ffff88034a88fe80
     ffff88034a88fe88: ffff88034a88fe80 ffff88034a88fe98
     ffff88034a88fe98: 0000000000000000 0000000000010001
     ffff88034a88fea8: ffff88034a88fdf0 ffff88034a88fdf0
     ffff88034a88feb8: ffffffff8123fd64 ffff88034be7ac00
     ffff88034a88fec8: 0000000000000001 ffff88034b5a5000
     ffff88034a88fed8: ffff88034a88fef8 ffffffff8117a4a0
  #6 [ffff88034a88fee0] __sync_filesystem at ffffffff8117a4a0
     ffff88034a88fee8: ffff88034be7ac00 ffff88034be7ac68
     ffff88034a88fef8: ffff88034a88ff08 ffffffff8117a4c7
  #7 [ffff88034a88ff00] sync_one_sb at ffffffff8117a4c7
     ffff88034a88ff08: ffff88034a88ff48 ffffffff8115126b
  #8 [ffff88034a88ff10] iterate_supers at ffffffff8115126b
     ffff88034a88ff18: ffff88034a88ff48 ffff88034a88ff6c
     ffff88034a88ff28: 0000000051eb851f 0000000000000003
     ffff88034a88ff38: 0000000000000000 00007fff47fe1c00
     ffff88034a88ff48: ffff88034a88ff78 ffffffff8117a515
  #9 [ffff88034a88ff50] sys_sync at ffffffff8117a515
     ffff88034a88ff58: 0000000000000003 000000000000006c
     ffff88034a88ff68: 0000000100000003 0000000000000072
     ffff88034a88ff78: 0000000000000072 ffffffff8141fff9
#10 [ffff88034a88ff80] system_call_fastpath at ffffffff8141fff9
     RIP: 00007fbe42532fe7  RSP: 00007fff47fde8e8  RFLAGS: 00010246
     RAX: 00000000000000a2  RBX: ffffffff8141fff9  RCX: 0000000000000000
     RDX: 0000000000000073  RSI: 000000003532c506  RDI: 0000000000000072
     RBP: 0000000000000072   R8: 0000000064264f93   R9: 00007fbe3c000078
     R10: 0000000000000000  R11: 0000000000000206  R12: 0000000000000072
     R13: 0000000100000003  R14: 000000000000006c  R15: 0000000000000003
     ORIG_RAX: 00000000000000a2  CS: 0033  SS: 002b

PID: 20089  TASK: ffff88034c5ca340  CPU: 2   COMMAND: "xfs_freeze"
  #0 [ffff88034aaafd18] __schedule at ffffffff81416e50
     ffff88034aaafd20: 0000000000000086 ffff88034aaae010
     ffff88034aaafd30: 0000000000012400 0000000000012400
     ffff88034aaafd40: 0000000000012400 0000000000012400
     ffff88034aaafd50: ffff88034aaaffd8 ffff88034aaaffd8
     ffff88034aaafd60: 0000000000012400 ffff88034c5ca340
     ffff88034aaafd70: ffff88034f9d6440 ffffffff810017ef
     ffff88034aaafd80: ffff88034c5ca340 ffff88034b5f49d8
     ffff88034aaafd90: 000000024b5f43c8 ffff88035fc92478
     ffff88034aaafda0: ffff88034aaafdb0 ffff88034c5ca998
     ffff88034aaafdb0: ffff88034aaafde0 ffffffff81072d2f
     ffff88034aaafdc0: ffff88034e603728 ffff88035fc92478
     ffff88034aaafdd0: ffff88034b5f43c8 ffff88034b5f43c8
     ffff88034aaafde0: ffff88034aaafe20 ffff88034bcfabc0
     ffff88034aaafdf0: ffff88035fc92400 ffff88034bbd3300
     ffff88034aaafe00: ffff88034bcfabc0 ffff88035fc92400
     ffff88034aaafe10: ffff88034b42a4c0 ffff88034aaafee8
     ffff88034aaafe20: 0000000000000000 ffff88034c5ca340
     ffff88034aaafe30: ffff88034c5ca330 ffff88034c5ca808
     ffff88034aaafe40: ffff88034aaafee8 ffff88034c5ca340
     ffff88034aaafe50: ffff88034c5ca340 ffff88034aaafe68
     ffff88034aaafe60: ffffffff814171c4
  #1 [ffff88034aaafe60] schedule at ffffffff814171c4
     ffff88034aaafe68: ffff88034aaafed8 ffffffff81040e39
  #2 [ffff88034aaafe70] do_wait at ffffffff81040e39
     ffff88034aaafe78: ffff88034b5f4380 ffff88034c5ca340
     ffff88034aaafe88: ffff88034c5ca340 ffff88034aaaff10
     ffff88034aaafe98: ffff88034c5ca340 0000000000000000
     ffff88034aaafea8: ffff88034c5ca870 ffffffffffffffea
     ffff88034aaafeb8: 0000000000000004 0000000000000000
     ffff88034aaafec8: 0000000000000000 00007fff7cd9c3c4
     ffff88034aaafed8: ffff88034aaaff78 ffffffff81040f11
  #3 [ffff88034aaafee0] sys_wait4 at ffffffff81040f11
     ffff88034aaafee8: 0000000400000003 0000000000000000
     ffff88034aaafef8: 0000000000000000 00007fff7cd9c3c4
     ffff88034aaaff08: 0000000000000000 ffffffff00000000
     ffff88034aaaff18: ffff88034c5ca340 ffffffff8103f510
     ffff88034aaaff28: ffff88034c1d1a98 ffff88034c1d1a98
     ffff88034aaaff38: 0000000000000000 00000000ffffffff
     ffff88034aaaff48: 00000000ffffffff 0000000000000000
     ffff88034aaaff58: 00000000ffffffff 00000000ffffffff
     ffff88034aaaff68: 0000000000000000 0000000000000000
     ffff88034aaaff78: 00007fff7cd9c3c4 ffffffff8141fff9
  #4 [ffff88034aaaff80] system_call_fastpath at ffffffff8141fff9
     RIP: 00007f9a536bd525  RSP: 00007fff7cd9c390  RFLAGS: 00000246
     RAX: 000000000000003d  RBX: ffffffff8141fff9  RCX: ffffffffffffffff
     RDX: 0000000000000000  RSI: 00007fff7cd9c3c4  RDI: ffffffffffffffff
     RBP: 00007fff7cd9c3c4   R8: 00000000006a33e0   R9: 00000000006a7390
     R10: 0000000000000000  R11: 0000000000000246  R12: 0000000000000000
     R13: 0000000000000000  R14: 00000000ffffffff  R15: 00000000ffffffff
     ORIG_RAX: 000000000000003d  CS: 0033  SS: 002b

PID: 20093  TASK: ffff88034b42a4c0  CPU: 1   COMMAND: "xfs_io"
  #0 [ffff88034c3abc98] __schedule at ffffffff81416e50
     ffff88034c3abca0: 0000000000000086 ffff88034c3aa010
     ffff88034c3abcb0: 0000000000012400 0000000000012400
     ffff88034c3abcc0: 0000000000012400 0000000000012400
     ffff88034c3abcd0: ffff88034c3abfd8 ffff88034c3abfd8
     ffff88034c3abce0: 0000000000012400 ffff88034b42a4c0
     ffff88034c3abcf0: ffff88034f99c300 ffff88034ddfd4d0
     ffff88034c3abd00: 00007f7d13560900 000000004c3abd38
     ffff88034c3abd10: ffffea000b862d18 0000000000000000
     ffff88034c3abd20: 000000004c3413f8 0000000000000200
     ffff88034c3abd30: ffff88034ae85b00 ffff880300000028
     ffff88034c3abd40: 0000000000000079 00007f7d13560000
     ffff88034c3abd50: ffffea000bcf5218 ffffea000b84ded0
     ffff88034c3abd60: 0000000000000000 0000000000000000
     ffff88034c3abd70: ffff88034c341978 ffff88034ae85b00
     ffff88034c3abd80: 0000000000000028 ffff88034c3abdf8
     ffff88034c3abd90: ffffffff811166c2 0000000000000000
     ffff88034c3abda0: ffff88034f4e70e8 ffff88034c3abde8
     ffff88034c3abdb0: 0000000000000002 ffff88034b42a4c0
     ffff88034c3abdc0: ffff88034be7ac68 ffff88034be7ac70
     ffff88034c3abdd0: ffffffffffffffff ffff88034c3abde8
     ffff88034c3abde0: ffffffff814171c4
  #1 [ffff88034c3abde0] schedule at ffffffff814171c4
     ffff88034c3abde8: ffff88034c3abe58 ffffffff81417de5
  #2 [ffff88034c3abdf0] rwsem_down_failed_common at ffffffff81417de5
     ffff88034c3abdf8: ffff88034be7ac78 ffff88034be7ac78
     ffff88034c3abe08: ffff88034b42a4c0 ffff880300000002
     ffff88034c3abe18: 00007f7d13560900 0000000000000000
     ffff88034c3abe28: ffff88034c3abf58 ffff88034be7ac00
     ffff88034c3abe38: 00007fffb132ee7c ffff88034be7ac68
     ffff88034c3abe48: 0000000000000003 00000000c0045878
     ffff88034c3abe58: ffff88034c3abe68 ffffffff81417e93
  #3 [ffff88034c3abe60] rwsem_down_write_failed at ffffffff81417e93
     ffff88034c3abe68: ffff88034c3abeb8 ffffffff8123fd93
  #4 [ffff88034c3abe70] call_rwsem_down_write_failed at ffffffff8123fd93
     ffff88034c3abe78: 0000000000000246 00007f7d135fef30
     ffff88034c3abe88: 000000000000000f 0000000000000003
     ffff88034c3abe98: 0000000000000015 ffff88035f054c00
     ffff88034c3abea8: ffff88034be7ac68 ffffffff81416110
  #5 [ffff88034c3abeb0] down_write at ffffffff81416110
     ffff88034c3abeb8: ffff88034c3abee8 ffffffff81150343
  #6 [ffff88034c3abec0] thaw_super at ffffffff81150343
     ffff88034c3abec8: 0000000000000000 ffff88034be7ac00
     ffff88034c3abed8: 00007fffb132ee7c 00007fffb132ee7c
     ffff88034c3abee8: ffff88034c3abf28 ffffffff8115efb8
  #7 [ffff88034c3abef0] do_vfs_ioctl at ffffffff8115efb8
     ffff88034c3abef8: 000000000087f38b 000000000087e00b
     ffff88034c3abf08: 000000000087e00b 0000000000000000
     ffff88034c3abf18: ffff88034bc6e280 00007fffb132ee7c
     ffff88034c3abf28: ffff88034c3abf78 ffffffff8115f139
  #8 [ffff88034c3abf30] sys_ioctl at ffffffff8115f139
     ffff88034c3abf38: 0000000000000000 00007fffb132eeb4
     ffff88034c3abf48: 0000000000000000 0000000000000001
     ffff88034c3abf58: 0000000000402090 000000000061e1d0
     ffff88034c3abf68: 000000000061e2e0 0000000000000000
     ffff88034c3abf78: 000000000061e300 ffffffff8141fff9
  #9 [ffff88034c3abf80] system_call_fastpath at ffffffff8141fff9
     RIP: 00007f7d135b6d07  RSP: 00007fffb132ee58  RFLAGS: 00010202
     RAX: 0000000000000010  RBX: ffffffff8141fff9  RCX: 0000000000776168
     RDX: 00007fffb132ee7c  RSI: ffffffffc0045878  RDI: 0000000000000003
     RBP: 000000000061e300   R8: 000000000000ffff   R9: 000000000000000f
     R10: 00007f7d135fef30  R11: 0000000000000246  R12: 0000000000000000
     R13: 000000000061e2e0  R14: 000000000061e1d0  R15: 0000000000402090
     ORIG_RAX: 0000000000000010  CS: 0033  SS: 002b

PID: 20185  TASK: ffff88034c31c280  CPU: 1   COMMAND: "sync"
  #0 [ffff88034afe7b88] __schedule at ffffffff81416e50
     ffff88034afe7b90: 0000000000000086 ffff88034afe6010
     ffff88034afe7ba0: 0000000000012400 0000000000012400
     ffff88034afe7bb0: 0000000000012400 0000000000012400
     ffff88034afe7bc0: ffff88034afe7fd8 ffff88034afe7fd8
     ffff88034afe7bd0: 0000000000012400 ffff88034c31c280
     ffff88034afe7be0: ffff88034f99c300 ffff880300000028
     ffff88034afe7bf0: 000000000000013a 00007fae7b775000
     ffff88034afe7c00: 000000000bcdeb90 0000000000000000
     ffff88034afe7c10: 0000000100000000 ffff88034a979390
     ffff88034afe7c20: 000100004e702ef8 ffff88034b1e8720
     ffff88034afe7c30: 0000000000000028 ffff88034afe7ca8
     ffff88034afe7c40: ffffffff811166c2 ffff88034afe7c68
     ffff88034afe7c50: ffff88034b65e2c8 ffff88034afe7c98
     ffff88034afe7c60: ffffffff810732a8 ffff88035fc52e40
     ffff88034afe7c70: ffff88034b65e2c8 ffff88035fcd2478
     ffff88034afe7c80: 0000000000000001 0000000000000003
     ffff88034afe7c90: 0000000000000000 ffff88034afe7cc8
     ffff88034afe7ca0: 7fffffffffffffff ffff88034afe7e68
     ffff88034afe7cb0: 7fffffffffffffff ffff88034c31c280
     ffff88034afe7cc0: 0000000000000000 ffff88034afe7cd8
     ffff88034afe7cd0: ffffffff814171c4
  #1 [ffff88034afe7cd0] schedule at ffffffff814171c4
     ffff88034afe7cd8: ffff88034afe7d78 ffffffff81415455
  #2 [ffff88034afe7ce0] schedule_timeout at ffffffff81415455
     ffff88034afe7ce8: ffff88035fcd2400 0000000000000005
     ffff88034afe7cf8: ffff88034afe7d28 ffffffff8106c2a1
     ffff88034afe7d08: ffff88034afe7d28 ffffffff81069895
     ffff88034afe7d18: ffff88035fcd2400 ffff88034b65e280
     ffff88034afe7d28: ffff88034afe7d58 ffffffff81069918
     ffff88034afe7d38: ffff88034b65e280 ffff88035fcd2400
     ffff88034afe7d48: 0000000000000000 7fffffffffffffff
     ffff88034afe7d58: ffff88034afe7e68 ffff88034afe7e70
     ffff88034afe7d68: ffff88034c31c280 0000000000000000
     ffff88034afe7d78: ffff88034afe7e08 ffffffff814166b7
  #3 [ffff88034afe7d80] wait_for_common at ffffffff814166b7
     ffff88034afe7d88: ffff88034afe7dd8 ffff88034c31c280
     ffff88034afe7d98: 0000000200000000 0000000000000002
     ffff88034afe7da8: 0000000000000001 ffff88034c31c280
     ffff88034afe7db8: ffffffff810702d0 ffff88034afe7e78
     ffff88034afe7dc8: ffff88034afe7e78 0000000000000246
     ffff88034afe7dd8: ffff88034afe7de8 ffff88034a4c8000
     ffff88034afe7de8: ffff88034afe7e28 ffff88034afe7e68
     ffff88034afe7df8: 0000000000000000 ffffffff8117a4b0
     ffff88034afe7e08: ffff88034afe7e18 ffffffff81416828
  #4 [ffff88034afe7e10] wait_for_completion at ffffffff81416828
     ffff88034afe7e18: ffff88034afe7ea8 ffffffff81174c69
  #5 [ffff88034afe7e20] writeback_inodes_sb_nr at ffffffff81174c69
     ffff88034afe7e28: 000000000000ecde ffff88034a4c8000
     ffff88034afe7e38: 0000000000000000 0000000100000000
     ffff88034afe7e48: 0000000000000002 ffff88034baaa3d0
     ffff88034afe7e58: ffff88034a637ea8 ffff88034afe7e68
     ffff88034afe7e68: 0000000000000000 0000000000010001
     ffff88034afe7e78: ffff88034afe7dc0 ffff88034afe7dc0
     ffff88034afe7e88: 0000000000000017 0000000000000017
     ffff88034afe7e98: 0000000000000002 ffff88034a4c8000
     ffff88034afe7ea8: ffff88034afe7ed8 ffffffff8117522c
  #6 [ffff88034afe7eb0] writeback_inodes_sb at ffffffff8117522c
     ffff88034afe7eb8: ffff88034a4c8000 0000000000000000
     ffff88034afe7ec8: ffff88034eb38c00 ffff88034afe7f6c
     ffff88034afe7ed8: ffff88034afe7ef8 ffffffff8117a469
  #7 [ffff88034afe7ee0] __sync_filesystem at ffffffff8117a469
     ffff88034afe7ee8: ffff88034a4c8000 ffff88034a4c8068
     ffff88034afe7ef8: ffff88034afe7f08 ffffffff8117a4c7
  #8 [ffff88034afe7f00] sync_one_sb at ffffffff8117a4c7
     ffff88034afe7f08: ffff88034afe7f48 ffffffff8115126b
  #9 [ffff88034afe7f10] iterate_supers at ffffffff8115126b
     ffff88034afe7f18: ffff88034afe7f48 ffff88034afe7f6c
     ffff88034afe7f28: 0000000000401140 00007fffc53f4e70
     ffff88034afe7f38: 0000000000000000 0000000000000000
     ffff88034afe7f48: ffff88034afe7f78 ffffffff8117a4ff
#10 [ffff88034afe7f50] sys_sync at ffffffff8117a4ff
     ffff88034afe7f58: 0000000000000000 0000000000000000
     ffff88034afe7f68: 00000000c53f4e70 00007fffc53f4e78
     ffff88034afe7f78: 0000000000000001 ffffffff8141fff9
#11 [ffff88034afe7f80] system_call_fastpath at ffffffff8141fff9
     RIP: 00007fae7b70bfe7  RSP: 00007fffc53f4d48  RFLAGS: 00010206
     RAX: 00000000000000a2  RBX: ffffffff8141fff9  RCX: 0000000000000000
     RDX: 00007fae7b9a913c  RSI: 0000000000000001  RDI: 0000000000000000
     RBP: 0000000000000001   R8: 00007fae7b773a70   R9: 0000000000000000
     R10: 00007fffc53f4b20  R11: 0000000000000206  R12: 00007fffc53f4e78
     R13: 00000000c53f4e70  R14: 0000000000000000  R15: 0000000000000000
     ORIG_RAX: 00000000000000a2  CS: 0033  SS: 002b

PID: 20110  TASK: ffff88034a4820c0  CPU: 2   COMMAND: "ls"
  #0 [ffff88034a855c78] __schedule at ffffffff81416e50
     ffff88034a855c80: 0000000000000086 ffff88034a854010
     ffff88034a855c90: 0000000000012400 0000000000012400
     ffff88034a855ca0: 0000000000012400 0000000000012400
     ffff88034a855cb0: ffff88034a855fd8 ffff88034a855fd8
     ffff88034a855cc0: 0000000000012400 ffff88034a4820c0
     ffff88034a855cd0: ffff88034f9d6440 ffffea000b3d3f38
     ffff88034a855ce0: ffff88034e3c8d98 0000000000629db8
     ffff88034a855cf0: 8000000336121067 ffff88034a855d08
     ffff88034a855d00: ffffffff810fb108 ffff88034a855d38
     ffff88034a855d10: ffffffff8111ec05 ffff88034ddb2148
     ffff88034a855d20: ffff88034e3c8d98 ffffea000b3d3f38
     ffff88034a855d30: ffff88034ddb2148 ffff88034a855d88
     ffff88034a855d40: ffffffff811113a5 ffffea000b907f20
     ffff88034a855d50: ffff88034b4f54c0 ffffea000b907f20
     ffff88034a855d60: 0000000000000000 0000000000000000
     ffff88034a855d70: ffff88034e3c8d98 ffff88034ddb2148
     ffff88034a855d80: 0000000000000246 ffff88034a855dc8
     ffff88034a855d90: ffff88034f4e7000 ffff88034a855dd8
     ffff88034a855da0: 0000000000000024 ffff88034f4e7000
     ffff88034a855db0: ffff88034a855f38 ffff88034a855dc8
     ffff88034a855dc0: ffffffff814171c4
  #1 [ffff88034a855dc0] schedule at ffffffff814171c4
     ffff88034a855dc8: ffff88034a855e28 ffffffffa0499fb5
  #2 [ffff88034a855dd0] xfs_trans_alloc at ffffffffa0499fb5 [xfs]
     ffff88034a855dd8: 0000000000000000 ffff88034a4820c0
     ffff88034a855de8: ffffffff8105e3e0 ffff88034b84f918
     ffff88034a855df8: ffff88034be7aeb0 ffffffff81116a1d
     ffff88034a855e08: ffff88034a855f28 0000000000000001
     ffff88034a855e18: ffff8802b7822538 ffff8802b7822400
     ffff88034a855e28: ffff88034a855e58 ffffffffa0457aa2
  #3 [ffff88034a855e30] xfs_fs_dirty_inode at ffffffffa0457aa2 [xfs]
     ffff88034a855e38: 0000000000000001 ffff8802b7822538
     ffff88034a855e48: 000000004f872c1b 0000000016880b81
     ffff88034a855e58: ffff88034a855e98 ffffffff811753da
  #4 [ffff88034a855e60] __mark_inode_dirty at ffffffff811753da
     ffff88034a855e68: ffff8802b7822400 ffff8802b7822538
     ffff88034a855e78: ffff88034e3d46a0 000000004f872c1b
     ffff88034a855e88: 0000000016880b81 ffff88034a855f38
     ffff88034a855e98: ffff88034a855ee8 ffffffff811662db
  #5 [ffff88034a855ea0] touch_atime at ffffffff811662db
     ffff88034a855ea8: 000000004f872c1b 0000000016880b81
     ffff88034a855eb8: 000000004f872c1b 0000000016880b81
     ffff88034a855ec8: 0000000000000000 ffff88034be12ac0
     ffff88034a855ed8: ffff8802b7822538 ffffffff8115f5e0
     ffff88034a855ee8: ffff88034a855f28 ffffffff8115f934
  #6 [ffff88034a855ef0] vfs_readdir at ffffffff8115f934
     ffff88034a855ef8: ffff8802b78225d8 0000000000621db8
     ffff88034a855f08: ffff88034be12ac0 0000000000008000
     ffff88034a855f18: 0000000000000000 0000000000621d90
     ffff88034a855f28: ffff88034a855f78 ffffffff8115f9c3
  #7 [ffff88034a855f30] sys_getdents64 at ffffffff8115f9c3
     ffff88034a855f38: 0000000000621e10 0000000000621de8
     ffff88034a855f48: ffffffea00007fa8 ffffffff81418635
     ffff88034a855f58: 0000000000000001 0000000000621d90
     ffff88034a855f68: ffffffffffffff08 00007f2450c587a0
     ffff88034a855f78: 0000000000621db8 ffffffff8141fff9
  #8 [ffff88034a855f80] system_call_fastpath at ffffffff8141fff9
     RIP: 00007f244ff7ad9a  RSP: 00007fffed07c030  RFLAGS: 00010202
     RAX: 00000000000000d9  RBX: ffffffff8141fff9  RCX: 0000000000629db0
     RDX: 0000000000008000  RSI: 0000000000621db8  RDI: 0000000000000003
     RBP: 0000000000621db8   R8: 00007f2450248e80   R9: 00007f2450248ed8
     R10: 00007fffed07bee0  R11: 0000000000000246  R12: 00007f2450c587a0
     R13: ffffffffffffff08  R14: 0000000000621d90  R15: 0000000000000001
     ORIG_RAX: 00000000000000d9  CS: 0033  SS: 002b

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 05/10] xfs: do flush inodes from background inode reclaim
  2012-03-27 16:44 ` [PATCH 05/10] xfs: do flush inodes from background inode reclaim Christoph Hellwig
  2012-04-13 10:14   ` Dave Chinner
@ 2012-04-16 19:25   ` Mark Tinguely
  1 sibling, 0 replies; 42+ messages in thread
From: Mark Tinguely @ 2012-04-16 19:25 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On 03/27/12 11:44, Christoph Hellwig wrote:
> We already flush dirty inodes throug the AIL regularly, there is no reason
> to have second thread compete with it and disturb the I/O pattern.  We still
> do write inodes when doing a synchronous reclaim from the shrinker or during
> unmount for now.
>
> Signed-off-by: Christoph Hellwig<hch@lst.de>
>
> ---

> -	 */
> -	return 0;
> +	xfs_iflock(ip);
>
>   reclaim:
>   	xfs_ifunlock(ip);

Is this flush lock / flush unlock cycle needed?

Reviewed-by: Mark Tinguely <tinguely@sgi.com>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/10] xfs: implement freezing by emptying the AIL
  2012-04-16 13:47   ` Mark Tinguely
@ 2012-04-16 23:54     ` Dave Chinner
  2012-04-17  4:20       ` Dave Chinner
  0 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2012-04-16 23:54 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: Christoph Hellwig, xfs

On Mon, Apr 16, 2012 at 08:47:00AM -0500, Mark Tinguely wrote:
> On 03/27/12 11:44, Christoph Hellwig wrote:
> >Now that we write back all metadata either synchronously or through the AIL
> >we can simply implement metadata freezing in terms of emptying the AIL.
> >
> >The implementation for this is fairly simply and straight-forward:  A new
> >routine is added that increments a counter that tells xfsaild to not stop
> >until the AIL is empty and then waits on a wakeup from
> >xfs_trans_ail_delete_bulk to signal that the AIL is empty.
> >
> >As usual the devil is in the details, in this case the filesystem shutdown
> >code.  Currently we are a bit sloppy there and do not continue ail pushing
> >in that case, and thus never reach the code in the log item implementations
> >that can unwind in case of a shutdown filesystem.  Also the code to
> >abort inode and dquot flushes was rather sloppy before and did not remove
> >the log items from the AIL, which had to be fixed as well.
> >
> >Also treat unmount the same way as freeze now, except that we still keep a
> >synchronous inode reclaim pass to make sure we reclaim all clean inodes, too.
> >
> >As an upside we can now remove the radix tree based inode writeback and
> >xfs_unmountfs_writesb.
> >
> >Signed-off-by: Christoph Hellwig<hch@lst.de>
> 
> Sorry for the empty email.
> 
> This series hangs my test boxes. This patch is the first indication
> of the hang. Reboot, and remove patch 4 and the test are successful.
> 
> The machine is still responsive. Only the SCRATCH filesystem from
> the test suite is hung.
> 
> Per Dave's observation, I added a couple inode reclaims to this
> patch and the test gets further (hangs on run 9 of test 068 rather
> than run 3).

That implies that there are dirty inodes at the VFS level leaking
through the freeze.

.....

> The back traces are from a Linux 3.4-rc2 kernel with just patches
> 0-4 of this series applied. This traceback does not have extra inode
> reclaims. The hang is in test 068. I did an ls and sync to the
> filesystem, so I included their tracebacks as well.  live system.
> 
> I have looked at the remaining patches in the series, but have not
> reviewed them because they depend on this patch...
> 
> --Mark.
> ---
> 
> crash> bt -f 20050
> PID: 20050  TASK: ffff88034a6943c0  CPU: 0   COMMAND: "fsstress"
>  #0 [ffff88034aa93d18] __schedule at ffffffff81416e50
>  #1 [ffff88034aa93e60] schedule at ffffffff814171c4
>  #2 [ffff88034aa93e70] do_wait at ffffffff81040e39
>  #3 [ffff88034aa93ee0] sys_wait4 at ffffffff81040f11
>  #4 [ffff88034aa93f80] system_call_fastpath at ffffffff8141fff9
> 
> PID: 20051  TASK: ffff88034e31e600  CPU: 3   COMMAND: "fsstress"
>  #0 [ffff88034c5c1c08] __schedule at ffffffff81416e50
>  #1 [ffff88034c5c1d50] schedule at ffffffff814171c4
>  #2 [ffff88034c5c1d60] xfs_file_aio_write at ffffffffa044d4b5 [xfs]
>  #3 [ffff88034c5c1df0] do_sync_write at ffffffff8114d3d9
>  #4 [ffff88034c5c1f10] vfs_write at ffffffff8114da0b
>  #5 [ffff88034c5c1f40] sys_write at ffffffff8114db60
>  #6 [ffff88034c5c1f80] system_call_fastpath at ffffffff8141fff9

Frozen write, not holding any locks.

> PID: 20052  TASK: ffff88034ad56080  CPU: 3   COMMAND: "fsstress"
>  #0 [ffff88034a88fbb8] __schedule at ffffffff81416e50
>  #1 [ffff88034a88fd00] schedule at ffffffff814171c4
>  #2 [ffff88034a88fd10] schedule_timeout at ffffffff81415455
>  #3 [ffff88034a88fdb0] wait_for_common at ffffffff814166b7
>  #4 [ffff88034a88fe40] wait_for_completion at ffffffff81416828
>  #5 [ffff88034a88fe50] sync_inodes_sb at ffffffff81174eaa
>  #6 [ffff88034a88fee0] __sync_filesystem at ffffffff8117a4a0
>  #7 [ffff88034a88ff00] sync_one_sb at ffffffff8117a4c7
>  #8 [ffff88034a88ff10] iterate_supers at ffffffff8115126b
>  #9 [ffff88034a88ff50] sys_sync at ffffffff8117a515
> #10 [ffff88034a88ff80] system_call_fastpath at ffffffff8141fff9

Waiting for flusher thread completion, holding the sb->s_umount lock
in read mode.

> PID: 20089  TASK: ffff88034c5ca340  CPU: 2   COMMAND: "xfs_freeze"
>  #0 [ffff88034aaafd18] __schedule at ffffffff81416e50
>  #1 [ffff88034aaafe60] schedule at ffffffff814171c4
>  #2 [ffff88034aaafe70] do_wait at ffffffff81040e39
>  #3 [ffff88034aaafee0] sys_wait4 at ffffffff81040f11
>  #4 [ffff88034aaaff80] system_call_fastpath at ffffffff8141fff9
> 
> PID: 20093  TASK: ffff88034b42a4c0  CPU: 1   COMMAND: "xfs_io"
>  #0 [ffff88034c3abc98] __schedule at ffffffff81416e50
>  #1 [ffff88034c3abde0] schedule at ffffffff814171c4
>  #2 [ffff88034c3abdf0] rwsem_down_failed_common at ffffffff81417de5
>  #3 [ffff88034c3abe60] rwsem_down_write_failed at ffffffff81417e93
>  #4 [ffff88034c3abe70] call_rwsem_down_write_failed at ffffffff8123fd93
>  #5 [ffff88034c3abeb0] down_write at ffffffff81416110
>  #6 [ffff88034c3abec0] thaw_super at ffffffff81150343
>  #7 [ffff88034c3abef0] do_vfs_ioctl at ffffffff8115efb8
>  #8 [ffff88034c3abf30] sys_ioctl at ffffffff8115f139
>  #9 [ffff88034c3abf80] system_call_fastpath at ffffffff8141fff9

waiting for sb->s_umount, which can only be released by flusher
thread completion.

> PID: 20185  TASK: ffff88034c31c280  CPU: 1   COMMAND: "sync"
>  #0 [ffff88034afe7b88] __schedule at ffffffff81416e50
>  #1 [ffff88034afe7cd0] schedule at ffffffff814171c4
>  #2 [ffff88034afe7ce0] schedule_timeout at ffffffff81415455
>  #3 [ffff88034afe7d80] wait_for_common at ffffffff814166b7
>  #4 [ffff88034afe7e10] wait_for_completion at ffffffff81416828
>  #5 [ffff88034afe7e20] writeback_inodes_sb_nr at ffffffff81174c69
>  #6 [ffff88034afe7eb0] writeback_inodes_sb at ffffffff8117522c
>  #7 [ffff88034afe7ee0] __sync_filesystem at ffffffff8117a469
>  #8 [ffff88034afe7f00] sync_one_sb at ffffffff8117a4c7
>  #9 [ffff88034afe7f10] iterate_supers at ffffffff8115126b
> #10 [ffff88034afe7f50] sys_sync at ffffffff8117a4ff
> #11 [ffff88034afe7f80] system_call_fastpath at ffffffff8141fff9

waiting for flusher thread completion, holding the sb->s_umount lock
in read mode.

> 
> PID: 20110  TASK: ffff88034a4820c0  CPU: 2   COMMAND: "ls"
>  #0 [ffff88034a855c78] __schedule at ffffffff81416e50
>  #1 [ffff88034a855dc0] schedule at ffffffff814171c4
>  #2 [ffff88034a855dd0] xfs_trans_alloc at ffffffffa0499fb5 [xfs]
>  #3 [ffff88034a855e30] xfs_fs_dirty_inode at ffffffffa0457aa2 [xfs]
>  #4 [ffff88034a855e60] __mark_inode_dirty at ffffffff811753da
>  #5 [ffff88034a855ea0] touch_atime at ffffffff811662db
>  #6 [ffff88034a855ef0] vfs_readdir at ffffffff8115f934
>  #7 [ffff88034a855f30] sys_getdents64 at ffffffff8115f9c3
>  #8 [ffff88034a855f80] system_call_fastpath at ffffffff8141fff9

Frozen attribute modification, no locks held.

So, what are the flusher threads doing - where are they stuck?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/10] xfs: implement freezing by emptying the AIL
  2012-04-16 23:54     ` Dave Chinner
@ 2012-04-17  4:20       ` Dave Chinner
  2012-04-17  8:26         ` Dave Chinner
  0 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2012-04-17  4:20 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: Christoph Hellwig, xfs

On Tue, Apr 17, 2012 at 09:54:32AM +1000, Dave Chinner wrote:
> On Mon, Apr 16, 2012 at 08:47:00AM -0500, Mark Tinguely wrote:
> > On 03/27/12 11:44, Christoph Hellwig wrote:
> > >Now that we write back all metadata either synchronously or through the AIL
> > >we can simply implement metadata freezing in terms of emptying the AIL.
> > >
> > >The implementation for this is fairly simply and straight-forward:  A new
> > >routine is added that increments a counter that tells xfsaild to not stop
> > >until the AIL is empty and then waits on a wakeup from
> > >xfs_trans_ail_delete_bulk to signal that the AIL is empty.
> > >
> > >As usual the devil is in the details, in this case the filesystem shutdown
> > >code.  Currently we are a bit sloppy there and do not continue ail pushing
> > >in that case, and thus never reach the code in the log item implementations
> > >that can unwind in case of a shutdown filesystem.  Also the code to
> > >abort inode and dquot flushes was rather sloppy before and did not remove
> > >the log items from the AIL, which had to be fixed as well.
> > >
> > >Also treat unmount the same way as freeze now, except that we still keep a
> > >synchronous inode reclaim pass to make sure we reclaim all clean inodes, too.
> > >
> > >As an upside we can now remove the radix tree based inode writeback and
> > >xfs_unmountfs_writesb.
> > >
> > >Signed-off-by: Christoph Hellwig<hch@lst.de>
> > 
> > Sorry for the empty email.
> > 
> > This series hangs my test boxes. This patch is the first indication
> > of the hang. Reboot, and remove patch 4 and the test are successful.
> > 
> > The machine is still responsive. Only the SCRATCH filesystem from
> > the test suite is hung.
> > 
> > Per Dave's observation, I added a couple inode reclaims to this
> > patch and the test gets further (hangs on run 9 of test 068 rather
> > than run 3).
> 
> That implies that there are dirty inodes at the VFS level leaking
> through the freeze.
> 
> .....
.....
> So, what are the flusher threads doing - where are they stuck?

I have an answer of sorts:

90580.054767]   task                        PC stack   pid father
[90580.056035] flush-253:16    D 0000000000000001  4136 32084      2 0x00000000
[90580.056035]  ffff880004c558a0 0000000000000046 ffff880068b6cd48 ffff880004c55cb0
[90580.056035]  ffff88007b616280 ffff880004c55fd8 ffff880004c55fd8 ffff880004c55fd8
[90580.056035]  ffff88000681e340 ffff88007b616280 ffff880004c558b0 ffff88007981e000
[90580.056035] Call Trace:
[90580.056035]  [<ffffffff81afcd19>] schedule+0x29/0x70
[90580.056035]  [<ffffffff814801fd>] xfs_trans_alloc+0x5d/0xb0
[90580.056035]  [<ffffffff81099eb0>] ? add_wait_queue+0x60/0x60
[90580.056035]  [<ffffffff81416b14>] xfs_setfilesize_trans_alloc+0x34/0xb0
[90580.056035]  [<ffffffff814186f5>] xfs_vm_writepage+0x4a5/0x560
[90580.056035]  [<ffffffff81127507>] __writepage+0x17/0x40
[90580.056035]  [<ffffffff81127b3d>] write_cache_pages+0x20d/0x460
[90580.056035]  [<ffffffff811274f0>] ? set_page_dirty_lock+0x60/0x60
[90580.056035]  [<ffffffff81127dda>] generic_writepages+0x4a/0x70
[90580.056035]  [<ffffffff814167ec>] xfs_vm_writepages+0x4c/0x60
[90580.056035]  [<ffffffff81129711>] do_writepages+0x21/0x40
[90580.056035]  [<ffffffff8118ee42>] writeback_single_inode+0x112/0x380
[90580.056035]  [<ffffffff8118f25e>] writeback_sb_inodes+0x1ae/0x270
[90580.056035]  [<ffffffff8118f4c0>] wb_writeback+0xe0/0x320
[90580.056035]  [<ffffffff8108724a>] ? try_to_del_timer_sync+0x8a/0x110
[90580.056035]  [<ffffffff81190bc8>] wb_do_writeback+0xb8/0x1d0
[90580.056035]  [<ffffffff81085f40>] ? usleep_range+0x50/0x50
[90580.056035]  [<ffffffff81190d6b>] bdi_writeback_thread+0x8b/0x280
[90580.056035]  [<ffffffff81190ce0>] ? wb_do_writeback+0x1d0/0x1d0
[90580.056035]  [<ffffffff81099403>] kthread+0x93/0xa0
[90580.056035]  [<ffffffff81b06f64>] kernel_thread_helper+0x4/0x10
[90580.056035]  [<ffffffff81099370>] ? kthread_freezable_should_stop+0x70/0x70
[90580.056035]  [<ffffffff81b06f60>] ? gs_change+0x13/0x13

A dirty inode has slipped through the freeze process, and the
flusher thread is stuck trying to allocate a transaction for setting
the file size. I can reproduce this fairly easily, so a a bit of
tracing should tell me exactly what is going wrong....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/10] xfs: implement freezing by emptying the AIL
  2012-04-17  4:20       ` Dave Chinner
@ 2012-04-17  8:26         ` Dave Chinner
  2012-04-18 13:13           ` Mark Tinguely
  2012-04-18 17:53           ` Mark Tinguely
  0 siblings, 2 replies; 42+ messages in thread
From: Dave Chinner @ 2012-04-17  8:26 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: Christoph Hellwig, xfs

On Tue, Apr 17, 2012 at 02:20:23PM +1000, Dave Chinner wrote:
> On Tue, Apr 17, 2012 at 09:54:32AM +1000, Dave Chinner wrote:
> > On Mon, Apr 16, 2012 at 08:47:00AM -0500, Mark Tinguely wrote:
> > > On 03/27/12 11:44, Christoph Hellwig wrote:
> > > >Now that we write back all metadata either synchronously or through the AIL
> > > >we can simply implement metadata freezing in terms of emptying the AIL.
> > > >
> > > >The implementation for this is fairly simply and straight-forward:  A new
> > > >routine is added that increments a counter that tells xfsaild to not stop
> > > >until the AIL is empty and then waits on a wakeup from
> > > >xfs_trans_ail_delete_bulk to signal that the AIL is empty.
> > > >
> > > >As usual the devil is in the details, in this case the filesystem shutdown
> > > >code.  Currently we are a bit sloppy there and do not continue ail pushing
> > > >in that case, and thus never reach the code in the log item implementations
> > > >that can unwind in case of a shutdown filesystem.  Also the code to
> > > >abort inode and dquot flushes was rather sloppy before and did not remove
> > > >the log items from the AIL, which had to be fixed as well.
> > > >
> > > >Also treat unmount the same way as freeze now, except that we still keep a
> > > >synchronous inode reclaim pass to make sure we reclaim all clean inodes, too.
> > > >
> > > >As an upside we can now remove the radix tree based inode writeback and
> > > >xfs_unmountfs_writesb.
> > > >
> > > >Signed-off-by: Christoph Hellwig<hch@lst.de>
> > > 
> > > Sorry for the empty email.
> > > 
> > > This series hangs my test boxes. This patch is the first indication
> > > of the hang. Reboot, and remove patch 4 and the test are successful.
> > > 
> > > The machine is still responsive. Only the SCRATCH filesystem from
> > > the test suite is hung.
> > > 
> > > Per Dave's observation, I added a couple inode reclaims to this
> > > patch and the test gets further (hangs on run 9 of test 068 rather
> > > than run 3).
> > 
> > That implies that there are dirty inodes at the VFS level leaking
> > through the freeze.
> > 
> > .....
> .....
> > So, what are the flusher threads doing - where are they stuck?
> 
> I have an answer of sorts:
> 
> 90580.054767]   task                        PC stack   pid father
> [90580.056035] flush-253:16    D 0000000000000001  4136 32084      2 0x00000000
> [90580.056035]  ffff880004c558a0 0000000000000046 ffff880068b6cd48 ffff880004c55cb0
> [90580.056035]  ffff88007b616280 ffff880004c55fd8 ffff880004c55fd8 ffff880004c55fd8
> [90580.056035]  ffff88000681e340 ffff88007b616280 ffff880004c558b0 ffff88007981e000
> [90580.056035] Call Trace:
> [90580.056035]  [<ffffffff81afcd19>] schedule+0x29/0x70
> [90580.056035]  [<ffffffff814801fd>] xfs_trans_alloc+0x5d/0xb0
> [90580.056035]  [<ffffffff81099eb0>] ? add_wait_queue+0x60/0x60
> [90580.056035]  [<ffffffff81416b14>] xfs_setfilesize_trans_alloc+0x34/0xb0
> [90580.056035]  [<ffffffff814186f5>] xfs_vm_writepage+0x4a5/0x560
> [90580.056035]  [<ffffffff81127507>] __writepage+0x17/0x40
> [90580.056035]  [<ffffffff81127b3d>] write_cache_pages+0x20d/0x460
> [90580.056035]  [<ffffffff811274f0>] ? set_page_dirty_lock+0x60/0x60
> [90580.056035]  [<ffffffff81127dda>] generic_writepages+0x4a/0x70
> [90580.056035]  [<ffffffff814167ec>] xfs_vm_writepages+0x4c/0x60
> [90580.056035]  [<ffffffff81129711>] do_writepages+0x21/0x40
> [90580.056035]  [<ffffffff8118ee42>] writeback_single_inode+0x112/0x380
> [90580.056035]  [<ffffffff8118f25e>] writeback_sb_inodes+0x1ae/0x270
> [90580.056035]  [<ffffffff8118f4c0>] wb_writeback+0xe0/0x320
> [90580.056035]  [<ffffffff8108724a>] ? try_to_del_timer_sync+0x8a/0x110
> [90580.056035]  [<ffffffff81190bc8>] wb_do_writeback+0xb8/0x1d0
> [90580.056035]  [<ffffffff81085f40>] ? usleep_range+0x50/0x50
> [90580.056035]  [<ffffffff81190d6b>] bdi_writeback_thread+0x8b/0x280
> [90580.056035]  [<ffffffff81190ce0>] ? wb_do_writeback+0x1d0/0x1d0
> [90580.056035]  [<ffffffff81099403>] kthread+0x93/0xa0
> [90580.056035]  [<ffffffff81b06f64>] kernel_thread_helper+0x4/0x10
> [90580.056035]  [<ffffffff81099370>] ? kthread_freezable_should_stop+0x70/0x70
> [90580.056035]  [<ffffffff81b06f60>] ? gs_change+0x13/0x13
> 
> A dirty inode has slipped through the freeze process, and the
> flusher thread is stuck trying to allocate a transaction for setting
> the file size. I can reproduce this fairly easily, so a a bit of
> tracing should tell me exactly what is going wrong....

Yeah, it's pretty clear what is happening here. We don't have
freeze protection against EOF zeroing operations. At least
xfs_setattr_size() and xfs_change_file_space() fail to check for
freeze, and that is initially what I though was causing this problem.

However, adding freeze checks into the relevant paths didn't make
the hangs go away, so there's more to it than that. Basically, we've
been getting races between checking for freeze, the dirtying of the
pages and the flusher thread syncing out the dirty data. i.e.:

Thread 1		Thread 2		freeze		flusher thread
write inode A
check for freeze
					grab s_umount
					SB_FREEZE_WRITE
					writeback_inodes_sb()
								iterate dirty inodes
								inode A not in flush
					sync_inodes_sb()
								iterate dirty inodes
								inode A not in flush
dirty pages
mark inode A dirty
write inode A done.
					SB_FREEZE_TRANS
					drop s_umount
					freeze done
			sync
			grab s_umount
								iterate dirty inodes
								Flush dirty inode A


Before we added the transactional inode size updates, this race
simply went unnoticed because nothing caused the flusher thread to
block. All the problems I see are due to overwrites of allocated
space - if there was real allocation then the delalloc conversion
would have always hung. Now we see that when we need to extend the
file size when writing, we ahve to allocate a transaction and hence
the flusher thread now hangs.

While I can "fix" the xfs_setattr_size() and xfs_change_file_space()
triggers, they don't close the above race condition, so this problem
is essentially unfixable in XFS. The only reason we have not tripped
over it before is that the flusher thread didn't hang waiting for a
transaction reservation when the race was hit.

So why didn't this happen before Christoph's patch set? That's
something I can't explain. Oh, wait, yes I can - 068 hangs even
without this patch of Christoph's. Actually, looking at my xfstests
logs, I can trace the start of the failures back to mid march, and
that coincided with an update to the xfstests installed on my test
boxes. Which coincides with when my machines first saw this change:

commit 281627df3eb55e1b729b9bb06fff5ff112929646
Author: Christoph Hellwig <hch@infradead.org>
Date:   Tue Mar 13 08:41:05 2012 +0000

    xfs: log file size updates at I/O completion time

That confirms my analysis above - the problem is being exposed by new
code in the writeback path that does transaction allocation where it
didn't used to.

Clearly the problem is not really the new code in Christoph's
patches - it's an existing freeze problem that has previously
resulted in data writes occuring after a freeze has completed (of
which we have had rare complaints about). That sounds pretty dire,
except for one thing: Jan Kara's patch set that fixes all these
freeze problems:

https://lkml.org/lkml/2012/4/16/356

And now that I've run some testing with Jan's patch series, along
with Christoph's and mine (75-odd patches ;), a couple of my test
VMs have been running test 068 in a tight loop for about half an
hour without a hang, so I'd consider this problem fixed by Jan's
freeze fixes given I could reliably hang it in 2-3 minutes before
adding Jan's patch set to my stack.

So the fix for this problem is getting Jan's patch set into the
kernel at the same time we get the inode size logging changes into
the kernel. What do people think about that for a plan?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/10] xfs: implement freezing by emptying the AIL
  2012-04-17  8:26         ` Dave Chinner
@ 2012-04-18 13:13           ` Mark Tinguely
  2012-04-18 18:14             ` Ben Myers
  2012-04-18 17:53           ` Mark Tinguely
  1 sibling, 1 reply; 42+ messages in thread
From: Mark Tinguely @ 2012-04-18 13:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

On 04/17/12 03:26, Dave Chinner wrote:
> On Tue, Apr 17, 2012 at 02:20:23PM +1000, Dave Chinner wrote:
>> On Tue, Apr 17, 2012 at 09:54:32AM +1000, Dave Chinner wrote:
>>> On Mon, Apr 16, 2012 at 08:47:00AM -0500, Mark Tinguely wrote:
>>>> On 03/27/12 11:44, Christoph Hellwig wrote:
>>>>> Now that we write back all metadata either synchronously or through the AIL
>>>>> we can simply implement metadata freezing in terms of emptying the AIL.
>>>>>
>>>>> The implementation for this is fairly simply and straight-forward:  A new
>>>>> routine is added that increments a counter that tells xfsaild to not stop
>>>>> until the AIL is empty and then waits on a wakeup from
>>>>> xfs_trans_ail_delete_bulk to signal that the AIL is empty.
>>>>>
>>>>> As usual the devil is in the details, in this case the filesystem shutdown
>>>>> code.  Currently we are a bit sloppy there and do not continue ail pushing
>>>>> in that case, and thus never reach the code in the log item implementations
>>>>> that can unwind in case of a shutdown filesystem.  Also the code to
>>>>> abort inode and dquot flushes was rather sloppy before and did not remove
>>>>> the log items from the AIL, which had to be fixed as well.
>>>>>
>>>>> Also treat unmount the same way as freeze now, except that we still keep a
>>>>> synchronous inode reclaim pass to make sure we reclaim all clean inodes, too.
>>>>>
>>>>> As an upside we can now remove the radix tree based inode writeback and
>>>>> xfs_unmountfs_writesb.
>>>>>
>>>>> Signed-off-by: Christoph Hellwig<hch@lst.de>
>>>>
>>>> Sorry for the empty email.
>>>>
>>>> This series hangs my test boxes. This patch is the first indication
>>>> of the hang. Reboot, and remove patch 4 and the test are successful.
>>>>
>>>> The machine is still responsive. Only the SCRATCH filesystem from
>>>> the test suite is hung.
>>>>
>>>> Per Dave's observation, I added a couple inode reclaims to this
>>>> patch and the test gets further (hangs on run 9 of test 068 rather
>>>> than run 3).
>>>
>>> That implies that there are dirty inodes at the VFS level leaking
>>> through the freeze.
>>>
>>> .....
>> .....
>>> So, what are the flusher threads doing - where are they stuck?
>>
>> I have an answer of sorts:
>>
>> 90580.054767]   task                        PC stack   pid father
>> [90580.056035] flush-253:16    D 0000000000000001  4136 32084      2 0x00000000
>> [90580.056035]  ffff880004c558a0 0000000000000046 ffff880068b6cd48 ffff880004c55cb0
>> [90580.056035]  ffff88007b616280 ffff880004c55fd8 ffff880004c55fd8 ffff880004c55fd8
>> [90580.056035]  ffff88000681e340 ffff88007b616280 ffff880004c558b0 ffff88007981e000
>> [90580.056035] Call Trace:
>> [90580.056035]  [<ffffffff81afcd19>] schedule+0x29/0x70
>> [90580.056035]  [<ffffffff814801fd>] xfs_trans_alloc+0x5d/0xb0
>> [90580.056035]  [<ffffffff81099eb0>] ? add_wait_queue+0x60/0x60
>> [90580.056035]  [<ffffffff81416b14>] xfs_setfilesize_trans_alloc+0x34/0xb0
>> [90580.056035]  [<ffffffff814186f5>] xfs_vm_writepage+0x4a5/0x560
>> [90580.056035]  [<ffffffff81127507>] __writepage+0x17/0x40
>> [90580.056035]  [<ffffffff81127b3d>] write_cache_pages+0x20d/0x460
>> [90580.056035]  [<ffffffff811274f0>] ? set_page_dirty_lock+0x60/0x60
>> [90580.056035]  [<ffffffff81127dda>] generic_writepages+0x4a/0x70
>> [90580.056035]  [<ffffffff814167ec>] xfs_vm_writepages+0x4c/0x60
>> [90580.056035]  [<ffffffff81129711>] do_writepages+0x21/0x40
>> [90580.056035]  [<ffffffff8118ee42>] writeback_single_inode+0x112/0x380
>> [90580.056035]  [<ffffffff8118f25e>] writeback_sb_inodes+0x1ae/0x270
>> [90580.056035]  [<ffffffff8118f4c0>] wb_writeback+0xe0/0x320
>> [90580.056035]  [<ffffffff8108724a>] ? try_to_del_timer_sync+0x8a/0x110
>> [90580.056035]  [<ffffffff81190bc8>] wb_do_writeback+0xb8/0x1d0
>> [90580.056035]  [<ffffffff81085f40>] ? usleep_range+0x50/0x50
>> [90580.056035]  [<ffffffff81190d6b>] bdi_writeback_thread+0x8b/0x280
>> [90580.056035]  [<ffffffff81190ce0>] ? wb_do_writeback+0x1d0/0x1d0
>> [90580.056035]  [<ffffffff81099403>] kthread+0x93/0xa0
>> [90580.056035]  [<ffffffff81b06f64>] kernel_thread_helper+0x4/0x10
>> [90580.056035]  [<ffffffff81099370>] ? kthread_freezable_should_stop+0x70/0x70
>> [90580.056035]  [<ffffffff81b06f60>] ? gs_change+0x13/0x13
>>
>> A dirty inode has slipped through the freeze process, and the
>> flusher thread is stuck trying to allocate a transaction for setting
>> the file size. I can reproduce this fairly easily, so a a bit of
>> tracing should tell me exactly what is going wrong....
>
> Yeah, it's pretty clear what is happening here. We don't have
> freeze protection against EOF zeroing operations. At least
> xfs_setattr_size() and xfs_change_file_space() fail to check for
> freeze, and that is initially what I though was causing this problem.
>
> However, adding freeze checks into the relevant paths didn't make
> the hangs go away, so there's more to it than that. Basically, we've
> been getting races between checking for freeze, the dirtying of the
> pages and the flusher thread syncing out the dirty data. i.e.:
>
> Thread 1		Thread 2		freeze		flusher thread
> write inode A
> check for freeze
> 					grab s_umount
> 					SB_FREEZE_WRITE
> 					writeback_inodes_sb()
> 								iterate dirty inodes
> 								inode A not in flush
> 					sync_inodes_sb()
> 								iterate dirty inodes
> 								inode A not in flush
> dirty pages
> mark inode A dirty
> write inode A done.
> 					SB_FREEZE_TRANS
> 					drop s_umount
> 					freeze done
> 			sync
> 			grab s_umount
> 								iterate dirty inodes
> 								Flush dirty inode A
>
>
> Before we added the transactional inode size updates, this race
> simply went unnoticed because nothing caused the flusher thread to
> block. All the problems I see are due to overwrites of allocated
> space - if there was real allocation then the delalloc conversion
> would have always hung. Now we see that when we need to extend the
> file size when writing, we ahve to allocate a transaction and hence
> the flusher thread now hangs.
>
> While I can "fix" the xfs_setattr_size() and xfs_change_file_space()
> triggers, they don't close the above race condition, so this problem
> is essentially unfixable in XFS. The only reason we have not tripped
> over it before is that the flusher thread didn't hang waiting for a
> transaction reservation when the race was hit.
>
> So why didn't this happen before Christoph's patch set? That's
> something I can't explain. Oh, wait, yes I can - 068 hangs even
> without this patch of Christoph's. Actually, looking at my xfstests
> logs, I can trace the start of the failures back to mid march, and
> that coincided with an update to the xfstests installed on my test
> boxes. Which coincides with when my machines first saw this change:
>
> commit 281627df3eb55e1b729b9bb06fff5ff112929646
> Author: Christoph Hellwig<hch@infradead.org>
> Date:   Tue Mar 13 08:41:05 2012 +0000
>
>      xfs: log file size updates at I/O completion time
>
> That confirms my analysis above - the problem is being exposed by new
> code in the writeback path that does transaction allocation where it
> didn't used to.
>
> Clearly the problem is not really the new code in Christoph's
> patches - it's an existing freeze problem that has previously
> resulted in data writes occuring after a freeze has completed (of
> which we have had rare complaints about). That sounds pretty dire,
> except for one thing: Jan Kara's patch set that fixes all these
> freeze problems:
>
> https://lkml.org/lkml/2012/4/16/356
>
> And now that I've run some testing with Jan's patch series, along
> with Christoph's and mine (75-odd patches ;), a couple of my test
> VMs have been running test 068 in a tight loop for about half an
> hour without a hang, so I'd consider this problem fixed by Jan's
> freeze fixes given I could reliably hang it in 2-3 minutes before
> adding Jan's patch set to my stack.
>
> So the fix for this problem is getting Jan's patch set into the
> kernel at the same time we get the inode size logging changes into
> the kernel. What do people think about that for a plan?
>
> Cheers,
>
> Dave.

Good job.

Jan's freeze patch set is at v5 and seems to be settling down. What is 
the status of Jan's freeze code getting into the kernel?

--Mark Tinguely

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 06/10] xfs: do not write the buffer from xfs_iflush
  2012-03-27 16:44 ` [PATCH 06/10] xfs: do not write the buffer from xfs_iflush Christoph Hellwig
  2012-04-13 10:31   ` Dave Chinner
@ 2012-04-18 13:33   ` Mark Tinguely
  1 sibling, 0 replies; 42+ messages in thread
From: Mark Tinguely @ 2012-04-18 13:33 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On 03/27/12 11:44, Christoph Hellwig wrote:
> Instead of writing the buffer directly from inside xfs_iflush return it to
> the caller and let the caller decide what to do with the buffer.  Also
> remove the pincount check in xfs_iflush that all non-blocking callers already
> implement and the now unused flags parameter.
>
> Signed-off-by: Christoph Hellwig<hch@lst.de>

Looks good,

Reviewed-by: Mark Tinguely <tinguely@sgi.com>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/10] xfs: implement freezing by emptying the AIL
  2012-04-17  8:26         ` Dave Chinner
  2012-04-18 13:13           ` Mark Tinguely
@ 2012-04-18 17:53           ` Mark Tinguely
  1 sibling, 0 replies; 42+ messages in thread
From: Mark Tinguely @ 2012-04-18 17:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

On 04/17/12 03:26, Dave Chinner wrote:

> Yeah, it's pretty clear what is happening here. We don't have
> freeze protection against EOF zeroing operations. At least
> xfs_setattr_size() and xfs_change_file_space() fail to check for
> freeze, and that is initially what I though was causing this problem.
>
> However, adding freeze checks into the relevant paths didn't make
> the hangs go away, so there's more to it than that. Basically, we've
> been getting races between checking for freeze, the dirtying of the
> pages and the flusher thread syncing out the dirty data. i.e.:
>
> Thread 1		Thread 2		freeze		flusher thread
> write inode A
> check for freeze
> 					grab s_umount
> 					SB_FREEZE_WRITE
> 					writeback_inodes_sb()
> 								iterate dirty inodes
> 								inode A not in flush
> 					sync_inodes_sb()
> 								iterate dirty inodes
> 								inode A not in flush
> dirty pages
> mark inode A dirty
> write inode A done.
> 					SB_FREEZE_TRANS
> 					drop s_umount
> 					freeze done
> 			sync
> 			grab s_umount
> 								iterate dirty inodes
> 								Flush dirty inode A
>
>
> Before we added the transactional inode size updates, this race
> simply went unnoticed because nothing caused the flusher thread to
> block. All the problems I see are due to overwrites of allocated
> space - if there was real allocation then the delalloc conversion
> would have always hung. Now we see that when we need to extend the
> file size when writing, we ahve to allocate a transaction and hence
> the flusher thread now hangs.
>
> While I can "fix" the xfs_setattr_size() and xfs_change_file_space()
> triggers, they don't close the above race condition, so this problem
> is essentially unfixable in XFS. The only reason we have not tripped
> over it before is that the flusher thread didn't hang waiting for a
> transaction reservation when the race was hit.
>
> So why didn't this happen before Christoph's patch set? That's
> something I can't explain. Oh, wait, yes I can - 068 hangs even
> without this patch of Christoph's. Actually, looking at my xfstests
> logs, I can trace the start of the failures back to mid march, and
> that coincided with an update to the xfstests installed on my test
> boxes. Which coincides with when my machines first saw this change:
>
> commit 281627df3eb55e1b729b9bb06fff5ff112929646
> Author: Christoph Hellwig<hch@infradead.org>
> Date:   Tue Mar 13 08:41:05 2012 +0000
>
>      xfs: log file size updates at I/O completion time
>
> That confirms my analysis above - the problem is being exposed by new
> code in the writeback path that does transaction allocation where it
> didn't used to.
>
> Clearly the problem is not really the new code in Christoph's
> patches - it's an existing freeze problem that has previously
> resulted in data writes occuring after a freeze has completed (of
> which we have had rare complaints about). That sounds pretty dire,
> except for one thing: Jan Kara's patch set that fixes all these
> freeze problems:
>
> https://lkml.org/lkml/2012/4/16/356
>
> And now that I've run some testing with Jan's patch series, along
> with Christoph's and mine (75-odd patches;), a couple of my test
> VMs have been running test 068 in a tight loop for about half an
> hour without a hang, so I'd consider this problem fixed by Jan's
> freeze fixes given I could reliably hang it in 2-3 minutes before
> adding Jan's patch set to my stack.
>
> So the fix for this problem is getting Jan's patch set into the
> kernel at the same time we get the inode size logging changes into
> the kernel. What do people think about that for a plan?
>
> Cheers,
>
> Dave.
> -- Dave Chinner david@fromorbit.com

Just a heads up, Jan's freeze patch did clear up the test 086 hang on my 
test box as well, but the 106 (quota test) hang on one of the mounts is 
still there.

--Mark.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 04/10] xfs: implement freezing by emptying the AIL
  2012-04-18 13:13           ` Mark Tinguely
@ 2012-04-18 18:14             ` Ben Myers
  0 siblings, 0 replies; 42+ messages in thread
From: Ben Myers @ 2012-04-18 18:14 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: Christoph Hellwig, xfs

On Wed, Apr 18, 2012 at 08:13:55AM -0500, Mark Tinguely wrote:
> On 04/17/12 03:26, Dave Chinner wrote:
> >On Tue, Apr 17, 2012 at 02:20:23PM +1000, Dave Chinner wrote:
> >
> >commit 281627df3eb55e1b729b9bb06fff5ff112929646
> >Author: Christoph Hellwig<hch@infradead.org>
> >Date:   Tue Mar 13 08:41:05 2012 +0000
> >
> >     xfs: log file size updates at I/O completion time
> >
> >That confirms my analysis above - the problem is being exposed by new
> >code in the writeback path that does transaction allocation where it
> >didn't used to.
> >
> >Clearly the problem is not really the new code in Christoph's
> >patches - it's an existing freeze problem that has previously
> >resulted in data writes occuring after a freeze has completed (of
> >which we have had rare complaints about). That sounds pretty dire,
> >except for one thing: Jan Kara's patch set that fixes all these
> >freeze problems:
> >
> >https://lkml.org/lkml/2012/4/16/356
> >
> >And now that I've run some testing with Jan's patch series, along
> >with Christoph's and mine (75-odd patches ;), a couple of my test
> >VMs have been running test 068 in a tight loop for about half an
> >hour without a hang, so I'd consider this problem fixed by Jan's
> >freeze fixes given I could reliably hang it in 2-3 minutes before
> >adding Jan's patch set to my stack.
> >
> >So the fix for this problem is getting Jan's patch set into the
> >kernel at the same time we get the inode size logging changes into
> >the kernel. What do people think about that for a plan?
> >
> >Cheers,
> >
> >Dave.
> 
> Good job.
> 
> Jan's freeze patch set is at v5 and seems to be settling down. What
> is the status of Jan's freeze code getting into the kernel?

The trouble I was having yesterday seems to be related to the i386 box on which
I was running.  Apparently something has regressed badly since 3.3 on that
i386.  Seems to be working fine on another x86_64 machine.

-Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 08/10] xfs: do not add buffers to the delwri queue until pushed
  2012-03-27 16:44 ` [PATCH 08/10] xfs: do not add buffers to the delwri queue until pushed Christoph Hellwig
  2012-04-13 10:35   ` Dave Chinner
@ 2012-04-18 21:11   ` Mark Tinguely
  1 sibling, 0 replies; 42+ messages in thread
From: Mark Tinguely @ 2012-04-18 21:11 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On 03/27/12 11:44, Christoph Hellwig wrote:
> Instead of adding buffers to the delwri list as soon as they are logged,
> even if they can't be written until commited because they are pinned
> defer adding them to the delwri list until xfsaild pushes them.  This
> makes the code more similar to other log items and prepares for writing
> buffers directly from xfsaild.
>
> The complication here is that we need to fail buffers that were added
> but not logged yet in xfs_buf_item_unpin, borrowing code from
> xfs_bioerror.
>
> Signed-off-by: Christoph Hellwig<hch@lst.de>

Looks good.

Reviewed-by: Mark Tinguely <tinguely@sgi.com>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 07/10] xfs: do not write the buffer from xfs_qm_dqflush
  2012-03-27 16:44 ` [PATCH 07/10] xfs: do not write the buffer from xfs_qm_dqflush Christoph Hellwig
  2012-04-13 10:33   ` Dave Chinner
@ 2012-04-18 21:11   ` Mark Tinguely
  1 sibling, 0 replies; 42+ messages in thread
From: Mark Tinguely @ 2012-04-18 21:11 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On 03/27/12 11:44, Christoph Hellwig wrote:
> Instead of writing the buffer directly from inside xfs_qm_dqflush return it
> to the caller and let the caller decide what to do with the buffer.  Also
> remove the pincount check in xfs_qm_dqflush that all non-blocking callers
> already implement and the now unused flags parameter and the XFS_DQ_IS_DIRTY
> check that all callers already perform.
>
> Signed-off-by: Christoph Hellwig<hch@lst.de>
>
> ---
Looks good.

Reviewed-by: Mark Tinguely <tinguely@sgi.com>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 09/10] xfs: on-stack delayed write buffer lists
  2012-03-27 16:44 ` [PATCH 09/10] xfs: on-stack delayed write buffer lists Christoph Hellwig
  2012-04-13 11:37   ` Dave Chinner
@ 2012-04-20 18:19   ` Mark Tinguely
  2012-04-21  0:42     ` Dave Chinner
  1 sibling, 1 reply; 42+ messages in thread
From: Mark Tinguely @ 2012-04-20 18:19 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On 03/27/12 11:44, Christoph Hellwig wrote:
> Queue delwri buffers on a local on-stack list instead of a per-buftarg one,
> and write back the buffers per-process instead of by waking up xfsbufd.
>
> This is now easily doable given that we have very few places left that write
> delwri buffers:
>
>   - log recovery:
> 	Only done at mount time, and already forcing out the buffers
> 	synchronously using xfs_flush_buftarg
>
>   - quotacheck:
> 	Same story.
>
>   - dquot reclaim:
> 	Writes out dirty dquots on the LRU under memory pressure.  We might
> 	want to look into doing more of this via xfsaild, but it's already
> 	more optimal than the synchronous inode reclaim that writes each
> 	buffer synchronously.
>
>   - xfsaild:
> 	This is the main beneficiary of the change.  By keeping a local list
> 	of buffers to write we reduce latency of writing out buffers, and
> 	more importably we can remove all the delwri list promotions which
> 	were hitting the buffer cache hard under sustained metadata loads.
>
> The implementation is very straight forward - xfs_buf_delwri_queue now gets
> a new list_head pointer that it adds the delwri buffers to, and all callers
> need to eventually submit the list using xfs_buf_delwi_submit or
> xfs_buf_delwi_submit_nowait.  Buffers that already are on a delwri list are
> skipped in xfs_buf_delwri_queue, assuming they already are on another delwri
> list.  The biggest change to pass down the buffer list was done to the AIL
> pushing. Now that we operate on buffers the trylock, push and pushbuf log
> item methods are merged into a single push routine, which tries to lock the
> item, and if possible add the buffer that needs writeback to the buffer list.
> This leads to much simpler code than the previous split but requires the
> individual IOP_PUSH instances to unlock and reacquire the AIL around calls
> to blocking routines.
>
> Given that xfsailds now also handles writing out buffers the conditions for
> log forcing and the sleep times needed some small changes.  The most
> important one is that we consider an AIL busy as long we still have buffers
> to push, and the other one is that we do increment the pushed LSN for
> buffers that are under flushing at this moment, but still count them towards
> the stuck items for restart purposes.  Without this we could hammer on stuck
> items without ever forcing the log and not make progress under heavy random
> delete workloads on fast flash storage devices.
>
> Signed-off-by: Christoph Hellwig<hch@lst.de>

Test 106 runs to completion with patch 06.

Patch 07 and 08 do not compile without patch 09.

Starting with patch 09, I get the following hang on every test 106:

ID: 27992  TASK: ffff8808310d00c0  CPU: 2   COMMAND: "mount"
  #0 [ffff880834237938] __schedule at ffffffff81417200
  #1 [ffff880834237a80] schedule at ffffffff81417574
  #2 [ffff880834237a90] schedule_timeout at ffffffff81415805
  #3 [ffff880834237b30] wait_for_common at ffffffff81416a67
  #4 [ffff880834237bc0] wait_for_completion at ffffffff81416bd8
  #5 [ffff880834237bd0] xfs_buf_iowait at ffffffffa04fc5a5 [xfs]
  #6 [ffff880834237c00] xfs_buf_delwri_submit at ffffffffa04fe4b9 [xfs]
  #7 [ffff880834237c40] xfs_qm_quotacheck at ffffffffa055cb2d [xfs]
  #8 [ffff880834237cc0] xfs_qm_mount_quotas at ffffffffa055cdf0 [xfs]
  #9 [ffff880834237cf0] xfs_mountfs at ffffffffa054c041 [xfs]
#10 [ffff880834237d40] xfs_fs_fill_super at ffffffffa050ca80 [xfs]
#11 [ffff880834237d70] mount_bdev at ffffffff81150c5c
#12 [ffff880834237de0] xfs_fs_mount at ffffffffa050ac00 [xfs]
#13 [ffff880834237df0] mount_fs at ffffffff811505f8
#14 [ffff880834237e40] vfs_kern_mount at ffffffff8116c070
#15 [ffff880834237e80] do_kern_mount at ffffffff8116c16e
#16 [ffff880834237ec0] do_mount at ffffffff8116d6f0
#17 [ffff880834237f20] sys_mount at ffffffff8116d7f3
#18 [ffff880834237f80] system_call_fastpath at ffffffff814203b9


The workers seem to be idle. For example the xfsaild:

PID: 27676  TASK: ffff880832880240  CPU: 3   COMMAND: "xfsaild/sda7"
  #0 [ffff880832933cb0] __schedule at ffffffff81417200
  #1 [ffff880832933df8] schedule at ffffffff81417574
  #2 [ffff880832933e08] schedule_timeout at ffffffff81415805
  #3 [ffff880832933ea8] xfsaild at ffffffffa0555935 [xfs]
  #4 [ffff880832933ee8] kthread at ffffffff8105dd6e
  #5 [ffff880832933f48] kernel_thread_helper at ffffffff814216a4


The hang is on the third quotacheck.

Should be easy to duplicate this.

--Mark Tinguely.


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 09/10] xfs: on-stack delayed write buffer lists
  2012-04-20 18:19   ` Mark Tinguely
@ 2012-04-21  0:42     ` Dave Chinner
  2012-04-23  1:57       ` Dave Chinner
  0 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2012-04-21  0:42 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: Christoph Hellwig, xfs

On Fri, Apr 20, 2012 at 01:19:46PM -0500, Mark Tinguely wrote:
> On 03/27/12 11:44, Christoph Hellwig wrote:
> >Queue delwri buffers on a local on-stack list instead of a per-buftarg one,
> >and write back the buffers per-process instead of by waking up xfsbufd.
> >
> >This is now easily doable given that we have very few places left that write
> >delwri buffers:
> >
> >  - log recovery:
> >	Only done at mount time, and already forcing out the buffers
> >	synchronously using xfs_flush_buftarg
> >
> >  - quotacheck:
> >	Same story.
> >
> >  - dquot reclaim:
> >	Writes out dirty dquots on the LRU under memory pressure.  We might
> >	want to look into doing more of this via xfsaild, but it's already
> >	more optimal than the synchronous inode reclaim that writes each
> >	buffer synchronously.
> >
> >  - xfsaild:
> >	This is the main beneficiary of the change.  By keeping a local list
> >	of buffers to write we reduce latency of writing out buffers, and
> >	more importably we can remove all the delwri list promotions which
> >	were hitting the buffer cache hard under sustained metadata loads.
> >
> >The implementation is very straight forward - xfs_buf_delwri_queue now gets
> >a new list_head pointer that it adds the delwri buffers to, and all callers
> >need to eventually submit the list using xfs_buf_delwi_submit or
> >xfs_buf_delwi_submit_nowait.  Buffers that already are on a delwri list are
> >skipped in xfs_buf_delwri_queue, assuming they already are on another delwri
> >list.  The biggest change to pass down the buffer list was done to the AIL
> >pushing. Now that we operate on buffers the trylock, push and pushbuf log
> >item methods are merged into a single push routine, which tries to lock the
> >item, and if possible add the buffer that needs writeback to the buffer list.
> >This leads to much simpler code than the previous split but requires the
> >individual IOP_PUSH instances to unlock and reacquire the AIL around calls
> >to blocking routines.
> >
> >Given that xfsailds now also handles writing out buffers the conditions for
> >log forcing and the sleep times needed some small changes.  The most
> >important one is that we consider an AIL busy as long we still have buffers
> >to push, and the other one is that we do increment the pushed LSN for
> >buffers that are under flushing at this moment, but still count them towards
> >the stuck items for restart purposes.  Without this we could hammer on stuck
> >items without ever forcing the log and not make progress under heavy random
> >delete workloads on fast flash storage devices.
> >
> >Signed-off-by: Christoph Hellwig<hch@lst.de>
> 
> Test 106 runs to completion with patch 06.
> 
> Patch 07 and 08 do not compile without patch 09.
> 
> Starting with patch 09, I get the following hang on every test 106:

FYI, test 106 is not in the auto group, which means it typically
isn't run on regression test runs by anyone who isn't modifying
quota code. That'll be why nobody else is seeing this.

As it is, I don't understand why it isnt in the auto group. The
commit that removed it:

62f8947 Test case for repair dir2 freetab botch.

was completely unrelated to quota stuff - it added test 110, but
remove 106-108 from the auto group as well. Perhaps that was an
oversight? I note that 108 has been brought back into the auto
group, but not 106/107....

> 
> ID: 27992  TASK: ffff8808310d00c0  CPU: 2   COMMAND: "mount"
>  #0 [ffff880834237938] __schedule at ffffffff81417200
>  #1 [ffff880834237a80] schedule at ffffffff81417574
>  #2 [ffff880834237a90] schedule_timeout at ffffffff81415805
>  #3 [ffff880834237b30] wait_for_common at ffffffff81416a67
>  #4 [ffff880834237bc0] wait_for_completion at ffffffff81416bd8
>  #5 [ffff880834237bd0] xfs_buf_iowait at ffffffffa04fc5a5 [xfs]
>  #6 [ffff880834237c00] xfs_buf_delwri_submit at ffffffffa04fe4b9 [xfs]
>  #7 [ffff880834237c40] xfs_qm_quotacheck at ffffffffa055cb2d [xfs]
>  #8 [ffff880834237cc0] xfs_qm_mount_quotas at ffffffffa055cdf0 [xfs]
>  #9 [ffff880834237cf0] xfs_mountfs at ffffffffa054c041 [xfs]
> #10 [ffff880834237d40] xfs_fs_fill_super at ffffffffa050ca80 [xfs]
> #11 [ffff880834237d70] mount_bdev at ffffffff81150c5c
> #12 [ffff880834237de0] xfs_fs_mount at ffffffffa050ac00 [xfs]
> #13 [ffff880834237df0] mount_fs at ffffffff811505f8
> #14 [ffff880834237e40] vfs_kern_mount at ffffffff8116c070
> #15 [ffff880834237e80] do_kern_mount at ffffffff8116c16e
> #16 [ffff880834237ec0] do_mount at ffffffff8116d6f0
> #17 [ffff880834237f20] sys_mount at ffffffff8116d7f3
> #18 [ffff880834237f80] system_call_fastpath at ffffffff814203b9

And event trace is going to be the only way to find out why it is
still waiting.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH 09/10] xfs: on-stack delayed write buffer lists
  2012-04-21  0:42     ` Dave Chinner
@ 2012-04-23  1:57       ` Dave Chinner
  0 siblings, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2012-04-23  1:57 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: Christoph Hellwig, xfs

On Sat, Apr 21, 2012 at 10:42:56AM +1000, Dave Chinner wrote:
> On Fri, Apr 20, 2012 at 01:19:46PM -0500, Mark Tinguely wrote:
> > On 03/27/12 11:44, Christoph Hellwig wrote:
> > >Queue delwri buffers on a local on-stack list instead of a per-buftarg one,
> > >and write back the buffers per-process instead of by waking up xfsbufd.
.....
> > >Given that xfsailds now also handles writing out buffers the conditions for
> > >log forcing and the sleep times needed some small changes.  The most
> > >important one is that we consider an AIL busy as long we still have buffers
> > >to push, and the other one is that we do increment the pushed LSN for
> > >buffers that are under flushing at this moment, but still count them towards
> > >the stuck items for restart purposes.  Without this we could hammer on stuck
> > >items without ever forcing the log and not make progress under heavy random
> > >delete workloads on fast flash storage devices.
> > >
> > >Signed-off-by: Christoph Hellwig<hch@lst.de>
> > 
> > Test 106 runs to completion with patch 06.
> > 
> > Patch 07 and 08 do not compile without patch 09.
> > 
> > Starting with patch 09, I get the following hang on every test 106:
> 
> FYI, test 106 is not in the auto group, which means it typically
> isn't run on regression test runs by anyone who isn't modifying
> quota code. That'll be why nobody else is seeing this.

And I can reporduce it easily enough w/ 106.

> > ID: 27992  TASK: ffff8808310d00c0  CPU: 2   COMMAND: "mount"
> >  #0 [ffff880834237938] __schedule at ffffffff81417200
> >  #1 [ffff880834237a80] schedule at ffffffff81417574
> >  #2 [ffff880834237a90] schedule_timeout at ffffffff81415805
> >  #3 [ffff880834237b30] wait_for_common at ffffffff81416a67
> >  #4 [ffff880834237bc0] wait_for_completion at ffffffff81416bd8
> >  #5 [ffff880834237bd0] xfs_buf_iowait at ffffffffa04fc5a5 [xfs]
> >  #6 [ffff880834237c00] xfs_buf_delwri_submit at ffffffffa04fe4b9 [xfs]
> >  #7 [ffff880834237c40] xfs_qm_quotacheck at ffffffffa055cb2d [xfs]
> >  #8 [ffff880834237cc0] xfs_qm_mount_quotas at ffffffffa055cdf0 [xfs]
> >  #9 [ffff880834237cf0] xfs_mountfs at ffffffffa054c041 [xfs]
> > #10 [ffff880834237d40] xfs_fs_fill_super at ffffffffa050ca80 [xfs]
> > #11 [ffff880834237d70] mount_bdev at ffffffff81150c5c
> > #12 [ffff880834237de0] xfs_fs_mount at ffffffffa050ac00 [xfs]
> > #13 [ffff880834237df0] mount_fs at ffffffff811505f8
> > #14 [ffff880834237e40] vfs_kern_mount at ffffffff8116c070
> > #15 [ffff880834237e80] do_kern_mount at ffffffff8116c16e
> > #16 [ffff880834237ec0] do_mount at ffffffff8116d6f0
> > #17 [ffff880834237f20] sys_mount at ffffffff8116d7f3
> > #18 [ffff880834237f80] system_call_fastpath at ffffffff814203b9
> 
> And event trace is going to be the only way to find out why it is
> still waiting.....

Interesting. The buffer that we have hung on looks strange:

xfs_buf_read:         dev 253:16 bno 0x4d8 len 0x1000 hold 1 pincount 0
		      lock 0 flags READ|READ_AHEAD|ASYNC|TRYLOCK
		      caller xfs_buf_readahead
xfs_buf_iorequest:    dev 253:16 bno 0x4d8 nblks 0x8 hold 1 pincount 0
		      lock 0 flags READ|READ_AHEAD|ASYNC|TRYLOCK|PAGES
		      caller _xfs_buf_read

What's really strange about that is the flags. The first trace is
the IO request type trace - the flags indicate the type of IO to be
done, and here is it clearly readahead. The second is the IO
dispatch, where the flags come from the *buffer*, and according to
the xfs_buf_alloc code:

        /*
         * We don't want certain flags to appear in b_flags.
         */
        flags &= ~(XBF_LOCK|XBF_MAPPED|XBF_DONT_BLOCK|XBF_READ_AHEAD);

And according to the xfs_buf.h code:

        { XBF_LOCK,             "LOCK" },       /* should never be set */\
	{ XBF_TRYLOCK,          "TRYLOCK" },    /* ditto */\
	{ XBF_DONT_BLOCK,       "DONT_BLOCK" }, /* ditto */\

So clearly these don't all match up. We're letting the trylock flag
slip through, but this should be harmless. However, the
XBF_READ_AHEAD flag is clearly showing up, but it is supposed to be
there across the IO (set in _xfs_buf_read, cleared in xfs_buf_ioend)
so we can ignore that. The trylock is slipping through because it is
not blocked like it should be xfs_buf_alloc(), but it is otherwise
unused so that isn't an issue. That leaves the only possible source
of the problem being the ASYNC flag.

We don't clear the async flag during IO completion because it is
used throughout IO completion to determine exactly what to do, and
completion has no idea if it is the final completion or not, and
hence we can't sanely clear it. hence it has to be cleared on IO
submission. That's the bug.

In __xfs_buf_delwri_submit(), the buffer submit does:

                if (!wait) {
                        bp->b_flags |= XBF_ASYNC;
                        list_del_init(&bp->b_list);
                }

whereas teh old xfs_flush_buftarg() code does:

                if (wait) {
                        bp->b_flags &= ~XBF_ASYNC;
                        list_add(&bp->b_list, &wait_list);
                }

Subtle difference, and one that I missed on review. That is, the old
code assumes that the XBF_ASYNC flag is already set because it is,
in fact, set in xfs_buf_delwri_queue().

		bp->b_flags |= XBF_DELWRI | _XBF_DELWRI_Q | XBF_ASYNC;

That is, for a buffer to be on the delwri queue, it has to have
XBF_ASYNC set on it. Hence for a blocking flush, we need to clear
that flag for IO completion to send wakeups.

The new code only does this in xfs_buf_delwri_queue() now:

	bp->b_flags |= _XBF_DELWRI_Q;

so the flushing code can make no assumption about the state of
XBF_ASYNC flag at all. Hence the new code in
__xfs_buf_delwri_submit() must clear the XBF_ASYNC flag to get it
into a known state for both blocking and async flushes:

        list_for_each_entry_safe(bp, n, submit_list, b_list) {
-               bp->b_flags &= ~_XBF_DELWRI_Q;
+               bp->b_flags &= ~(_XBF_DELWRI_Q | XBF_ASYNC);
                bp->b_flags |= XBF_WRITE;

                if (!wait) {
                        bp->b_flags |= XBF_ASYNC;
                        list_del_init(&bp->b_list);
                }
                xfs_bdstrat_cb(bp);
        }

I'll update the patch in my stack to make this change, and write a
new patch on top of my buffer flag cleanups to clean up the flag
exclusion code (because XBF_LOCK is already gone!).

With the above fix, 106 runs to completion, but it still fails
because the golden output is full of unfiltered numbers and
irrelevant stuff. I'd say that's why the test is not in the auto
group - it needs work to give reliable, machine independent
output...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2012-04-23  1:57 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-03-27 16:44 [PATCH 00/10] remove xfsbufd Christoph Hellwig
2012-03-27 16:44 ` [PATCH 01/10] xfs: remove log item from AIL in xfs_qm_dqflush after a shutdown Christoph Hellwig
2012-03-27 18:17   ` Mark Tinguely
2012-04-13  9:36   ` Dave Chinner
2012-03-27 16:44 ` [PATCH 02/10] xfs: remove log item from AIL in xfs_iflush " Christoph Hellwig
2012-04-13  9:37   ` Dave Chinner
2012-03-27 16:44 ` [PATCH 03/10] xfs: allow assigning the tail lsn with the AIL lock held Christoph Hellwig
2012-03-27 18:18   ` Mark Tinguely
2012-04-13  9:42   ` Dave Chinner
2012-03-27 16:44 ` [PATCH 04/10] xfs: implement freezing by emptying the AIL Christoph Hellwig
2012-04-13 10:04   ` Dave Chinner
2012-04-16 13:33   ` Mark Tinguely
2012-04-16 13:47   ` Mark Tinguely
2012-04-16 23:54     ` Dave Chinner
2012-04-17  4:20       ` Dave Chinner
2012-04-17  8:26         ` Dave Chinner
2012-04-18 13:13           ` Mark Tinguely
2012-04-18 18:14             ` Ben Myers
2012-04-18 17:53           ` Mark Tinguely
2012-03-27 16:44 ` [PATCH 05/10] xfs: do flush inodes from background inode reclaim Christoph Hellwig
2012-04-13 10:14   ` Dave Chinner
2012-04-16 19:25   ` Mark Tinguely
2012-03-27 16:44 ` [PATCH 06/10] xfs: do not write the buffer from xfs_iflush Christoph Hellwig
2012-04-13 10:31   ` Dave Chinner
2012-04-18 13:33   ` Mark Tinguely
2012-03-27 16:44 ` [PATCH 07/10] xfs: do not write the buffer from xfs_qm_dqflush Christoph Hellwig
2012-04-13 10:33   ` Dave Chinner
2012-04-18 21:11   ` Mark Tinguely
2012-03-27 16:44 ` [PATCH 08/10] xfs: do not add buffers to the delwri queue until pushed Christoph Hellwig
2012-04-13 10:35   ` Dave Chinner
2012-04-18 21:11   ` Mark Tinguely
2012-03-27 16:44 ` [PATCH 09/10] xfs: on-stack delayed write buffer lists Christoph Hellwig
2012-04-13 11:37   ` Dave Chinner
2012-04-20 18:19   ` Mark Tinguely
2012-04-21  0:42     ` Dave Chinner
2012-04-23  1:57       ` Dave Chinner
2012-03-27 16:44 ` [PATCH 10/10] xfs: remove some obsolete comments in xfs_trans_ail.c Christoph Hellwig
2012-04-13 11:37   ` Dave Chinner
2012-03-28  0:53 ` [PATCH 00/10] remove xfsbufd Dave Chinner
2012-03-28 15:10   ` Christoph Hellwig
2012-03-29  0:52     ` Dave Chinner
2012-03-29 19:38       ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.