All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/10] Delayed write metadata writeback V4
@ 2010-02-02 23:24 Dave Chinner
  2010-02-02 23:24 ` [PATCH 01/10] xfs: Make inode reclaim states explicit Dave Chinner
                   ` (9 more replies)
  0 siblings, 10 replies; 29+ messages in thread
From: Dave Chinner @ 2010-02-02 23:24 UTC (permalink / raw)
  To: xfs

While I started with killing async inode writeback, the series has
grown. It's not really limited to inode writeback - it touches dquot
flushing, changes the way the AIL pushes on buffers, adds xfsbufd
sortingi for delayed write buffers, adds a real non-blocking mode to
inode reclaim and avoids physical inode writeback from the VFS while
fixing bugs in handling delayed write inodes.  Hence this is more
about enabling efficient delayed write metadata than it is able
killing async inode writeback.

The idea behind this series is to make metadata buffers get
written from xfsbufd via the delayed write queue rather than being
issued asynchronously from all over the place. To do this, async
buffer writeback is almost entirely removed from XFS, replaced
instead by delayed writes and a method to expedite flushing of
delayed write buffers when required.

The result of funnelling all the buffer IO into a single place
is that we can more tightly control and therefore optimise the
submission of metadata IO. Aggregating the buffers before dispatch
allows much better sort efficiency of the buffers as the sort window
is not limited to the size of the elevator congestion hysteresis
limit. Hence we can approach 100% merge effeciency on large numbers
of buffers when dispatched for IO and greatly reduce the amount
of seeking metadata writeback causes.

The major change is to the inode flushing and reclaim code. Delayed
write inodes hold the flush lock for much longer than for async
writeback, and hence blocking on the flush lock can cause extremely
long latencies without other mechanisms to expedite the release of
the flush locks. To prevent needing to flush inodes immeidately,
all operations are done non-blocking unless synchronous. THis
required a significant rework of the inode reclaim code, but it
greatly simplified other pieces of code (e.g. log item pushing).

Version 4
- rework inode reclaim checks for better legibility
- add warning to reclaim code when delwri flush errors occur
- kill XFS_ITEM_FLUSHING now it is not used
- clean up sync_mode flags being pushed into xfs_iflush()
- kill the now unused xfs_bawrite() function
- include Christoph's fsync cache flush fix
- rework the inode locking and call to xfs_fsync() when doing
  synchronous inode writes to close races between the fsync and
  the background delwri flush afterwards.

Version 3
- rework inode reclaim to:
	- separate it from xfs_iflush return values
	- provide a non-blocking mode for background operation
- apply delwri buffer promotion tricks to dquot flushing
- kill unneeded dquot flushing flags, similar to inode flushing flag
  removal
- fix sync inode flush bug when trying to flush delwri inodes

Version 2:
- use generic list sort function
- when unmounting, push the delwri buffers first, then do sync inode
  reclaim so that reclaim doesn't block for 15 seconds waiting for
  delwri inode buffers to be aged and written before the inodes can
  be reclaimed.

Performance numbers for this version are the same as V2, which were
as follows:

Perf results (average of 3 runs) on a debug XFS build (means allocation
patterns are randomly varied, so runtimes are also a bit variable):

Untar 2.6.32 kernel tarball, sync, then remove:

                  Untar+sync     rm -rf
xfs-dev:           25.2s          13.0s
xfs-dev-delwri-1:  22.5s           9.1s
xfs-dev-delwri-2:  21.9s           8.4s

4 processes each creating 100,000, five byte files in separate
directories concurrently, then 4 processes removing a directory each
concurrently.

                  create          rm -rf
xfs-dev:           8m32s           4m10s
xfs-dev-delwri-1:  4m55s           3m42s
xfs-dev-delwri-2:  4m56s           3m33s


The patch series (plus the couple of previous bug fixes that haven't been
pulled into the tree on oss yet) are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/dgc/xfs for-2.6.34

Christoph Hellwig (1):
      xfs: remove invalid barrier optimization from xfs_fsync

Dave Chinner (11):
      xfs: don't hold onto reserved blocks on remount,ro
      xfs: turn off sign warnings
      xfs: Make inode reclaim states explicit
      xfs: Use delayed write for inodes rather than async V2
      xfs: Don't issue buffer IO direct from AIL push V2
      xfs: Sort delayed write buffers before dispatch
      xfs: Use delay write promotion for dquot flushing
      xfs: kill the unused XFS_QMOPT_* flush flags V2
      xfs: move the inode locking outside xfs_fsync()
      xfs: xfs_fs_write_inode() can fail to write inodes synchronously V2
      xfs: kill xfs_bawrite

 fs/xfs/Makefile               |    2 +-
 fs/xfs/linux-2.6/xfs_buf.c    |  135 ++++++++++++++++++++++++++--------------
 fs/xfs/linux-2.6/xfs_buf.h    |    3 +-
 fs/xfs/linux-2.6/xfs_file.c   |    6 ++-
 fs/xfs/linux-2.6/xfs_lrw.c    |    5 +-
 fs/xfs/linux-2.6/xfs_super.c  |   82 ++++++++++++++++++-------
 fs/xfs/linux-2.6/xfs_sync.c   |  138 +++++++++++++++++++++++++++++++++-------
 fs/xfs/linux-2.6/xfs_trace.h  |    1 +
 fs/xfs/quota/xfs_dquot.c      |   38 +++++-------
 fs/xfs/quota/xfs_dquot_item.c |   87 ++++----------------------
 fs/xfs/quota/xfs_dquot_item.h |    4 -
 fs/xfs/quota/xfs_qm.c         |   14 ++---
 fs/xfs/xfs_buf_item.c         |   64 ++++++++++---------
 fs/xfs/xfs_inode.c            |   86 ++------------------------
 fs/xfs/xfs_inode.h            |   11 +---
 fs/xfs/xfs_inode_item.c       |  108 +++++++-------------------------
 fs/xfs/xfs_inode_item.h       |    6 --
 fs/xfs/xfs_mount.c            |   13 ++++-
 fs/xfs/xfs_mount.h            |    1 +
 fs/xfs/xfs_quota.h            |    8 +--
 fs/xfs/xfs_trans.h            |    3 +-
 fs/xfs/xfs_trans_ail.c        |   13 ++--
 fs/xfs/xfs_vnodeops.c         |   46 +++++--------
 fs/xfs/xfs_vnodeops.h         |    2 +-
 24 files changed, 415 insertions(+), 461 deletions(-)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 01/10] xfs: Make inode reclaim states explicit
  2010-02-02 23:24 [PATCH 0/10] Delayed write metadata writeback V4 Dave Chinner
@ 2010-02-02 23:24 ` Dave Chinner
  2010-02-05 19:06   ` Alex Elder
  2010-02-02 23:24 ` [PATCH 02/10] xfs: Use delayed write for inodes rather than async V2 Dave Chinner
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2010-02-02 23:24 UTC (permalink / raw)
  To: xfs

A.K.A.: don't rely on xfs_iflush() return value in reclaim

We have gradually been moving checks out of the reclaim code because
they are duplicated in xfs_iflush(). We've had a history of problems
in this area, and many of them stem from the overloading of the
return values from xfs_iflush() and interaction with inode flush
locking to determine if the inode is safe to reclaim.

With the desire to move to delayed write flushing of inodes and
non-blocking inode tree reclaim walks, the overloading of the
return value of xfs_iflush makes it very difficult to determine
the correct thing to do next.

This patch explicitly re-adds the checks to the inode reclaim code,
removing the reliance on the return value of xfs_iflush() to
determine what to do next. It also means that we can clearly
document all the inode states that reclaim must handle and hence
we can easily see that we handled all the necessary cases.

This also removes the need for the xfs_inode_clean() check in
xfs_iflush() as all callers now check this first (safely).

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/linux-2.6/xfs_sync.c |   80 ++++++++++++++++++++++++++++++++----------
 fs/xfs/xfs_inode.c          |   11 +-----
 fs/xfs/xfs_inode.h          |    1 +
 3 files changed, 63 insertions(+), 29 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index c9b863e..98b8937 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -706,12 +706,43 @@ __xfs_inode_clear_reclaim_tag(
 			XFS_INO_TO_AGINO(mp, ip->i_ino), XFS_ICI_RECLAIM_TAG);
 }
 
+/*
+ * Inodes in different states need to be treated differently, and the return
+ * value of xfs_iflush is not sufficient to get this right. The following table
+ * lists the inode states and the reclaim actions necessary for non-blocking
+ * reclaim:
+ *
+ *
+ *	inode state	     iflush ret		required action
+ *      ---------------      ----------         ---------------
+ *	bad			-		reclaim
+ *	shutdown		EIO		unpin and reclaim
+ *	clean, unpinned		0		reclaim
+ *	stale, unpinned		0		reclaim
+ *	clean, pinned(*)	0		unpin and reclaim
+ *	stale, pinned		0		unpin and reclaim
+ *	dirty, async		0		block on flush lock, reclaim
+ *	dirty, sync flush	0		block on flush lock, reclaim
+ *
+ * (*) dgc: I don't think the clean, pinned state is possible but it gets
+ * handled anyway given the order of checks implemented.
+ *
+ * Hence the order of actions after gaining the locks should be:
+ *	bad		=> reclaim
+ *	shutdown	=> unpin and reclaim
+ *	pinned		=> unpin
+ *	stale		=> reclaim
+ *	clean		=> reclaim
+ *	dirty		=> flush, wait and reclaim
+ */
 STATIC int
 xfs_reclaim_inode(
 	struct xfs_inode	*ip,
 	struct xfs_perag	*pag,
 	int			sync_mode)
 {
+	int	error;
+
 	/*
 	 * The radix tree lock here protects a thread in xfs_iget from racing
 	 * with us starting reclaim on the inode.  Once we have the
@@ -729,30 +760,41 @@ xfs_reclaim_inode(
 	spin_unlock(&ip->i_flags_lock);
 	write_unlock(&pag->pag_ici_lock);
 
-	/*
-	 * If the inode is still dirty, then flush it out.  If the inode
-	 * is not in the AIL, then it will be OK to flush it delwri as
-	 * long as xfs_iflush() does not keep any references to the inode.
-	 * We leave that decision up to xfs_iflush() since it has the
-	 * knowledge of whether it's OK to simply do a delwri flush of
-	 * the inode or whether we need to wait until the inode is
-	 * pulled from the AIL.
-	 * We get the flush lock regardless, though, just to make sure
-	 * we don't free it while it is being flushed.
-	 */
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	xfs_iflock(ip);
 
-	/*
-	 * In the case of a forced shutdown we rely on xfs_iflush() to
-	 * wait for the inode to be unpinned before returning an error.
-	 */
-	if (!is_bad_inode(VFS_I(ip)) && xfs_iflush(ip, sync_mode) == 0) {
-		/* synchronize with xfs_iflush_done */
-		xfs_iflock(ip);
-		xfs_ifunlock(ip);
+	if (is_bad_inode(VFS_I(ip)))
+		goto reclaim;
+	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
+		xfs_iunpin_wait(ip);
+		goto reclaim;
+	}
+	if (xfs_ipincount(ip))
+		xfs_iunpin_wait(ip);
+	if (xfs_iflags_test(ip, XFS_ISTALE))
+		goto reclaim;
+	if (xfs_inode_clean(ip))
+		goto reclaim;
+
+	/* Now we have an inode that needs flushing */
+	error = xfs_iflush(ip, sync_mode);
+	if (!error) {
+		switch(sync_mode) {
+		case XFS_IFLUSH_DELWRI_ELSE_ASYNC:
+		case XFS_IFLUSH_DELWRI:
+		case XFS_IFLUSH_ASYNC:
+		case XFS_IFLUSH_DELWRI_ELSE_SYNC:
+		case XFS_IFLUSH_SYNC:
+			/* IO issued, synchronise with IO completion */
+			xfs_iflock(ip);
+		default:
+			ASSERT(0);
+			break;
+		}
 	}
 
+reclaim:
+	xfs_ifunlock(ip);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	xfs_ireclaim(ip);
 	return 0;
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index d0d1b5a..8d0666d 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2493,7 +2493,7 @@ __xfs_iunpin_wait(
 		wait_event(ip->i_ipin_wait, (atomic_read(&ip->i_pincount) == 0));
 }
 
-static inline void
+void
 xfs_iunpin_wait(
 	xfs_inode_t	*ip)
 {
@@ -2849,15 +2849,6 @@ xfs_iflush(
 	mp = ip->i_mount;
 
 	/*
-	 * If the inode isn't dirty, then just release the inode flush lock and
-	 * do nothing.
-	 */
-	if (xfs_inode_clean(ip)) {
-		xfs_ifunlock(ip);
-		return 0;
-	}
-
-	/*
 	 * We can't flush the inode until it is unpinned, so wait for it if we
 	 * are allowed to block.  We know noone new can pin it, because we are
 	 * holding the inode lock shared and you need to hold it exclusively to
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index ec1f28c..8b618ea 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -483,6 +483,7 @@ int		xfs_iunlink(struct xfs_trans *, xfs_inode_t *);
 void		xfs_iext_realloc(xfs_inode_t *, int, int);
 void		xfs_ipin(xfs_inode_t *);
 void		xfs_iunpin(xfs_inode_t *);
+void		xfs_iunpin_wait(xfs_inode_t *);
 int		xfs_iflush(xfs_inode_t *, uint);
 void		xfs_ichgtime(xfs_inode_t *, int);
 void		xfs_lock_inodes(xfs_inode_t **, int, uint);
-- 
1.6.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 02/10] xfs: Use delayed write for inodes rather than async V2
  2010-02-02 23:24 [PATCH 0/10] Delayed write metadata writeback V4 Dave Chinner
  2010-02-02 23:24 ` [PATCH 01/10] xfs: Make inode reclaim states explicit Dave Chinner
@ 2010-02-02 23:24 ` Dave Chinner
  2010-02-03 11:17   ` Christoph Hellwig
  2010-02-05 21:38   ` Alex Elder
  2010-02-02 23:24 ` [PATCH 03/10] xfs: Don't issue buffer IO direct from AIL push V2 Dave Chinner
                   ` (7 subsequent siblings)
  9 siblings, 2 replies; 29+ messages in thread
From: Dave Chinner @ 2010-02-02 23:24 UTC (permalink / raw)
  To: xfs

We currently do background inode flush asynchronously, resulting in
inodes being written in whatever order the background writeback
issues them. Not only that, there are also blocking and non-blocking
asynchronous inode flushes, depending on where the flush comes from.

This patch completely removes asynchronous inode writeback. It
removes all the strange writeback modes and replaces them with
either a synchronous flush or a non-blocking delayed write flush.
That is, inode flushes will only issue IO directly if they are
synchronous, and background flushing may do nothing if the operation
would block (e.g. on a pinned inode or buffer lock).

Delayed write flushes will now result in the inode buffer sitting in
the delwri queue of the buffer cache to be flushed by either an AIL
push or by the xfsbufd timing out the buffer. This will allow
accumulation of dirty inode buffers in memory and allow optimisation
of inode cluster writeback at the xfsbufd level where we have much
greater queue depths than the block layer elevators. We will also
get adjacent inode cluster buffer IO merging for free when a later
patch in the series allows sorting of the delayed write buffers
before dispatch.

This effectively means that any inode that is written back by
background writeback will be seen as flush locked during AIL
pushing, and will result in the buffers being pushed from there.
This writeback path is currently non-optimal, but the next patch
in the series will fix that problem.

A side effect of this delayed write mechanism is that background
inode reclaim will no longer directly flush inodes, nor can it wait
on the flush lock. The result is that inode reclaim must leave the
inode in the reclaimable state until it is clean. Hence attempts to
reclaim a dirty inode in the background will simply skip the inode
until it is clean and this allows other mechanisms (i.e. xfsbufd) to
do more optimal writeback of the dirty buffers. As a result, the
inode reclaim code has been rewritten so that it no longer relies on
the ambiguous return values of xfs_iflush() to determine whether it
is safe to reclaim an inode.

Portions of this patch are derived from patches by Christoph
Hellwig.

Version 2:
- cleanup reclaim code as suggested by Christoph
- log background reclaim inode flush errors
- just pass sync flags to xfs_iflush

Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/linux-2.6/xfs_super.c |    4 +-
 fs/xfs/linux-2.6/xfs_sync.c  |  104 ++++++++++++++++++++++++++++++-----------
 fs/xfs/xfs_inode.c           |   75 ++----------------------------
 fs/xfs/xfs_inode.h           |   10 ----
 fs/xfs/xfs_inode_item.c      |   10 +++-
 fs/xfs/xfs_mount.c           |   13 +++++-
 6 files changed, 102 insertions(+), 114 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index e9c2145..226fe20 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1064,7 +1064,7 @@ xfs_fs_write_inode(
 		xfs_ilock(ip, XFS_ILOCK_SHARED);
 		xfs_iflock(ip);
 
-		error = xfs_iflush(ip, XFS_IFLUSH_SYNC);
+		error = xfs_iflush(ip, SYNC_WAIT);
 	} else {
 		error = EAGAIN;
 		if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
@@ -1072,7 +1072,7 @@ xfs_fs_write_inode(
 		if (xfs_ipincount(ip) || !xfs_iflock_nowait(ip))
 			goto out_unlock;
 
-		error = xfs_iflush(ip, XFS_IFLUSH_ASYNC_NOBLOCK);
+		error = xfs_iflush(ip, 0);
 	}
 
  out_unlock:
diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index 98b8937..a9f6d20 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -270,8 +270,7 @@ xfs_sync_inode_attr(
 		goto out_unlock;
 	}
 
-	error = xfs_iflush(ip, (flags & SYNC_WAIT) ?
-			   XFS_IFLUSH_SYNC : XFS_IFLUSH_DELWRI);
+	error = xfs_iflush(ip, flags);
 
  out_unlock:
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
@@ -460,16 +459,18 @@ xfs_quiesce_fs(
 {
 	int	count = 0, pincount;
 
+	xfs_reclaim_inodes(mp, 0);
 	xfs_flush_buftarg(mp->m_ddev_targp, 0);
-	xfs_reclaim_inodes(mp, XFS_IFLUSH_DELWRI_ELSE_ASYNC);
 
 	/*
 	 * This loop must run at least twice.  The first instance of the loop
 	 * will flush most meta data but that will generate more meta data
 	 * (typically directory updates).  Which then must be flushed and
-	 * logged before we can write the unmount record.
+	 * logged before we can write the unmount record. We also so sync
+	 * reclaim of inodes to catch any that the above delwri flush skipped.
 	 */
 	do {
+		xfs_reclaim_inodes(mp, SYNC_WAIT);
 		xfs_sync_attr(mp, SYNC_WAIT);
 		pincount = xfs_flush_buftarg(mp->m_ddev_targp, 1);
 		if (!pincount) {
@@ -585,7 +586,7 @@ xfs_sync_worker(
 
 	if (!(mp->m_flags & XFS_MOUNT_RDONLY)) {
 		xfs_log_force(mp, 0);
-		xfs_reclaim_inodes(mp, XFS_IFLUSH_DELWRI_ELSE_ASYNC);
+		xfs_reclaim_inodes(mp, 0);
 		/* dgc: errors ignored here */
 		error = xfs_qm_sync(mp, SYNC_TRYLOCK);
 		error = xfs_sync_fsdata(mp, SYNC_TRYLOCK);
@@ -719,21 +720,42 @@ __xfs_inode_clear_reclaim_tag(
  *	shutdown		EIO		unpin and reclaim
  *	clean, unpinned		0		reclaim
  *	stale, unpinned		0		reclaim
- *	clean, pinned(*)	0		unpin and reclaim
- *	stale, pinned		0		unpin and reclaim
- *	dirty, async		0		block on flush lock, reclaim
- *	dirty, sync flush	0		block on flush lock, reclaim
+ *	clean, pinned(*)	0		requeue
+ *	stale, pinned		EAGAIN		requeue
+ *	dirty, delwri ok	0		requeue
+ *	dirty, delwri blocked	EAGAIN		requeue
+ *	dirty, sync flush	0		reclaim
  *
  * (*) dgc: I don't think the clean, pinned state is possible but it gets
  * handled anyway given the order of checks implemented.
  *
+ * As can be seen from the table, the return value of xfs_iflush() is not
+ * sufficient to correctly decide the reclaim action here. The checks in
+ * xfs_iflush() might look like duplicates, but they are not.
+ *
+ * Also, because we get the flush lock first, we know that any inode that has
+ * been flushed delwri has had the flush completed by the time we check that
+ * the inode is clean. The clean inode check needs to be done before flushing
+ * the inode delwri otherwise we would loop forever requeuing clean inodes as
+ * we cannot tell apart a successful delwri flush and a clean inode from the
+ * return value of xfs_iflush().
+ *
+ * Note that because the inode is flushed delayed write by background
+ * writeback, the flush lock may already be held here and waiting on it can
+ * result in very long latencies. Hence for sync reclaims, where we wait on the
+ * flush lock, the caller should push out delayed write inodes first before
+ * trying to reclaim them to minimise the amount of time spent waiting. For
+ * background relaim, we just requeue the inode for the next pass.
+ *
  * Hence the order of actions after gaining the locks should be:
  *	bad		=> reclaim
  *	shutdown	=> unpin and reclaim
- *	pinned		=> unpin
+ *	pinned, delwri	=> requeue
+ *	pinned, sync	=> unpin
  *	stale		=> reclaim
  *	clean		=> reclaim
- *	dirty		=> flush, wait and reclaim
+ *	dirty, delwri	=> flush and requeue
+ *	dirty, sync	=> flush, wait and reclaim
  */
 STATIC int
 xfs_reclaim_inode(
@@ -741,7 +763,7 @@ xfs_reclaim_inode(
 	struct xfs_perag	*pag,
 	int			sync_mode)
 {
-	int	error;
+	int	error = 0;
 
 	/*
 	 * The radix tree lock here protects a thread in xfs_iget from racing
@@ -761,7 +783,11 @@ xfs_reclaim_inode(
 	write_unlock(&pag->pag_ici_lock);
 
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	xfs_iflock(ip);
+	if (!xfs_iflock_nowait(ip)) {
+		if (!(sync_mode & SYNC_WAIT))
+			goto out;
+		xfs_iflock(ip);
+	}
 
 	if (is_bad_inode(VFS_I(ip)))
 		goto reclaim;
@@ -769,8 +795,13 @@ xfs_reclaim_inode(
 		xfs_iunpin_wait(ip);
 		goto reclaim;
 	}
-	if (xfs_ipincount(ip))
+	if (xfs_ipincount(ip)) {
+		if (!(sync_mode & SYNC_WAIT)) {
+			xfs_ifunlock(ip);
+			goto out;
+		}
 		xfs_iunpin_wait(ip);
+	}
 	if (xfs_iflags_test(ip, XFS_ISTALE))
 		goto reclaim;
 	if (xfs_inode_clean(ip))
@@ -778,26 +809,43 @@ xfs_reclaim_inode(
 
 	/* Now we have an inode that needs flushing */
 	error = xfs_iflush(ip, sync_mode);
-	if (!error) {
-		switch(sync_mode) {
-		case XFS_IFLUSH_DELWRI_ELSE_ASYNC:
-		case XFS_IFLUSH_DELWRI:
-		case XFS_IFLUSH_ASYNC:
-		case XFS_IFLUSH_DELWRI_ELSE_SYNC:
-		case XFS_IFLUSH_SYNC:
-			/* IO issued, synchronise with IO completion */
-			xfs_iflock(ip);
-		default:
-			ASSERT(0);
-			break;
-		}
+	if (sync_mode & SYNC_WAIT) {
+		xfs_iflock(ip);
+		goto reclaim;
 	}
 
+	/*
+	 * When we have to flush an inode but don't have SYNC_WAIT set, we
+	 * flush the inode out using a delwri buffer and wait for the next
+	 * call into reclaim to find it in a clean state instead of waiting for
+	 * it now. We also don't return errors here - if the error is transient
+	 * then the next reclaim pass will flush the inode, and if the error
+	 * is permanent then the next sync reclaim will relcaim the inode and
+	 * pass on the error.
+	 */
+	if (error && !XFS_FORCED_SHUTDOWN(ip->i_mount)) {
+		xfs_fs_cmn_err(CE_WARN, ip->i_mount,
+			"inode 0x%llx background reclaim flush failed with %d",
+			(long long)ip->i_ino, error);
+	}
+out:
+	xfs_iflags_clear(ip, XFS_IRECLAIM);
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+	/*
+	 * We could return EAGAIN here to make reclaim rescan the inode tree in
+	 * a short while. However, this just burns CPU time scanning the tree
+	 * waiting for IO to complete and xfssyncd never goes back to the idle
+	 * state. Instead, return 0 to let the next scheduled background reclaim
+	 * attempt to reclaim the inode again.
+	 */
+	return 0;
+
 reclaim:
 	xfs_ifunlock(ip);
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	xfs_ireclaim(ip);
-	return 0;
+	return error;
+
 }
 
 int
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 8d0666d..fa31360 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2835,8 +2835,6 @@ xfs_iflush(
 	xfs_dinode_t		*dip;
 	xfs_mount_t		*mp;
 	int			error;
-	int			noblock = (flags == XFS_IFLUSH_ASYNC_NOBLOCK);
-	enum { INT_DELWRI = (1 << 0), INT_ASYNC = (1 << 1) };
 
 	XFS_STATS_INC(xs_iflush_count);
 
@@ -2859,7 +2857,7 @@ xfs_iflush(
 	 * in the same cluster are dirty, they will probably write the inode
 	 * out for us if they occur after the log force completes.
 	 */
-	if (noblock && xfs_ipincount(ip)) {
+	if (!(flags & SYNC_WAIT) && xfs_ipincount(ip)) {
 		xfs_iunpin_nowait(ip);
 		xfs_ifunlock(ip);
 		return EAGAIN;
@@ -2893,60 +2891,10 @@ xfs_iflush(
 	}
 
 	/*
-	 * Decide how buffer will be flushed out.  This is done before
-	 * the call to xfs_iflush_int because this field is zeroed by it.
-	 */
-	if (iip != NULL && iip->ili_format.ilf_fields != 0) {
-		/*
-		 * Flush out the inode buffer according to the directions
-		 * of the caller.  In the cases where the caller has given
-		 * us a choice choose the non-delwri case.  This is because
-		 * the inode is in the AIL and we need to get it out soon.
-		 */
-		switch (flags) {
-		case XFS_IFLUSH_SYNC:
-		case XFS_IFLUSH_DELWRI_ELSE_SYNC:
-			flags = 0;
-			break;
-		case XFS_IFLUSH_ASYNC_NOBLOCK:
-		case XFS_IFLUSH_ASYNC:
-		case XFS_IFLUSH_DELWRI_ELSE_ASYNC:
-			flags = INT_ASYNC;
-			break;
-		case XFS_IFLUSH_DELWRI:
-			flags = INT_DELWRI;
-			break;
-		default:
-			ASSERT(0);
-			flags = 0;
-			break;
-		}
-	} else {
-		switch (flags) {
-		case XFS_IFLUSH_DELWRI_ELSE_SYNC:
-		case XFS_IFLUSH_DELWRI_ELSE_ASYNC:
-		case XFS_IFLUSH_DELWRI:
-			flags = INT_DELWRI;
-			break;
-		case XFS_IFLUSH_ASYNC_NOBLOCK:
-		case XFS_IFLUSH_ASYNC:
-			flags = INT_ASYNC;
-			break;
-		case XFS_IFLUSH_SYNC:
-			flags = 0;
-			break;
-		default:
-			ASSERT(0);
-			flags = 0;
-			break;
-		}
-	}
-
-	/*
 	 * Get the buffer containing the on-disk inode.
 	 */
 	error = xfs_itobp(mp, NULL, ip, &dip, &bp,
-				noblock ? XBF_TRYLOCK : XBF_LOCK);
+				(flags & SYNC_WAIT) ? XBF_LOCK : XBF_TRYLOCK);
 	if (error || !bp) {
 		xfs_ifunlock(ip);
 		return error;
@@ -2974,13 +2922,10 @@ xfs_iflush(
 	if (error)
 		goto cluster_corrupt_out;
 
-	if (flags & INT_DELWRI) {
-		xfs_bdwrite(mp, bp);
-	} else if (flags & INT_ASYNC) {
-		error = xfs_bawrite(mp, bp);
-	} else {
+	if (flags & SYNC_WAIT)
 		error = xfs_bwrite(mp, bp);
-	}
+	else
+		xfs_bdwrite(mp, bp);
 	return error;
 
 corrupt_out:
@@ -3015,16 +2960,6 @@ xfs_iflush_int(
 	iip = ip->i_itemp;
 	mp = ip->i_mount;
 
-
-	/*
-	 * If the inode isn't dirty, then just release the inode
-	 * flush lock and do nothing.
-	 */
-	if (xfs_inode_clean(ip)) {
-		xfs_ifunlock(ip);
-		return 0;
-	}
-
 	/* set *dip = inode's place in the buffer */
 	dip = (xfs_dinode_t *)xfs_buf_offset(bp, ip->i_imap.im_boffset);
 
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 8b618ea..6c912b0 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -420,16 +420,6 @@ static inline void xfs_ifunlock(xfs_inode_t *ip)
 #define XFS_ILOCK_DEP(flags)	(((flags) & XFS_ILOCK_DEP_MASK) >> XFS_ILOCK_SHIFT)
 
 /*
- * Flags for xfs_iflush()
- */
-#define	XFS_IFLUSH_DELWRI_ELSE_SYNC	1
-#define	XFS_IFLUSH_DELWRI_ELSE_ASYNC	2
-#define	XFS_IFLUSH_SYNC			3
-#define	XFS_IFLUSH_ASYNC		4
-#define	XFS_IFLUSH_DELWRI		5
-#define	XFS_IFLUSH_ASYNC_NOBLOCK	6
-
-/*
  * Flags for xfs_itruncate_start().
  */
 #define	XFS_ITRUNC_DEFINITE	0x1
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 48ec1c0..207553e 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -866,10 +866,14 @@ xfs_inode_item_push(
 	       iip->ili_format.ilf_fields != 0);
 
 	/*
-	 * Write out the inode.  The completion routine ('iflush_done') will
-	 * pull it from the AIL, mark it clean, unlock the flush lock.
+	 * Push the inode to it's backing buffer. This will not remove the
+	 * inode from the AIL - a further push will be required to trigger a
+	 * buffer push. However, this allows all the dirty inodes to be pushed
+	 * to the buffer before it is pushed to disk. THe buffer IO completion
+	 * will pull th einode from the AIL, mark it clean and unlock the flush
+	 * lock.
 	 */
-	(void) xfs_iflush(ip, XFS_IFLUSH_ASYNC);
+	(void) xfs_iflush(ip, 0);
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 
 	return;
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index 7f81ed7..9a036f4 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1456,7 +1456,18 @@ xfs_unmountfs(
 	 * need to force the log first.
 	 */
 	xfs_log_force(mp, XFS_LOG_SYNC);
-	xfs_reclaim_inodes(mp, XFS_IFLUSH_ASYNC);
+
+	/*
+	 * Do a delwri reclaim pass first so that as many dirty inodes are
+	 * queued up for IO as possible. Then flush the buffers before making
+	 * a synchronous path to catch all the remaining inodes are reclaimed.
+	 * This makes the reclaim process as quick as possible by avoiding
+	 * synchronous writeout and blocking on inodes already in the delwri
+	 * state as much as possible.
+	 */
+	xfs_reclaim_inodes(mp, 0);
+	XFS_bflush(mp->m_ddev_targp);
+	xfs_reclaim_inodes(mp, SYNC_WAIT);
 
 	xfs_qm_unmount(mp);
 
-- 
1.6.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 03/10] xfs: Don't issue buffer IO direct from AIL push V2
  2010-02-02 23:24 [PATCH 0/10] Delayed write metadata writeback V4 Dave Chinner
  2010-02-02 23:24 ` [PATCH 01/10] xfs: Make inode reclaim states explicit Dave Chinner
  2010-02-02 23:24 ` [PATCH 02/10] xfs: Use delayed write for inodes rather than async V2 Dave Chinner
@ 2010-02-02 23:24 ` Dave Chinner
  2010-02-05 22:51   ` Alex Elder
  2010-02-02 23:24 ` [PATCH 04/10] xfs: Sort delayed write buffers before dispatch Dave Chinner
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2010-02-02 23:24 UTC (permalink / raw)
  To: xfs

All buffers logged into the AIL are marked as delayed write.
When the AIL needs to push the buffer out, it issues an async write of the
buffer. This means that IO patterns are dependent on the order of
buffers in the AIL.

Instead of flushing the buffer, promote the buffer in the delayed
write list so that the next time the xfsbufd is run the buffer will
be flushed by the xfsbufd. Return the state to the xfsaild that the
buffer was promoted so that the xfsaild knows that it needs to cause
the xfsbufd to run to flush the buffers that were promoted.

Using the xfsbufd for issuing the IO allows us to dispatch all
buffer IO from the one queue. This means that we can make much more
enlightened decisions on what order to flush buffers to disk as
we don't have multiple places issuing IO. Optimisations to xfsbufd
will be in a future patch.

Version 2
- kill XFS_ITEM_FLUSHING as it is now unused.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/linux-2.6/xfs_buf.c    |   29 ++++++++++++
 fs/xfs/linux-2.6/xfs_buf.h    |    2 +
 fs/xfs/linux-2.6/xfs_trace.h  |    1 +
 fs/xfs/quota/xfs_dquot_item.c |   85 +++++------------------------------
 fs/xfs/quota/xfs_dquot_item.h |    4 --
 fs/xfs/xfs_buf_item.c         |   64 +++++++++++++++------------
 fs/xfs/xfs_inode_item.c       |   98 ++++++----------------------------------
 fs/xfs/xfs_inode_item.h       |    6 ---
 fs/xfs/xfs_trans.h            |    3 +-
 fs/xfs/xfs_trans_ail.c        |   13 +++---
 10 files changed, 102 insertions(+), 203 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 44e20e5..b306265 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -1778,6 +1778,35 @@ xfs_buf_delwri_dequeue(
 	trace_xfs_buf_delwri_dequeue(bp, _RET_IP_);
 }
 
+/*
+ * If a delwri buffer needs to be pushed before it has aged out, then promote
+ * it to the head of the delwri queue so that it will be flushed on the next
+ * xfsbufd run. We do this by resetting the queuetime of the buffer to be older
+ * than the age currently needed to flush the buffer. Hence the next time the
+ * xfsbufd sees it is guaranteed to be considered old enough to flush.
+ */
+void
+xfs_buf_delwri_promote(
+	struct xfs_buf	*bp)
+{
+	struct xfs_buftarg *btp = bp->b_target;
+	long		age = xfs_buf_age_centisecs * msecs_to_jiffies(10) + 1;
+
+	ASSERT(bp->b_flags & XBF_DELWRI);
+	ASSERT(bp->b_flags & _XBF_DELWRI_Q);
+
+	/*
+	 * Check the buffer age before locking the delayed write queue as we
+	 * don't need to promote buffers that are already past the flush age.
+	 */
+	if (bp->b_queuetime < jiffies - age)
+		return;
+	bp->b_queuetime = jiffies - age;
+	spin_lock(&btp->bt_delwrite_lock);
+	list_move(&bp->b_list, &btp->bt_delwrite_queue);
+	spin_unlock(&btp->bt_delwrite_lock);
+}
+
 STATIC void
 xfs_buf_runall_queues(
 	struct workqueue_struct	*queue)
diff --git a/fs/xfs/linux-2.6/xfs_buf.h b/fs/xfs/linux-2.6/xfs_buf.h
index ea8c198..be45e8c 100644
--- a/fs/xfs/linux-2.6/xfs_buf.h
+++ b/fs/xfs/linux-2.6/xfs_buf.h
@@ -266,6 +266,7 @@ extern int xfs_buf_ispin(xfs_buf_t *);
 
 /* Delayed Write Buffer Routines */
 extern void xfs_buf_delwri_dequeue(xfs_buf_t *);
+extern void xfs_buf_delwri_promote(xfs_buf_t *);
 
 /* Buffer Daemon Setup Routines */
 extern int xfs_buf_init(void);
@@ -395,6 +396,7 @@ extern void xfs_free_buftarg(struct xfs_mount *, struct xfs_buftarg *);
 extern void xfs_wait_buftarg(xfs_buftarg_t *);
 extern int xfs_setsize_buftarg(xfs_buftarg_t *, unsigned int, unsigned int);
 extern int xfs_flush_buftarg(xfs_buftarg_t *, int);
+
 #ifdef CONFIG_KDB_MODULES
 extern struct list_head *xfs_get_buftarg_list(void);
 #endif
diff --git a/fs/xfs/linux-2.6/xfs_trace.h b/fs/xfs/linux-2.6/xfs_trace.h
index 1bb09e7..a4574dc 100644
--- a/fs/xfs/linux-2.6/xfs_trace.h
+++ b/fs/xfs/linux-2.6/xfs_trace.h
@@ -483,6 +483,7 @@ DEFINE_BUF_ITEM_EVENT(xfs_buf_item_unlock);
 DEFINE_BUF_ITEM_EVENT(xfs_buf_item_unlock_stale);
 DEFINE_BUF_ITEM_EVENT(xfs_buf_item_committed);
 DEFINE_BUF_ITEM_EVENT(xfs_buf_item_push);
+DEFINE_BUF_ITEM_EVENT(xfs_buf_item_pushbuf);
 DEFINE_BUF_ITEM_EVENT(xfs_trans_get_buf);
 DEFINE_BUF_ITEM_EVENT(xfs_trans_get_buf_recur);
 DEFINE_BUF_ITEM_EVENT(xfs_trans_getsb);
diff --git a/fs/xfs/quota/xfs_dquot_item.c b/fs/xfs/quota/xfs_dquot_item.c
index 1b56437..dda0fb0 100644
--- a/fs/xfs/quota/xfs_dquot_item.c
+++ b/fs/xfs/quota/xfs_dquot_item.c
@@ -212,66 +212,31 @@ xfs_qm_dquot_logitem_pushbuf(
 	xfs_dquot_t	*dqp;
 	xfs_mount_t	*mp;
 	xfs_buf_t	*bp;
-	uint		dopush;
 
 	dqp = qip->qli_dquot;
 	ASSERT(XFS_DQ_IS_LOCKED(dqp));
 
 	/*
-	 * The qli_pushbuf_flag keeps others from
-	 * trying to duplicate our effort.
-	 */
-	ASSERT(qip->qli_pushbuf_flag != 0);
-	ASSERT(qip->qli_push_owner == current_pid());
-
-	/*
 	 * If flushlock isn't locked anymore, chances are that the
 	 * inode flush completed and the inode was taken off the AIL.
 	 * So, just get out.
 	 */
 	if (completion_done(&dqp->q_flush)  ||
 	    ((qip->qli_item.li_flags & XFS_LI_IN_AIL) == 0)) {
-		qip->qli_pushbuf_flag = 0;
 		xfs_dqunlock(dqp);
 		return;
 	}
 	mp = dqp->q_mount;
 	bp = xfs_incore(mp->m_ddev_targp, qip->qli_format.qlf_blkno,
 		    XFS_QI_DQCHUNKLEN(mp), XBF_TRYLOCK);
-	if (bp != NULL) {
-		if (XFS_BUF_ISDELAYWRITE(bp)) {
-			dopush = ((qip->qli_item.li_flags & XFS_LI_IN_AIL) &&
-				  !completion_done(&dqp->q_flush));
-			qip->qli_pushbuf_flag = 0;
-			xfs_dqunlock(dqp);
-
-			if (XFS_BUF_ISPINNED(bp))
-				xfs_log_force(mp, 0);
-
-			if (dopush) {
-				int	error;
-#ifdef XFSRACEDEBUG
-				delay_for_intr();
-				delay(300);
-#endif
-				error = xfs_bawrite(mp, bp);
-				if (error)
-					xfs_fs_cmn_err(CE_WARN, mp,
-	"xfs_qm_dquot_logitem_pushbuf: pushbuf error %d on qip %p, bp %p",
-							error, qip, bp);
-			} else {
-				xfs_buf_relse(bp);
-			}
-		} else {
-			qip->qli_pushbuf_flag = 0;
-			xfs_dqunlock(dqp);
-			xfs_buf_relse(bp);
-		}
+	xfs_dqunlock(dqp);
+	if (!bp)
 		return;
-	}
+	if (XFS_BUF_ISDELAYWRITE(bp))
+		xfs_buf_delwri_promote(bp);
+	xfs_buf_relse(bp);
+	return;
 
-	qip->qli_pushbuf_flag = 0;
-	xfs_dqunlock(dqp);
 }
 
 /*
@@ -289,50 +254,24 @@ xfs_qm_dquot_logitem_trylock(
 	xfs_dq_logitem_t	*qip)
 {
 	xfs_dquot_t		*dqp;
-	uint			retval;
 
 	dqp = qip->qli_dquot;
 	if (atomic_read(&dqp->q_pincount) > 0)
-		return (XFS_ITEM_PINNED);
+		return XFS_ITEM_PINNED;
 
 	if (! xfs_qm_dqlock_nowait(dqp))
-		return (XFS_ITEM_LOCKED);
+		return XFS_ITEM_LOCKED;
 
-	retval = XFS_ITEM_SUCCESS;
 	if (!xfs_dqflock_nowait(dqp)) {
 		/*
-		 * The dquot is already being flushed.	It may have been
-		 * flushed delayed write, however, and we don't want to
-		 * get stuck waiting for that to complete.  So, we want to check
-		 * to see if we can lock the dquot's buffer without sleeping.
-		 * If we can and it is marked for delayed write, then we
-		 * hold it and send it out from the push routine.  We don't
-		 * want to do that now since we might sleep in the device
-		 * strategy routine.  We also don't want to grab the buffer lock
-		 * here because we'd like not to call into the buffer cache
-		 * while holding the AIL lock.
-		 * Make sure to only return PUSHBUF if we set pushbuf_flag
-		 * ourselves.  If someone else is doing it then we don't
-		 * want to go to the push routine and duplicate their efforts.
+		 * dquot has already been flushed to the backing buffer,
+		 * leave it locked, pushbuf routine will unlock it.
 		 */
-		if (qip->qli_pushbuf_flag == 0) {
-			qip->qli_pushbuf_flag = 1;
-			ASSERT(qip->qli_format.qlf_blkno == dqp->q_blkno);
-#ifdef DEBUG
-			qip->qli_push_owner = current_pid();
-#endif
-			/*
-			 * The dquot is left locked.
-			 */
-			retval = XFS_ITEM_PUSHBUF;
-		} else {
-			retval = XFS_ITEM_FLUSHING;
-			xfs_dqunlock_nonotify(dqp);
-		}
+		return XFS_ITEM_PUSHBUF;
 	}
 
 	ASSERT(qip->qli_item.li_flags & XFS_LI_IN_AIL);
-	return (retval);
+	return XFS_ITEM_SUCCESS;
 }
 
 
diff --git a/fs/xfs/quota/xfs_dquot_item.h b/fs/xfs/quota/xfs_dquot_item.h
index 5a63253..5acae2a 100644
--- a/fs/xfs/quota/xfs_dquot_item.h
+++ b/fs/xfs/quota/xfs_dquot_item.h
@@ -27,10 +27,6 @@ typedef struct xfs_dq_logitem {
 	xfs_log_item_t		 qli_item;	   /* common portion */
 	struct xfs_dquot	*qli_dquot;	   /* dquot ptr */
 	xfs_lsn_t		 qli_flush_lsn;	   /* lsn at last flush */
-	unsigned short		 qli_pushbuf_flag; /* 1 bit used in push_ail */
-#ifdef DEBUG
-	uint64_t		 qli_push_owner;
-#endif
 	xfs_dq_logformat_t	 qli_format;	   /* logged structure */
 } xfs_dq_logitem_t;
 
diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index e0a1158..f3c49e6 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -467,8 +467,10 @@ xfs_buf_item_unpin_remove(
 /*
  * This is called to attempt to lock the buffer associated with this
  * buf log item.  Don't sleep on the buffer lock.  If we can't get
- * the lock right away, return 0.  If we can get the lock, pull the
- * buffer from the free list, mark it busy, and return 1.
+ * the lock right away, return 0.  If we can get the lock, take a
+ * reference to the buffer. If this is a delayed write buffer that
+ * needs AIL help to be written back, invoke the pushbuf routine
+ * rather than the normal success path.
  */
 STATIC uint
 xfs_buf_item_trylock(
@@ -477,24 +479,18 @@ xfs_buf_item_trylock(
 	xfs_buf_t	*bp;
 
 	bp = bip->bli_buf;
-
-	if (XFS_BUF_ISPINNED(bp)) {
+	if (XFS_BUF_ISPINNED(bp))
 		return XFS_ITEM_PINNED;
-	}
-
-	if (!XFS_BUF_CPSEMA(bp)) {
+	if (!XFS_BUF_CPSEMA(bp))
 		return XFS_ITEM_LOCKED;
-	}
 
-	/*
-	 * Remove the buffer from the free list.  Only do this
-	 * if it's on the free list.  Private buffers like the
-	 * superblock buffer are not.
-	 */
+	/* take a reference to the buffer.  */
 	XFS_BUF_HOLD(bp);
 
 	ASSERT(!(bip->bli_flags & XFS_BLI_STALE));
 	trace_xfs_buf_item_trylock(bip);
+	if (XFS_BUF_ISDELAYWRITE(bp))
+		return XFS_ITEM_PUSHBUF;
 	return XFS_ITEM_SUCCESS;
 }
 
@@ -626,11 +622,9 @@ xfs_buf_item_committed(
 }
 
 /*
- * This is called to asynchronously write the buffer associated with this
- * buf log item out to disk. The buffer will already have been locked by
- * a successful call to xfs_buf_item_trylock().  If the buffer still has
- * B_DELWRI set, then get it going out to disk with a call to bawrite().
- * If not, then just release the buffer.
+ * The buffer is locked, but is not a delayed write buffer. This happens
+ * if we race with IO completion and hence we don't want to try to write it
+ * again. Just release the buffer.
  */
 STATIC void
 xfs_buf_item_push(
@@ -642,17 +636,29 @@ xfs_buf_item_push(
 	trace_xfs_buf_item_push(bip);
 
 	bp = bip->bli_buf;
+	ASSERT(!XFS_BUF_ISDELAYWRITE(bp));
+	xfs_buf_relse(bp);
+}
 
-	if (XFS_BUF_ISDELAYWRITE(bp)) {
-		int	error;
-		error = xfs_bawrite(bip->bli_item.li_mountp, bp);
-		if (error)
-			xfs_fs_cmn_err(CE_WARN, bip->bli_item.li_mountp,
-			"xfs_buf_item_push: pushbuf error %d on bip %p, bp %p",
-					error, bip, bp);
-	} else {
-		xfs_buf_relse(bp);
-	}
+/*
+ * The buffer is locked and is a delayed write buffer. Promote the buffer
+ * in the delayed write queue as the caller knows that they must invoke
+ * the xfsbufd to get this buffer written. We have to unlock the buffer
+ * to allow the xfsbufd to write it, too.
+ */
+STATIC void
+xfs_buf_item_pushbuf(
+	xfs_buf_log_item_t	*bip)
+{
+	xfs_buf_t	*bp;
+
+	ASSERT(!(bip->bli_flags & XFS_BLI_STALE));
+	trace_xfs_buf_item_pushbuf(bip);
+
+	bp = bip->bli_buf;
+	ASSERT(XFS_BUF_ISDELAYWRITE(bp));
+	xfs_buf_delwri_promote(bp);
+	xfs_buf_relse(bp);
 }
 
 /* ARGSUSED */
@@ -677,7 +683,7 @@ static struct xfs_item_ops xfs_buf_item_ops = {
 	.iop_committed	= (xfs_lsn_t(*)(xfs_log_item_t*, xfs_lsn_t))
 					xfs_buf_item_committed,
 	.iop_push	= (void(*)(xfs_log_item_t*))xfs_buf_item_push,
-	.iop_pushbuf	= NULL,
+	.iop_pushbuf	= (void(*)(xfs_log_item_t*))xfs_buf_item_pushbuf,
 	.iop_committing = (void(*)(xfs_log_item_t*, xfs_lsn_t))
 					xfs_buf_item_committing
 };
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 207553e..d4dc063 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -602,33 +602,20 @@ xfs_inode_item_trylock(
 
 	if (!xfs_iflock_nowait(ip)) {
 		/*
-		 * If someone else isn't already trying to push the inode
-		 * buffer, we get to do it.
+		 * inode has already been flushed to the backing buffer,
+		 * leave it locked in shared mode, pushbuf routine will
+		 * unlock it.
 		 */
-		if (iip->ili_pushbuf_flag == 0) {
-			iip->ili_pushbuf_flag = 1;
-#ifdef DEBUG
-			iip->ili_push_owner = current_pid();
-#endif
-			/*
-			 * Inode is left locked in shared mode.
-			 * Pushbuf routine gets to unlock it.
-			 */
-			return XFS_ITEM_PUSHBUF;
-		} else {
-			/*
-			 * We hold the AIL lock, so we must specify the
-			 * NONOTIFY flag so that we won't double trip.
-			 */
-			xfs_iunlock(ip, XFS_ILOCK_SHARED|XFS_IUNLOCK_NONOTIFY);
-			return XFS_ITEM_FLUSHING;
-		}
-		/* NOTREACHED */
+		return XFS_ITEM_PUSHBUF;
 	}
 
 	/* Stale items should force out the iclog */
 	if (ip->i_flags & XFS_ISTALE) {
 		xfs_ifunlock(ip);
+		/*
+		 * we hold the AIL lock - notify the unlock routine of this
+		 * so it doesn't try to get the lock again.
+		 */
 		xfs_iunlock(ip, XFS_ILOCK_SHARED|XFS_IUNLOCK_NONOTIFY);
 		return XFS_ITEM_PINNED;
 	}
@@ -746,11 +733,8 @@ xfs_inode_item_committed(
  * This gets called by xfs_trans_push_ail(), when IOP_TRYLOCK
  * failed to get the inode flush lock but did get the inode locked SHARED.
  * Here we're trying to see if the inode buffer is incore, and if so whether it's
- * marked delayed write. If that's the case, we'll initiate a bawrite on that
- * buffer to expedite the process.
- *
- * We aren't holding the AIL lock (or the flush lock) when this gets called,
- * so it is inherently race-y.
+ * marked delayed write. If that's the case, we'll promote it and that will
+ * allow the caller to write the buffer by triggering the xfsbufd to run.
  */
 STATIC void
 xfs_inode_item_pushbuf(
@@ -759,26 +743,16 @@ xfs_inode_item_pushbuf(
 	xfs_inode_t	*ip;
 	xfs_mount_t	*mp;
 	xfs_buf_t	*bp;
-	uint		dopush;
 
 	ip = iip->ili_inode;
-
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_SHARED));
 
 	/*
-	 * The ili_pushbuf_flag keeps others from
-	 * trying to duplicate our effort.
-	 */
-	ASSERT(iip->ili_pushbuf_flag != 0);
-	ASSERT(iip->ili_push_owner == current_pid());
-
-	/*
 	 * If a flush is not in progress anymore, chances are that the
 	 * inode was taken off the AIL. So, just get out.
 	 */
 	if (completion_done(&ip->i_flush) ||
 	    ((iip->ili_item.li_flags & XFS_LI_IN_AIL) == 0)) {
-		iip->ili_pushbuf_flag = 0;
 		xfs_iunlock(ip, XFS_ILOCK_SHARED);
 		return;
 	}
@@ -787,53 +761,12 @@ xfs_inode_item_pushbuf(
 	bp = xfs_incore(mp->m_ddev_targp, iip->ili_format.ilf_blkno,
 		    iip->ili_format.ilf_len, XBF_TRYLOCK);
 
-	if (bp != NULL) {
-		if (XFS_BUF_ISDELAYWRITE(bp)) {
-			/*
-			 * We were racing with iflush because we don't hold
-			 * the AIL lock or the flush lock. However, at this point,
-			 * we have the buffer, and we know that it's dirty.
-			 * So, it's possible that iflush raced with us, and
-			 * this item is already taken off the AIL.
-			 * If not, we can flush it async.
-			 */
-			dopush = ((iip->ili_item.li_flags & XFS_LI_IN_AIL) &&
-				  !completion_done(&ip->i_flush));
-			iip->ili_pushbuf_flag = 0;
-			xfs_iunlock(ip, XFS_ILOCK_SHARED);
-
-			trace_xfs_inode_item_push(bp, _RET_IP_);
-
-			if (XFS_BUF_ISPINNED(bp))
-				xfs_log_force(mp, 0);
-
-			if (dopush) {
-				int	error;
-				error = xfs_bawrite(mp, bp);
-				if (error)
-					xfs_fs_cmn_err(CE_WARN, mp,
-		"xfs_inode_item_pushbuf: pushbuf error %d on iip %p, bp %p",
-							error, iip, bp);
-			} else {
-				xfs_buf_relse(bp);
-			}
-		} else {
-			iip->ili_pushbuf_flag = 0;
-			xfs_iunlock(ip, XFS_ILOCK_SHARED);
-			xfs_buf_relse(bp);
-		}
-		return;
-	}
-	/*
-	 * We have to be careful about resetting pushbuf flag too early (above).
-	 * Even though in theory we can do it as soon as we have the buflock,
-	 * we don't want others to be doing work needlessly. They'll come to
-	 * this function thinking that pushing the buffer is their
-	 * responsibility only to find that the buffer is still locked by
-	 * another doing the same thing
-	 */
-	iip->ili_pushbuf_flag = 0;
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
+	if (!bp)
+		return;
+	if (XFS_BUF_ISDELAYWRITE(bp))
+		xfs_buf_delwri_promote(bp);
+	xfs_buf_relse(bp);
 	return;
 }
 
@@ -937,7 +870,6 @@ xfs_inode_item_init(
 	/*
 	   We have zeroed memory. No need ...
 	   iip->ili_extents_buf = NULL;
-	   iip->ili_pushbuf_flag = 0;
 	 */
 
 	iip->ili_format.ilf_type = XFS_LI_INODE;
diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
index cc8df1a..9a46795 100644
--- a/fs/xfs/xfs_inode_item.h
+++ b/fs/xfs/xfs_inode_item.h
@@ -144,12 +144,6 @@ typedef struct xfs_inode_log_item {
 						      data exts */
 	struct xfs_bmbt_rec	*ili_aextents_buf; /* array of logged
 						      attr exts */
-	unsigned int            ili_pushbuf_flag;  /* one bit used in push_ail */
-
-#ifdef DEBUG
-	uint64_t                ili_push_owner;    /* one who sets pushbuf_flag
-						      above gets to push the buf */
-#endif
 #ifdef XFS_TRANS_DEBUG
 	int			ili_root_size;
 	char			*ili_orig_root;
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index ca64f33..c93e3a1 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -861,8 +861,7 @@ typedef struct xfs_item_ops {
 #define	XFS_ITEM_SUCCESS	0
 #define	XFS_ITEM_PINNED		1
 #define	XFS_ITEM_LOCKED		2
-#define	XFS_ITEM_FLUSHING	3
-#define XFS_ITEM_PUSHBUF	4
+#define XFS_ITEM_PUSHBUF	3
 
 /*
  * This structure is used to maintain a list of block ranges that have been
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index d7b1af8..e799824 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -253,6 +253,7 @@ xfsaild_push(
 	int		flush_log, count, stuck;
 	xfs_mount_t	*mp = ailp->xa_mount;
 	struct xfs_ail_cursor	*cur = &ailp->xa_cursors;
+	int		push_xfsbufd = 0;
 
 	spin_lock(&ailp->xa_lock);
 	xfs_trans_ail_cursor_init(ailp, cur);
@@ -308,6 +309,7 @@ xfsaild_push(
 			XFS_STATS_INC(xs_push_ail_pushbuf);
 			IOP_PUSHBUF(lip);
 			last_pushed_lsn = lsn;
+			push_xfsbufd = 1;
 			break;
 
 		case XFS_ITEM_PINNED:
@@ -322,12 +324,6 @@ xfsaild_push(
 			stuck++;
 			break;
 
-		case XFS_ITEM_FLUSHING:
-			XFS_STATS_INC(xs_push_ail_flushing);
-			last_pushed_lsn = lsn;
-			stuck++;
-			break;
-
 		default:
 			ASSERT(0);
 			break;
@@ -374,6 +370,11 @@ xfsaild_push(
 		xfs_log_force(mp, 0);
 	}
 
+	if (push_xfsbufd) {
+		/* we've got delayed write buffers to flush */
+		wake_up_process(mp->m_ddev_targp->bt_task);
+	}
+
 	if (!count) {
 		/* We're past our target or empty, so idle */
 		last_pushed_lsn = 0;
-- 
1.6.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 04/10] xfs: Sort delayed write buffers before dispatch
  2010-02-02 23:24 [PATCH 0/10] Delayed write metadata writeback V4 Dave Chinner
                   ` (2 preceding siblings ...)
  2010-02-02 23:24 ` [PATCH 03/10] xfs: Don't issue buffer IO direct from AIL push V2 Dave Chinner
@ 2010-02-02 23:24 ` Dave Chinner
  2010-02-05 23:53   ` Alex Elder
  2010-02-02 23:24 ` [PATCH 05/10] xfs: Use delay write promotion for dquot flushing Dave Chinner
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2010-02-02 23:24 UTC (permalink / raw)
  To: xfs

Currently when the xfsbufd writes delayed write buffers, it pushes
them to disk in the order they come off the delayed write list. If
there are lots of buffers ѕpread widely over the disk, this results
in overwhelming the elevator sort queues in the block layer and we
end up losing the posibility of merging adjacent buffers to minimise
the number of IOs.

Use the new generic list_sort function to sort the delwri dispatch
queue before issue to ensure that the buffers are pushed in the most
friendly order possible to the lower layers.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/linux-2.6/xfs_buf.c |   87 ++++++++++++++++++++++++++++++--------------
 1 files changed, 60 insertions(+), 27 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index b306265..4556a4c 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -33,6 +33,7 @@
 #include <linux/migrate.h>
 #include <linux/backing-dev.h>
 #include <linux/freezer.h>
+#include <linux/list_sort.h>
 
 #include "xfs_sb.h"
 #include "xfs_inum.h"
@@ -1877,14 +1878,42 @@ xfs_buf_delwri_split(
 
 }
 
+/*
+ * Compare function is more complex than it needs to be because
+ * the return value is only 32 bits and we are doing comparisons
+ * on 64 bit values
+ */
+static int
+xfs_buf_cmp(
+	void		*priv,
+	struct list_head *a,
+	struct list_head *b)
+{
+	struct xfs_buf	*ap = container_of(a, struct xfs_buf, b_list);
+	struct xfs_buf	*bp = container_of(b, struct xfs_buf, b_list);
+	xfs_daddr_t		diff;
+
+	diff = ap->b_bn - bp->b_bn;
+	if (diff < 0)
+		return -1;
+	if (diff > 0)
+		return 1;
+	return 0;
+}
+
+void
+xfs_buf_delwri_sort(
+	xfs_buftarg_t	*target,
+	struct list_head *list)
+{
+	list_sort(NULL, list, xfs_buf_cmp);
+}
+
 STATIC int
 xfsbufd(
 	void		*data)
 {
-	struct list_head tmp;
-	xfs_buftarg_t	*target = (xfs_buftarg_t *)data;
-	int		count;
-	xfs_buf_t	*bp;
+	xfs_buftarg_t   *target = (xfs_buftarg_t *)data;
 
 	current->flags |= PF_MEMALLOC;
 
@@ -1893,6 +1922,8 @@ xfsbufd(
 	do {
 		long	age = xfs_buf_age_centisecs * msecs_to_jiffies(10);
 		long	tout = xfs_buf_timer_centisecs * msecs_to_jiffies(10);
+		int	count = 0;
+		struct list_head tmp;
 
 		if (unlikely(freezing(current))) {
 			set_bit(XBT_FORCE_SLEEP, &target->bt_flags);
@@ -1907,11 +1938,10 @@ xfsbufd(
 		schedule_timeout_interruptible(tout);
 
 		xfs_buf_delwri_split(target, &tmp, age);
-		count = 0;
+		list_sort(NULL, &tmp, xfs_buf_cmp);
 		while (!list_empty(&tmp)) {
-			bp = list_entry(tmp.next, xfs_buf_t, b_list);
-			ASSERT(target == bp->b_target);
-
+			struct xfs_buf *bp;
+			bp = list_first_entry(&tmp, struct xfs_buf, b_list);
 			list_del_init(&bp->b_list);
 			xfs_buf_iostrategy(bp);
 			count++;
@@ -1937,42 +1967,45 @@ xfs_flush_buftarg(
 	xfs_buftarg_t	*target,
 	int		wait)
 {
-	struct list_head tmp;
-	xfs_buf_t	*bp, *n;
+	xfs_buf_t	*bp;
 	int		pincount = 0;
+	LIST_HEAD(tmp_list);
+	LIST_HEAD(wait_list);
 
 	xfs_buf_runall_queues(xfsconvertd_workqueue);
 	xfs_buf_runall_queues(xfsdatad_workqueue);
 	xfs_buf_runall_queues(xfslogd_workqueue);
 
 	set_bit(XBT_FORCE_FLUSH, &target->bt_flags);
-	pincount = xfs_buf_delwri_split(target, &tmp, 0);
+	pincount = xfs_buf_delwri_split(target, &tmp_list, 0);
 
 	/*
-	 * Dropped the delayed write list lock, now walk the temporary list
+	 * Dropped the delayed write list lock, now walk the temporary list.
+	 * All I/O is issued async and then if we need to wait for completion
+	 * we do that after issuing all the IO.
 	 */
-	list_for_each_entry_safe(bp, n, &tmp, b_list) {
+	list_sort(NULL, &tmp_list, xfs_buf_cmp);
+	while (!list_empty(&tmp_list)) {
+		bp = list_first_entry(&tmp_list, struct xfs_buf, b_list);
 		ASSERT(target == bp->b_target);
-		if (wait)
+		list_del_init(&bp->b_list);
+		if (wait) {
 			bp->b_flags &= ~XBF_ASYNC;
-		else
-			list_del_init(&bp->b_list);
-
+			list_add(&bp->b_list, &wait_list);
+		}
 		xfs_buf_iostrategy(bp);
 	}
 
-	if (wait)
+	if (wait) {
+		/* Expedite and wait for IO to complete. */
 		blk_run_address_space(target->bt_mapping);
+		while (!list_empty(&wait_list)) {
+			bp = list_first_entry(&wait_list, struct xfs_buf, b_list);
 
-	/*
-	 * Remaining list items must be flushed before returning
-	 */
-	while (!list_empty(&tmp)) {
-		bp = list_entry(tmp.next, xfs_buf_t, b_list);
-
-		list_del_init(&bp->b_list);
-		xfs_iowait(bp);
-		xfs_buf_relse(bp);
+			list_del_init(&bp->b_list);
+			xfs_iowait(bp);
+			xfs_buf_relse(bp);
+		}
 	}
 
 	return pincount;
-- 
1.6.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 05/10] xfs: Use delay write promotion for dquot flushing
  2010-02-02 23:24 [PATCH 0/10] Delayed write metadata writeback V4 Dave Chinner
                   ` (3 preceding siblings ...)
  2010-02-02 23:24 ` [PATCH 04/10] xfs: Sort delayed write buffers before dispatch Dave Chinner
@ 2010-02-02 23:24 ` Dave Chinner
  2010-02-05 23:55   ` Alex Elder
  2010-02-02 23:25 ` [PATCH 06/10] xfs: kill the unused XFS_QMOPT_* flush flags V2 Dave Chinner
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2010-02-02 23:24 UTC (permalink / raw)
  To: xfs

xfs_qm_dqflock_pushbuf_wait() does a very similar trick to item
pushing used to do to flush out delayed write dquot buffers. Change
it to use the new promotion method rather than an async flush.

Also, xfs_qm_dqflock_pushbuf_wait() can return without the flush lock
held, yet the callers make the assumption that after this call the
flush lock is held. Always return with the flush lock held.

Signed-off-by: Dave Chinner <david@fromorbit.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/quota/xfs_dquot.c |   25 ++++++++++---------------
 1 files changed, 10 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/quota/xfs_dquot.c b/fs/xfs/quota/xfs_dquot.c
index f9baeed..1620a56 100644
--- a/fs/xfs/quota/xfs_dquot.c
+++ b/fs/xfs/quota/xfs_dquot.c
@@ -1528,21 +1528,16 @@ xfs_qm_dqflock_pushbuf_wait(
 	 */
 	bp = xfs_incore(dqp->q_mount->m_ddev_targp, dqp->q_blkno,
 		    XFS_QI_DQCHUNKLEN(dqp->q_mount), XBF_TRYLOCK);
-	if (bp != NULL) {
-		if (XFS_BUF_ISDELAYWRITE(bp)) {
-			int	error;
-
-			if (XFS_BUF_ISPINNED(bp))
-				xfs_log_force(dqp->q_mount, 0);
-			error = xfs_bawrite(dqp->q_mount, bp);
-			if (error)
-				xfs_fs_cmn_err(CE_WARN, dqp->q_mount,
-					"xfs_qm_dqflock_pushbuf_wait: "
-					"pushbuf error %d on dqp %p, bp %p",
-					error, dqp, bp);
-		} else {
-			xfs_buf_relse(bp);
-		}
+	if (!bp)
+		goto out_lock;
+
+	if (XFS_BUF_ISDELAYWRITE(bp)) {
+		if (XFS_BUF_ISPINNED(bp))
+			xfs_log_force(dqp->q_mount, 0);
+		xfs_buf_delwri_promote(bp);
+		wake_up_process(bp->b_target->bt_task);
 	}
+	xfs_buf_relse(bp);
+out_lock:
 	xfs_dqflock(dqp);
 }
-- 
1.6.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 06/10] xfs: kill the unused XFS_QMOPT_* flush flags V2
  2010-02-02 23:24 [PATCH 0/10] Delayed write metadata writeback V4 Dave Chinner
                   ` (4 preceding siblings ...)
  2010-02-02 23:24 ` [PATCH 05/10] xfs: Use delay write promotion for dquot flushing Dave Chinner
@ 2010-02-02 23:25 ` Dave Chinner
  2010-02-03 11:17   ` Christoph Hellwig
  2010-02-02 23:25 ` [PATCH 07/10] xfs: remove invalid barrier optimization from xfs_fsync Dave Chinner
                   ` (3 subsequent siblings)
  9 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2010-02-02 23:25 UTC (permalink / raw)
  To: xfs

dquots are never flushed asynchronously. Remove the flag and the
async write support from the flush function. Make the default flush
a delwri flush to make the inode flush code, which leaves the
XFS_QMOPT_SYNC the only flag remaining.  Convert that to use
SYNC_WAIT instead, just like the inode flush code.

V2:
- just pass flush flags straight through

Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/quota/xfs_dquot.c      |   13 ++++++-------
 fs/xfs/quota/xfs_dquot_item.c |    2 +-
 fs/xfs/quota/xfs_qm.c         |   14 ++++++--------
 fs/xfs/xfs_quota.h            |    8 +-------
 4 files changed, 14 insertions(+), 23 deletions(-)

diff --git a/fs/xfs/quota/xfs_dquot.c b/fs/xfs/quota/xfs_dquot.c
index 1620a56..5f79dd7 100644
--- a/fs/xfs/quota/xfs_dquot.c
+++ b/fs/xfs/quota/xfs_dquot.c
@@ -1187,7 +1187,7 @@ xfs_qm_dqflush(
 	 * block, nada.
 	 */
 	if (!XFS_DQ_IS_DIRTY(dqp) ||
-	    (!(flags & XFS_QMOPT_SYNC) && atomic_read(&dqp->q_pincount) > 0)) {
+	    (!(flags & SYNC_WAIT) && atomic_read(&dqp->q_pincount) > 0)) {
 		xfs_dqfunlock(dqp);
 		return 0;
 	}
@@ -1251,18 +1251,17 @@ xfs_qm_dqflush(
 		xfs_log_force(mp, 0);
 	}
 
-	if (flags & XFS_QMOPT_DELWRI) {
-		xfs_bdwrite(mp, bp);
-	} else {
+	if (flags & SYNC_WAIT)
 		error = xfs_bwrite(mp, bp);
-	}
+	else
+		xfs_bdwrite(mp, bp);
 
 	trace_xfs_dqflush_done(dqp);
 
 	/*
 	 * dqp is still locked, but caller is free to unlock it now.
 	 */
-	return (error);
+	return error;
 
 }
 
@@ -1443,7 +1442,7 @@ xfs_qm_dqpurge(
 		 * We don't care about getting disk errors here. We need
 		 * to purge this dquot anyway, so we go ahead regardless.
 		 */
-		error = xfs_qm_dqflush(dqp, XFS_QMOPT_SYNC);
+		error = xfs_qm_dqflush(dqp, SYNC_WAIT);
 		if (error)
 			xfs_fs_cmn_err(CE_WARN, mp,
 				"xfs_qm_dqpurge: dquot %p flush failed", dqp);
diff --git a/fs/xfs/quota/xfs_dquot_item.c b/fs/xfs/quota/xfs_dquot_item.c
index dda0fb0..4e4ee9a 100644
--- a/fs/xfs/quota/xfs_dquot_item.c
+++ b/fs/xfs/quota/xfs_dquot_item.c
@@ -153,7 +153,7 @@ xfs_qm_dquot_logitem_push(
 	 * lock without sleeping, then there must not have been
 	 * anyone in the process of flushing the dquot.
 	 */
-	error = xfs_qm_dqflush(dqp, XFS_QMOPT_DELWRI);
+	error = xfs_qm_dqflush(dqp, 0);
 	if (error)
 		xfs_fs_cmn_err(CE_WARN, dqp->q_mount,
 			"xfs_qm_dquot_logitem_push: push error %d on dqp %p",
diff --git a/fs/xfs/quota/xfs_qm.c b/fs/xfs/quota/xfs_qm.c
index 11cfd82..8699e51 100644
--- a/fs/xfs/quota/xfs_qm.c
+++ b/fs/xfs/quota/xfs_qm.c
@@ -450,7 +450,7 @@ xfs_qm_unmount_quotas(
 STATIC int
 xfs_qm_dqflush_all(
 	xfs_mount_t	*mp,
-	int		flags)
+	int		sync_mode)
 {
 	int		recl;
 	xfs_dquot_t	*dqp;
@@ -486,7 +486,7 @@ again:
 		 * across a disk write.
 		 */
 		xfs_qm_mplist_unlock(mp);
-		error = xfs_qm_dqflush(dqp, flags);
+		error = xfs_qm_dqflush(dqp, sync_mode);
 		xfs_dqunlock(dqp);
 		if (error)
 			return error;
@@ -926,13 +926,11 @@ xfs_qm_sync(
 {
 	int		recl, restarts;
 	xfs_dquot_t	*dqp;
-	uint		flush_flags;
 	int		error;
 
 	if (!XFS_IS_QUOTA_RUNNING(mp) || !XFS_IS_QUOTA_ON(mp))
 		return 0;
 
-	flush_flags = (flags & SYNC_WAIT) ? XFS_QMOPT_SYNC : XFS_QMOPT_DELWRI;
 	restarts = 0;
 
   again:
@@ -992,7 +990,7 @@ xfs_qm_sync(
 		 * across a disk write
 		 */
 		xfs_qm_mplist_unlock(mp);
-		error = xfs_qm_dqflush(dqp, flush_flags);
+		error = xfs_qm_dqflush(dqp, flags);
 		xfs_dqunlock(dqp);
 		if (error && XFS_FORCED_SHUTDOWN(mp))
 			return 0;	/* Need to prevent umount failure */
@@ -1796,7 +1794,7 @@ xfs_qm_quotacheck(
 	 * successfully.
 	 */
 	if (!error)
-		error = xfs_qm_dqflush_all(mp, XFS_QMOPT_DELWRI);
+		error = xfs_qm_dqflush_all(mp, 0);
 
 	/*
 	 * We can get this error if we couldn't do a dquot allocation inside
@@ -2018,7 +2016,7 @@ xfs_qm_shake_freelist(
 			 * We flush it delayed write, so don't bother
 			 * releasing the mplock.
 			 */
-			error = xfs_qm_dqflush(dqp, XFS_QMOPT_DELWRI);
+			error = xfs_qm_dqflush(dqp, 0);
 			if (error) {
 				xfs_fs_cmn_err(CE_WARN, dqp->q_mount,
 			"xfs_qm_dqflush_all: dquot %p flush failed", dqp);
@@ -2201,7 +2199,7 @@ xfs_qm_dqreclaim_one(void)
 			 * We flush it delayed write, so don't bother
 			 * releasing the freelist lock.
 			 */
-			error = xfs_qm_dqflush(dqp, XFS_QMOPT_DELWRI);
+			error = xfs_qm_dqflush(dqp, 0);
 			if (error) {
 				xfs_fs_cmn_err(CE_WARN, dqp->q_mount,
 			"xfs_qm_dqreclaim: dquot %p flush failed", dqp);
diff --git a/fs/xfs/xfs_quota.h b/fs/xfs/xfs_quota.h
index 21d11d9..fdcab3f 100644
--- a/fs/xfs/xfs_quota.h
+++ b/fs/xfs/xfs_quota.h
@@ -223,15 +223,9 @@ typedef struct xfs_qoff_logformat {
 #define XFS_QMOPT_RES_INOS	0x0800000
 
 /*
- * flags for dqflush and dqflush_all.
- */
-#define XFS_QMOPT_SYNC		0x1000000
-#define XFS_QMOPT_DELWRI	0x4000000
-
-/*
  * flags for dqalloc.
  */
-#define XFS_QMOPT_INHERIT	0x8000000
+#define XFS_QMOPT_INHERIT	0x1000000
 
 /*
  * flags to xfs_trans_mod_dquot.
-- 
1.6.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 07/10] xfs: remove invalid barrier optimization from xfs_fsync
  2010-02-02 23:24 [PATCH 0/10] Delayed write metadata writeback V4 Dave Chinner
                   ` (5 preceding siblings ...)
  2010-02-02 23:25 ` [PATCH 06/10] xfs: kill the unused XFS_QMOPT_* flush flags V2 Dave Chinner
@ 2010-02-02 23:25 ` Dave Chinner
  2010-02-02 23:25 ` [PATCH 08/10] xfs: move the inode locking outside xfs_fsync() Dave Chinner
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 29+ messages in thread
From: Dave Chinner @ 2010-02-02 23:25 UTC (permalink / raw)
  To: xfs; +Cc: Christoph Hellwig

From: Christoph Hellwig <hch@lst.de>

We always need to flush the disk write cache and can't skip it just because
the no inode attributes have changed.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/xfs_vnodeops.c |   12 ++----------
 1 files changed, 2 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c
index fd108b7..43241e2 100644
--- a/fs/xfs/xfs_vnodeops.c
+++ b/fs/xfs/xfs_vnodeops.c
@@ -597,7 +597,7 @@ xfs_fsync(
 {
 	xfs_trans_t	*tp;
 	int		error = 0;
-	int		log_flushed = 0, changed = 1;
+	int		log_flushed = 0;
 
 	xfs_itrace_entry(ip);
 
@@ -627,18 +627,10 @@ xfs_fsync(
 		 * disk yet, the inode will be still be pinned.  If it is,
 		 * force the log.
 		 */
-
 		xfs_iunlock(ip, XFS_ILOCK_SHARED);
-
 		if (xfs_ipincount(ip)) {
 			error = _xfs_log_force(ip->i_mount, XFS_LOG_SYNC,
 					       &log_flushed);
-		} else {
-			/*
-			 * If the inode is not pinned and nothing has changed
-			 * we don't need to flush the cache.
-			 */
-			changed = 0;
 		}
 	} else	{
 		/*
@@ -673,7 +665,7 @@ xfs_fsync(
 		xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	}
 
-	if ((ip->i_mount->m_flags & XFS_MOUNT_BARRIER) && changed) {
+	if (ip->i_mount->m_flags & XFS_MOUNT_BARRIER) {
 		/*
 		 * If the log write didn't issue an ordered tag we need
 		 * to flush the disk cache for the data device now.
-- 
1.6.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 08/10] xfs: move the inode locking outside xfs_fsync()
  2010-02-02 23:24 [PATCH 0/10] Delayed write metadata writeback V4 Dave Chinner
                   ` (6 preceding siblings ...)
  2010-02-02 23:25 ` [PATCH 07/10] xfs: remove invalid barrier optimization from xfs_fsync Dave Chinner
@ 2010-02-02 23:25 ` Dave Chinner
  2010-02-03 11:29   ` Christoph Hellwig
  2010-02-02 23:25 ` [PATCH 09/10] xfs: xfs_fs_write_inode() can fail to write inodes synchronously V2 Dave Chinner
  2010-02-02 23:25 ` [PATCH 10/10] xfs: kill xfs_bawrite Dave Chinner
  9 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2010-02-02 23:25 UTC (permalink / raw)
  To: xfs

We have a need for a delayed write inode flush operation
to be made atomically with an fsync to avoid physically
writing inodes but still keeping inode buffer information
up to date for bulkstat.

Move the inode locking outside xfs_fsync() to allow this to
be done.

Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/linux-2.6/xfs_file.c |    6 +++++-
 fs/xfs/linux-2.6/xfs_lrw.c  |    5 +++--
 fs/xfs/xfs_vnodeops.c       |   34 ++++++++++++++++------------------
 fs/xfs/xfs_vnodeops.h       |    2 +-
 4 files changed, 25 insertions(+), 22 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c
index e4caeb2..a2c8321 100644
--- a/fs/xfs/linux-2.6/xfs_file.c
+++ b/fs/xfs/linux-2.6/xfs_file.c
@@ -177,9 +177,13 @@ xfs_file_fsync(
 	int			datasync)
 {
 	struct xfs_inode	*ip = XFS_I(dentry->d_inode);
+	int			error;
 
 	xfs_iflags_clear(ip, XFS_ITRUNCATED);
-	return -xfs_fsync(ip);
+	xfs_ilock(ip, XFS_ILOCK_SHARED);
+	error = -xfs_fsync(ip, XFS_ILOCK_SHARED);
+	xfs_iunlock(ip, XFS_ILOCK_SHARED);
+	return error;
 }
 
 STATIC int
diff --git a/fs/xfs/linux-2.6/xfs_lrw.c b/fs/xfs/linux-2.6/xfs_lrw.c
index c80fa00..508135c 100644
--- a/fs/xfs/linux-2.6/xfs_lrw.c
+++ b/fs/xfs/linux-2.6/xfs_lrw.c
@@ -754,11 +754,12 @@ write_retry:
 			error = error2;
 		if (need_i_mutex)
 			mutex_lock(&inode->i_mutex);
-		xfs_ilock(xip, iolock);
+		xfs_ilock(xip, iolock | XFS_ILOCK_SHARED);
 
-		error2 = xfs_fsync(xip);
+		error2 = xfs_fsync(xip, XFS_ILOCK_SHARED);
 		if (!error)
 			error = error2;
+		xfs_iunlock(xip, XFS_ILOCK_SHARED);
 	}
 
  out_unlock_internal:
diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c
index 43241e2..afcb635 100644
--- a/fs/xfs/xfs_vnodeops.c
+++ b/fs/xfs/xfs_vnodeops.c
@@ -590,10 +590,22 @@ xfs_readlink(
  * the I/O lock while flushing the data, and the inode lock while flushing the
  * inode.  The inode lock CANNOT be held while flushing the data, so acquire
  * after we're done with that.
+ *
+ * We always need to make sure that the required inode state is safe on disk.
+ * The inode might be clean but we still might need to force the log because of
+ * committed transactions that haven't hit the disk yet.  Likewise, there could
+ * be unflushed non-transactional changes to the inode core that have to go to
+ * disk and this requires us to issue a synchronous transaction to capture
+ * these changes correctly.
+ *
+ * This code relies on the assumption that if the update_* fields of the inode
+ * are clear and the inode is unpinned then it is clean and no action is
+ * required.
  */
 int
 xfs_fsync(
-	xfs_inode_t	*ip)
+	xfs_inode_t	*ip,
+	int		lock_mode)
 {
 	xfs_trans_t	*tp;
 	int		error = 0;
@@ -604,20 +616,6 @@ xfs_fsync(
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
 		return XFS_ERROR(EIO);
 
-	/*
-	 * We always need to make sure that the required inode state is safe on
-	 * disk.  The inode might be clean but we still might need to force the
-	 * log because of committed transactions that haven't hit the disk yet.
-	 * Likewise, there could be unflushed non-transactional changes to the
-	 * inode core that have to go to disk and this requires us to issue
-	 * a synchronous transaction to capture these changes correctly.
-	 *
-	 * This code relies on the assumption that if the update_* fields
-	 * of the inode are clear and the inode is unpinned then it is clean
-	 * and no action is required.
-	 */
-	xfs_ilock(ip, XFS_ILOCK_SHARED);
-
 	if (!ip->i_update_core) {
 		/*
 		 * Timestamps/size haven't changed since last inode flush or
@@ -627,7 +625,6 @@ xfs_fsync(
 		 * disk yet, the inode will be still be pinned.  If it is,
 		 * force the log.
 		 */
-		xfs_iunlock(ip, XFS_ILOCK_SHARED);
 		if (xfs_ipincount(ip)) {
 			error = _xfs_log_force(ip->i_mount, XFS_LOG_SYNC,
 					       &log_flushed);
@@ -637,7 +634,7 @@ xfs_fsync(
 		 * Kick off a transaction to log the inode core to get the
 		 * updates.  The sync transaction will also force the log.
 		 */
-		xfs_iunlock(ip, XFS_ILOCK_SHARED);
+		xfs_iunlock(ip, lock_mode);
 		tp = xfs_trans_alloc(ip->i_mount, XFS_TRANS_FSYNC_TS);
 		error = xfs_trans_reserve(tp, 0,
 				XFS_FSYNC_TS_LOG_RES(ip->i_mount), 0, 0, 0);
@@ -662,7 +659,8 @@ xfs_fsync(
 		xfs_trans_set_sync(tp);
 		error = _xfs_trans_commit(tp, 0, &log_flushed);
 
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+		if (lock_mode != XFS_ILOCK_EXCL)
+			xfs_ilock_demote(ip, XFS_ILOCK_EXCL);
 	}
 
 	if (ip->i_mount->m_flags & XFS_MOUNT_BARRIER) {
diff --git a/fs/xfs/xfs_vnodeops.h b/fs/xfs/xfs_vnodeops.h
index 774f407..eacd297 100644
--- a/fs/xfs/xfs_vnodeops.h
+++ b/fs/xfs/xfs_vnodeops.h
@@ -21,7 +21,7 @@ int xfs_setattr(struct xfs_inode *ip, struct iattr *vap, int flags);
 #define XFS_ATTR_NOACL		0x08	/* Don't call xfs_acl_chmod */
 
 int xfs_readlink(struct xfs_inode *ip, char *link);
-int xfs_fsync(struct xfs_inode *ip);
+int xfs_fsync(struct xfs_inode *ip, int lock_mode);
 int xfs_release(struct xfs_inode *ip);
 int xfs_inactive(struct xfs_inode *ip);
 int xfs_lookup(struct xfs_inode *dp, struct xfs_name *name,
-- 
1.6.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 09/10] xfs: xfs_fs_write_inode() can fail to write inodes synchronously V2
  2010-02-02 23:24 [PATCH 0/10] Delayed write metadata writeback V4 Dave Chinner
                   ` (7 preceding siblings ...)
  2010-02-02 23:25 ` [PATCH 08/10] xfs: move the inode locking outside xfs_fsync() Dave Chinner
@ 2010-02-02 23:25 ` Dave Chinner
  2010-02-03 11:27   ` Christoph Hellwig
  2010-02-02 23:25 ` [PATCH 10/10] xfs: kill xfs_bawrite Dave Chinner
  9 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2010-02-02 23:25 UTC (permalink / raw)
  To: xfs

When an inode has already be flushed delayed write,
xfs_inode_clean() returns true and hence xfs_fs_write_inode() can
return on a synchronous inode write without having written the
inode. Currently these sycnhronous writes only come sync(1),
unmount, a sycnhronous NFS export and cachefiles so should be
relatively rare and out of common performance paths.

Realistically, a synchronous inode write is not necessary here; we
can treat this like fsync where we either force the log if there are
no unlogged changes, or do a sync transaction if there are unlogged
changes. The will result real synchronous semantics as the fsync
will issue barriers, but may slow down the above two configurations
as a result. However, if the inode is not pinned and has no unlogged
changes, then the fsync code is a no-op and hence it may be faster
than the existing code.

The only thing we need tobe careful of here is if the inode has not
yet been flushed to the inode buffer bulkstat scans will fail to see
it. Hence even on a sync write, we should try to flush the inode to
the backing buffer. This is only a best-effort delwri flush - it
doesn't guarantee that the flush happens but races with other inode
operations should not happen as the inode lock is not dropped
between the fsync operation and the flush.

For the asynchronous write, move the clean check until after we have
locked up the inode. With the inode locked and the flush lock held,
we know that if the inodis clean there are no pending changes in the
log and there are no current outstanding delayed writes or IO in
progress, so if it reports clean now it really is clean. This matches
the order of locking and checks in xfs_sync_inode_attr().

Version 2:
- ilock is now held external to the fsync call to close race windows
  between the fsync and delwri flush to the backing buffer.

Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/linux-2.6/xfs_super.c |   54 ++++++++++++++++++++++++-----------------
 1 files changed, 32 insertions(+), 22 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 226fe20..1257a5f 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1045,35 +1045,45 @@ xfs_fs_write_inode(
 		error = xfs_wait_on_pages(ip, 0, -1);
 		if (error)
 			goto out;
-	}
-
-	/*
-	 * Bypass inodes which have already been cleaned by
-	 * the inode flush clustering code inside xfs_iflush
-	 */
-	if (xfs_inode_clean(ip))
-		goto out;
-
-	/*
-	 * We make this non-blocking if the inode is contended, return
-	 * EAGAIN to indicate to the caller that they did not succeed.
-	 * This prevents the flush path from blocking on inodes inside
-	 * another operation right now, they get caught later by xfs_sync.
-	 */
-	if (sync) {
+		/*
+		 * The fsync operation makes inode changes stable and it
+		 * reduces the IOs we have to do here from two (log and inode)
+		 * to just the log. We still need to do a delwri write of the
+		 * inode after this to flush it to the bacing buffer so that
+		 * bulkstat works properly.
+		 */
 		xfs_ilock(ip, XFS_ILOCK_SHARED);
-		xfs_iflock(ip);
-
-		error = xfs_iflush(ip, SYNC_WAIT);
+		error = xfs_fsync(ip, XFS_ILOCK_SHARED);
+		if (error)
+			goto out_unlock;
+		error = EAGAIN;
 	} else {
+		/*
+		 * We make this non-blocking if the inode is contended, return
+		 * EAGAIN to indicate to the caller that they did not succeed.
+		 * This prevents the flush path from blocking on inodes inside
+		 * another operation right now, they get caught later by xfs_sync.
+		 */
 		error = EAGAIN;
 		if (!xfs_ilock_nowait(ip, XFS_ILOCK_SHARED))
 			goto out;
-		if (xfs_ipincount(ip) || !xfs_iflock_nowait(ip))
-			goto out_unlock;
+	}
+
+	if (xfs_ipincount(ip) || !xfs_iflock_nowait(ip))
+		goto out_unlock;
 
-		error = xfs_iflush(ip, 0);
+	/*
+	 * Now we have the flush lock and the inode is not pinned, we can check
+	 * if the inode is really clean as we know that there are no pending
+	 * transaction completions, it is not waiting on the delayed write
+	 * queue and there is no IO in progress.
+	 */
+	error = 0;
+	if (xfs_inode_clean(ip)) {
+		xfs_ifunlock(ip);
+		goto out_unlock;
 	}
+	error = xfs_iflush(ip, 0);
 
  out_unlock:
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
-- 
1.6.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 10/10] xfs: kill xfs_bawrite
  2010-02-02 23:24 [PATCH 0/10] Delayed write metadata writeback V4 Dave Chinner
                   ` (8 preceding siblings ...)
  2010-02-02 23:25 ` [PATCH 09/10] xfs: xfs_fs_write_inode() can fail to write inodes synchronously V2 Dave Chinner
@ 2010-02-02 23:25 ` Dave Chinner
  2010-02-03 11:19   ` Christoph Hellwig
  9 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2010-02-02 23:25 UTC (permalink / raw)
  To: xfs

There are no more users of this function left in the XFS code
now that we've switched everything to delayed write flushing.
Remove it.

Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/linux-2.6/xfs_buf.c |   19 -------------------
 fs/xfs/linux-2.6/xfs_buf.h |    1 -
 2 files changed, 0 insertions(+), 20 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
index 4556a4c..d50df3a 100644
--- a/fs/xfs/linux-2.6/xfs_buf.c
+++ b/fs/xfs/linux-2.6/xfs_buf.c
@@ -1078,25 +1078,6 @@ xfs_bwrite(
 	return error;
 }
 
-int
-xfs_bawrite(
-	void			*mp,
-	struct xfs_buf		*bp)
-{
-	trace_xfs_buf_bawrite(bp, _RET_IP_);
-
-	ASSERT(bp->b_bn != XFS_BUF_DADDR_NULL);
-
-	xfs_buf_delwri_dequeue(bp);
-
-	bp->b_flags &= ~(XBF_READ | XBF_DELWRI | XBF_READ_AHEAD);
-	bp->b_flags |= (XBF_WRITE | XBF_ASYNC | _XBF_RUN_QUEUES);
-
-	bp->b_mount = mp;
-	bp->b_strat = xfs_bdstrat_cb;
-	return xfs_bdstrat_cb(bp);
-}
-
 void
 xfs_bdwrite(
 	void			*mp,
diff --git a/fs/xfs/linux-2.6/xfs_buf.h b/fs/xfs/linux-2.6/xfs_buf.h
index be45e8c..386e736 100644
--- a/fs/xfs/linux-2.6/xfs_buf.h
+++ b/fs/xfs/linux-2.6/xfs_buf.h
@@ -233,7 +233,6 @@ extern void xfs_buf_unlock(xfs_buf_t *);
 
 /* Buffer Read and Write Routines */
 extern int xfs_bwrite(struct xfs_mount *mp, struct xfs_buf *bp);
-extern int xfs_bawrite(void *mp, xfs_buf_t *bp);
 extern void xfs_bdwrite(void *mp, xfs_buf_t *bp);
 
 extern void xfsbdstrat(struct xfs_mount *, struct xfs_buf *);
-- 
1.6.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 02/10] xfs: Use delayed write for inodes rather than async V2
  2010-02-02 23:24 ` [PATCH 02/10] xfs: Use delayed write for inodes rather than async V2 Dave Chinner
@ 2010-02-03 11:17   ` Christoph Hellwig
  2010-02-05 21:38   ` Alex Elder
  1 sibling, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2010-02-03 11:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Still looks good with the recent changes,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 06/10] xfs: kill the unused XFS_QMOPT_* flush flags V2
  2010-02-02 23:25 ` [PATCH 06/10] xfs: kill the unused XFS_QMOPT_* flush flags V2 Dave Chinner
@ 2010-02-03 11:17   ` Christoph Hellwig
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2010-02-03 11:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Feb 03, 2010 at 10:25:00AM +1100, Dave Chinner wrote:
> dquots are never flushed asynchronously. Remove the flag and the
> async write support from the flush function. Make the default flush
> a delwri flush to make the inode flush code, which leaves the
> XFS_QMOPT_SYNC the only flag remaining.  Convert that to use
> SYNC_WAIT instead, just like the inode flush code.
> 
> V2:
> - just pass flush flags straight through

Still looks good with that minor change,


Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 10/10] xfs: kill xfs_bawrite
  2010-02-02 23:25 ` [PATCH 10/10] xfs: kill xfs_bawrite Dave Chinner
@ 2010-02-03 11:19   ` Christoph Hellwig
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2010-02-03 11:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Feb 03, 2010 at 10:25:04AM +1100, Dave Chinner wrote:
> There are no more users of this function left in the XFS code
> now that we've switched everything to delayed write flushing.
> Remove it.

Actually we still write the superblock async using xfs_bwrite, but
getting rid of that is a separate project, so


Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 09/10] xfs: xfs_fs_write_inode() can fail to write inodes synchronously V2
  2010-02-02 23:25 ` [PATCH 09/10] xfs: xfs_fs_write_inode() can fail to write inodes synchronously V2 Dave Chinner
@ 2010-02-03 11:27   ` Christoph Hellwig
  2010-02-03 18:07     ` bpm
  2010-02-03 20:56     ` Christoph Hellwig
  0 siblings, 2 replies; 29+ messages in thread
From: Christoph Hellwig @ 2010-02-03 11:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: bpm, xfs

Still not entirely happy with this one.  The first one is that I think
the barriers in fsync are still too heavy for the normal sync use
case.  I'd be more happy with exporting the body of xfs_fsync without
the cache flushes (and a ebtter name than xfs_fsync) and use that
for write_inode.

That leaves open the NFSD case thought.  I'd prefer to have that fixed
if possibly.  Ben, any chance you could send your patch to use fsync
to the nfs list ASAP?  I think we'd be even better off to just force
-o wsync and disable ->write_inode entirely for NFS, any chance you
could test such a patch on your setup?


Besides that the patch is missing the comment from the previous
iteration why we're still doing the delwri iflush for the sync == 1
case.  I think keeping that one is important to explain the really
weird reason for it.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 08/10] xfs: move the inode locking outside xfs_fsync()
  2010-02-02 23:25 ` [PATCH 08/10] xfs: move the inode locking outside xfs_fsync() Dave Chinner
@ 2010-02-03 11:29   ` Christoph Hellwig
  2010-02-03 23:08     ` Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2010-02-03 11:29 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Feb 03, 2010 at 10:25:02AM +1100, Dave Chinner wrote:
> We have a need for a delayed write inode flush operation
> to be made atomically with an fsync to avoid physically
> writing inodes but still keeping inode buffer information
> up to date for bulkstat.
> 
> Move the inode locking outside xfs_fsync() to allow this to
> be done.

What's the point of the lock_flags argument?  It should always
be IOLOCK_SHARED, so instead of passing it in as an argument
I'd rather add an assert to enforce it.

For more comments on if this is the right functionality to use
in write_inode see my comment to the next patch.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 09/10] xfs: xfs_fs_write_inode() can fail to write inodes synchronously V2
  2010-02-03 11:27   ` Christoph Hellwig
@ 2010-02-03 18:07     ` bpm
  2010-02-03 20:55       ` Christoph Hellwig
  2010-02-03 20:56     ` Christoph Hellwig
  1 sibling, 1 reply; 29+ messages in thread
From: bpm @ 2010-02-03 18:07 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

Hey Christoph,

On Wed, Feb 03, 2010 at 06:27:53AM -0500, Christoph Hellwig wrote:
> That leaves open the NFSD case thought.  I'd prefer to have that fixed
> if possibly.  Ben, any chance you could send your patch to use fsync
> to the nfs list ASAP?

I'll make some time to work on it.

> I think we'd be even better off to just force
> -o wsync and disable ->write_inode entirely for NFS, any chance you
> could test such a patch on your setup?

Sure.  IIRC, previous tests were w/ -o wsync and write_inode switched to
fsync passing 1 for fdatasync.  I'll try this too.

-Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 09/10] xfs: xfs_fs_write_inode() can fail to write inodes synchronously V2
  2010-02-03 18:07     ` bpm
@ 2010-02-03 20:55       ` Christoph Hellwig
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2010-02-03 20:55 UTC (permalink / raw)
  To: bpm; +Cc: Christoph Hellwig, xfs

On Wed, Feb 03, 2010 at 12:07:06PM -0600, bpm@sgi.com wrote:
> Hey Christoph,
> 
> On Wed, Feb 03, 2010 at 06:27:53AM -0500, Christoph Hellwig wrote:
> > That leaves open the NFSD case thought.  I'd prefer to have that fixed
> > if possibly.  Ben, any chance you could send your patch to use fsync
> > to the nfs list ASAP?
> 
> I'll make some time to work on it.
> 
> > I think we'd be even better off to just force
> > -o wsync and disable ->write_inode entirely for NFS, any chance you
> > could test such a patch on your setup?
> 
> Sure.  IIRC, previous tests were w/ -o wsync and write_inode switched to
> fsync passing 1 for fdatasync.  I'll try this too.

-o wsync should be very similar to using ->fsync or Dave's new
->write_inode.  What the synchronous transaction does is causing a log
force, which is exactly what fsync does in case the inode is still
pinned (except not quite as optimal, but I'll send a patch for that)

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 09/10] xfs: xfs_fs_write_inode() can fail to write inodes synchronously V2
  2010-02-03 11:27   ` Christoph Hellwig
  2010-02-03 18:07     ` bpm
@ 2010-02-03 20:56     ` Christoph Hellwig
  2010-02-03 23:02       ` Dave Chinner
  1 sibling, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2010-02-03 20:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: bpm, xfs

On Wed, Feb 03, 2010 at 06:27:53AM -0500, Christoph Hellwig wrote:
> Still not entirely happy with this one.  The first one is that I think
> the barriers in fsync are still too heavy for the normal sync use
> case.  I'd be more happy with exporting the body of xfs_fsync without
> the cache flushes (and a ebtter name than xfs_fsync) and use that
> for write_inode.
> 
> That leaves open the NFSD case thought.  I'd prefer to have that fixed
> if possibly.  Ben, any chance you could send your patch to use fsync
> to the nfs list ASAP?  I think we'd be even better off to just force
> -o wsync and disable ->write_inode entirely for NFS, any chance you
> could test such a patch on your setup?

Thinking about it, we usually do cause a log buffer write from ->fsync
which means we submit the barrier anyway.  That might be the reason
why you're not seeing the performance hit in your testing.  With that
I'm okay with the patch as-is for now, we can micro-optimize it later.

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 09/10] xfs: xfs_fs_write_inode() can fail to write inodes synchronously V2
  2010-02-03 20:56     ` Christoph Hellwig
@ 2010-02-03 23:02       ` Dave Chinner
  2010-02-04 17:36         ` Christoph Hellwig
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2010-02-03 23:02 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: bpm, xfs

On Wed, Feb 03, 2010 at 03:56:48PM -0500, Christoph Hellwig wrote:
> On Wed, Feb 03, 2010 at 06:27:53AM -0500, Christoph Hellwig wrote:
> > Still not entirely happy with this one.  The first one is that I think
> > the barriers in fsync are still too heavy for the normal sync use
> > case.  I'd be more happy with exporting the body of xfs_fsync without
> > the cache flushes (and a ebtter name than xfs_fsync) and use that
> > for write_inode.
> > 
> > That leaves open the NFSD case thought.  I'd prefer to have that fixed
> > if possibly.  Ben, any chance you could send your patch to use fsync
> > to the nfs list ASAP?  I think we'd be even better off to just force
> > -o wsync and disable ->write_inode entirely for NFS, any chance you
> > could test such a patch on your setup?
> 
> Thinking about it, we usually do cause a log buffer write from ->fsync
> which means we submit the barrier anyway.  That might be the reason
> why you're not seeing the performance hit in your testing.  With that
> I'm okay with the patch as-is for now, we can micro-optimize it later.

OK. I was thinking that a "inode cluster fsync" function might be
appropriate here. i.e. a transaction that takes all the dirty inodes
in the inode cluster and logs/forces them all in one transaction.
That would substanitally reduce the number of log writes needed,
I think. I'll look into this over the next few days.

Thanks for the reviews, Christoph.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 08/10] xfs: move the inode locking outside xfs_fsync()
  2010-02-03 11:29   ` Christoph Hellwig
@ 2010-02-03 23:08     ` Dave Chinner
  2010-02-04 16:07       ` Christoph Hellwig
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Chinner @ 2010-02-03 23:08 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Wed, Feb 03, 2010 at 06:29:17AM -0500, Christoph Hellwig wrote:
> On Wed, Feb 03, 2010 at 10:25:02AM +1100, Dave Chinner wrote:
> > We have a need for a delayed write inode flush operation
> > to be made atomically with an fsync to avoid physically
> > writing inodes but still keeping inode buffer information
> > up to date for bulkstat.
> > 
> > Move the inode locking outside xfs_fsync() to allow this to
> > be done.
> 
> What's the point of the lock_flags argument?  It should always
> be IOLOCK_SHARED, so instead of passing it in as an argument
> I'd rather add an assert to enforce it.

Fair enough. Updated patch below.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: move the inode locking outside xfs_fsync() V2

We have a need for a delayed write inode flush operation
to be made atomically with an fsync to avoid physically
writing inodes but still keeping inode buffer information
up to date for bulkstat.

Move the inode locking outside xfs_fsync() to allow this to
be done.

Version 2
- kill the lock_flags argument and simply assert XFS_IOLOCK_SHARED
  in xfs_fsync().

Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/linux-2.6/xfs_file.c |    6 +++++-
 fs/xfs/linux-2.6/xfs_lrw.c  |    3 ++-
 fs/xfs/xfs_vnodeops.c       |   29 +++++++++++++----------------
 3 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c
index e4caeb2..94d9d6d 100644
--- a/fs/xfs/linux-2.6/xfs_file.c
+++ b/fs/xfs/linux-2.6/xfs_file.c
@@ -177,9 +177,13 @@ xfs_file_fsync(
 	int			datasync)
 {
 	struct xfs_inode	*ip = XFS_I(dentry->d_inode);
+	int			error;
 
 	xfs_iflags_clear(ip, XFS_ITRUNCATED);
-	return -xfs_fsync(ip);
+	xfs_ilock(ip, XFS_ILOCK_SHARED);
+	error = -xfs_fsync(ip);
+	xfs_iunlock(ip, XFS_ILOCK_SHARED);
+	return error;
 }
 
 STATIC int
diff --git a/fs/xfs/linux-2.6/xfs_lrw.c b/fs/xfs/linux-2.6/xfs_lrw.c
index c80fa00..d7f1a71 100644
--- a/fs/xfs/linux-2.6/xfs_lrw.c
+++ b/fs/xfs/linux-2.6/xfs_lrw.c
@@ -754,11 +754,12 @@ write_retry:
 			error = error2;
 		if (need_i_mutex)
 			mutex_lock(&inode->i_mutex);
-		xfs_ilock(xip, iolock);
+		xfs_ilock(xip, iolock | XFS_ILOCK_SHARED);
 
 		error2 = xfs_fsync(xip);
 		if (!error)
 			error = error2;
+		xfs_iunlock(xip, XFS_ILOCK_SHARED);
 	}
 
  out_unlock_internal:
diff --git a/fs/xfs/xfs_vnodeops.c b/fs/xfs/xfs_vnodeops.c
index 43241e2..b5689a4 100644
--- a/fs/xfs/xfs_vnodeops.c
+++ b/fs/xfs/xfs_vnodeops.c
@@ -590,6 +590,17 @@ xfs_readlink(
  * the I/O lock while flushing the data, and the inode lock while flushing the
  * inode.  The inode lock CANNOT be held while flushing the data, so acquire
  * after we're done with that.
+ *
+ * We always need to make sure that the required inode state is safe on disk.
+ * The inode might be clean but we still might need to force the log because of
+ * committed transactions that haven't hit the disk yet.  Likewise, there could
+ * be unflushed non-transactional changes to the inode core that have to go to
+ * disk and this requires us to issue a synchronous transaction to capture
+ * these changes correctly.
+ *
+ * This code relies on the assumption that if the update_* fields of the inode
+ * are clear and the inode is unpinned then it is clean and no action is
+ * required.
  */
 int
 xfs_fsync(
@@ -600,24 +611,11 @@ xfs_fsync(
 	int		log_flushed = 0;
 
 	xfs_itrace_entry(ip);
+	ASSERT(xfs_isilocked(ip, XFS_ILOCK_SHARED));
 
 	if (XFS_FORCED_SHUTDOWN(ip->i_mount))
 		return XFS_ERROR(EIO);
 
-	/*
-	 * We always need to make sure that the required inode state is safe on
-	 * disk.  The inode might be clean but we still might need to force the
-	 * log because of committed transactions that haven't hit the disk yet.
-	 * Likewise, there could be unflushed non-transactional changes to the
-	 * inode core that have to go to disk and this requires us to issue
-	 * a synchronous transaction to capture these changes correctly.
-	 *
-	 * This code relies on the assumption that if the update_* fields
-	 * of the inode are clear and the inode is unpinned then it is clean
-	 * and no action is required.
-	 */
-	xfs_ilock(ip, XFS_ILOCK_SHARED);
-
 	if (!ip->i_update_core) {
 		/*
 		 * Timestamps/size haven't changed since last inode flush or
@@ -627,7 +625,6 @@ xfs_fsync(
 		 * disk yet, the inode will be still be pinned.  If it is,
 		 * force the log.
 		 */
-		xfs_iunlock(ip, XFS_ILOCK_SHARED);
 		if (xfs_ipincount(ip)) {
 			error = _xfs_log_force(ip->i_mount, XFS_LOG_SYNC,
 					       &log_flushed);
@@ -662,7 +659,7 @@ xfs_fsync(
 		xfs_trans_set_sync(tp);
 		error = _xfs_trans_commit(tp, 0, &log_flushed);
 
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
+		xfs_ilock_demote(ip, XFS_ILOCK_EXCL);
 	}
 
 	if (ip->i_mount->m_flags & XFS_MOUNT_BARRIER) {

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 08/10] xfs: move the inode locking outside xfs_fsync()
  2010-02-03 23:08     ` Dave Chinner
@ 2010-02-04 16:07       ` Christoph Hellwig
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2010-02-04 16:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

On Thu, Feb 04, 2010 at 10:08:33AM +1100, Dave Chinner wrote:
> On Wed, Feb 03, 2010 at 06:29:17AM -0500, Christoph Hellwig wrote:
> > On Wed, Feb 03, 2010 at 10:25:02AM +1100, Dave Chinner wrote:
> > > We have a need for a delayed write inode flush operation
> > > to be made atomically with an fsync to avoid physically
> > > writing inodes but still keeping inode buffer information
> > > up to date for bulkstat.
> > > 
> > > Move the inode locking outside xfs_fsync() to allow this to
> > > be done.
> > 
> > What's the point of the lock_flags argument?  It should always
> > be IOLOCK_SHARED, so instead of passing it in as an argument
> > I'd rather add an assert to enforce it.
> 
> Fair enough. Updated patch below.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> xfs: move the inode locking outside xfs_fsync() V2
> 
> We have a need for a delayed write inode flush operation
> to be made atomically with an fsync to avoid physically
> writing inodes but still keeping inode buffer information
> up to date for bulkstat.
> 
> Move the inode locking outside xfs_fsync() to allow this to
> be done.
> 
> Version 2
> - kill the lock_flags argument and simply assert XFS_IOLOCK_SHARED
>   in xfs_fsync().

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 09/10] xfs: xfs_fs_write_inode() can fail to write inodes synchronously V2
  2010-02-03 23:02       ` Dave Chinner
@ 2010-02-04 17:36         ` Christoph Hellwig
  0 siblings, 0 replies; 29+ messages in thread
From: Christoph Hellwig @ 2010-02-04 17:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, bpm, xfs

FYI I did some benchmarking on this, and the syncmodes 2 and 5 of
fs_mark, which use sys_sync regress almost 10% on my test setup
with this patch.  The barriers are only a small part of it,
from instrumentation it seems like the constant log forces don't
really help.  Now given that we only get data integrity writes
from sync_filesystem do we really need to bother with catching
all that pending I/O here?  It would be much easier to rely
on ->sync_fs to do that for us once, which is does anyway.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 01/10] xfs: Make inode reclaim states explicit
  2010-02-02 23:24 ` [PATCH 01/10] xfs: Make inode reclaim states explicit Dave Chinner
@ 2010-02-05 19:06   ` Alex Elder
  2010-02-06  0:07     ` Dave Chinner
  0 siblings, 1 reply; 29+ messages in thread
From: Alex Elder @ 2010-02-05 19:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, 2010-02-03 at 10:24 +1100, Dave Chinner wrote:
> A.K.A.: don't rely on xfs_iflush() return value in reclaim
> 
> We have gradually been moving checks out of the reclaim code because
> they are duplicated in xfs_iflush(). We've had a history of problems
> in this area, and many of them stem from the overloading of the
> return values from xfs_iflush() and interaction with inode flush
> locking to determine if the inode is safe to reclaim.
> 
> With the desire to move to delayed write flushing of inodes and
> non-blocking inode tree reclaim walks, the overloading of the
> return value of xfs_iflush makes it very difficult to determine
> the correct thing to do next.
> 
> This patch explicitly re-adds the checks to the inode reclaim code,
> removing the reliance on the return value of xfs_iflush() to
> determine what to do next. It also means that we can clearly
> document all the inode states that reclaim must handle and hence
> we can easily see that we handled all the necessary cases.
> 
> This also removes the need for the xfs_inode_clean() check in
> xfs_iflush() as all callers now check this first (safely).

I have a few comments, below.  One is a real bug.  At this
point I'm trusting that your enumeration of all the possible
inode states is correct.  Other than what I indicate below,
this change looks good.

> Signed-off-by: Dave Chinner <david@fromorbit.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/linux-2.6/xfs_sync.c |   80 ++++++++++++++++++++++++++++++++----------
>  fs/xfs/xfs_inode.c          |   11 +-----
>  fs/xfs/xfs_inode.h          |    1 +
>  3 files changed, 63 insertions(+), 29 deletions(-)
> 
> diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
> index c9b863e..98b8937 100644
> --- a/fs/xfs/linux-2.6/xfs_sync.c
> +++ b/fs/xfs/linux-2.6/xfs_sync.c
> @@ -706,12 +706,43 @@ __xfs_inode_clear_reclaim_tag(
>  			XFS_INO_TO_AGINO(mp, ip->i_ino), XFS_ICI_RECLAIM_TAG);
>  }
>  
> +/*
> + * Inodes in different states need to be treated differently, and the return
> + * value of xfs_iflush is not sufficient to get this right. The following table
> + * lists the inode states and the reclaim actions necessary for non-blocking
> + * reclaim:
> + *
> + *
> + *	inode state	     iflush ret		required action
> + *      ---------------      ----------         ---------------
> + *	bad			-		reclaim
> + *	shutdown		EIO		unpin and reclaim
> + *	clean, unpinned		0		reclaim
> + *	stale, unpinned		0		reclaim
> + *	clean, pinned(*)	0		unpin and reclaim
> + *	stale, pinned		0		unpin and reclaim
> + *	dirty, async		0		block on flush lock, reclaim
> + *	dirty, sync flush	0		block on flush lock, reclaim
> + *
> + * (*) dgc: I don't think the clean, pinned state is possible but it gets
> + * handled anyway given the order of checks implemented.
> + *
> + * Hence the order of actions after gaining the locks should be:
> + *	bad		=> reclaim
> + *	shutdown	=> unpin and reclaim
> + *	pinned		=> unpin
> + *	stale		=> reclaim
> + *	clean		=> reclaim
> + *	dirty		=> flush, wait and reclaim
> + */
>  STATIC int
>  xfs_reclaim_inode(
>  	struct xfs_inode	*ip,
>  	struct xfs_perag	*pag,
>  	int			sync_mode)
>  {
> +	int	error;
> +
>  	/*
>  	 * The radix tree lock here protects a thread in xfs_iget from racing
>  	 * with us starting reclaim on the inode.  Once we have the
> @@ -729,30 +760,41 @@ xfs_reclaim_inode(
>  	spin_unlock(&ip->i_flags_lock);
>  	write_unlock(&pag->pag_ici_lock);
>  
> -	/*
> -	 * If the inode is still dirty, then flush it out.  If the inode
> -	 * is not in the AIL, then it will be OK to flush it delwri as
> -	 * long as xfs_iflush() does not keep any references to the inode.
> -	 * We leave that decision up to xfs_iflush() since it has the
> -	 * knowledge of whether it's OK to simply do a delwri flush of
> -	 * the inode or whether we need to wait until the inode is
> -	 * pulled from the AIL.
> -	 * We get the flush lock regardless, though, just to make sure
> -	 * we don't free it while it is being flushed.
> -	 */
>  	xfs_ilock(ip, XFS_ILOCK_EXCL);
>  	xfs_iflock(ip);
>  
> -	/*
> -	 * In the case of a forced shutdown we rely on xfs_iflush() to
> -	 * wait for the inode to be unpinned before returning an error.
> -	 */
> -	if (!is_bad_inode(VFS_I(ip)) && xfs_iflush(ip, sync_mode) == 0) {
> -		/* synchronize with xfs_iflush_done */
> -		xfs_iflock(ip);
> -		xfs_ifunlock(ip);
> +	if (is_bad_inode(VFS_I(ip)))
> +		goto reclaim;

It looks to me like your code adds a call to xfs_ifunlock(ip)
in the bad inode case that was not there before.  My guess is
that your way is correct, but I'd like to know whether you
agree this differs.  Is this intentional?  Regardless, is
the new xfs_ifunlock() call correct?


> +	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
> +		xfs_iunpin_wait(ip);
> +		goto reclaim;
> +	}
> +	if (xfs_ipincount(ip))
> +		xfs_iunpin_wait(ip);

I notice you avoid an unneeded xfs_iunpin_wait() call
in the case where it's not pinned already.  (Still
there in xfs_iflush() when *not* non-blocking.)
Effect is the same though, but a minor improvement.

> +	if (xfs_iflags_test(ip, XFS_ISTALE))
> +		goto reclaim;
> +	if (xfs_inode_clean(ip))
> +		goto reclaim;
> +
> +	/* Now we have an inode that needs flushing */
> +	error = xfs_iflush(ip, sync_mode);
> +	if (!error) {
> +		switch(sync_mode) {
> +		case XFS_IFLUSH_DELWRI_ELSE_ASYNC:
> +		case XFS_IFLUSH_DELWRI:
> +		case XFS_IFLUSH_ASYNC:
> +		case XFS_IFLUSH_DELWRI_ELSE_SYNC:
> +		case XFS_IFLUSH_SYNC:
> +			/* IO issued, synchronise with IO completion */
> +			xfs_iflock(ip);

You must not have run this with DEBUG enabled.
You need a "break;" here.

> +		default:
> +			ASSERT(0);
> +			break;
> +		}
>  	}
>  
> +reclaim:
> +	xfs_ifunlock(ip);
>  	xfs_iunlock(ip, XFS_ILOCK_EXCL);
>  	xfs_ireclaim(ip);
>  	return 0;
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index d0d1b5a..8d0666d 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2493,7 +2493,7 @@ __xfs_iunpin_wait(
>  		wait_event(ip->i_ipin_wait, (atomic_read(&ip->i_pincount) == 0));
>  }
>  
> -static inline void
> +void
>  xfs_iunpin_wait(
>  	xfs_inode_t	*ip)
>  {
> @@ -2849,15 +2849,6 @@ xfs_iflush(
>  	mp = ip->i_mount;
>  
>  	/*
> -	 * If the inode isn't dirty, then just release the inode flush lock and
> -	 * do nothing.
> -	 */
> -	if (xfs_inode_clean(ip)) {
> -		xfs_ifunlock(ip);
> -		return 0;
> -	}
> -
> -	/*
>  	 * We can't flush the inode until it is unpinned, so wait for it if we
>  	 * are allowed to block.  We know noone new can pin it, because we are
>  	 * holding the inode lock shared and you need to hold it exclusively to
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index ec1f28c..8b618ea 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -483,6 +483,7 @@ int		xfs_iunlink(struct xfs_trans *, xfs_inode_t *);
>  void		xfs_iext_realloc(xfs_inode_t *, int, int);
>  void		xfs_ipin(xfs_inode_t *);
>  void		xfs_iunpin(xfs_inode_t *);
> +void		xfs_iunpin_wait(xfs_inode_t *);
>  int		xfs_iflush(xfs_inode_t *, uint);
>  void		xfs_ichgtime(xfs_inode_t *, int);
>  void		xfs_lock_inodes(xfs_inode_t **, int, uint);



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 02/10] xfs: Use delayed write for inodes rather than async V2
  2010-02-02 23:24 ` [PATCH 02/10] xfs: Use delayed write for inodes rather than async V2 Dave Chinner
  2010-02-03 11:17   ` Christoph Hellwig
@ 2010-02-05 21:38   ` Alex Elder
  1 sibling, 0 replies; 29+ messages in thread
From: Alex Elder @ 2010-02-05 21:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, 2010-02-03 at 10:24 +1100, Dave Chinner wrote:
> We currently do background inode flush asynchronously, resulting in
> inodes being written in whatever order the background writeback
> issues them. Not only that, there are also blocking and non-blocking
> asynchronous inode flushes, depending on where the flush comes from.
> 
> This patch completely removes asynchronous inode writeback. It
> removes all the strange writeback modes and replaces them with
> either a synchronous flush or a non-blocking delayed write flush.
> That is, inode flushes will only issue IO directly if they are
> synchronous, and background flushing may do nothing if the operation
> would block (e.g. on a pinned inode or buffer lock).
> 
> Delayed write flushes will now result in the inode buffer sitting in
> the delwri queue of the buffer cache to be flushed by either an AIL
> push or by the xfsbufd timing out the buffer. This will allow
> accumulation of dirty inode buffers in memory and allow optimisation
> of inode cluster writeback at the xfsbufd level where we have much
> greater queue depths than the block layer elevators. We will also
> get adjacent inode cluster buffer IO merging for free when a later
> patch in the series allows sorting of the delayed write buffers
> before dispatch.
> 
> This effectively means that any inode that is written back by
> background writeback will be seen as flush locked during AIL
> pushing, and will result in the buffers being pushed from there.
> This writeback path is currently non-optimal, but the next patch
> in the series will fix that problem.
> 
> A side effect of this delayed write mechanism is that background
> inode reclaim will no longer directly flush inodes, nor can it wait
> on the flush lock. The result is that inode reclaim must leave the
> inode in the reclaimable state until it is clean. Hence attempts to
> reclaim a dirty inode in the background will simply skip the inode
> until it is clean and this allows other mechanisms (i.e. xfsbufd) to
> do more optimal writeback of the dirty buffers. As a result, the
> inode reclaim code has been rewritten so that it no longer relies on
> the ambiguous return values of xfs_iflush() to determine whether it
> is safe to reclaim an inode.
> 
> Portions of this patch are derived from patches by Christoph
> Hellwig.
> 
> Version 2:
> - cleanup reclaim code as suggested by Christoph
> - log background reclaim inode flush errors
> - just pass sync flags to xfs_iflush

I now see that the missing "break;" in the previous patch
doesn't matter as long as this patch is also applied...

This change looks right to me, and it's pretty cool
to see where you're headed with it.  I wasn't quite
following a few weeks back when you and Christoph
were discussing this.

I have a few comments on the comments, below.  Maybe
you could either confirm or correct what I say below.

> Signed-off-by: Dave Chinner <david@fromorbit.com>

Reviewed-by: Alex Elder <aelder@sgi.com>

> ---
>  fs/xfs/linux-2.6/xfs_super.c |    4 +-
>  fs/xfs/linux-2.6/xfs_sync.c  |  104 ++++++++++++++++++++++++++++++-----------
>  fs/xfs/xfs_inode.c           |   75 ++----------------------------
>  fs/xfs/xfs_inode.h           |   10 ----
>  fs/xfs/xfs_inode_item.c      |   10 +++-
>  fs/xfs/xfs_mount.c           |   13 +++++-
>  6 files changed, 102 insertions(+), 114 deletions(-)
> 

. . .

> diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
> index 98b8937..a9f6d20 100644
> --- a/fs/xfs/linux-2.6/xfs_sync.c
> +++ b/fs/xfs/linux-2.6/xfs_sync.c

. . .

> @@ -719,21 +720,42 @@ __xfs_inode_clear_reclaim_tag(

Regardless of the state, if the inode has a flush underway
we requeue.

>   *	shutdown		EIO		unpin and reclaim
>   *	clean, unpinned		0		reclaim
>   *	stale, unpinned		0		reclaim
> - *	clean, pinned(*)	0		unpin and reclaim
> - *	stale, pinned		0		unpin and reclaim
> - *	dirty, async		0		block on flush lock, reclaim
> - *	dirty, sync flush	0		block on flush lock, reclaim
> + *	clean, pinned(*)	0		requeue
> + *	stale, pinned		EAGAIN		requeue

Actually, if it's pinned (clean or stale) then we either
unpin and reclaim (if synchronous) or requeue (otherwise),
i.e.:

*    clean, pinned, async(*)    N/A       requeue
*    stale, pinned, async(*)    N/A       requeue
*    clean, pinned, sync flush  0         unpin and reclaim
*    stale, pinned, sync flush  EAGAIN    unpin and reclaim 

> + *	dirty, delwri ok	0		requeue
> + *	dirty, delwri blocked	EAGAIN		requeue
> + *	dirty, sync flush	0		reclaim
>   *
>   * (*) dgc: I don't think the clean, pinned state is possible but it gets
>   * handled anyway given the order of checks implemented.
>   *
> + * As can be seen from the table, the return value of xfs_iflush() is not
> + * sufficient to correctly decide the reclaim action here. The checks in
> + * xfs_iflush() might look like duplicates, but they are not.
> + *
> + * Also, because we get the flush lock first, we know that any inode that has
> + * been flushed delwri has had the flush completed by the time we check that
> + * the inode is clean. The clean inode check needs to be done before flushing
> + * the inode delwri otherwise we would loop forever requeuing clean inodes as
> + * we cannot tell apart a successful delwri flush and a clean inode from the
> + * return value of xfs_iflush().
> + *
> + * Note that because the inode is flushed delayed write by background
> + * writeback, the flush lock may already be held here and waiting on it can
> + * result in very long latencies. Hence for sync reclaims, where we wait on the
> + * flush lock, the caller should push out delayed write inodes first before
> + * trying to reclaim them to minimise the amount of time spent waiting. For
> + * background relaim, we just requeue the inode for the next pass.
> + *
>   * Hence the order of actions after gaining the locks should be:
>   *	bad		=> reclaim
>   *	shutdown	=> unpin and reclaim
> - *	pinned		=> unpin
> + *	pinned, delwri	=> requeue
> + *	pinned, sync	=> unpin
>   *	stale		=> reclaim
>   *	clean		=> reclaim
> - *	dirty		=> flush, wait and reclaim
> + *	dirty, delwri	=> flush and requeue
> + *	dirty, sync	=> flush, wait and reclaim
>   */
>  STATIC int
>  xfs_reclaim_inode(

. . .

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 03/10] xfs: Don't issue buffer IO direct from AIL push V2
  2010-02-02 23:24 ` [PATCH 03/10] xfs: Don't issue buffer IO direct from AIL push V2 Dave Chinner
@ 2010-02-05 22:51   ` Alex Elder
  0 siblings, 0 replies; 29+ messages in thread
From: Alex Elder @ 2010-02-05 22:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, 2010-02-03 at 10:24 +1100, Dave Chinner wrote:
> All buffers logged into the AIL are marked as delayed write.
> When the AIL needs to push the buffer out, it issues an async write of the
> buffer. This means that IO patterns are dependent on the order of
> buffers in the AIL.
> 
> Instead of flushing the buffer, promote the buffer in the delayed
> write list so that the next time the xfsbufd is run the buffer will
> be flushed by the xfsbufd. Return the state to the xfsaild that the
> buffer was promoted so that the xfsaild knows that it needs to cause
> the xfsbufd to run to flush the buffers that were promoted.
> 
> Using the xfsbufd for issuing the IO allows us to dispatch all
> buffer IO from the one queue. This means that we can make much more
> enlightened decisions on what order to flush buffers to disk as
> we don't have multiple places issuing IO. Optimisations to xfsbufd
> will be in a future patch.
> 
> Version 2
> - kill XFS_ITEM_FLUSHING as it is now unused.

Looks good.

> Signed-off-by: Dave Chinner <david@fromorbit.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Alex Elder <aelder@sgi.com>

> ---
>  fs/xfs/linux-2.6/xfs_buf.c    |   29 ++++++++++++
>  fs/xfs/linux-2.6/xfs_buf.h    |    2 +
>  fs/xfs/linux-2.6/xfs_trace.h  |    1 +
>  fs/xfs/quota/xfs_dquot_item.c |   85 +++++------------------------------
>  fs/xfs/quota/xfs_dquot_item.h |    4 --
>  fs/xfs/xfs_buf_item.c         |   64 +++++++++++++++------------
>  fs/xfs/xfs_inode_item.c       |   98 ++++++----------------------------------
>  fs/xfs/xfs_inode_item.h       |    6 ---
>  fs/xfs/xfs_trans.h            |    3 +-
>  fs/xfs/xfs_trans_ail.c        |   13 +++---
>  10 files changed, 102 insertions(+), 203 deletions(-)
> 
> diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
> index 44e20e5..b306265 100644
> --- a/fs/xfs/linux-2.6/xfs_buf.c
> +++ b/fs/xfs/linux-2.6/xfs_buf.c
> @@ -1778,6 +1778,35 @@ xfs_buf_delwri_dequeue(
>  	trace_xfs_buf_delwri_dequeue(bp, _RET_IP_);
>  }
>  
> +/*
> + * If a delwri buffer needs to be pushed before it has aged out, then promote
> + * it to the head of the delwri queue so that it will be flushed on the next
> + * xfsbufd run. We do this by resetting the queuetime of the buffer to be older
> + * than the age currently needed to flush the buffer. Hence the next time the
> + * xfsbufd sees it is guaranteed to be considered old enough to flush.
> + */
> +void
> +xfs_buf_delwri_promote(
> +	struct xfs_buf	*bp)
> +{
> +	struct xfs_buftarg *btp = bp->b_target;
> +	long		age = xfs_buf_age_centisecs * msecs_to_jiffies(10) + 1;
> +
> +	ASSERT(bp->b_flags & XBF_DELWRI);
> +	ASSERT(bp->b_flags & _XBF_DELWRI_Q);
> +
> +	/*
> +	 * Check the buffer age before locking the delayed write queue as we
> +	 * don't need to promote buffers that are already past the flush age.
> +	 */
> +	if (bp->b_queuetime < jiffies - age)
> +		return;
> +	bp->b_queuetime = jiffies - age;
> +	spin_lock(&btp->bt_delwrite_lock);
> +	list_move(&bp->b_list, &btp->bt_delwrite_queue);
> +	spin_unlock(&btp->bt_delwrite_lock);
> +}
> +
>  STATIC void
>  xfs_buf_runall_queues(
>  	struct workqueue_struct	*queue)
> diff --git a/fs/xfs/linux-2.6/xfs_buf.h b/fs/xfs/linux-2.6/xfs_buf.h
> index ea8c198..be45e8c 100644
> --- a/fs/xfs/linux-2.6/xfs_buf.h
> +++ b/fs/xfs/linux-2.6/xfs_buf.h
> @@ -266,6 +266,7 @@ extern int xfs_buf_ispin(xfs_buf_t *);
>  
>  /* Delayed Write Buffer Routines */
>  extern void xfs_buf_delwri_dequeue(xfs_buf_t *);
> +extern void xfs_buf_delwri_promote(xfs_buf_t *);
>  
>  /* Buffer Daemon Setup Routines */
>  extern int xfs_buf_init(void);
> @@ -395,6 +396,7 @@ extern void xfs_free_buftarg(struct xfs_mount *, struct xfs_buftarg *);
>  extern void xfs_wait_buftarg(xfs_buftarg_t *);
>  extern int xfs_setsize_buftarg(xfs_buftarg_t *, unsigned int, unsigned int);
>  extern int xfs_flush_buftarg(xfs_buftarg_t *, int);
> +
>  #ifdef CONFIG_KDB_MODULES
>  extern struct list_head *xfs_get_buftarg_list(void);
>  #endif
> diff --git a/fs/xfs/linux-2.6/xfs_trace.h b/fs/xfs/linux-2.6/xfs_trace.h
> index 1bb09e7..a4574dc 100644
> --- a/fs/xfs/linux-2.6/xfs_trace.h
> +++ b/fs/xfs/linux-2.6/xfs_trace.h
> @@ -483,6 +483,7 @@ DEFINE_BUF_ITEM_EVENT(xfs_buf_item_unlock);
>  DEFINE_BUF_ITEM_EVENT(xfs_buf_item_unlock_stale);
>  DEFINE_BUF_ITEM_EVENT(xfs_buf_item_committed);
>  DEFINE_BUF_ITEM_EVENT(xfs_buf_item_push);
> +DEFINE_BUF_ITEM_EVENT(xfs_buf_item_pushbuf);
>  DEFINE_BUF_ITEM_EVENT(xfs_trans_get_buf);
>  DEFINE_BUF_ITEM_EVENT(xfs_trans_get_buf_recur);
>  DEFINE_BUF_ITEM_EVENT(xfs_trans_getsb);
> diff --git a/fs/xfs/quota/xfs_dquot_item.c b/fs/xfs/quota/xfs_dquot_item.c
> index 1b56437..dda0fb0 100644
> --- a/fs/xfs/quota/xfs_dquot_item.c
> +++ b/fs/xfs/quota/xfs_dquot_item.c
> @@ -212,66 +212,31 @@ xfs_qm_dquot_logitem_pushbuf(
>  	xfs_dquot_t	*dqp;
>  	xfs_mount_t	*mp;
>  	xfs_buf_t	*bp;
> -	uint		dopush;
>  
>  	dqp = qip->qli_dquot;
>  	ASSERT(XFS_DQ_IS_LOCKED(dqp));
>  
>  	/*
> -	 * The qli_pushbuf_flag keeps others from
> -	 * trying to duplicate our effort.
> -	 */
> -	ASSERT(qip->qli_pushbuf_flag != 0);
> -	ASSERT(qip->qli_push_owner == current_pid());
> -
> -	/*
>  	 * If flushlock isn't locked anymore, chances are that the
>  	 * inode flush completed and the inode was taken off the AIL.
>  	 * So, just get out.
>  	 */
>  	if (completion_done(&dqp->q_flush)  ||
>  	    ((qip->qli_item.li_flags & XFS_LI_IN_AIL) == 0)) {
> -		qip->qli_pushbuf_flag = 0;
>  		xfs_dqunlock(dqp);
>  		return;
>  	}
>  	mp = dqp->q_mount;
>  	bp = xfs_incore(mp->m_ddev_targp, qip->qli_format.qlf_blkno,
>  		    XFS_QI_DQCHUNKLEN(mp), XBF_TRYLOCK);
> -	if (bp != NULL) {
> -		if (XFS_BUF_ISDELAYWRITE(bp)) {
> -			dopush = ((qip->qli_item.li_flags & XFS_LI_IN_AIL) &&
> -				  !completion_done(&dqp->q_flush));
> -			qip->qli_pushbuf_flag = 0;
> -			xfs_dqunlock(dqp);
> -
> -			if (XFS_BUF_ISPINNED(bp))
> -				xfs_log_force(mp, 0);
> -
> -			if (dopush) {
> -				int	error;
> -#ifdef XFSRACEDEBUG
> -				delay_for_intr();
> -				delay(300);
> -#endif
> -				error = xfs_bawrite(mp, bp);
> -				if (error)
> -					xfs_fs_cmn_err(CE_WARN, mp,
> -	"xfs_qm_dquot_logitem_pushbuf: pushbuf error %d on qip %p, bp %p",
> -							error, qip, bp);
> -			} else {
> -				xfs_buf_relse(bp);
> -			}
> -		} else {
> -			qip->qli_pushbuf_flag = 0;
> -			xfs_dqunlock(dqp);
> -			xfs_buf_relse(bp);
> -		}
> +	xfs_dqunlock(dqp);
> +	if (!bp)
>  		return;
> -	}
> +	if (XFS_BUF_ISDELAYWRITE(bp))
> +		xfs_buf_delwri_promote(bp);
> +	xfs_buf_relse(bp);
> +	return;
>  
> -	qip->qli_pushbuf_flag = 0;
> -	xfs_dqunlock(dqp);
>  }
>  
>  /*
> @@ -289,50 +254,24 @@ xfs_qm_dquot_logitem_trylock(
>  	xfs_dq_logitem_t	*qip)
>  {
>  	xfs_dquot_t		*dqp;
> -	uint			retval;
>  
>  	dqp = qip->qli_dquot;
>  	if (atomic_read(&dqp->q_pincount) > 0)
> -		return (XFS_ITEM_PINNED);
> +		return XFS_ITEM_PINNED;
>  
>  	if (! xfs_qm_dqlock_nowait(dqp))
> -		return (XFS_ITEM_LOCKED);
> +		return XFS_ITEM_LOCKED;
>  
> -	retval = XFS_ITEM_SUCCESS;
>  	if (!xfs_dqflock_nowait(dqp)) {
>  		/*
> -		 * The dquot is already being flushed.	It may have been
> -		 * flushed delayed write, however, and we don't want to
> -		 * get stuck waiting for that to complete.  So, we want to check
> -		 * to see if we can lock the dquot's buffer without sleeping.
> -		 * If we can and it is marked for delayed write, then we
> -		 * hold it and send it out from the push routine.  We don't
> -		 * want to do that now since we might sleep in the device
> -		 * strategy routine.  We also don't want to grab the buffer lock
> -		 * here because we'd like not to call into the buffer cache
> -		 * while holding the AIL lock.
> -		 * Make sure to only return PUSHBUF if we set pushbuf_flag
> -		 * ourselves.  If someone else is doing it then we don't
> -		 * want to go to the push routine and duplicate their efforts.
> +		 * dquot has already been flushed to the backing buffer,
> +		 * leave it locked, pushbuf routine will unlock it.
>  		 */
> -		if (qip->qli_pushbuf_flag == 0) {
> -			qip->qli_pushbuf_flag = 1;
> -			ASSERT(qip->qli_format.qlf_blkno == dqp->q_blkno);
> -#ifdef DEBUG
> -			qip->qli_push_owner = current_pid();
> -#endif
> -			/*
> -			 * The dquot is left locked.
> -			 */
> -			retval = XFS_ITEM_PUSHBUF;
> -		} else {
> -			retval = XFS_ITEM_FLUSHING;
> -			xfs_dqunlock_nonotify(dqp);
> -		}
> +		return XFS_ITEM_PUSHBUF;
>  	}
>  
>  	ASSERT(qip->qli_item.li_flags & XFS_LI_IN_AIL);
> -	return (retval);
> +	return XFS_ITEM_SUCCESS;
>  }
>  
> 
> diff --git a/fs/xfs/quota/xfs_dquot_item.h b/fs/xfs/quota/xfs_dquot_item.h
> index 5a63253..5acae2a 100644
> --- a/fs/xfs/quota/xfs_dquot_item.h
> +++ b/fs/xfs/quota/xfs_dquot_item.h
> @@ -27,10 +27,6 @@ typedef struct xfs_dq_logitem {
>  	xfs_log_item_t		 qli_item;	   /* common portion */
>  	struct xfs_dquot	*qli_dquot;	   /* dquot ptr */
>  	xfs_lsn_t		 qli_flush_lsn;	   /* lsn at last flush */
> -	unsigned short		 qli_pushbuf_flag; /* 1 bit used in push_ail */
> -#ifdef DEBUG
> -	uint64_t		 qli_push_owner;
> -#endif
>  	xfs_dq_logformat_t	 qli_format;	   /* logged structure */
>  } xfs_dq_logitem_t;
>  
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index e0a1158..f3c49e6 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -467,8 +467,10 @@ xfs_buf_item_unpin_remove(
>  /*
>   * This is called to attempt to lock the buffer associated with this
>   * buf log item.  Don't sleep on the buffer lock.  If we can't get
> - * the lock right away, return 0.  If we can get the lock, pull the
> - * buffer from the free list, mark it busy, and return 1.
> + * the lock right away, return 0.  If we can get the lock, take a
> + * reference to the buffer. If this is a delayed write buffer that
> + * needs AIL help to be written back, invoke the pushbuf routine
> + * rather than the normal success path.
>   */
>  STATIC uint
>  xfs_buf_item_trylock(
> @@ -477,24 +479,18 @@ xfs_buf_item_trylock(
>  	xfs_buf_t	*bp;
>  
>  	bp = bip->bli_buf;
> -
> -	if (XFS_BUF_ISPINNED(bp)) {
> +	if (XFS_BUF_ISPINNED(bp))
>  		return XFS_ITEM_PINNED;
> -	}
> -
> -	if (!XFS_BUF_CPSEMA(bp)) {
> +	if (!XFS_BUF_CPSEMA(bp))
>  		return XFS_ITEM_LOCKED;
> -	}
>  
> -	/*
> -	 * Remove the buffer from the free list.  Only do this
> -	 * if it's on the free list.  Private buffers like the
> -	 * superblock buffer are not.
> -	 */
> +	/* take a reference to the buffer.  */
>  	XFS_BUF_HOLD(bp);
>  
>  	ASSERT(!(bip->bli_flags & XFS_BLI_STALE));
>  	trace_xfs_buf_item_trylock(bip);
> +	if (XFS_BUF_ISDELAYWRITE(bp))
> +		return XFS_ITEM_PUSHBUF;
>  	return XFS_ITEM_SUCCESS;
>  }
>  
> @@ -626,11 +622,9 @@ xfs_buf_item_committed(
>  }
>  
>  /*
> - * This is called to asynchronously write the buffer associated with this
> - * buf log item out to disk. The buffer will already have been locked by
> - * a successful call to xfs_buf_item_trylock().  If the buffer still has
> - * B_DELWRI set, then get it going out to disk with a call to bawrite().
> - * If not, then just release the buffer.
> + * The buffer is locked, but is not a delayed write buffer. This happens
> + * if we race with IO completion and hence we don't want to try to write it
> + * again. Just release the buffer.
>   */
>  STATIC void
>  xfs_buf_item_push(
> @@ -642,17 +636,29 @@ xfs_buf_item_push(
>  	trace_xfs_buf_item_push(bip);
>  
>  	bp = bip->bli_buf;
> +	ASSERT(!XFS_BUF_ISDELAYWRITE(bp));
> +	xfs_buf_relse(bp);
> +}
>  
> -	if (XFS_BUF_ISDELAYWRITE(bp)) {
> -		int	error;
> -		error = xfs_bawrite(bip->bli_item.li_mountp, bp);
> -		if (error)
> -			xfs_fs_cmn_err(CE_WARN, bip->bli_item.li_mountp,
> -			"xfs_buf_item_push: pushbuf error %d on bip %p, bp %p",
> -					error, bip, bp);
> -	} else {
> -		xfs_buf_relse(bp);
> -	}
> +/*
> + * The buffer is locked and is a delayed write buffer. Promote the buffer
> + * in the delayed write queue as the caller knows that they must invoke
> + * the xfsbufd to get this buffer written. We have to unlock the buffer
> + * to allow the xfsbufd to write it, too.
> + */
> +STATIC void
> +xfs_buf_item_pushbuf(
> +	xfs_buf_log_item_t	*bip)
> +{
> +	xfs_buf_t	*bp;
> +
> +	ASSERT(!(bip->bli_flags & XFS_BLI_STALE));
> +	trace_xfs_buf_item_pushbuf(bip);
> +
> +	bp = bip->bli_buf;
> +	ASSERT(XFS_BUF_ISDELAYWRITE(bp));
> +	xfs_buf_delwri_promote(bp);
> +	xfs_buf_relse(bp);
>  }
>  
>  /* ARGSUSED */
> @@ -677,7 +683,7 @@ static struct xfs_item_ops xfs_buf_item_ops = {
>  	.iop_committed	= (xfs_lsn_t(*)(xfs_log_item_t*, xfs_lsn_t))
>  					xfs_buf_item_committed,
>  	.iop_push	= (void(*)(xfs_log_item_t*))xfs_buf_item_push,
> -	.iop_pushbuf	= NULL,
> +	.iop_pushbuf	= (void(*)(xfs_log_item_t*))xfs_buf_item_pushbuf,
>  	.iop_committing = (void(*)(xfs_log_item_t*, xfs_lsn_t))
>  					xfs_buf_item_committing
>  };
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 207553e..d4dc063 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -602,33 +602,20 @@ xfs_inode_item_trylock(
>  
>  	if (!xfs_iflock_nowait(ip)) {
>  		/*
> -		 * If someone else isn't already trying to push the inode
> -		 * buffer, we get to do it.
> +		 * inode has already been flushed to the backing buffer,
> +		 * leave it locked in shared mode, pushbuf routine will
> +		 * unlock it.
>  		 */
> -		if (iip->ili_pushbuf_flag == 0) {
> -			iip->ili_pushbuf_flag = 1;
> -#ifdef DEBUG
> -			iip->ili_push_owner = current_pid();
> -#endif
> -			/*
> -			 * Inode is left locked in shared mode.
> -			 * Pushbuf routine gets to unlock it.
> -			 */
> -			return XFS_ITEM_PUSHBUF;
> -		} else {
> -			/*
> -			 * We hold the AIL lock, so we must specify the
> -			 * NONOTIFY flag so that we won't double trip.
> -			 */
> -			xfs_iunlock(ip, XFS_ILOCK_SHARED|XFS_IUNLOCK_NONOTIFY);
> -			return XFS_ITEM_FLUSHING;
> -		}
> -		/* NOTREACHED */
> +		return XFS_ITEM_PUSHBUF;
>  	}
>  
>  	/* Stale items should force out the iclog */
>  	if (ip->i_flags & XFS_ISTALE) {
>  		xfs_ifunlock(ip);
> +		/*
> +		 * we hold the AIL lock - notify the unlock routine of this
> +		 * so it doesn't try to get the lock again.
> +		 */
>  		xfs_iunlock(ip, XFS_ILOCK_SHARED|XFS_IUNLOCK_NONOTIFY);
>  		return XFS_ITEM_PINNED;
>  	}
> @@ -746,11 +733,8 @@ xfs_inode_item_committed(
>   * This gets called by xfs_trans_push_ail(), when IOP_TRYLOCK
>   * failed to get the inode flush lock but did get the inode locked SHARED.
>   * Here we're trying to see if the inode buffer is incore, and if so whether it's
> - * marked delayed write. If that's the case, we'll initiate a bawrite on that
> - * buffer to expedite the process.
> - *
> - * We aren't holding the AIL lock (or the flush lock) when this gets called,
> - * so it is inherently race-y.
> + * marked delayed write. If that's the case, we'll promote it and that will
> + * allow the caller to write the buffer by triggering the xfsbufd to run.
>   */
>  STATIC void
>  xfs_inode_item_pushbuf(
> @@ -759,26 +743,16 @@ xfs_inode_item_pushbuf(
>  	xfs_inode_t	*ip;
>  	xfs_mount_t	*mp;
>  	xfs_buf_t	*bp;
> -	uint		dopush;
>  
>  	ip = iip->ili_inode;
> -
>  	ASSERT(xfs_isilocked(ip, XFS_ILOCK_SHARED));
>  
>  	/*
> -	 * The ili_pushbuf_flag keeps others from
> -	 * trying to duplicate our effort.
> -	 */
> -	ASSERT(iip->ili_pushbuf_flag != 0);
> -	ASSERT(iip->ili_push_owner == current_pid());
> -
> -	/*
>  	 * If a flush is not in progress anymore, chances are that the
>  	 * inode was taken off the AIL. So, just get out.
>  	 */
>  	if (completion_done(&ip->i_flush) ||
>  	    ((iip->ili_item.li_flags & XFS_LI_IN_AIL) == 0)) {
> -		iip->ili_pushbuf_flag = 0;
>  		xfs_iunlock(ip, XFS_ILOCK_SHARED);
>  		return;
>  	}
> @@ -787,53 +761,12 @@ xfs_inode_item_pushbuf(
>  	bp = xfs_incore(mp->m_ddev_targp, iip->ili_format.ilf_blkno,
>  		    iip->ili_format.ilf_len, XBF_TRYLOCK);
>  
> -	if (bp != NULL) {
> -		if (XFS_BUF_ISDELAYWRITE(bp)) {
> -			/*
> -			 * We were racing with iflush because we don't hold
> -			 * the AIL lock or the flush lock. However, at this point,
> -			 * we have the buffer, and we know that it's dirty.
> -			 * So, it's possible that iflush raced with us, and
> -			 * this item is already taken off the AIL.
> -			 * If not, we can flush it async.
> -			 */
> -			dopush = ((iip->ili_item.li_flags & XFS_LI_IN_AIL) &&
> -				  !completion_done(&ip->i_flush));
> -			iip->ili_pushbuf_flag = 0;
> -			xfs_iunlock(ip, XFS_ILOCK_SHARED);
> -
> -			trace_xfs_inode_item_push(bp, _RET_IP_);
> -
> -			if (XFS_BUF_ISPINNED(bp))
> -				xfs_log_force(mp, 0);
> -
> -			if (dopush) {
> -				int	error;
> -				error = xfs_bawrite(mp, bp);
> -				if (error)
> -					xfs_fs_cmn_err(CE_WARN, mp,
> -		"xfs_inode_item_pushbuf: pushbuf error %d on iip %p, bp %p",
> -							error, iip, bp);
> -			} else {
> -				xfs_buf_relse(bp);
> -			}
> -		} else {
> -			iip->ili_pushbuf_flag = 0;
> -			xfs_iunlock(ip, XFS_ILOCK_SHARED);
> -			xfs_buf_relse(bp);
> -		}
> -		return;
> -	}
> -	/*
> -	 * We have to be careful about resetting pushbuf flag too early (above).
> -	 * Even though in theory we can do it as soon as we have the buflock,
> -	 * we don't want others to be doing work needlessly. They'll come to
> -	 * this function thinking that pushing the buffer is their
> -	 * responsibility only to find that the buffer is still locked by
> -	 * another doing the same thing
> -	 */
> -	iip->ili_pushbuf_flag = 0;
>  	xfs_iunlock(ip, XFS_ILOCK_SHARED);
> +	if (!bp)
> +		return;
> +	if (XFS_BUF_ISDELAYWRITE(bp))
> +		xfs_buf_delwri_promote(bp);
> +	xfs_buf_relse(bp);
>  	return;
>  }
>  
> @@ -937,7 +870,6 @@ xfs_inode_item_init(
>  	/*
>  	   We have zeroed memory. No need ...
>  	   iip->ili_extents_buf = NULL;
> -	   iip->ili_pushbuf_flag = 0;
>  	 */
>  
>  	iip->ili_format.ilf_type = XFS_LI_INODE;
> diff --git a/fs/xfs/xfs_inode_item.h b/fs/xfs/xfs_inode_item.h
> index cc8df1a..9a46795 100644
> --- a/fs/xfs/xfs_inode_item.h
> +++ b/fs/xfs/xfs_inode_item.h
> @@ -144,12 +144,6 @@ typedef struct xfs_inode_log_item {
>  						      data exts */
>  	struct xfs_bmbt_rec	*ili_aextents_buf; /* array of logged
>  						      attr exts */
> -	unsigned int            ili_pushbuf_flag;  /* one bit used in push_ail */
> -
> -#ifdef DEBUG
> -	uint64_t                ili_push_owner;    /* one who sets pushbuf_flag
> -						      above gets to push the buf */
> -#endif
>  #ifdef XFS_TRANS_DEBUG
>  	int			ili_root_size;
>  	char			*ili_orig_root;
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index ca64f33..c93e3a1 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -861,8 +861,7 @@ typedef struct xfs_item_ops {
>  #define	XFS_ITEM_SUCCESS	0
>  #define	XFS_ITEM_PINNED		1
>  #define	XFS_ITEM_LOCKED		2
> -#define	XFS_ITEM_FLUSHING	3
> -#define XFS_ITEM_PUSHBUF	4
> +#define XFS_ITEM_PUSHBUF	3
>  
>  /*
>   * This structure is used to maintain a list of block ranges that have been
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index d7b1af8..e799824 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -253,6 +253,7 @@ xfsaild_push(
>  	int		flush_log, count, stuck;
>  	xfs_mount_t	*mp = ailp->xa_mount;
>  	struct xfs_ail_cursor	*cur = &ailp->xa_cursors;
> +	int		push_xfsbufd = 0;
>  
>  	spin_lock(&ailp->xa_lock);
>  	xfs_trans_ail_cursor_init(ailp, cur);
> @@ -308,6 +309,7 @@ xfsaild_push(
>  			XFS_STATS_INC(xs_push_ail_pushbuf);
>  			IOP_PUSHBUF(lip);
>  			last_pushed_lsn = lsn;
> +			push_xfsbufd = 1;
>  			break;
>  
>  		case XFS_ITEM_PINNED:
> @@ -322,12 +324,6 @@ xfsaild_push(
>  			stuck++;
>  			break;
>  
> -		case XFS_ITEM_FLUSHING:
> -			XFS_STATS_INC(xs_push_ail_flushing);
> -			last_pushed_lsn = lsn;
> -			stuck++;
> -			break;
> -
>  		default:
>  			ASSERT(0);
>  			break;
> @@ -374,6 +370,11 @@ xfsaild_push(
>  		xfs_log_force(mp, 0);
>  	}
>  
> +	if (push_xfsbufd) {
> +		/* we've got delayed write buffers to flush */
> +		wake_up_process(mp->m_ddev_targp->bt_task);
> +	}
> +
>  	if (!count) {
>  		/* We're past our target or empty, so idle */
>  		last_pushed_lsn = 0;



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 04/10] xfs: Sort delayed write buffers before dispatch
  2010-02-02 23:24 ` [PATCH 04/10] xfs: Sort delayed write buffers before dispatch Dave Chinner
@ 2010-02-05 23:53   ` Alex Elder
  0 siblings, 0 replies; 29+ messages in thread
From: Alex Elder @ 2010-02-05 23:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, 2010-02-03 at 10:24 +1100, Dave Chinner wrote:
> Currently when the xfsbufd writes delayed write buffers, it pushes
> them to disk in the order they come off the delayed write list. If
> there are lots of buffers ѕpread widely over the disk, this results
> in overwhelming the elevator sort queues in the block layer and we
> end up losing the posibility of merging adjacent buffers to minimise
> the number of IOs.
> 
> Use the new generic list_sort function to sort the delwri dispatch
> queue before issue to ensure that the buffers are pushed in the most
> friendly order possible to the lower layers.

Looks good.

> Signed-off-by: Dave Chinner <david@fromorbit.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Alex Elder <aelder@sgi.com>

> ---
>  fs/xfs/linux-2.6/xfs_buf.c |   87 ++++++++++++++++++++++++++++++--------------
>  1 files changed, 60 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/xfs/linux-2.6/xfs_buf.c b/fs/xfs/linux-2.6/xfs_buf.c
> index b306265..4556a4c 100644
> --- a/fs/xfs/linux-2.6/xfs_buf.c
> +++ b/fs/xfs/linux-2.6/xfs_buf.c
> @@ -33,6 +33,7 @@
>  #include <linux/migrate.h>
>  #include <linux/backing-dev.h>
>  #include <linux/freezer.h>
> +#include <linux/list_sort.h>
>  
>  #include "xfs_sb.h"
>  #include "xfs_inum.h"
> @@ -1877,14 +1878,42 @@ xfs_buf_delwri_split(
>  
>  }
>  
> +/*
> + * Compare function is more complex than it needs to be because
> + * the return value is only 32 bits and we are doing comparisons
> + * on 64 bit values
> + */
> +static int
> +xfs_buf_cmp(
> +	void		*priv,
> +	struct list_head *a,
> +	struct list_head *b)
> +{
> +	struct xfs_buf	*ap = container_of(a, struct xfs_buf, b_list);
> +	struct xfs_buf	*bp = container_of(b, struct xfs_buf, b_list);
> +	xfs_daddr_t		diff;
> +
> +	diff = ap->b_bn - bp->b_bn;
> +	if (diff < 0)
> +		return -1;
> +	if (diff > 0)
> +		return 1;
> +	return 0;
> +}
> +
> +void
> +xfs_buf_delwri_sort(
> +	xfs_buftarg_t	*target,
> +	struct list_head *list)
> +{
> +	list_sort(NULL, list, xfs_buf_cmp);
> +}
> +
>  STATIC int
>  xfsbufd(
>  	void		*data)
>  {
> -	struct list_head tmp;
> -	xfs_buftarg_t	*target = (xfs_buftarg_t *)data;
> -	int		count;
> -	xfs_buf_t	*bp;
> +	xfs_buftarg_t   *target = (xfs_buftarg_t *)data;
>  
>  	current->flags |= PF_MEMALLOC;
>  
> @@ -1893,6 +1922,8 @@ xfsbufd(
>  	do {
>  		long	age = xfs_buf_age_centisecs * msecs_to_jiffies(10);
>  		long	tout = xfs_buf_timer_centisecs * msecs_to_jiffies(10);
> +		int	count = 0;
> +		struct list_head tmp;
>  
>  		if (unlikely(freezing(current))) {
>  			set_bit(XBT_FORCE_SLEEP, &target->bt_flags);
> @@ -1907,11 +1938,10 @@ xfsbufd(
>  		schedule_timeout_interruptible(tout);
>  
>  		xfs_buf_delwri_split(target, &tmp, age);
> -		count = 0;
> +		list_sort(NULL, &tmp, xfs_buf_cmp);
>  		while (!list_empty(&tmp)) {
> -			bp = list_entry(tmp.next, xfs_buf_t, b_list);
> -			ASSERT(target == bp->b_target);
> -
> +			struct xfs_buf *bp;
> +			bp = list_first_entry(&tmp, struct xfs_buf, b_list);
>  			list_del_init(&bp->b_list);
>  			xfs_buf_iostrategy(bp);
>  			count++;
> @@ -1937,42 +1967,45 @@ xfs_flush_buftarg(
>  	xfs_buftarg_t	*target,
>  	int		wait)
>  {
> -	struct list_head tmp;
> -	xfs_buf_t	*bp, *n;
> +	xfs_buf_t	*bp;
>  	int		pincount = 0;
> +	LIST_HEAD(tmp_list);
> +	LIST_HEAD(wait_list);
>  
>  	xfs_buf_runall_queues(xfsconvertd_workqueue);
>  	xfs_buf_runall_queues(xfsdatad_workqueue);
>  	xfs_buf_runall_queues(xfslogd_workqueue);
>  
>  	set_bit(XBT_FORCE_FLUSH, &target->bt_flags);
> -	pincount = xfs_buf_delwri_split(target, &tmp, 0);
> +	pincount = xfs_buf_delwri_split(target, &tmp_list, 0);
>  
>  	/*
> -	 * Dropped the delayed write list lock, now walk the temporary list
> +	 * Dropped the delayed write list lock, now walk the temporary list.
> +	 * All I/O is issued async and then if we need to wait for completion
> +	 * we do that after issuing all the IO.
>  	 */
> -	list_for_each_entry_safe(bp, n, &tmp, b_list) {
> +	list_sort(NULL, &tmp_list, xfs_buf_cmp);
> +	while (!list_empty(&tmp_list)) {
> +		bp = list_first_entry(&tmp_list, struct xfs_buf, b_list);
>  		ASSERT(target == bp->b_target);
> -		if (wait)
> +		list_del_init(&bp->b_list);
> +		if (wait) {
>  			bp->b_flags &= ~XBF_ASYNC;
> -		else
> -			list_del_init(&bp->b_list);
> -
> +			list_add(&bp->b_list, &wait_list);
> +		}
>  		xfs_buf_iostrategy(bp);
>  	}
>  
> -	if (wait)
> +	if (wait) {
> +		/* Expedite and wait for IO to complete. */
>  		blk_run_address_space(target->bt_mapping);
> +		while (!list_empty(&wait_list)) {
> +			bp = list_first_entry(&wait_list, struct xfs_buf, b_list);
>  
> -	/*
> -	 * Remaining list items must be flushed before returning
> -	 */
> -	while (!list_empty(&tmp)) {
> -		bp = list_entry(tmp.next, xfs_buf_t, b_list);
> -
> -		list_del_init(&bp->b_list);
> -		xfs_iowait(bp);
> -		xfs_buf_relse(bp);
> +			list_del_init(&bp->b_list);
> +			xfs_iowait(bp);
> +			xfs_buf_relse(bp);
> +		}
>  	}
>  
>  	return pincount;



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 05/10] xfs: Use delay write promotion for dquot flushing
  2010-02-02 23:24 ` [PATCH 05/10] xfs: Use delay write promotion for dquot flushing Dave Chinner
@ 2010-02-05 23:55   ` Alex Elder
  0 siblings, 0 replies; 29+ messages in thread
From: Alex Elder @ 2010-02-05 23:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, 2010-02-03 at 10:24 +1100, Dave Chinner wrote:
> xfs_qm_dqflock_pushbuf_wait() does a very similar trick to item
> pushing used to do to flush out delayed write dquot buffers. Change
> it to use the new promotion method rather than an async flush.
> 
> Also, xfs_qm_dqflock_pushbuf_wait() can return without the flush lock
> held, yet the callers make the assumption that after this call the
> flush lock is held. Always return with the flush lock held.

Looks good.

> Signed-off-by: Dave Chinner <david@fromorbit.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Reviewed-by: Alex Elder <aelder@sgi.com>

> ---
>  fs/xfs/quota/xfs_dquot.c |   25 ++++++++++---------------
>  1 files changed, 10 insertions(+), 15 deletions(-)
> 
> diff --git a/fs/xfs/quota/xfs_dquot.c b/fs/xfs/quota/xfs_dquot.c
> index f9baeed..1620a56 100644
> --- a/fs/xfs/quota/xfs_dquot.c
> +++ b/fs/xfs/quota/xfs_dquot.c
> @@ -1528,21 +1528,16 @@ xfs_qm_dqflock_pushbuf_wait(
>  	 */
>  	bp = xfs_incore(dqp->q_mount->m_ddev_targp, dqp->q_blkno,
>  		    XFS_QI_DQCHUNKLEN(dqp->q_mount), XBF_TRYLOCK);
> -	if (bp != NULL) {
> -		if (XFS_BUF_ISDELAYWRITE(bp)) {
> -			int	error;
> -
> -			if (XFS_BUF_ISPINNED(bp))
> -				xfs_log_force(dqp->q_mount, 0);
> -			error = xfs_bawrite(dqp->q_mount, bp);
> -			if (error)
> -				xfs_fs_cmn_err(CE_WARN, dqp->q_mount,
> -					"xfs_qm_dqflock_pushbuf_wait: "
> -					"pushbuf error %d on dqp %p, bp %p",
> -					error, dqp, bp);
> -		} else {
> -			xfs_buf_relse(bp);
> -		}
> +	if (!bp)
> +		goto out_lock;
> +
> +	if (XFS_BUF_ISDELAYWRITE(bp)) {
> +		if (XFS_BUF_ISPINNED(bp))
> +			xfs_log_force(dqp->q_mount, 0);
> +		xfs_buf_delwri_promote(bp);
> +		wake_up_process(bp->b_target->bt_task);
>  	}
> +	xfs_buf_relse(bp);
> +out_lock:
>  	xfs_dqflock(dqp);
>  }



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 01/10] xfs: Make inode reclaim states explicit
  2010-02-05 19:06   ` Alex Elder
@ 2010-02-06  0:07     ` Dave Chinner
  0 siblings, 0 replies; 29+ messages in thread
From: Dave Chinner @ 2010-02-06  0:07 UTC (permalink / raw)
  To: Alex Elder; +Cc: xfs

On Fri, Feb 05, 2010 at 01:06:20PM -0600, Alex Elder wrote:
> On Wed, 2010-02-03 at 10:24 +1100, Dave Chinner wrote:
> > A.K.A.: don't rely on xfs_iflush() return value in reclaim
> > 
> > We have gradually been moving checks out of the reclaim code because
> > they are duplicated in xfs_iflush(). We've had a history of problems
> > in this area, and many of them stem from the overloading of the
> > return values from xfs_iflush() and interaction with inode flush
> > locking to determine if the inode is safe to reclaim.
> > 
> > With the desire to move to delayed write flushing of inodes and
> > non-blocking inode tree reclaim walks, the overloading of the
> > return value of xfs_iflush makes it very difficult to determine
> > the correct thing to do next.
> > 
> > This patch explicitly re-adds the checks to the inode reclaim code,
> > removing the reliance on the return value of xfs_iflush() to
> > determine what to do next. It also means that we can clearly
> > document all the inode states that reclaim must handle and hence
> > we can easily see that we handled all the necessary cases.
> > 
> > This also removes the need for the xfs_inode_clean() check in
> > xfs_iflush() as all callers now check this first (safely).
> 
> I have a few comments, below.  One is a real bug.  At this
> point I'm trusting that your enumeration of all the possible
> inode states is correct.  Other than what I indicate below,
> this change looks good.

...

> It looks to me like your code adds a call to xfs_ifunlock(ip)
> in the bad inode case that was not there before.  My guess is
> that your way is correct, but I'd like to know whether you
> agree this differs.  Is this intentional?  Regardless, is
> the new xfs_ifunlock() call correct?

Yes, it is correct - the code before was broken.

> > +	if (xfs_iflags_test(ip, XFS_ISTALE))
> > +		goto reclaim;
> > +	if (xfs_inode_clean(ip))
> > +		goto reclaim;
> > +
> > +	/* Now we have an inode that needs flushing */
> > +	error = xfs_iflush(ip, sync_mode);
> > +	if (!error) {
> > +		switch(sync_mode) {
> > +		case XFS_IFLUSH_DELWRI_ELSE_ASYNC:
> > +		case XFS_IFLUSH_DELWRI:
> > +		case XFS_IFLUSH_ASYNC:
> > +		case XFS_IFLUSH_DELWRI_ELSE_SYNC:
> > +		case XFS_IFLUSH_SYNC:
> > +			/* IO issued, synchronise with IO completion */
> > +			xfs_iflock(ip);
> 
> You must not have run this with DEBUG enabled.
> You need a "break;" here.

I always run with DEBUG enabled, but I probably didn't
test this patch by iteself in the last iteration because
i don't have unlimited test cycles. This code gets removed
by the next patch, so the fact that it was broken would
not have been picked up by testing. I'll update it
and re-push it to the for-2.6.34 branch.

Cheers,
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2010-02-06  0:06 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-02-02 23:24 [PATCH 0/10] Delayed write metadata writeback V4 Dave Chinner
2010-02-02 23:24 ` [PATCH 01/10] xfs: Make inode reclaim states explicit Dave Chinner
2010-02-05 19:06   ` Alex Elder
2010-02-06  0:07     ` Dave Chinner
2010-02-02 23:24 ` [PATCH 02/10] xfs: Use delayed write for inodes rather than async V2 Dave Chinner
2010-02-03 11:17   ` Christoph Hellwig
2010-02-05 21:38   ` Alex Elder
2010-02-02 23:24 ` [PATCH 03/10] xfs: Don't issue buffer IO direct from AIL push V2 Dave Chinner
2010-02-05 22:51   ` Alex Elder
2010-02-02 23:24 ` [PATCH 04/10] xfs: Sort delayed write buffers before dispatch Dave Chinner
2010-02-05 23:53   ` Alex Elder
2010-02-02 23:24 ` [PATCH 05/10] xfs: Use delay write promotion for dquot flushing Dave Chinner
2010-02-05 23:55   ` Alex Elder
2010-02-02 23:25 ` [PATCH 06/10] xfs: kill the unused XFS_QMOPT_* flush flags V2 Dave Chinner
2010-02-03 11:17   ` Christoph Hellwig
2010-02-02 23:25 ` [PATCH 07/10] xfs: remove invalid barrier optimization from xfs_fsync Dave Chinner
2010-02-02 23:25 ` [PATCH 08/10] xfs: move the inode locking outside xfs_fsync() Dave Chinner
2010-02-03 11:29   ` Christoph Hellwig
2010-02-03 23:08     ` Dave Chinner
2010-02-04 16:07       ` Christoph Hellwig
2010-02-02 23:25 ` [PATCH 09/10] xfs: xfs_fs_write_inode() can fail to write inodes synchronously V2 Dave Chinner
2010-02-03 11:27   ` Christoph Hellwig
2010-02-03 18:07     ` bpm
2010-02-03 20:55       ` Christoph Hellwig
2010-02-03 20:56     ` Christoph Hellwig
2010-02-03 23:02       ` Dave Chinner
2010-02-04 17:36         ` Christoph Hellwig
2010-02-02 23:25 ` [PATCH 10/10] xfs: kill xfs_bawrite Dave Chinner
2010-02-03 11:19   ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.