All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] xfs: patch queue for Linux 3.2
@ 2011-08-25  7:17 Dave Chinner
  2011-08-25  7:17 ` [PATCH 1/6] xfs: don't serialise direct IO reads on page cache checks Dave Chinner
                   ` (5 more replies)
  0 siblings, 6 replies; 17+ messages in thread
From: Dave Chinner @ 2011-08-25  7:17 UTC (permalink / raw)
  To: xfs

These are the previously posted and commented on patches to fix the
dio locking serialisation issues, a couple of small optimisations,
and the conversion of the xfsbufd thread to a work queue.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/6] xfs: don't serialise direct IO reads on page cache checks
  2011-08-25  7:17 [PATCH 0/6] xfs: patch queue for Linux 3.2 Dave Chinner
@ 2011-08-25  7:17 ` Dave Chinner
  2011-08-25  7:17 ` [PATCH 2/6] xfs: don't serialise adjacent concurrent direct IO appending writes Dave Chinner
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2011-08-25  7:17 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

There is no need to grab the i_mutex of the IO lock in exclusive
mode if we don't need to invalidate the page cache. Taking these
locks on every direct IO effective serialises them as taking the IO
lock in exclusive mode has to wait for all shared holders to drop
the lock. That only happens when IO is complete, so effective it
prevents dispatch of concurrent direct IO reads to the same inode.

Fix this by taking the IO lock shared to check the page cache state,
and only then drop it and take the IO lock exclusively if there is
work to be done. Hence for the normal direct IO case, no exclusive
locking will occur.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Tested-by: Joern Engel <joern@logfs.org>
Reviewed-by: Alex Elder <aelder@sgi.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/xfs_file.c |   17 ++++++++++++++---
 1 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 2fdc6d1..a1dea10 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -317,7 +317,19 @@ xfs_file_aio_read(
 	if (XFS_FORCED_SHUTDOWN(mp))
 		return -EIO;
 
-	if (unlikely(ioflags & IO_ISDIRECT)) {
+	/*
+	 * Locking is a bit tricky here. If we take an exclusive lock
+	 * for direct IO, we effectively serialise all new concurrent
+	 * read IO to this file and block it behind IO that is currently in
+	 * progress because IO in progress holds the IO lock shared. We only
+	 * need to hold the lock exclusive to blow away the page cache, so
+	 * only take lock exclusively if the page cache needs invalidation.
+	 * This allows the normal direct IO case of no page cache pages to
+	 * proceeed concurrently without serialisation.
+	 */
+	xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
+	if ((ioflags & IO_ISDIRECT) && inode->i_mapping->nrpages) {
+		xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
 		xfs_rw_ilock(ip, XFS_IOLOCK_EXCL);
 
 		if (inode->i_mapping->nrpages) {
@@ -330,8 +342,7 @@ xfs_file_aio_read(
 			}
 		}
 		xfs_rw_ilock_demote(ip, XFS_IOLOCK_EXCL);
-	} else
-		xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
+	}
 
 	trace_xfs_file_read(ip, size, iocb->ki_pos, ioflags);
 
-- 
1.7.5.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/6] xfs: don't serialise adjacent concurrent direct IO appending writes
  2011-08-25  7:17 [PATCH 0/6] xfs: patch queue for Linux 3.2 Dave Chinner
  2011-08-25  7:17 ` [PATCH 1/6] xfs: don't serialise direct IO reads on page cache checks Dave Chinner
@ 2011-08-25  7:17 ` Dave Chinner
  2011-08-25 21:08   ` Alex Elder
  2011-08-25  7:17 ` [PATCH 3/6] xfs: Don't allocate new buffers on every call to _xfs_buf_find Dave Chinner
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2011-08-25  7:17 UTC (permalink / raw)
  To: xfs

For append write workloads, extending the file requires a certain
amount of exclusive locking to be done up front to ensure sanity in
things like ensuring that we've zeroed any allocated regions
between the old EOF and the start of the new IO.

For single threads, this typically isn't a problem, and for large
IOs we don't serialise enough for it to be a problem for two
threads on really fast block devices. However for smaller IO and
larger thread counts we have a problem.

Take 4 concurrent sequential, single block sized and aligned IOs.
After the first IO is submitted but before it completes, we end up
with this state:

        IO 1    IO 2    IO 3    IO 4
      +-------+-------+-------+-------+
      ^       ^
      |       |
      |       |
      |       |
      |       \- ip->i_new_size
      \- ip->i_size

And the IO is done without exclusive locking because offset <=
ip->i_size. When we submit IO 2, we see offset > ip->i_size, and
grab the IO lock exclusive, because there is a chance we need to do
EOF zeroing. However, there is already an IO in progress that avoids
the need for IO zeroing because offset <= ip->i_new_size. hence we
could avoid holding the IO lock exlcusive for this. Hence after
submission of the second IO, we'd end up this state:

        IO 1    IO 2    IO 3    IO 4
      +-------+-------+-------+-------+
      ^               ^
      |               |
      |               |
      |               |
      |               \- ip->i_new_size
      \- ip->i_size

There is no need to grab the i_mutex of the IO lock in exclusive
mode if we don't need to invalidate the page cache. Taking these
locks on every direct IO effective serialises them as taking the IO
lock in exclusive mode has to wait for all shared holders to drop
the lock. That only happens when IO is complete, so effective it
prevents dispatch of concurrent direct IO writes to the same inode.

And so you can see that for the third concurrent IO, we'd avoid
exclusive locking for the same reason we avoided the exclusive lock
for the second IO.

Fixing this is a bit more complex than that, because we need to hold
a write-submission local value of ip->i_new_size to that clearing
the value is only done if no other thread has updated it before our
IO completes.....

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c |   68 ++++++++++++++++++++++++++++++++++++++++------------
 1 files changed, 52 insertions(+), 16 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a1dea10..4c64eb0 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -418,11 +418,13 @@ xfs_aio_write_isize_update(
  */
 STATIC void
 xfs_aio_write_newsize_update(
-	struct xfs_inode	*ip)
+	struct xfs_inode	*ip,
+	xfs_fsize_t		new_size)
 {
-	if (ip->i_new_size) {
+	if (new_size == ip->i_new_size) {
 		xfs_rw_ilock(ip, XFS_ILOCK_EXCL);
-		ip->i_new_size = 0;
+		if (new_size == ip->i_new_size)
+			ip->i_new_size = 0;
 		if (ip->i_d.di_size > ip->i_size)
 			ip->i_d.di_size = ip->i_size;
 		xfs_rw_iunlock(ip, XFS_ILOCK_EXCL);
@@ -473,7 +475,7 @@ xfs_file_splice_write(
 	ret = generic_file_splice_write(pipe, outfilp, ppos, count, flags);
 
 	xfs_aio_write_isize_update(inode, ppos, ret);
-	xfs_aio_write_newsize_update(ip);
+	xfs_aio_write_newsize_update(ip, new_size);
 	xfs_iunlock(ip, XFS_IOLOCK_EXCL);
 	return ret;
 }
@@ -670,6 +672,7 @@ xfs_file_aio_write_checks(
 	struct file		*file,
 	loff_t			*pos,
 	size_t			*count,
+	xfs_fsize_t		*new_sizep,
 	int			*iolock)
 {
 	struct inode		*inode = file->f_mapping->host;
@@ -677,6 +680,8 @@ xfs_file_aio_write_checks(
 	xfs_fsize_t		new_size;
 	int			error = 0;
 
+	*new_sizep = 0;
+restart:
 	error = generic_write_checks(file, pos, count, S_ISBLK(inode->i_mode));
 	if (error) {
 		xfs_rw_iunlock(ip, XFS_ILOCK_EXCL | *iolock);
@@ -684,20 +689,41 @@ xfs_file_aio_write_checks(
 		return error;
 	}
 
-	new_size = *pos + *count;
-	if (new_size > ip->i_size)
-		ip->i_new_size = new_size;
-
 	if (likely(!(file->f_mode & FMODE_NOCMTIME)))
 		file_update_time(file);
 
 	/*
 	 * If the offset is beyond the size of the file, we need to zero any
 	 * blocks that fall between the existing EOF and the start of this
-	 * write.
+	 * write. There is no need to issue zeroing if another in-flght IO ends
+	 * at or before this one If zeronig is needed and we are currently
+	 * holding the iolock shared, we need to update it to exclusive which
+	 * involves dropping all locks and relocking to maintain correct locking
+	 * order. If we do this, restart the function to ensure all checks and
+	 * values are still valid.
 	 */
-	if (*pos > ip->i_size)
+	if ((ip->i_new_size && *pos > ip->i_new_size) ||
+	    (!ip->i_new_size && *pos > ip->i_size)) {
+		if (*iolock == XFS_IOLOCK_SHARED) {
+			xfs_rw_iunlock(ip, XFS_ILOCK_EXCL | *iolock);
+			*iolock = XFS_IOLOCK_EXCL;
+			xfs_rw_ilock(ip, XFS_ILOCK_EXCL | *iolock);
+			goto restart;
+		}
 		error = -xfs_zero_eof(ip, *pos, ip->i_size);
+	}
+
+	/*
+	 * If this IO extends beyond EOF, we may need to update ip->i_new_size.
+	 * We have already zeroed space beyond EOF (if necessary).  Only update
+	 * ip->i_new_size if this IO ends beyond any other in-flight writes.
+	 */
+	new_size = *pos + *count;
+	if (new_size > ip->i_size) {
+		if (new_size > ip->i_new_size)
+			ip->i_new_size = new_size;
+		*new_sizep = new_size;
+	}
 
 	xfs_rw_iunlock(ip, XFS_ILOCK_EXCL);
 	if (error)
@@ -744,6 +770,7 @@ xfs_file_dio_aio_write(
 	unsigned long		nr_segs,
 	loff_t			pos,
 	size_t			ocount,
+	xfs_fsize_t		*new_size,
 	int			*iolock)
 {
 	struct file		*file = iocb->ki_filp;
@@ -764,13 +791,20 @@ xfs_file_dio_aio_write(
 	if ((pos & mp->m_blockmask) || ((pos + count) & mp->m_blockmask))
 		unaligned_io = 1;
 
-	if (unaligned_io || mapping->nrpages || pos > ip->i_size)
+	/*
+	 * We don't need to take an exclusive lock unless there page cache needs
+	 * to be invalidated or unaligned IO is being executed. We don't need to
+	 * consider the EOF extension case here because
+	 * xfs_file_aio_write_checks() will relock the inode as necessary for
+	 * EOF zeroing cases and fill out the new inode size as appropriate.
+	 */
+	if (unaligned_io || mapping->nrpages)
 		*iolock = XFS_IOLOCK_EXCL;
 	else
 		*iolock = XFS_IOLOCK_SHARED;
 	xfs_rw_ilock(ip, XFS_ILOCK_EXCL | *iolock);
 
-	ret = xfs_file_aio_write_checks(file, &pos, &count, iolock);
+	ret = xfs_file_aio_write_checks(file, &pos, &count, new_size, iolock);
 	if (ret)
 		return ret;
 
@@ -809,6 +843,7 @@ xfs_file_buffered_aio_write(
 	unsigned long		nr_segs,
 	loff_t			pos,
 	size_t			ocount,
+	xfs_fsize_t		*new_size,
 	int			*iolock)
 {
 	struct file		*file = iocb->ki_filp;
@@ -822,7 +857,7 @@ xfs_file_buffered_aio_write(
 	*iolock = XFS_IOLOCK_EXCL;
 	xfs_rw_ilock(ip, XFS_ILOCK_EXCL | *iolock);
 
-	ret = xfs_file_aio_write_checks(file, &pos, &count, iolock);
+	ret = xfs_file_aio_write_checks(file, &pos, &count, new_size, iolock);
 	if (ret)
 		return ret;
 
@@ -862,6 +897,7 @@ xfs_file_aio_write(
 	ssize_t			ret;
 	int			iolock;
 	size_t			ocount = 0;
+	xfs_fsize_t		new_size = 0;
 
 	XFS_STATS_INC(xs_write_calls);
 
@@ -881,10 +917,10 @@ xfs_file_aio_write(
 
 	if (unlikely(file->f_flags & O_DIRECT))
 		ret = xfs_file_dio_aio_write(iocb, iovp, nr_segs, pos,
-						ocount, &iolock);
+						ocount, &new_size, &iolock);
 	else
 		ret = xfs_file_buffered_aio_write(iocb, iovp, nr_segs, pos,
-						ocount, &iolock);
+						ocount, &new_size, &iolock);
 
 	xfs_aio_write_isize_update(inode, &iocb->ki_pos, ret);
 
@@ -905,7 +941,7 @@ xfs_file_aio_write(
 	}
 
 out_unlock:
-	xfs_aio_write_newsize_update(ip);
+	xfs_aio_write_newsize_update(ip, new_size);
 	xfs_rw_iunlock(ip, iolock);
 	return ret;
 }
-- 
1.7.5.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/6] xfs: Don't allocate new buffers on every call to _xfs_buf_find
  2011-08-25  7:17 [PATCH 0/6] xfs: patch queue for Linux 3.2 Dave Chinner
  2011-08-25  7:17 ` [PATCH 1/6] xfs: don't serialise direct IO reads on page cache checks Dave Chinner
  2011-08-25  7:17 ` [PATCH 2/6] xfs: don't serialise adjacent concurrent direct IO appending writes Dave Chinner
@ 2011-08-25  7:17 ` Dave Chinner
  2011-08-25 20:56   ` Alex Elder
  2011-08-25  7:17 ` [PATCH 4/6] xfs: reduce the number of log forces from tail pushing Dave Chinner
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2011-08-25  7:17 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Stats show that for an 8-way unlink @ ~80,000 unlinks/s we are doing
~1 million cache hit lookups to ~3000 buffer creates. That's almost
3 orders of magnitude more cahce hits than misses, so optimising for
cache hits is quite important. In the cache hit case, we do not need
to allocate a new buffer in case of a cache miss, so we are
effectively hitting the allocator for no good reason for vast the
majority of calls to _xfs_buf_find. 8-way create workloads are
showing similar cache hit/miss ratios.

The result is profiles that look like this:

     samples  pcnt function                        DSO
     _______ _____ _______________________________ _________________

     1036.00 10.0% _xfs_buf_find                   [kernel.kallsyms]
      582.00  5.6% kmem_cache_alloc                [kernel.kallsyms]
      519.00  5.0% __memcpy                        [kernel.kallsyms]
      468.00  4.5% __ticket_spin_lock              [kernel.kallsyms]
      388.00  3.7% kmem_cache_free                 [kernel.kallsyms]
      331.00  3.2% xfs_log_commit_cil              [kernel.kallsyms]


Further, there is a fair bit of work involved in initialising a new
buffer once a cache miss has occurred and we currently do that under
the rbtree spinlock. That increases spinlock hold time on what are
heavily used trees.

To fix this, remove the initialisation of the buffer from
_xfs_buf_find() and only allocate the new buffer once we've had a
cache miss. Initialise the buffer immediately after allocating it in
xfs_buf_get, too, so that is it ready for insert if we get another
cache miss after allocation. This minimises lock hold time and
avoids unnecessary allocator churn. The resulting profiles look
like:

     samples  pcnt function                    DSO
     _______ _____ ___________________________ _________________

     8111.00  9.1% _xfs_buf_find               [kernel.kallsyms]
     4380.00  4.9% __memcpy                    [kernel.kallsyms]
     4341.00  4.8% __ticket_spin_lock          [kernel.kallsyms]
     3401.00  3.8% kmem_cache_alloc            [kernel.kallsyms]
     2856.00  3.2% xfs_log_commit_cil          [kernel.kallsyms]
     2625.00  2.9% __kmalloc                   [kernel.kallsyms]
     2380.00  2.7% kfree                       [kernel.kallsyms]
     2016.00  2.3% kmem_cache_free             [kernel.kallsyms]

Showing a significant reduction in time spent doing allocation and
freeing from slabs (kmem_cache_alloc and kmem_cache_free).

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c |   87 +++++++++++++++++++++++++++++++-----------------------
 1 files changed, 50 insertions(+), 37 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index c57836d..6fffa06 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -171,10 +171,16 @@ STATIC void
 _xfs_buf_initialize(
 	xfs_buf_t		*bp,
 	xfs_buftarg_t		*target,
-	xfs_off_t		range_base,
-	size_t			range_length,
+	xfs_off_t		bno,
+	size_t			num_blocks,
 	xfs_buf_flags_t		flags)
 {
+	xfs_off_t		range_base;
+	size_t			range_length;
+
+	range_base = BBTOB(bno);
+	range_length = BBTOB(num_blocks);
+
 	/*
 	 * We don't want certain flags to appear in b_flags.
 	 */
@@ -423,9 +429,9 @@ _xfs_buf_map_pages(
  */
 xfs_buf_t *
 _xfs_buf_find(
-	xfs_buftarg_t		*btp,	/* block device target		*/
-	xfs_off_t		ioff,	/* starting offset of range	*/
-	size_t			isize,	/* length of range		*/
+	xfs_buftarg_t		*btp,
+	xfs_off_t		bno,
+	size_t			num_blocks,
 	xfs_buf_flags_t		flags,
 	xfs_buf_t		*new_bp)
 {
@@ -436,8 +442,8 @@ _xfs_buf_find(
 	struct rb_node		*parent;
 	xfs_buf_t		*bp;
 
-	range_base = (ioff << BBSHIFT);
-	range_length = (isize << BBSHIFT);
+	range_base = BBTOB(bno);
+	range_length = BBTOB(num_blocks);
 
 	/* Check for IOs smaller than the sector size / not sector aligned */
 	ASSERT(!(range_length < (1 << btp->bt_sshift)));
@@ -445,7 +451,7 @@ _xfs_buf_find(
 
 	/* get tree root */
 	pag = xfs_perag_get(btp->bt_mount,
-				xfs_daddr_to_agno(btp->bt_mount, ioff));
+				xfs_daddr_to_agno(btp->bt_mount, bno));
 
 	/* walk tree */
 	spin_lock(&pag->pag_buf_lock);
@@ -481,8 +487,6 @@ _xfs_buf_find(
 
 	/* No match found */
 	if (new_bp) {
-		_xfs_buf_initialize(new_bp, btp, range_base,
-				range_length, flags);
 		rb_link_node(&new_bp->b_rbnode, parent, rbp);
 		rb_insert_color(&new_bp->b_rbnode, &pag->pag_buf_tree);
 		/* the buffer keeps the perag reference until it is freed */
@@ -525,34 +529,43 @@ found:
 }
 
 /*
- *	Assembles a buffer covering the specified range.
- *	Storage in memory for all portions of the buffer will be allocated,
- *	although backing storage may not be.
+ * Assembles a buffer covering the specified range. The code needs to be
+ * optimised for cache hits, as metadata intensive workloads will see 3 orders
+ * of magnitude more hits than misses.
  */
-xfs_buf_t *
+struct xfs_buf *
 xfs_buf_get(
-	xfs_buftarg_t		*target,/* target for buffer		*/
-	xfs_off_t		ioff,	/* starting offset of range	*/
-	size_t			isize,	/* length of range		*/
+	struct xfs_buftarg	*target,
+	xfs_off_t		bno,
+	size_t			num_blocks,
 	xfs_buf_flags_t		flags)
 {
-	xfs_buf_t		*bp, *new_bp;
+	struct xfs_buf		*bp;
+	struct xfs_buf		*new_bp = NULL;
 	int			error = 0;
 
+	bp = _xfs_buf_find(target, bno, num_blocks, flags, new_bp);
+	if (likely(bp))
+		goto found;
+
 	new_bp = xfs_buf_allocate(flags);
 	if (unlikely(!new_bp))
 		return NULL;
 
-	bp = _xfs_buf_find(target, ioff, isize, flags, new_bp);
+	_xfs_buf_initialize(new_bp, target, bno, num_blocks, flags);
+
+	bp = _xfs_buf_find(target, bno, num_blocks, flags, new_bp);
+	if (!bp) {
+		xfs_buf_deallocate(new_bp);
+		return NULL;
+	}
+
 	if (bp == new_bp) {
 		error = xfs_buf_allocate_memory(bp, flags);
 		if (error)
 			goto no_buffer;
-	} else {
+	} else
 		xfs_buf_deallocate(new_bp);
-		if (unlikely(bp == NULL))
-			return NULL;
-	}
 
 	if (!(bp->b_flags & XBF_MAPPED)) {
 		error = _xfs_buf_map_pages(bp, flags);
@@ -563,19 +576,19 @@ xfs_buf_get(
 		}
 	}
 
-	XFS_STATS_INC(xb_get);
-
 	/*
-	 * Always fill in the block number now, the mapped cases can do
-	 * their own overlay of this later.
+	 * Now we have a workable buffer, fill in the block number so
+	 * that we can do IO on it.
 	 */
-	bp->b_bn = ioff;
-	bp->b_count_desired = bp->b_buffer_length;
+	bp->b_bn = bno;
 
+found:
+	ASSERT(bp->b_flags & XBF_MAPPED);
+	XFS_STATS_INC(xb_get);
 	trace_xfs_buf_get(bp, flags, _RET_IP_);
 	return bp;
 
- no_buffer:
+no_buffer:
 	if (flags & (XBF_LOCK | XBF_TRYLOCK))
 		xfs_buf_unlock(bp);
 	xfs_buf_rele(bp);
@@ -604,15 +617,15 @@ _xfs_buf_read(
 xfs_buf_t *
 xfs_buf_read(
 	xfs_buftarg_t		*target,
-	xfs_off_t		ioff,
-	size_t			isize,
+	xfs_off_t		bno,
+	size_t			num_blocks,
 	xfs_buf_flags_t		flags)
 {
 	xfs_buf_t		*bp;
 
 	flags |= XBF_READ;
 
-	bp = xfs_buf_get(target, ioff, isize, flags);
+	bp = xfs_buf_get(target, bno, num_blocks, flags);
 	if (bp) {
 		trace_xfs_buf_read(bp, flags, _RET_IP_);
 
@@ -647,13 +660,13 @@ xfs_buf_read(
 void
 xfs_buf_readahead(
 	xfs_buftarg_t		*target,
-	xfs_off_t		ioff,
-	size_t			isize)
+	xfs_off_t		bno,
+	size_t			num_blocks)
 {
 	if (bdi_read_congested(target->bt_bdi))
 		return;
 
-	xfs_buf_read(target, ioff, isize,
+	xfs_buf_read(target, bno, num_blocks,
 		     XBF_TRYLOCK|XBF_ASYNC|XBF_READ_AHEAD|XBF_DONT_BLOCK);
 }
 
@@ -790,7 +803,7 @@ xfs_buf_get_uncached(
 	bp = xfs_buf_allocate(0);
 	if (unlikely(bp == NULL))
 		goto fail;
-	_xfs_buf_initialize(bp, target, 0, len, 0);
+	_xfs_buf_initialize(bp, target, 0, BTOBB(len), 0);
 
 	error = _xfs_buf_get_pages(bp, page_count, 0);
 	if (error)
-- 
1.7.5.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/6] xfs: reduce the number of log forces from tail pushing
  2011-08-25  7:17 [PATCH 0/6] xfs: patch queue for Linux 3.2 Dave Chinner
                   ` (2 preceding siblings ...)
  2011-08-25  7:17 ` [PATCH 3/6] xfs: Don't allocate new buffers on every call to _xfs_buf_find Dave Chinner
@ 2011-08-25  7:17 ` Dave Chinner
  2011-08-25 20:57   ` Alex Elder
  2011-08-25  7:17 ` [PATCH 5/6] xfs: re-arrange all the xfsbufd delwri queue code Dave Chinner
  2011-08-25  7:17 ` [PATCH 6/6] xfs: convert xfsbufd to use a workqueue Dave Chinner
  5 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2011-08-25  7:17 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

The AIL push code will issue a log force on ever single push loop
that it exits and has encountered pinned items. It doesn't rescan
these pinned items until it revisits the AIL from the start. Hence
we only need to force the log once per walk from the start of the
AIL to the target LSN.

This results in numbers like this:

	xs_push_ail_flush.....         1456
	xs_log_force.........          1485

For an 8-way 50M inode create workload - almost all the log forces
are coming from the AIL pushing code.

Reduce the number of log forces by only forcing the log if the
previous walk found pinned buffers. This reduces the numbers to:

	xs_push_ail_flush.....          665
	xs_log_force.........           682

For the same test.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_trans_ail.c  |   28 ++++++++++++++++------------
 fs/xfs/xfs_trans_priv.h |    1 +
 2 files changed, 17 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index c15aa29..dd966e0 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -372,12 +372,24 @@ xfs_ail_worker(
 	xfs_lsn_t		lsn;
 	xfs_lsn_t		target;
 	long			tout = 10;
-	int			flush_log = 0;
 	int			stuck = 0;
 	int			count = 0;
 	int			push_xfsbufd = 0;
 
+	/*
+	 * If last time we ran we encountered pinned items, force the log first,
+	 * wait for it and then push again.
+	 */
 	spin_lock(&ailp->xa_lock);
+	if (ailp->xa_last_pushed_lsn == 0 && ailp->xa_log_flush &&
+	    !list_empty(&ailp->xa_ail)) {
+		ailp->xa_log_flush = 0;
+		spin_unlock(&ailp->xa_lock);
+		XFS_STATS_INC(xs_push_ail_flush);
+		xfs_log_force(mp, XFS_LOG_SYNC);
+		spin_lock(&ailp->xa_lock);
+	}
+
 	target = ailp->xa_target;
 	lip = xfs_trans_ail_cursor_first(ailp, &cur, ailp->xa_last_pushed_lsn);
 	if (!lip || XFS_FORCED_SHUTDOWN(mp)) {
@@ -391,6 +403,7 @@ xfs_ail_worker(
 
 	XFS_STATS_INC(xs_push_ail);
 
+
 	/*
 	 * While the item we are looking at is below the given threshold
 	 * try to flush it out. We'd like not to stop until we've at least
@@ -435,7 +448,7 @@ xfs_ail_worker(
 		case XFS_ITEM_PINNED:
 			XFS_STATS_INC(xs_push_ail_pinned);
 			stuck++;
-			flush_log = 1;
+			ailp->xa_log_flush++;
 			break;
 
 		case XFS_ITEM_LOCKED:
@@ -480,16 +493,6 @@ xfs_ail_worker(
 	xfs_trans_ail_cursor_done(ailp, &cur);
 	spin_unlock(&ailp->xa_lock);
 
-	if (flush_log) {
-		/*
-		 * If something we need to push out was pinned, then
-		 * push out the log so it will become unpinned and
-		 * move forward in the AIL.
-		 */
-		XFS_STATS_INC(xs_push_ail_flush);
-		xfs_log_force(mp, 0);
-	}
-
 	if (push_xfsbufd) {
 		/* we've got delayed write buffers to flush */
 		wake_up_process(mp->m_ddev_targp->bt_task);
@@ -500,6 +503,7 @@ out_done:
 	if (!count) {
 		/* We're past our target or empty, so idle */
 		ailp->xa_last_pushed_lsn = 0;
+		ailp->xa_log_flush = 0;
 
 		/*
 		 * We clear the XFS_AIL_PUSHING_BIT first before checking
diff --git a/fs/xfs/xfs_trans_priv.h b/fs/xfs/xfs_trans_priv.h
index 212946b..0a6eec6 100644
--- a/fs/xfs/xfs_trans_priv.h
+++ b/fs/xfs/xfs_trans_priv.h
@@ -71,6 +71,7 @@ struct xfs_ail {
 	struct delayed_work	xa_work;
 	xfs_lsn_t		xa_last_pushed_lsn;
 	unsigned long		xa_flags;
+	int			xa_log_flush;
 };
 
 #define XFS_AIL_PUSHING_BIT	0
-- 
1.7.5.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 5/6] xfs: re-arrange all the xfsbufd delwri queue code
  2011-08-25  7:17 [PATCH 0/6] xfs: patch queue for Linux 3.2 Dave Chinner
                   ` (3 preceding siblings ...)
  2011-08-25  7:17 ` [PATCH 4/6] xfs: reduce the number of log forces from tail pushing Dave Chinner
@ 2011-08-25  7:17 ` Dave Chinner
  2011-08-25 20:57   ` Alex Elder
  2011-08-25  7:17 ` [PATCH 6/6] xfs: convert xfsbufd to use a workqueue Dave Chinner
  5 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2011-08-25  7:17 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

In preparation for converting the xfsbufd to a workqueue, re-arrange
all the delayed write queue manipulation code above the buftarg
sections.  This allows us to avoid the need for forward declarations
of the new xfsbufd worker functions. This is pure code movement, not
changes to the code have been made.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c |  387 +++++++++++++++++++++++++++---------------------------
 1 files changed, 194 insertions(+), 193 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 6fffa06..410de9f 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -1383,199 +1383,6 @@ xfs_buf_iomove(
 }
 
 /*
- *	Handling of buffer targets (buftargs).
- */
-
-/*
- * Wait for any bufs with callbacks that have been submitted but have not yet
- * returned. These buffers will have an elevated hold count, so wait on those
- * while freeing all the buffers only held by the LRU.
- */
-void
-xfs_wait_buftarg(
-	struct xfs_buftarg	*btp)
-{
-	struct xfs_buf		*bp;
-
-restart:
-	spin_lock(&btp->bt_lru_lock);
-	while (!list_empty(&btp->bt_lru)) {
-		bp = list_first_entry(&btp->bt_lru, struct xfs_buf, b_lru);
-		if (atomic_read(&bp->b_hold) > 1) {
-			spin_unlock(&btp->bt_lru_lock);
-			delay(100);
-			goto restart;
-		}
-		/*
-		 * clear the LRU reference count so the bufer doesn't get
-		 * ignored in xfs_buf_rele().
-		 */
-		atomic_set(&bp->b_lru_ref, 0);
-		spin_unlock(&btp->bt_lru_lock);
-		xfs_buf_rele(bp);
-		spin_lock(&btp->bt_lru_lock);
-	}
-	spin_unlock(&btp->bt_lru_lock);
-}
-
-int
-xfs_buftarg_shrink(
-	struct shrinker		*shrink,
-	struct shrink_control	*sc)
-{
-	struct xfs_buftarg	*btp = container_of(shrink,
-					struct xfs_buftarg, bt_shrinker);
-	struct xfs_buf		*bp;
-	int nr_to_scan = sc->nr_to_scan;
-	LIST_HEAD(dispose);
-
-	if (!nr_to_scan)
-		return btp->bt_lru_nr;
-
-	spin_lock(&btp->bt_lru_lock);
-	while (!list_empty(&btp->bt_lru)) {
-		if (nr_to_scan-- <= 0)
-			break;
-
-		bp = list_first_entry(&btp->bt_lru, struct xfs_buf, b_lru);
-
-		/*
-		 * Decrement the b_lru_ref count unless the value is already
-		 * zero. If the value is already zero, we need to reclaim the
-		 * buffer, otherwise it gets another trip through the LRU.
-		 */
-		if (!atomic_add_unless(&bp->b_lru_ref, -1, 0)) {
-			list_move_tail(&bp->b_lru, &btp->bt_lru);
-			continue;
-		}
-
-		/*
-		 * remove the buffer from the LRU now to avoid needing another
-		 * lock round trip inside xfs_buf_rele().
-		 */
-		list_move(&bp->b_lru, &dispose);
-		btp->bt_lru_nr--;
-	}
-	spin_unlock(&btp->bt_lru_lock);
-
-	while (!list_empty(&dispose)) {
-		bp = list_first_entry(&dispose, struct xfs_buf, b_lru);
-		list_del_init(&bp->b_lru);
-		xfs_buf_rele(bp);
-	}
-
-	return btp->bt_lru_nr;
-}
-
-void
-xfs_free_buftarg(
-	struct xfs_mount	*mp,
-	struct xfs_buftarg	*btp)
-{
-	unregister_shrinker(&btp->bt_shrinker);
-
-	xfs_flush_buftarg(btp, 1);
-	if (mp->m_flags & XFS_MOUNT_BARRIER)
-		xfs_blkdev_issue_flush(btp);
-
-	kthread_stop(btp->bt_task);
-	kmem_free(btp);
-}
-
-STATIC int
-xfs_setsize_buftarg_flags(
-	xfs_buftarg_t		*btp,
-	unsigned int		blocksize,
-	unsigned int		sectorsize,
-	int			verbose)
-{
-	btp->bt_bsize = blocksize;
-	btp->bt_sshift = ffs(sectorsize) - 1;
-	btp->bt_smask = sectorsize - 1;
-
-	if (set_blocksize(btp->bt_bdev, sectorsize)) {
-		xfs_warn(btp->bt_mount,
-			"Cannot set_blocksize to %u on device %s\n",
-			sectorsize, xfs_buf_target_name(btp));
-		return EINVAL;
-	}
-
-	return 0;
-}
-
-/*
- *	When allocating the initial buffer target we have not yet
- *	read in the superblock, so don't know what sized sectors
- *	are being used is at this early stage.  Play safe.
- */
-STATIC int
-xfs_setsize_buftarg_early(
-	xfs_buftarg_t		*btp,
-	struct block_device	*bdev)
-{
-	return xfs_setsize_buftarg_flags(btp,
-			PAGE_SIZE, bdev_logical_block_size(bdev), 0);
-}
-
-int
-xfs_setsize_buftarg(
-	xfs_buftarg_t		*btp,
-	unsigned int		blocksize,
-	unsigned int		sectorsize)
-{
-	return xfs_setsize_buftarg_flags(btp, blocksize, sectorsize, 1);
-}
-
-STATIC int
-xfs_alloc_delwrite_queue(
-	xfs_buftarg_t		*btp,
-	const char		*fsname)
-{
-	INIT_LIST_HEAD(&btp->bt_delwrite_queue);
-	spin_lock_init(&btp->bt_delwrite_lock);
-	btp->bt_flags = 0;
-	btp->bt_task = kthread_run(xfsbufd, btp, "xfsbufd/%s", fsname);
-	if (IS_ERR(btp->bt_task))
-		return PTR_ERR(btp->bt_task);
-	return 0;
-}
-
-xfs_buftarg_t *
-xfs_alloc_buftarg(
-	struct xfs_mount	*mp,
-	struct block_device	*bdev,
-	int			external,
-	const char		*fsname)
-{
-	xfs_buftarg_t		*btp;
-
-	btp = kmem_zalloc(sizeof(*btp), KM_SLEEP);
-
-	btp->bt_mount = mp;
-	btp->bt_dev =  bdev->bd_dev;
-	btp->bt_bdev = bdev;
-	btp->bt_bdi = blk_get_backing_dev_info(bdev);
-	if (!btp->bt_bdi)
-		goto error;
-
-	INIT_LIST_HEAD(&btp->bt_lru);
-	spin_lock_init(&btp->bt_lru_lock);
-	if (xfs_setsize_buftarg_early(btp, bdev))
-		goto error;
-	if (xfs_alloc_delwrite_queue(btp, fsname))
-		goto error;
-	btp->bt_shrinker.shrink = xfs_buftarg_shrink;
-	btp->bt_shrinker.seeks = DEFAULT_SEEKS;
-	register_shrinker(&btp->bt_shrinker);
-	return btp;
-
-error:
-	kmem_free(btp);
-	return NULL;
-}
-
-
-/*
  *	Delayed write buffer handling
  */
 STATIC void
@@ -1781,6 +1588,10 @@ xfsbufd(
 }
 
 /*
+ *	Handling of buffer targets (buftargs).
+ */
+
+/*
  *	Go through all incore buffers, and release buffers if they belong to
  *	the given device. This is used in filesystem error handling to
  *	preserve the consistency of its metadata.
@@ -1837,6 +1648,196 @@ xfs_flush_buftarg(
 	return pincount;
 }
 
+/*
+ * Wait for any bufs with callbacks that have been submitted but have not yet
+ * returned. These buffers will have an elevated hold count, so wait on those
+ * while freeing all the buffers only held by the LRU.
+ */
+void
+xfs_wait_buftarg(
+	struct xfs_buftarg	*btp)
+{
+	struct xfs_buf		*bp;
+
+restart:
+	spin_lock(&btp->bt_lru_lock);
+	while (!list_empty(&btp->bt_lru)) {
+		bp = list_first_entry(&btp->bt_lru, struct xfs_buf, b_lru);
+		if (atomic_read(&bp->b_hold) > 1) {
+			spin_unlock(&btp->bt_lru_lock);
+			delay(100);
+			goto restart;
+		}
+		/*
+		 * clear the LRU reference count so the bufer doesn't get
+		 * ignored in xfs_buf_rele().
+		 */
+		atomic_set(&bp->b_lru_ref, 0);
+		spin_unlock(&btp->bt_lru_lock);
+		xfs_buf_rele(bp);
+		spin_lock(&btp->bt_lru_lock);
+	}
+	spin_unlock(&btp->bt_lru_lock);
+}
+
+int
+xfs_buftarg_shrink(
+	struct shrinker		*shrink,
+	struct shrink_control	*sc)
+{
+	struct xfs_buftarg	*btp = container_of(shrink,
+					struct xfs_buftarg, bt_shrinker);
+	struct xfs_buf		*bp;
+	int nr_to_scan = sc->nr_to_scan;
+	LIST_HEAD(dispose);
+
+	if (!nr_to_scan)
+		return btp->bt_lru_nr;
+
+	spin_lock(&btp->bt_lru_lock);
+	while (!list_empty(&btp->bt_lru)) {
+		if (nr_to_scan-- <= 0)
+			break;
+
+		bp = list_first_entry(&btp->bt_lru, struct xfs_buf, b_lru);
+
+		/*
+		 * Decrement the b_lru_ref count unless the value is already
+		 * zero. If the value is already zero, we need to reclaim the
+		 * buffer, otherwise it gets another trip through the LRU.
+		 */
+		if (!atomic_add_unless(&bp->b_lru_ref, -1, 0)) {
+			list_move_tail(&bp->b_lru, &btp->bt_lru);
+			continue;
+		}
+
+		/*
+		 * remove the buffer from the LRU now to avoid needing another
+		 * lock round trip inside xfs_buf_rele().
+		 */
+		list_move(&bp->b_lru, &dispose);
+		btp->bt_lru_nr--;
+	}
+	spin_unlock(&btp->bt_lru_lock);
+
+	while (!list_empty(&dispose)) {
+		bp = list_first_entry(&dispose, struct xfs_buf, b_lru);
+		list_del_init(&bp->b_lru);
+		xfs_buf_rele(bp);
+	}
+
+	return btp->bt_lru_nr;
+}
+
+void
+xfs_free_buftarg(
+	struct xfs_mount	*mp,
+	struct xfs_buftarg	*btp)
+{
+	unregister_shrinker(&btp->bt_shrinker);
+
+	xfs_flush_buftarg(btp, 1);
+	if (mp->m_flags & XFS_MOUNT_BARRIER)
+		xfs_blkdev_issue_flush(btp);
+
+	kthread_stop(btp->bt_task);
+	kmem_free(btp);
+}
+
+STATIC int
+xfs_setsize_buftarg_flags(
+	xfs_buftarg_t		*btp,
+	unsigned int		blocksize,
+	unsigned int		sectorsize,
+	int			verbose)
+{
+	btp->bt_bsize = blocksize;
+	btp->bt_sshift = ffs(sectorsize) - 1;
+	btp->bt_smask = sectorsize - 1;
+
+	if (set_blocksize(btp->bt_bdev, sectorsize)) {
+		xfs_warn(btp->bt_mount,
+			"Cannot set_blocksize to %u on device %s\n",
+			sectorsize, xfs_buf_target_name(btp));
+		return EINVAL;
+	}
+
+	return 0;
+}
+
+/*
+ *	When allocating the initial buffer target we have not yet
+ *	read in the superblock, so don't know what sized sectors
+ *	are being used is at this early stage.  Play safe.
+ */
+STATIC int
+xfs_setsize_buftarg_early(
+	xfs_buftarg_t		*btp,
+	struct block_device	*bdev)
+{
+	return xfs_setsize_buftarg_flags(btp,
+			PAGE_SIZE, bdev_logical_block_size(bdev), 0);
+}
+
+int
+xfs_setsize_buftarg(
+	xfs_buftarg_t		*btp,
+	unsigned int		blocksize,
+	unsigned int		sectorsize)
+{
+	return xfs_setsize_buftarg_flags(btp, blocksize, sectorsize, 1);
+}
+
+STATIC int
+xfs_alloc_delwrite_queue(
+	xfs_buftarg_t		*btp,
+	const char		*fsname)
+{
+	INIT_LIST_HEAD(&btp->bt_delwrite_queue);
+	spin_lock_init(&btp->bt_delwrite_lock);
+	btp->bt_flags = 0;
+	btp->bt_task = kthread_run(xfsbufd, btp, "xfsbufd/%s", fsname);
+	if (IS_ERR(btp->bt_task))
+		return PTR_ERR(btp->bt_task);
+	return 0;
+}
+
+xfs_buftarg_t *
+xfs_alloc_buftarg(
+	struct xfs_mount	*mp,
+	struct block_device	*bdev,
+	int			external,
+	const char		*fsname)
+{
+	xfs_buftarg_t		*btp;
+
+	btp = kmem_zalloc(sizeof(*btp), KM_SLEEP);
+
+	btp->bt_mount = mp;
+	btp->bt_dev =  bdev->bd_dev;
+	btp->bt_bdev = bdev;
+	btp->bt_bdi = blk_get_backing_dev_info(bdev);
+	if (!btp->bt_bdi)
+		goto error;
+
+	INIT_LIST_HEAD(&btp->bt_lru);
+	spin_lock_init(&btp->bt_lru_lock);
+	if (xfs_setsize_buftarg_early(btp, bdev))
+		goto error;
+	if (xfs_alloc_delwrite_queue(btp, fsname))
+		goto error;
+	btp->bt_shrinker.shrink = xfs_buftarg_shrink;
+	btp->bt_shrinker.seeks = DEFAULT_SEEKS;
+	register_shrinker(&btp->bt_shrinker);
+	return btp;
+
+error:
+	kmem_free(btp);
+	return NULL;
+}
+
+
+
 int __init
 xfs_buf_init(void)
 {
-- 
1.7.5.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 6/6] xfs: convert xfsbufd to use a workqueue
  2011-08-25  7:17 [PATCH 0/6] xfs: patch queue for Linux 3.2 Dave Chinner
                   ` (4 preceding siblings ...)
  2011-08-25  7:17 ` [PATCH 5/6] xfs: re-arrange all the xfsbufd delwri queue code Dave Chinner
@ 2011-08-25  7:17 ` Dave Chinner
  2011-08-25 20:57   ` Alex Elder
  5 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2011-08-25  7:17 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

There is no reason we need a thread per filesystem to do the
flushing of the delayed write buffer queue. This can be easily
handled by a global concurrency managed workqueue.

Convert the delayed write buffer handling to use workqueues and
workqueue flushes to implement buffer writeback by embedding a
delayed work structure into the struct xfs_buftarg and using that to
control flushing.  This greatly simplifes the process of flushing
and also removes a bunch of duplicated code between buftarg flushing
and delwri buffer writeback.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf.c       |  165 ++++++++++++++++++++----------------------------
 fs/xfs/xfs_buf.h       |    5 +-
 fs/xfs/xfs_dquot.c     |    1 -
 fs/xfs/xfs_trans_ail.c |    2 +-
 4 files changed, 72 insertions(+), 101 deletions(-)

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 410de9f..b1b8c0c 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -42,9 +42,9 @@
 #include "xfs_trace.h"
 
 static kmem_zone_t *xfs_buf_zone;
-STATIC int xfsbufd(void *);
-STATIC void xfs_buf_delwri_queue(xfs_buf_t *, int);
+STATIC void xfs_buf_delwri_queue(xfs_buf_t *bp, int unlock);
 
+static struct workqueue_struct *xfs_buf_wq;
 static struct workqueue_struct *xfslogd_workqueue;
 struct workqueue_struct *xfsdatad_workqueue;
 struct workqueue_struct *xfsconvertd_workqueue;
@@ -1407,8 +1407,9 @@ xfs_buf_delwri_queue(
 	}
 
 	if (list_empty(dwq)) {
-		/* start xfsbufd as it is about to have something to do */
-		wake_up_process(bp->b_target->bt_task);
+		/* queue a delayed flush as we are about to queue a buffer */
+		queue_delayed_work(xfs_buf_wq, &bp->b_target->bt_delwrite_work,
+			xfs_buf_timer_centisecs * msecs_to_jiffies(10));
 	}
 
 	bp->b_flags |= _XBF_DELWRI_Q;
@@ -1486,13 +1487,13 @@ STATIC int
 xfs_buf_delwri_split(
 	xfs_buftarg_t	*target,
 	struct list_head *list,
-	unsigned long	age)
+	unsigned long	age,
+	int		force)
 {
 	xfs_buf_t	*bp, *n;
 	struct list_head *dwq = &target->bt_delwrite_queue;
 	spinlock_t	*dwlk = &target->bt_delwrite_lock;
 	int		skipped = 0;
-	int		force;
 
 	force = test_and_clear_bit(XBT_FORCE_FLUSH, &target->bt_flags);
 	INIT_LIST_HEAD(list);
@@ -1543,90 +1544,36 @@ xfs_buf_cmp(
 	return 0;
 }
 
-STATIC int
-xfsbufd(
-	void		*data)
-{
-	xfs_buftarg_t   *target = (xfs_buftarg_t *)data;
-
-	current->flags |= PF_MEMALLOC;
-
-	set_freezable();
-
-	do {
-		long	age = xfs_buf_age_centisecs * msecs_to_jiffies(10);
-		long	tout = xfs_buf_timer_centisecs * msecs_to_jiffies(10);
-		struct list_head tmp;
-		struct blk_plug plug;
-
-		if (unlikely(freezing(current))) {
-			set_bit(XBT_FORCE_SLEEP, &target->bt_flags);
-			refrigerator();
-		} else {
-			clear_bit(XBT_FORCE_SLEEP, &target->bt_flags);
-		}
-
-		/* sleep for a long time if there is nothing to do. */
-		if (list_empty(&target->bt_delwrite_queue))
-			tout = MAX_SCHEDULE_TIMEOUT;
-		schedule_timeout_interruptible(tout);
-
-		xfs_buf_delwri_split(target, &tmp, age);
-		list_sort(NULL, &tmp, xfs_buf_cmp);
-
-		blk_start_plug(&plug);
-		while (!list_empty(&tmp)) {
-			struct xfs_buf *bp;
-			bp = list_first_entry(&tmp, struct xfs_buf, b_list);
-			list_del_init(&bp->b_list);
-			xfs_bdstrat_cb(bp);
-		}
-		blk_finish_plug(&plug);
-	} while (!kthread_should_stop());
-
-	return 0;
-}
-
-/*
- *	Handling of buffer targets (buftargs).
- */
-
 /*
- *	Go through all incore buffers, and release buffers if they belong to
- *	the given device. This is used in filesystem error handling to
- *	preserve the consistency of its metadata.
+ * If we are doing a forced flush, then we need to wait for the IO that we
+ * issue to complete.
  */
-int
-xfs_flush_buftarg(
-	xfs_buftarg_t	*target,
-	int		wait)
+static void
+xfs_buf_delwri_work(
+	struct work_struct *work)
 {
-	xfs_buf_t	*bp;
-	int		pincount = 0;
+	struct xfs_buftarg *btp = container_of(to_delayed_work(work),
+					struct xfs_buftarg, bt_delwrite_work);
+	struct xfs_buf	*bp;
+	struct blk_plug	plug;
 	LIST_HEAD(tmp_list);
 	LIST_HEAD(wait_list);
-	struct blk_plug plug;
-
-	xfs_buf_runall_queues(xfsconvertd_workqueue);
-	xfs_buf_runall_queues(xfsdatad_workqueue);
-	xfs_buf_runall_queues(xfslogd_workqueue);
+	long		age = xfs_buf_age_centisecs * msecs_to_jiffies(10);
+	int		force = 0;
 
-	set_bit(XBT_FORCE_FLUSH, &target->bt_flags);
-	pincount = xfs_buf_delwri_split(target, &tmp_list, 0);
+	if (test_and_clear_bit(XBT_FORCE_FLUSH, &btp->bt_flags)) {
+		force = 1;
+		age = 0;
+	}
 
-	/*
-	 * Dropped the delayed write list lock, now walk the temporary list.
-	 * All I/O is issued async and then if we need to wait for completion
-	 * we do that after issuing all the IO.
-	 */
+	xfs_buf_delwri_split(btp, &tmp_list, age, force);
 	list_sort(NULL, &tmp_list, xfs_buf_cmp);
 
 	blk_start_plug(&plug);
 	while (!list_empty(&tmp_list)) {
 		bp = list_first_entry(&tmp_list, struct xfs_buf, b_list);
-		ASSERT(target == bp->b_target);
 		list_del_init(&bp->b_list);
-		if (wait) {
+		if (force) {
 			bp->b_flags &= ~XBF_ASYNC;
 			list_add(&bp->b_list, &wait_list);
 		}
@@ -1634,7 +1581,7 @@ xfs_flush_buftarg(
 	}
 	blk_finish_plug(&plug);
 
-	if (wait) {
+	if (force) {
 		/* Wait for IO to complete. */
 		while (!list_empty(&wait_list)) {
 			bp = list_first_entry(&wait_list, struct xfs_buf, b_list);
@@ -1645,7 +1592,39 @@ xfs_flush_buftarg(
 		}
 	}
 
-	return pincount;
+	if (list_empty(&btp->bt_delwrite_queue))
+		return;
+
+	queue_delayed_work(xfs_buf_wq, &btp->bt_delwrite_work,
+			xfs_buf_timer_centisecs * msecs_to_jiffies(10));
+}
+
+/*
+ *	Handling of buffer targets (buftargs).
+ */
+
+/*
+ * Flush all the queued buffer work, then flush any remaining dirty buffers
+ * and wait for them to complete. If there are buffers remaining on the delwri
+ * queue, then they were pinned so couldn't be flushed. Return a value of 1 to
+ * indicate that there were pinned buffers and the caller needs to retry the
+ * flush.
+ */
+int
+xfs_flush_buftarg(
+	xfs_buftarg_t	*target,
+	int		wait)
+{
+	xfs_buf_runall_queues(xfsconvertd_workqueue);
+	xfs_buf_runall_queues(xfsdatad_workqueue);
+	xfs_buf_runall_queues(xfslogd_workqueue);
+
+	set_bit(XBT_FORCE_FLUSH, &target->bt_flags);
+	flush_delayed_work_sync(&target->bt_delwrite_work);
+
+	if (!list_empty(&target->bt_delwrite_queue))
+		return 1;
+	return 0;
 }
 
 /*
@@ -1740,7 +1719,6 @@ xfs_free_buftarg(
 	if (mp->m_flags & XFS_MOUNT_BARRIER)
 		xfs_blkdev_issue_flush(btp);
 
-	kthread_stop(btp->bt_task);
 	kmem_free(btp);
 }
 
@@ -1788,20 +1766,6 @@ xfs_setsize_buftarg(
 	return xfs_setsize_buftarg_flags(btp, blocksize, sectorsize, 1);
 }
 
-STATIC int
-xfs_alloc_delwrite_queue(
-	xfs_buftarg_t		*btp,
-	const char		*fsname)
-{
-	INIT_LIST_HEAD(&btp->bt_delwrite_queue);
-	spin_lock_init(&btp->bt_delwrite_lock);
-	btp->bt_flags = 0;
-	btp->bt_task = kthread_run(xfsbufd, btp, "xfsbufd/%s", fsname);
-	if (IS_ERR(btp->bt_task))
-		return PTR_ERR(btp->bt_task);
-	return 0;
-}
-
 xfs_buftarg_t *
 xfs_alloc_buftarg(
 	struct xfs_mount	*mp,
@@ -1824,8 +1788,11 @@ xfs_alloc_buftarg(
 	spin_lock_init(&btp->bt_lru_lock);
 	if (xfs_setsize_buftarg_early(btp, bdev))
 		goto error;
-	if (xfs_alloc_delwrite_queue(btp, fsname))
-		goto error;
+
+	INIT_LIST_HEAD(&btp->bt_delwrite_queue);
+	spin_lock_init(&btp->bt_delwrite_lock);
+	INIT_DELAYED_WORK(&btp->bt_delwrite_work, xfs_buf_delwri_work);
+
 	btp->bt_shrinker.shrink = xfs_buftarg_shrink;
 	btp->bt_shrinker.seeks = DEFAULT_SEEKS;
 	register_shrinker(&btp->bt_shrinker);
@@ -1860,8 +1827,13 @@ xfs_buf_init(void)
 	if (!xfsconvertd_workqueue)
 		goto out_destroy_xfsdatad_workqueue;
 
+	xfs_buf_wq = alloc_workqueue("xfsbufd", WQ_MEM_RECLAIM, 8);
+	if (!xfs_buf_wq)
+		goto out_destroy_xfsconvertd_wq;
 	return 0;
 
+ out_destroy_xfsconvertd_wq:
+	destroy_workqueue(xfsconvertd_workqueue);
  out_destroy_xfsdatad_workqueue:
 	destroy_workqueue(xfsdatad_workqueue);
  out_destroy_xfslogd_workqueue:
@@ -1875,6 +1847,7 @@ xfs_buf_init(void)
 void
 xfs_buf_terminate(void)
 {
+	destroy_workqueue(xfs_buf_wq);
 	destroy_workqueue(xfsconvertd_workqueue);
 	destroy_workqueue(xfsdatad_workqueue);
 	destroy_workqueue(xfslogd_workqueue);
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index 620972b..c1aabfd 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -90,8 +90,7 @@ typedef unsigned int xfs_buf_flags_t;
 	{ _XBF_DELWRI_Q,	"DELWRI_Q" }
 
 typedef enum {
-	XBT_FORCE_SLEEP = 0,
-	XBT_FORCE_FLUSH = 1,
+	XBT_FORCE_FLUSH = 0,
 } xfs_buftarg_flags_t;
 
 typedef struct xfs_buftarg {
@@ -104,7 +103,7 @@ typedef struct xfs_buftarg {
 	size_t			bt_smask;
 
 	/* per device delwri queue */
-	struct task_struct	*bt_task;
+	struct delayed_work	bt_delwrite_work;
 	struct list_head	bt_delwrite_queue;
 	spinlock_t		bt_delwrite_lock;
 	unsigned long		bt_flags;
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index db62959..1fb9d93 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -1446,7 +1446,6 @@ xfs_qm_dqflock_pushbuf_wait(
 		if (xfs_buf_ispinned(bp))
 			xfs_log_force(mp, 0);
 		xfs_buf_delwri_promote(bp);
-		wake_up_process(bp->b_target->bt_task);
 	}
 	xfs_buf_relse(bp);
 out_lock:
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index dd966e0..919a31e 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -495,7 +495,7 @@ xfs_ail_worker(
 
 	if (push_xfsbufd) {
 		/* we've got delayed write buffers to flush */
-		wake_up_process(mp->m_ddev_targp->bt_task);
+		flush_delayed_work(&mp->m_ddev_targp->bt_delwrite_work);
 	}
 
 	/* assume we have more work to do in a short while */
-- 
1.7.5.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/6] xfs: Don't allocate new buffers on every call to _xfs_buf_find
  2011-08-25  7:17 ` [PATCH 3/6] xfs: Don't allocate new buffers on every call to _xfs_buf_find Dave Chinner
@ 2011-08-25 20:56   ` Alex Elder
  2011-08-25 23:57     ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Alex Elder @ 2011-08-25 20:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, 2011-08-25 at 17:17 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Stats show that for an 8-way unlink @ ~80,000 unlinks/s we are doing
> ~1 million cache hit lookups to ~3000 buffer creates. That's almost
> 3 orders of magnitude more cahce hits than misses, so optimising for
> cache hits is quite important. In the cache hit case, we do not need
> to allocate a new buffer in case of a cache miss, so we are
> effectively hitting the allocator for no good reason for vast the
> majority of calls to _xfs_buf_find. 8-way create workloads are
> showing similar cache hit/miss ratios.
> 
> The result is profiles that look like this:
> 
>      samples  pcnt function                        DSO
>      _______ _____ _______________________________ _________________
> 
>      1036.00 10.0% _xfs_buf_find                   [kernel.kallsyms]
>       582.00  5.6% kmem_cache_alloc                [kernel.kallsyms]
>       519.00  5.0% __memcpy                        [kernel.kallsyms]
>       468.00  4.5% __ticket_spin_lock              [kernel.kallsyms]
>       388.00  3.7% kmem_cache_free                 [kernel.kallsyms]
>       331.00  3.2% xfs_log_commit_cil              [kernel.kallsyms]
> 
> 
> Further, there is a fair bit of work involved in initialising a new
> buffer once a cache miss has occurred and we currently do that under
> the rbtree spinlock. That increases spinlock hold time on what are
> heavily used trees.
> 
> To fix this, remove the initialisation of the buffer from
> _xfs_buf_find() and only allocate the new buffer once we've had a
> cache miss. Initialise the buffer immediately after allocating it in
> xfs_buf_get, too, so that is it ready for insert if we get another
> cache miss after allocation. This minimises lock hold time and
> avoids unnecessary allocator churn. The resulting profiles look
> like:
> 
>      samples  pcnt function                    DSO
>      _______ _____ ___________________________ _________________
> 
>      8111.00  9.1% _xfs_buf_find               [kernel.kallsyms]
>      4380.00  4.9% __memcpy                    [kernel.kallsyms]
>      4341.00  4.8% __ticket_spin_lock          [kernel.kallsyms]
>      3401.00  3.8% kmem_cache_alloc            [kernel.kallsyms]
>      2856.00  3.2% xfs_log_commit_cil          [kernel.kallsyms]
>      2625.00  2.9% __kmalloc                   [kernel.kallsyms]
>      2380.00  2.7% kfree                       [kernel.kallsyms]
>      2016.00  2.3% kmem_cache_free             [kernel.kallsyms]
> 
> Showing a significant reduction in time spent doing allocation and
> freeing from slabs (kmem_cache_alloc and kmem_cache_free).
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

This is a good change, but I found one bug (of omission).
I also have a pretty harmless suggestion, plus suggest
some type changes.

For now I have corrected the bug and implemented the
one suggestion but not the type changes in my own copy
of this patch and am testing with it.  If you are
comfortable with that, I can commit my modified version.
The type changes can go in separately (they might expand
a bit to affect other code anyway).

Otherwise if you fix the bug you can consider this
reviewed by me.

Reviewed-by: Alex Elder <aelder@sgi.com>

> ---
>  fs/xfs/xfs_buf.c |   87 +++++++++++++++++++++++++++++++-----------------------
>  1 files changed, 50 insertions(+), 37 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index c57836d..6fffa06 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -171,10 +171,16 @@ STATIC void
>  _xfs_buf_initialize(
>  	xfs_buf_t		*bp,
>  	xfs_buftarg_t		*target,
> -	xfs_off_t		range_base,
> -	size_t			range_length,
> +	xfs_off_t		bno,
> +	size_t			num_blocks,

Since you are now passing block numbers and block counts
rather than byte offsets and counts the types of these
arguments should be changed accordingly.  I believe the
right types are xfs_daddr_t and xfs_filblks_t; the latter
doesn't exactly fit the usage but it's consistent with
how it's used elsewhere.

This is the type change I mentioned above.  It applies
in several places below (where I'll just mention them
briefly).

>  	xfs_buf_flags_t		flags)
>  {
> +	xfs_off_t		range_base;
> +	size_t			range_length;
> +
> +	range_base = BBTOB(bno);
> +	range_length = BBTOB(num_blocks);
> +
>  	/*
>  	 * We don't want certain flags to appear in b_flags.
>  	 */
> @@ -423,9 +429,9 @@ _xfs_buf_map_pages(
>   */
>  xfs_buf_t *
>  _xfs_buf_find(
> -	xfs_buftarg_t		*btp,	/* block device target		*/
> -	xfs_off_t		ioff,	/* starting offset of range	*/
> -	size_t			isize,	/* length of range		*/
> +	xfs_buftarg_t		*btp,
> +	xfs_off_t		bno,
> +	size_t			num_blocks,

Type change.  (I know in this case you only changed the name,
but the type was wrong to begin with.)

>  	xfs_buf_flags_t		flags,
>  	xfs_buf_t		*new_bp)
>  {

. . .

> @@ -525,34 +529,43 @@ found:
>  }
>  
>  /*
> - *	Assembles a buffer covering the specified range.
> - *	Storage in memory for all portions of the buffer will be allocated,
> - *	although backing storage may not be.
> + * Assembles a buffer covering the specified range. The code needs to be

Maybe say "is" rather than "needs to be" here.

> + * optimised for cache hits, as metadata intensive workloads will see 3 orders
> + * of magnitude more hits than misses.
>   */
> -xfs_buf_t *
> +struct xfs_buf *
>  xfs_buf_get(
> -	xfs_buftarg_t		*target,/* target for buffer		*/
> -	xfs_off_t		ioff,	/* starting offset of range	*/
> -	size_t			isize,	/* length of range		*/
> +	struct xfs_buftarg	*target,
> +	xfs_off_t		bno,
> +	size_t			num_blocks,

Type change.  (Again, types weren't really right to begin with.)
Fixing this maybe ought to be done more pervasively; the types
for values passed in the num_blocks argument are a mix of __u64,
int and size_t.


>  	xfs_buf_flags_t		flags)
>  {
> -	xfs_buf_t		*bp, *new_bp;
> +	struct xfs_buf		*bp;
> +	struct xfs_buf		*new_bp = NULL;
>  	int			error = 0;
>  
> +	bp = _xfs_buf_find(target, bno, num_blocks, flags, new_bp);

I'd rather an explicit NULL be used above for the last argument.
(I've made this change to my own version of this patch.)

> +	if (likely(bp))
> +		goto found;
> +
>  	new_bp = xfs_buf_allocate(flags);
>  	if (unlikely(!new_bp))
>  		return NULL;
>  
> -	bp = _xfs_buf_find(target, ioff, isize, flags, new_bp);
> +	_xfs_buf_initialize(new_bp, target, bno, num_blocks, flags);
> +

. . .

> @@ -790,7 +803,7 @@ xfs_buf_get_uncached(
>  	bp = xfs_buf_allocate(0);
>  	if (unlikely(bp == NULL))
>  		goto fail;
> -	_xfs_buf_initialize(bp, target, 0, len, 0);
> +	_xfs_buf_initialize(bp, target, 0, BTOBB(len), 0);
>  
>  	error = _xfs_buf_get_pages(bp, page_count, 0);
>  	if (error)

And the only remaining problem is the bug.  You need to make
a change comparable to what's right here in xfs_buf_get_empty().
I.e., that function needs to pass a block count rather than
a byte length.  (I have made this change in my own copy.)





_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/6] xfs: reduce the number of log forces from tail pushing
  2011-08-25  7:17 ` [PATCH 4/6] xfs: reduce the number of log forces from tail pushing Dave Chinner
@ 2011-08-25 20:57   ` Alex Elder
  2011-08-25 23:47     ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Alex Elder @ 2011-08-25 20:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, 2011-08-25 at 17:17 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The AIL push code will issue a log force on ever single push loop
> that it exits and has encountered pinned items. It doesn't rescan
> these pinned items until it revisits the AIL from the start. Hence
> we only need to force the log once per walk from the start of the
> AIL to the target LSN.
> 
> This results in numbers like this:
> 
> 	xs_push_ail_flush.....         1456
> 	xs_log_force.........          1485
> 
> For an 8-way 50M inode create workload - almost all the log forces
> are coming from the AIL pushing code.
> 
> Reduce the number of log forces by only forcing the log if the
> previous walk found pinned buffers. This reduces the numbers to:
> 
> 	xs_push_ail_flush.....          665
> 	xs_log_force.........           682
> 
> For the same test.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

This looks good to me and if you don't update it
I can take it as-is.  A couple trivial things below
if you decide to update.

Reviewed-by: Alex Elder <aelder@sgi.com>

> ---
>  fs/xfs/xfs_trans_ail.c  |   28 ++++++++++++++++------------
>  fs/xfs/xfs_trans_priv.h |    1 +
>  2 files changed, 17 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index c15aa29..dd966e0 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -372,12 +372,24 @@ xfs_ail_worker(
>  	xfs_lsn_t		lsn;
>  	xfs_lsn_t		target;
>  	long			tout = 10;
> -	int			flush_log = 0;
>  	int			stuck = 0;
>  	int			count = 0;
>  	int			push_xfsbufd = 0;
>  
> +	/*
> +	 * If last time we ran we encountered pinned items, force the log first,
> +	 * wait for it and then push again.
          * and wait for it before we push it again.
          */

> +	 */
>  	spin_lock(&ailp->xa_lock);
> +	if (ailp->xa_last_pushed_lsn == 0 && ailp->xa_log_flush &&
> +	    !list_empty(&ailp->xa_ail)) {
> +		ailp->xa_log_flush = 0;
> +		spin_unlock(&ailp->xa_lock);
> +		XFS_STATS_INC(xs_push_ail_flush);
> +		xfs_log_force(mp, XFS_LOG_SYNC);
> +		spin_lock(&ailp->xa_lock);
> +	}

This is a definite improvement over the previous version.

> +
>  	target = ailp->xa_target;
>  	lip = xfs_trans_ail_cursor_first(ailp, &cur, ailp->xa_last_pushed_lsn);
>  	if (!lip || XFS_FORCED_SHUTDOWN(mp)) {
> @@ -391,6 +403,7 @@ xfs_ail_worker(
>  
>  	XFS_STATS_INC(xs_push_ail);
>  
> +
>  	/*
>  	 * While the item we are looking at is below the given threshold
>  	 * try to flush it out. We'd like not to stop until we've at least

Kill this hunk.

. . .


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/6] xfs: re-arrange all the xfsbufd delwri queue code
  2011-08-25  7:17 ` [PATCH 5/6] xfs: re-arrange all the xfsbufd delwri queue code Dave Chinner
@ 2011-08-25 20:57   ` Alex Elder
  0 siblings, 0 replies; 17+ messages in thread
From: Alex Elder @ 2011-08-25 20:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, 2011-08-25 at 17:17 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> In preparation for converting the xfsbufd to a workqueue, re-arrange
> all the delayed write queue manipulation code above the buftarg
> sections.  This allows us to avoid the need for forward declarations
> of the new xfsbufd worker functions. This is pure code movement, not
> changes to the code have been made.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks good.

Reviewed-by: Alex Elder <aelder@sgi.com>


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 6/6] xfs: convert xfsbufd to use a workqueue
  2011-08-25  7:17 ` [PATCH 6/6] xfs: convert xfsbufd to use a workqueue Dave Chinner
@ 2011-08-25 20:57   ` Alex Elder
  2011-08-25 23:46     ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Alex Elder @ 2011-08-25 20:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, 2011-08-25 at 17:17 +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> There is no reason we need a thread per filesystem to do the
> flushing of the delayed write buffer queue. This can be easily
> handled by a global concurrency managed workqueue.
> 
> Convert the delayed write buffer handling to use workqueues and
> workqueue flushes to implement buffer writeback by embedding a
> delayed work structure into the struct xfs_buftarg and using that to
> control flushing.  This greatly simplifes the process of flushing
> and also removes a bunch of duplicated code between buftarg flushing
> and delwri buffer writeback.

I point out one bug below.  I also question one of the
changes and have some suggestions.  I'd like to see this
updated and get another chance to review the result.

					-Alex

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_buf.c       |  165 ++++++++++++++++++++----------------------------
>  fs/xfs/xfs_buf.h       |    5 +-
>  fs/xfs/xfs_dquot.c     |    1 -
>  fs/xfs/xfs_trans_ail.c |    2 +-
>  4 files changed, 72 insertions(+), 101 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index 410de9f..b1b8c0c 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c

. . .

> @@ -1407,8 +1407,9 @@ xfs_buf_delwri_queue(
>  	}
>  
>  	if (list_empty(dwq)) {
> -		/* start xfsbufd as it is about to have something to do */
> -		wake_up_process(bp->b_target->bt_task);
> +		/* queue a delayed flush as we are about to queue a buffer */
> +		queue_delayed_work(xfs_buf_wq, &bp->b_target->bt_delwrite_work,
> +			xfs_buf_timer_centisecs * msecs_to_jiffies(10));

I think this should be done *after* the buffer has been
added to the delayed work queue.  I realize there will be
a small delay so it should be fine, but...  It also doesn't
have to be done with bt_delwrite_lock held.

>  	}
>  
>  	bp->b_flags |= _XBF_DELWRI_Q;
> @@ -1486,13 +1487,13 @@ STATIC int
>  xfs_buf_delwri_split(
>  	xfs_buftarg_t	*target,
>  	struct list_head *list,
> -	unsigned long	age)
> +	unsigned long	age,
> +	int		force)
>  {
>  	xfs_buf_t	*bp, *n;
>  	struct list_head *dwq = &target->bt_delwrite_queue;
>  	spinlock_t	*dwlk = &target->bt_delwrite_lock;
>  	int		skipped = 0;
> -	int		force;
>  
>  	force = test_and_clear_bit(XBT_FORCE_FLUSH, &target->bt_flags);

You forgot to delete this line when you made "force" be
an argument to this function.

The (now sole) caller--xfs_buf_delwri_work()--uses the
value of force it computes after this function returns,
so it's clear you want to use the passed-in value.

But given that the caller will have already cleared
the bit, the value of "force" will now be 0.  Which
means that with this code xfs_flush_buftarg() does
not work as it should.

I think if you delete this line all is well, but you
need to test this.

>  	INIT_LIST_HEAD(list);
> @@ -1543,90 +1544,36 @@ xfs_buf_cmp(
>  	return 0;
>  }
>  
> -STATIC int
> -xfsbufd(
> -	void		*data)
> -{
> -	xfs_buftarg_t   *target = (xfs_buftarg_t *)data;

. . .

>  /*
> - *	Go through all incore buffers, and release buffers if they belong to
> - *	the given device. This is used in filesystem error handling to
> - *	preserve the consistency of its metadata.
> + * If we are doing a forced flush, then we need to wait for the IO that we
> + * issue to complete.
>   */
> -int
> -xfs_flush_buftarg(
> -	xfs_buftarg_t	*target,
> -	int		wait)
> +static void
> +xfs_buf_delwri_work(
> +	struct work_struct *work)
>  {
> -	xfs_buf_t	*bp;
> -	int		pincount = 0;
> +	struct xfs_buftarg *btp = container_of(to_delayed_work(work),
> +					struct xfs_buftarg, bt_delwrite_work);
> +	struct xfs_buf	*bp;
> +	struct blk_plug	plug;
>  	LIST_HEAD(tmp_list);
>  	LIST_HEAD(wait_list);
> -	struct blk_plug plug;
> -
> -	xfs_buf_runall_queues(xfsconvertd_workqueue);
> -	xfs_buf_runall_queues(xfsdatad_workqueue);
> -	xfs_buf_runall_queues(xfslogd_workqueue);
> +	long		age = xfs_buf_age_centisecs * msecs_to_jiffies(10);
> +	int		force = 0;
>  
> -	set_bit(XBT_FORCE_FLUSH, &target->bt_flags);
> -	pincount = xfs_buf_delwri_split(target, &tmp_list, 0);
> +	if (test_and_clear_bit(XBT_FORCE_FLUSH, &btp->bt_flags)) {
> +		force = 1;
> +		age = 0;
> +	}

xfs_buf_delwri_split() ignores its "age" argument if "force"
is set.  This function never uses "age" otherwise.  So the
above can just be:

	force = test_and_clear_bit(XBT_FORCE_FLUSH, &btp->bt_flags);

>  
> -	/*
> -	 * Dropped the delayed write list lock, now walk the temporary list.
> -	 * All I/O is issued async and then if we need to wait for completion
> -	 * we do that after issuing all the IO.
> -	 */
> +	xfs_buf_delwri_split(btp, &tmp_list, age, force);
>  	list_sort(NULL, &tmp_list, xfs_buf_cmp);
>  
>  	blk_start_plug(&plug);
>  	while (!list_empty(&tmp_list)) {
>  		bp = list_first_entry(&tmp_list, struct xfs_buf, b_list);
> -		ASSERT(target == bp->b_target);
>  		list_del_init(&bp->b_list);
> -		if (wait) {
> +		if (force) {
>  			bp->b_flags &= ~XBF_ASYNC;
>  			list_add(&bp->b_list, &wait_list);
>  		}

. . .

> @@ -1645,7 +1592,39 @@ xfs_flush_buftarg(
>  		}
>  	}
>  
> -	return pincount;
> +	if (list_empty(&btp->bt_delwrite_queue))
> +		return;
> +
> +	queue_delayed_work(xfs_buf_wq, &btp->bt_delwrite_work,
> +			xfs_buf_timer_centisecs * msecs_to_jiffies(10));
> +}
> +
> +/*
> + *	Handling of buffer targets (buftargs).
> + */
> +
> +/*
> + * Flush all the queued buffer work, then flush any remaining dirty buffers
> + * and wait for them to complete. If there are buffers remaining on the delwri
> + * queue, then they were pinned so couldn't be flushed. Return a value of 1 to
> + * indicate that there were pinned buffers and the caller needs to retry the
> + * flush.
> + */
> +int
> +xfs_flush_buftarg(
> +	xfs_buftarg_t	*target,
> +	int		wait)

Since this function now ignores its "wait" argument,
you could eliminate it, and perhaps get rid of the
one (first) call in xfs_quiesce_fs() that passes 0.

> +{
> +	xfs_buf_runall_queues(xfsconvertd_workqueue);
> +	xfs_buf_runall_queues(xfsdatad_workqueue);
> +	xfs_buf_runall_queues(xfslogd_workqueue);
> +
> +	set_bit(XBT_FORCE_FLUSH, &target->bt_flags);
> +	flush_delayed_work_sync(&target->bt_delwrite_work);
> +
> +	if (!list_empty(&target->bt_delwrite_queue))
> +		return 1;
> +	return 0;
 
Maybe just:
	return !list_empty(&target->bt_delwrite_queue);

(But I understand why you might prefer what you did.)

>  }
>  
>  /*

. . .

> diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> index db62959..1fb9d93 100644
> --- a/fs/xfs/xfs_dquot.c
> +++ b/fs/xfs/xfs_dquot.c
> @@ -1446,7 +1446,6 @@ xfs_qm_dqflock_pushbuf_wait(
>  		if (xfs_buf_ispinned(bp))
>  			xfs_log_force(mp, 0);
>  		xfs_buf_delwri_promote(bp);
> -		wake_up_process(bp->b_target->bt_task);

Why does this not need to be replaced with a
flush_delayed_work() call?  Was this wake_up_process()
not needed before?  If that's the case it seems like
that change is independent of the switch to work queues
and should be done separately (first).

>  	}
>  	xfs_buf_relse(bp);
>  out_lock:

. . .

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/6] xfs: don't serialise adjacent concurrent direct IO appending writes
  2011-08-25  7:17 ` [PATCH 2/6] xfs: don't serialise adjacent concurrent direct IO appending writes Dave Chinner
@ 2011-08-25 21:08   ` Alex Elder
  2011-08-26  2:19     ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Alex Elder @ 2011-08-25 21:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, 2011-08-25 at 17:17 +1000, Dave Chinner wrote:
> For append write workloads, extending the file requires a certain
> amount of exclusive locking to be done up front to ensure sanity in
> things like ensuring that we've zeroed any allocated regions
> between the old EOF and the start of the new IO.
> 
> For single threads, this typically isn't a problem, and for large
> IOs we don't serialise enough for it to be a problem for two
> threads on really fast block devices. However for smaller IO and
> larger thread counts we have a problem.
> 
> Take 4 concurrent sequential, single block sized and aligned IOs.
> After the first IO is submitted but before it completes, we end up
> with this state:
> 
>         IO 1    IO 2    IO 3    IO 4
>       +-------+-------+-------+-------+
>       ^       ^
>       |       |
>       |       |
>       |       |
>       |       \- ip->i_new_size
>       \- ip->i_size
> 
> And the IO is done without exclusive locking because offset <=
> ip->i_size. When we submit IO 2, we see offset > ip->i_size, and
> grab the IO lock exclusive, because there is a chance we need to do
> EOF zeroing. However, there is already an IO in progress that avoids
> the need for IO zeroing because offset <= ip->i_new_size. hence we
> could avoid holding the IO lock exlcusive for this. Hence after
> submission of the second IO, we'd end up this state:
> 
>         IO 1    IO 2    IO 3    IO 4
>       +-------+-------+-------+-------+
>       ^               ^
>       |               |
>       |               |
>       |               |
>       |               \- ip->i_new_size
>       \- ip->i_size
> 
> There is no need to grab the i_mutex of the IO lock in exclusive
> mode if we don't need to invalidate the page cache. Taking these
> locks on every direct IO effective serialises them as taking the IO
> lock in exclusive mode has to wait for all shared holders to drop
> the lock. That only happens when IO is complete, so effective it
> prevents dispatch of concurrent direct IO writes to the same inode.
> 
> And so you can see that for the third concurrent IO, we'd avoid
> exclusive locking for the same reason we avoided the exclusive lock
> for the second IO.
> 
> Fixing this is a bit more complex than that, because we need to hold
> a write-submission local value of ip->i_new_size to that clearing
> the value is only done if no other thread has updated it before our
> IO completes.....
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

This looks good.  What did you do with the little
"If the IO is clearly not beyond the on-disk inode size,
return before we take locks" optimization in xfs_setfilesize()
from the last time you posted this?

Reviewed-by: Alex Elder <aelder@sgi.com>


_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 6/6] xfs: convert xfsbufd to use a workqueue
  2011-08-25 20:57   ` Alex Elder
@ 2011-08-25 23:46     ` Dave Chinner
  2011-08-26  0:18       ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2011-08-25 23:46 UTC (permalink / raw)
  To: Alex Elder; +Cc: xfs

On Thu, Aug 25, 2011 at 03:57:19PM -0500, Alex Elder wrote:
> On Thu, 2011-08-25 at 17:17 +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > There is no reason we need a thread per filesystem to do the
> > flushing of the delayed write buffer queue. This can be easily
> > handled by a global concurrency managed workqueue.
> > 
> > Convert the delayed write buffer handling to use workqueues and
> > workqueue flushes to implement buffer writeback by embedding a
> > delayed work structure into the struct xfs_buftarg and using that to
> > control flushing.  This greatly simplifes the process of flushing
> > and also removes a bunch of duplicated code between buftarg flushing
> > and delwri buffer writeback.
> 
> I point out one bug below.  I also question one of the
> changes and have some suggestions.  I'd like to see this
> updated and get another chance to review the result.
> 
> 					-Alex
> 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_buf.c       |  165 ++++++++++++++++++++----------------------------
> >  fs/xfs/xfs_buf.h       |    5 +-
> >  fs/xfs/xfs_dquot.c     |    1 -
> >  fs/xfs/xfs_trans_ail.c |    2 +-
> >  4 files changed, 72 insertions(+), 101 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index 410de9f..b1b8c0c 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> 
> . . .
> 
> > @@ -1407,8 +1407,9 @@ xfs_buf_delwri_queue(
> >  	}
> >  
> >  	if (list_empty(dwq)) {
> > -		/* start xfsbufd as it is about to have something to do */
> > -		wake_up_process(bp->b_target->bt_task);
> > +		/* queue a delayed flush as we are about to queue a buffer */
> > +		queue_delayed_work(xfs_buf_wq, &bp->b_target->bt_delwrite_work,
> > +			xfs_buf_timer_centisecs * msecs_to_jiffies(10));
> 
> I think this should be done *after* the buffer has been
> added to the delayed work queue.  I realize there will be
> a small delay so it should be fine, but...  It also doesn't
> have to be done with bt_delwrite_lock held.

The reason it is done there is so we don't need a local variable to
store whether we need to queue work. The fact that the worker must
then first grab the bt_delwrite_lock before it will do anything
means that even if th worker runs immediately, it will block until
we've finished adding this item to the list and dropped the lock
so it really doesn't matter that we queue the work before adding the
buffer to the list. And the code is actually simpler doing it this way....

> 
> >  	}
> >  
> >  	bp->b_flags |= _XBF_DELWRI_Q;
> > @@ -1486,13 +1487,13 @@ STATIC int
> >  xfs_buf_delwri_split(
> >  	xfs_buftarg_t	*target,
> >  	struct list_head *list,
> > -	unsigned long	age)
> > +	unsigned long	age,
> > +	int		force)
> >  {
> >  	xfs_buf_t	*bp, *n;
> >  	struct list_head *dwq = &target->bt_delwrite_queue;
> >  	spinlock_t	*dwlk = &target->bt_delwrite_lock;
> >  	int		skipped = 0;
> > -	int		force;
> >  
> >  	force = test_and_clear_bit(XBT_FORCE_FLUSH, &target->bt_flags);
> 
> You forgot to delete this line when you made "force" be
> an argument to this function.

Oops. So I did. I screwed that up when updating it after all the
recent xfs_buf.c changes. Will fix.

> I think if you delete this line all is well, but you
> need to test this.

Of course ;)

> > -	xfs_buf_runall_queues(xfsconvertd_workqueue);
> > -	xfs_buf_runall_queues(xfsdatad_workqueue);
> > -	xfs_buf_runall_queues(xfslogd_workqueue);
> > +	long		age = xfs_buf_age_centisecs * msecs_to_jiffies(10);
> > +	int		force = 0;
> >  
> > -	set_bit(XBT_FORCE_FLUSH, &target->bt_flags);
> > -	pincount = xfs_buf_delwri_split(target, &tmp_list, 0);
> > +	if (test_and_clear_bit(XBT_FORCE_FLUSH, &btp->bt_flags)) {
> > +		force = 1;
> > +		age = 0;
> > +	}
> 
> xfs_buf_delwri_split() ignores its "age" argument if "force"
> is set.  This function never uses "age" otherwise.  So the
> above can just be:
> 
> 	force = test_and_clear_bit(XBT_FORCE_FLUSH, &btp->bt_flags);

Ok.

> > +/*
> > + * Flush all the queued buffer work, then flush any remaining dirty buffers
> > + * and wait for them to complete. If there are buffers remaining on the delwri
> > + * queue, then they were pinned so couldn't be flushed. Return a value of 1 to
> > + * indicate that there were pinned buffers and the caller needs to retry the
> > + * flush.
> > + */
> > +int
> > +xfs_flush_buftarg(
> > +	xfs_buftarg_t	*target,
> > +	int		wait)
> 
> Since this function now ignores its "wait" argument,
> you could eliminate it, and perhaps get rid of the
> one (first) call in xfs_quiesce_fs() that passes 0.

I'll leave that to a another patch.

> 
> > +{
> > +	xfs_buf_runall_queues(xfsconvertd_workqueue);
> > +	xfs_buf_runall_queues(xfsdatad_workqueue);
> > +	xfs_buf_runall_queues(xfslogd_workqueue);
> > +
> > +	set_bit(XBT_FORCE_FLUSH, &target->bt_flags);
> > +	flush_delayed_work_sync(&target->bt_delwrite_work);
> > +
> > +	if (!list_empty(&target->bt_delwrite_queue))
> > +		return 1;
> > +	return 0;
>  
> Maybe just:
> 	return !list_empty(&target->bt_delwrite_queue);
> 
> (But I understand why you might prefer what you did.)

Yeah, I prefer the explict return values than having to work out
whether it is returning 0 or 1. Call me lazy. ;)

> > diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
> > index db62959..1fb9d93 100644
> > --- a/fs/xfs/xfs_dquot.c
> > +++ b/fs/xfs/xfs_dquot.c
> > @@ -1446,7 +1446,6 @@ xfs_qm_dqflock_pushbuf_wait(
> >  		if (xfs_buf_ispinned(bp))
> >  			xfs_log_force(mp, 0);
> >  		xfs_buf_delwri_promote(bp);
> > -		wake_up_process(bp->b_target->bt_task);
> 
> Why does this not need to be replaced with a
> flush_delayed_work() call?  Was this wake_up_process()
> not needed before?

The wake up that is done here is simply to cause the delwri buffer
queue to be flushed immediately, rather than wait for the next
timeout. The thing is that the xfsbufd can idle with no next defined
wakeup, so this was added to ensure the xfsbufd was woken and the
buffer flushed. xfs_qm_dqflock_pushbuf_wait, however, is
never called in a performance or latency critical path, so this
was purely defensive to ensure the xfsbufd was awake as I wasn't
certain I'd got everything right in the xfsbufd idling code.

With the workqueue changeover, I'm pretty certain that the delayed
workqueue processing will continue until the delwri queue is empty,
so tehre is no need for a defensive flush of it here - the dquot
buffer will get flushed in a short while now that it is at the head
of the delwri queue without needing to explicitly ensure the queue
is running...

> If that's the case it seems like
> that change is independent of the switch to work queues
> and should be done separately (first).

I don't think it is separable from the workqueue changeover...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 4/6] xfs: reduce the number of log forces from tail pushing
  2011-08-25 20:57   ` Alex Elder
@ 2011-08-25 23:47     ` Dave Chinner
  0 siblings, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2011-08-25 23:47 UTC (permalink / raw)
  To: Alex Elder; +Cc: xfs

On Thu, Aug 25, 2011 at 03:57:06PM -0500, Alex Elder wrote:
> On Thu, 2011-08-25 at 17:17 +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The AIL push code will issue a log force on ever single push loop
> > that it exits and has encountered pinned items. It doesn't rescan
> > these pinned items until it revisits the AIL from the start. Hence
> > we only need to force the log once per walk from the start of the
> > AIL to the target LSN.
> > 
> > This results in numbers like this:
> > 
> > 	xs_push_ail_flush.....         1456
> > 	xs_log_force.........          1485
> > 
> > For an 8-way 50M inode create workload - almost all the log forces
> > are coming from the AIL pushing code.
> > 
> > Reduce the number of log forces by only forcing the log if the
> > previous walk found pinned buffers. This reduces the numbers to:
> > 
> > 	xs_push_ail_flush.....          665
> > 	xs_log_force.........           682
> > 
> > For the same test.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> This looks good to me and if you don't update it
> I can take it as-is.  A couple trivial things below
> if you decide to update.
> 
> Reviewed-by: Alex Elder <aelder@sgi.com>
> 
> > ---
> >  fs/xfs/xfs_trans_ail.c  |   28 ++++++++++++++++------------
> >  fs/xfs/xfs_trans_priv.h |    1 +
> >  2 files changed, 17 insertions(+), 12 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> > index c15aa29..dd966e0 100644
> > --- a/fs/xfs/xfs_trans_ail.c
> > +++ b/fs/xfs/xfs_trans_ail.c
> > @@ -372,12 +372,24 @@ xfs_ail_worker(
> >  	xfs_lsn_t		lsn;
> >  	xfs_lsn_t		target;
> >  	long			tout = 10;
> > -	int			flush_log = 0;
> >  	int			stuck = 0;
> >  	int			count = 0;
> >  	int			push_xfsbufd = 0;
> >  
> > +	/*
> > +	 * If last time we ran we encountered pinned items, force the log first,
> > +	 * wait for it and then push again.
>           * and wait for it before we push it again.
>           */
> 

OK.

> > +	 */
> >  	spin_lock(&ailp->xa_lock);
> > +	if (ailp->xa_last_pushed_lsn == 0 && ailp->xa_log_flush &&
> > +	    !list_empty(&ailp->xa_ail)) {
> > +		ailp->xa_log_flush = 0;
> > +		spin_unlock(&ailp->xa_lock);
> > +		XFS_STATS_INC(xs_push_ail_flush);
> > +		xfs_log_force(mp, XFS_LOG_SYNC);
> > +		spin_lock(&ailp->xa_lock);
> > +	}
> 
> This is a definite improvement over the previous version.
> 
> > +
> >  	target = ailp->xa_target;
> >  	lip = xfs_trans_ail_cursor_first(ailp, &cur, ailp->xa_last_pushed_lsn);
> >  	if (!lip || XFS_FORCED_SHUTDOWN(mp)) {
> > @@ -391,6 +403,7 @@ xfs_ail_worker(
> >  
> >  	XFS_STATS_INC(xs_push_ail);
> >  
> > +
> >  	/*
> >  	 * While the item we are looking at is below the given threshold
> >  	 * try to flush it out. We'd like not to stop until we've at least
> 
> Kill this hunk.

Yep, will do.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/6] xfs: Don't allocate new buffers on every call to _xfs_buf_find
  2011-08-25 20:56   ` Alex Elder
@ 2011-08-25 23:57     ` Dave Chinner
  0 siblings, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2011-08-25 23:57 UTC (permalink / raw)
  To: Alex Elder; +Cc: xfs

On Thu, Aug 25, 2011 at 03:56:18PM -0500, Alex Elder wrote:
> On Thu, 2011-08-25 at 17:17 +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Stats show that for an 8-way unlink @ ~80,000 unlinks/s we are doing
> > ~1 million cache hit lookups to ~3000 buffer creates. That's almost
> > 3 orders of magnitude more cahce hits than misses, so optimising for
> > cache hits is quite important. In the cache hit case, we do not need
> > to allocate a new buffer in case of a cache miss, so we are
> > effectively hitting the allocator for no good reason for vast the
> > majority of calls to _xfs_buf_find. 8-way create workloads are
> > showing similar cache hit/miss ratios.
> > 
> > The result is profiles that look like this:
> > 
> >      samples  pcnt function                        DSO
> >      _______ _____ _______________________________ _________________
> > 
> >      1036.00 10.0% _xfs_buf_find                   [kernel.kallsyms]
> >       582.00  5.6% kmem_cache_alloc                [kernel.kallsyms]
> >       519.00  5.0% __memcpy                        [kernel.kallsyms]
> >       468.00  4.5% __ticket_spin_lock              [kernel.kallsyms]
> >       388.00  3.7% kmem_cache_free                 [kernel.kallsyms]
> >       331.00  3.2% xfs_log_commit_cil              [kernel.kallsyms]
> > 
> > 
> > Further, there is a fair bit of work involved in initialising a new
> > buffer once a cache miss has occurred and we currently do that under
> > the rbtree spinlock. That increases spinlock hold time on what are
> > heavily used trees.
> > 
> > To fix this, remove the initialisation of the buffer from
> > _xfs_buf_find() and only allocate the new buffer once we've had a
> > cache miss. Initialise the buffer immediately after allocating it in
> > xfs_buf_get, too, so that is it ready for insert if we get another
> > cache miss after allocation. This minimises lock hold time and
> > avoids unnecessary allocator churn. The resulting profiles look
> > like:
> > 
> >      samples  pcnt function                    DSO
> >      _______ _____ ___________________________ _________________
> > 
> >      8111.00  9.1% _xfs_buf_find               [kernel.kallsyms]
> >      4380.00  4.9% __memcpy                    [kernel.kallsyms]
> >      4341.00  4.8% __ticket_spin_lock          [kernel.kallsyms]
> >      3401.00  3.8% kmem_cache_alloc            [kernel.kallsyms]
> >      2856.00  3.2% xfs_log_commit_cil          [kernel.kallsyms]
> >      2625.00  2.9% __kmalloc                   [kernel.kallsyms]
> >      2380.00  2.7% kfree                       [kernel.kallsyms]
> >      2016.00  2.3% kmem_cache_free             [kernel.kallsyms]
> > 
> > Showing a significant reduction in time spent doing allocation and
> > freeing from slabs (kmem_cache_alloc and kmem_cache_free).
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> This is a good change, but I found one bug (of omission).
> I also have a pretty harmless suggestion, plus suggest
> some type changes.
> 
> For now I have corrected the bug and implemented the
> one suggestion but not the type changes in my own copy
> of this patch and am testing with it.  If you are
> comfortable with that, I can commit my modified version.
> The type changes can go in separately (they might expand
> a bit to affect other code anyway).
> 
> Otherwise if you fix the bug you can consider this
> reviewed by me.
> 
> Reviewed-by: Alex Elder <aelder@sgi.com>
> 
> > ---
> >  fs/xfs/xfs_buf.c |   87 +++++++++++++++++++++++++++++++-----------------------
> >  1 files changed, 50 insertions(+), 37 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index c57836d..6fffa06 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -171,10 +171,16 @@ STATIC void
> >  _xfs_buf_initialize(
> >  	xfs_buf_t		*bp,
> >  	xfs_buftarg_t		*target,
> > -	xfs_off_t		range_base,
> > -	size_t			range_length,
> > +	xfs_off_t		bno,
> > +	size_t			num_blocks,
> 
> Since you are now passing block numbers and block counts
> rather than byte offsets and counts the types of these
> arguments should be changed accordingly.  I believe the
> right types are xfs_daddr_t and xfs_filblks_t; the latter
> doesn't exactly fit the usage but it's consistent with
> how it's used elsewhere.

I considered changing to xfs_daddr_t, but then thought "that's a
change to the external interface" and "there's more than just this
that needs fixing" so I figured I'd leave that to a
followup patchset. Hence I'd prefer to keep the type/API changes
separate to functional changes

> Fixing this maybe ought to be done more pervasively; the types
> for values passed in the num_blocks argument are a mix of __u64,
> int and size_t.

Exactly.

> >  	xfs_buf_flags_t		flags)
> >  {
> > -	xfs_buf_t		*bp, *new_bp;
> > +	struct xfs_buf		*bp;
> > +	struct xfs_buf		*new_bp = NULL;
> >  	int			error = 0;
> >  
> > +	bp = _xfs_buf_find(target, bno, num_blocks, flags, new_bp);
> 
> I'd rather an explicit NULL be used above for the last argument.
> (I've made this change to my own version of this patch.)

Good point.

> > +	if (likely(bp))
> > +		goto found;
> > +
> >  	new_bp = xfs_buf_allocate(flags);
> >  	if (unlikely(!new_bp))
> >  		return NULL;
> >  
> > -	bp = _xfs_buf_find(target, ioff, isize, flags, new_bp);
> > +	_xfs_buf_initialize(new_bp, target, bno, num_blocks, flags);
> > +
> 
> . . .
> 
> > @@ -790,7 +803,7 @@ xfs_buf_get_uncached(
> >  	bp = xfs_buf_allocate(0);
> >  	if (unlikely(bp == NULL))
> >  		goto fail;
> > -	_xfs_buf_initialize(bp, target, 0, len, 0);
> > +	_xfs_buf_initialize(bp, target, 0, BTOBB(len), 0);
> >  
> >  	error = _xfs_buf_get_pages(bp, page_count, 0);
> >  	if (error)
> 
> And the only remaining problem is the bug.  You need to make
> a change comparable to what's right here in xfs_buf_get_empty().
> I.e., that function needs to pass a block count rather than
> a byte length.  (I have made this change in my own copy.)

Oh, right, I missed the call in xfs_buf_get_empty(). It took me a
minute to work out that you were talking about a bug in
xfs_buf_get_empty() rather than in the above hunk :).

Will fix.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 6/6] xfs: convert xfsbufd to use a workqueue
  2011-08-25 23:46     ` Dave Chinner
@ 2011-08-26  0:18       ` Dave Chinner
  0 siblings, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2011-08-26  0:18 UTC (permalink / raw)
  To: Alex Elder; +Cc: xfs

On Fri, Aug 26, 2011 at 09:46:56AM +1000, Dave Chinner wrote:
> On Thu, Aug 25, 2011 at 03:57:19PM -0500, Alex Elder wrote:
> > On Thu, 2011-08-25 at 17:17 +1000, Dave Chinner wrote:
> > > +/*
> > > + * Flush all the queued buffer work, then flush any remaining dirty buffers
> > > + * and wait for them to complete. If there are buffers remaining on the delwri
> > > + * queue, then they were pinned so couldn't be flushed. Return a value of 1 to
> > > + * indicate that there were pinned buffers and the caller needs to retry the
> > > + * flush.
> > > + */
> > > +int
> > > +xfs_flush_buftarg(
> > > +	xfs_buftarg_t	*target,
> > > +	int		wait)
> > 
> > Since this function now ignores its "wait" argument,
> > you could eliminate it, and perhaps get rid of the
> > one (first) call in xfs_quiesce_fs() that passes 0.
> 
> I'll leave that to a another patch.

Actually, I'll re-instate the existing wait semantics here.

In looking at this again I realised there is a race condition in the
flushing code - if work is already in progress, then the wq flush
won't start new work and hence won't see the force flag at all. So
the code needs changing anyway and the only time we need to set the
FORCE_FLUSH flag is when we are supposed to be waiting. Hence I'll
change it back to doing a non-blocking flush when the wait flag is
not set.

If we want to go to just a blocking flush, then we can change
everything in the one patch.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/6] xfs: don't serialise adjacent concurrent direct IO appending writes
  2011-08-25 21:08   ` Alex Elder
@ 2011-08-26  2:19     ` Dave Chinner
  0 siblings, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2011-08-26  2:19 UTC (permalink / raw)
  To: Alex Elder; +Cc: xfs

On Thu, Aug 25, 2011 at 04:08:03PM -0500, Alex Elder wrote:
> On Thu, 2011-08-25 at 17:17 +1000, Dave Chinner wrote:
> > For append write workloads, extending the file requires a certain
> > amount of exclusive locking to be done up front to ensure sanity in
> > things like ensuring that we've zeroed any allocated regions
> > between the old EOF and the start of the new IO.
> > 
> > For single threads, this typically isn't a problem, and for large
> > IOs we don't serialise enough for it to be a problem for two
> > threads on really fast block devices. However for smaller IO and
> > larger thread counts we have a problem.
> > 
> > Take 4 concurrent sequential, single block sized and aligned IOs.
> > After the first IO is submitted but before it completes, we end up
> > with this state:
> > 
> >         IO 1    IO 2    IO 3    IO 4
> >       +-------+-------+-------+-------+
> >       ^       ^
> >       |       |
> >       |       |
> >       |       |
> >       |       \- ip->i_new_size
> >       \- ip->i_size
> > 
> > And the IO is done without exclusive locking because offset <=
> > ip->i_size. When we submit IO 2, we see offset > ip->i_size, and
> > grab the IO lock exclusive, because there is a chance we need to do
> > EOF zeroing. However, there is already an IO in progress that avoids
> > the need for IO zeroing because offset <= ip->i_new_size. hence we
> > could avoid holding the IO lock exlcusive for this. Hence after
> > submission of the second IO, we'd end up this state:
> > 
> >         IO 1    IO 2    IO 3    IO 4
> >       +-------+-------+-------+-------+
> >       ^               ^
> >       |               |
> >       |               |
> >       |               |
> >       |               \- ip->i_new_size
> >       \- ip->i_size
> > 
> > There is no need to grab the i_mutex of the IO lock in exclusive
> > mode if we don't need to invalidate the page cache. Taking these
> > locks on every direct IO effective serialises them as taking the IO
> > lock in exclusive mode has to wait for all shared holders to drop
> > the lock. That only happens when IO is complete, so effective it
> > prevents dispatch of concurrent direct IO writes to the same inode.
> > 
> > And so you can see that for the third concurrent IO, we'd avoid
> > exclusive locking for the same reason we avoided the exclusive lock
> > for the second IO.
> > 
> > Fixing this is a bit more complex than that, because we need to hold
> > a write-submission local value of ip->i_new_size to that clearing
> > the value is only done if no other thread has updated it before our
> > IO completes.....
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> This looks good.  What did you do with the little
> "If the IO is clearly not beyond the on-disk inode size,
> return before we take locks" optimization in xfs_setfilesize()
> from the last time you posted this?

That's take care of in Christoph's recent patch set.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2011-08-26  2:19 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-25  7:17 [PATCH 0/6] xfs: patch queue for Linux 3.2 Dave Chinner
2011-08-25  7:17 ` [PATCH 1/6] xfs: don't serialise direct IO reads on page cache checks Dave Chinner
2011-08-25  7:17 ` [PATCH 2/6] xfs: don't serialise adjacent concurrent direct IO appending writes Dave Chinner
2011-08-25 21:08   ` Alex Elder
2011-08-26  2:19     ` Dave Chinner
2011-08-25  7:17 ` [PATCH 3/6] xfs: Don't allocate new buffers on every call to _xfs_buf_find Dave Chinner
2011-08-25 20:56   ` Alex Elder
2011-08-25 23:57     ` Dave Chinner
2011-08-25  7:17 ` [PATCH 4/6] xfs: reduce the number of log forces from tail pushing Dave Chinner
2011-08-25 20:57   ` Alex Elder
2011-08-25 23:47     ` Dave Chinner
2011-08-25  7:17 ` [PATCH 5/6] xfs: re-arrange all the xfsbufd delwri queue code Dave Chinner
2011-08-25 20:57   ` Alex Elder
2011-08-25  7:17 ` [PATCH 6/6] xfs: convert xfsbufd to use a workqueue Dave Chinner
2011-08-25 20:57   ` Alex Elder
2011-08-25 23:46     ` Dave Chinner
2011-08-26  0:18       ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.