All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] xfs: delayed allocation @ ENOSPC fixes
@ 2010-03-04  1:46 Dave Chinner
  2010-03-04  1:46 ` [PATCH 1/3] xfs: check for more work before sleeping in xfssyncd Dave Chinner
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Dave Chinner @ 2010-03-04  1:46 UTC (permalink / raw)
  To: xfs

The first patch fixes an extreme slowdown when lots of concurrent threads are
trying to flush inodes - xfssyncd can "miss" work that is queued and so can
result in a flush waiting for 30s for the next timeout. Test 225 can take 15
minutes to run without this fix. Withthe fix it consistently takes about 50s.

The second patch prevents stale delayed allocation mappings from being left on
inodes when we discard a page in writeback due to an IO error or an ENOSPC
condition. The stale mappings can cause a BUG() to be triggered during
subsequent direct IO reads.

The last patch increases the size of the reserve block pool to reduce the
possibility of getting ENOSPC conditions during delayed allocation that
would trigger the page tossing in the first place. This requires some QA
tests to be updated - that will follow in a separate patch.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/3] xfs: check for more work before sleeping in xfssyncd
  2010-03-04  1:46 [PATCH 0/3] xfs: delayed allocation @ ENOSPC fixes Dave Chinner
@ 2010-03-04  1:46 ` Dave Chinner
  2010-03-04 13:50   ` Christoph Hellwig
  2010-03-04  1:46 ` [PATCH 2/3] xfs: truncate delalloc extents when IO fails in writeback Dave Chinner
  2010-03-04  1:46 ` [PATCH 3/3] xfs: Increase the default size of the reserved blocks pool Dave Chinner
  2 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2010-03-04  1:46 UTC (permalink / raw)
  To: xfs

xfssyncd processes a queue of work by detaching the queue and
then iterating over all the work items. It then sleeps for a
time period or until new work comes in. If new work is queued
while xfssyncd is actively processing the detached work queue,
it will not process that new work until after a sleep timeout
or the next work event queued wakes it.

Fix this by checking the work queue again before going to sleep.

Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/linux-2.6/xfs_sync.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_sync.c b/fs/xfs/linux-2.6/xfs_sync.c
index f9fc154..05cd853 100644
--- a/fs/xfs/linux-2.6/xfs_sync.c
+++ b/fs/xfs/linux-2.6/xfs_sync.c
@@ -607,7 +607,8 @@ xfssyncd(
 	set_freezable();
 	timeleft = xfs_syncd_centisecs * msecs_to_jiffies(10);
 	for (;;) {
-		timeleft = schedule_timeout_interruptible(timeleft);
+		if (list_empty(&mp->m_sync_list))
+			timeleft = schedule_timeout_interruptible(timeleft);
 		/* swsusp */
 		try_to_freeze();
 		if (kthread_should_stop() && list_empty(&mp->m_sync_list))
@@ -627,8 +628,7 @@ xfssyncd(
 			list_add_tail(&mp->m_sync_work.w_list,
 					&mp->m_sync_list);
 		}
-		list_for_each_entry_safe(work, n, &mp->m_sync_list, w_list)
-			list_move(&work->w_list, &tmp);
+		list_splice_init(&mp->m_sync_list, &tmp);
 		spin_unlock(&mp->m_sync_lock);
 
 		list_for_each_entry_safe(work, n, &tmp, w_list) {
-- 
1.6.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/3] xfs: truncate delalloc extents when IO fails in writeback
  2010-03-04  1:46 [PATCH 0/3] xfs: delayed allocation @ ENOSPC fixes Dave Chinner
  2010-03-04  1:46 ` [PATCH 1/3] xfs: check for more work before sleeping in xfssyncd Dave Chinner
@ 2010-03-04  1:46 ` Dave Chinner
  2010-03-04 14:28   ` Christoph Hellwig
  2010-03-10  9:12   ` [PATCH 2/3] " Christoph Hellwig
  2010-03-04  1:46 ` [PATCH 3/3] xfs: Increase the default size of the reserved blocks pool Dave Chinner
  2 siblings, 2 replies; 15+ messages in thread
From: Dave Chinner @ 2010-03-04  1:46 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

We currently use block_invalidatepage() to clean up pages where I/O
fails in ->writepage(). Unfortunately, if the page has delalloc
regions on it, we fail to remove the delalloc regions when we
invalidate the page.  This can result in tripping a BUG() in
xfs_get_blocks() later on if a direct IO read is done on that same
region - the delalloc extent is returned when none is supposed to be
there.

Fix this by truncating away the delalloc regions on the page before
invalidating it. Because they are delalloc, we can do this without
needing a transaction. Indeed - if we get ENOSPC errors, we have to
be able to do this truncation without a transaction as there is
no space left for block reservation (typically why we see a ENOSPC
in writeback).

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_aops.c |  106 +++++++++++++++++++++++++++++++++++++++----
 fs/xfs/xfs_bmap.c           |    6 ++-
 2 files changed, 100 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index b493c63..65360b5 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -39,6 +39,7 @@
 #include "xfs_iomap.h"
 #include "xfs_vnodeops.h"
 #include "xfs_trace.h"
+#include "xfs_bmap.h"
 #include <linux/mpage.h>
 #include <linux/pagevec.h>
 #include <linux/writeback.h>
@@ -893,6 +894,100 @@ xfs_cluster_write(
 	}
 }
 
+STATIC void
+xfs_vm_invalidatepage(
+	struct page		*page,
+	unsigned long		offset)
+{
+	trace_xfs_invalidatepage(page->mapping->host, page, offset);
+	block_invalidatepage(page, offset);
+}
+
+/*
+ * If the page has delalloc buffers on it, we need to punch them out before we
+ * invalidate the page. If we don't, we leave a stale delalloc mapping on the
+ * inode that we can trip over later.
+ *
+ * This is not a performance critical path, so for now just do the punching a
+ * buffer head at a time.
+ */
+static void
+xfs_aops_discard_page(
+	struct page		*page)
+{
+	struct inode		*inode = page->mapping->host;
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct buffer_head	*bh, *head;
+	loff_t			offset = page_offset(page);
+	ssize_t			len = 1 << inode->i_blkbits;
+
+	if (!xfs_is_delayed_page(page, IOMAP_DELAY))
+		goto out_invalidate;
+
+	xfs_fs_cmn_err(CE_ALERT, ip->i_mount,
+		"page discard on page %p, inode 0x%llx, offset %llu.",
+			page, ip->i_ino, offset);
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	bh = head = page_buffers(page);
+	do {
+		int		done;
+		xfs_fileoff_t	offset_fsb;
+		xfs_bmbt_irec_t	imap;
+		int		nimaps = 1;
+		int		error;
+
+		if (!buffer_delay(bh))
+			goto next_buffer;
+
+		offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
+
+		/*
+		 * Map the range first and check that it is a delalloc extent
+		 * before trying to unmap the range. Otherwise we will be
+		 * trying to remove a real extent (which requires a
+		 * transaction) or a hole, which is probably a bad idea...
+		 */
+		error = xfs_bmapi(NULL, ip, offset_fsb, 1,
+				XFS_BMAPI_ENTIRE,  NULL, 0, &imap,
+				&nimaps, NULL, NULL);
+
+		if (error) {
+			/* something screwed, just bail */
+			xfs_fs_cmn_err(CE_ALERT, ip->i_mount,
+			"page discard failed delalloc mapping lookup.");
+			break;
+		}
+		if (!nimaps) {
+			/* nothing there */
+			goto next_buffer;
+		}
+		if (imap.br_startblock != DELAYSTARTBLOCK) {
+			/* been converted, ignore */
+			goto next_buffer;
+		}
+		WARN_ON(imap.br_blockcount == 0);
+
+		error = xfs_bunmapi(NULL, ip, offset_fsb, 1, 0, 1, NULL,
+					NULL, NULL, &done);
+
+		if (error) {
+			/* something screwed, just bail */
+			xfs_fs_cmn_err(CE_ALERT, ip->i_mount,
+			"page discard unable to remove delalloc mapping.");
+			break;
+		}
+next_buffer:
+		offset += len;
+
+	} while ((bh = bh->b_this_page) != head);
+
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+out_invalidate:
+	xfs_vm_invalidatepage(page, 0);
+	return;
+}
+
 /*
  * Calling this without startio set means we are being asked to make a dirty
  * page ready for freeing it's buffers.  When called with startio set then
@@ -1144,7 +1239,7 @@ error:
 	 */
 	if (err != -EAGAIN) {
 		if (!unmapped)
-			block_invalidatepage(page, 0);
+			xfs_aops_discard_page(page);
 		ClearPageUptodate(page);
 	}
 	return err;
@@ -1554,15 +1649,6 @@ xfs_vm_readpages(
 	return mpage_readpages(mapping, pages, nr_pages, xfs_get_blocks);
 }
 
-STATIC void
-xfs_vm_invalidatepage(
-	struct page		*page,
-	unsigned long		offset)
-{
-	trace_xfs_invalidatepage(page->mapping->host, page, offset);
-	block_invalidatepage(page, offset);
-}
-
 const struct address_space_operations xfs_address_space_operations = {
 	.readpage		= xfs_vm_readpage,
 	.readpages		= xfs_vm_readpages,
diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c
index 95a4813..7afd475 100644
--- a/fs/xfs/xfs_bmap.c
+++ b/fs/xfs/xfs_bmap.c
@@ -5218,7 +5218,8 @@ xfs_bunmapi(
 	if (ifp->if_flags & XFS_IFBROOT) {
 		ASSERT(XFS_IFORK_FORMAT(ip, whichfork) == XFS_DINODE_FMT_BTREE);
 		cur = xfs_bmbt_init_cursor(mp, tp, ip, whichfork);
-		cur->bc_private.b.firstblock = *firstblock;
+		cur->bc_private.b.firstblock = firstblock ?
+						*firstblock : NULLFSBLOCK;
 		cur->bc_private.b.flist = flist;
 		cur->bc_private.b.flags = 0;
 	} else
@@ -5503,7 +5504,8 @@ error0:
 		xfs_trans_log_inode(tp, ip, logflags);
 	if (cur) {
 		if (!error) {
-			*firstblock = cur->bc_private.b.firstblock;
+			if (firstblock)
+				*firstblock = cur->bc_private.b.firstblock;
 			cur->bc_private.b.allocated = 0;
 		}
 		xfs_btree_del_cursor(cur,
-- 
1.6.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/3] xfs: Increase the default size of the reserved blocks pool
  2010-03-04  1:46 [PATCH 0/3] xfs: delayed allocation @ ENOSPC fixes Dave Chinner
  2010-03-04  1:46 ` [PATCH 1/3] xfs: check for more work before sleeping in xfssyncd Dave Chinner
  2010-03-04  1:46 ` [PATCH 2/3] xfs: truncate delalloc extents when IO fails in writeback Dave Chinner
@ 2010-03-04  1:46 ` Dave Chinner
  2010-03-05 15:45   ` Alex Elder
  2 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2010-03-04  1:46 UTC (permalink / raw)
  To: xfs

The current default size of the reserved blocks pool is easy to deplete
with certain workloads, in particular workloads that do lots of concurrent
delayed allocation extent conversions.  If enough transactions are running
in parallel and the entire pool is consumed then subsequent calls to
xfs_trans_reserve() will fail with ENOSPC.  Also add a rate limited
warning so we know if this starts happening again.

This is an updated version of an old patch from Lachlan McIlroy.

Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/xfs_mount.c |   49 +++++++++++++++++++++++++++++--------------------
 1 files changed, 29 insertions(+), 20 deletions(-)

diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index c207fef..e79b56b 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1097,13 +1097,15 @@ xfs_default_resblks(xfs_mount_t *mp)
 	__uint64_t resblks;
 
 	/*
-	 * We default to 5% or 1024 fsbs of space reserved, whichever is smaller.
-	 * This may drive us straight to ENOSPC on mount, but that implies
-	 * we were already there on the last unmount. Warn if this occurs.
+	 * We default to 5% or 8192 fsbs of space reserved, whichever is
+	 * smaller.  This is intended to cover concurrent allocation
+	 * transactions when we initially hit enospc. These each require a 4
+	 * block reservation. Hence by default we cover roughly 2000 concurrent
+	 * allocation reservations.
 	 */
 	resblks = mp->m_sb.sb_dblocks;
 	do_div(resblks, 20);
-	resblks = min_t(__uint64_t, resblks, 1024);
+	resblks = min_t(__uint64_t, resblks, 8192);
 	return resblks;
 }
 
@@ -1417,6 +1419,9 @@ xfs_mountfs(
 	 * when at ENOSPC. This is needed for operations like create with
 	 * attr, unwritten extent conversion at ENOSPC, etc. Data allocations
 	 * are not allowed to use this reserved space.
+	 *
+	 * This may drive us straight to ENOSPC on mount, but that implies
+	 * we were already there on the last unmount. Warn if this occurs.
 	 */
 	if (!(mp->m_flags & XFS_MOUNT_RDONLY)) {
 		resblks = xfs_default_resblks(mp);
@@ -1725,26 +1730,30 @@ xfs_mod_incore_sb_unlocked(
 				lcounter += rem;
 			}
 		} else {				/* Taking blocks away */
-
 			lcounter += delta;
+			if (lcounter >= 0) {
+				mp->m_sb.sb_fdblocks = lcounter +
+							XFS_ALLOC_SET_ASIDE(mp);
+				return 0;
+			}
 
-		/*
-		 * If were out of blocks, use any available reserved blocks if
-		 * were allowed to.
-		 */
+			/*
+			 * We are out of blocks, use any available reserved
+			 * blocks if were allowed to.
+			 */
+			if (!rsvd)
+				return XFS_ERROR(ENOSPC);
 
-			if (lcounter < 0) {
-				if (rsvd) {
-					lcounter = (long long)mp->m_resblks_avail + delta;
-					if (lcounter < 0) {
-						return XFS_ERROR(ENOSPC);
-					}
-					mp->m_resblks_avail = lcounter;
-					return 0;
-				} else {	/* not reserved */
-					return XFS_ERROR(ENOSPC);
-				}
+			lcounter = (long long)mp->m_resblks_avail + delta;
+			if (lcounter >= 0) {
+				mp->m_resblks_avail = lcounter;
+				return 0;
 			}
+			printk_once(KERN_WARNING
+				"Filesystem \"%s\": reserve blocks depleted! "
+				"Consider increasing reserve pool size.",
+				mp->m_fsname);
+			return XFS_ERROR(ENOSPC);
 		}
 
 		mp->m_sb.sb_fdblocks = lcounter + XFS_ALLOC_SET_ASIDE(mp);
-- 
1.6.5

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] xfs: check for more work before sleeping in xfssyncd
  2010-03-04  1:46 ` [PATCH 1/3] xfs: check for more work before sleeping in xfssyncd Dave Chinner
@ 2010-03-04 13:50   ` Christoph Hellwig
  2010-03-04 22:56     ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2010-03-04 13:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Mar 04, 2010 at 12:46:23PM +1100, Dave Chinner wrote:
> xfssyncd processes a queue of work by detaching the queue and
> then iterating over all the work items. It then sleeps for a
> time period or until new work comes in. If new work is queued
> while xfssyncd is actively processing the detached work queue,
> it will not process that new work until after a sleep timeout
> or the next work event queued wakes it.
> 
> Fix this by checking the work queue again before going to sleep.

Looks good, as does the list_splice cleanup not mentioned in the
changelog.  But I really wonder if we shouldn't just call the
flushing routine directly instead of going through xfssyncd.

The xfssyncd queueing infrastructure is quite a lot of hairy code
just to save a bit of stack space in two places rather high
up in the callchain.


Signed-off-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] xfs: truncate delalloc extents when IO fails in writeback
  2010-03-04  1:46 ` [PATCH 2/3] xfs: truncate delalloc extents when IO fails in writeback Dave Chinner
@ 2010-03-04 14:28   ` Christoph Hellwig
  2010-03-04 22:03     ` Dave Chinner
  2010-03-10  9:12   ` [PATCH 2/3] " Christoph Hellwig
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2010-03-04 14:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, Mar 04, 2010 at 12:46:24PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> We currently use block_invalidatepage() to clean up pages where I/O
> fails in ->writepage(). Unfortunately, if the page has delalloc
> regions on it, we fail to remove the delalloc regions when we
> invalidate the page.  This can result in tripping a BUG() in
> xfs_get_blocks() later on if a direct IO read is done on that same
> region - the delalloc extent is returned when none is supposed to be
> there.
> 
> Fix this by truncating away the delalloc regions on the page before
> invalidating it. Because they are delalloc, we can do this without
> needing a transaction. Indeed - if we get ENOSPC errors, we have to
> be able to do this truncation without a transaction as there is
> no space left for block reservation (typically why we see a ENOSPC
> in writeback).

The code looks good to me, but it could use a bit more documentation
from the page description above inside the code.  

> +static void
> +xfs_aops_discard_page(

This seems like a prime candidate for STATIC to make sure the
over-eager compiler doesn't inline it into xfs_page_state_convert.

> -		cur->bc_private.b.firstblock = *firstblock;
> +		cur->bc_private.b.firstblock = firstblock ?
> +						*firstblock : NULLFSBLOCK;
>  		cur->bc_private.b.flist = flist;
>  		cur->bc_private.b.flags = 0;
>  	} else
> @@ -5503,7 +5504,8 @@ error0:
>  		xfs_trans_log_inode(tp, ip, logflags);
>  	if (cur) {
>  		if (!error) {
> -			*firstblock = cur->bc_private.b.firstblock;
> +			if (firstblock)
> +				*firstblock = cur->bc_private.b.firstblock;

I don't think it's a good idea to throw this change in as part of
this patch.  I'd rather throw in the xfs_bmap_init and two
useless arguments for now and clean up the calling conventions later.
For which is more than enough opportunity, e.g. there's a totally
unused delta argument in xfs_bunmapi.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] xfs: truncate delalloc extents when IO fails in writeback
  2010-03-04 14:28   ` Christoph Hellwig
@ 2010-03-04 22:03     ` Dave Chinner
  2010-03-05  2:00       ` [PATCH v2] " Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2010-03-04 22:03 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Thu, Mar 04, 2010 at 09:28:51AM -0500, Christoph Hellwig wrote:
> On Thu, Mar 04, 2010 at 12:46:24PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > We currently use block_invalidatepage() to clean up pages where I/O
> > fails in ->writepage(). Unfortunately, if the page has delalloc
> > regions on it, we fail to remove the delalloc regions when we
> > invalidate the page.  This can result in tripping a BUG() in
> > xfs_get_blocks() later on if a direct IO read is done on that same
> > region - the delalloc extent is returned when none is supposed to be
> > there.
> > 
> > Fix this by truncating away the delalloc regions on the page before
> > invalidating it. Because they are delalloc, we can do this without
> > needing a transaction. Indeed - if we get ENOSPC errors, we have to
> > be able to do this truncation without a transaction as there is
> > no space left for block reservation (typically why we see a ENOSPC
> > in writeback).
> 
> The code looks good to me, but it could use a bit more documentation
> from the page description above inside the code.  

Good point. I'll add some comments to it...

> > +static void
> > +xfs_aops_discard_page(
> 
> This seems like a prime candidate for STATIC to make sure the
> over-eager compiler doesn't inline it into xfs_page_state_convert.

Will do.

> > -		cur->bc_private.b.firstblock = *firstblock;
> > +		cur->bc_private.b.firstblock = firstblock ?
> > +						*firstblock : NULLFSBLOCK;
> >  		cur->bc_private.b.flist = flist;
> >  		cur->bc_private.b.flags = 0;
> >  	} else
> > @@ -5503,7 +5504,8 @@ error0:
> >  		xfs_trans_log_inode(tp, ip, logflags);
> >  	if (cur) {
> >  		if (!error) {
> > -			*firstblock = cur->bc_private.b.firstblock;
> > +			if (firstblock)
> > +				*firstblock = cur->bc_private.b.firstblock;
> 
> I don't think it's a good idea to throw this change in as part of
> this patch.  I'd rather throw in the xfs_bmap_init and two
> useless arguments for now and clean up the calling conventions later.
> For which is more than enough opportunity, e.g. there's a totally
> unused delta argument in xfs_bunmapi.

I was undecided about which method to use. In the end I went with
the above because it was easier to check that correct behaviour was
occurring. i.e. if firstblock get used, then I'd see a panic. I'll
switch it to the variable method, and try to find the right
combination of ASSERT() calls to make sure it hasn't been used.

Thanks Christoph.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] xfs: check for more work before sleeping in xfssyncd
  2010-03-04 13:50   ` Christoph Hellwig
@ 2010-03-04 22:56     ` Dave Chinner
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2010-03-04 22:56 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Thu, Mar 04, 2010 at 08:50:50AM -0500, Christoph Hellwig wrote:
> On Thu, Mar 04, 2010 at 12:46:23PM +1100, Dave Chinner wrote:
> > xfssyncd processes a queue of work by detaching the queue and
> > then iterating over all the work items. It then sleeps for a
> > time period or until new work comes in. If new work is queued
> > while xfssyncd is actively processing the detached work queue,
> > it will not process that new work until after a sleep timeout
> > or the next work event queued wakes it.
> > 
> > Fix this by checking the work queue again before going to sleep.
> 
> Looks good, as does the list_splice cleanup not mentioned in the
> changelog.  But I really wonder if we shouldn't just call the
> flushing routine directly instead of going through xfssyncd.
> 
> The xfssyncd queueing infrastructure is quite a lot of hairy code
> just to save a bit of stack space in two places rather high
> up in the callchain.

Yeah, there is a lot cruft in that code. A good amount of cleanup
could be done here, but I'm not going to tackle that today.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2] xfs: truncate delalloc extents when IO fails in writeback
  2010-03-04 22:03     ` Dave Chinner
@ 2010-03-05  2:00       ` Dave Chinner
  2010-03-05  9:21         ` Christoph Hellwig
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2010-03-05  2:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs


From: Dave Chinner <dchinner@redhat.com>

We currently use block_invalidatepage() to clean up pages where I/O
fails in ->writepage(). Unfortunately, if the page has delalloc
regions on it, we fail to remove the delalloc regions when we
invalidate the page.  This can result in tripping a BUG() in
xfs_get_blocks() later on if a direct IO read is done on that same
region - the delalloc extent is returned when none is supposed to be
there.

Fix this by truncating away the delalloc regions on the page before
invalidating it. Because they are delalloc, we can do this without
needing a transaction. Indeed - if we get ENOSPC errors, we have to
be able to do this truncation without a transaction as there is
no space left for block reservation (typically why we see a ENOSPC
in writeback).

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/linux-2.6/xfs_aops.c |  124 +++++++++++++++++++++++++++++++++++++++----
 1 files changed, 114 insertions(+), 10 deletions(-)

diff --git a/fs/xfs/linux-2.6/xfs_aops.c b/fs/xfs/linux-2.6/xfs_aops.c
index b493c63..e0eb7f1 100644
--- a/fs/xfs/linux-2.6/xfs_aops.c
+++ b/fs/xfs/linux-2.6/xfs_aops.c
@@ -39,6 +39,7 @@
 #include "xfs_iomap.h"
 #include "xfs_vnodeops.h"
 #include "xfs_trace.h"
+#include "xfs_bmap.h"
 #include <linux/mpage.h>
 #include <linux/pagevec.h>
 #include <linux/writeback.h>
@@ -893,6 +894,118 @@ xfs_cluster_write(
 	}
 }
 
+STATIC void
+xfs_vm_invalidatepage(
+	struct page		*page,
+	unsigned long		offset)
+{
+	trace_xfs_invalidatepage(page->mapping->host, page, offset);
+	block_invalidatepage(page, offset);
+}
+
+/*
+ * If the page has delalloc buffers on it, we need to punch them out before we
+ * invalidate the page. If we don't, we leave a stale delalloc mapping on the
+ * inode that can trip a BUG() in xfs_get_blocks() later on if a direct IO read
+ * is done on that same region - the delalloc extent is returned when none is
+ * supposed to be there.
+ *
+ * We prevent this by truncating away the delalloc regions on the page before
+ * invalidating it. Because they are delalloc, we can do this without needing a
+ * transaction. Indeed - if we get ENOSPC errors, we have to be able to do this
+ * truncation without a transaction as there is no space left for block
+ * reservation (typically why we see a ENOSPC in writeback).
+ *
+ * This is not a performance critical path, so for now just do the punching a
+ * buffer head at a time.
+ */
+STATIC void
+xfs_aops_discard_page(
+	struct page		*page)
+{
+	struct inode		*inode = page->mapping->host;
+	struct xfs_inode	*ip = XFS_I(inode);
+	struct buffer_head	*bh, *head;
+	loff_t			offset = page_offset(page);
+	ssize_t			len = 1 << inode->i_blkbits;
+
+	if (!xfs_is_delayed_page(page, IOMAP_DELAY))
+		goto out_invalidate;
+
+	xfs_fs_cmn_err(CE_ALERT, ip->i_mount,
+		"page discard on page %p, inode 0x%llx, offset %llu.",
+			page, ip->i_ino, offset);
+
+	xfs_ilock(ip, XFS_ILOCK_EXCL);
+	bh = head = page_buffers(page);
+	do {
+		int		done;
+		xfs_fileoff_t	offset_fsb;
+		xfs_bmbt_irec_t	imap;
+		int		nimaps = 1;
+		int		error;
+		xfs_fsblock_t	firstblock;
+		xfs_bmap_free_t flist;
+
+		if (!buffer_delay(bh))
+			goto next_buffer;
+
+		offset_fsb = XFS_B_TO_FSBT(ip->i_mount, offset);
+
+		/*
+		 * Map the range first and check that it is a delalloc extent
+		 * before trying to unmap the range. Otherwise we will be
+		 * trying to remove a real extent (which requires a
+		 * transaction) or a hole, which is probably a bad idea...
+		 */
+		error = xfs_bmapi(NULL, ip, offset_fsb, 1,
+				XFS_BMAPI_ENTIRE,  NULL, 0, &imap,
+				&nimaps, NULL, NULL);
+
+		if (error) {
+			/* something screwed, just bail */
+			xfs_fs_cmn_err(CE_ALERT, ip->i_mount,
+			"page discard failed delalloc mapping lookup.");
+			break;
+		}
+		if (!nimaps) {
+			/* nothing there */
+			goto next_buffer;
+		}
+		if (imap.br_startblock != DELAYSTARTBLOCK) {
+			/* been converted, ignore */
+			goto next_buffer;
+		}
+		WARN_ON(imap.br_blockcount == 0);
+
+		/*
+		 * Note: while we initialise the firstblock/flist pair, they
+		 * should never be used because blocks shoul dnever be
+		 * allocated or freed for a delalloc extent and hence we need
+		 * don't cancel or finish them after the xfs_bunmapi() call.
+		 */
+		xfs_bmap_init(&flist, &firstblock);
+		error = xfs_bunmapi(NULL, ip, offset_fsb, 1, 0, 1, &firstblock,
+					&flist, NULL, &done);
+
+		ASSERT(!flist.xbf_count && !flist.xbf_first);
+		if (error) {
+			/* something screwed, just bail */
+			xfs_fs_cmn_err(CE_ALERT, ip->i_mount,
+			"page discard unable to remove delalloc mapping.");
+			break;
+		}
+next_buffer:
+		offset += len;
+
+	} while ((bh = bh->b_this_page) != head);
+
+	xfs_iunlock(ip, XFS_ILOCK_EXCL);
+out_invalidate:
+	xfs_vm_invalidatepage(page, 0);
+	return;
+}
+
 /*
  * Calling this without startio set means we are being asked to make a dirty
  * page ready for freeing it's buffers.  When called with startio set then
@@ -1144,7 +1257,7 @@ error:
 	 */
 	if (err != -EAGAIN) {
 		if (!unmapped)
-			block_invalidatepage(page, 0);
+			xfs_aops_discard_page(page);
 		ClearPageUptodate(page);
 	}
 	return err;
@@ -1554,15 +1667,6 @@ xfs_vm_readpages(
 	return mpage_readpages(mapping, pages, nr_pages, xfs_get_blocks);
 }
 
-STATIC void
-xfs_vm_invalidatepage(
-	struct page		*page,
-	unsigned long		offset)
-{
-	trace_xfs_invalidatepage(page->mapping->host, page, offset);
-	block_invalidatepage(page, offset);
-}
-
 const struct address_space_operations xfs_address_space_operations = {
 	.readpage		= xfs_vm_readpage,
 	.readpages		= xfs_vm_readpages,

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] xfs: truncate delalloc extents when IO fails in writeback
  2010-03-05  2:00       ` [PATCH v2] " Dave Chinner
@ 2010-03-05  9:21         ` Christoph Hellwig
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2010-03-05  9:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

On Fri, Mar 05, 2010 at 01:00:42PM +1100, Dave Chinner wrote:
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> We currently use block_invalidatepage() to clean up pages where I/O
> fails in ->writepage(). Unfortunately, if the page has delalloc
> regions on it, we fail to remove the delalloc regions when we
> invalidate the page.  This can result in tripping a BUG() in
> xfs_get_blocks() later on if a direct IO read is done on that same
> region - the delalloc extent is returned when none is supposed to be
> there.
> 
> Fix this by truncating away the delalloc regions on the page before
> invalidating it. Because they are delalloc, we can do this without
> needing a transaction. Indeed - if we get ENOSPC errors, we have to
> be able to do this truncation without a transaction as there is
> no space left for block reservation (typically why we see a ENOSPC
> in writeback).

Looks good.  I also like the new assert to ensure no blocks actually
get freed.

Reviewed-by: Christoph Hellwig <hch@lst.de>

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] xfs: Increase the default size of the reserved blocks pool
  2010-03-04  1:46 ` [PATCH 3/3] xfs: Increase the default size of the reserved blocks pool Dave Chinner
@ 2010-03-05 15:45   ` Alex Elder
  2010-03-05 23:26     ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Alex Elder @ 2010-03-05 15:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Thu, 2010-03-04 at 12:46 +1100, Dave Chinner wrote:
> The current default size of the reserved blocks pool is easy to deplete
> with certain workloads, in particular workloads that do lots of concurrent
> delayed allocation extent conversions.  If enough transactions are running
> in parallel and the entire pool is consumed then subsequent calls to
> xfs_trans_reserve() will fail with ENOSPC.  Also add a rate limited
> warning so we know if this starts happening again.
> 
> This is an updated version of an old patch from Lachlan McIlroy.

Looks good.  The comment and code rearrangements are an
improvement.

I have also reviewed the other two patches in the series
(including the updated patch 2) and they too look good.

> Signed-off-by: Dave Chinner <david@fromorbit.com>
         So is it got to be fromorbit or redhat?
         (You used both in this series.)

Reviewed-by: Alex Elder <aelder@sgi.com>

> ---
>  fs/xfs/xfs_mount.c |   49 +++++++++++++++++++++++++++++--------------------
>  1 files changed, 29 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index c207fef..e79b56b 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -1097,13 +1097,15 @@ xfs_default_resblks(xfs_mount_t *mp)
>  	__uint64_t resblks;
>  
>  	/*
> -	 * We default to 5% or 1024 fsbs of space reserved, whichever is smaller.
> -	 * This may drive us straight to ENOSPC on mount, but that implies
> -	 * we were already there on the last unmount. Warn if this occurs.
> +	 * We default to 5% or 8192 fsbs of space reserved, whichever is
> +	 * smaller.  This is intended to cover concurrent allocation
> +	 * transactions when we initially hit enospc. These each require a 4
> +	 * block reservation. Hence by default we cover roughly 2000 concurrent
> +	 * allocation reservations.
>  	 */
>  	resblks = mp->m_sb.sb_dblocks;
>  	do_div(resblks, 20);
> -	resblks = min_t(__uint64_t, resblks, 1024);
> +	resblks = min_t(__uint64_t, resblks, 8192);
>  	return resblks;
>  }
>  
> @@ -1417,6 +1419,9 @@ xfs_mountfs(
>  	 * when at ENOSPC. This is needed for operations like create with
>  	 * attr, unwritten extent conversion at ENOSPC, etc. Data allocations
>  	 * are not allowed to use this reserved space.
> +	 *
> +	 * This may drive us straight to ENOSPC on mount, but that implies
> +	 * we were already there on the last unmount. Warn if this occurs.
>  	 */
>  	if (!(mp->m_flags & XFS_MOUNT_RDONLY)) {
>  		resblks = xfs_default_resblks(mp);
> @@ -1725,26 +1730,30 @@ xfs_mod_incore_sb_unlocked(
>  				lcounter += rem;
>  			}
>  		} else {				/* Taking blocks away */
> -
>  			lcounter += delta;
> +			if (lcounter >= 0) {
> +				mp->m_sb.sb_fdblocks = lcounter +
> +							XFS_ALLOC_SET_ASIDE(mp);
> +				return 0;
> +			}
>  
> -		/*
> -		 * If were out of blocks, use any available reserved blocks if
> -		 * were allowed to.
> -		 */

I was just looking at this code yesterday, and got confused
by the indentation of this comment.  You beat me to fixing it.
I also think your rearranging of the logic below is an improvement.

> +			/*
> +			 * We are out of blocks, use any available reserved
> +			 * blocks if were allowed to.
> +			 */
> +			if (!rsvd)
> +				return XFS_ERROR(ENOSPC);
>  
> -			if (lcounter < 0) {
> -				if (rsvd) {
> -					lcounter = (long long)mp->m_resblks_avail + delta;
> -					if (lcounter < 0) {
> -						return XFS_ERROR(ENOSPC);
> -					}
> -					mp->m_resblks_avail = lcounter;
> -					return 0;
> -				} else {	/* not reserved */
> -					return XFS_ERROR(ENOSPC);
> -				}
> +			lcounter = (long long)mp->m_resblks_avail + delta;
> +			if (lcounter >= 0) {
> +				mp->m_resblks_avail = lcounter;
> +				return 0;
>  			}
> +			printk_once(KERN_WARNING
> +				"Filesystem \"%s\": reserve blocks depleted! "
> +				"Consider increasing reserve pool size.",
> +				mp->m_fsname);
> +			return XFS_ERROR(ENOSPC);
>  		}
>  
>  		mp->m_sb.sb_fdblocks = lcounter + XFS_ALLOC_SET_ASIDE(mp);



_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] xfs: Increase the default size of the reserved blocks pool
  2010-03-05 15:45   ` Alex Elder
@ 2010-03-05 23:26     ` Dave Chinner
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2010-03-05 23:26 UTC (permalink / raw)
  To: Alex Elder; +Cc: xfs

On Fri, Mar 05, 2010 at 09:45:53AM -0600, Alex Elder wrote:
> On Thu, 2010-03-04 at 12:46 +1100, Dave Chinner wrote:
> > The current default size of the reserved blocks pool is easy to deplete
> > with certain workloads, in particular workloads that do lots of concurrent
> > delayed allocation extent conversions.  If enough transactions are running
> > in parallel and the entire pool is consumed then subsequent calls to
> > xfs_trans_reserve() will fail with ENOSPC.  Also add a rate limited
> > warning so we know if this starts happening again.
> > 
> > This is an updated version of an old patch from Lachlan McIlroy.
> 
> Looks good.  The comment and code rearrangements are an
> improvement.
> 
> I have also reviewed the other two patches in the series
> (including the updated patch 2) and they too look good.
> 
> > Signed-off-by: Dave Chinner <david@fromorbit.com>
>          So is it got to be fromorbit or redhat?
>          (You used both in this series.)

What each individual patch says.

It depends on the history of the patch to what the sign-off I'll use
on it. This one I pulled from a patch series I had locally that
hadn't been touched for months (i.e. not new work). I simply updated
it for the recent resblks changes....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] xfs: truncate delalloc extents when IO fails in writeback
  2010-03-04  1:46 ` [PATCH 2/3] xfs: truncate delalloc extents when IO fails in writeback Dave Chinner
  2010-03-04 14:28   ` Christoph Hellwig
@ 2010-03-10  9:12   ` Christoph Hellwig
  2010-03-10 12:52     ` Dave Chinner
  1 sibling, 1 reply; 15+ messages in thread
From: Christoph Hellwig @ 2010-03-10  9:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

Btw, I can trigger the new warnings from this patch in xfsqa after
a few full XFSQA runs:

[39829.737139] Filesystem "vdb6": page discard on page c21d7480, inode
0x3f8, offset 0.
[39829.739488] Filesystem "vdb6": page discard failed delalloc mapping
lookup.
[39829.744005] Filesystem "vdb6": page discard on page c206ba80, inode
0x3f9, offset 0.
[39829.746496] Filesystem "vdb6": page discard failed delalloc mapping
lookup.
[39829.748655] Filesystem "vdb6": page discard on page c20e02a0, inode
0x3fa, offset 0.
[39829.751000] Filesystem "vdb6": page discard failed delalloc mapping
lookup.


going on for a few pages of kernel messages.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] xfs: truncate delalloc extents when IO fails in writeback
  2010-03-10  9:12   ` [PATCH 2/3] " Christoph Hellwig
@ 2010-03-10 12:52     ` Dave Chinner
  2010-03-10 18:18       ` Christoph Hellwig
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2010-03-10 12:52 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: xfs

On Wed, Mar 10, 2010 at 04:12:37AM -0500, Christoph Hellwig wrote:
> Btw, I can trigger the new warnings from this patch in xfsqa after
> a few full XFSQA runs:
> 
> [39829.737139] Filesystem "vdb6": page discard on page c21d7480, inode
> 0x3f8, offset 0.
> [39829.739488] Filesystem "vdb6": page discard failed delalloc mapping
> lookup.
> [39829.744005] Filesystem "vdb6": page discard on page c206ba80, inode
> 0x3f9, offset 0.
> [39829.746496] Filesystem "vdb6": page discard failed delalloc mapping
> lookup.
> [39829.748655] Filesystem "vdb6": page discard on page c20e02a0, inode
> 0x3fa, offset 0.
> [39829.751000] Filesystem "vdb6": page discard failed delalloc mapping
> lookup.

The "page discard on page ...." are expected, but the failed mapping
warnings are not - that means xfs_bmapi() returned an error while
the inode still has delayed allocation buffers in memory. Did this
happen during a forced shutdown?  What test is running when this
occurs?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/3] xfs: truncate delalloc extents when IO fails in writeback
  2010-03-10 12:52     ` Dave Chinner
@ 2010-03-10 18:18       ` Christoph Hellwig
  0 siblings, 0 replies; 15+ messages in thread
From: Christoph Hellwig @ 2010-03-10 18:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, xfs

On Wed, Mar 10, 2010 at 11:52:19PM +1100, Dave Chinner wrote:
> The "page discard on page ...." are expected, but the failed mapping
> warnings are not - that means xfs_bmapi() returned an error while
> the inode still has delayed allocation buffers in memory. Did this
> happen during a forced shutdown?  What test is running when this
> occurs?

This is in test 137 which shuts down the filesystem.  I guess the
warning is indeed due to a shutdown filesystem and we should probably
not bother to warn for that case.  I'll give the obvious patch a try.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2010-03-10 18:16 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-04  1:46 [PATCH 0/3] xfs: delayed allocation @ ENOSPC fixes Dave Chinner
2010-03-04  1:46 ` [PATCH 1/3] xfs: check for more work before sleeping in xfssyncd Dave Chinner
2010-03-04 13:50   ` Christoph Hellwig
2010-03-04 22:56     ` Dave Chinner
2010-03-04  1:46 ` [PATCH 2/3] xfs: truncate delalloc extents when IO fails in writeback Dave Chinner
2010-03-04 14:28   ` Christoph Hellwig
2010-03-04 22:03     ` Dave Chinner
2010-03-05  2:00       ` [PATCH v2] " Dave Chinner
2010-03-05  9:21         ` Christoph Hellwig
2010-03-10  9:12   ` [PATCH 2/3] " Christoph Hellwig
2010-03-10 12:52     ` Dave Chinner
2010-03-10 18:18       ` Christoph Hellwig
2010-03-04  1:46 ` [PATCH 3/3] xfs: Increase the default size of the reserved blocks pool Dave Chinner
2010-03-05 15:45   ` Alex Elder
2010-03-05 23:26     ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.