All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC, PATCH 0/6] xfs: delalloc, DIO and corruption....
@ 2014-03-21 10:11 Dave Chinner
  2014-03-21 10:11 ` [PATCH 1/6] xfs: kill buffers over failed write ranges properly Dave Chinner
                   ` (6 more replies)
  0 siblings, 7 replies; 15+ messages in thread
From: Dave Chinner @ 2014-03-21 10:11 UTC (permalink / raw)
  To: xfs; +Cc: Viro, viro, Al

Hi folks,

This patch series mostly shuts a can of worms that Al opened when he
found the cause of the generic/263 fsx failures. The fix for that is
patch 6 of this series, but, well, there are a bunch of other
problems that need to be fixed before making that change.

Basically, the direct Io block mapping behaviour was covering up a
bunch of other bugs in the delayed allocation extent/page cache
state coherency mappings. Essentially, we punch out the page cache
in quite a few places without first cleaning up delayed allocation
extents over that range and that exposes all sorts of nasty issues
once the direct IO mapping changes are made.  All of these are
existing problems, most of them are very unlikely to be seen in the
wild.

This patch set passes xfstests on a 4k block size/4k page size
config with out problems. However, there is still a fsx failure in
generic/127 on 1k block size/4k page size configurations that I
haven't yet tracked down. That test was failing occasionally before
this patch set as well, so it may be a completely unrelated problem.

The sad fact of this patchset is it is mostly playing whack-a-mole
with visible symptoms of bugs.  It drives home the fact that
bufferheads and the keeping of internal filesystem state attached to
the page cache simply isn't a verifiable architecture.  After
spending several days of doing nothing else but tracking down these
inconsistencies i can only conclude that the code is complex,
fragile and extremely difficult to verify that behaviour is correct.
As such, I doubt that the fixes are entirely correct, so I'm left
with using fsx and fsstress to tell me if I've broken anything.

Eyeballs appreciated, as is test results.

Cheers,

Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/6] xfs: kill buffers over failed write ranges properly
  2014-03-21 10:11 [RFC, PATCH 0/6] xfs: delalloc, DIO and corruption Dave Chinner
@ 2014-03-21 10:11 ` Dave Chinner
  2014-03-21 10:11 ` [PATCH 2/6] xfs: write failure beyond EOF truncates too much data Dave Chinner
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2014-03-21 10:11 UTC (permalink / raw)
  To: xfs; +Cc: Viro, viro, Al

From: Dave Chinner <dchinner@redhat.com>

When a write fails, if we don't clear the delalloc flags from the
buffers over the failed range, they can persist beyond EOF and cause
problems. writeback will see the pages int eh page cache, see they
are dirty and continually retry the write, assuming that the page
beyond EOF is just racing with a truncate. The page will eventually
be released due to some other operation (e.g. direct IO), and it
will not pass through invalidation because it is dirty. Hence it
will be released with buffer_delay set on it, and trigger warnings
in xfs_vm_releasepage() and assert fail in xfs_file_aio_write_direct
because invalidation failed and we didn't write the corect amount.

This causes failures on block size < page size filesystems in fsx
and fsstress workloads run by xfstests.

Fix it by completely trashing any state on the buffer that could be
used to imply that it contains valid data when the delalloc range
over the buffer is punched out during the failed write handling.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 98016b3..e810243 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1566,6 +1566,16 @@ xfs_vm_write_failed(
 
 		xfs_vm_kill_delalloc_range(inode, block_offset,
 					   block_offset + bh->b_size);
+
+		/*
+		 * This buffer does not contain data anymore. make sure anyone
+		 * who finds it knows that for certain.
+		 */
+		clear_buffer_delay(bh);
+		clear_buffer_uptodate(bh);
+		clear_buffer_mapped(bh);
+		clear_buffer_new(bh);
+		clear_buffer_dirty(bh);
 	}
 
 }
-- 
1.9.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/6] xfs: write failure beyond EOF truncates too much data
  2014-03-21 10:11 [RFC, PATCH 0/6] xfs: delalloc, DIO and corruption Dave Chinner
  2014-03-21 10:11 ` [PATCH 1/6] xfs: kill buffers over failed write ranges properly Dave Chinner
@ 2014-03-21 10:11 ` Dave Chinner
  2014-03-29 15:14   ` Brian Foster
  2014-03-21 10:11 ` [PATCH 3/6] xfs: xfs_vm_write_end truncates too much on failure Dave Chinner
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2014-03-21 10:11 UTC (permalink / raw)
  To: xfs; +Cc: Viro, viro, Al

From: Dave Chinner <dchinner@redhat.com>

If we fail a write beyond EOF and have to handle it in
xfs_vm_write_begin(), we truncate the inode back to the current inode
size. This doesn't take into account the fact that we may have
already made successful writes to the same page (in the case of block
size < page size) and hence we can truncate the page cache away from
blocks with valid data in them. If these blocks are delayed
allocation blocks, we now have a mismatch between the page cache and
the extent tree, and this will trigger - at minimum - a delayed
block count mismatch assert when the inode is evicted from the cache.
We can also trip over it when block mapping for direct IO - this is
the most common symptom seen from fsx and fsstress when run from
xfstests.

Fix it by only truncating away the exact range we are updating state
for in this write_begin call.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index e810243..6b4ecc8 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1609,12 +1609,18 @@ xfs_vm_write_begin(
 	status = __block_write_begin(page, pos, len, xfs_get_blocks);
 	if (unlikely(status)) {
 		struct inode	*inode = mapping->host;
+		size_t		isize = i_size_read(inode);
 
 		xfs_vm_write_failed(inode, page, pos, len);
 		unlock_page(page);
 
-		if (pos + len > i_size_read(inode))
-			truncate_pagecache(inode, i_size_read(inode));
+		/*
+		 * If the write is beyond EOF, we only want to kill blocks
+		 * allocated in this write, not blocks that were previously
+		 * written successfully.
+		 */
+		if (pos + len > isize)
+			truncate_pagecache_range(inode, pos, pos + len);
 
 		page_cache_release(page);
 		page = NULL;
-- 
1.9.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/6] xfs: xfs_vm_write_end truncates too much on failure
  2014-03-21 10:11 [RFC, PATCH 0/6] xfs: delalloc, DIO and corruption Dave Chinner
  2014-03-21 10:11 ` [PATCH 1/6] xfs: kill buffers over failed write ranges properly Dave Chinner
  2014-03-21 10:11 ` [PATCH 2/6] xfs: write failure beyond EOF truncates too much data Dave Chinner
@ 2014-03-21 10:11 ` Dave Chinner
  2014-03-21 10:11 ` [PATCH 4/6] xfs: zeroing space needs to punch delalloc blocks Dave Chinner
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2014-03-21 10:11 UTC (permalink / raw)
  To: xfs; +Cc: Viro, viro, Al

From: Dave Chinner <dchinner@redhat.com>

Similar to the write_begin problem, xfs-vm_write_end will truncate
back to the old EOF, potentially removing page cache from over the
top of delalloc blocks with valid data in them. Fix this by
truncating back to just the start of the failed write.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 6b4ecc8..8bbd8ba 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1631,9 +1631,12 @@ xfs_vm_write_begin(
 }
 
 /*
- * On failure, we only need to kill delalloc blocks beyond EOF because they
- * will never be written. For blocks within EOF, generic_write_end() zeros them
- * so they are safe to leave alone and be written with all the other valid data.
+ * On failure, we only need to kill delalloc blocks beyond EOF in the range of
+ * this specific write because they will never be written. Previous writes
+ * beyond EOF where block allocation succeeded do not need to be trashed, so
+ * only new blocks from this write should be trashed. For blocks within
+ * EOF, generic_write_end() zeros them so they are safe to leave alone and be
+ * written with all the other valid data.
  */
 STATIC int
 xfs_vm_write_end(
@@ -1656,8 +1659,11 @@ xfs_vm_write_end(
 		loff_t		to = pos + len;
 
 		if (to > isize) {
-			truncate_pagecache(inode, isize);
+			/* only kill blocks in this write beyond EOF */
+			if (pos > isize)
+				isize = pos;
 			xfs_vm_kill_delalloc_range(inode, isize, to);
+			truncate_pagecache_range(inode, isize, to);
 		}
 	}
 	return ret;
-- 
1.9.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/6] xfs: zeroing space needs to punch delalloc blocks
  2014-03-21 10:11 [RFC, PATCH 0/6] xfs: delalloc, DIO and corruption Dave Chinner
                   ` (2 preceding siblings ...)
  2014-03-21 10:11 ` [PATCH 3/6] xfs: xfs_vm_write_end truncates too much on failure Dave Chinner
@ 2014-03-21 10:11 ` Dave Chinner
  2014-03-21 10:11 ` [PATCH 5/6] xfs: splitting delalloc extents can run out of reservation Dave Chinner
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2014-03-21 10:11 UTC (permalink / raw)
  To: xfs; +Cc: Viro, viro, Al

From: Dave Chinner <dchinner@redhat.com>

When we are zeroing space andit is covered by a delalloc range, we
need to punch the delalloc range out before we truncate the page
cache. Failing to do so leaves and inconsistency between the page
cache and the extent tree, which we later trip over when doing
direct IO over the same range.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap_util.c | 13 ++++++++++++-
 fs/xfs/xfs_trace.h     |  1 +
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 01f6a64..3235b74 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1418,6 +1418,8 @@ xfs_zero_file_space(
 	xfs_off_t		end_boundary;
 	int			error;
 
+	trace_xfs_zero_file_space(ip);
+
 	granularity = max_t(uint, 1 << mp->m_sb.sb_blocklog, PAGE_CACHE_SIZE);
 
 	/*
@@ -1432,9 +1434,18 @@ xfs_zero_file_space(
 	ASSERT(end_boundary <= offset + len);
 
 	if (start_boundary < end_boundary - 1) {
-		/* punch out the page cache over the conversion range */
+		/*
+		 * punch out delayed allocation blocks and the page cache over
+		 * the conversion range
+		 */
+		xfs_ilock(ip, XFS_ILOCK_EXCL);
+		error = xfs_bmap_punch_delalloc_range(ip,
+				XFS_B_TO_FSB(mp, start_boundary),
+				XFS_B_TO_FSB(mp, end_boundary - start_boundary));
+		xfs_iunlock(ip, XFS_ILOCK_EXCL);
 		truncate_pagecache_range(VFS_I(ip), start_boundary,
 					 end_boundary - 1);
+
 		/* convert the blocks */
 		error = xfs_alloc_file_space(ip, start_boundary,
 					end_boundary - start_boundary - 1,
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index a4ae41c..65d8c79 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -603,6 +603,7 @@ DEFINE_INODE_EVENT(xfs_readlink);
 DEFINE_INODE_EVENT(xfs_inactive_symlink);
 DEFINE_INODE_EVENT(xfs_alloc_file_space);
 DEFINE_INODE_EVENT(xfs_free_file_space);
+DEFINE_INODE_EVENT(xfs_zero_file_space);
 DEFINE_INODE_EVENT(xfs_collapse_file_space);
 DEFINE_INODE_EVENT(xfs_readdir);
 #ifdef CONFIG_XFS_POSIX_ACL
-- 
1.9.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 5/6] xfs: splitting delalloc extents can run out of reservation
  2014-03-21 10:11 [RFC, PATCH 0/6] xfs: delalloc, DIO and corruption Dave Chinner
                   ` (3 preceding siblings ...)
  2014-03-21 10:11 ` [PATCH 4/6] xfs: zeroing space needs to punch delalloc blocks Dave Chinner
@ 2014-03-21 10:11 ` Dave Chinner
  2014-04-04 13:37   ` Brian Foster
  2014-03-21 10:11 ` [PATCH 6/6] xfs: don't map ranges that span EOF for direct IO Dave Chinner
  2014-03-31 17:22 ` [RFC, PATCH 0/6] xfs: delalloc, DIO and corruption Brian Foster
  6 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2014-03-21 10:11 UTC (permalink / raw)
  To: xfs; +Cc: Viro, viro, Al

From: Dave Chinner <dchinner@redhat.com>

When we punch a hole in a delalloc extent, we split the indirect
block reservation between the two new extents. If we repeatedly
punch holes in a large delalloc extent, that reservation will
eventually run out and we'll assert fail in xfs_bunmapi() because
the indirect block reservation for the delalloc extent is zero. This
is caused by doing a large delalloc write, then zeroing multiple
ranges of that write using fallocate to punch lots of holes in the
delayed allocation range.

To avoid this problem, if we split the reservation and require more
indirect blocks for the two new extents than we had for the old
reservation, steal the additional blocks from the hole we punched in
the extent. In most cases we only need a single extra block, so even
if we punch only single block holes we can still retain sufficient
indirect block reservations to avoid problems.

In doing this "stealing", we need to change where we account for the
delalloc blocks being freed. The block count held on the inode does
not take into account the indirect block reservation, so we still
need to do that before we free the extent. However, the accounting
ofr free space in the superblock need to be done after we've stolen
the blocks fro the freed extent so that they are accounted for
correctly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap.c | 65 ++++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 47 insertions(+), 18 deletions(-)

diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c
index 5b6092e..4bf6a0e 100644
--- a/fs/xfs/xfs_bmap.c
+++ b/fs/xfs/xfs_bmap.c
@@ -4945,7 +4945,27 @@ xfs_bmap_del_extent(
 			temp2 = xfs_bmap_worst_indlen(ip, temp2);
 			new.br_startblock = nullstartblock((int)temp2);
 			da_new = temp + temp2;
+
+			/*
+			 * Note: if we have an odd number of blocks reserved,
+			 * then if we keep splitting the delalloc extent like
+			 * this we end up with a delalloc indlen reservation of
+			 * zero for one of the two extents. Hence if we end
+			 * up with the new indlen reservations being larger than
+			 * the old one, steal blocks from the data reservation
+			 * we just punched out. Otherwise, just reduce the
+			 * remaining indlen reservations alternately and hope
+			 * next time we come here the range getting removed is
+			 * large enough to fix this all up.
+			 */
 			while (da_new > da_old) {
+				if (del->br_blockcount) {
+					/* steal a block */
+					da_new--;
+					del->br_blockcount--;
+					continue;
+				}
+
 				if (temp) {
 					temp--;
 					da_new--;
@@ -5255,24 +5275,6 @@ xfs_bunmapi(
 		}
 		if (wasdel) {
 			ASSERT(startblockval(del.br_startblock) > 0);
-			/* Update realtime/data freespace, unreserve quota */
-			if (isrt) {
-				xfs_filblks_t rtexts;
-
-				rtexts = XFS_FSB_TO_B(mp, del.br_blockcount);
-				do_div(rtexts, mp->m_sb.sb_rextsize);
-				xfs_mod_incore_sb(mp, XFS_SBS_FREXTENTS,
-						(int64_t)rtexts, 0);
-				(void)xfs_trans_reserve_quota_nblks(NULL,
-					ip, -((long)del.br_blockcount), 0,
-					XFS_QMOPT_RES_RTBLKS);
-			} else {
-				xfs_icsb_modify_counters(mp, XFS_SBS_FDBLOCKS,
-						(int64_t)del.br_blockcount, 0);
-				(void)xfs_trans_reserve_quota_nblks(NULL,
-					ip, -((long)del.br_blockcount), 0,
-					XFS_QMOPT_RES_REGBLKS);
-			}
 			ip->i_delayed_blks -= del.br_blockcount;
 			if (cur)
 				cur->bc_private.b.flags |=
@@ -5302,6 +5304,33 @@ xfs_bunmapi(
 		}
 		error = xfs_bmap_del_extent(ip, tp, &lastx, flist, cur, &del,
 				&tmp_logflags, whichfork);
+		/*
+		 * xfs_bmap_del_extent may hand delayed alloc blocks back to the
+		 * indirect block reservations to keep extent split reservations
+		 * sane. Hence we should only decrement the delayed block count
+		 * on the inode once we know exactly the amount of delalloc
+		 * space we actually removed from the inode.
+		 */
+		if (wasdel && del.br_blockcount) {
+			/* Update realtime/data freespace, unreserve quota */
+			if (isrt) {
+				xfs_filblks_t rtexts;
+
+				rtexts = XFS_FSB_TO_B(mp, del.br_blockcount);
+				do_div(rtexts, mp->m_sb.sb_rextsize);
+				xfs_mod_incore_sb(mp, XFS_SBS_FREXTENTS,
+						(int64_t)rtexts, 0);
+				(void)xfs_trans_reserve_quota_nblks(NULL,
+					ip, -((long)del.br_blockcount), 0,
+					XFS_QMOPT_RES_RTBLKS);
+			} else {
+				xfs_icsb_modify_counters(mp, XFS_SBS_FDBLOCKS,
+						(int64_t)del.br_blockcount, 0);
+				(void)xfs_trans_reserve_quota_nblks(NULL,
+					ip, -((long)del.br_blockcount), 0,
+					XFS_QMOPT_RES_REGBLKS);
+			}
+		}
 		logflags |= tmp_logflags;
 		if (error)
 			goto error0;
-- 
1.9.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 6/6] xfs: don't map ranges that span EOF for direct IO
  2014-03-21 10:11 [RFC, PATCH 0/6] xfs: delalloc, DIO and corruption Dave Chinner
                   ` (4 preceding siblings ...)
  2014-03-21 10:11 ` [PATCH 5/6] xfs: splitting delalloc extents can run out of reservation Dave Chinner
@ 2014-03-21 10:11 ` Dave Chinner
  2014-03-31 17:22 ` [RFC, PATCH 0/6] xfs: delalloc, DIO and corruption Brian Foster
  6 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2014-03-21 10:11 UTC (permalink / raw)
  To: xfs; +Cc: Viro, viro, Al

From: Dave Chinner <dchinner@redhat.com>

Al Viro tracked down the problem that has caused generic/263 to fail
on XFS since the test was introduced. If is caused by
xfs_get_blocks() mapping a single extent that spans EOF without
marking it as buffer-new() so that the direct Io code does not zero
the tail of the block at the new EOF. This is a long standing bug
that has been around for many, many years.

Because xfs_get_blocks() starts the map before EOF, it can't set
buffer_new(), because that causes he direct Io code to also zero
unaligned sectors at the head of the IO. This would overwrite valid
data with zeros, and hence we cannot validly return a single extent
that spans EOF to direct IO.

Fix this by detecting a mapping that spans EOF and truncate it down
to EOF. This results in the the direct IO code doing the right thing
for unaligned data blocks before EOF, and then returning to get
another mapping for the region beyond EOF which XFS treats correctly
by setting buffer_new() on it. This makes direct Io behave correctly
w.r.t. tail block zeroing beyond EOF, and fsx is happy about that.

Again, thanks to Al Viro for finding what I couldn't.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 8bbd8ba..d5702f0 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1344,6 +1344,14 @@ __xfs_get_blocks(
 	/*
 	 * If this is O_DIRECT or the mpage code calling tell them how large
 	 * the mapping is, so that we can avoid repeated get_blocks calls.
+	 *
+	 * If the mapping spans EOF, then we have to break the mapping up as the
+	 * mapping for blocks beyond EOF must be marked new so that sub block
+	 * regions can be correctly zeroed. We can't do this for mappings within
+	 * EOF unless the mapping was just allocated or is unwritten, otherwise
+	 * the callers would overwrite existing data with zeros. Hence we have
+	 * to split the mapping into a range up to and including EOF, and a
+	 * second mapping for beyond EOF.
 	 */
 	if (direct || size > (1 << inode->i_blkbits)) {
 		xfs_off_t		mapping_size;
@@ -1354,6 +1362,12 @@ __xfs_get_blocks(
 		ASSERT(mapping_size > 0);
 		if (mapping_size > size)
 			mapping_size = size;
+		if (offset < i_size_read(inode) &&
+		    offset + mapping_size >= i_size_read(inode)) {
+			/* limit mapping to block that spans EOF */
+			mapping_size = roundup(i_size_read(inode) - offset,
+					       1 << inode->i_blkbits);
+		}
 		if (mapping_size > LONG_MAX)
 			mapping_size = LONG_MAX;
 
-- 
1.9.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/6] xfs: write failure beyond EOF truncates too much data
  2014-03-21 10:11 ` [PATCH 2/6] xfs: write failure beyond EOF truncates too much data Dave Chinner
@ 2014-03-29 15:14   ` Brian Foster
  2014-04-04 15:26     ` Brian Foster
  0 siblings, 1 reply; 15+ messages in thread
From: Brian Foster @ 2014-03-29 15:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Al, Viro, viro, xfs

On Fri, Mar 21, 2014 at 09:11:46PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> If we fail a write beyond EOF and have to handle it in
> xfs_vm_write_begin(), we truncate the inode back to the current inode
> size. This doesn't take into account the fact that we may have
> already made successful writes to the same page (in the case of block
> size < page size) and hence we can truncate the page cache away from
> blocks with valid data in them. If these blocks are delayed
> allocation blocks, we now have a mismatch between the page cache and
> the extent tree, and this will trigger - at minimum - a delayed
> block count mismatch assert when the inode is evicted from the cache.
> We can also trip over it when block mapping for direct IO - this is
> the most common symptom seen from fsx and fsstress when run from
> xfstests.
> 
> Fix it by only truncating away the exact range we are updating state
> for in this write_begin call.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_aops.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index e810243..6b4ecc8 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1609,12 +1609,18 @@ xfs_vm_write_begin(
>  	status = __block_write_begin(page, pos, len, xfs_get_blocks);
>  	if (unlikely(status)) {
>  		struct inode	*inode = mapping->host;
> +		size_t		isize = i_size_read(inode);
>  
>  		xfs_vm_write_failed(inode, page, pos, len);
>  		unlock_page(page);
>  
> -		if (pos + len > i_size_read(inode))
> -			truncate_pagecache(inode, i_size_read(inode));

>From what I can see, we have a write_begin/copy/write_end sequence per
page and the inode size is updated down in the write_end path. If the
write fails at write_begin time, then we haven't copied any data and the
inode size hasn't changed.

The intent of the original code looks like it wants to break off any
blocks that might have been set up beyond EOF before the write happened
to fail. Could you elaborate on how this can kill blocks that might have
been written successfully? Doesn't this only affect post-EOF metadata?

> +		/*
> +		 * If the write is beyond EOF, we only want to kill blocks
> +		 * allocated in this write, not blocks that were previously
> +		 * written successfully.
> +		 */
> +		if (pos + len > isize)
> +			truncate_pagecache_range(inode, pos, pos + len);

So the cache truncate now covers the range of the write. What happens if
the part of the write is an overwrite? This seems similar to the problem
you've described above... should the truncate start at isize instead?

Brian

>  
>  		page_cache_release(page);
>  		page = NULL;
> -- 
> 1.9.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC, PATCH 0/6] xfs: delalloc, DIO and corruption....
  2014-03-21 10:11 [RFC, PATCH 0/6] xfs: delalloc, DIO and corruption Dave Chinner
                   ` (5 preceding siblings ...)
  2014-03-21 10:11 ` [PATCH 6/6] xfs: don't map ranges that span EOF for direct IO Dave Chinner
@ 2014-03-31 17:22 ` Brian Foster
  2014-03-31 20:17   ` Dave Chinner
  6 siblings, 1 reply; 15+ messages in thread
From: Brian Foster @ 2014-03-31 17:22 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Al, Viro, viro, xfs

On Fri, Mar 21, 2014 at 09:11:44PM +1100, Dave Chinner wrote:
> Hi folks,
> 
> This patch series mostly shuts a can of worms that Al opened when he
> found the cause of the generic/263 fsx failures. The fix for that is
> patch 6 of this series, but, well, there are a bunch of other
> problems that need to be fixed before making that change.
> 
> Basically, the direct Io block mapping behaviour was covering up a
> bunch of other bugs in the delayed allocation extent/page cache
> state coherency mappings. Essentially, we punch out the page cache
> in quite a few places without first cleaning up delayed allocation
> extents over that range and that exposes all sorts of nasty issues
> once the direct IO mapping changes are made.  All of these are
> existing problems, most of them are very unlikely to be seen in the
> wild.
> 
> This patch set passes xfstests on a 4k block size/4k page size
> config with out problems. However, there is still a fsx failure in
> generic/127 on 1k block size/4k page size configurations that I
> haven't yet tracked down. That test was failing occasionally before
> this patch set as well, so it may be a completely unrelated problem.
> 
> The sad fact of this patchset is it is mostly playing whack-a-mole
> with visible symptoms of bugs.  It drives home the fact that
> bufferheads and the keeping of internal filesystem state attached to
> the page cache simply isn't a verifiable architecture.  After
> spending several days of doing nothing else but tracking down these
> inconsistencies i can only conclude that the code is complex,
> fragile and extremely difficult to verify that behaviour is correct.
> As such, I doubt that the fixes are entirely correct, so I'm left
> with using fsx and fsstress to tell me if I've broken anything.
> 
> Eyeballs appreciated, as is test results.
> 

I had an xfstests running against this (on for-next) over the weekend
and it hit the following bug on xfs/297:

[ 6408.168767] kernel BUG at fs/xfs/xfs_aops.c:1336!
[ 6408.169542] invalid opcode: 0000 [#1] SMP 
[ 6408.169542] Modules linked in: loop xfs libcrc32c ip6t_rpfilter ip6t_REJECT xt_conntrack cfg80211 rfkill ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ppdev snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device microcode snd_pcm serio_raw virtio_balloon virtio_console snd_timer snd parport_pc parport soundcore i2c_piix4 nfsd auth_rpcgss nfs_acl lockd sunrpc virtio_blk virtio_net qxl drm_kms_helper ttm drm i2c_core ata_generic virtio_pci virtio_ring virtio pata_acpi [last unloaded: scsi_debug]
[ 6408.169542] CPU: 0 PID: 28956 Comm: fsstress Not tainted 3.14.0-rc1+ #11
[ 6408.169542] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[ 6408.169542] task: ffff880074860000 ti: ffff880074868000 task.ti: ffff880074868000
[ 6408.169542] RIP: 0010:[<ffffffffa041a4b0>]  [<ffffffffa041a4b0>] __xfs_get_blocks+0x7f0/0x800 [xfs]
[ 6408.169542] RSP: 0018:ffff880074869a00  EFLAGS: 00010202
[ 6408.169542] RAX: ffffffffffffffff RBX: 00000000000fa000 RCX: 00000000000000fa
[ 6408.169542] RDX: ffff8800d5686cc0 RSI: 0000000000000001 RDI: 0000000000000246
[ 6408.169542] RBP: ffff880074869a78 R08: 0000000000000113 R09: 0000000000000000
[ 6408.169542] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000019000
[ 6408.169542] R13: ffff8800a38b9350 R14: ffff8800aa62e7b0 R15: ffff880074869b80
[ 6408.169542] FS:  00007f3ddd3bd740(0000) GS:ffff88011ae00000(0000) knlGS:0000000000000000
[ 6408.169542] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6408.169542] CR2: 00007f3dd801c000 CR3: 0000000077977000 CR4: 00000000000006f0
[ 6408.169542] Stack:
[ 6408.169542]  00007f3d00000000 00007f3d00000008 ffffff01000f18cd 00000000000000fa
[ 6408.169542]  ffff8800a38b9080 00000001aab24840 00000000000000c3 ffffffffffffffff
[ 6408.169542]  000000000000003d ffff880000000000 0000000000000e00 0000000000000003
[ 6408.169542] Call Trace:
[ 6408.169542]  [<ffffffffa041a4f4>] xfs_get_blocks_direct+0x14/0x20 [xfs]
[ 6408.169542]  [<ffffffff81263194>] do_blockdev_direct_IO+0x10e4/0x2ad0
[ 6408.169542]  [<ffffffff811a1355>] ? find_get_pages_tag+0x25/0x310
[ 6408.169542]  [<ffffffffa041a4e0>] ? xfs_get_blocks+0x20/0x20 [xfs]
[ 6408.169542]  [<ffffffff81264bd5>] __blockdev_direct_IO+0x55/0x60
[ 6408.169542]  [<ffffffffa041a4e0>] ? xfs_get_blocks+0x20/0x20 [xfs]
[ 6408.169542]  [<ffffffffa0418512>] xfs_vm_direct_IO+0x152/0x170 [xfs]
[ 6408.169542]  [<ffffffffa041a4e0>] ? xfs_get_blocks+0x20/0x20 [xfs]
[ 6408.169542]  [<ffffffff811a365d>] generic_file_aio_read+0x6fd/0x760
[ 6408.169542]  [<ffffffff8178219e>] ? mutex_unlock+0xe/0x10
[ 6408.169542]  [<ffffffff811ce998>] ? unmap_mapping_range+0x88/0x170
[ 6408.169542]  [<ffffffff810f18cd>] ? trace_hardirqs_on+0xd/0x10
[ 6408.169542]  [<ffffffffa042847c>] xfs_file_aio_read+0x14c/0x3b0 [xfs]
[ 6408.169542]  [<ffffffff8121e85a>] do_sync_read+0x5a/0x90
[ 6408.169542]  [<ffffffff8121ef1e>] vfs_read+0x9e/0x170
[ 6408.169542]  [<ffffffff8121fa1c>] SyS_read+0x4c/0xa0
[ 6408.169542]  [<ffffffff8114a2cc>] ? __audit_syscall_entry+0x9c/0xf0
[ 6408.169542]  [<ffffffff8178eda9>] system_call_fastpath+0x16/0x1b
[ 6408.169542] Code: 85 2e fd ff ff e9 10 ff ff ff 48 c7 c7 e0 11 c5 81 4c 89 55 90 e8 81 47 cd e0 85 c0 4c 8b 55 90 0f 85 e2 fd ff ff e9 7a ff ff ff <0f> 0b 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 
[ 6408.200988] RIP  [<ffffffffa041a4b0>] __xfs_get_blocks+0x7f0/0x800 [xfs]
[ 6408.200988]  RSP <ffff880074869a00>
[ 6408.203708] ---[ end trace 40a923b54ddca373 ]---

Brian

> Cheers,
> 
> Dave.
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC, PATCH 0/6] xfs: delalloc, DIO and corruption....
  2014-03-31 17:22 ` [RFC, PATCH 0/6] xfs: delalloc, DIO and corruption Brian Foster
@ 2014-03-31 20:17   ` Dave Chinner
  2014-04-01 11:54     ` Brian Foster
  0 siblings, 1 reply; 15+ messages in thread
From: Dave Chinner @ 2014-03-31 20:17 UTC (permalink / raw)
  To: Brian Foster; +Cc: Al, Viro, viro, xfs

On Mon, Mar 31, 2014 at 01:22:43PM -0400, Brian Foster wrote:
> On Fri, Mar 21, 2014 at 09:11:44PM +1100, Dave Chinner wrote:
> > Hi folks,
> > 
> > This patch series mostly shuts a can of worms that Al opened when he
> > found the cause of the generic/263 fsx failures. The fix for that is
> > patch 6 of this series, but, well, there are a bunch of other
> > problems that need to be fixed before making that change.
> > 
> > Basically, the direct Io block mapping behaviour was covering up a
> > bunch of other bugs in the delayed allocation extent/page cache
> > state coherency mappings. Essentially, we punch out the page cache
> > in quite a few places without first cleaning up delayed allocation
> > extents over that range and that exposes all sorts of nasty issues
> > once the direct IO mapping changes are made.  All of these are
> > existing problems, most of them are very unlikely to be seen in the
> > wild.
> > 
> > This patch set passes xfstests on a 4k block size/4k page size
> > config with out problems. However, there is still a fsx failure in
> > generic/127 on 1k block size/4k page size configurations that I
> > haven't yet tracked down. That test was failing occasionally before
> > this patch set as well, so it may be a completely unrelated problem.
> > 
> > The sad fact of this patchset is it is mostly playing whack-a-mole
> > with visible symptoms of bugs.  It drives home the fact that
> > bufferheads and the keeping of internal filesystem state attached to
> > the page cache simply isn't a verifiable architecture.  After
> > spending several days of doing nothing else but tracking down these
> > inconsistencies i can only conclude that the code is complex,
> > fragile and extremely difficult to verify that behaviour is correct.
> > As such, I doubt that the fixes are entirely correct, so I'm left
> > with using fsx and fsstress to tell me if I've broken anything.
> > 
> > Eyeballs appreciated, as is test results.
> > 
> 
> I had an xfstests running against this (on for-next) over the weekend
> and it hit the following bug on xfs/297:
> 
> [ 6408.168767] kernel BUG at fs/xfs/xfs_aops.c:1336!
> [ 6408.169542] invalid opcode: 0000 [#1] SMP 

Ok, so that's found another stale delalloc range where there
shouldn't be. I know there were still problems when I left because
generic/127 was failing on 1k block size filesystems, but I haven't
yet had a chance to get back to determine if the bug was the broken
code in xfs_check_page_types() that Dan Carpenter noticed. Were you
running with that fix?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [RFC, PATCH 0/6] xfs: delalloc, DIO and corruption....
  2014-03-31 20:17   ` Dave Chinner
@ 2014-04-01 11:54     ` Brian Foster
  0 siblings, 0 replies; 15+ messages in thread
From: Brian Foster @ 2014-04-01 11:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, Viro, viro, Al

On Tue, Apr 01, 2014 at 07:17:57AM +1100, Dave Chinner wrote:
> On Mon, Mar 31, 2014 at 01:22:43PM -0400, Brian Foster wrote:
> > On Fri, Mar 21, 2014 at 09:11:44PM +1100, Dave Chinner wrote:
> > > Hi folks,
> > > 
> > > This patch series mostly shuts a can of worms that Al opened when he
> > > found the cause of the generic/263 fsx failures. The fix for that is
> > > patch 6 of this series, but, well, there are a bunch of other
> > > problems that need to be fixed before making that change.
> > > 
> > > Basically, the direct Io block mapping behaviour was covering up a
> > > bunch of other bugs in the delayed allocation extent/page cache
> > > state coherency mappings. Essentially, we punch out the page cache
> > > in quite a few places without first cleaning up delayed allocation
> > > extents over that range and that exposes all sorts of nasty issues
> > > once the direct IO mapping changes are made.  All of these are
> > > existing problems, most of them are very unlikely to be seen in the
> > > wild.
> > > 
> > > This patch set passes xfstests on a 4k block size/4k page size
> > > config with out problems. However, there is still a fsx failure in
> > > generic/127 on 1k block size/4k page size configurations that I
> > > haven't yet tracked down. That test was failing occasionally before
> > > this patch set as well, so it may be a completely unrelated problem.
> > > 
> > > The sad fact of this patchset is it is mostly playing whack-a-mole
> > > with visible symptoms of bugs.  It drives home the fact that
> > > bufferheads and the keeping of internal filesystem state attached to
> > > the page cache simply isn't a verifiable architecture.  After
> > > spending several days of doing nothing else but tracking down these
> > > inconsistencies i can only conclude that the code is complex,
> > > fragile and extremely difficult to verify that behaviour is correct.
> > > As such, I doubt that the fixes are entirely correct, so I'm left
> > > with using fsx and fsstress to tell me if I've broken anything.
> > > 
> > > Eyeballs appreciated, as is test results.
> > > 
> > 
> > I had an xfstests running against this (on for-next) over the weekend
> > and it hit the following bug on xfs/297:
> > 
> > [ 6408.168767] kernel BUG at fs/xfs/xfs_aops.c:1336!
> > [ 6408.169542] invalid opcode: 0000 [#1] SMP 
> 
> Ok, so that's found another stale delalloc range where there
> shouldn't be. I know there were still problems when I left because
> generic/127 was failing on 1k block size filesystems, but I haven't
> yet had a chance to get back to determine if the bug was the broken
> code in xfs_check_page_types() that Dan Carpenter noticed. Were you
> running with that fix?
> 

Ah, good point. I was running with the check_page_type() rework, but not
the most recent fix. I'll plan to test again with that included.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 5/6] xfs: splitting delalloc extents can run out of reservation
  2014-03-21 10:11 ` [PATCH 5/6] xfs: splitting delalloc extents can run out of reservation Dave Chinner
@ 2014-04-04 13:37   ` Brian Foster
  2014-04-04 21:31     ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Brian Foster @ 2014-04-04 13:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Al, Viro, viro, xfs

On Fri, Mar 21, 2014 at 09:11:49PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When we punch a hole in a delalloc extent, we split the indirect
> block reservation between the two new extents. If we repeatedly
> punch holes in a large delalloc extent, that reservation will
> eventually run out and we'll assert fail in xfs_bunmapi() because
> the indirect block reservation for the delalloc extent is zero. This
> is caused by doing a large delalloc write, then zeroing multiple
> ranges of that write using fallocate to punch lots of holes in the
> delayed allocation range.
> 
> To avoid this problem, if we split the reservation and require more
> indirect blocks for the two new extents than we had for the old
> reservation, steal the additional blocks from the hole we punched in
> the extent. In most cases we only need a single extra block, so even
> if we punch only single block holes we can still retain sufficient
> indirect block reservations to avoid problems.
> 
> In doing this "stealing", we need to change where we account for the
> delalloc blocks being freed. The block count held on the inode does
> not take into account the indirect block reservation, so we still
> need to do that before we free the extent. However, the accounting
> ofr free space in the superblock need to be done after we've stolen
> the blocks fro the freed extent so that they are accounted for
> correctly.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_bmap.c | 65 ++++++++++++++++++++++++++++++++++++++++---------------
>  1 file changed, 47 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c
> index 5b6092e..4bf6a0e 100644
> --- a/fs/xfs/xfs_bmap.c
> +++ b/fs/xfs/xfs_bmap.c
> @@ -4945,7 +4945,27 @@ xfs_bmap_del_extent(
>  			temp2 = xfs_bmap_worst_indlen(ip, temp2);
>  			new.br_startblock = nullstartblock((int)temp2);
>  			da_new = temp + temp2;
> +
> +			/*
> +			 * Note: if we have an odd number of blocks reserved,
> +			 * then if we keep splitting the delalloc extent like
> +			 * this we end up with a delalloc indlen reservation of
> +			 * zero for one of the two extents. Hence if we end
> +			 * up with the new indlen reservations being larger than
> +			 * the old one, steal blocks from the data reservation
> +			 * we just punched out. Otherwise, just reduce the
> +			 * remaining indlen reservations alternately and hope
> +			 * next time we come here the range getting removed is
> +			 * large enough to fix this all up.
> +			 */
>  			while (da_new > da_old) {
> +				if (del->br_blockcount) {
> +					/* steal a block */
> +					da_new--;
> +					del->br_blockcount--;
> +					continue;
> +				}
> +
>  				if (temp) {
>  					temp--;
>  					da_new--;
> @@ -5255,24 +5275,6 @@ xfs_bunmapi(
>  		}
>  		if (wasdel) {
>  			ASSERT(startblockval(del.br_startblock) > 0);
> -			/* Update realtime/data freespace, unreserve quota */
> -			if (isrt) {
> -				xfs_filblks_t rtexts;
> -
> -				rtexts = XFS_FSB_TO_B(mp, del.br_blockcount);
> -				do_div(rtexts, mp->m_sb.sb_rextsize);
> -				xfs_mod_incore_sb(mp, XFS_SBS_FREXTENTS,
> -						(int64_t)rtexts, 0);
> -				(void)xfs_trans_reserve_quota_nblks(NULL,
> -					ip, -((long)del.br_blockcount), 0,
> -					XFS_QMOPT_RES_RTBLKS);
> -			} else {
> -				xfs_icsb_modify_counters(mp, XFS_SBS_FDBLOCKS,
> -						(int64_t)del.br_blockcount, 0);
> -				(void)xfs_trans_reserve_quota_nblks(NULL,
> -					ip, -((long)del.br_blockcount), 0,
> -					XFS_QMOPT_RES_REGBLKS);
> -			}
>  			ip->i_delayed_blks -= del.br_blockcount;
>  			if (cur)
>  				cur->bc_private.b.flags |=
> @@ -5302,6 +5304,33 @@ xfs_bunmapi(
>  		}
>  		error = xfs_bmap_del_extent(ip, tp, &lastx, flist, cur, &del,
>  				&tmp_logflags, whichfork);
> +		/*
> +		 * xfs_bmap_del_extent may hand delayed alloc blocks back to the
> +		 * indirect block reservations to keep extent split reservations
> +		 * sane. Hence we should only decrement the delayed block count
> +		 * on the inode once we know exactly the amount of delalloc
> +		 * space we actually removed from the inode.
> +		 */
> +		if (wasdel && del.br_blockcount) {
> +			/* Update realtime/data freespace, unreserve quota */
> +			if (isrt) {
> +				xfs_filblks_t rtexts;
> +
> +				rtexts = XFS_FSB_TO_B(mp, del.br_blockcount);
> +				do_div(rtexts, mp->m_sb.sb_rextsize);
> +				xfs_mod_incore_sb(mp, XFS_SBS_FREXTENTS,
> +						(int64_t)rtexts, 0);
> +				(void)xfs_trans_reserve_quota_nblks(NULL,
> +					ip, -((long)del.br_blockcount), 0,
> +					XFS_QMOPT_RES_RTBLKS);
> +			} else {
> +				xfs_icsb_modify_counters(mp, XFS_SBS_FDBLOCKS,
> +						(int64_t)del.br_blockcount, 0);
> +				(void)xfs_trans_reserve_quota_nblks(NULL,
> +					ip, -((long)del.br_blockcount), 0,
> +					XFS_QMOPT_RES_REGBLKS);
> +			}
> +		}

In xfs_bmapi_reserve_delalloc(), we account alen against the quota,
calculate indlen, account both against the sb counters, add alen to
i_delayed_blks and continue on...

In xfs_bunmap(), we remove br_blockcount from the sb counters and
unreserve from the quota then call into xfs_bmap_del_extent(). The
latter takes care of removing the indlen blocks from the sb counters if
necessary.

With these changes, we move the accounting after the del_extent() call
and allow the latter to steal from br_blockcount for indlen. This seems
like it works for the sb counters. We also adjust i_delayed_blks against
the original extent length before it can be modified. The quota
reservation looks like it could become inconsistent, however. E.g., some
blocks previously accounted against the quota (alen) are now moved over
to indlen. If those indlen blocks happen to be cleaned up through
del_extent(), they'd only be replenished to the sb counters (near the
end of del_extent()). Perhaps we should leave the quota unreserve prior
to the del_extent() call or sample the original br_blockcount and use it
post del_extent()..?

Brian

>  		logflags |= tmp_logflags;
>  		if (error)
>  			goto error0;
> -- 
> 1.9.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/6] xfs: write failure beyond EOF truncates too much data
  2014-03-29 15:14   ` Brian Foster
@ 2014-04-04 15:26     ` Brian Foster
  2014-04-04 21:26       ` Dave Chinner
  0 siblings, 1 reply; 15+ messages in thread
From: Brian Foster @ 2014-04-04 15:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs, Viro, viro, Al

On Sat, Mar 29, 2014 at 11:14:19AM -0400, Brian Foster wrote:
> On Fri, Mar 21, 2014 at 09:11:46PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > If we fail a write beyond EOF and have to handle it in
> > xfs_vm_write_begin(), we truncate the inode back to the current inode
> > size. This doesn't take into account the fact that we may have
> > already made successful writes to the same page (in the case of block
> > size < page size) and hence we can truncate the page cache away from
> > blocks with valid data in them. If these blocks are delayed
> > allocation blocks, we now have a mismatch between the page cache and
> > the extent tree, and this will trigger - at minimum - a delayed
> > block count mismatch assert when the inode is evicted from the cache.
> > We can also trip over it when block mapping for direct IO - this is
> > the most common symptom seen from fsx and fsstress when run from
> > xfstests.
> > 
> > Fix it by only truncating away the exact range we are updating state
> > for in this write_begin call.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_aops.c | 10 ++++++++--
> >  1 file changed, 8 insertions(+), 2 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index e810243..6b4ecc8 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -1609,12 +1609,18 @@ xfs_vm_write_begin(
> >  	status = __block_write_begin(page, pos, len, xfs_get_blocks);
> >  	if (unlikely(status)) {
> >  		struct inode	*inode = mapping->host;
> > +		size_t		isize = i_size_read(inode);
> >  
> >  		xfs_vm_write_failed(inode, page, pos, len);
> >  		unlock_page(page);
> >  
> > -		if (pos + len > i_size_read(inode))
> > -			truncate_pagecache(inode, i_size_read(inode));
> 
> From what I can see, we have a write_begin/copy/write_end sequence per
> page and the inode size is updated down in the write_end path. If the
> write fails at write_begin time, then we haven't copied any data and the
> inode size hasn't changed.
> 
> The intent of the original code looks like it wants to break off any
> blocks that might have been set up beyond EOF before the write happened
> to fail. Could you elaborate on how this can kill blocks that might have
> been written successfully? Doesn't this only affect post-EOF metadata?
> 
> > +		/*
> > +		 * If the write is beyond EOF, we only want to kill blocks
> > +		 * allocated in this write, not blocks that were previously
> > +		 * written successfully.
> > +		 */
> > +		if (pos + len > isize)
> > +			truncate_pagecache_range(inode, pos, pos + len);
> 
> So the cache truncate now covers the range of the write. What happens if
> the part of the write is an overwrite? This seems similar to the problem
> you've described above... should the truncate start at isize instead?
> 

I ran an experiment on this to confirm my suspicion here. I added a
quick little hack to fail any write_begin (set status=-1) for which pos
!= 0. With that in place:

xfs_io -fc "pwrite -b 2k 0 2k" /mnt/file

# hexdump /mnt/file 
0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
*
0000800

xfs_io -c "pwrite -b 2k 1k 2k" /mnt/file
(fails)

# hexdump /mnt/file 
0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
*
0000400 0000 0000 0000 0000 0000 0000 0000 0000
*
0000800

The failed extending write trashes the data over the previously valid
region.

Brian

> Brian
> 
> >  
> >  		page_cache_release(page);
> >  		page = NULL;
> > -- 
> > 1.9.0
> > 
> > _______________________________________________
> > xfs mailing list
> > xfs@oss.sgi.com
> > http://oss.sgi.com/mailman/listinfo/xfs
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/6] xfs: write failure beyond EOF truncates too much data
  2014-04-04 15:26     ` Brian Foster
@ 2014-04-04 21:26       ` Dave Chinner
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2014-04-04 21:26 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs, Viro, viro, Al

On Fri, Apr 04, 2014 at 11:26:36AM -0400, Brian Foster wrote:
> On Sat, Mar 29, 2014 at 11:14:19AM -0400, Brian Foster wrote:
> > On Fri, Mar 21, 2014 at 09:11:46PM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > If we fail a write beyond EOF and have to handle it in
> > > xfs_vm_write_begin(), we truncate the inode back to the current inode
> > > size. This doesn't take into account the fact that we may have
> > > already made successful writes to the same page (in the case of block
> > > size < page size) and hence we can truncate the page cache away from
> > > blocks with valid data in them. If these blocks are delayed
> > > allocation blocks, we now have a mismatch between the page cache and
> > > the extent tree, and this will trigger - at minimum - a delayed
> > > block count mismatch assert when the inode is evicted from the cache.
> > > We can also trip over it when block mapping for direct IO - this is
> > > the most common symptom seen from fsx and fsstress when run from
> > > xfstests.
> > > 
> > > Fix it by only truncating away the exact range we are updating state
> > > for in this write_begin call.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_aops.c | 10 ++++++++--
> > >  1 file changed, 8 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > > index e810243..6b4ecc8 100644
> > > --- a/fs/xfs/xfs_aops.c
> > > +++ b/fs/xfs/xfs_aops.c
> > > @@ -1609,12 +1609,18 @@ xfs_vm_write_begin(
> > >  	status = __block_write_begin(page, pos, len, xfs_get_blocks);
> > >  	if (unlikely(status)) {
> > >  		struct inode	*inode = mapping->host;
> > > +		size_t		isize = i_size_read(inode);
> > >  
> > >  		xfs_vm_write_failed(inode, page, pos, len);
> > >  		unlock_page(page);
> > >  
> > > -		if (pos + len > i_size_read(inode))
> > > -			truncate_pagecache(inode, i_size_read(inode));
> > 
> > From what I can see, we have a write_begin/copy/write_end sequence per
> > page and the inode size is updated down in the write_end path. If the
> > write fails at write_begin time, then we haven't copied any data and the
> > inode size hasn't changed.
> > 
> > The intent of the original code looks like it wants to break off any
> > blocks that might have been set up beyond EOF before the write happened
> > to fail. Could you elaborate on how this can kill blocks that might have
> > been written successfully? Doesn't this only affect post-EOF metadata?
> > 
> > > +		/*
> > > +		 * If the write is beyond EOF, we only want to kill blocks
> > > +		 * allocated in this write, not blocks that were previously
> > > +		 * written successfully.
> > > +		 */
> > > +		if (pos + len > isize)
> > > +			truncate_pagecache_range(inode, pos, pos + len);
> > 
> > So the cache truncate now covers the range of the write. What happens if
> > the part of the write is an overwrite? This seems similar to the problem
> > you've described above... should the truncate start at isize instead?
> > 
> 
> I ran an experiment on this to confirm my suspicion here. I added a
> quick little hack to fail any write_begin (set status=-1) for which pos
> != 0. With that in place:
> 
> xfs_io -fc "pwrite -b 2k 0 2k" /mnt/file
> 
> # hexdump /mnt/file 
> 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
> *
> 0000800
> 
> xfs_io -c "pwrite -b 2k 1k 2k" /mnt/file
> (fails)
> 
> # hexdump /mnt/file 
> 0000000 cdcd cdcd cdcd cdcd cdcd cdcd cdcd cdcd
> *
> 0000400 0000 0000 0000 0000 0000 0000 0000 0000
> *
> 0000800
> 
> The failed extending write trashes the data over the previously valid
> region.

Good catch, Brian! That is indeed a problem with the fix I made. Not
intentional, as the comment probably lead you to beleive. I did say
I wasn't entirely sure all the fixes were correct, and I greatly
appreciate your diligence here.

The fix should probably be:

		/*
		 * If the write is beyond EOF, we only want to kill blocks
		 * allocated in this write, not blocks that were previously
		 * written successfully.
		 */
		if (pos + len > isize) {
			ssize_t		start = max_t(ssize_t, pos, isize);

			truncate_pagecache_range(inode, start, pos + len);
		}

So if the write overlaps EOF we don't truncate away data from before
EOF. It also has the effect of not truncating data in previous
writes between isize and pos, which was the bug I was trying to
fix...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 5/6] xfs: splitting delalloc extents can run out of reservation
  2014-04-04 13:37   ` Brian Foster
@ 2014-04-04 21:31     ` Dave Chinner
  0 siblings, 0 replies; 15+ messages in thread
From: Dave Chinner @ 2014-04-04 21:31 UTC (permalink / raw)
  To: Brian Foster; +Cc: Al, Viro, viro, xfs

On Fri, Apr 04, 2014 at 09:37:01AM -0400, Brian Foster wrote:
> On Fri, Mar 21, 2014 at 09:11:49PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > When we punch a hole in a delalloc extent, we split the indirect
> > block reservation between the two new extents. If we repeatedly
> > punch holes in a large delalloc extent, that reservation will
> > eventually run out and we'll assert fail in xfs_bunmapi() because
> > the indirect block reservation for the delalloc extent is zero. This
> > is caused by doing a large delalloc write, then zeroing multiple
> > ranges of that write using fallocate to punch lots of holes in the
> > delayed allocation range.
> > 
> > To avoid this problem, if we split the reservation and require more
> > indirect blocks for the two new extents than we had for the old
> > reservation, steal the additional blocks from the hole we punched in
> > the extent. In most cases we only need a single extra block, so even
> > if we punch only single block holes we can still retain sufficient
> > indirect block reservations to avoid problems.
> > 
> > In doing this "stealing", we need to change where we account for the
> > delalloc blocks being freed. The block count held on the inode does
> > not take into account the indirect block reservation, so we still
> > need to do that before we free the extent. However, the accounting
> > ofr free space in the superblock need to be done after we've stolen
> > the blocks fro the freed extent so that they are accounted for
> > correctly.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_bmap.c | 65 ++++++++++++++++++++++++++++++++++++++++---------------
> >  1 file changed, 47 insertions(+), 18 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_bmap.c b/fs/xfs/xfs_bmap.c
> > index 5b6092e..4bf6a0e 100644
> > --- a/fs/xfs/xfs_bmap.c
> > +++ b/fs/xfs/xfs_bmap.c
> > @@ -4945,7 +4945,27 @@ xfs_bmap_del_extent(
> >  			temp2 = xfs_bmap_worst_indlen(ip, temp2);
> >  			new.br_startblock = nullstartblock((int)temp2);
> >  			da_new = temp + temp2;
> > +
> > +			/*
> > +			 * Note: if we have an odd number of blocks reserved,
> > +			 * then if we keep splitting the delalloc extent like
> > +			 * this we end up with a delalloc indlen reservation of
> > +			 * zero for one of the two extents. Hence if we end
> > +			 * up with the new indlen reservations being larger than
> > +			 * the old one, steal blocks from the data reservation
> > +			 * we just punched out. Otherwise, just reduce the
> > +			 * remaining indlen reservations alternately and hope
> > +			 * next time we come here the range getting removed is
> > +			 * large enough to fix this all up.
> > +			 */
> >  			while (da_new > da_old) {
> > +				if (del->br_blockcount) {
> > +					/* steal a block */
> > +					da_new--;
> > +					del->br_blockcount--;
> > +					continue;
> > +				}
> > +
> >  				if (temp) {
> >  					temp--;
> >  					da_new--;
> > @@ -5255,24 +5275,6 @@ xfs_bunmapi(
> >  		}
> >  		if (wasdel) {
> >  			ASSERT(startblockval(del.br_startblock) > 0);
> > -			/* Update realtime/data freespace, unreserve quota */
> > -			if (isrt) {
> > -				xfs_filblks_t rtexts;
> > -
> > -				rtexts = XFS_FSB_TO_B(mp, del.br_blockcount);
> > -				do_div(rtexts, mp->m_sb.sb_rextsize);
> > -				xfs_mod_incore_sb(mp, XFS_SBS_FREXTENTS,
> > -						(int64_t)rtexts, 0);
> > -				(void)xfs_trans_reserve_quota_nblks(NULL,
> > -					ip, -((long)del.br_blockcount), 0,
> > -					XFS_QMOPT_RES_RTBLKS);
> > -			} else {
> > -				xfs_icsb_modify_counters(mp, XFS_SBS_FDBLOCKS,
> > -						(int64_t)del.br_blockcount, 0);
> > -				(void)xfs_trans_reserve_quota_nblks(NULL,
> > -					ip, -((long)del.br_blockcount), 0,
> > -					XFS_QMOPT_RES_REGBLKS);
> > -			}
> >  			ip->i_delayed_blks -= del.br_blockcount;
> >  			if (cur)
> >  				cur->bc_private.b.flags |=
> > @@ -5302,6 +5304,33 @@ xfs_bunmapi(
> >  		}
> >  		error = xfs_bmap_del_extent(ip, tp, &lastx, flist, cur, &del,
> >  				&tmp_logflags, whichfork);
> > +		/*
> > +		 * xfs_bmap_del_extent may hand delayed alloc blocks back to the
> > +		 * indirect block reservations to keep extent split reservations
> > +		 * sane. Hence we should only decrement the delayed block count
> > +		 * on the inode once we know exactly the amount of delalloc
> > +		 * space we actually removed from the inode.
> > +		 */
> > +		if (wasdel && del.br_blockcount) {
> > +			/* Update realtime/data freespace, unreserve quota */
> > +			if (isrt) {
> > +				xfs_filblks_t rtexts;
> > +
> > +				rtexts = XFS_FSB_TO_B(mp, del.br_blockcount);
> > +				do_div(rtexts, mp->m_sb.sb_rextsize);
> > +				xfs_mod_incore_sb(mp, XFS_SBS_FREXTENTS,
> > +						(int64_t)rtexts, 0);
> > +				(void)xfs_trans_reserve_quota_nblks(NULL,
> > +					ip, -((long)del.br_blockcount), 0,
> > +					XFS_QMOPT_RES_RTBLKS);
> > +			} else {
> > +				xfs_icsb_modify_counters(mp, XFS_SBS_FDBLOCKS,
> > +						(int64_t)del.br_blockcount, 0);
> > +				(void)xfs_trans_reserve_quota_nblks(NULL,
> > +					ip, -((long)del.br_blockcount), 0,
> > +					XFS_QMOPT_RES_REGBLKS);
> > +			}
> > +		}
> 
> In xfs_bmapi_reserve_delalloc(), we account alen against the quota,
> calculate indlen, account both against the sb counters, add alen to
> i_delayed_blks and continue on...

*nod*

> In xfs_bunmap(), we remove br_blockcount from the sb counters and
> unreserve from the quota then call into xfs_bmap_del_extent(). The
> latter takes care of removing the indlen blocks from the sb counters if
> necessary.

*nod*

> With these changes, we move the accounting after the del_extent() call
> and allow the latter to steal from br_blockcount for indlen. This seems
> like it works for the sb counters.

*nod*

> We also adjust i_delayed_blks against
> the original extent length before it can be modified. The quota
> reservation looks like it could become inconsistent, however. E.g., some
> blocks previously accounted against the quota (alen) are now moved over
> to indlen.

Yes, you are right - it will cause inconsistency. That's a lesser
evil than not having a reservation at all, but I can probably fix
that up.

> If those indlen blocks happen to be cleaned up through
> del_extent(), they'd only be replenished to the sb counters (near the
> end of del_extent()). Perhaps we should leave the quota unreserve prior
> to the del_extent() call or sample the original br_blockcount and use it
> post del_extent()..?

That's a possibility. I really only looked at fixing the reservation
issue and keeping the sb counters correct, not the quota aspect of
it. I'll go back and see what I can come up with that solves this
problem, too.

Thanks!

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2014-04-04 21:32 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-21 10:11 [RFC, PATCH 0/6] xfs: delalloc, DIO and corruption Dave Chinner
2014-03-21 10:11 ` [PATCH 1/6] xfs: kill buffers over failed write ranges properly Dave Chinner
2014-03-21 10:11 ` [PATCH 2/6] xfs: write failure beyond EOF truncates too much data Dave Chinner
2014-03-29 15:14   ` Brian Foster
2014-04-04 15:26     ` Brian Foster
2014-04-04 21:26       ` Dave Chinner
2014-03-21 10:11 ` [PATCH 3/6] xfs: xfs_vm_write_end truncates too much on failure Dave Chinner
2014-03-21 10:11 ` [PATCH 4/6] xfs: zeroing space needs to punch delalloc blocks Dave Chinner
2014-03-21 10:11 ` [PATCH 5/6] xfs: splitting delalloc extents can run out of reservation Dave Chinner
2014-04-04 13:37   ` Brian Foster
2014-04-04 21:31     ` Dave Chinner
2014-03-21 10:11 ` [PATCH 6/6] xfs: don't map ranges that span EOF for direct IO Dave Chinner
2014-03-31 17:22 ` [RFC, PATCH 0/6] xfs: delalloc, DIO and corruption Brian Foster
2014-03-31 20:17   ` Dave Chinner
2014-04-01 11:54     ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.