All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] xfs: direct IO invalidation and related fixes
@ 2014-08-21  5:09 Dave Chinner
  2014-08-21  5:09 ` [PATCH 1/6] xfs: mmap write/read leaves bad state on pages Dave Chinner
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Dave Chinner @ 2014-08-21  5:09 UTC (permalink / raw)
  To: xfs; +Cc: clm

Hi folks,

This patch set started when I was testing Chris Mason's direct IO
read invalidation fix (patch 2 of the series). I hit a problem
testing it on a 1k block size filesystem, and that opened a can of
bugs. Luckily, the can of bugs fell into the fire, and so the bug
hunt wasn't long and drawn out.

In the end, the class of bugs that was uncovered is prevented from
occurring by the first patch (hmmm, not sure the title is accurate
anymore) which then means it is safe to fix the invalidation bugs
in the direct Io code (Chris fixed read, I fixed write). With these
issues fixed, we can trim the range of invalidation to just that of
the direct IO in progress and that finally works without causing
random regressions.

This then left fsx tripping over issues with collapse range calls
that Brian had already found the cause of, so this series adds his
patch to change the inode logging and my temporary workaround of
flushing the entire file before running the collapse range
operation.

With these 6 patches, the troublesome fsx configuration from
generic/263 goes from assert failing on the 1192nd operation to
running for over 60 million operations before tripping over another
collapse range issue. This was repeated on multiple test machines,
with a ramdisk based test getting 61,350,000 million ops, a 2p VM
getting 63,700,000 ops and a 1p VM getting 65,450,000 million ops
before failing.

The series does result in generic/247 now occasionally tripping an
invalidation failure, but this is intentionally causing direct Io
writes and mmap writes to race and so this is an expected failure
when the code is working correctly. i.e. page faults cannot be
synchronised against any other IO operation.

Hence I think this patch series fixes the root cause of another long
standing bufferhead coherency issue that we'd otherwise covered up.
Please test and review - I want to send this patchset to Linux for
3.17-rc3 if possible...

-Dave.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 1/6] xfs: mmap write/read leaves bad state on pages
  2014-08-21  5:09 [PATCH 0/6] xfs: direct IO invalidation and related fixes Dave Chinner
@ 2014-08-21  5:09 ` Dave Chinner
  2014-08-21 12:48   ` Brian Foster
                     ` (2 more replies)
  2014-08-21  5:09 ` [PATCH 2/6] xfs: don't zero partial page cache pages during O_DIRECT writes Dave Chinner
                   ` (4 subsequent siblings)
  5 siblings, 3 replies; 20+ messages in thread
From: Dave Chinner @ 2014-08-21  5:09 UTC (permalink / raw)
  To: xfs; +Cc: clm

From: Dave Chinner <dchinner@redhat.com>

generic/263 is failing fsx at this point with a page spanning
EOF that cannot be invalidated. The operations are:

1190 mapwrite   0x52c00 thru    0x5e569 (0xb96a bytes)
1191 mapread    0x5c000 thru    0x5d636 (0x1637 bytes)
1192 write      0x5b600 thru    0x771ff (0x1bc00 bytes)

where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
write attempts to invalidate the cached page over this range, it
fails with -EBUSY and so we fire this assert:

XFS: Assertion failed: ret < 0 || ret == count, file: fs/xfs/xfs_file.c, line: 676

because the kernel is trying to fall back to buffered IO on the
direct IO path (which XFS does not do).

The real question is this: Why can't that page be invalidated after
it has been written to disk an cleaned?

Well, there's data on the first two buffers in the page (1k block
size, 4k page), but the third buffer on the page (i.e. beyond EOF)
is failing drop_buffers because it's bh->b_state == 0x3, which is
BH_Uptodate | BH_Dirty.  IOWs, there's dirty buffers beyond EOF. Say
what?

OK, set_buffer_dirty() is called on all buffers from
__set_page_buffers_dirty(), regardless of whether the buffer is
beyond EOF or not, which means that when we get to ->writepage,
we have buffers marked dirty beyond EOF that we need to clean.
So, we need to implement our own .set_page_dirty method that
doesn't dirty buffers beyond EOF.

This is messy because the buffer code is not meant to be shared
and it has interesting locking issues on the buffer dirty bits.
So just copy and paste it and then modify it to suit what we need.

cc: <stable@kernel.org>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_aops.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 58 insertions(+)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 11e9b4c..2a316ad 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1753,11 +1753,69 @@ xfs_vm_readpages(
 	return mpage_readpages(mapping, pages, nr_pages, xfs_get_blocks);
 }
 
+/*
+ * This is basically a copy of __set_page_dirty_buffers() with one
+ * small tweak: buffers beyond EOF do not get marked dirty. If we mark them
+ * dirty, we'll never be able to clean them because we don't write buffers
+ * beyond EOF, and that means we can't invalidate pages that span EOF
+ * that have been marked dirty. Further, the dirty state can leak into
+ * the file interior if the file is extended, resulting in all sorts of
+ * bad things happening as the state does not match the unerlying data.
+ */
+STATIC int
+xfs_vm_set_page_dirty(
+	struct page		*page)
+{
+	struct address_space	*mapping = page_mapping(page);
+	struct inode		*inode = mapping->host;
+	loff_t			end_offset;
+	loff_t			offset;
+	int			newly_dirty;
+
+	if (unlikely(!mapping))
+		return !TestSetPageDirty(page);
+
+	end_offset = i_size_read(inode);
+	offset = end_offset & PAGE_CACHE_MASK;
+
+	spin_lock(&mapping->private_lock);
+	if (page_has_buffers(page)) {
+		struct buffer_head *head = page_buffers(page);
+		struct buffer_head *bh = head;
+
+		do {
+			if (offset < end_offset)
+				set_buffer_dirty(bh);
+			bh = bh->b_this_page;
+			offset += 1 << inode->i_blkbits;
+		} while (bh != head);
+	}
+	newly_dirty = !TestSetPageDirty(page);
+	spin_unlock(&mapping->private_lock);
+
+	if (newly_dirty) {
+		/* sigh - __set_page_dirty() is static, so copy it here, too */
+		unsigned long flags;
+
+		spin_lock_irqsave(&mapping->tree_lock, flags);
+		if (page->mapping) {	/* Race with truncate? */
+			WARN_ON_ONCE(!PageUptodate(page));
+			account_page_dirtied(page, mapping);
+			radix_tree_tag_set(&mapping->page_tree,
+					page_index(page), PAGECACHE_TAG_DIRTY);
+		}
+		spin_unlock_irqrestore(&mapping->tree_lock, flags);
+		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+	}
+	return newly_dirty;
+}
+
 const struct address_space_operations xfs_address_space_operations = {
 	.readpage		= xfs_vm_readpage,
 	.readpages		= xfs_vm_readpages,
 	.writepage		= xfs_vm_writepage,
 	.writepages		= xfs_vm_writepages,
+	.set_page_dirty		= xfs_vm_set_page_dirty,
 	.releasepage		= xfs_vm_releasepage,
 	.invalidatepage		= xfs_vm_invalidatepage,
 	.write_begin		= xfs_vm_write_begin,
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 2/6] xfs: don't zero partial page cache pages during O_DIRECT writes
  2014-08-21  5:09 [PATCH 0/6] xfs: direct IO invalidation and related fixes Dave Chinner
  2014-08-21  5:09 ` [PATCH 1/6] xfs: mmap write/read leaves bad state on pages Dave Chinner
@ 2014-08-21  5:09 ` Dave Chinner
  2014-08-21 13:08   ` Brian Foster
  2014-08-21  5:09 ` [PATCH 3/6] " Dave Chinner
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2014-08-21  5:09 UTC (permalink / raw)
  To: xfs; +Cc: clm

From: Chris Mason <clm@fb.com>

xfs is using truncate_pagecache_range to invalidate the page cache
during DIO reads.  This is different from the other filesystems who
only invalidate pages during DIO writes.

truncate_pagecache_range is meant to be used when we are freeing the
underlying data structs from disk, so it will zero any partial
ranges in the page.  This means a DIO read can zero out part of the
page cache page, and it is possible the page will stay in cache.

buffered reads will find an up to date page with zeros instead of
the data actually on disk.

This patch fixes things by using invalidate_inode_pages2_range
instead.  It preserves the page cache invalidation, but won't zero
any pages.

[dchinner: catch error and warn if it fails. Comment.]

cc: stable@vger.kernel.org
Signed-off-by: Chris Mason <clm@fb.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Dave Chinner <david@fromorbit.com>
---
 fs/xfs/xfs_file.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 076b170..827cfb2 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -296,7 +296,16 @@ xfs_file_read_iter(
 				xfs_rw_iunlock(ip, XFS_IOLOCK_EXCL);
 				return ret;
 			}
-			truncate_pagecache_range(VFS_I(ip), pos, -1);
+
+			/*
+			 * Invalidate whole pages. This can return an error if
+			 * we fail to invalidate a page, but this should never
+			 * happen on XFS. Warn if it does fail.
+			 */
+			ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
+						pos >> PAGE_CACHE_SHIFT, -1);
+			WARN_ON_ONCE(ret);
+			ret = 0;
 		}
 		xfs_rw_ilock_demote(ip, XFS_IOLOCK_EXCL);
 	}
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 3/6] xfs: don't zero partial page cache pages during O_DIRECT writes
  2014-08-21  5:09 [PATCH 0/6] xfs: direct IO invalidation and related fixes Dave Chinner
  2014-08-21  5:09 ` [PATCH 1/6] xfs: mmap write/read leaves bad state on pages Dave Chinner
  2014-08-21  5:09 ` [PATCH 2/6] xfs: don't zero partial page cache pages during O_DIRECT writes Dave Chinner
@ 2014-08-21  5:09 ` Dave Chinner
  2014-08-21 13:09   ` Brian Foster
  2014-08-21  5:09 ` [PATCH 4/6] xfs: use ranged writeback and invalidation for direct IO Dave Chinner
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2014-08-21  5:09 UTC (permalink / raw)
  To: xfs; +Cc: clm

From: Dave Chinner <dchinner@redhat.com>

Similar to direct IO reads, direct IO writes are using
truncate_pagecache_range to invalidate the page cache. This is
incorrect due to the sub-block zeroing in the page cache that
truncate_pagecache_range() triggers.

This patch fixes things by using invalidate_inode_pages2_range
instead.  It preserves the page cache invalidation, but won't zero
any pages.

cc: stable@vger.kernel.org
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 827cfb2..19917fa 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -644,7 +644,15 @@ xfs_file_dio_aio_write(
 						    pos, -1);
 		if (ret)
 			goto out;
-		truncate_pagecache_range(VFS_I(ip), pos, -1);
+		/*
+		 * Invalidate whole pages. This can return an error if
+		 * we fail to invalidate a page, but this should never
+		 * happen on XFS. Warn if it does fail.
+		 */
+		ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
+						pos >> PAGE_CACHE_SHIFT, -1);
+		WARN_ON_ONCE(ret);
+		ret = 0;
 	}
 
 	/*
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 4/6] xfs: use ranged writeback and invalidation for direct IO
  2014-08-21  5:09 [PATCH 0/6] xfs: direct IO invalidation and related fixes Dave Chinner
                   ` (2 preceding siblings ...)
  2014-08-21  5:09 ` [PATCH 3/6] " Dave Chinner
@ 2014-08-21  5:09 ` Dave Chinner
  2014-08-21 13:09   ` Brian Foster
  2014-08-21  5:09 ` [PATCH 5/6] xfs: don't log inode unless extent shift makes extent modifications Dave Chinner
  2014-08-21  5:09 ` [PATCH 6/6] xfs: xfs_file_collapse_range is delalloc challenged Dave Chinner
  5 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2014-08-21  5:09 UTC (permalink / raw)
  To: xfs; +Cc: clm

From: Dave Chinner <dchinner@redhat.com>

Now we are not doing silly things with dirtying buffers beyond EOF
and using invalidation correctly, we can finally reduce the ranges of
writeback and invalidation used by direct IO to match that of the IO
being issued.

Bring the writeback and invalidation ranges back to match the
generic direct IO code - this will greatly reduce the perturbation
of cached data when direct IO and buffered IO are mixed, but still
provide the same buffered vs direct IO coherency behaviour we
currently have.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_file.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 19917fa..de5368c 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -291,7 +291,7 @@ xfs_file_read_iter(
 		if (inode->i_mapping->nrpages) {
 			ret = filemap_write_and_wait_range(
 							VFS_I(ip)->i_mapping,
-							pos, -1);
+							pos, pos + size - 1);
 			if (ret) {
 				xfs_rw_iunlock(ip, XFS_IOLOCK_EXCL);
 				return ret;
@@ -303,7 +303,8 @@ xfs_file_read_iter(
 			 * happen on XFS. Warn if it does fail.
 			 */
 			ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
-						pos >> PAGE_CACHE_SHIFT, -1);
+					pos >> PAGE_CACHE_SHIFT,
+					(pos + size - 1) >> PAGE_CACHE_SHIFT);
 			WARN_ON_ONCE(ret);
 			ret = 0;
 		}
@@ -641,7 +642,7 @@ xfs_file_dio_aio_write(
 
 	if (mapping->nrpages) {
 		ret = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
-						    pos, -1);
+						    pos, pos + count - 1);
 		if (ret)
 			goto out;
 		/*
@@ -650,7 +651,8 @@ xfs_file_dio_aio_write(
 		 * happen on XFS. Warn if it does fail.
 		 */
 		ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
-						pos >> PAGE_CACHE_SHIFT, -1);
+					pos >> PAGE_CACHE_SHIFT,
+					(pos + count - 1) >> PAGE_CACHE_SHIFT);
 		WARN_ON_ONCE(ret);
 		ret = 0;
 	}
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 5/6] xfs: don't log inode unless extent shift makes extent modifications
  2014-08-21  5:09 [PATCH 0/6] xfs: direct IO invalidation and related fixes Dave Chinner
                   ` (3 preceding siblings ...)
  2014-08-21  5:09 ` [PATCH 4/6] xfs: use ranged writeback and invalidation for direct IO Dave Chinner
@ 2014-08-21  5:09 ` Dave Chinner
  2014-08-21  5:09 ` [PATCH 6/6] xfs: xfs_file_collapse_range is delalloc challenged Dave Chinner
  5 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2014-08-21  5:09 UTC (permalink / raw)
  To: xfs; +Cc: clm

From: Brian Foster <bfoster@redhat.com>

The file collapse mechanism uses xfs_bmap_shift_extents() to collapse
all subsequent extents down into the specified, previously punched out,
region. This function performs some validation, such as whether a
sufficient hole exists in the target region of the collapse, then shifts
the remaining exents downward.

The exit path of the function currently logs the inode unconditionally.
While we must log the inode (and abort) if an error occurs and the
transaction is dirty, the initial validation paths can generate errors
before the transaction has been dirtied. This creates an unnecessary
filesystem shutdown scenario, as the caller will cancel a transaction
that has been marked dirty.

Modify xfs_bmap_shift_extents() to OR the logflags bits as modifications
are made to the inode bmap. Only log the inode in the exit path if
logflags has been set. This ensures we only have to cancel a dirty
transaction if modifications have been made and prevents an unnecessary
filesystem shutdown otherwise.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 3a6a700..e5c2518 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5422,7 +5422,7 @@ xfs_bmap_shift_extents(
 	struct xfs_bmap_free	*flist,
 	int			num_exts)
 {
-	struct xfs_btree_cur		*cur;
+	struct xfs_btree_cur		*cur = NULL;
 	struct xfs_bmbt_rec_host	*gotp;
 	struct xfs_bmbt_irec            got;
 	struct xfs_bmbt_irec		left;
@@ -5433,7 +5433,7 @@ xfs_bmap_shift_extents(
 	int				error = 0;
 	int				i;
 	int				whichfork = XFS_DATA_FORK;
-	int				logflags;
+	int				logflags = 0;
 	xfs_filblks_t			blockcount = 0;
 	int				total_extents;
 
@@ -5476,16 +5476,11 @@ xfs_bmap_shift_extents(
 		}
 	}
 
-	/* We are going to change core inode */
-	logflags = XFS_ILOG_CORE;
 	if (ifp->if_flags & XFS_IFBROOT) {
 		cur = xfs_bmbt_init_cursor(mp, tp, ip, whichfork);
 		cur->bc_private.b.firstblock = *firstblock;
 		cur->bc_private.b.flist = flist;
 		cur->bc_private.b.flags = 0;
-	} else {
-		cur = NULL;
-		logflags |= XFS_ILOG_DEXT;
 	}
 
 	/*
@@ -5543,11 +5538,14 @@ xfs_bmap_shift_extents(
 			blockcount = left.br_blockcount +
 				got.br_blockcount;
 			xfs_iext_remove(ip, *current_ext, 1, 0);
+			logflags |= XFS_ILOG_CORE;
 			if (cur) {
 				error = xfs_btree_delete(cur, &i);
 				if (error)
 					goto del_cursor;
 				XFS_WANT_CORRUPTED_GOTO(i == 1, del_cursor);
+			} else {
+				logflags |= XFS_ILOG_DEXT;
 			}
 			XFS_IFORK_NEXT_SET(ip, whichfork,
 				XFS_IFORK_NEXTENTS(ip, whichfork) - 1);
@@ -5573,6 +5571,7 @@ xfs_bmap_shift_extents(
 			got.br_startoff = startoff;
 		}
 
+		logflags |= XFS_ILOG_CORE;
 		if (cur) {
 			error = xfs_bmbt_update(cur, got.br_startoff,
 						got.br_startblock,
@@ -5580,6 +5579,8 @@ xfs_bmap_shift_extents(
 						got.br_state);
 			if (error)
 				goto del_cursor;
+		} else {
+			logflags |= XFS_ILOG_DEXT;
 		}
 
 		(*current_ext)++;
@@ -5595,6 +5596,7 @@ del_cursor:
 		xfs_btree_del_cursor(cur,
 			error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
 
-	xfs_trans_log_inode(tp, ip, logflags);
+	if (logflags)
+		xfs_trans_log_inode(tp, ip, logflags);
 	return error;
 }
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH 6/6] xfs: xfs_file_collapse_range is delalloc challenged
  2014-08-21  5:09 [PATCH 0/6] xfs: direct IO invalidation and related fixes Dave Chinner
                   ` (4 preceding siblings ...)
  2014-08-21  5:09 ` [PATCH 5/6] xfs: don't log inode unless extent shift makes extent modifications Dave Chinner
@ 2014-08-21  5:09 ` Dave Chinner
  2014-08-21 13:09   ` Brian Foster
  5 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2014-08-21  5:09 UTC (permalink / raw)
  To: xfs; +Cc: clm

From: Dave Chinner <dchinner@redhat.com>

If we have delalloc extents on a file before we run a collapse range
opertaion, we sync the range that we are going to collapse to
convert delalloc extents in that region to real extents to simplify
the shift operation.

However, the shift operation then assumes that the extent list is
not going to change as it iterates over the extent list moving
things about. Unfortunately, this isn't true because we can't hold
the ILOCK over all the operations. We can prevent new IO from
modifying the extent list by holding the IOLOCK, but that doesn't
prevent writeback from running....

And when writeback runs, it can convert delalloc extents is the
range of the file prior to the region being collapsed, and this
changes the indexes of all the extents in the file. That causes the
collapse range operation to Go Bad.

The right fix is to rewrite the extent shift operation not to be
dependent on the extent list not changing across the entire
operation, but this is a fairly significant piece of work to do.
Hence, as a short-term workaround for the problem, sync the entire
file before starting a collapse operation to remove all delalloc
ranges from the file and so avoid the problem of concurrent
writeback changing the extent list.

Diagnosed-and-Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap_util.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index c53cc03..035041d 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1460,6 +1460,19 @@ xfs_collapse_file_space(
 	start_fsb = XFS_B_TO_FSB(mp, offset + len);
 	shift_fsb = XFS_B_TO_FSB(mp, len);
 
+	/*
+	 * writeback the entire file to prevent concurrent writeback of ranges
+	 * outside the collapsing region from changing the extent list.
+	 *
+	 * XXX: This is a temporary fix until the extent shift loop below is
+	 * converted to use offsets and lookups within the ILOCK rather than
+	 * carrying around the index into the extent list for the next
+	 * iteration.
+	 */
+	error = filemap_write_and_wait(VFS_I(ip)->i_mapping);
+	if (error)
+		return error;
+
 	error = xfs_free_file_space(ip, offset, len);
 	if (error)
 		return error;
-- 
2.0.0

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/6] xfs: mmap write/read leaves bad state on pages
  2014-08-21  5:09 ` [PATCH 1/6] xfs: mmap write/read leaves bad state on pages Dave Chinner
@ 2014-08-21 12:48   ` Brian Foster
  2014-08-21 22:38     ` Dave Chinner
  2014-08-21 13:08   ` Christoph Hellwig
  2014-08-21 19:56   ` Jan Kara
  2 siblings, 1 reply; 20+ messages in thread
From: Brian Foster @ 2014-08-21 12:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: clm, xfs

On Thu, Aug 21, 2014 at 03:09:09PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> generic/263 is failing fsx at this point with a page spanning
> EOF that cannot be invalidated. The operations are:
> 
> 1190 mapwrite   0x52c00 thru    0x5e569 (0xb96a bytes)
> 1191 mapread    0x5c000 thru    0x5d636 (0x1637 bytes)
> 1192 write      0x5b600 thru    0x771ff (0x1bc00 bytes)
> 
> where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
> write attempts to invalidate the cached page over this range, it
> fails with -EBUSY and so we fire this assert:
> 
> XFS: Assertion failed: ret < 0 || ret == count, file: fs/xfs/xfs_file.c, line: 676
> 
> because the kernel is trying to fall back to buffered IO on the
> direct IO path (which XFS does not do).
> 
> The real question is this: Why can't that page be invalidated after
> it has been written to disk an cleaned?
> 
> Well, there's data on the first two buffers in the page (1k block
> size, 4k page), but the third buffer on the page (i.e. beyond EOF)
> is failing drop_buffers because it's bh->b_state == 0x3, which is
> BH_Uptodate | BH_Dirty.  IOWs, there's dirty buffers beyond EOF. Say
> what?
> 
> OK, set_buffer_dirty() is called on all buffers from
> __set_page_buffers_dirty(), regardless of whether the buffer is
> beyond EOF or not, which means that when we get to ->writepage,
> we have buffers marked dirty beyond EOF that we need to clean.
> So, we need to implement our own .set_page_dirty method that
> doesn't dirty buffers beyond EOF.
> 
> This is messy because the buffer code is not meant to be shared
> and it has interesting locking issues on the buffer dirty bits.
> So just copy and paste it and then modify it to suit what we need.
> 
> cc: <stable@kernel.org>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_aops.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 58 insertions(+)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 11e9b4c..2a316ad 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1753,11 +1753,69 @@ xfs_vm_readpages(
>  	return mpage_readpages(mapping, pages, nr_pages, xfs_get_blocks);
>  }
>  
> +/*
> + * This is basically a copy of __set_page_dirty_buffers() with one
> + * small tweak: buffers beyond EOF do not get marked dirty. If we mark them
> + * dirty, we'll never be able to clean them because we don't write buffers
> + * beyond EOF, and that means we can't invalidate pages that span EOF
> + * that have been marked dirty. Further, the dirty state can leak into
> + * the file interior if the file is extended, resulting in all sorts of
> + * bad things happening as the state does not match the unerlying data.
> + */
> +STATIC int
> +xfs_vm_set_page_dirty(
> +	struct page		*page)
> +{
> +	struct address_space	*mapping = page_mapping(page);

This breaks xfs as a kernel module:

$ make -j 8 M=fs/xfs
  Building modules, stage 2.
  MODPOST 1 modules
WARNING: "page_mapping" [fs/xfs/xfs.ko] undefined!
  ...

I suppose we could export that symbol, but why wouldn't we just propose
this change to __set_page_dirty_buffers()?

Brian

> +	struct inode		*inode = mapping->host;
> +	loff_t			end_offset;
> +	loff_t			offset;
> +	int			newly_dirty;
> +
> +	if (unlikely(!mapping))
> +		return !TestSetPageDirty(page);
> +
> +	end_offset = i_size_read(inode);
> +	offset = end_offset & PAGE_CACHE_MASK;
> +
> +	spin_lock(&mapping->private_lock);
> +	if (page_has_buffers(page)) {
> +		struct buffer_head *head = page_buffers(page);
> +		struct buffer_head *bh = head;
> +
> +		do {
> +			if (offset < end_offset)
> +				set_buffer_dirty(bh);
> +			bh = bh->b_this_page;
> +			offset += 1 << inode->i_blkbits;
> +		} while (bh != head);
> +	}
> +	newly_dirty = !TestSetPageDirty(page);
> +	spin_unlock(&mapping->private_lock);
> +
> +	if (newly_dirty) {
> +		/* sigh - __set_page_dirty() is static, so copy it here, too */
> +		unsigned long flags;
> +
> +		spin_lock_irqsave(&mapping->tree_lock, flags);
> +		if (page->mapping) {	/* Race with truncate? */
> +			WARN_ON_ONCE(!PageUptodate(page));
> +			account_page_dirtied(page, mapping);
> +			radix_tree_tag_set(&mapping->page_tree,
> +					page_index(page), PAGECACHE_TAG_DIRTY);
> +		}
> +		spin_unlock_irqrestore(&mapping->tree_lock, flags);
> +		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +	}
> +	return newly_dirty;
> +}
> +
>  const struct address_space_operations xfs_address_space_operations = {
>  	.readpage		= xfs_vm_readpage,
>  	.readpages		= xfs_vm_readpages,
>  	.writepage		= xfs_vm_writepage,
>  	.writepages		= xfs_vm_writepages,
> +	.set_page_dirty		= xfs_vm_set_page_dirty,
>  	.releasepage		= xfs_vm_releasepage,
>  	.invalidatepage		= xfs_vm_invalidatepage,
>  	.write_begin		= xfs_vm_write_begin,
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/6] xfs: mmap write/read leaves bad state on pages
  2014-08-21  5:09 ` [PATCH 1/6] xfs: mmap write/read leaves bad state on pages Dave Chinner
  2014-08-21 12:48   ` Brian Foster
@ 2014-08-21 13:08   ` Christoph Hellwig
  2014-08-21 13:54     ` Anton Altaparmakov
  2014-08-21 15:21     ` Chris Mason
  2014-08-21 19:56   ` Jan Kara
  2 siblings, 2 replies; 20+ messages in thread
From: Christoph Hellwig @ 2014-08-21 13:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-fsdevel, clm, xfs

On Thu, Aug 21, 2014 at 03:09:09PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> generic/263 is failing fsx at this point with a page spanning
> EOF that cannot be invalidated. The operations are:
> 
> 1190 mapwrite   0x52c00 thru    0x5e569 (0xb96a bytes)
> 1191 mapread    0x5c000 thru    0x5d636 (0x1637 bytes)
> 1192 write      0x5b600 thru    0x771ff (0x1bc00 bytes)
> 
> where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
> write attempts to invalidate the cached page over this range, it
> fails with -EBUSY and so we fire this assert:
> 
> XFS: Assertion failed: ret < 0 || ret == count, file: fs/xfs/xfs_file.c, line: 676
> 
> because the kernel is trying to fall back to buffered IO on the
> direct IO path (which XFS does not do).
> 
> The real question is this: Why can't that page be invalidated after
> it has been written to disk an cleaned?
> 
> Well, there's data on the first two buffers in the page (1k block
> size, 4k page), but the third buffer on the page (i.e. beyond EOF)
> is failing drop_buffers because it's bh->b_state == 0x3, which is
> BH_Uptodate | BH_Dirty.  IOWs, there's dirty buffers beyond EOF. Say
> what?
> 
> OK, set_buffer_dirty() is called on all buffers from
> __set_page_buffers_dirty(), regardless of whether the buffer is
> beyond EOF or not, which means that when we get to ->writepage,
> we have buffers marked dirty beyond EOF that we need to clean.
> So, we need to implement our own .set_page_dirty method that
> doesn't dirty buffers beyond EOF.

Shouldn't this be fixed in __set_page_buffers_dirty itself?  This
doesn't seem an XFS-specific issue.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/6] xfs: don't zero partial page cache pages during O_DIRECT writes
  2014-08-21  5:09 ` [PATCH 2/6] xfs: don't zero partial page cache pages during O_DIRECT writes Dave Chinner
@ 2014-08-21 13:08   ` Brian Foster
  0 siblings, 0 replies; 20+ messages in thread
From: Brian Foster @ 2014-08-21 13:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: clm, xfs

On Thu, Aug 21, 2014 at 03:09:10PM +1000, Dave Chinner wrote:
> From: Chris Mason <clm@fb.com>
> 
> xfs is using truncate_pagecache_range to invalidate the page cache
> during DIO reads.  This is different from the other filesystems who
> only invalidate pages during DIO writes.
> 
> truncate_pagecache_range is meant to be used when we are freeing the
> underlying data structs from disk, so it will zero any partial
> ranges in the page.  This means a DIO read can zero out part of the
> page cache page, and it is possible the page will stay in cache.
> 
> buffered reads will find an up to date page with zeros instead of
> the data actually on disk.
> 
> This patch fixes things by using invalidate_inode_pages2_range
> instead.  It preserves the page cache invalidation, but won't zero
> any pages.
> 
> [dchinner: catch error and warn if it fails. Comment.]
> 
> cc: stable@vger.kernel.org
> Signed-off-by: Chris Mason <clm@fb.com>
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Dave Chinner <david@fromorbit.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_file.c | 11 ++++++++++-
>  1 file changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 076b170..827cfb2 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -296,7 +296,16 @@ xfs_file_read_iter(
>  				xfs_rw_iunlock(ip, XFS_IOLOCK_EXCL);
>  				return ret;
>  			}
> -			truncate_pagecache_range(VFS_I(ip), pos, -1);
> +
> +			/*
> +			 * Invalidate whole pages. This can return an error if
> +			 * we fail to invalidate a page, but this should never
> +			 * happen on XFS. Warn if it does fail.
> +			 */
> +			ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
> +						pos >> PAGE_CACHE_SHIFT, -1);
> +			WARN_ON_ONCE(ret);
> +			ret = 0;
>  		}
>  		xfs_rw_ilock_demote(ip, XFS_IOLOCK_EXCL);
>  	}
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 3/6] xfs: don't zero partial page cache pages during O_DIRECT writes
  2014-08-21  5:09 ` [PATCH 3/6] " Dave Chinner
@ 2014-08-21 13:09   ` Brian Foster
  0 siblings, 0 replies; 20+ messages in thread
From: Brian Foster @ 2014-08-21 13:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: clm, xfs

On Thu, Aug 21, 2014 at 03:09:11PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Similar to direct IO reads, direct IO writes are using
> truncate_pagecache_range to invalidate the page cache. This is
> incorrect due to the sub-block zeroing in the page cache that
> truncate_pagecache_range() triggers.
> 
> This patch fixes things by using invalidate_inode_pages2_range
> instead.  It preserves the page cache invalidation, but won't zero
> any pages.
> 
> cc: stable@vger.kernel.org
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_file.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 827cfb2..19917fa 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -644,7 +644,15 @@ xfs_file_dio_aio_write(
>  						    pos, -1);
>  		if (ret)
>  			goto out;
> -		truncate_pagecache_range(VFS_I(ip), pos, -1);
> +		/*
> +		 * Invalidate whole pages. This can return an error if
> +		 * we fail to invalidate a page, but this should never
> +		 * happen on XFS. Warn if it does fail.
> +		 */
> +		ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
> +						pos >> PAGE_CACHE_SHIFT, -1);
> +		WARN_ON_ONCE(ret);
> +		ret = 0;
>  	}
>  
>  	/*
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 4/6] xfs: use ranged writeback and invalidation for direct IO
  2014-08-21  5:09 ` [PATCH 4/6] xfs: use ranged writeback and invalidation for direct IO Dave Chinner
@ 2014-08-21 13:09   ` Brian Foster
  0 siblings, 0 replies; 20+ messages in thread
From: Brian Foster @ 2014-08-21 13:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: clm, xfs

On Thu, Aug 21, 2014 at 03:09:12PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Now we are not doing silly things with dirtying buffers beyond EOF
> and using invalidation correctly, we can finally reduce the ranges of
> writeback and invalidation used by direct IO to match that of the IO
> being issued.
> 
> Bring the writeback and invalidation ranges back to match the
> generic direct IO code - this will greatly reduce the perturbation
> of cached data when direct IO and buffered IO are mixed, but still
> provide the same buffered vs direct IO coherency behaviour we
> currently have.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_file.c | 10 ++++++----
>  1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 19917fa..de5368c 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -291,7 +291,7 @@ xfs_file_read_iter(
>  		if (inode->i_mapping->nrpages) {
>  			ret = filemap_write_and_wait_range(
>  							VFS_I(ip)->i_mapping,
> -							pos, -1);
> +							pos, pos + size - 1);
>  			if (ret) {
>  				xfs_rw_iunlock(ip, XFS_IOLOCK_EXCL);
>  				return ret;
> @@ -303,7 +303,8 @@ xfs_file_read_iter(
>  			 * happen on XFS. Warn if it does fail.
>  			 */
>  			ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
> -						pos >> PAGE_CACHE_SHIFT, -1);
> +					pos >> PAGE_CACHE_SHIFT,
> +					(pos + size - 1) >> PAGE_CACHE_SHIFT);
>  			WARN_ON_ONCE(ret);
>  			ret = 0;
>  		}
> @@ -641,7 +642,7 @@ xfs_file_dio_aio_write(
>  
>  	if (mapping->nrpages) {
>  		ret = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
> -						    pos, -1);
> +						    pos, pos + count - 1);
>  		if (ret)
>  			goto out;
>  		/*
> @@ -650,7 +651,8 @@ xfs_file_dio_aio_write(
>  		 * happen on XFS. Warn if it does fail.
>  		 */
>  		ret = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
> -						pos >> PAGE_CACHE_SHIFT, -1);
> +					pos >> PAGE_CACHE_SHIFT,
> +					(pos + count - 1) >> PAGE_CACHE_SHIFT);
>  		WARN_ON_ONCE(ret);
>  		ret = 0;
>  	}
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 6/6] xfs: xfs_file_collapse_range is delalloc challenged
  2014-08-21  5:09 ` [PATCH 6/6] xfs: xfs_file_collapse_range is delalloc challenged Dave Chinner
@ 2014-08-21 13:09   ` Brian Foster
  0 siblings, 0 replies; 20+ messages in thread
From: Brian Foster @ 2014-08-21 13:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: clm, xfs

On Thu, Aug 21, 2014 at 03:09:14PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> If we have delalloc extents on a file before we run a collapse range
> opertaion, we sync the range that we are going to collapse to
> convert delalloc extents in that region to real extents to simplify
> the shift operation.
> 
> However, the shift operation then assumes that the extent list is
> not going to change as it iterates over the extent list moving
> things about. Unfortunately, this isn't true because we can't hold
> the ILOCK over all the operations. We can prevent new IO from
> modifying the extent list by holding the IOLOCK, but that doesn't
> prevent writeback from running....
> 
> And when writeback runs, it can convert delalloc extents is the
> range of the file prior to the region being collapsed, and this
> changes the indexes of all the extents in the file. That causes the
> collapse range operation to Go Bad.
> 
> The right fix is to rewrite the extent shift operation not to be
> dependent on the extent list not changing across the entire
> operation, but this is a fairly significant piece of work to do.
> Hence, as a short-term workaround for the problem, sync the entire
> file before starting a collapse operation to remove all delalloc
> ranges from the file and so avoid the problem of concurrent
> writeback changing the extent list.
> 
> Diagnosed-and-Reported-by: Brian Foster <bfoster@redhat.com>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_bmap_util.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index c53cc03..035041d 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1460,6 +1460,19 @@ xfs_collapse_file_space(
>  	start_fsb = XFS_B_TO_FSB(mp, offset + len);
>  	shift_fsb = XFS_B_TO_FSB(mp, len);
>  
> +	/*
> +	 * writeback the entire file to prevent concurrent writeback of ranges
> +	 * outside the collapsing region from changing the extent list.
> +	 *
> +	 * XXX: This is a temporary fix until the extent shift loop below is
> +	 * converted to use offsets and lookups within the ILOCK rather than
> +	 * carrying around the index into the extent list for the next
> +	 * iteration.
> +	 */
> +	error = filemap_write_and_wait(VFS_I(ip)->i_mapping);
> +	if (error)
> +		return error;
> +
>  	error = xfs_free_file_space(ip, offset, len);
>  	if (error)
>  		return error;
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/6] xfs: mmap write/read leaves bad state on pages
  2014-08-21 13:08   ` Christoph Hellwig
@ 2014-08-21 13:54     ` Anton Altaparmakov
  2014-08-21 15:21     ` Chris Mason
  1 sibling, 0 replies; 20+ messages in thread
From: Anton Altaparmakov @ 2014-08-21 13:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-fsdevel, clm, xfs

Hi,

On 21 Aug 2014, at 14:08, Christoph Hellwig <hch@infradead.org> wrote:
> On Thu, Aug 21, 2014 at 03:09:09PM +1000, Dave Chinner wrote:
>> From: Dave Chinner <dchinner@redhat.com>
>> 
>> generic/263 is failing fsx at this point with a page spanning
>> EOF that cannot be invalidated. The operations are:
>> 
>> 1190 mapwrite   0x52c00 thru    0x5e569 (0xb96a bytes)
>> 1191 mapread    0x5c000 thru    0x5d636 (0x1637 bytes)
>> 1192 write      0x5b600 thru    0x771ff (0x1bc00 bytes)
>> 
>> where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
>> write attempts to invalidate the cached page over this range, it
>> fails with -EBUSY and so we fire this assert:
>> 
>> XFS: Assertion failed: ret < 0 || ret == count, file: fs/xfs/xfs_file.c, line: 676
>> 
>> because the kernel is trying to fall back to buffered IO on the
>> direct IO path (which XFS does not do).
>> 
>> The real question is this: Why can't that page be invalidated after
>> it has been written to disk an cleaned?
>> 
>> Well, there's data on the first two buffers in the page (1k block
>> size, 4k page), but the third buffer on the page (i.e. beyond EOF)
>> is failing drop_buffers because it's bh->b_state == 0x3, which is
>> BH_Uptodate | BH_Dirty.  IOWs, there's dirty buffers beyond EOF. Say
>> what?
>> 
>> OK, set_buffer_dirty() is called on all buffers from
>> __set_page_buffers_dirty(), regardless of whether the buffer is
>> beyond EOF or not, which means that when we get to ->writepage,
>> we have buffers marked dirty beyond EOF that we need to clean.
>> So, we need to implement our own .set_page_dirty method that
>> doesn't dirty buffers beyond EOF.
> 
> Shouldn't this be fixed in __set_page_buffers_dirty itself?  This
> doesn't seem an XFS-specific issue.

It is perfectly normal to have dirty buffers beyond EOF.  The file system should just clean such buffers so the page then becomes reclaimable (or if the file system is using generic functionality this would be done by the generic functions but XFS is implementing writepage(s) itself and missing this bit).

The generic function you want to see is fs/buffer.c::__block_write_full_page() which does:

                if (block > last_block) {
                        /*
                         * mapped buffers outside i_size will occur, because
                         * this page can be outside i_size when there is a
                         * truncate in progress.
                         */
                        /*
                         * The buffer was zeroed by block_write_full_page()
                         */
                        clear_buffer_dirty(bh);
                        set_buffer_uptodate(bh);

Why do you want to add complexity to __set_page_buffers_dirty() given that they are no big deal and we need to be able to cope with their existance due to concurrent truncate anyway?

Best regards,

	Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
University of Cambridge Information Services, Roger Needham Building
7 JJ Thomson Avenue, Cambridge, CB3 0RB, UK

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/6] xfs: mmap write/read leaves bad state on pages
  2014-08-21 13:08   ` Christoph Hellwig
  2014-08-21 13:54     ` Anton Altaparmakov
@ 2014-08-21 15:21     ` Chris Mason
  1 sibling, 0 replies; 20+ messages in thread
From: Chris Mason @ 2014-08-21 15:21 UTC (permalink / raw)
  To: Christoph Hellwig, Dave Chinner; +Cc: linux-fsdevel, xfs



On 08/21/2014 09:08 AM, Christoph Hellwig wrote:
> On Thu, Aug 21, 2014 at 03:09:09PM +1000, Dave Chinner wrote:
>> From: Dave Chinner <dchinner@redhat.com>
>>
>> generic/263 is failing fsx at this point with a page spanning
>> EOF that cannot be invalidated. The operations are:
>>
>> 1190 mapwrite   0x52c00 thru    0x5e569 (0xb96a bytes)
>> 1191 mapread    0x5c000 thru    0x5d636 (0x1637 bytes)
>> 1192 write      0x5b600 thru    0x771ff (0x1bc00 bytes)
>>
>> where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
>> write attempts to invalidate the cached page over this range, it
>> fails with -EBUSY and so we fire this assert:
>>
>> XFS: Assertion failed: ret < 0 || ret == count, file: fs/xfs/xfs_file.c, line: 676
>>
>> because the kernel is trying to fall back to buffered IO on the
>> direct IO path (which XFS does not do).
>>
>> The real question is this: Why can't that page be invalidated after
>> it has been written to disk an cleaned?
>>
>> Well, there's data on the first two buffers in the page (1k block
>> size, 4k page), but the third buffer on the page (i.e. beyond EOF)
>> is failing drop_buffers because it's bh->b_state == 0x3, which is
>> BH_Uptodate | BH_Dirty.  IOWs, there's dirty buffers beyond EOF. Say
>> what?
>>
>> OK, set_buffer_dirty() is called on all buffers from
>> __set_page_buffers_dirty(), regardless of whether the buffer is
>> beyond EOF or not, which means that when we get to ->writepage,
>> we have buffers marked dirty beyond EOF that we need to clean.
>> So, we need to implement our own .set_page_dirty method that
>> doesn't dirty buffers beyond EOF.
> 
> Shouldn't this be fixed in __set_page_buffers_dirty itself?  This
> doesn't seem an XFS-specific issue.
> 

block_write_full_page() is invalidating buffers past eof.  Probably
because we used to be able to dirty buffers without holding the page
lock, and it's much easier to trust i_size at writepage time.

I think we have the page locked for all the dirties now, so this isn't
as important as in the past?

-chris

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/6] xfs: mmap write/read leaves bad state on pages
  2014-08-21  5:09 ` [PATCH 1/6] xfs: mmap write/read leaves bad state on pages Dave Chinner
  2014-08-21 12:48   ` Brian Foster
  2014-08-21 13:08   ` Christoph Hellwig
@ 2014-08-21 19:56   ` Jan Kara
  2014-08-21 22:33     ` Dave Chinner
  2 siblings, 1 reply; 20+ messages in thread
From: Jan Kara @ 2014-08-21 19:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: clm, xfs

On Thu 21-08-14 15:09:09, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> generic/263 is failing fsx at this point with a page spanning
> EOF that cannot be invalidated. The operations are:
> 
> 1190 mapwrite   0x52c00 thru    0x5e569 (0xb96a bytes)
> 1191 mapread    0x5c000 thru    0x5d636 (0x1637 bytes)
> 1192 write      0x5b600 thru    0x771ff (0x1bc00 bytes)
> 
> where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
> write attempts to invalidate the cached page over this range, it
> fails with -EBUSY and so we fire this assert:
> 
> XFS: Assertion failed: ret < 0 || ret == count, file: fs/xfs/xfs_file.c, line: 676
> 
> because the kernel is trying to fall back to buffered IO on the
> direct IO path (which XFS does not do).
> 
> The real question is this: Why can't that page be invalidated after
> it has been written to disk an cleaned?
> 
> Well, there's data on the first two buffers in the page (1k block
> size, 4k page), but the third buffer on the page (i.e. beyond EOF)
> is failing drop_buffers because it's bh->b_state == 0x3, which is
> BH_Uptodate | BH_Dirty.  IOWs, there's dirty buffers beyond EOF. Say
> what?
> 
> OK, set_buffer_dirty() is called on all buffers from
> __set_page_buffers_dirty(), regardless of whether the buffer is
> beyond EOF or not, which means that when we get to ->writepage,
> we have buffers marked dirty beyond EOF that we need to clean.
> So, we need to implement our own .set_page_dirty method that
> doesn't dirty buffers beyond EOF.
> 
> This is messy because the buffer code is not meant to be shared
> and it has interesting locking issues on the buffer dirty bits.
> So just copy and paste it and then modify it to suit what we need.
  Well, I'm not sure this is the cleanest way to fix your problem. The
thing is that inode size can change (decrease) after set_page_dirty() has
run and writeback can find the page before truncate_inode_pages() calls
do_invalidatepage() on the last partial page. Now I agree that given
truncate and direct IO are both synchronized using IOLOCK, this change
still fixes your problem. I just wanted to point out that your change
doesn't really make sure won't see dirty buffers in a tail page beyond
i_size.

As Anton has pointed out other filesystems solve the same issue by clearing
the dirty bits beyond EOF in their writepage() function. Also since we
zero-out the tail of the page in writepage() (even in XFS as I checked),
this kind of makes sense to me and has smaller overhead than special
set_page_dirty()...

								Honza

> cc: <stable@kernel.org>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_aops.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 58 insertions(+)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 11e9b4c..2a316ad 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1753,11 +1753,69 @@ xfs_vm_readpages(
>  	return mpage_readpages(mapping, pages, nr_pages, xfs_get_blocks);
>  }
>  
> +/*
> + * This is basically a copy of __set_page_dirty_buffers() with one
> + * small tweak: buffers beyond EOF do not get marked dirty. If we mark them
> + * dirty, we'll never be able to clean them because we don't write buffers
> + * beyond EOF, and that means we can't invalidate pages that span EOF
> + * that have been marked dirty. Further, the dirty state can leak into
> + * the file interior if the file is extended, resulting in all sorts of
> + * bad things happening as the state does not match the unerlying data.
> + */
> +STATIC int
> +xfs_vm_set_page_dirty(
> +	struct page		*page)
> +{
> +	struct address_space	*mapping = page_mapping(page);
> +	struct inode		*inode = mapping->host;
> +	loff_t			end_offset;
> +	loff_t			offset;
> +	int			newly_dirty;
> +
> +	if (unlikely(!mapping))
> +		return !TestSetPageDirty(page);
> +
> +	end_offset = i_size_read(inode);
> +	offset = end_offset & PAGE_CACHE_MASK;
> +
> +	spin_lock(&mapping->private_lock);
> +	if (page_has_buffers(page)) {
> +		struct buffer_head *head = page_buffers(page);
> +		struct buffer_head *bh = head;
> +
> +		do {
> +			if (offset < end_offset)
> +				set_buffer_dirty(bh);
> +			bh = bh->b_this_page;
> +			offset += 1 << inode->i_blkbits;
> +		} while (bh != head);
> +	}
> +	newly_dirty = !TestSetPageDirty(page);
> +	spin_unlock(&mapping->private_lock);
> +
> +	if (newly_dirty) {
> +		/* sigh - __set_page_dirty() is static, so copy it here, too */
> +		unsigned long flags;
> +
> +		spin_lock_irqsave(&mapping->tree_lock, flags);
> +		if (page->mapping) {	/* Race with truncate? */
> +			WARN_ON_ONCE(!PageUptodate(page));
> +			account_page_dirtied(page, mapping);
> +			radix_tree_tag_set(&mapping->page_tree,
> +					page_index(page), PAGECACHE_TAG_DIRTY);
> +		}
> +		spin_unlock_irqrestore(&mapping->tree_lock, flags);
> +		__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +	}
> +	return newly_dirty;
> +}
> +
>  const struct address_space_operations xfs_address_space_operations = {
>  	.readpage		= xfs_vm_readpage,
>  	.readpages		= xfs_vm_readpages,
>  	.writepage		= xfs_vm_writepage,
>  	.writepages		= xfs_vm_writepages,
> +	.set_page_dirty		= xfs_vm_set_page_dirty,
>  	.releasepage		= xfs_vm_releasepage,
>  	.invalidatepage		= xfs_vm_invalidatepage,
>  	.write_begin		= xfs_vm_write_begin,
> -- 
> 2.0.0
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/6] xfs: mmap write/read leaves bad state on pages
  2014-08-21 19:56   ` Jan Kara
@ 2014-08-21 22:33     ` Dave Chinner
  2014-08-26 16:06       ` Jan Kara
  0 siblings, 1 reply; 20+ messages in thread
From: Dave Chinner @ 2014-08-21 22:33 UTC (permalink / raw)
  To: Jan Kara; +Cc: clm, xfs

On Thu, Aug 21, 2014 at 09:56:32PM +0200, Jan Kara wrote:
> On Thu 21-08-14 15:09:09, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > generic/263 is failing fsx at this point with a page spanning
> > EOF that cannot be invalidated. The operations are:
> > 
> > 1190 mapwrite   0x52c00 thru    0x5e569 (0xb96a bytes)
> > 1191 mapread    0x5c000 thru    0x5d636 (0x1637 bytes)
> > 1192 write      0x5b600 thru    0x771ff (0x1bc00 bytes)
> > 
> > where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
> > write attempts to invalidate the cached page over this range, it
> > fails with -EBUSY and so we fire this assert:
> > 
> > XFS: Assertion failed: ret < 0 || ret == count, file: fs/xfs/xfs_file.c, line: 676
> > 
> > because the kernel is trying to fall back to buffered IO on the
> > direct IO path (which XFS does not do).
> > 
> > The real question is this: Why can't that page be invalidated after
> > it has been written to disk an cleaned?
> > 
> > Well, there's data on the first two buffers in the page (1k block
> > size, 4k page), but the third buffer on the page (i.e. beyond EOF)
> > is failing drop_buffers because it's bh->b_state == 0x3, which is
> > BH_Uptodate | BH_Dirty.  IOWs, there's dirty buffers beyond EOF. Say
> > what?
> > 
> > OK, set_buffer_dirty() is called on all buffers from
> > __set_page_buffers_dirty(), regardless of whether the buffer is
> > beyond EOF or not, which means that when we get to ->writepage,
> > we have buffers marked dirty beyond EOF that we need to clean.
> > So, we need to implement our own .set_page_dirty method that
> > doesn't dirty buffers beyond EOF.
> > 
> > This is messy because the buffer code is not meant to be shared
> > and it has interesting locking issues on the buffer dirty bits.
> > So just copy and paste it and then modify it to suit what we need.
>   Well, I'm not sure this is the cleanest way to fix your problem. The
> thing is that inode size can change (decrease) after set_page_dirty() has
> run and writeback can find the page before truncate_inode_pages() calls
> do_invalidatepage() on the last partial page.  Now I agree that given
> truncate and direct IO are both synchronized using IOLOCK, this change
> still fixes your problem. I just wanted to point out that your change
> doesn't really make sure won't see dirty buffers in a tail page beyond
> i_size.

Truncate is not the problem.  The issue is that we can't
*invalidate* pages that have dirty buffers, and I soon realised that
we don't see this problem with truncate because *truncate ignores
invalidation failures*. So now I went looking for how other code
handled this.

> As Anton has pointed out other filesystems solve the same issue by clearing
> the dirty bits beyond EOF in their writepage() function. Also since we
> zero-out the tail of the page in writepage() (even in XFS as I checked),
> this kind of makes sense to me and has smaller overhead than special
> set_page_dirty()...

Yes, and that's where I started - block_write_full_page and so
I ended up with this first:

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 2a316ad..a9d6afc 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1057,8 +1073,24 @@ xfs_vm_writepage(
 	do {
 		int new_ioend = 0;
 
-		if (offset >= end_offset)
+		/*
+		 * If the buffer is fully beyond EOF, we need to mark it clean
+		 * otherwise we'll leave stale dirty buffers in memory. See the
+		 * comment above in the EOF handling about Charlie Foxtrots.
+		 */
+		if (offset >= end_offset) {
+			clear_buffer_dirty(bh);
+			ASSERT(!buffer_delay(bh));
+			ASSERT(!buffer_mapped(bh));
+
+			/*
+			 * Page is not uptodate until buffers under IO have been
+			 * fully processed.
+			 */
+			uptodate = 0;
+			continue;
-			break;
+		}
+
 		if (!buffer_uptodate(bh))
 			uptodate = 0;
 

But this doesn't solve the invalidation failures - it merely covers
up a symptom of the underlying problem. i.e. fsx went from
failing invalidate on the EOF page at operation #1192 to failing
invalidate on the EOF page at operation #9378.

The I realised that flush/wait/invalidate is actually racy against
mmap, so writepage clearing dirty buffer flags on buffers beyond EOF
is not a solution to the problem - it can still occur. So I added
this check to releasepage() to catch stale dirty buffers beyond EOF
preventing invalidation from succeeding:

@@ -1209,9 +1245,16 @@ xfs_vm_writepages(
 
 /*
  * Called to move a page into cleanable state - and from there
- * to be released. The page should already be clean. We always
+ * to be released. We always
  * have buffer heads in this call.
  *
+ * The page should already be clean, except in the case where EOF falls within
+ * the page and then we can have dirty buffers beyond EOF that nobody can
+ * actually clean. These dirty buffers will cause invalidation failures, but
+ * because they can't be written the should not prevent us from tossing the page
+ * out of cache. Hence if we span EOF, walk the buffers on the page and make
+ * sure they are in a state where try_to_free_buffers() will succeed.
+ *
  * Returns 1 if the page is ok to release, 0 otherwise.
  */
 STATIC int
@@ -1219,7 +1262,10 @@ xfs_vm_releasepage(
 	struct page		*page,
 	gfp_t			gfp_mask)
 {
+	struct inode		*inode = page->mapping->host;
 	int			delalloc, unwritten;
+	loff_t			end_offset, offset;
+	pgoff_t                 end_index;
 
 	trace_xfs_releasepage(page->mapping->host, page, 0, 0);
 
@@ -1230,5 +1276,21 @@ xfs_vm_releasepage(
 	if (WARN_ON_ONCE(unwritten))
 		return 0;
 
+	end_offset = i_size_read(inode);
+	end_index = end_offset >> PAGE_CACHE_SHIFT;
+	if (page->index >= end_index) {
+		struct buffer_head *head = page_buffers(page);
+		struct buffer_head *bh;
+
+		offset = end_index << PAGE_CACHE_SHIFT;
+		bh = head;
+		do {
+			if (offset > end_offset)
+				clear_buffer_dirty(bh);
+			bh = bh->b_this_page;
+			offset += 1 << inode->i_blkbits;
+		} while (bh != head);
+	}
+
 	return try_to_free_buffers(page);
 }

Which then moved the fsx invalidation failure out to somewhere
around 105,000 operations.

At which point I realised that I'm playing whack-a-mole with a
fundamental problem: buffers beyond EOF cannot be written, so
dirtying them in the first place is just fundamentally wrong. In XFS we'll
zero them on disk during an extend operation (either in write,
truncate or prealloc), so we're not going to leaak stale data by not
marking them dirty.  They may not even be allocated, so we can't
assume that we can write them. So, rather than trying to handle this
dirty-buffers-beyond-EOF case in every situation where we might trip
over it, let's just prevent it from happening in the first place.

That's where the .set_page_dirty() method came about. The "mmap
dirties buffers beyond the EOF" problem is gone. Now, truncate might
have a similar problem with leaving dirty buffers beyond EOF as you
suggest, but I just can't seem to trip over that problem and it
hasn't shown up in the ~500 million fsx ops and ~30 hours of
fsstress that I've run in the past few days. So without being able
to reproduce a problem, I'm extremely wary of trying to "fix" it.

Quite frankly, preventing data corruption is far, far more important
than optimisation and efficiency....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/6] xfs: mmap write/read leaves bad state on pages
  2014-08-21 12:48   ` Brian Foster
@ 2014-08-21 22:38     ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2014-08-21 22:38 UTC (permalink / raw)
  To: Brian Foster; +Cc: clm, xfs

On Thu, Aug 21, 2014 at 08:48:24AM -0400, Brian Foster wrote:
> On Thu, Aug 21, 2014 at 03:09:09PM +1000, Dave Chinner wrote:
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -1753,11 +1753,69 @@ xfs_vm_readpages(
> >  	return mpage_readpages(mapping, pages, nr_pages, xfs_get_blocks);
> >  }
> >  
> > +/*
> > + * This is basically a copy of __set_page_dirty_buffers() with one
> > + * small tweak: buffers beyond EOF do not get marked dirty. If we mark them
> > + * dirty, we'll never be able to clean them because we don't write buffers
> > + * beyond EOF, and that means we can't invalidate pages that span EOF
> > + * that have been marked dirty. Further, the dirty state can leak into
> > + * the file interior if the file is extended, resulting in all sorts of
> > + * bad things happening as the state does not match the unerlying data.
> > + */
> > +STATIC int
> > +xfs_vm_set_page_dirty(
> > +	struct page		*page)
> > +{
> > +	struct address_space	*mapping = page_mapping(page);
> 
> This breaks xfs as a kernel module:
> 
> $ make -j 8 M=fs/xfs
>   Building modules, stage 2.
>   MODPOST 1 modules
> WARNING: "page_mapping" [fs/xfs/xfs.ko] undefined!
>   ...

Oh, that should just be:

+	struct address_space	*mapping = page->mapping;

> I suppose we could export that symbol, but why wouldn't we just propose
> this change to __set_page_dirty_buffers()?

I'm not going to risk breaking other filesystems that have implicit
dependencies on buffers beyond EOF being dirtied.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/6] xfs: mmap write/read leaves bad state on pages
  2014-08-21 22:33     ` Dave Chinner
@ 2014-08-26 16:06       ` Jan Kara
  2014-08-26 21:38         ` Dave Chinner
  0 siblings, 1 reply; 20+ messages in thread
From: Jan Kara @ 2014-08-26 16:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: clm, Jan Kara, xfs

On Fri 22-08-14 08:33:32, Dave Chinner wrote:
> On Thu, Aug 21, 2014 at 09:56:32PM +0200, Jan Kara wrote:
> > On Thu 21-08-14 15:09:09, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > generic/263 is failing fsx at this point with a page spanning
> > > EOF that cannot be invalidated. The operations are:
> > > 
> > > 1190 mapwrite   0x52c00 thru    0x5e569 (0xb96a bytes)
> > > 1191 mapread    0x5c000 thru    0x5d636 (0x1637 bytes)
> > > 1192 write      0x5b600 thru    0x771ff (0x1bc00 bytes)
> > > 
> > > where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
> > > write attempts to invalidate the cached page over this range, it
> > > fails with -EBUSY and so we fire this assert:
> > > 
> > > XFS: Assertion failed: ret < 0 || ret == count, file: fs/xfs/xfs_file.c, line: 676
> > > 
> > > because the kernel is trying to fall back to buffered IO on the
> > > direct IO path (which XFS does not do).
> > > 
> > > The real question is this: Why can't that page be invalidated after
> > > it has been written to disk an cleaned?
> > > 
> > > Well, there's data on the first two buffers in the page (1k block
> > > size, 4k page), but the third buffer on the page (i.e. beyond EOF)
> > > is failing drop_buffers because it's bh->b_state == 0x3, which is
> > > BH_Uptodate | BH_Dirty.  IOWs, there's dirty buffers beyond EOF. Say
> > > what?
> > > 
> > > OK, set_buffer_dirty() is called on all buffers from
> > > __set_page_buffers_dirty(), regardless of whether the buffer is
> > > beyond EOF or not, which means that when we get to ->writepage,
> > > we have buffers marked dirty beyond EOF that we need to clean.
> > > So, we need to implement our own .set_page_dirty method that
> > > doesn't dirty buffers beyond EOF.
> > > 
> > > This is messy because the buffer code is not meant to be shared
> > > and it has interesting locking issues on the buffer dirty bits.
> > > So just copy and paste it and then modify it to suit what we need.
> >   Well, I'm not sure this is the cleanest way to fix your problem. The
> > thing is that inode size can change (decrease) after set_page_dirty() has
> > run and writeback can find the page before truncate_inode_pages() calls
> > do_invalidatepage() on the last partial page.  Now I agree that given
> > truncate and direct IO are both synchronized using IOLOCK, this change
> > still fixes your problem. I just wanted to point out that your change
> > doesn't really make sure won't see dirty buffers in a tail page beyond
> > i_size.
> 
> Truncate is not the problem.  The issue is that we can't
> *invalidate* pages that have dirty buffers, and I soon realised that
> we don't see this problem with truncate because *truncate ignores
> invalidation failures*. So now I went looking for how other code
> handled this.
> 
> > As Anton has pointed out other filesystems solve the same issue by clearing
> > the dirty bits beyond EOF in their writepage() function. Also since we
> > zero-out the tail of the page in writepage() (even in XFS as I checked),
> > this kind of makes sense to me and has smaller overhead than special
> > set_page_dirty()...
> 
> Yes, and that's where I started - block_write_full_page and so
> I ended up with this first:
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 2a316ad..a9d6afc 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -1057,8 +1073,24 @@ xfs_vm_writepage(
>  	do {
>  		int new_ioend = 0;
>  
> -		if (offset >= end_offset)
> +		/*
> +		 * If the buffer is fully beyond EOF, we need to mark it clean
> +		 * otherwise we'll leave stale dirty buffers in memory. See the
> +		 * comment above in the EOF handling about Charlie Foxtrots.
> +		 */
> +		if (offset >= end_offset) {
> +			clear_buffer_dirty(bh);
> +			ASSERT(!buffer_delay(bh));
> +			ASSERT(!buffer_mapped(bh));
> +
> +			/*
> +			 * Page is not uptodate until buffers under IO have been
> +			 * fully processed.
> +			 */
> +			uptodate = 0;
> +			continue;
> -			break;
> +		}
> +
>  		if (!buffer_uptodate(bh))
>  			uptodate = 0;
>  
> 
> But this doesn't solve the invalidation failures - it merely covers
> up a symptom of the underlying problem. i.e. fsx went from
> failing invalidate on the EOF page at operation #1192 to failing
> invalidate on the EOF page at operation #9378.
>
> The I realised that flush/wait/invalidate is actually racy against
> mmap, so writepage clearing dirty buffer flags on buffers beyond EOF
> is not a solution to the problem - it can still occur. So I added
  Hum, I have to say I don't quite understand this. I agree with you that
clearing dirty bits in .writepage() is racy wrt mmap. However I don't see
how fsx can ever trigger this. For flush/wait/invalidate to fail, someone
would have to write to the mapping somewhere between flush and invalidate.
However fsx is singlethreaded so I don't see how that can happen.

> this check to releasepage() to catch stale dirty buffers beyond EOF
> preventing invalidation from succeeding:
> 
> @@ -1209,9 +1245,16 @@ xfs_vm_writepages(
>  
>  /*
>   * Called to move a page into cleanable state - and from there
> - * to be released. The page should already be clean. We always
> + * to be released. We always
>   * have buffer heads in this call.
>   *
> + * The page should already be clean, except in the case where EOF falls within
> + * the page and then we can have dirty buffers beyond EOF that nobody can
> + * actually clean. These dirty buffers will cause invalidation failures, but
> + * because they can't be written the should not prevent us from tossing the page
> + * out of cache. Hence if we span EOF, walk the buffers on the page and make
> + * sure they are in a state where try_to_free_buffers() will succeed.
> + *
>   * Returns 1 if the page is ok to release, 0 otherwise.
>   */
>  STATIC int
> @@ -1219,7 +1262,10 @@ xfs_vm_releasepage(
>  	struct page		*page,
>  	gfp_t			gfp_mask)
>  {
> +	struct inode		*inode = page->mapping->host;
>  	int			delalloc, unwritten;
> +	loff_t			end_offset, offset;
> +	pgoff_t                 end_index;
>  
>  	trace_xfs_releasepage(page->mapping->host, page, 0, 0);
>  
> @@ -1230,5 +1276,21 @@ xfs_vm_releasepage(
>  	if (WARN_ON_ONCE(unwritten))
>  		return 0;
>  
> +	end_offset = i_size_read(inode);
> +	end_index = end_offset >> PAGE_CACHE_SHIFT;
> +	if (page->index >= end_index) {
> +		struct buffer_head *head = page_buffers(page);
> +		struct buffer_head *bh;
> +
> +		offset = end_index << PAGE_CACHE_SHIFT;
  Here should be page->index instead of end_index I suppose. However it
shouldn't make a difference wrt your experiment.

> +		bh = head;
> +		do {
> +			if (offset > end_offset)
> +				clear_buffer_dirty(bh);
> +			bh = bh->b_this_page;
> +			offset += 1 << inode->i_blkbits;
> +		} while (bh != head);
> +	}
> +
>  	return try_to_free_buffers(page);
>  }
> 
> Which then moved the fsx invalidation failure out to somewhere
> around 105,000 operations.
> 
> At which point I realised that I'm playing whack-a-mole with a
> fundamental problem: buffers beyond EOF cannot be written, so
> dirtying them in the first place is just fundamentally wrong. In XFS we'll
> zero them on disk during an extend operation (either in write,
> truncate or prealloc), so we're not going to leaak stale data by not
> marking them dirty.  They may not even be allocated, so we can't
> assume that we can write them. So, rather than trying to handle this
> dirty-buffers-beyond-EOF case in every situation where we might trip
> over it, let's just prevent it from happening in the first place.
  Sure I understand that. I'm now trying to understand why noone else sees
the problem (we do run xfstests for ext4 and blocksize < page size) and
whether there's something we might want to do for other filesystems as
well. I wonder if there isn't something special going on with XFS
redirtying the page because xfs_imap_valid() returned EAGAIN but looking
into the code I don't see how that could happen.

> That's where the .set_page_dirty() method came about. The "mmap
> dirties buffers beyond the EOF" problem is gone. Now, truncate might
> have a similar problem with leaving dirty buffers beyond EOF as you
> suggest, but I just can't seem to trip over that problem and it
> hasn't shown up in the ~500 million fsx ops and ~30 hours of
> fsstress that I've run in the past few days. So without being able
> to reproduce a problem, I'm extremely wary of trying to "fix" it.
  So truncate is introducing dirty pages beyond EOF only temporarily -
between the moment i_size is set and the moment page is invalidated. Since
direct IO is synchronized with truncate, it cannot see them and you are
fine. Only stuff like writeback or page reclaim can see pages in such
state...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 1/6] xfs: mmap write/read leaves bad state on pages
  2014-08-26 16:06       ` Jan Kara
@ 2014-08-26 21:38         ` Dave Chinner
  0 siblings, 0 replies; 20+ messages in thread
From: Dave Chinner @ 2014-08-26 21:38 UTC (permalink / raw)
  To: Jan Kara; +Cc: clm, xfs

On Tue, Aug 26, 2014 at 06:06:30PM +0200, Jan Kara wrote:
> On Fri 22-08-14 08:33:32, Dave Chinner wrote:
> > On Thu, Aug 21, 2014 at 09:56:32PM +0200, Jan Kara wrote:
> > > On Thu 21-08-14 15:09:09, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > generic/263 is failing fsx at this point with a page spanning
> > > > EOF that cannot be invalidated. The operations are:
> > > > 
> > > > 1190 mapwrite   0x52c00 thru    0x5e569 (0xb96a bytes)
> > > > 1191 mapread    0x5c000 thru    0x5d636 (0x1637 bytes)
> > > > 1192 write      0x5b600 thru    0x771ff (0x1bc00 bytes)
> > > > 
> > > > where 1190 extents EOF from 0x54000 to 0x5e569. When the direct IO
> > > > write attempts to invalidate the cached page over this range, it
> > > > fails with -EBUSY and so we fire this assert:
> > > > 
> > > > XFS: Assertion failed: ret < 0 || ret == count, file: fs/xfs/xfs_file.c, line: 676
> > > > 
> > > > because the kernel is trying to fall back to buffered IO on the
> > > > direct IO path (which XFS does not do).
> > > > 
> > > > The real question is this: Why can't that page be invalidated after
> > > > it has been written to disk an cleaned?
....
> > > As Anton has pointed out other filesystems solve the same issue by clearing
> > > the dirty bits beyond EOF in their writepage() function. Also since we
> > > zero-out the tail of the page in writepage() (even in XFS as I checked),
> > > this kind of makes sense to me and has smaller overhead than special
> > > set_page_dirty()...
> > 
> > Yes, and that's where I started - block_write_full_page and so
> > I ended up with this first:
> > 
> > diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> > index 2a316ad..a9d6afc 100644
> > --- a/fs/xfs/xfs_aops.c
> > +++ b/fs/xfs/xfs_aops.c
> > @@ -1057,8 +1073,24 @@ xfs_vm_writepage(
> >  	do {
> >  		int new_ioend = 0;
> >  
> > -		if (offset >= end_offset)
> > +		/*
> > +		 * If the buffer is fully beyond EOF, we need to mark it clean
> > +		 * otherwise we'll leave stale dirty buffers in memory. See the
> > +		 * comment above in the EOF handling about Charlie Foxtrots.
> > +		 */
> > +		if (offset >= end_offset) {
> > +			clear_buffer_dirty(bh);
> > +			ASSERT(!buffer_delay(bh));
> > +			ASSERT(!buffer_mapped(bh));
> > +
> > +			/*
> > +			 * Page is not uptodate until buffers under IO have been
> > +			 * fully processed.
> > +			 */
> > +			uptodate = 0;
> > +			continue;
> > -			break;
> > +		}
> > +
> >  		if (!buffer_uptodate(bh))
> >  			uptodate = 0;
> >  
> > 
> > But this doesn't solve the invalidation failures - it merely covers
> > up a symptom of the underlying problem. i.e. fsx went from
> > failing invalidate on the EOF page at operation #1192 to failing
> > invalidate on the EOF page at operation #9378.
> >
> > The I realised that flush/wait/invalidate is actually racy against
> > mmap, so writepage clearing dirty buffer flags on buffers beyond EOF
> > is not a solution to the problem - it can still occur. So I added
>   Hum, I have to say I don't quite understand this. I agree with you that
> clearing dirty bits in .writepage() is racy wrt mmap. However I don't see
> how fsx can ever trigger this. For flush/wait/invalidate to fail, someone
> would have to write to the mapping somewhere between flush and invalidate.
> However fsx is singlethreaded so I don't see how that can happen.

Right, I was talking about the general problem, not necessarily the
specific fsx problem I was seeing. i.e. clearing dirty buffers
beyond EOF in writepage is not a robust solution, so regardless of
whatever other problems there are, it's not a solution we can rely
on.

....

> > At which point I realised that I'm playing whack-a-mole with a
> > fundamental problem: buffers beyond EOF cannot be written, so
> > dirtying them in the first place is just fundamentally wrong. In XFS we'll
> > zero them on disk during an extend operation (either in write,
> > truncate or prealloc), so we're not going to leaak stale data by not
> > marking them dirty.  They may not even be allocated, so we can't
> > assume that we can write them. So, rather than trying to handle this
> > dirty-buffers-beyond-EOF case in every situation where we might trip
> > over it, let's just prevent it from happening in the first place.
>   Sure I understand that. I'm now trying to understand why noone else sees
> the problem (we do run xfstests for ext4 and blocksize < page size) and
> whether there's something we might want to do for other filesystems as
> well.

Right, I don't understand where it's coming from, either.

I know there are still problems with stray delalloc blocks even
after making this change because xfs/297 is still tripping over
delayed allocation blocks where there should be none during direct
IO. i.e. this avoids the EOF issues that the direct IO invalidation
changes expose, but I still haven't been able to find the underlying
cause that is still triggering problems within EOF.

This is the disadvantage of smart people writing smart code - most
people (me included) are simply too dumb to be able to debug it. And
so when shit goes wrong, we're left to wonder where the paddle was
hidden...

> I wonder if there isn't something special going on with XFS
> redirtying the page because xfs_imap_valid() returned EAGAIN but looking
> into the code I don't see how that could happen.

To tell the truth, I'm fed up with having to deal repeatedly with
bufferhead coherency problems and not being able to find the root
cause. We just keep finding problem after problem after problem and
there's no end in sight. So I'm much more inclined to spend time
removing bufferheads from XFS than I am to try to find a needle in a
very large haystack....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2014-08-26 21:38 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-21  5:09 [PATCH 0/6] xfs: direct IO invalidation and related fixes Dave Chinner
2014-08-21  5:09 ` [PATCH 1/6] xfs: mmap write/read leaves bad state on pages Dave Chinner
2014-08-21 12:48   ` Brian Foster
2014-08-21 22:38     ` Dave Chinner
2014-08-21 13:08   ` Christoph Hellwig
2014-08-21 13:54     ` Anton Altaparmakov
2014-08-21 15:21     ` Chris Mason
2014-08-21 19:56   ` Jan Kara
2014-08-21 22:33     ` Dave Chinner
2014-08-26 16:06       ` Jan Kara
2014-08-26 21:38         ` Dave Chinner
2014-08-21  5:09 ` [PATCH 2/6] xfs: don't zero partial page cache pages during O_DIRECT writes Dave Chinner
2014-08-21 13:08   ` Brian Foster
2014-08-21  5:09 ` [PATCH 3/6] " Dave Chinner
2014-08-21 13:09   ` Brian Foster
2014-08-21  5:09 ` [PATCH 4/6] xfs: use ranged writeback and invalidation for direct IO Dave Chinner
2014-08-21 13:09   ` Brian Foster
2014-08-21  5:09 ` [PATCH 5/6] xfs: don't log inode unless extent shift makes extent modifications Dave Chinner
2014-08-21  5:09 ` [PATCH 6/6] xfs: xfs_file_collapse_range is delalloc challenged Dave Chinner
2014-08-21 13:09   ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.