All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] xfs: skip free eofblocks if inode is under written back
@ 2017-08-30  4:26 Eryu Guan
  2017-08-30  7:29 ` Dave Chinner
  0 siblings, 1 reply; 6+ messages in thread
From: Eryu Guan @ 2017-08-30  4:26 UTC (permalink / raw)
  To: linux-xfs; +Cc: Eryu Guan

xfs_vm_writepages() saves a cached imap in a writeback context
structure and passes it to xfs_do_writepage() for every page in this
writeback, so xfs_writepage_map() (called by xfs_do_writepage())
doesn't have to lookup & recalculate the mapping for every buffer it
writes if the cached imap is considered as valid.

But there's a chance that the cached imap becomes stale (doesn't
match the real extent entry) due to the removing of speculative
preallocated blocks, in this case xfs_imap_valid() still reports
imap as valid, because the buffer it's writing is still within the
cached imap range. This could result in fs corruption and data
corruption (data written to wrong block).

For example, the following sequences make it possible (assuming 4k
block size XFS):

  1. delalloc write 1072 blocks, with speculative preallocation,
  2032 blocks got allocated

  2. started a WB_SYNC_NONE writeback (so it wrote all
  PAGECACHE_TAG_DIRTY pages, additional dirty page from step 4 can
  be picked up, this could be a background writeback, or a bare
  SYNC_FILE_RANGE_WRITE sync_file_range() call) on this inode,
  filled in br_startblock, converted delayed allocation to real
  allocation, but br_blockcount was unchanged, still 2032, and this
  imap got cached in writeback context structure for writing
  subsequent pages

  3. close file, xfs_release() called xfs_free_eofblocks() to remove
  any speculative allocated blocks, br_blockcount changed to 1072,
  but cached imap still had 2032,

  4. another delalloc append write one more block to this file, a
  new extent was allocated, with br_startoff = 1072 (because
  previous extent was converted to a real allocation, new delalloc
  extent can't be merged with previous one), dirty the associated
  page and close file

  5. xfs_vm_writepages() dealt with the newly dirtied page (index
  1072), the cached imap was still considered as valid (as 0 <= 1072
  < 2032), so the buffer associated with the page was written to
  wrong block, leaving the new delalloc extent unprocessed and
  leaking i_delayed_blks, which caused sb_fdblocks inconsistency

This is not easy to reproduce, Zorro Lang once hit it by running
generic/224 on ppc64 host. But with the lustre-racer test[1], it can
be reproduced within minutes if you have good luck. Changing
RACER_PROGS in racer.sh to only contain "file_create file_rm
dir_create file_concat" seems to make it eaiser.

Fix it by simply skipping free eofblocks in xfs_release() if the
inode is currently under written back to avoid the imap change in
the middle of a writeback.

[1] https://git.hpdd.intel.com/?p=fs/lustre-release.git;a=tree;f=lustre/tests/racer;hb=HEAD

Signed-off-by: Eryu Guan <eguan@redhat.com>
---

I've tested this patch with lustre-racer, and fstests with various configs (v4,
v5, different block sizes, reflink or not, rmapbt or not), and I didn't see any
new failures. And I'll work on a new fstests case to cover this bug.

 fs/xfs/xfs_inode.c | 18 +++++++++++++++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index ff48f0096810..f9e09f6c6d36 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1656,9 +1656,10 @@ xfs_release(
 	xfs_inode_t	*ip)
 {
 	xfs_mount_t	*mp = ip->i_mount;
+	struct inode	*inode = VFS_I(ip);
 	int		error;
 
-	if (!S_ISREG(VFS_I(ip)->i_mode) || (VFS_I(ip)->i_mode == 0))
+	if (!S_ISREG(inode->i_mode) || (inode->i_mode == 0))
 		return 0;
 
 	/* If this is a read-only mount, don't do this (would generate I/O) */
@@ -1682,14 +1683,14 @@ xfs_release(
 		if (truncated) {
 			xfs_iflags_clear(ip, XFS_IDIRTY_RELEASE);
 			if (ip->i_delayed_blks > 0) {
-				error = filemap_flush(VFS_I(ip)->i_mapping);
+				error = filemap_flush(inode->i_mapping);
 				if (error)
 					return error;
 			}
 		}
 	}
 
-	if (VFS_I(ip)->i_nlink == 0)
+	if (inode->i_nlink == 0)
 		return 0;
 
 	if (xfs_can_free_eofblocks(ip, false)) {
@@ -1710,6 +1711,17 @@ xfs_release(
 		 */
 		if (xfs_iflags_test(ip, XFS_IDIRTY_RELEASE))
 			return 0;
+
+		/*
+		 * Skip if inode is under written back, because removing past
+		 * EOF blocks changes imap, shortens br_blockcount, which could
+		 * confuse the cached imap in xfs_vm_writepages() and
+		 * potentially result in writing data to wrong block in
+		 * writeback.
+		 */
+		if (mapping_tagged(inode->i_mapping, PAGECACHE_TAG_WRITEBACK))
+			return 0;
+
 		/*
 		 * If we can't get the iolock just skip truncating the blocks
 		 * past EOF because we could deadlock with the mmap_sem
-- 
2.13.5


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] xfs: skip free eofblocks if inode is under written back
  2017-08-30  4:26 [PATCH] xfs: skip free eofblocks if inode is under written back Eryu Guan
@ 2017-08-30  7:29 ` Dave Chinner
  2017-08-30  7:51   ` Christoph Hellwig
  2017-08-30  9:12   ` Eryu Guan
  0 siblings, 2 replies; 6+ messages in thread
From: Dave Chinner @ 2017-08-30  7:29 UTC (permalink / raw)
  To: Eryu Guan; +Cc: linux-xfs

On Wed, Aug 30, 2017 at 12:26:28PM +0800, Eryu Guan wrote:
> xfs_vm_writepages() saves a cached imap in a writeback context
> structure and passes it to xfs_do_writepage() for every page in this
> writeback, so xfs_writepage_map() (called by xfs_do_writepage())
> doesn't have to lookup & recalculate the mapping for every buffer it
> writes if the cached imap is considered as valid.
> 
> But there's a chance that the cached imap becomes stale (doesn't
> match the real extent entry) due to the removing of speculative
> preallocated blocks, in this case xfs_imap_valid() still reports
> imap as valid, because the buffer it's writing is still within the
> cached imap range. This could result in fs corruption and data
> corruption (data written to wrong block).
> 
> For example, the following sequences make it possible (assuming 4k
> block size XFS):
> 
>   1. delalloc write 1072 blocks, with speculative preallocation,
>   2032 blocks got allocated
> 
>   2. started a WB_SYNC_NONE writeback (so it wrote all
>   PAGECACHE_TAG_DIRTY pages, additional dirty page from step 4 can
>   be picked up, this could be a background writeback, or a bare
>   SYNC_FILE_RANGE_WRITE sync_file_range() call) on this inode,
>   filled in br_startblock, converted delayed allocation to real
>   allocation, but br_blockcount was unchanged, still 2032, and this
>   imap got cached in writeback context structure for writing
>   subsequent pages

Excellent work in tracking this down, Eryu! This has been a
longstanding issue - we've been caching the writeback map for
multi-page writeback since, well, longer than I've been working on
XFS....

First thing - checking for PAGECACHE_TAG_WRITEBACK in xfs_release()
is racy as writeback can start between the check and the EOF blocks
being trimmed in xfs_free_eofblocks(), so it's not really a solution
to the problem.

The root cause of the problem is that xfs_imap_valid() code doesn't
return false when the extent tree is modified and the map is
rendered potentially incorrect. If this was to notice we changed the
extent tree, it would trigger the writeback code to look up a new
map.

Hmmmm, that means the scope is probably wider than
xfs_free_eofblocks() - that's a xfs_bunmapi call, but we've also got
to consider that other IO operations might change the extent map,
too.

Quick patch below, doesn't work on reflink filesystems yet because
the way it handles cached COW mapping invalidation is unnecessarily
special and so doesn't work. Can you test it and see if it fixes the
problem?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: invalidate writepage mappings on extent modification

From: Dave Chinner <dchinner@redhat.com>

Pretty sure this breaks COW writeback because it's a separate
mapping path in xfs_writepage_map() and can't handle random
invalidations. More factoring is needed there to unify the
xfs_map_cow() with xfs_map_blocks() to prevent this:

XFS: Assertion failed: type != XFS_IO_COW, file: fs/xfs/xfs_aops.c, line: 357

Signed-Off-By: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c |  7 ++++++-
 fs/xfs/xfs_aops.c        | 12 ++++++++++++
 fs/xfs/xfs_inode.h       |  6 ++++++
 fs/xfs/xfs_iomap.c       | 34 ++++++++++++++++++++++++++++++++++
 4 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 9b877024c804..7ec44a39a6b6 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4156,7 +4156,6 @@ xfs_bmapi_reserve_delalloc(
 	if (error)
 		goto out_unreserve_blocks;
 
-
 	ip->i_delayed_blks += alen;
 
 	got->br_startoff = aoff;
@@ -5485,6 +5484,12 @@ __xfs_bunmapi(
 	bno = start + len - 1;
 
 	/*
+	 * We're about to make an extent map modification. Ensure any cached
+	 * writeback mapping is invalidated before it is used next.
+	 */
+	xfs_iflags_set(ip, XFS_IWBMAPINVALID);
+
+	/*
 	 * Check to see if the given block number is past the end of the
 	 * file, back up to the last block if so...
 	 */
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 6bf120bb1a17..2398bdc554c0 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -376,6 +376,12 @@ xfs_map_blocks(
 	 */
 	if (nimaps && type == XFS_IO_OVERWRITE)
 		xfs_reflink_trim_irec_to_next_cow(ip, offset_fsb, imap);
+
+	/*
+	 * We just got a new mapping for page writeback, so clear the writeback
+	 * map invalid flag before we drop the ilock.
+	 */
+	xfs_iflags_clear(ip, XFS_IWBMAPINVALID);
 	xfs_iunlock(ip, XFS_ILOCK_SHARED);
 
 	if (error)
@@ -408,6 +414,12 @@ xfs_imap_valid(
 	struct xfs_bmbt_irec	*imap,
 	xfs_off_t		offset)
 {
+	/*
+	 * Extent list modification invalidates the currently cached map.
+	 */
+	if (xfs_iflags_test(XFS_I(inode), XFS_IWBMAPINVALID))
+		return false;
+
 	offset >>= inode->i_blkbits;
 
 	return offset >= imap->br_startoff &&
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 0ee453de239a..b08921b6b20c 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -234,6 +234,12 @@ static inline bool xfs_is_reflink_inode(struct xfs_inode *ip)
 #define XFS_IRECOVERY		(1 << 11)
 
 /*
+ * If the extent map of an inode is being modified, then any cached
+ * map in the writeback code needs to be invalidated
+ */
+#define XFS_IWBMAPINVALID		(1 << 12)
+
+/*
  * Per-lifetime flags need to be reset when re-using a reclaimable inode during
  * inode lookup. This prevents unintended behaviour on the new inode from
  * ocurring.
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 1b625d050441..c10f2a1aa0d8 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -260,6 +260,12 @@ xfs_iomap_write_direct(
 	xfs_trans_ijoin(tp, ip, 0);
 
 	/*
+	 * We're about to make an extent map modification.  Ensure any cached
+	 * writeback mapping is invalidated before it is used next.
+	 */
+	xfs_iflags_set(ip, XFS_IWBMAPINVALID);
+
+	/*
 	 * From this point onwards we overwrite the imap pointer that the
 	 * caller gave to us.
 	 */
@@ -562,6 +568,13 @@ xfs_file_iomap_begin_delay(
 		if (xfs_is_reflink_inode(ip)) {
 			bool		shared;
 
+			/*
+			 * We're about to make an extent map modification.
+			 * Ensure any cached writeback mapping is invalidated
+			 * before it is used next.
+			 */
+			xfs_iflags_set(ip, XFS_IWBMAPINVALID);
+
 			end_fsb = min(XFS_B_TO_FSB(mp, offset + count),
 					maxbytes_fsb);
 			xfs_trim_extent(&got, offset_fsb, end_fsb - offset_fsb);
@@ -612,6 +625,12 @@ xfs_file_iomap_begin_delay(
 	}
 
 retry:
+	/*
+	 * We're about to make an extent map modification.  Ensure any cached
+	 * writeback mapping is invalidated before it is used next.
+	 */
+	xfs_iflags_set(ip, XFS_IWBMAPINVALID);
+
 	error = xfs_bmapi_reserve_delalloc(ip, XFS_DATA_FORK, offset_fsb,
 			end_fsb - offset_fsb, prealloc_blocks, &got, &idx, eof);
 	switch (error) {
@@ -792,6 +811,13 @@ xfs_iomap_write_allocate(
 			if (error)
 				goto error0;
 
+			/*
+			 * We just got a new mapping for page writeback, so
+			 * clear the writeback map invalid flag before we drop
+			 * the ilock.
+			 */
+			xfs_iflags_clear(ip, XFS_IWBMAPINVALID);
+
 			xfs_iunlock(ip, XFS_ILOCK_EXCL);
 		}
 
@@ -1054,6 +1080,14 @@ xfs_file_iomap_begin(
 			error = -EAGAIN;
 			goto out_unlock;
 		}
+
+		/*
+		 * We're about to make an extent map modification.  Ensure any
+		 * cached writeback mapping is invalidated before it is used
+		 * next.
+		 */
+		xfs_iflags_set(ip, XFS_IWBMAPINVALID);
+
 		/*
 		 * We cap the maximum length we map here to MAX_WRITEBACK_PAGES
 		 * pages to keep the chunks of work done where somewhat symmetric

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] xfs: skip free eofblocks if inode is under written back
  2017-08-30  7:29 ` Dave Chinner
@ 2017-08-30  7:51   ` Christoph Hellwig
  2017-08-30  8:10     ` Dave Chinner
  2017-08-30  9:12   ` Eryu Guan
  1 sibling, 1 reply; 6+ messages in thread
From: Christoph Hellwig @ 2017-08-30  7:51 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Eryu Guan, linux-xfs

On Wed, Aug 30, 2017 at 05:29:50PM +1000, Dave Chinner wrote:
> Quick patch below, doesn't work on reflink filesystems yet because
> the way it handles cached COW mapping invalidation is unnecessarily
> special and so doesn't work. Can you test it and see if it fixes the
> problem?

FYI, I thought about this problem a while ago in the context of
rcu lookups of the extent tree (not, I'm not doing those quite yet,
but I though of the possibilities once we've got a btree!) and
then my answer was that we'd really need a seqlock around uses of
the imaps.  I didn't think far enough that we might need something
like that even now..

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] xfs: skip free eofblocks if inode is under written back
  2017-08-30  7:51   ` Christoph Hellwig
@ 2017-08-30  8:10     ` Dave Chinner
  0 siblings, 0 replies; 6+ messages in thread
From: Dave Chinner @ 2017-08-30  8:10 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Eryu Guan, linux-xfs

On Wed, Aug 30, 2017 at 12:51:09AM -0700, Christoph Hellwig wrote:
> On Wed, Aug 30, 2017 at 05:29:50PM +1000, Dave Chinner wrote:
> > Quick patch below, doesn't work on reflink filesystems yet because
> > the way it handles cached COW mapping invalidation is unnecessarily
> > special and so doesn't work. Can you test it and see if it fixes the
> > problem?
> 
> FYI, I thought about this problem a while ago in the context of
> rcu lookups of the extent tree (not, I'm not doing those quite yet,
> but I though of the possibilities once we've got a btree!) and
> then my answer was that we'd really need a seqlock around uses of
> the imaps.  I didn't think far enough that we might need something
> like that even now..

Now that you mention it, a seqlock is probably a pretty good
fit for this. I'll have a look at it tomorrow....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] xfs: skip free eofblocks if inode is under written back
  2017-08-30  7:29 ` Dave Chinner
  2017-08-30  7:51   ` Christoph Hellwig
@ 2017-08-30  9:12   ` Eryu Guan
  2017-08-30 14:26     ` Brian Foster
  1 sibling, 1 reply; 6+ messages in thread
From: Eryu Guan @ 2017-08-30  9:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Aug 30, 2017 at 05:29:50PM +1000, Dave Chinner wrote:
> On Wed, Aug 30, 2017 at 12:26:28PM +0800, Eryu Guan wrote:
> > xfs_vm_writepages() saves a cached imap in a writeback context
> > structure and passes it to xfs_do_writepage() for every page in this
> > writeback, so xfs_writepage_map() (called by xfs_do_writepage())
> > doesn't have to lookup & recalculate the mapping for every buffer it
> > writes if the cached imap is considered as valid.
> > 
> > But there's a chance that the cached imap becomes stale (doesn't
> > match the real extent entry) due to the removing of speculative
> > preallocated blocks, in this case xfs_imap_valid() still reports
> > imap as valid, because the buffer it's writing is still within the
> > cached imap range. This could result in fs corruption and data
> > corruption (data written to wrong block).
> > 
> > For example, the following sequences make it possible (assuming 4k
> > block size XFS):
> > 
> >   1. delalloc write 1072 blocks, with speculative preallocation,
> >   2032 blocks got allocated
> > 
> >   2. started a WB_SYNC_NONE writeback (so it wrote all
> >   PAGECACHE_TAG_DIRTY pages, additional dirty page from step 4 can
> >   be picked up, this could be a background writeback, or a bare
> >   SYNC_FILE_RANGE_WRITE sync_file_range() call) on this inode,
> >   filled in br_startblock, converted delayed allocation to real
> >   allocation, but br_blockcount was unchanged, still 2032, and this
> >   imap got cached in writeback context structure for writing
> >   subsequent pages
> 
> Excellent work in tracking this down, Eryu! This has been a
> longstanding issue - we've been caching the writeback map for
> multi-page writeback since, well, longer than I've been working on
> XFS....

If I read the git history correctly, previously the imap was only cached
for buffer heads within the same page, commit fbcc02561359 ("xfs:
Introduce writeback context for writepages") made the cached imap live
longer, across the whole xfs_vm_writepages() call, which made the window
wider.

But I can't reproduce the bug until commit bb18782aa47d ("xfs: build bios
directly in xfs_add_to_ioend"), I didn't dig into it deeply, perhaps it
made the window even wider?

> 
> First thing - checking for PAGECACHE_TAG_WRITEBACK in xfs_release()
> is racy as writeback can start between the check and the EOF blocks
> being trimmed in xfs_free_eofblocks(), so it's not really a solution
> to the problem.

Ah, you're right, I missed that. 

> 
> The root cause of the problem is that xfs_imap_valid() code doesn't
> return false when the extent tree is modified and the map is
> rendered potentially incorrect. If this was to notice we changed the
> extent tree, it would trigger the writeback code to look up a new
> map.
> 
> Hmmmm, that means the scope is probably wider than
> xfs_free_eofblocks() - that's a xfs_bunmapi call, but we've also got
> to consider that other IO operations might change the extent map,
> too.

Agreed, I did think about a way to notify writeback code an extent
change was made, but I didn't know the relevant code well enough to do
this nicely. And I thought free eofblocks in xfs_release() was the only
place that could race with writeback to cause problems, so I thought
simply skipping free eoflblocks was easier.

> 
> Quick patch below, doesn't work on reflink filesystems yet because
> the way it handles cached COW mapping invalidation is unnecessarily
> special and so doesn't work.

I've hit this problem and the same assert failure when I was testing
another version of fix, the usage of xfs_map_cow() cut the imap
validation work flow :)

(The basic idea of another version of my fix is comparing timestamp
saved in struct xfs_inode at free eofblocks time with timestamp saved in
struct xfs_writepage_ctx, if timestamp in xfs_inode is newer than that
in xfs_writepage_ctx, invalidate the cached imap and map it again. But
that still only focused on free of eofblocks, ignored other paths that
might change the map.)

> Can you test it and see if it fixes the
> problem?

Yes, patch works for me, lustre-racer test survived 50 iterations,
previously it would fail within 5 runs.

Thanks!

Eryu

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] xfs: skip free eofblocks if inode is under written back
  2017-08-30  9:12   ` Eryu Guan
@ 2017-08-30 14:26     ` Brian Foster
  0 siblings, 0 replies; 6+ messages in thread
From: Brian Foster @ 2017-08-30 14:26 UTC (permalink / raw)
  To: Eryu Guan; +Cc: Dave Chinner, linux-xfs

On Wed, Aug 30, 2017 at 05:12:06PM +0800, Eryu Guan wrote:
> On Wed, Aug 30, 2017 at 05:29:50PM +1000, Dave Chinner wrote:
> > On Wed, Aug 30, 2017 at 12:26:28PM +0800, Eryu Guan wrote:
> > > xfs_vm_writepages() saves a cached imap in a writeback context
> > > structure and passes it to xfs_do_writepage() for every page in this
> > > writeback, so xfs_writepage_map() (called by xfs_do_writepage())
> > > doesn't have to lookup & recalculate the mapping for every buffer it
> > > writes if the cached imap is considered as valid.
> > > 
> > > But there's a chance that the cached imap becomes stale (doesn't
> > > match the real extent entry) due to the removing of speculative
> > > preallocated blocks, in this case xfs_imap_valid() still reports
> > > imap as valid, because the buffer it's writing is still within the
> > > cached imap range. This could result in fs corruption and data
> > > corruption (data written to wrong block).
> > > 
> > > For example, the following sequences make it possible (assuming 4k
> > > block size XFS):
> > > 
> > >   1. delalloc write 1072 blocks, with speculative preallocation,
> > >   2032 blocks got allocated
> > > 
> > >   2. started a WB_SYNC_NONE writeback (so it wrote all
> > >   PAGECACHE_TAG_DIRTY pages, additional dirty page from step 4 can
> > >   be picked up, this could be a background writeback, or a bare
> > >   SYNC_FILE_RANGE_WRITE sync_file_range() call) on this inode,
> > >   filled in br_startblock, converted delayed allocation to real
> > >   allocation, but br_blockcount was unchanged, still 2032, and this
> > >   imap got cached in writeback context structure for writing
> > >   subsequent pages
> > 

Thanks Eryu, nice work indeed. :) FWIW, I have patch (appended) that
helps detect this problem consistently in DEBUG mode by essentially
revalidating the ioend at submit time. I suppose it could be updated to
validate the bio target as well, but care to validate that independently
and possibly include it in your next post?

> > Excellent work in tracking this down, Eryu! This has been a
> > longstanding issue - we've been caching the writeback map for
> > multi-page writeback since, well, longer than I've been working on
> > XFS....
> 
> If I read the git history correctly, previously the imap was only cached
> for buffer heads within the same page, commit fbcc02561359 ("xfs:
> Introduce writeback context for writepages") made the cached imap live
> longer, across the whole xfs_vm_writepages() call, which made the window
> wider.
>

The difference here is that the previous code used xfs_cluster_write()
which implicitly limited the scope of the cached mapping to within EOF.
By implicitly, I mean that we only flushed up to EOF at the time the
mapping was created. I think this aspect of the problem is sort of a
regression as of the writeback rework and could be addressed by simply
trimming the cached mapping to EOF (I had previously run a test that
allowed the lustre-racer test to survive a weekend without any issues
with such a tweak). Note that the previous code also limited the
lifetime of the cached mapping to a single writeback "cycle," fwiw.

Given the nature of the above, could we try to independently verify
whether there is also an issue that might go further back than that?
Even if these are both fixed by the same patch, it would be good to have
independent analysis of the separate issues in the commit logs. For
example, what happens if writeback maps a largish delalloc extent, then
stalls (i.e., introduce an artificial delay), and then some other task
comes in and truncates off the last block of the just allocated extent
and then does another buffered write to place a delalloc block at that
offset? Note that hole punch probably won't work here because it waits
for writeback. Also note that this may be further complicated by the
fact that some writeback cases tag the pages to writeback up front.

> But I can't reproduce the bug until commit bb18782aa47d ("xfs: build bios
> directly in xfs_add_to_ioend"), I didn't dig into it deeply, perhaps it
> made the window even wider?
> 

Perhaps doing bio allocation at page processing time rather than submit
time is enough to stretch the window the mapping has to remain valid..?
I believe we've always done ioend allocation at this time, but those
look like they come from a memory pool and so may return quickly.

Brian

--- 8< ---

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 6bf120b..71b55b0 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -457,6 +457,47 @@ static inline int xfs_bio_add_buffer(struct bio *bio, struct buffer_head *bh)
 	return bio_add_page(bio, bh->b_page, bh->b_size, bh_offset(bh));
 }
 
+#ifdef DEBUG
+static int
+xfs_validate_ioend(
+	struct xfs_ioend	*ioend)
+{
+	struct xfs_inode	*ip = XFS_I(ioend->io_inode);
+	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, XFS_DATA_FORK);
+	struct xfs_mount	*mp = ip->i_mount;
+	xfs_fileoff_t		offset = XFS_B_TO_FSB(mp, ioend->io_offset);
+	xfs_filblks_t		count = XFS_B_TO_FSB(mp, ioend->io_size);
+	xfs_extnum_t		idx;
+	struct xfs_bmbt_irec	got;
+	int			error;
+
+	xfs_ilock(ip, XFS_ILOCK_SHARED);
+
+	error = -EFSCORRUPTED;
+	while (count) {
+		if (!xfs_iext_lookup_extent(ip, ifp, offset, &idx, &got))
+			goto out;
+		if (isnullstartblock(got.br_startblock))
+			goto out;
+		offset += got.br_blockcount;
+		count -= min_t(xfs_filblks_t, count, got.br_blockcount);
+	}
+
+	error = 0;
+out:
+	xfs_iunlock(ip, XFS_ILOCK_SHARED);
+	if (error) {
+		xfs_alert(mp,
+		"I/O submission to invalid extent (ino 0x%llx offset 0x%llx).",
+			  ip->i_ino, ioend->io_offset);
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+	}
+	return error;
+}
+#else
+#define xfs_validate_ioend(ioend)	(0)
+#endif
+
 /*
  * Submit the bio for an ioend. We are passed an ioend with a bio attached to
  * it, and we submit that bio. The ioend may be used for multiple bio
@@ -506,6 +547,10 @@ xfs_submit_ioend(
 		return status;
 	}
 
+	status = xfs_validate_ioend(ioend);
+	if (status)
+		return status;
+
 	ioend->io_bio->bi_write_hint = ioend->io_inode->i_write_hint;
 	submit_bio(ioend->io_bio);
 	return 0;

^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-08-30 14:26 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-30  4:26 [PATCH] xfs: skip free eofblocks if inode is under written back Eryu Guan
2017-08-30  7:29 ` Dave Chinner
2017-08-30  7:51   ` Christoph Hellwig
2017-08-30  8:10     ` Dave Chinner
2017-08-30  9:12   ` Eryu Guan
2017-08-30 14:26     ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.