linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] xfs: various fixes for 4.20
@ 2018-11-19 21:04 Dave Chinner
  2018-11-19 21:04 ` [PATCH 1/7] xfs: zero length symlinks are not valid Dave Chinner
                   ` (7 more replies)
  0 siblings, 8 replies; 34+ messages in thread
From: Dave Chinner @ 2018-11-19 21:04 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

This series is an aggregation of fixes I've been building up over
the past few weeks. They are a result of bug reports, problems
analysing extremely large metadump image files, trying to add 64k
block size support to x86-64, and adding dedup/clone/copy_file_range
support to fsx.

I've posted a couple of these before, these versions contain the
updates made as a result of review comments that were made.

THere are no new regressions from these patches in fstests, and they
fix some of the new fsx failures that we are seeing with it's new
functionality.

Cheers,

Dave.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/7] xfs: zero length symlinks are not valid
  2018-11-19 21:04 [PATCH 0/7] xfs: various fixes for 4.20 Dave Chinner
@ 2018-11-19 21:04 ` Dave Chinner
  2018-11-20  8:12   ` Christoph Hellwig
  2018-11-20 13:44   ` Brian Foster
  2018-11-19 21:04 ` [PATCH 2/7] xfs: uncached buffer tracing needs to print bno Dave Chinner
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 34+ messages in thread
From: Dave Chinner @ 2018-11-19 21:04 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

A log recovery failure has been reproduced where a symlink inode has
a zero length in extent form. It was caused by a shutdown during a
combined fstress+fsmark workload.

The underlying problem is the issue in xfs_inactive_symlink(): the
inode is unlocked between the symlink inactivation/truncation and
the inode being freed. This opens a window for the inode to be
written to disk before it xfs_ifree() removes it from the unlinked
list, marks it free in the inobt and zeros the mode.

For shortform inodes, the fix is simple. xfs_ifree() clears the data
fork state, so there's no need to do it in xfs_inactive_symlink().
This means the shortform fork verifier will not see a zero length
data fork as it mirrors the inode size through to xfs_ifree()), and
hence if the inode gets written back and the fork verifiers are run
they will still see a fork that matches the on-disk inode size.

For extent form (remote) symlinks, it is a little more tricky. Here
we explicitly set the inode size to zero, so the above race can lead
to zero length symlinks on disk. Because the inode is unlinked at
this point (i.e. on the unlinked list) and unreferenced, it can
never be seen again by a user. Hence when we set the inode size to
zeor, also change the type to S_IFREG. xfs_ifree() expects S_IFREG
inodes to be of zero length, and so this avoids all the problems of
zero length symlinks ever hitting the disk. It also avoids the
problem of needing to handle zero length symlink inodes in log
recovery to replay the extent free intents and the remaining
deferops to free the extents the symlink used.

Also add a couple of asserts to warn us if zero length symlinks end
up in either the symlink create or inactivation paths.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_symlink_remote.c | 14 +++++++++----
 fs/xfs/xfs_symlink.c               | 33 +++++++++++++++---------------
 2 files changed, 27 insertions(+), 20 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_symlink_remote.c b/fs/xfs/libxfs/xfs_symlink_remote.c
index 95374ab2dee7..77d80106f989 100644
--- a/fs/xfs/libxfs/xfs_symlink_remote.c
+++ b/fs/xfs/libxfs/xfs_symlink_remote.c
@@ -199,7 +199,10 @@ xfs_symlink_local_to_remote(
 					ifp->if_bytes - 1);
 }
 
-/* Verify the consistency of an inline symlink. */
+/*
+ * Verify the in-memory consistency of an inline symlink data fork. This
+ * does not do on-disk format checks.
+ */
 xfs_failaddr_t
 xfs_symlink_shortform_verify(
 	struct xfs_inode	*ip)
@@ -215,9 +218,12 @@ xfs_symlink_shortform_verify(
 	size = ifp->if_bytes;
 	endp = sfp + size;
 
-	/* Zero length symlinks can exist while we're deleting a remote one. */
-	if (size == 0)
-		return NULL;
+	/*
+	 * Zero length symlinks should never occur in memory as they are
+	 * never alllowed to exist on disk.
+	 */
+	if (!size)
+		return __this_address;
 
 	/* No negative sizes or overly long symlink targets. */
 	if (size < 0 || size > XFS_SYMLINK_MAXLEN)
diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
index a3e98c64b6e3..b2c1177c717f 100644
--- a/fs/xfs/xfs_symlink.c
+++ b/fs/xfs/xfs_symlink.c
@@ -192,6 +192,7 @@ xfs_symlink(
 	pathlen = strlen(target_path);
 	if (pathlen >= XFS_SYMLINK_MAXLEN)      /* total string too long */
 		return -ENAMETOOLONG;
+	ASSERT(pathlen > 0);
 
 	udqp = gdqp = NULL;
 	prid = xfs_get_initial_prid(dp);
@@ -378,6 +379,12 @@ xfs_symlink(
 
 /*
  * Free a symlink that has blocks associated with it.
+ *
+ * Note: zero length symlinks are not allowed to exist. When we set the size to
+ * zero, also change it to a regular file so that it does not get written to
+ * disk as a zero length symlink. The inode is on the unlinked list already, so
+ * userspace cannot find this inode anymore, so this change is not user visible
+ * but allows us to catch corrupt zero-length symlinks in the verifiers.
  */
 STATIC int
 xfs_inactive_symlink_rmt(
@@ -412,13 +419,14 @@ xfs_inactive_symlink_rmt(
 	xfs_trans_ijoin(tp, ip, 0);
 
 	/*
-	 * Lock the inode, fix the size, and join it to the transaction.
-	 * Hold it so in the normal path, we still have it locked for
-	 * the second transaction.  In the error paths we need it
+	 * Lock the inode, fix the size, turn it into a regular file and join it
+	 * to the transaction.  Hold it so in the normal path, we still have it
+	 * locked for the second transaction.  In the error paths we need it
 	 * held so the cancel won't rele it, see below.
 	 */
 	size = (int)ip->i_d.di_size;
 	ip->i_d.di_size = 0;
+	VFS_I(ip)->i_mode = (VFS_I(ip)->i_mode & ~S_IFMT) | S_IFREG;
 	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
 	/*
 	 * Find the block(s) so we can inval and unmap them.
@@ -494,17 +502,10 @@ xfs_inactive_symlink(
 		return -EIO;
 
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
-
-	/*
-	 * Zero length symlinks _can_ exist.
-	 */
 	pathlen = (int)ip->i_d.di_size;
-	if (!pathlen) {
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		return 0;
-	}
+	ASSERT(pathlen);
 
-	if (pathlen < 0 || pathlen > XFS_SYMLINK_MAXLEN) {
+	if (pathlen <= 0 || pathlen > XFS_SYMLINK_MAXLEN) {
 		xfs_alert(mp, "%s: inode (0x%llx) bad symlink length (%d)",
 			 __func__, (unsigned long long)ip->i_ino, pathlen);
 		xfs_iunlock(ip, XFS_ILOCK_EXCL);
@@ -512,12 +513,12 @@ xfs_inactive_symlink(
 		return -EFSCORRUPTED;
 	}
 
+	/*
+	 * Inline fork state gets removed by xfs_difree() so we have nothing to
+	 * do here in that case.
+	 */
 	if (ip->i_df.if_flags & XFS_IFINLINE) {
-		if (ip->i_df.if_bytes > 0) 
-			xfs_idata_realloc(ip, -(ip->i_df.if_bytes),
-					  XFS_DATA_FORK);
 		xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		ASSERT(ip->i_df.if_bytes == 0);
 		return 0;
 	}
 
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/7] xfs: uncached buffer tracing needs to print bno
  2018-11-19 21:04 [PATCH 0/7] xfs: various fixes for 4.20 Dave Chinner
  2018-11-19 21:04 ` [PATCH 1/7] xfs: zero length symlinks are not valid Dave Chinner
@ 2018-11-19 21:04 ` Dave Chinner
  2018-11-20  8:12   ` Christoph Hellwig
  2018-11-20 22:46   ` Darrick J. Wong
  2018-11-19 21:04 ` [PATCH 3/7] xfs: fix transient reference count error in xfs_buf_resubmit_failed_buffers Dave Chinner
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 34+ messages in thread
From: Dave Chinner @ 2018-11-19 21:04 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Useless:

xfs_buf_get_uncached: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
xfs_buf_unlock:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
xfs_buf_submit:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
xfs_buf_hold:         dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
xfs_buf_iowait:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
xfs_buf_iodone:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
xfs_buf_iowait_done:  dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
xfs_buf_rele:         dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...

Useful:


xfs_buf_get_uncached: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
xfs_buf_unlock:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
xfs_buf_submit:       dev 253:32 bno 0x200b5 nblks 0x1 ...
xfs_buf_hold:         dev 253:32 bno 0x200b5 nblks 0x1 ...
xfs_buf_iowait:       dev 253:32 bno 0x200b5 nblks 0x1 ...
xfs_buf_iodone:       dev 253:32 bno 0x200b5 nblks 0x1 ...
xfs_buf_iowait_done:  dev 253:32 bno 0x200b5 nblks 0x1 ...
xfs_buf_rele:         dev 253:32 bno 0x200b5 nblks 0x1 ...

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_trace.h | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 3043e5ed6495..8a6532aae779 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -280,7 +280,10 @@ DECLARE_EVENT_CLASS(xfs_buf_class,
 	),
 	TP_fast_assign(
 		__entry->dev = bp->b_target->bt_dev;
-		__entry->bno = bp->b_bn;
+		if (bp->b_bn == XFS_BUF_DADDR_NULL)
+			__entry->bno = bp->b_maps[0].bm_bn;
+		else
+			__entry->bno = bp->b_bn;
 		__entry->nblks = bp->b_length;
 		__entry->hold = atomic_read(&bp->b_hold);
 		__entry->pincount = atomic_read(&bp->b_pin_count);
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/7] xfs: fix transient reference count error in xfs_buf_resubmit_failed_buffers
  2018-11-19 21:04 [PATCH 0/7] xfs: various fixes for 4.20 Dave Chinner
  2018-11-19 21:04 ` [PATCH 1/7] xfs: zero length symlinks are not valid Dave Chinner
  2018-11-19 21:04 ` [PATCH 2/7] xfs: uncached buffer tracing needs to print bno Dave Chinner
@ 2018-11-19 21:04 ` Dave Chinner
  2018-11-20  8:13   ` Christoph Hellwig
  2018-11-20 22:48   ` Darrick J. Wong
  2018-11-19 21:04 ` [PATCH 4/7] xfs: finobt AG reserves don't consider last AG can be a runt Dave Chinner
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 34+ messages in thread
From: Dave Chinner @ 2018-11-19 21:04 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When retrying a failed inode or dquot buffer,
xfs_buf_resubmit_failed_buffers() clears all the failed flags from
the inde/dquot log items. In doing so, it also drops all the
reference counts on the buffer that the failed log items hold. This
means it can drop all the active references on the buffer and hence
free the buffer before it queues it for write again.

Putting the buffer on the delwri queue takes a reference to the
buffer (so that it hangs around until it has been written and
completed), but this goes bang if the buffer has already been freed.

Hence we need to add the buffer to the delwri queue before we remove
the failed flags from the log items attached to the buffer to ensure
it always remains referenced during the resubmit process.

Reported-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_buf_item.c | 28 +++++++++++++++++++++-------
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
index 12d8455bfbb2..010db5f8fb00 100644
--- a/fs/xfs/xfs_buf_item.c
+++ b/fs/xfs/xfs_buf_item.c
@@ -1233,9 +1233,23 @@ xfs_buf_iodone(
 }
 
 /*
- * Requeue a failed buffer for writeback
+ * Requeue a failed buffer for writeback.
  *
- * Return true if the buffer has been re-queued properly, false otherwise
+ * We clear the log item failed state here as well, but we have to be careful
+ * about reference counts because the only active reference counts on the buffer
+ * may be the failed log items. Hence if we clear the log item failed state
+ * before queuing the buffer for IO we can release all active references to
+ * the buffer and free it, leading to use after free problems in
+ * xfs_buf_delwri_queue. It makes no difference to the buffer or log items which
+ * order we process them in - the buffer is locked, and we own the buffer list
+ * so nothing on them is going to change while we are performing this action.
+ *
+ * Hence we can safely queue the buffer for IO before we clear the failed log
+ * item state, therefore  always having an active reference to the buffer and
+ * avoiding the transient zero-reference state that leads to use-after-free.
+ *
+ * Return true if the buffer was added to the buffer list, false if it was
+ * already on the buffer list.
  */
 bool
 xfs_buf_resubmit_failed_buffers(
@@ -1243,16 +1257,16 @@ xfs_buf_resubmit_failed_buffers(
 	struct list_head	*buffer_list)
 {
 	struct xfs_log_item	*lip;
+	bool			ret;
+
+	ret = xfs_buf_delwri_queue(bp, buffer_list);
 
 	/*
-	 * Clear XFS_LI_FAILED flag from all items before resubmit
-	 *
-	 * XFS_LI_FAILED set/clear is protected by ail_lock, caller  this
+	 * XFS_LI_FAILED set/clear is protected by ail_lock, caller of this
 	 * function already have it acquired
 	 */
 	list_for_each_entry(lip, &bp->b_li_list, li_bio_list)
 		xfs_clear_li_failed(lip);
 
-	/* Add this buffer back to the delayed write list */
-	return xfs_buf_delwri_queue(bp, buffer_list);
+	return ret;
 }
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 4/7] xfs: finobt AG reserves don't consider last AG can be a runt
  2018-11-19 21:04 [PATCH 0/7] xfs: various fixes for 4.20 Dave Chinner
                   ` (2 preceding siblings ...)
  2018-11-19 21:04 ` [PATCH 3/7] xfs: fix transient reference count error in xfs_buf_resubmit_failed_buffers Dave Chinner
@ 2018-11-19 21:04 ` Dave Chinner
  2018-11-20  8:14   ` Christoph Hellwig
  2018-11-20 22:49   ` Darrick J. Wong
  2018-11-19 21:04 ` [PATCH 5/7] xfs: extent shifting doesn't fully invalidate page cache Dave Chinner
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 34+ messages in thread
From: Dave Chinner @ 2018-11-19 21:04 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The last AG may be very small comapred to all other AGs, and hence
AG reservations based on the superblock AG size may actually consume
more space than the AG actually has. This results on assert failures
like:

XFS: Assertion failed: xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved + xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved <= pag->pagf_freeblks + pag->pagf_flcount, file: fs/xfs/libxfs/xfs_ag_resv.c, line: 319
[   48.932891]  xfs_ag_resv_init+0x1bd/0x1d0
[   48.933853]  xfs_fs_reserve_ag_blocks+0x37/0xb0
[   48.934939]  xfs_mountfs+0x5b3/0x920
[   48.935804]  xfs_fs_fill_super+0x462/0x640
[   48.936784]  ? xfs_test_remount_options+0x60/0x60
[   48.937908]  mount_bdev+0x178/0x1b0
[   48.938751]  mount_fs+0x36/0x170
[   48.939533]  vfs_kern_mount.part.43+0x54/0x130
[   48.940596]  do_mount+0x20e/0xcb0
[   48.941396]  ? memdup_user+0x3e/0x70
[   48.942249]  ksys_mount+0xba/0xd0
[   48.943046]  __x64_sys_mount+0x21/0x30
[   48.943953]  do_syscall_64+0x54/0x170
[   48.944835]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Hence we need to ensure the finobt per-ag space reservations take
into account the size of the last AG rather than treat it like all
the other full size AGs.

Note that both refcountbt and rmapbt already take the size of the AG
into account via reading the AGF length directly.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc_btree.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 86c50208a143..7fbf8af0b159 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -538,15 +538,18 @@ xfs_inobt_rec_check_count(
 
 static xfs_extlen_t
 xfs_inobt_max_size(
-	struct xfs_mount	*mp)
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno)
 {
+	xfs_agblock_t		agblocks = xfs_ag_block_count(mp, agno);
+
 	/* Bail out if we're uninitialized, which can happen in mkfs. */
 	if (mp->m_inobt_mxr[0] == 0)
 		return 0;
 
 	return xfs_btree_calc_size(mp->m_inobt_mnr,
-		(uint64_t)mp->m_sb.sb_agblocks * mp->m_sb.sb_inopblock /
-				XFS_INODES_PER_CHUNK);
+				(uint64_t)agblocks * mp->m_sb.sb_inopblock /
+					XFS_INODES_PER_CHUNK);
 }
 
 static int
@@ -594,7 +597,7 @@ xfs_finobt_calc_reserves(
 	if (error)
 		return error;
 
-	*ask += xfs_inobt_max_size(mp);
+	*ask += xfs_inobt_max_size(mp, agno);
 	*used += tree_len;
 	return 0;
 }
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 5/7] xfs: extent shifting doesn't fully invalidate page cache
  2018-11-19 21:04 [PATCH 0/7] xfs: various fixes for 4.20 Dave Chinner
                   ` (3 preceding siblings ...)
  2018-11-19 21:04 ` [PATCH 4/7] xfs: finobt AG reserves don't consider last AG can be a runt Dave Chinner
@ 2018-11-19 21:04 ` Dave Chinner
  2018-11-20  8:18   ` Christoph Hellwig
  2018-11-20 22:53   ` Darrick J. Wong
  2018-11-19 21:04 ` [PATCH 6/7] xfs: don't ENOSPC on writeback when punching holes Dave Chinner
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 34+ messages in thread
From: Dave Chinner @ 2018-11-19 21:04 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The extent shifting code uses a flush and invalidate mechainsm prior
to shifting extents around. This is similar to what
xfs_free_file_space() does, but it doesn't take into account things
like page cache vs block size differences, and it will fail if there
is a page that it currently busy.

xfs_flush_unmap_range() handles all of these cases, so just convert
xfs_prepare_shift() to us that mechanism rather than having it's own
special sauce.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap_util.c | 8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 5d263dfdb3bc..167ff4297e5c 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1195,13 +1195,7 @@ xfs_prepare_shift(
 	 * Writeback and invalidate cache for the remainder of the file as we're
 	 * about to shift down every extent from offset to EOF.
 	 */
-	error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping, offset, -1);
-	if (error)
-		return error;
-	error = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
-					offset >> PAGE_SHIFT, -1);
-	if (error)
-		return error;
+	error = xfs_flush_unmap_range(ip, offset, XFS_ISIZE(ip));
 
 	/*
 	 * Clean out anything hanging around in the cow fork now that
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 6/7] xfs: don't ENOSPC on writeback when punching holes
  2018-11-19 21:04 [PATCH 0/7] xfs: various fixes for 4.20 Dave Chinner
                   ` (4 preceding siblings ...)
  2018-11-19 21:04 ` [PATCH 5/7] xfs: extent shifting doesn't fully invalidate page cache Dave Chinner
@ 2018-11-19 21:04 ` Dave Chinner
  2018-11-20  8:20   ` Christoph Hellwig
  2018-11-21 18:09   ` Darrick J. Wong
  2018-11-19 21:04 ` [PATCH 7/7] xfs: flush removing page cache in xfs_reflink_remap_prep Dave Chinner
  2018-11-20  6:36 ` [PATCH 8/7] xfs: delalloc -> unwritten COW fork allocation can go wrong Dave Chinner
  7 siblings, 2 replies; 34+ messages in thread
From: Dave Chinner @ 2018-11-19 21:04 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Just tried to punch a 40T sparse file when enospc was triggered due
to extent size hints consuming more space than expected. It failed
with:

# sudo xfs_io -c "fpunch 0 40t" /mnt/fast/xfs-arekm.img
fallocate: No space left on device
#

because the writeback error of ENOSPC was being reported. We're
going to free that space, so we don't care if there was a ENOSPC
writeback error. So ignore ENOSPC errors and punch anyway.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap_util.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 167ff4297e5c..cc7a0d47c529 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1060,8 +1060,13 @@ xfs_flush_unmap_range(
 	start = round_down(offset, rounding);
 	end = round_up(offset + len, rounding) - 1;
 
+	/*
+	 * We're going to trash the data in this range, so if writeback reports
+	 * an enospc error, don't let it stop us from /freeing the space/ in the
+	 * range to alleviate the ENOSPC condition.
+	 */
 	error = filemap_write_and_wait_range(inode->i_mapping, start, end);
-	if (error)
+	if (error && error != -ENOSPC)
 		return error;
 	truncate_pagecache_range(inode, start, end);
 	return 0;
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 7/7] xfs: flush removing page cache in xfs_reflink_remap_prep
  2018-11-19 21:04 [PATCH 0/7] xfs: various fixes for 4.20 Dave Chinner
                   ` (5 preceding siblings ...)
  2018-11-19 21:04 ` [PATCH 6/7] xfs: don't ENOSPC on writeback when punching holes Dave Chinner
@ 2018-11-19 21:04 ` Dave Chinner
  2018-11-20  8:32   ` Christoph Hellwig
  2018-11-20 22:56   ` Darrick J. Wong
  2018-11-20  6:36 ` [PATCH 8/7] xfs: delalloc -> unwritten COW fork allocation can go wrong Dave Chinner
  7 siblings, 2 replies; 34+ messages in thread
From: Dave Chinner @ 2018-11-19 21:04 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

On a sub-page block size filesystem, fsx is failing with a data
corruption after a series of operations involving copying a file
with the destination offset beyond EOF of the destination of the file:

8093(157 mod 256): TRUNCATE DOWN        from 0x7a120 to 0x50000 ******WWWW
8094(158 mod 256): INSERT 0x25000 thru 0x25fff  (0x1000 bytes)
8095(159 mod 256): COPY 0x18000 thru 0x1afff    (0x3000 bytes) to 0x2f400
8096(160 mod 256): WRITE    0x5da00 thru 0x651ff        (0x7800 bytes) HOLE
8097(161 mod 256): COPY 0x2000 thru 0x5fff      (0x4000 bytes) to 0x6fc00

The second copy here is beyond EOF, and it is to sub-page (4k) but
block aligned (1k) offset. The clone runs the EOF zeroing, landing
in a pre-existing post-eof delalloc extent. This zeroes the post-eof
extents in the page cache just fine, dirtying the pages correctly.

The problem is that xfs_reflink_remap_prep() now truncates the page
cache over the range that it is copying it to, and rounds that down
to cover the entire start page. This removes the dirty page over the
delalloc extent from the page cache without having written it back.
Hence later, when the page cache is flushed, the page at offset
0x6f000 has not been written back and hence exposes stale data,
which fsx trips over less than 10 operations later.

Fix this by changing xfs_reflink_remap_prep() to use
xfs_flush_unmap_range().

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_bmap_util.c |  2 +-
 fs/xfs/xfs_bmap_util.h |  3 +++
 fs/xfs/xfs_reflink.c   | 17 +++++++++++++----
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index cc7a0d47c529..3e66cf0520a9 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1042,7 +1042,7 @@ xfs_unmap_extent(
 	goto out_unlock;
 }
 
-static int
+int
 xfs_flush_unmap_range(
 	struct xfs_inode	*ip,
 	xfs_off_t		offset,
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 87363d136bb6..7a78229cf1a7 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -80,4 +80,7 @@ int xfs_bmap_count_blocks(struct xfs_trans *tp, struct xfs_inode *ip,
 			  int whichfork, xfs_extnum_t *nextents,
 			  xfs_filblks_t *count);
 
+int	xfs_flush_unmap_range(struct xfs_inode *ip, xfs_off_t offset,
+			      xfs_off_t len);
+
 #endif	/* __XFS_BMAP_UTIL_H__ */
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index ecdb086bc23e..a41590b7c229 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1351,10 +1351,19 @@ xfs_reflink_remap_prep(
 	if (ret)
 		goto out_unlock;
 
-	/* Zap any page cache for the destination file's range. */
-	truncate_inode_pages_range(&inode_out->i_data,
-			round_down(pos_out, PAGE_SIZE),
-			round_up(pos_out + *len, PAGE_SIZE) - 1);
+	/*
+	 * If pos_out > EOF, we make have dirtied blocks between EOF and
+	 * pos_out. In that case, we need to extend the flush and unmap to cover
+	 * from EOF to the end of the copy length.
+	 */
+	if (pos_out > XFS_ISIZE(dest)) {
+		loff_t	flen = *len + (pos_out - XFS_ISIZE(dest));
+		ret = xfs_flush_unmap_range(dest, XFS_ISIZE(dest), flen);
+	} else {
+		ret = xfs_flush_unmap_range(dest, pos_out, *len);
+	}
+	if (ret)
+		goto out_unlock;
 
 	return 1;
 out_unlock:
-- 
2.19.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 8/7] xfs: delalloc -> unwritten COW fork allocation can go wrong
  2018-11-19 21:04 [PATCH 0/7] xfs: various fixes for 4.20 Dave Chinner
                   ` (6 preceding siblings ...)
  2018-11-19 21:04 ` [PATCH 7/7] xfs: flush removing page cache in xfs_reflink_remap_prep Dave Chinner
@ 2018-11-20  6:36 ` Dave Chinner
  2018-11-20 13:45   ` Brian Foster
                     ` (2 more replies)
  7 siblings, 3 replies; 34+ messages in thread
From: Dave Chinner @ 2018-11-20  6:36 UTC (permalink / raw)
  To: linux-xfs


From: Dave Chinner <dchinner@redhat.com>

Long saga. There have been days spent following this through dead end
after dead end in multi-GB event traces. This morning, after writing
a trace-cmd wrapper that enabled me to be more selective about XFS
trace points, I discovered that I could get just enough essential
tracepoints enabled that there was a 50:50 chance the fsx config
would fail at ~115k ops. If it didn't fail at op 115547, I stopped
fsx at op 115548 anyway.

That gave me two traces - one where the problem manifested, and one
where it didn't. After refining the traces to have the necessary
information, I found that in the failing case there was a real
extent in the COW fork compared to an unwritten extent in the
working case.

Walking back through the two traces to the point where the CWO fork
extents actually diverged, I found that the bad case had an extra
unwritten extent in it. This is likely because the bug it led me to
had triggered multiple times in those 115k ops, leaving stray
COW extents around. What I saw was a COW delalloc conversion to an
unwritten extent (as they should always be through
xfs_iomap_write_allocate()) resulted in a /written extent/:

xfs_writepage:        dev 259:0 ino 0x83 pgoff 0x17000 size 0x79a00 offset 0 length 0
xfs_iext_remove:      dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/2 offset 32 block 152 count 20 flag 1 caller xfs_bmap_add_extent_delay_real
xfs_bmap_pre_update:  dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/1 offset 1 block 4503599627239429 count 31 flag 0 caller xfs_bmap_add_extent_delay_real
xfs_bmap_post_update: dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/1 offset 1 block 121 count 51 flag 0 caller xfs_bmap_add_ex

Basically, Cow fork before:

	0 1            32          52
	+H+DDDDDDDDDDDD+UUUUUUUUUUU+
	   PREV		RIGHT

COW delalloc conversion allocates:

	  1	       32
	  +uuuuuuuuuuuu+
	  NEW

And the result according to the xfs_bmap_post_update trace was:

	0 1            32          52
	+H+wwwwwwwwwwwwwwwwwwwwwwww+
	   PREV

Which is clearly wrong - it should be a merged unwritten extent,
not an unwritten extent.

That lead me to look at the LEFT_FILLING|RIGHT_FILLING|RIGHT_CONTIG
case in xfs_bmap_add_extent_delay_real(), and sure enough, there's
the bug.

It takes the old delalloc extent (PREV) and adds the length of the
RIGHT extent to it, takes the start block from NEW, removes the
RIGHT extent and then updates PREV with the new extent.

What it fails to do is update PREV.br_state. For delalloc, this is
always XFS_EXT_NORM, while in this case we are converting the
delayed allocation to unwritten, so it needs to be updated to
XFS_EXT_UNWRITTEN. This LF|RF|RC case does not do this, and so
the resultant extent is always written.

And that's the bug I've been chasing for a week - a bmap btree bug,
not a reflink/dedupe/copy_file_range bug, but a BMBT bug introduced
with the recent in core extent tree scalability enhancements.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 2d78fd53e822..39eaa2b86060 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -1694,10 +1694,13 @@ xfs_bmap_add_extent_delay_real(
 	case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING | BMAP_RIGHT_CONTIG:
 		/*
 		 * Filling in all of a previously delayed allocation extent.
-		 * The right neighbor is contiguous, the left is not.
+		 * The right neighbor is contiguous, the left is not. Take care
+		 * with delay -> unwritten extent allocation here because the
+		 * delalloc record we are overwriting is always written.
 		 */
 		PREV.br_startblock = new->br_startblock;
 		PREV.br_blockcount += RIGHT.br_blockcount;
+		PREV.br_state = new->br_state;
 
 		xfs_iext_next(ifp, &bma->icur);
 		xfs_iext_remove(bma->ip, &bma->icur, state);

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/7] xfs: zero length symlinks are not valid
  2018-11-19 21:04 ` [PATCH 1/7] xfs: zero length symlinks are not valid Dave Chinner
@ 2018-11-20  8:12   ` Christoph Hellwig
  2018-11-20 13:44   ` Brian Foster
  1 sibling, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2018-11-20  8:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

>  	/*
> -	 * Lock the inode, fix the size, and join it to the transaction.
> -	 * Hold it so in the normal path, we still have it locked for
> -	 * the second transaction.  In the error paths we need it
> +	 * Lock the inode, fix the size, turn it into a regular file and join it
> +	 * to the transaction.  Hold it so in the normal path, we still have it
> +	 * locked for the second transaction.  In the error paths we need it
>  	 * held so the cancel won't rele it, see below.
>  	 */

Most of this comment seems rather pointless as it explains the how and
not the why.  Additionally it is out of data (we don't hold inodes for
transactions anymore), and partially below the code it claims to
document.  Given that you explain the why pretty well in the top of the
function document I'd drop this entirely.

> +	/*
> +	 * Inline fork state gets removed by xfs_difree() so we have nothing to
> +	 * do here in that case.
> +	 */
>  	if (ip->i_df.if_flags & XFS_IFINLINE) {
> -		if (ip->i_df.if_bytes > 0) 
> -			xfs_idata_realloc(ip, -(ip->i_df.if_bytes),
> -					  XFS_DATA_FORK);
>  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -		ASSERT(ip->i_df.if_bytes == 0);
>  		return 0;
>  	}

The ilock here is rather pointless now that we do nothing underneath
it but checking two different integral values.

Something like this additional cleanup might be nice now, with
the function split being rather pointless now:

diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
index b2c1177c717f..8e2e12e34098 100644
--- a/fs/xfs/xfs_symlink.c
+++ b/fs/xfs/xfs_symlink.c
@@ -386,29 +386,41 @@ xfs_symlink(
  * userspace cannot find this inode anymore, so this change is not user visible
  * but allows us to catch corrupt zero-length symlinks in the verifiers.
  */
-STATIC int
-xfs_inactive_symlink_rmt(
-	struct xfs_inode *ip)
+
+int
+xfs_inactive_symlink(
+	struct xfs_inode	*ip)
 {
-	xfs_buf_t	*bp;
-	int		done;
-	int		error;
-	int		i;
-	xfs_mount_t	*mp;
-	xfs_bmbt_irec_t	mval[XFS_SYMLINK_MAPS];
-	int		nmaps;
-	int		size;
-	xfs_trans_t	*tp;
-
-	mp = ip->i_mount;
-	ASSERT(ip->i_df.if_flags & XFS_IFEXTENTS);
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_buf		*bp;
+	int			done;
+	int			error;
+	int			i;
+	struct xfs_bmbt_irec	mval[XFS_SYMLINK_MAPS];
+	int			nmaps;
+	int			size;
+	struct xfs_trans	*tp;
+
+	trace_xfs_inactive_symlink(ip);
+
+	if (XFS_FORCED_SHUTDOWN(mp))
+		return -EIO;
+
+	ASSERT(ip->i_d.di_size > 0 && ip->i_d.di_size <= XFS_SYMLINK_MAXLEN);
+
 	/*
-	 * We're freeing a symlink that has some
-	 * blocks allocated to it.  Free the
-	 * blocks here.  We know that we've got
-	 * either 1 or 2 extents and that we can
-	 * free them all in one bunmapi call.
+	 * Inline fork state gets removed by xfs_difree() so we have nothing to
+	 * do here in that case.
 	 */
+	if (ip->i_df.if_flags & XFS_IFINLINE)
+		return 0;
+
+	/*
+	 * We're freeing a symlink that has some blocks allocated to it.  Free
+	 * the blocks here.  We know that we've got either 1 or 2 extents and
+	 * that we can free them all in one bunmapi call.
+	 */
+	ASSERT(ip->i_df.if_flags & XFS_IFEXTENTS);
 	ASSERT(ip->i_d.di_nextents > 0 && ip->i_d.di_nextents <= 2);
 
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
@@ -419,10 +431,7 @@ xfs_inactive_symlink_rmt(
 	xfs_trans_ijoin(tp, ip, 0);
 
 	/*
-	 * Lock the inode, fix the size, turn it into a regular file and join it
-	 * to the transaction.  Hold it so in the normal path, we still have it
-	 * locked for the second transaction.  In the error paths we need it
-	 * held so the cancel won't rele it, see below.
+	 * Fix the size, turn it into a regular file in the same transaction.
 	 */
 	size = (int)ip->i_d.di_size;
 	ip->i_d.di_size = 0;
@@ -485,45 +494,3 @@ xfs_inactive_symlink_rmt(
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
 }
-
-/*
- * xfs_inactive_symlink - free a symlink
- */
-int
-xfs_inactive_symlink(
-	struct xfs_inode	*ip)
-{
-	struct xfs_mount	*mp = ip->i_mount;
-	int			pathlen;
-
-	trace_xfs_inactive_symlink(ip);
-
-	if (XFS_FORCED_SHUTDOWN(mp))
-		return -EIO;
-
-	xfs_ilock(ip, XFS_ILOCK_EXCL);
-	pathlen = (int)ip->i_d.di_size;
-	ASSERT(pathlen);
-
-	if (pathlen <= 0 || pathlen > XFS_SYMLINK_MAXLEN) {
-		xfs_alert(mp, "%s: inode (0x%llx) bad symlink length (%d)",
-			 __func__, (unsigned long long)ip->i_ino, pathlen);
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		ASSERT(0);
-		return -EFSCORRUPTED;
-	}
-
-	/*
-	 * Inline fork state gets removed by xfs_difree() so we have nothing to
-	 * do here in that case.
-	 */
-	if (ip->i_df.if_flags & XFS_IFINLINE) {
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		return 0;
-	}
-
-	xfs_iunlock(ip, XFS_ILOCK_EXCL);
-
-	/* remove the remote symlink */
-	return xfs_inactive_symlink_rmt(ip);
-}

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/7] xfs: uncached buffer tracing needs to print bno
  2018-11-19 21:04 ` [PATCH 2/7] xfs: uncached buffer tracing needs to print bno Dave Chinner
@ 2018-11-20  8:12   ` Christoph Hellwig
  2018-11-20 22:46   ` Darrick J. Wong
  1 sibling, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2018-11-20  8:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/7] xfs: fix transient reference count error in xfs_buf_resubmit_failed_buffers
  2018-11-19 21:04 ` [PATCH 3/7] xfs: fix transient reference count error in xfs_buf_resubmit_failed_buffers Dave Chinner
@ 2018-11-20  8:13   ` Christoph Hellwig
  2018-11-20 22:48   ` Darrick J. Wong
  1 sibling, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2018-11-20  8:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Looks fine,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/7] xfs: finobt AG reserves don't consider last AG can be a runt
  2018-11-19 21:04 ` [PATCH 4/7] xfs: finobt AG reserves don't consider last AG can be a runt Dave Chinner
@ 2018-11-20  8:14   ` Christoph Hellwig
  2018-11-20 22:49   ` Darrick J. Wong
  1 sibling, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2018-11-20  8:14 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/7] xfs: extent shifting doesn't fully invalidate page cache
  2018-11-19 21:04 ` [PATCH 5/7] xfs: extent shifting doesn't fully invalidate page cache Dave Chinner
@ 2018-11-20  8:18   ` Christoph Hellwig
  2018-11-20 22:53   ` Darrick J. Wong
  1 sibling, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2018-11-20  8:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

Looks good,

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/7] xfs: don't ENOSPC on writeback when punching holes
  2018-11-19 21:04 ` [PATCH 6/7] xfs: don't ENOSPC on writeback when punching holes Dave Chinner
@ 2018-11-20  8:20   ` Christoph Hellwig
  2018-11-20  9:50     ` Dave Chinner
  2018-11-21 18:09   ` Darrick J. Wong
  1 sibling, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2018-11-20  8:20 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Nov 20, 2018 at 08:04:58AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Just tried to punch a 40T sparse file when enospc was triggered due
> to extent size hints consuming more space than expected. It failed
> with:
> 
> # sudo xfs_io -c "fpunch 0 40t" /mnt/fast/xfs-arekm.img
> fallocate: No space left on device
> #
> 
> because the writeback error of ENOSPC was being reported. We're
> going to free that space, so we don't care if there was a ENOSPC
> writeback error. So ignore ENOSPC errors and punch anyway.

How do even get -ENOSPC back from writeback?  It seems like we have
a much worse root cause lingering here.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 7/7] xfs: flush removing page cache in xfs_reflink_remap_prep
  2018-11-19 21:04 ` [PATCH 7/7] xfs: flush removing page cache in xfs_reflink_remap_prep Dave Chinner
@ 2018-11-20  8:32   ` Christoph Hellwig
  2018-11-20 22:56   ` Darrick J. Wong
  1 sibling, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2018-11-20  8:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

> -	truncate_inode_pages_range(&inode_out->i_data,
> -			round_down(pos_out, PAGE_SIZE),
> -			round_up(pos_out + *len, PAGE_SIZE) - 1);
> +	/*
> +	 * If pos_out > EOF, we make have dirtied blocks between EOF and

/make/may/ ?

Otherwise looks fine:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/7] xfs: don't ENOSPC on writeback when punching holes
  2018-11-20  8:20   ` Christoph Hellwig
@ 2018-11-20  9:50     ` Dave Chinner
  2018-11-20 16:28       ` Christoph Hellwig
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2018-11-20  9:50 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Tue, Nov 20, 2018 at 12:20:40AM -0800, Christoph Hellwig wrote:
> On Tue, Nov 20, 2018 at 08:04:58AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Just tried to punch a 40T sparse file when enospc was triggered due
> > to extent size hints consuming more space than expected. It failed
> > with:
> > 
> > # sudo xfs_io -c "fpunch 0 40t" /mnt/fast/xfs-arekm.img
> > fallocate: No space left on device
> > #
> > 
> > because the writeback error of ENOSPC was being reported. We're
> > going to free that space, so we don't care if there was a ENOSPC
> > writeback error. So ignore ENOSPC errors and punch anyway.
> 
> How do even get -ENOSPC back from writeback?  It seems like we have
> a much worse root cause lingering here.

It hammered ENOSPC pretty hard - I think it had consumed the entire
reserve pool and that can causes allocation transaction reservations
in xfs_iomap_write_allocate() to return ENOSPC even if we've got a
reservation for the data extents being allocated.

This doesn't happen very often in the real world, but if it does we
need operations like punch to work to be able to free space and
get us out of that hole...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/7] xfs: zero length symlinks are not valid
  2018-11-19 21:04 ` [PATCH 1/7] xfs: zero length symlinks are not valid Dave Chinner
  2018-11-20  8:12   ` Christoph Hellwig
@ 2018-11-20 13:44   ` Brian Foster
  2018-11-20 21:19     ` Dave Chinner
  1 sibling, 1 reply; 34+ messages in thread
From: Brian Foster @ 2018-11-20 13:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Nov 20, 2018 at 08:04:53AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> A log recovery failure has been reproduced where a symlink inode has
> a zero length in extent form. It was caused by a shutdown during a
> combined fstress+fsmark workload.
> 
> The underlying problem is the issue in xfs_inactive_symlink(): the
> inode is unlocked between the symlink inactivation/truncation and
> the inode being freed. This opens a window for the inode to be
> written to disk before it xfs_ifree() removes it from the unlinked
> list, marks it free in the inobt and zeros the mode.
> 
> For shortform inodes, the fix is simple. xfs_ifree() clears the data
> fork state, so there's no need to do it in xfs_inactive_symlink().
> This means the shortform fork verifier will not see a zero length
> data fork as it mirrors the inode size through to xfs_ifree()), and
> hence if the inode gets written back and the fork verifiers are run
> they will still see a fork that matches the on-disk inode size.
> 
> For extent form (remote) symlinks, it is a little more tricky. Here
> we explicitly set the inode size to zero, so the above race can lead
> to zero length symlinks on disk. Because the inode is unlinked at
> this point (i.e. on the unlinked list) and unreferenced, it can
> never be seen again by a user. Hence when we set the inode size to
> zeor, also change the type to S_IFREG. xfs_ifree() expects S_IFREG
> inodes to be of zero length, and so this avoids all the problems of
> zero length symlinks ever hitting the disk. It also avoids the
> problem of needing to handle zero length symlink inodes in log
> recovery to replay the extent free intents and the remaining
> deferops to free the extents the symlink used.
> 
> Also add a couple of asserts to warn us if zero length symlinks end
> up in either the symlink create or inactivation paths.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Hmm, I saw this and thought this was something we had already fixed.
Looking back, I see this was actually posted[1] months ago and there was
a fairly nuanced discussion. On a quick skim, that appears to have
concluded with the patch being mostly sane, but requiring a couple minor
tweaks and commit log updates. This patch looks exactly the same to me,
however. Hm?

Brian

[1] https://marc.info/?l=linux-xfs&m=152930146405930&w=2

>  fs/xfs/libxfs/xfs_symlink_remote.c | 14 +++++++++----
>  fs/xfs/xfs_symlink.c               | 33 +++++++++++++++---------------
>  2 files changed, 27 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_symlink_remote.c b/fs/xfs/libxfs/xfs_symlink_remote.c
> index 95374ab2dee7..77d80106f989 100644
> --- a/fs/xfs/libxfs/xfs_symlink_remote.c
> +++ b/fs/xfs/libxfs/xfs_symlink_remote.c
> @@ -199,7 +199,10 @@ xfs_symlink_local_to_remote(
>  					ifp->if_bytes - 1);
>  }
>  
> -/* Verify the consistency of an inline symlink. */
> +/*
> + * Verify the in-memory consistency of an inline symlink data fork. This
> + * does not do on-disk format checks.
> + */
>  xfs_failaddr_t
>  xfs_symlink_shortform_verify(
>  	struct xfs_inode	*ip)
> @@ -215,9 +218,12 @@ xfs_symlink_shortform_verify(
>  	size = ifp->if_bytes;
>  	endp = sfp + size;
>  
> -	/* Zero length symlinks can exist while we're deleting a remote one. */
> -	if (size == 0)
> -		return NULL;
> +	/*
> +	 * Zero length symlinks should never occur in memory as they are
> +	 * never alllowed to exist on disk.
> +	 */
> +	if (!size)
> +		return __this_address;
>  
>  	/* No negative sizes or overly long symlink targets. */
>  	if (size < 0 || size > XFS_SYMLINK_MAXLEN)
> diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
> index a3e98c64b6e3..b2c1177c717f 100644
> --- a/fs/xfs/xfs_symlink.c
> +++ b/fs/xfs/xfs_symlink.c
> @@ -192,6 +192,7 @@ xfs_symlink(
>  	pathlen = strlen(target_path);
>  	if (pathlen >= XFS_SYMLINK_MAXLEN)      /* total string too long */
>  		return -ENAMETOOLONG;
> +	ASSERT(pathlen > 0);
>  
>  	udqp = gdqp = NULL;
>  	prid = xfs_get_initial_prid(dp);
> @@ -378,6 +379,12 @@ xfs_symlink(
>  
>  /*
>   * Free a symlink that has blocks associated with it.
> + *
> + * Note: zero length symlinks are not allowed to exist. When we set the size to
> + * zero, also change it to a regular file so that it does not get written to
> + * disk as a zero length symlink. The inode is on the unlinked list already, so
> + * userspace cannot find this inode anymore, so this change is not user visible
> + * but allows us to catch corrupt zero-length symlinks in the verifiers.
>   */
>  STATIC int
>  xfs_inactive_symlink_rmt(
> @@ -412,13 +419,14 @@ xfs_inactive_symlink_rmt(
>  	xfs_trans_ijoin(tp, ip, 0);
>  
>  	/*
> -	 * Lock the inode, fix the size, and join it to the transaction.
> -	 * Hold it so in the normal path, we still have it locked for
> -	 * the second transaction.  In the error paths we need it
> +	 * Lock the inode, fix the size, turn it into a regular file and join it
> +	 * to the transaction.  Hold it so in the normal path, we still have it
> +	 * locked for the second transaction.  In the error paths we need it
>  	 * held so the cancel won't rele it, see below.
>  	 */
>  	size = (int)ip->i_d.di_size;
>  	ip->i_d.di_size = 0;
> +	VFS_I(ip)->i_mode = (VFS_I(ip)->i_mode & ~S_IFMT) | S_IFREG;
>  	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
>  	/*
>  	 * Find the block(s) so we can inval and unmap them.
> @@ -494,17 +502,10 @@ xfs_inactive_symlink(
>  		return -EIO;
>  
>  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> -
> -	/*
> -	 * Zero length symlinks _can_ exist.
> -	 */
>  	pathlen = (int)ip->i_d.di_size;
> -	if (!pathlen) {
> -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -		return 0;
> -	}
> +	ASSERT(pathlen);
>  
> -	if (pathlen < 0 || pathlen > XFS_SYMLINK_MAXLEN) {
> +	if (pathlen <= 0 || pathlen > XFS_SYMLINK_MAXLEN) {
>  		xfs_alert(mp, "%s: inode (0x%llx) bad symlink length (%d)",
>  			 __func__, (unsigned long long)ip->i_ino, pathlen);
>  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> @@ -512,12 +513,12 @@ xfs_inactive_symlink(
>  		return -EFSCORRUPTED;
>  	}
>  
> +	/*
> +	 * Inline fork state gets removed by xfs_difree() so we have nothing to
> +	 * do here in that case.
> +	 */
>  	if (ip->i_df.if_flags & XFS_IFINLINE) {
> -		if (ip->i_df.if_bytes > 0) 
> -			xfs_idata_realloc(ip, -(ip->i_df.if_bytes),
> -					  XFS_DATA_FORK);
>  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -		ASSERT(ip->i_df.if_bytes == 0);
>  		return 0;
>  	}
>  
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 8/7] xfs: delalloc -> unwritten COW fork allocation can go wrong
  2018-11-20  6:36 ` [PATCH 8/7] xfs: delalloc -> unwritten COW fork allocation can go wrong Dave Chinner
@ 2018-11-20 13:45   ` Brian Foster
  2018-11-20 16:33     ` Christoph Hellwig
  2018-11-20 16:32   ` Christoph Hellwig
  2018-11-20 22:58   ` Darrick J. Wong
  2 siblings, 1 reply; 34+ messages in thread
From: Brian Foster @ 2018-11-20 13:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Nov 20, 2018 at 05:36:05PM +1100, Dave Chinner wrote:
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> Long saga. There have been days spent following this through dead end
> after dead end in multi-GB event traces. This morning, after writing
> a trace-cmd wrapper that enabled me to be more selective about XFS
> trace points, I discovered that I could get just enough essential
> tracepoints enabled that there was a 50:50 chance the fsx config
> would fail at ~115k ops. If it didn't fail at op 115547, I stopped
> fsx at op 115548 anyway.
> 
> That gave me two traces - one where the problem manifested, and one
> where it didn't. After refining the traces to have the necessary
> information, I found that in the failing case there was a real
> extent in the COW fork compared to an unwritten extent in the
> working case.
> 
> Walking back through the two traces to the point where the CWO fork
> extents actually diverged, I found that the bad case had an extra
> unwritten extent in it. This is likely because the bug it led me to
> had triggered multiple times in those 115k ops, leaving stray
> COW extents around. What I saw was a COW delalloc conversion to an
> unwritten extent (as they should always be through
> xfs_iomap_write_allocate()) resulted in a /written extent/:
> 
> xfs_writepage:        dev 259:0 ino 0x83 pgoff 0x17000 size 0x79a00 offset 0 length 0
> xfs_iext_remove:      dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/2 offset 32 block 152 count 20 flag 1 caller xfs_bmap_add_extent_delay_real
> xfs_bmap_pre_update:  dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/1 offset 1 block 4503599627239429 count 31 flag 0 caller xfs_bmap_add_extent_delay_real
> xfs_bmap_post_update: dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/1 offset 1 block 121 count 51 flag 0 caller xfs_bmap_add_ex
> 
> Basically, Cow fork before:
> 
> 	0 1            32          52
> 	+H+DDDDDDDDDDDD+UUUUUUUUUUU+
> 	   PREV		RIGHT
> 
> COW delalloc conversion allocates:
> 
> 	  1	       32
> 	  +uuuuuuuuuuuu+
> 	  NEW
> 
> And the result according to the xfs_bmap_post_update trace was:
> 
> 	0 1            32          52
> 	+H+wwwwwwwwwwwwwwwwwwwwwwww+
> 	   PREV
> 
> Which is clearly wrong - it should be a merged unwritten extent,
> not an unwritten extent.
> 
> That lead me to look at the LEFT_FILLING|RIGHT_FILLING|RIGHT_CONTIG
> case in xfs_bmap_add_extent_delay_real(), and sure enough, there's
> the bug.
> 
> It takes the old delalloc extent (PREV) and adds the length of the
> RIGHT extent to it, takes the start block from NEW, removes the
> RIGHT extent and then updates PREV with the new extent.
> 
> What it fails to do is update PREV.br_state. For delalloc, this is
> always XFS_EXT_NORM, while in this case we are converting the
> delayed allocation to unwritten, so it needs to be updated to
> XFS_EXT_UNWRITTEN. This LF|RF|RC case does not do this, and so
> the resultant extent is always written.
> 
> And that's the bug I've been chasing for a week - a bmap btree bug,
> not a reflink/dedupe/copy_file_range bug, but a BMBT bug introduced
> with the recent in core extent tree scalability enhancements.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 2d78fd53e822..39eaa2b86060 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -1694,10 +1694,13 @@ xfs_bmap_add_extent_delay_real(
>  	case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING | BMAP_RIGHT_CONTIG:
>  		/*
>  		 * Filling in all of a previously delayed allocation extent.
> -		 * The right neighbor is contiguous, the left is not.
> +		 * The right neighbor is contiguous, the left is not. Take care
> +		 * with delay -> unwritten extent allocation here because the
> +		 * delalloc record we are overwriting is always written.
>  		 */
>  		PREV.br_startblock = new->br_startblock;
>  		PREV.br_blockcount += RIGHT.br_blockcount;
> +		PREV.br_state = new->br_state;
>  
>  		xfs_iext_next(ifp, &bma->icur);
>  		xfs_iext_remove(bma->ip, &bma->icur, state);

Fix looks sane to me, though I'm a little curious why this doesn't
follow the "contiguous extent extension" pattern that most of the other
contig cases seem to follow. For example, something like the following
for the right contig case:

                old = RIGHT;
                RIGHT.br_startoff = new->br_startoff;
                RIGHT.br_startblock = new->br_startblock;
                RIGHT.br_blockcount += new.br_blockcount;

                xfs_iext_remove(bma->ip, &bma->icur, state); /* PREV */
                xfs_iext_next(ifp, &bma->icur);
                xfs_iext_update_extent(bma->ip, state, &bma->icur, &RIGHT);

... and change the subsequent btree update to use old/RIGHT. Maybe
Christoph has thoughts on that.

Brian

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/7] xfs: don't ENOSPC on writeback when punching holes
  2018-11-20  9:50     ` Dave Chinner
@ 2018-11-20 16:28       ` Christoph Hellwig
  2018-11-20 21:00         ` Dave Chinner
  0 siblings, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2018-11-20 16:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Christoph Hellwig, linux-xfs

On Tue, Nov 20, 2018 at 08:50:42PM +1100, Dave Chinner wrote:
> > > because the writeback error of ENOSPC was being reported. We're
> > > going to free that space, so we don't care if there was a ENOSPC
> > > writeback error. So ignore ENOSPC errors and punch anyway.
> > 
> > How do even get -ENOSPC back from writeback?  It seems like we have
> > a much worse root cause lingering here.
> 
> It hammered ENOSPC pretty hard - I think it had consumed the entire
> reserve pool and that can causes allocation transaction reservations
> in xfs_iomap_write_allocate() to return ENOSPC even if we've got a
> reservation for the data extents being allocated.

Well, that means we do have a problem somewhere in our accounting,
as writeback should not run into ENOSPC.

> This doesn't happen very often in the real world, but if it does we
> need operations like punch to work to be able to free space and
> get us out of that hole...

I'm ok(-ish) with the patch, but I wish we could also sort out the
root cause..

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 8/7] xfs: delalloc -> unwritten COW fork allocation can go wrong
  2018-11-20  6:36 ` [PATCH 8/7] xfs: delalloc -> unwritten COW fork allocation can go wrong Dave Chinner
  2018-11-20 13:45   ` Brian Foster
@ 2018-11-20 16:32   ` Christoph Hellwig
  2018-11-20 22:58   ` Darrick J. Wong
  2 siblings, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2018-11-20 16:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Nov 20, 2018 at 05:36:05PM +1100, Dave Chinner wrote:
> @@ -1694,10 +1694,13 @@ xfs_bmap_add_extent_delay_real(
>  	case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING | BMAP_RIGHT_CONTIG:
>  		/*
>  		 * Filling in all of a previously delayed allocation extent.
> -		 * The right neighbor is contiguous, the left is not.
> +		 * The right neighbor is contiguous, the left is not. Take care
> +		 * with delay -> unwritten extent allocation here because the
> +		 * delalloc record we are overwriting is always written.
>  		 */
>  		PREV.br_startblock = new->br_startblock;
>  		PREV.br_blockcount += RIGHT.br_blockcount;
> +		PREV.br_state = new->br_state;
>  
>  		xfs_iext_next(ifp, &bma->icur);
>  		xfs_iext_remove(bma->ip, &bma->icur, state);

Looks good, this also fixes my always_cow generic/091 failures:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 8/7] xfs: delalloc -> unwritten COW fork allocation can go wrong
  2018-11-20 13:45   ` Brian Foster
@ 2018-11-20 16:33     ` Christoph Hellwig
  2018-11-20 21:08       ` Dave Chinner
  0 siblings, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2018-11-20 16:33 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs

On Tue, Nov 20, 2018 at 08:45:38AM -0500, Brian Foster wrote:
> >  		 * Filling in all of a previously delayed allocation extent.
> > -		 * The right neighbor is contiguous, the left is not.
> > +		 * The right neighbor is contiguous, the left is not. Take care
> > +		 * with delay -> unwritten extent allocation here because the
> > +		 * delalloc record we are overwriting is always written.
> >  		 */
> >  		PREV.br_startblock = new->br_startblock;
> >  		PREV.br_blockcount += RIGHT.br_blockcount;
> > +		PREV.br_state = new->br_state;
> >  
> >  		xfs_iext_next(ifp, &bma->icur);
> >  		xfs_iext_remove(bma->ip, &bma->icur, state);
> 
> Fix looks sane to me, though I'm a little curious why this doesn't
> follow the "contiguous extent extension" pattern that most of the other
> contig cases seem to follow. For example, something like the following
> for the right contig case:
> 
>                 old = RIGHT;
>                 RIGHT.br_startoff = new->br_startoff;
>                 RIGHT.br_startblock = new->br_startblock;
>                 RIGHT.br_blockcount += new.br_blockcount;
> 
>                 xfs_iext_remove(bma->ip, &bma->icur, state); /* PREV */
>                 xfs_iext_next(ifp, &bma->icur);
>                 xfs_iext_update_extent(bma->ip, state, &bma->icur, &RIGHT);
> 
> ... and change the subsequent btree update to use old/RIGHT. Maybe
> Christoph has thoughts on that.

Not sure the above looks much better.  If I had to do something
different I'd apply something like this (untested) on top of Daves patch:

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 91b94c5c0cae..efe9e554fd0d 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -1698,14 +1698,12 @@ xfs_bmap_add_extent_delay_real(
 		 * with delay -> unwritten extent allocation here because the
 		 * delalloc record we are overwriting is always written.
 		 */
-		PREV.br_startblock = new->br_startblock;
-		PREV.br_blockcount += RIGHT.br_blockcount;
-		PREV.br_state = new->br_state;
+		new->br_blockcount += RIGHT.br_blockcount;
 
 		xfs_iext_next(ifp, &bma->icur);
 		xfs_iext_remove(bma->ip, &bma->icur, state);
 		xfs_iext_prev(ifp, &bma->icur);
-		xfs_iext_update_extent(bma->ip, state, &bma->icur, &PREV);
+		xfs_iext_update_extent(bma->ip, state, &bma->icur, new);
 
 		if (bma->cur == NULL)
 			rval = XFS_ILOG_DEXT;

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/7] xfs: don't ENOSPC on writeback when punching holes
  2018-11-20 16:28       ` Christoph Hellwig
@ 2018-11-20 21:00         ` Dave Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2018-11-20 21:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs

On Tue, Nov 20, 2018 at 08:28:33AM -0800, Christoph Hellwig wrote:
> On Tue, Nov 20, 2018 at 08:50:42PM +1100, Dave Chinner wrote:
> > > > because the writeback error of ENOSPC was being reported. We're
> > > > going to free that space, so we don't care if there was a ENOSPC
> > > > writeback error. So ignore ENOSPC errors and punch anyway.
> > > 
> > > How do even get -ENOSPC back from writeback?  It seems like we have
> > > a much worse root cause lingering here.
> > 
> > It hammered ENOSPC pretty hard - I think it had consumed the entire
> > reserve pool and that can causes allocation transaction reservations
> > in xfs_iomap_write_allocate() to return ENOSPC even if we've got a
> > reservation for the data extents being allocated.
> 
> Well, that means we do have a problem somewhere in our accounting,
> as writeback should not run into ENOSPC.

Unwritten extent conversion consumes unreserved space when it splits
the BMBT and does all the other btree updates. If we do enough of
them at ENOSPC, we can consume the entire reservation pool.

In this case, I was creating a 25 million extent file using extent
size hints to try to get the number of extents down. It actually
resulted in the number of extents going up, because most of the
extents weren't fully written. IOWs, there was a /lot/ of partial
unwritten extent conversion going on in a very big extent tree.

> > This doesn't happen very often in the real world, but if it does we
> > need operations like punch to work to be able to free space and
> > get us out of that hole...
> 
> I'm ok(-ish) with the patch, but I wish we could also sort out the
> root cause..

Yeah, it's not great, but it did mean I didn't have to mkfs the
filesystem to get out of trouble. mount/unmount doesn't fix a
depleted reserve pool, only freeing space will do that, and we
should be able to punch space out of a file when at ENOSPC...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 8/7] xfs: delalloc -> unwritten COW fork allocation can go wrong
  2018-11-20 16:33     ` Christoph Hellwig
@ 2018-11-20 21:08       ` Dave Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2018-11-20 21:08 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Brian Foster, linux-xfs

On Tue, Nov 20, 2018 at 08:33:49AM -0800, Christoph Hellwig wrote:
> On Tue, Nov 20, 2018 at 08:45:38AM -0500, Brian Foster wrote:
> > >  		 * Filling in all of a previously delayed allocation extent.
> > > -		 * The right neighbor is contiguous, the left is not.
> > > +		 * The right neighbor is contiguous, the left is not. Take care
> > > +		 * with delay -> unwritten extent allocation here because the
> > > +		 * delalloc record we are overwriting is always written.
> > >  		 */
> > >  		PREV.br_startblock = new->br_startblock;
> > >  		PREV.br_blockcount += RIGHT.br_blockcount;
> > > +		PREV.br_state = new->br_state;
> > >  
> > >  		xfs_iext_next(ifp, &bma->icur);
> > >  		xfs_iext_remove(bma->ip, &bma->icur, state);
> > 
> > Fix looks sane to me, though I'm a little curious why this doesn't
> > follow the "contiguous extent extension" pattern that most of the other
> > contig cases seem to follow. For example, something like the following
> > for the right contig case:
> > 
> >                 old = RIGHT;
> >                 RIGHT.br_startoff = new->br_startoff;
> >                 RIGHT.br_startblock = new->br_startblock;
> >                 RIGHT.br_blockcount += new.br_blockcount;
> > 
> >                 xfs_iext_remove(bma->ip, &bma->icur, state); /* PREV */
> >                 xfs_iext_next(ifp, &bma->icur);
> >                 xfs_iext_update_extent(bma->ip, state, &bma->icur, &RIGHT);
> > 
> > ... and change the subsequent btree update to use old/RIGHT. Maybe
> > Christoph has thoughts on that.
> 
> Not sure the above looks much better.  If I had to do something
> different I'd apply something like this (untested) on top of Daves patch:
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 91b94c5c0cae..efe9e554fd0d 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -1698,14 +1698,12 @@ xfs_bmap_add_extent_delay_real(
>  		 * with delay -> unwritten extent allocation here because the
>  		 * delalloc record we are overwriting is always written.
>  		 */
> -		PREV.br_startblock = new->br_startblock;
> -		PREV.br_blockcount += RIGHT.br_blockcount;
> -		PREV.br_state = new->br_state;
> +		new->br_blockcount += RIGHT.br_blockcount;
>  
>  		xfs_iext_next(ifp, &bma->icur);
>  		xfs_iext_remove(bma->ip, &bma->icur, state);
>  		xfs_iext_prev(ifp, &bma->icur);
> -		xfs_iext_update_extent(bma->ip, state, &bma->icur, &PREV);
> +		xfs_iext_update_extent(bma->ip, state, &bma->icur, new);
>  
>  		if (bma->cur == NULL)
>  			rval = XFS_ILOG_DEXT;

And that's buggy because it doesn't update the BMBT record
correctly - it uses PREV to overwrite RIGHT, and so this will write
the delalloc record into the BMBT, not the new merged unwritten
extent...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/7] xfs: zero length symlinks are not valid
  2018-11-20 13:44   ` Brian Foster
@ 2018-11-20 21:19     ` Dave Chinner
  2018-11-21 12:01       ` Brian Foster
  0 siblings, 1 reply; 34+ messages in thread
From: Dave Chinner @ 2018-11-20 21:19 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Nov 20, 2018 at 08:44:39AM -0500, Brian Foster wrote:
> On Tue, Nov 20, 2018 at 08:04:53AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > A log recovery failure has been reproduced where a symlink inode has
> > a zero length in extent form. It was caused by a shutdown during a
> > combined fstress+fsmark workload.
> > 
> > The underlying problem is the issue in xfs_inactive_symlink(): the
> > inode is unlocked between the symlink inactivation/truncation and
> > the inode being freed. This opens a window for the inode to be
> > written to disk before it xfs_ifree() removes it from the unlinked
> > list, marks it free in the inobt and zeros the mode.
> > 
> > For shortform inodes, the fix is simple. xfs_ifree() clears the data
> > fork state, so there's no need to do it in xfs_inactive_symlink().
> > This means the shortform fork verifier will not see a zero length
> > data fork as it mirrors the inode size through to xfs_ifree()), and
> > hence if the inode gets written back and the fork verifiers are run
> > they will still see a fork that matches the on-disk inode size.
> > 
> > For extent form (remote) symlinks, it is a little more tricky. Here
> > we explicitly set the inode size to zero, so the above race can lead
> > to zero length symlinks on disk. Because the inode is unlinked at
> > this point (i.e. on the unlinked list) and unreferenced, it can
> > never be seen again by a user. Hence when we set the inode size to
> > zeor, also change the type to S_IFREG. xfs_ifree() expects S_IFREG
> > inodes to be of zero length, and so this avoids all the problems of
> > zero length symlinks ever hitting the disk. It also avoids the
> > problem of needing to handle zero length symlink inodes in log
> > recovery to replay the extent free intents and the remaining
> > deferops to free the extents the symlink used.
> > 
> > Also add a couple of asserts to warn us if zero length symlinks end
> > up in either the symlink create or inactivation paths.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> 
> Hmm, I saw this and thought this was something we had already fixed.
> Looking back, I see this was actually posted[1] months ago and there was
> a fairly nuanced discussion. On a quick skim, that appears to have
> concluded with the patch being mostly sane, but requiring a couple minor
> tweaks and commit log updates. This patch looks exactly the same to me,
> however. Hm?

Hmmm, I did all that, months ago. The code barely changed, it was
all just commit message updates. maybe I picked up the wrong version
of the patch - it had been sitting around for a while. I had even
gone back and checked the discussion before concluding that "it 
doesn't need code changes" before adding it to this stack...

I'll go back and see if I can find a more recent version...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/7] xfs: uncached buffer tracing needs to print bno
  2018-11-19 21:04 ` [PATCH 2/7] xfs: uncached buffer tracing needs to print bno Dave Chinner
  2018-11-20  8:12   ` Christoph Hellwig
@ 2018-11-20 22:46   ` Darrick J. Wong
  1 sibling, 0 replies; 34+ messages in thread
From: Darrick J. Wong @ 2018-11-20 22:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Nov 20, 2018 at 08:04:54AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Useless:
> 
> xfs_buf_get_uncached: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
> xfs_buf_unlock:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
> xfs_buf_submit:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
> xfs_buf_hold:         dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
> xfs_buf_iowait:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
> xfs_buf_iodone:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
> xfs_buf_iowait_done:  dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
> xfs_buf_rele:         dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
> 
> Useful:
> 
> 
> xfs_buf_get_uncached: dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
> xfs_buf_unlock:       dev 253:32 bno 0xffffffffffffffff nblks 0x1 ...
> xfs_buf_submit:       dev 253:32 bno 0x200b5 nblks 0x1 ...
> xfs_buf_hold:         dev 253:32 bno 0x200b5 nblks 0x1 ...
> xfs_buf_iowait:       dev 253:32 bno 0x200b5 nblks 0x1 ...
> xfs_buf_iodone:       dev 253:32 bno 0x200b5 nblks 0x1 ...
> xfs_buf_iowait_done:  dev 253:32 bno 0x200b5 nblks 0x1 ...
> xfs_buf_rele:         dev 253:32 bno 0x200b5 nblks 0x1 ...
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_trace.h | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 3043e5ed6495..8a6532aae779 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -280,7 +280,10 @@ DECLARE_EVENT_CLASS(xfs_buf_class,
>  	),
>  	TP_fast_assign(
>  		__entry->dev = bp->b_target->bt_dev;
> -		__entry->bno = bp->b_bn;
> +		if (bp->b_bn == XFS_BUF_DADDR_NULL)
> +			__entry->bno = bp->b_maps[0].bm_bn;
> +		else
> +			__entry->bno = bp->b_bn;
>  		__entry->nblks = bp->b_length;
>  		__entry->hold = atomic_read(&bp->b_hold);
>  		__entry->pincount = atomic_read(&bp->b_pin_count);
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 3/7] xfs: fix transient reference count error in xfs_buf_resubmit_failed_buffers
  2018-11-19 21:04 ` [PATCH 3/7] xfs: fix transient reference count error in xfs_buf_resubmit_failed_buffers Dave Chinner
  2018-11-20  8:13   ` Christoph Hellwig
@ 2018-11-20 22:48   ` Darrick J. Wong
  1 sibling, 0 replies; 34+ messages in thread
From: Darrick J. Wong @ 2018-11-20 22:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Nov 20, 2018 at 08:04:55AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When retrying a failed inode or dquot buffer,
> xfs_buf_resubmit_failed_buffers() clears all the failed flags from
> the inde/dquot log items. In doing so, it also drops all the
> reference counts on the buffer that the failed log items hold. This
> means it can drop all the active references on the buffer and hence
> free the buffer before it queues it for write again.
> 
> Putting the buffer on the delwri queue takes a reference to the
> buffer (so that it hangs around until it has been written and
> completed), but this goes bang if the buffer has already been freed.
> 
> Hence we need to add the buffer to the delwri queue before we remove
> the failed flags from the log items attached to the buffer to ensure
> it always remains referenced during the resubmit process.
> 
> Reported-by: Josef Bacik <josef@toxicpanda.com>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_buf_item.c | 28 +++++++++++++++++++++-------
>  1 file changed, 21 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_buf_item.c b/fs/xfs/xfs_buf_item.c
> index 12d8455bfbb2..010db5f8fb00 100644
> --- a/fs/xfs/xfs_buf_item.c
> +++ b/fs/xfs/xfs_buf_item.c
> @@ -1233,9 +1233,23 @@ xfs_buf_iodone(
>  }
>  
>  /*
> - * Requeue a failed buffer for writeback
> + * Requeue a failed buffer for writeback.
>   *
> - * Return true if the buffer has been re-queued properly, false otherwise
> + * We clear the log item failed state here as well, but we have to be careful
> + * about reference counts because the only active reference counts on the buffer
> + * may be the failed log items. Hence if we clear the log item failed state
> + * before queuing the buffer for IO we can release all active references to
> + * the buffer and free it, leading to use after free problems in
> + * xfs_buf_delwri_queue. It makes no difference to the buffer or log items which
> + * order we process them in - the buffer is locked, and we own the buffer list
> + * so nothing on them is going to change while we are performing this action.
> + *
> + * Hence we can safely queue the buffer for IO before we clear the failed log
> + * item state, therefore  always having an active reference to the buffer and
> + * avoiding the transient zero-reference state that leads to use-after-free.
> + *
> + * Return true if the buffer was added to the buffer list, false if it was
> + * already on the buffer list.
>   */
>  bool
>  xfs_buf_resubmit_failed_buffers(
> @@ -1243,16 +1257,16 @@ xfs_buf_resubmit_failed_buffers(
>  	struct list_head	*buffer_list)
>  {
>  	struct xfs_log_item	*lip;
> +	bool			ret;
> +
> +	ret = xfs_buf_delwri_queue(bp, buffer_list);
>  
>  	/*
> -	 * Clear XFS_LI_FAILED flag from all items before resubmit
> -	 *
> -	 * XFS_LI_FAILED set/clear is protected by ail_lock, caller  this
> +	 * XFS_LI_FAILED set/clear is protected by ail_lock, caller of this
>  	 * function already have it acquired
>  	 */
>  	list_for_each_entry(lip, &bp->b_li_list, li_bio_list)
>  		xfs_clear_li_failed(lip);
>  
> -	/* Add this buffer back to the delayed write list */
> -	return xfs_buf_delwri_queue(bp, buffer_list);
> +	return ret;
>  }
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 4/7] xfs: finobt AG reserves don't consider last AG can be a runt
  2018-11-19 21:04 ` [PATCH 4/7] xfs: finobt AG reserves don't consider last AG can be a runt Dave Chinner
  2018-11-20  8:14   ` Christoph Hellwig
@ 2018-11-20 22:49   ` Darrick J. Wong
  1 sibling, 0 replies; 34+ messages in thread
From: Darrick J. Wong @ 2018-11-20 22:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Nov 20, 2018 at 08:04:56AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The last AG may be very small comapred to all other AGs, and hence
> AG reservations based on the superblock AG size may actually consume
> more space than the AG actually has. This results on assert failures
> like:
> 
> XFS: Assertion failed: xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved + xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved <= pag->pagf_freeblks + pag->pagf_flcount, file: fs/xfs/libxfs/xfs_ag_resv.c, line: 319
> [   48.932891]  xfs_ag_resv_init+0x1bd/0x1d0
> [   48.933853]  xfs_fs_reserve_ag_blocks+0x37/0xb0
> [   48.934939]  xfs_mountfs+0x5b3/0x920
> [   48.935804]  xfs_fs_fill_super+0x462/0x640
> [   48.936784]  ? xfs_test_remount_options+0x60/0x60
> [   48.937908]  mount_bdev+0x178/0x1b0
> [   48.938751]  mount_fs+0x36/0x170
> [   48.939533]  vfs_kern_mount.part.43+0x54/0x130
> [   48.940596]  do_mount+0x20e/0xcb0
> [   48.941396]  ? memdup_user+0x3e/0x70
> [   48.942249]  ksys_mount+0xba/0xd0
> [   48.943046]  __x64_sys_mount+0x21/0x30
> [   48.943953]  do_syscall_64+0x54/0x170
> [   48.944835]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> Hence we need to ensure the finobt per-ag space reservations take
> into account the size of the last AG rather than treat it like all
> the other full size AGs.
> 
> Note that both refcountbt and rmapbt already take the size of the AG
> into account via reading the AGF length directly.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks ok now,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/libxfs/xfs_ialloc_btree.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> index 86c50208a143..7fbf8af0b159 100644
> --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> @@ -538,15 +538,18 @@ xfs_inobt_rec_check_count(
>  
>  static xfs_extlen_t
>  xfs_inobt_max_size(
> -	struct xfs_mount	*mp)
> +	struct xfs_mount	*mp,
> +	xfs_agnumber_t		agno)
>  {
> +	xfs_agblock_t		agblocks = xfs_ag_block_count(mp, agno);
> +
>  	/* Bail out if we're uninitialized, which can happen in mkfs. */
>  	if (mp->m_inobt_mxr[0] == 0)
>  		return 0;
>  
>  	return xfs_btree_calc_size(mp->m_inobt_mnr,
> -		(uint64_t)mp->m_sb.sb_agblocks * mp->m_sb.sb_inopblock /
> -				XFS_INODES_PER_CHUNK);
> +				(uint64_t)agblocks * mp->m_sb.sb_inopblock /
> +					XFS_INODES_PER_CHUNK);
>  }
>  
>  static int
> @@ -594,7 +597,7 @@ xfs_finobt_calc_reserves(
>  	if (error)
>  		return error;
>  
> -	*ask += xfs_inobt_max_size(mp);
> +	*ask += xfs_inobt_max_size(mp, agno);
>  	*used += tree_len;
>  	return 0;
>  }
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 5/7] xfs: extent shifting doesn't fully invalidate page cache
  2018-11-19 21:04 ` [PATCH 5/7] xfs: extent shifting doesn't fully invalidate page cache Dave Chinner
  2018-11-20  8:18   ` Christoph Hellwig
@ 2018-11-20 22:53   ` Darrick J. Wong
  1 sibling, 0 replies; 34+ messages in thread
From: Darrick J. Wong @ 2018-11-20 22:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Nov 20, 2018 at 08:04:57AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The extent shifting code uses a flush and invalidate mechainsm prior
> to shifting extents around. This is similar to what
> xfs_free_file_space() does, but it doesn't take into account things
> like page cache vs block size differences, and it will fail if there
> is a page that it currently busy.
> 
> xfs_flush_unmap_range() handles all of these cases, so just convert
> xfs_prepare_shift() to us that mechanism rather than having it's own
> special sauce.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_bmap_util.c | 8 +-------
>  1 file changed, 1 insertion(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 5d263dfdb3bc..167ff4297e5c 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1195,13 +1195,7 @@ xfs_prepare_shift(
>  	 * Writeback and invalidate cache for the remainder of the file as we're
>  	 * about to shift down every extent from offset to EOF.
>  	 */
> -	error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping, offset, -1);
> -	if (error)
> -		return error;
> -	error = invalidate_inode_pages2_range(VFS_I(ip)->i_mapping,
> -					offset >> PAGE_SHIFT, -1);
> -	if (error)
> -		return error;
> +	error = xfs_flush_unmap_range(ip, offset, XFS_ISIZE(ip));
>  
>  	/*
>  	 * Clean out anything hanging around in the cow fork now that
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 7/7] xfs: flush removing page cache in xfs_reflink_remap_prep
  2018-11-19 21:04 ` [PATCH 7/7] xfs: flush removing page cache in xfs_reflink_remap_prep Dave Chinner
  2018-11-20  8:32   ` Christoph Hellwig
@ 2018-11-20 22:56   ` Darrick J. Wong
  1 sibling, 0 replies; 34+ messages in thread
From: Darrick J. Wong @ 2018-11-20 22:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Nov 20, 2018 at 08:04:59AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> On a sub-page block size filesystem, fsx is failing with a data
> corruption after a series of operations involving copying a file
> with the destination offset beyond EOF of the destination of the file:
> 
> 8093(157 mod 256): TRUNCATE DOWN        from 0x7a120 to 0x50000 ******WWWW
> 8094(158 mod 256): INSERT 0x25000 thru 0x25fff  (0x1000 bytes)
> 8095(159 mod 256): COPY 0x18000 thru 0x1afff    (0x3000 bytes) to 0x2f400
> 8096(160 mod 256): WRITE    0x5da00 thru 0x651ff        (0x7800 bytes) HOLE
> 8097(161 mod 256): COPY 0x2000 thru 0x5fff      (0x4000 bytes) to 0x6fc00
> 
> The second copy here is beyond EOF, and it is to sub-page (4k) but
> block aligned (1k) offset. The clone runs the EOF zeroing, landing
> in a pre-existing post-eof delalloc extent. This zeroes the post-eof
> extents in the page cache just fine, dirtying the pages correctly.
> 
> The problem is that xfs_reflink_remap_prep() now truncates the page
> cache over the range that it is copying it to, and rounds that down
> to cover the entire start page. This removes the dirty page over the
> delalloc extent from the page cache without having written it back.
> Hence later, when the page cache is flushed, the page at offset
> 0x6f000 has not been written back and hence exposes stale data,
> which fsx trips over less than 10 operations later.
> 
> Fix this by changing xfs_reflink_remap_prep() to use
> xfs_flush_unmap_range().
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_bmap_util.c |  2 +-
>  fs/xfs/xfs_bmap_util.h |  3 +++
>  fs/xfs/xfs_reflink.c   | 17 +++++++++++++----
>  3 files changed, 17 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index cc7a0d47c529..3e66cf0520a9 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1042,7 +1042,7 @@ xfs_unmap_extent(
>  	goto out_unlock;
>  }
>  
> -static int
> +int
>  xfs_flush_unmap_range(
>  	struct xfs_inode	*ip,
>  	xfs_off_t		offset,
> diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
> index 87363d136bb6..7a78229cf1a7 100644
> --- a/fs/xfs/xfs_bmap_util.h
> +++ b/fs/xfs/xfs_bmap_util.h
> @@ -80,4 +80,7 @@ int xfs_bmap_count_blocks(struct xfs_trans *tp, struct xfs_inode *ip,
>  			  int whichfork, xfs_extnum_t *nextents,
>  			  xfs_filblks_t *count);
>  
> +int	xfs_flush_unmap_range(struct xfs_inode *ip, xfs_off_t offset,
> +			      xfs_off_t len);
> +
>  #endif	/* __XFS_BMAP_UTIL_H__ */
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index ecdb086bc23e..a41590b7c229 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1351,10 +1351,19 @@ xfs_reflink_remap_prep(
>  	if (ret)
>  		goto out_unlock;
>  
> -	/* Zap any page cache for the destination file's range. */
> -	truncate_inode_pages_range(&inode_out->i_data,
> -			round_down(pos_out, PAGE_SIZE),
> -			round_up(pos_out + *len, PAGE_SIZE) - 1);
> +	/*
> +	 * If pos_out > EOF, we make have dirtied blocks between EOF and

Looks ok, will s/make/may/ on the way in (unless you send a new
version);

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> +	 * pos_out. In that case, we need to extend the flush and unmap to cover
> +	 * from EOF to the end of the copy length.
> +	 */
> +	if (pos_out > XFS_ISIZE(dest)) {
> +		loff_t	flen = *len + (pos_out - XFS_ISIZE(dest));
> +		ret = xfs_flush_unmap_range(dest, XFS_ISIZE(dest), flen);
> +	} else {
> +		ret = xfs_flush_unmap_range(dest, pos_out, *len);
> +	}
> +	if (ret)
> +		goto out_unlock;
>  
>  	return 1;
>  out_unlock:
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 8/7] xfs: delalloc -> unwritten COW fork allocation can go wrong
  2018-11-20  6:36 ` [PATCH 8/7] xfs: delalloc -> unwritten COW fork allocation can go wrong Dave Chinner
  2018-11-20 13:45   ` Brian Foster
  2018-11-20 16:32   ` Christoph Hellwig
@ 2018-11-20 22:58   ` Darrick J. Wong
  2 siblings, 0 replies; 34+ messages in thread
From: Darrick J. Wong @ 2018-11-20 22:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Nov 20, 2018 at 05:36:05PM +1100, Dave Chinner wrote:
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> Long saga. There have been days spent following this through dead end
> after dead end in multi-GB event traces. This morning, after writing
> a trace-cmd wrapper that enabled me to be more selective about XFS
> trace points, I discovered that I could get just enough essential
> tracepoints enabled that there was a 50:50 chance the fsx config
> would fail at ~115k ops. If it didn't fail at op 115547, I stopped
> fsx at op 115548 anyway.
> 
> That gave me two traces - one where the problem manifested, and one
> where it didn't. After refining the traces to have the necessary
> information, I found that in the failing case there was a real
> extent in the COW fork compared to an unwritten extent in the
> working case.
> 
> Walking back through the two traces to the point where the CWO fork
> extents actually diverged, I found that the bad case had an extra
> unwritten extent in it. This is likely because the bug it led me to
> had triggered multiple times in those 115k ops, leaving stray
> COW extents around. What I saw was a COW delalloc conversion to an
> unwritten extent (as they should always be through
> xfs_iomap_write_allocate()) resulted in a /written extent/:
> 
> xfs_writepage:        dev 259:0 ino 0x83 pgoff 0x17000 size 0x79a00 offset 0 length 0
> xfs_iext_remove:      dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/2 offset 32 block 152 count 20 flag 1 caller xfs_bmap_add_extent_delay_real
> xfs_bmap_pre_update:  dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/1 offset 1 block 4503599627239429 count 31 flag 0 caller xfs_bmap_add_extent_delay_real
> xfs_bmap_post_update: dev 259:0 ino 0x83 state RC|LF|RF|COW cur 0xffff888247b899c0/1 offset 1 block 121 count 51 flag 0 caller xfs_bmap_add_ex
> 
> Basically, Cow fork before:
> 
> 	0 1            32          52
> 	+H+DDDDDDDDDDDD+UUUUUUUUUUU+
> 	   PREV		RIGHT
> 
> COW delalloc conversion allocates:
> 
> 	  1	       32
> 	  +uuuuuuuuuuuu+
> 	  NEW
> 
> And the result according to the xfs_bmap_post_update trace was:
> 
> 	0 1            32          52
> 	+H+wwwwwwwwwwwwwwwwwwwwwwww+
> 	   PREV
> 
> Which is clearly wrong - it should be a merged unwritten extent,
> not an unwritten extent.
> 
> That lead me to look at the LEFT_FILLING|RIGHT_FILLING|RIGHT_CONTIG
> case in xfs_bmap_add_extent_delay_real(), and sure enough, there's
> the bug.
> 
> It takes the old delalloc extent (PREV) and adds the length of the
> RIGHT extent to it, takes the start block from NEW, removes the
> RIGHT extent and then updates PREV with the new extent.
> 
> What it fails to do is update PREV.br_state. For delalloc, this is
> always XFS_EXT_NORM, while in this case we are converting the
> delayed allocation to unwritten, so it needs to be updated to
> XFS_EXT_UNWRITTEN. This LF|RF|RC case does not do this, and so
> the resultant extent is always written.
> 
> And that's the bug I've been chasing for a week - a bmap btree bug,
> not a reflink/dedupe/copy_file_range bug, but a BMBT bug introduced
> with the recent in core extent tree scalability enhancements.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks ok, seems to fix the fsx failures I know how to reproduce... :P

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/libxfs/xfs_bmap.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 2d78fd53e822..39eaa2b86060 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -1694,10 +1694,13 @@ xfs_bmap_add_extent_delay_real(
>  	case BMAP_LEFT_FILLING | BMAP_RIGHT_FILLING | BMAP_RIGHT_CONTIG:
>  		/*
>  		 * Filling in all of a previously delayed allocation extent.
> -		 * The right neighbor is contiguous, the left is not.
> +		 * The right neighbor is contiguous, the left is not. Take care
> +		 * with delay -> unwritten extent allocation here because the
> +		 * delalloc record we are overwriting is always written.
>  		 */
>  		PREV.br_startblock = new->br_startblock;
>  		PREV.br_blockcount += RIGHT.br_blockcount;
> +		PREV.br_state = new->br_state;
>  
>  		xfs_iext_next(ifp, &bma->icur);
>  		xfs_iext_remove(bma->ip, &bma->icur, state);

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/7] xfs: zero length symlinks are not valid
  2018-11-20 21:19     ` Dave Chinner
@ 2018-11-21 12:01       ` Brian Foster
  0 siblings, 0 replies; 34+ messages in thread
From: Brian Foster @ 2018-11-21 12:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Nov 21, 2018 at 08:19:18AM +1100, Dave Chinner wrote:
> On Tue, Nov 20, 2018 at 08:44:39AM -0500, Brian Foster wrote:
> > On Tue, Nov 20, 2018 at 08:04:53AM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > A log recovery failure has been reproduced where a symlink inode has
> > > a zero length in extent form. It was caused by a shutdown during a
> > > combined fstress+fsmark workload.
> > > 
> > > The underlying problem is the issue in xfs_inactive_symlink(): the
> > > inode is unlocked between the symlink inactivation/truncation and
> > > the inode being freed. This opens a window for the inode to be
> > > written to disk before it xfs_ifree() removes it from the unlinked
> > > list, marks it free in the inobt and zeros the mode.
> > > 
> > > For shortform inodes, the fix is simple. xfs_ifree() clears the data
> > > fork state, so there's no need to do it in xfs_inactive_symlink().
> > > This means the shortform fork verifier will not see a zero length
> > > data fork as it mirrors the inode size through to xfs_ifree()), and
> > > hence if the inode gets written back and the fork verifiers are run
> > > they will still see a fork that matches the on-disk inode size.
> > > 
> > > For extent form (remote) symlinks, it is a little more tricky. Here
> > > we explicitly set the inode size to zero, so the above race can lead
> > > to zero length symlinks on disk. Because the inode is unlinked at
> > > this point (i.e. on the unlinked list) and unreferenced, it can
> > > never be seen again by a user. Hence when we set the inode size to
> > > zeor, also change the type to S_IFREG. xfs_ifree() expects S_IFREG
> > > inodes to be of zero length, and so this avoids all the problems of
> > > zero length symlinks ever hitting the disk. It also avoids the
> > > problem of needing to handle zero length symlink inodes in log
> > > recovery to replay the extent free intents and the remaining
> > > deferops to free the extents the symlink used.
> > > 
> > > Also add a couple of asserts to warn us if zero length symlinks end
> > > up in either the symlink create or inactivation paths.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > 
> > Hmm, I saw this and thought this was something we had already fixed.
> > Looking back, I see this was actually posted[1] months ago and there was
> > a fairly nuanced discussion. On a quick skim, that appears to have
> > concluded with the patch being mostly sane, but requiring a couple minor
> > tweaks and commit log updates. This patch looks exactly the same to me,
> > however. Hm?
> 
> Hmmm, I did all that, months ago. The code barely changed, it was
> all just commit message updates. maybe I picked up the wrong version
> of the patch - it had been sitting around for a while. I had even
> gone back and checked the discussion before concluding that "it 
> doesn't need code changes" before adding it to this stack...
> 
> I'll go back and see if I can find a more recent version...
> 

It was mostly commit log changes except for an
s/xfs_difree()/xfs_ifree()/ needed in the code comment. IIRC, there was
additional context worth describing for log recovery in the commit log.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/7] xfs: don't ENOSPC on writeback when punching holes
  2018-11-19 21:04 ` [PATCH 6/7] xfs: don't ENOSPC on writeback when punching holes Dave Chinner
  2018-11-20  8:20   ` Christoph Hellwig
@ 2018-11-21 18:09   ` Darrick J. Wong
  2018-11-22  2:31     ` Dave Chinner
  1 sibling, 1 reply; 34+ messages in thread
From: Darrick J. Wong @ 2018-11-21 18:09 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Nov 20, 2018 at 08:04:58AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Just tried to punch a 40T sparse file when enospc was triggered due
> to extent size hints consuming more space than expected. It failed
> with:
> 
> # sudo xfs_io -c "fpunch 0 40t" /mnt/fast/xfs-arekm.img
> fallocate: No space left on device
> #
> 
> because the writeback error of ENOSPC was being reported. We're
> going to free that space, so we don't care if there was a ENOSPC
> writeback error. So ignore ENOSPC errors and punch anyway.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_bmap_util.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index 167ff4297e5c..cc7a0d47c529 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1060,8 +1060,13 @@ xfs_flush_unmap_range(
>  	start = round_down(offset, rounding);
>  	end = round_up(offset + len, rounding) - 1;
>  
> +	/*
> +	 * We're going to trash the data in this range, so if writeback reports
> +	 * an enospc error, don't let it stop us from /freeing the space/ in the
> +	 * range to alleviate the ENOSPC condition.
> +	 */
>  	error = filemap_write_and_wait_range(inode->i_mapping, start, end);
> -	if (error)
> +	if (error && error != -ENOSPC)

I think there's a bug here -- COLLAPSE_RANGE and INSERT_RANGE call
xfs_prepare_shift to writeback and invalidate the pagecache from a
selected offset to EOF, and _prepare_shift calls _flush_unmap_range.

So if we have a dirty 100 block file and we ask xfs to collapse 3 blocks
at block 17, it'll _prepare_shift blocks 20-99 before shifting them down
by 3 blocks.  If we instead asked to insert 3 blocks at block 17, it'll
_prepare_shift blocks 17-99 before shifting them up by 3 blocks.

Unlike the punch/reflink cases, the _prepare_shift ranges aren't doomed,
so the user might want to know if the write fails.

--D

>  		return error;
>  	truncate_pagecache_range(inode, start, end);
>  	return 0;
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 6/7] xfs: don't ENOSPC on writeback when punching holes
  2018-11-21 18:09   ` Darrick J. Wong
@ 2018-11-22  2:31     ` Dave Chinner
  0 siblings, 0 replies; 34+ messages in thread
From: Dave Chinner @ 2018-11-22  2:31 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Nov 21, 2018 at 10:09:36AM -0800, Darrick J. Wong wrote:
> On Tue, Nov 20, 2018 at 08:04:58AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Just tried to punch a 40T sparse file when enospc was triggered due
> > to extent size hints consuming more space than expected. It failed
> > with:
> > 
> > # sudo xfs_io -c "fpunch 0 40t" /mnt/fast/xfs-arekm.img
> > fallocate: No space left on device
> > #
> > 
> > because the writeback error of ENOSPC was being reported. We're
> > going to free that space, so we don't care if there was a ENOSPC
> > writeback error. So ignore ENOSPC errors and punch anyway.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_bmap_util.c | 7 ++++++-
> >  1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > index 167ff4297e5c..cc7a0d47c529 100644
> > --- a/fs/xfs/xfs_bmap_util.c
> > +++ b/fs/xfs/xfs_bmap_util.c
> > @@ -1060,8 +1060,13 @@ xfs_flush_unmap_range(
> >  	start = round_down(offset, rounding);
> >  	end = round_up(offset + len, rounding) - 1;
> >  
> > +	/*
> > +	 * We're going to trash the data in this range, so if writeback reports
> > +	 * an enospc error, don't let it stop us from /freeing the space/ in the
> > +	 * range to alleviate the ENOSPC condition.
> > +	 */
> >  	error = filemap_write_and_wait_range(inode->i_mapping, start, end);
> > -	if (error)
> > +	if (error && error != -ENOSPC)
> 
> I think there's a bug here -- COLLAPSE_RANGE and INSERT_RANGE call
> xfs_prepare_shift to writeback and invalidate the pagecache from a
> selected offset to EOF, and _prepare_shift calls _flush_unmap_range.
> 
> So if we have a dirty 100 block file and we ask xfs to collapse 3 blocks
> at block 17, it'll _prepare_shift blocks 20-99 before shifting them down
> by 3 blocks.  If we instead asked to insert 3 blocks at block 17, it'll
> _prepare_shift blocks 17-99 before shifting them up by 3 blocks.
> 
> Unlike the punch/reflink cases, the _prepare_shift ranges aren't doomed,
> so the user might want to know if the write fails.

Hmmm, yeah, the _prepare_shift changes came after this patch, so I
didn't consider that when originally writing this patch. Drop it for
now, will have to rethink what to do here....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2018-11-22 13:29 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-19 21:04 [PATCH 0/7] xfs: various fixes for 4.20 Dave Chinner
2018-11-19 21:04 ` [PATCH 1/7] xfs: zero length symlinks are not valid Dave Chinner
2018-11-20  8:12   ` Christoph Hellwig
2018-11-20 13:44   ` Brian Foster
2018-11-20 21:19     ` Dave Chinner
2018-11-21 12:01       ` Brian Foster
2018-11-19 21:04 ` [PATCH 2/7] xfs: uncached buffer tracing needs to print bno Dave Chinner
2018-11-20  8:12   ` Christoph Hellwig
2018-11-20 22:46   ` Darrick J. Wong
2018-11-19 21:04 ` [PATCH 3/7] xfs: fix transient reference count error in xfs_buf_resubmit_failed_buffers Dave Chinner
2018-11-20  8:13   ` Christoph Hellwig
2018-11-20 22:48   ` Darrick J. Wong
2018-11-19 21:04 ` [PATCH 4/7] xfs: finobt AG reserves don't consider last AG can be a runt Dave Chinner
2018-11-20  8:14   ` Christoph Hellwig
2018-11-20 22:49   ` Darrick J. Wong
2018-11-19 21:04 ` [PATCH 5/7] xfs: extent shifting doesn't fully invalidate page cache Dave Chinner
2018-11-20  8:18   ` Christoph Hellwig
2018-11-20 22:53   ` Darrick J. Wong
2018-11-19 21:04 ` [PATCH 6/7] xfs: don't ENOSPC on writeback when punching holes Dave Chinner
2018-11-20  8:20   ` Christoph Hellwig
2018-11-20  9:50     ` Dave Chinner
2018-11-20 16:28       ` Christoph Hellwig
2018-11-20 21:00         ` Dave Chinner
2018-11-21 18:09   ` Darrick J. Wong
2018-11-22  2:31     ` Dave Chinner
2018-11-19 21:04 ` [PATCH 7/7] xfs: flush removing page cache in xfs_reflink_remap_prep Dave Chinner
2018-11-20  8:32   ` Christoph Hellwig
2018-11-20 22:56   ` Darrick J. Wong
2018-11-20  6:36 ` [PATCH 8/7] xfs: delalloc -> unwritten COW fork allocation can go wrong Dave Chinner
2018-11-20 13:45   ` Brian Foster
2018-11-20 16:33     ` Christoph Hellwig
2018-11-20 21:08       ` Dave Chinner
2018-11-20 16:32   ` Christoph Hellwig
2018-11-20 22:58   ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).