All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] xfs: symlink and inode writeback issues
@ 2018-06-18  5:57 Dave Chinner
  2018-06-18  5:57 ` [PATCH 1/2] xfs: zero length symlinks are not valid Dave Chinner
  2018-06-18  5:57 ` [PATCH 2/2] xfs: xfs_iflush_abort() can be called twice on cluster writeback failure Dave Chinner
  0 siblings, 2 replies; 18+ messages in thread
From: Dave Chinner @ 2018-06-18  5:57 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

The first patch is the replacement fix for the symlink verification
problem I originally posted as "xfs: symlinks can be zero length
during log recovery" a few days ago. This caused the symlink data
fork verifier to fire during inode writeback, and in tracking that
down it became clear that zero length symlinks are not valid,
despite repeated comments in the code that "they can happen". This
new patch prevents zero length symlink inodes from ever getting to
disk, and prevents transient states in memory where the symlink data
fork is zero length.

The second patch is a fix for a problem discovered by the original
symlink verifier patch - when it fired in inode writeback it would
do a double flush lock release on the inode, triggering an assert.
nothing good can come from that, so I tracked down the dark, dusty
origins of the crappy error handling in the inode writeback code and
fixed it, too.

Comments, thoughts?

Cheers,

Dave.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/2] xfs: zero length symlinks are not valid
  2018-06-18  5:57 [PATCH 0/2] xfs: symlink and inode writeback issues Dave Chinner
@ 2018-06-18  5:57 ` Dave Chinner
  2018-06-18 13:24   ` Brian Foster
  2018-06-18  5:57 ` [PATCH 2/2] xfs: xfs_iflush_abort() can be called twice on cluster writeback failure Dave Chinner
  1 sibling, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2018-06-18  5:57 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

A log recovery failure has been reproduced where a symlink inode has
a zero length in extent form. It was caused by a shutdown during a
combined fstress+fsmark workload.

The underlying problem is the issue in xfs_inactive_symlink(): the
inode is unlocked between the symlink inactivation/truncation and
the inode being freed. This opens a window for the inode to be
written to disk before it xfs_ifree() removes it from the unlinked
list, marks it free in the inobt and zeros the mode.

For shortform inodes, the fix is simple. xfs_ifree() clears the data
fork state, so there's no need to do it in xfs_inactive_symlink().
This means the shortform fork verifier will not see a zero length
data fork as it mirrors the inode size through to xfs_ifree()), and
hence if the inode gets written back and the fork verifiers are run
they will still see a fork that matches the on-disk inode size.

For extent form (remote) symlinks, it is a little more tricky. Here
we explicitly set the inode size to zero, so the above race can lead
to zero length symlinks on disk. Because the inode is unlinked at
this point (i.e. on the unlinked list) and unreferenced, it can
never be seen again by a user. Hence when we set the inode size to
zeor, also change the type to S_IFREG. xfs_ifree() expects S_IFREG
inodes to be of zero length, and so this avoids all the problems of
zero length symlinks ever hitting the disk. It also avoids the
problem of needing to handle zero length symlink inodes in log
recovery to replay the extent free intents and the remaining
deferops to free the extents the symlink used.

Also add a couple of asserts to warn us if zero length symlinks end
up in either the symlink create or inactivation paths.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_symlink_remote.c | 14 +++++++++----
 fs/xfs/xfs_symlink.c               | 33 +++++++++++++++---------------
 2 files changed, 27 insertions(+), 20 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_symlink_remote.c b/fs/xfs/libxfs/xfs_symlink_remote.c
index 95374ab2dee7..77d80106f989 100644
--- a/fs/xfs/libxfs/xfs_symlink_remote.c
+++ b/fs/xfs/libxfs/xfs_symlink_remote.c
@@ -199,7 +199,10 @@ xfs_symlink_local_to_remote(
 					ifp->if_bytes - 1);
 }
 
-/* Verify the consistency of an inline symlink. */
+/*
+ * Verify the in-memory consistency of an inline symlink data fork. This
+ * does not do on-disk format checks.
+ */
 xfs_failaddr_t
 xfs_symlink_shortform_verify(
 	struct xfs_inode	*ip)
@@ -215,9 +218,12 @@ xfs_symlink_shortform_verify(
 	size = ifp->if_bytes;
 	endp = sfp + size;
 
-	/* Zero length symlinks can exist while we're deleting a remote one. */
-	if (size == 0)
-		return NULL;
+	/*
+	 * Zero length symlinks should never occur in memory as they are
+	 * never alllowed to exist on disk.
+	 */
+	if (!size)
+		return __this_address;
 
 	/* No negative sizes or overly long symlink targets. */
 	if (size < 0 || size > XFS_SYMLINK_MAXLEN)
diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
index 3783afcb68d2..e0c2d460ba67 100644
--- a/fs/xfs/xfs_symlink.c
+++ b/fs/xfs/xfs_symlink.c
@@ -194,6 +194,7 @@ xfs_symlink(
 	pathlen = strlen(target_path);
 	if (pathlen >= XFS_SYMLINK_MAXLEN)      /* total string too long */
 		return -ENAMETOOLONG;
+	ASSERT(pathlen > 0);
 
 	udqp = gdqp = NULL;
 	prid = xfs_get_initial_prid(dp);
@@ -395,6 +396,12 @@ xfs_symlink(
 
 /*
  * Free a symlink that has blocks associated with it.
+ *
+ * Note: zero length symlinks are not allowed to exist. When we set the size to
+ * zero, also change it to a regular file so that it does not get written to
+ * disk as a zero length symlink. The inode is on the unlinked list already, so
+ * userspace cannot find this inode anymore, so this change is not user visible
+ * but allows us to catch corrupt zero-length symlinks in the verifiers.
  */
 STATIC int
 xfs_inactive_symlink_rmt(
@@ -431,13 +438,14 @@ xfs_inactive_symlink_rmt(
 	xfs_trans_ijoin(tp, ip, 0);
 
 	/*
-	 * Lock the inode, fix the size, and join it to the transaction.
-	 * Hold it so in the normal path, we still have it locked for
-	 * the second transaction.  In the error paths we need it
+	 * Lock the inode, fix the size, turn it into a regular file and join it
+	 * to the transaction.  Hold it so in the normal path, we still have it
+	 * locked for the second transaction.  In the error paths we need it
 	 * held so the cancel won't rele it, see below.
 	 */
 	size = (int)ip->i_d.di_size;
 	ip->i_d.di_size = 0;
+	VFS_I(ip)->i_mode = (VFS_I(ip)->i_mode & ~S_IFMT) | S_IFREG;
 	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
 	/*
 	 * Find the block(s) so we can inval and unmap them.
@@ -523,17 +531,10 @@ xfs_inactive_symlink(
 		return -EIO;
 
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
-
-	/*
-	 * Zero length symlinks _can_ exist.
-	 */
 	pathlen = (int)ip->i_d.di_size;
-	if (!pathlen) {
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		return 0;
-	}
+	ASSERT(pathlen);
 
-	if (pathlen < 0 || pathlen > XFS_SYMLINK_MAXLEN) {
+	if (pathlen <= 0 || pathlen > XFS_SYMLINK_MAXLEN) {
 		xfs_alert(mp, "%s: inode (0x%llx) bad symlink length (%d)",
 			 __func__, (unsigned long long)ip->i_ino, pathlen);
 		xfs_iunlock(ip, XFS_ILOCK_EXCL);
@@ -541,12 +542,12 @@ xfs_inactive_symlink(
 		return -EFSCORRUPTED;
 	}
 
+	/*
+	 * Inline fork state gets removed by xfs_difree() so we have nothing to
+	 * do here in that case.
+	 */
 	if (ip->i_df.if_flags & XFS_IFINLINE) {
-		if (ip->i_df.if_bytes > 0) 
-			xfs_idata_realloc(ip, -(ip->i_df.if_bytes),
-					  XFS_DATA_FORK);
 		xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		ASSERT(ip->i_df.if_bytes == 0);
 		return 0;
 	}
 
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/2] xfs: xfs_iflush_abort() can be called twice on cluster writeback failure
  2018-06-18  5:57 [PATCH 0/2] xfs: symlink and inode writeback issues Dave Chinner
  2018-06-18  5:57 ` [PATCH 1/2] xfs: zero length symlinks are not valid Dave Chinner
@ 2018-06-18  5:57 ` Dave Chinner
  2018-06-18 13:24   ` Brian Foster
  2018-06-19  5:05   ` Darrick J. Wong
  1 sibling, 2 replies; 18+ messages in thread
From: Dave Chinner @ 2018-06-18  5:57 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

When a corrupt inode is detected during xfs_iflush_cluster, we can
get a shutdown ASSERT failure like this:

XFS (pmem1): Metadata corruption detected at xfs_symlink_shortform_verify+0x5c/0xa0, inode 0x86627 data fork
XFS (pmem1): Unmount and run xfs_repair
XFS (pmem1): xfs_do_force_shutdown(0x8) called from line 3372 of file fs/xfs/xfs_inode.c.  Return address = ffffffff814f4116
XFS (pmem1): Corruption of in-memory data detected.  Shutting down filesystem
XFS (pmem1): xfs_do_force_shutdown(0x1) called from line 222 of file fs/xfs/libxfs/xfs_defer.c.  Return address = ffffffff814a8a88
XFS (pmem1): xfs_do_force_shutdown(0x1) called from line 222 of file fs/xfs/libxfs/xfs_defer.c.  Return address = ffffffff814a8ef9
XFS (pmem1): Please umount the filesystem and rectify the problem(s)
XFS: Assertion failed: xfs_isiflocked(ip), file: fs/xfs/xfs_inode.h, line: 258
.....
Call Trace:
 xfs_iflush_abort+0x10a/0x110
 xfs_iflush+0xf3/0x390
 xfs_inode_item_push+0x126/0x1e0
 xfsaild+0x2c5/0x890
 kthread+0x11c/0x140
 ret_from_fork+0x24/0x30

Essentially, xfs_iflush_abort() has been called twice on the
original inode that that was flushed. This happens because the
inode has been flushed to teh buffer successfully via
xfs_iflush_int(), and so when another inode is detected as corrupt
in xfs_iflush_cluster, the buffer is marked stale and EIO, and
iodone callbacks are run on it.

Running the iodone callbacks walks across the original inode and
calls xfs_iflush_abort() on it. When xfs_iflush_cluster() returns
to xfs_iflush(), it runs the error path for that function, and that
calls xfs_iflush_abort() on the inode a second time, leading to the
above assert failure as the inode is not flush locked anymore.

This bug has been there a long time.

The simple fix would be to just avoid calling xfs_iflush_abort() in
xfs_iflush() if we've got a failure from xfs_iflush_cluster().
However, xfs_iflush_cluster() has magic delwri buffer handling that
means it may or may not have run IO completion on the buffer, and
hence sometimes we have to call xfs_iflush_abort() from
xfs_iflush(), and sometimes we shouldn't.

After reading through all the error paths and the delwri buffer
code, it's clear that the error handling in xfs_iflush_cluster() is
unnecessary. If the buffer is delwri, it leaves it on the delwri
list so that when the delwri list is submitted it sees a shutdown
fliesystem in xfs_buf_submit() and that marks the buffer stale, EIO
and runs IO completion. i.e. exactly what xfs+iflush_cluster() does
when it's not a delwri buffer. Further, marking a buffer stale
clears the _XBF_DELWRI_Q flag on the buffer, which means when
submission of the buffer occurs, it just skips over it and releases
it.

IOWs, the error handling in xfs_iflush_cluster doesn't need to care
if the buffer is already on a the delwri queue or not - it just
needs to mark the buffer stale, EIO and run completions. That means
we can just use the easy fix for xfs_iflush() to avoid the double
abort.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_inode.c | 57 +++++++++++++++++-----------------------------
 1 file changed, 21 insertions(+), 36 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 4a2e5e13c569..4a9b90cfb8c3 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3236,7 +3236,6 @@ xfs_iflush_cluster(
 	struct xfs_inode	*cip;
 	int			nr_found;
 	int			clcount = 0;
-	int			bufwasdelwri;
 	int			i;
 
 	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
@@ -3360,37 +3359,22 @@ xfs_iflush_cluster(
 	 * inode buffer and shut down the filesystem.
 	 */
 	rcu_read_unlock();
-	/*
-	 * Clean up the buffer.  If it was delwri, just release it --
-	 * brelse can handle it with no problems.  If not, shut down the
-	 * filesystem before releasing the buffer.
-	 */
-	bufwasdelwri = (bp->b_flags & _XBF_DELWRI_Q);
-	if (bufwasdelwri)
-		xfs_buf_relse(bp);
-
 	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
 
-	if (!bufwasdelwri) {
-		/*
-		 * Just like incore_relse: if we have b_iodone functions,
-		 * mark the buffer as an error and call them.  Otherwise
-		 * mark it as stale and brelse.
-		 */
-		if (bp->b_iodone) {
-			bp->b_flags &= ~XBF_DONE;
-			xfs_buf_stale(bp);
-			xfs_buf_ioerror(bp, -EIO);
-			xfs_buf_ioend(bp);
-		} else {
-			xfs_buf_stale(bp);
-			xfs_buf_relse(bp);
-		}
-	}
-
 	/*
-	 * Unlocks the flush lock
+	 * We'll always have an inode attached to the buffer for completion
+	 * process by the time we are called from xfs_iflush(). Hence we have
+	 * always need to do IO completion processing to abort the inodes
+	 * attached to the buffer.  handle them just like the shutdown case in
+	 * xfs_buf_submit().
 	 */
+	ASSERT(bp->b_iodone);
+	bp->b_flags &= ~XBF_DONE;
+	xfs_buf_stale(bp);
+	xfs_buf_ioerror(bp, -EIO);
+	xfs_buf_ioend(bp);
+
+	/* abort the corrupt inode, as it was not attached to the buffer */
 	xfs_iflush_abort(cip, false);
 	kmem_free(cilist);
 	xfs_perag_put(pag);
@@ -3486,12 +3470,17 @@ xfs_iflush(
 		xfs_log_force(mp, 0);
 
 	/*
-	 * inode clustering:
-	 * see if other inodes can be gathered into this write
+	 * inode clustering: try to gather other inodes into this write
+	 *
+	 * Note: Any error during clustering will result in the filesystem
+	 * being shut down and completion callbacks run on the cluster buffer.
+	 * As we have already flushed and attached this inode to the buffer,
+	 * it has already been aborted and released by xfs_iflush_cluster() and
+	 * so we have no further error handling to do here.
 	 */
 	error = xfs_iflush_cluster(ip, bp);
 	if (error)
-		goto cluster_corrupt_out;
+		return error;
 
 	*bpp = bp;
 	return 0;
@@ -3500,12 +3489,8 @@ xfs_iflush(
 	if (bp)
 		xfs_buf_relse(bp);
 	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
-cluster_corrupt_out:
-	error = -EFSCORRUPTED;
 abort_out:
-	/*
-	 * Unlocks the flush lock
-	 */
+	/* abort the corrupt inode, as it was not attached to the buffer */
 	xfs_iflush_abort(ip, false);
 	return error;
 }
-- 
2.17.0


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] xfs: zero length symlinks are not valid
  2018-06-18  5:57 ` [PATCH 1/2] xfs: zero length symlinks are not valid Dave Chinner
@ 2018-06-18 13:24   ` Brian Foster
  2018-06-18 22:42     ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2018-06-18 13:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Jun 18, 2018 at 03:57:10PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> A log recovery failure has been reproduced where a symlink inode has
> a zero length in extent form. It was caused by a shutdown during a
> combined fstress+fsmark workload.
> 
> The underlying problem is the issue in xfs_inactive_symlink(): the
> inode is unlocked between the symlink inactivation/truncation and
> the inode being freed. This opens a window for the inode to be
> written to disk before it xfs_ifree() removes it from the unlinked
> list, marks it free in the inobt and zeros the mode.
> 
> For shortform inodes, the fix is simple. xfs_ifree() clears the data
> fork state, so there's no need to do it in xfs_inactive_symlink().
> This means the shortform fork verifier will not see a zero length
> data fork as it mirrors the inode size through to xfs_ifree()), and
> hence if the inode gets written back and the fork verifiers are run
> they will still see a fork that matches the on-disk inode size.
> 
> For extent form (remote) symlinks, it is a little more tricky. Here
> we explicitly set the inode size to zero, so the above race can lead
> to zero length symlinks on disk. Because the inode is unlinked at
> this point (i.e. on the unlinked list) and unreferenced, it can
> never be seen again by a user. Hence when we set the inode size to

Perhaps we don't care, but what about bulkstat? The inode free is
imminent, so it probably doesn't matter that much.

> zeor, also change the type to S_IFREG. xfs_ifree() expects S_IFREG
> inodes to be of zero length, and so this avoids all the problems of
> zero length symlinks ever hitting the disk. It also avoids the
> problem of needing to handle zero length symlink inodes in log
> recovery to replay the extent free intents and the remaining
> deferops to free the extents the symlink used.
> 
> Also add a couple of asserts to warn us if zero length symlinks end
> up in either the symlink create or inactivation paths.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_symlink_remote.c | 14 +++++++++----
>  fs/xfs/xfs_symlink.c               | 33 +++++++++++++++---------------
>  2 files changed, 27 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_symlink_remote.c b/fs/xfs/libxfs/xfs_symlink_remote.c
> index 95374ab2dee7..77d80106f989 100644
> --- a/fs/xfs/libxfs/xfs_symlink_remote.c
> +++ b/fs/xfs/libxfs/xfs_symlink_remote.c
> @@ -199,7 +199,10 @@ xfs_symlink_local_to_remote(
>  					ifp->if_bytes - 1);
>  }
>  
> -/* Verify the consistency of an inline symlink. */
> +/*
> + * Verify the in-memory consistency of an inline symlink data fork. This
> + * does not do on-disk format checks.
> + */
>  xfs_failaddr_t
>  xfs_symlink_shortform_verify(
>  	struct xfs_inode	*ip)
> @@ -215,9 +218,12 @@ xfs_symlink_shortform_verify(
>  	size = ifp->if_bytes;
>  	endp = sfp + size;
>  
> -	/* Zero length symlinks can exist while we're deleting a remote one. */
> -	if (size == 0)
> -		return NULL;
> +	/*
> +	 * Zero length symlinks should never occur in memory as they are
> +	 * never alllowed to exist on disk.
> +	 */
> +	if (!size)
> +		return __this_address;
>  
>  	/* No negative sizes or overly long symlink targets. */
>  	if (size < 0 || size > XFS_SYMLINK_MAXLEN)
> diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
> index 3783afcb68d2..e0c2d460ba67 100644
> --- a/fs/xfs/xfs_symlink.c
> +++ b/fs/xfs/xfs_symlink.c
> @@ -194,6 +194,7 @@ xfs_symlink(
>  	pathlen = strlen(target_path);
>  	if (pathlen >= XFS_SYMLINK_MAXLEN)      /* total string too long */
>  		return -ENAMETOOLONG;
> +	ASSERT(pathlen > 0);
>  
>  	udqp = gdqp = NULL;
>  	prid = xfs_get_initial_prid(dp);
> @@ -395,6 +396,12 @@ xfs_symlink(
>  
>  /*
>   * Free a symlink that has blocks associated with it.
> + *
> + * Note: zero length symlinks are not allowed to exist. When we set the size to
> + * zero, also change it to a regular file so that it does not get written to
> + * disk as a zero length symlink. The inode is on the unlinked list already, so
> + * userspace cannot find this inode anymore, so this change is not user visible
> + * but allows us to catch corrupt zero-length symlinks in the verifiers.
>   */
>  STATIC int
>  xfs_inactive_symlink_rmt(
> @@ -431,13 +438,14 @@ xfs_inactive_symlink_rmt(
>  	xfs_trans_ijoin(tp, ip, 0);
>  
>  	/*
> -	 * Lock the inode, fix the size, and join it to the transaction.
> -	 * Hold it so in the normal path, we still have it locked for
> -	 * the second transaction.  In the error paths we need it
> +	 * Lock the inode, fix the size, turn it into a regular file and join it
> +	 * to the transaction.  Hold it so in the normal path, we still have it
> +	 * locked for the second transaction.  In the error paths we need it
>  	 * held so the cancel won't rele it, see below.
>  	 */
>  	size = (int)ip->i_d.di_size;
>  	ip->i_d.di_size = 0;
> +	VFS_I(ip)->i_mode = (VFS_I(ip)->i_mode & ~S_IFMT) | S_IFREG;
>  	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
>  	/*
>  	 * Find the block(s) so we can inval and unmap them.
> @@ -523,17 +531,10 @@ xfs_inactive_symlink(
>  		return -EIO;
>  
>  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> -
> -	/*
> -	 * Zero length symlinks _can_ exist.
> -	 */
>  	pathlen = (int)ip->i_d.di_size;
> -	if (!pathlen) {
> -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -		return 0;
> -	}
> +	ASSERT(pathlen);

Ok, but I still feel like I'm missing something here. The commit log
explains how symlink inode inactivation creates this transient zero-size
symlink state. That's fine, but why do we have code to check for such
symlinks _in_ the symlink activation path?

Is the assumption that this is just an overzealous/unnecessary check, or
am I missing something? If the former, an extra sentence in the last
paragraph or so of the commit log to explain why we remove this would be
nice.

>  
> -	if (pathlen < 0 || pathlen > XFS_SYMLINK_MAXLEN) {
> +	if (pathlen <= 0 || pathlen > XFS_SYMLINK_MAXLEN) {
>  		xfs_alert(mp, "%s: inode (0x%llx) bad symlink length (%d)",
>  			 __func__, (unsigned long long)ip->i_ino, pathlen);
>  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> @@ -541,12 +542,12 @@ xfs_inactive_symlink(
>  		return -EFSCORRUPTED;
>  	}
>  
> +	/*
> +	 * Inline fork state gets removed by xfs_difree() so we have nothing to
> +	 * do here in that case.
> +	 */

It looks like that actually happens in xfs_ifree() (not xfs_difree()).

Otherwise this seems a reasonable approach to me.

Brian

>  	if (ip->i_df.if_flags & XFS_IFINLINE) {
> -		if (ip->i_df.if_bytes > 0) 
> -			xfs_idata_realloc(ip, -(ip->i_df.if_bytes),
> -					  XFS_DATA_FORK);
>  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -		ASSERT(ip->i_df.if_bytes == 0);
>  		return 0;
>  	}
>  
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] xfs: xfs_iflush_abort() can be called twice on cluster writeback failure
  2018-06-18  5:57 ` [PATCH 2/2] xfs: xfs_iflush_abort() can be called twice on cluster writeback failure Dave Chinner
@ 2018-06-18 13:24   ` Brian Foster
  2018-06-19  5:05   ` Darrick J. Wong
  1 sibling, 0 replies; 18+ messages in thread
From: Brian Foster @ 2018-06-18 13:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Jun 18, 2018 at 03:57:11PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When a corrupt inode is detected during xfs_iflush_cluster, we can
> get a shutdown ASSERT failure like this:
> 
> XFS (pmem1): Metadata corruption detected at xfs_symlink_shortform_verify+0x5c/0xa0, inode 0x86627 data fork
> XFS (pmem1): Unmount and run xfs_repair
> XFS (pmem1): xfs_do_force_shutdown(0x8) called from line 3372 of file fs/xfs/xfs_inode.c.  Return address = ffffffff814f4116
> XFS (pmem1): Corruption of in-memory data detected.  Shutting down filesystem
> XFS (pmem1): xfs_do_force_shutdown(0x1) called from line 222 of file fs/xfs/libxfs/xfs_defer.c.  Return address = ffffffff814a8a88
> XFS (pmem1): xfs_do_force_shutdown(0x1) called from line 222 of file fs/xfs/libxfs/xfs_defer.c.  Return address = ffffffff814a8ef9
> XFS (pmem1): Please umount the filesystem and rectify the problem(s)
> XFS: Assertion failed: xfs_isiflocked(ip), file: fs/xfs/xfs_inode.h, line: 258
> .....
> Call Trace:
>  xfs_iflush_abort+0x10a/0x110
>  xfs_iflush+0xf3/0x390
>  xfs_inode_item_push+0x126/0x1e0
>  xfsaild+0x2c5/0x890
>  kthread+0x11c/0x140
>  ret_from_fork+0x24/0x30
> 
> Essentially, xfs_iflush_abort() has been called twice on the
> original inode that that was flushed. This happens because the
> inode has been flushed to teh buffer successfully via
> xfs_iflush_int(), and so when another inode is detected as corrupt
> in xfs_iflush_cluster, the buffer is marked stale and EIO, and
> iodone callbacks are run on it.
> 
> Running the iodone callbacks walks across the original inode and
> calls xfs_iflush_abort() on it. When xfs_iflush_cluster() returns
> to xfs_iflush(), it runs the error path for that function, and that
> calls xfs_iflush_abort() on the inode a second time, leading to the
> above assert failure as the inode is not flush locked anymore.
> 
> This bug has been there a long time.
> 
> The simple fix would be to just avoid calling xfs_iflush_abort() in
> xfs_iflush() if we've got a failure from xfs_iflush_cluster().
> However, xfs_iflush_cluster() has magic delwri buffer handling that
> means it may or may not have run IO completion on the buffer, and
> hence sometimes we have to call xfs_iflush_abort() from
> xfs_iflush(), and sometimes we shouldn't.
> 
> After reading through all the error paths and the delwri buffer
> code, it's clear that the error handling in xfs_iflush_cluster() is
> unnecessary. If the buffer is delwri, it leaves it on the delwri
> list so that when the delwri list is submitted it sees a shutdown
> fliesystem in xfs_buf_submit() and that marks the buffer stale, EIO
> and runs IO completion. i.e. exactly what xfs+iflush_cluster() does
> when it's not a delwri buffer. Further, marking a buffer stale
> clears the _XBF_DELWRI_Q flag on the buffer, which means when
> submission of the buffer occurs, it just skips over it and releases
> it.
> 
> IOWs, the error handling in xfs_iflush_cluster doesn't need to care
> if the buffer is already on a the delwri queue or not - it just
> needs to mark the buffer stale, EIO and run completions. That means
> we can just use the easy fix for xfs_iflush() to avoid the double
> abort.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---

Looks good to me:

Reviewed-by: Brian Foster <bfoster@redhat.com>

>  fs/xfs/xfs_inode.c | 57 +++++++++++++++++-----------------------------
>  1 file changed, 21 insertions(+), 36 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 4a2e5e13c569..4a9b90cfb8c3 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3236,7 +3236,6 @@ xfs_iflush_cluster(
>  	struct xfs_inode	*cip;
>  	int			nr_found;
>  	int			clcount = 0;
> -	int			bufwasdelwri;
>  	int			i;
>  
>  	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
> @@ -3360,37 +3359,22 @@ xfs_iflush_cluster(
>  	 * inode buffer and shut down the filesystem.
>  	 */
>  	rcu_read_unlock();
> -	/*
> -	 * Clean up the buffer.  If it was delwri, just release it --
> -	 * brelse can handle it with no problems.  If not, shut down the
> -	 * filesystem before releasing the buffer.
> -	 */
> -	bufwasdelwri = (bp->b_flags & _XBF_DELWRI_Q);
> -	if (bufwasdelwri)
> -		xfs_buf_relse(bp);
> -
>  	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
>  
> -	if (!bufwasdelwri) {
> -		/*
> -		 * Just like incore_relse: if we have b_iodone functions,
> -		 * mark the buffer as an error and call them.  Otherwise
> -		 * mark it as stale and brelse.
> -		 */
> -		if (bp->b_iodone) {
> -			bp->b_flags &= ~XBF_DONE;
> -			xfs_buf_stale(bp);
> -			xfs_buf_ioerror(bp, -EIO);
> -			xfs_buf_ioend(bp);
> -		} else {
> -			xfs_buf_stale(bp);
> -			xfs_buf_relse(bp);
> -		}
> -	}
> -
>  	/*
> -	 * Unlocks the flush lock
> +	 * We'll always have an inode attached to the buffer for completion
> +	 * process by the time we are called from xfs_iflush(). Hence we have
> +	 * always need to do IO completion processing to abort the inodes
> +	 * attached to the buffer.  handle them just like the shutdown case in
> +	 * xfs_buf_submit().
>  	 */
> +	ASSERT(bp->b_iodone);
> +	bp->b_flags &= ~XBF_DONE;
> +	xfs_buf_stale(bp);
> +	xfs_buf_ioerror(bp, -EIO);
> +	xfs_buf_ioend(bp);
> +
> +	/* abort the corrupt inode, as it was not attached to the buffer */
>  	xfs_iflush_abort(cip, false);
>  	kmem_free(cilist);
>  	xfs_perag_put(pag);
> @@ -3486,12 +3470,17 @@ xfs_iflush(
>  		xfs_log_force(mp, 0);
>  
>  	/*
> -	 * inode clustering:
> -	 * see if other inodes can be gathered into this write
> +	 * inode clustering: try to gather other inodes into this write
> +	 *
> +	 * Note: Any error during clustering will result in the filesystem
> +	 * being shut down and completion callbacks run on the cluster buffer.
> +	 * As we have already flushed and attached this inode to the buffer,
> +	 * it has already been aborted and released by xfs_iflush_cluster() and
> +	 * so we have no further error handling to do here.
>  	 */
>  	error = xfs_iflush_cluster(ip, bp);
>  	if (error)
> -		goto cluster_corrupt_out;
> +		return error;
>  
>  	*bpp = bp;
>  	return 0;
> @@ -3500,12 +3489,8 @@ xfs_iflush(
>  	if (bp)
>  		xfs_buf_relse(bp);
>  	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> -cluster_corrupt_out:
> -	error = -EFSCORRUPTED;
>  abort_out:
> -	/*
> -	 * Unlocks the flush lock
> -	 */
> +	/* abort the corrupt inode, as it was not attached to the buffer */
>  	xfs_iflush_abort(ip, false);
>  	return error;
>  }
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] xfs: zero length symlinks are not valid
  2018-06-18 13:24   ` Brian Foster
@ 2018-06-18 22:42     ` Dave Chinner
  2018-06-19 11:54       ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2018-06-18 22:42 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Jun 18, 2018 at 09:24:11AM -0400, Brian Foster wrote:
> On Mon, Jun 18, 2018 at 03:57:10PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > A log recovery failure has been reproduced where a symlink inode has
> > a zero length in extent form. It was caused by a shutdown during a
> > combined fstress+fsmark workload.
> > 
> > The underlying problem is the issue in xfs_inactive_symlink(): the
> > inode is unlocked between the symlink inactivation/truncation and
> > the inode being freed. This opens a window for the inode to be
> > written to disk before it xfs_ifree() removes it from the unlinked
> > list, marks it free in the inobt and zeros the mode.
> > 
> > For shortform inodes, the fix is simple. xfs_ifree() clears the data
> > fork state, so there's no need to do it in xfs_inactive_symlink().
> > This means the shortform fork verifier will not see a zero length
> > data fork as it mirrors the inode size through to xfs_ifree()), and
> > hence if the inode gets written back and the fork verifiers are run
> > they will still see a fork that matches the on-disk inode size.
> > 
> > For extent form (remote) symlinks, it is a little more tricky. Here
> > we explicitly set the inode size to zero, so the above race can lead
> > to zero length symlinks on disk. Because the inode is unlinked at
> > this point (i.e. on the unlinked list) and unreferenced, it can
> > never be seen again by a user. Hence when we set the inode size to
> 
> Perhaps we don't care, but what about bulkstat? The inode free is
> imminent, so it probably doesn't matter that much.

Bulkstat users have to deal with all sorts of TOCTOU races because
we do the inode lookup after unlocking the AGI. Hence the inodes
returned are not guaranteed to be allocated or even accessible,
especially if they return nlink == 0...

So, yeah, I don't think it matters for bulkstat.

> > @@ -523,17 +531,10 @@ xfs_inactive_symlink(
> >  		return -EIO;
> >  
> >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > -
> > -	/*
> > -	 * Zero length symlinks _can_ exist.
> > -	 */
> >  	pathlen = (int)ip->i_d.di_size;
> > -	if (!pathlen) {
> > -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > -		return 0;
> > -	}
> > +	ASSERT(pathlen);
> 
> Ok, but I still feel like I'm missing something here. The commit log
> explains how symlink inode inactivation creates this transient zero-size
> symlink state. That's fine, but why do we have code to check for such
> symlinks _in_ the symlink activation path?

Ah, I put that in the cover letter - the VFS is supposed to ensure
we don't get zero length symlinks passed to us, and the verifiers
are supposed to catch it coming off disk. This is mostly just a
bounds check to document that we don't expect zero length symlinks
at any time, especially in the unlink path. I put the same assert
check in the xfs_symlink() create path, too, to ensure we don't get
passed a zero length from the VFS...

> Is the assumption that this is just an overzealous/unnecessary check, or
> am I missing something? If the former, an extra sentence in the last
> paragraph or so of the commit log to explain why we remove this would be
> nice.

It's a "should never occur" situation but, well, paranoia seeing as
we've got this wrong for so long...

> > -	if (pathlen < 0 || pathlen > XFS_SYMLINK_MAXLEN) {
> > +	if (pathlen <= 0 || pathlen > XFS_SYMLINK_MAXLEN) {
> >  		xfs_alert(mp, "%s: inode (0x%llx) bad symlink length (%d)",
> >  			 __func__, (unsigned long long)ip->i_ino, pathlen);
> >  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > @@ -541,12 +542,12 @@ xfs_inactive_symlink(
> >  		return -EFSCORRUPTED;
> >  	}
> >  
> > +	/*
> > +	 * Inline fork state gets removed by xfs_difree() so we have nothing to
> > +	 * do here in that case.
> > +	 */
> 
> It looks like that actually happens in xfs_ifree() (not xfs_difree()).

Yup, will fix.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/2] xfs: xfs_iflush_abort() can be called twice on cluster writeback failure
  2018-06-18  5:57 ` [PATCH 2/2] xfs: xfs_iflush_abort() can be called twice on cluster writeback failure Dave Chinner
  2018-06-18 13:24   ` Brian Foster
@ 2018-06-19  5:05   ` Darrick J. Wong
  1 sibling, 0 replies; 18+ messages in thread
From: Darrick J. Wong @ 2018-06-19  5:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Mon, Jun 18, 2018 at 03:57:11PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> When a corrupt inode is detected during xfs_iflush_cluster, we can
> get a shutdown ASSERT failure like this:
> 
> XFS (pmem1): Metadata corruption detected at xfs_symlink_shortform_verify+0x5c/0xa0, inode 0x86627 data fork
> XFS (pmem1): Unmount and run xfs_repair
> XFS (pmem1): xfs_do_force_shutdown(0x8) called from line 3372 of file fs/xfs/xfs_inode.c.  Return address = ffffffff814f4116
> XFS (pmem1): Corruption of in-memory data detected.  Shutting down filesystem
> XFS (pmem1): xfs_do_force_shutdown(0x1) called from line 222 of file fs/xfs/libxfs/xfs_defer.c.  Return address = ffffffff814a8a88
> XFS (pmem1): xfs_do_force_shutdown(0x1) called from line 222 of file fs/xfs/libxfs/xfs_defer.c.  Return address = ffffffff814a8ef9
> XFS (pmem1): Please umount the filesystem and rectify the problem(s)
> XFS: Assertion failed: xfs_isiflocked(ip), file: fs/xfs/xfs_inode.h, line: 258
> .....
> Call Trace:
>  xfs_iflush_abort+0x10a/0x110
>  xfs_iflush+0xf3/0x390
>  xfs_inode_item_push+0x126/0x1e0
>  xfsaild+0x2c5/0x890
>  kthread+0x11c/0x140
>  ret_from_fork+0x24/0x30
> 
> Essentially, xfs_iflush_abort() has been called twice on the
> original inode that that was flushed. This happens because the
> inode has been flushed to teh buffer successfully via
> xfs_iflush_int(), and so when another inode is detected as corrupt
> in xfs_iflush_cluster, the buffer is marked stale and EIO, and
> iodone callbacks are run on it.
> 
> Running the iodone callbacks walks across the original inode and
> calls xfs_iflush_abort() on it. When xfs_iflush_cluster() returns
> to xfs_iflush(), it runs the error path for that function, and that
> calls xfs_iflush_abort() on the inode a second time, leading to the
> above assert failure as the inode is not flush locked anymore.
> 
> This bug has been there a long time.
> 
> The simple fix would be to just avoid calling xfs_iflush_abort() in
> xfs_iflush() if we've got a failure from xfs_iflush_cluster().
> However, xfs_iflush_cluster() has magic delwri buffer handling that
> means it may or may not have run IO completion on the buffer, and
> hence sometimes we have to call xfs_iflush_abort() from
> xfs_iflush(), and sometimes we shouldn't.
> 
> After reading through all the error paths and the delwri buffer
> code, it's clear that the error handling in xfs_iflush_cluster() is
> unnecessary. If the buffer is delwri, it leaves it on the delwri
> list so that when the delwri list is submitted it sees a shutdown
> fliesystem in xfs_buf_submit() and that marks the buffer stale, EIO
> and runs IO completion. i.e. exactly what xfs+iflush_cluster() does
> when it's not a delwri buffer. Further, marking a buffer stale
> clears the _XBF_DELWRI_Q flag on the buffer, which means when
> submission of the buffer occurs, it just skips over it and releases
> it.
> 
> IOWs, the error handling in xfs_iflush_cluster doesn't need to care
> if the buffer is already on a the delwri queue or not - it just
> needs to mark the buffer stale, EIO and run completions. That means
> we can just use the easy fix for xfs_iflush() to avoid the double
> abort.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Looks ok, will start working on a -fixes branch...
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/xfs_inode.c | 57 +++++++++++++++++-----------------------------
>  1 file changed, 21 insertions(+), 36 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 4a2e5e13c569..4a9b90cfb8c3 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3236,7 +3236,6 @@ xfs_iflush_cluster(
>  	struct xfs_inode	*cip;
>  	int			nr_found;
>  	int			clcount = 0;
> -	int			bufwasdelwri;
>  	int			i;
>  
>  	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
> @@ -3360,37 +3359,22 @@ xfs_iflush_cluster(
>  	 * inode buffer and shut down the filesystem.
>  	 */
>  	rcu_read_unlock();
> -	/*
> -	 * Clean up the buffer.  If it was delwri, just release it --
> -	 * brelse can handle it with no problems.  If not, shut down the
> -	 * filesystem before releasing the buffer.
> -	 */
> -	bufwasdelwri = (bp->b_flags & _XBF_DELWRI_Q);
> -	if (bufwasdelwri)
> -		xfs_buf_relse(bp);
> -
>  	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
>  
> -	if (!bufwasdelwri) {
> -		/*
> -		 * Just like incore_relse: if we have b_iodone functions,
> -		 * mark the buffer as an error and call them.  Otherwise
> -		 * mark it as stale and brelse.
> -		 */
> -		if (bp->b_iodone) {
> -			bp->b_flags &= ~XBF_DONE;
> -			xfs_buf_stale(bp);
> -			xfs_buf_ioerror(bp, -EIO);
> -			xfs_buf_ioend(bp);
> -		} else {
> -			xfs_buf_stale(bp);
> -			xfs_buf_relse(bp);
> -		}
> -	}
> -
>  	/*
> -	 * Unlocks the flush lock
> +	 * We'll always have an inode attached to the buffer for completion
> +	 * process by the time we are called from xfs_iflush(). Hence we have
> +	 * always need to do IO completion processing to abort the inodes
> +	 * attached to the buffer.  handle them just like the shutdown case in
> +	 * xfs_buf_submit().
>  	 */
> +	ASSERT(bp->b_iodone);
> +	bp->b_flags &= ~XBF_DONE;
> +	xfs_buf_stale(bp);
> +	xfs_buf_ioerror(bp, -EIO);
> +	xfs_buf_ioend(bp);
> +
> +	/* abort the corrupt inode, as it was not attached to the buffer */
>  	xfs_iflush_abort(cip, false);
>  	kmem_free(cilist);
>  	xfs_perag_put(pag);
> @@ -3486,12 +3470,17 @@ xfs_iflush(
>  		xfs_log_force(mp, 0);
>  
>  	/*
> -	 * inode clustering:
> -	 * see if other inodes can be gathered into this write
> +	 * inode clustering: try to gather other inodes into this write
> +	 *
> +	 * Note: Any error during clustering will result in the filesystem
> +	 * being shut down and completion callbacks run on the cluster buffer.
> +	 * As we have already flushed and attached this inode to the buffer,
> +	 * it has already been aborted and released by xfs_iflush_cluster() and
> +	 * so we have no further error handling to do here.
>  	 */
>  	error = xfs_iflush_cluster(ip, bp);
>  	if (error)
> -		goto cluster_corrupt_out;
> +		return error;
>  
>  	*bpp = bp;
>  	return 0;
> @@ -3500,12 +3489,8 @@ xfs_iflush(
>  	if (bp)
>  		xfs_buf_relse(bp);
>  	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> -cluster_corrupt_out:
> -	error = -EFSCORRUPTED;
>  abort_out:
> -	/*
> -	 * Unlocks the flush lock
> -	 */
> +	/* abort the corrupt inode, as it was not attached to the buffer */
>  	xfs_iflush_abort(ip, false);
>  	return error;
>  }
> -- 
> 2.17.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] xfs: zero length symlinks are not valid
  2018-06-18 22:42     ` Dave Chinner
@ 2018-06-19 11:54       ` Brian Foster
  2018-06-19 15:48         ` Darrick J. Wong
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2018-06-19 11:54 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 19, 2018 at 08:42:59AM +1000, Dave Chinner wrote:
> On Mon, Jun 18, 2018 at 09:24:11AM -0400, Brian Foster wrote:
> > On Mon, Jun 18, 2018 at 03:57:10PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > A log recovery failure has been reproduced where a symlink inode has
> > > a zero length in extent form. It was caused by a shutdown during a
> > > combined fstress+fsmark workload.
> > > 
> > > The underlying problem is the issue in xfs_inactive_symlink(): the
> > > inode is unlocked between the symlink inactivation/truncation and
> > > the inode being freed. This opens a window for the inode to be
> > > written to disk before it xfs_ifree() removes it from the unlinked
> > > list, marks it free in the inobt and zeros the mode.
> > > 
> > > For shortform inodes, the fix is simple. xfs_ifree() clears the data
> > > fork state, so there's no need to do it in xfs_inactive_symlink().
> > > This means the shortform fork verifier will not see a zero length
> > > data fork as it mirrors the inode size through to xfs_ifree()), and
> > > hence if the inode gets written back and the fork verifiers are run
> > > they will still see a fork that matches the on-disk inode size.
> > > 
> > > For extent form (remote) symlinks, it is a little more tricky. Here
> > > we explicitly set the inode size to zero, so the above race can lead
> > > to zero length symlinks on disk. Because the inode is unlinked at
> > > this point (i.e. on the unlinked list) and unreferenced, it can
> > > never be seen again by a user. Hence when we set the inode size to
> > 
> > Perhaps we don't care, but what about bulkstat? The inode free is
> > imminent, so it probably doesn't matter that much.
> 
> Bulkstat users have to deal with all sorts of TOCTOU races because
> we do the inode lookup after unlocking the AGI. Hence the inodes
> returned are not guaranteed to be allocated or even accessible,
> especially if they return nlink == 0...
> 
> So, yeah, I don't think it matters for bulkstat.
> 

Ack.

> > > @@ -523,17 +531,10 @@ xfs_inactive_symlink(
> > >  		return -EIO;
> > >  
> > >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > -
> > > -	/*
> > > -	 * Zero length symlinks _can_ exist.
> > > -	 */
> > >  	pathlen = (int)ip->i_d.di_size;
> > > -	if (!pathlen) {
> > > -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > -		return 0;
> > > -	}
> > > +	ASSERT(pathlen);
> > 
> > Ok, but I still feel like I'm missing something here. The commit log
> > explains how symlink inode inactivation creates this transient zero-size
> > symlink state. That's fine, but why do we have code to check for such
> > symlinks _in_ the symlink activation path?
> 
> Ah, I put that in the cover letter - the VFS is supposed to ensure
> we don't get zero length symlinks passed to us, and the verifiers
> are supposed to catch it coming off disk. This is mostly just a
> bounds check to document that we don't expect zero length symlinks
> at any time, especially in the unlink path. I put the same assert
> check in the xfs_symlink() create path, too, to ensure we don't get
> passed a zero length from the VFS...
> 

Oh, I'm not referring to the ASSERT() here. That all seems fine...

> > Is the assumption that this is just an overzealous/unnecessary check, or
> > am I missing something? If the former, an extra sentence in the last
> > paragraph or so of the commit log to explain why we remove this would be
> > nice.
> 
> It's a "should never occur" situation but, well, paranoia seeing as
> we've got this wrong for so long...
> 

Rather, I'm asking about the pre-existing code that we remove. The hunk
just above from xfs_inactive_symlink():

-	/*
-	 * Zero length symlinks _can_ exist.
-	 */
-	if (!pathlen) {
-		xfs_iunlock(ip, XFS_ILOCK_EXCL);
-		return 0;
-	}

... obviously suggests that there could have been a situation where we'd
see a zero-length symlink on entry to xfs_inactive_symlink(). This
patch, however, focuses on fixing the transient zero-length symlink
caused by xfs_inactive_symlink_rmt() (which comes after the above
check).

The latter implies we'd never enter the symlink inactivate path with a
zero-length symlink, since this is the path that causes it (until the
inode is ultimately freed). So my question essentially is do we have any
idea why this code exists in the first place? Is it just bogus given
there's no other clear vector to a zero-length symlink before the inode
is inactivated?

Brian

> > > -	if (pathlen < 0 || pathlen > XFS_SYMLINK_MAXLEN) {
> > > +	if (pathlen <= 0 || pathlen > XFS_SYMLINK_MAXLEN) {
> > >  		xfs_alert(mp, "%s: inode (0x%llx) bad symlink length (%d)",
> > >  			 __func__, (unsigned long long)ip->i_ino, pathlen);
> > >  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > @@ -541,12 +542,12 @@ xfs_inactive_symlink(
> > >  		return -EFSCORRUPTED;
> > >  	}
> > >  
> > > +	/*
> > > +	 * Inline fork state gets removed by xfs_difree() so we have nothing to
> > > +	 * do here in that case.
> > > +	 */
> > 
> > It looks like that actually happens in xfs_ifree() (not xfs_difree()).
> 
> Yup, will fix.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] xfs: zero length symlinks are not valid
  2018-06-19 11:54       ` Brian Foster
@ 2018-06-19 15:48         ` Darrick J. Wong
  2018-06-19 16:28           ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Darrick J. Wong @ 2018-06-19 15:48 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs

On Tue, Jun 19, 2018 at 07:54:34AM -0400, Brian Foster wrote:
> On Tue, Jun 19, 2018 at 08:42:59AM +1000, Dave Chinner wrote:
> > On Mon, Jun 18, 2018 at 09:24:11AM -0400, Brian Foster wrote:
> > > On Mon, Jun 18, 2018 at 03:57:10PM +1000, Dave Chinner wrote:
> > > > From: Dave Chinner <dchinner@redhat.com>
> > > > 
> > > > A log recovery failure has been reproduced where a symlink inode has
> > > > a zero length in extent form. It was caused by a shutdown during a
> > > > combined fstress+fsmark workload.
> > > > 
> > > > The underlying problem is the issue in xfs_inactive_symlink(): the
> > > > inode is unlocked between the symlink inactivation/truncation and
> > > > the inode being freed. This opens a window for the inode to be
> > > > written to disk before it xfs_ifree() removes it from the unlinked
> > > > list, marks it free in the inobt and zeros the mode.
> > > > 
> > > > For shortform inodes, the fix is simple. xfs_ifree() clears the data
> > > > fork state, so there's no need to do it in xfs_inactive_symlink().
> > > > This means the shortform fork verifier will not see a zero length
> > > > data fork as it mirrors the inode size through to xfs_ifree()), and
> > > > hence if the inode gets written back and the fork verifiers are run
> > > > they will still see a fork that matches the on-disk inode size.
> > > > 
> > > > For extent form (remote) symlinks, it is a little more tricky. Here
> > > > we explicitly set the inode size to zero, so the above race can lead
> > > > to zero length symlinks on disk. Because the inode is unlinked at
> > > > this point (i.e. on the unlinked list) and unreferenced, it can
> > > > never be seen again by a user. Hence when we set the inode size to
> > > 
> > > Perhaps we don't care, but what about bulkstat? The inode free is
> > > imminent, so it probably doesn't matter that much.
> > 
> > Bulkstat users have to deal with all sorts of TOCTOU races because
> > we do the inode lookup after unlocking the AGI. Hence the inodes
> > returned are not guaranteed to be allocated or even accessible,
> > especially if they return nlink == 0...
> > 
> > So, yeah, I don't think it matters for bulkstat.
> > 
> 
> Ack.
> 
> > > > @@ -523,17 +531,10 @@ xfs_inactive_symlink(
> > > >  		return -EIO;
> > > >  
> > > >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > > -
> > > > -	/*
> > > > -	 * Zero length symlinks _can_ exist.
> > > > -	 */
> > > >  	pathlen = (int)ip->i_d.di_size;
> > > > -	if (!pathlen) {
> > > > -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > -		return 0;
> > > > -	}
> > > > +	ASSERT(pathlen);
> > > 
> > > Ok, but I still feel like I'm missing something here. The commit log
> > > explains how symlink inode inactivation creates this transient zero-size
> > > symlink state. That's fine, but why do we have code to check for such
> > > symlinks _in_ the symlink activation path?
> > 
> > Ah, I put that in the cover letter - the VFS is supposed to ensure
> > we don't get zero length symlinks passed to us, and the verifiers
> > are supposed to catch it coming off disk. This is mostly just a
> > bounds check to document that we don't expect zero length symlinks
> > at any time, especially in the unlink path. I put the same assert
> > check in the xfs_symlink() create path, too, to ensure we don't get
> > passed a zero length from the VFS...
> > 
> 
> Oh, I'm not referring to the ASSERT() here. That all seems fine...
> 
> > > Is the assumption that this is just an overzealous/unnecessary check, or
> > > am I missing something? If the former, an extra sentence in the last
> > > paragraph or so of the commit log to explain why we remove this would be
> > > nice.
> > 
> > It's a "should never occur" situation but, well, paranoia seeing as
> > we've got this wrong for so long...
> > 
> 
> Rather, I'm asking about the pre-existing code that we remove. The hunk
> just above from xfs_inactive_symlink():
> 
> -	/*
> -	 * Zero length symlinks _can_ exist.
> -	 */
> -	if (!pathlen) {
> -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -		return 0;
> -	}
> 
> ... obviously suggests that there could have been a situation where we'd
> see a zero-length symlink on entry to xfs_inactive_symlink(). This
> patch, however, focuses on fixing the transient zero-length symlink
> caused by xfs_inactive_symlink_rmt() (which comes after the above
> check).
> 
> The latter implies we'd never enter the symlink inactivate path with a
> zero-length symlink, since this is the path that causes it (until the
> inode is ultimately freed). So my question essentially is do we have any
> idea why this code exists in the first place? Is it just bogus given
> there's no other clear vector to a zero-length symlink before the inode
> is inactivated?

I think that vector is log recovery.  Say we have a filesystem running
on an old kernel, and the fs goes down immediately after the transaction
commit in xfs_inactive_symlink_rmt.  At that exact moment we have a
zero-size extents format symlink on the unlinked list, and if we boot up
with a new kernel and go through log recovery we'll try to free this
inode in xlog_recover_process_one_iunlink.

The new code that switches i_mode prevents us from dropping more of
these onto the disk, but we still have old filesystems to deal with.
FWIW every now and then generic/475 complains in xfs_repair -n about
such a symlink (dfork format 2 and size 0).

--D

> Brian
> 
> > > > -	if (pathlen < 0 || pathlen > XFS_SYMLINK_MAXLEN) {
> > > > +	if (pathlen <= 0 || pathlen > XFS_SYMLINK_MAXLEN) {
> > > >  		xfs_alert(mp, "%s: inode (0x%llx) bad symlink length (%d)",
> > > >  			 __func__, (unsigned long long)ip->i_ino, pathlen);
> > > >  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > @@ -541,12 +542,12 @@ xfs_inactive_symlink(
> > > >  		return -EFSCORRUPTED;
> > > >  	}
> > > >  
> > > > +	/*
> > > > +	 * Inline fork state gets removed by xfs_difree() so we have nothing to
> > > > +	 * do here in that case.
> > > > +	 */
> > > 
> > > It looks like that actually happens in xfs_ifree() (not xfs_difree()).
> > 
> > Yup, will fix.
> > 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] xfs: zero length symlinks are not valid
  2018-06-19 15:48         ` Darrick J. Wong
@ 2018-06-19 16:28           ` Brian Foster
  2018-06-19 23:22             ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2018-06-19 16:28 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

On Tue, Jun 19, 2018 at 08:48:10AM -0700, Darrick J. Wong wrote:
> On Tue, Jun 19, 2018 at 07:54:34AM -0400, Brian Foster wrote:
> > On Tue, Jun 19, 2018 at 08:42:59AM +1000, Dave Chinner wrote:
> > > On Mon, Jun 18, 2018 at 09:24:11AM -0400, Brian Foster wrote:
> > > > On Mon, Jun 18, 2018 at 03:57:10PM +1000, Dave Chinner wrote:
> > > > > From: Dave Chinner <dchinner@redhat.com>
> > > > > 
> > > > > A log recovery failure has been reproduced where a symlink inode has
> > > > > a zero length in extent form. It was caused by a shutdown during a
> > > > > combined fstress+fsmark workload.
> > > > > 
> > > > > The underlying problem is the issue in xfs_inactive_symlink(): the
> > > > > inode is unlocked between the symlink inactivation/truncation and
> > > > > the inode being freed. This opens a window for the inode to be
> > > > > written to disk before it xfs_ifree() removes it from the unlinked
> > > > > list, marks it free in the inobt and zeros the mode.
> > > > > 
> > > > > For shortform inodes, the fix is simple. xfs_ifree() clears the data
> > > > > fork state, so there's no need to do it in xfs_inactive_symlink().
> > > > > This means the shortform fork verifier will not see a zero length
> > > > > data fork as it mirrors the inode size through to xfs_ifree()), and
> > > > > hence if the inode gets written back and the fork verifiers are run
> > > > > they will still see a fork that matches the on-disk inode size.
> > > > > 
> > > > > For extent form (remote) symlinks, it is a little more tricky. Here
> > > > > we explicitly set the inode size to zero, so the above race can lead
> > > > > to zero length symlinks on disk. Because the inode is unlinked at
> > > > > this point (i.e. on the unlinked list) and unreferenced, it can
> > > > > never be seen again by a user. Hence when we set the inode size to
> > > > 
> > > > Perhaps we don't care, but what about bulkstat? The inode free is
> > > > imminent, so it probably doesn't matter that much.
> > > 
> > > Bulkstat users have to deal with all sorts of TOCTOU races because
> > > we do the inode lookup after unlocking the AGI. Hence the inodes
> > > returned are not guaranteed to be allocated or even accessible,
> > > especially if they return nlink == 0...
> > > 
> > > So, yeah, I don't think it matters for bulkstat.
> > > 
> > 
> > Ack.
> > 
> > > > > @@ -523,17 +531,10 @@ xfs_inactive_symlink(
> > > > >  		return -EIO;
> > > > >  
> > > > >  	xfs_ilock(ip, XFS_ILOCK_EXCL);
> > > > > -
> > > > > -	/*
> > > > > -	 * Zero length symlinks _can_ exist.
> > > > > -	 */
> > > > >  	pathlen = (int)ip->i_d.di_size;
> > > > > -	if (!pathlen) {
> > > > > -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > -		return 0;
> > > > > -	}
> > > > > +	ASSERT(pathlen);
> > > > 
> > > > Ok, but I still feel like I'm missing something here. The commit log
> > > > explains how symlink inode inactivation creates this transient zero-size
> > > > symlink state. That's fine, but why do we have code to check for such
> > > > symlinks _in_ the symlink activation path?
> > > 
> > > Ah, I put that in the cover letter - the VFS is supposed to ensure
> > > we don't get zero length symlinks passed to us, and the verifiers
> > > are supposed to catch it coming off disk. This is mostly just a
> > > bounds check to document that we don't expect zero length symlinks
> > > at any time, especially in the unlink path. I put the same assert
> > > check in the xfs_symlink() create path, too, to ensure we don't get
> > > passed a zero length from the VFS...
> > > 
> > 
> > Oh, I'm not referring to the ASSERT() here. That all seems fine...
> > 
> > > > Is the assumption that this is just an overzealous/unnecessary check, or
> > > > am I missing something? If the former, an extra sentence in the last
> > > > paragraph or so of the commit log to explain why we remove this would be
> > > > nice.
> > > 
> > > It's a "should never occur" situation but, well, paranoia seeing as
> > > we've got this wrong for so long...
> > > 
> > 
> > Rather, I'm asking about the pre-existing code that we remove. The hunk
> > just above from xfs_inactive_symlink():
> > 
> > -	/*
> > -	 * Zero length symlinks _can_ exist.
> > -	 */
> > -	if (!pathlen) {
> > -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > -		return 0;
> > -	}
> > 
> > ... obviously suggests that there could have been a situation where we'd
> > see a zero-length symlink on entry to xfs_inactive_symlink(). This
> > patch, however, focuses on fixing the transient zero-length symlink
> > caused by xfs_inactive_symlink_rmt() (which comes after the above
> > check).
> > 
> > The latter implies we'd never enter the symlink inactivate path with a
> > zero-length symlink, since this is the path that causes it (until the
> > inode is ultimately freed). So my question essentially is do we have any
> > idea why this code exists in the first place? Is it just bogus given
> > there's no other clear vector to a zero-length symlink before the inode
> > is inactivated?
> 
> I think that vector is log recovery.  Say we have a filesystem running
> on an old kernel, and the fs goes down immediately after the transaction
> commit in xfs_inactive_symlink_rmt.  At that exact moment we have a
> zero-size extents format symlink on the unlinked list, and if we boot up
> with a new kernel and go through log recovery we'll try to free this
> inode in xlog_recover_process_one_iunlink.
> 

Ah, Ok. So log recovery will look up the inode and run it through
the inactive sequence again. We presumably remove this code because with
this patch that should no longer happen. Keeping it around doesn't help
us because presumably the verifier would fail before we get to this
point anyways. Makes sense, thanks Darrick.

> The new code that switches i_mode prevents us from dropping more of
> these onto the disk, but we still have old filesystems to deal with.
> FWIW every now and then generic/475 complains in xfs_repair -n about
> such a symlink (dfork format 2 and size 0).
> 

So it sounds like the caveat with this approach then is that we haven't
necessarily provided a recovery path for somebody stuck in this
particular failed recovery state, outside of 'xfs_repair -L'. :/ Unless
I'm misunderstanding the error, that does seem a bit unfortunate given
that otherwise recovery should complete.

Brian

> --D
> 
> > Brian
> > 
> > > > > -	if (pathlen < 0 || pathlen > XFS_SYMLINK_MAXLEN) {
> > > > > +	if (pathlen <= 0 || pathlen > XFS_SYMLINK_MAXLEN) {
> > > > >  		xfs_alert(mp, "%s: inode (0x%llx) bad symlink length (%d)",
> > > > >  			 __func__, (unsigned long long)ip->i_ino, pathlen);
> > > > >  		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > @@ -541,12 +542,12 @@ xfs_inactive_symlink(
> > > > >  		return -EFSCORRUPTED;
> > > > >  	}
> > > > >  
> > > > > +	/*
> > > > > +	 * Inline fork state gets removed by xfs_difree() so we have nothing to
> > > > > +	 * do here in that case.
> > > > > +	 */
> > > > 
> > > > It looks like that actually happens in xfs_ifree() (not xfs_difree()).
> > > 
> > > Yup, will fix.
> > > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] xfs: zero length symlinks are not valid
  2018-06-19 16:28           ` Brian Foster
@ 2018-06-19 23:22             ` Dave Chinner
  2018-06-20 11:50               ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2018-06-19 23:22 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs

On Tue, Jun 19, 2018 at 12:28:40PM -0400, Brian Foster wrote:
> On Tue, Jun 19, 2018 at 08:48:10AM -0700, Darrick J. Wong wrote:
> > On Tue, Jun 19, 2018 at 07:54:34AM -0400, Brian Foster wrote:
> > > On Tue, Jun 19, 2018 at 08:42:59AM +1000, Dave Chinner wrote:
> > > Rather, I'm asking about the pre-existing code that we remove. The hunk
> > > just above from xfs_inactive_symlink():
> > > 
> > > -	/*
> > > -	 * Zero length symlinks _can_ exist.
> > > -	 */
> > > -	if (!pathlen) {
> > > -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > -		return 0;
> > > -	}
> > > 
> > > ... obviously suggests that there could have been a situation where we'd
> > > see a zero-length symlink on entry to xfs_inactive_symlink(). This
> > > patch, however, focuses on fixing the transient zero-length symlink
> > > caused by xfs_inactive_symlink_rmt() (which comes after the above
> > > check).

It can't happen through normal VFS paths, and I don't think it can
happen through log recovery. That's why I replaced this with an ASSERT.

In the normal behaviour case, zero length symlinks were created
/after/ this point in time, and we've always been unable to read
such inodes because xfs_dinode_verify() has always rejected zero
length symlink inodes.

However, log recovery doesn't trip over the transient inode state
because it does not use xfs_dinode_verify() for inode recovery - it
reads the buffer with xfs_inode_buf_ops, and that just checks the
inode numbers and then recovers whatever is in the log over the top.
It never checks for zero length symlinks on recovery, and it never
goes through the dinode verifier (on read or write!) to catch this.

It's not until we then recover the remaining intent chain that we go
through xfs_iget/xfs_iread/xfs_inactive/xfs_ifree, and that
xfs_iget() call runs xfs_dinode_verify(). If we've already recovered
any part of the remote symlink removal intent chain (and we must
have to be replaying EFIs!) this should see a zero length symlink
inode. AIUI, we have no evidence that the verifier has ever fired at
this point in time, even when recovering known transient zero length
states.

i.e all the cases I've seen where repair complains about symlinks
with "dfork format 2 and size 0" it is because the log is dirty and
hasn't been replayed. Mounting the filesystem and running log
recovery makes the problem go away, and this is what lead to me
removing the zeor length handling - the verifier should already
be firing if it is recovering an intent on a zero length symlink
inode...

That said, I'll try some more focussed testing with intentional
corruption to see if it's just a case of my testing being (un)lucky
and not triggering this issue...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] xfs: zero length symlinks are not valid
  2018-06-19 23:22             ` Dave Chinner
@ 2018-06-20 11:50               ` Brian Foster
  2018-06-20 22:59                 ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2018-06-20 11:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

On Wed, Jun 20, 2018 at 09:22:42AM +1000, Dave Chinner wrote:
> On Tue, Jun 19, 2018 at 12:28:40PM -0400, Brian Foster wrote:
> > On Tue, Jun 19, 2018 at 08:48:10AM -0700, Darrick J. Wong wrote:
> > > On Tue, Jun 19, 2018 at 07:54:34AM -0400, Brian Foster wrote:
> > > > On Tue, Jun 19, 2018 at 08:42:59AM +1000, Dave Chinner wrote:
> > > > Rather, I'm asking about the pre-existing code that we remove. The hunk
> > > > just above from xfs_inactive_symlink():
> > > > 
> > > > -	/*
> > > > -	 * Zero length symlinks _can_ exist.
> > > > -	 */
> > > > -	if (!pathlen) {
> > > > -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > -		return 0;
> > > > -	}
> > > > 
> > > > ... obviously suggests that there could have been a situation where we'd
> > > > see a zero-length symlink on entry to xfs_inactive_symlink(). This
> > > > patch, however, focuses on fixing the transient zero-length symlink
> > > > caused by xfs_inactive_symlink_rmt() (which comes after the above
> > > > check).
> 
> It can't happen through normal VFS paths, and I don't think it can
> happen through log recovery. That's why I replaced this with an ASSERT.
> 
> In the normal behaviour case, zero length symlinks were created
> /after/ this point in time, and we've always been unable to read
> such inodes because xfs_dinode_verify() has always rejected zero
> length symlink inodes.
> 

That was my initial understanding as well..

> However, log recovery doesn't trip over the transient inode state
> because it does not use xfs_dinode_verify() for inode recovery - it
> reads the buffer with xfs_inode_buf_ops, and that just checks the
> inode numbers and then recovers whatever is in the log over the top.
> It never checks for zero length symlinks on recovery, and it never
> goes through the dinode verifier (on read or write!) to catch this.
> 
> It's not until we then recover the remaining intent chain that we go
> through xfs_iget/xfs_iread/xfs_inactive/xfs_ifree, and that
> xfs_iget() call runs xfs_dinode_verify(). If we've already recovered
> any part of the remote symlink removal intent chain (and we must
> have to be replaying EFIs!) this should see a zero length symlink
> inode. AIUI, we have no evidence that the verifier has ever fired at
> this point in time, even when recovering known transient zero length
> states.
> 

Hmm, not sure we're talking about the same thing. I ran a quick
experiment to try and clear up my confusion here, at least. I injected a
shutdown at the point inactive creates the zero sized symlink and
created/removed a remote symlink. On a remount, I hit the following
during log recovery:

[  685.931834] XFS (dm-3): Starting recovery (logdev: internal)
[  685.993014] XFS (dm-3): Metadata corruption detected at xfs_dinode_verify+0x331/0x490 [xfs], inode 0x85 dinode
[  685.996287] XFS (dm-3): Unmount and run xfs_repair
[  685.996911] XFS (dm-3): First 128 bytes of corrupted metadata buffer:
...
[  686.006647] Call Trace:
[  686.006922]  dump_stack+0x85/0xcb
[  686.007338]  xfs_iread+0xeb/0x220 [xfs]
[  686.007820]  xfs_iget+0x4bd/0x1100 [xfs]
[  686.008344]  xlog_recover_process_one_iunlink+0x4d/0x3c0 [xfs]
...

That seems to show that the verifier can impede unlinked list processing
if the change is at least present in the log.

If I recreate that same dirty log state and mount with this patch
applied (note that the fs is created without this patch to simulate an
old kernel that has not changed i_mode in the same transaction that sets
di_size = 0) along with a hack to avoid the check in
xfs_dinode_verify(), I now hit the new assert and corruption error
that's been placed in xfs_inactive_symlink().

So to Darrick's point, that seems to show that this is a vector to the
pre-existing len == 0 check in xfs_inactive_symlink(). Given that, it
seems to me that if we want to handle recovery from this state, we'd
still need to work around the verifier check and retain the initial
di_size == 0 check in xfs_inactive_symlink().

> i.e all the cases I've seen where repair complains about symlinks
> with "dfork format 2 and size 0" it is because the log is dirty and
> hasn't been replayed. Mounting the filesystem and running log
> recovery makes the problem go away, and this is what lead to me
> removing the zeor length handling - the verifier should already
> be firing if it is recovering an intent on a zero length symlink
> inode...
> 

I haven't grokked all the details here, but the behavior you describe
does sound like this might be a different scenario from the above.

Brian

> That said, I'll try some more focussed testing with intentional
> corruption to see if it's just a case of my testing being (un)lucky
> and not triggering this issue...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] xfs: zero length symlinks are not valid
  2018-06-20 11:50               ` Brian Foster
@ 2018-06-20 22:59                 ` Dave Chinner
  2018-06-21 11:46                   ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2018-06-20 22:59 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs

On Wed, Jun 20, 2018 at 07:50:48AM -0400, Brian Foster wrote:
> On Wed, Jun 20, 2018 at 09:22:42AM +1000, Dave Chinner wrote:
> > On Tue, Jun 19, 2018 at 12:28:40PM -0400, Brian Foster wrote:
> > > On Tue, Jun 19, 2018 at 08:48:10AM -0700, Darrick J. Wong wrote:
> > > > On Tue, Jun 19, 2018 at 07:54:34AM -0400, Brian Foster wrote:
> > > > > On Tue, Jun 19, 2018 at 08:42:59AM +1000, Dave Chinner wrote:
> > > > > Rather, I'm asking about the pre-existing code that we remove. The hunk
> > > > > just above from xfs_inactive_symlink():
> > > > > 
> > > > > -	/*
> > > > > -	 * Zero length symlinks _can_ exist.
> > > > > -	 */
> > > > > -	if (!pathlen) {
> > > > > -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > -		return 0;
> > > > > -	}
> > > > > 
> > > > > ... obviously suggests that there could have been a situation where we'd
> > > > > see a zero-length symlink on entry to xfs_inactive_symlink(). This
> > > > > patch, however, focuses on fixing the transient zero-length symlink
> > > > > caused by xfs_inactive_symlink_rmt() (which comes after the above
> > > > > check).
> > 
> > It can't happen through normal VFS paths, and I don't think it can
> > happen through log recovery. That's why I replaced this with an ASSERT.
> > 
> > In the normal behaviour case, zero length symlinks were created
> > /after/ this point in time, and we've always been unable to read
> > such inodes because xfs_dinode_verify() has always rejected zero
> > length symlink inodes.
> > 
> 
> That was my initial understanding as well..
> 
> > However, log recovery doesn't trip over the transient inode state
> > because it does not use xfs_dinode_verify() for inode recovery - it
> > reads the buffer with xfs_inode_buf_ops, and that just checks the
> > inode numbers and then recovers whatever is in the log over the top.
> > It never checks for zero length symlinks on recovery, and it never
> > goes through the dinode verifier (on read or write!) to catch this.
> > 
> > It's not until we then recover the remaining intent chain that we go
> > through xfs_iget/xfs_iread/xfs_inactive/xfs_ifree, and that
> > xfs_iget() call runs xfs_dinode_verify(). If we've already recovered
> > any part of the remote symlink removal intent chain (and we must
> > have to be replaying EFIs!) this should see a zero length symlink
> > inode. AIUI, we have no evidence that the verifier has ever fired at
> > this point in time, even when recovering known transient zero length
> > states.
> > 
> 
> Hmm, not sure we're talking about the same thing. I ran a quick
> experiment to try and clear up my confusion here, at least. I injected a
> shutdown at the point inactive creates the zero sized symlink and
> created/removed a remote symlink. On a remount, I hit the following
> during log recovery:
> 
> [  685.931834] XFS (dm-3): Starting recovery (logdev: internal)
> [  685.993014] XFS (dm-3): Metadata corruption detected at xfs_dinode_verify+0x331/0x490 [xfs], inode 0x85 dinode
> [  685.996287] XFS (dm-3): Unmount and run xfs_repair
> [  685.996911] XFS (dm-3): First 128 bytes of corrupted metadata buffer:
> ...
> [  686.006647] Call Trace:
> [  686.006922]  dump_stack+0x85/0xcb
> [  686.007338]  xfs_iread+0xeb/0x220 [xfs]
> [  686.007820]  xfs_iget+0x4bd/0x1100 [xfs]
> [  686.008344]  xlog_recover_process_one_iunlink+0x4d/0x3c0 [xfs]
> ...
> 
> That seems to show that the verifier can impede unlinked list processing
> if the change is at least present in the log.

Which happens after we process intents - ah, the bmap intents only
occur from swap extents, so the inode is not read via xfs_iget()
in this case until unlinked list processing.

So, yeah, we are talking about the same thing - it's the second
phase of recovery that reads inodes through xfs_iget() (for whatever
reason) that will trip over the xfs_dinode_verifier.

> If I recreate that same dirty log state and mount with this patch
> applied (note that the fs is created without this patch to simulate an
> old kernel that has not changed i_mode in the same transaction that sets
> di_size = 0) along with a hack to avoid the check in
> xfs_dinode_verify(), I now hit the new assert and corruption error
> that's been placed in xfs_inactive_symlink().
> 
> So to Darrick's point, that seems to show that this is a vector to the
> pre-existing len == 0 check in xfs_inactive_symlink(). Given that, it
> seems to me that if we want to handle recovery from this state, we'd
> still need to work around the verifier check and retain the initial
> di_size == 0 check in xfs_inactive_symlink().

I think we should get rid of the transient state, not continue to
work around it. Because the transient state only exists in a log
recovery context, we can change the behaviour and not care about
older kernels still having the problem because all that is needed to
avoid the issue is a clean log when upgrading the kernel.

> > i.e all the cases I've seen where repair complains about symlinks
> > with "dfork format 2 and size 0" it is because the log is dirty and
> > hasn't been replayed. Mounting the filesystem and running log
> > recovery makes the problem go away, and this is what lead to me
> > removing the zeor length handling - the verifier should already
> > be firing if it is recovering an intent on a zero length symlink
> > inode...
> > 
> 
> I haven't grokked all the details here, but the behavior you describe
> does sound like this might be a different scenario from the above.

No, we are talking about the same thing. The only difference is that
I've been working from a bug report and corrupted fuzzer image, not
a real world corruption vector. Hence I'm trying to work backwards
from "here's a corrupt image that crashed" to "this is all the paths
that can lead to that on disk state".

While writing my last reply, I suspected you were correct about the
log recovery state, which is why I said:

> > That said, I'll try some more focussed testing with intentional
> > corruption to see if it's just a case of my testing being (un)lucky
> > and not triggering this issue...

As you've shown, I was being (un)lucky, so more work is needed on
this one.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] xfs: zero length symlinks are not valid
  2018-06-20 22:59                 ` Dave Chinner
@ 2018-06-21 11:46                   ` Brian Foster
  2018-06-21 22:31                     ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2018-06-21 11:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

On Thu, Jun 21, 2018 at 08:59:18AM +1000, Dave Chinner wrote:
> On Wed, Jun 20, 2018 at 07:50:48AM -0400, Brian Foster wrote:
> > On Wed, Jun 20, 2018 at 09:22:42AM +1000, Dave Chinner wrote:
> > > On Tue, Jun 19, 2018 at 12:28:40PM -0400, Brian Foster wrote:
> > > > On Tue, Jun 19, 2018 at 08:48:10AM -0700, Darrick J. Wong wrote:
> > > > > On Tue, Jun 19, 2018 at 07:54:34AM -0400, Brian Foster wrote:
> > > > > > On Tue, Jun 19, 2018 at 08:42:59AM +1000, Dave Chinner wrote:
> > > > > > Rather, I'm asking about the pre-existing code that we remove. The hunk
> > > > > > just above from xfs_inactive_symlink():
> > > > > > 
> > > > > > -	/*
> > > > > > -	 * Zero length symlinks _can_ exist.
> > > > > > -	 */
> > > > > > -	if (!pathlen) {
> > > > > > -		xfs_iunlock(ip, XFS_ILOCK_EXCL);
> > > > > > -		return 0;
> > > > > > -	}
> > > > > > 
> > > > > > ... obviously suggests that there could have been a situation where we'd
> > > > > > see a zero-length symlink on entry to xfs_inactive_symlink(). This
> > > > > > patch, however, focuses on fixing the transient zero-length symlink
> > > > > > caused by xfs_inactive_symlink_rmt() (which comes after the above
> > > > > > check).
> > > 
> > > It can't happen through normal VFS paths, and I don't think it can
> > > happen through log recovery. That's why I replaced this with an ASSERT.
> > > 
> > > In the normal behaviour case, zero length symlinks were created
> > > /after/ this point in time, and we've always been unable to read
> > > such inodes because xfs_dinode_verify() has always rejected zero
> > > length symlink inodes.
> > > 
> > 
> > That was my initial understanding as well..
> > 
> > > However, log recovery doesn't trip over the transient inode state
> > > because it does not use xfs_dinode_verify() for inode recovery - it
> > > reads the buffer with xfs_inode_buf_ops, and that just checks the
> > > inode numbers and then recovers whatever is in the log over the top.
> > > It never checks for zero length symlinks on recovery, and it never
> > > goes through the dinode verifier (on read or write!) to catch this.
> > > 
> > > It's not until we then recover the remaining intent chain that we go
> > > through xfs_iget/xfs_iread/xfs_inactive/xfs_ifree, and that
> > > xfs_iget() call runs xfs_dinode_verify(). If we've already recovered
> > > any part of the remote symlink removal intent chain (and we must
> > > have to be replaying EFIs!) this should see a zero length symlink
> > > inode. AIUI, we have no evidence that the verifier has ever fired at
> > > this point in time, even when recovering known transient zero length
> > > states.
> > > 
> > 
> > Hmm, not sure we're talking about the same thing. I ran a quick
> > experiment to try and clear up my confusion here, at least. I injected a
> > shutdown at the point inactive creates the zero sized symlink and
> > created/removed a remote symlink. On a remount, I hit the following
> > during log recovery:
> > 
> > [  685.931834] XFS (dm-3): Starting recovery (logdev: internal)
> > [  685.993014] XFS (dm-3): Metadata corruption detected at xfs_dinode_verify+0x331/0x490 [xfs], inode 0x85 dinode
> > [  685.996287] XFS (dm-3): Unmount and run xfs_repair
> > [  685.996911] XFS (dm-3): First 128 bytes of corrupted metadata buffer:
> > ...
> > [  686.006647] Call Trace:
> > [  686.006922]  dump_stack+0x85/0xcb
> > [  686.007338]  xfs_iread+0xeb/0x220 [xfs]
> > [  686.007820]  xfs_iget+0x4bd/0x1100 [xfs]
> > [  686.008344]  xlog_recover_process_one_iunlink+0x4d/0x3c0 [xfs]
> > ...
> > 
> > That seems to show that the verifier can impede unlinked list processing
> > if the change is at least present in the log.
> 
> Which happens after we process intents - ah, the bmap intents only
> occur from swap extents, so the inode is not read via xfs_iget()
> in this case until unlinked list processing.
> 
> So, yeah, we are talking about the same thing - it's the second
> phase of recovery that reads inodes through xfs_iget() (for whatever
> reason) that will trip over the xfs_dinode_verifier.
> 
> > If I recreate that same dirty log state and mount with this patch
> > applied (note that the fs is created without this patch to simulate an
> > old kernel that has not changed i_mode in the same transaction that sets
> > di_size = 0) along with a hack to avoid the check in
> > xfs_dinode_verify(), I now hit the new assert and corruption error
> > that's been placed in xfs_inactive_symlink().
> > 
> > So to Darrick's point, that seems to show that this is a vector to the
> > pre-existing len == 0 check in xfs_inactive_symlink(). Given that, it
> > seems to me that if we want to handle recovery from this state, we'd
> > still need to work around the verifier check and retain the initial
> > di_size == 0 check in xfs_inactive_symlink().
> 
> I think we should get rid of the transient state, not continue to
> work around it. Because the transient state only exists in a log
> recovery context, we can change the behaviour and not care about
> older kernels still having the problem because all that is needed to
> avoid the issue is a clean log when upgrading the kernel.
> 

Right... that sounds reasonable to me, but we still need to consider how
we handle a filesystem already in this state. BTW, was this a user
report or a manufactured corruption due to fuzzing/shutdown testing or
something?

It's probably an extremely rare real-world situation to encounter so
perhaps this approach is good enough on its own and we just punt. I
think we at least need to document that in the commit log, however.
IIRC, the filesystem still actually mounted during the test I ran the
other day, despite the verifier failure. Taking a look at that codepath,
we have this:

       /*
         * We can't read in the inode this bucket points to, or this inode
         * is messed up.  Just ditch this bucket of inodes.  We will lose
         * some inodes and space, but at least we won't hang.
         *
         * Call xlog_recover_clear_agi_bucket() to perform a transaction to
         * clear the inode pointer in the bucket.
         */
        xlog_recover_clear_agi_bucket(mp, agno, bucket);

... which explains why that occurs: we just nuke the AGI unlinked list
bucket after failing to read an inode from it. I guess that means we
leak some inodes, but those were already freed so that's not really a
data loss vector. This otherwise doesn't seem to affect the recovery or
mount sequence. The leaked inodes can be recovered by a repair at a
later point. Is that overall behavior consistent with what you've seen
on the corrupted image?

Given that additional context, I don't feel too strongly about needing
to specially handle the "zero length symlink already exists in the dirty
log due to xfs_inactive_symlink_rmt()" case. Chances are the mount that
reported the error already nuked the state in the first place, so users
shouldn't really end up "stuck" somewhere between needing a kernel fix
or an 'xfs_repair -L' run (but if that's the approach we take, please
just note the tradeoff in the commit log). Just my .02.

Brian

> > > i.e all the cases I've seen where repair complains about symlinks
> > > with "dfork format 2 and size 0" it is because the log is dirty and
> > > hasn't been replayed. Mounting the filesystem and running log
> > > recovery makes the problem go away, and this is what lead to me
> > > removing the zeor length handling - the verifier should already
> > > be firing if it is recovering an intent on a zero length symlink
> > > inode...
> > > 
> > 
> > I haven't grokked all the details here, but the behavior you describe
> > does sound like this might be a different scenario from the above.
> 
> No, we are talking about the same thing. The only difference is that
> I've been working from a bug report and corrupted fuzzer image, not
> a real world corruption vector. Hence I'm trying to work backwards
> from "here's a corrupt image that crashed" to "this is all the paths
> that can lead to that on disk state".
> 
> While writing my last reply, I suspected you were correct about the
> log recovery state, which is why I said:
> 
> > > That said, I'll try some more focussed testing with intentional
> > > corruption to see if it's just a case of my testing being (un)lucky
> > > and not triggering this issue...
> 
> As you've shown, I was being (un)lucky, so more work is needed on
> this one.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] xfs: zero length symlinks are not valid
  2018-06-21 11:46                   ` Brian Foster
@ 2018-06-21 22:31                     ` Dave Chinner
  2018-06-21 22:53                       ` Darrick J. Wong
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2018-06-21 22:31 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs

On Thu, Jun 21, 2018 at 07:46:18AM -0400, Brian Foster wrote:
> On Thu, Jun 21, 2018 at 08:59:18AM +1000, Dave Chinner wrote:
> > On Wed, Jun 20, 2018 at 07:50:48AM -0400, Brian Foster wrote:
> > > If I recreate that same dirty log state and mount with this patch
> > > applied (note that the fs is created without this patch to simulate an
> > > old kernel that has not changed i_mode in the same transaction that sets
> > > di_size = 0) along with a hack to avoid the check in
> > > xfs_dinode_verify(), I now hit the new assert and corruption error
> > > that's been placed in xfs_inactive_symlink().
> > > 
> > > So to Darrick's point, that seems to show that this is a vector to the
> > > pre-existing len == 0 check in xfs_inactive_symlink(). Given that, it
> > > seems to me that if we want to handle recovery from this state, we'd
> > > still need to work around the verifier check and retain the initial
> > > di_size == 0 check in xfs_inactive_symlink().
> > 
> > I think we should get rid of the transient state, not continue to
> > work around it. Because the transient state only exists in a log
> > recovery context, we can change the behaviour and not care about
> > older kernels still having the problem because all that is needed to
> > avoid the issue is a clean log when upgrading the kernel.
> > 
> 
> Right... that sounds reasonable to me, but we still need to consider how
> we handle a filesystem already in this state. BTW, was this a user
> report or a manufactured corruption due to fuzzing/shutdown testing or
> something?

It was shutdown testing. The report was that xfs_repair -n failed
with a zero length symlink, not that log recovery failed. I assumed
that log recovery worked fine before xfs_repair -n was run because
it didn't warn about a dirty log. However, now that you point out
that we just toss unlink list recovery failures (which I'd forgotten
about!), I'm guessing that happened and then repair tripped over
the leaked symlink inode.

> Given that additional context, I don't feel too strongly about needing
> to specially handle the "zero length symlink already exists in the dirty
> log due to xfs_inactive_symlink_rmt()" case. Chances are the mount that
> reported the error already nuked the state in the first place, so users
> shouldn't really end up "stuck" somewhere between needing a kernel fix
> or an 'xfs_repair -L' run (but if that's the approach we take, please
> just note the tradeoff in the commit log). Just my .02.

Yup, that seems reasonable to me - leaking the inode until repair is
done has no user impact.  It will do the same thing for any inode it
finds on the unlinked list that is corrupted, so my thoughts are
that if we want to improve corruption handling we need to address
the general unlinked list corruption issue rather than just this
specific inode corruption case.

Alright - I'll clean up the patch, make notes of all this in the
commit message and repost.

Thanks, Brian!

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] xfs: zero length symlinks are not valid
  2018-06-21 22:31                     ` Dave Chinner
@ 2018-06-21 22:53                       ` Darrick J. Wong
  2018-06-22 10:44                         ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Darrick J. Wong @ 2018-06-21 22:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs

On Fri, Jun 22, 2018 at 08:31:41AM +1000, Dave Chinner wrote:
> On Thu, Jun 21, 2018 at 07:46:18AM -0400, Brian Foster wrote:
> > On Thu, Jun 21, 2018 at 08:59:18AM +1000, Dave Chinner wrote:
> > > On Wed, Jun 20, 2018 at 07:50:48AM -0400, Brian Foster wrote:
> > > > If I recreate that same dirty log state and mount with this patch
> > > > applied (note that the fs is created without this patch to simulate an
> > > > old kernel that has not changed i_mode in the same transaction that sets
> > > > di_size = 0) along with a hack to avoid the check in
> > > > xfs_dinode_verify(), I now hit the new assert and corruption error
> > > > that's been placed in xfs_inactive_symlink().
> > > > 
> > > > So to Darrick's point, that seems to show that this is a vector to the
> > > > pre-existing len == 0 check in xfs_inactive_symlink(). Given that, it
> > > > seems to me that if we want to handle recovery from this state, we'd
> > > > still need to work around the verifier check and retain the initial
> > > > di_size == 0 check in xfs_inactive_symlink().
> > > 
> > > I think we should get rid of the transient state, not continue to
> > > work around it. Because the transient state only exists in a log
> > > recovery context, we can change the behaviour and not care about
> > > older kernels still having the problem because all that is needed to
> > > avoid the issue is a clean log when upgrading the kernel.
> > > 
> > 
> > Right... that sounds reasonable to me, but we still need to consider how
> > we handle a filesystem already in this state. BTW, was this a user
> > report or a manufactured corruption due to fuzzing/shutdown testing or
> > something?
> 
> It was shutdown testing. The report was that xfs_repair -n failed
> with a zero length symlink, not that log recovery failed. I assumed
> that log recovery worked fine before xfs_repair -n was run because
> it didn't warn about a dirty log. However, now that you point out
> that we just toss unlink list recovery failures (which I'd forgotten
> about!), I'm guessing that happened and then repair tripped over
> the leaked symlink inode.
> 
> > Given that additional context, I don't feel too strongly about needing
> > to specially handle the "zero length symlink already exists in the dirty
> > log due to xfs_inactive_symlink_rmt()" case. Chances are the mount that
> > reported the error already nuked the state in the first place, so users
> > shouldn't really end up "stuck" somewhere between needing a kernel fix
> > or an 'xfs_repair -L' run (but if that's the approach we take, please
> > just note the tradeoff in the commit log). Just my .02.
> 
> Yup, that seems reasonable to me - leaking the inode until repair is
> done has no user impact.  It will do the same thing for any inode it
> finds on the unlinked list that is corrupted, so my thoughts are
> that if we want to improve corruption handling we need to address
> the general unlinked list corruption issue rather than just this
> specific inode corruption case.

FWIW I saw generic/475 trip over this last night and judging from the
logs I saw that behavior -- first we shut down right committing the
zero-length symlink, then recovery makes it as far as iunlink processing
where it blows up, tosses out that failure, and then the next time we
hit that inode we'd trip over it all over again.

--D

> Alright - I'll clean up the patch, make notes of all this in the
> commit message and repost.
> 
> Thanks, Brian!
> 
> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] xfs: zero length symlinks are not valid
  2018-06-21 22:53                       ` Darrick J. Wong
@ 2018-06-22 10:44                         ` Brian Foster
  2018-06-23 17:38                           ` Darrick J. Wong
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2018-06-22 10:44 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs

On Thu, Jun 21, 2018 at 03:53:46PM -0700, Darrick J. Wong wrote:
> On Fri, Jun 22, 2018 at 08:31:41AM +1000, Dave Chinner wrote:
> > On Thu, Jun 21, 2018 at 07:46:18AM -0400, Brian Foster wrote:
> > > On Thu, Jun 21, 2018 at 08:59:18AM +1000, Dave Chinner wrote:
> > > > On Wed, Jun 20, 2018 at 07:50:48AM -0400, Brian Foster wrote:
> > > > > If I recreate that same dirty log state and mount with this patch
> > > > > applied (note that the fs is created without this patch to simulate an
> > > > > old kernel that has not changed i_mode in the same transaction that sets
> > > > > di_size = 0) along with a hack to avoid the check in
> > > > > xfs_dinode_verify(), I now hit the new assert and corruption error
> > > > > that's been placed in xfs_inactive_symlink().
> > > > > 
> > > > > So to Darrick's point, that seems to show that this is a vector to the
> > > > > pre-existing len == 0 check in xfs_inactive_symlink(). Given that, it
> > > > > seems to me that if we want to handle recovery from this state, we'd
> > > > > still need to work around the verifier check and retain the initial
> > > > > di_size == 0 check in xfs_inactive_symlink().
> > > > 
> > > > I think we should get rid of the transient state, not continue to
> > > > work around it. Because the transient state only exists in a log
> > > > recovery context, we can change the behaviour and not care about
> > > > older kernels still having the problem because all that is needed to
> > > > avoid the issue is a clean log when upgrading the kernel.
> > > > 
> > > 
> > > Right... that sounds reasonable to me, but we still need to consider how
> > > we handle a filesystem already in this state. BTW, was this a user
> > > report or a manufactured corruption due to fuzzing/shutdown testing or
> > > something?
> > 
> > It was shutdown testing. The report was that xfs_repair -n failed
> > with a zero length symlink, not that log recovery failed. I assumed
> > that log recovery worked fine before xfs_repair -n was run because
> > it didn't warn about a dirty log. However, now that you point out
> > that we just toss unlink list recovery failures (which I'd forgotten
> > about!), I'm guessing that happened and then repair tripped over
> > the leaked symlink inode.
> > 
> > > Given that additional context, I don't feel too strongly about needing
> > > to specially handle the "zero length symlink already exists in the dirty
> > > log due to xfs_inactive_symlink_rmt()" case. Chances are the mount that
> > > reported the error already nuked the state in the first place, so users
> > > shouldn't really end up "stuck" somewhere between needing a kernel fix
> > > or an 'xfs_repair -L' run (but if that's the approach we take, please
> > > just note the tradeoff in the commit log). Just my .02.
> > 
> > Yup, that seems reasonable to me - leaking the inode until repair is
> > done has no user impact.  It will do the same thing for any inode it
> > finds on the unlinked list that is corrupted, so my thoughts are
> > that if we want to improve corruption handling we need to address
> > the general unlinked list corruption issue rather than just this
> > specific inode corruption case.
> 
> FWIW I saw generic/475 trip over this last night and judging from the
> logs I saw that behavior -- first we shut down right committing the
> zero-length symlink, then recovery makes it as far as iunlink processing
> where it blows up, tosses out that failure, and then the next time we
> hit that inode we'd trip over it all over again.
> 

What do you mean by "the next time we hit that inode?" IIUC, iunlink
processing nukes the entire AGI unlinked list bucket in response to the
read error. Given the inode was unlinked, shouldn't it no longer be
accessible after that point? Are you hitting some other path that
attempts to read the inode?

Brian

> --D
> 
> > Alright - I'll clean up the patch, make notes of all this in the
> > commit message and repost.
> > 
> > Thanks, Brian!
> > 
> > Cheers,
> > 
> > Dave.
> > 
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/2] xfs: zero length symlinks are not valid
  2018-06-22 10:44                         ` Brian Foster
@ 2018-06-23 17:38                           ` Darrick J. Wong
  0 siblings, 0 replies; 18+ messages in thread
From: Darrick J. Wong @ 2018-06-23 17:38 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, linux-xfs

On Fri, Jun 22, 2018 at 06:44:48AM -0400, Brian Foster wrote:
> On Thu, Jun 21, 2018 at 03:53:46PM -0700, Darrick J. Wong wrote:
> > On Fri, Jun 22, 2018 at 08:31:41AM +1000, Dave Chinner wrote:
> > > On Thu, Jun 21, 2018 at 07:46:18AM -0400, Brian Foster wrote:
> > > > On Thu, Jun 21, 2018 at 08:59:18AM +1000, Dave Chinner wrote:
> > > > > On Wed, Jun 20, 2018 at 07:50:48AM -0400, Brian Foster wrote:
> > > > > > If I recreate that same dirty log state and mount with this patch
> > > > > > applied (note that the fs is created without this patch to simulate an
> > > > > > old kernel that has not changed i_mode in the same transaction that sets
> > > > > > di_size = 0) along with a hack to avoid the check in
> > > > > > xfs_dinode_verify(), I now hit the new assert and corruption error
> > > > > > that's been placed in xfs_inactive_symlink().
> > > > > > 
> > > > > > So to Darrick's point, that seems to show that this is a vector to the
> > > > > > pre-existing len == 0 check in xfs_inactive_symlink(). Given that, it
> > > > > > seems to me that if we want to handle recovery from this state, we'd
> > > > > > still need to work around the verifier check and retain the initial
> > > > > > di_size == 0 check in xfs_inactive_symlink().
> > > > > 
> > > > > I think we should get rid of the transient state, not continue to
> > > > > work around it. Because the transient state only exists in a log
> > > > > recovery context, we can change the behaviour and not care about
> > > > > older kernels still having the problem because all that is needed to
> > > > > avoid the issue is a clean log when upgrading the kernel.
> > > > > 
> > > > 
> > > > Right... that sounds reasonable to me, but we still need to consider how
> > > > we handle a filesystem already in this state. BTW, was this a user
> > > > report or a manufactured corruption due to fuzzing/shutdown testing or
> > > > something?
> > > 
> > > It was shutdown testing. The report was that xfs_repair -n failed
> > > with a zero length symlink, not that log recovery failed. I assumed
> > > that log recovery worked fine before xfs_repair -n was run because
> > > it didn't warn about a dirty log. However, now that you point out
> > > that we just toss unlink list recovery failures (which I'd forgotten
> > > about!), I'm guessing that happened and then repair tripped over
> > > the leaked symlink inode.
> > > 
> > > > Given that additional context, I don't feel too strongly about needing
> > > > to specially handle the "zero length symlink already exists in the dirty
> > > > log due to xfs_inactive_symlink_rmt()" case. Chances are the mount that
> > > > reported the error already nuked the state in the first place, so users
> > > > shouldn't really end up "stuck" somewhere between needing a kernel fix
> > > > or an 'xfs_repair -L' run (but if that's the approach we take, please
> > > > just note the tradeoff in the commit log). Just my .02.
> > > 
> > > Yup, that seems reasonable to me - leaking the inode until repair is
> > > done has no user impact.  It will do the same thing for any inode it
> > > finds on the unlinked list that is corrupted, so my thoughts are
> > > that if we want to improve corruption handling we need to address
> > > the general unlinked list corruption issue rather than just this
> > > specific inode corruption case.
> > 
> > FWIW I saw generic/475 trip over this last night and judging from the
> > logs I saw that behavior -- first we shut down right committing the
> > zero-length symlink, then recovery makes it as far as iunlink processing
> > where it blows up, tosses out that failure, and then the next time we
> > hit that inode we'd trip over it all over again.
> > 
> 
> What do you mean by "the next time we hit that inode?" IIUC, iunlink
> processing nukes the entire AGI unlinked list bucket in response to the
> read error. Given the inode was unlinked, shouldn't it no longer be
> accessible after that point? Are you hitting some other path that
> attempts to read the inode?

Aha, xfs_scrub finds the busted inaccessible inode when it scans the
filesystem, though frustratingly the fs shuts down because xfs_inactive
gets called on the corrupted symlink when xfs_fs_destroy_inode ->
xfs_inactive -> xfs_inactive_symlink... will have to review scrub's
iget use.

--D

> Brian
> 
> > --D
> > 
> > > Alright - I'll clean up the patch, make notes of all this in the
> > > commit message and repost.
> > > 
> > > Thanks, Brian!
> > > 
> > > Cheers,
> > > 
> > > Dave.
> > > 
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-06-23 17:38 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-18  5:57 [PATCH 0/2] xfs: symlink and inode writeback issues Dave Chinner
2018-06-18  5:57 ` [PATCH 1/2] xfs: zero length symlinks are not valid Dave Chinner
2018-06-18 13:24   ` Brian Foster
2018-06-18 22:42     ` Dave Chinner
2018-06-19 11:54       ` Brian Foster
2018-06-19 15:48         ` Darrick J. Wong
2018-06-19 16:28           ` Brian Foster
2018-06-19 23:22             ` Dave Chinner
2018-06-20 11:50               ` Brian Foster
2018-06-20 22:59                 ` Dave Chinner
2018-06-21 11:46                   ` Brian Foster
2018-06-21 22:31                     ` Dave Chinner
2018-06-21 22:53                       ` Darrick J. Wong
2018-06-22 10:44                         ` Brian Foster
2018-06-23 17:38                           ` Darrick J. Wong
2018-06-18  5:57 ` [PATCH 2/2] xfs: xfs_iflush_abort() can be called twice on cluster writeback failure Dave Chinner
2018-06-18 13:24   ` Brian Foster
2018-06-19  5:05   ` Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.