Re: [PATCH] xfs: don't commit sunit/swidth updates to disk if that would cause repair failures

From: Brian Foster <bfoster@redhat.com>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: xfs <linux-xfs@vger.kernel.org>, Alex Lyakas <alex@zadara.com>,
	Dave Chinner <david@fromorbit.com>
Subject: Re: [PATCH] xfs: don't commit sunit/swidth updates to disk if that would cause repair failures
Date: Thu, 5 Dec 2019 09:36:18 -0500	[thread overview]
Message-ID: <20191205143618.GA48368@bfoster> (raw)
In-Reply-To: <20191204170340.GR7335@magnolia>

On Wed, Dec 04, 2019 at 09:03:40AM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Alex Lyakas reported[1] that mounting an xfs filesystem with new sunit
> and swidth values could cause xfs_repair to fail loudly.  The problem
> here is that repair calculates the where mkfs should have allocated the
> root inode, based on the superblock geometry.  The allocation decisions
> depend on sunit, which means that we really can't go updating sunit if
> it would lead to a subsequent repair failure on an otherwise correct
> filesystem.
> 
> Port the computation code from xfs_repair and teach mount to avoid the
> ondisk update if it would cause problems for repair.  We allow the mount
> to proceed (and new allocations will reflect this new geometry) because
> we've never screened this kind of thing before.
> 
> [1] https://lore.kernel.org/linux-xfs/20191125130744.GA44777@bfoster/T/#m00f9594b511e076e2fcdd489d78bc30216d72a7d
> 
> Reported-by: Alex Lyakas <alex@zadara.com>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
> v2: compute the root inode location directly
> ---
>  fs/xfs/libxfs/xfs_ialloc.c |   81 ++++++++++++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_ialloc.h |    1 +
>  fs/xfs/xfs_mount.c         |   51 ++++++++++++++++++----------
>  3 files changed, 115 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index 988cde7744e6..6df9bcc96251 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -2909,3 +2909,84 @@ xfs_ialloc_setup_geometry(
>  	else
>  		igeo->ialloc_align = 0;
>  }
> +
> +/*
> + * Compute the location of the root directory inode that is laid out by mkfs.
> + * The @sunit parameter will be copied from the superblock if it is negative.
> + */
> +xfs_ino_t
> +xfs_ialloc_calc_rootino(
> +	struct xfs_mount	*mp,
> +	int			sunit)
> +{
> +	struct xfs_ino_geometry	*igeo = M_IGEO(mp);
> +	xfs_agino_t		first_agino;
> +	xfs_agblock_t		first_bno;
> +
> +	if (sunit < 0)
> +		sunit = mp->m_sb.sb_unit;
> +
> +	/*
> +	 * Pre-calculate the geometry of ag 0. We know what it looks like
> +	 * because we know what mkfs does: 2 allocation btree roots (by block
> +	 * and by size), the inode allocation btree root, the free inode
> +	 * allocation btree root (if enabled) and some number of blocks to
> +	 * prefill the agfl.
> +	 *
> +	 * Because the current shape of the btrees may differ from the current
> +	 * shape, we open code the mkfs freelist block count here. mkfs creates
> +	 * single level trees, so the calculation is pretty straight forward for
> +	 * the trees that use the AGFL.
> +	 */
> +

I know this code is lifted from userspace, but.. "the current shape of
the btrees may differ from the current shape, .." Eh?

> +	/* free space by block btree root comes after the ag headers */
> +	first_bno = howmany(4 * mp->m_sb.sb_sectsize, mp->m_sb.sb_blocksize);
> +
> +	/* free space by length btree root */
> +	first_bno += 1;
> +
> +	/* inode btree root */
> +	first_bno += 1;
> +
> +	/* agfl */
> +	first_bno += (2 * min_t(xfs_agblock_t, 2, mp->m_ag_maxlevels)) + 1;
> +

This is a little subtle from the userspace code. The extra +1 here is
where we go from pointing at metadata blocks (i.e. bnobt root) to the
first free block (i.e., past metadata blocks), right? If so, I wonder if
this should be incorporated either at the beginning or end with a
comment for explanation (i.e. "Start by pointing at the first block
after the AG headers and increment by size of applicable metadata to
locate the first free block ...").

I'm guessing this is a historical artifact of the userspace code as
features were added. The fact that userspace uses different variables
somewhat helps self-document in that context, and we lose that here.

> +	if (xfs_sb_version_hasfinobt(&mp->m_sb))
> +		first_bno++;
> +
> +	if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
> +		first_bno++;
> +		/* agfl blocks */
> +		first_bno += min_t(xfs_agblock_t, 2, mp->m_rmap_maxlevels);
> +	}
> +
> +	if (xfs_sb_version_hasreflink(&mp->m_sb))
> +		first_bno++;
> +
> +	/*
> +	 * If the log is allocated in the first allocation group we need to
> +	 * add the number of blocks used by the log to the above calculation.
> +	 *
> +	 * This can happens with filesystems that only have a single
> +	 * allocation group, or very odd geometries created by old mkfs
> +	 * versions on very small filesystems.
> +	 */
> +	if (mp->m_sb.sb_logstart &&
> +	    XFS_FSB_TO_AGNO(mp, mp->m_sb.sb_logstart) == 0)
> +		 first_bno += mp->m_sb.sb_logblocks;
> +
> +	/*
> +	 * ditto the location of the first inode chunks in the fs ('/')
> +	 */
> +	if (xfs_sb_version_hasdalign(&mp->m_sb) && igeo->ialloc_align > 0) {
> +		first_agino = XFS_AGB_TO_AGINO(mp, roundup(first_bno, sunit));
> +	} else if (xfs_sb_version_hasalign(&mp->m_sb) &&
> +		   mp->m_sb.sb_inoalignmt > 1)  {
> +		first_agino = XFS_AGB_TO_AGINO(mp,
> +				roundup(first_bno, mp->m_sb.sb_inoalignmt));
> +	} else  {
> +		first_agino = XFS_AGB_TO_AGINO(mp, first_bno);
> +	}
> +
> +	return XFS_AGINO_TO_INO(mp, 0, first_agino);
> +}
> diff --git a/fs/xfs/libxfs/xfs_ialloc.h b/fs/xfs/libxfs/xfs_ialloc.h
> index 323592d563d5..72b3468b97b1 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.h
> +++ b/fs/xfs/libxfs/xfs_ialloc.h
> @@ -152,5 +152,6 @@ int xfs_inobt_insert_rec(struct xfs_btree_cur *cur, uint16_t holemask,
>  
>  int xfs_ialloc_cluster_alignment(struct xfs_mount *mp);
>  void xfs_ialloc_setup_geometry(struct xfs_mount *mp);
> +xfs_ino_t xfs_ialloc_calc_rootino(struct xfs_mount *mp, int sunit);
>  
>  #endif	/* __XFS_IALLOC_H__ */
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index fca65109cf24..a4eb3ae34a84 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -363,9 +363,10 @@ xfs_readsb(
>   * Update alignment values based on mount options and sb values
>   */
>  STATIC int
> -xfs_update_alignment(xfs_mount_t *mp)
> +xfs_update_alignment(
> +	struct xfs_mount	*mp)
>  {
> -	xfs_sb_t	*sbp = &(mp->m_sb);
> +	struct xfs_sb		*sbp = &mp->m_sb;
>  
>  	if (mp->m_dalign) {
>  		/*
> @@ -398,28 +399,42 @@ xfs_update_alignment(xfs_mount_t *mp)
>  			}
>  		}
>  
> -		/*
> -		 * Update superblock with new values
> -		 * and log changes
> -		 */
> -		if (xfs_sb_version_hasdalign(sbp)) {
> -			if (sbp->sb_unit != mp->m_dalign) {
> -				sbp->sb_unit = mp->m_dalign;
> -				mp->m_update_sb = true;
> -			}
> -			if (sbp->sb_width != mp->m_swidth) {
> -				sbp->sb_width = mp->m_swidth;
> -				mp->m_update_sb = true;
> -			}
> -		} else {
> +		/* Update superblock with new values and log changes. */
> +		if (!xfs_sb_version_hasdalign(sbp)) {
>  			xfs_warn(mp,
>  	"cannot change alignment: superblock does not support data alignment");
>  			return -EINVAL;
>  		}
> +
> +		if (sbp->sb_unit == mp->m_dalign &&
> +		    sbp->sb_width == mp->m_swidth)
> +			return 0;
> +
> +		/*
> +		 * If the sunit/swidth change would move the precomputed root
> +		 * inode value, we must reject the ondisk change because repair
> +		 * will stumble over that.  However, we allow the mount to
> +		 * proceed because we never rejected this combination before.
> +		 */
> +		if (sbp->sb_rootino !=
> +		    xfs_ialloc_calc_rootino(mp, mp->m_dalign)) {
> +			xfs_warn(mp,
> +	"cannot change stripe alignment: would require moving root inode");
> +

FWIW, I read this error message as the mount option was ignored. I don't
much care whether we ignore the mount option or simply the on-disk
update, but the error could be a bit more clear in the latter case.
Also, what is the expected behavior for xfs_info in the latter
situation?

Brian

> +			/*
> +			 * XXX: Next time we add a new incompat feature, this
> +			 * should start returning -EINVAL.
> +			 */
> +			return 0;
> +		}
> +
> +		sbp->sb_unit = mp->m_dalign;
> +		sbp->sb_width = mp->m_swidth;
> +		mp->m_update_sb = true;
>  	} else if ((mp->m_flags & XFS_MOUNT_NOALIGN) != XFS_MOUNT_NOALIGN &&
>  		    xfs_sb_version_hasdalign(&mp->m_sb)) {
> -			mp->m_dalign = sbp->sb_unit;
> -			mp->m_swidth = sbp->sb_width;
> +		mp->m_dalign = sbp->sb_unit;
> +		mp->m_swidth = sbp->sb_width;
>  	}
>  
>  	return 0;
>