Re: [RFC PATCH] xfs: support shrinking unused space in the last AG

From: Gao Xiang <hsiangkao@aol.com>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: linux-xfs@vger.kernel.org, Gao Xiang <hsiangkao@redhat.com>
Subject: Re: [RFC PATCH] xfs: support shrinking unused space in the last AG
Date: Thu, 15 Oct 2020 09:11:24 +0800	[thread overview]
Message-ID: <20201015011116.GB7037@hsiangkao-HP-ZHAN-66-Pro-G1> (raw)
In-Reply-To: <20201014160633.GD9832@magnolia>

Hi Darrick,

On Wed, Oct 14, 2020 at 09:06:33AM -0700, Darrick J. Wong wrote:
> On Wed, Oct 14, 2020 at 08:58:09AM +0800, Gao Xiang wrote:
> > From: Gao Xiang <hsiangkao@redhat.com>
> > 

...

> > +xfs_ag_shrink_space(
> > +	struct xfs_mount	*mp,
> > +	struct xfs_trans	*tp,
> > +	struct aghdr_init_data	*id,
> > +	xfs_extlen_t		len)
> > +{
> > +	struct xfs_buf		*agibp, *agfbp;
> > +	struct xfs_agi		*agi;
> > +	struct xfs_agf		*agf;
> > +	int			error;
> > +
> > +	ASSERT(id->agno == mp->m_sb.sb_agcount - 1);
> > +	error = xfs_ialloc_read_agi(mp, tp, id->agno, &agibp);
> > +	if (error)
> > +		return error;
> > +
> > +	agi = agibp->b_addr;
> > +
> > +	/* Cannot touch the log space */
> > +	if (is_log_ag(mp, id) &&
> > +	    XFS_FSB_TO_AGBNO(mp, mp->m_sb.sb_logstart) +
> > +	    mp->m_sb.sb_logblocks > be32_to_cpu(agi->agi_length) - len)
> > +		return -EINVAL;
> 
> The space used by the internal log shouldn't also show up in the free
> space btrees, so I think you could rely on the _alloc_vextent_shrink
> call to return ENOSPC, right?

Yeah, it makes sense since freespace btree doesn't have log space
at all. I will drop this and make a comment on this.

> 
> > +
> > +	error = xfs_alloc_read_agf(mp, tp, id->agno, 0, &agfbp);
> > +	if (error)
> > +		return error;
> > +
> > +	error = xfs_alloc_vextent_shrink(tp, agfbp,
> > +			be32_to_cpu(agi->agi_length) - len, len);
> > +	if (error)
> > +		return error;
> > +
> > +	/* Change the agi length */
> > +	be32_add_cpu(&agi->agi_length, -len);
> > +	xfs_ialloc_log_agi(tp, agibp, XFS_AGI_LENGTH);
> > +
> > +	/* Change agf length */
> > +	agf = agfbp->b_addr;
> > +	be32_add_cpu(&agf->agf_length, -len);
> > +	ASSERT(agf->agf_length == agi->agi_length);
> > +	xfs_alloc_log_agf(tp, agfbp, XFS_AGF_LENGTH);
> > +	return 0;
> > +}
> > +
> >  /*
> >   * Extent the AG indicated by the @id by the length passed in
> 
> "Extend..." ?

Yeah, yet not related to this patch, I could fix it independently
(maybe together with other typos)...

> 
> >   */
> > diff --git a/fs/xfs/libxfs/xfs_ag.h b/fs/xfs/libxfs/xfs_ag.h
> > index 5166322807e7..f3b5bbfeadce 100644
> > --- a/fs/xfs/libxfs/xfs_ag.h
> > +++ b/fs/xfs/libxfs/xfs_ag.h
> > @@ -24,6 +24,8 @@ struct aghdr_init_data {
> >  };
> >  
> >  int xfs_ag_init_headers(struct xfs_mount *mp, struct aghdr_init_data *id);
> > +int xfs_ag_shrink_space(struct xfs_mount *mp, struct xfs_trans *tp,
> > +			struct aghdr_init_data *id, xfs_extlen_t len);
> >  int xfs_ag_extend_space(struct xfs_mount *mp, struct xfs_trans *tp,
> >  			struct aghdr_init_data *id, xfs_extlen_t len);
> >  int xfs_ag_get_geometry(struct xfs_mount *mp, xfs_agnumber_t agno,
> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index 852b536551b5..681357bb2701 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -1118,6 +1118,61 @@ xfs_alloc_ag_vextent_small(
> >  	return error;
> >  }
> >  
> > +/*
> > + * Allocate an extent for shrinking the last allocation group
> > + * to fix the freespace btree.
> > + * agfl fix is avoided in order to keep from dirtying the
> > + * transaction unnecessarily compared with xfs_alloc_vextent().
> > + */
> > +int
> > +xfs_alloc_vextent_shrink(
> > +	struct xfs_trans	*tp,
> > +	struct xfs_buf		*agbp,
> > +	xfs_agblock_t		agbno,
> > +	xfs_extlen_t		len)
> > +{
> > +	struct xfs_mount	*mp = tp->t_mountp;
> > +	xfs_agnumber_t		agno = agbp->b_pag->pag_agno;
> > +	struct xfs_alloc_arg	args = {
> > +		.tp = tp,
> > +		.mp = mp,
> > +		.type = XFS_ALLOCTYPE_THIS_BNO,
> > +		.agbp = agbp,
> > +		.agno = agno,
> > +		.agbno = agbno,
> > +		.fsbno = XFS_AGB_TO_FSB(mp, agno, agbno),
> 
> /me would have thought you could compute agbno from the AGF buffer
> passed in and the length parameter.

Yeah, that would be fine since agfbp has been passed in here. will update.

> 
> > +		.minlen = len,
> > +		.maxlen = len,
> > +		.oinfo = XFS_RMAP_OINFO_SKIP_UPDATE,
> > +		.resv = XFS_AG_RESV_NONE,
> 
> Hmm.  Using AG_RESV_NONE here means that the allocation can fail because
> there won't be enough free space left to satisfy the AG metadata
> preallocations.  I think you have to call xfs_ag_resv_free before
> calling this allocation function to remove the preallocations, and then
> call xfs_ag_resv_init afterwards (regardless of whether the shrink
> succeeds) to re-establish the preallocations to fit the new AG size.

Ok, currently I'm not quite familiar with code in libxfs/xfs_ag_resv.c.
will look into that and try to follow your suggestion.

> 
> > +		.prod = 1,
> > +		.alignment = 1,
> > +		.pag = agbp->b_pag
> > +	};
> > +	int			error;
> > +
> > +	error = xfs_alloc_ag_vextent_exact(&args);
> > +	if (error || args.agbno == NULLAGBLOCK)
> > +		return -EBUSY;
> > +
> > +	ASSERT(args.agbno == agbno);
> > +	ASSERT(args.len == len);
> > +	ASSERT(!args.wasfromfl || args.resv != XFS_AG_RESV_AGFL);
> 
> Can wasfromfl==true actually happen?  I think for this shrink case we
> need to prevent that from ever happening.

I think it won't happen as well, since my idea currently is to avoid
any reduntant step, e.g. to adjust agfl or move/defrag metadata. I think
laterly userspace could do that with another ioctl (?) and if it fails
(e.g. change concurrently....) to try some process again.

> 
> > +
> > +	if (!args.wasfromfl) {
> > +		error = xfs_alloc_update_counters(tp, agbp, -(long)len);
> > +		if (error)
> > +			return error;
> > +
> > +		ASSERT(!xfs_extent_busy_search(mp, args.agno, agbno, args.len));
> > +	}
> > +	xfs_ag_resv_alloc_extent(args.pag, args.resv, &args);
> 
> (I think you can skip this call if you free the AG's preallocations
> before allocating the last freespace.)

Ok, will look into that as well. Currently the logic is mostly taken from
xfs_alloc_vextent() but without agfl fix, so the entriance seems more
straight-forward than shrinking stuffs are deep into a large
xfs_alloc_vextent() details...

> 
> > +
> > +	XFS_STATS_INC(mp, xs_allocx);
> > +	XFS_STATS_ADD(mp, xs_allocb, args.len);
> > +	return 0;
> > +}
> > +
> >  /*
> >   * Allocate a variable extent in the allocation group agno.
> >   * Type and bno are used to determine where in the allocation group the
> > diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
> > index 6c22b12176b8..6080140bcb56 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.h
> > +++ b/fs/xfs/libxfs/xfs_alloc.h
> > @@ -160,6 +160,16 @@ int				/* error */
> >  xfs_alloc_vextent(
> >  	xfs_alloc_arg_t	*args);	/* allocation argument structure */
> >  
> > +/*
> > + * Allocate an extent for shrinking the last AG
> > + */
> > +int
> > +xfs_alloc_vextent_shrink(
> > +	struct xfs_trans	*tp,
> > +	struct xfs_buf		*agbp,
> > +	xfs_agblock_t		agbno,
> > +	xfs_extlen_t		len);	/* length of extent */
> > +
> >  /*
> >   * Free an extent.
> >   */
> > diff --git a/fs/xfs/xfs_fsops.c b/fs/xfs/xfs_fsops.c
> > index ef1d5bb88b93..80927d323939 100644
> > --- a/fs/xfs/xfs_fsops.c
> > +++ b/fs/xfs/xfs_fsops.c
> > @@ -36,19 +36,21 @@ xfs_growfs_data_private(
> >  	xfs_rfsblock_t		new;
> >  	xfs_agnumber_t		oagcount;
> >  	xfs_trans_t		*tp;
> > +	bool			extend;
> >  	struct aghdr_init_data	id = {};
> >  
> >  	nb = in->newblocks;
> > -	if (nb < mp->m_sb.sb_dblocks)
> > -		return -EINVAL;
> >  	if ((error = xfs_sb_validate_fsb_count(&mp->m_sb, nb)))
> >  		return error;
> 
> Please convert this to the more normal format:
> 
> error = xfs_sb_validate...();
> if (error)
> 	return error;

Ok, will update it with this patch.

> 
> > -	error = xfs_buf_read_uncached(mp->m_ddev_targp,
> > +
> > +	if (nb > mp->m_sb.sb_dblocks) {
> > +		error = xfs_buf_read_uncached(mp->m_ddev_targp,
> >  				XFS_FSB_TO_BB(mp, nb) - XFS_FSS_TO_BB(mp, 1),
> >  				XFS_FSS_TO_BB(mp, 1), 0, &bp, NULL);
> > -	if (error)
> > -		return error;
> > -	xfs_buf_relse(bp);
> > +		if (error)
> > +			return error;
> > +		xfs_buf_relse(bp);
> > +	}
> >  
> >  	new = nb;	/* use new as a temporary here */
> >  	nb_mod = do_div(new, mp->m_sb.sb_agblocks);
> > @@ -56,10 +58,18 @@ xfs_growfs_data_private(
> >  	if (nb_mod && nb_mod < XFS_MIN_AG_BLOCKS) {
> >  		nagcount--;
> >  		nb = (xfs_rfsblock_t)nagcount * mp->m_sb.sb_agblocks;
> > -		if (nb < mp->m_sb.sb_dblocks)
> > +		if (!nagcount)
> >  			return -EINVAL;
> 
> Realistically, we don't ever want fewer than 2 AGs since that would
> leave us with no superblock redundancy.  Repair doesn't really support
> the 1 AG case.

yeah, I notice that userspace warning, will update it.

> 
> >  	}
> > -	new = nb - mp->m_sb.sb_dblocks;
> > +
> > +	if (nb > mp->m_sb.sb_dblocks) {
> > +		new = nb - mp->m_sb.sb_dblocks;
> > +		extend = true;
> > +	} else {
> > +		new = mp->m_sb.sb_dblocks - nb;
> > +		extend = false;
> > +	}
> 
> Er... maybe you should start by hoisting all the "make data device
> larger" code into a separate function so that you can add the "make data
> device smaller" code as a second function.  Then xfs_growfs_data_private
> can decide which of the two to call based on the new size.

Yeah, that is fine, I will try to do in the next version.

> 
> > +
> >  	oagcount = mp->m_sb.sb_agcount;
> >  
> >  	/* allocate the new per-ag structures */
> > @@ -67,10 +77,14 @@ xfs_growfs_data_private(
> >  		error = xfs_initialize_perag(mp, nagcount, &nagimax);
> >  		if (error)
> >  			return error;
> > +	} else if (nagcount != oagcount) {
> > +		/* TODO: shrinking a whole AG hasn't yet implemented */
> > +		return -EINVAL;
> >  	}
> >  
> >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_growdata,
> > -			XFS_GROWFS_SPACE_RES(mp), 0, XFS_TRANS_RESERVE, &tp);
> > +			(extend ? 0 : new) + XFS_GROWFS_SPACE_RES(mp), 0,
> 
> XFS_GROWFS_SPACE_RES is defined as twice m_ag_maxlevels, which I think
> means (in the grow case) that it's reserving space for one full split
> for each free space btree.
> 
> Shrink, by contrast, is removing the rightmost record from the bnobt and
> some arbitrary record from the cntbt.  Can removing 1 record from each
> btree actually cause a split?

Yeah, I don't think so from btree implementation itself since it
removes/adjusts the record, will think it deeper later.

> 
> > +			XFS_TRANS_RESERVE, &tp);
> >  	if (error)
> >  		return error;
> >  
> > @@ -103,15 +117,22 @@ xfs_growfs_data_private(
> >  			goto out_trans_cancel;
> >  		}
> >  	}
> > -	error = xfs_buf_delwri_submit(&id.buffer_list);
> > -	if (error)
> > -		goto out_trans_cancel;
> > +
> > +	if (!list_empty(&id.buffer_list)) {
> > +		error = xfs_buf_delwri_submit(&id.buffer_list);
> > +		if (error)
> > +			goto out_trans_cancel;
> > +	}
> >  
> >  	xfs_trans_agblocks_delta(tp, id.nfree);
> >  
> >  	/* If there are new blocks in the old last AG, extend it. */
> >  	if (new) {
> > -		error = xfs_ag_extend_space(mp, tp, &id, new);
> > +		if (extend)
> > +			error = xfs_ag_extend_space(mp, tp, &id, new);
> > +		else
> > +			error = xfs_ag_shrink_space(mp, tp, &id, new);
> 
> This logic is getting harder for me to track because the patch combines
> grow and shrink logic in the same function....

Yeah, currently it shares almost the same logic, I will try to seperate them
in the next version.

It's rough a preliminary version for now, if this direction is roughly
acceptable for upstream, will look into it further...

Thanks,
Gao Xiang

> 
> --D
>