All of lore.kernel.org
 help / color / mirror / Atom feed
* avoid running out of space during finobt insertation
@ 2017-01-25  9:14 Christoph Hellwig
  2017-01-25  9:14 ` [PATCH 1/2] xfs: only update mount/resv fields on success in __xfs_ag_resv_init Christoph Hellwig
  2017-01-25  9:14 ` [PATCH 2/2] xfs: use per-AG reservations for the finobt Christoph Hellwig
  0 siblings, 2 replies; 11+ messages in thread
From: Christoph Hellwig @ 2017-01-25  9:14 UTC (permalink / raw)
  To: darrick.wong; +Cc: bfoster, linux-xfs

Another repost, now down to mostly cosmetic changes.  Hopefully
we can still make it for 4.10..

Changes since V3:
  - set m_inotbt_nores more aggresively
  - fix a comment typo
  - undo whitespace churn

Changes since V2:
  - improve error handling in __xfs_ag_resv_init
  - improve mount time warning

Changes since V1:
  - reduce the size of the reservation
  - warn and fall back to the old somewhat buggy code if we can't
    get the reservation at mount time
  - read the AGF before asserting based on values in it

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] xfs: only update mount/resv fields on success in __xfs_ag_resv_init
  2017-01-25  9:14 avoid running out of space during finobt insertation Christoph Hellwig
@ 2017-01-25  9:14 ` Christoph Hellwig
  2017-01-25  9:14 ` [PATCH 2/2] xfs: use per-AG reservations for the finobt Christoph Hellwig
  1 sibling, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2017-01-25  9:14 UTC (permalink / raw)
  To: darrick.wong; +Cc: bfoster, linux-xfs

Try to reserve the blocks first and only then update the fields in
or hanging off the mount structure.  This way we can call __xfs_ag_resv_init
again after a previous failure.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_ag_resv.c | 23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
index d346d42..94234bf 100644
--- a/fs/xfs/libxfs/xfs_ag_resv.c
+++ b/fs/xfs/libxfs/xfs_ag_resv.c
@@ -200,22 +200,27 @@ __xfs_ag_resv_init(
 	struct xfs_mount		*mp = pag->pag_mount;
 	struct xfs_ag_resv		*resv;
 	int				error;
+	xfs_extlen_t			reserved;
 
-	resv = xfs_perag_resv(pag, type);
 	if (used > ask)
 		ask = used;
-	resv->ar_asked = ask;
-	resv->ar_reserved = resv->ar_orig_reserved = ask - used;
-	mp->m_ag_max_usable -= ask;
+	reserved = ask - used;
 
-	trace_xfs_ag_resv_init(pag, type, ask);
-
-	error = xfs_mod_fdblocks(mp, -(int64_t)resv->ar_reserved, true);
-	if (error)
+	error = xfs_mod_fdblocks(mp, -(int64_t)reserved, true);
+	if (error) {
 		trace_xfs_ag_resv_init_error(pag->pag_mount, pag->pag_agno,
 				error, _RET_IP_);
+		return error;
+	}
 
-	return error;
+	mp->m_ag_max_usable -= ask;
+
+	resv = xfs_perag_resv(pag, type);
+	resv->ar_asked = ask;
+	resv->ar_reserved = resv->ar_orig_reserved = reserved;
+
+	trace_xfs_ag_resv_init(pag, type, ask);
+	return 0;
 }
 
 /* Create a per-AG block reservation. */
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/2] xfs: use per-AG reservations for the finobt
  2017-01-25  9:14 avoid running out of space during finobt insertation Christoph Hellwig
  2017-01-25  9:14 ` [PATCH 1/2] xfs: only update mount/resv fields on success in __xfs_ag_resv_init Christoph Hellwig
@ 2017-01-25  9:14 ` Christoph Hellwig
  2017-01-25 15:58   ` Darrick J. Wong
  2017-01-27  5:08   ` Eric Sandeen
  1 sibling, 2 replies; 11+ messages in thread
From: Christoph Hellwig @ 2017-01-25  9:14 UTC (permalink / raw)
  To: darrick.wong; +Cc: bfoster, linux-xfs

Currently we try to rely on the global reserved block pool for block
allocations for the free inode btree, but I have customer reports
(fairly complex workload, need to find an easier reproducer) where that
is not enough as the AG where we free an inode that requires a new
finobt block is entirely full.  This causes us to cancel a dirty
transaction and thus a file system shutdown.

I think the right way to guard against this is to treat the finot the same
way as the refcount btree and have a per-AG reservations for the possible
worst case size of it, and the patch below implements that.

Note that this could increase mount times with large finobt trees.  In
an ideal world we would have added a field for the number of finobt
fields to the AGI, similar to what we did for the refcount blocks.
We should do add it next time we rev the AGI or AGF format by adding
new fields.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_ag_resv.c      | 47 ++++++++++++++++++---
 fs/xfs/libxfs/xfs_ialloc_btree.c | 90 ++++++++++++++++++++++++++++++++++++++--
 fs/xfs/libxfs/xfs_ialloc_btree.h |  3 ++
 fs/xfs/xfs_inode.c               | 23 +++++-----
 fs/xfs/xfs_mount.h               |  1 +
 5 files changed, 144 insertions(+), 20 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
index 94234bf..33db69b 100644
--- a/fs/xfs/libxfs/xfs_ag_resv.c
+++ b/fs/xfs/libxfs/xfs_ag_resv.c
@@ -39,6 +39,7 @@
 #include "xfs_rmap_btree.h"
 #include "xfs_btree.h"
 #include "xfs_refcount_btree.h"
+#include "xfs_ialloc_btree.h"
 
 /*
  * Per-AG Block Reservations
@@ -210,6 +211,9 @@ __xfs_ag_resv_init(
 	if (error) {
 		trace_xfs_ag_resv_init_error(pag->pag_mount, pag->pag_agno,
 				error, _RET_IP_);
+		xfs_warn(mp,
+"Per-AG reservation for AG %u failed.  Filesystem may run out of space.",
+				pag->pag_agno);
 		return error;
 	}
 
@@ -228,6 +232,8 @@ int
 xfs_ag_resv_init(
 	struct xfs_perag		*pag)
 {
+	struct xfs_mount		*mp = pag->pag_mount;
+	xfs_agnumber_t			agno = pag->pag_agno;
 	xfs_extlen_t			ask;
 	xfs_extlen_t			used;
 	int				error = 0;
@@ -236,23 +242,45 @@ xfs_ag_resv_init(
 	if (pag->pag_meta_resv.ar_asked == 0) {
 		ask = used = 0;
 
-		error = xfs_refcountbt_calc_reserves(pag->pag_mount,
-				pag->pag_agno, &ask, &used);
+		error = xfs_refcountbt_calc_reserves(mp, agno, &ask, &used);
 		if (error)
 			goto out;
 
-		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_METADATA,
-				ask, used);
+		error = xfs_finobt_calc_reserves(mp, agno, &ask, &used);
 		if (error)
 			goto out;
+
+		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_METADATA,
+				ask, used);
+		if (error) {
+			/*
+			 * Because we didn't have per-AG reservations when the
+			 * finobt feature was added we might not be able to
+			 * reserve all needed blocks.  Warn and fall back to the
+			 * old and potentially buggy code in that case, but
+			 * ensure we do have the reservation for the refcountbt.
+			 */
+			ask = used = 0;
+
+			mp->m_inotbt_nores = true;
+
+			error = xfs_refcountbt_calc_reserves(mp, agno, &ask,
+					&used);
+			if (error)
+				goto out;
+
+			error = __xfs_ag_resv_init(pag, XFS_AG_RESV_METADATA,
+					ask, used);
+			if (error)
+				goto out;
+		}
 	}
 
 	/* Create the AGFL metadata reservation */
 	if (pag->pag_agfl_resv.ar_asked == 0) {
 		ask = used = 0;
 
-		error = xfs_rmapbt_calc_reserves(pag->pag_mount, pag->pag_agno,
-				&ask, &used);
+		error = xfs_rmapbt_calc_reserves(mp, agno, &ask, &used);
 		if (error)
 			goto out;
 
@@ -261,9 +289,16 @@ xfs_ag_resv_init(
 			goto out;
 	}
 
+#ifdef DEBUG
+	/* need to read in the AGF for the ASSERT below to work */
+	error = xfs_alloc_pagf_init(pag->pag_mount, NULL, pag->pag_agno, 0);
+	if (error)
+		return error;
+
 	ASSERT(xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved +
 	       xfs_perag_resv(pag, XFS_AG_RESV_AGFL)->ar_reserved <=
 	       pag->pagf_freeblks + pag->pagf_flcount);
+#endif
 out:
 	return error;
 }
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 0fd086d..7c47188 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -82,11 +82,12 @@ xfs_finobt_set_root(
 }
 
 STATIC int
-xfs_inobt_alloc_block(
+__xfs_inobt_alloc_block(
 	struct xfs_btree_cur	*cur,
 	union xfs_btree_ptr	*start,
 	union xfs_btree_ptr	*new,
-	int			*stat)
+	int			*stat,
+	enum xfs_ag_resv_type	resv)
 {
 	xfs_alloc_arg_t		args;		/* block allocation args */
 	int			error;		/* error return value */
@@ -103,6 +104,7 @@ xfs_inobt_alloc_block(
 	args.maxlen = 1;
 	args.prod = 1;
 	args.type = XFS_ALLOCTYPE_NEAR_BNO;
+	args.resv = resv;
 
 	error = xfs_alloc_vextent(&args);
 	if (error) {
@@ -123,6 +125,27 @@ xfs_inobt_alloc_block(
 }
 
 STATIC int
+xfs_inobt_alloc_block(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*start,
+	union xfs_btree_ptr	*new,
+	int			*stat)
+{
+	return __xfs_inobt_alloc_block(cur, start, new, stat, XFS_AG_RESV_NONE);
+}
+
+STATIC int
+xfs_finobt_alloc_block(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*start,
+	union xfs_btree_ptr	*new,
+	int			*stat)
+{
+	return __xfs_inobt_alloc_block(cur, start, new, stat,
+			XFS_AG_RESV_METADATA);
+}
+
+STATIC int
 xfs_inobt_free_block(
 	struct xfs_btree_cur	*cur,
 	struct xfs_buf		*bp)
@@ -328,7 +351,7 @@ static const struct xfs_btree_ops xfs_finobt_ops = {
 
 	.dup_cursor		= xfs_inobt_dup_cursor,
 	.set_root		= xfs_finobt_set_root,
-	.alloc_block		= xfs_inobt_alloc_block,
+	.alloc_block		= xfs_finobt_alloc_block,
 	.free_block		= xfs_inobt_free_block,
 	.get_minrecs		= xfs_inobt_get_minrecs,
 	.get_maxrecs		= xfs_inobt_get_maxrecs,
@@ -480,3 +503,64 @@ xfs_inobt_rec_check_count(
 	return 0;
 }
 #endif	/* DEBUG */
+
+static xfs_extlen_t
+xfs_inobt_max_size(
+	struct xfs_mount	*mp)
+{
+	/* Bail out if we're uninitialized, which can happen in mkfs. */
+	if (mp->m_inobt_mxr[0] == 0)
+		return 0;
+
+	return xfs_btree_calc_size(mp, mp->m_inobt_mnr,
+		(uint64_t)mp->m_sb.sb_agblocks * mp->m_sb.sb_inopblock /
+				XFS_INODES_PER_CHUNK);
+}
+
+static int
+xfs_inobt_count_blocks(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_btnum_t		btnum,
+	xfs_extlen_t		*tree_blocks)
+{
+	struct xfs_buf		*agbp;
+	struct xfs_btree_cur	*cur;
+	int			error;
+
+	error = xfs_ialloc_read_agi(mp, NULL, agno, &agbp);
+	if (error)
+		return error;
+
+	cur = xfs_inobt_init_cursor(mp, NULL, agbp, agno, btnum);
+	error = xfs_btree_count_blocks(cur, tree_blocks);
+	xfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
+	xfs_buf_relse(agbp);
+
+	return error;
+}
+
+/*
+ * Figure out how many blocks to reserve and how many are used by this btree.
+ */
+int
+xfs_finobt_calc_reserves(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_extlen_t		*ask,
+	xfs_extlen_t		*used)
+{
+	xfs_extlen_t		tree_len = 0;
+	int			error;
+
+	if (!xfs_sb_version_hasfinobt(&mp->m_sb))
+		return 0;
+
+	error = xfs_inobt_count_blocks(mp, agno, XFS_BTNUM_FINO, &tree_len);
+	if (error)
+		return error;
+
+	*ask += xfs_inobt_max_size(mp);
+	*used += tree_len;
+	return 0;
+}
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.h b/fs/xfs/libxfs/xfs_ialloc_btree.h
index bd88453..aa81e2e 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.h
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.h
@@ -72,4 +72,7 @@ int xfs_inobt_rec_check_count(struct xfs_mount *,
 #define xfs_inobt_rec_check_count(mp, rec)	0
 #endif	/* DEBUG */
 
+int xfs_finobt_calc_reserves(struct xfs_mount *mp, xfs_agnumber_t agno,
+		xfs_extlen_t *ask, xfs_extlen_t *used);
+
 #endif	/* __XFS_IALLOC_BTREE_H__ */
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index b955779..de32f0f 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1792,22 +1792,23 @@ xfs_inactive_ifree(
 	int			error;
 
 	/*
-	 * The ifree transaction might need to allocate blocks for record
-	 * insertion to the finobt. We don't want to fail here at ENOSPC, so
-	 * allow ifree to dip into the reserved block pool if necessary.
-	 *
-	 * Freeing large sets of inodes generally means freeing inode chunks,
-	 * directory and file data blocks, so this should be relatively safe.
-	 * Only under severe circumstances should it be possible to free enough
-	 * inodes to exhaust the reserve block pool via finobt expansion while
-	 * at the same time not creating free space in the filesystem.
+	 * We try to use a per-AG reservation for any block needed by the finobt
+	 * tree, but as the finobt feature predates the per-AG reservation
+	 * support a degraded file system might not have enough space for the
+	 * reservation at mount time.  In that case try to dip into the reserved
+	 * pool and pray.
 	 *
 	 * Send a warning if the reservation does happen to fail, as the inode
 	 * now remains allocated and sits on the unlinked list until the fs is
 	 * repaired.
 	 */
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ifree,
-			XFS_IFREE_SPACE_RES(mp), 0, XFS_TRANS_RESERVE, &tp);
+	if (unlikely(mp->m_inotbt_nores)) {
+		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ifree,
+				XFS_IFREE_SPACE_RES(mp), 0, XFS_TRANS_RESERVE,
+				&tp);
+	} else {
+		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ifree, 0, 0, 0, &tp);
+	}
 	if (error) {
 		if (error == -ENOSPC) {
 			xfs_warn_ratelimited(mp,
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 84f7852..7f351f7 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -140,6 +140,7 @@ typedef struct xfs_mount {
 	int			m_fixedfsid[2];	/* unchanged for life of FS */
 	uint			m_dmevmask;	/* DMI events for this FS */
 	__uint64_t		m_flags;	/* global mount flags */
+	bool			m_inotbt_nores; /* no per-AG finobt resv. */
 	int			m_ialloc_inos;	/* inodes in inode allocation */
 	int			m_ialloc_blks;	/* blocks in inode allocation */
 	int			m_ialloc_min_blks;/* min blocks in sparse inode
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] xfs: use per-AG reservations for the finobt
  2017-01-25  9:14 ` [PATCH 2/2] xfs: use per-AG reservations for the finobt Christoph Hellwig
@ 2017-01-25 15:58   ` Darrick J. Wong
  2017-01-27  5:08   ` Eric Sandeen
  1 sibling, 0 replies; 11+ messages in thread
From: Darrick J. Wong @ 2017-01-25 15:58 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: bfoster, linux-xfs

On Wed, Jan 25, 2017 at 10:14:27AM +0100, Christoph Hellwig wrote:
> Currently we try to rely on the global reserved block pool for block
> allocations for the free inode btree, but I have customer reports
> (fairly complex workload, need to find an easier reproducer) where that
> is not enough as the AG where we free an inode that requires a new
> finobt block is entirely full.  This causes us to cancel a dirty
> transaction and thus a file system shutdown.
> 
> I think the right way to guard against this is to treat the finot the same
> way as the refcount btree and have a per-AG reservations for the possible
> worst case size of it, and the patch below implements that.
> 
> Note that this could increase mount times with large finobt trees.  In
> an ideal world we would have added a field for the number of finobt
> fields to the AGI, similar to what we did for the refcount blocks.
> We should do add it next time we rev the AGI or AGF format by adding
> new fields.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>

Looks reasonable enough now, I think;
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

Queued up for -rc6.

--D

> ---
>  fs/xfs/libxfs/xfs_ag_resv.c      | 47 ++++++++++++++++++---
>  fs/xfs/libxfs/xfs_ialloc_btree.c | 90 ++++++++++++++++++++++++++++++++++++++--
>  fs/xfs/libxfs/xfs_ialloc_btree.h |  3 ++
>  fs/xfs/xfs_inode.c               | 23 +++++-----
>  fs/xfs/xfs_mount.h               |  1 +
>  5 files changed, 144 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
> index 94234bf..33db69b 100644
> --- a/fs/xfs/libxfs/xfs_ag_resv.c
> +++ b/fs/xfs/libxfs/xfs_ag_resv.c
> @@ -39,6 +39,7 @@
>  #include "xfs_rmap_btree.h"
>  #include "xfs_btree.h"
>  #include "xfs_refcount_btree.h"
> +#include "xfs_ialloc_btree.h"
>  
>  /*
>   * Per-AG Block Reservations
> @@ -210,6 +211,9 @@ __xfs_ag_resv_init(
>  	if (error) {
>  		trace_xfs_ag_resv_init_error(pag->pag_mount, pag->pag_agno,
>  				error, _RET_IP_);
> +		xfs_warn(mp,
> +"Per-AG reservation for AG %u failed.  Filesystem may run out of space.",
> +				pag->pag_agno);
>  		return error;
>  	}
>  
> @@ -228,6 +232,8 @@ int
>  xfs_ag_resv_init(
>  	struct xfs_perag		*pag)
>  {
> +	struct xfs_mount		*mp = pag->pag_mount;
> +	xfs_agnumber_t			agno = pag->pag_agno;
>  	xfs_extlen_t			ask;
>  	xfs_extlen_t			used;
>  	int				error = 0;
> @@ -236,23 +242,45 @@ xfs_ag_resv_init(
>  	if (pag->pag_meta_resv.ar_asked == 0) {
>  		ask = used = 0;
>  
> -		error = xfs_refcountbt_calc_reserves(pag->pag_mount,
> -				pag->pag_agno, &ask, &used);
> +		error = xfs_refcountbt_calc_reserves(mp, agno, &ask, &used);
>  		if (error)
>  			goto out;
>  
> -		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_METADATA,
> -				ask, used);
> +		error = xfs_finobt_calc_reserves(mp, agno, &ask, &used);
>  		if (error)
>  			goto out;
> +
> +		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_METADATA,
> +				ask, used);
> +		if (error) {
> +			/*
> +			 * Because we didn't have per-AG reservations when the
> +			 * finobt feature was added we might not be able to
> +			 * reserve all needed blocks.  Warn and fall back to the
> +			 * old and potentially buggy code in that case, but
> +			 * ensure we do have the reservation for the refcountbt.
> +			 */
> +			ask = used = 0;
> +
> +			mp->m_inotbt_nores = true;
> +
> +			error = xfs_refcountbt_calc_reserves(mp, agno, &ask,
> +					&used);
> +			if (error)
> +				goto out;
> +
> +			error = __xfs_ag_resv_init(pag, XFS_AG_RESV_METADATA,
> +					ask, used);
> +			if (error)
> +				goto out;
> +		}
>  	}
>  
>  	/* Create the AGFL metadata reservation */
>  	if (pag->pag_agfl_resv.ar_asked == 0) {
>  		ask = used = 0;
>  
> -		error = xfs_rmapbt_calc_reserves(pag->pag_mount, pag->pag_agno,
> -				&ask, &used);
> +		error = xfs_rmapbt_calc_reserves(mp, agno, &ask, &used);
>  		if (error)
>  			goto out;
>  
> @@ -261,9 +289,16 @@ xfs_ag_resv_init(
>  			goto out;
>  	}
>  
> +#ifdef DEBUG
> +	/* need to read in the AGF for the ASSERT below to work */
> +	error = xfs_alloc_pagf_init(pag->pag_mount, NULL, pag->pag_agno, 0);
> +	if (error)
> +		return error;
> +
>  	ASSERT(xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved +
>  	       xfs_perag_resv(pag, XFS_AG_RESV_AGFL)->ar_reserved <=
>  	       pag->pagf_freeblks + pag->pagf_flcount);
> +#endif
>  out:
>  	return error;
>  }
> diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> index 0fd086d..7c47188 100644
> --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> @@ -82,11 +82,12 @@ xfs_finobt_set_root(
>  }
>  
>  STATIC int
> -xfs_inobt_alloc_block(
> +__xfs_inobt_alloc_block(
>  	struct xfs_btree_cur	*cur,
>  	union xfs_btree_ptr	*start,
>  	union xfs_btree_ptr	*new,
> -	int			*stat)
> +	int			*stat,
> +	enum xfs_ag_resv_type	resv)
>  {
>  	xfs_alloc_arg_t		args;		/* block allocation args */
>  	int			error;		/* error return value */
> @@ -103,6 +104,7 @@ xfs_inobt_alloc_block(
>  	args.maxlen = 1;
>  	args.prod = 1;
>  	args.type = XFS_ALLOCTYPE_NEAR_BNO;
> +	args.resv = resv;
>  
>  	error = xfs_alloc_vextent(&args);
>  	if (error) {
> @@ -123,6 +125,27 @@ xfs_inobt_alloc_block(
>  }
>  
>  STATIC int
> +xfs_inobt_alloc_block(
> +	struct xfs_btree_cur	*cur,
> +	union xfs_btree_ptr	*start,
> +	union xfs_btree_ptr	*new,
> +	int			*stat)
> +{
> +	return __xfs_inobt_alloc_block(cur, start, new, stat, XFS_AG_RESV_NONE);
> +}
> +
> +STATIC int
> +xfs_finobt_alloc_block(
> +	struct xfs_btree_cur	*cur,
> +	union xfs_btree_ptr	*start,
> +	union xfs_btree_ptr	*new,
> +	int			*stat)
> +{
> +	return __xfs_inobt_alloc_block(cur, start, new, stat,
> +			XFS_AG_RESV_METADATA);
> +}
> +
> +STATIC int
>  xfs_inobt_free_block(
>  	struct xfs_btree_cur	*cur,
>  	struct xfs_buf		*bp)
> @@ -328,7 +351,7 @@ static const struct xfs_btree_ops xfs_finobt_ops = {
>  
>  	.dup_cursor		= xfs_inobt_dup_cursor,
>  	.set_root		= xfs_finobt_set_root,
> -	.alloc_block		= xfs_inobt_alloc_block,
> +	.alloc_block		= xfs_finobt_alloc_block,
>  	.free_block		= xfs_inobt_free_block,
>  	.get_minrecs		= xfs_inobt_get_minrecs,
>  	.get_maxrecs		= xfs_inobt_get_maxrecs,
> @@ -480,3 +503,64 @@ xfs_inobt_rec_check_count(
>  	return 0;
>  }
>  #endif	/* DEBUG */
> +
> +static xfs_extlen_t
> +xfs_inobt_max_size(
> +	struct xfs_mount	*mp)
> +{
> +	/* Bail out if we're uninitialized, which can happen in mkfs. */
> +	if (mp->m_inobt_mxr[0] == 0)
> +		return 0;
> +
> +	return xfs_btree_calc_size(mp, mp->m_inobt_mnr,
> +		(uint64_t)mp->m_sb.sb_agblocks * mp->m_sb.sb_inopblock /
> +				XFS_INODES_PER_CHUNK);
> +}
> +
> +static int
> +xfs_inobt_count_blocks(
> +	struct xfs_mount	*mp,
> +	xfs_agnumber_t		agno,
> +	xfs_btnum_t		btnum,
> +	xfs_extlen_t		*tree_blocks)
> +{
> +	struct xfs_buf		*agbp;
> +	struct xfs_btree_cur	*cur;
> +	int			error;
> +
> +	error = xfs_ialloc_read_agi(mp, NULL, agno, &agbp);
> +	if (error)
> +		return error;
> +
> +	cur = xfs_inobt_init_cursor(mp, NULL, agbp, agno, btnum);
> +	error = xfs_btree_count_blocks(cur, tree_blocks);
> +	xfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
> +	xfs_buf_relse(agbp);
> +
> +	return error;
> +}
> +
> +/*
> + * Figure out how many blocks to reserve and how many are used by this btree.
> + */
> +int
> +xfs_finobt_calc_reserves(
> +	struct xfs_mount	*mp,
> +	xfs_agnumber_t		agno,
> +	xfs_extlen_t		*ask,
> +	xfs_extlen_t		*used)
> +{
> +	xfs_extlen_t		tree_len = 0;
> +	int			error;
> +
> +	if (!xfs_sb_version_hasfinobt(&mp->m_sb))
> +		return 0;
> +
> +	error = xfs_inobt_count_blocks(mp, agno, XFS_BTNUM_FINO, &tree_len);
> +	if (error)
> +		return error;
> +
> +	*ask += xfs_inobt_max_size(mp);
> +	*used += tree_len;
> +	return 0;
> +}
> diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.h b/fs/xfs/libxfs/xfs_ialloc_btree.h
> index bd88453..aa81e2e 100644
> --- a/fs/xfs/libxfs/xfs_ialloc_btree.h
> +++ b/fs/xfs/libxfs/xfs_ialloc_btree.h
> @@ -72,4 +72,7 @@ int xfs_inobt_rec_check_count(struct xfs_mount *,
>  #define xfs_inobt_rec_check_count(mp, rec)	0
>  #endif	/* DEBUG */
>  
> +int xfs_finobt_calc_reserves(struct xfs_mount *mp, xfs_agnumber_t agno,
> +		xfs_extlen_t *ask, xfs_extlen_t *used);
> +
>  #endif	/* __XFS_IALLOC_BTREE_H__ */
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index b955779..de32f0f 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1792,22 +1792,23 @@ xfs_inactive_ifree(
>  	int			error;
>  
>  	/*
> -	 * The ifree transaction might need to allocate blocks for record
> -	 * insertion to the finobt. We don't want to fail here at ENOSPC, so
> -	 * allow ifree to dip into the reserved block pool if necessary.
> -	 *
> -	 * Freeing large sets of inodes generally means freeing inode chunks,
> -	 * directory and file data blocks, so this should be relatively safe.
> -	 * Only under severe circumstances should it be possible to free enough
> -	 * inodes to exhaust the reserve block pool via finobt expansion while
> -	 * at the same time not creating free space in the filesystem.
> +	 * We try to use a per-AG reservation for any block needed by the finobt
> +	 * tree, but as the finobt feature predates the per-AG reservation
> +	 * support a degraded file system might not have enough space for the
> +	 * reservation at mount time.  In that case try to dip into the reserved
> +	 * pool and pray.
>  	 *
>  	 * Send a warning if the reservation does happen to fail, as the inode
>  	 * now remains allocated and sits on the unlinked list until the fs is
>  	 * repaired.
>  	 */
> -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ifree,
> -			XFS_IFREE_SPACE_RES(mp), 0, XFS_TRANS_RESERVE, &tp);
> +	if (unlikely(mp->m_inotbt_nores)) {
> +		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ifree,
> +				XFS_IFREE_SPACE_RES(mp), 0, XFS_TRANS_RESERVE,
> +				&tp);
> +	} else {
> +		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ifree, 0, 0, 0, &tp);
> +	}
>  	if (error) {
>  		if (error == -ENOSPC) {
>  			xfs_warn_ratelimited(mp,
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 84f7852..7f351f7 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -140,6 +140,7 @@ typedef struct xfs_mount {
>  	int			m_fixedfsid[2];	/* unchanged for life of FS */
>  	uint			m_dmevmask;	/* DMI events for this FS */
>  	__uint64_t		m_flags;	/* global mount flags */
> +	bool			m_inotbt_nores; /* no per-AG finobt resv. */
>  	int			m_ialloc_inos;	/* inodes in inode allocation */
>  	int			m_ialloc_blks;	/* blocks in inode allocation */
>  	int			m_ialloc_min_blks;/* min blocks in sparse inode
> -- 
> 2.1.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] xfs: use per-AG reservations for the finobt
  2017-01-25  9:14 ` [PATCH 2/2] xfs: use per-AG reservations for the finobt Christoph Hellwig
  2017-01-25 15:58   ` Darrick J. Wong
@ 2017-01-27  5:08   ` Eric Sandeen
  2017-01-27 14:59     ` Christoph Hellwig
  1 sibling, 1 reply; 11+ messages in thread
From: Eric Sandeen @ 2017-01-27  5:08 UTC (permalink / raw)
  To: Christoph Hellwig, darrick.wong; +Cc: bfoster, linux-xfs

On 1/25/17 3:14 AM, Christoph Hellwig wrote:
> Currently we try to rely on the global reserved block pool for block
> allocations for the free inode btree, but I have customer reports
> (fairly complex workload, need to find an easier reproducer) where that
> is not enough as the AG where we free an inode that requires a new
> finobt block is entirely full.  This causes us to cancel a dirty
> transaction and thus a file system shutdown.
> 
> I think the right way to guard against this is to treat the finot the same
> way as the refcount btree and have a per-AG reservations for the possible
> worst case size of it, and the patch below implements that.
> 
> Note that this could increase mount times with large finobt trees.  In
> an ideal world we would have added a field for the number of finobt
> fields to the AGI, similar to what we did for the refcount blocks.
> We should do add it next time we rev the AGI or AGF format by adding
> new fields.

This seems to cause regressions for me, at least on generic/108, xfs/004

haven't dug into it tonight, but just a heads up to see if anyone else
sees the same thing...

-Eric

> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/libxfs/xfs_ag_resv.c      | 47 ++++++++++++++++++---
>  fs/xfs/libxfs/xfs_ialloc_btree.c | 90 ++++++++++++++++++++++++++++++++++++++--
>  fs/xfs/libxfs/xfs_ialloc_btree.h |  3 ++
>  fs/xfs/xfs_inode.c               | 23 +++++-----
>  fs/xfs/xfs_mount.h               |  1 +
>  5 files changed, 144 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
> index 94234bf..33db69b 100644
> --- a/fs/xfs/libxfs/xfs_ag_resv.c
> +++ b/fs/xfs/libxfs/xfs_ag_resv.c
> @@ -39,6 +39,7 @@
>  #include "xfs_rmap_btree.h"
>  #include "xfs_btree.h"
>  #include "xfs_refcount_btree.h"
> +#include "xfs_ialloc_btree.h"
>  
>  /*
>   * Per-AG Block Reservations
> @@ -210,6 +211,9 @@ __xfs_ag_resv_init(
>  	if (error) {
>  		trace_xfs_ag_resv_init_error(pag->pag_mount, pag->pag_agno,
>  				error, _RET_IP_);
> +		xfs_warn(mp,
> +"Per-AG reservation for AG %u failed.  Filesystem may run out of space.",
> +				pag->pag_agno);
>  		return error;
>  	}
>  
> @@ -228,6 +232,8 @@ int
>  xfs_ag_resv_init(
>  	struct xfs_perag		*pag)
>  {
> +	struct xfs_mount		*mp = pag->pag_mount;
> +	xfs_agnumber_t			agno = pag->pag_agno;
>  	xfs_extlen_t			ask;
>  	xfs_extlen_t			used;
>  	int				error = 0;
> @@ -236,23 +242,45 @@ xfs_ag_resv_init(
>  	if (pag->pag_meta_resv.ar_asked == 0) {
>  		ask = used = 0;
>  
> -		error = xfs_refcountbt_calc_reserves(pag->pag_mount,
> -				pag->pag_agno, &ask, &used);
> +		error = xfs_refcountbt_calc_reserves(mp, agno, &ask, &used);
>  		if (error)
>  			goto out;
>  
> -		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_METADATA,
> -				ask, used);
> +		error = xfs_finobt_calc_reserves(mp, agno, &ask, &used);
>  		if (error)
>  			goto out;
> +
> +		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_METADATA,
> +				ask, used);
> +		if (error) {
> +			/*
> +			 * Because we didn't have per-AG reservations when the
> +			 * finobt feature was added we might not be able to
> +			 * reserve all needed blocks.  Warn and fall back to the
> +			 * old and potentially buggy code in that case, but
> +			 * ensure we do have the reservation for the refcountbt.
> +			 */
> +			ask = used = 0;
> +
> +			mp->m_inotbt_nores = true;
> +
> +			error = xfs_refcountbt_calc_reserves(mp, agno, &ask,
> +					&used);
> +			if (error)
> +				goto out;
> +
> +			error = __xfs_ag_resv_init(pag, XFS_AG_RESV_METADATA,
> +					ask, used);
> +			if (error)
> +				goto out;
> +		}
>  	}
>  
>  	/* Create the AGFL metadata reservation */
>  	if (pag->pag_agfl_resv.ar_asked == 0) {
>  		ask = used = 0;
>  
> -		error = xfs_rmapbt_calc_reserves(pag->pag_mount, pag->pag_agno,
> -				&ask, &used);
> +		error = xfs_rmapbt_calc_reserves(mp, agno, &ask, &used);
>  		if (error)
>  			goto out;
>  
> @@ -261,9 +289,16 @@ xfs_ag_resv_init(
>  			goto out;
>  	}
>  
> +#ifdef DEBUG
> +	/* need to read in the AGF for the ASSERT below to work */
> +	error = xfs_alloc_pagf_init(pag->pag_mount, NULL, pag->pag_agno, 0);
> +	if (error)
> +		return error;
> +
>  	ASSERT(xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved +
>  	       xfs_perag_resv(pag, XFS_AG_RESV_AGFL)->ar_reserved <=
>  	       pag->pagf_freeblks + pag->pagf_flcount);
> +#endif
>  out:
>  	return error;
>  }
> diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> index 0fd086d..7c47188 100644
> --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> @@ -82,11 +82,12 @@ xfs_finobt_set_root(
>  }
>  
>  STATIC int
> -xfs_inobt_alloc_block(
> +__xfs_inobt_alloc_block(
>  	struct xfs_btree_cur	*cur,
>  	union xfs_btree_ptr	*start,
>  	union xfs_btree_ptr	*new,
> -	int			*stat)
> +	int			*stat,
> +	enum xfs_ag_resv_type	resv)
>  {
>  	xfs_alloc_arg_t		args;		/* block allocation args */
>  	int			error;		/* error return value */
> @@ -103,6 +104,7 @@ xfs_inobt_alloc_block(
>  	args.maxlen = 1;
>  	args.prod = 1;
>  	args.type = XFS_ALLOCTYPE_NEAR_BNO;
> +	args.resv = resv;
>  
>  	error = xfs_alloc_vextent(&args);
>  	if (error) {
> @@ -123,6 +125,27 @@ xfs_inobt_alloc_block(
>  }
>  
>  STATIC int
> +xfs_inobt_alloc_block(
> +	struct xfs_btree_cur	*cur,
> +	union xfs_btree_ptr	*start,
> +	union xfs_btree_ptr	*new,
> +	int			*stat)
> +{
> +	return __xfs_inobt_alloc_block(cur, start, new, stat, XFS_AG_RESV_NONE);
> +}
> +
> +STATIC int
> +xfs_finobt_alloc_block(
> +	struct xfs_btree_cur	*cur,
> +	union xfs_btree_ptr	*start,
> +	union xfs_btree_ptr	*new,
> +	int			*stat)
> +{
> +	return __xfs_inobt_alloc_block(cur, start, new, stat,
> +			XFS_AG_RESV_METADATA);
> +}
> +
> +STATIC int
>  xfs_inobt_free_block(
>  	struct xfs_btree_cur	*cur,
>  	struct xfs_buf		*bp)
> @@ -328,7 +351,7 @@ static const struct xfs_btree_ops xfs_finobt_ops = {
>  
>  	.dup_cursor		= xfs_inobt_dup_cursor,
>  	.set_root		= xfs_finobt_set_root,
> -	.alloc_block		= xfs_inobt_alloc_block,
> +	.alloc_block		= xfs_finobt_alloc_block,
>  	.free_block		= xfs_inobt_free_block,
>  	.get_minrecs		= xfs_inobt_get_minrecs,
>  	.get_maxrecs		= xfs_inobt_get_maxrecs,
> @@ -480,3 +503,64 @@ xfs_inobt_rec_check_count(
>  	return 0;
>  }
>  #endif	/* DEBUG */
> +
> +static xfs_extlen_t
> +xfs_inobt_max_size(
> +	struct xfs_mount	*mp)
> +{
> +	/* Bail out if we're uninitialized, which can happen in mkfs. */
> +	if (mp->m_inobt_mxr[0] == 0)
> +		return 0;
> +
> +	return xfs_btree_calc_size(mp, mp->m_inobt_mnr,
> +		(uint64_t)mp->m_sb.sb_agblocks * mp->m_sb.sb_inopblock /
> +				XFS_INODES_PER_CHUNK);
> +}
> +
> +static int
> +xfs_inobt_count_blocks(
> +	struct xfs_mount	*mp,
> +	xfs_agnumber_t		agno,
> +	xfs_btnum_t		btnum,
> +	xfs_extlen_t		*tree_blocks)
> +{
> +	struct xfs_buf		*agbp;
> +	struct xfs_btree_cur	*cur;
> +	int			error;
> +
> +	error = xfs_ialloc_read_agi(mp, NULL, agno, &agbp);
> +	if (error)
> +		return error;
> +
> +	cur = xfs_inobt_init_cursor(mp, NULL, agbp, agno, btnum);
> +	error = xfs_btree_count_blocks(cur, tree_blocks);
> +	xfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
> +	xfs_buf_relse(agbp);
> +
> +	return error;
> +}
> +
> +/*
> + * Figure out how many blocks to reserve and how many are used by this btree.
> + */
> +int
> +xfs_finobt_calc_reserves(
> +	struct xfs_mount	*mp,
> +	xfs_agnumber_t		agno,
> +	xfs_extlen_t		*ask,
> +	xfs_extlen_t		*used)
> +{
> +	xfs_extlen_t		tree_len = 0;
> +	int			error;
> +
> +	if (!xfs_sb_version_hasfinobt(&mp->m_sb))
> +		return 0;
> +
> +	error = xfs_inobt_count_blocks(mp, agno, XFS_BTNUM_FINO, &tree_len);
> +	if (error)
> +		return error;
> +
> +	*ask += xfs_inobt_max_size(mp);
> +	*used += tree_len;
> +	return 0;
> +}
> diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.h b/fs/xfs/libxfs/xfs_ialloc_btree.h
> index bd88453..aa81e2e 100644
> --- a/fs/xfs/libxfs/xfs_ialloc_btree.h
> +++ b/fs/xfs/libxfs/xfs_ialloc_btree.h
> @@ -72,4 +72,7 @@ int xfs_inobt_rec_check_count(struct xfs_mount *,
>  #define xfs_inobt_rec_check_count(mp, rec)	0
>  #endif	/* DEBUG */
>  
> +int xfs_finobt_calc_reserves(struct xfs_mount *mp, xfs_agnumber_t agno,
> +		xfs_extlen_t *ask, xfs_extlen_t *used);
> +
>  #endif	/* __XFS_IALLOC_BTREE_H__ */
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index b955779..de32f0f 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1792,22 +1792,23 @@ xfs_inactive_ifree(
>  	int			error;
>  
>  	/*
> -	 * The ifree transaction might need to allocate blocks for record
> -	 * insertion to the finobt. We don't want to fail here at ENOSPC, so
> -	 * allow ifree to dip into the reserved block pool if necessary.
> -	 *
> -	 * Freeing large sets of inodes generally means freeing inode chunks,
> -	 * directory and file data blocks, so this should be relatively safe.
> -	 * Only under severe circumstances should it be possible to free enough
> -	 * inodes to exhaust the reserve block pool via finobt expansion while
> -	 * at the same time not creating free space in the filesystem.
> +	 * We try to use a per-AG reservation for any block needed by the finobt
> +	 * tree, but as the finobt feature predates the per-AG reservation
> +	 * support a degraded file system might not have enough space for the
> +	 * reservation at mount time.  In that case try to dip into the reserved
> +	 * pool and pray.
>  	 *
>  	 * Send a warning if the reservation does happen to fail, as the inode
>  	 * now remains allocated and sits on the unlinked list until the fs is
>  	 * repaired.
>  	 */
> -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ifree,
> -			XFS_IFREE_SPACE_RES(mp), 0, XFS_TRANS_RESERVE, &tp);
> +	if (unlikely(mp->m_inotbt_nores)) {
> +		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ifree,
> +				XFS_IFREE_SPACE_RES(mp), 0, XFS_TRANS_RESERVE,
> +				&tp);
> +	} else {
> +		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ifree, 0, 0, 0, &tp);
> +	}
>  	if (error) {
>  		if (error == -ENOSPC) {
>  			xfs_warn_ratelimited(mp,
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 84f7852..7f351f7 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -140,6 +140,7 @@ typedef struct xfs_mount {
>  	int			m_fixedfsid[2];	/* unchanged for life of FS */
>  	uint			m_dmevmask;	/* DMI events for this FS */
>  	__uint64_t		m_flags;	/* global mount flags */
> +	bool			m_inotbt_nores; /* no per-AG finobt resv. */
>  	int			m_ialloc_inos;	/* inodes in inode allocation */
>  	int			m_ialloc_blks;	/* blocks in inode allocation */
>  	int			m_ialloc_min_blks;/* min blocks in sparse inode
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] xfs: use per-AG reservations for the finobt
  2017-01-27  5:08   ` Eric Sandeen
@ 2017-01-27 14:59     ` Christoph Hellwig
  2017-01-27 15:43       ` Eric Sandeen
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2017-01-27 14:59 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: Christoph Hellwig, darrick.wong, bfoster, linux-xfs

On Thu, Jan 26, 2017 at 11:08:00PM -0600, Eric Sandeen wrote:
> This seems to cause regressions for me, at least on generic/108, xfs/004

xfs/004 isn't really broken, it just needs a change every time we
tune some reservations.   I've been pondering to just drop that
sub-check.

I can't run generic/108 as it requires a modular scsi_debug, and
none of my test setups uses modular kernels, so I'm curious on how
that could get broken by this change.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] xfs: use per-AG reservations for the finobt
  2017-01-27 14:59     ` Christoph Hellwig
@ 2017-01-27 15:43       ` Eric Sandeen
  2017-01-27 17:26         ` Eric Sandeen
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Sandeen @ 2017-01-27 15:43 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: darrick.wong, bfoster, linux-xfs

On 1/27/17 8:59 AM, Christoph Hellwig wrote:
> On Thu, Jan 26, 2017 at 11:08:00PM -0600, Eric Sandeen wrote:
>> This seems to cause regressions for me, at least on generic/108, xfs/004
> 
> xfs/004 isn't really broken, it just needs a change every time we
> tune some reservations.   I've been pondering to just drop that
> sub-check.

Ok yeah, I realized that this morning, sorry for the noise.
 
> I can't run generic/108 as it requires a modular scsi_debug, and
> none of my test setups uses modular kernels, so I'm curious on how
> that could get broken by this change.

I'll keep looking to be sure it's legit.  It didn't really make sense
to me either, but ... yup, i'm seeing it.  the fsync error doesn't
come through as expected.  I'll keep looking into it.

oh, btw:

>     Note that this could increase mount times with large finobt trees.  In
>     an ideal world we would have added a field for the number of finobt
>     fields to the AGI, similar to what we did for the refcount blocks.
>     We should do add it next time we rev the AGI or AGF format by adding
>     new fields.

maybe should stick a comment in there as a reminder? :)

-Eric

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] xfs: use per-AG reservations for the finobt
  2017-01-27 15:43       ` Eric Sandeen
@ 2017-01-27 17:26         ` Eric Sandeen
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Sandeen @ 2017-01-27 17:26 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: darrick.wong, bfoster, linux-xfs

On 1/27/17 9:43 AM, Eric Sandeen wrote:

> I'll keep looking to be sure it's legit.  It didn't really make sense
> to me either, but ... yup, i'm seeing it.  the fsync error doesn't
> come through as expected.  I'll keep looking into it.

Ok finally had time to actually look.  generic/108 is doing IO
across a stripe and failing one disk; it seems that the reservations
have moved the IO pattern in such a way that the fsync'd IO
does not hit the disk that failed, so the fsync succeeds.

If I bump up the write prior to the fsync frmo 6M to 8M, it fails
again as expected.

So I'm not quite sure how a 6M write on a 4M stripe device
doesn't hit the failing device, but a larger IO makes the test
fail again - so I'll chalk this up to a problematic assumption
in the test, not a problem with the patch.

Thanks,
-Eric

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] xfs: use per-AG reservations for the finobt
  2017-01-24 20:42   ` Darrick J. Wong
@ 2017-01-25  9:13     ` Christoph Hellwig
  0 siblings, 0 replies; 11+ messages in thread
From: Christoph Hellwig @ 2017-01-25  9:13 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Christoph Hellwig, bfoster, linux-xfs

On Tue, Jan 24, 2017 at 12:42:57PM -0800, Darrick J. Wong wrote:
> > +		if (error) {
> > +			/*
> > +			 * Because we didn't have per-AG reservations when the
> > +			 * finobt feature was added we might not ne able to
> > +			 * reservere all needed blocks.  Warn and fall back to
> 
> "...might not be able to reserve all..."

Fixed.

> So if __xfs_ag_resv_init() returns ENOSPC, we don't actually halt the
> mount attempt, which means that we can start operating with
> mp->m_inotbt_nores = false if there's not even enough space for the
> refcountbt-only reservation.  In theory this is impossible, but I think
> we'd be in a better defensive position by moving this to before the
> __xfs_ag_resv_init call.

Ok.

> >  		xfs_agnumber_t agno, struct xfs_buf **bpp);
> >  
> > -
> >  #endif	/* __XFS_IALLOC_H__ */
> 
> Unnecessary change?

Fixed.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 2/2] xfs: use per-AG reservations for the finobt
  2017-01-24 20:16 ` [PATCH 2/2] xfs: use per-AG reservations for the finobt Christoph Hellwig
@ 2017-01-24 20:42   ` Darrick J. Wong
  2017-01-25  9:13     ` Christoph Hellwig
  0 siblings, 1 reply; 11+ messages in thread
From: Darrick J. Wong @ 2017-01-24 20:42 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: bfoster, linux-xfs

On Tue, Jan 24, 2017 at 09:16:20PM +0100, Christoph Hellwig wrote:
> Currently we try to rely on the global reserved block pool for block
> allocations for the free inode btree, but I have customer reports
> (fairly complex workload, need to find an easier reproducer) where that
> is not enough as the AG where we free an inode that requires a new
> finobt block is entirely full.  This causes us to cancel a dirty
> transaction and thus a file system shutdown.
> 
> I think the right way to guard against this is to treat the finot the same
> way as the refcount btree and have a per-AG reservations for the possible
> worst case size of it, and the patch below implements that.
> 
> Note that this could increase mount times with large finobt trees.  In
> an ideal world we would have added a field for the number of finobt
> fields to the AGI, similar to what we did for the refcount blocks.
> We should do add it next time we rev the AGI or AGF format by adding
> new fields.
> 
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> ---
>  fs/xfs/libxfs/xfs_ag_resv.c      | 48 ++++++++++++++++++---
>  fs/xfs/libxfs/xfs_ialloc.h       |  1 -
>  fs/xfs/libxfs/xfs_ialloc_btree.c | 90 ++++++++++++++++++++++++++++++++++++++--
>  fs/xfs/libxfs/xfs_ialloc_btree.h |  3 ++
>  fs/xfs/xfs_inode.c               | 23 +++++-----
>  fs/xfs/xfs_mount.h               |  1 +
>  6 files changed, 145 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
> index 94234bf..be6c904 100644
> --- a/fs/xfs/libxfs/xfs_ag_resv.c
> +++ b/fs/xfs/libxfs/xfs_ag_resv.c
> @@ -39,6 +39,7 @@
>  #include "xfs_rmap_btree.h"
>  #include "xfs_btree.h"
>  #include "xfs_refcount_btree.h"
> +#include "xfs_ialloc_btree.h"
>  
>  /*
>   * Per-AG Block Reservations
> @@ -210,6 +211,9 @@ __xfs_ag_resv_init(
>  	if (error) {
>  		trace_xfs_ag_resv_init_error(pag->pag_mount, pag->pag_agno,
>  				error, _RET_IP_);
> +		xfs_warn(mp,
> +"Per-AG reservation for AG %u failed.  Filesystem may run out of space.",
> +				pag->pag_agno);
>  		return error;
>  	}
>  
> @@ -228,6 +232,8 @@ int
>  xfs_ag_resv_init(
>  	struct xfs_perag		*pag)
>  {
> +	struct xfs_mount		*mp = pag->pag_mount;
> +	xfs_agnumber_t			agno = pag->pag_agno;
>  	xfs_extlen_t			ask;
>  	xfs_extlen_t			used;
>  	int				error = 0;
> @@ -236,23 +242,46 @@ xfs_ag_resv_init(
>  	if (pag->pag_meta_resv.ar_asked == 0) {
>  		ask = used = 0;
>  
> -		error = xfs_refcountbt_calc_reserves(pag->pag_mount,
> -				pag->pag_agno, &ask, &used);
> +		error = xfs_refcountbt_calc_reserves(mp, agno, &ask, &used);
>  		if (error)
>  			goto out;
>  
> -		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_METADATA,
> -				ask, used);
> +		error = xfs_finobt_calc_reserves(mp, agno, &ask, &used);
>  		if (error)
>  			goto out;
> +
> +		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_METADATA,
> +				ask, used);
> +		if (error) {
> +			/*
> +			 * Because we didn't have per-AG reservations when the
> +			 * finobt feature was added we might not ne able to
> +			 * reservere all needed blocks.  Warn and fall back to

"...might not be able to reserve all..."

> +			 * the old and potentially buggy code in that case,
> +			 * but ensure we do have the reservation for the
> +			 * refcountbt.
> +			 */
> +			ask = used = 0;
> +
> +			error = xfs_refcountbt_calc_reserves(mp, agno, &ask,
> +					&used);
> +			if (error)
> +				goto out;
> +
> +			error = __xfs_ag_resv_init(pag, XFS_AG_RESV_METADATA,
> +					ask, used);
> +			if (error)
> +				goto out;
> +
> +			mp->m_inotbt_nores = true;

So if __xfs_ag_resv_init() returns ENOSPC, we don't actually halt the
mount attempt, which means that we can start operating with
mp->m_inotbt_nores = false if there's not even enough space for the
refcountbt-only reservation.  In theory this is impossible, but I think
we'd be in a better defensive position by moving this to before the
__xfs_ag_resv_init call.

> +		}
>  	}
>  
>  	/* Create the AGFL metadata reservation */
>  	if (pag->pag_agfl_resv.ar_asked == 0) {
>  		ask = used = 0;
>  
> -		error = xfs_rmapbt_calc_reserves(pag->pag_mount, pag->pag_agno,
> -				&ask, &used);
> +		error = xfs_rmapbt_calc_reserves(mp, agno, &ask, &used);
>  		if (error)
>  			goto out;
>  
> @@ -261,9 +290,16 @@ xfs_ag_resv_init(
>  			goto out;
>  	}
>  
> +#ifdef DEBUG
> +	/* need to read in the AGF for the ASSERT below to work */
> +	error = xfs_alloc_pagf_init(pag->pag_mount, NULL, pag->pag_agno, 0);
> +	if (error)
> +		return error;
> +
>  	ASSERT(xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved +
>  	       xfs_perag_resv(pag, XFS_AG_RESV_AGFL)->ar_reserved <=
>  	       pag->pagf_freeblks + pag->pagf_flcount);
> +#endif
>  out:
>  	return error;
>  }
> diff --git a/fs/xfs/libxfs/xfs_ialloc.h b/fs/xfs/libxfs/xfs_ialloc.h
> index 0bb8966..3f24725 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.h
> +++ b/fs/xfs/libxfs/xfs_ialloc.h
> @@ -168,5 +168,4 @@ int xfs_ialloc_inode_init(struct xfs_mount *mp, struct xfs_trans *tp,
>  int xfs_read_agi(struct xfs_mount *mp, struct xfs_trans *tp,
>  		xfs_agnumber_t agno, struct xfs_buf **bpp);
>  
> -
>  #endif	/* __XFS_IALLOC_H__ */

Unnecessary change?

--D

> diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
> index 0fd086d..7c47188 100644
> --- a/fs/xfs/libxfs/xfs_ialloc_btree.c
> +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
> @@ -82,11 +82,12 @@ xfs_finobt_set_root(
>  }
>  
>  STATIC int
> -xfs_inobt_alloc_block(
> +__xfs_inobt_alloc_block(
>  	struct xfs_btree_cur	*cur,
>  	union xfs_btree_ptr	*start,
>  	union xfs_btree_ptr	*new,
> -	int			*stat)
> +	int			*stat,
> +	enum xfs_ag_resv_type	resv)
>  {
>  	xfs_alloc_arg_t		args;		/* block allocation args */
>  	int			error;		/* error return value */
> @@ -103,6 +104,7 @@ xfs_inobt_alloc_block(
>  	args.maxlen = 1;
>  	args.prod = 1;
>  	args.type = XFS_ALLOCTYPE_NEAR_BNO;
> +	args.resv = resv;
>  
>  	error = xfs_alloc_vextent(&args);
>  	if (error) {
> @@ -123,6 +125,27 @@ xfs_inobt_alloc_block(
>  }
>  
>  STATIC int
> +xfs_inobt_alloc_block(
> +	struct xfs_btree_cur	*cur,
> +	union xfs_btree_ptr	*start,
> +	union xfs_btree_ptr	*new,
> +	int			*stat)
> +{
> +	return __xfs_inobt_alloc_block(cur, start, new, stat, XFS_AG_RESV_NONE);
> +}
> +
> +STATIC int
> +xfs_finobt_alloc_block(
> +	struct xfs_btree_cur	*cur,
> +	union xfs_btree_ptr	*start,
> +	union xfs_btree_ptr	*new,
> +	int			*stat)
> +{
> +	return __xfs_inobt_alloc_block(cur, start, new, stat,
> +			XFS_AG_RESV_METADATA);
> +}
> +
> +STATIC int
>  xfs_inobt_free_block(
>  	struct xfs_btree_cur	*cur,
>  	struct xfs_buf		*bp)
> @@ -328,7 +351,7 @@ static const struct xfs_btree_ops xfs_finobt_ops = {
>  
>  	.dup_cursor		= xfs_inobt_dup_cursor,
>  	.set_root		= xfs_finobt_set_root,
> -	.alloc_block		= xfs_inobt_alloc_block,
> +	.alloc_block		= xfs_finobt_alloc_block,
>  	.free_block		= xfs_inobt_free_block,
>  	.get_minrecs		= xfs_inobt_get_minrecs,
>  	.get_maxrecs		= xfs_inobt_get_maxrecs,
> @@ -480,3 +503,64 @@ xfs_inobt_rec_check_count(
>  	return 0;
>  }
>  #endif	/* DEBUG */
> +
> +static xfs_extlen_t
> +xfs_inobt_max_size(
> +	struct xfs_mount	*mp)
> +{
> +	/* Bail out if we're uninitialized, which can happen in mkfs. */
> +	if (mp->m_inobt_mxr[0] == 0)
> +		return 0;
> +
> +	return xfs_btree_calc_size(mp, mp->m_inobt_mnr,
> +		(uint64_t)mp->m_sb.sb_agblocks * mp->m_sb.sb_inopblock /
> +				XFS_INODES_PER_CHUNK);
> +}
> +
> +static int
> +xfs_inobt_count_blocks(
> +	struct xfs_mount	*mp,
> +	xfs_agnumber_t		agno,
> +	xfs_btnum_t		btnum,
> +	xfs_extlen_t		*tree_blocks)
> +{
> +	struct xfs_buf		*agbp;
> +	struct xfs_btree_cur	*cur;
> +	int			error;
> +
> +	error = xfs_ialloc_read_agi(mp, NULL, agno, &agbp);
> +	if (error)
> +		return error;
> +
> +	cur = xfs_inobt_init_cursor(mp, NULL, agbp, agno, btnum);
> +	error = xfs_btree_count_blocks(cur, tree_blocks);
> +	xfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
> +	xfs_buf_relse(agbp);
> +
> +	return error;
> +}
> +
> +/*
> + * Figure out how many blocks to reserve and how many are used by this btree.
> + */
> +int
> +xfs_finobt_calc_reserves(
> +	struct xfs_mount	*mp,
> +	xfs_agnumber_t		agno,
> +	xfs_extlen_t		*ask,
> +	xfs_extlen_t		*used)
> +{
> +	xfs_extlen_t		tree_len = 0;
> +	int			error;
> +
> +	if (!xfs_sb_version_hasfinobt(&mp->m_sb))
> +		return 0;
> +
> +	error = xfs_inobt_count_blocks(mp, agno, XFS_BTNUM_FINO, &tree_len);
> +	if (error)
> +		return error;
> +
> +	*ask += xfs_inobt_max_size(mp);
> +	*used += tree_len;
> +	return 0;
> +}
> diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.h b/fs/xfs/libxfs/xfs_ialloc_btree.h
> index bd88453..aa81e2e 100644
> --- a/fs/xfs/libxfs/xfs_ialloc_btree.h
> +++ b/fs/xfs/libxfs/xfs_ialloc_btree.h
> @@ -72,4 +72,7 @@ int xfs_inobt_rec_check_count(struct xfs_mount *,
>  #define xfs_inobt_rec_check_count(mp, rec)	0
>  #endif	/* DEBUG */
>  
> +int xfs_finobt_calc_reserves(struct xfs_mount *mp, xfs_agnumber_t agno,
> +		xfs_extlen_t *ask, xfs_extlen_t *used);
> +
>  #endif	/* __XFS_IALLOC_BTREE_H__ */
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index b955779..de32f0f 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1792,22 +1792,23 @@ xfs_inactive_ifree(
>  	int			error;
>  
>  	/*
> -	 * The ifree transaction might need to allocate blocks for record
> -	 * insertion to the finobt. We don't want to fail here at ENOSPC, so
> -	 * allow ifree to dip into the reserved block pool if necessary.
> -	 *
> -	 * Freeing large sets of inodes generally means freeing inode chunks,
> -	 * directory and file data blocks, so this should be relatively safe.
> -	 * Only under severe circumstances should it be possible to free enough
> -	 * inodes to exhaust the reserve block pool via finobt expansion while
> -	 * at the same time not creating free space in the filesystem.
> +	 * We try to use a per-AG reservation for any block needed by the finobt
> +	 * tree, but as the finobt feature predates the per-AG reservation
> +	 * support a degraded file system might not have enough space for the
> +	 * reservation at mount time.  In that case try to dip into the reserved
> +	 * pool and pray.
>  	 *
>  	 * Send a warning if the reservation does happen to fail, as the inode
>  	 * now remains allocated and sits on the unlinked list until the fs is
>  	 * repaired.
>  	 */
> -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ifree,
> -			XFS_IFREE_SPACE_RES(mp), 0, XFS_TRANS_RESERVE, &tp);
> +	if (unlikely(mp->m_inotbt_nores)) {
> +		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ifree,
> +				XFS_IFREE_SPACE_RES(mp), 0, XFS_TRANS_RESERVE,
> +				&tp);
> +	} else {
> +		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ifree, 0, 0, 0, &tp);
> +	}
>  	if (error) {
>  		if (error == -ENOSPC) {
>  			xfs_warn_ratelimited(mp,
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 84f7852..7f351f7 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -140,6 +140,7 @@ typedef struct xfs_mount {
>  	int			m_fixedfsid[2];	/* unchanged for life of FS */
>  	uint			m_dmevmask;	/* DMI events for this FS */
>  	__uint64_t		m_flags;	/* global mount flags */
> +	bool			m_inotbt_nores; /* no per-AG finobt resv. */
>  	int			m_ialloc_inos;	/* inodes in inode allocation */
>  	int			m_ialloc_blks;	/* blocks in inode allocation */
>  	int			m_ialloc_min_blks;/* min blocks in sparse inode
> -- 
> 2.1.4
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 2/2] xfs: use per-AG reservations for the finobt
  2017-01-24 20:16 avoid running out of space during finobt insertation Christoph Hellwig
@ 2017-01-24 20:16 ` Christoph Hellwig
  2017-01-24 20:42   ` Darrick J. Wong
  0 siblings, 1 reply; 11+ messages in thread
From: Christoph Hellwig @ 2017-01-24 20:16 UTC (permalink / raw)
  To: darrick.wong; +Cc: bfoster, linux-xfs

Currently we try to rely on the global reserved block pool for block
allocations for the free inode btree, but I have customer reports
(fairly complex workload, need to find an easier reproducer) where that
is not enough as the AG where we free an inode that requires a new
finobt block is entirely full.  This causes us to cancel a dirty
transaction and thus a file system shutdown.

I think the right way to guard against this is to treat the finot the same
way as the refcount btree and have a per-AG reservations for the possible
worst case size of it, and the patch below implements that.

Note that this could increase mount times with large finobt trees.  In
an ideal world we would have added a field for the number of finobt
fields to the AGI, similar to what we did for the refcount blocks.
We should do add it next time we rev the AGI or AGF format by adding
new fields.

Signed-off-by: Christoph Hellwig <hch@lst.de>
---
 fs/xfs/libxfs/xfs_ag_resv.c      | 48 ++++++++++++++++++---
 fs/xfs/libxfs/xfs_ialloc.h       |  1 -
 fs/xfs/libxfs/xfs_ialloc_btree.c | 90 ++++++++++++++++++++++++++++++++++++++--
 fs/xfs/libxfs/xfs_ialloc_btree.h |  3 ++
 fs/xfs/xfs_inode.c               | 23 +++++-----
 fs/xfs/xfs_mount.h               |  1 +
 6 files changed, 145 insertions(+), 21 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
index 94234bf..be6c904 100644
--- a/fs/xfs/libxfs/xfs_ag_resv.c
+++ b/fs/xfs/libxfs/xfs_ag_resv.c
@@ -39,6 +39,7 @@
 #include "xfs_rmap_btree.h"
 #include "xfs_btree.h"
 #include "xfs_refcount_btree.h"
+#include "xfs_ialloc_btree.h"
 
 /*
  * Per-AG Block Reservations
@@ -210,6 +211,9 @@ __xfs_ag_resv_init(
 	if (error) {
 		trace_xfs_ag_resv_init_error(pag->pag_mount, pag->pag_agno,
 				error, _RET_IP_);
+		xfs_warn(mp,
+"Per-AG reservation for AG %u failed.  Filesystem may run out of space.",
+				pag->pag_agno);
 		return error;
 	}
 
@@ -228,6 +232,8 @@ int
 xfs_ag_resv_init(
 	struct xfs_perag		*pag)
 {
+	struct xfs_mount		*mp = pag->pag_mount;
+	xfs_agnumber_t			agno = pag->pag_agno;
 	xfs_extlen_t			ask;
 	xfs_extlen_t			used;
 	int				error = 0;
@@ -236,23 +242,46 @@ xfs_ag_resv_init(
 	if (pag->pag_meta_resv.ar_asked == 0) {
 		ask = used = 0;
 
-		error = xfs_refcountbt_calc_reserves(pag->pag_mount,
-				pag->pag_agno, &ask, &used);
+		error = xfs_refcountbt_calc_reserves(mp, agno, &ask, &used);
 		if (error)
 			goto out;
 
-		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_METADATA,
-				ask, used);
+		error = xfs_finobt_calc_reserves(mp, agno, &ask, &used);
 		if (error)
 			goto out;
+
+		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_METADATA,
+				ask, used);
+		if (error) {
+			/*
+			 * Because we didn't have per-AG reservations when the
+			 * finobt feature was added we might not ne able to
+			 * reservere all needed blocks.  Warn and fall back to
+			 * the old and potentially buggy code in that case,
+			 * but ensure we do have the reservation for the
+			 * refcountbt.
+			 */
+			ask = used = 0;
+
+			error = xfs_refcountbt_calc_reserves(mp, agno, &ask,
+					&used);
+			if (error)
+				goto out;
+
+			error = __xfs_ag_resv_init(pag, XFS_AG_RESV_METADATA,
+					ask, used);
+			if (error)
+				goto out;
+
+			mp->m_inotbt_nores = true;
+		}
 	}
 
 	/* Create the AGFL metadata reservation */
 	if (pag->pag_agfl_resv.ar_asked == 0) {
 		ask = used = 0;
 
-		error = xfs_rmapbt_calc_reserves(pag->pag_mount, pag->pag_agno,
-				&ask, &used);
+		error = xfs_rmapbt_calc_reserves(mp, agno, &ask, &used);
 		if (error)
 			goto out;
 
@@ -261,9 +290,16 @@ xfs_ag_resv_init(
 			goto out;
 	}
 
+#ifdef DEBUG
+	/* need to read in the AGF for the ASSERT below to work */
+	error = xfs_alloc_pagf_init(pag->pag_mount, NULL, pag->pag_agno, 0);
+	if (error)
+		return error;
+
 	ASSERT(xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved +
 	       xfs_perag_resv(pag, XFS_AG_RESV_AGFL)->ar_reserved <=
 	       pag->pagf_freeblks + pag->pagf_flcount);
+#endif
 out:
 	return error;
 }
diff --git a/fs/xfs/libxfs/xfs_ialloc.h b/fs/xfs/libxfs/xfs_ialloc.h
index 0bb8966..3f24725 100644
--- a/fs/xfs/libxfs/xfs_ialloc.h
+++ b/fs/xfs/libxfs/xfs_ialloc.h
@@ -168,5 +168,4 @@ int xfs_ialloc_inode_init(struct xfs_mount *mp, struct xfs_trans *tp,
 int xfs_read_agi(struct xfs_mount *mp, struct xfs_trans *tp,
 		xfs_agnumber_t agno, struct xfs_buf **bpp);
 
-
 #endif	/* __XFS_IALLOC_H__ */
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c
index 0fd086d..7c47188 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.c
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.c
@@ -82,11 +82,12 @@ xfs_finobt_set_root(
 }
 
 STATIC int
-xfs_inobt_alloc_block(
+__xfs_inobt_alloc_block(
 	struct xfs_btree_cur	*cur,
 	union xfs_btree_ptr	*start,
 	union xfs_btree_ptr	*new,
-	int			*stat)
+	int			*stat,
+	enum xfs_ag_resv_type	resv)
 {
 	xfs_alloc_arg_t		args;		/* block allocation args */
 	int			error;		/* error return value */
@@ -103,6 +104,7 @@ xfs_inobt_alloc_block(
 	args.maxlen = 1;
 	args.prod = 1;
 	args.type = XFS_ALLOCTYPE_NEAR_BNO;
+	args.resv = resv;
 
 	error = xfs_alloc_vextent(&args);
 	if (error) {
@@ -123,6 +125,27 @@ xfs_inobt_alloc_block(
 }
 
 STATIC int
+xfs_inobt_alloc_block(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*start,
+	union xfs_btree_ptr	*new,
+	int			*stat)
+{
+	return __xfs_inobt_alloc_block(cur, start, new, stat, XFS_AG_RESV_NONE);
+}
+
+STATIC int
+xfs_finobt_alloc_block(
+	struct xfs_btree_cur	*cur,
+	union xfs_btree_ptr	*start,
+	union xfs_btree_ptr	*new,
+	int			*stat)
+{
+	return __xfs_inobt_alloc_block(cur, start, new, stat,
+			XFS_AG_RESV_METADATA);
+}
+
+STATIC int
 xfs_inobt_free_block(
 	struct xfs_btree_cur	*cur,
 	struct xfs_buf		*bp)
@@ -328,7 +351,7 @@ static const struct xfs_btree_ops xfs_finobt_ops = {
 
 	.dup_cursor		= xfs_inobt_dup_cursor,
 	.set_root		= xfs_finobt_set_root,
-	.alloc_block		= xfs_inobt_alloc_block,
+	.alloc_block		= xfs_finobt_alloc_block,
 	.free_block		= xfs_inobt_free_block,
 	.get_minrecs		= xfs_inobt_get_minrecs,
 	.get_maxrecs		= xfs_inobt_get_maxrecs,
@@ -480,3 +503,64 @@ xfs_inobt_rec_check_count(
 	return 0;
 }
 #endif	/* DEBUG */
+
+static xfs_extlen_t
+xfs_inobt_max_size(
+	struct xfs_mount	*mp)
+{
+	/* Bail out if we're uninitialized, which can happen in mkfs. */
+	if (mp->m_inobt_mxr[0] == 0)
+		return 0;
+
+	return xfs_btree_calc_size(mp, mp->m_inobt_mnr,
+		(uint64_t)mp->m_sb.sb_agblocks * mp->m_sb.sb_inopblock /
+				XFS_INODES_PER_CHUNK);
+}
+
+static int
+xfs_inobt_count_blocks(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_btnum_t		btnum,
+	xfs_extlen_t		*tree_blocks)
+{
+	struct xfs_buf		*agbp;
+	struct xfs_btree_cur	*cur;
+	int			error;
+
+	error = xfs_ialloc_read_agi(mp, NULL, agno, &agbp);
+	if (error)
+		return error;
+
+	cur = xfs_inobt_init_cursor(mp, NULL, agbp, agno, btnum);
+	error = xfs_btree_count_blocks(cur, tree_blocks);
+	xfs_btree_del_cursor(cur, error ? XFS_BTREE_ERROR : XFS_BTREE_NOERROR);
+	xfs_buf_relse(agbp);
+
+	return error;
+}
+
+/*
+ * Figure out how many blocks to reserve and how many are used by this btree.
+ */
+int
+xfs_finobt_calc_reserves(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno,
+	xfs_extlen_t		*ask,
+	xfs_extlen_t		*used)
+{
+	xfs_extlen_t		tree_len = 0;
+	int			error;
+
+	if (!xfs_sb_version_hasfinobt(&mp->m_sb))
+		return 0;
+
+	error = xfs_inobt_count_blocks(mp, agno, XFS_BTNUM_FINO, &tree_len);
+	if (error)
+		return error;
+
+	*ask += xfs_inobt_max_size(mp);
+	*used += tree_len;
+	return 0;
+}
diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.h b/fs/xfs/libxfs/xfs_ialloc_btree.h
index bd88453..aa81e2e 100644
--- a/fs/xfs/libxfs/xfs_ialloc_btree.h
+++ b/fs/xfs/libxfs/xfs_ialloc_btree.h
@@ -72,4 +72,7 @@ int xfs_inobt_rec_check_count(struct xfs_mount *,
 #define xfs_inobt_rec_check_count(mp, rec)	0
 #endif	/* DEBUG */
 
+int xfs_finobt_calc_reserves(struct xfs_mount *mp, xfs_agnumber_t agno,
+		xfs_extlen_t *ask, xfs_extlen_t *used);
+
 #endif	/* __XFS_IALLOC_BTREE_H__ */
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index b955779..de32f0f 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1792,22 +1792,23 @@ xfs_inactive_ifree(
 	int			error;
 
 	/*
-	 * The ifree transaction might need to allocate blocks for record
-	 * insertion to the finobt. We don't want to fail here at ENOSPC, so
-	 * allow ifree to dip into the reserved block pool if necessary.
-	 *
-	 * Freeing large sets of inodes generally means freeing inode chunks,
-	 * directory and file data blocks, so this should be relatively safe.
-	 * Only under severe circumstances should it be possible to free enough
-	 * inodes to exhaust the reserve block pool via finobt expansion while
-	 * at the same time not creating free space in the filesystem.
+	 * We try to use a per-AG reservation for any block needed by the finobt
+	 * tree, but as the finobt feature predates the per-AG reservation
+	 * support a degraded file system might not have enough space for the
+	 * reservation at mount time.  In that case try to dip into the reserved
+	 * pool and pray.
 	 *
 	 * Send a warning if the reservation does happen to fail, as the inode
 	 * now remains allocated and sits on the unlinked list until the fs is
 	 * repaired.
 	 */
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ifree,
-			XFS_IFREE_SPACE_RES(mp), 0, XFS_TRANS_RESERVE, &tp);
+	if (unlikely(mp->m_inotbt_nores)) {
+		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ifree,
+				XFS_IFREE_SPACE_RES(mp), 0, XFS_TRANS_RESERVE,
+				&tp);
+	} else {
+		error = xfs_trans_alloc(mp, &M_RES(mp)->tr_ifree, 0, 0, 0, &tp);
+	}
 	if (error) {
 		if (error == -ENOSPC) {
 			xfs_warn_ratelimited(mp,
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 84f7852..7f351f7 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -140,6 +140,7 @@ typedef struct xfs_mount {
 	int			m_fixedfsid[2];	/* unchanged for life of FS */
 	uint			m_dmevmask;	/* DMI events for this FS */
 	__uint64_t		m_flags;	/* global mount flags */
+	bool			m_inotbt_nores; /* no per-AG finobt resv. */
 	int			m_ialloc_inos;	/* inodes in inode allocation */
 	int			m_ialloc_blks;	/* blocks in inode allocation */
 	int			m_ialloc_min_blks;/* min blocks in sparse inode
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2017-01-27 17:26 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-25  9:14 avoid running out of space during finobt insertation Christoph Hellwig
2017-01-25  9:14 ` [PATCH 1/2] xfs: only update mount/resv fields on success in __xfs_ag_resv_init Christoph Hellwig
2017-01-25  9:14 ` [PATCH 2/2] xfs: use per-AG reservations for the finobt Christoph Hellwig
2017-01-25 15:58   ` Darrick J. Wong
2017-01-27  5:08   ` Eric Sandeen
2017-01-27 14:59     ` Christoph Hellwig
2017-01-27 15:43       ` Eric Sandeen
2017-01-27 17:26         ` Eric Sandeen
  -- strict thread matches above, loose matches on Subject: below --
2017-01-24 20:16 avoid running out of space during finobt insertation Christoph Hellwig
2017-01-24 20:16 ` [PATCH 2/2] xfs: use per-AG reservations for the finobt Christoph Hellwig
2017-01-24 20:42   ` Darrick J. Wong
2017-01-25  9:13     ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.