All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] xfs: rmapbt block and perag reservation fixups
@ 2018-02-05 17:45 Brian Foster
  2018-02-05 17:45 ` [PATCH 1/4] xfs: shutdown if block allocation overruns tx reservation Brian Foster
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Brian Foster @ 2018-02-05 17:45 UTC (permalink / raw)
  To: linux-xfs

Hi all,

Here's another version of the rmapbt perag reservation fixup. This
series splits the original patch into two: one for the
rename/non-functional changes and another for the logical fix. This
series also includes a couple new patches related to an issue
encountered during testing of the previous patch. I have an xfstests
test for the latter issue sketched out that I'll follow up with once
it's formalized.

Thoughts, reviews, flames appreciated.

Brian

v1:
- Split original perag res. fix into two patches.
- Additional patches for block res. issue.
rfc: https://marc.info/?l=linux-xfs&m=151741074520114&w=2

Brian Foster (4):
  xfs: shutdown if block allocation overruns tx reservation
  xfs: account format bouncing into rmapbt swapext tx reservation
  xfs: rename agfl perag res type to rmapbt
  xfs: account only rmapbt-used blocks against rmapbt perag res

 fs/xfs/libxfs/xfs_ag_resv.c    | 39 ++++++++++++++++++++++-----------------
 fs/xfs/libxfs/xfs_ag_resv.h    | 31 +++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_alloc.c      | 14 +++-----------
 fs/xfs/libxfs/xfs_rmap_btree.c |  4 ++++
 fs/xfs/xfs_bmap_util.c         | 29 ++++++++++++++++++++---------
 fs/xfs/xfs_mount.h             | 11 ++++++-----
 fs/xfs/xfs_reflink.c           |  2 +-
 fs/xfs/xfs_trans.c             | 12 ++++++++----
 8 files changed, 95 insertions(+), 47 deletions(-)

-- 
2.13.6


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/4] xfs: shutdown if block allocation overruns tx reservation
  2018-02-05 17:45 [PATCH 0/4] xfs: rmapbt block and perag reservation fixups Brian Foster
@ 2018-02-05 17:45 ` Brian Foster
  2018-02-08  1:42   ` Darrick J. Wong
  2018-02-05 17:45 ` [PATCH 2/4] xfs: account format bouncing into rmapbt swapext " Brian Foster
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Brian Foster @ 2018-02-05 17:45 UTC (permalink / raw)
  To: linux-xfs

The ->t_blk_res_used field tracks how many blocks have been used in
the current transaction. This should never exceed the block
reservation (->t_blk_res) for a particular transaction. We currently
assert this condition in the transaction block accounting code, but
otherwise take no additional action should this situation occur.

The overrun generally has no effect if space ends up being available
and the associated transaction commits. If the transaction is
duplicated, however, the current block usage is used to determine
the remaining block reservation to be transferred to the new
transaction. If usage exceeds reservation, this calculation
underflows and creates a transaction with an invalid and excessive
reservation. When the second transaction commits, the release of
unused blocks corrupts the in-core free space counters. With lazy
superblock accounting enabled, this inconsistency eventually
trickles to the on-disk superblock and corrupts the filesystem.

Replace the transaction block usage accounting assert with an
explicit overrun check. If the transaction overruns the reservation,
shutdown the filesystem immediately to prevent corruption. Add a new
assert to xfs_trans_dup() to catch any callers that might induce
this invalid state in the future.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_trans.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 86f92df32c42..9f9f745c3dff 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -119,8 +119,11 @@ xfs_trans_dup(
 	/* We gave our writer reference to the new transaction */
 	tp->t_flags |= XFS_TRANS_NO_WRITECOUNT;
 	ntp->t_ticket = xfs_log_ticket_get(tp->t_ticket);
+
+	ASSERT(tp->t_blk_res >= tp->t_blk_res_used);
 	ntp->t_blk_res = tp->t_blk_res - tp->t_blk_res_used;
 	tp->t_blk_res = tp->t_blk_res_used;
+
 	ntp->t_rtx_res = tp->t_rtx_res - tp->t_rtx_res_used;
 	tp->t_rtx_res = tp->t_rtx_res_used;
 	ntp->t_pflags = tp->t_pflags;
@@ -344,13 +347,14 @@ xfs_trans_mod_sb(
 		break;
 	case XFS_TRANS_SB_FDBLOCKS:
 		/*
-		 * Track the number of blocks allocated in the
-		 * transaction.  Make sure it does not exceed the
-		 * number reserved.
+		 * Track the number of blocks allocated in the transaction.
+		 * Make sure it does not exceed the number reserved. If so,
+		 * shutdown as this can lead to accounting inconsistency.
 		 */
 		if (delta < 0) {
 			tp->t_blk_res_used += (uint)-delta;
-			ASSERT(tp->t_blk_res_used <= tp->t_blk_res);
+			if (tp->t_blk_res_used > tp->t_blk_res)
+				xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
 		}
 		tp->t_fdblocks_delta += delta;
 		if (xfs_sb_version_haslazysbcount(&mp->m_sb))
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/4] xfs: account format bouncing into rmapbt swapext tx reservation
  2018-02-05 17:45 [PATCH 0/4] xfs: rmapbt block and perag reservation fixups Brian Foster
  2018-02-05 17:45 ` [PATCH 1/4] xfs: shutdown if block allocation overruns tx reservation Brian Foster
@ 2018-02-05 17:45 ` Brian Foster
  2018-02-08  1:56   ` Darrick J. Wong
  2018-02-05 17:46 ` [PATCH 3/4] xfs: rename agfl perag res type to rmapbt Brian Foster
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Brian Foster @ 2018-02-05 17:45 UTC (permalink / raw)
  To: linux-xfs

The extent swap mechanism requires a unique implementation for
rmapbt enabled filesystems. Because the rmapbt tracks extent owner
information, extent swap must individually unmap and remap each
extent between the two inodes.

The rmapbt extent swap transaction block reservation currently
accounts for the worst case bmapbt block and rmapbt block
consumption based on the extent count of each inode. There is a
corner case that exists due to the extent swap implementation that
is not covered by this reservation, however.

If one of the associated inodes is just over the max extent count
used for extent format inodes (i.e., the inode is in btree format by
a single extent), the unmap/remap cycle of the extent swap can
bounce the inode between extent and btree format multiple times,
almost as many times as there are extents in the inode (if the
opposing inode happens to have one less, for example). Each back and
forth cycle involves a block free and allocation, which isn't a
problem except for that the initial transaction reservation must
account for the total number of block allocations performed by the
chain of deferred operations. If not, a block reservation overrun
occurs and the filesystem shuts down.

Update the rmapbt extent swap block reservation to check for this
situation and add some block reservation slop to ensure the entire
operation succeeds. We'd never likely require reservation for both
inodes as fsr wouldn't defrag the file in that case, but the
additional reservation is constrained by the data fork size so be
cautious and check for both.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_bmap_util.c | 29 ++++++++++++++++++++---------
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index c83f549dc17b..e0a442f504e5 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1899,17 +1899,28 @@ xfs_swap_extents(
 	 * performed with log redo items!
 	 */
 	if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
+		int		w	= XFS_DATA_FORK;
+		uint32_t	ipnext	= XFS_IFORK_NEXTENTS(ip, w);
+		uint32_t	tipnext	= XFS_IFORK_NEXTENTS(tip, w);
+
+		/*
+		 * Conceptually this shouldn't affect the shape of either bmbt,
+		 * but since we atomically move extents one by one, we reserve
+		 * enough space to rebuild both trees.
+		 */
+		resblks = XFS_SWAP_RMAP_SPACE_RES(mp, ipnext, w);
+		resblks +=  XFS_SWAP_RMAP_SPACE_RES(mp, tipnext, w);
+
 		/*
-		 * Conceptually this shouldn't affect the shape of either
-		 * bmbt, but since we atomically move extents one by one,
-		 * we reserve enough space to rebuild both trees.
+		 * Handle the corner case where either inode might straddle the
+		 * btree format boundary. If so, the inode could bounce between
+		 * btree <-> extent format on unmap -> remap cycles, freeing and
+		 * allocating a bmapbt block each time.
 		 */
-		resblks = XFS_SWAP_RMAP_SPACE_RES(mp,
-				XFS_IFORK_NEXTENTS(ip, XFS_DATA_FORK),
-				XFS_DATA_FORK) +
-			  XFS_SWAP_RMAP_SPACE_RES(mp,
-				XFS_IFORK_NEXTENTS(tip, XFS_DATA_FORK),
-				XFS_DATA_FORK);
+		if (ipnext == (XFS_IFORK_MAXEXT(ip, w) + 1))
+			resblks += XFS_IFORK_MAXEXT(ip, w);
+		if (tipnext == (XFS_IFORK_MAXEXT(tip, w) + 1))
+			resblks += XFS_IFORK_MAXEXT(tip, w);
 	}
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
 	if (error)
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/4] xfs: rename agfl perag res type to rmapbt
  2018-02-05 17:45 [PATCH 0/4] xfs: rmapbt block and perag reservation fixups Brian Foster
  2018-02-05 17:45 ` [PATCH 1/4] xfs: shutdown if block allocation overruns tx reservation Brian Foster
  2018-02-05 17:45 ` [PATCH 2/4] xfs: account format bouncing into rmapbt swapext " Brian Foster
@ 2018-02-05 17:46 ` Brian Foster
  2018-02-08  1:57   ` Darrick J. Wong
  2018-02-05 17:46 ` [PATCH 4/4] xfs: account only rmapbt-used blocks against rmapbt perag res Brian Foster
  2018-02-06 13:10 ` [PATCH] tests/xfs: rmapbt swapext block reservation overrun test Brian Foster
  4 siblings, 1 reply; 19+ messages in thread
From: Brian Foster @ 2018-02-05 17:46 UTC (permalink / raw)
  To: linux-xfs

The AGFL perag reservation type accounts all allocations that feed
into (or are released from) the allocation group free list (agfl).
The purpose of the reservation is to support worst case conditions
for the reverse mapping btree (rmapbt). As such, the agfl
reservation usage accounting only considers rmapbt usage when the
in-core counters are initialized at mount time.

This implementation inconsistency leads to divergence of the in-core
and on-disk usage accounting over time. In preparation to resolve
this inconsistency and adjust the AGFL reservation into an rmapbt
specific reservation, rename the AGFL reservation type and
associated accounting fields to something more rmapbt-specific. Also
fix up a couple tracepoints that incorrectly use the AGFL
reservation type to pass the agfl state of the associated extent
where the raw reservation type is expected.

Note that this patch does not change perag reservation behavior.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ag_resv.c | 35 ++++++++++++++++++-----------------
 fs/xfs/libxfs/xfs_alloc.c   | 20 +++++++++-----------
 fs/xfs/xfs_mount.h          | 10 +++++-----
 fs/xfs/xfs_reflink.c        |  2 +-
 4 files changed, 33 insertions(+), 34 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
index 2291f4224e24..0ca2e680034a 100644
--- a/fs/xfs/libxfs/xfs_ag_resv.c
+++ b/fs/xfs/libxfs/xfs_ag_resv.c
@@ -95,13 +95,13 @@ xfs_ag_resv_critical(
 
 	switch (type) {
 	case XFS_AG_RESV_METADATA:
-		avail = pag->pagf_freeblks - pag->pag_agfl_resv.ar_reserved;
+		avail = pag->pagf_freeblks - pag->pag_rmapbt_resv.ar_reserved;
 		orig = pag->pag_meta_resv.ar_asked;
 		break;
-	case XFS_AG_RESV_AGFL:
+	case XFS_AG_RESV_RMAPBT:
 		avail = pag->pagf_freeblks + pag->pagf_flcount -
 			pag->pag_meta_resv.ar_reserved;
-		orig = pag->pag_agfl_resv.ar_asked;
+		orig = pag->pag_rmapbt_resv.ar_asked;
 		break;
 	default:
 		ASSERT(0);
@@ -126,10 +126,10 @@ xfs_ag_resv_needed(
 {
 	xfs_extlen_t			len;
 
-	len = pag->pag_meta_resv.ar_reserved + pag->pag_agfl_resv.ar_reserved;
+	len = pag->pag_meta_resv.ar_reserved + pag->pag_rmapbt_resv.ar_reserved;
 	switch (type) {
 	case XFS_AG_RESV_METADATA:
-	case XFS_AG_RESV_AGFL:
+	case XFS_AG_RESV_RMAPBT:
 		len -= xfs_perag_resv(pag, type)->ar_reserved;
 		break;
 	case XFS_AG_RESV_NONE:
@@ -160,10 +160,11 @@ __xfs_ag_resv_free(
 	if (pag->pag_agno == 0)
 		pag->pag_mount->m_ag_max_usable += resv->ar_asked;
 	/*
-	 * AGFL blocks are always considered "free", so whatever
-	 * was reserved at mount time must be given back at umount.
+	 * RMAPBT blocks come from the AGFL and AGFL blocks are always
+	 * considered "free", so whatever was reserved at mount time must be
+	 * given back at umount.
 	 */
-	if (type == XFS_AG_RESV_AGFL)
+	if (type == XFS_AG_RESV_RMAPBT)
 		oldresv = resv->ar_orig_reserved;
 	else
 		oldresv = resv->ar_reserved;
@@ -185,7 +186,7 @@ xfs_ag_resv_free(
 	int				error;
 	int				err2;
 
-	error = __xfs_ag_resv_free(pag, XFS_AG_RESV_AGFL);
+	error = __xfs_ag_resv_free(pag, XFS_AG_RESV_RMAPBT);
 	err2 = __xfs_ag_resv_free(pag, XFS_AG_RESV_METADATA);
 	if (err2 && !error)
 		error = err2;
@@ -284,15 +285,15 @@ xfs_ag_resv_init(
 		}
 	}
 
-	/* Create the AGFL metadata reservation */
-	if (pag->pag_agfl_resv.ar_asked == 0) {
+	/* Create the RMAPBT metadata reservation */
+	if (pag->pag_rmapbt_resv.ar_asked == 0) {
 		ask = used = 0;
 
 		error = xfs_rmapbt_calc_reserves(mp, agno, &ask, &used);
 		if (error)
 			goto out;
 
-		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_AGFL, ask, used);
+		error = __xfs_ag_resv_init(pag, XFS_AG_RESV_RMAPBT, ask, used);
 		if (error)
 			goto out;
 	}
@@ -304,7 +305,7 @@ xfs_ag_resv_init(
 		return error;
 
 	ASSERT(xfs_perag_resv(pag, XFS_AG_RESV_METADATA)->ar_reserved +
-	       xfs_perag_resv(pag, XFS_AG_RESV_AGFL)->ar_reserved <=
+	       xfs_perag_resv(pag, XFS_AG_RESV_RMAPBT)->ar_reserved <=
 	       pag->pagf_freeblks + pag->pagf_flcount);
 #endif
 out:
@@ -326,7 +327,7 @@ xfs_ag_resv_alloc_extent(
 
 	switch (type) {
 	case XFS_AG_RESV_METADATA:
-	case XFS_AG_RESV_AGFL:
+	case XFS_AG_RESV_RMAPBT:
 		resv = xfs_perag_resv(pag, type);
 		break;
 	default:
@@ -341,7 +342,7 @@ xfs_ag_resv_alloc_extent(
 
 	len = min_t(xfs_extlen_t, args->len, resv->ar_reserved);
 	resv->ar_reserved -= len;
-	if (type == XFS_AG_RESV_AGFL)
+	if (type == XFS_AG_RESV_RMAPBT)
 		return;
 	/* Allocations of reserved blocks only need on-disk sb updates... */
 	xfs_trans_mod_sb(args->tp, XFS_TRANS_SB_RES_FDBLOCKS, -(int64_t)len);
@@ -366,7 +367,7 @@ xfs_ag_resv_free_extent(
 
 	switch (type) {
 	case XFS_AG_RESV_METADATA:
-	case XFS_AG_RESV_AGFL:
+	case XFS_AG_RESV_RMAPBT:
 		resv = xfs_perag_resv(pag, type);
 		break;
 	default:
@@ -379,7 +380,7 @@ xfs_ag_resv_free_extent(
 
 	leftover = min_t(xfs_extlen_t, len, resv->ar_asked - resv->ar_reserved);
 	resv->ar_reserved += leftover;
-	if (type == XFS_AG_RESV_AGFL)
+	if (type == XFS_AG_RESV_RMAPBT)
 		return;
 	/* Freeing into the reserved pool only requires on-disk update... */
 	xfs_trans_mod_sb(tp, XFS_TRANS_SB_RES_FDBLOCKS, len);
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index c02781a4c091..e53d55453dc5 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -711,7 +711,7 @@ xfs_alloc_ag_vextent(
 
 	ASSERT(args->len >= args->minlen);
 	ASSERT(args->len <= args->maxlen);
-	ASSERT(!args->wasfromfl || args->resv != XFS_AG_RESV_AGFL);
+	ASSERT(!args->wasfromfl || args->resv != XFS_AG_RESV_RMAPBT);
 	ASSERT(args->agbno % args->alignment == 0);
 
 	/* if not file data, insert new block into the reverse map btree */
@@ -1583,7 +1583,7 @@ xfs_alloc_ag_vextent_small(
 	 * freelist.
 	 */
 	else if (args->minlen == 1 && args->alignment == 1 &&
-		 args->resv != XFS_AG_RESV_AGFL &&
+		 args->resv != XFS_AG_RESV_RMAPBT &&
 		 (be32_to_cpu(XFS_BUF_TO_AGF(args->agbp)->agf_flcount)
 		  > args->minleft)) {
 		error = xfs_alloc_get_freelist(args->tp, args->agbp, &fbno, 0);
@@ -1617,7 +1617,7 @@ xfs_alloc_ag_vextent_small(
 			 * If we're feeding an AGFL block to something that
 			 * doesn't live in the free space, we need to clear
 			 * out the OWN_AG rmap and add the block back to
-			 * the AGFL per-AG reservation.
+			 * the RMAPBT per-AG reservation.
 			 */
 			xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
 			error = xfs_rmap_free(args->tp, args->agbp, args->agno,
@@ -1625,7 +1625,7 @@ xfs_alloc_ag_vextent_small(
 			if (error)
 				goto error0;
 			pag = xfs_perag_get(args->mp, args->agno);
-			xfs_ag_resv_free_extent(pag, XFS_AG_RESV_AGFL,
+			xfs_ag_resv_free_extent(pag, XFS_AG_RESV_RMAPBT,
 					args->tp, 1);
 			xfs_perag_put(pag);
 
@@ -1911,14 +1911,12 @@ xfs_free_ag_extent(
 	XFS_STATS_INC(mp, xs_freex);
 	XFS_STATS_ADD(mp, xs_freeb, len);
 
-	trace_xfs_free_extent(mp, agno, bno, len, type == XFS_AG_RESV_AGFL,
-			haveleft, haveright);
+	trace_xfs_free_extent(mp, agno, bno, len, type, haveleft, haveright);
 
 	return 0;
 
  error0:
-	trace_xfs_free_extent(mp, agno, bno, len, type == XFS_AG_RESV_AGFL,
-			-1, -1);
+	trace_xfs_free_extent(mp, agno, bno, len, type, -1, -1);
 	if (bno_cur)
 		xfs_btree_del_cursor(bno_cur, XFS_BTREE_ERROR);
 	if (cnt_cur)
@@ -2155,7 +2153,7 @@ xfs_alloc_fix_freelist(
 		if (error)
 			goto out_agbp_relse;
 		error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1,
-					   &targs.oinfo, XFS_AG_RESV_AGFL);
+					   &targs.oinfo, XFS_AG_RESV_RMAPBT);
 		if (error)
 			goto out_agbp_relse;
 		bp = xfs_btree_get_bufs(mp, tp, args->agno, bno, 0);
@@ -2181,7 +2179,7 @@ xfs_alloc_fix_freelist(
 	while (pag->pagf_flcount < need) {
 		targs.agbno = 0;
 		targs.maxlen = need - pag->pagf_flcount;
-		targs.resv = XFS_AG_RESV_AGFL;
+		targs.resv = XFS_AG_RESV_RMAPBT;
 
 		/* Allocate as many blocks as possible at once. */
 		error = xfs_alloc_ag_vextent(&targs);
@@ -2862,7 +2860,7 @@ xfs_free_extent(
 	int			error;
 
 	ASSERT(len != 0);
-	ASSERT(type != XFS_AG_RESV_AGFL);
+	ASSERT(type != XFS_AG_RESV_RMAPBT);
 
 	if (XFS_TEST_ERROR(false, mp,
 			XFS_ERRTAG_FREE_EXTENT))
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index e0792d036be2..cf84288b65ca 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -327,7 +327,7 @@ xfs_daddr_to_agbno(struct xfs_mount *mp, xfs_daddr_t d)
 enum xfs_ag_resv_type {
 	XFS_AG_RESV_NONE = 0,
 	XFS_AG_RESV_METADATA,
-	XFS_AG_RESV_AGFL,
+	XFS_AG_RESV_RMAPBT,
 };
 
 struct xfs_ag_resv {
@@ -391,8 +391,8 @@ typedef struct xfs_perag {
 
 	/* Blocks reserved for all kinds of metadata. */
 	struct xfs_ag_resv	pag_meta_resv;
-	/* Blocks reserved for just AGFL-based metadata. */
-	struct xfs_ag_resv	pag_agfl_resv;
+	/* Blocks reserved for the reverse mapping btree. */
+	struct xfs_ag_resv	pag_rmapbt_resv;
 
 	/* reference count */
 	uint8_t			pagf_refcount_level;
@@ -406,8 +406,8 @@ xfs_perag_resv(
 	switch (type) {
 	case XFS_AG_RESV_METADATA:
 		return &pag->pag_meta_resv;
-	case XFS_AG_RESV_AGFL:
-		return &pag->pag_agfl_resv;
+	case XFS_AG_RESV_RMAPBT:
+		return &pag->pag_rmapbt_resv;
 	default:
 		return NULL;
 	}
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 270246943a06..832df6f49ba1 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1061,7 +1061,7 @@ xfs_reflink_ag_has_free_space(
 		return 0;
 
 	pag = xfs_perag_get(mp, agno);
-	if (xfs_ag_resv_critical(pag, XFS_AG_RESV_AGFL) ||
+	if (xfs_ag_resv_critical(pag, XFS_AG_RESV_RMAPBT) ||
 	    xfs_ag_resv_critical(pag, XFS_AG_RESV_METADATA))
 		error = -ENOSPC;
 	xfs_perag_put(pag);
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/4] xfs: account only rmapbt-used blocks against rmapbt perag res
  2018-02-05 17:45 [PATCH 0/4] xfs: rmapbt block and perag reservation fixups Brian Foster
                   ` (2 preceding siblings ...)
  2018-02-05 17:46 ` [PATCH 3/4] xfs: rename agfl perag res type to rmapbt Brian Foster
@ 2018-02-05 17:46 ` Brian Foster
  2018-02-07  0:03   ` Darrick J. Wong
  2018-02-06 13:10 ` [PATCH] tests/xfs: rmapbt swapext block reservation overrun test Brian Foster
  4 siblings, 1 reply; 19+ messages in thread
From: Brian Foster @ 2018-02-05 17:46 UTC (permalink / raw)
  To: linux-xfs

The rmapbt perag metadata reservation reserves blocks for the
reverse mapping btree (rmapbt). Since the rmapbt uses blocks from
the agfl and perag accounting is updated as blocks are allocated
from the allocation btrees, the reservation actually accounts blocks
as they are allocated to (or freed from) the agfl rather than the
rmapbt itself.

While this works for blocks that are eventually used for the rmapbt,
not all agfl blocks are destined for the rmapbt. Blocks that are
allocated to the agfl (and thus "reserved" for the rmapbt) but then
used by another structure leads to a growing inconsistency over time
between the runtime tracking of rmapbt usage vs. actual rmapbt
usage. Since the runtime tracking thinks all agfl blocks are rmapbt
blocks, it essentially believes that less future reservation is
required to satisfy the rmapbt than what is actually necessary.

The inconsistency is rectified across mount cycles because the perag
reservation is initialized based on the actual rmapbt usage at mount
time. The problem, however, is that the excessive drain of the
reservation at runtime opens a window to allocate blocks for other
purposes that might be required for the rmapbt on a subsequent
mount. This problem can be demonstrated by a simple test that runs
an allocation workload to consume agfl blocks over time and then
observe the difference in the agfl reservation requirement across an
unmount/mount cycle:

  mount ...: xfs_ag_resv_init: ... resv 3193 ask 3194 len 3194
  ...
  ...      : xfs_ag_resv_alloc_extent: ... resv 2957 ask 3194 len 1
  umount...: xfs_ag_resv_free: ... resv 2956 ask 3194 len 0
  mount ...: xfs_ag_resv_init: ... resv 3052 ask 3194 len 3194

As the above tracepoints show, the reservation requirement reduces
from 3194 blocks to 2956 blocks as the workload runs.  Without any
other changes in the filesystem, the same reservation requirement
jumps from 2956 to 3052 blocks over a umount/mount cycle.

To address this divergence, update the RMAPBT reservation to account
blocks used for the rmapbt only rather than all blocks filled into
the agfl. This patch makes several high-level changes toward that
end:

1.) Reintroduce an AGFL reservation type to serve as an accounting
    no-op for blocks allocated to (or freed from) the AGFL.
2.) Invoke RMAPBT usage accounting from the actual rmapbt block
    allocation path rather than the AGFL allocation path.

The first change is required because agfl blocks are considered free
blocks throughout their lifetime. The perag reservation subsystem is
invoked unconditionally by the allocation subsystem, so we need a
way to tell the perag subsystem (via the allocation subsystem) to
not make any accounting changes for blocks filled into the AGFL.

The second change causes the in-core RMAPBT reservation usage
accounting to remain consistent with the on-disk state at all times
and eliminates the risk of leaving the rmapbt reservation
underfilled.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_ag_resv.c    |  4 ++++
 fs/xfs/libxfs/xfs_ag_resv.h    | 31 +++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_alloc.c      | 18 ++++++------------
 fs/xfs/libxfs/xfs_rmap_btree.c |  4 ++++
 fs/xfs/xfs_mount.h             |  1 +
 5 files changed, 46 insertions(+), 12 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
index 0ca2e680034a..03885a968de8 100644
--- a/fs/xfs/libxfs/xfs_ag_resv.c
+++ b/fs/xfs/libxfs/xfs_ag_resv.c
@@ -326,6 +326,8 @@ xfs_ag_resv_alloc_extent(
 	trace_xfs_ag_resv_alloc_extent(pag, type, args->len);
 
 	switch (type) {
+	case XFS_AG_RESV_AGFL:
+		return;
 	case XFS_AG_RESV_METADATA:
 	case XFS_AG_RESV_RMAPBT:
 		resv = xfs_perag_resv(pag, type);
@@ -366,6 +368,8 @@ xfs_ag_resv_free_extent(
 	trace_xfs_ag_resv_free_extent(pag, type, len);
 
 	switch (type) {
+	case XFS_AG_RESV_AGFL:
+		return;
 	case XFS_AG_RESV_METADATA:
 	case XFS_AG_RESV_RMAPBT:
 		resv = xfs_perag_resv(pag, type);
diff --git a/fs/xfs/libxfs/xfs_ag_resv.h b/fs/xfs/libxfs/xfs_ag_resv.h
index 8d6c687deef3..938f2f96c5e8 100644
--- a/fs/xfs/libxfs/xfs_ag_resv.h
+++ b/fs/xfs/libxfs/xfs_ag_resv.h
@@ -32,4 +32,35 @@ void xfs_ag_resv_alloc_extent(struct xfs_perag *pag, enum xfs_ag_resv_type type,
 void xfs_ag_resv_free_extent(struct xfs_perag *pag, enum xfs_ag_resv_type type,
 		struct xfs_trans *tp, xfs_extlen_t len);
 
+/*
+ * RMAPBT reservation accounting wrappers. Since rmapbt blocks are sourced from
+ * the AGFL, they are allocated one at a time and the reservation updates don't
+ * require a transaction.
+ */
+static inline void
+xfs_ag_resv_rmapbt_alloc(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno)
+{
+	struct xfs_alloc_arg	args = {0};
+	struct xfs_perag	*pag;
+
+	args.len = 1;
+	pag = xfs_perag_get(mp, agno);
+	xfs_ag_resv_alloc_extent(pag, XFS_AG_RESV_RMAPBT, &args);
+	xfs_perag_put(pag);
+}
+
+static inline void
+xfs_ag_resv_rmapbt_free(
+	struct xfs_mount	*mp,
+	xfs_agnumber_t		agno)
+{
+	struct xfs_perag	*pag;
+
+	pag = xfs_perag_get(mp, agno);
+	xfs_ag_resv_free_extent(pag, XFS_AG_RESV_RMAPBT, NULL, 1);
+	xfs_perag_put(pag);
+}
+
 #endif	/* __XFS_AG_RESV_H__ */
diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index e53d55453dc5..8c401efb2d74 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -711,7 +711,7 @@ xfs_alloc_ag_vextent(
 
 	ASSERT(args->len >= args->minlen);
 	ASSERT(args->len <= args->maxlen);
-	ASSERT(!args->wasfromfl || args->resv != XFS_AG_RESV_RMAPBT);
+	ASSERT(!args->wasfromfl || args->resv != XFS_AG_RESV_AGFL);
 	ASSERT(args->agbno % args->alignment == 0);
 
 	/* if not file data, insert new block into the reverse map btree */
@@ -1564,7 +1564,6 @@ xfs_alloc_ag_vextent_small(
 	int		*stat)	/* status: 0-freelist, 1-normal/none */
 {
 	struct xfs_owner_info	oinfo;
-	struct xfs_perag	*pag;
 	int		error;
 	xfs_agblock_t	fbno;
 	xfs_extlen_t	flen;
@@ -1583,7 +1582,7 @@ xfs_alloc_ag_vextent_small(
 	 * freelist.
 	 */
 	else if (args->minlen == 1 && args->alignment == 1 &&
-		 args->resv != XFS_AG_RESV_RMAPBT &&
+		 args->resv != XFS_AG_RESV_AGFL &&
 		 (be32_to_cpu(XFS_BUF_TO_AGF(args->agbp)->agf_flcount)
 		  > args->minleft)) {
 		error = xfs_alloc_get_freelist(args->tp, args->agbp, &fbno, 0);
@@ -1616,18 +1615,13 @@ xfs_alloc_ag_vextent_small(
 			/*
 			 * If we're feeding an AGFL block to something that
 			 * doesn't live in the free space, we need to clear
-			 * out the OWN_AG rmap and add the block back to
-			 * the RMAPBT per-AG reservation.
+			 * out the OWN_AG rmap.
 			 */
 			xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
 			error = xfs_rmap_free(args->tp, args->agbp, args->agno,
 					fbno, 1, &oinfo);
 			if (error)
 				goto error0;
-			pag = xfs_perag_get(args->mp, args->agno);
-			xfs_ag_resv_free_extent(pag, XFS_AG_RESV_RMAPBT,
-					args->tp, 1);
-			xfs_perag_put(pag);
 
 			*stat = 0;
 			return 0;
@@ -2153,7 +2147,7 @@ xfs_alloc_fix_freelist(
 		if (error)
 			goto out_agbp_relse;
 		error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1,
-					   &targs.oinfo, XFS_AG_RESV_RMAPBT);
+					   &targs.oinfo, XFS_AG_RESV_AGFL);
 		if (error)
 			goto out_agbp_relse;
 		bp = xfs_btree_get_bufs(mp, tp, args->agno, bno, 0);
@@ -2179,7 +2173,7 @@ xfs_alloc_fix_freelist(
 	while (pag->pagf_flcount < need) {
 		targs.agbno = 0;
 		targs.maxlen = need - pag->pagf_flcount;
-		targs.resv = XFS_AG_RESV_RMAPBT;
+		targs.resv = XFS_AG_RESV_AGFL;
 
 		/* Allocate as many blocks as possible at once. */
 		error = xfs_alloc_ag_vextent(&targs);
@@ -2860,7 +2854,7 @@ xfs_free_extent(
 	int			error;
 
 	ASSERT(len != 0);
-	ASSERT(type != XFS_AG_RESV_RMAPBT);
+	ASSERT(type != XFS_AG_RESV_AGFL);
 
 	if (XFS_TEST_ERROR(false, mp,
 			XFS_ERRTAG_FREE_EXTENT))
diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
index e829c3e489ea..8560091554e0 100644
--- a/fs/xfs/libxfs/xfs_rmap_btree.c
+++ b/fs/xfs/libxfs/xfs_rmap_btree.c
@@ -130,6 +130,8 @@ xfs_rmapbt_alloc_block(
 	be32_add_cpu(&agf->agf_rmap_blocks, 1);
 	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_RMAP_BLOCKS);
 
+	xfs_ag_resv_rmapbt_alloc(cur->bc_mp, cur->bc_private.a.agno);
+
 	XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
 	*stat = 1;
 	return 0;
@@ -158,6 +160,8 @@ xfs_rmapbt_free_block(
 			      XFS_EXTENT_BUSY_SKIP_DISCARD);
 	xfs_trans_agbtree_delta(cur->bc_tp, -1);
 
+	xfs_ag_resv_rmapbt_free(cur->bc_mp, cur->bc_private.a.agno);
+
 	return 0;
 }
 
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index cf84288b65ca..7aae5af71aba 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -326,6 +326,7 @@ xfs_daddr_to_agbno(struct xfs_mount *mp, xfs_daddr_t d)
 /* per-AG block reservation data structures*/
 enum xfs_ag_resv_type {
 	XFS_AG_RESV_NONE = 0,
+	XFS_AG_RESV_AGFL,
 	XFS_AG_RESV_METADATA,
 	XFS_AG_RESV_RMAPBT,
 };
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH] tests/xfs: rmapbt swapext block reservation overrun test
  2018-02-05 17:45 [PATCH 0/4] xfs: rmapbt block and perag reservation fixups Brian Foster
                   ` (3 preceding siblings ...)
  2018-02-05 17:46 ` [PATCH 4/4] xfs: account only rmapbt-used blocks against rmapbt perag res Brian Foster
@ 2018-02-06 13:10 ` Brian Foster
  2018-02-06 17:30   ` Darrick J. Wong
  4 siblings, 1 reply; 19+ messages in thread
From: Brian Foster @ 2018-02-06 13:10 UTC (permalink / raw)
  To: linux-xfs; +Cc: fstests

The XFS rmapbt extent swap mechanism performs an extent by extent
swap to ensure the rmapbt is rectified with the appropriate extent
owner information after the operation. This implementation suffers
from a corner case that requires extra reservation if the swap
operation results in bouncing one of the associated inodes between
extent and btree formats. When this corner case occurs, it results
in a transaction block reservation overrun and possible corruption
of the free space accounting.

This regression test provides coverage for this corner case. It
creates two files with a large enough extent count to require btree
format, regardless of inode size, and performs a sequence of extent
swaps between them with a decreasing extent count until all extents
are removed from the file(s). This ensures that one of the swaps
covers the btree <-> extent fork format boundary case.

This test reproduces fs corruption on rmapbt enabled filesystems
running on kernels without the associated extent swap fix.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---

This test reproduces one of the problems targeted to be fixed by the
following patch series:

  https://marc.info/?l=linux-xfs&m=151785278525201&w=2

Also note that this test depends on currently unmerged xfs_io
functionality. The associated functionality is posted for review here:

  https://marc.info/?l=linux-xfs&m=151792224511355&w=2

... and so this test should not be merged until/unless that
functionality is reviewed. Thanks.

Brian

 tests/xfs/440     | 97 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/xfs/440.out |  2 ++
 tests/xfs/group   |  1 +
 3 files changed, 100 insertions(+)
 create mode 100755 tests/xfs/440
 create mode 100644 tests/xfs/440.out

diff --git a/tests/xfs/440 b/tests/xfs/440
new file mode 100755
index 00000000..c7667e08
--- /dev/null
+++ b/tests/xfs/440
@@ -0,0 +1,97 @@
+#! /bin/bash
+# FS QA Test 440
+#
+# Regression test for the XFS rmapbt based extent swap algorithm. The extent
+# swap algorithm for rmapbt=1 filesystems unmaps/remaps individual extents to
+# rectify the rmapbt for each extent swapped between inodes. If one of the
+# inodes happens to straddle the extent <-> btree format boundary (which can
+# vary depending on inode size), the unmap/remap sequence can bounce the inodes
+# back and forth between formats many times during the swap. Since extent ->
+# btree format conversion requires a block allocation, this can consume more
+# blocks than expected, lead to block reservation overrun and free space
+# accounting inconsistency.
+#
+#-----------------------------------------------------------------------
+# Copyright (c) 2018 Red Hat, Inc.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#-----------------------------------------------------------------------
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1	# failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+	cd /
+	rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs generic
+_supported_os Linux
+_require_scratch
+_require_xfs_io_command "falloc"
+_require_xfs_io_command "fpunch"
+_require_xfs_io_command "swapext"
+
+_scratch_mkfs >> $seqres.full 2>&1 || _fail "mkfs failed"
+_scratch_mount || _fail "mount failed"
+
+file1=$SCRATCH_MNT/file1
+file2=$SCRATCH_MNT/file2
+
+# The goal is run an extent swap where one of the associated files has the
+# minimum number of extents to remain in btree format. First, create a couple
+# files with large enough extent counts to ensure btree format on the largest
+# possible inode size filesystems.
+for i in $(seq 0 199); do
+	$XFS_IO_PROG -fc "falloc $((i * 8192)) 4k" $file1
+	$XFS_IO_PROG -fc "falloc $((i * 8192)) 4k" $file2
+done
+
+# Now run an extent swap at every possible extent count down to 0.  Depending
+# on filesystem geometry (i.e., inode size), one of these swaps will cover the
+# boundary case between extent and btree format.
+for i in $(seq 0 199); do
+	# punch one extent from the tmpfile and swap
+	$XFS_IO_PROG -c "fpunch $((i * 8192)) 4k" $file2
+	$XFS_IO_PROG -c "swapext $file2" $file1
+
+	# punch the same extent from the old fork (now in file2) to resync the
+	# extent counts and repeat
+	$XFS_IO_PROG -c "fpunch $((i * 8192)) 4k" $file2
+done
+
+# failure results in fs corruption and possible assert failure
+echo Silence is golden
+
+# success, all done
+status=0
+exit
diff --git a/tests/xfs/440.out b/tests/xfs/440.out
new file mode 100644
index 00000000..fb8dc21f
--- /dev/null
+++ b/tests/xfs/440.out
@@ -0,0 +1,2 @@
+QA output created by 440
+Silence is golden
diff --git a/tests/xfs/group b/tests/xfs/group
index cf81451d..ae0c5fc8 100644
--- a/tests/xfs/group
+++ b/tests/xfs/group
@@ -437,3 +437,4 @@
 437 auto quick other
 438 auto quick quota dangerous
 439 auto quick fuzzers log
+440 auto quick ioctl
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] tests/xfs: rmapbt swapext block reservation overrun test
  2018-02-06 13:10 ` [PATCH] tests/xfs: rmapbt swapext block reservation overrun test Brian Foster
@ 2018-02-06 17:30   ` Darrick J. Wong
  2018-02-06 18:50     ` Brian Foster
  2018-02-07  4:07     ` Eryu Guan
  0 siblings, 2 replies; 19+ messages in thread
From: Darrick J. Wong @ 2018-02-06 17:30 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs, fstests

On Tue, Feb 06, 2018 at 08:10:32AM -0500, Brian Foster wrote:
> The XFS rmapbt extent swap mechanism performs an extent by extent
> swap to ensure the rmapbt is rectified with the appropriate extent
> owner information after the operation. This implementation suffers
> from a corner case that requires extra reservation if the swap
> operation results in bouncing one of the associated inodes between
> extent and btree formats. When this corner case occurs, it results
> in a transaction block reservation overrun and possible corruption
> of the free space accounting.
> 
> This regression test provides coverage for this corner case. It
> creates two files with a large enough extent count to require btree
> format, regardless of inode size, and performs a sequence of extent
> swaps between them with a decreasing extent count until all extents
> are removed from the file(s). This ensures that one of the swaps
> covers the btree <-> extent fork format boundary case.
> 
> This test reproduces fs corruption on rmapbt enabled filesystems
> running on kernels without the associated extent swap fix.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> 
> This test reproduces one of the problems targeted to be fixed by the
> following patch series:
> 
>   https://marc.info/?l=linux-xfs&m=151785278525201&w=2
> 
> Also note that this test depends on currently unmerged xfs_io
> functionality. The associated functionality is posted for review here:
> 
>   https://marc.info/?l=linux-xfs&m=151792224511355&w=2
> 
> ... and so this test should not be merged until/unless that
> functionality is reviewed. Thanks.
> 
> Brian
> 
>  tests/xfs/440     | 97 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tests/xfs/440.out |  2 ++
>  tests/xfs/group   |  1 +
>  3 files changed, 100 insertions(+)
>  create mode 100755 tests/xfs/440
>  create mode 100644 tests/xfs/440.out
> 
> diff --git a/tests/xfs/440 b/tests/xfs/440
> new file mode 100755
> index 00000000..c7667e08
> --- /dev/null
> +++ b/tests/xfs/440
> @@ -0,0 +1,97 @@
> +#! /bin/bash
> +# FS QA Test 440
> +#
> +# Regression test for the XFS rmapbt based extent swap algorithm. The extent
> +# swap algorithm for rmapbt=1 filesystems unmaps/remaps individual extents to
> +# rectify the rmapbt for each extent swapped between inodes. If one of the
> +# inodes happens to straddle the extent <-> btree format boundary (which can
> +# vary depending on inode size), the unmap/remap sequence can bounce the inodes
> +# back and forth between formats many times during the swap. Since extent ->
> +# btree format conversion requires a block allocation, this can consume more
> +# blocks than expected, lead to block reservation overrun and free space
> +# accounting inconsistency.

Yikes. :)

<slightly ot here>

TBH, I've long wondered a couple of things about the swapext code --
since the rmap version of it can swap extents between any kind of file,
does it still make sense to return -EINVAL if the donor file has more
extents than the source file?  And do we have a use case for allowing
extent swaps of parts of files?

<ok, back to the test>

> +#
> +#-----------------------------------------------------------------------
> +# Copyright (c) 2018 Red Hat, Inc.  All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#-----------------------------------------------------------------------
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1	# failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> +	cd /
> +	rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +# Modify as appropriate.
> +_supported_fs generic
> +_supported_os Linux
> +_require_scratch
> +_require_xfs_io_command "falloc"
> +_require_xfs_io_command "fpunch"
> +_require_xfs_io_command "swapext"
> +
> +_scratch_mkfs >> $seqres.full 2>&1 || _fail "mkfs failed"
> +_scratch_mount || _fail "mount failed"

Before we encode too many _scratch_mount || _fail, can we get a decision
from the maintainer about whether or not _scratch_mount should just
_fail if the mount doesn't work, instead of each test having to
open-code this on its own?

I see that 53 of the 1221 mentions of _scratch_mount already do _fail...

> +
> +file1=$SCRATCH_MNT/file1
> +file2=$SCRATCH_MNT/file2
> +
> +# The goal is run an extent swap where one of the associated files has the
> +# minimum number of extents to remain in btree format. First, create a couple
> +# files with large enough extent counts to ensure btree format on the largest
> +# possible inode size filesystems.
> +for i in $(seq 0 199); do
> +	$XFS_IO_PROG -fc "falloc $((i * 8192)) 4k" $file1
> +	$XFS_IO_PROG -fc "falloc $((i * 8192)) 4k" $file2

A 4k extent length isn't going to work on a fs with 64k blocks.  I'd
probably just do:

blksz=65536
$XFS_IO_PROG -fc "falloc 0 $((400 * blksz))" $file1
src/punch_alternating $file1

> +done
> +
> +# Now run an extent swap at every possible extent count down to 0.  Depending
> +# on filesystem geometry (i.e., inode size), one of these swaps will cover the
> +# boundary case between extent and btree format.
> +for i in $(seq 0 199); do
> +	# punch one extent from the tmpfile and swap
> +	$XFS_IO_PROG -c "fpunch $((i * 8192)) 4k" $file2
> +	$XFS_IO_PROG -c "swapext $file2" $file1
> +
> +	# punch the same extent from the old fork (now in file2) to resync the
> +	# extent counts and repeat
> +	$XFS_IO_PROG -c "fpunch $((i * 8192)) 4k" $file2
> +done
> +
> +# failure results in fs corruption and possible assert failure
> +echo Silence is golden
> +
> +# success, all done
> +status=0
> +exit
> diff --git a/tests/xfs/440.out b/tests/xfs/440.out
> new file mode 100644
> index 00000000..fb8dc21f
> --- /dev/null
> +++ b/tests/xfs/440.out
> @@ -0,0 +1,2 @@
> +QA output created by 440
> +Silence is golden
> diff --git a/tests/xfs/group b/tests/xfs/group
> index cf81451d..ae0c5fc8 100644
> --- a/tests/xfs/group
> +++ b/tests/xfs/group
> @@ -437,3 +437,4 @@
>  437 auto quick other
>  438 auto quick quota dangerous
>  439 auto quick fuzzers log
> +440 auto quick ioctl

Though this isn't a fsr test per se, it does test a regression in the
underlying ioctl, so maybe this should also be tagged group fsr?

--D

> -- 
> 2.13.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] tests/xfs: rmapbt swapext block reservation overrun test
  2018-02-06 17:30   ` Darrick J. Wong
@ 2018-02-06 18:50     ` Brian Foster
  2018-02-07  4:07     ` Eryu Guan
  1 sibling, 0 replies; 19+ messages in thread
From: Brian Foster @ 2018-02-06 18:50 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, fstests

On Tue, Feb 06, 2018 at 09:30:23AM -0800, Darrick J. Wong wrote:
> On Tue, Feb 06, 2018 at 08:10:32AM -0500, Brian Foster wrote:
> > The XFS rmapbt extent swap mechanism performs an extent by extent
> > swap to ensure the rmapbt is rectified with the appropriate extent
> > owner information after the operation. This implementation suffers
> > from a corner case that requires extra reservation if the swap
> > operation results in bouncing one of the associated inodes between
> > extent and btree formats. When this corner case occurs, it results
> > in a transaction block reservation overrun and possible corruption
> > of the free space accounting.
> > 
> > This regression test provides coverage for this corner case. It
> > creates two files with a large enough extent count to require btree
> > format, regardless of inode size, and performs a sequence of extent
> > swaps between them with a decreasing extent count until all extents
> > are removed from the file(s). This ensures that one of the swaps
> > covers the btree <-> extent fork format boundary case.
> > 
> > This test reproduces fs corruption on rmapbt enabled filesystems
> > running on kernels without the associated extent swap fix.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> > 
> > This test reproduces one of the problems targeted to be fixed by the
> > following patch series:
> > 
> >   https://marc.info/?l=linux-xfs&m=151785278525201&w=2
> > 
> > Also note that this test depends on currently unmerged xfs_io
> > functionality. The associated functionality is posted for review here:
> > 
> >   https://marc.info/?l=linux-xfs&m=151792224511355&w=2
> > 
> > ... and so this test should not be merged until/unless that
> > functionality is reviewed. Thanks.
> > 
> > Brian
> > 
> >  tests/xfs/440     | 97 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  tests/xfs/440.out |  2 ++
> >  tests/xfs/group   |  1 +
> >  3 files changed, 100 insertions(+)
> >  create mode 100755 tests/xfs/440
> >  create mode 100644 tests/xfs/440.out
> > 
> > diff --git a/tests/xfs/440 b/tests/xfs/440
> > new file mode 100755
> > index 00000000..c7667e08
> > --- /dev/null
> > +++ b/tests/xfs/440
> > @@ -0,0 +1,97 @@
> > +#! /bin/bash
> > +# FS QA Test 440
> > +#
> > +# Regression test for the XFS rmapbt based extent swap algorithm. The extent
> > +# swap algorithm for rmapbt=1 filesystems unmaps/remaps individual extents to
> > +# rectify the rmapbt for each extent swapped between inodes. If one of the
> > +# inodes happens to straddle the extent <-> btree format boundary (which can
> > +# vary depending on inode size), the unmap/remap sequence can bounce the inodes
> > +# back and forth between formats many times during the swap. Since extent ->
> > +# btree format conversion requires a block allocation, this can consume more
> > +# blocks than expected, lead to block reservation overrun and free space
> > +# accounting inconsistency.
> 
> Yikes. :)
> 
> <slightly ot here>
> 
> TBH, I've long wondered a couple of things about the swapext code --
> since the rmap version of it can swap extents between any kind of file,
> does it still make sense to return -EINVAL if the donor file has more
> extents than the source file?  And do we have a use case for allowing
> extent swaps of parts of files?
> 

None that I'm aware of, but I haven't really thought that hard about it.
:P

> <ok, back to the test>
> 
> > +#
> > +#-----------------------------------------------------------------------
> > +# Copyright (c) 2018 Red Hat, Inc.  All Rights Reserved.
> > +#
> > +# This program is free software; you can redistribute it and/or
> > +# modify it under the terms of the GNU General Public License as
> > +# published by the Free Software Foundation.
> > +#
> > +# This program is distributed in the hope that it would be useful,
> > +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > +# GNU General Public License for more details.
> > +#
> > +# You should have received a copy of the GNU General Public License
> > +# along with this program; if not, write the Free Software Foundation,
> > +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> > +#-----------------------------------------------------------------------
> > +#
> > +
> > +seq=`basename $0`
> > +seqres=$RESULT_DIR/$seq
> > +echo "QA output created by $seq"
> > +
> > +here=`pwd`
> > +tmp=/tmp/$$
> > +status=1	# failure is the default!
> > +trap "_cleanup; exit \$status" 0 1 2 3 15
> > +
> > +_cleanup()
> > +{
> > +	cd /
> > +	rm -f $tmp.*
> > +}
> > +
> > +# get standard environment, filters and checks
> > +. ./common/rc
> > +. ./common/filter
> > +
> > +# remove previous $seqres.full before test
> > +rm -f $seqres.full
> > +
> > +# real QA test starts here
> > +
> > +# Modify as appropriate.
> > +_supported_fs generic
> > +_supported_os Linux
> > +_require_scratch
> > +_require_xfs_io_command "falloc"
> > +_require_xfs_io_command "fpunch"
> > +_require_xfs_io_command "swapext"
> > +
> > +_scratch_mkfs >> $seqres.full 2>&1 || _fail "mkfs failed"
> > +_scratch_mount || _fail "mount failed"
> 
> Before we encode too many _scratch_mount || _fail, can we get a decision
> from the maintainer about whether or not _scratch_mount should just
> _fail if the mount doesn't work, instead of each test having to
> open-code this on its own?
> 

I'd probably leave it around until a patch hits, regardless. It's easy
to remove if a _scratch_mount() fix lands in the meantime.

> I see that 53 of the 1221 mentions of _scratch_mount already do _fail...
> 
> > +
> > +file1=$SCRATCH_MNT/file1
> > +file2=$SCRATCH_MNT/file2
> > +
> > +# The goal is run an extent swap where one of the associated files has the
> > +# minimum number of extents to remain in btree format. First, create a couple
> > +# files with large enough extent counts to ensure btree format on the largest
> > +# possible inode size filesystems.
> > +for i in $(seq 0 199); do
> > +	$XFS_IO_PROG -fc "falloc $((i * 8192)) 4k" $file1
> > +	$XFS_IO_PROG -fc "falloc $((i * 8192)) 4k" $file2
> 
> A 4k extent length isn't going to work on a fs with 64k blocks.  I'd
> probably just do:
> 
> blksz=65536
> $XFS_IO_PROG -fc "falloc 0 $((400 * blksz))" $file1
> src/punch_alternating $file1
> 

Indeed, I didn't know about punch_alternating. I suppose we can also get
the actual filesystem block size from the filtered mkfs output as well.

> > +done
> > +
> > +# Now run an extent swap at every possible extent count down to 0.  Depending
> > +# on filesystem geometry (i.e., inode size), one of these swaps will cover the
> > +# boundary case between extent and btree format.
> > +for i in $(seq 0 199); do
> > +	# punch one extent from the tmpfile and swap
> > +	$XFS_IO_PROG -c "fpunch $((i * 8192)) 4k" $file2
> > +	$XFS_IO_PROG -c "swapext $file2" $file1
> > +
> > +	# punch the same extent from the old fork (now in file2) to resync the
> > +	# extent counts and repeat
> > +	$XFS_IO_PROG -c "fpunch $((i * 8192)) 4k" $file2
> > +done
> > +
> > +# failure results in fs corruption and possible assert failure
> > +echo Silence is golden
> > +
> > +# success, all done
> > +status=0
> > +exit
> > diff --git a/tests/xfs/440.out b/tests/xfs/440.out
> > new file mode 100644
> > index 00000000..fb8dc21f
> > --- /dev/null
> > +++ b/tests/xfs/440.out
> > @@ -0,0 +1,2 @@
> > +QA output created by 440
> > +Silence is golden
> > diff --git a/tests/xfs/group b/tests/xfs/group
> > index cf81451d..ae0c5fc8 100644
> > --- a/tests/xfs/group
> > +++ b/tests/xfs/group
> > @@ -437,3 +437,4 @@
> >  437 auto quick other
> >  438 auto quick quota dangerous
> >  439 auto quick fuzzers log
> > +440 auto quick ioctl
> 
> Though this isn't a fsr test per se, it does test a regression in the
> underlying ioctl, so maybe this should also be tagged group fsr?
> 

No preference. I suppose it does make sense if one wanted to verify a
kernel was "fsr safe." I'll add it.

Brian

> --D
> 
> > -- 
> > 2.13.6
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe fstests" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] xfs: account only rmapbt-used blocks against rmapbt perag res
  2018-02-05 17:46 ` [PATCH 4/4] xfs: account only rmapbt-used blocks against rmapbt perag res Brian Foster
@ 2018-02-07  0:03   ` Darrick J. Wong
  2018-02-07 14:49     ` Brian Foster
  0 siblings, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2018-02-07  0:03 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Feb 05, 2018 at 12:46:01PM -0500, Brian Foster wrote:
> The rmapbt perag metadata reservation reserves blocks for the
> reverse mapping btree (rmapbt). Since the rmapbt uses blocks from
> the agfl and perag accounting is updated as blocks are allocated
> from the allocation btrees, the reservation actually accounts blocks
> as they are allocated to (or freed from) the agfl rather than the
> rmapbt itself.
> 
> While this works for blocks that are eventually used for the rmapbt,
> not all agfl blocks are destined for the rmapbt. Blocks that are
> allocated to the agfl (and thus "reserved" for the rmapbt) but then
> used by another structure leads to a growing inconsistency over time
> between the runtime tracking of rmapbt usage vs. actual rmapbt
> usage. Since the runtime tracking thinks all agfl blocks are rmapbt
> blocks, it essentially believes that less future reservation is
> required to satisfy the rmapbt than what is actually necessary.
> 
> The inconsistency is rectified across mount cycles because the perag
> reservation is initialized based on the actual rmapbt usage at mount
> time. The problem, however, is that the excessive drain of the
> reservation at runtime opens a window to allocate blocks for other
> purposes that might be required for the rmapbt on a subsequent
> mount. This problem can be demonstrated by a simple test that runs
> an allocation workload to consume agfl blocks over time and then
> observe the difference in the agfl reservation requirement across an
> unmount/mount cycle:
> 
>   mount ...: xfs_ag_resv_init: ... resv 3193 ask 3194 len 3194
>   ...
>   ...      : xfs_ag_resv_alloc_extent: ... resv 2957 ask 3194 len 1
>   umount...: xfs_ag_resv_free: ... resv 2956 ask 3194 len 0
>   mount ...: xfs_ag_resv_init: ... resv 3052 ask 3194 len 3194
> 
> As the above tracepoints show, the reservation requirement reduces
> from 3194 blocks to 2956 blocks as the workload runs.  Without any
> other changes in the filesystem, the same reservation requirement
> jumps from 2956 to 3052 blocks over a umount/mount cycle.
> 
> To address this divergence, update the RMAPBT reservation to account
> blocks used for the rmapbt only rather than all blocks filled into
> the agfl. This patch makes several high-level changes toward that
> end:
> 
> 1.) Reintroduce an AGFL reservation type to serve as an accounting
>     no-op for blocks allocated to (or freed from) the AGFL.
> 2.) Invoke RMAPBT usage accounting from the actual rmapbt block
>     allocation path rather than the AGFL allocation path.
> 
> The first change is required because agfl blocks are considered free
> blocks throughout their lifetime. The perag reservation subsystem is
> invoked unconditionally by the allocation subsystem, so we need a
> way to tell the perag subsystem (via the allocation subsystem) to
> not make any accounting changes for blocks filled into the AGFL.
> 
> The second change causes the in-core RMAPBT reservation usage
> accounting to remain consistent with the on-disk state at all times
> and eliminates the risk of leaving the rmapbt reservation
> underfilled.

Sorry I haven't gotten to this in a couple of days.  This looks like a
reasonable enough fix to the complex perag machinery, though Dave and I
were batting around an idea last week -- what if instead of all these
reservation types and tracking we simply maintained a per-ag free space
reservation of a couple megabytes, handed out blocks if we got
desperate, and replenished that res when we had a good opportunity?

I /think/ the answer to that question is that we need to try to reserve
enough blocks to satisfy the maximum size of the btree (if there is
one), not just a fixed-size delayed free space.  We already hand out
blocks if we get desperate, and we replenish the reservation when we
can.  Lastly, we need the reservation type tracking so that we avoid
things like the refcount btree stealing blocks from the agfl in
desperation.  I've also been wondering if each reservation should get
its own type so that we track refcountbt and finobt blocks separately.

Otherwise I think this series looks ok, but I'll go over it in more
detail tomorrow.

--D

> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_ag_resv.c    |  4 ++++
>  fs/xfs/libxfs/xfs_ag_resv.h    | 31 +++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_alloc.c      | 18 ++++++------------
>  fs/xfs/libxfs/xfs_rmap_btree.c |  4 ++++
>  fs/xfs/xfs_mount.h             |  1 +
>  5 files changed, 46 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
> index 0ca2e680034a..03885a968de8 100644
> --- a/fs/xfs/libxfs/xfs_ag_resv.c
> +++ b/fs/xfs/libxfs/xfs_ag_resv.c
> @@ -326,6 +326,8 @@ xfs_ag_resv_alloc_extent(
>  	trace_xfs_ag_resv_alloc_extent(pag, type, args->len);
>  
>  	switch (type) {
> +	case XFS_AG_RESV_AGFL:
> +		return;
>  	case XFS_AG_RESV_METADATA:
>  	case XFS_AG_RESV_RMAPBT:
>  		resv = xfs_perag_resv(pag, type);
> @@ -366,6 +368,8 @@ xfs_ag_resv_free_extent(
>  	trace_xfs_ag_resv_free_extent(pag, type, len);
>  
>  	switch (type) {
> +	case XFS_AG_RESV_AGFL:
> +		return;
>  	case XFS_AG_RESV_METADATA:
>  	case XFS_AG_RESV_RMAPBT:
>  		resv = xfs_perag_resv(pag, type);
> diff --git a/fs/xfs/libxfs/xfs_ag_resv.h b/fs/xfs/libxfs/xfs_ag_resv.h
> index 8d6c687deef3..938f2f96c5e8 100644
> --- a/fs/xfs/libxfs/xfs_ag_resv.h
> +++ b/fs/xfs/libxfs/xfs_ag_resv.h
> @@ -32,4 +32,35 @@ void xfs_ag_resv_alloc_extent(struct xfs_perag *pag, enum xfs_ag_resv_type type,
>  void xfs_ag_resv_free_extent(struct xfs_perag *pag, enum xfs_ag_resv_type type,
>  		struct xfs_trans *tp, xfs_extlen_t len);
>  
> +/*
> + * RMAPBT reservation accounting wrappers. Since rmapbt blocks are sourced from
> + * the AGFL, they are allocated one at a time and the reservation updates don't
> + * require a transaction.
> + */
> +static inline void
> +xfs_ag_resv_rmapbt_alloc(
> +	struct xfs_mount	*mp,
> +	xfs_agnumber_t		agno)
> +{
> +	struct xfs_alloc_arg	args = {0};
> +	struct xfs_perag	*pag;
> +
> +	args.len = 1;
> +	pag = xfs_perag_get(mp, agno);
> +	xfs_ag_resv_alloc_extent(pag, XFS_AG_RESV_RMAPBT, &args);
> +	xfs_perag_put(pag);
> +}
> +
> +static inline void
> +xfs_ag_resv_rmapbt_free(
> +	struct xfs_mount	*mp,
> +	xfs_agnumber_t		agno)
> +{
> +	struct xfs_perag	*pag;
> +
> +	pag = xfs_perag_get(mp, agno);
> +	xfs_ag_resv_free_extent(pag, XFS_AG_RESV_RMAPBT, NULL, 1);
> +	xfs_perag_put(pag);
> +}
> +
>  #endif	/* __XFS_AG_RESV_H__ */
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index e53d55453dc5..8c401efb2d74 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -711,7 +711,7 @@ xfs_alloc_ag_vextent(
>  
>  	ASSERT(args->len >= args->minlen);
>  	ASSERT(args->len <= args->maxlen);
> -	ASSERT(!args->wasfromfl || args->resv != XFS_AG_RESV_RMAPBT);
> +	ASSERT(!args->wasfromfl || args->resv != XFS_AG_RESV_AGFL);
>  	ASSERT(args->agbno % args->alignment == 0);
>  
>  	/* if not file data, insert new block into the reverse map btree */
> @@ -1564,7 +1564,6 @@ xfs_alloc_ag_vextent_small(
>  	int		*stat)	/* status: 0-freelist, 1-normal/none */
>  {
>  	struct xfs_owner_info	oinfo;
> -	struct xfs_perag	*pag;
>  	int		error;
>  	xfs_agblock_t	fbno;
>  	xfs_extlen_t	flen;
> @@ -1583,7 +1582,7 @@ xfs_alloc_ag_vextent_small(
>  	 * freelist.
>  	 */
>  	else if (args->minlen == 1 && args->alignment == 1 &&
> -		 args->resv != XFS_AG_RESV_RMAPBT &&
> +		 args->resv != XFS_AG_RESV_AGFL &&
>  		 (be32_to_cpu(XFS_BUF_TO_AGF(args->agbp)->agf_flcount)
>  		  > args->minleft)) {
>  		error = xfs_alloc_get_freelist(args->tp, args->agbp, &fbno, 0);
> @@ -1616,18 +1615,13 @@ xfs_alloc_ag_vextent_small(
>  			/*
>  			 * If we're feeding an AGFL block to something that
>  			 * doesn't live in the free space, we need to clear
> -			 * out the OWN_AG rmap and add the block back to
> -			 * the RMAPBT per-AG reservation.
> +			 * out the OWN_AG rmap.
>  			 */
>  			xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
>  			error = xfs_rmap_free(args->tp, args->agbp, args->agno,
>  					fbno, 1, &oinfo);
>  			if (error)
>  				goto error0;
> -			pag = xfs_perag_get(args->mp, args->agno);
> -			xfs_ag_resv_free_extent(pag, XFS_AG_RESV_RMAPBT,
> -					args->tp, 1);
> -			xfs_perag_put(pag);
>  
>  			*stat = 0;
>  			return 0;
> @@ -2153,7 +2147,7 @@ xfs_alloc_fix_freelist(
>  		if (error)
>  			goto out_agbp_relse;
>  		error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1,
> -					   &targs.oinfo, XFS_AG_RESV_RMAPBT);
> +					   &targs.oinfo, XFS_AG_RESV_AGFL);
>  		if (error)
>  			goto out_agbp_relse;
>  		bp = xfs_btree_get_bufs(mp, tp, args->agno, bno, 0);
> @@ -2179,7 +2173,7 @@ xfs_alloc_fix_freelist(
>  	while (pag->pagf_flcount < need) {
>  		targs.agbno = 0;
>  		targs.maxlen = need - pag->pagf_flcount;
> -		targs.resv = XFS_AG_RESV_RMAPBT;
> +		targs.resv = XFS_AG_RESV_AGFL;
>  
>  		/* Allocate as many blocks as possible at once. */
>  		error = xfs_alloc_ag_vextent(&targs);
> @@ -2860,7 +2854,7 @@ xfs_free_extent(
>  	int			error;
>  
>  	ASSERT(len != 0);
> -	ASSERT(type != XFS_AG_RESV_RMAPBT);
> +	ASSERT(type != XFS_AG_RESV_AGFL);
>  
>  	if (XFS_TEST_ERROR(false, mp,
>  			XFS_ERRTAG_FREE_EXTENT))
> diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
> index e829c3e489ea..8560091554e0 100644
> --- a/fs/xfs/libxfs/xfs_rmap_btree.c
> +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
> @@ -130,6 +130,8 @@ xfs_rmapbt_alloc_block(
>  	be32_add_cpu(&agf->agf_rmap_blocks, 1);
>  	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_RMAP_BLOCKS);
>  
> +	xfs_ag_resv_rmapbt_alloc(cur->bc_mp, cur->bc_private.a.agno);
> +
>  	XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
>  	*stat = 1;
>  	return 0;
> @@ -158,6 +160,8 @@ xfs_rmapbt_free_block(
>  			      XFS_EXTENT_BUSY_SKIP_DISCARD);
>  	xfs_trans_agbtree_delta(cur->bc_tp, -1);
>  
> +	xfs_ag_resv_rmapbt_free(cur->bc_mp, cur->bc_private.a.agno);
> +
>  	return 0;
>  }
>  
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index cf84288b65ca..7aae5af71aba 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -326,6 +326,7 @@ xfs_daddr_to_agbno(struct xfs_mount *mp, xfs_daddr_t d)
>  /* per-AG block reservation data structures*/
>  enum xfs_ag_resv_type {
>  	XFS_AG_RESV_NONE = 0,
> +	XFS_AG_RESV_AGFL,
>  	XFS_AG_RESV_METADATA,
>  	XFS_AG_RESV_RMAPBT,
>  };
> -- 
> 2.13.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] tests/xfs: rmapbt swapext block reservation overrun test
  2018-02-06 17:30   ` Darrick J. Wong
  2018-02-06 18:50     ` Brian Foster
@ 2018-02-07  4:07     ` Eryu Guan
  1 sibling, 0 replies; 19+ messages in thread
From: Eryu Guan @ 2018-02-07  4:07 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs, fstests

On Tue, Feb 06, 2018 at 09:30:23AM -0800, Darrick J. Wong wrote:
> On Tue, Feb 06, 2018 at 08:10:32AM -0500, Brian Foster wrote:
> > The XFS rmapbt extent swap mechanism performs an extent by extent
> > swap to ensure the rmapbt is rectified with the appropriate extent
> > owner information after the operation. This implementation suffers
> > from a corner case that requires extra reservation if the swap
> > operation results in bouncing one of the associated inodes between
> > extent and btree formats. When this corner case occurs, it results
> > in a transaction block reservation overrun and possible corruption
> > of the free space accounting.
> > 
> > This regression test provides coverage for this corner case. It
> > creates two files with a large enough extent count to require btree
> > format, regardless of inode size, and performs a sequence of extent
> > swaps between them with a decreasing extent count until all extents
> > are removed from the file(s). This ensures that one of the swaps
> > covers the btree <-> extent fork format boundary case.
> > 
> > This test reproduces fs corruption on rmapbt enabled filesystems
> > running on kernels without the associated extent swap fix.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> > 
> > This test reproduces one of the problems targeted to be fixed by the
> > following patch series:
> > 
> >   https://marc.info/?l=linux-xfs&m=151785278525201&w=2
> > 
> > Also note that this test depends on currently unmerged xfs_io
> > functionality. The associated functionality is posted for review here:
> > 
> >   https://marc.info/?l=linux-xfs&m=151792224511355&w=2
> > 
> > ... and so this test should not be merged until/unless that
> > functionality is reviewed. Thanks.
> > 
> > Brian
> > 
> >  tests/xfs/440     | 97 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  tests/xfs/440.out |  2 ++
> >  tests/xfs/group   |  1 +
> >  3 files changed, 100 insertions(+)
> >  create mode 100755 tests/xfs/440
> >  create mode 100644 tests/xfs/440.out
> > 
> > diff --git a/tests/xfs/440 b/tests/xfs/440
> > new file mode 100755
> > index 00000000..c7667e08
> > --- /dev/null
> > +++ b/tests/xfs/440
> > @@ -0,0 +1,97 @@
> > +#! /bin/bash
> > +# FS QA Test 440
> > +#
> > +# Regression test for the XFS rmapbt based extent swap algorithm. The extent
> > +# swap algorithm for rmapbt=1 filesystems unmaps/remaps individual extents to
> > +# rectify the rmapbt for each extent swapped between inodes. If one of the
> > +# inodes happens to straddle the extent <-> btree format boundary (which can
> > +# vary depending on inode size), the unmap/remap sequence can bounce the inodes
> > +# back and forth between formats many times during the swap. Since extent ->
> > +# btree format conversion requires a block allocation, this can consume more
> > +# blocks than expected, lead to block reservation overrun and free space
> > +# accounting inconsistency.
> 
> Yikes. :)
> 
> <slightly ot here>
> 
> TBH, I've long wondered a couple of things about the swapext code --
> since the rmap version of it can swap extents between any kind of file,
> does it still make sense to return -EINVAL if the donor file has more
> extents than the source file?  And do we have a use case for allowing
> extent swaps of parts of files?
> 
> <ok, back to the test>
> 
> > +#
> > +#-----------------------------------------------------------------------
> > +# Copyright (c) 2018 Red Hat, Inc.  All Rights Reserved.
> > +#
> > +# This program is free software; you can redistribute it and/or
> > +# modify it under the terms of the GNU General Public License as
> > +# published by the Free Software Foundation.
> > +#
> > +# This program is distributed in the hope that it would be useful,
> > +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > +# GNU General Public License for more details.
> > +#
> > +# You should have received a copy of the GNU General Public License
> > +# along with this program; if not, write the Free Software Foundation,
> > +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> > +#-----------------------------------------------------------------------
> > +#
> > +
> > +seq=`basename $0`
> > +seqres=$RESULT_DIR/$seq
> > +echo "QA output created by $seq"
> > +
> > +here=`pwd`
> > +tmp=/tmp/$$
> > +status=1	# failure is the default!
> > +trap "_cleanup; exit \$status" 0 1 2 3 15
> > +
> > +_cleanup()
> > +{
> > +	cd /
> > +	rm -f $tmp.*
> > +}
> > +
> > +# get standard environment, filters and checks
> > +. ./common/rc
> > +. ./common/filter
> > +
> > +# remove previous $seqres.full before test
> > +rm -f $seqres.full
> > +
> > +# real QA test starts here
> > +
> > +# Modify as appropriate.
> > +_supported_fs generic
> > +_supported_os Linux
> > +_require_scratch
> > +_require_xfs_io_command "falloc"
> > +_require_xfs_io_command "fpunch"
> > +_require_xfs_io_command "swapext"
> > +
> > +_scratch_mkfs >> $seqres.full 2>&1 || _fail "mkfs failed"
> > +_scratch_mount || _fail "mount failed"
> 
> Before we encode too many _scratch_mount || _fail, can we get a decision
> from the maintainer about whether or not _scratch_mount should just
> _fail if the mount doesn't work, instead of each test having to
> open-code this on its own?
> 
> I see that 53 of the 1221 mentions of _scratch_mount already do _fail...

Sorry for not responding to the check _scratch_mount status issue early.
I'm fine with it overall, as more people are running into this problem
and get annoyed by the implicit failures. I just wanted to think more
about it and see what would be the better way to do this.

And Ted complained about not checking return status _scratch_mkfs and
similar functions a month ago, I think we can fix them all together.

Thanks,
Eryu

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] xfs: account only rmapbt-used blocks against rmapbt perag res
  2018-02-07  0:03   ` Darrick J. Wong
@ 2018-02-07 14:49     ` Brian Foster
  2018-02-08  2:20       ` Darrick J. Wong
  0 siblings, 1 reply; 19+ messages in thread
From: Brian Foster @ 2018-02-07 14:49 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Tue, Feb 06, 2018 at 04:03:50PM -0800, Darrick J. Wong wrote:
> On Mon, Feb 05, 2018 at 12:46:01PM -0500, Brian Foster wrote:
> > The rmapbt perag metadata reservation reserves blocks for the
> > reverse mapping btree (rmapbt). Since the rmapbt uses blocks from
> > the agfl and perag accounting is updated as blocks are allocated
> > from the allocation btrees, the reservation actually accounts blocks
> > as they are allocated to (or freed from) the agfl rather than the
> > rmapbt itself.
> > 
> > While this works for blocks that are eventually used for the rmapbt,
> > not all agfl blocks are destined for the rmapbt. Blocks that are
> > allocated to the agfl (and thus "reserved" for the rmapbt) but then
> > used by another structure leads to a growing inconsistency over time
> > between the runtime tracking of rmapbt usage vs. actual rmapbt
> > usage. Since the runtime tracking thinks all agfl blocks are rmapbt
> > blocks, it essentially believes that less future reservation is
> > required to satisfy the rmapbt than what is actually necessary.
> > 
> > The inconsistency is rectified across mount cycles because the perag
> > reservation is initialized based on the actual rmapbt usage at mount
> > time. The problem, however, is that the excessive drain of the
> > reservation at runtime opens a window to allocate blocks for other
> > purposes that might be required for the rmapbt on a subsequent
> > mount. This problem can be demonstrated by a simple test that runs
> > an allocation workload to consume agfl blocks over time and then
> > observe the difference in the agfl reservation requirement across an
> > unmount/mount cycle:
> > 
> >   mount ...: xfs_ag_resv_init: ... resv 3193 ask 3194 len 3194
> >   ...
> >   ...      : xfs_ag_resv_alloc_extent: ... resv 2957 ask 3194 len 1
> >   umount...: xfs_ag_resv_free: ... resv 2956 ask 3194 len 0
> >   mount ...: xfs_ag_resv_init: ... resv 3052 ask 3194 len 3194
> > 
> > As the above tracepoints show, the reservation requirement reduces
> > from 3194 blocks to 2956 blocks as the workload runs.  Without any
> > other changes in the filesystem, the same reservation requirement
> > jumps from 2956 to 3052 blocks over a umount/mount cycle.
> > 
> > To address this divergence, update the RMAPBT reservation to account
> > blocks used for the rmapbt only rather than all blocks filled into
> > the agfl. This patch makes several high-level changes toward that
> > end:
> > 
> > 1.) Reintroduce an AGFL reservation type to serve as an accounting
> >     no-op for blocks allocated to (or freed from) the AGFL.
> > 2.) Invoke RMAPBT usage accounting from the actual rmapbt block
> >     allocation path rather than the AGFL allocation path.
> > 
> > The first change is required because agfl blocks are considered free
> > blocks throughout their lifetime. The perag reservation subsystem is
> > invoked unconditionally by the allocation subsystem, so we need a
> > way to tell the perag subsystem (via the allocation subsystem) to
> > not make any accounting changes for blocks filled into the AGFL.
> > 
> > The second change causes the in-core RMAPBT reservation usage
> > accounting to remain consistent with the on-disk state at all times
> > and eliminates the risk of leaving the rmapbt reservation
> > underfilled.
> 
> Sorry I haven't gotten to this in a couple of days.  This looks like a
> reasonable enough fix to the complex perag machinery, though Dave and I
> were batting around an idea last week -- what if instead of all these
> reservation types and tracking we simply maintained a per-ag free space
> reservation of a couple megabytes, handed out blocks if we got
> desperate, and replenished that res when we had a good opportunity?
> 

Firstly, I essentially view this patch as a bug fix vs. some kind of
larger rework. I'm hoping the split of the rename bits helps emphasize
that... It's really just a change to accurately account rmapbt block
consumption at runtime.

I don't have much context on the broader question you've called out...
I assume there was a larger requirement (associated with
reflink+rmapbt?) that necessitated this kind of worst case reservation
in the first place..? We've since applied it to things like the finobt
(which I'm still not totally convinced was the right thing to do based
on the vague justification for it), which kind of blurs the line between
where it's a requirement vs. nice-to-have/band-aid for me.

> I /think/ the answer to that question is that we need to try to reserve
> enough blocks to satisfy the maximum size of the btree (if there is
> one), not just a fixed-size delayed free space.  We already hand out
> blocks if we get desperate, and we replenish the reservation when we
> can.  Lastly, we need the reservation type tracking so that we avoid
> things like the refcount btree stealing blocks from the agfl in
> desperation.  I've also been wondering if each reservation should get
> its own type so that we track refcountbt and finobt blocks separately.
> 

In principle, I'm not really for or against the existing mechanism,
further granularizing it or changing it out for something else entirely.
I'm confused about what exactly we're trying to fix in the first place,
though. Is there a larger fundamental problem at play here? Is the
reservation (or complexity of the mechanism) simply overkill? Something
else..?

> Otherwise I think this series looks ok, but I'll go over it in more
> detail tomorrow.
> 

Thanks. No rush..

Brian

> --D
> 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_ag_resv.c    |  4 ++++
> >  fs/xfs/libxfs/xfs_ag_resv.h    | 31 +++++++++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_alloc.c      | 18 ++++++------------
> >  fs/xfs/libxfs/xfs_rmap_btree.c |  4 ++++
> >  fs/xfs/xfs_mount.h             |  1 +
> >  5 files changed, 46 insertions(+), 12 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
> > index 0ca2e680034a..03885a968de8 100644
> > --- a/fs/xfs/libxfs/xfs_ag_resv.c
> > +++ b/fs/xfs/libxfs/xfs_ag_resv.c
> > @@ -326,6 +326,8 @@ xfs_ag_resv_alloc_extent(
> >  	trace_xfs_ag_resv_alloc_extent(pag, type, args->len);
> >  
> >  	switch (type) {
> > +	case XFS_AG_RESV_AGFL:
> > +		return;
> >  	case XFS_AG_RESV_METADATA:
> >  	case XFS_AG_RESV_RMAPBT:
> >  		resv = xfs_perag_resv(pag, type);
> > @@ -366,6 +368,8 @@ xfs_ag_resv_free_extent(
> >  	trace_xfs_ag_resv_free_extent(pag, type, len);
> >  
> >  	switch (type) {
> > +	case XFS_AG_RESV_AGFL:
> > +		return;
> >  	case XFS_AG_RESV_METADATA:
> >  	case XFS_AG_RESV_RMAPBT:
> >  		resv = xfs_perag_resv(pag, type);
> > diff --git a/fs/xfs/libxfs/xfs_ag_resv.h b/fs/xfs/libxfs/xfs_ag_resv.h
> > index 8d6c687deef3..938f2f96c5e8 100644
> > --- a/fs/xfs/libxfs/xfs_ag_resv.h
> > +++ b/fs/xfs/libxfs/xfs_ag_resv.h
> > @@ -32,4 +32,35 @@ void xfs_ag_resv_alloc_extent(struct xfs_perag *pag, enum xfs_ag_resv_type type,
> >  void xfs_ag_resv_free_extent(struct xfs_perag *pag, enum xfs_ag_resv_type type,
> >  		struct xfs_trans *tp, xfs_extlen_t len);
> >  
> > +/*
> > + * RMAPBT reservation accounting wrappers. Since rmapbt blocks are sourced from
> > + * the AGFL, they are allocated one at a time and the reservation updates don't
> > + * require a transaction.
> > + */
> > +static inline void
> > +xfs_ag_resv_rmapbt_alloc(
> > +	struct xfs_mount	*mp,
> > +	xfs_agnumber_t		agno)
> > +{
> > +	struct xfs_alloc_arg	args = {0};
> > +	struct xfs_perag	*pag;
> > +
> > +	args.len = 1;
> > +	pag = xfs_perag_get(mp, agno);
> > +	xfs_ag_resv_alloc_extent(pag, XFS_AG_RESV_RMAPBT, &args);
> > +	xfs_perag_put(pag);
> > +}
> > +
> > +static inline void
> > +xfs_ag_resv_rmapbt_free(
> > +	struct xfs_mount	*mp,
> > +	xfs_agnumber_t		agno)
> > +{
> > +	struct xfs_perag	*pag;
> > +
> > +	pag = xfs_perag_get(mp, agno);
> > +	xfs_ag_resv_free_extent(pag, XFS_AG_RESV_RMAPBT, NULL, 1);
> > +	xfs_perag_put(pag);
> > +}
> > +
> >  #endif	/* __XFS_AG_RESV_H__ */
> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index e53d55453dc5..8c401efb2d74 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -711,7 +711,7 @@ xfs_alloc_ag_vextent(
> >  
> >  	ASSERT(args->len >= args->minlen);
> >  	ASSERT(args->len <= args->maxlen);
> > -	ASSERT(!args->wasfromfl || args->resv != XFS_AG_RESV_RMAPBT);
> > +	ASSERT(!args->wasfromfl || args->resv != XFS_AG_RESV_AGFL);
> >  	ASSERT(args->agbno % args->alignment == 0);
> >  
> >  	/* if not file data, insert new block into the reverse map btree */
> > @@ -1564,7 +1564,6 @@ xfs_alloc_ag_vextent_small(
> >  	int		*stat)	/* status: 0-freelist, 1-normal/none */
> >  {
> >  	struct xfs_owner_info	oinfo;
> > -	struct xfs_perag	*pag;
> >  	int		error;
> >  	xfs_agblock_t	fbno;
> >  	xfs_extlen_t	flen;
> > @@ -1583,7 +1582,7 @@ xfs_alloc_ag_vextent_small(
> >  	 * freelist.
> >  	 */
> >  	else if (args->minlen == 1 && args->alignment == 1 &&
> > -		 args->resv != XFS_AG_RESV_RMAPBT &&
> > +		 args->resv != XFS_AG_RESV_AGFL &&
> >  		 (be32_to_cpu(XFS_BUF_TO_AGF(args->agbp)->agf_flcount)
> >  		  > args->minleft)) {
> >  		error = xfs_alloc_get_freelist(args->tp, args->agbp, &fbno, 0);
> > @@ -1616,18 +1615,13 @@ xfs_alloc_ag_vextent_small(
> >  			/*
> >  			 * If we're feeding an AGFL block to something that
> >  			 * doesn't live in the free space, we need to clear
> > -			 * out the OWN_AG rmap and add the block back to
> > -			 * the RMAPBT per-AG reservation.
> > +			 * out the OWN_AG rmap.
> >  			 */
> >  			xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> >  			error = xfs_rmap_free(args->tp, args->agbp, args->agno,
> >  					fbno, 1, &oinfo);
> >  			if (error)
> >  				goto error0;
> > -			pag = xfs_perag_get(args->mp, args->agno);
> > -			xfs_ag_resv_free_extent(pag, XFS_AG_RESV_RMAPBT,
> > -					args->tp, 1);
> > -			xfs_perag_put(pag);
> >  
> >  			*stat = 0;
> >  			return 0;
> > @@ -2153,7 +2147,7 @@ xfs_alloc_fix_freelist(
> >  		if (error)
> >  			goto out_agbp_relse;
> >  		error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1,
> > -					   &targs.oinfo, XFS_AG_RESV_RMAPBT);
> > +					   &targs.oinfo, XFS_AG_RESV_AGFL);
> >  		if (error)
> >  			goto out_agbp_relse;
> >  		bp = xfs_btree_get_bufs(mp, tp, args->agno, bno, 0);
> > @@ -2179,7 +2173,7 @@ xfs_alloc_fix_freelist(
> >  	while (pag->pagf_flcount < need) {
> >  		targs.agbno = 0;
> >  		targs.maxlen = need - pag->pagf_flcount;
> > -		targs.resv = XFS_AG_RESV_RMAPBT;
> > +		targs.resv = XFS_AG_RESV_AGFL;
> >  
> >  		/* Allocate as many blocks as possible at once. */
> >  		error = xfs_alloc_ag_vextent(&targs);
> > @@ -2860,7 +2854,7 @@ xfs_free_extent(
> >  	int			error;
> >  
> >  	ASSERT(len != 0);
> > -	ASSERT(type != XFS_AG_RESV_RMAPBT);
> > +	ASSERT(type != XFS_AG_RESV_AGFL);
> >  
> >  	if (XFS_TEST_ERROR(false, mp,
> >  			XFS_ERRTAG_FREE_EXTENT))
> > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
> > index e829c3e489ea..8560091554e0 100644
> > --- a/fs/xfs/libxfs/xfs_rmap_btree.c
> > +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
> > @@ -130,6 +130,8 @@ xfs_rmapbt_alloc_block(
> >  	be32_add_cpu(&agf->agf_rmap_blocks, 1);
> >  	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_RMAP_BLOCKS);
> >  
> > +	xfs_ag_resv_rmapbt_alloc(cur->bc_mp, cur->bc_private.a.agno);
> > +
> >  	XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
> >  	*stat = 1;
> >  	return 0;
> > @@ -158,6 +160,8 @@ xfs_rmapbt_free_block(
> >  			      XFS_EXTENT_BUSY_SKIP_DISCARD);
> >  	xfs_trans_agbtree_delta(cur->bc_tp, -1);
> >  
> > +	xfs_ag_resv_rmapbt_free(cur->bc_mp, cur->bc_private.a.agno);
> > +
> >  	return 0;
> >  }
> >  
> > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > index cf84288b65ca..7aae5af71aba 100644
> > --- a/fs/xfs/xfs_mount.h
> > +++ b/fs/xfs/xfs_mount.h
> > @@ -326,6 +326,7 @@ xfs_daddr_to_agbno(struct xfs_mount *mp, xfs_daddr_t d)
> >  /* per-AG block reservation data structures*/
> >  enum xfs_ag_resv_type {
> >  	XFS_AG_RESV_NONE = 0,
> > +	XFS_AG_RESV_AGFL,
> >  	XFS_AG_RESV_METADATA,
> >  	XFS_AG_RESV_RMAPBT,
> >  };
> > -- 
> > 2.13.6
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/4] xfs: shutdown if block allocation overruns tx reservation
  2018-02-05 17:45 ` [PATCH 1/4] xfs: shutdown if block allocation overruns tx reservation Brian Foster
@ 2018-02-08  1:42   ` Darrick J. Wong
  0 siblings, 0 replies; 19+ messages in thread
From: Darrick J. Wong @ 2018-02-08  1:42 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Feb 05, 2018 at 12:45:58PM -0500, Brian Foster wrote:
> The ->t_blk_res_used field tracks how many blocks have been used in
> the current transaction. This should never exceed the block
> reservation (->t_blk_res) for a particular transaction. We currently
> assert this condition in the transaction block accounting code, but
> otherwise take no additional action should this situation occur.
> 
> The overrun generally has no effect if space ends up being available
> and the associated transaction commits. If the transaction is
> duplicated, however, the current block usage is used to determine
> the remaining block reservation to be transferred to the new
> transaction. If usage exceeds reservation, this calculation
> underflows and creates a transaction with an invalid and excessive
> reservation. When the second transaction commits, the release of
> unused blocks corrupts the in-core free space counters. With lazy
> superblock accounting enabled, this inconsistency eventually
> trickles to the on-disk superblock and corrupts the filesystem.
> 
> Replace the transaction block usage accounting assert with an
> explicit overrun check. If the transaction overruns the reservation,
> shutdown the filesystem immediately to prevent corruption. Add a new
> assert to xfs_trans_dup() to catch any callers that might induce
> this invalid state in the future.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Looks ok,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] xfs: account format bouncing into rmapbt swapext tx reservation
  2018-02-05 17:45 ` [PATCH 2/4] xfs: account format bouncing into rmapbt swapext " Brian Foster
@ 2018-02-08  1:56   ` Darrick J. Wong
  2018-02-08 13:12     ` Brian Foster
  0 siblings, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2018-02-08  1:56 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Feb 05, 2018 at 12:45:59PM -0500, Brian Foster wrote:
> The extent swap mechanism requires a unique implementation for
> rmapbt enabled filesystems. Because the rmapbt tracks extent owner
> information, extent swap must individually unmap and remap each
> extent between the two inodes.
> 
> The rmapbt extent swap transaction block reservation currently
> accounts for the worst case bmapbt block and rmapbt block
> consumption based on the extent count of each inode. There is a
> corner case that exists due to the extent swap implementation that
> is not covered by this reservation, however.
> 
> If one of the associated inodes is just over the max extent count
> used for extent format inodes (i.e., the inode is in btree format by
> a single extent), the unmap/remap cycle of the extent swap can
> bounce the inode between extent and btree format multiple times,
> almost as many times as there are extents in the inode (if the
> opposing inode happens to have one less, for example). Each back and
> forth cycle involves a block free and allocation, which isn't a
> problem except for that the initial transaction reservation must
> account for the total number of block allocations performed by the
> chain of deferred operations. If not, a block reservation overrun
> occurs and the filesystem shuts down.
> 
> Update the rmapbt extent swap block reservation to check for this
> situation and add some block reservation slop to ensure the entire
> operation succeeds. We'd never likely require reservation for both
> inodes as fsr wouldn't defrag the file in that case, but the
> additional reservation is constrained by the data fork size so be
> cautious and check for both.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_bmap_util.c | 29 ++++++++++++++++++++---------
>  1 file changed, 20 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> index c83f549dc17b..e0a442f504e5 100644
> --- a/fs/xfs/xfs_bmap_util.c
> +++ b/fs/xfs/xfs_bmap_util.c
> @@ -1899,17 +1899,28 @@ xfs_swap_extents(
>  	 * performed with log redo items!
>  	 */
>  	if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
> +		int		w	= XFS_DATA_FORK;
> +		uint32_t	ipnext	= XFS_IFORK_NEXTENTS(ip, w);
> +		uint32_t	tipnext	= XFS_IFORK_NEXTENTS(tip, w);
> +
> +		/*
> +		 * Conceptually this shouldn't affect the shape of either bmbt,
> +		 * but since we atomically move extents one by one, we reserve
> +		 * enough space to rebuild both trees.
> +		 */
> +		resblks = XFS_SWAP_RMAP_SPACE_RES(mp, ipnext, w);
> +		resblks +=  XFS_SWAP_RMAP_SPACE_RES(mp, tipnext, w);
> +
>  		/*
> -		 * Conceptually this shouldn't affect the shape of either
> -		 * bmbt, but since we atomically move extents one by one,
> -		 * we reserve enough space to rebuild both trees.
> +		 * Handle the corner case where either inode might straddle the
> +		 * btree format boundary. If so, the inode could bounce between
> +		 * btree <-> extent format on unmap -> remap cycles, freeing and
> +		 * allocating a bmapbt block each time.
>  		 */
> -		resblks = XFS_SWAP_RMAP_SPACE_RES(mp,
> -				XFS_IFORK_NEXTENTS(ip, XFS_DATA_FORK),
> -				XFS_DATA_FORK) +
> -			  XFS_SWAP_RMAP_SPACE_RES(mp,
> -				XFS_IFORK_NEXTENTS(tip, XFS_DATA_FORK),
> -				XFS_DATA_FORK);
> +		if (ipnext == (XFS_IFORK_MAXEXT(ip, w) + 1))
> +			resblks += XFS_IFORK_MAXEXT(ip, w);
> +		if (tipnext == (XFS_IFORK_MAXEXT(tip, w) + 1))
> +			resblks += XFS_IFORK_MAXEXT(tip, w);

I think this looks good enough to fix the problem, but I've been
wondering (in a more general sense) if it really makes sense to be
repeatedly freeing and allocating bmbt blocks like this?

What I mean is, there are a few operations (like rmapbt updates) that
can cause a lot of similar thrashing behavior when we delete a record
from one place and reinsert it shortly thereafter.  If the btree block
has the exact minimum number of records then it'll try to disperse the
records into the adjoining blocks, which is completely unnecessary if we
know that we're about to reinsert it somewhere else in the block.

Granted in swapext-with-rmap we also have a lot of log update machinery
in the way so there might not be a good way to hold on to blocks.  It
might introduce so much extra complexity it's not worth it either, since
I think we'd have to claw back references to the buffer in the log,
remove the extent busy record, and change the buffer type...?

--D

>  	}
>  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
>  	if (error)
> -- 
> 2.13.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] xfs: rename agfl perag res type to rmapbt
  2018-02-05 17:46 ` [PATCH 3/4] xfs: rename agfl perag res type to rmapbt Brian Foster
@ 2018-02-08  1:57   ` Darrick J. Wong
  0 siblings, 0 replies; 19+ messages in thread
From: Darrick J. Wong @ 2018-02-08  1:57 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Feb 05, 2018 at 12:46:00PM -0500, Brian Foster wrote:
> The AGFL perag reservation type accounts all allocations that feed
> into (or are released from) the allocation group free list (agfl).
> The purpose of the reservation is to support worst case conditions
> for the reverse mapping btree (rmapbt). As such, the agfl
> reservation usage accounting only considers rmapbt usage when the
> in-core counters are initialized at mount time.
> 
> This implementation inconsistency leads to divergence of the in-core
> and on-disk usage accounting over time. In preparation to resolve
> this inconsistency and adjust the AGFL reservation into an rmapbt
> specific reservation, rename the AGFL reservation type and
> associated accounting fields to something more rmapbt-specific. Also
> fix up a couple tracepoints that incorrectly use the AGFL
> reservation type to pass the agfl state of the associated extent
> where the raw reservation type is expected.
> 
> Note that this patch does not change perag reservation behavior.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Looks ok, though the tracepoint fix should probably be a standalone
patch.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] xfs: account only rmapbt-used blocks against rmapbt perag res
  2018-02-07 14:49     ` Brian Foster
@ 2018-02-08  2:20       ` Darrick J. Wong
  2018-02-08 13:19         ` Brian Foster
  0 siblings, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2018-02-08  2:20 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Feb 07, 2018 at 09:49:35AM -0500, Brian Foster wrote:
> On Tue, Feb 06, 2018 at 04:03:50PM -0800, Darrick J. Wong wrote:
> > On Mon, Feb 05, 2018 at 12:46:01PM -0500, Brian Foster wrote:
> > > The rmapbt perag metadata reservation reserves blocks for the
> > > reverse mapping btree (rmapbt). Since the rmapbt uses blocks from
> > > the agfl and perag accounting is updated as blocks are allocated
> > > from the allocation btrees, the reservation actually accounts blocks
> > > as they are allocated to (or freed from) the agfl rather than the
> > > rmapbt itself.
> > > 
> > > While this works for blocks that are eventually used for the rmapbt,
> > > not all agfl blocks are destined for the rmapbt. Blocks that are
> > > allocated to the agfl (and thus "reserved" for the rmapbt) but then
> > > used by another structure leads to a growing inconsistency over time
> > > between the runtime tracking of rmapbt usage vs. actual rmapbt
> > > usage. Since the runtime tracking thinks all agfl blocks are rmapbt
> > > blocks, it essentially believes that less future reservation is
> > > required to satisfy the rmapbt than what is actually necessary.
> > > 
> > > The inconsistency is rectified across mount cycles because the perag
> > > reservation is initialized based on the actual rmapbt usage at mount
> > > time. The problem, however, is that the excessive drain of the
> > > reservation at runtime opens a window to allocate blocks for other
> > > purposes that might be required for the rmapbt on a subsequent
> > > mount. This problem can be demonstrated by a simple test that runs
> > > an allocation workload to consume agfl blocks over time and then
> > > observe the difference in the agfl reservation requirement across an
> > > unmount/mount cycle:
> > > 
> > >   mount ...: xfs_ag_resv_init: ... resv 3193 ask 3194 len 3194
> > >   ...
> > >   ...      : xfs_ag_resv_alloc_extent: ... resv 2957 ask 3194 len 1
> > >   umount...: xfs_ag_resv_free: ... resv 2956 ask 3194 len 0
> > >   mount ...: xfs_ag_resv_init: ... resv 3052 ask 3194 len 3194
> > > 
> > > As the above tracepoints show, the reservation requirement reduces
> > > from 3194 blocks to 2956 blocks as the workload runs.  Without any
> > > other changes in the filesystem, the same reservation requirement
> > > jumps from 2956 to 3052 blocks over a umount/mount cycle.
> > > 
> > > To address this divergence, update the RMAPBT reservation to account
> > > blocks used for the rmapbt only rather than all blocks filled into
> > > the agfl. This patch makes several high-level changes toward that
> > > end:
> > > 
> > > 1.) Reintroduce an AGFL reservation type to serve as an accounting
> > >     no-op for blocks allocated to (or freed from) the AGFL.
> > > 2.) Invoke RMAPBT usage accounting from the actual rmapbt block
> > >     allocation path rather than the AGFL allocation path.
> > > 
> > > The first change is required because agfl blocks are considered free
> > > blocks throughout their lifetime. The perag reservation subsystem is
> > > invoked unconditionally by the allocation subsystem, so we need a
> > > way to tell the perag subsystem (via the allocation subsystem) to
> > > not make any accounting changes for blocks filled into the AGFL.
> > > 
> > > The second change causes the in-core RMAPBT reservation usage
> > > accounting to remain consistent with the on-disk state at all times
> > > and eliminates the risk of leaving the rmapbt reservation
> > > underfilled.
> > 
> > Sorry I haven't gotten to this in a couple of days.  This looks like a
> > reasonable enough fix to the complex perag machinery, though Dave and I
> > were batting around an idea last week -- what if instead of all these
> > reservation types and tracking we simply maintained a per-ag free space
> > reservation of a couple megabytes, handed out blocks if we got
> > desperate, and replenished that res when we had a good opportunity?
> > 
> 
> Firstly, I essentially view this patch as a bug fix vs. some kind of
> larger rework. I'm hoping the split of the rename bits helps emphasize
> that... It's really just a change to accurately account rmapbt block
> consumption at runtime.
> 
> I don't have much context on the broader question you've called out...
> I assume there was a larger requirement (associated with
> reflink+rmapbt?) that necessitated this kind of worst case reservation
> in the first place..?

Without reflink, we protect fs writes from hitting ENOSPC midway through
a transaction by reserving all the blocks we think we might need ahead
of time.  Making the reservation at the fs level works because any
blocks we might have to allocate to handle the write can be in any AG,
so if a given AG is out of space we just move on to the next one.

However, once copy-on-write enters the picture, that last assumption
is no longer valid because we have to decrease the reference count in
the specific AG that has the shared blocks.  We now have a requirement
that said AG also has enough space to handle any refcountbt splits that
might happen in the process of decreasing the refcount, and that's where
we get into trouble.  The FS may have plenty of space (so the
reservation goes thorugh) but the AG may be out of space, in which case
the refcount update fails and the fs shuts down.

Therefore, we reserve some free space (by hiding it from the incore
metadata) and every time we need a block for the refcountbt we get one
from the reservation, hence we can never run out.

> We've since applied it to things like the finobt (which I'm still not
> totally convinced was the right thing to do based on the vague
> justification for it), which kind of blurs the line between where it's
> a requirement vs. nice-to-have/band-aid for me.

I think the finobt reservation is required: Suppose you have a
filesystem with a lot of empty files, a lot of single-block files, and a
lot of big files such that there's no free space anywhere.  Suppose
further that there's an AG where every finobt block is exactly full,
there's an inode chunk with exactly 64 inodes in use, and every block in
that AG is in use (meaning zero AGFL blocks).  Now find one of the empty
inodes in that totally-full chunk and try to free it.  Truncation
doesn't free up any blocks, but we have to expand the finobt to add the
record for the chunk.  We can't find any blocks in that AG so we shut
down.

> > I /think/ the answer to that question is that we need to try to reserve
> > enough blocks to satisfy the maximum size of the btree (if there is
> > one), not just a fixed-size delayed free space.  We already hand out
> > blocks if we get desperate, and we replenish the reservation when we
> > can.  Lastly, we need the reservation type tracking so that we avoid
> > things like the refcount btree stealing blocks from the agfl in
> > desperation.  I've also been wondering if each reservation should get
> > its own type so that we track refcountbt and finobt blocks separately.
> > 
> 
> In principle, I'm not really for or against the existing mechanism,
> further granularizing it or changing it out for something else entirely.
> I'm confused about what exactly we're trying to fix in the first place,
> though. Is there a larger fundamental problem at play here? Is the
> reservation (or complexity of the mechanism) simply overkill? Something
> else..?

Mostly my insecurity over "Was this the right thing for the job?  What
even /was/ the job?" :)

--D

> > Otherwise I think this series looks ok, but I'll go over it in more
> > detail tomorrow.
> > 
> 
> Thanks. No rush..
> 
> Brian
> 
> > --D
> > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_ag_resv.c    |  4 ++++
> > >  fs/xfs/libxfs/xfs_ag_resv.h    | 31 +++++++++++++++++++++++++++++++
> > >  fs/xfs/libxfs/xfs_alloc.c      | 18 ++++++------------
> > >  fs/xfs/libxfs/xfs_rmap_btree.c |  4 ++++
> > >  fs/xfs/xfs_mount.h             |  1 +
> > >  5 files changed, 46 insertions(+), 12 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
> > > index 0ca2e680034a..03885a968de8 100644
> > > --- a/fs/xfs/libxfs/xfs_ag_resv.c
> > > +++ b/fs/xfs/libxfs/xfs_ag_resv.c
> > > @@ -326,6 +326,8 @@ xfs_ag_resv_alloc_extent(
> > >  	trace_xfs_ag_resv_alloc_extent(pag, type, args->len);
> > >  
> > >  	switch (type) {
> > > +	case XFS_AG_RESV_AGFL:
> > > +		return;
> > >  	case XFS_AG_RESV_METADATA:
> > >  	case XFS_AG_RESV_RMAPBT:
> > >  		resv = xfs_perag_resv(pag, type);
> > > @@ -366,6 +368,8 @@ xfs_ag_resv_free_extent(
> > >  	trace_xfs_ag_resv_free_extent(pag, type, len);
> > >  
> > >  	switch (type) {
> > > +	case XFS_AG_RESV_AGFL:
> > > +		return;
> > >  	case XFS_AG_RESV_METADATA:
> > >  	case XFS_AG_RESV_RMAPBT:
> > >  		resv = xfs_perag_resv(pag, type);
> > > diff --git a/fs/xfs/libxfs/xfs_ag_resv.h b/fs/xfs/libxfs/xfs_ag_resv.h
> > > index 8d6c687deef3..938f2f96c5e8 100644
> > > --- a/fs/xfs/libxfs/xfs_ag_resv.h
> > > +++ b/fs/xfs/libxfs/xfs_ag_resv.h
> > > @@ -32,4 +32,35 @@ void xfs_ag_resv_alloc_extent(struct xfs_perag *pag, enum xfs_ag_resv_type type,
> > >  void xfs_ag_resv_free_extent(struct xfs_perag *pag, enum xfs_ag_resv_type type,
> > >  		struct xfs_trans *tp, xfs_extlen_t len);
> > >  
> > > +/*
> > > + * RMAPBT reservation accounting wrappers. Since rmapbt blocks are sourced from
> > > + * the AGFL, they are allocated one at a time and the reservation updates don't
> > > + * require a transaction.
> > > + */
> > > +static inline void
> > > +xfs_ag_resv_rmapbt_alloc(
> > > +	struct xfs_mount	*mp,
> > > +	xfs_agnumber_t		agno)
> > > +{
> > > +	struct xfs_alloc_arg	args = {0};
> > > +	struct xfs_perag	*pag;
> > > +
> > > +	args.len = 1;
> > > +	pag = xfs_perag_get(mp, agno);
> > > +	xfs_ag_resv_alloc_extent(pag, XFS_AG_RESV_RMAPBT, &args);
> > > +	xfs_perag_put(pag);
> > > +}
> > > +
> > > +static inline void
> > > +xfs_ag_resv_rmapbt_free(
> > > +	struct xfs_mount	*mp,
> > > +	xfs_agnumber_t		agno)
> > > +{
> > > +	struct xfs_perag	*pag;
> > > +
> > > +	pag = xfs_perag_get(mp, agno);
> > > +	xfs_ag_resv_free_extent(pag, XFS_AG_RESV_RMAPBT, NULL, 1);
> > > +	xfs_perag_put(pag);
> > > +}
> > > +
> > >  #endif	/* __XFS_AG_RESV_H__ */
> > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > > index e53d55453dc5..8c401efb2d74 100644
> > > --- a/fs/xfs/libxfs/xfs_alloc.c
> > > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > > @@ -711,7 +711,7 @@ xfs_alloc_ag_vextent(
> > >  
> > >  	ASSERT(args->len >= args->minlen);
> > >  	ASSERT(args->len <= args->maxlen);
> > > -	ASSERT(!args->wasfromfl || args->resv != XFS_AG_RESV_RMAPBT);
> > > +	ASSERT(!args->wasfromfl || args->resv != XFS_AG_RESV_AGFL);
> > >  	ASSERT(args->agbno % args->alignment == 0);
> > >  
> > >  	/* if not file data, insert new block into the reverse map btree */
> > > @@ -1564,7 +1564,6 @@ xfs_alloc_ag_vextent_small(
> > >  	int		*stat)	/* status: 0-freelist, 1-normal/none */
> > >  {
> > >  	struct xfs_owner_info	oinfo;
> > > -	struct xfs_perag	*pag;
> > >  	int		error;
> > >  	xfs_agblock_t	fbno;
> > >  	xfs_extlen_t	flen;
> > > @@ -1583,7 +1582,7 @@ xfs_alloc_ag_vextent_small(
> > >  	 * freelist.
> > >  	 */
> > >  	else if (args->minlen == 1 && args->alignment == 1 &&
> > > -		 args->resv != XFS_AG_RESV_RMAPBT &&
> > > +		 args->resv != XFS_AG_RESV_AGFL &&
> > >  		 (be32_to_cpu(XFS_BUF_TO_AGF(args->agbp)->agf_flcount)
> > >  		  > args->minleft)) {
> > >  		error = xfs_alloc_get_freelist(args->tp, args->agbp, &fbno, 0);
> > > @@ -1616,18 +1615,13 @@ xfs_alloc_ag_vextent_small(
> > >  			/*
> > >  			 * If we're feeding an AGFL block to something that
> > >  			 * doesn't live in the free space, we need to clear
> > > -			 * out the OWN_AG rmap and add the block back to
> > > -			 * the RMAPBT per-AG reservation.
> > > +			 * out the OWN_AG rmap.
> > >  			 */
> > >  			xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > >  			error = xfs_rmap_free(args->tp, args->agbp, args->agno,
> > >  					fbno, 1, &oinfo);
> > >  			if (error)
> > >  				goto error0;
> > > -			pag = xfs_perag_get(args->mp, args->agno);
> > > -			xfs_ag_resv_free_extent(pag, XFS_AG_RESV_RMAPBT,
> > > -					args->tp, 1);
> > > -			xfs_perag_put(pag);
> > >  
> > >  			*stat = 0;
> > >  			return 0;
> > > @@ -2153,7 +2147,7 @@ xfs_alloc_fix_freelist(
> > >  		if (error)
> > >  			goto out_agbp_relse;
> > >  		error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1,
> > > -					   &targs.oinfo, XFS_AG_RESV_RMAPBT);
> > > +					   &targs.oinfo, XFS_AG_RESV_AGFL);
> > >  		if (error)
> > >  			goto out_agbp_relse;
> > >  		bp = xfs_btree_get_bufs(mp, tp, args->agno, bno, 0);
> > > @@ -2179,7 +2173,7 @@ xfs_alloc_fix_freelist(
> > >  	while (pag->pagf_flcount < need) {
> > >  		targs.agbno = 0;
> > >  		targs.maxlen = need - pag->pagf_flcount;
> > > -		targs.resv = XFS_AG_RESV_RMAPBT;
> > > +		targs.resv = XFS_AG_RESV_AGFL;
> > >  
> > >  		/* Allocate as many blocks as possible at once. */
> > >  		error = xfs_alloc_ag_vextent(&targs);
> > > @@ -2860,7 +2854,7 @@ xfs_free_extent(
> > >  	int			error;
> > >  
> > >  	ASSERT(len != 0);
> > > -	ASSERT(type != XFS_AG_RESV_RMAPBT);
> > > +	ASSERT(type != XFS_AG_RESV_AGFL);
> > >  
> > >  	if (XFS_TEST_ERROR(false, mp,
> > >  			XFS_ERRTAG_FREE_EXTENT))
> > > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
> > > index e829c3e489ea..8560091554e0 100644
> > > --- a/fs/xfs/libxfs/xfs_rmap_btree.c
> > > +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
> > > @@ -130,6 +130,8 @@ xfs_rmapbt_alloc_block(
> > >  	be32_add_cpu(&agf->agf_rmap_blocks, 1);
> > >  	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_RMAP_BLOCKS);
> > >  
> > > +	xfs_ag_resv_rmapbt_alloc(cur->bc_mp, cur->bc_private.a.agno);
> > > +
> > >  	XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
> > >  	*stat = 1;
> > >  	return 0;
> > > @@ -158,6 +160,8 @@ xfs_rmapbt_free_block(
> > >  			      XFS_EXTENT_BUSY_SKIP_DISCARD);
> > >  	xfs_trans_agbtree_delta(cur->bc_tp, -1);
> > >  
> > > +	xfs_ag_resv_rmapbt_free(cur->bc_mp, cur->bc_private.a.agno);
> > > +
> > >  	return 0;
> > >  }
> > >  
> > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > > index cf84288b65ca..7aae5af71aba 100644
> > > --- a/fs/xfs/xfs_mount.h
> > > +++ b/fs/xfs/xfs_mount.h
> > > @@ -326,6 +326,7 @@ xfs_daddr_to_agbno(struct xfs_mount *mp, xfs_daddr_t d)
> > >  /* per-AG block reservation data structures*/
> > >  enum xfs_ag_resv_type {
> > >  	XFS_AG_RESV_NONE = 0,
> > > +	XFS_AG_RESV_AGFL,
> > >  	XFS_AG_RESV_METADATA,
> > >  	XFS_AG_RESV_RMAPBT,
> > >  };
> > > -- 
> > > 2.13.6
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 2/4] xfs: account format bouncing into rmapbt swapext tx reservation
  2018-02-08  1:56   ` Darrick J. Wong
@ 2018-02-08 13:12     ` Brian Foster
  0 siblings, 0 replies; 19+ messages in thread
From: Brian Foster @ 2018-02-08 13:12 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Feb 07, 2018 at 05:56:34PM -0800, Darrick J. Wong wrote:
> On Mon, Feb 05, 2018 at 12:45:59PM -0500, Brian Foster wrote:
> > The extent swap mechanism requires a unique implementation for
> > rmapbt enabled filesystems. Because the rmapbt tracks extent owner
> > information, extent swap must individually unmap and remap each
> > extent between the two inodes.
> > 
> > The rmapbt extent swap transaction block reservation currently
> > accounts for the worst case bmapbt block and rmapbt block
> > consumption based on the extent count of each inode. There is a
> > corner case that exists due to the extent swap implementation that
> > is not covered by this reservation, however.
> > 
> > If one of the associated inodes is just over the max extent count
> > used for extent format inodes (i.e., the inode is in btree format by
> > a single extent), the unmap/remap cycle of the extent swap can
> > bounce the inode between extent and btree format multiple times,
> > almost as many times as there are extents in the inode (if the
> > opposing inode happens to have one less, for example). Each back and
> > forth cycle involves a block free and allocation, which isn't a
> > problem except for that the initial transaction reservation must
> > account for the total number of block allocations performed by the
> > chain of deferred operations. If not, a block reservation overrun
> > occurs and the filesystem shuts down.
> > 
> > Update the rmapbt extent swap block reservation to check for this
> > situation and add some block reservation slop to ensure the entire
> > operation succeeds. We'd never likely require reservation for both
> > inodes as fsr wouldn't defrag the file in that case, but the
> > additional reservation is constrained by the data fork size so be
> > cautious and check for both.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_bmap_util.c | 29 ++++++++++++++++++++---------
> >  1 file changed, 20 insertions(+), 9 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
> > index c83f549dc17b..e0a442f504e5 100644
> > --- a/fs/xfs/xfs_bmap_util.c
> > +++ b/fs/xfs/xfs_bmap_util.c
> > @@ -1899,17 +1899,28 @@ xfs_swap_extents(
> >  	 * performed with log redo items!
> >  	 */
> >  	if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
> > +		int		w	= XFS_DATA_FORK;
> > +		uint32_t	ipnext	= XFS_IFORK_NEXTENTS(ip, w);
> > +		uint32_t	tipnext	= XFS_IFORK_NEXTENTS(tip, w);
> > +
> > +		/*
> > +		 * Conceptually this shouldn't affect the shape of either bmbt,
> > +		 * but since we atomically move extents one by one, we reserve
> > +		 * enough space to rebuild both trees.
> > +		 */
> > +		resblks = XFS_SWAP_RMAP_SPACE_RES(mp, ipnext, w);
> > +		resblks +=  XFS_SWAP_RMAP_SPACE_RES(mp, tipnext, w);
> > +
> >  		/*
> > -		 * Conceptually this shouldn't affect the shape of either
> > -		 * bmbt, but since we atomically move extents one by one,
> > -		 * we reserve enough space to rebuild both trees.
> > +		 * Handle the corner case where either inode might straddle the
> > +		 * btree format boundary. If so, the inode could bounce between
> > +		 * btree <-> extent format on unmap -> remap cycles, freeing and
> > +		 * allocating a bmapbt block each time.
> >  		 */
> > -		resblks = XFS_SWAP_RMAP_SPACE_RES(mp,
> > -				XFS_IFORK_NEXTENTS(ip, XFS_DATA_FORK),
> > -				XFS_DATA_FORK) +
> > -			  XFS_SWAP_RMAP_SPACE_RES(mp,
> > -				XFS_IFORK_NEXTENTS(tip, XFS_DATA_FORK),
> > -				XFS_DATA_FORK);
> > +		if (ipnext == (XFS_IFORK_MAXEXT(ip, w) + 1))
> > +			resblks += XFS_IFORK_MAXEXT(ip, w);
> > +		if (tipnext == (XFS_IFORK_MAXEXT(tip, w) + 1))
> > +			resblks += XFS_IFORK_MAXEXT(tip, w);
> 
> I think this looks good enough to fix the problem, but I've been
> wondering (in a more general sense) if it really makes sense to be
> repeatedly freeing and allocating bmbt blocks like this?
> 

It is a bit cringe-worthy of a behavior when you look at these
pathological corner cases, but I think this one really is a corner case
in the sense that we should probably question the value of anything that
adds real complexity to the overall operation just for this particular
situation. To me, swapext seemed a limited enough operation that it
wasn't worth thinking too hard about something that avoided the higher
level behavior.

I'm not against doing something more clever, but it would be in addition
to this fix and should probably require more justification that this
case alone.

> What I mean is, there are a few operations (like rmapbt updates) that
> can cause a lot of similar thrashing behavior when we delete a record
> from one place and reinsert it shortly thereafter.  If the btree block
> has the exact minimum number of records then it'll try to disperse the
> records into the adjoining blocks, which is completely unnecessary if we
> know that we're about to reinsert it somewhere else in the block.
> 

Indeed. Something that comes to mind initially would be to find a way
for an operation to pin a particular structure format over a broader
operation, where pin is loosely defined to mean "don't shrink it." An
obvious challenge there is that we still need to handle cases where
shuffling around metadata records does actually result in legitimate
shrink due to creation of contiguity or whatnot. Perhaps such tasks
could be deferred post high-level op (i.e., such as cursor release) vs.
always done inline to a particular modification.

Alternatively, I believe we've already loosely taken advantage of the
concept of reusing freed blocks in the delalloc case (i.e.,
xfs_bmap_split_indlen()). Perhaps we could apply something like that to
transactions and physical blocks that have been (deferred) freed.

But I'm just throwing stuff against the wall here.. :P

> Granted in swapext-with-rmap we also have a lot of log update machinery
> in the way so there might not be a good way to hold on to blocks.  It
> might introduce so much extra complexity it's not worth it either, since
> I think we'd have to claw back references to the buffer in the log,
> remove the extent busy record, and change the buffer type...?
> 

Yeah, I think anything that relies on undoing a particular modification
that has been made in buffers but just hadn't hit disk is probably not
going to get us very far. Those buffers may have already been unlocked
or even relogged with other changes by the time we'd get at them again,
I think.

OTOH, I do think deferred operations could facilitate such a mechanism,
but I suppose we'd still have to make sure the appropriate rmapbt
updates are made, accounting/reservation is done correctly, think about
the case where extents should be busy and thus shouldn't immediately be
reused, etc. There's definitely enough complexity there that we'd
probably want to identify whether there are strong enough benefits.
I.e., what's the problem to this solution? :)

Brian

> --D
> 
> >  	}
> >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
> >  	if (error)
> > -- 
> > 2.13.6
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] xfs: account only rmapbt-used blocks against rmapbt perag res
  2018-02-08  2:20       ` Darrick J. Wong
@ 2018-02-08 13:19         ` Brian Foster
  2018-02-08 22:49           ` Dave Chinner
  0 siblings, 1 reply; 19+ messages in thread
From: Brian Foster @ 2018-02-08 13:19 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Wed, Feb 07, 2018 at 06:20:37PM -0800, Darrick J. Wong wrote:
> On Wed, Feb 07, 2018 at 09:49:35AM -0500, Brian Foster wrote:
> > On Tue, Feb 06, 2018 at 04:03:50PM -0800, Darrick J. Wong wrote:
> > > On Mon, Feb 05, 2018 at 12:46:01PM -0500, Brian Foster wrote:
> > > > The rmapbt perag metadata reservation reserves blocks for the
> > > > reverse mapping btree (rmapbt). Since the rmapbt uses blocks from
> > > > the agfl and perag accounting is updated as blocks are allocated
> > > > from the allocation btrees, the reservation actually accounts blocks
> > > > as they are allocated to (or freed from) the agfl rather than the
> > > > rmapbt itself.
> > > > 
> > > > While this works for blocks that are eventually used for the rmapbt,
> > > > not all agfl blocks are destined for the rmapbt. Blocks that are
> > > > allocated to the agfl (and thus "reserved" for the rmapbt) but then
> > > > used by another structure leads to a growing inconsistency over time
> > > > between the runtime tracking of rmapbt usage vs. actual rmapbt
> > > > usage. Since the runtime tracking thinks all agfl blocks are rmapbt
> > > > blocks, it essentially believes that less future reservation is
> > > > required to satisfy the rmapbt than what is actually necessary.
> > > > 
> > > > The inconsistency is rectified across mount cycles because the perag
> > > > reservation is initialized based on the actual rmapbt usage at mount
> > > > time. The problem, however, is that the excessive drain of the
> > > > reservation at runtime opens a window to allocate blocks for other
> > > > purposes that might be required for the rmapbt on a subsequent
> > > > mount. This problem can be demonstrated by a simple test that runs
> > > > an allocation workload to consume agfl blocks over time and then
> > > > observe the difference in the agfl reservation requirement across an
> > > > unmount/mount cycle:
> > > > 
> > > >   mount ...: xfs_ag_resv_init: ... resv 3193 ask 3194 len 3194
> > > >   ...
> > > >   ...      : xfs_ag_resv_alloc_extent: ... resv 2957 ask 3194 len 1
> > > >   umount...: xfs_ag_resv_free: ... resv 2956 ask 3194 len 0
> > > >   mount ...: xfs_ag_resv_init: ... resv 3052 ask 3194 len 3194
> > > > 
> > > > As the above tracepoints show, the reservation requirement reduces
> > > > from 3194 blocks to 2956 blocks as the workload runs.  Without any
> > > > other changes in the filesystem, the same reservation requirement
> > > > jumps from 2956 to 3052 blocks over a umount/mount cycle.
> > > > 
> > > > To address this divergence, update the RMAPBT reservation to account
> > > > blocks used for the rmapbt only rather than all blocks filled into
> > > > the agfl. This patch makes several high-level changes toward that
> > > > end:
> > > > 
> > > > 1.) Reintroduce an AGFL reservation type to serve as an accounting
> > > >     no-op for blocks allocated to (or freed from) the AGFL.
> > > > 2.) Invoke RMAPBT usage accounting from the actual rmapbt block
> > > >     allocation path rather than the AGFL allocation path.
> > > > 
> > > > The first change is required because agfl blocks are considered free
> > > > blocks throughout their lifetime. The perag reservation subsystem is
> > > > invoked unconditionally by the allocation subsystem, so we need a
> > > > way to tell the perag subsystem (via the allocation subsystem) to
> > > > not make any accounting changes for blocks filled into the AGFL.
> > > > 
> > > > The second change causes the in-core RMAPBT reservation usage
> > > > accounting to remain consistent with the on-disk state at all times
> > > > and eliminates the risk of leaving the rmapbt reservation
> > > > underfilled.
> > > 
> > > Sorry I haven't gotten to this in a couple of days.  This looks like a
> > > reasonable enough fix to the complex perag machinery, though Dave and I
> > > were batting around an idea last week -- what if instead of all these
> > > reservation types and tracking we simply maintained a per-ag free space
> > > reservation of a couple megabytes, handed out blocks if we got
> > > desperate, and replenished that res when we had a good opportunity?
> > > 
> > 
> > Firstly, I essentially view this patch as a bug fix vs. some kind of
> > larger rework. I'm hoping the split of the rename bits helps emphasize
> > that... It's really just a change to accurately account rmapbt block
> > consumption at runtime.
> > 
> > I don't have much context on the broader question you've called out...
> > I assume there was a larger requirement (associated with
> > reflink+rmapbt?) that necessitated this kind of worst case reservation
> > in the first place..?
> 
> Without reflink, we protect fs writes from hitting ENOSPC midway through
> a transaction by reserving all the blocks we think we might need ahead
> of time.  Making the reservation at the fs level works because any
> blocks we might have to allocate to handle the write can be in any AG,
> so if a given AG is out of space we just move on to the next one.
> 
> However, once copy-on-write enters the picture, that last assumption
> is no longer valid because we have to decrease the reference count in
> the specific AG that has the shared blocks.  We now have a requirement
> that said AG also has enough space to handle any refcountbt splits that
> might happen in the process of decreasing the refcount, and that's where
> we get into trouble.  The FS may have plenty of space (so the
> reservation goes thorugh) but the AG may be out of space, in which case
> the refcount update fails and the fs shuts down.
> 
> Therefore, we reserve some free space (by hiding it from the incore
> metadata) and every time we need a block for the refcountbt we get one
> from the reservation, hence we can never run out.
> 

Ok, I thought it had something to do with cases introduced by reflink,
but I can't keep that all in my head. So IIUC essentially the refcountbt
can grow aggressively based on splits of shared extents and whatnot and
the reservation primarily guarantees we'll have blocks in the AG for
those cases.

> > We've since applied it to things like the finobt (which I'm still not
> > totally convinced was the right thing to do based on the vague
> > justification for it), which kind of blurs the line between where it's
> > a requirement vs. nice-to-have/band-aid for me.
> 
> I think the finobt reservation is required: Suppose you have a
> filesystem with a lot of empty files, a lot of single-block files, and a
> lot of big files such that there's no free space anywhere.  Suppose
> further that there's an AG where every finobt block is exactly full,
> there's an inode chunk with exactly 64 inodes in use, and every block in
> that AG is in use (meaning zero AGFL blocks).  Now find one of the empty
> inodes in that totally-full chunk and try to free it.  Truncation
> doesn't free up any blocks, but we have to expand the finobt to add the
> record for the chunk.  We can't find any blocks in that AG so we shut
> down.
> 

Yes, I suppose the problem makes sense (I wish the original commit had
such an explanation :/). We do have the transaction block reservation in
the !perag res case, but I suppose we're susceptible to the same global
reservation problem as above.

Have we considered a per-ag + per-transaction mechanism at any point
through all of this? I ask because something that has been in the back
of my mind (which I think was an idea from Dave originally) for a while
is to simply queue inactive inode processing when it can't run at a
particular point in time, but that depends on actually knowing whether
we can proceed to inactivate an inode or not.

> > > I /think/ the answer to that question is that we need to try to reserve
> > > enough blocks to satisfy the maximum size of the btree (if there is
> > > one), not just a fixed-size delayed free space.  We already hand out
> > > blocks if we get desperate, and we replenish the reservation when we
> > > can.  Lastly, we need the reservation type tracking so that we avoid
> > > things like the refcount btree stealing blocks from the agfl in
> > > desperation.  I've also been wondering if each reservation should get
> > > its own type so that we track refcountbt and finobt blocks separately.
> > > 
> > 
> > In principle, I'm not really for or against the existing mechanism,
> > further granularizing it or changing it out for something else entirely.
> > I'm confused about what exactly we're trying to fix in the first place,
> > though. Is there a larger fundamental problem at play here? Is the
> > reservation (or complexity of the mechanism) simply overkill? Something
> > else..?
> 
> Mostly my insecurity over "Was this the right thing for the job?  What
> even /was/ the job?" :)
> 

Heh, Ok.. fair enough. I guess I do share the sentiment with regard to
certain aspects of the reservation like how it is applied to the finobt,
where I think reserving every possible finobt block up front may be
overkill with respect to the underlying/fundamental problem. I'd have to
think more about the broader/!finobt aspects, though.

Brian

> --D
> 
> > > Otherwise I think this series looks ok, but I'll go over it in more
> > > detail tomorrow.
> > > 
> > 
> > Thanks. No rush..
> > 
> > Brian
> > 
> > > --D
> > > 
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_ag_resv.c    |  4 ++++
> > > >  fs/xfs/libxfs/xfs_ag_resv.h    | 31 +++++++++++++++++++++++++++++++
> > > >  fs/xfs/libxfs/xfs_alloc.c      | 18 ++++++------------
> > > >  fs/xfs/libxfs/xfs_rmap_btree.c |  4 ++++
> > > >  fs/xfs/xfs_mount.h             |  1 +
> > > >  5 files changed, 46 insertions(+), 12 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_ag_resv.c b/fs/xfs/libxfs/xfs_ag_resv.c
> > > > index 0ca2e680034a..03885a968de8 100644
> > > > --- a/fs/xfs/libxfs/xfs_ag_resv.c
> > > > +++ b/fs/xfs/libxfs/xfs_ag_resv.c
> > > > @@ -326,6 +326,8 @@ xfs_ag_resv_alloc_extent(
> > > >  	trace_xfs_ag_resv_alloc_extent(pag, type, args->len);
> > > >  
> > > >  	switch (type) {
> > > > +	case XFS_AG_RESV_AGFL:
> > > > +		return;
> > > >  	case XFS_AG_RESV_METADATA:
> > > >  	case XFS_AG_RESV_RMAPBT:
> > > >  		resv = xfs_perag_resv(pag, type);
> > > > @@ -366,6 +368,8 @@ xfs_ag_resv_free_extent(
> > > >  	trace_xfs_ag_resv_free_extent(pag, type, len);
> > > >  
> > > >  	switch (type) {
> > > > +	case XFS_AG_RESV_AGFL:
> > > > +		return;
> > > >  	case XFS_AG_RESV_METADATA:
> > > >  	case XFS_AG_RESV_RMAPBT:
> > > >  		resv = xfs_perag_resv(pag, type);
> > > > diff --git a/fs/xfs/libxfs/xfs_ag_resv.h b/fs/xfs/libxfs/xfs_ag_resv.h
> > > > index 8d6c687deef3..938f2f96c5e8 100644
> > > > --- a/fs/xfs/libxfs/xfs_ag_resv.h
> > > > +++ b/fs/xfs/libxfs/xfs_ag_resv.h
> > > > @@ -32,4 +32,35 @@ void xfs_ag_resv_alloc_extent(struct xfs_perag *pag, enum xfs_ag_resv_type type,
> > > >  void xfs_ag_resv_free_extent(struct xfs_perag *pag, enum xfs_ag_resv_type type,
> > > >  		struct xfs_trans *tp, xfs_extlen_t len);
> > > >  
> > > > +/*
> > > > + * RMAPBT reservation accounting wrappers. Since rmapbt blocks are sourced from
> > > > + * the AGFL, they are allocated one at a time and the reservation updates don't
> > > > + * require a transaction.
> > > > + */
> > > > +static inline void
> > > > +xfs_ag_resv_rmapbt_alloc(
> > > > +	struct xfs_mount	*mp,
> > > > +	xfs_agnumber_t		agno)
> > > > +{
> > > > +	struct xfs_alloc_arg	args = {0};
> > > > +	struct xfs_perag	*pag;
> > > > +
> > > > +	args.len = 1;
> > > > +	pag = xfs_perag_get(mp, agno);
> > > > +	xfs_ag_resv_alloc_extent(pag, XFS_AG_RESV_RMAPBT, &args);
> > > > +	xfs_perag_put(pag);
> > > > +}
> > > > +
> > > > +static inline void
> > > > +xfs_ag_resv_rmapbt_free(
> > > > +	struct xfs_mount	*mp,
> > > > +	xfs_agnumber_t		agno)
> > > > +{
> > > > +	struct xfs_perag	*pag;
> > > > +
> > > > +	pag = xfs_perag_get(mp, agno);
> > > > +	xfs_ag_resv_free_extent(pag, XFS_AG_RESV_RMAPBT, NULL, 1);
> > > > +	xfs_perag_put(pag);
> > > > +}
> > > > +
> > > >  #endif	/* __XFS_AG_RESV_H__ */
> > > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > > > index e53d55453dc5..8c401efb2d74 100644
> > > > --- a/fs/xfs/libxfs/xfs_alloc.c
> > > > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > > > @@ -711,7 +711,7 @@ xfs_alloc_ag_vextent(
> > > >  
> > > >  	ASSERT(args->len >= args->minlen);
> > > >  	ASSERT(args->len <= args->maxlen);
> > > > -	ASSERT(!args->wasfromfl || args->resv != XFS_AG_RESV_RMAPBT);
> > > > +	ASSERT(!args->wasfromfl || args->resv != XFS_AG_RESV_AGFL);
> > > >  	ASSERT(args->agbno % args->alignment == 0);
> > > >  
> > > >  	/* if not file data, insert new block into the reverse map btree */
> > > > @@ -1564,7 +1564,6 @@ xfs_alloc_ag_vextent_small(
> > > >  	int		*stat)	/* status: 0-freelist, 1-normal/none */
> > > >  {
> > > >  	struct xfs_owner_info	oinfo;
> > > > -	struct xfs_perag	*pag;
> > > >  	int		error;
> > > >  	xfs_agblock_t	fbno;
> > > >  	xfs_extlen_t	flen;
> > > > @@ -1583,7 +1582,7 @@ xfs_alloc_ag_vextent_small(
> > > >  	 * freelist.
> > > >  	 */
> > > >  	else if (args->minlen == 1 && args->alignment == 1 &&
> > > > -		 args->resv != XFS_AG_RESV_RMAPBT &&
> > > > +		 args->resv != XFS_AG_RESV_AGFL &&
> > > >  		 (be32_to_cpu(XFS_BUF_TO_AGF(args->agbp)->agf_flcount)
> > > >  		  > args->minleft)) {
> > > >  		error = xfs_alloc_get_freelist(args->tp, args->agbp, &fbno, 0);
> > > > @@ -1616,18 +1615,13 @@ xfs_alloc_ag_vextent_small(
> > > >  			/*
> > > >  			 * If we're feeding an AGFL block to something that
> > > >  			 * doesn't live in the free space, we need to clear
> > > > -			 * out the OWN_AG rmap and add the block back to
> > > > -			 * the RMAPBT per-AG reservation.
> > > > +			 * out the OWN_AG rmap.
> > > >  			 */
> > > >  			xfs_rmap_ag_owner(&oinfo, XFS_RMAP_OWN_AG);
> > > >  			error = xfs_rmap_free(args->tp, args->agbp, args->agno,
> > > >  					fbno, 1, &oinfo);
> > > >  			if (error)
> > > >  				goto error0;
> > > > -			pag = xfs_perag_get(args->mp, args->agno);
> > > > -			xfs_ag_resv_free_extent(pag, XFS_AG_RESV_RMAPBT,
> > > > -					args->tp, 1);
> > > > -			xfs_perag_put(pag);
> > > >  
> > > >  			*stat = 0;
> > > >  			return 0;
> > > > @@ -2153,7 +2147,7 @@ xfs_alloc_fix_freelist(
> > > >  		if (error)
> > > >  			goto out_agbp_relse;
> > > >  		error = xfs_free_ag_extent(tp, agbp, args->agno, bno, 1,
> > > > -					   &targs.oinfo, XFS_AG_RESV_RMAPBT);
> > > > +					   &targs.oinfo, XFS_AG_RESV_AGFL);
> > > >  		if (error)
> > > >  			goto out_agbp_relse;
> > > >  		bp = xfs_btree_get_bufs(mp, tp, args->agno, bno, 0);
> > > > @@ -2179,7 +2173,7 @@ xfs_alloc_fix_freelist(
> > > >  	while (pag->pagf_flcount < need) {
> > > >  		targs.agbno = 0;
> > > >  		targs.maxlen = need - pag->pagf_flcount;
> > > > -		targs.resv = XFS_AG_RESV_RMAPBT;
> > > > +		targs.resv = XFS_AG_RESV_AGFL;
> > > >  
> > > >  		/* Allocate as many blocks as possible at once. */
> > > >  		error = xfs_alloc_ag_vextent(&targs);
> > > > @@ -2860,7 +2854,7 @@ xfs_free_extent(
> > > >  	int			error;
> > > >  
> > > >  	ASSERT(len != 0);
> > > > -	ASSERT(type != XFS_AG_RESV_RMAPBT);
> > > > +	ASSERT(type != XFS_AG_RESV_AGFL);
> > > >  
> > > >  	if (XFS_TEST_ERROR(false, mp,
> > > >  			XFS_ERRTAG_FREE_EXTENT))
> > > > diff --git a/fs/xfs/libxfs/xfs_rmap_btree.c b/fs/xfs/libxfs/xfs_rmap_btree.c
> > > > index e829c3e489ea..8560091554e0 100644
> > > > --- a/fs/xfs/libxfs/xfs_rmap_btree.c
> > > > +++ b/fs/xfs/libxfs/xfs_rmap_btree.c
> > > > @@ -130,6 +130,8 @@ xfs_rmapbt_alloc_block(
> > > >  	be32_add_cpu(&agf->agf_rmap_blocks, 1);
> > > >  	xfs_alloc_log_agf(cur->bc_tp, agbp, XFS_AGF_RMAP_BLOCKS);
> > > >  
> > > > +	xfs_ag_resv_rmapbt_alloc(cur->bc_mp, cur->bc_private.a.agno);
> > > > +
> > > >  	XFS_BTREE_TRACE_CURSOR(cur, XBT_EXIT);
> > > >  	*stat = 1;
> > > >  	return 0;
> > > > @@ -158,6 +160,8 @@ xfs_rmapbt_free_block(
> > > >  			      XFS_EXTENT_BUSY_SKIP_DISCARD);
> > > >  	xfs_trans_agbtree_delta(cur->bc_tp, -1);
> > > >  
> > > > +	xfs_ag_resv_rmapbt_free(cur->bc_mp, cur->bc_private.a.agno);
> > > > +
> > > >  	return 0;
> > > >  }
> > > >  
> > > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > > > index cf84288b65ca..7aae5af71aba 100644
> > > > --- a/fs/xfs/xfs_mount.h
> > > > +++ b/fs/xfs/xfs_mount.h
> > > > @@ -326,6 +326,7 @@ xfs_daddr_to_agbno(struct xfs_mount *mp, xfs_daddr_t d)
> > > >  /* per-AG block reservation data structures*/
> > > >  enum xfs_ag_resv_type {
> > > >  	XFS_AG_RESV_NONE = 0,
> > > > +	XFS_AG_RESV_AGFL,
> > > >  	XFS_AG_RESV_METADATA,
> > > >  	XFS_AG_RESV_RMAPBT,
> > > >  };
> > > > -- 
> > > > 2.13.6
> > > > 
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > > the body of a message to majordomo@vger.kernel.org
> > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] xfs: account only rmapbt-used blocks against rmapbt perag res
  2018-02-08 13:19         ` Brian Foster
@ 2018-02-08 22:49           ` Dave Chinner
  2018-02-09 13:37             ` Brian Foster
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2018-02-08 22:49 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs

On Thu, Feb 08, 2018 at 08:19:38AM -0500, Brian Foster wrote:
> On Wed, Feb 07, 2018 at 06:20:37PM -0800, Darrick J. Wong wrote:
> > On Wed, Feb 07, 2018 at 09:49:35AM -0500, Brian Foster wrote:
> > > On Tue, Feb 06, 2018 at 04:03:50PM -0800, Darrick J. Wong wrote:
> > > We've since applied it to things like the finobt (which I'm still not
> > > totally convinced was the right thing to do based on the vague
> > > justification for it), which kind of blurs the line between where it's
> > > a requirement vs. nice-to-have/band-aid for me.
> > 
> > I think the finobt reservation is required: Suppose you have a
> > filesystem with a lot of empty files, a lot of single-block files, and a
> > lot of big files such that there's no free space anywhere.  Suppose
> > further that there's an AG where every finobt block is exactly full,
> > there's an inode chunk with exactly 64 inodes in use, and every block in
> > that AG is in use (meaning zero AGFL blocks).  Now find one of the empty
> > inodes in that totally-full chunk and try to free it.  Truncation
> > doesn't free up any blocks, but we have to expand the finobt to add the
> > record for the chunk.  We can't find any blocks in that AG so we shut
> > down.
> > 
> 
> Yes, I suppose the problem makes sense (I wish the original commit had
> such an explanation :/). We do have the transaction block reservation in
> the !perag res case, but I suppose we're susceptible to the same global
> reservation problem as above.
> 
> Have we considered a per-ag + per-transaction mechanism at any point
> through all of this?

That's kind of what I was suggesting to Darrick on IRC a while back.
i.e. the per-ag reservation of at least 4-8MB of space similar to
the global reservation pool we have, and when it dips below that
threshold we reserve more free space.

But yeah, it doesn't completely solve the finobt growth at ENOSPC
problem, but then again the global reservation pool doesn't
completely solve the "run out of free space for IO completion
processing at ENOSPC" problem either. That mechanism is just a
simple solution that is good enough for 99.99% of XFS users, and if
you are outside this there's a way to increase the pool size to make
it more robust (e.g. for those 4096 CPU MPI jobs all doing
concurrent DIO writeback at ENOSPC).

So the question I'm asking here is this: do we need a "perfect
solution" or does a simple, small, dynamic reservation pool provide
"good enough protection" for the vast majority of our users?

> I ask because something that has been in the back
> of my mind (which I think was an idea from Dave originally) for a while
> is to simply queue inactive inode processing when it can't run at a
> particular point in time, but that depends on actually knowing whether
> we can proceed to inactivate an inode or not.

Essentially defer the xfs_ifree() processing step from
xfs_inactive(), right? i.e. leave the inode on the unlinked list
until we've got space to free it? This could be determined by a
simple AG space/resv space check before removign the inode from
the unlinked list...

FWIW, if we keep a list of inactivated but not yet freed inodes for
background processing, we could allocate inodes from that list, too,
simply by removing them from the unlinked list...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 4/4] xfs: account only rmapbt-used blocks against rmapbt perag res
  2018-02-08 22:49           ` Dave Chinner
@ 2018-02-09 13:37             ` Brian Foster
  0 siblings, 0 replies; 19+ messages in thread
From: Brian Foster @ 2018-02-09 13:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

On Fri, Feb 09, 2018 at 09:49:30AM +1100, Dave Chinner wrote:
> On Thu, Feb 08, 2018 at 08:19:38AM -0500, Brian Foster wrote:
> > On Wed, Feb 07, 2018 at 06:20:37PM -0800, Darrick J. Wong wrote:
> > > On Wed, Feb 07, 2018 at 09:49:35AM -0500, Brian Foster wrote:
> > > > On Tue, Feb 06, 2018 at 04:03:50PM -0800, Darrick J. Wong wrote:
> > > > We've since applied it to things like the finobt (which I'm still not
> > > > totally convinced was the right thing to do based on the vague
> > > > justification for it), which kind of blurs the line between where it's
> > > > a requirement vs. nice-to-have/band-aid for me.
> > > 
> > > I think the finobt reservation is required: Suppose you have a
> > > filesystem with a lot of empty files, a lot of single-block files, and a
> > > lot of big files such that there's no free space anywhere.  Suppose
> > > further that there's an AG where every finobt block is exactly full,
> > > there's an inode chunk with exactly 64 inodes in use, and every block in
> > > that AG is in use (meaning zero AGFL blocks).  Now find one of the empty
> > > inodes in that totally-full chunk and try to free it.  Truncation
> > > doesn't free up any blocks, but we have to expand the finobt to add the
> > > record for the chunk.  We can't find any blocks in that AG so we shut
> > > down.
> > > 
> > 
> > Yes, I suppose the problem makes sense (I wish the original commit had
> > such an explanation :/). We do have the transaction block reservation in
> > the !perag res case, but I suppose we're susceptible to the same global
> > reservation problem as above.
> > 
> > Have we considered a per-ag + per-transaction mechanism at any point
> > through all of this?
> 
> That's kind of what I was suggesting to Darrick on IRC a while back.
> i.e. the per-ag reservation of at least 4-8MB of space similar to
> the global reservation pool we have, and when it dips below that
> threshold we reserve more free space.
> 

I see. FWIW, that's not exactly what I was thinking, but I suspect it's
a very similar idea in terms of just trying to provide a "good enough"
solution (IOW, we may just be quibbling on implementation details here).

Is this new block pool essentially just a low-end pool to be used in
certain, fixed contexts that are at risk of catastrophic failure in
event of ENOSPC? Essentially a backup plan to cover the case where: a
transaction reserves N global blocks, the operation results in some
refcountbt operations, AG is out of blocks so allocations fail, dip into
special AG reserve pool so we don't explode.

That sounds plausible to me, almost like a high level AGFL for
particular operations.

> But yeah, it doesn't completely solve the finobt growth at ENOSPC
> problem, but then again the global reservation pool doesn't
> completely solve the "run out of free space for IO completion
> processing at ENOSPC" problem either. That mechanism is just a
> simple solution that is good enough for 99.99% of XFS users, and if
> you are outside this there's a way to increase the pool size to make
> it more robust (e.g. for those 4096 CPU MPI jobs all doing
> concurrent DIO writeback at ENOSPC).
> 

Yeah. I'm wondering if just refactoring the existing global reservation
pool into something that is AG granular would be a reasonable approach.
E.g., rather than reserve N blocks out of the filesystem that as far as
we know could all source from the same AG, implement a reservation that
reserves roughly N/agcount number of blocks from each AG.

> So the question I'm asking here is this: do we need a "perfect
> solution" or does a simple, small, dynamic reservation pool provide
> "good enough protection" for the vast majority of our users?
> 

I'm all for simple + good enough. :) That's primarily been my concern
from the perag + finobt angle... that a full worst-case tree reservation
is overkill when it might just be sufficient to return -ENOSPC in the
nebulous case that motivated it in the first place. IOW, do we really
need these worst case reservations or to worry about how big a reserve
pool needs to be if we can implement a reliable enough mechanism to
reserve blocks for a particular operation to either proceed or fail
gracefully?

Thinking out loud a bit... the idea that popped in my mind was slightly
different in that I was thinking about a per-transaction AG blocks
reservation. E.g., for the purpose of discussion, consider an
xfs_trans_reserve_agblocks() function that reserved N blocks out of a
particular AG for a given tx. This of course requires knowing what AG to
reserve from, so is limited to certain operations, but I think that
should be doable for inode operations. This also requires knowing what
allocations shall be accounted against such tx reservation, so we'd have
to somehow flag/tag inode chunks, inobt blocks, etc. as being AG
reserved allocations (perhaps similar to how delalloc blocks are already
reserved for bmaps).

I was originally thinking more about inode processing so it's not clear
to me if/how useful this would be for something like reflink. Scanning
through that code a bit, I guess we have to worry about copy-on-write at
writeback time so a transaction is not terribly useful here. What we'd
want is a tx-less reservation at write() time that guarantees we'd have
enough blocks for refcount updates based on the underlying shared
extents that may need to split, etc., right?

Hmm, I'm not sure what the right answer is there. I can certainly see
how a fixed size pool would make this easier, I just wonder if that
could blow up just the same as without it because we're back to just
guessing/hoping it's enough. I think an underlying AG block reservation
mechanism is still potentially useful here, the question is just where
to store/track blocks that would be reserved at cow write() time and
consumed (and/or released) at writeback time? (Then again, I suppose the
same underlying reservation mechanism could be shared to feed tx AG
reservations _and_ a special pool for reflink handling.)

I suppose we could try to design a structure that tracked reservations
associated with a cow fork or something, but that could get hairy
because IIUC the shared extents in the data fork for a particular write
could span multiple AGs (and thus require multiple reservations). It
would be nice if we could define a reservation relative to each extent
or some such that was fixed across this whole cow extent lifecycle and
thus easy to track, consume or release, but the simplest things that
come to mind (e.g., a refcount btree split reservation per shared block)
are too excessive to be realistic. I'd have to think about that some
more... thoughts on any of this?

> > I ask because something that has been in the back
> > of my mind (which I think was an idea from Dave originally) for a while
> > is to simply queue inactive inode processing when it can't run at a
> > particular point in time, but that depends on actually knowing whether
> > we can proceed to inactivate an inode or not.
> 
> Essentially defer the xfs_ifree() processing step from
> xfs_inactive(), right? i.e. leave the inode on the unlinked list
> until we've got space to free it? This could be determined by a
> simple AG space/resv space check before removign the inode from
> the unlinked list...
> 

Yeah, that's essentially what I'm after. A mechanism to tell us reliably
whether an operation will suceed or not is sufficient here so long as we
support the ability to defer freeing unlinked inodes until it is safe to
do so. I don't think the reservation is that difficult for inactive
operations because we 1.) have a transaction to track actual block
consumption and 2.) have a deterministic worst case reservation
requirement.

> FWIW, if we keep a list of inactivated but not yet freed inodes for
> background processing, we could allocate inodes from that list, too,
> simply by removing them from the unlinked list...
> 

Yep. A neat enhancement, indeed.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-02-09 13:37 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-05 17:45 [PATCH 0/4] xfs: rmapbt block and perag reservation fixups Brian Foster
2018-02-05 17:45 ` [PATCH 1/4] xfs: shutdown if block allocation overruns tx reservation Brian Foster
2018-02-08  1:42   ` Darrick J. Wong
2018-02-05 17:45 ` [PATCH 2/4] xfs: account format bouncing into rmapbt swapext " Brian Foster
2018-02-08  1:56   ` Darrick J. Wong
2018-02-08 13:12     ` Brian Foster
2018-02-05 17:46 ` [PATCH 3/4] xfs: rename agfl perag res type to rmapbt Brian Foster
2018-02-08  1:57   ` Darrick J. Wong
2018-02-05 17:46 ` [PATCH 4/4] xfs: account only rmapbt-used blocks against rmapbt perag res Brian Foster
2018-02-07  0:03   ` Darrick J. Wong
2018-02-07 14:49     ` Brian Foster
2018-02-08  2:20       ` Darrick J. Wong
2018-02-08 13:19         ` Brian Foster
2018-02-08 22:49           ` Dave Chinner
2018-02-09 13:37             ` Brian Foster
2018-02-06 13:10 ` [PATCH] tests/xfs: rmapbt swapext block reservation overrun test Brian Foster
2018-02-06 17:30   ` Darrick J. Wong
2018-02-06 18:50     ` Brian Foster
2018-02-07  4:07     ` Eryu Guan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.