All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] xfs: inode transaction reservation fixups
@ 2017-11-27 20:24 Brian Foster
  2017-11-27 20:24 ` [PATCH 1/4] xfs: print transaction log reservation on overrun Brian Foster
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Brian Foster @ 2017-11-27 20:24 UTC (permalink / raw)
  To: linux-xfs

Hi all,

This series fixes up a few issues related to the xfs_inactive_ifree()
reservation overruns that have been reported on occasion. Patch 1 is a
small fixup to include a bit more data in overrun dumps, patch 2 fixes
an old inode free reservation bug, patch 3 fixes the actual overrun by
implementing an agfl fixup limit and patch 4 is some refactoring of
inode transactions that fell out of discussions on how to address the
problem [1]. 

Patches 1-3 survive an xfstests run and a couple iterations (so far) of
the workload that has consistently reproduced the overrun. I plan to run
the latter test a bit longer and repeat with patch 4 included. Note that
patch 4 is RFC and compile tested only atm.

Thoughts, reviews, flames appreciated.

Brian

[1] https://marc.info/?l=linux-xfs&m=151127676203410&w=2

Brian Foster (4):
  xfs: print transaction log reservation on overrun
  xfs: include inobt buffers in ifree tx log reservation
  xfs: amortize agfl block frees across multiple transactions
  xfs: include an allocfree res for inobt modifications

 fs/xfs/libxfs/xfs_alloc.c      | 15 ++++---
 fs/xfs/libxfs/xfs_trans_resv.c | 93 +++++++++++++++++++++---------------------
 fs/xfs/xfs_log.c               |  4 +-
 3 files changed, 60 insertions(+), 52 deletions(-)

-- 
2.13.6


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH 1/4] xfs: print transaction log reservation on overrun
  2017-11-27 20:24 [PATCH 0/4] xfs: inode transaction reservation fixups Brian Foster
@ 2017-11-27 20:24 ` Brian Foster
  2017-11-27 22:14   ` Dave Chinner
  2017-11-27 20:24 ` [PATCH 2/4] xfs: include inobt buffers in ifree tx log reservation Brian Foster
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: Brian Foster @ 2017-11-27 20:24 UTC (permalink / raw)
  To: linux-xfs

The transaction dump code displays the content and reservation
consumption of a particular transaction in the event of an overrun.
It currently displays the reservation associated with the
transaction ticket, but not the original reservation attached to the
transaction.

The latter value reflects the original transaction reservation
calculation before additional reservation overhead is assigned, such
as for the CIL context header and potential split region headers.

Update xlog_print_trans() to also print the original transaction
reservation in the event of overrun. This provides a reference point
to identify how much reservation overhead was added to a particular
ticket by xfs_log_calc_unit_res().

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_log.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 38d4227895ae..3a37fa1ed25e 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -2117,7 +2117,9 @@ xlog_print_trans(
 
 	/* dump core transaction and ticket info */
 	xfs_warn(mp, "transaction summary:");
-	xfs_warn(mp, "  flags	= 0x%x", tp->t_flags);
+	xfs_warn(mp, "  log res   = %d", tp->t_log_res);
+	xfs_warn(mp, "  log count = %d", tp->t_log_count);
+	xfs_warn(mp, "  flags     = 0x%x", tp->t_flags);
 
 	xlog_print_tic_res(mp, tp->t_ticket);
 
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 2/4] xfs: include inobt buffers in ifree tx log reservation
  2017-11-27 20:24 [PATCH 0/4] xfs: inode transaction reservation fixups Brian Foster
  2017-11-27 20:24 ` [PATCH 1/4] xfs: print transaction log reservation on overrun Brian Foster
@ 2017-11-27 20:24 ` Brian Foster
  2017-11-27 22:28   ` Dave Chinner
  2017-11-27 20:24 ` [PATCH 3/4] xfs: amortize agfl block frees across multiple transactions Brian Foster
  2017-11-27 20:24 ` [PATCH RFC 4/4] xfs: include an allocfree res for inobt modifications Brian Foster
  3 siblings, 1 reply; 23+ messages in thread
From: Brian Foster @ 2017-11-27 20:24 UTC (permalink / raw)
  To: linux-xfs

The tr_ifree transaction handles inode unlinks and inode chunk
frees. The current transaction calculation does not accurately
reflect worst case changes to the inode btree, however. The inobt
portion of the current transaction reservation only covers
modification of a single inobt buffer (for the particular inode
record). This is a historical artifact from the days before XFS
supported full inode chunk removal.

When support for inode chunk removal was added in commit
254f6311ed1b ("Implement deletion of inode clusters in XFS."), the
additional log reservation required for chunk removal was not added
correctly. The new reservation only considered the header overhead
of associated buffers rather than the full contents of the btrees
and AGF and AGFL buffers affected by the transaction. The
reservation for the free space btrees was subsequently fixed up in
commit 5fe6abb82f76 ("Add space for inode and allocation btrees to
ITRUNCATE log reservation"), but the res. for full inobt joins has
never been added.

Update xfs_calc_ifree_reservation() to include a full inobt join in
the reservation calculation. Refactor the undocumented +2 blocks for
the AGF and AGFL buffers into the appropriate place so they are
accounted as sectors (not FSBs) and update the associated comments.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_trans_resv.c | 11 +++++------
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index 6bd916bd35e2..4cd7cd1e60da 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -490,10 +490,10 @@ xfs_calc_symlink_reservation(
 /*
  * In freeing an inode we can modify:
  *    the inode being freed: inode size
- *    the super block free inode counter: sector size
+ *    the super block free inode counter, AGF and AGFL: sector size
  *    the agi hash list and counters: sector size
- *    the inode btree entry: block size
  *    the on disk inode before ours in the agi hash list: inode cluster size
+ *    the inode chunk is marked stale (headers only)
  *    the inode btree: max depth * blocksize
  *    the allocation btrees: 2 trees * (max depth - 1) * block size
  *    the finobt (record insertion, removal or modification)
@@ -504,12 +504,11 @@ xfs_calc_ifree_reservation(
 {
 	return XFS_DQUOT_LOGRES(mp) +
 		xfs_calc_inode_res(mp, 1) +
-		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
-		xfs_calc_buf_res(1, XFS_FSB_TO_B(mp, 1)) +
+		xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
 		xfs_calc_iunlink_remove_reservation(mp) +
 		xfs_calc_buf_res(1, 0) +
-		xfs_calc_buf_res(2 + mp->m_ialloc_blks +
-				 mp->m_in_maxlevels, 0) +
+		xfs_calc_buf_res(mp->m_ialloc_blks, 0) +
+		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
 		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
 				 XFS_FSB_TO_B(mp, 1)) +
 		xfs_calc_finobt_res(mp, 0, 1);
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH 3/4] xfs: amortize agfl block frees across multiple transactions
  2017-11-27 20:24 [PATCH 0/4] xfs: inode transaction reservation fixups Brian Foster
  2017-11-27 20:24 ` [PATCH 1/4] xfs: print transaction log reservation on overrun Brian Foster
  2017-11-27 20:24 ` [PATCH 2/4] xfs: include inobt buffers in ifree tx log reservation Brian Foster
@ 2017-11-27 20:24 ` Brian Foster
  2017-11-27 23:07   ` Dave Chinner
  2017-11-27 20:24 ` [PATCH RFC 4/4] xfs: include an allocfree res for inobt modifications Brian Foster
  3 siblings, 1 reply; 23+ messages in thread
From: Brian Foster @ 2017-11-27 20:24 UTC (permalink / raw)
  To: linux-xfs

We've had rare reports of transaction overruns in
xfs_inactive_ifree() for quite some time. Analysis of a reproducing
metadump has shown the problem is essentially caused by performing
too many agfl block frees in a single transaction.

For example, an inode chunk is freed and the associated agfl fixup
algorithm discovers it needs to free a single agfl block before the
chunk free occurs. This single block ends up causing a space btree
join and adds one or more blocks back onto the agfl. This causes
xfs_alloc_fix_freelist() to free up to 3 blocks just to rectify a
single block discrepency.

The transaction overrun occurs under several other unfortunate
circumstances:

 - Each agfl block free is left+right contiguous. This requires 2
   record deletions and 1 insertion for the cntbt and thus requires
   more log reservation than normal.
 - The associated transaction is the first in the CIL ctx and thus
   the ctx header reservation is consumed.
 - The transaction reservation is larger than a log buffer and thus
   extra split header reservation is consumed.

As a result of the agfl and free space state of the filesystem, the
agfl fixup code has dirtied more cntbt buffer space than allowed by
the portion of the reservation allotted for block allocation. This
is all before the real allocation even starts!

Note that the log related conditions above are correctly covered by
the existing transaction reservation. The above demonstrates that
the reservation allotted for the context/split headers may help
suppress overruns in the more common case where that reservation
goes unused for its intended purpose.

To address this problem, update xfs_alloc_fix_freelist() to amortize
agfl block frees over multiple transactions. Free one block per
transaction so long as the agfl is less than half free. The agfl
minimum allocation requirement is dynamic, but is based on the
geometry of the associated btrees (i.e., level count) and therefore
should be easily rectified over multiple allocation transactions.
Further, there is no real harm in leaving extraneous blocks on the
agfl so long as there are enough free slots available for btree
blocks freed as a result of the upcoming allocation.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_alloc.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 0da80019a917..d8d58e35da00 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2117,11 +2117,6 @@ xfs_alloc_fix_freelist(
 	 * appropriately based on the recursion count and dirty state of the
 	 * buffer.
 	 *
-	 * XXX (dgc): When we have lots of free space, does this buy us
-	 * anything other than extra overhead when we need to put more blocks
-	 * back on the free list? Maybe we should only do this when space is
-	 * getting low or the AGFL is more than half full?
-	 *
 	 * The NOSHRINK flag prevents the AGFL from being shrunk if it's too
 	 * big; the NORMAP flag prevents AGFL expand/shrink operations from
 	 * updating the rmapbt.  Both flags are used in xfs_repair while we're
@@ -2151,6 +2146,16 @@ xfs_alloc_fix_freelist(
 			goto out_agbp_relse;
 		}
 		xfs_trans_binval(tp, bp);
+
+		/*
+		 * Freeing all extra agfl blocks adds too much log reservation
+		 * overhead to a single transaction, particularly considering
+		 * that freeing a block can cause a btree join and put one right
+		 * back on the agfl. Try to free one block per tx so long as
+		 * we've left enough free slots for the upcoming modifications.
+		 */
+		if (pag->pagf_flcount <= (XFS_AGFL_SIZE(mp) >> 1))
+			break;
 	}
 
 	targs.tp = tp;
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH RFC 4/4] xfs: include an allocfree res for inobt modifications
  2017-11-27 20:24 [PATCH 0/4] xfs: inode transaction reservation fixups Brian Foster
                   ` (2 preceding siblings ...)
  2017-11-27 20:24 ` [PATCH 3/4] xfs: amortize agfl block frees across multiple transactions Brian Foster
@ 2017-11-27 20:24 ` Brian Foster
  2017-11-27 23:27   ` Dave Chinner
  3 siblings, 1 reply; 23+ messages in thread
From: Brian Foster @ 2017-11-27 20:24 UTC (permalink / raw)
  To: linux-xfs

Analysis of recent reports of log reservation overruns and code
inspection has uncovered that the reservations associated with inode
operations may not cover the worst case scenarios. In particular,
many cases only include one allocfree res. for a particular
operation even though said operations may also entail AGFL fixups
and inode btree block allocations in addition to the actual inode
chunk allocation. This can easily turn into two or three block
allocations (or frees) per operation.

In theory, the only way to define the worst case reservation is to
include an allocfree res for each individual allocation in a
transaction. Since that is impractical (we can perform multiple agfl
fixups per tx and not every allocation is going to result in a full
tree operation), implement a reasonable compromise that addresses
the deficiency in practice without blowing out the size of the
transactions.

Refactor the inode transaction reservation code to include one
allocfree res. per inode btree modification to cover allocations
required by the tree itself. This essentially separates the
reservation required to allocate the physical inode chunk from
additional reservation required to perform inobt record
insertion/removal. Apply the same logic to the finobt reservation.
This results in killing off the finobt modify condition because we
no longer assume that the broader transaction reservation will cover
finobt block allocations.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/libxfs/xfs_trans_resv.c | 84 +++++++++++++++++++++---------------------
 1 file changed, 43 insertions(+), 41 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index 4cd7cd1e60da..79e4aea74065 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -132,44 +132,43 @@ xfs_calc_inode_res(
 }
 
 /*
- * The free inode btree is a conditional feature and the log reservation
- * requirements differ slightly from that of the traditional inode allocation
- * btree. The finobt tracks records for inode chunks with at least one free
- * inode. A record can be removed from the tree for an inode allocation
- * or free and thus the finobt reservation is unconditional across:
+ * Inode btree record insertion/removal modifies the inode btree and free space
+ * btrees (since the inobt does not use the agfl). This requires the following
+ * reservation:
  *
- * 	- inode allocation
- * 	- inode free
- * 	- inode chunk allocation
+ * the inode btree: max depth * blocksize
+ * the allocation btrees: 2 trees * (max depth - 1) * block size
  *
- * The 'modify' param indicates to include the record modification scenario. The
- * 'alloc' param indicates to include the reservation for free space btree
- * modifications on behalf of finobt modifications. This is required only for
- * transactions that do not already account for free space btree modifications.
+ * The caller must account for SB and AG header modifications, etc.
+ */
+STATIC uint
+xfs_calc_inobt_res(
+	struct xfs_mount	*mp)
+{
+	return xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
+		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
+				 XFS_FSB_TO_B(mp, 1));
+}
+
+/*
+ * The free inode btree is a conditional feature. The behavior differs slightly
+ * from that of the traditional inode btree in that the finobt tracks records
+ * for inode chunks with at least one free inode. A record can be removed from
+ * the tree during individual inode allocation. Therefore the finobt
+ * reservation is unconditional for both the inode chunk allocation and
+ * individual inode allocation (modify) cases.
  *
- * the free inode btree: max depth * block size
- * the allocation btrees: 2 trees * (max depth - 1) * block size
- * the free inode btree entry: block size
+ * Behavior aside, the reservation for finobt modification is equivalent to the
+ * traditional inobt: cover a full finobt shape change plus block allocation.
  */
 STATIC uint
 xfs_calc_finobt_res(
-	struct xfs_mount	*mp,
-	int			alloc,
-	int			modify)
+	struct xfs_mount	*mp)
 {
-	uint res;
-
 	if (!xfs_sb_version_hasfinobt(&mp->m_sb))
 		return 0;
 
-	res = xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1));
-	if (alloc)
-		res += xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
-					XFS_FSB_TO_B(mp, 1));
-	if (modify)
-		res += (uint)XFS_FSB_TO_B(mp, 1);
-
-	return res;
+	return xfs_calc_inobt_res(mp);
 }
 
 /*
@@ -379,7 +378,7 @@ xfs_calc_create_resv_modify(
 		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
 		(uint)XFS_FSB_TO_B(mp, 1) +
 		xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1)) +
-		xfs_calc_finobt_res(mp, 1, 1);
+		xfs_calc_finobt_res(mp);
 }
 
 /*
@@ -387,8 +386,8 @@ xfs_calc_create_resv_modify(
  *    the agi and agf of the ag getting the new inodes: 2 * sectorsize
  *    the superblock for the nlink flag: sector size
  *    the inode blocks allocated: mp->m_ialloc_blks * blocksize
- *    the inode btree: max depth * blocksize
  *    the allocation btrees: 2 trees * (max depth - 1) * block size
+ *    the inode btree (record insertion)
  */
 STATIC uint
 xfs_calc_create_resv_alloc(
@@ -397,9 +396,9 @@ xfs_calc_create_resv_alloc(
 	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
 		mp->m_sb.sb_sectsize +
 		xfs_calc_buf_res(mp->m_ialloc_blks, XFS_FSB_TO_B(mp, 1)) +
-		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
 		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
-				 XFS_FSB_TO_B(mp, 1));
+				 XFS_FSB_TO_B(mp, 1)) +
+		xfs_calc_inobt_res(mp);
 }
 
 STATIC uint
@@ -415,8 +414,8 @@ __xfs_calc_create_reservation(
  * For icreate we can allocate some inodes giving:
  *    the agi and agf of the ag getting the new inodes: 2 * sectorsize
  *    the superblock for the nlink flag: sector size
- *    the inode btree: max depth * blocksize
  *    the allocation btrees: 2 trees * (max depth - 1) * block size
+ *    the inobt (record insertion)
  *    the finobt (record insertion)
  */
 STATIC uint
@@ -425,10 +424,10 @@ xfs_calc_icreate_resv_alloc(
 {
 	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
 		mp->m_sb.sb_sectsize +
-		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
 		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
 				 XFS_FSB_TO_B(mp, 1)) +
-		xfs_calc_finobt_res(mp, 0, 0);
+		xfs_calc_inobt_res(mp) +
+		xfs_calc_finobt_res(mp);
 }
 
 STATIC uint
@@ -494,9 +493,14 @@ xfs_calc_symlink_reservation(
  *    the agi hash list and counters: sector size
  *    the on disk inode before ours in the agi hash list: inode cluster size
  *    the inode chunk is marked stale (headers only)
- *    the inode btree: max depth * blocksize
- *    the allocation btrees: 2 trees * (max depth - 1) * block size
+ *    the inode btree
  *    the finobt (record insertion, removal or modification)
+ *
+ * Note that the allocfree res. for the inode chunk itself is not included
+ * because the extent free occurs after a transaction roll. We could take the
+ * maximum of the pre/post roll operations, but the pre-roll reservation already
+ * includes at least one allocfree res. for the inobt and is thus guaranteed to
+ * be larger.
  */
 STATIC uint
 xfs_calc_ifree_reservation(
@@ -508,10 +512,8 @@ xfs_calc_ifree_reservation(
 		xfs_calc_iunlink_remove_reservation(mp) +
 		xfs_calc_buf_res(1, 0) +
 		xfs_calc_buf_res(mp->m_ialloc_blks, 0) +
-		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
-		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
-				 XFS_FSB_TO_B(mp, 1)) +
-		xfs_calc_finobt_res(mp, 0, 1);
+		xfs_calc_inobt_res(mp) +
+		xfs_calc_finobt_res(mp);
 }
 
 /*
-- 
2.13.6


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 1/4] xfs: print transaction log reservation on overrun
  2017-11-27 20:24 ` [PATCH 1/4] xfs: print transaction log reservation on overrun Brian Foster
@ 2017-11-27 22:14   ` Dave Chinner
  0 siblings, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2017-11-27 22:14 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Nov 27, 2017 at 03:24:31PM -0500, Brian Foster wrote:
> The transaction dump code displays the content and reservation
> consumption of a particular transaction in the event of an overrun.
> It currently displays the reservation associated with the
> transaction ticket, but not the original reservation attached to the
> transaction.
> 
> The latter value reflects the original transaction reservation
> calculation before additional reservation overhead is assigned, such
> as for the CIL context header and potential split region headers.
> 
> Update xlog_print_trans() to also print the original transaction
> reservation in the event of overrun. This provides a reference point
> to identify how much reservation overhead was added to a particular
> ticket by xfs_log_calc_unit_res().
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_log.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 38d4227895ae..3a37fa1ed25e 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -2117,7 +2117,9 @@ xlog_print_trans(
>  
>  	/* dump core transaction and ticket info */
>  	xfs_warn(mp, "transaction summary:");
> -	xfs_warn(mp, "  flags	= 0x%x", tp->t_flags);
> +	xfs_warn(mp, "  log res   = %d", tp->t_log_res);
> +	xfs_warn(mp, "  log count = %d", tp->t_log_count);
> +	xfs_warn(mp, "  flags     = 0x%x", tp->t_flags);
>  
>  	xlog_print_tic_res(mp, tp->t_ticket);


Looks fine.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/4] xfs: include inobt buffers in ifree tx log reservation
  2017-11-27 20:24 ` [PATCH 2/4] xfs: include inobt buffers in ifree tx log reservation Brian Foster
@ 2017-11-27 22:28   ` Dave Chinner
  2017-11-28 13:30     ` Brian Foster
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2017-11-27 22:28 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Nov 27, 2017 at 03:24:32PM -0500, Brian Foster wrote:
> The tr_ifree transaction handles inode unlinks and inode chunk
> frees. The current transaction calculation does not accurately
> reflect worst case changes to the inode btree, however. The inobt
> portion of the current transaction reservation only covers
> modification of a single inobt buffer (for the particular inode
> record). This is a historical artifact from the days before XFS
> supported full inode chunk removal.
> 
> When support for inode chunk removal was added in commit
> 254f6311ed1b ("Implement deletion of inode clusters in XFS."), the
> additional log reservation required for chunk removal was not added
> correctly. The new reservation only considered the header overhead
> of associated buffers rather than the full contents of the btrees
> and AGF and AGFL buffers affected by the transaction. The
> reservation for the free space btrees was subsequently fixed up in
> commit 5fe6abb82f76 ("Add space for inode and allocation btrees to
> ITRUNCATE log reservation"), but the res. for full inobt joins has
> never been added.
> 
> Update xfs_calc_ifree_reservation() to include a full inobt join in
> the reservation calculation. Refactor the undocumented +2 blocks for
> the AGF and AGFL buffers into the appropriate place so they are
> accounted as sectors (not FSBs) and update the associated comments.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_trans_resv.c | 11 +++++------
>  1 file changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> index 6bd916bd35e2..4cd7cd1e60da 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.c
> +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> @@ -490,10 +490,10 @@ xfs_calc_symlink_reservation(
>  /*
>   * In freeing an inode we can modify:
>   *    the inode being freed: inode size
> - *    the super block free inode counter: sector size
> + *    the super block free inode counter, AGF and AGFL: sector size
>   *    the agi hash list and counters: sector size
> - *    the inode btree entry: block size
>   *    the on disk inode before ours in the agi hash list: inode cluster size
> + *    the inode chunk is marked stale (headers only)
>   *    the inode btree: max depth * blocksize
>   *    the allocation btrees: 2 trees * (max depth - 1) * block size
>   *    the finobt (record insertion, removal or modification)
> @@ -504,12 +504,11 @@ xfs_calc_ifree_reservation(
>  {
>  	return XFS_DQUOT_LOGRES(mp) +
>  		xfs_calc_inode_res(mp, 1) +

 *    the inode being freed: inode size

> -		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
> -		xfs_calc_buf_res(1, XFS_FSB_TO_B(mp, 1)) +
> +		xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +

 *    the super block free inode counter, AGF and AGFL: sector size

		[missing, hidden in calc_iunlink_remove]

 *    the agi hash list and counters: sector size

>  		xfs_calc_iunlink_remove_reservation(mp) +

 *    the on disk inode before ours in the agi hash list: inode cluster size

		[missing]

 *    on-disk inode to log the di_next_unlinked: inode cluster size

Yes, check the xfs_iunlink_remove() code - we can log two inode
cluster buffers there: the one prior to us in the unlinked list,
and ours to reset the di_next_unlinked pointer to
null. IOWs, xfs_calc_iunlink_remove_reservation() needs to take into
account /2/ inode cluster buffers, not one.

>  		xfs_calc_buf_res(1, 0) +

No idea what this is.

> -		xfs_calc_buf_res(2 + mp->m_ialloc_blks +
> -				 mp->m_in_maxlevels, 0) +
> +		xfs_calc_buf_res(mp->m_ialloc_blks, 0) +

 *    the inode chunk is marked stale (headers only)

> +		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +

  *    the inode btree: max depth * blocksize

>  		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
>  				 XFS_FSB_TO_B(mp, 1)) +
  *    the allocation btrees: 2 trees * (max depth - 1) * block size

>  		xfs_calc_finobt_res(mp, 0, 1);

  *    the finobt (record insertion, removal or modification)

The rest look fine.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] xfs: amortize agfl block frees across multiple transactions
  2017-11-27 20:24 ` [PATCH 3/4] xfs: amortize agfl block frees across multiple transactions Brian Foster
@ 2017-11-27 23:07   ` Dave Chinner
  2017-11-28 13:57     ` Brian Foster
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2017-11-27 23:07 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Nov 27, 2017 at 03:24:33PM -0500, Brian Foster wrote:
> We've had rare reports of transaction overruns in
> xfs_inactive_ifree() for quite some time. Analysis of a reproducing
> metadump has shown the problem is essentially caused by performing
> too many agfl block frees in a single transaction.
> 
> For example, an inode chunk is freed and the associated agfl fixup
> algorithm discovers it needs to free a single agfl block before the
> chunk free occurs. This single block ends up causing a space btree
> join and adds one or more blocks back onto the agfl. This causes
> xfs_alloc_fix_freelist() to free up to 3 blocks just to rectify a
> single block discrepency.
> 
> The transaction overrun occurs under several other unfortunate
> circumstances:
> 
>  - Each agfl block free is left+right contiguous. This requires 2
>    record deletions and 1 insertion for the cntbt and thus requires
>    more log reservation than normal.
>  - The associated transaction is the first in the CIL ctx and thus
>    the ctx header reservation is consumed.
>  - The transaction reservation is larger than a log buffer and thus
>    extra split header reservation is consumed.
> 
> As a result of the agfl and free space state of the filesystem, the
> agfl fixup code has dirtied more cntbt buffer space than allowed by
> the portion of the reservation allotted for block allocation. This
> is all before the real allocation even starts!
> 
> Note that the log related conditions above are correctly covered by
> the existing transaction reservation. The above demonstrates that
> the reservation allotted for the context/split headers may help
> suppress overruns in the more common case where that reservation
> goes unused for its intended purpose.
> 
> To address this problem, update xfs_alloc_fix_freelist() to amortize
> agfl block frees over multiple transactions. Free one block per
> transaction so long as the agfl is less than half free. The agfl
> minimum allocation requirement is dynamic, but is based on the
> geometry of the associated btrees (i.e., level count) and therefore
> should be easily rectified over multiple allocation transactions.
> Further, there is no real harm in leaving extraneous blocks on the
> agfl so long as there are enough free slots available for btree
> blocks freed as a result of the upcoming allocation.
> 
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_alloc.c | 15 ++++++++++-----
>  1 file changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> index 0da80019a917..d8d58e35da00 100644
> --- a/fs/xfs/libxfs/xfs_alloc.c
> +++ b/fs/xfs/libxfs/xfs_alloc.c
> @@ -2117,11 +2117,6 @@ xfs_alloc_fix_freelist(
>  	 * appropriately based on the recursion count and dirty state of the
>  	 * buffer.
>  	 *
> -	 * XXX (dgc): When we have lots of free space, does this buy us
> -	 * anything other than extra overhead when we need to put more blocks
> -	 * back on the free list? Maybe we should only do this when space is
> -	 * getting low or the AGFL is more than half full?
> -	 *
>  	 * The NOSHRINK flag prevents the AGFL from being shrunk if it's too
>  	 * big; the NORMAP flag prevents AGFL expand/shrink operations from
>  	 * updating the rmapbt.  Both flags are used in xfs_repair while we're
> @@ -2151,6 +2146,16 @@ xfs_alloc_fix_freelist(
>  			goto out_agbp_relse;
>  		}
>  		xfs_trans_binval(tp, bp);
> +
> +		/*
> +		 * Freeing all extra agfl blocks adds too much log reservation
> +		 * overhead to a single transaction, particularly considering
> +		 * that freeing a block can cause a btree join and put one right
> +		 * back on the agfl. Try to free one block per tx so long as
> +		 * we've left enough free slots for the upcoming modifications.
> +		 */
> +		if (pag->pagf_flcount <= (XFS_AGFL_SIZE(mp) >> 1))
> +			break;
>  	}

In /theory/, this /should/ work. However, as the comment you removed
implies, there are likely to be issues with this as we get near
ENOSPC. We know that if we don't trim the AGFL right down to the
minimum requested as we approach ENOSPC we can get premature ENOSPC
events being reported that lead to filesystem shutdowns.  (e.g. need
the very last free block for a BMBT block to complete a data extent
allocation).  Hence I'd suggest that this needs to be aware of the
low space allocation algorithm (i.e. dfops->dop_low is true) to trim
the agfl right back when we are really short of space

I'm also concerned that it doesn't take into account that freeing
a block from the AGFL could cause a freespace tree split to occur,
thereby emptying the AGFL whilst consuming the entire log
reservation for tree modifications. This leaves nothing in the log
reservation for re-growing the AGFL to the minimum required, which
we /must do/ before returning and could cause more splits/joins to
occur.

IMO, there's a lot more to be concerned about here than just trying
to work around the specific symptom observed in the given test case.
This code is, unfortunately, really tricky and intricate and history
tells us that the risk of unexpected regressions is extremely high,
especially around ENOSPC related issues. Of all the patches in this
series, this is the most dangerous and "iffy" of them and the
one we should be most concerned and conservative about....

I'm not sure what the solution to this problem is, but I'm extremely
hesitant to make changes like this without an awful lot more
analysis of it's impact.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 4/4] xfs: include an allocfree res for inobt modifications
  2017-11-27 20:24 ` [PATCH RFC 4/4] xfs: include an allocfree res for inobt modifications Brian Foster
@ 2017-11-27 23:27   ` Dave Chinner
  2017-11-28 14:04     ` Brian Foster
  2017-11-28 15:49     ` Brian Foster
  0 siblings, 2 replies; 23+ messages in thread
From: Dave Chinner @ 2017-11-27 23:27 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Mon, Nov 27, 2017 at 03:24:34PM -0500, Brian Foster wrote:
> Analysis of recent reports of log reservation overruns and code
> inspection has uncovered that the reservations associated with inode
> operations may not cover the worst case scenarios. In particular,
> many cases only include one allocfree res. for a particular
> operation even though said operations may also entail AGFL fixups
> and inode btree block allocations in addition to the actual inode
> chunk allocation. This can easily turn into two or three block
> allocations (or frees) per operation.
> 
> In theory, the only way to define the worst case reservation is to
> include an allocfree res for each individual allocation in a
> transaction. Since that is impractical (we can perform multiple agfl
> fixups per tx and not every allocation is going to result in a full
> tree operation), implement a reasonable compromise that addresses
> the deficiency in practice without blowing out the size of the
> transactions.
> 
> Refactor the inode transaction reservation code to include one
> allocfree res. per inode btree modification to cover allocations
> required by the tree itself. This essentially separates the
> reservation required to allocate the physical inode chunk from
> additional reservation required to perform inobt record
> insertion/removal.

I think you missed the most important reason the inobt/finobt
modifications need there own allocfree reservation - btree
modifications that cause btree blocks to be freed do not use defered
ops and so the freeing of blocks occurs directly within the primary
transaction. Hence the primary transaction reservation must take
this into account ....

> Apply the same logic to the finobt reservation.
> This results in killing off the finobt modify condition because we
> no longer assume that the broader transaction reservation will cover
> finobt block allocations.
> 
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Brian Foster <bfoster@redhat.com>

Code looks fine. Comments below are for another follow-on patch.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

> @@ -387,8 +386,8 @@ xfs_calc_create_resv_modify(
>   *    the agi and agf of the ag getting the new inodes: 2 * sectorsize
>   *    the superblock for the nlink flag: sector size
>   *    the inode blocks allocated: mp->m_ialloc_blks * blocksize
> - *    the inode btree: max depth * blocksize
>   *    the allocation btrees: 2 trees * (max depth - 1) * block size
> + *    the inode btree (record insertion)
>   */
>  STATIC uint
>  xfs_calc_create_resv_alloc(
> @@ -397,9 +396,9 @@ xfs_calc_create_resv_alloc(
>  	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
>  		mp->m_sb.sb_sectsize +
>  		xfs_calc_buf_res(mp->m_ialloc_blks, XFS_FSB_TO_B(mp, 1)) +
> -		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
>  		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> -				 XFS_FSB_TO_B(mp, 1));
> +				 XFS_FSB_TO_B(mp, 1)) +
> +		xfs_calc_inobt_res(mp);
>  }

Looking at this,  I'm wondering if there should also be a function
for calculating the inode chunk reservation. Something like:

static uint
xfs_calc_inode_chunk_res(
	struct xfs-mount	*mp,
	bool			chunk_contents_logged)
{
	uint	res;

	if (chunk_contents_logged) {
		res = xfs_calc_buf_res(mp->m_ialloc_blks,
					XFS_FSB_TO_B(mp, 1));
	} else {
		/* take into account logged headers for freeing */
		res = xfs_calc_buf_res(mp->m_ialloc_blks, 0);
	}

	res += xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
				XFS_FSB_TO_B(mp, 1));
	return res;
}

Then xfs_calc_create_resv_alloc() reads like:

	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
		mp->m_sb.sb_sectsize + 
		xfs_calc_inode_chunk_res(mp, true) +
		xfs_calc_inobt_res(mp);


>  
>  STATIC uint
> @@ -415,8 +414,8 @@ __xfs_calc_create_reservation(
>   * For icreate we can allocate some inodes giving:
>   *    the agi and agf of the ag getting the new inodes: 2 * sectorsize
>   *    the superblock for the nlink flag: sector size
> - *    the inode btree: max depth * blocksize
>   *    the allocation btrees: 2 trees * (max depth - 1) * block size
> + *    the inobt (record insertion)
>   *    the finobt (record insertion)
>   */
>  STATIC uint
> @@ -425,10 +424,10 @@ xfs_calc_icreate_resv_alloc(
>  {
>  	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
>  		mp->m_sb.sb_sectsize +
> -		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
>  		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
>  				 XFS_FSB_TO_B(mp, 1)) +
> -		xfs_calc_finobt_res(mp, 0, 0);
> +		xfs_calc_inobt_res(mp) +
> +		xfs_calc_finobt_res(mp);
>  }

	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
		mp->m_sb.sb_sectsize +
		xfs_calc_inode_chunk_res(mp, false) +
		xfs_calc_inobt_res(mp) +
		xfs_calc_finobt_res(mp);

>  
>  STATIC uint
> @@ -494,9 +493,14 @@ xfs_calc_symlink_reservation(
>   *    the agi hash list and counters: sector size
>   *    the on disk inode before ours in the agi hash list: inode cluster size
>   *    the inode chunk is marked stale (headers only)
> - *    the inode btree: max depth * blocksize
> - *    the allocation btrees: 2 trees * (max depth - 1) * block size
> + *    the inode btree
>   *    the finobt (record insertion, removal or modification)
> + *
> + * Note that the allocfree res. for the inode chunk itself is not included
> + * because the extent free occurs after a transaction roll. We could take the
> + * maximum of the pre/post roll operations, but the pre-roll reservation already
> + * includes at least one allocfree res. for the inobt and is thus guaranteed to
> + * be larger.
>   */
>  STATIC uint
>  xfs_calc_ifree_reservation(
> @@ -508,10 +512,8 @@ xfs_calc_ifree_reservation(
>  		xfs_calc_iunlink_remove_reservation(mp) +
>  		xfs_calc_buf_res(1, 0) +
>  		xfs_calc_buf_res(mp->m_ialloc_blks, 0) +
> -		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
> -		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> -				 XFS_FSB_TO_B(mp, 1)) +
> -		xfs_calc_finobt_res(mp, 0, 1);
> +		xfs_calc_inobt_res(mp) +
> +		xfs_calc_finobt_res(mp);
>  }

	.....
		xfs_calc_iunlink_remove_reservation(mp) +
		xfs_calc_buf_res(1, 0) +
		xfs_calc_inode_chunk_res(mp, false) +
		xfs_calc_inobt_res(mp) +
		xfs_calc_finobt_res(mp);

This seems to make more sense to me - it's clear what the individual
components of the reservation are, and we can ensure that the
individual components have the necessary reservation independently
of the overall reservations that need them.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/4] xfs: include inobt buffers in ifree tx log reservation
  2017-11-27 22:28   ` Dave Chinner
@ 2017-11-28 13:30     ` Brian Foster
  2017-11-28 21:38       ` Dave Chinner
  0 siblings, 1 reply; 23+ messages in thread
From: Brian Foster @ 2017-11-28 13:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Nov 28, 2017 at 09:28:57AM +1100, Dave Chinner wrote:
> On Mon, Nov 27, 2017 at 03:24:32PM -0500, Brian Foster wrote:
> > The tr_ifree transaction handles inode unlinks and inode chunk
> > frees. The current transaction calculation does not accurately
> > reflect worst case changes to the inode btree, however. The inobt
> > portion of the current transaction reservation only covers
> > modification of a single inobt buffer (for the particular inode
> > record). This is a historical artifact from the days before XFS
> > supported full inode chunk removal.
> > 
> > When support for inode chunk removal was added in commit
> > 254f6311ed1b ("Implement deletion of inode clusters in XFS."), the
> > additional log reservation required for chunk removal was not added
> > correctly. The new reservation only considered the header overhead
> > of associated buffers rather than the full contents of the btrees
> > and AGF and AGFL buffers affected by the transaction. The
> > reservation for the free space btrees was subsequently fixed up in
> > commit 5fe6abb82f76 ("Add space for inode and allocation btrees to
> > ITRUNCATE log reservation"), but the res. for full inobt joins has
> > never been added.
> > 
> > Update xfs_calc_ifree_reservation() to include a full inobt join in
> > the reservation calculation. Refactor the undocumented +2 blocks for
> > the AGF and AGFL buffers into the appropriate place so they are
> > accounted as sectors (not FSBs) and update the associated comments.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_trans_resv.c | 11 +++++------
> >  1 file changed, 5 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> > index 6bd916bd35e2..4cd7cd1e60da 100644
> > --- a/fs/xfs/libxfs/xfs_trans_resv.c
> > +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> > @@ -490,10 +490,10 @@ xfs_calc_symlink_reservation(
> >  /*
> >   * In freeing an inode we can modify:
> >   *    the inode being freed: inode size
> > - *    the super block free inode counter: sector size
> > + *    the super block free inode counter, AGF and AGFL: sector size
> >   *    the agi hash list and counters: sector size
> > - *    the inode btree entry: block size
> >   *    the on disk inode before ours in the agi hash list: inode cluster size
> > + *    the inode chunk is marked stale (headers only)
> >   *    the inode btree: max depth * blocksize
> >   *    the allocation btrees: 2 trees * (max depth - 1) * block size
> >   *    the finobt (record insertion, removal or modification)
> > @@ -504,12 +504,11 @@ xfs_calc_ifree_reservation(
> >  {
> >  	return XFS_DQUOT_LOGRES(mp) +
> >  		xfs_calc_inode_res(mp, 1) +
> 
>  *    the inode being freed: inode size
> 
> > -		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
> > -		xfs_calc_buf_res(1, XFS_FSB_TO_B(mp, 1)) +
> > +		xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
> 
>  *    the super block free inode counter, AGF and AGFL: sector size
> 
> 		[missing, hidden in calc_iunlink_remove]
> 
>  *    the agi hash list and counters: sector size
> 
> >  		xfs_calc_iunlink_remove_reservation(mp) +
> 
>  *    the on disk inode before ours in the agi hash list: inode cluster size
> 
> 		[missing]
> 
>  *    on-disk inode to log the di_next_unlinked: inode cluster size
> 
> Yes, check the xfs_iunlink_remove() code - we can log two inode
> cluster buffers there: the one prior to us in the unlinked list,
> and ours to reset the di_next_unlinked pointer to
> null. IOWs, xfs_calc_iunlink_remove_reservation() needs to take into
> account /2/ inode cluster buffers, not one.
> 

I was wondering why this wouldn't be covered by the inode size included
above, but looking at the code, I guess we log the agi unlinked changes
directly in the cluster buffer. That means we have separate log items
with independent reservation consumption between the unlinked list
fixups and core inode.

I'll update xfs_calc_iunlink_remove_reservation() to include another
cluster and add a comment.

> >  		xfs_calc_buf_res(1, 0) +
> 
> No idea what this is.
> 

Same, I didn't want to remove it unless we could identify what it was
originally intended for.

> > -		xfs_calc_buf_res(2 + mp->m_ialloc_blks +
> > -				 mp->m_in_maxlevels, 0) +
> > +		xfs_calc_buf_res(mp->m_ialloc_blks, 0) +
> 
>  *    the inode chunk is marked stale (headers only)
> 
> > +		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
> 
>   *    the inode btree: max depth * blocksize
> 
> >  		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> >  				 XFS_FSB_TO_B(mp, 1)) +
>   *    the allocation btrees: 2 trees * (max depth - 1) * block size
> 
> >  		xfs_calc_finobt_res(mp, 0, 1);
> 
>   *    the finobt (record insertion, removal or modification)
> 
> The rest look fine.
> 

Thanks.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] xfs: amortize agfl block frees across multiple transactions
  2017-11-27 23:07   ` Dave Chinner
@ 2017-11-28 13:57     ` Brian Foster
  2017-11-28 22:09       ` Dave Chinner
  0 siblings, 1 reply; 23+ messages in thread
From: Brian Foster @ 2017-11-28 13:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Nov 28, 2017 at 10:07:34AM +1100, Dave Chinner wrote:
> On Mon, Nov 27, 2017 at 03:24:33PM -0500, Brian Foster wrote:
> > We've had rare reports of transaction overruns in
> > xfs_inactive_ifree() for quite some time. Analysis of a reproducing
> > metadump has shown the problem is essentially caused by performing
> > too many agfl block frees in a single transaction.
> > 
> > For example, an inode chunk is freed and the associated agfl fixup
> > algorithm discovers it needs to free a single agfl block before the
> > chunk free occurs. This single block ends up causing a space btree
> > join and adds one or more blocks back onto the agfl. This causes
> > xfs_alloc_fix_freelist() to free up to 3 blocks just to rectify a
> > single block discrepency.
> > 
> > The transaction overrun occurs under several other unfortunate
> > circumstances:
> > 
> >  - Each agfl block free is left+right contiguous. This requires 2
> >    record deletions and 1 insertion for the cntbt and thus requires
> >    more log reservation than normal.
> >  - The associated transaction is the first in the CIL ctx and thus
> >    the ctx header reservation is consumed.
> >  - The transaction reservation is larger than a log buffer and thus
> >    extra split header reservation is consumed.
> > 
> > As a result of the agfl and free space state of the filesystem, the
> > agfl fixup code has dirtied more cntbt buffer space than allowed by
> > the portion of the reservation allotted for block allocation. This
> > is all before the real allocation even starts!
> > 
> > Note that the log related conditions above are correctly covered by
> > the existing transaction reservation. The above demonstrates that
> > the reservation allotted for the context/split headers may help
> > suppress overruns in the more common case where that reservation
> > goes unused for its intended purpose.
> > 
> > To address this problem, update xfs_alloc_fix_freelist() to amortize
> > agfl block frees over multiple transactions. Free one block per
> > transaction so long as the agfl is less than half free. The agfl
> > minimum allocation requirement is dynamic, but is based on the
> > geometry of the associated btrees (i.e., level count) and therefore
> > should be easily rectified over multiple allocation transactions.
> > Further, there is no real harm in leaving extraneous blocks on the
> > agfl so long as there are enough free slots available for btree
> > blocks freed as a result of the upcoming allocation.
> > 
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_alloc.c | 15 ++++++++++-----
> >  1 file changed, 10 insertions(+), 5 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > index 0da80019a917..d8d58e35da00 100644
> > --- a/fs/xfs/libxfs/xfs_alloc.c
> > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > @@ -2117,11 +2117,6 @@ xfs_alloc_fix_freelist(
> >  	 * appropriately based on the recursion count and dirty state of the
> >  	 * buffer.
> >  	 *
> > -	 * XXX (dgc): When we have lots of free space, does this buy us
> > -	 * anything other than extra overhead when we need to put more blocks
> > -	 * back on the free list? Maybe we should only do this when space is
> > -	 * getting low or the AGFL is more than half full?
> > -	 *
> >  	 * The NOSHRINK flag prevents the AGFL from being shrunk if it's too
> >  	 * big; the NORMAP flag prevents AGFL expand/shrink operations from
> >  	 * updating the rmapbt.  Both flags are used in xfs_repair while we're
> > @@ -2151,6 +2146,16 @@ xfs_alloc_fix_freelist(
> >  			goto out_agbp_relse;
> >  		}
> >  		xfs_trans_binval(tp, bp);
> > +
> > +		/*
> > +		 * Freeing all extra agfl blocks adds too much log reservation
> > +		 * overhead to a single transaction, particularly considering
> > +		 * that freeing a block can cause a btree join and put one right
> > +		 * back on the agfl. Try to free one block per tx so long as
> > +		 * we've left enough free slots for the upcoming modifications.
> > +		 */
> > +		if (pag->pagf_flcount <= (XFS_AGFL_SIZE(mp) >> 1))
> > +			break;
> >  	}
> 
> In /theory/, this /should/ work. However, as the comment you removed
> implies, there are likely to be issues with this as we get near
> ENOSPC. We know that if we don't trim the AGFL right down to the
> minimum requested as we approach ENOSPC we can get premature ENOSPC
> events being reported that lead to filesystem shutdowns.  (e.g. need
> the very last free block for a BMBT block to complete a data extent
> allocation).  Hence I'd suggest that this needs to be aware of the
> low space allocation algorithm (i.e. dfops->dop_low is true) to trim
> the agfl right back when we are really short of space
> 

Hmm, shouldn't the worst case bmbt requirements be satisfied by the
block reservation of the transaction that maps the extent (or stored as
indirect reservation if delalloc)? I'm less concerned with premature
ENOSPC from contexts where it's not fatal.. IIRC, I think we've already
incorporated changes (Christoph's minfree stuff rings a bell) into the
allocator that explicitly prefer premature ENOSPC over potentially fatal
conditions, but I could be mistaken. We're also only talking about 256k
or so for half of the AGFL, less than that if we assume that some of
those blocks are actually required by the agfl and not slated to be
lazily freed. We could also think about explicitly fixing up agfl
surplus from other contexts (background, write -ENOSPC handling, etc.)
if it became that much of a problem.

I do think it's semi-reasonable from a conservative standpoint to
further restrict this behavior to !dop_low conditions, but I'm a little
concerned about creating an "open the floodgates" type situation when
the agfl is allowed to carry extra blocks and then all of a sudden the
free space state changes and one transaction is allowed to free a bunch
of blocks. Thoughts?

I suppose we could incorporate logic that frees until/unless a join
occurs (i.e., the last free did not drop flcount), the latter being an
indication that we've probably logged as much as we should for agfl
fixups in the transaction. But that's also something we could just do
unconditionally as opposed to only under dop_low conditions. That might
be less aggressive of a change from current behavior.

> I'm also concerned that it doesn't take into account that freeing
> a block from the AGFL could cause a freespace tree split to occur,
> thereby emptying the AGFL whilst consuming the entire log
> reservation for tree modifications. This leaves nothing in the log
> reservation for re-growing the AGFL to the minimum required, which
> we /must do/ before returning and could cause more splits/joins to
> occur.
> 

How is this different from the current code? This sounds to me like an
unconditional side effect of the fact that freeing an agfl block can
indirectly affect the agfl via the btree operations. IOW, freeing a
single extra block could require consuming one or more and trigger the
need for an allocation. I suspect the allocation could then very well
cause a join on the other tree and put more than one block back onto the
agfl.

> IMO, there's a lot more to be concerned about here than just trying
> to work around the specific symptom observed in the given test case.
> This code is, unfortunately, really tricky and intricate and history
> tells us that the risk of unexpected regressions is extremely high,
> especially around ENOSPC related issues. Of all the patches in this
> series, this is the most dangerous and "iffy" of them and the
> one we should be most concerned and conservative about....
> 

Agreed. The impact is something that also has to be evaluated over a
sequence of transactions along with the more obvious impact on a single
transaction.

Brian

> I'm not sure what the solution to this problem is, but I'm extremely
> hesitant to make changes like this without an awful lot more
> analysis of it's impact.
> 
> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 4/4] xfs: include an allocfree res for inobt modifications
  2017-11-27 23:27   ` Dave Chinner
@ 2017-11-28 14:04     ` Brian Foster
  2017-11-28 22:26       ` Dave Chinner
  2017-11-28 15:49     ` Brian Foster
  1 sibling, 1 reply; 23+ messages in thread
From: Brian Foster @ 2017-11-28 14:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Nov 28, 2017 at 10:27:19AM +1100, Dave Chinner wrote:
> On Mon, Nov 27, 2017 at 03:24:34PM -0500, Brian Foster wrote:
> > Analysis of recent reports of log reservation overruns and code
> > inspection has uncovered that the reservations associated with inode
> > operations may not cover the worst case scenarios. In particular,
> > many cases only include one allocfree res. for a particular
> > operation even though said operations may also entail AGFL fixups
> > and inode btree block allocations in addition to the actual inode
> > chunk allocation. This can easily turn into two or three block
> > allocations (or frees) per operation.
> > 
> > In theory, the only way to define the worst case reservation is to
> > include an allocfree res for each individual allocation in a
> > transaction. Since that is impractical (we can perform multiple agfl
> > fixups per tx and not every allocation is going to result in a full
> > tree operation), implement a reasonable compromise that addresses
> > the deficiency in practice without blowing out the size of the
> > transactions.
> > 
> > Refactor the inode transaction reservation code to include one
> > allocfree res. per inode btree modification to cover allocations
> > required by the tree itself. This essentially separates the
> > reservation required to allocate the physical inode chunk from
> > additional reservation required to perform inobt record
> > insertion/removal.
> 
> I think you missed the most important reason the inobt/finobt
> modifications need there own allocfree reservation - btree
> modifications that cause btree blocks to be freed do not use defered
> ops and so the freeing of blocks occurs directly within the primary
> transaction. Hence the primary transaction reservation must take
> this into account ....
> 

Sort of implied by "to cover allocations by the tree itself," but I'll
update the commit log to be more explicit.

> > Apply the same logic to the finobt reservation.
> > This results in killing off the finobt modify condition because we
> > no longer assume that the broader transaction reservation will cover
> > finobt block allocations.
> > 
> > Suggested-by: Dave Chinner <david@fromorbit.com>
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> 
> Code looks fine. Comments below are for another follow-on patch.
> 

Ok..

> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> 
> > @@ -387,8 +386,8 @@ xfs_calc_create_resv_modify(
> >   *    the agi and agf of the ag getting the new inodes: 2 * sectorsize
> >   *    the superblock for the nlink flag: sector size
> >   *    the inode blocks allocated: mp->m_ialloc_blks * blocksize
> > - *    the inode btree: max depth * blocksize
> >   *    the allocation btrees: 2 trees * (max depth - 1) * block size
> > + *    the inode btree (record insertion)
> >   */
> >  STATIC uint
> >  xfs_calc_create_resv_alloc(
> > @@ -397,9 +396,9 @@ xfs_calc_create_resv_alloc(
> >  	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
> >  		mp->m_sb.sb_sectsize +
> >  		xfs_calc_buf_res(mp->m_ialloc_blks, XFS_FSB_TO_B(mp, 1)) +
> > -		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
> >  		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> > -				 XFS_FSB_TO_B(mp, 1));
> > +				 XFS_FSB_TO_B(mp, 1)) +
> > +		xfs_calc_inobt_res(mp);
> >  }
> 
> Looking at this,  I'm wondering if there should also be a function
> for calculating the inode chunk reservation. Something like:
> 
> static uint
> xfs_calc_inode_chunk_res(
> 	struct xfs-mount	*mp,
> 	bool			chunk_contents_logged)
> {
> 	uint	res;
> 
> 	if (chunk_contents_logged) {
> 		res = xfs_calc_buf_res(mp->m_ialloc_blks,
> 					XFS_FSB_TO_B(mp, 1));
> 	} else {
> 		/* take into account logged headers for freeing */
> 		res = xfs_calc_buf_res(mp->m_ialloc_blks, 0);
> 	}
> 
> 	res += xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> 				XFS_FSB_TO_B(mp, 1));
> 	return res;
> }
> 
> Then xfs_calc_create_resv_alloc() reads like:
> 
> 	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
> 		mp->m_sb.sb_sectsize + 
> 		xfs_calc_inode_chunk_res(mp, true) +
> 		xfs_calc_inobt_res(mp);
> 

Looks reasonable.

> 
> >  
> >  STATIC uint
> > @@ -415,8 +414,8 @@ __xfs_calc_create_reservation(
> >   * For icreate we can allocate some inodes giving:
> >   *    the agi and agf of the ag getting the new inodes: 2 * sectorsize
> >   *    the superblock for the nlink flag: sector size
> > - *    the inode btree: max depth * blocksize
> >   *    the allocation btrees: 2 trees * (max depth - 1) * block size
> > + *    the inobt (record insertion)
> >   *    the finobt (record insertion)
> >   */
> >  STATIC uint
> > @@ -425,10 +424,10 @@ xfs_calc_icreate_resv_alloc(
> >  {
> >  	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
> >  		mp->m_sb.sb_sectsize +
> > -		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
> >  		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> >  				 XFS_FSB_TO_B(mp, 1)) +
> > -		xfs_calc_finobt_res(mp, 0, 0);
> > +		xfs_calc_inobt_res(mp) +
> > +		xfs_calc_finobt_res(mp);
> >  }
> 
> 	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
> 		mp->m_sb.sb_sectsize +
> 		xfs_calc_inode_chunk_res(mp, false) +
> 		xfs_calc_inobt_res(mp) +
> 		xfs_calc_finobt_res(mp);
> 

The icreate reservation doesn't currently include m_ialloc_blks at all.
The helper, as defined above, adds a reservation for associated headers.
Is that really necessary?  My understanding is that icreate doesn't log
the inode chunk. I suppose we could get around that by tweaking the
parameter to take the size to reserve instead of a bool and pass a dummy
value (i.e., negative) to avoid logging the chunk at all. A little ugly,
but I could live with it.

> >  
> >  STATIC uint
> > @@ -494,9 +493,14 @@ xfs_calc_symlink_reservation(
> >   *    the agi hash list and counters: sector size
> >   *    the on disk inode before ours in the agi hash list: inode cluster size
> >   *    the inode chunk is marked stale (headers only)
> > - *    the inode btree: max depth * blocksize
> > - *    the allocation btrees: 2 trees * (max depth - 1) * block size
> > + *    the inode btree
> >   *    the finobt (record insertion, removal or modification)
> > + *
> > + * Note that the allocfree res. for the inode chunk itself is not included
> > + * because the extent free occurs after a transaction roll. We could take the
> > + * maximum of the pre/post roll operations, but the pre-roll reservation already
> > + * includes at least one allocfree res. for the inobt and is thus guaranteed to
> > + * be larger.
> >   */
> >  STATIC uint
> >  xfs_calc_ifree_reservation(
> > @@ -508,10 +512,8 @@ xfs_calc_ifree_reservation(
> >  		xfs_calc_iunlink_remove_reservation(mp) +
> >  		xfs_calc_buf_res(1, 0) +
> >  		xfs_calc_buf_res(mp->m_ialloc_blks, 0) +
> > -		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
> > -		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> > -				 XFS_FSB_TO_B(mp, 1)) +
> > -		xfs_calc_finobt_res(mp, 0, 1);
> > +		xfs_calc_inobt_res(mp) +
> > +		xfs_calc_finobt_res(mp);
> >  }
> 
> 	.....
> 		xfs_calc_iunlink_remove_reservation(mp) +
> 		xfs_calc_buf_res(1, 0) +
> 		xfs_calc_inode_chunk_res(mp, false) +
> 		xfs_calc_inobt_res(mp) +
> 		xfs_calc_finobt_res(mp);
> 

This covers the inode chunk invalidation, but also adds the allocfree
res. for the chunk free where we technically don't need it (because the
free is deferred, re: the comment above).

I guess I'm fine with just adding that one, but I'd update the comment
above to point out that it's technically unecessary. Hm?

> This seems to make more sense to me - it's clear what the individual
> components of the reservation are, and we can ensure that the
> individual components have the necessary reservation independently
> of the overall reservations that need them.
> 

I agree in principle. I think the underlying helpers (and pushing down
some of the associated documentation) help clearly declare the intent of
the reservations, particularly when we include multiple factors of a
single reservation and/or have situations where we don't technically
have a definition of worst case, but want to define something logically
reasonable (like the whole allocfree per inode tree thing).

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 4/4] xfs: include an allocfree res for inobt modifications
  2017-11-27 23:27   ` Dave Chinner
  2017-11-28 14:04     ` Brian Foster
@ 2017-11-28 15:49     ` Brian Foster
  2017-11-28 22:34       ` Dave Chinner
  1 sibling, 1 reply; 23+ messages in thread
From: Brian Foster @ 2017-11-28 15:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Nov 28, 2017 at 10:27:19AM +1100, Dave Chinner wrote:
> On Mon, Nov 27, 2017 at 03:24:34PM -0500, Brian Foster wrote:
> > Analysis of recent reports of log reservation overruns and code
> > inspection has uncovered that the reservations associated with inode
> > operations may not cover the worst case scenarios. In particular,
> > many cases only include one allocfree res. for a particular
> > operation even though said operations may also entail AGFL fixups
> > and inode btree block allocations in addition to the actual inode
> > chunk allocation. This can easily turn into two or three block
> > allocations (or frees) per operation.
> > 
> > In theory, the only way to define the worst case reservation is to
> > include an allocfree res for each individual allocation in a
> > transaction. Since that is impractical (we can perform multiple agfl
> > fixups per tx and not every allocation is going to result in a full
> > tree operation), implement a reasonable compromise that addresses
> > the deficiency in practice without blowing out the size of the
> > transactions.
> > 
> > Refactor the inode transaction reservation code to include one
> > allocfree res. per inode btree modification to cover allocations
> > required by the tree itself. This essentially separates the
> > reservation required to allocate the physical inode chunk from
> > additional reservation required to perform inobt record
> > insertion/removal.
> 
> I think you missed the most important reason the inobt/finobt
> modifications need there own allocfree reservation - btree
> modifications that cause btree blocks to be freed do not use defered
> ops and so the freeing of blocks occurs directly within the primary
> transaction. Hence the primary transaction reservation must take
> this into account ....
> 
> > Apply the same logic to the finobt reservation.
> > This results in killing off the finobt modify condition because we
> > no longer assume that the broader transaction reservation will cover
> > finobt block allocations.
> > 
> > Suggested-by: Dave Chinner <david@fromorbit.com>
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> 
> Code looks fine. Comments below are for another follow-on patch.
> 

Actually, what do you think of the following variant (compile tested
only)? This facilites another cleanup patch I just wrote to kill off
about half of the (now effectively duplicate) xfs_calc_create_*()
helpers because the finobt and inode chunk res. helpers only include the
associated reservation based on the associated feature bits.

Brian

--- 8< ---

diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index c27c57e65e15..045781f2cf00 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -172,6 +172,41 @@ xfs_calc_finobt_res(
 }
 
 /*
+ * Calculate the reservation required to allocate or free an inode chunk. This
+ * includes:
+ *
+ * the allocation btrees: 2 trees * (max depth - 1) * block size
+ * the inode chunk: m_ialloc_blks * N
+ *
+ * The size N of the inode chunk reservation depends on whether it is for
+ * allocation or free and which type of create transaction is in use. An inode
+ * chunk free always invalidates the buffers and only requires reservation for
+ * headers (N == 0). An inode chunk allocation requires a chunk sized
+ * reservation on v4 and older superblocks to initialize the chunk. No chunk
+ * reservation is required for allocation on v5 supers, which use ordered
+ * buffers to initialize.
+ */
+STATIC uint
+xfs_calc_inode_chunk_res(
+	struct xfs_mount	*mp,
+	bool			alloc)
+{
+	uint			res, size = 0;
+
+	res = xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
+			       XFS_FSB_TO_B(mp, 1));
+	if (alloc) {
+		/* icreate tx uses ordered buffers */
+		if (xfs_sb_version_hascrc(&mp->m_sb))
+			return res;
+		size = XFS_FSB_TO_B(mp, 1);
+	}
+
+	res += xfs_calc_buf_res(mp->m_ialloc_blks, size);
+	return res;
+}
+
+/*
  * Various log reservation values.
  *
  * These are based on the size of the file system block because that is what
@@ -386,8 +421,7 @@ xfs_calc_create_resv_modify(
  * For create we can allocate some inodes giving:
  *    the agi and agf of the ag getting the new inodes: 2 * sectorsize
  *    the superblock for the nlink flag: sector size
- *    the inode blocks allocated: mp->m_ialloc_blks * blocksize
- *    the allocation btrees: 2 trees * (max depth - 1) * block size
+ *    the inode chunk (allocation/init)
  *    the inode btree (record insertion)
  */
 STATIC uint
@@ -396,9 +430,7 @@ xfs_calc_create_resv_alloc(
 {
 	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
 		mp->m_sb.sb_sectsize +
-		xfs_calc_buf_res(mp->m_ialloc_blks, XFS_FSB_TO_B(mp, 1)) +
-		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
-				 XFS_FSB_TO_B(mp, 1)) +
+		xfs_calc_inode_chunk_res(mp, true) +
 		xfs_calc_inobt_res(mp);
 }
 
@@ -415,7 +447,7 @@ __xfs_calc_create_reservation(
  * For icreate we can allocate some inodes giving:
  *    the agi and agf of the ag getting the new inodes: 2 * sectorsize
  *    the superblock for the nlink flag: sector size
- *    the allocation btrees: 2 trees * (max depth - 1) * block size
+ *    the inode chunk (allocation, no init)
  *    the inobt (record insertion)
  *    the finobt (record insertion)
  */
@@ -425,8 +457,7 @@ xfs_calc_icreate_resv_alloc(
 {
 	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
 		mp->m_sb.sb_sectsize +
-		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
-				 XFS_FSB_TO_B(mp, 1)) +
+		xfs_calc_inode_chunk_res(mp, true) +
 		xfs_calc_inobt_res(mp) +
 		xfs_calc_finobt_res(mp);
 }
@@ -492,15 +523,15 @@ xfs_calc_symlink_reservation(
  *    the inode being freed: inode size
  *    the super block free inode counter, AGF and AGFL: sector size
  *    the on disk inode (agi unlinked list removal)
- *    the inode chunk is marked stale (headers only)
+ *    the inode chunk (invalidated, headers only)
  *    the inode btree
  *    the finobt (record insertion, removal or modification)
  *
- * Note that the allocfree res. for the inode chunk itself is not included
- * because the extent free occurs after a transaction roll. We could take the
- * maximum of the pre/post roll operations, but the pre-roll reservation already
- * includes at least one allocfree res. for the inobt and is thus guaranteed to
- * be larger.
+ * Note that the inode chunk res. includes an allocfree res. for freeing of the
+ * inode chunk. This is technically extraneous because the inode chunk free is
+ * deferred (it occurs after a transaction roll). Include the extra reservation
+ * anyways since we've had reports of ifree transaction overruns due to too many
+ * agfl fixups during inode chunk frees.
  */
 STATIC uint
 xfs_calc_ifree_reservation(
@@ -511,7 +542,7 @@ xfs_calc_ifree_reservation(
 		xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
 		xfs_calc_iunlink_remove_reservation(mp) +
 		xfs_calc_buf_res(1, 0) +
-		xfs_calc_buf_res(mp->m_ialloc_blks, 0) +
+		xfs_calc_inode_chunk_res(mp, false) +
 		xfs_calc_inobt_res(mp) +
 		xfs_calc_finobt_res(mp);
 }

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/4] xfs: include inobt buffers in ifree tx log reservation
  2017-11-28 13:30     ` Brian Foster
@ 2017-11-28 21:38       ` Dave Chinner
  2017-11-29 14:31         ` Brian Foster
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2017-11-28 21:38 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Nov 28, 2017 at 08:30:02AM -0500, Brian Foster wrote:
> On Tue, Nov 28, 2017 at 09:28:57AM +1100, Dave Chinner wrote:
> > On Mon, Nov 27, 2017 at 03:24:32PM -0500, Brian Foster wrote:
> > > The tr_ifree transaction handles inode unlinks and inode chunk
> > > frees. The current transaction calculation does not accurately
> > > reflect worst case changes to the inode btree, however. The inobt
> > > portion of the current transaction reservation only covers
> > > modification of a single inobt buffer (for the particular inode
> > > record). This is a historical artifact from the days before XFS
> > > supported full inode chunk removal.
> > > 
> > > When support for inode chunk removal was added in commit
> > > 254f6311ed1b ("Implement deletion of inode clusters in XFS."), the
> > > additional log reservation required for chunk removal was not added
> > > correctly. The new reservation only considered the header overhead
> > > of associated buffers rather than the full contents of the btrees
> > > and AGF and AGFL buffers affected by the transaction. The
> > > reservation for the free space btrees was subsequently fixed up in
> > > commit 5fe6abb82f76 ("Add space for inode and allocation btrees to
> > > ITRUNCATE log reservation"), but the res. for full inobt joins has
> > > never been added.
> > > 
> > > Update xfs_calc_ifree_reservation() to include a full inobt join in
> > > the reservation calculation. Refactor the undocumented +2 blocks for
> > > the AGF and AGFL buffers into the appropriate place so they are
> > > accounted as sectors (not FSBs) and update the associated comments.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_trans_resv.c | 11 +++++------
> > >  1 file changed, 5 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> > > index 6bd916bd35e2..4cd7cd1e60da 100644
> > > --- a/fs/xfs/libxfs/xfs_trans_resv.c
> > > +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> > > @@ -490,10 +490,10 @@ xfs_calc_symlink_reservation(
> > >  /*
> > >   * In freeing an inode we can modify:
> > >   *    the inode being freed: inode size
> > > - *    the super block free inode counter: sector size
> > > + *    the super block free inode counter, AGF and AGFL: sector size
> > >   *    the agi hash list and counters: sector size
> > > - *    the inode btree entry: block size
> > >   *    the on disk inode before ours in the agi hash list: inode cluster size
> > > + *    the inode chunk is marked stale (headers only)
> > >   *    the inode btree: max depth * blocksize
> > >   *    the allocation btrees: 2 trees * (max depth - 1) * block size
> > >   *    the finobt (record insertion, removal or modification)
> > > @@ -504,12 +504,11 @@ xfs_calc_ifree_reservation(
> > >  {
> > >  	return XFS_DQUOT_LOGRES(mp) +
> > >  		xfs_calc_inode_res(mp, 1) +
> > 
> >  *    the inode being freed: inode size
> > 
> > > -		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
> > > -		xfs_calc_buf_res(1, XFS_FSB_TO_B(mp, 1)) +
> > > +		xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
> > 
> >  *    the super block free inode counter, AGF and AGFL: sector size
> > 
> > 		[missing, hidden in calc_iunlink_remove]
> > 
> >  *    the agi hash list and counters: sector size
> > 
> > >  		xfs_calc_iunlink_remove_reservation(mp) +
> > 
> >  *    the on disk inode before ours in the agi hash list: inode cluster size
> > 
> > 		[missing]
> > 
> >  *    on-disk inode to log the di_next_unlinked: inode cluster size
> > 
> > Yes, check the xfs_iunlink_remove() code - we can log two inode
> > cluster buffers there: the one prior to us in the unlinked list,
> > and ours to reset the di_next_unlinked pointer to
> > null. IOWs, xfs_calc_iunlink_remove_reservation() needs to take into
> > account /2/ inode cluster buffers, not one.
> > 
> 
> I was wondering why this wouldn't be covered by the inode size included
> above, but looking at the code, I guess we log the agi unlinked changes
> directly in the cluster buffer. That means we have separate log items
> with independent reservation consumption between the unlinked list
> fixups and core inode.
> 
> I'll update xfs_calc_iunlink_remove_reservation() to include another
> cluster and add a comment.

And, as I mentioned in the other thread,
xfs_calc_iunlink_add_reservation() needs this fixup, too.

> 
> > >  		xfs_calc_buf_res(1, 0) +
> > 
> > No idea what this is.
> > 
> 
> Same, I didn't want to remove it unless we could identify what it was
> originally intended for.

It's only going to be a 128 bytes for a header, so I think we should
just remove it rather than keep some unknown magic in the
reservation. If it turns out to be critical, then we need to
understand exactly where it came from...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] xfs: amortize agfl block frees across multiple transactions
  2017-11-28 13:57     ` Brian Foster
@ 2017-11-28 22:09       ` Dave Chinner
  2017-11-29 18:24         ` Brian Foster
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2017-11-28 22:09 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Nov 28, 2017 at 08:57:48AM -0500, Brian Foster wrote:
> On Tue, Nov 28, 2017 at 10:07:34AM +1100, Dave Chinner wrote:
> > On Mon, Nov 27, 2017 at 03:24:33PM -0500, Brian Foster wrote:
> > > We've had rare reports of transaction overruns in
> > > xfs_inactive_ifree() for quite some time. Analysis of a reproducing
> > > metadump has shown the problem is essentially caused by performing
> > > too many agfl block frees in a single transaction.
> > > 
> > > For example, an inode chunk is freed and the associated agfl fixup
> > > algorithm discovers it needs to free a single agfl block before the
> > > chunk free occurs. This single block ends up causing a space btree
> > > join and adds one or more blocks back onto the agfl. This causes
> > > xfs_alloc_fix_freelist() to free up to 3 blocks just to rectify a
> > > single block discrepency.
> > > 
> > > The transaction overrun occurs under several other unfortunate
> > > circumstances:
> > > 
> > >  - Each agfl block free is left+right contiguous. This requires 2
> > >    record deletions and 1 insertion for the cntbt and thus requires
> > >    more log reservation than normal.
> > >  - The associated transaction is the first in the CIL ctx and thus
> > >    the ctx header reservation is consumed.
> > >  - The transaction reservation is larger than a log buffer and thus
> > >    extra split header reservation is consumed.
> > > 
> > > As a result of the agfl and free space state of the filesystem, the
> > > agfl fixup code has dirtied more cntbt buffer space than allowed by
> > > the portion of the reservation allotted for block allocation. This
> > > is all before the real allocation even starts!
> > > 
> > > Note that the log related conditions above are correctly covered by
> > > the existing transaction reservation. The above demonstrates that
> > > the reservation allotted for the context/split headers may help
> > > suppress overruns in the more common case where that reservation
> > > goes unused for its intended purpose.
> > > 
> > > To address this problem, update xfs_alloc_fix_freelist() to amortize
> > > agfl block frees over multiple transactions. Free one block per
> > > transaction so long as the agfl is less than half free. The agfl
> > > minimum allocation requirement is dynamic, but is based on the
> > > geometry of the associated btrees (i.e., level count) and therefore
> > > should be easily rectified over multiple allocation transactions.
> > > Further, there is no real harm in leaving extraneous blocks on the
> > > agfl so long as there are enough free slots available for btree
> > > blocks freed as a result of the upcoming allocation.
> > > 
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_alloc.c | 15 ++++++++++-----
> > >  1 file changed, 10 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
> > > index 0da80019a917..d8d58e35da00 100644
> > > --- a/fs/xfs/libxfs/xfs_alloc.c
> > > +++ b/fs/xfs/libxfs/xfs_alloc.c
> > > @@ -2117,11 +2117,6 @@ xfs_alloc_fix_freelist(
> > >  	 * appropriately based on the recursion count and dirty state of the
> > >  	 * buffer.
> > >  	 *
> > > -	 * XXX (dgc): When we have lots of free space, does this buy us
> > > -	 * anything other than extra overhead when we need to put more blocks
> > > -	 * back on the free list? Maybe we should only do this when space is
> > > -	 * getting low or the AGFL is more than half full?
> > > -	 *
> > >  	 * The NOSHRINK flag prevents the AGFL from being shrunk if it's too
> > >  	 * big; the NORMAP flag prevents AGFL expand/shrink operations from
> > >  	 * updating the rmapbt.  Both flags are used in xfs_repair while we're
> > > @@ -2151,6 +2146,16 @@ xfs_alloc_fix_freelist(
> > >  			goto out_agbp_relse;
> > >  		}
> > >  		xfs_trans_binval(tp, bp);
> > > +
> > > +		/*
> > > +		 * Freeing all extra agfl blocks adds too much log reservation
> > > +		 * overhead to a single transaction, particularly considering
> > > +		 * that freeing a block can cause a btree join and put one right
> > > +		 * back on the agfl. Try to free one block per tx so long as
> > > +		 * we've left enough free slots for the upcoming modifications.
> > > +		 */
> > > +		if (pag->pagf_flcount <= (XFS_AGFL_SIZE(mp) >> 1))
> > > +			break;
> > >  	}
> > 
> > In /theory/, this /should/ work. However, as the comment you removed
> > implies, there are likely to be issues with this as we get near
> > ENOSPC. We know that if we don't trim the AGFL right down to the
> > minimum requested as we approach ENOSPC we can get premature ENOSPC
> > events being reported that lead to filesystem shutdowns.  (e.g. need
> > the very last free block for a BMBT block to complete a data extent
> > allocation).  Hence I'd suggest that this needs to be aware of the
> > low space allocation algorithm (i.e. dfops->dop_low is true) to trim
> > the agfl right back when we are really short of space
> > 
> 
> Hmm, shouldn't the worst case bmbt requirements be satisfied by the
> block reservation of the transaction that maps the extent (or stored as
> indirect reservation if delalloc)?

That's checked against global free space when we do the initial
reservation. It's not checked against AG free space until we make
the allocation attempt.

The problem here is that if we don't leave enough space for the BMBT
block in the same AG as the data extent was allocated we can end up
with an ENOSPC on the BMBT block allocation because, say, AGF
locking order and all the higher AGFs are already ENOSPC. That's a
fatal ENOSPC error as the transaction is already dirty and will
shutdown the filesystem on transaction cancel.

> I'm less concerned with premature
> ENOSPC from contexts where it's not fatal..

The definition of "premature ENOSPC" is "unexpected ENOSPC that
causes a filesystem shutdown" :/

>
> IIRC, I think we've already
> incorporated changes (Christoph's minfree stuff rings a bell) into the
> allocator that explicitly prefer premature ENOSPC over potentially fatal
> conditions, but I could be mistaken.

Yes, but that doesn't take into account all the crazy AGFL
manipulations that could occur as a result of trimming/extending the
AGFL. We /assume/ that we don't need signification reservations to
extend/trim the AGFL to what is required for the transaction. If we
now how to free 50 blocks from the AGFL back to the freespace tree
before we start the allocation/free operation, that's going to
have some impact on the reservations we have.

And think about 4k sector filesystems - there's close on 1000
entries in the AGFL - this means we could have 500 blocks floating
on the AGFL that would otherwise be in the freespace tree. IOWs,
this change could have significant impact on freespace
fragmentation because blocks aren't getting returned to the
freespace tree where they merge with other small freespace extents
to reform larger extents.

It's these sorts of issues (i.e. filesystem aging) that might not
show up for months of production use that we need to think about. It
may be that a larger AGFL helps here because it keeps a working
set of freespace tree blocks on the AGFL rather than having to
go back to the freespace trees all the time, but this is something
we need to measure, analyse and characterise before changing a
critical piece of the allocation architecture....

> We're also only talking about 256k
> or so for half of the AGFL, less than that if we assume that some of
> those blocks are actually required by the agfl and not slated to be
> lazily freed. We could also think about explicitly fixing up agfl
> surplus from other contexts (background, write -ENOSPC handling, etc.)
> if it became that much of a problem.
> 
> I do think it's semi-reasonable from a conservative standpoint to
> further restrict this behavior to !dop_low conditions, but I'm a little
> concerned about creating an "open the floodgates" type situation when
> the agfl is allowed to carry extra blocks and then all of a sudden the
> free space state changes and one transaction is allowed to free a bunch
> of blocks. Thoughts?

That's exactly the sort of problem we need to measure, analyse
and characterise before making this sort of change :/

> I suppose we could incorporate logic that frees until/unless a join
> occurs (i.e., the last free did not drop flcount), the latter being an
> indication that we've probably logged as much as we should for agfl
> fixups in the transaction. But that's also something we could just do
> unconditionally as opposed to only under dop_low conditions. That might
> be less aggressive of a change from current behavior.

Heuristics will still need to be analysed and tested :/

The complexity of doing this is why I left a small comment rather
than actually making the change....

> > I'm also concerned that it doesn't take into account that freeing
> > a block from the AGFL could cause a freespace tree split to occur,
> > thereby emptying the AGFL whilst consuming the entire log
> > reservation for tree modifications. This leaves nothing in the log
> > reservation for re-growing the AGFL to the minimum required, which
> > we /must do/ before returning and could cause more splits/joins to
> > occur.
> > 
> 
> How is this different from the current code?

It's not. I'm pointing out that while you're focussed on the
problems with shortening the AGFL, the same problem exists with
/extending the AGFL/.

> This sounds to me like an
> unconditional side effect of the fact that freeing an agfl block can
> indirectly affect the agfl via the btree operations. IOW, freeing a
> single extra block could require consuming one or more and trigger the
> need for an allocation. I suspect the allocation could then very well
> cause a join on the other tree and put more than one block back onto the
> agfl.

Yes, it could do that too. Remove a single block from an existing
free extent, no change to the by-block btree. by-cnt now requires a
record delete (full height join) followed by record insert elsewhere
in the tree (full height split). So the attempt to add a block to
the AGFL can actually shorten it if the by-cnt tree splits on
insert. It can grow if the by-block or by-cnt tree joins on record
removal.

Either way, we've got the same problem of using the entire log
reservation for AGFL modification when growing the AGFL as we do
when trying to shrink the AGFL down.

That's my point here - just hacking a limit into the shrink case
doesn't address the problem at all - it just papers over one of the
visible symptoms....

> > IMO, there's a lot more to be concerned about here than just trying
> > to work around the specific symptom observed in the given test case.
> > This code is, unfortunately, really tricky and intricate and history
> > tells us that the risk of unexpected regressions is extremely high,
> > especially around ENOSPC related issues. Of all the patches in this
> > series, this is the most dangerous and "iffy" of them and the
> > one we should be most concerned and conservative about....
> > 
> 
> Agreed. The impact is something that also has to be evaluated over a
> sequence of transactions along with the more obvious impact on a single
> transaction.

The impact has to be measured over a far longer time frame than a
few transactions. The fact it has impact on freespace reformation
means it'll need accelerated aging tests done on it so we can be
reasonably certain that it isn't going to bite us in extended
production environments...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 4/4] xfs: include an allocfree res for inobt modifications
  2017-11-28 14:04     ` Brian Foster
@ 2017-11-28 22:26       ` Dave Chinner
  2017-11-29 14:32         ` Brian Foster
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2017-11-28 22:26 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Nov 28, 2017 at 09:04:59AM -0500, Brian Foster wrote:
> On Tue, Nov 28, 2017 at 10:27:19AM +1100, Dave Chinner wrote:
> > On Mon, Nov 27, 2017 at 03:24:34PM -0500, Brian Foster wrote:
> > >  STATIC uint
> > > @@ -415,8 +414,8 @@ __xfs_calc_create_reservation(
> > >   * For icreate we can allocate some inodes giving:
> > >   *    the agi and agf of the ag getting the new inodes: 2 * sectorsize
> > >   *    the superblock for the nlink flag: sector size
> > > - *    the inode btree: max depth * blocksize
> > >   *    the allocation btrees: 2 trees * (max depth - 1) * block size
> > > + *    the inobt (record insertion)
> > >   *    the finobt (record insertion)
> > >   */
> > >  STATIC uint
> > > @@ -425,10 +424,10 @@ xfs_calc_icreate_resv_alloc(
> > >  {
> > >  	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
> > >  		mp->m_sb.sb_sectsize +
> > > -		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
> > >  		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> > >  				 XFS_FSB_TO_B(mp, 1)) +
> > > -		xfs_calc_finobt_res(mp, 0, 0);
> > > +		xfs_calc_inobt_res(mp) +
> > > +		xfs_calc_finobt_res(mp);
> > >  }
> > 
> > 	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
> > 		mp->m_sb.sb_sectsize +
> > 		xfs_calc_inode_chunk_res(mp, false) +
> > 		xfs_calc_inobt_res(mp) +
> > 		xfs_calc_finobt_res(mp);
> > 
> 
> The icreate reservation doesn't currently include m_ialloc_blks at all.
> The helper, as defined above, adds a reservation for associated headers.
> Is that really necessary?  My understanding is that icreate doesn't log
> the inode chunk.

Right, it uses ordered buffers to avoid needing to log them.

> I suppose we could get around that by tweaking the
> parameter to take the size to reserve instead of a bool and pass a dummy
> value (i.e., negative) to avoid logging the chunk at all. A little ugly,
> but I could live with it.

I don't think that having an extra few hundred bytes of overhead in
the reservation is going to be noticable by anyone. I'd just
ignore the problem (as I did when suggesting this).

> > >  STATIC uint
> > > @@ -494,9 +493,14 @@ xfs_calc_symlink_reservation(
> > >   *    the agi hash list and counters: sector size
> > >   *    the on disk inode before ours in the agi hash list: inode cluster size
> > >   *    the inode chunk is marked stale (headers only)
> > > - *    the inode btree: max depth * blocksize
> > > - *    the allocation btrees: 2 trees * (max depth - 1) * block size
> > > + *    the inode btree
> > >   *    the finobt (record insertion, removal or modification)
> > > + *
> > > + * Note that the allocfree res. for the inode chunk itself is not included
> > > + * because the extent free occurs after a transaction roll. We could take the
> > > + * maximum of the pre/post roll operations, but the pre-roll reservation already
> > > + * includes at least one allocfree res. for the inobt and is thus guaranteed to
> > > + * be larger.
> > >   */
> > >  STATIC uint
> > >  xfs_calc_ifree_reservation(
> > > @@ -508,10 +512,8 @@ xfs_calc_ifree_reservation(
> > >  		xfs_calc_iunlink_remove_reservation(mp) +
> > >  		xfs_calc_buf_res(1, 0) +
> > >  		xfs_calc_buf_res(mp->m_ialloc_blks, 0) +
> > > -		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
> > > -		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> > > -				 XFS_FSB_TO_B(mp, 1)) +
> > > -		xfs_calc_finobt_res(mp, 0, 1);
> > > +		xfs_calc_inobt_res(mp) +
> > > +		xfs_calc_finobt_res(mp);
> > >  }
> > 
> > 	.....
> > 		xfs_calc_iunlink_remove_reservation(mp) +
> > 		xfs_calc_buf_res(1, 0) +
> > 		xfs_calc_inode_chunk_res(mp, false) +
> > 		xfs_calc_inobt_res(mp) +
> > 		xfs_calc_finobt_res(mp);
> > 
> 
> This covers the inode chunk invalidation, but also adds the allocfree
> res. for the chunk free where we technically don't need it (because the
> free is deferred, re: the comment above).
> 
> I guess I'm fine with just adding that one, but I'd update the comment
> above to point out that it's technically unecessary. Hm?

*nod*

Though with sparse inodes, we might be freeing multiple extents,
right?  which means we probably need all the allocfree reservations
we can get....

> > This seems to make more sense to me - it's clear what the individual
> > components of the reservation are, and we can ensure that the
> > individual components have the necessary reservation independently
> > of the overall reservations that need them.
> > 
> 
> I agree in principle. I think the underlying helpers (and pushing down
> some of the associated documentation) help clearly declare the intent of
> the reservations, particularly when we include multiple factors of a
> single reservation and/or have situations where we don't technically
> have a definition of worst case, but want to define something logically
> reasonable (like the whole allocfree per inode tree thing).

Yup, that's pretty much what I was thinking.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 4/4] xfs: include an allocfree res for inobt modifications
  2017-11-28 15:49     ` Brian Foster
@ 2017-11-28 22:34       ` Dave Chinner
  2017-11-29 14:32         ` Brian Foster
  0 siblings, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2017-11-28 22:34 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Nov 28, 2017 at 10:49:22AM -0500, Brian Foster wrote:
> On Tue, Nov 28, 2017 at 10:27:19AM +1100, Dave Chinner wrote:
> > On Mon, Nov 27, 2017 at 03:24:34PM -0500, Brian Foster wrote:
> > > Analysis of recent reports of log reservation overruns and code
> > > inspection has uncovered that the reservations associated with inode
> > > operations may not cover the worst case scenarios. In particular,
> > > many cases only include one allocfree res. for a particular
> > > operation even though said operations may also entail AGFL fixups
> > > and inode btree block allocations in addition to the actual inode
> > > chunk allocation. This can easily turn into two or three block
> > > allocations (or frees) per operation.
> > > 
> > > In theory, the only way to define the worst case reservation is to
> > > include an allocfree res for each individual allocation in a
> > > transaction. Since that is impractical (we can perform multiple agfl
> > > fixups per tx and not every allocation is going to result in a full
> > > tree operation), implement a reasonable compromise that addresses
> > > the deficiency in practice without blowing out the size of the
> > > transactions.
> > > 
> > > Refactor the inode transaction reservation code to include one
> > > allocfree res. per inode btree modification to cover allocations
> > > required by the tree itself. This essentially separates the
> > > reservation required to allocate the physical inode chunk from
> > > additional reservation required to perform inobt record
> > > insertion/removal.
> > 
> > I think you missed the most important reason the inobt/finobt
> > modifications need there own allocfree reservation - btree
> > modifications that cause btree blocks to be freed do not use defered
> > ops and so the freeing of blocks occurs directly within the primary
> > transaction. Hence the primary transaction reservation must take
> > this into account ....
> > 
> > > Apply the same logic to the finobt reservation.
> > > This results in killing off the finobt modify condition because we
> > > no longer assume that the broader transaction reservation will cover
> > > finobt block allocations.
> > > 
> > > Suggested-by: Dave Chinner <david@fromorbit.com>
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > 
> > Code looks fine. Comments below are for another follow-on patch.
> > 
> 
> Actually, what do you think of the following variant (compile tested
> only)? This facilites another cleanup patch I just wrote to kill off
> about half of the (now effectively duplicate) xfs_calc_create_*()
> helpers because the finobt and inode chunk res. helpers only include the
> associated reservation based on the associated feature bits.

Yup, makes a lot of sense to do that.

FWIW, because we've got so many alloc vs free type reservations,
would it make more sense to do something like:

#define _ALLOC	true
#define _FREE	false

And use those in the callers rather than true/false?

i.e. this code remains the same in that it takes an "bool alloc"
flag as a parameter:

> +STATIC uint
> +xfs_calc_inode_chunk_res(
> +	struct xfs_mount	*mp,
> +	bool			alloc)
> +{
> +	uint			res, size = 0;
> +
> +	res = xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> +			       XFS_FSB_TO_B(mp, 1));
> +	if (alloc) {
> +		/* icreate tx uses ordered buffers */
> +		if (xfs_sb_version_hascrc(&mp->m_sb))
> +			return res;
> +		size = XFS_FSB_TO_B(mp, 1);
> +	}
> +
> +	res += xfs_calc_buf_res(mp->m_ialloc_blks, size);
> +	return res;
> +}

but makes the callers  look like:

>  STATIC uint
> @@ -396,9 +430,7 @@ xfs_calc_create_resv_alloc(
>  {
>  	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
>  		mp->m_sb.sb_sectsize +
> -		xfs_calc_buf_res(mp->m_ialloc_blks, XFS_FSB_TO_B(mp, 1)) +
> -		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> -				 XFS_FSB_TO_B(mp, 1)) +
> +		xfs_calc_inode_chunk_res(mp, true) +

		xfs_calc_inode_chunk_res(mp, _ALLOC) +
		.....

and

> @@ -511,7 +542,7 @@ xfs_calc_ifree_reservation(
>  		xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
>  		xfs_calc_iunlink_remove_reservation(mp) +
>  		xfs_calc_buf_res(1, 0) +
> -		xfs_calc_buf_res(mp->m_ialloc_blks, 0) +
> +		xfs_calc_inode_chunk_res(mp, false) +

		xfs_calc_inode_chunk_res(mp, _FREE) +
		.....

That would make it very clear what type of reservation we are
acutally expecting to take out in the calculation. i.e. the code
is now clearly self documenting :)

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 2/4] xfs: include inobt buffers in ifree tx log reservation
  2017-11-28 21:38       ` Dave Chinner
@ 2017-11-29 14:31         ` Brian Foster
  0 siblings, 0 replies; 23+ messages in thread
From: Brian Foster @ 2017-11-29 14:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Nov 29, 2017 at 08:38:44AM +1100, Dave Chinner wrote:
> On Tue, Nov 28, 2017 at 08:30:02AM -0500, Brian Foster wrote:
> > On Tue, Nov 28, 2017 at 09:28:57AM +1100, Dave Chinner wrote:
> > > On Mon, Nov 27, 2017 at 03:24:32PM -0500, Brian Foster wrote:
> > > > The tr_ifree transaction handles inode unlinks and inode chunk
> > > > frees. The current transaction calculation does not accurately
> > > > reflect worst case changes to the inode btree, however. The inobt
> > > > portion of the current transaction reservation only covers
> > > > modification of a single inobt buffer (for the particular inode
> > > > record). This is a historical artifact from the days before XFS
> > > > supported full inode chunk removal.
> > > > 
> > > > When support for inode chunk removal was added in commit
> > > > 254f6311ed1b ("Implement deletion of inode clusters in XFS."), the
> > > > additional log reservation required for chunk removal was not added
> > > > correctly. The new reservation only considered the header overhead
> > > > of associated buffers rather than the full contents of the btrees
> > > > and AGF and AGFL buffers affected by the transaction. The
> > > > reservation for the free space btrees was subsequently fixed up in
> > > > commit 5fe6abb82f76 ("Add space for inode and allocation btrees to
> > > > ITRUNCATE log reservation"), but the res. for full inobt joins has
> > > > never been added.
> > > > 
> > > > Update xfs_calc_ifree_reservation() to include a full inobt join in
> > > > the reservation calculation. Refactor the undocumented +2 blocks for
> > > > the AGF and AGFL buffers into the appropriate place so they are
> > > > accounted as sectors (not FSBs) and update the associated comments.
> > > > 
> > > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_trans_resv.c | 11 +++++------
> > > >  1 file changed, 5 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> > > > index 6bd916bd35e2..4cd7cd1e60da 100644
> > > > --- a/fs/xfs/libxfs/xfs_trans_resv.c
> > > > +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> > > > @@ -490,10 +490,10 @@ xfs_calc_symlink_reservation(
> > > >  /*
> > > >   * In freeing an inode we can modify:
> > > >   *    the inode being freed: inode size
> > > > - *    the super block free inode counter: sector size
> > > > + *    the super block free inode counter, AGF and AGFL: sector size
> > > >   *    the agi hash list and counters: sector size
> > > > - *    the inode btree entry: block size
> > > >   *    the on disk inode before ours in the agi hash list: inode cluster size
> > > > + *    the inode chunk is marked stale (headers only)
> > > >   *    the inode btree: max depth * blocksize
> > > >   *    the allocation btrees: 2 trees * (max depth - 1) * block size
> > > >   *    the finobt (record insertion, removal or modification)
> > > > @@ -504,12 +504,11 @@ xfs_calc_ifree_reservation(
> > > >  {
> > > >  	return XFS_DQUOT_LOGRES(mp) +
> > > >  		xfs_calc_inode_res(mp, 1) +
> > > 
> > >  *    the inode being freed: inode size
> > > 
> > > > -		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
> > > > -		xfs_calc_buf_res(1, XFS_FSB_TO_B(mp, 1)) +
> > > > +		xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
> > > 
> > >  *    the super block free inode counter, AGF and AGFL: sector size
> > > 
> > > 		[missing, hidden in calc_iunlink_remove]
> > > 
> > >  *    the agi hash list and counters: sector size
> > > 
> > > >  		xfs_calc_iunlink_remove_reservation(mp) +
> > > 
> > >  *    the on disk inode before ours in the agi hash list: inode cluster size
> > > 
> > > 		[missing]
> > > 
> > >  *    on-disk inode to log the di_next_unlinked: inode cluster size
> > > 
> > > Yes, check the xfs_iunlink_remove() code - we can log two inode
> > > cluster buffers there: the one prior to us in the unlinked list,
> > > and ours to reset the di_next_unlinked pointer to
> > > null. IOWs, xfs_calc_iunlink_remove_reservation() needs to take into
> > > account /2/ inode cluster buffers, not one.
> > > 
> > 
> > I was wondering why this wouldn't be covered by the inode size included
> > above, but looking at the code, I guess we log the agi unlinked changes
> > directly in the cluster buffer. That means we have separate log items
> > with independent reservation consumption between the unlinked list
> > fixups and core inode.
> > 
> > I'll update xfs_calc_iunlink_remove_reservation() to include another
> > cluster and add a comment.
> 
> And, as I mentioned in the other thread,
> xfs_calc_iunlink_add_reservation() needs this fixup, too.
> 

Right, thanks..

> > 
> > > >  		xfs_calc_buf_res(1, 0) +
> > > 
> > > No idea what this is.
> > > 
> > 
> > Same, I didn't want to remove it unless we could identify what it was
> > originally intended for.
> 
> It's only going to be a 128 bytes for a header, so I think we should
> just remove it rather than keep some unknown magic in the
> reservation. If it turns out to be critical, then we need to
> understand exactly where it came from...
> 

Fair enough. We're still adding reservation in the end so I'll drop this
too.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 4/4] xfs: include an allocfree res for inobt modifications
  2017-11-28 22:26       ` Dave Chinner
@ 2017-11-29 14:32         ` Brian Foster
  0 siblings, 0 replies; 23+ messages in thread
From: Brian Foster @ 2017-11-29 14:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Nov 29, 2017 at 09:26:37AM +1100, Dave Chinner wrote:
> On Tue, Nov 28, 2017 at 09:04:59AM -0500, Brian Foster wrote:
> > On Tue, Nov 28, 2017 at 10:27:19AM +1100, Dave Chinner wrote:
> > > On Mon, Nov 27, 2017 at 03:24:34PM -0500, Brian Foster wrote:
> > > >  STATIC uint
> > > > @@ -415,8 +414,8 @@ __xfs_calc_create_reservation(
> > > >   * For icreate we can allocate some inodes giving:
> > > >   *    the agi and agf of the ag getting the new inodes: 2 * sectorsize
> > > >   *    the superblock for the nlink flag: sector size
> > > > - *    the inode btree: max depth * blocksize
> > > >   *    the allocation btrees: 2 trees * (max depth - 1) * block size
> > > > + *    the inobt (record insertion)
> > > >   *    the finobt (record insertion)
> > > >   */
> > > >  STATIC uint
> > > > @@ -425,10 +424,10 @@ xfs_calc_icreate_resv_alloc(
> > > >  {
> > > >  	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
> > > >  		mp->m_sb.sb_sectsize +
> > > > -		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
> > > >  		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> > > >  				 XFS_FSB_TO_B(mp, 1)) +
> > > > -		xfs_calc_finobt_res(mp, 0, 0);
> > > > +		xfs_calc_inobt_res(mp) +
> > > > +		xfs_calc_finobt_res(mp);
> > > >  }
> > > 
> > > 	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
> > > 		mp->m_sb.sb_sectsize +
> > > 		xfs_calc_inode_chunk_res(mp, false) +
> > > 		xfs_calc_inobt_res(mp) +
> > > 		xfs_calc_finobt_res(mp);
> > > 
> > 
> > The icreate reservation doesn't currently include m_ialloc_blks at all.
> > The helper, as defined above, adds a reservation for associated headers.
> > Is that really necessary?  My understanding is that icreate doesn't log
> > the inode chunk.
> 
> Right, it uses ordered buffers to avoid needing to log them.
> 
> > I suppose we could get around that by tweaking the
> > parameter to take the size to reserve instead of a bool and pass a dummy
> > value (i.e., negative) to avoid logging the chunk at all. A little ugly,
> > but I could live with it.
> 
> I don't think that having an extra few hundred bytes of overhead in
> the reservation is going to be noticable by anyone. I'd just
> ignore the problem (as I did when suggesting this).
> 

I'm not as concerned with the overhead as much as I want to make sure
the intent of the changes is clearly documented. The refactoring
proposed in the other thread factors this out more cleanly than I
originally anticipated, so it's a moot point now.

> > > >  STATIC uint
> > > > @@ -494,9 +493,14 @@ xfs_calc_symlink_reservation(
> > > >   *    the agi hash list and counters: sector size
> > > >   *    the on disk inode before ours in the agi hash list: inode cluster size
> > > >   *    the inode chunk is marked stale (headers only)
> > > > - *    the inode btree: max depth * blocksize
> > > > - *    the allocation btrees: 2 trees * (max depth - 1) * block size
> > > > + *    the inode btree
> > > >   *    the finobt (record insertion, removal or modification)
> > > > + *
> > > > + * Note that the allocfree res. for the inode chunk itself is not included
> > > > + * because the extent free occurs after a transaction roll. We could take the
> > > > + * maximum of the pre/post roll operations, but the pre-roll reservation already
> > > > + * includes at least one allocfree res. for the inobt and is thus guaranteed to
> > > > + * be larger.
> > > >   */
> > > >  STATIC uint
> > > >  xfs_calc_ifree_reservation(
> > > > @@ -508,10 +512,8 @@ xfs_calc_ifree_reservation(
> > > >  		xfs_calc_iunlink_remove_reservation(mp) +
> > > >  		xfs_calc_buf_res(1, 0) +
> > > >  		xfs_calc_buf_res(mp->m_ialloc_blks, 0) +
> > > > -		xfs_calc_buf_res(mp->m_in_maxlevels, XFS_FSB_TO_B(mp, 1)) +
> > > > -		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> > > > -				 XFS_FSB_TO_B(mp, 1)) +
> > > > -		xfs_calc_finobt_res(mp, 0, 1);
> > > > +		xfs_calc_inobt_res(mp) +
> > > > +		xfs_calc_finobt_res(mp);
> > > >  }
> > > 
> > > 	.....
> > > 		xfs_calc_iunlink_remove_reservation(mp) +
> > > 		xfs_calc_buf_res(1, 0) +
> > > 		xfs_calc_inode_chunk_res(mp, false) +
> > > 		xfs_calc_inobt_res(mp) +
> > > 		xfs_calc_finobt_res(mp);
> > > 
> > 
> > This covers the inode chunk invalidation, but also adds the allocfree
> > res. for the chunk free where we technically don't need it (because the
> > free is deferred, re: the comment above).
> > 
> > I guess I'm fine with just adding that one, but I'd update the comment
> > above to point out that it's technically unecessary. Hm?
> 
> *nod*
> 
> Though with sparse inodes, we might be freeing multiple extents,
> right?  which means we probably need all the allocfree reservations
> we can get....
> 

Yep good point, I suppose that is possible.

Brian

> > > This seems to make more sense to me - it's clear what the individual
> > > components of the reservation are, and we can ensure that the
> > > individual components have the necessary reservation independently
> > > of the overall reservations that need them.
> > > 
> > 
> > I agree in principle. I think the underlying helpers (and pushing down
> > some of the associated documentation) help clearly declare the intent of
> > the reservations, particularly when we include multiple factors of a
> > single reservation and/or have situations where we don't technically
> > have a definition of worst case, but want to define something logically
> > reasonable (like the whole allocfree per inode tree thing).
> 
> Yup, that's pretty much what I was thinking.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH RFC 4/4] xfs: include an allocfree res for inobt modifications
  2017-11-28 22:34       ` Dave Chinner
@ 2017-11-29 14:32         ` Brian Foster
  0 siblings, 0 replies; 23+ messages in thread
From: Brian Foster @ 2017-11-29 14:32 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Nov 29, 2017 at 09:34:45AM +1100, Dave Chinner wrote:
> On Tue, Nov 28, 2017 at 10:49:22AM -0500, Brian Foster wrote:
> > On Tue, Nov 28, 2017 at 10:27:19AM +1100, Dave Chinner wrote:
> > > On Mon, Nov 27, 2017 at 03:24:34PM -0500, Brian Foster wrote:
...
> > > Code looks fine. Comments below are for another follow-on patch.
> > > 
> > 
> > Actually, what do you think of the following variant (compile tested
> > only)? This facilites another cleanup patch I just wrote to kill off
> > about half of the (now effectively duplicate) xfs_calc_create_*()
> > helpers because the finobt and inode chunk res. helpers only include the
> > associated reservation based on the associated feature bits.
> 
> Yup, makes a lot of sense to do that.
> 
> FWIW, because we've got so many alloc vs free type reservations,
> would it make more sense to do something like:
> 
> #define _ALLOC	true
> #define _FREE	false
> 
> And use those in the callers rather than true/false?
> 
...
> > @@ -511,7 +542,7 @@ xfs_calc_ifree_reservation(
> >  		xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
> >  		xfs_calc_iunlink_remove_reservation(mp) +
> >  		xfs_calc_buf_res(1, 0) +
> > -		xfs_calc_buf_res(mp->m_ialloc_blks, 0) +
> > +		xfs_calc_inode_chunk_res(mp, false) +
> 
> 		xfs_calc_inode_chunk_res(mp, _FREE) +
> 		.....
> 
> That would make it very clear what type of reservation we are
> acutally expecting to take out in the calculation. i.e. the code
> is now clearly self documenting :)
> 

Sure, that looks like a nice enhancement. I'll factor that in..

Brian

> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] xfs: amortize agfl block frees across multiple transactions
  2017-11-28 22:09       ` Dave Chinner
@ 2017-11-29 18:24         ` Brian Foster
  2017-11-29 20:36           ` Brian Foster
  0 siblings, 1 reply; 23+ messages in thread
From: Brian Foster @ 2017-11-29 18:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Nov 29, 2017 at 09:09:19AM +1100, Dave Chinner wrote:
> On Tue, Nov 28, 2017 at 08:57:48AM -0500, Brian Foster wrote:
> > On Tue, Nov 28, 2017 at 10:07:34AM +1100, Dave Chinner wrote:
> > > On Mon, Nov 27, 2017 at 03:24:33PM -0500, Brian Foster wrote:
...
> > > 
> > > In /theory/, this /should/ work. However, as the comment you removed
> > > implies, there are likely to be issues with this as we get near
> > > ENOSPC. We know that if we don't trim the AGFL right down to the
> > > minimum requested as we approach ENOSPC we can get premature ENOSPC
> > > events being reported that lead to filesystem shutdowns.  (e.g. need
> > > the very last free block for a BMBT block to complete a data extent
> > > allocation).  Hence I'd suggest that this needs to be aware of the
> > > low space allocation algorithm (i.e. dfops->dop_low is true) to trim
> > > the agfl right back when we are really short of space
> > > 
> > 
> > Hmm, shouldn't the worst case bmbt requirements be satisfied by the
> > block reservation of the transaction that maps the extent (or stored as
> > indirect reservation if delalloc)?
> 
> That's checked against global free space when we do the initial
> reservation. It's not checked against AG free space until we make
> the allocation attempt.
> 
> The problem here is that if we don't leave enough space for the BMBT
> block in the same AG as the data extent was allocated we can end up
> with an ENOSPC on the BMBT block allocation because, say, AGF
> locking order and all the higher AGFs are already ENOSPC. That's a
> fatal ENOSPC error as the transaction is already dirty and will
> shutdown the filesystem on transaction cancel.
> 

I thought Christoph already addressed that problem in commit 255c516278
("xfs: fix bogus minleft manipulations").

If not, that sounds like something that would need to be fixed
independent from whether the associated AGFL happens to hold surplus
blocks. AFAICT that problem could occur whether blocks sit on the AGFL
as a surplus, legitimately sit on the AGFL or are simply allocated for
some other purpose.

> > I'm less concerned with premature
> > ENOSPC from contexts where it's not fatal..
> 
> The definition of "premature ENOSPC" is "unexpected ENOSPC that
> causes a filesystem shutdown" :/
> 

That's a bit pedantic. We've had plenty of discussions on the list using
"premature ENOSPC" to describe errors that aren't necessarily fatal for
the fs. Eofblocks trimming, sparse inodes and the broader work the
minleft fix referenced above was part of are just a few obvious
examples of context.

> >
> > IIRC, I think we've already
> > incorporated changes (Christoph's minfree stuff rings a bell) into the
> > allocator that explicitly prefer premature ENOSPC over potentially fatal
> > conditions, but I could be mistaken.
> 
> Yes, but that doesn't take into account all the crazy AGFL
> manipulations that could occur as a result of trimming/extending the
> AGFL. We /assume/ that we don't need signification reservations to
> extend/trim the AGFL to what is required for the transaction. If we
> now how to free 50 blocks from the AGFL back to the freespace tree
> before we start the allocation/free operation, that's going to
> have some impact on the reservations we have.
> 

How does this patch exacerbate that problem? The current code performs N
agfl frees per fixup* to drop below a dynamic watermark
(xfs_alloc_min_freelist()). The patch changes that behavior to perform 1
agfl free per fixup unless we're over a fixed watermark (1/2 agfl size).
ISTM that how far we can exceed that watermark in a given fixup (to be
rectified by the next) is on a similar order in either case (and if
anything, it seems like we factor out the case of recursively populating
the AGFL and shrinking the watermark at the same time).

* By fixup, I'm referring to one pass through xfs_alloc_fix_freelist().

> And think about 4k sector filesystems - there's close on 1000
> entries in the AGFL - this means we could have 500 blocks floating
> on the AGFL that would otherwise be in the freespace tree. IOWs,
> this change could have significant impact on freespace
> fragmentation because blocks aren't getting returned to the
> freespace tree where they merge with other small freespace extents
> to reform larger extents.
> 

Good point. I'm a little curious how we'd really end up with that many
blocks on the agfl. The max requirement for the bnobt and cntbt is 4
each for a 1TB AG. The rmapbt adds another optional dependency, but IIRC
that only adds another 4 or 5 more blocks. I suppose this might require
some kind of sequence where consecutive agfl fixups end up freeing a
block that results in one or more full tree joins, and thus we populate
the agfl much faster than we shrink it (since we'd remove 1 block at a
time) for some period of time.

With regard to the impact on free space fragmentation.. I suppose we
could mitigate that by setting the surplus limit to some multiple of the
max possible agfl requirement instead of 1/2 the physical size of the
agfl.

We could also consider doing things like fixing up AGFL surpluses in a
more safe contexts (i.e., in response to -ENOSPC on write where we also
trim eofblocks, free up indlen reservations, etc.) or allowing a surplus
block to be used in that BMBT block allocation failure scenario as a
last resort to fs shutdown (and if there isn't a surplus block, then
we'd be shutting down anyways). Note that I'm not currently convinced
either of these are necessary, I'm just thinking out loud about how to
deal with some of these potential hazards if they prove legitimate.

> It's these sorts of issues (i.e. filesystem aging) that might not
> show up for months of production use that we need to think about. It
> may be that a larger AGFL helps here because it keeps a working
> set of freespace tree blocks on the AGFL rather than having to
> go back to the freespace trees all the time, but this is something
> we need to measure, analyse and characterise before changing a
> critical piece of the allocation architecture....
> 

I agree. I'm going to drop this patch from v2 for now because I don't
want to hold up the other more straightforward transaction reservation
fixes while we work out how best to fix this problem.

> > We're also only talking about 256k
> > or so for half of the AGFL, less than that if we assume that some of
> > those blocks are actually required by the agfl and not slated to be
> > lazily freed. We could also think about explicitly fixing up agfl
> > surplus from other contexts (background, write -ENOSPC handling, etc.)
> > if it became that much of a problem.
> > 
> > I do think it's semi-reasonable from a conservative standpoint to
> > further restrict this behavior to !dop_low conditions, but I'm a little
> > concerned about creating an "open the floodgates" type situation when
> > the agfl is allowed to carry extra blocks and then all of a sudden the
> > free space state changes and one transaction is allowed to free a bunch
> > of blocks. Thoughts?
> 
> That's exactly the sort of problem we need to measure, analyse
> and characterise before making this sort of change :/
> 
> > I suppose we could incorporate logic that frees until/unless a join
> > occurs (i.e., the last free did not drop flcount), the latter being an
> > indication that we've probably logged as much as we should for agfl
> > fixups in the transaction. But that's also something we could just do
> > unconditionally as opposed to only under dop_low conditions. That might
> > be less aggressive of a change from current behavior.
> 
> Heuristics will still need to be analysed and tested :/
> 
> The complexity of doing this is why I left a small comment rather
> than actually making the change....
> 
> > > I'm also concerned that it doesn't take into account that freeing
> > > a block from the AGFL could cause a freespace tree split to occur,
> > > thereby emptying the AGFL whilst consuming the entire log
> > > reservation for tree modifications. This leaves nothing in the log
> > > reservation for re-growing the AGFL to the minimum required, which
> > > we /must do/ before returning and could cause more splits/joins to
> > > occur.
> > > 
> > 
> > How is this different from the current code?
> 
> It's not. I'm pointing out that while you're focussed on the
> problems with shortening the AGFL, the same problem exists with
> /extending the AGFL/.
> 
> > This sounds to me like an
> > unconditional side effect of the fact that freeing an agfl block can
> > indirectly affect the agfl via the btree operations. IOW, freeing a
> > single extra block could require consuming one or more and trigger the
> > need for an allocation. I suspect the allocation could then very well
> > cause a join on the other tree and put more than one block back onto the
> > agfl.
> 
> Yes, it could do that too. Remove a single block from an existing
> free extent, no change to the by-block btree. by-cnt now requires a
> record delete (full height join) followed by record insert elsewhere
> in the tree (full height split). So the attempt to add a block to
> the AGFL can actually shorten it if the by-cnt tree splits on
> insert. It can grow if the by-block or by-cnt tree joins on record
> removal.
> 
> Either way, we've got the same problem of using the entire log
> reservation for AGFL modification when growing the AGFL as we do
> when trying to shrink the AGFL down.
> 
> That's my point here - just hacking a limit into the shrink case
> doesn't address the problem at all - it just papers over one of the
> visible symptoms....
> 

Yes, it's clear that the current code allows for all sorts of
theoretical avenues to transaction overrun. Hence my previous idea to
roll the transaction once an AGFL fixup triggers a join or split. Even
that may not be sufficient in certain scenarios.

Moving on from that, this patch is a variant of your suggestion to allow
leaving surplus blocks on the agfl up to a certain limit. It is
intentionally a more isolated fix for the specific issue of performing
too many independent allocations (frees) per transaction in this
context.

One approach doesn't have to preclude the other. I'm not aware of any
pattern of overrun problems with this code over the however many years
it has been in place, other than this one, however. Given that, I think
it's perfectly reasonable to consider a shorter term solution so long as
we're confident it doesn't introduce other problems.

> > > IMO, there's a lot more to be concerned about here than just trying
> > > to work around the specific symptom observed in the given test case.
> > > This code is, unfortunately, really tricky and intricate and history
> > > tells us that the risk of unexpected regressions is extremely high,
> > > especially around ENOSPC related issues. Of all the patches in this
> > > series, this is the most dangerous and "iffy" of them and the
> > > one we should be most concerned and conservative about....
> > > 
> > 
> > Agreed. The impact is something that also has to be evaluated over a
> > sequence of transactions along with the more obvious impact on a single
> > transaction.
> 
> The impact has to be measured over a far longer time frame than a
> few transactions. The fact it has impact on freespace reformation
> means it'll need accelerated aging tests done on it so we can be
> reasonably certain that it isn't going to bite us in extended
> production environments...
> 

That's exactly what I mean by a sequence. E.g., the effect on the agfl
over time. Clearly, the longer the sequence, the more robust the
results. I'm not sure where the idea that a few transactions would
provide anything useful comes from. This probably needs to be evaluated
over many cycles of fully depopulating and repopulating the space btrees
in different ways.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] xfs: amortize agfl block frees across multiple transactions
  2017-11-29 18:24         ` Brian Foster
@ 2017-11-29 20:36           ` Brian Foster
  2017-12-05 20:53             ` Brian Foster
  0 siblings, 1 reply; 23+ messages in thread
From: Brian Foster @ 2017-11-29 20:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Nov 29, 2017 at 01:24:53PM -0500, Brian Foster wrote:
> On Wed, Nov 29, 2017 at 09:09:19AM +1100, Dave Chinner wrote:
> > On Tue, Nov 28, 2017 at 08:57:48AM -0500, Brian Foster wrote:
> > > On Tue, Nov 28, 2017 at 10:07:34AM +1100, Dave Chinner wrote:
> > > > On Mon, Nov 27, 2017 at 03:24:33PM -0500, Brian Foster wrote:
...
> > > This sounds to me like an
> > > unconditional side effect of the fact that freeing an agfl block can
> > > indirectly affect the agfl via the btree operations. IOW, freeing a
> > > single extra block could require consuming one or more and trigger the
> > > need for an allocation. I suspect the allocation could then very well
> > > cause a join on the other tree and put more than one block back onto the
> > > agfl.
> > 
> > Yes, it could do that too. Remove a single block from an existing
> > free extent, no change to the by-block btree. by-cnt now requires a
> > record delete (full height join) followed by record insert elsewhere
> > in the tree (full height split). So the attempt to add a block to
> > the AGFL can actually shorten it if the by-cnt tree splits on
> > insert. It can grow if the by-block or by-cnt tree joins on record
> > removal.
> > 
> > Either way, we've got the same problem of using the entire log
> > reservation for AGFL modification when growing the AGFL as we do
> > when trying to shrink the AGFL down.
> > 
> > That's my point here - just hacking a limit into the shrink case
> > doesn't address the problem at all - it just papers over one of the
> > visible symptoms....
> > 
> 
> Yes, it's clear that the current code allows for all sorts of
> theoretical avenues to transaction overrun. Hence my previous idea to
> roll the transaction once an AGFL fixup triggers a join or split. Even
> that may not be sufficient in certain scenarios.
> 

Random thought from this afternoon...

What do you think about unconditionally removing surplus agfl blocks as
we do today, but defer them rather than free them immediately? We'd free
up the agfl slots as needed so the allocation can proceed, but we'd
eliminate the behavior where agfl frees recursively affect the agfl. The
broader idea is that in the event where 2+ agfl frees are required,
they'd now execute in a context where can enforce deterministic log
consumption per tx (by also implementing the 2 frees per EFD idea, for
example) until the agfl is rectified.

I'd have to think a little more about whether the idea is sane.. It
looks like we'd have to plumb dfops in through xfs_alloc_arg for
starters. We could do the defer conditionally based on whether the
caller passes dfops to facilitate incremental conversions. We also may
be able to consider optimizations like putting deferred agfl blocks
right back onto the agfl if there's a deficit by the time we get around
to freeing them (rather than doing an agfl alloc only to free up
deferred agfl blocks), but that's probably getting too far ahead for
now.

This also doesn't help with extending the agfl, but I think that path is
already more deterministic since we attempt to fill the deficit in as
few allocs as possible. Thoughts?

Brian

> Moving on from that, this patch is a variant of your suggestion to allow
> leaving surplus blocks on the agfl up to a certain limit. It is
> intentionally a more isolated fix for the specific issue of performing
> too many independent allocations (frees) per transaction in this
> context.
> 
> One approach doesn't have to preclude the other. I'm not aware of any
> pattern of overrun problems with this code over the however many years
> it has been in place, other than this one, however. Given that, I think
> it's perfectly reasonable to consider a shorter term solution so long as
> we're confident it doesn't introduce other problems.
> 
> > > > IMO, there's a lot more to be concerned about here than just trying
> > > > to work around the specific symptom observed in the given test case.
> > > > This code is, unfortunately, really tricky and intricate and history
> > > > tells us that the risk of unexpected regressions is extremely high,
> > > > especially around ENOSPC related issues. Of all the patches in this
> > > > series, this is the most dangerous and "iffy" of them and the
> > > > one we should be most concerned and conservative about....
> > > > 
> > > 
> > > Agreed. The impact is something that also has to be evaluated over a
> > > sequence of transactions along with the more obvious impact on a single
> > > transaction.
> > 
> > The impact has to be measured over a far longer time frame than a
> > few transactions. The fact it has impact on freespace reformation
> > means it'll need accelerated aging tests done on it so we can be
> > reasonably certain that it isn't going to bite us in extended
> > production environments...
> > 
> 
> That's exactly what I mean by a sequence. E.g., the effect on the agfl
> over time. Clearly, the longer the sequence, the more robust the
> results. I'm not sure where the idea that a few transactions would
> provide anything useful comes from. This probably needs to be evaluated
> over many cycles of fully depopulating and repopulating the space btrees
> in different ways.
> 
> Brian
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH 3/4] xfs: amortize agfl block frees across multiple transactions
  2017-11-29 20:36           ` Brian Foster
@ 2017-12-05 20:53             ` Brian Foster
  0 siblings, 0 replies; 23+ messages in thread
From: Brian Foster @ 2017-12-05 20:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Nov 29, 2017 at 03:36:06PM -0500, Brian Foster wrote:
> On Wed, Nov 29, 2017 at 01:24:53PM -0500, Brian Foster wrote:
> > On Wed, Nov 29, 2017 at 09:09:19AM +1100, Dave Chinner wrote:
> > > On Tue, Nov 28, 2017 at 08:57:48AM -0500, Brian Foster wrote:
> > > > On Tue, Nov 28, 2017 at 10:07:34AM +1100, Dave Chinner wrote:
> > > > > On Mon, Nov 27, 2017 at 03:24:33PM -0500, Brian Foster wrote:
> ...
> > > > This sounds to me like an
> > > > unconditional side effect of the fact that freeing an agfl block can
> > > > indirectly affect the agfl via the btree operations. IOW, freeing a
> > > > single extra block could require consuming one or more and trigger the
> > > > need for an allocation. I suspect the allocation could then very well
> > > > cause a join on the other tree and put more than one block back onto the
> > > > agfl.
> > > 
> > > Yes, it could do that too. Remove a single block from an existing
> > > free extent, no change to the by-block btree. by-cnt now requires a
> > > record delete (full height join) followed by record insert elsewhere
> > > in the tree (full height split). So the attempt to add a block to
> > > the AGFL can actually shorten it if the by-cnt tree splits on
> > > insert. It can grow if the by-block or by-cnt tree joins on record
> > > removal.
> > > 
> > > Either way, we've got the same problem of using the entire log
> > > reservation for AGFL modification when growing the AGFL as we do
> > > when trying to shrink the AGFL down.
> > > 
> > > That's my point here - just hacking a limit into the shrink case
> > > doesn't address the problem at all - it just papers over one of the
> > > visible symptoms....
> > > 
> > 
> > Yes, it's clear that the current code allows for all sorts of
> > theoretical avenues to transaction overrun. Hence my previous idea to
> > roll the transaction once an AGFL fixup triggers a join or split. Even
> > that may not be sufficient in certain scenarios.
> > 
> 
> Random thought from this afternoon...
> 
> What do you think about unconditionally removing surplus agfl blocks as
> we do today, but defer them rather than free them immediately? We'd free
> up the agfl slots as needed so the allocation can proceed, but we'd
> eliminate the behavior where agfl frees recursively affect the agfl. The
> broader idea is that in the event where 2+ agfl frees are required,
> they'd now execute in a context where can enforce deterministic log
> consumption per tx (by also implementing the 2 frees per EFD idea, for
> example) until the agfl is rectified.
> 
> I'd have to think a little more about whether the idea is sane.. It
> looks like we'd have to plumb dfops in through xfs_alloc_arg for
> starters. We could do the defer conditionally based on whether the
> caller passes dfops to facilitate incremental conversions. We also may
> be able to consider optimizations like putting deferred agfl blocks
> right back onto the agfl if there's a deficit by the time we get around
> to freeing them (rather than doing an agfl alloc only to free up
> deferred agfl blocks), but that's probably getting too far ahead for
> now.
> 
> This also doesn't help with extending the agfl, but I think that path is
> already more deterministic since we attempt to fill the deficit in as
> few allocs as possible. Thoughts?
> 

I've hacked on the above a bit.. enough to at least determine that
deferring AGFL frees does avoid the reservation overrun (without any of
the recent transaction reservation fixes). To avoid the overrun, I had
to defer AGFL frees from the inobt (xfs_inobt_free_block()) path and
from the deferred ops xfs_trans_extent_free() path. To this point, I
don't see any fundamental reason why we couldn't defer AGFL frees from
any other path as well so long as we have a dfops structure available in
the associated context.

We'd have to deal with the following caveats due to how AGFL blocks
differ from typical blocks freed via xfs_bmap_add_free():

1.) AGFL blocks are accounted against different allocation counters.
2.) AGFL blocks are not marked busy on free (in fact, they are possibly
already on the busy list by the time they are freed).

I've currently hacked around these problems using the OWN_AG owner info
to handle AGFL blocks appropriately down in xfs_trans_free_extent().
What I think may provide a cleaner implementation is to define a new
deferred ops type specific to AGFL frees. This would be a subset of
XFS_DEFER_OPS_TYPE_FREE and share nearly all of the implementation
outside of ->finish_item(). The latter would be replaced with a callout
that handles AGFL blocks as noted. Thoughts?

With regard to AGFL allocation, I've not seen any overrun issues
associated with that path. A thought that comes to mind for dealing with
a problem in that area is to further genericize the above to do some
kind of post alloc. fixup when we know an agfl block was consumed. For
example, tag the perag as deficient for another context (thread/wq) to
rectify or defer a more generic AGFL fixup item from agfl block
consumption. We'd still have to implement some kind of serialization or
do the alloc directly in the worst case, but the idea is that hopefully
that would be rare and in most cases all AGFL fixups would now occur
with a full tx reservation available for the fixup (agfl fixup-ahead, if
you will). That would probably require some experimentation to
determine how effective it would be without resorting to nasty locking
tricks and whatnot.

Note that I don't think anything as involved as the latter is currently
necessary. As such, it's just unsubstantiated handwaving around how we
could potentially evolve from deferred agfl frees to a more generic
solution that covered all AGFL fixups.

Brian

> Brian
> 
> > Moving on from that, this patch is a variant of your suggestion to allow
> > leaving surplus blocks on the agfl up to a certain limit. It is
> > intentionally a more isolated fix for the specific issue of performing
> > too many independent allocations (frees) per transaction in this
> > context.
> > 
> > One approach doesn't have to preclude the other. I'm not aware of any
> > pattern of overrun problems with this code over the however many years
> > it has been in place, other than this one, however. Given that, I think
> > it's perfectly reasonable to consider a shorter term solution so long as
> > we're confident it doesn't introduce other problems.
> > 
> > > > > IMO, there's a lot more to be concerned about here than just trying
> > > > > to work around the specific symptom observed in the given test case.
> > > > > This code is, unfortunately, really tricky and intricate and history
> > > > > tells us that the risk of unexpected regressions is extremely high,
> > > > > especially around ENOSPC related issues. Of all the patches in this
> > > > > series, this is the most dangerous and "iffy" of them and the
> > > > > one we should be most concerned and conservative about....
> > > > > 
> > > > 
> > > > Agreed. The impact is something that also has to be evaluated over a
> > > > sequence of transactions along with the more obvious impact on a single
> > > > transaction.
> > > 
> > > The impact has to be measured over a far longer time frame than a
> > > few transactions. The fact it has impact on freespace reformation
> > > means it'll need accelerated aging tests done on it so we can be
> > > reasonably certain that it isn't going to bite us in extended
> > > production environments...
> > > 
> > 
> > That's exactly what I mean by a sequence. E.g., the effect on the agfl
> > over time. Clearly, the longer the sequence, the more robust the
> > results. I'm not sure where the idea that a few transactions would
> > provide anything useful comes from. This probably needs to be evaluated
> > over many cycles of fully depopulating and repopulating the space btrees
> > in different ways.
> > 
> > Brian
> > 
> > > Cheers,
> > > 
> > > Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-12-05 20:53 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-27 20:24 [PATCH 0/4] xfs: inode transaction reservation fixups Brian Foster
2017-11-27 20:24 ` [PATCH 1/4] xfs: print transaction log reservation on overrun Brian Foster
2017-11-27 22:14   ` Dave Chinner
2017-11-27 20:24 ` [PATCH 2/4] xfs: include inobt buffers in ifree tx log reservation Brian Foster
2017-11-27 22:28   ` Dave Chinner
2017-11-28 13:30     ` Brian Foster
2017-11-28 21:38       ` Dave Chinner
2017-11-29 14:31         ` Brian Foster
2017-11-27 20:24 ` [PATCH 3/4] xfs: amortize agfl block frees across multiple transactions Brian Foster
2017-11-27 23:07   ` Dave Chinner
2017-11-28 13:57     ` Brian Foster
2017-11-28 22:09       ` Dave Chinner
2017-11-29 18:24         ` Brian Foster
2017-11-29 20:36           ` Brian Foster
2017-12-05 20:53             ` Brian Foster
2017-11-27 20:24 ` [PATCH RFC 4/4] xfs: include an allocfree res for inobt modifications Brian Foster
2017-11-27 23:27   ` Dave Chinner
2017-11-28 14:04     ` Brian Foster
2017-11-28 22:26       ` Dave Chinner
2017-11-29 14:32         ` Brian Foster
2017-11-28 15:49     ` Brian Foster
2017-11-28 22:34       ` Dave Chinner
2017-11-29 14:32         ` Brian Foster

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.