All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] [RFC] xfs: intent recovery reservation changes
@ 2020-09-09  8:19 Dave Chinner
  2020-09-09  8:19 ` [PATCH 1/3] xfs: EFI recovery needs it's own transaction reservation Dave Chinner
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Dave Chinner @ 2020-09-09  8:19 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

A recent bug report of a log recovery hang led to an analysis of a
metadump image that indicates that an intent can pin the tail of the
log and leave less space in the log than is require to begin
recovery of that intent. In this case, it was an EFI that pinned the
log tail, and the transaction reservation required to replay the EFI
was larger than the remaining free space in the log.

Long story short: the EFI recovery log space reservation is larger
than the log space reservation required to complete the operation at
runtime.

This appears to be a general problem with intents - the reservation
to replay the intent is the reservation of the entire permanent
transaction reservation that logged the intent, not the amount of
log space that is required to _execute_ the intent and record the
intent done item.

This patch series does not attempt to solve this generic problem,
just address the specific EFI reservation for a single EFI that pins
the tail of the log. If deferops and rolling transactions are
required, then recovery of multiple EFIs that all pin the tail of
the log at the same LSN becomes very complex, very fast. I have no clear idea
how to deal with that problem yet, but for the single case where the
extent free (and subsequent defer ops for refcnt and rmap mods) all
complete within the initial EFI reservation, the solution is
relatively straight forward.

That is, EFI recovery is an itruncate reservation minus the unit
reservation for the initial transaction that logs the EFI. At
runtime, this log space will be reserved by both the reserve and
write grant heads, hence we are guaranteed to have at least that
amount of space free in the log when we start recoverying the EFI.

For a single EFI pinning the tail of the log, once we free the
extent and log the EFD, the log tail is unpinned and the rest of
defered ops can be run without issue.

If there are multiple log tail pinning intents, then we cannot allow
the defered ops in an intent chain to roll once they run out of
initial recovery grant space. If they have to reserve more log space
rather than just regrant from the initial reservation, then they are
stealing log space from other intents that may pin the tail of the
log and need that space to guarantee that they can be replayed to
unpin the tail of the log. i.e. it gets complex real fast.

Hence this patchset does not attempt to solve the generic problem,
just the simple one-off EFI case. THe first patch is a back-portable
"minimal fix" for the problem, the second patch creates a formal
reservation construct for freeing a single extent to the AG free
list and converts the EFI recovery reservation to use it, and the
third patch factors all the existing reservation calculations to use
the free extent reservation instead of open coding it everywhere.

This last patch fixes some inconsistencies, too, in that some
extent alloc/free reservations fail to take into account AGFL
modifications. I also noted some concerns about the number of
extent alloc/free operations that certain transactions reserve space
for - if they are too large, then we should be able to reduce the
size of some of the large transactions significantly.

But doing any of that requires much, much more investigation than I
can do in a couple of hours, so if anyone is looking for something
they can sink a bit of time into, this would be a good thing to look
at.....

Anyway, this has run through fstests with reflink+rmap enabled
without regressions, so the patchset is not an obvious disaster.
Comments, thoughts, ideas all welcome...

Cheers,

Dave.


Dave Chinner (3):
  xfs: EFI recovery needs it's own transaction reservation
  xfs: add a free space extent change reservation
  xfs: factor free space tree transaciton reservations

 fs/xfs/libxfs/xfs_trans_resv.c | 143 ++++++++++++++++-----------------
 fs/xfs/libxfs/xfs_trans_resv.h |   2 +
 fs/xfs/xfs_extfree_item.c      |   2 +-
 3 files changed, 74 insertions(+), 73 deletions(-)

-- 
2.28.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/3] xfs: EFI recovery needs it's own transaction reservation
  2020-09-09  8:19 [PATCH 0/3] [RFC] xfs: intent recovery reservation changes Dave Chinner
@ 2020-09-09  8:19 ` Dave Chinner
  2020-09-09 13:31   ` Brian Foster
  2020-09-09  8:19 ` [PATCH 2/3] xfs: add a free space extent change reservation Dave Chinner
  2020-09-09  8:19 ` [PATCH 3/3] xfs: factor free space tree transaciton reservations Dave Chinner
  2 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2020-09-09  8:19 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Recovering an EFI currently uses a itruncate reservation, which is
designed for a rolling transaction that modifies the BMBT and
logs the EFI in one commit, then frees the space and logs the EFD in
the second commit.

Recovering the EFI only requires the second transaction in this
pair, and hence has a smaller log space requirement than a truncate
operation. Hence when the extent free is being processed at runtime,
the log reservation that is held by the filesystem is only enough to
complete the extent free, not the entire truncate operation.

Hence if the EFI pins the tail of the log and the log fills up while
the extent is being freed, the amount of reserved free space in the
log is not enough to start another entire truncate operation. Hence
if we crash at this point, log recovery will deadlock with the EFI
pinning the tail of the log and the log not having enough free space
to reserve an itruncate transaction.

As such, EFI recovery needs it's own log space reservation separate
to the itruncate reservation. We only need what is required free the
extent, and this matches the space we have reserved at runtime for
this operation and hence should prevent the recovery deadlock from
occurring.

This patch adds the new reservation in a way that minimises the
change such that it should be back-portable to older kernels easily.
Follow up patches will factor and rework the reservations to be more
correct and more tightly defined.

Note: this would appear to be a generic problem with intent
recovery; we use the entire operation reservation for recovery,
not the reservation that was held at runtime after the intent was
logged. I suspect all intents are going to require their own
reservation as a result.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_trans_resv.c | 10 ++++++++++
 fs/xfs/libxfs/xfs_trans_resv.h |  2 ++
 fs/xfs/xfs_extfree_item.c      |  2 +-
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index d1a0848cb52e..da2ec052ac0a 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -916,6 +916,16 @@ xfs_trans_resv_calc(
 		resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT;
 	resp->tr_qm_dqalloc.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
 
+	/*
+	 * Log recovery reservations for intent replay
+	 *
+	 * EFI recovery is itruncate minus the initial transaction that logs
+	 * logs the EFI.
+	 */
+	resp->tr_efi.tr_logres = resp->tr_itruncate.tr_logres;
+	resp->tr_efi.tr_logcount = resp->tr_itruncate.tr_logcount - 1;
+	resp->tr_efi.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
+
 	/*
 	 * The following transactions are logged in logical format with
 	 * a default log count.
diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
index 7241ab28cf84..13173b3eaac9 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.h
+++ b/fs/xfs/libxfs/xfs_trans_resv.h
@@ -50,6 +50,8 @@ struct xfs_trans_resv {
 	struct xfs_trans_res	tr_qm_equotaoff;/* end of turn quota off */
 	struct xfs_trans_res	tr_sb;		/* modify superblock */
 	struct xfs_trans_res	tr_fsyncts;	/* update timestamps on fsync */
+	struct xfs_trans_res	tr_efi;		/* EFI log item recovery */
+
 };
 
 /* shorthand way of accessing reservation structure */
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index 6cb8cd11072a..1ea9ab4cd44e 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -618,7 +618,7 @@ xfs_efi_item_recover(
 		}
 	}
 
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_efi, 0, 0, 0, &tp);
 	if (error)
 		return error;
 	efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/3] xfs: add a free space extent change reservation
  2020-09-09  8:19 [PATCH 0/3] [RFC] xfs: intent recovery reservation changes Dave Chinner
  2020-09-09  8:19 ` [PATCH 1/3] xfs: EFI recovery needs it's own transaction reservation Dave Chinner
@ 2020-09-09  8:19 ` Dave Chinner
  2020-09-09 13:35   ` Brian Foster
  2020-09-09  8:19 ` [PATCH 3/3] xfs: factor free space tree transaciton reservations Dave Chinner
  2 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2020-09-09  8:19 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Lots of the transaction reservation code reserves space for an
extent allocation. It is inconsistently implemented, and many of
them get it wrong. Introduce a new function to calculate the log
space reservation for adding or removing an extent from the free
space btrees.

This function reserves space for logging the AGF, the AGFL and the
free space btrees, avoiding the need to account for them seperately
in every reservation that manipulates free space.

Convert the EFI recovery reservation to use this transaction
reservation as EFI recovery only needs to manipulate the free space
index.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_trans_resv.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index da2ec052ac0a..621ddb277dfa 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -79,6 +79,23 @@ xfs_allocfree_log_count(
 	return blocks;
 }
 
+/*
+ * Log reservation required to add or remove a single extent to the free space
+ * btrees.  This requires modifying:
+ *
+ * the agf header: 1 sector
+ * the agfl header: 1 sector
+ * the allocation btrees: 2 trees * (max depth - 1) * block size
+ */
+uint
+xfs_allocfree_extent_res(
+	struct xfs_mount *mp)
+{
+	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
+	       xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
+				XFS_FSB_TO_B(mp, 1));
+}
+
 /*
  * Logging inodes is really tricksy. They are logged in memory format,
  * which means that what we write into the log doesn't directly translate into
@@ -922,7 +939,7 @@ xfs_trans_resv_calc(
 	 * EFI recovery is itruncate minus the initial transaction that logs
 	 * logs the EFI.
 	 */
-	resp->tr_efi.tr_logres = resp->tr_itruncate.tr_logres;
+	resp->tr_efi.tr_logres = xfs_allocfree_extent_res(mp);
 	resp->tr_efi.tr_logcount = resp->tr_itruncate.tr_logcount - 1;
 	resp->tr_efi.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/3] xfs: factor free space tree transaciton reservations
  2020-09-09  8:19 [PATCH 0/3] [RFC] xfs: intent recovery reservation changes Dave Chinner
  2020-09-09  8:19 ` [PATCH 1/3] xfs: EFI recovery needs it's own transaction reservation Dave Chinner
  2020-09-09  8:19 ` [PATCH 2/3] xfs: add a free space extent change reservation Dave Chinner
@ 2020-09-09  8:19 ` Dave Chinner
  2 siblings, 0 replies; 12+ messages in thread
From: Dave Chinner @ 2020-09-09  8:19 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

Convert all the open coded free space tree modification reservations
to use the new xfs_allocfree_extent_res() function.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_trans_resv.c | 116 +++++++++++++--------------------
 1 file changed, 44 insertions(+), 72 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index 621ddb277dfa..cb3cddb84d75 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -143,18 +143,16 @@ xfs_calc_inode_res(
  * reservation:
  *
  * the inode btree: max depth * blocksize
- * the allocation btrees: 2 trees * (max depth - 1) * block size
+ * one extent for the AG free space trees
  *
- * The caller must account for SB and AG header modifications, etc.
  */
 STATIC uint
 xfs_calc_inobt_res(
 	struct xfs_mount	*mp)
 {
 	return xfs_calc_buf_res(M_IGEO(mp)->inobt_maxlevels,
-			XFS_FSB_TO_B(mp, 1)) +
-				xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
-			XFS_FSB_TO_B(mp, 1));
+				XFS_FSB_TO_B(mp, 1)) +
+		xfs_allocfree_extent_res(mp);
 }
 
 /*
@@ -200,8 +198,7 @@ xfs_calc_inode_chunk_res(
 {
 	uint			res, size = 0;
 
-	res = xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
-			       XFS_FSB_TO_B(mp, 1));
+	res = xfs_allocfree_extent_res(mp);
 	if (alloc) {
 		/* icreate tx uses ordered buffers */
 		if (xfs_sb_version_has_v3inode(&mp->m_sb))
@@ -256,22 +253,19 @@ xfs_rtalloc_log_count(
  * extents.  This gives (t1):
  *    the inode getting the new extents: inode size
  *    the inode's bmap btree: max depth * block size
- *    the agfs of the ags from which the extents are allocated: 2 * sector
  *    the superblock free block counter: sector size
- *    the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size
+ *    two extents for the AG free space trees
  * Or, if we're writing to a realtime file (t2):
  *    the inode getting the new extents: inode size
  *    the inode's bmap btree: max depth * block size
- *    the agfs of the ags from which the extents are allocated: 2 * sector
+ *    one extent for the AG free space trees
  *    the superblock free block counter: sector size
  *    the realtime bitmap: ((MAXEXTLEN / rtextsize) / NBBY) bytes
  *    the realtime summary: 1 block
- *    the allocation btrees: 2 trees * (2 * max depth - 1) * block size
  * And the bmap_finish transaction can free bmap blocks in a join (t3):
- *    the agfs of the ags containing the blocks: 2 * sector size
  *    the agfls of the ags containing the blocks: 2 * sector size
  *    the super block free block counter: sector size
- *    the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size
+ *    two extents for the AG free space trees
  */
 STATIC uint
 xfs_calc_write_reservation(
@@ -282,8 +276,8 @@ xfs_calc_write_reservation(
 
 	t1 = xfs_calc_inode_res(mp, 1) +
 	     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK), blksz) +
-	     xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
-	     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2), blksz);
+	     xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
+	     xfs_allocfree_extent_res(mp) * 2;
 
 	if (xfs_sb_version_hasrealtime(&mp->m_sb)) {
 		t2 = xfs_calc_inode_res(mp, 1) +
@@ -291,13 +285,13 @@ xfs_calc_write_reservation(
 				     blksz) +
 		     xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
 		     xfs_calc_buf_res(xfs_rtalloc_log_count(mp, 1), blksz) +
-		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1), blksz);
+		     xfs_allocfree_extent_res(mp);
 	} else {
 		t2 = 0;
 	}
 
-	t3 = xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
-	     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2), blksz);
+	t3 = xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
+	     xfs_allocfree_extent_res(mp) * 2;
 
 	return XFS_DQUOT_LOGRES(mp) + max3(t1, t2, t3);
 }
@@ -307,19 +301,13 @@ xfs_calc_write_reservation(
  *    the inode being truncated: inode size
  *    the inode's bmap btree: (max depth + 1) * block size
  * And the bmap_finish transaction can free the blocks and bmap blocks (t2):
- *    the agf for each of the ags: 4 * sector size
- *    the agfl for each of the ags: 4 * sector size
  *    the super block to reflect the freed blocks: sector size
- *    worst case split in allocation btrees per extent assuming 4 extents:
- *		4 exts * 2 trees * (2 * max depth - 1) * block size
+ *    four (XXX: one?) extents for the AG free space trees
  * Or, if it's a realtime file (t3):
- *    the agf for each of the ags: 2 * sector size
- *    the agfl for each of the ags: 2 * sector size
  *    the super block to reflect the freed blocks: sector size
  *    the realtime bitmap: 2 exts * ((MAXEXTLEN / rtextsize) / NBBY) bytes
  *    the realtime summary: 2 exts * 1 block
- *    worst case split in allocation btrees per extent assuming 2 extents:
- *		2 exts * 2 trees * (2 * max depth - 1) * block size
+ *    two extents for the AG free space trees
  */
 STATIC uint
 xfs_calc_itruncate_reservation(
@@ -331,13 +319,13 @@ xfs_calc_itruncate_reservation(
 	t1 = xfs_calc_inode_res(mp, 1) +
 	     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) + 1, blksz);
 
-	t2 = xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) +
-	     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 4), blksz);
+	t2 = xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
+	     xfs_allocfree_extent_res(mp) * 4;
 
 	if (xfs_sb_version_hasrealtime(&mp->m_sb)) {
-		t3 = xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
+		t3 = xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
 		     xfs_calc_buf_res(xfs_rtalloc_log_count(mp, 2), blksz) +
-		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2), blksz);
+		     xfs_allocfree_extent_res(mp) * 2;
 	} else {
 		t3 = 0;
 	}
@@ -352,10 +340,8 @@ xfs_calc_itruncate_reservation(
  *    the two directory bmap btrees: 2 * max depth * block size
  * And the bmap_finish transaction can free dir and bmap blocks (two sets
  *	of bmap blocks) giving:
- *    the agf for the ags in which the blocks live: 3 * sector size
- *    the agfl for the ags in which the blocks live: 3 * sector size
  *    the superblock for the free block count: sector size
- *    the allocation btrees: 3 exts * 2 trees * (2 * max depth - 1) * block size
+ *    three extents for the AG free space trees
  */
 STATIC uint
 xfs_calc_rename_reservation(
@@ -365,9 +351,8 @@ xfs_calc_rename_reservation(
 		max((xfs_calc_inode_res(mp, 4) +
 		     xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp),
 				      XFS_FSB_TO_B(mp, 1))),
-		    (xfs_calc_buf_res(7, mp->m_sb.sb_sectsize) +
-		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 3),
-				      XFS_FSB_TO_B(mp, 1))));
+		    (xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
+		     xfs_allocfree_extent_res(mp) * 3));
 }
 
 /*
@@ -381,20 +366,19 @@ xfs_calc_iunlink_remove_reservation(
 	struct xfs_mount        *mp)
 {
 	return xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
-	       2 * M_IGEO(mp)->inode_cluster_size;
+	       xfs_calc_buf_res(2, M_IGEO(mp)->inode_cluster_size);
 }
 
 /*
  * For creating a link to an inode:
+ *    the inode is removed from the iunlink list (O_TMPFILE)
  *    the parent directory inode: inode size
  *    the linked inode: inode size
  *    the directory btree could split: (max depth + v2) * dir block size
  *    the directory bmap btree could join or split: (max depth + v2) * blocksize
  * And the bmap_finish transaction can free some bmap blocks giving:
- *    the agf for the ag in which the blocks live: sector size
- *    the agfl for the ag in which the blocks live: sector size
  *    the superblock for the free block count: sector size
- *    the allocation btrees: 2 trees * (2 * max depth - 1) * block size
+ *    one extent for the AG free space trees
  */
 STATIC uint
 xfs_calc_link_reservation(
@@ -405,9 +389,8 @@ xfs_calc_link_reservation(
 		max((xfs_calc_inode_res(mp, 2) +
 		     xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp),
 				      XFS_FSB_TO_B(mp, 1))),
-		    (xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
-		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
-				      XFS_FSB_TO_B(mp, 1))));
+		    (xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
+		     xfs_allocfree_extent_res(mp)));
 }
 
 /*
@@ -419,20 +402,19 @@ STATIC uint
 xfs_calc_iunlink_add_reservation(xfs_mount_t *mp)
 {
 	return xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
-			M_IGEO(mp)->inode_cluster_size;
+	       xfs_calc_buf_res(1, M_IGEO(mp)->inode_cluster_size);
 }
 
 /*
  * For removing a directory entry we can modify:
+ *    the inode is added to the agi unlinked list
  *    the parent directory inode: inode size
  *    the removed inode: inode size
  *    the directory btree could join: (max depth + v2) * dir block size
  *    the directory bmap btree could join or split: (max depth + v2) * blocksize
  * And the bmap_finish transaction can free the dir and bmap blocks giving:
- *    the agf for the ag in which the blocks live: 2 * sector size
- *    the agfl for the ag in which the blocks live: 2 * sector size
  *    the superblock for the free block count: sector size
- *    the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size
+ *    two extents for the AG free space trees
  */
 STATIC uint
 xfs_calc_remove_reservation(
@@ -443,9 +425,8 @@ xfs_calc_remove_reservation(
 		max((xfs_calc_inode_res(mp, 1) +
 		     xfs_calc_buf_res(XFS_DIROP_LOG_COUNT(mp),
 				      XFS_FSB_TO_B(mp, 1))),
-		    (xfs_calc_buf_res(4, mp->m_sb.sb_sectsize) +
-		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2),
-				      XFS_FSB_TO_B(mp, 1))));
+		    (xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
+		     xfs_allocfree_extent_res(mp) * 2));
 }
 
 /*
@@ -582,15 +563,14 @@ xfs_calc_ichange_reservation(
  * Growing the data section of the filesystem.
  *	superblock
  *	agi and agf
- *	allocation btrees
+ *	growing the last AG requires freeing one extent
  */
 STATIC uint
 xfs_calc_growdata_reservation(
 	struct xfs_mount	*mp)
 {
-	return xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
-		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
-				 XFS_FSB_TO_B(mp, 1));
+	return xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
+	       xfs_allocfree_extent_res(mp);
 }
 
 /*
@@ -598,10 +578,9 @@ xfs_calc_growdata_reservation(
  * In the first set of transactions (ALLOC) we allocate space to the
  * bitmap or summary files.
  *	superblock: sector size
- *	agf of the ag from which the extent is allocated: sector size
  *	bmap btree for bitmap/summary inode: max depth * blocksize
  *	bitmap/summary inode: inode size
- *	allocation btrees for 1 block alloc: 2 * (2 * maxdepth - 1) * blocksize
+ *	one extent for the AG free space trees
  */
 STATIC uint
 xfs_calc_growrtalloc_reservation(
@@ -611,8 +590,7 @@ xfs_calc_growrtalloc_reservation(
 		xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK),
 				 XFS_FSB_TO_B(mp, 1)) +
 		xfs_calc_inode_res(mp, 1) +
-		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
-				 XFS_FSB_TO_B(mp, 1));
+		xfs_allocfree_extent_res(mp);
 }
 
 /*
@@ -675,7 +653,7 @@ xfs_calc_writeid_reservation(
  *	agf block and superblock (for block allocation)
  *	the new block (directory sized)
  *	bmap blocks for the new directory block
- *	allocation btrees
+ *	one extent for the AG free space trees
  */
 STATIC uint
 xfs_calc_addafork_reservation(
@@ -687,8 +665,7 @@ xfs_calc_addafork_reservation(
 		xfs_calc_buf_res(1, mp->m_dir_geo->blksize) +
 		xfs_calc_buf_res(XFS_DAENTER_BMAP1B(mp, XFS_DATA_FORK) + 1,
 				 XFS_FSB_TO_B(mp, 1)) +
-		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
-				 XFS_FSB_TO_B(mp, 1));
+		xfs_allocfree_extent_res(mp);
 }
 
 /*
@@ -696,11 +673,8 @@ xfs_calc_addafork_reservation(
  *    the inode being truncated: inode size
  *    the inode's bmap btree: max depth * block size
  * And the bmap_finish transaction can free the blocks and bmap blocks:
- *    the agf for each of the ags: 4 * sector size
- *    the agfl for each of the ags: 4 * sector size
  *    the super block to reflect the freed blocks: sector size
- *    worst case split in allocation btrees per extent assuming 4 extents:
- *		4 exts * 2 trees * (2 * max depth - 1) * block size
+ *    four (XXX: really? should be one?) extents for the AG free space trees
  */
 STATIC uint
 xfs_calc_attrinval_reservation(
@@ -709,9 +683,8 @@ xfs_calc_attrinval_reservation(
 	return max((xfs_calc_inode_res(mp, 1) +
 		    xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK),
 				     XFS_FSB_TO_B(mp, 1))),
-		   (xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) +
-		    xfs_calc_buf_res(xfs_allocfree_log_count(mp, 4),
-				     XFS_FSB_TO_B(mp, 1))));
+		   (xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
+		    xfs_allocfree_extent_res(mp) * 4));
 }
 
 /*
@@ -763,7 +736,7 @@ xfs_calc_attrsetrt_reservation(
  *    the agf for the ag in which the blocks live: 2 * sector size
  *    the agfl for the ag in which the blocks live: 2 * sector size
  *    the superblock for the free block count: sector size
- *    the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size
+ *    two extents for the AG free space trees
  */
 STATIC uint
 xfs_calc_attrrm_reservation(
@@ -776,9 +749,8 @@ xfs_calc_attrrm_reservation(
 		     (uint)XFS_FSB_TO_B(mp,
 					XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK)) +
 		     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK), 0)),
-		    (xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
-		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2),
-				      XFS_FSB_TO_B(mp, 1))));
+		    (xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
+		     xfs_allocfree_extent_res(mp) * 2));
 }
 
 /*
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] xfs: EFI recovery needs it's own transaction reservation
  2020-09-09  8:19 ` [PATCH 1/3] xfs: EFI recovery needs it's own transaction reservation Dave Chinner
@ 2020-09-09 13:31   ` Brian Foster
  2020-09-09 21:44     ` Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Brian Foster @ 2020-09-09 13:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Sep 09, 2020 at 06:19:10PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Recovering an EFI currently uses a itruncate reservation, which is
> designed for a rolling transaction that modifies the BMBT and
> logs the EFI in one commit, then frees the space and logs the EFD in
> the second commit.
> 
> Recovering the EFI only requires the second transaction in this
> pair, and hence has a smaller log space requirement than a truncate
> operation. Hence when the extent free is being processed at runtime,
> the log reservation that is held by the filesystem is only enough to
> complete the extent free, not the entire truncate operation.
> 
> Hence if the EFI pins the tail of the log and the log fills up while
> the extent is being freed, the amount of reserved free space in the
> log is not enough to start another entire truncate operation. Hence
> if we crash at this point, log recovery will deadlock with the EFI
> pinning the tail of the log and the log not having enough free space
> to reserve an itruncate transaction.
> 
> As such, EFI recovery needs it's own log space reservation separate
> to the itruncate reservation. We only need what is required free the
> extent, and this matches the space we have reserved at runtime for
> this operation and hence should prevent the recovery deadlock from
> occurring.
> 
> This patch adds the new reservation in a way that minimises the
> change such that it should be back-portable to older kernels easily.
> Follow up patches will factor and rework the reservations to be more
> correct and more tightly defined.
> 
> Note: this would appear to be a generic problem with intent
> recovery; we use the entire operation reservation for recovery,
> not the reservation that was held at runtime after the intent was
> logged. I suspect all intents are going to require their own
> reservation as a result.
> 

It might be worth explicitly pointing out that support for EFI/EFD
intents goes farther back than the various intents associated with newer
features, hence the value of a targeted fix. Otherwise the problem
description makes sense and approach seems reasonable to me. Thanks for
the writeup. Some questions...

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_trans_resv.c | 10 ++++++++++
>  fs/xfs/libxfs/xfs_trans_resv.h |  2 ++
>  fs/xfs/xfs_extfree_item.c      |  2 +-
>  3 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> index d1a0848cb52e..da2ec052ac0a 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.c
> +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> @@ -916,6 +916,16 @@ xfs_trans_resv_calc(
>  		resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT;
>  	resp->tr_qm_dqalloc.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
>  
> +	/*
> +	 * Log recovery reservations for intent replay
> +	 *
> +	 * EFI recovery is itruncate minus the initial transaction that logs
> +	 * logs the EFI.
> +	 */
> +	resp->tr_efi.tr_logres = resp->tr_itruncate.tr_logres;
> +	resp->tr_efi.tr_logcount = resp->tr_itruncate.tr_logcount - 1;

tr_itruncate.tr_logcount looks like it's either 2 or 8 depending on
whether reflink is enabled. On one hand this seems conservative enough,
but do we know exactly what those extra unit counts are accounted for in
the reflink case? It looks like extents are only freed when the last
reference is dropped (otherwise we log a refcount intent), which makes
me wonder whether we really need 7 log count units if recovery
encounters an EFI.

Also, while looking through the code I noticed that truncate does the
following:

		...
                error = xfs_defer_finish(&tp);
                if (error)
                        goto out;

                error = xfs_trans_roll_inode(&tp, ip);
                if (error)
                        goto out;
		...

... which looks like it rolls the transaction an extra time per-extent.
I don't think that contributes to this problem vs just being a runtime
inefficiency, so maybe I'll fling a patch up for that separately.

> +	resp->tr_efi.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> +
>  	/*
>  	 * The following transactions are logged in logical format with
>  	 * a default log count.
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
> index 7241ab28cf84..13173b3eaac9 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.h
> +++ b/fs/xfs/libxfs/xfs_trans_resv.h
> @@ -50,6 +50,8 @@ struct xfs_trans_resv {
>  	struct xfs_trans_res	tr_qm_equotaoff;/* end of turn quota off */
>  	struct xfs_trans_res	tr_sb;		/* modify superblock */
>  	struct xfs_trans_res	tr_fsyncts;	/* update timestamps on fsync */
> +	struct xfs_trans_res	tr_efi;		/* EFI log item recovery */
> +

Extra whitespace line.

Brian

>  };
>  
>  /* shorthand way of accessing reservation structure */
> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> index 6cb8cd11072a..1ea9ab4cd44e 100644
> --- a/fs/xfs/xfs_extfree_item.c
> +++ b/fs/xfs/xfs_extfree_item.c
> @@ -618,7 +618,7 @@ xfs_efi_item_recover(
>  		}
>  	}
>  
> -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_efi, 0, 0, 0, &tp);
>  	if (error)
>  		return error;
>  	efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents);
> -- 
> 2.28.0
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] xfs: add a free space extent change reservation
  2020-09-09  8:19 ` [PATCH 2/3] xfs: add a free space extent change reservation Dave Chinner
@ 2020-09-09 13:35   ` Brian Foster
  2020-09-09 22:51     ` Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Brian Foster @ 2020-09-09 13:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Sep 09, 2020 at 06:19:11PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Lots of the transaction reservation code reserves space for an
> extent allocation. It is inconsistently implemented, and many of
> them get it wrong. Introduce a new function to calculate the log
> space reservation for adding or removing an extent from the free
> space btrees.
> 
> This function reserves space for logging the AGF, the AGFL and the
> free space btrees, avoiding the need to account for them seperately
> in every reservation that manipulates free space.
> 
> Convert the EFI recovery reservation to use this transaction
> reservation as EFI recovery only needs to manipulate the free space
> index.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_trans_resv.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> index da2ec052ac0a..621ddb277dfa 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.c
> +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> @@ -79,6 +79,23 @@ xfs_allocfree_log_count(
>  	return blocks;
>  }
>  
> +/*
> + * Log reservation required to add or remove a single extent to the free space
> + * btrees.  This requires modifying:
> + *
> + * the agf header: 1 sector
> + * the agfl header: 1 sector
> + * the allocation btrees: 2 trees * (max depth - 1) * block size

Nit, but the xfs_allocfree_log_count() helper this uses clearly
indicates reservation for up to four trees. It might be worth referring
to that here just to minimize spreading details all over the place that
are likely to become stale or inconsistent over time.

> + */
> +uint
> +xfs_allocfree_extent_res(
> +	struct xfs_mount *mp)
> +{
> +	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
> +	       xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),

Why calculate for a single extent when an EFI can refer to multiple
extents? I thought the max was 2, but the extent free portion of the
itruncate calculation actually uses an op count of 4. The reason for
that is not immediately clear to me. It actually accounts 4 agf/agfl
blocks as well, so perhaps there's a wrong assumption somewhere. FWIW,
the truncate code allows 2 unmaps per transaction and the
xfs_extent_free_defer_type struct limits the dfp to 16. I suspect the
latter is not relevant for the current code.

Either way, multiple extents are factored into the current freeing
reservation and the extent freeing code at runtime (dfops) and during
recovery both appear to iterate on an extent count (potentially > 1) per
transaction. The itruncate comment, for reference (I also just noticed
that the subsequent patch modifies this comment, so you're presumably
aware of this mess):

/*
 * ...
 * And the bmap_finish transaction can free the blocks and bmap blocks (t2):
 *    the agf for each of the ags: 4 * sector size
 *    the agfl for each of the ags: 4 * sector size
 *    the super block to reflect the freed blocks: sector size
 *    worst case split in allocation btrees per extent assuming 4 extents:
 *              4 exts * 2 trees * (2 * max depth - 1) * block size
 * ...
 */

Brian

> +				XFS_FSB_TO_B(mp, 1));
> +}
> +
>  /*
>   * Logging inodes is really tricksy. They are logged in memory format,
>   * which means that what we write into the log doesn't directly translate into
> @@ -922,7 +939,7 @@ xfs_trans_resv_calc(
>  	 * EFI recovery is itruncate minus the initial transaction that logs
>  	 * logs the EFI.
>  	 */
> -	resp->tr_efi.tr_logres = resp->tr_itruncate.tr_logres;
> +	resp->tr_efi.tr_logres = xfs_allocfree_extent_res(mp);
>  	resp->tr_efi.tr_logcount = resp->tr_itruncate.tr_logcount - 1;
>  	resp->tr_efi.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
>  
> -- 
> 2.28.0
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] xfs: EFI recovery needs it's own transaction reservation
  2020-09-09 13:31   ` Brian Foster
@ 2020-09-09 21:44     ` Dave Chinner
  2020-09-10 13:18       ` Brian Foster
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2020-09-09 21:44 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Sep 09, 2020 at 09:31:11AM -0400, Brian Foster wrote:
> On Wed, Sep 09, 2020 at 06:19:10PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Recovering an EFI currently uses a itruncate reservation, which is
> > designed for a rolling transaction that modifies the BMBT and
> > logs the EFI in one commit, then frees the space and logs the EFD in
> > the second commit.
> > 
> > Recovering the EFI only requires the second transaction in this
> > pair, and hence has a smaller log space requirement than a truncate
> > operation. Hence when the extent free is being processed at runtime,
> > the log reservation that is held by the filesystem is only enough to
> > complete the extent free, not the entire truncate operation.
> > 
> > Hence if the EFI pins the tail of the log and the log fills up while
> > the extent is being freed, the amount of reserved free space in the
> > log is not enough to start another entire truncate operation. Hence
> > if we crash at this point, log recovery will deadlock with the EFI
> > pinning the tail of the log and the log not having enough free space
> > to reserve an itruncate transaction.
> > 
> > As such, EFI recovery needs it's own log space reservation separate
> > to the itruncate reservation. We only need what is required free the
> > extent, and this matches the space we have reserved at runtime for
> > this operation and hence should prevent the recovery deadlock from
> > occurring.
> > 
> > This patch adds the new reservation in a way that minimises the
> > change such that it should be back-portable to older kernels easily.
> > Follow up patches will factor and rework the reservations to be more
> > correct and more tightly defined.
> > 
> > Note: this would appear to be a generic problem with intent
> > recovery; we use the entire operation reservation for recovery,
> > not the reservation that was held at runtime after the intent was
> > logged. I suspect all intents are going to require their own
> > reservation as a result.
> > 
> 
> It might be worth explicitly pointing out that support for EFI/EFD
> intents goes farther back than the various intents associated with newer
> features, hence the value of a targeted fix.

Ok.

> > @@ -916,6 +916,16 @@ xfs_trans_resv_calc(
> >  		resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT;
> >  	resp->tr_qm_dqalloc.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> >  
> > +	/*
> > +	 * Log recovery reservations for intent replay
> > +	 *
> > +	 * EFI recovery is itruncate minus the initial transaction that logs
> > +	 * logs the EFI.
> > +	 */
> > +	resp->tr_efi.tr_logres = resp->tr_itruncate.tr_logres;
> > +	resp->tr_efi.tr_logcount = resp->tr_itruncate.tr_logcount - 1;
> 
> tr_itruncate.tr_logcount looks like it's either 2 or 8 depending on
> whether reflink is enabled. On one hand this seems conservative enough,
> but do we know exactly what those extra unit counts are accounted for in
> the reflink case?

Right, in the reflink case we may have to roll the transaction many more
times to do the refcount btree and reverse map btree modifications.
Those are done under separate intents and so we have to roll and
commit the defered ops more times on a reflink/rmap based
filesystem. Hence the logcount is higher so that the initial
reservation can roll more times before regrant during a roll has to
go and physically reserve more write space in the log to continue
rolling the transaction.

> It looks like extents are only freed when the last
> reference is dropped (otherwise we log a refcount intent), which makes
> me wonder whether we really need 7 log count units if recovery
> encounters an EFI.

I don't know if the numbers are correct, and it really is out of
scope for this patch to audit/fix that. I really think we need to
map this whole thing out in a diagram at this point because I now
suspect that the allocfree log count calculation is not correct,
either...

> Also, while looking through the code I noticed that truncate does the
> following:
> 
> 		...
>                 error = xfs_defer_finish(&tp);
>                 if (error)
>                         goto out;
> 
>                 error = xfs_trans_roll_inode(&tp, ip);
>                 if (error)
>                         goto out;
> 		...
> 
> ... which looks like it rolls the transaction an extra time per-extent.
> I don't think that contributes to this problem vs just being a runtime
> inefficiency, so maybe I'll fling a patch up for that separately.

Yeah, I'm not sure if this is correct/needed or not. 

> >  	 * The following transactions are logged in logical format with
> >  	 * a default log count.
> > diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
> > index 7241ab28cf84..13173b3eaac9 100644
> > --- a/fs/xfs/libxfs/xfs_trans_resv.h
> > +++ b/fs/xfs/libxfs/xfs_trans_resv.h
> > @@ -50,6 +50,8 @@ struct xfs_trans_resv {
> >  	struct xfs_trans_res	tr_qm_equotaoff;/* end of turn quota off */
> >  	struct xfs_trans_res	tr_sb;		/* modify superblock */
> >  	struct xfs_trans_res	tr_fsyncts;	/* update timestamps on fsync */
> > +	struct xfs_trans_res	tr_efi;		/* EFI log item recovery */
> > +
> 
> Extra whitespace line.

Will fix.

Thanks!

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] xfs: add a free space extent change reservation
  2020-09-09 13:35   ` Brian Foster
@ 2020-09-09 22:51     ` Dave Chinner
  2020-09-10 13:19       ` Brian Foster
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2020-09-09 22:51 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Wed, Sep 09, 2020 at 09:35:25AM -0400, Brian Foster wrote:
> On Wed, Sep 09, 2020 at 06:19:11PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Lots of the transaction reservation code reserves space for an
> > extent allocation. It is inconsistently implemented, and many of
> > them get it wrong. Introduce a new function to calculate the log
> > space reservation for adding or removing an extent from the free
> > space btrees.
> > 
> > This function reserves space for logging the AGF, the AGFL and the
> > free space btrees, avoiding the need to account for them seperately
> > in every reservation that manipulates free space.
> > 
> > Convert the EFI recovery reservation to use this transaction
> > reservation as EFI recovery only needs to manipulate the free space
> > index.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_trans_resv.c | 19 ++++++++++++++++++-
> >  1 file changed, 18 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> > index da2ec052ac0a..621ddb277dfa 100644
> > --- a/fs/xfs/libxfs/xfs_trans_resv.c
> > +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> > @@ -79,6 +79,23 @@ xfs_allocfree_log_count(
> >  	return blocks;
> >  }
> >  
> > +/*
> > + * Log reservation required to add or remove a single extent to the free space
> > + * btrees.  This requires modifying:
> > + *
> > + * the agf header: 1 sector
> > + * the agfl header: 1 sector
> > + * the allocation btrees: 2 trees * (max depth - 1) * block size
> 
> Nit, but the xfs_allocfree_log_count() helper this uses clearly
> indicates reservation for up to four trees. It might be worth referring
> to that here just to minimize spreading details all over the place that
> are likely to become stale or inconsistent over time.

Yup, but now I think of it, xfs_allocfree_log_count() doesn't seem
right for freeing, and I need to check how allocation works again
because stuff gets deferred.

i.e. on freeing, AFAICT we don't modify the freespace trees, the reflink
tree and the rmap trees all in the same transaction. We do a cycle
that looks like this:

log new intent, commit, execute the intent, log the intent done,
log new intent, commit, execute the intent, log the intent done,
log new intent, commit, ....

And so I'm not sure that we are modifying the reflink, rmap and free
space trees all in the same transaction and commit.

> 
> > + */
> > +uint
> > +xfs_allocfree_extent_res(
> > +	struct xfs_mount *mp)
> > +{
> > +	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
> > +	       xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> 
> Why calculate for a single extent when an EFI can refer to multiple
> extents?

This isn't anything to do with an EFI at this point. It's just the
reservation needed to remove a single extent from a single AG in
isolation. I wanted to isolate this reservation completely from the
rest of the transaction reservations and how it ends up being used.

This deadlock surprised me, and so reflection and insight has lead
me to think that we actually need our reservations to reflect the
specific operations that will be performed by the transaction rather
than an aggregation of things that get modified.

Then each reservation can contain the reservations for each
operation it performance (e.g. free an extent) 

That's kinda what this "intent needs it's own reservation"

> I thought the max was 2, but the extent free portion of the
> itruncate calculation actually uses an op count of 4. The reason for
> that is not immediately clear to me.

THe reason is that itruncate used to allow 4 extents to be freed in
a single transaction. i.e. XFS_ITRUNC_MAX_EXTENTS used to be defined
to 4, it is now 2.

However, if you want to relate this to EFIs, I think this
reservation is completely bogus. If all the extents freed a single
BMBT extent free are packed into a single EFI (i.e. all the BMBT
blocks freed and the data extent) then we'll overrun the reservation
extent count of 4. The max extents being freed in a single
BMBT extent free operations is a full bmbt merge + the data extent.
i.e. XFS_BM_MAXLEVELS + 1 extents in a single EFI. Except that
we're allowed to pack two data extent frees into a single EFI, and
that means it is, worst case, a full BMBT merge + a BMBT merge up to
level below root + 2 data extents, or:

= (XFS_BM_MAXLEVELS + 1) + ((XFS_BM_MAXLEVELS - 1 - 1) + 1)
= XFS_BM_MAXLEVELS * 2 extents

And then we try to free all those extents in a single itruncate unit
reservation. We probably don't hit this often because the maximum
BMBT level is bound by filesystem size, and it's rare to have a >=
4-level BMBT tree needed to make this problematic.  We do not have a
reservation large enough to do that as a single transaction, and we
don't want to have a reservation that large, either.

This is not a limitation of the EFI/EFD construct - that can
effectively be unbound. The limitation is that we don't roll and
relog the EFD between each extent in the EFI that we free. The
itruncate reservation has a bound limit for freeing extents, the
EFI reservation does not define that.

So, really, I think our extent free reservations and some of the
intent execution behaviour needs a serious audit and rework, and we
need to decouple the extent allocation and freeing reservations as
extent allocation is now a fundamentally different operation to
freeing an extent. i.e. Allocation does BMBT, refcount, rmap and
freespace tree mods all in one transaction, while freeing does
multiple individual transactional modifications linked by chained
intents.

> Either way, multiple extents are factored into the current freeing
> reservation and the extent freeing code at runtime (dfops) and during
> recovery both appear to iterate on an extent count (potentially > 1) per
> transaction. The itruncate comment, for reference (I also just noticed
> that the subsequent patch modifies this comment, so you're presumably
> aware of this mess):
> 
> /*
>  * ...
>  * And the bmap_finish transaction can free the blocks and bmap blocks (t2):
>  *    the agf for each of the ags: 4 * sector size
>  *    the agfl for each of the ags: 4 * sector size
>  *    the super block to reflect the freed blocks: sector size
>  *    worst case split in allocation btrees per extent assuming 4 extents:
>  *              4 exts * 2 trees * (2 * max depth - 1) * block size
>  * ...
>  */

Oh, yeah, I'm well aware of it - ISTR commenting on it past
discussions about the transaction reservation sizes being much
larger than necessary....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] xfs: EFI recovery needs it's own transaction reservation
  2020-09-09 21:44     ` Dave Chinner
@ 2020-09-10 13:18       ` Brian Foster
  2020-09-10 21:29         ` Dave Chinner
  0 siblings, 1 reply; 12+ messages in thread
From: Brian Foster @ 2020-09-10 13:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Sep 10, 2020 at 07:44:55AM +1000, Dave Chinner wrote:
> On Wed, Sep 09, 2020 at 09:31:11AM -0400, Brian Foster wrote:
> > On Wed, Sep 09, 2020 at 06:19:10PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Recovering an EFI currently uses a itruncate reservation, which is
> > > designed for a rolling transaction that modifies the BMBT and
> > > logs the EFI in one commit, then frees the space and logs the EFD in
> > > the second commit.
> > > 
> > > Recovering the EFI only requires the second transaction in this
> > > pair, and hence has a smaller log space requirement than a truncate
> > > operation. Hence when the extent free is being processed at runtime,
> > > the log reservation that is held by the filesystem is only enough to
> > > complete the extent free, not the entire truncate operation.
> > > 
> > > Hence if the EFI pins the tail of the log and the log fills up while
> > > the extent is being freed, the amount of reserved free space in the
> > > log is not enough to start another entire truncate operation. Hence
> > > if we crash at this point, log recovery will deadlock with the EFI
> > > pinning the tail of the log and the log not having enough free space
> > > to reserve an itruncate transaction.
> > > 
> > > As such, EFI recovery needs it's own log space reservation separate
> > > to the itruncate reservation. We only need what is required free the
> > > extent, and this matches the space we have reserved at runtime for
> > > this operation and hence should prevent the recovery deadlock from
> > > occurring.
> > > 
> > > This patch adds the new reservation in a way that minimises the
> > > change such that it should be back-portable to older kernels easily.
> > > Follow up patches will factor and rework the reservations to be more
> > > correct and more tightly defined.
> > > 
> > > Note: this would appear to be a generic problem with intent
> > > recovery; we use the entire operation reservation for recovery,
> > > not the reservation that was held at runtime after the intent was
> > > logged. I suspect all intents are going to require their own
> > > reservation as a result.
> > > 
> > 
> > It might be worth explicitly pointing out that support for EFI/EFD
> > intents goes farther back than the various intents associated with newer
> > features, hence the value of a targeted fix.
> 
> Ok.
> 
> > > @@ -916,6 +916,16 @@ xfs_trans_resv_calc(
> > >  		resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT;
> > >  	resp->tr_qm_dqalloc.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> > >  
> > > +	/*
> > > +	 * Log recovery reservations for intent replay
> > > +	 *
> > > +	 * EFI recovery is itruncate minus the initial transaction that logs
> > > +	 * logs the EFI.
> > > +	 */
> > > +	resp->tr_efi.tr_logres = resp->tr_itruncate.tr_logres;
> > > +	resp->tr_efi.tr_logcount = resp->tr_itruncate.tr_logcount - 1;
> > 
> > tr_itruncate.tr_logcount looks like it's either 2 or 8 depending on
> > whether reflink is enabled. On one hand this seems conservative enough,
> > but do we know exactly what those extra unit counts are accounted for in
> > the reflink case?
> 
> Right, in the reflink case we may have to roll the transaction many more
> times to do the refcount btree and reverse map btree modifications.
> Those are done under separate intents and so we have to roll and
> commit the defered ops more times on a reflink/rmap based
> filesystem. Hence the logcount is higher so that the initial
> reservation can roll more times before regrant during a roll has to
> go and physically reserve more write space in the log to continue
> rolling the transaction.
> 
> > It looks like extents are only freed when the last
> > reference is dropped (otherwise we log a refcount intent), which makes
> > me wonder whether we really need 7 log count units if recovery
> > encounters an EFI.
> 
> I don't know if the numbers are correct, and it really is out of
> scope for this patch to audit/fix that. I really think we need to
> map this whole thing out in a diagram at this point because I now
> suspect that the allocfree log count calculation is not correct,
> either...
> 

I agree up to the point where it relates to this specific EFI recovery
issue. reflink is enabled by default, which means the default EFI
recovery reservation is going to have 7 logcount units. Is that actually
enough of a reduction to prevent this same recovery problem on newer
fs'? I'm wondering if the tr_efi logcount should just be set to 1, for
example, at least for the short term fix.

Brian

> > Also, while looking through the code I noticed that truncate does the
> > following:
> > 
> > 		...
> >                 error = xfs_defer_finish(&tp);
> >                 if (error)
> >                         goto out;
> > 
> >                 error = xfs_trans_roll_inode(&tp, ip);
> >                 if (error)
> >                         goto out;
> > 		...
> > 
> > ... which looks like it rolls the transaction an extra time per-extent.
> > I don't think that contributes to this problem vs just being a runtime
> > inefficiency, so maybe I'll fling a patch up for that separately.
> 
> Yeah, I'm not sure if this is correct/needed or not. 
> 
> > >  	 * The following transactions are logged in logical format with
> > >  	 * a default log count.
> > > diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
> > > index 7241ab28cf84..13173b3eaac9 100644
> > > --- a/fs/xfs/libxfs/xfs_trans_resv.h
> > > +++ b/fs/xfs/libxfs/xfs_trans_resv.h
> > > @@ -50,6 +50,8 @@ struct xfs_trans_resv {
> > >  	struct xfs_trans_res	tr_qm_equotaoff;/* end of turn quota off */
> > >  	struct xfs_trans_res	tr_sb;		/* modify superblock */
> > >  	struct xfs_trans_res	tr_fsyncts;	/* update timestamps on fsync */
> > > +	struct xfs_trans_res	tr_efi;		/* EFI log item recovery */
> > > +
> > 
> > Extra whitespace line.
> 
> Will fix.
> 
> Thanks!
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] xfs: add a free space extent change reservation
  2020-09-09 22:51     ` Dave Chinner
@ 2020-09-10 13:19       ` Brian Foster
  0 siblings, 0 replies; 12+ messages in thread
From: Brian Foster @ 2020-09-10 13:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Sep 10, 2020 at 08:51:27AM +1000, Dave Chinner wrote:
> On Wed, Sep 09, 2020 at 09:35:25AM -0400, Brian Foster wrote:
> > On Wed, Sep 09, 2020 at 06:19:11PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Lots of the transaction reservation code reserves space for an
> > > extent allocation. It is inconsistently implemented, and many of
> > > them get it wrong. Introduce a new function to calculate the log
> > > space reservation for adding or removing an extent from the free
> > > space btrees.
> > > 
> > > This function reserves space for logging the AGF, the AGFL and the
> > > free space btrees, avoiding the need to account for them seperately
> > > in every reservation that manipulates free space.
> > > 
> > > Convert the EFI recovery reservation to use this transaction
> > > reservation as EFI recovery only needs to manipulate the free space
> > > index.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_trans_resv.c | 19 ++++++++++++++++++-
> > >  1 file changed, 18 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> > > index da2ec052ac0a..621ddb277dfa 100644
> > > --- a/fs/xfs/libxfs/xfs_trans_resv.c
> > > +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> > > @@ -79,6 +79,23 @@ xfs_allocfree_log_count(
> > >  	return blocks;
> > >  }
> > >  
> > > +/*
> > > + * Log reservation required to add or remove a single extent to the free space
> > > + * btrees.  This requires modifying:
> > > + *
> > > + * the agf header: 1 sector
> > > + * the agfl header: 1 sector
> > > + * the allocation btrees: 2 trees * (max depth - 1) * block size
> > 
> > Nit, but the xfs_allocfree_log_count() helper this uses clearly
> > indicates reservation for up to four trees. It might be worth referring
> > to that here just to minimize spreading details all over the place that
> > are likely to become stale or inconsistent over time.
> 
> Yup, but now I think of it, xfs_allocfree_log_count() doesn't seem
> right for freeing, and I need to check how allocation works again
> because stuff gets deferred.
> 
> i.e. on freeing, AFAICT we don't modify the freespace trees, the reflink
> tree and the rmap trees all in the same transaction. We do a cycle
> that looks like this:
> 
> log new intent, commit, execute the intent, log the intent done,
> log new intent, commit, execute the intent, log the intent done,
> log new intent, commit, ....
> 
> And so I'm not sure that we are modifying the reflink, rmap and free
> space trees all in the same transaction and commit.
> 

Indeed.

> > 
> > > + */
> > > +uint
> > > +xfs_allocfree_extent_res(
> > > +	struct xfs_mount *mp)
> > > +{
> > > +	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
> > > +	       xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> > 
> > Why calculate for a single extent when an EFI can refer to multiple
> > extents?
> 
> This isn't anything to do with an EFI at this point. It's just the
> reservation needed to remove a single extent from a single AG in
> isolation. I wanted to isolate this reservation completely from the
> rest of the transaction reservations and how it ends up being used.
> 

Sure, but this is used as the reservation size for tr_efi. The current
itruncate reservation is clearly more than is required for EFI recovery,
but my point is that the current itruncate code means recovery must
handle EFIs with at least a couple extents (perhaps more based on your
analysis below). Recovery currently processes efi_nextents all in a
single transaction. The current recovery reservation seems to
accommodate current runtime behavior, but does the new reservation based
on this helper satisfy EFIs with >1 extents? I haven't tested it, but it
seems potentially deficient based on inspection...

> This deadlock surprised me, and so reflection and insight has lead
> me to think that we actually need our reservations to reflect the
> specific operations that will be performed by the transaction rather
> than an aggregation of things that get modified.
> 
> Then each reservation can contain the reservations for each
> operation it performance (e.g. free an extent) 
> 
> That's kinda what this "intent needs it's own reservation"
> 

Ack.

> > I thought the max was 2, but the extent free portion of the
> > itruncate calculation actually uses an op count of 4. The reason for
> > that is not immediately clear to me.
> 
> THe reason is that itruncate used to allow 4 extents to be freed in
> a single transaction. i.e. XFS_ITRUNC_MAX_EXTENTS used to be defined
> to 4, it is now 2.
> 

Ok, well that's a simple enough explanation.. :P

> However, if you want to relate this to EFIs, I think this
> reservation is completely bogus. If all the extents freed a single
> BMBT extent free are packed into a single EFI (i.e. all the BMBT
> blocks freed and the data extent) then we'll overrun the reservation
> extent count of 4. The max extents being freed in a single
> BMBT extent free operations is a full bmbt merge + the data extent.
> i.e. XFS_BM_MAXLEVELS + 1 extents in a single EFI. Except that
> we're allowed to pack two data extent frees into a single EFI, and
> that means it is, worst case, a full BMBT merge + a BMBT merge up to
> level below root + 2 data extents, or:
> 

All I want to do wrt to this patch is make sure the recovery reservation
is large enough to process the EFIs it might see in the log based on the
current code.

> = (XFS_BM_MAXLEVELS + 1) + ((XFS_BM_MAXLEVELS - 1 - 1) + 1)
> = XFS_BM_MAXLEVELS * 2 extents
> 
> And then we try to free all those extents in a single itruncate unit
> reservation. We probably don't hit this often because the maximum
> BMBT level is bound by filesystem size, and it's rare to have a >=
> 4-level BMBT tree needed to make this problematic.  We do not have a
> reservation large enough to do that as a single transaction, and we
> don't want to have a reservation that large, either.
> 

Agreed.

> This is not a limitation of the EFI/EFD construct - that can
> effectively be unbound. The limitation is that we don't roll and
> relog the EFD between each extent in the EFI that we free. The
> itruncate reservation has a bound limit for freeing extents, the
> EFI reservation does not define that.
> 

Sure.

> So, really, I think our extent free reservations and some of the
> intent execution behaviour needs a serious audit and rework, and we
> need to decouple the extent allocation and freeing reservations as
> extent allocation is now a fundamentally different operation to
> freeing an extent. i.e. Allocation does BMBT, refcount, rmap and
> freespace tree mods all in one transaction, while freeing does
> multiple individual transactional modifications linked by chained
> intents.
> 

I don't think an allocation transaction is so overloaded as compared to
freeing. I thought reflink was more of a separate remap operation vs
being tied to block allocation, not to say that a remap might depend on
block allocs or not, but both remaps and refcount updates look like
individual intents (along with rmapbt updates associated with block
mappings/allocations). Regardless, I agree that the current transaction
(and associated reservation) situation is a mess, particularly now with
more use of intents, complex chains thereof, and certain transactions
(tr_itruncate/tr_write) being overloaded to the point where they're huge
or simply too difficult to reason about..

Brian

> > Either way, multiple extents are factored into the current freeing
> > reservation and the extent freeing code at runtime (dfops) and during
> > recovery both appear to iterate on an extent count (potentially > 1) per
> > transaction. The itruncate comment, for reference (I also just noticed
> > that the subsequent patch modifies this comment, so you're presumably
> > aware of this mess):
> > 
> > /*
> >  * ...
> >  * And the bmap_finish transaction can free the blocks and bmap blocks (t2):
> >  *    the agf for each of the ags: 4 * sector size
> >  *    the agfl for each of the ags: 4 * sector size
> >  *    the super block to reflect the freed blocks: sector size
> >  *    worst case split in allocation btrees per extent assuming 4 extents:
> >  *              4 exts * 2 trees * (2 * max depth - 1) * block size
> >  * ...
> >  */
> 
> Oh, yeah, I'm well aware of it - ISTR commenting on it past
> discussions about the transaction reservation sizes being much
> larger than necessary....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] xfs: EFI recovery needs it's own transaction reservation
  2020-09-10 13:18       ` Brian Foster
@ 2020-09-10 21:29         ` Dave Chinner
  2020-09-11 12:37           ` Brian Foster
  0 siblings, 1 reply; 12+ messages in thread
From: Dave Chinner @ 2020-09-10 21:29 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Sep 10, 2020 at 09:18:10AM -0400, Brian Foster wrote:
> On Thu, Sep 10, 2020 at 07:44:55AM +1000, Dave Chinner wrote:
> > On Wed, Sep 09, 2020 at 09:31:11AM -0400, Brian Foster wrote:
> > > It looks like extents are only freed when the last
> > > reference is dropped (otherwise we log a refcount intent), which makes
> > > me wonder whether we really need 7 log count units if recovery
> > > encounters an EFI.
> > 
> > I don't know if the numbers are correct, and it really is out of
> > scope for this patch to audit/fix that. I really think we need to
> > map this whole thing out in a diagram at this point because I now
> > suspect that the allocfree log count calculation is not correct,
> > either...
> 
> I agree up to the point where it relates to this specific EFI recovery
> issue. reflink is enabled by default, which means the default EFI
> recovery reservation is going to have 7 logcount units. Is that actually
> enough of a reduction to prevent this same recovery problem on newer
> fs'? I'm wondering if the tr_efi logcount should just be set to 1, for
> example, at least for the short term fix.

I spent yesterday mapping out the whole "free extent" chain to
understand exactly what was necessary, and in discussing this with
Darrick on #xfs I came to the conclusion that we cannot -ever- have
a logcount of more than 1 for an intent recovery of any sort.

That's because the transaction may have rolled enough times to
exhaust the initial grant of unit * logcount, so the only
reservation that the runtime kernel has when it crashs is a single
transaction unit reservation (which gets re-reserved on every roll).

Hence recovery cannot assume that intent that was being processed has more
than a single unit reservation of log space available to be used,
and hence all intent recovery reservations must start with a log
count of 1.

THere are other restrictions we need to deal with, too. multiple
intents may pin the tail of the log, so we can't just process a
single intent chain at a time as that will result in using all the
log space for a single intent chain and reservation deadlocking on
one of the intents we haven't yet started.

Hence the first thing we have to do in recovering intents is take an
active reservation for -all- intents in the log to match the
reservation state at runtime. Only then can we guarantee that
intents that pin the tail of the log will have the reservation space
needed to be able to unpin the log tail.

Further, because we now may have consumed all the available log
reservation space to guarantee forwards progress, the first commit
of the first intent may block trying to regrant space if another
intent also pins the tail of the log. This means we cannot replay
intents in a serial manner as we currently do. Each intent chain and it's
transaction context needs to run in it's own process context so it
can block waiting for other intents to make progress, just like
happens at runtime when the system crashed. IOWs, intent replay
needs to be done with a task context per intent chain (probably via
a workqueue).

AFAICT, this is the only way we can make intent recovery deadlock
free - we have to recreate the complete log reservation state in
memory before we start recovery of a single intent, we can only
assume an intent had a single unit reservation of log space assigned
to it, and intent chain execution needs to run concurrently so
commits can block waiting for log space to become available as other
intents commit and unpin the tail of the log...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] xfs: EFI recovery needs it's own transaction reservation
  2020-09-10 21:29         ` Dave Chinner
@ 2020-09-11 12:37           ` Brian Foster
  0 siblings, 0 replies; 12+ messages in thread
From: Brian Foster @ 2020-09-11 12:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Fri, Sep 11, 2020 at 07:29:27AM +1000, Dave Chinner wrote:
> On Thu, Sep 10, 2020 at 09:18:10AM -0400, Brian Foster wrote:
> > On Thu, Sep 10, 2020 at 07:44:55AM +1000, Dave Chinner wrote:
> > > On Wed, Sep 09, 2020 at 09:31:11AM -0400, Brian Foster wrote:
> > > > It looks like extents are only freed when the last
> > > > reference is dropped (otherwise we log a refcount intent), which makes
> > > > me wonder whether we really need 7 log count units if recovery
> > > > encounters an EFI.
> > > 
> > > I don't know if the numbers are correct, and it really is out of
> > > scope for this patch to audit/fix that. I really think we need to
> > > map this whole thing out in a diagram at this point because I now
> > > suspect that the allocfree log count calculation is not correct,
> > > either...
> > 
> > I agree up to the point where it relates to this specific EFI recovery
> > issue. reflink is enabled by default, which means the default EFI
> > recovery reservation is going to have 7 logcount units. Is that actually
> > enough of a reduction to prevent this same recovery problem on newer
> > fs'? I'm wondering if the tr_efi logcount should just be set to 1, for
> > example, at least for the short term fix.
> 
> I spent yesterday mapping out the whole "free extent" chain to
> understand exactly what was necessary, and in discussing this with
> Darrick on #xfs I came to the conclusion that we cannot -ever- have
> a logcount of more than 1 for an intent recovery of any sort.
> 
> That's because the transaction may have rolled enough times to
> exhaust the initial grant of unit * logcount, so the only
> reservation that the runtime kernel has when it crashs is a single
> transaction unit reservation (which gets re-reserved on every roll).
> 
> Hence recovery cannot assume that intent that was being processed has more
> than a single unit reservation of log space available to be used,
> and hence all intent recovery reservations must start with a log
> count of 1.
> 

Makes sense.

> THere are other restrictions we need to deal with, too. multiple
> intents may pin the tail of the log, so we can't just process a
> single intent chain at a time as that will result in using all the
> log space for a single intent chain and reservation deadlocking on
> one of the intents we haven't yet started.
> 
> Hence the first thing we have to do in recovering intents is take an
> active reservation for -all- intents in the log to match the
> reservation state at runtime. Only then can we guarantee that
> intents that pin the tail of the log will have the reservation space
> needed to be able to unpin the log tail.
> 

Which brings up the ordering question when multiple intents pin the
tail...

> Further, because we now may have consumed all the available log
> reservation space to guarantee forwards progress, the first commit
> of the first intent may block trying to regrant space if another
> intent also pins the tail of the log. This means we cannot replay
> intents in a serial manner as we currently do. Each intent chain and it's
> transaction context needs to run in it's own process context so it
> can block waiting for other intents to make progress, just like
> happens at runtime when the system crashed. IOWs, intent replay
> needs to be done with a task context per intent chain (probably via
> a workqueue).
> 

Indeed.

> AFAICT, this is the only way we can make intent recovery deadlock
> free - we have to recreate the complete log reservation state in
> memory before we start recovery of a single intent, we can only
> assume an intent had a single unit reservation of log space assigned
> to it, and intent chain execution needs to run concurrently so
> commits can block waiting for log space to become available as other
> intents commit and unpin the tail of the log...
> 

Ok, so the primary recovery task would either need to acquire and
distribute the initial reservation or otherwise implement some kind of
execution barrier for the intent recovery wq tasks to ensure that every
currently log resident intent acquires a transaction before any other is
allowed to roll. That doesn't seem _too_ limiting a restriction given
that it technically should allow many intents (that don't result in
intent chains) to execute to completion before waiting on others at all,
though it does sound like a complete rework of the current
post-recovery, single threaded intent recovery approach.

I'm kind of wondering if we had some mechanism to freeze/unfreeze all
transaction rolls, if we could then rework recovery to issue intent
recovery wq tasks during pass2 as log records are written back to the
fs. For example, with transaction rolls frozen, pass2 of recovery
completes a particular log record and on I/O completion of all
associated buffers, kicks off workqueue tasks for each unfinished intent
in that particular record. The workqueue tasks will either complete and
exit or unconditionally block on the next transaction roll (regrant).
The main recovery task carries on and repeats for all subsequent records
in the log, unfreezes regrants and waits for all outstanding
tasks/transactions to (hopefully) complete. Hm?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-09-11 17:42 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-09  8:19 [PATCH 0/3] [RFC] xfs: intent recovery reservation changes Dave Chinner
2020-09-09  8:19 ` [PATCH 1/3] xfs: EFI recovery needs it's own transaction reservation Dave Chinner
2020-09-09 13:31   ` Brian Foster
2020-09-09 21:44     ` Dave Chinner
2020-09-10 13:18       ` Brian Foster
2020-09-10 21:29         ` Dave Chinner
2020-09-11 12:37           ` Brian Foster
2020-09-09  8:19 ` [PATCH 2/3] xfs: add a free space extent change reservation Dave Chinner
2020-09-09 13:35   ` Brian Foster
2020-09-09 22:51     ` Dave Chinner
2020-09-10 13:19       ` Brian Foster
2020-09-09  8:19 ` [PATCH 3/3] xfs: factor free space tree transaciton reservations Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.