All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHSET 0/6] xfs: fix reflink inefficiencies
@ 2022-04-14 22:54 Darrick J. Wong
  2022-04-14 22:54 ` [PATCH 1/6] xfs: stop artificially limiting the length of bunmap calls Darrick J. Wong
                   ` (5 more replies)
  0 siblings, 6 replies; 24+ messages in thread
From: Darrick J. Wong @ 2022-04-14 22:54 UTC (permalink / raw)
  To: djwong, david; +Cc: linux-xfs

Hi all,

As Dave Chinner has complained about on IRC, there are a couple of
things about reflink that are very inefficient.  First of all, we
limited the size of all bunmapi operations to avoid flooding the log
with defer ops in the worst case, but recent changes to the defer ops
code have solved that problem, so get rid of the bunmapi length clamp.

Second, the log reservations for reflink operations are far far larger
than they need to be.  Shrink them to exactly what we need to handle
each deferred RUI and CUI log item, and no more.  Also reduce logcount
because we don't need 8 rolls per operation.  Introduce a transaction
reservation compatibility layer to avoid changing the minimum log size
calculations.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=reflink-speedups-5.19
---
 fs/xfs/libxfs/xfs_bmap.c       |   22 -----
 fs/xfs/libxfs/xfs_log_rlimit.c |   17 +++-
 fs/xfs/libxfs/xfs_refcount.c   |   14 ++-
 fs/xfs/libxfs/xfs_refcount.h   |    8 --
 fs/xfs/libxfs/xfs_trans_resv.c |  193 +++++++++++++++++++++++++++++++++-------
 fs/xfs/libxfs/xfs_trans_resv.h |    8 +-
 fs/xfs/xfs_reflink.c           |   95 ++++++++++++--------
 fs/xfs/xfs_trace.h             |   15 ++-
 8 files changed, 264 insertions(+), 108 deletions(-)


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/6] xfs: stop artificially limiting the length of bunmap calls
  2022-04-14 22:54 [PATCHSET 0/6] xfs: fix reflink inefficiencies Darrick J. Wong
@ 2022-04-14 22:54 ` Darrick J. Wong
  2022-04-22 22:01   ` Dave Chinner
  2022-04-26 13:46   ` Christoph Hellwig
  2022-04-14 22:54 ` [PATCH 2/6] xfs: remove a __xfs_bunmapi call from reflink Darrick J. Wong
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 24+ messages in thread
From: Darrick J. Wong @ 2022-04-14 22:54 UTC (permalink / raw)
  To: djwong, david; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

In commit e1a4e37cc7b6, we clamped the length of bunmapi calls on the
data forks of shared files to avoid two failure scenarios: one where the
extent being unmapped is so sparsely shared that we exceed the
transaction reservation with the sheer number of refcount btree updates
and EFI intent items; and the other where we attach so many deferred
updates to the transaction that we pin the log tail and later the log
head meets the tail, causing the log to livelock.

We avoid triggering the first problem by tracking the number of ops in
the refcount btree cursor and forcing a requeue of the refcount intent
item any time we think that we might be close to overflowing.  This has
been baked into XFS since before the original e1a4 patch.

A recent patchset fixed the second problem by changing the deferred ops
code to finish all the work items created by each round of trying to
complete a refcount intent item, which eliminates the long chains of
deferred items (27dad); and causing long-running transactions to relog
their intent log items when space in the log gets low (74f4d).

Because this clamp affects /any/ unmapping request regardless of the
sharing factors of the component blocks, it degrades the performance of
all large unmapping requests -- whereas with an unshared file we can
unmap millions of blocks in one go, shared files are limited to
unmapping a few thousand blocks at a time, which causes the upper level
code to spin in a bunmapi loop even if it wasn't needed.

This also eliminates one more place where log recovery behavior can
differ from online behavior, because bunmapi operations no longer need
to requeue.

Partial-revert-of: e1a4e37cc7b6 ("xfs: try to avoid blowing out the transaction reservation when bunmaping a shared extent")
Depends: 27dada070d59 ("xfs: change the order in which child and parent defer ops ar finished")
Depends: 74f4d6a1e065 ("xfs: only relog deferred intent items if free space in the log gets low")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_bmap.c     |   22 +---------------------
 fs/xfs/libxfs/xfs_refcount.c |    5 ++---
 fs/xfs/libxfs/xfs_refcount.h |    8 ++------
 3 files changed, 5 insertions(+), 30 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 74198dd82b03..432597e425c1 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5299,7 +5299,6 @@ __xfs_bunmapi(
 	int			whichfork;	/* data or attribute fork */
 	xfs_fsblock_t		sum;
 	xfs_filblks_t		len = *rlen;	/* length to unmap in file */
-	xfs_fileoff_t		max_len;
 	xfs_fileoff_t		end;
 	struct xfs_iext_cursor	icur;
 	bool			done = false;
@@ -5318,16 +5317,6 @@ __xfs_bunmapi(
 	ASSERT(len > 0);
 	ASSERT(nexts >= 0);
 
-	/*
-	 * Guesstimate how many blocks we can unmap without running the risk of
-	 * blowing out the transaction with a mix of EFIs and reflink
-	 * adjustments.
-	 */
-	if (tp && xfs_is_reflink_inode(ip) && whichfork == XFS_DATA_FORK)
-		max_len = min(len, xfs_refcount_max_unmap(tp->t_log_res));
-	else
-		max_len = len;
-
 	error = xfs_iread_extents(tp, ip, whichfork);
 	if (error)
 		return error;
@@ -5366,7 +5355,7 @@ __xfs_bunmapi(
 
 	extno = 0;
 	while (end != (xfs_fileoff_t)-1 && end >= start &&
-	       (nexts == 0 || extno < nexts) && max_len > 0) {
+	       (nexts == 0 || extno < nexts)) {
 		/*
 		 * Is the found extent after a hole in which end lives?
 		 * Just back up to the previous extent, if so.
@@ -5400,14 +5389,6 @@ __xfs_bunmapi(
 		if (del.br_startoff + del.br_blockcount > end + 1)
 			del.br_blockcount = end + 1 - del.br_startoff;
 
-		/* How much can we safely unmap? */
-		if (max_len < del.br_blockcount) {
-			del.br_startoff += del.br_blockcount - max_len;
-			if (!wasdel)
-				del.br_startblock += del.br_blockcount - max_len;
-			del.br_blockcount = max_len;
-		}
-
 		if (!isrt)
 			goto delete;
 
@@ -5543,7 +5524,6 @@ __xfs_bunmapi(
 		if (error)
 			goto error0;
 
-		max_len -= del.br_blockcount;
 		end = del.br_startoff - 1;
 nodelete:
 		/*
diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index 327ba25e9e17..a07ebaecba73 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -960,6 +960,7 @@ xfs_refcount_adjust_extents(
 			 * Either cover the hole (increment) or
 			 * delete the range (decrement).
 			 */
+			cur->bc_ag.refc.nr_ops++;
 			if (tmp.rc_refcount) {
 				error = xfs_refcount_insert(cur, &tmp,
 						&found_tmp);
@@ -970,7 +971,6 @@ xfs_refcount_adjust_extents(
 					error = -EFSCORRUPTED;
 					goto out_error;
 				}
-				cur->bc_ag.refc.nr_ops++;
 			} else {
 				fsbno = XFS_AGB_TO_FSB(cur->bc_mp,
 						cur->bc_ag.pag->pag_agno,
@@ -1001,11 +1001,11 @@ xfs_refcount_adjust_extents(
 		ext.rc_refcount += adj;
 		trace_xfs_refcount_modify_extent(cur->bc_mp,
 				cur->bc_ag.pag->pag_agno, &ext);
+		cur->bc_ag.refc.nr_ops++;
 		if (ext.rc_refcount > 1) {
 			error = xfs_refcount_update(cur, &ext);
 			if (error)
 				goto out_error;
-			cur->bc_ag.refc.nr_ops++;
 		} else if (ext.rc_refcount == 1) {
 			error = xfs_refcount_delete(cur, &found_rec);
 			if (error)
@@ -1014,7 +1014,6 @@ xfs_refcount_adjust_extents(
 				error = -EFSCORRUPTED;
 				goto out_error;
 			}
-			cur->bc_ag.refc.nr_ops++;
 			goto advloop;
 		} else {
 			fsbno = XFS_AGB_TO_FSB(cur->bc_mp,
diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
index 9eb01edbd89d..6b265f6075b8 100644
--- a/fs/xfs/libxfs/xfs_refcount.h
+++ b/fs/xfs/libxfs/xfs_refcount.h
@@ -66,15 +66,11 @@ extern int xfs_refcount_recover_cow_leftovers(struct xfs_mount *mp,
  * reservation and crash the fs.  Each record adds 12 bytes to the
  * log (plus any key updates) so we'll conservatively assume 32 bytes
  * per record.  We must also leave space for btree splits on both ends
- * of the range and space for the CUD and a new CUI.
+ * of the range and space for the CUD and a new CUI.  Each EFI that we
+ * attach to the transaction also consumes ~32 bytes.
  */
 #define XFS_REFCOUNT_ITEM_OVERHEAD	32
 
-static inline xfs_fileoff_t xfs_refcount_max_unmap(int log_res)
-{
-	return (log_res * 3 / 4) / XFS_REFCOUNT_ITEM_OVERHEAD;
-}
-
 extern int xfs_refcount_has_record(struct xfs_btree_cur *cur,
 		xfs_agblock_t bno, xfs_extlen_t len, bool *exists);
 union xfs_btree_rec;


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/6] xfs: remove a __xfs_bunmapi call from reflink
  2022-04-14 22:54 [PATCHSET 0/6] xfs: fix reflink inefficiencies Darrick J. Wong
  2022-04-14 22:54 ` [PATCH 1/6] xfs: stop artificially limiting the length of bunmap calls Darrick J. Wong
@ 2022-04-14 22:54 ` Darrick J. Wong
  2022-04-22 22:03   ` Dave Chinner
  2022-04-26 13:47   ` Christoph Hellwig
  2022-04-14 22:54 ` [PATCH 3/6] xfs: create shadow transaction reservations for computing minimum log size Darrick J. Wong
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 24+ messages in thread
From: Darrick J. Wong @ 2022-04-14 22:54 UTC (permalink / raw)
  To: djwong, david; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

This raw call isn't necessary since we can always remove a full delalloc
extent.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_reflink.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 54e68e5693fd..09bedbfef01a 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1133,7 +1133,7 @@ xfs_reflink_remap_extent(
 		xfs_refcount_decrease_extent(tp, &smap);
 		qdelta -= smap.br_blockcount;
 	} else if (smap.br_startblock == DELAYSTARTBLOCK) {
-		xfs_filblks_t	len = smap.br_blockcount;
+		int		done;
 
 		/*
 		 * If the extent we're unmapping is a delalloc reservation,
@@ -1141,10 +1141,11 @@ xfs_reflink_remap_extent(
 		 * incore state.  Dropping the delalloc reservation takes care
 		 * of the quota reservation for us.
 		 */
-		error = __xfs_bunmapi(NULL, ip, smap.br_startoff, &len, 0, 1);
+		error = xfs_bunmapi(NULL, ip, smap.br_startoff,
+				smap.br_blockcount, 0, 1, &done);
 		if (error)
 			goto out_cancel;
-		ASSERT(len == 0);
+		ASSERT(done);
 	}
 
 	/*


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 3/6] xfs: create shadow transaction reservations for computing minimum log size
  2022-04-14 22:54 [PATCHSET 0/6] xfs: fix reflink inefficiencies Darrick J. Wong
  2022-04-14 22:54 ` [PATCH 1/6] xfs: stop artificially limiting the length of bunmap calls Darrick J. Wong
  2022-04-14 22:54 ` [PATCH 2/6] xfs: remove a __xfs_bunmapi call from reflink Darrick J. Wong
@ 2022-04-14 22:54 ` Darrick J. Wong
  2022-04-22 22:36   ` Dave Chinner
  2022-04-14 22:54 ` [PATCH 4/6] xfs: reduce the absurdly large log reservations Darrick J. Wong
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2022-04-14 22:54 UTC (permalink / raw)
  To: djwong, david; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Every time someone changes the transaction reservation sizes, they
introduce potential compatibility problems if the changes affect the
minimum log size that we validate at mount time.  If the minimum log
size gets larger (which should be avoided because doing so presents a
serious risk of log livelock), filesystems created with old mkfs will
not mount on a newer kernel; if the minimum size shrinks, filesystems
created with newer mkfs will not mount on older kernels.

Therefore, enable the creation of a shadow log reservation structure
where we can "undo" the effects of tweaks when computing minimum log
sizes.  These shadow reservations should never be used in practice, but
they insulate us from perturbations in minimum log size.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_log_rlimit.c |   17 +++++++++++++----
 fs/xfs/libxfs/xfs_trans_resv.c |   12 ++++++++++++
 fs/xfs/libxfs/xfs_trans_resv.h |    2 ++
 fs/xfs/xfs_trace.h             |   12 ++++++++++--
 4 files changed, 37 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_log_rlimit.c b/fs/xfs/libxfs/xfs_log_rlimit.c
index 67798ff5e14e..2bafc69cac15 100644
--- a/fs/xfs/libxfs/xfs_log_rlimit.c
+++ b/fs/xfs/libxfs/xfs_log_rlimit.c
@@ -14,6 +14,7 @@
 #include "xfs_trans_space.h"
 #include "xfs_da_btree.h"
 #include "xfs_bmap_btree.h"
+#include "xfs_trace.h"
 
 /*
  * Calculate the maximum length in bytes that would be required for a local
@@ -47,18 +48,25 @@ xfs_log_get_max_trans_res(
 	struct xfs_trans_res	*max_resp)
 {
 	struct xfs_trans_res	*resp;
+	struct xfs_trans_res	*start_resp;
 	struct xfs_trans_res	*end_resp;
+	struct xfs_trans_resv	*resv;
 	int			log_space = 0;
 	int			attr_space;
 
 	attr_space = xfs_log_calc_max_attrsetm_res(mp);
 
-	resp = (struct xfs_trans_res *)M_RES(mp);
-	end_resp = (struct xfs_trans_res *)(M_RES(mp) + 1);
-	for (; resp < end_resp; resp++) {
+	resv = kmem_zalloc(sizeof(struct xfs_trans_resv), 0);
+	xfs_trans_resv_calc_logsize(mp, resv);
+
+	start_resp = (struct xfs_trans_res *)resv;
+	end_resp = (struct xfs_trans_res *)(resv + 1);
+	for (resp = start_resp; resp < end_resp; resp++) {
 		int		tmp = resp->tr_logcount > 1 ?
 				      resp->tr_logres * resp->tr_logcount :
 				      resp->tr_logres;
+
+		trace_xfs_trans_resv_calc_logsize(mp, resp - start_resp, resp);
 		if (log_space < tmp) {
 			log_space = tmp;
 			*max_resp = *resp;		/* struct copy */
@@ -66,9 +74,10 @@ xfs_log_get_max_trans_res(
 	}
 
 	if (attr_space > log_space) {
-		*max_resp = M_RES(mp)->tr_attrsetm;	/* struct copy */
+		*max_resp = resv->tr_attrsetm;	/* struct copy */
 		max_resp->tr_logres = attr_space;
 	}
+	kmem_free(resv);
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index 6f83d9b306ee..12d4e451e70e 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -933,3 +933,15 @@ xfs_trans_resv_calc(
 	/* Put everything back the way it was.  This goes at the end. */
 	mp->m_rmap_maxlevels = rmap_maxlevels;
 }
+
+/*
+ * Compute an alternate set of log reservation sizes for use exclusively with
+ * minimum log size calculations.
+ */
+void
+xfs_trans_resv_calc_logsize(
+	struct xfs_mount	*mp,
+	struct xfs_trans_resv	*resp)
+{
+	xfs_trans_resv_calc(mp, resp);
+}
diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
index fc4e9b369a3a..9fa4863f72a4 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.h
+++ b/fs/xfs/libxfs/xfs_trans_resv.h
@@ -89,6 +89,8 @@ struct xfs_trans_resv {
 #define	XFS_ATTRSET_LOG_COUNT		3
 #define	XFS_ATTRRM_LOG_COUNT		3
 
+void xfs_trans_resv_calc_logsize(struct xfs_mount *mp,
+		struct xfs_trans_resv *resp);
 void xfs_trans_resv_calc(struct xfs_mount *mp, struct xfs_trans_resv *resp);
 uint xfs_allocfree_log_count(struct xfs_mount *mp, uint num_ops);
 
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index ecde0be3030a..d24784ab8cea 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3516,7 +3516,7 @@ DEFINE_GETFSMAP_EVENT(xfs_getfsmap_low_key);
 DEFINE_GETFSMAP_EVENT(xfs_getfsmap_high_key);
 DEFINE_GETFSMAP_EVENT(xfs_getfsmap_mapping);
 
-TRACE_EVENT(xfs_trans_resv_calc,
+DECLARE_EVENT_CLASS(xfs_trans_resv_class,
 	TP_PROTO(struct xfs_mount *mp, unsigned int type,
 		 struct xfs_trans_res *res),
 	TP_ARGS(mp, type, res),
@@ -3540,7 +3540,15 @@ TRACE_EVENT(xfs_trans_resv_calc,
 		  __entry->logres,
 		  __entry->logcount,
 		  __entry->logflags)
-);
+)
+
+#define DEFINE_TRANS_RESV_EVENT(name) \
+DEFINE_EVENT(xfs_trans_resv_class, name, \
+	TP_PROTO(struct xfs_mount *mp, unsigned int type, \
+		 struct xfs_trans_res *res), \
+	TP_ARGS(mp, type, res))
+DEFINE_TRANS_RESV_EVENT(xfs_trans_resv_calc);
+DEFINE_TRANS_RESV_EVENT(xfs_trans_resv_calc_logsize);
 
 DECLARE_EVENT_CLASS(xfs_trans_class,
 	TP_PROTO(struct xfs_trans *tp, unsigned long caller_ip),


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 4/6] xfs: reduce the absurdly large log reservations
  2022-04-14 22:54 [PATCHSET 0/6] xfs: fix reflink inefficiencies Darrick J. Wong
                   ` (2 preceding siblings ...)
  2022-04-14 22:54 ` [PATCH 3/6] xfs: create shadow transaction reservations for computing minimum log size Darrick J. Wong
@ 2022-04-14 22:54 ` Darrick J. Wong
  2022-04-22 22:51   ` Dave Chinner
  2022-04-14 22:54 ` [PATCH 5/6] xfs: reduce transaction reservations with reflink Darrick J. Wong
  2022-04-14 22:54 ` [PATCH 6/6] xfs: rewrite xfs_reflink_end_cow to use intents Darrick J. Wong
  5 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2022-04-14 22:54 UTC (permalink / raw)
  To: djwong, david; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Back in the early days of reflink and rmap development I set the
transaction reservation sizes to be overly generous for rmap+reflink
filesystems, and a little under-generous for rmap-only filesystems.

Since we don't need *eight* transaction rolls to handle three new log
intent items, decrease the logcounts to what we actually need, and amend
the shadow reservation computation function to reflect what we used to
do so that the minimum log size doesn't change.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_trans_resv.c |   88 +++++++++++++++++++++++++++-------------
 fs/xfs/libxfs/xfs_trans_resv.h |    6 ++-
 2 files changed, 64 insertions(+), 30 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index 12d4e451e70e..8d2f04dddb65 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -814,36 +814,16 @@ xfs_trans_resv_calc(
 	struct xfs_mount	*mp,
 	struct xfs_trans_resv	*resp)
 {
-	unsigned int		rmap_maxlevels = mp->m_rmap_maxlevels;
-
-	/*
-	 * In the early days of rmap+reflink, we always set the rmap maxlevels
-	 * to 9 even if the AG was small enough that it would never grow to
-	 * that height.  Transaction reservation sizes influence the minimum
-	 * log size calculation, which influences the size of the log that mkfs
-	 * creates.  Use the old value here to ensure that newly formatted
-	 * small filesystems will mount on older kernels.
-	 */
-	if (xfs_has_rmapbt(mp) && xfs_has_reflink(mp))
-		mp->m_rmap_maxlevels = XFS_OLD_REFLINK_RMAP_MAXLEVELS;
-
 	/*
 	 * The following transactions are logged in physical format and
 	 * require a permanent reservation on space.
 	 */
 	resp->tr_write.tr_logres = xfs_calc_write_reservation(mp);
-	if (xfs_has_reflink(mp))
-		resp->tr_write.tr_logcount = XFS_WRITE_LOG_COUNT_REFLINK;
-	else
-		resp->tr_write.tr_logcount = XFS_WRITE_LOG_COUNT;
+	resp->tr_write.tr_logcount = XFS_WRITE_LOG_COUNT;
 	resp->tr_write.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
 
 	resp->tr_itruncate.tr_logres = xfs_calc_itruncate_reservation(mp);
-	if (xfs_has_reflink(mp))
-		resp->tr_itruncate.tr_logcount =
-				XFS_ITRUNCATE_LOG_COUNT_REFLINK;
-	else
-		resp->tr_itruncate.tr_logcount = XFS_ITRUNCATE_LOG_COUNT;
+	resp->tr_itruncate.tr_logcount = XFS_ITRUNCATE_LOG_COUNT;
 	resp->tr_itruncate.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
 
 	resp->tr_rename.tr_logres = xfs_calc_rename_reservation(mp);
@@ -900,10 +880,7 @@ xfs_trans_resv_calc(
 	resp->tr_growrtalloc.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
 
 	resp->tr_qm_dqalloc.tr_logres = xfs_calc_qm_dqalloc_reservation(mp);
-	if (xfs_has_reflink(mp))
-		resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT_REFLINK;
-	else
-		resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT;
+	resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT;
 	resp->tr_qm_dqalloc.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
 
 	/*
@@ -930,8 +907,26 @@ xfs_trans_resv_calc(
 	resp->tr_growrtzero.tr_logres = xfs_calc_growrtzero_reservation(mp);
 	resp->tr_growrtfree.tr_logres = xfs_calc_growrtfree_reservation(mp);
 
-	/* Put everything back the way it was.  This goes at the end. */
-	mp->m_rmap_maxlevels = rmap_maxlevels;
+	/* Add one logcount for BUI items that appear with rmap or reflink. */
+	if (xfs_has_reflink(mp) || xfs_has_rmapbt(mp)) {
+		resp->tr_itruncate.tr_logcount++;
+		resp->tr_write.tr_logcount++;
+		resp->tr_qm_dqalloc.tr_logcount++;
+	}
+
+	/* Add one logcount for refcount intent items. */
+	if (xfs_has_reflink(mp)) {
+		resp->tr_itruncate.tr_logcount++;
+		resp->tr_write.tr_logcount++;
+		resp->tr_qm_dqalloc.tr_logcount++;
+	}
+
+	/* Add one logcount for rmap intent items. */
+	if (xfs_has_rmapbt(mp)) {
+		resp->tr_itruncate.tr_logcount++;
+		resp->tr_write.tr_logcount++;
+		resp->tr_qm_dqalloc.tr_logcount++;
+	}
 }
 
 /*
@@ -943,5 +938,42 @@ xfs_trans_resv_calc_logsize(
 	struct xfs_mount	*mp,
 	struct xfs_trans_resv	*resp)
 {
+	unsigned int		rmap_maxlevels = mp->m_rmap_maxlevels;
+
+	ASSERT(resp != M_RES(mp));
+
+	/*
+	 * In the early days of rmap+reflink, we always set the rmap maxlevels
+	 * to 9 even if the AG was small enough that it would never grow to
+	 * that height.  Transaction reservation sizes influence the minimum
+	 * log size calculation, which influences the size of the log that mkfs
+	 * creates.  Use the old value here to ensure that newly formatted
+	 * small filesystems will mount on older kernels.
+	 */
+	if (xfs_has_rmapbt(mp) && xfs_has_reflink(mp))
+		mp->m_rmap_maxlevels = XFS_OLD_REFLINK_RMAP_MAXLEVELS;
+
 	xfs_trans_resv_calc(mp, resp);
+
+	if (xfs_has_reflink(mp)) {
+		/*
+		 * In the early days of reflink we set the logcounts absurdly
+		 * high.
+		 */
+		resp->tr_write.tr_logcount = XFS_WRITE_LOG_COUNT_REFLINK;
+		resp->tr_itruncate.tr_logcount =
+				XFS_ITRUNCATE_LOG_COUNT_REFLINK;
+		resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT_REFLINK;
+	} else if (xfs_has_rmapbt(mp)) {
+		/*
+		 * In the early days of non-reflink rmap we set the logcount
+		 * too low.
+		 */
+		resp->tr_write.tr_logcount = XFS_WRITE_LOG_COUNT;
+		resp->tr_itruncate.tr_logcount = XFS_ITRUNCATE_LOG_COUNT;
+		resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT;
+	}
+
+	/* Put everything back the way it was.  This goes at the end. */
+	mp->m_rmap_maxlevels = rmap_maxlevels;
 }
diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
index 9fa4863f72a4..461859f4a745 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.h
+++ b/fs/xfs/libxfs/xfs_trans_resv.h
@@ -73,7 +73,6 @@ struct xfs_trans_resv {
 #define	XFS_DEFAULT_LOG_COUNT		1
 #define	XFS_DEFAULT_PERM_LOG_COUNT	2
 #define	XFS_ITRUNCATE_LOG_COUNT		2
-#define	XFS_ITRUNCATE_LOG_COUNT_REFLINK	8
 #define XFS_INACTIVE_LOG_COUNT		2
 #define	XFS_CREATE_LOG_COUNT		2
 #define	XFS_CREATE_TMPFILE_LOG_COUNT	2
@@ -83,12 +82,15 @@ struct xfs_trans_resv {
 #define	XFS_LINK_LOG_COUNT		2
 #define	XFS_RENAME_LOG_COUNT		2
 #define	XFS_WRITE_LOG_COUNT		2
-#define	XFS_WRITE_LOG_COUNT_REFLINK	8
 #define	XFS_ADDAFORK_LOG_COUNT		2
 #define	XFS_ATTRINVAL_LOG_COUNT		1
 #define	XFS_ATTRSET_LOG_COUNT		3
 #define	XFS_ATTRRM_LOG_COUNT		3
 
+/* Absurdly high log counts from the early days of reflink.  Do not use. */
+#define	XFS_ITRUNCATE_LOG_COUNT_REFLINK	8
+#define	XFS_WRITE_LOG_COUNT_REFLINK	8
+
 void xfs_trans_resv_calc_logsize(struct xfs_mount *mp,
 		struct xfs_trans_resv *resp);
 void xfs_trans_resv_calc(struct xfs_mount *mp, struct xfs_trans_resv *resp);


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 5/6] xfs: reduce transaction reservations with reflink
  2022-04-14 22:54 [PATCHSET 0/6] xfs: fix reflink inefficiencies Darrick J. Wong
                   ` (3 preceding siblings ...)
  2022-04-14 22:54 ` [PATCH 4/6] xfs: reduce the absurdly large log reservations Darrick J. Wong
@ 2022-04-14 22:54 ` Darrick J. Wong
  2022-04-22 23:42   ` Dave Chinner
  2022-04-14 22:54 ` [PATCH 6/6] xfs: rewrite xfs_reflink_end_cow to use intents Darrick J. Wong
  5 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2022-04-14 22:54 UTC (permalink / raw)
  To: djwong, david; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Before to the introduction of deferred refcount operations, reflink
would try to cram refcount btree updates into the same transaction as an
allocation or a free event.  Mainline XFS has never actually done that,
but we never refactored the transaction reservations to reflect that we
now do all refcount updates in separate transactions.  Fix this to
reduce the transaction reservation size even farther, so that between
this patch and the previous one, we reduce the tr_write and tr_itruncate
sizes by 66%.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_refcount.c   |    9 +++-
 fs/xfs/libxfs/xfs_trans_resv.c |   97 ++++++++++++++++++++++++++++++++++++++--
 2 files changed, 98 insertions(+), 8 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
index a07ebaecba73..e53544d52ee2 100644
--- a/fs/xfs/libxfs/xfs_refcount.c
+++ b/fs/xfs/libxfs/xfs_refcount.c
@@ -886,8 +886,13 @@ xfs_refcount_still_have_space(
 {
 	unsigned long			overhead;
 
-	overhead = cur->bc_ag.refc.shape_changes *
-			xfs_allocfree_log_count(cur->bc_mp, 1);
+	/*
+	 * Worst case estimate: full splits of the free space and rmap btrees
+	 * to handle each of the shape changes to the refcount btree.
+	 */
+	overhead = xfs_allocfree_log_count(cur->bc_mp,
+				cur->bc_ag.refc.shape_changes);
+	overhead += cur->bc_mp->m_refc_maxlevels;
 	overhead *= cur->bc_mp->m_sb.sb_blocksize;
 
 	/*
diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index 8d2f04dddb65..e5c3fcc2ab15 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -56,8 +56,7 @@ xfs_calc_buf_res(
  * Per-extent log reservation for the btree changes involved in freeing or
  * allocating an extent.  In classic XFS there were two trees that will be
  * modified (bnobt + cntbt).  With rmap enabled, there are three trees
- * (rmapbt).  With reflink, there are four trees (refcountbt).  The number of
- * blocks reserved is based on the formula:
+ * (rmapbt).  The number of blocks reserved is based on the formula:
  *
  * num trees * ((2 blocks/level * max depth) - 1)
  *
@@ -73,12 +72,23 @@ xfs_allocfree_log_count(
 	blocks = num_ops * 2 * (2 * mp->m_alloc_maxlevels - 1);
 	if (xfs_has_rmapbt(mp))
 		blocks += num_ops * (2 * mp->m_rmap_maxlevels - 1);
-	if (xfs_has_reflink(mp))
-		blocks += num_ops * (2 * mp->m_refc_maxlevels - 1);
 
 	return blocks;
 }
 
+/*
+ * Per-extent log reservation for refcount btree changes.  These are never done
+ * in the same transaction as an allocation or a free, so we compute them
+ * separately.
+ */
+static unsigned int
+xfs_refcount_log_count(
+	struct xfs_mount	*mp,
+	unsigned int		num_ops)
+{
+	return num_ops * (2 * mp->m_refc_maxlevels - 1);
+}
+
 /*
  * Logging inodes is really tricksy. They are logged in memory format,
  * which means that what we write into the log doesn't directly translate into
@@ -233,6 +243,28 @@ xfs_rtalloc_log_count(
  * register overflow from temporaries in the calculations.
  */
 
+/*
+ * Compute the log reservation required to handle the refcount update
+ * transaction.  Refcount updates are always done via deferred log items.
+ *
+ * This is calculated as:
+ * Data device refcount updates (t1):
+ *    the agfs of the ags containing the blocks: nr_ops * sector size
+ *    the refcount btrees: nr_ops * 1 trees * (2 * max depth - 1) * block size
+ */
+static unsigned int
+xfs_refcount_log_reservation(
+	struct xfs_mount	*mp,
+	unsigned int		nr_ops)
+{
+	unsigned int		blksz = XFS_FSB_TO_B(mp, 1);
+
+	if (!xfs_has_reflink(mp))
+		return 0;
+
+	return xfs_calc_buf_res(nr_ops, mp->m_sb.sb_sectsize) +
+	       xfs_calc_buf_res(xfs_refcount_log_count(mp, nr_ops), blksz);
+}
 
 /*
  * In a write transaction we can allocate a maximum of 2
@@ -255,12 +287,13 @@ xfs_rtalloc_log_count(
  *    the agfls of the ags containing the blocks: 2 * sector size
  *    the super block free block counter: sector size
  *    the allocation btrees: 2 exts * 2 trees * (2 * max depth - 1) * block size
+ * And any refcount updates that happen in a separate transaction (t4).
  */
 STATIC uint
 xfs_calc_write_reservation(
 	struct xfs_mount	*mp)
 {
-	unsigned int		t1, t2, t3;
+	unsigned int		t1, t2, t3, t4;
 	unsigned int		blksz = XFS_FSB_TO_B(mp, 1);
 
 	t1 = xfs_calc_inode_res(mp, 1) +
@@ -282,7 +315,9 @@ xfs_calc_write_reservation(
 	t3 = xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
 	     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2), blksz);
 
-	return XFS_DQUOT_LOGRES(mp) + max3(t1, t2, t3);
+	t4 = xfs_refcount_log_reservation(mp, 1);
+
+	return XFS_DQUOT_LOGRES(mp) + max(t4, max3(t1, t2, t3));
 }
 
 /*
@@ -303,10 +338,42 @@ xfs_calc_write_reservation(
  *    the realtime summary: 2 exts * 1 block
  *    worst case split in allocation btrees per extent assuming 2 extents:
  *		2 exts * 2 trees * (2 * max depth - 1) * block size
+ * And any refcount updates that happen in a separate transaction (t4).
  */
 STATIC uint
 xfs_calc_itruncate_reservation(
 	struct xfs_mount	*mp)
+{
+	unsigned int		t1, t2, t3, t4;
+	unsigned int		blksz = XFS_FSB_TO_B(mp, 1);
+
+	t1 = xfs_calc_inode_res(mp, 1) +
+	     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) + 1, blksz);
+
+	t2 = xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) +
+	     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 4), blksz);
+
+	if (xfs_has_realtime(mp)) {
+		t3 = xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
+		     xfs_calc_buf_res(xfs_rtalloc_log_count(mp, 2), blksz) +
+		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2), blksz);
+	} else {
+		t3 = 0;
+	}
+
+	t4 = xfs_refcount_log_reservation(mp, 2);
+
+	return XFS_DQUOT_LOGRES(mp) + max(t4, max3(t1, t2, t3));
+}
+
+/*
+ * For log size calculation, this is the same as above except that we used to
+ * include refcount updates in the allocfree computation even though we've
+ * always run them as a separate transaction.
+ */
+STATIC uint
+xfs_calc_itruncate_reservation_logsize(
+	struct xfs_mount	*mp)
 {
 	unsigned int		t1, t2, t3;
 	unsigned int		blksz = XFS_FSB_TO_B(mp, 1);
@@ -317,6 +384,9 @@ xfs_calc_itruncate_reservation(
 	t2 = xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) +
 	     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 4), blksz);
 
+	if (xfs_has_reflink(mp))
+	     t2 += xfs_calc_buf_res(xfs_refcount_log_count(mp, 4), blksz);
+
 	if (xfs_has_realtime(mp)) {
 		t3 = xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
 		     xfs_calc_buf_res(xfs_rtalloc_log_count(mp, 2), blksz) +
@@ -956,6 +1026,9 @@ xfs_trans_resv_calc_logsize(
 	xfs_trans_resv_calc(mp, resp);
 
 	if (xfs_has_reflink(mp)) {
+		unsigned int	t4;
+		unsigned int	blksz = XFS_FSB_TO_B(mp, 1);
+
 		/*
 		 * In the early days of reflink we set the logcounts absurdly
 		 * high.
@@ -964,6 +1037,18 @@ xfs_trans_resv_calc_logsize(
 		resp->tr_itruncate.tr_logcount =
 				XFS_ITRUNCATE_LOG_COUNT_REFLINK;
 		resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT_REFLINK;
+
+		/*
+		 * We also used to account two refcount updates per extent into
+		 * the alloc/free step of write and truncate calls, even though
+		 * those are run in separate transactions.
+		 */
+		t4 = xfs_calc_buf_res(xfs_refcount_log_count(mp, 2), blksz);
+		resp->tr_write.tr_logres += t4;
+		resp->tr_qm_dqalloc.tr_logres += t4;
+
+		resp->tr_itruncate.tr_logres =
+				xfs_calc_itruncate_reservation_logsize(mp);
 	} else if (xfs_has_rmapbt(mp)) {
 		/*
 		 * In the early days of non-reflink rmap we set the logcount


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 6/6] xfs: rewrite xfs_reflink_end_cow to use intents
  2022-04-14 22:54 [PATCHSET 0/6] xfs: fix reflink inefficiencies Darrick J. Wong
                   ` (4 preceding siblings ...)
  2022-04-14 22:54 ` [PATCH 5/6] xfs: reduce transaction reservations with reflink Darrick J. Wong
@ 2022-04-14 22:54 ` Darrick J. Wong
  2022-04-22 23:50   ` Dave Chinner
  5 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2022-04-14 22:54 UTC (permalink / raw)
  To: djwong, david; +Cc: linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Currently, the code that performs CoW remapping after a write has this
odd behavior where it walks /backwards/ through the data fork to remap
extents in reverse order.  Earlier, we rewrote the reflink remap
function to use deferred bmap log items instead of trying to cram as
much into the first transaction that we could.  Now do the same for the
CoW remap code.  There doesn't seem to be any performance impact; we're
just making better use of code that we added for the benefit of reflink.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_reflink.c |   88 ++++++++++++++++++++++++++++++++------------------
 fs/xfs/xfs_trace.h   |    3 +-
 2 files changed, 58 insertions(+), 33 deletions(-)


diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 09bedbfef01a..27f5b17d03aa 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -586,21 +586,21 @@ xfs_reflink_cancel_cow_range(
 STATIC int
 xfs_reflink_end_cow_extent(
 	struct xfs_inode	*ip,
-	xfs_fileoff_t		offset_fsb,
-	xfs_fileoff_t		*end_fsb)
+	xfs_fileoff_t		*offset_fsb,
+	xfs_fileoff_t		end_fsb)
 {
-	struct xfs_bmbt_irec	got, del;
 	struct xfs_iext_cursor	icur;
+	struct xfs_bmbt_irec	got, del, data;
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_trans	*tp;
 	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, XFS_COW_FORK);
-	xfs_filblks_t		rlen;
 	unsigned int		resblks;
+	int			nmaps;
 	int			error;
 
 	/* No COW extents?  That's easy! */
 	if (ifp->if_bytes == 0) {
-		*end_fsb = offset_fsb;
+		*offset_fsb = end_fsb;
 		return 0;
 	}
 
@@ -628,42 +628,66 @@ xfs_reflink_end_cow_extent(
 	 * left by the time I/O completes for the loser of the race.  In that
 	 * case we are done.
 	 */
-	if (!xfs_iext_lookup_extent_before(ip, ifp, end_fsb, &icur, &got) ||
-	    got.br_startoff + got.br_blockcount <= offset_fsb) {
-		*end_fsb = offset_fsb;
+	if (!xfs_iext_lookup_extent(ip, ifp, *offset_fsb, &icur, &got) ||
+	    got.br_startoff >= end_fsb) {
+		*offset_fsb = end_fsb;
 		goto out_cancel;
 	}
 
 	/*
-	 * Structure copy @got into @del, then trim @del to the range that we
-	 * were asked to remap.  We preserve @got for the eventual CoW fork
+	 * Only remap real extents that contain data.  With AIO, speculative
+	 * preallocations can leak into the range we are called upon, and we
+	 * need to skip them.  Preserve @got for the eventual CoW fork
 	 * deletion; from now on @del represents the mapping that we're
 	 * actually remapping.
 	 */
+	while (!xfs_bmap_is_written_extent(&got)) {
+		if (!xfs_iext_next_extent(ifp, &icur, &got) ||
+		    got.br_startoff >= end_fsb) {
+			*offset_fsb = end_fsb;
+			goto out_cancel;
+		}
+	}
 	del = got;
-	xfs_trim_extent(&del, offset_fsb, *end_fsb - offset_fsb);
 
-	ASSERT(del.br_blockcount > 0);
-
-	/*
-	 * Only remap real extents that contain data.  With AIO, speculative
-	 * preallocations can leak into the range we are called upon, and we
-	 * need to skip them.
-	 */
-	if (!xfs_bmap_is_written_extent(&got)) {
-		*end_fsb = del.br_startoff;
-		goto out_cancel;
-	}
-
-	/* Unmap the old blocks in the data fork. */
-	rlen = del.br_blockcount;
-	error = __xfs_bunmapi(tp, ip, del.br_startoff, &rlen, 0, 1);
+	/* Grab the corresponding mapping in the data fork. */
+	nmaps = 1;
+	error = xfs_bmapi_read(ip, del.br_startoff, del.br_blockcount, &data,
+			&nmaps, 0);
 	if (error)
 		goto out_cancel;
 
-	/* Trim the extent to whatever got unmapped. */
-	xfs_trim_extent(&del, del.br_startoff + rlen, del.br_blockcount - rlen);
-	trace_xfs_reflink_cow_remap(ip, &del);
+	/* We can only remap the smaller of the two extent sizes. */
+	data.br_blockcount = min(data.br_blockcount, del.br_blockcount);
+	del.br_blockcount = data.br_blockcount;
+
+	trace_xfs_reflink_cow_remap_from(ip, &del);
+	trace_xfs_reflink_cow_remap_to(ip, &data);
+
+	if (xfs_bmap_is_real_extent(&data)) {
+		/*
+		 * If the extent we're remapping is backed by storage (written
+		 * or not), unmap the extent and drop its refcount.
+		 */
+		xfs_bmap_unmap_extent(tp, ip, &data);
+		xfs_refcount_decrease_extent(tp, &data);
+		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,
+				-data.br_blockcount);
+	} else if (data.br_startblock == DELAYSTARTBLOCK) {
+		int		done;
+
+		/*
+		 * If the extent we're remapping is a delalloc reservation,
+		 * we can use the regular bunmapi function to release the
+		 * incore state.  Dropping the delalloc reservation takes care
+		 * of the quota reservation for us.
+		 */
+		error = xfs_bunmapi(NULL, ip, data.br_startoff,
+				data.br_blockcount, 0, 1, &done);
+		if (error)
+			goto out_cancel;
+		ASSERT(done);
+	}
 
 	/* Free the CoW orphan record. */
 	xfs_refcount_free_cow_extent(tp, del.br_startblock, del.br_blockcount);
@@ -684,7 +708,7 @@ xfs_reflink_end_cow_extent(
 		return error;
 
 	/* Update the caller about how much progress we made. */
-	*end_fsb = del.br_startoff;
+	*offset_fsb = del.br_startoff + del.br_blockcount;
 	return 0;
 
 out_cancel:
@@ -712,7 +736,7 @@ xfs_reflink_end_cow(
 	end_fsb = XFS_B_TO_FSB(ip->i_mount, offset + count);
 
 	/*
-	 * Walk backwards until we're out of the I/O range.  The loop function
+	 * Walk forwards until we've remapped the I/O range.  The loop function
 	 * repeatedly cycles the ILOCK to allocate one transaction per remapped
 	 * extent.
 	 *
@@ -744,7 +768,7 @@ xfs_reflink_end_cow(
 	 * blocks will be remapped.
 	 */
 	while (end_fsb > offset_fsb && !error)
-		error = xfs_reflink_end_cow_extent(ip, offset_fsb, &end_fsb);
+		error = xfs_reflink_end_cow_extent(ip, &offset_fsb, end_fsb);
 
 	if (error)
 		trace_xfs_reflink_end_cow_error(ip, error, _RET_IP_);
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index d24784ab8cea..32cfc62f2777 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3421,7 +3421,8 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_convert_cow);
 
 DEFINE_SIMPLE_IO_EVENT(xfs_reflink_cancel_cow_range);
 DEFINE_SIMPLE_IO_EVENT(xfs_reflink_end_cow);
-DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_remap);
+DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_remap_from);
+DEFINE_INODE_IREC_EVENT(xfs_reflink_cow_remap_to);
 
 DEFINE_INODE_ERROR_EVENT(xfs_reflink_cancel_cow_range_error);
 DEFINE_INODE_ERROR_EVENT(xfs_reflink_end_cow_error);


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/6] xfs: stop artificially limiting the length of bunmap calls
  2022-04-14 22:54 ` [PATCH 1/6] xfs: stop artificially limiting the length of bunmap calls Darrick J. Wong
@ 2022-04-22 22:01   ` Dave Chinner
  2022-04-22 22:18     ` Darrick J. Wong
  2022-04-26 13:46   ` Christoph Hellwig
  1 sibling, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2022-04-22 22:01 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Apr 14, 2022 at 03:54:31PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> In commit e1a4e37cc7b6, we clamped the length of bunmapi calls on the
> data forks of shared files to avoid two failure scenarios: one where the
> extent being unmapped is so sparsely shared that we exceed the
> transaction reservation with the sheer number of refcount btree updates
> and EFI intent items; and the other where we attach so many deferred
> updates to the transaction that we pin the log tail and later the log
> head meets the tail, causing the log to livelock.
> 
> We avoid triggering the first problem by tracking the number of ops in
> the refcount btree cursor and forcing a requeue of the refcount intent
> item any time we think that we might be close to overflowing.  This has
> been baked into XFS since before the original e1a4 patch.
> 
> A recent patchset fixed the second problem by changing the deferred ops
> code to finish all the work items created by each round of trying to
> complete a refcount intent item, which eliminates the long chains of
> deferred items (27dad); and causing long-running transactions to relog
> their intent log items when space in the log gets low (74f4d).
> 
> Because this clamp affects /any/ unmapping request regardless of the
> sharing factors of the component blocks, it degrades the performance of
> all large unmapping requests -- whereas with an unshared file we can
> unmap millions of blocks in one go, shared files are limited to
> unmapping a few thousand blocks at a time, which causes the upper level
> code to spin in a bunmapi loop even if it wasn't needed.
> 
> This also eliminates one more place where log recovery behavior can
> differ from online behavior, because bunmapi operations no longer need
> to requeue.
> 
> Partial-revert-of: e1a4e37cc7b6 ("xfs: try to avoid blowing out the transaction reservation when bunmaping a shared extent")
> Depends: 27dada070d59 ("xfs: change the order in which child and parent defer ops ar finished")
> Depends: 74f4d6a1e065 ("xfs: only relog deferred intent items if free space in the log gets low")
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_bmap.c     |   22 +---------------------
>  fs/xfs/libxfs/xfs_refcount.c |    5 ++---
>  fs/xfs/libxfs/xfs_refcount.h |    8 ++------
>  3 files changed, 5 insertions(+), 30 deletions(-)

This looks reasonable, but I'm wondering how the original problem
was discovered and whether this has been tested against that
original problem situation to ensure we aren't introducing a
regression here....

> diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
> index 9eb01edbd89d..6b265f6075b8 100644
> --- a/fs/xfs/libxfs/xfs_refcount.h
> +++ b/fs/xfs/libxfs/xfs_refcount.h
> @@ -66,15 +66,11 @@ extern int xfs_refcount_recover_cow_leftovers(struct xfs_mount *mp,
>   * reservation and crash the fs.  Each record adds 12 bytes to the
>   * log (plus any key updates) so we'll conservatively assume 32 bytes
>   * per record.  We must also leave space for btree splits on both ends
> - * of the range and space for the CUD and a new CUI.
> + * of the range and space for the CUD and a new CUI.  Each EFI that we
> + * attach to the transaction also consumes ~32 bytes.
>   */
>  #define XFS_REFCOUNT_ITEM_OVERHEAD	32

FWIW, I think this is a low-ball number - each EFI also consumes an
ophdr (12 bytes) for the region identifier in the log, so it's
actually 44 bytes, not 32 bytes that will be consumed.  It is not
necessary to address this in this patchset, though.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] xfs: remove a __xfs_bunmapi call from reflink
  2022-04-14 22:54 ` [PATCH 2/6] xfs: remove a __xfs_bunmapi call from reflink Darrick J. Wong
@ 2022-04-22 22:03   ` Dave Chinner
  2022-04-26 13:47   ` Christoph Hellwig
  1 sibling, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2022-04-22 22:03 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Apr 14, 2022 at 03:54:37PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> This raw call isn't necessary since we can always remove a full delalloc
> extent.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_reflink.c |    7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)

Looks fine - the assert will catch it if it turns out this is
incorrect in some case.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/6] xfs: stop artificially limiting the length of bunmap calls
  2022-04-22 22:01   ` Dave Chinner
@ 2022-04-22 22:18     ` Darrick J. Wong
  2022-04-22 23:51       ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2022-04-22 22:18 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Sat, Apr 23, 2022 at 08:01:20AM +1000, Dave Chinner wrote:
> On Thu, Apr 14, 2022 at 03:54:31PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > In commit e1a4e37cc7b6, we clamped the length of bunmapi calls on the
> > data forks of shared files to avoid two failure scenarios: one where the
> > extent being unmapped is so sparsely shared that we exceed the
> > transaction reservation with the sheer number of refcount btree updates
> > and EFI intent items; and the other where we attach so many deferred
> > updates to the transaction that we pin the log tail and later the log
> > head meets the tail, causing the log to livelock.
> > 
> > We avoid triggering the first problem by tracking the number of ops in
> > the refcount btree cursor and forcing a requeue of the refcount intent
> > item any time we think that we might be close to overflowing.  This has
> > been baked into XFS since before the original e1a4 patch.
> > 
> > A recent patchset fixed the second problem by changing the deferred ops
> > code to finish all the work items created by each round of trying to
> > complete a refcount intent item, which eliminates the long chains of
> > deferred items (27dad); and causing long-running transactions to relog
> > their intent log items when space in the log gets low (74f4d).
> > 
> > Because this clamp affects /any/ unmapping request regardless of the
> > sharing factors of the component blocks, it degrades the performance of
> > all large unmapping requests -- whereas with an unshared file we can
> > unmap millions of blocks in one go, shared files are limited to
> > unmapping a few thousand blocks at a time, which causes the upper level
> > code to spin in a bunmapi loop even if it wasn't needed.
> > 
> > This also eliminates one more place where log recovery behavior can
> > differ from online behavior, because bunmapi operations no longer need
> > to requeue.
> > 
> > Partial-revert-of: e1a4e37cc7b6 ("xfs: try to avoid blowing out the transaction reservation when bunmaping a shared extent")
> > Depends: 27dada070d59 ("xfs: change the order in which child and parent defer ops ar finished")
> > Depends: 74f4d6a1e065 ("xfs: only relog deferred intent items if free space in the log gets low")
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/libxfs/xfs_bmap.c     |   22 +---------------------
> >  fs/xfs/libxfs/xfs_refcount.c |    5 ++---
> >  fs/xfs/libxfs/xfs_refcount.h |    8 ++------
> >  3 files changed, 5 insertions(+), 30 deletions(-)
> 
> This looks reasonable, but I'm wondering how the original problem
> was discovered and whether this has been tested against that
> original problem situation to ensure we aren't introducing a
> regression here....

generic/447, and yes, I have forced it to run a deletion of 1 million
extents without incident. :)

I should probably amend that test to note that it's an exerciser for
e1a4e37cc7b6.

> > diff --git a/fs/xfs/libxfs/xfs_refcount.h b/fs/xfs/libxfs/xfs_refcount.h
> > index 9eb01edbd89d..6b265f6075b8 100644
> > --- a/fs/xfs/libxfs/xfs_refcount.h
> > +++ b/fs/xfs/libxfs/xfs_refcount.h
> > @@ -66,15 +66,11 @@ extern int xfs_refcount_recover_cow_leftovers(struct xfs_mount *mp,
> >   * reservation and crash the fs.  Each record adds 12 bytes to the
> >   * log (plus any key updates) so we'll conservatively assume 32 bytes
> >   * per record.  We must also leave space for btree splits on both ends
> > - * of the range and space for the CUD and a new CUI.
> > + * of the range and space for the CUD and a new CUI.  Each EFI that we
> > + * attach to the transaction also consumes ~32 bytes.
> >   */
> >  #define XFS_REFCOUNT_ITEM_OVERHEAD	32
> 
> FWIW, I think this is a low-ball number - each EFI also consumes an
> ophdr (12 bytes) for the region identifier in the log, so it's
> actually 44 bytes, not 32 bytes that will be consumed.  It is not
> necessary to address this in this patchset, though.

<Nod>

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/6] xfs: create shadow transaction reservations for computing minimum log size
  2022-04-14 22:54 ` [PATCH 3/6] xfs: create shadow transaction reservations for computing minimum log size Darrick J. Wong
@ 2022-04-22 22:36   ` Dave Chinner
  2022-04-25 23:39     ` Darrick J. Wong
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2022-04-22 22:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Apr 14, 2022 at 03:54:42PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Every time someone changes the transaction reservation sizes, they
> introduce potential compatibility problems if the changes affect the
> minimum log size that we validate at mount time.  If the minimum log
> size gets larger (which should be avoided because doing so presents a
> serious risk of log livelock), filesystems created with old mkfs will
> not mount on a newer kernel; if the minimum size shrinks, filesystems
> created with newer mkfs will not mount on older kernels.
> 
> Therefore, enable the creation of a shadow log reservation structure
> where we can "undo" the effects of tweaks when computing minimum log
> sizes.  These shadow reservations should never be used in practice, but
> they insulate us from perturbations in minimum log size.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_log_rlimit.c |   17 +++++++++++++----
>  fs/xfs/libxfs/xfs_trans_resv.c |   12 ++++++++++++
>  fs/xfs/libxfs/xfs_trans_resv.h |    2 ++
>  fs/xfs/xfs_trace.h             |   12 ++++++++++--
>  4 files changed, 37 insertions(+), 6 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_log_rlimit.c b/fs/xfs/libxfs/xfs_log_rlimit.c
> index 67798ff5e14e..2bafc69cac15 100644
> --- a/fs/xfs/libxfs/xfs_log_rlimit.c
> +++ b/fs/xfs/libxfs/xfs_log_rlimit.c
> @@ -14,6 +14,7 @@
>  #include "xfs_trans_space.h"
>  #include "xfs_da_btree.h"
>  #include "xfs_bmap_btree.h"
> +#include "xfs_trace.h"
>  
>  /*
>   * Calculate the maximum length in bytes that would be required for a local
> @@ -47,18 +48,25 @@ xfs_log_get_max_trans_res(
>  	struct xfs_trans_res	*max_resp)
>  {
>  	struct xfs_trans_res	*resp;
> +	struct xfs_trans_res	*start_resp;
>  	struct xfs_trans_res	*end_resp;
> +	struct xfs_trans_resv	*resv;
>  	int			log_space = 0;
>  	int			attr_space;
>  
>  	attr_space = xfs_log_calc_max_attrsetm_res(mp);
>  
> -	resp = (struct xfs_trans_res *)M_RES(mp);
> -	end_resp = (struct xfs_trans_res *)(M_RES(mp) + 1);
> -	for (; resp < end_resp; resp++) {
> +	resv = kmem_zalloc(sizeof(struct xfs_trans_resv), 0);
> +	xfs_trans_resv_calc_logsize(mp, resv);
> +
> +	start_resp = (struct xfs_trans_res *)resv;
> +	end_resp = (struct xfs_trans_res *)(resv + 1);
> +	for (resp = start_resp; resp < end_resp; resp++) {
>  		int		tmp = resp->tr_logcount > 1 ?
>  				      resp->tr_logres * resp->tr_logcount :
>  				      resp->tr_logres;
> +
> +		trace_xfs_trans_resv_calc_logsize(mp, resp - start_resp, resp);
>  		if (log_space < tmp) {
>  			log_space = tmp;
>  			*max_resp = *resp;		/* struct copy */

This took me a while to get my head around. The minimum logsize
calculation stuff is all a bit of a mess.

Essentially, we call xfs_log_get_max_trans_res() from two places.
One is to calculate the minimum log size, the other is the
transaction reservation tracing code done when M_RES(mp) is set up
via xfs_trans_trace_reservations().  We don't need the call from
xfs_trans_trace_reservations() - it's trivial to scan the list of
tracepoints emitted by this function at mount time to find the
maximum reservation.

Hence I think we should start by removing that call to this
function, and making this a static function called only from
xfs_log_calc_minimum_size().

At this point, we can use an on-stack struct xfs_trans_resv for the
calculated values - no need for memory allocation here as we will
never be short of stack space in this path.

The tracing in the loop also wants an integer index into the struct
xfs_trans_resv structure, so it should be changed to match what
xfs_trans_trace_reservations() does:

	struct xfs_trans_resv	resv;
	struct xfs_trans_res	*resp;
	struct xfs_trans_res	*end_resp;

	....

	xfs_trans_resv_calc(mp, &resv)

	resp = (struct xfs_trans_res *)&resv;
	end_resp = (struct xfs_trans_res *)(&resv + 1);
	for (i = 0; resp < end_resp; resp++) {
		.....
		trace_xfs_trans_resv_calc_logsize(mp, i, resp);
		....
	}

> diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> index 6f83d9b306ee..12d4e451e70e 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.c
> +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> @@ -933,3 +933,15 @@ xfs_trans_resv_calc(
>  	/* Put everything back the way it was.  This goes at the end. */
>  	mp->m_rmap_maxlevels = rmap_maxlevels;
>  }
> +
> +/*
> + * Compute an alternate set of log reservation sizes for use exclusively with
> + * minimum log size calculations.
> + */
> +void
> +xfs_trans_resv_calc_logsize(
> +	struct xfs_mount	*mp,
> +	struct xfs_trans_resv	*resp)
> +{
> +	xfs_trans_resv_calc(mp, resp);
> +}

This function and it's name was waht confused me for a while - I
don't think it belongs in this patch, and I don't think it belongs
in this file when it's filled out in the next patch. It's basically
handling things specific to minimum log size calculations, so with
the above mods I think it should also end up being static to
libxfs/xfs_log_rlimit.c.

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/6] xfs: reduce the absurdly large log reservations
  2022-04-14 22:54 ` [PATCH 4/6] xfs: reduce the absurdly large log reservations Darrick J. Wong
@ 2022-04-22 22:51   ` Dave Chinner
  2022-04-25 23:47     ` Darrick J. Wong
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2022-04-22 22:51 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Apr 14, 2022 at 03:54:48PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Back in the early days of reflink and rmap development I set the
> transaction reservation sizes to be overly generous for rmap+reflink
> filesystems, and a little under-generous for rmap-only filesystems.
> 
> Since we don't need *eight* transaction rolls to handle three new log
> intent items, decrease the logcounts to what we actually need, and amend
> the shadow reservation computation function to reflect what we used to
> do so that the minimum log size doesn't change.

Yup, this will make a huge difference to the number of transactions
we can have in flight on reflink/rmap enabled filesystems.

Mostly looks good, some comments about code and comments below.

> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_trans_resv.c |   88 +++++++++++++++++++++++++++-------------
>  fs/xfs/libxfs/xfs_trans_resv.h |    6 ++-
>  2 files changed, 64 insertions(+), 30 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> index 12d4e451e70e..8d2f04dddb65 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.c
> +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> @@ -814,36 +814,16 @@ xfs_trans_resv_calc(
>  	struct xfs_mount	*mp,
>  	struct xfs_trans_resv	*resp)
>  {
> -	unsigned int		rmap_maxlevels = mp->m_rmap_maxlevels;
> -
> -	/*
> -	 * In the early days of rmap+reflink, we always set the rmap maxlevels
> -	 * to 9 even if the AG was small enough that it would never grow to
> -	 * that height.  Transaction reservation sizes influence the minimum
> -	 * log size calculation, which influences the size of the log that mkfs
> -	 * creates.  Use the old value here to ensure that newly formatted
> -	 * small filesystems will mount on older kernels.
> -	 */
> -	if (xfs_has_rmapbt(mp) && xfs_has_reflink(mp))
> -		mp->m_rmap_maxlevels = XFS_OLD_REFLINK_RMAP_MAXLEVELS;
> -
>  	/*
>  	 * The following transactions are logged in physical format and
>  	 * require a permanent reservation on space.
>  	 */
>  	resp->tr_write.tr_logres = xfs_calc_write_reservation(mp);
> -	if (xfs_has_reflink(mp))
> -		resp->tr_write.tr_logcount = XFS_WRITE_LOG_COUNT_REFLINK;
> -	else
> -		resp->tr_write.tr_logcount = XFS_WRITE_LOG_COUNT;
> +	resp->tr_write.tr_logcount = XFS_WRITE_LOG_COUNT;
>  	resp->tr_write.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
>  
>  	resp->tr_itruncate.tr_logres = xfs_calc_itruncate_reservation(mp);
> -	if (xfs_has_reflink(mp))
> -		resp->tr_itruncate.tr_logcount =
> -				XFS_ITRUNCATE_LOG_COUNT_REFLINK;
> -	else
> -		resp->tr_itruncate.tr_logcount = XFS_ITRUNCATE_LOG_COUNT;
> +	resp->tr_itruncate.tr_logcount = XFS_ITRUNCATE_LOG_COUNT;
>  	resp->tr_itruncate.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
>  
>  	resp->tr_rename.tr_logres = xfs_calc_rename_reservation(mp);
> @@ -900,10 +880,7 @@ xfs_trans_resv_calc(
>  	resp->tr_growrtalloc.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
>  
>  	resp->tr_qm_dqalloc.tr_logres = xfs_calc_qm_dqalloc_reservation(mp);
> -	if (xfs_has_reflink(mp))
> -		resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT_REFLINK;
> -	else
> -		resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT;
> +	resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT;
>  	resp->tr_qm_dqalloc.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
>  
>  	/*
> @@ -930,8 +907,26 @@ xfs_trans_resv_calc(
>  	resp->tr_growrtzero.tr_logres = xfs_calc_growrtzero_reservation(mp);
>  	resp->tr_growrtfree.tr_logres = xfs_calc_growrtfree_reservation(mp);
>  
> -	/* Put everything back the way it was.  This goes at the end. */
> -	mp->m_rmap_maxlevels = rmap_maxlevels;
> +	/* Add one logcount for BUI items that appear with rmap or reflink. */
> +	if (xfs_has_reflink(mp) || xfs_has_rmapbt(mp)) {
> +		resp->tr_itruncate.tr_logcount++;
> +		resp->tr_write.tr_logcount++;
> +		resp->tr_qm_dqalloc.tr_logcount++;
> +	}
> +
> +	/* Add one logcount for refcount intent items. */
> +	if (xfs_has_reflink(mp)) {
> +		resp->tr_itruncate.tr_logcount++;
> +		resp->tr_write.tr_logcount++;
> +		resp->tr_qm_dqalloc.tr_logcount++;
> +	}
> +
> +	/* Add one logcount for rmap intent items. */
> +	if (xfs_has_rmapbt(mp)) {
> +		resp->tr_itruncate.tr_logcount++;
> +		resp->tr_write.tr_logcount++;
> +		resp->tr_qm_dqalloc.tr_logcount++;
> +	}

This would be much more concisely written as

	count = 0;
	if (xfs_has_reflink(mp) || xfs_has_rmapbt(mp)) {
		count = 2;
		if (xfs_has_reflink(mp) && xfs_has_rmapbt(mp))
			count++;
	}

	resp->tr_itruncate.tr_logcount += count;
	resp->tr_write.tr_logcount += count;
	resp->tr_qm_dqalloc.tr_logcount += count;

>  }
>  
>  /*
> @@ -943,5 +938,42 @@ xfs_trans_resv_calc_logsize(
>  	struct xfs_mount	*mp,
>  	struct xfs_trans_resv	*resp)
>  {
> +	unsigned int		rmap_maxlevels = mp->m_rmap_maxlevels;
> +
> +	ASSERT(resp != M_RES(mp));
> +
> +	/*
> +	 * In the early days of rmap+reflink, we always set the rmap maxlevels
> +	 * to 9 even if the AG was small enough that it would never grow to
> +	 * that height.  Transaction reservation sizes influence the minimum
> +	 * log size calculation, which influences the size of the log that mkfs
> +	 * creates.  Use the old value here to ensure that newly formatted
> +	 * small filesystems will mount on older kernels.
> +	 */
> +	if (xfs_has_rmapbt(mp) && xfs_has_reflink(mp))
> +		mp->m_rmap_maxlevels = XFS_OLD_REFLINK_RMAP_MAXLEVELS;
> +
>  	xfs_trans_resv_calc(mp, resp);
> +
> +	if (xfs_has_reflink(mp)) {
> +		/*
> +		 * In the early days of reflink we set the logcounts absurdly
> +		 * high.

"In the early days of reflink, typical operation log counts were
greatly overestimated"

> +		 */
> +		resp->tr_write.tr_logcount = XFS_WRITE_LOG_COUNT_REFLINK;
> +		resp->tr_itruncate.tr_logcount =
> +				XFS_ITRUNCATE_LOG_COUNT_REFLINK;
> +		resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT_REFLINK;
> +	} else if (xfs_has_rmapbt(mp)) {
> +		/*
> +		 * In the early days of non-reflink rmap we set the logcount
> +		 * too low.

"In the early days of non-reflink rmap the impact of rmap btree
updates on log counts was not taken into account at all."

> +		 */
> +		resp->tr_write.tr_logcount = XFS_WRITE_LOG_COUNT;
> +		resp->tr_itruncate.tr_logcount = XFS_ITRUNCATE_LOG_COUNT;
> +		resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT;
> +	}
> +
> +	/* Put everything back the way it was.  This goes at the end. */
> +	mp->m_rmap_maxlevels = rmap_maxlevels;
>  }

Yeah, so I think this should all go in xfs_log_rlimit.c as it is
specific to the minimum log size calculation.

> diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
> index 9fa4863f72a4..461859f4a745 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.h
> +++ b/fs/xfs/libxfs/xfs_trans_resv.h
> @@ -73,7 +73,6 @@ struct xfs_trans_resv {
>  #define	XFS_DEFAULT_LOG_COUNT		1
>  #define	XFS_DEFAULT_PERM_LOG_COUNT	2
>  #define	XFS_ITRUNCATE_LOG_COUNT		2
> -#define	XFS_ITRUNCATE_LOG_COUNT_REFLINK	8
>  #define XFS_INACTIVE_LOG_COUNT		2
>  #define	XFS_CREATE_LOG_COUNT		2
>  #define	XFS_CREATE_TMPFILE_LOG_COUNT	2
> @@ -83,12 +82,15 @@ struct xfs_trans_resv {
>  #define	XFS_LINK_LOG_COUNT		2
>  #define	XFS_RENAME_LOG_COUNT		2
>  #define	XFS_WRITE_LOG_COUNT		2
> -#define	XFS_WRITE_LOG_COUNT_REFLINK	8
>  #define	XFS_ADDAFORK_LOG_COUNT		2
>  #define	XFS_ATTRINVAL_LOG_COUNT		1
>  #define	XFS_ATTRSET_LOG_COUNT		3
>  #define	XFS_ATTRRM_LOG_COUNT		3
>  
> +/* Absurdly high log counts from the early days of reflink.  Do not use. */
> +#define	XFS_ITRUNCATE_LOG_COUNT_REFLINK	8
> +#define	XFS_WRITE_LOG_COUNT_REFLINK	8

/*
 * Original log counts were overestimated in the early days of
 * reflink. These are retained here purely for minimum log size
 * calculations and are not to be used for runtime reservations.
 */

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] xfs: reduce transaction reservations with reflink
  2022-04-14 22:54 ` [PATCH 5/6] xfs: reduce transaction reservations with reflink Darrick J. Wong
@ 2022-04-22 23:42   ` Dave Chinner
  2022-04-25 23:49     ` Darrick J. Wong
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2022-04-22 23:42 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Apr 14, 2022 at 03:54:54PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Before to the introduction of deferred refcount operations, reflink
> would try to cram refcount btree updates into the same transaction as an
> allocation or a free event.  Mainline XFS has never actually done that,
> but we never refactored the transaction reservations to reflect that we
> now do all refcount updates in separate transactions.  Fix this to
> reduce the transaction reservation size even farther, so that between
> this patch and the previous one, we reduce the tr_write and tr_itruncate
> sizes by 66%.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/libxfs/xfs_refcount.c   |    9 +++-
>  fs/xfs/libxfs/xfs_trans_resv.c |   97 ++++++++++++++++++++++++++++++++++++++--
>  2 files changed, 98 insertions(+), 8 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
> index a07ebaecba73..e53544d52ee2 100644
> --- a/fs/xfs/libxfs/xfs_refcount.c
> +++ b/fs/xfs/libxfs/xfs_refcount.c
> @@ -886,8 +886,13 @@ xfs_refcount_still_have_space(
>  {
>  	unsigned long			overhead;
>  
> -	overhead = cur->bc_ag.refc.shape_changes *
> -			xfs_allocfree_log_count(cur->bc_mp, 1);
> +	/*
> +	 * Worst case estimate: full splits of the free space and rmap btrees
> +	 * to handle each of the shape changes to the refcount btree.
> +	 */
> +	overhead = xfs_allocfree_log_count(cur->bc_mp,
> +				cur->bc_ag.refc.shape_changes);
> +	overhead += cur->bc_mp->m_refc_maxlevels;
>  	overhead *= cur->bc_mp->m_sb.sb_blocksize;
>  
>  	/*
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> index 8d2f04dddb65..e5c3fcc2ab15 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.c
> +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> @@ -56,8 +56,7 @@ xfs_calc_buf_res(
>   * Per-extent log reservation for the btree changes involved in freeing or
>   * allocating an extent.  In classic XFS there were two trees that will be
>   * modified (bnobt + cntbt).  With rmap enabled, there are three trees
> - * (rmapbt).  With reflink, there are four trees (refcountbt).  The number of
> - * blocks reserved is based on the formula:
> + * (rmapbt).  The number of blocks reserved is based on the formula:
>   *
>   * num trees * ((2 blocks/level * max depth) - 1)
>   *
> @@ -73,12 +72,23 @@ xfs_allocfree_log_count(
>  	blocks = num_ops * 2 * (2 * mp->m_alloc_maxlevels - 1);
>  	if (xfs_has_rmapbt(mp))
>  		blocks += num_ops * (2 * mp->m_rmap_maxlevels - 1);
> -	if (xfs_has_reflink(mp))
> -		blocks += num_ops * (2 * mp->m_refc_maxlevels - 1);
>  
>  	return blocks;
>  }
>  
> +/*
> + * Per-extent log reservation for refcount btree changes.  These are never done
> + * in the same transaction as an allocation or a free, so we compute them
> + * separately.
> + */
> +static unsigned int
> +xfs_refcount_log_count(
> +	struct xfs_mount	*mp,
> +	unsigned int		num_ops)
> +{
> +	return num_ops * (2 * mp->m_refc_maxlevels - 1);
> +}

This is a block count, right? My brain just went "spoing!" because I
was just looking at transaction reservation log counts, and here we
have a "log count" reservation that isn't a reservation log count
but a block count used to calculate the transaction unit
reservation...

Yeah, I know, that's my fault for naming xfs_allocfree_log_count()
the way I did way back when, but I think this really should tell the
reader it is returning a block count for the refcount btree mods.

Maybe xfs_refcountbt_block_count(), perhaps?

> + * Compute the log reservation required to handle the refcount update
> + * transaction.  Refcount updates are always done via deferred log items.
> + *
> + * This is calculated as:
> + * Data device refcount updates (t1):
> + *    the agfs of the ags containing the blocks: nr_ops * sector size
> + *    the refcount btrees: nr_ops * 1 trees * (2 * max depth - 1) * block size
> + */
> +static unsigned int
> +xfs_refcount_log_reservation(
> +	struct xfs_mount	*mp,
> +	unsigned int		nr_ops)
> +{
> +	unsigned int		blksz = XFS_FSB_TO_B(mp, 1);
> +
> +	if (!xfs_has_reflink(mp))
> +		return 0;
> +
> +	return xfs_calc_buf_res(nr_ops, mp->m_sb.sb_sectsize) +
> +	       xfs_calc_buf_res(xfs_refcount_log_count(mp, nr_ops), blksz);
> +}

To follow convention, this calculates a reservation in bytes, so
xfs_calc_refcountbt_reservation() would match all the other
functions that do similar things...

.....

> @@ -303,10 +338,42 @@ xfs_calc_write_reservation(
>   *    the realtime summary: 2 exts * 1 block
>   *    worst case split in allocation btrees per extent assuming 2 extents:
>   *		2 exts * 2 trees * (2 * max depth - 1) * block size
> + * And any refcount updates that happen in a separate transaction (t4).
>   */
>  STATIC uint
>  xfs_calc_itruncate_reservation(
>  	struct xfs_mount	*mp)
> +{
> +	unsigned int		t1, t2, t3, t4;
> +	unsigned int		blksz = XFS_FSB_TO_B(mp, 1);
> +
> +	t1 = xfs_calc_inode_res(mp, 1) +
> +	     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) + 1, blksz);
> +
> +	t2 = xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) +
> +	     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 4), blksz);
> +
> +	if (xfs_has_realtime(mp)) {
> +		t3 = xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
> +		     xfs_calc_buf_res(xfs_rtalloc_log_count(mp, 2), blksz) +
> +		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2), blksz);
> +	} else {
> +		t3 = 0;
> +	}
> +
> +	t4 = xfs_refcount_log_reservation(mp, 2);
> +
> +	return XFS_DQUOT_LOGRES(mp) + max(t4, max3(t1, t2, t3));
> +}
> +
> +/*
> + * For log size calculation, this is the same as above except that we used to
> + * include refcount updates in the allocfree computation even though we've
> + * always run them as a separate transaction.
> + */
> +STATIC uint
> +xfs_calc_itruncate_reservation_logsize(
> +	struct xfs_mount	*mp)
>  {
>  	unsigned int		t1, t2, t3;
>  	unsigned int		blksz = XFS_FSB_TO_B(mp, 1);
> @@ -317,6 +384,9 @@ xfs_calc_itruncate_reservation(
>  	t2 = xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) +
>  	     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 4), blksz);
>  
> +	if (xfs_has_reflink(mp))
> +	     t2 += xfs_calc_buf_res(xfs_refcount_log_count(mp, 4), blksz);
> +

Ok, so the legacy code gets 4 extra refcount ops accounted here
because we removed it from xfs_allocfree_log_count(), and we keep
this function around because the new code might select t4 as
the maximum reservation it needs and so we can't just add these
blocks to the newly calculated reservation. Correct?

If so, I'm not a great fan of duplicating this code when this is all
that differs between the two case. Can we pass in a "bool
for_minlogsize" variable and do:

uint
xfs_calc_itruncate_reservation(
	struct xfs_mount	*mp,
	bool			for_minlogsize)
{
	unsigned int		t1, t2, t3, t4;
	unsigned int		blksz = XFS_FSB_TO_B(mp, 1);

	t1 = xfs_calc_inode_res(mp, 1) +
	     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) + 1, blksz);

	t2 = xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) +
	     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 4), blksz);

	if (xfs_has_realtime(mp)) {
		t3 = xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
		     xfs_calc_buf_res(xfs_rtalloc_log_count(mp, 2), blksz) +
		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2), blksz);
	} else {
		t3 = 0;
	}

	if (!for_minlogsize) {
		t4 = xfs_refcount_log_reservation(mp, 2);
		return XFS_DQUOT_LOGRES(mp) + max(t4, max3(t1, t2, t3));
	}

	/*
	 * add reason for minlogsize calculation differences here
	 */
	if (xfs_has_reflink(mp))
		t2 += xfs_calc_buf_res(xfs_refcount_log_count(mp, 4), blksz);

	return XFS_DQUOT_LOGRES(mp) + max3(t1, t2, t3);
}

And now we can call xfs_calc_itruncate_reservation(mp, true)
directly from xfs_log_rlimits.c to get the correct reservation for
the logsize code with minimal extra code.

>  	if (xfs_has_realtime(mp)) {
>  		t3 = xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
>  		     xfs_calc_buf_res(xfs_rtalloc_log_count(mp, 2), blksz) +
> @@ -956,6 +1026,9 @@ xfs_trans_resv_calc_logsize(
>  	xfs_trans_resv_calc(mp, resp);
>  
>  	if (xfs_has_reflink(mp)) {
> +		unsigned int	t4;
> +		unsigned int	blksz = XFS_FSB_TO_B(mp, 1);
> +
>  		/*
>  		 * In the early days of reflink we set the logcounts absurdly
>  		 * high.
> @@ -964,6 +1037,18 @@ xfs_trans_resv_calc_logsize(
>  		resp->tr_itruncate.tr_logcount =
>  				XFS_ITRUNCATE_LOG_COUNT_REFLINK;
>  		resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT_REFLINK;
> +
> +		/*
> +		 * We also used to account two refcount updates per extent into
> +		 * the alloc/free step of write and truncate calls, even though
> +		 * those are run in separate transactions.
> +		 */
> +		t4 = xfs_calc_buf_res(xfs_refcount_log_count(mp, 2), blksz);
> +		resp->tr_write.tr_logres += t4;
> +		resp->tr_qm_dqalloc.tr_logres += t4;

I think these have the same problem as itruncate - the write
reservation is conditional on the maximum reservation of several
different operations, and so dropping the refcountbt from that
comparison means it could select a different max reservation. Then
if we unconditionally add the refcount blocks to that, then we end
up with a different size to what the original code calculated.

Hence I think these also need to be treated like I outline for
itruncate above.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 6/6] xfs: rewrite xfs_reflink_end_cow to use intents
  2022-04-14 22:54 ` [PATCH 6/6] xfs: rewrite xfs_reflink_end_cow to use intents Darrick J. Wong
@ 2022-04-22 23:50   ` Dave Chinner
  0 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2022-04-22 23:50 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Apr 14, 2022 at 03:54:59PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Currently, the code that performs CoW remapping after a write has this
> odd behavior where it walks /backwards/ through the data fork to remap
> extents in reverse order.  Earlier, we rewrote the reflink remap
> function to use deferred bmap log items instead of trying to cram as
> much into the first transaction that we could.  Now do the same for the
> CoW remap code.  There doesn't seem to be any performance impact; we're
> just making better use of code that we added for the benefit of reflink.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_reflink.c |   88 ++++++++++++++++++++++++++++++++------------------
>  fs/xfs/xfs_trace.h   |    3 +-
>  2 files changed, 58 insertions(+), 33 deletions(-)

Looks ok.

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/6] xfs: stop artificially limiting the length of bunmap calls
  2022-04-22 22:18     ` Darrick J. Wong
@ 2022-04-22 23:51       ` Dave Chinner
  0 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2022-04-22 23:51 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Fri, Apr 22, 2022 at 03:18:20PM -0700, Darrick J. Wong wrote:
> On Sat, Apr 23, 2022 at 08:01:20AM +1000, Dave Chinner wrote:
> > On Thu, Apr 14, 2022 at 03:54:31PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > In commit e1a4e37cc7b6, we clamped the length of bunmapi calls on the
> > > data forks of shared files to avoid two failure scenarios: one where the
> > > extent being unmapped is so sparsely shared that we exceed the
> > > transaction reservation with the sheer number of refcount btree updates
> > > and EFI intent items; and the other where we attach so many deferred
> > > updates to the transaction that we pin the log tail and later the log
> > > head meets the tail, causing the log to livelock.
> > > 
> > > We avoid triggering the first problem by tracking the number of ops in
> > > the refcount btree cursor and forcing a requeue of the refcount intent
> > > item any time we think that we might be close to overflowing.  This has
> > > been baked into XFS since before the original e1a4 patch.
> > > 
> > > A recent patchset fixed the second problem by changing the deferred ops
> > > code to finish all the work items created by each round of trying to
> > > complete a refcount intent item, which eliminates the long chains of
> > > deferred items (27dad); and causing long-running transactions to relog
> > > their intent log items when space in the log gets low (74f4d).
> > > 
> > > Because this clamp affects /any/ unmapping request regardless of the
> > > sharing factors of the component blocks, it degrades the performance of
> > > all large unmapping requests -- whereas with an unshared file we can
> > > unmap millions of blocks in one go, shared files are limited to
> > > unmapping a few thousand blocks at a time, which causes the upper level
> > > code to spin in a bunmapi loop even if it wasn't needed.
> > > 
> > > This also eliminates one more place where log recovery behavior can
> > > differ from online behavior, because bunmapi operations no longer need
> > > to requeue.
> > > 
> > > Partial-revert-of: e1a4e37cc7b6 ("xfs: try to avoid blowing out the transaction reservation when bunmaping a shared extent")
> > > Depends: 27dada070d59 ("xfs: change the order in which child and parent defer ops ar finished")
> > > Depends: 74f4d6a1e065 ("xfs: only relog deferred intent items if free space in the log gets low")
> > > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > > ---
> > >  fs/xfs/libxfs/xfs_bmap.c     |   22 +---------------------
> > >  fs/xfs/libxfs/xfs_refcount.c |    5 ++---
> > >  fs/xfs/libxfs/xfs_refcount.h |    8 ++------
> > >  3 files changed, 5 insertions(+), 30 deletions(-)
> > 
> > This looks reasonable, but I'm wondering how the original problem
> > was discovered and whether this has been tested against that
> > original problem situation to ensure we aren't introducing a
> > regression here....
> 
> generic/447, and yes, I have forced it to run a deletion of 1 million
> extents without incident. :)

Ok, that's all I wanted to know :)

Reviewed-by: Dave Chinner <dchinner@redhat.com>

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/6] xfs: create shadow transaction reservations for computing minimum log size
  2022-04-22 22:36   ` Dave Chinner
@ 2022-04-25 23:39     ` Darrick J. Wong
  2022-04-26  4:24       ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2022-04-25 23:39 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Sat, Apr 23, 2022 at 08:36:35AM +1000, Dave Chinner wrote:
> On Thu, Apr 14, 2022 at 03:54:42PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Every time someone changes the transaction reservation sizes, they
> > introduce potential compatibility problems if the changes affect the
> > minimum log size that we validate at mount time.  If the minimum log
> > size gets larger (which should be avoided because doing so presents a
> > serious risk of log livelock), filesystems created with old mkfs will
> > not mount on a newer kernel; if the minimum size shrinks, filesystems
> > created with newer mkfs will not mount on older kernels.
> > 
> > Therefore, enable the creation of a shadow log reservation structure
> > where we can "undo" the effects of tweaks when computing minimum log
> > sizes.  These shadow reservations should never be used in practice, but
> > they insulate us from perturbations in minimum log size.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/libxfs/xfs_log_rlimit.c |   17 +++++++++++++----
> >  fs/xfs/libxfs/xfs_trans_resv.c |   12 ++++++++++++
> >  fs/xfs/libxfs/xfs_trans_resv.h |    2 ++
> >  fs/xfs/xfs_trace.h             |   12 ++++++++++--
> >  4 files changed, 37 insertions(+), 6 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_log_rlimit.c b/fs/xfs/libxfs/xfs_log_rlimit.c
> > index 67798ff5e14e..2bafc69cac15 100644
> > --- a/fs/xfs/libxfs/xfs_log_rlimit.c
> > +++ b/fs/xfs/libxfs/xfs_log_rlimit.c
> > @@ -14,6 +14,7 @@
> >  #include "xfs_trans_space.h"
> >  #include "xfs_da_btree.h"
> >  #include "xfs_bmap_btree.h"
> > +#include "xfs_trace.h"
> >  
> >  /*
> >   * Calculate the maximum length in bytes that would be required for a local
> > @@ -47,18 +48,25 @@ xfs_log_get_max_trans_res(
> >  	struct xfs_trans_res	*max_resp)
> >  {
> >  	struct xfs_trans_res	*resp;
> > +	struct xfs_trans_res	*start_resp;
> >  	struct xfs_trans_res	*end_resp;
> > +	struct xfs_trans_resv	*resv;
> >  	int			log_space = 0;
> >  	int			attr_space;
> >  
> >  	attr_space = xfs_log_calc_max_attrsetm_res(mp);
> >  
> > -	resp = (struct xfs_trans_res *)M_RES(mp);
> > -	end_resp = (struct xfs_trans_res *)(M_RES(mp) + 1);
> > -	for (; resp < end_resp; resp++) {
> > +	resv = kmem_zalloc(sizeof(struct xfs_trans_resv), 0);
> > +	xfs_trans_resv_calc_logsize(mp, resv);
> > +
> > +	start_resp = (struct xfs_trans_res *)resv;
> > +	end_resp = (struct xfs_trans_res *)(resv + 1);
> > +	for (resp = start_resp; resp < end_resp; resp++) {
> >  		int		tmp = resp->tr_logcount > 1 ?
> >  				      resp->tr_logres * resp->tr_logcount :
> >  				      resp->tr_logres;
> > +
> > +		trace_xfs_trans_resv_calc_logsize(mp, resp - start_resp, resp);
> >  		if (log_space < tmp) {
> >  			log_space = tmp;
> >  			*max_resp = *resp;		/* struct copy */
> 
> This took me a while to get my head around. The minimum logsize
> calculation stuff is all a bit of a mess.
> 
> Essentially, we call xfs_log_get_max_trans_res() from two places.
> One is to calculate the minimum log size, the other is the
> transaction reservation tracing code done when M_RES(mp) is set up
> via xfs_trans_trace_reservations().  We don't need the call from
> xfs_trans_trace_reservations() - it's trivial to scan the list of
> tracepoints emitted by this function at mount time to find the
> maximum reservation.

Here's the thing -- xfs_db also calls xfs_log_get_max_trans_res to
figure out the transaction reservation that's used to compute the
minimum log size.  Whenever I get a report about mount failing due to
minlogsize checks, I can ask the reporter to send me the ftrace output
from the mount attempt and compare it against what userspace thinks:

# xfs_db /dev/sde -c logres
type 0 logres 168184 logcount 5 flags 0x4
type 1 logres 293760 logcount 5 flags 0x4
type 2 logres 307936 logcount 2 flags 0x4
type 3 logres 187760 logcount 2 flags 0x4
type 4 logres 170616 logcount 2 flags 0x4
type 5 logres 244720 logcount 3 flags 0x4
type 6 logres 243568 logcount 2 flags 0x4
type 7 logres 260352 logcount 2 flags 0x4
type 8 logres 243568 logcount 3 flags 0x4
type 9 logres 278648 logcount 2 flags 0x4
type 10 logres 2168 logcount 0 flags 0x0
type 11 logres 73728 logcount 2 flags 0x4
type 12 logres 99960 logcount 2 flags 0x4
type 13 logres 760 logcount 0 flags 0x0
type 14 logres 292992 logcount 1 flags 0x4
type 15 logres 23288 logcount 3 flags 0x4
type 16 logres 13312 logcount 0 flags 0x0
type 17 logres 147584 logcount 3 flags 0x4
type 18 logres 640 logcount 0 flags 0x0
type 19 logres 94968 logcount 2 flags 0x4
type 20 logres 4224 logcount 0 flags 0x0
type 21 logres 6512 logcount 0 flags 0x0
type 22 logres 232 logcount 1 flags 0x0
type 23 logres 172407 logcount 5 flags 0x4
type 24 logres 640 logcount 1 flags 0x0
type 25 logres 760 logcount 0 flags 0x0
type 26 logres 243568 logcount 2 flags 0x4
type 27 logres 170616 logcount 2 flags 0x4
type -1 logres 547200 logcount 8 flags 0x4

And this "-1" entry matches the last output of the kernel.  I'd rather
not lose this tracing facility (which means keeping this function
non-static) though I will move the tracepoint out of
xfs_trans_trace_reservations.

> Hence I think we should start by removing that call to this
> function, and making this a static function called only from
> xfs_log_calc_minimum_size().
> 
> At this point, we can use an on-stack struct xfs_trans_resv for the
> calculated values - no need for memory allocation here as we will
> never be short of stack space in this path.

~312 bytes?  That's ~8% of the kernel stack.  I'll see if I run into any
complaints, though I bet I won't on x64.

> The tracing in the loop also wants an integer index into the struct
> xfs_trans_resv structure, so it should be changed to match what
> xfs_trans_trace_reservations() does:
> 
> 	struct xfs_trans_resv	resv;
> 	struct xfs_trans_res	*resp;
> 	struct xfs_trans_res	*end_resp;
> 
> 	....
> 
> 	xfs_trans_resv_calc(mp, &resv)
> 
> 	resp = (struct xfs_trans_res *)&resv;
> 	end_resp = (struct xfs_trans_res *)(&resv + 1);
> 	for (i = 0; resp < end_resp; resp++) {
> 		.....
> 		trace_xfs_trans_resv_calc_logsize(mp, i, resp);
> 		....
> 	}

Done.

> 
> > diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> > index 6f83d9b306ee..12d4e451e70e 100644
> > --- a/fs/xfs/libxfs/xfs_trans_resv.c
> > +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> > @@ -933,3 +933,15 @@ xfs_trans_resv_calc(
> >  	/* Put everything back the way it was.  This goes at the end. */
> >  	mp->m_rmap_maxlevels = rmap_maxlevels;
> >  }
> > +
> > +/*
> > + * Compute an alternate set of log reservation sizes for use exclusively with
> > + * minimum log size calculations.
> > + */
> > +void
> > +xfs_trans_resv_calc_logsize(
> > +	struct xfs_mount	*mp,
> > +	struct xfs_trans_resv	*resp)
> > +{
> > +	xfs_trans_resv_calc(mp, resp);
> > +}
> 
> This function and it's name was waht confused me for a while - I
> don't think it belongs in this patch, and I don't think it belongs
> in this file when it's filled out in the next patch. It's basically
> handling things specific to minimum log size calculations, so with
> the above mods I think it should also end up being static to
> libxfs/xfs_log_rlimit.c.

Moved.  I guess I should rename it to xfs_log_recalc_trans_resv or
something.

--D

> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/6] xfs: reduce the absurdly large log reservations
  2022-04-22 22:51   ` Dave Chinner
@ 2022-04-25 23:47     ` Darrick J. Wong
  2022-04-26  4:25       ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: Darrick J. Wong @ 2022-04-25 23:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Sat, Apr 23, 2022 at 08:51:52AM +1000, Dave Chinner wrote:
> On Thu, Apr 14, 2022 at 03:54:48PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Back in the early days of reflink and rmap development I set the
> > transaction reservation sizes to be overly generous for rmap+reflink
> > filesystems, and a little under-generous for rmap-only filesystems.
> > 
> > Since we don't need *eight* transaction rolls to handle three new log
> > intent items, decrease the logcounts to what we actually need, and amend
> > the shadow reservation computation function to reflect what we used to
> > do so that the minimum log size doesn't change.
> 
> Yup, this will make a huge difference to the number of transactions
> we can have in flight on reflink/rmap enabled filesystems.
> 
> Mostly looks good, some comments about code and comments below.

Yay!

> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/libxfs/xfs_trans_resv.c |   88 +++++++++++++++++++++++++++-------------
> >  fs/xfs/libxfs/xfs_trans_resv.h |    6 ++-
> >  2 files changed, 64 insertions(+), 30 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> > index 12d4e451e70e..8d2f04dddb65 100644
> > --- a/fs/xfs/libxfs/xfs_trans_resv.c
> > +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> > @@ -814,36 +814,16 @@ xfs_trans_resv_calc(
> >  	struct xfs_mount	*mp,
> >  	struct xfs_trans_resv	*resp)
> >  {
> > -	unsigned int		rmap_maxlevels = mp->m_rmap_maxlevels;
> > -
> > -	/*
> > -	 * In the early days of rmap+reflink, we always set the rmap maxlevels
> > -	 * to 9 even if the AG was small enough that it would never grow to
> > -	 * that height.  Transaction reservation sizes influence the minimum
> > -	 * log size calculation, which influences the size of the log that mkfs
> > -	 * creates.  Use the old value here to ensure that newly formatted
> > -	 * small filesystems will mount on older kernels.
> > -	 */
> > -	if (xfs_has_rmapbt(mp) && xfs_has_reflink(mp))
> > -		mp->m_rmap_maxlevels = XFS_OLD_REFLINK_RMAP_MAXLEVELS;
> > -
> >  	/*
> >  	 * The following transactions are logged in physical format and
> >  	 * require a permanent reservation on space.
> >  	 */
> >  	resp->tr_write.tr_logres = xfs_calc_write_reservation(mp);
> > -	if (xfs_has_reflink(mp))
> > -		resp->tr_write.tr_logcount = XFS_WRITE_LOG_COUNT_REFLINK;
> > -	else
> > -		resp->tr_write.tr_logcount = XFS_WRITE_LOG_COUNT;
> > +	resp->tr_write.tr_logcount = XFS_WRITE_LOG_COUNT;
> >  	resp->tr_write.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> >  
> >  	resp->tr_itruncate.tr_logres = xfs_calc_itruncate_reservation(mp);
> > -	if (xfs_has_reflink(mp))
> > -		resp->tr_itruncate.tr_logcount =
> > -				XFS_ITRUNCATE_LOG_COUNT_REFLINK;
> > -	else
> > -		resp->tr_itruncate.tr_logcount = XFS_ITRUNCATE_LOG_COUNT;
> > +	resp->tr_itruncate.tr_logcount = XFS_ITRUNCATE_LOG_COUNT;
> >  	resp->tr_itruncate.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> >  
> >  	resp->tr_rename.tr_logres = xfs_calc_rename_reservation(mp);
> > @@ -900,10 +880,7 @@ xfs_trans_resv_calc(
> >  	resp->tr_growrtalloc.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> >  
> >  	resp->tr_qm_dqalloc.tr_logres = xfs_calc_qm_dqalloc_reservation(mp);
> > -	if (xfs_has_reflink(mp))
> > -		resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT_REFLINK;
> > -	else
> > -		resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT;
> > +	resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT;
> >  	resp->tr_qm_dqalloc.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> >  
> >  	/*
> > @@ -930,8 +907,26 @@ xfs_trans_resv_calc(
> >  	resp->tr_growrtzero.tr_logres = xfs_calc_growrtzero_reservation(mp);
> >  	resp->tr_growrtfree.tr_logres = xfs_calc_growrtfree_reservation(mp);
> >  
> > -	/* Put everything back the way it was.  This goes at the end. */
> > -	mp->m_rmap_maxlevels = rmap_maxlevels;
> > +	/* Add one logcount for BUI items that appear with rmap or reflink. */
> > +	if (xfs_has_reflink(mp) || xfs_has_rmapbt(mp)) {
> > +		resp->tr_itruncate.tr_logcount++;
> > +		resp->tr_write.tr_logcount++;
> > +		resp->tr_qm_dqalloc.tr_logcount++;
> > +	}
> > +
> > +	/* Add one logcount for refcount intent items. */
> > +	if (xfs_has_reflink(mp)) {
> > +		resp->tr_itruncate.tr_logcount++;
> > +		resp->tr_write.tr_logcount++;
> > +		resp->tr_qm_dqalloc.tr_logcount++;
> > +	}
> > +
> > +	/* Add one logcount for rmap intent items. */
> > +	if (xfs_has_rmapbt(mp)) {
> > +		resp->tr_itruncate.tr_logcount++;
> > +		resp->tr_write.tr_logcount++;
> > +		resp->tr_qm_dqalloc.tr_logcount++;
> > +	}
> 
> This would be much more concisely written as
> 
> 	count = 0;
> 	if (xfs_has_reflink(mp) || xfs_has_rmapbt(mp)) {
> 		count = 2;
> 		if (xfs_has_reflink(mp) && xfs_has_rmapbt(mp))
> 			count++;
> 	}
> 
> 	resp->tr_itruncate.tr_logcount += count;
> 	resp->tr_write.tr_logcount += count;
> 	resp->tr_qm_dqalloc.tr_logcount += count;

I think I'd rather do:

	/*
	 * Add one logcount for BUI items that appear with rmap or reflink,
	 * one logcount for refcount intent items, and one logcount for rmap
	 * intent items.
	 */
	if (xfs_has_reflink(mp) || xfs_has_rmapbt(mp))
		logcount_adj++;
	if (xfs_has_reflink(mp))
		logcount_adj++;
	if (xfs_has_rmapbt(mp))
		logcount_adj++;

	resp->tr_itruncate.tr_logcount += logcount_adj;
	resp->tr_write.tr_logcount += logcount_adj;
	resp->tr_qm_dqalloc.tr_logcount += logcount_adj;

If you don't mind?

> 
> >  }
> >  
> >  /*
> > @@ -943,5 +938,42 @@ xfs_trans_resv_calc_logsize(
> >  	struct xfs_mount	*mp,
> >  	struct xfs_trans_resv	*resp)
> >  {
> > +	unsigned int		rmap_maxlevels = mp->m_rmap_maxlevels;
> > +
> > +	ASSERT(resp != M_RES(mp));
> > +
> > +	/*
> > +	 * In the early days of rmap+reflink, we always set the rmap maxlevels
> > +	 * to 9 even if the AG was small enough that it would never grow to
> > +	 * that height.  Transaction reservation sizes influence the minimum
> > +	 * log size calculation, which influences the size of the log that mkfs
> > +	 * creates.  Use the old value here to ensure that newly formatted
> > +	 * small filesystems will mount on older kernels.
> > +	 */
> > +	if (xfs_has_rmapbt(mp) && xfs_has_reflink(mp))
> > +		mp->m_rmap_maxlevels = XFS_OLD_REFLINK_RMAP_MAXLEVELS;
> > +
> >  	xfs_trans_resv_calc(mp, resp);
> > +
> > +	if (xfs_has_reflink(mp)) {
> > +		/*
> > +		 * In the early days of reflink we set the logcounts absurdly
> > +		 * high.
> 
> "In the early days of reflink, typical operation log counts were
> greatly overestimated"

Fixed.

> > +		 */
> > +		resp->tr_write.tr_logcount = XFS_WRITE_LOG_COUNT_REFLINK;
> > +		resp->tr_itruncate.tr_logcount =
> > +				XFS_ITRUNCATE_LOG_COUNT_REFLINK;
> > +		resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT_REFLINK;
> > +	} else if (xfs_has_rmapbt(mp)) {
> > +		/*
> > +		 * In the early days of non-reflink rmap we set the logcount
> > +		 * too low.
> 
> "In the early days of non-reflink rmap the impact of rmap btree
> updates on log counts was not taken into account at all."

Fixed.

> > +		 */
> > +		resp->tr_write.tr_logcount = XFS_WRITE_LOG_COUNT;
> > +		resp->tr_itruncate.tr_logcount = XFS_ITRUNCATE_LOG_COUNT;
> > +		resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT;
> > +	}
> > +
> > +	/* Put everything back the way it was.  This goes at the end. */
> > +	mp->m_rmap_maxlevels = rmap_maxlevels;
> >  }
> 
> Yeah, so I think this should all go in xfs_log_rlimit.c as it is
> specific to the minimum log size calculation.

Moved.

> > diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
> > index 9fa4863f72a4..461859f4a745 100644
> > --- a/fs/xfs/libxfs/xfs_trans_resv.h
> > +++ b/fs/xfs/libxfs/xfs_trans_resv.h
> > @@ -73,7 +73,6 @@ struct xfs_trans_resv {
> >  #define	XFS_DEFAULT_LOG_COUNT		1
> >  #define	XFS_DEFAULT_PERM_LOG_COUNT	2
> >  #define	XFS_ITRUNCATE_LOG_COUNT		2
> > -#define	XFS_ITRUNCATE_LOG_COUNT_REFLINK	8
> >  #define XFS_INACTIVE_LOG_COUNT		2
> >  #define	XFS_CREATE_LOG_COUNT		2
> >  #define	XFS_CREATE_TMPFILE_LOG_COUNT	2
> > @@ -83,12 +82,15 @@ struct xfs_trans_resv {
> >  #define	XFS_LINK_LOG_COUNT		2
> >  #define	XFS_RENAME_LOG_COUNT		2
> >  #define	XFS_WRITE_LOG_COUNT		2
> > -#define	XFS_WRITE_LOG_COUNT_REFLINK	8
> >  #define	XFS_ADDAFORK_LOG_COUNT		2
> >  #define	XFS_ATTRINVAL_LOG_COUNT		1
> >  #define	XFS_ATTRSET_LOG_COUNT		3
> >  #define	XFS_ATTRRM_LOG_COUNT		3
> >  
> > +/* Absurdly high log counts from the early days of reflink.  Do not use. */
> > +#define	XFS_ITRUNCATE_LOG_COUNT_REFLINK	8
> > +#define	XFS_WRITE_LOG_COUNT_REFLINK	8
> 
> /*
>  * Original log counts were overestimated in the early days of
>  * reflink. These are retained here purely for minimum log size
>  * calculations and are not to be used for runtime reservations.
>  */

Fixed, thanks. :)

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 5/6] xfs: reduce transaction reservations with reflink
  2022-04-22 23:42   ` Dave Chinner
@ 2022-04-25 23:49     ` Darrick J. Wong
  0 siblings, 0 replies; 24+ messages in thread
From: Darrick J. Wong @ 2022-04-25 23:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Sat, Apr 23, 2022 at 09:42:38AM +1000, Dave Chinner wrote:
> On Thu, Apr 14, 2022 at 03:54:54PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Before to the introduction of deferred refcount operations, reflink
> > would try to cram refcount btree updates into the same transaction as an
> > allocation or a free event.  Mainline XFS has never actually done that,
> > but we never refactored the transaction reservations to reflect that we
> > now do all refcount updates in separate transactions.  Fix this to
> > reduce the transaction reservation size even farther, so that between
> > this patch and the previous one, we reduce the tr_write and tr_itruncate
> > sizes by 66%.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/libxfs/xfs_refcount.c   |    9 +++-
> >  fs/xfs/libxfs/xfs_trans_resv.c |   97 ++++++++++++++++++++++++++++++++++++++--
> >  2 files changed, 98 insertions(+), 8 deletions(-)
> > 
> > 
> > diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
> > index a07ebaecba73..e53544d52ee2 100644
> > --- a/fs/xfs/libxfs/xfs_refcount.c
> > +++ b/fs/xfs/libxfs/xfs_refcount.c
> > @@ -886,8 +886,13 @@ xfs_refcount_still_have_space(
> >  {
> >  	unsigned long			overhead;
> >  
> > -	overhead = cur->bc_ag.refc.shape_changes *
> > -			xfs_allocfree_log_count(cur->bc_mp, 1);
> > +	/*
> > +	 * Worst case estimate: full splits of the free space and rmap btrees
> > +	 * to handle each of the shape changes to the refcount btree.
> > +	 */
> > +	overhead = xfs_allocfree_log_count(cur->bc_mp,
> > +				cur->bc_ag.refc.shape_changes);
> > +	overhead += cur->bc_mp->m_refc_maxlevels;
> >  	overhead *= cur->bc_mp->m_sb.sb_blocksize;
> >  
> >  	/*
> > diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> > index 8d2f04dddb65..e5c3fcc2ab15 100644
> > --- a/fs/xfs/libxfs/xfs_trans_resv.c
> > +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> > @@ -56,8 +56,7 @@ xfs_calc_buf_res(
> >   * Per-extent log reservation for the btree changes involved in freeing or
> >   * allocating an extent.  In classic XFS there were two trees that will be
> >   * modified (bnobt + cntbt).  With rmap enabled, there are three trees
> > - * (rmapbt).  With reflink, there are four trees (refcountbt).  The number of
> > - * blocks reserved is based on the formula:
> > + * (rmapbt).  The number of blocks reserved is based on the formula:
> >   *
> >   * num trees * ((2 blocks/level * max depth) - 1)
> >   *
> > @@ -73,12 +72,23 @@ xfs_allocfree_log_count(
> >  	blocks = num_ops * 2 * (2 * mp->m_alloc_maxlevels - 1);
> >  	if (xfs_has_rmapbt(mp))
> >  		blocks += num_ops * (2 * mp->m_rmap_maxlevels - 1);
> > -	if (xfs_has_reflink(mp))
> > -		blocks += num_ops * (2 * mp->m_refc_maxlevels - 1);
> >  
> >  	return blocks;
> >  }
> >  
> > +/*
> > + * Per-extent log reservation for refcount btree changes.  These are never done
> > + * in the same transaction as an allocation or a free, so we compute them
> > + * separately.
> > + */
> > +static unsigned int
> > +xfs_refcount_log_count(
> > +	struct xfs_mount	*mp,
> > +	unsigned int		num_ops)
> > +{
> > +	return num_ops * (2 * mp->m_refc_maxlevels - 1);
> > +}
> 
> This is a block count, right? My brain just went "spoing!" because I
> was just looking at transaction reservation log counts, and here we
> have a "log count" reservation that isn't a reservation log count
> but a block count used to calculate the transaction unit
> reservation...
> 
> Yeah, I know, that's my fault for naming xfs_allocfree_log_count()
> the way I did way back when, but I think this really should tell the
> reader it is returning a block count for the refcount btree mods.
> 
> Maybe xfs_refcountbt_block_count(), perhaps?

Ok.  I'll add in another patch to rename xfs_allocfree_log_count at the
end.

> > + * Compute the log reservation required to handle the refcount update
> > + * transaction.  Refcount updates are always done via deferred log items.
> > + *
> > + * This is calculated as:
> > + * Data device refcount updates (t1):
> > + *    the agfs of the ags containing the blocks: nr_ops * sector size
> > + *    the refcount btrees: nr_ops * 1 trees * (2 * max depth - 1) * block size
> > + */
> > +static unsigned int
> > +xfs_refcount_log_reservation(
> > +	struct xfs_mount	*mp,
> > +	unsigned int		nr_ops)
> > +{
> > +	unsigned int		blksz = XFS_FSB_TO_B(mp, 1);
> > +
> > +	if (!xfs_has_reflink(mp))
> > +		return 0;
> > +
> > +	return xfs_calc_buf_res(nr_ops, mp->m_sb.sb_sectsize) +
> > +	       xfs_calc_buf_res(xfs_refcount_log_count(mp, nr_ops), blksz);
> > +}
> 
> To follow convention, this calculates a reservation in bytes, so
> xfs_calc_refcountbt_reservation() would match all the other
> functions that do similar things...

Fixed.

> .....
> 
> > @@ -303,10 +338,42 @@ xfs_calc_write_reservation(
> >   *    the realtime summary: 2 exts * 1 block
> >   *    worst case split in allocation btrees per extent assuming 2 extents:
> >   *		2 exts * 2 trees * (2 * max depth - 1) * block size
> > + * And any refcount updates that happen in a separate transaction (t4).
> >   */
> >  STATIC uint
> >  xfs_calc_itruncate_reservation(
> >  	struct xfs_mount	*mp)
> > +{
> > +	unsigned int		t1, t2, t3, t4;
> > +	unsigned int		blksz = XFS_FSB_TO_B(mp, 1);
> > +
> > +	t1 = xfs_calc_inode_res(mp, 1) +
> > +	     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) + 1, blksz);
> > +
> > +	t2 = xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) +
> > +	     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 4), blksz);
> > +
> > +	if (xfs_has_realtime(mp)) {
> > +		t3 = xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
> > +		     xfs_calc_buf_res(xfs_rtalloc_log_count(mp, 2), blksz) +
> > +		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2), blksz);
> > +	} else {
> > +		t3 = 0;
> > +	}
> > +
> > +	t4 = xfs_refcount_log_reservation(mp, 2);
> > +
> > +	return XFS_DQUOT_LOGRES(mp) + max(t4, max3(t1, t2, t3));
> > +}
> > +
> > +/*
> > + * For log size calculation, this is the same as above except that we used to
> > + * include refcount updates in the allocfree computation even though we've
> > + * always run them as a separate transaction.
> > + */
> > +STATIC uint
> > +xfs_calc_itruncate_reservation_logsize(
> > +	struct xfs_mount	*mp)
> >  {
> >  	unsigned int		t1, t2, t3;
> >  	unsigned int		blksz = XFS_FSB_TO_B(mp, 1);
> > @@ -317,6 +384,9 @@ xfs_calc_itruncate_reservation(
> >  	t2 = xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) +
> >  	     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 4), blksz);
> >  
> > +	if (xfs_has_reflink(mp))
> > +	     t2 += xfs_calc_buf_res(xfs_refcount_log_count(mp, 4), blksz);
> > +
> 
> Ok, so the legacy code gets 4 extra refcount ops accounted here
> because we removed it from xfs_allocfree_log_count(), and we keep
> this function around because the new code might select t4 as
> the maximum reservation it needs and so we can't just add these
> blocks to the newly calculated reservation. Correct?
> 
> If so, I'm not a great fan of duplicating this code when this is all
> that differs between the two case. Can we pass in a "bool
> for_minlogsize" variable and do:
> 
> uint
> xfs_calc_itruncate_reservation(
> 	struct xfs_mount	*mp,
> 	bool			for_minlogsize)
> {
> 	unsigned int		t1, t2, t3, t4;
> 	unsigned int		blksz = XFS_FSB_TO_B(mp, 1);
> 
> 	t1 = xfs_calc_inode_res(mp, 1) +
> 	     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) + 1, blksz);
> 
> 	t2 = xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) +
> 	     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 4), blksz);
> 
> 	if (xfs_has_realtime(mp)) {
> 		t3 = xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
> 		     xfs_calc_buf_res(xfs_rtalloc_log_count(mp, 2), blksz) +
> 		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2), blksz);
> 	} else {
> 		t3 = 0;
> 	}
> 
> 	if (!for_minlogsize) {
> 		t4 = xfs_refcount_log_reservation(mp, 2);
> 		return XFS_DQUOT_LOGRES(mp) + max(t4, max3(t1, t2, t3));
> 	}
> 
> 	/*
> 	 * add reason for minlogsize calculation differences here
> 	 */
> 	if (xfs_has_reflink(mp))
> 		t2 += xfs_calc_buf_res(xfs_refcount_log_count(mp, 4), blksz);
> 
> 	return XFS_DQUOT_LOGRES(mp) + max3(t1, t2, t3);
> }
> 
> And now we can call xfs_calc_itruncate_reservation(mp, true)
> directly from xfs_log_rlimits.c to get the correct reservation for
> the logsize code with minimal extra code.

Yep.  This is an improvement, in terms of being able to compare before
and after.  I"ll make sure to document exactly why the for_minlogsize
case does what it does.

> >  	if (xfs_has_realtime(mp)) {
> >  		t3 = xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
> >  		     xfs_calc_buf_res(xfs_rtalloc_log_count(mp, 2), blksz) +
> > @@ -956,6 +1026,9 @@ xfs_trans_resv_calc_logsize(
> >  	xfs_trans_resv_calc(mp, resp);
> >  
> >  	if (xfs_has_reflink(mp)) {
> > +		unsigned int	t4;
> > +		unsigned int	blksz = XFS_FSB_TO_B(mp, 1);
> > +
> >  		/*
> >  		 * In the early days of reflink we set the logcounts absurdly
> >  		 * high.
> > @@ -964,6 +1037,18 @@ xfs_trans_resv_calc_logsize(
> >  		resp->tr_itruncate.tr_logcount =
> >  				XFS_ITRUNCATE_LOG_COUNT_REFLINK;
> >  		resp->tr_qm_dqalloc.tr_logcount = XFS_WRITE_LOG_COUNT_REFLINK;
> > +
> > +		/*
> > +		 * We also used to account two refcount updates per extent into
> > +		 * the alloc/free step of write and truncate calls, even though
> > +		 * those are run in separate transactions.
> > +		 */
> > +		t4 = xfs_calc_buf_res(xfs_refcount_log_count(mp, 2), blksz);
> > +		resp->tr_write.tr_logres += t4;
> > +		resp->tr_qm_dqalloc.tr_logres += t4;
> 
> I think these have the same problem as itruncate - the write
> reservation is conditional on the maximum reservation of several
> different operations, and so dropping the refcountbt from that
> comparison means it could select a different max reservation. Then
> if we unconditionally add the refcount blocks to that, then we end
> up with a different size to what the original code calculated.
> 
> Hence I think these also need to be treated like I outline for
> itruncate above.

Yup.  Thanks for the comments!

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/6] xfs: create shadow transaction reservations for computing minimum log size
  2022-04-25 23:39     ` Darrick J. Wong
@ 2022-04-26  4:24       ` Dave Chinner
  2022-04-26  5:10         ` Darrick J. Wong
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2022-04-26  4:24 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Apr 25, 2022 at 04:39:05PM -0700, Darrick J. Wong wrote:
> On Sat, Apr 23, 2022 at 08:36:35AM +1000, Dave Chinner wrote:
> > On Thu, Apr 14, 2022 at 03:54:42PM -0700, Darrick J. Wong wrote:
....
> > > @@ -47,18 +48,25 @@ xfs_log_get_max_trans_res(
> > >  	struct xfs_trans_res	*max_resp)
> > >  {
> > >  	struct xfs_trans_res	*resp;
> > > +	struct xfs_trans_res	*start_resp;
> > >  	struct xfs_trans_res	*end_resp;
> > > +	struct xfs_trans_resv	*resv;
> > >  	int			log_space = 0;
> > >  	int			attr_space;
> > >  
> > >  	attr_space = xfs_log_calc_max_attrsetm_res(mp);
> > >  
> > > -	resp = (struct xfs_trans_res *)M_RES(mp);
> > > -	end_resp = (struct xfs_trans_res *)(M_RES(mp) + 1);
> > > -	for (; resp < end_resp; resp++) {
> > > +	resv = kmem_zalloc(sizeof(struct xfs_trans_resv), 0);
> > > +	xfs_trans_resv_calc_logsize(mp, resv);
> > > +
> > > +	start_resp = (struct xfs_trans_res *)resv;
> > > +	end_resp = (struct xfs_trans_res *)(resv + 1);
> > > +	for (resp = start_resp; resp < end_resp; resp++) {
> > >  		int		tmp = resp->tr_logcount > 1 ?
> > >  				      resp->tr_logres * resp->tr_logcount :
> > >  				      resp->tr_logres;
> > > +
> > > +		trace_xfs_trans_resv_calc_logsize(mp, resp - start_resp, resp);
> > >  		if (log_space < tmp) {
> > >  			log_space = tmp;
> > >  			*max_resp = *resp;		/* struct copy */
> > 
> > This took me a while to get my head around. The minimum logsize
> > calculation stuff is all a bit of a mess.
> > 
> > Essentially, we call xfs_log_get_max_trans_res() from two places.
> > One is to calculate the minimum log size, the other is the
> > transaction reservation tracing code done when M_RES(mp) is set up
> > via xfs_trans_trace_reservations().  We don't need the call from
> > xfs_trans_trace_reservations() - it's trivial to scan the list of
> > tracepoints emitted by this function at mount time to find the
> > maximum reservation.
> 
> Here's the thing -- xfs_db also calls xfs_log_get_max_trans_res to
> figure out the transaction reservation that's used to compute the
> minimum log size.  Whenever I get a report about mount failing due to
> minlogsize checks, I can ask the reporter to send me the ftrace output
> from the mount attempt and compare it against what userspace thinks:
> 
> # xfs_db /dev/sde -c logres
> type 0 logres 168184 logcount 5 flags 0x4
> type 1 logres 293760 logcount 5 flags 0x4
> type 2 logres 307936 logcount 2 flags 0x4
> type 3 logres 187760 logcount 2 flags 0x4
> type 4 logres 170616 logcount 2 flags 0x4
> type 5 logres 244720 logcount 3 flags 0x4
> type 6 logres 243568 logcount 2 flags 0x4
> type 7 logres 260352 logcount 2 flags 0x4
> type 8 logres 243568 logcount 3 flags 0x4
> type 9 logres 278648 logcount 2 flags 0x4
> type 10 logres 2168 logcount 0 flags 0x0
> type 11 logres 73728 logcount 2 flags 0x4
> type 12 logres 99960 logcount 2 flags 0x4
> type 13 logres 760 logcount 0 flags 0x0
> type 14 logres 292992 logcount 1 flags 0x4
> type 15 logres 23288 logcount 3 flags 0x4
> type 16 logres 13312 logcount 0 flags 0x0
> type 17 logres 147584 logcount 3 flags 0x4
> type 18 logres 640 logcount 0 flags 0x0
> type 19 logres 94968 logcount 2 flags 0x4
> type 20 logres 4224 logcount 0 flags 0x0
> type 21 logres 6512 logcount 0 flags 0x0
> type 22 logres 232 logcount 1 flags 0x0
> type 23 logres 172407 logcount 5 flags 0x4
> type 24 logres 640 logcount 1 flags 0x0
> type 25 logres 760 logcount 0 flags 0x0
> type 26 logres 243568 logcount 2 flags 0x4
> type 27 logres 170616 logcount 2 flags 0x4
> type -1 logres 547200 logcount 8 flags 0x4
> 
> And this "-1" entry matches the last output of the kernel.

I look at that and thing "xfs_db output is broken" because that last
line cannot be derived from the individual transaction reservations
that are listed. It makes no sense in isolation/without
documentation. :/

> I'd rather
> not lose this tracing facility (which means keeping this function
> non-static) though I will move the tracepoint out of
> xfs_trans_trace_reservations.

You mean "remove only the '-1' tracepoint" from
xfs_trans_trace_reservations()?

> > Hence I think we should start by removing that call to this
> > function, and making this a static function called only from
> > xfs_log_calc_minimum_size().
> > 
> > At this point, we can use an on-stack struct xfs_trans_resv for the
> > calculated values - no need for memory allocation here as we will
> > never be short of stack space in this path.
> 
> ~312 bytes?  That's ~8% of the kernel stack.  I'll see if I run into any
> complaints, though I bet I won't on x64.

What architecture still uses 4kB stacks?  Filesystems have blown
through 4kB stacks without even trying on 32bit systems for years
now.

Regardless, the mount path call chain here is nowhere near deep
enough to be at risk of blowing stacks, and this is at the leaf so
it's largely irrelevant if we put this on the stack...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 4/6] xfs: reduce the absurdly large log reservations
  2022-04-25 23:47     ` Darrick J. Wong
@ 2022-04-26  4:25       ` Dave Chinner
  0 siblings, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2022-04-26  4:25 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Mon, Apr 25, 2022 at 04:47:49PM -0700, Darrick J. Wong wrote:
> On Sat, Apr 23, 2022 at 08:51:52AM +1000, Dave Chinner wrote:
> > On Thu, Apr 14, 2022 at 03:54:48PM -0700, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > Back in the early days of reflink and rmap development I set the
> > > transaction reservation sizes to be overly generous for rmap+reflink
> > > filesystems, and a little under-generous for rmap-only filesystems.
> > > 
> > > Since we don't need *eight* transaction rolls to handle three new log
> > > intent items, decrease the logcounts to what we actually need, and amend
> > > the shadow reservation computation function to reflect what we used to
> > > do so that the minimum log size doesn't change.
> > 
> > Yup, this will make a huge difference to the number of transactions
> > we can have in flight on reflink/rmap enabled filesystems.
> > 
> > Mostly looks good, some comments about code and comments below.
....
> > > -	/* Put everything back the way it was.  This goes at the end. */
> > > -	mp->m_rmap_maxlevels = rmap_maxlevels;
> > > +	/* Add one logcount for BUI items that appear with rmap or reflink. */
> > > +	if (xfs_has_reflink(mp) || xfs_has_rmapbt(mp)) {
> > > +		resp->tr_itruncate.tr_logcount++;
> > > +		resp->tr_write.tr_logcount++;
> > > +		resp->tr_qm_dqalloc.tr_logcount++;
> > > +	}
> > > +
> > > +	/* Add one logcount for refcount intent items. */
> > > +	if (xfs_has_reflink(mp)) {
> > > +		resp->tr_itruncate.tr_logcount++;
> > > +		resp->tr_write.tr_logcount++;
> > > +		resp->tr_qm_dqalloc.tr_logcount++;
> > > +	}
> > > +
> > > +	/* Add one logcount for rmap intent items. */
> > > +	if (xfs_has_rmapbt(mp)) {
> > > +		resp->tr_itruncate.tr_logcount++;
> > > +		resp->tr_write.tr_logcount++;
> > > +		resp->tr_qm_dqalloc.tr_logcount++;
> > > +	}
> > 
> > This would be much more concisely written as
> > 
> > 	count = 0;
> > 	if (xfs_has_reflink(mp) || xfs_has_rmapbt(mp)) {
> > 		count = 2;
> > 		if (xfs_has_reflink(mp) && xfs_has_rmapbt(mp))
> > 			count++;
> > 	}
> > 
> > 	resp->tr_itruncate.tr_logcount += count;
> > 	resp->tr_write.tr_logcount += count;
> > 	resp->tr_qm_dqalloc.tr_logcount += count;
> 
> I think I'd rather do:
> 
> 	/*
> 	 * Add one logcount for BUI items that appear with rmap or reflink,
> 	 * one logcount for refcount intent items, and one logcount for rmap
> 	 * intent items.
> 	 */
> 	if (xfs_has_reflink(mp) || xfs_has_rmapbt(mp))
> 		logcount_adj++;
> 	if (xfs_has_reflink(mp))
> 		logcount_adj++;
> 	if (xfs_has_rmapbt(mp))
> 		logcount_adj++;
> 
> 	resp->tr_itruncate.tr_logcount += logcount_adj;
> 	resp->tr_write.tr_logcount += logcount_adj;
> 	resp->tr_qm_dqalloc.tr_logcount += logcount_adj;
> 
> If you don't mind?

Sure, that's just as good.

--Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 3/6] xfs: create shadow transaction reservations for computing minimum log size
  2022-04-26  4:24       ` Dave Chinner
@ 2022-04-26  5:10         ` Darrick J. Wong
  0 siblings, 0 replies; 24+ messages in thread
From: Darrick J. Wong @ 2022-04-26  5:10 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Apr 26, 2022 at 02:24:44PM +1000, Dave Chinner wrote:
> On Mon, Apr 25, 2022 at 04:39:05PM -0700, Darrick J. Wong wrote:
> > On Sat, Apr 23, 2022 at 08:36:35AM +1000, Dave Chinner wrote:
> > > On Thu, Apr 14, 2022 at 03:54:42PM -0700, Darrick J. Wong wrote:
> ....
> > > > @@ -47,18 +48,25 @@ xfs_log_get_max_trans_res(
> > > >  	struct xfs_trans_res	*max_resp)
> > > >  {
> > > >  	struct xfs_trans_res	*resp;
> > > > +	struct xfs_trans_res	*start_resp;
> > > >  	struct xfs_trans_res	*end_resp;
> > > > +	struct xfs_trans_resv	*resv;
> > > >  	int			log_space = 0;
> > > >  	int			attr_space;
> > > >  
> > > >  	attr_space = xfs_log_calc_max_attrsetm_res(mp);
> > > >  
> > > > -	resp = (struct xfs_trans_res *)M_RES(mp);
> > > > -	end_resp = (struct xfs_trans_res *)(M_RES(mp) + 1);
> > > > -	for (; resp < end_resp; resp++) {
> > > > +	resv = kmem_zalloc(sizeof(struct xfs_trans_resv), 0);
> > > > +	xfs_trans_resv_calc_logsize(mp, resv);
> > > > +
> > > > +	start_resp = (struct xfs_trans_res *)resv;
> > > > +	end_resp = (struct xfs_trans_res *)(resv + 1);
> > > > +	for (resp = start_resp; resp < end_resp; resp++) {
> > > >  		int		tmp = resp->tr_logcount > 1 ?
> > > >  				      resp->tr_logres * resp->tr_logcount :
> > > >  				      resp->tr_logres;
> > > > +
> > > > +		trace_xfs_trans_resv_calc_logsize(mp, resp - start_resp, resp);
> > > >  		if (log_space < tmp) {
> > > >  			log_space = tmp;
> > > >  			*max_resp = *resp;		/* struct copy */
> > > 
> > > This took me a while to get my head around. The minimum logsize
> > > calculation stuff is all a bit of a mess.
> > > 
> > > Essentially, we call xfs_log_get_max_trans_res() from two places.
> > > One is to calculate the minimum log size, the other is the
> > > transaction reservation tracing code done when M_RES(mp) is set up
> > > via xfs_trans_trace_reservations().  We don't need the call from
> > > xfs_trans_trace_reservations() - it's trivial to scan the list of
> > > tracepoints emitted by this function at mount time to find the
> > > maximum reservation.
> > 
> > Here's the thing -- xfs_db also calls xfs_log_get_max_trans_res to
> > figure out the transaction reservation that's used to compute the
> > minimum log size.  Whenever I get a report about mount failing due to
> > minlogsize checks, I can ask the reporter to send me the ftrace output
> > from the mount attempt and compare it against what userspace thinks:
> > 
> > # xfs_db /dev/sde -c logres
> > type 0 logres 168184 logcount 5 flags 0x4
> > type 1 logres 293760 logcount 5 flags 0x4
> > type 2 logres 307936 logcount 2 flags 0x4
> > type 3 logres 187760 logcount 2 flags 0x4
> > type 4 logres 170616 logcount 2 flags 0x4
> > type 5 logres 244720 logcount 3 flags 0x4
> > type 6 logres 243568 logcount 2 flags 0x4
> > type 7 logres 260352 logcount 2 flags 0x4
> > type 8 logres 243568 logcount 3 flags 0x4
> > type 9 logres 278648 logcount 2 flags 0x4
> > type 10 logres 2168 logcount 0 flags 0x0
> > type 11 logres 73728 logcount 2 flags 0x4
> > type 12 logres 99960 logcount 2 flags 0x4
> > type 13 logres 760 logcount 0 flags 0x0
> > type 14 logres 292992 logcount 1 flags 0x4
> > type 15 logres 23288 logcount 3 flags 0x4
> > type 16 logres 13312 logcount 0 flags 0x0
> > type 17 logres 147584 logcount 3 flags 0x4
> > type 18 logres 640 logcount 0 flags 0x0
> > type 19 logres 94968 logcount 2 flags 0x4
> > type 20 logres 4224 logcount 0 flags 0x0
> > type 21 logres 6512 logcount 0 flags 0x0
> > type 22 logres 232 logcount 1 flags 0x0
> > type 23 logres 172407 logcount 5 flags 0x4
> > type 24 logres 640 logcount 1 flags 0x0
> > type 25 logres 760 logcount 0 flags 0x0
> > type 26 logres 243568 logcount 2 flags 0x4
> > type 27 logres 170616 logcount 2 flags 0x4
> > type -1 logres 547200 logcount 8 flags 0x4
> > 
> > And this "-1" entry matches the last output of the kernel.
> 
> I look at that and thing "xfs_db output is broken" because that last
> line cannot be derived from the individual transaction reservations
> that are listed. It makes no sense in isolation/without
> documentation. :/
> 
> > I'd rather
> > not lose this tracing facility (which means keeping this function
> > non-static) though I will move the tracepoint out of
> > xfs_trans_trace_reservations.
> 
> You mean "remove only the '-1' tracepoint" from
> xfs_trans_trace_reservations()?

Rework into a better tracepoint?

	xfs_minlogblocks_trans_resv: logres 547200 logcount 8 flags 0x4

Or something like that.  I don't think we actually care about flags
there.

> > > Hence I think we should start by removing that call to this
> > > function, and making this a static function called only from
> > > xfs_log_calc_minimum_size().
> > > 
> > > At this point, we can use an on-stack struct xfs_trans_resv for the
> > > calculated values - no need for memory allocation here as we will
> > > never be short of stack space in this path.
> > 
> > ~312 bytes?  That's ~8% of the kernel stack.  I'll see if I run into any
> > complaints, though I bet I won't on x64.
> 
> What architecture still uses 4kB stacks?  Filesystems have blown
> through 4kB stacks without even trying on 32bit systems for years
> now.

I doubt they /all/ have larger stacks, though it might simply be the
case that we don't care about supporting the long tail of small machines
(h8300 anyone?) that Linux supports.

> Regardless, the mount path call chain here is nowhere near deep
> enough to be at risk of blowing stacks, and this is at the leaf so
> it's largely irrelevant if we put this on the stack...

<nod> I'm not expecting to see any problems.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/6] xfs: stop artificially limiting the length of bunmap calls
  2022-04-14 22:54 ` [PATCH 1/6] xfs: stop artificially limiting the length of bunmap calls Darrick J. Wong
  2022-04-22 22:01   ` Dave Chinner
@ 2022-04-26 13:46   ` Christoph Hellwig
  2022-04-26 14:52     ` Darrick J. Wong
  1 sibling, 1 reply; 24+ messages in thread
From: Christoph Hellwig @ 2022-04-26 13:46 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs

On Thu, Apr 14, 2022 at 03:54:31PM -0700, Darrick J. Wong wrote:
> diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
> index 327ba25e9e17..a07ebaecba73 100644
> --- a/fs/xfs/libxfs/xfs_refcount.c
> +++ b/fs/xfs/libxfs/xfs_refcount.c
> @@ -960,6 +960,7 @@ xfs_refcount_adjust_extents(
>  			 * Either cover the hole (increment) or
>  			 * delete the range (decrement).
>  			 */
> +			cur->bc_ag.refc.nr_ops++;
>  			if (tmp.rc_refcount) {
>  				error = xfs_refcount_insert(cur, &tmp,
>  						&found_tmp);
> @@ -970,7 +971,6 @@ xfs_refcount_adjust_extents(
>  					error = -EFSCORRUPTED;
>  					goto out_error;
>  				}
> -				cur->bc_ag.refc.nr_ops++;

How do these changes fit into the rest of the patch?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 2/6] xfs: remove a __xfs_bunmapi call from reflink
  2022-04-14 22:54 ` [PATCH 2/6] xfs: remove a __xfs_bunmapi call from reflink Darrick J. Wong
  2022-04-22 22:03   ` Dave Chinner
@ 2022-04-26 13:47   ` Christoph Hellwig
  1 sibling, 0 replies; 24+ messages in thread
From: Christoph Hellwig @ 2022-04-26 13:47 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: david, linux-xfs

On Thu, Apr 14, 2022 at 03:54:37PM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> This raw call isn't necessary since we can always remove a full delalloc
> extent.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH 1/6] xfs: stop artificially limiting the length of bunmap calls
  2022-04-26 13:46   ` Christoph Hellwig
@ 2022-04-26 14:52     ` Darrick J. Wong
  0 siblings, 0 replies; 24+ messages in thread
From: Darrick J. Wong @ 2022-04-26 14:52 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: david, linux-xfs

On Tue, Apr 26, 2022 at 06:46:42AM -0700, Christoph Hellwig wrote:
> On Thu, Apr 14, 2022 at 03:54:31PM -0700, Darrick J. Wong wrote:
> > diff --git a/fs/xfs/libxfs/xfs_refcount.c b/fs/xfs/libxfs/xfs_refcount.c
> > index 327ba25e9e17..a07ebaecba73 100644
> > --- a/fs/xfs/libxfs/xfs_refcount.c
> > +++ b/fs/xfs/libxfs/xfs_refcount.c
> > @@ -960,6 +960,7 @@ xfs_refcount_adjust_extents(
> >  			 * Either cover the hole (increment) or
> >  			 * delete the range (decrement).
> >  			 */
> > +			cur->bc_ag.refc.nr_ops++;
> >  			if (tmp.rc_refcount) {
> >  				error = xfs_refcount_insert(cur, &tmp,
> >  						&found_tmp);
> > @@ -970,7 +971,6 @@ xfs_refcount_adjust_extents(
> >  					error = -EFSCORRUPTED;
> >  					goto out_error;
> >  				}
> > -				cur->bc_ag.refc.nr_ops++;
> 
> How do these changes fit into the rest of the patch?

Long ago before we used deferred ops to update the refcount btree, we
would restrict the number of blocks that could be bunmapped in order to
avoid overflowing the transaction reservation while doing refcount
updates for an unmapped extent.

Later on, I added deferred refcount updates so at least the refcountbt
updates went in their own transaction, which meant that if we started
running out of space because the unmapped extent had many many different
refcount records, we could at least return EAGAIN to get a clean
transaction to continue.  I needed a way to guess when we were getting
close to the reservation limit, so I added this nr_ops counter to track
the number of record updates.  Granted, 32 bytes per refcountbt record
update is quite an overestimate, since the records are contiguous in a
btree leaf block.

In the past, the deferred ops code had a defect where it didn't maintain
correct ordering of deferred ops when one dfops' ->finish_item function
queued even more dfops to the transaction, which is what happens when a
refcount update queues EFIs for blocks that are no longer referenced.
This resulted in enormous chains of defer ops, so I left the bunmap
piece in to throttle the chain lengths.

Now we've fixed the deferred ops code to maintain correct dfops order
between existing ops and newly queued ones, so the chaining problem went
away, and we can get rid of the bunmap throttle.  However, as part of
doing that, it's necessary to track the number of EFIs logged to the
transaction as well, which is what the above code change fixes.

Granted, Dave has also commented that EFIs use a bit more than 32 bytes,
so I think it would be warranted to capture the above comment and the
xfs_refcount.c changes in a separate patch now.

--D

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-04-26 14:52 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-14 22:54 [PATCHSET 0/6] xfs: fix reflink inefficiencies Darrick J. Wong
2022-04-14 22:54 ` [PATCH 1/6] xfs: stop artificially limiting the length of bunmap calls Darrick J. Wong
2022-04-22 22:01   ` Dave Chinner
2022-04-22 22:18     ` Darrick J. Wong
2022-04-22 23:51       ` Dave Chinner
2022-04-26 13:46   ` Christoph Hellwig
2022-04-26 14:52     ` Darrick J. Wong
2022-04-14 22:54 ` [PATCH 2/6] xfs: remove a __xfs_bunmapi call from reflink Darrick J. Wong
2022-04-22 22:03   ` Dave Chinner
2022-04-26 13:47   ` Christoph Hellwig
2022-04-14 22:54 ` [PATCH 3/6] xfs: create shadow transaction reservations for computing minimum log size Darrick J. Wong
2022-04-22 22:36   ` Dave Chinner
2022-04-25 23:39     ` Darrick J. Wong
2022-04-26  4:24       ` Dave Chinner
2022-04-26  5:10         ` Darrick J. Wong
2022-04-14 22:54 ` [PATCH 4/6] xfs: reduce the absurdly large log reservations Darrick J. Wong
2022-04-22 22:51   ` Dave Chinner
2022-04-25 23:47     ` Darrick J. Wong
2022-04-26  4:25       ` Dave Chinner
2022-04-14 22:54 ` [PATCH 5/6] xfs: reduce transaction reservations with reflink Darrick J. Wong
2022-04-22 23:42   ` Dave Chinner
2022-04-25 23:49     ` Darrick J. Wong
2022-04-14 22:54 ` [PATCH 6/6] xfs: rewrite xfs_reflink_end_cow to use intents Darrick J. Wong
2022-04-22 23:50   ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.