All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/7] xfs: Extend per-inode extent counters.
@ 2020-06-06  8:27 Chandan Babu R
  2020-06-06  8:27 ` [PATCH 1/7] xfs: Fix log reservation calculation for xattr insert operation Chandan Babu R
                   ` (7 more replies)
  0 siblings, 8 replies; 40+ messages in thread
From: Chandan Babu R @ 2020-06-06  8:27 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, david, darrick.wong, bfoster, hch

The commit xfs: fix inode fork extent count overflow
(3f8a4f1d876d3e3e49e50b0396eaffcc4ba71b08) mentions that 10 billion
per-inode data fork extents should be possible to create. However the
corresponding on-disk field has an signed 32-bit type. Hence this
patchset extends the on-disk field to 64-bit length out of which only
the first 47-bits are valid.

Also, XFS has a per-inode xattr extent counter which is 16 bits
wide. A workload which
1. Creates 1 million 255-byte sized xattrs,
2. Deletes 50% of these xattrs in an alternating manner,
3. Tries to insert 400,000 new 255-byte sized xattrs
causes the following message to be printed on the console,

XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173

This indicates that we overflowed the 16-bits wide xattr extent
counter.

I have been informed that there are instances where a single file
has > 100 million hardlinks. With parent pointers being stored in xattr,
we will overflow the 16-bits wide xattr extent counter when large
number of hardlinks are created. Hence this patchset extends the
on-disk field to 32-bit length.

This patchset also includes the previously posted "Fix log reservation
calculation for xattr insert operation" patch as a bug fix. It
replaces the xattr set "mount" and "runtime" reservations with just
one static reservation. Hence we don't need the functionality to
calculate maximum sized 'xattr set' reservation separately anymore.

The patches can also be obtained from
https://github.com/chandanr/linux.git at branch xfs-extend-extent-counters.

Chandan Babu R (7):
  xfs: Fix log reservation calculation for xattr insert operation
  xfs: Check for per-inode extent count overflow
  xfs: Compute maximum height of directory BMBT separately
  xfs: Add "Use Dir BMBT height" argument to XFS_BM_MAXLEVELS()
  xfs: Use 2^27 as the maximum number of directory extents
  xfs: Extend data extent counter to 47 bits
  xfs: Extend attr extent counter to 32 bits

 fs/xfs/libxfs/xfs_attr.c        |  11 +--
 fs/xfs/libxfs/xfs_bmap.c        | 118 +++++++++++++++++++++++-------
 fs/xfs/libxfs/xfs_bmap.h        |   3 +-
 fs/xfs/libxfs/xfs_bmap_btree.h  |   4 +-
 fs/xfs/libxfs/xfs_format.h      |  49 ++++++++++---
 fs/xfs/libxfs/xfs_inode_buf.c   |  65 ++++++++++++++---
 fs/xfs/libxfs/xfs_inode_buf.h   |   2 +
 fs/xfs/libxfs/xfs_inode_fork.c  | 125 ++++++++++++++++++++++++++++++--
 fs/xfs/libxfs/xfs_inode_fork.h  |   2 +
 fs/xfs/libxfs/xfs_log_format.h  |   8 +-
 fs/xfs/libxfs/xfs_log_rlimit.c  |  29 --------
 fs/xfs/libxfs/xfs_trans_resv.c  |  75 +++++++++----------
 fs/xfs/libxfs/xfs_trans_resv.h  |   9 +--
 fs/xfs/libxfs/xfs_trans_space.h |  48 ++++++------
 fs/xfs/libxfs/xfs_types.h       |  11 ++-
 fs/xfs/scrub/inode.c            |  14 ++--
 fs/xfs/xfs_bmap_item.c          |   3 +-
 fs/xfs/xfs_inode.c              |  10 ++-
 fs/xfs/xfs_inode_item.c         |  10 ++-
 fs/xfs/xfs_inode_item_recover.c |  22 +++++-
 fs/xfs/xfs_mount.c              |   5 +-
 fs/xfs/xfs_mount.h              |   1 +
 fs/xfs/xfs_reflink.c            |   4 +-
 23 files changed, 451 insertions(+), 177 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/7] xfs: Fix log reservation calculation for xattr insert operation
  2020-06-06  8:27 [PATCH 0/7] xfs: Extend per-inode extent counters Chandan Babu R
@ 2020-06-06  8:27 ` Chandan Babu R
  2020-06-19 14:33   ` Christoph Hellwig
  2020-06-06  8:27 ` [PATCH 2/7] xfs: Check for per-inode extent count overflow Chandan Babu R
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 40+ messages in thread
From: Chandan Babu R @ 2020-06-06  8:27 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, david, darrick.wong, bfoster, hch, Dave Chinner

Log space reservation for xattr insert operation is divided into two
parts,
1. Mount time
   - Inode
   - Superblock for accounting space allocations
   - AGF for accounting space used by count, block number, rmap and refcnt
     btrees.

2. The remaining log space can only be calculated at run time because,
   - A local xattr can be large enough to cause a double split of the da
     btree.
   - The value of the xattr can be large enough to be stored in remote
     blocks. The contents of the remote blocks are not logged.

   The log space reservation could be,
   - (XFS_DA_NODE_MAXDEPTH + 1) number of blocks. The "+ 1" is required in
     case xattr is large enough to cause another split of the da btree path.
   - BMBT blocks for storing (XFS_DA_NODE_MAXDEPTH + 1) record
     entries.
   - Space for logging blocks of count, block number, rmap and refcnt btrees.

At present, mount time log reservation includes block count required for a
single split of the dabtree. The dabtree block count is also taken into
account by xfs_attr_calc_size().

Also, AGF log space reservation isn't accounted for.

Due to the reasons mentioned above, log reservation calculation for xattr
insert operation gives an incorrect value.

Apart from the above, xfs_log_calc_max_attrsetm_res() passes byte count as
an argument to XFS_NEXTENTADD_SPACE_RES() instead of block count.

The above mentioned inconsistencies were discoverd when trying to mount a
modified XFS filesystem which uses a 32-bit value as xattr extent counter
caused the following warning messages to be printed on the console,

XFS (loop0): Mounting V4 Filesystem
XFS (loop0): Log size 2560 blocks too small, minimum size is 4035 blocks
XFS (loop0): Log size out of supported range.
XFS (loop0): Continuing onwards, but if log hangs are experienced then please report this message in the bug report.
XFS (loop0): Ending clean mount

To fix the inconsistencies described above, this commit replaces 'mount'
and 'runtime' components with just one static reservation. The new
reservation calculates the log space for the worst case possible i.e. it
considers,
1. Double split of the da btree.
   This happens for large local xattrs.
2. Bmbt blocks required for mapping the contents of a maximum
   sized (i.e. XATTR_SIZE_MAX bytes in size) remote attribute.

Suggested-by: Dave Chinner <dchinner@redhat.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
---
 fs/xfs/libxfs/xfs_attr.c        |  6 +---
 fs/xfs/libxfs/xfs_log_rlimit.c  | 29 ------------------
 fs/xfs/libxfs/xfs_trans_resv.c  | 54 +++++++++++++++------------------
 fs/xfs/libxfs/xfs_trans_resv.h  |  5 +--
 fs/xfs/libxfs/xfs_trans_space.h |  7 ++++-
 5 files changed, 32 insertions(+), 69 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index 3b1bd6e112f8..a4b23edf887e 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -337,11 +337,7 @@ xfs_attr_set(
 				return error;
 		}
 
-		tres.tr_logres = M_RES(mp)->tr_attrsetm.tr_logres +
-				 M_RES(mp)->tr_attrsetrt.tr_logres *
-					args->total;
-		tres.tr_logcount = XFS_ATTRSET_LOG_COUNT;
-		tres.tr_logflags = XFS_TRANS_PERM_LOG_RES;
+		tres = M_RES(mp)->tr_attrset;
 		total = args->total;
 	} else {
 		XFS_STATS_INC(mp, xs_attr_remove);
diff --git a/fs/xfs/libxfs/xfs_log_rlimit.c b/fs/xfs/libxfs/xfs_log_rlimit.c
index 7f55eb3f3653..7aa9e6684ecd 100644
--- a/fs/xfs/libxfs/xfs_log_rlimit.c
+++ b/fs/xfs/libxfs/xfs_log_rlimit.c
@@ -15,27 +15,6 @@
 #include "xfs_da_btree.h"
 #include "xfs_bmap_btree.h"
 
-/*
- * Calculate the maximum length in bytes that would be required for a local
- * attribute value as large attributes out of line are not logged.
- */
-STATIC int
-xfs_log_calc_max_attrsetm_res(
-	struct xfs_mount	*mp)
-{
-	int			size;
-	int			nblks;
-
-	size = xfs_attr_leaf_entsize_local_max(mp->m_attr_geo->blksize) -
-	       MAXNAMELEN - 1;
-	nblks = XFS_DAENTER_SPACE_RES(mp, XFS_ATTR_FORK);
-	nblks += XFS_B_TO_FSB(mp, size);
-	nblks += XFS_NEXTENTADD_SPACE_RES(mp, size, XFS_ATTR_FORK);
-
-	return  M_RES(mp)->tr_attrsetm.tr_logres +
-		M_RES(mp)->tr_attrsetrt.tr_logres * nblks;
-}
-
 /*
  * Iterate over the log space reservation table to figure out and return
  * the maximum one in terms of the pre-calculated values which were done
@@ -49,9 +28,6 @@ xfs_log_get_max_trans_res(
 	struct xfs_trans_res	*resp;
 	struct xfs_trans_res	*end_resp;
 	int			log_space = 0;
-	int			attr_space;
-
-	attr_space = xfs_log_calc_max_attrsetm_res(mp);
 
 	resp = (struct xfs_trans_res *)M_RES(mp);
 	end_resp = (struct xfs_trans_res *)(M_RES(mp) + 1);
@@ -64,11 +40,6 @@ xfs_log_get_max_trans_res(
 			*max_resp = *resp;		/* struct copy */
 		}
 	}
-
-	if (attr_space > log_space) {
-		*max_resp = M_RES(mp)->tr_attrsetm;	/* struct copy */
-		max_resp->tr_logres = attr_space;
-	}
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index d1a0848cb52e..b44b521c605c 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -19,6 +19,7 @@
 #include "xfs_trans.h"
 #include "xfs_qm.h"
 #include "xfs_trans_space.h"
+#include "xfs_attr_remote.h"
 
 #define _ALLOC	true
 #define _FREE	false
@@ -698,42 +699,36 @@ xfs_calc_attrinval_reservation(
 }
 
 /*
- * Setting an attribute at mount time.
+ * Setting an attribute.
  *	the inode getting the attribute
  *	the superblock for allocations
- *	the agfs extents are allocated from
+ *	the agf extents are allocated from
  *	the attribute btree * max depth
- *	the inode allocation btree
- * Since attribute transaction space is dependent on the size of the attribute,
- * the calculation is done partially at mount time and partially at runtime(see
- * below).
+ *	the bmbt entries for da btree blocks
+ *	the bmbt entries for remote blocks (if any)
+ *	the allocation btrees.
  */
 STATIC uint
-xfs_calc_attrsetm_reservation(
+xfs_calc_attrset_reservation(
 	struct xfs_mount	*mp)
 {
+	int			max_rmt_blks;
+	int			da_blks;
+	int			bmbt_blks;
+
+	da_blks = XFS_DAENTER_BLOCKS(mp, XFS_ATTR_FORK);
+	bmbt_blks = XFS_DAENTER_BMAPS(mp, XFS_ATTR_FORK);
+
+	max_rmt_blks = xfs_attr3_rmt_blocks(mp, XATTR_SIZE_MAX);
+	bmbt_blks += XFS_NEXTENTADD_SPACE_RES(mp, max_rmt_blks, XFS_ATTR_FORK);
+
 	return XFS_DQUOT_LOGRES(mp) +
 		xfs_calc_inode_res(mp, 1) +
 		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
-		xfs_calc_buf_res(XFS_DA_NODE_MAXDEPTH, XFS_FSB_TO_B(mp, 1));
-}
-
-/*
- * Setting an attribute at runtime, transaction space unit per block.
- * 	the superblock for allocations: sector size
- *	the inode bmap btree could join or split: max depth * block size
- * Since the runtime attribute transaction space is dependent on the total
- * blocks needed for the 1st bmap, here we calculate out the space unit for
- * one block so that the caller could figure out the total space according
- * to the attibute extent length in blocks by:
- *	ext * M_RES(mp)->tr_attrsetrt.tr_logres
- */
-STATIC uint
-xfs_calc_attrsetrt_reservation(
-	struct xfs_mount	*mp)
-{
-	return xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
-		xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK),
+		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
+		xfs_calc_buf_res(da_blks, XFS_FSB_TO_B(mp, 1)) +
+		xfs_calc_buf_res(bmbt_blks, XFS_FSB_TO_B(mp, 1)) +
+		xfs_calc_buf_res(xfs_allocfree_log_count(mp, da_blks),
 				 XFS_FSB_TO_B(mp, 1));
 }
 
@@ -897,9 +892,9 @@ xfs_trans_resv_calc(
 	resp->tr_attrinval.tr_logcount = XFS_ATTRINVAL_LOG_COUNT;
 	resp->tr_attrinval.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
 
-	resp->tr_attrsetm.tr_logres = xfs_calc_attrsetm_reservation(mp);
-	resp->tr_attrsetm.tr_logcount = XFS_ATTRSET_LOG_COUNT;
-	resp->tr_attrsetm.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
+	resp->tr_attrset.tr_logres = xfs_calc_attrset_reservation(mp);
+	resp->tr_attrset.tr_logcount = XFS_ATTRSET_LOG_COUNT;
+	resp->tr_attrset.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
 
 	resp->tr_attrrm.tr_logres = xfs_calc_attrrm_reservation(mp);
 	resp->tr_attrrm.tr_logcount = XFS_ATTRRM_LOG_COUNT;
@@ -942,7 +937,6 @@ xfs_trans_resv_calc(
 	resp->tr_ichange.tr_logres = xfs_calc_ichange_reservation(mp);
 	resp->tr_fsyncts.tr_logres = xfs_calc_swrite_reservation(mp);
 	resp->tr_writeid.tr_logres = xfs_calc_writeid_reservation(mp);
-	resp->tr_attrsetrt.tr_logres = xfs_calc_attrsetrt_reservation(mp);
 	resp->tr_clearagi.tr_logres = xfs_calc_clear_agi_bucket_reservation(mp);
 	resp->tr_growrtzero.tr_logres = xfs_calc_growrtzero_reservation(mp);
 	resp->tr_growrtfree.tr_logres = xfs_calc_growrtfree_reservation(mp);
diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
index 7241ab28cf84..f50996ae18e6 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.h
+++ b/fs/xfs/libxfs/xfs_trans_resv.h
@@ -35,10 +35,7 @@ struct xfs_trans_resv {
 	struct xfs_trans_res	tr_writeid;	/* write setuid/setgid file */
 	struct xfs_trans_res	tr_attrinval;	/* attr fork buffer
 						 * invalidation */
-	struct xfs_trans_res	tr_attrsetm;	/* set/create an attribute at
-						 * mount time */
-	struct xfs_trans_res	tr_attrsetrt;	/* set/create an attribute at
-						 * runtime */
+	struct xfs_trans_res	tr_attrset;	/* set/create an attribute */
 	struct xfs_trans_res	tr_attrrm;	/* remove an attribute */
 	struct xfs_trans_res	tr_clearagi;	/* clear agi unlinked bucket */
 	struct xfs_trans_res	tr_growrtalloc;	/* grow realtime allocations */
diff --git a/fs/xfs/libxfs/xfs_trans_space.h b/fs/xfs/libxfs/xfs_trans_space.h
index 88221c7a04cc..b559af70cf51 100644
--- a/fs/xfs/libxfs/xfs_trans_space.h
+++ b/fs/xfs/libxfs/xfs_trans_space.h
@@ -38,8 +38,13 @@
 
 #define	XFS_DAENTER_1B(mp,w)	\
 	((w) == XFS_DATA_FORK ? (mp)->m_dir_geo->fsbcount : 1)
+/*
+ * xattr set operation can cause the da btree to split once from the
+ * root to leaf and also allocate an extra leaf node. The '1' in the
+ * macro below accounts for the extra leaf node.
+ */
 #define	XFS_DAENTER_DBS(mp,w)	\
-	(XFS_DA_NODE_MAXDEPTH + (((w) == XFS_DATA_FORK) ? 2 : 0))
+	(XFS_DA_NODE_MAXDEPTH + (((w) == XFS_DATA_FORK) ? 2 : 1))
 #define	XFS_DAENTER_BLOCKS(mp,w)	\
 	(XFS_DAENTER_1B(mp,w) * XFS_DAENTER_DBS(mp,w))
 #define	XFS_DAENTER_BMAP1B(mp,w)	\
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/7] xfs: Check for per-inode extent count overflow
  2020-06-06  8:27 [PATCH 0/7] xfs: Extend per-inode extent counters Chandan Babu R
  2020-06-06  8:27 ` [PATCH 1/7] xfs: Fix log reservation calculation for xattr insert operation Chandan Babu R
@ 2020-06-06  8:27 ` Chandan Babu R
  2020-06-08 16:24   ` Darrick J. Wong
  2020-06-06  8:27 ` [PATCH 3/7] xfs: Compute maximum height of directory BMBT separately Chandan Babu R
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 40+ messages in thread
From: Chandan Babu R @ 2020-06-06  8:27 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, david, darrick.wong, bfoster, hch

The following error message was noticed when a workload added one
million xattrs, deleted 50% of them and then inserted 400,000 new
xattrs.

XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00

The error message was printed during unmounting the filesystem. The
value printed under "total extents" indicates that we overflowed the
per-inode signed 16-bit xattr extent counter.

Instead of letting this silent corruption occur, this patch checks for
extent counter (both data and xattr) overflow before we assign the
new value to the corresponding in-memory extent counter.

Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
---
 fs/xfs/libxfs/xfs_bmap.c       | 92 +++++++++++++++++++++++++++-------
 fs/xfs/libxfs/xfs_inode_fork.c | 29 +++++++++++
 fs/xfs/libxfs/xfs_inode_fork.h |  1 +
 3 files changed, 104 insertions(+), 18 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index edc63dba007f..798fca5c52af 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -906,7 +906,10 @@ xfs_bmap_local_to_extents(
 	xfs_iext_first(ifp, &icur);
 	xfs_iext_insert(ip, &icur, &rec, 0);
 
-	ifp->if_nextents = 1;
+	error = xfs_next_set(ip, whichfork, 1);
+	if (error)
+		goto done;
+
 	ip->i_d.di_nblocks = 1;
 	xfs_trans_mod_dquot_byino(tp, ip,
 		XFS_TRANS_DQ_BCOUNT, 1L);
@@ -1594,7 +1597,10 @@ xfs_bmap_add_extent_delay_real(
 		xfs_iext_remove(bma->ip, &bma->icur, state);
 		xfs_iext_prev(ifp, &bma->icur);
 		xfs_iext_update_extent(bma->ip, state, &bma->icur, &LEFT);
-		ifp->if_nextents--;
+
+		error = xfs_next_set(bma->ip, whichfork, -1);
+		if (error)
+			goto done;
 
 		if (bma->cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
@@ -1698,7 +1704,10 @@ xfs_bmap_add_extent_delay_real(
 		PREV.br_startblock = new->br_startblock;
 		PREV.br_state = new->br_state;
 		xfs_iext_update_extent(bma->ip, state, &bma->icur, &PREV);
-		ifp->if_nextents++;
+
+		error = xfs_next_set(bma->ip, whichfork, 1);
+		if (error)
+			goto done;
 
 		if (bma->cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
@@ -1764,7 +1773,10 @@ xfs_bmap_add_extent_delay_real(
 		 * The left neighbor is not contiguous.
 		 */
 		xfs_iext_update_extent(bma->ip, state, &bma->icur, new);
-		ifp->if_nextents++;
+
+		error = xfs_next_set(bma->ip, whichfork, 1);
+		if (error)
+			goto done;
 
 		if (bma->cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
@@ -1851,7 +1863,10 @@ xfs_bmap_add_extent_delay_real(
 		 * The right neighbor is not contiguous.
 		 */
 		xfs_iext_update_extent(bma->ip, state, &bma->icur, new);
-		ifp->if_nextents++;
+
+		error = xfs_next_set(bma->ip, whichfork, 1);
+		if (error)
+			goto done;
 
 		if (bma->cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
@@ -1937,7 +1952,10 @@ xfs_bmap_add_extent_delay_real(
 		xfs_iext_next(ifp, &bma->icur);
 		xfs_iext_insert(bma->ip, &bma->icur, &RIGHT, state);
 		xfs_iext_insert(bma->ip, &bma->icur, &LEFT, state);
-		ifp->if_nextents++;
+
+		error = xfs_next_set(bma->ip, whichfork, 1);
+		if (error)
+			goto done;
 
 		if (bma->cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
@@ -2141,7 +2159,11 @@ xfs_bmap_add_extent_unwritten_real(
 		xfs_iext_remove(ip, icur, state);
 		xfs_iext_prev(ifp, icur);
 		xfs_iext_update_extent(ip, state, icur, &LEFT);
-		ifp->if_nextents -= 2;
+
+		error = xfs_next_set(ip, whichfork, -2);
+		if (error)
+			goto done;
+
 		if (cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
 		else {
@@ -2193,7 +2215,11 @@ xfs_bmap_add_extent_unwritten_real(
 		xfs_iext_remove(ip, icur, state);
 		xfs_iext_prev(ifp, icur);
 		xfs_iext_update_extent(ip, state, icur, &LEFT);
-		ifp->if_nextents--;
+
+		error = xfs_next_set(ip, whichfork, -1);
+		if (error)
+			goto done;
+
 		if (cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
 		else {
@@ -2235,7 +2261,10 @@ xfs_bmap_add_extent_unwritten_real(
 		xfs_iext_remove(ip, icur, state);
 		xfs_iext_prev(ifp, icur);
 		xfs_iext_update_extent(ip, state, icur, &PREV);
-		ifp->if_nextents--;
+
+		error = xfs_next_set(ip, whichfork, -1);
+		if (error)
+			goto done;
 
 		if (cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
@@ -2343,7 +2372,10 @@ xfs_bmap_add_extent_unwritten_real(
 
 		xfs_iext_update_extent(ip, state, icur, &PREV);
 		xfs_iext_insert(ip, icur, new, state);
-		ifp->if_nextents++;
+
+		error = xfs_next_set(ip, whichfork, 1);
+		if (error)
+			goto done;
 
 		if (cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
@@ -2419,7 +2451,10 @@ xfs_bmap_add_extent_unwritten_real(
 		xfs_iext_update_extent(ip, state, icur, &PREV);
 		xfs_iext_next(ifp, icur);
 		xfs_iext_insert(ip, icur, new, state);
-		ifp->if_nextents++;
+
+		error = xfs_next_set(ip, whichfork, 1);
+		if (error)
+			goto done;
 
 		if (cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
@@ -2471,7 +2506,10 @@ xfs_bmap_add_extent_unwritten_real(
 		xfs_iext_next(ifp, icur);
 		xfs_iext_insert(ip, icur, &r[1], state);
 		xfs_iext_insert(ip, icur, &r[0], state);
-		ifp->if_nextents += 2;
+
+		error = xfs_next_set(ip, whichfork, 2);
+		if (error)
+			goto done;
 
 		if (cur == NULL)
 			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
@@ -2787,7 +2825,10 @@ xfs_bmap_add_extent_hole_real(
 		xfs_iext_remove(ip, icur, state);
 		xfs_iext_prev(ifp, icur);
 		xfs_iext_update_extent(ip, state, icur, &left);
-		ifp->if_nextents--;
+
+		error = xfs_next_set(ip, whichfork, -1);
+		if (error)
+			goto done;
 
 		if (cur == NULL) {
 			rval = XFS_ILOG_CORE | xfs_ilog_fext(whichfork);
@@ -2886,7 +2927,10 @@ xfs_bmap_add_extent_hole_real(
 		 * Insert a new entry.
 		 */
 		xfs_iext_insert(ip, icur, new, state);
-		ifp->if_nextents++;
+
+		error = xfs_next_set(ip, whichfork, 1);
+		if (error)
+			goto done;
 
 		if (cur == NULL) {
 			rval = XFS_ILOG_CORE | xfs_ilog_fext(whichfork);
@@ -5083,7 +5127,10 @@ xfs_bmap_del_extent_real(
 		 */
 		xfs_iext_remove(ip, icur, state);
 		xfs_iext_prev(ifp, icur);
-		ifp->if_nextents--;
+
+		error = xfs_next_set(ip, whichfork, -1);
+		if (error)
+			goto done;
 
 		flags |= XFS_ILOG_CORE;
 		if (!cur) {
@@ -5193,7 +5240,10 @@ xfs_bmap_del_extent_real(
 		} else
 			flags |= xfs_ilog_fext(whichfork);
 
-		ifp->if_nextents++;
+		error = xfs_next_set(ip, whichfork, 1);
+		if (error)
+			goto done;
+
 		xfs_iext_next(ifp, icur);
 		xfs_iext_insert(ip, icur, &new, state);
 		break;
@@ -5660,7 +5710,10 @@ xfs_bmse_merge(
 	 * Update the on-disk extent count, the btree if necessary and log the
 	 * inode.
 	 */
-	ifp->if_nextents--;
+	error = xfs_next_set(ip, whichfork, -1);
+	if (error)
+		goto done;
+
 	*logflags |= XFS_ILOG_CORE;
 	if (!cur) {
 		*logflags |= XFS_ILOG_DEXT;
@@ -6047,7 +6100,10 @@ xfs_bmap_split_extent(
 	/* Add new extent */
 	xfs_iext_next(ifp, &icur);
 	xfs_iext_insert(ip, &icur, &new, 0);
-	ifp->if_nextents++;
+
+	error = xfs_next_set(ip, whichfork, 1);
+	if (error)
+		goto del_cursor;
 
 	if (cur) {
 		error = xfs_bmbt_lookup_eq(cur, &new, &i);
diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index 28b366275ae0..3bf5a2c391bd 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -728,3 +728,32 @@ xfs_ifork_verify_local_attr(
 
 	return 0;
 }
+
+int
+xfs_next_set(
+	struct xfs_inode	*ip,
+	int			whichfork,
+	int			delta)
+{
+	struct xfs_ifork	*ifp;
+	int64_t			nr_exts;
+	int64_t			max_exts;
+
+	ifp = XFS_IFORK_PTR(ip, whichfork);
+
+	if (whichfork == XFS_DATA_FORK || whichfork == XFS_COW_FORK)
+		max_exts = MAXEXTNUM;
+	else if (whichfork == XFS_ATTR_FORK)
+		max_exts = MAXAEXTNUM;
+	else
+		ASSERT(0);
+
+	nr_exts = ifp->if_nextents + delta;
+	if ((delta > 0 && nr_exts > max_exts)
+		|| (delta < 0 && nr_exts < 0))
+		return -EOVERFLOW;
+
+	ifp->if_nextents = nr_exts;
+
+	return 0;
+}
diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
index a4953e95c4f3..a84ae42ace79 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.h
+++ b/fs/xfs/libxfs/xfs_inode_fork.h
@@ -173,4 +173,5 @@ extern void xfs_ifork_init_cow(struct xfs_inode *ip);
 int xfs_ifork_verify_local_data(struct xfs_inode *ip);
 int xfs_ifork_verify_local_attr(struct xfs_inode *ip);
 
+int xfs_next_set(struct xfs_inode *ip, int whichfork, int delta);
 #endif	/* __XFS_INODE_FORK_H__ */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 3/7] xfs: Compute maximum height of directory BMBT separately
  2020-06-06  8:27 [PATCH 0/7] xfs: Extend per-inode extent counters Chandan Babu R
  2020-06-06  8:27 ` [PATCH 1/7] xfs: Fix log reservation calculation for xattr insert operation Chandan Babu R
  2020-06-06  8:27 ` [PATCH 2/7] xfs: Check for per-inode extent count overflow Chandan Babu R
@ 2020-06-06  8:27 ` Chandan Babu R
  2020-06-08 20:59   ` Darrick J. Wong
  2020-06-06  8:27 ` [PATCH 4/7] xfs: Add "Use Dir BMBT height" argument to XFS_BM_MAXLEVELS() Chandan Babu R
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 40+ messages in thread
From: Chandan Babu R @ 2020-06-06  8:27 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, david, darrick.wong, bfoster, hch

xfs/306 causes the following call trace when using a data fork with a
maximum extent count of 2^47,

 XFS (loop0): Mounting V5 Filesystem
 XFS (loop0): Log size 8906 blocks too small, minimum size is 9075 blocks
 XFS (loop0): AAIEEE! Log failed size checks. Abort!
 XFS: Assertion failed: 0, file: fs/xfs/xfs_log.c, line: 711
 ------------[ cut here ]------------
 WARNING: CPU: 0 PID: 12821 at fs/xfs/xfs_message.c:112 assfail+0x25/0x28
 Modules linked in:
 CPU: 0 PID: 12821 Comm: mount Tainted: G        W         5.6.0-rc6-next-20200320-chandan-00003-g071c2af3f4de #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
 RIP: 0010:assfail+0x25/0x28
 Code: ff ff 0f 0b c3 0f 1f 44 00 00 41 89 c8 48 89 d1 48 89 f2 48 c7 c6 40 b7 4b b3 e8 82 f9 ff ff 80 3d 83 d6 64 01 00 74 02 0f $
 RSP: 0018:ffffb05b414cbd78 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: ffff9d9d501d5000 RCX: 0000000000000000
 RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffb346dc65
 RBP: ffff9da444b49a80 R08: 0000000000000000 R09: 0000000000000000
 R10: 000000000000000a R11: f000000000000000 R12: 00000000ffffffea
 R13: 000000000000000e R14: 0000000000004594 R15: ffff9d9d501d5628
 FS:  00007fd6c5d17c80(0000) GS:ffff9da44d800000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000002 CR3: 00000008a48c0000 CR4: 00000000000006f0
 Call Trace:
  xfs_log_mount+0xf8/0x300
  xfs_mountfs+0x46e/0x950
  xfs_fc_fill_super+0x318/0x510
  ? xfs_mount_free+0x30/0x30
  get_tree_bdev+0x15c/0x250
  vfs_get_tree+0x25/0xb0
  do_mount+0x740/0x9b0
  ? memdup_user+0x41/0x80
  __x64_sys_mount+0x8e/0xd0
  do_syscall_64+0x48/0x110
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7fd6c5f2ccda
 Code: 48 8b 0d b9 e1 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f $
 RSP: 002b:00007ffe00dfb9f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
 RAX: ffffffffffffffda RBX: 0000560c1aaa92c0 RCX: 00007fd6c5f2ccda
 RDX: 0000560c1aaae110 RSI: 0000560c1aaad040 RDI: 0000560c1aaa94d0
 RBP: 00007fd6c607d204 R08: 0000000000000000 R09: 0000560c1aaadde0
 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
 R13: 0000000000000000 R14: 0000560c1aaa94d0 R15: 0000560c1aaae110
 ---[ end trace 6436391b468bc652 ]---
 XFS (loop0): log mount failed

The corresponding filesystem was created using mkfs options
"-m rmapbt=1,reflink=1 -b size=1k -d size=20m -n size=64k".

i.e. We have a filesystem of size 20MiB, data block size of 1KiB and
directory block size of 64KiB. Filesystems of size < 1GiB can have less
than 10MiB on-disk log (Please refer to calculate_log_size() in
xfsprogs).

The largest reservation space was contributed by the rename
operation. The corresponding calculation is done inside
xfs_calc_rename_reservation(). In this case, the value returned by this
function is,

xfs_calc_inode_res(mp, 4)
+ xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1))

xfs_calc_inode_res(mp, 4) returns a constant value of 3040 bytes
regardless of the maximum data fork extent count.

The largest contribution to the rename operation was by "2 *
XFS_DIROP_LOG_COUNT(mp)" and it is a function of maximum height of a
directory's BMBT tree.

XFS_DIROP_LOG_COUNT() is a sum of,

1. The maximum number of dabtree blocks that needs to be logged
   i.e. XFS_DAENTER_BLOCKS() = XFS_DAENTER_1B(mp,w) *
   XFS_DAENTER_DBS(mp,w).  For directories, this evaluates
   to (64 * (XFS_DA_NODE_MAXDEPTH + 2)) = (64 * (5 + 2)) = 448.

2. The corresponding maximum number of BMBT blocks that needs to be
   logged i.e. XFS_DAENTER_BMAPS() = XFS_DAENTER_DBS(mp,w) *
   XFS_DAENTER_BMAP1B(mp,w)

   XFS_DAENTER_DBS(mp,w) = XFS_DA_NODE_MAXDEPTH + 2 = 7

   XFS_DAENTER_BMAP1B(mp,w)
   = XFS_NEXTENTADD_SPACE_RES(mp, XFS_DAENTER_1B(mp, w), w)
   = XFS_NEXTENTADD_SPACE_RES(mp, 64, w)
   = ((64 + XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp) - 1) /
   XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp)) * XFS_EXTENTADD_SPACE_RES(mp, w)

   XFS_MAX_CONTIG_EXTENTS_PER_BLOCK() =
   mp->m_alloc_mxr[0] - mp->m_alloc_mnr[0] = 121 - 60 = 61

   XFS_DAENTER_BMAP1B(mp,w) =
   ((64 + XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp) - 1) /
   XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp)) * XFS_EXTENTADD_SPACE_RES(mp, w)
   = ((64 + 61 - 1) / 61) * XFS_EXTENTADD_SPACE_RES(mp, w)
   = 2 * XFS_EXTENTADD_SPACE_RES(mp, w)
   = 2 * (XFS_BM_MAXLEVELS(mp,w) - 1)
   = 2 * (8 - 1)
   = 14

   With 2^32 as the maximum extent count the maximum height of the bmap btree
   was 7. Now with 2^47 maximum extent count, the height has increased to 8.

   Therefore, XFS_DAENTER_BMAPS() = 7 * 14 = 98.

XFS_DIROP_LOG_COUNT() = 448 + 98 = 546.
2 * XFS_DIROP_LOG_COUNT() = 2 * 546 = 1092.

With 2^32 max extent count, XFS_DIROP_LOG_COUNT() evaluates to
533. Hence 2 * XFS_DIROP_LOG_COUNT() = 2 * 533 = 1066.

This small difference of 1092 - 1066 = 26 fs blocks is sufficient to
trip us over the minimum log size check.

A future commit in this series will use 2^27 as the maximum directory
extent count. This will result in a shorter directory BMBT tree.  Log
reservation calculations that are applicable only to
directories (e.g. XFS_DIROP_LOG_COUNT()) can then choose this instead of
non-dir data fork BMBT height.

This commit introduces a new member in 'struct xfs_mount' to hold the
maximum BMBT height of a directory. At present, the maximum height of a
directory BMBT is the same as a the maximum height of a non-directory
BMBT. A future commit will change the parameters used as input for
computing the maximum height of a directory BMBT.

Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 17 ++++++++++++++---
 fs/xfs/libxfs/xfs_bmap.h |  3 ++-
 fs/xfs/xfs_mount.c       |  5 +++--
 fs/xfs/xfs_mount.h       |  1 +
 4 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 798fca5c52af..01e2b543b139 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -50,7 +50,8 @@ kmem_zone_t		*xfs_bmap_free_item_zone;
 void
 xfs_bmap_compute_maxlevels(
 	xfs_mount_t	*mp,		/* file system mount structure */
-	int		whichfork)	/* data or attr fork */
+	int		whichfork,	/* data or attr fork */
+	int		dir_bmbt)	/* Dir or non-dir data fork */
 {
 	int		level;		/* btree level */
 	uint		maxblocks;	/* max blocks at this level */
@@ -60,6 +61,9 @@ xfs_bmap_compute_maxlevels(
 	int		minnoderecs;	/* min records in node block */
 	int		sz;		/* root block size */
 
+	if (whichfork == XFS_ATTR_FORK)
+		ASSERT(dir_bmbt == 0);
+
 	/*
 	 * The maximum number of extents in a file, hence the maximum number of
 	 * leaf entries, is controlled by the size of the on-disk extent count,
@@ -75,8 +79,11 @@ xfs_bmap_compute_maxlevels(
 	 * of a minimum size available.
 	 */
 	if (whichfork == XFS_DATA_FORK) {
-		maxleafents = MAXEXTNUM;
 		sz = XFS_BMDR_SPACE_CALC(MINDBTPTRS);
+		if (dir_bmbt)
+			maxleafents = MAXEXTNUM;
+		else
+			maxleafents = MAXEXTNUM;
 	} else {
 		maxleafents = MAXAEXTNUM;
 		sz = XFS_BMDR_SPACE_CALC(MINABTPTRS);
@@ -91,7 +98,11 @@ xfs_bmap_compute_maxlevels(
 		else
 			maxblocks = (maxblocks + minnoderecs - 1) / minnoderecs;
 	}
-	mp->m_bm_maxlevels[whichfork] = level;
+
+	if (whichfork == XFS_DATA_FORK && dir_bmbt)
+		mp->m_bm_dir_maxlevel = level;
+	else
+		mp->m_bm_maxlevels[whichfork] = level;
 }
 
 STATIC int				/* error */
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 6028a3c825ba..4250c9ab4b75 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -187,7 +187,8 @@ void	xfs_bmap_local_to_extents_empty(struct xfs_trans *tp,
 void	__xfs_bmap_add_free(struct xfs_trans *tp, xfs_fsblock_t bno,
 		xfs_filblks_t len, const struct xfs_owner_info *oinfo,
 		bool skip_discard);
-void	xfs_bmap_compute_maxlevels(struct xfs_mount *mp, int whichfork);
+void	xfs_bmap_compute_maxlevels(struct xfs_mount *mp, int whichfork,
+		int dir_bmbt);
 int	xfs_bmap_first_unused(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_extlen_t len, xfs_fileoff_t *unused, int whichfork);
 int	xfs_bmap_last_before(struct xfs_trans *tp, struct xfs_inode *ip,
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index bb91f04266b9..d8ebfc67bb63 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -711,8 +711,9 @@ xfs_mountfs(
 		goto out;
 
 	xfs_alloc_compute_maxlevels(mp);
-	xfs_bmap_compute_maxlevels(mp, XFS_DATA_FORK);
-	xfs_bmap_compute_maxlevels(mp, XFS_ATTR_FORK);
+	xfs_bmap_compute_maxlevels(mp, XFS_DATA_FORK, 0);
+	xfs_bmap_compute_maxlevels(mp, XFS_DATA_FORK, 1);
+	xfs_bmap_compute_maxlevels(mp, XFS_ATTR_FORK, 0);
 	xfs_ialloc_setup_geometry(mp);
 	xfs_rmapbt_compute_maxlevels(mp);
 	xfs_refcountbt_compute_maxlevels(mp);
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index aba5a1579279..9dbf036ddace 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -133,6 +133,7 @@ typedef struct xfs_mount {
 	uint			m_refc_mnr[2];	/* min refc btree records */
 	uint			m_ag_maxlevels;	/* XFS_AG_MAXLEVELS */
 	uint			m_bm_maxlevels[2]; /* XFS_BM_MAXLEVELS */
+	uint			m_bm_dir_maxlevel;
 	uint			m_rmap_maxlevels; /* max rmap btree levels */
 	uint			m_refc_maxlevels; /* max refcount btree level */
 	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 4/7] xfs: Add "Use Dir BMBT height" argument to XFS_BM_MAXLEVELS()
  2020-06-06  8:27 [PATCH 0/7] xfs: Extend per-inode extent counters Chandan Babu R
                   ` (2 preceding siblings ...)
  2020-06-06  8:27 ` [PATCH 3/7] xfs: Compute maximum height of directory BMBT separately Chandan Babu R
@ 2020-06-06  8:27 ` Chandan Babu R
  2020-06-08 17:50   ` Darrick J. Wong
  2020-06-06  8:27 ` [PATCH 5/7] xfs: Use 2^27 as the maximum number of directory extents Chandan Babu R
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 40+ messages in thread
From: Chandan Babu R @ 2020-06-06  8:27 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, david, darrick.wong, bfoster, hch

XFS_BM_MAXLEVELS() returns the maximum possible height of BMBT tree for
either data or attribute fork. For data forks, this commit adds a new
argument to XFS_BM_MAXLEVELS() to let the users choose between the
maximum heights of dir and non-dir BMBTs.

As of this commit, both dir and non-dir BMBTs have the same maximum
height. A future commit in this series will use 2^27 extent count as the
input to compute the maximum height of a directory BMBT which will in
turn cause the maximum heights of dir and non-dir BMBTs to differ.

Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
---
 fs/xfs/libxfs/xfs_attr.c        |  5 ++--
 fs/xfs/libxfs/xfs_bmap.c        |  5 ++--
 fs/xfs/libxfs/xfs_bmap_btree.h  |  4 +++-
 fs/xfs/libxfs/xfs_trans_resv.c  | 25 +++++++++++---------
 fs/xfs/libxfs/xfs_trans_resv.h  |  4 ++--
 fs/xfs/libxfs/xfs_trans_space.h | 41 +++++++++++++++++----------------
 fs/xfs/xfs_bmap_item.c          |  3 ++-
 fs/xfs/xfs_reflink.c            |  4 ++--
 8 files changed, 50 insertions(+), 41 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index a4b23edf887e..357e29a5a167 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -150,7 +150,7 @@ xfs_attr_calc_size(
 	 * "local" or "remote" (note: local != inline).
 	 */
 	size = xfs_attr_leaf_newentsize(args, local);
-	nblks = XFS_DAENTER_SPACE_RES(mp, XFS_ATTR_FORK);
+	nblks = XFS_DAENTER_SPACE_RES(mp, XFS_ATTR_FORK, 0);
 	if (*local) {
 		if (size > (args->geo->blksize / 2)) {
 			/* Double split possible */
@@ -163,7 +163,8 @@ xfs_attr_calc_size(
 		 */
 		uint	dblocks = xfs_attr3_rmt_blocks(mp, args->valuelen);
 		nblks += dblocks;
-		nblks += XFS_NEXTENTADD_SPACE_RES(mp, dblocks, XFS_ATTR_FORK);
+		nblks += XFS_NEXTENTADD_SPACE_RES(mp, dblocks,
+				XFS_ATTR_FORK, 0);
 	}
 
 	return nblks;
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 01e2b543b139..8b0029b3cecf 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -182,13 +182,14 @@ xfs_bmap_worst_indlen(
 	mp = ip->i_mount;
 	maxrecs = mp->m_bmap_dmxr[0];
 	for (level = 0, rval = 0;
-	     level < XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK);
+	     level < XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK, 0);
 	     level++) {
 		len += maxrecs - 1;
 		do_div(len, maxrecs);
 		rval += len;
 		if (len == 1)
-			return rval + XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) -
+			return rval +
+				XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK, 0) -
 				level - 1;
 		if (level == 0)
 			maxrecs = mp->m_bmap_dmxr[1];
diff --git a/fs/xfs/libxfs/xfs_bmap_btree.h b/fs/xfs/libxfs/xfs_bmap_btree.h
index 72bf74c79fb9..a047be5883d1 100644
--- a/fs/xfs/libxfs/xfs_bmap_btree.h
+++ b/fs/xfs/libxfs/xfs_bmap_btree.h
@@ -79,7 +79,9 @@ struct xfs_trans;
 /*
  * Maximum number of bmap btree levels.
  */
-#define XFS_BM_MAXLEVELS(mp,w)		((mp)->m_bm_maxlevels[(w)])
+#define XFS_BM_MAXLEVELS(mp,w,use_dir_bmbt) \
+	((!(use_dir_bmbt)) ? \
+		(mp)->m_bm_maxlevels[(w)] : (mp)->m_bm_dir_maxlevel)
 
 /*
  * Prototypes for xfs_bmap.c to call.
diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index b44b521c605c..39cfca1b71b6 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -265,14 +265,14 @@ xfs_calc_write_reservation(
 	unsigned int		blksz = XFS_FSB_TO_B(mp, 1);
 
 	t1 = xfs_calc_inode_res(mp, 1) +
-	     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK), blksz) +
+	     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK, 0), blksz) +
 	     xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
 	     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2), blksz);
 
 	if (xfs_sb_version_hasrealtime(&mp->m_sb)) {
 		t2 = xfs_calc_inode_res(mp, 1) +
-		     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK),
-				     blksz) +
+		     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK, 0),
+			blksz) +
 		     xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
 		     xfs_calc_buf_res(xfs_rtalloc_log_count(mp, 1), blksz) +
 		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1), blksz);
@@ -313,7 +313,8 @@ xfs_calc_itruncate_reservation(
 	unsigned int		blksz = XFS_FSB_TO_B(mp, 1);
 
 	t1 = xfs_calc_inode_res(mp, 1) +
-	     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) + 1, blksz);
+	     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK, 0) + 1,
+			     blksz);
 
 	t2 = xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) +
 	     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 4), blksz);
@@ -592,7 +593,7 @@ xfs_calc_growrtalloc_reservation(
 	struct xfs_mount	*mp)
 {
 	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
-		xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK),
+		xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK, 0),
 				 XFS_FSB_TO_B(mp, 1)) +
 		xfs_calc_inode_res(mp, 1) +
 		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
@@ -669,7 +670,7 @@ xfs_calc_addafork_reservation(
 		xfs_calc_inode_res(mp, 1) +
 		xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
 		xfs_calc_buf_res(1, mp->m_dir_geo->blksize) +
-		xfs_calc_buf_res(XFS_DAENTER_BMAP1B(mp, XFS_DATA_FORK) + 1,
+		xfs_calc_buf_res(XFS_DAENTER_BMAP1B(mp, XFS_DATA_FORK, 0) + 1,
 				 XFS_FSB_TO_B(mp, 1)) +
 		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
 				 XFS_FSB_TO_B(mp, 1));
@@ -691,7 +692,7 @@ xfs_calc_attrinval_reservation(
 	struct xfs_mount	*mp)
 {
 	return max((xfs_calc_inode_res(mp, 1) +
-		    xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK),
+		    xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK, 0),
 				     XFS_FSB_TO_B(mp, 1))),
 		   (xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) +
 		    xfs_calc_buf_res(xfs_allocfree_log_count(mp, 4),
@@ -717,10 +718,11 @@ xfs_calc_attrset_reservation(
 	int			bmbt_blks;
 
 	da_blks = XFS_DAENTER_BLOCKS(mp, XFS_ATTR_FORK);
-	bmbt_blks = XFS_DAENTER_BMAPS(mp, XFS_ATTR_FORK);
+	bmbt_blks = XFS_DAENTER_BMAPS(mp, XFS_ATTR_FORK, 0);
 
 	max_rmt_blks = xfs_attr3_rmt_blocks(mp, XATTR_SIZE_MAX);
-	bmbt_blks += XFS_NEXTENTADD_SPACE_RES(mp, max_rmt_blks, XFS_ATTR_FORK);
+	bmbt_blks += XFS_NEXTENTADD_SPACE_RES(mp, max_rmt_blks,
+			XFS_ATTR_FORK, 0);
 
 	return XFS_DQUOT_LOGRES(mp) +
 		xfs_calc_inode_res(mp, 1) +
@@ -752,8 +754,9 @@ xfs_calc_attrrm_reservation(
 		     xfs_calc_buf_res(XFS_DA_NODE_MAXDEPTH,
 				      XFS_FSB_TO_B(mp, 1)) +
 		     (uint)XFS_FSB_TO_B(mp,
-					XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK)) +
-		     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK), 0)),
+				XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK, 0)) +
+		     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK, 0),
+				     0)),
 		    (xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
 		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2),
 				      XFS_FSB_TO_B(mp, 1))));
diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
index f50996ae18e6..d64989eeebd7 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.h
+++ b/fs/xfs/libxfs/xfs_trans_resv.h
@@ -61,10 +61,10 @@ struct xfs_trans_resv {
  */
 #define	XFS_DIROP_LOG_RES(mp)	\
 	(XFS_FSB_TO_B(mp, XFS_DAENTER_BLOCKS(mp, XFS_DATA_FORK)) + \
-	 (XFS_FSB_TO_B(mp, XFS_DAENTER_BMAPS(mp, XFS_DATA_FORK) + 1)))
+	 (XFS_FSB_TO_B(mp, XFS_DAENTER_BMAPS(mp, XFS_DATA_FORK, 1) + 1)))
 #define	XFS_DIROP_LOG_COUNT(mp)	\
 	(XFS_DAENTER_BLOCKS(mp, XFS_DATA_FORK) + \
-	 XFS_DAENTER_BMAPS(mp, XFS_DATA_FORK) + 1)
+	 XFS_DAENTER_BMAPS(mp, XFS_DATA_FORK, 1) + 1)
 
 /*
  * Various log count values.
diff --git a/fs/xfs/libxfs/xfs_trans_space.h b/fs/xfs/libxfs/xfs_trans_space.h
index b559af70cf51..c51d809a16b1 100644
--- a/fs/xfs/libxfs/xfs_trans_space.h
+++ b/fs/xfs/libxfs/xfs_trans_space.h
@@ -25,15 +25,16 @@
 
 #define XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp)    \
 		(((mp)->m_alloc_mxr[0]) - ((mp)->m_alloc_mnr[0]))
-#define	XFS_EXTENTADD_SPACE_RES(mp,w)	(XFS_BM_MAXLEVELS(mp,w) - 1)
-#define XFS_NEXTENTADD_SPACE_RES(mp,b,w)\
+#define	XFS_EXTENTADD_SPACE_RES(mp,w,dbmbt)	\
+	(XFS_BM_MAXLEVELS(mp,w,dbmbt) - 1)
+#define XFS_NEXTENTADD_SPACE_RES(mp,b,w,dbmbt)		   \
 	(((b + XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp) - 1) / \
 	  XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp)) * \
-	  XFS_EXTENTADD_SPACE_RES(mp,w))
+		XFS_EXTENTADD_SPACE_RES(mp,w,dbmbt))
 
 /* Blocks we might need to add "b" mappings & rmappings to a file. */
-#define XFS_SWAP_RMAP_SPACE_RES(mp,b,w)\
-	(XFS_NEXTENTADD_SPACE_RES((mp), (b), (w)) + \
+#define XFS_SWAP_RMAP_SPACE_RES(mp,b,w)	    \
+	(XFS_NEXTENTADD_SPACE_RES((mp), (b), (w), 0) +	\
 	 XFS_NRMAPADD_SPACE_RES((mp), (b)))
 
 #define	XFS_DAENTER_1B(mp,w)	\
@@ -47,19 +48,19 @@
 	(XFS_DA_NODE_MAXDEPTH + (((w) == XFS_DATA_FORK) ? 2 : 1))
 #define	XFS_DAENTER_BLOCKS(mp,w)	\
 	(XFS_DAENTER_1B(mp,w) * XFS_DAENTER_DBS(mp,w))
-#define	XFS_DAENTER_BMAP1B(mp,w)	\
-	XFS_NEXTENTADD_SPACE_RES(mp, XFS_DAENTER_1B(mp, w), w)
-#define	XFS_DAENTER_BMAPS(mp,w)		\
-	(XFS_DAENTER_DBS(mp,w) * XFS_DAENTER_BMAP1B(mp,w))
-#define	XFS_DAENTER_SPACE_RES(mp,w)	\
-	(XFS_DAENTER_BLOCKS(mp,w) + XFS_DAENTER_BMAPS(mp,w))
-#define	XFS_DAREMOVE_SPACE_RES(mp,w)	XFS_DAENTER_BMAPS(mp,w)
+#define	XFS_DAENTER_BMAP1B(mp,w,dbmbt)	\
+	XFS_NEXTENTADD_SPACE_RES(mp, XFS_DAENTER_1B(mp, w), w, dbmbt)
+#define	XFS_DAENTER_BMAPS(mp,w,dbmbt)	\
+	(XFS_DAENTER_DBS(mp,w) * XFS_DAENTER_BMAP1B(mp,w,dbmbt))
+#define	XFS_DAENTER_SPACE_RES(mp,w,dbmbt)	\
+	(XFS_DAENTER_BLOCKS(mp,w) + XFS_DAENTER_BMAPS(mp,w,dbmbt))
+#define	XFS_DAREMOVE_SPACE_RES(mp,w,dbmbt)	XFS_DAENTER_BMAPS(mp,w,dbmbt)
 #define	XFS_DIRENTER_MAX_SPLIT(mp,nl)	1
 #define	XFS_DIRENTER_SPACE_RES(mp,nl)	\
-	(XFS_DAENTER_SPACE_RES(mp, XFS_DATA_FORK) * \
+	(XFS_DAENTER_SPACE_RES(mp, XFS_DATA_FORK, 1) *	\
 	 XFS_DIRENTER_MAX_SPLIT(mp,nl))
 #define	XFS_DIRREMOVE_SPACE_RES(mp)	\
-	XFS_DAREMOVE_SPACE_RES(mp, XFS_DATA_FORK)
+	XFS_DAREMOVE_SPACE_RES(mp, XFS_DATA_FORK, 1)
 #define	XFS_IALLOC_SPACE_RES(mp)	\
 	(M_IGEO(mp)->ialloc_blks + \
 	 (xfs_sb_version_hasfinobt(&mp->m_sb) ? 2 : 1 * \
@@ -69,26 +70,26 @@
  * Space reservation values for various transactions.
  */
 #define	XFS_ADDAFORK_SPACE_RES(mp)	\
-	((mp)->m_dir_geo->fsbcount + XFS_DAENTER_BMAP1B(mp, XFS_DATA_FORK))
+	((mp)->m_dir_geo->fsbcount + XFS_DAENTER_BMAP1B(mp, XFS_DATA_FORK, 0))
 #define	XFS_ATTRRM_SPACE_RES(mp)	\
-	XFS_DAREMOVE_SPACE_RES(mp, XFS_ATTR_FORK)
+	XFS_DAREMOVE_SPACE_RES(mp, XFS_ATTR_FORK, 0)
 /* This macro is not used - see inline code in xfs_attr_set */
 #define	XFS_ATTRSET_SPACE_RES(mp, v)	\
-	(XFS_DAENTER_SPACE_RES(mp, XFS_ATTR_FORK) + XFS_B_TO_FSB(mp, v))
+	(XFS_DAENTER_SPACE_RES(mp, XFS_ATTR_FORK, 0) + XFS_B_TO_FSB(mp, v))
 #define	XFS_CREATE_SPACE_RES(mp,nl)	\
 	(XFS_IALLOC_SPACE_RES(mp) + XFS_DIRENTER_SPACE_RES(mp,nl))
 #define	XFS_DIOSTRAT_SPACE_RES(mp, v)	\
-	(XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK) + (v))
+	(XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK, 0) + (v))
 #define	XFS_GROWFS_SPACE_RES(mp)	\
 	(2 * (mp)->m_ag_maxlevels)
 #define	XFS_GROWFSRT_SPACE_RES(mp,b)	\
-	((b) + XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK))
+	((b) + XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK, 0))
 #define	XFS_LINK_SPACE_RES(mp,nl)	\
 	XFS_DIRENTER_SPACE_RES(mp,nl)
 #define	XFS_MKDIR_SPACE_RES(mp,nl)	\
 	(XFS_IALLOC_SPACE_RES(mp) + XFS_DIRENTER_SPACE_RES(mp,nl))
 #define	XFS_QM_DQALLOC_SPACE_RES(mp)	\
-	(XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK) + \
+	(XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK, 0) + \
 	 XFS_DQUOT_CLUSTER_SIZE_FSB)
 #define	XFS_QM_QINOCREATE_SPACE_RES(mp)	\
 	XFS_IALLOC_SPACE_RES(mp)
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 6736c5ab188f..0a8a8377a150 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -482,7 +482,8 @@ xfs_bui_item_recover(
 	}
 
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate,
-			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp);
+			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK, 0), 0,
+			0, &tp);
 	if (error)
 		return error;
 	/*
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 107bf2a2f344..fd35a0bf2c47 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -614,7 +614,7 @@ xfs_reflink_end_cow_extent(
 		return 0;
 	}
 
-	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
+	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK, 0);
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
 			XFS_TRANS_RESERVE, &tp);
 	if (error)
@@ -1017,7 +1017,7 @@ xfs_reflink_remap_extent(
 	}
 
 	/* Start a rolling transaction to switch the mappings */
-	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
+	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK, 0);
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
 	if (error)
 		goto out;
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 5/7] xfs: Use 2^27 as the maximum number of directory extents
  2020-06-06  8:27 [PATCH 0/7] xfs: Extend per-inode extent counters Chandan Babu R
                   ` (3 preceding siblings ...)
  2020-06-06  8:27 ` [PATCH 4/7] xfs: Add "Use Dir BMBT height" argument to XFS_BM_MAXLEVELS() Chandan Babu R
@ 2020-06-06  8:27 ` Chandan Babu R
  2020-06-08 16:52   ` Darrick J. Wong
  2020-06-06  8:27 ` [PATCH 6/7] xfs: Extend data extent counter to 47 bits Chandan Babu R
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 40+ messages in thread
From: Chandan Babu R @ 2020-06-06  8:27 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, david, darrick.wong, bfoster, hch

The maximum number of extents that can be used by a directory can be
calculated as shown below. (FS block size is assumed to be 512 bytes
since the smallest allowed block size can create a BMBT of maximum
possible height).

Maximum number of extents in data space =
XFS_DIR2_SPACE_SIZE / 2^9 = 32GiB / 2^9 = 2^26.

Maximum number (theoretically) of extents in leaf space =
32GiB / 2^9 = 2^26.

Maximum number of entries in a free space index block
= (512 - (sizeof struct xfs_dir3_free_hdr)) / (sizeof struct
                                               xfs_dir2_data_off_t)
= (512 - 64) / 2 = 224

Maximum number of extents in free space index =
(Maximum number of extents in data segment) / 224 =
2^26 / 224 = ~2^18

Maximum number of extents in a directory =
Maximum number of extents in data space +
Maximum number of extents in leaf space +
Maximum number of extents in free space index =
2^26 + 2^26 + 2^18 = ~2^27

This commit defines the macro MAXDIREXTNUM to have the value 2^27 and
this in turn is used in calculating the maximum height of a directory
BMBT.

Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
---
 fs/xfs/libxfs/xfs_bmap.c  | 2 +-
 fs/xfs/libxfs/xfs_types.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 8b0029b3cecf..f75b70ae7b1f 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -81,7 +81,7 @@ xfs_bmap_compute_maxlevels(
 	if (whichfork == XFS_DATA_FORK) {
 		sz = XFS_BMDR_SPACE_CALC(MINDBTPTRS);
 		if (dir_bmbt)
-			maxleafents = MAXEXTNUM;
+			maxleafents = MAXDIREXTNUM;
 		else
 			maxleafents = MAXEXTNUM;
 	} else {
diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index 397d94775440..0a3041ad5bec 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -60,6 +60,7 @@ typedef void *		xfs_failaddr_t;
  */
 #define	MAXEXTLEN	((xfs_extlen_t)0x001fffff)	/* 21 bits */
 #define	MAXEXTNUM	((xfs_extnum_t)0x7fffffff)	/* signed int */
+#define	MAXDIREXTNUM	((xfs_extnum_t)0x7ffffff)	/* 27 bits */
 #define	MAXAEXTNUM	((xfs_aextnum_t)0x7fff)		/* signed short */
 
 /*
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 6/7] xfs: Extend data extent counter to 47 bits
  2020-06-06  8:27 [PATCH 0/7] xfs: Extend per-inode extent counters Chandan Babu R
                   ` (4 preceding siblings ...)
  2020-06-06  8:27 ` [PATCH 5/7] xfs: Use 2^27 as the maximum number of directory extents Chandan Babu R
@ 2020-06-06  8:27 ` Chandan Babu R
  2020-06-08 17:14   ` Darrick J. Wong
  2020-06-19 14:38   ` Christoph Hellwig
  2020-06-06  8:27 ` [PATCH 7/7] xfs: Extend attr extent counter to 32 bits Chandan Babu R
  2020-06-08 17:31 ` [PATCH 0/7] xfs: Extend per-inode extent counters Darrick J. Wong
  7 siblings, 2 replies; 40+ messages in thread
From: Chandan Babu R @ 2020-06-06  8:27 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, david, darrick.wong, bfoster, hch

This commit extends the per-inode data extent counter to 47 bits. The
length of 47-bits was chosen because,
Maximum file size = 2^63.
Maximum extent count when using 64k block size = 2^63 / 2^16 = 2^47.

The following changes are made to accomplish this,
1. A new ro-compat superblock flag to prevent older kernels from
   mounting the filesystem in read-write mode. This flag is set for the
   first time when an inode would end up having more than 2^31 extents.
3. Carve out a new 32-bit field from xfs_dinode->di_pad2[]. This field
   holds the most significant 15 bits of the data extent counter.
2. A new inode->di_flags2 flag to indicate that the newly added field
   contains valid data. This flag is set when one of the following two
   conditions are met,
   - When the inode is about to have more than 2^31 extents.
   - When flushing the incore inode (See xfs_iflush_int()), if
     the superblock ro-compat flag is already set.

Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
---
 fs/xfs/libxfs/xfs_bmap.c        | 40 ++++++++--------
 fs/xfs/libxfs/xfs_format.h      | 30 ++++++++----
 fs/xfs/libxfs/xfs_inode_buf.c   | 46 +++++++++++++++---
 fs/xfs/libxfs/xfs_inode_buf.h   |  2 +
 fs/xfs/libxfs/xfs_inode_fork.c  | 84 ++++++++++++++++++++++++++-------
 fs/xfs/libxfs/xfs_inode_fork.h  |  3 +-
 fs/xfs/libxfs/xfs_log_format.h  |  5 +-
 fs/xfs/libxfs/xfs_types.h       |  5 +-
 fs/xfs/scrub/inode.c            |  9 ++--
 fs/xfs/xfs_inode.c              |  6 ++-
 fs/xfs/xfs_inode_item.c         |  5 +-
 fs/xfs/xfs_inode_item_recover.c | 16 +++++--
 12 files changed, 184 insertions(+), 67 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index f75b70ae7b1f..73e552678adc 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -53,9 +53,9 @@ xfs_bmap_compute_maxlevels(
 	int		whichfork,	/* data or attr fork */
 	int		dir_bmbt)	/* Dir or non-dir data fork */
 {
+	uint64_t	maxleafents;	/* max leaf entries possible */
 	int		level;		/* btree level */
 	uint		maxblocks;	/* max blocks at this level */
-	uint		maxleafents;	/* max leaf entries possible */
 	int		maxrootrecs;	/* max records in root block */
 	int		minleafrecs;	/* min records in leaf block */
 	int		minnoderecs;	/* min records in node block */
@@ -477,7 +477,7 @@ xfs_bmap_check_leaf_extents(
 	if (bp_release)
 		xfs_trans_brelse(NULL, bp);
 error_norelse:
-	xfs_warn(mp, "%s: BAD after btree leaves for %d extents",
+	xfs_warn(mp, "%s: BAD after btree leaves for %llu extents",
 		__func__, i);
 	xfs_err(mp, "%s: CORRUPTED BTREE OR SOMETHING", __func__);
 	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
@@ -918,7 +918,7 @@ xfs_bmap_local_to_extents(
 	xfs_iext_first(ifp, &icur);
 	xfs_iext_insert(ip, &icur, &rec, 0);
 
-	error = xfs_next_set(ip, whichfork, 1);
+	error = xfs_next_set(tp, ip, whichfork, 1);
 	if (error)
 		goto done;
 
@@ -1610,7 +1610,7 @@ xfs_bmap_add_extent_delay_real(
 		xfs_iext_prev(ifp, &bma->icur);
 		xfs_iext_update_extent(bma->ip, state, &bma->icur, &LEFT);
 
-		error = xfs_next_set(bma->ip, whichfork, -1);
+		error = xfs_next_set(bma->tp, bma->ip, whichfork, -1);
 		if (error)
 			goto done;
 
@@ -1717,7 +1717,7 @@ xfs_bmap_add_extent_delay_real(
 		PREV.br_state = new->br_state;
 		xfs_iext_update_extent(bma->ip, state, &bma->icur, &PREV);
 
-		error = xfs_next_set(bma->ip, whichfork, 1);
+		error = xfs_next_set(bma->tp, bma->ip, whichfork, 1);
 		if (error)
 			goto done;
 
@@ -1786,7 +1786,7 @@ xfs_bmap_add_extent_delay_real(
 		 */
 		xfs_iext_update_extent(bma->ip, state, &bma->icur, new);
 
-		error = xfs_next_set(bma->ip, whichfork, 1);
+		error = xfs_next_set(bma->tp, bma->ip, whichfork, 1);
 		if (error)
 			goto done;
 
@@ -1876,7 +1876,7 @@ xfs_bmap_add_extent_delay_real(
 		 */
 		xfs_iext_update_extent(bma->ip, state, &bma->icur, new);
 
-		error = xfs_next_set(bma->ip, whichfork, 1);
+		error = xfs_next_set(bma->tp, bma->ip, whichfork, 1);
 		if (error)
 			goto done;
 
@@ -1965,7 +1965,7 @@ xfs_bmap_add_extent_delay_real(
 		xfs_iext_insert(bma->ip, &bma->icur, &RIGHT, state);
 		xfs_iext_insert(bma->ip, &bma->icur, &LEFT, state);
 
-		error = xfs_next_set(bma->ip, whichfork, 1);
+		error = xfs_next_set(bma->tp, bma->ip, whichfork, 1);
 		if (error)
 			goto done;
 
@@ -2172,7 +2172,7 @@ xfs_bmap_add_extent_unwritten_real(
 		xfs_iext_prev(ifp, icur);
 		xfs_iext_update_extent(ip, state, icur, &LEFT);
 
-		error = xfs_next_set(ip, whichfork, -2);
+		error = xfs_next_set(tp, ip, whichfork, -2);
 		if (error)
 			goto done;
 
@@ -2228,7 +2228,7 @@ xfs_bmap_add_extent_unwritten_real(
 		xfs_iext_prev(ifp, icur);
 		xfs_iext_update_extent(ip, state, icur, &LEFT);
 
-		error = xfs_next_set(ip, whichfork, -1);
+		error = xfs_next_set(tp, ip, whichfork, -1);
 		if (error)
 			goto done;
 
@@ -2274,7 +2274,7 @@ xfs_bmap_add_extent_unwritten_real(
 		xfs_iext_prev(ifp, icur);
 		xfs_iext_update_extent(ip, state, icur, &PREV);
 
-		error = xfs_next_set(ip, whichfork, -1);
+		error = xfs_next_set(tp, ip, whichfork, -1);
 		if (error)
 			goto done;
 
@@ -2385,7 +2385,7 @@ xfs_bmap_add_extent_unwritten_real(
 		xfs_iext_update_extent(ip, state, icur, &PREV);
 		xfs_iext_insert(ip, icur, new, state);
 
-		error = xfs_next_set(ip, whichfork, 1);
+		error = xfs_next_set(tp, ip, whichfork, 1);
 		if (error)
 			goto done;
 
@@ -2464,7 +2464,7 @@ xfs_bmap_add_extent_unwritten_real(
 		xfs_iext_next(ifp, icur);
 		xfs_iext_insert(ip, icur, new, state);
 
-		error = xfs_next_set(ip, whichfork, 1);
+		error = xfs_next_set(tp, ip, whichfork, 1);
 		if (error)
 			goto done;
 
@@ -2519,7 +2519,7 @@ xfs_bmap_add_extent_unwritten_real(
 		xfs_iext_insert(ip, icur, &r[1], state);
 		xfs_iext_insert(ip, icur, &r[0], state);
 
-		error = xfs_next_set(ip, whichfork, 2);
+		error = xfs_next_set(tp, ip, whichfork, 2);
 		if (error)
 			goto done;
 
@@ -2838,7 +2838,7 @@ xfs_bmap_add_extent_hole_real(
 		xfs_iext_prev(ifp, icur);
 		xfs_iext_update_extent(ip, state, icur, &left);
 
-		error = xfs_next_set(ip, whichfork, -1);
+		error = xfs_next_set(tp, ip, whichfork, -1);
 		if (error)
 			goto done;
 
@@ -2940,7 +2940,7 @@ xfs_bmap_add_extent_hole_real(
 		 */
 		xfs_iext_insert(ip, icur, new, state);
 
-		error = xfs_next_set(ip, whichfork, 1);
+		error = xfs_next_set(tp, ip, whichfork, 1);
 		if (error)
 			goto done;
 
@@ -5140,7 +5140,7 @@ xfs_bmap_del_extent_real(
 		xfs_iext_remove(ip, icur, state);
 		xfs_iext_prev(ifp, icur);
 
-		error = xfs_next_set(ip, whichfork, -1);
+		error = xfs_next_set(tp, ip, whichfork, -1);
 		if (error)
 			goto done;
 
@@ -5252,7 +5252,7 @@ xfs_bmap_del_extent_real(
 		} else
 			flags |= xfs_ilog_fext(whichfork);
 
-		error = xfs_next_set(ip, whichfork, 1);
+		error = xfs_next_set(tp, ip, whichfork, 1);
 		if (error)
 			goto done;
 
@@ -5722,7 +5722,7 @@ xfs_bmse_merge(
 	 * Update the on-disk extent count, the btree if necessary and log the
 	 * inode.
 	 */
-	error = xfs_next_set(ip, whichfork, -1);
+	error = xfs_next_set(tp, ip, whichfork, -1);
 	if (error)
 		goto done;
 
@@ -6113,7 +6113,7 @@ xfs_bmap_split_extent(
 	xfs_iext_next(ifp, &icur);
 	xfs_iext_insert(ip, &icur, &new, 0);
 
-	error = xfs_next_set(ip, whichfork, 1);
+	error = xfs_next_set(tp, ip, whichfork, 1);
 	if (error)
 		goto del_cursor;
 
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index b42a52bfa1e9..91bee33aa988 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -449,10 +449,12 @@ xfs_sb_has_compat_feature(
 #define XFS_SB_FEAT_RO_COMPAT_FINOBT   (1 << 0)		/* free inode btree */
 #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
 #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
+#define XFS_SB_FEAT_RO_COMPAT_47BIT_DEXT_CNTR (1 << 3)	/* 47bit data extents */
 #define XFS_SB_FEAT_RO_COMPAT_ALL \
 		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
 		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
-		 XFS_SB_FEAT_RO_COMPAT_REFLINK)
+		 XFS_SB_FEAT_RO_COMPAT_REFLINK | \
+		 XFS_SB_FEAT_RO_COMPAT_47BIT_DEXT_CNTR)
 #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
 static inline bool
 xfs_sb_has_ro_compat_feature(
@@ -563,6 +565,18 @@ static inline bool xfs_sb_version_hasreflink(struct xfs_sb *sbp)
 		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_REFLINK);
 }
 
+static inline bool xfs_sb_version_has47bitext(struct xfs_sb *sbp)
+{
+	return XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5 &&
+		(sbp->sb_features_ro_compat &
+			XFS_SB_FEAT_RO_COMPAT_47BIT_DEXT_CNTR);
+}
+
+static inline void xfs_sb_version_add47bitext(struct xfs_sb *sbp)
+{
+	sbp->sb_features_ro_compat |= XFS_SB_FEAT_RO_COMPAT_47BIT_DEXT_CNTR;
+}
+
 /*
  * end of superblock version macros
  */
@@ -873,7 +887,7 @@ typedef struct xfs_dinode {
 	__be64		di_size;	/* number of bytes in file */
 	__be64		di_nblocks;	/* # of direct & btree blocks used */
 	__be32		di_extsize;	/* basic/minimum extent size for file */
-	__be32		di_nextents;	/* number of extents in data fork */
+	__be32		di_nextents_lo;	/* number of extents in data fork */
 	__be16		di_anextents;	/* number of extents in attribute fork*/
 	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
 	__s8		di_aformat;	/* format of attr fork's data */
@@ -891,7 +905,8 @@ typedef struct xfs_dinode {
 	__be64		di_lsn;		/* flush sequence */
 	__be64		di_flags2;	/* more random flags */
 	__be32		di_cowextsize;	/* basic cow extent size for file */
-	__u8		di_pad2[12];	/* more padding for future expansion */
+	__be32		di_nextents_hi;
+	__u8		di_pad2[8];	/* more padding for future expansion */
 
 	/* fields only written to during inode creation */
 	xfs_timestamp_t	di_crtime;	/* time created */
@@ -992,10 +1007,6 @@ enum xfs_dinode_fmt {
 	((w) == XFS_DATA_FORK ? \
 		(dip)->di_format : \
 		(dip)->di_aformat)
-#define XFS_DFORK_NEXTENTS(dip,w) \
-	((w) == XFS_DATA_FORK ? \
-		be32_to_cpu((dip)->di_nextents) : \
-		be16_to_cpu((dip)->di_anextents))
 
 /*
  * For block and character special files the 32bit dev_t is stored at the
@@ -1061,12 +1072,15 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
 #define XFS_DIFLAG2_DAX_BIT	0	/* use DAX for this inode */
 #define XFS_DIFLAG2_REFLINK_BIT	1	/* file's blocks may be shared */
 #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
+#define XFS_DIFLAG2_47BIT_NEXTENTS_BIT 3 /* Uses di_nextents_hi field */
 #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
 #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
 #define XFS_DIFLAG2_COWEXTSIZE  (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
+#define XFS_DIFLAG2_47BIT_NEXTENTS (1 << XFS_DIFLAG2_47BIT_NEXTENTS_BIT)
 
 #define XFS_DIFLAG2_ANY \
-	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE)
+	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
+	 XFS_DIFLAG2_47BIT_NEXTENTS)
 
 /*
  * Inode number format:
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 6f84ea85fdd8..8b89fe080f70 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -307,7 +307,8 @@ xfs_inode_to_disk(
 	to->di_size = cpu_to_be64(from->di_size);
 	to->di_nblocks = cpu_to_be64(from->di_nblocks);
 	to->di_extsize = cpu_to_be32(from->di_extsize);
-	to->di_nextents = cpu_to_be32(xfs_ifork_nextents(&ip->i_df));
+	to->di_nextents_lo = cpu_to_be32(xfs_ifork_nextents(&ip->i_df) &
+					0xffffffffU);
 	to->di_anextents = cpu_to_be16(xfs_ifork_nextents(ip->i_afp));
 	to->di_forkoff = from->di_forkoff;
 	to->di_aformat = xfs_ifork_format(ip->i_afp);
@@ -322,6 +323,10 @@ xfs_inode_to_disk(
 		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
 		to->di_flags2 = cpu_to_be64(from->di_flags2);
 		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
+		if (from->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
+			to->di_nextents_hi
+				= cpu_to_be32(xfs_ifork_nextents(&ip->i_df)
+					>> 32);
 		to->di_ino = cpu_to_be64(ip->i_ino);
 		to->di_lsn = cpu_to_be64(lsn);
 		memset(to->di_pad2, 0, sizeof(to->di_pad2));
@@ -360,7 +365,7 @@ xfs_log_dinode_to_disk(
 	to->di_size = cpu_to_be64(from->di_size);
 	to->di_nblocks = cpu_to_be64(from->di_nblocks);
 	to->di_extsize = cpu_to_be32(from->di_extsize);
-	to->di_nextents = cpu_to_be32(from->di_nextents);
+	to->di_nextents_lo = cpu_to_be32(from->di_nextents_lo);
 	to->di_anextents = cpu_to_be16(from->di_anextents);
 	to->di_forkoff = from->di_forkoff;
 	to->di_aformat = from->di_aformat;
@@ -375,6 +380,9 @@ xfs_log_dinode_to_disk(
 		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
 		to->di_flags2 = cpu_to_be64(from->di_flags2);
 		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
+		if (from->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
+			to->di_nextents_hi =
+				cpu_to_be32(from->di_nextents_hi);
 		to->di_ino = cpu_to_be64(from->di_ino);
 		to->di_lsn = cpu_to_be64(from->di_lsn);
 		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
@@ -391,7 +399,9 @@ xfs_dinode_verify_fork(
 	struct xfs_mount	*mp,
 	int			whichfork)
 {
-	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
+	xfs_extnum_t		di_nextents;
+
+	di_nextents = xfs_dfork_nextents(&mp->m_sb, dip, whichfork);
 
 	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
 	case XFS_DINODE_FMT_LOCAL:
@@ -462,6 +472,8 @@ xfs_dinode_verify(
 	uint16_t		flags;
 	uint64_t		flags2;
 	uint64_t		di_size;
+	xfs_extnum_t		nextents;
+	int64_t			nblocks;
 
 	if (dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC))
 		return __this_address;
@@ -492,10 +504,12 @@ xfs_dinode_verify(
 	if ((S_ISLNK(mode) || S_ISDIR(mode)) && di_size == 0)
 		return __this_address;
 
+	nextents = xfs_dfork_nextents(&mp->m_sb, dip, XFS_DATA_FORK);
+	nextents += xfs_dfork_nextents(&mp->m_sb, dip, XFS_ATTR_FORK);
+	nblocks = be64_to_cpu(dip->di_nblocks);
+
 	/* Fork checks carried over from xfs_iformat_fork */
-	if (mode &&
-	    be32_to_cpu(dip->di_nextents) + be16_to_cpu(dip->di_anextents) >
-			be64_to_cpu(dip->di_nblocks))
+	if (mode && nextents > nblocks)
 		return __this_address;
 
 	if (mode && XFS_DFORK_BOFF(dip) > mp->m_sb.sb_inodesize)
@@ -716,3 +730,23 @@ xfs_inode_validate_cowextsize(
 
 	return NULL;
 }
+
+xfs_extnum_t
+xfs_dfork_nextents(
+	struct xfs_sb		*sbp,
+	struct xfs_dinode	*dip,
+	int			whichfork)
+{
+	xfs_extnum_t		nextents;
+
+	if (whichfork == XFS_DATA_FORK) {
+		nextents = be32_to_cpu(dip->di_nextents_lo);
+		if (xfs_sb_version_has_v3inode(sbp)
+			&& (dip->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS))
+			nextents |= (u64)(be32_to_cpu(dip->di_nextents_hi))
+				<< 32;
+		return nextents;
+	} else {
+		return be16_to_cpu(dip->di_anextents);
+	}
+}
diff --git a/fs/xfs/libxfs/xfs_inode_buf.h b/fs/xfs/libxfs/xfs_inode_buf.h
index 865ac493c72a..4583db53b933 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.h
+++ b/fs/xfs/libxfs/xfs_inode_buf.h
@@ -65,5 +65,7 @@ xfs_failaddr_t xfs_inode_validate_extsize(struct xfs_mount *mp,
 xfs_failaddr_t xfs_inode_validate_cowextsize(struct xfs_mount *mp,
 		uint32_t cowextsize, uint16_t mode, uint16_t flags,
 		uint64_t flags2);
+xfs_extnum_t xfs_dfork_nextents(struct xfs_sb *sbp, struct xfs_dinode *dip,
+		int whichfork);
 
 #endif	/* __XFS_INODE_BUF_H__ */
diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index 3bf5a2c391bd..ec682e2d5bcb 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -10,6 +10,7 @@
 #include "xfs_format.h"
 #include "xfs_log_format.h"
 #include "xfs_trans_resv.h"
+#include "xfs_sb.h"
 #include "xfs_mount.h"
 #include "xfs_inode.h"
 #include "xfs_trans.h"
@@ -103,21 +104,22 @@ xfs_iformat_extents(
 	int			whichfork)
 {
 	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_sb		*sb = &mp->m_sb;
 	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
+	xfs_extnum_t		nex = xfs_dfork_nextents(sb, dip, whichfork);
 	int			state = xfs_bmap_fork_to_state(whichfork);
-	int			nex = XFS_DFORK_NEXTENTS(dip, whichfork);
 	int			size = nex * sizeof(xfs_bmbt_rec_t);
 	struct xfs_iext_cursor	icur;
 	struct xfs_bmbt_rec	*dp;
 	struct xfs_bmbt_irec	new;
-	int			i;
+	xfs_extnum_t		i;
 
 	/*
 	 * If the number of extents is unreasonable, then something is wrong and
 	 * we just bail out rather than crash in kmem_alloc() or memcpy() below.
 	 */
 	if (unlikely(size < 0 || size > XFS_DFORK_SIZE(dip, mp, whichfork))) {
-		xfs_warn(ip->i_mount, "corrupt inode %Lu ((a)extents = %d).",
+		xfs_warn(ip->i_mount, "corrupt inode %Lu ((a)extents = %llu).",
 			(unsigned long long) ip->i_ino, nex);
 		xfs_inode_verifier_error(ip, -EFSCORRUPTED,
 				"xfs_iformat_extents(1)", dip, sizeof(*dip),
@@ -233,7 +235,11 @@ xfs_iformat_data_fork(
 	 * depend on it.
 	 */
 	ip->i_df.if_format = dip->di_format;
-	ip->i_df.if_nextents = be32_to_cpu(dip->di_nextents);
+	ip->i_df.if_nextents = be32_to_cpu(dip->di_nextents_lo);
+	if (ip->i_d.di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
+		ip->i_df.if_nextents |=
+			((u64)(be32_to_cpu(dip->di_nextents_hi)) << 32);
+
 
 	switch (inode->i_mode & S_IFMT) {
 	case S_IFIFO:
@@ -729,31 +735,73 @@ xfs_ifork_verify_local_attr(
 	return 0;
 }
 
+static int
+xfs_next_set_data(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	struct xfs_ifork	*ifp,
+	int			delta)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	xfs_extnum_t		nr_exts;
+
+	nr_exts = ifp->if_nextents + delta;
+
+	if ((delta > 0 && nr_exts > MAXEXTNUM)
+		|| (delta < 0 && nr_exts > ifp->if_nextents))
+		return -EOVERFLOW;
+
+	if (ifp->if_nextents <= MAXEXTNUM31BIT &&
+		nr_exts > MAXEXTNUM31BIT &&
+		!(ip->i_d.di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS) &&
+		xfs_sb_version_has_v3inode(&mp->m_sb)) {
+		if (!xfs_sb_version_has47bitext(&mp->m_sb)) {
+			bool log_sb = false;
+
+			spin_lock(&mp->m_sb_lock);
+			if (!xfs_sb_version_has47bitext(&mp->m_sb)) {
+				xfs_sb_version_add47bitext(&mp->m_sb);
+				log_sb = true;
+			}
+			spin_unlock(&mp->m_sb_lock);
+
+			if (log_sb)
+				xfs_log_sb(tp);
+		}
+
+		ip->i_d.di_flags2 |= XFS_DIFLAG2_47BIT_NEXTENTS;
+	}
+
+	ifp->if_nextents = nr_exts;
+
+	return 0;
+}
+
 int
 xfs_next_set(
+	struct xfs_trans	*tp,
 	struct xfs_inode	*ip,
 	int			whichfork,
 	int			delta)
 {
 	struct xfs_ifork	*ifp;
 	int64_t			nr_exts;
-	int64_t			max_exts;
+	int			error = 0;
 
 	ifp = XFS_IFORK_PTR(ip, whichfork);
 
-	if (whichfork == XFS_DATA_FORK || whichfork == XFS_COW_FORK)
-		max_exts = MAXEXTNUM;
-	else if (whichfork == XFS_ATTR_FORK)
-		max_exts = MAXAEXTNUM;
-	else
-		ASSERT(0);
-
-	nr_exts = ifp->if_nextents + delta;
-	if ((delta > 0 && nr_exts > max_exts)
-		|| (delta < 0 && nr_exts < 0))
-		return -EOVERFLOW;
+	if (whichfork == XFS_DATA_FORK || whichfork == XFS_COW_FORK) {
+		error = xfs_next_set_data(tp, ip, ifp, delta);
+	} else if (whichfork == XFS_ATTR_FORK) {
+		nr_exts = ifp->if_nextents + delta;
+		if ((delta > 0 && nr_exts > MAXAEXTNUM)
+			|| (delta < 0 && nr_exts < 0))
+			return -EOVERFLOW;
 
-	ifp->if_nextents = nr_exts;
+		ifp->if_nextents = nr_exts;
+	} else {
+		ASSERT(0);
+	}
 
-	return 0;
+	return error;
 }
diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
index a84ae42ace79..c74fa6371cc8 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.h
+++ b/fs/xfs/libxfs/xfs_inode_fork.h
@@ -173,5 +173,6 @@ extern void xfs_ifork_init_cow(struct xfs_inode *ip);
 int xfs_ifork_verify_local_data(struct xfs_inode *ip);
 int xfs_ifork_verify_local_attr(struct xfs_inode *ip);
 
-int xfs_next_set(struct xfs_inode *ip, int whichfork, int delta);
+int xfs_next_set(struct xfs_trans *tp, struct xfs_inode *ip, int whichfork,
+		int delta);
 #endif	/* __XFS_INODE_FORK_H__ */
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index e3400c9c71cd..879aadff7692 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -396,7 +396,7 @@ struct xfs_log_dinode {
 	xfs_fsize_t	di_size;	/* number of bytes in file */
 	xfs_rfsblock_t	di_nblocks;	/* # of direct & btree blocks used */
 	xfs_extlen_t	di_extsize;	/* basic/minimum extent size for file */
-	xfs_extnum_t	di_nextents;	/* number of extents in data fork */
+	uint32_t	di_nextents_lo;	/* number of extents in data fork */
 	xfs_aextnum_t	di_anextents;	/* number of extents in attribute fork*/
 	uint8_t		di_forkoff;	/* attr fork offs, <<3 for 64b align */
 	int8_t		di_aformat;	/* format of attr fork's data */
@@ -414,7 +414,8 @@ struct xfs_log_dinode {
 	xfs_lsn_t	di_lsn;		/* flush sequence */
 	uint64_t	di_flags2;	/* more random flags */
 	uint32_t	di_cowextsize;	/* basic cow extent size for file */
-	uint8_t		di_pad2[12];	/* more padding for future expansion */
+	uint32_t	di_nextents_hi;
+	uint8_t		di_pad2[8];	/* more padding for future expansion */
 
 	/* fields only written to during inode creation */
 	xfs_ictimestamp_t di_crtime;	/* time created */
diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index 0a3041ad5bec..c68ff2178976 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -12,7 +12,7 @@ typedef uint32_t	xfs_agblock_t;	/* blockno in alloc. group */
 typedef uint32_t	xfs_agino_t;	/* inode # within allocation grp */
 typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
 typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
-typedef int32_t		xfs_extnum_t;	/* # of extents in a file */
+typedef uint64_t	xfs_extnum_t;	/* # of extents in a file */
 typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
 typedef int64_t		xfs_fsize_t;	/* bytes in a file */
 typedef uint64_t	xfs_ufsize_t;	/* unsigned bytes in a file */
@@ -59,7 +59,8 @@ typedef void *		xfs_failaddr_t;
  * Max values for extlen, extnum, aextnum.
  */
 #define	MAXEXTLEN	((xfs_extlen_t)0x001fffff)	/* 21 bits */
-#define	MAXEXTNUM	((xfs_extnum_t)0x7fffffff)	/* signed int */
+#define	MAXEXTNUM31BIT	((xfs_extnum_t)0x7fffffff)	/* 31 bits */
+#define	MAXEXTNUM	((xfs_extnum_t)0x7fffffffffff)	/* 47 bits */
 #define	MAXDIREXTNUM	((xfs_extnum_t)0x7ffffff)	/* 27 bits */
 #define	MAXAEXTNUM	((xfs_aextnum_t)0x7fff)		/* signed short */
 
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index 6d483ab29e63..be41fd242ff2 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -205,8 +205,8 @@ xchk_dinode(
 	struct xfs_mount	*mp = sc->mp;
 	size_t			fork_recs;
 	unsigned long long	isize;
+	xfs_extnum_t		nextents;
 	uint64_t		flags2;
-	uint32_t		nextents;
 	uint16_t		flags;
 	uint16_t		mode;
 
@@ -354,7 +354,7 @@ xchk_dinode(
 	xchk_inode_extsize(sc, dip, ino, mode, flags);
 
 	/* di_nextents */
-	nextents = be32_to_cpu(dip->di_nextents);
+	nextents = xfs_dfork_nextents(&mp->m_sb, dip, XFS_DATA_FORK);
 	fork_recs =  XFS_DFORK_DSIZE(dip, mp) / sizeof(struct xfs_bmbt_rec);
 	switch (dip->di_format) {
 	case XFS_DINODE_FMT_EXTENTS:
@@ -464,6 +464,7 @@ xchk_inode_xref_bmap(
 	struct xfs_scrub	*sc,
 	struct xfs_dinode	*dip)
 {
+	xfs_mount_t		*mp = sc->mp;
 	xfs_extnum_t		nextents;
 	xfs_filblks_t		count;
 	xfs_filblks_t		acount;
@@ -477,14 +478,14 @@ xchk_inode_xref_bmap(
 			&nextents, &count);
 	if (!xchk_should_check_xref(sc, &error, NULL))
 		return;
-	if (nextents < be32_to_cpu(dip->di_nextents))
+	if (nextents < xfs_dfork_nextents(&mp->m_sb, dip, XFS_DATA_FORK))
 		xchk_ino_xref_set_corrupt(sc, sc->ip->i_ino);
 
 	error = xfs_bmap_count_blocks(sc->tp, sc->ip, XFS_ATTR_FORK,
 			&nextents, &acount);
 	if (!xchk_should_check_xref(sc, &error, NULL))
 		return;
-	if (nextents != be16_to_cpu(dip->di_anextents))
+	if (nextents != xfs_dfork_nextents(&mp->m_sb, dip, XFS_ATTR_FORK))
 		xchk_ino_xref_set_corrupt(sc, sc->ip->i_ino);
 
 	/* Check nblocks against the inode. */
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 64f5f9a440ae..4418a66cf6d6 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3748,7 +3748,7 @@ xfs_iflush_int(
 				ip->i_d.di_nblocks, mp, XFS_ERRTAG_IFLUSH_5)) {
 		xfs_alert_tag(mp, XFS_PTAG_IFLUSH,
 			"%s: detected corrupt incore inode %Lu, "
-			"total extents = %d, nblocks = %Ld, ptr "PTR_FMT,
+			"total extents = %llu, nblocks = %Ld, ptr "PTR_FMT,
 			__func__, ip->i_ino,
 			ip->i_df.if_nextents + xfs_ifork_nextents(ip->i_afp),
 			ip->i_d.di_nblocks, ip);
@@ -3785,6 +3785,10 @@ xfs_iflush_int(
 	    xfs_ifork_verify_local_attr(ip))
 		goto flush_out;
 
+	if (!(ip->i_d.di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
+		&& xfs_sb_version_has47bitext(&mp->m_sb))
+		ip->i_d.di_flags2 |= XFS_DIFLAG2_47BIT_NEXTENTS;
+
 	/*
 	 * Copy the dirty parts of the inode into the on-disk inode.  We always
 	 * copy out the core of the inode, because if the inode is dirty at all
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index ba47bf65b772..6f27ac7c8631 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -326,7 +326,7 @@ xfs_inode_to_log_dinode(
 	to->di_size = from->di_size;
 	to->di_nblocks = from->di_nblocks;
 	to->di_extsize = from->di_extsize;
-	to->di_nextents = xfs_ifork_nextents(&ip->i_df);
+	to->di_nextents_lo = xfs_ifork_nextents(&ip->i_df) & 0xffffffffU;
 	to->di_anextents = xfs_ifork_nextents(ip->i_afp);
 	to->di_forkoff = from->di_forkoff;
 	to->di_aformat = xfs_ifork_format(ip->i_afp);
@@ -344,6 +344,9 @@ xfs_inode_to_log_dinode(
 		to->di_crtime.t_nsec = from->di_crtime.tv_nsec;
 		to->di_flags2 = from->di_flags2;
 		to->di_cowextsize = from->di_cowextsize;
+		if (from->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
+			to->di_nextents_hi =
+				xfs_ifork_nextents(&ip->i_df) >> 32;
 		to->di_ino = ip->i_ino;
 		to->di_lsn = lsn;
 		memset(to->di_pad2, 0, sizeof(to->di_pad2));
diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
index 10ef5ddf5429..8d64b861fb66 100644
--- a/fs/xfs/xfs_inode_item_recover.c
+++ b/fs/xfs/xfs_inode_item_recover.c
@@ -134,6 +134,7 @@ xlog_recover_inode_commit_pass2(
 	struct xfs_log_dinode		*ldip;
 	uint				isize;
 	int				need_free = 0;
+	xfs_extnum_t			nextents;
 
 	if (item->ri_buf[0].i_len == sizeof(struct xfs_inode_log_format)) {
 		in_f = item->ri_buf[0].i_addr;
@@ -255,16 +256,23 @@ xlog_recover_inode_commit_pass2(
 			goto out_release;
 		}
 	}
-	if (unlikely(ldip->di_nextents + ldip->di_anextents > ldip->di_nblocks)){
+
+	nextents = ldip->di_nextents_lo;
+	if (xfs_sb_version_has_v3inode(&mp->m_sb) &&
+		ldip->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
+		nextents |= ((u64)(ldip->di_nextents_hi) << 32);
+
+	nextents += ldip->di_anextents;
+
+	if (unlikely(nextents > ldip->di_nblocks)) {
 		XFS_CORRUPTION_ERROR("xlog_recover_inode_pass2(5)",
 				     XFS_ERRLEVEL_LOW, mp, ldip,
 				     sizeof(*ldip));
 		xfs_alert(mp,
 	"%s: Bad inode log record, rec ptr "PTR_FMT", dino ptr "PTR_FMT", "
-	"dino bp "PTR_FMT", ino %Ld, total extents = %d, nblocks = %Ld",
+	"dino bp "PTR_FMT", ino %Ld, total extents = %llu, nblocks = %Ld",
 			__func__, item, dip, bp, in_f->ilf_ino,
-			ldip->di_nextents + ldip->di_anextents,
-			ldip->di_nblocks);
+			nextents, ldip->di_nblocks);
 		error = -EFSCORRUPTED;
 		goto out_release;
 	}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 7/7] xfs: Extend attr extent counter to 32 bits
  2020-06-06  8:27 [PATCH 0/7] xfs: Extend per-inode extent counters Chandan Babu R
                   ` (5 preceding siblings ...)
  2020-06-06  8:27 ` [PATCH 6/7] xfs: Extend data extent counter to 47 bits Chandan Babu R
@ 2020-06-06  8:27 ` Chandan Babu R
  2020-06-08 17:21   ` Darrick J. Wong
  2020-06-19 14:39   ` Christoph Hellwig
  2020-06-08 17:31 ` [PATCH 0/7] xfs: Extend per-inode extent counters Darrick J. Wong
  7 siblings, 2 replies; 40+ messages in thread
From: Chandan Babu R @ 2020-06-06  8:27 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, david, darrick.wong, bfoster, hch

This commit extends the per-inode attr extent counter to 32 bits.

The following changes are made to accomplish this,
1. A new ro-compat superblock flag to prevent older kernels from
   mounting the filesystem in read-write mode. This flag is set for the
   first time when an inode would end up having more than 2^15 extents.
3. Carve out a new 16-bit field from xfs_dinode->di_pad2[]. This field
   holds the most significant 16 bits of the attr extent counter.
2. A new inode->di_flags2 flag to indicate that the newly added field
   contains valid data. This flag is set when one of the following two
   conditions are met,
   - When the inode is about to have more than 2^15 extents.
   - When flushing the incore inode (See xfs_iflush_int()), if
     the superblock ro-compat flag is already set.

Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
---
 fs/xfs/libxfs/xfs_format.h      | 25 ++++++++++---
 fs/xfs/libxfs/xfs_inode_buf.c   | 23 +++++++++---
 fs/xfs/libxfs/xfs_inode_fork.c  | 62 ++++++++++++++++++++++++++-------
 fs/xfs/libxfs/xfs_log_format.h  |  5 +--
 fs/xfs/libxfs/xfs_types.h       |  5 +--
 fs/xfs/scrub/inode.c            |  5 +--
 fs/xfs/xfs_inode.c              |  4 +++
 fs/xfs/xfs_inode_item.c         |  5 ++-
 fs/xfs/xfs_inode_item_recover.c |  8 ++++-
 9 files changed, 113 insertions(+), 29 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 91bee33aa988..2e37d887fd35 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -450,11 +450,13 @@ xfs_sb_has_compat_feature(
 #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
 #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
 #define XFS_SB_FEAT_RO_COMPAT_47BIT_DEXT_CNTR (1 << 3)	/* 47bit data extents */
+#define XFS_SB_FEAT_RO_COMPAT_32BIT_AEXT_CNTR (1 << 4)	/* 32bit attr extents */
 #define XFS_SB_FEAT_RO_COMPAT_ALL \
 		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
 		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
 		 XFS_SB_FEAT_RO_COMPAT_REFLINK | \
-		 XFS_SB_FEAT_RO_COMPAT_47BIT_DEXT_CNTR)
+		 XFS_SB_FEAT_RO_COMPAT_47BIT_DEXT_CNTR | \
+		 XFS_SB_FEAT_RO_COMPAT_32BIT_AEXT_CNTR)
 #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
 static inline bool
 xfs_sb_has_ro_compat_feature(
@@ -577,6 +579,18 @@ static inline void xfs_sb_version_add47bitext(struct xfs_sb *sbp)
 	sbp->sb_features_ro_compat |= XFS_SB_FEAT_RO_COMPAT_47BIT_DEXT_CNTR;
 }
 
+static inline bool xfs_sb_version_has32bitaext(struct xfs_sb *sbp)
+{
+	return XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5 &&
+		(sbp->sb_features_ro_compat &
+			XFS_SB_FEAT_RO_COMPAT_32BIT_AEXT_CNTR);
+}
+
+static inline void xfs_sb_version_add32bitaext(struct xfs_sb *sbp)
+{
+	sbp->sb_features_ro_compat |= XFS_SB_FEAT_RO_COMPAT_32BIT_AEXT_CNTR;
+}
+
 /*
  * end of superblock version macros
  */
@@ -888,7 +902,7 @@ typedef struct xfs_dinode {
 	__be64		di_nblocks;	/* # of direct & btree blocks used */
 	__be32		di_extsize;	/* basic/minimum extent size for file */
 	__be32		di_nextents_lo;	/* number of extents in data fork */
-	__be16		di_anextents;	/* number of extents in attribute fork*/
+	__be16		di_anextents_lo;/* lower part of xattr extent count */
 	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
 	__s8		di_aformat;	/* format of attr fork's data */
 	__be32		di_dmevmask;	/* DMIG event mask */
@@ -906,7 +920,8 @@ typedef struct xfs_dinode {
 	__be64		di_flags2;	/* more random flags */
 	__be32		di_cowextsize;	/* basic cow extent size for file */
 	__be32		di_nextents_hi;
-	__u8		di_pad2[8];	/* more padding for future expansion */
+	__be16		di_anextents_hi;/* higher part of xattr extent count */
+	__u8		di_pad2[6];	/* more padding for future expansion */
 
 	/* fields only written to during inode creation */
 	xfs_timestamp_t	di_crtime;	/* time created */
@@ -1073,14 +1088,16 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
 #define XFS_DIFLAG2_REFLINK_BIT	1	/* file's blocks may be shared */
 #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
 #define XFS_DIFLAG2_47BIT_NEXTENTS_BIT 3 /* Uses di_nextents_hi field */
+#define XFS_DIFLAG2_32BIT_ANEXTENTS_BIT 4 /* Uses di_anextents_hi field  */
 #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
 #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
 #define XFS_DIFLAG2_COWEXTSIZE  (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
 #define XFS_DIFLAG2_47BIT_NEXTENTS (1 << XFS_DIFLAG2_47BIT_NEXTENTS_BIT)
+#define XFS_DIFLAG2_32BIT_ANEXTENTS (1 << XFS_DIFLAG2_32BIT_ANEXTENTS_BIT)
 
 #define XFS_DIFLAG2_ANY \
 	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
-	 XFS_DIFLAG2_47BIT_NEXTENTS)
+	 XFS_DIFLAG2_47BIT_NEXTENTS | XFS_DIFLAG2_32BIT_ANEXTENTS)
 
 /*
  * Inode number format:
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 8b89fe080f70..285cbce0cd10 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -309,7 +309,8 @@ xfs_inode_to_disk(
 	to->di_extsize = cpu_to_be32(from->di_extsize);
 	to->di_nextents_lo = cpu_to_be32(xfs_ifork_nextents(&ip->i_df) &
 					0xffffffffU);
-	to->di_anextents = cpu_to_be16(xfs_ifork_nextents(ip->i_afp));
+	to->di_anextents_lo = cpu_to_be16(xfs_ifork_nextents(ip->i_afp) &
+					0xffffU);
 	to->di_forkoff = from->di_forkoff;
 	to->di_aformat = xfs_ifork_format(ip->i_afp);
 	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
@@ -327,6 +328,10 @@ xfs_inode_to_disk(
 			to->di_nextents_hi
 				= cpu_to_be32(xfs_ifork_nextents(&ip->i_df)
 					>> 32);
+		if (from->di_flags2 & XFS_DIFLAG2_32BIT_ANEXTENTS)
+			to->di_anextents_hi
+				= cpu_to_be16(xfs_ifork_nextents(ip->i_afp)
+					>> 16);
 		to->di_ino = cpu_to_be64(ip->i_ino);
 		to->di_lsn = cpu_to_be64(lsn);
 		memset(to->di_pad2, 0, sizeof(to->di_pad2));
@@ -366,7 +371,7 @@ xfs_log_dinode_to_disk(
 	to->di_nblocks = cpu_to_be64(from->di_nblocks);
 	to->di_extsize = cpu_to_be32(from->di_extsize);
 	to->di_nextents_lo = cpu_to_be32(from->di_nextents_lo);
-	to->di_anextents = cpu_to_be16(from->di_anextents);
+	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
 	to->di_forkoff = from->di_forkoff;
 	to->di_aformat = from->di_aformat;
 	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
@@ -383,6 +388,9 @@ xfs_log_dinode_to_disk(
 		if (from->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
 			to->di_nextents_hi =
 				cpu_to_be32(from->di_nextents_hi);
+		if (from->di_flags2 & XFS_DIFLAG2_32BIT_ANEXTENTS)
+			to->di_anextents_hi =
+				cpu_to_be16(from->di_anextents_hi);
 		to->di_ino = cpu_to_be64(from->di_ino);
 		to->di_lsn = cpu_to_be64(from->di_lsn);
 		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
@@ -566,7 +574,7 @@ xfs_dinode_verify(
 		default:
 			return __this_address;
 		}
-		if (dip->di_anextents)
+		if (xfs_dfork_nextents(&mp->m_sb, dip, XFS_ATTR_FORK))
 			return __this_address;
 	}
 
@@ -745,8 +753,13 @@ xfs_dfork_nextents(
 			&& (dip->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS))
 			nextents |= (u64)(be32_to_cpu(dip->di_nextents_hi))
 				<< 32;
-		return nextents;
 	} else {
-		return be16_to_cpu(dip->di_anextents);
+		nextents = be16_to_cpu(dip->di_anextents_lo);
+		if (xfs_sb_version_has_v3inode(sbp)
+			&& (dip->di_flags2 & XFS_DIFLAG2_32BIT_ANEXTENTS))
+			nextents |= (u32)(be16_to_cpu(dip->di_anextents_hi))
+				<< 16;
 	}
+
+	return nextents;
 }
diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index ec682e2d5bcb..169e16947ece 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -301,7 +301,10 @@ xfs_iformat_attr_fork(
 	ip->i_afp->if_format = dip->di_aformat;
 	if (unlikely(ip->i_afp->if_format == 0)) /* pre IRIX 6.2 file system */
 		ip->i_afp->if_format = XFS_DINODE_FMT_EXTENTS;
-	ip->i_afp->if_nextents = be16_to_cpu(dip->di_anextents);
+	ip->i_afp->if_nextents = be16_to_cpu(dip->di_anextents_lo);
+	if (ip->i_d.di_flags2 & XFS_DIFLAG2_32BIT_ANEXTENTS)
+		ip->i_afp->if_nextents |=
+			(u32)(be16_to_cpu(dip->di_anextents_hi)) << 16;
 
 	switch (ip->i_afp->if_format) {
 	case XFS_DINODE_FMT_LOCAL:
@@ -777,6 +780,48 @@ xfs_next_set_data(
 	return 0;
 }
 
+static int
+xfs_next_set_attr(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	struct xfs_ifork	*ifp,
+	int			delta)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	xfs_aextnum_t		nr_exts;
+
+	nr_exts = ifp->if_nextents + delta;
+
+	if ((delta > 0 && nr_exts < ifp->if_nextents) ||
+		(delta < 0 && nr_exts > ifp->if_nextents))
+		return -EOVERFLOW;
+
+	if (ifp->if_nextents <= MAXAEXTNUM15BIT &&
+		nr_exts > MAXAEXTNUM15BIT &&
+		!(ip->i_d.di_flags2 & XFS_DIFLAG2_32BIT_ANEXTENTS) &&
+		xfs_sb_version_has_v3inode(&mp->m_sb)) {
+		if (!xfs_sb_version_has32bitaext(&mp->m_sb)) {
+			bool log_sb = false;
+
+			spin_lock(&mp->m_sb_lock);
+			if (!xfs_sb_version_has32bitaext(&mp->m_sb)) {
+				xfs_sb_version_add32bitaext(&mp->m_sb);
+				log_sb = true;
+			}
+			spin_unlock(&mp->m_sb_lock);
+
+			if (log_sb)
+				xfs_log_sb(tp);
+		}
+
+		ip->i_d.di_flags2 |= XFS_DIFLAG2_32BIT_ANEXTENTS;
+	}
+
+	ifp->if_nextents = nr_exts;
+
+	return 0;
+}
+
 int
 xfs_next_set(
 	struct xfs_trans	*tp,
@@ -785,23 +830,16 @@ xfs_next_set(
 	int			delta)
 {
 	struct xfs_ifork	*ifp;
-	int64_t			nr_exts;
 	int			error = 0;
 
 	ifp = XFS_IFORK_PTR(ip, whichfork);
 
-	if (whichfork == XFS_DATA_FORK || whichfork == XFS_COW_FORK) {
+	if (whichfork == XFS_DATA_FORK || whichfork == XFS_COW_FORK)
 		error = xfs_next_set_data(tp, ip, ifp, delta);
-	} else if (whichfork == XFS_ATTR_FORK) {
-		nr_exts = ifp->if_nextents + delta;
-		if ((delta > 0 && nr_exts > MAXAEXTNUM)
-			|| (delta < 0 && nr_exts < 0))
-			return -EOVERFLOW;
-
-		ifp->if_nextents = nr_exts;
-	} else {
+	else if (whichfork == XFS_ATTR_FORK)
+		error = xfs_next_set_attr(tp, ip, ifp, delta);
+	else
 		ASSERT(0);
-	}
 
 	return error;
 }
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index 879aadff7692..db419fc862bc 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -397,7 +397,7 @@ struct xfs_log_dinode {
 	xfs_rfsblock_t	di_nblocks;	/* # of direct & btree blocks used */
 	xfs_extlen_t	di_extsize;	/* basic/minimum extent size for file */
 	uint32_t	di_nextents_lo;	/* number of extents in data fork */
-	xfs_aextnum_t	di_anextents;	/* number of extents in attribute fork*/
+	uint16_t	di_anextents_lo;/* lower part of xattr extent count */
 	uint8_t		di_forkoff;	/* attr fork offs, <<3 for 64b align */
 	int8_t		di_aformat;	/* format of attr fork's data */
 	uint32_t	di_dmevmask;	/* DMIG event mask */
@@ -415,7 +415,8 @@ struct xfs_log_dinode {
 	uint64_t	di_flags2;	/* more random flags */
 	uint32_t	di_cowextsize;	/* basic cow extent size for file */
 	uint32_t	di_nextents_hi;
-	uint8_t		di_pad2[8];	/* more padding for future expansion */
+	uint16_t	di_anextents_hi;/* higher part of xattr extent count */
+	uint8_t		di_pad2[6];	/* more padding for future expansion */
 
 	/* fields only written to during inode creation */
 	xfs_ictimestamp_t di_crtime;	/* time created */
diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index c68ff2178976..974737a9e9c1 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -13,7 +13,7 @@ typedef uint32_t	xfs_agino_t;	/* inode # within allocation grp */
 typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
 typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
 typedef uint64_t	xfs_extnum_t;	/* # of extents in a file */
-typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
+typedef uint32_t	xfs_aextnum_t;	/* # extents in an attribute fork */
 typedef int64_t		xfs_fsize_t;	/* bytes in a file */
 typedef uint64_t	xfs_ufsize_t;	/* unsigned bytes in a file */
 
@@ -62,7 +62,8 @@ typedef void *		xfs_failaddr_t;
 #define	MAXEXTNUM31BIT	((xfs_extnum_t)0x7fffffff)	/* 31 bits */
 #define	MAXEXTNUM	((xfs_extnum_t)0x7fffffffffff)	/* 47 bits */
 #define	MAXDIREXTNUM	((xfs_extnum_t)0x7ffffff)	/* 27 bits */
-#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fff)		/* signed short */
+#define	MAXAEXTNUM15BIT	((xfs_aextnum_t)0x7fff)		/* 15 bits */
+#define	MAXAEXTNUM	((xfs_aextnum_t)0xffffffff)	/* 32 bits */
 
 /*
  * Minimum and maximum blocksize and sectorsize.
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index be41fd242ff2..01e60c78a3a3 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -371,10 +371,12 @@ xchk_dinode(
 		break;
 	}
 
+	nextents = xfs_dfork_nextents(&mp->m_sb, dip, XFS_ATTR_FORK);
+
 	/* di_forkoff */
 	if (XFS_DFORK_APTR(dip) >= (char *)dip + mp->m_sb.sb_inodesize)
 		xchk_ino_set_corrupt(sc, ino);
-	if (dip->di_anextents != 0 && dip->di_forkoff == 0)
+	if (nextents != 0 && dip->di_forkoff == 0)
 		xchk_ino_set_corrupt(sc, ino);
 	if (dip->di_forkoff == 0 && dip->di_aformat != XFS_DINODE_FMT_EXTENTS)
 		xchk_ino_set_corrupt(sc, ino);
@@ -386,7 +388,6 @@ xchk_dinode(
 		xchk_ino_set_corrupt(sc, ino);
 
 	/* di_anextents */
-	nextents = be16_to_cpu(dip->di_anextents);
 	fork_recs =  XFS_DFORK_ASIZE(dip, mp) / sizeof(struct xfs_bmbt_rec);
 	switch (dip->di_aformat) {
 	case XFS_DINODE_FMT_EXTENTS:
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 4418a66cf6d6..6ec34e069344 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3789,6 +3789,10 @@ xfs_iflush_int(
 		&& xfs_sb_version_has47bitext(&mp->m_sb))
 		ip->i_d.di_flags2 |= XFS_DIFLAG2_47BIT_NEXTENTS;
 
+	if (!(ip->i_d.di_flags2 & XFS_DIFLAG2_32BIT_ANEXTENTS)
+		&& xfs_sb_version_has32bitaext(&mp->m_sb))
+		ip->i_d.di_flags2 |= XFS_DIFLAG2_32BIT_ANEXTENTS;
+
 	/*
 	 * Copy the dirty parts of the inode into the on-disk inode.  We always
 	 * copy out the core of the inode, because if the inode is dirty at all
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 6f27ac7c8631..40f0a19d1c07 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -327,7 +327,7 @@ xfs_inode_to_log_dinode(
 	to->di_nblocks = from->di_nblocks;
 	to->di_extsize = from->di_extsize;
 	to->di_nextents_lo = xfs_ifork_nextents(&ip->i_df) & 0xffffffffU;
-	to->di_anextents = xfs_ifork_nextents(ip->i_afp);
+	to->di_anextents_lo = xfs_ifork_nextents(ip->i_afp) & 0xffffU;
 	to->di_forkoff = from->di_forkoff;
 	to->di_aformat = xfs_ifork_format(ip->i_afp);
 	to->di_dmevmask = from->di_dmevmask;
@@ -347,6 +347,9 @@ xfs_inode_to_log_dinode(
 		if (from->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
 			to->di_nextents_hi =
 				xfs_ifork_nextents(&ip->i_df) >> 32;
+		if (from->di_flags2 & XFS_DIFLAG2_32BIT_ANEXTENTS)
+			to->di_anextents_hi =
+				xfs_ifork_nextents(ip->i_afp) >> 16;
 		to->di_ino = ip->i_ino;
 		to->di_lsn = lsn;
 		memset(to->di_pad2, 0, sizeof(to->di_pad2));
diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
index 8d64b861fb66..c8b5fbba848b 100644
--- a/fs/xfs/xfs_inode_item_recover.c
+++ b/fs/xfs/xfs_inode_item_recover.c
@@ -135,6 +135,7 @@ xlog_recover_inode_commit_pass2(
 	uint				isize;
 	int				need_free = 0;
 	xfs_extnum_t			nextents;
+	xfs_aextnum_t			anextents;
 
 	if (item->ri_buf[0].i_len == sizeof(struct xfs_inode_log_format)) {
 		in_f = item->ri_buf[0].i_addr;
@@ -262,7 +263,12 @@ xlog_recover_inode_commit_pass2(
 		ldip->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
 		nextents |= ((u64)(ldip->di_nextents_hi) << 32);
 
-	nextents += ldip->di_anextents;
+	anextents = ldip->di_anextents_lo;
+	if (xfs_sb_version_has_v3inode(&mp->m_sb) &&
+		ldip->di_flags2 & XFS_DIFLAG2_32BIT_ANEXTENTS)
+		anextents |= ((u32)(ldip->di_anextents_hi) << 16);
+
+	nextents += anextents;
 
 	if (unlikely(nextents > ldip->di_nblocks)) {
 		XFS_CORRUPTION_ERROR("xlog_recover_inode_pass2(5)",
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/7] xfs: Check for per-inode extent count overflow
  2020-06-06  8:27 ` [PATCH 2/7] xfs: Check for per-inode extent count overflow Chandan Babu R
@ 2020-06-08 16:24   ` Darrick J. Wong
  2020-06-08 16:32     ` Darrick J. Wong
  2020-06-09 14:22     ` Chandan Babu R
  0 siblings, 2 replies; 40+ messages in thread
From: Darrick J. Wong @ 2020-06-08 16:24 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, david, bfoster, hch

On Sat, Jun 06, 2020 at 01:57:40PM +0530, Chandan Babu R wrote:
> The following error message was noticed when a workload added one
> million xattrs, deleted 50% of them and then inserted 400,000 new
> xattrs.
> 
> XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> 
> The error message was printed during unmounting the filesystem. The
> value printed under "total extents" indicates that we overflowed the
> per-inode signed 16-bit xattr extent counter.
> 
> Instead of letting this silent corruption occur, this patch checks for
> extent counter (both data and xattr) overflow before we assign the
> new value to the corresponding in-memory extent counter.
> 
> Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c       | 92 +++++++++++++++++++++++++++-------
>  fs/xfs/libxfs/xfs_inode_fork.c | 29 +++++++++++
>  fs/xfs/libxfs/xfs_inode_fork.h |  1 +
>  3 files changed, 104 insertions(+), 18 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index edc63dba007f..798fca5c52af 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -906,7 +906,10 @@ xfs_bmap_local_to_extents(
>  	xfs_iext_first(ifp, &icur);
>  	xfs_iext_insert(ip, &icur, &rec, 0);
>  
> -	ifp->if_nextents = 1;
> +	error = xfs_next_set(ip, whichfork, 1);
> +	if (error)
> +		goto done;

Are you sure that if_nextents == 0 is a precondition here?  Technically
speaking, this turns an assignment into an increment operation.

> +
>  	ip->i_d.di_nblocks = 1;
>  	xfs_trans_mod_dquot_byino(tp, ip,
>  		XFS_TRANS_DQ_BCOUNT, 1L);
> @@ -1594,7 +1597,10 @@ xfs_bmap_add_extent_delay_real(
>  		xfs_iext_remove(bma->ip, &bma->icur, state);
>  		xfs_iext_prev(ifp, &bma->icur);
>  		xfs_iext_update_extent(bma->ip, state, &bma->icur, &LEFT);
> -		ifp->if_nextents--;
> +
> +		error = xfs_next_set(bma->ip, whichfork, -1);
> +		if (error)
> +			goto done;
>  
>  		if (bma->cur == NULL)
>  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> @@ -1698,7 +1704,10 @@ xfs_bmap_add_extent_delay_real(
>  		PREV.br_startblock = new->br_startblock;
>  		PREV.br_state = new->br_state;
>  		xfs_iext_update_extent(bma->ip, state, &bma->icur, &PREV);
> -		ifp->if_nextents++;
> +
> +		error = xfs_next_set(bma->ip, whichfork, 1);
> +		if (error)
> +			goto done;
>  
>  		if (bma->cur == NULL)
>  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> @@ -1764,7 +1773,10 @@ xfs_bmap_add_extent_delay_real(
>  		 * The left neighbor is not contiguous.
>  		 */
>  		xfs_iext_update_extent(bma->ip, state, &bma->icur, new);
> -		ifp->if_nextents++;
> +
> +		error = xfs_next_set(bma->ip, whichfork, 1);
> +		if (error)
> +			goto done;
>  
>  		if (bma->cur == NULL)
>  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> @@ -1851,7 +1863,10 @@ xfs_bmap_add_extent_delay_real(
>  		 * The right neighbor is not contiguous.
>  		 */
>  		xfs_iext_update_extent(bma->ip, state, &bma->icur, new);
> -		ifp->if_nextents++;
> +
> +		error = xfs_next_set(bma->ip, whichfork, 1);
> +		if (error)
> +			goto done;
>  
>  		if (bma->cur == NULL)
>  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> @@ -1937,7 +1952,10 @@ xfs_bmap_add_extent_delay_real(
>  		xfs_iext_next(ifp, &bma->icur);
>  		xfs_iext_insert(bma->ip, &bma->icur, &RIGHT, state);
>  		xfs_iext_insert(bma->ip, &bma->icur, &LEFT, state);
> -		ifp->if_nextents++;
> +
> +		error = xfs_next_set(bma->ip, whichfork, 1);
> +		if (error)
> +			goto done;
>  
>  		if (bma->cur == NULL)
>  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> @@ -2141,7 +2159,11 @@ xfs_bmap_add_extent_unwritten_real(
>  		xfs_iext_remove(ip, icur, state);
>  		xfs_iext_prev(ifp, icur);
>  		xfs_iext_update_extent(ip, state, icur, &LEFT);
> -		ifp->if_nextents -= 2;
> +
> +		error = xfs_next_set(ip, whichfork, -2);
> +		if (error)
> +			goto done;
> +
>  		if (cur == NULL)
>  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
>  		else {
> @@ -2193,7 +2215,11 @@ xfs_bmap_add_extent_unwritten_real(
>  		xfs_iext_remove(ip, icur, state);
>  		xfs_iext_prev(ifp, icur);
>  		xfs_iext_update_extent(ip, state, icur, &LEFT);
> -		ifp->if_nextents--;
> +
> +		error = xfs_next_set(ip, whichfork, -1);
> +		if (error)
> +			goto done;
> +
>  		if (cur == NULL)
>  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
>  		else {
> @@ -2235,7 +2261,10 @@ xfs_bmap_add_extent_unwritten_real(
>  		xfs_iext_remove(ip, icur, state);
>  		xfs_iext_prev(ifp, icur);
>  		xfs_iext_update_extent(ip, state, icur, &PREV);
> -		ifp->if_nextents--;
> +
> +		error = xfs_next_set(ip, whichfork, -1);
> +		if (error)
> +			goto done;
>  
>  		if (cur == NULL)
>  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> @@ -2343,7 +2372,10 @@ xfs_bmap_add_extent_unwritten_real(
>  
>  		xfs_iext_update_extent(ip, state, icur, &PREV);
>  		xfs_iext_insert(ip, icur, new, state);
> -		ifp->if_nextents++;
> +
> +		error = xfs_next_set(ip, whichfork, 1);
> +		if (error)
> +			goto done;
>  
>  		if (cur == NULL)
>  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> @@ -2419,7 +2451,10 @@ xfs_bmap_add_extent_unwritten_real(
>  		xfs_iext_update_extent(ip, state, icur, &PREV);
>  		xfs_iext_next(ifp, icur);
>  		xfs_iext_insert(ip, icur, new, state);
> -		ifp->if_nextents++;
> +
> +		error = xfs_next_set(ip, whichfork, 1);
> +		if (error)
> +			goto done;
>  
>  		if (cur == NULL)
>  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> @@ -2471,7 +2506,10 @@ xfs_bmap_add_extent_unwritten_real(
>  		xfs_iext_next(ifp, icur);
>  		xfs_iext_insert(ip, icur, &r[1], state);
>  		xfs_iext_insert(ip, icur, &r[0], state);
> -		ifp->if_nextents += 2;
> +
> +		error = xfs_next_set(ip, whichfork, 2);
> +		if (error)
> +			goto done;
>  
>  		if (cur == NULL)
>  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> @@ -2787,7 +2825,10 @@ xfs_bmap_add_extent_hole_real(
>  		xfs_iext_remove(ip, icur, state);
>  		xfs_iext_prev(ifp, icur);
>  		xfs_iext_update_extent(ip, state, icur, &left);
> -		ifp->if_nextents--;
> +
> +		error = xfs_next_set(ip, whichfork, -1);
> +		if (error)
> +			goto done;
>  
>  		if (cur == NULL) {
>  			rval = XFS_ILOG_CORE | xfs_ilog_fext(whichfork);
> @@ -2886,7 +2927,10 @@ xfs_bmap_add_extent_hole_real(
>  		 * Insert a new entry.
>  		 */
>  		xfs_iext_insert(ip, icur, new, state);
> -		ifp->if_nextents++;
> +
> +		error = xfs_next_set(ip, whichfork, 1);
> +		if (error)
> +			goto done;
>  
>  		if (cur == NULL) {
>  			rval = XFS_ILOG_CORE | xfs_ilog_fext(whichfork);
> @@ -5083,7 +5127,10 @@ xfs_bmap_del_extent_real(
>  		 */
>  		xfs_iext_remove(ip, icur, state);
>  		xfs_iext_prev(ifp, icur);
> -		ifp->if_nextents--;
> +
> +		error = xfs_next_set(ip, whichfork, -1);
> +		if (error)
> +			goto done;
>  
>  		flags |= XFS_ILOG_CORE;
>  		if (!cur) {
> @@ -5193,7 +5240,10 @@ xfs_bmap_del_extent_real(
>  		} else
>  			flags |= xfs_ilog_fext(whichfork);
>  
> -		ifp->if_nextents++;
> +		error = xfs_next_set(ip, whichfork, 1);
> +		if (error)
> +			goto done;
> +
>  		xfs_iext_next(ifp, icur);
>  		xfs_iext_insert(ip, icur, &new, state);
>  		break;
> @@ -5660,7 +5710,10 @@ xfs_bmse_merge(
>  	 * Update the on-disk extent count, the btree if necessary and log the
>  	 * inode.
>  	 */
> -	ifp->if_nextents--;
> +	error = xfs_next_set(ip, whichfork, -1);
> +	if (error)
> +		goto done;
> +
>  	*logflags |= XFS_ILOG_CORE;
>  	if (!cur) {
>  		*logflags |= XFS_ILOG_DEXT;
> @@ -6047,7 +6100,10 @@ xfs_bmap_split_extent(
>  	/* Add new extent */
>  	xfs_iext_next(ifp, &icur);
>  	xfs_iext_insert(ip, &icur, &new, 0);
> -	ifp->if_nextents++;
> +
> +	error = xfs_next_set(ip, whichfork, 1);
> +	if (error)
> +		goto del_cursor;
>  
>  	if (cur) {
>  		error = xfs_bmbt_lookup_eq(cur, &new, &i);
> diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
> index 28b366275ae0..3bf5a2c391bd 100644
> --- a/fs/xfs/libxfs/xfs_inode_fork.c
> +++ b/fs/xfs/libxfs/xfs_inode_fork.c
> @@ -728,3 +728,32 @@ xfs_ifork_verify_local_attr(
>  
>  	return 0;
>  }
> +
> +int
> +xfs_next_set(

"next"... please choose an abbreviation that doesn't collide with a
common English word.

> +	struct xfs_inode	*ip,
> +	int			whichfork,
> +	int			delta)

Delta?  I thought this was a setter function?

> +{
> +	struct xfs_ifork	*ifp;
> +	int64_t			nr_exts;
> +	int64_t			max_exts;
> +
> +	ifp = XFS_IFORK_PTR(ip, whichfork);
> +
> +	if (whichfork == XFS_DATA_FORK || whichfork == XFS_COW_FORK)
> +		max_exts = MAXEXTNUM;
> +	else if (whichfork == XFS_ATTR_FORK)
> +		max_exts = MAXAEXTNUM;
> +	else
> +		ASSERT(0);
> +
> +	nr_exts = ifp->if_nextents + delta;

Nope, it's a modify function all right.  Then it should be named:

xfs_nextents_mod(ip, whichfork, delta)

> +	if ((delta > 0 && nr_exts > max_exts)
> +		|| (delta < 0 && nr_exts < 0))

Line these up, please.  e.g.,

	if ((delta > 0 && nr_exts > max_exts) ||
            (delta < 0 && nr_exts < 0))

--D

> +		return -EOVERFLOW;
> +
> +	ifp->if_nextents = nr_exts;
> +
> +	return 0;
> +}
> diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
> index a4953e95c4f3..a84ae42ace79 100644
> --- a/fs/xfs/libxfs/xfs_inode_fork.h
> +++ b/fs/xfs/libxfs/xfs_inode_fork.h
> @@ -173,4 +173,5 @@ extern void xfs_ifork_init_cow(struct xfs_inode *ip);
>  int xfs_ifork_verify_local_data(struct xfs_inode *ip);
>  int xfs_ifork_verify_local_attr(struct xfs_inode *ip);
>  
> +int xfs_next_set(struct xfs_inode *ip, int whichfork, int delta);
>  #endif	/* __XFS_INODE_FORK_H__ */
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/7] xfs: Check for per-inode extent count overflow
  2020-06-08 16:24   ` Darrick J. Wong
@ 2020-06-08 16:32     ` Darrick J. Wong
  2020-06-09 14:22       ` Chandan Babu R
  2020-06-09 14:22     ` Chandan Babu R
  1 sibling, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2020-06-08 16:32 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, david, bfoster, hch

On Mon, Jun 08, 2020 at 09:24:25AM -0700, Darrick J. Wong wrote:
> On Sat, Jun 06, 2020 at 01:57:40PM +0530, Chandan Babu R wrote:
> > The following error message was noticed when a workload added one
> > million xattrs, deleted 50% of them and then inserted 400,000 new
> > xattrs.
> > 
> > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > 
> > The error message was printed during unmounting the filesystem. The
> > value printed under "total extents" indicates that we overflowed the
> > per-inode signed 16-bit xattr extent counter.
> > 
> > Instead of letting this silent corruption occur, this patch checks for
> > extent counter (both data and xattr) overflow before we assign the
> > new value to the corresponding in-memory extent counter.
> > 
> > Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
> > ---
> >  fs/xfs/libxfs/xfs_bmap.c       | 92 +++++++++++++++++++++++++++-------
> >  fs/xfs/libxfs/xfs_inode_fork.c | 29 +++++++++++
> >  fs/xfs/libxfs/xfs_inode_fork.h |  1 +
> >  3 files changed, 104 insertions(+), 18 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index edc63dba007f..798fca5c52af 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -906,7 +906,10 @@ xfs_bmap_local_to_extents(
> >  	xfs_iext_first(ifp, &icur);
> >  	xfs_iext_insert(ip, &icur, &rec, 0);
> >  
> > -	ifp->if_nextents = 1;
> > +	error = xfs_next_set(ip, whichfork, 1);
> > +	if (error)
> > +		goto done;
> 
> Are you sure that if_nextents == 0 is a precondition here?  Technically
> speaking, this turns an assignment into an increment operation.
> 
> > +
> >  	ip->i_d.di_nblocks = 1;
> >  	xfs_trans_mod_dquot_byino(tp, ip,
> >  		XFS_TRANS_DQ_BCOUNT, 1L);
> > @@ -1594,7 +1597,10 @@ xfs_bmap_add_extent_delay_real(
> >  		xfs_iext_remove(bma->ip, &bma->icur, state);
> >  		xfs_iext_prev(ifp, &bma->icur);
> >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, &LEFT);
> > -		ifp->if_nextents--;
> > +
> > +		error = xfs_next_set(bma->ip, whichfork, -1);
> > +		if (error)
> > +			goto done;
> >  
> >  		if (bma->cur == NULL)
> >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > @@ -1698,7 +1704,10 @@ xfs_bmap_add_extent_delay_real(
> >  		PREV.br_startblock = new->br_startblock;
> >  		PREV.br_state = new->br_state;
> >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, &PREV);
> > -		ifp->if_nextents++;
> > +
> > +		error = xfs_next_set(bma->ip, whichfork, 1);
> > +		if (error)
> > +			goto done;
> >  
> >  		if (bma->cur == NULL)
> >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > @@ -1764,7 +1773,10 @@ xfs_bmap_add_extent_delay_real(
> >  		 * The left neighbor is not contiguous.
> >  		 */
> >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, new);
> > -		ifp->if_nextents++;
> > +
> > +		error = xfs_next_set(bma->ip, whichfork, 1);
> > +		if (error)
> > +			goto done;
> >  
> >  		if (bma->cur == NULL)
> >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > @@ -1851,7 +1863,10 @@ xfs_bmap_add_extent_delay_real(
> >  		 * The right neighbor is not contiguous.
> >  		 */
> >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, new);
> > -		ifp->if_nextents++;
> > +
> > +		error = xfs_next_set(bma->ip, whichfork, 1);
> > +		if (error)
> > +			goto done;
> >  
> >  		if (bma->cur == NULL)
> >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > @@ -1937,7 +1952,10 @@ xfs_bmap_add_extent_delay_real(
> >  		xfs_iext_next(ifp, &bma->icur);
> >  		xfs_iext_insert(bma->ip, &bma->icur, &RIGHT, state);
> >  		xfs_iext_insert(bma->ip, &bma->icur, &LEFT, state);
> > -		ifp->if_nextents++;
> > +
> > +		error = xfs_next_set(bma->ip, whichfork, 1);
> > +		if (error)
> > +			goto done;
> >  
> >  		if (bma->cur == NULL)
> >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > @@ -2141,7 +2159,11 @@ xfs_bmap_add_extent_unwritten_real(
> >  		xfs_iext_remove(ip, icur, state);
> >  		xfs_iext_prev(ifp, icur);
> >  		xfs_iext_update_extent(ip, state, icur, &LEFT);
> > -		ifp->if_nextents -= 2;
> > +
> > +		error = xfs_next_set(ip, whichfork, -2);
> > +		if (error)
> > +			goto done;
> > +
> >  		if (cur == NULL)
> >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> >  		else {
> > @@ -2193,7 +2215,11 @@ xfs_bmap_add_extent_unwritten_real(
> >  		xfs_iext_remove(ip, icur, state);
> >  		xfs_iext_prev(ifp, icur);
> >  		xfs_iext_update_extent(ip, state, icur, &LEFT);
> > -		ifp->if_nextents--;
> > +
> > +		error = xfs_next_set(ip, whichfork, -1);
> > +		if (error)
> > +			goto done;
> > +
> >  		if (cur == NULL)
> >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> >  		else {
> > @@ -2235,7 +2261,10 @@ xfs_bmap_add_extent_unwritten_real(
> >  		xfs_iext_remove(ip, icur, state);
> >  		xfs_iext_prev(ifp, icur);
> >  		xfs_iext_update_extent(ip, state, icur, &PREV);
> > -		ifp->if_nextents--;
> > +
> > +		error = xfs_next_set(ip, whichfork, -1);
> > +		if (error)
> > +			goto done;
> >  
> >  		if (cur == NULL)
> >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > @@ -2343,7 +2372,10 @@ xfs_bmap_add_extent_unwritten_real(
> >  
> >  		xfs_iext_update_extent(ip, state, icur, &PREV);
> >  		xfs_iext_insert(ip, icur, new, state);
> > -		ifp->if_nextents++;
> > +
> > +		error = xfs_next_set(ip, whichfork, 1);
> > +		if (error)
> > +			goto done;
> >  
> >  		if (cur == NULL)
> >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > @@ -2419,7 +2451,10 @@ xfs_bmap_add_extent_unwritten_real(
> >  		xfs_iext_update_extent(ip, state, icur, &PREV);
> >  		xfs_iext_next(ifp, icur);
> >  		xfs_iext_insert(ip, icur, new, state);
> > -		ifp->if_nextents++;
> > +
> > +		error = xfs_next_set(ip, whichfork, 1);
> > +		if (error)
> > +			goto done;
> >  
> >  		if (cur == NULL)
> >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > @@ -2471,7 +2506,10 @@ xfs_bmap_add_extent_unwritten_real(
> >  		xfs_iext_next(ifp, icur);
> >  		xfs_iext_insert(ip, icur, &r[1], state);
> >  		xfs_iext_insert(ip, icur, &r[0], state);
> > -		ifp->if_nextents += 2;
> > +
> > +		error = xfs_next_set(ip, whichfork, 2);
> > +		if (error)
> > +			goto done;
> >  
> >  		if (cur == NULL)
> >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > @@ -2787,7 +2825,10 @@ xfs_bmap_add_extent_hole_real(
> >  		xfs_iext_remove(ip, icur, state);
> >  		xfs_iext_prev(ifp, icur);
> >  		xfs_iext_update_extent(ip, state, icur, &left);
> > -		ifp->if_nextents--;
> > +
> > +		error = xfs_next_set(ip, whichfork, -1);
> > +		if (error)
> > +			goto done;
> >  
> >  		if (cur == NULL) {
> >  			rval = XFS_ILOG_CORE | xfs_ilog_fext(whichfork);
> > @@ -2886,7 +2927,10 @@ xfs_bmap_add_extent_hole_real(
> >  		 * Insert a new entry.
> >  		 */
> >  		xfs_iext_insert(ip, icur, new, state);
> > -		ifp->if_nextents++;
> > +
> > +		error = xfs_next_set(ip, whichfork, 1);
> > +		if (error)
> > +			goto done;
> >  
> >  		if (cur == NULL) {
> >  			rval = XFS_ILOG_CORE | xfs_ilog_fext(whichfork);
> > @@ -5083,7 +5127,10 @@ xfs_bmap_del_extent_real(
> >  		 */
> >  		xfs_iext_remove(ip, icur, state);
> >  		xfs_iext_prev(ifp, icur);
> > -		ifp->if_nextents--;
> > +
> > +		error = xfs_next_set(ip, whichfork, -1);
> > +		if (error)
> > +			goto done;
> >  
> >  		flags |= XFS_ILOG_CORE;
> >  		if (!cur) {
> > @@ -5193,7 +5240,10 @@ xfs_bmap_del_extent_real(
> >  		} else
> >  			flags |= xfs_ilog_fext(whichfork);
> >  
> > -		ifp->if_nextents++;
> > +		error = xfs_next_set(ip, whichfork, 1);
> > +		if (error)
> > +			goto done;
> > +
> >  		xfs_iext_next(ifp, icur);
> >  		xfs_iext_insert(ip, icur, &new, state);
> >  		break;
> > @@ -5660,7 +5710,10 @@ xfs_bmse_merge(
> >  	 * Update the on-disk extent count, the btree if necessary and log the
> >  	 * inode.
> >  	 */
> > -	ifp->if_nextents--;
> > +	error = xfs_next_set(ip, whichfork, -1);
> > +	if (error)
> > +		goto done;
> > +
> >  	*logflags |= XFS_ILOG_CORE;
> >  	if (!cur) {
> >  		*logflags |= XFS_ILOG_DEXT;
> > @@ -6047,7 +6100,10 @@ xfs_bmap_split_extent(
> >  	/* Add new extent */
> >  	xfs_iext_next(ifp, &icur);
> >  	xfs_iext_insert(ip, &icur, &new, 0);
> > -	ifp->if_nextents++;
> > +
> > +	error = xfs_next_set(ip, whichfork, 1);
> > +	if (error)
> > +		goto del_cursor;
> >  
> >  	if (cur) {
> >  		error = xfs_bmbt_lookup_eq(cur, &new, &i);
> > diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
> > index 28b366275ae0..3bf5a2c391bd 100644
> > --- a/fs/xfs/libxfs/xfs_inode_fork.c
> > +++ b/fs/xfs/libxfs/xfs_inode_fork.c
> > @@ -728,3 +728,32 @@ xfs_ifork_verify_local_attr(
> >  
> >  	return 0;
> >  }
> > +
> > +int
> > +xfs_next_set(
> 
> "next"... please choose an abbreviation that doesn't collide with a
> common English word.
> 
> > +	struct xfs_inode	*ip,
> > +	int			whichfork,
> > +	int			delta)
> 
> Delta?  I thought this was a setter function?
> 
> > +{
> > +	struct xfs_ifork	*ifp;
> > +	int64_t			nr_exts;
> > +	int64_t			max_exts;
> > +
> > +	ifp = XFS_IFORK_PTR(ip, whichfork);
> > +
> > +	if (whichfork == XFS_DATA_FORK || whichfork == XFS_COW_FORK)
> > +		max_exts = MAXEXTNUM;
> > +	else if (whichfork == XFS_ATTR_FORK)
> > +		max_exts = MAXAEXTNUM;
> > +	else
> > +		ASSERT(0);
> > +
> > +	nr_exts = ifp->if_nextents + delta;
> 
> Nope, it's a modify function all right.  Then it should be named:
> 
> xfs_nextents_mod(ip, whichfork, delta)
> 
> > +	if ((delta > 0 && nr_exts > max_exts)
> > +		|| (delta < 0 && nr_exts < 0))
> 
> Line these up, please.  e.g.,
> 
> 	if ((delta > 0 && nr_exts > max_exts) ||
>             (delta < 0 && nr_exts < 0))
> 
> --D
> 
> > +		return -EOVERFLOW;

Oh, also, shouldn't this be EFBIG ("File too big")?

--D

> > +
> > +	ifp->if_nextents = nr_exts;
> > +
> > +	return 0;
> > +}
> > diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
> > index a4953e95c4f3..a84ae42ace79 100644
> > --- a/fs/xfs/libxfs/xfs_inode_fork.h
> > +++ b/fs/xfs/libxfs/xfs_inode_fork.h
> > @@ -173,4 +173,5 @@ extern void xfs_ifork_init_cow(struct xfs_inode *ip);
> >  int xfs_ifork_verify_local_data(struct xfs_inode *ip);
> >  int xfs_ifork_verify_local_attr(struct xfs_inode *ip);
> >  
> > +int xfs_next_set(struct xfs_inode *ip, int whichfork, int delta);
> >  #endif	/* __XFS_INODE_FORK_H__ */
> > -- 
> > 2.20.1
> > 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 5/7] xfs: Use 2^27 as the maximum number of directory extents
  2020-06-06  8:27 ` [PATCH 5/7] xfs: Use 2^27 as the maximum number of directory extents Chandan Babu R
@ 2020-06-08 16:52   ` Darrick J. Wong
  2020-06-09 14:23     ` Chandan Babu R
  0 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2020-06-08 16:52 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, david, bfoster, hch

On Sat, Jun 06, 2020 at 01:57:43PM +0530, Chandan Babu R wrote:
> The maximum number of extents that can be used by a directory can be
> calculated as shown below. (FS block size is assumed to be 512 bytes
> since the smallest allowed block size can create a BMBT of maximum
> possible height).
> 
> Maximum number of extents in data space =
> XFS_DIR2_SPACE_SIZE / 2^9 = 32GiB / 2^9 = 2^26.
> 
> Maximum number (theoretically) of extents in leaf space =
> 32GiB / 2^9 = 2^26.

Hm.  The leaf hash entries are 8 bytes long, whereas I think directory
entries occupy at least 16 bytes.  Is there a situation where the number
of dir leaf/dabtree blocks can actually hit the 32G section size limit?

> Maximum number of entries in a free space index block
> = (512 - (sizeof struct xfs_dir3_free_hdr)) / (sizeof struct
>                                                xfs_dir2_data_off_t)
> = (512 - 64) / 2 = 224
> 
> Maximum number of extents in free space index =
> (Maximum number of extents in data segment) / 224 =
> 2^26 / 224 = ~2^18
> 
> Maximum number of extents in a directory =
> Maximum number of extents in data space +
> Maximum number of extents in leaf space +
> Maximum number of extents in free space index =
> 2^26 + 2^26 + 2^18 = ~2^27

I calculated the exact expression here, and got:

2^26 + 2^26 + (2^26/224) = 134,517,321

This requires 28 bits of space, doesn't it?

Granted I bet the leaf section won't come within 300,000 nextents of the
2^26 you've assumed for it, so I suspect that in real world scenarios,
27 bits is enough.  But if you're anticipating a totally full leaf
section under extreme fragmentation, then MAXDIREXTNUM ought to be able
to handle that.

(Assuming I did any of that math correctly. ;))

--D

> 
> This commit defines the macro MAXDIREXTNUM to have the value 2^27 and
> this in turn is used in calculating the maximum height of a directory
> BMBT.
> 
> Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c  | 2 +-
>  fs/xfs/libxfs/xfs_types.h | 1 +
>  2 files changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 8b0029b3cecf..f75b70ae7b1f 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -81,7 +81,7 @@ xfs_bmap_compute_maxlevels(
>  	if (whichfork == XFS_DATA_FORK) {
>  		sz = XFS_BMDR_SPACE_CALC(MINDBTPTRS);
>  		if (dir_bmbt)
> -			maxleafents = MAXEXTNUM;
> +			maxleafents = MAXDIREXTNUM;
>  		else
>  			maxleafents = MAXEXTNUM;
>  	} else {
> diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> index 397d94775440..0a3041ad5bec 100644
> --- a/fs/xfs/libxfs/xfs_types.h
> +++ b/fs/xfs/libxfs/xfs_types.h
> @@ -60,6 +60,7 @@ typedef void *		xfs_failaddr_t;
>   */
>  #define	MAXEXTLEN	((xfs_extlen_t)0x001fffff)	/* 21 bits */
>  #define	MAXEXTNUM	((xfs_extnum_t)0x7fffffff)	/* signed int */
> +#define	MAXDIREXTNUM	((xfs_extnum_t)0x7ffffff)	/* 27 bits */
>  #define	MAXAEXTNUM	((xfs_aextnum_t)0x7fff)		/* signed short */
>  
>  /*
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 6/7] xfs: Extend data extent counter to 47 bits
  2020-06-06  8:27 ` [PATCH 6/7] xfs: Extend data extent counter to 47 bits Chandan Babu R
@ 2020-06-08 17:14   ` Darrick J. Wong
  2020-06-09 14:23     ` Chandan Babu R
  2020-06-19 14:38   ` Christoph Hellwig
  1 sibling, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2020-06-08 17:14 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, david, bfoster, hch

On Sat, Jun 06, 2020 at 01:57:44PM +0530, Chandan Babu R wrote:
> This commit extends the per-inode data extent counter to 47 bits. The
> length of 47-bits was chosen because,
> Maximum file size = 2^63.
> Maximum extent count when using 64k block size = 2^63 / 2^16 = 2^47.
> 
> The following changes are made to accomplish this,
> 1. A new ro-compat superblock flag to prevent older kernels from
>    mounting the filesystem in read-write mode. This flag is set for the
>    first time when an inode would end up having more than 2^31 extents.
> 3. Carve out a new 32-bit field from xfs_dinode->di_pad2[]. This field
>    holds the most significant 15 bits of the data extent counter.

On a 1k block V5 fs, the maximum extent count is 2^(63-10) = 2^53.

If you're going to allocate 32 bits of space from di_pad2 to expand the
data fork's nextents, let's use the entire bitspace.

> 2. A new inode->di_flags2 flag to indicate that the newly added field
>    contains valid data. This flag is set when one of the following two
>    conditions are met,
>    - When the inode is about to have more than 2^31 extents.
>    - When flushing the incore inode (See xfs_iflush_int()), if
>      the superblock ro-compat flag is already set.
> 
> Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c        | 40 ++++++++--------
>  fs/xfs/libxfs/xfs_format.h      | 30 ++++++++----
>  fs/xfs/libxfs/xfs_inode_buf.c   | 46 +++++++++++++++---
>  fs/xfs/libxfs/xfs_inode_buf.h   |  2 +
>  fs/xfs/libxfs/xfs_inode_fork.c  | 84 ++++++++++++++++++++++++++-------
>  fs/xfs/libxfs/xfs_inode_fork.h  |  3 +-
>  fs/xfs/libxfs/xfs_log_format.h  |  5 +-
>  fs/xfs/libxfs/xfs_types.h       |  5 +-
>  fs/xfs/scrub/inode.c            |  9 ++--
>  fs/xfs/xfs_inode.c              |  6 ++-
>  fs/xfs/xfs_inode_item.c         |  5 +-
>  fs/xfs/xfs_inode_item_recover.c | 16 +++++--
>  12 files changed, 184 insertions(+), 67 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index f75b70ae7b1f..73e552678adc 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -53,9 +53,9 @@ xfs_bmap_compute_maxlevels(
>  	int		whichfork,	/* data or attr fork */
>  	int		dir_bmbt)	/* Dir or non-dir data fork */
>  {
> +	uint64_t	maxleafents;	/* max leaf entries possible */
>  	int		level;		/* btree level */
>  	uint		maxblocks;	/* max blocks at this level */
> -	uint		maxleafents;	/* max leaf entries possible */
>  	int		maxrootrecs;	/* max records in root block */
>  	int		minleafrecs;	/* min records in leaf block */
>  	int		minnoderecs;	/* min records in node block */
> @@ -477,7 +477,7 @@ xfs_bmap_check_leaf_extents(
>  	if (bp_release)
>  		xfs_trans_brelse(NULL, bp);
>  error_norelse:
> -	xfs_warn(mp, "%s: BAD after btree leaves for %d extents",
> +	xfs_warn(mp, "%s: BAD after btree leaves for %llu extents",
>  		__func__, i);
>  	xfs_err(mp, "%s: CORRUPTED BTREE OR SOMETHING", __func__);
>  	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> @@ -918,7 +918,7 @@ xfs_bmap_local_to_extents(
>  	xfs_iext_first(ifp, &icur);
>  	xfs_iext_insert(ip, &icur, &rec, 0);
>  
> -	error = xfs_next_set(ip, whichfork, 1);
> +	error = xfs_next_set(tp, ip, whichfork, 1);
>  	if (error)
>  		goto done;
>  
> @@ -1610,7 +1610,7 @@ xfs_bmap_add_extent_delay_real(
>  		xfs_iext_prev(ifp, &bma->icur);
>  		xfs_iext_update_extent(bma->ip, state, &bma->icur, &LEFT);
>  
> -		error = xfs_next_set(bma->ip, whichfork, -1);
> +		error = xfs_next_set(bma->tp, bma->ip, whichfork, -1);
>  		if (error)
>  			goto done;
>  
> @@ -1717,7 +1717,7 @@ xfs_bmap_add_extent_delay_real(
>  		PREV.br_state = new->br_state;
>  		xfs_iext_update_extent(bma->ip, state, &bma->icur, &PREV);
>  
> -		error = xfs_next_set(bma->ip, whichfork, 1);
> +		error = xfs_next_set(bma->tp, bma->ip, whichfork, 1);
>  		if (error)
>  			goto done;
>  
> @@ -1786,7 +1786,7 @@ xfs_bmap_add_extent_delay_real(
>  		 */
>  		xfs_iext_update_extent(bma->ip, state, &bma->icur, new);
>  
> -		error = xfs_next_set(bma->ip, whichfork, 1);
> +		error = xfs_next_set(bma->tp, bma->ip, whichfork, 1);
>  		if (error)
>  			goto done;
>  
> @@ -1876,7 +1876,7 @@ xfs_bmap_add_extent_delay_real(
>  		 */
>  		xfs_iext_update_extent(bma->ip, state, &bma->icur, new);
>  
> -		error = xfs_next_set(bma->ip, whichfork, 1);
> +		error = xfs_next_set(bma->tp, bma->ip, whichfork, 1);
>  		if (error)
>  			goto done;
>  
> @@ -1965,7 +1965,7 @@ xfs_bmap_add_extent_delay_real(
>  		xfs_iext_insert(bma->ip, &bma->icur, &RIGHT, state);
>  		xfs_iext_insert(bma->ip, &bma->icur, &LEFT, state);
>  
> -		error = xfs_next_set(bma->ip, whichfork, 1);
> +		error = xfs_next_set(bma->tp, bma->ip, whichfork, 1);
>  		if (error)
>  			goto done;
>  
> @@ -2172,7 +2172,7 @@ xfs_bmap_add_extent_unwritten_real(
>  		xfs_iext_prev(ifp, icur);
>  		xfs_iext_update_extent(ip, state, icur, &LEFT);
>  
> -		error = xfs_next_set(ip, whichfork, -2);
> +		error = xfs_next_set(tp, ip, whichfork, -2);
>  		if (error)
>  			goto done;
>  
> @@ -2228,7 +2228,7 @@ xfs_bmap_add_extent_unwritten_real(
>  		xfs_iext_prev(ifp, icur);
>  		xfs_iext_update_extent(ip, state, icur, &LEFT);
>  
> -		error = xfs_next_set(ip, whichfork, -1);
> +		error = xfs_next_set(tp, ip, whichfork, -1);
>  		if (error)
>  			goto done;
>  
> @@ -2274,7 +2274,7 @@ xfs_bmap_add_extent_unwritten_real(
>  		xfs_iext_prev(ifp, icur);
>  		xfs_iext_update_extent(ip, state, icur, &PREV);
>  
> -		error = xfs_next_set(ip, whichfork, -1);
> +		error = xfs_next_set(tp, ip, whichfork, -1);
>  		if (error)
>  			goto done;
>  
> @@ -2385,7 +2385,7 @@ xfs_bmap_add_extent_unwritten_real(
>  		xfs_iext_update_extent(ip, state, icur, &PREV);
>  		xfs_iext_insert(ip, icur, new, state);
>  
> -		error = xfs_next_set(ip, whichfork, 1);
> +		error = xfs_next_set(tp, ip, whichfork, 1);
>  		if (error)
>  			goto done;
>  
> @@ -2464,7 +2464,7 @@ xfs_bmap_add_extent_unwritten_real(
>  		xfs_iext_next(ifp, icur);
>  		xfs_iext_insert(ip, icur, new, state);
>  
> -		error = xfs_next_set(ip, whichfork, 1);
> +		error = xfs_next_set(tp, ip, whichfork, 1);
>  		if (error)
>  			goto done;
>  
> @@ -2519,7 +2519,7 @@ xfs_bmap_add_extent_unwritten_real(
>  		xfs_iext_insert(ip, icur, &r[1], state);
>  		xfs_iext_insert(ip, icur, &r[0], state);
>  
> -		error = xfs_next_set(ip, whichfork, 2);
> +		error = xfs_next_set(tp, ip, whichfork, 2);
>  		if (error)
>  			goto done;
>  
> @@ -2838,7 +2838,7 @@ xfs_bmap_add_extent_hole_real(
>  		xfs_iext_prev(ifp, icur);
>  		xfs_iext_update_extent(ip, state, icur, &left);
>  
> -		error = xfs_next_set(ip, whichfork, -1);
> +		error = xfs_next_set(tp, ip, whichfork, -1);
>  		if (error)
>  			goto done;
>  
> @@ -2940,7 +2940,7 @@ xfs_bmap_add_extent_hole_real(
>  		 */
>  		xfs_iext_insert(ip, icur, new, state);
>  
> -		error = xfs_next_set(ip, whichfork, 1);
> +		error = xfs_next_set(tp, ip, whichfork, 1);
>  		if (error)
>  			goto done;
>  
> @@ -5140,7 +5140,7 @@ xfs_bmap_del_extent_real(
>  		xfs_iext_remove(ip, icur, state);
>  		xfs_iext_prev(ifp, icur);
>  
> -		error = xfs_next_set(ip, whichfork, -1);
> +		error = xfs_next_set(tp, ip, whichfork, -1);
>  		if (error)
>  			goto done;
>  
> @@ -5252,7 +5252,7 @@ xfs_bmap_del_extent_real(
>  		} else
>  			flags |= xfs_ilog_fext(whichfork);
>  
> -		error = xfs_next_set(ip, whichfork, 1);
> +		error = xfs_next_set(tp, ip, whichfork, 1);
>  		if (error)
>  			goto done;
>  
> @@ -5722,7 +5722,7 @@ xfs_bmse_merge(
>  	 * Update the on-disk extent count, the btree if necessary and log the
>  	 * inode.
>  	 */
> -	error = xfs_next_set(ip, whichfork, -1);
> +	error = xfs_next_set(tp, ip, whichfork, -1);
>  	if (error)
>  		goto done;
>  
> @@ -6113,7 +6113,7 @@ xfs_bmap_split_extent(
>  	xfs_iext_next(ifp, &icur);
>  	xfs_iext_insert(ip, &icur, &new, 0);
>  
> -	error = xfs_next_set(ip, whichfork, 1);
> +	error = xfs_next_set(tp, ip, whichfork, 1);
>  	if (error)
>  		goto del_cursor;
>  
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index b42a52bfa1e9..91bee33aa988 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -449,10 +449,12 @@ xfs_sb_has_compat_feature(
>  #define XFS_SB_FEAT_RO_COMPAT_FINOBT   (1 << 0)		/* free inode btree */
>  #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
>  #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
> +#define XFS_SB_FEAT_RO_COMPAT_47BIT_DEXT_CNTR (1 << 3)	/* 47bit data extents */

I wonder if we could come up with a better name for this...

DFORK_EXTENTHI

Hmm...

BIG_DFORK

Hmmm...

ULTRAFRAG

There we go.  "XFS with UltraFrag, part of this complete g@m3r t00lk1t." ;)

...

(What do you think of the second suggestion?)

>  #define XFS_SB_FEAT_RO_COMPAT_ALL \
>  		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
>  		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
> -		 XFS_SB_FEAT_RO_COMPAT_REFLINK)
> +		 XFS_SB_FEAT_RO_COMPAT_REFLINK | \
> +		 XFS_SB_FEAT_RO_COMPAT_47BIT_DEXT_CNTR)
>  #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
>  static inline bool
>  xfs_sb_has_ro_compat_feature(
> @@ -563,6 +565,18 @@ static inline bool xfs_sb_version_hasreflink(struct xfs_sb *sbp)
>  		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_REFLINK);
>  }
>  
> +static inline bool xfs_sb_version_has47bitext(struct xfs_sb *sbp)
> +{
> +	return XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5 &&
> +		(sbp->sb_features_ro_compat &
> +			XFS_SB_FEAT_RO_COMPAT_47BIT_DEXT_CNTR);
> +}
> +
> +static inline void xfs_sb_version_add47bitext(struct xfs_sb *sbp)
> +{
> +	sbp->sb_features_ro_compat |= XFS_SB_FEAT_RO_COMPAT_47BIT_DEXT_CNTR;
> +}
> +
>  /*
>   * end of superblock version macros
>   */
> @@ -873,7 +887,7 @@ typedef struct xfs_dinode {
>  	__be64		di_size;	/* number of bytes in file */
>  	__be64		di_nblocks;	/* # of direct & btree blocks used */
>  	__be32		di_extsize;	/* basic/minimum extent size for file */
> -	__be32		di_nextents;	/* number of extents in data fork */
> +	__be32		di_nextents_lo;	/* number of extents in data fork */
>  	__be16		di_anextents;	/* number of extents in attribute fork*/
>  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
>  	__s8		di_aformat;	/* format of attr fork's data */
> @@ -891,7 +905,8 @@ typedef struct xfs_dinode {
>  	__be64		di_lsn;		/* flush sequence */
>  	__be64		di_flags2;	/* more random flags */
>  	__be32		di_cowextsize;	/* basic cow extent size for file */
> -	__u8		di_pad2[12];	/* more padding for future expansion */
> +	__be32		di_nextents_hi;
> +	__u8		di_pad2[8];	/* more padding for future expansion */
>  
>  	/* fields only written to during inode creation */
>  	xfs_timestamp_t	di_crtime;	/* time created */
> @@ -992,10 +1007,6 @@ enum xfs_dinode_fmt {
>  	((w) == XFS_DATA_FORK ? \
>  		(dip)->di_format : \
>  		(dip)->di_aformat)
> -#define XFS_DFORK_NEXTENTS(dip,w) \
> -	((w) == XFS_DATA_FORK ? \
> -		be32_to_cpu((dip)->di_nextents) : \
> -		be16_to_cpu((dip)->di_anextents))
>  
>  /*
>   * For block and character special files the 32bit dev_t is stored at the
> @@ -1061,12 +1072,15 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
>  #define XFS_DIFLAG2_DAX_BIT	0	/* use DAX for this inode */
>  #define XFS_DIFLAG2_REFLINK_BIT	1	/* file's blocks may be shared */
>  #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
> +#define XFS_DIFLAG2_47BIT_NEXTENTS_BIT 3 /* Uses di_nextents_hi field */
>  #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
>  #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
>  #define XFS_DIFLAG2_COWEXTSIZE  (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
> +#define XFS_DIFLAG2_47BIT_NEXTENTS (1 << XFS_DIFLAG2_47BIT_NEXTENTS_BIT)
>  
>  #define XFS_DIFLAG2_ANY \
> -	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE)
> +	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
> +	 XFS_DIFLAG2_47BIT_NEXTENTS)
>  
>  /*
>   * Inode number format:
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index 6f84ea85fdd8..8b89fe080f70 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -307,7 +307,8 @@ xfs_inode_to_disk(
>  	to->di_size = cpu_to_be64(from->di_size);
>  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
>  	to->di_extsize = cpu_to_be32(from->di_extsize);
> -	to->di_nextents = cpu_to_be32(xfs_ifork_nextents(&ip->i_df));
> +	to->di_nextents_lo = cpu_to_be32(xfs_ifork_nextents(&ip->i_df) &
> +					0xffffffffU);
>  	to->di_anextents = cpu_to_be16(xfs_ifork_nextents(ip->i_afp));
>  	to->di_forkoff = from->di_forkoff;
>  	to->di_aformat = xfs_ifork_format(ip->i_afp);
> @@ -322,6 +323,10 @@ xfs_inode_to_disk(
>  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
>  		to->di_flags2 = cpu_to_be64(from->di_flags2);
>  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> +		if (from->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
> +			to->di_nextents_hi
> +				= cpu_to_be32(xfs_ifork_nextents(&ip->i_df)
> +					>> 32);

/me kinda hates the indentation here, would a convenience variable
reduce the amount of linewrapping here?

Oh, right, we're in a new epoch now; just go past 80 columns.

>  		to->di_ino = cpu_to_be64(ip->i_ino);
>  		to->di_lsn = cpu_to_be64(lsn);
>  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> @@ -360,7 +365,7 @@ xfs_log_dinode_to_disk(
>  	to->di_size = cpu_to_be64(from->di_size);
>  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
>  	to->di_extsize = cpu_to_be32(from->di_extsize);
> -	to->di_nextents = cpu_to_be32(from->di_nextents);
> +	to->di_nextents_lo = cpu_to_be32(from->di_nextents_lo);
>  	to->di_anextents = cpu_to_be16(from->di_anextents);
>  	to->di_forkoff = from->di_forkoff;
>  	to->di_aformat = from->di_aformat;
> @@ -375,6 +380,9 @@ xfs_log_dinode_to_disk(
>  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
>  		to->di_flags2 = cpu_to_be64(from->di_flags2);
>  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> +		if (from->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
> +			to->di_nextents_hi =
> +				cpu_to_be32(from->di_nextents_hi);
>  		to->di_ino = cpu_to_be64(from->di_ino);
>  		to->di_lsn = cpu_to_be64(from->di_lsn);
>  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> @@ -391,7 +399,9 @@ xfs_dinode_verify_fork(
>  	struct xfs_mount	*mp,
>  	int			whichfork)
>  {
> -	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> +	xfs_extnum_t		di_nextents;
> +
> +	di_nextents = xfs_dfork_nextents(&mp->m_sb, dip, whichfork);
>  
>  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
>  	case XFS_DINODE_FMT_LOCAL:
> @@ -462,6 +472,8 @@ xfs_dinode_verify(
>  	uint16_t		flags;
>  	uint64_t		flags2;
>  	uint64_t		di_size;
> +	xfs_extnum_t		nextents;
> +	int64_t			nblocks;
>  
>  	if (dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC))
>  		return __this_address;
> @@ -492,10 +504,12 @@ xfs_dinode_verify(
>  	if ((S_ISLNK(mode) || S_ISDIR(mode)) && di_size == 0)
>  		return __this_address;
>  
> +	nextents = xfs_dfork_nextents(&mp->m_sb, dip, XFS_DATA_FORK);
> +	nextents += xfs_dfork_nextents(&mp->m_sb, dip, XFS_ATTR_FORK);
> +	nblocks = be64_to_cpu(dip->di_nblocks);
> +
>  	/* Fork checks carried over from xfs_iformat_fork */
> -	if (mode &&
> -	    be32_to_cpu(dip->di_nextents) + be16_to_cpu(dip->di_anextents) >
> -			be64_to_cpu(dip->di_nblocks))
> +	if (mode && nextents > nblocks)
>  		return __this_address;
>  
>  	if (mode && XFS_DFORK_BOFF(dip) > mp->m_sb.sb_inodesize)
> @@ -716,3 +730,23 @@ xfs_inode_validate_cowextsize(
>  
>  	return NULL;
>  }
> +
> +xfs_extnum_t
> +xfs_dfork_nextents(
> +	struct xfs_sb		*sbp,
> +	struct xfs_dinode	*dip,
> +	int			whichfork)
> +{
> +	xfs_extnum_t		nextents;
> +
> +	if (whichfork == XFS_DATA_FORK) {
> +		nextents = be32_to_cpu(dip->di_nextents_lo);
> +		if (xfs_sb_version_has_v3inode(sbp)
> +			&& (dip->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS))

Please don't align the second line of the if test with the if body.

Or maybe just create a "xfs_inode_has_big_dfork" helper to encapsulate
this, like we do for reflink/hascow/realtime inodes.

> +			nextents |= (u64)(be32_to_cpu(dip->di_nextents_hi))
> +				<< 32;
> +		return nextents;
> +	} else {
> +		return be16_to_cpu(dip->di_anextents);

I suspect you could reduce the indenting here by inverting the logic,
e.g.

	if (attr fork)
		return be16_to_cpu(anextents);

	nextents = be32_to_cpu(nextents_lo);
	if (xfs_inode_has_big_dfork())
		nextents += be32_to_cpu(nextents_hi);
	return nextents;

> +	}
> +}
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.h b/fs/xfs/libxfs/xfs_inode_buf.h
> index 865ac493c72a..4583db53b933 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.h
> +++ b/fs/xfs/libxfs/xfs_inode_buf.h
> @@ -65,5 +65,7 @@ xfs_failaddr_t xfs_inode_validate_extsize(struct xfs_mount *mp,
>  xfs_failaddr_t xfs_inode_validate_cowextsize(struct xfs_mount *mp,
>  		uint32_t cowextsize, uint16_t mode, uint16_t flags,
>  		uint64_t flags2);
> +xfs_extnum_t xfs_dfork_nextents(struct xfs_sb *sbp, struct xfs_dinode *dip,
> +		int whichfork);
>  
>  #endif	/* __XFS_INODE_BUF_H__ */
> diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
> index 3bf5a2c391bd..ec682e2d5bcb 100644
> --- a/fs/xfs/libxfs/xfs_inode_fork.c
> +++ b/fs/xfs/libxfs/xfs_inode_fork.c
> @@ -10,6 +10,7 @@
>  #include "xfs_format.h"
>  #include "xfs_log_format.h"
>  #include "xfs_trans_resv.h"
> +#include "xfs_sb.h"
>  #include "xfs_mount.h"
>  #include "xfs_inode.h"
>  #include "xfs_trans.h"
> @@ -103,21 +104,22 @@ xfs_iformat_extents(
>  	int			whichfork)
>  {
>  	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_sb		*sb = &mp->m_sb;
>  	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
> +	xfs_extnum_t		nex = xfs_dfork_nextents(sb, dip, whichfork);
>  	int			state = xfs_bmap_fork_to_state(whichfork);
> -	int			nex = XFS_DFORK_NEXTENTS(dip, whichfork);
>  	int			size = nex * sizeof(xfs_bmbt_rec_t);
>  	struct xfs_iext_cursor	icur;
>  	struct xfs_bmbt_rec	*dp;
>  	struct xfs_bmbt_irec	new;
> -	int			i;
> +	xfs_extnum_t		i;
>  
>  	/*
>  	 * If the number of extents is unreasonable, then something is wrong and
>  	 * we just bail out rather than crash in kmem_alloc() or memcpy() below.
>  	 */
>  	if (unlikely(size < 0 || size > XFS_DFORK_SIZE(dip, mp, whichfork))) {
> -		xfs_warn(ip->i_mount, "corrupt inode %Lu ((a)extents = %d).",
> +		xfs_warn(ip->i_mount, "corrupt inode %Lu ((a)extents = %llu).",
>  			(unsigned long long) ip->i_ino, nex);
>  		xfs_inode_verifier_error(ip, -EFSCORRUPTED,
>  				"xfs_iformat_extents(1)", dip, sizeof(*dip),
> @@ -233,7 +235,11 @@ xfs_iformat_data_fork(
>  	 * depend on it.
>  	 */
>  	ip->i_df.if_format = dip->di_format;
> -	ip->i_df.if_nextents = be32_to_cpu(dip->di_nextents);
> +	ip->i_df.if_nextents = be32_to_cpu(dip->di_nextents_lo);
> +	if (ip->i_d.di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
> +		ip->i_df.if_nextents |=
> +			((u64)(be32_to_cpu(dip->di_nextents_hi)) << 32);
> +
>  
>  	switch (inode->i_mode & S_IFMT) {
>  	case S_IFIFO:
> @@ -729,31 +735,73 @@ xfs_ifork_verify_local_attr(
>  	return 0;
>  }
>  
> +static int
> +xfs_next_set_data(
> +	struct xfs_trans	*tp,
> +	struct xfs_inode	*ip,
> +	struct xfs_ifork	*ifp,
> +	int			delta)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	xfs_extnum_t		nr_exts;
> +
> +	nr_exts = ifp->if_nextents + delta;
> +
> +	if ((delta > 0 && nr_exts > MAXEXTNUM)
> +		|| (delta < 0 && nr_exts > ifp->if_nextents))
> +		return -EOVERFLOW;
> +
> +	if (ifp->if_nextents <= MAXEXTNUM31BIT &&
> +		nr_exts > MAXEXTNUM31BIT &&
> +		!(ip->i_d.di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS) &&
> +		xfs_sb_version_has_v3inode(&mp->m_sb)) {
> +		if (!xfs_sb_version_has47bitext(&mp->m_sb)) {

Urk.  Again, don't indent the if test logic and the if body statements
to the same level.

> +			bool log_sb = false;
> +
> +			spin_lock(&mp->m_sb_lock);
> +			if (!xfs_sb_version_has47bitext(&mp->m_sb)) {
> +				xfs_sb_version_add47bitext(&mp->m_sb);
> +				log_sb = true;
> +			}
> +			spin_unlock(&mp->m_sb_lock);
> +
> +			if (log_sb)
> +				xfs_log_sb(tp);
> +		}

Hm, dynamic filesystem upgrade.  This probably ought to log something to
dmesg about the upgrade.  It might also be a better to make this a
separate helper so that it's not triply-indented.

> +
> +		ip->i_d.di_flags2 |= XFS_DIFLAG2_47BIT_NEXTENTS;
> +	}
> +
> +	ifp->if_nextents = nr_exts;
> +
> +	return 0;
> +}
> +
>  int
>  xfs_next_set(
> +	struct xfs_trans	*tp,
>  	struct xfs_inode	*ip,
>  	int			whichfork,
>  	int			delta)
>  {
>  	struct xfs_ifork	*ifp;
>  	int64_t			nr_exts;
> -	int64_t			max_exts;
> +	int			error = 0;
>  
>  	ifp = XFS_IFORK_PTR(ip, whichfork);
>  
> -	if (whichfork == XFS_DATA_FORK || whichfork == XFS_COW_FORK)
> -		max_exts = MAXEXTNUM;
> -	else if (whichfork == XFS_ATTR_FORK)
> -		max_exts = MAXAEXTNUM;
> -	else
> -		ASSERT(0);
> -
> -	nr_exts = ifp->if_nextents + delta;
> -	if ((delta > 0 && nr_exts > max_exts)
> -		|| (delta < 0 && nr_exts < 0))
> -		return -EOVERFLOW;
> +	if (whichfork == XFS_DATA_FORK || whichfork == XFS_COW_FORK) {
> +		error = xfs_next_set_data(tp, ip, ifp, delta);
> +	} else if (whichfork == XFS_ATTR_FORK) {
> +		nr_exts = ifp->if_nextents + delta;
> +		if ((delta > 0 && nr_exts > MAXAEXTNUM)
> +			|| (delta < 0 && nr_exts < 0))
> +			return -EOVERFLOW;
>  
> -	ifp->if_nextents = nr_exts;
> +		ifp->if_nextents = nr_exts;
> +	} else {
> +		ASSERT(0);
> +	}
>  
> -	return 0;
> +	return error;
>  }
> diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
> index a84ae42ace79..c74fa6371cc8 100644
> --- a/fs/xfs/libxfs/xfs_inode_fork.h
> +++ b/fs/xfs/libxfs/xfs_inode_fork.h
> @@ -173,5 +173,6 @@ extern void xfs_ifork_init_cow(struct xfs_inode *ip);
>  int xfs_ifork_verify_local_data(struct xfs_inode *ip);
>  int xfs_ifork_verify_local_attr(struct xfs_inode *ip);
>  
> -int xfs_next_set(struct xfs_inode *ip, int whichfork, int delta);
> +int xfs_next_set(struct xfs_trans *tp, struct xfs_inode *ip, int whichfork,
> +		int delta);
>  #endif	/* __XFS_INODE_FORK_H__ */
> diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> index e3400c9c71cd..879aadff7692 100644
> --- a/fs/xfs/libxfs/xfs_log_format.h
> +++ b/fs/xfs/libxfs/xfs_log_format.h
> @@ -396,7 +396,7 @@ struct xfs_log_dinode {
>  	xfs_fsize_t	di_size;	/* number of bytes in file */
>  	xfs_rfsblock_t	di_nblocks;	/* # of direct & btree blocks used */
>  	xfs_extlen_t	di_extsize;	/* basic/minimum extent size for file */
> -	xfs_extnum_t	di_nextents;	/* number of extents in data fork */
> +	uint32_t	di_nextents_lo;	/* number of extents in data fork */
>  	xfs_aextnum_t	di_anextents;	/* number of extents in attribute fork*/
>  	uint8_t		di_forkoff;	/* attr fork offs, <<3 for 64b align */
>  	int8_t		di_aformat;	/* format of attr fork's data */
> @@ -414,7 +414,8 @@ struct xfs_log_dinode {
>  	xfs_lsn_t	di_lsn;		/* flush sequence */
>  	uint64_t	di_flags2;	/* more random flags */
>  	uint32_t	di_cowextsize;	/* basic cow extent size for file */
> -	uint8_t		di_pad2[12];	/* more padding for future expansion */
> +	uint32_t	di_nextents_hi;
> +	uint8_t		di_pad2[8];	/* more padding for future expansion */
>  
>  	/* fields only written to during inode creation */
>  	xfs_ictimestamp_t di_crtime;	/* time created */
> diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> index 0a3041ad5bec..c68ff2178976 100644
> --- a/fs/xfs/libxfs/xfs_types.h
> +++ b/fs/xfs/libxfs/xfs_types.h
> @@ -12,7 +12,7 @@ typedef uint32_t	xfs_agblock_t;	/* blockno in alloc. group */
>  typedef uint32_t	xfs_agino_t;	/* inode # within allocation grp */
>  typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
>  typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
> -typedef int32_t		xfs_extnum_t;	/* # of extents in a file */
> +typedef uint64_t	xfs_extnum_t;	/* # of extents in a file */
>  typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
>  typedef int64_t		xfs_fsize_t;	/* bytes in a file */
>  typedef uint64_t	xfs_ufsize_t;	/* unsigned bytes in a file */
> @@ -59,7 +59,8 @@ typedef void *		xfs_failaddr_t;
>   * Max values for extlen, extnum, aextnum.
>   */
>  #define	MAXEXTLEN	((xfs_extlen_t)0x001fffff)	/* 21 bits */
> -#define	MAXEXTNUM	((xfs_extnum_t)0x7fffffff)	/* signed int */
> +#define	MAXEXTNUM31BIT	((xfs_extnum_t)0x7fffffff)	/* 31 bits */
> +#define	MAXEXTNUM	((xfs_extnum_t)0x7fffffffffff)	/* 47 bits */
>  #define	MAXDIREXTNUM	((xfs_extnum_t)0x7ffffff)	/* 27 bits */
>  #define	MAXAEXTNUM	((xfs_aextnum_t)0x7fff)		/* signed short */
>  
> diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
> index 6d483ab29e63..be41fd242ff2 100644
> --- a/fs/xfs/scrub/inode.c
> +++ b/fs/xfs/scrub/inode.c
> @@ -205,8 +205,8 @@ xchk_dinode(
>  	struct xfs_mount	*mp = sc->mp;
>  	size_t			fork_recs;
>  	unsigned long long	isize;
> +	xfs_extnum_t		nextents;
>  	uint64_t		flags2;
> -	uint32_t		nextents;
>  	uint16_t		flags;
>  	uint16_t		mode;
>  
> @@ -354,7 +354,7 @@ xchk_dinode(
>  	xchk_inode_extsize(sc, dip, ino, mode, flags);
>  
>  	/* di_nextents */
> -	nextents = be32_to_cpu(dip->di_nextents);
> +	nextents = xfs_dfork_nextents(&mp->m_sb, dip, XFS_DATA_FORK);
>  	fork_recs =  XFS_DFORK_DSIZE(dip, mp) / sizeof(struct xfs_bmbt_rec);
>  	switch (dip->di_format) {
>  	case XFS_DINODE_FMT_EXTENTS:
> @@ -464,6 +464,7 @@ xchk_inode_xref_bmap(
>  	struct xfs_scrub	*sc,
>  	struct xfs_dinode	*dip)
>  {
> +	xfs_mount_t		*mp = sc->mp;

struct xfs_mount.  The structure typedefs usages are deprecated and
we're trying to get rid of them (slowly).

--D

>  	xfs_extnum_t		nextents;
>  	xfs_filblks_t		count;
>  	xfs_filblks_t		acount;
> @@ -477,14 +478,14 @@ xchk_inode_xref_bmap(
>  			&nextents, &count);
>  	if (!xchk_should_check_xref(sc, &error, NULL))
>  		return;
> -	if (nextents < be32_to_cpu(dip->di_nextents))
> +	if (nextents < xfs_dfork_nextents(&mp->m_sb, dip, XFS_DATA_FORK))
>  		xchk_ino_xref_set_corrupt(sc, sc->ip->i_ino);
>  
>  	error = xfs_bmap_count_blocks(sc->tp, sc->ip, XFS_ATTR_FORK,
>  			&nextents, &acount);
>  	if (!xchk_should_check_xref(sc, &error, NULL))
>  		return;
> -	if (nextents != be16_to_cpu(dip->di_anextents))
> +	if (nextents != xfs_dfork_nextents(&mp->m_sb, dip, XFS_ATTR_FORK))
>  		xchk_ino_xref_set_corrupt(sc, sc->ip->i_ino);
>  
>  	/* Check nblocks against the inode. */
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 64f5f9a440ae..4418a66cf6d6 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3748,7 +3748,7 @@ xfs_iflush_int(
>  				ip->i_d.di_nblocks, mp, XFS_ERRTAG_IFLUSH_5)) {
>  		xfs_alert_tag(mp, XFS_PTAG_IFLUSH,
>  			"%s: detected corrupt incore inode %Lu, "
> -			"total extents = %d, nblocks = %Ld, ptr "PTR_FMT,
> +			"total extents = %llu, nblocks = %Ld, ptr "PTR_FMT,
>  			__func__, ip->i_ino,
>  			ip->i_df.if_nextents + xfs_ifork_nextents(ip->i_afp),
>  			ip->i_d.di_nblocks, ip);
> @@ -3785,6 +3785,10 @@ xfs_iflush_int(
>  	    xfs_ifork_verify_local_attr(ip))
>  		goto flush_out;
>  
> +	if (!(ip->i_d.di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
> +		&& xfs_sb_version_has47bitext(&mp->m_sb))
> +		ip->i_d.di_flags2 |= XFS_DIFLAG2_47BIT_NEXTENTS;
> +
>  	/*
>  	 * Copy the dirty parts of the inode into the on-disk inode.  We always
>  	 * copy out the core of the inode, because if the inode is dirty at all
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index ba47bf65b772..6f27ac7c8631 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -326,7 +326,7 @@ xfs_inode_to_log_dinode(
>  	to->di_size = from->di_size;
>  	to->di_nblocks = from->di_nblocks;
>  	to->di_extsize = from->di_extsize;
> -	to->di_nextents = xfs_ifork_nextents(&ip->i_df);
> +	to->di_nextents_lo = xfs_ifork_nextents(&ip->i_df) & 0xffffffffU;
>  	to->di_anextents = xfs_ifork_nextents(ip->i_afp);
>  	to->di_forkoff = from->di_forkoff;
>  	to->di_aformat = xfs_ifork_format(ip->i_afp);
> @@ -344,6 +344,9 @@ xfs_inode_to_log_dinode(
>  		to->di_crtime.t_nsec = from->di_crtime.tv_nsec;
>  		to->di_flags2 = from->di_flags2;
>  		to->di_cowextsize = from->di_cowextsize;
> +		if (from->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
> +			to->di_nextents_hi =
> +				xfs_ifork_nextents(&ip->i_df) >> 32;
>  		to->di_ino = ip->i_ino;
>  		to->di_lsn = lsn;
>  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
> index 10ef5ddf5429..8d64b861fb66 100644
> --- a/fs/xfs/xfs_inode_item_recover.c
> +++ b/fs/xfs/xfs_inode_item_recover.c
> @@ -134,6 +134,7 @@ xlog_recover_inode_commit_pass2(
>  	struct xfs_log_dinode		*ldip;
>  	uint				isize;
>  	int				need_free = 0;
> +	xfs_extnum_t			nextents;
>  
>  	if (item->ri_buf[0].i_len == sizeof(struct xfs_inode_log_format)) {
>  		in_f = item->ri_buf[0].i_addr;
> @@ -255,16 +256,23 @@ xlog_recover_inode_commit_pass2(
>  			goto out_release;
>  		}
>  	}
> -	if (unlikely(ldip->di_nextents + ldip->di_anextents > ldip->di_nblocks)){
> +
> +	nextents = ldip->di_nextents_lo;
> +	if (xfs_sb_version_has_v3inode(&mp->m_sb) &&
> +		ldip->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
> +		nextents |= ((u64)(ldip->di_nextents_hi) << 32);
> +
> +	nextents += ldip->di_anextents;
> +
> +	if (unlikely(nextents > ldip->di_nblocks)) {
>  		XFS_CORRUPTION_ERROR("xlog_recover_inode_pass2(5)",
>  				     XFS_ERRLEVEL_LOW, mp, ldip,
>  				     sizeof(*ldip));
>  		xfs_alert(mp,
>  	"%s: Bad inode log record, rec ptr "PTR_FMT", dino ptr "PTR_FMT", "
> -	"dino bp "PTR_FMT", ino %Ld, total extents = %d, nblocks = %Ld",
> +	"dino bp "PTR_FMT", ino %Ld, total extents = %llu, nblocks = %Ld",
>  			__func__, item, dip, bp, in_f->ilf_ino,
> -			ldip->di_nextents + ldip->di_anextents,
> -			ldip->di_nblocks);
> +			nextents, ldip->di_nblocks);
>  		error = -EFSCORRUPTED;
>  		goto out_release;
>  	}
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 7/7] xfs: Extend attr extent counter to 32 bits
  2020-06-06  8:27 ` [PATCH 7/7] xfs: Extend attr extent counter to 32 bits Chandan Babu R
@ 2020-06-08 17:21   ` Darrick J. Wong
  2020-06-09 14:22     ` Chandan Babu R
  2020-06-19 14:39   ` Christoph Hellwig
  1 sibling, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2020-06-08 17:21 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, david, bfoster, hch

On Sat, Jun 06, 2020 at 01:57:45PM +0530, Chandan Babu R wrote:
> This commit extends the per-inode attr extent counter to 32 bits.
> 
> The following changes are made to accomplish this,
> 1. A new ro-compat superblock flag to prevent older kernels from
>    mounting the filesystem in read-write mode. This flag is set for the
>    first time when an inode would end up having more than 2^15 extents.
> 3. Carve out a new 16-bit field from xfs_dinode->di_pad2[]. This field
>    holds the most significant 16 bits of the attr extent counter.

How difficult is it to end up with an attr fork mapping more than 2^32
blocks?  Supposing I have a file with nlinks==2^32-1, each mapped to a
255-byte name and some number of other xattrs?

> 2. A new inode->di_flags2 flag to indicate that the newly added field
>    contains valid data. This flag is set when one of the following two
>    conditions are met,
>    - When the inode is about to have more than 2^15 extents.
>    - When flushing the incore inode (See xfs_iflush_int()), if
>      the superblock ro-compat flag is already set.
> 
> Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
> ---
>  fs/xfs/libxfs/xfs_format.h      | 25 ++++++++++---
>  fs/xfs/libxfs/xfs_inode_buf.c   | 23 +++++++++---
>  fs/xfs/libxfs/xfs_inode_fork.c  | 62 ++++++++++++++++++++++++++-------
>  fs/xfs/libxfs/xfs_log_format.h  |  5 +--
>  fs/xfs/libxfs/xfs_types.h       |  5 +--
>  fs/xfs/scrub/inode.c            |  5 +--
>  fs/xfs/xfs_inode.c              |  4 +++
>  fs/xfs/xfs_inode_item.c         |  5 ++-
>  fs/xfs/xfs_inode_item_recover.c |  8 ++++-
>  9 files changed, 113 insertions(+), 29 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 91bee33aa988..2e37d887fd35 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -450,11 +450,13 @@ xfs_sb_has_compat_feature(
>  #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
>  #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
>  #define XFS_SB_FEAT_RO_COMPAT_47BIT_DEXT_CNTR (1 << 3)	/* 47bit data extents */
> +#define XFS_SB_FEAT_RO_COMPAT_32BIT_AEXT_CNTR (1 << 4)	/* 32bit attr extents */

Can we bundle both of these changes in a single feature flag?  I would
like to keep our feature testing matrix as small as we can.

/* 64-bit data fork extent counts and 32-bit attr fork extent counts */
#define XFS_SB_FEAT_RO_COMPAT_BIG_FORK	(1 << 4)

>  #define XFS_SB_FEAT_RO_COMPAT_ALL \
>  		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
>  		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
>  		 XFS_SB_FEAT_RO_COMPAT_REFLINK | \
> -		 XFS_SB_FEAT_RO_COMPAT_47BIT_DEXT_CNTR)
> +		 XFS_SB_FEAT_RO_COMPAT_47BIT_DEXT_CNTR | \
> +		 XFS_SB_FEAT_RO_COMPAT_32BIT_AEXT_CNTR)
>  #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
>  static inline bool
>  xfs_sb_has_ro_compat_feature(
> @@ -577,6 +579,18 @@ static inline void xfs_sb_version_add47bitext(struct xfs_sb *sbp)
>  	sbp->sb_features_ro_compat |= XFS_SB_FEAT_RO_COMPAT_47BIT_DEXT_CNTR;
>  }
>  
> +static inline bool xfs_sb_version_has32bitaext(struct xfs_sb *sbp)
> +{
> +	return XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5 &&
> +		(sbp->sb_features_ro_compat &
> +			XFS_SB_FEAT_RO_COMPAT_32BIT_AEXT_CNTR);
> +}
> +
> +static inline void xfs_sb_version_add32bitaext(struct xfs_sb *sbp)
> +{
> +	sbp->sb_features_ro_compat |= XFS_SB_FEAT_RO_COMPAT_32BIT_AEXT_CNTR;
> +}
> +
>  /*
>   * end of superblock version macros
>   */
> @@ -888,7 +902,7 @@ typedef struct xfs_dinode {
>  	__be64		di_nblocks;	/* # of direct & btree blocks used */
>  	__be32		di_extsize;	/* basic/minimum extent size for file */
>  	__be32		di_nextents_lo;	/* number of extents in data fork */
> -	__be16		di_anextents;	/* number of extents in attribute fork*/
> +	__be16		di_anextents_lo;/* lower part of xattr extent count */
>  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
>  	__s8		di_aformat;	/* format of attr fork's data */
>  	__be32		di_dmevmask;	/* DMIG event mask */
> @@ -906,7 +920,8 @@ typedef struct xfs_dinode {
>  	__be64		di_flags2;	/* more random flags */
>  	__be32		di_cowextsize;	/* basic cow extent size for file */
>  	__be32		di_nextents_hi;
> -	__u8		di_pad2[8];	/* more padding for future expansion */
> +	__be16		di_anextents_hi;/* higher part of xattr extent count */
> +	__u8		di_pad2[6];	/* more padding for future expansion */
>  
>  	/* fields only written to during inode creation */
>  	xfs_timestamp_t	di_crtime;	/* time created */
> @@ -1073,14 +1088,16 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
>  #define XFS_DIFLAG2_REFLINK_BIT	1	/* file's blocks may be shared */
>  #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
>  #define XFS_DIFLAG2_47BIT_NEXTENTS_BIT 3 /* Uses di_nextents_hi field */
> +#define XFS_DIFLAG2_32BIT_ANEXTENTS_BIT 4 /* Uses di_anextents_hi field  */

Same thing here.

>  #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
>  #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
>  #define XFS_DIFLAG2_COWEXTSIZE  (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
>  #define XFS_DIFLAG2_47BIT_NEXTENTS (1 << XFS_DIFLAG2_47BIT_NEXTENTS_BIT)
> +#define XFS_DIFLAG2_32BIT_ANEXTENTS (1 << XFS_DIFLAG2_32BIT_ANEXTENTS_BIT)
>  
>  #define XFS_DIFLAG2_ANY \
>  	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
> -	 XFS_DIFLAG2_47BIT_NEXTENTS)
> +	 XFS_DIFLAG2_47BIT_NEXTENTS | XFS_DIFLAG2_32BIT_ANEXTENTS)
>  
>  /*
>   * Inode number format:
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index 8b89fe080f70..285cbce0cd10 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -309,7 +309,8 @@ xfs_inode_to_disk(
>  	to->di_extsize = cpu_to_be32(from->di_extsize);
>  	to->di_nextents_lo = cpu_to_be32(xfs_ifork_nextents(&ip->i_df) &
>  					0xffffffffU);
> -	to->di_anextents = cpu_to_be16(xfs_ifork_nextents(ip->i_afp));
> +	to->di_anextents_lo = cpu_to_be16(xfs_ifork_nextents(ip->i_afp) &
> +					0xffffU);
>  	to->di_forkoff = from->di_forkoff;
>  	to->di_aformat = xfs_ifork_format(ip->i_afp);
>  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> @@ -327,6 +328,10 @@ xfs_inode_to_disk(
>  			to->di_nextents_hi
>  				= cpu_to_be32(xfs_ifork_nextents(&ip->i_df)
>  					>> 32);
> +		if (from->di_flags2 & XFS_DIFLAG2_32BIT_ANEXTENTS)
> +			to->di_anextents_hi
> +				= cpu_to_be16(xfs_ifork_nextents(ip->i_afp)
> +					>> 16);
>  		to->di_ino = cpu_to_be64(ip->i_ino);
>  		to->di_lsn = cpu_to_be64(lsn);
>  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> @@ -366,7 +371,7 @@ xfs_log_dinode_to_disk(
>  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
>  	to->di_extsize = cpu_to_be32(from->di_extsize);
>  	to->di_nextents_lo = cpu_to_be32(from->di_nextents_lo);
> -	to->di_anextents = cpu_to_be16(from->di_anextents);
> +	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
>  	to->di_forkoff = from->di_forkoff;
>  	to->di_aformat = from->di_aformat;
>  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> @@ -383,6 +388,9 @@ xfs_log_dinode_to_disk(
>  		if (from->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
>  			to->di_nextents_hi =
>  				cpu_to_be32(from->di_nextents_hi);
> +		if (from->di_flags2 & XFS_DIFLAG2_32BIT_ANEXTENTS)
> +			to->di_anextents_hi =
> +				cpu_to_be16(from->di_anextents_hi);
>  		to->di_ino = cpu_to_be64(from->di_ino);
>  		to->di_lsn = cpu_to_be64(from->di_lsn);
>  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> @@ -566,7 +574,7 @@ xfs_dinode_verify(
>  		default:
>  			return __this_address;
>  		}
> -		if (dip->di_anextents)
> +		if (xfs_dfork_nextents(&mp->m_sb, dip, XFS_ATTR_FORK))
>  			return __this_address;
>  	}
>  
> @@ -745,8 +753,13 @@ xfs_dfork_nextents(
>  			&& (dip->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS))
>  			nextents |= (u64)(be32_to_cpu(dip->di_nextents_hi))
>  				<< 32;
> -		return nextents;
>  	} else {
> -		return be16_to_cpu(dip->di_anextents);
> +		nextents = be16_to_cpu(dip->di_anextents_lo);
> +		if (xfs_sb_version_has_v3inode(sbp)
> +			&& (dip->di_flags2 & XFS_DIFLAG2_32BIT_ANEXTENTS))
> +			nextents |= (u32)(be16_to_cpu(dip->di_anextents_hi))

<same if test logic vs. if body statement indentation complaint>

> +				<< 16;
>  	}
> +
> +	return nextents;
>  }
> diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
> index ec682e2d5bcb..169e16947ece 100644
> --- a/fs/xfs/libxfs/xfs_inode_fork.c
> +++ b/fs/xfs/libxfs/xfs_inode_fork.c
> @@ -301,7 +301,10 @@ xfs_iformat_attr_fork(
>  	ip->i_afp->if_format = dip->di_aformat;
>  	if (unlikely(ip->i_afp->if_format == 0)) /* pre IRIX 6.2 file system */
>  		ip->i_afp->if_format = XFS_DINODE_FMT_EXTENTS;
> -	ip->i_afp->if_nextents = be16_to_cpu(dip->di_anextents);
> +	ip->i_afp->if_nextents = be16_to_cpu(dip->di_anextents_lo);
> +	if (ip->i_d.di_flags2 & XFS_DIFLAG2_32BIT_ANEXTENTS)
> +		ip->i_afp->if_nextents |=
> +			(u32)(be16_to_cpu(dip->di_anextents_hi)) << 16;
>  
>  	switch (ip->i_afp->if_format) {
>  	case XFS_DINODE_FMT_LOCAL:
> @@ -777,6 +780,48 @@ xfs_next_set_data(
>  	return 0;
>  }
>  
> +static int
> +xfs_next_set_attr(
> +	struct xfs_trans	*tp,
> +	struct xfs_inode	*ip,
> +	struct xfs_ifork	*ifp,
> +	int			delta)
> +{
> +	struct xfs_mount	*mp = ip->i_mount;
> +	xfs_aextnum_t		nr_exts;
> +
> +	nr_exts = ifp->if_nextents + delta;
> +
> +	if ((delta > 0 && nr_exts < ifp->if_nextents) ||
> +		(delta < 0 && nr_exts > ifp->if_nextents))
> +		return -EOVERFLOW;
> +
> +	if (ifp->if_nextents <= MAXAEXTNUM15BIT &&
> +		nr_exts > MAXAEXTNUM15BIT &&
> +		!(ip->i_d.di_flags2 & XFS_DIFLAG2_32BIT_ANEXTENTS) &&
> +		xfs_sb_version_has_v3inode(&mp->m_sb)) {
> +		if (!xfs_sb_version_has32bitaext(&mp->m_sb)) {

Indentation complaint^2

> +			bool log_sb = false;
> +
> +			spin_lock(&mp->m_sb_lock);
> +			if (!xfs_sb_version_has32bitaext(&mp->m_sb)) {
> +				xfs_sb_version_add32bitaext(&mp->m_sb);
> +				log_sb = true;
> +			}
> +			spin_unlock(&mp->m_sb_lock);
> +
> +			if (log_sb)
> +				xfs_log_sb(tp);
> +		}
> +
> +		ip->i_d.di_flags2 |= XFS_DIFLAG2_32BIT_ANEXTENTS;
> +	}
> +
> +	ifp->if_nextents = nr_exts;
> +
> +	return 0;
> +}
> +
>  int
>  xfs_next_set(
>  	struct xfs_trans	*tp,
> @@ -785,23 +830,16 @@ xfs_next_set(
>  	int			delta)
>  {
>  	struct xfs_ifork	*ifp;
> -	int64_t			nr_exts;
>  	int			error = 0;
>  
>  	ifp = XFS_IFORK_PTR(ip, whichfork);
>  
> -	if (whichfork == XFS_DATA_FORK || whichfork == XFS_COW_FORK) {
> +	if (whichfork == XFS_DATA_FORK || whichfork == XFS_COW_FORK)
>  		error = xfs_next_set_data(tp, ip, ifp, delta);
> -	} else if (whichfork == XFS_ATTR_FORK) {
> -		nr_exts = ifp->if_nextents + delta;
> -		if ((delta > 0 && nr_exts > MAXAEXTNUM)
> -			|| (delta < 0 && nr_exts < 0))
> -			return -EOVERFLOW;
> -
> -		ifp->if_nextents = nr_exts;
> -	} else {
> +	else if (whichfork == XFS_ATTR_FORK)
> +		error = xfs_next_set_attr(tp, ip, ifp, delta);
> +	else
>  		ASSERT(0);
> -	}
>  
>  	return error;
>  }
> diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> index 879aadff7692..db419fc862bc 100644
> --- a/fs/xfs/libxfs/xfs_log_format.h
> +++ b/fs/xfs/libxfs/xfs_log_format.h
> @@ -397,7 +397,7 @@ struct xfs_log_dinode {
>  	xfs_rfsblock_t	di_nblocks;	/* # of direct & btree blocks used */
>  	xfs_extlen_t	di_extsize;	/* basic/minimum extent size for file */
>  	uint32_t	di_nextents_lo;	/* number of extents in data fork */
> -	xfs_aextnum_t	di_anextents;	/* number of extents in attribute fork*/
> +	uint16_t	di_anextents_lo;/* lower part of xattr extent count */
>  	uint8_t		di_forkoff;	/* attr fork offs, <<3 for 64b align */
>  	int8_t		di_aformat;	/* format of attr fork's data */
>  	uint32_t	di_dmevmask;	/* DMIG event mask */
> @@ -415,7 +415,8 @@ struct xfs_log_dinode {
>  	uint64_t	di_flags2;	/* more random flags */
>  	uint32_t	di_cowextsize;	/* basic cow extent size for file */
>  	uint32_t	di_nextents_hi;
> -	uint8_t		di_pad2[8];	/* more padding for future expansion */
> +	uint16_t	di_anextents_hi;/* higher part of xattr extent count */
> +	uint8_t		di_pad2[6];	/* more padding for future expansion */
>  
>  	/* fields only written to during inode creation */
>  	xfs_ictimestamp_t di_crtime;	/* time created */
> diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> index c68ff2178976..974737a9e9c1 100644
> --- a/fs/xfs/libxfs/xfs_types.h
> +++ b/fs/xfs/libxfs/xfs_types.h
> @@ -13,7 +13,7 @@ typedef uint32_t	xfs_agino_t;	/* inode # within allocation grp */
>  typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
>  typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
>  typedef uint64_t	xfs_extnum_t;	/* # of extents in a file */
> -typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> +typedef uint32_t	xfs_aextnum_t;	/* # extents in an attribute fork */
>  typedef int64_t		xfs_fsize_t;	/* bytes in a file */
>  typedef uint64_t	xfs_ufsize_t;	/* unsigned bytes in a file */
>  
> @@ -62,7 +62,8 @@ typedef void *		xfs_failaddr_t;
>  #define	MAXEXTNUM31BIT	((xfs_extnum_t)0x7fffffff)	/* 31 bits */
>  #define	MAXEXTNUM	((xfs_extnum_t)0x7fffffffffff)	/* 47 bits */
>  #define	MAXDIREXTNUM	((xfs_extnum_t)0x7ffffff)	/* 27 bits */
> -#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fff)		/* signed short */
> +#define	MAXAEXTNUM15BIT	((xfs_aextnum_t)0x7fff)		/* 15 bits */
> +#define	MAXAEXTNUM	((xfs_aextnum_t)0xffffffff)	/* 32 bits */
>  
>  /*
>   * Minimum and maximum blocksize and sectorsize.
> diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
> index be41fd242ff2..01e60c78a3a3 100644
> --- a/fs/xfs/scrub/inode.c
> +++ b/fs/xfs/scrub/inode.c
> @@ -371,10 +371,12 @@ xchk_dinode(
>  		break;
>  	}
>  
> +	nextents = xfs_dfork_nextents(&mp->m_sb, dip, XFS_ATTR_FORK);
> +
>  	/* di_forkoff */
>  	if (XFS_DFORK_APTR(dip) >= (char *)dip + mp->m_sb.sb_inodesize)
>  		xchk_ino_set_corrupt(sc, ino);
> -	if (dip->di_anextents != 0 && dip->di_forkoff == 0)
> +	if (nextents != 0 && dip->di_forkoff == 0)
>  		xchk_ino_set_corrupt(sc, ino);
>  	if (dip->di_forkoff == 0 && dip->di_aformat != XFS_DINODE_FMT_EXTENTS)
>  		xchk_ino_set_corrupt(sc, ino);
> @@ -386,7 +388,6 @@ xchk_dinode(
>  		xchk_ino_set_corrupt(sc, ino);
>  
>  	/* di_anextents */
> -	nextents = be16_to_cpu(dip->di_anextents);
>  	fork_recs =  XFS_DFORK_ASIZE(dip, mp) / sizeof(struct xfs_bmbt_rec);
>  	switch (dip->di_aformat) {
>  	case XFS_DINODE_FMT_EXTENTS:
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 4418a66cf6d6..6ec34e069344 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3789,6 +3789,10 @@ xfs_iflush_int(
>  		&& xfs_sb_version_has47bitext(&mp->m_sb))
>  		ip->i_d.di_flags2 |= XFS_DIFLAG2_47BIT_NEXTENTS;
>  
> +	if (!(ip->i_d.di_flags2 & XFS_DIFLAG2_32BIT_ANEXTENTS)
> +		&& xfs_sb_version_has32bitaext(&mp->m_sb))
> +		ip->i_d.di_flags2 |= XFS_DIFLAG2_32BIT_ANEXTENTS;
> +
>  	/*
>  	 * Copy the dirty parts of the inode into the on-disk inode.  We always
>  	 * copy out the core of the inode, because if the inode is dirty at all
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 6f27ac7c8631..40f0a19d1c07 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -327,7 +327,7 @@ xfs_inode_to_log_dinode(
>  	to->di_nblocks = from->di_nblocks;
>  	to->di_extsize = from->di_extsize;
>  	to->di_nextents_lo = xfs_ifork_nextents(&ip->i_df) & 0xffffffffU;
> -	to->di_anextents = xfs_ifork_nextents(ip->i_afp);
> +	to->di_anextents_lo = xfs_ifork_nextents(ip->i_afp) & 0xffffU;
>  	to->di_forkoff = from->di_forkoff;
>  	to->di_aformat = xfs_ifork_format(ip->i_afp);
>  	to->di_dmevmask = from->di_dmevmask;
> @@ -347,6 +347,9 @@ xfs_inode_to_log_dinode(
>  		if (from->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
>  			to->di_nextents_hi =
>  				xfs_ifork_nextents(&ip->i_df) >> 32;
> +		if (from->di_flags2 & XFS_DIFLAG2_32BIT_ANEXTENTS)
> +			to->di_anextents_hi =
> +				xfs_ifork_nextents(ip->i_afp) >> 16;
>  		to->di_ino = ip->i_ino;
>  		to->di_lsn = lsn;
>  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
> index 8d64b861fb66..c8b5fbba848b 100644
> --- a/fs/xfs/xfs_inode_item_recover.c
> +++ b/fs/xfs/xfs_inode_item_recover.c
> @@ -135,6 +135,7 @@ xlog_recover_inode_commit_pass2(
>  	uint				isize;
>  	int				need_free = 0;
>  	xfs_extnum_t			nextents;
> +	xfs_aextnum_t			anextents;
>  
>  	if (item->ri_buf[0].i_len == sizeof(struct xfs_inode_log_format)) {
>  		in_f = item->ri_buf[0].i_addr;
> @@ -262,7 +263,12 @@ xlog_recover_inode_commit_pass2(
>  		ldip->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
>  		nextents |= ((u64)(ldip->di_nextents_hi) << 32);
>  
> -	nextents += ldip->di_anextents;
> +	anextents = ldip->di_anextents_lo;
> +	if (xfs_sb_version_has_v3inode(&mp->m_sb) &&
> +		ldip->di_flags2 & XFS_DIFLAG2_32BIT_ANEXTENTS)
> +		anextents |= ((u32)(ldip->di_anextents_hi) << 16);
> +
> +	nextents += anextents;
>  
>  	if (unlikely(nextents > ldip->di_nblocks)) {
>  		XFS_CORRUPTION_ERROR("xlog_recover_inode_pass2(5)",
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/7] xfs: Extend per-inode extent counters.
  2020-06-06  8:27 [PATCH 0/7] xfs: Extend per-inode extent counters Chandan Babu R
                   ` (6 preceding siblings ...)
  2020-06-06  8:27 ` [PATCH 7/7] xfs: Extend attr extent counter to 32 bits Chandan Babu R
@ 2020-06-08 17:31 ` Darrick J. Wong
  2020-06-09 14:22   ` Chandan Babu R
  7 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2020-06-08 17:31 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, david, bfoster, hch

On Sat, Jun 06, 2020 at 01:57:38PM +0530, Chandan Babu R wrote:
> The commit xfs: fix inode fork extent count overflow
> (3f8a4f1d876d3e3e49e50b0396eaffcc4ba71b08) mentions that 10 billion
> per-inode data fork extents should be possible to create. However the
> corresponding on-disk field has an signed 32-bit type. Hence this
> patchset extends the on-disk field to 64-bit length out of which only
> the first 47-bits are valid.
> 
> Also, XFS has a per-inode xattr extent counter which is 16 bits
> wide. A workload which
> 1. Creates 1 million 255-byte sized xattrs,
> 2. Deletes 50% of these xattrs in an alternating manner,
> 3. Tries to insert 400,000 new 255-byte sized xattrs
> causes the following message to be printed on the console,
> 
> XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> 
> This indicates that we overflowed the 16-bits wide xattr extent
> counter.
> 
> I have been informed that there are instances where a single file
> has > 100 million hardlinks. With parent pointers being stored in xattr,
> we will overflow the 16-bits wide xattr extent counter when large
> number of hardlinks are created. Hence this patchset extends the
> on-disk field to 32-bit length.
> 
> This patchset also includes the previously posted "Fix log reservation
> calculation for xattr insert operation" patch as a bug fix. It
> replaces the xattr set "mount" and "runtime" reservations with just
> one static reservation. Hence we don't need the functionality to
> calculate maximum sized 'xattr set' reservation separately anymore.
> 
> The patches can also be obtained from
> https://github.com/chandanr/linux.git at branch xfs-extend-extent-counters.
> 
> Chandan Babu R (7):
>   xfs: Fix log reservation calculation for xattr insert operation

What happened to that whole patchset with struct xfs_attr_set_resv
and whatnot?  Did all that get condensed down to this single patch?

--D

>   xfs: Check for per-inode extent count overflow
>   xfs: Compute maximum height of directory BMBT separately
>   xfs: Add "Use Dir BMBT height" argument to XFS_BM_MAXLEVELS()
>   xfs: Use 2^27 as the maximum number of directory extents
>   xfs: Extend data extent counter to 47 bits
>   xfs: Extend attr extent counter to 32 bits
> 
>  fs/xfs/libxfs/xfs_attr.c        |  11 +--
>  fs/xfs/libxfs/xfs_bmap.c        | 118 +++++++++++++++++++++++-------
>  fs/xfs/libxfs/xfs_bmap.h        |   3 +-
>  fs/xfs/libxfs/xfs_bmap_btree.h  |   4 +-
>  fs/xfs/libxfs/xfs_format.h      |  49 ++++++++++---
>  fs/xfs/libxfs/xfs_inode_buf.c   |  65 ++++++++++++++---
>  fs/xfs/libxfs/xfs_inode_buf.h   |   2 +
>  fs/xfs/libxfs/xfs_inode_fork.c  | 125 ++++++++++++++++++++++++++++++--
>  fs/xfs/libxfs/xfs_inode_fork.h  |   2 +
>  fs/xfs/libxfs/xfs_log_format.h  |   8 +-
>  fs/xfs/libxfs/xfs_log_rlimit.c  |  29 --------
>  fs/xfs/libxfs/xfs_trans_resv.c  |  75 +++++++++----------
>  fs/xfs/libxfs/xfs_trans_resv.h  |   9 +--
>  fs/xfs/libxfs/xfs_trans_space.h |  48 ++++++------
>  fs/xfs/libxfs/xfs_types.h       |  11 ++-
>  fs/xfs/scrub/inode.c            |  14 ++--
>  fs/xfs/xfs_bmap_item.c          |   3 +-
>  fs/xfs/xfs_inode.c              |  10 ++-
>  fs/xfs/xfs_inode_item.c         |  10 ++-
>  fs/xfs/xfs_inode_item_recover.c |  22 +++++-
>  fs/xfs/xfs_mount.c              |   5 +-
>  fs/xfs/xfs_mount.h              |   1 +
>  fs/xfs/xfs_reflink.c            |   4 +-
>  23 files changed, 451 insertions(+), 177 deletions(-)
> 
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/7] xfs: Add "Use Dir BMBT height" argument to XFS_BM_MAXLEVELS()
  2020-06-06  8:27 ` [PATCH 4/7] xfs: Add "Use Dir BMBT height" argument to XFS_BM_MAXLEVELS() Chandan Babu R
@ 2020-06-08 17:50   ` Darrick J. Wong
  2020-06-09 14:23     ` Chandan Babu R
  0 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2020-06-08 17:50 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, david, bfoster, hch

On Sat, Jun 06, 2020 at 01:57:42PM +0530, Chandan Babu R wrote:
> XFS_BM_MAXLEVELS() returns the maximum possible height of BMBT tree for
> either data or attribute fork. For data forks, this commit adds a new
> argument to XFS_BM_MAXLEVELS() to let the users choose between the
> maximum heights of dir and non-dir BMBTs.
> 
> As of this commit, both dir and non-dir BMBTs have the same maximum
> height. A future commit in this series will use 2^27 extent count as the
> input to compute the maximum height of a directory BMBT which will in
> turn cause the maximum heights of dir and non-dir BMBTs to differ.
> 
> Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
> ---
>  fs/xfs/libxfs/xfs_attr.c        |  5 ++--
>  fs/xfs/libxfs/xfs_bmap.c        |  5 ++--
>  fs/xfs/libxfs/xfs_bmap_btree.h  |  4 +++-
>  fs/xfs/libxfs/xfs_trans_resv.c  | 25 +++++++++++---------
>  fs/xfs/libxfs/xfs_trans_resv.h  |  4 ++--
>  fs/xfs/libxfs/xfs_trans_space.h | 41 +++++++++++++++++----------------
>  fs/xfs/xfs_bmap_item.c          |  3 ++-
>  fs/xfs/xfs_reflink.c            |  4 ++--
>  8 files changed, 50 insertions(+), 41 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
> index a4b23edf887e..357e29a5a167 100644
> --- a/fs/xfs/libxfs/xfs_attr.c
> +++ b/fs/xfs/libxfs/xfs_attr.c
> @@ -150,7 +150,7 @@ xfs_attr_calc_size(
>  	 * "local" or "remote" (note: local != inline).
>  	 */
>  	size = xfs_attr_leaf_newentsize(args, local);
> -	nblks = XFS_DAENTER_SPACE_RES(mp, XFS_ATTR_FORK);
> +	nblks = XFS_DAENTER_SPACE_RES(mp, XFS_ATTR_FORK, 0);

When would we have a DAENTER space reservation for the data fork on
something that isn't a directory?

Shouldn't you be able to compute the correct 'dbmbt' parameter value
from whichfork?

Can you modify these macros to take the xfs_inode so that we can gate
the logic on i_mode instead of passing magic values 0 and 1 around?
Though... thinking about this more, 1 means "use the slightly smaller
directory bmbt maxlevels", and 0 means "either this is a non directory
or we want worst case calculations", doesn't it...

Zooming out, why do we even care?  While it's true that we might gain
the ability to shave a few blocks off the block reservation when we know
we're dealing with a directory, this adds quite a bit of clutter to get
it.

>  	if (*local) {
>  		if (size > (args->geo->blksize / 2)) {
>  			/* Double split possible */
> @@ -163,7 +163,8 @@ xfs_attr_calc_size(
>  		 */
>  		uint	dblocks = xfs_attr3_rmt_blocks(mp, args->valuelen);
>  		nblks += dblocks;
> -		nblks += XFS_NEXTENTADD_SPACE_RES(mp, dblocks, XFS_ATTR_FORK);
> +		nblks += XFS_NEXTENTADD_SPACE_RES(mp, dblocks,
> +				XFS_ATTR_FORK, 0);
>  	}
>  
>  	return nblks;
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 01e2b543b139..8b0029b3cecf 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -182,13 +182,14 @@ xfs_bmap_worst_indlen(
>  	mp = ip->i_mount;
>  	maxrecs = mp->m_bmap_dmxr[0];
>  	for (level = 0, rval = 0;
> -	     level < XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK);
> +	     level < XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK, 0);
>  	     level++) {
>  		len += maxrecs - 1;
>  		do_div(len, maxrecs);
>  		rval += len;
>  		if (len == 1)
> -			return rval + XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) -
> +			return rval +
> +				XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK, 0) -
>  				level - 1;
>  		if (level == 0)
>  			maxrecs = mp->m_bmap_dmxr[1];
> diff --git a/fs/xfs/libxfs/xfs_bmap_btree.h b/fs/xfs/libxfs/xfs_bmap_btree.h
> index 72bf74c79fb9..a047be5883d1 100644
> --- a/fs/xfs/libxfs/xfs_bmap_btree.h
> +++ b/fs/xfs/libxfs/xfs_bmap_btree.h
> @@ -79,7 +79,9 @@ struct xfs_trans;
>  /*
>   * Maximum number of bmap btree levels.
>   */
> -#define XFS_BM_MAXLEVELS(mp,w)		((mp)->m_bm_maxlevels[(w)])
> +#define XFS_BM_MAXLEVELS(mp,w,use_dir_bmbt) \
> +	((!(use_dir_bmbt)) ? \
> +		(mp)->m_bm_maxlevels[(w)] : (mp)->m_bm_dir_maxlevel)

Also, if you /are/ going to mess with these macros, can you please turn
them into static inline functions?  Typechecking would be nice.

--D

>  /*
>   * Prototypes for xfs_bmap.c to call.
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> index b44b521c605c..39cfca1b71b6 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.c
> +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> @@ -265,14 +265,14 @@ xfs_calc_write_reservation(
>  	unsigned int		blksz = XFS_FSB_TO_B(mp, 1);
>  
>  	t1 = xfs_calc_inode_res(mp, 1) +
> -	     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK), blksz) +
> +	     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK, 0), blksz) +
>  	     xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
>  	     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2), blksz);
>  
>  	if (xfs_sb_version_hasrealtime(&mp->m_sb)) {
>  		t2 = xfs_calc_inode_res(mp, 1) +
> -		     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK),
> -				     blksz) +
> +		     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK, 0),
> +			blksz) +
>  		     xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
>  		     xfs_calc_buf_res(xfs_rtalloc_log_count(mp, 1), blksz) +
>  		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1), blksz);
> @@ -313,7 +313,8 @@ xfs_calc_itruncate_reservation(
>  	unsigned int		blksz = XFS_FSB_TO_B(mp, 1);
>  
>  	t1 = xfs_calc_inode_res(mp, 1) +
> -	     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) + 1, blksz);
> +	     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK, 0) + 1,
> +			     blksz);
>  
>  	t2 = xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) +
>  	     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 4), blksz);
> @@ -592,7 +593,7 @@ xfs_calc_growrtalloc_reservation(
>  	struct xfs_mount	*mp)
>  {
>  	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
> -		xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK),
> +		xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK, 0),
>  				 XFS_FSB_TO_B(mp, 1)) +
>  		xfs_calc_inode_res(mp, 1) +
>  		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> @@ -669,7 +670,7 @@ xfs_calc_addafork_reservation(
>  		xfs_calc_inode_res(mp, 1) +
>  		xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
>  		xfs_calc_buf_res(1, mp->m_dir_geo->blksize) +
> -		xfs_calc_buf_res(XFS_DAENTER_BMAP1B(mp, XFS_DATA_FORK) + 1,
> +		xfs_calc_buf_res(XFS_DAENTER_BMAP1B(mp, XFS_DATA_FORK, 0) + 1,
>  				 XFS_FSB_TO_B(mp, 1)) +
>  		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
>  				 XFS_FSB_TO_B(mp, 1));
> @@ -691,7 +692,7 @@ xfs_calc_attrinval_reservation(
>  	struct xfs_mount	*mp)
>  {
>  	return max((xfs_calc_inode_res(mp, 1) +
> -		    xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK),
> +		    xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK, 0),
>  				     XFS_FSB_TO_B(mp, 1))),
>  		   (xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) +
>  		    xfs_calc_buf_res(xfs_allocfree_log_count(mp, 4),
> @@ -717,10 +718,11 @@ xfs_calc_attrset_reservation(
>  	int			bmbt_blks;
>  
>  	da_blks = XFS_DAENTER_BLOCKS(mp, XFS_ATTR_FORK);
> -	bmbt_blks = XFS_DAENTER_BMAPS(mp, XFS_ATTR_FORK);
> +	bmbt_blks = XFS_DAENTER_BMAPS(mp, XFS_ATTR_FORK, 0);
>  
>  	max_rmt_blks = xfs_attr3_rmt_blocks(mp, XATTR_SIZE_MAX);
> -	bmbt_blks += XFS_NEXTENTADD_SPACE_RES(mp, max_rmt_blks, XFS_ATTR_FORK);
> +	bmbt_blks += XFS_NEXTENTADD_SPACE_RES(mp, max_rmt_blks,
> +			XFS_ATTR_FORK, 0);
>  
>  	return XFS_DQUOT_LOGRES(mp) +
>  		xfs_calc_inode_res(mp, 1) +
> @@ -752,8 +754,9 @@ xfs_calc_attrrm_reservation(
>  		     xfs_calc_buf_res(XFS_DA_NODE_MAXDEPTH,
>  				      XFS_FSB_TO_B(mp, 1)) +
>  		     (uint)XFS_FSB_TO_B(mp,
> -					XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK)) +
> -		     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK), 0)),
> +				XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK, 0)) +
> +		     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK, 0),
> +				     0)),
>  		    (xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
>  		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2),
>  				      XFS_FSB_TO_B(mp, 1))));
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
> index f50996ae18e6..d64989eeebd7 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.h
> +++ b/fs/xfs/libxfs/xfs_trans_resv.h
> @@ -61,10 +61,10 @@ struct xfs_trans_resv {
>   */
>  #define	XFS_DIROP_LOG_RES(mp)	\
>  	(XFS_FSB_TO_B(mp, XFS_DAENTER_BLOCKS(mp, XFS_DATA_FORK)) + \
> -	 (XFS_FSB_TO_B(mp, XFS_DAENTER_BMAPS(mp, XFS_DATA_FORK) + 1)))
> +	 (XFS_FSB_TO_B(mp, XFS_DAENTER_BMAPS(mp, XFS_DATA_FORK, 1) + 1)))
>  #define	XFS_DIROP_LOG_COUNT(mp)	\
>  	(XFS_DAENTER_BLOCKS(mp, XFS_DATA_FORK) + \
> -	 XFS_DAENTER_BMAPS(mp, XFS_DATA_FORK) + 1)
> +	 XFS_DAENTER_BMAPS(mp, XFS_DATA_FORK, 1) + 1)
>  
>  /*
>   * Various log count values.
> diff --git a/fs/xfs/libxfs/xfs_trans_space.h b/fs/xfs/libxfs/xfs_trans_space.h
> index b559af70cf51..c51d809a16b1 100644
> --- a/fs/xfs/libxfs/xfs_trans_space.h
> +++ b/fs/xfs/libxfs/xfs_trans_space.h
> @@ -25,15 +25,16 @@
>  
>  #define XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp)    \
>  		(((mp)->m_alloc_mxr[0]) - ((mp)->m_alloc_mnr[0]))
> -#define	XFS_EXTENTADD_SPACE_RES(mp,w)	(XFS_BM_MAXLEVELS(mp,w) - 1)
> -#define XFS_NEXTENTADD_SPACE_RES(mp,b,w)\
> +#define	XFS_EXTENTADD_SPACE_RES(mp,w,dbmbt)	\
> +	(XFS_BM_MAXLEVELS(mp,w,dbmbt) - 1)
> +#define XFS_NEXTENTADD_SPACE_RES(mp,b,w,dbmbt)		   \
>  	(((b + XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp) - 1) / \
>  	  XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp)) * \
> -	  XFS_EXTENTADD_SPACE_RES(mp,w))
> +		XFS_EXTENTADD_SPACE_RES(mp,w,dbmbt))
>  
>  /* Blocks we might need to add "b" mappings & rmappings to a file. */
> -#define XFS_SWAP_RMAP_SPACE_RES(mp,b,w)\
> -	(XFS_NEXTENTADD_SPACE_RES((mp), (b), (w)) + \
> +#define XFS_SWAP_RMAP_SPACE_RES(mp,b,w)	    \
> +	(XFS_NEXTENTADD_SPACE_RES((mp), (b), (w), 0) +	\
>  	 XFS_NRMAPADD_SPACE_RES((mp), (b)))
>  
>  #define	XFS_DAENTER_1B(mp,w)	\
> @@ -47,19 +48,19 @@
>  	(XFS_DA_NODE_MAXDEPTH + (((w) == XFS_DATA_FORK) ? 2 : 1))
>  #define	XFS_DAENTER_BLOCKS(mp,w)	\
>  	(XFS_DAENTER_1B(mp,w) * XFS_DAENTER_DBS(mp,w))
> -#define	XFS_DAENTER_BMAP1B(mp,w)	\
> -	XFS_NEXTENTADD_SPACE_RES(mp, XFS_DAENTER_1B(mp, w), w)
> -#define	XFS_DAENTER_BMAPS(mp,w)		\
> -	(XFS_DAENTER_DBS(mp,w) * XFS_DAENTER_BMAP1B(mp,w))
> -#define	XFS_DAENTER_SPACE_RES(mp,w)	\
> -	(XFS_DAENTER_BLOCKS(mp,w) + XFS_DAENTER_BMAPS(mp,w))
> -#define	XFS_DAREMOVE_SPACE_RES(mp,w)	XFS_DAENTER_BMAPS(mp,w)
> +#define	XFS_DAENTER_BMAP1B(mp,w,dbmbt)	\
> +	XFS_NEXTENTADD_SPACE_RES(mp, XFS_DAENTER_1B(mp, w), w, dbmbt)
> +#define	XFS_DAENTER_BMAPS(mp,w,dbmbt)	\
> +	(XFS_DAENTER_DBS(mp,w) * XFS_DAENTER_BMAP1B(mp,w,dbmbt))
> +#define	XFS_DAENTER_SPACE_RES(mp,w,dbmbt)	\
> +	(XFS_DAENTER_BLOCKS(mp,w) + XFS_DAENTER_BMAPS(mp,w,dbmbt))
> +#define	XFS_DAREMOVE_SPACE_RES(mp,w,dbmbt)	XFS_DAENTER_BMAPS(mp,w,dbmbt)
>  #define	XFS_DIRENTER_MAX_SPLIT(mp,nl)	1
>  #define	XFS_DIRENTER_SPACE_RES(mp,nl)	\
> -	(XFS_DAENTER_SPACE_RES(mp, XFS_DATA_FORK) * \
> +	(XFS_DAENTER_SPACE_RES(mp, XFS_DATA_FORK, 1) *	\
>  	 XFS_DIRENTER_MAX_SPLIT(mp,nl))
>  #define	XFS_DIRREMOVE_SPACE_RES(mp)	\
> -	XFS_DAREMOVE_SPACE_RES(mp, XFS_DATA_FORK)
> +	XFS_DAREMOVE_SPACE_RES(mp, XFS_DATA_FORK, 1)
>  #define	XFS_IALLOC_SPACE_RES(mp)	\
>  	(M_IGEO(mp)->ialloc_blks + \
>  	 (xfs_sb_version_hasfinobt(&mp->m_sb) ? 2 : 1 * \
> @@ -69,26 +70,26 @@
>   * Space reservation values for various transactions.
>   */
>  #define	XFS_ADDAFORK_SPACE_RES(mp)	\
> -	((mp)->m_dir_geo->fsbcount + XFS_DAENTER_BMAP1B(mp, XFS_DATA_FORK))
> +	((mp)->m_dir_geo->fsbcount + XFS_DAENTER_BMAP1B(mp, XFS_DATA_FORK, 0))
>  #define	XFS_ATTRRM_SPACE_RES(mp)	\
> -	XFS_DAREMOVE_SPACE_RES(mp, XFS_ATTR_FORK)
> +	XFS_DAREMOVE_SPACE_RES(mp, XFS_ATTR_FORK, 0)
>  /* This macro is not used - see inline code in xfs_attr_set */
>  #define	XFS_ATTRSET_SPACE_RES(mp, v)	\
> -	(XFS_DAENTER_SPACE_RES(mp, XFS_ATTR_FORK) + XFS_B_TO_FSB(mp, v))
> +	(XFS_DAENTER_SPACE_RES(mp, XFS_ATTR_FORK, 0) + XFS_B_TO_FSB(mp, v))
>  #define	XFS_CREATE_SPACE_RES(mp,nl)	\
>  	(XFS_IALLOC_SPACE_RES(mp) + XFS_DIRENTER_SPACE_RES(mp,nl))
>  #define	XFS_DIOSTRAT_SPACE_RES(mp, v)	\
> -	(XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK) + (v))
> +	(XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK, 0) + (v))
>  #define	XFS_GROWFS_SPACE_RES(mp)	\
>  	(2 * (mp)->m_ag_maxlevels)
>  #define	XFS_GROWFSRT_SPACE_RES(mp,b)	\
> -	((b) + XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK))
> +	((b) + XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK, 0))
>  #define	XFS_LINK_SPACE_RES(mp,nl)	\
>  	XFS_DIRENTER_SPACE_RES(mp,nl)
>  #define	XFS_MKDIR_SPACE_RES(mp,nl)	\
>  	(XFS_IALLOC_SPACE_RES(mp) + XFS_DIRENTER_SPACE_RES(mp,nl))
>  #define	XFS_QM_DQALLOC_SPACE_RES(mp)	\
> -	(XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK) + \
> +	(XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK, 0) + \
>  	 XFS_DQUOT_CLUSTER_SIZE_FSB)
>  #define	XFS_QM_QINOCREATE_SPACE_RES(mp)	\
>  	XFS_IALLOC_SPACE_RES(mp)
> diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> index 6736c5ab188f..0a8a8377a150 100644
> --- a/fs/xfs/xfs_bmap_item.c
> +++ b/fs/xfs/xfs_bmap_item.c
> @@ -482,7 +482,8 @@ xfs_bui_item_recover(
>  	}
>  
>  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate,
> -			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp);
> +			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK, 0), 0,
> +			0, &tp);
>  	if (error)
>  		return error;
>  	/*
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index 107bf2a2f344..fd35a0bf2c47 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -614,7 +614,7 @@ xfs_reflink_end_cow_extent(
>  		return 0;
>  	}
>  
> -	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
> +	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK, 0);
>  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
>  			XFS_TRANS_RESERVE, &tp);
>  	if (error)
> @@ -1017,7 +1017,7 @@ xfs_reflink_remap_extent(
>  	}
>  
>  	/* Start a rolling transaction to switch the mappings */
> -	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
> +	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK, 0);
>  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
>  	if (error)
>  		goto out;
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/7] xfs: Compute maximum height of directory BMBT separately
  2020-06-06  8:27 ` [PATCH 3/7] xfs: Compute maximum height of directory BMBT separately Chandan Babu R
@ 2020-06-08 20:59   ` Darrick J. Wong
  2020-06-09 14:23     ` Chandan Babu R
  0 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2020-06-08 20:59 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, david, bfoster, hch

On Sat, Jun 06, 2020 at 01:57:41PM +0530, Chandan Babu R wrote:
> xfs/306 causes the following call trace when using a data fork with a
> maximum extent count of 2^47,
> 
>  XFS (loop0): Mounting V5 Filesystem
>  XFS (loop0): Log size 8906 blocks too small, minimum size is 9075 blocks
>  XFS (loop0): AAIEEE! Log failed size checks. Abort!
>  XFS: Assertion failed: 0, file: fs/xfs/xfs_log.c, line: 711

Uh... won't applying the corresponding MAXEXTNUM changes and whatnot to
xfsprogs result in mkfs formatting a log with 9075 blocks?  Is there
some other mistake in the minimum log size computations?

>  ------------[ cut here ]------------
>  WARNING: CPU: 0 PID: 12821 at fs/xfs/xfs_message.c:112 assfail+0x25/0x28
>  Modules linked in:
>  CPU: 0 PID: 12821 Comm: mount Tainted: G        W         5.6.0-rc6-next-20200320-chandan-00003-g071c2af3f4de #1
>  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
>  RIP: 0010:assfail+0x25/0x28
>  Code: ff ff 0f 0b c3 0f 1f 44 00 00 41 89 c8 48 89 d1 48 89 f2 48 c7 c6 40 b7 4b b3 e8 82 f9 ff ff 80 3d 83 d6 64 01 00 74 02 0f $
>  RSP: 0018:ffffb05b414cbd78 EFLAGS: 00010246
>  RAX: 0000000000000000 RBX: ffff9d9d501d5000 RCX: 0000000000000000
>  RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffb346dc65
>  RBP: ffff9da444b49a80 R08: 0000000000000000 R09: 0000000000000000
>  R10: 000000000000000a R11: f000000000000000 R12: 00000000ffffffea
>  R13: 000000000000000e R14: 0000000000004594 R15: ffff9d9d501d5628
>  FS:  00007fd6c5d17c80(0000) GS:ffff9da44d800000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 0000000000000002 CR3: 00000008a48c0000 CR4: 00000000000006f0
>  Call Trace:
>   xfs_log_mount+0xf8/0x300
>   xfs_mountfs+0x46e/0x950
>   xfs_fc_fill_super+0x318/0x510
>   ? xfs_mount_free+0x30/0x30
>   get_tree_bdev+0x15c/0x250
>   vfs_get_tree+0x25/0xb0
>   do_mount+0x740/0x9b0
>   ? memdup_user+0x41/0x80
>   __x64_sys_mount+0x8e/0xd0
>   do_syscall_64+0x48/0x110
>   entry_SYSCALL_64_after_hwframe+0x44/0xa9
>  RIP: 0033:0x7fd6c5f2ccda
>  Code: 48 8b 0d b9 e1 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f $
>  RSP: 002b:00007ffe00dfb9f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
>  RAX: ffffffffffffffda RBX: 0000560c1aaa92c0 RCX: 00007fd6c5f2ccda
>  RDX: 0000560c1aaae110 RSI: 0000560c1aaad040 RDI: 0000560c1aaa94d0
>  RBP: 00007fd6c607d204 R08: 0000000000000000 R09: 0000560c1aaadde0
>  R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
>  R13: 0000000000000000 R14: 0000560c1aaa94d0 R15: 0000560c1aaae110
>  ---[ end trace 6436391b468bc652 ]---
>  XFS (loop0): log mount failed
> 
> The corresponding filesystem was created using mkfs options
> "-m rmapbt=1,reflink=1 -b size=1k -d size=20m -n size=64k".
> 
> i.e. We have a filesystem of size 20MiB, data block size of 1KiB and
> directory block size of 64KiB. Filesystems of size < 1GiB can have less
> than 10MiB on-disk log (Please refer to calculate_log_size() in
> xfsprogs).

Hm.  You don't seem to be setting either of the big extent count feature
flags here.

Is this something that happens after a filesystem gets *upgraded* to
support extent counts > 2^32?  If it's this second case, then I think
the function that upgrades the filesystem has to reject the change if it
would cause the minimum log size checks to fail.

Granted, I don't understand the need (in the next patch) to special case
bmbt maxlevels for directory data forks.  That's probably muddying up
my ability to figure all this out.  Yes I did read this series
backwards. :)

--D

> The largest reservation space was contributed by the rename
> operation. The corresponding calculation is done inside
> xfs_calc_rename_reservation(). In this case, the value returned by this
> function is,
> 
> xfs_calc_inode_res(mp, 4)
> + xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1))
> 
> xfs_calc_inode_res(mp, 4) returns a constant value of 3040 bytes
> regardless of the maximum data fork extent count.
> 
> The largest contribution to the rename operation was by "2 *
> XFS_DIROP_LOG_COUNT(mp)" and it is a function of maximum height of a
> directory's BMBT tree.
> 
> XFS_DIROP_LOG_COUNT() is a sum of,
> 
> 1. The maximum number of dabtree blocks that needs to be logged
>    i.e. XFS_DAENTER_BLOCKS() = XFS_DAENTER_1B(mp,w) *
>    XFS_DAENTER_DBS(mp,w).  For directories, this evaluates
>    to (64 * (XFS_DA_NODE_MAXDEPTH + 2)) = (64 * (5 + 2)) = 448.
> 
> 2. The corresponding maximum number of BMBT blocks that needs to be
>    logged i.e. XFS_DAENTER_BMAPS() = XFS_DAENTER_DBS(mp,w) *
>    XFS_DAENTER_BMAP1B(mp,w)
> 
>    XFS_DAENTER_DBS(mp,w) = XFS_DA_NODE_MAXDEPTH + 2 = 7
> 
>    XFS_DAENTER_BMAP1B(mp,w)
>    = XFS_NEXTENTADD_SPACE_RES(mp, XFS_DAENTER_1B(mp, w), w)
>    = XFS_NEXTENTADD_SPACE_RES(mp, 64, w)
>    = ((64 + XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp) - 1) /
>    XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp)) * XFS_EXTENTADD_SPACE_RES(mp, w)
> 
>    XFS_MAX_CONTIG_EXTENTS_PER_BLOCK() =
>    mp->m_alloc_mxr[0] - mp->m_alloc_mnr[0] = 121 - 60 = 61
> 
>    XFS_DAENTER_BMAP1B(mp,w) =
>    ((64 + XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp) - 1) /
>    XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp)) * XFS_EXTENTADD_SPACE_RES(mp, w)
>    = ((64 + 61 - 1) / 61) * XFS_EXTENTADD_SPACE_RES(mp, w)
>    = 2 * XFS_EXTENTADD_SPACE_RES(mp, w)
>    = 2 * (XFS_BM_MAXLEVELS(mp,w) - 1)
>    = 2 * (8 - 1)
>    = 14
> 
>    With 2^32 as the maximum extent count the maximum height of the bmap btree
>    was 7. Now with 2^47 maximum extent count, the height has increased to 8.
> 
>    Therefore, XFS_DAENTER_BMAPS() = 7 * 14 = 98.
> 
> XFS_DIROP_LOG_COUNT() = 448 + 98 = 546.
> 2 * XFS_DIROP_LOG_COUNT() = 2 * 546 = 1092.
> 
> With 2^32 max extent count, XFS_DIROP_LOG_COUNT() evaluates to
> 533. Hence 2 * XFS_DIROP_LOG_COUNT() = 2 * 533 = 1066.
> 
> This small difference of 1092 - 1066 = 26 fs blocks is sufficient to
> trip us over the minimum log size check.
> 
> A future commit in this series will use 2^27 as the maximum directory
> extent count. This will result in a shorter directory BMBT tree.  Log
> reservation calculations that are applicable only to
> directories (e.g. XFS_DIROP_LOG_COUNT()) can then choose this instead of
> non-dir data fork BMBT height.
> 
> This commit introduces a new member in 'struct xfs_mount' to hold the
> maximum BMBT height of a directory. At present, the maximum height of a
> directory BMBT is the same as a the maximum height of a non-directory
> BMBT. A future commit will change the parameters used as input for
> computing the maximum height of a directory BMBT.
> 
> Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
> ---
>  fs/xfs/libxfs/xfs_bmap.c | 17 ++++++++++++++---
>  fs/xfs/libxfs/xfs_bmap.h |  3 ++-
>  fs/xfs/xfs_mount.c       |  5 +++--
>  fs/xfs/xfs_mount.h       |  1 +
>  4 files changed, 20 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 798fca5c52af..01e2b543b139 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -50,7 +50,8 @@ kmem_zone_t		*xfs_bmap_free_item_zone;
>  void
>  xfs_bmap_compute_maxlevels(
>  	xfs_mount_t	*mp,		/* file system mount structure */
> -	int		whichfork)	/* data or attr fork */
> +	int		whichfork,	/* data or attr fork */
> +	int		dir_bmbt)	/* Dir or non-dir data fork */
>  {
>  	int		level;		/* btree level */
>  	uint		maxblocks;	/* max blocks at this level */
> @@ -60,6 +61,9 @@ xfs_bmap_compute_maxlevels(
>  	int		minnoderecs;	/* min records in node block */
>  	int		sz;		/* root block size */
>  
> +	if (whichfork == XFS_ATTR_FORK)
> +		ASSERT(dir_bmbt == 0);
> +
>  	/*
>  	 * The maximum number of extents in a file, hence the maximum number of
>  	 * leaf entries, is controlled by the size of the on-disk extent count,
> @@ -75,8 +79,11 @@ xfs_bmap_compute_maxlevels(
>  	 * of a minimum size available.
>  	 */
>  	if (whichfork == XFS_DATA_FORK) {
> -		maxleafents = MAXEXTNUM;
>  		sz = XFS_BMDR_SPACE_CALC(MINDBTPTRS);
> +		if (dir_bmbt)
> +			maxleafents = MAXEXTNUM;
> +		else
> +			maxleafents = MAXEXTNUM;
>  	} else {
>  		maxleafents = MAXAEXTNUM;
>  		sz = XFS_BMDR_SPACE_CALC(MINABTPTRS);
> @@ -91,7 +98,11 @@ xfs_bmap_compute_maxlevels(
>  		else
>  			maxblocks = (maxblocks + minnoderecs - 1) / minnoderecs;
>  	}
> -	mp->m_bm_maxlevels[whichfork] = level;
> +
> +	if (whichfork == XFS_DATA_FORK && dir_bmbt)
> +		mp->m_bm_dir_maxlevel = level;
> +	else
> +		mp->m_bm_maxlevels[whichfork] = level;
>  }
>  
>  STATIC int				/* error */
> diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> index 6028a3c825ba..4250c9ab4b75 100644
> --- a/fs/xfs/libxfs/xfs_bmap.h
> +++ b/fs/xfs/libxfs/xfs_bmap.h
> @@ -187,7 +187,8 @@ void	xfs_bmap_local_to_extents_empty(struct xfs_trans *tp,
>  void	__xfs_bmap_add_free(struct xfs_trans *tp, xfs_fsblock_t bno,
>  		xfs_filblks_t len, const struct xfs_owner_info *oinfo,
>  		bool skip_discard);
> -void	xfs_bmap_compute_maxlevels(struct xfs_mount *mp, int whichfork);
> +void	xfs_bmap_compute_maxlevels(struct xfs_mount *mp, int whichfork,
> +		int dir_bmbt);
>  int	xfs_bmap_first_unused(struct xfs_trans *tp, struct xfs_inode *ip,
>  		xfs_extlen_t len, xfs_fileoff_t *unused, int whichfork);
>  int	xfs_bmap_last_before(struct xfs_trans *tp, struct xfs_inode *ip,
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index bb91f04266b9..d8ebfc67bb63 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -711,8 +711,9 @@ xfs_mountfs(
>  		goto out;
>  
>  	xfs_alloc_compute_maxlevels(mp);
> -	xfs_bmap_compute_maxlevels(mp, XFS_DATA_FORK);
> -	xfs_bmap_compute_maxlevels(mp, XFS_ATTR_FORK);
> +	xfs_bmap_compute_maxlevels(mp, XFS_DATA_FORK, 0);
> +	xfs_bmap_compute_maxlevels(mp, XFS_DATA_FORK, 1);
> +	xfs_bmap_compute_maxlevels(mp, XFS_ATTR_FORK, 0);
>  	xfs_ialloc_setup_geometry(mp);
>  	xfs_rmapbt_compute_maxlevels(mp);
>  	xfs_refcountbt_compute_maxlevels(mp);
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index aba5a1579279..9dbf036ddace 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -133,6 +133,7 @@ typedef struct xfs_mount {
>  	uint			m_refc_mnr[2];	/* min refc btree records */
>  	uint			m_ag_maxlevels;	/* XFS_AG_MAXLEVELS */
>  	uint			m_bm_maxlevels[2]; /* XFS_BM_MAXLEVELS */
> +	uint			m_bm_dir_maxlevel;
>  	uint			m_rmap_maxlevels; /* max rmap btree levels */
>  	uint			m_refc_maxlevels; /* max refcount btree level */
>  	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 0/7] xfs: Extend per-inode extent counters.
  2020-06-08 17:31 ` [PATCH 0/7] xfs: Extend per-inode extent counters Darrick J. Wong
@ 2020-06-09 14:22   ` Chandan Babu R
  0 siblings, 0 replies; 40+ messages in thread
From: Chandan Babu R @ 2020-06-09 14:22 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, bfoster, hch

On Monday 8 June 2020 11:01:03 PM IST Darrick J. Wong wrote:
> On Sat, Jun 06, 2020 at 01:57:38PM +0530, Chandan Babu R wrote:
> > The commit xfs: fix inode fork extent count overflow
> > (3f8a4f1d876d3e3e49e50b0396eaffcc4ba71b08) mentions that 10 billion
> > per-inode data fork extents should be possible to create. However the
> > corresponding on-disk field has an signed 32-bit type. Hence this
> > patchset extends the on-disk field to 64-bit length out of which only
> > the first 47-bits are valid.
> > 
> > Also, XFS has a per-inode xattr extent counter which is 16 bits
> > wide. A workload which
> > 1. Creates 1 million 255-byte sized xattrs,
> > 2. Deletes 50% of these xattrs in an alternating manner,
> > 3. Tries to insert 400,000 new 255-byte sized xattrs
> > causes the following message to be printed on the console,
> > 
> > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > 
> > This indicates that we overflowed the 16-bits wide xattr extent
> > counter.
> > 
> > I have been informed that there are instances where a single file
> > has > 100 million hardlinks. With parent pointers being stored in xattr,
> > we will overflow the 16-bits wide xattr extent counter when large
> > number of hardlinks are created. Hence this patchset extends the
> > on-disk field to 32-bit length.
> > 
> > This patchset also includes the previously posted "Fix log reservation
> > calculation for xattr insert operation" patch as a bug fix. It
> > replaces the xattr set "mount" and "runtime" reservations with just
> > one static reservation. Hence we don't need the functionality to
> > calculate maximum sized 'xattr set' reservation separately anymore.
> > 
> > The patches can also be obtained from
> > https://github.com/chandanr/linux.git at branch xfs-extend-extent-counters.
> > 
> > Chandan Babu R (7):
> >   xfs: Fix log reservation calculation for xattr insert operation
> 
> What happened to that whole patchset with struct xfs_attr_set_resv
> and whatnot?  Did all that get condensed down to this single patch?

Yes, with the new method, we have just one static log reservation rather than
having "mount" and "runtime" reservations for the xattr set operation. The
single static log reservation takes into account the worst case
possible. i.e.
- Double split of the Dabtree for large local xattrs.
- Bmbt blocks required for mapping the contents of a maximum sized
  (i.e. XATTR_SIZE_MAX bytes in size) remote attribute.

-- 
chandan




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/7] xfs: Check for per-inode extent count overflow
  2020-06-08 16:24   ` Darrick J. Wong
  2020-06-08 16:32     ` Darrick J. Wong
@ 2020-06-09 14:22     ` Chandan Babu R
  2020-06-09 17:10       ` Darrick J. Wong
  1 sibling, 1 reply; 40+ messages in thread
From: Chandan Babu R @ 2020-06-09 14:22 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, bfoster, hch

On Monday 8 June 2020 9:54:25 PM IST Darrick J. Wong wrote:
> On Sat, Jun 06, 2020 at 01:57:40PM +0530, Chandan Babu R wrote:
> > The following error message was noticed when a workload added one
> > million xattrs, deleted 50% of them and then inserted 400,000 new
> > xattrs.
> > 
> > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > 
> > The error message was printed during unmounting the filesystem. The
> > value printed under "total extents" indicates that we overflowed the
> > per-inode signed 16-bit xattr extent counter.
> > 
> > Instead of letting this silent corruption occur, this patch checks for
> > extent counter (both data and xattr) overflow before we assign the
> > new value to the corresponding in-memory extent counter.
> > 
> > Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
> > ---
> >  fs/xfs/libxfs/xfs_bmap.c       | 92 +++++++++++++++++++++++++++-------
> >  fs/xfs/libxfs/xfs_inode_fork.c | 29 +++++++++++
> >  fs/xfs/libxfs/xfs_inode_fork.h |  1 +
> >  3 files changed, 104 insertions(+), 18 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index edc63dba007f..798fca5c52af 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -906,7 +906,10 @@ xfs_bmap_local_to_extents(
> >  	xfs_iext_first(ifp, &icur);
> >  	xfs_iext_insert(ip, &icur, &rec, 0);
> >  
> > -	ifp->if_nextents = 1;
> > +	error = xfs_next_set(ip, whichfork, 1);
> > +	if (error)
> > +		goto done;
> 
> Are you sure that if_nextents == 0 is a precondition here?  Technically
> speaking, this turns an assignment into an increment operation.

Hmm. I didn't pay attention to that. I will check and update the code
appropriately. Thanks for pointing this out.

> 
> > +
> >  	ip->i_d.di_nblocks = 1;
> >  	xfs_trans_mod_dquot_byino(tp, ip,
> >  		XFS_TRANS_DQ_BCOUNT, 1L);
> > @@ -1594,7 +1597,10 @@ xfs_bmap_add_extent_delay_real(
> >  		xfs_iext_remove(bma->ip, &bma->icur, state);
> >  		xfs_iext_prev(ifp, &bma->icur);
> >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, &LEFT);
> > -		ifp->if_nextents--;
> > +
> > +		error = xfs_next_set(bma->ip, whichfork, -1);
> > +		if (error)
> > +			goto done;
> >  
> >  		if (bma->cur == NULL)
> >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > @@ -1698,7 +1704,10 @@ xfs_bmap_add_extent_delay_real(
> >  		PREV.br_startblock = new->br_startblock;
> >  		PREV.br_state = new->br_state;
> >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, &PREV);
> > -		ifp->if_nextents++;
> > +
> > +		error = xfs_next_set(bma->ip, whichfork, 1);
> > +		if (error)
> > +			goto done;
> >  
> >  		if (bma->cur == NULL)
> >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > @@ -1764,7 +1773,10 @@ xfs_bmap_add_extent_delay_real(
> >  		 * The left neighbor is not contiguous.
> >  		 */
> >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, new);
> > -		ifp->if_nextents++;
> > +
> > +		error = xfs_next_set(bma->ip, whichfork, 1);
> > +		if (error)
> > +			goto done;
> >  
> >  		if (bma->cur == NULL)
> >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > @@ -1851,7 +1863,10 @@ xfs_bmap_add_extent_delay_real(
> >  		 * The right neighbor is not contiguous.
> >  		 */
> >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, new);
> > -		ifp->if_nextents++;
> > +
> > +		error = xfs_next_set(bma->ip, whichfork, 1);
> > +		if (error)
> > +			goto done;
> >  
> >  		if (bma->cur == NULL)
> >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > @@ -1937,7 +1952,10 @@ xfs_bmap_add_extent_delay_real(
> >  		xfs_iext_next(ifp, &bma->icur);
> >  		xfs_iext_insert(bma->ip, &bma->icur, &RIGHT, state);
> >  		xfs_iext_insert(bma->ip, &bma->icur, &LEFT, state);
> > -		ifp->if_nextents++;
> > +
> > +		error = xfs_next_set(bma->ip, whichfork, 1);
> > +		if (error)
> > +			goto done;
> >  
> >  		if (bma->cur == NULL)
> >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > @@ -2141,7 +2159,11 @@ xfs_bmap_add_extent_unwritten_real(
> >  		xfs_iext_remove(ip, icur, state);
> >  		xfs_iext_prev(ifp, icur);
> >  		xfs_iext_update_extent(ip, state, icur, &LEFT);
> > -		ifp->if_nextents -= 2;
> > +
> > +		error = xfs_next_set(ip, whichfork, -2);
> > +		if (error)
> > +			goto done;
> > +
> >  		if (cur == NULL)
> >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> >  		else {
> > @@ -2193,7 +2215,11 @@ xfs_bmap_add_extent_unwritten_real(
> >  		xfs_iext_remove(ip, icur, state);
> >  		xfs_iext_prev(ifp, icur);
> >  		xfs_iext_update_extent(ip, state, icur, &LEFT);
> > -		ifp->if_nextents--;
> > +
> > +		error = xfs_next_set(ip, whichfork, -1);
> > +		if (error)
> > +			goto done;
> > +
> >  		if (cur == NULL)
> >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> >  		else {
> > @@ -2235,7 +2261,10 @@ xfs_bmap_add_extent_unwritten_real(
> >  		xfs_iext_remove(ip, icur, state);
> >  		xfs_iext_prev(ifp, icur);
> >  		xfs_iext_update_extent(ip, state, icur, &PREV);
> > -		ifp->if_nextents--;
> > +
> > +		error = xfs_next_set(ip, whichfork, -1);
> > +		if (error)
> > +			goto done;
> >  
> >  		if (cur == NULL)
> >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > @@ -2343,7 +2372,10 @@ xfs_bmap_add_extent_unwritten_real(
> >  
> >  		xfs_iext_update_extent(ip, state, icur, &PREV);
> >  		xfs_iext_insert(ip, icur, new, state);
> > -		ifp->if_nextents++;
> > +
> > +		error = xfs_next_set(ip, whichfork, 1);
> > +		if (error)
> > +			goto done;
> >  
> >  		if (cur == NULL)
> >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > @@ -2419,7 +2451,10 @@ xfs_bmap_add_extent_unwritten_real(
> >  		xfs_iext_update_extent(ip, state, icur, &PREV);
> >  		xfs_iext_next(ifp, icur);
> >  		xfs_iext_insert(ip, icur, new, state);
> > -		ifp->if_nextents++;
> > +
> > +		error = xfs_next_set(ip, whichfork, 1);
> > +		if (error)
> > +			goto done;
> >  
> >  		if (cur == NULL)
> >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > @@ -2471,7 +2506,10 @@ xfs_bmap_add_extent_unwritten_real(
> >  		xfs_iext_next(ifp, icur);
> >  		xfs_iext_insert(ip, icur, &r[1], state);
> >  		xfs_iext_insert(ip, icur, &r[0], state);
> > -		ifp->if_nextents += 2;
> > +
> > +		error = xfs_next_set(ip, whichfork, 2);
> > +		if (error)
> > +			goto done;
> >  
> >  		if (cur == NULL)
> >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > @@ -2787,7 +2825,10 @@ xfs_bmap_add_extent_hole_real(
> >  		xfs_iext_remove(ip, icur, state);
> >  		xfs_iext_prev(ifp, icur);
> >  		xfs_iext_update_extent(ip, state, icur, &left);
> > -		ifp->if_nextents--;
> > +
> > +		error = xfs_next_set(ip, whichfork, -1);
> > +		if (error)
> > +			goto done;
> >  
> >  		if (cur == NULL) {
> >  			rval = XFS_ILOG_CORE | xfs_ilog_fext(whichfork);
> > @@ -2886,7 +2927,10 @@ xfs_bmap_add_extent_hole_real(
> >  		 * Insert a new entry.
> >  		 */
> >  		xfs_iext_insert(ip, icur, new, state);
> > -		ifp->if_nextents++;
> > +
> > +		error = xfs_next_set(ip, whichfork, 1);
> > +		if (error)
> > +			goto done;
> >  
> >  		if (cur == NULL) {
> >  			rval = XFS_ILOG_CORE | xfs_ilog_fext(whichfork);
> > @@ -5083,7 +5127,10 @@ xfs_bmap_del_extent_real(
> >  		 */
> >  		xfs_iext_remove(ip, icur, state);
> >  		xfs_iext_prev(ifp, icur);
> > -		ifp->if_nextents--;
> > +
> > +		error = xfs_next_set(ip, whichfork, -1);
> > +		if (error)
> > +			goto done;
> >  
> >  		flags |= XFS_ILOG_CORE;
> >  		if (!cur) {
> > @@ -5193,7 +5240,10 @@ xfs_bmap_del_extent_real(
> >  		} else
> >  			flags |= xfs_ilog_fext(whichfork);
> >  
> > -		ifp->if_nextents++;
> > +		error = xfs_next_set(ip, whichfork, 1);
> > +		if (error)
> > +			goto done;
> > +
> >  		xfs_iext_next(ifp, icur);
> >  		xfs_iext_insert(ip, icur, &new, state);
> >  		break;
> > @@ -5660,7 +5710,10 @@ xfs_bmse_merge(
> >  	 * Update the on-disk extent count, the btree if necessary and log the
> >  	 * inode.
> >  	 */
> > -	ifp->if_nextents--;
> > +	error = xfs_next_set(ip, whichfork, -1);
> > +	if (error)
> > +		goto done;
> > +
> >  	*logflags |= XFS_ILOG_CORE;
> >  	if (!cur) {
> >  		*logflags |= XFS_ILOG_DEXT;
> > @@ -6047,7 +6100,10 @@ xfs_bmap_split_extent(
> >  	/* Add new extent */
> >  	xfs_iext_next(ifp, &icur);
> >  	xfs_iext_insert(ip, &icur, &new, 0);
> > -	ifp->if_nextents++;
> > +
> > +	error = xfs_next_set(ip, whichfork, 1);
> > +	if (error)
> > +		goto del_cursor;
> >  
> >  	if (cur) {
> >  		error = xfs_bmbt_lookup_eq(cur, &new, &i);
> > diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
> > index 28b366275ae0..3bf5a2c391bd 100644
> > --- a/fs/xfs/libxfs/xfs_inode_fork.c
> > +++ b/fs/xfs/libxfs/xfs_inode_fork.c
> > @@ -728,3 +728,32 @@ xfs_ifork_verify_local_attr(
> >  
> >  	return 0;
> >  }
> > +
> > +int
> > +xfs_next_set(
> 
> "next"... please choose an abbreviation that doesn't collide with a
> common English word.
> 
> > +	struct xfs_inode	*ip,
> > +	int			whichfork,
> > +	int			delta)
> 
> Delta?  I thought this was a setter function?
> 
> > +{
> > +	struct xfs_ifork	*ifp;
> > +	int64_t			nr_exts;
> > +	int64_t			max_exts;
> > +
> > +	ifp = XFS_IFORK_PTR(ip, whichfork);
> > +
> > +	if (whichfork == XFS_DATA_FORK || whichfork == XFS_COW_FORK)
> > +		max_exts = MAXEXTNUM;
> > +	else if (whichfork == XFS_ATTR_FORK)
> > +		max_exts = MAXAEXTNUM;
> > +	else
> > +		ASSERT(0);
> > +
> > +	nr_exts = ifp->if_nextents + delta;
> 
> Nope, it's a modify function all right.  Then it should be named:
> 
> xfs_nextents_mod(ip, whichfork, delta)

Ok. I will change this.

> 
> > +	if ((delta > 0 && nr_exts > max_exts)
> > +		|| (delta < 0 && nr_exts < 0))
> 
> Line these up, please.  e.g.,
> 
> 	if ((delta > 0 && nr_exts > max_exts) ||
>             (delta < 0 && nr_exts < 0))

Ok.

> 
> --D
> 
> > +		return -EOVERFLOW;
> > +
> > +	ifp->if_nextents = nr_exts;
> > +
> > +	return 0;
> > +}
> > diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
> > index a4953e95c4f3..a84ae42ace79 100644
> > --- a/fs/xfs/libxfs/xfs_inode_fork.h
> > +++ b/fs/xfs/libxfs/xfs_inode_fork.h
> > @@ -173,4 +173,5 @@ extern void xfs_ifork_init_cow(struct xfs_inode *ip);
> >  int xfs_ifork_verify_local_data(struct xfs_inode *ip);
> >  int xfs_ifork_verify_local_attr(struct xfs_inode *ip);
> >  
> > +int xfs_next_set(struct xfs_inode *ip, int whichfork, int delta);
> >  #endif	/* __XFS_INODE_FORK_H__ */
> 


-- 
chandan




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/7] xfs: Check for per-inode extent count overflow
  2020-06-08 16:32     ` Darrick J. Wong
@ 2020-06-09 14:22       ` Chandan Babu R
  2020-06-09 17:07         ` Darrick J. Wong
  0 siblings, 1 reply; 40+ messages in thread
From: Chandan Babu R @ 2020-06-09 14:22 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, bfoster, hch

On Monday 8 June 2020 10:02:16 PM IST Darrick J. Wong wrote:
> On Mon, Jun 08, 2020 at 09:24:25AM -0700, Darrick J. Wong wrote:
> > On Sat, Jun 06, 2020 at 01:57:40PM +0530, Chandan Babu R wrote:
> > > The following error message was noticed when a workload added one
> > > million xattrs, deleted 50% of them and then inserted 400,000 new
> > > xattrs.
> > > 
> > > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > > 
> > > The error message was printed during unmounting the filesystem. The
> > > value printed under "total extents" indicates that we overflowed the
> > > per-inode signed 16-bit xattr extent counter.
> > > 
> > > Instead of letting this silent corruption occur, this patch checks for
> > > extent counter (both data and xattr) overflow before we assign the
> > > new value to the corresponding in-memory extent counter.
> > > 
> > > Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_bmap.c       | 92 +++++++++++++++++++++++++++-------
> > >  fs/xfs/libxfs/xfs_inode_fork.c | 29 +++++++++++
> > >  fs/xfs/libxfs/xfs_inode_fork.h |  1 +
> > >  3 files changed, 104 insertions(+), 18 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > index edc63dba007f..798fca5c52af 100644
> > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > @@ -906,7 +906,10 @@ xfs_bmap_local_to_extents(
> > >  	xfs_iext_first(ifp, &icur);
> > >  	xfs_iext_insert(ip, &icur, &rec, 0);
> > >  
> > > -	ifp->if_nextents = 1;
> > > +	error = xfs_next_set(ip, whichfork, 1);
> > > +	if (error)
> > > +		goto done;
> > 
> > Are you sure that if_nextents == 0 is a precondition here?  Technically
> > speaking, this turns an assignment into an increment operation.
> > 
> > > +
> > >  	ip->i_d.di_nblocks = 1;
> > >  	xfs_trans_mod_dquot_byino(tp, ip,
> > >  		XFS_TRANS_DQ_BCOUNT, 1L);
> > > @@ -1594,7 +1597,10 @@ xfs_bmap_add_extent_delay_real(
> > >  		xfs_iext_remove(bma->ip, &bma->icur, state);
> > >  		xfs_iext_prev(ifp, &bma->icur);
> > >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, &LEFT);
> > > -		ifp->if_nextents--;
> > > +
> > > +		error = xfs_next_set(bma->ip, whichfork, -1);
> > > +		if (error)
> > > +			goto done;
> > >  
> > >  		if (bma->cur == NULL)
> > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > @@ -1698,7 +1704,10 @@ xfs_bmap_add_extent_delay_real(
> > >  		PREV.br_startblock = new->br_startblock;
> > >  		PREV.br_state = new->br_state;
> > >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, &PREV);
> > > -		ifp->if_nextents++;
> > > +
> > > +		error = xfs_next_set(bma->ip, whichfork, 1);
> > > +		if (error)
> > > +			goto done;
> > >  
> > >  		if (bma->cur == NULL)
> > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > @@ -1764,7 +1773,10 @@ xfs_bmap_add_extent_delay_real(
> > >  		 * The left neighbor is not contiguous.
> > >  		 */
> > >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, new);
> > > -		ifp->if_nextents++;
> > > +
> > > +		error = xfs_next_set(bma->ip, whichfork, 1);
> > > +		if (error)
> > > +			goto done;
> > >  
> > >  		if (bma->cur == NULL)
> > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > @@ -1851,7 +1863,10 @@ xfs_bmap_add_extent_delay_real(
> > >  		 * The right neighbor is not contiguous.
> > >  		 */
> > >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, new);
> > > -		ifp->if_nextents++;
> > > +
> > > +		error = xfs_next_set(bma->ip, whichfork, 1);
> > > +		if (error)
> > > +			goto done;
> > >  
> > >  		if (bma->cur == NULL)
> > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > @@ -1937,7 +1952,10 @@ xfs_bmap_add_extent_delay_real(
> > >  		xfs_iext_next(ifp, &bma->icur);
> > >  		xfs_iext_insert(bma->ip, &bma->icur, &RIGHT, state);
> > >  		xfs_iext_insert(bma->ip, &bma->icur, &LEFT, state);
> > > -		ifp->if_nextents++;
> > > +
> > > +		error = xfs_next_set(bma->ip, whichfork, 1);
> > > +		if (error)
> > > +			goto done;
> > >  
> > >  		if (bma->cur == NULL)
> > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > @@ -2141,7 +2159,11 @@ xfs_bmap_add_extent_unwritten_real(
> > >  		xfs_iext_remove(ip, icur, state);
> > >  		xfs_iext_prev(ifp, icur);
> > >  		xfs_iext_update_extent(ip, state, icur, &LEFT);
> > > -		ifp->if_nextents -= 2;
> > > +
> > > +		error = xfs_next_set(ip, whichfork, -2);
> > > +		if (error)
> > > +			goto done;
> > > +
> > >  		if (cur == NULL)
> > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > >  		else {
> > > @@ -2193,7 +2215,11 @@ xfs_bmap_add_extent_unwritten_real(
> > >  		xfs_iext_remove(ip, icur, state);
> > >  		xfs_iext_prev(ifp, icur);
> > >  		xfs_iext_update_extent(ip, state, icur, &LEFT);
> > > -		ifp->if_nextents--;
> > > +
> > > +		error = xfs_next_set(ip, whichfork, -1);
> > > +		if (error)
> > > +			goto done;
> > > +
> > >  		if (cur == NULL)
> > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > >  		else {
> > > @@ -2235,7 +2261,10 @@ xfs_bmap_add_extent_unwritten_real(
> > >  		xfs_iext_remove(ip, icur, state);
> > >  		xfs_iext_prev(ifp, icur);
> > >  		xfs_iext_update_extent(ip, state, icur, &PREV);
> > > -		ifp->if_nextents--;
> > > +
> > > +		error = xfs_next_set(ip, whichfork, -1);
> > > +		if (error)
> > > +			goto done;
> > >  
> > >  		if (cur == NULL)
> > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > @@ -2343,7 +2372,10 @@ xfs_bmap_add_extent_unwritten_real(
> > >  
> > >  		xfs_iext_update_extent(ip, state, icur, &PREV);
> > >  		xfs_iext_insert(ip, icur, new, state);
> > > -		ifp->if_nextents++;
> > > +
> > > +		error = xfs_next_set(ip, whichfork, 1);
> > > +		if (error)
> > > +			goto done;
> > >  
> > >  		if (cur == NULL)
> > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > @@ -2419,7 +2451,10 @@ xfs_bmap_add_extent_unwritten_real(
> > >  		xfs_iext_update_extent(ip, state, icur, &PREV);
> > >  		xfs_iext_next(ifp, icur);
> > >  		xfs_iext_insert(ip, icur, new, state);
> > > -		ifp->if_nextents++;
> > > +
> > > +		error = xfs_next_set(ip, whichfork, 1);
> > > +		if (error)
> > > +			goto done;
> > >  
> > >  		if (cur == NULL)
> > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > @@ -2471,7 +2506,10 @@ xfs_bmap_add_extent_unwritten_real(
> > >  		xfs_iext_next(ifp, icur);
> > >  		xfs_iext_insert(ip, icur, &r[1], state);
> > >  		xfs_iext_insert(ip, icur, &r[0], state);
> > > -		ifp->if_nextents += 2;
> > > +
> > > +		error = xfs_next_set(ip, whichfork, 2);
> > > +		if (error)
> > > +			goto done;
> > >  
> > >  		if (cur == NULL)
> > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > @@ -2787,7 +2825,10 @@ xfs_bmap_add_extent_hole_real(
> > >  		xfs_iext_remove(ip, icur, state);
> > >  		xfs_iext_prev(ifp, icur);
> > >  		xfs_iext_update_extent(ip, state, icur, &left);
> > > -		ifp->if_nextents--;
> > > +
> > > +		error = xfs_next_set(ip, whichfork, -1);
> > > +		if (error)
> > > +			goto done;
> > >  
> > >  		if (cur == NULL) {
> > >  			rval = XFS_ILOG_CORE | xfs_ilog_fext(whichfork);
> > > @@ -2886,7 +2927,10 @@ xfs_bmap_add_extent_hole_real(
> > >  		 * Insert a new entry.
> > >  		 */
> > >  		xfs_iext_insert(ip, icur, new, state);
> > > -		ifp->if_nextents++;
> > > +
> > > +		error = xfs_next_set(ip, whichfork, 1);
> > > +		if (error)
> > > +			goto done;
> > >  
> > >  		if (cur == NULL) {
> > >  			rval = XFS_ILOG_CORE | xfs_ilog_fext(whichfork);
> > > @@ -5083,7 +5127,10 @@ xfs_bmap_del_extent_real(
> > >  		 */
> > >  		xfs_iext_remove(ip, icur, state);
> > >  		xfs_iext_prev(ifp, icur);
> > > -		ifp->if_nextents--;
> > > +
> > > +		error = xfs_next_set(ip, whichfork, -1);
> > > +		if (error)
> > > +			goto done;
> > >  
> > >  		flags |= XFS_ILOG_CORE;
> > >  		if (!cur) {
> > > @@ -5193,7 +5240,10 @@ xfs_bmap_del_extent_real(
> > >  		} else
> > >  			flags |= xfs_ilog_fext(whichfork);
> > >  
> > > -		ifp->if_nextents++;
> > > +		error = xfs_next_set(ip, whichfork, 1);
> > > +		if (error)
> > > +			goto done;
> > > +
> > >  		xfs_iext_next(ifp, icur);
> > >  		xfs_iext_insert(ip, icur, &new, state);
> > >  		break;
> > > @@ -5660,7 +5710,10 @@ xfs_bmse_merge(
> > >  	 * Update the on-disk extent count, the btree if necessary and log the
> > >  	 * inode.
> > >  	 */
> > > -	ifp->if_nextents--;
> > > +	error = xfs_next_set(ip, whichfork, -1);
> > > +	if (error)
> > > +		goto done;
> > > +
> > >  	*logflags |= XFS_ILOG_CORE;
> > >  	if (!cur) {
> > >  		*logflags |= XFS_ILOG_DEXT;
> > > @@ -6047,7 +6100,10 @@ xfs_bmap_split_extent(
> > >  	/* Add new extent */
> > >  	xfs_iext_next(ifp, &icur);
> > >  	xfs_iext_insert(ip, &icur, &new, 0);
> > > -	ifp->if_nextents++;
> > > +
> > > +	error = xfs_next_set(ip, whichfork, 1);
> > > +	if (error)
> > > +		goto del_cursor;
> > >  
> > >  	if (cur) {
> > >  		error = xfs_bmbt_lookup_eq(cur, &new, &i);
> > > diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
> > > index 28b366275ae0..3bf5a2c391bd 100644
> > > --- a/fs/xfs/libxfs/xfs_inode_fork.c
> > > +++ b/fs/xfs/libxfs/xfs_inode_fork.c
> > > @@ -728,3 +728,32 @@ xfs_ifork_verify_local_attr(
> > >  
> > >  	return 0;
> > >  }
> > > +
> > > +int
> > > +xfs_next_set(
> > 
> > "next"... please choose an abbreviation that doesn't collide with a
> > common English word.
> > 
> > > +	struct xfs_inode	*ip,
> > > +	int			whichfork,
> > > +	int			delta)
> > 
> > Delta?  I thought this was a setter function?
> > 
> > > +{
> > > +	struct xfs_ifork	*ifp;
> > > +	int64_t			nr_exts;
> > > +	int64_t			max_exts;
> > > +
> > > +	ifp = XFS_IFORK_PTR(ip, whichfork);
> > > +
> > > +	if (whichfork == XFS_DATA_FORK || whichfork == XFS_COW_FORK)
> > > +		max_exts = MAXEXTNUM;
> > > +	else if (whichfork == XFS_ATTR_FORK)
> > > +		max_exts = MAXAEXTNUM;
> > > +	else
> > > +		ASSERT(0);
> > > +
> > > +	nr_exts = ifp->if_nextents + delta;
> > 
> > Nope, it's a modify function all right.  Then it should be named:
> > 
> > xfs_nextents_mod(ip, whichfork, delta)
> > 
> > > +	if ((delta > 0 && nr_exts > max_exts)
> > > +		|| (delta < 0 && nr_exts < 0))
> > 
> > Line these up, please.  e.g.,
> > 
> > 	if ((delta > 0 && nr_exts > max_exts) ||
> >             (delta < 0 && nr_exts < 0))
> > 
> > --D
> > 
> > > +		return -EOVERFLOW;
> 
> Oh, also, shouldn't this be EFBIG ("File too big")?

True, EFBIG is more appropriate than EOVERFLOW in this case.

Darrick, I have one question. The purpose of this patch is to fix the zero day
bug where we overflow extent counter silently and get to know about it only
when flushing the incore inode to disk. Patches that come later in the series
modify the extent count limits to 2^32 (for xattr fork) and 2^47 (for data
fork). If this patch is not required to be sent to stable release, I will drop
it from the series. Also, I can't have a "fixes" tag because this is a zero
day bug.

> 
> --D
> 
> > > +
> > > +	ifp->if_nextents = nr_exts;
> > > +
> > > +	return 0;
> > > +}
> > > diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
> > > index a4953e95c4f3..a84ae42ace79 100644
> > > --- a/fs/xfs/libxfs/xfs_inode_fork.h
> > > +++ b/fs/xfs/libxfs/xfs_inode_fork.h
> > > @@ -173,4 +173,5 @@ extern void xfs_ifork_init_cow(struct xfs_inode *ip);
> > >  int xfs_ifork_verify_local_data(struct xfs_inode *ip);
> > >  int xfs_ifork_verify_local_attr(struct xfs_inode *ip);
> > >  
> > > +int xfs_next_set(struct xfs_inode *ip, int whichfork, int delta);
> > >  #endif	/* __XFS_INODE_FORK_H__ */
> 


-- 
chandan




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 7/7] xfs: Extend attr extent counter to 32 bits
  2020-06-08 17:21   ` Darrick J. Wong
@ 2020-06-09 14:22     ` Chandan Babu R
  0 siblings, 0 replies; 40+ messages in thread
From: Chandan Babu R @ 2020-06-09 14:22 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, bfoster, hch

On Monday 8 June 2020 10:51:21 PM IST Darrick J. Wong wrote:
> On Sat, Jun 06, 2020 at 01:57:45PM +0530, Chandan Babu R wrote:
> > This commit extends the per-inode attr extent counter to 32 bits.
> > 
> > The following changes are made to accomplish this,
> > 1. A new ro-compat superblock flag to prevent older kernels from
> >    mounting the filesystem in read-write mode. This flag is set for the
> >    first time when an inode would end up having more than 2^15 extents.
> > 3. Carve out a new 16-bit field from xfs_dinode->di_pad2[]. This field
> >    holds the most significant 16 bits of the attr extent counter.
> 
> How difficult is it to end up with an attr fork mapping more than 2^32
> blocks?  Supposing I have a file with nlinks==2^32-1, each mapped to a
> 255-byte name and some number of other xattrs?

- 2^32 nlinks each having 255 byte sized name.
  - Size of one xattr
    - name + value = 16 + 255 = 271
      16 comes from the size of the following structure,
      #+BEGIN_SRC fundamental
        struct xfs_parent_name_rec {
                __be64  p_ino;
                __be32  p_gen;
                __be32  p_diroffset;
        };
      #+END_SRC
  - sizeof(xfs_attr_leaf_hdr_t)
    32
  - sizeof(xfs_attr_leaf_entry_t)
    8
  - Number of entries in a 1k leaf block
    (1024 - sizeof(xfs_attr_leaf_hdr_t)) / (8 + 271)
    = (1024 - 32) / 279
    = 992 / 279
    = floor(3.55)
    = 3
  - Nr leaves = (2^32 / 3) * 3 (magicpct) = 4.3 billion
  - Nr entries per node = (1024 - sizeof(struct xfs_da3_node_hdr)) / sizeof(struct xfs_da_node_entry)
    = (1024 - 64) / 8
    = 120 entries
  - Nr entries at level (n - 1) = 4.3 billion / 120 = 36 million
  - Nr entries at level (n - 2) = 36 million / 120 = 300k
  - Nr entries at level (n - 3) = 300k / 120 = 2.5k
  - Nr entries at level (n - 4) = 2.5k / 120 = 20
  - Nr entries at level (n - 5) = 20 / 120 = 1
  Hence with 1024 block size, the maximum height (i.e. XFS_DA_NODE_MAXDEPTH)
  allowed for a dabtree would act as a limit.

  With 4k block size,
  - Number of entries in a 4k leaf block
    (4096 - sizeof(xfs_attr_leaf_hdr_t)) / (8 + 271)
    = (4096 - 32) / 279
    = 4064 / 279
    = floor(14.56)
    = 14
  - Nr leaves = (2^32 / 14) * 3 (magicpct) = 920 million
  - Nr entries per node = (4096 - sizeof(struct xfs_da3_node_hdr)) / sizeof(struct xfs_da_node_entry)
    = (4096 - 64) / 8
    = 504 entries
  - Nr entries at level (n - 1) = 920 million / 504 = 1.8 million
  - Nr entries at level (n - 2) = 1.8 million / 504 = 3.6k
  - Nr entries at level (n - 3) = 3.6k / 504 = 7
  - Nr entries at level (n - 4) = 7 / 504 = 1

  Total number of extents = 920 million + 1.8 million
  = 922 million
  < 2^32 (4.2 billion).

So we still have ample space in the 32-bit counter. 

> 
> > 2. A new inode->di_flags2 flag to indicate that the newly added field
> >    contains valid data. This flag is set when one of the following two
> >    conditions are met,
> >    - When the inode is about to have more than 2^15 extents.
> >    - When flushing the incore inode (See xfs_iflush_int()), if
> >      the superblock ro-compat flag is already set.
> > 
> > Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
> > ---
> >  fs/xfs/libxfs/xfs_format.h      | 25 ++++++++++---
> >  fs/xfs/libxfs/xfs_inode_buf.c   | 23 +++++++++---
> >  fs/xfs/libxfs/xfs_inode_fork.c  | 62 ++++++++++++++++++++++++++-------
> >  fs/xfs/libxfs/xfs_log_format.h  |  5 +--
> >  fs/xfs/libxfs/xfs_types.h       |  5 +--
> >  fs/xfs/scrub/inode.c            |  5 +--
> >  fs/xfs/xfs_inode.c              |  4 +++
> >  fs/xfs/xfs_inode_item.c         |  5 ++-
> >  fs/xfs/xfs_inode_item_recover.c |  8 ++++-
> >  9 files changed, 113 insertions(+), 29 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > index 91bee33aa988..2e37d887fd35 100644
> > --- a/fs/xfs/libxfs/xfs_format.h
> > +++ b/fs/xfs/libxfs/xfs_format.h
> > @@ -450,11 +450,13 @@ xfs_sb_has_compat_feature(
> >  #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
> >  #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
> >  #define XFS_SB_FEAT_RO_COMPAT_47BIT_DEXT_CNTR (1 << 3)	/* 47bit data extents */
> > +#define XFS_SB_FEAT_RO_COMPAT_32BIT_AEXT_CNTR (1 << 4)	/* 32bit attr extents */
> 
> Can we bundle both of these changes in a single feature flag?  I would
> like to keep our feature testing matrix as small as we can.
> 
> /* 64-bit data fork extent counts and 32-bit attr fork extent counts */
> #define XFS_SB_FEAT_RO_COMPAT_BIG_FORK	(1 << 4)

Sure, this should be easy to implement.

> 
> >  #define XFS_SB_FEAT_RO_COMPAT_ALL \
> >  		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
> >  		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
> >  		 XFS_SB_FEAT_RO_COMPAT_REFLINK | \
> > -		 XFS_SB_FEAT_RO_COMPAT_47BIT_DEXT_CNTR)
> > +		 XFS_SB_FEAT_RO_COMPAT_47BIT_DEXT_CNTR | \
> > +		 XFS_SB_FEAT_RO_COMPAT_32BIT_AEXT_CNTR)
> >  #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
> >  static inline bool
> >  xfs_sb_has_ro_compat_feature(
> > @@ -577,6 +579,18 @@ static inline void xfs_sb_version_add47bitext(struct xfs_sb *sbp)
> >  	sbp->sb_features_ro_compat |= XFS_SB_FEAT_RO_COMPAT_47BIT_DEXT_CNTR;
> >  }
> >  
> > +static inline bool xfs_sb_version_has32bitaext(struct xfs_sb *sbp)
> > +{
> > +	return XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5 &&
> > +		(sbp->sb_features_ro_compat &
> > +			XFS_SB_FEAT_RO_COMPAT_32BIT_AEXT_CNTR);
> > +}
> > +
> > +static inline void xfs_sb_version_add32bitaext(struct xfs_sb *sbp)
> > +{
> > +	sbp->sb_features_ro_compat |= XFS_SB_FEAT_RO_COMPAT_32BIT_AEXT_CNTR;
> > +}
> > +
> >  /*
> >   * end of superblock version macros
> >   */
> > @@ -888,7 +902,7 @@ typedef struct xfs_dinode {
> >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
> >  	__be32		di_extsize;	/* basic/minimum extent size for file */
> >  	__be32		di_nextents_lo;	/* number of extents in data fork */
> > -	__be16		di_anextents;	/* number of extents in attribute fork*/
> > +	__be16		di_anextents_lo;/* lower part of xattr extent count */
> >  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> >  	__s8		di_aformat;	/* format of attr fork's data */
> >  	__be32		di_dmevmask;	/* DMIG event mask */
> > @@ -906,7 +920,8 @@ typedef struct xfs_dinode {
> >  	__be64		di_flags2;	/* more random flags */
> >  	__be32		di_cowextsize;	/* basic cow extent size for file */
> >  	__be32		di_nextents_hi;
> > -	__u8		di_pad2[8];	/* more padding for future expansion */
> > +	__be16		di_anextents_hi;/* higher part of xattr extent count */
> > +	__u8		di_pad2[6];	/* more padding for future expansion */
> >  
> >  	/* fields only written to during inode creation */
> >  	xfs_timestamp_t	di_crtime;	/* time created */
> > @@ -1073,14 +1088,16 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
> >  #define XFS_DIFLAG2_REFLINK_BIT	1	/* file's blocks may be shared */
> >  #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
> >  #define XFS_DIFLAG2_47BIT_NEXTENTS_BIT 3 /* Uses di_nextents_hi field */
> > +#define XFS_DIFLAG2_32BIT_ANEXTENTS_BIT 4 /* Uses di_anextents_hi field  */
> 
> Same thing here.

Ok.

> 
> >  #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
> >  #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
> >  #define XFS_DIFLAG2_COWEXTSIZE  (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
> >  #define XFS_DIFLAG2_47BIT_NEXTENTS (1 << XFS_DIFLAG2_47BIT_NEXTENTS_BIT)
> > +#define XFS_DIFLAG2_32BIT_ANEXTENTS (1 << XFS_DIFLAG2_32BIT_ANEXTENTS_BIT)
> >  
> >  #define XFS_DIFLAG2_ANY \
> >  	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
> > -	 XFS_DIFLAG2_47BIT_NEXTENTS)
> > +	 XFS_DIFLAG2_47BIT_NEXTENTS | XFS_DIFLAG2_32BIT_ANEXTENTS)
> >  
> >  /*
> >   * Inode number format:
> > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > index 8b89fe080f70..285cbce0cd10 100644
> > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > @@ -309,7 +309,8 @@ xfs_inode_to_disk(
> >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> >  	to->di_nextents_lo = cpu_to_be32(xfs_ifork_nextents(&ip->i_df) &
> >  					0xffffffffU);
> > -	to->di_anextents = cpu_to_be16(xfs_ifork_nextents(ip->i_afp));
> > +	to->di_anextents_lo = cpu_to_be16(xfs_ifork_nextents(ip->i_afp) &
> > +					0xffffU);
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat = xfs_ifork_format(ip->i_afp);
> >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > @@ -327,6 +328,10 @@ xfs_inode_to_disk(
> >  			to->di_nextents_hi
> >  				= cpu_to_be32(xfs_ifork_nextents(&ip->i_df)
> >  					>> 32);
> > +		if (from->di_flags2 & XFS_DIFLAG2_32BIT_ANEXTENTS)
> > +			to->di_anextents_hi
> > +				= cpu_to_be16(xfs_ifork_nextents(ip->i_afp)
> > +					>> 16);
> >  		to->di_ino = cpu_to_be64(ip->i_ino);
> >  		to->di_lsn = cpu_to_be64(lsn);
> >  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> > @@ -366,7 +371,7 @@ xfs_log_dinode_to_disk(
> >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> >  	to->di_nextents_lo = cpu_to_be32(from->di_nextents_lo);
> > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > +	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat = from->di_aformat;
> >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > @@ -383,6 +388,9 @@ xfs_log_dinode_to_disk(
> >  		if (from->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
> >  			to->di_nextents_hi =
> >  				cpu_to_be32(from->di_nextents_hi);
> > +		if (from->di_flags2 & XFS_DIFLAG2_32BIT_ANEXTENTS)
> > +			to->di_anextents_hi =
> > +				cpu_to_be16(from->di_anextents_hi);
> >  		to->di_ino = cpu_to_be64(from->di_ino);
> >  		to->di_lsn = cpu_to_be64(from->di_lsn);
> >  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> > @@ -566,7 +574,7 @@ xfs_dinode_verify(
> >  		default:
> >  			return __this_address;
> >  		}
> > -		if (dip->di_anextents)
> > +		if (xfs_dfork_nextents(&mp->m_sb, dip, XFS_ATTR_FORK))
> >  			return __this_address;
> >  	}
> >  
> > @@ -745,8 +753,13 @@ xfs_dfork_nextents(
> >  			&& (dip->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS))
> >  			nextents |= (u64)(be32_to_cpu(dip->di_nextents_hi))
> >  				<< 32;
> > -		return nextents;
> >  	} else {
> > -		return be16_to_cpu(dip->di_anextents);
> > +		nextents = be16_to_cpu(dip->di_anextents_lo);
> > +		if (xfs_sb_version_has_v3inode(sbp)
> > +			&& (dip->di_flags2 & XFS_DIFLAG2_32BIT_ANEXTENTS))
> > +			nextents |= (u32)(be16_to_cpu(dip->di_anextents_hi))
> 
> <same if test logic vs. if body statement indentation complaint>

Ok. I will fix this up.

> 
> > +				<< 16;
> >  	}
> > +
> > +	return nextents;
> >  }
> > diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
> > index ec682e2d5bcb..169e16947ece 100644
> > --- a/fs/xfs/libxfs/xfs_inode_fork.c
> > +++ b/fs/xfs/libxfs/xfs_inode_fork.c
> > @@ -301,7 +301,10 @@ xfs_iformat_attr_fork(
> >  	ip->i_afp->if_format = dip->di_aformat;
> >  	if (unlikely(ip->i_afp->if_format == 0)) /* pre IRIX 6.2 file system */
> >  		ip->i_afp->if_format = XFS_DINODE_FMT_EXTENTS;
> > -	ip->i_afp->if_nextents = be16_to_cpu(dip->di_anextents);
> > +	ip->i_afp->if_nextents = be16_to_cpu(dip->di_anextents_lo);
> > +	if (ip->i_d.di_flags2 & XFS_DIFLAG2_32BIT_ANEXTENTS)
> > +		ip->i_afp->if_nextents |=
> > +			(u32)(be16_to_cpu(dip->di_anextents_hi)) << 16;
> >  
> >  	switch (ip->i_afp->if_format) {
> >  	case XFS_DINODE_FMT_LOCAL:
> > @@ -777,6 +780,48 @@ xfs_next_set_data(
> >  	return 0;
> >  }
> >  
> > +static int
> > +xfs_next_set_attr(
> > +	struct xfs_trans	*tp,
> > +	struct xfs_inode	*ip,
> > +	struct xfs_ifork	*ifp,
> > +	int			delta)
> > +{
> > +	struct xfs_mount	*mp = ip->i_mount;
> > +	xfs_aextnum_t		nr_exts;
> > +
> > +	nr_exts = ifp->if_nextents + delta;
> > +
> > +	if ((delta > 0 && nr_exts < ifp->if_nextents) ||
> > +		(delta < 0 && nr_exts > ifp->if_nextents))
> > +		return -EOVERFLOW;
> > +
> > +	if (ifp->if_nextents <= MAXAEXTNUM15BIT &&
> > +		nr_exts > MAXAEXTNUM15BIT &&
> > +		!(ip->i_d.di_flags2 & XFS_DIFLAG2_32BIT_ANEXTENTS) &&
> > +		xfs_sb_version_has_v3inode(&mp->m_sb)) {
> > +		if (!xfs_sb_version_has32bitaext(&mp->m_sb)) {
> 
> Indentation complaint^2

Ok. I will fix this up.

> 
> > +			bool log_sb = false;
> > +
> > +			spin_lock(&mp->m_sb_lock);
> > +			if (!xfs_sb_version_has32bitaext(&mp->m_sb)) {
> > +				xfs_sb_version_add32bitaext(&mp->m_sb);
> > +				log_sb = true;
> > +			}
> > +			spin_unlock(&mp->m_sb_lock);
> > +
> > +			if (log_sb)
> > +				xfs_log_sb(tp);
> > +		}
> > +
> > +		ip->i_d.di_flags2 |= XFS_DIFLAG2_32BIT_ANEXTENTS;
> > +	}
> > +
> > +	ifp->if_nextents = nr_exts;
> > +
> > +	return 0;
> > +}
> > +
> >  int
> >  xfs_next_set(
> >  	struct xfs_trans	*tp,
> > @@ -785,23 +830,16 @@ xfs_next_set(
> >  	int			delta)
> >  {
> >  	struct xfs_ifork	*ifp;
> > -	int64_t			nr_exts;
> >  	int			error = 0;
> >  
> >  	ifp = XFS_IFORK_PTR(ip, whichfork);
> >  
> > -	if (whichfork == XFS_DATA_FORK || whichfork == XFS_COW_FORK) {
> > +	if (whichfork == XFS_DATA_FORK || whichfork == XFS_COW_FORK)
> >  		error = xfs_next_set_data(tp, ip, ifp, delta);
> > -	} else if (whichfork == XFS_ATTR_FORK) {
> > -		nr_exts = ifp->if_nextents + delta;
> > -		if ((delta > 0 && nr_exts > MAXAEXTNUM)
> > -			|| (delta < 0 && nr_exts < 0))
> > -			return -EOVERFLOW;
> > -
> > -		ifp->if_nextents = nr_exts;
> > -	} else {
> > +	else if (whichfork == XFS_ATTR_FORK)
> > +		error = xfs_next_set_attr(tp, ip, ifp, delta);
> > +	else
> >  		ASSERT(0);
> > -	}
> >  
> >  	return error;
> >  }
> > diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> > index 879aadff7692..db419fc862bc 100644
> > --- a/fs/xfs/libxfs/xfs_log_format.h
> > +++ b/fs/xfs/libxfs/xfs_log_format.h
> > @@ -397,7 +397,7 @@ struct xfs_log_dinode {
> >  	xfs_rfsblock_t	di_nblocks;	/* # of direct & btree blocks used */
> >  	xfs_extlen_t	di_extsize;	/* basic/minimum extent size for file */
> >  	uint32_t	di_nextents_lo;	/* number of extents in data fork */
> > -	xfs_aextnum_t	di_anextents;	/* number of extents in attribute fork*/
> > +	uint16_t	di_anextents_lo;/* lower part of xattr extent count */
> >  	uint8_t		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> >  	int8_t		di_aformat;	/* format of attr fork's data */
> >  	uint32_t	di_dmevmask;	/* DMIG event mask */
> > @@ -415,7 +415,8 @@ struct xfs_log_dinode {
> >  	uint64_t	di_flags2;	/* more random flags */
> >  	uint32_t	di_cowextsize;	/* basic cow extent size for file */
> >  	uint32_t	di_nextents_hi;
> > -	uint8_t		di_pad2[8];	/* more padding for future expansion */
> > +	uint16_t	di_anextents_hi;/* higher part of xattr extent count */
> > +	uint8_t		di_pad2[6];	/* more padding for future expansion */
> >  
> >  	/* fields only written to during inode creation */
> >  	xfs_ictimestamp_t di_crtime;	/* time created */
> > diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> > index c68ff2178976..974737a9e9c1 100644
> > --- a/fs/xfs/libxfs/xfs_types.h
> > +++ b/fs/xfs/libxfs/xfs_types.h
> > @@ -13,7 +13,7 @@ typedef uint32_t	xfs_agino_t;	/* inode # within allocation grp */
> >  typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
> >  typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
> >  typedef uint64_t	xfs_extnum_t;	/* # of extents in a file */
> > -typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> > +typedef uint32_t	xfs_aextnum_t;	/* # extents in an attribute fork */
> >  typedef int64_t		xfs_fsize_t;	/* bytes in a file */
> >  typedef uint64_t	xfs_ufsize_t;	/* unsigned bytes in a file */
> >  
> > @@ -62,7 +62,8 @@ typedef void *		xfs_failaddr_t;
> >  #define	MAXEXTNUM31BIT	((xfs_extnum_t)0x7fffffff)	/* 31 bits */
> >  #define	MAXEXTNUM	((xfs_extnum_t)0x7fffffffffff)	/* 47 bits */
> >  #define	MAXDIREXTNUM	((xfs_extnum_t)0x7ffffff)	/* 27 bits */
> > -#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fff)		/* signed short */
> > +#define	MAXAEXTNUM15BIT	((xfs_aextnum_t)0x7fff)		/* 15 bits */
> > +#define	MAXAEXTNUM	((xfs_aextnum_t)0xffffffff)	/* 32 bits */
> >  
> >  /*
> >   * Minimum and maximum blocksize and sectorsize.
> > diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
> > index be41fd242ff2..01e60c78a3a3 100644
> > --- a/fs/xfs/scrub/inode.c
> > +++ b/fs/xfs/scrub/inode.c
> > @@ -371,10 +371,12 @@ xchk_dinode(
> >  		break;
> >  	}
> >  
> > +	nextents = xfs_dfork_nextents(&mp->m_sb, dip, XFS_ATTR_FORK);
> > +
> >  	/* di_forkoff */
> >  	if (XFS_DFORK_APTR(dip) >= (char *)dip + mp->m_sb.sb_inodesize)
> >  		xchk_ino_set_corrupt(sc, ino);
> > -	if (dip->di_anextents != 0 && dip->di_forkoff == 0)
> > +	if (nextents != 0 && dip->di_forkoff == 0)
> >  		xchk_ino_set_corrupt(sc, ino);
> >  	if (dip->di_forkoff == 0 && dip->di_aformat != XFS_DINODE_FMT_EXTENTS)
> >  		xchk_ino_set_corrupt(sc, ino);
> > @@ -386,7 +388,6 @@ xchk_dinode(
> >  		xchk_ino_set_corrupt(sc, ino);
> >  
> >  	/* di_anextents */
> > -	nextents = be16_to_cpu(dip->di_anextents);
> >  	fork_recs =  XFS_DFORK_ASIZE(dip, mp) / sizeof(struct xfs_bmbt_rec);
> >  	switch (dip->di_aformat) {
> >  	case XFS_DINODE_FMT_EXTENTS:
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 4418a66cf6d6..6ec34e069344 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -3789,6 +3789,10 @@ xfs_iflush_int(
> >  		&& xfs_sb_version_has47bitext(&mp->m_sb))
> >  		ip->i_d.di_flags2 |= XFS_DIFLAG2_47BIT_NEXTENTS;
> >  
> > +	if (!(ip->i_d.di_flags2 & XFS_DIFLAG2_32BIT_ANEXTENTS)
> > +		&& xfs_sb_version_has32bitaext(&mp->m_sb))
> > +		ip->i_d.di_flags2 |= XFS_DIFLAG2_32BIT_ANEXTENTS;
> > +
> >  	/*
> >  	 * Copy the dirty parts of the inode into the on-disk inode.  We always
> >  	 * copy out the core of the inode, because if the inode is dirty at all
> > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> > index 6f27ac7c8631..40f0a19d1c07 100644
> > --- a/fs/xfs/xfs_inode_item.c
> > +++ b/fs/xfs/xfs_inode_item.c
> > @@ -327,7 +327,7 @@ xfs_inode_to_log_dinode(
> >  	to->di_nblocks = from->di_nblocks;
> >  	to->di_extsize = from->di_extsize;
> >  	to->di_nextents_lo = xfs_ifork_nextents(&ip->i_df) & 0xffffffffU;
> > -	to->di_anextents = xfs_ifork_nextents(ip->i_afp);
> > +	to->di_anextents_lo = xfs_ifork_nextents(ip->i_afp) & 0xffffU;
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat = xfs_ifork_format(ip->i_afp);
> >  	to->di_dmevmask = from->di_dmevmask;
> > @@ -347,6 +347,9 @@ xfs_inode_to_log_dinode(
> >  		if (from->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
> >  			to->di_nextents_hi =
> >  				xfs_ifork_nextents(&ip->i_df) >> 32;
> > +		if (from->di_flags2 & XFS_DIFLAG2_32BIT_ANEXTENTS)
> > +			to->di_anextents_hi =
> > +				xfs_ifork_nextents(ip->i_afp) >> 16;
> >  		to->di_ino = ip->i_ino;
> >  		to->di_lsn = lsn;
> >  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> > diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
> > index 8d64b861fb66..c8b5fbba848b 100644
> > --- a/fs/xfs/xfs_inode_item_recover.c
> > +++ b/fs/xfs/xfs_inode_item_recover.c
> > @@ -135,6 +135,7 @@ xlog_recover_inode_commit_pass2(
> >  	uint				isize;
> >  	int				need_free = 0;
> >  	xfs_extnum_t			nextents;
> > +	xfs_aextnum_t			anextents;
> >  
> >  	if (item->ri_buf[0].i_len == sizeof(struct xfs_inode_log_format)) {
> >  		in_f = item->ri_buf[0].i_addr;
> > @@ -262,7 +263,12 @@ xlog_recover_inode_commit_pass2(
> >  		ldip->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
> >  		nextents |= ((u64)(ldip->di_nextents_hi) << 32);
> >  
> > -	nextents += ldip->di_anextents;
> > +	anextents = ldip->di_anextents_lo;
> > +	if (xfs_sb_version_has_v3inode(&mp->m_sb) &&
> > +		ldip->di_flags2 & XFS_DIFLAG2_32BIT_ANEXTENTS)
> > +		anextents |= ((u32)(ldip->di_anextents_hi) << 16);
> > +
> > +	nextents += anextents;
> >  
> >  	if (unlikely(nextents > ldip->di_nblocks)) {
> >  		XFS_CORRUPTION_ERROR("xlog_recover_inode_pass2(5)",
> 

-- 
chandan




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 6/7] xfs: Extend data extent counter to 47 bits
  2020-06-08 17:14   ` Darrick J. Wong
@ 2020-06-09 14:23     ` Chandan Babu R
  2020-08-31 21:05       ` Darrick J. Wong
  0 siblings, 1 reply; 40+ messages in thread
From: Chandan Babu R @ 2020-06-09 14:23 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, bfoster, hch

On Monday 8 June 2020 10:44:10 PM IST Darrick J. Wong wrote:
> On Sat, Jun 06, 2020 at 01:57:44PM +0530, Chandan Babu R wrote:
> > This commit extends the per-inode data extent counter to 47 bits. The
> > length of 47-bits was chosen because,
> > Maximum file size = 2^63.
> > Maximum extent count when using 64k block size = 2^63 / 2^16 = 2^47.
> > 
> > The following changes are made to accomplish this,
> > 1. A new ro-compat superblock flag to prevent older kernels from
> >    mounting the filesystem in read-write mode. This flag is set for the
> >    first time when an inode would end up having more than 2^31 extents.
> > 3. Carve out a new 32-bit field from xfs_dinode->di_pad2[]. This field
> >    holds the most significant 15 bits of the data extent counter.
> 
> On a 1k block V5 fs, the maximum extent count is 2^(63-10) = 2^53.
> 
> If you're going to allocate 32 bits of space from di_pad2 to expand the
> data fork's nextents, let's use the entire bitspace.

But 2^53 extents will be beyond the limit of number of extents possible for a
64k blocksized filesystem?

> 
> > 2. A new inode->di_flags2 flag to indicate that the newly added field
> >    contains valid data. This flag is set when one of the following two
> >    conditions are met,
> >    - When the inode is about to have more than 2^31 extents.
> >    - When flushing the incore inode (See xfs_iflush_int()), if
> >      the superblock ro-compat flag is already set.
> > 
> > Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
> > ---
> >  fs/xfs/libxfs/xfs_bmap.c        | 40 ++++++++--------
> >  fs/xfs/libxfs/xfs_format.h      | 30 ++++++++----
> >  fs/xfs/libxfs/xfs_inode_buf.c   | 46 +++++++++++++++---
> >  fs/xfs/libxfs/xfs_inode_buf.h   |  2 +
> >  fs/xfs/libxfs/xfs_inode_fork.c  | 84 ++++++++++++++++++++++++++-------
> >  fs/xfs/libxfs/xfs_inode_fork.h  |  3 +-
> >  fs/xfs/libxfs/xfs_log_format.h  |  5 +-
> >  fs/xfs/libxfs/xfs_types.h       |  5 +-
> >  fs/xfs/scrub/inode.c            |  9 ++--
> >  fs/xfs/xfs_inode.c              |  6 ++-
> >  fs/xfs/xfs_inode_item.c         |  5 +-
> >  fs/xfs/xfs_inode_item_recover.c | 16 +++++--
> >  12 files changed, 184 insertions(+), 67 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index f75b70ae7b1f..73e552678adc 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -53,9 +53,9 @@ xfs_bmap_compute_maxlevels(
> >  	int		whichfork,	/* data or attr fork */
> >  	int		dir_bmbt)	/* Dir or non-dir data fork */
> >  {
> > +	uint64_t	maxleafents;	/* max leaf entries possible */
> >  	int		level;		/* btree level */
> >  	uint		maxblocks;	/* max blocks at this level */
> > -	uint		maxleafents;	/* max leaf entries possible */
> >  	int		maxrootrecs;	/* max records in root block */
> >  	int		minleafrecs;	/* min records in leaf block */
> >  	int		minnoderecs;	/* min records in node block */
> > @@ -477,7 +477,7 @@ xfs_bmap_check_leaf_extents(
> >  	if (bp_release)
> >  		xfs_trans_brelse(NULL, bp);
> >  error_norelse:
> > -	xfs_warn(mp, "%s: BAD after btree leaves for %d extents",
> > +	xfs_warn(mp, "%s: BAD after btree leaves for %llu extents",
> >  		__func__, i);
> >  	xfs_err(mp, "%s: CORRUPTED BTREE OR SOMETHING", __func__);
> >  	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > @@ -918,7 +918,7 @@ xfs_bmap_local_to_extents(
> >  	xfs_iext_first(ifp, &icur);
> >  	xfs_iext_insert(ip, &icur, &rec, 0);
> >  
> > -	error = xfs_next_set(ip, whichfork, 1);
> > +	error = xfs_next_set(tp, ip, whichfork, 1);
> >  	if (error)
> >  		goto done;
> >  
> > @@ -1610,7 +1610,7 @@ xfs_bmap_add_extent_delay_real(
> >  		xfs_iext_prev(ifp, &bma->icur);
> >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, &LEFT);
> >  
> > -		error = xfs_next_set(bma->ip, whichfork, -1);
> > +		error = xfs_next_set(bma->tp, bma->ip, whichfork, -1);
> >  		if (error)
> >  			goto done;
> >  
> > @@ -1717,7 +1717,7 @@ xfs_bmap_add_extent_delay_real(
> >  		PREV.br_state = new->br_state;
> >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, &PREV);
> >  
> > -		error = xfs_next_set(bma->ip, whichfork, 1);
> > +		error = xfs_next_set(bma->tp, bma->ip, whichfork, 1);
> >  		if (error)
> >  			goto done;
> >  
> > @@ -1786,7 +1786,7 @@ xfs_bmap_add_extent_delay_real(
> >  		 */
> >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, new);
> >  
> > -		error = xfs_next_set(bma->ip, whichfork, 1);
> > +		error = xfs_next_set(bma->tp, bma->ip, whichfork, 1);
> >  		if (error)
> >  			goto done;
> >  
> > @@ -1876,7 +1876,7 @@ xfs_bmap_add_extent_delay_real(
> >  		 */
> >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, new);
> >  
> > -		error = xfs_next_set(bma->ip, whichfork, 1);
> > +		error = xfs_next_set(bma->tp, bma->ip, whichfork, 1);
> >  		if (error)
> >  			goto done;
> >  
> > @@ -1965,7 +1965,7 @@ xfs_bmap_add_extent_delay_real(
> >  		xfs_iext_insert(bma->ip, &bma->icur, &RIGHT, state);
> >  		xfs_iext_insert(bma->ip, &bma->icur, &LEFT, state);
> >  
> > -		error = xfs_next_set(bma->ip, whichfork, 1);
> > +		error = xfs_next_set(bma->tp, bma->ip, whichfork, 1);
> >  		if (error)
> >  			goto done;
> >  
> > @@ -2172,7 +2172,7 @@ xfs_bmap_add_extent_unwritten_real(
> >  		xfs_iext_prev(ifp, icur);
> >  		xfs_iext_update_extent(ip, state, icur, &LEFT);
> >  
> > -		error = xfs_next_set(ip, whichfork, -2);
> > +		error = xfs_next_set(tp, ip, whichfork, -2);
> >  		if (error)
> >  			goto done;
> >  
> > @@ -2228,7 +2228,7 @@ xfs_bmap_add_extent_unwritten_real(
> >  		xfs_iext_prev(ifp, icur);
> >  		xfs_iext_update_extent(ip, state, icur, &LEFT);
> >  
> > -		error = xfs_next_set(ip, whichfork, -1);
> > +		error = xfs_next_set(tp, ip, whichfork, -1);
> >  		if (error)
> >  			goto done;
> >  
> > @@ -2274,7 +2274,7 @@ xfs_bmap_add_extent_unwritten_real(
> >  		xfs_iext_prev(ifp, icur);
> >  		xfs_iext_update_extent(ip, state, icur, &PREV);
> >  
> > -		error = xfs_next_set(ip, whichfork, -1);
> > +		error = xfs_next_set(tp, ip, whichfork, -1);
> >  		if (error)
> >  			goto done;
> >  
> > @@ -2385,7 +2385,7 @@ xfs_bmap_add_extent_unwritten_real(
> >  		xfs_iext_update_extent(ip, state, icur, &PREV);
> >  		xfs_iext_insert(ip, icur, new, state);
> >  
> > -		error = xfs_next_set(ip, whichfork, 1);
> > +		error = xfs_next_set(tp, ip, whichfork, 1);
> >  		if (error)
> >  			goto done;
> >  
> > @@ -2464,7 +2464,7 @@ xfs_bmap_add_extent_unwritten_real(
> >  		xfs_iext_next(ifp, icur);
> >  		xfs_iext_insert(ip, icur, new, state);
> >  
> > -		error = xfs_next_set(ip, whichfork, 1);
> > +		error = xfs_next_set(tp, ip, whichfork, 1);
> >  		if (error)
> >  			goto done;
> >  
> > @@ -2519,7 +2519,7 @@ xfs_bmap_add_extent_unwritten_real(
> >  		xfs_iext_insert(ip, icur, &r[1], state);
> >  		xfs_iext_insert(ip, icur, &r[0], state);
> >  
> > -		error = xfs_next_set(ip, whichfork, 2);
> > +		error = xfs_next_set(tp, ip, whichfork, 2);
> >  		if (error)
> >  			goto done;
> >  
> > @@ -2838,7 +2838,7 @@ xfs_bmap_add_extent_hole_real(
> >  		xfs_iext_prev(ifp, icur);
> >  		xfs_iext_update_extent(ip, state, icur, &left);
> >  
> > -		error = xfs_next_set(ip, whichfork, -1);
> > +		error = xfs_next_set(tp, ip, whichfork, -1);
> >  		if (error)
> >  			goto done;
> >  
> > @@ -2940,7 +2940,7 @@ xfs_bmap_add_extent_hole_real(
> >  		 */
> >  		xfs_iext_insert(ip, icur, new, state);
> >  
> > -		error = xfs_next_set(ip, whichfork, 1);
> > +		error = xfs_next_set(tp, ip, whichfork, 1);
> >  		if (error)
> >  			goto done;
> >  
> > @@ -5140,7 +5140,7 @@ xfs_bmap_del_extent_real(
> >  		xfs_iext_remove(ip, icur, state);
> >  		xfs_iext_prev(ifp, icur);
> >  
> > -		error = xfs_next_set(ip, whichfork, -1);
> > +		error = xfs_next_set(tp, ip, whichfork, -1);
> >  		if (error)
> >  			goto done;
> >  
> > @@ -5252,7 +5252,7 @@ xfs_bmap_del_extent_real(
> >  		} else
> >  			flags |= xfs_ilog_fext(whichfork);
> >  
> > -		error = xfs_next_set(ip, whichfork, 1);
> > +		error = xfs_next_set(tp, ip, whichfork, 1);
> >  		if (error)
> >  			goto done;
> >  
> > @@ -5722,7 +5722,7 @@ xfs_bmse_merge(
> >  	 * Update the on-disk extent count, the btree if necessary and log the
> >  	 * inode.
> >  	 */
> > -	error = xfs_next_set(ip, whichfork, -1);
> > +	error = xfs_next_set(tp, ip, whichfork, -1);
> >  	if (error)
> >  		goto done;
> >  
> > @@ -6113,7 +6113,7 @@ xfs_bmap_split_extent(
> >  	xfs_iext_next(ifp, &icur);
> >  	xfs_iext_insert(ip, &icur, &new, 0);
> >  
> > -	error = xfs_next_set(ip, whichfork, 1);
> > +	error = xfs_next_set(tp, ip, whichfork, 1);
> >  	if (error)
> >  		goto del_cursor;
> >  
> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > index b42a52bfa1e9..91bee33aa988 100644
> > --- a/fs/xfs/libxfs/xfs_format.h
> > +++ b/fs/xfs/libxfs/xfs_format.h
> > @@ -449,10 +449,12 @@ xfs_sb_has_compat_feature(
> >  #define XFS_SB_FEAT_RO_COMPAT_FINOBT   (1 << 0)		/* free inode btree */
> >  #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
> >  #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
> > +#define XFS_SB_FEAT_RO_COMPAT_47BIT_DEXT_CNTR (1 << 3)	/* 47bit data extents */
> 
> I wonder if we could come up with a better name for this...
> 
> DFORK_EXTENTHI
> 
> Hmm...
> 
> BIG_DFORK
> 
> Hmmm...
> 
> ULTRAFRAG
> 
> There we go.  "XFS with UltraFrag, part of this complete g@m3r t00lk1t." ;)
> 
> ...
> 
> (What do you think of the second suggestion?)

I like the name DFORK_EXTENTHI since it signifies that we are now using the
"_HI" field of the extent counter and it can also be used to convey the same
for the attr extent counter as well. Thanks for the suggestions.

> 
> >  #define XFS_SB_FEAT_RO_COMPAT_ALL \
> >  		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
> >  		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
> > -		 XFS_SB_FEAT_RO_COMPAT_REFLINK)
> > +		 XFS_SB_FEAT_RO_COMPAT_REFLINK | \
> > +		 XFS_SB_FEAT_RO_COMPAT_47BIT_DEXT_CNTR)
> >  #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
> >  static inline bool
> >  xfs_sb_has_ro_compat_feature(
> > @@ -563,6 +565,18 @@ static inline bool xfs_sb_version_hasreflink(struct xfs_sb *sbp)
> >  		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_REFLINK);
> >  }
> >  
> > +static inline bool xfs_sb_version_has47bitext(struct xfs_sb *sbp)
> > +{
> > +	return XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5 &&
> > +		(sbp->sb_features_ro_compat &
> > +			XFS_SB_FEAT_RO_COMPAT_47BIT_DEXT_CNTR);
> > +}
> > +
> > +static inline void xfs_sb_version_add47bitext(struct xfs_sb *sbp)
> > +{
> > +	sbp->sb_features_ro_compat |= XFS_SB_FEAT_RO_COMPAT_47BIT_DEXT_CNTR;
> > +}
> > +
> >  /*
> >   * end of superblock version macros
> >   */
> > @@ -873,7 +887,7 @@ typedef struct xfs_dinode {
> >  	__be64		di_size;	/* number of bytes in file */
> >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
> >  	__be32		di_extsize;	/* basic/minimum extent size for file */
> > -	__be32		di_nextents;	/* number of extents in data fork */
> > +	__be32		di_nextents_lo;	/* number of extents in data fork */
> >  	__be16		di_anextents;	/* number of extents in attribute fork*/
> >  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> >  	__s8		di_aformat;	/* format of attr fork's data */
> > @@ -891,7 +905,8 @@ typedef struct xfs_dinode {
> >  	__be64		di_lsn;		/* flush sequence */
> >  	__be64		di_flags2;	/* more random flags */
> >  	__be32		di_cowextsize;	/* basic cow extent size for file */
> > -	__u8		di_pad2[12];	/* more padding for future expansion */
> > +	__be32		di_nextents_hi;
> > +	__u8		di_pad2[8];	/* more padding for future expansion */
> >  
> >  	/* fields only written to during inode creation */
> >  	xfs_timestamp_t	di_crtime;	/* time created */
> > @@ -992,10 +1007,6 @@ enum xfs_dinode_fmt {
> >  	((w) == XFS_DATA_FORK ? \
> >  		(dip)->di_format : \
> >  		(dip)->di_aformat)
> > -#define XFS_DFORK_NEXTENTS(dip,w) \
> > -	((w) == XFS_DATA_FORK ? \
> > -		be32_to_cpu((dip)->di_nextents) : \
> > -		be16_to_cpu((dip)->di_anextents))
> >  
> >  /*
> >   * For block and character special files the 32bit dev_t is stored at the
> > @@ -1061,12 +1072,15 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
> >  #define XFS_DIFLAG2_DAX_BIT	0	/* use DAX for this inode */
> >  #define XFS_DIFLAG2_REFLINK_BIT	1	/* file's blocks may be shared */
> >  #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
> > +#define XFS_DIFLAG2_47BIT_NEXTENTS_BIT 3 /* Uses di_nextents_hi field */
> >  #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
> >  #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
> >  #define XFS_DIFLAG2_COWEXTSIZE  (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
> > +#define XFS_DIFLAG2_47BIT_NEXTENTS (1 << XFS_DIFLAG2_47BIT_NEXTENTS_BIT)
> >  
> >  #define XFS_DIFLAG2_ANY \
> > -	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE)
> > +	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
> > +	 XFS_DIFLAG2_47BIT_NEXTENTS)
> >  
> >  /*
> >   * Inode number format:
> > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > index 6f84ea85fdd8..8b89fe080f70 100644
> > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > @@ -307,7 +307,8 @@ xfs_inode_to_disk(
> >  	to->di_size = cpu_to_be64(from->di_size);
> >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> > -	to->di_nextents = cpu_to_be32(xfs_ifork_nextents(&ip->i_df));
> > +	to->di_nextents_lo = cpu_to_be32(xfs_ifork_nextents(&ip->i_df) &
> > +					0xffffffffU);
> >  	to->di_anextents = cpu_to_be16(xfs_ifork_nextents(ip->i_afp));
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat = xfs_ifork_format(ip->i_afp);
> > @@ -322,6 +323,10 @@ xfs_inode_to_disk(
> >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
> >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > +		if (from->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
> > +			to->di_nextents_hi
> > +				= cpu_to_be32(xfs_ifork_nextents(&ip->i_df)
> > +					>> 32);
> 
> /me kinda hates the indentation here, would a convenience variable
> reduce the amount of linewrapping here?

I will use a variable here as you have suggested.

> 
> Oh, right, we're in a new epoch now; just go past 80 columns.
> 
> >  		to->di_ino = cpu_to_be64(ip->i_ino);
> >  		to->di_lsn = cpu_to_be64(lsn);
> >  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> > @@ -360,7 +365,7 @@ xfs_log_dinode_to_disk(
> >  	to->di_size = cpu_to_be64(from->di_size);
> >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> > -	to->di_nextents = cpu_to_be32(from->di_nextents);
> > +	to->di_nextents_lo = cpu_to_be32(from->di_nextents_lo);
> >  	to->di_anextents = cpu_to_be16(from->di_anextents);
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat = from->di_aformat;
> > @@ -375,6 +380,9 @@ xfs_log_dinode_to_disk(
> >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > +		if (from->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
> > +			to->di_nextents_hi =
> > +				cpu_to_be32(from->di_nextents_hi);
> >  		to->di_ino = cpu_to_be64(from->di_ino);
> >  		to->di_lsn = cpu_to_be64(from->di_lsn);
> >  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> > @@ -391,7 +399,9 @@ xfs_dinode_verify_fork(
> >  	struct xfs_mount	*mp,
> >  	int			whichfork)
> >  {
> > -	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> > +	xfs_extnum_t		di_nextents;
> > +
> > +	di_nextents = xfs_dfork_nextents(&mp->m_sb, dip, whichfork);
> >  
> >  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
> >  	case XFS_DINODE_FMT_LOCAL:
> > @@ -462,6 +472,8 @@ xfs_dinode_verify(
> >  	uint16_t		flags;
> >  	uint64_t		flags2;
> >  	uint64_t		di_size;
> > +	xfs_extnum_t		nextents;
> > +	int64_t			nblocks;
> >  
> >  	if (dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC))
> >  		return __this_address;
> > @@ -492,10 +504,12 @@ xfs_dinode_verify(
> >  	if ((S_ISLNK(mode) || S_ISDIR(mode)) && di_size == 0)
> >  		return __this_address;
> >  
> > +	nextents = xfs_dfork_nextents(&mp->m_sb, dip, XFS_DATA_FORK);
> > +	nextents += xfs_dfork_nextents(&mp->m_sb, dip, XFS_ATTR_FORK);
> > +	nblocks = be64_to_cpu(dip->di_nblocks);
> > +
> >  	/* Fork checks carried over from xfs_iformat_fork */
> > -	if (mode &&
> > -	    be32_to_cpu(dip->di_nextents) + be16_to_cpu(dip->di_anextents) >
> > -			be64_to_cpu(dip->di_nblocks))
> > +	if (mode && nextents > nblocks)
> >  		return __this_address;
> >  
> >  	if (mode && XFS_DFORK_BOFF(dip) > mp->m_sb.sb_inodesize)
> > @@ -716,3 +730,23 @@ xfs_inode_validate_cowextsize(
> >  
> >  	return NULL;
> >  }
> > +
> > +xfs_extnum_t
> > +xfs_dfork_nextents(
> > +	struct xfs_sb		*sbp,
> > +	struct xfs_dinode	*dip,
> > +	int			whichfork)
> > +{
> > +	xfs_extnum_t		nextents;
> > +
> > +	if (whichfork == XFS_DATA_FORK) {
> > +		nextents = be32_to_cpu(dip->di_nextents_lo);
> > +		if (xfs_sb_version_has_v3inode(sbp)
> > +			&& (dip->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS))
> 
> Please don't align the second line of the if test with the if body.
> 
> Or maybe just create a "xfs_inode_has_big_dfork" helper to encapsulate
> this, like we do for reflink/hascow/realtime inodes.

Ok. I will follow the style used for reflink inodes.

> 
> > +			nextents |= (u64)(be32_to_cpu(dip->di_nextents_hi))
> > +				<< 32;
> > +		return nextents;
> > +	} else {
> > +		return be16_to_cpu(dip->di_anextents);
> 
> I suspect you could reduce the indenting here by inverting the logic,
> e.g.
> 
> 	if (attr fork)
> 		return be16_to_cpu(anextents);
> 
> 	nextents = be32_to_cpu(nextents_lo);
> 	if (xfs_inode_has_big_dfork())
> 		nextents += be32_to_cpu(nextents_hi);
> 	return nextents;
>

The "else" part (i.e. attr fork) gets expanded in the next
patch to contain code similar to the data fork. I will have to introduce the
"if/else" branch logic once again in that patch.

> > +	}
> > +}
> > diff --git a/fs/xfs/libxfs/xfs_inode_buf.h b/fs/xfs/libxfs/xfs_inode_buf.h
> > index 865ac493c72a..4583db53b933 100644
> > --- a/fs/xfs/libxfs/xfs_inode_buf.h
> > +++ b/fs/xfs/libxfs/xfs_inode_buf.h
> > @@ -65,5 +65,7 @@ xfs_failaddr_t xfs_inode_validate_extsize(struct xfs_mount *mp,
> >  xfs_failaddr_t xfs_inode_validate_cowextsize(struct xfs_mount *mp,
> >  		uint32_t cowextsize, uint16_t mode, uint16_t flags,
> >  		uint64_t flags2);
> > +xfs_extnum_t xfs_dfork_nextents(struct xfs_sb *sbp, struct xfs_dinode *dip,
> > +		int whichfork);
> >  
> >  #endif	/* __XFS_INODE_BUF_H__ */
> > diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
> > index 3bf5a2c391bd..ec682e2d5bcb 100644
> > --- a/fs/xfs/libxfs/xfs_inode_fork.c
> > +++ b/fs/xfs/libxfs/xfs_inode_fork.c
> > @@ -10,6 +10,7 @@
> >  #include "xfs_format.h"
> >  #include "xfs_log_format.h"
> >  #include "xfs_trans_resv.h"
> > +#include "xfs_sb.h"
> >  #include "xfs_mount.h"
> >  #include "xfs_inode.h"
> >  #include "xfs_trans.h"
> > @@ -103,21 +104,22 @@ xfs_iformat_extents(
> >  	int			whichfork)
> >  {
> >  	struct xfs_mount	*mp = ip->i_mount;
> > +	struct xfs_sb		*sb = &mp->m_sb;
> >  	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
> > +	xfs_extnum_t		nex = xfs_dfork_nextents(sb, dip, whichfork);
> >  	int			state = xfs_bmap_fork_to_state(whichfork);
> > -	int			nex = XFS_DFORK_NEXTENTS(dip, whichfork);
> >  	int			size = nex * sizeof(xfs_bmbt_rec_t);
> >  	struct xfs_iext_cursor	icur;
> >  	struct xfs_bmbt_rec	*dp;
> >  	struct xfs_bmbt_irec	new;
> > -	int			i;
> > +	xfs_extnum_t		i;
> >  
> >  	/*
> >  	 * If the number of extents is unreasonable, then something is wrong and
> >  	 * we just bail out rather than crash in kmem_alloc() or memcpy() below.
> >  	 */
> >  	if (unlikely(size < 0 || size > XFS_DFORK_SIZE(dip, mp, whichfork))) {
> > -		xfs_warn(ip->i_mount, "corrupt inode %Lu ((a)extents = %d).",
> > +		xfs_warn(ip->i_mount, "corrupt inode %Lu ((a)extents = %llu).",
> >  			(unsigned long long) ip->i_ino, nex);
> >  		xfs_inode_verifier_error(ip, -EFSCORRUPTED,
> >  				"xfs_iformat_extents(1)", dip, sizeof(*dip),
> > @@ -233,7 +235,11 @@ xfs_iformat_data_fork(
> >  	 * depend on it.
> >  	 */
> >  	ip->i_df.if_format = dip->di_format;
> > -	ip->i_df.if_nextents = be32_to_cpu(dip->di_nextents);
> > +	ip->i_df.if_nextents = be32_to_cpu(dip->di_nextents_lo);
> > +	if (ip->i_d.di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
> > +		ip->i_df.if_nextents |=
> > +			((u64)(be32_to_cpu(dip->di_nextents_hi)) << 32);
> > +
> >  
> >  	switch (inode->i_mode & S_IFMT) {
> >  	case S_IFIFO:
> > @@ -729,31 +735,73 @@ xfs_ifork_verify_local_attr(
> >  	return 0;
> >  }
> >  
> > +static int
> > +xfs_next_set_data(
> > +	struct xfs_trans	*tp,
> > +	struct xfs_inode	*ip,
> > +	struct xfs_ifork	*ifp,
> > +	int			delta)
> > +{
> > +	struct xfs_mount	*mp = ip->i_mount;
> > +	xfs_extnum_t		nr_exts;
> > +
> > +	nr_exts = ifp->if_nextents + delta;
> > +
> > +	if ((delta > 0 && nr_exts > MAXEXTNUM)
> > +		|| (delta < 0 && nr_exts > ifp->if_nextents))
> > +		return -EOVERFLOW;
> > +
> > +	if (ifp->if_nextents <= MAXEXTNUM31BIT &&
> > +		nr_exts > MAXEXTNUM31BIT &&
> > +		!(ip->i_d.di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS) &&
> > +		xfs_sb_version_has_v3inode(&mp->m_sb)) {
> > +		if (!xfs_sb_version_has47bitext(&mp->m_sb)) {
> 
> Urk.  Again, don't indent the if test logic and the if body statements
> to the same level.

I am sorry. I will fixup the indentation issues.

> 
> > +			bool log_sb = false;
> > +
> > +			spin_lock(&mp->m_sb_lock);
> > +			if (!xfs_sb_version_has47bitext(&mp->m_sb)) {
> > +				xfs_sb_version_add47bitext(&mp->m_sb);
> > +				log_sb = true;
> > +			}
> > +			spin_unlock(&mp->m_sb_lock);
> > +
> > +			if (log_sb)
> > +				xfs_log_sb(tp);
> > +		}
> 
> Hm, dynamic filesystem upgrade.  This probably ought to log something to
> dmesg about the upgrade.  It might also be a better to make this a
> separate helper so that it's not triply-indented.

Ok. I will implement that.

> 
> > +
> > +		ip->i_d.di_flags2 |= XFS_DIFLAG2_47BIT_NEXTENTS;
> > +	}
> > +
> > +	ifp->if_nextents = nr_exts;
> > +
> > +	return 0;
> > +}
> > +
> >  int
> >  xfs_next_set(
> > +	struct xfs_trans	*tp,
> >  	struct xfs_inode	*ip,
> >  	int			whichfork,
> >  	int			delta)
> >  {
> >  	struct xfs_ifork	*ifp;
> >  	int64_t			nr_exts;
> > -	int64_t			max_exts;
> > +	int			error = 0;
> >  
> >  	ifp = XFS_IFORK_PTR(ip, whichfork);
> >  
> > -	if (whichfork == XFS_DATA_FORK || whichfork == XFS_COW_FORK)
> > -		max_exts = MAXEXTNUM;
> > -	else if (whichfork == XFS_ATTR_FORK)
> > -		max_exts = MAXAEXTNUM;
> > -	else
> > -		ASSERT(0);
> > -
> > -	nr_exts = ifp->if_nextents + delta;
> > -	if ((delta > 0 && nr_exts > max_exts)
> > -		|| (delta < 0 && nr_exts < 0))
> > -		return -EOVERFLOW;
> > +	if (whichfork == XFS_DATA_FORK || whichfork == XFS_COW_FORK) {
> > +		error = xfs_next_set_data(tp, ip, ifp, delta);
> > +	} else if (whichfork == XFS_ATTR_FORK) {
> > +		nr_exts = ifp->if_nextents + delta;
> > +		if ((delta > 0 && nr_exts > MAXAEXTNUM)
> > +			|| (delta < 0 && nr_exts < 0))
> > +			return -EOVERFLOW;
> >  
> > -	ifp->if_nextents = nr_exts;
> > +		ifp->if_nextents = nr_exts;
> > +	} else {
> > +		ASSERT(0);
> > +	}
> >  
> > -	return 0;
> > +	return error;
> >  }
> > diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
> > index a84ae42ace79..c74fa6371cc8 100644
> > --- a/fs/xfs/libxfs/xfs_inode_fork.h
> > +++ b/fs/xfs/libxfs/xfs_inode_fork.h
> > @@ -173,5 +173,6 @@ extern void xfs_ifork_init_cow(struct xfs_inode *ip);
> >  int xfs_ifork_verify_local_data(struct xfs_inode *ip);
> >  int xfs_ifork_verify_local_attr(struct xfs_inode *ip);
> >  
> > -int xfs_next_set(struct xfs_inode *ip, int whichfork, int delta);
> > +int xfs_next_set(struct xfs_trans *tp, struct xfs_inode *ip, int whichfork,
> > +		int delta);
> >  #endif	/* __XFS_INODE_FORK_H__ */
> > diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> > index e3400c9c71cd..879aadff7692 100644
> > --- a/fs/xfs/libxfs/xfs_log_format.h
> > +++ b/fs/xfs/libxfs/xfs_log_format.h
> > @@ -396,7 +396,7 @@ struct xfs_log_dinode {
> >  	xfs_fsize_t	di_size;	/* number of bytes in file */
> >  	xfs_rfsblock_t	di_nblocks;	/* # of direct & btree blocks used */
> >  	xfs_extlen_t	di_extsize;	/* basic/minimum extent size for file */
> > -	xfs_extnum_t	di_nextents;	/* number of extents in data fork */
> > +	uint32_t	di_nextents_lo;	/* number of extents in data fork */
> >  	xfs_aextnum_t	di_anextents;	/* number of extents in attribute fork*/
> >  	uint8_t		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> >  	int8_t		di_aformat;	/* format of attr fork's data */
> > @@ -414,7 +414,8 @@ struct xfs_log_dinode {
> >  	xfs_lsn_t	di_lsn;		/* flush sequence */
> >  	uint64_t	di_flags2;	/* more random flags */
> >  	uint32_t	di_cowextsize;	/* basic cow extent size for file */
> > -	uint8_t		di_pad2[12];	/* more padding for future expansion */
> > +	uint32_t	di_nextents_hi;
> > +	uint8_t		di_pad2[8];	/* more padding for future expansion */
> >  
> >  	/* fields only written to during inode creation */
> >  	xfs_ictimestamp_t di_crtime;	/* time created */
> > diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> > index 0a3041ad5bec..c68ff2178976 100644
> > --- a/fs/xfs/libxfs/xfs_types.h
> > +++ b/fs/xfs/libxfs/xfs_types.h
> > @@ -12,7 +12,7 @@ typedef uint32_t	xfs_agblock_t;	/* blockno in alloc. group */
> >  typedef uint32_t	xfs_agino_t;	/* inode # within allocation grp */
> >  typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
> >  typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
> > -typedef int32_t		xfs_extnum_t;	/* # of extents in a file */
> > +typedef uint64_t	xfs_extnum_t;	/* # of extents in a file */
> >  typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> >  typedef int64_t		xfs_fsize_t;	/* bytes in a file */
> >  typedef uint64_t	xfs_ufsize_t;	/* unsigned bytes in a file */
> > @@ -59,7 +59,8 @@ typedef void *		xfs_failaddr_t;
> >   * Max values for extlen, extnum, aextnum.
> >   */
> >  #define	MAXEXTLEN	((xfs_extlen_t)0x001fffff)	/* 21 bits */
> > -#define	MAXEXTNUM	((xfs_extnum_t)0x7fffffff)	/* signed int */
> > +#define	MAXEXTNUM31BIT	((xfs_extnum_t)0x7fffffff)	/* 31 bits */
> > +#define	MAXEXTNUM	((xfs_extnum_t)0x7fffffffffff)	/* 47 bits */
> >  #define	MAXDIREXTNUM	((xfs_extnum_t)0x7ffffff)	/* 27 bits */
> >  #define	MAXAEXTNUM	((xfs_aextnum_t)0x7fff)		/* signed short */
> >  
> > diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
> > index 6d483ab29e63..be41fd242ff2 100644
> > --- a/fs/xfs/scrub/inode.c
> > +++ b/fs/xfs/scrub/inode.c
> > @@ -205,8 +205,8 @@ xchk_dinode(
> >  	struct xfs_mount	*mp = sc->mp;
> >  	size_t			fork_recs;
> >  	unsigned long long	isize;
> > +	xfs_extnum_t		nextents;
> >  	uint64_t		flags2;
> > -	uint32_t		nextents;
> >  	uint16_t		flags;
> >  	uint16_t		mode;
> >  
> > @@ -354,7 +354,7 @@ xchk_dinode(
> >  	xchk_inode_extsize(sc, dip, ino, mode, flags);
> >  
> >  	/* di_nextents */
> > -	nextents = be32_to_cpu(dip->di_nextents);
> > +	nextents = xfs_dfork_nextents(&mp->m_sb, dip, XFS_DATA_FORK);
> >  	fork_recs =  XFS_DFORK_DSIZE(dip, mp) / sizeof(struct xfs_bmbt_rec);
> >  	switch (dip->di_format) {
> >  	case XFS_DINODE_FMT_EXTENTS:
> > @@ -464,6 +464,7 @@ xchk_inode_xref_bmap(
> >  	struct xfs_scrub	*sc,
> >  	struct xfs_dinode	*dip)
> >  {
> > +	xfs_mount_t		*mp = sc->mp;
> 
> struct xfs_mount.  The structure typedefs usages are deprecated and
> we're trying to get rid of them (slowly).

Yes, I missed out on this one. I will fix this up.

> 
> --D
> 
> >  	xfs_extnum_t		nextents;
> >  	xfs_filblks_t		count;
> >  	xfs_filblks_t		acount;
> > @@ -477,14 +478,14 @@ xchk_inode_xref_bmap(
> >  			&nextents, &count);
> >  	if (!xchk_should_check_xref(sc, &error, NULL))
> >  		return;
> > -	if (nextents < be32_to_cpu(dip->di_nextents))
> > +	if (nextents < xfs_dfork_nextents(&mp->m_sb, dip, XFS_DATA_FORK))
> >  		xchk_ino_xref_set_corrupt(sc, sc->ip->i_ino);
> >  
> >  	error = xfs_bmap_count_blocks(sc->tp, sc->ip, XFS_ATTR_FORK,
> >  			&nextents, &acount);
> >  	if (!xchk_should_check_xref(sc, &error, NULL))
> >  		return;
> > -	if (nextents != be16_to_cpu(dip->di_anextents))
> > +	if (nextents != xfs_dfork_nextents(&mp->m_sb, dip, XFS_ATTR_FORK))
> >  		xchk_ino_xref_set_corrupt(sc, sc->ip->i_ino);
> >  
> >  	/* Check nblocks against the inode. */
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 64f5f9a440ae..4418a66cf6d6 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -3748,7 +3748,7 @@ xfs_iflush_int(
> >  				ip->i_d.di_nblocks, mp, XFS_ERRTAG_IFLUSH_5)) {
> >  		xfs_alert_tag(mp, XFS_PTAG_IFLUSH,
> >  			"%s: detected corrupt incore inode %Lu, "
> > -			"total extents = %d, nblocks = %Ld, ptr "PTR_FMT,
> > +			"total extents = %llu, nblocks = %Ld, ptr "PTR_FMT,
> >  			__func__, ip->i_ino,
> >  			ip->i_df.if_nextents + xfs_ifork_nextents(ip->i_afp),
> >  			ip->i_d.di_nblocks, ip);
> > @@ -3785,6 +3785,10 @@ xfs_iflush_int(
> >  	    xfs_ifork_verify_local_attr(ip))
> >  		goto flush_out;
> >  
> > +	if (!(ip->i_d.di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
> > +		&& xfs_sb_version_has47bitext(&mp->m_sb))
> > +		ip->i_d.di_flags2 |= XFS_DIFLAG2_47BIT_NEXTENTS;
> > +
> >  	/*
> >  	 * Copy the dirty parts of the inode into the on-disk inode.  We always
> >  	 * copy out the core of the inode, because if the inode is dirty at all
> > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> > index ba47bf65b772..6f27ac7c8631 100644
> > --- a/fs/xfs/xfs_inode_item.c
> > +++ b/fs/xfs/xfs_inode_item.c
> > @@ -326,7 +326,7 @@ xfs_inode_to_log_dinode(
> >  	to->di_size = from->di_size;
> >  	to->di_nblocks = from->di_nblocks;
> >  	to->di_extsize = from->di_extsize;
> > -	to->di_nextents = xfs_ifork_nextents(&ip->i_df);
> > +	to->di_nextents_lo = xfs_ifork_nextents(&ip->i_df) & 0xffffffffU;
> >  	to->di_anextents = xfs_ifork_nextents(ip->i_afp);
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat = xfs_ifork_format(ip->i_afp);
> > @@ -344,6 +344,9 @@ xfs_inode_to_log_dinode(
> >  		to->di_crtime.t_nsec = from->di_crtime.tv_nsec;
> >  		to->di_flags2 = from->di_flags2;
> >  		to->di_cowextsize = from->di_cowextsize;
> > +		if (from->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
> > +			to->di_nextents_hi =
> > +				xfs_ifork_nextents(&ip->i_df) >> 32;
> >  		to->di_ino = ip->i_ino;
> >  		to->di_lsn = lsn;
> >  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> > diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
> > index 10ef5ddf5429..8d64b861fb66 100644
> > --- a/fs/xfs/xfs_inode_item_recover.c
> > +++ b/fs/xfs/xfs_inode_item_recover.c
> > @@ -134,6 +134,7 @@ xlog_recover_inode_commit_pass2(
> >  	struct xfs_log_dinode		*ldip;
> >  	uint				isize;
> >  	int				need_free = 0;
> > +	xfs_extnum_t			nextents;
> >  
> >  	if (item->ri_buf[0].i_len == sizeof(struct xfs_inode_log_format)) {
> >  		in_f = item->ri_buf[0].i_addr;
> > @@ -255,16 +256,23 @@ xlog_recover_inode_commit_pass2(
> >  			goto out_release;
> >  		}
> >  	}
> > -	if (unlikely(ldip->di_nextents + ldip->di_anextents > ldip->di_nblocks)){
> > +
> > +	nextents = ldip->di_nextents_lo;
> > +	if (xfs_sb_version_has_v3inode(&mp->m_sb) &&
> > +		ldip->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
> > +		nextents |= ((u64)(ldip->di_nextents_hi) << 32);
> > +
> > +	nextents += ldip->di_anextents;
> > +
> > +	if (unlikely(nextents > ldip->di_nblocks)) {
> >  		XFS_CORRUPTION_ERROR("xlog_recover_inode_pass2(5)",
> >  				     XFS_ERRLEVEL_LOW, mp, ldip,
> >  				     sizeof(*ldip));
> >  		xfs_alert(mp,
> >  	"%s: Bad inode log record, rec ptr "PTR_FMT", dino ptr "PTR_FMT", "
> > -	"dino bp "PTR_FMT", ino %Ld, total extents = %d, nblocks = %Ld",
> > +	"dino bp "PTR_FMT", ino %Ld, total extents = %llu, nblocks = %Ld",
> >  			__func__, item, dip, bp, in_f->ilf_ino,
> > -			ldip->di_nextents + ldip->di_anextents,
> > -			ldip->di_nblocks);
> > +			nextents, ldip->di_nblocks);
> >  		error = -EFSCORRUPTED;
> >  		goto out_release;
> >  	}
> 

-- 
chandan




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 5/7] xfs: Use 2^27 as the maximum number of directory extents
  2020-06-08 16:52   ` Darrick J. Wong
@ 2020-06-09 14:23     ` Chandan Babu R
  0 siblings, 0 replies; 40+ messages in thread
From: Chandan Babu R @ 2020-06-09 14:23 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, bfoster, hch

On Monday 8 June 2020 10:22:17 PM IST Darrick J. Wong wrote:
> On Sat, Jun 06, 2020 at 01:57:43PM +0530, Chandan Babu R wrote:
> > The maximum number of extents that can be used by a directory can be
> > calculated as shown below. (FS block size is assumed to be 512 bytes
> > since the smallest allowed block size can create a BMBT of maximum
> > possible height).
> > 
> > Maximum number of extents in data space =
> > XFS_DIR2_SPACE_SIZE / 2^9 = 32GiB / 2^9 = 2^26.
> > 
> > Maximum number (theoretically) of extents in leaf space =
> > 32GiB / 2^9 = 2^26.
> 
> Hm.  The leaf hash entries are 8 bytes long, whereas I think directory
> entries occupy at least 16 bytes.  Is there a situation where the number
> of dir leaf/dabtree blocks can actually hit the 32G section size limit?

I don't think so. The 2^26 extents above was a theoretical limit. I wanted to
prove that even with the theoretical limit, the maximum number of extents used
by a directory is much less than 2^47 extents.

> 
> > Maximum number of entries in a free space index block
> > = (512 - (sizeof struct xfs_dir3_free_hdr)) / (sizeof struct
> >                                                xfs_dir2_data_off_t)
> > = (512 - 64) / 2 = 224
> > 
> > Maximum number of extents in free space index =
> > (Maximum number of extents in data segment) / 224 =
> > 2^26 / 224 = ~2^18
> > 
> > Maximum number of extents in a directory =
> > Maximum number of extents in data space +
> > Maximum number of extents in leaf space +
> > Maximum number of extents in free space index =
> > 2^26 + 2^26 + 2^18 = ~2^27
> 
> I calculated the exact expression here, and got:
> 
> 2^26 + 2^26 + (2^26/224) = 134,517,321
> 
> This requires 28 bits of space, doesn't it?

You are right.

Log_2(134,517,321) returns 27.003. Since I had assumed a theoretical maximum
for the "leaf space extent count", I had rounded it down to 27 bits. I will
change this to 28 bits.

> 
> Granted I bet the leaf section won't come within 300,000 nextents of the
> 2^26 you've assumed for it, so I suspect that in real world scenarios,
> 27 bits is enough.  But if you're anticipating a totally full leaf
> section under extreme fragmentation, then MAXDIREXTNUM ought to be able
> to handle that.
> 
> (Assuming I did any of that math correctly. ;))
> 
> --D
> 
> > 
> > This commit defines the macro MAXDIREXTNUM to have the value 2^27 and
> > this in turn is used in calculating the maximum height of a directory
> > BMBT.
> > 
> > Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
> > ---
> >  fs/xfs/libxfs/xfs_bmap.c  | 2 +-
> >  fs/xfs/libxfs/xfs_types.h | 1 +
> >  2 files changed, 2 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index 8b0029b3cecf..f75b70ae7b1f 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -81,7 +81,7 @@ xfs_bmap_compute_maxlevels(
> >  	if (whichfork == XFS_DATA_FORK) {
> >  		sz = XFS_BMDR_SPACE_CALC(MINDBTPTRS);
> >  		if (dir_bmbt)
> > -			maxleafents = MAXEXTNUM;
> > +			maxleafents = MAXDIREXTNUM;
> >  		else
> >  			maxleafents = MAXEXTNUM;
> >  	} else {
> > diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> > index 397d94775440..0a3041ad5bec 100644
> > --- a/fs/xfs/libxfs/xfs_types.h
> > +++ b/fs/xfs/libxfs/xfs_types.h
> > @@ -60,6 +60,7 @@ typedef void *		xfs_failaddr_t;
> >   */
> >  #define	MAXEXTLEN	((xfs_extlen_t)0x001fffff)	/* 21 bits */
> >  #define	MAXEXTNUM	((xfs_extnum_t)0x7fffffff)	/* signed int */
> > +#define	MAXDIREXTNUM	((xfs_extnum_t)0x7ffffff)	/* 27 bits */
> >  #define	MAXAEXTNUM	((xfs_aextnum_t)0x7fff)		/* signed short */
> >  
> >  /*
> 


-- 
chandan




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/7] xfs: Add "Use Dir BMBT height" argument to XFS_BM_MAXLEVELS()
  2020-06-08 17:50   ` Darrick J. Wong
@ 2020-06-09 14:23     ` Chandan Babu R
  0 siblings, 0 replies; 40+ messages in thread
From: Chandan Babu R @ 2020-06-09 14:23 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, bfoster, hch

On Monday 8 June 2020 11:20:08 PM IST Darrick J. Wong wrote:
> On Sat, Jun 06, 2020 at 01:57:42PM +0530, Chandan Babu R wrote:
> > XFS_BM_MAXLEVELS() returns the maximum possible height of BMBT tree for
> > either data or attribute fork. For data forks, this commit adds a new
> > argument to XFS_BM_MAXLEVELS() to let the users choose between the
> > maximum heights of dir and non-dir BMBTs.
> > 
> > As of this commit, both dir and non-dir BMBTs have the same maximum
> > height. A future commit in this series will use 2^27 extent count as the
> > input to compute the maximum height of a directory BMBT which will in
> > turn cause the maximum heights of dir and non-dir BMBTs to differ.
> > 
> > Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
> > ---
> >  fs/xfs/libxfs/xfs_attr.c        |  5 ++--
> >  fs/xfs/libxfs/xfs_bmap.c        |  5 ++--
> >  fs/xfs/libxfs/xfs_bmap_btree.h  |  4 +++-
> >  fs/xfs/libxfs/xfs_trans_resv.c  | 25 +++++++++++---------
> >  fs/xfs/libxfs/xfs_trans_resv.h  |  4 ++--
> >  fs/xfs/libxfs/xfs_trans_space.h | 41 +++++++++++++++++----------------
> >  fs/xfs/xfs_bmap_item.c          |  3 ++-
> >  fs/xfs/xfs_reflink.c            |  4 ++--
> >  8 files changed, 50 insertions(+), 41 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
> > index a4b23edf887e..357e29a5a167 100644
> > --- a/fs/xfs/libxfs/xfs_attr.c
> > +++ b/fs/xfs/libxfs/xfs_attr.c
> > @@ -150,7 +150,7 @@ xfs_attr_calc_size(
> >  	 * "local" or "remote" (note: local != inline).
> >  	 */
> >  	size = xfs_attr_leaf_newentsize(args, local);
> > -	nblks = XFS_DAENTER_SPACE_RES(mp, XFS_ATTR_FORK);
> > +	nblks = XFS_DAENTER_SPACE_RES(mp, XFS_ATTR_FORK, 0);
> 
> When would we have a DAENTER space reservation for the data fork on
> something that isn't a directory?
> 
> Shouldn't you be able to compute the correct 'dbmbt' parameter value
> from whichfork?
>

You are right. I could pass the "use dir bmbt" argument to XFS_DAENTER_BMAPS()
from within the definition of XFS_DAENTER_SPACE_RES() if the fork passed in is
a data fork. This argument could be passed to XFS_BM_MAXLEVELS() via
XFS_NEXTENTADD_SPACE_RES() => XFS_EXTENTADD_SPACE_RES() => XFS_BM_MAXLEVELS().
But modifications made to invocations of these three macros else where in the
code have to be retained so that a correct value is passed for the newly
introduced argument.

> Can you modify these macros to take the xfs_inode so that we can gate
> the logic on i_mode instead of passing magic values 0 and 1 around?

I did try to do that. But many of these macros are invoked from functions that
don't have access to xfs_inode. For example, functions in xfs_trans_resv.c
which pre-calculate log reservations don't have an xfs_inode handy.

> Though... thinking about this more, 1 means "use the slightly smaller
> directory bmbt maxlevels", and 0 means "either this is a non directory
> or we want worst case calculations", doesn't it...

Yes, that was the intention of introducing this argument.

> 
> Zooming out, why do we even care?  While it's true that we might gain
> the ability to shave a few blocks off the block reservation when we know
> we're dealing with a directory, this adds quite a bit of clutter to get
> it.

Using a separate maximum extent count for directory data fork was required to
reduce the increased log reservations. To be precise, rename
operation invokes XFS_DIR_OP_LOG_COUNT() which indirectly uses
mp->m_bm_maxlevels[XFS_DATA_FORK] for its calculations. Using a modified
kernel which had 2^47 as the value for MAXEXTNUM resulted in a taller data
fork BMBT tree. Hence log reservation space for rename operation became larger.

The idea of special handling of "maximum extents for directory data fork" came
up later when trying to find a way to reduce the log reservation for the
rename operation.

> 
> >  	if (*local) {
> >  		if (size > (args->geo->blksize / 2)) {
> >  			/* Double split possible */
> > @@ -163,7 +163,8 @@ xfs_attr_calc_size(
> >  		 */
> >  		uint	dblocks = xfs_attr3_rmt_blocks(mp, args->valuelen);
> >  		nblks += dblocks;
> > -		nblks += XFS_NEXTENTADD_SPACE_RES(mp, dblocks, XFS_ATTR_FORK);
> > +		nblks += XFS_NEXTENTADD_SPACE_RES(mp, dblocks,
> > +				XFS_ATTR_FORK, 0);
> >  	}
> >  
> >  	return nblks;
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index 01e2b543b139..8b0029b3cecf 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -182,13 +182,14 @@ xfs_bmap_worst_indlen(
> >  	mp = ip->i_mount;
> >  	maxrecs = mp->m_bmap_dmxr[0];
> >  	for (level = 0, rval = 0;
> > -	     level < XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK);
> > +	     level < XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK, 0);
> >  	     level++) {
> >  		len += maxrecs - 1;
> >  		do_div(len, maxrecs);
> >  		rval += len;
> >  		if (len == 1)
> > -			return rval + XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) -
> > +			return rval +
> > +				XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK, 0) -
> >  				level - 1;
> >  		if (level == 0)
> >  			maxrecs = mp->m_bmap_dmxr[1];
> > diff --git a/fs/xfs/libxfs/xfs_bmap_btree.h b/fs/xfs/libxfs/xfs_bmap_btree.h
> > index 72bf74c79fb9..a047be5883d1 100644
> > --- a/fs/xfs/libxfs/xfs_bmap_btree.h
> > +++ b/fs/xfs/libxfs/xfs_bmap_btree.h
> > @@ -79,7 +79,9 @@ struct xfs_trans;
> >  /*
> >   * Maximum number of bmap btree levels.
> >   */
> > -#define XFS_BM_MAXLEVELS(mp,w)		((mp)->m_bm_maxlevels[(w)])
> > +#define XFS_BM_MAXLEVELS(mp,w,use_dir_bmbt) \
> > +	((!(use_dir_bmbt)) ? \
> > +		(mp)->m_bm_maxlevels[(w)] : (mp)->m_bm_dir_maxlevel)
> 
> Also, if you /are/ going to mess with these macros, can you please turn
> them into static inline functions?  Typechecking would be nice.

Sure, I will do that.

> 
> --D
> 
> >  /*
> >   * Prototypes for xfs_bmap.c to call.
> > diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> > index b44b521c605c..39cfca1b71b6 100644
> > --- a/fs/xfs/libxfs/xfs_trans_resv.c
> > +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> > @@ -265,14 +265,14 @@ xfs_calc_write_reservation(
> >  	unsigned int		blksz = XFS_FSB_TO_B(mp, 1);
> >  
> >  	t1 = xfs_calc_inode_res(mp, 1) +
> > -	     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK), blksz) +
> > +	     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK, 0), blksz) +
> >  	     xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
> >  	     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2), blksz);
> >  
> >  	if (xfs_sb_version_hasrealtime(&mp->m_sb)) {
> >  		t2 = xfs_calc_inode_res(mp, 1) +
> > -		     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK),
> > -				     blksz) +
> > +		     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK, 0),
> > +			blksz) +
> >  		     xfs_calc_buf_res(3, mp->m_sb.sb_sectsize) +
> >  		     xfs_calc_buf_res(xfs_rtalloc_log_count(mp, 1), blksz) +
> >  		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1), blksz);
> > @@ -313,7 +313,8 @@ xfs_calc_itruncate_reservation(
> >  	unsigned int		blksz = XFS_FSB_TO_B(mp, 1);
> >  
> >  	t1 = xfs_calc_inode_res(mp, 1) +
> > -	     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK) + 1, blksz);
> > +	     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK, 0) + 1,
> > +			     blksz);
> >  
> >  	t2 = xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) +
> >  	     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 4), blksz);
> > @@ -592,7 +593,7 @@ xfs_calc_growrtalloc_reservation(
> >  	struct xfs_mount	*mp)
> >  {
> >  	return xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
> > -		xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK),
> > +		xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK, 0),
> >  				 XFS_FSB_TO_B(mp, 1)) +
> >  		xfs_calc_inode_res(mp, 1) +
> >  		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> > @@ -669,7 +670,7 @@ xfs_calc_addafork_reservation(
> >  		xfs_calc_inode_res(mp, 1) +
> >  		xfs_calc_buf_res(2, mp->m_sb.sb_sectsize) +
> >  		xfs_calc_buf_res(1, mp->m_dir_geo->blksize) +
> > -		xfs_calc_buf_res(XFS_DAENTER_BMAP1B(mp, XFS_DATA_FORK) + 1,
> > +		xfs_calc_buf_res(XFS_DAENTER_BMAP1B(mp, XFS_DATA_FORK, 0) + 1,
> >  				 XFS_FSB_TO_B(mp, 1)) +
> >  		xfs_calc_buf_res(xfs_allocfree_log_count(mp, 1),
> >  				 XFS_FSB_TO_B(mp, 1));
> > @@ -691,7 +692,7 @@ xfs_calc_attrinval_reservation(
> >  	struct xfs_mount	*mp)
> >  {
> >  	return max((xfs_calc_inode_res(mp, 1) +
> > -		    xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK),
> > +		    xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK, 0),
> >  				     XFS_FSB_TO_B(mp, 1))),
> >  		   (xfs_calc_buf_res(9, mp->m_sb.sb_sectsize) +
> >  		    xfs_calc_buf_res(xfs_allocfree_log_count(mp, 4),
> > @@ -717,10 +718,11 @@ xfs_calc_attrset_reservation(
> >  	int			bmbt_blks;
> >  
> >  	da_blks = XFS_DAENTER_BLOCKS(mp, XFS_ATTR_FORK);
> > -	bmbt_blks = XFS_DAENTER_BMAPS(mp, XFS_ATTR_FORK);
> > +	bmbt_blks = XFS_DAENTER_BMAPS(mp, XFS_ATTR_FORK, 0);
> >  
> >  	max_rmt_blks = xfs_attr3_rmt_blocks(mp, XATTR_SIZE_MAX);
> > -	bmbt_blks += XFS_NEXTENTADD_SPACE_RES(mp, max_rmt_blks, XFS_ATTR_FORK);
> > +	bmbt_blks += XFS_NEXTENTADD_SPACE_RES(mp, max_rmt_blks,
> > +			XFS_ATTR_FORK, 0);
> >  
> >  	return XFS_DQUOT_LOGRES(mp) +
> >  		xfs_calc_inode_res(mp, 1) +
> > @@ -752,8 +754,9 @@ xfs_calc_attrrm_reservation(
> >  		     xfs_calc_buf_res(XFS_DA_NODE_MAXDEPTH,
> >  				      XFS_FSB_TO_B(mp, 1)) +
> >  		     (uint)XFS_FSB_TO_B(mp,
> > -					XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK)) +
> > -		     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK), 0)),
> > +				XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK, 0)) +
> > +		     xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_DATA_FORK, 0),
> > +				     0)),
> >  		    (xfs_calc_buf_res(5, mp->m_sb.sb_sectsize) +
> >  		     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 2),
> >  				      XFS_FSB_TO_B(mp, 1))));
> > diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
> > index f50996ae18e6..d64989eeebd7 100644
> > --- a/fs/xfs/libxfs/xfs_trans_resv.h
> > +++ b/fs/xfs/libxfs/xfs_trans_resv.h
> > @@ -61,10 +61,10 @@ struct xfs_trans_resv {
> >   */
> >  #define	XFS_DIROP_LOG_RES(mp)	\
> >  	(XFS_FSB_TO_B(mp, XFS_DAENTER_BLOCKS(mp, XFS_DATA_FORK)) + \
> > -	 (XFS_FSB_TO_B(mp, XFS_DAENTER_BMAPS(mp, XFS_DATA_FORK) + 1)))
> > +	 (XFS_FSB_TO_B(mp, XFS_DAENTER_BMAPS(mp, XFS_DATA_FORK, 1) + 1)))
> >  #define	XFS_DIROP_LOG_COUNT(mp)	\
> >  	(XFS_DAENTER_BLOCKS(mp, XFS_DATA_FORK) + \
> > -	 XFS_DAENTER_BMAPS(mp, XFS_DATA_FORK) + 1)
> > +	 XFS_DAENTER_BMAPS(mp, XFS_DATA_FORK, 1) + 1)
> >  
> >  /*
> >   * Various log count values.
> > diff --git a/fs/xfs/libxfs/xfs_trans_space.h b/fs/xfs/libxfs/xfs_trans_space.h
> > index b559af70cf51..c51d809a16b1 100644
> > --- a/fs/xfs/libxfs/xfs_trans_space.h
> > +++ b/fs/xfs/libxfs/xfs_trans_space.h
> > @@ -25,15 +25,16 @@
> >  
> >  #define XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp)    \
> >  		(((mp)->m_alloc_mxr[0]) - ((mp)->m_alloc_mnr[0]))
> > -#define	XFS_EXTENTADD_SPACE_RES(mp,w)	(XFS_BM_MAXLEVELS(mp,w) - 1)
> > -#define XFS_NEXTENTADD_SPACE_RES(mp,b,w)\
> > +#define	XFS_EXTENTADD_SPACE_RES(mp,w,dbmbt)	\
> > +	(XFS_BM_MAXLEVELS(mp,w,dbmbt) - 1)
> > +#define XFS_NEXTENTADD_SPACE_RES(mp,b,w,dbmbt)		   \
> >  	(((b + XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp) - 1) / \
> >  	  XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp)) * \
> > -	  XFS_EXTENTADD_SPACE_RES(mp,w))
> > +		XFS_EXTENTADD_SPACE_RES(mp,w,dbmbt))
> >  
> >  /* Blocks we might need to add "b" mappings & rmappings to a file. */
> > -#define XFS_SWAP_RMAP_SPACE_RES(mp,b,w)\
> > -	(XFS_NEXTENTADD_SPACE_RES((mp), (b), (w)) + \
> > +#define XFS_SWAP_RMAP_SPACE_RES(mp,b,w)	    \
> > +	(XFS_NEXTENTADD_SPACE_RES((mp), (b), (w), 0) +	\
> >  	 XFS_NRMAPADD_SPACE_RES((mp), (b)))
> >  
> >  #define	XFS_DAENTER_1B(mp,w)	\
> > @@ -47,19 +48,19 @@
> >  	(XFS_DA_NODE_MAXDEPTH + (((w) == XFS_DATA_FORK) ? 2 : 1))
> >  #define	XFS_DAENTER_BLOCKS(mp,w)	\
> >  	(XFS_DAENTER_1B(mp,w) * XFS_DAENTER_DBS(mp,w))
> > -#define	XFS_DAENTER_BMAP1B(mp,w)	\
> > -	XFS_NEXTENTADD_SPACE_RES(mp, XFS_DAENTER_1B(mp, w), w)
> > -#define	XFS_DAENTER_BMAPS(mp,w)		\
> > -	(XFS_DAENTER_DBS(mp,w) * XFS_DAENTER_BMAP1B(mp,w))
> > -#define	XFS_DAENTER_SPACE_RES(mp,w)	\
> > -	(XFS_DAENTER_BLOCKS(mp,w) + XFS_DAENTER_BMAPS(mp,w))
> > -#define	XFS_DAREMOVE_SPACE_RES(mp,w)	XFS_DAENTER_BMAPS(mp,w)
> > +#define	XFS_DAENTER_BMAP1B(mp,w,dbmbt)	\
> > +	XFS_NEXTENTADD_SPACE_RES(mp, XFS_DAENTER_1B(mp, w), w, dbmbt)
> > +#define	XFS_DAENTER_BMAPS(mp,w,dbmbt)	\
> > +	(XFS_DAENTER_DBS(mp,w) * XFS_DAENTER_BMAP1B(mp,w,dbmbt))
> > +#define	XFS_DAENTER_SPACE_RES(mp,w,dbmbt)	\
> > +	(XFS_DAENTER_BLOCKS(mp,w) + XFS_DAENTER_BMAPS(mp,w,dbmbt))
> > +#define	XFS_DAREMOVE_SPACE_RES(mp,w,dbmbt)	XFS_DAENTER_BMAPS(mp,w,dbmbt)
> >  #define	XFS_DIRENTER_MAX_SPLIT(mp,nl)	1
> >  #define	XFS_DIRENTER_SPACE_RES(mp,nl)	\
> > -	(XFS_DAENTER_SPACE_RES(mp, XFS_DATA_FORK) * \
> > +	(XFS_DAENTER_SPACE_RES(mp, XFS_DATA_FORK, 1) *	\
> >  	 XFS_DIRENTER_MAX_SPLIT(mp,nl))
> >  #define	XFS_DIRREMOVE_SPACE_RES(mp)	\
> > -	XFS_DAREMOVE_SPACE_RES(mp, XFS_DATA_FORK)
> > +	XFS_DAREMOVE_SPACE_RES(mp, XFS_DATA_FORK, 1)
> >  #define	XFS_IALLOC_SPACE_RES(mp)	\
> >  	(M_IGEO(mp)->ialloc_blks + \
> >  	 (xfs_sb_version_hasfinobt(&mp->m_sb) ? 2 : 1 * \
> > @@ -69,26 +70,26 @@
> >   * Space reservation values for various transactions.
> >   */
> >  #define	XFS_ADDAFORK_SPACE_RES(mp)	\
> > -	((mp)->m_dir_geo->fsbcount + XFS_DAENTER_BMAP1B(mp, XFS_DATA_FORK))
> > +	((mp)->m_dir_geo->fsbcount + XFS_DAENTER_BMAP1B(mp, XFS_DATA_FORK, 0))
> >  #define	XFS_ATTRRM_SPACE_RES(mp)	\
> > -	XFS_DAREMOVE_SPACE_RES(mp, XFS_ATTR_FORK)
> > +	XFS_DAREMOVE_SPACE_RES(mp, XFS_ATTR_FORK, 0)
> >  /* This macro is not used - see inline code in xfs_attr_set */
> >  #define	XFS_ATTRSET_SPACE_RES(mp, v)	\
> > -	(XFS_DAENTER_SPACE_RES(mp, XFS_ATTR_FORK) + XFS_B_TO_FSB(mp, v))
> > +	(XFS_DAENTER_SPACE_RES(mp, XFS_ATTR_FORK, 0) + XFS_B_TO_FSB(mp, v))
> >  #define	XFS_CREATE_SPACE_RES(mp,nl)	\
> >  	(XFS_IALLOC_SPACE_RES(mp) + XFS_DIRENTER_SPACE_RES(mp,nl))
> >  #define	XFS_DIOSTRAT_SPACE_RES(mp, v)	\
> > -	(XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK) + (v))
> > +	(XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK, 0) + (v))
> >  #define	XFS_GROWFS_SPACE_RES(mp)	\
> >  	(2 * (mp)->m_ag_maxlevels)
> >  #define	XFS_GROWFSRT_SPACE_RES(mp,b)	\
> > -	((b) + XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK))
> > +	((b) + XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK, 0))
> >  #define	XFS_LINK_SPACE_RES(mp,nl)	\
> >  	XFS_DIRENTER_SPACE_RES(mp,nl)
> >  #define	XFS_MKDIR_SPACE_RES(mp,nl)	\
> >  	(XFS_IALLOC_SPACE_RES(mp) + XFS_DIRENTER_SPACE_RES(mp,nl))
> >  #define	XFS_QM_DQALLOC_SPACE_RES(mp)	\
> > -	(XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK) + \
> > +	(XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK, 0) + \
> >  	 XFS_DQUOT_CLUSTER_SIZE_FSB)
> >  #define	XFS_QM_QINOCREATE_SPACE_RES(mp)	\
> >  	XFS_IALLOC_SPACE_RES(mp)
> > diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> > index 6736c5ab188f..0a8a8377a150 100644
> > --- a/fs/xfs/xfs_bmap_item.c
> > +++ b/fs/xfs/xfs_bmap_item.c
> > @@ -482,7 +482,8 @@ xfs_bui_item_recover(
> >  	}
> >  
> >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate,
> > -			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp);
> > +			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK, 0), 0,
> > +			0, &tp);
> >  	if (error)
> >  		return error;
> >  	/*
> > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > index 107bf2a2f344..fd35a0bf2c47 100644
> > --- a/fs/xfs/xfs_reflink.c
> > +++ b/fs/xfs/xfs_reflink.c
> > @@ -614,7 +614,7 @@ xfs_reflink_end_cow_extent(
> >  		return 0;
> >  	}
> >  
> > -	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK);
> > +	resblks = XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK, 0);
> >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0,
> >  			XFS_TRANS_RESERVE, &tp);
> >  	if (error)
> > @@ -1017,7 +1017,7 @@ xfs_reflink_remap_extent(
> >  	}
> >  
> >  	/* Start a rolling transaction to switch the mappings */
> > -	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK);
> > +	resblks = XFS_EXTENTADD_SPACE_RES(ip->i_mount, XFS_DATA_FORK, 0);
> >  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
> >  	if (error)
> >  		goto out;
> 


-- 
chandan




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/7] xfs: Compute maximum height of directory BMBT separately
  2020-06-08 20:59   ` Darrick J. Wong
@ 2020-06-09 14:23     ` Chandan Babu R
  2020-06-09 18:40       ` Darrick J. Wong
  0 siblings, 1 reply; 40+ messages in thread
From: Chandan Babu R @ 2020-06-09 14:23 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, bfoster, hch

On Tuesday 9 June 2020 2:29:22 AM IST Darrick J. Wong wrote:
> On Sat, Jun 06, 2020 at 01:57:41PM +0530, Chandan Babu R wrote:
> > xfs/306 causes the following call trace when using a data fork with a
> > maximum extent count of 2^47,
> > 
> >  XFS (loop0): Mounting V5 Filesystem
> >  XFS (loop0): Log size 8906 blocks too small, minimum size is 9075 blocks
> >  XFS (loop0): AAIEEE! Log failed size checks. Abort!
> >  XFS: Assertion failed: 0, file: fs/xfs/xfs_log.c, line: 711
> 
> Uh... won't applying the corresponding MAXEXTNUM changes and whatnot to
> xfsprogs result in mkfs formatting a log with 9075 blocks?  Is there
> some other mistake in the minimum log size computations?

The call trace given below shows up when using 2^47 as the maximum extent
count for both Dir and Non-dir inodes.

However, using 2^27 as the maximum
extent count for directories would reduce the log reservation value for
"rename" operation (which has the maximum sized log reservation when using the
below mentioned FS geometry).

"Rename" log reservation is a function of the maximum directory BMBT height
which in turn is a function of the maximum number of extents that can be
occupied by a directory.

Hence when moving the MAXEXTNUM changes to xfsprogs, the corresponding
"maximum directory extent count" changes must also be moved as a
dependency.

With this patchset applied (i.e. With 2^27 as the maximum extent count for
directory inodes and 2^47 as the maximum extent count for non-directory
inodes), xfs_log_calc_minimum_size() in kernel returns 8691 blocks.

> 
> >  ------------[ cut here ]------------
> >  WARNING: CPU: 0 PID: 12821 at fs/xfs/xfs_message.c:112 assfail+0x25/0x28
> >  Modules linked in:
> >  CPU: 0 PID: 12821 Comm: mount Tainted: G        W         5.6.0-rc6-next-20200320-chandan-00003-g071c2af3f4de #1
> >  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> >  RIP: 0010:assfail+0x25/0x28
> >  Code: ff ff 0f 0b c3 0f 1f 44 00 00 41 89 c8 48 89 d1 48 89 f2 48 c7 c6 40 b7 4b b3 e8 82 f9 ff ff 80 3d 83 d6 64 01 00 74 02 0f $
> >  RSP: 0018:ffffb05b414cbd78 EFLAGS: 00010246
> >  RAX: 0000000000000000 RBX: ffff9d9d501d5000 RCX: 0000000000000000
> >  RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffb346dc65
> >  RBP: ffff9da444b49a80 R08: 0000000000000000 R09: 0000000000000000
> >  R10: 000000000000000a R11: f000000000000000 R12: 00000000ffffffea
> >  R13: 000000000000000e R14: 0000000000004594 R15: ffff9d9d501d5628
> >  FS:  00007fd6c5d17c80(0000) GS:ffff9da44d800000(0000) knlGS:0000000000000000
> >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >  CR2: 0000000000000002 CR3: 00000008a48c0000 CR4: 00000000000006f0
> >  Call Trace:
> >   xfs_log_mount+0xf8/0x300
> >   xfs_mountfs+0x46e/0x950
> >   xfs_fc_fill_super+0x318/0x510
> >   ? xfs_mount_free+0x30/0x30
> >   get_tree_bdev+0x15c/0x250
> >   vfs_get_tree+0x25/0xb0
> >   do_mount+0x740/0x9b0
> >   ? memdup_user+0x41/0x80
> >   __x64_sys_mount+0x8e/0xd0
> >   do_syscall_64+0x48/0x110
> >   entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >  RIP: 0033:0x7fd6c5f2ccda
> >  Code: 48 8b 0d b9 e1 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f $
> >  RSP: 002b:00007ffe00dfb9f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
> >  RAX: ffffffffffffffda RBX: 0000560c1aaa92c0 RCX: 00007fd6c5f2ccda
> >  RDX: 0000560c1aaae110 RSI: 0000560c1aaad040 RDI: 0000560c1aaa94d0
> >  RBP: 00007fd6c607d204 R08: 0000000000000000 R09: 0000560c1aaadde0
> >  R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> >  R13: 0000000000000000 R14: 0000560c1aaa94d0 R15: 0000560c1aaae110
> >  ---[ end trace 6436391b468bc652 ]---
> >  XFS (loop0): log mount failed
> > 
> > The corresponding filesystem was created using mkfs options
> > "-m rmapbt=1,reflink=1 -b size=1k -d size=20m -n size=64k".
> > 
> > i.e. We have a filesystem of size 20MiB, data block size of 1KiB and
> > directory block size of 64KiB. Filesystems of size < 1GiB can have less
> > than 10MiB on-disk log (Please refer to calculate_log_size() in
> > xfsprogs).
> 
> Hm.  You don't seem to be setting either of the big extent count feature
> flags here.
> 
> Is this something that happens after a filesystem gets *upgraded* to
> support extent counts > 2^32?  If it's this second case, then I think
> the function that upgrades the filesystem has to reject the change if it
> would cause the minimum log size checks to fail.

This happens when having 2^47 as the value of MAXEXTNUM irrespective of
whether the filesystem's superblock has the big extent count feature flag set
i.e. this patchset

Using 2^47 as the value of MAXEXTNUM causes the height of the data fork BMBT
tree to increase when compared to the height of the tree when using 2^32
MAXEXTNUM (In the case of the fs geometry that caused the above call trace,
the height increased by 1). The call xfs_bmap_compute_maxlevels(mp,
XFS_DATA_FORK) (invoked as part of FS mount operation) uses MAXEXTNUM as input
to calculate the maximum height of the data fork BMBT and the result is stored
in mp->m_bm_maxlevels[XFS_DATA_FORK]. This value is then used when calculating
log reservations for various fs operations. Hence the log reservations of fs
operations now change regardless of whether the "big extent count" feature
flag is set or not.

> 
> Granted, I don't understand the need (in the next patch) to special case
> bmbt maxlevels for directory data forks.  That's probably muddying up
> my ability to figure all this out.  Yes I did read this series
> backwards. :)

Using a separate maximum extent count for directory data fork was required to
reduce the increased log reservations described above. To be precise, rename
operation invokes XFS_DIR_OP_LOG_COUNT() which indirectly uses
mp->m_bm_maxlevels[XFS_DATA_FORK] for its calculations. When using a modified
kernel which had 2^47 as the value for MAXEXTNUM resulted in a taller data
fork BMBT tree. Hence log reservation space for rename operation became larger.

The idea of special handling of "maximum extents for directory data fork" came
up later when trying to find a way to reduce the log reservation for the
rename operation.

> 
> --D
> 
> > The largest reservation space was contributed by the rename
> > operation. The corresponding calculation is done inside
> > xfs_calc_rename_reservation(). In this case, the value returned by this
> > function is,
> > 
> > xfs_calc_inode_res(mp, 4)
> > + xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1))
> > 
> > xfs_calc_inode_res(mp, 4) returns a constant value of 3040 bytes
> > regardless of the maximum data fork extent count.
> > 
> > The largest contribution to the rename operation was by "2 *
> > XFS_DIROP_LOG_COUNT(mp)" and it is a function of maximum height of a
> > directory's BMBT tree.
> > 
> > XFS_DIROP_LOG_COUNT() is a sum of,
> > 
> > 1. The maximum number of dabtree blocks that needs to be logged
> >    i.e. XFS_DAENTER_BLOCKS() = XFS_DAENTER_1B(mp,w) *
> >    XFS_DAENTER_DBS(mp,w).  For directories, this evaluates
> >    to (64 * (XFS_DA_NODE_MAXDEPTH + 2)) = (64 * (5 + 2)) = 448.
> > 
> > 2. The corresponding maximum number of BMBT blocks that needs to be
> >    logged i.e. XFS_DAENTER_BMAPS() = XFS_DAENTER_DBS(mp,w) *
> >    XFS_DAENTER_BMAP1B(mp,w)
> > 
> >    XFS_DAENTER_DBS(mp,w) = XFS_DA_NODE_MAXDEPTH + 2 = 7
> > 
> >    XFS_DAENTER_BMAP1B(mp,w)
> >    = XFS_NEXTENTADD_SPACE_RES(mp, XFS_DAENTER_1B(mp, w), w)
> >    = XFS_NEXTENTADD_SPACE_RES(mp, 64, w)
> >    = ((64 + XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp) - 1) /
> >    XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp)) * XFS_EXTENTADD_SPACE_RES(mp, w)
> > 
> >    XFS_MAX_CONTIG_EXTENTS_PER_BLOCK() =
> >    mp->m_alloc_mxr[0] - mp->m_alloc_mnr[0] = 121 - 60 = 61
> > 
> >    XFS_DAENTER_BMAP1B(mp,w) =
> >    ((64 + XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp) - 1) /
> >    XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp)) * XFS_EXTENTADD_SPACE_RES(mp, w)
> >    = ((64 + 61 - 1) / 61) * XFS_EXTENTADD_SPACE_RES(mp, w)
> >    = 2 * XFS_EXTENTADD_SPACE_RES(mp, w)
> >    = 2 * (XFS_BM_MAXLEVELS(mp,w) - 1)
> >    = 2 * (8 - 1)
> >    = 14
> > 
> >    With 2^32 as the maximum extent count the maximum height of the bmap btree
> >    was 7. Now with 2^47 maximum extent count, the height has increased to 8.
> > 
> >    Therefore, XFS_DAENTER_BMAPS() = 7 * 14 = 98.
> > 
> > XFS_DIROP_LOG_COUNT() = 448 + 98 = 546.
> > 2 * XFS_DIROP_LOG_COUNT() = 2 * 546 = 1092.
> > 
> > With 2^32 max extent count, XFS_DIROP_LOG_COUNT() evaluates to
> > 533. Hence 2 * XFS_DIROP_LOG_COUNT() = 2 * 533 = 1066.
> > 
> > This small difference of 1092 - 1066 = 26 fs blocks is sufficient to
> > trip us over the minimum log size check.
> > 
> > A future commit in this series will use 2^27 as the maximum directory
> > extent count. This will result in a shorter directory BMBT tree.  Log
> > reservation calculations that are applicable only to
> > directories (e.g. XFS_DIROP_LOG_COUNT()) can then choose this instead of
> > non-dir data fork BMBT height.
> > 
> > This commit introduces a new member in 'struct xfs_mount' to hold the
> > maximum BMBT height of a directory. At present, the maximum height of a
> > directory BMBT is the same as a the maximum height of a non-directory
> > BMBT. A future commit will change the parameters used as input for
> > computing the maximum height of a directory BMBT.
> > 
> > Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
> > ---
> >  fs/xfs/libxfs/xfs_bmap.c | 17 ++++++++++++++---
> >  fs/xfs/libxfs/xfs_bmap.h |  3 ++-
> >  fs/xfs/xfs_mount.c       |  5 +++--
> >  fs/xfs/xfs_mount.h       |  1 +
> >  4 files changed, 20 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > index 798fca5c52af..01e2b543b139 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.c
> > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > @@ -50,7 +50,8 @@ kmem_zone_t		*xfs_bmap_free_item_zone;
> >  void
> >  xfs_bmap_compute_maxlevels(
> >  	xfs_mount_t	*mp,		/* file system mount structure */
> > -	int		whichfork)	/* data or attr fork */
> > +	int		whichfork,	/* data or attr fork */
> > +	int		dir_bmbt)	/* Dir or non-dir data fork */
> >  {
> >  	int		level;		/* btree level */
> >  	uint		maxblocks;	/* max blocks at this level */
> > @@ -60,6 +61,9 @@ xfs_bmap_compute_maxlevels(
> >  	int		minnoderecs;	/* min records in node block */
> >  	int		sz;		/* root block size */
> >  
> > +	if (whichfork == XFS_ATTR_FORK)
> > +		ASSERT(dir_bmbt == 0);
> > +
> >  	/*
> >  	 * The maximum number of extents in a file, hence the maximum number of
> >  	 * leaf entries, is controlled by the size of the on-disk extent count,
> > @@ -75,8 +79,11 @@ xfs_bmap_compute_maxlevels(
> >  	 * of a minimum size available.
> >  	 */
> >  	if (whichfork == XFS_DATA_FORK) {
> > -		maxleafents = MAXEXTNUM;
> >  		sz = XFS_BMDR_SPACE_CALC(MINDBTPTRS);
> > +		if (dir_bmbt)
> > +			maxleafents = MAXEXTNUM;
> > +		else
> > +			maxleafents = MAXEXTNUM;
> >  	} else {
> >  		maxleafents = MAXAEXTNUM;
> >  		sz = XFS_BMDR_SPACE_CALC(MINABTPTRS);
> > @@ -91,7 +98,11 @@ xfs_bmap_compute_maxlevels(
> >  		else
> >  			maxblocks = (maxblocks + minnoderecs - 1) / minnoderecs;
> >  	}
> > -	mp->m_bm_maxlevels[whichfork] = level;
> > +
> > +	if (whichfork == XFS_DATA_FORK && dir_bmbt)
> > +		mp->m_bm_dir_maxlevel = level;
> > +	else
> > +		mp->m_bm_maxlevels[whichfork] = level;
> >  }
> >  
> >  STATIC int				/* error */
> > diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> > index 6028a3c825ba..4250c9ab4b75 100644
> > --- a/fs/xfs/libxfs/xfs_bmap.h
> > +++ b/fs/xfs/libxfs/xfs_bmap.h
> > @@ -187,7 +187,8 @@ void	xfs_bmap_local_to_extents_empty(struct xfs_trans *tp,
> >  void	__xfs_bmap_add_free(struct xfs_trans *tp, xfs_fsblock_t bno,
> >  		xfs_filblks_t len, const struct xfs_owner_info *oinfo,
> >  		bool skip_discard);
> > -void	xfs_bmap_compute_maxlevels(struct xfs_mount *mp, int whichfork);
> > +void	xfs_bmap_compute_maxlevels(struct xfs_mount *mp, int whichfork,
> > +		int dir_bmbt);
> >  int	xfs_bmap_first_unused(struct xfs_trans *tp, struct xfs_inode *ip,
> >  		xfs_extlen_t len, xfs_fileoff_t *unused, int whichfork);
> >  int	xfs_bmap_last_before(struct xfs_trans *tp, struct xfs_inode *ip,
> > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > index bb91f04266b9..d8ebfc67bb63 100644
> > --- a/fs/xfs/xfs_mount.c
> > +++ b/fs/xfs/xfs_mount.c
> > @@ -711,8 +711,9 @@ xfs_mountfs(
> >  		goto out;
> >  
> >  	xfs_alloc_compute_maxlevels(mp);
> > -	xfs_bmap_compute_maxlevels(mp, XFS_DATA_FORK);
> > -	xfs_bmap_compute_maxlevels(mp, XFS_ATTR_FORK);
> > +	xfs_bmap_compute_maxlevels(mp, XFS_DATA_FORK, 0);
> > +	xfs_bmap_compute_maxlevels(mp, XFS_DATA_FORK, 1);
> > +	xfs_bmap_compute_maxlevels(mp, XFS_ATTR_FORK, 0);
> >  	xfs_ialloc_setup_geometry(mp);
> >  	xfs_rmapbt_compute_maxlevels(mp);
> >  	xfs_refcountbt_compute_maxlevels(mp);
> > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > index aba5a1579279..9dbf036ddace 100644
> > --- a/fs/xfs/xfs_mount.h
> > +++ b/fs/xfs/xfs_mount.h
> > @@ -133,6 +133,7 @@ typedef struct xfs_mount {
> >  	uint			m_refc_mnr[2];	/* min refc btree records */
> >  	uint			m_ag_maxlevels;	/* XFS_AG_MAXLEVELS */
> >  	uint			m_bm_maxlevels[2]; /* XFS_BM_MAXLEVELS */
> > +	uint			m_bm_dir_maxlevel;
> >  	uint			m_rmap_maxlevels; /* max rmap btree levels */
> >  	uint			m_refc_maxlevels; /* max refcount btree level */
> >  	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */
> 

-- 
chandan




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/7] xfs: Check for per-inode extent count overflow
  2020-06-09 14:22       ` Chandan Babu R
@ 2020-06-09 17:07         ` Darrick J. Wong
  2020-06-10  6:24           ` Chandan Babu R
  0 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2020-06-09 17:07 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, david, bfoster, hch

On Tue, Jun 09, 2020 at 07:52:48PM +0530, Chandan Babu R wrote:
> On Monday 8 June 2020 10:02:16 PM IST Darrick J. Wong wrote:
> > On Mon, Jun 08, 2020 at 09:24:25AM -0700, Darrick J. Wong wrote:
> > > On Sat, Jun 06, 2020 at 01:57:40PM +0530, Chandan Babu R wrote:
> > > > The following error message was noticed when a workload added one
> > > > million xattrs, deleted 50% of them and then inserted 400,000 new
> > > > xattrs.
> > > > 
> > > > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > > > 
> > > > The error message was printed during unmounting the filesystem. The
> > > > value printed under "total extents" indicates that we overflowed the
> > > > per-inode signed 16-bit xattr extent counter.
> > > > 
> > > > Instead of letting this silent corruption occur, this patch checks for
> > > > extent counter (both data and xattr) overflow before we assign the
> > > > new value to the corresponding in-memory extent counter.
> > > > 
> > > > Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_bmap.c       | 92 +++++++++++++++++++++++++++-------
> > > >  fs/xfs/libxfs/xfs_inode_fork.c | 29 +++++++++++
> > > >  fs/xfs/libxfs/xfs_inode_fork.h |  1 +
> > > >  3 files changed, 104 insertions(+), 18 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > > index edc63dba007f..798fca5c52af 100644
> > > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > > @@ -906,7 +906,10 @@ xfs_bmap_local_to_extents(
> > > >  	xfs_iext_first(ifp, &icur);
> > > >  	xfs_iext_insert(ip, &icur, &rec, 0);
> > > >  
> > > > -	ifp->if_nextents = 1;
> > > > +	error = xfs_next_set(ip, whichfork, 1);
> > > > +	if (error)
> > > > +		goto done;
> > > 
> > > Are you sure that if_nextents == 0 is a precondition here?  Technically
> > > speaking, this turns an assignment into an increment operation.
> > > 
> > > > +
> > > >  	ip->i_d.di_nblocks = 1;
> > > >  	xfs_trans_mod_dquot_byino(tp, ip,
> > > >  		XFS_TRANS_DQ_BCOUNT, 1L);
> > > > @@ -1594,7 +1597,10 @@ xfs_bmap_add_extent_delay_real(
> > > >  		xfs_iext_remove(bma->ip, &bma->icur, state);
> > > >  		xfs_iext_prev(ifp, &bma->icur);
> > > >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, &LEFT);
> > > > -		ifp->if_nextents--;
> > > > +
> > > > +		error = xfs_next_set(bma->ip, whichfork, -1);
> > > > +		if (error)
> > > > +			goto done;
> > > >  
> > > >  		if (bma->cur == NULL)
> > > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > > @@ -1698,7 +1704,10 @@ xfs_bmap_add_extent_delay_real(
> > > >  		PREV.br_startblock = new->br_startblock;
> > > >  		PREV.br_state = new->br_state;
> > > >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, &PREV);
> > > > -		ifp->if_nextents++;
> > > > +
> > > > +		error = xfs_next_set(bma->ip, whichfork, 1);
> > > > +		if (error)
> > > > +			goto done;
> > > >  
> > > >  		if (bma->cur == NULL)
> > > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > > @@ -1764,7 +1773,10 @@ xfs_bmap_add_extent_delay_real(
> > > >  		 * The left neighbor is not contiguous.
> > > >  		 */
> > > >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, new);
> > > > -		ifp->if_nextents++;
> > > > +
> > > > +		error = xfs_next_set(bma->ip, whichfork, 1);
> > > > +		if (error)
> > > > +			goto done;
> > > >  
> > > >  		if (bma->cur == NULL)
> > > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > > @@ -1851,7 +1863,10 @@ xfs_bmap_add_extent_delay_real(
> > > >  		 * The right neighbor is not contiguous.
> > > >  		 */
> > > >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, new);
> > > > -		ifp->if_nextents++;
> > > > +
> > > > +		error = xfs_next_set(bma->ip, whichfork, 1);
> > > > +		if (error)
> > > > +			goto done;
> > > >  
> > > >  		if (bma->cur == NULL)
> > > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > > @@ -1937,7 +1952,10 @@ xfs_bmap_add_extent_delay_real(
> > > >  		xfs_iext_next(ifp, &bma->icur);
> > > >  		xfs_iext_insert(bma->ip, &bma->icur, &RIGHT, state);
> > > >  		xfs_iext_insert(bma->ip, &bma->icur, &LEFT, state);
> > > > -		ifp->if_nextents++;
> > > > +
> > > > +		error = xfs_next_set(bma->ip, whichfork, 1);
> > > > +		if (error)
> > > > +			goto done;
> > > >  
> > > >  		if (bma->cur == NULL)
> > > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > > @@ -2141,7 +2159,11 @@ xfs_bmap_add_extent_unwritten_real(
> > > >  		xfs_iext_remove(ip, icur, state);
> > > >  		xfs_iext_prev(ifp, icur);
> > > >  		xfs_iext_update_extent(ip, state, icur, &LEFT);
> > > > -		ifp->if_nextents -= 2;
> > > > +
> > > > +		error = xfs_next_set(ip, whichfork, -2);
> > > > +		if (error)
> > > > +			goto done;
> > > > +
> > > >  		if (cur == NULL)
> > > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > >  		else {
> > > > @@ -2193,7 +2215,11 @@ xfs_bmap_add_extent_unwritten_real(
> > > >  		xfs_iext_remove(ip, icur, state);
> > > >  		xfs_iext_prev(ifp, icur);
> > > >  		xfs_iext_update_extent(ip, state, icur, &LEFT);
> > > > -		ifp->if_nextents--;
> > > > +
> > > > +		error = xfs_next_set(ip, whichfork, -1);
> > > > +		if (error)
> > > > +			goto done;
> > > > +
> > > >  		if (cur == NULL)
> > > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > >  		else {
> > > > @@ -2235,7 +2261,10 @@ xfs_bmap_add_extent_unwritten_real(
> > > >  		xfs_iext_remove(ip, icur, state);
> > > >  		xfs_iext_prev(ifp, icur);
> > > >  		xfs_iext_update_extent(ip, state, icur, &PREV);
> > > > -		ifp->if_nextents--;
> > > > +
> > > > +		error = xfs_next_set(ip, whichfork, -1);
> > > > +		if (error)
> > > > +			goto done;
> > > >  
> > > >  		if (cur == NULL)
> > > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > > @@ -2343,7 +2372,10 @@ xfs_bmap_add_extent_unwritten_real(
> > > >  
> > > >  		xfs_iext_update_extent(ip, state, icur, &PREV);
> > > >  		xfs_iext_insert(ip, icur, new, state);
> > > > -		ifp->if_nextents++;
> > > > +
> > > > +		error = xfs_next_set(ip, whichfork, 1);
> > > > +		if (error)
> > > > +			goto done;
> > > >  
> > > >  		if (cur == NULL)
> > > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > > @@ -2419,7 +2451,10 @@ xfs_bmap_add_extent_unwritten_real(
> > > >  		xfs_iext_update_extent(ip, state, icur, &PREV);
> > > >  		xfs_iext_next(ifp, icur);
> > > >  		xfs_iext_insert(ip, icur, new, state);
> > > > -		ifp->if_nextents++;
> > > > +
> > > > +		error = xfs_next_set(ip, whichfork, 1);
> > > > +		if (error)
> > > > +			goto done;
> > > >  
> > > >  		if (cur == NULL)
> > > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > > @@ -2471,7 +2506,10 @@ xfs_bmap_add_extent_unwritten_real(
> > > >  		xfs_iext_next(ifp, icur);
> > > >  		xfs_iext_insert(ip, icur, &r[1], state);
> > > >  		xfs_iext_insert(ip, icur, &r[0], state);
> > > > -		ifp->if_nextents += 2;
> > > > +
> > > > +		error = xfs_next_set(ip, whichfork, 2);
> > > > +		if (error)
> > > > +			goto done;
> > > >  
> > > >  		if (cur == NULL)
> > > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > > @@ -2787,7 +2825,10 @@ xfs_bmap_add_extent_hole_real(
> > > >  		xfs_iext_remove(ip, icur, state);
> > > >  		xfs_iext_prev(ifp, icur);
> > > >  		xfs_iext_update_extent(ip, state, icur, &left);
> > > > -		ifp->if_nextents--;
> > > > +
> > > > +		error = xfs_next_set(ip, whichfork, -1);
> > > > +		if (error)
> > > > +			goto done;
> > > >  
> > > >  		if (cur == NULL) {
> > > >  			rval = XFS_ILOG_CORE | xfs_ilog_fext(whichfork);
> > > > @@ -2886,7 +2927,10 @@ xfs_bmap_add_extent_hole_real(
> > > >  		 * Insert a new entry.
> > > >  		 */
> > > >  		xfs_iext_insert(ip, icur, new, state);
> > > > -		ifp->if_nextents++;
> > > > +
> > > > +		error = xfs_next_set(ip, whichfork, 1);
> > > > +		if (error)
> > > > +			goto done;
> > > >  
> > > >  		if (cur == NULL) {
> > > >  			rval = XFS_ILOG_CORE | xfs_ilog_fext(whichfork);
> > > > @@ -5083,7 +5127,10 @@ xfs_bmap_del_extent_real(
> > > >  		 */
> > > >  		xfs_iext_remove(ip, icur, state);
> > > >  		xfs_iext_prev(ifp, icur);
> > > > -		ifp->if_nextents--;
> > > > +
> > > > +		error = xfs_next_set(ip, whichfork, -1);
> > > > +		if (error)
> > > > +			goto done;
> > > >  
> > > >  		flags |= XFS_ILOG_CORE;
> > > >  		if (!cur) {
> > > > @@ -5193,7 +5240,10 @@ xfs_bmap_del_extent_real(
> > > >  		} else
> > > >  			flags |= xfs_ilog_fext(whichfork);
> > > >  
> > > > -		ifp->if_nextents++;
> > > > +		error = xfs_next_set(ip, whichfork, 1);
> > > > +		if (error)
> > > > +			goto done;
> > > > +
> > > >  		xfs_iext_next(ifp, icur);
> > > >  		xfs_iext_insert(ip, icur, &new, state);
> > > >  		break;
> > > > @@ -5660,7 +5710,10 @@ xfs_bmse_merge(
> > > >  	 * Update the on-disk extent count, the btree if necessary and log the
> > > >  	 * inode.
> > > >  	 */
> > > > -	ifp->if_nextents--;
> > > > +	error = xfs_next_set(ip, whichfork, -1);
> > > > +	if (error)
> > > > +		goto done;
> > > > +
> > > >  	*logflags |= XFS_ILOG_CORE;
> > > >  	if (!cur) {
> > > >  		*logflags |= XFS_ILOG_DEXT;
> > > > @@ -6047,7 +6100,10 @@ xfs_bmap_split_extent(
> > > >  	/* Add new extent */
> > > >  	xfs_iext_next(ifp, &icur);
> > > >  	xfs_iext_insert(ip, &icur, &new, 0);
> > > > -	ifp->if_nextents++;
> > > > +
> > > > +	error = xfs_next_set(ip, whichfork, 1);
> > > > +	if (error)
> > > > +		goto del_cursor;
> > > >  
> > > >  	if (cur) {
> > > >  		error = xfs_bmbt_lookup_eq(cur, &new, &i);
> > > > diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
> > > > index 28b366275ae0..3bf5a2c391bd 100644
> > > > --- a/fs/xfs/libxfs/xfs_inode_fork.c
> > > > +++ b/fs/xfs/libxfs/xfs_inode_fork.c
> > > > @@ -728,3 +728,32 @@ xfs_ifork_verify_local_attr(
> > > >  
> > > >  	return 0;
> > > >  }
> > > > +
> > > > +int
> > > > +xfs_next_set(
> > > 
> > > "next"... please choose an abbreviation that doesn't collide with a
> > > common English word.
> > > 
> > > > +	struct xfs_inode	*ip,
> > > > +	int			whichfork,
> > > > +	int			delta)
> > > 
> > > Delta?  I thought this was a setter function?
> > > 
> > > > +{
> > > > +	struct xfs_ifork	*ifp;
> > > > +	int64_t			nr_exts;
> > > > +	int64_t			max_exts;
> > > > +
> > > > +	ifp = XFS_IFORK_PTR(ip, whichfork);
> > > > +
> > > > +	if (whichfork == XFS_DATA_FORK || whichfork == XFS_COW_FORK)
> > > > +		max_exts = MAXEXTNUM;
> > > > +	else if (whichfork == XFS_ATTR_FORK)
> > > > +		max_exts = MAXAEXTNUM;
> > > > +	else
> > > > +		ASSERT(0);
> > > > +
> > > > +	nr_exts = ifp->if_nextents + delta;
> > > 
> > > Nope, it's a modify function all right.  Then it should be named:
> > > 
> > > xfs_nextents_mod(ip, whichfork, delta)
> > > 
> > > > +	if ((delta > 0 && nr_exts > max_exts)
> > > > +		|| (delta < 0 && nr_exts < 0))
> > > 
> > > Line these up, please.  e.g.,
> > > 
> > > 	if ((delta > 0 && nr_exts > max_exts) ||
> > >             (delta < 0 && nr_exts < 0))

HA even the maintainer gets it wrong. :(

> > > 
> > > --D
> > > 
> > > > +		return -EOVERFLOW;
> > 
> > Oh, also, shouldn't this be EFBIG ("File too big")?
> 
> True, EFBIG is more appropriate than EOVERFLOW in this case.
> 
> Darrick, I have one question. The purpose of this patch is to fix the zero day
> bug where we overflow extent counter silently and get to know about it only
> when flushing the incore inode to disk. Patches that come later in the series
> modify the extent count limits to 2^32 (for xattr fork) and 2^47 (for data
> fork). If this patch is not required to be sent to stable release, I will drop
> it from the series.

I would leave it in the series, unless you mean to send this as a
separate cleanup ahead of everything else?

Now that I think about it, this probably should become its own cleanup
series.  I just realized that if we error out EFBIG in the middle of a
bmap function, we're probably going to end up cancelling a dirty
transaction, which will cause an fs shutdown.  Since xfs cannot undo the
effects of a dirty transaction, we have to be able to error out earlier
in the transaction sequence so that we can back out to userspace without
affecting the filesystem.

IOWs, this means that any code path that could increase an inode's
extent count will have to check the the inode (after we take the ILOCK)
to make sure that it can accomodate however many more extents we're
adding.

static int
xfs_trans_inode_reserve_extent_count(ip, whichfork, nrtoadd)
{
	if (MAX{,A}EXTNUM - XFS_IFORK_PTR(ip, whichfork)->if_nextents < nrtoadd)
		return -EFBIG;
	return 0;
}

	error = xfs_trans_alloc(..., &tp);
	if (error)
		goto out;

	xfs_ilock(ip, XFS_ILOCK_EXCL);
	xfs_trans_ijoin(ip, 0);

	error = xfs_trans_inode_reserve_extent_count(ip, whichfork, nrtoadd)
	if (error)
		goto out;

	error = xfs_trans_reserve_quota_nblks(tp, ip, ...);
	if (error)
		goto out;

...or something like that.  And now suddenly this grows into its own
cleanup series. :/

> Also, I can't have a "fixes" tag because this is a zero
> day bug.

Everything is a zero day now... but establishing a base for this one is
probably not going to be easy since I bet the overflow has existed since
the beginning.

--D

> 
> > 
> > --D
> > 
> > > > +
> > > > +	ifp->if_nextents = nr_exts;
> > > > +
> > > > +	return 0;
> > > > +}
> > > > diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
> > > > index a4953e95c4f3..a84ae42ace79 100644
> > > > --- a/fs/xfs/libxfs/xfs_inode_fork.h
> > > > +++ b/fs/xfs/libxfs/xfs_inode_fork.h
> > > > @@ -173,4 +173,5 @@ extern void xfs_ifork_init_cow(struct xfs_inode *ip);
> > > >  int xfs_ifork_verify_local_data(struct xfs_inode *ip);
> > > >  int xfs_ifork_verify_local_attr(struct xfs_inode *ip);
> > > >  
> > > > +int xfs_next_set(struct xfs_inode *ip, int whichfork, int delta);
> > > >  #endif	/* __XFS_INODE_FORK_H__ */
> > 
> 
> 
> -- 
> chandan
> 
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/7] xfs: Check for per-inode extent count overflow
  2020-06-09 14:22     ` Chandan Babu R
@ 2020-06-09 17:10       ` Darrick J. Wong
  2020-06-19 14:36         ` Christoph Hellwig
  0 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2020-06-09 17:10 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, david, bfoster, hch

On Tue, Jun 09, 2020 at 07:52:37PM +0530, Chandan Babu R wrote:
> On Monday 8 June 2020 9:54:25 PM IST Darrick J. Wong wrote:
> > On Sat, Jun 06, 2020 at 01:57:40PM +0530, Chandan Babu R wrote:
> > > The following error message was noticed when a workload added one
> > > million xattrs, deleted 50% of them and then inserted 400,000 new
> > > xattrs.
> > > 
> > > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > > 
> > > The error message was printed during unmounting the filesystem. The
> > > value printed under "total extents" indicates that we overflowed the
> > > per-inode signed 16-bit xattr extent counter.
> > > 
> > > Instead of letting this silent corruption occur, this patch checks for
> > > extent counter (both data and xattr) overflow before we assign the
> > > new value to the corresponding in-memory extent counter.
> > > 
> > > Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_bmap.c       | 92 +++++++++++++++++++++++++++-------
> > >  fs/xfs/libxfs/xfs_inode_fork.c | 29 +++++++++++
> > >  fs/xfs/libxfs/xfs_inode_fork.h |  1 +
> > >  3 files changed, 104 insertions(+), 18 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > index edc63dba007f..798fca5c52af 100644
> > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > @@ -906,7 +906,10 @@ xfs_bmap_local_to_extents(
> > >  	xfs_iext_first(ifp, &icur);
> > >  	xfs_iext_insert(ip, &icur, &rec, 0);
> > >  
> > > -	ifp->if_nextents = 1;
> > > +	error = xfs_next_set(ip, whichfork, 1);
> > > +	if (error)
> > > +		goto done;
> > 
> > Are you sure that if_nextents == 0 is a precondition here?  Technically
> > speaking, this turns an assignment into an increment operation.
> 
> Hmm. I didn't pay attention to that. I will check and update the code
> appropriately. Thanks for pointing this out.
> 
> > 
> > > +
> > >  	ip->i_d.di_nblocks = 1;
> > >  	xfs_trans_mod_dquot_byino(tp, ip,
> > >  		XFS_TRANS_DQ_BCOUNT, 1L);
> > > @@ -1594,7 +1597,10 @@ xfs_bmap_add_extent_delay_real(
> > >  		xfs_iext_remove(bma->ip, &bma->icur, state);
> > >  		xfs_iext_prev(ifp, &bma->icur);
> > >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, &LEFT);
> > > -		ifp->if_nextents--;
> > > +
> > > +		error = xfs_next_set(bma->ip, whichfork, -1);
> > > +		if (error)
> > > +			goto done;
> > >  
> > >  		if (bma->cur == NULL)
> > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > @@ -1698,7 +1704,10 @@ xfs_bmap_add_extent_delay_real(
> > >  		PREV.br_startblock = new->br_startblock;
> > >  		PREV.br_state = new->br_state;
> > >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, &PREV);
> > > -		ifp->if_nextents++;
> > > +
> > > +		error = xfs_next_set(bma->ip, whichfork, 1);
> > > +		if (error)
> > > +			goto done;
> > >  
> > >  		if (bma->cur == NULL)
> > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > @@ -1764,7 +1773,10 @@ xfs_bmap_add_extent_delay_real(
> > >  		 * The left neighbor is not contiguous.
> > >  		 */
> > >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, new);
> > > -		ifp->if_nextents++;
> > > +
> > > +		error = xfs_next_set(bma->ip, whichfork, 1);
> > > +		if (error)
> > > +			goto done;
> > >  
> > >  		if (bma->cur == NULL)
> > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > @@ -1851,7 +1863,10 @@ xfs_bmap_add_extent_delay_real(
> > >  		 * The right neighbor is not contiguous.
> > >  		 */
> > >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, new);
> > > -		ifp->if_nextents++;
> > > +
> > > +		error = xfs_next_set(bma->ip, whichfork, 1);
> > > +		if (error)
> > > +			goto done;
> > >  
> > >  		if (bma->cur == NULL)
> > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > @@ -1937,7 +1952,10 @@ xfs_bmap_add_extent_delay_real(
> > >  		xfs_iext_next(ifp, &bma->icur);
> > >  		xfs_iext_insert(bma->ip, &bma->icur, &RIGHT, state);
> > >  		xfs_iext_insert(bma->ip, &bma->icur, &LEFT, state);
> > > -		ifp->if_nextents++;
> > > +
> > > +		error = xfs_next_set(bma->ip, whichfork, 1);
> > > +		if (error)
> > > +			goto done;
> > >  
> > >  		if (bma->cur == NULL)
> > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > @@ -2141,7 +2159,11 @@ xfs_bmap_add_extent_unwritten_real(
> > >  		xfs_iext_remove(ip, icur, state);
> > >  		xfs_iext_prev(ifp, icur);
> > >  		xfs_iext_update_extent(ip, state, icur, &LEFT);
> > > -		ifp->if_nextents -= 2;
> > > +
> > > +		error = xfs_next_set(ip, whichfork, -2);
> > > +		if (error)
> > > +			goto done;
> > > +
> > >  		if (cur == NULL)
> > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > >  		else {
> > > @@ -2193,7 +2215,11 @@ xfs_bmap_add_extent_unwritten_real(
> > >  		xfs_iext_remove(ip, icur, state);
> > >  		xfs_iext_prev(ifp, icur);
> > >  		xfs_iext_update_extent(ip, state, icur, &LEFT);
> > > -		ifp->if_nextents--;
> > > +
> > > +		error = xfs_next_set(ip, whichfork, -1);
> > > +		if (error)
> > > +			goto done;
> > > +
> > >  		if (cur == NULL)
> > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > >  		else {
> > > @@ -2235,7 +2261,10 @@ xfs_bmap_add_extent_unwritten_real(
> > >  		xfs_iext_remove(ip, icur, state);
> > >  		xfs_iext_prev(ifp, icur);
> > >  		xfs_iext_update_extent(ip, state, icur, &PREV);
> > > -		ifp->if_nextents--;
> > > +
> > > +		error = xfs_next_set(ip, whichfork, -1);
> > > +		if (error)
> > > +			goto done;
> > >  
> > >  		if (cur == NULL)
> > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > @@ -2343,7 +2372,10 @@ xfs_bmap_add_extent_unwritten_real(
> > >  
> > >  		xfs_iext_update_extent(ip, state, icur, &PREV);
> > >  		xfs_iext_insert(ip, icur, new, state);
> > > -		ifp->if_nextents++;
> > > +
> > > +		error = xfs_next_set(ip, whichfork, 1);
> > > +		if (error)
> > > +			goto done;
> > >  
> > >  		if (cur == NULL)
> > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > @@ -2419,7 +2451,10 @@ xfs_bmap_add_extent_unwritten_real(
> > >  		xfs_iext_update_extent(ip, state, icur, &PREV);
> > >  		xfs_iext_next(ifp, icur);
> > >  		xfs_iext_insert(ip, icur, new, state);
> > > -		ifp->if_nextents++;
> > > +
> > > +		error = xfs_next_set(ip, whichfork, 1);
> > > +		if (error)
> > > +			goto done;
> > >  
> > >  		if (cur == NULL)
> > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > @@ -2471,7 +2506,10 @@ xfs_bmap_add_extent_unwritten_real(
> > >  		xfs_iext_next(ifp, icur);
> > >  		xfs_iext_insert(ip, icur, &r[1], state);
> > >  		xfs_iext_insert(ip, icur, &r[0], state);
> > > -		ifp->if_nextents += 2;
> > > +
> > > +		error = xfs_next_set(ip, whichfork, 2);
> > > +		if (error)
> > > +			goto done;
> > >  
> > >  		if (cur == NULL)
> > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > @@ -2787,7 +2825,10 @@ xfs_bmap_add_extent_hole_real(
> > >  		xfs_iext_remove(ip, icur, state);
> > >  		xfs_iext_prev(ifp, icur);
> > >  		xfs_iext_update_extent(ip, state, icur, &left);
> > > -		ifp->if_nextents--;
> > > +
> > > +		error = xfs_next_set(ip, whichfork, -1);
> > > +		if (error)
> > > +			goto done;
> > >  
> > >  		if (cur == NULL) {
> > >  			rval = XFS_ILOG_CORE | xfs_ilog_fext(whichfork);
> > > @@ -2886,7 +2927,10 @@ xfs_bmap_add_extent_hole_real(
> > >  		 * Insert a new entry.
> > >  		 */
> > >  		xfs_iext_insert(ip, icur, new, state);
> > > -		ifp->if_nextents++;
> > > +
> > > +		error = xfs_next_set(ip, whichfork, 1);
> > > +		if (error)
> > > +			goto done;
> > >  
> > >  		if (cur == NULL) {
> > >  			rval = XFS_ILOG_CORE | xfs_ilog_fext(whichfork);
> > > @@ -5083,7 +5127,10 @@ xfs_bmap_del_extent_real(
> > >  		 */
> > >  		xfs_iext_remove(ip, icur, state);
> > >  		xfs_iext_prev(ifp, icur);
> > > -		ifp->if_nextents--;
> > > +
> > > +		error = xfs_next_set(ip, whichfork, -1);
> > > +		if (error)
> > > +			goto done;
> > >  
> > >  		flags |= XFS_ILOG_CORE;
> > >  		if (!cur) {
> > > @@ -5193,7 +5240,10 @@ xfs_bmap_del_extent_real(
> > >  		} else
> > >  			flags |= xfs_ilog_fext(whichfork);
> > >  
> > > -		ifp->if_nextents++;
> > > +		error = xfs_next_set(ip, whichfork, 1);
> > > +		if (error)
> > > +			goto done;
> > > +
> > >  		xfs_iext_next(ifp, icur);
> > >  		xfs_iext_insert(ip, icur, &new, state);
> > >  		break;
> > > @@ -5660,7 +5710,10 @@ xfs_bmse_merge(
> > >  	 * Update the on-disk extent count, the btree if necessary and log the
> > >  	 * inode.
> > >  	 */
> > > -	ifp->if_nextents--;
> > > +	error = xfs_next_set(ip, whichfork, -1);
> > > +	if (error)
> > > +		goto done;
> > > +
> > >  	*logflags |= XFS_ILOG_CORE;
> > >  	if (!cur) {
> > >  		*logflags |= XFS_ILOG_DEXT;
> > > @@ -6047,7 +6100,10 @@ xfs_bmap_split_extent(
> > >  	/* Add new extent */
> > >  	xfs_iext_next(ifp, &icur);
> > >  	xfs_iext_insert(ip, &icur, &new, 0);
> > > -	ifp->if_nextents++;
> > > +
> > > +	error = xfs_next_set(ip, whichfork, 1);
> > > +	if (error)
> > > +		goto del_cursor;
> > >  
> > >  	if (cur) {
> > >  		error = xfs_bmbt_lookup_eq(cur, &new, &i);
> > > diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
> > > index 28b366275ae0..3bf5a2c391bd 100644
> > > --- a/fs/xfs/libxfs/xfs_inode_fork.c
> > > +++ b/fs/xfs/libxfs/xfs_inode_fork.c
> > > @@ -728,3 +728,32 @@ xfs_ifork_verify_local_attr(
> > >  
> > >  	return 0;
> > >  }
> > > +
> > > +int
> > > +xfs_next_set(
> > 
> > "next"... please choose an abbreviation that doesn't collide with a
> > common English word.
> > 
> > > +	struct xfs_inode	*ip,
> > > +	int			whichfork,
> > > +	int			delta)
> > 
> > Delta?  I thought this was a setter function?
> > 
> > > +{
> > > +	struct xfs_ifork	*ifp;
> > > +	int64_t			nr_exts;
> > > +	int64_t			max_exts;
> > > +
> > > +	ifp = XFS_IFORK_PTR(ip, whichfork);
> > > +
> > > +	if (whichfork == XFS_DATA_FORK || whichfork == XFS_COW_FORK)
> > > +		max_exts = MAXEXTNUM;
> > > +	else if (whichfork == XFS_ATTR_FORK)
> > > +		max_exts = MAXAEXTNUM;
> > > +	else
> > > +		ASSERT(0);
> > > +
> > > +	nr_exts = ifp->if_nextents + delta;
> > 
> > Nope, it's a modify function all right.  Then it should be named:
> > 
> > xfs_nextents_mod(ip, whichfork, delta)
> 
> Ok. I will change this.

<nod> Though as I (just) pointed out in the other part of this thread,
the range check on the extent count ought to come earlier in the
transaction sequence so that we can return EFBIG to userspace without
having to cancel a (potentially dirty) transaction.

--D

> > 
> > > +	if ((delta > 0 && nr_exts > max_exts)
> > > +		|| (delta < 0 && nr_exts < 0))
> > 
> > Line these up, please.  e.g.,
> > 
> > 	if ((delta > 0 && nr_exts > max_exts) ||
> >             (delta < 0 && nr_exts < 0))
> 
> Ok.
> 
> > 
> > --D
> > 
> > > +		return -EOVERFLOW;
> > > +
> > > +	ifp->if_nextents = nr_exts;
> > > +
> > > +	return 0;
> > > +}
> > > diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
> > > index a4953e95c4f3..a84ae42ace79 100644
> > > --- a/fs/xfs/libxfs/xfs_inode_fork.h
> > > +++ b/fs/xfs/libxfs/xfs_inode_fork.h
> > > @@ -173,4 +173,5 @@ extern void xfs_ifork_init_cow(struct xfs_inode *ip);
> > >  int xfs_ifork_verify_local_data(struct xfs_inode *ip);
> > >  int xfs_ifork_verify_local_attr(struct xfs_inode *ip);
> > >  
> > > +int xfs_next_set(struct xfs_inode *ip, int whichfork, int delta);
> > >  #endif	/* __XFS_INODE_FORK_H__ */
> > 
> 
> 
> -- 
> chandan
> 
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/7] xfs: Compute maximum height of directory BMBT separately
  2020-06-09 14:23     ` Chandan Babu R
@ 2020-06-09 18:40       ` Darrick J. Wong
  2020-06-10  6:23         ` Chandan Babu R
  0 siblings, 1 reply; 40+ messages in thread
From: Darrick J. Wong @ 2020-06-09 18:40 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, david, bfoster, hch

On Tue, Jun 09, 2020 at 07:53:55PM +0530, Chandan Babu R wrote:
> On Tuesday 9 June 2020 2:29:22 AM IST Darrick J. Wong wrote:
> > On Sat, Jun 06, 2020 at 01:57:41PM +0530, Chandan Babu R wrote:
> > > xfs/306 causes the following call trace when using a data fork with a
> > > maximum extent count of 2^47,
> > > 
> > >  XFS (loop0): Mounting V5 Filesystem
> > >  XFS (loop0): Log size 8906 blocks too small, minimum size is 9075 blocks
> > >  XFS (loop0): AAIEEE! Log failed size checks. Abort!
> > >  XFS: Assertion failed: 0, file: fs/xfs/xfs_log.c, line: 711
> > 
> > Uh... won't applying the corresponding MAXEXTNUM changes and whatnot to
> > xfsprogs result in mkfs formatting a log with 9075 blocks?  Is there
> > some other mistake in the minimum log size computations?
> 
> The call trace given below shows up when using 2^47 as the maximum extent
> count for both Dir and Non-dir inodes.
> 
> However, using 2^27 as the maximum
> extent count for directories would reduce the log reservation value for
> "rename" operation (which has the maximum sized log reservation when using the
> below mentioned FS geometry).
> 
> "Rename" log reservation is a function of the maximum directory BMBT height
> which in turn is a function of the maximum number of extents that can be
> occupied by a directory.
> 
> Hence when moving the MAXEXTNUM changes to xfsprogs, the corresponding
> "maximum directory extent count" changes must also be moved as a
> dependency.
> 
> With this patchset applied (i.e. With 2^27 as the maximum extent count for
> directory inodes and 2^47 as the maximum extent count for non-directory
> inodes), xfs_log_calc_minimum_size() in kernel returns 8691 blocks.

Hmm, 8691, you say?  Ok, that's a helpful clue...

MAXEXTNUM	min log blocks
2^47		9,075
2^32		8,906
2^27		8,691

...and now I think I finally understand the goal here.  The existing
xfs_bmap_compute_maxlevels computes the max bmbt height from MAXEXTNUM
(2^32).  The file rename reservation computation uses this max bmbt
height, which works out to a min log size of 8,906 blocks.  Once you
change MAXEXTNUM to 2^47, this computation turns into 9,075 blocks.

This means that if you use mkfs.xfs 5.6.0 to create a small, vanilla V5
filesystem, it won't mount on your development kernel due to the minimum
log size checks, even if you didn't enable the larger extent counters.

Therefore, you're introducing m_bm_dir_maxlevel to store the max bmbt
height for a directory, using that to compute the rename reservation,
and lo and behold the min log size never goes above the old limit.

This is problematic... (scroll down, please)

> > 
> > >  ------------[ cut here ]------------
> > >  WARNING: CPU: 0 PID: 12821 at fs/xfs/xfs_message.c:112 assfail+0x25/0x28
> > >  Modules linked in:
> > >  CPU: 0 PID: 12821 Comm: mount Tainted: G        W         5.6.0-rc6-next-20200320-chandan-00003-g071c2af3f4de #1
> > >  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> > >  RIP: 0010:assfail+0x25/0x28
> > >  Code: ff ff 0f 0b c3 0f 1f 44 00 00 41 89 c8 48 89 d1 48 89 f2 48 c7 c6 40 b7 4b b3 e8 82 f9 ff ff 80 3d 83 d6 64 01 00 74 02 0f $
> > >  RSP: 0018:ffffb05b414cbd78 EFLAGS: 00010246
> > >  RAX: 0000000000000000 RBX: ffff9d9d501d5000 RCX: 0000000000000000
> > >  RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffb346dc65
> > >  RBP: ffff9da444b49a80 R08: 0000000000000000 R09: 0000000000000000
> > >  R10: 000000000000000a R11: f000000000000000 R12: 00000000ffffffea
> > >  R13: 000000000000000e R14: 0000000000004594 R15: ffff9d9d501d5628
> > >  FS:  00007fd6c5d17c80(0000) GS:ffff9da44d800000(0000) knlGS:0000000000000000
> > >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > >  CR2: 0000000000000002 CR3: 00000008a48c0000 CR4: 00000000000006f0
> > >  Call Trace:
> > >   xfs_log_mount+0xf8/0x300
> > >   xfs_mountfs+0x46e/0x950
> > >   xfs_fc_fill_super+0x318/0x510
> > >   ? xfs_mount_free+0x30/0x30
> > >   get_tree_bdev+0x15c/0x250
> > >   vfs_get_tree+0x25/0xb0
> > >   do_mount+0x740/0x9b0
> > >   ? memdup_user+0x41/0x80
> > >   __x64_sys_mount+0x8e/0xd0
> > >   do_syscall_64+0x48/0x110
> > >   entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > >  RIP: 0033:0x7fd6c5f2ccda
> > >  Code: 48 8b 0d b9 e1 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f $
> > >  RSP: 002b:00007ffe00dfb9f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
> > >  RAX: ffffffffffffffda RBX: 0000560c1aaa92c0 RCX: 00007fd6c5f2ccda
> > >  RDX: 0000560c1aaae110 RSI: 0000560c1aaad040 RDI: 0000560c1aaa94d0
> > >  RBP: 00007fd6c607d204 R08: 0000000000000000 R09: 0000560c1aaadde0
> > >  R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> > >  R13: 0000000000000000 R14: 0000560c1aaa94d0 R15: 0000560c1aaae110
> > >  ---[ end trace 6436391b468bc652 ]---
> > >  XFS (loop0): log mount failed
> > > 
> > > The corresponding filesystem was created using mkfs options
> > > "-m rmapbt=1,reflink=1 -b size=1k -d size=20m -n size=64k".
> > > 
> > > i.e. We have a filesystem of size 20MiB, data block size of 1KiB and
> > > directory block size of 64KiB. Filesystems of size < 1GiB can have less
> > > than 10MiB on-disk log (Please refer to calculate_log_size() in
> > > xfsprogs).
> > 
> > Hm.  You don't seem to be setting either of the big extent count feature
> > flags here.
> > 
> > Is this something that happens after a filesystem gets *upgraded* to
> > support extent counts > 2^32?  If it's this second case, then I think
> > the function that upgrades the filesystem has to reject the change if it
> > would cause the minimum log size checks to fail.
> 
> This happens when having 2^47 as the value of MAXEXTNUM irrespective of
> whether the filesystem's superblock has the big extent count feature flag set
> i.e. this patchset
> 
> Using 2^47 as the value of MAXEXTNUM causes the height of the data fork BMBT
> tree to increase when compared to the height of the tree when using 2^32
> MAXEXTNUM (In the case of the fs geometry that caused the above call trace,
> the height increased by 1). The call xfs_bmap_compute_maxlevels(mp,
> XFS_DATA_FORK) (invoked as part of FS mount operation) uses MAXEXTNUM as input
> to calculate the maximum height of the data fork BMBT and the result is stored
> in mp->m_bm_maxlevels[XFS_DATA_FORK]. This value is then used when calculating
> log reservations for various fs operations. Hence the log reservations of fs
> operations now change regardless of whether the "big extent count" feature
> flag is set or not.

"...or not."

Urrrk, no.  The log reservation calculations for existing filesystems
must not change, because (at best) this will cause subtle log behavior
changes due to the fluctuating reservation sizes; and (at worst) it can
cause the same log minimum size mounting problems you observed above.

If you disturb the log reservations for existing filesystems such
that the minimum log size goes up, this means that small filesystems
created with an old mkfs will now fail to mount with the new kernel.
This is never acceptable.

If you disturb the log reservations such that the minimum log size goes
down, this means that when those changes get pulled in by the xfsprogs
maintainer, a new mkfs will produce small filesystems that won't mount
on older kernels.  The only way this is acceptable is if the changes
only affect filesystems with a feature flag set that would cause all
of those older kernels to warn about the feature being EXPERIMENTAL.

Either way, users end up broken.

> > 
> > Granted, I don't understand the need (in the next patch) to special case
> > bmbt maxlevels for directory data forks.  That's probably muddying up
> > my ability to figure all this out.  Yes I did read this series
> > backwards. :)
> 
> Using a separate maximum extent count for directory data fork was required to
> reduce the increased log reservations described above. To be precise, rename
> operation invokes XFS_DIR_OP_LOG_COUNT() which indirectly uses
> mp->m_bm_maxlevels[XFS_DATA_FORK] for its calculations. When using a modified
> kernel which had 2^47 as the value for MAXEXTNUM resulted in a taller data
> fork BMBT tree. Hence log reservation space for rename operation became larger.
> 
> The idea of special handling of "maximum extents for directory data fork" came
> up later when trying to find a way to reduce the log reservation for the
> rename operation.

I think a better way to handle the directory operation reservations is:

1. Introduce XFS_MAXDIREXTNUM == 2^32-1, and use that to compute
   m_bm_dir_maxlevel for directories.

2. Use m_bm_dir_maxlevel to compute the rename reservations, like you do
   here.

3. As a cleanup, split XFS_NEXTENTADD_SPACE_RES into three separate
   helpers: one for attr forks (a), one for regular file data forks (b),
   and one for !S_ISREG() data forks(c).  The DAENTER macros can switch
   between (a) and (c).  Anything that knows it's being run against a
   regular file can use (b).  Symlinks and rtbitmaps can use (c).

   We then add a separate helper taking an xfs_inode and whichfork to
   compute the correct value for the the callers that have non-variable
   arguments.

This means that the log reservations will stay the same, regardless of
whether the bigfork feature is enabled.  I think this will be safe for
the attr extent count expansion, since we aren't letting the attr fork
expand beyond 2^32 extents, which means the max bmbt height there will
never be larger than anything we've ever seen before.

In my head I've convinced myself that this will keep the code simpler in
the long run, but maybe the rest of you have other ideas or flames? :D

--D

> > 
> > --D
> > 
> > > The largest reservation space was contributed by the rename
> > > operation. The corresponding calculation is done inside
> > > xfs_calc_rename_reservation(). In this case, the value returned by this
> > > function is,
> > > 
> > > xfs_calc_inode_res(mp, 4)
> > > + xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1))
> > > 
> > > xfs_calc_inode_res(mp, 4) returns a constant value of 3040 bytes
> > > regardless of the maximum data fork extent count.
> > > 
> > > The largest contribution to the rename operation was by "2 *
> > > XFS_DIROP_LOG_COUNT(mp)" and it is a function of maximum height of a
> > > directory's BMBT tree.
> > > 
> > > XFS_DIROP_LOG_COUNT() is a sum of,
> > > 
> > > 1. The maximum number of dabtree blocks that needs to be logged
> > >    i.e. XFS_DAENTER_BLOCKS() = XFS_DAENTER_1B(mp,w) *
> > >    XFS_DAENTER_DBS(mp,w).  For directories, this evaluates
> > >    to (64 * (XFS_DA_NODE_MAXDEPTH + 2)) = (64 * (5 + 2)) = 448.
> > > 
> > > 2. The corresponding maximum number of BMBT blocks that needs to be
> > >    logged i.e. XFS_DAENTER_BMAPS() = XFS_DAENTER_DBS(mp,w) *
> > >    XFS_DAENTER_BMAP1B(mp,w)
> > > 
> > >    XFS_DAENTER_DBS(mp,w) = XFS_DA_NODE_MAXDEPTH + 2 = 7
> > > 
> > >    XFS_DAENTER_BMAP1B(mp,w)
> > >    = XFS_NEXTENTADD_SPACE_RES(mp, XFS_DAENTER_1B(mp, w), w)
> > >    = XFS_NEXTENTADD_SPACE_RES(mp, 64, w)
> > >    = ((64 + XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp) - 1) /
> > >    XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp)) * XFS_EXTENTADD_SPACE_RES(mp, w)
> > > 
> > >    XFS_MAX_CONTIG_EXTENTS_PER_BLOCK() =
> > >    mp->m_alloc_mxr[0] - mp->m_alloc_mnr[0] = 121 - 60 = 61
> > > 
> > >    XFS_DAENTER_BMAP1B(mp,w) =
> > >    ((64 + XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp) - 1) /
> > >    XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp)) * XFS_EXTENTADD_SPACE_RES(mp, w)
> > >    = ((64 + 61 - 1) / 61) * XFS_EXTENTADD_SPACE_RES(mp, w)
> > >    = 2 * XFS_EXTENTADD_SPACE_RES(mp, w)
> > >    = 2 * (XFS_BM_MAXLEVELS(mp,w) - 1)
> > >    = 2 * (8 - 1)
> > >    = 14
> > > 
> > >    With 2^32 as the maximum extent count the maximum height of the bmap btree
> > >    was 7. Now with 2^47 maximum extent count, the height has increased to 8.
> > > 
> > >    Therefore, XFS_DAENTER_BMAPS() = 7 * 14 = 98.
> > > 
> > > XFS_DIROP_LOG_COUNT() = 448 + 98 = 546.
> > > 2 * XFS_DIROP_LOG_COUNT() = 2 * 546 = 1092.
> > > 
> > > With 2^32 max extent count, XFS_DIROP_LOG_COUNT() evaluates to
> > > 533. Hence 2 * XFS_DIROP_LOG_COUNT() = 2 * 533 = 1066.
> > > 
> > > This small difference of 1092 - 1066 = 26 fs blocks is sufficient to
> > > trip us over the minimum log size check.
> > > 
> > > A future commit in this series will use 2^27 as the maximum directory
> > > extent count. This will result in a shorter directory BMBT tree.  Log
> > > reservation calculations that are applicable only to
> > > directories (e.g. XFS_DIROP_LOG_COUNT()) can then choose this instead of
> > > non-dir data fork BMBT height.
> > > 
> > > This commit introduces a new member in 'struct xfs_mount' to hold the
> > > maximum BMBT height of a directory. At present, the maximum height of a
> > > directory BMBT is the same as a the maximum height of a non-directory
> > > BMBT. A future commit will change the parameters used as input for
> > > computing the maximum height of a directory BMBT.
> > > 
> > > Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_bmap.c | 17 ++++++++++++++---
> > >  fs/xfs/libxfs/xfs_bmap.h |  3 ++-
> > >  fs/xfs/xfs_mount.c       |  5 +++--
> > >  fs/xfs/xfs_mount.h       |  1 +
> > >  4 files changed, 20 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > index 798fca5c52af..01e2b543b139 100644
> > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > @@ -50,7 +50,8 @@ kmem_zone_t		*xfs_bmap_free_item_zone;
> > >  void
> > >  xfs_bmap_compute_maxlevels(
> > >  	xfs_mount_t	*mp,		/* file system mount structure */
> > > -	int		whichfork)	/* data or attr fork */
> > > +	int		whichfork,	/* data or attr fork */
> > > +	int		dir_bmbt)	/* Dir or non-dir data fork */
> > >  {
> > >  	int		level;		/* btree level */
> > >  	uint		maxblocks;	/* max blocks at this level */
> > > @@ -60,6 +61,9 @@ xfs_bmap_compute_maxlevels(
> > >  	int		minnoderecs;	/* min records in node block */
> > >  	int		sz;		/* root block size */
> > >  
> > > +	if (whichfork == XFS_ATTR_FORK)
> > > +		ASSERT(dir_bmbt == 0);
> > > +
> > >  	/*
> > >  	 * The maximum number of extents in a file, hence the maximum number of
> > >  	 * leaf entries, is controlled by the size of the on-disk extent count,
> > > @@ -75,8 +79,11 @@ xfs_bmap_compute_maxlevels(
> > >  	 * of a minimum size available.
> > >  	 */
> > >  	if (whichfork == XFS_DATA_FORK) {
> > > -		maxleafents = MAXEXTNUM;
> > >  		sz = XFS_BMDR_SPACE_CALC(MINDBTPTRS);
> > > +		if (dir_bmbt)
> > > +			maxleafents = MAXEXTNUM;
> > > +		else
> > > +			maxleafents = MAXEXTNUM;
> > >  	} else {
> > >  		maxleafents = MAXAEXTNUM;
> > >  		sz = XFS_BMDR_SPACE_CALC(MINABTPTRS);
> > > @@ -91,7 +98,11 @@ xfs_bmap_compute_maxlevels(
> > >  		else
> > >  			maxblocks = (maxblocks + minnoderecs - 1) / minnoderecs;
> > >  	}
> > > -	mp->m_bm_maxlevels[whichfork] = level;
> > > +
> > > +	if (whichfork == XFS_DATA_FORK && dir_bmbt)
> > > +		mp->m_bm_dir_maxlevel = level;
> > > +	else
> > > +		mp->m_bm_maxlevels[whichfork] = level;
> > >  }
> > >  
> > >  STATIC int				/* error */
> > > diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
> > > index 6028a3c825ba..4250c9ab4b75 100644
> > > --- a/fs/xfs/libxfs/xfs_bmap.h
> > > +++ b/fs/xfs/libxfs/xfs_bmap.h
> > > @@ -187,7 +187,8 @@ void	xfs_bmap_local_to_extents_empty(struct xfs_trans *tp,
> > >  void	__xfs_bmap_add_free(struct xfs_trans *tp, xfs_fsblock_t bno,
> > >  		xfs_filblks_t len, const struct xfs_owner_info *oinfo,
> > >  		bool skip_discard);
> > > -void	xfs_bmap_compute_maxlevels(struct xfs_mount *mp, int whichfork);
> > > +void	xfs_bmap_compute_maxlevels(struct xfs_mount *mp, int whichfork,
> > > +		int dir_bmbt);
> > >  int	xfs_bmap_first_unused(struct xfs_trans *tp, struct xfs_inode *ip,
> > >  		xfs_extlen_t len, xfs_fileoff_t *unused, int whichfork);
> > >  int	xfs_bmap_last_before(struct xfs_trans *tp, struct xfs_inode *ip,
> > > diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> > > index bb91f04266b9..d8ebfc67bb63 100644
> > > --- a/fs/xfs/xfs_mount.c
> > > +++ b/fs/xfs/xfs_mount.c
> > > @@ -711,8 +711,9 @@ xfs_mountfs(
> > >  		goto out;
> > >  
> > >  	xfs_alloc_compute_maxlevels(mp);
> > > -	xfs_bmap_compute_maxlevels(mp, XFS_DATA_FORK);
> > > -	xfs_bmap_compute_maxlevels(mp, XFS_ATTR_FORK);
> > > +	xfs_bmap_compute_maxlevels(mp, XFS_DATA_FORK, 0);
> > > +	xfs_bmap_compute_maxlevels(mp, XFS_DATA_FORK, 1);
> > > +	xfs_bmap_compute_maxlevels(mp, XFS_ATTR_FORK, 0);
> > >  	xfs_ialloc_setup_geometry(mp);
> > >  	xfs_rmapbt_compute_maxlevels(mp);
> > >  	xfs_refcountbt_compute_maxlevels(mp);
> > > diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> > > index aba5a1579279..9dbf036ddace 100644
> > > --- a/fs/xfs/xfs_mount.h
> > > +++ b/fs/xfs/xfs_mount.h
> > > @@ -133,6 +133,7 @@ typedef struct xfs_mount {
> > >  	uint			m_refc_mnr[2];	/* min refc btree records */
> > >  	uint			m_ag_maxlevels;	/* XFS_AG_MAXLEVELS */
> > >  	uint			m_bm_maxlevels[2]; /* XFS_BM_MAXLEVELS */
> > > +	uint			m_bm_dir_maxlevel;
> > >  	uint			m_rmap_maxlevels; /* max rmap btree levels */
> > >  	uint			m_refc_maxlevels; /* max refcount btree level */
> > >  	xfs_extlen_t		m_ag_prealloc_blocks; /* reserved ag blocks */
> > 
> 
> -- 
> chandan
> 
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/7] xfs: Compute maximum height of directory BMBT separately
  2020-06-09 18:40       ` Darrick J. Wong
@ 2020-06-10  6:23         ` Chandan Babu R
  2020-06-11  6:38           ` Chandan Babu R
  0 siblings, 1 reply; 40+ messages in thread
From: Chandan Babu R @ 2020-06-10  6:23 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, bfoster, hch

On Wednesday 10 June 2020 12:10:02 AM IST Darrick J. Wong wrote:
> On Tue, Jun 09, 2020 at 07:53:55PM +0530, Chandan Babu R wrote:
> > On Tuesday 9 June 2020 2:29:22 AM IST Darrick J. Wong wrote:
> > > On Sat, Jun 06, 2020 at 01:57:41PM +0530, Chandan Babu R wrote:
> > > > xfs/306 causes the following call trace when using a data fork with a
> > > > maximum extent count of 2^47,
> > > > 
> > > >  XFS (loop0): Mounting V5 Filesystem
> > > >  XFS (loop0): Log size 8906 blocks too small, minimum size is 9075 blocks
> > > >  XFS (loop0): AAIEEE! Log failed size checks. Abort!
> > > >  XFS: Assertion failed: 0, file: fs/xfs/xfs_log.c, line: 711
> > > 
> > > Uh... won't applying the corresponding MAXEXTNUM changes and whatnot to
> > > xfsprogs result in mkfs formatting a log with 9075 blocks?  Is there
> > > some other mistake in the minimum log size computations?
> > 
> > The call trace given below shows up when using 2^47 as the maximum extent
> > count for both Dir and Non-dir inodes.
> > 
> > However, using 2^27 as the maximum
> > extent count for directories would reduce the log reservation value for
> > "rename" operation (which has the maximum sized log reservation when using the
> > below mentioned FS geometry).
> > 
> > "Rename" log reservation is a function of the maximum directory BMBT height
> > which in turn is a function of the maximum number of extents that can be
> > occupied by a directory.
> > 
> > Hence when moving the MAXEXTNUM changes to xfsprogs, the corresponding
> > "maximum directory extent count" changes must also be moved as a
> > dependency.
> > 
> > With this patchset applied (i.e. With 2^27 as the maximum extent count for
> > directory inodes and 2^47 as the maximum extent count for non-directory
> > inodes), xfs_log_calc_minimum_size() in kernel returns 8691 blocks.
> 
> Hmm, 8691, you say?  Ok, that's a helpful clue...
> 
> MAXEXTNUM	min log blocks
> 2^47		9,075
> 2^32		8,906
> 2^27		8,691
> 
> ...and now I think I finally understand the goal here.  The existing
> xfs_bmap_compute_maxlevels computes the max bmbt height from MAXEXTNUM
> (2^32).  The file rename reservation computation uses this max bmbt
> height, which works out to a min log size of 8,906 blocks.  Once you
> change MAXEXTNUM to 2^47, this computation turns into 9,075 blocks.
> 
> This means that if you use mkfs.xfs 5.6.0 to create a small, vanilla V5
> filesystem, it won't mount on your development kernel due to the minimum
> log size checks, even if you didn't enable the larger extent counters.
> 
> Therefore, you're introducing m_bm_dir_maxlevel to store the max bmbt
> height for a directory, using that to compute the rename reservation,
> and lo and behold the min log size never goes above the old limit.
> 
> This is problematic... (scroll down, please)
> 
> > > 
> > > >  ------------[ cut here ]------------
> > > >  WARNING: CPU: 0 PID: 12821 at fs/xfs/xfs_message.c:112 assfail+0x25/0x28
> > > >  Modules linked in:
> > > >  CPU: 0 PID: 12821 Comm: mount Tainted: G        W         5.6.0-rc6-next-20200320-chandan-00003-g071c2af3f4de #1
> > > >  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> > > >  RIP: 0010:assfail+0x25/0x28
> > > >  Code: ff ff 0f 0b c3 0f 1f 44 00 00 41 89 c8 48 89 d1 48 89 f2 48 c7 c6 40 b7 4b b3 e8 82 f9 ff ff 80 3d 83 d6 64 01 00 74 02 0f $
> > > >  RSP: 0018:ffffb05b414cbd78 EFLAGS: 00010246
> > > >  RAX: 0000000000000000 RBX: ffff9d9d501d5000 RCX: 0000000000000000
> > > >  RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffb346dc65
> > > >  RBP: ffff9da444b49a80 R08: 0000000000000000 R09: 0000000000000000
> > > >  R10: 000000000000000a R11: f000000000000000 R12: 00000000ffffffea
> > > >  R13: 000000000000000e R14: 0000000000004594 R15: ffff9d9d501d5628
> > > >  FS:  00007fd6c5d17c80(0000) GS:ffff9da44d800000(0000) knlGS:0000000000000000
> > > >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > >  CR2: 0000000000000002 CR3: 00000008a48c0000 CR4: 00000000000006f0
> > > >  Call Trace:
> > > >   xfs_log_mount+0xf8/0x300
> > > >   xfs_mountfs+0x46e/0x950
> > > >   xfs_fc_fill_super+0x318/0x510
> > > >   ? xfs_mount_free+0x30/0x30
> > > >   get_tree_bdev+0x15c/0x250
> > > >   vfs_get_tree+0x25/0xb0
> > > >   do_mount+0x740/0x9b0
> > > >   ? memdup_user+0x41/0x80
> > > >   __x64_sys_mount+0x8e/0xd0
> > > >   do_syscall_64+0x48/0x110
> > > >   entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > >  RIP: 0033:0x7fd6c5f2ccda
> > > >  Code: 48 8b 0d b9 e1 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f $
> > > >  RSP: 002b:00007ffe00dfb9f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
> > > >  RAX: ffffffffffffffda RBX: 0000560c1aaa92c0 RCX: 00007fd6c5f2ccda
> > > >  RDX: 0000560c1aaae110 RSI: 0000560c1aaad040 RDI: 0000560c1aaa94d0
> > > >  RBP: 00007fd6c607d204 R08: 0000000000000000 R09: 0000560c1aaadde0
> > > >  R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> > > >  R13: 0000000000000000 R14: 0000560c1aaa94d0 R15: 0000560c1aaae110
> > > >  ---[ end trace 6436391b468bc652 ]---
> > > >  XFS (loop0): log mount failed
> > > > 
> > > > The corresponding filesystem was created using mkfs options
> > > > "-m rmapbt=1,reflink=1 -b size=1k -d size=20m -n size=64k".
> > > > 
> > > > i.e. We have a filesystem of size 20MiB, data block size of 1KiB and
> > > > directory block size of 64KiB. Filesystems of size < 1GiB can have less
> > > > than 10MiB on-disk log (Please refer to calculate_log_size() in
> > > > xfsprogs).
> > > 
> > > Hm.  You don't seem to be setting either of the big extent count feature
> > > flags here.
> > > 
> > > Is this something that happens after a filesystem gets *upgraded* to
> > > support extent counts > 2^32?  If it's this second case, then I think
> > > the function that upgrades the filesystem has to reject the change if it
> > > would cause the minimum log size checks to fail.
> > 
> > This happens when having 2^47 as the value of MAXEXTNUM irrespective of
> > whether the filesystem's superblock has the big extent count feature flag set
> > i.e. this patchset
> > 
> > Using 2^47 as the value of MAXEXTNUM causes the height of the data fork BMBT
> > tree to increase when compared to the height of the tree when using 2^32
> > MAXEXTNUM (In the case of the fs geometry that caused the above call trace,
> > the height increased by 1). The call xfs_bmap_compute_maxlevels(mp,
> > XFS_DATA_FORK) (invoked as part of FS mount operation) uses MAXEXTNUM as input
> > to calculate the maximum height of the data fork BMBT and the result is stored
> > in mp->m_bm_maxlevels[XFS_DATA_FORK]. This value is then used when calculating
> > log reservations for various fs operations. Hence the log reservations of fs
> > operations now change regardless of whether the "big extent count" feature
> > flag is set or not.
> 
> "...or not."
> 
> Urrrk, no.  The log reservation calculations for existing filesystems
> must not change, because (at best) this will cause subtle log behavior
> changes due to the fluctuating reservation sizes; and (at worst) it can
> cause the same log minimum size mounting problems you observed above.
> 
> If you disturb the log reservations for existing filesystems such
> that the minimum log size goes up, this means that small filesystems
> created with an old mkfs will now fail to mount with the new kernel.
> This is never acceptable.
> 
> If you disturb the log reservations such that the minimum log size goes
> down, this means that when those changes get pulled in by the xfsprogs
> maintainer, a new mkfs will produce small filesystems that won't mount
> on older kernels.  The only way this is acceptable is if the changes
> only affect filesystems with a feature flag set that would cause all
> of those older kernels to warn about the feature being EXPERIMENTAL.

So the reduction of the rename log reservation size (by using 2^32 as the
maximum directory extent count) must be accompanied with setting of a feature
flag other than bigfork feature flag right? I say that because the bigfork
feature flag is currently set at runtime when we are about to overflow signed
16-bit attrs or signed 32-bit data extent counters. Log reservation values are
pre-calculated during filesystem mount and cannot be changed during runtime.

This also means that the patch "xfs: Fix log reservation calculation for xattr
insert operation" also needs to be handled specially since it,
- Replaces two reservations (mount and runtime) with just one static
  reservation.
- Reduces the value of "xattr set operation" reservation.
Hence older kernels may not be able to mount filesystems created with mkfs.xfs
containing this patch.

> 
> Either way, users end up broken.
> 
> > > 
> > > Granted, I don't understand the need (in the next patch) to special case
> > > bmbt maxlevels for directory data forks.  That's probably muddying up
> > > my ability to figure all this out.  Yes I did read this series
> > > backwards. :)
> > 
> > Using a separate maximum extent count for directory data fork was required to
> > reduce the increased log reservations described above. To be precise, rename
> > operation invokes XFS_DIR_OP_LOG_COUNT() which indirectly uses
> > mp->m_bm_maxlevels[XFS_DATA_FORK] for its calculations. When using a modified
> > kernel which had 2^47 as the value for MAXEXTNUM resulted in a taller data
> > fork BMBT tree. Hence log reservation space for rename operation became larger.
> > 
> > The idea of special handling of "maximum extents for directory data fork" came
> > up later when trying to find a way to reduce the log reservation for the
> > rename operation.
> 
> I think a better way to handle the directory operation reservations is:
> 
> 1. Introduce XFS_MAXDIREXTNUM == 2^32-1, and use that to compute
>    m_bm_dir_maxlevel for directories.
> 
> 2. Use m_bm_dir_maxlevel to compute the rename reservations, like you do
>    here.
> 
> 3. As a cleanup, split XFS_NEXTENTADD_SPACE_RES into three separate
>    helpers: one for attr forks (a), one for regular file data forks (b),
>    and one for !S_ISREG() data forks(c).  The DAENTER macros can switch
>    between (a) and (c).  Anything that knows it's being run against a
>    regular file can use (b).  Symlinks and rtbitmaps can use (c).
> 
>    We then add a separate helper taking an xfs_inode and whichfork to
>    compute the correct value for the the callers that have non-variable
>    arguments.

Ok. I will take a shot at implementing the helpers. Thanks for the
suggestions.

> 
> This means that the log reservations will stay the same, regardless of
> whether the bigfork feature is enabled.

On a filesystem which would have already used,
1. 2^47 max data extent count
2. 2^28 max directory extent count
3. 2^32 max xattr count
for computing the log reservations during mount time, an upgrade to bigfork
feature would not affect the pre-calculated log reservation values.

> I think this will be safe for
> the attr extent count expansion, since we aren't letting the attr fork
> expand beyond 2^32 extents, which means the max bmbt height there will
> never be larger than anything we've ever seen before.
> 
> In my head I've convinced myself that this will keep the code simpler in
> the long run, but maybe the rest of you have other ideas or flames? :D
> 

-- 
chandan




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/7] xfs: Check for per-inode extent count overflow
  2020-06-09 17:07         ` Darrick J. Wong
@ 2020-06-10  6:24           ` Chandan Babu R
  0 siblings, 0 replies; 40+ messages in thread
From: Chandan Babu R @ 2020-06-10  6:24 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, bfoster, hch

On Tuesday 9 June 2020 10:37:34 PM IST Darrick J. Wong wrote:
> On Tue, Jun 09, 2020 at 07:52:48PM +0530, Chandan Babu R wrote:
> > On Monday 8 June 2020 10:02:16 PM IST Darrick J. Wong wrote:
> > > On Mon, Jun 08, 2020 at 09:24:25AM -0700, Darrick J. Wong wrote:
> > > > On Sat, Jun 06, 2020 at 01:57:40PM +0530, Chandan Babu R wrote:
> > > > > The following error message was noticed when a workload added one
> > > > > million xattrs, deleted 50% of them and then inserted 400,000 new
> > > > > xattrs.
> > > > > 
> > > > > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > > > > 
> > > > > The error message was printed during unmounting the filesystem. The
> > > > > value printed under "total extents" indicates that we overflowed the
> > > > > per-inode signed 16-bit xattr extent counter.
> > > > > 
> > > > > Instead of letting this silent corruption occur, this patch checks for
> > > > > extent counter (both data and xattr) overflow before we assign the
> > > > > new value to the corresponding in-memory extent counter.
> > > > > 
> > > > > Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
> > > > > ---
> > > > >  fs/xfs/libxfs/xfs_bmap.c       | 92 +++++++++++++++++++++++++++-------
> > > > >  fs/xfs/libxfs/xfs_inode_fork.c | 29 +++++++++++
> > > > >  fs/xfs/libxfs/xfs_inode_fork.h |  1 +
> > > > >  3 files changed, 104 insertions(+), 18 deletions(-)
> > > > > 
> > > > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > > > index edc63dba007f..798fca5c52af 100644
> > > > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > > > @@ -906,7 +906,10 @@ xfs_bmap_local_to_extents(
> > > > >  	xfs_iext_first(ifp, &icur);
> > > > >  	xfs_iext_insert(ip, &icur, &rec, 0);
> > > > >  
> > > > > -	ifp->if_nextents = 1;
> > > > > +	error = xfs_next_set(ip, whichfork, 1);
> > > > > +	if (error)
> > > > > +		goto done;
> > > > 
> > > > Are you sure that if_nextents == 0 is a precondition here?  Technically
> > > > speaking, this turns an assignment into an increment operation.
> > > > 
> > > > > +
> > > > >  	ip->i_d.di_nblocks = 1;
> > > > >  	xfs_trans_mod_dquot_byino(tp, ip,
> > > > >  		XFS_TRANS_DQ_BCOUNT, 1L);
> > > > > @@ -1594,7 +1597,10 @@ xfs_bmap_add_extent_delay_real(
> > > > >  		xfs_iext_remove(bma->ip, &bma->icur, state);
> > > > >  		xfs_iext_prev(ifp, &bma->icur);
> > > > >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, &LEFT);
> > > > > -		ifp->if_nextents--;
> > > > > +
> > > > > +		error = xfs_next_set(bma->ip, whichfork, -1);
> > > > > +		if (error)
> > > > > +			goto done;
> > > > >  
> > > > >  		if (bma->cur == NULL)
> > > > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > > > @@ -1698,7 +1704,10 @@ xfs_bmap_add_extent_delay_real(
> > > > >  		PREV.br_startblock = new->br_startblock;
> > > > >  		PREV.br_state = new->br_state;
> > > > >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, &PREV);
> > > > > -		ifp->if_nextents++;
> > > > > +
> > > > > +		error = xfs_next_set(bma->ip, whichfork, 1);
> > > > > +		if (error)
> > > > > +			goto done;
> > > > >  
> > > > >  		if (bma->cur == NULL)
> > > > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > > > @@ -1764,7 +1773,10 @@ xfs_bmap_add_extent_delay_real(
> > > > >  		 * The left neighbor is not contiguous.
> > > > >  		 */
> > > > >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, new);
> > > > > -		ifp->if_nextents++;
> > > > > +
> > > > > +		error = xfs_next_set(bma->ip, whichfork, 1);
> > > > > +		if (error)
> > > > > +			goto done;
> > > > >  
> > > > >  		if (bma->cur == NULL)
> > > > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > > > @@ -1851,7 +1863,10 @@ xfs_bmap_add_extent_delay_real(
> > > > >  		 * The right neighbor is not contiguous.
> > > > >  		 */
> > > > >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, new);
> > > > > -		ifp->if_nextents++;
> > > > > +
> > > > > +		error = xfs_next_set(bma->ip, whichfork, 1);
> > > > > +		if (error)
> > > > > +			goto done;
> > > > >  
> > > > >  		if (bma->cur == NULL)
> > > > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > > > @@ -1937,7 +1952,10 @@ xfs_bmap_add_extent_delay_real(
> > > > >  		xfs_iext_next(ifp, &bma->icur);
> > > > >  		xfs_iext_insert(bma->ip, &bma->icur, &RIGHT, state);
> > > > >  		xfs_iext_insert(bma->ip, &bma->icur, &LEFT, state);
> > > > > -		ifp->if_nextents++;
> > > > > +
> > > > > +		error = xfs_next_set(bma->ip, whichfork, 1);
> > > > > +		if (error)
> > > > > +			goto done;
> > > > >  
> > > > >  		if (bma->cur == NULL)
> > > > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > > > @@ -2141,7 +2159,11 @@ xfs_bmap_add_extent_unwritten_real(
> > > > >  		xfs_iext_remove(ip, icur, state);
> > > > >  		xfs_iext_prev(ifp, icur);
> > > > >  		xfs_iext_update_extent(ip, state, icur, &LEFT);
> > > > > -		ifp->if_nextents -= 2;
> > > > > +
> > > > > +		error = xfs_next_set(ip, whichfork, -2);
> > > > > +		if (error)
> > > > > +			goto done;
> > > > > +
> > > > >  		if (cur == NULL)
> > > > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > > >  		else {
> > > > > @@ -2193,7 +2215,11 @@ xfs_bmap_add_extent_unwritten_real(
> > > > >  		xfs_iext_remove(ip, icur, state);
> > > > >  		xfs_iext_prev(ifp, icur);
> > > > >  		xfs_iext_update_extent(ip, state, icur, &LEFT);
> > > > > -		ifp->if_nextents--;
> > > > > +
> > > > > +		error = xfs_next_set(ip, whichfork, -1);
> > > > > +		if (error)
> > > > > +			goto done;
> > > > > +
> > > > >  		if (cur == NULL)
> > > > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > > >  		else {
> > > > > @@ -2235,7 +2261,10 @@ xfs_bmap_add_extent_unwritten_real(
> > > > >  		xfs_iext_remove(ip, icur, state);
> > > > >  		xfs_iext_prev(ifp, icur);
> > > > >  		xfs_iext_update_extent(ip, state, icur, &PREV);
> > > > > -		ifp->if_nextents--;
> > > > > +
> > > > > +		error = xfs_next_set(ip, whichfork, -1);
> > > > > +		if (error)
> > > > > +			goto done;
> > > > >  
> > > > >  		if (cur == NULL)
> > > > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > > > @@ -2343,7 +2372,10 @@ xfs_bmap_add_extent_unwritten_real(
> > > > >  
> > > > >  		xfs_iext_update_extent(ip, state, icur, &PREV);
> > > > >  		xfs_iext_insert(ip, icur, new, state);
> > > > > -		ifp->if_nextents++;
> > > > > +
> > > > > +		error = xfs_next_set(ip, whichfork, 1);
> > > > > +		if (error)
> > > > > +			goto done;
> > > > >  
> > > > >  		if (cur == NULL)
> > > > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > > > @@ -2419,7 +2451,10 @@ xfs_bmap_add_extent_unwritten_real(
> > > > >  		xfs_iext_update_extent(ip, state, icur, &PREV);
> > > > >  		xfs_iext_next(ifp, icur);
> > > > >  		xfs_iext_insert(ip, icur, new, state);
> > > > > -		ifp->if_nextents++;
> > > > > +
> > > > > +		error = xfs_next_set(ip, whichfork, 1);
> > > > > +		if (error)
> > > > > +			goto done;
> > > > >  
> > > > >  		if (cur == NULL)
> > > > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > > > @@ -2471,7 +2506,10 @@ xfs_bmap_add_extent_unwritten_real(
> > > > >  		xfs_iext_next(ifp, icur);
> > > > >  		xfs_iext_insert(ip, icur, &r[1], state);
> > > > >  		xfs_iext_insert(ip, icur, &r[0], state);
> > > > > -		ifp->if_nextents += 2;
> > > > > +
> > > > > +		error = xfs_next_set(ip, whichfork, 2);
> > > > > +		if (error)
> > > > > +			goto done;
> > > > >  
> > > > >  		if (cur == NULL)
> > > > >  			rval = XFS_ILOG_CORE | XFS_ILOG_DEXT;
> > > > > @@ -2787,7 +2825,10 @@ xfs_bmap_add_extent_hole_real(
> > > > >  		xfs_iext_remove(ip, icur, state);
> > > > >  		xfs_iext_prev(ifp, icur);
> > > > >  		xfs_iext_update_extent(ip, state, icur, &left);
> > > > > -		ifp->if_nextents--;
> > > > > +
> > > > > +		error = xfs_next_set(ip, whichfork, -1);
> > > > > +		if (error)
> > > > > +			goto done;
> > > > >  
> > > > >  		if (cur == NULL) {
> > > > >  			rval = XFS_ILOG_CORE | xfs_ilog_fext(whichfork);
> > > > > @@ -2886,7 +2927,10 @@ xfs_bmap_add_extent_hole_real(
> > > > >  		 * Insert a new entry.
> > > > >  		 */
> > > > >  		xfs_iext_insert(ip, icur, new, state);
> > > > > -		ifp->if_nextents++;
> > > > > +
> > > > > +		error = xfs_next_set(ip, whichfork, 1);
> > > > > +		if (error)
> > > > > +			goto done;
> > > > >  
> > > > >  		if (cur == NULL) {
> > > > >  			rval = XFS_ILOG_CORE | xfs_ilog_fext(whichfork);
> > > > > @@ -5083,7 +5127,10 @@ xfs_bmap_del_extent_real(
> > > > >  		 */
> > > > >  		xfs_iext_remove(ip, icur, state);
> > > > >  		xfs_iext_prev(ifp, icur);
> > > > > -		ifp->if_nextents--;
> > > > > +
> > > > > +		error = xfs_next_set(ip, whichfork, -1);
> > > > > +		if (error)
> > > > > +			goto done;
> > > > >  
> > > > >  		flags |= XFS_ILOG_CORE;
> > > > >  		if (!cur) {
> > > > > @@ -5193,7 +5240,10 @@ xfs_bmap_del_extent_real(
> > > > >  		} else
> > > > >  			flags |= xfs_ilog_fext(whichfork);
> > > > >  
> > > > > -		ifp->if_nextents++;
> > > > > +		error = xfs_next_set(ip, whichfork, 1);
> > > > > +		if (error)
> > > > > +			goto done;
> > > > > +
> > > > >  		xfs_iext_next(ifp, icur);
> > > > >  		xfs_iext_insert(ip, icur, &new, state);
> > > > >  		break;
> > > > > @@ -5660,7 +5710,10 @@ xfs_bmse_merge(
> > > > >  	 * Update the on-disk extent count, the btree if necessary and log the
> > > > >  	 * inode.
> > > > >  	 */
> > > > > -	ifp->if_nextents--;
> > > > > +	error = xfs_next_set(ip, whichfork, -1);
> > > > > +	if (error)
> > > > > +		goto done;
> > > > > +
> > > > >  	*logflags |= XFS_ILOG_CORE;
> > > > >  	if (!cur) {
> > > > >  		*logflags |= XFS_ILOG_DEXT;
> > > > > @@ -6047,7 +6100,10 @@ xfs_bmap_split_extent(
> > > > >  	/* Add new extent */
> > > > >  	xfs_iext_next(ifp, &icur);
> > > > >  	xfs_iext_insert(ip, &icur, &new, 0);
> > > > > -	ifp->if_nextents++;
> > > > > +
> > > > > +	error = xfs_next_set(ip, whichfork, 1);
> > > > > +	if (error)
> > > > > +		goto del_cursor;
> > > > >  
> > > > >  	if (cur) {
> > > > >  		error = xfs_bmbt_lookup_eq(cur, &new, &i);
> > > > > diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
> > > > > index 28b366275ae0..3bf5a2c391bd 100644
> > > > > --- a/fs/xfs/libxfs/xfs_inode_fork.c
> > > > > +++ b/fs/xfs/libxfs/xfs_inode_fork.c
> > > > > @@ -728,3 +728,32 @@ xfs_ifork_verify_local_attr(
> > > > >  
> > > > >  	return 0;
> > > > >  }
> > > > > +
> > > > > +int
> > > > > +xfs_next_set(
> > > > 
> > > > "next"... please choose an abbreviation that doesn't collide with a
> > > > common English word.
> > > > 
> > > > > +	struct xfs_inode	*ip,
> > > > > +	int			whichfork,
> > > > > +	int			delta)
> > > > 
> > > > Delta?  I thought this was a setter function?
> > > > 
> > > > > +{
> > > > > +	struct xfs_ifork	*ifp;
> > > > > +	int64_t			nr_exts;
> > > > > +	int64_t			max_exts;
> > > > > +
> > > > > +	ifp = XFS_IFORK_PTR(ip, whichfork);
> > > > > +
> > > > > +	if (whichfork == XFS_DATA_FORK || whichfork == XFS_COW_FORK)
> > > > > +		max_exts = MAXEXTNUM;
> > > > > +	else if (whichfork == XFS_ATTR_FORK)
> > > > > +		max_exts = MAXAEXTNUM;
> > > > > +	else
> > > > > +		ASSERT(0);
> > > > > +
> > > > > +	nr_exts = ifp->if_nextents + delta;
> > > > 
> > > > Nope, it's a modify function all right.  Then it should be named:
> > > > 
> > > > xfs_nextents_mod(ip, whichfork, delta)
> > > > 
> > > > > +	if ((delta > 0 && nr_exts > max_exts)
> > > > > +		|| (delta < 0 && nr_exts < 0))
> > > > 
> > > > Line these up, please.  e.g.,
> > > > 
> > > > 	if ((delta > 0 && nr_exts > max_exts) ||
> > > >             (delta < 0 && nr_exts < 0))
> 
> HA even the maintainer gets it wrong. :(
> 
> > > > 
> > > > --D
> > > > 
> > > > > +		return -EOVERFLOW;
> > > 
> > > Oh, also, shouldn't this be EFBIG ("File too big")?
> > 
> > True, EFBIG is more appropriate than EOVERFLOW in this case.
> > 
> > Darrick, I have one question. The purpose of this patch is to fix the zero day
> > bug where we overflow extent counter silently and get to know about it only
> > when flushing the incore inode to disk. Patches that come later in the series
> > modify the extent count limits to 2^32 (for xattr fork) and 2^47 (for data
> > fork). If this patch is not required to be sent to stable release, I will drop
> > it from the series.
> 
> I would leave it in the series, unless you mean to send this as a
> separate cleanup ahead of everything else?
> 
> Now that I think about it, this probably should become its own cleanup
> series.  I just realized that if we error out EFBIG in the middle of a
> bmap function, we're probably going to end up cancelling a dirty
> transaction, which will cause an fs shutdown.  Since xfs cannot undo the
> effects of a dirty transaction, we have to be able to error out earlier
> in the transaction sequence so that we can back out to userspace without
> affecting the filesystem.
> 
> IOWs, this means that any code path that could increase an inode's
> extent count will have to check the the inode (after we take the ILOCK)
> to make sure that it can accomodate however many more extents we're
> adding.
> 
> static int
> xfs_trans_inode_reserve_extent_count(ip, whichfork, nrtoadd)
> {
> 	if (MAX{,A}EXTNUM - XFS_IFORK_PTR(ip, whichfork)->if_nextents < nrtoadd)
> 		return -EFBIG;
> 	return 0;
> }
> 
> 	error = xfs_trans_alloc(..., &tp);
> 	if (error)
> 		goto out;
> 
> 	xfs_ilock(ip, XFS_ILOCK_EXCL);
> 	xfs_trans_ijoin(ip, 0);
> 
> 	error = xfs_trans_inode_reserve_extent_count(ip, whichfork, nrtoadd)
> 	if (error)
> 		goto out;
> 
> 	error = xfs_trans_reserve_quota_nblks(tp, ip, ...);
> 	if (error)
> 		goto out;
> 
> ...or something like that.  And now suddenly this grows into its own
> cleanup series. :/


Ok. I will work on the cleanup series while we are reaching consensus on the
log reservation changes.

Thanks once again for the suggestions.

> 
> > Also, I can't have a "fixes" tag because this is a zero
> > day bug.
> 
> Everything is a zero day now... but establishing a base for this one is
> probably not going to be easy since I bet the overflow has existed since
> the beginning.
> 
> --D
> 
> > 
> > > 
> > > --D
> > > 
> > > > > +
> > > > > +	ifp->if_nextents = nr_exts;
> > > > > +
> > > > > +	return 0;
> > > > > +}
> > > > > diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
> > > > > index a4953e95c4f3..a84ae42ace79 100644
> > > > > --- a/fs/xfs/libxfs/xfs_inode_fork.h
> > > > > +++ b/fs/xfs/libxfs/xfs_inode_fork.h
> > > > > @@ -173,4 +173,5 @@ extern void xfs_ifork_init_cow(struct xfs_inode *ip);
> > > > >  int xfs_ifork_verify_local_data(struct xfs_inode *ip);
> > > > >  int xfs_ifork_verify_local_attr(struct xfs_inode *ip);
> > > > >  
> > > > > +int xfs_next_set(struct xfs_inode *ip, int whichfork, int delta);
> > > > >  #endif	/* __XFS_INODE_FORK_H__ */
> > > 
> > 
> > 
> 

-- 
chandan




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/7] xfs: Compute maximum height of directory BMBT separately
  2020-06-10  6:23         ` Chandan Babu R
@ 2020-06-11  6:38           ` Chandan Babu R
  0 siblings, 0 replies; 40+ messages in thread
From: Chandan Babu R @ 2020-06-11  6:38 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, david, bfoster, hch

On Wednesday 10 June 2020 11:53:49 AM IST Chandan Babu R wrote:
> On Wednesday 10 June 2020 12:10:02 AM IST Darrick J. Wong wrote:
> > On Tue, Jun 09, 2020 at 07:53:55PM +0530, Chandan Babu R wrote:
> > > On Tuesday 9 June 2020 2:29:22 AM IST Darrick J. Wong wrote:
> > > > On Sat, Jun 06, 2020 at 01:57:41PM +0530, Chandan Babu R wrote:
> > > > > xfs/306 causes the following call trace when using a data fork with a
> > > > > maximum extent count of 2^47,
> > > > > 
> > > > >  XFS (loop0): Mounting V5 Filesystem
> > > > >  XFS (loop0): Log size 8906 blocks too small, minimum size is 9075 blocks
> > > > >  XFS (loop0): AAIEEE! Log failed size checks. Abort!
> > > > >  XFS: Assertion failed: 0, file: fs/xfs/xfs_log.c, line: 711
> > > > 
> > > > Uh... won't applying the corresponding MAXEXTNUM changes and whatnot to
> > > > xfsprogs result in mkfs formatting a log with 9075 blocks?  Is there
> > > > some other mistake in the minimum log size computations?
> > > 
> > > The call trace given below shows up when using 2^47 as the maximum extent
> > > count for both Dir and Non-dir inodes.
> > > 
> > > However, using 2^27 as the maximum
> > > extent count for directories would reduce the log reservation value for
> > > "rename" operation (which has the maximum sized log reservation when using the
> > > below mentioned FS geometry).
> > > 
> > > "Rename" log reservation is a function of the maximum directory BMBT height
> > > which in turn is a function of the maximum number of extents that can be
> > > occupied by a directory.
> > > 
> > > Hence when moving the MAXEXTNUM changes to xfsprogs, the corresponding
> > > "maximum directory extent count" changes must also be moved as a
> > > dependency.
> > > 
> > > With this patchset applied (i.e. With 2^27 as the maximum extent count for
> > > directory inodes and 2^47 as the maximum extent count for non-directory
> > > inodes), xfs_log_calc_minimum_size() in kernel returns 8691 blocks.
> > 
> > Hmm, 8691, you say?  Ok, that's a helpful clue...
> > 
> > MAXEXTNUM	min log blocks
> > 2^47		9,075
> > 2^32		8,906
> > 2^27		8,691
> > 
> > ...and now I think I finally understand the goal here.  The existing
> > xfs_bmap_compute_maxlevels computes the max bmbt height from MAXEXTNUM
> > (2^32).  The file rename reservation computation uses this max bmbt
> > height, which works out to a min log size of 8,906 blocks.  Once you
> > change MAXEXTNUM to 2^47, this computation turns into 9,075 blocks.
> > 
> > This means that if you use mkfs.xfs 5.6.0 to create a small, vanilla V5
> > filesystem, it won't mount on your development kernel due to the minimum
> > log size checks, even if you didn't enable the larger extent counters.
> > 
> > Therefore, you're introducing m_bm_dir_maxlevel to store the max bmbt
> > height for a directory, using that to compute the rename reservation,
> > and lo and behold the min log size never goes above the old limit.
> > 
> > This is problematic... (scroll down, please)
> > 
> > > > 
> > > > >  ------------[ cut here ]------------
> > > > >  WARNING: CPU: 0 PID: 12821 at fs/xfs/xfs_message.c:112 assfail+0x25/0x28
> > > > >  Modules linked in:
> > > > >  CPU: 0 PID: 12821 Comm: mount Tainted: G        W         5.6.0-rc6-next-20200320-chandan-00003-g071c2af3f4de #1
> > > > >  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
> > > > >  RIP: 0010:assfail+0x25/0x28
> > > > >  Code: ff ff 0f 0b c3 0f 1f 44 00 00 41 89 c8 48 89 d1 48 89 f2 48 c7 c6 40 b7 4b b3 e8 82 f9 ff ff 80 3d 83 d6 64 01 00 74 02 0f $
> > > > >  RSP: 0018:ffffb05b414cbd78 EFLAGS: 00010246
> > > > >  RAX: 0000000000000000 RBX: ffff9d9d501d5000 RCX: 0000000000000000
> > > > >  RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffb346dc65
> > > > >  RBP: ffff9da444b49a80 R08: 0000000000000000 R09: 0000000000000000
> > > > >  R10: 000000000000000a R11: f000000000000000 R12: 00000000ffffffea
> > > > >  R13: 000000000000000e R14: 0000000000004594 R15: ffff9d9d501d5628
> > > > >  FS:  00007fd6c5d17c80(0000) GS:ffff9da44d800000(0000) knlGS:0000000000000000
> > > > >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > >  CR2: 0000000000000002 CR3: 00000008a48c0000 CR4: 00000000000006f0
> > > > >  Call Trace:
> > > > >   xfs_log_mount+0xf8/0x300
> > > > >   xfs_mountfs+0x46e/0x950
> > > > >   xfs_fc_fill_super+0x318/0x510
> > > > >   ? xfs_mount_free+0x30/0x30
> > > > >   get_tree_bdev+0x15c/0x250
> > > > >   vfs_get_tree+0x25/0xb0
> > > > >   do_mount+0x740/0x9b0
> > > > >   ? memdup_user+0x41/0x80
> > > > >   __x64_sys_mount+0x8e/0xd0
> > > > >   do_syscall_64+0x48/0x110
> > > > >   entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > >  RIP: 0033:0x7fd6c5f2ccda
> > > > >  Code: 48 8b 0d b9 e1 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f $
> > > > >  RSP: 002b:00007ffe00dfb9f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
> > > > >  RAX: ffffffffffffffda RBX: 0000560c1aaa92c0 RCX: 00007fd6c5f2ccda
> > > > >  RDX: 0000560c1aaae110 RSI: 0000560c1aaad040 RDI: 0000560c1aaa94d0
> > > > >  RBP: 00007fd6c607d204 R08: 0000000000000000 R09: 0000560c1aaadde0
> > > > >  R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> > > > >  R13: 0000000000000000 R14: 0000560c1aaa94d0 R15: 0000560c1aaae110
> > > > >  ---[ end trace 6436391b468bc652 ]---
> > > > >  XFS (loop0): log mount failed
> > > > > 
> > > > > The corresponding filesystem was created using mkfs options
> > > > > "-m rmapbt=1,reflink=1 -b size=1k -d size=20m -n size=64k".
> > > > > 
> > > > > i.e. We have a filesystem of size 20MiB, data block size of 1KiB and
> > > > > directory block size of 64KiB. Filesystems of size < 1GiB can have less
> > > > > than 10MiB on-disk log (Please refer to calculate_log_size() in
> > > > > xfsprogs).
> > > > 
> > > > Hm.  You don't seem to be setting either of the big extent count feature
> > > > flags here.
> > > > 
> > > > Is this something that happens after a filesystem gets *upgraded* to
> > > > support extent counts > 2^32?  If it's this second case, then I think
> > > > the function that upgrades the filesystem has to reject the change if it
> > > > would cause the minimum log size checks to fail.
> > > 
> > > This happens when having 2^47 as the value of MAXEXTNUM irrespective of
> > > whether the filesystem's superblock has the big extent count feature flag set
> > > i.e. this patchset
> > > 
> > > Using 2^47 as the value of MAXEXTNUM causes the height of the data fork BMBT
> > > tree to increase when compared to the height of the tree when using 2^32
> > > MAXEXTNUM (In the case of the fs geometry that caused the above call trace,
> > > the height increased by 1). The call xfs_bmap_compute_maxlevels(mp,
> > > XFS_DATA_FORK) (invoked as part of FS mount operation) uses MAXEXTNUM as input
> > > to calculate the maximum height of the data fork BMBT and the result is stored
> > > in mp->m_bm_maxlevels[XFS_DATA_FORK]. This value is then used when calculating
> > > log reservations for various fs operations. Hence the log reservations of fs
> > > operations now change regardless of whether the "big extent count" feature
> > > flag is set or not.
> > 
> > "...or not."
> > 
> > Urrrk, no.  The log reservation calculations for existing filesystems
> > must not change, because (at best) this will cause subtle log behavior
> > changes due to the fluctuating reservation sizes; and (at worst) it can
> > cause the same log minimum size mounting problems you observed above.
> > 
> > If you disturb the log reservations for existing filesystems such
> > that the minimum log size goes up, this means that small filesystems
> > created with an old mkfs will now fail to mount with the new kernel.
> > This is never acceptable.
> > 
> > If you disturb the log reservations such that the minimum log size goes
> > down, this means that when those changes get pulled in by the xfsprogs
> > maintainer, a new mkfs will produce small filesystems that won't mount
> > on older kernels.  The only way this is acceptable is if the changes
> > only affect filesystems with a feature flag set that would cause all
> > of those older kernels to warn about the feature being EXPERIMENTAL.
> 
> So the reduction of the rename log reservation size (by using 2^32 as the
> maximum directory extent count) must be accompanied with setting of a feature
> flag other than bigfork feature flag right? I say that because the bigfork
> feature flag is currently set at runtime when we are about to overflow signed
> 16-bit attrs or signed 32-bit data extent counters. Log reservation values are
> pre-calculated during filesystem mount and cannot be changed during runtime.
> 
> This also means that the patch "xfs: Fix log reservation calculation for xattr
> insert operation" also needs to be handled specially since it,
> - Replaces two reservations (mount and runtime) with just one static
>   reservation.
> - Reduces the value of "xattr set operation" reservation.
> Hence older kernels may not be able to mount filesystems created with mkfs.xfs
> containing this patch.
> 
> > 
> > Either way, users end up broken.
> > 
> > > > 
> > > > Granted, I don't understand the need (in the next patch) to special case
> > > > bmbt maxlevels for directory data forks.  That's probably muddying up
> > > > my ability to figure all this out.  Yes I did read this series
> > > > backwards. :)
> > > 
> > > Using a separate maximum extent count for directory data fork was required to
> > > reduce the increased log reservations described above. To be precise, rename
> > > operation invokes XFS_DIR_OP_LOG_COUNT() which indirectly uses
> > > mp->m_bm_maxlevels[XFS_DATA_FORK] for its calculations. When using a modified
> > > kernel which had 2^47 as the value for MAXEXTNUM resulted in a taller data
> > > fork BMBT tree. Hence log reservation space for rename operation became larger.
> > > 
> > > The idea of special handling of "maximum extents for directory data fork" came
> > > up later when trying to find a way to reduce the log reservation for the
> > > rename operation.
> > 
> > I think a better way to handle the directory operation reservations is:
> > 
> > 1. Introduce XFS_MAXDIREXTNUM == 2^32-1, and use that to compute
> >    m_bm_dir_maxlevel for directories.
> > 
> > 2. Use m_bm_dir_maxlevel to compute the rename reservations, like you do
> >    here.
> > 
> > 3. As a cleanup, split XFS_NEXTENTADD_SPACE_RES into three separate
> >    helpers: one for attr forks (a), one for regular file data forks (b),
> >    and one for !S_ISREG() data forks(c).  The DAENTER macros can switch
> >    between (a) and (c).  Anything that knows it's being run against a
> >    regular file can use (b).  Symlinks and rtbitmaps can use (c).
> > 
> >    We then add a separate helper taking an xfs_inode and whichfork to
> >    compute the correct value for the the callers that have non-variable
> >    arguments.
> 
> Ok. I will take a shot at implementing the helpers. Thanks for the
> suggestions.
> 
> > 
> > This means that the log reservations will stay the same, regardless of
> > whether the bigfork feature is enabled.

Sorry, I had misunderstood the above statement. If we use XFS_MAXDIREXTNUM
(i.e. 2^32) as the maximum extent count for computing rename reservation the
resultant value should be the same as the one computed in existing filesystems
which have MAXEXTNUM set to 2^32. Hence as you have noted, the reservation for
rename operation should not change.

However, other FS operations that use mp->m_bm_maxlevels[xattr fork|non-dir
data fork] (e.g. xfs_calc_write_reservation()) as input for calculating log
reservations, will see an increase in the resultant values since both max
xattr and max data extent count have now been increased. Hence IMHO, the only
way to prevent older kernels from mounting filesystems whose log reservations
have been calculated based on 2^32 (MAXAEXTNUM) and 2^47 (MAXEXTNUM) is to
have an incompat flag set during mkfs time. If we decide to go with
this approach, then we could drop XFS_MAXDIREXTNUM and just continue to use
MAXEXTNUM. Please let me know your opinion this.

> 
> On a filesystem which would have already used,
> 1. 2^47 max data extent count
> 2. 2^28 max directory extent count
> 3. 2^32 max xattr count
> for computing the log reservations during mount time, an upgrade to bigfork
> feature would not affect the pre-calculated log reservation values.
> 
> > I think this will be safe for
> > the attr extent count expansion, since we aren't letting the attr fork
> > expand beyond 2^32 extents, which means the max bmbt height there will
> > never be larger than anything we've ever seen before.
> > 
> > In my head I've convinced myself that this will keep the code simpler in
> > the long run, but maybe the rest of you have other ideas or flames? :D
> > 
> 
> 

-- 
chandan




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/7] xfs: Fix log reservation calculation for xattr insert operation
  2020-06-06  8:27 ` [PATCH 1/7] xfs: Fix log reservation calculation for xattr insert operation Chandan Babu R
@ 2020-06-19 14:33   ` Christoph Hellwig
  2020-06-20 12:53     ` Chandan Babu R
  0 siblings, 1 reply; 40+ messages in thread
From: Christoph Hellwig @ 2020-06-19 14:33 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, david, darrick.wong, bfoster, hch, Dave Chinner

On Sat, Jun 06, 2020 at 01:57:39PM +0530, Chandan Babu R wrote:
> -		tres.tr_logres = M_RES(mp)->tr_attrsetm.tr_logres +
> -				 M_RES(mp)->tr_attrsetrt.tr_logres *
> -					args->total;
> -		tres.tr_logcount = XFS_ATTRSET_LOG_COUNT;
> -		tres.tr_logflags = XFS_TRANS_PERM_LOG_RES;
> +		tres = M_RES(mp)->tr_attrset;
>  		total = args->total;

tres can become a pointer now, and we can avoid the struct copy.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/7] xfs: Check for per-inode extent count overflow
  2020-06-09 17:10       ` Darrick J. Wong
@ 2020-06-19 14:36         ` Christoph Hellwig
  2020-06-19 21:31           ` Darrick J. Wong
  2020-06-20 12:53           ` Chandan Babu R
  0 siblings, 2 replies; 40+ messages in thread
From: Christoph Hellwig @ 2020-06-19 14:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Chandan Babu R, linux-xfs, david, bfoster, hch

I'm lost in 4 layers of full quotes.  Can someone summarize the
discussion without hundreds of lines of irrelevant quotes?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 6/7] xfs: Extend data extent counter to 47 bits
  2020-06-06  8:27 ` [PATCH 6/7] xfs: Extend data extent counter to 47 bits Chandan Babu R
  2020-06-08 17:14   ` Darrick J. Wong
@ 2020-06-19 14:38   ` Christoph Hellwig
  2020-06-20 12:52     ` Chandan Babu R
  1 sibling, 1 reply; 40+ messages in thread
From: Christoph Hellwig @ 2020-06-19 14:38 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, david, darrick.wong, bfoster, hch

On Sat, Jun 06, 2020 at 01:57:44PM +0530, Chandan Babu R wrote:
> This commit extends the per-inode data extent counter to 47 bits. The
> length of 47-bits was chosen because,
> Maximum file size = 2^63.
> Maximum extent count when using 64k block size = 2^63 / 2^16 = 2^47.

What is the use case for a large nuber of extents?  I'm not sure why
we'd want to bother, but if there is a good reason it really should
be documented here.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 7/7] xfs: Extend attr extent counter to 32 bits
  2020-06-06  8:27 ` [PATCH 7/7] xfs: Extend attr extent counter to 32 bits Chandan Babu R
  2020-06-08 17:21   ` Darrick J. Wong
@ 2020-06-19 14:39   ` Christoph Hellwig
  2020-06-20 12:53     ` Chandan Babu R
  1 sibling, 1 reply; 40+ messages in thread
From: Christoph Hellwig @ 2020-06-19 14:39 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, david, darrick.wong, bfoster, hch

On Sat, Jun 06, 2020 at 01:57:45PM +0530, Chandan Babu R wrote:
> This commit extends the per-inode attr extent counter to 32 bits.

And the reason for why this is needed or at least nice to have needs
to go here.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/7] xfs: Check for per-inode extent count overflow
  2020-06-19 14:36         ` Christoph Hellwig
@ 2020-06-19 21:31           ` Darrick J. Wong
  2020-06-20 12:53           ` Chandan Babu R
  1 sibling, 0 replies; 40+ messages in thread
From: Darrick J. Wong @ 2020-06-19 21:31 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Chandan Babu R, linux-xfs, david, bfoster

On Fri, Jun 19, 2020 at 07:36:08AM -0700, Christoph Hellwig wrote:
> I'm lost in 4 layers of full quotes.  Can someone summarize the
> discussion without hundreds of lines of irrelevant quotes?

Naming issues with helper functions, and pointing out that any code path
that thinks it could add $nr extents to a file needs to check for
overflows in (ifp->if_nextents + $nr) after we take the ILOCK but before
we dirty the transaction.

--D

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 6/7] xfs: Extend data extent counter to 47 bits
  2020-06-19 14:38   ` Christoph Hellwig
@ 2020-06-20 12:52     ` Chandan Babu R
  0 siblings, 0 replies; 40+ messages in thread
From: Chandan Babu R @ 2020-06-20 12:52 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, david, darrick.wong, bfoster

On Friday 19 June 2020 8:08:46 PM IST Christoph Hellwig wrote:
> On Sat, Jun 06, 2020 at 01:57:44PM +0530, Chandan Babu R wrote:
> > This commit extends the per-inode data extent counter to 47 bits. The
> > length of 47-bits was chosen because,
> > Maximum file size = 2^63.
> > Maximum extent count when using 64k block size = 2^63 / 2^16 = 2^47.
> 
> What is the use case for a large nuber of extents?  I'm not sure why
> we'd want to bother, but if there is a good reason it really should
> be documented here.
> 
> 

Late last year, Dave had pointed me to the commit "xfs: fix inode fork extent
count overflow" (3f8a4f1d876d3e3e49e50b0396eaffcc4ba71b08) where the following
scenario is described,

Fallocate 40TiB of disk space and then alternatively punch out fs
blocks. Assuming 4k block size, this would give,

40TiB / 4k / 2 = ~5 billion extents.

This won't fit into a unsigned 32-bit field which can hold a maximum value of
~4 billion.

Dave mentioned that we will go over the 32-bit extent counter limit
soon. Hence this patch extends the on-disk data fork extent counter to a
64-bit field.

In my next version of this patchset, I will add the technical part of the
above description to the patch. Sorry for missing that out.

-- 
chandan




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 7/7] xfs: Extend attr extent counter to 32 bits
  2020-06-19 14:39   ` Christoph Hellwig
@ 2020-06-20 12:53     ` Chandan Babu R
  0 siblings, 0 replies; 40+ messages in thread
From: Chandan Babu R @ 2020-06-20 12:53 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, david, darrick.wong, bfoster

On Friday 19 June 2020 8:09:17 PM IST Christoph Hellwig wrote:
> On Sat, Jun 06, 2020 at 01:57:45PM +0530, Chandan Babu R wrote:
> > This commit extends the per-inode attr extent counter to 32 bits.
> 
> And the reason for why this is needed or at least nice to have needs
> to go here.
> 

Parent pointers are stored in xattrs of the corresponding inode. Dave had
informed me that there have been instances where we have more than 100 million
hardlinks associated with an inode. This will most likely cause the 16-bit
wide on-disk xattr extent counter to overflow as described below,

1. Insert 5 million xattrs (each having a value size of 255 bytes) and then
   delete 50% of them in an alternating  manner. 
   ./benchmark-xattrs -l 255 -n 5000000 -s 50 -f $mntpnt/testfile-0

   benchmark-xattrs.c and related sources can be obtained from
   https://github.com/chandanr/xfs-xattr-benchmark/blob/master/src/
   
2. This causes 98511 extents to be created in the attr fork of the inode.
   xfsaild/loop0  2035 [003]  9643.390490: probe:xfs_iflush_int: (ffffffffac6225c0) if_nextents=98511 inode=131

3. The incore inode fork extent counter is a signed 32-bit quantity. However
   the on-disk extent counter is an unsigned 16-bit quantity and hence cannot
   hold 98511 extents.

4. The following incorrect value is stored in the attr extent counter
   # xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0
   core.naextents = -32561

I will add a generic description of the above sequence of events in the commit
message of this patch when posting the next version.

-- 
chandan




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/7] xfs: Fix log reservation calculation for xattr insert operation
  2020-06-19 14:33   ` Christoph Hellwig
@ 2020-06-20 12:53     ` Chandan Babu R
  0 siblings, 0 replies; 40+ messages in thread
From: Chandan Babu R @ 2020-06-20 12:53 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-xfs, david, darrick.wong, bfoster, Dave Chinner

On Friday 19 June 2020 8:03:54 PM IST Christoph Hellwig wrote:
> On Sat, Jun 06, 2020 at 01:57:39PM +0530, Chandan Babu R wrote:
> > -		tres.tr_logres = M_RES(mp)->tr_attrsetm.tr_logres +
> > -				 M_RES(mp)->tr_attrsetrt.tr_logres *
> > -					args->total;
> > -		tres.tr_logcount = XFS_ATTRSET_LOG_COUNT;
> > -		tres.tr_logflags = XFS_TRANS_PERM_LOG_RES;
> > +		tres = M_RES(mp)->tr_attrset;
> >  		total = args->total;
> 
> tres can become a pointer now, and we can avoid the struct copy.
> 

Yes, you are right. I will fix this up. Thanks for the review.

-- 
chandan




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/7] xfs: Check for per-inode extent count overflow
  2020-06-19 14:36         ` Christoph Hellwig
  2020-06-19 21:31           ` Darrick J. Wong
@ 2020-06-20 12:53           ` Chandan Babu R
  1 sibling, 0 replies; 40+ messages in thread
From: Chandan Babu R @ 2020-06-20 12:53 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Darrick J. Wong, linux-xfs, david, bfoster

On Friday 19 June 2020 8:06:08 PM IST Christoph Hellwig wrote:
> I'm lost in 4 layers of full quotes.  Can someone summarize the
> discussion without hundreds of lines of irrelevant quotes?
> 

XFS does not check for possible overflow of per-inode extent counter fields
when adding extents to either data or attr fork.

For e.g.

1. Insert 5 million xattrs (each having a value size of 255 bytes) and then
   delete 50% of them in an alternating  manner. 
   ./benchmark-xattrs -l 255 -n 5000000 -s 50 -f $mntpnt/testfile-0

   benchmark-xattrs.c and related sources can be obtained from
   https://github.com/chandanr/xfs-xattr-benchmark/blob/master/src/
   
2. This causes 98511 extents to be created in the attr fork of the inode.
   xfsaild/loop0  2035 [003]  9643.390490: probe:xfs_iflush_int: (ffffffffac6225c0) if_nextents=98511 inode=131

3. The incore inode fork extent counter is a signed 32-bit quantity. However
   the on-disk extent counter is an unsigned 16-bit quantity and hence cannot
   hold 98511 extents.

4. The following incorrect value is stored in the attr extent counter
   # xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0
   core.naextents = -32561

As an aside, Please note that the sequence of events causing counter overflow
that have been described in this patch's description is not applicable since
the patches have been rebased on xfs-ifork-cleanup.2 which causes a signed
32-bit (xfs_extnum_t) quantity to be used to store the extent counter in
memory. However, the "on-disk counter overflow" bug still exists as was
described above.

To prevent the overflow bug from occuring silently, this patch now checks for
the overflow condition before incrementing the in-core value and returns an
error if a possible overflow is detected.

Darrick pointed out that returning an error code can cause the transaction to
be aborted because it is most likely to be dirty. Hence his suggestion (to
which I agree) was to check for possible overflow just after starting a
transaction and obtaining the inode's i_lock.

Also, since we are extending the data and xattr extent counters to 32 and 47
bits in the later patches the value of log reservations will change since they
are a function of maximum height of BMBT trees. The maximum of height of BMBT
trees are themselves calculated based on the maximum number of xattr and data
extents. Due to this, "min log size" can end up being larger than what was
calculated during mkfs.xfs time. This can cause mount to fail as shown by the
following call trace which was generated when executing xfs/306 test,

 XFS (loop0): Mounting V5 Filesystem
 XFS (loop0): Log size 8906 blocks too small, minimum size is 9075 blocks
 XFS (loop0): AAIEEE! Log failed size checks. Abort!
 XFS: Assertion failed: 0, file: fs/xfs/xfs_log.c, line: 711
 ------------[ cut here ]------------
 WARNING: CPU: 0 PID: 12821 at fs/xfs/xfs_message.c:112 assfail+0x25/0x28
 Modules linked in:
 CPU: 0 PID: 12821 Comm: mount Tainted: G        W         5.6.0-rc6-next-20200320-chandan-00003-g071c2af3f4de #1
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
 RIP: 0010:assfail+0x25/0x28
 Code: ff ff 0f 0b c3 0f 1f 44 00 00 41 89 c8 48 89 d1 48 89 f2 48 c7 c6 40 b7 4b b3 e8 82 f9 ff ff 80 3d 83 d6 64 01 00 74 02 0f $
 RSP: 0018:ffffb05b414cbd78 EFLAGS: 00010246
 RAX: 0000000000000000 RBX: ffff9d9d501d5000 RCX: 0000000000000000
 RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffb346dc65
 RBP: ffff9da444b49a80 R08: 0000000000000000 R09: 0000000000000000
 R10: 000000000000000a R11: f000000000000000 R12: 00000000ffffffea
 R13: 000000000000000e R14: 0000000000004594 R15: ffff9d9d501d5628
 FS:  00007fd6c5d17c80(0000) GS:ffff9da44d800000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000002 CR3: 00000008a48c0000 CR4: 00000000000006f0
 Call Trace:
  xfs_log_mount+0xf8/0x300
  xfs_mountfs+0x46e/0x950
  xfs_fc_fill_super+0x318/0x510
  ? xfs_mount_free+0x30/0x30
  get_tree_bdev+0x15c/0x250
  vfs_get_tree+0x25/0xb0
  do_mount+0x740/0x9b0
  ? memdup_user+0x41/0x80
  __x64_sys_mount+0x8e/0xd0
  do_syscall_64+0x48/0x110
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7fd6c5f2ccda
 Code: 48 8b 0d b9 e1 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f $
 RSP: 002b:00007ffe00dfb9f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
 RAX: ffffffffffffffda RBX: 0000560c1aaa92c0 RCX: 00007fd6c5f2ccda
 RDX: 0000560c1aaae110 RSI: 0000560c1aaad040 RDI: 0000560c1aaa94d0
 RBP: 00007fd6c607d204 R08: 0000000000000000 R09: 0000560c1aaadde0
 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
 R13: 0000000000000000 R14: 0000560c1aaa94d0 R15: 0000560c1aaae110
 ---[ end trace 6436391b468bc652 ]---
 XFS (loop0): log mount failed

The corresponding filesystem was created using mkfs options
"-m rmapbt=1,reflink=1 -b size=1k -d size=20m -n size=64k".

To prevent such incidents, we are contemplating on using the following
approach,
1. Use existing constants for max extent counts (i.e. signed 2^16 for xattrs
   and signed 2^32 for data extents).
2. Compute max bmbt heights, log reservations and hence min log size during
   mount time.
3. Later, during the mount lifetime of the filesystem, when we are about to
   overflow the extent counter we use larger values (2^32 for xattr and 2^47
   for data) for max extent count and then recompute log reservations and min
   logsize. At this point, if on-disk log size is smaller than min log size
   we return with an error.
4. Otherwise, we set an RO-feature flag and use the revised log reservations
   for future transactions.

Please let me know your opinion on the above approaches.
   
--
chandan




^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 6/7] xfs: Extend data extent counter to 47 bits
  2020-06-09 14:23     ` Chandan Babu R
@ 2020-08-31 21:05       ` Darrick J. Wong
  0 siblings, 0 replies; 40+ messages in thread
From: Darrick J. Wong @ 2020-08-31 21:05 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, david, bfoster, hch

On Tue, Jun 09, 2020 at 07:53:05PM +0530, Chandan Babu R wrote:
> On Monday 8 June 2020 10:44:10 PM IST Darrick J. Wong wrote:
> > On Sat, Jun 06, 2020 at 01:57:44PM +0530, Chandan Babu R wrote:
> > > This commit extends the per-inode data extent counter to 47 bits. The
> > > length of 47-bits was chosen because,
> > > Maximum file size = 2^63.
> > > Maximum extent count when using 64k block size = 2^63 / 2^16 = 2^47.
> > > 
> > > The following changes are made to accomplish this,
> > > 1. A new ro-compat superblock flag to prevent older kernels from
> > >    mounting the filesystem in read-write mode. This flag is set for the
> > >    first time when an inode would end up having more than 2^31 extents.
> > > 3. Carve out a new 32-bit field from xfs_dinode->di_pad2[]. This field
> > >    holds the most significant 15 bits of the data extent counter.
> > 
> > On a 1k block V5 fs, the maximum extent count is 2^(63-10) = 2^53.
> > 
> > If you're going to allocate 32 bits of space from di_pad2 to expand the
> > data fork's nextents, let's use the entire bitspace.
> 
> But 2^53 extents will be beyond the limit of number of extents possible for a
> 64k blocksized filesystem?

That is true, but what about 4k block filesystems?

What about 1k block filesystems?

(Yeah, sorry I forgot to reply)

--D

> > 
> > > 2. A new inode->di_flags2 flag to indicate that the newly added field
> > >    contains valid data. This flag is set when one of the following two
> > >    conditions are met,
> > >    - When the inode is about to have more than 2^31 extents.
> > >    - When flushing the incore inode (See xfs_iflush_int()), if
> > >      the superblock ro-compat flag is already set.
> > > 
> > > Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_bmap.c        | 40 ++++++++--------
> > >  fs/xfs/libxfs/xfs_format.h      | 30 ++++++++----
> > >  fs/xfs/libxfs/xfs_inode_buf.c   | 46 +++++++++++++++---
> > >  fs/xfs/libxfs/xfs_inode_buf.h   |  2 +
> > >  fs/xfs/libxfs/xfs_inode_fork.c  | 84 ++++++++++++++++++++++++++-------
> > >  fs/xfs/libxfs/xfs_inode_fork.h  |  3 +-
> > >  fs/xfs/libxfs/xfs_log_format.h  |  5 +-
> > >  fs/xfs/libxfs/xfs_types.h       |  5 +-
> > >  fs/xfs/scrub/inode.c            |  9 ++--
> > >  fs/xfs/xfs_inode.c              |  6 ++-
> > >  fs/xfs/xfs_inode_item.c         |  5 +-
> > >  fs/xfs/xfs_inode_item_recover.c | 16 +++++--
> > >  12 files changed, 184 insertions(+), 67 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > index f75b70ae7b1f..73e552678adc 100644
> > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > @@ -53,9 +53,9 @@ xfs_bmap_compute_maxlevels(
> > >  	int		whichfork,	/* data or attr fork */
> > >  	int		dir_bmbt)	/* Dir or non-dir data fork */
> > >  {
> > > +	uint64_t	maxleafents;	/* max leaf entries possible */
> > >  	int		level;		/* btree level */
> > >  	uint		maxblocks;	/* max blocks at this level */
> > > -	uint		maxleafents;	/* max leaf entries possible */
> > >  	int		maxrootrecs;	/* max records in root block */
> > >  	int		minleafrecs;	/* min records in leaf block */
> > >  	int		minnoderecs;	/* min records in node block */
> > > @@ -477,7 +477,7 @@ xfs_bmap_check_leaf_extents(
> > >  	if (bp_release)
> > >  		xfs_trans_brelse(NULL, bp);
> > >  error_norelse:
> > > -	xfs_warn(mp, "%s: BAD after btree leaves for %d extents",
> > > +	xfs_warn(mp, "%s: BAD after btree leaves for %llu extents",
> > >  		__func__, i);
> > >  	xfs_err(mp, "%s: CORRUPTED BTREE OR SOMETHING", __func__);
> > >  	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > > @@ -918,7 +918,7 @@ xfs_bmap_local_to_extents(
> > >  	xfs_iext_first(ifp, &icur);
> > >  	xfs_iext_insert(ip, &icur, &rec, 0);
> > >  
> > > -	error = xfs_next_set(ip, whichfork, 1);
> > > +	error = xfs_next_set(tp, ip, whichfork, 1);
> > >  	if (error)
> > >  		goto done;
> > >  
> > > @@ -1610,7 +1610,7 @@ xfs_bmap_add_extent_delay_real(
> > >  		xfs_iext_prev(ifp, &bma->icur);
> > >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, &LEFT);
> > >  
> > > -		error = xfs_next_set(bma->ip, whichfork, -1);
> > > +		error = xfs_next_set(bma->tp, bma->ip, whichfork, -1);
> > >  		if (error)
> > >  			goto done;
> > >  
> > > @@ -1717,7 +1717,7 @@ xfs_bmap_add_extent_delay_real(
> > >  		PREV.br_state = new->br_state;
> > >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, &PREV);
> > >  
> > > -		error = xfs_next_set(bma->ip, whichfork, 1);
> > > +		error = xfs_next_set(bma->tp, bma->ip, whichfork, 1);
> > >  		if (error)
> > >  			goto done;
> > >  
> > > @@ -1786,7 +1786,7 @@ xfs_bmap_add_extent_delay_real(
> > >  		 */
> > >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, new);
> > >  
> > > -		error = xfs_next_set(bma->ip, whichfork, 1);
> > > +		error = xfs_next_set(bma->tp, bma->ip, whichfork, 1);
> > >  		if (error)
> > >  			goto done;
> > >  
> > > @@ -1876,7 +1876,7 @@ xfs_bmap_add_extent_delay_real(
> > >  		 */
> > >  		xfs_iext_update_extent(bma->ip, state, &bma->icur, new);
> > >  
> > > -		error = xfs_next_set(bma->ip, whichfork, 1);
> > > +		error = xfs_next_set(bma->tp, bma->ip, whichfork, 1);
> > >  		if (error)
> > >  			goto done;
> > >  
> > > @@ -1965,7 +1965,7 @@ xfs_bmap_add_extent_delay_real(
> > >  		xfs_iext_insert(bma->ip, &bma->icur, &RIGHT, state);
> > >  		xfs_iext_insert(bma->ip, &bma->icur, &LEFT, state);
> > >  
> > > -		error = xfs_next_set(bma->ip, whichfork, 1);
> > > +		error = xfs_next_set(bma->tp, bma->ip, whichfork, 1);
> > >  		if (error)
> > >  			goto done;
> > >  
> > > @@ -2172,7 +2172,7 @@ xfs_bmap_add_extent_unwritten_real(
> > >  		xfs_iext_prev(ifp, icur);
> > >  		xfs_iext_update_extent(ip, state, icur, &LEFT);
> > >  
> > > -		error = xfs_next_set(ip, whichfork, -2);
> > > +		error = xfs_next_set(tp, ip, whichfork, -2);
> > >  		if (error)
> > >  			goto done;
> > >  
> > > @@ -2228,7 +2228,7 @@ xfs_bmap_add_extent_unwritten_real(
> > >  		xfs_iext_prev(ifp, icur);
> > >  		xfs_iext_update_extent(ip, state, icur, &LEFT);
> > >  
> > > -		error = xfs_next_set(ip, whichfork, -1);
> > > +		error = xfs_next_set(tp, ip, whichfork, -1);
> > >  		if (error)
> > >  			goto done;
> > >  
> > > @@ -2274,7 +2274,7 @@ xfs_bmap_add_extent_unwritten_real(
> > >  		xfs_iext_prev(ifp, icur);
> > >  		xfs_iext_update_extent(ip, state, icur, &PREV);
> > >  
> > > -		error = xfs_next_set(ip, whichfork, -1);
> > > +		error = xfs_next_set(tp, ip, whichfork, -1);
> > >  		if (error)
> > >  			goto done;
> > >  
> > > @@ -2385,7 +2385,7 @@ xfs_bmap_add_extent_unwritten_real(
> > >  		xfs_iext_update_extent(ip, state, icur, &PREV);
> > >  		xfs_iext_insert(ip, icur, new, state);
> > >  
> > > -		error = xfs_next_set(ip, whichfork, 1);
> > > +		error = xfs_next_set(tp, ip, whichfork, 1);
> > >  		if (error)
> > >  			goto done;
> > >  
> > > @@ -2464,7 +2464,7 @@ xfs_bmap_add_extent_unwritten_real(
> > >  		xfs_iext_next(ifp, icur);
> > >  		xfs_iext_insert(ip, icur, new, state);
> > >  
> > > -		error = xfs_next_set(ip, whichfork, 1);
> > > +		error = xfs_next_set(tp, ip, whichfork, 1);
> > >  		if (error)
> > >  			goto done;
> > >  
> > > @@ -2519,7 +2519,7 @@ xfs_bmap_add_extent_unwritten_real(
> > >  		xfs_iext_insert(ip, icur, &r[1], state);
> > >  		xfs_iext_insert(ip, icur, &r[0], state);
> > >  
> > > -		error = xfs_next_set(ip, whichfork, 2);
> > > +		error = xfs_next_set(tp, ip, whichfork, 2);
> > >  		if (error)
> > >  			goto done;
> > >  
> > > @@ -2838,7 +2838,7 @@ xfs_bmap_add_extent_hole_real(
> > >  		xfs_iext_prev(ifp, icur);
> > >  		xfs_iext_update_extent(ip, state, icur, &left);
> > >  
> > > -		error = xfs_next_set(ip, whichfork, -1);
> > > +		error = xfs_next_set(tp, ip, whichfork, -1);
> > >  		if (error)
> > >  			goto done;
> > >  
> > > @@ -2940,7 +2940,7 @@ xfs_bmap_add_extent_hole_real(
> > >  		 */
> > >  		xfs_iext_insert(ip, icur, new, state);
> > >  
> > > -		error = xfs_next_set(ip, whichfork, 1);
> > > +		error = xfs_next_set(tp, ip, whichfork, 1);
> > >  		if (error)
> > >  			goto done;
> > >  
> > > @@ -5140,7 +5140,7 @@ xfs_bmap_del_extent_real(
> > >  		xfs_iext_remove(ip, icur, state);
> > >  		xfs_iext_prev(ifp, icur);
> > >  
> > > -		error = xfs_next_set(ip, whichfork, -1);
> > > +		error = xfs_next_set(tp, ip, whichfork, -1);
> > >  		if (error)
> > >  			goto done;
> > >  
> > > @@ -5252,7 +5252,7 @@ xfs_bmap_del_extent_real(
> > >  		} else
> > >  			flags |= xfs_ilog_fext(whichfork);
> > >  
> > > -		error = xfs_next_set(ip, whichfork, 1);
> > > +		error = xfs_next_set(tp, ip, whichfork, 1);
> > >  		if (error)
> > >  			goto done;
> > >  
> > > @@ -5722,7 +5722,7 @@ xfs_bmse_merge(
> > >  	 * Update the on-disk extent count, the btree if necessary and log the
> > >  	 * inode.
> > >  	 */
> > > -	error = xfs_next_set(ip, whichfork, -1);
> > > +	error = xfs_next_set(tp, ip, whichfork, -1);
> > >  	if (error)
> > >  		goto done;
> > >  
> > > @@ -6113,7 +6113,7 @@ xfs_bmap_split_extent(
> > >  	xfs_iext_next(ifp, &icur);
> > >  	xfs_iext_insert(ip, &icur, &new, 0);
> > >  
> > > -	error = xfs_next_set(ip, whichfork, 1);
> > > +	error = xfs_next_set(tp, ip, whichfork, 1);
> > >  	if (error)
> > >  		goto del_cursor;
> > >  
> > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > index b42a52bfa1e9..91bee33aa988 100644
> > > --- a/fs/xfs/libxfs/xfs_format.h
> > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > @@ -449,10 +449,12 @@ xfs_sb_has_compat_feature(
> > >  #define XFS_SB_FEAT_RO_COMPAT_FINOBT   (1 << 0)		/* free inode btree */
> > >  #define XFS_SB_FEAT_RO_COMPAT_RMAPBT   (1 << 1)		/* reverse map btree */
> > >  #define XFS_SB_FEAT_RO_COMPAT_REFLINK  (1 << 2)		/* reflinked files */
> > > +#define XFS_SB_FEAT_RO_COMPAT_47BIT_DEXT_CNTR (1 << 3)	/* 47bit data extents */
> > 
> > I wonder if we could come up with a better name for this...
> > 
> > DFORK_EXTENTHI
> > 
> > Hmm...
> > 
> > BIG_DFORK
> > 
> > Hmmm...
> > 
> > ULTRAFRAG
> > 
> > There we go.  "XFS with UltraFrag, part of this complete g@m3r t00lk1t." ;)
> > 
> > ...
> > 
> > (What do you think of the second suggestion?)
> 
> I like the name DFORK_EXTENTHI since it signifies that we are now using the
> "_HI" field of the extent counter and it can also be used to convey the same
> for the attr extent counter as well. Thanks for the suggestions.
> 
> > 
> > >  #define XFS_SB_FEAT_RO_COMPAT_ALL \
> > >  		(XFS_SB_FEAT_RO_COMPAT_FINOBT | \
> > >  		 XFS_SB_FEAT_RO_COMPAT_RMAPBT | \
> > > -		 XFS_SB_FEAT_RO_COMPAT_REFLINK)
> > > +		 XFS_SB_FEAT_RO_COMPAT_REFLINK | \
> > > +		 XFS_SB_FEAT_RO_COMPAT_47BIT_DEXT_CNTR)
> > >  #define XFS_SB_FEAT_RO_COMPAT_UNKNOWN	~XFS_SB_FEAT_RO_COMPAT_ALL
> > >  static inline bool
> > >  xfs_sb_has_ro_compat_feature(
> > > @@ -563,6 +565,18 @@ static inline bool xfs_sb_version_hasreflink(struct xfs_sb *sbp)
> > >  		(sbp->sb_features_ro_compat & XFS_SB_FEAT_RO_COMPAT_REFLINK);
> > >  }
> > >  
> > > +static inline bool xfs_sb_version_has47bitext(struct xfs_sb *sbp)
> > > +{
> > > +	return XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5 &&
> > > +		(sbp->sb_features_ro_compat &
> > > +			XFS_SB_FEAT_RO_COMPAT_47BIT_DEXT_CNTR);
> > > +}
> > > +
> > > +static inline void xfs_sb_version_add47bitext(struct xfs_sb *sbp)
> > > +{
> > > +	sbp->sb_features_ro_compat |= XFS_SB_FEAT_RO_COMPAT_47BIT_DEXT_CNTR;
> > > +}
> > > +
> > >  /*
> > >   * end of superblock version macros
> > >   */
> > > @@ -873,7 +887,7 @@ typedef struct xfs_dinode {
> > >  	__be64		di_size;	/* number of bytes in file */
> > >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
> > >  	__be32		di_extsize;	/* basic/minimum extent size for file */
> > > -	__be32		di_nextents;	/* number of extents in data fork */
> > > +	__be32		di_nextents_lo;	/* number of extents in data fork */
> > >  	__be16		di_anextents;	/* number of extents in attribute fork*/
> > >  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> > >  	__s8		di_aformat;	/* format of attr fork's data */
> > > @@ -891,7 +905,8 @@ typedef struct xfs_dinode {
> > >  	__be64		di_lsn;		/* flush sequence */
> > >  	__be64		di_flags2;	/* more random flags */
> > >  	__be32		di_cowextsize;	/* basic cow extent size for file */
> > > -	__u8		di_pad2[12];	/* more padding for future expansion */
> > > +	__be32		di_nextents_hi;
> > > +	__u8		di_pad2[8];	/* more padding for future expansion */
> > >  
> > >  	/* fields only written to during inode creation */
> > >  	xfs_timestamp_t	di_crtime;	/* time created */
> > > @@ -992,10 +1007,6 @@ enum xfs_dinode_fmt {
> > >  	((w) == XFS_DATA_FORK ? \
> > >  		(dip)->di_format : \
> > >  		(dip)->di_aformat)
> > > -#define XFS_DFORK_NEXTENTS(dip,w) \
> > > -	((w) == XFS_DATA_FORK ? \
> > > -		be32_to_cpu((dip)->di_nextents) : \
> > > -		be16_to_cpu((dip)->di_anextents))
> > >  
> > >  /*
> > >   * For block and character special files the 32bit dev_t is stored at the
> > > @@ -1061,12 +1072,15 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
> > >  #define XFS_DIFLAG2_DAX_BIT	0	/* use DAX for this inode */
> > >  #define XFS_DIFLAG2_REFLINK_BIT	1	/* file's blocks may be shared */
> > >  #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
> > > +#define XFS_DIFLAG2_47BIT_NEXTENTS_BIT 3 /* Uses di_nextents_hi field */
> > >  #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
> > >  #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
> > >  #define XFS_DIFLAG2_COWEXTSIZE  (1 << XFS_DIFLAG2_COWEXTSIZE_BIT)
> > > +#define XFS_DIFLAG2_47BIT_NEXTENTS (1 << XFS_DIFLAG2_47BIT_NEXTENTS_BIT)
> > >  
> > >  #define XFS_DIFLAG2_ANY \
> > > -	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE)
> > > +	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
> > > +	 XFS_DIFLAG2_47BIT_NEXTENTS)
> > >  
> > >  /*
> > >   * Inode number format:
> > > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > > index 6f84ea85fdd8..8b89fe080f70 100644
> > > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > > @@ -307,7 +307,8 @@ xfs_inode_to_disk(
> > >  	to->di_size = cpu_to_be64(from->di_size);
> > >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> > >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> > > -	to->di_nextents = cpu_to_be32(xfs_ifork_nextents(&ip->i_df));
> > > +	to->di_nextents_lo = cpu_to_be32(xfs_ifork_nextents(&ip->i_df) &
> > > +					0xffffffffU);
> > >  	to->di_anextents = cpu_to_be16(xfs_ifork_nextents(ip->i_afp));
> > >  	to->di_forkoff = from->di_forkoff;
> > >  	to->di_aformat = xfs_ifork_format(ip->i_afp);
> > > @@ -322,6 +323,10 @@ xfs_inode_to_disk(
> > >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
> > >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> > >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > > +		if (from->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
> > > +			to->di_nextents_hi
> > > +				= cpu_to_be32(xfs_ifork_nextents(&ip->i_df)
> > > +					>> 32);
> > 
> > /me kinda hates the indentation here, would a convenience variable
> > reduce the amount of linewrapping here?
> 
> I will use a variable here as you have suggested.
> 
> > 
> > Oh, right, we're in a new epoch now; just go past 80 columns.
> > 
> > >  		to->di_ino = cpu_to_be64(ip->i_ino);
> > >  		to->di_lsn = cpu_to_be64(lsn);
> > >  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> > > @@ -360,7 +365,7 @@ xfs_log_dinode_to_disk(
> > >  	to->di_size = cpu_to_be64(from->di_size);
> > >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> > >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> > > -	to->di_nextents = cpu_to_be32(from->di_nextents);
> > > +	to->di_nextents_lo = cpu_to_be32(from->di_nextents_lo);
> > >  	to->di_anextents = cpu_to_be16(from->di_anextents);
> > >  	to->di_forkoff = from->di_forkoff;
> > >  	to->di_aformat = from->di_aformat;
> > > @@ -375,6 +380,9 @@ xfs_log_dinode_to_disk(
> > >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> > >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> > >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > > +		if (from->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
> > > +			to->di_nextents_hi =
> > > +				cpu_to_be32(from->di_nextents_hi);
> > >  		to->di_ino = cpu_to_be64(from->di_ino);
> > >  		to->di_lsn = cpu_to_be64(from->di_lsn);
> > >  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> > > @@ -391,7 +399,9 @@ xfs_dinode_verify_fork(
> > >  	struct xfs_mount	*mp,
> > >  	int			whichfork)
> > >  {
> > > -	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> > > +	xfs_extnum_t		di_nextents;
> > > +
> > > +	di_nextents = xfs_dfork_nextents(&mp->m_sb, dip, whichfork);
> > >  
> > >  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
> > >  	case XFS_DINODE_FMT_LOCAL:
> > > @@ -462,6 +472,8 @@ xfs_dinode_verify(
> > >  	uint16_t		flags;
> > >  	uint64_t		flags2;
> > >  	uint64_t		di_size;
> > > +	xfs_extnum_t		nextents;
> > > +	int64_t			nblocks;
> > >  
> > >  	if (dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC))
> > >  		return __this_address;
> > > @@ -492,10 +504,12 @@ xfs_dinode_verify(
> > >  	if ((S_ISLNK(mode) || S_ISDIR(mode)) && di_size == 0)
> > >  		return __this_address;
> > >  
> > > +	nextents = xfs_dfork_nextents(&mp->m_sb, dip, XFS_DATA_FORK);
> > > +	nextents += xfs_dfork_nextents(&mp->m_sb, dip, XFS_ATTR_FORK);
> > > +	nblocks = be64_to_cpu(dip->di_nblocks);
> > > +
> > >  	/* Fork checks carried over from xfs_iformat_fork */
> > > -	if (mode &&
> > > -	    be32_to_cpu(dip->di_nextents) + be16_to_cpu(dip->di_anextents) >
> > > -			be64_to_cpu(dip->di_nblocks))
> > > +	if (mode && nextents > nblocks)
> > >  		return __this_address;
> > >  
> > >  	if (mode && XFS_DFORK_BOFF(dip) > mp->m_sb.sb_inodesize)
> > > @@ -716,3 +730,23 @@ xfs_inode_validate_cowextsize(
> > >  
> > >  	return NULL;
> > >  }
> > > +
> > > +xfs_extnum_t
> > > +xfs_dfork_nextents(
> > > +	struct xfs_sb		*sbp,
> > > +	struct xfs_dinode	*dip,
> > > +	int			whichfork)
> > > +{
> > > +	xfs_extnum_t		nextents;
> > > +
> > > +	if (whichfork == XFS_DATA_FORK) {
> > > +		nextents = be32_to_cpu(dip->di_nextents_lo);
> > > +		if (xfs_sb_version_has_v3inode(sbp)
> > > +			&& (dip->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS))
> > 
> > Please don't align the second line of the if test with the if body.
> > 
> > Or maybe just create a "xfs_inode_has_big_dfork" helper to encapsulate
> > this, like we do for reflink/hascow/realtime inodes.
> 
> Ok. I will follow the style used for reflink inodes.
> 
> > 
> > > +			nextents |= (u64)(be32_to_cpu(dip->di_nextents_hi))
> > > +				<< 32;
> > > +		return nextents;
> > > +	} else {
> > > +		return be16_to_cpu(dip->di_anextents);
> > 
> > I suspect you could reduce the indenting here by inverting the logic,
> > e.g.
> > 
> > 	if (attr fork)
> > 		return be16_to_cpu(anextents);
> > 
> > 	nextents = be32_to_cpu(nextents_lo);
> > 	if (xfs_inode_has_big_dfork())
> > 		nextents += be32_to_cpu(nextents_hi);
> > 	return nextents;
> >
> 
> The "else" part (i.e. attr fork) gets expanded in the next
> patch to contain code similar to the data fork. I will have to introduce the
> "if/else" branch logic once again in that patch.
> 
> > > +	}
> > > +}
> > > diff --git a/fs/xfs/libxfs/xfs_inode_buf.h b/fs/xfs/libxfs/xfs_inode_buf.h
> > > index 865ac493c72a..4583db53b933 100644
> > > --- a/fs/xfs/libxfs/xfs_inode_buf.h
> > > +++ b/fs/xfs/libxfs/xfs_inode_buf.h
> > > @@ -65,5 +65,7 @@ xfs_failaddr_t xfs_inode_validate_extsize(struct xfs_mount *mp,
> > >  xfs_failaddr_t xfs_inode_validate_cowextsize(struct xfs_mount *mp,
> > >  		uint32_t cowextsize, uint16_t mode, uint16_t flags,
> > >  		uint64_t flags2);
> > > +xfs_extnum_t xfs_dfork_nextents(struct xfs_sb *sbp, struct xfs_dinode *dip,
> > > +		int whichfork);
> > >  
> > >  #endif	/* __XFS_INODE_BUF_H__ */
> > > diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
> > > index 3bf5a2c391bd..ec682e2d5bcb 100644
> > > --- a/fs/xfs/libxfs/xfs_inode_fork.c
> > > +++ b/fs/xfs/libxfs/xfs_inode_fork.c
> > > @@ -10,6 +10,7 @@
> > >  #include "xfs_format.h"
> > >  #include "xfs_log_format.h"
> > >  #include "xfs_trans_resv.h"
> > > +#include "xfs_sb.h"
> > >  #include "xfs_mount.h"
> > >  #include "xfs_inode.h"
> > >  #include "xfs_trans.h"
> > > @@ -103,21 +104,22 @@ xfs_iformat_extents(
> > >  	int			whichfork)
> > >  {
> > >  	struct xfs_mount	*mp = ip->i_mount;
> > > +	struct xfs_sb		*sb = &mp->m_sb;
> > >  	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
> > > +	xfs_extnum_t		nex = xfs_dfork_nextents(sb, dip, whichfork);
> > >  	int			state = xfs_bmap_fork_to_state(whichfork);
> > > -	int			nex = XFS_DFORK_NEXTENTS(dip, whichfork);
> > >  	int			size = nex * sizeof(xfs_bmbt_rec_t);
> > >  	struct xfs_iext_cursor	icur;
> > >  	struct xfs_bmbt_rec	*dp;
> > >  	struct xfs_bmbt_irec	new;
> > > -	int			i;
> > > +	xfs_extnum_t		i;
> > >  
> > >  	/*
> > >  	 * If the number of extents is unreasonable, then something is wrong and
> > >  	 * we just bail out rather than crash in kmem_alloc() or memcpy() below.
> > >  	 */
> > >  	if (unlikely(size < 0 || size > XFS_DFORK_SIZE(dip, mp, whichfork))) {
> > > -		xfs_warn(ip->i_mount, "corrupt inode %Lu ((a)extents = %d).",
> > > +		xfs_warn(ip->i_mount, "corrupt inode %Lu ((a)extents = %llu).",
> > >  			(unsigned long long) ip->i_ino, nex);
> > >  		xfs_inode_verifier_error(ip, -EFSCORRUPTED,
> > >  				"xfs_iformat_extents(1)", dip, sizeof(*dip),
> > > @@ -233,7 +235,11 @@ xfs_iformat_data_fork(
> > >  	 * depend on it.
> > >  	 */
> > >  	ip->i_df.if_format = dip->di_format;
> > > -	ip->i_df.if_nextents = be32_to_cpu(dip->di_nextents);
> > > +	ip->i_df.if_nextents = be32_to_cpu(dip->di_nextents_lo);
> > > +	if (ip->i_d.di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
> > > +		ip->i_df.if_nextents |=
> > > +			((u64)(be32_to_cpu(dip->di_nextents_hi)) << 32);
> > > +
> > >  
> > >  	switch (inode->i_mode & S_IFMT) {
> > >  	case S_IFIFO:
> > > @@ -729,31 +735,73 @@ xfs_ifork_verify_local_attr(
> > >  	return 0;
> > >  }
> > >  
> > > +static int
> > > +xfs_next_set_data(
> > > +	struct xfs_trans	*tp,
> > > +	struct xfs_inode	*ip,
> > > +	struct xfs_ifork	*ifp,
> > > +	int			delta)
> > > +{
> > > +	struct xfs_mount	*mp = ip->i_mount;
> > > +	xfs_extnum_t		nr_exts;
> > > +
> > > +	nr_exts = ifp->if_nextents + delta;
> > > +
> > > +	if ((delta > 0 && nr_exts > MAXEXTNUM)
> > > +		|| (delta < 0 && nr_exts > ifp->if_nextents))
> > > +		return -EOVERFLOW;
> > > +
> > > +	if (ifp->if_nextents <= MAXEXTNUM31BIT &&
> > > +		nr_exts > MAXEXTNUM31BIT &&
> > > +		!(ip->i_d.di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS) &&
> > > +		xfs_sb_version_has_v3inode(&mp->m_sb)) {
> > > +		if (!xfs_sb_version_has47bitext(&mp->m_sb)) {
> > 
> > Urk.  Again, don't indent the if test logic and the if body statements
> > to the same level.
> 
> I am sorry. I will fixup the indentation issues.
> 
> > 
> > > +			bool log_sb = false;
> > > +
> > > +			spin_lock(&mp->m_sb_lock);
> > > +			if (!xfs_sb_version_has47bitext(&mp->m_sb)) {
> > > +				xfs_sb_version_add47bitext(&mp->m_sb);
> > > +				log_sb = true;
> > > +			}
> > > +			spin_unlock(&mp->m_sb_lock);
> > > +
> > > +			if (log_sb)
> > > +				xfs_log_sb(tp);
> > > +		}
> > 
> > Hm, dynamic filesystem upgrade.  This probably ought to log something to
> > dmesg about the upgrade.  It might also be a better to make this a
> > separate helper so that it's not triply-indented.
> 
> Ok. I will implement that.
> 
> > 
> > > +
> > > +		ip->i_d.di_flags2 |= XFS_DIFLAG2_47BIT_NEXTENTS;
> > > +	}
> > > +
> > > +	ifp->if_nextents = nr_exts;
> > > +
> > > +	return 0;
> > > +}
> > > +
> > >  int
> > >  xfs_next_set(
> > > +	struct xfs_trans	*tp,
> > >  	struct xfs_inode	*ip,
> > >  	int			whichfork,
> > >  	int			delta)
> > >  {
> > >  	struct xfs_ifork	*ifp;
> > >  	int64_t			nr_exts;
> > > -	int64_t			max_exts;
> > > +	int			error = 0;
> > >  
> > >  	ifp = XFS_IFORK_PTR(ip, whichfork);
> > >  
> > > -	if (whichfork == XFS_DATA_FORK || whichfork == XFS_COW_FORK)
> > > -		max_exts = MAXEXTNUM;
> > > -	else if (whichfork == XFS_ATTR_FORK)
> > > -		max_exts = MAXAEXTNUM;
> > > -	else
> > > -		ASSERT(0);
> > > -
> > > -	nr_exts = ifp->if_nextents + delta;
> > > -	if ((delta > 0 && nr_exts > max_exts)
> > > -		|| (delta < 0 && nr_exts < 0))
> > > -		return -EOVERFLOW;
> > > +	if (whichfork == XFS_DATA_FORK || whichfork == XFS_COW_FORK) {
> > > +		error = xfs_next_set_data(tp, ip, ifp, delta);
> > > +	} else if (whichfork == XFS_ATTR_FORK) {
> > > +		nr_exts = ifp->if_nextents + delta;
> > > +		if ((delta > 0 && nr_exts > MAXAEXTNUM)
> > > +			|| (delta < 0 && nr_exts < 0))
> > > +			return -EOVERFLOW;
> > >  
> > > -	ifp->if_nextents = nr_exts;
> > > +		ifp->if_nextents = nr_exts;
> > > +	} else {
> > > +		ASSERT(0);
> > > +	}
> > >  
> > > -	return 0;
> > > +	return error;
> > >  }
> > > diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
> > > index a84ae42ace79..c74fa6371cc8 100644
> > > --- a/fs/xfs/libxfs/xfs_inode_fork.h
> > > +++ b/fs/xfs/libxfs/xfs_inode_fork.h
> > > @@ -173,5 +173,6 @@ extern void xfs_ifork_init_cow(struct xfs_inode *ip);
> > >  int xfs_ifork_verify_local_data(struct xfs_inode *ip);
> > >  int xfs_ifork_verify_local_attr(struct xfs_inode *ip);
> > >  
> > > -int xfs_next_set(struct xfs_inode *ip, int whichfork, int delta);
> > > +int xfs_next_set(struct xfs_trans *tp, struct xfs_inode *ip, int whichfork,
> > > +		int delta);
> > >  #endif	/* __XFS_INODE_FORK_H__ */
> > > diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> > > index e3400c9c71cd..879aadff7692 100644
> > > --- a/fs/xfs/libxfs/xfs_log_format.h
> > > +++ b/fs/xfs/libxfs/xfs_log_format.h
> > > @@ -396,7 +396,7 @@ struct xfs_log_dinode {
> > >  	xfs_fsize_t	di_size;	/* number of bytes in file */
> > >  	xfs_rfsblock_t	di_nblocks;	/* # of direct & btree blocks used */
> > >  	xfs_extlen_t	di_extsize;	/* basic/minimum extent size for file */
> > > -	xfs_extnum_t	di_nextents;	/* number of extents in data fork */
> > > +	uint32_t	di_nextents_lo;	/* number of extents in data fork */
> > >  	xfs_aextnum_t	di_anextents;	/* number of extents in attribute fork*/
> > >  	uint8_t		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> > >  	int8_t		di_aformat;	/* format of attr fork's data */
> > > @@ -414,7 +414,8 @@ struct xfs_log_dinode {
> > >  	xfs_lsn_t	di_lsn;		/* flush sequence */
> > >  	uint64_t	di_flags2;	/* more random flags */
> > >  	uint32_t	di_cowextsize;	/* basic cow extent size for file */
> > > -	uint8_t		di_pad2[12];	/* more padding for future expansion */
> > > +	uint32_t	di_nextents_hi;
> > > +	uint8_t		di_pad2[8];	/* more padding for future expansion */
> > >  
> > >  	/* fields only written to during inode creation */
> > >  	xfs_ictimestamp_t di_crtime;	/* time created */
> > > diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> > > index 0a3041ad5bec..c68ff2178976 100644
> > > --- a/fs/xfs/libxfs/xfs_types.h
> > > +++ b/fs/xfs/libxfs/xfs_types.h
> > > @@ -12,7 +12,7 @@ typedef uint32_t	xfs_agblock_t;	/* blockno in alloc. group */
> > >  typedef uint32_t	xfs_agino_t;	/* inode # within allocation grp */
> > >  typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
> > >  typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
> > > -typedef int32_t		xfs_extnum_t;	/* # of extents in a file */
> > > +typedef uint64_t	xfs_extnum_t;	/* # of extents in a file */
> > >  typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> > >  typedef int64_t		xfs_fsize_t;	/* bytes in a file */
> > >  typedef uint64_t	xfs_ufsize_t;	/* unsigned bytes in a file */
> > > @@ -59,7 +59,8 @@ typedef void *		xfs_failaddr_t;
> > >   * Max values for extlen, extnum, aextnum.
> > >   */
> > >  #define	MAXEXTLEN	((xfs_extlen_t)0x001fffff)	/* 21 bits */
> > > -#define	MAXEXTNUM	((xfs_extnum_t)0x7fffffff)	/* signed int */
> > > +#define	MAXEXTNUM31BIT	((xfs_extnum_t)0x7fffffff)	/* 31 bits */
> > > +#define	MAXEXTNUM	((xfs_extnum_t)0x7fffffffffff)	/* 47 bits */
> > >  #define	MAXDIREXTNUM	((xfs_extnum_t)0x7ffffff)	/* 27 bits */
> > >  #define	MAXAEXTNUM	((xfs_aextnum_t)0x7fff)		/* signed short */
> > >  
> > > diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
> > > index 6d483ab29e63..be41fd242ff2 100644
> > > --- a/fs/xfs/scrub/inode.c
> > > +++ b/fs/xfs/scrub/inode.c
> > > @@ -205,8 +205,8 @@ xchk_dinode(
> > >  	struct xfs_mount	*mp = sc->mp;
> > >  	size_t			fork_recs;
> > >  	unsigned long long	isize;
> > > +	xfs_extnum_t		nextents;
> > >  	uint64_t		flags2;
> > > -	uint32_t		nextents;
> > >  	uint16_t		flags;
> > >  	uint16_t		mode;
> > >  
> > > @@ -354,7 +354,7 @@ xchk_dinode(
> > >  	xchk_inode_extsize(sc, dip, ino, mode, flags);
> > >  
> > >  	/* di_nextents */
> > > -	nextents = be32_to_cpu(dip->di_nextents);
> > > +	nextents = xfs_dfork_nextents(&mp->m_sb, dip, XFS_DATA_FORK);
> > >  	fork_recs =  XFS_DFORK_DSIZE(dip, mp) / sizeof(struct xfs_bmbt_rec);
> > >  	switch (dip->di_format) {
> > >  	case XFS_DINODE_FMT_EXTENTS:
> > > @@ -464,6 +464,7 @@ xchk_inode_xref_bmap(
> > >  	struct xfs_scrub	*sc,
> > >  	struct xfs_dinode	*dip)
> > >  {
> > > +	xfs_mount_t		*mp = sc->mp;
> > 
> > struct xfs_mount.  The structure typedefs usages are deprecated and
> > we're trying to get rid of them (slowly).
> 
> Yes, I missed out on this one. I will fix this up.
> 
> > 
> > --D
> > 
> > >  	xfs_extnum_t		nextents;
> > >  	xfs_filblks_t		count;
> > >  	xfs_filblks_t		acount;
> > > @@ -477,14 +478,14 @@ xchk_inode_xref_bmap(
> > >  			&nextents, &count);
> > >  	if (!xchk_should_check_xref(sc, &error, NULL))
> > >  		return;
> > > -	if (nextents < be32_to_cpu(dip->di_nextents))
> > > +	if (nextents < xfs_dfork_nextents(&mp->m_sb, dip, XFS_DATA_FORK))
> > >  		xchk_ino_xref_set_corrupt(sc, sc->ip->i_ino);
> > >  
> > >  	error = xfs_bmap_count_blocks(sc->tp, sc->ip, XFS_ATTR_FORK,
> > >  			&nextents, &acount);
> > >  	if (!xchk_should_check_xref(sc, &error, NULL))
> > >  		return;
> > > -	if (nextents != be16_to_cpu(dip->di_anextents))
> > > +	if (nextents != xfs_dfork_nextents(&mp->m_sb, dip, XFS_ATTR_FORK))
> > >  		xchk_ino_xref_set_corrupt(sc, sc->ip->i_ino);
> > >  
> > >  	/* Check nblocks against the inode. */
> > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > > index 64f5f9a440ae..4418a66cf6d6 100644
> > > --- a/fs/xfs/xfs_inode.c
> > > +++ b/fs/xfs/xfs_inode.c
> > > @@ -3748,7 +3748,7 @@ xfs_iflush_int(
> > >  				ip->i_d.di_nblocks, mp, XFS_ERRTAG_IFLUSH_5)) {
> > >  		xfs_alert_tag(mp, XFS_PTAG_IFLUSH,
> > >  			"%s: detected corrupt incore inode %Lu, "
> > > -			"total extents = %d, nblocks = %Ld, ptr "PTR_FMT,
> > > +			"total extents = %llu, nblocks = %Ld, ptr "PTR_FMT,
> > >  			__func__, ip->i_ino,
> > >  			ip->i_df.if_nextents + xfs_ifork_nextents(ip->i_afp),
> > >  			ip->i_d.di_nblocks, ip);
> > > @@ -3785,6 +3785,10 @@ xfs_iflush_int(
> > >  	    xfs_ifork_verify_local_attr(ip))
> > >  		goto flush_out;
> > >  
> > > +	if (!(ip->i_d.di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
> > > +		&& xfs_sb_version_has47bitext(&mp->m_sb))
> > > +		ip->i_d.di_flags2 |= XFS_DIFLAG2_47BIT_NEXTENTS;
> > > +
> > >  	/*
> > >  	 * Copy the dirty parts of the inode into the on-disk inode.  We always
> > >  	 * copy out the core of the inode, because if the inode is dirty at all
> > > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> > > index ba47bf65b772..6f27ac7c8631 100644
> > > --- a/fs/xfs/xfs_inode_item.c
> > > +++ b/fs/xfs/xfs_inode_item.c
> > > @@ -326,7 +326,7 @@ xfs_inode_to_log_dinode(
> > >  	to->di_size = from->di_size;
> > >  	to->di_nblocks = from->di_nblocks;
> > >  	to->di_extsize = from->di_extsize;
> > > -	to->di_nextents = xfs_ifork_nextents(&ip->i_df);
> > > +	to->di_nextents_lo = xfs_ifork_nextents(&ip->i_df) & 0xffffffffU;
> > >  	to->di_anextents = xfs_ifork_nextents(ip->i_afp);
> > >  	to->di_forkoff = from->di_forkoff;
> > >  	to->di_aformat = xfs_ifork_format(ip->i_afp);
> > > @@ -344,6 +344,9 @@ xfs_inode_to_log_dinode(
> > >  		to->di_crtime.t_nsec = from->di_crtime.tv_nsec;
> > >  		to->di_flags2 = from->di_flags2;
> > >  		to->di_cowextsize = from->di_cowextsize;
> > > +		if (from->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
> > > +			to->di_nextents_hi =
> > > +				xfs_ifork_nextents(&ip->i_df) >> 32;
> > >  		to->di_ino = ip->i_ino;
> > >  		to->di_lsn = lsn;
> > >  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> > > diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
> > > index 10ef5ddf5429..8d64b861fb66 100644
> > > --- a/fs/xfs/xfs_inode_item_recover.c
> > > +++ b/fs/xfs/xfs_inode_item_recover.c
> > > @@ -134,6 +134,7 @@ xlog_recover_inode_commit_pass2(
> > >  	struct xfs_log_dinode		*ldip;
> > >  	uint				isize;
> > >  	int				need_free = 0;
> > > +	xfs_extnum_t			nextents;
> > >  
> > >  	if (item->ri_buf[0].i_len == sizeof(struct xfs_inode_log_format)) {
> > >  		in_f = item->ri_buf[0].i_addr;
> > > @@ -255,16 +256,23 @@ xlog_recover_inode_commit_pass2(
> > >  			goto out_release;
> > >  		}
> > >  	}
> > > -	if (unlikely(ldip->di_nextents + ldip->di_anextents > ldip->di_nblocks)){
> > > +
> > > +	nextents = ldip->di_nextents_lo;
> > > +	if (xfs_sb_version_has_v3inode(&mp->m_sb) &&
> > > +		ldip->di_flags2 & XFS_DIFLAG2_47BIT_NEXTENTS)
> > > +		nextents |= ((u64)(ldip->di_nextents_hi) << 32);
> > > +
> > > +	nextents += ldip->di_anextents;
> > > +
> > > +	if (unlikely(nextents > ldip->di_nblocks)) {
> > >  		XFS_CORRUPTION_ERROR("xlog_recover_inode_pass2(5)",
> > >  				     XFS_ERRLEVEL_LOW, mp, ldip,
> > >  				     sizeof(*ldip));
> > >  		xfs_alert(mp,
> > >  	"%s: Bad inode log record, rec ptr "PTR_FMT", dino ptr "PTR_FMT", "
> > > -	"dino bp "PTR_FMT", ino %Ld, total extents = %d, nblocks = %Ld",
> > > +	"dino bp "PTR_FMT", ino %Ld, total extents = %llu, nblocks = %Ld",
> > >  			__func__, item, dip, bp, in_f->ilf_ino,
> > > -			ldip->di_nextents + ldip->di_anextents,
> > > -			ldip->di_nblocks);
> > > +			nextents, ldip->di_nblocks);
> > >  		error = -EFSCORRUPTED;
> > >  		goto out_release;
> > >  	}
> > 
> 
> -- 
> chandan
> 
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2020-08-31 21:05 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-06-06  8:27 [PATCH 0/7] xfs: Extend per-inode extent counters Chandan Babu R
2020-06-06  8:27 ` [PATCH 1/7] xfs: Fix log reservation calculation for xattr insert operation Chandan Babu R
2020-06-19 14:33   ` Christoph Hellwig
2020-06-20 12:53     ` Chandan Babu R
2020-06-06  8:27 ` [PATCH 2/7] xfs: Check for per-inode extent count overflow Chandan Babu R
2020-06-08 16:24   ` Darrick J. Wong
2020-06-08 16:32     ` Darrick J. Wong
2020-06-09 14:22       ` Chandan Babu R
2020-06-09 17:07         ` Darrick J. Wong
2020-06-10  6:24           ` Chandan Babu R
2020-06-09 14:22     ` Chandan Babu R
2020-06-09 17:10       ` Darrick J. Wong
2020-06-19 14:36         ` Christoph Hellwig
2020-06-19 21:31           ` Darrick J. Wong
2020-06-20 12:53           ` Chandan Babu R
2020-06-06  8:27 ` [PATCH 3/7] xfs: Compute maximum height of directory BMBT separately Chandan Babu R
2020-06-08 20:59   ` Darrick J. Wong
2020-06-09 14:23     ` Chandan Babu R
2020-06-09 18:40       ` Darrick J. Wong
2020-06-10  6:23         ` Chandan Babu R
2020-06-11  6:38           ` Chandan Babu R
2020-06-06  8:27 ` [PATCH 4/7] xfs: Add "Use Dir BMBT height" argument to XFS_BM_MAXLEVELS() Chandan Babu R
2020-06-08 17:50   ` Darrick J. Wong
2020-06-09 14:23     ` Chandan Babu R
2020-06-06  8:27 ` [PATCH 5/7] xfs: Use 2^27 as the maximum number of directory extents Chandan Babu R
2020-06-08 16:52   ` Darrick J. Wong
2020-06-09 14:23     ` Chandan Babu R
2020-06-06  8:27 ` [PATCH 6/7] xfs: Extend data extent counter to 47 bits Chandan Babu R
2020-06-08 17:14   ` Darrick J. Wong
2020-06-09 14:23     ` Chandan Babu R
2020-08-31 21:05       ` Darrick J. Wong
2020-06-19 14:38   ` Christoph Hellwig
2020-06-20 12:52     ` Chandan Babu R
2020-06-06  8:27 ` [PATCH 7/7] xfs: Extend attr extent counter to 32 bits Chandan Babu R
2020-06-08 17:21   ` Darrick J. Wong
2020-06-09 14:22     ` Chandan Babu R
2020-06-19 14:39   ` Christoph Hellwig
2020-06-20 12:53     ` Chandan Babu R
2020-06-08 17:31 ` [PATCH 0/7] xfs: Extend per-inode extent counters Darrick J. Wong
2020-06-09 14:22   ` Chandan Babu R

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.