All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Extend xattr extent counter to 32-bits
@ 2020-04-04  8:52 Chandan Rajendra
  2020-04-04  8:52 ` [PATCH 1/2] xfs: Fix log reservation calculation for xattr insert operation Chandan Rajendra
  2020-04-04  8:52 ` [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits Chandan Rajendra
  0 siblings, 2 replies; 37+ messages in thread
From: Chandan Rajendra @ 2020-04-04  8:52 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Rajendra, david, chandan, darrick.wong, bfoster

XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
which
1. Creates 1,000,000 255-byte sized xattrs,
2. Deletes 50% of these xattrs in an alternating manner,
3. Tries to create 400,000 new 255-byte sized xattrs
causes the following message to be printed on the console,

XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173

This indicates that we overflowed the 16-bits wide xattr extent counter.

I have been informed that there are instances where a single file has > 100
million hardlinks. With parent pointers being stored in xattr, we will
overflow the 16-bits wide xattr extent counter when large number of
hardlinks are created.

This patchset also includes the previously posted "Fix log reservation
calculation for xattr insert operation" patch as a bug fix. It now
replaces the xattr set "mount" and "runtime" reservations with just
one static reservation. Hence we don't need the funcationality to
calculate maximum sized 'xattr set' reservation separately anymore.

The patches can also be obtained from
https://github.com/chandanr/linux.git at branch 32bit-anextents-v0.


Chandan Rajendra (2):
  xfs: Fix log reservation calculation for xattr insert operation
  xfs: Extend xattr extent counter to 32-bits

 fs/xfs/libxfs/xfs_attr.c        |  6 +---
 fs/xfs/libxfs/xfs_format.h      | 28 ++++++++++++-----
 fs/xfs/libxfs/xfs_inode_buf.c   | 27 ++++++++++++-----
 fs/xfs/libxfs/xfs_inode_fork.c  |  3 +-
 fs/xfs/libxfs/xfs_log_format.h  |  5 +--
 fs/xfs/libxfs/xfs_log_rlimit.c  | 29 ------------------
 fs/xfs/libxfs/xfs_trans_resv.c  | 54 +++++++++++++++------------------
 fs/xfs/libxfs/xfs_trans_resv.h  |  5 +--
 fs/xfs/libxfs/xfs_trans_space.h |  8 ++++-
 fs/xfs/libxfs/xfs_types.h       |  4 +--
 fs/xfs/scrub/inode.c            |  7 +++--
 fs/xfs/xfs_inode_item.c         |  3 +-
 fs/xfs/xfs_log_recover.c        | 13 ++++++--
 13 files changed, 96 insertions(+), 96 deletions(-)

-- 
2.19.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH 1/2] xfs: Fix log reservation calculation for xattr insert operation
  2020-04-04  8:52 [PATCH 0/2] Extend xattr extent counter to 32-bits Chandan Rajendra
@ 2020-04-04  8:52 ` Chandan Rajendra
  2020-04-06 15:25   ` Brian Foster
  2020-04-07  0:49   ` Dave Chinner
  2020-04-04  8:52 ` [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits Chandan Rajendra
  1 sibling, 2 replies; 37+ messages in thread
From: Chandan Rajendra @ 2020-04-04  8:52 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Rajendra, david, chandan, darrick.wong, bfoster

Log space reservation for xattr insert operation is divided into two
parts,
1. Mount time
   - Inode
   - Superblock for accounting space allocations
   - AGF for accounting space used by count, block number, rmap and refcnt
     btrees.

2. The remaining log space can only be calculated at run time because,
   - A local xattr can be large enough to cause a double split of the da
     btree.
   - The value of the xattr can be large enough to be stored in remote
     blocks. The contents of the remote blocks are not logged.

   The log space reservation could be,
   - (XFS_DA_NODE_MAXDEPTH + 1) number of blocks. The "+ 1" is required in
     case xattr is large enough to cause another split of the da btree path.
   - BMBT blocks for storing (XFS_DA_NODE_MAXDEPTH + 1) record
     entries.
   - Space for logging blocks of count, block number, rmap and refcnt btrees.

At present, mount time log reservation includes block count required for a
single split of the dabtree. The dabtree block count is also taken into
account by xfs_attr_calc_size().

Also, AGF log space reservation isn't accounted for.

Due to the reasons mentioned above, log reservation calculation for xattr
insert operation gives an incorrect value.

Apart from the above, xfs_log_calc_max_attrsetm_res() passes byte count as
an argument to XFS_NEXTENTADD_SPACE_RES() instead of block count.

The above mentioned inconsistencies were discoverd when trying to mount a
modified XFS filesystem which uses a 32-bit value as xattr extent counter
caused the following warning messages to be printed on the console,

XFS (loop0): Mounting V4 Filesystem
XFS (loop0): Log size 2560 blocks too small, minimum size is 4035 blocks
XFS (loop0): Log size out of supported range.
XFS (loop0): Continuing onwards, but if log hangs are experienced then please report this message in the bug report.
XFS (loop0): Ending clean mount

To fix the inconsistencies described above, this commit replaces 'mount'
and 'runtime' components with just one static reservation. The new
reservation calculates the log space for the worst case possible i.e. it
considers,
1. Double split of the da btree.
   This happens for large local xattrs.
2. Bmbt blocks required for mapping the contents of a maximum
   sized (i.e. XATTR_SIZE_MAX bytes in size) remote attribute.

Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
---
 fs/xfs/libxfs/xfs_attr.c        |  6 +---
 fs/xfs/libxfs/xfs_log_rlimit.c  | 29 ------------------
 fs/xfs/libxfs/xfs_trans_resv.c  | 54 +++++++++++++++------------------
 fs/xfs/libxfs/xfs_trans_resv.h  |  5 +--
 fs/xfs/libxfs/xfs_trans_space.h |  8 ++++-
 5 files changed, 33 insertions(+), 69 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index e4fe3dca9883b..74dca80224f17 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -337,11 +337,7 @@ xfs_attr_set(
 				return error;
 		}
 
-		tres.tr_logres = M_RES(mp)->tr_attrsetm.tr_logres +
-				 M_RES(mp)->tr_attrsetrt.tr_logres *
-					args->total;
-		tres.tr_logcount = XFS_ATTRSET_LOG_COUNT;
-		tres.tr_logflags = XFS_TRANS_PERM_LOG_RES;
+		tres = M_RES(mp)->tr_attrset;
 		total = args->total;
 	} else {
 		XFS_STATS_INC(mp, xs_attr_remove);
diff --git a/fs/xfs/libxfs/xfs_log_rlimit.c b/fs/xfs/libxfs/xfs_log_rlimit.c
index 7f55eb3f36536..7aa9e6684ecd6 100644
--- a/fs/xfs/libxfs/xfs_log_rlimit.c
+++ b/fs/xfs/libxfs/xfs_log_rlimit.c
@@ -15,27 +15,6 @@
 #include "xfs_da_btree.h"
 #include "xfs_bmap_btree.h"
 
-/*
- * Calculate the maximum length in bytes that would be required for a local
- * attribute value as large attributes out of line are not logged.
- */
-STATIC int
-xfs_log_calc_max_attrsetm_res(
-	struct xfs_mount	*mp)
-{
-	int			size;
-	int			nblks;
-
-	size = xfs_attr_leaf_entsize_local_max(mp->m_attr_geo->blksize) -
-	       MAXNAMELEN - 1;
-	nblks = XFS_DAENTER_SPACE_RES(mp, XFS_ATTR_FORK);
-	nblks += XFS_B_TO_FSB(mp, size);
-	nblks += XFS_NEXTENTADD_SPACE_RES(mp, size, XFS_ATTR_FORK);
-
-	return  M_RES(mp)->tr_attrsetm.tr_logres +
-		M_RES(mp)->tr_attrsetrt.tr_logres * nblks;
-}
-
 /*
  * Iterate over the log space reservation table to figure out and return
  * the maximum one in terms of the pre-calculated values which were done
@@ -49,9 +28,6 @@ xfs_log_get_max_trans_res(
 	struct xfs_trans_res	*resp;
 	struct xfs_trans_res	*end_resp;
 	int			log_space = 0;
-	int			attr_space;
-
-	attr_space = xfs_log_calc_max_attrsetm_res(mp);
 
 	resp = (struct xfs_trans_res *)M_RES(mp);
 	end_resp = (struct xfs_trans_res *)(M_RES(mp) + 1);
@@ -64,11 +40,6 @@ xfs_log_get_max_trans_res(
 			*max_resp = *resp;		/* struct copy */
 		}
 	}
-
-	if (attr_space > log_space) {
-		*max_resp = M_RES(mp)->tr_attrsetm;	/* struct copy */
-		max_resp->tr_logres = attr_space;
-	}
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index d1a0848cb52ec..b44b521c605c7 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -19,6 +19,7 @@
 #include "xfs_trans.h"
 #include "xfs_qm.h"
 #include "xfs_trans_space.h"
+#include "xfs_attr_remote.h"
 
 #define _ALLOC	true
 #define _FREE	false
@@ -698,42 +699,36 @@ xfs_calc_attrinval_reservation(
 }
 
 /*
- * Setting an attribute at mount time.
+ * Setting an attribute.
  *	the inode getting the attribute
  *	the superblock for allocations
- *	the agfs extents are allocated from
+ *	the agf extents are allocated from
  *	the attribute btree * max depth
- *	the inode allocation btree
- * Since attribute transaction space is dependent on the size of the attribute,
- * the calculation is done partially at mount time and partially at runtime(see
- * below).
+ *	the bmbt entries for da btree blocks
+ *	the bmbt entries for remote blocks (if any)
+ *	the allocation btrees.
  */
 STATIC uint
-xfs_calc_attrsetm_reservation(
+xfs_calc_attrset_reservation(
 	struct xfs_mount	*mp)
 {
+	int			max_rmt_blks;
+	int			da_blks;
+	int			bmbt_blks;
+
+	da_blks = XFS_DAENTER_BLOCKS(mp, XFS_ATTR_FORK);
+	bmbt_blks = XFS_DAENTER_BMAPS(mp, XFS_ATTR_FORK);
+
+	max_rmt_blks = xfs_attr3_rmt_blocks(mp, XATTR_SIZE_MAX);
+	bmbt_blks += XFS_NEXTENTADD_SPACE_RES(mp, max_rmt_blks, XFS_ATTR_FORK);
+
 	return XFS_DQUOT_LOGRES(mp) +
 		xfs_calc_inode_res(mp, 1) +
 		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
-		xfs_calc_buf_res(XFS_DA_NODE_MAXDEPTH, XFS_FSB_TO_B(mp, 1));
-}
-
-/*
- * Setting an attribute at runtime, transaction space unit per block.
- * 	the superblock for allocations: sector size
- *	the inode bmap btree could join or split: max depth * block size
- * Since the runtime attribute transaction space is dependent on the total
- * blocks needed for the 1st bmap, here we calculate out the space unit for
- * one block so that the caller could figure out the total space according
- * to the attibute extent length in blocks by:
- *	ext * M_RES(mp)->tr_attrsetrt.tr_logres
- */
-STATIC uint
-xfs_calc_attrsetrt_reservation(
-	struct xfs_mount	*mp)
-{
-	return xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
-		xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK),
+		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
+		xfs_calc_buf_res(da_blks, XFS_FSB_TO_B(mp, 1)) +
+		xfs_calc_buf_res(bmbt_blks, XFS_FSB_TO_B(mp, 1)) +
+		xfs_calc_buf_res(xfs_allocfree_log_count(mp, da_blks),
 				 XFS_FSB_TO_B(mp, 1));
 }
 
@@ -897,9 +892,9 @@ xfs_trans_resv_calc(
 	resp->tr_attrinval.tr_logcount = XFS_ATTRINVAL_LOG_COUNT;
 	resp->tr_attrinval.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
 
-	resp->tr_attrsetm.tr_logres = xfs_calc_attrsetm_reservation(mp);
-	resp->tr_attrsetm.tr_logcount = XFS_ATTRSET_LOG_COUNT;
-	resp->tr_attrsetm.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
+	resp->tr_attrset.tr_logres = xfs_calc_attrset_reservation(mp);
+	resp->tr_attrset.tr_logcount = XFS_ATTRSET_LOG_COUNT;
+	resp->tr_attrset.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
 
 	resp->tr_attrrm.tr_logres = xfs_calc_attrrm_reservation(mp);
 	resp->tr_attrrm.tr_logcount = XFS_ATTRRM_LOG_COUNT;
@@ -942,7 +937,6 @@ xfs_trans_resv_calc(
 	resp->tr_ichange.tr_logres = xfs_calc_ichange_reservation(mp);
 	resp->tr_fsyncts.tr_logres = xfs_calc_swrite_reservation(mp);
 	resp->tr_writeid.tr_logres = xfs_calc_writeid_reservation(mp);
-	resp->tr_attrsetrt.tr_logres = xfs_calc_attrsetrt_reservation(mp);
 	resp->tr_clearagi.tr_logres = xfs_calc_clear_agi_bucket_reservation(mp);
 	resp->tr_growrtzero.tr_logres = xfs_calc_growrtzero_reservation(mp);
 	resp->tr_growrtfree.tr_logres = xfs_calc_growrtfree_reservation(mp);
diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
index 7241ab28cf84f..f50996ae18e6c 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.h
+++ b/fs/xfs/libxfs/xfs_trans_resv.h
@@ -35,10 +35,7 @@ struct xfs_trans_resv {
 	struct xfs_trans_res	tr_writeid;	/* write setuid/setgid file */
 	struct xfs_trans_res	tr_attrinval;	/* attr fork buffer
 						 * invalidation */
-	struct xfs_trans_res	tr_attrsetm;	/* set/create an attribute at
-						 * mount time */
-	struct xfs_trans_res	tr_attrsetrt;	/* set/create an attribute at
-						 * runtime */
+	struct xfs_trans_res	tr_attrset;	/* set/create an attribute */
 	struct xfs_trans_res	tr_attrrm;	/* remove an attribute */
 	struct xfs_trans_res	tr_clearagi;	/* clear agi unlinked bucket */
 	struct xfs_trans_res	tr_growrtalloc;	/* grow realtime allocations */
diff --git a/fs/xfs/libxfs/xfs_trans_space.h b/fs/xfs/libxfs/xfs_trans_space.h
index 88221c7a04ccf..6a22ad11b3825 100644
--- a/fs/xfs/libxfs/xfs_trans_space.h
+++ b/fs/xfs/libxfs/xfs_trans_space.h
@@ -38,8 +38,14 @@
 
 #define	XFS_DAENTER_1B(mp,w)	\
 	((w) == XFS_DATA_FORK ? (mp)->m_dir_geo->fsbcount : 1)
+/*
+ * xattr set operation can cause the da btree to split twice in the
+ * worst case. The double split is actually an extra leaf node rather
+ * than a complete split of blocks in the path from root to a
+ * leaf. The '1' in the macro below accounts for the extra leaf node.
+ */
 #define	XFS_DAENTER_DBS(mp,w)	\
-	(XFS_DA_NODE_MAXDEPTH + (((w) == XFS_DATA_FORK) ? 2 : 0))
+	(XFS_DA_NODE_MAXDEPTH + (((w) == XFS_DATA_FORK) ? 2 : 1))
 #define	XFS_DAENTER_BLOCKS(mp,w)	\
 	(XFS_DAENTER_1B(mp,w) * XFS_DAENTER_DBS(mp,w))
 #define	XFS_DAENTER_BMAP1B(mp,w)	\
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-04-04  8:52 [PATCH 0/2] Extend xattr extent counter to 32-bits Chandan Rajendra
  2020-04-04  8:52 ` [PATCH 1/2] xfs: Fix log reservation calculation for xattr insert operation Chandan Rajendra
@ 2020-04-04  8:52 ` Chandan Rajendra
  2020-04-06 16:45   ` Brian Foster
                     ` (3 more replies)
  1 sibling, 4 replies; 37+ messages in thread
From: Chandan Rajendra @ 2020-04-04  8:52 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Rajendra, david, chandan, darrick.wong, bfoster

XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
which
1. Creates 1,000,000 255-byte sized xattrs,
2. Deletes 50% of these xattrs in an alternating manner,
3. Tries to create 400,000 new 255-byte sized xattrs
causes the following message to be printed on the console,

XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173

This indicates that we overflowed the 16-bits wide xattr extent counter.

I have been informed that there are instances where a single file has
 > 100 million hardlinks. With parent pointers being stored in xattr,
we will overflow the 16-bits wide xattr extent counter when large
number of hardlinks are created.

Hence this commit extends xattr extent counter to 32-bits. It also introduces
an incompat flag to prevent older kernels from mounting newer filesystems with
32-bit wide xattr extent counter.

Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
---
 fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
 fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
 fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
 fs/xfs/libxfs/xfs_log_format.h |  5 +++--
 fs/xfs/libxfs/xfs_types.h      |  4 ++--
 fs/xfs/scrub/inode.c           |  7 ++++---
 fs/xfs/xfs_inode_item.c        |  3 ++-
 fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
 8 files changed, 63 insertions(+), 27 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 045556e78ee2c..0a4266b0d46e1 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
 #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
 #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
 #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
+#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
 #define XFS_SB_FEAT_INCOMPAT_ALL \
 		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
 		 XFS_SB_FEAT_INCOMPAT_SPINODES|	\
-		 XFS_SB_FEAT_INCOMPAT_META_UUID)
+		 XFS_SB_FEAT_INCOMPAT_META_UUID| \
+		 XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR)
 
 #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
 static inline bool
@@ -874,7 +876,7 @@ typedef struct xfs_dinode {
 	__be64		di_nblocks;	/* # of direct & btree blocks used */
 	__be32		di_extsize;	/* basic/minimum extent size for file */
 	__be32		di_nextents;	/* number of extents in data fork */
-	__be16		di_anextents;	/* number of extents in attribute fork*/
+	__be16		di_anextents_lo;/* lower part of xattr extent count */
 	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
 	__s8		di_aformat;	/* format of attr fork's data */
 	__be32		di_dmevmask;	/* DMIG event mask */
@@ -891,7 +893,8 @@ typedef struct xfs_dinode {
 	__be64		di_lsn;		/* flush sequence */
 	__be64		di_flags2;	/* more random flags */
 	__be32		di_cowextsize;	/* basic cow extent size for file */
-	__u8		di_pad2[12];	/* more padding for future expansion */
+	__be16		di_anextents_hi;/* higher part of xattr extent count */
+	__u8		di_pad2[10];	/* more padding for future expansion */
 
 	/* fields only written to during inode creation */
 	xfs_timestamp_t	di_crtime;	/* time created */
@@ -993,10 +996,21 @@ enum xfs_dinode_fmt {
 	((w) == XFS_DATA_FORK ? \
 		(dip)->di_format : \
 		(dip)->di_aformat)
-#define XFS_DFORK_NEXTENTS(dip,w) \
-	((w) == XFS_DATA_FORK ? \
-		be32_to_cpu((dip)->di_nextents) : \
-		be16_to_cpu((dip)->di_anextents))
+
+static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,
+					struct xfs_dinode *dip, int whichfork)
+{
+	int32_t anextents;
+
+	if (whichfork == XFS_DATA_FORK)
+		return be32_to_cpu((dip)->di_nextents);
+
+	anextents = be16_to_cpu((dip)->di_anextents_lo);
+	if (xfs_sb_version_has_v3inode(sbp))
+		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);
+
+	return anextents;
+}
 
 /*
  * For block and character special files the 32bit dev_t is stored at the
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 39c5a6e24915c..ced8195bd8c22 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -232,7 +232,8 @@ xfs_inode_from_disk(
 	to->di_nblocks = be64_to_cpu(from->di_nblocks);
 	to->di_extsize = be32_to_cpu(from->di_extsize);
 	to->di_nextents = be32_to_cpu(from->di_nextents);
-	to->di_anextents = be16_to_cpu(from->di_anextents);
+	to->di_anextents = XFS_DFORK_NEXTENTS(&ip->i_mount->m_sb, from,
+				XFS_ATTR_FORK);
 	to->di_forkoff = from->di_forkoff;
 	to->di_aformat	= from->di_aformat;
 	to->di_dmevmask	= be32_to_cpu(from->di_dmevmask);
@@ -282,7 +283,7 @@ xfs_inode_to_disk(
 	to->di_nblocks = cpu_to_be64(from->di_nblocks);
 	to->di_extsize = cpu_to_be32(from->di_extsize);
 	to->di_nextents = cpu_to_be32(from->di_nextents);
-	to->di_anextents = cpu_to_be16(from->di_anextents);
+	to->di_anextents_lo = cpu_to_be16((u32)(from->di_anextents) & 0xffff);
 	to->di_forkoff = from->di_forkoff;
 	to->di_aformat = from->di_aformat;
 	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
@@ -296,6 +297,8 @@ xfs_inode_to_disk(
 		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
 		to->di_flags2 = cpu_to_be64(from->di_flags2);
 		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
+		to->di_anextents_hi
+			= cpu_to_be16((u32)(from->di_anextents) >> 16);
 		to->di_ino = cpu_to_be64(ip->i_ino);
 		to->di_lsn = cpu_to_be64(lsn);
 		memset(to->di_pad2, 0, sizeof(to->di_pad2));
@@ -335,7 +338,7 @@ xfs_log_dinode_to_disk(
 	to->di_nblocks = cpu_to_be64(from->di_nblocks);
 	to->di_extsize = cpu_to_be32(from->di_extsize);
 	to->di_nextents = cpu_to_be32(from->di_nextents);
-	to->di_anextents = cpu_to_be16(from->di_anextents);
+	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
 	to->di_forkoff = from->di_forkoff;
 	to->di_aformat = from->di_aformat;
 	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
@@ -349,6 +352,7 @@ xfs_log_dinode_to_disk(
 		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
 		to->di_flags2 = cpu_to_be64(from->di_flags2);
 		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
+		to->di_anextents_hi = cpu_to_be16(from->di_anextents_hi);
 		to->di_ino = cpu_to_be64(from->di_ino);
 		to->di_lsn = cpu_to_be64(from->di_lsn);
 		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
@@ -365,7 +369,9 @@ xfs_dinode_verify_fork(
 	struct xfs_mount	*mp,
 	int			whichfork)
 {
-	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
+	uint32_t		di_nextents;
+
+	di_nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, whichfork);
 
 	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
 	case XFS_DINODE_FMT_LOCAL:
@@ -436,6 +442,9 @@ xfs_dinode_verify(
 	uint16_t		flags;
 	uint64_t		flags2;
 	uint64_t		di_size;
+	int32_t			nextents;
+	int32_t			anextents;
+	int64_t			nblocks;
 
 	if (dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC))
 		return __this_address;
@@ -466,10 +475,12 @@ xfs_dinode_verify(
 	if ((S_ISLNK(mode) || S_ISDIR(mode)) && di_size == 0)
 		return __this_address;
 
+	nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_DATA_FORK);
+	anextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK);
+	nblocks = be64_to_cpu(dip->di_nblocks);
+
 	/* Fork checks carried over from xfs_iformat_fork */
-	if (mode &&
-	    be32_to_cpu(dip->di_nextents) + be16_to_cpu(dip->di_anextents) >
-			be64_to_cpu(dip->di_nblocks))
+	if (mode && nextents + anextents > nblocks)
 		return __this_address;
 
 	if (mode && XFS_DFORK_BOFF(dip) > mp->m_sb.sb_inodesize)
@@ -526,7 +537,7 @@ xfs_dinode_verify(
 		default:
 			return __this_address;
 		}
-		if (dip->di_anextents)
+		if (XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK))
 			return __this_address;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index 518c6f0ec3a61..080fd0c156a1e 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -207,9 +207,10 @@ xfs_iformat_extents(
 	int			whichfork)
 {
 	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_sb		*sb = &mp->m_sb;
 	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
 	int			state = xfs_bmap_fork_to_state(whichfork);
-	int			nex = XFS_DFORK_NEXTENTS(dip, whichfork);
+	int			nex = XFS_DFORK_NEXTENTS(sb, dip, whichfork);
 	int			size = nex * sizeof(xfs_bmbt_rec_t);
 	struct xfs_iext_cursor	icur;
 	struct xfs_bmbt_rec	*dp;
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index e3400c9c71cdb..5db92aa508bc5 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -397,7 +397,7 @@ struct xfs_log_dinode {
 	xfs_rfsblock_t	di_nblocks;	/* # of direct & btree blocks used */
 	xfs_extlen_t	di_extsize;	/* basic/minimum extent size for file */
 	xfs_extnum_t	di_nextents;	/* number of extents in data fork */
-	xfs_aextnum_t	di_anextents;	/* number of extents in attribute fork*/
+	uint16_t	di_anextents_lo;/* lower part of xattr extent count */
 	uint8_t		di_forkoff;	/* attr fork offs, <<3 for 64b align */
 	int8_t		di_aformat;	/* format of attr fork's data */
 	uint32_t	di_dmevmask;	/* DMIG event mask */
@@ -414,7 +414,8 @@ struct xfs_log_dinode {
 	xfs_lsn_t	di_lsn;		/* flush sequence */
 	uint64_t	di_flags2;	/* more random flags */
 	uint32_t	di_cowextsize;	/* basic cow extent size for file */
-	uint8_t		di_pad2[12];	/* more padding for future expansion */
+	uint16_t	di_anextents_hi;/* higher part of xattr extent count */
+	uint8_t		di_pad2[10];	/* more padding for future expansion */
 
 	/* fields only written to during inode creation */
 	xfs_ictimestamp_t di_crtime;	/* time created */
diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index 397d94775440d..01669aa65745a 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -13,7 +13,7 @@ typedef uint32_t	xfs_agino_t;	/* inode # within allocation grp */
 typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
 typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
 typedef int32_t		xfs_extnum_t;	/* # of extents in a file */
-typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
+typedef int32_t		xfs_aextnum_t;	/* # extents in an attribute fork */
 typedef int64_t		xfs_fsize_t;	/* bytes in a file */
 typedef uint64_t	xfs_ufsize_t;	/* unsigned bytes in a file */
 
@@ -60,7 +60,7 @@ typedef void *		xfs_failaddr_t;
  */
 #define	MAXEXTLEN	((xfs_extlen_t)0x001fffff)	/* 21 bits */
 #define	MAXEXTNUM	((xfs_extnum_t)0x7fffffff)	/* signed int */
-#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fff)		/* signed short */
+#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fffffff)	/* signed int */
 
 /*
  * Minimum and maximum blocksize and sectorsize.
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index 6d483ab29e639..3b624e24ae868 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -371,10 +371,12 @@ xchk_dinode(
 		break;
 	}
 
+	nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK);
+
 	/* di_forkoff */
 	if (XFS_DFORK_APTR(dip) >= (char *)dip + mp->m_sb.sb_inodesize)
 		xchk_ino_set_corrupt(sc, ino);
-	if (dip->di_anextents != 0 && dip->di_forkoff == 0)
+	if (nextents != 0 && dip->di_forkoff == 0)
 		xchk_ino_set_corrupt(sc, ino);
 	if (dip->di_forkoff == 0 && dip->di_aformat != XFS_DINODE_FMT_EXTENTS)
 		xchk_ino_set_corrupt(sc, ino);
@@ -386,7 +388,6 @@ xchk_dinode(
 		xchk_ino_set_corrupt(sc, ino);
 
 	/* di_anextents */
-	nextents = be16_to_cpu(dip->di_anextents);
 	fork_recs =  XFS_DFORK_ASIZE(dip, mp) / sizeof(struct xfs_bmbt_rec);
 	switch (dip->di_aformat) {
 	case XFS_DINODE_FMT_EXTENTS:
@@ -484,7 +485,7 @@ xchk_inode_xref_bmap(
 			&nextents, &acount);
 	if (!xchk_should_check_xref(sc, &error, NULL))
 		return;
-	if (nextents != be16_to_cpu(dip->di_anextents))
+	if (nextents != XFS_DFORK_NEXTENTS(&sc->mp->m_sb, dip, XFS_ATTR_FORK))
 		xchk_ino_xref_set_corrupt(sc, sc->ip->i_ino);
 
 	/* Check nblocks against the inode. */
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 4a3d13d4a0228..dff20f2b368ea 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -327,7 +327,7 @@ xfs_inode_to_log_dinode(
 	to->di_nblocks = from->di_nblocks;
 	to->di_extsize = from->di_extsize;
 	to->di_nextents = from->di_nextents;
-	to->di_anextents = from->di_anextents;
+	to->di_anextents_lo = ((u32)(from->di_anextents)) & 0xffff;
 	to->di_forkoff = from->di_forkoff;
 	to->di_aformat = from->di_aformat;
 	to->di_dmevmask = from->di_dmevmask;
@@ -344,6 +344,7 @@ xfs_inode_to_log_dinode(
 		to->di_crtime.t_nsec = from->di_crtime.tv_nsec;
 		to->di_flags2 = from->di_flags2;
 		to->di_cowextsize = from->di_cowextsize;
+		to->di_anextents_hi = ((u32)(from->di_anextents)) >> 16;
 		to->di_ino = ip->i_ino;
 		to->di_lsn = lsn;
 		memset(to->di_pad2, 0, sizeof(to->di_pad2));
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 11c3502b07b13..ba3fae95b2260 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2922,6 +2922,7 @@ xlog_recover_inode_pass2(
 	struct xfs_log_dinode	*ldip;
 	uint			isize;
 	int			need_free = 0;
+	uint32_t		nextents;
 
 	if (item->ri_buf[0].i_len == sizeof(struct xfs_inode_log_format)) {
 		in_f = item->ri_buf[0].i_addr;
@@ -3044,7 +3045,14 @@ xlog_recover_inode_pass2(
 			goto out_release;
 		}
 	}
-	if (unlikely(ldip->di_nextents + ldip->di_anextents > ldip->di_nblocks)){
+
+	nextents = ldip->di_anextents_lo;
+	if (xfs_sb_version_has_v3inode(&mp->m_sb))
+		nextents |= ((u32)(ldip->di_anextents_hi) << 16);
+
+	nextents += ldip->di_nextents;
+
+	if (unlikely(nextents > ldip->di_nblocks)) {
 		XFS_CORRUPTION_ERROR("xlog_recover_inode_pass2(5)",
 				     XFS_ERRLEVEL_LOW, mp, ldip,
 				     sizeof(*ldip));
@@ -3052,8 +3060,7 @@ xlog_recover_inode_pass2(
 	"%s: Bad inode log record, rec ptr "PTR_FMT", dino ptr "PTR_FMT", "
 	"dino bp "PTR_FMT", ino %Ld, total extents = %d, nblocks = %Ld",
 			__func__, item, dip, bp, in_f->ilf_ino,
-			ldip->di_nextents + ldip->di_anextents,
-			ldip->di_nblocks);
+			nextents, ldip->di_nblocks);
 		error = -EFSCORRUPTED;
 		goto out_release;
 	}
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] xfs: Fix log reservation calculation for xattr insert operation
  2020-04-04  8:52 ` [PATCH 1/2] xfs: Fix log reservation calculation for xattr insert operation Chandan Rajendra
@ 2020-04-06 15:25   ` Brian Foster
  2020-04-06 22:57     ` Dave Chinner
  2020-04-07  0:49   ` Dave Chinner
  1 sibling, 1 reply; 37+ messages in thread
From: Brian Foster @ 2020-04-06 15:25 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: linux-xfs, david, chandan, darrick.wong

On Sat, Apr 04, 2020 at 02:22:02PM +0530, Chandan Rajendra wrote:
> Log space reservation for xattr insert operation is divided into two
> parts,
> 1. Mount time
>    - Inode
>    - Superblock for accounting space allocations
>    - AGF for accounting space used by count, block number, rmap and refcnt
>      btrees.
> 
> 2. The remaining log space can only be calculated at run time because,
>    - A local xattr can be large enough to cause a double split of the da
>      btree.
>    - The value of the xattr can be large enough to be stored in remote
>      blocks. The contents of the remote blocks are not logged.
> 
>    The log space reservation could be,
>    - (XFS_DA_NODE_MAXDEPTH + 1) number of blocks. The "+ 1" is required in
>      case xattr is large enough to cause another split of the da btree path.
>    - BMBT blocks for storing (XFS_DA_NODE_MAXDEPTH + 1) record
>      entries.
>    - Space for logging blocks of count, block number, rmap and refcnt btrees.
> 
> At present, mount time log reservation includes block count required for a
> single split of the dabtree. The dabtree block count is also taken into
> account by xfs_attr_calc_size().
> 
> Also, AGF log space reservation isn't accounted for.
> 
> Due to the reasons mentioned above, log reservation calculation for xattr
> insert operation gives an incorrect value.
> 
> Apart from the above, xfs_log_calc_max_attrsetm_res() passes byte count as
> an argument to XFS_NEXTENTADD_SPACE_RES() instead of block count.
> 
> The above mentioned inconsistencies were discoverd when trying to mount a
> modified XFS filesystem which uses a 32-bit value as xattr extent counter
> caused the following warning messages to be printed on the console,
> 
> XFS (loop0): Mounting V4 Filesystem
> XFS (loop0): Log size 2560 blocks too small, minimum size is 4035 blocks
> XFS (loop0): Log size out of supported range.
> XFS (loop0): Continuing onwards, but if log hangs are experienced then please report this message in the bug report.
> XFS (loop0): Ending clean mount
> 
> To fix the inconsistencies described above, this commit replaces 'mount'
> and 'runtime' components with just one static reservation. The new
> reservation calculates the log space for the worst case possible i.e. it
> considers,
> 1. Double split of the da btree.
>    This happens for large local xattrs.
> 2. Bmbt blocks required for mapping the contents of a maximum
>    sized (i.e. XATTR_SIZE_MAX bytes in size) remote attribute.
> 

Hmm.. so the last I recall looking at this, the change was more around
refactoring the mount vs. runtime portions of the xattr reservation to
be more accurate. This approach eliminates the runtime portion for a
100% mount time reservation calculation. Can you elaborate on why the
change in approach? Also, it looks like at least one tradeoff here is
reservation size (to the point where we increase min log size on small
filesystems?).

Brian

> Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> ---
>  fs/xfs/libxfs/xfs_attr.c        |  6 +---
>  fs/xfs/libxfs/xfs_log_rlimit.c  | 29 ------------------
>  fs/xfs/libxfs/xfs_trans_resv.c  | 54 +++++++++++++++------------------
>  fs/xfs/libxfs/xfs_trans_resv.h  |  5 +--
>  fs/xfs/libxfs/xfs_trans_space.h |  8 ++++-
>  5 files changed, 33 insertions(+), 69 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
> index e4fe3dca9883b..74dca80224f17 100644
> --- a/fs/xfs/libxfs/xfs_attr.c
> +++ b/fs/xfs/libxfs/xfs_attr.c
> @@ -337,11 +337,7 @@ xfs_attr_set(
>  				return error;
>  		}
>  
> -		tres.tr_logres = M_RES(mp)->tr_attrsetm.tr_logres +
> -				 M_RES(mp)->tr_attrsetrt.tr_logres *
> -					args->total;
> -		tres.tr_logcount = XFS_ATTRSET_LOG_COUNT;
> -		tres.tr_logflags = XFS_TRANS_PERM_LOG_RES;
> +		tres = M_RES(mp)->tr_attrset;
>  		total = args->total;
>  	} else {
>  		XFS_STATS_INC(mp, xs_attr_remove);
> diff --git a/fs/xfs/libxfs/xfs_log_rlimit.c b/fs/xfs/libxfs/xfs_log_rlimit.c
> index 7f55eb3f36536..7aa9e6684ecd6 100644
> --- a/fs/xfs/libxfs/xfs_log_rlimit.c
> +++ b/fs/xfs/libxfs/xfs_log_rlimit.c
> @@ -15,27 +15,6 @@
>  #include "xfs_da_btree.h"
>  #include "xfs_bmap_btree.h"
>  
> -/*
> - * Calculate the maximum length in bytes that would be required for a local
> - * attribute value as large attributes out of line are not logged.
> - */
> -STATIC int
> -xfs_log_calc_max_attrsetm_res(
> -	struct xfs_mount	*mp)
> -{
> -	int			size;
> -	int			nblks;
> -
> -	size = xfs_attr_leaf_entsize_local_max(mp->m_attr_geo->blksize) -
> -	       MAXNAMELEN - 1;
> -	nblks = XFS_DAENTER_SPACE_RES(mp, XFS_ATTR_FORK);
> -	nblks += XFS_B_TO_FSB(mp, size);
> -	nblks += XFS_NEXTENTADD_SPACE_RES(mp, size, XFS_ATTR_FORK);
> -
> -	return  M_RES(mp)->tr_attrsetm.tr_logres +
> -		M_RES(mp)->tr_attrsetrt.tr_logres * nblks;
> -}
> -
>  /*
>   * Iterate over the log space reservation table to figure out and return
>   * the maximum one in terms of the pre-calculated values which were done
> @@ -49,9 +28,6 @@ xfs_log_get_max_trans_res(
>  	struct xfs_trans_res	*resp;
>  	struct xfs_trans_res	*end_resp;
>  	int			log_space = 0;
> -	int			attr_space;
> -
> -	attr_space = xfs_log_calc_max_attrsetm_res(mp);
>  
>  	resp = (struct xfs_trans_res *)M_RES(mp);
>  	end_resp = (struct xfs_trans_res *)(M_RES(mp) + 1);
> @@ -64,11 +40,6 @@ xfs_log_get_max_trans_res(
>  			*max_resp = *resp;		/* struct copy */
>  		}
>  	}
> -
> -	if (attr_space > log_space) {
> -		*max_resp = M_RES(mp)->tr_attrsetm;	/* struct copy */
> -		max_resp->tr_logres = attr_space;
> -	}
>  }
>  
>  /*
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
> index d1a0848cb52ec..b44b521c605c7 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.c
> +++ b/fs/xfs/libxfs/xfs_trans_resv.c
> @@ -19,6 +19,7 @@
>  #include "xfs_trans.h"
>  #include "xfs_qm.h"
>  #include "xfs_trans_space.h"
> +#include "xfs_attr_remote.h"
>  
>  #define _ALLOC	true
>  #define _FREE	false
> @@ -698,42 +699,36 @@ xfs_calc_attrinval_reservation(
>  }
>  
>  /*
> - * Setting an attribute at mount time.
> + * Setting an attribute.
>   *	the inode getting the attribute
>   *	the superblock for allocations
> - *	the agfs extents are allocated from
> + *	the agf extents are allocated from
>   *	the attribute btree * max depth
> - *	the inode allocation btree
> - * Since attribute transaction space is dependent on the size of the attribute,
> - * the calculation is done partially at mount time and partially at runtime(see
> - * below).
> + *	the bmbt entries for da btree blocks
> + *	the bmbt entries for remote blocks (if any)
> + *	the allocation btrees.
>   */
>  STATIC uint
> -xfs_calc_attrsetm_reservation(
> +xfs_calc_attrset_reservation(
>  	struct xfs_mount	*mp)
>  {
> +	int			max_rmt_blks;
> +	int			da_blks;
> +	int			bmbt_blks;
> +
> +	da_blks = XFS_DAENTER_BLOCKS(mp, XFS_ATTR_FORK);
> +	bmbt_blks = XFS_DAENTER_BMAPS(mp, XFS_ATTR_FORK);
> +
> +	max_rmt_blks = xfs_attr3_rmt_blocks(mp, XATTR_SIZE_MAX);
> +	bmbt_blks += XFS_NEXTENTADD_SPACE_RES(mp, max_rmt_blks, XFS_ATTR_FORK);
> +
>  	return XFS_DQUOT_LOGRES(mp) +
>  		xfs_calc_inode_res(mp, 1) +
>  		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
> -		xfs_calc_buf_res(XFS_DA_NODE_MAXDEPTH, XFS_FSB_TO_B(mp, 1));
> -}
> -
> -/*
> - * Setting an attribute at runtime, transaction space unit per block.
> - * 	the superblock for allocations: sector size
> - *	the inode bmap btree could join or split: max depth * block size
> - * Since the runtime attribute transaction space is dependent on the total
> - * blocks needed for the 1st bmap, here we calculate out the space unit for
> - * one block so that the caller could figure out the total space according
> - * to the attibute extent length in blocks by:
> - *	ext * M_RES(mp)->tr_attrsetrt.tr_logres
> - */
> -STATIC uint
> -xfs_calc_attrsetrt_reservation(
> -	struct xfs_mount	*mp)
> -{
> -	return xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
> -		xfs_calc_buf_res(XFS_BM_MAXLEVELS(mp, XFS_ATTR_FORK),
> +		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
> +		xfs_calc_buf_res(da_blks, XFS_FSB_TO_B(mp, 1)) +
> +		xfs_calc_buf_res(bmbt_blks, XFS_FSB_TO_B(mp, 1)) +
> +		xfs_calc_buf_res(xfs_allocfree_log_count(mp, da_blks),
>  				 XFS_FSB_TO_B(mp, 1));
>  }
>  
> @@ -897,9 +892,9 @@ xfs_trans_resv_calc(
>  	resp->tr_attrinval.tr_logcount = XFS_ATTRINVAL_LOG_COUNT;
>  	resp->tr_attrinval.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
>  
> -	resp->tr_attrsetm.tr_logres = xfs_calc_attrsetm_reservation(mp);
> -	resp->tr_attrsetm.tr_logcount = XFS_ATTRSET_LOG_COUNT;
> -	resp->tr_attrsetm.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
> +	resp->tr_attrset.tr_logres = xfs_calc_attrset_reservation(mp);
> +	resp->tr_attrset.tr_logcount = XFS_ATTRSET_LOG_COUNT;
> +	resp->tr_attrset.tr_logflags |= XFS_TRANS_PERM_LOG_RES;
>  
>  	resp->tr_attrrm.tr_logres = xfs_calc_attrrm_reservation(mp);
>  	resp->tr_attrrm.tr_logcount = XFS_ATTRRM_LOG_COUNT;
> @@ -942,7 +937,6 @@ xfs_trans_resv_calc(
>  	resp->tr_ichange.tr_logres = xfs_calc_ichange_reservation(mp);
>  	resp->tr_fsyncts.tr_logres = xfs_calc_swrite_reservation(mp);
>  	resp->tr_writeid.tr_logres = xfs_calc_writeid_reservation(mp);
> -	resp->tr_attrsetrt.tr_logres = xfs_calc_attrsetrt_reservation(mp);
>  	resp->tr_clearagi.tr_logres = xfs_calc_clear_agi_bucket_reservation(mp);
>  	resp->tr_growrtzero.tr_logres = xfs_calc_growrtzero_reservation(mp);
>  	resp->tr_growrtfree.tr_logres = xfs_calc_growrtfree_reservation(mp);
> diff --git a/fs/xfs/libxfs/xfs_trans_resv.h b/fs/xfs/libxfs/xfs_trans_resv.h
> index 7241ab28cf84f..f50996ae18e6c 100644
> --- a/fs/xfs/libxfs/xfs_trans_resv.h
> +++ b/fs/xfs/libxfs/xfs_trans_resv.h
> @@ -35,10 +35,7 @@ struct xfs_trans_resv {
>  	struct xfs_trans_res	tr_writeid;	/* write setuid/setgid file */
>  	struct xfs_trans_res	tr_attrinval;	/* attr fork buffer
>  						 * invalidation */
> -	struct xfs_trans_res	tr_attrsetm;	/* set/create an attribute at
> -						 * mount time */
> -	struct xfs_trans_res	tr_attrsetrt;	/* set/create an attribute at
> -						 * runtime */
> +	struct xfs_trans_res	tr_attrset;	/* set/create an attribute */
>  	struct xfs_trans_res	tr_attrrm;	/* remove an attribute */
>  	struct xfs_trans_res	tr_clearagi;	/* clear agi unlinked bucket */
>  	struct xfs_trans_res	tr_growrtalloc;	/* grow realtime allocations */
> diff --git a/fs/xfs/libxfs/xfs_trans_space.h b/fs/xfs/libxfs/xfs_trans_space.h
> index 88221c7a04ccf..6a22ad11b3825 100644
> --- a/fs/xfs/libxfs/xfs_trans_space.h
> +++ b/fs/xfs/libxfs/xfs_trans_space.h
> @@ -38,8 +38,14 @@
>  
>  #define	XFS_DAENTER_1B(mp,w)	\
>  	((w) == XFS_DATA_FORK ? (mp)->m_dir_geo->fsbcount : 1)
> +/*
> + * xattr set operation can cause the da btree to split twice in the
> + * worst case. The double split is actually an extra leaf node rather
> + * than a complete split of blocks in the path from root to a
> + * leaf. The '1' in the macro below accounts for the extra leaf node.
> + */
>  #define	XFS_DAENTER_DBS(mp,w)	\
> -	(XFS_DA_NODE_MAXDEPTH + (((w) == XFS_DATA_FORK) ? 2 : 0))
> +	(XFS_DA_NODE_MAXDEPTH + (((w) == XFS_DATA_FORK) ? 2 : 1))
>  #define	XFS_DAENTER_BLOCKS(mp,w)	\
>  	(XFS_DAENTER_1B(mp,w) * XFS_DAENTER_DBS(mp,w))
>  #define	XFS_DAENTER_BMAP1B(mp,w)	\
> -- 
> 2.19.1
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-04-04  8:52 ` [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits Chandan Rajendra
@ 2020-04-06 16:45   ` Brian Foster
  2020-04-08 12:40     ` Chandan Rajendra
  2020-04-06 17:06   ` Darrick J. Wong
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 37+ messages in thread
From: Brian Foster @ 2020-04-06 16:45 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: linux-xfs, david, chandan, darrick.wong

On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> which
> 1. Creates 1,000,000 255-byte sized xattrs,
> 2. Deletes 50% of these xattrs in an alternating manner,
> 3. Tries to create 400,000 new 255-byte sized xattrs
> causes the following message to be printed on the console,
> 
> XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> 
> This indicates that we overflowed the 16-bits wide xattr extent counter.
> 
> I have been informed that there are instances where a single file has
>  > 100 million hardlinks. With parent pointers being stored in xattr,
> we will overflow the 16-bits wide xattr extent counter when large
> number of hardlinks are created.
> 
> Hence this commit extends xattr extent counter to 32-bits. It also introduces
> an incompat flag to prevent older kernels from mounting newer filesystems with
> 32-bit wide xattr extent counter.
> 

Just a couple high level comments on the first pass...

It looks like the feature bit is only set by mkfs. That raises a couple
questions. First, what about a fix for older/existing filesystems? Even
if we can't exceed the 16bit extent count, I would think we should be
able to fail more gracefully than allowing a write verifier to fail and
shutdown the fs. What happens when/if we run into a data fork extent
count limit, for example?

Second, I also wonder if enabling an incompat feature bit by default in
mkfs is a bit extreme. Perhaps this should be tied to a mkfs flag for a
period of time? Maybe others have thoughts on that, but I'd at minimum
request to introduce and enable said bit by default in separate patches
to make it a bit easier for distro releases to identify and manage the
incompatibility.

Brian

> Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> ---
>  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
>  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
>  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
>  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
>  fs/xfs/libxfs/xfs_types.h      |  4 ++--
>  fs/xfs/scrub/inode.c           |  7 ++++---
>  fs/xfs/xfs_inode_item.c        |  3 ++-
>  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
>  8 files changed, 63 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 045556e78ee2c..0a4266b0d46e1 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
>  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
>  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
>  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
>  #define XFS_SB_FEAT_INCOMPAT_ALL \
>  		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
>  		 XFS_SB_FEAT_INCOMPAT_SPINODES|	\
> -		 XFS_SB_FEAT_INCOMPAT_META_UUID)
> +		 XFS_SB_FEAT_INCOMPAT_META_UUID| \
> +		 XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR)
>  
>  #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
>  static inline bool
> @@ -874,7 +876,7 @@ typedef struct xfs_dinode {
>  	__be64		di_nblocks;	/* # of direct & btree blocks used */
>  	__be32		di_extsize;	/* basic/minimum extent size for file */
>  	__be32		di_nextents;	/* number of extents in data fork */
> -	__be16		di_anextents;	/* number of extents in attribute fork*/
> +	__be16		di_anextents_lo;/* lower part of xattr extent count */
>  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
>  	__s8		di_aformat;	/* format of attr fork's data */
>  	__be32		di_dmevmask;	/* DMIG event mask */
> @@ -891,7 +893,8 @@ typedef struct xfs_dinode {
>  	__be64		di_lsn;		/* flush sequence */
>  	__be64		di_flags2;	/* more random flags */
>  	__be32		di_cowextsize;	/* basic cow extent size for file */
> -	__u8		di_pad2[12];	/* more padding for future expansion */
> +	__be16		di_anextents_hi;/* higher part of xattr extent count */
> +	__u8		di_pad2[10];	/* more padding for future expansion */
>  
>  	/* fields only written to during inode creation */
>  	xfs_timestamp_t	di_crtime;	/* time created */
> @@ -993,10 +996,21 @@ enum xfs_dinode_fmt {
>  	((w) == XFS_DATA_FORK ? \
>  		(dip)->di_format : \
>  		(dip)->di_aformat)
> -#define XFS_DFORK_NEXTENTS(dip,w) \
> -	((w) == XFS_DATA_FORK ? \
> -		be32_to_cpu((dip)->di_nextents) : \
> -		be16_to_cpu((dip)->di_anextents))
> +
> +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,
> +					struct xfs_dinode *dip, int whichfork)
> +{
> +	int32_t anextents;
> +
> +	if (whichfork == XFS_DATA_FORK)
> +		return be32_to_cpu((dip)->di_nextents);
> +
> +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> +	if (xfs_sb_version_has_v3inode(sbp))
> +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);
> +
> +	return anextents;
> +}
>  
>  /*
>   * For block and character special files the 32bit dev_t is stored at the
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index 39c5a6e24915c..ced8195bd8c22 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -232,7 +232,8 @@ xfs_inode_from_disk(
>  	to->di_nblocks = be64_to_cpu(from->di_nblocks);
>  	to->di_extsize = be32_to_cpu(from->di_extsize);
>  	to->di_nextents = be32_to_cpu(from->di_nextents);
> -	to->di_anextents = be16_to_cpu(from->di_anextents);
> +	to->di_anextents = XFS_DFORK_NEXTENTS(&ip->i_mount->m_sb, from,
> +				XFS_ATTR_FORK);
>  	to->di_forkoff = from->di_forkoff;
>  	to->di_aformat	= from->di_aformat;
>  	to->di_dmevmask	= be32_to_cpu(from->di_dmevmask);
> @@ -282,7 +283,7 @@ xfs_inode_to_disk(
>  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
>  	to->di_extsize = cpu_to_be32(from->di_extsize);
>  	to->di_nextents = cpu_to_be32(from->di_nextents);
> -	to->di_anextents = cpu_to_be16(from->di_anextents);
> +	to->di_anextents_lo = cpu_to_be16((u32)(from->di_anextents) & 0xffff);
>  	to->di_forkoff = from->di_forkoff;
>  	to->di_aformat = from->di_aformat;
>  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> @@ -296,6 +297,8 @@ xfs_inode_to_disk(
>  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
>  		to->di_flags2 = cpu_to_be64(from->di_flags2);
>  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> +		to->di_anextents_hi
> +			= cpu_to_be16((u32)(from->di_anextents) >> 16);
>  		to->di_ino = cpu_to_be64(ip->i_ino);
>  		to->di_lsn = cpu_to_be64(lsn);
>  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> @@ -335,7 +338,7 @@ xfs_log_dinode_to_disk(
>  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
>  	to->di_extsize = cpu_to_be32(from->di_extsize);
>  	to->di_nextents = cpu_to_be32(from->di_nextents);
> -	to->di_anextents = cpu_to_be16(from->di_anextents);
> +	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
>  	to->di_forkoff = from->di_forkoff;
>  	to->di_aformat = from->di_aformat;
>  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> @@ -349,6 +352,7 @@ xfs_log_dinode_to_disk(
>  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
>  		to->di_flags2 = cpu_to_be64(from->di_flags2);
>  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> +		to->di_anextents_hi = cpu_to_be16(from->di_anextents_hi);
>  		to->di_ino = cpu_to_be64(from->di_ino);
>  		to->di_lsn = cpu_to_be64(from->di_lsn);
>  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> @@ -365,7 +369,9 @@ xfs_dinode_verify_fork(
>  	struct xfs_mount	*mp,
>  	int			whichfork)
>  {
> -	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> +	uint32_t		di_nextents;
> +
> +	di_nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, whichfork);
>  
>  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
>  	case XFS_DINODE_FMT_LOCAL:
> @@ -436,6 +442,9 @@ xfs_dinode_verify(
>  	uint16_t		flags;
>  	uint64_t		flags2;
>  	uint64_t		di_size;
> +	int32_t			nextents;
> +	int32_t			anextents;
> +	int64_t			nblocks;
>  
>  	if (dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC))
>  		return __this_address;
> @@ -466,10 +475,12 @@ xfs_dinode_verify(
>  	if ((S_ISLNK(mode) || S_ISDIR(mode)) && di_size == 0)
>  		return __this_address;
>  
> +	nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_DATA_FORK);
> +	anextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK);
> +	nblocks = be64_to_cpu(dip->di_nblocks);
> +
>  	/* Fork checks carried over from xfs_iformat_fork */
> -	if (mode &&
> -	    be32_to_cpu(dip->di_nextents) + be16_to_cpu(dip->di_anextents) >
> -			be64_to_cpu(dip->di_nblocks))
> +	if (mode && nextents + anextents > nblocks)
>  		return __this_address;
>  
>  	if (mode && XFS_DFORK_BOFF(dip) > mp->m_sb.sb_inodesize)
> @@ -526,7 +537,7 @@ xfs_dinode_verify(
>  		default:
>  			return __this_address;
>  		}
> -		if (dip->di_anextents)
> +		if (XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK))
>  			return __this_address;
>  	}
>  
> diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
> index 518c6f0ec3a61..080fd0c156a1e 100644
> --- a/fs/xfs/libxfs/xfs_inode_fork.c
> +++ b/fs/xfs/libxfs/xfs_inode_fork.c
> @@ -207,9 +207,10 @@ xfs_iformat_extents(
>  	int			whichfork)
>  {
>  	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_sb		*sb = &mp->m_sb;
>  	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
>  	int			state = xfs_bmap_fork_to_state(whichfork);
> -	int			nex = XFS_DFORK_NEXTENTS(dip, whichfork);
> +	int			nex = XFS_DFORK_NEXTENTS(sb, dip, whichfork);
>  	int			size = nex * sizeof(xfs_bmbt_rec_t);
>  	struct xfs_iext_cursor	icur;
>  	struct xfs_bmbt_rec	*dp;
> diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> index e3400c9c71cdb..5db92aa508bc5 100644
> --- a/fs/xfs/libxfs/xfs_log_format.h
> +++ b/fs/xfs/libxfs/xfs_log_format.h
> @@ -397,7 +397,7 @@ struct xfs_log_dinode {
>  	xfs_rfsblock_t	di_nblocks;	/* # of direct & btree blocks used */
>  	xfs_extlen_t	di_extsize;	/* basic/minimum extent size for file */
>  	xfs_extnum_t	di_nextents;	/* number of extents in data fork */
> -	xfs_aextnum_t	di_anextents;	/* number of extents in attribute fork*/
> +	uint16_t	di_anextents_lo;/* lower part of xattr extent count */
>  	uint8_t		di_forkoff;	/* attr fork offs, <<3 for 64b align */
>  	int8_t		di_aformat;	/* format of attr fork's data */
>  	uint32_t	di_dmevmask;	/* DMIG event mask */
> @@ -414,7 +414,8 @@ struct xfs_log_dinode {
>  	xfs_lsn_t	di_lsn;		/* flush sequence */
>  	uint64_t	di_flags2;	/* more random flags */
>  	uint32_t	di_cowextsize;	/* basic cow extent size for file */
> -	uint8_t		di_pad2[12];	/* more padding for future expansion */
> +	uint16_t	di_anextents_hi;/* higher part of xattr extent count */
> +	uint8_t		di_pad2[10];	/* more padding for future expansion */
>  
>  	/* fields only written to during inode creation */
>  	xfs_ictimestamp_t di_crtime;	/* time created */
> diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> index 397d94775440d..01669aa65745a 100644
> --- a/fs/xfs/libxfs/xfs_types.h
> +++ b/fs/xfs/libxfs/xfs_types.h
> @@ -13,7 +13,7 @@ typedef uint32_t	xfs_agino_t;	/* inode # within allocation grp */
>  typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
>  typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
>  typedef int32_t		xfs_extnum_t;	/* # of extents in a file */
> -typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> +typedef int32_t		xfs_aextnum_t;	/* # extents in an attribute fork */
>  typedef int64_t		xfs_fsize_t;	/* bytes in a file */
>  typedef uint64_t	xfs_ufsize_t;	/* unsigned bytes in a file */
>  
> @@ -60,7 +60,7 @@ typedef void *		xfs_failaddr_t;
>   */
>  #define	MAXEXTLEN	((xfs_extlen_t)0x001fffff)	/* 21 bits */
>  #define	MAXEXTNUM	((xfs_extnum_t)0x7fffffff)	/* signed int */
> -#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fff)		/* signed short */
> +#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fffffff)	/* signed int */
>  
>  /*
>   * Minimum and maximum blocksize and sectorsize.
> diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
> index 6d483ab29e639..3b624e24ae868 100644
> --- a/fs/xfs/scrub/inode.c
> +++ b/fs/xfs/scrub/inode.c
> @@ -371,10 +371,12 @@ xchk_dinode(
>  		break;
>  	}
>  
> +	nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK);
> +
>  	/* di_forkoff */
>  	if (XFS_DFORK_APTR(dip) >= (char *)dip + mp->m_sb.sb_inodesize)
>  		xchk_ino_set_corrupt(sc, ino);
> -	if (dip->di_anextents != 0 && dip->di_forkoff == 0)
> +	if (nextents != 0 && dip->di_forkoff == 0)
>  		xchk_ino_set_corrupt(sc, ino);
>  	if (dip->di_forkoff == 0 && dip->di_aformat != XFS_DINODE_FMT_EXTENTS)
>  		xchk_ino_set_corrupt(sc, ino);
> @@ -386,7 +388,6 @@ xchk_dinode(
>  		xchk_ino_set_corrupt(sc, ino);
>  
>  	/* di_anextents */
> -	nextents = be16_to_cpu(dip->di_anextents);
>  	fork_recs =  XFS_DFORK_ASIZE(dip, mp) / sizeof(struct xfs_bmbt_rec);
>  	switch (dip->di_aformat) {
>  	case XFS_DINODE_FMT_EXTENTS:
> @@ -484,7 +485,7 @@ xchk_inode_xref_bmap(
>  			&nextents, &acount);
>  	if (!xchk_should_check_xref(sc, &error, NULL))
>  		return;
> -	if (nextents != be16_to_cpu(dip->di_anextents))
> +	if (nextents != XFS_DFORK_NEXTENTS(&sc->mp->m_sb, dip, XFS_ATTR_FORK))
>  		xchk_ino_xref_set_corrupt(sc, sc->ip->i_ino);
>  
>  	/* Check nblocks against the inode. */
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 4a3d13d4a0228..dff20f2b368ea 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -327,7 +327,7 @@ xfs_inode_to_log_dinode(
>  	to->di_nblocks = from->di_nblocks;
>  	to->di_extsize = from->di_extsize;
>  	to->di_nextents = from->di_nextents;
> -	to->di_anextents = from->di_anextents;
> +	to->di_anextents_lo = ((u32)(from->di_anextents)) & 0xffff;
>  	to->di_forkoff = from->di_forkoff;
>  	to->di_aformat = from->di_aformat;
>  	to->di_dmevmask = from->di_dmevmask;
> @@ -344,6 +344,7 @@ xfs_inode_to_log_dinode(
>  		to->di_crtime.t_nsec = from->di_crtime.tv_nsec;
>  		to->di_flags2 = from->di_flags2;
>  		to->di_cowextsize = from->di_cowextsize;
> +		to->di_anextents_hi = ((u32)(from->di_anextents)) >> 16;
>  		to->di_ino = ip->i_ino;
>  		to->di_lsn = lsn;
>  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 11c3502b07b13..ba3fae95b2260 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -2922,6 +2922,7 @@ xlog_recover_inode_pass2(
>  	struct xfs_log_dinode	*ldip;
>  	uint			isize;
>  	int			need_free = 0;
> +	uint32_t		nextents;
>  
>  	if (item->ri_buf[0].i_len == sizeof(struct xfs_inode_log_format)) {
>  		in_f = item->ri_buf[0].i_addr;
> @@ -3044,7 +3045,14 @@ xlog_recover_inode_pass2(
>  			goto out_release;
>  		}
>  	}
> -	if (unlikely(ldip->di_nextents + ldip->di_anextents > ldip->di_nblocks)){
> +
> +	nextents = ldip->di_anextents_lo;
> +	if (xfs_sb_version_has_v3inode(&mp->m_sb))
> +		nextents |= ((u32)(ldip->di_anextents_hi) << 16);
> +
> +	nextents += ldip->di_nextents;
> +
> +	if (unlikely(nextents > ldip->di_nblocks)) {
>  		XFS_CORRUPTION_ERROR("xlog_recover_inode_pass2(5)",
>  				     XFS_ERRLEVEL_LOW, mp, ldip,
>  				     sizeof(*ldip));
> @@ -3052,8 +3060,7 @@ xlog_recover_inode_pass2(
>  	"%s: Bad inode log record, rec ptr "PTR_FMT", dino ptr "PTR_FMT", "
>  	"dino bp "PTR_FMT", ino %Ld, total extents = %d, nblocks = %Ld",
>  			__func__, item, dip, bp, in_f->ilf_ino,
> -			ldip->di_nextents + ldip->di_anextents,
> -			ldip->di_nblocks);
> +			nextents, ldip->di_nblocks);
>  		error = -EFSCORRUPTED;
>  		goto out_release;
>  	}
> -- 
> 2.19.1
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-04-04  8:52 ` [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits Chandan Rajendra
  2020-04-06 16:45   ` Brian Foster
@ 2020-04-06 17:06   ` Darrick J. Wong
  2020-04-06 23:30     ` Dave Chinner
  2020-04-08 12:42     ` Chandan Rajendra
  2020-04-07  1:20   ` Dave Chinner
  2020-04-27  7:39   ` Christoph Hellwig
  3 siblings, 2 replies; 37+ messages in thread
From: Darrick J. Wong @ 2020-04-06 17:06 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: linux-xfs, david, chandan, bfoster

On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> which
> 1. Creates 1,000,000 255-byte sized xattrs,
> 2. Deletes 50% of these xattrs in an alternating manner,
> 3. Tries to create 400,000 new 255-byte sized xattrs
> causes the following message to be printed on the console,
> 
> XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> 
> This indicates that we overflowed the 16-bits wide xattr extent counter.
> 
> I have been informed that there are instances where a single file has
>  > 100 million hardlinks. With parent pointers being stored in xattr,
> we will overflow the 16-bits wide xattr extent counter when large
> number of hardlinks are created.
> 
> Hence this commit extends xattr extent counter to 32-bits. It also introduces
> an incompat flag to prevent older kernels from mounting newer filesystems with
> 32-bit wide xattr extent counter.
> 
> Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> ---
>  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
>  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
>  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
>  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
>  fs/xfs/libxfs/xfs_types.h      |  4 ++--
>  fs/xfs/scrub/inode.c           |  7 ++++---
>  fs/xfs/xfs_inode_item.c        |  3 ++-
>  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
>  8 files changed, 63 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 045556e78ee2c..0a4266b0d46e1 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
>  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
>  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
>  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)

If you're going to introduce an INCOMPAT feature, please also use the
opportunity to convert xattrs to something resembling the dir v3 format,
where we index free space within each block so that we can speed up attr
setting with 100 million attrs.

>  #define XFS_SB_FEAT_INCOMPAT_ALL \
>  		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
>  		 XFS_SB_FEAT_INCOMPAT_SPINODES|	\
> -		 XFS_SB_FEAT_INCOMPAT_META_UUID)
> +		 XFS_SB_FEAT_INCOMPAT_META_UUID| \
> +		 XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR)
>  
>  #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
>  static inline bool
> @@ -874,7 +876,7 @@ typedef struct xfs_dinode {
>  	__be64		di_nblocks;	/* # of direct & btree blocks used */
>  	__be32		di_extsize;	/* basic/minimum extent size for file */
>  	__be32		di_nextents;	/* number of extents in data fork */
> -	__be16		di_anextents;	/* number of extents in attribute fork*/
> +	__be16		di_anextents_lo;/* lower part of xattr extent count */
>  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
>  	__s8		di_aformat;	/* format of attr fork's data */
>  	__be32		di_dmevmask;	/* DMIG event mask */
> @@ -891,7 +893,8 @@ typedef struct xfs_dinode {
>  	__be64		di_lsn;		/* flush sequence */
>  	__be64		di_flags2;	/* more random flags */
>  	__be32		di_cowextsize;	/* basic cow extent size for file */
> -	__u8		di_pad2[12];	/* more padding for future expansion */
> +	__be16		di_anextents_hi;/* higher part of xattr extent count */

I was expecting you to use di_pad, not di_pad2... :)

> +	__u8		di_pad2[10];	/* more padding for future expansion */
>  
>  	/* fields only written to during inode creation */
>  	xfs_timestamp_t	di_crtime;	/* time created */
> @@ -993,10 +996,21 @@ enum xfs_dinode_fmt {
>  	((w) == XFS_DATA_FORK ? \
>  		(dip)->di_format : \
>  		(dip)->di_aformat)
> -#define XFS_DFORK_NEXTENTS(dip,w) \
> -	((w) == XFS_DATA_FORK ? \
> -		be32_to_cpu((dip)->di_nextents) : \
> -		be16_to_cpu((dip)->di_anextents))
> +
> +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,
> +					struct xfs_dinode *dip, int whichfork)
> +{

XFS style indenting, please.

> +	int32_t anextents;

When would we have negative extent count?

(Yes, this is a bug in the xfs_extnum/xfs_aextnum typedefs, bah...)

> +
> +	if (whichfork == XFS_DATA_FORK)
> +		return be32_to_cpu((dip)->di_nextents);
> +
> +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> +	if (xfs_sb_version_has_v3inode(sbp))

v3inode?  I thought this had a separate incompat flag?

> +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);

/me would have thought you'd do the splitting and endian conversion in
the opposite order, e.g.:

	be32 x = dip->di_anextents_lo;
	if (has32bitattrcount)
		x |= (be32)dip->di_anextents_hi << 16;
	return be32_to_cpu(x);

> +
> +	return anextents;
> +}
>  
>  /*
>   * For block and character special files the 32bit dev_t is stored at the
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index 39c5a6e24915c..ced8195bd8c22 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -232,7 +232,8 @@ xfs_inode_from_disk(
>  	to->di_nblocks = be64_to_cpu(from->di_nblocks);
>  	to->di_extsize = be32_to_cpu(from->di_extsize);
>  	to->di_nextents = be32_to_cpu(from->di_nextents);
> -	to->di_anextents = be16_to_cpu(from->di_anextents);
> +	to->di_anextents = XFS_DFORK_NEXTENTS(&ip->i_mount->m_sb, from,
> +				XFS_ATTR_FORK);
>  	to->di_forkoff = from->di_forkoff;
>  	to->di_aformat	= from->di_aformat;
>  	to->di_dmevmask	= be32_to_cpu(from->di_dmevmask);
> @@ -282,7 +283,7 @@ xfs_inode_to_disk(
>  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
>  	to->di_extsize = cpu_to_be32(from->di_extsize);
>  	to->di_nextents = cpu_to_be32(from->di_nextents);
> -	to->di_anextents = cpu_to_be16(from->di_anextents);
> +	to->di_anextents_lo = cpu_to_be16((u32)(from->di_anextents) & 0xffff);
>  	to->di_forkoff = from->di_forkoff;
>  	to->di_aformat = from->di_aformat;
>  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> @@ -296,6 +297,8 @@ xfs_inode_to_disk(
>  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
>  		to->di_flags2 = cpu_to_be64(from->di_flags2);
>  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> +		to->di_anextents_hi
> +			= cpu_to_be16((u32)(from->di_anextents) >> 16);
>  		to->di_ino = cpu_to_be64(ip->i_ino);
>  		to->di_lsn = cpu_to_be64(lsn);
>  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> @@ -335,7 +338,7 @@ xfs_log_dinode_to_disk(
>  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
>  	to->di_extsize = cpu_to_be32(from->di_extsize);
>  	to->di_nextents = cpu_to_be32(from->di_nextents);
> -	to->di_anextents = cpu_to_be16(from->di_anextents);
> +	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
>  	to->di_forkoff = from->di_forkoff;
>  	to->di_aformat = from->di_aformat;
>  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> @@ -349,6 +352,7 @@ xfs_log_dinode_to_disk(
>  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
>  		to->di_flags2 = cpu_to_be64(from->di_flags2);
>  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> +		to->di_anextents_hi = cpu_to_be16(from->di_anextents_hi);
>  		to->di_ino = cpu_to_be64(from->di_ino);
>  		to->di_lsn = cpu_to_be64(from->di_lsn);
>  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> @@ -365,7 +369,9 @@ xfs_dinode_verify_fork(
>  	struct xfs_mount	*mp,
>  	int			whichfork)
>  {
> -	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> +	uint32_t		di_nextents;
> +
> +	di_nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, whichfork);
>  
>  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
>  	case XFS_DINODE_FMT_LOCAL:
> @@ -436,6 +442,9 @@ xfs_dinode_verify(
>  	uint16_t		flags;
>  	uint64_t		flags2;
>  	uint64_t		di_size;
> +	int32_t			nextents;
> +	int32_t			anextents;
> +	int64_t			nblocks;
>  
>  	if (dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC))
>  		return __this_address;
> @@ -466,10 +475,12 @@ xfs_dinode_verify(
>  	if ((S_ISLNK(mode) || S_ISDIR(mode)) && di_size == 0)
>  		return __this_address;
>  
> +	nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_DATA_FORK);
> +	anextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK);
> +	nblocks = be64_to_cpu(dip->di_nblocks);
> +
>  	/* Fork checks carried over from xfs_iformat_fork */
> -	if (mode &&
> -	    be32_to_cpu(dip->di_nextents) + be16_to_cpu(dip->di_anextents) >
> -			be64_to_cpu(dip->di_nblocks))
> +	if (mode && nextents + anextents > nblocks)
>  		return __this_address;
>  
>  	if (mode && XFS_DFORK_BOFF(dip) > mp->m_sb.sb_inodesize)
> @@ -526,7 +537,7 @@ xfs_dinode_verify(
>  		default:
>  			return __this_address;
>  		}
> -		if (dip->di_anextents)
> +		if (XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK))
>  			return __this_address;
>  	}
>  
> diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
> index 518c6f0ec3a61..080fd0c156a1e 100644
> --- a/fs/xfs/libxfs/xfs_inode_fork.c
> +++ b/fs/xfs/libxfs/xfs_inode_fork.c
> @@ -207,9 +207,10 @@ xfs_iformat_extents(
>  	int			whichfork)
>  {
>  	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_sb		*sb = &mp->m_sb;
>  	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
>  	int			state = xfs_bmap_fork_to_state(whichfork);
> -	int			nex = XFS_DFORK_NEXTENTS(dip, whichfork);
> +	int			nex = XFS_DFORK_NEXTENTS(sb, dip, whichfork);
>  	int			size = nex * sizeof(xfs_bmbt_rec_t);
>  	struct xfs_iext_cursor	icur;
>  	struct xfs_bmbt_rec	*dp;
> diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> index e3400c9c71cdb..5db92aa508bc5 100644
> --- a/fs/xfs/libxfs/xfs_log_format.h
> +++ b/fs/xfs/libxfs/xfs_log_format.h
> @@ -397,7 +397,7 @@ struct xfs_log_dinode {
>  	xfs_rfsblock_t	di_nblocks;	/* # of direct & btree blocks used */
>  	xfs_extlen_t	di_extsize;	/* basic/minimum extent size for file */
>  	xfs_extnum_t	di_nextents;	/* number of extents in data fork */
> -	xfs_aextnum_t	di_anextents;	/* number of extents in attribute fork*/
> +	uint16_t	di_anextents_lo;/* lower part of xattr extent count */
>  	uint8_t		di_forkoff;	/* attr fork offs, <<3 for 64b align */
>  	int8_t		di_aformat;	/* format of attr fork's data */
>  	uint32_t	di_dmevmask;	/* DMIG event mask */
> @@ -414,7 +414,8 @@ struct xfs_log_dinode {
>  	xfs_lsn_t	di_lsn;		/* flush sequence */
>  	uint64_t	di_flags2;	/* more random flags */
>  	uint32_t	di_cowextsize;	/* basic cow extent size for file */
> -	uint8_t		di_pad2[12];	/* more padding for future expansion */
> +	uint16_t	di_anextents_hi;/* higher part of xattr extent count */
> +	uint8_t		di_pad2[10];	/* more padding for future expansion */
>  
>  	/* fields only written to during inode creation */
>  	xfs_ictimestamp_t di_crtime;	/* time created */
> diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> index 397d94775440d..01669aa65745a 100644
> --- a/fs/xfs/libxfs/xfs_types.h
> +++ b/fs/xfs/libxfs/xfs_types.h
> @@ -13,7 +13,7 @@ typedef uint32_t	xfs_agino_t;	/* inode # within allocation grp */
>  typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
>  typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
>  typedef int32_t		xfs_extnum_t;	/* # of extents in a file */
> -typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> +typedef int32_t		xfs_aextnum_t;	/* # extents in an attribute fork */
>  typedef int64_t		xfs_fsize_t;	/* bytes in a file */
>  typedef uint64_t	xfs_ufsize_t;	/* unsigned bytes in a file */
>  
> @@ -60,7 +60,7 @@ typedef void *		xfs_failaddr_t;
>   */
>  #define	MAXEXTLEN	((xfs_extlen_t)0x001fffff)	/* 21 bits */
>  #define	MAXEXTNUM	((xfs_extnum_t)0x7fffffff)	/* signed int */
> -#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fff)		/* signed short */
> +#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fffffff)	/* signed int */

Need to preserve both limits so that we can do the correct check for the
given feature set.

>  
>  /*
>   * Minimum and maximum blocksize and sectorsize.
> diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
> index 6d483ab29e639..3b624e24ae868 100644
> --- a/fs/xfs/scrub/inode.c
> +++ b/fs/xfs/scrub/inode.c
> @@ -371,10 +371,12 @@ xchk_dinode(
>  		break;
>  	}
>  
> +	nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK);
> +
>  	/* di_forkoff */
>  	if (XFS_DFORK_APTR(dip) >= (char *)dip + mp->m_sb.sb_inodesize)
>  		xchk_ino_set_corrupt(sc, ino);
> -	if (dip->di_anextents != 0 && dip->di_forkoff == 0)
> +	if (nextents != 0 && dip->di_forkoff == 0)
>  		xchk_ino_set_corrupt(sc, ino);
>  	if (dip->di_forkoff == 0 && dip->di_aformat != XFS_DINODE_FMT_EXTENTS)
>  		xchk_ino_set_corrupt(sc, ino);
> @@ -386,7 +388,6 @@ xchk_dinode(
>  		xchk_ino_set_corrupt(sc, ino);
>  
>  	/* di_anextents */
> -	nextents = be16_to_cpu(dip->di_anextents);
>  	fork_recs =  XFS_DFORK_ASIZE(dip, mp) / sizeof(struct xfs_bmbt_rec);
>  	switch (dip->di_aformat) {
>  	case XFS_DINODE_FMT_EXTENTS:
> @@ -484,7 +485,7 @@ xchk_inode_xref_bmap(
>  			&nextents, &acount);
>  	if (!xchk_should_check_xref(sc, &error, NULL))
>  		return;
> -	if (nextents != be16_to_cpu(dip->di_anextents))
> +	if (nextents != XFS_DFORK_NEXTENTS(&sc->mp->m_sb, dip, XFS_ATTR_FORK))
>  		xchk_ino_xref_set_corrupt(sc, sc->ip->i_ino);
>  
>  	/* Check nblocks against the inode. */
> diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> index 4a3d13d4a0228..dff20f2b368ea 100644
> --- a/fs/xfs/xfs_inode_item.c
> +++ b/fs/xfs/xfs_inode_item.c
> @@ -327,7 +327,7 @@ xfs_inode_to_log_dinode(
>  	to->di_nblocks = from->di_nblocks;
>  	to->di_extsize = from->di_extsize;
>  	to->di_nextents = from->di_nextents;
> -	to->di_anextents = from->di_anextents;
> +	to->di_anextents_lo = ((u32)(from->di_anextents)) & 0xffff;
>  	to->di_forkoff = from->di_forkoff;
>  	to->di_aformat = from->di_aformat;
>  	to->di_dmevmask = from->di_dmevmask;
> @@ -344,6 +344,7 @@ xfs_inode_to_log_dinode(
>  		to->di_crtime.t_nsec = from->di_crtime.tv_nsec;
>  		to->di_flags2 = from->di_flags2;
>  		to->di_cowextsize = from->di_cowextsize;
> +		to->di_anextents_hi = ((u32)(from->di_anextents)) >> 16;
>  		to->di_ino = ip->i_ino;
>  		to->di_lsn = lsn;
>  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 11c3502b07b13..ba3fae95b2260 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -2922,6 +2922,7 @@ xlog_recover_inode_pass2(
>  	struct xfs_log_dinode	*ldip;
>  	uint			isize;
>  	int			need_free = 0;
> +	uint32_t		nextents;
>  
>  	if (item->ri_buf[0].i_len == sizeof(struct xfs_inode_log_format)) {
>  		in_f = item->ri_buf[0].i_addr;
> @@ -3044,7 +3045,14 @@ xlog_recover_inode_pass2(
>  			goto out_release;
>  		}
>  	}
> -	if (unlikely(ldip->di_nextents + ldip->di_anextents > ldip->di_nblocks)){
> +
> +	nextents = ldip->di_anextents_lo;
> +	if (xfs_sb_version_has_v3inode(&mp->m_sb))
> +		nextents |= ((u32)(ldip->di_anextents_hi) << 16);
> +
> +	nextents += ldip->di_nextents;
> +
> +	if (unlikely(nextents > ldip->di_nblocks)) {
>  		XFS_CORRUPTION_ERROR("xlog_recover_inode_pass2(5)",
>  				     XFS_ERRLEVEL_LOW, mp, ldip,
>  				     sizeof(*ldip));
> @@ -3052,8 +3060,7 @@ xlog_recover_inode_pass2(
>  	"%s: Bad inode log record, rec ptr "PTR_FMT", dino ptr "PTR_FMT", "
>  	"dino bp "PTR_FMT", ino %Ld, total extents = %d, nblocks = %Ld",
>  			__func__, item, dip, bp, in_f->ilf_ino,
> -			ldip->di_nextents + ldip->di_anextents,
> -			ldip->di_nblocks);
> +			nextents, ldip->di_nblocks);
>  		error = -EFSCORRUPTED;
>  		goto out_release;
>  	}
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] xfs: Fix log reservation calculation for xattr insert operation
  2020-04-06 15:25   ` Brian Foster
@ 2020-04-06 22:57     ` Dave Chinner
  2020-04-07  5:11       ` Chandan Rajendra
  2020-04-07 12:59       ` Brian Foster
  0 siblings, 2 replies; 37+ messages in thread
From: Dave Chinner @ 2020-04-06 22:57 UTC (permalink / raw)
  To: Brian Foster; +Cc: Chandan Rajendra, linux-xfs, chandan, darrick.wong

On Mon, Apr 06, 2020 at 11:25:40AM -0400, Brian Foster wrote:
> On Sat, Apr 04, 2020 at 02:22:02PM +0530, Chandan Rajendra wrote:
> > Log space reservation for xattr insert operation is divided into two
> > parts,
> > 1. Mount time
> >    - Inode
> >    - Superblock for accounting space allocations
> >    - AGF for accounting space used by count, block number, rmap and refcnt
> >      btrees.
> > 
> > 2. The remaining log space can only be calculated at run time because,
> >    - A local xattr can be large enough to cause a double split of the da
> >      btree.
> >    - The value of the xattr can be large enough to be stored in remote
> >      blocks. The contents of the remote blocks are not logged.
> > 
> >    The log space reservation could be,
> >    - (XFS_DA_NODE_MAXDEPTH + 1) number of blocks. The "+ 1" is required in
> >      case xattr is large enough to cause another split of the da btree path.
> >    - BMBT blocks for storing (XFS_DA_NODE_MAXDEPTH + 1) record
> >      entries.
> >    - Space for logging blocks of count, block number, rmap and refcnt btrees.
> > 
> > At present, mount time log reservation includes block count required for a
> > single split of the dabtree. The dabtree block count is also taken into
> > account by xfs_attr_calc_size().
> > 
> > Also, AGF log space reservation isn't accounted for.
> > 
> > Due to the reasons mentioned above, log reservation calculation for xattr
> > insert operation gives an incorrect value.
> > 
> > Apart from the above, xfs_log_calc_max_attrsetm_res() passes byte count as
> > an argument to XFS_NEXTENTADD_SPACE_RES() instead of block count.
> > 
> > The above mentioned inconsistencies were discoverd when trying to mount a
> > modified XFS filesystem which uses a 32-bit value as xattr extent counter
> > caused the following warning messages to be printed on the console,
> > 
> > XFS (loop0): Mounting V4 Filesystem
> > XFS (loop0): Log size 2560 blocks too small, minimum size is 4035 blocks
> > XFS (loop0): Log size out of supported range.
> > XFS (loop0): Continuing onwards, but if log hangs are experienced then please report this message in the bug report.
> > XFS (loop0): Ending clean mount
> > 
> > To fix the inconsistencies described above, this commit replaces 'mount'
> > and 'runtime' components with just one static reservation. The new
> > reservation calculates the log space for the worst case possible i.e. it
> > considers,
> > 1. Double split of the da btree.
> >    This happens for large local xattrs.
> > 2. Bmbt blocks required for mapping the contents of a maximum
> >    sized (i.e. XATTR_SIZE_MAX bytes in size) remote attribute.
> > 
> 
> Hmm.. so the last I recall looking at this, the change was more around
> refactoring the mount vs. runtime portions of the xattr reservation to
> be more accurate. This approach eliminates the runtime portion for a
> 100% mount time reservation calculation. Can you elaborate on why the
> change in approach? Also, it looks like at least one tradeoff here is
> reservation size (to the point where we increase min log size on small
> filesystems?).

What's not in this commit message is that this was actually my idea
that I had when Chandan contacted me off list about his refactoring
of the reservation blowing out reservations for attribute operations
by a factor of 10.

My fault, I should have pushed the discussion back to the mailing
list rather than answering directly.  I'll repeat a lot of my
analysis from that discussion below to get everyone up to speed.

[ Chandan, in future I'm going to insist that all your XFS questions
need to be on the list, so that everyone sees the discusions and
understands the reasons why things are suggested. It's also a good
idea to use "suggested-by" when presenting code based on other
people's ideas, just so that everyone knows that there were more
people involved than just yourself... ]

So, when I went through all the reservations changes that Chandan
had made I realised that the current code was wrong in lots of ways,
and when I looked at it from the fundamental changes being made the
mount vs runtime split made no sense at all.

Such as:

- the dabtree double split was a double _leaf block split only_. It
  is not a full tree split, and can only result in a single parent
  split because there is only one path update after the double leaf
  split has been done. Hence it can only do one full dabtree split
  and the code in xfs_attr_calc_size() that doubles the block count
  reservation for the double leaf split is wrong.  We only need on
  extra block, and that is just:

dgc>  #define XFS_DAENTER_DBS(mp,w)        \
dgc> -     (XFS_DA_NODE_MAXDEPTH + (((w) == XFS_DATA_FORK) ? 2 : 0))
dgc> +     (XFS_DA_NODE_MAXDEPTH + (((w) == XFS_DATA_FORK) ? 2 : 1))

[ Note: The "+ 2" for the data fork reservation is for the dir data
block and a potential free space index block that get added in a
typical directory entry addition. ]

- remote attributes are not logged, so only the BMBT block
  reservation is needed for that extent allocation. i.e. we need to
  reserve the blocks for the xattr, but we don't need log space
  for them.  xfs_attr_calc_size() gets this right,
  xfs_log_calc_max_attrsetm_res() gets this wrong in that it does
  not take into account remote attr BMBT blocks at all.

- xfs_attr_calc_size() calculates the number of blocks we need to
  allocate for the attr operation, not the number of blocks we need
  to log, hence can't be used to replace
  xfs_log_calc_max_attrsetm_res().

- the runtime reservation is just a BMBT block reservation for a
  single block to be allocated. multiplying the number of blocks we
  need to allocate by M_RES(mp)->tr_attrsetrt.tr_logres to get the
  log reservation is wrong. We are not doing a full BMBT split for
  every block in the attribute we modify, so the log reservation is
  massively oversized by xfs_log_calc_max_attrsetm_res() and
  xfs_attr_set() by multipling the block count (including BMBT
  blocks we allocate) by a full bmbt split reservation.

IOWs, the code as it stands now is just wrong. It works because it
massively oversizes the runtime reservations, but that in itself is
a problem.  To quote myself again from that analysis:

dgc> The log reservation that covers both local and remote attributes:
dgc>
dgc> blks =  full dabtree split + 1 leaf block + bmbt blocks
dgc> blks += nextent_res(MAX_ATTR_LEN/block size) // bmbt blocks only
dgc> resv =	inode + sb + agf +
dgc>		xfs_calc_buf_res(blks) +
dgc>		allocfree_log_count(blks);

THe first line takes into account the blocks we modify in a local
attribute tree modification. The second line takes into account the
BMBT logging overhead of a remote attribute. The "resv" calculation
converts that modified block count into a log reservation and adds
the freespace tree logging overhead of allocating all those blocks.

The only thing that is variable at runtime now is the size of the
remote attribute, but we already have a log reservation for the
allocation and BMBT block modification side of that and so we only
need to physically reserve the block space (i.e. via
block count passed to xfs_trans_alloc()).

IOWs, the log reservation does not need to change at runtime now.

It also makes it clear that changing the attr fork extent count from
16 to 32 bits should only impact the BMBT reservations.  The dabtree
reservations already use XFS_DA_NODE_MAXDEPTH for the attr fork and
hence so they already are sized for max dabtree depth reservations.

As a result, the attr reservation itself should not grow excessively
for 32bit attribute fork extent counts. It should maybe 20-30 blocks
on a 4kb block size filesystem as we add 4-5 levels to the max depth
of the BMBT on the attribute fork. It should certainly not grow by
400 blocks as the original reworking resulted in...

Again, sorry for not getting this discussion out onto the mailing
list originally, it should have been there.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-04-06 17:06   ` Darrick J. Wong
@ 2020-04-06 23:30     ` Dave Chinner
  2020-04-08 12:43       ` Chandan Rajendra
  2020-04-08 15:45       ` Darrick J. Wong
  2020-04-08 12:42     ` Chandan Rajendra
  1 sibling, 2 replies; 37+ messages in thread
From: Dave Chinner @ 2020-04-06 23:30 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Chandan Rajendra, linux-xfs, chandan, bfoster

On Mon, Apr 06, 2020 at 10:06:03AM -0700, Darrick J. Wong wrote:
> On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > which
> > 1. Creates 1,000,000 255-byte sized xattrs,
> > 2. Deletes 50% of these xattrs in an alternating manner,
> > 3. Tries to create 400,000 new 255-byte sized xattrs
> > causes the following message to be printed on the console,
> > 
> > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > 
> > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > 
> > I have been informed that there are instances where a single file has
> >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > we will overflow the 16-bits wide xattr extent counter when large
> > number of hardlinks are created.
> > 
> > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > an incompat flag to prevent older kernels from mounting newer filesystems with
> > 32-bit wide xattr extent counter.
> > 
> > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > ---
> >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> >  fs/xfs/scrub/inode.c           |  7 ++++---
> >  fs/xfs/xfs_inode_item.c        |  3 ++-
> >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> >  8 files changed, 63 insertions(+), 27 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > index 045556e78ee2c..0a4266b0d46e1 100644
> > --- a/fs/xfs/libxfs/xfs_format.h
> > +++ b/fs/xfs/libxfs/xfs_format.h
> > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> 
> If you're going to introduce an INCOMPAT feature, please also use the
> opportunity to convert xattrs to something resembling the dir v3 format,
> where we index free space within each block so that we can speed up attr
> setting with 100 million attrs.

Not necessary. Chandan has already spent a lot of time investigating
that - I suggested doing the investigation probably a year ago when
he was looking for stuff to do knowing that this could be a problem
parent pointers hit. Long story short - there's no degradation in
performance in the dabtree out to tens of millions of records with
different fixed size or random sized attributes, nor does various
combinations of insert/lookup/remove/replace operations seem to
impact the tree performance at scale. IOWs, we hit the 16 bit extent
limits of the attribute trees without finding any degradation in
performance.

Hence we concluded that the dabtree structure does not require
significant modification or optimisation to work well with typical
parent pointer attribute demands...

As for free space indexes....

The issue with the directory structure that requires external free
space is that the directory data is not part of the dabtree itself.
The attribute fork stores all the attributes at the leaves of the
dabtree, while the directory structure stores the directory data in
external blocks and the dabtree only contains the name hash index
that points to the external data.

i.e. When we add an attribute to the dabtree, we split/merge leaves
of the tree based on where the name hash index tells us it needs to
be inserted/removed from. i.e. we make space available or collapse
sparse leaves of the dabtree as a side effect of inserting or
removing objects.

The directory structure is very different. The dirents cannot change
location as their logical offset into the dir data segment is used
as the readdir/seekdir/telldir cookie. Therefore that location is
not allowed to change for the life of the dirent and so we can't
store them in the leaves of a dabtree indexed in hash order because
the offset into the tree would change as other entries are inserted
and removed.  Hence when we remove dirents, we must leave holes in
the data segment so the rest of the dirent data does not change
logical offset.

The directory name hash index - the dabtree bit - is in a separate
segment (the 2nd one). Because it only stores pointers to dirents in
the data segment, it doesn't need to leave holes - the dabtree just
merge/splits as required as pointers to the dir data segment are
added/removed - and has no free space tracking.

Hence when we go to add a dirent, we need to find the best free
space in the dir data segment to add that dirent. This requires a
dir data segment free space index, and that is held in the 3rd dir
segment.  Once we've found the best free space via lookup in the
free space index, we go modify the dir data block it points to, then
update the dabtree to point the name hash at that new dirent.

IOWs, the requirement for a free space map in the directory
structure results from storing the dirent data externally to the
dabtree. Attributes are stored directly in the leaves of the
dabtree - except for remote attributes which can be anywhere in the
BMBT address space - and hence do no need external free space
tracking to determine where to best insert them...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] xfs: Fix log reservation calculation for xattr insert operation
  2020-04-04  8:52 ` [PATCH 1/2] xfs: Fix log reservation calculation for xattr insert operation Chandan Rajendra
  2020-04-06 15:25   ` Brian Foster
@ 2020-04-07  0:49   ` Dave Chinner
  2020-04-08  8:47     ` Chandan Rajendra
  1 sibling, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2020-04-07  0:49 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: linux-xfs, chandan, darrick.wong, bfoster

[chopped bits out of the diff to get the whole reservation in one
 obvious piece of code.]

On Sat, Apr 04, 2020 at 02:22:02PM +0530, Chandan Rajendra wrote:
> @@ -698,42 +699,36 @@ xfs_calc_attrinval_reservation(
>  }
>  
>  /*
> + * Setting an attribute.
>   *	the inode getting the attribute
>   *	the superblock for allocations
> + *	the agf extents are allocated from
>   *	the attribute btree * max depth
> + *	the bmbt entries for da btree blocks
> + *	the bmbt entries for remote blocks (if any)
> + *	the allocation btrees.
>   */
>  STATIC uint
> -xfs_calc_attrsetm_reservation(
> +xfs_calc_attrset_reservation(
>  	struct xfs_mount	*mp)
>  {
> +	int			max_rmt_blks;
> +	int			da_blks;
> +	int			bmbt_blks;
> +
> +	da_blks = XFS_DAENTER_BLOCKS(mp, XFS_ATTR_FORK);

#define XFS_DAENTER_BLOCKS(mp,w)        \
        (XFS_DAENTER_1B(mp,w) * XFS_DAENTER_DBS(mp,w))
#define XFS_DAENTER_1B(mp,w)    \
        ((w) == XFS_DATA_FORK ? (mp)->m_dir_geo->fsbcount : 1)
#define XFS_DAENTER_DBS(mp,w)   \
	(XFS_DA_NODE_MAXDEPTH + (((w) == XFS_DATA_FORK) ? 2 : 0))

So, da_blks contains the full da btree split depth * 1 block. i.e.

	da_blks = XFS_DA_NODE_MAXDEPTH;

> +	bmbt_blks = XFS_DAENTER_BMAPS(mp, XFS_ATTR_FORK);

#define XFS_DAENTER_BMAPS(mp,w)         \
        (XFS_DAENTER_DBS(mp,w) * XFS_DAENTER_BMAP1B(mp,w))

#define XFS_DAENTER_BMAP1B(mp,w)        \
        XFS_NEXTENTADD_SPACE_RES(mp, XFS_DAENTER_1B(mp, w), w)

So, bmbt_blks contains the full da btree split depth * the BMBT
overhead for a single block allocation:

#define XFS_EXTENTADD_SPACE_RES(mp,w)   (XFS_BM_MAXLEVELS(mp,w) - 1)
#define XFS_NEXTENTADD_SPACE_RES(mp,b,w)\
        (((b + XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp) - 1) / \
	          XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp)) * \
		            XFS_EXTENTADD_SPACE_RES(mp,w))

XFS_NEXTENTADD_SPACE_RES(1) = ((1 + N - 1) / N) * (XFS_BM_MAXLEVELS - 1)
		= (XFS_BM_MAXLEVELS - 1)

So, bmbt_blks = XFS_DA_NODE_MAXDEPTH * (XFS_BM_MAXLEVELS - 1)

IOWs, this bmbt reservation is assuming a full height BMBT
modification on *every* dabtree node allocation. IOWs, we're
reserving multiple times the log space for potential bmbt
modifications than we are for the entire dabtree modification.
That's why the individual dabtree reservations are so big....

> +	max_rmt_blks = xfs_attr3_rmt_blocks(mp, XATTR_SIZE_MAX);
> +	bmbt_blks += XFS_NEXTENTADD_SPACE_RES(mp, max_rmt_blks, XFS_ATTR_FORK);

And this assumes we are going to log at least another full bmbt
modification.

IT seems to me that the worst case here is one full split and then
every other allocation inserts at the start of an existing block and
so updates pointers all the way up to the root. The impact is
limited, though, because XFS_DA_NODE_MAXDEPTH = 5 and so the attr
fork BMBT tree is not likely to reach anywhere near it's max depth
on large filesystems.....

>  	return XFS_DQUOT_LOGRES(mp) +
>  		xfs_calc_inode_res(mp, 1) +
>  		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
> +		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
> +		xfs_calc_buf_res(da_blks, XFS_FSB_TO_B(mp, 1)) +
> +		xfs_calc_buf_res(bmbt_blks, XFS_FSB_TO_B(mp, 1)) +
> +		xfs_calc_buf_res(xfs_allocfree_log_count(mp, da_blks),
>  				 XFS_FSB_TO_B(mp, 1));

Given the above, this looks OK. Worst case BMBT usage looks
excessive, but there is a chance it could be required...

> diff --git a/fs/xfs/libxfs/xfs_trans_space.h b/fs/xfs/libxfs/xfs_trans_space.h
> index 88221c7a04ccf..6a22ad11b3825 100644
> --- a/fs/xfs/libxfs/xfs_trans_space.h
> +++ b/fs/xfs/libxfs/xfs_trans_space.h
> @@ -38,8 +38,14 @@
>  
>  #define	XFS_DAENTER_1B(mp,w)	\
>  	((w) == XFS_DATA_FORK ? (mp)->m_dir_geo->fsbcount : 1)
> +/*
> + * xattr set operation can cause the da btree to split twice in the
> + * worst case. The double split is actually an extra leaf node rather
> + * than a complete split of blocks in the path from root to a
> + * leaf. The '1' in the macro below accounts for the extra leaf node.

It's not a double tree split, so don't describe it that way and then
have to explain that it's not a double tree split!

/*
 * When inserting a large local record into the dabtree leaf, we may
 * need to split the leaf twice to make room to fit the new record
 * into the new leaf. This double leaf split still only requires a
 * single datree path update as the inserted leaves are at adjacent
 * indexes. Hence we only need to account for an the extra leaf
 * block in the attribute fork here.
 */

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-04-04  8:52 ` [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits Chandan Rajendra
  2020-04-06 16:45   ` Brian Foster
  2020-04-06 17:06   ` Darrick J. Wong
@ 2020-04-07  1:20   ` Dave Chinner
  2020-04-08 12:45     ` Chandan Rajendra
                       ` (2 more replies)
  2020-04-27  7:39   ` Christoph Hellwig
  3 siblings, 3 replies; 37+ messages in thread
From: Dave Chinner @ 2020-04-07  1:20 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: linux-xfs, chandan, darrick.wong, bfoster

On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> which
> 1. Creates 1,000,000 255-byte sized xattrs,
> 2. Deletes 50% of these xattrs in an alternating manner,
> 3. Tries to create 400,000 new 255-byte sized xattrs
> causes the following message to be printed on the console,
> 
> XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> 
> This indicates that we overflowed the 16-bits wide xattr extent counter.
> 
> I have been informed that there are instances where a single file has
>  > 100 million hardlinks. With parent pointers being stored in xattr,
> we will overflow the 16-bits wide xattr extent counter when large
> number of hardlinks are created.
> 
> Hence this commit extends xattr extent counter to 32-bits. It also introduces
> an incompat flag to prevent older kernels from mounting newer filesystems with
> 32-bit wide xattr extent counter.
> 
> Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> ---
>  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
>  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
>  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
>  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
>  fs/xfs/libxfs/xfs_types.h      |  4 ++--
>  fs/xfs/scrub/inode.c           |  7 ++++---
>  fs/xfs/xfs_inode_item.c        |  3 ++-
>  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
>  8 files changed, 63 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 045556e78ee2c..0a4266b0d46e1 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
>  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
>  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
>  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
>  #define XFS_SB_FEAT_INCOMPAT_ALL \
>  		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
>  		 XFS_SB_FEAT_INCOMPAT_SPINODES|	\
> -		 XFS_SB_FEAT_INCOMPAT_META_UUID)
> +		 XFS_SB_FEAT_INCOMPAT_META_UUID| \
> +		 XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR)
>  
>  #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
>  static inline bool
> @@ -874,7 +876,7 @@ typedef struct xfs_dinode {
>  	__be64		di_nblocks;	/* # of direct & btree blocks used */
>  	__be32		di_extsize;	/* basic/minimum extent size for file */
>  	__be32		di_nextents;	/* number of extents in data fork */
> -	__be16		di_anextents;	/* number of extents in attribute fork*/
> +	__be16		di_anextents_lo;/* lower part of xattr extent count */
>  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
>  	__s8		di_aformat;	/* format of attr fork's data */
>  	__be32		di_dmevmask;	/* DMIG event mask */
> @@ -891,7 +893,8 @@ typedef struct xfs_dinode {
>  	__be64		di_lsn;		/* flush sequence */
>  	__be64		di_flags2;	/* more random flags */
>  	__be32		di_cowextsize;	/* basic cow extent size for file */
> -	__u8		di_pad2[12];	/* more padding for future expansion */
> +	__be16		di_anextents_hi;/* higher part of xattr extent count */
> +	__u8		di_pad2[10];	/* more padding for future expansion */

Ok, I think you've limited what we can do here by using this "fill
holes" variable split. I've never liked doing this, and we've only
done it in the past when we haven't had space in the inode to create
a new 32 bit variable.

IOWs, this is a v5 format feature only, so we should just create a
new variable:

	__be32		di_attr_nextents;

With that in place, we can now do what we did extending the v1 inode
link count (16 bits) to the v2 inode link count (32 bits).

That is, when the attribute count is going to overflow, we set a
inode flag on disk to indicate that it now has a 32 bit extent count
and uses that field in the inode, and we set a RO-compat feature
flag in the superblock to indicate that there are 32 bit attr fork
extent counts in use.

Old kernels can still read the filesystem, but see the extent count
as "max" (65535) but can't modify the attr fork and hence corrupt
the 32 bit count it knows nothing about.

If the kernel sees the RO feature bit set, it can set the inode flag
on inodes it is modifying and update both the old and new counters
appropriately when flushing the inode to disk (i.e. transparent
conversion).

In future, mkfs can then set the RO feature flag by default so all
new filesystems use the 32 bit counter.

>  	/* fields only written to during inode creation */
>  	xfs_timestamp_t	di_crtime;	/* time created */
> @@ -993,10 +996,21 @@ enum xfs_dinode_fmt {
>  	((w) == XFS_DATA_FORK ? \
>  		(dip)->di_format : \
>  		(dip)->di_aformat)
> -#define XFS_DFORK_NEXTENTS(dip,w) \
> -	((w) == XFS_DATA_FORK ? \
> -		be32_to_cpu((dip)->di_nextents) : \
> -		be16_to_cpu((dip)->di_anextents))
> +
> +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,

If you are converting a macro to static inline, then all the caller
sites should be converted to lower case at the same time.

> +					struct xfs_dinode *dip, int whichfork)
> +{
> +	int32_t anextents;

Extent counts should be unsigned, as they are on disk.

> +
> +	if (whichfork == XFS_DATA_FORK)
> +		return be32_to_cpu((dip)->di_nextents);
> +
> +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> +	if (xfs_sb_version_has_v3inode(sbp))
> +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);
> +
> +	return anextents;
> +}

No feature bit to indicate that 32 bit attribute extent counts are
valid?

>  
>  /*
>   * For block and character special files the 32bit dev_t is stored at the
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index 39c5a6e24915c..ced8195bd8c22 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -232,7 +232,8 @@ xfs_inode_from_disk(
>  	to->di_nblocks = be64_to_cpu(from->di_nblocks);
>  	to->di_extsize = be32_to_cpu(from->di_extsize);
>  	to->di_nextents = be32_to_cpu(from->di_nextents);
> -	to->di_anextents = be16_to_cpu(from->di_anextents);
> +	to->di_anextents = XFS_DFORK_NEXTENTS(&ip->i_mount->m_sb, from,
> +				XFS_ATTR_FORK);

This should open code, but I'd prefer a compeltely separate
variable...

>  	to->di_forkoff = from->di_forkoff;
>  	to->di_aformat	= from->di_aformat;
>  	to->di_dmevmask	= be32_to_cpu(from->di_dmevmask);
> @@ -282,7 +283,7 @@ xfs_inode_to_disk(
>  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
>  	to->di_extsize = cpu_to_be32(from->di_extsize);
>  	to->di_nextents = cpu_to_be32(from->di_nextents);
> -	to->di_anextents = cpu_to_be16(from->di_anextents);
> +	to->di_anextents_lo = cpu_to_be16((u32)(from->di_anextents) & 0xffff);
>  	to->di_forkoff = from->di_forkoff;
>  	to->di_aformat = from->di_aformat;
>  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> @@ -296,6 +297,8 @@ xfs_inode_to_disk(
>  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
>  		to->di_flags2 = cpu_to_be64(from->di_flags2);
>  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> +		to->di_anextents_hi
> +			= cpu_to_be16((u32)(from->di_anextents) >> 16);

Again, feature bit for on-disk format modifications needed...

>  		to->di_ino = cpu_to_be64(ip->i_ino);
>  		to->di_lsn = cpu_to_be64(lsn);
>  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> @@ -335,7 +338,7 @@ xfs_log_dinode_to_disk(
>  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
>  	to->di_extsize = cpu_to_be32(from->di_extsize);
>  	to->di_nextents = cpu_to_be32(from->di_nextents);
> -	to->di_anextents = cpu_to_be16(from->di_anextents);
> +	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
>  	to->di_forkoff = from->di_forkoff;
>  	to->di_aformat = from->di_aformat;
>  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> @@ -349,6 +352,7 @@ xfs_log_dinode_to_disk(
>  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
>  		to->di_flags2 = cpu_to_be64(from->di_flags2);
>  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> +		to->di_anextents_hi = cpu_to_be16(from->di_anextents_hi);
>  		to->di_ino = cpu_to_be64(from->di_ino);
>  		to->di_lsn = cpu_to_be64(from->di_lsn);
>  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> @@ -365,7 +369,9 @@ xfs_dinode_verify_fork(
>  	struct xfs_mount	*mp,
>  	int			whichfork)
>  {
> -	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> +	uint32_t		di_nextents;
> +
> +	di_nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, whichfork);
>  
>  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
>  	case XFS_DINODE_FMT_LOCAL:
> @@ -436,6 +442,9 @@ xfs_dinode_verify(
>  	uint16_t		flags;
>  	uint64_t		flags2;
>  	uint64_t		di_size;
> +	int32_t			nextents;
> +	int32_t			anextents;
> +	int64_t			nblocks;

Extent counts need to be converted to unsigned in memory - they are
unsigned on disk....

>  
>  	if (dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC))
>  		return __this_address;
> @@ -466,10 +475,12 @@ xfs_dinode_verify(
>  	if ((S_ISLNK(mode) || S_ISDIR(mode)) && di_size == 0)
>  		return __this_address;
>  
> +	nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_DATA_FORK);
> +	anextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK);
> +	nblocks = be64_to_cpu(dip->di_nblocks);
> +
>  	/* Fork checks carried over from xfs_iformat_fork */
> -	if (mode &&
> -	    be32_to_cpu(dip->di_nextents) + be16_to_cpu(dip->di_anextents) >
> -			be64_to_cpu(dip->di_nblocks))
> +	if (mode && nextents + anextents > nblocks)
>  		return __this_address;
>  
>  	if (mode && XFS_DFORK_BOFF(dip) > mp->m_sb.sb_inodesize)
> @@ -526,7 +537,7 @@ xfs_dinode_verify(
>  		default:
>  			return __this_address;
>  		}
> -		if (dip->di_anextents)
> +		if (XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK))
>  			return __this_address;
>  	}
>  
> diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
> index 518c6f0ec3a61..080fd0c156a1e 100644
> --- a/fs/xfs/libxfs/xfs_inode_fork.c
> +++ b/fs/xfs/libxfs/xfs_inode_fork.c
> @@ -207,9 +207,10 @@ xfs_iformat_extents(
>  	int			whichfork)
>  {
>  	struct xfs_mount	*mp = ip->i_mount;
> +	struct xfs_sb		*sb = &mp->m_sb;
>  	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
>  	int			state = xfs_bmap_fork_to_state(whichfork);
> -	int			nex = XFS_DFORK_NEXTENTS(dip, whichfork);
> +	int			nex = XFS_DFORK_NEXTENTS(sb, dip, whichfork);
>  	int			size = nex * sizeof(xfs_bmbt_rec_t);
>  	struct xfs_iext_cursor	icur;
>  	struct xfs_bmbt_rec	*dp;
> diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> index e3400c9c71cdb..5db92aa508bc5 100644
> --- a/fs/xfs/libxfs/xfs_log_format.h
> +++ b/fs/xfs/libxfs/xfs_log_format.h
> @@ -397,7 +397,7 @@ struct xfs_log_dinode {
>  	xfs_rfsblock_t	di_nblocks;	/* # of direct & btree blocks used */
>  	xfs_extlen_t	di_extsize;	/* basic/minimum extent size for file */
>  	xfs_extnum_t	di_nextents;	/* number of extents in data fork */
> -	xfs_aextnum_t	di_anextents;	/* number of extents in attribute fork*/
> +	uint16_t	di_anextents_lo;/* lower part of xattr extent count */
>  	uint8_t		di_forkoff;	/* attr fork offs, <<3 for 64b align */
>  	int8_t		di_aformat;	/* format of attr fork's data */
>  	uint32_t	di_dmevmask;	/* DMIG event mask */
> @@ -414,7 +414,8 @@ struct xfs_log_dinode {
>  	xfs_lsn_t	di_lsn;		/* flush sequence */
>  	uint64_t	di_flags2;	/* more random flags */
>  	uint32_t	di_cowextsize;	/* basic cow extent size for file */
> -	uint8_t		di_pad2[12];	/* more padding for future expansion */
> +	uint16_t	di_anextents_hi;/* higher part of xattr extent count */

So, unsigned in the log, as on disk...

> +	uint8_t		di_pad2[10];	/* more padding for future expansion */
>  
>  	/* fields only written to during inode creation */
>  	xfs_ictimestamp_t di_crtime;	/* time created */
> diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> index 397d94775440d..01669aa65745a 100644
> --- a/fs/xfs/libxfs/xfs_types.h
> +++ b/fs/xfs/libxfs/xfs_types.h
> @@ -13,7 +13,7 @@ typedef uint32_t	xfs_agino_t;	/* inode # within allocation grp */
>  typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
>  typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
>  typedef int32_t		xfs_extnum_t;	/* # of extents in a file */
> -typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> +typedef int32_t		xfs_aextnum_t;	/* # extents in an attribute fork */

.... but not in memory?

>  typedef int64_t		xfs_fsize_t;	/* bytes in a file */
>  typedef uint64_t	xfs_ufsize_t;	/* unsigned bytes in a file */
>  
> @@ -60,7 +60,7 @@ typedef void *		xfs_failaddr_t;
>   */
>  #define	MAXEXTLEN	((xfs_extlen_t)0x001fffff)	/* 21 bits */
>  #define	MAXEXTNUM	((xfs_extnum_t)0x7fffffff)	/* signed int */
> -#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fff)		/* signed short */
> +#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fffffff)	/* signed int */

What about for older filesystems where MAXAEXTNUM is unchanged?

> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index 11c3502b07b13..ba3fae95b2260 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -2922,6 +2922,7 @@ xlog_recover_inode_pass2(
>  	struct xfs_log_dinode	*ldip;
>  	uint			isize;
>  	int			need_free = 0;
> +	uint32_t		nextents;
>  
>  	if (item->ri_buf[0].i_len == sizeof(struct xfs_inode_log_format)) {
>  		in_f = item->ri_buf[0].i_addr;
> @@ -3044,7 +3045,14 @@ xlog_recover_inode_pass2(
>  			goto out_release;
>  		}
>  	}
> -	if (unlikely(ldip->di_nextents + ldip->di_anextents > ldip->di_nblocks)){
> +
> +	nextents = ldip->di_anextents_lo;
> +	if (xfs_sb_version_has_v3inode(&mp->m_sb))
> +		nextents |= ((u32)(ldip->di_anextents_hi) << 16);

What happens if we are recovering from a filesysetm that doesn't
know anything about di_anextents_hi and never wrote anything to
the log for this field?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] xfs: Fix log reservation calculation for xattr insert operation
  2020-04-06 22:57     ` Dave Chinner
@ 2020-04-07  5:11       ` Chandan Rajendra
  2020-04-07 12:59       ` Brian Foster
  1 sibling, 0 replies; 37+ messages in thread
From: Chandan Rajendra @ 2020-04-07  5:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, Chandan Rajendra, linux-xfs, darrick.wong

On Tuesday, April 7, 2020 4:27 AM Dave Chinner wrote: 
> On Mon, Apr 06, 2020 at 11:25:40AM -0400, Brian Foster wrote:
> > On Sat, Apr 04, 2020 at 02:22:02PM +0530, Chandan Rajendra wrote:
> > > Log space reservation for xattr insert operation is divided into two
> > > parts,
> > > 1. Mount time
> > >    - Inode
> > >    - Superblock for accounting space allocations
> > >    - AGF for accounting space used by count, block number, rmap and refcnt
> > >      btrees.
> > > 
> > > 2. The remaining log space can only be calculated at run time because,
> > >    - A local xattr can be large enough to cause a double split of the da
> > >      btree.
> > >    - The value of the xattr can be large enough to be stored in remote
> > >      blocks. The contents of the remote blocks are not logged.
> > > 
> > >    The log space reservation could be,
> > >    - (XFS_DA_NODE_MAXDEPTH + 1) number of blocks. The "+ 1" is required in
> > >      case xattr is large enough to cause another split of the da btree path.
> > >    - BMBT blocks for storing (XFS_DA_NODE_MAXDEPTH + 1) record
> > >      entries.
> > >    - Space for logging blocks of count, block number, rmap and refcnt btrees.
> > > 
> > > At present, mount time log reservation includes block count required for a
> > > single split of the dabtree. The dabtree block count is also taken into
> > > account by xfs_attr_calc_size().
> > > 
> > > Also, AGF log space reservation isn't accounted for.
> > > 
> > > Due to the reasons mentioned above, log reservation calculation for xattr
> > > insert operation gives an incorrect value.
> > > 
> > > Apart from the above, xfs_log_calc_max_attrsetm_res() passes byte count as
> > > an argument to XFS_NEXTENTADD_SPACE_RES() instead of block count.
> > > 
> > > The above mentioned inconsistencies were discoverd when trying to mount a
> > > modified XFS filesystem which uses a 32-bit value as xattr extent counter
> > > caused the following warning messages to be printed on the console,
> > > 
> > > XFS (loop0): Mounting V4 Filesystem
> > > XFS (loop0): Log size 2560 blocks too small, minimum size is 4035 blocks
> > > XFS (loop0): Log size out of supported range.
> > > XFS (loop0): Continuing onwards, but if log hangs are experienced then please report this message in the bug report.
> > > XFS (loop0): Ending clean mount
> > > 
> > > To fix the inconsistencies described above, this commit replaces 'mount'
> > > and 'runtime' components with just one static reservation. The new
> > > reservation calculates the log space for the worst case possible i.e. it
> > > considers,
> > > 1. Double split of the da btree.
> > >    This happens for large local xattrs.
> > > 2. Bmbt blocks required for mapping the contents of a maximum
> > >    sized (i.e. XATTR_SIZE_MAX bytes in size) remote attribute.
> > > 
> > 
> > Hmm.. so the last I recall looking at this, the change was more around
> > refactoring the mount vs. runtime portions of the xattr reservation to
> > be more accurate. This approach eliminates the runtime portion for a
> > 100% mount time reservation calculation. Can you elaborate on why the
> > change in approach? Also, it looks like at least one tradeoff here is
> > reservation size (to the point where we increase min log size on small
> > filesystems?).
> 
> What's not in this commit message is that this was actually my idea
> that I had when Chandan contacted me off list about his refactoring
> of the reservation blowing out reservations for attribute operations
> by a factor of 10.
> 
> My fault, I should have pushed the discussion back to the mailing
> list rather than answering directly.  I'll repeat a lot of my
> analysis from that discussion below to get everyone up to speed.
> 
> [ Chandan, in future I'm going to insist that all your XFS questions
> need to be on the list, so that everyone sees the discusions and
> understands the reasons why things are suggested. It's also a good
> idea to use "suggested-by" when presenting code based on other
> people's ideas, just so that everyone knows that there were more
> people involved than just yourself... ]
> 
> So, when I went through all the reservations changes that Chandan
> had made I realised that the current code was wrong in lots of ways,
> and when I looked at it from the fundamental changes being made the
> mount vs runtime split made no sense at all.
> 
> Such as:
> 
> - the dabtree double split was a double _leaf block split only_. It
>   is not a full tree split, and can only result in a single parent
>   split because there is only one path update after the double leaf
>   split has been done. Hence it can only do one full dabtree split
>   and the code in xfs_attr_calc_size() that doubles the block count
>   reservation for the double leaf split is wrong.  We only need on
>   extra block, and that is just:
> 
> dgc>  #define XFS_DAENTER_DBS(mp,w)        \
> dgc> -     (XFS_DA_NODE_MAXDEPTH + (((w) == XFS_DATA_FORK) ? 2 : 0))
> dgc> +     (XFS_DA_NODE_MAXDEPTH + (((w) == XFS_DATA_FORK) ? 2 : 1))
> 
> [ Note: The "+ 2" for the data fork reservation is for the dir data
> block and a potential free space index block that get added in a
> typical directory entry addition. ]
> 
> - remote attributes are not logged, so only the BMBT block
>   reservation is needed for that extent allocation. i.e. we need to
>   reserve the blocks for the xattr, but we don't need log space
>   for them.  xfs_attr_calc_size() gets this right,
>   xfs_log_calc_max_attrsetm_res() gets this wrong in that it does
>   not take into account remote attr BMBT blocks at all.
> 
> - xfs_attr_calc_size() calculates the number of blocks we need to
>   allocate for the attr operation, not the number of blocks we need
>   to log, hence can't be used to replace
>   xfs_log_calc_max_attrsetm_res().
> 
> - the runtime reservation is just a BMBT block reservation for a
>   single block to be allocated. multiplying the number of blocks we
>   need to allocate by M_RES(mp)->tr_attrsetrt.tr_logres to get the
>   log reservation is wrong. We are not doing a full BMBT split for
>   every block in the attribute we modify, so the log reservation is
>   massively oversized by xfs_log_calc_max_attrsetm_res() and
>   xfs_attr_set() by multipling the block count (including BMBT
>   blocks we allocate) by a full bmbt split reservation.
> 
> IOWs, the code as it stands now is just wrong. It works because it
> massively oversizes the runtime reservations, but that in itself is
> a problem.  To quote myself again from that analysis:
> 
> dgc> The log reservation that covers both local and remote attributes:
> dgc>
> dgc> blks =  full dabtree split + 1 leaf block + bmbt blocks
> dgc> blks += nextent_res(MAX_ATTR_LEN/block size) // bmbt blocks only
> dgc> resv =	inode + sb + agf +
> dgc>		xfs_calc_buf_res(blks) +
> dgc>		allocfree_log_count(blks);
> 
> THe first line takes into account the blocks we modify in a local
> attribute tree modification. The second line takes into account the
> BMBT logging overhead of a remote attribute. The "resv" calculation
> converts that modified block count into a log reservation and adds
> the freespace tree logging overhead of allocating all those blocks.
> 
> The only thing that is variable at runtime now is the size of the
> remote attribute, but we already have a log reservation for the
> allocation and BMBT block modification side of that and so we only
> need to physically reserve the block space (i.e. via
> block count passed to xfs_trans_alloc()).
> 
> IOWs, the log reservation does not need to change at runtime now.
> 
> It also makes it clear that changing the attr fork extent count from
> 16 to 32 bits should only impact the BMBT reservations.  The dabtree
> reservations already use XFS_DA_NODE_MAXDEPTH for the attr fork and
> hence so they already are sized for max dabtree depth reservations.
> 
> As a result, the attr reservation itself should not grow excessively
> for 32bit attribute fork extent counts. It should maybe 20-30 blocks
> on a 4kb block size filesystem as we add 4-5 levels to the max depth
> of the BMBT on the attribute fork. It should certainly not grow by
> 400 blocks as the original reworking resulted in...
> 
> Again, sorry for not getting this discussion out onto the mailing
> list originally, it should have been there.

It is actually me at fault here. I am sorry for not having the conversation on
the mailing list in the first place.

-- 
chandan




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] xfs: Fix log reservation calculation for xattr insert operation
  2020-04-06 22:57     ` Dave Chinner
  2020-04-07  5:11       ` Chandan Rajendra
@ 2020-04-07 12:59       ` Brian Foster
  1 sibling, 0 replies; 37+ messages in thread
From: Brian Foster @ 2020-04-07 12:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Chandan Rajendra, linux-xfs, chandan, darrick.wong

On Tue, Apr 07, 2020 at 08:57:58AM +1000, Dave Chinner wrote:
> On Mon, Apr 06, 2020 at 11:25:40AM -0400, Brian Foster wrote:
> > On Sat, Apr 04, 2020 at 02:22:02PM +0530, Chandan Rajendra wrote:
> > > Log space reservation for xattr insert operation is divided into two
> > > parts,
> > > 1. Mount time
> > >    - Inode
> > >    - Superblock for accounting space allocations
> > >    - AGF for accounting space used by count, block number, rmap and refcnt
> > >      btrees.
> > > 
> > > 2. The remaining log space can only be calculated at run time because,
> > >    - A local xattr can be large enough to cause a double split of the da
> > >      btree.
> > >    - The value of the xattr can be large enough to be stored in remote
> > >      blocks. The contents of the remote blocks are not logged.
> > > 
> > >    The log space reservation could be,
> > >    - (XFS_DA_NODE_MAXDEPTH + 1) number of blocks. The "+ 1" is required in
> > >      case xattr is large enough to cause another split of the da btree path.
> > >    - BMBT blocks for storing (XFS_DA_NODE_MAXDEPTH + 1) record
> > >      entries.
> > >    - Space for logging blocks of count, block number, rmap and refcnt btrees.
> > > 
> > > At present, mount time log reservation includes block count required for a
> > > single split of the dabtree. The dabtree block count is also taken into
> > > account by xfs_attr_calc_size().
> > > 
> > > Also, AGF log space reservation isn't accounted for.
> > > 
> > > Due to the reasons mentioned above, log reservation calculation for xattr
> > > insert operation gives an incorrect value.
> > > 
> > > Apart from the above, xfs_log_calc_max_attrsetm_res() passes byte count as
> > > an argument to XFS_NEXTENTADD_SPACE_RES() instead of block count.
> > > 
> > > The above mentioned inconsistencies were discoverd when trying to mount a
> > > modified XFS filesystem which uses a 32-bit value as xattr extent counter
> > > caused the following warning messages to be printed on the console,
> > > 
> > > XFS (loop0): Mounting V4 Filesystem
> > > XFS (loop0): Log size 2560 blocks too small, minimum size is 4035 blocks
> > > XFS (loop0): Log size out of supported range.
> > > XFS (loop0): Continuing onwards, but if log hangs are experienced then please report this message in the bug report.
> > > XFS (loop0): Ending clean mount
> > > 
> > > To fix the inconsistencies described above, this commit replaces 'mount'
> > > and 'runtime' components with just one static reservation. The new
> > > reservation calculates the log space for the worst case possible i.e. it
> > > considers,
> > > 1. Double split of the da btree.
> > >    This happens for large local xattrs.
> > > 2. Bmbt blocks required for mapping the contents of a maximum
> > >    sized (i.e. XATTR_SIZE_MAX bytes in size) remote attribute.
> > > 
> > 
> > Hmm.. so the last I recall looking at this, the change was more around
> > refactoring the mount vs. runtime portions of the xattr reservation to
> > be more accurate. This approach eliminates the runtime portion for a
> > 100% mount time reservation calculation. Can you elaborate on why the
> > change in approach? Also, it looks like at least one tradeoff here is
> > reservation size (to the point where we increase min log size on small
> > filesystems?).
> 
> What's not in this commit message is that this was actually my idea
> that I had when Chandan contacted me off list about his refactoring
> of the reservation blowing out reservations for attribute operations
> by a factor of 10.
> 
> My fault, I should have pushed the discussion back to the mailing
> list rather than answering directly.  I'll repeat a lot of my
> analysis from that discussion below to get everyone up to speed.
> 
> [ Chandan, in future I'm going to insist that all your XFS questions
> need to be on the list, so that everyone sees the discusions and
> understands the reasons why things are suggested. It's also a good
> idea to use "suggested-by" when presenting code based on other
> people's ideas, just so that everyone knows that there were more
> people involved than just yourself... ]
> 
> So, when I went through all the reservations changes that Chandan
> had made I realised that the current code was wrong in lots of ways,
> and when I looked at it from the fundamental changes being made the
> mount vs runtime split made no sense at all.
> 
> Such as:
> 
> - the dabtree double split was a double _leaf block split only_. It
>   is not a full tree split, and can only result in a single parent
>   split because there is only one path update after the double leaf
>   split has been done. Hence it can only do one full dabtree split
>   and the code in xfs_attr_calc_size() that doubles the block count
>   reservation for the double leaf split is wrong.  We only need on
>   extra block, and that is just:
> 
> dgc>  #define XFS_DAENTER_DBS(mp,w)        \
> dgc> -     (XFS_DA_NODE_MAXDEPTH + (((w) == XFS_DATA_FORK) ? 2 : 0))
> dgc> +     (XFS_DA_NODE_MAXDEPTH + (((w) == XFS_DATA_FORK) ? 2 : 1))
> 
> [ Note: The "+ 2" for the data fork reservation is for the dir data
> block and a potential free space index block that get added in a
> typical directory entry addition. ]
> 
> - remote attributes are not logged, so only the BMBT block
>   reservation is needed for that extent allocation. i.e. we need to
>   reserve the blocks for the xattr, but we don't need log space
>   for them.  xfs_attr_calc_size() gets this right,
>   xfs_log_calc_max_attrsetm_res() gets this wrong in that it does
>   not take into account remote attr BMBT blocks at all.
> 
> - xfs_attr_calc_size() calculates the number of blocks we need to
>   allocate for the attr operation, not the number of blocks we need
>   to log, hence can't be used to replace
>   xfs_log_calc_max_attrsetm_res().
> 
> - the runtime reservation is just a BMBT block reservation for a
>   single block to be allocated. multiplying the number of blocks we
>   need to allocate by M_RES(mp)->tr_attrsetrt.tr_logres to get the
>   log reservation is wrong. We are not doing a full BMBT split for
>   every block in the attribute we modify, so the log reservation is
>   massively oversized by xfs_log_calc_max_attrsetm_res() and
>   xfs_attr_set() by multipling the block count (including BMBT
>   blocks we allocate) by a full bmbt split reservation.
> 
> IOWs, the code as it stands now is just wrong. It works because it
> massively oversizes the runtime reservations, but that in itself is
> a problem.  To quote myself again from that analysis:
> 
> dgc> The log reservation that covers both local and remote attributes:
> dgc>
> dgc> blks =  full dabtree split + 1 leaf block + bmbt blocks
> dgc> blks += nextent_res(MAX_ATTR_LEN/block size) // bmbt blocks only
> dgc> resv =	inode + sb + agf +
> dgc>		xfs_calc_buf_res(blks) +
> dgc>		allocfree_log_count(blks);
> 
> THe first line takes into account the blocks we modify in a local
> attribute tree modification. The second line takes into account the
> BMBT logging overhead of a remote attribute. The "resv" calculation
> converts that modified block count into a log reservation and adds
> the freespace tree logging overhead of allocating all those blocks.
> 
> The only thing that is variable at runtime now is the size of the
> remote attribute, but we already have a log reservation for the
> allocation and BMBT block modification side of that and so we only
> need to physically reserve the block space (i.e. via
> block count passed to xfs_trans_alloc()).
> 
> IOWs, the log reservation does not need to change at runtime now.
> 

Ok, that looks reasonable and nicely simplified from the current
approach. My primary concern on first seeing this patch was reservation
size. The remote blocks was the portion that stood out because it's
obviously not universal for xattr operations, but having taken a closer
look I see that it's actually a small portion of the overall
reservation. E.g., on one of my test volumes the reservation comes out
to 467448 bytes. It only drops to 459000 if I comment out the remote
block bmbt bits so we'd only save 8k or so by making that conditional.
Of course the unit reservation is still large (though the full
reservation is not actually the largest when you factor in logcount),
but that seems mostly a side effect of making it correct. ;P

> It also makes it clear that changing the attr fork extent count from
> 16 to 32 bits should only impact the BMBT reservations.  The dabtree
> reservations already use XFS_DA_NODE_MAXDEPTH for the attr fork and
> hence so they already are sized for max dabtree depth reservations.
> 
> As a result, the attr reservation itself should not grow excessively
> for 32bit attribute fork extent counts. It should maybe 20-30 blocks
> on a 4kb block size filesystem as we add 4-5 levels to the max depth
> of the BMBT on the attribute fork. It should certainly not grow by
> 400 blocks as the original reworking resulted in...
> 

Indeed.

> Again, sorry for not getting this discussion out onto the mailing
> list originally, it should have been there.
> 

No problem, thanks for the background/context.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 1/2] xfs: Fix log reservation calculation for xattr insert operation
  2020-04-07  0:49   ` Dave Chinner
@ 2020-04-08  8:47     ` Chandan Rajendra
  0 siblings, 0 replies; 37+ messages in thread
From: Chandan Rajendra @ 2020-04-08  8:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Chandan Rajendra, linux-xfs, darrick.wong, bfoster

On Tuesday, April 7, 2020 6:19 AM Dave Chinner wrote: 
> [chopped bits out of the diff to get the whole reservation in one
>  obvious piece of code.]
> 
> On Sat, Apr 04, 2020 at 02:22:02PM +0530, Chandan Rajendra wrote:
> > @@ -698,42 +699,36 @@ xfs_calc_attrinval_reservation(
> >  }
> >  
> >  /*
> > + * Setting an attribute.
> >   *	the inode getting the attribute
> >   *	the superblock for allocations
> > + *	the agf extents are allocated from
> >   *	the attribute btree * max depth
> > + *	the bmbt entries for da btree blocks
> > + *	the bmbt entries for remote blocks (if any)
> > + *	the allocation btrees.
> >   */
> >  STATIC uint
> > -xfs_calc_attrsetm_reservation(
> > +xfs_calc_attrset_reservation(
> >  	struct xfs_mount	*mp)
> >  {
> > +	int			max_rmt_blks;
> > +	int			da_blks;
> > +	int			bmbt_blks;
> > +
> > +	da_blks = XFS_DAENTER_BLOCKS(mp, XFS_ATTR_FORK);
> 
> #define XFS_DAENTER_BLOCKS(mp,w)        \
>         (XFS_DAENTER_1B(mp,w) * XFS_DAENTER_DBS(mp,w))
> #define XFS_DAENTER_1B(mp,w)    \
>         ((w) == XFS_DATA_FORK ? (mp)->m_dir_geo->fsbcount : 1)
> #define XFS_DAENTER_DBS(mp,w)   \
> 	(XFS_DA_NODE_MAXDEPTH + (((w) == XFS_DATA_FORK) ? 2 : 0))
> 
> So, da_blks contains the full da btree split depth * 1 block. i.e.
> 
> 	da_blks = XFS_DA_NODE_MAXDEPTH;
> 
> > +	bmbt_blks = XFS_DAENTER_BMAPS(mp, XFS_ATTR_FORK);
> 
> #define XFS_DAENTER_BMAPS(mp,w)         \
>         (XFS_DAENTER_DBS(mp,w) * XFS_DAENTER_BMAP1B(mp,w))
> 
> #define XFS_DAENTER_BMAP1B(mp,w)        \
>         XFS_NEXTENTADD_SPACE_RES(mp, XFS_DAENTER_1B(mp, w), w)
> 
> So, bmbt_blks contains the full da btree split depth * the BMBT
> overhead for a single block allocation:
> 
> #define XFS_EXTENTADD_SPACE_RES(mp,w)   (XFS_BM_MAXLEVELS(mp,w) - 1)
> #define XFS_NEXTENTADD_SPACE_RES(mp,b,w)\
>         (((b + XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp) - 1) / \
> 	          XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp)) * \
> 		            XFS_EXTENTADD_SPACE_RES(mp,w))
> 
> XFS_NEXTENTADD_SPACE_RES(1) = ((1 + N - 1) / N) * (XFS_BM_MAXLEVELS - 1)
> 		= (XFS_BM_MAXLEVELS - 1)
> 
> So, bmbt_blks = XFS_DA_NODE_MAXDEPTH * (XFS_BM_MAXLEVELS - 1)
> 
> IOWs, this bmbt reservation is assuming a full height BMBT
> modification on *every* dabtree node allocation. IOWs, we're
> reserving multiple times the log space for potential bmbt
> modifications than we are for the entire dabtree modification.
> That's why the individual dabtree reservations are so big....
> 
> > +	max_rmt_blks = xfs_attr3_rmt_blocks(mp, XATTR_SIZE_MAX);
> > +	bmbt_blks += XFS_NEXTENTADD_SPACE_RES(mp, max_rmt_blks, XFS_ATTR_FORK);
> 
> And this assumes we are going to log at least another full bmbt
> modification.
> 
> IT seems to me that the worst case here is one full split and then
> every other allocation inserts at the start of an existing block and
> so updates pointers all the way up to the root. The impact is
> limited, though, because XFS_DA_NODE_MAXDEPTH = 5 and so the attr
> fork BMBT tree is not likely to reach anywhere near it's max depth
> on large filesystems.....
> 
> >  	return XFS_DQUOT_LOGRES(mp) +
> >  		xfs_calc_inode_res(mp, 1) +
> >  		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
> > +		xfs_calc_buf_res(1, mp->m_sb.sb_sectsize) +
> > +		xfs_calc_buf_res(da_blks, XFS_FSB_TO_B(mp, 1)) +
> > +		xfs_calc_buf_res(bmbt_blks, XFS_FSB_TO_B(mp, 1)) +
> > +		xfs_calc_buf_res(xfs_allocfree_log_count(mp, da_blks),
> >  				 XFS_FSB_TO_B(mp, 1));
> 
> Given the above, this looks OK. Worst case BMBT usage looks
> excessive, but there is a chance it could be required...
> 
> > diff --git a/fs/xfs/libxfs/xfs_trans_space.h b/fs/xfs/libxfs/xfs_trans_space.h
> > index 88221c7a04ccf..6a22ad11b3825 100644
> > --- a/fs/xfs/libxfs/xfs_trans_space.h
> > +++ b/fs/xfs/libxfs/xfs_trans_space.h
> > @@ -38,8 +38,14 @@
> >  
> >  #define	XFS_DAENTER_1B(mp,w)	\
> >  	((w) == XFS_DATA_FORK ? (mp)->m_dir_geo->fsbcount : 1)
> > +/*
> > + * xattr set operation can cause the da btree to split twice in the
> > + * worst case. The double split is actually an extra leaf node rather
> > + * than a complete split of blocks in the path from root to a
> > + * leaf. The '1' in the macro below accounts for the extra leaf node.
> 
> It's not a double tree split, so don't describe it that way and then
> have to explain that it's not a double tree split!
> 
> /*
>  * When inserting a large local record into the dabtree leaf, we may
>  * need to split the leaf twice to make room to fit the new record
>  * into the new leaf. This double leaf split still only requires a
>  * single datree path update as the inserted leaves are at adjacent
>  * indexes. Hence we only need to account for an the extra leaf
>  * block in the attribute fork here.
>  */

Sure. I will change the comment. Thanks for the review.

-- 
chandan




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-04-06 16:45   ` Brian Foster
@ 2020-04-08 12:40     ` Chandan Rajendra
  0 siblings, 0 replies; 37+ messages in thread
From: Chandan Rajendra @ 2020-04-08 12:40 UTC (permalink / raw)
  To: Brian Foster; +Cc: Chandan Rajendra, linux-xfs, david, darrick.wong

On Monday, April 6, 2020 10:15 PM Brian Foster wrote: 
> On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > which
> > 1. Creates 1,000,000 255-byte sized xattrs,
> > 2. Deletes 50% of these xattrs in an alternating manner,
> > 3. Tries to create 400,000 new 255-byte sized xattrs
> > causes the following message to be printed on the console,
> > 
> > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > 
> > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > 
> > I have been informed that there are instances where a single file has
> >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > we will overflow the 16-bits wide xattr extent counter when large
> > number of hardlinks are created.
> > 
> > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > an incompat flag to prevent older kernels from mounting newer filesystems with
> > 32-bit wide xattr extent counter.
> > 
> 
> Just a couple high level comments on the first pass...
> 
> It looks like the feature bit is only set by mkfs. That raises a couple
> questions. First, what about a fix for older/existing filesystems? Even
> if we can't exceed the 16bit extent count, I would think we should be
> able to fail more gracefully than allowing a write verifier to fail and
> shutdown the fs. What happens when/if we run into a data fork extent
> count limit, for example?

Yes, I agree that for older filesystems I should write a separate patch to
check for the 16-bit overflow case.

This applies to the data fork extent counter as well. Dave was suggesting that
we should change that to a 64-bit value. That would be my next work item.

> 
> Second, I also wonder if enabling an incompat feature bit by default in
> mkfs is a bit extreme. Perhaps this should be tied to a mkfs flag for a
> period of time? Maybe others have thoughts on that, but I'd at minimum
> request to introduce and enable said bit by default in separate patches
> to make it a bit easier for distro releases to identify and manage the
> incompatibility.

Dave has suggested that we should have a new 32-bit field in the inode. When
we are about to overflow the existing 16-bit counter limit, we set a per-inode
flag and also a RO-compat feature flag in the superblock.

When flushing an inode to disk, if the RO-compat feature flag is set, then we
set the corresponding inode flag and move over the 16-bit counter to the new
32-bit counter. Also, the RO feature flag can be set by default by mkfs
sometime in the future.

> 
> Brian
> 
> > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > ---
> >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> >  fs/xfs/scrub/inode.c           |  7 ++++---
> >  fs/xfs/xfs_inode_item.c        |  3 ++-
> >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> >  8 files changed, 63 insertions(+), 27 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > index 045556e78ee2c..0a4266b0d46e1 100644
> > --- a/fs/xfs/libxfs/xfs_format.h
> > +++ b/fs/xfs/libxfs/xfs_format.h
> > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> >  #define XFS_SB_FEAT_INCOMPAT_ALL \
> >  		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
> >  		 XFS_SB_FEAT_INCOMPAT_SPINODES|	\
> > -		 XFS_SB_FEAT_INCOMPAT_META_UUID)
> > +		 XFS_SB_FEAT_INCOMPAT_META_UUID| \
> > +		 XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR)
> >  
> >  #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
> >  static inline bool
> > @@ -874,7 +876,7 @@ typedef struct xfs_dinode {
> >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
> >  	__be32		di_extsize;	/* basic/minimum extent size for file */
> >  	__be32		di_nextents;	/* number of extents in data fork */
> > -	__be16		di_anextents;	/* number of extents in attribute fork*/
> > +	__be16		di_anextents_lo;/* lower part of xattr extent count */
> >  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> >  	__s8		di_aformat;	/* format of attr fork's data */
> >  	__be32		di_dmevmask;	/* DMIG event mask */
> > @@ -891,7 +893,8 @@ typedef struct xfs_dinode {
> >  	__be64		di_lsn;		/* flush sequence */
> >  	__be64		di_flags2;	/* more random flags */
> >  	__be32		di_cowextsize;	/* basic cow extent size for file */
> > -	__u8		di_pad2[12];	/* more padding for future expansion */
> > +	__be16		di_anextents_hi;/* higher part of xattr extent count */
> > +	__u8		di_pad2[10];	/* more padding for future expansion */
> >  
> >  	/* fields only written to during inode creation */
> >  	xfs_timestamp_t	di_crtime;	/* time created */
> > @@ -993,10 +996,21 @@ enum xfs_dinode_fmt {
> >  	((w) == XFS_DATA_FORK ? \
> >  		(dip)->di_format : \
> >  		(dip)->di_aformat)
> > -#define XFS_DFORK_NEXTENTS(dip,w) \
> > -	((w) == XFS_DATA_FORK ? \
> > -		be32_to_cpu((dip)->di_nextents) : \
> > -		be16_to_cpu((dip)->di_anextents))
> > +
> > +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,
> > +					struct xfs_dinode *dip, int whichfork)
> > +{
> > +	int32_t anextents;
> > +
> > +	if (whichfork == XFS_DATA_FORK)
> > +		return be32_to_cpu((dip)->di_nextents);
> > +
> > +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> > +	if (xfs_sb_version_has_v3inode(sbp))
> > +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);
> > +
> > +	return anextents;
> > +}
> >  
> >  /*
> >   * For block and character special files the 32bit dev_t is stored at the
> > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > index 39c5a6e24915c..ced8195bd8c22 100644
> > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > @@ -232,7 +232,8 @@ xfs_inode_from_disk(
> >  	to->di_nblocks = be64_to_cpu(from->di_nblocks);
> >  	to->di_extsize = be32_to_cpu(from->di_extsize);
> >  	to->di_nextents = be32_to_cpu(from->di_nextents);
> > -	to->di_anextents = be16_to_cpu(from->di_anextents);
> > +	to->di_anextents = XFS_DFORK_NEXTENTS(&ip->i_mount->m_sb, from,
> > +				XFS_ATTR_FORK);
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat	= from->di_aformat;
> >  	to->di_dmevmask	= be32_to_cpu(from->di_dmevmask);
> > @@ -282,7 +283,7 @@ xfs_inode_to_disk(
> >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > +	to->di_anextents_lo = cpu_to_be16((u32)(from->di_anextents) & 0xffff);
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat = from->di_aformat;
> >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > @@ -296,6 +297,8 @@ xfs_inode_to_disk(
> >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
> >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > +		to->di_anextents_hi
> > +			= cpu_to_be16((u32)(from->di_anextents) >> 16);
> >  		to->di_ino = cpu_to_be64(ip->i_ino);
> >  		to->di_lsn = cpu_to_be64(lsn);
> >  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> > @@ -335,7 +338,7 @@ xfs_log_dinode_to_disk(
> >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > +	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat = from->di_aformat;
> >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > @@ -349,6 +352,7 @@ xfs_log_dinode_to_disk(
> >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > +		to->di_anextents_hi = cpu_to_be16(from->di_anextents_hi);
> >  		to->di_ino = cpu_to_be64(from->di_ino);
> >  		to->di_lsn = cpu_to_be64(from->di_lsn);
> >  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> > @@ -365,7 +369,9 @@ xfs_dinode_verify_fork(
> >  	struct xfs_mount	*mp,
> >  	int			whichfork)
> >  {
> > -	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> > +	uint32_t		di_nextents;
> > +
> > +	di_nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, whichfork);
> >  
> >  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
> >  	case XFS_DINODE_FMT_LOCAL:
> > @@ -436,6 +442,9 @@ xfs_dinode_verify(
> >  	uint16_t		flags;
> >  	uint64_t		flags2;
> >  	uint64_t		di_size;
> > +	int32_t			nextents;
> > +	int32_t			anextents;
> > +	int64_t			nblocks;
> >  
> >  	if (dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC))
> >  		return __this_address;
> > @@ -466,10 +475,12 @@ xfs_dinode_verify(
> >  	if ((S_ISLNK(mode) || S_ISDIR(mode)) && di_size == 0)
> >  		return __this_address;
> >  
> > +	nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_DATA_FORK);
> > +	anextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK);
> > +	nblocks = be64_to_cpu(dip->di_nblocks);
> > +
> >  	/* Fork checks carried over from xfs_iformat_fork */
> > -	if (mode &&
> > -	    be32_to_cpu(dip->di_nextents) + be16_to_cpu(dip->di_anextents) >
> > -			be64_to_cpu(dip->di_nblocks))
> > +	if (mode && nextents + anextents > nblocks)
> >  		return __this_address;
> >  
> >  	if (mode && XFS_DFORK_BOFF(dip) > mp->m_sb.sb_inodesize)
> > @@ -526,7 +537,7 @@ xfs_dinode_verify(
> >  		default:
> >  			return __this_address;
> >  		}
> > -		if (dip->di_anextents)
> > +		if (XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK))
> >  			return __this_address;
> >  	}
> >  
> > diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
> > index 518c6f0ec3a61..080fd0c156a1e 100644
> > --- a/fs/xfs/libxfs/xfs_inode_fork.c
> > +++ b/fs/xfs/libxfs/xfs_inode_fork.c
> > @@ -207,9 +207,10 @@ xfs_iformat_extents(
> >  	int			whichfork)
> >  {
> >  	struct xfs_mount	*mp = ip->i_mount;
> > +	struct xfs_sb		*sb = &mp->m_sb;
> >  	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
> >  	int			state = xfs_bmap_fork_to_state(whichfork);
> > -	int			nex = XFS_DFORK_NEXTENTS(dip, whichfork);
> > +	int			nex = XFS_DFORK_NEXTENTS(sb, dip, whichfork);
> >  	int			size = nex * sizeof(xfs_bmbt_rec_t);
> >  	struct xfs_iext_cursor	icur;
> >  	struct xfs_bmbt_rec	*dp;
> > diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> > index e3400c9c71cdb..5db92aa508bc5 100644
> > --- a/fs/xfs/libxfs/xfs_log_format.h
> > +++ b/fs/xfs/libxfs/xfs_log_format.h
> > @@ -397,7 +397,7 @@ struct xfs_log_dinode {
> >  	xfs_rfsblock_t	di_nblocks;	/* # of direct & btree blocks used */
> >  	xfs_extlen_t	di_extsize;	/* basic/minimum extent size for file */
> >  	xfs_extnum_t	di_nextents;	/* number of extents in data fork */
> > -	xfs_aextnum_t	di_anextents;	/* number of extents in attribute fork*/
> > +	uint16_t	di_anextents_lo;/* lower part of xattr extent count */
> >  	uint8_t		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> >  	int8_t		di_aformat;	/* format of attr fork's data */
> >  	uint32_t	di_dmevmask;	/* DMIG event mask */
> > @@ -414,7 +414,8 @@ struct xfs_log_dinode {
> >  	xfs_lsn_t	di_lsn;		/* flush sequence */
> >  	uint64_t	di_flags2;	/* more random flags */
> >  	uint32_t	di_cowextsize;	/* basic cow extent size for file */
> > -	uint8_t		di_pad2[12];	/* more padding for future expansion */
> > +	uint16_t	di_anextents_hi;/* higher part of xattr extent count */
> > +	uint8_t		di_pad2[10];	/* more padding for future expansion */
> >  
> >  	/* fields only written to during inode creation */
> >  	xfs_ictimestamp_t di_crtime;	/* time created */
> > diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> > index 397d94775440d..01669aa65745a 100644
> > --- a/fs/xfs/libxfs/xfs_types.h
> > +++ b/fs/xfs/libxfs/xfs_types.h
> > @@ -13,7 +13,7 @@ typedef uint32_t	xfs_agino_t;	/* inode # within allocation grp */
> >  typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
> >  typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
> >  typedef int32_t		xfs_extnum_t;	/* # of extents in a file */
> > -typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> > +typedef int32_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> >  typedef int64_t		xfs_fsize_t;	/* bytes in a file */
> >  typedef uint64_t	xfs_ufsize_t;	/* unsigned bytes in a file */
> >  
> > @@ -60,7 +60,7 @@ typedef void *		xfs_failaddr_t;
> >   */
> >  #define	MAXEXTLEN	((xfs_extlen_t)0x001fffff)	/* 21 bits */
> >  #define	MAXEXTNUM	((xfs_extnum_t)0x7fffffff)	/* signed int */
> > -#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fff)		/* signed short */
> > +#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fffffff)	/* signed int */
> >  
> >  /*
> >   * Minimum and maximum blocksize and sectorsize.
> > diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
> > index 6d483ab29e639..3b624e24ae868 100644
> > --- a/fs/xfs/scrub/inode.c
> > +++ b/fs/xfs/scrub/inode.c
> > @@ -371,10 +371,12 @@ xchk_dinode(
> >  		break;
> >  	}
> >  
> > +	nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK);
> > +
> >  	/* di_forkoff */
> >  	if (XFS_DFORK_APTR(dip) >= (char *)dip + mp->m_sb.sb_inodesize)
> >  		xchk_ino_set_corrupt(sc, ino);
> > -	if (dip->di_anextents != 0 && dip->di_forkoff == 0)
> > +	if (nextents != 0 && dip->di_forkoff == 0)
> >  		xchk_ino_set_corrupt(sc, ino);
> >  	if (dip->di_forkoff == 0 && dip->di_aformat != XFS_DINODE_FMT_EXTENTS)
> >  		xchk_ino_set_corrupt(sc, ino);
> > @@ -386,7 +388,6 @@ xchk_dinode(
> >  		xchk_ino_set_corrupt(sc, ino);
> >  
> >  	/* di_anextents */
> > -	nextents = be16_to_cpu(dip->di_anextents);
> >  	fork_recs =  XFS_DFORK_ASIZE(dip, mp) / sizeof(struct xfs_bmbt_rec);
> >  	switch (dip->di_aformat) {
> >  	case XFS_DINODE_FMT_EXTENTS:
> > @@ -484,7 +485,7 @@ xchk_inode_xref_bmap(
> >  			&nextents, &acount);
> >  	if (!xchk_should_check_xref(sc, &error, NULL))
> >  		return;
> > -	if (nextents != be16_to_cpu(dip->di_anextents))
> > +	if (nextents != XFS_DFORK_NEXTENTS(&sc->mp->m_sb, dip, XFS_ATTR_FORK))
> >  		xchk_ino_xref_set_corrupt(sc, sc->ip->i_ino);
> >  
> >  	/* Check nblocks against the inode. */
> > diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
> > index 4a3d13d4a0228..dff20f2b368ea 100644
> > --- a/fs/xfs/xfs_inode_item.c
> > +++ b/fs/xfs/xfs_inode_item.c
> > @@ -327,7 +327,7 @@ xfs_inode_to_log_dinode(
> >  	to->di_nblocks = from->di_nblocks;
> >  	to->di_extsize = from->di_extsize;
> >  	to->di_nextents = from->di_nextents;
> > -	to->di_anextents = from->di_anextents;
> > +	to->di_anextents_lo = ((u32)(from->di_anextents)) & 0xffff;
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat = from->di_aformat;
> >  	to->di_dmevmask = from->di_dmevmask;
> > @@ -344,6 +344,7 @@ xfs_inode_to_log_dinode(
> >  		to->di_crtime.t_nsec = from->di_crtime.tv_nsec;
> >  		to->di_flags2 = from->di_flags2;
> >  		to->di_cowextsize = from->di_cowextsize;
> > +		to->di_anextents_hi = ((u32)(from->di_anextents)) >> 16;
> >  		to->di_ino = ip->i_ino;
> >  		to->di_lsn = lsn;
> >  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > index 11c3502b07b13..ba3fae95b2260 100644
> > --- a/fs/xfs/xfs_log_recover.c
> > +++ b/fs/xfs/xfs_log_recover.c
> > @@ -2922,6 +2922,7 @@ xlog_recover_inode_pass2(
> >  	struct xfs_log_dinode	*ldip;
> >  	uint			isize;
> >  	int			need_free = 0;
> > +	uint32_t		nextents;
> >  
> >  	if (item->ri_buf[0].i_len == sizeof(struct xfs_inode_log_format)) {
> >  		in_f = item->ri_buf[0].i_addr;
> > @@ -3044,7 +3045,14 @@ xlog_recover_inode_pass2(
> >  			goto out_release;
> >  		}
> >  	}
> > -	if (unlikely(ldip->di_nextents + ldip->di_anextents > ldip->di_nblocks)){
> > +
> > +	nextents = ldip->di_anextents_lo;
> > +	if (xfs_sb_version_has_v3inode(&mp->m_sb))
> > +		nextents |= ((u32)(ldip->di_anextents_hi) << 16);
> > +
> > +	nextents += ldip->di_nextents;
> > +
> > +	if (unlikely(nextents > ldip->di_nblocks)) {
> >  		XFS_CORRUPTION_ERROR("xlog_recover_inode_pass2(5)",
> >  				     XFS_ERRLEVEL_LOW, mp, ldip,
> >  				     sizeof(*ldip));
> > @@ -3052,8 +3060,7 @@ xlog_recover_inode_pass2(
> >  	"%s: Bad inode log record, rec ptr "PTR_FMT", dino ptr "PTR_FMT", "
> >  	"dino bp "PTR_FMT", ino %Ld, total extents = %d, nblocks = %Ld",
> >  			__func__, item, dip, bp, in_f->ilf_ino,
> > -			ldip->di_nextents + ldip->di_anextents,
> > -			ldip->di_nblocks);
> > +			nextents, ldip->di_nblocks);
> >  		error = -EFSCORRUPTED;
> >  		goto out_release;
> >  	}
> 
> 

-- 
chandan




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-04-06 17:06   ` Darrick J. Wong
  2020-04-06 23:30     ` Dave Chinner
@ 2020-04-08 12:42     ` Chandan Rajendra
  1 sibling, 0 replies; 37+ messages in thread
From: Chandan Rajendra @ 2020-04-08 12:42 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Chandan Rajendra, linux-xfs, david, bfoster

On Monday, April 6, 2020 10:36 PM Darrick J. Wong wrote: 
> On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > which
> > 1. Creates 1,000,000 255-byte sized xattrs,
> > 2. Deletes 50% of these xattrs in an alternating manner,
> > 3. Tries to create 400,000 new 255-byte sized xattrs
> > causes the following message to be printed on the console,
> > 
> > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > 
> > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > 
> > I have been informed that there are instances where a single file has
> >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > we will overflow the 16-bits wide xattr extent counter when large
> > number of hardlinks are created.
> > 
> > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > an incompat flag to prevent older kernels from mounting newer filesystems with
> > 32-bit wide xattr extent counter.
> > 
> > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > ---
> >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> >  fs/xfs/scrub/inode.c           |  7 ++++---
> >  fs/xfs/xfs_inode_item.c        |  3 ++-
> >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> >  8 files changed, 63 insertions(+), 27 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > index 045556e78ee2c..0a4266b0d46e1 100644
> > --- a/fs/xfs/libxfs/xfs_format.h
> > +++ b/fs/xfs/libxfs/xfs_format.h
> > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> 
> If you're going to introduce an INCOMPAT feature, please also use the
> opportunity to convert xattrs to something resembling the dir v3 format,
> where we index free space within each block so that we can speed up attr
> setting with 100 million attrs.
> 
> >  #define XFS_SB_FEAT_INCOMPAT_ALL \
> >  		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
> >  		 XFS_SB_FEAT_INCOMPAT_SPINODES|	\
> > -		 XFS_SB_FEAT_INCOMPAT_META_UUID)
> > +		 XFS_SB_FEAT_INCOMPAT_META_UUID| \
> > +		 XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR)
> >  
> >  #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
> >  static inline bool
> > @@ -874,7 +876,7 @@ typedef struct xfs_dinode {
> >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
> >  	__be32		di_extsize;	/* basic/minimum extent size for file */
> >  	__be32		di_nextents;	/* number of extents in data fork */
> > -	__be16		di_anextents;	/* number of extents in attribute fork*/
> > +	__be16		di_anextents_lo;/* lower part of xattr extent count */
> >  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> >  	__s8		di_aformat;	/* format of attr fork's data */
> >  	__be32		di_dmevmask;	/* DMIG event mask */
> > @@ -891,7 +893,8 @@ typedef struct xfs_dinode {
> >  	__be64		di_lsn;		/* flush sequence */
> >  	__be64		di_flags2;	/* more random flags */
> >  	__be32		di_cowextsize;	/* basic cow extent size for file */
> > -	__u8		di_pad2[12];	/* more padding for future expansion */
> > +	__be16		di_anextents_hi;/* higher part of xattr extent count */
> 
> I was expecting you to use di_pad, not di_pad2... :)

Dave has suggested that a new 32-bit field be introduced. The kernel will
switch over to using this field when we are about to overflow the existing
16-bit counter.

> 
> > +	__u8		di_pad2[10];	/* more padding for future expansion */
> >  
> >  	/* fields only written to during inode creation */
> >  	xfs_timestamp_t	di_crtime;	/* time created */
> > @@ -993,10 +996,21 @@ enum xfs_dinode_fmt {
> >  	((w) == XFS_DATA_FORK ? \
> >  		(dip)->di_format : \
> >  		(dip)->di_aformat)
> > -#define XFS_DFORK_NEXTENTS(dip,w) \
> > -	((w) == XFS_DATA_FORK ? \
> > -		be32_to_cpu((dip)->di_nextents) : \
> > -		be16_to_cpu((dip)->di_anextents))
> > +
> > +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,
> > +					struct xfs_dinode *dip, int whichfork)
> > +{
> 
> XFS style indenting, please.

Sorry about that. I will fix it up.

> 
> > +	int32_t anextents;
> 
> When would we have negative extent count?
> 
> (Yes, this is a bug in the xfs_extnum/xfs_aextnum typedefs, bah...)
> 
> > +
> > +	if (whichfork == XFS_DATA_FORK)
> > +		return be32_to_cpu((dip)->di_nextents);
> > +
> > +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> > +	if (xfs_sb_version_has_v3inode(sbp))
> 
> v3inode?  I thought this had a separate incompat flag?
> 
> > +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);
>

Yes, I will fix this up. With the new ro-compat feature bit and and an extra
32-bit field to track the xattr extent counter the above logic will change.

> /me would have thought you'd do the splitting and endian conversion in
> the opposite order, e.g.:
> 
> 	be32 x = dip->di_anextents_lo;
> 	if (has32bitattrcount)
> 		x |= (be32)dip->di_anextents_hi << 16;
> 	return be32_to_cpu(x);

I actually followed what was being done w.r.t projid i.e.

     to->di_projid = (prid_t)be16_to_cpu(from->di_projid_hi) << 16 |                                                                                               
                             be16_to_cpu(from->di_projid_lo);

But with the new 32-bit extent counter, we won't have to do that either.

> 
> > +
> > +	return anextents;
> > +}
> >  
> >  /*
> >   * For block and character special files the 32bit dev_t is stored at the
> > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > index 39c5a6e24915c..ced8195bd8c22 100644
> > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > @@ -232,7 +232,8 @@ xfs_inode_from_disk(
> >  	to->di_nblocks = be64_to_cpu(from->di_nblocks);
> >  	to->di_extsize = be32_to_cpu(from->di_extsize);
> >  	to->di_nextents = be32_to_cpu(from->di_nextents);
> > -	to->di_anextents = be16_to_cpu(from->di_anextents);
> > +	to->di_anextents = XFS_DFORK_NEXTENTS(&ip->i_mount->m_sb, from,
> > +				XFS_ATTR_FORK);
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat	= from->di_aformat;
> >  	to->di_dmevmask	= be32_to_cpu(from->di_dmevmask);
> > @@ -282,7 +283,7 @@ xfs_inode_to_disk(
> >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > +	to->di_anextents_lo = cpu_to_be16((u32)(from->di_anextents) & 0xffff);
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat = from->di_aformat;
> >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > @@ -296,6 +297,8 @@ xfs_inode_to_disk(
> >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
> >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > +		to->di_anextents_hi
> > +			= cpu_to_be16((u32)(from->di_anextents) >> 16);
> >  		to->di_ino = cpu_to_be64(ip->i_ino);
> >  		to->di_lsn = cpu_to_be64(lsn);
> >  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> > @@ -335,7 +338,7 @@ xfs_log_dinode_to_disk(
> >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > +	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat = from->di_aformat;
> >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > @@ -349,6 +352,7 @@ xfs_log_dinode_to_disk(
> >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > +		to->di_anextents_hi = cpu_to_be16(from->di_anextents_hi);
> >  		to->di_ino = cpu_to_be64(from->di_ino);
> >  		to->di_lsn = cpu_to_be64(from->di_lsn);
> >  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> > @@ -365,7 +369,9 @@ xfs_dinode_verify_fork(
> >  	struct xfs_mount	*mp,
> >  	int			whichfork)
> >  {
> > -	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> > +	uint32_t		di_nextents;
> > +
> > +	di_nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, whichfork);
> >  
> >  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
> >  	case XFS_DINODE_FMT_LOCAL:
> > @@ -436,6 +442,9 @@ xfs_dinode_verify(
> >  	uint16_t		flags;
> >  	uint64_t		flags2;
> >  	uint64_t		di_size;
> > +	int32_t			nextents;
> > +	int32_t			anextents;
> > +	int64_t			nblocks;
> >  
> >  	if (dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC))
> >  		return __this_address;
> > @@ -466,10 +475,12 @@ xfs_dinode_verify(
> >  	if ((S_ISLNK(mode) || S_ISDIR(mode)) && di_size == 0)
> >  		return __this_address;
> >  
> > +	nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_DATA_FORK);
> > +	anextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK);
> > +	nblocks = be64_to_cpu(dip->di_nblocks);
> > +
> >  	/* Fork checks carried over from xfs_iformat_fork */
> > -	if (mode &&
> > -	    be32_to_cpu(dip->di_nextents) + be16_to_cpu(dip->di_anextents) >
> > -			be64_to_cpu(dip->di_nblocks))
> > +	if (mode && nextents + anextents > nblocks)
> >  		return __this_address;
> >  
> >  	if (mode && XFS_DFORK_BOFF(dip) > mp->m_sb.sb_inodesize)
> > @@ -526,7 +537,7 @@ xfs_dinode_verify(
> >  		default:
> >  			return __this_address;
> >  		}
> > -		if (dip->di_anextents)
> > +		if (XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK))
> >  			return __this_address;
> >  	}
> >  
> > diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
> > index 518c6f0ec3a61..080fd0c156a1e 100644
> > --- a/fs/xfs/libxfs/xfs_inode_fork.c
> > +++ b/fs/xfs/libxfs/xfs_inode_fork.c
> > @@ -207,9 +207,10 @@ xfs_iformat_extents(
> >  	int			whichfork)
> >  {
> >  	struct xfs_mount	*mp = ip->i_mount;
> > +	struct xfs_sb		*sb = &mp->m_sb;
> >  	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
> >  	int			state = xfs_bmap_fork_to_state(whichfork);
> > -	int			nex = XFS_DFORK_NEXTENTS(dip, whichfork);
> > +	int			nex = XFS_DFORK_NEXTENTS(sb, dip, whichfork);
> >  	int			size = nex * sizeof(xfs_bmbt_rec_t);
> >  	struct xfs_iext_cursor	icur;
> >  	struct xfs_bmbt_rec	*dp;
> > diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> > index e3400c9c71cdb..5db92aa508bc5 100644
> > --- a/fs/xfs/libxfs/xfs_log_format.h
> > +++ b/fs/xfs/libxfs/xfs_log_format.h
> > @@ -397,7 +397,7 @@ struct xfs_log_dinode {
> >  	xfs_rfsblock_t	di_nblocks;	/* # of direct & btree blocks used */
> >  	xfs_extlen_t	di_extsize;	/* basic/minimum extent size for file */
> >  	xfs_extnum_t	di_nextents;	/* number of extents in data fork */
> > -	xfs_aextnum_t	di_anextents;	/* number of extents in attribute fork*/
> > +	uint16_t	di_anextents_lo;/* lower part of xattr extent count */
> >  	uint8_t		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> >  	int8_t		di_aformat;	/* format of attr fork's data */
> >  	uint32_t	di_dmevmask;	/* DMIG event mask */
> > @@ -414,7 +414,8 @@ struct xfs_log_dinode {
> >  	xfs_lsn_t	di_lsn;		/* flush sequence */
> >  	uint64_t	di_flags2;	/* more random flags */
> >  	uint32_t	di_cowextsize;	/* basic cow extent size for file */
> > -	uint8_t		di_pad2[12];	/* more padding for future expansion */
> > +	uint16_t	di_anextents_hi;/* higher part of xattr extent count */
> > +	uint8_t		di_pad2[10];	/* more padding for future expansion */
> >  
> >  	/* fields only written to during inode creation */
> >  	xfs_ictimestamp_t di_crtime;	/* time created */
> > diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> > index 397d94775440d..01669aa65745a 100644
> > --- a/fs/xfs/libxfs/xfs_types.h
> > +++ b/fs/xfs/libxfs/xfs_types.h
> > @@ -13,7 +13,7 @@ typedef uint32_t	xfs_agino_t;	/* inode # within allocation grp */
> >  typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
> >  typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
> >  typedef int32_t		xfs_extnum_t;	/* # of extents in a file */
> > -typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> > +typedef int32_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> >  typedef int64_t		xfs_fsize_t;	/* bytes in a file */
> >  typedef uint64_t	xfs_ufsize_t;	/* unsigned bytes in a file */
> >  
> > @@ -60,7 +60,7 @@ typedef void *		xfs_failaddr_t;
> >   */
> >  #define	MAXEXTLEN	((xfs_extlen_t)0x001fffff)	/* 21 bits */
> >  #define	MAXEXTNUM	((xfs_extnum_t)0x7fffffff)	/* signed int */
> > -#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fff)		/* signed short */
> > +#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fffffff)	/* signed int */
> 
> Need to preserve both limits so that we can do the correct check for the
> given feature set.

True. I will fix that.

Thank you for providing the above review comments.

-- 
chandan




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-04-06 23:30     ` Dave Chinner
@ 2020-04-08 12:43       ` Chandan Rajendra
  2020-04-08 15:38         ` Darrick J. Wong
  2020-04-08 22:43         ` Dave Chinner
  2020-04-08 15:45       ` Darrick J. Wong
  1 sibling, 2 replies; 37+ messages in thread
From: Chandan Rajendra @ 2020-04-08 12:43 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, Chandan Rajendra, linux-xfs, bfoster

On Tuesday, April 7, 2020 5:00 AM Dave Chinner wrote: 
> On Mon, Apr 06, 2020 at 10:06:03AM -0700, Darrick J. Wong wrote:
> > On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > > which
> > > 1. Creates 1,000,000 255-byte sized xattrs,
> > > 2. Deletes 50% of these xattrs in an alternating manner,
> > > 3. Tries to create 400,000 new 255-byte sized xattrs
> > > causes the following message to be printed on the console,
> > > 
> > > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > > 
> > > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > > 
> > > I have been informed that there are instances where a single file has
> > >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > > we will overflow the 16-bits wide xattr extent counter when large
> > > number of hardlinks are created.
> > > 
> > > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > > an incompat flag to prevent older kernels from mounting newer filesystems with
> > > 32-bit wide xattr extent counter.
> > > 
> > > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> > >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> > >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> > >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> > >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> > >  fs/xfs/scrub/inode.c           |  7 ++++---
> > >  fs/xfs/xfs_inode_item.c        |  3 ++-
> > >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> > >  8 files changed, 63 insertions(+), 27 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > index 045556e78ee2c..0a4266b0d46e1 100644
> > > --- a/fs/xfs/libxfs/xfs_format.h
> > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> > >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> > >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> > >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> > 
> > If you're going to introduce an INCOMPAT feature, please also use the
> > opportunity to convert xattrs to something resembling the dir v3 format,
> > where we index free space within each block so that we can speed up attr
> > setting with 100 million attrs.
> 
> Not necessary. Chandan has already spent a lot of time investigating
> that - I suggested doing the investigation probably a year ago when
> he was looking for stuff to do knowing that this could be a problem
> parent pointers hit. Long story short - there's no degradation in
> performance in the dabtree out to tens of millions of records with
> different fixed size or random sized attributes, nor does various
> combinations of insert/lookup/remove/replace operations seem to
> impact the tree performance at scale. IOWs, we hit the 16 bit extent
> limits of the attribute trees without finding any degradation in
> performance.

My benchmarking was limited to working with a maximum of 1,000,000 xattrs. I
will address the review comments provided on this patchset and then run the
benchmarks once again ... but this time I will increase the upper limit to 100
million xattrs (since we will have a 32-bit extent counter). I will post the
results of the benchmarking (along with the benchmarking programs/scripts) to
the mailing list before I post the patchset itself.

-- 
chandan




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-04-07  1:20   ` Dave Chinner
@ 2020-04-08 12:45     ` Chandan Rajendra
  2020-04-10  7:46     ` Chandan Rajendra
  2020-04-27  7:42     ` Christoph Hellwig
  2 siblings, 0 replies; 37+ messages in thread
From: Chandan Rajendra @ 2020-04-08 12:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Chandan Rajendra, linux-xfs, darrick.wong, bfoster

On Tuesday, April 7, 2020 6:50 AM Dave Chinner wrote: 
> On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > which
> > 1. Creates 1,000,000 255-byte sized xattrs,
> > 2. Deletes 50% of these xattrs in an alternating manner,
> > 3. Tries to create 400,000 new 255-byte sized xattrs
> > causes the following message to be printed on the console,
> > 
> > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > 
> > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > 
> > I have been informed that there are instances where a single file has
> >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > we will overflow the 16-bits wide xattr extent counter when large
> > number of hardlinks are created.
> > 
> > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > an incompat flag to prevent older kernels from mounting newer filesystems with
> > 32-bit wide xattr extent counter.
> > 
> > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > ---
> >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> >  fs/xfs/scrub/inode.c           |  7 ++++---
> >  fs/xfs/xfs_inode_item.c        |  3 ++-
> >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> >  8 files changed, 63 insertions(+), 27 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > index 045556e78ee2c..0a4266b0d46e1 100644
> > --- a/fs/xfs/libxfs/xfs_format.h
> > +++ b/fs/xfs/libxfs/xfs_format.h
> > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> >  #define XFS_SB_FEAT_INCOMPAT_ALL \
> >  		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
> >  		 XFS_SB_FEAT_INCOMPAT_SPINODES|	\
> > -		 XFS_SB_FEAT_INCOMPAT_META_UUID)
> > +		 XFS_SB_FEAT_INCOMPAT_META_UUID| \
> > +		 XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR)
> >  
> >  #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
> >  static inline bool
> > @@ -874,7 +876,7 @@ typedef struct xfs_dinode {
> >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
> >  	__be32		di_extsize;	/* basic/minimum extent size for file */
> >  	__be32		di_nextents;	/* number of extents in data fork */
> > -	__be16		di_anextents;	/* number of extents in attribute fork*/
> > +	__be16		di_anextents_lo;/* lower part of xattr extent count */
> >  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> >  	__s8		di_aformat;	/* format of attr fork's data */
> >  	__be32		di_dmevmask;	/* DMIG event mask */
> > @@ -891,7 +893,8 @@ typedef struct xfs_dinode {
> >  	__be64		di_lsn;		/* flush sequence */
> >  	__be64		di_flags2;	/* more random flags */
> >  	__be32		di_cowextsize;	/* basic cow extent size for file */
> > -	__u8		di_pad2[12];	/* more padding for future expansion */
> > +	__be16		di_anextents_hi;/* higher part of xattr extent count */
> > +	__u8		di_pad2[10];	/* more padding for future expansion */
> 
> Ok, I think you've limited what we can do here by using this "fill
> holes" variable split. I've never liked doing this, and we've only
> done it in the past when we haven't had space in the inode to create
> a new 32 bit variable.
> 
> IOWs, this is a v5 format feature only, so we should just create a
> new variable:
> 
> 	__be32		di_attr_nextents;
> 
> With that in place, we can now do what we did extending the v1 inode
> link count (16 bits) to the v2 inode link count (32 bits).
> 
> That is, when the attribute count is going to overflow, we set a
> inode flag on disk to indicate that it now has a 32 bit extent count
> and uses that field in the inode, and we set a RO-compat feature
> flag in the superblock to indicate that there are 32 bit attr fork
> extent counts in use.
> 
> Old kernels can still read the filesystem, but see the extent count
> as "max" (65535) but can't modify the attr fork and hence corrupt
> the 32 bit count it knows nothing about.
> 
> If the kernel sees the RO feature bit set, it can set the inode flag
> on inodes it is modifying and update both the old and new counters
> appropriately when flushing the inode to disk (i.e. transparent
> conversion).
> 
> In future, mkfs can then set the RO feature flag by default so all
> new filesystems use the 32 bit counter.

Sure. I will make the changes suggested above.

> 
> >  	/* fields only written to during inode creation */
> >  	xfs_timestamp_t	di_crtime;	/* time created */
> > @@ -993,10 +996,21 @@ enum xfs_dinode_fmt {
> >  	((w) == XFS_DATA_FORK ? \
> >  		(dip)->di_format : \
> >  		(dip)->di_aformat)
> > -#define XFS_DFORK_NEXTENTS(dip,w) \
> > -	((w) == XFS_DATA_FORK ? \
> > -		be32_to_cpu((dip)->di_nextents) : \
> > -		be16_to_cpu((dip)->di_anextents))
> > +
> > +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,
> 
> If you are converting a macro to static inline, then all the caller
> sites should be converted to lower case at the same time.

Ok.

> 
> > +					struct xfs_dinode *dip, int whichfork)
> > +{
> > +	int32_t anextents;
> 
> Extent counts should be unsigned, as they are on disk.

Ok.

> 
> > +
> > +	if (whichfork == XFS_DATA_FORK)
> > +		return be32_to_cpu((dip)->di_nextents);
> > +
> > +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> > +	if (xfs_sb_version_has_v3inode(sbp))
> > +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);
> > +
> > +	return anextents;
> > +}
> 
> No feature bit to indicate that 32 bit attribute extent counts are
> valid?

The incompat feature flag (i.e. XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR) that I
had introduced prevented older kernels from mounting filesystems having
di_anextents_hi field in the inodes.  As you have explained above, this method
is incorrect. I will add appropriate checks once I implement the new "RO
feature bit" method.

> 
> >  
> >  /*
> >   * For block and character special files the 32bit dev_t is stored at the
> > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > index 39c5a6e24915c..ced8195bd8c22 100644
> > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > @@ -232,7 +232,8 @@ xfs_inode_from_disk(
> >  	to->di_nblocks = be64_to_cpu(from->di_nblocks);
> >  	to->di_extsize = be32_to_cpu(from->di_extsize);
> >  	to->di_nextents = be32_to_cpu(from->di_nextents);
> > -	to->di_anextents = be16_to_cpu(from->di_anextents);
> > +	to->di_anextents = XFS_DFORK_NEXTENTS(&ip->i_mount->m_sb, from,
> > +				XFS_ATTR_FORK);
> 
> This should open code, but I'd prefer a compeltely separate
> variable...

Ok. I will change that.

> 
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat	= from->di_aformat;
> >  	to->di_dmevmask	= be32_to_cpu(from->di_dmevmask);
> > @@ -282,7 +283,7 @@ xfs_inode_to_disk(
> >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > +	to->di_anextents_lo = cpu_to_be16((u32)(from->di_anextents) & 0xffff);
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat = from->di_aformat;
> >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > @@ -296,6 +297,8 @@ xfs_inode_to_disk(
> >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
> >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > +		to->di_anextents_hi
> > +			= cpu_to_be16((u32)(from->di_anextents) >> 16);
> 
> Again, feature bit for on-disk format modifications needed...

Sure. I will change this.

> 
> >  		to->di_ino = cpu_to_be64(ip->i_ino);
> >  		to->di_lsn = cpu_to_be64(lsn);
> >  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> > @@ -335,7 +338,7 @@ xfs_log_dinode_to_disk(
> >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > +	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat = from->di_aformat;
> >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > @@ -349,6 +352,7 @@ xfs_log_dinode_to_disk(
> >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > +		to->di_anextents_hi = cpu_to_be16(from->di_anextents_hi);
> >  		to->di_ino = cpu_to_be64(from->di_ino);
> >  		to->di_lsn = cpu_to_be64(from->di_lsn);
> >  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> > @@ -365,7 +369,9 @@ xfs_dinode_verify_fork(
> >  	struct xfs_mount	*mp,
> >  	int			whichfork)
> >  {
> > -	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> > +	uint32_t		di_nextents;
> > +
> > +	di_nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, whichfork);
> >  
> >  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
> >  	case XFS_DINODE_FMT_LOCAL:
> > @@ -436,6 +442,9 @@ xfs_dinode_verify(
> >  	uint16_t		flags;
> >  	uint64_t		flags2;
> >  	uint64_t		di_size;
> > +	int32_t			nextents;
> > +	int32_t			anextents;
> > +	int64_t			nblocks;
> 
> Extent counts need to be converted to unsigned in memory - they are
> unsigned on disk....

Ok.

> 
> >  
> >  	if (dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC))
> >  		return __this_address;
> > @@ -466,10 +475,12 @@ xfs_dinode_verify(
> >  	if ((S_ISLNK(mode) || S_ISDIR(mode)) && di_size == 0)
> >  		return __this_address;
> >  
> > +	nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_DATA_FORK);
> > +	anextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK);
> > +	nblocks = be64_to_cpu(dip->di_nblocks);
> > +
> >  	/* Fork checks carried over from xfs_iformat_fork */
> > -	if (mode &&
> > -	    be32_to_cpu(dip->di_nextents) + be16_to_cpu(dip->di_anextents) >
> > -			be64_to_cpu(dip->di_nblocks))
> > +	if (mode && nextents + anextents > nblocks)
> >  		return __this_address;
> >  
> >  	if (mode && XFS_DFORK_BOFF(dip) > mp->m_sb.sb_inodesize)
> > @@ -526,7 +537,7 @@ xfs_dinode_verify(
> >  		default:
> >  			return __this_address;
> >  		}
> > -		if (dip->di_anextents)
> > +		if (XFS_DFORK_NEXTENTS(&mp->m_sb, dip, XFS_ATTR_FORK))
> >  			return __this_address;
> >  	}
> >  
> > diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
> > index 518c6f0ec3a61..080fd0c156a1e 100644
> > --- a/fs/xfs/libxfs/xfs_inode_fork.c
> > +++ b/fs/xfs/libxfs/xfs_inode_fork.c
> > @@ -207,9 +207,10 @@ xfs_iformat_extents(
> >  	int			whichfork)
> >  {
> >  	struct xfs_mount	*mp = ip->i_mount;
> > +	struct xfs_sb		*sb = &mp->m_sb;
> >  	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
> >  	int			state = xfs_bmap_fork_to_state(whichfork);
> > -	int			nex = XFS_DFORK_NEXTENTS(dip, whichfork);
> > +	int			nex = XFS_DFORK_NEXTENTS(sb, dip, whichfork);
> >  	int			size = nex * sizeof(xfs_bmbt_rec_t);
> >  	struct xfs_iext_cursor	icur;
> >  	struct xfs_bmbt_rec	*dp;
> > diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> > index e3400c9c71cdb..5db92aa508bc5 100644
> > --- a/fs/xfs/libxfs/xfs_log_format.h
> > +++ b/fs/xfs/libxfs/xfs_log_format.h
> > @@ -397,7 +397,7 @@ struct xfs_log_dinode {
> >  	xfs_rfsblock_t	di_nblocks;	/* # of direct & btree blocks used */
> >  	xfs_extlen_t	di_extsize;	/* basic/minimum extent size for file */
> >  	xfs_extnum_t	di_nextents;	/* number of extents in data fork */
> > -	xfs_aextnum_t	di_anextents;	/* number of extents in attribute fork*/
> > +	uint16_t	di_anextents_lo;/* lower part of xattr extent count */
> >  	uint8_t		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> >  	int8_t		di_aformat;	/* format of attr fork's data */
> >  	uint32_t	di_dmevmask;	/* DMIG event mask */
> > @@ -414,7 +414,8 @@ struct xfs_log_dinode {
> >  	xfs_lsn_t	di_lsn;		/* flush sequence */
> >  	uint64_t	di_flags2;	/* more random flags */
> >  	uint32_t	di_cowextsize;	/* basic cow extent size for file */
> > -	uint8_t		di_pad2[12];	/* more padding for future expansion */
> > +	uint16_t	di_anextents_hi;/* higher part of xattr extent count */
> 
> So, unsigned in the log, as on disk...
> 
> > +	uint8_t		di_pad2[10];	/* more padding for future expansion */
> >  
> >  	/* fields only written to during inode creation */
> >  	xfs_ictimestamp_t di_crtime;	/* time created */
> > diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
> > index 397d94775440d..01669aa65745a 100644
> > --- a/fs/xfs/libxfs/xfs_types.h
> > +++ b/fs/xfs/libxfs/xfs_types.h
> > @@ -13,7 +13,7 @@ typedef uint32_t	xfs_agino_t;	/* inode # within allocation grp */
> >  typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
> >  typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
> >  typedef int32_t		xfs_extnum_t;	/* # of extents in a file */
> > -typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> > +typedef int32_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> 
> .... but not in memory?

I will change this. I actually did notice 'data type' inconsistency in the
existing code,

typedef int16_t         xfs_aextnum_t;  /* # extents in an attribute fork */

... and I thought there could be some purpose behind this. I was wrong. I will
fix the data type inconsistencies.

> 
> >  typedef int64_t		xfs_fsize_t;	/* bytes in a file */
> >  typedef uint64_t	xfs_ufsize_t;	/* unsigned bytes in a file */
> >  
> > @@ -60,7 +60,7 @@ typedef void *		xfs_failaddr_t;
> >   */
> >  #define	MAXEXTLEN	((xfs_extlen_t)0x001fffff)	/* 21 bits */
> >  #define	MAXEXTNUM	((xfs_extnum_t)0x7fffffff)	/* signed int */
> > -#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fff)		/* signed short */
> > +#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fffffff)	/* signed int */
> 
> What about for older filesystems where MAXAEXTNUM is unchanged?

I had again depended on XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR incompat feature
flag to not allow older kernels to access a filesystem with 32-bit xattr
extent counter. I will fix this as well.

> 
> > diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> > index 11c3502b07b13..ba3fae95b2260 100644
> > --- a/fs/xfs/xfs_log_recover.c
> > +++ b/fs/xfs/xfs_log_recover.c
> > @@ -2922,6 +2922,7 @@ xlog_recover_inode_pass2(
> >  	struct xfs_log_dinode	*ldip;
> >  	uint			isize;
> >  	int			need_free = 0;
> > +	uint32_t		nextents;
> >  
> >  	if (item->ri_buf[0].i_len == sizeof(struct xfs_inode_log_format)) {
> >  		in_f = item->ri_buf[0].i_addr;
> > @@ -3044,7 +3045,14 @@ xlog_recover_inode_pass2(
> >  			goto out_release;
> >  		}
> >  	}
> > -	if (unlikely(ldip->di_nextents + ldip->di_anextents > ldip->di_nblocks)){
> > +
> > +	nextents = ldip->di_anextents_lo;
> > +	if (xfs_sb_version_has_v3inode(&mp->m_sb))
> > +		nextents |= ((u32)(ldip->di_anextents_hi) << 16);
> 
> What happens if we are recovering from a filesysetm that doesn't
> know anything about di_anextents_hi and never wrote anything to
> the log for this field?

I had again depended on XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR incompat feature
flag to not allow older kernels to access a filesystem with 32-bit xattr
extent counter. I will fix this as well.

Thanks for the review comments. I will post the next version which will
address them.

-- 
chandan




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-04-08 12:43       ` Chandan Rajendra
@ 2020-04-08 15:38         ` Darrick J. Wong
  2020-04-08 22:43         ` Dave Chinner
  1 sibling, 0 replies; 37+ messages in thread
From: Darrick J. Wong @ 2020-04-08 15:38 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: Dave Chinner, Chandan Rajendra, linux-xfs, bfoster

On Wed, Apr 08, 2020 at 06:13:45PM +0530, Chandan Rajendra wrote:
> On Tuesday, April 7, 2020 5:00 AM Dave Chinner wrote: 
> > On Mon, Apr 06, 2020 at 10:06:03AM -0700, Darrick J. Wong wrote:
> > > On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > > > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > > > which
> > > > 1. Creates 1,000,000 255-byte sized xattrs,
> > > > 2. Deletes 50% of these xattrs in an alternating manner,
> > > > 3. Tries to create 400,000 new 255-byte sized xattrs
> > > > causes the following message to be printed on the console,
> > > > 
> > > > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > > > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > > > 
> > > > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > > > 
> > > > I have been informed that there are instances where a single file has
> > > >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > > > we will overflow the 16-bits wide xattr extent counter when large
> > > > number of hardlinks are created.
> > > > 
> > > > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > > > an incompat flag to prevent older kernels from mounting newer filesystems with
> > > > 32-bit wide xattr extent counter.
> > > > 
> > > > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> > > >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> > > >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> > > >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> > > >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> > > >  fs/xfs/scrub/inode.c           |  7 ++++---
> > > >  fs/xfs/xfs_inode_item.c        |  3 ++-
> > > >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> > > >  8 files changed, 63 insertions(+), 27 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > > index 045556e78ee2c..0a4266b0d46e1 100644
> > > > --- a/fs/xfs/libxfs/xfs_format.h
> > > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> > > >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> > > >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> > > >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > > > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> > > 
> > > If you're going to introduce an INCOMPAT feature, please also use the
> > > opportunity to convert xattrs to something resembling the dir v3 format,
> > > where we index free space within each block so that we can speed up attr
> > > setting with 100 million attrs.
> > 
> > Not necessary. Chandan has already spent a lot of time investigating
> > that - I suggested doing the investigation probably a year ago when
> > he was looking for stuff to do knowing that this could be a problem
> > parent pointers hit. Long story short - there's no degradation in
> > performance in the dabtree out to tens of millions of records with
> > different fixed size or random sized attributes, nor does various
> > combinations of insert/lookup/remove/replace operations seem to
> > impact the tree performance at scale. IOWs, we hit the 16 bit extent
> > limits of the attribute trees without finding any degradation in
> > performance.
> 
> My benchmarking was limited to working with a maximum of 1,000,000 xattrs. I
> will address the review comments provided on this patchset and then run the
> benchmarks once again ... but this time I will increase the upper limit to 100
> million xattrs (since we will have a 32-bit extent counter). I will post the
> results of the benchmarking (along with the benchmarking programs/scripts) to
> the mailing list before I post the patchset itself.

Ok.  Thanks for doing that work. :)

--D

> -- 
> chandan
> 
> 
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-04-06 23:30     ` Dave Chinner
  2020-04-08 12:43       ` Chandan Rajendra
@ 2020-04-08 15:45       ` Darrick J. Wong
  2020-04-08 22:45         ` Dave Chinner
  1 sibling, 1 reply; 37+ messages in thread
From: Darrick J. Wong @ 2020-04-08 15:45 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Chandan Rajendra, linux-xfs, chandan, bfoster

On Tue, Apr 07, 2020 at 09:30:02AM +1000, Dave Chinner wrote:
> On Mon, Apr 06, 2020 at 10:06:03AM -0700, Darrick J. Wong wrote:
> > On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > > which
> > > 1. Creates 1,000,000 255-byte sized xattrs,
> > > 2. Deletes 50% of these xattrs in an alternating manner,
> > > 3. Tries to create 400,000 new 255-byte sized xattrs
> > > causes the following message to be printed on the console,
> > > 
> > > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > > 
> > > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > > 
> > > I have been informed that there are instances where a single file has
> > >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > > we will overflow the 16-bits wide xattr extent counter when large
> > > number of hardlinks are created.
> > > 
> > > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > > an incompat flag to prevent older kernels from mounting newer filesystems with
> > > 32-bit wide xattr extent counter.
> > > 
> > > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> > >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> > >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> > >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> > >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> > >  fs/xfs/scrub/inode.c           |  7 ++++---
> > >  fs/xfs/xfs_inode_item.c        |  3 ++-
> > >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> > >  8 files changed, 63 insertions(+), 27 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > index 045556e78ee2c..0a4266b0d46e1 100644
> > > --- a/fs/xfs/libxfs/xfs_format.h
> > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> > >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> > >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> > >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> > 
> > If you're going to introduce an INCOMPAT feature, please also use the
> > opportunity to convert xattrs to something resembling the dir v3 format,
> > where we index free space within each block so that we can speed up attr
> > setting with 100 million attrs.
> 
> Not necessary. Chandan has already spent a lot of time investigating
> that - I suggested doing the investigation probably a year ago when
> he was looking for stuff to do knowing that this could be a problem
> parent pointers hit.

Oh, I didn't realize that analysis work has already been done.

Chandan, could you please mention that somewhere in the cover letter?
It does mention that you tried creating 1M xattrs, but I guess it needed
to be more explicit about not uncovering any gigantic performance holes.

> Long story short - there's no degradation in
> performance in the dabtree out to tens of millions of records with
> different fixed size or random sized attributes, nor does various
> combinations of insert/lookup/remove/replace operations seem to
> impact the tree performance at scale. IOWs, we hit the 16 bit extent
> limits of the attribute trees without finding any degradation in
> performance.

Ok.  I'll take "attr v3 upgrade" off my list of things to look out for.

> Hence we concluded that the dabtree structure does not require
> significant modification or optimisation to work well with typical
> parent pointer attribute demands...
> 
> As for free space indexes....
> 
> The issue with the directory structure that requires external free
> space is that the directory data is not part of the dabtree itself.
> The attribute fork stores all the attributes at the leaves of the
> dabtree, while the directory structure stores the directory data in
> external blocks and the dabtree only contains the name hash index
> that points to the external data.
> 
> i.e. When we add an attribute to the dabtree, we split/merge leaves
> of the tree based on where the name hash index tells us it needs to
> be inserted/removed from. i.e. we make space available or collapse
> sparse leaves of the dabtree as a side effect of inserting or
> removing objects.
> 
> The directory structure is very different. The dirents cannot change
> location as their logical offset into the dir data segment is used
> as the readdir/seekdir/telldir cookie. Therefore that location is
> not allowed to change for the life of the dirent and so we can't
> store them in the leaves of a dabtree indexed in hash order because
> the offset into the tree would change as other entries are inserted
> and removed.  Hence when we remove dirents, we must leave holes in
> the data segment so the rest of the dirent data does not change
> logical offset.
> 
> The directory name hash index - the dabtree bit - is in a separate
> segment (the 2nd one). Because it only stores pointers to dirents in
> the data segment, it doesn't need to leave holes - the dabtree just
> merge/splits as required as pointers to the dir data segment are
> added/removed - and has no free space tracking.
> 
> Hence when we go to add a dirent, we need to find the best free
> space in the dir data segment to add that dirent. This requires a
> dir data segment free space index, and that is held in the 3rd dir
> segment.  Once we've found the best free space via lookup in the
> free space index, we go modify the dir data block it points to, then
> update the dabtree to point the name hash at that new dirent.
> 
> IOWs, the requirement for a free space map in the directory
> structure results from storing the dirent data externally to the
> dabtree. Attributes are stored directly in the leaves of the
> dabtree - except for remote attributes which can be anywhere in the
> BMBT address space - and hence do no need external free space
> tracking to determine where to best insert them...

<nod> Got it.  I've suspected this property about the xattr structures
for a long time, so I'm glad to hear someone else echo that. :)

Dave: May I try to rework the above into something suitable for the
ondisk format documentation?

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-04-08 12:43       ` Chandan Rajendra
  2020-04-08 15:38         ` Darrick J. Wong
@ 2020-04-08 22:43         ` Dave Chinner
  1 sibling, 0 replies; 37+ messages in thread
From: Dave Chinner @ 2020-04-08 22:43 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: Darrick J. Wong, Chandan Rajendra, linux-xfs, bfoster

On Wed, Apr 08, 2020 at 06:13:45PM +0530, Chandan Rajendra wrote:
> On Tuesday, April 7, 2020 5:00 AM Dave Chinner wrote: 
> > On Mon, Apr 06, 2020 at 10:06:03AM -0700, Darrick J. Wong wrote:
> > > On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > > > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > > > which
> > > > 1. Creates 1,000,000 255-byte sized xattrs,
> > > > 2. Deletes 50% of these xattrs in an alternating manner,
> > > > 3. Tries to create 400,000 new 255-byte sized xattrs
> > > > causes the following message to be printed on the console,
> > > > 
> > > > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > > > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > > > 
> > > > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > > > 
> > > > I have been informed that there are instances where a single file has
> > > >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > > > we will overflow the 16-bits wide xattr extent counter when large
> > > > number of hardlinks are created.
> > > > 
> > > > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > > > an incompat flag to prevent older kernels from mounting newer filesystems with
> > > > 32-bit wide xattr extent counter.
> > > > 
> > > > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> > > >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> > > >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> > > >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> > > >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> > > >  fs/xfs/scrub/inode.c           |  7 ++++---
> > > >  fs/xfs/xfs_inode_item.c        |  3 ++-
> > > >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> > > >  8 files changed, 63 insertions(+), 27 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > > index 045556e78ee2c..0a4266b0d46e1 100644
> > > > --- a/fs/xfs/libxfs/xfs_format.h
> > > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> > > >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> > > >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> > > >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > > > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> > > 
> > > If you're going to introduce an INCOMPAT feature, please also use the
> > > opportunity to convert xattrs to something resembling the dir v3 format,
> > > where we index free space within each block so that we can speed up attr
> > > setting with 100 million attrs.
> > 
> > Not necessary. Chandan has already spent a lot of time investigating
> > that - I suggested doing the investigation probably a year ago when
> > he was looking for stuff to do knowing that this could be a problem
> > parent pointers hit. Long story short - there's no degradation in
> > performance in the dabtree out to tens of millions of records with
> > different fixed size or random sized attributes, nor does various
> > combinations of insert/lookup/remove/replace operations seem to
> > impact the tree performance at scale. IOWs, we hit the 16 bit extent
> > limits of the attribute trees without finding any degradation in
> > performance.
> 
> My benchmarking was limited to working with a maximum of 1,000,000 xattrs. I

/me goes and reviews old emails

Yes, there were a lot of experiements limited to 1M xattrs because
of the 16bit extent count limitations once the tree modifications
started removing blocks and allocating new ones, but:

| Dave, I have experimented and found that xattr insertion and deletion
| operations consume cpu time in a O(N) manner. Below is a sample of such an
| experiment,
|
| | Nr attributes | Create | Delete |
| |---------------+--------+--------|
| |         10000 |   0.07 |   0.06 |
| |         20000 |   0.14 |   0.13 |
| |        100000 |   0.73 |   0.69 |
| |        200000 |   1.50 |   1.30 |
| |       1000000 |   7.87 |   6.39 |
| |       2000000 |  15.76 |  12.56 |
| |      10000000 |  78.68 |  66.53 |

There's 10M attributes with expected scalability behaviour.

Space efficiency for parent-pointer style xattrs out to 10 million
xattrs:

| I extracted some more data from the experiments,
| 
|     1. 13 to 20 bytes name length; Zero length value
|        | Nr xattr | 4kAvg | 4kmin | 4kmax | stddev | Total Nr leaves | Below avg space used | Percentage |
|        |----------+-------+-------+-------+--------+-----------------+----------------------+------------|
|        |    10000 |  3156 |  2100 |  4080 |    978 |             122 |                   56 |      45.90 |
|        |    20000 |  3358 |  2100 |  4080 |    945 |             255 |                  135 |      52.94 |
|        |   100000 |  3469 |  2080 |  4080 |    910 |            1349 |                  802 |      59.45 |
|        |   200000 |  2842 |  2080 |  4080 |    747 |            2649 |                 1264 |      47.72 |
|        |   300000 |  2739 |  2080 |  4080 |    699 |            3907 |                 2045 |      52.34 |
|        |   400000 |  2949 |  2080 |  4080 |    699 |            5349 |                 2692 |      50.33 |
|        |   500000 |  2947 |  2080 |  4080 |    714 |            6795 |                 3709 |      54.58 |
|        |   600000 |  2947 |  2080 |  4080 |    588 |            7726 |                 5214 |      67.49 |
|        |   700000 |  2858 |  2080 |  4088 |    619 |            9331 |                 4821 |      51.67 |
|        |   800000 |  3076 |  2080 |  4088 |    626 |           11148 |                 6241 |      55.98 |
|        |   900000 |  3060 |  2080 |  4088 |    715 |           11355 |                 5907 |      52.02 |
|        |  1000000 |  2726 |  2080 |  4080 |    602 |           11757 |                 5422 |      46.12 |
|        |  2000000 |  2707 |  2080 |  4088 |    530 |           24508 |                10877 |      44.38 |
|        |  3000000 |  2637 |  2080 |  4088 |    506 |           36842 |                15983 |      43.38 |
|        |  4000000 |  2639 |  2080 |  4088 |    509 |           49502 |                22745 |      45.95 |
|        |  5000000 |  2609 |  2080 |  4088 |    504 |           62102 |                28536 |      45.95 |
|        |  6000000 |  2622 |  2080 |  4088 |    525 |           74640 |                34797 |      46.62 |
|        |  7000000 |  2601 |  2080 |  4088 |    511 |           87232 |                40565 |      46.50 |
|        |  8000000 |  2593 |  2080 |  4088 |    513 |           99924 |                47249 |      47.28 |
|        |  9000000 |  2584 |  2080 |  4088 |    511 |          112551 |                48683 |      43.25 |
|        | 10000000 |  2597 |  2080 |  4088 |    527 |          125158 |                54245 |      43.34 |
| 
| 
|     2. 13 to 20 bytes name length; Value length is 13 bytes
|        | Nr xattr | 4kAvg | 4kmin | 4kmax | stddev | Total Nr leaves | Below avg space used | Percentage |
|        |----------+-------+-------+-------+--------+-----------------+----------------------+------------|
|        |    10000 |  2702 |  2096 |  3536 |    564 |              65 |                   30 |      46.15 |
|        |    20000 |  2746 |  2096 |  3968 |    687 |             122 |                   44 |      36.07 |
|        |   100000 |  2718 |  2092 |  3968 |    746 |             590 |                  180 |      30.51 |
|        |   200000 |  2782 |  2092 |  3968 |    690 |            1593 |                 1166 |      73.20 |
|        |   300000 |  2834 |  2092 |  4040 |    708 |            2557 |                 1473 |      57.61 |
|        |   400000 |  2764 |  2092 |  3968 |    536 |            3206 |                 1393 |      43.45 |
|        |   500000 |  2723 |  2092 |  4040 |    651 |            4045 |                 2449 |      60.54 |
|        |   600000 |  2870 |  2092 |  4040 |    594 |            4883 |                 2727 |      55.85 |
|        |   700000 |  2776 |  2092 |  4076 |    564 |            5903 |                 2647 |      44.84 |
|        |   800000 |  2659 |  2092 |  4076 |    510 |            6275 |                 3224 |      51.38 |
|        |   900000 |  2929 |  2092 |  3968 |    491 |            7113 |                 4207 |      59.15 |
|        |  1000000 |  3138 |  2092 |  4076 |    552 |            8916 |                 5746 |      64.45 |
|        |  2000000 |  3016 |  2096 |  4076 |    615 |           18119 |                11540 |      63.69 |
|        |  3000000 |  3010 |  2096 |  4076 |    642 |           27995 |                18411 |      65.77 |
|        |  4000000 |  2988 |  2096 |  4076 |    667 |           37346 |                22439 |      60.08 |
|        |  5000000 |  2977 |  2096 |  4076 |    670 |           47275 |                28745 |      60.80 |
|        |  6000000 |  2973 |  2096 |  4076 |    680 |           56479 |                33075 |      58.56 |
|        |  7000000 |  2968 |  2096 |  4076 |    680 |           66472 |                40288 |      60.61 |
|        |  8000000 |  2961 |  2096 |  4076 |    684 |           76241 |                45640 |      59.86 |
|        |  9000000 |  2958 |  2096 |  4076 |    684 |           86070 |                52306 |      60.77 |
|        | 10000000 |  2956 |  2096 |  4076 |    688 |           95179 |                56395 |      59.25 |

And theres a couple of logarithmic overhead data tables that go out
as far as 37 million xattrs...

> will address the review comments provided on this patchset and then run the
> benchmarks once again ... but this time I will increase the upper limit to 100
> million xattrs (since we will have a 32-bit extent counter). I will post the
> results of the benchmarking (along with the benchmarking programs/scripts) to
> the mailing list before I post the patchset itself.

Sounds good.

Though given the results of what you have done so far, I don't
expect to see any scalaility issues until we hit on machine memory
limits (i.e.  can't cache all the dabtree metadata in memory) or
maximum dabtree depths.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-04-08 15:45       ` Darrick J. Wong
@ 2020-04-08 22:45         ` Dave Chinner
  0 siblings, 0 replies; 37+ messages in thread
From: Dave Chinner @ 2020-04-08 22:45 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Chandan Rajendra, linux-xfs, chandan, bfoster

On Wed, Apr 08, 2020 at 08:45:12AM -0700, Darrick J. Wong wrote:
> On Tue, Apr 07, 2020 at 09:30:02AM +1000, Dave Chinner wrote:
> > Long story short - there's no degradation in
> > performance in the dabtree out to tens of millions of records with
> > different fixed size or random sized attributes, nor does various
> > combinations of insert/lookup/remove/replace operations seem to
> > impact the tree performance at scale. IOWs, we hit the 16 bit extent
> > limits of the attribute trees without finding any degradation in
> > performance.
> 
> Ok.  I'll take "attr v3 upgrade" off my list of things to look out for.
> 
> > Hence we concluded that the dabtree structure does not require
> > significant modification or optimisation to work well with typical
> > parent pointer attribute demands...
> > 
> > As for free space indexes....
> > 
> > The issue with the directory structure that requires external free
> > space is that the directory data is not part of the dabtree itself.
> > The attribute fork stores all the attributes at the leaves of the
> > dabtree, while the directory structure stores the directory data in
> > external blocks and the dabtree only contains the name hash index
> > that points to the external data.
> > 
> > i.e. When we add an attribute to the dabtree, we split/merge leaves
> > of the tree based on where the name hash index tells us it needs to
> > be inserted/removed from. i.e. we make space available or collapse
> > sparse leaves of the dabtree as a side effect of inserting or
> > removing objects.
> > 
> > The directory structure is very different. The dirents cannot change
> > location as their logical offset into the dir data segment is used
> > as the readdir/seekdir/telldir cookie. Therefore that location is
> > not allowed to change for the life of the dirent and so we can't
> > store them in the leaves of a dabtree indexed in hash order because
> > the offset into the tree would change as other entries are inserted
> > and removed.  Hence when we remove dirents, we must leave holes in
> > the data segment so the rest of the dirent data does not change
> > logical offset.
> > 
> > The directory name hash index - the dabtree bit - is in a separate
> > segment (the 2nd one). Because it only stores pointers to dirents in
> > the data segment, it doesn't need to leave holes - the dabtree just
> > merge/splits as required as pointers to the dir data segment are
> > added/removed - and has no free space tracking.
> > 
> > Hence when we go to add a dirent, we need to find the best free
> > space in the dir data segment to add that dirent. This requires a
> > dir data segment free space index, and that is held in the 3rd dir
> > segment.  Once we've found the best free space via lookup in the
> > free space index, we go modify the dir data block it points to, then
> > update the dabtree to point the name hash at that new dirent.
> > 
> > IOWs, the requirement for a free space map in the directory
> > structure results from storing the dirent data externally to the
> > dabtree. Attributes are stored directly in the leaves of the
> > dabtree - except for remote attributes which can be anywhere in the
> > BMBT address space - and hence do no need external free space
> > tracking to determine where to best insert them...
> 
> <nod> Got it.  I've suspected this property about the xattr structures
> for a long time, so I'm glad to hear someone else echo that. :)
> 
> Dave: May I try to rework the above into something suitable for the
> ondisk format documentation?

Sure. Anything that helps people understand the complexity of the
directory data structure is a good thing :)

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-04-07  1:20   ` Dave Chinner
  2020-04-08 12:45     ` Chandan Rajendra
@ 2020-04-10  7:46     ` Chandan Rajendra
  2020-04-12  6:34       ` Chandan Rajendra
  2020-04-27  7:42     ` Christoph Hellwig
  2 siblings, 1 reply; 37+ messages in thread
From: Chandan Rajendra @ 2020-04-10  7:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Chandan Rajendra, linux-xfs, darrick.wong, bfoster

On Tuesday, April 7, 2020 6:50 AM Dave Chinner wrote: 
> On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > which
> > 1. Creates 1,000,000 255-byte sized xattrs,
> > 2. Deletes 50% of these xattrs in an alternating manner,
> > 3. Tries to create 400,000 new 255-byte sized xattrs
> > causes the following message to be printed on the console,
> > 
> > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > 
> > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > 
> > I have been informed that there are instances where a single file has
> >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > we will overflow the 16-bits wide xattr extent counter when large
> > number of hardlinks are created.
> > 
> > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > an incompat flag to prevent older kernels from mounting newer filesystems with
> > 32-bit wide xattr extent counter.
> > 
> > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > ---
> >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> >  fs/xfs/scrub/inode.c           |  7 ++++---
> >  fs/xfs/xfs_inode_item.c        |  3 ++-
> >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> >  8 files changed, 63 insertions(+), 27 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > index 045556e78ee2c..0a4266b0d46e1 100644
> > --- a/fs/xfs/libxfs/xfs_format.h
> > +++ b/fs/xfs/libxfs/xfs_format.h
> > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> >  #define XFS_SB_FEAT_INCOMPAT_ALL \
> >  		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
> >  		 XFS_SB_FEAT_INCOMPAT_SPINODES|	\
> > -		 XFS_SB_FEAT_INCOMPAT_META_UUID)
> > +		 XFS_SB_FEAT_INCOMPAT_META_UUID| \
> > +		 XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR)
> >  
> >  #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
> >  static inline bool
> > @@ -874,7 +876,7 @@ typedef struct xfs_dinode {
> >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
> >  	__be32		di_extsize;	/* basic/minimum extent size for file */
> >  	__be32		di_nextents;	/* number of extents in data fork */
> > -	__be16		di_anextents;	/* number of extents in attribute fork*/
> > +	__be16		di_anextents_lo;/* lower part of xattr extent count */
> >  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> >  	__s8		di_aformat;	/* format of attr fork's data */
> >  	__be32		di_dmevmask;	/* DMIG event mask */
> > @@ -891,7 +893,8 @@ typedef struct xfs_dinode {
> >  	__be64		di_lsn;		/* flush sequence */
> >  	__be64		di_flags2;	/* more random flags */
> >  	__be32		di_cowextsize;	/* basic cow extent size for file */
> > -	__u8		di_pad2[12];	/* more padding for future expansion */
> > +	__be16		di_anextents_hi;/* higher part of xattr extent count */
> > +	__u8		di_pad2[10];	/* more padding for future expansion */
> 
> Ok, I think you've limited what we can do here by using this "fill
> holes" variable split. I've never liked doing this, and we've only
> done it in the past when we haven't had space in the inode to create
> a new 32 bit variable.
> 
> IOWs, this is a v5 format feature only, so we should just create a
> new variable:
> 
> 	__be32		di_attr_nextents;
> 
> With that in place, we can now do what we did extending the v1 inode
> link count (16 bits) to the v2 inode link count (32 bits).
> 
> That is, when the attribute count is going to overflow, we set a
> inode flag on disk to indicate that it now has a 32 bit extent count
> and uses that field in the inode, and we set a RO-compat feature
> flag in the superblock to indicate that there are 32 bit attr fork
> extent counts in use.
> 
> Old kernels can still read the filesystem, but see the extent count
> as "max" (65535) but can't modify the attr fork and hence corrupt
> the 32 bit count it knows nothing about.
> 
> If the kernel sees the RO feature bit set, it can set the inode flag
> on inodes it is modifying and update both the old and new counters
> appropriately when flushing the inode to disk (i.e. transparent
> conversion).
> 
> In future, mkfs can then set the RO feature flag by default so all
> new filesystems use the 32 bit counter.
> 
> >  	/* fields only written to during inode creation */
> >  	xfs_timestamp_t	di_crtime;	/* time created */
> > @@ -993,10 +996,21 @@ enum xfs_dinode_fmt {
> >  	((w) == XFS_DATA_FORK ? \
> >  		(dip)->di_format : \
> >  		(dip)->di_aformat)
> > -#define XFS_DFORK_NEXTENTS(dip,w) \
> > -	((w) == XFS_DATA_FORK ? \
> > -		be32_to_cpu((dip)->di_nextents) : \
> > -		be16_to_cpu((dip)->di_anextents))
> > +
> > +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,
> 
> If you are converting a macro to static inline, then all the caller
> sites should be converted to lower case at the same time.
> 
> > +					struct xfs_dinode *dip, int whichfork)
> > +{
> > +	int32_t anextents;
> 
> Extent counts should be unsigned, as they are on disk.
> 
> > +
> > +	if (whichfork == XFS_DATA_FORK)
> > +		return be32_to_cpu((dip)->di_nextents);
> > +
> > +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> > +	if (xfs_sb_version_has_v3inode(sbp))
> > +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);
> > +
> > +	return anextents;
> > +}
> 
> No feature bit to indicate that 32 bit attribute extent counts are
> valid?
> 
> >  
> >  /*
> >   * For block and character special files the 32bit dev_t is stored at the
> > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > index 39c5a6e24915c..ced8195bd8c22 100644
> > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > @@ -232,7 +232,8 @@ xfs_inode_from_disk(
> >  	to->di_nblocks = be64_to_cpu(from->di_nblocks);
> >  	to->di_extsize = be32_to_cpu(from->di_extsize);
> >  	to->di_nextents = be32_to_cpu(from->di_nextents);
> > -	to->di_anextents = be16_to_cpu(from->di_anextents);
> > +	to->di_anextents = XFS_DFORK_NEXTENTS(&ip->i_mount->m_sb, from,
> > +				XFS_ATTR_FORK);
> 
> This should open code, but I'd prefer a compeltely separate
> variable...
> 
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat	= from->di_aformat;
> >  	to->di_dmevmask	= be32_to_cpu(from->di_dmevmask);
> > @@ -282,7 +283,7 @@ xfs_inode_to_disk(
> >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > +	to->di_anextents_lo = cpu_to_be16((u32)(from->di_anextents) & 0xffff);
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat = from->di_aformat;
> >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > @@ -296,6 +297,8 @@ xfs_inode_to_disk(
> >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
> >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > +		to->di_anextents_hi
> > +			= cpu_to_be16((u32)(from->di_anextents) >> 16);
> 
> Again, feature bit for on-disk format modifications needed...
> 
> >  		to->di_ino = cpu_to_be64(ip->i_ino);
> >  		to->di_lsn = cpu_to_be64(lsn);
> >  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> > @@ -335,7 +338,7 @@ xfs_log_dinode_to_disk(
> >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > +	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
> >  	to->di_forkoff = from->di_forkoff;
> >  	to->di_aformat = from->di_aformat;
> >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > @@ -349,6 +352,7 @@ xfs_log_dinode_to_disk(
> >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > +		to->di_anextents_hi = cpu_to_be16(from->di_anextents_hi);
> >  		to->di_ino = cpu_to_be64(from->di_ino);
> >  		to->di_lsn = cpu_to_be64(from->di_lsn);
> >  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> > @@ -365,7 +369,9 @@ xfs_dinode_verify_fork(
> >  	struct xfs_mount	*mp,
> >  	int			whichfork)
> >  {
> > -	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> > +	uint32_t		di_nextents;
> > +
> > +	di_nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, whichfork);
> >  
> >  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
> >  	case XFS_DINODE_FMT_LOCAL:
> > @@ -436,6 +442,9 @@ xfs_dinode_verify(
> >  	uint16_t		flags;
> >  	uint64_t		flags2;
> >  	uint64_t		di_size;
> > +	int32_t			nextents;
> > +	int32_t			anextents;
> > +	int64_t			nblocks;
> 
> Extent counts need to be converted to unsigned in memory - they are
> unsigned on disk....

In the current code, we have,

#define MAXEXTNUM       ((xfs_extnum_t)0x7fffffff)      /* signed int */                                                                                                      
#define MAXAEXTNUM      ((xfs_aextnum_t)0x7fff)         /* signed short */

i.e. the maximum allowed data extent counter and xattr extent counter are
maximum possible values w.r.t signed int and signed short.

Can you please explain as to why signed maximum values were considered when
the corresponding on-disk data types are unsigned?

-- 
chandan




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-04-10  7:46     ` Chandan Rajendra
@ 2020-04-12  6:34       ` Chandan Rajendra
  2020-04-13 18:55         ` Darrick J. Wong
  0 siblings, 1 reply; 37+ messages in thread
From: Chandan Rajendra @ 2020-04-12  6:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Chandan Rajendra, linux-xfs, darrick.wong, bfoster

On Friday, April 10, 2020 1:16 PM Chandan Rajendra wrote: 
> On Tuesday, April 7, 2020 6:50 AM Dave Chinner wrote: 
> > On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > > which
> > > 1. Creates 1,000,000 255-byte sized xattrs,
> > > 2. Deletes 50% of these xattrs in an alternating manner,
> > > 3. Tries to create 400,000 new 255-byte sized xattrs
> > > causes the following message to be printed on the console,
> > > 
> > > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > > 
> > > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > > 
> > > I have been informed that there are instances where a single file has
> > >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > > we will overflow the 16-bits wide xattr extent counter when large
> > > number of hardlinks are created.
> > > 
> > > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > > an incompat flag to prevent older kernels from mounting newer filesystems with
> > > 32-bit wide xattr extent counter.
> > > 
> > > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> > >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> > >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> > >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> > >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> > >  fs/xfs/scrub/inode.c           |  7 ++++---
> > >  fs/xfs/xfs_inode_item.c        |  3 ++-
> > >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> > >  8 files changed, 63 insertions(+), 27 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > index 045556e78ee2c..0a4266b0d46e1 100644
> > > --- a/fs/xfs/libxfs/xfs_format.h
> > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> > >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> > >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> > >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> > >  #define XFS_SB_FEAT_INCOMPAT_ALL \
> > >  		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
> > >  		 XFS_SB_FEAT_INCOMPAT_SPINODES|	\
> > > -		 XFS_SB_FEAT_INCOMPAT_META_UUID)
> > > +		 XFS_SB_FEAT_INCOMPAT_META_UUID| \
> > > +		 XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR)
> > >  
> > >  #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
> > >  static inline bool
> > > @@ -874,7 +876,7 @@ typedef struct xfs_dinode {
> > >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
> > >  	__be32		di_extsize;	/* basic/minimum extent size for file */
> > >  	__be32		di_nextents;	/* number of extents in data fork */
> > > -	__be16		di_anextents;	/* number of extents in attribute fork*/
> > > +	__be16		di_anextents_lo;/* lower part of xattr extent count */
> > >  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> > >  	__s8		di_aformat;	/* format of attr fork's data */
> > >  	__be32		di_dmevmask;	/* DMIG event mask */
> > > @@ -891,7 +893,8 @@ typedef struct xfs_dinode {
> > >  	__be64		di_lsn;		/* flush sequence */
> > >  	__be64		di_flags2;	/* more random flags */
> > >  	__be32		di_cowextsize;	/* basic cow extent size for file */
> > > -	__u8		di_pad2[12];	/* more padding for future expansion */
> > > +	__be16		di_anextents_hi;/* higher part of xattr extent count */
> > > +	__u8		di_pad2[10];	/* more padding for future expansion */
> > 
> > Ok, I think you've limited what we can do here by using this "fill
> > holes" variable split. I've never liked doing this, and we've only
> > done it in the past when we haven't had space in the inode to create
> > a new 32 bit variable.
> > 
> > IOWs, this is a v5 format feature only, so we should just create a
> > new variable:
> > 
> > 	__be32		di_attr_nextents;
> > 
> > With that in place, we can now do what we did extending the v1 inode
> > link count (16 bits) to the v2 inode link count (32 bits).
> > 
> > That is, when the attribute count is going to overflow, we set a
> > inode flag on disk to indicate that it now has a 32 bit extent count
> > and uses that field in the inode, and we set a RO-compat feature
> > flag in the superblock to indicate that there are 32 bit attr fork
> > extent counts in use.
> > 
> > Old kernels can still read the filesystem, but see the extent count
> > as "max" (65535) but can't modify the attr fork and hence corrupt
> > the 32 bit count it knows nothing about.
> > 
> > If the kernel sees the RO feature bit set, it can set the inode flag
> > on inodes it is modifying and update both the old and new counters
> > appropriately when flushing the inode to disk (i.e. transparent
> > conversion).
> > 
> > In future, mkfs can then set the RO feature flag by default so all
> > new filesystems use the 32 bit counter.
> > 
> > >  	/* fields only written to during inode creation */
> > >  	xfs_timestamp_t	di_crtime;	/* time created */
> > > @@ -993,10 +996,21 @@ enum xfs_dinode_fmt {
> > >  	((w) == XFS_DATA_FORK ? \
> > >  		(dip)->di_format : \
> > >  		(dip)->di_aformat)
> > > -#define XFS_DFORK_NEXTENTS(dip,w) \
> > > -	((w) == XFS_DATA_FORK ? \
> > > -		be32_to_cpu((dip)->di_nextents) : \
> > > -		be16_to_cpu((dip)->di_anextents))
> > > +
> > > +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,
> > 
> > If you are converting a macro to static inline, then all the caller
> > sites should be converted to lower case at the same time.
> > 
> > > +					struct xfs_dinode *dip, int whichfork)
> > > +{
> > > +	int32_t anextents;
> > 
> > Extent counts should be unsigned, as they are on disk.
> > 
> > > +
> > > +	if (whichfork == XFS_DATA_FORK)
> > > +		return be32_to_cpu((dip)->di_nextents);
> > > +
> > > +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> > > +	if (xfs_sb_version_has_v3inode(sbp))
> > > +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);
> > > +
> > > +	return anextents;
> > > +}
> > 
> > No feature bit to indicate that 32 bit attribute extent counts are
> > valid?
> > 
> > >  
> > >  /*
> > >   * For block and character special files the 32bit dev_t is stored at the
> > > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > > index 39c5a6e24915c..ced8195bd8c22 100644
> > > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > > @@ -232,7 +232,8 @@ xfs_inode_from_disk(
> > >  	to->di_nblocks = be64_to_cpu(from->di_nblocks);
> > >  	to->di_extsize = be32_to_cpu(from->di_extsize);
> > >  	to->di_nextents = be32_to_cpu(from->di_nextents);
> > > -	to->di_anextents = be16_to_cpu(from->di_anextents);
> > > +	to->di_anextents = XFS_DFORK_NEXTENTS(&ip->i_mount->m_sb, from,
> > > +				XFS_ATTR_FORK);
> > 
> > This should open code, but I'd prefer a compeltely separate
> > variable...
> > 
> > >  	to->di_forkoff = from->di_forkoff;
> > >  	to->di_aformat	= from->di_aformat;
> > >  	to->di_dmevmask	= be32_to_cpu(from->di_dmevmask);
> > > @@ -282,7 +283,7 @@ xfs_inode_to_disk(
> > >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> > >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> > >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > > +	to->di_anextents_lo = cpu_to_be16((u32)(from->di_anextents) & 0xffff);
> > >  	to->di_forkoff = from->di_forkoff;
> > >  	to->di_aformat = from->di_aformat;
> > >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > > @@ -296,6 +297,8 @@ xfs_inode_to_disk(
> > >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
> > >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> > >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > > +		to->di_anextents_hi
> > > +			= cpu_to_be16((u32)(from->di_anextents) >> 16);
> > 
> > Again, feature bit for on-disk format modifications needed...
> > 
> > >  		to->di_ino = cpu_to_be64(ip->i_ino);
> > >  		to->di_lsn = cpu_to_be64(lsn);
> > >  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> > > @@ -335,7 +338,7 @@ xfs_log_dinode_to_disk(
> > >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> > >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> > >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > > +	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
> > >  	to->di_forkoff = from->di_forkoff;
> > >  	to->di_aformat = from->di_aformat;
> > >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > > @@ -349,6 +352,7 @@ xfs_log_dinode_to_disk(
> > >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> > >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> > >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > > +		to->di_anextents_hi = cpu_to_be16(from->di_anextents_hi);
> > >  		to->di_ino = cpu_to_be64(from->di_ino);
> > >  		to->di_lsn = cpu_to_be64(from->di_lsn);
> > >  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> > > @@ -365,7 +369,9 @@ xfs_dinode_verify_fork(
> > >  	struct xfs_mount	*mp,
> > >  	int			whichfork)
> > >  {
> > > -	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> > > +	uint32_t		di_nextents;
> > > +
> > > +	di_nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, whichfork);
> > >  
> > >  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
> > >  	case XFS_DINODE_FMT_LOCAL:
> > > @@ -436,6 +442,9 @@ xfs_dinode_verify(
> > >  	uint16_t		flags;
> > >  	uint64_t		flags2;
> > >  	uint64_t		di_size;
> > > +	int32_t			nextents;
> > > +	int32_t			anextents;
> > > +	int64_t			nblocks;
> > 
> > Extent counts need to be converted to unsigned in memory - they are
> > unsigned on disk....
> 
> In the current code, we have,
> 
> #define MAXEXTNUM       ((xfs_extnum_t)0x7fffffff)      /* signed int */                                                                                                      
> #define MAXAEXTNUM      ((xfs_aextnum_t)0x7fff)         /* signed short */
> 
> i.e. the maximum allowed data extent counter and xattr extent counter are
> maximum possible values w.r.t signed int and signed short.
> 
> Can you please explain as to why signed maximum values were considered when
> the corresponding on-disk data types are unsigned?
> 
> 

Ok. So the reason I asked that question was because I was wondering if
changing the maximum number of extents for data and attr would cause a change
the height of the corresponding bmbt trees (which in-turn could change the log
reservation values). The following calculations prove otherwise,

- 5 levels deep data bmbt tree.
  |-------+------------------------+-------------------------------|
  | level | number of nodes/leaves | Total Nr recs                 |
  |-------+------------------------+-------------------------------|
  |     0 |                      1 | 3 (max root recs)             |
  |     1 |                      3 | 125 * 3 = 375                 |
  |     2 |                    375 | 125 * 375 = 46875             |
  |     3 |                  46875 | 125 * 46875 = 5859375         |
  |     4 |                5859375 | 125 * 5859375 = 732421875     |
  |     5 |              732421875 | 125 * 732421875 = 91552734375 |
  |-------+------------------------+-------------------------------|

- 3 levels deep attr bmbt tree.
  |-------+------------------------+-----------------------|
  | level | number of nodes/leaves | Total Nr recs         |
  |-------+------------------------+-----------------------|
  |     0 |                      1 | 2 (max root recs)     |
  |     1 |                      2 | 125 * 2 = 250         |
  |     2 |                    250 | 125 * 250 = 31250     |
  |     3 |                  31250 | 125 * 31250 = 3906250 |
  |-------+------------------------+-----------------------|

- Data type to number of records
  |-----------+-------------+-----------------|
  | data type | max extents | max leaf blocks |
  |-----------+-------------+-----------------|
  | int32     |  2147483647 |        17179870 |
  | uint32    |  4294967295 |        34359739 |
  | int16     |       32767 |             263 |
  | uint16    |       65535 |             525 |                                                                                                                  
  |-----------+-------------+-----------------|

So data bmbt will still have a height of 5 and attr bmbt will continue to have
a height of 3.

-- 
chandan




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-04-12  6:34       ` Chandan Rajendra
@ 2020-04-13 18:55         ` Darrick J. Wong
  2020-04-20  4:38           ` Chandan Rajendra
  0 siblings, 1 reply; 37+ messages in thread
From: Darrick J. Wong @ 2020-04-13 18:55 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: Dave Chinner, Chandan Rajendra, linux-xfs, bfoster

On Sun, Apr 12, 2020 at 12:04:13PM +0530, Chandan Rajendra wrote:
> On Friday, April 10, 2020 1:16 PM Chandan Rajendra wrote: 
> > On Tuesday, April 7, 2020 6:50 AM Dave Chinner wrote: 
> > > On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > > > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > > > which
> > > > 1. Creates 1,000,000 255-byte sized xattrs,
> > > > 2. Deletes 50% of these xattrs in an alternating manner,
> > > > 3. Tries to create 400,000 new 255-byte sized xattrs
> > > > causes the following message to be printed on the console,
> > > > 
> > > > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > > > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > > > 
> > > > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > > > 
> > > > I have been informed that there are instances where a single file has
> > > >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > > > we will overflow the 16-bits wide xattr extent counter when large
> > > > number of hardlinks are created.
> > > > 
> > > > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > > > an incompat flag to prevent older kernels from mounting newer filesystems with
> > > > 32-bit wide xattr extent counter.
> > > > 
> > > > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > > > ---
> > > >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> > > >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> > > >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> > > >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> > > >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> > > >  fs/xfs/scrub/inode.c           |  7 ++++---
> > > >  fs/xfs/xfs_inode_item.c        |  3 ++-
> > > >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> > > >  8 files changed, 63 insertions(+), 27 deletions(-)
> > > > 
> > > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > > index 045556e78ee2c..0a4266b0d46e1 100644
> > > > --- a/fs/xfs/libxfs/xfs_format.h
> > > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> > > >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> > > >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> > > >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > > > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> > > >  #define XFS_SB_FEAT_INCOMPAT_ALL \
> > > >  		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
> > > >  		 XFS_SB_FEAT_INCOMPAT_SPINODES|	\
> > > > -		 XFS_SB_FEAT_INCOMPAT_META_UUID)
> > > > +		 XFS_SB_FEAT_INCOMPAT_META_UUID| \
> > > > +		 XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR)
> > > >  
> > > >  #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
> > > >  static inline bool
> > > > @@ -874,7 +876,7 @@ typedef struct xfs_dinode {
> > > >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
> > > >  	__be32		di_extsize;	/* basic/minimum extent size for file */
> > > >  	__be32		di_nextents;	/* number of extents in data fork */
> > > > -	__be16		di_anextents;	/* number of extents in attribute fork*/
> > > > +	__be16		di_anextents_lo;/* lower part of xattr extent count */
> > > >  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> > > >  	__s8		di_aformat;	/* format of attr fork's data */
> > > >  	__be32		di_dmevmask;	/* DMIG event mask */
> > > > @@ -891,7 +893,8 @@ typedef struct xfs_dinode {
> > > >  	__be64		di_lsn;		/* flush sequence */
> > > >  	__be64		di_flags2;	/* more random flags */
> > > >  	__be32		di_cowextsize;	/* basic cow extent size for file */
> > > > -	__u8		di_pad2[12];	/* more padding for future expansion */
> > > > +	__be16		di_anextents_hi;/* higher part of xattr extent count */
> > > > +	__u8		di_pad2[10];	/* more padding for future expansion */
> > > 
> > > Ok, I think you've limited what we can do here by using this "fill
> > > holes" variable split. I've never liked doing this, and we've only
> > > done it in the past when we haven't had space in the inode to create
> > > a new 32 bit variable.
> > > 
> > > IOWs, this is a v5 format feature only, so we should just create a
> > > new variable:
> > > 
> > > 	__be32		di_attr_nextents;
> > > 
> > > With that in place, we can now do what we did extending the v1 inode
> > > link count (16 bits) to the v2 inode link count (32 bits).
> > > 
> > > That is, when the attribute count is going to overflow, we set a
> > > inode flag on disk to indicate that it now has a 32 bit extent count
> > > and uses that field in the inode, and we set a RO-compat feature
> > > flag in the superblock to indicate that there are 32 bit attr fork
> > > extent counts in use.
> > > 
> > > Old kernels can still read the filesystem, but see the extent count
> > > as "max" (65535) but can't modify the attr fork and hence corrupt
> > > the 32 bit count it knows nothing about.
> > > 
> > > If the kernel sees the RO feature bit set, it can set the inode flag
> > > on inodes it is modifying and update both the old and new counters
> > > appropriately when flushing the inode to disk (i.e. transparent
> > > conversion).
> > > 
> > > In future, mkfs can then set the RO feature flag by default so all
> > > new filesystems use the 32 bit counter.
> > > 
> > > >  	/* fields only written to during inode creation */
> > > >  	xfs_timestamp_t	di_crtime;	/* time created */
> > > > @@ -993,10 +996,21 @@ enum xfs_dinode_fmt {
> > > >  	((w) == XFS_DATA_FORK ? \
> > > >  		(dip)->di_format : \
> > > >  		(dip)->di_aformat)
> > > > -#define XFS_DFORK_NEXTENTS(dip,w) \
> > > > -	((w) == XFS_DATA_FORK ? \
> > > > -		be32_to_cpu((dip)->di_nextents) : \
> > > > -		be16_to_cpu((dip)->di_anextents))
> > > > +
> > > > +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,
> > > 
> > > If you are converting a macro to static inline, then all the caller
> > > sites should be converted to lower case at the same time.
> > > 
> > > > +					struct xfs_dinode *dip, int whichfork)
> > > > +{
> > > > +	int32_t anextents;
> > > 
> > > Extent counts should be unsigned, as they are on disk.
> > > 
> > > > +
> > > > +	if (whichfork == XFS_DATA_FORK)
> > > > +		return be32_to_cpu((dip)->di_nextents);
> > > > +
> > > > +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> > > > +	if (xfs_sb_version_has_v3inode(sbp))
> > > > +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);
> > > > +
> > > > +	return anextents;
> > > > +}
> > > 
> > > No feature bit to indicate that 32 bit attribute extent counts are
> > > valid?
> > > 
> > > >  
> > > >  /*
> > > >   * For block and character special files the 32bit dev_t is stored at the
> > > > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > > > index 39c5a6e24915c..ced8195bd8c22 100644
> > > > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > > > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > > > @@ -232,7 +232,8 @@ xfs_inode_from_disk(
> > > >  	to->di_nblocks = be64_to_cpu(from->di_nblocks);
> > > >  	to->di_extsize = be32_to_cpu(from->di_extsize);
> > > >  	to->di_nextents = be32_to_cpu(from->di_nextents);
> > > > -	to->di_anextents = be16_to_cpu(from->di_anextents);
> > > > +	to->di_anextents = XFS_DFORK_NEXTENTS(&ip->i_mount->m_sb, from,
> > > > +				XFS_ATTR_FORK);
> > > 
> > > This should open code, but I'd prefer a compeltely separate
> > > variable...
> > > 
> > > >  	to->di_forkoff = from->di_forkoff;
> > > >  	to->di_aformat	= from->di_aformat;
> > > >  	to->di_dmevmask	= be32_to_cpu(from->di_dmevmask);
> > > > @@ -282,7 +283,7 @@ xfs_inode_to_disk(
> > > >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> > > >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> > > >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > > > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > > > +	to->di_anextents_lo = cpu_to_be16((u32)(from->di_anextents) & 0xffff);
> > > >  	to->di_forkoff = from->di_forkoff;
> > > >  	to->di_aformat = from->di_aformat;
> > > >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > > > @@ -296,6 +297,8 @@ xfs_inode_to_disk(
> > > >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
> > > >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> > > >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > > > +		to->di_anextents_hi
> > > > +			= cpu_to_be16((u32)(from->di_anextents) >> 16);
> > > 
> > > Again, feature bit for on-disk format modifications needed...
> > > 
> > > >  		to->di_ino = cpu_to_be64(ip->i_ino);
> > > >  		to->di_lsn = cpu_to_be64(lsn);
> > > >  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> > > > @@ -335,7 +338,7 @@ xfs_log_dinode_to_disk(
> > > >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> > > >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> > > >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > > > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > > > +	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
> > > >  	to->di_forkoff = from->di_forkoff;
> > > >  	to->di_aformat = from->di_aformat;
> > > >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > > > @@ -349,6 +352,7 @@ xfs_log_dinode_to_disk(
> > > >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> > > >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> > > >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > > > +		to->di_anextents_hi = cpu_to_be16(from->di_anextents_hi);
> > > >  		to->di_ino = cpu_to_be64(from->di_ino);
> > > >  		to->di_lsn = cpu_to_be64(from->di_lsn);
> > > >  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> > > > @@ -365,7 +369,9 @@ xfs_dinode_verify_fork(
> > > >  	struct xfs_mount	*mp,
> > > >  	int			whichfork)
> > > >  {
> > > > -	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> > > > +	uint32_t		di_nextents;
> > > > +
> > > > +	di_nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, whichfork);
> > > >  
> > > >  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
> > > >  	case XFS_DINODE_FMT_LOCAL:
> > > > @@ -436,6 +442,9 @@ xfs_dinode_verify(
> > > >  	uint16_t		flags;
> > > >  	uint64_t		flags2;
> > > >  	uint64_t		di_size;
> > > > +	int32_t			nextents;
> > > > +	int32_t			anextents;
> > > > +	int64_t			nblocks;
> > > 
> > > Extent counts need to be converted to unsigned in memory - they are
> > > unsigned on disk....
> > 
> > In the current code, we have,
> > 
> > #define MAXEXTNUM       ((xfs_extnum_t)0x7fffffff)      /* signed int */                                                                                                      
> > #define MAXAEXTNUM      ((xfs_aextnum_t)0x7fff)         /* signed short */
> > 
> > i.e. the maximum allowed data extent counter and xattr extent counter are
> > maximum possible values w.r.t signed int and signed short.
> > 
> > Can you please explain as to why signed maximum values were considered when
> > the corresponding on-disk data types are unsigned?
> > 
> > 
> 
> Ok. So the reason I asked that question was because I was wondering if
> changing the maximum number of extents for data and attr would cause a change
> the height of the corresponding bmbt trees (which in-turn could change the log
> reservation values). The following calculations prove otherwise,
> 
> - 5 levels deep data bmbt tree.
>   |-------+------------------------+-------------------------------|
>   | level | number of nodes/leaves | Total Nr recs                 |
>   |-------+------------------------+-------------------------------|
>   |     0 |                      1 | 3 (max root recs)             |
>   |     1 |                      3 | 125 * 3 = 375                 |
>   |     2 |                    375 | 125 * 375 = 46875             |
>   |     3 |                  46875 | 125 * 46875 = 5859375         |
>   |     4 |                5859375 | 125 * 5859375 = 732421875     |
>   |     5 |              732421875 | 125 * 732421875 = 91552734375 |
>   |-------+------------------------+-------------------------------|
> 
> - 3 levels deep attr bmbt tree.
>   |-------+------------------------+-----------------------|
>   | level | number of nodes/leaves | Total Nr recs         |
>   |-------+------------------------+-----------------------|
>   |     0 |                      1 | 2 (max root recs)     |
>   |     1 |                      2 | 125 * 2 = 250         |
>   |     2 |                    250 | 125 * 250 = 31250     |
>   |     3 |                  31250 | 125 * 31250 = 3906250 |
>   |-------+------------------------+-----------------------|
> 
> - Data type to number of records
>   |-----------+-------------+-----------------|
>   | data type | max extents | max leaf blocks |
>   |-----------+-------------+-----------------|
>   | int32     |  2147483647 |        17179870 |
>   | uint32    |  4294967295 |        34359739 |
>   | int16     |       32767 |             263 |
>   | uint16    |       65535 |             525 |                                                                                                                  
>   |-----------+-------------+-----------------|
> 
> So data bmbt will still have a height of 5 and attr bmbt will continue to have
> a height of 3.

I think extent count variables should be unsigned because there's no
meaning for a negative extent count.  ("You have -3 extents." "Ehh???")

That said, it was very helpful to point out that the current MAXEXTNUM /
MAXAEXTNUM symbols stop short of using all 32 (or 16) bits.

Can we use this new feature flag + inode flag to allow 4294967295
extents in either fork?

--D

> 
> -- 
> chandan
> 
> 
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-04-13 18:55         ` Darrick J. Wong
@ 2020-04-20  4:38           ` Chandan Rajendra
  2020-04-22  9:38             ` Chandan Rajendra
  0 siblings, 1 reply; 37+ messages in thread
From: Chandan Rajendra @ 2020-04-20  4:38 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, Chandan Rajendra, linux-xfs, bfoster

On Tuesday, April 14, 2020 12:25 AM Darrick J. Wong wrote: 
> On Sun, Apr 12, 2020 at 12:04:13PM +0530, Chandan Rajendra wrote:
> > On Friday, April 10, 2020 1:16 PM Chandan Rajendra wrote: 
> > > On Tuesday, April 7, 2020 6:50 AM Dave Chinner wrote: 
> > > > On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > > > > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > > > > which
> > > > > 1. Creates 1,000,000 255-byte sized xattrs,
> > > > > 2. Deletes 50% of these xattrs in an alternating manner,
> > > > > 3. Tries to create 400,000 new 255-byte sized xattrs
> > > > > causes the following message to be printed on the console,
> > > > > 
> > > > > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > > > > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > > > > 
> > > > > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > > > > 
> > > > > I have been informed that there are instances where a single file has
> > > > >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > > > > we will overflow the 16-bits wide xattr extent counter when large
> > > > > number of hardlinks are created.
> > > > > 
> > > > > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > > > > an incompat flag to prevent older kernels from mounting newer filesystems with
> > > > > 32-bit wide xattr extent counter.
> > > > > 
> > > > > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > > > > ---
> > > > >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> > > > >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> > > > >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> > > > >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> > > > >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> > > > >  fs/xfs/scrub/inode.c           |  7 ++++---
> > > > >  fs/xfs/xfs_inode_item.c        |  3 ++-
> > > > >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> > > > >  8 files changed, 63 insertions(+), 27 deletions(-)
> > > > > 
> > > > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > > > index 045556e78ee2c..0a4266b0d46e1 100644
> > > > > --- a/fs/xfs/libxfs/xfs_format.h
> > > > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > > > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> > > > >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> > > > >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> > > > >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > > > > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> > > > >  #define XFS_SB_FEAT_INCOMPAT_ALL \
> > > > >  		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
> > > > >  		 XFS_SB_FEAT_INCOMPAT_SPINODES|	\
> > > > > -		 XFS_SB_FEAT_INCOMPAT_META_UUID)
> > > > > +		 XFS_SB_FEAT_INCOMPAT_META_UUID| \
> > > > > +		 XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR)
> > > > >  
> > > > >  #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
> > > > >  static inline bool
> > > > > @@ -874,7 +876,7 @@ typedef struct xfs_dinode {
> > > > >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
> > > > >  	__be32		di_extsize;	/* basic/minimum extent size for file */
> > > > >  	__be32		di_nextents;	/* number of extents in data fork */
> > > > > -	__be16		di_anextents;	/* number of extents in attribute fork*/
> > > > > +	__be16		di_anextents_lo;/* lower part of xattr extent count */
> > > > >  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> > > > >  	__s8		di_aformat;	/* format of attr fork's data */
> > > > >  	__be32		di_dmevmask;	/* DMIG event mask */
> > > > > @@ -891,7 +893,8 @@ typedef struct xfs_dinode {
> > > > >  	__be64		di_lsn;		/* flush sequence */
> > > > >  	__be64		di_flags2;	/* more random flags */
> > > > >  	__be32		di_cowextsize;	/* basic cow extent size for file */
> > > > > -	__u8		di_pad2[12];	/* more padding for future expansion */
> > > > > +	__be16		di_anextents_hi;/* higher part of xattr extent count */
> > > > > +	__u8		di_pad2[10];	/* more padding for future expansion */
> > > > 
> > > > Ok, I think you've limited what we can do here by using this "fill
> > > > holes" variable split. I've never liked doing this, and we've only
> > > > done it in the past when we haven't had space in the inode to create
> > > > a new 32 bit variable.
> > > > 
> > > > IOWs, this is a v5 format feature only, so we should just create a
> > > > new variable:
> > > > 
> > > > 	__be32		di_attr_nextents;
> > > > 
> > > > With that in place, we can now do what we did extending the v1 inode
> > > > link count (16 bits) to the v2 inode link count (32 bits).
> > > > 
> > > > That is, when the attribute count is going to overflow, we set a
> > > > inode flag on disk to indicate that it now has a 32 bit extent count
> > > > and uses that field in the inode, and we set a RO-compat feature
> > > > flag in the superblock to indicate that there are 32 bit attr fork
> > > > extent counts in use.
> > > > 
> > > > Old kernels can still read the filesystem, but see the extent count
> > > > as "max" (65535) but can't modify the attr fork and hence corrupt
> > > > the 32 bit count it knows nothing about.
> > > > 
> > > > If the kernel sees the RO feature bit set, it can set the inode flag
> > > > on inodes it is modifying and update both the old and new counters
> > > > appropriately when flushing the inode to disk (i.e. transparent
> > > > conversion).
> > > > 
> > > > In future, mkfs can then set the RO feature flag by default so all
> > > > new filesystems use the 32 bit counter.
> > > > 
> > > > >  	/* fields only written to during inode creation */
> > > > >  	xfs_timestamp_t	di_crtime;	/* time created */
> > > > > @@ -993,10 +996,21 @@ enum xfs_dinode_fmt {
> > > > >  	((w) == XFS_DATA_FORK ? \
> > > > >  		(dip)->di_format : \
> > > > >  		(dip)->di_aformat)
> > > > > -#define XFS_DFORK_NEXTENTS(dip,w) \
> > > > > -	((w) == XFS_DATA_FORK ? \
> > > > > -		be32_to_cpu((dip)->di_nextents) : \
> > > > > -		be16_to_cpu((dip)->di_anextents))
> > > > > +
> > > > > +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,
> > > > 
> > > > If you are converting a macro to static inline, then all the caller
> > > > sites should be converted to lower case at the same time.
> > > > 
> > > > > +					struct xfs_dinode *dip, int whichfork)
> > > > > +{
> > > > > +	int32_t anextents;
> > > > 
> > > > Extent counts should be unsigned, as they are on disk.
> > > > 
> > > > > +
> > > > > +	if (whichfork == XFS_DATA_FORK)
> > > > > +		return be32_to_cpu((dip)->di_nextents);
> > > > > +
> > > > > +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> > > > > +	if (xfs_sb_version_has_v3inode(sbp))
> > > > > +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);
> > > > > +
> > > > > +	return anextents;
> > > > > +}
> > > > 
> > > > No feature bit to indicate that 32 bit attribute extent counts are
> > > > valid?
> > > > 
> > > > >  
> > > > >  /*
> > > > >   * For block and character special files the 32bit dev_t is stored at the
> > > > > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > > > > index 39c5a6e24915c..ced8195bd8c22 100644
> > > > > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > > > > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > > > > @@ -232,7 +232,8 @@ xfs_inode_from_disk(
> > > > >  	to->di_nblocks = be64_to_cpu(from->di_nblocks);
> > > > >  	to->di_extsize = be32_to_cpu(from->di_extsize);
> > > > >  	to->di_nextents = be32_to_cpu(from->di_nextents);
> > > > > -	to->di_anextents = be16_to_cpu(from->di_anextents);
> > > > > +	to->di_anextents = XFS_DFORK_NEXTENTS(&ip->i_mount->m_sb, from,
> > > > > +				XFS_ATTR_FORK);
> > > > 
> > > > This should open code, but I'd prefer a compeltely separate
> > > > variable...
> > > > 
> > > > >  	to->di_forkoff = from->di_forkoff;
> > > > >  	to->di_aformat	= from->di_aformat;
> > > > >  	to->di_dmevmask	= be32_to_cpu(from->di_dmevmask);
> > > > > @@ -282,7 +283,7 @@ xfs_inode_to_disk(
> > > > >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> > > > >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> > > > >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > > > > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > > > > +	to->di_anextents_lo = cpu_to_be16((u32)(from->di_anextents) & 0xffff);
> > > > >  	to->di_forkoff = from->di_forkoff;
> > > > >  	to->di_aformat = from->di_aformat;
> > > > >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > > > > @@ -296,6 +297,8 @@ xfs_inode_to_disk(
> > > > >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
> > > > >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> > > > >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > > > > +		to->di_anextents_hi
> > > > > +			= cpu_to_be16((u32)(from->di_anextents) >> 16);
> > > > 
> > > > Again, feature bit for on-disk format modifications needed...
> > > > 
> > > > >  		to->di_ino = cpu_to_be64(ip->i_ino);
> > > > >  		to->di_lsn = cpu_to_be64(lsn);
> > > > >  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> > > > > @@ -335,7 +338,7 @@ xfs_log_dinode_to_disk(
> > > > >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> > > > >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> > > > >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > > > > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > > > > +	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
> > > > >  	to->di_forkoff = from->di_forkoff;
> > > > >  	to->di_aformat = from->di_aformat;
> > > > >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > > > > @@ -349,6 +352,7 @@ xfs_log_dinode_to_disk(
> > > > >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> > > > >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> > > > >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > > > > +		to->di_anextents_hi = cpu_to_be16(from->di_anextents_hi);
> > > > >  		to->di_ino = cpu_to_be64(from->di_ino);
> > > > >  		to->di_lsn = cpu_to_be64(from->di_lsn);
> > > > >  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> > > > > @@ -365,7 +369,9 @@ xfs_dinode_verify_fork(
> > > > >  	struct xfs_mount	*mp,
> > > > >  	int			whichfork)
> > > > >  {
> > > > > -	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> > > > > +	uint32_t		di_nextents;
> > > > > +
> > > > > +	di_nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, whichfork);
> > > > >  
> > > > >  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
> > > > >  	case XFS_DINODE_FMT_LOCAL:
> > > > > @@ -436,6 +442,9 @@ xfs_dinode_verify(
> > > > >  	uint16_t		flags;
> > > > >  	uint64_t		flags2;
> > > > >  	uint64_t		di_size;
> > > > > +	int32_t			nextents;
> > > > > +	int32_t			anextents;
> > > > > +	int64_t			nblocks;
> > > > 
> > > > Extent counts need to be converted to unsigned in memory - they are
> > > > unsigned on disk....
> > > 
> > > In the current code, we have,
> > > 
> > > #define MAXEXTNUM       ((xfs_extnum_t)0x7fffffff)      /* signed int */                                                                                                      
> > > #define MAXAEXTNUM      ((xfs_aextnum_t)0x7fff)         /* signed short */
> > > 
> > > i.e. the maximum allowed data extent counter and xattr extent counter are
> > > maximum possible values w.r.t signed int and signed short.
> > > 
> > > Can you please explain as to why signed maximum values were considered when
> > > the corresponding on-disk data types are unsigned?
> > > 
> > > 
> > 
> > Ok. So the reason I asked that question was because I was wondering if
> > changing the maximum number of extents for data and attr would cause a change
> > the height of the corresponding bmbt trees (which in-turn could change the log
> > reservation values). The following calculations prove otherwise,
> > 
> > - 5 levels deep data bmbt tree.
> >   |-------+------------------------+-------------------------------|
> >   | level | number of nodes/leaves | Total Nr recs                 |
> >   |-------+------------------------+-------------------------------|
> >   |     0 |                      1 | 3 (max root recs)             |
> >   |     1 |                      3 | 125 * 3 = 375                 |
> >   |     2 |                    375 | 125 * 375 = 46875             |
> >   |     3 |                  46875 | 125 * 46875 = 5859375         |
> >   |     4 |                5859375 | 125 * 5859375 = 732421875     |
> >   |     5 |              732421875 | 125 * 732421875 = 91552734375 |
> >   |-------+------------------------+-------------------------------|
> > 
> > - 3 levels deep attr bmbt tree.
> >   |-------+------------------------+-----------------------|
> >   | level | number of nodes/leaves | Total Nr recs         |
> >   |-------+------------------------+-----------------------|
> >   |     0 |                      1 | 2 (max root recs)     |
> >   |     1 |                      2 | 125 * 2 = 250         |
> >   |     2 |                    250 | 125 * 250 = 31250     |
> >   |     3 |                  31250 | 125 * 31250 = 3906250 |
> >   |-------+------------------------+-----------------------|
> > 
> > - Data type to number of records
> >   |-----------+-------------+-----------------|
> >   | data type | max extents | max leaf blocks |
> >   |-----------+-------------+-----------------|
> >   | int32     |  2147483647 |        17179870 |
> >   | uint32    |  4294967295 |        34359739 |
> >   | int16     |       32767 |             263 |
> >   | uint16    |       65535 |             525 |                                                                                                                  
> >   |-----------+-------------+-----------------|
> > 
> > So data bmbt will still have a height of 5 and attr bmbt will continue to have
> > a height of 3.
> 
> I think extent count variables should be unsigned because there's no
> meaning for a negative extent count.  ("You have -3 extents." "Ehh???")
> 
> That said, it was very helpful to point out that the current MAXEXTNUM /
> MAXAEXTNUM symbols stop short of using all 32 (or 16) bits.
> 
> Can we use this new feature flag + inode flag to allow 4294967295
> extents in either fork?

Sure.

I have already tested that having 4294967295 as the maximum data extent count
does not cause any regressions.

Also, Dave was of the opinion that data extent counter be increased to
64-bit. I think I should include that change along with this feature flag
rather than adding a new one in the near future.

-- 
chandan




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-04-20  4:38           ` Chandan Rajendra
@ 2020-04-22  9:38             ` Chandan Rajendra
  2020-04-22 22:30               ` Dave Chinner
  2020-04-22 22:51               ` Darrick J. Wong
  0 siblings, 2 replies; 37+ messages in thread
From: Chandan Rajendra @ 2020-04-22  9:38 UTC (permalink / raw)
  To: Darrick J. Wong, Dave Chinner; +Cc: Chandan Rajendra, linux-xfs, bfoster

On Monday, April 20, 2020 10:08 AM Chandan Rajendra wrote: 
> On Tuesday, April 14, 2020 12:25 AM Darrick J. Wong wrote: 
> > On Sun, Apr 12, 2020 at 12:04:13PM +0530, Chandan Rajendra wrote:
> > > On Friday, April 10, 2020 1:16 PM Chandan Rajendra wrote: 
> > > > On Tuesday, April 7, 2020 6:50 AM Dave Chinner wrote: 
> > > > > On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > > > > > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > > > > > which
> > > > > > 1. Creates 1,000,000 255-byte sized xattrs,
> > > > > > 2. Deletes 50% of these xattrs in an alternating manner,
> > > > > > 3. Tries to create 400,000 new 255-byte sized xattrs
> > > > > > causes the following message to be printed on the console,
> > > > > > 
> > > > > > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > > > > > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > > > > > 
> > > > > > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > > > > > 
> > > > > > I have been informed that there are instances where a single file has
> > > > > >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > > > > > we will overflow the 16-bits wide xattr extent counter when large
> > > > > > number of hardlinks are created.
> > > > > > 
> > > > > > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > > > > > an incompat flag to prevent older kernels from mounting newer filesystems with
> > > > > > 32-bit wide xattr extent counter.
> > > > > > 
> > > > > > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > > > > > ---
> > > > > >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> > > > > >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> > > > > >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> > > > > >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> > > > > >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> > > > > >  fs/xfs/scrub/inode.c           |  7 ++++---
> > > > > >  fs/xfs/xfs_inode_item.c        |  3 ++-
> > > > > >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> > > > > >  8 files changed, 63 insertions(+), 27 deletions(-)
> > > > > > 
> > > > > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > > > > index 045556e78ee2c..0a4266b0d46e1 100644
> > > > > > --- a/fs/xfs/libxfs/xfs_format.h
> > > > > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > > > > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> > > > > >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> > > > > >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> > > > > >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > > > > > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> > > > > >  #define XFS_SB_FEAT_INCOMPAT_ALL \
> > > > > >  		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
> > > > > >  		 XFS_SB_FEAT_INCOMPAT_SPINODES|	\
> > > > > > -		 XFS_SB_FEAT_INCOMPAT_META_UUID)
> > > > > > +		 XFS_SB_FEAT_INCOMPAT_META_UUID| \
> > > > > > +		 XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR)
> > > > > >  
> > > > > >  #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
> > > > > >  static inline bool
> > > > > > @@ -874,7 +876,7 @@ typedef struct xfs_dinode {
> > > > > >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
> > > > > >  	__be32		di_extsize;	/* basic/minimum extent size for file */
> > > > > >  	__be32		di_nextents;	/* number of extents in data fork */
> > > > > > -	__be16		di_anextents;	/* number of extents in attribute fork*/
> > > > > > +	__be16		di_anextents_lo;/* lower part of xattr extent count */
> > > > > >  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> > > > > >  	__s8		di_aformat;	/* format of attr fork's data */
> > > > > >  	__be32		di_dmevmask;	/* DMIG event mask */
> > > > > > @@ -891,7 +893,8 @@ typedef struct xfs_dinode {
> > > > > >  	__be64		di_lsn;		/* flush sequence */
> > > > > >  	__be64		di_flags2;	/* more random flags */
> > > > > >  	__be32		di_cowextsize;	/* basic cow extent size for file */
> > > > > > -	__u8		di_pad2[12];	/* more padding for future expansion */
> > > > > > +	__be16		di_anextents_hi;/* higher part of xattr extent count */
> > > > > > +	__u8		di_pad2[10];	/* more padding for future expansion */
> > > > > 
> > > > > Ok, I think you've limited what we can do here by using this "fill
> > > > > holes" variable split. I've never liked doing this, and we've only
> > > > > done it in the past when we haven't had space in the inode to create
> > > > > a new 32 bit variable.
> > > > > 
> > > > > IOWs, this is a v5 format feature only, so we should just create a
> > > > > new variable:
> > > > > 
> > > > > 	__be32		di_attr_nextents;
> > > > > 
> > > > > With that in place, we can now do what we did extending the v1 inode
> > > > > link count (16 bits) to the v2 inode link count (32 bits).
> > > > > 
> > > > > That is, when the attribute count is going to overflow, we set a
> > > > > inode flag on disk to indicate that it now has a 32 bit extent count
> > > > > and uses that field in the inode, and we set a RO-compat feature
> > > > > flag in the superblock to indicate that there are 32 bit attr fork
> > > > > extent counts in use.
> > > > > 
> > > > > Old kernels can still read the filesystem, but see the extent count
> > > > > as "max" (65535) but can't modify the attr fork and hence corrupt
> > > > > the 32 bit count it knows nothing about.
> > > > > 
> > > > > If the kernel sees the RO feature bit set, it can set the inode flag
> > > > > on inodes it is modifying and update both the old and new counters
> > > > > appropriately when flushing the inode to disk (i.e. transparent
> > > > > conversion).
> > > > > 
> > > > > In future, mkfs can then set the RO feature flag by default so all
> > > > > new filesystems use the 32 bit counter.
> > > > > 
> > > > > >  	/* fields only written to during inode creation */
> > > > > >  	xfs_timestamp_t	di_crtime;	/* time created */
> > > > > > @@ -993,10 +996,21 @@ enum xfs_dinode_fmt {
> > > > > >  	((w) == XFS_DATA_FORK ? \
> > > > > >  		(dip)->di_format : \
> > > > > >  		(dip)->di_aformat)
> > > > > > -#define XFS_DFORK_NEXTENTS(dip,w) \
> > > > > > -	((w) == XFS_DATA_FORK ? \
> > > > > > -		be32_to_cpu((dip)->di_nextents) : \
> > > > > > -		be16_to_cpu((dip)->di_anextents))
> > > > > > +
> > > > > > +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,
> > > > > 
> > > > > If you are converting a macro to static inline, then all the caller
> > > > > sites should be converted to lower case at the same time.
> > > > > 
> > > > > > +					struct xfs_dinode *dip, int whichfork)
> > > > > > +{
> > > > > > +	int32_t anextents;
> > > > > 
> > > > > Extent counts should be unsigned, as they are on disk.
> > > > > 
> > > > > > +
> > > > > > +	if (whichfork == XFS_DATA_FORK)
> > > > > > +		return be32_to_cpu((dip)->di_nextents);
> > > > > > +
> > > > > > +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> > > > > > +	if (xfs_sb_version_has_v3inode(sbp))
> > > > > > +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);
> > > > > > +
> > > > > > +	return anextents;
> > > > > > +}
> > > > > 
> > > > > No feature bit to indicate that 32 bit attribute extent counts are
> > > > > valid?
> > > > > 
> > > > > >  
> > > > > >  /*
> > > > > >   * For block and character special files the 32bit dev_t is stored at the
> > > > > > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > > > > > index 39c5a6e24915c..ced8195bd8c22 100644
> > > > > > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > > > > > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > > > > > @@ -232,7 +232,8 @@ xfs_inode_from_disk(
> > > > > >  	to->di_nblocks = be64_to_cpu(from->di_nblocks);
> > > > > >  	to->di_extsize = be32_to_cpu(from->di_extsize);
> > > > > >  	to->di_nextents = be32_to_cpu(from->di_nextents);
> > > > > > -	to->di_anextents = be16_to_cpu(from->di_anextents);
> > > > > > +	to->di_anextents = XFS_DFORK_NEXTENTS(&ip->i_mount->m_sb, from,
> > > > > > +				XFS_ATTR_FORK);
> > > > > 
> > > > > This should open code, but I'd prefer a compeltely separate
> > > > > variable...
> > > > > 
> > > > > >  	to->di_forkoff = from->di_forkoff;
> > > > > >  	to->di_aformat	= from->di_aformat;
> > > > > >  	to->di_dmevmask	= be32_to_cpu(from->di_dmevmask);
> > > > > > @@ -282,7 +283,7 @@ xfs_inode_to_disk(
> > > > > >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> > > > > >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> > > > > >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > > > > > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > > > > > +	to->di_anextents_lo = cpu_to_be16((u32)(from->di_anextents) & 0xffff);
> > > > > >  	to->di_forkoff = from->di_forkoff;
> > > > > >  	to->di_aformat = from->di_aformat;
> > > > > >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > > > > > @@ -296,6 +297,8 @@ xfs_inode_to_disk(
> > > > > >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
> > > > > >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> > > > > >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > > > > > +		to->di_anextents_hi
> > > > > > +			= cpu_to_be16((u32)(from->di_anextents) >> 16);
> > > > > 
> > > > > Again, feature bit for on-disk format modifications needed...
> > > > > 
> > > > > >  		to->di_ino = cpu_to_be64(ip->i_ino);
> > > > > >  		to->di_lsn = cpu_to_be64(lsn);
> > > > > >  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> > > > > > @@ -335,7 +338,7 @@ xfs_log_dinode_to_disk(
> > > > > >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> > > > > >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> > > > > >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > > > > > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > > > > > +	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
> > > > > >  	to->di_forkoff = from->di_forkoff;
> > > > > >  	to->di_aformat = from->di_aformat;
> > > > > >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > > > > > @@ -349,6 +352,7 @@ xfs_log_dinode_to_disk(
> > > > > >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> > > > > >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> > > > > >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > > > > > +		to->di_anextents_hi = cpu_to_be16(from->di_anextents_hi);
> > > > > >  		to->di_ino = cpu_to_be64(from->di_ino);
> > > > > >  		to->di_lsn = cpu_to_be64(from->di_lsn);
> > > > > >  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> > > > > > @@ -365,7 +369,9 @@ xfs_dinode_verify_fork(
> > > > > >  	struct xfs_mount	*mp,
> > > > > >  	int			whichfork)
> > > > > >  {
> > > > > > -	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> > > > > > +	uint32_t		di_nextents;
> > > > > > +
> > > > > > +	di_nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, whichfork);
> > > > > >  
> > > > > >  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
> > > > > >  	case XFS_DINODE_FMT_LOCAL:
> > > > > > @@ -436,6 +442,9 @@ xfs_dinode_verify(
> > > > > >  	uint16_t		flags;
> > > > > >  	uint64_t		flags2;
> > > > > >  	uint64_t		di_size;
> > > > > > +	int32_t			nextents;
> > > > > > +	int32_t			anextents;
> > > > > > +	int64_t			nblocks;
> > > > > 
> > > > > Extent counts need to be converted to unsigned in memory - they are
> > > > > unsigned on disk....
> > > > 
> > > > In the current code, we have,
> > > > 
> > > > #define MAXEXTNUM       ((xfs_extnum_t)0x7fffffff)      /* signed int */                                                                                                      
> > > > #define MAXAEXTNUM      ((xfs_aextnum_t)0x7fff)         /* signed short */
> > > > 
> > > > i.e. the maximum allowed data extent counter and xattr extent counter are
> > > > maximum possible values w.r.t signed int and signed short.
> > > > 
> > > > Can you please explain as to why signed maximum values were considered when
> > > > the corresponding on-disk data types are unsigned?
> > > > 
> > > > 
> > > 
> > > Ok. So the reason I asked that question was because I was wondering if
> > > changing the maximum number of extents for data and attr would cause a change
> > > the height of the corresponding bmbt trees (which in-turn could change the log
> > > reservation values). The following calculations prove otherwise,
> > > 
> > > - 5 levels deep data bmbt tree.
> > >   |-------+------------------------+-------------------------------|
> > >   | level | number of nodes/leaves | Total Nr recs                 |
> > >   |-------+------------------------+-------------------------------|
> > >   |     0 |                      1 | 3 (max root recs)             |
> > >   |     1 |                      3 | 125 * 3 = 375                 |
> > >   |     2 |                    375 | 125 * 375 = 46875             |
> > >   |     3 |                  46875 | 125 * 46875 = 5859375         |
> > >   |     4 |                5859375 | 125 * 5859375 = 732421875     |
> > >   |     5 |              732421875 | 125 * 732421875 = 91552734375 |
> > >   |-------+------------------------+-------------------------------|
> > > 
> > > - 3 levels deep attr bmbt tree.
> > >   |-------+------------------------+-----------------------|
> > >   | level | number of nodes/leaves | Total Nr recs         |
> > >   |-------+------------------------+-----------------------|
> > >   |     0 |                      1 | 2 (max root recs)     |
> > >   |     1 |                      2 | 125 * 2 = 250         |
> > >   |     2 |                    250 | 125 * 250 = 31250     |
> > >   |     3 |                  31250 | 125 * 31250 = 3906250 |
> > >   |-------+------------------------+-----------------------|
> > > 
> > > - Data type to number of records
> > >   |-----------+-------------+-----------------|
> > >   | data type | max extents | max leaf blocks |
> > >   |-----------+-------------+-----------------|
> > >   | int32     |  2147483647 |        17179870 |
> > >   | uint32    |  4294967295 |        34359739 |
> > >   | int16     |       32767 |             263 |
> > >   | uint16    |       65535 |             525 |                                                                                                                  
> > >   |-----------+-------------+-----------------|
> > > 
> > > So data bmbt will still have a height of 5 and attr bmbt will continue to have
> > > a height of 3.
> > 
> > I think extent count variables should be unsigned because there's no
> > meaning for a negative extent count.  ("You have -3 extents." "Ehh???")
> > 
> > That said, it was very helpful to point out that the current MAXEXTNUM /
> > MAXAEXTNUM symbols stop short of using all 32 (or 16) bits.
> > 
> > Can we use this new feature flag + inode flag to allow 4294967295
> > extents in either fork?
> 
> Sure.
> 
> I have already tested that having 4294967295 as the maximum data extent count
> does not cause any regressions.
> 
> Also, Dave was of the opinion that data extent counter be increased to
> 64-bit. I think I should include that change along with this feature flag
> rather than adding a new one in the near future.
> 
> 

Hello Dave & Darrick,

Can you please look into the following design decision w.r.t using 32-bit and
64-bit unsigned counters for xattr and data extents.

Maximum extent counts.
|-----------------------+----------------------|
| Field width (in bits) |          Max extents |
|-----------------------+----------------------|
|                    32 |           4294967295 |
|                    48 |      281474976710655 |
|                    64 | 18446744073709551615 |
|-----------------------+----------------------|

|-------------------+-----|
| Minimum node recs | 125 |
| Minimum leaf recs | 125 |
|-------------------+-----|

Data bmbt tree height (MINDBTPTRS == 3)
|-------+------------------------+-------------------------|
| Level | Number of nodes/leaves |           Total Nr recs |
|       |                        | (nr nodes/leaves * 125) |
|-------+------------------------+-------------------------|
|     0 |                      1 |                       3 |
|     1 |                      3 |                     375 |
|     2 |                    375 |                   46875 |
|     3 |                  46875 |                 5859375 |
|     4 |                5859375 |               732421875 |
|     5 |              732421875 |             91552734375 |
|     6 |            91552734375 |          11444091796875 |
|     7 |         11444091796875 |        1430511474609375 |
|     8 |       1430511474609375 |      178813934326171875 |
|     9 |     178813934326171875 |    22351741790771484375 |
|-------+------------------------+-------------------------|

For counting data extents, even though we theoretically have 64 bits at our
disposal, I think we should have (2 ** 48) - 1 as the maximum number of
extents. This gives 281474976710655 (i.e. ~281 trillion extents). With this,
bmbt tree's height grows by just two more levels (i.e. it grows from the
current maximum height of 5 to 7). Please let me know your opinion on this.

Attr bmbt tree height (MINABTPTRS == 2)
|-------+------------------------+-------------------------|
| Level | Number of nodes/leaves |           Total Nr recs |
|       |                        | (nr nodes/leaves * 125) |
|-------+------------------------+-------------------------|
|     0 |                      1 |                       2 |
|     1 |                      2 |                     250 |
|     2 |                    250 |                   31250 |
|     3 |                  31250 |                 3906250 |
|     4 |                3906250 |               488281250 |
|     5 |              488281250 |             61035156250 |
|-------+------------------------+-------------------------|

For xattr extents, (2 ** 32) - 1 = 4294967295 (~ 4 billion extents). So this
will cause the corresponding bmbt's maximum height to go from 3 to 5.
This probably won't cause any regression.

Meanwhile, I will work on finding the impact of increasing the height of these
two trees on log reservation.

-- 
chandan




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-04-22  9:38             ` Chandan Rajendra
@ 2020-04-22 22:30               ` Dave Chinner
  2020-04-25 12:07                 ` Chandan Rajendra
  2020-04-22 22:51               ` Darrick J. Wong
  1 sibling, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2020-04-22 22:30 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: Darrick J. Wong, Chandan Rajendra, linux-xfs, bfoster

On Wed, Apr 22, 2020 at 03:08:00PM +0530, Chandan Rajendra wrote:
> On Monday, April 20, 2020 10:08 AM Chandan Rajendra wrote: 
> > On Tuesday, April 14, 2020 12:25 AM Darrick J. Wong wrote: 
> > > That said, it was very helpful to point out that the current MAXEXTNUM /
> > > MAXAEXTNUM symbols stop short of using all 32 (or 16) bits.
> > > 
> > > Can we use this new feature flag + inode flag to allow 4294967295
> > > extents in either fork?
> > 
> > Sure.
> > 
> > I have already tested that having 4294967295 as the maximum data extent count
> > does not cause any regressions.
> > 
> > Also, Dave was of the opinion that data extent counter be increased to
> > 64-bit. I think I should include that change along with this feature flag
> > rather than adding a new one in the near future.
> > 
> > 
> 
> Hello Dave & Darrick,
> 
> Can you please look into the following design decision w.r.t using 32-bit and
> 64-bit unsigned counters for xattr and data extents.
> 
> Maximum extent counts.
> |-----------------------+----------------------|
> | Field width (in bits) |          Max extents |
> |-----------------------+----------------------|
> |                    32 |           4294967295 |
> |                    48 |      281474976710655 |
> |                    64 | 18446744073709551615 |
> |-----------------------+----------------------|

These huge numbers are impossible to compare visually.  Once numbers
go beyond 7-9 digits, you need to start condensing them in reports.
Humans are, in general, unable to handle strings of digits longer
than 7-9 digits at all well...

Can you condense them by using scientific representation i.e. XEy,
which gives:

|-----------------------+-------------|
| Field width (in bits) | Max extents |
|-----------------------+-------------|
|                    32 |      4.3E09 |
|                    48 |      2.8E14 |
|                    64 |      1.8E19 |
|-----------------------+-------------|

It's much easier to compare differences visually because it's not
only 4 digits, not 20. The other alternative is to use k,m,g,t,p,e
suffixes to indicate magnitude (4.3g, 280t, 18e), but using
exponentials make the numbers easier to do calculations on
directly...

> |-------------------+-----|
> | Minimum node recs | 125 |
> | Minimum leaf recs | 125 |
> |-------------------+-----|

Please show your working. I'm assuming this is 50% * 4kB /
sizeof(bmbt_rec), so you are working out limits based on 4kB block
size? Realistically, worse case behaviour will be with the minimum
supported block size, which in this case will be 1kB....

> Data bmbt tree height (MINDBTPTRS == 3)
> |-------+------------------------+-------------------------|
> | Level | Number of nodes/leaves |           Total Nr recs |
> |       |                        | (nr nodes/leaves * 125) |
> |-------+------------------------+-------------------------|
> |     0 |                      1 |                       3 |
> |     1 |                      3 |                     375 |
> |     2 |                    375 |                   46875 |
> |     3 |                  46875 |                 5859375 |
> |     4 |                5859375 |               732421875 |
> |     5 |              732421875 |             91552734375 |
> |     6 |            91552734375 |          11444091796875 |
> |     7 |         11444091796875 |        1430511474609375 |
> |     8 |       1430511474609375 |      178813934326171875 |
> |     9 |     178813934326171875 |    22351741790771484375 |
> |-------+------------------------+-------------------------|
> 
> For counting data extents, even though we theoretically have 64 bits at our
> disposal, I think we should have (2 ** 48) - 1 as the maximum number of
> extents. This gives 281474976710655 (i.e. ~281 trillion extents). With this,
> bmbt tree's height grows by just two more levels (i.e. it grows from the
> current maximum height of 5 to 7). Please let me know your opinion on this.

We shouldn't make up arbitrary limits when we can calculate them exactly.
i.e. 2^63 max file size, 1kB block size (2^10), means max fragments
is 2^53 entries. On a 64kB block size (2^16), we have a max extent
count of 2^47....

i.e. 2^48 would be an acceptible limit for 1kB block size, but it is
not correct for 64kB block size filesystems....

> Attr bmbt tree height (MINABTPTRS == 2)
> |-------+------------------------+-------------------------|
> | Level | Number of nodes/leaves |           Total Nr recs |
> |       |                        | (nr nodes/leaves * 125) |
> |-------+------------------------+-------------------------|
> |     0 |                      1 |                       2 |
> |     1 |                      2 |                     250 |
> |     2 |                    250 |                   31250 |
> |     3 |                  31250 |                 3906250 |
> |     4 |                3906250 |               488281250 |
> |     5 |              488281250 |             61035156250 |
> |-------+------------------------+-------------------------|
> 
> For xattr extents, (2 ** 32) - 1 = 4294967295 (~ 4 billion extents). So this
> will cause the corresponding bmbt's maximum height to go from 3 to 5.
> This probably won't cause any regression.

We already have the XFS_DA_NODE_MAXDEPTH set to 5, so changing the
attr fork extent count makes no difference to the attribute fork
bmbt reservations. i.e. the bmbt reservations are defined by the
dabtree structure limits, not the maximum extent count the fork can
hold.

The data fork to 64 bits has no impact on the directory
reservations, either, because the number of extents in the directory
is bound by the directory segment size of 32GB. i.e. a directory can
hold, at most, 32GB of dirent data, which means there's a hard limit
on the number of dabtree entries somewhere in the order of a few
hundred million. That's where XFS_DA_NODE_MAXDEPTH comes from - it's
large enough to index a max sized directory, and the BMBT overhead
is derived from that...

> Meanwhile, I will work on finding the impact of increasing the
> height of these two trees on log reservation.

It should not change it substantially - 2 blocks per bmbt
reservation per transaction is what I'd expect from the numbers
presented...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-04-22  9:38             ` Chandan Rajendra
  2020-04-22 22:30               ` Dave Chinner
@ 2020-04-22 22:51               ` Darrick J. Wong
  1 sibling, 0 replies; 37+ messages in thread
From: Darrick J. Wong @ 2020-04-22 22:51 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: Dave Chinner, Chandan Rajendra, linux-xfs, bfoster

On Wed, Apr 22, 2020 at 03:08:00PM +0530, Chandan Rajendra wrote:
> On Monday, April 20, 2020 10:08 AM Chandan Rajendra wrote: 
> > On Tuesday, April 14, 2020 12:25 AM Darrick J. Wong wrote: 
> > > On Sun, Apr 12, 2020 at 12:04:13PM +0530, Chandan Rajendra wrote:
> > > > On Friday, April 10, 2020 1:16 PM Chandan Rajendra wrote: 
> > > > > On Tuesday, April 7, 2020 6:50 AM Dave Chinner wrote: 
> > > > > > On Sat, Apr 04, 2020 at 02:22:03PM +0530, Chandan Rajendra wrote:
> > > > > > > XFS has a per-inode xattr extent counter which is 16 bits wide. A workload
> > > > > > > which
> > > > > > > 1. Creates 1,000,000 255-byte sized xattrs,
> > > > > > > 2. Deletes 50% of these xattrs in an alternating manner,
> > > > > > > 3. Tries to create 400,000 new 255-byte sized xattrs
> > > > > > > causes the following message to be printed on the console,
> > > > > > > 
> > > > > > > XFS (loop0): xfs_iflush_int: detected corrupt incore inode 131, total extents = -19916, nblocks = 102937, ptr ffff9ce33b098c00
> > > > > > > XFS (loop0): xfs_do_force_shutdown(0x8) called from line 3739 of file fs/xfs/xfs_inode.c. Return address = ffffffffa4a94173
> > > > > > > 
> > > > > > > This indicates that we overflowed the 16-bits wide xattr extent counter.
> > > > > > > 
> > > > > > > I have been informed that there are instances where a single file has
> > > > > > >  > 100 million hardlinks. With parent pointers being stored in xattr,
> > > > > > > we will overflow the 16-bits wide xattr extent counter when large
> > > > > > > number of hardlinks are created.
> > > > > > > 
> > > > > > > Hence this commit extends xattr extent counter to 32-bits. It also introduces
> > > > > > > an incompat flag to prevent older kernels from mounting newer filesystems with
> > > > > > > 32-bit wide xattr extent counter.
> > > > > > > 
> > > > > > > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > > > > > > ---
> > > > > > >  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++-------
> > > > > > >  fs/xfs/libxfs/xfs_inode_buf.c  | 27 +++++++++++++++++++--------
> > > > > > >  fs/xfs/libxfs/xfs_inode_fork.c |  3 ++-
> > > > > > >  fs/xfs/libxfs/xfs_log_format.h |  5 +++--
> > > > > > >  fs/xfs/libxfs/xfs_types.h      |  4 ++--
> > > > > > >  fs/xfs/scrub/inode.c           |  7 ++++---
> > > > > > >  fs/xfs/xfs_inode_item.c        |  3 ++-
> > > > > > >  fs/xfs/xfs_log_recover.c       | 13 ++++++++++---
> > > > > > >  8 files changed, 63 insertions(+), 27 deletions(-)
> > > > > > > 
> > > > > > > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > > > > > > index 045556e78ee2c..0a4266b0d46e1 100644
> > > > > > > --- a/fs/xfs/libxfs/xfs_format.h
> > > > > > > +++ b/fs/xfs/libxfs/xfs_format.h
> > > > > > > @@ -465,10 +465,12 @@ xfs_sb_has_ro_compat_feature(
> > > > > > >  #define XFS_SB_FEAT_INCOMPAT_FTYPE	(1 << 0)	/* filetype in dirent */
> > > > > > >  #define XFS_SB_FEAT_INCOMPAT_SPINODES	(1 << 1)	/* sparse inode chunks */
> > > > > > >  #define XFS_SB_FEAT_INCOMPAT_META_UUID	(1 << 2)	/* metadata UUID */
> > > > > > > +#define XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR (1 << 3)
> > > > > > >  #define XFS_SB_FEAT_INCOMPAT_ALL \
> > > > > > >  		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
> > > > > > >  		 XFS_SB_FEAT_INCOMPAT_SPINODES|	\
> > > > > > > -		 XFS_SB_FEAT_INCOMPAT_META_UUID)
> > > > > > > +		 XFS_SB_FEAT_INCOMPAT_META_UUID| \
> > > > > > > +		 XFS_SB_FEAT_INCOMPAT_32BIT_AEXT_CNTR)
> > > > > > >  
> > > > > > >  #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
> > > > > > >  static inline bool
> > > > > > > @@ -874,7 +876,7 @@ typedef struct xfs_dinode {
> > > > > > >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
> > > > > > >  	__be32		di_extsize;	/* basic/minimum extent size for file */
> > > > > > >  	__be32		di_nextents;	/* number of extents in data fork */
> > > > > > > -	__be16		di_anextents;	/* number of extents in attribute fork*/
> > > > > > > +	__be16		di_anextents_lo;/* lower part of xattr extent count */
> > > > > > >  	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
> > > > > > >  	__s8		di_aformat;	/* format of attr fork's data */
> > > > > > >  	__be32		di_dmevmask;	/* DMIG event mask */
> > > > > > > @@ -891,7 +893,8 @@ typedef struct xfs_dinode {
> > > > > > >  	__be64		di_lsn;		/* flush sequence */
> > > > > > >  	__be64		di_flags2;	/* more random flags */
> > > > > > >  	__be32		di_cowextsize;	/* basic cow extent size for file */
> > > > > > > -	__u8		di_pad2[12];	/* more padding for future expansion */
> > > > > > > +	__be16		di_anextents_hi;/* higher part of xattr extent count */
> > > > > > > +	__u8		di_pad2[10];	/* more padding for future expansion */
> > > > > > 
> > > > > > Ok, I think you've limited what we can do here by using this "fill
> > > > > > holes" variable split. I've never liked doing this, and we've only
> > > > > > done it in the past when we haven't had space in the inode to create
> > > > > > a new 32 bit variable.
> > > > > > 
> > > > > > IOWs, this is a v5 format feature only, so we should just create a
> > > > > > new variable:
> > > > > > 
> > > > > > 	__be32		di_attr_nextents;
> > > > > > 
> > > > > > With that in place, we can now do what we did extending the v1 inode
> > > > > > link count (16 bits) to the v2 inode link count (32 bits).
> > > > > > 
> > > > > > That is, when the attribute count is going to overflow, we set a
> > > > > > inode flag on disk to indicate that it now has a 32 bit extent count
> > > > > > and uses that field in the inode, and we set a RO-compat feature
> > > > > > flag in the superblock to indicate that there are 32 bit attr fork
> > > > > > extent counts in use.
> > > > > > 
> > > > > > Old kernels can still read the filesystem, but see the extent count
> > > > > > as "max" (65535) but can't modify the attr fork and hence corrupt
> > > > > > the 32 bit count it knows nothing about.
> > > > > > 
> > > > > > If the kernel sees the RO feature bit set, it can set the inode flag
> > > > > > on inodes it is modifying and update both the old and new counters
> > > > > > appropriately when flushing the inode to disk (i.e. transparent
> > > > > > conversion).
> > > > > > 
> > > > > > In future, mkfs can then set the RO feature flag by default so all
> > > > > > new filesystems use the 32 bit counter.
> > > > > > 
> > > > > > >  	/* fields only written to during inode creation */
> > > > > > >  	xfs_timestamp_t	di_crtime;	/* time created */
> > > > > > > @@ -993,10 +996,21 @@ enum xfs_dinode_fmt {
> > > > > > >  	((w) == XFS_DATA_FORK ? \
> > > > > > >  		(dip)->di_format : \
> > > > > > >  		(dip)->di_aformat)
> > > > > > > -#define XFS_DFORK_NEXTENTS(dip,w) \
> > > > > > > -	((w) == XFS_DATA_FORK ? \
> > > > > > > -		be32_to_cpu((dip)->di_nextents) : \
> > > > > > > -		be16_to_cpu((dip)->di_anextents))
> > > > > > > +
> > > > > > > +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,
> > > > > > 
> > > > > > If you are converting a macro to static inline, then all the caller
> > > > > > sites should be converted to lower case at the same time.
> > > > > > 
> > > > > > > +					struct xfs_dinode *dip, int whichfork)
> > > > > > > +{
> > > > > > > +	int32_t anextents;
> > > > > > 
> > > > > > Extent counts should be unsigned, as they are on disk.
> > > > > > 
> > > > > > > +
> > > > > > > +	if (whichfork == XFS_DATA_FORK)
> > > > > > > +		return be32_to_cpu((dip)->di_nextents);
> > > > > > > +
> > > > > > > +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> > > > > > > +	if (xfs_sb_version_has_v3inode(sbp))
> > > > > > > +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);
> > > > > > > +
> > > > > > > +	return anextents;
> > > > > > > +}
> > > > > > 
> > > > > > No feature bit to indicate that 32 bit attribute extent counts are
> > > > > > valid?
> > > > > > 
> > > > > > >  
> > > > > > >  /*
> > > > > > >   * For block and character special files the 32bit dev_t is stored at the
> > > > > > > diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> > > > > > > index 39c5a6e24915c..ced8195bd8c22 100644
> > > > > > > --- a/fs/xfs/libxfs/xfs_inode_buf.c
> > > > > > > +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> > > > > > > @@ -232,7 +232,8 @@ xfs_inode_from_disk(
> > > > > > >  	to->di_nblocks = be64_to_cpu(from->di_nblocks);
> > > > > > >  	to->di_extsize = be32_to_cpu(from->di_extsize);
> > > > > > >  	to->di_nextents = be32_to_cpu(from->di_nextents);
> > > > > > > -	to->di_anextents = be16_to_cpu(from->di_anextents);
> > > > > > > +	to->di_anextents = XFS_DFORK_NEXTENTS(&ip->i_mount->m_sb, from,
> > > > > > > +				XFS_ATTR_FORK);
> > > > > > 
> > > > > > This should open code, but I'd prefer a compeltely separate
> > > > > > variable...
> > > > > > 
> > > > > > >  	to->di_forkoff = from->di_forkoff;
> > > > > > >  	to->di_aformat	= from->di_aformat;
> > > > > > >  	to->di_dmevmask	= be32_to_cpu(from->di_dmevmask);
> > > > > > > @@ -282,7 +283,7 @@ xfs_inode_to_disk(
> > > > > > >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> > > > > > >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> > > > > > >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > > > > > > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > > > > > > +	to->di_anextents_lo = cpu_to_be16((u32)(from->di_anextents) & 0xffff);
> > > > > > >  	to->di_forkoff = from->di_forkoff;
> > > > > > >  	to->di_aformat = from->di_aformat;
> > > > > > >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > > > > > > @@ -296,6 +297,8 @@ xfs_inode_to_disk(
> > > > > > >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.tv_nsec);
> > > > > > >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> > > > > > >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > > > > > > +		to->di_anextents_hi
> > > > > > > +			= cpu_to_be16((u32)(from->di_anextents) >> 16);
> > > > > > 
> > > > > > Again, feature bit for on-disk format modifications needed...
> > > > > > 
> > > > > > >  		to->di_ino = cpu_to_be64(ip->i_ino);
> > > > > > >  		to->di_lsn = cpu_to_be64(lsn);
> > > > > > >  		memset(to->di_pad2, 0, sizeof(to->di_pad2));
> > > > > > > @@ -335,7 +338,7 @@ xfs_log_dinode_to_disk(
> > > > > > >  	to->di_nblocks = cpu_to_be64(from->di_nblocks);
> > > > > > >  	to->di_extsize = cpu_to_be32(from->di_extsize);
> > > > > > >  	to->di_nextents = cpu_to_be32(from->di_nextents);
> > > > > > > -	to->di_anextents = cpu_to_be16(from->di_anextents);
> > > > > > > +	to->di_anextents_lo = cpu_to_be16(from->di_anextents_lo);
> > > > > > >  	to->di_forkoff = from->di_forkoff;
> > > > > > >  	to->di_aformat = from->di_aformat;
> > > > > > >  	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
> > > > > > > @@ -349,6 +352,7 @@ xfs_log_dinode_to_disk(
> > > > > > >  		to->di_crtime.t_nsec = cpu_to_be32(from->di_crtime.t_nsec);
> > > > > > >  		to->di_flags2 = cpu_to_be64(from->di_flags2);
> > > > > > >  		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
> > > > > > > +		to->di_anextents_hi = cpu_to_be16(from->di_anextents_hi);
> > > > > > >  		to->di_ino = cpu_to_be64(from->di_ino);
> > > > > > >  		to->di_lsn = cpu_to_be64(from->di_lsn);
> > > > > > >  		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
> > > > > > > @@ -365,7 +369,9 @@ xfs_dinode_verify_fork(
> > > > > > >  	struct xfs_mount	*mp,
> > > > > > >  	int			whichfork)
> > > > > > >  {
> > > > > > > -	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> > > > > > > +	uint32_t		di_nextents;
> > > > > > > +
> > > > > > > +	di_nextents = XFS_DFORK_NEXTENTS(&mp->m_sb, dip, whichfork);
> > > > > > >  
> > > > > > >  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
> > > > > > >  	case XFS_DINODE_FMT_LOCAL:
> > > > > > > @@ -436,6 +442,9 @@ xfs_dinode_verify(
> > > > > > >  	uint16_t		flags;
> > > > > > >  	uint64_t		flags2;
> > > > > > >  	uint64_t		di_size;
> > > > > > > +	int32_t			nextents;
> > > > > > > +	int32_t			anextents;
> > > > > > > +	int64_t			nblocks;
> > > > > > 
> > > > > > Extent counts need to be converted to unsigned in memory - they are
> > > > > > unsigned on disk....
> > > > > 
> > > > > In the current code, we have,
> > > > > 
> > > > > #define MAXEXTNUM       ((xfs_extnum_t)0x7fffffff)      /* signed int */                                                                                                      
> > > > > #define MAXAEXTNUM      ((xfs_aextnum_t)0x7fff)         /* signed short */
> > > > > 
> > > > > i.e. the maximum allowed data extent counter and xattr extent counter are
> > > > > maximum possible values w.r.t signed int and signed short.
> > > > > 
> > > > > Can you please explain as to why signed maximum values were considered when
> > > > > the corresponding on-disk data types are unsigned?
> > > > > 
> > > > > 
> > > > 
> > > > Ok. So the reason I asked that question was because I was wondering if
> > > > changing the maximum number of extents for data and attr would cause a change
> > > > the height of the corresponding bmbt trees (which in-turn could change the log
> > > > reservation values). The following calculations prove otherwise,
> > > > 
> > > > - 5 levels deep data bmbt tree.
> > > >   |-------+------------------------+-------------------------------|
> > > >   | level | number of nodes/leaves | Total Nr recs                 |
> > > >   |-------+------------------------+-------------------------------|
> > > >   |     0 |                      1 | 3 (max root recs)             |
> > > >   |     1 |                      3 | 125 * 3 = 375                 |
> > > >   |     2 |                    375 | 125 * 375 = 46875             |
> > > >   |     3 |                  46875 | 125 * 46875 = 5859375         |
> > > >   |     4 |                5859375 | 125 * 5859375 = 732421875     |
> > > >   |     5 |              732421875 | 125 * 732421875 = 91552734375 |
> > > >   |-------+------------------------+-------------------------------|
> > > > 
> > > > - 3 levels deep attr bmbt tree.
> > > >   |-------+------------------------+-----------------------|
> > > >   | level | number of nodes/leaves | Total Nr recs         |
> > > >   |-------+------------------------+-----------------------|
> > > >   |     0 |                      1 | 2 (max root recs)     |
> > > >   |     1 |                      2 | 125 * 2 = 250         |
> > > >   |     2 |                    250 | 125 * 250 = 31250     |
> > > >   |     3 |                  31250 | 125 * 31250 = 3906250 |
> > > >   |-------+------------------------+-----------------------|
> > > > 
> > > > - Data type to number of records
> > > >   |-----------+-------------+-----------------|
> > > >   | data type | max extents | max leaf blocks |
> > > >   |-----------+-------------+-----------------|
> > > >   | int32     |  2147483647 |        17179870 |
> > > >   | uint32    |  4294967295 |        34359739 |
> > > >   | int16     |       32767 |             263 |
> > > >   | uint16    |       65535 |             525 |                                                                                                                  
> > > >   |-----------+-------------+-----------------|
> > > > 
> > > > So data bmbt will still have a height of 5 and attr bmbt will continue to have
> > > > a height of 3.
> > > 
> > > I think extent count variables should be unsigned because there's no
> > > meaning for a negative extent count.  ("You have -3 extents." "Ehh???")
> > > 
> > > That said, it was very helpful to point out that the current MAXEXTNUM /
> > > MAXAEXTNUM symbols stop short of using all 32 (or 16) bits.
> > > 
> > > Can we use this new feature flag + inode flag to allow 4294967295
> > > extents in either fork?
> > 
> > Sure.
> > 
> > I have already tested that having 4294967295 as the maximum data extent count
> > does not cause any regressions.
> > 
> > Also, Dave was of the opinion that data extent counter be increased to
> > 64-bit. I think I should include that change along with this feature flag
> > rather than adding a new one in the near future.
> > 
> > 
> 
> Hello Dave & Darrick,
> 
> Can you please look into the following design decision w.r.t using 32-bit and
> 64-bit unsigned counters for xattr and data extents.
> 
> Maximum extent counts.
> |-----------------------+----------------------|
> | Field width (in bits) |          Max extents |
> |-----------------------+----------------------|
> |                    32 |           4294967295 |
> |                    48 |      281474976710655 |
> |                    64 | 18446744073709551615 |
> |-----------------------+----------------------|
> 
> |-------------------+-----|
> | Minimum node recs | 125 |
> | Minimum leaf recs | 125 |
> |-------------------+-----|
> 
> Data bmbt tree height (MINDBTPTRS == 3)
> |-------+------------------------+-------------------------|
> | Level | Number of nodes/leaves |           Total Nr recs |
> |       |                        | (nr nodes/leaves * 125) |
> |-------+------------------------+-------------------------|
> |     0 |                      1 |                       3 |
> |     1 |                      3 |                     375 |
> |     2 |                    375 |                   46875 |
> |     3 |                  46875 |                 5859375 |
> |     4 |                5859375 |               732421875 |
> |     5 |              732421875 |             91552734375 |
> |     6 |            91552734375 |          11444091796875 |
> |     7 |         11444091796875 |        1430511474609375 |
> |     8 |       1430511474609375 |      178813934326171875 |
> |     9 |     178813934326171875 |    22351741790771484375 |
> |-------+------------------------+-------------------------|
> 
> For counting data extents, even though we theoretically have 64 bits at our
> disposal, I think we should have (2 ** 48) - 1 as the maximum number of

Why not 2^54-1, since that's the maximum value you can put in
br_startoff?  Granted I might just use a u64 and not have to deal with
bit masking :P

Hmm, so 2^54-1 = 18,014,398,509,418,983.

BMBT blocks have a 72-byte header, so on a 1k block filesystem that's...

(1024-72) = 952 bytes for records, and 16 bytes per record.

Assuming the block is half full, that's ... 952 / (16 * 2) = 29 records
per leaf.

Assuming the max records, that's 621,186,155,497,207 leaf blocks.

Node blocks require 16 bytes per keyptr pair, so they also store 29
records per leaf block.

Node level 1 would need 21,420,212,258,525 blocks.
Node level 2 would need 738,628,008,915 blocks.
Node level 3 would need 25,469,931,342 blocks.
Node level 4 would need 878,273,495 blocks.
Node level 5 would need 30,285,293 blocks.
Node level 6 would need 1,044,321 blocks.
Node level 7 would need 36,012 blocks.
Node level 8 would need 1,242 blocks.
Node level 9 would need 43 blocks.
Node level 10 would need 2 blocks.
Node level 11 could hold that in the ifork.

So I guess we'd need to bump XFS_BTREE_MAXLEVELS to 11 to support that.
Though we'd run out of global RAM and disk supply long before we
actually hit that, so perhaps we don't care.  Certainly increasing
XFS_BM_MAXLEVELS will make log reservation requirements grow even more.

> extents. This gives 281474976710655 (i.e. ~281 trillion extents). With this,
> bmbt tree's height grows by just two more levels (i.e. it grows from the
> current maximum height of 5 to 7). Please let me know your opinion on this.
> 
> Attr bmbt tree height (MINABTPTRS == 2)
> |-------+------------------------+-------------------------|
> | Level | Number of nodes/leaves |           Total Nr recs |
> |       |                        | (nr nodes/leaves * 125) |
> |-------+------------------------+-------------------------|
> |     0 |                      1 |                       2 |
> |     1 |                      2 |                     250 |
> |     2 |                    250 |                   31250 |
> |     3 |                  31250 |                 3906250 |
> |     4 |                3906250 |               488281250 |
> |     5 |              488281250 |             61035156250 |
> |-------+------------------------+-------------------------|
> 
> For xattr extents, (2 ** 32) - 1 = 4294967295 (~ 4 billion extents). So this
> will cause the corresponding bmbt's maximum height to go from 3 to 5.
> This probably won't cause any regression.
>
> Meanwhile, I will work on finding the impact of increasing the height of these
> two trees on log reservation.

Heh.  xfs_db log reservation dump command can be your friend for that. :)

--D

> -- 
> chandan
> 
> 
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-04-22 22:30               ` Dave Chinner
@ 2020-04-25 12:07                 ` Chandan Rajendra
  2020-04-26 22:08                   ` Dave Chinner
  0 siblings, 1 reply; 37+ messages in thread
From: Chandan Rajendra @ 2020-04-25 12:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, Chandan Rajendra, linux-xfs, bfoster

On Thursday, April 23, 2020 4:00 AM Dave Chinner wrote: 
> On Wed, Apr 22, 2020 at 03:08:00PM +0530, Chandan Rajendra wrote:
> > On Monday, April 20, 2020 10:08 AM Chandan Rajendra wrote: 
> > > On Tuesday, April 14, 2020 12:25 AM Darrick J. Wong wrote: 
> > > > That said, it was very helpful to point out that the current MAXEXTNUM /
> > > > MAXAEXTNUM symbols stop short of using all 32 (or 16) bits.
> > > > 
> > > > Can we use this new feature flag + inode flag to allow 4294967295
> > > > extents in either fork?
> > > 
> > > Sure.
> > > 
> > > I have already tested that having 4294967295 as the maximum data extent count
> > > does not cause any regressions.
> > > 
> > > Also, Dave was of the opinion that data extent counter be increased to
> > > 64-bit. I think I should include that change along with this feature flag
> > > rather than adding a new one in the near future.
> > > 
> > > 
> > 
> > Hello Dave & Darrick,
> > 
> > Can you please look into the following design decision w.r.t using 32-bit and
> > 64-bit unsigned counters for xattr and data extents.
> > 
> > Maximum extent counts.
> > |-----------------------+----------------------|
> > | Field width (in bits) |          Max extents |
> > |-----------------------+----------------------|
> > |                    32 |           4294967295 |
> > |                    48 |      281474976710655 |
> > |                    64 | 18446744073709551615 |
> > |-----------------------+----------------------|
> 
> These huge numbers are impossible to compare visually.  Once numbers
> go beyond 7-9 digits, you need to start condensing them in reports.
> Humans are, in general, unable to handle strings of digits longer
> than 7-9 digits at all well...
> 
> Can you condense them by using scientific representation i.e. XEy,
> which gives:
> 
> |-----------------------+-------------|
> | Field width (in bits) | Max extents |
> |-----------------------+-------------|
> |                    32 |      4.3E09 |
> |                    48 |      2.8E14 |
> |                    64 |      1.8E19 |
> |-----------------------+-------------|
> 
> It's much easier to compare differences visually because it's not
> only 4 digits, not 20. The other alternative is to use k,m,g,t,p,e
> suffixes to indicate magnitude (4.3g, 280t, 18e), but using
> exponentials make the numbers easier to do calculations on
> directly...
>

Sorry about that. I will use scientific notation for representing large
numbers.

> > |-------------------+-----|
> > | Minimum node recs | 125 |
> > | Minimum leaf recs | 125 |
> > |-------------------+-----|
>

Yes, your assumption of 4k block size is correct. I will include detailed
calculation steps in my future mails.

> Please show your working. I'm assuming this is 50% * 4kB /
> sizeof(bmbt_rec), so you are working out limits based on 4kB block
> size? Realistically, worse case behaviour will be with the minimum
> supported block size, which in this case will be 1kB....
> 
> > Data bmbt tree height (MINDBTPTRS == 3)
> > |-------+------------------------+-------------------------|
> > | Level | Number of nodes/leaves |           Total Nr recs |
> > |       |                        | (nr nodes/leaves * 125) |
> > |-------+------------------------+-------------------------|
> > |     0 |                      1 |                       3 |
> > |     1 |                      3 |                     375 |
> > |     2 |                    375 |                   46875 |
> > |     3 |                  46875 |                 5859375 |
> > |     4 |                5859375 |               732421875 |
> > |     5 |              732421875 |             91552734375 |
> > |     6 |            91552734375 |          11444091796875 |
> > |     7 |         11444091796875 |        1430511474609375 |
> > |     8 |       1430511474609375 |      178813934326171875 |
> > |     9 |     178813934326171875 |    22351741790771484375 |
> > |-------+------------------------+-------------------------|
> > 
> > For counting data extents, even though we theoretically have 64 bits at our
> > disposal, I think we should have (2 ** 48) - 1 as the maximum number of
> > extents. This gives 281474976710655 (i.e. ~281 trillion extents). With this,
> > bmbt tree's height grows by just two more levels (i.e. it grows from the
> > current maximum height of 5 to 7). Please let me know your opinion on this.
> 
> We shouldn't make up arbitrary limits when we can calculate them exactly.
> i.e. 2^63 max file size, 1kB block size (2^10), means max fragments
> is 2^53 entries. On a 64kB block size (2^16), we have a max extent
> count of 2^47....
> 
> i.e. 2^48 would be an acceptible limit for 1kB block size, but it is
> not correct for 64kB block size filesystems....

You are right about this. I will set the max data extent count to 2^47.

> 
> > Attr bmbt tree height (MINABTPTRS == 2)
> > |-------+------------------------+-------------------------|
> > | Level | Number of nodes/leaves |           Total Nr recs |
> > |       |                        | (nr nodes/leaves * 125) |
> > |-------+------------------------+-------------------------|
> > |     0 |                      1 |                       2 |
> > |     1 |                      2 |                     250 |
> > |     2 |                    250 |                   31250 |
> > |     3 |                  31250 |                 3906250 |
> > |     4 |                3906250 |               488281250 |
> > |     5 |              488281250 |             61035156250 |
> > |-------+------------------------+-------------------------|
> > 
> > For xattr extents, (2 ** 32) - 1 = 4294967295 (~ 4 billion extents). So this
> > will cause the corresponding bmbt's maximum height to go from 3 to 5.
> > This probably won't cause any regression.
> 
> We already have the XFS_DA_NODE_MAXDEPTH set to 5, so changing the
> attr fork extent count makes no difference to the attribute fork
> bmbt reservations. i.e. the bmbt reservations are defined by the
> dabtree structure limits, not the maximum extent count the fork can
> hold.

I think the dabtree structure limits is because of the following ...

How many levels of dabtree would be needed to hold ~100 million xattrs?
- name len = 16 bytes
         struct xfs_parent_name_rec {
               __be64  p_ino;
               __be32  p_gen;
               __be32  p_diroffset;
       };
  i.e. 64 + 32 + 32 = 128 bits = 16 bytes;
- Value len = file name length = Assume ~40 bytes
- Formula for number of node entries (used in column 3 in the table given
  below) at any level of the dabtree,
  nr_blocks * ((block size - sizeof(struct xfs_da3_node_hdr)) / sizeof(struct
  xfs_da_node_entry))
  i.e. nr_blocks * ((block size - 64) / 8)
- Formula for number of leaf entries (used in column 4 in the table given
  below),
  (block size - sizeof(xfs_attr_leaf_hdr_t)) /
  (sizeof(xfs_attr_leaf_entry_t) + valuelen + namelen + nameval)
  i.e. nr_blocks * ((block size - 32) / (8 + 2 + 1 + 16 + 40))

Here I have assumed block size to be 4k.

|-------+------------------+--------------------------+--------------------------|
| Level | Number of blocks | Number of entries (node) | Number of entries (leaf) |
|-------+------------------+--------------------------+--------------------------|
|     0 |              1.0 |                      5e2 |                    6.1e1 |
|     1 |              5e2 |                    2.5e5 |                    3.0e4 |
|     2 |            2.5e5 |                    1.3e8 |                    1.5e7 |
|     3 |            1.3e8 |                   6.6e10 |                    7.9e9 |
|-------+------------------+--------------------------+--------------------------|

Hence we would need a tree of height 3.
Total number of blocks = 1 + 5e2 + 2.5e5 + 1.3e8 = ~1.3e8
... which is < 2^32 (4.3e9)

> 
> The data fork to 64 bits has no impact on the directory
> reservations, either, because the number of extents in the directory
> is bound by the directory segment size of 32GB. i.e. a directory can
> hold, at most, 32GB of dirent data, which means there's a hard limit
> on the number of dabtree entries somewhere in the order of a few
> hundred million. That's where XFS_DA_NODE_MAXDEPTH comes from - it's
> large enough to index a max sized directory, and the BMBT overhead
> is derived from that...

Ok. Thanks for explaining that.

> 
> > Meanwhile, I will work on finding the impact of increasing the
> > height of these two trees on log reservation.
> 
> It should not change it substantially - 2 blocks per bmbt
> reservation per transaction is what I'd expect from the numbers
> presented...

I still haven't got to this task yet. I will respond soon. I spent time in
figuring out how directories are organized in XFS and also arriving at the
above mentioned calculations for xattr extent counter. 

-- 
chandan




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-04-25 12:07                 ` Chandan Rajendra
@ 2020-04-26 22:08                   ` Dave Chinner
  2020-04-29 15:35                     ` Chandan Rajendra
  0 siblings, 1 reply; 37+ messages in thread
From: Dave Chinner @ 2020-04-26 22:08 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: Darrick J. Wong, Chandan Rajendra, linux-xfs, bfoster

On Sat, Apr 25, 2020 at 05:37:39PM +0530, Chandan Rajendra wrote:
> On Thursday, April 23, 2020 4:00 AM Dave Chinner wrote: 
> > On Wed, Apr 22, 2020 at 03:08:00PM +0530, Chandan Rajendra wrote:
> > > Attr bmbt tree height (MINABTPTRS == 2)
> > > |-------+------------------------+-------------------------|
> > > | Level | Number of nodes/leaves |           Total Nr recs |
> > > |       |                        | (nr nodes/leaves * 125) |
> > > |-------+------------------------+-------------------------|
> > > |     0 |                      1 |                       2 |
> > > |     1 |                      2 |                     250 |
> > > |     2 |                    250 |                   31250 |
> > > |     3 |                  31250 |                 3906250 |
> > > |     4 |                3906250 |               488281250 |
> > > |     5 |              488281250 |             61035156250 |
> > > |-------+------------------------+-------------------------|
> > > 
> > > For xattr extents, (2 ** 32) - 1 = 4294967295 (~ 4 billion extents). So this
> > > will cause the corresponding bmbt's maximum height to go from 3 to 5.
> > > This probably won't cause any regression.
> > 
> > We already have the XFS_DA_NODE_MAXDEPTH set to 5, so changing the
> > attr fork extent count makes no difference to the attribute fork
> > bmbt reservations. i.e. the bmbt reservations are defined by the
> > dabtree structure limits, not the maximum extent count the fork can
> > hold.
> 
> I think the dabtree structure limits is because of the following ...
> 
> How many levels of dabtree would be needed to hold ~100 million xattrs?
> - name len = 16 bytes
>          struct xfs_parent_name_rec {
>                __be64  p_ino;
>                __be32  p_gen;
>                __be32  p_diroffset;
>        };
>   i.e. 64 + 32 + 32 = 128 bits = 16 bytes;
> - Value len = file name length = Assume ~40 bytes

That's quite long for a file name, but lets run with it...

> - Formula for number of node entries (used in column 3 in the table given
>   below) at any level of the dabtree,
>   nr_blocks * ((block size - sizeof(struct xfs_da3_node_hdr)) / sizeof(struct
>   xfs_da_node_entry))
>   i.e. nr_blocks * ((block size - 64) / 8)
> - Formula for number of leaf entries (used in column 4 in the table given
>   below),
>   (block size - sizeof(xfs_attr_leaf_hdr_t)) /
>   (sizeof(xfs_attr_leaf_entry_t) + valuelen + namelen + nameval)
>   i.e. nr_blocks * ((block size - 32) / (8 + 2 + 1 + 16 + 40))
> 
> Here I have assumed block size to be 4k.
> 
> |-------+------------------+--------------------------+--------------------------|
> | Level | Number of blocks | Number of entries (node) | Number of entries (leaf) |
> |-------+------------------+--------------------------+--------------------------|
> |     0 |              1.0 |                      5e2 |                    6.1e1 |
> |     1 |              5e2 |                    2.5e5 |                    3.0e4 |
> |     2 |            2.5e5 |                    1.3e8 |                    1.5e7 |
> |     3 |            1.3e8 |                   6.6e10 |                    7.9e9 |
> |-------+------------------+--------------------------+--------------------------|

I'm not sure what this table actually represents.

> 
> Hence we would need a tree of height 3.
> Total number of blocks = 1 + 5e2 + 2.5e5 + 1.3e8 = ~1.3e8

130 million blocks to hold 100 million xattrs? That doesn't pass the
smell test.

I think you are trying to do these calculations from the wrong
direction. Calculate the number of leaf blocks needed to hold the
xattr data first, then work out the height of the pointer tree from
that. e.g:

If we need 100m xattrs, we need this many 100% full 4k blocks to
hold them all:

blocks	= 100m / entries per leaf
	= 100m / 61
	= 1.64m

and if we assume 37% for the least populated (because magic
split/merge number), multiply by 3, so blocks ~= 5m for 100m xattrs
in 4k blocks.

That makes a lot more sense. Now the tree itself:

ptrs per node ^ N = 5m
ptrs per node ^ (N-1) = 5m / 500 = 10k
ptrs per node ^ (N-2) = 10k / 500 = 200
ptrs per node ^ (N-3) = 200 / 500 = 1

So, N-3 = level 0, so we've got a tree of height 4 for 100m xattrs,
and the pointer tree requires ~12000 blocks which is noise compared
to the number of leaf blocks...

As for the bmbt, we've got ~5m extents worst case, which is

ptrs per node ^ N = 5m
ptrs per node ^ (N-1) = 5m / 125 = 40k
ptrs per node ^ (N-2) = 40k / 125 = 320
ptrs per node ^ (N-3) = 320 / 125 = 3

As 3 bmbt records should fit in the inode fork, we'd only need a 4
level bmbt tree to hold this, too. It's at the lower limit of a 4
level tree, but 100m xattrs is the extreme case we are talking about
here...

FWIW, repeat this with a directory data segment size of 32GB w/ 40
byte names, and the numbers aren't much different to a worst case
xattr tree of this shape. You'll see the reason for the dabtree
height being limited to 5, and that neither the directory structure
nor the xattr structure is anywhere near the 2^32 bit extent count
limit...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-04-04  8:52 ` [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits Chandan Rajendra
                     ` (2 preceding siblings ...)
  2020-04-07  1:20   ` Dave Chinner
@ 2020-04-27  7:39   ` Christoph Hellwig
  2020-04-30  2:29     ` Chandan Rajendra
  3 siblings, 1 reply; 37+ messages in thread
From: Christoph Hellwig @ 2020-04-27  7:39 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: linux-xfs, david, chandan, darrick.wong, bfoster

FYI, I have had a series in the works for a while but not quite 
finished yet that moves the in-memory nextents and format fields
into the ifork structure.  I feared this might conflict badly, but
so far this seems relatively harmless.  Note that your patch creates
some not so nice layout in struct xfs_icdinode, so maybe I need to
rush and finish that series ASAP.

> +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,
> +					struct xfs_dinode *dip, int whichfork)
> +{
> +	int32_t anextents;
> +
> +	if (whichfork == XFS_DATA_FORK)
> +		return be32_to_cpu((dip)->di_nextents);
> +
> +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> +	if (xfs_sb_version_has_v3inode(sbp))
> +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);
> +
> +	return anextents;

No need for any of the braces around dip.  Also this funcion really
deserves a proper lower case name now, and probably should be moved out
of line.

>  typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
>  typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
>  typedef int32_t		xfs_extnum_t;	/* # of extents in a file */
> -typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> +typedef int32_t		xfs_aextnum_t;	/* # extents in an attribute fork */

We can just retire xfs_aextnum_t.  It only has 4 uses anyway.

> @@ -327,7 +327,7 @@ xfs_inode_to_log_dinode(
>  	to->di_nblocks = from->di_nblocks;
>  	to->di_extsize = from->di_extsize;
>  	to->di_nextents = from->di_nextents;
> -	to->di_anextents = from->di_anextents;
> +	to->di_anextents_lo = ((u32)(from->di_anextents)) & 0xffff;

No need for any of the casting here.

> @@ -3044,7 +3045,14 @@ xlog_recover_inode_pass2(
>  			goto out_release;
>  		}
>  	}
> -	if (unlikely(ldip->di_nextents + ldip->di_anextents > ldip->di_nblocks)){
> +
> +	nextents = ldip->di_anextents_lo;
> +	if (xfs_sb_version_has_v3inode(&mp->m_sb))
> +		nextents |= ((u32)(ldip->di_anextents_hi) << 16);
> +
> +	nextents += ldip->di_nextents;

Little helpers to get/set the attr extents in the log inode would be nice.


Last but not least:  This seems like a feature flag we could just lazily
set once needed, similar to attr2.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-04-07  1:20   ` Dave Chinner
  2020-04-08 12:45     ` Chandan Rajendra
  2020-04-10  7:46     ` Chandan Rajendra
@ 2020-04-27  7:42     ` Christoph Hellwig
  2 siblings, 0 replies; 37+ messages in thread
From: Christoph Hellwig @ 2020-04-27  7:42 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Chandan Rajendra, linux-xfs, chandan, darrick.wong, bfoster

On Tue, Apr 07, 2020 at 11:20:00AM +1000, Dave Chinner wrote:
> Ok, I think you've limited what we can do here by using this "fill
> holes" variable split. I've never liked doing this, and we've only
> done it in the past when we haven't had space in the inode to create
> a new 32 bit variable.
> 
> IOWs, this is a v5 format feature only, so we should just create a
> new variable:
> 
> 	__be32		di_attr_nextents;
> 
> With that in place, we can now do what we did extending the v1 inode
> link count (16 bits) to the v2 inode link count (32 bits).
> 
> That is, when the attribute count is going to overflow, we set a
> inode flag on disk to indicate that it now has a 32 bit extent count
> and uses that field in the inode, and we set a RO-compat feature
> flag in the superblock to indicate that there are 32 bit attr fork
> extent counts in use.
> 
> Old kernels can still read the filesystem, but see the extent count
> as "max" (65535) but can't modify the attr fork and hence corrupt
> the 32 bit count it knows nothing about.
> 
> If the kernel sees the RO feature bit set, it can set the inode flag
> on inodes it is modifying and update both the old and new counters
> appropriately when flushing the inode to disk (i.e. transparent
> conversion).
> 
> In future, mkfs can then set the RO feature flag by default so all
> new filesystems use the 32 bit counter.

I don't like just moving to a new counter.  This wastes precious
space that is going to be really confusing to reuse later, and doesn't
really help with performance.  And we can do the RO_COMPAT trick
even without that.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-04-26 22:08                   ` Dave Chinner
@ 2020-04-29 15:35                     ` Chandan Rajendra
  2020-05-01  7:08                       ` Chandan Rajendra
  0 siblings, 1 reply; 37+ messages in thread
From: Chandan Rajendra @ 2020-04-29 15:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, Chandan Rajendra, linux-xfs, bfoster

[-- Attachment #1: Type: text/plain, Size: 8595 bytes --]

On Monday, April 27, 2020 3:38 AM Dave Chinner wrote: 
> On Sat, Apr 25, 2020 at 05:37:39PM +0530, Chandan Rajendra wrote:
> > On Thursday, April 23, 2020 4:00 AM Dave Chinner wrote: 
> > > On Wed, Apr 22, 2020 at 03:08:00PM +0530, Chandan Rajendra wrote:
> > > > Attr bmbt tree height (MINABTPTRS == 2)
> > > > |-------+------------------------+-------------------------|
> > > > | Level | Number of nodes/leaves |           Total Nr recs |
> > > > |       |                        | (nr nodes/leaves * 125) |
> > > > |-------+------------------------+-------------------------|
> > > > |     0 |                      1 |                       2 |
> > > > |     1 |                      2 |                     250 |
> > > > |     2 |                    250 |                   31250 |
> > > > |     3 |                  31250 |                 3906250 |
> > > > |     4 |                3906250 |               488281250 |
> > > > |     5 |              488281250 |             61035156250 |
> > > > |-------+------------------------+-------------------------|
> > > > 
> > > > For xattr extents, (2 ** 32) - 1 = 4294967295 (~ 4 billion extents). So this
> > > > will cause the corresponding bmbt's maximum height to go from 3 to 5.
> > > > This probably won't cause any regression.
> > > 
> > > We already have the XFS_DA_NODE_MAXDEPTH set to 5, so changing the
> > > attr fork extent count makes no difference to the attribute fork
> > > bmbt reservations. i.e. the bmbt reservations are defined by the
> > > dabtree structure limits, not the maximum extent count the fork can
> > > hold.
> > 
> > I think the dabtree structure limits is because of the following ...
> > 
> > How many levels of dabtree would be needed to hold ~100 million xattrs?
> > - name len = 16 bytes
> >          struct xfs_parent_name_rec {
> >                __be64  p_ino;
> >                __be32  p_gen;
> >                __be32  p_diroffset;
> >        };
> >   i.e. 64 + 32 + 32 = 128 bits = 16 bytes;
> > - Value len = file name length = Assume ~40 bytes
> 
> That's quite long for a file name, but lets run with it...
> 
> > - Formula for number of node entries (used in column 3 in the table given
> >   below) at any level of the dabtree,
> >   nr_blocks * ((block size - sizeof(struct xfs_da3_node_hdr)) / sizeof(struct
> >   xfs_da_node_entry))
> >   i.e. nr_blocks * ((block size - 64) / 8)
> > - Formula for number of leaf entries (used in column 4 in the table given
> >   below),
> >   (block size - sizeof(xfs_attr_leaf_hdr_t)) /
> >   (sizeof(xfs_attr_leaf_entry_t) + valuelen + namelen + nameval)
> >   i.e. nr_blocks * ((block size - 32) / (8 + 2 + 1 + 16 + 40))
> > 
> > Here I have assumed block size to be 4k.
> > 
> > |-------+------------------+--------------------------+--------------------------|
> > | Level | Number of blocks | Number of entries (node) | Number of entries (leaf) |
> > |-------+------------------+--------------------------+--------------------------|
> > |     0 |              1.0 |                      5e2 |                    6.1e1 |
> > |     1 |              5e2 |                    2.5e5 |                    3.0e4 |
> > |     2 |            2.5e5 |                    1.3e8 |                    1.5e7 |
> > |     3 |            1.3e8 |                   6.6e10 |                    7.9e9 |
> > |-------+------------------+--------------------------+--------------------------|
> 
> I'm not sure what this table actually represents.
> 
> > 
> > Hence we would need a tree of height 3.
> > Total number of blocks = 1 + 5e2 + 2.5e5 + 1.3e8 = ~1.3e8
> 
> 130 million blocks to hold 100 million xattrs? That doesn't pass the
> smell test.
> 
> I think you are trying to do these calculations from the wrong
> direction.

You are right. Btrees grow in height by adding a new root
node. Hence the btree space usage should be calculated in bottom-to-top
direction.

> Calculate the number of leaf blocks needed to hold the
> xattr data first, then work out the height of the pointer tree from
> that. e.g:
> 
> If we need 100m xattrs, we need this many 100% full 4k blocks to
> hold them all:
> 
> blocks	= 100m / entries per leaf
> 	= 100m / 61
> 	= 1.64m
> 
> and if we assume 37% for the least populated (because magic
> split/merge number), multiply by 3, so blocks ~= 5m for 100m xattrs
> in 4k blocks.
> 
> That makes a lot more sense. Now the tree itself:
> 
> ptrs per node ^ N = 5m
> ptrs per node ^ (N-1) = 5m / 500 = 10k
> ptrs per node ^ (N-2) = 10k / 500 = 200
> ptrs per node ^ (N-3) = 200 / 500 = 1
> 
> So, N-3 = level 0, so we've got a tree of height 4 for 100m xattrs,
> and the pointer tree requires ~12000 blocks which is noise compared
> to the number of leaf blocks...
> 
> As for the bmbt, we've got ~5m extents worst case, which is
> 
> ptrs per node ^ N = 5m
> ptrs per node ^ (N-1) = 5m / 125 = 40k
> ptrs per node ^ (N-2) = 40k / 125 = 320
> ptrs per node ^ (N-3) = 320 / 125 = 3
> 
> As 3 bmbt records should fit in the inode fork, we'd only need a 4
> level bmbt tree to hold this, too. It's at the lower limit of a 4
> level tree, but 100m xattrs is the extreme case we are talking about
> here...
> 
> FWIW, repeat this with a directory data segment size of 32GB w/ 40
> byte names, and the numbers aren't much different to a worst case
> xattr tree of this shape. You'll see the reason for the dabtree
> height being limited to 5, and that neither the directory structure
> nor the xattr structure is anywhere near the 2^32 bit extent count
> limit...

Directory segment size is 32 GB                                                                                                                                  
  - Number of directory entries required for indexing 32GiB.
    - 32GiB is divided into 4k data blocks. 
    - Number of 4k blocks = 32GB / 4k = 8M
    - Each 4k data block has,
      - struct xfs_dir3_data_hdr = 64 bytes
      - struct xfs_dir2_data_entry = 12 bytes (metadata) + 40 bytes (name)
                                   = 52 bytes
      - Number of 'struct xfs_dir2_data_entry' in a 4k block
        (4096 - 64) / 52 = 78
    - Number of 'struct xfs_dir2_data_entry' in 32-GiB space
      8m * 78 = 654m
  - Contents of a single dabtree leaf
    - struct xfs_dir3_leaf_hdr = 64 bytes
    - struct xfs_dir2_leaf_entry = 8 bytes
    - Number of 'struct xfs_dir2_leaf_entry' = (4096 - 64) / 8 = 504
    - 37% of 504 = 186 entries
  - Contents of a single dabtree node
    - struct xfs_da3_node_hdr = 64 bytes
    - struct xfs_da_node_entry = 8 bytes
    - Number of 'struct xfs_da_node_entry' = (4096 - 64) / 8 = 504
  - Nr leaves
    Level (N) = 654m / 186 = 3m leaves
    Level (N-1) = 3m / 504 = 6k
    Level (N-2) = 6k / 504 = 12
    Level (N-3) = 12 / 504 = 1
    Dabtree having 4 levels is sufficient.

Hence a dabtree with 5 levels should be more than enough to index a 32GiB
directory segment containing directory entries with even shorter names.

Even with 5m extents (used in xattr tree example above) consumed by a da
btree, this is still much less than the limit imposed by 2^32 (i.e. ~4
billion) extents.

Hence the actual log space consumed for logging bmbt blocks is limited by the
height of da btree.

My experiment with changing the values of MAXEXTNUM and MAXAEXTNUM to 2^47 and
2^32 respectively, gave me the following results,
- For 1k block size, bmbt tree height increased by 3.
- For 4k block size, bmbt tree height increased by 2.

This happens because xfs_bmap_compute_maxlevels() calculates the BMBT tree
height by assuming that there will be MAXEXTNUM/MAXAEXTNUM worth of leaf
entries in the worst case.

For Attr fork Bmbt , Do you think the calculation should be changed to
consider the number of extents occupied by a dabtree holding > 100 million
xattrs?

The new increase in Bmbt height in turn causes the static reservation values
to increase. In the worst case, the maximum increase observed was 118k bytes
(4k block size, reflink=0, tr_rename).

The experiment was executed after applying "xfsprogs: Fix log reservation
calculation for xattr insert operation" patch
(https://lore.kernel.org/linux-xfs/20200404085229.2034-2-chandanrlinux@gmail.com/)

I am attaching the output of "xfs_db -c logres <dev>" executed on the
following configurations of the XFS filesystem.
- -b size=1k -m reflink=0
- -b size=1k -m rmapbt=1reflink=1
- -b size=4k -m reflink=0
- -b size=4k -m rmapbt=1reflink=1
- -b size=1k -m crc=0
- -b size=4k -m crc=0

I will go through the code which calculates the log reservations of the
entries which have a drastic increase in their values.

-- 
chandan

[-- Attachment #2: xfs-db-logres.tar.gz --]
[-- Type: application/x-compressed-tar, Size: 1564 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-04-27  7:39   ` Christoph Hellwig
@ 2020-04-30  2:29     ` Chandan Rajendra
  0 siblings, 0 replies; 37+ messages in thread
From: Chandan Rajendra @ 2020-04-30  2:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Chandan Rajendra, linux-xfs, david, darrick.wong, bfoster

On Monday, April 27, 2020 1:09 PM Christoph Hellwig wrote: 
> FYI, I have had a series in the works for a while but not quite 
> finished yet that moves the in-memory nextents and format fields
> into the ifork structure.  I feared this might conflict badly, but
> so far this seems relatively harmless.  Note that your patch creates
> some not so nice layout in struct xfs_icdinode, so maybe I need to
> rush and finish that series ASAP.
> 
> > +static inline int32_t XFS_DFORK_NEXTENTS(struct xfs_sb *sbp,
> > +					struct xfs_dinode *dip, int whichfork)
> > +{
> > +	int32_t anextents;
> > +
> > +	if (whichfork == XFS_DATA_FORK)
> > +		return be32_to_cpu((dip)->di_nextents);
> > +
> > +	anextents = be16_to_cpu((dip)->di_anextents_lo);
> > +	if (xfs_sb_version_has_v3inode(sbp))
> > +		anextents |= ((u32)(be16_to_cpu((dip)->di_anextents_hi)) << 16);
> > +
> > +	return anextents;
> 
> No need for any of the braces around dip.  Also this funcion really
> deserves a proper lower case name now, and probably should be moved out
> of line.

Sure, I will implement that.

> 
> >  typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
> >  typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
> >  typedef int32_t		xfs_extnum_t;	/* # of extents in a file */
> > -typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> > +typedef int32_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> 
> We can just retire xfs_aextnum_t.  It only has 4 uses anyway.
> 
> > @@ -327,7 +327,7 @@ xfs_inode_to_log_dinode(
> >  	to->di_nblocks = from->di_nblocks;
> >  	to->di_extsize = from->di_extsize;
> >  	to->di_nextents = from->di_nextents;
> > -	to->di_anextents = from->di_anextents;
> > +	to->di_anextents_lo = ((u32)(from->di_anextents)) & 0xffff;
> 
> No need for any of the casting here.

Ok.

> 
> > @@ -3044,7 +3045,14 @@ xlog_recover_inode_pass2(
> >  			goto out_release;
> >  		}
> >  	}
> > -	if (unlikely(ldip->di_nextents + ldip->di_anextents > ldip->di_nblocks)){
> > +
> > +	nextents = ldip->di_anextents_lo;
> > +	if (xfs_sb_version_has_v3inode(&mp->m_sb))
> > +		nextents |= ((u32)(ldip->di_anextents_hi) << 16);
> > +
> > +	nextents += ldip->di_nextents;
> 
> Little helpers to get/set the attr extents in the log inode would be nice.
>

Ok. I will implement the helper functions.

> 
> Last but not least:  This seems like a feature flag we could just lazily
> set once needed, similar to attr2.
> 

Yes, I will implement this change before posting the next version of the
patchset.

-- 
chandan




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-04-29 15:35                     ` Chandan Rajendra
@ 2020-05-01  7:08                       ` Chandan Rajendra
  2020-05-12 23:53                         ` Darrick J. Wong
  0 siblings, 1 reply; 37+ messages in thread
From: Chandan Rajendra @ 2020-05-01  7:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, Chandan Rajendra, linux-xfs, bfoster

On Wednesday, April 29, 2020 9:05 PM Chandan Rajendra wrote: 
> On Monday, April 27, 2020 3:38 AM Dave Chinner wrote: 
> > On Sat, Apr 25, 2020 at 05:37:39PM +0530, Chandan Rajendra wrote:
> > > On Thursday, April 23, 2020 4:00 AM Dave Chinner wrote: 
> > > > On Wed, Apr 22, 2020 at 03:08:00PM +0530, Chandan Rajendra wrote:
> > > > > Attr bmbt tree height (MINABTPTRS == 2)
> > > > > |-------+------------------------+-------------------------|
> > > > > | Level | Number of nodes/leaves |           Total Nr recs |
> > > > > |       |                        | (nr nodes/leaves * 125) |
> > > > > |-------+------------------------+-------------------------|
> > > > > |     0 |                      1 |                       2 |
> > > > > |     1 |                      2 |                     250 |
> > > > > |     2 |                    250 |                   31250 |
> > > > > |     3 |                  31250 |                 3906250 |
> > > > > |     4 |                3906250 |               488281250 |
> > > > > |     5 |              488281250 |             61035156250 |
> > > > > |-------+------------------------+-------------------------|
> > > > > 
> > > > > For xattr extents, (2 ** 32) - 1 = 4294967295 (~ 4 billion extents). So this
> > > > > will cause the corresponding bmbt's maximum height to go from 3 to 5.
> > > > > This probably won't cause any regression.
> > > > 
> > > > We already have the XFS_DA_NODE_MAXDEPTH set to 5, so changing the
> > > > attr fork extent count makes no difference to the attribute fork
> > > > bmbt reservations. i.e. the bmbt reservations are defined by the
> > > > dabtree structure limits, not the maximum extent count the fork can
> > > > hold.
> > > 
> > > I think the dabtree structure limits is because of the following ...
> > > 
> > > How many levels of dabtree would be needed to hold ~100 million xattrs?
> > > - name len = 16 bytes
> > >          struct xfs_parent_name_rec {
> > >                __be64  p_ino;
> > >                __be32  p_gen;
> > >                __be32  p_diroffset;
> > >        };
> > >   i.e. 64 + 32 + 32 = 128 bits = 16 bytes;
> > > - Value len = file name length = Assume ~40 bytes
> > 
> > That's quite long for a file name, but lets run with it...
> > 
> > > - Formula for number of node entries (used in column 3 in the table given
> > >   below) at any level of the dabtree,
> > >   nr_blocks * ((block size - sizeof(struct xfs_da3_node_hdr)) / sizeof(struct
> > >   xfs_da_node_entry))
> > >   i.e. nr_blocks * ((block size - 64) / 8)
> > > - Formula for number of leaf entries (used in column 4 in the table given
> > >   below),
> > >   (block size - sizeof(xfs_attr_leaf_hdr_t)) /
> > >   (sizeof(xfs_attr_leaf_entry_t) + valuelen + namelen + nameval)
> > >   i.e. nr_blocks * ((block size - 32) / (8 + 2 + 1 + 16 + 40))
> > > 
> > > Here I have assumed block size to be 4k.
> > > 
> > > |-------+------------------+--------------------------+--------------------------|
> > > | Level | Number of blocks | Number of entries (node) | Number of entries (leaf) |
> > > |-------+------------------+--------------------------+--------------------------|
> > > |     0 |              1.0 |                      5e2 |                    6.1e1 |
> > > |     1 |              5e2 |                    2.5e5 |                    3.0e4 |
> > > |     2 |            2.5e5 |                    1.3e8 |                    1.5e7 |
> > > |     3 |            1.3e8 |                   6.6e10 |                    7.9e9 |
> > > |-------+------------------+--------------------------+--------------------------|
> > 
> > I'm not sure what this table actually represents.
> > 
> > > 
> > > Hence we would need a tree of height 3.
> > > Total number of blocks = 1 + 5e2 + 2.5e5 + 1.3e8 = ~1.3e8
> > 
> > 130 million blocks to hold 100 million xattrs? That doesn't pass the
> > smell test.
> > 
> > I think you are trying to do these calculations from the wrong
> > direction.
> 
> You are right. Btrees grow in height by adding a new root
> node. Hence the btree space usage should be calculated in bottom-to-top
> direction.
> 
> > Calculate the number of leaf blocks needed to hold the
> > xattr data first, then work out the height of the pointer tree from
> > that. e.g:
> > 
> > If we need 100m xattrs, we need this many 100% full 4k blocks to
> > hold them all:
> > 
> > blocks	= 100m / entries per leaf
> > 	= 100m / 61
> > 	= 1.64m
> > 
> > and if we assume 37% for the least populated (because magic
> > split/merge number), multiply by 3, so blocks ~= 5m for 100m xattrs
> > in 4k blocks.
> > 
> > That makes a lot more sense. Now the tree itself:
> > 
> > ptrs per node ^ N = 5m
> > ptrs per node ^ (N-1) = 5m / 500 = 10k
> > ptrs per node ^ (N-2) = 10k / 500 = 200
> > ptrs per node ^ (N-3) = 200 / 500 = 1
> > 
> > So, N-3 = level 0, so we've got a tree of height 4 for 100m xattrs,
> > and the pointer tree requires ~12000 blocks which is noise compared
> > to the number of leaf blocks...
> > 
> > As for the bmbt, we've got ~5m extents worst case, which is
> > 
> > ptrs per node ^ N = 5m
> > ptrs per node ^ (N-1) = 5m / 125 = 40k
> > ptrs per node ^ (N-2) = 40k / 125 = 320
> > ptrs per node ^ (N-3) = 320 / 125 = 3
> > 
> > As 3 bmbt records should fit in the inode fork, we'd only need a 4
> > level bmbt tree to hold this, too. It's at the lower limit of a 4
> > level tree, but 100m xattrs is the extreme case we are talking about
> > here...
> > 
> > FWIW, repeat this with a directory data segment size of 32GB w/ 40
> > byte names, and the numbers aren't much different to a worst case
> > xattr tree of this shape. You'll see the reason for the dabtree
> > height being limited to 5, and that neither the directory structure
> > nor the xattr structure is anywhere near the 2^32 bit extent count
> > limit...
> 
> Directory segment size is 32 GB                                                                                                                                  
>   - Number of directory entries required for indexing 32GiB.
>     - 32GiB is divided into 4k data blocks. 
>     - Number of 4k blocks = 32GB / 4k = 8M
>     - Each 4k data block has,
>       - struct xfs_dir3_data_hdr = 64 bytes
>       - struct xfs_dir2_data_entry = 12 bytes (metadata) + 40 bytes (name)
>                                    = 52 bytes
>       - Number of 'struct xfs_dir2_data_entry' in a 4k block
>         (4096 - 64) / 52 = 78
>     - Number of 'struct xfs_dir2_data_entry' in 32-GiB space
>       8m * 78 = 654m
>   - Contents of a single dabtree leaf
>     - struct xfs_dir3_leaf_hdr = 64 bytes
>     - struct xfs_dir2_leaf_entry = 8 bytes
>     - Number of 'struct xfs_dir2_leaf_entry' = (4096 - 64) / 8 = 504
>     - 37% of 504 = 186 entries
>   - Contents of a single dabtree node
>     - struct xfs_da3_node_hdr = 64 bytes
>     - struct xfs_da_node_entry = 8 bytes
>     - Number of 'struct xfs_da_node_entry' = (4096 - 64) / 8 = 504
>   - Nr leaves
>     Level (N) = 654m / 186 = 3m leaves
>     Level (N-1) = 3m / 504 = 6k
>     Level (N-2) = 6k / 504 = 12
>     Level (N-3) = 12 / 504 = 1
>     Dabtree having 4 levels is sufficient.
> 
> Hence a dabtree with 5 levels should be more than enough to index a 32GiB
> directory segment containing directory entries with even shorter names.
> 
> Even with 5m extents (used in xattr tree example above) consumed by a da
> btree, this is still much less than the limit imposed by 2^32 (i.e. ~4
> billion) extents.
> 
> Hence the actual log space consumed for logging bmbt blocks is limited by the
> height of da btree.
> 
> My experiment with changing the values of MAXEXTNUM and MAXAEXTNUM to 2^47 and
> 2^32 respectively, gave me the following results,
> - For 1k block size, bmbt tree height increased by 3.
> - For 4k block size, bmbt tree height increased by 2.
> 
> This happens because xfs_bmap_compute_maxlevels() calculates the BMBT tree
> height by assuming that there will be MAXEXTNUM/MAXAEXTNUM worth of leaf
> entries in the worst case.
> 
> For Attr fork Bmbt , Do you think the calculation should be changed to
> consider the number of extents occupied by a dabtree holding > 100 million
> xattrs?
> 
> The new increase in Bmbt height in turn causes the static reservation values
> to increase. In the worst case, the maximum increase observed was 118k bytes
> (4k block size, reflink=0, tr_rename).
> 
> The experiment was executed after applying "xfsprogs: Fix log reservation
> calculation for xattr insert operation" patch
> (https://lore.kernel.org/linux-xfs/20200404085229.2034-2-chandanrlinux@gmail.com/)
> 
> I am attaching the output of "xfs_db -c logres <dev>" executed on the
> following configurations of the XFS filesystem.
> - -b size=1k -m reflink=0
> - -b size=1k -m rmapbt=1reflink=1
> - -b size=4k -m reflink=0
> - -b size=4k -m rmapbt=1reflink=1
> - -b size=1k -m crc=0
> - -b size=4k -m crc=0
> 
> I will go through the code which calculates the log reservations of the
> entries which have a drastic increase in their values.
> 

The highest increase (i.e. an increase of 118k) in log reservation was
associated with the rename operation,

STATIC uint
xfs_calc_rename_reservation(
        struct xfs_mount        *mp)
{
        return XFS_DQUOT_LOGRES(mp) +
                max((xfs_calc_inode_res(mp, 4) +
                     xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp),
                                      XFS_FSB_TO_B(mp, 1))),
                    (xfs_calc_buf_res(7, mp->m_sb.sb_sectsize) +
                     xfs_calc_buf_res(xfs_allocfree_log_count(mp, 3),
                                      XFS_FSB_TO_B(mp, 1))));
}

The first argument to max() contributes the highest value.

xfs_calc_inode_res(mp, 4) + xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp),XFS_FSB_TO_B(mp, 1))

The inode reservation part is a constant.

The number of blocks computed by the second operand of the '+' operator is,

2 * ((XFS_DA_NODE_MAXDEPTH + 2) + ((XFS_DA_NODE_MAXDEPTH + 2) * (bmbt_height - 1)))

= 2 * ((5 + 2) + ((5 + 2) * (bmbt_height - 1)))

When bmbt height is 5 (i.e. when using the original 2^31 extent count limit) this
evaluates to,

2 * ((5 + 2) + ((5 + 2) * (5 - 1)))
= 70 blocks

When bmbt height is 7 (i.e. when using the original 2^47 extent count limit) this
evaluates to,

2 * ((5 + 2) + ((5 + 2) * (7 - 1)))
= 98 blocks

However, I don't see any extraneous space reserved by the above calculation
that could be removed. Also, IMHO an increase by 118k is most likely not going
to introduce any bugs. I will execute xfstests to make sure that no
regressions get added.

-- 
chandan




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-05-01  7:08                       ` Chandan Rajendra
@ 2020-05-12 23:53                         ` Darrick J. Wong
  2020-05-13 12:19                           ` Chandan Rajendra
  0 siblings, 1 reply; 37+ messages in thread
From: Darrick J. Wong @ 2020-05-12 23:53 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: Dave Chinner, Chandan Rajendra, linux-xfs, bfoster

On Fri, May 01, 2020 at 12:38:30PM +0530, Chandan Rajendra wrote:
> On Wednesday, April 29, 2020 9:05 PM Chandan Rajendra wrote: 
> > On Monday, April 27, 2020 3:38 AM Dave Chinner wrote: 
> > > On Sat, Apr 25, 2020 at 05:37:39PM +0530, Chandan Rajendra wrote:
> > > > On Thursday, April 23, 2020 4:00 AM Dave Chinner wrote: 
> > > > > On Wed, Apr 22, 2020 at 03:08:00PM +0530, Chandan Rajendra wrote:
> > > > > > Attr bmbt tree height (MINABTPTRS == 2)
> > > > > > |-------+------------------------+-------------------------|
> > > > > > | Level | Number of nodes/leaves |           Total Nr recs |
> > > > > > |       |                        | (nr nodes/leaves * 125) |
> > > > > > |-------+------------------------+-------------------------|
> > > > > > |     0 |                      1 |                       2 |
> > > > > > |     1 |                      2 |                     250 |
> > > > > > |     2 |                    250 |                   31250 |
> > > > > > |     3 |                  31250 |                 3906250 |
> > > > > > |     4 |                3906250 |               488281250 |
> > > > > > |     5 |              488281250 |             61035156250 |
> > > > > > |-------+------------------------+-------------------------|
> > > > > > 
> > > > > > For xattr extents, (2 ** 32) - 1 = 4294967295 (~ 4 billion extents). So this
> > > > > > will cause the corresponding bmbt's maximum height to go from 3 to 5.
> > > > > > This probably won't cause any regression.
> > > > > 
> > > > > We already have the XFS_DA_NODE_MAXDEPTH set to 5, so changing the
> > > > > attr fork extent count makes no difference to the attribute fork
> > > > > bmbt reservations. i.e. the bmbt reservations are defined by the
> > > > > dabtree structure limits, not the maximum extent count the fork can
> > > > > hold.
> > > > 
> > > > I think the dabtree structure limits is because of the following ...
> > > > 
> > > > How many levels of dabtree would be needed to hold ~100 million xattrs?
> > > > - name len = 16 bytes
> > > >          struct xfs_parent_name_rec {
> > > >                __be64  p_ino;
> > > >                __be32  p_gen;
> > > >                __be32  p_diroffset;
> > > >        };
> > > >   i.e. 64 + 32 + 32 = 128 bits = 16 bytes;
> > > > - Value len = file name length = Assume ~40 bytes
> > > 
> > > That's quite long for a file name, but lets run with it...
> > > 
> > > > - Formula for number of node entries (used in column 3 in the table given
> > > >   below) at any level of the dabtree,
> > > >   nr_blocks * ((block size - sizeof(struct xfs_da3_node_hdr)) / sizeof(struct
> > > >   xfs_da_node_entry))
> > > >   i.e. nr_blocks * ((block size - 64) / 8)
> > > > - Formula for number of leaf entries (used in column 4 in the table given
> > > >   below),
> > > >   (block size - sizeof(xfs_attr_leaf_hdr_t)) /
> > > >   (sizeof(xfs_attr_leaf_entry_t) + valuelen + namelen + nameval)
> > > >   i.e. nr_blocks * ((block size - 32) / (8 + 2 + 1 + 16 + 40))
> > > > 
> > > > Here I have assumed block size to be 4k.
> > > > 
> > > > |-------+------------------+--------------------------+--------------------------|
> > > > | Level | Number of blocks | Number of entries (node) | Number of entries (leaf) |
> > > > |-------+------------------+--------------------------+--------------------------|
> > > > |     0 |              1.0 |                      5e2 |                    6.1e1 |
> > > > |     1 |              5e2 |                    2.5e5 |                    3.0e4 |
> > > > |     2 |            2.5e5 |                    1.3e8 |                    1.5e7 |
> > > > |     3 |            1.3e8 |                   6.6e10 |                    7.9e9 |
> > > > |-------+------------------+--------------------------+--------------------------|
> > > 
> > > I'm not sure what this table actually represents.
> > > 
> > > > 
> > > > Hence we would need a tree of height 3.
> > > > Total number of blocks = 1 + 5e2 + 2.5e5 + 1.3e8 = ~1.3e8
> > > 
> > > 130 million blocks to hold 100 million xattrs? That doesn't pass the
> > > smell test.
> > > 
> > > I think you are trying to do these calculations from the wrong
> > > direction.
> > 
> > You are right. Btrees grow in height by adding a new root
> > node. Hence the btree space usage should be calculated in bottom-to-top
> > direction.
> > 
> > > Calculate the number of leaf blocks needed to hold the
> > > xattr data first, then work out the height of the pointer tree from
> > > that. e.g:
> > > 
> > > If we need 100m xattrs, we need this many 100% full 4k blocks to
> > > hold them all:
> > > 
> > > blocks	= 100m / entries per leaf
> > > 	= 100m / 61
> > > 	= 1.64m
> > > 
> > > and if we assume 37% for the least populated (because magic
> > > split/merge number), multiply by 3, so blocks ~= 5m for 100m xattrs
> > > in 4k blocks.
> > > 
> > > That makes a lot more sense. Now the tree itself:
> > > 
> > > ptrs per node ^ N = 5m
> > > ptrs per node ^ (N-1) = 5m / 500 = 10k
> > > ptrs per node ^ (N-2) = 10k / 500 = 200
> > > ptrs per node ^ (N-3) = 200 / 500 = 1
> > > 
> > > So, N-3 = level 0, so we've got a tree of height 4 for 100m xattrs,
> > > and the pointer tree requires ~12000 blocks which is noise compared
> > > to the number of leaf blocks...
> > > 
> > > As for the bmbt, we've got ~5m extents worst case, which is
> > > 
> > > ptrs per node ^ N = 5m
> > > ptrs per node ^ (N-1) = 5m / 125 = 40k
> > > ptrs per node ^ (N-2) = 40k / 125 = 320
> > > ptrs per node ^ (N-3) = 320 / 125 = 3
> > > 
> > > As 3 bmbt records should fit in the inode fork, we'd only need a 4
> > > level bmbt tree to hold this, too. It's at the lower limit of a 4
> > > level tree, but 100m xattrs is the extreme case we are talking about
> > > here...
> > > 
> > > FWIW, repeat this with a directory data segment size of 32GB w/ 40
> > > byte names, and the numbers aren't much different to a worst case
> > > xattr tree of this shape. You'll see the reason for the dabtree
> > > height being limited to 5, and that neither the directory structure
> > > nor the xattr structure is anywhere near the 2^32 bit extent count
> > > limit...
> > 
> > Directory segment size is 32 GB                                                                                                                                  
> >   - Number of directory entries required for indexing 32GiB.
> >     - 32GiB is divided into 4k data blocks. 
> >     - Number of 4k blocks = 32GB / 4k = 8M
> >     - Each 4k data block has,
> >       - struct xfs_dir3_data_hdr = 64 bytes
> >       - struct xfs_dir2_data_entry = 12 bytes (metadata) + 40 bytes (name)
> >                                    = 52 bytes
> >       - Number of 'struct xfs_dir2_data_entry' in a 4k block
> >         (4096 - 64) / 52 = 78
> >     - Number of 'struct xfs_dir2_data_entry' in 32-GiB space
> >       8m * 78 = 654m
> >   - Contents of a single dabtree leaf
> >     - struct xfs_dir3_leaf_hdr = 64 bytes
> >     - struct xfs_dir2_leaf_entry = 8 bytes
> >     - Number of 'struct xfs_dir2_leaf_entry' = (4096 - 64) / 8 = 504
> >     - 37% of 504 = 186 entries
> >   - Contents of a single dabtree node
> >     - struct xfs_da3_node_hdr = 64 bytes
> >     - struct xfs_da_node_entry = 8 bytes
> >     - Number of 'struct xfs_da_node_entry' = (4096 - 64) / 8 = 504
> >   - Nr leaves
> >     Level (N) = 654m / 186 = 3m leaves
> >     Level (N-1) = 3m / 504 = 6k
> >     Level (N-2) = 6k / 504 = 12
> >     Level (N-3) = 12 / 504 = 1
> >     Dabtree having 4 levels is sufficient.
> > 
> > Hence a dabtree with 5 levels should be more than enough to index a 32GiB
> > directory segment containing directory entries with even shorter names.
> > 
> > Even with 5m extents (used in xattr tree example above) consumed by a da
> > btree, this is still much less than the limit imposed by 2^32 (i.e. ~4
> > billion) extents.
> > 
> > Hence the actual log space consumed for logging bmbt blocks is limited by the
> > height of da btree.
> > 
> > My experiment with changing the values of MAXEXTNUM and MAXAEXTNUM to 2^47 and
> > 2^32 respectively, gave me the following results,
> > - For 1k block size, bmbt tree height increased by 3.
> > - For 4k block size, bmbt tree height increased by 2.
> > 
> > This happens because xfs_bmap_compute_maxlevels() calculates the BMBT tree
> > height by assuming that there will be MAXEXTNUM/MAXAEXTNUM worth of leaf
> > entries in the worst case.
> > 
> > For Attr fork Bmbt , Do you think the calculation should be changed to
> > consider the number of extents occupied by a dabtree holding > 100 million
> > xattrs?
> > 
> > The new increase in Bmbt height in turn causes the static reservation values
> > to increase. In the worst case, the maximum increase observed was 118k bytes
> > (4k block size, reflink=0, tr_rename).
> > 
> > The experiment was executed after applying "xfsprogs: Fix log reservation
> > calculation for xattr insert operation" patch
> > (https://lore.kernel.org/linux-xfs/20200404085229.2034-2-chandanrlinux@gmail.com/)
> > 
> > I am attaching the output of "xfs_db -c logres <dev>" executed on the
> > following configurations of the XFS filesystem.
> > - -b size=1k -m reflink=0
> > - -b size=1k -m rmapbt=1reflink=1
> > - -b size=4k -m reflink=0
> > - -b size=4k -m rmapbt=1reflink=1
> > - -b size=1k -m crc=0
> > - -b size=4k -m crc=0
> > 
> > I will go through the code which calculates the log reservations of the
> > entries which have a drastic increase in their values.
> > 
> 
> The highest increase (i.e. an increase of 118k) in log reservation was
> associated with the rename operation,
> 
> STATIC uint
> xfs_calc_rename_reservation(
>         struct xfs_mount        *mp)
> {
>         return XFS_DQUOT_LOGRES(mp) +
>                 max((xfs_calc_inode_res(mp, 4) +
>                      xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp),
>                                       XFS_FSB_TO_B(mp, 1))),
>                     (xfs_calc_buf_res(7, mp->m_sb.sb_sectsize) +
>                      xfs_calc_buf_res(xfs_allocfree_log_count(mp, 3),
>                                       XFS_FSB_TO_B(mp, 1))));
> }
> 
> The first argument to max() contributes the highest value.
> 
> xfs_calc_inode_res(mp, 4) + xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp),XFS_FSB_TO_B(mp, 1))
> 
> The inode reservation part is a constant.
> 
> The number of blocks computed by the second operand of the '+' operator is,
> 
> 2 * ((XFS_DA_NODE_MAXDEPTH + 2) + ((XFS_DA_NODE_MAXDEPTH + 2) * (bmbt_height - 1)))
> 
> = 2 * ((5 + 2) + ((5 + 2) * (bmbt_height - 1)))
> 
> When bmbt height is 5 (i.e. when using the original 2^31 extent count limit) this
> evaluates to,
> 
> 2 * ((5 + 2) + ((5 + 2) * (5 - 1)))
> = 70 blocks
> 
> When bmbt height is 7 (i.e. when using the original 2^47 extent count limit) this
> evaluates to,
> 
> 2 * ((5 + 2) + ((5 + 2) * (7 - 1)))
> = 98 blocks
> 
> However, I don't see any extraneous space reserved by the above calculation
> that could be removed. Also, IMHO an increase by 118k is most likely not going
> to introduce any bugs. I will execute xfstests to make sure that no
> regressions get added.

(Did fstests pass?)

--D

> -- 
> chandan
> 
> 
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits
  2020-05-12 23:53                         ` Darrick J. Wong
@ 2020-05-13 12:19                           ` Chandan Rajendra
  0 siblings, 0 replies; 37+ messages in thread
From: Chandan Rajendra @ 2020-05-13 12:19 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, Chandan Rajendra, linux-xfs, bfoster

On Wednesday, May 13, 2020 5:23 AM Darrick J. Wong wrote: 
> On Fri, May 01, 2020 at 12:38:30PM +0530, Chandan Rajendra wrote:
> > On Wednesday, April 29, 2020 9:05 PM Chandan Rajendra wrote: 
> > > On Monday, April 27, 2020 3:38 AM Dave Chinner wrote: 
> > > > On Sat, Apr 25, 2020 at 05:37:39PM +0530, Chandan Rajendra wrote:
> > > > > On Thursday, April 23, 2020 4:00 AM Dave Chinner wrote: 
> > > > > > On Wed, Apr 22, 2020 at 03:08:00PM +0530, Chandan Rajendra wrote:
> > > > > > > Attr bmbt tree height (MINABTPTRS == 2)
> > > > > > > |-------+------------------------+-------------------------|
> > > > > > > | Level | Number of nodes/leaves |           Total Nr recs |
> > > > > > > |       |                        | (nr nodes/leaves * 125) |
> > > > > > > |-------+------------------------+-------------------------|
> > > > > > > |     0 |                      1 |                       2 |
> > > > > > > |     1 |                      2 |                     250 |
> > > > > > > |     2 |                    250 |                   31250 |
> > > > > > > |     3 |                  31250 |                 3906250 |
> > > > > > > |     4 |                3906250 |               488281250 |
> > > > > > > |     5 |              488281250 |             61035156250 |
> > > > > > > |-------+------------------------+-------------------------|
> > > > > > > 
> > > > > > > For xattr extents, (2 ** 32) - 1 = 4294967295 (~ 4 billion extents). So this
> > > > > > > will cause the corresponding bmbt's maximum height to go from 3 to 5.
> > > > > > > This probably won't cause any regression.
> > > > > > 
> > > > > > We already have the XFS_DA_NODE_MAXDEPTH set to 5, so changing the
> > > > > > attr fork extent count makes no difference to the attribute fork
> > > > > > bmbt reservations. i.e. the bmbt reservations are defined by the
> > > > > > dabtree structure limits, not the maximum extent count the fork can
> > > > > > hold.
> > > > > 
> > > > > I think the dabtree structure limits is because of the following ...
> > > > > 
> > > > > How many levels of dabtree would be needed to hold ~100 million xattrs?
> > > > > - name len = 16 bytes
> > > > >          struct xfs_parent_name_rec {
> > > > >                __be64  p_ino;
> > > > >                __be32  p_gen;
> > > > >                __be32  p_diroffset;
> > > > >        };
> > > > >   i.e. 64 + 32 + 32 = 128 bits = 16 bytes;
> > > > > - Value len = file name length = Assume ~40 bytes
> > > > 
> > > > That's quite long for a file name, but lets run with it...
> > > > 
> > > > > - Formula for number of node entries (used in column 3 in the table given
> > > > >   below) at any level of the dabtree,
> > > > >   nr_blocks * ((block size - sizeof(struct xfs_da3_node_hdr)) / sizeof(struct
> > > > >   xfs_da_node_entry))
> > > > >   i.e. nr_blocks * ((block size - 64) / 8)
> > > > > - Formula for number of leaf entries (used in column 4 in the table given
> > > > >   below),
> > > > >   (block size - sizeof(xfs_attr_leaf_hdr_t)) /
> > > > >   (sizeof(xfs_attr_leaf_entry_t) + valuelen + namelen + nameval)
> > > > >   i.e. nr_blocks * ((block size - 32) / (8 + 2 + 1 + 16 + 40))
> > > > > 
> > > > > Here I have assumed block size to be 4k.
> > > > > 
> > > > > |-------+------------------+--------------------------+--------------------------|
> > > > > | Level | Number of blocks | Number of entries (node) | Number of entries (leaf) |
> > > > > |-------+------------------+--------------------------+--------------------------|
> > > > > |     0 |              1.0 |                      5e2 |                    6.1e1 |
> > > > > |     1 |              5e2 |                    2.5e5 |                    3.0e4 |
> > > > > |     2 |            2.5e5 |                    1.3e8 |                    1.5e7 |
> > > > > |     3 |            1.3e8 |                   6.6e10 |                    7.9e9 |
> > > > > |-------+------------------+--------------------------+--------------------------|
> > > > 
> > > > I'm not sure what this table actually represents.
> > > > 
> > > > > 
> > > > > Hence we would need a tree of height 3.
> > > > > Total number of blocks = 1 + 5e2 + 2.5e5 + 1.3e8 = ~1.3e8
> > > > 
> > > > 130 million blocks to hold 100 million xattrs? That doesn't pass the
> > > > smell test.
> > > > 
> > > > I think you are trying to do these calculations from the wrong
> > > > direction.
> > > 
> > > You are right. Btrees grow in height by adding a new root
> > > node. Hence the btree space usage should be calculated in bottom-to-top
> > > direction.
> > > 
> > > > Calculate the number of leaf blocks needed to hold the
> > > > xattr data first, then work out the height of the pointer tree from
> > > > that. e.g:
> > > > 
> > > > If we need 100m xattrs, we need this many 100% full 4k blocks to
> > > > hold them all:
> > > > 
> > > > blocks	= 100m / entries per leaf
> > > > 	= 100m / 61
> > > > 	= 1.64m
> > > > 
> > > > and if we assume 37% for the least populated (because magic
> > > > split/merge number), multiply by 3, so blocks ~= 5m for 100m xattrs
> > > > in 4k blocks.
> > > > 
> > > > That makes a lot more sense. Now the tree itself:
> > > > 
> > > > ptrs per node ^ N = 5m
> > > > ptrs per node ^ (N-1) = 5m / 500 = 10k
> > > > ptrs per node ^ (N-2) = 10k / 500 = 200
> > > > ptrs per node ^ (N-3) = 200 / 500 = 1
> > > > 
> > > > So, N-3 = level 0, so we've got a tree of height 4 for 100m xattrs,
> > > > and the pointer tree requires ~12000 blocks which is noise compared
> > > > to the number of leaf blocks...
> > > > 
> > > > As for the bmbt, we've got ~5m extents worst case, which is
> > > > 
> > > > ptrs per node ^ N = 5m
> > > > ptrs per node ^ (N-1) = 5m / 125 = 40k
> > > > ptrs per node ^ (N-2) = 40k / 125 = 320
> > > > ptrs per node ^ (N-3) = 320 / 125 = 3
> > > > 
> > > > As 3 bmbt records should fit in the inode fork, we'd only need a 4
> > > > level bmbt tree to hold this, too. It's at the lower limit of a 4
> > > > level tree, but 100m xattrs is the extreme case we are talking about
> > > > here...
> > > > 
> > > > FWIW, repeat this with a directory data segment size of 32GB w/ 40
> > > > byte names, and the numbers aren't much different to a worst case
> > > > xattr tree of this shape. You'll see the reason for the dabtree
> > > > height being limited to 5, and that neither the directory structure
> > > > nor the xattr structure is anywhere near the 2^32 bit extent count
> > > > limit...
> > > 
> > > Directory segment size is 32 GB                                                                                                                                  
> > >   - Number of directory entries required for indexing 32GiB.
> > >     - 32GiB is divided into 4k data blocks. 
> > >     - Number of 4k blocks = 32GB / 4k = 8M
> > >     - Each 4k data block has,
> > >       - struct xfs_dir3_data_hdr = 64 bytes
> > >       - struct xfs_dir2_data_entry = 12 bytes (metadata) + 40 bytes (name)
> > >                                    = 52 bytes
> > >       - Number of 'struct xfs_dir2_data_entry' in a 4k block
> > >         (4096 - 64) / 52 = 78
> > >     - Number of 'struct xfs_dir2_data_entry' in 32-GiB space
> > >       8m * 78 = 654m
> > >   - Contents of a single dabtree leaf
> > >     - struct xfs_dir3_leaf_hdr = 64 bytes
> > >     - struct xfs_dir2_leaf_entry = 8 bytes
> > >     - Number of 'struct xfs_dir2_leaf_entry' = (4096 - 64) / 8 = 504
> > >     - 37% of 504 = 186 entries
> > >   - Contents of a single dabtree node
> > >     - struct xfs_da3_node_hdr = 64 bytes
> > >     - struct xfs_da_node_entry = 8 bytes
> > >     - Number of 'struct xfs_da_node_entry' = (4096 - 64) / 8 = 504
> > >   - Nr leaves
> > >     Level (N) = 654m / 186 = 3m leaves
> > >     Level (N-1) = 3m / 504 = 6k
> > >     Level (N-2) = 6k / 504 = 12
> > >     Level (N-3) = 12 / 504 = 1
> > >     Dabtree having 4 levels is sufficient.
> > > 
> > > Hence a dabtree with 5 levels should be more than enough to index a 32GiB
> > > directory segment containing directory entries with even shorter names.
> > > 
> > > Even with 5m extents (used in xattr tree example above) consumed by a da
> > > btree, this is still much less than the limit imposed by 2^32 (i.e. ~4
> > > billion) extents.
> > > 
> > > Hence the actual log space consumed for logging bmbt blocks is limited by the
> > > height of da btree.
> > > 
> > > My experiment with changing the values of MAXEXTNUM and MAXAEXTNUM to 2^47 and
> > > 2^32 respectively, gave me the following results,
> > > - For 1k block size, bmbt tree height increased by 3.
> > > - For 4k block size, bmbt tree height increased by 2.
> > > 
> > > This happens because xfs_bmap_compute_maxlevels() calculates the BMBT tree
> > > height by assuming that there will be MAXEXTNUM/MAXAEXTNUM worth of leaf
> > > entries in the worst case.
> > > 
> > > For Attr fork Bmbt , Do you think the calculation should be changed to
> > > consider the number of extents occupied by a dabtree holding > 100 million
> > > xattrs?
> > > 
> > > The new increase in Bmbt height in turn causes the static reservation values
> > > to increase. In the worst case, the maximum increase observed was 118k bytes
> > > (4k block size, reflink=0, tr_rename).
> > > 
> > > The experiment was executed after applying "xfsprogs: Fix log reservation
> > > calculation for xattr insert operation" patch
> > > (https://lore.kernel.org/linux-xfs/20200404085229.2034-2-chandanrlinux@gmail.com/)
> > > 
> > > I am attaching the output of "xfs_db -c logres <dev>" executed on the
> > > following configurations of the XFS filesystem.
> > > - -b size=1k -m reflink=0
> > > - -b size=1k -m rmapbt=1reflink=1
> > > - -b size=4k -m reflink=0
> > > - -b size=4k -m rmapbt=1reflink=1
> > > - -b size=1k -m crc=0
> > > - -b size=4k -m crc=0
> > > 
> > > I will go through the code which calculates the log reservations of the
> > > entries which have a drastic increase in their values.
> > > 
> > 
> > The highest increase (i.e. an increase of 118k) in log reservation was
> > associated with the rename operation,
> > 
> > STATIC uint
> > xfs_calc_rename_reservation(
> >         struct xfs_mount        *mp)
> > {
> >         return XFS_DQUOT_LOGRES(mp) +
> >                 max((xfs_calc_inode_res(mp, 4) +
> >                      xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp),
> >                                       XFS_FSB_TO_B(mp, 1))),
> >                     (xfs_calc_buf_res(7, mp->m_sb.sb_sectsize) +
> >                      xfs_calc_buf_res(xfs_allocfree_log_count(mp, 3),
> >                                       XFS_FSB_TO_B(mp, 1))));
> > }
> > 
> > The first argument to max() contributes the highest value.
> > 
> > xfs_calc_inode_res(mp, 4) + xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp),XFS_FSB_TO_B(mp, 1))
> > 
> > The inode reservation part is a constant.
> > 
> > The number of blocks computed by the second operand of the '+' operator is,
> > 
> > 2 * ((XFS_DA_NODE_MAXDEPTH + 2) + ((XFS_DA_NODE_MAXDEPTH + 2) * (bmbt_height - 1)))
> > 
> > = 2 * ((5 + 2) + ((5 + 2) * (bmbt_height - 1)))
> > 
> > When bmbt height is 5 (i.e. when using the original 2^31 extent count limit) this
> > evaluates to,
> > 
> > 2 * ((5 + 2) + ((5 + 2) * (5 - 1)))
> > = 70 blocks
> > 
> > When bmbt height is 7 (i.e. when using the original 2^47 extent count limit) this
> > evaluates to,
> > 
> > 2 * ((5 + 2) + ((5 + 2) * (7 - 1)))
> > = 98 blocks
> > 
> > However, I don't see any extraneous space reserved by the above calculation
> > that could be removed. Also, IMHO an increase by 118k is most likely not going
> > to introduce any bugs. I will execute xfstests to make sure that no
> > regressions get added.
> 
> (Did fstests pass?)
>

On Wednesday, May 13, 2020 5:23:22 AM IST you wrote:
> On Fri, May 01, 2020 at 12:38:30PM +0530, Chandan Rajendra wrote:
> > On Wednesday, April 29, 2020 9:05 PM Chandan Rajendra wrote: 
> > > On Monday, April 27, 2020 3:38 AM Dave Chinner wrote: 
> > > > On Sat, Apr 25, 2020 at 05:37:39PM +0530, Chandan Rajendra wrote:
> > > > > On Thursday, April 23, 2020 4:00 AM Dave Chinner wrote: 
> > > > > > On Wed, Apr 22, 2020 at 03:08:00PM +0530, Chandan Rajendra wrote:
> > > > > > > Attr bmbt tree height (MINABTPTRS == 2)
> > > > > > > |-------+------------------------+-------------------------|
> > > > > > > | Level | Number of nodes/leaves |           Total Nr recs |
> > > > > > > |       |                        | (nr nodes/leaves * 125) |
> > > > > > > |-------+------------------------+-------------------------|
> > > > > > > |     0 |                      1 |                       2 |
> > > > > > > |     1 |                      2 |                     250 |
> > > > > > > |     2 |                    250 |                   31250 |
> > > > > > > |     3 |                  31250 |                 3906250 |
> > > > > > > |     4 |                3906250 |               488281250 |
> > > > > > > |     5 |              488281250 |             61035156250 |
> > > > > > > |-------+------------------------+-------------------------|
> > > > > > > 
> > > > > > > For xattr extents, (2 ** 32) - 1 = 4294967295 (~ 4 billion extents). So this
> > > > > > > will cause the corresponding bmbt's maximum height to go from 3 to 5.
> > > > > > > This probably won't cause any regression.
> > > > > > 
> > > > > > We already have the XFS_DA_NODE_MAXDEPTH set to 5, so changing the
> > > > > > attr fork extent count makes no difference to the attribute fork
> > > > > > bmbt reservations. i.e. the bmbt reservations are defined by the
> > > > > > dabtree structure limits, not the maximum extent count the fork can
> > > > > > hold.
> > > > > 
> > > > > I think the dabtree structure limits is because of the following ...
> > > > > 
> > > > > How many levels of dabtree would be needed to hold ~100 million xattrs?
> > > > > - name len = 16 bytes
> > > > >          struct xfs_parent_name_rec {
> > > > >                __be64  p_ino;
> > > > >                __be32  p_gen;
> > > > >                __be32  p_diroffset;
> > > > >        };
> > > > >   i.e. 64 + 32 + 32 = 128 bits = 16 bytes;
> > > > > - Value len = file name length = Assume ~40 bytes
> > > > 
> > > > That's quite long for a file name, but lets run with it...
> > > > 
> > > > > - Formula for number of node entries (used in column 3 in the table given
> > > > >   below) at any level of the dabtree,
> > > > >   nr_blocks * ((block size - sizeof(struct xfs_da3_node_hdr)) / sizeof(struct
> > > > >   xfs_da_node_entry))
> > > > >   i.e. nr_blocks * ((block size - 64) / 8)
> > > > > - Formula for number of leaf entries (used in column 4 in the table given
> > > > >   below),
> > > > >   (block size - sizeof(xfs_attr_leaf_hdr_t)) /
> > > > >   (sizeof(xfs_attr_leaf_entry_t) + valuelen + namelen + nameval)
> > > > >   i.e. nr_blocks * ((block size - 32) / (8 + 2 + 1 + 16 + 40))
> > > > > 
> > > > > Here I have assumed block size to be 4k.
> > > > > 
> > > > > |-------+------------------+--------------------------+--------------------------|
> > > > > | Level | Number of blocks | Number of entries (node) | Number of entries (leaf) |
> > > > > |-------+------------------+--------------------------+--------------------------|
> > > > > |     0 |              1.0 |                      5e2 |                    6.1e1 |
> > > > > |     1 |              5e2 |                    2.5e5 |                    3.0e4 |
> > > > > |     2 |            2.5e5 |                    1.3e8 |                    1.5e7 |
> > > > > |     3 |            1.3e8 |                   6.6e10 |                    7.9e9 |
> > > > > |-------+------------------+--------------------------+--------------------------|
> > > > 
> > > > I'm not sure what this table actually represents.
> > > > 
> > > > > 
> > > > > Hence we would need a tree of height 3.
> > > > > Total number of blocks = 1 + 5e2 + 2.5e5 + 1.3e8 = ~1.3e8
> > > > 
> > > > 130 million blocks to hold 100 million xattrs? That doesn't pass the
> > > > smell test.
> > > > 
> > > > I think you are trying to do these calculations from the wrong
> > > > direction.
> > > 
> > > You are right. Btrees grow in height by adding a new root
> > > node. Hence the btree space usage should be calculated in bottom-to-top
> > > direction.
> > > 
> > > > Calculate the number of leaf blocks needed to hold the
> > > > xattr data first, then work out the height of the pointer tree from
> > > > that. e.g:
> > > > 
> > > > If we need 100m xattrs, we need this many 100% full 4k blocks to
> > > > hold them all:
> > > > 
> > > > blocks	= 100m / entries per leaf
> > > > 	= 100m / 61
> > > > 	= 1.64m
> > > > 
> > > > and if we assume 37% for the least populated (because magic
> > > > split/merge number), multiply by 3, so blocks ~= 5m for 100m xattrs
> > > > in 4k blocks.
> > > > 
> > > > That makes a lot more sense. Now the tree itself:
> > > > 
> > > > ptrs per node ^ N = 5m
> > > > ptrs per node ^ (N-1) = 5m / 500 = 10k
> > > > ptrs per node ^ (N-2) = 10k / 500 = 200
> > > > ptrs per node ^ (N-3) = 200 / 500 = 1
> > > > 
> > > > So, N-3 = level 0, so we've got a tree of height 4 for 100m xattrs,
> > > > and the pointer tree requires ~12000 blocks which is noise compared
> > > > to the number of leaf blocks...
> > > > 
> > > > As for the bmbt, we've got ~5m extents worst case, which is
> > > > 
> > > > ptrs per node ^ N = 5m
> > > > ptrs per node ^ (N-1) = 5m / 125 = 40k
> > > > ptrs per node ^ (N-2) = 40k / 125 = 320
> > > > ptrs per node ^ (N-3) = 320 / 125 = 3
> > > > 
> > > > As 3 bmbt records should fit in the inode fork, we'd only need a 4
> > > > level bmbt tree to hold this, too. It's at the lower limit of a 4
> > > > level tree, but 100m xattrs is the extreme case we are talking about
> > > > here...
> > > > 
> > > > FWIW, repeat this with a directory data segment size of 32GB w/ 40
> > > > byte names, and the numbers aren't much different to a worst case
> > > > xattr tree of this shape. You'll see the reason for the dabtree
> > > > height being limited to 5, and that neither the directory structure
> > > > nor the xattr structure is anywhere near the 2^32 bit extent count
> > > > limit...
> > > 
> > > Directory segment size is 32 GB                                                                                                                                  
> > >   - Number of directory entries required for indexing 32GiB.
> > >     - 32GiB is divided into 4k data blocks. 
> > >     - Number of 4k blocks = 32GB / 4k = 8M
> > >     - Each 4k data block has,
> > >       - struct xfs_dir3_data_hdr = 64 bytes
> > >       - struct xfs_dir2_data_entry = 12 bytes (metadata) + 40 bytes (name)
> > >                                    = 52 bytes
> > >       - Number of 'struct xfs_dir2_data_entry' in a 4k block
> > >         (4096 - 64) / 52 = 78
> > >     - Number of 'struct xfs_dir2_data_entry' in 32-GiB space
> > >       8m * 78 = 654m
> > >   - Contents of a single dabtree leaf
> > >     - struct xfs_dir3_leaf_hdr = 64 bytes
> > >     - struct xfs_dir2_leaf_entry = 8 bytes
> > >     - Number of 'struct xfs_dir2_leaf_entry' = (4096 - 64) / 8 = 504
> > >     - 37% of 504 = 186 entries
> > >   - Contents of a single dabtree node
> > >     - struct xfs_da3_node_hdr = 64 bytes
> > >     - struct xfs_da_node_entry = 8 bytes
> > >     - Number of 'struct xfs_da_node_entry' = (4096 - 64) / 8 = 504
> > >   - Nr leaves
> > >     Level (N) = 654m / 186 = 3m leaves
> > >     Level (N-1) = 3m / 504 = 6k
> > >     Level (N-2) = 6k / 504 = 12
> > >     Level (N-3) = 12 / 504 = 1
> > >     Dabtree having 4 levels is sufficient.
> > > 
> > > Hence a dabtree with 5 levels should be more than enough to index a 32GiB
> > > directory segment containing directory entries with even shorter names.
> > > 
> > > Even with 5m extents (used in xattr tree example above) consumed by a da
> > > btree, this is still much less than the limit imposed by 2^32 (i.e. ~4
> > > billion) extents.
> > > 
> > > Hence the actual log space consumed for logging bmbt blocks is limited by the
> > > height of da btree.
> > > 
> > > My experiment with changing the values of MAXEXTNUM and MAXAEXTNUM to 2^47 and
> > > 2^32 respectively, gave me the following results,
> > > - For 1k block size, bmbt tree height increased by 3.
> > > - For 4k block size, bmbt tree height increased by 2.
> > > 
> > > This happens because xfs_bmap_compute_maxlevels() calculates the BMBT tree
> > > height by assuming that there will be MAXEXTNUM/MAXAEXTNUM worth of leaf
> > > entries in the worst case.
> > > 
> > > For Attr fork Bmbt , Do you think the calculation should be changed to
> > > consider the number of extents occupied by a dabtree holding > 100 million
> > > xattrs?
> > > 
> > > The new increase in Bmbt height in turn causes the static reservation values
> > > to increase. In the worst case, the maximum increase observed was 118k bytes
> > > (4k block size, reflink=0, tr_rename).
> > > 
> > > The experiment was executed after applying "xfsprogs: Fix log reservation
> > > calculation for xattr insert operation" patch
> > > (https://lore.kernel.org/linux-xfs/20200404085229.2034-2-chandanrlinux@gmail.com/)
> > > 
> > > I am attaching the output of "xfs_db -c logres <dev>" executed on the
> > > following configurations of the XFS filesystem.
> > > - -b size=1k -m reflink=0
> > > - -b size=1k -m rmapbt=1reflink=1
> > > - -b size=4k -m reflink=0
> > > - -b size=4k -m rmapbt=1reflink=1
> > > - -b size=1k -m crc=0
> > > - -b size=4k -m crc=0
> > > 
> > > I will go through the code which calculates the log reservations of the
> > > entries which have a drastic increase in their values.
> > > 
> > 
> > The highest increase (i.e. an increase of 118k) in log reservation was
> > associated with the rename operation,
> > 
> > STATIC uint
> > xfs_calc_rename_reservation(
> >         struct xfs_mount        *mp)
> > {
> >         return XFS_DQUOT_LOGRES(mp) +
> >                 max((xfs_calc_inode_res(mp, 4) +
> >                      xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp),
> >                                       XFS_FSB_TO_B(mp, 1))),
> >                     (xfs_calc_buf_res(7, mp->m_sb.sb_sectsize) +
> >                      xfs_calc_buf_res(xfs_allocfree_log_count(mp, 3),
> >                                       XFS_FSB_TO_B(mp, 1))));
> > }
> > 
> > The first argument to max() contributes the highest value.
> > 
> > xfs_calc_inode_res(mp, 4) + xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp),XFS_FSB_TO_B(mp, 1))
> > 
> > The inode reservation part is a constant.
> > 
> > The number of blocks computed by the second operand of the '+' operator is,
> > 
> > 2 * ((XFS_DA_NODE_MAXDEPTH + 2) + ((XFS_DA_NODE_MAXDEPTH + 2) * (bmbt_height - 1)))
> > 
> > = 2 * ((5 + 2) + ((5 + 2) * (bmbt_height - 1)))
> > 
> > When bmbt height is 5 (i.e. when using the original 2^31 extent count limit) this
> > evaluates to,
> > 
> > 2 * ((5 + 2) + ((5 + 2) * (5 - 1)))
> > = 70 blocks
> > 
> > When bmbt height is 7 (i.e. when using the original 2^47 extent count limit) this
> > evaluates to,
> > 
> > 2 * ((5 + 2) + ((5 + 2) * (7 - 1)))
> > = 98 blocks
> > 
> > However, I don't see any extraneous space reserved by the above calculation
> > that could be removed. Also, IMHO an increase by 118k is most likely not going
> > to introduce any bugs. I will execute xfstests to make sure that no
> > regressions get added.
> 
> (Did fstests pass?)

I had executed fstests with 5 different configurations i.e.
1. -m crc=0 -bsize=1k
2. -m crc=0 -bsize=4k
3. -m crc=0 -bsize=512
4. -m rmapbt=1,reflink=1 -bsize=1k
5. -m rmapbt=1,reflink=1 -bsize=4k

The only test that regressed was xfs/306.  It failed when using "-m
rmapbt=1,reflink=1 -b size=1k" mkfs configuration.

The changes were made only to the kernel and I had used upstream xfsprogs since
the newer kernel is supposed to mount older filesystems as well.

The dmesg log had the following,

[  702.273340] XFS (loop0): Mounting V5 Filesystem
[  702.275511] XFS (loop0): Log size 8906 blocks too small, minimum size is 9075 blocks
[  702.277764] XFS (loop0): AAIEEE! Log failed size checks. Abort!
[  702.279615] XFS: Assertion failed: 0, file: fs/xfs/xfs_log.c, line: 711
[  702.283679] ------------[ cut here ]------------
[  702.285170] WARNING: CPU: 0 PID: 12821 at fs/xfs/xfs_message.c:112 assfail+0x25/0x28
[  702.287651] Modules linked in:
[  702.288654] CPU: 0 PID: 12821 Comm: mount Tainted: G        W         5.6.0-rc6-next-20200320-chandan-00003-g071c2af3f4de #1
[  702.291995] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[  702.294159] RIP: 0010:assfail+0x25/0x28
[  702.295176] Code: ff ff 0f 0b c3 0f 1f 44 00 00 41 89 c8 48 89 d1 48 89 f2 48 c7 c6 40 b7 4b b3 e8 82 f9 ff ff 80 3d 83 d6 64 01 00 74 02 0f $
[  702.300079] RSP: 0018:ffffb05b414cbd78 EFLAGS: 00010246
[  702.301463] RAX: 0000000000000000 RBX: ffff9d9d501d5000 RCX: 0000000000000000
[  702.303293] RDX: 00000000ffffffc0 RSI: 000000000000000a RDI: ffffffffb346dc65
[  702.304976] RBP: ffff9da444b49a80 R08: 0000000000000000 R09: 0000000000000000
[  702.306747] R10: 000000000000000a R11: f000000000000000 R12: 00000000ffffffea
[  702.308417] R13: 000000000000000e R14: 0000000000004594 R15: ffff9d9d501d5628
[  702.310138] FS:  00007fd6c5d17c80(0000) GS:ffff9da44d800000(0000) knlGS:0000000000000000
[  702.312078] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  702.313421] CR2: 0000000000000002 CR3: 00000008a48c0000 CR4: 00000000000006f0
[  702.315210] Call Trace:
[  702.315807]  xfs_log_mount+0xf8/0x300
[  702.316741]  xfs_mountfs+0x46e/0x950
[  702.317640]  xfs_fc_fill_super+0x318/0x510
[  702.318739]  ? xfs_mount_free+0x30/0x30
[  702.319669]  get_tree_bdev+0x15c/0x250
[  702.320579]  vfs_get_tree+0x25/0xb0
[  702.321417]  do_mount+0x740/0x9b0
[  702.322220]  ? memdup_user+0x41/0x80
[  702.323135]  __x64_sys_mount+0x8e/0xd0
[  702.324033]  do_syscall_64+0x48/0x110
[  702.324918]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  702.326133] RIP: 0033:0x7fd6c5f2ccda
[  702.327105] Code: 48 8b 0d b9 e1 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 49 89 ca b8 a5 00 00 00 0f $
[  702.331596] RSP: 002b:00007ffe00dfb9f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[  702.333430] RAX: ffffffffffffffda RBX: 0000560c1aaa92c0 RCX: 00007fd6c5f2ccda
[  702.335146] RDX: 0000560c1aaae110 RSI: 0000560c1aaad040 RDI: 0000560c1aaa94d0
[  702.336843] RBP: 00007fd6c607d204 R08: 0000000000000000 R09: 0000560c1aaadde0
[  702.338618] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  702.340314] R13: 0000000000000000 R14: 0000560c1aaa94d0 R15: 0000560c1aaae110
[  702.342039] ---[ end trace 6436391b468bc652 ]---
[  702.343308] XFS (loop0): log mount failed

xfs/306 has,

_scratch_mkfs_xfs -d size=20m -n size=64k >> $seqres.full 2>&1

i.e. it creates a filesystem of size 20MiB, data block size of 1KiB and
directory block size of 64KiB. Filesystems of size < 1GiB can have less than
10MiB log (Please refer to calculate_log_size() in xfsprogs).

The highest reservation space was used by tr_rename. The calculation is done
by xfs_calc_rename_reservation(). In this case, the value returned by this
function was accounted by

xfs_calc_inode_res(mp, 4)
+ xfs_calc_buf_res(2 * XFS_DIROP_LOG_COUNT(mp), XFS_FSB_TO_B(mp, 1))

xfs_calc_inode_res(mp, 4) returns a constant value (i.e. 3040).

The largest contribution to the value returned by the above
calculation was by 2 * XFS_DIROP_LOG_COUNT(mp).

XFS_DIROP_LOG_COUNT() is a sum of
1. The maximum number of dabtree blocks that needs to be logged
   i.e. XFS_DAENTER_BLOCKS() = XFS_DAENTER_1B(mp,w) * XFS_DAENTER_DBS(mp,w).
   For directories, this evaluates to (64 * (XFS_DA_NODE_MAXDEPTH + 2)) = (64
   * (5 + 2)) = 448.
   NOTE: I still don't know why we add the "2" to XFS_DA_NODE_MAXDEPTH in the
   above calculation.
2. The corresponding maximum number of bmap btree blocks that needs to be
   logged i.e. XFS_DAENTER_BMAPS() = XFS_DAENTER_DBS(mp,w) *
   XFS_DAENTER_BMAP1B(mp,w)

   XFS_DAENTER_DBS(mp,w) = XFS_DA_NODE_MAXDEPTH + 2 = 7
   XFS_DAENTER_BMAP1B(mp,w)
   = XFS_NEXTENTADD_SPACE_RES(mp, XFS_DAENTER_1B(mp, w), w)
   = XFS_NEXTENTADD_SPACE_RES(mp, 64, w)
   = ((64 + XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp) - 1) /
   XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp)) * XFS_EXTENTADD_SPACE_RES(mp, w)

   XFS_MAX_CONTIG_EXTENTS_PER_BLOCK() = (mp)->m_alloc_mxr[0]) -
   ((mp)->m_alloc_mnr[0] = 121 - 60 = 61 

   XFS_DAENTER_BMAP1B(mp,w) = ((64 + XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp) - 1) /
   XFS_MAX_CONTIG_EXTENTS_PER_BLOCK(mp)) * XFS_EXTENTADD_SPACE_RES(mp, w)
   = ((64 + 61 - 1) / 61) * XFS_EXTENTADD_SPACE_RES(mp, w)
   = 2 * XFS_EXTENTADD_SPACE_RES(mp, w)
   = 2 * (XFS_BM_MAXLEVELS(mp,w) - 1)
   = 2 * (8 - 1) ;; Notice that the height of the bmap btree has increased to 8.
   = 14

   With 2^32 as the maximum extent count the maximum height of the bmap btree
   was 7. Now with 2^47 maximum extent count the height is 8.

   Therefore, XFS_DAENTER_BMAPS() = 7 * 14 = 98.

Also, XFS_DIROP_LOG_COUNT() = 448 + 98 = 546.
2 * XFS_DIROP_LOG_COUNT() = 2 * 546 = 1092.

With 2^32 max extent count, XFS_DIROP_LOG_COUNT() evaluates to 533. Hence 2 *
XFS_DIROP_LOG_COUNT() = 2 * 533 = 1066.

This small difference of 1092 - 1066 = 26 fs blocks is sufficient to trip us
over the minimum log size check.

I could not find a way to reduce the number of blocks that gets logged.

Hence I thought of the following alternate approach.

The maximum number of extents that can be occupied by a directory is ~
2^27. The following steps prove this, (I assumed fs block size to be
512 bytes since it is the one which can create a bmap btree of maximum
possible height).

Maximum number of extents in data space = 32GiB (i.e. XFS_DIR2_SPACE_SIZE) / 2^9 = 2^26.
Maximum number (theoretically) of extents in leaf space = 32GiB / 2^9 = 2^26.

Maximum number of entries in a free space index block
= (512 - (sizeof struct xfs_dir3_free_hdr)) / (sizeof struct xfs_dir2_data_off_t)
= (512 - 64) / 2 = 224
Maximum number of extents in free space index = (Maximum number of extents in
data segment) / 224 = (2^26) / 224 = ~2^18

Maximum number of extents in a directory = 2^26 + 2^26 + 2^18 = ~2^27

Hence my idea was to have a new entry in xfs_mount->m_bm_maxlevels[]
array to hold the maximum height of a bmap btree belonging to a
directory and use that for calculating reservations associated with
directories.

Please let me know your opinion on this.

PS: I had started making the changes in the kernel and was planning to
test the changes before posting this idea on the mailing list.
   
-- 
chandan




^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2020-05-13 12:16 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-04  8:52 [PATCH 0/2] Extend xattr extent counter to 32-bits Chandan Rajendra
2020-04-04  8:52 ` [PATCH 1/2] xfs: Fix log reservation calculation for xattr insert operation Chandan Rajendra
2020-04-06 15:25   ` Brian Foster
2020-04-06 22:57     ` Dave Chinner
2020-04-07  5:11       ` Chandan Rajendra
2020-04-07 12:59       ` Brian Foster
2020-04-07  0:49   ` Dave Chinner
2020-04-08  8:47     ` Chandan Rajendra
2020-04-04  8:52 ` [PATCH 2/2] xfs: Extend xattr extent counter to 32-bits Chandan Rajendra
2020-04-06 16:45   ` Brian Foster
2020-04-08 12:40     ` Chandan Rajendra
2020-04-06 17:06   ` Darrick J. Wong
2020-04-06 23:30     ` Dave Chinner
2020-04-08 12:43       ` Chandan Rajendra
2020-04-08 15:38         ` Darrick J. Wong
2020-04-08 22:43         ` Dave Chinner
2020-04-08 15:45       ` Darrick J. Wong
2020-04-08 22:45         ` Dave Chinner
2020-04-08 12:42     ` Chandan Rajendra
2020-04-07  1:20   ` Dave Chinner
2020-04-08 12:45     ` Chandan Rajendra
2020-04-10  7:46     ` Chandan Rajendra
2020-04-12  6:34       ` Chandan Rajendra
2020-04-13 18:55         ` Darrick J. Wong
2020-04-20  4:38           ` Chandan Rajendra
2020-04-22  9:38             ` Chandan Rajendra
2020-04-22 22:30               ` Dave Chinner
2020-04-25 12:07                 ` Chandan Rajendra
2020-04-26 22:08                   ` Dave Chinner
2020-04-29 15:35                     ` Chandan Rajendra
2020-05-01  7:08                       ` Chandan Rajendra
2020-05-12 23:53                         ` Darrick J. Wong
2020-05-13 12:19                           ` Chandan Rajendra
2020-04-22 22:51               ` Darrick J. Wong
2020-04-27  7:42     ` Christoph Hellwig
2020-04-27  7:39   ` Christoph Hellwig
2020-04-30  2:29     ` Chandan Rajendra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.