linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH V3 00/12] xfs: Extend per-inode extent counters
@ 2021-09-16 10:06 Chandan Babu R
  2021-09-16 10:06 ` [PATCH V3 01/12] xfs: Move extent count limits to xfs_format.h Chandan Babu R
                   ` (12 more replies)
  0 siblings, 13 replies; 42+ messages in thread
From: Chandan Babu R @ 2021-09-16 10:06 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, djwong

The commit xfs: fix inode fork extent count overflow
(3f8a4f1d876d3e3e49e50b0396eaffcc4ba71b08) mentions that 10 billion
data fork extents should be possible to create. However the
corresponding on-disk field has a signed 32-bit type. Hence this
patchset extends the per-inode data extent counter to 64 bits out of
which 48 bits are used to store the extent count. 

Also, XFS has an attr fork extent counter which is 16 bits wide. A
workload which,
1. Creates 1 million 255-byte sized xattrs,
2. Deletes 50% of these xattrs in an alternating manner,
3. Tries to insert 400,000 new 255-byte sized xattrs
   causes the xattr extent counter to overflow.

Dave tells me that there are instances where a single file has more
than 100 million hardlinks. With parent pointers being stored in
xattrs, we will overflow the signed 16-bits wide xattr extent counter
when large number of hardlinks are created. Hence this patchset
extends the on-disk field to 32-bits.

The following changes are made to accomplish this,
1. A new incompat superblock flag to prevent older kernels from mounting
   the filesystem. This flag has to be set during mkfs time.
2. A new 64-bit inode field is created to hold the data extent
   counter.
3. The existing 32-bit inode data extent counter will be used to hold
   the attr fork extent counter.

The patchset has been tested by executing xfstests with the following
mkfs.xfs options,
1. -m crc=0 -b size=1k
2. -m crc=0 -b size=4k
3. -m crc=0 -b size=512
4. -m rmapbt=1,reflink=1 -b size=1k
5. -m rmapbt=1,reflink=1 -b size=4k

Each of the above test scenarios were executed on the following
combinations (For V4 FS test scenario, the last combination
i.e. "Patched (enable extcnt64bit)", was omitted).
|-------------------------------+-----------|
| Xfsprogs                      | Kernel    |
|-------------------------------+-----------|
| Unpatched                     | Patched   |
| Patched (disable extcnt64bit) | Unpatched |
| Patched (disable extcnt64bit) | Patched   |
| Patched (enable extcnt64bit)  | Patched   |
|-------------------------------+-----------|

I have also written a test (yet to be converted into xfstests format)
to check if the correct extent counter fields are updated with/without
the new incompat flag. I have also fixed some of the existing fstests
to work with the new extent counter fields.

Increasing data extent counter width also causes the maximum height of
BMBT to increase. This requires that the macro XFS_BTREE_MAXLEVELS be
updated with a larger value. However such a change causes the value of
mp->m_rmap_maxlevels to increase which in turn causes log reservation
sizes to increase and hence a modified XFS driver will fail to mount
filesystems created by older versions of mkfs.xfs.

Hence this patchset is built on top of Darrick's btree-dynamic-depth
branch which removes the macro XFS_BTREE_MAXLEVELS and computes
mp->m_rmap_maxlevels based on the size of an AG.

These patches can also be obtained from
https://github.com/chandanr/linux.git at branch
xfs-incompat-extend-extcnt-v3.

I will be posting the changes associated with xfsprogs separately.

Changelog:
V2 -> V3:
1. Define maximum extent length as a function of
   BMBT_BLOCKCOUNT_BITLEN.
2. Introduce xfs_iext_max_nextents() function in the patch series
   before renaming MAXEXTNUM/MAXAEXTNUM. This is done to reduce
   proliferation of macros indicating maximum extent count for data
   and attribute forks.
3. Define xfs_dfork_nextents() as an inline function.
4. Use xfs_rfsblock_t as the data type for variables that hold block
   count.
5. xfs_dfork_nextents() now returns -EFSCORRUPTED when an invalid fork
   is passed as an argument.
6. The following changes are done to enable bulkstat ioctl to report
   64-bit extent counters,
   - Carve out a new 64-bit field xfs_bulkstat->bs_extents64 from
     xfs_bulkstat->bs_pad[]. 
   - Carve out a new 64-bit field xfs_bulk_ireq->bulkstat_flags from
     xfs_bulk_ireq->reserved[] to hold bulkstat specific operational
     flags. Introduce XFS_IBULK_NREXT64 flag to indicate that
     userspace has the necessary infrastructure to receive 64-bit
     extent counters.
   - Define the new flag XFS_BULK_IREQ_BULKSTAT for userspace to
     indicate that xfs_bulk_ireq->bulkstat_flags has valid flags set.
7. Rename the incompat flag from XFS_SB_FEAT_INCOMPAT_EXTCOUNT_64BIT
   to XFS_SB_FEAT_INCOMPAT_NREXT64.
8. Add a new helper function xfs_inode_to_disk_iext_counters() to
   convert from incore inode extent counters to ondisk inode extent
   counters.
9. Reuse XFS_ERRTAG_REDUCE_MAX_IEXTENTS error tag to skip reporting
   inodes with more than 10 extents when bulkstat ioctl is invoked by
   userspace.
10. Introduce the new per-inode XFS_DIFLAG2_NREXT64 flag to indicate
    that the inode uses 64-bit extent counter. This is used to allow
    administrators to upgrade existing filesystems.
11. Export presence of XFS_SB_FEAT_INCOMPAT_NREXT64 feature to
    userspace via XFS_IOC_FSGEOMETRY ioctl.

V1 -> V2:
1. Rebase patches on top of Darrick's btree-dynamic-depth branch.
2. Add new bulkstat ioctl version to support 64-bit data fork extent
   counter field.
3. Introduce new error tag to verify if the old bulkstat ioctls skip
   reporting inodes with large data fork extent counters.

Chandan Babu R (12):
  xfs: Move extent count limits to xfs_format.h
  xfs: Introduce xfs_iext_max_nextents() helper
  xfs: Rename MAXEXTNUM, MAXAEXTNUM to XFS_IFORK_EXTCNT_MAXS32,
    XFS_IFORK_EXTCNT_MAXS16
  xfs: Use xfs_extnum_t instead of basic data types
  xfs: Introduce xfs_dfork_nextents() helper
  xfs: xfs_dfork_nextents: Return extent count via an out argument
  xfs: Rename inode's extent counter fields based on their width
  xfs: Promote xfs_extnum_t and xfs_aextnum_t to 64 and 32-bits
    respectively
  xfs: Enable bulkstat ioctl to support 64-bit per-inode extent counters
  xfs: Extend per-inode extent counter widths
  xfs: Add XFS_SB_FEAT_INCOMPAT_NREXT64 to XFS_SB_FEAT_INCOMPAT_ALL
  xfs: Define max extent length based on on-disk format definition

 fs/xfs/libxfs/xfs_bmap.c        | 80 ++++++++++++++-------------
 fs/xfs/libxfs/xfs_format.h      | 80 +++++++++++++++++++++++----
 fs/xfs/libxfs/xfs_fs.h          | 20 +++++--
 fs/xfs/libxfs/xfs_ialloc.c      |  2 +
 fs/xfs/libxfs/xfs_inode_buf.c   | 61 ++++++++++++++++-----
 fs/xfs/libxfs/xfs_inode_fork.c  | 32 +++++++----
 fs/xfs/libxfs/xfs_inode_fork.h  | 23 +++++++-
 fs/xfs/libxfs/xfs_log_format.h  |  7 +--
 fs/xfs/libxfs/xfs_rtbitmap.c    |  4 +-
 fs/xfs/libxfs/xfs_sb.c          |  4 ++
 fs/xfs/libxfs/xfs_swapext.c     |  6 +--
 fs/xfs/libxfs/xfs_trans_inode.c |  6 +++
 fs/xfs/libxfs/xfs_trans_resv.c  | 10 ++--
 fs/xfs/libxfs/xfs_types.h       | 11 +---
 fs/xfs/scrub/attr_repair.c      |  2 +-
 fs/xfs/scrub/bmap.c             |  2 +-
 fs/xfs/scrub/bmap_repair.c      |  2 +-
 fs/xfs/scrub/inode.c            | 96 ++++++++++++++++++++-------------
 fs/xfs/scrub/inode_repair.c     | 71 +++++++++++++++++-------
 fs/xfs/scrub/repair.c           |  2 +-
 fs/xfs/scrub/trace.h            | 16 +++---
 fs/xfs/xfs_bmap_util.c          | 14 ++---
 fs/xfs/xfs_inode.c              |  4 +-
 fs/xfs/xfs_inode.h              |  5 ++
 fs/xfs/xfs_inode_item.c         | 21 +++++++-
 fs/xfs/xfs_inode_item_recover.c | 26 ++++++---
 fs/xfs/xfs_ioctl.c              |  7 +++
 fs/xfs/xfs_iomap.c              | 28 +++++-----
 fs/xfs/xfs_itable.c             | 25 ++++++++-
 fs/xfs/xfs_itable.h             |  2 +
 fs/xfs/xfs_iwalk.h              |  7 ++-
 fs/xfs/xfs_mount.h              |  2 +
 fs/xfs/xfs_trace.h              |  6 +--
 33 files changed, 478 insertions(+), 206 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V3 01/12] xfs: Move extent count limits to xfs_format.h
  2021-09-16 10:06 [PATCH V3 00/12] xfs: Extend per-inode extent counters Chandan Babu R
@ 2021-09-16 10:06 ` Chandan Babu R
  2021-09-16 10:06 ` [PATCH V3 02/12] xfs: Introduce xfs_iext_max_nextents() helper Chandan Babu R
                   ` (11 subsequent siblings)
  12 siblings, 0 replies; 42+ messages in thread
From: Chandan Babu R @ 2021-09-16 10:06 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, djwong

Maximum values associated with extent counters i.e. Maximum extent length,
Maximum data extents and Maximum xattr extents are dictated by the on-disk
format. Hence move these definitions over to xfs_format.h.

Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
---
 fs/xfs/libxfs/xfs_format.h | 7 +++++++
 fs/xfs/libxfs/xfs_types.h  | 7 -------
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 0bc5410491ac..bef1727bb182 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -872,6 +872,13 @@ enum xfs_dinode_fmt {
 	{ XFS_DINODE_FMT_BTREE,		"btree" }, \
 	{ XFS_DINODE_FMT_UUID,		"uuid" }
 
+/*
+ * Max values for extlen, extnum, aextnum.
+ */
+#define	MAXEXTLEN	((xfs_extlen_t)0x001fffff)	/* 21 bits */
+#define	MAXEXTNUM	((xfs_extnum_t)0x7fffffff)	/* signed int */
+#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fff)		/* signed short */
+
 /*
  * Inode minimum and maximum sizes.
  */
diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index d0afc3d11e37..dbe5bb56f31f 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -56,13 +56,6 @@ typedef void *		xfs_failaddr_t;
 #define	NULLFSINO	((xfs_ino_t)-1)
 #define	NULLAGINO	((xfs_agino_t)-1)
 
-/*
- * Max values for extlen, extnum, aextnum.
- */
-#define	MAXEXTLEN	((xfs_extlen_t)0x001fffff)	/* 21 bits */
-#define	MAXEXTNUM	((xfs_extnum_t)0x7fffffff)	/* signed int */
-#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fff)		/* signed short */
-
 /*
  * Minimum and maximum blocksize and sectorsize.
  * The blocksize upper limit is pretty much arbitrary.
-- 
2.30.2


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V3 02/12] xfs: Introduce xfs_iext_max_nextents() helper
  2021-09-16 10:06 [PATCH V3 00/12] xfs: Extend per-inode extent counters Chandan Babu R
  2021-09-16 10:06 ` [PATCH V3 01/12] xfs: Move extent count limits to xfs_format.h Chandan Babu R
@ 2021-09-16 10:06 ` Chandan Babu R
  2021-09-16 10:06 ` [PATCH V3 03/12] xfs: Rename MAXEXTNUM, MAXAEXTNUM to XFS_IFORK_EXTCNT_MAXS32, XFS_IFORK_EXTCNT_MAXS16 Chandan Babu R
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 42+ messages in thread
From: Chandan Babu R @ 2021-09-16 10:06 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, djwong

xfs_iext_max_nextents() returns the maximum number of extents possible for one
of data, cow or attribute fork. This helper will be extended further in a
future commit when maximum extent counts associated with data/attribute forks
are increased.

Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c       | 9 ++++-----
 fs/xfs/libxfs/xfs_inode_buf.c  | 8 +++-----
 fs/xfs/libxfs/xfs_inode_fork.c | 5 +++--
 fs/xfs/libxfs/xfs_inode_fork.h | 9 +++++++++
 fs/xfs/scrub/inode_repair.c    | 2 +-
 5 files changed, 20 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index cd97288f6abc..88d4d17821b6 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -76,13 +76,12 @@ xfs_bmap_compute_maxlevels(
 	 * ATTR2 we have to assume the worst case scenario of a minimum size
 	 * available.
 	 */
-	if (whichfork == XFS_DATA_FORK) {
-		maxleafents = MAXEXTNUM;
+	maxleafents = xfs_iext_max_nextents(mp, whichfork);
+	if (whichfork == XFS_DATA_FORK)
 		sz = xfs_bmdr_space_calc(MINDBTPTRS);
-	} else {
-		maxleafents = MAXAEXTNUM;
+	else
 		sz = xfs_bmdr_space_calc(MINABTPTRS);
-	}
+
 	maxrootrecs = xfs_bmdr_maxrecs(sz, 0);
 	minleafrecs = mp->m_bmap_dmnr[0];
 	minnoderecs = mp->m_bmap_dmnr[1];
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 5834d46762d4..51d91ad98b50 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -343,6 +343,7 @@ xfs_dinode_verify_fork(
 	int			whichfork)
 {
 	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
+	xfs_extnum_t		max_extents;
 
 	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
 	case XFS_DINODE_FMT_LOCAL:
@@ -364,12 +365,9 @@ xfs_dinode_verify_fork(
 			return __this_address;
 		break;
 	case XFS_DINODE_FMT_BTREE:
-		if (whichfork == XFS_ATTR_FORK) {
-			if (di_nextents > MAXAEXTNUM)
-				return __this_address;
-		} else if (di_nextents > MAXEXTNUM) {
+		max_extents = xfs_iext_max_nextents(mp, whichfork);
+		if (di_nextents > max_extents)
 			return __this_address;
-		}
 		break;
 	default:
 		return __this_address;
diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index 801a6f7dbd0c..bc12d85df6e1 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -729,6 +729,7 @@ xfs_iext_count_may_overflow(
 	int			whichfork,
 	int			nr_to_add)
 {
+	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
 	uint64_t		max_exts;
 	uint64_t		nr_exts;
@@ -736,9 +737,9 @@ xfs_iext_count_may_overflow(
 	if (whichfork == XFS_COW_FORK)
 		return 0;
 
-	max_exts = (whichfork == XFS_ATTR_FORK) ? MAXAEXTNUM : MAXEXTNUM;
+	max_exts = xfs_iext_max_nextents(mp, whichfork);
 
-	if (XFS_TEST_ERROR(false, ip->i_mount, XFS_ERRTAG_REDUCE_MAX_IEXTENTS))
+	if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_REDUCE_MAX_IEXTENTS))
 		max_exts = 10;
 
 	nr_exts = ifp->if_nextents + nr_to_add;
diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
index cf82be263b48..6ba38c154647 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.h
+++ b/fs/xfs/libxfs/xfs_inode_fork.h
@@ -133,6 +133,15 @@ static inline int8_t xfs_ifork_format(struct xfs_ifork *ifp)
 	return ifp->if_format;
 }
 
+static inline xfs_extnum_t xfs_iext_max_nextents(struct xfs_mount *mp,
+		int whichfork)
+{
+	if (whichfork == XFS_DATA_FORK || whichfork == XFS_COW_FORK)
+		return MAXEXTNUM;
+
+	return MAXAEXTNUM;
+}
+
 struct xfs_ifork *xfs_ifork_alloc(enum xfs_dinode_fmt format,
 				xfs_extnum_t nextents);
 struct xfs_ifork *xfs_iext_state_to_fork(struct xfs_inode *ip, int state);
diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
index f3140991ee5b..b58820d22304 100644
--- a/fs/xfs/scrub/inode_repair.c
+++ b/fs/xfs/scrub/inode_repair.c
@@ -1202,7 +1202,7 @@ xrep_inode_blockcounts(
 			return error;
 		if (count >= sc->mp->m_sb.sb_dblocks)
 			return -EFSCORRUPTED;
-		if (nextents >= MAXAEXTNUM)
+		if (nextents >= xfs_iext_max_nextents(sc->mp, XFS_ATTR_FORK))
 			return -EFSCORRUPTED;
 		ifp->if_nextents = nextents;
 	} else {
-- 
2.30.2


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V3 03/12] xfs: Rename MAXEXTNUM, MAXAEXTNUM to XFS_IFORK_EXTCNT_MAXS32, XFS_IFORK_EXTCNT_MAXS16
  2021-09-16 10:06 [PATCH V3 00/12] xfs: Extend per-inode extent counters Chandan Babu R
  2021-09-16 10:06 ` [PATCH V3 01/12] xfs: Move extent count limits to xfs_format.h Chandan Babu R
  2021-09-16 10:06 ` [PATCH V3 02/12] xfs: Introduce xfs_iext_max_nextents() helper Chandan Babu R
@ 2021-09-16 10:06 ` Chandan Babu R
  2021-09-16 10:06 ` [PATCH V3 04/12] xfs: Use xfs_extnum_t instead of basic data types Chandan Babu R
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 42+ messages in thread
From: Chandan Babu R @ 2021-09-16 10:06 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, djwong

In preparation for introducing larger extent count limits, this commit renames
existing extent count limits based on their signedness and width.

Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
---
 fs/xfs/libxfs/xfs_format.h     | 9 +++++----
 fs/xfs/libxfs/xfs_inode_fork.h | 4 ++--
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index bef1727bb182..ed8a5354bcbf 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -873,11 +873,12 @@ enum xfs_dinode_fmt {
 	{ XFS_DINODE_FMT_UUID,		"uuid" }
 
 /*
- * Max values for extlen, extnum, aextnum.
+ * Max values for extlen and disk inode's extent counters.
  */
-#define	MAXEXTLEN	((xfs_extlen_t)0x001fffff)	/* 21 bits */
-#define	MAXEXTNUM	((xfs_extnum_t)0x7fffffff)	/* signed int */
-#define	MAXAEXTNUM	((xfs_aextnum_t)0x7fff)		/* signed short */
+#define	MAXEXTLEN		((xfs_extlen_t)0x1fffff)	/* 21 bits */
+#define XFS_IFORK_EXTCNT_MAXS32 ((xfs_extnum_t)0x7fffffff)	/* Signed 32-bits */
+#define XFS_IFORK_EXTCNT_MAXS16 ((xfs_aextnum_t)0x7fff)		/* Signed 16-bits */
+
 
 /*
  * Inode minimum and maximum sizes.
diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
index 6ba38c154647..e8fe5b477b50 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.h
+++ b/fs/xfs/libxfs/xfs_inode_fork.h
@@ -137,9 +137,9 @@ static inline xfs_extnum_t xfs_iext_max_nextents(struct xfs_mount *mp,
 		int whichfork)
 {
 	if (whichfork == XFS_DATA_FORK || whichfork == XFS_COW_FORK)
-		return MAXEXTNUM;
+		return XFS_IFORK_EXTCNT_MAXS32;
 
-	return MAXAEXTNUM;
+	return XFS_IFORK_EXTCNT_MAXS16;
 }
 
 struct xfs_ifork *xfs_ifork_alloc(enum xfs_dinode_fmt format,
-- 
2.30.2


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V3 04/12] xfs: Use xfs_extnum_t instead of basic data types
  2021-09-16 10:06 [PATCH V3 00/12] xfs: Extend per-inode extent counters Chandan Babu R
                   ` (2 preceding siblings ...)
  2021-09-16 10:06 ` [PATCH V3 03/12] xfs: Rename MAXEXTNUM, MAXAEXTNUM to XFS_IFORK_EXTCNT_MAXS32, XFS_IFORK_EXTCNT_MAXS16 Chandan Babu R
@ 2021-09-16 10:06 ` Chandan Babu R
  2021-09-16 10:06 ` [PATCH V3 05/12] xfs: Introduce xfs_dfork_nextents() helper Chandan Babu R
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 42+ messages in thread
From: Chandan Babu R @ 2021-09-16 10:06 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, djwong

xfs_extnum_t is the type to use to declare variables which have values
obtained from xfs_dinode->di_[a]nextents. This commit replaces basic
types (e.g. uint32_t) with xfs_extnum_t for such variables.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c       | 2 +-
 fs/xfs/libxfs/xfs_inode_buf.c  | 2 +-
 fs/xfs/libxfs/xfs_inode_fork.c | 2 +-
 fs/xfs/scrub/inode.c           | 2 +-
 fs/xfs/scrub/inode_repair.c    | 2 +-
 fs/xfs/xfs_trace.h             | 2 +-
 6 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 88d4d17821b6..e5485b5c99a0 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -56,7 +56,7 @@ xfs_bmap_compute_maxlevels(
 {
 	int		level;		/* btree level */
 	uint		maxblocks;	/* max blocks at this level */
-	uint		maxleafents;	/* max leaf entries possible */
+	xfs_extnum_t	maxleafents;	/* max leaf entries possible */
 	int		maxrootrecs;	/* max records in root block */
 	int		minleafrecs;	/* min records in leaf block */
 	int		minnoderecs;	/* min records in node block */
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 51d91ad98b50..ea4469b5114e 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -342,7 +342,7 @@ xfs_dinode_verify_fork(
 	struct xfs_mount	*mp,
 	int			whichfork)
 {
-	uint32_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
+	xfs_extnum_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
 	xfs_extnum_t		max_extents;
 
 	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index bc12d85df6e1..e7bb3ba22912 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -107,7 +107,7 @@ xfs_iformat_extents(
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
 	int			state = xfs_bmap_fork_to_state(whichfork);
-	int			nex = XFS_DFORK_NEXTENTS(dip, whichfork);
+	xfs_extnum_t		nex = XFS_DFORK_NEXTENTS(dip, whichfork);
 	int			size = nex * sizeof(xfs_bmbt_rec_t);
 	struct xfs_iext_cursor	icur;
 	struct xfs_bmbt_rec	*dp;
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index eac15af7b08c..87925761e174 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -232,7 +232,7 @@ xchk_dinode(
 	size_t			fork_recs;
 	unsigned long long	isize;
 	uint64_t		flags2;
-	uint32_t		nextents;
+	xfs_extnum_t		nextents;
 	prid_t			prid;
 	uint16_t		flags;
 	uint16_t		mode;
diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
index b58820d22304..bebc1fd33667 100644
--- a/fs/xfs/scrub/inode_repair.c
+++ b/fs/xfs/scrub/inode_repair.c
@@ -603,7 +603,7 @@ xrep_dinode_bad_extents_fork(
 	struct xfs_bmbt_rec	*dp;
 	bool			isrt;
 	int			i;
-	int			nex;
+	xfs_extnum_t		nex;
 	int			fork_size;
 
 	nex = XFS_DFORK_NEXTENTS(dip, whichfork);
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index f1fe5156b3b5..fb1033de7003 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2183,7 +2183,7 @@ DECLARE_EVENT_CLASS(xfs_swap_extent_class,
 		__field(int, which)
 		__field(xfs_ino_t, ino)
 		__field(int, format)
-		__field(int, nex)
+		__field(xfs_extnum_t, nex)
 		__field(int, broot_size)
 		__field(int, fork_off)
 	),
-- 
2.30.2


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V3 05/12] xfs: Introduce xfs_dfork_nextents() helper
  2021-09-16 10:06 [PATCH V3 00/12] xfs: Extend per-inode extent counters Chandan Babu R
                   ` (3 preceding siblings ...)
  2021-09-16 10:06 ` [PATCH V3 04/12] xfs: Use xfs_extnum_t instead of basic data types Chandan Babu R
@ 2021-09-16 10:06 ` Chandan Babu R
  2021-09-27 22:46   ` Dave Chinner
  2021-09-16 10:06 ` [PATCH V3 06/12] xfs: xfs_dfork_nextents: Return extent count via an out argument Chandan Babu R
                   ` (7 subsequent siblings)
  12 siblings, 1 reply; 42+ messages in thread
From: Chandan Babu R @ 2021-09-16 10:06 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, djwong

This commit replaces the macro XFS_DFORK_NEXTENTS() with the helper function
xfs_dfork_nextents(). As of this commit, xfs_dfork_nextents() returns the same
value as XFS_DFORK_NEXTENTS(). A future commit which extends inode's extent
counter fields will add more logic to this helper.

This commit also replaces direct accesses to xfs_dinode->di_[a]nextents
with calls to xfs_dfork_nextents().

No functional changes have been made.

Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
---
 fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++----
 fs/xfs/libxfs/xfs_inode_buf.c  | 16 +++++++++-----
 fs/xfs/libxfs/xfs_inode_fork.c | 10 +++++----
 fs/xfs/scrub/inode.c           | 18 +++++++++-------
 fs/xfs/scrub/inode_repair.c    | 38 +++++++++++++++++++++-------------
 5 files changed, 75 insertions(+), 35 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index ed8a5354bcbf..b4638052801f 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -930,10 +930,30 @@ enum xfs_dinode_fmt {
 	((w) == XFS_DATA_FORK ? \
 		(dip)->di_format : \
 		(dip)->di_aformat)
-#define XFS_DFORK_NEXTENTS(dip,w) \
-	((w) == XFS_DATA_FORK ? \
-		be32_to_cpu((dip)->di_nextents) : \
-		be16_to_cpu((dip)->di_anextents))
+
+static inline xfs_extnum_t
+xfs_dfork_nextents(
+	struct xfs_dinode	*dip,
+	int			whichfork)
+{
+	xfs_extnum_t		nextents = 0;
+
+	switch (whichfork) {
+	case XFS_DATA_FORK:
+		nextents = be32_to_cpu(dip->di_nextents);
+		break;
+
+	case XFS_ATTR_FORK:
+		nextents = be16_to_cpu(dip->di_anextents);
+		break;
+
+	default:
+		ASSERT(0);
+		break;
+	}
+
+	return nextents;
+}
 
 /*
  * For block and character special files the 32bit dev_t is stored at the
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index ea4469b5114e..176c98798aa4 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -342,9 +342,11 @@ xfs_dinode_verify_fork(
 	struct xfs_mount	*mp,
 	int			whichfork)
 {
-	xfs_extnum_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
+	xfs_extnum_t		di_nextents;
 	xfs_extnum_t		max_extents;
 
+	di_nextents = xfs_dfork_nextents(dip, whichfork);
+
 	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
 	case XFS_DINODE_FMT_LOCAL:
 		/*
@@ -474,6 +476,8 @@ xfs_dinode_verify(
 	uint16_t		flags;
 	uint64_t		flags2;
 	uint64_t		di_size;
+	xfs_extnum_t            nextents;
+	xfs_rfsblock_t		nblocks;
 
 	if (dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC))
 		return __this_address;
@@ -504,10 +508,12 @@ xfs_dinode_verify(
 	if ((S_ISLNK(mode) || S_ISDIR(mode)) && di_size == 0)
 		return __this_address;
 
+	nextents = xfs_dfork_nextents(dip, XFS_DATA_FORK);
+	nextents += xfs_dfork_nextents(dip, XFS_ATTR_FORK);
+	nblocks = be64_to_cpu(dip->di_nblocks);
+
 	/* Fork checks carried over from xfs_iformat_fork */
-	if (mode &&
-	    be32_to_cpu(dip->di_nextents) + be16_to_cpu(dip->di_anextents) >
-			be64_to_cpu(dip->di_nblocks))
+	if (mode && nextents > nblocks)
 		return __this_address;
 
 	if (mode && XFS_DFORK_BOFF(dip) > mp->m_sb.sb_inodesize)
@@ -564,7 +570,7 @@ xfs_dinode_verify(
 		default:
 			return __this_address;
 		}
-		if (dip->di_anextents)
+		if (xfs_dfork_nextents(dip, XFS_ATTR_FORK))
 			return __this_address;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index e7bb3ba22912..7d1efccfea59 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -107,7 +107,7 @@ xfs_iformat_extents(
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
 	int			state = xfs_bmap_fork_to_state(whichfork);
-	xfs_extnum_t		nex = XFS_DFORK_NEXTENTS(dip, whichfork);
+	xfs_extnum_t		nex = xfs_dfork_nextents(dip, whichfork);
 	int			size = nex * sizeof(xfs_bmbt_rec_t);
 	struct xfs_iext_cursor	icur;
 	struct xfs_bmbt_rec	*dp;
@@ -234,7 +234,7 @@ xfs_iformat_data_fork(
 	 * depend on it.
 	 */
 	ip->i_df.if_format = dip->di_format;
-	ip->i_df.if_nextents = be32_to_cpu(dip->di_nextents);
+	ip->i_df.if_nextents = xfs_dfork_nextents(dip, XFS_DATA_FORK);
 
 	switch (inode->i_mode & S_IFMT) {
 	case S_IFIFO:
@@ -301,14 +301,16 @@ xfs_iformat_attr_fork(
 	struct xfs_inode	*ip,
 	struct xfs_dinode	*dip)
 {
+	xfs_extnum_t		naextents;
 	int			error = 0;
 
+	naextents = xfs_dfork_nextents(dip, XFS_ATTR_FORK);
+
 	/*
 	 * Initialize the extent count early, as the per-format routines may
 	 * depend on it.
 	 */
-	ip->i_afp = xfs_ifork_alloc(dip->di_aformat,
-				be16_to_cpu(dip->di_anextents));
+	ip->i_afp = xfs_ifork_alloc(dip->di_aformat, naextents);
 
 	switch (ip->i_afp->if_format) {
 	case XFS_DINODE_FMT_LOCAL:
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index 87925761e174..4177b85c941d 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -233,6 +233,7 @@ xchk_dinode(
 	unsigned long long	isize;
 	uint64_t		flags2;
 	xfs_extnum_t		nextents;
+	xfs_extnum_t		naextents;
 	prid_t			prid;
 	uint16_t		flags;
 	uint16_t		mode;
@@ -391,7 +392,7 @@ xchk_dinode(
 	xchk_inode_extsize(sc, dip, ino, mode, flags);
 
 	/* di_nextents */
-	nextents = be32_to_cpu(dip->di_nextents);
+	nextents = xfs_dfork_nextents(dip, XFS_DATA_FORK);
 	fork_recs =  XFS_DFORK_DSIZE(dip, mp) / sizeof(struct xfs_bmbt_rec);
 	switch (dip->di_format) {
 	case XFS_DINODE_FMT_EXTENTS:
@@ -408,10 +409,12 @@ xchk_dinode(
 		break;
 	}
 
+	naextents = xfs_dfork_nextents(dip, XFS_ATTR_FORK);
+
 	/* di_forkoff */
 	if (XFS_DFORK_APTR(dip) >= (char *)dip + mp->m_sb.sb_inodesize)
 		xchk_ino_set_corrupt(sc, ino);
-	if (dip->di_anextents != 0 && dip->di_forkoff == 0)
+	if (naextents != 0 && dip->di_forkoff == 0)
 		xchk_ino_set_corrupt(sc, ino);
 	if (dip->di_forkoff == 0 && dip->di_aformat != XFS_DINODE_FMT_EXTENTS)
 		xchk_ino_set_corrupt(sc, ino);
@@ -423,19 +426,18 @@ xchk_dinode(
 		xchk_ino_set_corrupt(sc, ino);
 
 	/* di_anextents */
-	nextents = be16_to_cpu(dip->di_anextents);
 	fork_recs =  XFS_DFORK_ASIZE(dip, mp) / sizeof(struct xfs_bmbt_rec);
 	switch (dip->di_aformat) {
 	case XFS_DINODE_FMT_EXTENTS:
-		if (nextents > fork_recs)
+		if (naextents > fork_recs)
 			xchk_ino_set_corrupt(sc, ino);
 		break;
 	case XFS_DINODE_FMT_BTREE:
-		if (nextents <= fork_recs)
+		if (naextents <= fork_recs)
 			xchk_ino_set_corrupt(sc, ino);
 		break;
 	default:
-		if (nextents != 0)
+		if (naextents != 0)
 			xchk_ino_set_corrupt(sc, ino);
 	}
 
@@ -513,14 +515,14 @@ xchk_inode_xref_bmap(
 			&nextents, &count);
 	if (!xchk_should_check_xref(sc, &error, NULL))
 		return;
-	if (nextents < be32_to_cpu(dip->di_nextents))
+	if (nextents < xfs_dfork_nextents(dip, XFS_DATA_FORK))
 		xchk_ino_xref_set_corrupt(sc, sc->ip->i_ino);
 
 	error = xfs_bmap_count_blocks(sc->tp, sc->ip, XFS_ATTR_FORK,
 			&nextents, &acount);
 	if (!xchk_should_check_xref(sc, &error, NULL))
 		return;
-	if (nextents != be16_to_cpu(dip->di_anextents))
+	if (nextents != xfs_dfork_nextents(dip, XFS_ATTR_FORK))
 		xchk_ino_xref_set_corrupt(sc, sc->ip->i_ino);
 
 	/* Check nblocks against the inode. */
diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
index bebc1fd33667..ec8360b3b13b 100644
--- a/fs/xfs/scrub/inode_repair.c
+++ b/fs/xfs/scrub/inode_repair.c
@@ -606,7 +606,7 @@ xrep_dinode_bad_extents_fork(
 	xfs_extnum_t		nex;
 	int			fork_size;
 
-	nex = XFS_DFORK_NEXTENTS(dip, whichfork);
+	nex = xfs_dfork_nextents(dip, whichfork);
 	fork_size = nex * sizeof(struct xfs_bmbt_rec);
 	if (fork_size < 0 || fork_size > dfork_size)
 		return true;
@@ -640,7 +640,7 @@ xrep_dinode_bad_btree_fork(
 	int			nrecs;
 	int			level;
 
-	if (XFS_DFORK_NEXTENTS(dip, whichfork) <=
+	if (xfs_dfork_nextents(dip, whichfork) <=
 			dfork_size / sizeof(struct xfs_bmbt_rec))
 		return true;
 
@@ -835,12 +835,16 @@ xrep_dinode_ensure_forkoff(
 	struct xrep_dinode_stats	*dis)
 {
 	struct xfs_bmdr_block		*bmdr;
+	xfs_extnum_t			anextents, dnextents;
 	size_t				bmdr_minsz = xfs_bmdr_space_calc(1);
 	unsigned int			lit_sz = XFS_LITINO(sc->mp);
 	unsigned int			afork_min, dfork_min;
 
 	trace_xrep_dinode_ensure_forkoff(sc, dip);
 
+	dnextents = xfs_dfork_nextents(dip, XFS_DATA_FORK);
+	anextents = xfs_dfork_nextents(dip, XFS_ATTR_FORK);
+
 	/*
 	 * Before calling this function, xrep_dinode_core ensured that both
 	 * forks actually fit inside their respective literal areas.  If this
@@ -861,15 +865,14 @@ xrep_dinode_ensure_forkoff(
 		afork_min = XFS_DFORK_SIZE(dip, sc->mp, XFS_ATTR_FORK);
 		break;
 	case XFS_DINODE_FMT_EXTENTS:
-		if (dip->di_anextents) {
+		if (anextents) {
 			/*
 			 * We must maintain sufficient space to hold the entire
 			 * extent map array in the data fork.  Note that we
 			 * previously zapped the fork if it had no chance of
 			 * fitting in the inode.
 			 */
-			afork_min = sizeof(struct xfs_bmbt_rec) *
-						be16_to_cpu(dip->di_anextents);
+			afork_min = sizeof(struct xfs_bmbt_rec) * anextents;
 		} else if (dis->attr_extents > 0) {
 			/*
 			 * The attr fork thinks it has zero extents, but we
@@ -912,15 +915,14 @@ xrep_dinode_ensure_forkoff(
 		dfork_min = be64_to_cpu(dip->di_size);
 		break;
 	case XFS_DINODE_FMT_EXTENTS:
-		if (dip->di_nextents) {
+		if (dnextents) {
 			/*
 			 * We must maintain sufficient space to hold the entire
 			 * extent map array in the data fork.  Note that we
 			 * previously zapped the fork if it had no chance of
 			 * fitting in the inode.
 			 */
-			dfork_min = sizeof(struct xfs_bmbt_rec) *
-						be32_to_cpu(dip->di_nextents);
+			dfork_min = sizeof(struct xfs_bmbt_rec) * dnextents;
 		} else if (dis->data_extents > 0 || dis->rt_extents > 0) {
 			/*
 			 * The data fork thinks it has zero extents, but we
@@ -960,7 +962,7 @@ xrep_dinode_ensure_forkoff(
 	 * recovery fork, move the attr fork up.
 	 */
 	if (dip->di_format == XFS_DINODE_FMT_EXTENTS &&
-	    dip->di_nextents == 0 &&
+	    dnextents == 0 &&
 	    (dis->data_extents > 0 || dis->rt_extents > 0) &&
 	    bmdr_minsz > XFS_DFORK_DSIZE(dip, sc->mp)) {
 		if (bmdr_minsz + afork_min > lit_sz) {
@@ -986,7 +988,7 @@ xrep_dinode_ensure_forkoff(
 	 * recovery fork, move the attr fork down.
 	 */
 	if (dip->di_aformat == XFS_DINODE_FMT_EXTENTS &&
-	    dip->di_anextents == 0 &&
+	    anextents == 0 &&
 	    dis->attr_extents > 0 &&
 	    bmdr_minsz > XFS_DFORK_ASIZE(dip, sc->mp)) {
 		if (dip->di_format == XFS_DINODE_FMT_BTREE) {
@@ -1023,6 +1025,9 @@ xrep_dinode_zap_forks(
 	struct xfs_dinode		*dip,
 	struct xrep_dinode_stats	*dis)
 {
+	xfs_rfsblock_t			nblocks;
+	xfs_extnum_t			nextents;
+	xfs_extnum_t			naextents;
 	uint16_t			mode;
 	bool				zap_datafork = false;
 	bool				zap_attrfork = false;
@@ -1032,12 +1037,17 @@ xrep_dinode_zap_forks(
 	mode = be16_to_cpu(dip->di_mode);
 
 	/* Inode counters don't make sense? */
-	if (be32_to_cpu(dip->di_nextents) > be64_to_cpu(dip->di_nblocks))
+	nblocks = be64_to_cpu(dip->di_nblocks);
+
+	nextents = xfs_dfork_nextents(dip, XFS_DATA_FORK);
+	if (nextents > nblocks)
 		zap_datafork = true;
-	if (be16_to_cpu(dip->di_anextents) > be64_to_cpu(dip->di_nblocks))
+
+	naextents = xfs_dfork_nextents(dip, XFS_ATTR_FORK);
+	if (naextents > nblocks)
 		zap_attrfork = true;
-	if (be32_to_cpu(dip->di_nextents) + be16_to_cpu(dip->di_anextents) >
-			be64_to_cpu(dip->di_nblocks))
+
+	if (nextents + naextents > nblocks)
 		zap_datafork = zap_attrfork = true;
 
 	if (!zap_datafork)
-- 
2.30.2


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V3 06/12] xfs: xfs_dfork_nextents: Return extent count via an out argument
  2021-09-16 10:06 [PATCH V3 00/12] xfs: Extend per-inode extent counters Chandan Babu R
                   ` (4 preceding siblings ...)
  2021-09-16 10:06 ` [PATCH V3 05/12] xfs: Introduce xfs_dfork_nextents() helper Chandan Babu R
@ 2021-09-16 10:06 ` Chandan Babu R
  2021-09-30  1:19   ` Dave Chinner
  2021-09-16 10:06 ` [PATCH V3 07/12] xfs: Rename inode's extent counter fields based on their width Chandan Babu R
                   ` (6 subsequent siblings)
  12 siblings, 1 reply; 42+ messages in thread
From: Chandan Babu R @ 2021-09-16 10:06 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, djwong

This commit changes xfs_dfork_nextents() to return an error code. The extent
count itself is now returned through an out argument. This facility will be
used by a future commit to indicate an inconsistent ondisk extent count.

Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
---
 fs/xfs/libxfs/xfs_format.h     | 14 ++---
 fs/xfs/libxfs/xfs_inode_buf.c  | 16 ++++--
 fs/xfs/libxfs/xfs_inode_fork.c | 21 ++++++--
 fs/xfs/scrub/inode.c           | 94 +++++++++++++++++++++-------------
 fs/xfs/scrub/inode_repair.c    | 34 ++++++++----
 5 files changed, 118 insertions(+), 61 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index b4638052801f..dba868f2c3e3 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -931,28 +931,30 @@ enum xfs_dinode_fmt {
 		(dip)->di_format : \
 		(dip)->di_aformat)
 
-static inline xfs_extnum_t
+static inline int
 xfs_dfork_nextents(
 	struct xfs_dinode	*dip,
-	int			whichfork)
+	int			whichfork,
+	xfs_extnum_t		*nextents)
 {
-	xfs_extnum_t		nextents = 0;
+	int			error = 0;
 
 	switch (whichfork) {
 	case XFS_DATA_FORK:
-		nextents = be32_to_cpu(dip->di_nextents);
+		*nextents = be32_to_cpu(dip->di_nextents);
 		break;
 
 	case XFS_ATTR_FORK:
-		nextents = be16_to_cpu(dip->di_anextents);
+		*nextents = be16_to_cpu(dip->di_anextents);
 		break;
 
 	default:
 		ASSERT(0);
+		error = -EFSCORRUPTED;
 		break;
 	}
 
-	return nextents;
+	return error;
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 176c98798aa4..dc511630cc7a 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -345,7 +345,8 @@ xfs_dinode_verify_fork(
 	xfs_extnum_t		di_nextents;
 	xfs_extnum_t		max_extents;
 
-	di_nextents = xfs_dfork_nextents(dip, whichfork);
+	if (xfs_dfork_nextents(dip, whichfork, &di_nextents))
+		return __this_address;
 
 	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
 	case XFS_DINODE_FMT_LOCAL:
@@ -477,6 +478,7 @@ xfs_dinode_verify(
 	uint64_t		flags2;
 	uint64_t		di_size;
 	xfs_extnum_t            nextents;
+	xfs_extnum_t            naextents;
 	xfs_rfsblock_t		nblocks;
 
 	if (dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC))
@@ -508,8 +510,13 @@ xfs_dinode_verify(
 	if ((S_ISLNK(mode) || S_ISDIR(mode)) && di_size == 0)
 		return __this_address;
 
-	nextents = xfs_dfork_nextents(dip, XFS_DATA_FORK);
-	nextents += xfs_dfork_nextents(dip, XFS_ATTR_FORK);
+	if (xfs_dfork_nextents(dip, XFS_DATA_FORK, &nextents))
+		return __this_address;
+
+	if (xfs_dfork_nextents(dip, XFS_ATTR_FORK, &naextents))
+		return __this_address;
+
+	nextents += naextents;
 	nblocks = be64_to_cpu(dip->di_nblocks);
 
 	/* Fork checks carried over from xfs_iformat_fork */
@@ -570,7 +577,8 @@ xfs_dinode_verify(
 		default:
 			return __this_address;
 		}
-		if (xfs_dfork_nextents(dip, XFS_ATTR_FORK))
+
+		if (naextents)
 			return __this_address;
 	}
 
diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index 7d1efccfea59..435c343612e2 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -107,13 +107,20 @@ xfs_iformat_extents(
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
 	int			state = xfs_bmap_fork_to_state(whichfork);
-	xfs_extnum_t		nex = xfs_dfork_nextents(dip, whichfork);
-	int			size = nex * sizeof(xfs_bmbt_rec_t);
+	xfs_extnum_t		nex;
+	int			size;
 	struct xfs_iext_cursor	icur;
 	struct xfs_bmbt_rec	*dp;
 	struct xfs_bmbt_irec	new;
+	int			error;
 	int			i;
 
+	error = xfs_dfork_nextents(dip, whichfork, &nex);
+	if (error)
+		return error;
+
+	size = nex * sizeof(struct xfs_bmbt_rec);
+
 	/*
 	 * If the number of extents is unreasonable, then something is wrong and
 	 * we just bail out rather than crash in kmem_alloc() or memcpy() below.
@@ -234,7 +241,9 @@ xfs_iformat_data_fork(
 	 * depend on it.
 	 */
 	ip->i_df.if_format = dip->di_format;
-	ip->i_df.if_nextents = xfs_dfork_nextents(dip, XFS_DATA_FORK);
+	error = xfs_dfork_nextents(dip, XFS_DATA_FORK, &ip->i_df.if_nextents);
+	if (error)
+		return error;
 
 	switch (inode->i_mode & S_IFMT) {
 	case S_IFIFO:
@@ -302,9 +311,11 @@ xfs_iformat_attr_fork(
 	struct xfs_dinode	*dip)
 {
 	xfs_extnum_t		naextents;
-	int			error = 0;
+	int			error;
 
-	naextents = xfs_dfork_nextents(dip, XFS_ATTR_FORK);
+	error = xfs_dfork_nextents(dip, XFS_ATTR_FORK, &naextents);
+	if (error)
+		return error;
 
 	/*
 	 * Initialize the extent count early, as the per-format routines may
diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c
index 4177b85c941d..be43bd6be1ed 100644
--- a/fs/xfs/scrub/inode.c
+++ b/fs/xfs/scrub/inode.c
@@ -221,6 +221,38 @@ xchk_dinode_nsec(
 		xchk_ino_set_corrupt(sc, ino);
 }
 
+STATIC void
+xchk_dinode_fork_recs(
+	struct xfs_scrub	*sc,
+	struct xfs_dinode	*dip,
+	xfs_ino_t		ino,
+	xfs_extnum_t		nextents,
+	int			whichfork)
+{
+	struct xfs_mount	*mp = sc->mp;
+	size_t			fork_recs;
+	unsigned char		format;
+
+	fork_recs = XFS_DFORK_SIZE(dip, mp, whichfork) /
+		sizeof(struct xfs_bmbt_rec);
+	format = XFS_DFORK_FORMAT(dip, whichfork);
+
+	switch (format) {
+	case XFS_DINODE_FMT_EXTENTS:
+		if (nextents > fork_recs)
+			xchk_ino_set_corrupt(sc, ino);
+		break;
+	case XFS_DINODE_FMT_BTREE:
+		if (nextents <= fork_recs)
+			xchk_ino_set_corrupt(sc, ino);
+		break;
+	default:
+		if (nextents != 0)
+			xchk_ino_set_corrupt(sc, ino);
+		break;
+	}
+}
+
 /* Scrub all the ondisk inode fields. */
 STATIC void
 xchk_dinode(
@@ -229,7 +261,6 @@ xchk_dinode(
 	xfs_ino_t		ino)
 {
 	struct xfs_mount	*mp = sc->mp;
-	size_t			fork_recs;
 	unsigned long long	isize;
 	uint64_t		flags2;
 	xfs_extnum_t		nextents;
@@ -237,6 +268,7 @@ xchk_dinode(
 	prid_t			prid;
 	uint16_t		flags;
 	uint16_t		mode;
+	int			error;
 
 	flags = be16_to_cpu(dip->di_flags);
 	if (dip->di_version >= 3)
@@ -392,33 +424,30 @@ xchk_dinode(
 	xchk_inode_extsize(sc, dip, ino, mode, flags);
 
 	/* di_nextents */
-	nextents = xfs_dfork_nextents(dip, XFS_DATA_FORK);
-	fork_recs =  XFS_DFORK_DSIZE(dip, mp) / sizeof(struct xfs_bmbt_rec);
-	switch (dip->di_format) {
-	case XFS_DINODE_FMT_EXTENTS:
-		if (nextents > fork_recs)
-			xchk_ino_set_corrupt(sc, ino);
-		break;
-	case XFS_DINODE_FMT_BTREE:
-		if (nextents <= fork_recs)
-			xchk_ino_set_corrupt(sc, ino);
-		break;
-	default:
-		if (nextents != 0)
-			xchk_ino_set_corrupt(sc, ino);
-		break;
+	error = xfs_dfork_nextents(dip, XFS_DATA_FORK, &nextents);
+	if (error) {
+		xchk_ino_set_corrupt(sc, ino);
+		return;
 	}
-
-	naextents = xfs_dfork_nextents(dip, XFS_ATTR_FORK);
+	xchk_dinode_fork_recs(sc, dip, ino, nextents, XFS_DATA_FORK);
 
 	/* di_forkoff */
 	if (XFS_DFORK_APTR(dip) >= (char *)dip + mp->m_sb.sb_inodesize)
 		xchk_ino_set_corrupt(sc, ino);
-	if (naextents != 0 && dip->di_forkoff == 0)
-		xchk_ino_set_corrupt(sc, ino);
 	if (dip->di_forkoff == 0 && dip->di_aformat != XFS_DINODE_FMT_EXTENTS)
 		xchk_ino_set_corrupt(sc, ino);
 
+	error = xfs_dfork_nextents(dip, XFS_ATTR_FORK, &naextents);
+	if (error) {
+		xchk_ino_set_corrupt(sc, ino);
+		return;
+	}
+
+	if (naextents != 0 && dip->di_forkoff == 0) {
+		xchk_ino_set_corrupt(sc, ino);
+		return;
+	}
+
 	/* di_aformat */
 	if (dip->di_aformat != XFS_DINODE_FMT_LOCAL &&
 	    dip->di_aformat != XFS_DINODE_FMT_EXTENTS &&
@@ -426,20 +455,8 @@ xchk_dinode(
 		xchk_ino_set_corrupt(sc, ino);
 
 	/* di_anextents */
-	fork_recs =  XFS_DFORK_ASIZE(dip, mp) / sizeof(struct xfs_bmbt_rec);
-	switch (dip->di_aformat) {
-	case XFS_DINODE_FMT_EXTENTS:
-		if (naextents > fork_recs)
-			xchk_ino_set_corrupt(sc, ino);
-		break;
-	case XFS_DINODE_FMT_BTREE:
-		if (naextents <= fork_recs)
-			xchk_ino_set_corrupt(sc, ino);
-		break;
-	default:
-		if (naextents != 0)
-			xchk_ino_set_corrupt(sc, ino);
-	}
+	if (!error)
+		xchk_dinode_fork_recs(sc, dip, ino, naextents, XFS_ATTR_FORK);
 
 	if (dip->di_version >= 3) {
 		xchk_dinode_nsec(sc, ino, dip, dip->di_crtime);
@@ -502,6 +519,7 @@ xchk_inode_xref_bmap(
 	struct xfs_scrub	*sc,
 	struct xfs_dinode	*dip)
 {
+	xfs_extnum_t		dip_nextents;
 	xfs_extnum_t		nextents;
 	xfs_filblks_t		count;
 	xfs_filblks_t		acount;
@@ -515,14 +533,18 @@ xchk_inode_xref_bmap(
 			&nextents, &count);
 	if (!xchk_should_check_xref(sc, &error, NULL))
 		return;
-	if (nextents < xfs_dfork_nextents(dip, XFS_DATA_FORK))
+
+	error = xfs_dfork_nextents(dip, XFS_DATA_FORK, &dip_nextents);
+	if (error || nextents < dip_nextents)
 		xchk_ino_xref_set_corrupt(sc, sc->ip->i_ino);
 
 	error = xfs_bmap_count_blocks(sc->tp, sc->ip, XFS_ATTR_FORK,
 			&nextents, &acount);
 	if (!xchk_should_check_xref(sc, &error, NULL))
 		return;
-	if (nextents != xfs_dfork_nextents(dip, XFS_ATTR_FORK))
+
+	error = xfs_dfork_nextents(dip, XFS_ATTR_FORK, &dip_nextents);
+	if (error || nextents < dip_nextents)
 		xchk_ino_xref_set_corrupt(sc, sc->ip->i_ino);
 
 	/* Check nblocks against the inode. */
diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
index ec8360b3b13b..4133a91c9a57 100644
--- a/fs/xfs/scrub/inode_repair.c
+++ b/fs/xfs/scrub/inode_repair.c
@@ -606,7 +606,9 @@ xrep_dinode_bad_extents_fork(
 	xfs_extnum_t		nex;
 	int			fork_size;
 
-	nex = xfs_dfork_nextents(dip, whichfork);
+	if (xfs_dfork_nextents(dip, whichfork, &nex))
+		return true;
+
 	fork_size = nex * sizeof(struct xfs_bmbt_rec);
 	if (fork_size < 0 || fork_size > dfork_size)
 		return true;
@@ -637,11 +639,14 @@ xrep_dinode_bad_btree_fork(
 	int			whichfork)
 {
 	struct xfs_bmdr_block	*dfp;
+	xfs_extnum_t		nextents;
 	int			nrecs;
 	int			level;
 
-	if (xfs_dfork_nextents(dip, whichfork) <=
-			dfork_size / sizeof(struct xfs_bmbt_rec))
+	if (xfs_dfork_nextents(dip, whichfork, &nextents))
+		return true;
+
+	if (nextents <= dfork_size / sizeof(struct xfs_bmbt_rec))
 		return true;
 
 	if (dfork_size < sizeof(struct xfs_bmdr_block))
@@ -778,11 +783,15 @@ xrep_dinode_check_afork(
 	struct xfs_dinode		*dip)
 {
 	struct xfs_attr_shortform	*sfp;
+	xfs_extnum_t			nextents;
 	int				size;
 
+	if (xfs_dfork_nextents(dip, XFS_ATTR_FORK, &nextents))
+		return true;
+
 	if (XFS_DFORK_BOFF(dip) == 0)
 		return dip->di_aformat != XFS_DINODE_FMT_EXTENTS ||
-		       dip->di_anextents != 0;
+		       nextents != 0;
 
 	size = XFS_DFORK_SIZE(dip, sc->mp, XFS_ATTR_FORK);
 	switch (XFS_DFORK_FORMAT(dip, XFS_ATTR_FORK)) {
@@ -839,11 +848,15 @@ xrep_dinode_ensure_forkoff(
 	size_t				bmdr_minsz = xfs_bmdr_space_calc(1);
 	unsigned int			lit_sz = XFS_LITINO(sc->mp);
 	unsigned int			afork_min, dfork_min;
+	int				error;
 
 	trace_xrep_dinode_ensure_forkoff(sc, dip);
 
-	dnextents = xfs_dfork_nextents(dip, XFS_DATA_FORK);
-	anextents = xfs_dfork_nextents(dip, XFS_ATTR_FORK);
+	error = xfs_dfork_nextents(dip, XFS_DATA_FORK, &dnextents);
+	ASSERT(error == 0);
+
+	error = xfs_dfork_nextents(dip, XFS_ATTR_FORK, &anextents);
+	ASSERT(error == 0);
 
 	/*
 	 * Before calling this function, xrep_dinode_core ensured that both
@@ -1031,6 +1044,7 @@ xrep_dinode_zap_forks(
 	uint16_t			mode;
 	bool				zap_datafork = false;
 	bool				zap_attrfork = false;
+	int				error;
 
 	trace_xrep_dinode_zap_forks(sc, dip);
 
@@ -1039,12 +1053,12 @@ xrep_dinode_zap_forks(
 	/* Inode counters don't make sense? */
 	nblocks = be64_to_cpu(dip->di_nblocks);
 
-	nextents = xfs_dfork_nextents(dip, XFS_DATA_FORK);
-	if (nextents > nblocks)
+	error = xfs_dfork_nextents(dip, XFS_DATA_FORK, &nextents);
+	if (error || nextents > nblocks)
 		zap_datafork = true;
 
-	naextents = xfs_dfork_nextents(dip, XFS_ATTR_FORK);
-	if (naextents > nblocks)
+	error = xfs_dfork_nextents(dip, XFS_ATTR_FORK, &naextents);
+	if (error || naextents > nblocks)
 		zap_attrfork = true;
 
 	if (nextents + naextents > nblocks)
-- 
2.30.2


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V3 07/12] xfs: Rename inode's extent counter fields based on their width
  2021-09-16 10:06 [PATCH V3 00/12] xfs: Extend per-inode extent counters Chandan Babu R
                   ` (5 preceding siblings ...)
  2021-09-16 10:06 ` [PATCH V3 06/12] xfs: xfs_dfork_nextents: Return extent count via an out argument Chandan Babu R
@ 2021-09-16 10:06 ` Chandan Babu R
  2021-09-27 23:46   ` Dave Chinner
  2021-09-16 10:06 ` [PATCH V3 08/12] xfs: Promote xfs_extnum_t and xfs_aextnum_t to 64 and 32-bits respectively Chandan Babu R
                   ` (5 subsequent siblings)
  12 siblings, 1 reply; 42+ messages in thread
From: Chandan Babu R @ 2021-09-16 10:06 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, djwong

This commit renames extent counter fields in "struct xfs_dinode" and "struct
xfs_log_dinode" based on the width of the fields. As of this commit, the
32-bit field will be used to count data fork extents and the 16-bit field will
be used to count attr fork extents.

This change is done to enable a future commit to introduce a new 64-bit extent
counter field.

Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
---
 fs/xfs/libxfs/xfs_format.h      |  8 ++++----
 fs/xfs/libxfs/xfs_inode_buf.c   |  4 ++--
 fs/xfs/libxfs/xfs_log_format.h  |  4 ++--
 fs/xfs/scrub/inode_repair.c     |  4 ++--
 fs/xfs/scrub/trace.h            | 14 +++++++-------
 fs/xfs/xfs_inode_item.c         |  4 ++--
 fs/xfs/xfs_inode_item_recover.c |  8 ++++----
 7 files changed, 23 insertions(+), 23 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index dba868f2c3e3..87c927d912f6 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -802,8 +802,8 @@ typedef struct xfs_dinode {
 	__be64		di_size;	/* number of bytes in file */
 	__be64		di_nblocks;	/* # of direct & btree blocks used */
 	__be32		di_extsize;	/* basic/minimum extent size for file */
-	__be32		di_nextents;	/* number of extents in data fork */
-	__be16		di_anextents;	/* number of extents in attribute fork*/
+	__be32		di_nextents32;	/* number of extents in data fork */
+	__be16		di_nextents16;	/* number of extents in attribute fork*/
 	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
 	__s8		di_aformat;	/* format of attr fork's data */
 	__be32		di_dmevmask;	/* DMIG event mask */
@@ -941,11 +941,11 @@ xfs_dfork_nextents(
 
 	switch (whichfork) {
 	case XFS_DATA_FORK:
-		*nextents = be32_to_cpu(dip->di_nextents);
+		*nextents = be32_to_cpu(dip->di_nextents32);
 		break;
 
 	case XFS_ATTR_FORK:
-		*nextents = be16_to_cpu(dip->di_anextents);
+		*nextents = be16_to_cpu(dip->di_nextents16);
 		break;
 
 	default:
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index dc511630cc7a..882ed4873afe 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -313,8 +313,8 @@ xfs_inode_to_disk(
 	to->di_size = cpu_to_be64(ip->i_disk_size);
 	to->di_nblocks = cpu_to_be64(ip->i_nblocks);
 	to->di_extsize = cpu_to_be32(ip->i_extsize);
-	to->di_nextents = cpu_to_be32(xfs_ifork_nextents(&ip->i_df));
-	to->di_anextents = cpu_to_be16(xfs_ifork_nextents(ip->i_afp));
+	to->di_nextents32 = cpu_to_be32(xfs_ifork_nextents(&ip->i_df));
+	to->di_nextents16 = cpu_to_be16(xfs_ifork_nextents(ip->i_afp));
 	to->di_forkoff = ip->i_forkoff;
 	to->di_aformat = xfs_ifork_format(ip->i_afp);
 	to->di_flags = cpu_to_be16(ip->i_diflags);
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index bd711d244c4b..9f352ff4352b 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -402,8 +402,8 @@ struct xfs_log_dinode {
 	xfs_fsize_t	di_size;	/* number of bytes in file */
 	xfs_rfsblock_t	di_nblocks;	/* # of direct & btree blocks used */
 	xfs_extlen_t	di_extsize;	/* basic/minimum extent size for file */
-	xfs_extnum_t	di_nextents;	/* number of extents in data fork */
-	xfs_aextnum_t	di_anextents;	/* number of extents in attribute fork*/
+	uint32_t	di_nextents32;	/* number of extents in data fork */
+	uint16_t	di_nextents16;	/* number of extents in attribute fork*/
 	uint8_t		di_forkoff;	/* attr fork offs, <<3 for 64b align */
 	int8_t		di_aformat;	/* format of attr fork's data */
 	uint32_t	di_dmevmask;	/* DMIG event mask */
diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
index 4133a91c9a57..19ea86aa9fd0 100644
--- a/fs/xfs/scrub/inode_repair.c
+++ b/fs/xfs/scrub/inode_repair.c
@@ -740,7 +740,7 @@ xrep_dinode_zap_dfork(
 {
 	trace_xrep_dinode_zap_dfork(sc, dip);
 
-	dip->di_nextents = 0;
+	dip->di_nextents32 = 0;
 
 	/* Special files always get reset to DEV */
 	switch (mode & S_IFMT) {
@@ -827,7 +827,7 @@ xrep_dinode_zap_afork(
 	trace_xrep_dinode_zap_afork(sc, dip);
 
 	dip->di_aformat = XFS_DINODE_FMT_EXTENTS;
-	dip->di_anextents = 0;
+	dip->di_nextents16 = 0;
 
 	dip->di_forkoff = 0;
 	dip->di_mode = cpu_to_be16(mode & ~0777);
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index e44ab2d9f85f..92888a6a6e51 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -1218,8 +1218,8 @@ DECLARE_EVENT_CLASS(xrep_dinode_class,
 		__field(uint64_t, size)
 		__field(uint64_t, nblocks)
 		__field(uint32_t, extsize)
-		__field(uint32_t, nextents)
-		__field(uint16_t, anextents)
+		__field(uint32_t, nextents32)
+		__field(uint16_t, nextents16)
 		__field(uint8_t, forkoff)
 		__field(uint8_t, aformat)
 		__field(uint16_t, flags)
@@ -1238,8 +1238,8 @@ DECLARE_EVENT_CLASS(xrep_dinode_class,
 		__entry->size = be64_to_cpu(dip->di_size);
 		__entry->nblocks = be64_to_cpu(dip->di_nblocks);
 		__entry->extsize = be32_to_cpu(dip->di_extsize);
-		__entry->nextents = be32_to_cpu(dip->di_nextents);
-		__entry->anextents = be16_to_cpu(dip->di_anextents);
+		__entry->nextents32 = be32_to_cpu(dip->di_nextents32);
+		__entry->nextents16 = be16_to_cpu(dip->di_nextents16);
 		__entry->forkoff = dip->di_forkoff;
 		__entry->aformat = dip->di_aformat;
 		__entry->flags = be16_to_cpu(dip->di_flags);
@@ -1247,7 +1247,7 @@ DECLARE_EVENT_CLASS(xrep_dinode_class,
 		__entry->flags2 = be64_to_cpu(dip->di_flags2);
 		__entry->cowextsize = be32_to_cpu(dip->di_cowextsize);
 	),
-	TP_printk("dev %d:%d ino 0x%llx mode 0x%x version %u format %u uid %u gid %u disize 0x%llx nblocks 0x%llx extsize %u nextents %u anextents %u forkoff 0x%x aformat %u flags 0x%x gen 0x%x flags2 0x%llx cowextsize %u",
+	TP_printk("dev %d:%d ino 0x%llx mode 0x%x version %u format %u uid %u gid %u disize 0x%llx nblocks 0x%llx extsize %u nextents32 %u nextents16 %u forkoff 0x%x aformat %u flags 0x%x gen 0x%x flags2 0x%llx cowextsize %u",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
 		  __entry->mode,
@@ -1258,8 +1258,8 @@ DECLARE_EVENT_CLASS(xrep_dinode_class,
 		  __entry->size,
 		  __entry->nblocks,
 		  __entry->extsize,
-		  __entry->nextents,
-		  __entry->anextents,
+		  __entry->nextents32,
+		  __entry->nextents16,
 		  __entry->forkoff,
 		  __entry->aformat,
 		  __entry->flags,
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index 0659d19c211e..e4800a965670 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -385,8 +385,8 @@ xfs_inode_to_log_dinode(
 	to->di_size = ip->i_disk_size;
 	to->di_nblocks = ip->i_nblocks;
 	to->di_extsize = ip->i_extsize;
-	to->di_nextents = xfs_ifork_nextents(&ip->i_df);
-	to->di_anextents = xfs_ifork_nextents(ip->i_afp);
+	to->di_nextents32 = xfs_ifork_nextents(&ip->i_df);
+	to->di_nextents16 = xfs_ifork_nextents(ip->i_afp);
 	to->di_forkoff = ip->i_forkoff;
 	to->di_aformat = xfs_ifork_format(ip->i_afp);
 	to->di_flags = ip->i_diflags;
diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
index 239dd2e3384e..c21fb3d2ddca 100644
--- a/fs/xfs/xfs_inode_item_recover.c
+++ b/fs/xfs/xfs_inode_item_recover.c
@@ -167,8 +167,8 @@ xfs_log_dinode_to_disk(
 	to->di_size = cpu_to_be64(from->di_size);
 	to->di_nblocks = cpu_to_be64(from->di_nblocks);
 	to->di_extsize = cpu_to_be32(from->di_extsize);
-	to->di_nextents = cpu_to_be32(from->di_nextents);
-	to->di_anextents = cpu_to_be16(from->di_anextents);
+	to->di_nextents32 = cpu_to_be32(from->di_nextents32);
+	to->di_nextents16 = cpu_to_be16(from->di_nextents16);
 	to->di_forkoff = from->di_forkoff;
 	to->di_aformat = from->di_aformat;
 	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
@@ -342,7 +342,7 @@ xlog_recover_inode_commit_pass2(
 			goto out_release;
 		}
 	}
-	if (unlikely(ldip->di_nextents + ldip->di_anextents > ldip->di_nblocks)){
+	if (unlikely(ldip->di_nextents32 + ldip->di_nextents16 > ldip->di_nblocks)) {
 		XFS_CORRUPTION_ERROR("xlog_recover_inode_pass2(5)",
 				     XFS_ERRLEVEL_LOW, mp, ldip,
 				     sizeof(*ldip));
@@ -350,7 +350,7 @@ xlog_recover_inode_commit_pass2(
 	"%s: Bad inode log record, rec ptr "PTR_FMT", dino ptr "PTR_FMT", "
 	"dino bp "PTR_FMT", ino %Ld, total extents = %d, nblocks = %Ld",
 			__func__, item, dip, bp, in_f->ilf_ino,
-			ldip->di_nextents + ldip->di_anextents,
+			ldip->di_nextents32 + ldip->di_nextents16,
 			ldip->di_nblocks);
 		error = -EFSCORRUPTED;
 		goto out_release;
-- 
2.30.2


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V3 08/12] xfs: Promote xfs_extnum_t and xfs_aextnum_t to 64 and 32-bits respectively
  2021-09-16 10:06 [PATCH V3 00/12] xfs: Extend per-inode extent counters Chandan Babu R
                   ` (6 preceding siblings ...)
  2021-09-16 10:06 ` [PATCH V3 07/12] xfs: Rename inode's extent counter fields based on their width Chandan Babu R
@ 2021-09-16 10:06 ` Chandan Babu R
  2021-09-28  0:47   ` Dave Chinner
  2021-09-16 10:06 ` [PATCH V3 09/12] xfs: Enable bulkstat ioctl to support 64-bit per-inode extent counters Chandan Babu R
                   ` (4 subsequent siblings)
  12 siblings, 1 reply; 42+ messages in thread
From: Chandan Babu R @ 2021-09-16 10:06 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, djwong

A future commit will introduce a 64-bit on-disk data extent counter and a
32-bit on-disk attr extent counter. This commit promotes xfs_extnum_t and
xfs_aextnum_t to 64 and 32-bits in order to correctly handle in-core versions
of these quantities.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c       | 4 ++--
 fs/xfs/libxfs/xfs_inode_fork.c | 2 +-
 fs/xfs/libxfs/xfs_inode_fork.h | 2 +-
 fs/xfs/libxfs/xfs_types.h      | 4 ++--
 fs/xfs/scrub/attr_repair.c     | 2 +-
 fs/xfs/scrub/inode_repair.c    | 2 +-
 fs/xfs/scrub/trace.h           | 2 +-
 fs/xfs/xfs_inode.c             | 4 ++--
 fs/xfs/xfs_trace.h             | 4 ++--
 9 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index e5485b5c99a0..1a716067901f 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -54,9 +54,9 @@ xfs_bmap_compute_maxlevels(
 	xfs_mount_t	*mp,		/* file system mount structure */
 	int		whichfork)	/* data or attr fork */
 {
+	xfs_extnum_t	maxleafents;	/* max leaf entries possible */
 	int		level;		/* btree level */
 	uint		maxblocks;	/* max blocks at this level */
-	xfs_extnum_t	maxleafents;	/* max leaf entries possible */
 	int		maxrootrecs;	/* max records in root block */
 	int		minleafrecs;	/* min records in leaf block */
 	int		minnoderecs;	/* min records in node block */
@@ -473,7 +473,7 @@ xfs_bmap_check_leaf_extents(
 	if (bp_release)
 		xfs_trans_brelse(NULL, bp);
 error_norelse:
-	xfs_warn(mp, "%s: BAD after btree leaves for %d extents",
+	xfs_warn(mp, "%s: BAD after btree leaves for %llu extents",
 		__func__, i);
 	xfs_err(mp, "%s: CORRUPTED BTREE OR SOMETHING", __func__);
 	xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index 435c343612e2..feabe2da63e6 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -126,7 +126,7 @@ xfs_iformat_extents(
 	 * we just bail out rather than crash in kmem_alloc() or memcpy() below.
 	 */
 	if (unlikely(size < 0 || size > XFS_DFORK_SIZE(dip, mp, whichfork))) {
-		xfs_warn(ip->i_mount, "corrupt inode %Lu ((a)extents = %d).",
+		xfs_warn(ip->i_mount, "corrupt inode %llu ((a)extents = %llu).",
 			(unsigned long long) ip->i_ino, nex);
 		xfs_inode_verifier_error(ip, -EFSCORRUPTED,
 				"xfs_iformat_extents(1)", dip, sizeof(*dip),
diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
index e8fe5b477b50..4b9df10e8eea 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.h
+++ b/fs/xfs/libxfs/xfs_inode_fork.h
@@ -21,9 +21,9 @@ struct xfs_ifork {
 		void		*if_root;	/* extent tree root */
 		char		*if_data;	/* inline file data */
 	} if_u1;
+	xfs_extnum_t		if_nextents;	/* # of extents in this fork */
 	short			if_broot_bytes;	/* bytes allocated for root */
 	int8_t			if_format;	/* format of this fork */
-	xfs_extnum_t		if_nextents;	/* # of extents in this fork */
 };
 
 /*
diff --git a/fs/xfs/libxfs/xfs_types.h b/fs/xfs/libxfs/xfs_types.h
index dbe5bb56f31f..a3af29b7d9f2 100644
--- a/fs/xfs/libxfs/xfs_types.h
+++ b/fs/xfs/libxfs/xfs_types.h
@@ -12,8 +12,8 @@ typedef uint32_t	xfs_agblock_t;	/* blockno in alloc. group */
 typedef uint32_t	xfs_agino_t;	/* inode # within allocation grp */
 typedef uint32_t	xfs_extlen_t;	/* extent length in blocks */
 typedef uint32_t	xfs_agnumber_t;	/* allocation group number */
-typedef int32_t		xfs_extnum_t;	/* # of extents in a file */
-typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
+typedef uint64_t	xfs_extnum_t;	/* # of extents in a file */
+typedef uint32_t	xfs_aextnum_t;	/* # extents in an attribute fork */
 typedef int64_t		xfs_fsize_t;	/* bytes in a file */
 typedef uint64_t	xfs_ufsize_t;	/* unsigned bytes in a file */
 
diff --git a/fs/xfs/scrub/attr_repair.c b/fs/xfs/scrub/attr_repair.c
index d7f7afb71a70..02983d037d3b 100644
--- a/fs/xfs/scrub/attr_repair.c
+++ b/fs/xfs/scrub/attr_repair.c
@@ -770,7 +770,7 @@ xrep_xattr_fork_remove(
 		unsigned int		i = 0;
 
 		xfs_emerg(sc->mp,
-	"inode 0x%llx attr fork still has %u attr extents, format %d?!",
+	"inode 0x%llx attr fork still has %llu attr extents, format %d?!",
 				ip->i_ino, ifp->if_nextents, ifp->if_format);
 		for_each_xfs_iext(ifp, &icur, &irec) {
 			xfs_err(sc->mp, "[%u]: startoff %llu startblock %llu blockcount %llu state %u", i++, irec.br_startoff, irec.br_startblock, irec.br_blockcount, irec.br_state);
diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
index 19ea86aa9fd0..133109d84b98 100644
--- a/fs/xfs/scrub/inode_repair.c
+++ b/fs/xfs/scrub/inode_repair.c
@@ -601,9 +601,9 @@ xrep_dinode_bad_extents_fork(
 {
 	struct xfs_bmbt_irec	new;
 	struct xfs_bmbt_rec	*dp;
+	xfs_extnum_t		nex;
 	bool			isrt;
 	int			i;
-	xfs_extnum_t		nex;
 	int			fork_size;
 
 	if (xfs_dfork_nextents(dip, whichfork, &nex))
diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h
index 92888a6a6e51..14e4ac8eebce 100644
--- a/fs/xfs/scrub/trace.h
+++ b/fs/xfs/scrub/trace.h
@@ -1372,7 +1372,7 @@ TRACE_EVENT(xrep_dinode_count_rmaps,
 		__entry->attr_extents = attr_extents;
 		__entry->block0 = block0;
 	),
-	TP_printk("dev %d:%d ino 0x%llx dblocks 0x%llx rtblocks 0x%llx ablocks 0x%llx dextents %u rtextents %u aextents %u startblock0 0x%llx",
+	TP_printk("dev %d:%d ino 0x%llx dblocks 0x%llx rtblocks 0x%llx ablocks 0x%llx dextents %llu rtextents %llu aextents %u startblock0 0x%llx",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
 		  __entry->data_blocks,
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 4875d5e843f6..6338a93b975c 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2538,8 +2538,8 @@ xfs_iflush(
 	if (XFS_TEST_ERROR(ip->i_df.if_nextents + xfs_ifork_nextents(ip->i_afp) >
 				ip->i_nblocks, mp, XFS_ERRTAG_IFLUSH_5)) {
 		xfs_alert_tag(mp, XFS_PTAG_IFLUSH,
-			"%s: detected corrupt incore inode %Lu, "
-			"total extents = %d, nblocks = %Ld, ptr "PTR_FMT,
+			"%s: detected corrupt incore inode %llu, "
+			"total extents = %llu nblocks = %lld, ptr "PTR_FMT,
 			__func__, ip->i_ino,
 			ip->i_df.if_nextents + xfs_ifork_nextents(ip->i_afp),
 			ip->i_nblocks, ip);
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index fb1033de7003..dde8c98ac195 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -2196,7 +2196,7 @@ DECLARE_EVENT_CLASS(xfs_swap_extent_class,
 		__entry->broot_size = ip->i_df.if_broot_bytes;
 		__entry->fork_off = XFS_IFORK_BOFF(ip);
 	),
-	TP_printk("dev %d:%d ino 0x%llx (%s), %s format, num_extents %d, "
+	TP_printk("dev %d:%d ino 0x%llx (%s), %s format, num_extents %llu, "
 		  "broot size %d, forkoff 0x%x",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino,
@@ -4557,7 +4557,7 @@ TRACE_EVENT(xfs_swapext_delta_nextents,
 		__entry->d_nexts1 = d_nexts1;
 		__entry->d_nexts2 = d_nexts2;
 	),
-	TP_printk("dev %d:%d ino1 0x%llx nexts %u ino2 0x%llx nexts %u delta1 %lld delta2 %lld",
+	TP_printk("dev %d:%d ino1 0x%llx nexts %llu ino2 0x%llx nexts %llu delta1 %lld delta2 %lld",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->ino1, __entry->nexts1,
 		  __entry->ino2, __entry->nexts2,
-- 
2.30.2


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V3 09/12] xfs: Enable bulkstat ioctl to support 64-bit per-inode extent counters
  2021-09-16 10:06 [PATCH V3 00/12] xfs: Extend per-inode extent counters Chandan Babu R
                   ` (7 preceding siblings ...)
  2021-09-16 10:06 ` [PATCH V3 08/12] xfs: Promote xfs_extnum_t and xfs_aextnum_t to 64 and 32-bits respectively Chandan Babu R
@ 2021-09-16 10:06 ` Chandan Babu R
  2021-09-27 23:06   ` Dave Chinner
  2021-09-16 10:06 ` [PATCH V3 10/12] xfs: Extend per-inode extent counter widths Chandan Babu R
                   ` (3 subsequent siblings)
  12 siblings, 1 reply; 42+ messages in thread
From: Chandan Babu R @ 2021-09-16 10:06 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, djwong

The following changes are made to enable userspace to obtain 64-bit extent
counters,
1. To hold 64-bit extent counters, carve out the new 64-bit field
   xfs_bulkstat->bs_extents64 from xfs_bulkstat->bs_pad[].
2. Carve out a new 64-bit field xfs_bulk_ireq->bulkstat_flags from
   xfs_bulk_ireq->reserved[] to hold bulkstat specific operational flags.  As of
   this commit, XFS_IBULK_NREXT64 is the only valid flag that this field can
   hold. It indicates that userspace has the necessary infrastructure to
   receive 64-bit extent counters.
3. Define the new flag XFS_BULK_IREQ_BULKSTAT for userspace to indicate that
   xfs_bulk_ireq->bulkstat_flags has valid flags set.

Suggested-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
---
 fs/xfs/libxfs/xfs_fs.h | 19 ++++++++++++++-----
 fs/xfs/xfs_ioctl.c     |  7 +++++++
 fs/xfs/xfs_itable.c    | 25 +++++++++++++++++++++++--
 fs/xfs/xfs_itable.h    |  2 ++
 fs/xfs/xfs_iwalk.h     |  7 +++++--
 5 files changed, 51 insertions(+), 9 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 2594fb647384..b76906914d89 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -394,7 +394,7 @@ struct xfs_bulkstat {
 	uint32_t	bs_extsize_blks; /* extent size hint, blocks	*/
 
 	uint32_t	bs_nlink;	/* number of links		*/
-	uint32_t	bs_extents;	/* number of extents		*/
+	uint32_t	bs_extents32;	/* 32-bit data fork extent counter */
 	uint32_t	bs_aextents;	/* attribute number of extents	*/
 	uint16_t	bs_version;	/* structure version		*/
 	uint16_t	bs_forkoff;	/* inode fork offset in bytes	*/
@@ -403,8 +403,9 @@ struct xfs_bulkstat {
 	uint16_t	bs_checked;	/* checked inode metadata	*/
 	uint16_t	bs_mode;	/* type and mode		*/
 	uint16_t	bs_pad2;	/* zeroed			*/
+	uint64_t	bs_extents64;	/* 64-bit data fork extent counter */
 
-	uint64_t	bs_pad[7];	/* zeroed			*/
+	uint64_t	bs_pad[6];	/* zeroed			*/
 };
 
 #define XFS_BULKSTAT_VERSION_V1	(1)
@@ -469,7 +470,8 @@ struct xfs_bulk_ireq {
 	uint32_t	icount;		/* I: count of entries in buffer */
 	uint32_t	ocount;		/* O: count of entries filled out */
 	uint32_t	agno;		/* I: see comment for IREQ_AGNO	*/
-	uint64_t	reserved[5];	/* must be zero			*/
+	uint64_t	bulkstat_flags; /* I: Bulkstat operation flags */
+	uint64_t	reserved[4];	/* must be zero			*/
 };
 
 /*
@@ -492,9 +494,16 @@ struct xfs_bulk_ireq {
  */
 #define XFS_BULK_IREQ_METADIR	(1 << 2)
 
-#define XFS_BULK_IREQ_FLAGS_ALL	(XFS_BULK_IREQ_AGNO | \
+#define XFS_BULK_IREQ_BULKSTAT	(1 << 3)
+
+#define XFS_BULK_IREQ_FLAGS_ALL	(XFS_BULK_IREQ_AGNO |	 \
 				 XFS_BULK_IREQ_SPECIAL | \
-				 XFS_BULK_IREQ_METADIR)
+				 XFS_BULK_IREQ_METADIR | \
+				 XFS_BULK_IREQ_BULKSTAT)
+
+#define XFS_BULK_IREQ_BULKSTAT_NREXT64 (1 << 0)
+
+#define XFS_BULK_IREQ_BULKSTAT_FLAGS_ALL (XFS_BULK_IREQ_BULKSTAT_NREXT64)
 
 /* Operate on the root directory inode. */
 #define XFS_BULK_IREQ_SPECIAL_ROOT	(1)
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 4077862fa806..207c96bbc729 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -839,6 +839,10 @@ xfs_bulk_ireq_setup(
 {
 	if (hdr->icount == 0 ||
 	    (hdr->flags & ~XFS_BULK_IREQ_FLAGS_ALL) ||
+	    ((hdr->flags & XFS_BULK_IREQ_BULKSTAT) &&
+	     (hdr->bulkstat_flags & ~XFS_BULK_IREQ_BULKSTAT_FLAGS_ALL)) ||
+	    (!(hdr->flags & XFS_BULK_IREQ_BULKSTAT) &&
+	     (hdr->bulkstat_flags != 0)) ||
 	    memchr_inv(hdr->reserved, 0, sizeof(hdr->reserved)))
 		return -EINVAL;
 
@@ -897,6 +901,9 @@ xfs_bulk_ireq_setup(
 	if (hdr->flags & XFS_BULK_IREQ_METADIR)
 		breq->flags |= XFS_IWALK_METADIR;
 
+	if (hdr->flags & XFS_BULK_IREQ_BULKSTAT)
+		if (hdr->bulkstat_flags & XFS_BULK_IREQ_BULKSTAT_NREXT64)
+			breq->flags |= XFS_IBULK_NREXT64;
 	return 0;
 }
 
diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c
index f92057ad686b..5dce090f8f65 100644
--- a/fs/xfs/xfs_itable.c
+++ b/fs/xfs/xfs_itable.c
@@ -20,6 +20,7 @@
 #include "xfs_icache.h"
 #include "xfs_health.h"
 #include "xfs_trans.h"
+#include "xfs_errortag.h"
 
 /*
  * Bulk Stat
@@ -74,6 +75,7 @@ xfs_bulkstat_one_int(
 	struct xfs_inode	*ip;		/* incore inode pointer */
 	struct inode		*inode;
 	struct xfs_bulkstat	*buf = bc->buf;
+	xfs_extnum_t		nextents;
 	int			error = -EINVAL;
 
 	error = xfs_iget(mp, tp, ino,
@@ -134,7 +136,26 @@ xfs_bulkstat_one_int(
 
 	buf->bs_xflags = xfs_ip2xflags(ip);
 	buf->bs_extsize_blks = ip->i_extsize;
-	buf->bs_extents = xfs_ifork_nextents(&ip->i_df);
+
+	nextents = xfs_ifork_nextents(&ip->i_df);
+	if (!(bc->breq->flags & XFS_IBULK_NREXT64)) {
+		xfs_extnum_t max_nextents = XFS_IFORK_EXTCNT_MAXS32;
+
+		if (unlikely(XFS_TEST_ERROR(false, mp,
+				XFS_ERRTAG_REDUCE_MAX_IEXTENTS)))
+			max_nextents = 10;
+
+		if (nextents > max_nextents) {
+			xfs_iunlock(ip, XFS_ILOCK_SHARED);
+			xfs_irele(ip);
+			error = -EINVAL;
+			goto out_advance;
+		}
+		buf->bs_extents32 = nextents;
+	} else {
+		buf->bs_extents64 = nextents;
+	}
+
 	xfs_bulkstat_health(ip, buf);
 	buf->bs_aextents = xfs_ifork_nextents(ip->i_afp);
 	buf->bs_forkoff = XFS_IFORK_BOFF(ip);
@@ -356,7 +377,7 @@ xfs_bulkstat_to_bstat(
 	bs1->bs_blocks = bstat->bs_blocks;
 	bs1->bs_xflags = bstat->bs_xflags;
 	bs1->bs_extsize = XFS_FSB_TO_B(mp, bstat->bs_extsize_blks);
-	bs1->bs_extents = bstat->bs_extents;
+	bs1->bs_extents = bstat->bs_extents32;
 	bs1->bs_gen = bstat->bs_gen;
 	bs1->bs_projid_lo = bstat->bs_projectid & 0xFFFF;
 	bs1->bs_forkoff = bstat->bs_forkoff;
diff --git a/fs/xfs/xfs_itable.h b/fs/xfs/xfs_itable.h
index f5a13f69883a..f61685da3837 100644
--- a/fs/xfs/xfs_itable.h
+++ b/fs/xfs/xfs_itable.h
@@ -22,6 +22,8 @@ struct xfs_ibulk {
 /* Signal that we can return metadata directories. */
 #define XFS_IBULK_METADIR	(XFS_IWALK_METADIR)
 
+#define XFS_IBULK_NREXT64	(XFS_IWALK_NREXT64)
+
 /*
  * Advance the user buffer pointer by one record of the given size.  If the
  * buffer is now full, return the appropriate error code.
diff --git a/fs/xfs/xfs_iwalk.h b/fs/xfs/xfs_iwalk.h
index d7a082e45cbf..27a6842a1bb5 100644
--- a/fs/xfs/xfs_iwalk.h
+++ b/fs/xfs/xfs_iwalk.h
@@ -31,8 +31,11 @@ int xfs_iwalk_threaded(struct xfs_mount *mp, xfs_ino_t startino,
 /* Signal that we can return metadata directories. */
 #define XFS_IWALK_METADIR	(0x2)
 
-#define XFS_IWALK_FLAGS_ALL	(XFS_IWALK_SAME_AG | \
-				 XFS_IWALK_METADIR)
+#define XFS_IWALK_NREXT64	(0x4)
+
+#define XFS_IWALK_FLAGS_ALL	(XFS_IWALK_SAME_AG |	\
+				 XFS_IWALK_METADIR |	\
+				 XFS_IWALK_NREXT64)
 
 /* Walk all inode btree records in the filesystem starting from @startino. */
 typedef int (*xfs_inobt_walk_fn)(struct xfs_mount *mp, struct xfs_trans *tp,
-- 
2.30.2


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V3 10/12] xfs: Extend per-inode extent counter widths
  2021-09-16 10:06 [PATCH V3 00/12] xfs: Extend per-inode extent counters Chandan Babu R
                   ` (8 preceding siblings ...)
  2021-09-16 10:06 ` [PATCH V3 09/12] xfs: Enable bulkstat ioctl to support 64-bit per-inode extent counters Chandan Babu R
@ 2021-09-16 10:06 ` Chandan Babu R
  2021-09-16 10:06 ` [PATCH V3 11/12] xfs: Add XFS_SB_FEAT_INCOMPAT_NREXT64 to XFS_SB_FEAT_INCOMPAT_ALL Chandan Babu R
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 42+ messages in thread
From: Chandan Babu R @ 2021-09-16 10:06 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, djwong

This commit adds a new 64-bit per-inode data extent counter. However the
maximum number of extents that a data fork can hold is limited to 2^48
extents. This feature is available only when XFS_SB_FEAT_INCOMPAT_NREXT64
feature bit is enabled on the filesystem. Also, enabling this feature bit
causes attr fork extent counter to use the 32-bit extent counter that was
previously used to hold the data fork extent counter. This implies that the
attr fork can now occupy a maximum of 2^32 extents.

Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c        |  8 ++-
 fs/xfs/libxfs/xfs_format.h      | 87 ++++++++++++++++++++++-----------
 fs/xfs/libxfs/xfs_fs.h          |  1 +
 fs/xfs/libxfs/xfs_ialloc.c      |  2 +
 fs/xfs/libxfs/xfs_inode_buf.c   | 25 +++++++++-
 fs/xfs/libxfs/xfs_inode_fork.h  | 18 +++++--
 fs/xfs/libxfs/xfs_log_format.h  |  3 +-
 fs/xfs/libxfs/xfs_sb.c          |  4 ++
 fs/xfs/libxfs/xfs_trans_inode.c |  6 +++
 fs/xfs/scrub/inode_repair.c     | 11 ++++-
 fs/xfs/xfs_inode.c              |  2 +-
 fs/xfs/xfs_inode.h              |  5 ++
 fs/xfs/xfs_inode_item.c         | 21 +++++++-
 fs/xfs/xfs_inode_item_recover.c | 26 +++++++---
 fs/xfs/xfs_mount.h              |  2 +
 15 files changed, 171 insertions(+), 50 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 1a716067901f..a77cf8619ec0 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -55,18 +55,16 @@ xfs_bmap_compute_maxlevels(
 	int		whichfork)	/* data or attr fork */
 {
 	xfs_extnum_t	maxleafents;	/* max leaf entries possible */
+	xfs_rfsblock_t	maxblocks;	/* max blocks at this level */
 	int		level;		/* btree level */
-	uint		maxblocks;	/* max blocks at this level */
 	int		maxrootrecs;	/* max records in root block */
 	int		minleafrecs;	/* min records in leaf block */
 	int		minnoderecs;	/* min records in node block */
 	int		sz;		/* root block size */
 
 	/*
-	 * The maximum number of extents in a file, hence the maximum number of
-	 * leaf entries, is controlled by the size of the on-disk extent count,
-	 * either a signed 32-bit number for the data fork, or a signed 16-bit
-	 * number for the attr fork.
+	 * The maximum number of extents in a fork, hence the maximum number of
+	 * leaf entries, is controlled by the size of the on-disk extent count.
 	 *
 	 * Note that we can no longer assume that if we are in ATTR1 that the
 	 * fork offset of all the inodes will be
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 87c927d912f6..7373ac8b890d 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -388,6 +388,7 @@ xfs_sb_has_ro_compat_feature(
 #define XFS_SB_FEAT_INCOMPAT_BIGTIME	(1 << 3)	/* large timestamps */
 #define XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR (1 << 4)	/* needs xfs_repair */
 #define XFS_SB_FEAT_INCOMPAT_METADIR	(1 << 5)	/* metadata dir tree */
+#define XFS_SB_FEAT_INCOMPAT_NREXT64	(1 << 6)	/* 64-bit data fork extent counter */
 #define XFS_SB_FEAT_INCOMPAT_ALL \
 		(XFS_SB_FEAT_INCOMPAT_FTYPE|	\
 		 XFS_SB_FEAT_INCOMPAT_SPINODES|	\
@@ -802,6 +803,16 @@ typedef struct xfs_dinode {
 	__be64		di_size;	/* number of bytes in file */
 	__be64		di_nblocks;	/* # of direct & btree blocks used */
 	__be32		di_extsize;	/* basic/minimum extent size for file */
+
+	/*
+	 * On a extcnt64bit filesystem, di_nextents64 holds the data fork
+	 * extent count, di_nextents32 holds the attr fork extent count,
+	 * and di_nextents16 must be zero.
+	 *
+	 * Otherwise, di_nextents32 holds the data fork extent count,
+	 * di_nextents16 holds the attr fork extent count, and di_nextents64
+	 * must be zero.
+	 */
 	__be32		di_nextents32;	/* number of extents in data fork */
 	__be16		di_nextents16;	/* number of extents in attribute fork*/
 	__u8		di_forkoff;	/* attr fork offs, <<3 for 64b align */
@@ -820,7 +831,8 @@ typedef struct xfs_dinode {
 	__be64		di_lsn;		/* flush sequence */
 	__be64		di_flags2;	/* more random flags */
 	__be32		di_cowextsize;	/* basic cow extent size for file */
-	__u8		di_pad2[12];	/* more padding for future expansion */
+	__u8		di_pad2[4];	/* more padding for future expansion */
+	__be64		di_nextents64;
 
 	/* fields only written to during inode creation */
 	xfs_timestamp_t	di_crtime;	/* time created */
@@ -876,6 +888,8 @@ enum xfs_dinode_fmt {
  * Max values for extlen and disk inode's extent counters.
  */
 #define	MAXEXTLEN		((xfs_extlen_t)0x1fffff)	/* 21 bits */
+#define XFS_IFORK_EXTCNT_MAXU48	((xfs_extnum_t)0xffffffffffff)	/* Unsigned 48-bits */
+#define XFS_IFORK_EXTCNT_MAXU32	((xfs_aextnum_t)0xffffffff)	/* Unsigned 32-bits */
 #define XFS_IFORK_EXTCNT_MAXS32 ((xfs_extnum_t)0x7fffffff)	/* Signed 32-bits */
 #define XFS_IFORK_EXTCNT_MAXS16 ((xfs_aextnum_t)0x7fff)		/* Signed 16-bits */
 
@@ -931,32 +945,6 @@ enum xfs_dinode_fmt {
 		(dip)->di_format : \
 		(dip)->di_aformat)
 
-static inline int
-xfs_dfork_nextents(
-	struct xfs_dinode	*dip,
-	int			whichfork,
-	xfs_extnum_t		*nextents)
-{
-	int			error = 0;
-
-	switch (whichfork) {
-	case XFS_DATA_FORK:
-		*nextents = be32_to_cpu(dip->di_nextents32);
-		break;
-
-	case XFS_ATTR_FORK:
-		*nextents = be16_to_cpu(dip->di_nextents16);
-		break;
-
-	default:
-		ASSERT(0);
-		error = -EFSCORRUPTED;
-		break;
-	}
-
-	return error;
-}
-
 /*
  * For block and character special files the 32bit dev_t is stored at the
  * beginning of the data fork.
@@ -1023,6 +1011,7 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
 #define XFS_DIFLAG2_COWEXTSIZE_BIT   2  /* copy on write extent size hint */
 #define XFS_DIFLAG2_BIGTIME_BIT	3	/* big timestamps */
 #define XFS_DIFLAG2_METADATA_BIT 4	/* filesystem metadata */
+#define XFS_DIFLAG2_NREXT64_BIT 5	/* 64-bit extent counter enabled */
 
 #define XFS_DIFLAG2_DAX		(1 << XFS_DIFLAG2_DAX_BIT)
 #define XFS_DIFLAG2_REFLINK     (1 << XFS_DIFLAG2_REFLINK_BIT)
@@ -1053,10 +1042,12 @@ static inline void xfs_dinode_put_rdev(struct xfs_dinode *dip, xfs_dev_t rdev)
  * - Metadata directory entries must have correct ftype.
  */
 #define XFS_DIFLAG2_METADATA	(1 << XFS_DIFLAG2_METADATA_BIT)
+#define XFS_DIFLAG2_NREXT64	(1 << XFS_DIFLAG2_NREXT64_BIT)
+
 
 #define XFS_DIFLAG2_ANY \
 	(XFS_DIFLAG2_DAX | XFS_DIFLAG2_REFLINK | XFS_DIFLAG2_COWEXTSIZE | \
-	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_METADATA)
+	 XFS_DIFLAG2_BIGTIME | XFS_DIFLAG2_METADATA | XFS_DIFLAG2_NREXT64)
 
 static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
 {
@@ -1064,6 +1055,46 @@ static inline bool xfs_dinode_has_bigtime(const struct xfs_dinode *dip)
 	       (dip->di_flags2 & cpu_to_be64(XFS_DIFLAG2_BIGTIME));
 }
 
+static inline bool xfs_dinode_has_nrext64(const struct xfs_dinode *dip)
+{
+	return dip->di_version >= 3 &&
+	       (dip->di_flags2 & cpu_to_be64(XFS_DIFLAG2_NREXT64));
+}
+
+static inline int
+xfs_dfork_nextents(
+	struct xfs_dinode	*dip,
+	int			whichfork,
+	xfs_extnum_t		*nextents)
+{
+	int			error = 0;
+	bool			inode_has_nrext64;
+
+	inode_has_nrext64 = xfs_dinode_has_nrext64(dip);
+
+	if (inode_has_nrext64 && dip->di_nextents16 != 0)
+		return -EFSCORRUPTED;
+
+	switch (whichfork) {
+	case XFS_DATA_FORK:
+		*nextents = inode_has_nrext64 ? be64_to_cpu(dip->di_nextents64) :
+			be32_to_cpu(dip->di_nextents32);
+		break;
+
+	case XFS_ATTR_FORK:
+		*nextents = inode_has_nrext64 ? be32_to_cpu(dip->di_nextents32) :
+			be16_to_cpu(dip->di_nextents16);
+		break;
+
+	default:
+		ASSERT(0);
+		error = -EFSCORRUPTED;
+		break;
+	}
+
+	return error;
+}
+
 /*
  * Inode number format:
  * low inopblog bits - offset in block
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index b76906914d89..3d0b679d96d7 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -254,6 +254,7 @@ typedef struct xfs_fsop_resblks {
 #define XFS_FSOP_GEOM_FLAGS_INOBTCNT	(1 << 22) /* inobt btree counter */
 #define XFS_FSOP_GEOM_FLAGS_ATOMIC_SWAP	(1 << 23) /* atomic swapext */
 #define XFS_FSOP_GEOM_FLAGS_METADIR	(1 << 24) /* metadata directories */
+#define XFS_FSOP_GEOM_FLAGS_NREXT64	(1 << 25) /* 64-bit extent counter */
 
 /*
  * Minimum and maximum sizes need for growth checks.
diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 77119ea7d1ce..585743208392 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -2836,6 +2836,8 @@ xfs_ialloc_setup_geometry(
 	igeo->new_diflags2 = 0;
 	if (xfs_has_bigtime(mp))
 		igeo->new_diflags2 |= XFS_DIFLAG2_BIGTIME;
+	if (xfs_has_nrext64(mp))
+		igeo->new_diflags2 |= XFS_DIFLAG2_NREXT64;
 
 	/* Compute inode btree geometry. */
 	igeo->agino_log = sbp->sb_inopblog + sbp->sb_agblklog;
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 882ed4873afe..0ab332c913c4 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -285,6 +285,27 @@ xfs_inode_to_disk_ts(
 	return ts;
 }
 
+static inline void
+xfs_inode_to_disk_iext_counters(
+	struct xfs_inode	*ip,
+	struct xfs_dinode	*to)
+{
+	if (xfs_inode_has_nrext64(ip)) {
+		to->di_nextents64 = cpu_to_be64(xfs_ifork_nextents(&ip->i_df));
+		to->di_nextents32 = cpu_to_be32(xfs_ifork_nextents(ip->i_afp));
+		/*
+		 * We might be upgrading the inode to use wider extent counters
+		 * than was previously used. Hence zero the unused field.
+		 */
+		to->di_nextents16 = cpu_to_be16(0);
+	} else {
+		if (xfs_has_v3inodes(ip->i_mount))
+			to->di_nextents64 = 0;
+		to->di_nextents32 = cpu_to_be32(xfs_ifork_nextents(&ip->i_df));
+		to->di_nextents16 = cpu_to_be16(xfs_ifork_nextents(ip->i_afp));
+	}
+}
+
 void
 xfs_inode_to_disk(
 	struct xfs_inode	*ip,
@@ -313,8 +334,6 @@ xfs_inode_to_disk(
 	to->di_size = cpu_to_be64(ip->i_disk_size);
 	to->di_nblocks = cpu_to_be64(ip->i_nblocks);
 	to->di_extsize = cpu_to_be32(ip->i_extsize);
-	to->di_nextents32 = cpu_to_be32(xfs_ifork_nextents(&ip->i_df));
-	to->di_nextents16 = cpu_to_be16(xfs_ifork_nextents(ip->i_afp));
 	to->di_forkoff = ip->i_forkoff;
 	to->di_aformat = xfs_ifork_format(ip->i_afp);
 	to->di_flags = cpu_to_be16(ip->i_diflags);
@@ -334,6 +353,8 @@ xfs_inode_to_disk(
 		to->di_version = 2;
 		to->di_flushiter = cpu_to_be16(ip->i_flushiter);
 	}
+
+	xfs_inode_to_disk_iext_counters(ip, to);
 }
 
 static xfs_failaddr_t
diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
index 4b9df10e8eea..f8a85ba6e9e9 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.h
+++ b/fs/xfs/libxfs/xfs_inode_fork.h
@@ -136,10 +136,22 @@ static inline int8_t xfs_ifork_format(struct xfs_ifork *ifp)
 static inline xfs_extnum_t xfs_iext_max_nextents(struct xfs_mount *mp,
 		int whichfork)
 {
-	if (whichfork == XFS_DATA_FORK || whichfork == XFS_COW_FORK)
-		return XFS_IFORK_EXTCNT_MAXS32;
+	bool has_64bit_extcnt = xfs_has_nrext64(mp);
 
-	return XFS_IFORK_EXTCNT_MAXS16;
+	switch (whichfork) {
+	case XFS_DATA_FORK:
+	case XFS_COW_FORK:
+		return has_64bit_extcnt ? XFS_IFORK_EXTCNT_MAXU48
+			: XFS_IFORK_EXTCNT_MAXS32;
+
+	case XFS_ATTR_FORK:
+		return has_64bit_extcnt ? XFS_IFORK_EXTCNT_MAXU32
+			: XFS_IFORK_EXTCNT_MAXS16;
+
+	default:
+		ASSERT(0);
+		return 0;
+	}
 }
 
 struct xfs_ifork *xfs_ifork_alloc(enum xfs_dinode_fmt format,
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index 9f352ff4352b..de4bcb94c732 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -429,7 +429,8 @@ struct xfs_log_dinode {
 
 	uint64_t	di_flags2;	/* more random flags */
 	uint32_t	di_cowextsize;	/* basic cow extent size for file */
-	uint8_t		di_pad2[12];	/* more padding for future expansion */
+	uint8_t		di_pad2[4];	/* more padding for future expansion */
+	uint64_t	di_nextents64; /* higher part of data fork extent count */
 
 	/* fields only written to during inode creation */
 	xfs_log_timestamp_t di_crtime;	/* time created */
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index ffff91081036..a6b84893ebda 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -126,6 +126,8 @@ xfs_sb_version_to_features(
 		features |= XFS_FEAT_NEEDSREPAIR;
 	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_METADIR)
 		features |= XFS_FEAT_METADIR;
+	if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_NREXT64)
+		features |= XFS_FEAT_NREXT64;
 
 	if (sbp->sb_features_log_incompat & XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP)
 		features |= XFS_FEAT_ATOMIC_SWAP;
@@ -1175,6 +1177,8 @@ xfs_fs_geometry(
 		geo->flags |= XFS_FSOP_GEOM_FLAGS_ATOMIC_SWAP;
 	if (xfs_has_metadir(mp))
 		geo->flags |= XFS_FSOP_GEOM_FLAGS_METADIR;
+	if (xfs_has_nrext64(mp))
+		geo->flags |= XFS_FSOP_GEOM_FLAGS_NREXT64;
 	geo->rtsectsize = sbp->sb_blocksize;
 	geo->dirblocksize = xfs_dir2_dirblock_bytes(sbp);
 
diff --git a/fs/xfs/libxfs/xfs_trans_inode.c b/fs/xfs/libxfs/xfs_trans_inode.c
index 6a3a869635bf..ac622097243a 100644
--- a/fs/xfs/libxfs/xfs_trans_inode.c
+++ b/fs/xfs/libxfs/xfs_trans_inode.c
@@ -144,6 +144,12 @@ xfs_trans_log_inode(
 		flags |= XFS_ILOG_CORE;
 	}
 
+	if ((flags & XFS_ILOG_CORE) &&
+	    xfs_has_nrext64(ip->i_mount) &&
+	    !xfs_inode_has_nrext64(ip)) {
+		ip->i_diflags2 |= XFS_DIFLAG2_NREXT64;
+	}
+
 	/*
 	 * Inode verifiers do not check that the extent size hint is an integer
 	 * multiple of the rt extent size on a directory with both rtinherit
diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c
index 133109d84b98..995bad2cedd6 100644
--- a/fs/xfs/scrub/inode_repair.c
+++ b/fs/xfs/scrub/inode_repair.c
@@ -740,7 +740,10 @@ xrep_dinode_zap_dfork(
 {
 	trace_xrep_dinode_zap_dfork(sc, dip);
 
-	dip->di_nextents32 = 0;
+	if (xfs_dinode_has_nrext64(dip))
+		dip->di_nextents64 = 0;
+	else
+		dip->di_nextents32 = 0;
 
 	/* Special files always get reset to DEV */
 	switch (mode & S_IFMT) {
@@ -827,7 +830,11 @@ xrep_dinode_zap_afork(
 	trace_xrep_dinode_zap_afork(sc, dip);
 
 	dip->di_aformat = XFS_DINODE_FMT_EXTENTS;
-	dip->di_nextents16 = 0;
+
+	if (xfs_dinode_has_nrext64(dip))
+		dip->di_nextents32 = 0;
+	else
+		dip->di_nextents16 = 0;
 
 	dip->di_forkoff = 0;
 	dip->di_mode = cpu_to_be16(mode & ~0777);
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 6338a93b975c..3c969803e671 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2539,7 +2539,7 @@ xfs_iflush(
 				ip->i_nblocks, mp, XFS_ERRTAG_IFLUSH_5)) {
 		xfs_alert_tag(mp, XFS_PTAG_IFLUSH,
 			"%s: detected corrupt incore inode %llu, "
-			"total extents = %llu nblocks = %lld, ptr "PTR_FMT,
+			"total extents = %llu, nblocks = %lld, ptr "PTR_FMT,
 			__func__, ip->i_ino,
 			ip->i_df.if_nextents + xfs_ifork_nextents(ip->i_afp),
 			ip->i_nblocks, ip);
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index b0114c8cef76..348b1dbe42c0 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -230,6 +230,11 @@ static inline bool xfs_inode_has_bigrtextents(struct xfs_inode *ip)
 	return XFS_IS_REALTIME_INODE(ip) && ip->i_mount->m_sb.sb_rextsize > 1;
 }
 
+static inline bool xfs_inode_has_nrext64(struct xfs_inode *ip)
+{
+	return ip->i_diflags2 & XFS_DIFLAG2_NREXT64;
+}
+
 /*
  * Return the buftarg used for data allocations on a given inode.
  */
diff --git a/fs/xfs/xfs_inode_item.c b/fs/xfs/xfs_inode_item.c
index e4800a965670..5c318aaecff4 100644
--- a/fs/xfs/xfs_inode_item.c
+++ b/fs/xfs/xfs_inode_item.c
@@ -358,6 +358,23 @@ xfs_copy_dm_fields_to_log_dinode(
 	}
 }
 
+static inline void
+xfs_inode_to_log_dinode_iext_counters(
+	struct xfs_inode	*ip,
+	struct xfs_log_dinode	*to)
+{
+	if (xfs_inode_has_nrext64(ip)) {
+		to->di_nextents64 = xfs_ifork_nextents(&ip->i_df);
+		to->di_nextents32 = xfs_ifork_nextents(ip->i_afp);
+		to->di_nextents16 = 0;
+	} else {
+		if (xfs_has_v3inodes(ip->i_mount))
+			to->di_nextents64 = 0;
+		to->di_nextents32 = xfs_ifork_nextents(&ip->i_df);
+		to->di_nextents16 = xfs_ifork_nextents(ip->i_afp);
+	}
+}
+
 static void
 xfs_inode_to_log_dinode(
 	struct xfs_inode	*ip,
@@ -385,8 +402,6 @@ xfs_inode_to_log_dinode(
 	to->di_size = ip->i_disk_size;
 	to->di_nblocks = ip->i_nblocks;
 	to->di_extsize = ip->i_extsize;
-	to->di_nextents32 = xfs_ifork_nextents(&ip->i_df);
-	to->di_nextents16 = xfs_ifork_nextents(ip->i_afp);
 	to->di_forkoff = ip->i_forkoff;
 	to->di_aformat = xfs_ifork_format(ip->i_afp);
 	to->di_flags = ip->i_diflags;
@@ -411,6 +426,8 @@ xfs_inode_to_log_dinode(
 		to->di_version = 2;
 		to->di_flushiter = ip->i_flushiter;
 	}
+
+	xfs_inode_to_log_dinode_iext_counters(ip, to);
 }
 
 /*
diff --git a/fs/xfs/xfs_inode_item_recover.c b/fs/xfs/xfs_inode_item_recover.c
index c21fb3d2ddca..980d6615f6f2 100644
--- a/fs/xfs/xfs_inode_item_recover.c
+++ b/fs/xfs/xfs_inode_item_recover.c
@@ -167,8 +167,6 @@ xfs_log_dinode_to_disk(
 	to->di_size = cpu_to_be64(from->di_size);
 	to->di_nblocks = cpu_to_be64(from->di_nblocks);
 	to->di_extsize = cpu_to_be32(from->di_extsize);
-	to->di_nextents32 = cpu_to_be32(from->di_nextents32);
-	to->di_nextents16 = cpu_to_be16(from->di_nextents16);
 	to->di_forkoff = from->di_forkoff;
 	to->di_aformat = from->di_aformat;
 	to->di_dmevmask = cpu_to_be32(from->di_dmevmask);
@@ -182,12 +180,17 @@ xfs_log_dinode_to_disk(
 							  from->di_crtime);
 		to->di_flags2 = cpu_to_be64(from->di_flags2);
 		to->di_cowextsize = cpu_to_be32(from->di_cowextsize);
+		to->di_nextents64 = cpu_to_be64(from->di_nextents64);
+		to->di_nextents32 = cpu_to_be32(from->di_nextents32);
+		to->di_nextents16 = cpu_to_be16(from->di_nextents16);
 		to->di_ino = cpu_to_be64(from->di_ino);
 		to->di_lsn = cpu_to_be64(lsn);
 		memcpy(to->di_pad2, from->di_pad2, sizeof(to->di_pad2));
 		uuid_copy(&to->di_uuid, &from->di_uuid);
 		to->di_flushiter = 0;
 	} else {
+		to->di_nextents32 = cpu_to_be32(from->di_nextents32);
+		to->di_nextents16 = cpu_to_be16(from->di_nextents16);
 		to->di_flushiter = cpu_to_be16(from->di_flushiter);
 	}
 }
@@ -203,6 +206,8 @@ xlog_recover_inode_commit_pass2(
 	struct xfs_mount		*mp = log->l_mp;
 	struct xfs_buf			*bp;
 	struct xfs_dinode		*dip;
+	xfs_extnum_t                    nextents;
+	xfs_aextnum_t                   anextents;
 	int				len;
 	char				*src;
 	char				*dest;
@@ -342,16 +347,25 @@ xlog_recover_inode_commit_pass2(
 			goto out_release;
 		}
 	}
-	if (unlikely(ldip->di_nextents32 + ldip->di_nextents16 > ldip->di_nblocks)) {
+
+	if (xfs_has_v3inodes(mp) &&
+		ldip->di_flags2 & XFS_DIFLAG2_NREXT64) {
+		nextents = ldip->di_nextents64;
+		anextents = ldip->di_nextents32;
+	} else {
+		nextents = ldip->di_nextents32;
+		anextents = ldip->di_nextents16;
+	}
+
+	if (unlikely(nextents + anextents > ldip->di_nblocks)) {
 		XFS_CORRUPTION_ERROR("xlog_recover_inode_pass2(5)",
 				     XFS_ERRLEVEL_LOW, mp, ldip,
 				     sizeof(*ldip));
 		xfs_alert(mp,
 	"%s: Bad inode log record, rec ptr "PTR_FMT", dino ptr "PTR_FMT", "
-	"dino bp "PTR_FMT", ino %Ld, total extents = %d, nblocks = %Ld",
+	"dino bp "PTR_FMT", ino %Ld, total extents = %llu, nblocks = %Ld",
 			__func__, item, dip, bp, in_f->ilf_ino,
-			ldip->di_nextents32 + ldip->di_nextents16,
-			ldip->di_nblocks);
+			nextents + anextents, ldip->di_nblocks);
 		error = -EFSCORRUPTED;
 		goto out_release;
 	}
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index a4c149670476..f558d5c4a5f1 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -288,6 +288,7 @@ typedef struct xfs_mount {
 #define XFS_FEAT_NEEDSREPAIR	(1ULL << 25)	/* needs xfs_repair */
 #define XFS_FEAT_ATOMIC_SWAP	(1ULL << 26)	/* extent swap log items */
 #define XFS_FEAT_METADIR	(1ULL << 27)	/* metadata directory tree */
+#define XFS_FEAT_NREXT64	(1ULL << 28)	/* 64-bit inode extent counters */
 
 /* Mount features */
 #define XFS_FEAT_NOATTR2	(1ULL << 48)	/* disable attr2 creation */
@@ -370,6 +371,7 @@ __XFS_HAS_FEAT(bigtime, BIGTIME)
 __XFS_HAS_FEAT(needsrepair, NEEDSREPAIR)
 __XFS_LOG_FEAT(atomicswap, ATOMIC_SWAP)
 __XFS_HAS_FEAT(metadir, METADIR)
+__XFS_HAS_FEAT(nrext64, NREXT64)
 
 /*
  * Decide if this filesystem can use log-assisted ("atomic") extent swapping.
-- 
2.30.2


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V3 11/12] xfs: Add XFS_SB_FEAT_INCOMPAT_NREXT64 to XFS_SB_FEAT_INCOMPAT_ALL
  2021-09-16 10:06 [PATCH V3 00/12] xfs: Extend per-inode extent counters Chandan Babu R
                   ` (9 preceding siblings ...)
  2021-09-16 10:06 ` [PATCH V3 10/12] xfs: Extend per-inode extent counter widths Chandan Babu R
@ 2021-09-16 10:06 ` Chandan Babu R
  2021-09-16 10:06 ` [PATCH V3 12/12] xfs: Define max extent length based on on-disk format definition Chandan Babu R
  2021-09-18  0:03 ` [PATCH V3 00/12] xfs: Extend per-inode extent counters Darrick J. Wong
  12 siblings, 0 replies; 42+ messages in thread
From: Chandan Babu R @ 2021-09-16 10:06 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, djwong

This commit adds XFS_SB_FEAT_INCOMPAT_NREXT64 to the list of supported
incompat feature flags.

Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
---
 fs/xfs/libxfs/xfs_format.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 7373ac8b890d..7d08bb0fe510 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -395,7 +395,8 @@ xfs_sb_has_ro_compat_feature(
 		 XFS_SB_FEAT_INCOMPAT_META_UUID| \
 		 XFS_SB_FEAT_INCOMPAT_BIGTIME| \
 		 XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR| \
-		 XFS_SB_FEAT_INCOMPAT_METADIR)
+		 XFS_SB_FEAT_INCOMPAT_METADIR| \
+		 XFS_SB_FEAT_INCOMPAT_NREXT64)
 
 #define XFS_SB_FEAT_INCOMPAT_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_ALL
 static inline bool
-- 
2.30.2


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [PATCH V3 12/12] xfs: Define max extent length based on on-disk format definition
  2021-09-16 10:06 [PATCH V3 00/12] xfs: Extend per-inode extent counters Chandan Babu R
                   ` (10 preceding siblings ...)
  2021-09-16 10:06 ` [PATCH V3 11/12] xfs: Add XFS_SB_FEAT_INCOMPAT_NREXT64 to XFS_SB_FEAT_INCOMPAT_ALL Chandan Babu R
@ 2021-09-16 10:06 ` Chandan Babu R
  2021-09-28  0:33   ` Dave Chinner
  2021-09-18  0:03 ` [PATCH V3 00/12] xfs: Extend per-inode extent counters Darrick J. Wong
  12 siblings, 1 reply; 42+ messages in thread
From: Chandan Babu R @ 2021-09-16 10:06 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, djwong

The maximum extent length depends on maximum block count that can be stored in
a BMBT record. Hence this commit defines MAXEXTLEN based on
BMBT_BLOCKCOUNT_BITLEN.

While at it, the commit also renames MAXEXTLEN to XFS_MAX_EXTLEN.

Suggested-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c       | 59 +++++++++++++++++-----------------
 fs/xfs/libxfs/xfs_format.h     | 20 ++++++------
 fs/xfs/libxfs/xfs_inode_buf.c  |  4 +--
 fs/xfs/libxfs/xfs_rtbitmap.c   |  4 +--
 fs/xfs/libxfs/xfs_swapext.c    |  6 ++--
 fs/xfs/libxfs/xfs_trans_resv.c | 10 +++---
 fs/xfs/scrub/bmap.c            |  2 +-
 fs/xfs/scrub/bmap_repair.c     |  2 +-
 fs/xfs/scrub/repair.c          |  2 +-
 fs/xfs/xfs_bmap_util.c         | 14 ++++----
 fs/xfs/xfs_iomap.c             | 28 ++++++++--------
 11 files changed, 77 insertions(+), 74 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index a77cf8619ec0..fb10ea078361 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -548,7 +548,7 @@ __xfs_bmap_add_free(
 
 	ASSERT(bno != NULLFSBLOCK);
 	ASSERT(len > 0);
-	ASSERT(len <= MAXEXTLEN);
+	ASSERT(len <= XFS_MAX_EXTLEN);
 	ASSERT(!isnullstartblock(bno));
 	agno = XFS_FSB_TO_AGNO(mp, bno);
 	agbno = XFS_FSB_TO_AGBNO(mp, bno);
@@ -1504,7 +1504,7 @@ xfs_bmap_add_extent_delay_real(
 	    LEFT.br_startoff + LEFT.br_blockcount == new->br_startoff &&
 	    LEFT.br_startblock + LEFT.br_blockcount == new->br_startblock &&
 	    LEFT.br_state == new->br_state &&
-	    LEFT.br_blockcount + new->br_blockcount <= MAXEXTLEN)
+	    LEFT.br_blockcount + new->br_blockcount <= XFS_MAX_EXTLEN)
 		state |= BMAP_LEFT_CONTIG;
 
 	/*
@@ -1522,13 +1522,13 @@ xfs_bmap_add_extent_delay_real(
 	    new_endoff == RIGHT.br_startoff &&
 	    new->br_startblock + new->br_blockcount == RIGHT.br_startblock &&
 	    new->br_state == RIGHT.br_state &&
-	    new->br_blockcount + RIGHT.br_blockcount <= MAXEXTLEN &&
+	    new->br_blockcount + RIGHT.br_blockcount <= XFS_MAX_EXTLEN &&
 	    ((state & (BMAP_LEFT_CONTIG | BMAP_LEFT_FILLING |
 		       BMAP_RIGHT_FILLING)) !=
 		      (BMAP_LEFT_CONTIG | BMAP_LEFT_FILLING |
 		       BMAP_RIGHT_FILLING) ||
 	     LEFT.br_blockcount + new->br_blockcount + RIGHT.br_blockcount
-			<= MAXEXTLEN))
+			<= XFS_MAX_EXTLEN))
 		state |= BMAP_RIGHT_CONTIG;
 
 	error = 0;
@@ -2067,7 +2067,7 @@ xfs_bmap_add_extent_unwritten_real(
 	    LEFT.br_startoff + LEFT.br_blockcount == new->br_startoff &&
 	    LEFT.br_startblock + LEFT.br_blockcount == new->br_startblock &&
 	    LEFT.br_state == new->br_state &&
-	    LEFT.br_blockcount + new->br_blockcount <= MAXEXTLEN)
+	    LEFT.br_blockcount + new->br_blockcount <= XFS_MAX_EXTLEN)
 		state |= BMAP_LEFT_CONTIG;
 
 	/*
@@ -2085,13 +2085,13 @@ xfs_bmap_add_extent_unwritten_real(
 	    new_endoff == RIGHT.br_startoff &&
 	    new->br_startblock + new->br_blockcount == RIGHT.br_startblock &&
 	    new->br_state == RIGHT.br_state &&
-	    new->br_blockcount + RIGHT.br_blockcount <= MAXEXTLEN &&
+	    new->br_blockcount + RIGHT.br_blockcount <= XFS_MAX_EXTLEN &&
 	    ((state & (BMAP_LEFT_CONTIG | BMAP_LEFT_FILLING |
 		       BMAP_RIGHT_FILLING)) !=
 		      (BMAP_LEFT_CONTIG | BMAP_LEFT_FILLING |
 		       BMAP_RIGHT_FILLING) ||
 	     LEFT.br_blockcount + new->br_blockcount + RIGHT.br_blockcount
-			<= MAXEXTLEN))
+			<= XFS_MAX_EXTLEN))
 		state |= BMAP_RIGHT_CONTIG;
 
 	/*
@@ -2600,15 +2600,15 @@ xfs_bmap_add_extent_hole_delay(
 	 */
 	if ((state & BMAP_LEFT_VALID) && (state & BMAP_LEFT_DELAY) &&
 	    left.br_startoff + left.br_blockcount == new->br_startoff &&
-	    left.br_blockcount + new->br_blockcount <= MAXEXTLEN)
+	    left.br_blockcount + new->br_blockcount <= XFS_MAX_EXTLEN)
 		state |= BMAP_LEFT_CONTIG;
 
 	if ((state & BMAP_RIGHT_VALID) && (state & BMAP_RIGHT_DELAY) &&
 	    new->br_startoff + new->br_blockcount == right.br_startoff &&
-	    new->br_blockcount + right.br_blockcount <= MAXEXTLEN &&
+	    new->br_blockcount + right.br_blockcount <= XFS_MAX_EXTLEN &&
 	    (!(state & BMAP_LEFT_CONTIG) ||
 	     (left.br_blockcount + new->br_blockcount +
-	      right.br_blockcount <= MAXEXTLEN)))
+	      right.br_blockcount <= XFS_MAX_EXTLEN)))
 		state |= BMAP_RIGHT_CONTIG;
 
 	/*
@@ -2751,17 +2751,17 @@ xfs_bmap_add_extent_hole_real(
 	    left.br_startoff + left.br_blockcount == new->br_startoff &&
 	    left.br_startblock + left.br_blockcount == new->br_startblock &&
 	    left.br_state == new->br_state &&
-	    left.br_blockcount + new->br_blockcount <= MAXEXTLEN)
+	    left.br_blockcount + new->br_blockcount <= XFS_MAX_EXTLEN)
 		state |= BMAP_LEFT_CONTIG;
 
 	if ((state & BMAP_RIGHT_VALID) && !(state & BMAP_RIGHT_DELAY) &&
 	    new->br_startoff + new->br_blockcount == right.br_startoff &&
 	    new->br_startblock + new->br_blockcount == right.br_startblock &&
 	    new->br_state == right.br_state &&
-	    new->br_blockcount + right.br_blockcount <= MAXEXTLEN &&
+	    new->br_blockcount + right.br_blockcount <= XFS_MAX_EXTLEN &&
 	    (!(state & BMAP_LEFT_CONTIG) ||
 	     left.br_blockcount + new->br_blockcount +
-	     right.br_blockcount <= MAXEXTLEN))
+	     right.br_blockcount <= XFS_MAX_EXTLEN))
 		state |= BMAP_RIGHT_CONTIG;
 
 	error = 0;
@@ -3003,15 +3003,15 @@ xfs_bmap_extsize_align(
 
 	/*
 	 * For large extent hint sizes, the aligned extent might be larger than
-	 * MAXEXTLEN. In that case, reduce the size by an extsz so that it pulls
-	 * the length back under MAXEXTLEN. The outer allocation loops handle
-	 * short allocation just fine, so it is safe to do this. We only want to
-	 * do it when we are forced to, though, because it means more allocation
-	 * operations are required.
+	 * XFS_MAX_EXTLEN. In that case, reduce the size by an extsz so that it
+	 * pulls the length back under XFS_MAX_EXTLEN. The outer allocation
+	 * loops handle short allocation just fine, so it is safe to do this. We
+	 * only want to do it when we are forced to, though, because it means
+	 * more allocation operations are required.
 	 */
-	while (align_alen > MAXEXTLEN)
+	while (align_alen > XFS_MAX_EXTLEN)
 		align_alen -= extsz;
-	ASSERT(align_alen <= MAXEXTLEN);
+	ASSERT(align_alen <= XFS_MAX_EXTLEN);
 
 	/*
 	 * If the previous block overlaps with this proposed allocation
@@ -3101,9 +3101,9 @@ xfs_bmap_extsize_align(
 			return -EINVAL;
 	} else {
 		ASSERT(orig_off >= align_off);
-		/* see MAXEXTLEN handling above */
+		/* see XFS_MAX_EXTLEN handling above */
 		ASSERT(orig_end <= align_off + align_alen ||
-		       align_alen + extsz > MAXEXTLEN);
+		       align_alen + extsz > XFS_MAX_EXTLEN);
 	}
 
 #ifdef DEBUG
@@ -4070,7 +4070,7 @@ xfs_bmapi_reserve_delalloc(
 	 * Cap the alloc length. Keep track of prealloc so we know whether to
 	 * tag the inode before we return.
 	 */
-	alen = XFS_FILBLKS_MIN(len + prealloc, MAXEXTLEN);
+	alen = XFS_FILBLKS_MIN(len + prealloc, XFS_MAX_EXTLEN);
 	if (!eof)
 		alen = XFS_FILBLKS_MIN(alen, got->br_startoff - aoff);
 	if (prealloc && alen >= len)
@@ -4203,7 +4203,7 @@ xfs_bmapi_allocate(
 		if (!xfs_iext_peek_prev_extent(ifp, &bma->icur, &bma->prev))
 			bma->prev.br_startoff = NULLFILEOFF;
 	} else {
-		bma->length = XFS_FILBLKS_MIN(bma->length, MAXEXTLEN);
+		bma->length = XFS_FILBLKS_MIN(bma->length, XFS_MAX_EXTLEN);
 		if (!bma->eof)
 			bma->length = XFS_FILBLKS_MIN(bma->length,
 					bma->got.br_startoff - bma->offset);
@@ -4524,8 +4524,8 @@ xfs_bmapi_write(
 			 * xfs_extlen_t and therefore 32 bits. Hence we have to
 			 * check for 32-bit overflows and handle them here.
 			 */
-			if (len > (xfs_filblks_t)MAXEXTLEN)
-				bma.length = MAXEXTLEN;
+			if (len > (xfs_filblks_t)XFS_MAX_EXTLEN)
+				bma.length = XFS_MAX_EXTLEN;
 			else
 				bma.length = len;
 
@@ -4660,7 +4660,8 @@ xfs_bmapi_convert_delalloc(
 	bma.ip = ip;
 	bma.wasdel = true;
 	bma.offset = bma.got.br_startoff;
-	bma.length = max_t(xfs_filblks_t, bma.got.br_blockcount, MAXEXTLEN);
+	bma.length = max_t(xfs_filblks_t, bma.got.br_blockcount,
+			XFS_MAX_EXTLEN);
 	bma.minleft = xfs_bmapi_minleft(tp, ip, whichfork);
 
 	/*
@@ -4743,7 +4744,7 @@ xfs_bmapi_remap(
 
 	ifp = XFS_IFORK_PTR(ip, whichfork);
 	ASSERT(len > 0);
-	ASSERT(len <= (xfs_filblks_t)MAXEXTLEN);
+	ASSERT(len <= (xfs_filblks_t)XFS_MAX_EXTLEN);
 	ASSERT(xfs_isilocked(ip, XFS_ILOCK_EXCL));
 	ASSERT(!(flags & ~(XFS_BMAPI_ATTRFORK | XFS_BMAPI_PREALLOC |
 			   XFS_BMAPI_NORMAP)));
@@ -5716,7 +5717,7 @@ xfs_bmse_can_merge(
 	if ((left->br_startoff + left->br_blockcount != startoff) ||
 	    (left->br_startblock + left->br_blockcount != got->br_startblock) ||
 	    (left->br_state != got->br_state) ||
-	    (left->br_blockcount + got->br_blockcount > MAXEXTLEN))
+	    (left->br_blockcount + got->br_blockcount > XFS_MAX_EXTLEN))
 		return false;
 
 	return true;
diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 7d08bb0fe510..e0fb19761669 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -885,16 +885,6 @@ enum xfs_dinode_fmt {
 	{ XFS_DINODE_FMT_BTREE,		"btree" }, \
 	{ XFS_DINODE_FMT_UUID,		"uuid" }
 
-/*
- * Max values for extlen and disk inode's extent counters.
- */
-#define	MAXEXTLEN		((xfs_extlen_t)0x1fffff)	/* 21 bits */
-#define XFS_IFORK_EXTCNT_MAXU48	((xfs_extnum_t)0xffffffffffff)	/* Unsigned 48-bits */
-#define XFS_IFORK_EXTCNT_MAXU32	((xfs_aextnum_t)0xffffffff)	/* Unsigned 32-bits */
-#define XFS_IFORK_EXTCNT_MAXS32 ((xfs_extnum_t)0x7fffffff)	/* Signed 32-bits */
-#define XFS_IFORK_EXTCNT_MAXS16 ((xfs_aextnum_t)0x7fff)		/* Signed 16-bits */
-
-
 /*
  * Inode minimum and maximum sizes.
  */
@@ -1701,6 +1691,16 @@ typedef struct xfs_bmbt_rec {
 typedef uint64_t	xfs_bmbt_rec_base_t;	/* use this for casts */
 typedef xfs_bmbt_rec_t xfs_bmdr_rec_t;
 
+/*
+ * Max values for extlen and disk inode's extent counters.
+ */
+#define XFS_MAX_EXTLEN		((xfs_extlen_t)(1 << BMBT_BLOCKCOUNT_BITLEN) - 1)
+#define XFS_IFORK_EXTCNT_MAXU48	((xfs_extnum_t)0xffffffffffff)	/* Unsigned 48-bits */
+#define XFS_IFORK_EXTCNT_MAXU32	((xfs_aextnum_t)0xffffffff)	/* Unsigned 32-bits */
+#define XFS_IFORK_EXTCNT_MAXS32 ((xfs_extnum_t)0x7fffffff)	/* Signed 32-bits */
+#define XFS_IFORK_EXTCNT_MAXS16 ((xfs_aextnum_t)0x7fff)		/* Signed 16-bits */
+
+
 /*
  * Values and macros for delayed-allocation startblock fields.
  */
diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
index 0ab332c913c4..1b27095b423d 100644
--- a/fs/xfs/libxfs/xfs_inode_buf.c
+++ b/fs/xfs/libxfs/xfs_inode_buf.c
@@ -747,7 +747,7 @@ xfs_inode_validate_extsize(
 	if (extsize_bytes % blocksize_bytes)
 		return __this_address;
 
-	if (extsize > MAXEXTLEN)
+	if (extsize > XFS_MAX_EXTLEN)
 		return __this_address;
 
 	if (!rt_flag && extsize > mp->m_sb.sb_agblocks / 2)
@@ -804,7 +804,7 @@ xfs_inode_validate_cowextsize(
 	if (cowextsize_bytes % mp->m_sb.sb_blocksize)
 		return __this_address;
 
-	if (cowextsize > MAXEXTLEN)
+	if (cowextsize > XFS_MAX_EXTLEN)
 		return __this_address;
 
 	if (cowextsize > mp->m_sb.sb_agblocks / 2)
diff --git a/fs/xfs/libxfs/xfs_rtbitmap.c b/fs/xfs/libxfs/xfs_rtbitmap.c
index 7b70ac58a1dc..0ed94079f72c 100644
--- a/fs/xfs/libxfs/xfs_rtbitmap.c
+++ b/fs/xfs/libxfs/xfs_rtbitmap.c
@@ -1015,7 +1015,7 @@ xfs_rtfree_extent(
 /*
  * Free some blocks in the realtime subvolume.  rtbno and rtlen are in units of
  * rt blocks, not rt extents; must be aligned to the rt extent size; and rtlen
- * cannot exceed MAXEXTLEN.
+ * cannot exceed XFS_MAX_EXTLEN.
  */
 int
 xfs_rtfree_blocks(
@@ -1028,7 +1028,7 @@ xfs_rtfree_blocks(
 	xfs_filblks_t		len;
 	xfs_extlen_t		mod;
 
-	ASSERT(rtlen <= MAXEXTLEN);
+	ASSERT(rtlen <= XFS_MAX_EXTLEN);
 
 	len = div_u64_rem(rtlen, mp->m_sb.sb_rextsize, &mod);
 	if (mod) {
diff --git a/fs/xfs/libxfs/xfs_swapext.c b/fs/xfs/libxfs/xfs_swapext.c
index 36c918776ba0..6a5564e534eb 100644
--- a/fs/xfs/libxfs/xfs_swapext.c
+++ b/fs/xfs/libxfs/xfs_swapext.c
@@ -757,7 +757,7 @@ can_merge(
 	if (b1->br_startoff   + b1->br_blockcount == b2->br_startoff &&
 	    b1->br_startblock + b1->br_blockcount == b2->br_startblock &&
 	    b1->br_state			  == b2->br_state &&
-	    b1->br_blockcount + b2->br_blockcount <= MAXEXTLEN)
+	    b1->br_blockcount + b2->br_blockcount <= XFS_MAX_EXTLEN)
 		return true;
 
 	return false;
@@ -799,7 +799,7 @@ delta_nextents_step(
 		state |= CRIGHT_CONTIG;
 	if ((state & CBOTH_CONTIG) == CBOTH_CONTIG &&
 	    left->br_startblock + curr->br_startblock +
-					right->br_startblock > MAXEXTLEN)
+					right->br_startblock > XFS_MAX_EXTLEN)
 		state &= ~CRIGHT_CONTIG;
 
 	if (nhole)
@@ -810,7 +810,7 @@ delta_nextents_step(
 		state |= NRIGHT_CONTIG;
 	if ((state & NBOTH_CONTIG) == NBOTH_CONTIG &&
 	    left->br_startblock + new->br_startblock +
-					right->br_startblock > MAXEXTLEN)
+					right->br_startblock > XFS_MAX_EXTLEN)
 		state &= ~NRIGHT_CONTIG;
 
 	switch (state & (CLEFT_CONTIG | CRIGHT_CONTIG | CHOLE)) {
diff --git a/fs/xfs/libxfs/xfs_trans_resv.c b/fs/xfs/libxfs/xfs_trans_resv.c
index b3de538ea7ce..0c165bf3c357 100644
--- a/fs/xfs/libxfs/xfs_trans_resv.c
+++ b/fs/xfs/libxfs/xfs_trans_resv.c
@@ -209,8 +209,8 @@ xfs_calc_inode_chunk_res(
 /*
  * Per-extent log reservation for the btree changes involved in freeing or
  * allocating a realtime extent.  We have to be able to log as many rtbitmap
- * blocks as needed to mark inuse MAXEXTLEN blocks' worth of realtime extents,
- * as well as the realtime summary block.
+ * blocks as needed to mark inuse XFS_MAX_EXTLEN blocks' worth of realtime
+ * extents, as well as the realtime summary block.
  */
 static unsigned int
 xfs_rtalloc_log_count(
@@ -220,7 +220,7 @@ xfs_rtalloc_log_count(
 	unsigned int		blksz = XFS_FSB_TO_B(mp, 1);
 	unsigned int		rtbmp_bytes;
 
-	rtbmp_bytes = (MAXEXTLEN / mp->m_sb.sb_rextsize) / NBBY;
+	rtbmp_bytes = (XFS_MAX_EXTLEN / mp->m_sb.sb_rextsize) / NBBY;
 	return (howmany(rtbmp_bytes, blksz) + 1) * num_ops;
 }
 
@@ -279,7 +279,7 @@ xfs_refcount_log_reservation(
  *    the inode's bmap btree: max depth * block size
  *    the agfs of the ags from which the extents are allocated: 2 * sector
  *    the superblock free block counter: sector size
- *    the realtime bitmap: ((MAXEXTLEN / rtextsize) / NBBY) bytes
+ *    the realtime bitmap: ((XFS_MAX_EXTLEN / rtextsize) / NBBY) bytes
  *    the realtime summary: 1 block
  *    the allocation btrees: 2 trees * (2 * max depth - 1) * block size
  * And the bmap_finish transaction can free bmap blocks in a join (t3):
@@ -334,7 +334,7 @@ xfs_calc_write_reservation(
  *    the agf for each of the ags: 2 * sector size
  *    the agfl for each of the ags: 2 * sector size
  *    the super block to reflect the freed blocks: sector size
- *    the realtime bitmap: 2 exts * ((MAXEXTLEN / rtextsize) / NBBY) bytes
+ *    the realtime bitmap: 2 exts * ((XFS_MAX_EXTLEN / rtextsize) / NBBY) bytes
  *    the realtime summary: 2 exts * 1 block
  *    worst case split in allocation btrees per extent assuming 2 extents:
  *		2 exts * 2 trees * (2 * max depth - 1) * block size
diff --git a/fs/xfs/scrub/bmap.c b/fs/xfs/scrub/bmap.c
index 5fcaa2518799..d57509090788 100644
--- a/fs/xfs/scrub/bmap.c
+++ b/fs/xfs/scrub/bmap.c
@@ -427,7 +427,7 @@ xchk_bmap_iextent(
 				irec->br_startoff);
 
 	/* Make sure the extent points to a valid place. */
-	if (irec->br_blockcount > MAXEXTLEN)
+	if (irec->br_blockcount > XFS_MAX_EXTLEN)
 		xchk_fblock_set_corrupt(info->sc, info->whichfork,
 				irec->br_startoff);
 	if (info->is_rt &&
diff --git a/fs/xfs/scrub/bmap_repair.c b/fs/xfs/scrub/bmap_repair.c
index 471f67d7acb1..6f1edadbadac 100644
--- a/fs/xfs/scrub/bmap_repair.c
+++ b/fs/xfs/scrub/bmap_repair.c
@@ -92,7 +92,7 @@ xrep_bmap_from_rmap(
 
 	do {
 		irec.br_blockcount = min_t(xfs_filblks_t, blockcount,
-				MAXEXTLEN);
+				XFS_MAX_EXTLEN);
 		xfs_bmbt_disk_set_all(&rbe, &irec);
 
 		trace_xrep_bmap_found(rb->sc->ip, rb->whichfork, &irec);
diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c
index 5ea55a4f4c2b..67268a534a83 100644
--- a/fs/xfs/scrub/repair.c
+++ b/fs/xfs/scrub/repair.c
@@ -1127,7 +1127,7 @@ xrep_reap_extent(
 	xfs_agblock_t		agbno_next = agbno + len;
 	int			error = 0;
 
-	ASSERT(len <= MAXEXTLEN);
+	ASSERT(len <= XFS_MAX_EXTLEN);
 
 	if (sc->ip != NULL) {
 		/*
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index f43f1d434fe2..45a86b36c9dc 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -120,14 +120,14 @@ xfs_bmap_rtalloc(
 	 */
 	ralen = ap->length / mp->m_sb.sb_rextsize;
 	/*
-	 * If the old value was close enough to MAXEXTLEN that
+	 * If the old value was close enough to XFS_MAX_EXTLEN that
 	 * we rounded up to it, cut it back so it's valid again.
 	 * Note that if it's a really large request (bigger than
-	 * MAXEXTLEN), we don't hear about that number, and can't
+	 * XFS_MAX_EXTLEN), we don't hear about that number, and can't
 	 * adjust the starting point to match it.
 	 */
-	if (ralen * mp->m_sb.sb_rextsize >= MAXEXTLEN)
-		ralen = MAXEXTLEN / mp->m_sb.sb_rextsize;
+	if (ralen * mp->m_sb.sb_rextsize >= XFS_MAX_EXTLEN)
+		ralen = XFS_MAX_EXTLEN / mp->m_sb.sb_rextsize;
 
 	/*
 	 * Lock out modifications to both the RT bitmap and summary inodes
@@ -841,9 +841,11 @@ xfs_alloc_file_space(
 		 * count, hence we need to limit the number of blocks we are
 		 * trying to reserve to avoid an overflow. We can't allocate
 		 * more than @nimaps extents, and an extent is limited on disk
-		 * to MAXEXTLEN (21 bits), so use that to enforce the limit.
+		 * to XFS_MAX_EXTLEN (21 bits), so use that to enforce the
+		 * limit.
 		 */
-		resblks = min_t(xfs_fileoff_t, (e - s), (MAXEXTLEN * nimaps));
+		resblks = min_t(xfs_fileoff_t, (e - s),
+				(XFS_MAX_EXTLEN * nimaps));
 		if (unlikely(rt)) {
 			dblocks = XFS_DIOSTRAT_SPACE_RES(mp, 0);
 			rblocks = resblks;
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 2c8eee2fe5be..e5e5d1482ff2 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -399,7 +399,7 @@ xfs_iomap_prealloc_size(
 	 */
 	plen = prev.br_blockcount;
 	while (xfs_iext_prev_extent(ifp, &ncur, &got)) {
-		if (plen > MAXEXTLEN / 2 ||
+		if (plen > XFS_MAX_EXTLEN / 2 ||
 		    isnullstartblock(got.br_startblock) ||
 		    got.br_startoff + got.br_blockcount != prev.br_startoff ||
 		    got.br_startblock + got.br_blockcount != prev.br_startblock)
@@ -411,23 +411,23 @@ xfs_iomap_prealloc_size(
 	/*
 	 * If the size of the extents is greater than half the maximum extent
 	 * length, then use the current offset as the basis.  This ensures that
-	 * for large files the preallocation size always extends to MAXEXTLEN
-	 * rather than falling short due to things like stripe unit/width
-	 * alignment of real extents.
+	 * for large files the preallocation size always extends to
+	 * XFS_MAX_EXTLEN rather than falling short due to things like stripe
+	 * unit/width alignment of real extents.
 	 */
 	alloc_blocks = plen * 2;
-	if (alloc_blocks > MAXEXTLEN)
+	if (alloc_blocks > XFS_MAX_EXTLEN)
 		alloc_blocks = XFS_B_TO_FSB(mp, offset);
 	qblocks = alloc_blocks;
 
 	/*
-	 * MAXEXTLEN is not a power of two value but we round the prealloc down
-	 * to the nearest power of two value after throttling. To prevent the
-	 * round down from unconditionally reducing the maximum supported
-	 * prealloc size, we round up first, apply appropriate throttling,
-	 * round down and cap the value to MAXEXTLEN.
+	 * XFS_MAX_EXTLEN is not a power of two value but we round the prealloc
+	 * down to the nearest power of two value after throttling. To prevent
+	 * the round down from unconditionally reducing the maximum supported
+	 * prealloc size, we round up first, apply appropriate throttling, round
+	 * down and cap the value to XFS_MAX_EXTLEN.
 	 */
-	alloc_blocks = XFS_FILEOFF_MIN(roundup_pow_of_two(MAXEXTLEN),
+	alloc_blocks = XFS_FILEOFF_MIN(roundup_pow_of_two(XFS_MAX_EXTLEN),
 				       alloc_blocks);
 
 	freesp = percpu_counter_read_positive(&mp->m_fdblocks);
@@ -475,14 +475,14 @@ xfs_iomap_prealloc_size(
 	 */
 	if (alloc_blocks)
 		alloc_blocks = rounddown_pow_of_two(alloc_blocks);
-	if (alloc_blocks > MAXEXTLEN)
-		alloc_blocks = MAXEXTLEN;
+	if (alloc_blocks > XFS_MAX_EXTLEN)
+		alloc_blocks = XFS_MAX_EXTLEN;
 
 	/*
 	 * If we are still trying to allocate more space than is
 	 * available, squash the prealloc hard. This can happen if we
 	 * have a large file on a small filesystem and the above
-	 * lowspace thresholds are smaller than MAXEXTLEN.
+	 * lowspace thresholds are smaller than XFS_MAX_EXTLEN.
 	 */
 	while (alloc_blocks && alloc_blocks >= freesp)
 		alloc_blocks >>= 4;
-- 
2.30.2


^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 00/12] xfs: Extend per-inode extent counters
  2021-09-16 10:06 [PATCH V3 00/12] xfs: Extend per-inode extent counters Chandan Babu R
                   ` (11 preceding siblings ...)
  2021-09-16 10:06 ` [PATCH V3 12/12] xfs: Define max extent length based on on-disk format definition Chandan Babu R
@ 2021-09-18  0:03 ` Darrick J. Wong
  2021-09-18  3:36   ` [External] : " Chandan Babu R
  12 siblings, 1 reply; 42+ messages in thread
From: Darrick J. Wong @ 2021-09-18  0:03 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs

On Thu, Sep 16, 2021 at 03:36:35PM +0530, Chandan Babu R wrote:
> The commit xfs: fix inode fork extent count overflow
> (3f8a4f1d876d3e3e49e50b0396eaffcc4ba71b08) mentions that 10 billion
> data fork extents should be possible to create. However the
> corresponding on-disk field has a signed 32-bit type. Hence this
> patchset extends the per-inode data extent counter to 64 bits out of
> which 48 bits are used to store the extent count. 
> 
> Also, XFS has an attr fork extent counter which is 16 bits wide. A
> workload which,
> 1. Creates 1 million 255-byte sized xattrs,
> 2. Deletes 50% of these xattrs in an alternating manner,
> 3. Tries to insert 400,000 new 255-byte sized xattrs
>    causes the xattr extent counter to overflow.
> 
> Dave tells me that there are instances where a single file has more
> than 100 million hardlinks. With parent pointers being stored in
> xattrs, we will overflow the signed 16-bits wide xattr extent counter
> when large number of hardlinks are created. Hence this patchset
> extends the on-disk field to 32-bits.
> 
> The following changes are made to accomplish this,
> 1. A new incompat superblock flag to prevent older kernels from mounting
>    the filesystem. This flag has to be set during mkfs time.
> 2. A new 64-bit inode field is created to hold the data extent
>    counter.
> 3. The existing 32-bit inode data extent counter will be used to hold
>    the attr fork extent counter.
> 
> The patchset has been tested by executing xfstests with the following
> mkfs.xfs options,
> 1. -m crc=0 -b size=1k
> 2. -m crc=0 -b size=4k
> 3. -m crc=0 -b size=512
> 4. -m rmapbt=1,reflink=1 -b size=1k
> 5. -m rmapbt=1,reflink=1 -b size=4k
> 
> Each of the above test scenarios were executed on the following
> combinations (For V4 FS test scenario, the last combination
> i.e. "Patched (enable extcnt64bit)", was omitted).
> |-------------------------------+-----------|
> | Xfsprogs                      | Kernel    |
> |-------------------------------+-----------|
> | Unpatched                     | Patched   |
> | Patched (disable extcnt64bit) | Unpatched |
> | Patched (disable extcnt64bit) | Patched   |
> | Patched (enable extcnt64bit)  | Patched   |
> |-------------------------------+-----------|
> 
> I have also written a test (yet to be converted into xfstests format)
> to check if the correct extent counter fields are updated with/without
> the new incompat flag. I have also fixed some of the existing fstests
> to work with the new extent counter fields.
> 
> Increasing data extent counter width also causes the maximum height of
> BMBT to increase. This requires that the macro XFS_BTREE_MAXLEVELS be
> updated with a larger value. However such a change causes the value of
> mp->m_rmap_maxlevels to increase which in turn causes log reservation
> sizes to increase and hence a modified XFS driver will fail to mount
> filesystems created by older versions of mkfs.xfs.
> 
> Hence this patchset is built on top of Darrick's btree-dynamic-depth
> branch which removes the macro XFS_BTREE_MAXLEVELS and computes
> mp->m_rmap_maxlevels based on the size of an AG.

I forward-ported /just/ that branch to a 5.16 dev branch and will send
that out, in case you wanted to add it to the head of your dev branch
and thereby escape relying on the bajillion patches in djwong-dev.

--D

> These patches can also be obtained from
> https://github.com/chandanr/linux.git at branch
> xfs-incompat-extend-extcnt-v3.
> 
> I will be posting the changes associated with xfsprogs separately.
> 
> Changelog:
> V2 -> V3:
> 1. Define maximum extent length as a function of
>    BMBT_BLOCKCOUNT_BITLEN.
> 2. Introduce xfs_iext_max_nextents() function in the patch series
>    before renaming MAXEXTNUM/MAXAEXTNUM. This is done to reduce
>    proliferation of macros indicating maximum extent count for data
>    and attribute forks.
> 3. Define xfs_dfork_nextents() as an inline function.
> 4. Use xfs_rfsblock_t as the data type for variables that hold block
>    count.
> 5. xfs_dfork_nextents() now returns -EFSCORRUPTED when an invalid fork
>    is passed as an argument.
> 6. The following changes are done to enable bulkstat ioctl to report
>    64-bit extent counters,
>    - Carve out a new 64-bit field xfs_bulkstat->bs_extents64 from
>      xfs_bulkstat->bs_pad[]. 
>    - Carve out a new 64-bit field xfs_bulk_ireq->bulkstat_flags from
>      xfs_bulk_ireq->reserved[] to hold bulkstat specific operational
>      flags. Introduce XFS_IBULK_NREXT64 flag to indicate that
>      userspace has the necessary infrastructure to receive 64-bit
>      extent counters.
>    - Define the new flag XFS_BULK_IREQ_BULKSTAT for userspace to
>      indicate that xfs_bulk_ireq->bulkstat_flags has valid flags set.
> 7. Rename the incompat flag from XFS_SB_FEAT_INCOMPAT_EXTCOUNT_64BIT
>    to XFS_SB_FEAT_INCOMPAT_NREXT64.
> 8. Add a new helper function xfs_inode_to_disk_iext_counters() to
>    convert from incore inode extent counters to ondisk inode extent
>    counters.
> 9. Reuse XFS_ERRTAG_REDUCE_MAX_IEXTENTS error tag to skip reporting
>    inodes with more than 10 extents when bulkstat ioctl is invoked by
>    userspace.
> 10. Introduce the new per-inode XFS_DIFLAG2_NREXT64 flag to indicate
>     that the inode uses 64-bit extent counter. This is used to allow
>     administrators to upgrade existing filesystems.
> 11. Export presence of XFS_SB_FEAT_INCOMPAT_NREXT64 feature to
>     userspace via XFS_IOC_FSGEOMETRY ioctl.
> 
> V1 -> V2:
> 1. Rebase patches on top of Darrick's btree-dynamic-depth branch.
> 2. Add new bulkstat ioctl version to support 64-bit data fork extent
>    counter field.
> 3. Introduce new error tag to verify if the old bulkstat ioctls skip
>    reporting inodes with large data fork extent counters.
> 
> Chandan Babu R (12):
>   xfs: Move extent count limits to xfs_format.h
>   xfs: Introduce xfs_iext_max_nextents() helper
>   xfs: Rename MAXEXTNUM, MAXAEXTNUM to XFS_IFORK_EXTCNT_MAXS32,
>     XFS_IFORK_EXTCNT_MAXS16
>   xfs: Use xfs_extnum_t instead of basic data types
>   xfs: Introduce xfs_dfork_nextents() helper
>   xfs: xfs_dfork_nextents: Return extent count via an out argument
>   xfs: Rename inode's extent counter fields based on their width
>   xfs: Promote xfs_extnum_t and xfs_aextnum_t to 64 and 32-bits
>     respectively
>   xfs: Enable bulkstat ioctl to support 64-bit per-inode extent counters
>   xfs: Extend per-inode extent counter widths
>   xfs: Add XFS_SB_FEAT_INCOMPAT_NREXT64 to XFS_SB_FEAT_INCOMPAT_ALL
>   xfs: Define max extent length based on on-disk format definition
> 
>  fs/xfs/libxfs/xfs_bmap.c        | 80 ++++++++++++++-------------
>  fs/xfs/libxfs/xfs_format.h      | 80 +++++++++++++++++++++++----
>  fs/xfs/libxfs/xfs_fs.h          | 20 +++++--
>  fs/xfs/libxfs/xfs_ialloc.c      |  2 +
>  fs/xfs/libxfs/xfs_inode_buf.c   | 61 ++++++++++++++++-----
>  fs/xfs/libxfs/xfs_inode_fork.c  | 32 +++++++----
>  fs/xfs/libxfs/xfs_inode_fork.h  | 23 +++++++-
>  fs/xfs/libxfs/xfs_log_format.h  |  7 +--
>  fs/xfs/libxfs/xfs_rtbitmap.c    |  4 +-
>  fs/xfs/libxfs/xfs_sb.c          |  4 ++
>  fs/xfs/libxfs/xfs_swapext.c     |  6 +--
>  fs/xfs/libxfs/xfs_trans_inode.c |  6 +++
>  fs/xfs/libxfs/xfs_trans_resv.c  | 10 ++--
>  fs/xfs/libxfs/xfs_types.h       | 11 +---
>  fs/xfs/scrub/attr_repair.c      |  2 +-
>  fs/xfs/scrub/bmap.c             |  2 +-
>  fs/xfs/scrub/bmap_repair.c      |  2 +-
>  fs/xfs/scrub/inode.c            | 96 ++++++++++++++++++++-------------
>  fs/xfs/scrub/inode_repair.c     | 71 +++++++++++++++++-------
>  fs/xfs/scrub/repair.c           |  2 +-
>  fs/xfs/scrub/trace.h            | 16 +++---
>  fs/xfs/xfs_bmap_util.c          | 14 ++---
>  fs/xfs/xfs_inode.c              |  4 +-
>  fs/xfs/xfs_inode.h              |  5 ++
>  fs/xfs/xfs_inode_item.c         | 21 +++++++-
>  fs/xfs/xfs_inode_item_recover.c | 26 ++++++---
>  fs/xfs/xfs_ioctl.c              |  7 +++
>  fs/xfs/xfs_iomap.c              | 28 +++++-----
>  fs/xfs/xfs_itable.c             | 25 ++++++++-
>  fs/xfs/xfs_itable.h             |  2 +
>  fs/xfs/xfs_iwalk.h              |  7 ++-
>  fs/xfs/xfs_mount.h              |  2 +
>  fs/xfs/xfs_trace.h              |  6 +--
>  33 files changed, 478 insertions(+), 206 deletions(-)
> 
> -- 
> 2.30.2
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [External] : Re: [PATCH V3 00/12] xfs: Extend per-inode extent counters
  2021-09-18  0:03 ` [PATCH V3 00/12] xfs: Extend per-inode extent counters Darrick J. Wong
@ 2021-09-18  3:36   ` Chandan Babu R
  0 siblings, 0 replies; 42+ messages in thread
From: Chandan Babu R @ 2021-09-18  3:36 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On 18 Sep 2021 at 05:33, Darrick J. Wong wrote:
> On Thu, Sep 16, 2021 at 03:36:35PM +0530, Chandan Babu R wrote:
>> The commit xfs: fix inode fork extent count overflow
>> (3f8a4f1d876d3e3e49e50b0396eaffcc4ba71b08) mentions that 10 billion
>> data fork extents should be possible to create. However the
>> corresponding on-disk field has a signed 32-bit type. Hence this
>> patchset extends the per-inode data extent counter to 64 bits out of
>> which 48 bits are used to store the extent count. 
>> 
>> Also, XFS has an attr fork extent counter which is 16 bits wide. A
>> workload which,
>> 1. Creates 1 million 255-byte sized xattrs,
>> 2. Deletes 50% of these xattrs in an alternating manner,
>> 3. Tries to insert 400,000 new 255-byte sized xattrs
>>    causes the xattr extent counter to overflow.
>> 
>> Dave tells me that there are instances where a single file has more
>> than 100 million hardlinks. With parent pointers being stored in
>> xattrs, we will overflow the signed 16-bits wide xattr extent counter
>> when large number of hardlinks are created. Hence this patchset
>> extends the on-disk field to 32-bits.
>> 
>> The following changes are made to accomplish this,
>> 1. A new incompat superblock flag to prevent older kernels from mounting
>>    the filesystem. This flag has to be set during mkfs time.
>> 2. A new 64-bit inode field is created to hold the data extent
>>    counter.
>> 3. The existing 32-bit inode data extent counter will be used to hold
>>    the attr fork extent counter.
>> 
>> The patchset has been tested by executing xfstests with the following
>> mkfs.xfs options,
>> 1. -m crc=0 -b size=1k
>> 2. -m crc=0 -b size=4k
>> 3. -m crc=0 -b size=512
>> 4. -m rmapbt=1,reflink=1 -b size=1k
>> 5. -m rmapbt=1,reflink=1 -b size=4k
>> 
>> Each of the above test scenarios were executed on the following
>> combinations (For V4 FS test scenario, the last combination
>> i.e. "Patched (enable extcnt64bit)", was omitted).
>> |-------------------------------+-----------|
>> | Xfsprogs                      | Kernel    |
>> |-------------------------------+-----------|
>> | Unpatched                     | Patched   |
>> | Patched (disable extcnt64bit) | Unpatched |
>> | Patched (disable extcnt64bit) | Patched   |
>> | Patched (enable extcnt64bit)  | Patched   |
>> |-------------------------------+-----------|
>> 
>> I have also written a test (yet to be converted into xfstests format)
>> to check if the correct extent counter fields are updated with/without
>> the new incompat flag. I have also fixed some of the existing fstests
>> to work with the new extent counter fields.
>> 
>> Increasing data extent counter width also causes the maximum height of
>> BMBT to increase. This requires that the macro XFS_BTREE_MAXLEVELS be
>> updated with a larger value. However such a change causes the value of
>> mp->m_rmap_maxlevels to increase which in turn causes log reservation
>> sizes to increase and hence a modified XFS driver will fail to mount
>> filesystems created by older versions of mkfs.xfs.
>> 
>> Hence this patchset is built on top of Darrick's btree-dynamic-depth
>> branch which removes the macro XFS_BTREE_MAXLEVELS and computes
>> mp->m_rmap_maxlevels based on the size of an AG.
>
> I forward-ported /just/ that branch to a 5.16 dev branch and will send
> that out, in case you wanted to add it to the head of your dev branch
> and thereby escape relying on the bajillion patches in djwong-dev.
>

Thanks for doing that. I will rebase my patchset on top of "xfs: support
dynamic btree cursor height" series.

-- 
chandan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 05/12] xfs: Introduce xfs_dfork_nextents() helper
  2021-09-16 10:06 ` [PATCH V3 05/12] xfs: Introduce xfs_dfork_nextents() helper Chandan Babu R
@ 2021-09-27 22:46   ` Dave Chinner
  2021-09-28  9:46     ` Chandan Babu R
  0 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2021-09-27 22:46 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, djwong

On Thu, Sep 16, 2021 at 03:36:40PM +0530, Chandan Babu R wrote:
> This commit replaces the macro XFS_DFORK_NEXTENTS() with the helper function
> xfs_dfork_nextents(). As of this commit, xfs_dfork_nextents() returns the same
> value as XFS_DFORK_NEXTENTS(). A future commit which extends inode's extent
> counter fields will add more logic to this helper.
> 
> This commit also replaces direct accesses to xfs_dinode->di_[a]nextents
> with calls to xfs_dfork_nextents().
> 
> No functional changes have been made.
> 
> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++----
>  fs/xfs/libxfs/xfs_inode_buf.c  | 16 +++++++++-----
>  fs/xfs/libxfs/xfs_inode_fork.c | 10 +++++----
>  fs/xfs/scrub/inode.c           | 18 +++++++++-------
>  fs/xfs/scrub/inode_repair.c    | 38 +++++++++++++++++++++-------------
>  5 files changed, 75 insertions(+), 35 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index ed8a5354bcbf..b4638052801f 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -930,10 +930,30 @@ enum xfs_dinode_fmt {
>  	((w) == XFS_DATA_FORK ? \
>  		(dip)->di_format : \
>  		(dip)->di_aformat)
> -#define XFS_DFORK_NEXTENTS(dip,w) \
> -	((w) == XFS_DATA_FORK ? \
> -		be32_to_cpu((dip)->di_nextents) : \
> -		be16_to_cpu((dip)->di_anextents))
> +
> +static inline xfs_extnum_t
> +xfs_dfork_nextents(
> +	struct xfs_dinode	*dip,
> +	int			whichfork)
> +{
> +	xfs_extnum_t		nextents = 0;
> +
> +	switch (whichfork) {
> +	case XFS_DATA_FORK:
> +		nextents = be32_to_cpu(dip->di_nextents);
> +		break;
> +

No need for whitespace line after the break, and this could just
return the value directly.

> +	case XFS_ATTR_FORK:
> +		nextents = be16_to_cpu(dip->di_anextents);
> +		break;
> +
> +	default:
> +		ASSERT(0);
> +		break;
> +	}
> +
> +	return nextents;
> +}

I think that all the conditional inode fork macros
should be moved to libxfs/xfs_inode_fork.h as they are converted.

These macros are not acutally part of the on-disk format definition
(which is what xfs_format.h is supposed to contain) - it's code that
parses the on-disk format and that is supposed to be in
libxfs/xfs_inode_fork.[ch]....

Next thing: the caller almost always knows what fork it wants
the extents for - only 3 callers have a whichfork variable. So,
perhaps:

static inline xfs_extnum_t
xfs_dfork_data_extents(
	struct xfs_dinode	*dip)
{
	return be32_to_cpu(dip->di_nextents);
}

static inline xfs_extnum_t
xfs_dfork_attr_extents(
	struct xfs_dinode	*dip)
{
	return be16_to_cpu(dip->di_anextents);
}

static inline xfs_extnum_t
xfs_dfork_extents(
	struct xfs_dinode	*dip,
	int			whichfork)
{
	switch (whichfork) {
	case XFS_DATA_FORK:
		return xfs_dfork_data_extents(dip);
	case XFS_ATTR_FORK:
		return xfs_dfork_attr_extents(dip);
	default:
		ASSERT(0);
		break;
	}
	return 0;
}

So we don't have to rely on the compiler optimising away the switch
statement correctly to produce optimal code.

> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -342,9 +342,11 @@ xfs_dinode_verify_fork(
>  	struct xfs_mount	*mp,
>  	int			whichfork)
>  {
> -	xfs_extnum_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
> +	xfs_extnum_t		di_nextents;
>  	xfs_extnum_t		max_extents;
>  
> +	di_nextents = xfs_dfork_nextents(dip, whichfork);
> +
>  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
>  	case XFS_DINODE_FMT_LOCAL:
>  		/*
> @@ -474,6 +476,8 @@ xfs_dinode_verify(
>  	uint16_t		flags;
>  	uint64_t		flags2;
>  	uint64_t		di_size;
> +	xfs_extnum_t            nextents;
> +	xfs_rfsblock_t		nblocks;

That's a block number type, not a block count:

typedef uint64_t        xfs_rfsblock_t; /* blockno in filesystem (raw) */
....
typedef uint64_t        xfs_filblks_t;  /* number of blocks in a file */

The latter is the appropriate type to use here.

Oh, the struct xfs_inode and the struct xfs_log_dinode makes
this same type mistake. Ok, that's a cleanup for another day....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 09/12] xfs: Enable bulkstat ioctl to support 64-bit per-inode extent counters
  2021-09-16 10:06 ` [PATCH V3 09/12] xfs: Enable bulkstat ioctl to support 64-bit per-inode extent counters Chandan Babu R
@ 2021-09-27 23:06   ` Dave Chinner
  2021-09-28  9:49     ` Chandan Babu R
  0 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2021-09-27 23:06 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, djwong

On Thu, Sep 16, 2021 at 03:36:44PM +0530, Chandan Babu R wrote:
> The following changes are made to enable userspace to obtain 64-bit extent
> counters,
> 1. To hold 64-bit extent counters, carve out the new 64-bit field
>    xfs_bulkstat->bs_extents64 from xfs_bulkstat->bs_pad[].
> 2. Carve out a new 64-bit field xfs_bulk_ireq->bulkstat_flags from
>    xfs_bulk_ireq->reserved[] to hold bulkstat specific operational flags.  As of
>    this commit, XFS_IBULK_NREXT64 is the only valid flag that this field can
>    hold. It indicates that userspace has the necessary infrastructure to
>    receive 64-bit extent counters.
> 3. Define the new flag XFS_BULK_IREQ_BULKSTAT for userspace to indicate that
>    xfs_bulk_ireq->bulkstat_flags has valid flags set.

This seems unnecessarily complex. It adds a new flag to define a new
flag field in the same structure and then define a new and a new
flag in the new flag field to define a new behaviour.

Why can't this be done with just a single new flag in the existing
flags field?

> Suggested-by: Darrick J. Wong <djwong@kernel.org>
> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_fs.h | 19 ++++++++++++++-----
>  fs/xfs/xfs_ioctl.c     |  7 +++++++
>  fs/xfs/xfs_itable.c    | 25 +++++++++++++++++++++++--
>  fs/xfs/xfs_itable.h    |  2 ++
>  fs/xfs/xfs_iwalk.h     |  7 +++++--
>  5 files changed, 51 insertions(+), 9 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
> index 2594fb647384..b76906914d89 100644
> --- a/fs/xfs/libxfs/xfs_fs.h
> +++ b/fs/xfs/libxfs/xfs_fs.h
> @@ -394,7 +394,7 @@ struct xfs_bulkstat {
>  	uint32_t	bs_extsize_blks; /* extent size hint, blocks	*/
>  
>  	uint32_t	bs_nlink;	/* number of links		*/
> -	uint32_t	bs_extents;	/* number of extents		*/
> +	uint32_t	bs_extents32;	/* 32-bit data fork extent counter */
>  	uint32_t	bs_aextents;	/* attribute number of extents	*/
>  	uint16_t	bs_version;	/* structure version		*/
>  	uint16_t	bs_forkoff;	/* inode fork offset in bytes	*/

I don't think renaming structure members is a good idea - it breaks
the user API and forces applications to require source level
modifications just to compile on both old and new xfsprogs installs.

> @@ -403,8 +403,9 @@ struct xfs_bulkstat {
>  	uint16_t	bs_checked;	/* checked inode metadata	*/
>  	uint16_t	bs_mode;	/* type and mode		*/
>  	uint16_t	bs_pad2;	/* zeroed			*/
> +	uint64_t	bs_extents64;	/* 64-bit data fork extent counter */
>  
> -	uint64_t	bs_pad[7];	/* zeroed			*/
> +	uint64_t	bs_pad[6];	/* zeroed			*/
>  };
>  
>  #define XFS_BULKSTAT_VERSION_V1	(1)
> @@ -469,7 +470,8 @@ struct xfs_bulk_ireq {
>  	uint32_t	icount;		/* I: count of entries in buffer */
>  	uint32_t	ocount;		/* O: count of entries filled out */
>  	uint32_t	agno;		/* I: see comment for IREQ_AGNO	*/
> -	uint64_t	reserved[5];	/* must be zero			*/
> +	uint64_t	bulkstat_flags; /* I: Bulkstat operation flags */
> +	uint64_t	reserved[4];	/* must be zero			*/
>  };
>  
>  /*
> @@ -492,9 +494,16 @@ struct xfs_bulk_ireq {
>   */
>  #define XFS_BULK_IREQ_METADIR	(1 << 2)
>  
> -#define XFS_BULK_IREQ_FLAGS_ALL	(XFS_BULK_IREQ_AGNO | \
> +#define XFS_BULK_IREQ_BULKSTAT	(1 << 3)
> +
> +#define XFS_BULK_IREQ_FLAGS_ALL	(XFS_BULK_IREQ_AGNO |	 \
>  				 XFS_BULK_IREQ_SPECIAL | \
> -				 XFS_BULK_IREQ_METADIR)
> +				 XFS_BULK_IREQ_METADIR | \
> +				 XFS_BULK_IREQ_BULKSTAT)

What's this XFS_BULK_IREQ_METADIR thing? I haven't noticed that when
scanning any recent proposed patch series....

> +#define XFS_BULK_IREQ_BULKSTAT_NREXT64 (1 << 0)
> +
> +#define XFS_BULK_IREQ_BULKSTAT_FLAGS_ALL (XFS_BULK_IREQ_BULKSTAT_NREXT64)

As per above, this seems unnecessarily complex.

> @@ -134,7 +136,26 @@ xfs_bulkstat_one_int(
>  
>  	buf->bs_xflags = xfs_ip2xflags(ip);
>  	buf->bs_extsize_blks = ip->i_extsize;
> -	buf->bs_extents = xfs_ifork_nextents(&ip->i_df);
> +
> +	nextents = xfs_ifork_nextents(&ip->i_df);
> +	if (!(bc->breq->flags & XFS_IBULK_NREXT64)) {
> +		xfs_extnum_t max_nextents = XFS_IFORK_EXTCNT_MAXS32;
> +
> +		if (unlikely(XFS_TEST_ERROR(false, mp,
> +				XFS_ERRTAG_REDUCE_MAX_IEXTENTS)))
> +			max_nextents = 10;
> +
> +		if (nextents > max_nextents) {
> +			xfs_iunlock(ip, XFS_ILOCK_SHARED);
> +			xfs_irele(ip);
> +			error = -EINVAL;
> +			goto out_advance;
> +		}

So we return an EINVAL error if any extent overflows the 32 bit
counter? Why isn't this -EOVERFLOW?

> +		buf->bs_extents32 = nextents;
> +	} else {
> +		buf->bs_extents64 = nextents;
> +	}
> +
>  	xfs_bulkstat_health(ip, buf);
>  	buf->bs_aextents = xfs_ifork_nextents(ip->i_afp);
>  	buf->bs_forkoff = XFS_IFORK_BOFF(ip);
> @@ -356,7 +377,7 @@ xfs_bulkstat_to_bstat(
>  	bs1->bs_blocks = bstat->bs_blocks;
>  	bs1->bs_xflags = bstat->bs_xflags;
>  	bs1->bs_extsize = XFS_FSB_TO_B(mp, bstat->bs_extsize_blks);
> -	bs1->bs_extents = bstat->bs_extents;
> +	bs1->bs_extents = bstat->bs_extents32;
>  	bs1->bs_gen = bstat->bs_gen;
>  	bs1->bs_projid_lo = bstat->bs_projectid & 0xFFFF;
>  	bs1->bs_forkoff = bstat->bs_forkoff;
> diff --git a/fs/xfs/xfs_itable.h b/fs/xfs/xfs_itable.h
> index f5a13f69883a..f61685da3837 100644
> --- a/fs/xfs/xfs_itable.h
> +++ b/fs/xfs/xfs_itable.h
> @@ -22,6 +22,8 @@ struct xfs_ibulk {
>  /* Signal that we can return metadata directories. */
>  #define XFS_IBULK_METADIR	(XFS_IWALK_METADIR)
>  
> +#define XFS_IBULK_NREXT64	(XFS_IWALK_NREXT64)
> +
>  /*
>   * Advance the user buffer pointer by one record of the given size.  If the
>   * buffer is now full, return the appropriate error code.
> diff --git a/fs/xfs/xfs_iwalk.h b/fs/xfs/xfs_iwalk.h
> index d7a082e45cbf..27a6842a1bb5 100644
> --- a/fs/xfs/xfs_iwalk.h
> +++ b/fs/xfs/xfs_iwalk.h
> @@ -31,8 +31,11 @@ int xfs_iwalk_threaded(struct xfs_mount *mp, xfs_ino_t startino,
>  /* Signal that we can return metadata directories. */
>  #define XFS_IWALK_METADIR	(0x2)
>  
> -#define XFS_IWALK_FLAGS_ALL	(XFS_IWALK_SAME_AG | \
> -				 XFS_IWALK_METADIR)
> +#define XFS_IWALK_NREXT64	(0x4)

Can we use '(1 << 2)' style notation for new bit field defines?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 07/12] xfs: Rename inode's extent counter fields based on their width
  2021-09-16 10:06 ` [PATCH V3 07/12] xfs: Rename inode's extent counter fields based on their width Chandan Babu R
@ 2021-09-27 23:46   ` Dave Chinner
  2021-09-28  4:04     ` Dave Chinner
  2021-09-28  9:47     ` Chandan Babu R
  0 siblings, 2 replies; 42+ messages in thread
From: Dave Chinner @ 2021-09-27 23:46 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, djwong

On Thu, Sep 16, 2021 at 03:36:42PM +0530, Chandan Babu R wrote:
> This commit renames extent counter fields in "struct xfs_dinode" and "struct
> xfs_log_dinode" based on the width of the fields. As of this commit, the
> 32-bit field will be used to count data fork extents and the 16-bit field will
> be used to count attr fork extents.
> 
> This change is done to enable a future commit to introduce a new 64-bit extent
> counter field.
> 
> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_format.h      |  8 ++++----
>  fs/xfs/libxfs/xfs_inode_buf.c   |  4 ++--
>  fs/xfs/libxfs/xfs_log_format.h  |  4 ++--
>  fs/xfs/scrub/inode_repair.c     |  4 ++--
>  fs/xfs/scrub/trace.h            | 14 +++++++-------
>  fs/xfs/xfs_inode_item.c         |  4 ++--
>  fs/xfs/xfs_inode_item_recover.c |  8 ++++----
>  7 files changed, 23 insertions(+), 23 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index dba868f2c3e3..87c927d912f6 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -802,8 +802,8 @@ typedef struct xfs_dinode {
>  	__be64		di_size;	/* number of bytes in file */
>  	__be64		di_nblocks;	/* # of direct & btree blocks used */
>  	__be32		di_extsize;	/* basic/minimum extent size for file */
> -	__be32		di_nextents;	/* number of extents in data fork */
> -	__be16		di_anextents;	/* number of extents in attribute fork*/
> +	__be32		di_nextents32;	/* number of extents in data fork */
> +	__be16		di_nextents16;	/* number of extents in attribute fork*/


Hmmm. Having the same field in the inode hold the extent count
for different inode forks based on a bit in the superblock means the
on-disk inode format is not self describing. i.e. we can't decode
the on-disk contents of an inode correctly without knowing whether a
specific feature bit is set in the superblock or not.

Right now we don't have use external configs to decode the inode.
Feature level conditional fields are determined by inode version,
not superblock bits. Optional feature fields easy to deal with -
zero if the feature is not in use, otherwise we assume it is in use
and can validity check it appropriately. IOWs, we don't need
to look at sb feature bits to decode and validate inode fields.

This change means that we can't determine if the extent counts are
correct just by looking at the on-disk inode. If we just have
di_nextents32 set to a non-zero value, does that mean we should have
data fork extents or attribute fork extents present?

Just looking at whether the attr fork is initialised is not
sufficient - it can be initialised with zero attr fork extents
present.  We can't look at the literal area contents, either,
because we don't zero that when we shrink it. We can't look at
di_nblocks, because that counts both attr and data for blocks. We
can't look at di_size, because we can have data extents beyond EOF
and hence a size of zero doesn't mean the data fork is empty.

So if both forks are in extent format, they could be either both
empty, both contain extents or only one fork contains extents but we
can't tell which state is the correct one. Hence
if di_nextents64 if zero, we don't know if di_nextents32 is a count
of attribute extents or data extents without first looking at the
superblock feature bit to determine if di_nextents64 is in use or
not. The inode format is not self describing anymore.

When XFS introduced 32 bit link counts, the inode version was bumped
from v1 to v2 because it redefined fields in the inode structure
similar to this proposal[1]. The verison number was then used to
determine if the inode was in old or new format - it was a self
describing format change. Hence If we are going to redefine
di_nextents to be able to hold either data fork extent count (old
format) or attr fork extent count (new format) we really need to
bump the inode version so that we can discriminate between the two
inode formats just by looking at the inode itself.

If we don't want to bump the version, then we need to do something
like:

-	__be32		di_nextents;	/* number of extents in data fork */
-	__be16		di_anextents;	/* number of extents in attribute fork*/
+	__be32		di_nextents_old;/* old number of extents in data fork */
+	__be16		di_anextents_old;/* old number of extents in attribute fork*/
.....
-	__u8            di_pad2[12];
+	__be64		di_nextents;	/* number of extents in data fork */
+	__be32		di_anextents;	/* number of extents in attribute fork*/
+	__u8            di_pad2[4];

So that there is no ambiguity in the on-disk format between the two
formats - if one set is non-zero, the other set must be zero in this
sort of setup.

However, I think that redefining the fields and bumping the inode
version is the better long term strategy, as it allows future reuse
of the di_anextents_old field, and it uses less of the small amount
of unused padding we have remaining in the on-disk inode core.

At which point, the feature bit in the superblock becomes "has v4
inodes", not "has big extent counts". We then use v4 inode format in
memory for everything (i.e. 64 bit extent counts) and convert
to/from the ondisk format at IO time like we do with v1/v2 inodes.

Thoughts?

-Dave.

[1] The change to v2 inodes back in 1995 removed the filesystem UUID
from the inode and was replaced with a 32 bit link counter, a project ID
value and padding:

@@ -36,10 +38,12 @@ typedef struct xfs_dinode_core
        __uint16_t      di_mode;        /* mode and type of file */
        __int8_t        di_version;     /* inode version */
        __int8_t        di_format;      /* format of di_c data */
-       __uint16_t      di_nlink;       /* number of links to file */
+       __uint16_t      di_onlink;      /* old number of links to file */
        __uint32_t      di_uid;         /* owner's user id */
        __uint32_t      di_gid;         /* owner's group id */
-       uuid_t          di_uuid;        /* file unique id */
+       __uint32_t      di_nlink;       /* number of links to file */
+       __uint16_t      di_projid;      /* owner's project id */
+       __uint8_t       di_pad[10];     /* unused, zeroed space */
        xfs_timestamp_t di_atime;       /* time last accessed */
        xfs_timestamp_t di_mtime;       /* time last modified */
        xfs_timestamp_t di_ctime;       /* time created/inode modified */
@@ -81,7 +85,13 @@ typedef struct xfs_dinode

it was the redefinition of the di_uuid variable space that required
the bumping of the inode version...
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 12/12] xfs: Define max extent length based on on-disk format definition
  2021-09-16 10:06 ` [PATCH V3 12/12] xfs: Define max extent length based on on-disk format definition Chandan Babu R
@ 2021-09-28  0:33   ` Dave Chinner
  2021-09-28 10:07     ` Chandan Babu R
  0 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2021-09-28  0:33 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, djwong

On Thu, Sep 16, 2021 at 03:36:47PM +0530, Chandan Babu R wrote:
> The maximum extent length depends on maximum block count that can be stored in
> a BMBT record. Hence this commit defines MAXEXTLEN based on
> BMBT_BLOCKCOUNT_BITLEN.
> 
> While at it, the commit also renames MAXEXTLEN to XFS_MAX_EXTLEN.

hmmmm. So you reimplemented:

#define BMBT_BLOCKCOUNT_MASK    ((1ULL << BMBT_BLOCKCOUNT_BITLEN) - 1)

and defined it as XFS_MAX_EXTLEN?

One of these two defines needs to go away. :)

Also, this macro really defines the maximum extent length a BMBT
record can hold, not the maximum XFS extent length supported. I
think it should be  named XFS_BMBT_MAX_EXTLEN and also used to
replace BMBT_BLOCKCOUNT_MASK.

The counter example are free space btree records - they can hold
extents lengths up to 2^31 blocks long:

typedef struct xfs_alloc_rec {
        __be32          ar_startblock;  /* starting block number */
        __be32          ar_blockcount;  /* count of free blocks */
} xfs_alloc_rec_t, xfs_alloc_key_t;

So, yes, I think MAXEXTLEN needs cleaning up, but it needs some more
work to make it explicit in what it refers to.

Also:

> -/*
> - * Max values for extlen and disk inode's extent counters.
> - */
> -#define	MAXEXTLEN		((xfs_extlen_t)0x1fffff)	/* 21 bits */
> -#define XFS_IFORK_EXTCNT_MAXU48	((xfs_extnum_t)0xffffffffffff)	/* Unsigned 48-bits */
> -#define XFS_IFORK_EXTCNT_MAXU32	((xfs_aextnum_t)0xffffffff)	/* Unsigned 32-bits */
> -#define XFS_IFORK_EXTCNT_MAXS32 ((xfs_extnum_t)0x7fffffff)	/* Signed 32-bits */
> -#define XFS_IFORK_EXTCNT_MAXS16 ((xfs_aextnum_t)0x7fff)		/* Signed 16-bits */
> -
> -
>  /*
>   * Inode minimum and maximum sizes.
>   */
> @@ -1701,6 +1691,16 @@ typedef struct xfs_bmbt_rec {
>  typedef uint64_t	xfs_bmbt_rec_base_t;	/* use this for casts */
>  typedef xfs_bmbt_rec_t xfs_bmdr_rec_t;
>  
> +/*
> + * Max values for extlen and disk inode's extent counters.
> + */
> +#define XFS_MAX_EXTLEN		((xfs_extlen_t)(1 << BMBT_BLOCKCOUNT_BITLEN) - 1)
> +#define XFS_IFORK_EXTCNT_MAXU48	((xfs_extnum_t)0xffffffffffff)	/* Unsigned 48-bits */
> +#define XFS_IFORK_EXTCNT_MAXU32	((xfs_aextnum_t)0xffffffff)	/* Unsigned 32-bits */
> +#define XFS_IFORK_EXTCNT_MAXS32 ((xfs_extnum_t)0x7fffffff)	/* Signed 32-bits */
> +#define XFS_IFORK_EXTCNT_MAXS16 ((xfs_aextnum_t)0x7fff)		/* Signed 16-bits */

At the end of the patch series, I still really don't like these
names. Hungarian notation is ugly, and they don't tell me what type
they apply to. Hence I don't know what limit is the correct one to
apply to which fork and which format....

These would be much better as

#define XFS_MAX_EXTCNT_DATA_FORK	((1ULL < 48) - 1)
#define XFS_MAX_EXTCNT_ATTR_FORK	((1ULL < 32) - 1)

#define XFS_MAX_EXTCNT_DATA_FORK_OLD	((1ULL < 31) - 1)
#define XFS_MAX_EXTCNT_ATTR_FORK_OLD	((1ULL < 15) - 1)

The name tells me what object/format they apply to, and the
implementation tells me the exact size without needing a comment
to make it readable. And it doesn't need casts that just add noise
to the implementation...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 08/12] xfs: Promote xfs_extnum_t and xfs_aextnum_t to 64 and 32-bits respectively
  2021-09-16 10:06 ` [PATCH V3 08/12] xfs: Promote xfs_extnum_t and xfs_aextnum_t to 64 and 32-bits respectively Chandan Babu R
@ 2021-09-28  0:47   ` Dave Chinner
  2021-09-28  9:47     ` Chandan Babu R
  0 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2021-09-28  0:47 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, djwong

On Thu, Sep 16, 2021 at 03:36:43PM +0530, Chandan Babu R wrote:
> A future commit will introduce a 64-bit on-disk data extent counter and a
> 32-bit on-disk attr extent counter. This commit promotes xfs_extnum_t and
> xfs_aextnum_t to 64 and 32-bits in order to correctly handle in-core versions
> of these quantities.
> 
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>

So while I was auditing extent lengths w.r.t. the last patch f the
series, I noticed that xfs_extnum_t is used in the struct
xfs_log_dinode and so changing the size of these types changes the
layout of this structure:

/*
 * Define the format of the inode core that is logged. This structure must be
 * kept identical to struct xfs_dinode except for the endianness annotations.
 */
struct xfs_log_dinode {
....
        xfs_rfsblock_t  di_nblocks;     /* # of direct & btree blocks used */
        xfs_extlen_t    di_extsize;     /* basic/minimum extent size for file */
        xfs_extnum_t    di_nextents;    /* number of extents in data fork */
        xfs_aextnum_t   di_anextents;   /* number of extents in attribute fork*/
....

Which means this:

> -typedef int32_t		xfs_extnum_t;	/* # of extents in a file */
> -typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> +typedef uint64_t	xfs_extnum_t;	/* # of extents in a file */
> +typedef uint32_t	xfs_aextnum_t;	/* # extents in an attribute fork */

creates an incompatible log format change that will cause silent
inode corruption during log recovery if inodes logged with this
change are replayed on an older kernel without this change. It's not
just the type size change that matters here - it also changes the
implicit padding in this structure because xfs_extlen_t is a 32 bit
object and so:

Old					New
64 bit object (di_nblocks)		64 bit object (di_nblocks)
32 bit object (di_extsize)		32 bit object (di_extsize)
					32 bit pad (implicit)
32 bit object (di_nextents)		64 bit object (di_nextents)
16 bit object (di_anextents)		32 bit ojecct (di_anextents
8 bit object (di_forkoff)		8 bit object (di_forkoff)
8 bit object (di_aformat)		8 bit object (di_aformat)
					16 bit pad (implicit)
32 bit object (di_dmevmask)		32 bit object (di_dmevmask)


That's quite the layout change, and that's something we must not do
without a feature bit being set. hence I think we need to rev the
struct xfs_log_dinode version for large extent count support, too,
so that the struct xfs_log_dinode does not change size for
filesystems without the large extent count feature.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 07/12] xfs: Rename inode's extent counter fields based on their width
  2021-09-27 23:46   ` Dave Chinner
@ 2021-09-28  4:04     ` Dave Chinner
  2021-09-29 17:03       ` Chandan Babu R
  2021-09-28  9:47     ` Chandan Babu R
  1 sibling, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2021-09-28  4:04 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, djwong

On Tue, Sep 28, 2021 at 09:46:37AM +1000, Dave Chinner wrote:
> On Thu, Sep 16, 2021 at 03:36:42PM +0530, Chandan Babu R wrote:
> > This commit renames extent counter fields in "struct xfs_dinode" and "struct
> > xfs_log_dinode" based on the width of the fields. As of this commit, the
> > 32-bit field will be used to count data fork extents and the 16-bit field will
> > be used to count attr fork extents.
> > 
> > This change is done to enable a future commit to introduce a new 64-bit extent
> > counter field.
> > 
> > Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
> > ---
> >  fs/xfs/libxfs/xfs_format.h      |  8 ++++----
> >  fs/xfs/libxfs/xfs_inode_buf.c   |  4 ++--
> >  fs/xfs/libxfs/xfs_log_format.h  |  4 ++--
> >  fs/xfs/scrub/inode_repair.c     |  4 ++--
> >  fs/xfs/scrub/trace.h            | 14 +++++++-------
> >  fs/xfs/xfs_inode_item.c         |  4 ++--
> >  fs/xfs/xfs_inode_item_recover.c |  8 ++++----
> >  7 files changed, 23 insertions(+), 23 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > index dba868f2c3e3..87c927d912f6 100644
> > --- a/fs/xfs/libxfs/xfs_format.h
> > +++ b/fs/xfs/libxfs/xfs_format.h
> > @@ -802,8 +802,8 @@ typedef struct xfs_dinode {
> >  	__be64		di_size;	/* number of bytes in file */
> >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
> >  	__be32		di_extsize;	/* basic/minimum extent size for file */
> > -	__be32		di_nextents;	/* number of extents in data fork */
> > -	__be16		di_anextents;	/* number of extents in attribute fork*/
> > +	__be32		di_nextents32;	/* number of extents in data fork */
> > +	__be16		di_nextents16;	/* number of extents in attribute fork*/
> 
> 
> Hmmm. Having the same field in the inode hold the extent count
> for different inode forks based on a bit in the superblock means the
> on-disk inode format is not self describing. i.e. we can't decode
> the on-disk contents of an inode correctly without knowing whether a
> specific feature bit is set in the superblock or not.

Hmmmm - I just realised that there is an inode flag that indicates
the format is different. It's jsut that most of the code doing
conditional behaviour is using the superblock flag, not the inode
flag as the conditional.

So it is self describing, but I still don't like the way the same
field is used for the different forks. It just feels like we are
placing a landmine that we are going to forget about and step
on in the future....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 05/12] xfs: Introduce xfs_dfork_nextents() helper
  2021-09-27 22:46   ` Dave Chinner
@ 2021-09-28  9:46     ` Chandan Babu R
  0 siblings, 0 replies; 42+ messages in thread
From: Chandan Babu R @ 2021-09-28  9:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, djwong

On 28 Sep 2021 at 04:16, Dave Chinner wrote:
> On Thu, Sep 16, 2021 at 03:36:40PM +0530, Chandan Babu R wrote:
>> This commit replaces the macro XFS_DFORK_NEXTENTS() with the helper function
>> xfs_dfork_nextents(). As of this commit, xfs_dfork_nextents() returns the same
>> value as XFS_DFORK_NEXTENTS(). A future commit which extends inode's extent
>> counter fields will add more logic to this helper.
>> 
>> This commit also replaces direct accesses to xfs_dinode->di_[a]nextents
>> with calls to xfs_dfork_nextents().
>> 
>> No functional changes have been made.
>> 
>> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
>> ---
>>  fs/xfs/libxfs/xfs_format.h     | 28 +++++++++++++++++++++----
>>  fs/xfs/libxfs/xfs_inode_buf.c  | 16 +++++++++-----
>>  fs/xfs/libxfs/xfs_inode_fork.c | 10 +++++----
>>  fs/xfs/scrub/inode.c           | 18 +++++++++-------
>>  fs/xfs/scrub/inode_repair.c    | 38 +++++++++++++++++++++-------------
>>  5 files changed, 75 insertions(+), 35 deletions(-)
>> 
>> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
>> index ed8a5354bcbf..b4638052801f 100644
>> --- a/fs/xfs/libxfs/xfs_format.h
>> +++ b/fs/xfs/libxfs/xfs_format.h
>> @@ -930,10 +930,30 @@ enum xfs_dinode_fmt {
>>  	((w) == XFS_DATA_FORK ? \
>>  		(dip)->di_format : \
>>  		(dip)->di_aformat)
>> -#define XFS_DFORK_NEXTENTS(dip,w) \
>> -	((w) == XFS_DATA_FORK ? \
>> -		be32_to_cpu((dip)->di_nextents) : \
>> -		be16_to_cpu((dip)->di_anextents))
>> +
>> +static inline xfs_extnum_t
>> +xfs_dfork_nextents(
>> +	struct xfs_dinode	*dip,
>> +	int			whichfork)
>> +{
>> +	xfs_extnum_t		nextents = 0;
>> +
>> +	switch (whichfork) {
>> +	case XFS_DATA_FORK:
>> +		nextents = be32_to_cpu(dip->di_nextents);
>> +		break;
>> +
>
> No need for whitespace line after the break, and this could just
> return the value directly.
>

Ok. I will fix this.

>> +	case XFS_ATTR_FORK:
>> +		nextents = be16_to_cpu(dip->di_anextents);
>> +		break;
>> +
>> +	default:
>> +		ASSERT(0);
>> +		break;
>> +	}
>> +
>> +	return nextents;
>> +}
>
> I think that all the conditional inode fork macros
> should be moved to libxfs/xfs_inode_fork.h as they are converted.
>
> These macros are not acutally part of the on-disk format definition
> (which is what xfs_format.h is supposed to contain) - it's code that
> parses the on-disk format and that is supposed to be in
> libxfs/xfs_inode_fork.[ch]....
>
> Next thing: the caller almost always knows what fork it wants
> the extents for - only 3 callers have a whichfork variable. So,
> perhaps:
>
> static inline xfs_extnum_t
> xfs_dfork_data_extents(
> 	struct xfs_dinode	*dip)
> {
> 	return be32_to_cpu(dip->di_nextents);
> }
>
> static inline xfs_extnum_t
> xfs_dfork_attr_extents(
> 	struct xfs_dinode	*dip)
> {
> 	return be16_to_cpu(dip->di_anextents);
> }
>
> static inline xfs_extnum_t
> xfs_dfork_extents(
> 	struct xfs_dinode	*dip,
> 	int			whichfork)
> {
> 	switch (whichfork) {
> 	case XFS_DATA_FORK:
> 		return xfs_dfork_data_extents(dip);
> 	case XFS_ATTR_FORK:
> 		return xfs_dfork_attr_extents(dip);
> 	default:
> 		ASSERT(0);
> 		break;
> 	}
> 	return 0;
> }
>
> So we don't have to rely on the compiler optimising away the switch
> statement correctly to produce optimal code.
>

I will fix this too.

>> --- a/fs/xfs/libxfs/xfs_inode_buf.c
>> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
>> @@ -342,9 +342,11 @@ xfs_dinode_verify_fork(
>>  	struct xfs_mount	*mp,
>>  	int			whichfork)
>>  {
>> -	xfs_extnum_t		di_nextents = XFS_DFORK_NEXTENTS(dip, whichfork);
>> +	xfs_extnum_t		di_nextents;
>>  	xfs_extnum_t		max_extents;
>>  
>> +	di_nextents = xfs_dfork_nextents(dip, whichfork);
>> +
>>  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
>>  	case XFS_DINODE_FMT_LOCAL:
>>  		/*
>> @@ -474,6 +476,8 @@ xfs_dinode_verify(
>>  	uint16_t		flags;
>>  	uint64_t		flags2;
>>  	uint64_t		di_size;
>> +	xfs_extnum_t            nextents;
>> +	xfs_rfsblock_t		nblocks;
>
> That's a block number type, not a block count:
>
> typedef uint64_t        xfs_rfsblock_t; /* blockno in filesystem (raw) */
> ....
> typedef uint64_t        xfs_filblks_t;  /* number of blocks in a file */
>
> The latter is the appropriate type to use here.
>
> Oh, the struct xfs_inode and the struct xfs_log_dinode makes
> this same type mistake. Ok, that's a cleanup for another day....
>

I will add this cleanup to my todo list. 

-- 
chandan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 07/12] xfs: Rename inode's extent counter fields based on their width
  2021-09-27 23:46   ` Dave Chinner
  2021-09-28  4:04     ` Dave Chinner
@ 2021-09-28  9:47     ` Chandan Babu R
  1 sibling, 0 replies; 42+ messages in thread
From: Chandan Babu R @ 2021-09-28  9:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, djwong

On 28 Sep 2021 at 05:16, Dave Chinner wrote:
> On Thu, Sep 16, 2021 at 03:36:42PM +0530, Chandan Babu R wrote:
>> This commit renames extent counter fields in "struct xfs_dinode" and "struct
>> xfs_log_dinode" based on the width of the fields. As of this commit, the
>> 32-bit field will be used to count data fork extents and the 16-bit field will
>> be used to count attr fork extents.
>> 
>> This change is done to enable a future commit to introduce a new 64-bit extent
>> counter field.
>> 
>> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
>> ---
>>  fs/xfs/libxfs/xfs_format.h      |  8 ++++----
>>  fs/xfs/libxfs/xfs_inode_buf.c   |  4 ++--
>>  fs/xfs/libxfs/xfs_log_format.h  |  4 ++--
>>  fs/xfs/scrub/inode_repair.c     |  4 ++--
>>  fs/xfs/scrub/trace.h            | 14 +++++++-------
>>  fs/xfs/xfs_inode_item.c         |  4 ++--
>>  fs/xfs/xfs_inode_item_recover.c |  8 ++++----
>>  7 files changed, 23 insertions(+), 23 deletions(-)
>> 
>> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
>> index dba868f2c3e3..87c927d912f6 100644
>> --- a/fs/xfs/libxfs/xfs_format.h
>> +++ b/fs/xfs/libxfs/xfs_format.h
>> @@ -802,8 +802,8 @@ typedef struct xfs_dinode {
>>  	__be64		di_size;	/* number of bytes in file */
>>  	__be64		di_nblocks;	/* # of direct & btree blocks used */
>>  	__be32		di_extsize;	/* basic/minimum extent size for file */
>> -	__be32		di_nextents;	/* number of extents in data fork */
>> -	__be16		di_anextents;	/* number of extents in attribute fork*/
>> +	__be32		di_nextents32;	/* number of extents in data fork */
>> +	__be16		di_nextents16;	/* number of extents in attribute fork*/
>
>
> Hmmm. Having the same field in the inode hold the extent count
> for different inode forks based on a bit in the superblock means the
> on-disk inode format is not self describing. i.e. we can't decode
> the on-disk contents of an inode correctly without knowing whether a
> specific feature bit is set in the superblock or not.
>
> Right now we don't have use external configs to decode the inode.
> Feature level conditional fields are determined by inode version,
> not superblock bits. Optional feature fields easy to deal with -
> zero if the feature is not in use, otherwise we assume it is in use
> and can validity check it appropriately. IOWs, we don't need
> to look at sb feature bits to decode and validate inode fields.
>
> This change means that we can't determine if the extent counts are
> correct just by looking at the on-disk inode. If we just have
> di_nextents32 set to a non-zero value, does that mean we should have
> data fork extents or attribute fork extents present?
>
> Just looking at whether the attr fork is initialised is not
> sufficient - it can be initialised with zero attr fork extents
> present.  We can't look at the literal area contents, either,
> because we don't zero that when we shrink it. We can't look at
> di_nblocks, because that counts both attr and data for blocks. We
> can't look at di_size, because we can have data extents beyond EOF
> and hence a size of zero doesn't mean the data fork is empty.
>
> So if both forks are in extent format, they could be either both
> empty, both contain extents or only one fork contains extents but we
> can't tell which state is the correct one. Hence
> if di_nextents64 if zero, we don't know if di_nextents32 is a count
> of attribute extents or data extents without first looking at the
> superblock feature bit to determine if di_nextents64 is in use or
> not. The inode format is not self describing anymore.
>
> When XFS introduced 32 bit link counts, the inode version was bumped
> from v1 to v2 because it redefined fields in the inode structure
> similar to this proposal[1]. The verison number was then used to
> determine if the inode was in old or new format - it was a self
> describing format change. Hence If we are going to redefine
> di_nextents to be able to hold either data fork extent count (old
> format) or attr fork extent count (new format) we really need to
> bump the inode version so that we can discriminate between the two
> inode formats just by looking at the inode itself.
>
> If we don't want to bump the version, then we need to do something
> like:
>
> -	__be32		di_nextents;	/* number of extents in data fork */
> -	__be16		di_anextents;	/* number of extents in attribute fork*/
> +	__be32		di_nextents_old;/* old number of extents in data fork */
> +	__be16		di_anextents_old;/* old number of extents in attribute fork*/
> .....
> -	__u8            di_pad2[12];
> +	__be64		di_nextents;	/* number of extents in data fork */
> +	__be32		di_anextents;	/* number of extents in attribute fork*/
> +	__u8            di_pad2[4];
>
> So that there is no ambiguity in the on-disk format between the two
> formats - if one set is non-zero, the other set must be zero in this
> sort of setup.
>
> However, I think that redefining the fields and bumping the inode
> version is the better long term strategy, as it allows future reuse
> of the di_anextents_old field, and it uses less of the small amount
> of unused padding we have remaining in the on-disk inode core.
>
> At which point, the feature bit in the superblock becomes "has v4
> inodes", not "has big extent counts". We then use v4 inode format in
> memory for everything (i.e. 64 bit extent counts) and convert
> to/from the ondisk format at IO time like we do with v1/v2 inodes.
>
> Thoughts?

The patch "xfs: Extend per-inode extent counter widths" (which appears later
in the series) adds the new per-inode flag XFS_DIFLAG2_NREXT64. This flag is
set on inodes which use 64-bit data fork extent counter and 32-bit attr fork
extent counter fields. Verifiers can check for the presence/absence of this
flag to determine which extent counter fields to use for verification of an
xfs_dinode structure.

Hence, XFS_DIFLAG2_NREXT64 flag should be sufficient for maintaining the self
describing nature of XFS inodes right?

>
> -Dave.
>
> [1] The change to v2 inodes back in 1995 removed the filesystem UUID
> from the inode and was replaced with a 32 bit link counter, a project ID
> value and padding:
>
> @@ -36,10 +38,12 @@ typedef struct xfs_dinode_core
>         __uint16_t      di_mode;        /* mode and type of file */
>         __int8_t        di_version;     /* inode version */
>         __int8_t        di_format;      /* format of di_c data */
> -       __uint16_t      di_nlink;       /* number of links to file */
> +       __uint16_t      di_onlink;      /* old number of links to file */
>         __uint32_t      di_uid;         /* owner's user id */
>         __uint32_t      di_gid;         /* owner's group id */
> -       uuid_t          di_uuid;        /* file unique id */
> +       __uint32_t      di_nlink;       /* number of links to file */
> +       __uint16_t      di_projid;      /* owner's project id */
> +       __uint8_t       di_pad[10];     /* unused, zeroed space */
>         xfs_timestamp_t di_atime;       /* time last accessed */
>         xfs_timestamp_t di_mtime;       /* time last modified */
>         xfs_timestamp_t di_ctime;       /* time created/inode modified */
> @@ -81,7 +85,13 @@ typedef struct xfs_dinode
>
> it was the redefinition of the di_uuid variable space that required
> the bumping of the inode version...


-- 
chandan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 08/12] xfs: Promote xfs_extnum_t and xfs_aextnum_t to 64 and 32-bits respectively
  2021-09-28  0:47   ` Dave Chinner
@ 2021-09-28  9:47     ` Chandan Babu R
  2021-09-28 23:08       ` Dave Chinner
  0 siblings, 1 reply; 42+ messages in thread
From: Chandan Babu R @ 2021-09-28  9:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, djwong

On 28 Sep 2021 at 06:17, Dave Chinner wrote:
> On Thu, Sep 16, 2021 at 03:36:43PM +0530, Chandan Babu R wrote:
>> A future commit will introduce a 64-bit on-disk data extent counter and a
>> 32-bit on-disk attr extent counter. This commit promotes xfs_extnum_t and
>> xfs_aextnum_t to 64 and 32-bits in order to correctly handle in-core versions
>> of these quantities.
>> 
>> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
>> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
>
> So while I was auditing extent lengths w.r.t. the last patch f the
> series, I noticed that xfs_extnum_t is used in the struct
> xfs_log_dinode and so changing the size of these types changes the
> layout of this structure:
>
> /*
>  * Define the format of the inode core that is logged. This structure must be
>  * kept identical to struct xfs_dinode except for the endianness annotations.
>  */
> struct xfs_log_dinode {
> ....
>         xfs_rfsblock_t  di_nblocks;     /* # of direct & btree blocks used */
>         xfs_extlen_t    di_extsize;     /* basic/minimum extent size for file */
>         xfs_extnum_t    di_nextents;    /* number of extents in data fork */
>         xfs_aextnum_t   di_anextents;   /* number of extents in attribute fork*/
> ....
>
> Which means this:
>
>> -typedef int32_t		xfs_extnum_t;	/* # of extents in a file */
>> -typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
>> +typedef uint64_t	xfs_extnum_t;	/* # of extents in a file */
>> +typedef uint32_t	xfs_aextnum_t;	/* # extents in an attribute fork */
>
> creates an incompatible log format change that will cause silent
> inode corruption during log recovery if inodes logged with this
> change are replayed on an older kernel without this change. It's not
> just the type size change that matters here - it also changes the
> implicit padding in this structure because xfs_extlen_t is a 32 bit
> object and so:
>
> Old					New
> 64 bit object (di_nblocks)		64 bit object (di_nblocks)
> 32 bit object (di_extsize)		32 bit object (di_extsize)
> 					32 bit pad (implicit)
> 32 bit object (di_nextents)		64 bit object (di_nextents)
> 16 bit object (di_anextents)		32 bit ojecct (di_anextents
> 8 bit object (di_forkoff)		8 bit object (di_forkoff)
> 8 bit object (di_aformat)		8 bit object (di_aformat)
> 					16 bit pad (implicit)
> 32 bit object (di_dmevmask)		32 bit object (di_dmevmask)
>
>
> That's quite the layout change, and that's something we must not do
> without a feature bit being set. hence I think we need to rev the
> struct xfs_log_dinode version for large extent count support, too,
> so that the struct xfs_log_dinode does not change size for
> filesystems without the large extent count feature.

Actually, the current patch replaces the data types xfs_extnum_t and
xfs_aextnum_t inside "struct xfs_log_dinode" with the basic integral types
uint32_t and uint16_t respectively. The patch "xfs: Extend per-inode extent
counter widths" which arrives later in the series adds the new field
di_nextents64 to "struct xfs_log_dinode" and uint64_t is used as its data
type.

So in a scenario where we have a filesystem which does not have support for
64-bit extent counters and a kernel which does not support 64-bit extent
counters is replaying a log created by a kernel supporting 64-bit extent
counters, the contents of the 16-bit and 32-bit extent counter fields should
be replayed correctly into xfs_inode's attr and data fork extent counters
respectively. The contents of the 64-bit extent counter (whose value will be
zero) in the logged inode will be replayed back into di_pad2[] field of the
inode.

Please do let me know if my explaination is incorrect.

-- 
chandan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 09/12] xfs: Enable bulkstat ioctl to support 64-bit per-inode extent counters
  2021-09-27 23:06   ` Dave Chinner
@ 2021-09-28  9:49     ` Chandan Babu R
  2021-09-28 23:39       ` Dave Chinner
  0 siblings, 1 reply; 42+ messages in thread
From: Chandan Babu R @ 2021-09-28  9:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, djwong

On 28 Sep 2021 at 04:36, Dave Chinner wrote:
> On Thu, Sep 16, 2021 at 03:36:44PM +0530, Chandan Babu R wrote:
>> The following changes are made to enable userspace to obtain 64-bit extent
>> counters,
>> 1. To hold 64-bit extent counters, carve out the new 64-bit field
>>    xfs_bulkstat->bs_extents64 from xfs_bulkstat->bs_pad[].
>> 2. Carve out a new 64-bit field xfs_bulk_ireq->bulkstat_flags from
>>    xfs_bulk_ireq->reserved[] to hold bulkstat specific operational flags.  As of
>>    this commit, XFS_IBULK_NREXT64 is the only valid flag that this field can
>>    hold. It indicates that userspace has the necessary infrastructure to
>>    receive 64-bit extent counters.
>> 3. Define the new flag XFS_BULK_IREQ_BULKSTAT for userspace to indicate that
>>    xfs_bulk_ireq->bulkstat_flags has valid flags set.
>
> This seems unnecessarily complex. It adds a new flag to define a new
> flag field in the same structure and then define a new and a new
> flag in the new flag field to define a new behaviour.
>
> Why can't this be done with just a single new flag in the existing
> flags field?
>

Yes, This can be implemented with just one flag. I will make the relevant
changes before posting the next version.

>> Suggested-by: Darrick J. Wong <djwong@kernel.org>
>> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
>> ---
>>  fs/xfs/libxfs/xfs_fs.h | 19 ++++++++++++++-----
>>  fs/xfs/xfs_ioctl.c     |  7 +++++++
>>  fs/xfs/xfs_itable.c    | 25 +++++++++++++++++++++++--
>>  fs/xfs/xfs_itable.h    |  2 ++
>>  fs/xfs/xfs_iwalk.h     |  7 +++++--
>>  5 files changed, 51 insertions(+), 9 deletions(-)
>> 
>> diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
>> index 2594fb647384..b76906914d89 100644
>> --- a/fs/xfs/libxfs/xfs_fs.h
>> +++ b/fs/xfs/libxfs/xfs_fs.h
>> @@ -394,7 +394,7 @@ struct xfs_bulkstat {
>>  	uint32_t	bs_extsize_blks; /* extent size hint, blocks	*/
>>  
>>  	uint32_t	bs_nlink;	/* number of links		*/
>> -	uint32_t	bs_extents;	/* number of extents		*/
>> +	uint32_t	bs_extents32;	/* 32-bit data fork extent counter */
>>  	uint32_t	bs_aextents;	/* attribute number of extents	*/
>>  	uint16_t	bs_version;	/* structure version		*/
>>  	uint16_t	bs_forkoff;	/* inode fork offset in bytes	*/
>
> I don't think renaming structure members is a good idea - it breaks
> the user API and forces applications to require source level
> modifications just to compile on both old and new xfsprogs installs.
>

Ok. I will revert the rename.

>> @@ -403,8 +403,9 @@ struct xfs_bulkstat {
>>  	uint16_t	bs_checked;	/* checked inode metadata	*/
>>  	uint16_t	bs_mode;	/* type and mode		*/
>>  	uint16_t	bs_pad2;	/* zeroed			*/
>> +	uint64_t	bs_extents64;	/* 64-bit data fork extent counter */
>>  
>> -	uint64_t	bs_pad[7];	/* zeroed			*/
>> +	uint64_t	bs_pad[6];	/* zeroed			*/
>>  };
>>  
>>  #define XFS_BULKSTAT_VERSION_V1	(1)
>> @@ -469,7 +470,8 @@ struct xfs_bulk_ireq {
>>  	uint32_t	icount;		/* I: count of entries in buffer */
>>  	uint32_t	ocount;		/* O: count of entries filled out */
>>  	uint32_t	agno;		/* I: see comment for IREQ_AGNO	*/
>> -	uint64_t	reserved[5];	/* must be zero			*/
>> +	uint64_t	bulkstat_flags; /* I: Bulkstat operation flags */
>> +	uint64_t	reserved[4];	/* must be zero			*/
>>  };
>>  
>>  /*
>> @@ -492,9 +494,16 @@ struct xfs_bulk_ireq {
>>   */
>>  #define XFS_BULK_IREQ_METADIR	(1 << 2)
>>  
>> -#define XFS_BULK_IREQ_FLAGS_ALL	(XFS_BULK_IREQ_AGNO | \
>> +#define XFS_BULK_IREQ_BULKSTAT	(1 << 3)
>> +
>> +#define XFS_BULK_IREQ_FLAGS_ALL	(XFS_BULK_IREQ_AGNO |	 \
>>  				 XFS_BULK_IREQ_SPECIAL | \
>> -				 XFS_BULK_IREQ_METADIR)
>> +				 XFS_BULK_IREQ_METADIR | \
>> +				 XFS_BULK_IREQ_BULKSTAT)
>
> What's this XFS_BULK_IREQ_METADIR thing? I haven't noticed that when
> scanning any recent proposed patch series....
>

XFS_BULK_IREQ_METADIR is from Darrick's tree. His "Kill XFS_BTREE_MAXLEVELS"
patch series is based on his other patchsets. His recent "xfs: support dynamic
btree cursor height" patch series rebases only the required patchset on top of
v5.15-rc1 kernel eliminating the others.

>> +#define XFS_BULK_IREQ_BULKSTAT_NREXT64 (1 << 0)
>> +
>> +#define XFS_BULK_IREQ_BULKSTAT_FLAGS_ALL (XFS_BULK_IREQ_BULKSTAT_NREXT64)
>
> As per above, this seems unnecessarily complex.
>
>> @@ -134,7 +136,26 @@ xfs_bulkstat_one_int(
>>  
>>  	buf->bs_xflags = xfs_ip2xflags(ip);
>>  	buf->bs_extsize_blks = ip->i_extsize;
>> -	buf->bs_extents = xfs_ifork_nextents(&ip->i_df);
>> +
>> +	nextents = xfs_ifork_nextents(&ip->i_df);
>> +	if (!(bc->breq->flags & XFS_IBULK_NREXT64)) {
>> +		xfs_extnum_t max_nextents = XFS_IFORK_EXTCNT_MAXS32;
>> +
>> +		if (unlikely(XFS_TEST_ERROR(false, mp,
>> +				XFS_ERRTAG_REDUCE_MAX_IEXTENTS)))
>> +			max_nextents = 10;
>> +
>> +		if (nextents > max_nextents) {
>> +			xfs_iunlock(ip, XFS_ILOCK_SHARED);
>> +			xfs_irele(ip);
>> +			error = -EINVAL;
>> +			goto out_advance;
>> +		}
>
> So we return an EINVAL error if any extent overflows the 32 bit
> counter? Why isn't this -EOVERFLOW?
>

Returning -EINVAL causes xfs_bulkstat_iwalk() to skip inodes whose extent
count is larger than that which can be fitted into a 32-bit field. Returning
-EOVERFLOW causes the bulkstat ioctl to stop reporting remaining inodes.

>> +		buf->bs_extents32 = nextents;
>> +	} else {
>> +		buf->bs_extents64 = nextents;
>> +	}
>> +
>>  	xfs_bulkstat_health(ip, buf);
>>  	buf->bs_aextents = xfs_ifork_nextents(ip->i_afp);
>>  	buf->bs_forkoff = XFS_IFORK_BOFF(ip);
>> @@ -356,7 +377,7 @@ xfs_bulkstat_to_bstat(
>>  	bs1->bs_blocks = bstat->bs_blocks;
>>  	bs1->bs_xflags = bstat->bs_xflags;
>>  	bs1->bs_extsize = XFS_FSB_TO_B(mp, bstat->bs_extsize_blks);
>> -	bs1->bs_extents = bstat->bs_extents;
>> +	bs1->bs_extents = bstat->bs_extents32;
>>  	bs1->bs_gen = bstat->bs_gen;
>>  	bs1->bs_projid_lo = bstat->bs_projectid & 0xFFFF;
>>  	bs1->bs_forkoff = bstat->bs_forkoff;
>> diff --git a/fs/xfs/xfs_itable.h b/fs/xfs/xfs_itable.h
>> index f5a13f69883a..f61685da3837 100644
>> --- a/fs/xfs/xfs_itable.h
>> +++ b/fs/xfs/xfs_itable.h
>> @@ -22,6 +22,8 @@ struct xfs_ibulk {
>>  /* Signal that we can return metadata directories. */
>>  #define XFS_IBULK_METADIR	(XFS_IWALK_METADIR)
>>  
>> +#define XFS_IBULK_NREXT64	(XFS_IWALK_NREXT64)
>> +
>>  /*
>>   * Advance the user buffer pointer by one record of the given size.  If the
>>   * buffer is now full, return the appropriate error code.
>> diff --git a/fs/xfs/xfs_iwalk.h b/fs/xfs/xfs_iwalk.h
>> index d7a082e45cbf..27a6842a1bb5 100644
>> --- a/fs/xfs/xfs_iwalk.h
>> +++ b/fs/xfs/xfs_iwalk.h
>> @@ -31,8 +31,11 @@ int xfs_iwalk_threaded(struct xfs_mount *mp, xfs_ino_t startino,
>>  /* Signal that we can return metadata directories. */
>>  #define XFS_IWALK_METADIR	(0x2)
>>  
>> -#define XFS_IWALK_FLAGS_ALL	(XFS_IWALK_SAME_AG | \
>> -				 XFS_IWALK_METADIR)
>> +#define XFS_IWALK_NREXT64	(0x4)
>
> Can we use '(1 << 2)' style notation for new bit field defines?

Sure, I will change this.

-- 
chandan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 12/12] xfs: Define max extent length based on on-disk format definition
  2021-09-28  0:33   ` Dave Chinner
@ 2021-09-28 10:07     ` Chandan Babu R
  0 siblings, 0 replies; 42+ messages in thread
From: Chandan Babu R @ 2021-09-28 10:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, djwong

On 28 Sep 2021 at 06:03, Dave Chinner wrote:
> On Thu, Sep 16, 2021 at 03:36:47PM +0530, Chandan Babu R wrote:
>> The maximum extent length depends on maximum block count that can be stored in
>> a BMBT record. Hence this commit defines MAXEXTLEN based on
>> BMBT_BLOCKCOUNT_BITLEN.
>> 
>> While at it, the commit also renames MAXEXTLEN to XFS_MAX_EXTLEN.
>
> hmmmm. So you reimplemented:
>
> #define BMBT_BLOCKCOUNT_MASK    ((1ULL << BMBT_BLOCKCOUNT_BITLEN) - 1)
>
> and defined it as XFS_MAX_EXTLEN?
>
> One of these two defines needs to go away. :)
>
> Also, this macro really defines the maximum extent length a BMBT
> record can hold, not the maximum XFS extent length supported. I
> think it should be  named XFS_BMBT_MAX_EXTLEN and also used to
> replace BMBT_BLOCKCOUNT_MASK.

Thanks for the suggestion. I will incorporate this before posting the next
version of the patchset.

>
> The counter example are free space btree records - they can hold
> extents lengths up to 2^31 blocks long:
>
> typedef struct xfs_alloc_rec {
>         __be32          ar_startblock;  /* starting block number */
>         __be32          ar_blockcount;  /* count of free blocks */
> } xfs_alloc_rec_t, xfs_alloc_key_t;
>
> So, yes, I think MAXEXTLEN needs cleaning up, but it needs some more
> work to make it explicit in what it refers to.
>
> Also:
>
>> -/*
>> - * Max values for extlen and disk inode's extent counters.
>> - */
>> -#define	MAXEXTLEN		((xfs_extlen_t)0x1fffff)	/* 21 bits */
>> -#define XFS_IFORK_EXTCNT_MAXU48	((xfs_extnum_t)0xffffffffffff)	/* Unsigned 48-bits */
>> -#define XFS_IFORK_EXTCNT_MAXU32	((xfs_aextnum_t)0xffffffff)	/* Unsigned 32-bits */
>> -#define XFS_IFORK_EXTCNT_MAXS32 ((xfs_extnum_t)0x7fffffff)	/* Signed 32-bits */
>> -#define XFS_IFORK_EXTCNT_MAXS16 ((xfs_aextnum_t)0x7fff)		/* Signed 16-bits */
>> -
>> -
>>  /*
>>   * Inode minimum and maximum sizes.
>>   */
>> @@ -1701,6 +1691,16 @@ typedef struct xfs_bmbt_rec {
>>  typedef uint64_t	xfs_bmbt_rec_base_t;	/* use this for casts */
>>  typedef xfs_bmbt_rec_t xfs_bmdr_rec_t;
>>  
>> +/*
>> + * Max values for extlen and disk inode's extent counters.
>> + */
>> +#define XFS_MAX_EXTLEN		((xfs_extlen_t)(1 << BMBT_BLOCKCOUNT_BITLEN) - 1)
>> +#define XFS_IFORK_EXTCNT_MAXU48	((xfs_extnum_t)0xffffffffffff)	/* Unsigned 48-bits */
>> +#define XFS_IFORK_EXTCNT_MAXU32	((xfs_aextnum_t)0xffffffff)	/* Unsigned 32-bits */
>> +#define XFS_IFORK_EXTCNT_MAXS32 ((xfs_extnum_t)0x7fffffff)	/* Signed 32-bits */
>> +#define XFS_IFORK_EXTCNT_MAXS16 ((xfs_aextnum_t)0x7fff)		/* Signed 16-bits */
>
> At the end of the patch series, I still really don't like these
> names. Hungarian notation is ugly, and they don't tell me what type
> they apply to. Hence I don't know what limit is the correct one to
> apply to which fork and which format....
>
> These would be much better as
>
> #define XFS_MAX_EXTCNT_DATA_FORK	((1ULL < 48) - 1)
> #define XFS_MAX_EXTCNT_ATTR_FORK	((1ULL < 32) - 1)
>
> #define XFS_MAX_EXTCNT_DATA_FORK_OLD	((1ULL < 31) - 1)
> #define XFS_MAX_EXTCNT_ATTR_FORK_OLD	((1ULL < 15) - 1)
>
> The name tells me what object/format they apply to, and the
> implementation tells me the exact size without needing a comment
> to make it readable. And it doesn't need casts that just add noise
> to the implementation...

I agree. I will include this change in the next version of the patchset.

-- 
chandan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 08/12] xfs: Promote xfs_extnum_t and xfs_aextnum_t to 64 and 32-bits respectively
  2021-09-28  9:47     ` Chandan Babu R
@ 2021-09-28 23:08       ` Dave Chinner
  2021-09-29 17:04         ` Chandan Babu R
  0 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2021-09-28 23:08 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, djwong

On Tue, Sep 28, 2021 at 03:17:59PM +0530, Chandan Babu R wrote:
> On 28 Sep 2021 at 06:17, Dave Chinner wrote:
> > On Thu, Sep 16, 2021 at 03:36:43PM +0530, Chandan Babu R wrote:
> >> A future commit will introduce a 64-bit on-disk data extent counter and a
> >> 32-bit on-disk attr extent counter. This commit promotes xfs_extnum_t and
> >> xfs_aextnum_t to 64 and 32-bits in order to correctly handle in-core versions
> >> of these quantities.
> >> 
> >> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> >> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
> >
> > So while I was auditing extent lengths w.r.t. the last patch f the
> > series, I noticed that xfs_extnum_t is used in the struct
> > xfs_log_dinode and so changing the size of these types changes the
> > layout of this structure:
> >
> > /*
> >  * Define the format of the inode core that is logged. This structure must be
> >  * kept identical to struct xfs_dinode except for the endianness annotations.
> >  */
> > struct xfs_log_dinode {
> > ....
> >         xfs_rfsblock_t  di_nblocks;     /* # of direct & btree blocks used */
> >         xfs_extlen_t    di_extsize;     /* basic/minimum extent size for file */
> >         xfs_extnum_t    di_nextents;    /* number of extents in data fork */
> >         xfs_aextnum_t   di_anextents;   /* number of extents in attribute fork*/
> > ....
> >
> > Which means this:
> >
> >> -typedef int32_t		xfs_extnum_t;	/* # of extents in a file */
> >> -typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
> >> +typedef uint64_t	xfs_extnum_t;	/* # of extents in a file */
> >> +typedef uint32_t	xfs_aextnum_t;	/* # extents in an attribute fork */
> >
> > creates an incompatible log format change that will cause silent
> > inode corruption during log recovery if inodes logged with this
> > change are replayed on an older kernel without this change. It's not
> > just the type size change that matters here - it also changes the
> > implicit padding in this structure because xfs_extlen_t is a 32 bit
> > object and so:
> >
> > Old					New
> > 64 bit object (di_nblocks)		64 bit object (di_nblocks)
> > 32 bit object (di_extsize)		32 bit object (di_extsize)
> > 					32 bit pad (implicit)
> > 32 bit object (di_nextents)		64 bit object (di_nextents)
> > 16 bit object (di_anextents)		32 bit ojecct (di_anextents
> > 8 bit object (di_forkoff)		8 bit object (di_forkoff)
> > 8 bit object (di_aformat)		8 bit object (di_aformat)
> > 					16 bit pad (implicit)
> > 32 bit object (di_dmevmask)		32 bit object (di_dmevmask)
> >
> >
> > That's quite the layout change, and that's something we must not do
> > without a feature bit being set. hence I think we need to rev the
> > struct xfs_log_dinode version for large extent count support, too,
> > so that the struct xfs_log_dinode does not change size for
> > filesystems without the large extent count feature.
> 
> Actually, the current patch replaces the data types xfs_extnum_t and
> xfs_aextnum_t inside "struct xfs_log_dinode" with the basic integral types
> uint32_t and uint16_t respectively. The patch "xfs: Extend per-inode extent
> counter widths" which arrives later in the series adds the new field
> di_nextents64 to "struct xfs_log_dinode" and uint64_t is used as its data
> type.

Arggh.

Perhaps now you might see why I really don't like naming things by
size and having the contents of those fields based on context? It
is so easy to miss things like when the wrong variable or type is
used for a given context because the code itself gives you no hint
as to what the correct usage it.

I suspect part of the problem I'm had here is that the change of
the type in the xfs_log_dinode is done in a -variable rename- patch
that names variables by size, not in the patch that -actually
changes the variable size-.

IOWs, the type change in the xfs_log_dinode should
either be in this patch where the log_dinode structure shape would
change, or in it's own standalone patch with a description that says
"we need to avoid changing the on-disk structure shape".

Making sure that the on-disk format changes (or things that avoid
them!) are clear and explicit in a patchset is critical as these are
things we really need to get right.

I missed the per-inode extent size flag for a similar reason - it
was buried in a larger patch that made lots of different
modifications to support the on-disk extent count format change, so
it wasn't clearly defined/called out as a separate on-disk format
change necessary for correct functioning.

> So in a scenario where we have a filesystem which does not have support for
> 64-bit extent counters and a kernel which does not support 64-bit extent
> counters is replaying a log created by a kernel supporting 64-bit extent
> counters, the contents of the 16-bit and 32-bit extent counter fields should
> be replayed correctly into xfs_inode's attr and data fork extent counters
> respectively. The contents of the 64-bit extent counter (whose value will be
> zero) in the logged inode will be replayed back into di_pad2[] field of the
> inode.

I think that's correct, because the superblock bit will prevent
mount on old kernels that don't support the 64 bit extent counter
and so the zeroes in di_pad2 won't get overwritten incorrectly.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 09/12] xfs: Enable bulkstat ioctl to support 64-bit per-inode extent counters
  2021-09-28  9:49     ` Chandan Babu R
@ 2021-09-28 23:39       ` Dave Chinner
  2021-09-29 17:04         ` Chandan Babu R
  0 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2021-09-28 23:39 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, djwong

On Tue, Sep 28, 2021 at 03:19:29PM +0530, Chandan Babu R wrote:
> On 28 Sep 2021 at 04:36, Dave Chinner wrote:
> > On Thu, Sep 16, 2021 at 03:36:44PM +0530, Chandan Babu R wrote:
> >> @@ -492,9 +494,16 @@ struct xfs_bulk_ireq {
> >>   */
> >>  #define XFS_BULK_IREQ_METADIR	(1 << 2)
> >>  
> >> -#define XFS_BULK_IREQ_FLAGS_ALL	(XFS_BULK_IREQ_AGNO | \
> >> +#define XFS_BULK_IREQ_BULKSTAT	(1 << 3)
> >> +
> >> +#define XFS_BULK_IREQ_FLAGS_ALL	(XFS_BULK_IREQ_AGNO |	 \
> >>  				 XFS_BULK_IREQ_SPECIAL | \
> >> -				 XFS_BULK_IREQ_METADIR)
> >> +				 XFS_BULK_IREQ_METADIR | \
> >> +				 XFS_BULK_IREQ_BULKSTAT)
> >
> > What's this XFS_BULK_IREQ_METADIR thing? I haven't noticed that when
> > scanning any recent proposed patch series....
> >
> 
> XFS_BULK_IREQ_METADIR is from Darrick's tree. His "Kill XFS_BTREE_MAXLEVELS"
> patch series is based on his other patchsets. His recent "xfs: support dynamic
> btree cursor height" patch series rebases only the required patchset on top of
> v5.15-rc1 kernel eliminating the others.

OK, so how much testing has this had on just a straight v5.15-rcX
kernel?

> >> @@ -134,7 +136,26 @@ xfs_bulkstat_one_int(
> >>  
> >>  	buf->bs_xflags = xfs_ip2xflags(ip);
> >>  	buf->bs_extsize_blks = ip->i_extsize;
> >> -	buf->bs_extents = xfs_ifork_nextents(&ip->i_df);
> >> +
> >> +	nextents = xfs_ifork_nextents(&ip->i_df);
> >> +	if (!(bc->breq->flags & XFS_IBULK_NREXT64)) {
> >> +		xfs_extnum_t max_nextents = XFS_IFORK_EXTCNT_MAXS32;
> >> +
> >> +		if (unlikely(XFS_TEST_ERROR(false, mp,
> >> +				XFS_ERRTAG_REDUCE_MAX_IEXTENTS)))
> >> +			max_nextents = 10;
> >> +
> >> +		if (nextents > max_nextents) {
> >> +			xfs_iunlock(ip, XFS_ILOCK_SHARED);
> >> +			xfs_irele(ip);
> >> +			error = -EINVAL;
> >> +			goto out_advance;
> >> +		}
> >
> > So we return an EINVAL error if any extent overflows the 32 bit
> > counter? Why isn't this -EOVERFLOW?
> >
> 
> Returning -EINVAL causes xfs_bulkstat_iwalk() to skip inodes whose extent
> count is larger than that which can be fitted into a 32-bit field. Returning
> -EOVERFLOW causes the bulkstat ioctl to stop reporting remaining inodes.

Ok, that's a bad behaviour we need to fix because it will cause
things like old versions of xfs_dump to miss inodes that
have overflowing extent counts. i.e. it will cause incomplete
backups, and the failure will likely be silent.

I asked about -EOVERFLOW because that's what stat() returns when an
inode attribute value doesn't fit in the stat_buf field (e.g. 64 bit
inode number on 32 bit kernel), and if we are overflowing the
bulkstat field then we really should be telling userspace that an
overflow occurred.

/me has a sudden realisation that the xfs_dump format may not
support large extents counts and goes looking...

Yeah, xfsdump doesn't support extent counts greater than 2^32. So
that means we really do need -EOVERFLOW errors here.  i.e, if we get
an extent count overflow with a !(bc->breq->flags &
XFS_IBULK_NREXT64) bulkstat walk, xfs_dump needs bulkstat to fill
out the inode with the overflow with all the fileds that aren't
overflowed, then error out with -EOVERFLOW.

Bulkstat itself should not silently skip the inode because it would
overflow a field in the struct xfs-bstat - the decision of what to
do with the overflow is something xfsdump needs to handle, not the
kernel.  Hence we need to return -EOVERFLOW here so that userspace
can decide what to do with an inode it can't handle...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 07/12] xfs: Rename inode's extent counter fields based on their width
  2021-09-28  4:04     ` Dave Chinner
@ 2021-09-29 17:03       ` Chandan Babu R
  2021-09-30  0:40         ` Dave Chinner
  0 siblings, 1 reply; 42+ messages in thread
From: Chandan Babu R @ 2021-09-29 17:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, djwong

On 28 Sep 2021 at 09:34, Dave Chinner wrote:
> On Tue, Sep 28, 2021 at 09:46:37AM +1000, Dave Chinner wrote:
>> On Thu, Sep 16, 2021 at 03:36:42PM +0530, Chandan Babu R wrote:
>> > This commit renames extent counter fields in "struct xfs_dinode" and "struct
>> > xfs_log_dinode" based on the width of the fields. As of this commit, the
>> > 32-bit field will be used to count data fork extents and the 16-bit field will
>> > be used to count attr fork extents.
>> > 
>> > This change is done to enable a future commit to introduce a new 64-bit extent
>> > counter field.
>> > 
>> > Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
>> > ---
>> >  fs/xfs/libxfs/xfs_format.h      |  8 ++++----
>> >  fs/xfs/libxfs/xfs_inode_buf.c   |  4 ++--
>> >  fs/xfs/libxfs/xfs_log_format.h  |  4 ++--
>> >  fs/xfs/scrub/inode_repair.c     |  4 ++--
>> >  fs/xfs/scrub/trace.h            | 14 +++++++-------
>> >  fs/xfs/xfs_inode_item.c         |  4 ++--
>> >  fs/xfs/xfs_inode_item_recover.c |  8 ++++----
>> >  7 files changed, 23 insertions(+), 23 deletions(-)
>> > 
>> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
>> > index dba868f2c3e3..87c927d912f6 100644
>> > --- a/fs/xfs/libxfs/xfs_format.h
>> > +++ b/fs/xfs/libxfs/xfs_format.h
>> > @@ -802,8 +802,8 @@ typedef struct xfs_dinode {
>> >  	__be64		di_size;	/* number of bytes in file */
>> >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
>> >  	__be32		di_extsize;	/* basic/minimum extent size for file */
>> > -	__be32		di_nextents;	/* number of extents in data fork */
>> > -	__be16		di_anextents;	/* number of extents in attribute fork*/
>> > +	__be32		di_nextents32;	/* number of extents in data fork */
>> > +	__be16		di_nextents16;	/* number of extents in attribute fork*/
>> 
>> 
>> Hmmm. Having the same field in the inode hold the extent count
>> for different inode forks based on a bit in the superblock means the
>> on-disk inode format is not self describing. i.e. we can't decode
>> the on-disk contents of an inode correctly without knowing whether a
>> specific feature bit is set in the superblock or not.
>
> Hmmmm - I just realised that there is an inode flag that indicates
> the format is different. It's jsut that most of the code doing
> conditional behaviour is using the superblock flag, not the inode
> flag as the conditional.
>
> So it is self describing, but I still don't like the way the same
> field is used for the different forks. It just feels like we are
> placing a landmine that we are going to forget about and step
> on in the future....
>

Sorry, I missed this response from you.

I agree with your suggestion. I will use the inode version number to help in
deciding which extent counter fields are valid for a specific inode.

-- 
chandan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 08/12] xfs: Promote xfs_extnum_t and xfs_aextnum_t to 64 and 32-bits respectively
  2021-09-28 23:08       ` Dave Chinner
@ 2021-09-29 17:04         ` Chandan Babu R
  0 siblings, 0 replies; 42+ messages in thread
From: Chandan Babu R @ 2021-09-29 17:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, djwong

On 29 Sep 2021 at 04:38, Dave Chinner wrote:
> On Tue, Sep 28, 2021 at 03:17:59PM +0530, Chandan Babu R wrote:
>> On 28 Sep 2021 at 06:17, Dave Chinner wrote:
>> > On Thu, Sep 16, 2021 at 03:36:43PM +0530, Chandan Babu R wrote:
>> >> A future commit will introduce a 64-bit on-disk data extent counter and a
>> >> 32-bit on-disk attr extent counter. This commit promotes xfs_extnum_t and
>> >> xfs_aextnum_t to 64 and 32-bits in order to correctly handle in-core versions
>> >> of these quantities.
>> >> 
>> >> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
>> >> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
>> >
>> > So while I was auditing extent lengths w.r.t. the last patch f the
>> > series, I noticed that xfs_extnum_t is used in the struct
>> > xfs_log_dinode and so changing the size of these types changes the
>> > layout of this structure:
>> >
>> > /*
>> >  * Define the format of the inode core that is logged. This structure must be
>> >  * kept identical to struct xfs_dinode except for the endianness annotations.
>> >  */
>> > struct xfs_log_dinode {
>> > ....
>> >         xfs_rfsblock_t  di_nblocks;     /* # of direct & btree blocks used */
>> >         xfs_extlen_t    di_extsize;     /* basic/minimum extent size for file */
>> >         xfs_extnum_t    di_nextents;    /* number of extents in data fork */
>> >         xfs_aextnum_t   di_anextents;   /* number of extents in attribute fork*/
>> > ....
>> >
>> > Which means this:
>> >
>> >> -typedef int32_t		xfs_extnum_t;	/* # of extents in a file */
>> >> -typedef int16_t		xfs_aextnum_t;	/* # extents in an attribute fork */
>> >> +typedef uint64_t	xfs_extnum_t;	/* # of extents in a file */
>> >> +typedef uint32_t	xfs_aextnum_t;	/* # extents in an attribute fork */
>> >
>> > creates an incompatible log format change that will cause silent
>> > inode corruption during log recovery if inodes logged with this
>> > change are replayed on an older kernel without this change. It's not
>> > just the type size change that matters here - it also changes the
>> > implicit padding in this structure because xfs_extlen_t is a 32 bit
>> > object and so:
>> >
>> > Old					New
>> > 64 bit object (di_nblocks)		64 bit object (di_nblocks)
>> > 32 bit object (di_extsize)		32 bit object (di_extsize)
>> > 					32 bit pad (implicit)
>> > 32 bit object (di_nextents)		64 bit object (di_nextents)
>> > 16 bit object (di_anextents)		32 bit ojecct (di_anextents
>> > 8 bit object (di_forkoff)		8 bit object (di_forkoff)
>> > 8 bit object (di_aformat)		8 bit object (di_aformat)
>> > 					16 bit pad (implicit)
>> > 32 bit object (di_dmevmask)		32 bit object (di_dmevmask)
>> >
>> >
>> > That's quite the layout change, and that's something we must not do
>> > without a feature bit being set. hence I think we need to rev the
>> > struct xfs_log_dinode version for large extent count support, too,
>> > so that the struct xfs_log_dinode does not change size for
>> > filesystems without the large extent count feature.
>> 
>> Actually, the current patch replaces the data types xfs_extnum_t and
>> xfs_aextnum_t inside "struct xfs_log_dinode" with the basic integral types
>> uint32_t and uint16_t respectively. The patch "xfs: Extend per-inode extent
>> counter widths" which arrives later in the series adds the new field
>> di_nextents64 to "struct xfs_log_dinode" and uint64_t is used as its data
>> type.

Sorry, The previous patch is the one which changes the data type of the extent
counter fields in "struct xfs_log_dinode".

>
> Arggh.
>
> Perhaps now you might see why I really don't like naming things by
> size and having the contents of those fields based on context? It
> is so easy to miss things like when the wrong variable or type is
> used for a given context because the code itself gives you no hint
> as to what the correct usage it.

I agree. I will go with the "Increment inode version" suggestion.

>
> I suspect part of the problem I'm had here is that the change of
> the type in the xfs_log_dinode is done in a -variable rename- patch
> that names variables by size, not in the patch that -actually
> changes the variable size-.
>
> IOWs, the type change in the xfs_log_dinode should
> either be in this patch where the log_dinode structure shape would
> change, or in it's own standalone patch with a description that says
> "we need to avoid changing the on-disk structure shape".

I think I will put the data type change in a separate patch to make it much
easier to spot. Thanks for suggesting that.

>
> Making sure that the on-disk format changes (or things that avoid
> them!) are clear and explicit in a patchset is critical as these are
> things we really need to get right.
>
> I missed the per-inode extent size flag for a similar reason - it
> was buried in a larger patch that made lots of different
> modifications to support the on-disk extent count format change, so
> it wasn't clearly defined/called out as a separate on-disk format
> change necessary for correct functioning.
>

You are right. I will pull out critical parts of the "xfs: Extend per-inode
extent counter widths" into as many separate patches as possible.

>> So in a scenario where we have a filesystem which does not have support for
>> 64-bit extent counters and a kernel which does not support 64-bit extent
>> counters is replaying a log created by a kernel supporting 64-bit extent
>> counters, the contents of the 16-bit and 32-bit extent counter fields should
>> be replayed correctly into xfs_inode's attr and data fork extent counters
>> respectively. The contents of the 64-bit extent counter (whose value will be
>> zero) in the logged inode will be replayed back into di_pad2[] field of the
>> inode.
>
> I think that's correct, because the superblock bit will prevent
> mount on old kernels that don't support the 64 bit extent counter
> and so the zeroes in di_pad2 won't get overwritten incorrectly.
>
> Cheers,
>
> Dave.

-- 
chandan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 09/12] xfs: Enable bulkstat ioctl to support 64-bit per-inode extent counters
  2021-09-28 23:39       ` Dave Chinner
@ 2021-09-29 17:04         ` Chandan Babu R
  0 siblings, 0 replies; 42+ messages in thread
From: Chandan Babu R @ 2021-09-29 17:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, djwong

On 29 Sep 2021 at 05:09, Dave Chinner wrote:
> On Tue, Sep 28, 2021 at 03:19:29PM +0530, Chandan Babu R wrote:
>> On 28 Sep 2021 at 04:36, Dave Chinner wrote:
>> > On Thu, Sep 16, 2021 at 03:36:44PM +0530, Chandan Babu R wrote:
>> >> @@ -492,9 +494,16 @@ struct xfs_bulk_ireq {
>> >>   */
>> >>  #define XFS_BULK_IREQ_METADIR	(1 << 2)
>> >>  
>> >> -#define XFS_BULK_IREQ_FLAGS_ALL	(XFS_BULK_IREQ_AGNO | \
>> >> +#define XFS_BULK_IREQ_BULKSTAT	(1 << 3)
>> >> +
>> >> +#define XFS_BULK_IREQ_FLAGS_ALL	(XFS_BULK_IREQ_AGNO |	 \
>> >>  				 XFS_BULK_IREQ_SPECIAL | \
>> >> -				 XFS_BULK_IREQ_METADIR)
>> >> +				 XFS_BULK_IREQ_METADIR | \
>> >> +				 XFS_BULK_IREQ_BULKSTAT)
>> >
>> > What's this XFS_BULK_IREQ_METADIR thing? I haven't noticed that when
>> > scanning any recent proposed patch series....
>> >
>> 
>> XFS_BULK_IREQ_METADIR is from Darrick's tree. His "Kill XFS_BTREE_MAXLEVELS"
>> patch series is based on his other patchsets. His recent "xfs: support dynamic
>> btree cursor height" patch series rebases only the required patchset on top of
>> v5.15-rc1 kernel eliminating the others.
>
> OK, so how much testing has this had on just a straight v5.15-rcX
> kernel?
>

I haven't yet tested this patchset on v5.15-rcX yet. I will have to rebase my
patchset on top of Darrick's patchset and also would require xfsprogs' version
of "xfs: support dynamic btree cursor height".

>> >> @@ -134,7 +136,26 @@ xfs_bulkstat_one_int(
>> >>  
>> >>  	buf->bs_xflags = xfs_ip2xflags(ip);
>> >>  	buf->bs_extsize_blks = ip->i_extsize;
>> >> -	buf->bs_extents = xfs_ifork_nextents(&ip->i_df);
>> >> +
>> >> +	nextents = xfs_ifork_nextents(&ip->i_df);
>> >> +	if (!(bc->breq->flags & XFS_IBULK_NREXT64)) {
>> >> +		xfs_extnum_t max_nextents = XFS_IFORK_EXTCNT_MAXS32;
>> >> +
>> >> +		if (unlikely(XFS_TEST_ERROR(false, mp,
>> >> +				XFS_ERRTAG_REDUCE_MAX_IEXTENTS)))
>> >> +			max_nextents = 10;
>> >> +
>> >> +		if (nextents > max_nextents) {
>> >> +			xfs_iunlock(ip, XFS_ILOCK_SHARED);
>> >> +			xfs_irele(ip);
>> >> +			error = -EINVAL;
>> >> +			goto out_advance;
>> >> +		}
>> >
>> > So we return an EINVAL error if any extent overflows the 32 bit
>> > counter? Why isn't this -EOVERFLOW?
>> >
>> 
>> Returning -EINVAL causes xfs_bulkstat_iwalk() to skip inodes whose extent
>> count is larger than that which can be fitted into a 32-bit field. Returning
>> -EOVERFLOW causes the bulkstat ioctl to stop reporting remaining inodes.
>
> Ok, that's a bad behaviour we need to fix because it will cause
> things like old versions of xfs_dump to miss inodes that
> have overflowing extent counts. i.e. it will cause incomplete
> backups, and the failure will likely be silent.
>
> I asked about -EOVERFLOW because that's what stat() returns when an
> inode attribute value doesn't fit in the stat_buf field (e.g. 64 bit
> inode number on 32 bit kernel), and if we are overflowing the
> bulkstat field then we really should be telling userspace that an
> overflow occurred.
>
> /me has a sudden realisation that the xfs_dump format may not
> support large extents counts and goes looking...
>
> Yeah, xfsdump doesn't support extent counts greater than 2^32. So
> that means we really do need -EOVERFLOW errors here.  i.e, if we get
> an extent count overflow with a !(bc->breq->flags &
> XFS_IBULK_NREXT64) bulkstat walk, xfs_dump needs bulkstat to fill
> out the inode with the overflow with all the fileds that aren't
> overflowed, then error out with -EOVERFLOW.
>
> Bulkstat itself should not silently skip the inode because it would
> overflow a field in the struct xfs-bstat - the decision of what to
> do with the overflow is something xfsdump needs to handle, not the
> kernel.  Hence we need to return -EOVERFLOW here so that userspace
> can decide what to do with an inode it can't handle...
>

Ok. I had never thought of xfsdump use case. I will fix this issue as
well.

I guess adding ability to xfsdump to work with 64-bit extent counters can be
done after I address all the issues pointed out with the current patchset.

Thanks a lot for reviewing this patchset.

-- 
chandan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 07/12] xfs: Rename inode's extent counter fields based on their width
  2021-09-29 17:03       ` Chandan Babu R
@ 2021-09-30  0:40         ` Dave Chinner
  2021-09-30  4:31           ` Dave Chinner
  0 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2021-09-30  0:40 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, djwong

On Wed, Sep 29, 2021 at 10:33:23PM +0530, Chandan Babu R wrote:
> On 28 Sep 2021 at 09:34, Dave Chinner wrote:
> > On Tue, Sep 28, 2021 at 09:46:37AM +1000, Dave Chinner wrote:
> >> On Thu, Sep 16, 2021 at 03:36:42PM +0530, Chandan Babu R wrote:
> >> > This commit renames extent counter fields in "struct xfs_dinode" and "struct
> >> > xfs_log_dinode" based on the width of the fields. As of this commit, the
> >> > 32-bit field will be used to count data fork extents and the 16-bit field will
> >> > be used to count attr fork extents.
> >> > 
> >> > This change is done to enable a future commit to introduce a new 64-bit extent
> >> > counter field.
> >> > 
> >> > Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
> >> > ---
> >> >  fs/xfs/libxfs/xfs_format.h      |  8 ++++----
> >> >  fs/xfs/libxfs/xfs_inode_buf.c   |  4 ++--
> >> >  fs/xfs/libxfs/xfs_log_format.h  |  4 ++--
> >> >  fs/xfs/scrub/inode_repair.c     |  4 ++--
> >> >  fs/xfs/scrub/trace.h            | 14 +++++++-------
> >> >  fs/xfs/xfs_inode_item.c         |  4 ++--
> >> >  fs/xfs/xfs_inode_item_recover.c |  8 ++++----
> >> >  7 files changed, 23 insertions(+), 23 deletions(-)
> >> > 
> >> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> >> > index dba868f2c3e3..87c927d912f6 100644
> >> > --- a/fs/xfs/libxfs/xfs_format.h
> >> > +++ b/fs/xfs/libxfs/xfs_format.h
> >> > @@ -802,8 +802,8 @@ typedef struct xfs_dinode {
> >> >  	__be64		di_size;	/* number of bytes in file */
> >> >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
> >> >  	__be32		di_extsize;	/* basic/minimum extent size for file */
> >> > -	__be32		di_nextents;	/* number of extents in data fork */
> >> > -	__be16		di_anextents;	/* number of extents in attribute fork*/
> >> > +	__be32		di_nextents32;	/* number of extents in data fork */
> >> > +	__be16		di_nextents16;	/* number of extents in attribute fork*/
> >> 
> >> 
> >> Hmmm. Having the same field in the inode hold the extent count
> >> for different inode forks based on a bit in the superblock means the
> >> on-disk inode format is not self describing. i.e. we can't decode
> >> the on-disk contents of an inode correctly without knowing whether a
> >> specific feature bit is set in the superblock or not.
> >
> > Hmmmm - I just realised that there is an inode flag that indicates
> > the format is different. It's jsut that most of the code doing
> > conditional behaviour is using the superblock flag, not the inode
> > flag as the conditional.
> >
> > So it is self describing, but I still don't like the way the same
> > field is used for the different forks. It just feels like we are
> > placing a landmine that we are going to forget about and step
> > on in the future....
> >
> 
> Sorry, I missed this response from you.
> 
> I agree with your suggestion. I will use the inode version number to help in
> deciding which extent counter fields are valid for a specific inode.

No, don't do something I suggested with a flawed understanding of
the code.

Just because *I* suggest something, it means you have to make that
change. That is reacting to *who* said something, not *what was
said*.

So, I may have reservations about the way the storage definitions
are being redefined, but if I had a valid, technical argument I
could give right now I would have said so directly. I can't put my
finger on why this worries me in this case but didn't for something
like, say, the BIGTIME feature which redefined the contents of
various fields in the inode.

IOWs, I haven't really had time to think and go back over the rest
of the patchset since I realised my mistake and determine if that
changes what I think about this, so don't go turning the patchset
upside just because *I suggested something*.

Think critically about what is said and respond to that, not look
at who said it and respond based on their reputation.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 06/12] xfs: xfs_dfork_nextents: Return extent count via an out argument
  2021-09-16 10:06 ` [PATCH V3 06/12] xfs: xfs_dfork_nextents: Return extent count via an out argument Chandan Babu R
@ 2021-09-30  1:19   ` Dave Chinner
  0 siblings, 0 replies; 42+ messages in thread
From: Dave Chinner @ 2021-09-30  1:19 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, djwong

On Thu, Sep 16, 2021 at 03:36:41PM +0530, Chandan Babu R wrote:
> This commit changes xfs_dfork_nextents() to return an error code. The extent
> count itself is now returned through an out argument. This facility will be
> used by a future commit to indicate an inconsistent ondisk extent count.
> 
> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
> ---
>  fs/xfs/libxfs/xfs_format.h     | 14 ++---
>  fs/xfs/libxfs/xfs_inode_buf.c  | 16 ++++--
>  fs/xfs/libxfs/xfs_inode_fork.c | 21 ++++++--
>  fs/xfs/scrub/inode.c           | 94 +++++++++++++++++++++-------------
>  fs/xfs/scrub/inode_repair.c    | 34 ++++++++----
>  5 files changed, 118 insertions(+), 61 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index b4638052801f..dba868f2c3e3 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -931,28 +931,30 @@ enum xfs_dinode_fmt {
>  		(dip)->di_format : \
>  		(dip)->di_aformat)
>  
> -static inline xfs_extnum_t
> +static inline int
>  xfs_dfork_nextents(
>  	struct xfs_dinode	*dip,
> -	int			whichfork)
> +	int			whichfork,
> +	xfs_extnum_t		*nextents)
>  {
> -	xfs_extnum_t		nextents = 0;
> +	int			error = 0;
>  
>  	switch (whichfork) {
>  	case XFS_DATA_FORK:
> -		nextents = be32_to_cpu(dip->di_nextents);
> +		*nextents = be32_to_cpu(dip->di_nextents);
>  		break;
>  
>  	case XFS_ATTR_FORK:
> -		nextents = be16_to_cpu(dip->di_anextents);
> +		*nextents = be16_to_cpu(dip->di_anextents);
>  		break;
>  
>  	default:
>  		ASSERT(0);
> +		error = -EFSCORRUPTED;
>  		break;
>  	}
>  
> -	return nextents;
> +	return error;
>  }

So why do we need to do this? AFAICT, the only check that can return
errors that is added by the ned of the patch series is a
on-disk-format check that does:

	if (inode_has_nrext64 && dip->di_nextents16 != 0)
		return -EFSCORRUPTED;

This doesn't belong here - it is conflating verification with
extraction. Verfication of the on-disk format belongs in the
verifiers or verification code, not in the function that extracts

>  /*
> diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c
> index 176c98798aa4..dc511630cc7a 100644
> --- a/fs/xfs/libxfs/xfs_inode_buf.c
> +++ b/fs/xfs/libxfs/xfs_inode_buf.c
> @@ -345,7 +345,8 @@ xfs_dinode_verify_fork(
>  	xfs_extnum_t		di_nextents;
>  	xfs_extnum_t		max_extents;
>  
> -	di_nextents = xfs_dfork_nextents(dip, whichfork);
> +	if (xfs_dfork_nextents(dip, whichfork, &di_nextents))
> +		return __this_address;
>  
>  	switch (XFS_DFORK_FORMAT(dip, whichfork)) {
>  	case XFS_DINODE_FMT_LOCAL:
> @@ -477,6 +478,7 @@ xfs_dinode_verify(
>  	uint64_t		flags2;
>  	uint64_t		di_size;
>  	xfs_extnum_t            nextents;
> +	xfs_extnum_t            naextents;
>  	xfs_rfsblock_t		nblocks;
>  
>  	if (dip->di_magic != cpu_to_be16(XFS_DINODE_MAGIC))
> @@ -508,8 +510,13 @@ xfs_dinode_verify(
>  	if ((S_ISLNK(mode) || S_ISDIR(mode)) && di_size == 0)
>  		return __this_address;
>  
> -	nextents = xfs_dfork_nextents(dip, XFS_DATA_FORK);
> -	nextents += xfs_dfork_nextents(dip, XFS_ATTR_FORK);
> +	if (xfs_dfork_nextents(dip, XFS_DATA_FORK, &nextents))
> +		return __this_address;
> +
> +	if (xfs_dfork_nextents(dip, XFS_ATTR_FORK, &naextents))
> +		return __this_address;

Yeah, so this should end up being:

xfs_failaddr_t
xfs_dfork_nextents_verify(
	... )
{
	if (ip->di_flags2 & NREXT64) {
		if (!xfs_has_nrext64(mp))
			return __this_address;
		if (dip->di_nextents16 != 0)
			return __this_address;
	} else if (dip->di_nextents64 != 0)
		return __this_address;
	}
	return NULL;
}

and
	faddr = xfs_dfork_nextents_verify(dip, mp);
	if (faddr)
		return faddr;
	nextents = xfs_dfork_nextents(dip, XFS_DATA_FORK);
	naextents = xfs_dfork_nextents(dip, XFS_ATTR_FORK);

Now all the verification can be removed from xfs_dfork_nextents(),
and anything that needs to verify that the extent counts are in a
valid format can call xfs_dfork_nextents_verify() to perform this
check (i.e. the dinode verifiers and scrub checking code).

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 07/12] xfs: Rename inode's extent counter fields based on their width
  2021-09-30  0:40         ` Dave Chinner
@ 2021-09-30  4:31           ` Dave Chinner
  2021-09-30  7:30             ` Chandan Babu R
  0 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2021-09-30  4:31 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, djwong

On Thu, Sep 30, 2021 at 10:40:15AM +1000, Dave Chinner wrote:
> On Wed, Sep 29, 2021 at 10:33:23PM +0530, Chandan Babu R wrote:
> > On 28 Sep 2021 at 09:34, Dave Chinner wrote:
> > > On Tue, Sep 28, 2021 at 09:46:37AM +1000, Dave Chinner wrote:
> > >> On Thu, Sep 16, 2021 at 03:36:42PM +0530, Chandan Babu R wrote:
> > >> > This commit renames extent counter fields in "struct xfs_dinode" and "struct
> > >> > xfs_log_dinode" based on the width of the fields. As of this commit, the
> > >> > 32-bit field will be used to count data fork extents and the 16-bit field will
> > >> > be used to count attr fork extents.
> > >> > 
> > >> > This change is done to enable a future commit to introduce a new 64-bit extent
> > >> > counter field.
> > >> > 
> > >> > Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
> > >> > ---
> > >> >  fs/xfs/libxfs/xfs_format.h      |  8 ++++----
> > >> >  fs/xfs/libxfs/xfs_inode_buf.c   |  4 ++--
> > >> >  fs/xfs/libxfs/xfs_log_format.h  |  4 ++--
> > >> >  fs/xfs/scrub/inode_repair.c     |  4 ++--
> > >> >  fs/xfs/scrub/trace.h            | 14 +++++++-------
> > >> >  fs/xfs/xfs_inode_item.c         |  4 ++--
> > >> >  fs/xfs/xfs_inode_item_recover.c |  8 ++++----
> > >> >  7 files changed, 23 insertions(+), 23 deletions(-)
> > >> > 
> > >> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> > >> > index dba868f2c3e3..87c927d912f6 100644
> > >> > --- a/fs/xfs/libxfs/xfs_format.h
> > >> > +++ b/fs/xfs/libxfs/xfs_format.h
> > >> > @@ -802,8 +802,8 @@ typedef struct xfs_dinode {
> > >> >  	__be64		di_size;	/* number of bytes in file */
> > >> >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
> > >> >  	__be32		di_extsize;	/* basic/minimum extent size for file */
> > >> > -	__be32		di_nextents;	/* number of extents in data fork */
> > >> > -	__be16		di_anextents;	/* number of extents in attribute fork*/
> > >> > +	__be32		di_nextents32;	/* number of extents in data fork */
> > >> > +	__be16		di_nextents16;	/* number of extents in attribute fork*/
> > >> 
> > >> 
> > >> Hmmm. Having the same field in the inode hold the extent count
> > >> for different inode forks based on a bit in the superblock means the
> > >> on-disk inode format is not self describing. i.e. we can't decode
> > >> the on-disk contents of an inode correctly without knowing whether a
> > >> specific feature bit is set in the superblock or not.
> > >
> > > Hmmmm - I just realised that there is an inode flag that indicates
> > > the format is different. It's jsut that most of the code doing
> > > conditional behaviour is using the superblock flag, not the inode
> > > flag as the conditional.
> > >
> > > So it is self describing, but I still don't like the way the same
> > > field is used for the different forks. It just feels like we are
> > > placing a landmine that we are going to forget about and step
> > > on in the future....
> > >
> > 
> > Sorry, I missed this response from you.
> > 
> > I agree with your suggestion. I will use the inode version number to help in
> > deciding which extent counter fields are valid for a specific inode.
> 
> No, don't do something I suggested with a flawed understanding of
> the code.
> 
> Just because *I* suggest something, it means you have to make that
> change. That is reacting to *who* said something, not *what was
> said*.
> 
> So, I may have reservations about the way the storage definitions
> are being redefined, but if I had a valid, technical argument I
> could give right now I would have said so directly. I can't put my
> finger on why this worries me in this case but didn't for something
> like, say, the BIGTIME feature which redefined the contents of
> various fields in the inode.
> 
> IOWs, I haven't really had time to think and go back over the rest
> of the patchset since I realised my mistake and determine if that
> changes what I think about this, so don't go turning the patchset
> upside just because *I suggested something*.

So, looking over the patchset more, I think I understand my feeling
a bit better. Inconsistency is a big part of it.

The in-memory extent counts are held in the struct xfs_inode_fork
and not the inode. The type is a xfs_extcnt_t - it's not a size
dependent type. Indeed, there are actually no users of the
xfs_aextcnt_t variable in XFS at all any more. It should be removed.

What this means is that in-memory inode extent counting just doesn't
discriminate between inode fork types. They are all 64 bit counters,
and all the limits applied to them should be 64 bit types. Even the
checks for overflow are abstracted away by
xfs_iext_count_may_overflow(), so none of the extent manipulation
code has any idea there are different types and limits in the
on-disk format.

That's good.

The only place the actual type matters is when looking at the raw
disk inode and, unfortunately, that's where it gets messy. Anything
accessing the on-disk inode directly has to look at inode version
number, and an inode feature flag to interpret the inode format
correctly.  That format is then reflected in an in-memory inode
feature flag, and then there's the superblock feature flag on top of
that to indicate that there are NREXT64 format inodes in the
filesystem.

Then there's implied dynamic upgrades of the on-disk inode format.
We see that being implied in xfs_inode_to_disk_iext_counters() and
xfs_trans_log_inode() but the filesystem format can't be changed
dynamically. i.e. we can't create new NREXT64 inodes if the
superblock flag is not set, so there is no code in this patchset
that I can see that provides a trigger for a dynamic upgrade to
start. IOWs, the filesystem has to be taken offline to change the
superblock feature bit, and the setup of the default NREXT64 inode
flag at mount time re-inforces this.

With this in mind, I started to see inconsistent use of inode
feature flag vs superblock feature flag to determine on-disk inode
extent count limits. e.g. look at xfs_iext_count_may_overflow() and
xfs_iext_max_nextents(). Both of these are determining the maximum
number of extents that are valid for an inode, and they look at the
-superblock feature bit- to determine the limits.

This only works if all inodes in the filesystem have the same
format, which is not true if we are doing dynamic upgrades of the
inode features. The most obvious case here is that scrub needs to
determine the layout and limits based on the current feature bits in
the inode, not the superblock feature bit.

Then we have to look at how the upgrade is performed - by changing
the in-memory inode flag during xfs_trans_log_inode() when the inode
is dirtied. When we are modifying the inode for extent allocation,
we check the extent count limits on the inode *before* we dirty the
inode. Hence the only way an "upgrade at overflow thresholds" can
actually work is if we don't use the inode flag for determining
limits but instead use the sueprblock feature bit limits. But as
I've already pointed out, that leads to other problems.

When we are converting an inode format, we currently do it when the
inode is first brought into memory and read from disk (i.e.
xfs_inode_from_disk()). We do the full conversion at this point in
time, such that if the inode is dirtied in memory all the correct
behaviour for the new format occurs and the writeback is done in the
new format.

This would allow xfs_iext_count_may_overflow/xfs_iext_max_nextents
to actually return the correct limits for the inode as it is being
modified and not have to rely on superblock feature bits. If the
inode is not being modified, then the in-memory format changes are
discarded when the inode is reclaimed from memory and nothing
changes on disk.

This means that once we've read the inode in from disk and set up
ip->i_diflags2 according to the superblock feature bit, we can use
the in-memory inode flag -everywhere- we need to find and/or check
limits during modifications. Yes, I know that the BIGTIME upgrade
path does this, but that doesn't have limits that prevent
modifications from taking place before we can log the inode and set
the BIGTIME flag....

So, yeah, I think the biggest problem I've been having is that the
way the inode flags, the limits and the on-disk format is juggled
has resulted in me taking some time to understand where the problems
lie. Cleaning up the initialisation, conversion and consistency in
using the inode flags rather thant he superblock flag will go a long
way to addressing my concerns

---

FWIW, I also think doing something like this would help make the
code be easier to read and confirm that it is obviously correct when
reading it:

	__be32          di_gid;         /* owner's group id */
	__be32          di_nlink;       /* number of links to file */
	__be16          di_projid_lo;   /* lower part of owner's project id */
	__be16          di_projid_hi;   /* higher part owner's project id */
	union {
		__be64	di_big_dextcnt;	/* NREXT64 data extents */
		__u8	di_v3_pad[8];	/* !NREXT64 V3 inode zeroed space */
		struct {
			__u8	di_v2_pad[6];	/* V2 inode zeroed space */
			__be16	di_flushiter;	/* V2 inode incremented on flush */
		};
	};
	xfs_timestamp_t di_atime;       /* time last accessed */
	xfs_timestamp_t di_mtime;       /* time last modified */
	xfs_timestamp_t di_ctime;       /* time created/inode modified */
	__be64          di_size;        /* number of bytes in file */
	__be64          di_nblocks;     /* # of direct & btree blocks used */
	__be32          di_extsize;     /* basic/minimum extent size for file */
	union {
		struct {
			__be32	di_big_aextcnt; /* NREXT64 attr extents */
			__be16	di_nrext64_pad;	/* NREXT64 unused, zero */
		};
		struct {
			__be32	di_nextents;    /* !NREXT64 data extents */
			__be16	di_anextents;   /* !NREXT64 attr extents */
		}
	}
	__u8            di_forkoff;     /* attr fork offs, <<3 for 64b align */
	__s8            di_aformat;     /* format of attr fork's data */
...

Then we get something like:

static inline void
xfs_inode_to_disk_iext_counters(
       struct xfs_inode        *ip,
       struct xfs_dinode       *to)
{
       if (xfs_inode_has_nrext64(ip)) {
               to->di_big_dextent_cnt = cpu_to_be64(xfs_ifork_nextents(&ip->i_df));
               to->di_big_anextents = cpu_to_be32(xfs_ifork_nextents(ip->i_afp));
               to->di_nrext64_pad = 0;
       } else {
               to->di_nextents = cpu_to_be32(xfs_ifork_nextents(&ip->i_df));
               to->di_anextents = cpu_to_be16(xfs_ifork_nextents(ip->i_afp));
       }
}

This is now obvious that we are writing to the correct fields
in the inode for the feature bits that are set, and we don't need
to zero the di_big_dextcnt field because that's been taken care of
by the existing di_v2_pad/flushiter zeroing. That bit could probably
be improved by unwinding and open coding this in xfs_inode_to_disk(),
but I think what I'm proposing should be obvious now...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 07/12] xfs: Rename inode's extent counter fields based on their width
  2021-09-30  4:31           ` Dave Chinner
@ 2021-09-30  7:30             ` Chandan Babu R
  2021-09-30 22:55               ` Dave Chinner
  0 siblings, 1 reply; 42+ messages in thread
From: Chandan Babu R @ 2021-09-30  7:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, djwong

On 30 Sep 2021 at 10:01, Dave Chinner wrote:
> On Thu, Sep 30, 2021 at 10:40:15AM +1000, Dave Chinner wrote:
>> On Wed, Sep 29, 2021 at 10:33:23PM +0530, Chandan Babu R wrote:
>> > On 28 Sep 2021 at 09:34, Dave Chinner wrote:
>> > > On Tue, Sep 28, 2021 at 09:46:37AM +1000, Dave Chinner wrote:
>> > >> On Thu, Sep 16, 2021 at 03:36:42PM +0530, Chandan Babu R wrote:
>> > >> > This commit renames extent counter fields in "struct xfs_dinode" and "struct
>> > >> > xfs_log_dinode" based on the width of the fields. As of this commit, the
>> > >> > 32-bit field will be used to count data fork extents and the 16-bit field will
>> > >> > be used to count attr fork extents.
>> > >> > 
>> > >> > This change is done to enable a future commit to introduce a new 64-bit extent
>> > >> > counter field.
>> > >> > 
>> > >> > Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
>> > >> > ---
>> > >> >  fs/xfs/libxfs/xfs_format.h      |  8 ++++----
>> > >> >  fs/xfs/libxfs/xfs_inode_buf.c   |  4 ++--
>> > >> >  fs/xfs/libxfs/xfs_log_format.h  |  4 ++--
>> > >> >  fs/xfs/scrub/inode_repair.c     |  4 ++--
>> > >> >  fs/xfs/scrub/trace.h            | 14 +++++++-------
>> > >> >  fs/xfs/xfs_inode_item.c         |  4 ++--
>> > >> >  fs/xfs/xfs_inode_item_recover.c |  8 ++++----
>> > >> >  7 files changed, 23 insertions(+), 23 deletions(-)
>> > >> > 
>> > >> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
>> > >> > index dba868f2c3e3..87c927d912f6 100644
>> > >> > --- a/fs/xfs/libxfs/xfs_format.h
>> > >> > +++ b/fs/xfs/libxfs/xfs_format.h
>> > >> > @@ -802,8 +802,8 @@ typedef struct xfs_dinode {
>> > >> >  	__be64		di_size;	/* number of bytes in file */
>> > >> >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
>> > >> >  	__be32		di_extsize;	/* basic/minimum extent size for file */
>> > >> > -	__be32		di_nextents;	/* number of extents in data fork */
>> > >> > -	__be16		di_anextents;	/* number of extents in attribute fork*/
>> > >> > +	__be32		di_nextents32;	/* number of extents in data fork */
>> > >> > +	__be16		di_nextents16;	/* number of extents in attribute fork*/
>> > >> 
>> > >> 
>> > >> Hmmm. Having the same field in the inode hold the extent count
>> > >> for different inode forks based on a bit in the superblock means the
>> > >> on-disk inode format is not self describing. i.e. we can't decode
>> > >> the on-disk contents of an inode correctly without knowing whether a
>> > >> specific feature bit is set in the superblock or not.
>> > >
>> > > Hmmmm - I just realised that there is an inode flag that indicates
>> > > the format is different. It's jsut that most of the code doing
>> > > conditional behaviour is using the superblock flag, not the inode
>> > > flag as the conditional.
>> > >
>> > > So it is self describing, but I still don't like the way the same
>> > > field is used for the different forks. It just feels like we are
>> > > placing a landmine that we are going to forget about and step
>> > > on in the future....
>> > >
>> > 
>> > Sorry, I missed this response from you.
>> > 
>> > I agree with your suggestion. I will use the inode version number to help in
>> > deciding which extent counter fields are valid for a specific inode.
>> 
>> No, don't do something I suggested with a flawed understanding of
>> the code.
>> 
>> Just because *I* suggest something, it means you have to make that
>> change. That is reacting to *who* said something, not *what was
>> said*.
>> 
>> So, I may have reservations about the way the storage definitions
>> are being redefined, but if I had a valid, technical argument I
>> could give right now I would have said so directly. I can't put my
>> finger on why this worries me in this case but didn't for something
>> like, say, the BIGTIME feature which redefined the contents of
>> various fields in the inode.
>> 
>> IOWs, I haven't really had time to think and go back over the rest
>> of the patchset since I realised my mistake and determine if that
>> changes what I think about this, so don't go turning the patchset
>> upside just because *I suggested something*.
>
> So, looking over the patchset more, I think I understand my feeling
> a bit better. Inconsistency is a big part of it.
>
> The in-memory extent counts are held in the struct xfs_inode_fork
> and not the inode. The type is a xfs_extcnt_t - it's not a size
> dependent type. Indeed, there are actually no users of the
> xfs_aextcnt_t variable in XFS at all any more. It should be removed.
>
> What this means is that in-memory inode extent counting just doesn't
> discriminate between inode fork types. They are all 64 bit counters,
> and all the limits applied to them should be 64 bit types. Even the
> checks for overflow are abstracted away by
> xfs_iext_count_may_overflow(), so none of the extent manipulation
> code has any idea there are different types and limits in the
> on-disk format.
>
> That's good.
>
> The only place the actual type matters is when looking at the raw
> disk inode and, unfortunately, that's where it gets messy. Anything
> accessing the on-disk inode directly has to look at inode version
> number, and an inode feature flag to interpret the inode format
> correctly.  That format is then reflected in an in-memory inode
> feature flag, and then there's the superblock feature flag on top of
> that to indicate that there are NREXT64 format inodes in the
> filesystem.
>
> Then there's implied dynamic upgrades of the on-disk inode format.
> We see that being implied in xfs_inode_to_disk_iext_counters() and
> xfs_trans_log_inode() but the filesystem format can't be changed
> dynamically. i.e. we can't create new NREXT64 inodes if the
> superblock flag is not set, so there is no code in this patchset
> that I can see that provides a trigger for a dynamic upgrade to
> start. IOWs, the filesystem has to be taken offline to change the
> superblock feature bit, and the setup of the default NREXT64 inode
> flag at mount time re-inforces this.
>
> With this in mind, I started to see inconsistent use of inode
> feature flag vs superblock feature flag to determine on-disk inode
> extent count limits. e.g. look at xfs_iext_count_may_overflow() and
> xfs_iext_max_nextents(). Both of these are determining the maximum
> number of extents that are valid for an inode, and they look at the
> -superblock feature bit- to determine the limits.
>
> This only works if all inodes in the filesystem have the same
> format, which is not true if we are doing dynamic upgrades of the
> inode features. The most obvious case here is that scrub needs to
> determine the layout and limits based on the current feature bits in
> the inode, not the superblock feature bit.
>
> Then we have to look at how the upgrade is performed - by changing
> the in-memory inode flag during xfs_trans_log_inode() when the inode
> is dirtied. When we are modifying the inode for extent allocation,
> we check the extent count limits on the inode *before* we dirty the
> inode. Hence the only way an "upgrade at overflow thresholds" can
> actually work is if we don't use the inode flag for determining
> limits but instead use the sueprblock feature bit limits. But as
> I've already pointed out, that leads to other problems.
>
> When we are converting an inode format, we currently do it when the
> inode is first brought into memory and read from disk (i.e.
> xfs_inode_from_disk()). We do the full conversion at this point in
> time, such that if the inode is dirtied in memory all the correct
> behaviour for the new format occurs and the writeback is done in the
> new format.
>
> This would allow xfs_iext_count_may_overflow/xfs_iext_max_nextents
> to actually return the correct limits for the inode as it is being
> modified and not have to rely on superblock feature bits. If the
> inode is not being modified, then the in-memory format changes are
> discarded when the inode is reclaimed from memory and nothing
> changes on disk.
>
> This means that once we've read the inode in from disk and set up
> ip->i_diflags2 according to the superblock feature bit, we can use
> the in-memory inode flag -everywhere- we need to find and/or check
> limits during modifications. Yes, I know that the BIGTIME upgrade
> path does this, but that doesn't have limits that prevent
> modifications from taking place before we can log the inode and set
> the BIGTIME flag....
>

Ok. The above solution looks logically correct. I haven't been able to come up
with a scenario where the solution wouldn't work. I will implement it and see
if anything breaks.

> So, yeah, I think the biggest problem I've been having is that the
> way the inode flags, the limits and the on-disk format is juggled
> has resulted in me taking some time to understand where the problems
> lie. Cleaning up the initialisation, conversion and consistency in
> using the inode flags rather thant he superblock flag will go a long
> way to addressing my concerns
>
> ---
>
> FWIW, I also think doing something like this would help make the
> code be easier to read and confirm that it is obviously correct when
> reading it:
>
> 	__be32          di_gid;         /* owner's group id */
> 	__be32          di_nlink;       /* number of links to file */
> 	__be16          di_projid_lo;   /* lower part of owner's project id */
> 	__be16          di_projid_hi;   /* higher part owner's project id */
> 	union {
> 		__be64	di_big_dextcnt;	/* NREXT64 data extents */
> 		__u8	di_v3_pad[8];	/* !NREXT64 V3 inode zeroed space */
> 		struct {
> 			__u8	di_v2_pad[6];	/* V2 inode zeroed space */
> 			__be16	di_flushiter;	/* V2 inode incremented on flush */
> 		};
> 	};
> 	xfs_timestamp_t di_atime;       /* time last accessed */
> 	xfs_timestamp_t di_mtime;       /* time last modified */
> 	xfs_timestamp_t di_ctime;       /* time created/inode modified */
> 	__be64          di_size;        /* number of bytes in file */
> 	__be64          di_nblocks;     /* # of direct & btree blocks used */
> 	__be32          di_extsize;     /* basic/minimum extent size for file */
> 	union {
> 		struct {
> 			__be32	di_big_aextcnt; /* NREXT64 attr extents */
> 			__be16	di_nrext64_pad;	/* NREXT64 unused, zero */
> 		};
> 		struct {
> 			__be32	di_nextents;    /* !NREXT64 data extents */
> 			__be16	di_anextents;   /* !NREXT64 attr extents */
> 		}
> 	}
> 	__u8            di_forkoff;     /* attr fork offs, <<3 for 64b align */
> 	__s8            di_aformat;     /* format of attr fork's data */
> ...
>
> Then we get something like:
>
> static inline void
> xfs_inode_to_disk_iext_counters(
>        struct xfs_inode        *ip,
>        struct xfs_dinode       *to)
> {
>        if (xfs_inode_has_nrext64(ip)) {
>                to->di_big_dextent_cnt = cpu_to_be64(xfs_ifork_nextents(&ip->i_df));
>                to->di_big_anextents = cpu_to_be32(xfs_ifork_nextents(ip->i_afp));
>                to->di_nrext64_pad = 0;
>        } else {
>                to->di_nextents = cpu_to_be32(xfs_ifork_nextents(&ip->i_df));
>                to->di_anextents = cpu_to_be16(xfs_ifork_nextents(ip->i_afp));
>        }
> }
>
> This is now obvious that we are writing to the correct fields
> in the inode for the feature bits that are set, and we don't need
> to zero the di_big_dextcnt field because that's been taken care of
> by the existing di_v2_pad/flushiter zeroing. That bit could probably
> be improved by unwinding and open coding this in xfs_inode_to_disk(),
> but I think what I'm proposing should be obvious now...
>

Yes, the explaination provided by you is very clear. I will implement these
suggestions.

-- 
chandan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 07/12] xfs: Rename inode's extent counter fields based on their width
  2021-09-30  7:30             ` Chandan Babu R
@ 2021-09-30 22:55               ` Dave Chinner
  2021-10-07 10:52                 ` Chandan Babu R
  0 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2021-09-30 22:55 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, djwong

On Thu, Sep 30, 2021 at 01:00:00PM +0530, Chandan Babu R wrote:
> On 30 Sep 2021 at 10:01, Dave Chinner wrote:
> > On Thu, Sep 30, 2021 at 10:40:15AM +1000, Dave Chinner wrote:
> >> On Wed, Sep 29, 2021 at 10:33:23PM +0530, Chandan Babu R wrote:
> >> > On 28 Sep 2021 at 09:34, Dave Chinner wrote:
> >> > > On Tue, Sep 28, 2021 at 09:46:37AM +1000, Dave Chinner wrote:
> >> > >> On Thu, Sep 16, 2021 at 03:36:42PM +0530, Chandan Babu R wrote:
> >> > >> > This commit renames extent counter fields in "struct xfs_dinode" and "struct
> >> > >> > xfs_log_dinode" based on the width of the fields. As of this commit, the
> >> > >> > 32-bit field will be used to count data fork extents and the 16-bit field will
> >> > >> > be used to count attr fork extents.
> >> > >> > 
> >> > >> > This change is done to enable a future commit to introduce a new 64-bit extent
> >> > >> > counter field.
> >> > >> > 
> >> > >> > Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
> >> > >> > ---
> >> > >> >  fs/xfs/libxfs/xfs_format.h      |  8 ++++----
> >> > >> >  fs/xfs/libxfs/xfs_inode_buf.c   |  4 ++--
> >> > >> >  fs/xfs/libxfs/xfs_log_format.h  |  4 ++--
> >> > >> >  fs/xfs/scrub/inode_repair.c     |  4 ++--
> >> > >> >  fs/xfs/scrub/trace.h            | 14 +++++++-------
> >> > >> >  fs/xfs/xfs_inode_item.c         |  4 ++--
> >> > >> >  fs/xfs/xfs_inode_item_recover.c |  8 ++++----
> >> > >> >  7 files changed, 23 insertions(+), 23 deletions(-)
> >> > >> > 
> >> > >> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> >> > >> > index dba868f2c3e3..87c927d912f6 100644
> >> > >> > --- a/fs/xfs/libxfs/xfs_format.h
> >> > >> > +++ b/fs/xfs/libxfs/xfs_format.h
> >> > >> > @@ -802,8 +802,8 @@ typedef struct xfs_dinode {
> >> > >> >  	__be64		di_size;	/* number of bytes in file */
> >> > >> >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
> >> > >> >  	__be32		di_extsize;	/* basic/minimum extent size for file */
> >> > >> > -	__be32		di_nextents;	/* number of extents in data fork */
> >> > >> > -	__be16		di_anextents;	/* number of extents in attribute fork*/
> >> > >> > +	__be32		di_nextents32;	/* number of extents in data fork */
> >> > >> > +	__be16		di_nextents16;	/* number of extents in attribute fork*/
> >> > >> 
> >> > >> 
> >> > >> Hmmm. Having the same field in the inode hold the extent count
> >> > >> for different inode forks based on a bit in the superblock means the
> >> > >> on-disk inode format is not self describing. i.e. we can't decode
> >> > >> the on-disk contents of an inode correctly without knowing whether a
> >> > >> specific feature bit is set in the superblock or not.
> >> > >
> >> > > Hmmmm - I just realised that there is an inode flag that indicates
> >> > > the format is different. It's jsut that most of the code doing
> >> > > conditional behaviour is using the superblock flag, not the inode
> >> > > flag as the conditional.
> >> > >
> >> > > So it is self describing, but I still don't like the way the same
> >> > > field is used for the different forks. It just feels like we are
> >> > > placing a landmine that we are going to forget about and step
> >> > > on in the future....
> >> > >
> >> > 
> >> > Sorry, I missed this response from you.
> >> > 
> >> > I agree with your suggestion. I will use the inode version number to help in
> >> > deciding which extent counter fields are valid for a specific inode.
> >> 
> >> No, don't do something I suggested with a flawed understanding of
> >> the code.
> >> 
> >> Just because *I* suggest something, it means you have to make that
> >> change. That is reacting to *who* said something, not *what was
> >> said*.
> >> 
> >> So, I may have reservations about the way the storage definitions
> >> are being redefined, but if I had a valid, technical argument I
> >> could give right now I would have said so directly. I can't put my
> >> finger on why this worries me in this case but didn't for something
> >> like, say, the BIGTIME feature which redefined the contents of
> >> various fields in the inode.
> >> 
> >> IOWs, I haven't really had time to think and go back over the rest
> >> of the patchset since I realised my mistake and determine if that
> >> changes what I think about this, so don't go turning the patchset
> >> upside just because *I suggested something*.
> >
> > So, looking over the patchset more, I think I understand my feeling
> > a bit better. Inconsistency is a big part of it.
> >
> > The in-memory extent counts are held in the struct xfs_inode_fork
> > and not the inode. The type is a xfs_extcnt_t - it's not a size
> > dependent type. Indeed, there are actually no users of the
> > xfs_aextcnt_t variable in XFS at all any more. It should be removed.
> >
> > What this means is that in-memory inode extent counting just doesn't
> > discriminate between inode fork types. They are all 64 bit counters,
> > and all the limits applied to them should be 64 bit types. Even the
> > checks for overflow are abstracted away by
> > xfs_iext_count_may_overflow(), so none of the extent manipulation
> > code has any idea there are different types and limits in the
> > on-disk format.
> >
> > That's good.
> >
> > The only place the actual type matters is when looking at the raw
> > disk inode and, unfortunately, that's where it gets messy. Anything
> > accessing the on-disk inode directly has to look at inode version
> > number, and an inode feature flag to interpret the inode format
> > correctly.  That format is then reflected in an in-memory inode
> > feature flag, and then there's the superblock feature flag on top of
> > that to indicate that there are NREXT64 format inodes in the
> > filesystem.
> >
> > Then there's implied dynamic upgrades of the on-disk inode format.
> > We see that being implied in xfs_inode_to_disk_iext_counters() and
> > xfs_trans_log_inode() but the filesystem format can't be changed
> > dynamically. i.e. we can't create new NREXT64 inodes if the
> > superblock flag is not set, so there is no code in this patchset
> > that I can see that provides a trigger for a dynamic upgrade to
> > start. IOWs, the filesystem has to be taken offline to change the
> > superblock feature bit, and the setup of the default NREXT64 inode
> > flag at mount time re-inforces this.
> >
> > With this in mind, I started to see inconsistent use of inode
> > feature flag vs superblock feature flag to determine on-disk inode
> > extent count limits. e.g. look at xfs_iext_count_may_overflow() and
> > xfs_iext_max_nextents(). Both of these are determining the maximum
> > number of extents that are valid for an inode, and they look at the
> > -superblock feature bit- to determine the limits.
> >
> > This only works if all inodes in the filesystem have the same
> > format, which is not true if we are doing dynamic upgrades of the
> > inode features. The most obvious case here is that scrub needs to
> > determine the layout and limits based on the current feature bits in
> > the inode, not the superblock feature bit.
> >
> > Then we have to look at how the upgrade is performed - by changing
> > the in-memory inode flag during xfs_trans_log_inode() when the inode
> > is dirtied. When we are modifying the inode for extent allocation,
> > we check the extent count limits on the inode *before* we dirty the
> > inode. Hence the only way an "upgrade at overflow thresholds" can
> > actually work is if we don't use the inode flag for determining
> > limits but instead use the sueprblock feature bit limits. But as
> > I've already pointed out, that leads to other problems.
> >
> > When we are converting an inode format, we currently do it when the
> > inode is first brought into memory and read from disk (i.e.
> > xfs_inode_from_disk()). We do the full conversion at this point in
> > time, such that if the inode is dirtied in memory all the correct
> > behaviour for the new format occurs and the writeback is done in the
> > new format.
> >
> > This would allow xfs_iext_count_may_overflow/xfs_iext_max_nextents
> > to actually return the correct limits for the inode as it is being
> > modified and not have to rely on superblock feature bits. If the
> > inode is not being modified, then the in-memory format changes are
> > discarded when the inode is reclaimed from memory and nothing
> > changes on disk.
> >
> > This means that once we've read the inode in from disk and set up
> > ip->i_diflags2 according to the superblock feature bit, we can use
> > the in-memory inode flag -everywhere- we need to find and/or check
> > limits during modifications. Yes, I know that the BIGTIME upgrade
> > path does this, but that doesn't have limits that prevent
> > modifications from taking place before we can log the inode and set
> > the BIGTIME flag....
> >
> 
> Ok. The above solution looks logically correct. I haven't been able to come up
> with a scenario where the solution wouldn't work. I will implement it and see
> if anything breaks.

I think I can poke one hole in it - I missed the fact that if we
upgrade and inode read time, and then we modify the inode without
modifying the inode core (can we even do that - metadata mods should
at least change timestamps right?) then we don't log the format
change or the NREXT64 inode flag change and they only appear in the
on-disk inode at writeback.

Log recovery needs to be checked for correct behaviour here. I think
that if the inode is in NREXT64 format when read in and the log
inode core is not, then the on disk LSN must be more recent than
what is being recovered from the log and should be skipped. If
NREXT64 is present in the log inode, then we logged the core
properly and we just don't care what format is on disk because we
replay it into NREXT64 format and write that back.

SO I *think* we're ok here, but it needs closer inspection to
determine behaviour is actually safe. If it is safe, then maybe in
future we can do the same thing for BIGTIME and get that upgrade out
of xfs_trans_log_inode() as well....

> > ---
> >
> > FWIW, I also think doing something like this would help make the
> > code be easier to read and confirm that it is obviously correct when
> > reading it:
> >
> > 	__be32          di_gid;         /* owner's group id */
> > 	__be32          di_nlink;       /* number of links to file */
> > 	__be16          di_projid_lo;   /* lower part of owner's project id */
> > 	__be16          di_projid_hi;   /* higher part owner's project id */
> > 	union {
> > 		__be64	di_big_dextcnt;	/* NREXT64 data extents */
> > 		__u8	di_v3_pad[8];	/* !NREXT64 V3 inode zeroed space */
> > 		struct {
> > 			__u8	di_v2_pad[6];	/* V2 inode zeroed space */
> > 			__be16	di_flushiter;	/* V2 inode incremented on flush */
> > 		};
> > 	};
> > 	xfs_timestamp_t di_atime;       /* time last accessed */
> > 	xfs_timestamp_t di_mtime;       /* time last modified */
> > 	xfs_timestamp_t di_ctime;       /* time created/inode modified */
> > 	__be64          di_size;        /* number of bytes in file */
> > 	__be64          di_nblocks;     /* # of direct & btree blocks used */
> > 	__be32          di_extsize;     /* basic/minimum extent size for file */
> > 	union {
> > 		struct {
> > 			__be32	di_big_aextcnt; /* NREXT64 attr extents */
> > 			__be16	di_nrext64_pad;	/* NREXT64 unused, zero */
> > 		};
> > 		struct {
> > 			__be32	di_nextents;    /* !NREXT64 data extents */
> > 			__be16	di_anextents;   /* !NREXT64 attr extents */
> > 		}
> > 	}
> > 	__u8            di_forkoff;     /* attr fork offs, <<3 for 64b align */
> > 	__s8            di_aformat;     /* format of attr fork's data */
> > ...
> >
> > Then we get something like:
> >
> > static inline void
> > xfs_inode_to_disk_iext_counters(
> >        struct xfs_inode        *ip,
> >        struct xfs_dinode       *to)
> > {
> >        if (xfs_inode_has_nrext64(ip)) {
> >                to->di_big_dextent_cnt = cpu_to_be64(xfs_ifork_nextents(&ip->i_df));
> >                to->di_big_anextents = cpu_to_be32(xfs_ifork_nextents(ip->i_afp));
> >                to->di_nrext64_pad = 0;
> >        } else {
> >                to->di_nextents = cpu_to_be32(xfs_ifork_nextents(&ip->i_df));
> >                to->di_anextents = cpu_to_be16(xfs_ifork_nextents(ip->i_afp));
> >        }
> > }
> >
> > This is now obvious that we are writing to the correct fields
> > in the inode for the feature bits that are set, and we don't need
> > to zero the di_big_dextcnt field because that's been taken care of
> > by the existing di_v2_pad/flushiter zeroing. That bit could probably
> > be improved by unwinding and open coding this in xfs_inode_to_disk(),
> > but I think what I'm proposing should be obvious now...
> >
> 
> Yes, the explaination provided by you is very clear. I will implement these
> suggestions.

Don't forget to try to poke holes in it and look for complexity that
can be removed before you try to implement or optimise anything.

FWIW, the code design concept I'm basing this on is that complexity
should be contained within the structures that store the data,
rather than be directly exposed to the code that manipulates the
data.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 07/12] xfs: Rename inode's extent counter fields based on their width
  2021-09-30 22:55               ` Dave Chinner
@ 2021-10-07 10:52                 ` Chandan Babu R
  2021-10-10 21:49                   ` Dave Chinner
  0 siblings, 1 reply; 42+ messages in thread
From: Chandan Babu R @ 2021-10-07 10:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, djwong

On 01 Oct 2021 at 04:25, Dave Chinner wrote:
> On Thu, Sep 30, 2021 at 01:00:00PM +0530, Chandan Babu R wrote:
>> On 30 Sep 2021 at 10:01, Dave Chinner wrote:
>> > On Thu, Sep 30, 2021 at 10:40:15AM +1000, Dave Chinner wrote:
>> >> On Wed, Sep 29, 2021 at 10:33:23PM +0530, Chandan Babu R wrote:
>> >> > On 28 Sep 2021 at 09:34, Dave Chinner wrote:
>> >> > > On Tue, Sep 28, 2021 at 09:46:37AM +1000, Dave Chinner wrote:
>> >> > >> On Thu, Sep 16, 2021 at 03:36:42PM +0530, Chandan Babu R wrote:
>> >> > >> > This commit renames extent counter fields in "struct xfs_dinode" and "struct
>> >> > >> > xfs_log_dinode" based on the width of the fields. As of this commit, the
>> >> > >> > 32-bit field will be used to count data fork extents and the 16-bit field will
>> >> > >> > be used to count attr fork extents.
>> >> > >> > 
>> >> > >> > This change is done to enable a future commit to introduce a new 64-bit extent
>> >> > >> > counter field.
>> >> > >> > 
>> >> > >> > Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
>> >> > >> > ---
>> >> > >> >  fs/xfs/libxfs/xfs_format.h      |  8 ++++----
>> >> > >> >  fs/xfs/libxfs/xfs_inode_buf.c   |  4 ++--
>> >> > >> >  fs/xfs/libxfs/xfs_log_format.h  |  4 ++--
>> >> > >> >  fs/xfs/scrub/inode_repair.c     |  4 ++--
>> >> > >> >  fs/xfs/scrub/trace.h            | 14 +++++++-------
>> >> > >> >  fs/xfs/xfs_inode_item.c         |  4 ++--
>> >> > >> >  fs/xfs/xfs_inode_item_recover.c |  8 ++++----
>> >> > >> >  7 files changed, 23 insertions(+), 23 deletions(-)
>> >> > >> > 
>> >> > >> > diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
>> >> > >> > index dba868f2c3e3..87c927d912f6 100644
>> >> > >> > --- a/fs/xfs/libxfs/xfs_format.h
>> >> > >> > +++ b/fs/xfs/libxfs/xfs_format.h
>> >> > >> > @@ -802,8 +802,8 @@ typedef struct xfs_dinode {
>> >> > >> >  	__be64		di_size;	/* number of bytes in file */
>> >> > >> >  	__be64		di_nblocks;	/* # of direct & btree blocks used */
>> >> > >> >  	__be32		di_extsize;	/* basic/minimum extent size for file */
>> >> > >> > -	__be32		di_nextents;	/* number of extents in data fork */
>> >> > >> > -	__be16		di_anextents;	/* number of extents in attribute fork*/
>> >> > >> > +	__be32		di_nextents32;	/* number of extents in data fork */
>> >> > >> > +	__be16		di_nextents16;	/* number of extents in attribute fork*/
>> >> > >> 
>> >> > >> 
>> >> > >> Hmmm. Having the same field in the inode hold the extent count
>> >> > >> for different inode forks based on a bit in the superblock means the
>> >> > >> on-disk inode format is not self describing. i.e. we can't decode
>> >> > >> the on-disk contents of an inode correctly without knowing whether a
>> >> > >> specific feature bit is set in the superblock or not.
>> >> > >
>> >> > > Hmmmm - I just realised that there is an inode flag that indicates
>> >> > > the format is different. It's jsut that most of the code doing
>> >> > > conditional behaviour is using the superblock flag, not the inode
>> >> > > flag as the conditional.
>> >> > >
>> >> > > So it is self describing, but I still don't like the way the same
>> >> > > field is used for the different forks. It just feels like we are
>> >> > > placing a landmine that we are going to forget about and step
>> >> > > on in the future....
>> >> > >
>> >> > 
>> >> > Sorry, I missed this response from you.
>> >> > 
>> >> > I agree with your suggestion. I will use the inode version number to help in
>> >> > deciding which extent counter fields are valid for a specific inode.
>> >> 
>> >> No, don't do something I suggested with a flawed understanding of
>> >> the code.
>> >> 
>> >> Just because *I* suggest something, it means you have to make that
>> >> change. That is reacting to *who* said something, not *what was
>> >> said*.
>> >> 
>> >> So, I may have reservations about the way the storage definitions
>> >> are being redefined, but if I had a valid, technical argument I
>> >> could give right now I would have said so directly. I can't put my
>> >> finger on why this worries me in this case but didn't for something
>> >> like, say, the BIGTIME feature which redefined the contents of
>> >> various fields in the inode.
>> >> 
>> >> IOWs, I haven't really had time to think and go back over the rest
>> >> of the patchset since I realised my mistake and determine if that
>> >> changes what I think about this, so don't go turning the patchset
>> >> upside just because *I suggested something*.
>> >
>> > So, looking over the patchset more, I think I understand my feeling
>> > a bit better. Inconsistency is a big part of it.
>> >
>> > The in-memory extent counts are held in the struct xfs_inode_fork
>> > and not the inode. The type is a xfs_extcnt_t - it's not a size
>> > dependent type. Indeed, there are actually no users of the
>> > xfs_aextcnt_t variable in XFS at all any more. It should be removed.
>> >
>> > What this means is that in-memory inode extent counting just doesn't
>> > discriminate between inode fork types. They are all 64 bit counters,
>> > and all the limits applied to them should be 64 bit types. Even the
>> > checks for overflow are abstracted away by
>> > xfs_iext_count_may_overflow(), so none of the extent manipulation
>> > code has any idea there are different types and limits in the
>> > on-disk format.
>> >
>> > That's good.
>> >
>> > The only place the actual type matters is when looking at the raw
>> > disk inode and, unfortunately, that's where it gets messy. Anything
>> > accessing the on-disk inode directly has to look at inode version
>> > number, and an inode feature flag to interpret the inode format
>> > correctly.  That format is then reflected in an in-memory inode
>> > feature flag, and then there's the superblock feature flag on top of
>> > that to indicate that there are NREXT64 format inodes in the
>> > filesystem.
>> >
>> > Then there's implied dynamic upgrades of the on-disk inode format.
>> > We see that being implied in xfs_inode_to_disk_iext_counters() and
>> > xfs_trans_log_inode() but the filesystem format can't be changed
>> > dynamically. i.e. we can't create new NREXT64 inodes if the
>> > superblock flag is not set, so there is no code in this patchset
>> > that I can see that provides a trigger for a dynamic upgrade to
>> > start. IOWs, the filesystem has to be taken offline to change the
>> > superblock feature bit, and the setup of the default NREXT64 inode
>> > flag at mount time re-inforces this.
>> >
>> > With this in mind, I started to see inconsistent use of inode
>> > feature flag vs superblock feature flag to determine on-disk inode
>> > extent count limits. e.g. look at xfs_iext_count_may_overflow() and
>> > xfs_iext_max_nextents(). Both of these are determining the maximum
>> > number of extents that are valid for an inode, and they look at the
>> > -superblock feature bit- to determine the limits.
>> >
>> > This only works if all inodes in the filesystem have the same
>> > format, which is not true if we are doing dynamic upgrades of the
>> > inode features. The most obvious case here is that scrub needs to
>> > determine the layout and limits based on the current feature bits in
>> > the inode, not the superblock feature bit.
>> >
>> > Then we have to look at how the upgrade is performed - by changing
>> > the in-memory inode flag during xfs_trans_log_inode() when the inode
>> > is dirtied. When we are modifying the inode for extent allocation,
>> > we check the extent count limits on the inode *before* we dirty the
>> > inode. Hence the only way an "upgrade at overflow thresholds" can
>> > actually work is if we don't use the inode flag for determining
>> > limits but instead use the sueprblock feature bit limits. But as
>> > I've already pointed out, that leads to other problems.
>> >
>> > When we are converting an inode format, we currently do it when the
>> > inode is first brought into memory and read from disk (i.e.
>> > xfs_inode_from_disk()). We do the full conversion at this point in
>> > time, such that if the inode is dirtied in memory all the correct
>> > behaviour for the new format occurs and the writeback is done in the
>> > new format.
>> >
>> > This would allow xfs_iext_count_may_overflow/xfs_iext_max_nextents
>> > to actually return the correct limits for the inode as it is being
>> > modified and not have to rely on superblock feature bits. If the
>> > inode is not being modified, then the in-memory format changes are
>> > discarded when the inode is reclaimed from memory and nothing
>> > changes on disk.
>> >
>> > This means that once we've read the inode in from disk and set up
>> > ip->i_diflags2 according to the superblock feature bit, we can use
>> > the in-memory inode flag -everywhere- we need to find and/or check
>> > limits during modifications. Yes, I know that the BIGTIME upgrade
>> > path does this, but that doesn't have limits that prevent
>> > modifications from taking place before we can log the inode and set
>> > the BIGTIME flag....
>> >
>> 
>> Ok. The above solution looks logically correct. I haven't been able to come up
>> with a scenario where the solution wouldn't work. I will implement it and see
>> if anything breaks.
>
> I think I can poke one hole in it - I missed the fact that if we
> upgrade and inode read time, and then we modify the inode without
> modifying the inode core (can we even do that - metadata mods should
> at least change timestamps right?) then we don't log the format
> change or the NREXT64 inode flag change and they only appear in the
> on-disk inode at writeback.
>
> Log recovery needs to be checked for correct behaviour here. I think
> that if the inode is in NREXT64 format when read in and the log
> inode core is not, then the on disk LSN must be more recent than
> what is being recovered from the log and should be skipped. If
> NREXT64 is present in the log inode, then we logged the core
> properly and we just don't care what format is on disk because we
> replay it into NREXT64 format and write that back.

xfs_inode_item_format() logs the inode core regardless of whether
XFS_ILOG_CORE flag is set in xfs_inode_log_item->ili_fields. Hence, setting
the NREXT64 bit in xfs_dinode->di_flags2 just after reading an inode from disk
should not result in a scenario where the corresponding
xfs_log_dinode->di_flags2 will not have NREXT64 bit set.

If log recovery comes across a log inode with NREXT64 bit set in its di_flags2
field, then we can safely conclude that the ondisk inode has to be updated to
reflect this change i.e. there is no need to compare LSNs of the checkpoint
transaction being replayed and that of the disk inode.

>
> SO I *think* we're ok here, but it needs closer inspection to
> determine behaviour is actually safe. If it is safe, then maybe in
> future we can do the same thing for BIGTIME and get that upgrade out
> of xfs_trans_log_inode() as well....
>
>> > ---
>> >
>> > FWIW, I also think doing something like this would help make the
>> > code be easier to read and confirm that it is obviously correct when
>> > reading it:
>> >
>> > 	__be32          di_gid;         /* owner's group id */
>> > 	__be32          di_nlink;       /* number of links to file */
>> > 	__be16          di_projid_lo;   /* lower part of owner's project id */
>> > 	__be16          di_projid_hi;   /* higher part owner's project id */
>> > 	union {
>> > 		__be64	di_big_dextcnt;	/* NREXT64 data extents */
>> > 		__u8	di_v3_pad[8];	/* !NREXT64 V3 inode zeroed space */
>> > 		struct {
>> > 			__u8	di_v2_pad[6];	/* V2 inode zeroed space */
>> > 			__be16	di_flushiter;	/* V2 inode incremented on flush */
>> > 		};
>> > 	};
>> > 	xfs_timestamp_t di_atime;       /* time last accessed */
>> > 	xfs_timestamp_t di_mtime;       /* time last modified */
>> > 	xfs_timestamp_t di_ctime;       /* time created/inode modified */
>> > 	__be64          di_size;        /* number of bytes in file */
>> > 	__be64          di_nblocks;     /* # of direct & btree blocks used */
>> > 	__be32          di_extsize;     /* basic/minimum extent size for file */
>> > 	union {
>> > 		struct {
>> > 			__be32	di_big_aextcnt; /* NREXT64 attr extents */
>> > 			__be16	di_nrext64_pad;	/* NREXT64 unused, zero */
>> > 		};
>> > 		struct {
>> > 			__be32	di_nextents;    /* !NREXT64 data extents */
>> > 			__be16	di_anextents;   /* !NREXT64 attr extents */
>> > 		}
>> > 	}

The two structures above result in padding and hence result in a hole being
introduced. The entire union above can be replaced with the following,

        union {
                __be32  di_big_aextcnt; /* NREXT64 attr extents */
                __be32  di_nextents;    /* !NREXT64 data extents */
        };
        union {
                __be16  di_nrext64_pad; /* NREXT64 unused, zero */
                __be16  di_anextents;   /* !NREXT64 attr extents */
        };

>> > 	__u8            di_forkoff;     /* attr fork offs, <<3 for 64b align */
>> > 	__s8            di_aformat;     /* format of attr fork's data */
>> > ...
>> >
>> > Then we get something like:
>> >
>> > static inline void
>> > xfs_inode_to_disk_iext_counters(
>> >        struct xfs_inode        *ip,
>> >        struct xfs_dinode       *to)
>> > {
>> >        if (xfs_inode_has_nrext64(ip)) {
>> >                to->di_big_dextent_cnt = cpu_to_be64(xfs_ifork_nextents(&ip->i_df));
>> >                to->di_big_anextents = cpu_to_be32(xfs_ifork_nextents(ip->i_afp));
>> >                to->di_nrext64_pad = 0;
>> >        } else {
>> >                to->di_nextents = cpu_to_be32(xfs_ifork_nextents(&ip->i_df));
>> >                to->di_anextents = cpu_to_be16(xfs_ifork_nextents(ip->i_afp));
>> >        }
>> > }
>> >
>> > This is now obvious that we are writing to the correct fields
>> > in the inode for the feature bits that are set, and we don't need
>> > to zero the di_big_dextcnt field because that's been taken care of
>> > by the existing di_v2_pad/flushiter zeroing. That bit could probably
>> > be improved by unwinding and open coding this in xfs_inode_to_disk(),
>> > but I think what I'm proposing should be obvious now...
>> >
>> 
>> Yes, the explaination provided by you is very clear. I will implement these
>> suggestions.
>
> Don't forget to try to poke holes in it and look for complexity that
> can be removed before you try to implement or optimise anything.
>
> FWIW, the code design concept I'm basing this on is that complexity
> should be contained within the structures that store the data,
> rather than be directly exposed to the code that manipulates the
> data.
>

To summarize the design,

- We need both the per-inode flag (for satisfying the requirement of
  self-describing metadata) and superblock flag (since an older kernel should
  not be allowed to mount an fs containing inodes with large extent counters).

- When an allocated inode is read from disk, the incore inode's NREXT64 bit in
  di_flags2 field should be set if the superblock has NREXT64 feature enabled.

- Any modification to an inode is guaranteed to cause logging of its di_flags2
  field. Hence xfs_iext_max_nextents() can depend on an inode's di_flags2
  field's NREXT64 bit to determine the maximum extent count.

- Newly allocated inodes will have NREXT64 bit set in di_flags2 field by
  default due to xfs_ino_geometry->new_diflags2 having XFS_DIFLAG2_NREXT64 bit
  set.

Apart from the regular fs operations, the on-disk format changes introduced
above seems to work well with Log replay, Scrub and xfs_repair.

-- 
chandan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 07/12] xfs: Rename inode's extent counter fields based on their width
  2021-10-07 10:52                 ` Chandan Babu R
@ 2021-10-10 21:49                   ` Dave Chinner
  2021-10-13 14:44                     ` Chandan Babu R
  0 siblings, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2021-10-10 21:49 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, djwong

On Thu, Oct 07, 2021 at 04:22:25PM +0530, Chandan Babu R wrote:
> On 01 Oct 2021 at 04:25, Dave Chinner wrote:
> > On Thu, Sep 30, 2021 at 01:00:00PM +0530, Chandan Babu R wrote:
> >> On 30 Sep 2021 at 10:01, Dave Chinner wrote:
> >> > On Thu, Sep 30, 2021 at 10:40:15AM +1000, Dave Chinner wrote:
> >> >
> >> 
> >> Ok. The above solution looks logically correct. I haven't been able to come up
> >> with a scenario where the solution wouldn't work. I will implement it and see
> >> if anything breaks.
> >
> > I think I can poke one hole in it - I missed the fact that if we
> > upgrade and inode read time, and then we modify the inode without
> > modifying the inode core (can we even do that - metadata mods should
> > at least change timestamps right?) then we don't log the format
> > change or the NREXT64 inode flag change and they only appear in the
> > on-disk inode at writeback.
> >
> > Log recovery needs to be checked for correct behaviour here. I think
> > that if the inode is in NREXT64 format when read in and the log
> > inode core is not, then the on disk LSN must be more recent than
> > what is being recovered from the log and should be skipped. If
> > NREXT64 is present in the log inode, then we logged the core
> > properly and we just don't care what format is on disk because we
> > replay it into NREXT64 format and write that back.
> 
> xfs_inode_item_format() logs the inode core regardless of whether
> XFS_ILOG_CORE flag is set in xfs_inode_log_item->ili_fields. Hence, setting
> the NREXT64 bit in xfs_dinode->di_flags2 just after reading an inode from disk
> should not result in a scenario where the corresponding
> xfs_log_dinode->di_flags2 will not have NREXT64 bit set.

Except that log recovery might be replaying lots of indoe changes
such as:

log inode
commit A
log inode
commit B
log inode
set NREXT64
commit C
writeback inode
<crash before log tail moves>

Recovery will then replay commit A, B and C, in which case we *must
not recover the log inode* in commit A or B because the LSN in the
on-disk inode points at commit C. Hence replaying A or B will result
in the on-disk inode going backwards in time and hence resulting in
an inconsistent state on disk until commit C is recovered.

> i.e. there is no need to compare LSNs of the checkpoint
> transaction being replayed and that of the disk inode.

Inncorrect: we -always- have to do this, regardless of the change
being made.

> If log recovery comes across a log inode with NREXT64 bit set in its di_flags2
> field, then we can safely conclude that the ondisk inode has to be updated to
> reflect this change

We can't assume that. This makes an assumption that NREXT64 is
only ever a one-way transition. There's nothing in the disk format that
prevents us from -removing- NREXT64 for inodes that don't need large
extent counts.

Yes, the -current implementation- does not allow going back to small
extent counts, but the on-disk format design still needs to allow
for such things to be done as we may need such functionality and
flexibility in the on-disk format in the future.

Hence we have to ensure that log recovery handles both set and reset
transistions from the start. If we don't ensure that log recovery
handles reset conditions when we first add the feature bit, then
we are going to have to add a log incompat or another feature bit
to stop older kernels from trying to recover reset operations.

IOWs, the only determining factor as to whether we should replay an
inode is the LSN of the on-disk inode vs the LSN of the transaction
being replayed. Feature bits in either the on-disk ior log inode are
not reliable indicators of whether a dynamically set feature is
active or not at the time the inode item is being replayed...

> >> > FWIW, I also think doing something like this would help make the
> >> > code be easier to read and confirm that it is obviously correct when
> >> > reading it:
> >> >
> >> > 	__be32          di_gid;         /* owner's group id */
> >> > 	__be32          di_nlink;       /* number of links to file */
> >> > 	__be16          di_projid_lo;   /* lower part of owner's project id */
> >> > 	__be16          di_projid_hi;   /* higher part owner's project id */
> >> > 	union {
> >> > 		__be64	di_big_dextcnt;	/* NREXT64 data extents */
> >> > 		__u8	di_v3_pad[8];	/* !NREXT64 V3 inode zeroed space */
> >> > 		struct {
> >> > 			__u8	di_v2_pad[6];	/* V2 inode zeroed space */
> >> > 			__be16	di_flushiter;	/* V2 inode incremented on flush */
> >> > 		};
> >> > 	};
> >> > 	xfs_timestamp_t di_atime;       /* time last accessed */
> >> > 	xfs_timestamp_t di_mtime;       /* time last modified */
> >> > 	xfs_timestamp_t di_ctime;       /* time created/inode modified */
> >> > 	__be64          di_size;        /* number of bytes in file */
> >> > 	__be64          di_nblocks;     /* # of direct & btree blocks used */
> >> > 	__be32          di_extsize;     /* basic/minimum extent size for file */
> >> > 	union {
> >> > 		struct {
> >> > 			__be32	di_big_aextcnt; /* NREXT64 attr extents */
> >> > 			__be16	di_nrext64_pad;	/* NREXT64 unused, zero */
> >> > 		};
> >> > 		struct {
> >> > 			__be32	di_nextents;    /* !NREXT64 data extents */
> >> > 			__be16	di_anextents;   /* !NREXT64 attr extents */
> >> > 		}
> >> > 	}
> 
> The two structures above result in padding and hence result in a hole being
> introduced. The entire union above can be replaced with the following,
> 
>         union {
>                 __be32  di_big_aextcnt; /* NREXT64 attr extents */
>                 __be32  di_nextents;    /* !NREXT64 data extents */
>         };
>         union {
>                 __be16  di_nrext64_pad; /* NREXT64 unused, zero */
>                 __be16  di_anextents;   /* !NREXT64 attr extents */
>         };

I don't think this makes sense. This groups by field rather than
by feature layout. It doesn't make it clear at all that these
varaibles both change definition at the same time - they are either
{di_nexts, di_anexts} pair or a {di_big_aexts, pad} pair. That's the
whole point of using anonymous structs here - it defines and
documents the relationship between the layouts when certain features
are set rather than relying on people to parse the comments
correctly to determine the relationship....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 07/12] xfs: Rename inode's extent counter fields based on their width
  2021-10-10 21:49                   ` Dave Chinner
@ 2021-10-13 14:44                     ` Chandan Babu R
  2021-10-14  2:00                       ` Dave Chinner
  2021-10-21 10:27                       ` Chandan Babu R
  0 siblings, 2 replies; 42+ messages in thread
From: Chandan Babu R @ 2021-10-13 14:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, djwong

On 11 Oct 2021 at 03:19, Dave Chinner wrote:
> On Thu, Oct 07, 2021 at 04:22:25PM +0530, Chandan Babu R wrote:
>> On 01 Oct 2021 at 04:25, Dave Chinner wrote:
>> > On Thu, Sep 30, 2021 at 01:00:00PM +0530, Chandan Babu R wrote:
>> >> On 30 Sep 2021 at 10:01, Dave Chinner wrote:
>> >> > On Thu, Sep 30, 2021 at 10:40:15AM +1000, Dave Chinner wrote:
>> >> >
>> >> 
>> >> Ok. The above solution looks logically correct. I haven't been able to come up
>> >> with a scenario where the solution wouldn't work. I will implement it and see
>> >> if anything breaks.
>> >
>> > I think I can poke one hole in it - I missed the fact that if we
>> > upgrade and inode read time, and then we modify the inode without
>> > modifying the inode core (can we even do that - metadata mods should
>> > at least change timestamps right?) then we don't log the format
>> > change or the NREXT64 inode flag change and they only appear in the
>> > on-disk inode at writeback.
>> >
>> > Log recovery needs to be checked for correct behaviour here. I think
>> > that if the inode is in NREXT64 format when read in and the log
>> > inode core is not, then the on disk LSN must be more recent than
>> > what is being recovered from the log and should be skipped. If
>> > NREXT64 is present in the log inode, then we logged the core
>> > properly and we just don't care what format is on disk because we
>> > replay it into NREXT64 format and write that back.
>> 
>> xfs_inode_item_format() logs the inode core regardless of whether
>> XFS_ILOG_CORE flag is set in xfs_inode_log_item->ili_fields. Hence, setting
>> the NREXT64 bit in xfs_dinode->di_flags2 just after reading an inode from disk
>> should not result in a scenario where the corresponding
>> xfs_log_dinode->di_flags2 will not have NREXT64 bit set.
>
> Except that log recovery might be replaying lots of indoe changes
> such as:
>
> log inode
> commit A
> log inode
> commit B
> log inode
> set NREXT64
> commit C
> writeback inode
> <crash before log tail moves>
>
> Recovery will then replay commit A, B and C, in which case we *must
> not recover the log inode* in commit A or B because the LSN in the
> on-disk inode points at commit C. Hence replaying A or B will result
> in the on-disk inode going backwards in time and hence resulting in
> an inconsistent state on disk until commit C is recovered.
>
>> i.e. there is no need to compare LSNs of the checkpoint
>> transaction being replayed and that of the disk inode.
>
> Inncorrect: we -always- have to do this, regardless of the change
> being made.
>
>> If log recovery comes across a log inode with NREXT64 bit set in its di_flags2
>> field, then we can safely conclude that the ondisk inode has to be updated to
>> reflect this change
>
> We can't assume that. This makes an assumption that NREXT64 is
> only ever a one-way transition. There's nothing in the disk format that
> prevents us from -removing- NREXT64 for inodes that don't need large
> extent counts.
>
> Yes, the -current implementation- does not allow going back to small
> extent counts, but the on-disk format design still needs to allow
> for such things to be done as we may need such functionality and
> flexibility in the on-disk format in the future.
>
> Hence we have to ensure that log recovery handles both set and reset
> transistions from the start. If we don't ensure that log recovery
> handles reset conditions when we first add the feature bit, then
> we are going to have to add a log incompat or another feature bit
> to stop older kernels from trying to recover reset operations.
>

Ok. I had never considered the possibility of transitioning an inode back into
32-bit data fork extent count format. With this new requirement, I now
understand the reasoning behind comparing ondisk inode's LSN and checkpoint
transaction's LSN.

As you have mentioned earlier, comparing LSNs is required not only for the
change introduced in this patch, but also for any other change in value of any
of the inode's fields. Without such a comparison, the inode can temporarily
end up being in an inconsistent state during log replay.

To that end, The following code snippet from xlog_recover_inode_commit_pass2()
skips playing back xfs_log_dinode entries when ondisk inode's LSN is greater
than checkpoint transaction's LSN,

        if (dip->di_version >= 3) {
                xfs_lsn_t       lsn = be64_to_cpu(dip->di_lsn);

                if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) > 0) {
                        trace_xfs_log_recover_inode_skip(log, in_f);
                        error = 0;
                        goto out_owner_change;
                }
        }


However, if the commits in the sequence below belong to three different
checkpoint transactions having the same LSN,

log inode
commit A
log inode
commit B
set NREXT64
log inode
commit C
writeback inode
<crash before log tail moves>

Then the above code snippet won't prevent an inode from becoming temporarily
inconsistent due to commits A and B being replayed. To handle this, we should
probably go with the additional rule of "Replay log inode if both the log
inode and the ondisk inode have the same value for NREXT64 bit".

With that additional rule in place, the following sequence will result in a
consistent inode state even if all the three checkpoint transactions have the
same LSN,

log inode
commit A
set NREXT64
log inode
commit B
clear NREXT64
log inode
commit C
writeback inode
<crash before log tail moves>

i.e. Commit B won't be replayed.

Please let me know if my understanding is incorrect.

> IOWs, the only determining factor as to whether we should replay an
> inode is the LSN of the on-disk inode vs the LSN of the transaction
> being replayed. Feature bits in either the on-disk ior log inode are
> not reliable indicators of whether a dynamically set feature is
> active or not at the time the inode item is being replayed...
>
>> >> > FWIW, I also think doing something like this would help make the
>> >> > code be easier to read and confirm that it is obviously correct when
>> >> > reading it:
>> >> >
>> >> > 	__be32          di_gid;         /* owner's group id */
>> >> > 	__be32          di_nlink;       /* number of links to file */
>> >> > 	__be16          di_projid_lo;   /* lower part of owner's project id */
>> >> > 	__be16          di_projid_hi;   /* higher part owner's project id */
>> >> > 	union {
>> >> > 		__be64	di_big_dextcnt;	/* NREXT64 data extents */
>> >> > 		__u8	di_v3_pad[8];	/* !NREXT64 V3 inode zeroed space */
>> >> > 		struct {
>> >> > 			__u8	di_v2_pad[6];	/* V2 inode zeroed space */
>> >> > 			__be16	di_flushiter;	/* V2 inode incremented on flush */
>> >> > 		};
>> >> > 	};
>> >> > 	xfs_timestamp_t di_atime;       /* time last accessed */
>> >> > 	xfs_timestamp_t di_mtime;       /* time last modified */
>> >> > 	xfs_timestamp_t di_ctime;       /* time created/inode modified */
>> >> > 	__be64          di_size;        /* number of bytes in file */
>> >> > 	__be64          di_nblocks;     /* # of direct & btree blocks used */
>> >> > 	__be32          di_extsize;     /* basic/minimum extent size for file */
>> >> > 	union {
>> >> > 		struct {
>> >> > 			__be32	di_big_aextcnt; /* NREXT64 attr extents */
>> >> > 			__be16	di_nrext64_pad;	/* NREXT64 unused, zero */
>> >> > 		};
>> >> > 		struct {
>> >> > 			__be32	di_nextents;    /* !NREXT64 data extents */
>> >> > 			__be16	di_anextents;   /* !NREXT64 attr extents */
>> >> > 		}
>> >> > 	}
>> 
>> The two structures above result in padding and hence result in a hole being
>> introduced. The entire union above can be replaced with the following,
>> 
>>         union {
>>                 __be32  di_big_aextcnt; /* NREXT64 attr extents */
>>                 __be32  di_nextents;    /* !NREXT64 data extents */
>>         };
>>         union {
>>                 __be16  di_nrext64_pad; /* NREXT64 unused, zero */
>>                 __be16  di_anextents;   /* !NREXT64 attr extents */
>>         };
>
> I don't think this makes sense. This groups by field rather than
> by feature layout. It doesn't make it clear at all that these
> varaibles both change definition at the same time - they are either
> {di_nexts, di_anexts} pair or a {di_big_aexts, pad} pair. That's the
> whole point of using anonymous structs here - it defines and
> documents the relationship between the layouts when certain features
> are set rather than relying on people to parse the comments
> correctly to determine the relationship....

Ok. I will need to check if there are alternative ways of arranging the fields
to accomplish the goal stated above. I will think about this and get back as
soon as possible.

-- 
chandan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 07/12] xfs: Rename inode's extent counter fields based on their width
  2021-10-13 14:44                     ` Chandan Babu R
@ 2021-10-14  2:00                       ` Dave Chinner
  2021-10-14 10:07                         ` Chandan Babu R
  2021-10-21 10:27                       ` Chandan Babu R
  1 sibling, 1 reply; 42+ messages in thread
From: Dave Chinner @ 2021-10-14  2:00 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, djwong

On Wed, Oct 13, 2021 at 08:14:01PM +0530, Chandan Babu R wrote:
> On 11 Oct 2021 at 03:19, Dave Chinner wrote:
> > On Thu, Oct 07, 2021 at 04:22:25PM +0530, Chandan Babu R wrote:
> >> On 01 Oct 2021 at 04:25, Dave Chinner wrote:
> >> > On Thu, Sep 30, 2021 at 01:00:00PM +0530, Chandan Babu R wrote:
> >> >> On 30 Sep 2021 at 10:01, Dave Chinner wrote:
> >> >> > On Thu, Sep 30, 2021 at 10:40:15AM +1000, Dave Chinner wrote:
> >> >> >
> >> >> 
> >> >> Ok. The above solution looks logically correct. I haven't been able to come up
> >> >> with a scenario where the solution wouldn't work. I will implement it and see
> >> >> if anything breaks.
> >> >
> >> > I think I can poke one hole in it - I missed the fact that if we
> >> > upgrade and inode read time, and then we modify the inode without
> >> > modifying the inode core (can we even do that - metadata mods should
> >> > at least change timestamps right?) then we don't log the format
> >> > change or the NREXT64 inode flag change and they only appear in the
> >> > on-disk inode at writeback.
> >> >
> >> > Log recovery needs to be checked for correct behaviour here. I think
> >> > that if the inode is in NREXT64 format when read in and the log
> >> > inode core is not, then the on disk LSN must be more recent than
> >> > what is being recovered from the log and should be skipped. If
> >> > NREXT64 is present in the log inode, then we logged the core
> >> > properly and we just don't care what format is on disk because we
> >> > replay it into NREXT64 format and write that back.
> >> 
> >> xfs_inode_item_format() logs the inode core regardless of whether
> >> XFS_ILOG_CORE flag is set in xfs_inode_log_item->ili_fields. Hence, setting
> >> the NREXT64 bit in xfs_dinode->di_flags2 just after reading an inode from disk
> >> should not result in a scenario where the corresponding
> >> xfs_log_dinode->di_flags2 will not have NREXT64 bit set.
> >
> > Except that log recovery might be replaying lots of indoe changes
> > such as:
> >
> > log inode
> > commit A
> > log inode
> > commit B
> > log inode
> > set NREXT64
> > commit C
> > writeback inode
> > <crash before log tail moves>
> >
> > Recovery will then replay commit A, B and C, in which case we *must
> > not recover the log inode* in commit A or B because the LSN in the
> > on-disk inode points at commit C. Hence replaying A or B will result
> > in the on-disk inode going backwards in time and hence resulting in
> > an inconsistent state on disk until commit C is recovered.
> >
> >> i.e. there is no need to compare LSNs of the checkpoint
> >> transaction being replayed and that of the disk inode.
> >
> > Inncorrect: we -always- have to do this, regardless of the change
> > being made.
> >
> >> If log recovery comes across a log inode with NREXT64 bit set in its di_flags2
> >> field, then we can safely conclude that the ondisk inode has to be updated to
> >> reflect this change
> >
> > We can't assume that. This makes an assumption that NREXT64 is
> > only ever a one-way transition. There's nothing in the disk format that
> > prevents us from -removing- NREXT64 for inodes that don't need large
> > extent counts.
> >
> > Yes, the -current implementation- does not allow going back to small
> > extent counts, but the on-disk format design still needs to allow
> > for such things to be done as we may need such functionality and
> > flexibility in the on-disk format in the future.
> >
> > Hence we have to ensure that log recovery handles both set and reset
> > transistions from the start. If we don't ensure that log recovery
> > handles reset conditions when we first add the feature bit, then
> > we are going to have to add a log incompat or another feature bit
> > to stop older kernels from trying to recover reset operations.
> >
> 
> Ok. I had never considered the possibility of transitioning an inode back into
> 32-bit data fork extent count format. With this new requirement, I now
> understand the reasoning behind comparing ondisk inode's LSN and checkpoint
> transaction's LSN.
> 
> As you have mentioned earlier, comparing LSNs is required not only for the
> change introduced in this patch, but also for any other change in value of any
> of the inode's fields. Without such a comparison, the inode can temporarily
> end up being in an inconsistent state during log replay.
> 
> To that end, The following code snippet from xlog_recover_inode_commit_pass2()
> skips playing back xfs_log_dinode entries when ondisk inode's LSN is greater
> than checkpoint transaction's LSN,
> 
>         if (dip->di_version >= 3) {
>                 xfs_lsn_t       lsn = be64_to_cpu(dip->di_lsn);
> 
>                 if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) > 0) {
>                         trace_xfs_log_recover_inode_skip(log, in_f);
>                         error = 0;
>                         goto out_owner_change;
>                 }
>         }
> 
> 
> However, if the commits in the sequence below belong to three different
> checkpoint transactions having the same LSN,
> 
> log inode
> commit A
> log inode
> commit B
> set NREXT64
> log inode
> commit C
> writeback inode
> <crash before log tail moves>
> 
> Then the above code snippet won't prevent an inode from becoming temporarily
> inconsistent due to commits A and B being replayed.

Ah, this is a very special corner case.  You snipped out the most
important part of the comment above that code:

	/*
         * If the inode has an LSN in it, recover the inode only if the on-disk
         * inode's LSN is older than the lsn of the transaction we are
         * replaying. We can have multiple checkpoints with the same start LSN,
         * so the current LSN being equal to the on-disk LSN doesn't necessarily
         * mean that the on-disk inode is more recent than the change being
         * replayed.
....

This is exactly the situation you are asking about here - what
happens in recovery when the LSNs are the same and there are
multiple checkpoints with the same LSN.

The first thing to understand here is "how do we get checkpoints
with the same LSN" and then understand what it implies.

We get checkpoints with the same start/commit LSNs when multiple
checkpoints are written in the same iclog. The start/commit LSNs are
determined by the LSN of the iclog they are written in, and hence if
they are the same they were written to the journal in a single
"atomic" IO.

I say "atomic" because it's not an atomic IO at the hardware level.
It's atomic in that the entire iclog is protected by a CRC and hence
if the CRC check for the iclog passes at recovery, then the iclog write has been
recovered intact. If the write was torn, misdirected
or some other physical media failure occurred, then we don't
recovery the iclog at all. IOWs, none of the changes in the iclog
are recovered. IOWs, we have atomic "all or nothing" iclog recovery
semantics.

Next, the fact that the inode has been written back and is up to
date on disk means that the iclog is entirely on stable storage.
The inode isn't unpinned until the flush/FUA associtate with the
iclog was completed, which happens before the iclog IO is completed
and the callbacks to unpin the inode are run. Hence ordering tells
us the entire iclog is on disk and should be recovered.

What this really means is that we cannot possibly see the
intermediate commit A or commit B states on disk at runtime or
before recovery is run. The metadata is not unpinned until the iclog
that also contains commit C is written to the journal. Hence from
the POV of the on-disk inode, we go from the original version to
commit C in one step and we never, ever see A or B as intermediate
states. IOWs, the iclog contents defines old -> C as an atomic
on-disk modification, even though the contents are spread across
multiple checkpoints.[1]

Hence in this specific case, we have 3 individual modifications to
the inode and it's related metadata sitting in the journal waiting
for log recovery to replay them as an atomic unit. They will all get
recovered, and each change that is replayed will be internally
consistent. Therefore, after replaying commit A, the inode and it's
metadata will be reverted to whatever was in that commit and it will
be consistent in that context. Then replay of commit B and commit C
bring it back up to being up to date on disk and providing the step
change from old -> C as the runtime code would have also done.

Hence at the end of replay, the inode and all it's related metadata
will be consistent with commit C and so so this special transient
corner case should resolve itself correctly (at least, as far as my
poor dumb brain can reason about it being correct).

> To handle this, we should
> probably go with the additional rule of "Replay log inode if both the log
> inode and the ondisk inode have the same value for NREXT64 bit".

No, we do not want case specific logic in recovery code like this
because inode core updates are simply overwrites. As long as the
overwrites are all replayed from A to C, we end up with the correct
result of an "atomic" step change from old to C on disk...

Cheers,

Dave.

[1] There's more really subtle, complex details around start LSN vs
commit LSN ordering with AIL, iclog and recovery LSNs and how to
treat same start/different commit LSNs, different start/same commit
LSNs, etc, but that's way beyond the scope of what is needed to be
understood here. These play into why we replay all the changes at
the same LSN as per above rather than skip them. Commit 32baa63d82ee
("xfs: logging the on disk inode LSN can make it go backwards")
might give you some more insight into the complexities here.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 07/12] xfs: Rename inode's extent counter fields based on their width
  2021-10-14  2:00                       ` Dave Chinner
@ 2021-10-14 10:07                         ` Chandan Babu R
  0 siblings, 0 replies; 42+ messages in thread
From: Chandan Babu R @ 2021-10-14 10:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, djwong

On 14 Oct 2021 at 07:30, Dave Chinner wrote:
> On Wed, Oct 13, 2021 at 08:14:01PM +0530, Chandan Babu R wrote:
>> On 11 Oct 2021 at 03:19, Dave Chinner wrote:
>> > On Thu, Oct 07, 2021 at 04:22:25PM +0530, Chandan Babu R wrote:
>> >> On 01 Oct 2021 at 04:25, Dave Chinner wrote:
>> >> > On Thu, Sep 30, 2021 at 01:00:00PM +0530, Chandan Babu R wrote:
>> >> >> On 30 Sep 2021 at 10:01, Dave Chinner wrote:
>> >> >> > On Thu, Sep 30, 2021 at 10:40:15AM +1000, Dave Chinner wrote:
>> >> >> >
>> >> >> 
>> >> >> Ok. The above solution looks logically correct. I haven't been able to come up
>> >> >> with a scenario where the solution wouldn't work. I will implement it and see
>> >> >> if anything breaks.
>> >> >
>> >> > I think I can poke one hole in it - I missed the fact that if we
>> >> > upgrade and inode read time, and then we modify the inode without
>> >> > modifying the inode core (can we even do that - metadata mods should
>> >> > at least change timestamps right?) then we don't log the format
>> >> > change or the NREXT64 inode flag change and they only appear in the
>> >> > on-disk inode at writeback.
>> >> >
>> >> > Log recovery needs to be checked for correct behaviour here. I think
>> >> > that if the inode is in NREXT64 format when read in and the log
>> >> > inode core is not, then the on disk LSN must be more recent than
>> >> > what is being recovered from the log and should be skipped. If
>> >> > NREXT64 is present in the log inode, then we logged the core
>> >> > properly and we just don't care what format is on disk because we
>> >> > replay it into NREXT64 format and write that back.
>> >> 
>> >> xfs_inode_item_format() logs the inode core regardless of whether
>> >> XFS_ILOG_CORE flag is set in xfs_inode_log_item->ili_fields. Hence, setting
>> >> the NREXT64 bit in xfs_dinode->di_flags2 just after reading an inode from disk
>> >> should not result in a scenario where the corresponding
>> >> xfs_log_dinode->di_flags2 will not have NREXT64 bit set.
>> >
>> > Except that log recovery might be replaying lots of indoe changes
>> > such as:
>> >
>> > log inode
>> > commit A
>> > log inode
>> > commit B
>> > log inode
>> > set NREXT64
>> > commit C
>> > writeback inode
>> > <crash before log tail moves>
>> >
>> > Recovery will then replay commit A, B and C, in which case we *must
>> > not recover the log inode* in commit A or B because the LSN in the
>> > on-disk inode points at commit C. Hence replaying A or B will result
>> > in the on-disk inode going backwards in time and hence resulting in
>> > an inconsistent state on disk until commit C is recovered.
>> >
>> >> i.e. there is no need to compare LSNs of the checkpoint
>> >> transaction being replayed and that of the disk inode.
>> >
>> > Inncorrect: we -always- have to do this, regardless of the change
>> > being made.
>> >
>> >> If log recovery comes across a log inode with NREXT64 bit set in its di_flags2
>> >> field, then we can safely conclude that the ondisk inode has to be updated to
>> >> reflect this change
>> >
>> > We can't assume that. This makes an assumption that NREXT64 is
>> > only ever a one-way transition. There's nothing in the disk format that
>> > prevents us from -removing- NREXT64 for inodes that don't need large
>> > extent counts.
>> >
>> > Yes, the -current implementation- does not allow going back to small
>> > extent counts, but the on-disk format design still needs to allow
>> > for such things to be done as we may need such functionality and
>> > flexibility in the on-disk format in the future.
>> >
>> > Hence we have to ensure that log recovery handles both set and reset
>> > transistions from the start. If we don't ensure that log recovery
>> > handles reset conditions when we first add the feature bit, then
>> > we are going to have to add a log incompat or another feature bit
>> > to stop older kernels from trying to recover reset operations.
>> >
>> 
>> Ok. I had never considered the possibility of transitioning an inode back into
>> 32-bit data fork extent count format. With this new requirement, I now
>> understand the reasoning behind comparing ondisk inode's LSN and checkpoint
>> transaction's LSN.
>> 
>> As you have mentioned earlier, comparing LSNs is required not only for the
>> change introduced in this patch, but also for any other change in value of any
>> of the inode's fields. Without such a comparison, the inode can temporarily
>> end up being in an inconsistent state during log replay.
>> 
>> To that end, The following code snippet from xlog_recover_inode_commit_pass2()
>> skips playing back xfs_log_dinode entries when ondisk inode's LSN is greater
>> than checkpoint transaction's LSN,
>> 
>>         if (dip->di_version >= 3) {
>>                 xfs_lsn_t       lsn = be64_to_cpu(dip->di_lsn);
>> 
>>                 if (lsn && lsn != -1 && XFS_LSN_CMP(lsn, current_lsn) > 0) {
>>                         trace_xfs_log_recover_inode_skip(log, in_f);
>>                         error = 0;
>>                         goto out_owner_change;
>>                 }
>>         }
>> 
>> 
>> However, if the commits in the sequence below belong to three different
>> checkpoint transactions having the same LSN,
>> 
>> log inode
>> commit A
>> log inode
>> commit B
>> set NREXT64
>> log inode
>> commit C
>> writeback inode
>> <crash before log tail moves>
>> 
>> Then the above code snippet won't prevent an inode from becoming temporarily
>> inconsistent due to commits A and B being replayed.
>
> Ah, this is a very special corner case.  You snipped out the most
> important part of the comment above that code:
>
> 	/*
>          * If the inode has an LSN in it, recover the inode only if the on-disk
>          * inode's LSN is older than the lsn of the transaction we are
>          * replaying. We can have multiple checkpoints with the same start LSN,
>          * so the current LSN being equal to the on-disk LSN doesn't necessarily
>          * mean that the on-disk inode is more recent than the change being
>          * replayed.
> ....
>
> This is exactly the situation you are asking about here - what
> happens in recovery when the LSNs are the same and there are
> multiple checkpoints with the same LSN.
>
> The first thing to understand here is "how do we get checkpoints
> with the same LSN" and then understand what it implies.
>
> We get checkpoints with the same start/commit LSNs when multiple
> checkpoints are written in the same iclog. The start/commit LSNs are
> determined by the LSN of the iclog they are written in, and hence if
> they are the same they were written to the journal in a single
> "atomic" IO.
>
> I say "atomic" because it's not an atomic IO at the hardware level.
> It's atomic in that the entire iclog is protected by a CRC and hence
> if the CRC check for the iclog passes at recovery, then the iclog write has been
> recovered intact. If the write was torn, misdirected
> or some other physical media failure occurred, then we don't
> recovery the iclog at all. IOWs, none of the changes in the iclog
> are recovered. IOWs, we have atomic "all or nothing" iclog recovery
> semantics.
>
> Next, the fact that the inode has been written back and is up to
> date on disk means that the iclog is entirely on stable storage.
> The inode isn't unpinned until the flush/FUA associtate with the
> iclog was completed, which happens before the iclog IO is completed
> and the callbacks to unpin the inode are run. Hence ordering tells
> us the entire iclog is on disk and should be recovered.
>
> What this really means is that we cannot possibly see the
> intermediate commit A or commit B states on disk at runtime or
> before recovery is run. The metadata is not unpinned until the iclog
> that also contains commit C is written to the journal. Hence from
> the POV of the on-disk inode, we go from the original version to
> commit C in one step and we never, ever see A or B as intermediate
> states. IOWs, the iclog contents defines old -> C as an atomic
> on-disk modification, even though the contents are spread across
> multiple checkpoints.[1]
>
> Hence in this specific case, we have 3 individual modifications to
> the inode and it's related metadata sitting in the journal waiting
> for log recovery to replay them as an atomic unit. They will all get
> recovered, and each change that is replayed will be internally
> consistent. Therefore, after replaying commit A, the inode and it's
> metadata will be reverted to whatever was in that commit and it will
> be consistent in that context. Then replay of commit B and commit C
> bring it back up to being up to date on disk and providing the step
> change from old -> C as the runtime code would have also done.
>
> Hence at the end of replay, the inode and all it's related metadata
> will be consistent with commit C and so so this special transient
> corner case should resolve itself correctly (at least, as far as my
> poor dumb brain can reason about it being correct).
>

Thanks for the detailed explaination. I had figured out that multiple
checkpoints can end up having the same LSN because they were written to the
same iclog. The value of cil->xc_push_commit_stable is one of the things that
determine if the iclog is supposed to be flushed or not just after writing the
contents of a CIL context into it.

However the "atomic replay" behaviour had not occured to me.

>> To handle this, we should
>> probably go with the additional rule of "Replay log inode if both the log
>> inode and the ondisk inode have the same value for NREXT64 bit".
>
> No, we do not want case specific logic in recovery code like this
> because inode core updates are simply overwrites. As long as the
> overwrites are all replayed from A to C, we end up with the correct
> result of an "atomic" step change from old to C on disk...
>

W.r.t processing per-inode NREXT64 bit status during log recovery, I think we
can depend on the LSN comparison that is already implemented in
xlog_recover_inode_commit_pass2() to skip checkpoint transactions (with
different LSNs) which can make an ondisk inode enter an inconsistent state.


> Cheers,
>
> Dave.
>
> [1] There's more really subtle, complex details around start LSN vs
> commit LSN ordering with AIL, iclog and recovery LSNs and how to
> treat same start/different commit LSNs, different start/same commit
> LSNs, etc, but that's way beyond the scope of what is needed to be
> understood here. These play into why we replay all the changes at
> the same LSN as per above rather than skip them. Commit 32baa63d82ee
> ("xfs: logging the on disk inode LSN can make it go backwards")
> might give you some more insight into the complexities here.

Thanks for the commit ID. I will add this to my list of items to read.

-- 
chandan

^ permalink raw reply	[flat|nested] 42+ messages in thread

* Re: [PATCH V3 07/12] xfs: Rename inode's extent counter fields based on their width
  2021-10-13 14:44                     ` Chandan Babu R
  2021-10-14  2:00                       ` Dave Chinner
@ 2021-10-21 10:27                       ` Chandan Babu R
  1 sibling, 0 replies; 42+ messages in thread
From: Chandan Babu R @ 2021-10-21 10:27 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, djwong

On 13 Oct 2021 at 20:14, Chandan Babu R wrote:
> On 11 Oct 2021 at 03:19, Dave Chinner wrote:
>> On Thu, Oct 07, 2021 at 04:22:25PM +0530, Chandan Babu R wrote:
>>> On 01 Oct 2021 at 04:25, Dave Chinner wrote:
>>> > On Thu, Sep 30, 2021 at 01:00:00PM +0530, Chandan Babu R wrote:
>>> >> On 30 Sep 2021 at 10:01, Dave Chinner wrote:
>>> >> > On Thu, Sep 30, 2021 at 10:40:15AM +1000, Dave Chinner wrote:
>>> >> >
>>> >> 
[...]
>>> >> > FWIW, I also think doing something like this would help make the
>>> >> > code be easier to read and confirm that it is obviously correct when
>>> >> > reading it:
>>> >> >
>>> >> > 	__be32          di_gid;         /* owner's group id */
>>> >> > 	__be32          di_nlink;       /* number of links to file */
>>> >> > 	__be16          di_projid_lo;   /* lower part of owner's project id */
>>> >> > 	__be16          di_projid_hi;   /* higher part owner's project id */
>>> >> > 	union {
>>> >> > 		__be64	di_big_dextcnt;	/* NREXT64 data extents */
>>> >> > 		__u8	di_v3_pad[8];	/* !NREXT64 V3 inode zeroed space */
>>> >> > 		struct {
>>> >> > 			__u8	di_v2_pad[6];	/* V2 inode zeroed space */
>>> >> > 			__be16	di_flushiter;	/* V2 inode incremented on flush */
>>> >> > 		};
>>> >> > 	};
>>> >> > 	xfs_timestamp_t di_atime;       /* time last accessed */
>>> >> > 	xfs_timestamp_t di_mtime;       /* time last modified */
>>> >> > 	xfs_timestamp_t di_ctime;       /* time created/inode modified */
>>> >> > 	__be64          di_size;        /* number of bytes in file */
>>> >> > 	__be64          di_nblocks;     /* # of direct & btree blocks used */
>>> >> > 	__be32          di_extsize;     /* basic/minimum extent size for file */
>>> >> > 	union {
>>> >> > 		struct {
>>> >> > 			__be32	di_big_aextcnt; /* NREXT64 attr extents */
>>> >> > 			__be16	di_nrext64_pad;	/* NREXT64 unused, zero */
>>> >> > 		};
>>> >> > 		struct {
>>> >> > 			__be32	di_nextents;    /* !NREXT64 data extents */
>>> >> > 			__be16	di_anextents;   /* !NREXT64 attr extents */
>>> >> > 		}
>>> >> > 	}
>>> 
>>> The two structures above result in padding and hence result in a hole being
>>> introduced. The entire union above can be replaced with the following,
>>> 
>>>         union {
>>>                 __be32  di_big_aextcnt; /* NREXT64 attr extents */
>>>                 __be32  di_nextents;    /* !NREXT64 data extents */
>>>         };
>>>         union {
>>>                 __be16  di_nrext64_pad; /* NREXT64 unused, zero */
>>>                 __be16  di_anextents;   /* !NREXT64 attr extents */
>>>         };
>>
>> I don't think this makes sense. This groups by field rather than
>> by feature layout. It doesn't make it clear at all that these
>> varaibles both change definition at the same time - they are either
>> {di_nexts, di_anexts} pair or a {di_big_aexts, pad} pair. That's the
>> whole point of using anonymous structs here - it defines and
>> documents the relationship between the layouts when certain features
>> are set rather than relying on people to parse the comments
>> correctly to determine the relationship....
>
> Ok. I will need to check if there are alternative ways of arranging the fields
> to accomplish the goal stated above. I will think about this and get back as
> soon as possible.

The padding that results from the following structure layout,

typedef struct xfs_dinode {
        __be16          di_magic;       /* inode magic # = XFS_DINODE_MAGIC */
        __be16          di_mode;        /* mode and type of file */
        __u8            di_version;     /* inode version */
        __u8            di_format;      /* format of di_c data */
        __be16          di_onlink;      /* old number of links to file */
        __be32          di_uid;         /* owner's user id */
        __be32          di_gid;         /* owner's group id */
        __be32          di_nlink;       /* number of links to file */
        __be16          di_projid_lo;   /* lower part of owner's project id */
        __be16          di_projid_hi;   /* higher part owner's project id */
        __u8            di_pad[6];      /* unused, zeroed space */
        __be16          di_flushiter;   /* incremented on flush */
        xfs_timestamp_t di_atime;       /* time last accessed */
        xfs_timestamp_t di_mtime;       /* time last modified */
        xfs_timestamp_t di_ctime;       /* time created/inode modified */
        __be64          di_size;        /* number of bytes in file */
        __be64          di_nblocks;     /* # of direct & btree blocks used */
        __be32          di_extsize;     /* basic/minimum extent size for file */
        union {
                struct {
                        __be32  di_big_aextcnt; /* NREXT64 attr extents */
                        __be16  di_nrext64_pad; /* NREXT64 unused, zero */
                };
                struct {
                        __be32  di_nextents;    /* !NREXT64 data extents */
                        __be16  di_anextents;   /* !NREXT64 attr extents */
                };
        };
        __u8            di_forkoff;     /* attr fork offs, <<3 for 64b align */
        __s8            di_aformat;     /* format of attr fork's data */

... can be solved by packing the two structures contained within the union i.e.

        union {
                struct {
                        __be32  di_big_aextcnt; /* NREXT64 attr extents */
                        __be16  di_nrext64_pad; /* NREXT64 unused, zero */
                } __packed;
                struct {
                        __be32  di_nextents;    /* !NREXT64 data extents */
                        __be16  di_anextents;   /* !NREXT64 attr extents */
                } __packed;
        };
        __u8            di_forkoff;     /* attr fork offs, <<3 for 64b align */
        __s8            di_aformat;     /* format of attr fork's data */

Each of the two structures start at an 8-byte offset and the two 1-byte fields
(di_forkoff & di_aformat) defined later in the structure, prevent introduction
of holes inside dinode structure.

Also, An exception shouldn't be generated if the address of any of the packed
structure members is assigned to another pointer variable and later the
pointer variable is dereferenced. This is because such an address would still
be a 4-byte aligned address (in the case of di_big_aextcnt/di_nextents) or a
2-byte aligned address (in the case of di_nrext64_pad/di_anextents).

-- 
chandan

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2021-10-21 10:27 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-16 10:06 [PATCH V3 00/12] xfs: Extend per-inode extent counters Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 01/12] xfs: Move extent count limits to xfs_format.h Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 02/12] xfs: Introduce xfs_iext_max_nextents() helper Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 03/12] xfs: Rename MAXEXTNUM, MAXAEXTNUM to XFS_IFORK_EXTCNT_MAXS32, XFS_IFORK_EXTCNT_MAXS16 Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 04/12] xfs: Use xfs_extnum_t instead of basic data types Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 05/12] xfs: Introduce xfs_dfork_nextents() helper Chandan Babu R
2021-09-27 22:46   ` Dave Chinner
2021-09-28  9:46     ` Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 06/12] xfs: xfs_dfork_nextents: Return extent count via an out argument Chandan Babu R
2021-09-30  1:19   ` Dave Chinner
2021-09-16 10:06 ` [PATCH V3 07/12] xfs: Rename inode's extent counter fields based on their width Chandan Babu R
2021-09-27 23:46   ` Dave Chinner
2021-09-28  4:04     ` Dave Chinner
2021-09-29 17:03       ` Chandan Babu R
2021-09-30  0:40         ` Dave Chinner
2021-09-30  4:31           ` Dave Chinner
2021-09-30  7:30             ` Chandan Babu R
2021-09-30 22:55               ` Dave Chinner
2021-10-07 10:52                 ` Chandan Babu R
2021-10-10 21:49                   ` Dave Chinner
2021-10-13 14:44                     ` Chandan Babu R
2021-10-14  2:00                       ` Dave Chinner
2021-10-14 10:07                         ` Chandan Babu R
2021-10-21 10:27                       ` Chandan Babu R
2021-09-28  9:47     ` Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 08/12] xfs: Promote xfs_extnum_t and xfs_aextnum_t to 64 and 32-bits respectively Chandan Babu R
2021-09-28  0:47   ` Dave Chinner
2021-09-28  9:47     ` Chandan Babu R
2021-09-28 23:08       ` Dave Chinner
2021-09-29 17:04         ` Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 09/12] xfs: Enable bulkstat ioctl to support 64-bit per-inode extent counters Chandan Babu R
2021-09-27 23:06   ` Dave Chinner
2021-09-28  9:49     ` Chandan Babu R
2021-09-28 23:39       ` Dave Chinner
2021-09-29 17:04         ` Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 10/12] xfs: Extend per-inode extent counter widths Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 11/12] xfs: Add XFS_SB_FEAT_INCOMPAT_NREXT64 to XFS_SB_FEAT_INCOMPAT_ALL Chandan Babu R
2021-09-16 10:06 ` [PATCH V3 12/12] xfs: Define max extent length based on on-disk format definition Chandan Babu R
2021-09-28  0:33   ` Dave Chinner
2021-09-28 10:07     ` Chandan Babu R
2021-09-18  0:03 ` [PATCH V3 00/12] xfs: Extend per-inode extent counters Darrick J. Wong
2021-09-18  3:36   ` [External] : " Chandan Babu R

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).