All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow
@ 2021-01-10 16:07 Chandan Babu R
  2021-01-10 16:07 ` [PATCH V14 01/16] xfs: Add helper for checking per-inode extent count overflow Chandan Babu R
                   ` (16 more replies)
  0 siblings, 17 replies; 31+ messages in thread
From: Chandan Babu R @ 2021-01-10 16:07 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, darrick.wong, djwong, hch, allison.henderson

XFS does not check for possible overflow of per-inode extent counter
fields when adding extents to either data or attr fork.

For e.g.
1. Insert 5 million xattrs (each having a value size of 255 bytes) and
   then delete 50% of them in an alternating manner.

2. On a 4k block sized XFS filesystem instance, the above causes 98511
   extents to be created in the attr fork of the inode.

   xfsaild/loop0  2008 [003]  1475.127209: probe:xfs_inode_to_disk: (ffffffffa43fb6b0) if_nextents=98511 i_ino=131

3. The incore inode fork extent counter is a signed 32-bit
   quantity. However, the on-disk extent counter is an unsigned 16-bit
   quantity and hence cannot hold 98511 extents.

4. The following incorrect value is stored in the xattr extent counter,
   # xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0
   core.naextents = -32561

This patchset adds a new helper function
(i.e. xfs_iext_count_may_overflow()) to check for overflow of the
per-inode data and xattr extent counters and invokes it before
starting an fs operation (e.g. creating a new directory entry). With
this patchset applied, XFS detects counter overflows and returns with
an error rather than causing a silent corruption.

The patchset has been tested by executing xfstests with the following
mkfs.xfs options,
1. -m crc=0 -b size=1k
2. -m crc=0 -b size=4k
3. -m crc=0 -b size=512
4. -m rmapbt=1,reflink=1 -b size=1k
5. -m rmapbt=1,reflink=1 -b size=4k

The patches can also be obtained from
https://github.com/chandanr/linux.git at branch xfs-reserve-extent-count-v14.

I have two patches that define the newly introduced error injection
tags in xfsprogs
(https://lore.kernel.org/linux-xfs/20201104114900.172147-1-chandanrlinux@gmail.com/).

I have also written tests
(https://github.com/chandanr/xfstests/commits/extent-overflow-tests)
for verifying the checks introduced in the kernel.

Changelog:
V13 -> V14:
  1. Fix incorrect comparison of xfs_iext_count_may_overflow()'s
     return value with -ENOSPC in xfs_bmap_del_extent_real().
  Also, for quick reference, the following are the patches that
  need to be reviewed,
  - [PATCH V14 04/16] xfs: Check for extent overflow when adding dir entries
  - [PATCH V14 05/16] xfs: Check for extent overflow when removing dir entries
  - [PATCH V14 06/16] xfs: Check for extent overflow when renaming dir entries

V12 -> V13:
  1. xfs_rename():
     - Add comment explaining why we do not check for extent count
       overflow for the source directory entry of a rename operation.
     - Fix grammatical nit in a comment.
  2. xfs_bmap_del_extent_real():
     Replace explicit checks for inode's mode and fork with an
     assert() call since extent count overflow check here is
     applicable only to directory entry remove/rename operation.
  
V11 -> V12:
  1. Rebase patches on top of Linux v5.11-rc1.
  2. Revert back to using using a pseudo max inode extent count of 10.
     Hence the patches
     - [PATCH V12 05/14] xfs: Check for extent overflow when adding/removing xattrs
     - [PATCH V12 10/14] xfs: Introduce error injection to reduce maximum
     have been reverted back (including retaining of corresponding RVB
     tags) to how it was under V10 of the patchset.

     V11 of the patchset had increased the max pseudo extent count to
     35 to allow for "directory entry remove" operation to always
     succeed. However the corresponding logic was incorrect. Please
     refer to "[PATCH V12 04/14] xfs: Check for extent overflow when
     adding/removing dir entries" to find logic and explaination of
     the newer logic.

     "[PATCH V12 04/14] xfs: Check for extent overflow when
     adding/removing dir entries" is the only patch yet to be reviewed.

V10 -> V11:
  1. For directory/xattr insert operations we now reserve sufficient
     number of "extent count" so as to guarantee a future
     directory/xattr remove operation.
  2. The pseudo max extent count value has been increased to 35.

V9 -> V10:
  1. Pull back changes which cause xfs_bmap_compute_alignments() to
     return "stripe alignment" into 12th patch i.e. "xfs: Compute bmap
     extent alignments in a separate function".

V8 -> V9:
  1. Enabling XFS_ERRTAG_BMAP_ALLOC_MINLEN_EXTENT error tag will
     always allocate single block sized free extents (if
     available).
  2. xfs_bmap_compute_alignments() now returns stripe alignment as its
     return value.
  3. Dropped Allison's RVB tag for "xfs: Compute bmap extent
     alignments in a separate function" and "xfs: Introduce error
     injection to allocate only minlen size extents for files".

V7 -> V8:
  1. Rename local variable in xfs_alloc_fix_freelist() from "i" to "stat".

V6 -> V7:
  1. Create new function xfs_bmap_exact_minlen_extent_alloc() (enabled
     only when CONFIG_XFS_DEBUG is set to y) which issues allocation
     requests for minlen sized extents only. In order to achieve this,
     common code from xfs_bmap_btalloc() have been refactored into new
     functions.
  2. All major functions implementing logic associated with
     XFS_ERRTAG_BMAP_ALLOC_MINLEN_EXTENT error tag are compiled only
     when CONFIG_XFS_DEBUG is set to y.
  3. Remove XFS_IEXT_REFLINK_REMAP_CNT macro and replace it with an
     integer which holds the number of new extents to be
     added to the data fork.

V5 -> V6:
  1. Rebased the patchset on xfs-linux/for-next branch.
  2. Drop "xfs: Set tp->t_firstblock only once during a transaction's
     lifetime" patch from the patchset.
  3. Add a comment to xfs_bmap_btalloc() describing why it was chosen
     to start "free space extent search" from AG 0 when
     XFS_ERRTAG_BMAP_ALLOC_MINLEN_EXTENT is enabled and when the
     transaction is allocating its first extent.
  4. Fix review comments associated with coding style.

V4 -> V5:
  1. Introduce new error tag XFS_ERRTAG_BMAP_ALLOC_MINLEN_EXTENT to
     let user space programs to be able to guarantee that free space
     requests for files are satisfied by allocating minlen sized
     extents.
  2. Change xfs_bmap_btalloc() and xfs_alloc_vextent() to allocate
     minlen sized extents when XFS_ERRTAG_BMAP_ALLOC_MINLEN_EXTENT is
     enabled.
  3. Introduce a new patch that causes tp->t_firstblock to be assigned
     to a value only when its previous value is NULLFSBLOCK.
  4. Replace the previously introduced MAXERRTAGEXTNUM (maximum inode
     fork extent count) with the hardcoded value of 10.
  5. xfs_bui_item_recover(): Use XFS_IEXT_ADD_NOSPLIT_CNT when mapping
     an extent.
  6. xfs_swap_extent_rmap(): Use xfs_bmap_is_real_extent() instead of
     xfs_bmap_is_update_needed() to assess if the extent really needs
     to be swapped.

V3 -> V4:
  1. Introduce new patch which lets userspace programs to test "extent
     count overflow detection" by injecting an error tag. The new
     error tag reduces the maximum allowed extent count to 10.
  2. Injecting the newly defined error tag prevents
     xfs_bmap_add_extent_hole_real() from merging a new extent with
     its neighbours to allow writing deterministic tests for testing
     extent count overflow for Directories, Xattr and growing realtime
     devices. This is required because the new extent being allocated
     can be contiguous with its neighbours (w.r.t both file and disk
     offsets).
  3. Injecting the newly defined error tag forces block sized extents
     to be allocated for summary/bitmap files when growing a realtime
     device. This is required because xfs_growfs_rt_alloc() allocates
     as large an extent as possible for summary/bitmap files and hence
     it would be impossible to write deterministic tests.
  4. Rename XFS_IEXT_REMOVE_CNT to XFS_IEXT_PUNCH_HOLE_CNT to reflect
     the actual meaning of the fs operation.
  5. Fold XFS_IEXT_INSERT_HOLE_CNT code into that associated with
     XFS_IEXT_PUNCH_HOLE_CNT since both perform the same job.
  6. xfs_swap_extent_rmap(): Check for extent overflow should be made
     on the source file only if the donor file extent has a valid
     on-disk mapping and vice versa.

V2 -> V3:
  1. Move the definition of xfs_iext_count_may_overflow() from
     libxfs/xfs_trans_resv.c to libxfs/xfs_inode_fork.c. Also, I tried
     to make xfs_iext_count_may_overflow() an inline function by
     placing the definition in libxfs/xfs_inode_fork.h. However this
     required that the definition of 'struct xfs_inode' be available,
     since xfs_iext_count_may_overflow() uses a 'struct xfs_inode *'
     type variable.
  2. Handle XFS_COW_FORK within xfs_iext_count_may_overflow() by
     returning a success value.
  3. Rename XFS_IEXT_ADD_CNT to XFS_IEXT_ADD_NOSPLIT_CNT. Thanks to
     Darrick for the suggesting the new name.
  4. Expand comments to make use of 80 columns.

V1 -> V2:
  1. Rename helper function from xfs_trans_resv_ext_cnt() to
     xfs_iext_count_may_overflow().
  2. Define and use macros to represent fs operations and the
     corresponding increase in extent count.
  3. Split the patches based on the fs operation being performed.

Chandan Babu R (16):
  xfs: Add helper for checking per-inode extent count overflow
  xfs: Check for extent overflow when trivally adding a new extent
  xfs: Check for extent overflow when punching a hole
  xfs: Check for extent overflow when adding dir entries
  xfs: Check for extent overflow when removing dir entries
  xfs: Check for extent overflow when renaming dir entries
  xfs: Check for extent overflow when adding/removing xattrs
  xfs: Check for extent overflow when writing to unwritten extent
  xfs: Check for extent overflow when moving extent from cow to data
    fork
  xfs: Check for extent overflow when remapping an extent
  xfs: Check for extent overflow when swapping extents
  xfs: Introduce error injection to reduce maximum inode fork extent
    count
  xfs: Remove duplicate assert statement in xfs_bmap_btalloc()
  xfs: Compute bmap extent alignments in a separate function
  xfs: Process allocated extent in a separate function
  xfs: Introduce error injection to allocate only minlen size extents
    for files

 fs/xfs/libxfs/xfs_alloc.c      |  50 ++++++
 fs/xfs/libxfs/xfs_alloc.h      |   3 +
 fs/xfs/libxfs/xfs_attr.c       |  13 ++
 fs/xfs/libxfs/xfs_bmap.c       | 285 ++++++++++++++++++++++++---------
 fs/xfs/libxfs/xfs_errortag.h   |   6 +-
 fs/xfs/libxfs/xfs_inode_fork.c |  27 ++++
 fs/xfs/libxfs/xfs_inode_fork.h |  63 ++++++++
 fs/xfs/xfs_bmap_item.c         |  10 ++
 fs/xfs/xfs_bmap_util.c         |  31 ++++
 fs/xfs/xfs_dquot.c             |   8 +-
 fs/xfs/xfs_error.c             |   6 +
 fs/xfs/xfs_inode.c             |  54 ++++++-
 fs/xfs/xfs_iomap.c             |  10 ++
 fs/xfs/xfs_reflink.c           |  16 ++
 fs/xfs/xfs_rtalloc.c           |   5 +
 fs/xfs/xfs_symlink.c           |   5 +
 16 files changed, 513 insertions(+), 79 deletions(-)

-- 
2.29.2


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH V14 01/16] xfs: Add helper for checking per-inode extent count overflow
  2021-01-10 16:07 [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow Chandan Babu R
@ 2021-01-10 16:07 ` Chandan Babu R
  2021-01-10 16:07 ` [PATCH V14 02/16] xfs: Check for extent overflow when trivally adding a new extent Chandan Babu R
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 31+ messages in thread
From: Chandan Babu R @ 2021-01-10 16:07 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, darrick.wong, djwong, hch, allison.henderson

XFS does not check for possible overflow of per-inode extent counter
fields when adding extents to either data or attr fork.

For e.g.
1. Insert 5 million xattrs (each having a value size of 255 bytes) and
   then delete 50% of them in an alternating manner.

2. On a 4k block sized XFS filesystem instance, the above causes 98511
   extents to be created in the attr fork of the inode.

   xfsaild/loop0  2008 [003]  1475.127209: probe:xfs_inode_to_disk: (ffffffffa43fb6b0) if_nextents=98511 i_ino=131

3. The incore inode fork extent counter is a signed 32-bit
   quantity. However the on-disk extent counter is an unsigned 16-bit
   quantity and hence cannot hold 98511 extents.

4. The following incorrect value is stored in the attr extent counter,
   # xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0
   core.naextents = -32561

This commit adds a new helper function (i.e.
xfs_iext_count_may_overflow()) to check for overflow of the per-inode
data and xattr extent counters. Future patches will use this function to
make sure that an FS operation won't cause the extent counter to
overflow.

Suggested-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
---
 fs/xfs/libxfs/xfs_inode_fork.c | 23 +++++++++++++++++++++++
 fs/xfs/libxfs/xfs_inode_fork.h |  2 ++
 2 files changed, 25 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index 7575de5cecb1..8d48716547e5 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -23,6 +23,7 @@
 #include "xfs_da_btree.h"
 #include "xfs_dir2_priv.h"
 #include "xfs_attr_leaf.h"
+#include "xfs_types.h"
 
 kmem_zone_t *xfs_ifork_zone;
 
@@ -728,3 +729,25 @@ xfs_ifork_verify_local_attr(
 
 	return 0;
 }
+
+int
+xfs_iext_count_may_overflow(
+	struct xfs_inode	*ip,
+	int			whichfork,
+	int			nr_to_add)
+{
+	struct xfs_ifork	*ifp = XFS_IFORK_PTR(ip, whichfork);
+	uint64_t		max_exts;
+	uint64_t		nr_exts;
+
+	if (whichfork == XFS_COW_FORK)
+		return 0;
+
+	max_exts = (whichfork == XFS_ATTR_FORK) ? MAXAEXTNUM : MAXEXTNUM;
+
+	nr_exts = ifp->if_nextents + nr_to_add;
+	if (nr_exts < ifp->if_nextents || nr_exts > max_exts)
+		return -EFBIG;
+
+	return 0;
+}
diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
index a4953e95c4f3..0beb8e2a00be 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.h
+++ b/fs/xfs/libxfs/xfs_inode_fork.h
@@ -172,5 +172,7 @@ extern void xfs_ifork_init_cow(struct xfs_inode *ip);
 
 int xfs_ifork_verify_local_data(struct xfs_inode *ip);
 int xfs_ifork_verify_local_attr(struct xfs_inode *ip);
+int xfs_iext_count_may_overflow(struct xfs_inode *ip, int whichfork,
+		int nr_to_add);
 
 #endif	/* __XFS_INODE_FORK_H__ */
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V14 02/16] xfs: Check for extent overflow when trivally adding a new extent
  2021-01-10 16:07 [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow Chandan Babu R
  2021-01-10 16:07 ` [PATCH V14 01/16] xfs: Add helper for checking per-inode extent count overflow Chandan Babu R
@ 2021-01-10 16:07 ` Chandan Babu R
  2021-01-10 16:07 ` [PATCH V14 03/16] xfs: Check for extent overflow when punching a hole Chandan Babu R
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 31+ messages in thread
From: Chandan Babu R @ 2021-01-10 16:07 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, darrick.wong, djwong, hch, allison.henderson

When adding a new data extent (without modifying an inode's existing
extents) the extent count increases only by 1. This commit checks for
extent count overflow in such cases.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
---
 fs/xfs/libxfs/xfs_bmap.c       | 6 ++++++
 fs/xfs/libxfs/xfs_inode_fork.h | 6 ++++++
 fs/xfs/xfs_bmap_item.c         | 7 +++++++
 fs/xfs/xfs_bmap_util.c         | 5 +++++
 fs/xfs/xfs_dquot.c             | 8 +++++++-
 fs/xfs/xfs_iomap.c             | 5 +++++
 fs/xfs/xfs_rtalloc.c           | 5 +++++
 7 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index bc446418e227..32aeacf6f055 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -4527,6 +4527,12 @@ xfs_bmapi_convert_delalloc(
 		return error;
 
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
+
+	error = xfs_iext_count_may_overflow(ip, whichfork,
+			XFS_IEXT_ADD_NOSPLIT_CNT);
+	if (error)
+		goto out_trans_cancel;
+
 	xfs_trans_ijoin(tp, ip, 0);
 
 	if (!xfs_iext_lookup_extent(ip, ifp, offset_fsb, &bma.icur, &bma.got) ||
diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
index 0beb8e2a00be..7fc2b129a2e7 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.h
+++ b/fs/xfs/libxfs/xfs_inode_fork.h
@@ -34,6 +34,12 @@ struct xfs_ifork {
 #define	XFS_IFEXTENTS	0x02	/* All extent pointers are read in */
 #define	XFS_IFBROOT	0x04	/* i_broot points to the bmap b-tree root */
 
+/*
+ * Worst-case increase in the fork extent count when we're adding a single
+ * extent to a fork and there's no possibility of splitting an existing mapping.
+ */
+#define XFS_IEXT_ADD_NOSPLIT_CNT	(1)
+
 /*
  * Fork handling.
  */
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 93e4d8ae6e92..0534304ed0a7 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -508,6 +508,13 @@ xfs_bui_item_recover(
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	xfs_trans_ijoin(tp, ip, 0);
 
+	if (bui_type == XFS_BMAP_MAP) {
+		error = xfs_iext_count_may_overflow(ip, whichfork,
+				XFS_IEXT_ADD_NOSPLIT_CNT);
+		if (error)
+			goto err_cancel;
+	}
+
 	count = bmap->me_len;
 	error = xfs_trans_log_finish_bmap_update(tp, budp, bui_type, ip,
 			whichfork, bmap->me_startoff, bmap->me_startblock,
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 7371a7f7c652..db44bfaabe88 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -822,6 +822,11 @@ xfs_alloc_file_space(
 		if (error)
 			goto error1;
 
+		error = xfs_iext_count_may_overflow(ip, XFS_DATA_FORK,
+				XFS_IEXT_ADD_NOSPLIT_CNT);
+		if (error)
+			goto error0;
+
 		xfs_trans_ijoin(tp, ip, 0);
 
 		error = xfs_bmapi_write(tp, ip, startoffset_fsb,
diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c
index 1d95ed387d66..175f544f7c45 100644
--- a/fs/xfs/xfs_dquot.c
+++ b/fs/xfs/xfs_dquot.c
@@ -314,8 +314,14 @@ xfs_dquot_disk_alloc(
 		return -ESRCH;
 	}
 
-	/* Create the block mapping. */
 	xfs_trans_ijoin(tp, quotip, XFS_ILOCK_EXCL);
+
+	error = xfs_iext_count_may_overflow(quotip, XFS_DATA_FORK,
+			XFS_IEXT_ADD_NOSPLIT_CNT);
+	if (error)
+		return error;
+
+	/* Create the block mapping. */
 	error = xfs_bmapi_write(tp, quotip, dqp->q_fileoffset,
 			XFS_DQUOT_CLUSTER_SIZE_FSB, XFS_BMAPI_METADATA, 0, &map,
 			&nmaps);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 7b9ff824e82d..f53690febb22 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -250,6 +250,11 @@ xfs_iomap_write_direct(
 	if (error)
 		goto out_trans_cancel;
 
+	error = xfs_iext_count_may_overflow(ip, XFS_DATA_FORK,
+			XFS_IEXT_ADD_NOSPLIT_CNT);
+	if (error)
+		goto out_trans_cancel;
+
 	xfs_trans_ijoin(tp, ip, 0);
 
 	/*
diff --git a/fs/xfs/xfs_rtalloc.c b/fs/xfs/xfs_rtalloc.c
index b4999fb01ff7..161b0e8992ba 100644
--- a/fs/xfs/xfs_rtalloc.c
+++ b/fs/xfs/xfs_rtalloc.c
@@ -804,6 +804,11 @@ xfs_growfs_rt_alloc(
 		xfs_ilock(ip, XFS_ILOCK_EXCL);
 		xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL);
 
+		error = xfs_iext_count_may_overflow(ip, XFS_DATA_FORK,
+				XFS_IEXT_ADD_NOSPLIT_CNT);
+		if (error)
+			goto out_trans_cancel;
+
 		/*
 		 * Allocate blocks to the bitmap file.
 		 */
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V14 03/16] xfs: Check for extent overflow when punching a hole
  2021-01-10 16:07 [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow Chandan Babu R
  2021-01-10 16:07 ` [PATCH V14 01/16] xfs: Add helper for checking per-inode extent count overflow Chandan Babu R
  2021-01-10 16:07 ` [PATCH V14 02/16] xfs: Check for extent overflow when trivally adding a new extent Chandan Babu R
@ 2021-01-10 16:07 ` Chandan Babu R
  2021-01-10 16:07 ` [PATCH V14 04/16] xfs: Check for extent overflow when adding dir entries Chandan Babu R
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 31+ messages in thread
From: Chandan Babu R @ 2021-01-10 16:07 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, darrick.wong, djwong, hch, allison.henderson

The extent mapping the file offset at which a hole has to be
inserted will be split into two extents causing extent count to
increase by 1.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
---
 fs/xfs/libxfs/xfs_inode_fork.h |  7 +++++++
 fs/xfs/xfs_bmap_item.c         | 15 +++++++++------
 fs/xfs/xfs_bmap_util.c         | 10 ++++++++++
 3 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
index 7fc2b129a2e7..bcac769a7df6 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.h
+++ b/fs/xfs/libxfs/xfs_inode_fork.h
@@ -40,6 +40,13 @@ struct xfs_ifork {
  */
 #define XFS_IEXT_ADD_NOSPLIT_CNT	(1)
 
+/*
+ * Punching out an extent from the middle of an existing extent can cause the
+ * extent count to increase by 1.
+ * i.e. | Old extent | Hole | Old extent |
+ */
+#define XFS_IEXT_PUNCH_HOLE_CNT		(1)
+
 /*
  * Fork handling.
  */
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 0534304ed0a7..2344757ede63 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -471,6 +471,7 @@ xfs_bui_item_recover(
 	xfs_exntst_t			state;
 	unsigned int			bui_type;
 	int				whichfork;
+	int				iext_delta;
 	int				error = 0;
 
 	if (!xfs_bui_validate(mp, buip)) {
@@ -508,12 +509,14 @@ xfs_bui_item_recover(
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	xfs_trans_ijoin(tp, ip, 0);
 
-	if (bui_type == XFS_BMAP_MAP) {
-		error = xfs_iext_count_may_overflow(ip, whichfork,
-				XFS_IEXT_ADD_NOSPLIT_CNT);
-		if (error)
-			goto err_cancel;
-	}
+	if (bui_type == XFS_BMAP_MAP)
+		iext_delta = XFS_IEXT_ADD_NOSPLIT_CNT;
+	else
+		iext_delta = XFS_IEXT_PUNCH_HOLE_CNT;
+
+	error = xfs_iext_count_may_overflow(ip, whichfork, iext_delta);
+	if (error)
+		goto err_cancel;
 
 	count = bmap->me_len;
 	error = xfs_trans_log_finish_bmap_update(tp, budp, bui_type, ip,
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index db44bfaabe88..6ac7a6ac2658 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -891,6 +891,11 @@ xfs_unmap_extent(
 
 	xfs_trans_ijoin(tp, ip, 0);
 
+	error = xfs_iext_count_may_overflow(ip, XFS_DATA_FORK,
+			XFS_IEXT_PUNCH_HOLE_CNT);
+	if (error)
+		goto out_trans_cancel;
+
 	error = xfs_bunmapi(tp, ip, startoffset_fsb, len_fsb, 0, 2, done);
 	if (error)
 		goto out_trans_cancel;
@@ -1168,6 +1173,11 @@ xfs_insert_file_space(
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	xfs_trans_ijoin(tp, ip, 0);
 
+	error = xfs_iext_count_may_overflow(ip, XFS_DATA_FORK,
+			XFS_IEXT_PUNCH_HOLE_CNT);
+	if (error)
+		goto out_trans_cancel;
+
 	/*
 	 * The extent shifting code works on extent granularity. So, if stop_fsb
 	 * is not the starting block of extent, we need to split the extent at
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V14 04/16] xfs: Check for extent overflow when adding dir entries
  2021-01-10 16:07 [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow Chandan Babu R
                   ` (2 preceding siblings ...)
  2021-01-10 16:07 ` [PATCH V14 03/16] xfs: Check for extent overflow when punching a hole Chandan Babu R
@ 2021-01-10 16:07 ` Chandan Babu R
  2021-01-12  1:34   ` Darrick J. Wong
  2021-01-10 16:07 ` [PATCH V14 05/16] xfs: Check for extent overflow when removing " Chandan Babu R
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 31+ messages in thread
From: Chandan Babu R @ 2021-01-10 16:07 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, darrick.wong, djwong, hch, allison.henderson

Directory entry addition can cause the following,
1. Data block can be added/removed.
   A new extent can cause extent count to increase by 1.
2. Free disk block can be added/removed.
   Same behaviour as described above for Data block.
3. Dabtree blocks.
   XFS_DA_NODE_MAXDEPTH blocks can be added. Each of these
   can be new extents. Hence extent count can increase by
   XFS_DA_NODE_MAXDEPTH.

Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
---
 fs/xfs/libxfs/xfs_inode_fork.h | 13 +++++++++++++
 fs/xfs/xfs_inode.c             | 10 ++++++++++
 fs/xfs/xfs_symlink.c           |  5 +++++
 3 files changed, 28 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
index bcac769a7df6..ea1a9dd8a763 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.h
+++ b/fs/xfs/libxfs/xfs_inode_fork.h
@@ -47,6 +47,19 @@ struct xfs_ifork {
  */
 #define XFS_IEXT_PUNCH_HOLE_CNT		(1)
 
+/*
+ * Directory entry addition can cause the following,
+ * 1. Data block can be added/removed.
+ *    A new extent can cause extent count to increase by 1.
+ * 2. Free disk block can be added/removed.
+ *    Same behaviour as described above for Data block.
+ * 3. Dabtree blocks.
+ *    XFS_DA_NODE_MAXDEPTH blocks can be added. Each of these can be new
+ *    extents. Hence extent count can increase by XFS_DA_NODE_MAXDEPTH.
+ */
+#define XFS_IEXT_DIR_MANIP_CNT(mp) \
+	((XFS_DA_NODE_MAXDEPTH + 1 + 1) * (mp)->m_dir_geo->fsbcount)
+
 /*
  * Fork handling.
  */
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index b7352bc4c815..4cc787cc4eee 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1042,6 +1042,11 @@ xfs_create(
 	if (error)
 		goto out_trans_cancel;
 
+	error = xfs_iext_count_may_overflow(dp, XFS_DATA_FORK,
+			XFS_IEXT_DIR_MANIP_CNT(mp));
+	if (error)
+		goto out_trans_cancel;
+
 	/*
 	 * A newly created regular or special file just has one directory
 	 * entry pointing to them, but a directory also the "." entry
@@ -1258,6 +1263,11 @@ xfs_link(
 	xfs_trans_ijoin(tp, sip, XFS_ILOCK_EXCL);
 	xfs_trans_ijoin(tp, tdp, XFS_ILOCK_EXCL);
 
+	error = xfs_iext_count_may_overflow(tdp, XFS_DATA_FORK,
+			XFS_IEXT_DIR_MANIP_CNT(mp));
+	if (error)
+		goto error_return;
+
 	/*
 	 * If we are using project inheritance, we only allow hard link
 	 * creation in our tree when the project IDs are the same; else
diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
index 1f43fd7f3209..0b8136a32484 100644
--- a/fs/xfs/xfs_symlink.c
+++ b/fs/xfs/xfs_symlink.c
@@ -220,6 +220,11 @@ xfs_symlink(
 	if (error)
 		goto out_trans_cancel;
 
+	error = xfs_iext_count_may_overflow(dp, XFS_DATA_FORK,
+			XFS_IEXT_DIR_MANIP_CNT(mp));
+	if (error)
+		goto out_trans_cancel;
+
 	/*
 	 * Allocate an inode for the symlink.
 	 */
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V14 05/16] xfs: Check for extent overflow when removing dir entries
  2021-01-10 16:07 [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow Chandan Babu R
                   ` (3 preceding siblings ...)
  2021-01-10 16:07 ` [PATCH V14 04/16] xfs: Check for extent overflow when adding dir entries Chandan Babu R
@ 2021-01-10 16:07 ` Chandan Babu R
  2021-01-12  1:38   ` Darrick J. Wong
  2021-01-10 16:07 ` [PATCH V14 06/16] xfs: Check for extent overflow when renaming " Chandan Babu R
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 31+ messages in thread
From: Chandan Babu R @ 2021-01-10 16:07 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, darrick.wong, djwong, hch, allison.henderson

Directory entry removal must always succeed; Hence XFS does the
following during low disk space scenario:
1. Data/Free blocks linger until a future remove operation.
2. Dabtree blocks would be swapped with the last block in the leaf space
   and then the new last block will be unmapped.

This facility is reused during low inode extent count scenario i.e. this
commit causes xfs_bmap_del_extent_real() to return -ENOSPC error code so
that the above mentioned behaviour is exercised causing no change to the
directory's extent count.

Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 32aeacf6f055..6c8f17a0e247 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5151,6 +5151,24 @@ xfs_bmap_del_extent_real(
 		/*
 		 * Deleting the middle of the extent.
 		 */
+
+		/*
+		 * For directories, -ENOSPC is returned since a directory entry
+		 * remove operation must not fail due to low extent count
+		 * availability. -ENOSPC will be handled by higher layers of XFS
+		 * by letting the corresponding empty Data/Free blocks to linger
+		 * until a future remove operation. Dabtree blocks would be
+		 * swapped with the last block in the leaf space and then the
+		 * new last block will be unmapped.
+		 */
+		error = xfs_iext_count_may_overflow(ip, whichfork, 1);
+		if (error) {
+			ASSERT(S_ISDIR(VFS_I(ip)->i_mode) &&
+				whichfork == XFS_DATA_FORK);
+			error = -ENOSPC;
+			goto done;
+		}
+
 		old = got;
 
 		got.br_blockcount = del->br_startoff - got.br_startoff;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V14 06/16] xfs: Check for extent overflow when renaming dir entries
  2021-01-10 16:07 [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow Chandan Babu R
                   ` (4 preceding siblings ...)
  2021-01-10 16:07 ` [PATCH V14 05/16] xfs: Check for extent overflow when removing " Chandan Babu R
@ 2021-01-10 16:07 ` Chandan Babu R
  2021-01-12  1:37   ` Darrick J. Wong
  2021-01-10 16:07 ` [PATCH V14 07/16] xfs: Check for extent overflow when adding/removing xattrs Chandan Babu R
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 31+ messages in thread
From: Chandan Babu R @ 2021-01-10 16:07 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, darrick.wong, djwong, hch, allison.henderson

A rename operation is essentially a directory entry remove operation
from the perspective of parent directory (i.e. src_dp) of rename's
source. Hence the only place where we check for extent count overflow
for src_dp is in xfs_bmap_del_extent_real(). xfs_bmap_del_extent_real()
returns -ENOSPC when it detects a possible extent count overflow and in
response, the higher layers of directory handling code do the following:
1. Data/Free blocks: XFS lets these blocks linger until a future remove
   operation removes them.
2. Dabtree blocks: XFS swaps the blocks with the last block in the Leaf
   space and unmaps the last block.

For target_dp, there are two cases depending on whether the destination
directory entry exists or not.

When destination directory entry does not exist (i.e. target_ip ==
NULL), extent count overflow check is performed only when transaction
has a non-zero sized space reservation associated with it.  With a
zero-sized space reservation, XFS allows a rename operation to continue
only when the directory has sufficient free space in its data/leaf/free
space blocks to hold the new entry.

When destination directory entry exists (i.e. target_ip != NULL), all
we need to do is change the inode number associated with the already
existing entry. Hence there is no need to perform an extent count
overflow check.

Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
---
 fs/xfs/libxfs/xfs_bmap.c |  3 +++
 fs/xfs/xfs_inode.c       | 44 +++++++++++++++++++++++++++++++++++++++-
 2 files changed, 46 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 6c8f17a0e247..8ebe5f13279c 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -5160,6 +5160,9 @@ xfs_bmap_del_extent_real(
 		 * until a future remove operation. Dabtree blocks would be
 		 * swapped with the last block in the leaf space and then the
 		 * new last block will be unmapped.
+		 *
+		 * The above logic also applies to the source directory entry of
+		 * a rename operation.
 		 */
 		error = xfs_iext_count_may_overflow(ip, whichfork, 1);
 		if (error) {
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 4cc787cc4eee..f0a6d528cbc4 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3116,6 +3116,35 @@ xfs_rename(
 	/*
 	 * Check for expected errors before we dirty the transaction
 	 * so we can return an error without a transaction abort.
+	 *
+	 * Extent count overflow check:
+	 *
+	 * From the perspective of src_dp, a rename operation is essentially a
+	 * directory entry remove operation. Hence the only place where we check
+	 * for extent count overflow for src_dp is in
+	 * xfs_bmap_del_extent_real(). xfs_bmap_del_extent_real() returns
+	 * -ENOSPC when it detects a possible extent count overflow and in
+	 * response, the higher layers of directory handling code do the
+	 * following:
+	 * 1. Data/Free blocks: XFS lets these blocks linger until a
+	 *    future remove operation removes them.
+	 * 2. Dabtree blocks: XFS swaps the blocks with the last block in the
+	 *    Leaf space and unmaps the last block.
+	 *
+	 * For target_dp, there are two cases depending on whether the
+	 * destination directory entry exists or not.
+	 *
+	 * When destination directory entry does not exist (i.e. target_ip ==
+	 * NULL), extent count overflow check is performed only when transaction
+	 * has a non-zero sized space reservation associated with it.  With a
+	 * zero-sized space reservation, XFS allows a rename operation to
+	 * continue only when the directory has sufficient free space in its
+	 * data/leaf/free space blocks to hold the new entry.
+	 *
+	 * When destination directory entry exists (i.e. target_ip != NULL), all
+	 * we need to do is change the inode number associated with the already
+	 * existing entry. Hence there is no need to perform an extent count
+	 * overflow check.
 	 */
 	if (target_ip == NULL) {
 		/*
@@ -3126,6 +3155,12 @@ xfs_rename(
 			error = xfs_dir_canenter(tp, target_dp, target_name);
 			if (error)
 				goto out_trans_cancel;
+		} else {
+			error = xfs_iext_count_may_overflow(target_dp,
+					XFS_DATA_FORK,
+					XFS_IEXT_DIR_MANIP_CNT(mp));
+			if (error)
+				goto out_trans_cancel;
 		}
 	} else {
 		/*
@@ -3283,9 +3318,16 @@ xfs_rename(
 	if (wip) {
 		error = xfs_dir_replace(tp, src_dp, src_name, wip->i_ino,
 					spaceres);
-	} else
+	} else {
+		/*
+		 * NOTE: We don't need to check for extent count overflow here
+		 * because the dir remove name code will leave the dir block in
+		 * place if the extent count would overflow.
+		 */
 		error = xfs_dir_removename(tp, src_dp, src_name, src_ip->i_ino,
 					   spaceres);
+	}
+
 	if (error)
 		goto out_trans_cancel;
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V14 07/16] xfs: Check for extent overflow when adding/removing xattrs
  2021-01-10 16:07 [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow Chandan Babu R
                   ` (5 preceding siblings ...)
  2021-01-10 16:07 ` [PATCH V14 06/16] xfs: Check for extent overflow when renaming " Chandan Babu R
@ 2021-01-10 16:07 ` Chandan Babu R
  2021-01-10 16:07 ` [PATCH V14 08/16] xfs: Check for extent overflow when writing to unwritten extent Chandan Babu R
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 31+ messages in thread
From: Chandan Babu R @ 2021-01-10 16:07 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, darrick.wong, djwong, hch, allison.henderson

Adding/removing an xattr can cause XFS_DA_NODE_MAXDEPTH extents to be
added. One extra extent for dabtree in case a local attr is large enough
to cause a double split.  It can also cause extent count to increase
proportional to the size of a remote xattr's value.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
---
 fs/xfs/libxfs/xfs_attr.c       | 13 +++++++++++++
 fs/xfs/libxfs/xfs_inode_fork.h | 10 ++++++++++
 2 files changed, 23 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c
index fd8e6418a0d3..be51e7068dcd 100644
--- a/fs/xfs/libxfs/xfs_attr.c
+++ b/fs/xfs/libxfs/xfs_attr.c
@@ -396,6 +396,7 @@ xfs_attr_set(
 	struct xfs_trans_res	tres;
 	bool			rsvd = (args->attr_filter & XFS_ATTR_ROOT);
 	int			error, local;
+	int			rmt_blks = 0;
 	unsigned int		total;
 
 	if (XFS_FORCED_SHUTDOWN(dp->i_mount))
@@ -442,11 +443,15 @@ xfs_attr_set(
 		tres.tr_logcount = XFS_ATTRSET_LOG_COUNT;
 		tres.tr_logflags = XFS_TRANS_PERM_LOG_RES;
 		total = args->total;
+
+		if (!local)
+			rmt_blks = xfs_attr3_rmt_blocks(mp, args->valuelen);
 	} else {
 		XFS_STATS_INC(mp, xs_attr_remove);
 
 		tres = M_RES(mp)->tr_attrrm;
 		total = XFS_ATTRRM_SPACE_RES(mp);
+		rmt_blks = xfs_attr3_rmt_blocks(mp, XFS_XATTR_SIZE_MAX);
 	}
 
 	/*
@@ -460,6 +465,14 @@ xfs_attr_set(
 
 	xfs_ilock(dp, XFS_ILOCK_EXCL);
 	xfs_trans_ijoin(args->trans, dp, 0);
+
+	if (args->value || xfs_inode_hasattr(dp)) {
+		error = xfs_iext_count_may_overflow(dp, XFS_ATTR_FORK,
+				XFS_IEXT_ATTR_MANIP_CNT(rmt_blks));
+		if (error)
+			goto out_trans_cancel;
+	}
+
 	if (args->value) {
 		unsigned int	quota_flags = XFS_QMOPT_RES_REGBLKS;
 
diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
index ea1a9dd8a763..8d89838e23f8 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.h
+++ b/fs/xfs/libxfs/xfs_inode_fork.h
@@ -60,6 +60,16 @@ struct xfs_ifork {
 #define XFS_IEXT_DIR_MANIP_CNT(mp) \
 	((XFS_DA_NODE_MAXDEPTH + 1 + 1) * (mp)->m_dir_geo->fsbcount)
 
+/*
+ * Adding/removing an xattr can cause XFS_DA_NODE_MAXDEPTH extents to
+ * be added. One extra extent for dabtree in case a local attr is
+ * large enough to cause a double split.  It can also cause extent
+ * count to increase proportional to the size of a remote xattr's
+ * value.
+ */
+#define XFS_IEXT_ATTR_MANIP_CNT(rmt_blks) \
+	(XFS_DA_NODE_MAXDEPTH + max(1, rmt_blks))
+
 /*
  * Fork handling.
  */
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V14 08/16] xfs: Check for extent overflow when writing to unwritten extent
  2021-01-10 16:07 [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow Chandan Babu R
                   ` (6 preceding siblings ...)
  2021-01-10 16:07 ` [PATCH V14 07/16] xfs: Check for extent overflow when adding/removing xattrs Chandan Babu R
@ 2021-01-10 16:07 ` Chandan Babu R
  2021-01-10 16:07 ` [PATCH V14 09/16] xfs: Check for extent overflow when moving extent from cow to data fork Chandan Babu R
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 31+ messages in thread
From: Chandan Babu R @ 2021-01-10 16:07 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, darrick.wong, djwong, hch, allison.henderson

A write to a sub-interval of an existing unwritten extent causes
the original extent to be split into 3 extents
i.e. | Unwritten | Real | Unwritten |
Hence extent count can increase by 2.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
---
 fs/xfs/libxfs/xfs_inode_fork.h | 9 +++++++++
 fs/xfs/xfs_iomap.c             | 5 +++++
 2 files changed, 14 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
index 8d89838e23f8..917e289ad962 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.h
+++ b/fs/xfs/libxfs/xfs_inode_fork.h
@@ -70,6 +70,15 @@ struct xfs_ifork {
 #define XFS_IEXT_ATTR_MANIP_CNT(rmt_blks) \
 	(XFS_DA_NODE_MAXDEPTH + max(1, rmt_blks))
 
+/*
+ * A write to a sub-interval of an existing unwritten extent causes the original
+ * extent to be split into 3 extents
+ * i.e. | Unwritten | Real | Unwritten |
+ * Hence extent count can increase by 2.
+ */
+#define XFS_IEXT_WRITE_UNWRITTEN_CNT	(2)
+
+
 /*
  * Fork handling.
  */
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index f53690febb22..5bf84622421d 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -566,6 +566,11 @@ xfs_iomap_write_unwritten(
 		if (error)
 			goto error_on_bmapi_transaction;
 
+		error = xfs_iext_count_may_overflow(ip, XFS_DATA_FORK,
+				XFS_IEXT_WRITE_UNWRITTEN_CNT);
+		if (error)
+			goto error_on_bmapi_transaction;
+
 		/*
 		 * Modify the unwritten extent state of the buffer.
 		 */
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V14 09/16] xfs: Check for extent overflow when moving extent from cow to data fork
  2021-01-10 16:07 [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow Chandan Babu R
                   ` (7 preceding siblings ...)
  2021-01-10 16:07 ` [PATCH V14 08/16] xfs: Check for extent overflow when writing to unwritten extent Chandan Babu R
@ 2021-01-10 16:07 ` Chandan Babu R
  2021-01-10 16:07 ` [PATCH V14 10/16] xfs: Check for extent overflow when remapping an extent Chandan Babu R
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 31+ messages in thread
From: Chandan Babu R @ 2021-01-10 16:07 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, darrick.wong, djwong, hch, allison.henderson

Moving an extent to data fork can cause a sub-interval of an existing
extent to be unmapped. This will increase extent count by 1. Mapping in
the new extent can increase the extent count by 1 again i.e.
 | Old extent | New extent | Old extent |
Hence number of extents increases by 2.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
---
 fs/xfs/libxfs/xfs_inode_fork.h | 9 +++++++++
 fs/xfs/xfs_reflink.c           | 5 +++++
 2 files changed, 14 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
index 917e289ad962..c8f279edc5c1 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.h
+++ b/fs/xfs/libxfs/xfs_inode_fork.h
@@ -79,6 +79,15 @@ struct xfs_ifork {
 #define XFS_IEXT_WRITE_UNWRITTEN_CNT	(2)
 
 
+/*
+ * Moving an extent to data fork can cause a sub-interval of an existing extent
+ * to be unmapped. This will increase extent count by 1. Mapping in the new
+ * extent can increase the extent count by 1 again i.e.
+ * | Old extent | New extent | Old extent |
+ * Hence number of extents increases by 2.
+ */
+#define XFS_IEXT_REFLINK_END_COW_CNT	(2)
+
 /*
  * Fork handling.
  */
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 6fa05fb78189..ca0ac1426d74 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -628,6 +628,11 @@ xfs_reflink_end_cow_extent(
 	xfs_ilock(ip, XFS_ILOCK_EXCL);
 	xfs_trans_ijoin(tp, ip, 0);
 
+	error = xfs_iext_count_may_overflow(ip, XFS_DATA_FORK,
+			XFS_IEXT_REFLINK_END_COW_CNT);
+	if (error)
+		goto out_cancel;
+
 	/*
 	 * In case of racing, overlapping AIO writes no COW extents might be
 	 * left by the time I/O completes for the loser of the race.  In that
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V14 10/16] xfs: Check for extent overflow when remapping an extent
  2021-01-10 16:07 [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow Chandan Babu R
                   ` (8 preceding siblings ...)
  2021-01-10 16:07 ` [PATCH V14 09/16] xfs: Check for extent overflow when moving extent from cow to data fork Chandan Babu R
@ 2021-01-10 16:07 ` Chandan Babu R
  2021-01-10 16:07 ` [PATCH V14 11/16] xfs: Check for extent overflow when swapping extents Chandan Babu R
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 31+ messages in thread
From: Chandan Babu R @ 2021-01-10 16:07 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, darrick.wong, djwong, hch, allison.henderson

Remapping an extent involves unmapping the existing extent and mapping
in the new extent. When unmapping, an extent containing the entire unmap
range can be split into two extents,
i.e. | Old extent | hole | Old extent |
Hence extent count increases by 1.

Mapping in the new extent into the destination file can increase the
extent count by 1.

Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
---
 fs/xfs/xfs_reflink.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index ca0ac1426d74..e1c98dbf79e4 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1006,6 +1006,7 @@ xfs_reflink_remap_extent(
 	unsigned int		resblks;
 	bool			smap_real;
 	bool			dmap_written = xfs_bmap_is_written_extent(dmap);
+	int			iext_delta = 0;
 	int			nimaps;
 	int			error;
 
@@ -1099,6 +1100,16 @@ xfs_reflink_remap_extent(
 			goto out_cancel;
 	}
 
+	if (smap_real)
+		++iext_delta;
+
+	if (dmap_written)
+		++iext_delta;
+
+	error = xfs_iext_count_may_overflow(ip, XFS_DATA_FORK, iext_delta);
+	if (error)
+		goto out_cancel;
+
 	if (smap_real) {
 		/*
 		 * If the extent we're unmapping is backed by storage (written
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V14 11/16] xfs: Check for extent overflow when swapping extents
  2021-01-10 16:07 [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow Chandan Babu R
                   ` (9 preceding siblings ...)
  2021-01-10 16:07 ` [PATCH V14 10/16] xfs: Check for extent overflow when remapping an extent Chandan Babu R
@ 2021-01-10 16:07 ` Chandan Babu R
  2021-01-10 16:07 ` [PATCH V14 12/16] xfs: Introduce error injection to reduce maximum inode fork extent count Chandan Babu R
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 31+ messages in thread
From: Chandan Babu R @ 2021-01-10 16:07 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, darrick.wong, djwong, hch, allison.henderson

Removing an initial range of source/donor file's extent and adding a new
extent (from donor/source file) in its place will cause extent count to
increase by 1.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
---
 fs/xfs/libxfs/xfs_inode_fork.h |  7 +++++++
 fs/xfs/xfs_bmap_util.c         | 16 ++++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
index c8f279edc5c1..9e2137cd7372 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.h
+++ b/fs/xfs/libxfs/xfs_inode_fork.h
@@ -88,6 +88,13 @@ struct xfs_ifork {
  */
 #define XFS_IEXT_REFLINK_END_COW_CNT	(2)
 
+/*
+ * Removing an initial range of source/donor file's extent and adding a new
+ * extent (from donor/source file) in its place will cause extent count to
+ * increase by 1.
+ */
+#define XFS_IEXT_SWAP_RMAP_CNT		(1)
+
 /*
  * Fork handling.
  */
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 6ac7a6ac2658..f3f8c48ff5bf 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1399,6 +1399,22 @@ xfs_swap_extent_rmap(
 					irec.br_blockcount);
 			trace_xfs_swap_extent_rmap_remap_piece(tip, &uirec);
 
+			if (xfs_bmap_is_real_extent(&uirec)) {
+				error = xfs_iext_count_may_overflow(ip,
+						XFS_DATA_FORK,
+						XFS_IEXT_SWAP_RMAP_CNT);
+				if (error)
+					goto out;
+			}
+
+			if (xfs_bmap_is_real_extent(&irec)) {
+				error = xfs_iext_count_may_overflow(tip,
+						XFS_DATA_FORK,
+						XFS_IEXT_SWAP_RMAP_CNT);
+				if (error)
+					goto out;
+			}
+
 			/* Remove the mapping from the donor file. */
 			xfs_bmap_unmap_extent(tp, tip, &uirec);
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V14 12/16] xfs: Introduce error injection to reduce maximum inode fork extent count
  2021-01-10 16:07 [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow Chandan Babu R
                   ` (10 preceding siblings ...)
  2021-01-10 16:07 ` [PATCH V14 11/16] xfs: Check for extent overflow when swapping extents Chandan Babu R
@ 2021-01-10 16:07 ` Chandan Babu R
  2021-01-10 16:07 ` [PATCH V14 13/16] xfs: Remove duplicate assert statement in xfs_bmap_btalloc() Chandan Babu R
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 31+ messages in thread
From: Chandan Babu R @ 2021-01-10 16:07 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, darrick.wong, djwong, hch, allison.henderson

This commit adds XFS_ERRTAG_REDUCE_MAX_IEXTENTS error tag which enables
userspace programs to test "Inode fork extent count overflow detection"
by reducing maximum possible inode fork extent count to 10.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
---
 fs/xfs/libxfs/xfs_errortag.h   | 4 +++-
 fs/xfs/libxfs/xfs_inode_fork.c | 4 ++++
 fs/xfs/xfs_error.c             | 3 +++
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_errortag.h b/fs/xfs/libxfs/xfs_errortag.h
index 53b305dea381..1c56fcceeea6 100644
--- a/fs/xfs/libxfs/xfs_errortag.h
+++ b/fs/xfs/libxfs/xfs_errortag.h
@@ -56,7 +56,8 @@
 #define XFS_ERRTAG_FORCE_SUMMARY_RECALC			33
 #define XFS_ERRTAG_IUNLINK_FALLBACK			34
 #define XFS_ERRTAG_BUF_IOERROR				35
-#define XFS_ERRTAG_MAX					36
+#define XFS_ERRTAG_REDUCE_MAX_IEXTENTS			36
+#define XFS_ERRTAG_MAX					37
 
 /*
  * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -97,5 +98,6 @@
 #define XFS_RANDOM_FORCE_SUMMARY_RECALC			1
 #define XFS_RANDOM_IUNLINK_FALLBACK			(XFS_RANDOM_DEFAULT/10)
 #define XFS_RANDOM_BUF_IOERROR				XFS_RANDOM_DEFAULT
+#define XFS_RANDOM_REDUCE_MAX_IEXTENTS			1
 
 #endif /* __XFS_ERRORTAG_H_ */
diff --git a/fs/xfs/libxfs/xfs_inode_fork.c b/fs/xfs/libxfs/xfs_inode_fork.c
index 8d48716547e5..e080d7e07643 100644
--- a/fs/xfs/libxfs/xfs_inode_fork.c
+++ b/fs/xfs/libxfs/xfs_inode_fork.c
@@ -24,6 +24,7 @@
 #include "xfs_dir2_priv.h"
 #include "xfs_attr_leaf.h"
 #include "xfs_types.h"
+#include "xfs_errortag.h"
 
 kmem_zone_t *xfs_ifork_zone;
 
@@ -745,6 +746,9 @@ xfs_iext_count_may_overflow(
 
 	max_exts = (whichfork == XFS_ATTR_FORK) ? MAXAEXTNUM : MAXEXTNUM;
 
+	if (XFS_TEST_ERROR(false, ip->i_mount, XFS_ERRTAG_REDUCE_MAX_IEXTENTS))
+		max_exts = 10;
+
 	nr_exts = ifp->if_nextents + nr_to_add;
 	if (nr_exts < ifp->if_nextents || nr_exts > max_exts)
 		return -EFBIG;
diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
index 7f6e20899473..3780b118cc47 100644
--- a/fs/xfs/xfs_error.c
+++ b/fs/xfs/xfs_error.c
@@ -54,6 +54,7 @@ static unsigned int xfs_errortag_random_default[] = {
 	XFS_RANDOM_FORCE_SUMMARY_RECALC,
 	XFS_RANDOM_IUNLINK_FALLBACK,
 	XFS_RANDOM_BUF_IOERROR,
+	XFS_RANDOM_REDUCE_MAX_IEXTENTS,
 };
 
 struct xfs_errortag_attr {
@@ -164,6 +165,7 @@ XFS_ERRORTAG_ATTR_RW(force_repair,	XFS_ERRTAG_FORCE_SCRUB_REPAIR);
 XFS_ERRORTAG_ATTR_RW(bad_summary,	XFS_ERRTAG_FORCE_SUMMARY_RECALC);
 XFS_ERRORTAG_ATTR_RW(iunlink_fallback,	XFS_ERRTAG_IUNLINK_FALLBACK);
 XFS_ERRORTAG_ATTR_RW(buf_ioerror,	XFS_ERRTAG_BUF_IOERROR);
+XFS_ERRORTAG_ATTR_RW(reduce_max_iextents,	XFS_ERRTAG_REDUCE_MAX_IEXTENTS);
 
 static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(noerror),
@@ -202,6 +204,7 @@ static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(bad_summary),
 	XFS_ERRORTAG_ATTR_LIST(iunlink_fallback),
 	XFS_ERRORTAG_ATTR_LIST(buf_ioerror),
+	XFS_ERRORTAG_ATTR_LIST(reduce_max_iextents),
 	NULL,
 };
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V14 13/16] xfs: Remove duplicate assert statement in xfs_bmap_btalloc()
  2021-01-10 16:07 [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow Chandan Babu R
                   ` (11 preceding siblings ...)
  2021-01-10 16:07 ` [PATCH V14 12/16] xfs: Introduce error injection to reduce maximum inode fork extent count Chandan Babu R
@ 2021-01-10 16:07 ` Chandan Babu R
  2021-01-10 16:07 ` [PATCH V14 14/16] xfs: Compute bmap extent alignments in a separate function Chandan Babu R
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 31+ messages in thread
From: Chandan Babu R @ 2021-01-10 16:07 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, darrick.wong, djwong, hch, allison.henderson

The check for verifying if the allocated extent is from an AG whose
index is greater than or equal to that of tp->t_firstblock is already
done a couple of statements earlier in the same function. Hence this
commit removes the redundant assert statement.

Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 8ebe5f13279c..0b15b1ff4bdd 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3699,7 +3699,6 @@ xfs_bmap_btalloc(
 		ap->blkno = args.fsbno;
 		if (ap->tp->t_firstblock == NULLFSBLOCK)
 			ap->tp->t_firstblock = args.fsbno;
-		ASSERT(nullfb || fb_agno <= args.agno);
 		ap->length = args.len;
 		/*
 		 * If the extent size hint is active, we tried to round the
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V14 14/16] xfs: Compute bmap extent alignments in a separate function
  2021-01-10 16:07 [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow Chandan Babu R
                   ` (12 preceding siblings ...)
  2021-01-10 16:07 ` [PATCH V14 13/16] xfs: Remove duplicate assert statement in xfs_bmap_btalloc() Chandan Babu R
@ 2021-01-10 16:07 ` Chandan Babu R
  2021-01-10 16:07 ` [PATCH V14 15/16] xfs: Process allocated extent " Chandan Babu R
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 31+ messages in thread
From: Chandan Babu R @ 2021-01-10 16:07 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, darrick.wong, djwong, hch, allison.henderson

This commit moves over the code which computes stripe alignment and
extent size hint alignment into a separate function. Apart from
xfs_bmap_btalloc(), the new function will be used by another function
introduced in a future commit.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 89 +++++++++++++++++++++++-----------------
 1 file changed, 52 insertions(+), 37 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 0b15b1ff4bdd..8955a0a938d5 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3463,13 +3463,59 @@ xfs_bmap_btalloc_accounting(
 		args->len);
 }
 
+static int
+xfs_bmap_compute_alignments(
+	struct xfs_bmalloca	*ap,
+	struct xfs_alloc_arg	*args)
+{
+	struct xfs_mount	*mp = args->mp;
+	xfs_extlen_t		align = 0; /* minimum allocation alignment */
+	int			stripe_align = 0;
+	int			error;
+
+	/* stripe alignment for allocation is determined by mount parameters */
+	if (mp->m_swidth && (mp->m_flags & XFS_MOUNT_SWALLOC))
+		stripe_align = mp->m_swidth;
+	else if (mp->m_dalign)
+		stripe_align = mp->m_dalign;
+
+	if (ap->flags & XFS_BMAPI_COWFORK)
+		align = xfs_get_cowextsz_hint(ap->ip);
+	else if (ap->datatype & XFS_ALLOC_USERDATA)
+		align = xfs_get_extsz_hint(ap->ip);
+	if (align) {
+		error = xfs_bmap_extsize_align(mp, &ap->got, &ap->prev,
+						align, 0, ap->eof, 0, ap->conv,
+						&ap->offset, &ap->length);
+		ASSERT(!error);
+		ASSERT(ap->length);
+	}
+
+	/* apply extent size hints if obtained earlier */
+	if (align) {
+		args->prod = align;
+		div_u64_rem(ap->offset, args->prod, &args->mod);
+		if (args->mod)
+			args->mod = args->prod - args->mod;
+	} else if (mp->m_sb.sb_blocksize >= PAGE_SIZE) {
+		args->prod = 1;
+		args->mod = 0;
+	} else {
+		args->prod = PAGE_SIZE >> mp->m_sb.sb_blocklog;
+		div_u64_rem(ap->offset, args->prod, &args->mod);
+		if (args->mod)
+			args->mod = args->prod - args->mod;
+	}
+
+	return stripe_align;
+}
+
 STATIC int
 xfs_bmap_btalloc(
 	struct xfs_bmalloca	*ap)	/* bmap alloc argument struct */
 {
 	xfs_mount_t	*mp;		/* mount point structure */
 	xfs_alloctype_t	atype = 0;	/* type for allocation routines */
-	xfs_extlen_t	align = 0;	/* minimum allocation alignment */
 	xfs_agnumber_t	fb_agno;	/* ag number of ap->firstblock */
 	xfs_agnumber_t	ag;
 	xfs_alloc_arg_t	args;
@@ -3489,25 +3535,11 @@ xfs_bmap_btalloc(
 
 	mp = ap->ip->i_mount;
 
-	/* stripe alignment for allocation is determined by mount parameters */
-	stripe_align = 0;
-	if (mp->m_swidth && (mp->m_flags & XFS_MOUNT_SWALLOC))
-		stripe_align = mp->m_swidth;
-	else if (mp->m_dalign)
-		stripe_align = mp->m_dalign;
-
-	if (ap->flags & XFS_BMAPI_COWFORK)
-		align = xfs_get_cowextsz_hint(ap->ip);
-	else if (ap->datatype & XFS_ALLOC_USERDATA)
-		align = xfs_get_extsz_hint(ap->ip);
-	if (align) {
-		error = xfs_bmap_extsize_align(mp, &ap->got, &ap->prev,
-						align, 0, ap->eof, 0, ap->conv,
-						&ap->offset, &ap->length);
-		ASSERT(!error);
-		ASSERT(ap->length);
-	}
+	memset(&args, 0, sizeof(args));
+	args.tp = ap->tp;
+	args.mp = mp;
 
+	stripe_align = xfs_bmap_compute_alignments(ap, &args);
 
 	nullfb = ap->tp->t_firstblock == NULLFSBLOCK;
 	fb_agno = nullfb ? NULLAGNUMBER : XFS_FSB_TO_AGNO(mp,
@@ -3538,9 +3570,6 @@ xfs_bmap_btalloc(
 	 * Normal allocation, done through xfs_alloc_vextent.
 	 */
 	tryagain = isaligned = 0;
-	memset(&args, 0, sizeof(args));
-	args.tp = ap->tp;
-	args.mp = mp;
 	args.fsbno = ap->blkno;
 	args.oinfo = XFS_RMAP_OINFO_SKIP_UPDATE;
 
@@ -3571,21 +3600,7 @@ xfs_bmap_btalloc(
 		args.total = ap->total;
 		args.minlen = ap->minlen;
 	}
-	/* apply extent size hints if obtained earlier */
-	if (align) {
-		args.prod = align;
-		div_u64_rem(ap->offset, args.prod, &args.mod);
-		if (args.mod)
-			args.mod = args.prod - args.mod;
-	} else if (mp->m_sb.sb_blocksize >= PAGE_SIZE) {
-		args.prod = 1;
-		args.mod = 0;
-	} else {
-		args.prod = PAGE_SIZE >> mp->m_sb.sb_blocklog;
-		div_u64_rem(ap->offset, args.prod, &args.mod);
-		if (args.mod)
-			args.mod = args.prod - args.mod;
-	}
+
 	/*
 	 * If we are not low on available data blocks, and the underlying
 	 * logical volume manager is a stripe, and the file offset is zero then
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V14 15/16] xfs: Process allocated extent in a separate function
  2021-01-10 16:07 [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow Chandan Babu R
                   ` (13 preceding siblings ...)
  2021-01-10 16:07 ` [PATCH V14 14/16] xfs: Compute bmap extent alignments in a separate function Chandan Babu R
@ 2021-01-10 16:07 ` Chandan Babu R
  2021-01-10 16:07 ` [PATCH V14 16/16] xfs: Introduce error injection to allocate only minlen size extents for files Chandan Babu R
  2022-05-23 11:15 ` [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow Amir Goldstein
  16 siblings, 0 replies; 31+ messages in thread
From: Chandan Babu R @ 2021-01-10 16:07 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, darrick.wong, djwong, hch, allison.henderson

This commit moves over the code in xfs_bmap_btalloc() which is
responsible for processing an allocated extent to a new function. Apart
from xfs_bmap_btalloc(), the new function will be invoked by another
function introduced in a future commit.

Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
---
 fs/xfs/libxfs/xfs_bmap.c | 74 ++++++++++++++++++++++++----------------
 1 file changed, 45 insertions(+), 29 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 8955a0a938d5..bf53a0b1eff3 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3510,6 +3510,48 @@ xfs_bmap_compute_alignments(
 	return stripe_align;
 }
 
+static void
+xfs_bmap_process_allocated_extent(
+	struct xfs_bmalloca	*ap,
+	struct xfs_alloc_arg	*args,
+	xfs_fileoff_t		orig_offset,
+	xfs_extlen_t		orig_length)
+{
+	int			nullfb;
+
+	nullfb = ap->tp->t_firstblock == NULLFSBLOCK;
+
+	/*
+	 * check the allocation happened at the same or higher AG than
+	 * the first block that was allocated.
+	 */
+	ASSERT(nullfb ||
+		XFS_FSB_TO_AGNO(args->mp, ap->tp->t_firstblock) <=
+		XFS_FSB_TO_AGNO(args->mp, args->fsbno));
+
+	ap->blkno = args->fsbno;
+	if (nullfb)
+		ap->tp->t_firstblock = args->fsbno;
+	ap->length = args->len;
+	/*
+	 * If the extent size hint is active, we tried to round the
+	 * caller's allocation request offset down to extsz and the
+	 * length up to another extsz boundary.  If we found a free
+	 * extent we mapped it in starting at this new offset.  If the
+	 * newly mapped space isn't long enough to cover any of the
+	 * range of offsets that was originally requested, move the
+	 * mapping up so that we can fill as much of the caller's
+	 * original request as possible.  Free space is apparently
+	 * very fragmented so we're unlikely to be able to satisfy the
+	 * hints anyway.
+	 */
+	if (ap->length <= orig_length)
+		ap->offset = orig_offset;
+	else if (ap->offset + ap->length < orig_offset + orig_length)
+		ap->offset = orig_offset + orig_length - ap->length;
+	xfs_bmap_btalloc_accounting(ap, args);
+}
+
 STATIC int
 xfs_bmap_btalloc(
 	struct xfs_bmalloca	*ap)	/* bmap alloc argument struct */
@@ -3702,36 +3744,10 @@ xfs_bmap_btalloc(
 			return error;
 		ap->tp->t_flags |= XFS_TRANS_LOWMODE;
 	}
+
 	if (args.fsbno != NULLFSBLOCK) {
-		/*
-		 * check the allocation happened at the same or higher AG than
-		 * the first block that was allocated.
-		 */
-		ASSERT(ap->tp->t_firstblock == NULLFSBLOCK ||
-		       XFS_FSB_TO_AGNO(mp, ap->tp->t_firstblock) <=
-		       XFS_FSB_TO_AGNO(mp, args.fsbno));
-
-		ap->blkno = args.fsbno;
-		if (ap->tp->t_firstblock == NULLFSBLOCK)
-			ap->tp->t_firstblock = args.fsbno;
-		ap->length = args.len;
-		/*
-		 * If the extent size hint is active, we tried to round the
-		 * caller's allocation request offset down to extsz and the
-		 * length up to another extsz boundary.  If we found a free
-		 * extent we mapped it in starting at this new offset.  If the
-		 * newly mapped space isn't long enough to cover any of the
-		 * range of offsets that was originally requested, move the
-		 * mapping up so that we can fill as much of the caller's
-		 * original request as possible.  Free space is apparently
-		 * very fragmented so we're unlikely to be able to satisfy the
-		 * hints anyway.
-		 */
-		if (ap->length <= orig_length)
-			ap->offset = orig_offset;
-		else if (ap->offset + ap->length < orig_offset + orig_length)
-			ap->offset = orig_offset + orig_length - ap->length;
-		xfs_bmap_btalloc_accounting(ap, &args);
+		xfs_bmap_process_allocated_extent(ap, &args, orig_offset,
+			orig_length);
 	} else {
 		ap->blkno = NULLFSBLOCK;
 		ap->length = 0;
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH V14 16/16] xfs: Introduce error injection to allocate only minlen size extents for files
  2021-01-10 16:07 [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow Chandan Babu R
                   ` (14 preceding siblings ...)
  2021-01-10 16:07 ` [PATCH V14 15/16] xfs: Process allocated extent " Chandan Babu R
@ 2021-01-10 16:07 ` Chandan Babu R
  2022-05-23 11:15 ` [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow Amir Goldstein
  16 siblings, 0 replies; 31+ messages in thread
From: Chandan Babu R @ 2021-01-10 16:07 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Babu R, darrick.wong, djwong, hch, allison.henderson

This commit adds XFS_ERRTAG_BMAP_ALLOC_MINLEN_EXTENT error tag which
helps userspace test programs to get xfs_bmap_btalloc() to always
allocate minlen sized extents.

This is required for test programs which need a guarantee that minlen
extents allocated for a file do not get merged with their existing
neighbours in the inode's BMBT. "Inode fork extent overflow check" for
Directories, Xattrs and extension of realtime inodes need this since the
file offset at which the extents are being allocated cannot be
explicitly controlled from userspace.

One way to use this error tag is to,
1. Consume all of the free space by sequentially writing to a file.
2. Punch alternate blocks of the file. This causes CNTBT to contain
   sufficient number of one block sized extent records.
3. Inject XFS_ERRTAG_BMAP_ALLOC_MINLEN_EXTENT error tag.
After step 3, xfs_bmap_btalloc() will issue space allocation
requests for minlen sized extents only.

ENOSPC error code is returned to userspace when there aren't any "one
block sized" extents left in any of the AGs.

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>
---
 fs/xfs/libxfs/xfs_alloc.c    |  50 ++++++++++++++
 fs/xfs/libxfs/xfs_alloc.h    |   3 +
 fs/xfs/libxfs/xfs_bmap.c     | 124 ++++++++++++++++++++++++++++-------
 fs/xfs/libxfs/xfs_errortag.h |   4 +-
 fs/xfs/xfs_error.c           |   3 +
 5 files changed, 159 insertions(+), 25 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_alloc.c b/fs/xfs/libxfs/xfs_alloc.c
index 7cb9f064ac64..0c623d3c1036 100644
--- a/fs/xfs/libxfs/xfs_alloc.c
+++ b/fs/xfs/libxfs/xfs_alloc.c
@@ -2474,6 +2474,47 @@ xfs_defer_agfl_block(
 	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_AGFL_FREE, &new->xefi_list);
 }
 
+#ifdef DEBUG
+/*
+ * Check if an AGF has a free extent record whose length is equal to
+ * args->minlen.
+ */
+STATIC int
+xfs_exact_minlen_extent_available(
+	struct xfs_alloc_arg	*args,
+	struct xfs_buf		*agbp,
+	int			*stat)
+{
+	struct xfs_btree_cur	*cnt_cur;
+	xfs_agblock_t		fbno;
+	xfs_extlen_t		flen;
+	int			error = 0;
+
+	cnt_cur = xfs_allocbt_init_cursor(args->mp, args->tp, agbp,
+			args->agno, XFS_BTNUM_CNT);
+	error = xfs_alloc_lookup_ge(cnt_cur, 0, args->minlen, stat);
+	if (error)
+		goto out;
+
+	if (*stat == 0) {
+		error = -EFSCORRUPTED;
+		goto out;
+	}
+
+	error = xfs_alloc_get_rec(cnt_cur, &fbno, &flen, stat);
+	if (error)
+		goto out;
+
+	if (*stat == 1 && flen != args->minlen)
+		*stat = 0;
+
+out:
+	xfs_btree_del_cursor(cnt_cur, error);
+
+	return error;
+}
+#endif
+
 /*
  * Decide whether to use this allocation group for this allocation.
  * If so, fix up the btree freelist's size.
@@ -2545,6 +2586,15 @@ xfs_alloc_fix_freelist(
 	if (!xfs_alloc_space_available(args, need, flags))
 		goto out_agbp_relse;
 
+#ifdef DEBUG
+	if (args->alloc_minlen_only) {
+		int stat;
+
+		error = xfs_exact_minlen_extent_available(args, agbp, &stat);
+		if (error || !stat)
+			goto out_agbp_relse;
+	}
+#endif
 	/*
 	 * Make the freelist shorter if it's too long.
 	 *
diff --git a/fs/xfs/libxfs/xfs_alloc.h b/fs/xfs/libxfs/xfs_alloc.h
index 6c22b12176b8..a4427c5775c2 100644
--- a/fs/xfs/libxfs/xfs_alloc.h
+++ b/fs/xfs/libxfs/xfs_alloc.h
@@ -75,6 +75,9 @@ typedef struct xfs_alloc_arg {
 	char		wasfromfl;	/* set if allocation is from freelist */
 	struct xfs_owner_info	oinfo;	/* owner of blocks being allocated */
 	enum xfs_ag_resv_type	resv;	/* block reservation to use */
+#ifdef DEBUG
+	bool		alloc_minlen_only; /* allocate exact minlen extent */
+#endif
 } xfs_alloc_arg_t;
 
 /*
diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index bf53a0b1eff3..2cd24bb06040 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3552,34 +3552,101 @@ xfs_bmap_process_allocated_extent(
 	xfs_bmap_btalloc_accounting(ap, args);
 }
 
-STATIC int
-xfs_bmap_btalloc(
-	struct xfs_bmalloca	*ap)	/* bmap alloc argument struct */
+#ifdef DEBUG
+static int
+xfs_bmap_exact_minlen_extent_alloc(
+	struct xfs_bmalloca	*ap)
 {
-	xfs_mount_t	*mp;		/* mount point structure */
-	xfs_alloctype_t	atype = 0;	/* type for allocation routines */
-	xfs_agnumber_t	fb_agno;	/* ag number of ap->firstblock */
-	xfs_agnumber_t	ag;
-	xfs_alloc_arg_t	args;
-	xfs_fileoff_t	orig_offset;
-	xfs_extlen_t	orig_length;
-	xfs_extlen_t	blen;
-	xfs_extlen_t	nextminlen = 0;
-	int		nullfb;		/* true if ap->firstblock isn't set */
-	int		isaligned;
-	int		tryagain;
-	int		error;
-	int		stripe_align;
+	struct xfs_mount	*mp = ap->ip->i_mount;
+	struct xfs_alloc_arg	args = { .tp = ap->tp, .mp = mp };
+	xfs_fileoff_t		orig_offset;
+	xfs_extlen_t		orig_length;
+	int			error;
 
 	ASSERT(ap->length);
+
+	if (ap->minlen != 1) {
+		ap->blkno = NULLFSBLOCK;
+		ap->length = 0;
+		return 0;
+	}
+
 	orig_offset = ap->offset;
 	orig_length = ap->length;
 
-	mp = ap->ip->i_mount;
+	args.alloc_minlen_only = 1;
 
-	memset(&args, 0, sizeof(args));
-	args.tp = ap->tp;
-	args.mp = mp;
+	xfs_bmap_compute_alignments(ap, &args);
+
+	if (ap->tp->t_firstblock == NULLFSBLOCK) {
+		/*
+		 * Unlike the longest extent available in an AG, we don't track
+		 * the length of an AG's shortest extent.
+		 * XFS_ERRTAG_BMAP_ALLOC_MINLEN_EXTENT is a debug only knob and
+		 * hence we can afford to start traversing from the 0th AG since
+		 * we need not be concerned about a drop in performance in
+		 * "debug only" code paths.
+		 */
+		ap->blkno = XFS_AGB_TO_FSB(mp, 0, 0);
+	} else {
+		ap->blkno = ap->tp->t_firstblock;
+	}
+
+	args.fsbno = ap->blkno;
+	args.oinfo = XFS_RMAP_OINFO_SKIP_UPDATE;
+	args.type = XFS_ALLOCTYPE_FIRST_AG;
+	args.total = args.minlen = args.maxlen = ap->minlen;
+
+	args.alignment = 1;
+	args.minalignslop = 0;
+
+	args.minleft = ap->minleft;
+	args.wasdel = ap->wasdel;
+	args.resv = XFS_AG_RESV_NONE;
+	args.datatype = ap->datatype;
+
+	error = xfs_alloc_vextent(&args);
+	if (error)
+		return error;
+
+	if (args.fsbno != NULLFSBLOCK) {
+		xfs_bmap_process_allocated_extent(ap, &args, orig_offset,
+			orig_length);
+	} else {
+		ap->blkno = NULLFSBLOCK;
+		ap->length = 0;
+	}
+
+	return 0;
+}
+#else
+
+#define xfs_bmap_exact_minlen_extent_alloc(bma) (-EFSCORRUPTED)
+
+#endif
+
+STATIC int
+xfs_bmap_btalloc(
+	struct xfs_bmalloca	*ap)
+{
+	struct xfs_mount	*mp = ap->ip->i_mount;
+	struct xfs_alloc_arg	args = { .tp = ap->tp, .mp = mp };
+	xfs_alloctype_t		atype = 0;
+	xfs_agnumber_t		fb_agno;	/* ag number of ap->firstblock */
+	xfs_agnumber_t		ag;
+	xfs_fileoff_t		orig_offset;
+	xfs_extlen_t		orig_length;
+	xfs_extlen_t		blen;
+	xfs_extlen_t		nextminlen = 0;
+	int			nullfb; /* true if ap->firstblock isn't set */
+	int			isaligned;
+	int			tryagain;
+	int			error;
+	int			stripe_align;
+
+	ASSERT(ap->length);
+	orig_offset = ap->offset;
+	orig_length = ap->length;
 
 	stripe_align = xfs_bmap_compute_alignments(ap, &args);
 
@@ -4113,6 +4180,10 @@ xfs_bmap_alloc_userdata(
 			return xfs_bmap_rtalloc(bma);
 	}
 
+	if (unlikely(XFS_TEST_ERROR(false, mp,
+			XFS_ERRTAG_BMAP_ALLOC_MINLEN_EXTENT)))
+		return xfs_bmap_exact_minlen_extent_alloc(bma);
+
 	return xfs_bmap_btalloc(bma);
 }
 
@@ -4149,10 +4220,15 @@ xfs_bmapi_allocate(
 	else
 		bma->minlen = 1;
 
-	if (bma->flags & XFS_BMAPI_METADATA)
-		error = xfs_bmap_btalloc(bma);
-	else
+	if (bma->flags & XFS_BMAPI_METADATA) {
+		if (unlikely(XFS_TEST_ERROR(false, mp,
+				XFS_ERRTAG_BMAP_ALLOC_MINLEN_EXTENT)))
+			error = xfs_bmap_exact_minlen_extent_alloc(bma);
+		else
+			error = xfs_bmap_btalloc(bma);
+	} else {
 		error = xfs_bmap_alloc_userdata(bma);
+	}
 	if (error || bma->blkno == NULLFSBLOCK)
 		return error;
 
diff --git a/fs/xfs/libxfs/xfs_errortag.h b/fs/xfs/libxfs/xfs_errortag.h
index 1c56fcceeea6..6ca9084b6934 100644
--- a/fs/xfs/libxfs/xfs_errortag.h
+++ b/fs/xfs/libxfs/xfs_errortag.h
@@ -57,7 +57,8 @@
 #define XFS_ERRTAG_IUNLINK_FALLBACK			34
 #define XFS_ERRTAG_BUF_IOERROR				35
 #define XFS_ERRTAG_REDUCE_MAX_IEXTENTS			36
-#define XFS_ERRTAG_MAX					37
+#define XFS_ERRTAG_BMAP_ALLOC_MINLEN_EXTENT		37
+#define XFS_ERRTAG_MAX					38
 
 /*
  * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -99,5 +100,6 @@
 #define XFS_RANDOM_IUNLINK_FALLBACK			(XFS_RANDOM_DEFAULT/10)
 #define XFS_RANDOM_BUF_IOERROR				XFS_RANDOM_DEFAULT
 #define XFS_RANDOM_REDUCE_MAX_IEXTENTS			1
+#define XFS_RANDOM_BMAP_ALLOC_MINLEN_EXTENT		1
 
 #endif /* __XFS_ERRORTAG_H_ */
diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
index 3780b118cc47..185b4915b7bf 100644
--- a/fs/xfs/xfs_error.c
+++ b/fs/xfs/xfs_error.c
@@ -55,6 +55,7 @@ static unsigned int xfs_errortag_random_default[] = {
 	XFS_RANDOM_IUNLINK_FALLBACK,
 	XFS_RANDOM_BUF_IOERROR,
 	XFS_RANDOM_REDUCE_MAX_IEXTENTS,
+	XFS_RANDOM_BMAP_ALLOC_MINLEN_EXTENT,
 };
 
 struct xfs_errortag_attr {
@@ -166,6 +167,7 @@ XFS_ERRORTAG_ATTR_RW(bad_summary,	XFS_ERRTAG_FORCE_SUMMARY_RECALC);
 XFS_ERRORTAG_ATTR_RW(iunlink_fallback,	XFS_ERRTAG_IUNLINK_FALLBACK);
 XFS_ERRORTAG_ATTR_RW(buf_ioerror,	XFS_ERRTAG_BUF_IOERROR);
 XFS_ERRORTAG_ATTR_RW(reduce_max_iextents,	XFS_ERRTAG_REDUCE_MAX_IEXTENTS);
+XFS_ERRORTAG_ATTR_RW(bmap_alloc_minlen_extent,	XFS_ERRTAG_BMAP_ALLOC_MINLEN_EXTENT);
 
 static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(noerror),
@@ -205,6 +207,7 @@ static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(iunlink_fallback),
 	XFS_ERRORTAG_ATTR_LIST(buf_ioerror),
 	XFS_ERRORTAG_ATTR_LIST(reduce_max_iextents),
+	XFS_ERRORTAG_ATTR_LIST(bmap_alloc_minlen_extent),
 	NULL,
 };
 
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* Re: [PATCH V14 04/16] xfs: Check for extent overflow when adding dir entries
  2021-01-10 16:07 ` [PATCH V14 04/16] xfs: Check for extent overflow when adding dir entries Chandan Babu R
@ 2021-01-12  1:34   ` Darrick J. Wong
  0 siblings, 0 replies; 31+ messages in thread
From: Darrick J. Wong @ 2021-01-12  1:34 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, darrick.wong, hch, allison.henderson

On Sun, Jan 10, 2021 at 09:37:08PM +0530, Chandan Babu R wrote:
> Directory entry addition can cause the following,
> 1. Data block can be added/removed.
>    A new extent can cause extent count to increase by 1.
> 2. Free disk block can be added/removed.
>    Same behaviour as described above for Data block.
> 3. Dabtree blocks.
>    XFS_DA_NODE_MAXDEPTH blocks can be added. Each of these
>    can be new extents. Hence extent count can increase by
>    XFS_DA_NODE_MAXDEPTH.
> 
> Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>

OK, seems reasonably straight forward,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/libxfs/xfs_inode_fork.h | 13 +++++++++++++
>  fs/xfs/xfs_inode.c             | 10 ++++++++++
>  fs/xfs/xfs_symlink.c           |  5 +++++
>  3 files changed, 28 insertions(+)
> 
> diff --git a/fs/xfs/libxfs/xfs_inode_fork.h b/fs/xfs/libxfs/xfs_inode_fork.h
> index bcac769a7df6..ea1a9dd8a763 100644
> --- a/fs/xfs/libxfs/xfs_inode_fork.h
> +++ b/fs/xfs/libxfs/xfs_inode_fork.h
> @@ -47,6 +47,19 @@ struct xfs_ifork {
>   */
>  #define XFS_IEXT_PUNCH_HOLE_CNT		(1)
>  
> +/*
> + * Directory entry addition can cause the following,
> + * 1. Data block can be added/removed.
> + *    A new extent can cause extent count to increase by 1.
> + * 2. Free disk block can be added/removed.
> + *    Same behaviour as described above for Data block.
> + * 3. Dabtree blocks.
> + *    XFS_DA_NODE_MAXDEPTH blocks can be added. Each of these can be new
> + *    extents. Hence extent count can increase by XFS_DA_NODE_MAXDEPTH.
> + */
> +#define XFS_IEXT_DIR_MANIP_CNT(mp) \
> +	((XFS_DA_NODE_MAXDEPTH + 1 + 1) * (mp)->m_dir_geo->fsbcount)
> +
>  /*
>   * Fork handling.
>   */
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index b7352bc4c815..4cc787cc4eee 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1042,6 +1042,11 @@ xfs_create(
>  	if (error)
>  		goto out_trans_cancel;
>  
> +	error = xfs_iext_count_may_overflow(dp, XFS_DATA_FORK,
> +			XFS_IEXT_DIR_MANIP_CNT(mp));
> +	if (error)
> +		goto out_trans_cancel;
> +
>  	/*
>  	 * A newly created regular or special file just has one directory
>  	 * entry pointing to them, but a directory also the "." entry
> @@ -1258,6 +1263,11 @@ xfs_link(
>  	xfs_trans_ijoin(tp, sip, XFS_ILOCK_EXCL);
>  	xfs_trans_ijoin(tp, tdp, XFS_ILOCK_EXCL);
>  
> +	error = xfs_iext_count_may_overflow(tdp, XFS_DATA_FORK,
> +			XFS_IEXT_DIR_MANIP_CNT(mp));
> +	if (error)
> +		goto error_return;
> +
>  	/*
>  	 * If we are using project inheritance, we only allow hard link
>  	 * creation in our tree when the project IDs are the same; else
> diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
> index 1f43fd7f3209..0b8136a32484 100644
> --- a/fs/xfs/xfs_symlink.c
> +++ b/fs/xfs/xfs_symlink.c
> @@ -220,6 +220,11 @@ xfs_symlink(
>  	if (error)
>  		goto out_trans_cancel;
>  
> +	error = xfs_iext_count_may_overflow(dp, XFS_DATA_FORK,
> +			XFS_IEXT_DIR_MANIP_CNT(mp));
> +	if (error)
> +		goto out_trans_cancel;
> +
>  	/*
>  	 * Allocate an inode for the symlink.
>  	 */
> -- 
> 2.29.2
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V14 06/16] xfs: Check for extent overflow when renaming dir entries
  2021-01-10 16:07 ` [PATCH V14 06/16] xfs: Check for extent overflow when renaming " Chandan Babu R
@ 2021-01-12  1:37   ` Darrick J. Wong
  0 siblings, 0 replies; 31+ messages in thread
From: Darrick J. Wong @ 2021-01-12  1:37 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, darrick.wong, hch, allison.henderson

On Sun, Jan 10, 2021 at 09:37:10PM +0530, Chandan Babu R wrote:
> A rename operation is essentially a directory entry remove operation
> from the perspective of parent directory (i.e. src_dp) of rename's
> source. Hence the only place where we check for extent count overflow
> for src_dp is in xfs_bmap_del_extent_real(). xfs_bmap_del_extent_real()
> returns -ENOSPC when it detects a possible extent count overflow and in
> response, the higher layers of directory handling code do the following:
> 1. Data/Free blocks: XFS lets these blocks linger until a future remove
>    operation removes them.
> 2. Dabtree blocks: XFS swaps the blocks with the last block in the Leaf
>    space and unmaps the last block.
> 
> For target_dp, there are two cases depending on whether the destination
> directory entry exists or not.
> 
> When destination directory entry does not exist (i.e. target_ip ==
> NULL), extent count overflow check is performed only when transaction
> has a non-zero sized space reservation associated with it.  With a
> zero-sized space reservation, XFS allows a rename operation to continue
> only when the directory has sufficient free space in its data/leaf/free
> space blocks to hold the new entry.
> 
> When destination directory entry exists (i.e. target_ip != NULL), all
> we need to do is change the inode number associated with the already
> existing entry. Hence there is no need to perform an extent count
> overflow check.
> 
> Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>

Looks good,
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/libxfs/xfs_bmap.c |  3 +++
>  fs/xfs/xfs_inode.c       | 44 +++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 46 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 6c8f17a0e247..8ebe5f13279c 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -5160,6 +5160,9 @@ xfs_bmap_del_extent_real(
>  		 * until a future remove operation. Dabtree blocks would be
>  		 * swapped with the last block in the leaf space and then the
>  		 * new last block will be unmapped.
> +		 *
> +		 * The above logic also applies to the source directory entry of
> +		 * a rename operation.
>  		 */
>  		error = xfs_iext_count_may_overflow(ip, whichfork, 1);
>  		if (error) {
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 4cc787cc4eee..f0a6d528cbc4 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -3116,6 +3116,35 @@ xfs_rename(
>  	/*
>  	 * Check for expected errors before we dirty the transaction
>  	 * so we can return an error without a transaction abort.
> +	 *
> +	 * Extent count overflow check:
> +	 *
> +	 * From the perspective of src_dp, a rename operation is essentially a
> +	 * directory entry remove operation. Hence the only place where we check
> +	 * for extent count overflow for src_dp is in
> +	 * xfs_bmap_del_extent_real(). xfs_bmap_del_extent_real() returns
> +	 * -ENOSPC when it detects a possible extent count overflow and in
> +	 * response, the higher layers of directory handling code do the
> +	 * following:
> +	 * 1. Data/Free blocks: XFS lets these blocks linger until a
> +	 *    future remove operation removes them.
> +	 * 2. Dabtree blocks: XFS swaps the blocks with the last block in the
> +	 *    Leaf space and unmaps the last block.
> +	 *
> +	 * For target_dp, there are two cases depending on whether the
> +	 * destination directory entry exists or not.
> +	 *
> +	 * When destination directory entry does not exist (i.e. target_ip ==
> +	 * NULL), extent count overflow check is performed only when transaction
> +	 * has a non-zero sized space reservation associated with it.  With a
> +	 * zero-sized space reservation, XFS allows a rename operation to
> +	 * continue only when the directory has sufficient free space in its
> +	 * data/leaf/free space blocks to hold the new entry.
> +	 *
> +	 * When destination directory entry exists (i.e. target_ip != NULL), all
> +	 * we need to do is change the inode number associated with the already
> +	 * existing entry. Hence there is no need to perform an extent count
> +	 * overflow check.
>  	 */
>  	if (target_ip == NULL) {
>  		/*
> @@ -3126,6 +3155,12 @@ xfs_rename(
>  			error = xfs_dir_canenter(tp, target_dp, target_name);
>  			if (error)
>  				goto out_trans_cancel;
> +		} else {
> +			error = xfs_iext_count_may_overflow(target_dp,
> +					XFS_DATA_FORK,
> +					XFS_IEXT_DIR_MANIP_CNT(mp));
> +			if (error)
> +				goto out_trans_cancel;
>  		}
>  	} else {
>  		/*
> @@ -3283,9 +3318,16 @@ xfs_rename(
>  	if (wip) {
>  		error = xfs_dir_replace(tp, src_dp, src_name, wip->i_ino,
>  					spaceres);
> -	} else
> +	} else {
> +		/*
> +		 * NOTE: We don't need to check for extent count overflow here
> +		 * because the dir remove name code will leave the dir block in
> +		 * place if the extent count would overflow.
> +		 */
>  		error = xfs_dir_removename(tp, src_dp, src_name, src_ip->i_ino,
>  					   spaceres);
> +	}
> +
>  	if (error)
>  		goto out_trans_cancel;
>  
> -- 
> 2.29.2
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V14 05/16] xfs: Check for extent overflow when removing dir entries
  2021-01-10 16:07 ` [PATCH V14 05/16] xfs: Check for extent overflow when removing " Chandan Babu R
@ 2021-01-12  1:38   ` Darrick J. Wong
  0 siblings, 0 replies; 31+ messages in thread
From: Darrick J. Wong @ 2021-01-12  1:38 UTC (permalink / raw)
  To: Chandan Babu R; +Cc: linux-xfs, darrick.wong, hch, allison.henderson

On Sun, Jan 10, 2021 at 09:37:09PM +0530, Chandan Babu R wrote:
> Directory entry removal must always succeed; Hence XFS does the
> following during low disk space scenario:
> 1. Data/Free blocks linger until a future remove operation.
> 2. Dabtree blocks would be swapped with the last block in the leaf space
>    and then the new last block will be unmapped.
> 
> This facility is reused during low inode extent count scenario i.e. this
> commit causes xfs_bmap_del_extent_real() to return -ENOSPC error code so
> that the above mentioned behaviour is exercised causing no change to the
> directory's extent count.
> 
> Signed-off-by: Chandan Babu R <chandanrlinux@gmail.com>

Thanks for the minor tweaks since v12,

Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>

--D

> ---
>  fs/xfs/libxfs/xfs_bmap.c | 18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> index 32aeacf6f055..6c8f17a0e247 100644
> --- a/fs/xfs/libxfs/xfs_bmap.c
> +++ b/fs/xfs/libxfs/xfs_bmap.c
> @@ -5151,6 +5151,24 @@ xfs_bmap_del_extent_real(
>  		/*
>  		 * Deleting the middle of the extent.
>  		 */
> +
> +		/*
> +		 * For directories, -ENOSPC is returned since a directory entry
> +		 * remove operation must not fail due to low extent count
> +		 * availability. -ENOSPC will be handled by higher layers of XFS
> +		 * by letting the corresponding empty Data/Free blocks to linger
> +		 * until a future remove operation. Dabtree blocks would be
> +		 * swapped with the last block in the leaf space and then the
> +		 * new last block will be unmapped.
> +		 */
> +		error = xfs_iext_count_may_overflow(ip, whichfork, 1);
> +		if (error) {
> +			ASSERT(S_ISDIR(VFS_I(ip)->i_mode) &&
> +				whichfork == XFS_DATA_FORK);
> +			error = -ENOSPC;
> +			goto done;
> +		}
> +
>  		old = got;
>  
>  		got.br_blockcount = del->br_startoff - got.br_startoff;
> -- 
> 2.29.2
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow
  2021-01-10 16:07 [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow Chandan Babu R
                   ` (15 preceding siblings ...)
  2021-01-10 16:07 ` [PATCH V14 16/16] xfs: Introduce error injection to allocate only minlen size extents for files Chandan Babu R
@ 2022-05-23 11:15 ` Amir Goldstein
  2022-05-23 15:50   ` Chandan Babu R
  2022-05-23 22:43   ` Dave Chinner
  16 siblings, 2 replies; 31+ messages in thread
From: Amir Goldstein @ 2022-05-23 11:15 UTC (permalink / raw)
  To: Chandan Babu R
  Cc: linux-xfs, Darrick J . Wong, Darrick J. Wong, Christoph Hellwig,
	Allison Henderson, Luis R. Rodriguez, Theodore Tso

On Sun, Jan 10, 2021 at 6:10 PM Chandan Babu R <chandanrlinux@gmail.com> wrote:
>
> XFS does not check for possible overflow of per-inode extent counter
> fields when adding extents to either data or attr fork.
>
> For e.g.
> 1. Insert 5 million xattrs (each having a value size of 255 bytes) and
>    then delete 50% of them in an alternating manner.
>
> 2. On a 4k block sized XFS filesystem instance, the above causes 98511
>    extents to be created in the attr fork of the inode.
>
>    xfsaild/loop0  2008 [003]  1475.127209: probe:xfs_inode_to_disk: (ffffffffa43fb6b0) if_nextents=98511 i_ino=131
>
> 3. The incore inode fork extent counter is a signed 32-bit
>    quantity. However, the on-disk extent counter is an unsigned 16-bit
>    quantity and hence cannot hold 98511 extents.
>
> 4. The following incorrect value is stored in the xattr extent counter,
>    # xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0
>    core.naextents = -32561
>
> This patchset adds a new helper function
> (i.e. xfs_iext_count_may_overflow()) to check for overflow of the
> per-inode data and xattr extent counters and invokes it before
> starting an fs operation (e.g. creating a new directory entry). With
> this patchset applied, XFS detects counter overflows and returns with
> an error rather than causing a silent corruption.
>
> The patchset has been tested by executing xfstests with the following
> mkfs.xfs options,
> 1. -m crc=0 -b size=1k
> 2. -m crc=0 -b size=4k
> 3. -m crc=0 -b size=512
> 4. -m rmapbt=1,reflink=1 -b size=1k
> 5. -m rmapbt=1,reflink=1 -b size=4k
>
> The patches can also be obtained from
> https://github.com/chandanr/linux.git at branch xfs-reserve-extent-count-v14.
>
> I have two patches that define the newly introduced error injection
> tags in xfsprogs
> (https://lore.kernel.org/linux-xfs/20201104114900.172147-1-chandanrlinux@gmail.com/).
>
> I have also written tests
> (https://github.com/chandanr/xfstests/commits/extent-overflow-tests)
> for verifying the checks introduced in the kernel.
>

Hi Chandan and XFS folks,

As you may have heard, I am working on producing a series of
xfs patches for stable v5.10.y.

My patch selection is documented at [1].
I am in the process of testing the backport patches against the 5.10.y
baseline using Luis' kdevops [2] fstests runner.

The configurations that we are testing are:
1. -m rmbat=0,reflink=1 -b size=4k (default)
2. -m crc=0 -b size=4k
3. -m crc=0 -b size=512
4. -m rmapbt=1,reflink=1 -b size=1k
5. -m rmapbt=1,reflink=1 -b size=4k

This patch set is the only largish series that I selected, because:
- It applies cleanly to 5.10.y
- I evaluated it as low risk and high value
- Chandan has written good regression tests

I intend to post the rest of the individual selected patches
for review in small batches after they pass the tests, but w.r.t this
patch set -

Does anyone object to including it in the stable kernel
after it passes the tests?

Thanks,
Amir.

[1] https://github.com/amir73il/b4/blob/xfs-5.10.y/xfs-5.10..5.17-fixes.rst
[2] https://github.com/linux-kdevops/kdevops

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow
  2022-05-23 11:15 ` [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow Amir Goldstein
@ 2022-05-23 15:50   ` Chandan Babu R
  2022-05-23 19:06     ` Amir Goldstein
  2022-05-23 22:43   ` Dave Chinner
  1 sibling, 1 reply; 31+ messages in thread
From: Chandan Babu R @ 2022-05-23 15:50 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Chandan Babu R, linux-xfs, Darrick J . Wong, Darrick J. Wong,
	Christoph Hellwig, Allison Henderson, Luis R. Rodriguez,
	Theodore Tso

On Mon, May 23, 2022 at 02:15:44 PM +0300, Amir Goldstein wrote:
> On Sun, Jan 10, 2021 at 6:10 PM Chandan Babu R <chandanrlinux@gmail.com> wrote:
>>
>> XFS does not check for possible overflow of per-inode extent counter
>> fields when adding extents to either data or attr fork.
>>
>> For e.g.
>> 1. Insert 5 million xattrs (each having a value size of 255 bytes) and
>>    then delete 50% of them in an alternating manner.
>>
>> 2. On a 4k block sized XFS filesystem instance, the above causes 98511
>>    extents to be created in the attr fork of the inode.
>>
>>    xfsaild/loop0  2008 [003]  1475.127209: probe:xfs_inode_to_disk: (ffffffffa43fb6b0) if_nextents=98511 i_ino=131
>>
>> 3. The incore inode fork extent counter is a signed 32-bit
>>    quantity. However, the on-disk extent counter is an unsigned 16-bit
>>    quantity and hence cannot hold 98511 extents.
>>
>> 4. The following incorrect value is stored in the xattr extent counter,
>>    # xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0
>>    core.naextents = -32561
>>
>> This patchset adds a new helper function
>> (i.e. xfs_iext_count_may_overflow()) to check for overflow of the
>> per-inode data and xattr extent counters and invokes it before
>> starting an fs operation (e.g. creating a new directory entry). With
>> this patchset applied, XFS detects counter overflows and returns with
>> an error rather than causing a silent corruption.
>>
>> The patchset has been tested by executing xfstests with the following
>> mkfs.xfs options,
>> 1. -m crc=0 -b size=1k
>> 2. -m crc=0 -b size=4k
>> 3. -m crc=0 -b size=512
>> 4. -m rmapbt=1,reflink=1 -b size=1k
>> 5. -m rmapbt=1,reflink=1 -b size=4k
>>
>> The patches can also be obtained from
>> https://github.com/chandanr/linux.git at branch xfs-reserve-extent-count-v14.
>>
>> I have two patches that define the newly introduced error injection
>> tags in xfsprogs
>> (https://lore.kernel.org/linux-xfs/20201104114900.172147-1-chandanrlinux@gmail.com/).
>>
>> I have also written tests
>> (https://github.com/chandanr/xfstests/commits/extent-overflow-tests)
>> for verifying the checks introduced in the kernel.
>>
>
> Hi Chandan and XFS folks,
>
> As you may have heard, I am working on producing a series of
> xfs patches for stable v5.10.y.
>
> My patch selection is documented at [1].
> I am in the process of testing the backport patches against the 5.10.y
> baseline using Luis' kdevops [2] fstests runner.
>
> The configurations that we are testing are:
> 1. -m rmbat=0,reflink=1 -b size=4k (default)
> 2. -m crc=0 -b size=4k
> 3. -m crc=0 -b size=512
> 4. -m rmapbt=1,reflink=1 -b size=1k
> 5. -m rmapbt=1,reflink=1 -b size=4k
>
> This patch set is the only largish series that I selected, because:
> - It applies cleanly to 5.10.y
> - I evaluated it as low risk and high value
> - Chandan has written good regression tests
>
> I intend to post the rest of the individual selected patches
> for review in small batches after they pass the tests, but w.r.t this
> patch set -
>
> Does anyone object to including it in the stable kernel
> after it passes the tests?
>

Hi Amir,

The following three commits will have to be skipped from the series,

1. 02092a2f034fdeabab524ae39c2de86ba9ffa15a
   xfs: Check for extent overflow when renaming dir entries

2. 0dbc5cb1a91cc8c44b1c75429f5b9351837114fd
   xfs: Check for extent overflow when removing dir entries

3. f5d92749191402c50e32ac83dd9da3b910f5680f
   xfs: Check for extent overflow when adding dir entries

The maximum size of a directory data fork is ~96GiB. This is much smaller than
what can be accommodated by the existing data fork extent counter (i.e. 2^31
extents).

Also the corresponding test (i.e. xfs/533) has been removed from
fstests. Please refer to
https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/commit/?id=9ae10c882550c48868e7c0baff889bb1a7c7c8e9

-- 
chandan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow
  2022-05-23 15:50   ` Chandan Babu R
@ 2022-05-23 19:06     ` Amir Goldstein
  2022-05-25  5:49       ` Amir Goldstein
  0 siblings, 1 reply; 31+ messages in thread
From: Amir Goldstein @ 2022-05-23 19:06 UTC (permalink / raw)
  To: Chandan Babu R
  Cc: Chandan Babu R, linux-xfs, Darrick J . Wong, Darrick J. Wong,
	Christoph Hellwig, Allison Henderson, Luis R. Rodriguez,
	Theodore Tso

On Mon, May 23, 2022 at 7:17 PM Chandan Babu R <chandan.babu@oracle.com> wrote:
>
> On Mon, May 23, 2022 at 02:15:44 PM +0300, Amir Goldstein wrote:
> > On Sun, Jan 10, 2021 at 6:10 PM Chandan Babu R <chandanrlinux@gmail.com> wrote:
> >>
> >> XFS does not check for possible overflow of per-inode extent counter
> >> fields when adding extents to either data or attr fork.
> >>
> >> For e.g.
> >> 1. Insert 5 million xattrs (each having a value size of 255 bytes) and
> >>    then delete 50% of them in an alternating manner.
> >>
> >> 2. On a 4k block sized XFS filesystem instance, the above causes 98511
> >>    extents to be created in the attr fork of the inode.
> >>
> >>    xfsaild/loop0  2008 [003]  1475.127209: probe:xfs_inode_to_disk: (ffffffffa43fb6b0) if_nextents=98511 i_ino=131
> >>
> >> 3. The incore inode fork extent counter is a signed 32-bit
> >>    quantity. However, the on-disk extent counter is an unsigned 16-bit
> >>    quantity and hence cannot hold 98511 extents.
> >>
> >> 4. The following incorrect value is stored in the xattr extent counter,
> >>    # xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0
> >>    core.naextents = -32561
> >>
> >> This patchset adds a new helper function
> >> (i.e. xfs_iext_count_may_overflow()) to check for overflow of the
> >> per-inode data and xattr extent counters and invokes it before
> >> starting an fs operation (e.g. creating a new directory entry). With
> >> this patchset applied, XFS detects counter overflows and returns with
> >> an error rather than causing a silent corruption.
> >>
> >> The patchset has been tested by executing xfstests with the following
> >> mkfs.xfs options,
> >> 1. -m crc=0 -b size=1k
> >> 2. -m crc=0 -b size=4k
> >> 3. -m crc=0 -b size=512
> >> 4. -m rmapbt=1,reflink=1 -b size=1k
> >> 5. -m rmapbt=1,reflink=1 -b size=4k
> >>
> >> The patches can also be obtained from
> >> https://github.com/chandanr/linux.git at branch xfs-reserve-extent-count-v14.
> >>
> >> I have two patches that define the newly introduced error injection
> >> tags in xfsprogs
> >> (https://lore.kernel.org/linux-xfs/20201104114900.172147-1-chandanrlinux@gmail.com/).
> >>
> >> I have also written tests
> >> (https://github.com/chandanr/xfstests/commits/extent-overflow-tests)
> >> for verifying the checks introduced in the kernel.
> >>
> >
> > Hi Chandan and XFS folks,
> >
> > As you may have heard, I am working on producing a series of
> > xfs patches for stable v5.10.y.
> >
> > My patch selection is documented at [1].
> > I am in the process of testing the backport patches against the 5.10.y
> > baseline using Luis' kdevops [2] fstests runner.
> >
> > The configurations that we are testing are:
> > 1. -m rmbat=0,reflink=1 -b size=4k (default)
> > 2. -m crc=0 -b size=4k
> > 3. -m crc=0 -b size=512
> > 4. -m rmapbt=1,reflink=1 -b size=1k
> > 5. -m rmapbt=1,reflink=1 -b size=4k
> >
> > This patch set is the only largish series that I selected, because:
> > - It applies cleanly to 5.10.y
> > - I evaluated it as low risk and high value
> > - Chandan has written good regression tests
> >
> > I intend to post the rest of the individual selected patches
> > for review in small batches after they pass the tests, but w.r.t this
> > patch set -
> >
> > Does anyone object to including it in the stable kernel
> > after it passes the tests?
> >
>
> Hi Amir,
>
> The following three commits will have to be skipped from the series,
>
> 1. 02092a2f034fdeabab524ae39c2de86ba9ffa15a
>    xfs: Check for extent overflow when renaming dir entries
>
> 2. 0dbc5cb1a91cc8c44b1c75429f5b9351837114fd
>    xfs: Check for extent overflow when removing dir entries
>
> 3. f5d92749191402c50e32ac83dd9da3b910f5680f
>    xfs: Check for extent overflow when adding dir entries
>
> The maximum size of a directory data fork is ~96GiB. This is much smaller than
> what can be accommodated by the existing data fork extent counter (i.e. 2^31
> extents).
>

Thanks for this information!

I understand that the "fixes" are not needed, but the moto of the stable
tree maintainers is that taking harmless patches is preferred over non
clean backports and without those patches, the rest of the series does
not apply cleanly.

So the question is: does it hurt to take those patches to the stable tree?

> Also the corresponding test (i.e. xfs/533) has been removed from
> fstests. Please refer to
> https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/commit/?id=9ae10c882550c48868e7c0baff889bb1a7c7c8e9
>

Well the test does not fail so it doesn't hurt either. Right?
In my test env, we will occasionally pull latest fstests and then
the unneeded test will be removed.

Does that sound right?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow
  2022-05-23 11:15 ` [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow Amir Goldstein
  2022-05-23 15:50   ` Chandan Babu R
@ 2022-05-23 22:43   ` Dave Chinner
  2022-05-24  5:36     ` Amir Goldstein
  1 sibling, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2022-05-23 22:43 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Chandan Babu R, linux-xfs, Darrick J . Wong, Darrick J. Wong,
	Christoph Hellwig, Allison Henderson, Luis R. Rodriguez,
	Theodore Tso

On Mon, May 23, 2022 at 02:15:44PM +0300, Amir Goldstein wrote:
> On Sun, Jan 10, 2021 at 6:10 PM Chandan Babu R <chandanrlinux@gmail.com> wrote:
> >
> > XFS does not check for possible overflow of per-inode extent counter
> > fields when adding extents to either data or attr fork.
> >
> > For e.g.
> > 1. Insert 5 million xattrs (each having a value size of 255 bytes) and
> >    then delete 50% of them in an alternating manner.
> >
> > 2. On a 4k block sized XFS filesystem instance, the above causes 98511
> >    extents to be created in the attr fork of the inode.
> >
> >    xfsaild/loop0  2008 [003]  1475.127209: probe:xfs_inode_to_disk: (ffffffffa43fb6b0) if_nextents=98511 i_ino=131
> >
> > 3. The incore inode fork extent counter is a signed 32-bit
> >    quantity. However, the on-disk extent counter is an unsigned 16-bit
> >    quantity and hence cannot hold 98511 extents.
> >
> > 4. The following incorrect value is stored in the xattr extent counter,
> >    # xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0
> >    core.naextents = -32561
> >
> > This patchset adds a new helper function
> > (i.e. xfs_iext_count_may_overflow()) to check for overflow of the
> > per-inode data and xattr extent counters and invokes it before
> > starting an fs operation (e.g. creating a new directory entry). With
> > this patchset applied, XFS detects counter overflows and returns with
> > an error rather than causing a silent corruption.
> >
> > The patchset has been tested by executing xfstests with the following
> > mkfs.xfs options,
> > 1. -m crc=0 -b size=1k
> > 2. -m crc=0 -b size=4k
> > 3. -m crc=0 -b size=512
> > 4. -m rmapbt=1,reflink=1 -b size=1k
> > 5. -m rmapbt=1,reflink=1 -b size=4k
> >
> > The patches can also be obtained from
> > https://github.com/chandanr/linux.git at branch xfs-reserve-extent-count-v14.
> >
> > I have two patches that define the newly introduced error injection
> > tags in xfsprogs
> > (https://lore.kernel.org/linux-xfs/20201104114900.172147-1-chandanrlinux@gmail.com/).
> >
> > I have also written tests
> > (https://github.com/chandanr/xfstests/commits/extent-overflow-tests)
> > for verifying the checks introduced in the kernel.
> >
> 
> Hi Chandan and XFS folks,
> 
> As you may have heard, I am working on producing a series of
> xfs patches for stable v5.10.y.
> 
> My patch selection is documented at [1].
> I am in the process of testing the backport patches against the 5.10.y
> baseline using Luis' kdevops [2] fstests runner.
> 
> The configurations that we are testing are:
> 1. -m rmbat=0,reflink=1 -b size=4k (default)
> 2. -m crc=0 -b size=4k
> 3. -m crc=0 -b size=512
> 4. -m rmapbt=1,reflink=1 -b size=1k
> 5. -m rmapbt=1,reflink=1 -b size=4k
> 
> This patch set is the only largish series that I selected, because:
> - It applies cleanly to 5.10.y
> - I evaluated it as low risk and high value

What value does it provide LTS users?

This series adds almost no value to normal users - extent count
overflows are just something that doesn't happen in production
systems at this point in time. The largest data extent count I've
ever seen is still an order of magnitude of extents away from
overflowing (i.e. 400 million extents seen, 4 billion to overflow),
and nobody is using the attribute fork sufficiently hard to overflow
65536 extents (typically a couple of million xattrs per inode).

i.e. this series is ground work for upcoming internal filesystem
functionality that require much larger attribute forks (parent
pointers and fsverity merkle tree storage) to be supported, and
allow scope for much larger, massively fragmented VM image files
(beyond 16TB on 4kB block size fs for worst case
fragmentation/reflink). 

As a standalone patchset, this provides almost no real benefit to
users but adds a whole new set of "hard stop" error paths across
every operation that does inode data/attr extent allocation. i.e.
the scope of affected functionality is very wide, the benefit
to users is pretty much zero.

Hence I'm left wondering what criteria ranks this as a high value
change...

> - Chandan has written good regression tests
>
> I intend to post the rest of the individual selected patches
> for review in small batches after they pass the tests, but w.r.t this
> patch set -
> 
> Does anyone object to including it in the stable kernel
> after it passes the tests?

I prefer that the process doesn't result in taking random unnecesary
functionality into stable kernels. The part of the LTS process that
I've most disagreed with is the "backport random unnecessary
changes" part of the stable selection criteria. It doesn't matter if
it's selected by a bot or a human, the problems that causes are the
same.

Hence on those grounds, I'd say this isn't a stable backport
candidate at all...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow
  2022-05-23 22:43   ` Dave Chinner
@ 2022-05-24  5:36     ` Amir Goldstein
  2022-05-24 16:05       ` Amir Goldstein
  2022-05-25  7:33       ` Dave Chinner
  0 siblings, 2 replies; 31+ messages in thread
From: Amir Goldstein @ 2022-05-24  5:36 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Chandan Babu R, linux-xfs, Darrick J . Wong, Darrick J. Wong,
	Christoph Hellwig, Allison Henderson, Luis R. Rodriguez,
	Theodore Tso

On Tue, May 24, 2022 at 1:43 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, May 23, 2022 at 02:15:44PM +0300, Amir Goldstein wrote:
> > On Sun, Jan 10, 2021 at 6:10 PM Chandan Babu R <chandanrlinux@gmail.com> wrote:
> > >
> > > XFS does not check for possible overflow of per-inode extent counter
> > > fields when adding extents to either data or attr fork.
> > >
> > > For e.g.
> > > 1. Insert 5 million xattrs (each having a value size of 255 bytes) and
> > >    then delete 50% of them in an alternating manner.
> > >
> > > 2. On a 4k block sized XFS filesystem instance, the above causes 98511
> > >    extents to be created in the attr fork of the inode.
> > >
> > >    xfsaild/loop0  2008 [003]  1475.127209: probe:xfs_inode_to_disk: (ffffffffa43fb6b0) if_nextents=98511 i_ino=131
> > >
> > > 3. The incore inode fork extent counter is a signed 32-bit
> > >    quantity. However, the on-disk extent counter is an unsigned 16-bit
> > >    quantity and hence cannot hold 98511 extents.
> > >
> > > 4. The following incorrect value is stored in the xattr extent counter,
> > >    # xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0
> > >    core.naextents = -32561
> > >
> > > This patchset adds a new helper function
> > > (i.e. xfs_iext_count_may_overflow()) to check for overflow of the
> > > per-inode data and xattr extent counters and invokes it before
> > > starting an fs operation (e.g. creating a new directory entry). With
> > > this patchset applied, XFS detects counter overflows and returns with
> > > an error rather than causing a silent corruption.
> > >
> > > The patchset has been tested by executing xfstests with the following
> > > mkfs.xfs options,
> > > 1. -m crc=0 -b size=1k
> > > 2. -m crc=0 -b size=4k
> > > 3. -m crc=0 -b size=512
> > > 4. -m rmapbt=1,reflink=1 -b size=1k
> > > 5. -m rmapbt=1,reflink=1 -b size=4k
> > >
> > > The patches can also be obtained from
> > > https://github.com/chandanr/linux.git at branch xfs-reserve-extent-count-v14.
> > >
> > > I have two patches that define the newly introduced error injection
> > > tags in xfsprogs
> > > (https://lore.kernel.org/linux-xfs/20201104114900.172147-1-chandanrlinux@gmail.com/).
> > >
> > > I have also written tests
> > > (https://github.com/chandanr/xfstests/commits/extent-overflow-tests)
> > > for verifying the checks introduced in the kernel.
> > >
> >
> > Hi Chandan and XFS folks,
> >
> > As you may have heard, I am working on producing a series of
> > xfs patches for stable v5.10.y.
> >
> > My patch selection is documented at [1].
> > I am in the process of testing the backport patches against the 5.10.y
> > baseline using Luis' kdevops [2] fstests runner.
> >
> > The configurations that we are testing are:
> > 1. -m rmbat=0,reflink=1 -b size=4k (default)
> > 2. -m crc=0 -b size=4k
> > 3. -m crc=0 -b size=512
> > 4. -m rmapbt=1,reflink=1 -b size=1k
> > 5. -m rmapbt=1,reflink=1 -b size=4k
> >
> > This patch set is the only largish series that I selected, because:
> > - It applies cleanly to 5.10.y
> > - I evaluated it as low risk and high value
>
> What value does it provide LTS users?
>

Cloud providers deploy a large number of VMs/containers
and they may use reflink. So I think this could be an issue.

> This series adds almost no value to normal users - extent count
> overflows are just something that doesn't happen in production
> systems at this point in time. The largest data extent count I've
> ever seen is still an order of magnitude of extents away from
> overflowing (i.e. 400 million extents seen, 4 billion to overflow),
> and nobody is using the attribute fork sufficiently hard to overflow
> 65536 extents (typically a couple of million xattrs per inode).
>
> i.e. this series is ground work for upcoming internal filesystem
> functionality that require much larger attribute forks (parent
> pointers and fsverity merkle tree storage) to be supported, and
> allow scope for much larger, massively fragmented VM image files
> (beyond 16TB on 4kB block size fs for worst case
> fragmentation/reflink).

I am not sure I follow this argument.
Users can create large attributes, can they not?
And users can create massive fragmented/reflinked images, can they not?
If we have learned anything, is that if users can do something (i.e. on stable),
users will do that, so it may still be worth protecting this workflow?

I argue that the reason that you did not see those constructs in the wild yet,
is the time it takes until users format new xfs filesystems with mkfs
that defaults
to reflink enabled and then use latest userspace tools that started to do
copy_file_range() or clone on their filesystem, perhaps even without the
user's knowledge, such as samba [1].

[1] https://gitlab.com/samba-team/samba/-/merge_requests/2044

>
> As a standalone patchset, this provides almost no real benefit to
> users but adds a whole new set of "hard stop" error paths across
> every operation that does inode data/attr extent allocation. i.e.
> the scope of affected functionality is very wide, the benefit
> to users is pretty much zero.
>
> Hence I'm left wondering what criteria ranks this as a high value
> change...
>

Given your inputs, I am not sure that the fix has high value, but I must
say I didn't fully understand your argument.
It sounded like
"We don't need the fix because we did not see the problem yet",
but I may have misunderstood you.

I am sure that you are aware of the fact that even though 5.10 is
almost 2 y/o, it has only been deployed recently by some distros.

For example, Amazon AMI [2] and Google Cloud COS [3] images based
on the "new" 5.10 kernel were only released about half a year ago.

[2] https://aws.amazon.com/about-aws/whats-new/2021/11/amazon-linux-2-ami-kernel-5-10/
[3] https://cloud.google.com/container-optimized-os/docs/release-notes/m93#cos-93-16623-39-6

I have not analysed the distro situation w.r.t xfsprogs, but here the
important factor is which version of xfsprogs was used to format the
user's filesystem, not which xfsprogs is installed on their system now.

> > - Chandan has written good regression tests
> >
> > I intend to post the rest of the individual selected patches
> > for review in small batches after they pass the tests, but w.r.t this
> > patch set -
> >
> > Does anyone object to including it in the stable kernel
> > after it passes the tests?
>
> I prefer that the process doesn't result in taking random unnecesary
> functionality into stable kernels. The part of the LTS process that
> I've most disagreed with is the "backport random unnecessary
> changes" part of the stable selection criteria. It doesn't matter if
> it's selected by a bot or a human, the problems that causes are the
> same.

I am in agreement with you.

If you actually look at my selections [4]
I think that you will find that they are very far from "random".
I have tried to make it VERY easy to review my selections, by
listing the links to lore instead of the commit ids and my selection
process is also documented in the git log.

TBH, *this* series was the one that I was mostly in doubt about,
which is one of the reasons I posted it first to the list.
I was pretty confident about my risk estimation, but not so much
about the value.

Also, I am considering my post in this mailing list (without CC stable)
part of the process, and the inputs I got from you and from Chandan
is exactly what is missing in the regular stable tree process IMO, so
I appreciate your inputs very much.

>
> Hence on those grounds, I'd say this isn't a stable backport
> candidate at all...
>

If my arguments did not convince you, out goes this series!

I shall be posting more patches for consideration in the coming
weeks. I would appreciate your inputs on those as well.

You guys are welcome to review my selection [4] already.

Thanks!
Amir.

[4] https://github.com/amir73il/b4/blob/xfs-5.10.y/xfs-5.10..5.17-fixes.rst

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow
  2022-05-24  5:36     ` Amir Goldstein
@ 2022-05-24 16:05       ` Amir Goldstein
  2022-05-25  8:21         ` Dave Chinner
  2022-05-25  7:33       ` Dave Chinner
  1 sibling, 1 reply; 31+ messages in thread
From: Amir Goldstein @ 2022-05-24 16:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Chandan Babu R, linux-xfs, Darrick J . Wong, Darrick J. Wong,
	Christoph Hellwig, Allison Henderson, Luis R. Rodriguez,
	Theodore Tso

On Tue, May 24, 2022 at 8:36 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Tue, May 24, 2022 at 1:43 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Mon, May 23, 2022 at 02:15:44PM +0300, Amir Goldstein wrote:
> > > On Sun, Jan 10, 2021 at 6:10 PM Chandan Babu R <chandanrlinux@gmail.com> wrote:
> > > >
> > > > XFS does not check for possible overflow of per-inode extent counter
> > > > fields when adding extents to either data or attr fork.
> > > >
> > > > For e.g.
> > > > 1. Insert 5 million xattrs (each having a value size of 255 bytes) and
> > > >    then delete 50% of them in an alternating manner.
> > > >
> > > > 2. On a 4k block sized XFS filesystem instance, the above causes 98511
> > > >    extents to be created in the attr fork of the inode.
> > > >
> > > >    xfsaild/loop0  2008 [003]  1475.127209: probe:xfs_inode_to_disk: (ffffffffa43fb6b0) if_nextents=98511 i_ino=131
> > > >
> > > > 3. The incore inode fork extent counter is a signed 32-bit
> > > >    quantity. However, the on-disk extent counter is an unsigned 16-bit
> > > >    quantity and hence cannot hold 98511 extents.
> > > >
> > > > 4. The following incorrect value is stored in the xattr extent counter,
> > > >    # xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0
> > > >    core.naextents = -32561
> > > >
> > > > This patchset adds a new helper function
> > > > (i.e. xfs_iext_count_may_overflow()) to check for overflow of the
> > > > per-inode data and xattr extent counters and invokes it before
> > > > starting an fs operation (e.g. creating a new directory entry). With
> > > > this patchset applied, XFS detects counter overflows and returns with
> > > > an error rather than causing a silent corruption.
> > > >
> > > > The patchset has been tested by executing xfstests with the following
> > > > mkfs.xfs options,
> > > > 1. -m crc=0 -b size=1k
> > > > 2. -m crc=0 -b size=4k
> > > > 3. -m crc=0 -b size=512
> > > > 4. -m rmapbt=1,reflink=1 -b size=1k
> > > > 5. -m rmapbt=1,reflink=1 -b size=4k
> > > >
> > > > The patches can also be obtained from
> > > > https://github.com/chandanr/linux.git at branch xfs-reserve-extent-count-v14.
> > > >
> > > > I have two patches that define the newly introduced error injection
> > > > tags in xfsprogs
> > > > (https://lore.kernel.org/linux-xfs/20201104114900.172147-1-chandanrlinux@gmail.com/).
> > > >
> > > > I have also written tests
> > > > (https://github.com/chandanr/xfstests/commits/extent-overflow-tests)
> > > > for verifying the checks introduced in the kernel.
> > > >
> > >
> > > Hi Chandan and XFS folks,
> > >
> > > As you may have heard, I am working on producing a series of
> > > xfs patches for stable v5.10.y.
> > >
> > > My patch selection is documented at [1].
> > > I am in the process of testing the backport patches against the 5.10.y
> > > baseline using Luis' kdevops [2] fstests runner.
> > >
> > > The configurations that we are testing are:
> > > 1. -m rmbat=0,reflink=1 -b size=4k (default)
> > > 2. -m crc=0 -b size=4k
> > > 3. -m crc=0 -b size=512
> > > 4. -m rmapbt=1,reflink=1 -b size=1k
> > > 5. -m rmapbt=1,reflink=1 -b size=4k
> > >
> > > This patch set is the only largish series that I selected, because:
> > > - It applies cleanly to 5.10.y
> > > - I evaluated it as low risk and high value
> >
> > What value does it provide LTS users?
> >
>
> Cloud providers deploy a large number of VMs/containers
> and they may use reflink. So I think this could be an issue.
>
> > This series adds almost no value to normal users - extent count
> > overflows are just something that doesn't happen in production
> > systems at this point in time. The largest data extent count I've
> > ever seen is still an order of magnitude of extents away from
> > overflowing (i.e. 400 million extents seen, 4 billion to overflow),
> > and nobody is using the attribute fork sufficiently hard to overflow
> > 65536 extents (typically a couple of million xattrs per inode).
> >
> > i.e. this series is ground work for upcoming internal filesystem
> > functionality that require much larger attribute forks (parent
> > pointers and fsverity merkle tree storage) to be supported, and
> > allow scope for much larger, massively fragmented VM image files
> > (beyond 16TB on 4kB block size fs for worst case
> > fragmentation/reflink).
>
> I am not sure I follow this argument.
> Users can create large attributes, can they not?
> And users can create massive fragmented/reflinked images, can they not?
> If we have learned anything, is that if users can do something (i.e. on stable),
> users will do that, so it may still be worth protecting this workflow?
>
> I argue that the reason that you did not see those constructs in the wild yet,
> is the time it takes until users format new xfs filesystems with mkfs
> that defaults
> to reflink enabled and then use latest userspace tools that started to do
> copy_file_range() or clone on their filesystem, perhaps even without the
> user's knowledge, such as samba [1].
>
> [1] https://gitlab.com/samba-team/samba/-/merge_requests/2044
>
> >
> > As a standalone patchset, this provides almost no real benefit to
> > users but adds a whole new set of "hard stop" error paths across
> > every operation that does inode data/attr extent allocation. i.e.
> > the scope of affected functionality is very wide, the benefit
> > to users is pretty much zero.
> >
> > Hence I'm left wondering what criteria ranks this as a high value
> > change...
> >
>
> Given your inputs, I am not sure that the fix has high value, but I must
> say I didn't fully understand your argument.
> It sounded like
> "We don't need the fix because we did not see the problem yet",
> but I may have misunderstood you.
>
> I am sure that you are aware of the fact that even though 5.10 is
> almost 2 y/o, it has only been deployed recently by some distros.
>
> For example, Amazon AMI [2] and Google Cloud COS [3] images based
> on the "new" 5.10 kernel were only released about half a year ago.
>
> [2] https://aws.amazon.com/about-aws/whats-new/2021/11/amazon-linux-2-ami-kernel-5-10/
> [3] https://cloud.google.com/container-optimized-os/docs/release-notes/m93#cos-93-16623-39-6
>
> I have not analysed the distro situation w.r.t xfsprogs, but here the
> important factor is which version of xfsprogs was used to format the
> user's filesystem, not which xfsprogs is installed on their system now.
>
> > > - Chandan has written good regression tests
> > >
> > > I intend to post the rest of the individual selected patches
> > > for review in small batches after they pass the tests, but w.r.t this
> > > patch set -
> > >
> > > Does anyone object to including it in the stable kernel
> > > after it passes the tests?
> >
> > I prefer that the process doesn't result in taking random unnecesary
> > functionality into stable kernels. The part of the LTS process that
> > I've most disagreed with is the "backport random unnecessary
> > changes" part of the stable selection criteria. It doesn't matter if
> > it's selected by a bot or a human, the problems that causes are the
> > same.
>
> I am in agreement with you.
>
> If you actually look at my selections [4]
> I think that you will find that they are very far from "random".
> I have tried to make it VERY easy to review my selections, by
> listing the links to lore instead of the commit ids and my selection
> process is also documented in the git log.
>
> TBH, *this* series was the one that I was mostly in doubt about,
> which is one of the reasons I posted it first to the list.
> I was pretty confident about my risk estimation, but not so much
> about the value.
>
> Also, I am considering my post in this mailing list (without CC stable)
> part of the process, and the inputs I got from you and from Chandan
> is exactly what is missing in the regular stable tree process IMO, so
> I appreciate your inputs very much.
>
> >
> > Hence on those grounds, I'd say this isn't a stable backport
> > candidate at all...
> >
>

Allow me to rephrase that using a less hypothetical use case.

Our team is working on an out-of-band dedupe tool, much like
https://markfasheh.github.io/duperemove/duperemove.html
but for larger scale filesystems and testing focus is on xfs.

In certain settings, such as containers, the tool does not control the
running kernel and *if* we require a new kernel, the newest we can
require in this setting is 5.10.y.

How would the tool know that it can safely create millions of dups
that may get fragmented?
One cannot expect from a user space tool to check which kernel
it is running on, even asking which filesystem it is running on would
be an irregular pattern.

The tool just checks for clone/dedupe support in the underlying filesystem.

The way I see it, backporting these changes to LTS kernel is the
only way to move forward, unless you can tell me, and I did not
understand that from your response, why our tool is safe to use
on 5.10.y and why fragmentation cannot lead to hitting maximum
extent limitation in kernel 5.10.y.

So with that information in mind, I have to ask again:

Does anyone *object* to including this series in the stable kernel
after it passes the tests?

Chandan and all,

Do you consider it *harmful* to apply the 3 commits about directory
extents that Chandan listed as "unneeded"?

Please do not regard this as a philosophical question.
Is there an actual known bug/regression from applying those 3 patches
to the 5.10.y kernel?

Because my fstests loop has been running on the recommended xfs
configs for over 30 times now and have not detected any regression
from the baseline LTS kernel so far.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow
  2022-05-23 19:06     ` Amir Goldstein
@ 2022-05-25  5:49       ` Amir Goldstein
  0 siblings, 0 replies; 31+ messages in thread
From: Amir Goldstein @ 2022-05-25  5:49 UTC (permalink / raw)
  To: Chandan Babu R
  Cc: Chandan Babu R, linux-xfs, Darrick J . Wong, Darrick J. Wong,
	Christoph Hellwig, Allison Henderson, Luis R. Rodriguez,
	Theodore Tso

On Mon, May 23, 2022 at 10:06 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Mon, May 23, 2022 at 7:17 PM Chandan Babu R <chandan.babu@oracle.com> wrote:
> >
> > On Mon, May 23, 2022 at 02:15:44 PM +0300, Amir Goldstein wrote:
> > > On Sun, Jan 10, 2021 at 6:10 PM Chandan Babu R <chandanrlinux@gmail.com> wrote:
> > >>
> > >> XFS does not check for possible overflow of per-inode extent counter
> > >> fields when adding extents to either data or attr fork.
> > >>
> > >> For e.g.
> > >> 1. Insert 5 million xattrs (each having a value size of 255 bytes) and
> > >>    then delete 50% of them in an alternating manner.
> > >>
> > >> 2. On a 4k block sized XFS filesystem instance, the above causes 98511
> > >>    extents to be created in the attr fork of the inode.
> > >>
> > >>    xfsaild/loop0  2008 [003]  1475.127209: probe:xfs_inode_to_disk: (ffffffffa43fb6b0) if_nextents=98511 i_ino=131
> > >>
> > >> 3. The incore inode fork extent counter is a signed 32-bit
> > >>    quantity. However, the on-disk extent counter is an unsigned 16-bit
> > >>    quantity and hence cannot hold 98511 extents.
> > >>
> > >> 4. The following incorrect value is stored in the xattr extent counter,
> > >>    # xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0
> > >>    core.naextents = -32561
> > >>
> > >> This patchset adds a new helper function
> > >> (i.e. xfs_iext_count_may_overflow()) to check for overflow of the
> > >> per-inode data and xattr extent counters and invokes it before
> > >> starting an fs operation (e.g. creating a new directory entry). With
> > >> this patchset applied, XFS detects counter overflows and returns with
> > >> an error rather than causing a silent corruption.
> > >>
> > >> The patchset has been tested by executing xfstests with the following
> > >> mkfs.xfs options,
> > >> 1. -m crc=0 -b size=1k
> > >> 2. -m crc=0 -b size=4k
> > >> 3. -m crc=0 -b size=512
> > >> 4. -m rmapbt=1,reflink=1 -b size=1k
> > >> 5. -m rmapbt=1,reflink=1 -b size=4k
> > >>
> > >> The patches can also be obtained from
> > >> https://github.com/chandanr/linux.git at branch xfs-reserve-extent-count-v14.
> > >>
> > >> I have two patches that define the newly introduced error injection
> > >> tags in xfsprogs
> > >> (https://lore.kernel.org/linux-xfs/20201104114900.172147-1-chandanrlinux@gmail.com/).
> > >>
> > >> I have also written tests
> > >> (https://github.com/chandanr/xfstests/commits/extent-overflow-tests)
> > >> for verifying the checks introduced in the kernel.
> > >>
> > >
> > > Hi Chandan and XFS folks,
> > >
> > > As you may have heard, I am working on producing a series of
> > > xfs patches for stable v5.10.y.
> > >
> > > My patch selection is documented at [1].
> > > I am in the process of testing the backport patches against the 5.10.y
> > > baseline using Luis' kdevops [2] fstests runner.
> > >
> > > The configurations that we are testing are:
> > > 1. -m rmbat=0,reflink=1 -b size=4k (default)
> > > 2. -m crc=0 -b size=4k
> > > 3. -m crc=0 -b size=512
> > > 4. -m rmapbt=1,reflink=1 -b size=1k
> > > 5. -m rmapbt=1,reflink=1 -b size=4k
> > >
> > > This patch set is the only largish series that I selected, because:
> > > - It applies cleanly to 5.10.y
> > > - I evaluated it as low risk and high value
> > > - Chandan has written good regression tests
> > >
> > > I intend to post the rest of the individual selected patches
> > > for review in small batches after they pass the tests, but w.r.t this
> > > patch set -
> > >
> > > Does anyone object to including it in the stable kernel
> > > after it passes the tests?
> > >
> >
> > Hi Amir,
> >
> > The following three commits will have to be skipped from the series,
> >
> > 1. 02092a2f034fdeabab524ae39c2de86ba9ffa15a
> >    xfs: Check for extent overflow when renaming dir entries
> >
> > 2. 0dbc5cb1a91cc8c44b1c75429f5b9351837114fd
> >    xfs: Check for extent overflow when removing dir entries
> >
> > 3. f5d92749191402c50e32ac83dd9da3b910f5680f
> >    xfs: Check for extent overflow when adding dir entries
> >
> > The maximum size of a directory data fork is ~96GiB. This is much smaller than
> > what can be accommodated by the existing data fork extent counter (i.e. 2^31
> > extents).
> >
>
> Thanks for this information!
>
> I understand that the "fixes" are not needed, but the moto of the stable
> tree maintainers is that taking harmless patches is preferred over non
> clean backports and without those patches, the rest of the series does
> not apply cleanly.
>
> So the question is: does it hurt to take those patches to the stable tree?

All right, I've found the revert partial patch in for-next:
83a21c18441f xfs: Directory's data fork extent counter can never overflow

I can backport this patch to stable after it hits mainline (since this is not
an urgent fix I would wait for v.5.19.0) with the obvious omission of the
XFS_MAX_EXTCNT_*_FORK_LARGE constants.

But even then, unless we have a clear revert in mainline, it is better to
have the history in stable as it was in mainline.

Furthermore, stable, even more than mainline, should always prefer safety
over performance optimization, the sending the 3 patches already in mainline
to stable without the partial revert is better than sending no patches at all
and better then delaying the process.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow
  2022-05-24  5:36     ` Amir Goldstein
  2022-05-24 16:05       ` Amir Goldstein
@ 2022-05-25  7:33       ` Dave Chinner
  2022-05-25  7:48         ` Amir Goldstein
  1 sibling, 1 reply; 31+ messages in thread
From: Dave Chinner @ 2022-05-25  7:33 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Chandan Babu R, linux-xfs, Darrick J . Wong, Darrick J. Wong,
	Christoph Hellwig, Allison Henderson, Luis R. Rodriguez,
	Theodore Tso

On Tue, May 24, 2022 at 08:36:50AM +0300, Amir Goldstein wrote:
> On Tue, May 24, 2022 at 1:43 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Mon, May 23, 2022 at 02:15:44PM +0300, Amir Goldstein wrote:
> > > On Sun, Jan 10, 2021 at 6:10 PM Chandan Babu R <chandanrlinux@gmail.com> wrote:
> > > >
> > > > XFS does not check for possible overflow of per-inode extent counter
> > > > fields when adding extents to either data or attr fork.
> > > >
> > > > For e.g.
> > > > 1. Insert 5 million xattrs (each having a value size of 255 bytes) and
> > > >    then delete 50% of them in an alternating manner.
> > > >
> > > > 2. On a 4k block sized XFS filesystem instance, the above causes 98511
> > > >    extents to be created in the attr fork of the inode.
> > > >
> > > >    xfsaild/loop0  2008 [003]  1475.127209: probe:xfs_inode_to_disk: (ffffffffa43fb6b0) if_nextents=98511 i_ino=131
> > > >
> > > > 3. The incore inode fork extent counter is a signed 32-bit
> > > >    quantity. However, the on-disk extent counter is an unsigned 16-bit
> > > >    quantity and hence cannot hold 98511 extents.
> > > >
> > > > 4. The following incorrect value is stored in the xattr extent counter,
> > > >    # xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0
> > > >    core.naextents = -32561
> > > >
> > > > This patchset adds a new helper function
> > > > (i.e. xfs_iext_count_may_overflow()) to check for overflow of the
> > > > per-inode data and xattr extent counters and invokes it before
> > > > starting an fs operation (e.g. creating a new directory entry). With
> > > > this patchset applied, XFS detects counter overflows and returns with
> > > > an error rather than causing a silent corruption.
> > > >
> > > > The patchset has been tested by executing xfstests with the following
> > > > mkfs.xfs options,
> > > > 1. -m crc=0 -b size=1k
> > > > 2. -m crc=0 -b size=4k
> > > > 3. -m crc=0 -b size=512
> > > > 4. -m rmapbt=1,reflink=1 -b size=1k
> > > > 5. -m rmapbt=1,reflink=1 -b size=4k
> > > >
> > > > The patches can also be obtained from
> > > > https://github.com/chandanr/linux.git at branch xfs-reserve-extent-count-v14.
> > > >
> > > > I have two patches that define the newly introduced error injection
> > > > tags in xfsprogs
> > > > (https://lore.kernel.org/linux-xfs/20201104114900.172147-1-chandanrlinux@gmail.com/).
> > > >
> > > > I have also written tests
> > > > (https://github.com/chandanr/xfstests/commits/extent-overflow-tests)
> > > > for verifying the checks introduced in the kernel.
> > > >
> > >
> > > Hi Chandan and XFS folks,
> > >
> > > As you may have heard, I am working on producing a series of
> > > xfs patches for stable v5.10.y.
> > >
> > > My patch selection is documented at [1].
> > > I am in the process of testing the backport patches against the 5.10.y
> > > baseline using Luis' kdevops [2] fstests runner.
> > >
> > > The configurations that we are testing are:
> > > 1. -m rmbat=0,reflink=1 -b size=4k (default)
> > > 2. -m crc=0 -b size=4k
> > > 3. -m crc=0 -b size=512
> > > 4. -m rmapbt=1,reflink=1 -b size=1k
> > > 5. -m rmapbt=1,reflink=1 -b size=4k
> > >
> > > This patch set is the only largish series that I selected, because:
> > > - It applies cleanly to 5.10.y
> > > - I evaluated it as low risk and high value
> >
> > What value does it provide LTS users?
> >
> 
> Cloud providers deploy a large number of VMs/containers
> and they may use reflink. So I think this could be an issue.

Cloud providers are not deploying multi-TB VM images on XFS without
also using some mechanism for avoiding worst-case fragmentation.
They know all about the problems that manifest when extent
counts get into the tens of millions, let alone billions....

e.g. first access to a file pulls the entire extent list into
memory, so for a file with 4 billion extents this will take hours to
pull into memory (single threaded, synchronous read IO of millions
of filesystem blocks) and consume and consume >100GB of RAM for the
in-memory extent list. Having VM startup get delayed by hours and
put a massive load on the cloud storage infrastructure for that
entire length of time isn't desirable behaviour...

For multi-TB VM image deployment - especially with reflink on the
image file - extent size hints are needed to mitigate worst case
fragmentation.  Reflink copies can run at up to about 100,000
extents/s, so if you reflink a file with 4 billion extents in it,
not only do you need another 100GB RAM, you also need to wait
several hours for the reflink to run. And while that reflink is
running, nothing else has access the data in that VM image: your VM
is *down* for *hours* while you snapshot it.

Typical mitigation is extent size hints in the MB ranges to reduce
worst case fragmentation by two orders of magnitude (i.e. limit to
tens of millions of extents, not billions) which brings snapshot
times down to a minute or two. 

IOWs, it's obviously not practical to scale VM images out to
billions of extents, even though we support extent counts in the
billions.

> > This series adds almost no value to normal users - extent count
> > overflows are just something that doesn't happen in production
> > systems at this point in time. The largest data extent count I've
> > ever seen is still an order of magnitude of extents away from
> > overflowing (i.e. 400 million extents seen, 4 billion to overflow),
> > and nobody is using the attribute fork sufficiently hard to overflow
> > 65536 extents (typically a couple of million xattrs per inode).
> >
> > i.e. this series is ground work for upcoming internal filesystem
> > functionality that require much larger attribute forks (parent
> > pointers and fsverity merkle tree storage) to be supported, and
> > allow scope for much larger, massively fragmented VM image files
> > (beyond 16TB on 4kB block size fs for worst case
> > fragmentation/reflink).
> 
> I am not sure I follow this argument.
> Users can create large attributes, can they not?

Sure. But *nobody does*, and there are good reasons we don't see
people doing this.

The reality is that apps don't use xattrs heavily because
filesystems are traditionally very bad at storing even moderate
numbers of xattrs. XFS is the exception to the rule. Hence nobody is
trying to use a few million xattrs per inode right now, and it's
unlikely anyone will unless they specifically target XFS.  In which
case, they are going to want the large extent count stuff that just
got merged into the for-next tree, and this whole discussion is
moot....

> And users can create massive fragmented/reflinked images, can they not?

Yes, and they will hit scalability problems long before they get
anywhere near 4 billion extents.

> If we have learned anything, is that if users can do something (i.e. on stable),
> users will do that, so it may still be worth protecting this workflow?

If I have learned anything, it's that huge extent counts are highly
impractical for most workloads for one reason or another. We are a
long way for enabling practical use of extent counts in the
billions. Demand paging the extent list is the bare minimum we need,
but then there's sheer scale of modifications reflink and unlink
need to make (billions of transactions to share/free billions of
individual extents) and there's no magic solution to that. 

> I argue that the reason that you did not see those constructs in the wild yet,
> is the time it takes until users format new xfs filesystems with mkfs

It really has nothing to do with filesystem formats and everything
to do with the *cost* of creating, accessing, indexing and managing
billions of extents.

Have you ever tried to create a file with 4 billion extents in it?
Even using fallocate to do it as fast as possible (no data IO!), I
ran out of RAM on my 128GB test machine after 6 days of doing
nothing but running fallocate() on a single inode. The kernel died a
horrible OOM killer death at around 2.5 billion extents because the
extent list cannot be reclaimed from memory while the inode is in
use and the kernel ran out of all other memory it could reclaim as
the extent list grew.

The only way to fix that is to make the extent lists reclaimable
(i.e. demand paging of the in-memory extent list) and that's a big
chunk of work that isn't on anyone's radar right now.

> Given your inputs, I am not sure that the fix has high value, but I must
> say I didn't fully understand your argument.
> It sounded like
> "We don't need the fix because we did not see the problem yet",
> but I may have misunderstood you.

I hope you now realise that there are much bigger practical
scalability limitations with extent lists and reflink that will
manifest in production systems long before we get anywhere near
billions of extents per inode.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow
  2022-05-25  7:33       ` Dave Chinner
@ 2022-05-25  7:48         ` Amir Goldstein
  2022-05-25  8:38           ` Dave Chinner
  0 siblings, 1 reply; 31+ messages in thread
From: Amir Goldstein @ 2022-05-25  7:48 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Chandan Babu R, linux-xfs, Darrick J . Wong, Darrick J. Wong,
	Christoph Hellwig, Allison Henderson, Luis R. Rodriguez,
	Theodore Tso

On Wed, May 25, 2022 at 10:33 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Tue, May 24, 2022 at 08:36:50AM +0300, Amir Goldstein wrote:
> > On Tue, May 24, 2022 at 1:43 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Mon, May 23, 2022 at 02:15:44PM +0300, Amir Goldstein wrote:
> > > > On Sun, Jan 10, 2021 at 6:10 PM Chandan Babu R <chandanrlinux@gmail.com> wrote:
> > > > >
> > > > > XFS does not check for possible overflow of per-inode extent counter
> > > > > fields when adding extents to either data or attr fork.
> > > > >
> > > > > For e.g.
> > > > > 1. Insert 5 million xattrs (each having a value size of 255 bytes) and
> > > > >    then delete 50% of them in an alternating manner.
> > > > >
> > > > > 2. On a 4k block sized XFS filesystem instance, the above causes 98511
> > > > >    extents to be created in the attr fork of the inode.
> > > > >
> > > > >    xfsaild/loop0  2008 [003]  1475.127209: probe:xfs_inode_to_disk: (ffffffffa43fb6b0) if_nextents=98511 i_ino=131
> > > > >
> > > > > 3. The incore inode fork extent counter is a signed 32-bit
> > > > >    quantity. However, the on-disk extent counter is an unsigned 16-bit
> > > > >    quantity and hence cannot hold 98511 extents.
> > > > >
> > > > > 4. The following incorrect value is stored in the xattr extent counter,
> > > > >    # xfs_db -f -c 'inode 131' -c 'print core.naextents' /dev/loop0
> > > > >    core.naextents = -32561
> > > > >
> > > > > This patchset adds a new helper function
> > > > > (i.e. xfs_iext_count_may_overflow()) to check for overflow of the
> > > > > per-inode data and xattr extent counters and invokes it before
> > > > > starting an fs operation (e.g. creating a new directory entry). With
> > > > > this patchset applied, XFS detects counter overflows and returns with
> > > > > an error rather than causing a silent corruption.
> > > > >
> > > > > The patchset has been tested by executing xfstests with the following
> > > > > mkfs.xfs options,
> > > > > 1. -m crc=0 -b size=1k
> > > > > 2. -m crc=0 -b size=4k
> > > > > 3. -m crc=0 -b size=512
> > > > > 4. -m rmapbt=1,reflink=1 -b size=1k
> > > > > 5. -m rmapbt=1,reflink=1 -b size=4k
> > > > >
> > > > > The patches can also be obtained from
> > > > > https://github.com/chandanr/linux.git at branch xfs-reserve-extent-count-v14.
> > > > >
> > > > > I have two patches that define the newly introduced error injection
> > > > > tags in xfsprogs
> > > > > (https://lore.kernel.org/linux-xfs/20201104114900.172147-1-chandanrlinux@gmail.com/).
> > > > >
> > > > > I have also written tests
> > > > > (https://github.com/chandanr/xfstests/commits/extent-overflow-tests)
> > > > > for verifying the checks introduced in the kernel.
> > > > >
> > > >
> > > > Hi Chandan and XFS folks,
> > > >
> > > > As you may have heard, I am working on producing a series of
> > > > xfs patches for stable v5.10.y.
> > > >
> > > > My patch selection is documented at [1].
> > > > I am in the process of testing the backport patches against the 5.10.y
> > > > baseline using Luis' kdevops [2] fstests runner.
> > > >
> > > > The configurations that we are testing are:
> > > > 1. -m rmbat=0,reflink=1 -b size=4k (default)
> > > > 2. -m crc=0 -b size=4k
> > > > 3. -m crc=0 -b size=512
> > > > 4. -m rmapbt=1,reflink=1 -b size=1k
> > > > 5. -m rmapbt=1,reflink=1 -b size=4k
> > > >
> > > > This patch set is the only largish series that I selected, because:
> > > > - It applies cleanly to 5.10.y
> > > > - I evaluated it as low risk and high value
> > >
> > > What value does it provide LTS users?
> > >
> >
> > Cloud providers deploy a large number of VMs/containers
> > and they may use reflink. So I think this could be an issue.
>
> Cloud providers are not deploying multi-TB VM images on XFS without
> also using some mechanism for avoiding worst-case fragmentation.
> They know all about the problems that manifest when extent
> counts get into the tens of millions, let alone billions....
>
> e.g. first access to a file pulls the entire extent list into
> memory, so for a file with 4 billion extents this will take hours to
> pull into memory (single threaded, synchronous read IO of millions
> of filesystem blocks) and consume and consume >100GB of RAM for the
> in-memory extent list. Having VM startup get delayed by hours and
> put a massive load on the cloud storage infrastructure for that
> entire length of time isn't desirable behaviour...
>
> For multi-TB VM image deployment - especially with reflink on the
> image file - extent size hints are needed to mitigate worst case
> fragmentation.  Reflink copies can run at up to about 100,000
> extents/s, so if you reflink a file with 4 billion extents in it,
> not only do you need another 100GB RAM, you also need to wait
> several hours for the reflink to run. And while that reflink is
> running, nothing else has access the data in that VM image: your VM
> is *down* for *hours* while you snapshot it.
>
> Typical mitigation is extent size hints in the MB ranges to reduce
> worst case fragmentation by two orders of magnitude (i.e. limit to
> tens of millions of extents, not billions) which brings snapshot
> times down to a minute or two.
>
> IOWs, it's obviously not practical to scale VM images out to
> billions of extents, even though we support extent counts in the
> billions.
>
> > > This series adds almost no value to normal users - extent count
> > > overflows are just something that doesn't happen in production
> > > systems at this point in time. The largest data extent count I've
> > > ever seen is still an order of magnitude of extents away from
> > > overflowing (i.e. 400 million extents seen, 4 billion to overflow),
> > > and nobody is using the attribute fork sufficiently hard to overflow
> > > 65536 extents (typically a couple of million xattrs per inode).
> > >
> > > i.e. this series is ground work for upcoming internal filesystem
> > > functionality that require much larger attribute forks (parent
> > > pointers and fsverity merkle tree storage) to be supported, and
> > > allow scope for much larger, massively fragmented VM image files
> > > (beyond 16TB on 4kB block size fs for worst case
> > > fragmentation/reflink).
> >
> > I am not sure I follow this argument.
> > Users can create large attributes, can they not?
>
> Sure. But *nobody does*, and there are good reasons we don't see
> people doing this.
>
> The reality is that apps don't use xattrs heavily because
> filesystems are traditionally very bad at storing even moderate
> numbers of xattrs. XFS is the exception to the rule. Hence nobody is
> trying to use a few million xattrs per inode right now, and it's
> unlikely anyone will unless they specifically target XFS.  In which
> case, they are going to want the large extent count stuff that just
> got merged into the for-next tree, and this whole discussion is
> moot....

With all the barriers to large extents count that you mentioned
I wonder how large extent counters feature mitigates those,
but that is irrelevant to the question at hand.

>
> > And users can create massive fragmented/reflinked images, can they not?
>
> Yes, and they will hit scalability problems long before they get
> anywhere near 4 billion extents.
>
> > If we have learned anything, is that if users can do something (i.e. on stable),
> > users will do that, so it may still be worth protecting this workflow?
>
> If I have learned anything, it's that huge extent counts are highly
> impractical for most workloads for one reason or another. We are a
> long way for enabling practical use of extent counts in the
> billions. Demand paging the extent list is the bare minimum we need,
> but then there's sheer scale of modifications reflink and unlink
> need to make (billions of transactions to share/free billions of
> individual extents) and there's no magic solution to that.
>
> > I argue that the reason that you did not see those constructs in the wild yet,
> > is the time it takes until users format new xfs filesystems with mkfs
>
> It really has nothing to do with filesystem formats and everything
> to do with the *cost* of creating, accessing, indexing and managing
> billions of extents.
>
> Have you ever tried to create a file with 4 billion extents in it?
> Even using fallocate to do it as fast as possible (no data IO!), I
> ran out of RAM on my 128GB test machine after 6 days of doing
> nothing but running fallocate() on a single inode. The kernel died a
> horrible OOM killer death at around 2.5 billion extents because the
> extent list cannot be reclaimed from memory while the inode is in
> use and the kernel ran out of all other memory it could reclaim as
> the extent list grew.
>
> The only way to fix that is to make the extent lists reclaimable
> (i.e. demand paging of the in-memory extent list) and that's a big
> chunk of work that isn't on anyone's radar right now.
>
> > Given your inputs, I am not sure that the fix has high value, but I must
> > say I didn't fully understand your argument.
> > It sounded like
> > "We don't need the fix because we did not see the problem yet",
> > but I may have misunderstood you.
>
> I hope you now realise that there are much bigger practical
> scalability limitations with extent lists and reflink that will
> manifest in production systems long before we get anywhere near
> billions of extents per inode.
>

I do!
And I *really* appreciate the time that you took to explain it to me
(and to everyone).

I'm dropping this series from my xfs-5.10.y queue.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow
  2022-05-24 16:05       ` Amir Goldstein
@ 2022-05-25  8:21         ` Dave Chinner
  0 siblings, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2022-05-25  8:21 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Chandan Babu R, linux-xfs, Darrick J . Wong, Darrick J. Wong,
	Christoph Hellwig, Allison Henderson, Luis R. Rodriguez,
	Theodore Tso

On Tue, May 24, 2022 at 07:05:07PM +0300, Amir Goldstein wrote:
> On Tue, May 24, 2022 at 8:36 AM Amir Goldstein <amir73il@gmail.com> wrote:
> 
> Allow me to rephrase that using a less hypothetical use case.
> 
> Our team is working on an out-of-band dedupe tool, much like
> https://markfasheh.github.io/duperemove/duperemove.html
> but for larger scale filesystems and testing focus is on xfs.

dedupe is nothing new. It's being done in production systems and has
been for a while now. e.g. Veeam has a production server back end
for their reflink/dedupe based backup software that is hosted on
XFS.

The only scalability issues we've seen with those systems managing
tens of TB of heavily cross-linked files so far have been limited to
how long unlink of those large files takes. Dedupe/reflink speeds up
ingest for backup farms, but it slows down removal/garbage
collection of backup that are no longer needed. The big
reflink/dedupe backup farms I've seen problems with are generally
dealing with extent counts per file in the tens of millions,
which is still very managable.

Maybe we'll see more problems as data sets grow, but it's also
likely that the crosslinked data sets the applications build will
scale out (more base files) instead of up (larger base files). This
will mean they remain at the "tens of millions of extents per file"
level and won't stress the filesystem any more than they already do.

> In certain settings, such as containers, the tool does not control the
> running kernel and *if* we require a new kernel, the newest we can
> require in this setting is 5.10.y.

*If* you have a customer that creates a billion extents in a single
file, then you could consider backporting this. But until managing
billions of extents per file is an actual issue for production
filesystems, it's unnecessary to backport these changes.

> How would the tool know that it can safely create millions of dups
> that may get fragmented?

Millions or shared extents in a single file aren't a problem at all.
Millions of references to a single shared block aren't a problem at
all, either.

But there are limits to how much you can share a single block, and
those limits are *highly variable* because they are dependent on
free space being available to record references.  e.g. XFS can
share a single block a maximum of 2^32 -1 times. If a user turns on
rmapbt, that max share limit drops way down to however many
individual rmap records can be stored in the rmap btree before the
AG runs out of space. If the AGs are small and/or full of other data,
that could limit sharing of a single block to a few hundreds of
references.

IOWs, applications creating shared extents must expect the operation
to fail at any time, without warning. And dedupe applications need
to be able to index multiple replicas of the same block so that they
aren't limited to deduping that data to a single block that has
arbitrary limits on how many times it can be shared.

> Does anyone *object* to including this series in the stable kernel
> after it passes the tests?

If you end up having a customer that hits a billion extents in a
single file, then you can backport these patches to the 5.10.y
series. But without any obvious production need for these patches,
they don't fit the criteria for stable backports...

Don't change what ain't broke.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow
  2022-05-25  7:48         ` Amir Goldstein
@ 2022-05-25  8:38           ` Dave Chinner
  0 siblings, 0 replies; 31+ messages in thread
From: Dave Chinner @ 2022-05-25  8:38 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Chandan Babu R, linux-xfs, Darrick J . Wong, Darrick J. Wong,
	Christoph Hellwig, Allison Henderson, Luis R. Rodriguez,
	Theodore Tso

On Wed, May 25, 2022 at 10:48:09AM +0300, Amir Goldstein wrote:
> On Wed, May 25, 2022 at 10:33 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Tue, May 24, 2022 at 08:36:50AM +0300, Amir Goldstein wrote:
> > > On Tue, May 24, 2022 at 1:43 AM Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > On Mon, May 23, 2022 at 02:15:44PM +0300, Amir Goldstein wrote:
> > > > > On Sun, Jan 10, 2021 at 6:10 PM Chandan Babu R <chandanrlinux@gmail.com> wrote:
> > >
> > > I am not sure I follow this argument.
> > > Users can create large attributes, can they not?
> >
> > Sure. But *nobody does*, and there are good reasons we don't see
> > people doing this.
> >
> > The reality is that apps don't use xattrs heavily because
> > filesystems are traditionally very bad at storing even moderate
> > numbers of xattrs. XFS is the exception to the rule. Hence nobody is
> > trying to use a few million xattrs per inode right now, and it's
> > unlikely anyone will unless they specifically target XFS.  In which
> > case, they are going to want the large extent count stuff that just
> > got merged into the for-next tree, and this whole discussion is
> > moot....
> 
> With all the barriers to large extents count that you mentioned
> I wonder how large extent counters feature mitigates those,
> but that is irrelevant to the question at hand.

They don't. That's the point I'm trying to make - these patches
don't actually fix any problems with large data fork extent counts -
they just allow them to get bigger.

As I said earlier - the primary driver for these changes is not
growing the number of data extents or reflink - it's growing the
amount of data we can store in the attribute fork. We need to grow
that from 2^16 extents to 2^32 extents because we want to be able to
store hundreds of millions of xattrs per file for internal
filesystem purposes.

Extending the data fork to 2^48 extents at the same time just makes
sense from an on-disk format perspective, not because the current
code can scale effectively to 2^32 extents, but because we're
already changing all that code to support a different attr fork
extent size. We will probably need >2^32 extents in the next decade,
so we're making the change now while we are touching the code....

There are future mods planned that will make large extent counts
bearable, but we don't have any idea how to solve problems like
making reflink go from O(n) to O(log n) to make reflink of
billion extent files an every day occurrence....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2022-05-25  8:38 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-10 16:07 [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow Chandan Babu R
2021-01-10 16:07 ` [PATCH V14 01/16] xfs: Add helper for checking per-inode extent count overflow Chandan Babu R
2021-01-10 16:07 ` [PATCH V14 02/16] xfs: Check for extent overflow when trivally adding a new extent Chandan Babu R
2021-01-10 16:07 ` [PATCH V14 03/16] xfs: Check for extent overflow when punching a hole Chandan Babu R
2021-01-10 16:07 ` [PATCH V14 04/16] xfs: Check for extent overflow when adding dir entries Chandan Babu R
2021-01-12  1:34   ` Darrick J. Wong
2021-01-10 16:07 ` [PATCH V14 05/16] xfs: Check for extent overflow when removing " Chandan Babu R
2021-01-12  1:38   ` Darrick J. Wong
2021-01-10 16:07 ` [PATCH V14 06/16] xfs: Check for extent overflow when renaming " Chandan Babu R
2021-01-12  1:37   ` Darrick J. Wong
2021-01-10 16:07 ` [PATCH V14 07/16] xfs: Check for extent overflow when adding/removing xattrs Chandan Babu R
2021-01-10 16:07 ` [PATCH V14 08/16] xfs: Check for extent overflow when writing to unwritten extent Chandan Babu R
2021-01-10 16:07 ` [PATCH V14 09/16] xfs: Check for extent overflow when moving extent from cow to data fork Chandan Babu R
2021-01-10 16:07 ` [PATCH V14 10/16] xfs: Check for extent overflow when remapping an extent Chandan Babu R
2021-01-10 16:07 ` [PATCH V14 11/16] xfs: Check for extent overflow when swapping extents Chandan Babu R
2021-01-10 16:07 ` [PATCH V14 12/16] xfs: Introduce error injection to reduce maximum inode fork extent count Chandan Babu R
2021-01-10 16:07 ` [PATCH V14 13/16] xfs: Remove duplicate assert statement in xfs_bmap_btalloc() Chandan Babu R
2021-01-10 16:07 ` [PATCH V14 14/16] xfs: Compute bmap extent alignments in a separate function Chandan Babu R
2021-01-10 16:07 ` [PATCH V14 15/16] xfs: Process allocated extent " Chandan Babu R
2021-01-10 16:07 ` [PATCH V14 16/16] xfs: Introduce error injection to allocate only minlen size extents for files Chandan Babu R
2022-05-23 11:15 ` [PATCH V14 00/16] Bail out if transaction can cause extent count to overflow Amir Goldstein
2022-05-23 15:50   ` Chandan Babu R
2022-05-23 19:06     ` Amir Goldstein
2022-05-25  5:49       ` Amir Goldstein
2022-05-23 22:43   ` Dave Chinner
2022-05-24  5:36     ` Amir Goldstein
2022-05-24 16:05       ` Amir Goldstein
2022-05-25  8:21         ` Dave Chinner
2022-05-25  7:33       ` Dave Chinner
2022-05-25  7:48         ` Amir Goldstein
2022-05-25  8:38           ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.