linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RFC 00/18] xfs: atomic file updates
@ 2020-04-29  2:44 Darrick J. Wong
  2020-04-29  2:44 ` [PATCH 01/18] xfs: clean up the error handling in xfs_swap_extent_rmap Darrick J. Wong
                   ` (18 more replies)
  0 siblings, 19 replies; 22+ messages in thread
From: Darrick J. Wong @ 2020-04-29  2:44 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, linux-fsdevel, linux-api

Hi all,

This series creates a new log incompat feature and log intent items to
track high level progress of swapping ranges of two files and finish
interrupted work if the system goes down.  It then adds a new
FISWAPRANGE ioctl so that userspace can access the atomic extent
swapping feature.  With this feature, user programs will be able to
update files atomically by opening an O_TMPFILE, reflinking the source
file to it, making whatever updates they want to make, and then
atomically swap the changed bits back to the source file.  It even has
an optional ability to detect a changed source file and reject the
update.

The intent behind this new userspace functionality is to enable atomic
rewrites of arbitrary parts of individual files.  For years, application
programmers wanting to ensure the atomicity of a file update had to
write the changes to a new file in the same directory, fsync the new
file, rename the new file on top of the old filename, and then fsync the
directory.  People get it wrong all the time, and $fs hacks abound.

With atomic file updates, this is no longer necessary.  Programmers
create an O_TMPFILE, optionally FICLONE the file contents into the
temporary file, make whatever changes they want to the tempfile, and
FISWAPRANGE the contents from the tempfile into the regular file.  The
interface can optionally check the original file's [cm]time to reject
the swap operation if the file has been modified by.  There are no
fsyncs to take care of; no directory operations at all; and the fs will
take care of finishing the swap operation if the system goes down in the
middle of the swap.  Sample code can be found in the corresponding
changes to xfs_io to exercise the use case mentioned above.

Note that this function is /not/ the O_DIRECT atomic file writes concept
that has been floating around for years.  This is constructed entirely
in software, which means that there are no limitations other than the
regular filesystem limits.

As a side note, there's an extra motivation behind the kernel
functionality: online repair of file-based metadata.  The atomic file
swap is implemented as an atomic inode fork swap, which means that we
can implement online reconstruction of extended attributes and
directories by building a new one in another inode and atomically
swap the contents.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=atomic-file-updates

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=atomic-file-updates

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 01/18] xfs: clean up the error handling in xfs_swap_extent_rmap
  2020-04-29  2:44 [PATCH RFC 00/18] xfs: atomic file updates Darrick J. Wong
@ 2020-04-29  2:44 ` Darrick J. Wong
  2020-04-29  2:44 ` [PATCH 02/18] xfs: fix xfs_reflink_remap_prep calling conventions Darrick J. Wong
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2020-04-29  2:44 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <darrick.wong@oracle.com>

Clean up the error handling and make sure we actually bail out if
there's something not right with either file's fork mappings or we
couldn't clear all the COW extents.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_bmap_util.c |   33 ++++++++++++++++++++++++---------
 1 file changed, 24 insertions(+), 9 deletions(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index cfd6e64661ba..746bb0c8271c 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1393,8 +1393,16 @@ xfs_swap_extent_rmap(
 				&nimaps, 0);
 		if (error)
 			goto out;
-		ASSERT(nimaps == 1);
-		ASSERT(tirec.br_startblock != DELAYSTARTBLOCK);
+		if (nimaps != 1 || tirec.br_startblock == DELAYSTARTBLOCK) {
+			/*
+			 * We should never get no mapping or a delalloc extent
+			 * since the donor file should have been flushed by the
+			 * caller.
+			 */
+			ASSERT(0);
+			error = -EINVAL;
+			goto out;
+		}
 
 		trace_xfs_swap_extent_rmap_remap(tip, &tirec);
 		ilen = tirec.br_blockcount;
@@ -1411,8 +1419,17 @@ xfs_swap_extent_rmap(
 					&nimaps, 0);
 			if (error)
 				goto out;
-			ASSERT(nimaps == 1);
-			ASSERT(tirec.br_startoff == irec.br_startoff);
+			if (nimaps != 1 ||
+			    tirec.br_startoff != irec.br_startoff) {
+				/*
+				 * We should never get no mapping or a mapping
+				 * for another offset, but bail out if that
+				 * ever does.
+				 */
+				ASSERT(0);
+				error = -EFSCORRUPTED;
+				goto out;
+			}
 			trace_xfs_swap_extent_rmap_remap_piece(ip, &irec);
 
 			/* Trim the extent. */
@@ -1451,11 +1468,9 @@ xfs_swap_extent_rmap(
 		offset_fsb += ilen;
 	}
 
-	tip->i_d.di_flags2 = tip_flags2;
-	return 0;
-
 out:
-	trace_xfs_swap_extent_rmap_error(ip, error, _RET_IP_);
+	if (error)
+		trace_xfs_swap_extent_rmap_error(ip, error, _RET_IP_);
 	tip->i_d.di_flags2 = tip_flags2;
 	return error;
 }
@@ -1657,7 +1672,7 @@ xfs_swap_extents(
 	if (xfs_inode_has_cow_data(tip)) {
 		error = xfs_reflink_cancel_cow_range(tip, 0, NULLFILEOFF, true);
 		if (error)
-			return error;
+			goto out_unlock;
 	}
 
 	/*


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 02/18] xfs: fix xfs_reflink_remap_prep calling conventions
  2020-04-29  2:44 [PATCH RFC 00/18] xfs: atomic file updates Darrick J. Wong
  2020-04-29  2:44 ` [PATCH 01/18] xfs: clean up the error handling in xfs_swap_extent_rmap Darrick J. Wong
@ 2020-04-29  2:44 ` Darrick J. Wong
  2020-05-01 22:54   ` Allison Collins
  2020-04-29  2:44 ` [PATCH 03/18] vfs: introduce new file extent swap ioctl Darrick J. Wong
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2020-04-29  2:44 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <darrick.wong@oracle.com>

Fix the return value of xfs_reflink_remap_prep so that its calling
conventions match the rest of xfs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_file.c    |    2 +-
 fs/xfs/xfs_reflink.c |    6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 994fd3d59872..1759fbcbcd46 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1029,7 +1029,7 @@ xfs_file_remap_range(
 	/* Prepare and then clone file data. */
 	ret = xfs_reflink_remap_prep(file_in, pos_in, file_out, pos_out,
 			&len, remap_flags);
-	if (ret < 0 || len == 0)
+	if (ret || len == 0)
 		return ret;
 
 	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index d8c8b299cb1f..5e978d1f169d 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1375,7 +1375,7 @@ xfs_reflink_remap_prep(
 	struct inode		*inode_out = file_inode(file_out);
 	struct xfs_inode	*dest = XFS_I(inode_out);
 	bool			same_inode = (inode_in == inode_out);
-	ssize_t			ret;
+	int			ret;
 
 	/* Lock both files against IO */
 	ret = xfs_iolock_two_inodes_and_break_layout(inode_in, inode_out);
@@ -1399,7 +1399,7 @@ xfs_reflink_remap_prep(
 
 	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
 			len, remap_flags);
-	if (ret < 0 || *len == 0)
+	if (ret || *len == 0)
 		goto out_unlock;
 
 	/* Attach dquots to dest inode before changing block map */
@@ -1434,7 +1434,7 @@ xfs_reflink_remap_prep(
 	if (ret)
 		goto out_unlock;
 
-	return 1;
+	return 0;
 out_unlock:
 	xfs_reflink_remap_unlock(file_in, file_out);
 	return ret;


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 03/18] vfs: introduce new file extent swap ioctl
  2020-04-29  2:44 [PATCH RFC 00/18] xfs: atomic file updates Darrick J. Wong
  2020-04-29  2:44 ` [PATCH 01/18] xfs: clean up the error handling in xfs_swap_extent_rmap Darrick J. Wong
  2020-04-29  2:44 ` [PATCH 02/18] xfs: fix xfs_reflink_remap_prep calling conventions Darrick J. Wong
@ 2020-04-29  2:44 ` Darrick J. Wong
  2020-04-29  2:44 ` [PATCH 04/18] xfs: support deferred bmap updates on the attr fork Darrick J. Wong
                   ` (15 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2020-04-29  2:44 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <darrick.wong@oracle.com>

Introduce a new ioctl to handle swapping extents between two files.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/ioctl.c              |   32 ++++++++
 fs/read_write.c         |  188 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_fs.h  |    1 
 include/linux/fs.h      |   15 ++++
 include/uapi/linux/fs.h |   55 ++++++++++++++
 mm/filemap.c            |   77 +++++++++++++++++++
 6 files changed, 367 insertions(+), 1 deletion(-)


diff --git a/fs/ioctl.c b/fs/ioctl.c
index 282d45be6f45..f564e6f2fad5 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -268,6 +268,35 @@ static long ioctl_file_clone_range(struct file *file,
 				args.src_length, args.dest_offset);
 }
 
+static long ioctl_file_swap_range(struct file *file2,
+				  struct file_swap_range __user *argp)
+{
+	struct file_swap_range args;
+	struct fd file1;
+	int ret;
+
+	if (copy_from_user(&args, argp, sizeof(args)))
+		return -EFAULT;
+
+	file1 = fdget(args.file1_fd);
+	if (!file1.file)
+		return -EBADF;
+
+	ret = -EXDEV;
+	if (file1.file->f_path.mnt != file2->f_path.mnt)
+		goto fdput;
+
+	ret = vfs_swap_file_range(file1.file, file2, &args);
+	if (ret)
+		goto fdput;
+
+	if (copy_to_user(argp, &args, sizeof(args)))
+		ret = -EFAULT;
+fdput:
+	fdput(file1);
+	return ret;
+}
+
 #ifdef CONFIG_BLOCK
 
 static inline sector_t logical_to_blk(struct inode *inode, loff_t offset)
@@ -730,6 +759,9 @@ static int do_vfs_ioctl(struct file *filp, unsigned int fd,
 	case FIDEDUPERANGE:
 		return ioctl_file_dedupe_range(filp, argp);
 
+	case FISWAPRANGE:
+		return ioctl_file_swap_range(filp, argp);
+
 	case FIONREAD:
 		if (!S_ISREG(inode->i_mode))
 			return vfs_ioctl(filp, cmd, arg);
diff --git a/fs/read_write.c b/fs/read_write.c
index bbfa9b12b15e..2b5116f129de 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -2081,6 +2081,92 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 }
 EXPORT_SYMBOL(generic_remap_file_range_prep);
 
+/*
+ * Check that the two inodes are eligible for range swapping, the ranges make
+ * sense, and then flush all dirty data.  Caller must ensure that the inodes
+ * have been locked against any other modifications.
+ */
+int generic_swap_file_range_prep(struct file *file1, struct file *file2,
+				 struct file_swap_range *fsr)
+{
+	struct inode *inode1 = file_inode(file1);
+	struct inode *inode2 = file_inode(file2);
+	u64 blkmask = i_blocksize(inode1) - 1;
+	bool same_inode = (inode1 == inode2);
+	int ret;
+
+	/* Don't touch certain kinds of inodes */
+	if (IS_IMMUTABLE(inode2))
+		return -EPERM;
+
+	if (IS_SWAPFILE(inode1) || IS_SWAPFILE(inode2))
+		return -ETXTBSY;
+
+	/* Don't reflink dirs, pipes, sockets... */
+	if (S_ISDIR(inode1->i_mode) || S_ISDIR(inode2->i_mode))
+		return -EISDIR;
+	if (!S_ISREG(inode1->i_mode) || !S_ISREG(inode2->i_mode))
+		return -EINVAL;
+
+	/* Ranges cannot start after EOF. */
+	if (fsr->file1_offset > i_size_read(inode1) ||
+	    fsr->file2_offset > i_size_read(inode2))
+		return -EINVAL;
+
+	/*
+	 * If the caller said to swap to EOF, we set the length of the request
+	 * large enough to cover everything to the end of both files.
+	 */
+	if (fsr->flags & FILE_SWAP_RANGE_TO_EOF)
+		fsr->length = max_t(int64_t,
+				    i_size_read(inode1) - fsr->file1_offset,
+				    i_size_read(inode2) - fsr->file2_offset);
+
+	/* Zero length swapext exits immediately. */
+	if (fsr->length == 0)
+		return 0;
+
+	/* Check that we don't violate system file offset limits. */
+	ret = generic_swap_file_range_checks(file1, file2, fsr);
+	if (ret)
+		return ret;
+
+	/*
+	 * Ensure that we don't swap a partial EOF block into the middle of
+	 * another file.
+	 */
+	if (fsr->length & blkmask) {
+		loff_t new_length = fsr->length;
+
+		if (fsr->file2_offset + new_length < i_size_read(inode2))
+			new_length &= ~blkmask;
+
+		if (fsr->file1_offset + new_length < i_size_read(inode1))
+			new_length &= ~blkmask;
+
+		if (new_length != fsr->length)
+			return -EINVAL;
+	}
+
+	/* Wait for the completion of any pending IOs on both files */
+	inode_dio_wait(inode1);
+	if (!same_inode)
+		inode_dio_wait(inode2);
+
+	ret = filemap_write_and_wait_range(inode1->i_mapping, fsr->file1_offset,
+					   fsr->file1_offset + fsr->length - 1);
+	if (ret)
+		return ret;
+
+	ret = filemap_write_and_wait_range(inode2->i_mapping, fsr->file2_offset,
+					   fsr->file2_offset + fsr->length - 1);
+	if (ret)
+		return ret;
+
+	return 0;
+}
+EXPORT_SYMBOL(generic_swap_file_range_prep);
+
 loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
 			   struct file *file_out, loff_t pos_out,
 			   loff_t len, unsigned int remap_flags)
@@ -2278,3 +2364,105 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
 	return ret;
 }
 EXPORT_SYMBOL(vfs_dedupe_file_range);
+
+/*
+ * Check that both files' metadata agree with the snapshot that we took for
+ * the range swap request.
+
+ * This should be called after the filesystem has locked /all/ inode metadata
+ * against modification.
+ */
+int generic_swap_file_range_check_fresh(struct inode *inode1,
+					struct inode *inode2,
+					const struct file_swap_range *fsr)
+{
+	/* Check that the offset/length values cover all of both files */
+	if ((fsr->flags & FILE_SWAP_RANGE_FULL_FILES) &&
+	    (fsr->file1_offset != 0 ||
+	     fsr->file2_offset != 0 ||
+	     fsr->length != i_size_read(inode1) ||
+	     fsr->length != i_size_read(inode2)))
+		return -EDOM;
+
+	/* Check that file2 hasn't otherwise been modified. */
+	if ((fsr->flags & FILE_SWAP_RANGE_FILE2_FRESH) &&
+	    (fsr->file2_ino        != inode2->i_ino ||
+	     fsr->file2_ctime      != inode2->i_ctime.tv_sec  ||
+	     fsr->file2_ctime_nsec != inode2->i_ctime.tv_nsec ||
+	     fsr->file2_mtime      != inode2->i_mtime.tv_sec  ||
+	     fsr->file2_mtime_nsec != inode2->i_mtime.tv_nsec))
+		return -EBUSY;
+
+	return 0;
+}
+EXPORT_SYMBOL(generic_swap_file_range_check_fresh);
+
+static inline int swap_range_verify_area(struct file *file, loff_t pos,
+					 struct file_swap_range *fsr)
+{
+	int64_t len = fsr->length;
+
+	if (fsr->flags & FILE_SWAP_RANGE_TO_EOF)
+		len = min_t(int64_t, len, i_size_read(file_inode(file)) - pos);
+	return remap_verify_area(file, pos, len, true);
+}
+
+int do_swap_file_range(struct file *file1, struct file *file2,
+		       struct file_swap_range *fsr)
+{
+	int ret;
+
+	if ((fsr->flags & ~FILE_SWAP_RANGE_ALL_FLAGS) ||
+	    memchr_inv(&fsr->pad, 0, sizeof(fsr->pad)))
+		return -EINVAL;
+
+	if ((fsr->flags & FILE_SWAP_RANGE_FULL_FILES) &&
+	    (fsr->flags & FILE_SWAP_RANGE_TO_EOF))
+		return -EINVAL;
+
+	/*
+	 * FISWAPRANGE ioctl enforces that src and dest files are on the same
+	 * mount. Practically, they only need to be on the same file system.
+	 */
+	if (file_inode(file1)->i_sb != file_inode(file2)->i_sb)
+		return -EXDEV;
+
+	ret = generic_file_rw_checks(file1, file2);
+	if (ret < 0)
+		return ret;
+
+	if (!file1->f_op->swap_file_range)
+		return -EOPNOTSUPP;
+
+	ret = swap_range_verify_area(file1, fsr->file1_offset, fsr);
+	if (ret)
+		return ret;
+
+	ret = swap_range_verify_area(file2, fsr->file2_offset, fsr);
+	if (ret)
+		return ret;
+
+	ret = file2->f_op->swap_file_range(file1, file2, fsr);
+	if (ret)
+		return ret;
+
+	file_modified(file1);
+	file_modified(file2);
+	fsnotify_modify(file1);
+	fsnotify_modify(file2);
+	return ret;
+}
+EXPORT_SYMBOL(do_swap_file_range);
+
+int vfs_swap_file_range(struct file *file1, struct file *file2,
+			struct file_swap_range *fsr)
+{
+	int ret;
+
+	file_start_write(file2);
+	ret = do_swap_file_range(file1, file2, fsr);
+	file_end_write(file2);
+
+	return ret;
+}
+EXPORT_SYMBOL(vfs_swap_file_range);
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 18054120074e..c5b75082b9db 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -844,6 +844,7 @@ struct xfs_scrub_metadata {
 #define XFS_IOC_FSGEOMETRY	     _IOR ('X', 126, struct xfs_fsop_geom)
 #define XFS_IOC_BULKSTAT	     _IOR ('X', 127, struct xfs_bulkstat_req)
 #define XFS_IOC_INUMBERS	     _IOR ('X', 128, struct xfs_inumbers_req)
+/*	FISWAPRANGE ---------------- hoisted 129	 */
 /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
 
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4f6f59b4f22a..63acc11d0804 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1862,6 +1862,8 @@ struct file_operations {
 	loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
 				   struct file *file_out, loff_t pos_out,
 				   loff_t len, unsigned int remap_flags);
+	int (*swap_file_range)(struct file *file_in, struct file *file_out,
+			       struct file_swap_range *fsr);
 	int (*fadvise)(struct file *, loff_t, loff_t, int);
 } __randomize_layout;
 
@@ -1931,6 +1933,8 @@ extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 					 struct file *file_out, loff_t pos_out,
 					 loff_t *count,
 					 unsigned int remap_flags);
+extern int generic_swap_file_range_prep(struct file *file1, struct file *file2,
+					struct file_swap_range *fsr);
 extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
 				  struct file *file_out, loff_t pos_out,
 				  loff_t len, unsigned int remap_flags);
@@ -1942,7 +1946,13 @@ extern int vfs_dedupe_file_range(struct file *file,
 extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
 					struct file *dst_file, loff_t dst_pos,
 					loff_t len, unsigned int remap_flags);
-
+extern int do_swap_file_range(struct file *file1, struct file *file2,
+			      struct file_swap_range *fsr);
+extern int vfs_swap_file_range(struct file *file1, struct file *file2,
+			       struct file_swap_range *fsr);
+extern int generic_swap_file_range_check_fresh(struct inode *inode1,
+					struct inode *inode2,
+					const struct file_swap_range *fsr);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
@@ -3120,6 +3130,9 @@ extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
 extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
 				struct file *file_out, loff_t pos_out,
 				loff_t *count, unsigned int remap_flags);
+extern int generic_swap_file_range_checks(struct file *file1,
+					  struct file *file2,
+					  const struct file_swap_range *fsr);
 extern int generic_file_rw_checks(struct file *file_in, struct file *file_out);
 extern int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
 				    struct file *file_out, loff_t pos_out,
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 379a612f8f1d..a74b49b02e75 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -93,6 +93,60 @@ struct file_dedupe_range {
 	struct file_dedupe_range_info info[0];
 };
 
+/*
+ * Swap part of file1 with part of the file that this ioctl that is being
+ * called against (which we'll call file2).  Filesystems must be able to
+ * complete the operation even if the system goes down.
+ */
+struct file_swap_range {
+	__s64		file1_fd;
+	__s64		file1_offset;	/* file1 offset, bytes */
+	__s64		file2_offset;	/* file2 offset, bytes */
+	__s64		length;		/* bytes to swap */
+
+	__u64		flags;		/* see FILE_SWAP_RANGE_* below */
+
+	/* file2 metadata for optional freshness checks */
+	__s64		file2_ino;	/* inode number */
+	__s64		file2_mtime;	/* modification time */
+	__s64		file2_ctime;	/* change time */
+	__s32		file2_mtime_nsec; /* mod time, nsec */
+	__s32		file2_ctime_nsec; /* change time, nsec */
+
+	__u64		pad[6];		/* must be zeroes */
+};
+
+/*
+ * Atomic swap operations are not required.  This relaxes the requirement that
+ * the filesystem must be able to complete the operation after a crash.
+ */
+#define FILE_SWAP_RANGE_NONATOMIC	(1 << 0)
+
+/*
+ * Check that file2's inode number, mtime, and ctime against the values
+ * provided, and return -EBUSY if there isn't an exact match.
+ */
+#define FILE_SWAP_RANGE_FILE2_FRESH	(1 << 1)
+
+/*
+ * Check that the file1's length is equal to file1_offset + length, and that
+ * file2's length is equal to file2_offset + length.  Returns -EDOM if there
+ * isn't an exact match.
+ */
+#define FILE_SWAP_RANGE_FULL_FILES	(1 << 2)
+
+/*
+ * Swap file data all the way to the ends of both files, and then swap the file
+ * sizes.  This flag can be used to replace a file's contents with a different
+ * amount of data.  length will be ignored.
+ */
+#define FILE_SWAP_RANGE_TO_EOF		(1 << 3)
+
+#define FILE_SWAP_RANGE_ALL_FLAGS	(FILE_SWAP_RANGE_NONATOMIC | \
+					 FILE_SWAP_RANGE_FILE2_FRESH | \
+					 FILE_SWAP_RANGE_FULL_FILES | \
+					 FILE_SWAP_RANGE_TO_EOF)
+
 /* And dynamically-tunable limits and defaults: */
 struct files_stat_struct {
 	unsigned long nr_files;		/* read only */
@@ -198,6 +252,7 @@ struct fsxattr {
 #define FICLONE		_IOW(0x94, 9, int)
 #define FICLONERANGE	_IOW(0x94, 13, struct file_clone_range)
 #define FIDEDUPERANGE	_IOWR(0x94, 54, struct file_dedupe_range)
+#define FISWAPRANGE	_IOWR('X', 129, struct file_swap_range)
 
 #define FSLABEL_MAX 256	/* Max chars for the interface; each fs may differ */
 
diff --git a/mm/filemap.c b/mm/filemap.c
index 23a051a7ef0f..e21b63654767 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3035,6 +3035,83 @@ int generic_remap_checks(struct file *file_in, loff_t pos_in,
 	return 0;
 }
 
+/* Performs necessary checks before doing a range swap. */
+int generic_swap_file_range_checks(struct file *file1, struct file *file2,
+				   const struct file_swap_range *fsr)
+{
+	struct inode *inode1 = file1->f_mapping->host;
+	struct inode *inode2 = file2->f_mapping->host;
+	int64_t test_len;
+	uint64_t blen;
+	loff_t size1, size2;
+	loff_t bs = inode2->i_sb->s_blocksize;
+	int ret;
+
+	if (fsr->length < 0)
+		return -EINVAL;
+
+	/* The start of both ranges must be aligned to an fs block. */
+	if (!IS_ALIGNED(fsr->file1_offset, bs) ||
+	    !IS_ALIGNED(fsr->file2_offset, bs))
+		return -EINVAL;
+
+	/* Ensure offsets don't wrap. */
+	if (fsr->file1_offset + fsr->length < fsr->file1_offset ||
+	    fsr->file2_offset + fsr->length < fsr->file2_offset)
+		return -EINVAL;
+
+	size1 = i_size_read(inode1);
+	size2 = i_size_read(inode2);
+
+	/*
+	 * Swapext require both ranges to be within EOF, unless we're swapping
+	 * to EOF.  generic_swap_range_prep already checked that both
+	 * fsr->file1_offset and fsr->file2_offset are within EOF.
+	 */
+	if (!(fsr->flags & FILE_SWAP_RANGE_TO_EOF) &&
+	    (fsr->file1_offset + fsr->length > size1 ||
+	     fsr->file2_offset + fsr->length > size2))
+		return -EINVAL;
+
+	/*
+	 * Make sure we don't hit any file size limits.  If we hit any size
+	 * limits such that test_length was adjusted, we abort the whole
+	 * operation.
+	 */
+	test_len = fsr->length;
+	ret = generic_write_check_limits(file2, fsr->file2_offset, &test_len);
+	if (ret)
+		return ret;
+	ret = generic_write_check_limits(file1, fsr->file1_offset, &test_len);
+	if (ret)
+		return ret;
+	if (test_len != fsr->length)
+		return -EINVAL;
+
+	/*
+	 * If the user wanted us to swap to the infile's EOF, round up to the
+	 * next block boundary for this check.  Do the same for the outfile.
+	 *
+	 * Otherwise, reject the range length if it's not block aligned.  We
+	 * already confirmed the starting offsets' block alignment.
+	 */
+	if (fsr->file1_offset + fsr->length == size1)
+		blen = ALIGN(size1, bs) - fsr->file1_offset;
+	else if (fsr->file2_offset + fsr->length == size2)
+		blen = ALIGN(size2, bs) - fsr->file2_offset;
+	else if (!IS_ALIGNED(fsr->length, bs))
+		return -EINVAL;
+	else
+		blen = fsr->length;
+
+	/* Don't allow overlapped swapping within the same file. */
+	if (inode1 == inode2 &&
+	    fsr->file2_offset + blen > fsr->file1_offset &&
+	    fsr->file1_offset + blen > fsr->file2_offset)
+		return -EINVAL;
+
+	return 0;
+}
 
 /*
  * Performs common checks before doing a file copy/clone


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 04/18] xfs: support deferred bmap updates on the attr fork
  2020-04-29  2:44 [PATCH RFC 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (2 preceding siblings ...)
  2020-04-29  2:44 ` [PATCH 03/18] vfs: introduce new file extent swap ioctl Darrick J. Wong
@ 2020-04-29  2:44 ` Darrick J. Wong
  2020-04-29  2:44 ` [PATCH 05/18] xfs: xfs_bmap_finish_one should map unwritten extents properly Darrick J. Wong
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2020-04-29  2:44 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <darrick.wong@oracle.com>

The deferred bmap update log item has always supported the attr fork, so
plumb this in so that higher layers can access this.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c |   42 ++++++++++++++++--------------------------
 fs/xfs/libxfs/xfs_bmap.h |    4 ++--
 fs/xfs/xfs_bmap_item.c   |    2 +-
 fs/xfs/xfs_bmap_util.c   |    8 ++++----
 fs/xfs/xfs_reflink.c     |    4 ++--
 5 files changed, 25 insertions(+), 35 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 33dbae784463..2752df4f4e69 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -6238,17 +6238,8 @@ xfs_bmap_split_extent(
 	return error;
 }
 
-/* Deferred mapping is only for real extents in the data fork. */
-static bool
-xfs_bmap_is_update_needed(
-	struct xfs_bmbt_irec	*bmap)
-{
-	return  bmap->br_startblock != HOLESTARTBLOCK &&
-		bmap->br_startblock != DELAYSTARTBLOCK;
-}
-
 /* Record a bmap intent. */
-static int
+static void
 __xfs_bmap_add(
 	struct xfs_trans		*tp,
 	enum xfs_bmap_intent_type	type,
@@ -6258,6 +6249,11 @@ __xfs_bmap_add(
 {
 	struct xfs_bmap_intent		*bi;
 
+	if ((whichfork != XFS_DATA_FORK && whichfork != XFS_ATTR_FORK) ||
+	    bmap->br_startblock == HOLESTARTBLOCK ||
+	    bmap->br_startblock == DELAYSTARTBLOCK)
+		return;
+
 	trace_xfs_bmap_defer(tp->t_mountp,
 			XFS_FSB_TO_AGNO(tp->t_mountp, bmap->br_startblock),
 			type,
@@ -6275,7 +6271,6 @@ __xfs_bmap_add(
 	bi->bi_bmap = *bmap;
 
 	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_BMAP, &bi->bi_list);
-	return 0;
 }
 
 /* Map an extent into a file. */
@@ -6283,12 +6278,10 @@ void
 xfs_bmap_map_extent(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*ip,
+	int			whichfork,
 	struct xfs_bmbt_irec	*PREV)
 {
-	if (!xfs_bmap_is_update_needed(PREV))
-		return;
-
-	__xfs_bmap_add(tp, XFS_BMAP_MAP, ip, XFS_DATA_FORK, PREV);
+	__xfs_bmap_add(tp, XFS_BMAP_MAP, ip, whichfork, PREV);
 }
 
 /* Unmap an extent out of a file. */
@@ -6296,12 +6289,10 @@ void
 xfs_bmap_unmap_extent(
 	struct xfs_trans	*tp,
 	struct xfs_inode	*ip,
+	int			whichfork,
 	struct xfs_bmbt_irec	*PREV)
 {
-	if (!xfs_bmap_is_update_needed(PREV))
-		return;
-
-	__xfs_bmap_add(tp, XFS_BMAP_UNMAP, ip, XFS_DATA_FORK, PREV);
+	__xfs_bmap_add(tp, XFS_BMAP_UNMAP, ip, whichfork, PREV);
 }
 
 /*
@@ -6320,6 +6311,10 @@ xfs_bmap_finish_one(
 	xfs_exntst_t			state)
 {
 	int				error = 0;
+	int				flags = 0;
+
+	if (whichfork == XFS_ATTR_FORK)
+		flags |= XFS_BMAPI_ATTRFORK;
 
 	ASSERT(tp->t_firstblock == NULLFSBLOCK);
 
@@ -6328,11 +6323,6 @@ xfs_bmap_finish_one(
 			XFS_FSB_TO_AGBNO(tp->t_mountp, startblock),
 			ip->i_ino, whichfork, startoff, *blockcount, state);
 
-	if (WARN_ON_ONCE(whichfork != XFS_DATA_FORK)) {
-		xfs_bmap_mark_sick(ip, whichfork);
-		return -EFSCORRUPTED;
-	}
-
 	if (XFS_TEST_ERROR(false, tp->t_mountp,
 			XFS_ERRTAG_BMAP_FINISH_ONE))
 		return -EIO;
@@ -6340,12 +6330,12 @@ xfs_bmap_finish_one(
 	switch (type) {
 	case XFS_BMAP_MAP:
 		error = xfs_bmapi_remap(tp, ip, startoff, *blockcount,
-				startblock, 0);
+				startblock, flags);
 		*blockcount = 0;
 		break;
 	case XFS_BMAP_UNMAP:
 		error = __xfs_bunmapi(tp, ip, startoff, blockcount,
-				XFS_BMAPI_REMAP, 1);
+				flags | XFS_BMAPI_REMAP, 1);
 		break;
 	default:
 		ASSERT(0);
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index bbd8ccdecffa..3367df499ac8 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -266,9 +266,9 @@ int	xfs_bmap_finish_one(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_fileoff_t startoff, xfs_fsblock_t startblock,
 		xfs_filblks_t *blockcount, xfs_exntst_t state);
 void	xfs_bmap_map_extent(struct xfs_trans *tp, struct xfs_inode *ip,
-		struct xfs_bmbt_irec *imap);
+		int whichfork, struct xfs_bmbt_irec *imap);
 void	xfs_bmap_unmap_extent(struct xfs_trans *tp, struct xfs_inode *ip,
-		struct xfs_bmbt_irec *imap);
+		int whichfork, struct xfs_bmbt_irec *imap);
 
 static inline int xfs_bmap_fork_to_state(int whichfork)
 {
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 267351fbea67..7ad803a06634 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -581,7 +581,7 @@ xfs_bui_recover(
 		irec.br_blockcount = count;
 		irec.br_startoff = bmap->me_startoff;
 		irec.br_state = state;
-		xfs_bmap_unmap_extent(tp, ip, &irec);
+		xfs_bmap_unmap_extent(tp, ip, whichfork, &irec);
 	}
 
 	set_bit(XFS_BUI_RECOVERED, &buip->bui_flags);
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 746bb0c8271c..070f657241a1 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1440,16 +1440,16 @@ xfs_swap_extent_rmap(
 			trace_xfs_swap_extent_rmap_remap_piece(tip, &uirec);
 
 			/* Remove the mapping from the donor file. */
-			xfs_bmap_unmap_extent(tp, tip, &uirec);
+			xfs_bmap_unmap_extent(tp, tip, XFS_DATA_FORK, &uirec);
 
 			/* Remove the mapping from the source file. */
-			xfs_bmap_unmap_extent(tp, ip, &irec);
+			xfs_bmap_unmap_extent(tp, ip, XFS_DATA_FORK, &irec);
 
 			/* Map the donor file's blocks into the source file. */
-			xfs_bmap_map_extent(tp, ip, &uirec);
+			xfs_bmap_map_extent(tp, ip, XFS_DATA_FORK, &uirec);
 
 			/* Map the source file's blocks into the donor file. */
-			xfs_bmap_map_extent(tp, tip, &irec);
+			xfs_bmap_map_extent(tp, tip, XFS_DATA_FORK, &irec);
 
 			error = xfs_defer_finish(tpp);
 			tp = *tpp;
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index 5e978d1f169d..f206f6637daf 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -706,7 +706,7 @@ xfs_reflink_end_cow_extent(
 	xfs_refcount_free_cow_extent(tp, del.br_startblock, del.br_blockcount);
 
 	/* Map the new blocks into the data fork. */
-	xfs_bmap_map_extent(tp, ip, &del);
+	xfs_bmap_map_extent(tp, ip, XFS_DATA_FORK, &del);
 
 	/* Charge this new data fork mapping to the on-disk quota. */
 	xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_DELBCOUNT,
@@ -1125,7 +1125,7 @@ xfs_reflink_remap_extent(
 		xfs_refcount_increase_extent(tp, &uirec);
 
 		/* Map the new blocks into the data fork. */
-		xfs_bmap_map_extent(tp, ip, &uirec);
+		xfs_bmap_map_extent(tp, ip, XFS_DATA_FORK, &uirec);
 
 		/* Update quota accounting. */
 		xfs_trans_mod_dquot_byino(tp, ip, XFS_TRANS_DQ_BCOUNT,


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 05/18] xfs: xfs_bmap_finish_one should map unwritten extents properly
  2020-04-29  2:44 [PATCH RFC 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (3 preceding siblings ...)
  2020-04-29  2:44 ` [PATCH 04/18] xfs: support deferred bmap updates on the attr fork Darrick J. Wong
@ 2020-04-29  2:44 ` Darrick J. Wong
  2020-04-29  2:44 ` [PATCH 06/18] xfs: create a log incompat flag for atomic extent swapping Darrick J. Wong
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2020-04-29  2:44 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <darrick.wong@oracle.com>

The deferred bmap work state and the log item can transmit unwritten
state, so the XFS_BMAP_MAP handler must map in extents with that
unwritten state.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_bmap.c |    2 ++
 1 file changed, 2 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 2752df4f4e69..81e03461312b 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -6329,6 +6329,8 @@ xfs_bmap_finish_one(
 
 	switch (type) {
 	case XFS_BMAP_MAP:
+		if (state == XFS_EXT_UNWRITTEN)
+			flags |= XFS_BMAPI_PREALLOC;
 		error = xfs_bmapi_remap(tp, ip, startoff, *blockcount,
 				startblock, flags);
 		*blockcount = 0;


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 06/18] xfs: create a log incompat flag for atomic extent swapping
  2020-04-29  2:44 [PATCH RFC 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (4 preceding siblings ...)
  2020-04-29  2:44 ` [PATCH 05/18] xfs: xfs_bmap_finish_one should map unwritten extents properly Darrick J. Wong
@ 2020-04-29  2:44 ` Darrick J. Wong
  2020-04-29  2:45 ` [PATCH 07/18] xfs: allow deferred ops items to put themselves at the end of the pending queue Darrick J. Wong
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2020-04-29  2:44 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <darrick.wong@oracle.com>

Create a log incompat flag so that we only attempt to process swap
extent log items if the filesystem supports it.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_format.h |   11 ++++++++++-
 fs/xfs/libxfs/xfs_fs.h     |    1 +
 fs/xfs/libxfs/xfs_sb.c     |    2 ++
 fs/xfs/xfs_super.c         |    4 ++++
 4 files changed, 17 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 34babf402e14..63ed62a92c9c 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -500,7 +500,10 @@ xfs_sb_has_incompat_feature(
 	return (sbp->sb_features_incompat & feature) != 0;
 }
 
-#define XFS_SB_FEAT_INCOMPAT_LOG_ALL 0
+#define XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP (1 << 0)
+#define XFS_SB_FEAT_INCOMPAT_LOG_ALL \
+		(XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP)
+
 #define XFS_SB_FEAT_INCOMPAT_LOG_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_LOG_ALL
 static inline bool
 xfs_sb_has_incompat_log_feature(
@@ -614,6 +617,12 @@ static inline bool xfs_sb_version_hasrtrmapbt(struct xfs_sb *sbp)
 	       xfs_sb_version_hasrmapbt(sbp);
 }
 
+static inline bool xfs_sb_version_hasatomicswap(struct xfs_sb *sbp)
+{
+	return XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5 &&
+		(sbp->sb_features_log_incompat & XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP);
+}
+
 /*
  * end of superblock version macros
  */
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index c5b75082b9db..d278ca5731e4 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -251,6 +251,7 @@ typedef struct xfs_fsop_resblks {
 #define XFS_FSOP_GEOM_FLAGS_RMAPBT	(1 << 19) /* reverse mapping btree */
 #define XFS_FSOP_GEOM_FLAGS_REFLINK	(1 << 20) /* files can share blocks */
 #define XFS_FSOP_GEOM_FLAGS_BIGTIME	(1 << 21) /* 64-bit nsec timestamps */
+#define XFS_FSOP_GEOM_FLAGS_ATOMIC_SWAP	(1 << 22) /* atomic swapext */
 
 /*
  * Minimum and maximum sizes need for growth checks.
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index c50f589824f0..16094fd1a75e 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -1205,6 +1205,8 @@ xfs_fs_geometry(
 		geo->flags |= XFS_FSOP_GEOM_FLAGS_REFLINK;
 	if (xfs_sb_version_hasbigtime(sbp))
 		geo->flags |= XFS_FSOP_GEOM_FLAGS_BIGTIME;
+	if (xfs_sb_version_hasatomicswap(sbp))
+		geo->flags |= XFS_FSOP_GEOM_FLAGS_ATOMIC_SWAP;
 	if (xfs_sb_version_hassector(sbp))
 		geo->logsectsize = sbp->sb_logsectsize;
 	else
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index a17e824c6084..42d82c9d2a1d 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -1647,6 +1647,10 @@ xfs_fc_fill_super(
 		xfs_warn(mp,
  "EXPERIMENTAL inode btree counters feature in use. Use at your own risk!");
 
+	if (xfs_sb_version_hasatomicswap(&mp->m_sb))
+		xfs_warn(mp,
+ "EXPERIMENTAL atomic file range swap feature in use. Use at your own risk!");
+
 	error = xfs_mountfs(mp);
 	if (error)
 		goto out_filestream_unmount;


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 07/18] xfs: allow deferred ops items to put themselves at the end of the pending queue
  2020-04-29  2:44 [PATCH RFC 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (5 preceding siblings ...)
  2020-04-29  2:44 ` [PATCH 06/18] xfs: create a log incompat flag for atomic extent swapping Darrick J. Wong
@ 2020-04-29  2:45 ` Darrick J. Wong
  2020-04-29  2:45 ` [PATCH 08/18] xfs: introduce a swap-extent log intent item Darrick J. Wong
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2020-04-29  2:45 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <darrick.wong@oracle.com>

Allow individual deferred op ->finish_item functions to decide that they
want to yield to all other deferred ops that might need processing.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_defer.c |   29 +++++++++++++++++++++++------
 1 file changed, 23 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index 1cab95cef399..f53e3ce858eb 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -69,10 +69,10 @@
  *   - For each work item attached to the log intent item,
  *     * Perform the described action.
  *     * Attach the work item to the log done item.
- *     * If the result of doing the work was -EAGAIN, ->finish work
- *       wants a new transaction.  See the "Requesting a Fresh
- *       Transaction while Finishing Deferred Work" section below for
- *       details.
+ *     * If the result of doing the work was -EAGAIN or -EMULTIHOP,
+ *       ->finish work wants a new transaction.  See the "Requesting a
+ *       Fresh Transaction while Finishing Deferred Work" section below
+ *       for details.
  *
  * The key here is that we must log an intent item for all pending
  * work items every time we roll the transaction, and that we must log
@@ -108,6 +108,13 @@
  * required that ->finish_item must be careful to leave enough
  * transaction reservation to fit the new log intent item.
  *
+ * If ->finish_item returns -EMULTIHOP, defer_finish will log the new
+ * intent item with the remaining work items but it will move the
+ * xfs_defer_pending item to a separate queue.  The separate queue
+ * will be put back into the pending list at the very end of processing
+ * after all other pending items (including ones that were created as
+ * part of finishing other items) have been processed.
+ *
  * This is an example of remapping the extent (E, E+B) into file X at
  * offset A and dealing with the extent (C, C+B) already being mapped
  * there:
@@ -365,12 +372,14 @@ xfs_defer_finish_noroll(
 	int				error = 0;
 	const struct xfs_defer_op_type	*ops;
 	LIST_HEAD(dop_pending);
+	LIST_HEAD(dop_endofline);
 
 	ASSERT((*tp)->t_flags & XFS_TRANS_PERM_LOG_RES);
 
 	trace_xfs_defer_finish(*tp, _RET_IP_);
 
 	/* Until we run out of pending work to finish... */
+again:
 	while (!list_empty(&dop_pending) || !list_empty(&(*tp)->t_dfops)) {
 		/* log intents and pull in intake items */
 		xfs_defer_create_intents(*tp);
@@ -398,7 +407,7 @@ xfs_defer_finish_noroll(
 			dfp->dfp_count--;
 			error = ops->finish_item(*tp, li, dfp->dfp_done,
 					&state);
-			if (error == -EAGAIN) {
+			if (error == -EAGAIN || error == -EMULTIHOP) {
 				/*
 				 * Caller wants a fresh transaction;
 				 * put the work item back on the list
@@ -418,7 +427,7 @@ xfs_defer_finish_noroll(
 				goto out;
 			}
 		}
-		if (error == -EAGAIN) {
+		if (error == -EAGAIN || error == -EMULTIHOP) {
 			/*
 			 * Caller wants a fresh transaction, so log a
 			 * new log intent item to replace the old one
@@ -431,6 +440,8 @@ xfs_defer_finish_noroll(
 			dfp->dfp_done = NULL;
 			list_for_each(li, &dfp->dfp_work)
 				ops->log_item(*tp, dfp->dfp_intent, li);
+			if (error == -EMULTIHOP)
+				list_move_tail(&dfp->dfp_list, &dop_endofline);
 		} else {
 			/* Done with the dfp, free it. */
 			list_del(&dfp->dfp_list);
@@ -441,8 +452,14 @@ xfs_defer_finish_noroll(
 			ops->finish_cleanup(*tp, state, error);
 	}
 
+	if (!list_empty(&dop_endofline)) {
+		list_splice_tail_init(&dop_endofline, &dop_pending);
+		goto again;
+	}
+
 out:
 	if (error) {
+		list_splice_tail_init(&dop_endofline, &dop_pending);
 		xfs_defer_trans_abort(*tp, &dop_pending);
 		xfs_force_shutdown((*tp)->t_mountp, SHUTDOWN_CORRUPT_INCORE);
 		trace_xfs_defer_finish_error(*tp, error);


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 08/18] xfs: introduce a swap-extent log intent item
  2020-04-29  2:44 [PATCH RFC 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (6 preceding siblings ...)
  2020-04-29  2:45 ` [PATCH 07/18] xfs: allow deferred ops items to put themselves at the end of the pending queue Darrick J. Wong
@ 2020-04-29  2:45 ` Darrick J. Wong
  2020-04-29  2:45 ` [PATCH 09/18] xfs: create deferred log items for extent swapping Darrick J. Wong
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2020-04-29  2:45 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <darrick.wong@oracle.com>

Introduce a new intent log item to handle swapping extents.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile                 |    1 
 fs/xfs/libxfs/xfs_log_format.h  |   55 ++++++
 fs/xfs/libxfs/xfs_log_recover.h |    1 
 fs/xfs/xfs_log.c                |    2 
 fs/xfs/xfs_log_recover.c        |    6 +
 fs/xfs/xfs_super.c              |   17 ++
 fs/xfs/xfs_swapext_item.c       |  365 +++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_swapext_item.h       |   67 +++++++
 8 files changed, 512 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/xfs_swapext_item.c
 create mode 100644 fs/xfs/xfs_swapext_item.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 2bd822c784cb..27b4bd5c8ffe 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -109,6 +109,7 @@ xfs-y				+= xfs_log.o \
 				   xfs_inode_item.o \
 				   xfs_refcount_item.o \
 				   xfs_rmap_item.o \
+				   xfs_swapext_item.o \
 				   xfs_log_recover.o \
 				   xfs_trans_ail.o \
 				   xfs_trans_buf.o
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index 382b7cd6ba82..ceb67213df64 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -117,7 +117,9 @@ struct xfs_unmount_log_format {
 #define XLOG_REG_TYPE_CUD_FORMAT	24
 #define XLOG_REG_TYPE_BUI_FORMAT	25
 #define XLOG_REG_TYPE_BUD_FORMAT	26
-#define XLOG_REG_TYPE_MAX		26
+#define XLOG_REG_TYPE_SXI_FORMAT	27
+#define XLOG_REG_TYPE_SXD_FORMAT	28
+#define XLOG_REG_TYPE_MAX		28
 
 /*
  * Flags to log operation header
@@ -240,6 +242,8 @@ typedef struct xfs_trans_header {
 #define	XFS_LI_CUD		0x1243
 #define	XFS_LI_BUI		0x1244	/* bmbt update intent */
 #define	XFS_LI_BUD		0x1245
+#define	XFS_LI_SXI		0x1246
+#define	XFS_LI_SXD		0x1247
 
 #define XFS_LI_TYPE_DESC \
 	{ XFS_LI_EFI,		"XFS_LI_EFI" }, \
@@ -255,7 +259,9 @@ typedef struct xfs_trans_header {
 	{ XFS_LI_CUI,		"XFS_LI_CUI" }, \
 	{ XFS_LI_CUD,		"XFS_LI_CUD" }, \
 	{ XFS_LI_BUI,		"XFS_LI_BUI" }, \
-	{ XFS_LI_BUD,		"XFS_LI_BUD" }
+	{ XFS_LI_BUD,		"XFS_LI_BUD" }, \
+	{ XFS_LI_SXI,		"XFS_LI_SXI" }, \
+	{ XFS_LI_SXD,		"XFS_LI_SXD" }
 
 /*
  * Inode Log Item Format definitions.
@@ -786,6 +792,51 @@ struct xfs_bud_log_format {
 	uint64_t		bud_bui_id;	/* id of corresponding bui */
 };
 
+/*
+ * SXI/SXD (extent swapping) log format definitions
+ */
+
+struct xfs_swap_extent {
+	uint64_t		se_inode1;
+	uint64_t		se_inode2;
+	uint64_t		se_startoff1;
+	uint64_t		se_startoff2;
+	uint64_t		se_blockcount;
+	uint64_t		se_flags;
+	int64_t			se_isize1;
+	int64_t			se_isize2;
+};
+
+/* Swap extents between extended attribute forks. */
+#define XFS_SWAP_EXTENT_ATTR_FORK	(1ULL << 0)
+
+/* Set the file sizes when finished. */
+#define XFS_SWAP_EXTENT_SET_SIZES	(1ULL << 1)
+
+#define XFS_SWAP_EXTENT_FLAGS		(XFS_SWAP_EXTENT_ATTR_FORK | \
+					 XFS_SWAP_EXTENT_SET_SIZES)
+
+/* This is the structure used to lay out an sxi log item in the log. */
+struct xfs_sxi_log_format {
+	uint16_t		sxi_type;	/* sxi log item type */
+	uint16_t		sxi_size;	/* size of this item */
+	uint32_t		__pad;		/* must be zero */
+	uint64_t		sxi_id;		/* sxi identifier */
+	struct xfs_swap_extent	sxi_extent;	/* extent to swap */
+};
+
+/*
+ * This is the structure used to lay out an sxd log item in the
+ * log.  The sxd_extents array is a variable size array whose
+ * size is given by sxd_nextents;
+ */
+struct xfs_sxd_log_format {
+	uint16_t		sxd_type;	/* sxd log item type */
+	uint16_t		sxd_size;	/* size of this item */
+	uint32_t		__pad;
+	uint64_t		sxd_sxi_id;	/* id of corresponding bui */
+};
+
 /*
  * Dquot Log format definitions.
  *
diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h
index b36ccaa5465b..c9cd6775f50c 100644
--- a/fs/xfs/libxfs/xfs_log_recover.h
+++ b/fs/xfs/libxfs/xfs_log_recover.h
@@ -169,6 +169,7 @@ extern const struct xlog_recover_intent_type xlog_recover_extfree_type;
 extern const struct xlog_recover_intent_type xlog_recover_rmap_type;
 extern const struct xlog_recover_intent_type xlog_recover_refcount_type;
 extern const struct xlog_recover_intent_type xlog_recover_bmap_type;
+extern const struct xlog_recover_intent_type xlog_recover_swapext_type;
 
 typedef bool (*xlog_recover_release_intent_fn)(struct xlog *log,
 		struct xfs_log_item *item, uint64_t intent_id);
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 00fda2e8e738..f589157059d2 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1975,6 +1975,8 @@ xlog_print_tic_res(
 	    REG_TYPE_STR(CUD_FORMAT, "cud_format"),
 	    REG_TYPE_STR(BUI_FORMAT, "bui_format"),
 	    REG_TYPE_STR(BUD_FORMAT, "bud_format"),
+	    REG_TYPE_STR(SXI_FORMAT, "sxi_format"),
+	    REG_TYPE_STR(SXD_FORMAT, "sxd_format"),
 	};
 	BUILD_BUG_ON(ARRAY_SIZE(res_type_str) != XLOG_REG_TYPE_MAX + 1);
 #undef REG_TYPE_STR
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 1c836dcf3e3e..4f990a45291b 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -1851,6 +1851,9 @@ xlog_intent_for_type(
 	case XFS_LI_BUI:
 	case XFS_LI_BUD:
 		return &xlog_recover_bmap_type;
+	case XFS_LI_SXI:
+	case XFS_LI_SXD:
+		return &xlog_recover_swapext_type;
 	default:
 		return NULL;
 	}
@@ -1865,6 +1868,7 @@ xlog_is_intent_done_item(
 	case XFS_LI_RUD:
 	case XFS_LI_CUD:
 	case XFS_LI_BUD:
+	case XFS_LI_SXD:
 		return true;
 	default:
 		return false;
@@ -1917,6 +1921,8 @@ xlog_item_for_type(
 	case XFS_LI_CUD:
 	case XFS_LI_BUI:
 	case XFS_LI_BUD:
+	case XFS_LI_SXI:
+	case XFS_LI_SXD:
 		return &xlog_intent_item_type;
 	case XFS_LI_INODE:
 		return &xlog_inode_item_type;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 42d82c9d2a1d..206db91d113f 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -35,6 +35,7 @@
 #include "xfs_refcount_item.h"
 #include "xfs_bmap_item.h"
 #include "xfs_reflink.h"
+#include "xfs_swapext_item.h"
 
 #include <linux/magic.h>
 #include <linux/fs_context.h>
@@ -2075,8 +2076,24 @@ xfs_init_zones(void)
 	if (!xfs_bui_zone)
 		goto out_destroy_bud_zone;
 
+	xfs_sxd_zone = kmem_cache_create("xfs_sxd_item",
+					 sizeof(struct xfs_sxd_log_item),
+					 0, 0, NULL);
+	if (!xfs_sxd_zone)
+		goto out_destroy_bui_zone;
+
+	xfs_sxi_zone = kmem_cache_create("xfs_sxi_item",
+					 sizeof(struct xfs_sxi_log_item),
+					 0, 0, NULL);
+	if (!xfs_sxi_zone)
+		goto out_destroy_sxd_zone;
+
 	return 0;
 
+ out_destroy_sxd_zone:
+	kmem_cache_destroy(xfs_sxd_zone);
+ out_destroy_bui_zone:
+	kmem_cache_destroy(xfs_bui_zone);
  out_destroy_bud_zone:
 	kmem_cache_destroy(xfs_bud_zone);
  out_destroy_cui_zone:
diff --git a/fs/xfs/xfs_swapext_item.c b/fs/xfs/xfs_swapext_item.c
new file mode 100644
index 000000000000..63ba43e5c3bb
--- /dev/null
+++ b/fs/xfs/xfs_swapext_item.c
@@ -0,0 +1,365 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2020 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_bit.h"
+#include "xfs_shared.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_inode.h"
+#include "xfs_trans.h"
+#include "xfs_trans_priv.h"
+#include "xfs_swapext_item.h"
+#include "xfs_log.h"
+#include "xfs_bmap.h"
+#include "xfs_icache.h"
+#include "xfs_trans_space.h"
+#include "xfs_error.h"
+#include "xfs_log_priv.h"
+#include "xfs_log_recover.h"
+
+kmem_zone_t	*xfs_sxi_zone;
+kmem_zone_t	*xfs_sxd_zone;
+
+static inline struct xfs_sxi_log_item *SXI_ITEM(struct xfs_log_item *lip)
+{
+	return container_of(lip, struct xfs_sxi_log_item, sxi_item);
+}
+
+STATIC void
+xfs_sxi_item_free(
+	struct xfs_sxi_log_item	*ilip)
+{
+	kmem_cache_free(xfs_sxi_zone, ilip);
+}
+
+/*
+ * Freeing the SXI requires that we remove it from the AIL if it has already
+ * been placed there. However, the SXI may not yet have been placed in the AIL
+ * when called by xfs_sxi_release() from SXD processing due to the ordering of
+ * committed vs unpin operations in bulk insert operations. Hence the reference
+ * count to ensure only the last caller frees the SXI.
+ */
+STATIC void
+xfs_sxi_release(
+	struct xfs_sxi_log_item	*ilip)
+{
+	ASSERT(atomic_read(&ilip->sxi_refcount) > 0);
+	if (atomic_dec_and_test(&ilip->sxi_refcount)) {
+		xfs_trans_ail_remove(&ilip->sxi_item, SHUTDOWN_LOG_IO_ERROR);
+		xfs_sxi_item_free(ilip);
+	}
+}
+
+
+STATIC void
+xfs_sxi_item_size(
+	struct xfs_log_item	*lip,
+	int			*nvecs,
+	int			*nbytes)
+{
+	*nvecs += 1;
+	*nbytes += sizeof(struct xfs_sxi_log_format);
+}
+
+/*
+ * This is called to fill in the vector of log iovecs for the
+ * given sxi log item. We use only 1 iovec, and we point that
+ * at the sxi_log_format structure embedded in the sxi item.
+ * It is at this point that we assert that all of the extent
+ * slots in the sxi item have been filled.
+ */
+STATIC void
+xfs_sxi_item_format(
+	struct xfs_log_item	*lip,
+	struct xfs_log_vec	*lv)
+{
+	struct xfs_sxi_log_item	*ilip = SXI_ITEM(lip);
+	struct xfs_log_iovec	*vecp = NULL;
+
+	ilip->sxi_format.sxi_type = XFS_LI_SXI;
+	ilip->sxi_format.sxi_size = 1;
+
+	xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_SXI_FORMAT, &ilip->sxi_format,
+			sizeof(struct xfs_sxi_log_format));
+}
+
+/*
+ * The unpin operation is the last place an SXI is manipulated in the log. It is
+ * either inserted in the AIL or aborted in the event of a log I/O error. In
+ * either case, the SXI transaction has been successfully committed to make it
+ * this far. Therefore, we expect whoever committed the SXI to either construct
+ * and commit the SXD or drop the SXD's reference in the event of error. Simply
+ * drop the log's SXI reference now that the log is done with it.
+ */
+STATIC void
+xfs_sxi_item_unpin(
+	struct xfs_log_item	*lip,
+	int			remove)
+{
+	struct xfs_sxi_log_item	*ilip = SXI_ITEM(lip);
+
+	xfs_sxi_release(ilip);
+}
+
+/*
+ * The SXI has been either committed or aborted if the transaction has been
+ * cancelled. If the transaction was cancelled, an SXD isn't going to be
+ * constructed and thus we free the SXI here directly.
+ */
+STATIC void
+xfs_sxi_item_release(
+	struct xfs_log_item	*lip)
+{
+	xfs_sxi_release(SXI_ITEM(lip));
+}
+
+static const struct xfs_item_ops xfs_sxi_item_ops = {
+	.iop_size	= xfs_sxi_item_size,
+	.iop_format	= xfs_sxi_item_format,
+	.iop_unpin	= xfs_sxi_item_unpin,
+	.iop_release	= xfs_sxi_item_release,
+};
+
+/*
+ * Allocate and initialize an sxi item with the given number of extents.
+ */
+STATIC struct xfs_sxi_log_item *
+xfs_sxi_init(
+	struct xfs_mount		*mp)
+
+{
+	struct xfs_sxi_log_item		*ilip;
+
+	ilip = kmem_zone_zalloc(xfs_sxi_zone, 0);
+
+	xfs_log_item_init(mp, &ilip->sxi_item, XFS_LI_SXI, &xfs_sxi_item_ops);
+	ilip->sxi_format.sxi_id = (uintptr_t)(void *)ilip;
+	atomic_set(&ilip->sxi_refcount, 2);
+
+	return ilip;
+}
+
+static inline struct xfs_sxd_log_item *SXD_ITEM(struct xfs_log_item *lip)
+{
+	return container_of(lip, struct xfs_sxd_log_item, sxd_item);
+}
+
+STATIC void
+xfs_sxd_item_size(
+	struct xfs_log_item	*lip,
+	int			*nvecs,
+	int			*nbytes)
+{
+	*nvecs += 1;
+	*nbytes += sizeof(struct xfs_sxd_log_format);
+}
+
+/*
+ * This is called to fill in the vector of log iovecs for the
+ * given sxd log item. We use only 1 iovec, and we point that
+ * at the sxd_log_format structure embedded in the sxd item.
+ * It is at this point that we assert that all of the extent
+ * slots in the sxd item have been filled.
+ */
+STATIC void
+xfs_sxd_item_format(
+	struct xfs_log_item	*lip,
+	struct xfs_log_vec	*lv)
+{
+	struct xfs_sxd_log_item	*dlip = SXD_ITEM(lip);
+	struct xfs_log_iovec	*vecp = NULL;
+
+	dlip->sxd_format.sxd_type = XFS_LI_SXD;
+	dlip->sxd_format.sxd_size = 1;
+
+	xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_SXD_FORMAT, &dlip->sxd_format,
+			sizeof(struct xfs_sxd_log_format));
+}
+
+/*
+ * The SXD is either committed or aborted if the transaction is cancelled. If
+ * the transaction is cancelled, drop our reference to the SXI and free the
+ * SXD.
+ */
+STATIC void
+xfs_sxd_item_release(
+	struct xfs_log_item	*lip)
+{
+	struct xfs_sxd_log_item	*dlip = SXD_ITEM(lip);
+
+	xfs_sxi_release(dlip->sxd_intent_log_item);
+	kmem_cache_free(xfs_sxd_zone, dlip);
+}
+
+static const struct xfs_item_ops xfs_sxd_item_ops = {
+	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED,
+	.iop_size	= xfs_sxd_item_size,
+	.iop_format	= xfs_sxd_item_format,
+	.iop_release	= xfs_sxd_item_release,
+};
+
+/*
+ * Process a swapext update intent item that was recovered from the log.
+ * We need to update some inode's bmbt.
+ */
+STATIC int
+xfs_sxi_recover(
+	struct xfs_mount		*mp,
+	struct xfs_defer_freezer	**dffp,
+	struct xfs_sxi_log_item		*ilip)
+{
+	return -EFSCORRUPTED;
+}
+
+/*
+ * Copy an SXI format buffer from the given buf, and into the destination
+ * SXI format structure.  The SXI/SXD items were designed not to need any
+ * special alignment handling.
+ */
+static int
+xfs_sxi_copy_format(
+	struct xfs_log_iovec		*buf,
+	struct xfs_sxi_log_format	*dst_sxi_fmt)
+{
+	struct xfs_sxi_log_format	*src_sxi_fmt;
+	size_t				len;
+
+	src_sxi_fmt = buf->i_addr;
+	len = sizeof(struct xfs_sxi_log_format);
+
+	if (buf->i_len == len) {
+		memcpy(dst_sxi_fmt, src_sxi_fmt, len);
+		return 0;
+	}
+	XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, NULL);
+	return -EFSCORRUPTED;
+}
+
+/*
+ * This routine is called to create an in-core extent swapext update
+ * item from the sxi format structure which was logged on disk.
+ * It allocates an in-core sxi, copies the extents from the format
+ * structure into it, and adds the sxi to the AIL with the given
+ * LSN.
+ */
+STATIC int
+xlog_recover_sxi(
+	struct xlog			*log,
+	struct xlog_recover_item	*item,
+	xfs_lsn_t			lsn)
+{
+	int				error;
+	struct xfs_mount		*mp = log->l_mp;
+	struct xfs_sxi_log_item		*ilip;
+	struct xfs_sxi_log_format	*sxi_formatp;
+
+	sxi_formatp = item->ri_buf[0].i_addr;
+
+	if (sxi_formatp->__pad != 0) {
+		XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp);
+		return -EFSCORRUPTED;
+	}
+	ilip = xfs_sxi_init(mp);
+	error = xfs_sxi_copy_format(&item->ri_buf[0], &ilip->sxi_format);
+	if (error) {
+		xfs_sxi_item_free(ilip);
+		return error;
+	}
+	xlog_recover_insert_ail(log, &ilip->sxi_item, lsn);
+	xfs_sxi_release(ilip);
+	return 0;
+}
+
+STATIC bool
+xlog_release_sxi(
+	struct xlog		*log,
+	struct xfs_log_item	*lip,
+	uint64_t		intent_id)
+{
+	struct xfs_sxi_log_item	*ilip = SXI_ITEM(lip);
+	struct xfs_ail		*ailp = log->l_ailp;
+
+	if (ilip->sxi_format.sxi_id == intent_id) {
+		/*
+		 * Drop the SXD reference to the SXI. This
+		 * removes the SXI from the AIL and frees it.
+		 */
+		spin_unlock(&ailp->ail_lock);
+		xfs_sxi_release(ilip);
+		spin_lock(&ailp->ail_lock);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * This routine is called when an SXD format structure is found in a committed
+ * transaction in the log. Its purpose is to cancel the corresponding SXI if it
+ * was still in the log. To do this it searches the AIL for the SXI with an id
+ * equal to that in the SXD format structure. If we find it we drop the SXD
+ * reference, which removes the SXI from the AIL and frees it.
+ */
+STATIC int
+xlog_recover_sxd(
+	struct xlog			*log,
+	struct xlog_recover_item	*item)
+{
+	struct xfs_sxd_log_format	*sxd_formatp;
+
+	sxd_formatp = item->ri_buf[0].i_addr;
+	if (item->ri_buf[0].i_len != sizeof(struct xfs_sxd_log_format)) {
+		XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp);
+		return -EFSCORRUPTED;
+	}
+
+	xlog_recover_release_intent(log, XFS_LI_SXI, sxd_formatp->sxd_sxi_id,
+			 xlog_release_sxi);
+	return 0;
+}
+
+/* Recover the SXI if necessary. */
+STATIC int
+xlog_recover_process_sxi(
+	struct xlog			*log,
+	struct xfs_defer_freezer	**dffp,
+	struct xfs_log_item		*lip)
+{
+	struct xfs_ail			*ailp = log->l_ailp;
+	struct xfs_sxi_log_item		*ilip = SXI_ITEM(lip);
+	int				error;
+
+	/*
+	 * Skip SXIs that we've already processed.
+	 */
+	if (test_bit(XFS_SXI_RECOVERED, &ilip->sxi_flags))
+		return 0;
+
+	spin_unlock(&ailp->ail_lock);
+	error = xfs_sxi_recover(log->l_mp, dffp, ilip);
+	spin_lock(&ailp->ail_lock);
+
+	return error;
+}
+
+/* Release the SXI since we're cancelling everything. */
+STATIC void
+xlog_recover_cancel_sxi(
+	struct xfs_log_item		*lip)
+{
+	xfs_sxi_release(SXI_ITEM(lip));
+}
+
+const struct xlog_recover_intent_type xlog_recover_swapext_type = {
+	.recover_intent		= xlog_recover_sxi,
+	.recover_done		= xlog_recover_sxd,
+	.process_intent		= xlog_recover_process_sxi,
+	.cancel_intent		= xlog_recover_cancel_sxi,
+};
diff --git a/fs/xfs/xfs_swapext_item.h b/fs/xfs/xfs_swapext_item.h
new file mode 100644
index 000000000000..63e2c15d117d
--- /dev/null
+++ b/fs/xfs/xfs_swapext_item.h
@@ -0,0 +1,67 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2020 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#ifndef	__XFS_SWAPEXT_ITEM_H__
+#define	__XFS_SWAPEXT_ITEM_H__
+
+/*
+ * The extent swapping intent item help us perform atomic extent swaps between
+ * two inode forks.  It does this by tracking the range of logical offsets that
+ * still need to be swapped, and relogs as progress happens.
+ *
+ * *I items should be recorded in the *first* of a series of rolled
+ * transactions, and the *D items should be recorded in the same transaction
+ * that records the associated bmbt updates.
+ *
+ * Should the system crash after the commit of the first transaction but
+ * before the commit of the final transaction in a series, log recovery will
+ * use the redo information recorded by the intent items to replay the
+ * rest of the extent swaps.
+ */
+
+/* kernel only SXI/SXD definitions */
+
+struct xfs_mount;
+struct kmem_zone;
+
+/*
+ * Max number of extents in fast allocation path.
+ */
+#define	XFS_SXI_MAX_FAST_EXTENTS	1
+
+/*
+ * Define SXI flag bits. Manipulated by set/clear/test_bit operators.
+ */
+#define	XFS_SXI_RECOVERED		1
+
+/*
+ * This is the "swapext update intent" log item.  It is used to log the fact
+ * that we are swapping extents between two files.  It is used in conjunction
+ * with the "swapext update done" log item described below.
+ *
+ * These log items follow the same rules as struct xfs_efi_log_item; see the
+ * comments about that structure (in xfs_extfree_item.h) for more details.
+ */
+struct xfs_sxi_log_item {
+	struct xfs_log_item		sxi_item;
+	atomic_t			sxi_refcount;
+	unsigned long			sxi_flags;
+	struct xfs_sxi_log_format	sxi_format;
+};
+
+/*
+ * This is the "swapext update done" log item.  It is used to log the fact that
+ * some extent swapping mentioned in an earlier sxi item have been performed.
+ */
+struct xfs_sxd_log_item {
+	struct xfs_log_item		sxd_item;
+	struct xfs_sxi_log_item		*sxd_intent_log_item;
+	struct xfs_sxd_log_format	sxd_format;
+};
+
+extern struct kmem_zone	*xfs_sxi_zone;
+extern struct kmem_zone	*xfs_sxd_zone;
+
+#endif	/* __XFS_SWAPEXT_ITEM_H__ */


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 09/18] xfs: create deferred log items for extent swapping
  2020-04-29  2:44 [PATCH RFC 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (7 preceding siblings ...)
  2020-04-29  2:45 ` [PATCH 08/18] xfs: introduce a swap-extent log intent item Darrick J. Wong
@ 2020-04-29  2:45 ` Darrick J. Wong
  2020-04-29  2:45 ` [PATCH 10/18] xfs: refactor locking and unlocking two inodes against userspace IO Darrick J. Wong
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2020-04-29  2:45 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <darrick.wong@oracle.com>

Now that we've created the skeleton of a log intent item to track and
restart extent swap operations, add the upper level logic to commit
intent items and turn them into concrete work recorded in the log.  We
use the deferred item "multihop" feature that was introduced a few
patches ago to constrain the number of active swap operations to one per
thread.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/Makefile             |    1 
 fs/xfs/libxfs/xfs_bmap.h    |   13 +
 fs/xfs/libxfs/xfs_defer.c   |    1 
 fs/xfs/libxfs/xfs_defer.h   |    2 
 fs/xfs/libxfs/xfs_swapext.c |  430 +++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_swapext.h |   57 ++++++
 fs/xfs/xfs_swapext_item.c   |  336 ++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_trace.c          |    1 
 fs/xfs/xfs_trace.h          |   49 +++++
 9 files changed, 885 insertions(+), 5 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_swapext.c
 create mode 100644 fs/xfs/libxfs/xfs_swapext.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 27b4bd5c8ffe..6f8d8f2f8a8c 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -51,6 +51,7 @@ xfs-y				+= $(addprefix libxfs/, \
 				   xfs_refcount.o \
 				   xfs_refcount_btree.o \
 				   xfs_sb.o \
+				   xfs_swapext.o \
 				   xfs_symlink_remote.o \
 				   xfs_trans_inode.o \
 				   xfs_trans_resv.o \
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 3367df499ac8..215ce1b8c736 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -158,6 +158,13 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
 	{ BMAP_ATTRFORK,	"ATTR" }, \
 	{ BMAP_COWFORK,		"COW" }
 
+/* Return true if the extent is an allocated extent, written or not. */
+static inline bool xfs_bmap_is_mapped_extent(struct xfs_bmbt_irec *irec)
+{
+	return irec->br_startblock != HOLESTARTBLOCK &&
+		irec->br_startblock != DELAYSTARTBLOCK &&
+		!isnullstartblock(irec->br_startblock);
+}
 
 /*
  * Return true if the extent is a real, allocated extent, or false if it is  a
@@ -165,10 +172,8 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
  */
 static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec)
 {
-	return irec->br_state != XFS_EXT_UNWRITTEN &&
-		irec->br_startblock != HOLESTARTBLOCK &&
-		irec->br_startblock != DELAYSTARTBLOCK &&
-		!isnullstartblock(irec->br_startblock);
+	return xfs_bmap_is_mapped_extent(irec) &&
+		irec->br_state != XFS_EXT_UNWRITTEN;
 }
 
 /*
diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index f53e3ce858eb..00bd0e478829 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -184,6 +184,7 @@ static const struct xfs_defer_op_type *defer_op_types[] = {
 	[XFS_DEFER_OPS_TYPE_RMAP]	= &xfs_rmap_update_defer_type,
 	[XFS_DEFER_OPS_TYPE_FREE]	= &xfs_extent_free_defer_type,
 	[XFS_DEFER_OPS_TYPE_AGFL_FREE]	= &xfs_agfl_free_defer_type,
+	[XFS_DEFER_OPS_TYPE_SWAPEXT]	= &xfs_swapext_defer_type,
 };
 
 /*
diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index e64b577a9b95..226db6e5a1b0 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -18,6 +18,7 @@ enum xfs_defer_ops_type {
 	XFS_DEFER_OPS_TYPE_RMAP,
 	XFS_DEFER_OPS_TYPE_FREE,
 	XFS_DEFER_OPS_TYPE_AGFL_FREE,
+	XFS_DEFER_OPS_TYPE_SWAPEXT,
 	XFS_DEFER_OPS_TYPE_MAX,
 };
 
@@ -65,6 +66,7 @@ extern const struct xfs_defer_op_type xfs_refcount_update_defer_type;
 extern const struct xfs_defer_op_type xfs_rmap_update_defer_type;
 extern const struct xfs_defer_op_type xfs_extent_free_defer_type;
 extern const struct xfs_defer_op_type xfs_agfl_free_defer_type;
+extern const struct xfs_defer_op_type xfs_swapext_defer_type;
 
 /*
  * Deferred operation freezer.  This structure enables a dfops user to detach
diff --git a/fs/xfs/libxfs/xfs_swapext.c b/fs/xfs/libxfs/xfs_swapext.c
new file mode 100644
index 000000000000..2eff48453070
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_swapext.c
@@ -0,0 +1,430 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2020 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_inode.h"
+#include "xfs_trans.h"
+#include "xfs_bmap.h"
+#include "xfs_icache.h"
+#include "xfs_quota.h"
+#include "xfs_swapext.h"
+#include "xfs_trace.h"
+
+/* Information to help us reset reflink flag / CoW fork state after a swap. */
+
+/* Are we swapping the data fork? */
+#define XFS_SX_REFLINK_DATAFORK		(1U << 0)
+
+/* Can we swap the flags? */
+#define XFS_SX_REFLINK_SWAPFLAGS	(1U << 1)
+
+/* Previous state of the two inodes' reflink flags. */
+#define XFS_SX_REFLINK_IP1_REFLINK	(1U << 2)
+#define XFS_SX_REFLINK_IP2_REFLINK	(1U << 3)
+
+
+/*
+ * Prepare both inodes' reflink state for an extent swap, and return our
+ * findings so that xfs_swapext_reflink_finish can deal with the aftermath.
+ */
+unsigned int
+xfs_swapext_reflink_prep(
+	struct xfs_inode	*ip1,
+	struct xfs_inode	*ip2,
+	int			whichfork,
+	xfs_fileoff_t		startoff1,
+	xfs_fileoff_t		startoff2,
+	xfs_filblks_t		blockcount)
+{
+	struct xfs_mount	*mp = ip1->i_mount;
+	unsigned int		rs = 0;
+
+	if (whichfork != XFS_DATA_FORK)
+		return 0;
+
+	/*
+	 * If either file has shared blocks and we're swapping data forks, we
+	 * must flag the other file as having shared blocks so that we get the
+	 * shared-block rmap functions if we need to fix up the rmaps.  The
+	 * flags will be switched for real by xfs_swapext_reflink_finish.
+	 */
+	if (xfs_is_reflink_inode(ip1))
+		rs |= XFS_SX_REFLINK_IP1_REFLINK;
+	if (xfs_is_reflink_inode(ip2))
+		rs |= XFS_SX_REFLINK_IP2_REFLINK;
+
+	if (rs & XFS_SX_REFLINK_IP1_REFLINK)
+		ip2->i_d.di_flags2 |= XFS_DIFLAG2_REFLINK;
+	if (rs & XFS_SX_REFLINK_IP2_REFLINK)
+		ip1->i_d.di_flags2 |= XFS_DIFLAG2_REFLINK;
+
+	/*
+	 * If either file had the reflink flag set before; and the two files'
+	 * reflink state was different; and we're swapping the entirety of both
+	 * files, then we can exchange the reflink flags at the end.
+	 * Otherwise, we propagate the reflink flag from either file to the
+	 * other file.
+	 *
+	 * Note that we've only set the _REFLINK flags of the reflink state, so
+	 * we can cheat and use hweight32 for the reflink flag test.
+	 *
+	 */
+	if (hweight32(rs) == 1 && startoff1 == 0 && startoff2 == 0 &&
+	    blockcount == XFS_B_TO_FSB(mp, ip1->i_d.di_size) &&
+	    blockcount == XFS_B_TO_FSB(mp, ip2->i_d.di_size))
+		rs |= XFS_SX_REFLINK_SWAPFLAGS;
+
+	rs |= XFS_SX_REFLINK_DATAFORK;
+	return rs;
+}
+
+/*
+ * If the reflink flag is set on either inode, make sure it has an incore CoW
+ * fork, since all reflink inodes must have them.  If there's a CoW fork and it
+ * has extents in it, make sure the inodes are tagged appropriately so that
+ * speculative preallocations can be GC'd if we run low of space.
+ */
+static inline void
+xfs_swapext_ensure_cowfork(
+	struct xfs_inode	*ip)
+{
+	struct xfs_ifork	*cfork;
+
+	if (xfs_is_reflink_inode(ip))
+		xfs_ifork_init_cow(ip);
+
+	cfork = XFS_IFORK_PTR(ip, XFS_COW_FORK);
+	if (!cfork)
+		return;
+	if (cfork->if_bytes > 0)
+		xfs_inode_set_cowblocks_tag(ip);
+	else
+		xfs_inode_clear_cowblocks_tag(ip);
+}
+
+/*
+ * Set both inodes' ondisk reflink flags to their final state and ensure that
+ * the incore state is ready to go.
+ */
+void
+xfs_swapext_reflink_finish(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip1,
+	struct xfs_inode	*ip2,
+	unsigned int		rs)
+{
+	if (!(rs & XFS_SX_REFLINK_DATAFORK))
+		return;
+
+	if (rs & XFS_SX_REFLINK_SWAPFLAGS) {
+		/* Exchange the reflink inode flags and log them. */
+		ip1->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+		if (rs & XFS_SX_REFLINK_IP2_REFLINK)
+			ip1->i_d.di_flags2 |= XFS_DIFLAG2_REFLINK;
+
+		ip2->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+		if (rs & XFS_SX_REFLINK_IP1_REFLINK)
+			ip2->i_d.di_flags2 |= XFS_DIFLAG2_REFLINK;
+
+		xfs_trans_log_inode(tp, ip1, XFS_ILOG_CORE);
+		xfs_trans_log_inode(tp, ip2, XFS_ILOG_CORE);
+	}
+
+	xfs_swapext_ensure_cowfork(ip1);
+	xfs_swapext_ensure_cowfork(ip2);
+}
+
+/* Schedule an atomic extent swap. */
+static inline void
+xfs_swapext_schedule(
+	struct xfs_trans		*tp,
+	struct xfs_swapext_intent	*sxi)
+{
+	trace_xfs_swapext_defer(tp->t_mountp, sxi);
+	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_SWAPEXT, &sxi->si_list);
+}
+
+/* Reschedule an atomic extent swap on behalf of log recovery. */
+void
+xfs_swapext_reschedule(
+	struct xfs_trans		*tp,
+	const struct xfs_swapext_intent	*sxi)
+{
+	struct xfs_swapext_intent	*new_sxi;
+
+	new_sxi = kmem_alloc(sizeof(struct xfs_swapext_intent), KM_NOFS);
+	memcpy(new_sxi, sxi, sizeof(*new_sxi));
+	INIT_LIST_HEAD(&new_sxi->si_list);
+
+	xfs_swapext_schedule(tp, new_sxi);
+}
+
+/*
+ * Adjust the on-disk inode size upwards if needed so that we never map extents
+ * into the file past EOF.  This is crucial so that log recovery won't get
+ * confused by the sudden appearance of post-eof extents.
+ */
+STATIC void
+xfs_swapext_update_size(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	struct xfs_bmbt_irec	*imap,
+	xfs_fsize_t		new_isize)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	xfs_fsize_t		len;
+
+	if (new_isize < 0)
+		return;
+
+	len = min(XFS_FSB_TO_B(mp, imap->br_startoff + imap->br_blockcount),
+		  new_isize);
+
+	if (len <= ip->i_d.di_size)
+		return;
+
+	trace_xfs_swapext_update_inode_size(ip, len);
+
+	ip->i_d.di_size = len;
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+}
+
+/* Do we have more work to do to finish this operation? */
+bool
+xfs_swapext_has_more_work(
+	struct xfs_swapext_intent	*sxi)
+{
+	return sxi->si_blockcount > 0;
+}
+
+/* Finish one extent swap, possibly log more. */
+int
+xfs_swapext_finish_one(
+	struct xfs_trans		*tp,
+	struct xfs_swapext_intent	*sxi)
+{
+	struct xfs_bmbt_irec		irec1, irec2;
+	int				whichfork;
+	int				nimaps;
+	int				bmap_flags;
+	int				error;
+
+	whichfork = (sxi->si_flags & XFS_SWAP_EXTENT_ATTR_FORK) ?
+			XFS_ATTR_FORK : XFS_DATA_FORK;
+	bmap_flags = xfs_bmapi_aflag(whichfork);
+
+	while (sxi->si_blockcount > 0) {
+		int64_t		ip1_delta = 0, ip2_delta = 0;
+
+		/* Read extent from the first file */
+		nimaps = 1;
+		error = xfs_bmapi_read(sxi->si_ip1, sxi->si_startoff1,
+				sxi->si_blockcount, &irec1, &nimaps,
+				bmap_flags);
+		if (error)
+			return error;
+		if (nimaps != 1 ||
+		    irec1.br_startblock == DELAYSTARTBLOCK ||
+		    irec1.br_startoff != sxi->si_startoff1) {
+			/*
+			 * We should never get no mapping or a delalloc extent
+			 * or something that doesn't match what we asked for,
+			 * since the caller flushed both inodes and we hold the
+			 * ILOCKs for both inodes.
+			 */
+			ASSERT(0);
+			return -EINVAL;
+		}
+
+		/* Read extent from the second file */
+		nimaps = 1;
+		error = xfs_bmapi_read(sxi->si_ip2, sxi->si_startoff2,
+				irec1.br_blockcount, &irec2, &nimaps,
+				bmap_flags);
+		if (error)
+			return error;
+		if (nimaps != 1 ||
+		    irec2.br_startblock == DELAYSTARTBLOCK ||
+		    irec2.br_startoff != sxi->si_startoff2) {
+			/*
+			 * We should never get no mapping or a delalloc extent
+			 * or something that doesn't match what we asked for,
+			 * since the caller flushed both inodes and we hold the
+			 * ILOCKs for both inodes.
+			 */
+			ASSERT(0);
+			return -EINVAL;
+		}
+
+		/*
+		 * We can only swap as many blocks as the smaller of the two
+		 * extent maps.
+		 */
+		irec1.br_blockcount = min(irec1.br_blockcount,
+					  irec2.br_blockcount);
+
+		trace_xfs_swapext_extent1(sxi->si_ip1, &irec1);
+		trace_xfs_swapext_extent2(sxi->si_ip2, &irec2);
+
+		/*
+		 * Two extents mapped to the same physical block must not have
+		 * different states; that's filesystem corruption.  Move on to
+		 * the next extent if they're both holes or both the same
+		 * physical extent.
+		 */
+		if (irec1.br_startblock == irec2.br_startblock) {
+			if (irec1.br_state != irec2.br_state)
+				return -EFSCORRUPTED;
+
+			sxi->si_startoff1 += irec1.br_blockcount;
+			sxi->si_startoff2 += irec1.br_blockcount;
+			sxi->si_blockcount -= irec1.br_blockcount;
+			continue;
+		}
+
+		/* Update quota accounting. */
+		if (xfs_bmap_is_mapped_extent(&irec1)) {
+			ip1_delta -= irec1.br_blockcount;
+			ip2_delta += irec1.br_blockcount;
+		}
+		if (xfs_bmap_is_mapped_extent(&irec2)) {
+			ip1_delta += irec2.br_blockcount;
+			ip2_delta -= irec2.br_blockcount;
+		}
+
+		if (ip1_delta)
+			xfs_trans_mod_dquot_byino(tp, sxi->si_ip1,
+					XFS_TRANS_DQ_BCOUNT, ip1_delta);
+		if (ip2_delta)
+			xfs_trans_mod_dquot_byino(tp, sxi->si_ip2,
+					XFS_TRANS_DQ_BCOUNT, ip2_delta);
+
+		/* Remove both mappings. */
+		xfs_bmap_unmap_extent(tp, sxi->si_ip1, whichfork, &irec1);
+		xfs_bmap_unmap_extent(tp, sxi->si_ip2, whichfork, &irec2);
+
+		/*
+		 * Re-add both mappings.  We swap the file offsets between the
+		 * two maps and add the opposite map, which has the effect of
+		 * filling the logical offsets we just unmapped, but with with
+		 * the physical mapping information swapped.
+		 */
+		swap(irec1.br_startoff, irec2.br_startoff);
+		xfs_bmap_map_extent(tp, sxi->si_ip1, whichfork, &irec2);
+		xfs_bmap_map_extent(tp, sxi->si_ip2, whichfork, &irec1);
+
+		/* Make sure we're not mapping extents past EOF. */
+		if (whichfork == XFS_DATA_FORK) {
+			xfs_swapext_update_size(tp, sxi->si_ip1, &irec2,
+					sxi->si_isize1);
+			xfs_swapext_update_size(tp, sxi->si_ip2, &irec1,
+					sxi->si_isize2);
+		}
+
+		/*
+		 * Advance our cursor and exit.   The caller (either defer ops
+		 * or log recovery) will log the SXD item, and if *blockcount
+		 * is nonzero, it will log a new SXI item for the remainder
+		 * and call us back.
+		 */
+		sxi->si_startoff1 += irec1.br_blockcount;
+		sxi->si_startoff2 += irec1.br_blockcount;
+		sxi->si_blockcount -= irec1.br_blockcount;
+		break;
+	}
+
+	/*
+	 * If we've reached the end of the remap operation and the caller
+	 * wanted us to exchange the sizes, do that now.
+	 */
+	if (sxi->si_blockcount == 0 &&
+	    (sxi->si_flags & XFS_SWAP_EXTENT_SET_SIZES)) {
+		sxi->si_ip1->i_d.di_size = sxi->si_isize1;
+		sxi->si_ip2->i_d.di_size = sxi->si_isize2;
+		xfs_trans_log_inode(tp, sxi->si_ip1, XFS_ILOG_CORE);
+		xfs_trans_log_inode(tp, sxi->si_ip2, XFS_ILOG_CORE);
+	}
+
+	if (xfs_swapext_has_more_work(sxi))
+		trace_xfs_swapext_defer(tp->t_mountp, sxi);
+	return 0;
+}
+
+static void
+xfs_swapext_init_intent(
+	struct xfs_swapext_intent	*sxi,
+	struct xfs_inode		*ip1,
+	struct xfs_inode		*ip2,
+	int				whichfork,
+	xfs_fileoff_t			startoff1,
+	xfs_fileoff_t			startoff2,
+	xfs_filblks_t			blockcount,
+	unsigned int			flags)
+{
+	INIT_LIST_HEAD(&sxi->si_list);
+	sxi->si_flags = 0;
+	if (whichfork == XFS_ATTR_FORK)
+		sxi->si_flags |= XFS_SWAP_EXTENT_ATTR_FORK;
+	sxi->si_isize1 = sxi->si_isize2 = -1;
+	if (whichfork == XFS_DATA_FORK && (flags & XFS_SWAPEXT_SET_SIZES)) {
+		sxi->si_flags |= XFS_SWAP_EXTENT_SET_SIZES;
+		sxi->si_isize1 = ip2->i_d.di_size;
+		sxi->si_isize2 = ip1->i_d.di_size;
+	}
+	sxi->si_ip1 = ip1;
+	sxi->si_ip2 = ip2;
+	sxi->si_startoff1 = startoff1;
+	sxi->si_startoff2 = startoff2;
+	sxi->si_blockcount = blockcount;
+}
+
+/*
+ * Atomically swap a range of extents from one inode to another.
+ *
+ * The caller must ensure the inodes must be joined to the transaction and
+ * ILOCKd; they will still be joined to the transaction at exit.
+ */
+int
+xfs_swapext_atomic(
+	struct xfs_trans		**tpp,
+	struct xfs_inode		*ip1,
+	struct xfs_inode		*ip2,
+	int				whichfork,
+	xfs_fileoff_t			startoff1,
+	xfs_fileoff_t			startoff2,
+	xfs_filblks_t			blockcount,
+	unsigned int			flags)
+{
+	struct xfs_swapext_intent	*sxi;
+	unsigned int			state;
+	int				error;
+
+	ASSERT(xfs_isilocked(ip1, XFS_ILOCK_EXCL));
+	ASSERT(xfs_isilocked(ip2, XFS_ILOCK_EXCL));
+	ASSERT(whichfork != XFS_COW_FORK);
+	ASSERT(whichfork == XFS_DATA_FORK || !(flags & XFS_SWAPEXT_SET_SIZES));
+
+	state = xfs_swapext_reflink_prep(ip1, ip2, whichfork, startoff1,
+			startoff2, blockcount);
+
+	sxi = kmem_alloc(sizeof(struct xfs_swapext_intent), KM_NOFS);
+	xfs_swapext_init_intent(sxi, ip1, ip2, whichfork, startoff1, startoff2,
+			blockcount, flags);
+	xfs_swapext_schedule(*tpp, sxi);
+
+	error = xfs_defer_finish(tpp);
+	if (error)
+		return error;
+
+	xfs_swapext_reflink_finish(*tpp, ip1, ip2, state);
+	return 0;
+}
diff --git a/fs/xfs/libxfs/xfs_swapext.h b/fs/xfs/libxfs/xfs_swapext.h
new file mode 100644
index 000000000000..af1893f37d39
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_swapext.h
@@ -0,0 +1,57 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2020 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <darrick.wong@oracle.com>
+ */
+#ifndef __XFS_SWAPEXT_H_
+#define __XFS_SWAPEXT_H_ 1
+
+/*
+ * In-core information about an extent swap request between ranges of two
+ * inodes.
+ */
+struct xfs_swapext_intent {
+	/* List of other incore deferred work. */
+	struct list_head	si_list;
+
+	/* The two inodes we're swapping. */
+	union {
+		struct xfs_inode *si_ip1;
+		xfs_ino_t	si_ino1;
+	};
+	union {
+		struct xfs_inode *si_ip2;
+		xfs_ino_t	si_ino2;
+	};
+
+	/* File offset range information. */
+	xfs_fileoff_t		si_startoff1;
+	xfs_fileoff_t		si_startoff2;
+	xfs_filblks_t		si_blockcount;
+	uint64_t		si_flags;
+
+	/* Set these file sizes after the operation, unless negative. */
+	xfs_fsize_t		si_isize1;
+	xfs_fsize_t		si_isize2;
+};
+
+bool xfs_swapext_has_more_work(struct xfs_swapext_intent *sxi);
+
+unsigned int xfs_swapext_reflink_prep(struct xfs_inode *ip1,
+		struct xfs_inode *ip2, int whichfork, xfs_fileoff_t startoff1,
+		xfs_fileoff_t startoff2, xfs_filblks_t blockcount);
+void xfs_swapext_reflink_finish(struct xfs_trans *tp, struct xfs_inode *ip1,
+		struct xfs_inode *ip2, unsigned int reflink_state);
+
+void xfs_swapext_reschedule(struct xfs_trans *tpp,
+		const struct xfs_swapext_intent *sxi_state);
+int xfs_swapext_finish_one(struct xfs_trans *tp,
+		struct xfs_swapext_intent *sxi_state);
+
+#define XFS_SWAPEXT_SET_SIZES		(1U << 0)
+int xfs_swapext_atomic(struct xfs_trans **tpp, struct xfs_inode *ip1,
+		struct xfs_inode *ip2, int whichfork, xfs_fileoff_t startoff1,
+		xfs_fileoff_t startoff2, xfs_filblks_t blockcount,
+		unsigned int flags);
+
+#endif /* __XFS_SWAPEXT_H_ */
diff --git a/fs/xfs/xfs_swapext_item.c b/fs/xfs/xfs_swapext_item.c
index 63ba43e5c3bb..fadd522c6841 100644
--- a/fs/xfs/xfs_swapext_item.c
+++ b/fs/xfs/xfs_swapext_item.c
@@ -16,9 +16,11 @@
 #include "xfs_trans.h"
 #include "xfs_trans_priv.h"
 #include "xfs_swapext_item.h"
+#include "xfs_swapext.h"
 #include "xfs_log.h"
 #include "xfs_bmap.h"
 #include "xfs_icache.h"
+#include "xfs_bmap_btree.h"
 #include "xfs_trans_space.h"
 #include "xfs_error.h"
 #include "xfs_log_priv.h"
@@ -205,6 +207,240 @@ static const struct xfs_item_ops xfs_sxd_item_ops = {
 	.iop_release	= xfs_sxd_item_release,
 };
 
+static struct xfs_sxd_log_item *
+xfs_trans_get_sxd(
+	struct xfs_trans		*tp,
+	struct xfs_sxi_log_item		*ilip)
+{
+	struct xfs_sxd_log_item		*dlip;
+
+	dlip = kmem_zone_zalloc(xfs_sxd_zone, 0);
+	xfs_log_item_init(tp->t_mountp, &dlip->sxd_item, XFS_LI_SXD,
+			  &xfs_sxd_item_ops);
+	dlip->sxd_intent_log_item = ilip;
+	dlip->sxd_format.sxd_sxi_id = ilip->sxi_format.sxi_id;
+
+	xfs_trans_add_item(tp, &dlip->sxd_item);
+	return dlip;
+}
+
+/*
+ * Finish an swapext update and log it to the SXD. Note that the
+ * transaction is marked dirty regardless of whether the swapext update
+ * succeeds or fails to support the SXI/SXD lifecycle rules.
+ */
+static int
+xfs_trans_log_finish_swapext_update(
+	struct xfs_trans		*tp,
+	struct xfs_sxd_log_item		*dlip,
+	struct xfs_swapext_intent	*sxi)
+{
+	int				error;
+
+	error = xfs_swapext_finish_one(tp, sxi);
+
+	/*
+	 * Mark the transaction dirty, even on error. This ensures the
+	 * transaction is aborted, which:
+	 *
+	 * 1.) releases the SXI and frees the SXD
+	 * 2.) shuts down the filesystem
+	 */
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	set_bit(XFS_LI_DIRTY, &dlip->sxd_item.li_flags);
+
+	return error;
+}
+
+/* Sort swapext intents by inode. */
+static int
+xfs_swapext_diff_items(
+	void				*priv,
+	struct list_head		*a,
+	struct list_head		*b)
+{
+	struct xfs_swapext_intent	*sa;
+	struct xfs_swapext_intent	*sb;
+
+	sa = container_of(a, struct xfs_swapext_intent, si_list);
+	sb = container_of(b, struct xfs_swapext_intent, si_list);
+	return sa->si_ip1->i_ino - sb->si_ip2->i_ino;
+}
+
+/* Get an SXI. */
+STATIC void *
+xfs_swapext_create_intent(
+	struct xfs_trans		*tp,
+	unsigned int			count)
+{
+	struct xfs_sxi_log_item		*ilip;
+
+	ASSERT(count == XFS_SXI_MAX_FAST_EXTENTS);
+	ASSERT(tp != NULL);
+
+	ilip = xfs_sxi_init(tp->t_mountp);
+	ASSERT(ilip != NULL);
+
+	/*
+	 * Get a log_item_desc to point at the new item.
+	 */
+	xfs_trans_add_item(tp, &ilip->sxi_item);
+	return ilip;
+}
+
+/* Log swapext updates in the intent item. */
+STATIC void
+xfs_swapext_log_item(
+	struct xfs_trans		*tp,
+	void				*intent,
+	struct list_head		*item)
+{
+	struct xfs_sxi_log_item		*ilip = intent;
+	struct xfs_swapext_intent	*sxi;
+	struct xfs_swap_extent		*se;
+
+	ASSERT(!test_bit(XFS_LI_DIRTY, &ilip->sxi_item.li_flags));
+
+	sxi = container_of(item, struct xfs_swapext_intent, si_list);
+
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	set_bit(XFS_LI_DIRTY, &ilip->sxi_item.li_flags);
+
+	se = &ilip->sxi_format.sxi_extent;
+	se->se_inode1 = sxi->si_ip1->i_ino;
+	se->se_inode2 = sxi->si_ip2->i_ino;
+	se->se_startoff1 = sxi->si_startoff1;
+	se->se_startoff2 = sxi->si_startoff2;
+	se->se_blockcount = sxi->si_blockcount;
+	se->se_isize1 = sxi->si_isize1;
+	se->se_isize2 = sxi->si_isize2;
+	se->se_flags = sxi->si_flags;
+}
+
+/* Get an SXD so we can process all the deferred swapext updates. */
+STATIC void *
+xfs_swapext_create_done(
+	struct xfs_trans		*tp,
+	void				*intent,
+	unsigned int			count)
+{
+	return xfs_trans_get_sxd(tp, intent);
+}
+
+/* Process a deferred swapext update. */
+STATIC int
+xfs_swapext_finish_item(
+	struct xfs_trans		*tp,
+	struct list_head		*item,
+	void				*done_item,
+	void				**state)
+{
+	struct xfs_swapext_intent	*sxi;
+	int				error;
+
+	sxi = container_of(item, struct xfs_swapext_intent, si_list);
+
+	/*
+	 * Swap one more extent between the two files.  If there's still more
+	 * work to do, we want to requeue ourselves after all other pending
+	 * deferred operations have finished.  This includes all of the dfops
+	 * that we queued directly as well as any new ones created in the
+	 * process of finishing the others.  Doing so prevents us from queuing
+	 * a large number of SXI log items in kernel memory, which in turn
+	 * prevents us from pinning the tail of the log (while logging those
+	 * new SXI items) until the first SXI items can be processed.
+	 */
+	error = xfs_trans_log_finish_swapext_update(tp, done_item, sxi);
+	if (!error && xfs_swapext_has_more_work(sxi))
+		return -EMULTIHOP;
+
+	kmem_free(sxi);
+	return error;
+}
+
+/* Abort all pending SXIs. */
+STATIC void
+xfs_swapext_abort_intent(
+	void				*intent)
+{
+	xfs_sxi_release(intent);
+}
+
+/* Cancel a deferred swapext update. */
+STATIC void
+xfs_swapext_cancel_item(
+	struct list_head		*item)
+{
+	struct xfs_swapext_intent	*sxi;
+
+	sxi = container_of(item, struct xfs_swapext_intent, si_list);
+	kmem_free(sxi);
+}
+
+/* Prepare a deferred swapext item for freezing by detaching the inodes. */
+STATIC int
+xfs_swapext_freeze_item(
+	struct xfs_defer_freezer	*freezer,
+	struct list_head		*item)
+{
+	struct xfs_swapext_intent	*sxi;
+	struct xfs_inode		*ip;
+	int				error;
+
+	sxi = container_of(item, struct xfs_swapext_intent, si_list);
+
+	ip = sxi->si_ip1;
+	error = xfs_defer_freezer_ijoin(freezer, ip);
+	if (error)
+		return error;
+	sxi->si_ino1 = ip->i_ino;
+
+	ip = sxi->si_ip2;
+	error = xfs_defer_freezer_ijoin(freezer, ip);
+	if (error)
+		return error;
+	sxi->si_ino2 = ip->i_ino;
+
+	return 0;
+}
+
+/* Thaw a deferred swapext item by reattaching the inodes. */
+STATIC int
+xfs_swapext_thaw_item(
+	struct xfs_defer_freezer	*freezer,
+	struct list_head		*item)
+{
+	struct xfs_swapext_intent	*sxi;
+	struct xfs_inode		*ip;
+
+	sxi = container_of(item, struct xfs_swapext_intent, si_list);
+
+	ip = xfs_defer_freezer_igrab(freezer, sxi->si_ino1);
+	if (!ip)
+		return -EFSCORRUPTED;
+	sxi->si_ip1 = ip;
+
+	ip = xfs_defer_freezer_igrab(freezer, sxi->si_ino2);
+	if (!ip)
+		return -EFSCORRUPTED;
+	sxi->si_ip2 = ip;
+
+	return 0;
+}
+
+const struct xfs_defer_op_type xfs_swapext_defer_type = {
+	.max_items	= XFS_SXI_MAX_FAST_EXTENTS,
+	.diff_items	= xfs_swapext_diff_items,
+	.create_intent	= xfs_swapext_create_intent,
+	.abort_intent	= xfs_swapext_abort_intent,
+	.log_item	= xfs_swapext_log_item,
+	.create_done	= xfs_swapext_create_done,
+	.finish_item	= xfs_swapext_finish_item,
+	.cancel_item	= xfs_swapext_cancel_item,
+	.freeze_item	= xfs_swapext_freeze_item,
+	.thaw_item	= xfs_swapext_thaw_item,
+};
+
 /*
  * Process a swapext update intent item that was recovered from the log.
  * We need to update some inode's bmbt.
@@ -215,7 +451,105 @@ xfs_sxi_recover(
 	struct xfs_defer_freezer	**dffp,
 	struct xfs_sxi_log_item		*ilip)
 {
-	return -EFSCORRUPTED;
+	struct xfs_swapext_intent	sxi;
+	struct xfs_swap_extent		*se;
+	struct xfs_sxd_log_item		*dlip;
+	struct xfs_trans		*tp;
+	int				error = 0;
+
+	ASSERT(!test_bit(XFS_SXI_RECOVERED, &ilip->sxi_flags));
+
+	/*
+	 * First check the validity of the extent described by the
+	 * SXI.  If anything is bad, then toss the SXI.
+	 */
+	se = &ilip->sxi_format.sxi_extent;
+	if (se->se_blockcount == 0 ||
+	    ilip->sxi_format.__pad != 0 ||
+	    !xfs_verify_ino(mp, se->se_inode1) ||
+	    !xfs_verify_ino(mp, se->se_inode2) ||
+	    (se->se_flags & ~XFS_SWAP_EXTENT_FLAGS) ||
+	    ((se->se_flags & XFS_SWAP_EXTENT_SET_SIZES) &&
+	     (se->se_isize1 < 0 || se->se_isize2 < 0))) {
+		/*
+		 * This will pull the SXI from the AIL and
+		 * free the memory associated with it.
+		 */
+		set_bit(XFS_SXI_RECOVERED, &ilip->sxi_flags);
+		xfs_sxi_release(ilip);
+		return -EFSCORRUPTED;
+	}
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate,
+			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp);
+	if (error)
+		return error;
+
+	dlip = xfs_trans_get_sxd(tp, ilip);
+	memset(&sxi, 0, sizeof(sxi));
+	INIT_LIST_HEAD(&sxi.si_list);
+
+	/* Grab both inodes and lock them. */
+	error = xfs_iget(mp, tp, se->se_inode1, 0, 0, &sxi.si_ip1);
+	if (error)
+		goto out_fail;
+	error = xfs_iget(mp, tp, se->se_inode2, 0, 0, &sxi.si_ip2);
+	if (error)
+		goto out_fail;
+
+	xfs_lock_two_inodes(sxi.si_ip1, XFS_ILOCK_EXCL,
+			    sxi.si_ip2, XFS_ILOCK_EXCL);
+	xfs_trans_ijoin(tp, sxi.si_ip1, 0);
+	xfs_trans_ijoin(tp, sxi.si_ip2, 0);
+
+	/*
+	 * Set IRECOVERY to prevent trimming of post-eof extents and freeing of
+	 * unlinked inodes until we're totally done processing files.
+	 */
+	if (VFS_I(sxi.si_ip1)->i_nlink == 0)
+		xfs_iflags_set(sxi.si_ip1, XFS_IRECOVERY);
+	if (VFS_I(sxi.si_ip2)->i_nlink == 0)
+		xfs_iflags_set(sxi.si_ip2, XFS_IRECOVERY);
+
+	/*
+	 * Construct the rest of our in-core swapext intent state so that we
+	 * can call the deferred operation functions to continue the work.
+	 */
+	sxi.si_flags = se->se_flags;
+	sxi.si_startoff1 = se->se_startoff1;
+	sxi.si_startoff2 = se->se_startoff2;
+	sxi.si_blockcount = se->se_blockcount;
+	sxi.si_isize1 = se->se_isize1;
+	sxi.si_isize2 = se->se_isize2;
+	error = xfs_trans_log_finish_swapext_update(tp, dlip, &sxi);
+	if (error)
+		goto out_fail;
+
+	/*
+	 * If there's more extent swapping to be done, we have to schedule that
+	 * as a separate deferred operation to be run after we've finished
+	 * replaying all of the intents we recovered from the log.
+	 */
+	if (xfs_swapext_has_more_work(&sxi))
+		xfs_swapext_reschedule(tp, &sxi);
+
+	set_bit(XFS_SXI_RECOVERED, &ilip->sxi_flags);
+	error = xlog_recover_trans_commit(tp, dffp);
+	goto out_rele;
+
+out_fail:
+	xfs_trans_cancel(tp);
+out_rele:
+	if (sxi.si_ip2) {
+		xfs_iunlock(sxi.si_ip2, XFS_ILOCK_EXCL);
+		xfs_irele(sxi.si_ip2);
+	}
+	if (sxi.si_ip1) {
+		xfs_iunlock(sxi.si_ip1, XFS_ILOCK_EXCL);
+		xfs_irele(sxi.si_ip1);
+	}
+	return error;
+
 }
 
 /*
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index 9b8d703dc9fd..f8cceacfb51d 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -30,6 +30,7 @@
 #include "xfs_fsmap.h"
 #include "xfs_btree_staging.h"
 #include "xfs_icache.h"
+#include "xfs_swapext.h"
 
 /*
  * We include this last to have the helpers above available for the trace
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 721e14f5c98b..af9c7bcb7a8a 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -37,6 +37,7 @@ struct xfs_trans_res;
 struct xfs_inobt_rec_incore;
 union xfs_btree_ptr;
 struct xfs_eofblocks;
+struct xfs_swapext_intent;
 
 #define XFS_ATTR_FILTER_FLAGS \
 	{ XFS_ATTR_ROOT,	"ROOT" }, \
@@ -3207,6 +3208,9 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_cancel_cow);
 DEFINE_INODE_IREC_EVENT(xfs_swap_extent_rmap_remap);
 DEFINE_INODE_IREC_EVENT(xfs_swap_extent_rmap_remap_piece);
 DEFINE_INODE_ERROR_EVENT(xfs_swap_extent_rmap_error);
+DEFINE_INODE_IREC_EVENT(xfs_swapext_extent1);
+DEFINE_INODE_IREC_EVENT(xfs_swapext_extent2);
+DEFINE_ITRUNC_EVENT(xfs_swapext_update_inode_size);
 
 /* fsmap traces */
 DECLARE_EVENT_CLASS(xfs_fsmap_class,
@@ -3836,6 +3840,51 @@ DEFINE_NAMESPACE_EVENT(xfs_imeta_dir_created);
 DEFINE_NAMESPACE_EVENT(xfs_imeta_dir_unlinked);
 DEFINE_NAMESPACE_EVENT(xfs_imeta_dir_zap);
 
+#define XFS_SWAPEXT_FLAGS \
+	{ XFS_SWAP_EXTENT_ATTR_FORK,		"ATTRFORK" }, \
+	{ XFS_SWAP_EXTENT_SET_SIZES,		"SETSIZES" }
+
+TRACE_EVENT(xfs_swapext_defer,
+	TP_PROTO(struct xfs_mount *mp, const struct xfs_swapext_intent *sxi),
+	TP_ARGS(mp, sxi),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino1)
+		__field(xfs_ino_t, ino2)
+		__field(uint64_t, flags)
+		__field(xfs_fileoff_t, startoff1)
+		__field(xfs_fileoff_t, startoff2)
+		__field(xfs_filblks_t, blockcount)
+		__field(xfs_fsize_t, isize1)
+		__field(xfs_fsize_t, isize2)
+		__field(xfs_fsize_t, new_isize1)
+		__field(xfs_fsize_t, new_isize2)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->ino1 = sxi->si_ip1->i_ino;
+		__entry->ino2 = sxi->si_ip2->i_ino;
+		__entry->flags = sxi->si_flags;
+		__entry->startoff1 = sxi->si_startoff1;
+		__entry->startoff2 = sxi->si_startoff2;
+		__entry->blockcount = sxi->si_blockcount;
+		__entry->isize1 = sxi->si_ip1->i_d.di_size;
+		__entry->isize2 = sxi->si_ip2->i_d.di_size;
+		__entry->new_isize1 = sxi->si_isize1;
+		__entry->new_isize2 = sxi->si_isize2;
+	),
+	TP_printk("dev %d:%d ino1 0x%llx isize1 %lld ino2 0x%llx isize2 %lld flags (%s) startoff1 %llu startoff2 %llu blockcount %llu newisize1 %lld newisize2 %lld",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino1, __entry->isize1,
+		  __entry->ino2, __entry->isize2,
+		  __print_flags(__entry->flags, "|", XFS_SWAPEXT_FLAGS),
+		  __entry->startoff1,
+		  __entry->startoff2,
+		  __entry->blockcount,
+		  __entry->new_isize1, __entry->new_isize2)
+
+);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 10/18] xfs: refactor locking and unlocking two inodes against userspace IO
  2020-04-29  2:44 [PATCH RFC 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (8 preceding siblings ...)
  2020-04-29  2:45 ` [PATCH 09/18] xfs: create deferred log items for extent swapping Darrick J. Wong
@ 2020-04-29  2:45 ` Darrick J. Wong
  2020-04-29  2:45 ` [PATCH 11/18] xfs: add a ->swap_file_range handler Darrick J. Wong
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2020-04-29  2:45 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <darrick.wong@oracle.com>

Refactor the two functions that we use to lock and unlock two inodes to
block userspace from initiating IO against a file, whether via system
calls or mmap activity.  Move them to xfs_inode.c since this functionality
won't be specific to reflink for much longer.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_file.c    |    2 +
 fs/xfs/xfs_inode.c   |   93 ++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_inode.h   |    3 ++
 fs/xfs/xfs_reflink.c |   85 +---------------------------------------------
 fs/xfs/xfs_reflink.h |    2 -
 5 files changed, 99 insertions(+), 86 deletions(-)


diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 1759fbcbcd46..9bce98323ca6 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1059,7 +1059,7 @@ xfs_file_remap_range(
 	if (mp->m_flags & XFS_MOUNT_WSYNC)
 		xfs_log_force_inode(dest);
 out_unlock:
-	xfs_reflink_remap_unlock(file_in, file_out);
+	xfs_iunlock_two_io(src, dest);
 	if (ret)
 		trace_xfs_reflink_remap_range_error(dest, ret, _RET_IP_);
 	return remapped > 0 ? remapped : ret;
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index a0db7f47826f..080c8838fba5 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3112,3 +3112,96 @@ xfs_is_always_cow_inode(
 	return ip->i_mount->m_always_cow &&
 		xfs_sb_version_hasreflink(&ip->i_mount->m_sb);
 }
+
+/*
+ * Grab the exclusive iolock for a data copy from src to dest, making sure to
+ * abide vfs locking order (lowest pointer value goes first) and breaking the
+ * layout leases before proceeding.  The loop is needed because we cannot call
+ * the blocking break_layout() with the iolocks held, and therefore have to
+ * back out both locks.
+ */
+static int
+xfs_iolock_two_inodes_and_break_layout(
+	struct inode		*src,
+	struct inode		*dest)
+{
+	int			error;
+
+	if (src > dest)
+		swap(src, dest);
+
+retry:
+	/* Wait to break both inodes' layouts before we start locking. */
+	error = break_layout(src, true);
+	if (error)
+		return error;
+	if (src != dest) {
+		error = break_layout(dest, true);
+		if (error)
+			return error;
+	}
+
+	/* Lock one inode and make sure nobody got in and leased it. */
+	inode_lock(src);
+	error = break_layout(src, false);
+	if (error) {
+		inode_unlock(src);
+		if (error == -EWOULDBLOCK)
+			goto retry;
+		return error;
+	}
+
+	if (src == dest)
+		return 0;
+
+	/* Lock the other inode and make sure nobody got in and leased it. */
+	inode_lock_nested(dest, I_MUTEX_NONDIR2);
+	error = break_layout(dest, false);
+	if (error) {
+		inode_unlock(src);
+		inode_unlock(dest);
+		if (error == -EWOULDBLOCK)
+			goto retry;
+		return error;
+	}
+
+	return 0;
+}
+
+/*
+ * Lock two files so that userspace cannot initiate I/O via file syscalls or
+ * mmap activity.
+ */
+int
+xfs_ilock_two_io(
+	struct xfs_inode	*ip1,
+	struct xfs_inode	*ip2)
+{
+	int			ret;
+
+	ret = xfs_iolock_two_inodes_and_break_layout(VFS_I(ip1), VFS_I(ip2));
+	if (ret)
+		return ret;
+	if (ip1 == ip2)
+		xfs_ilock(ip1, XFS_MMAPLOCK_EXCL);
+	else
+		xfs_lock_two_inodes(ip1, XFS_MMAPLOCK_EXCL,
+				    ip2, XFS_MMAPLOCK_EXCL);
+	return 0;
+}
+
+/* Unlock both files to allow IO and mmap activity. */
+void
+xfs_iunlock_two_io(
+	struct xfs_inode	*ip1,
+	struct xfs_inode	*ip2)
+{
+	bool			same_inode = (ip1 == ip2);
+
+	xfs_iunlock(ip2, XFS_MMAPLOCK_EXCL);
+	if (!same_inode)
+		xfs_iunlock(ip1, XFS_MMAPLOCK_EXCL);
+	inode_unlock(VFS_I(ip2));
+	if (!same_inode)
+		inode_unlock(VFS_I(ip1));
+}
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index df5021cf5d0f..d8cb7bed4dd9 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -509,4 +509,7 @@ void xfs_inode_inactivation_cleanup(struct xfs_inode *ip);
 
 void xfs_end_io(struct work_struct *work);
 
+int xfs_ilock_two_io(struct xfs_inode *ip1, struct xfs_inode *ip2);
+void xfs_iunlock_two_io(struct xfs_inode *ip1, struct xfs_inode *ip2);
+
 #endif	/* __XFS_INODE_H__ */
diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
index f206f6637daf..566a3dee2815 100644
--- a/fs/xfs/xfs_reflink.c
+++ b/fs/xfs/xfs_reflink.c
@@ -1237,81 +1237,6 @@ xfs_reflink_remap_blocks(
 	return error;
 }
 
-/*
- * Grab the exclusive iolock for a data copy from src to dest, making sure to
- * abide vfs locking order (lowest pointer value goes first) and breaking the
- * layout leases before proceeding.  The loop is needed because we cannot call
- * the blocking break_layout() with the iolocks held, and therefore have to
- * back out both locks.
- */
-static int
-xfs_iolock_two_inodes_and_break_layout(
-	struct inode		*src,
-	struct inode		*dest)
-{
-	int			error;
-
-	if (src > dest)
-		swap(src, dest);
-
-retry:
-	/* Wait to break both inodes' layouts before we start locking. */
-	error = break_layout(src, true);
-	if (error)
-		return error;
-	if (src != dest) {
-		error = break_layout(dest, true);
-		if (error)
-			return error;
-	}
-
-	/* Lock one inode and make sure nobody got in and leased it. */
-	inode_lock(src);
-	error = break_layout(src, false);
-	if (error) {
-		inode_unlock(src);
-		if (error == -EWOULDBLOCK)
-			goto retry;
-		return error;
-	}
-
-	if (src == dest)
-		return 0;
-
-	/* Lock the other inode and make sure nobody got in and leased it. */
-	inode_lock_nested(dest, I_MUTEX_NONDIR2);
-	error = break_layout(dest, false);
-	if (error) {
-		inode_unlock(src);
-		inode_unlock(dest);
-		if (error == -EWOULDBLOCK)
-			goto retry;
-		return error;
-	}
-
-	return 0;
-}
-
-/* Unlock both inodes after they've been prepped for a range clone. */
-void
-xfs_reflink_remap_unlock(
-	struct file		*file_in,
-	struct file		*file_out)
-{
-	struct inode		*inode_in = file_inode(file_in);
-	struct xfs_inode	*src = XFS_I(inode_in);
-	struct inode		*inode_out = file_inode(file_out);
-	struct xfs_inode	*dest = XFS_I(inode_out);
-	bool			same_inode = (inode_in == inode_out);
-
-	xfs_iunlock(dest, XFS_MMAPLOCK_EXCL);
-	if (!same_inode)
-		xfs_iunlock(src, XFS_MMAPLOCK_EXCL);
-	inode_unlock(inode_out);
-	if (!same_inode)
-		inode_unlock(inode_in);
-}
-
 /*
  * If we're reflinking to a point past the destination file's EOF, we must
  * zero any speculative post-EOF preallocations that sit between the old EOF
@@ -1374,18 +1299,12 @@ xfs_reflink_remap_prep(
 	struct xfs_inode	*src = XFS_I(inode_in);
 	struct inode		*inode_out = file_inode(file_out);
 	struct xfs_inode	*dest = XFS_I(inode_out);
-	bool			same_inode = (inode_in == inode_out);
 	int			ret;
 
 	/* Lock both files against IO */
-	ret = xfs_iolock_two_inodes_and_break_layout(inode_in, inode_out);
+	ret = xfs_ilock_two_io(src, dest);
 	if (ret)
 		return ret;
-	if (same_inode)
-		xfs_ilock(src, XFS_MMAPLOCK_EXCL);
-	else
-		xfs_lock_two_inodes(src, XFS_MMAPLOCK_EXCL, dest,
-				XFS_MMAPLOCK_EXCL);
 
 	/* Check file eligibility and prepare for block sharing. */
 	ret = -EINVAL;
@@ -1436,7 +1355,7 @@ xfs_reflink_remap_prep(
 
 	return 0;
 out_unlock:
-	xfs_reflink_remap_unlock(file_in, file_out);
+	xfs_iunlock_two_io(src, dest);
 	return ret;
 }
 
diff --git a/fs/xfs/xfs_reflink.h b/fs/xfs/xfs_reflink.h
index 0879d2e71e11..8ddf1300a982 100644
--- a/fs/xfs/xfs_reflink.h
+++ b/fs/xfs/xfs_reflink.h
@@ -50,7 +50,5 @@ extern int xfs_reflink_remap_blocks(struct xfs_inode *src, loff_t pos_in,
 		loff_t *remapped);
 extern int xfs_reflink_update_dest(struct xfs_inode *dest, xfs_off_t newlen,
 		xfs_extlen_t cowextsize, unsigned int remap_flags);
-extern void xfs_reflink_remap_unlock(struct file *file_in,
-		struct file *file_out);
 
 #endif /* __XFS_REFLINK_H */


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 11/18] xfs: add a ->swap_file_range handler
  2020-04-29  2:44 [PATCH RFC 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (9 preceding siblings ...)
  2020-04-29  2:45 ` [PATCH 10/18] xfs: refactor locking and unlocking two inodes against userspace IO Darrick J. Wong
@ 2020-04-29  2:45 ` Darrick J. Wong
  2020-04-29  2:45 ` [PATCH 12/18] xfs: add error injection to test swapext recovery Darrick J. Wong
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2020-04-29  2:45 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <darrick.wong@oracle.com>

Add a function to handle range swap requests from the vfs.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_bmap_util.c |  340 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_bmap_util.h |    4 +
 fs/xfs/xfs_file.c      |   39 ++++++
 fs/xfs/xfs_trace.h     |    4 +
 4 files changed, 387 insertions(+)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 070f657241a1..a8bd2627d76e 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -29,6 +29,7 @@
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
 #include "xfs_sb.h"
+#include "xfs_swapext.h"
 
 /* Kernel only BMAP related definitions and functions */
 
@@ -1841,3 +1842,342 @@ xfs_swap_extents(
 	xfs_trans_cancel(tp);
 	goto out_unlock;
 }
+
+/* Prepare two files to have their data swapped. */
+int
+xfs_swap_range_prep(
+	struct file		*file1,
+	struct file		*file2,
+	struct file_swap_range	*fsr)
+{
+	struct xfs_inode	*ip1 = XFS_I(file_inode(file1));
+	struct xfs_inode	*ip2 = XFS_I(file_inode(file2));
+	int			ret;
+
+	/* Verify both files are either real-time or non-realtime */
+	if (XFS_IS_REALTIME_INODE(ip1) != XFS_IS_REALTIME_INODE(ip2))
+		return -EINVAL;
+
+	ret = generic_swap_file_range_prep(file1, file2, fsr);
+	if (ret)
+		return ret;
+
+	/* Attach dquots to both inodes before changing block maps. */
+	ret = xfs_qm_dqattach(ip2);
+	if (ret)
+		return ret;
+	ret = xfs_qm_dqattach(ip1);
+	if (ret)
+		return ret;
+
+	/* Flush the relevant ranges of both files. */
+	ret = xfs_flush_unmap_range(ip2, fsr->file2_offset, fsr->length);
+	if (ret)
+		return ret;
+	return xfs_flush_unmap_range(ip1, fsr->file1_offset, fsr->length);
+}
+
+/*
+ * Compute the number of blocks and extents mapped to part of a file, and the
+ * worst case estimate of the number of bmbt blocks required to store those
+ * mappings.
+ */
+STATIC int
+xfs_bmap_count_range_blocks(
+	struct xfs_inode	*ip,
+	int			whichfork,
+	xfs_fileoff_t		startoff,
+	xfs_filblks_t		blockcount,
+	xfs_filblks_t		*nr_mapped_blocks)
+{
+	struct xfs_bmbt_irec	irec;
+	xfs_filblks_t		nr_blocks = 0;
+	xfs_extnum_t		extents = 0;
+	int			bmapi_flags = xfs_bmapi_aflag(whichfork);
+	int			nimaps;
+	int			error;
+
+	*nr_mapped_blocks = 0;
+
+	/* Count all the extents that map to allocated space. */
+	while (blockcount > 0) {
+		nimaps = 1;
+		error = xfs_bmapi_read(ip, startoff, blockcount, &irec,
+				&nimaps, bmapi_flags);
+		if (error)
+			return error;
+		if (nimaps != 1)
+			return -EINVAL;
+		if (xfs_bmap_is_mapped_extent(&irec)) {
+			nr_blocks += irec.br_blockcount;
+			extents++;
+		}
+		startoff += irec.br_blockcount;
+		blockcount -= irec.br_blockcount;
+	}
+
+	/* Add in the number of bmbt splits that could happen. */
+	nr_blocks += XFS_NEXTENTADD_SPACE_RES(ip->i_mount, nr_blocks,
+			whichfork);
+	*nr_mapped_blocks = nr_blocks;
+
+	return 0;
+}
+
+/*
+ * Compute the number of blocks we need to reserve to handle a log-assisted
+ * extent swap operation.
+ */
+static inline unsigned int
+xfs_swap_range_calc_resblks(
+	struct xfs_inode	*ip1,
+	struct xfs_inode	*ip2,
+	int			whichfork,
+	xfs_filblks_t		blockcount)
+{
+	struct xfs_mount	*mp = ip1->i_mount;
+	xfs_extnum_t		ip1_nr = XFS_IFORK_NEXTENTS(ip1, whichfork);
+	xfs_extnum_t		ip2_nr = XFS_IFORK_NEXTENTS(ip2, whichfork);
+	unsigned int		resblks;
+
+	/*
+	 * Each file range cannot have more extents than there are blocks in
+	 * that range.
+	 */
+	ip1_nr = min_t(xfs_filblks_t, ip1_nr, blockcount);
+	ip2_nr = min_t(xfs_filblks_t, ip2_nr, blockcount);
+
+	/*
+	 * Conceptually this shouldn't affect the shape of either bmbt, but
+	 * since we atomically move extents one by one, we reserve enough space
+	 * to rebuild both trees.
+	 */
+	resblks =  XFS_SWAP_RMAP_SPACE_RES(mp, ip1_nr, whichfork);
+	resblks += XFS_SWAP_RMAP_SPACE_RES(mp, ip2_nr, whichfork);
+
+	/*
+	 * Handle the corner case where either inode might straddle the btree
+	 * format boundary. If so, the inode could bounce between btree <->
+	 * extent format on unmap -> remap cycles, freeing and allocating a
+	 * bmapbt block each time.
+	 */
+	if (ip1_nr == (XFS_IFORK_MAXEXT(ip1, whichfork) + 1))
+		resblks += XFS_IFORK_MAXEXT(ip1, whichfork);
+	if (ip2_nr == (XFS_IFORK_MAXEXT(ip2, whichfork) + 1))
+		resblks += XFS_IFORK_MAXEXT(ip2, whichfork);
+
+	return resblks;
+}
+
+/*
+ * Obtain a quota reservation to make sure we don't hit EDQUOT.  We can skip
+ * this if quota enforcement is disabled or if both inodes' dquots are the
+ * same.
+ */
+STATIC int
+xfs_swap_range_prep_quota(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip1,
+	struct xfs_inode	*ip2,
+	int			whichfork,
+	xfs_fileoff_t		startoff1,
+	xfs_fileoff_t		startoff2,
+	xfs_filblks_t		blockcount)
+{
+	struct xfs_mount	*mp = ip1->i_mount;
+	xfs_filblks_t		ip1_mapped, ip2_mapped;
+	int			error;
+
+	/*
+	 * Don't bother with a quota reservation if we're not enforcing them
+	 * or the two inodes have the same dquots.
+	 */
+	if (!(mp->m_qflags & XFS_ALL_QUOTA_ENFD) || ip1 == ip2)
+		return 0;
+
+	if (ip1->i_udquot == ip2->i_udquot &&
+	    ip1->i_gdquot == ip2->i_gdquot &&
+	    ip1->i_pdquot == ip2->i_pdquot)
+		return 0;
+
+	/* Figure out how many blocks we'll move out of each file. */
+	error = xfs_bmap_count_range_blocks(ip1, whichfork, startoff1,
+			blockcount, &ip1_mapped);
+	if (error)
+		return error;
+	error = xfs_bmap_count_range_blocks(ip2, whichfork, startoff2,
+			blockcount, &ip2_mapped);
+	if (error)
+		return error;
+
+	/*
+	 * For each file, compute the net gain in the number of blocks that
+	 * will be mapped into that file and reserve that much quota.  The
+	 * quota counts must be able to absorb at least that much space.
+	 */
+	if (ip2_mapped > ip1_mapped) {
+		error = xfs_trans_reserve_quota_nblks(tp, ip1,
+				ip2_mapped - ip1_mapped, 0,
+				XFS_QMOPT_RES_REGBLKS);
+		if (error)
+			return error;
+	}
+
+	if (ip1_mapped > ip2_mapped) {
+		error = xfs_trans_reserve_quota_nblks(tp, ip2,
+				ip1_mapped - ip2_mapped, 0,
+				XFS_QMOPT_RES_REGBLKS);
+		if (error)
+			return error;
+	}
+
+	/*
+	 * For each file, forcibly reserve the gross gain in mapped blocks so
+	 * that we don't trip over any quota block reservation assertions.
+	 * We must reserve the gross gain because the quota code subtracts from
+	 * bcount the number of blocks that we unmap; it does not add that
+	 * quantity back to the quota block reservation.
+	 */
+	error = xfs_trans_reserve_quota_nblks(tp, ip1, ip1_mapped, 0,
+			XFS_QMOPT_FORCE_RES | XFS_QMOPT_RES_REGBLKS);
+	if (error)
+		return error;
+
+	return xfs_trans_reserve_quota_nblks(tp, ip2, ip2_mapped, 0,
+			XFS_QMOPT_FORCE_RES | XFS_QMOPT_RES_REGBLKS);
+}
+
+/* Swap parts of two files. */
+int
+xfs_swap_range(
+	struct xfs_inode	*ip1,
+	struct xfs_inode	*ip2,
+	const struct file_swap_range *fsr)
+{
+	struct xfs_mount	*mp = ip1->i_mount;
+	struct xfs_trans	*tp;
+	xfs_fileoff_t		startoff1;
+	xfs_fileoff_t		startoff2;
+	xfs_filblks_t		blockcount = XFS_B_TO_FSB(mp, fsr->length);
+	unsigned int		resblks;
+	unsigned int		sxflags = 0;
+	int			error;
+
+	if (!xfs_sb_version_hasatomicswap(&mp->m_sb))
+		return -EOPNOTSUPP;
+
+	startoff1 = XFS_B_TO_FSBT(mp, fsr->file1_offset);
+	startoff2 = XFS_B_TO_FSBT(mp, fsr->file2_offset);
+
+	/*
+	 * Cancel CoW fork preallocations for the ranges of both files.  The
+	 * prep function should have flushed all the dirty data, so the only
+	 * extents remaining should be speculative.
+	 */
+	if (xfs_inode_has_cow_data(ip1)) {
+		error = xfs_reflink_cancel_cow_range(ip1, fsr->file1_offset,
+				fsr->length, true);
+		if (error)
+			return error;
+	}
+
+	if (xfs_inode_has_cow_data(ip2)) {
+		error = xfs_reflink_cancel_cow_range(ip2, fsr->file2_offset,
+				fsr->length, true);
+		if (error)
+			return error;
+	}
+
+	resblks = xfs_swap_range_calc_resblks(ip1, ip2, XFS_DATA_FORK,
+			blockcount);
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
+	if (error)
+		return error;
+
+	/*
+	 * Lock and join the inodes to the tansaction so that transaction commit
+	 * or cancel will unlock the inodes from this point onwards.
+	 */
+	if (ip1 != ip2) {
+		xfs_lock_two_inodes(ip1, XFS_ILOCK_EXCL, ip2, XFS_ILOCK_EXCL);
+		xfs_trans_ijoin(tp, ip1, 0);
+		xfs_trans_ijoin(tp, ip2, 0);
+	} else {
+		xfs_ilock(ip1, XFS_ILOCK_EXCL);
+		xfs_trans_ijoin(tp, ip1, 0);
+	}
+
+	trace_xfs_swap_extent_before(ip2, 0);
+	trace_xfs_swap_extent_before(ip1, 1);
+
+	/*
+	 * Do all of the inputs checking that we can only do once we've taken
+	 * both ILOCKs.
+	 */
+	error = generic_swap_file_range_check_fresh(VFS_I(ip1), VFS_I(ip2),
+			fsr);
+	if (error)
+		goto out_trans_cancel;
+
+	if (XFS_IFORK_FORMAT(ip1, XFS_DATA_FORK) == XFS_DINODE_FMT_LOCAL ||
+	    XFS_IFORK_FORMAT(ip2, XFS_DATA_FORK) == XFS_DINODE_FMT_LOCAL) {
+		error = -EINVAL;
+		goto out_trans_cancel;
+	}
+
+	/*
+	 * Reserve ourselves some quota if any of them are in enforcing mode.
+	 * In theory we only need enough to satisfy the change in the number
+	 * of blocks between the two ranges being remapped.
+	 */
+	error = xfs_swap_range_prep_quota(tp, ip1, ip2, XFS_DATA_FORK,
+			startoff1, startoff2, blockcount);
+	if (error)
+		goto out_trans_cancel;
+
+	/* Perform the file range swap. */
+	if (fsr->flags & FILE_SWAP_RANGE_TO_EOF)
+		sxflags |= XFS_SWAPEXT_SET_SIZES;
+
+	error = xfs_swapext_atomic(&tp, ip1, ip2, XFS_DATA_FORK, startoff1,
+			startoff2, blockcount, sxflags);
+	if (error)
+		goto out_trans_cancel;
+
+	/*
+	 * If the caller wanted us to swap two complete files of unequal
+	 * length, swap the incore sizes now.  This should be safe because we
+	 * flushed both files' page caches and moved all the post-eof extents,
+	 * so there should not be anything to zero.
+	 */
+	if (fsr->flags & FILE_SWAP_RANGE_TO_EOF) {
+		loff_t	temp;
+
+		temp = i_size_read(VFS_I(ip2));
+		i_size_write(VFS_I(ip2), i_size_read(VFS_I(ip1)));
+		i_size_write(VFS_I(ip1), temp);
+	}
+
+	/*
+	 * If this is a synchronous mount, make sure that the
+	 * transaction goes to disk before returning to the user.
+	 */
+	if (mp->m_flags & XFS_MOUNT_WSYNC)
+		xfs_trans_set_sync(tp);
+
+	error = xfs_trans_commit(tp);
+
+	trace_xfs_swap_extent_after(ip2, 0);
+	trace_xfs_swap_extent_after(ip1, 1);
+
+out_unlock:
+	xfs_iunlock(ip1, XFS_ILOCK_EXCL);
+	if (ip1 != ip2)
+		xfs_iunlock(ip2, XFS_ILOCK_EXCL);
+	return error;
+
+out_trans_cancel:
+	xfs_trans_cancel(tp);
+	goto out_unlock;
+}
+
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 9f993168b55b..d3444a63bbd7 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -68,6 +68,10 @@ int	xfs_free_eofblocks(struct xfs_inode *ip);
 
 int	xfs_swap_extents(struct xfs_inode *ip, struct xfs_inode *tip,
 			 struct xfs_swapext *sx);
+int	xfs_swap_range_prep(struct file *file1, struct file *file2,
+			    struct file_swap_range *fsr);
+int	xfs_swap_range(struct xfs_inode *ip1, struct xfs_inode *ip2,
+		       const struct file_swap_range *fsr);
 
 xfs_daddr_t xfs_fsb_to_db(struct xfs_inode *ip, xfs_fsblock_t fsb);
 
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 9bce98323ca6..d446c16cfc30 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1065,6 +1065,44 @@ xfs_file_remap_range(
 	return remapped > 0 ? remapped : ret;
 }
 
+STATIC int
+xfs_file_swap_range(
+	struct file		*file1,
+	struct file		*file2,
+	struct file_swap_range	*fsr)
+{
+	struct xfs_inode	*ip1 = XFS_I(file_inode(file1));
+	struct xfs_inode	*ip2 = XFS_I(file_inode(file2));
+	struct xfs_mount	*mp = ip1->i_mount;
+	int			ret;
+
+	if (XFS_FORCED_SHUTDOWN(mp))
+		return -EIO;
+
+	/* Lock both files against IO */
+	ret = xfs_ilock_two_io(ip1, ip2);
+	if (ret)
+		return ret;
+
+	/* Prepare and then swap file data. */
+	ret = xfs_swap_range_prep(file1, file2, fsr);
+	if (ret)
+		goto out_unlock;
+
+	trace_xfs_file_swap_range(ip1, fsr->file1_offset, fsr->length, ip2,
+			fsr->file2_offset);
+
+	ret = xfs_swap_range(ip1, ip2, fsr);
+	if (ret)
+		goto out_unlock;
+
+out_unlock:
+	xfs_iunlock_two_io(ip1, ip2);
+	if (ret)
+		trace_xfs_file_swap_range_error(ip2, ret, _RET_IP_);
+	return ret;
+}
+
 STATIC int
 xfs_file_open(
 	struct inode	*inode,
@@ -1307,6 +1345,7 @@ const struct file_operations xfs_file_operations = {
 	.fallocate	= xfs_file_fallocate,
 	.fadvise	= xfs_file_fadvise,
 	.remap_file_range = xfs_file_remap_range,
+	.swap_file_range = xfs_file_swap_range,
 };
 
 const struct file_operations xfs_dir_file_operations = {
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index af9c7bcb7a8a..7917203e56d4 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3208,6 +3208,10 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_cancel_cow);
 DEFINE_INODE_IREC_EVENT(xfs_swap_extent_rmap_remap);
 DEFINE_INODE_IREC_EVENT(xfs_swap_extent_rmap_remap_piece);
 DEFINE_INODE_ERROR_EVENT(xfs_swap_extent_rmap_error);
+
+/* swapext tracepoints */
+DEFINE_DOUBLE_IO_EVENT(xfs_file_swap_range);
+DEFINE_INODE_ERROR_EVENT(xfs_file_swap_range_error);
 DEFINE_INODE_IREC_EVENT(xfs_swapext_extent1);
 DEFINE_INODE_IREC_EVENT(xfs_swapext_extent2);
 DEFINE_ITRUNC_EVENT(xfs_swapext_update_inode_size);


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 12/18] xfs: add error injection to test swapext recovery
  2020-04-29  2:44 [PATCH RFC 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (10 preceding siblings ...)
  2020-04-29  2:45 ` [PATCH 11/18] xfs: add a ->swap_file_range handler Darrick J. Wong
@ 2020-04-29  2:45 ` Darrick J. Wong
  2020-04-29  2:45 ` [PATCH 13/18] xfs: allow xfs_swap_range to use older extent swap algorithms Darrick J. Wong
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2020-04-29  2:45 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <darrick.wong@oracle.com>

Add an errortag so that we can test recovery of swapext log items.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_errortag.h |    4 +++-
 fs/xfs/libxfs/xfs_swapext.c  |    5 +++++
 fs/xfs/xfs_error.c           |    3 +++
 3 files changed, 11 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_errortag.h b/fs/xfs/libxfs/xfs_errortag.h
index 79e6c4fb1d8a..e99683558ccc 100644
--- a/fs/xfs/libxfs/xfs_errortag.h
+++ b/fs/xfs/libxfs/xfs_errortag.h
@@ -55,7 +55,8 @@
 #define XFS_ERRTAG_FORCE_SCRUB_REPAIR			32
 #define XFS_ERRTAG_FORCE_SUMMARY_RECALC			33
 #define XFS_ERRTAG_IUNLINK_FALLBACK			34
-#define XFS_ERRTAG_MAX					35
+#define XFS_ERRTAG_SWAPEXT_FINISH_ONE			35
+#define XFS_ERRTAG_MAX					36
 
 /*
  * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -94,6 +95,7 @@
 #define XFS_RANDOM_BUF_LRU_REF				2
 #define XFS_RANDOM_FORCE_SCRUB_REPAIR			1
 #define XFS_RANDOM_FORCE_SUMMARY_RECALC			1
+#define XFS_RANDOM_SWAPEXT_FINISH_ONE			1
 #define XFS_RANDOM_IUNLINK_FALLBACK			(XFS_RANDOM_DEFAULT/10)
 
 #endif /* __XFS_ERRORTAG_H_ */
diff --git a/fs/xfs/libxfs/xfs_swapext.c b/fs/xfs/libxfs/xfs_swapext.c
index 2eff48453070..6597c613fa3e 100644
--- a/fs/xfs/libxfs/xfs_swapext.c
+++ b/fs/xfs/libxfs/xfs_swapext.c
@@ -18,6 +18,8 @@
 #include "xfs_quota.h"
 #include "xfs_swapext.h"
 #include "xfs_trace.h"
+#include "xfs_errortag.h"
+#include "xfs_error.h"
 
 /* Information to help us reset reflink flag / CoW fork state after a swap. */
 
@@ -354,6 +356,9 @@ xfs_swapext_finish_one(
 		xfs_trans_log_inode(tp, sxi->si_ip2, XFS_ILOG_CORE);
 	}
 
+	if (XFS_TEST_ERROR(false, tp->t_mountp, XFS_ERRTAG_SWAPEXT_FINISH_ONE))
+		return -EIO;
+
 	if (xfs_swapext_has_more_work(sxi))
 		trace_xfs_swapext_defer(tp->t_mountp, sxi);
 	return 0;
diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
index a21e9cc6516a..d818497afa2c 100644
--- a/fs/xfs/xfs_error.c
+++ b/fs/xfs/xfs_error.c
@@ -53,6 +53,7 @@ static unsigned int xfs_errortag_random_default[] = {
 	XFS_RANDOM_FORCE_SCRUB_REPAIR,
 	XFS_RANDOM_FORCE_SUMMARY_RECALC,
 	XFS_RANDOM_IUNLINK_FALLBACK,
+	XFS_RANDOM_SWAPEXT_FINISH_ONE,
 };
 
 struct xfs_errortag_attr {
@@ -162,6 +163,7 @@ XFS_ERRORTAG_ATTR_RW(buf_lru_ref,	XFS_ERRTAG_BUF_LRU_REF);
 XFS_ERRORTAG_ATTR_RW(force_repair,	XFS_ERRTAG_FORCE_SCRUB_REPAIR);
 XFS_ERRORTAG_ATTR_RW(bad_summary,	XFS_ERRTAG_FORCE_SUMMARY_RECALC);
 XFS_ERRORTAG_ATTR_RW(iunlink_fallback,	XFS_ERRTAG_IUNLINK_FALLBACK);
+XFS_ERRORTAG_ATTR_RW(swapext_finish_one, XFS_RANDOM_SWAPEXT_FINISH_ONE);
 
 static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(noerror),
@@ -199,6 +201,7 @@ static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(force_repair),
 	XFS_ERRORTAG_ATTR_LIST(bad_summary),
 	XFS_ERRORTAG_ATTR_LIST(iunlink_fallback),
+	XFS_ERRORTAG_ATTR_LIST(swapext_finish_one),
 	NULL,
 };
 


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 13/18] xfs: allow xfs_swap_range to use older extent swap algorithms
  2020-04-29  2:44 [PATCH RFC 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (11 preceding siblings ...)
  2020-04-29  2:45 ` [PATCH 12/18] xfs: add error injection to test swapext recovery Darrick J. Wong
@ 2020-04-29  2:45 ` Darrick J. Wong
  2020-04-29  2:45 ` [PATCH 14/18] xfs: port xfs_swap_extents_rmap to our new code Darrick J. Wong
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2020-04-29  2:45 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <darrick.wong@oracle.com>

If userspace permits non-atomic swap operations, use the older code
paths to implement the same functionality.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_bmap_util.c |   42 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 36 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index a8bd2627d76e..72aebf7ed42d 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -2063,9 +2063,6 @@ xfs_swap_range(
 	unsigned int		sxflags = 0;
 	int			error;
 
-	if (!xfs_sb_version_hasatomicswap(&mp->m_sb))
-		return -EOPNOTSUPP;
-
 	startoff1 = XFS_B_TO_FSBT(mp, fsr->file1_offset);
 	startoff2 = XFS_B_TO_FSBT(mp, fsr->file2_offset);
 
@@ -2135,12 +2132,45 @@ xfs_swap_range(
 	if (error)
 		goto out_trans_cancel;
 
-	/* Perform the file range swap. */
 	if (fsr->flags & FILE_SWAP_RANGE_TO_EOF)
 		sxflags |= XFS_SWAPEXT_SET_SIZES;
 
-	error = xfs_swapext_atomic(&tp, ip1, ip2, XFS_DATA_FORK, startoff1,
-			startoff2, blockcount, sxflags);
+	/* Perform the file range swap... */
+	if (xfs_sb_version_hasatomicswap(&mp->m_sb)) {
+		/* ...by using the atomic swap, since it's available. */
+		error = xfs_swapext_atomic(&tp, ip1, ip2, XFS_DATA_FORK,
+				startoff1, startoff2, blockcount, sxflags);
+	} else if ((fsr->flags & FILE_SWAP_RANGE_NONATOMIC) &&
+		   (xfs_sb_version_hasreflink(&mp->m_sb) ||
+		    xfs_sb_version_hasrmapbt(&mp->m_sb))) {
+		/*
+		 * ...by using deferred bmap operations, which are only
+		 * supported if userspace is ok with a non-atomic swap
+		 * (e.g. xfs_fsr) and the log supports deferred bmap.
+		 */
+		error = xfs_swapext_deferred_bmap(&tp, ip1, ip2, XFS_DATA_FORK,
+				startoff1, startoff2, blockcount, sxflags);
+	} else if ((fsr->flags & FILE_SWAP_RANGE_NONATOMIC) &&
+		   !(fsr->flags & FILE_SWAP_RANGE_TO_EOF) &&
+		   fsr->file1_offset == 0 && fsr->file2_offset == 0 &&
+		   fsr->length == ip1->i_d.di_size &&
+		   fsr->length == ip2->i_d.di_size) {
+		/*
+		 * ...by using the old bmap owner change code, if we're doing
+		 * a full file swap and we're ok with non-atomic mode.
+		 */
+		error = xfs_swap_extents_check_format(ip2, ip1);
+		if (error) {
+			xfs_notice(mp,
+		"%s: inode 0x%llx format is incompatible for exchanging.",
+					__func__, ip2->i_ino);
+			goto out_trans_cancel;
+		}
+		error = xfs_swap_extent_forks(&tp, ip2, ip1);
+	} else {
+		/* ...or not at all, because we cannot do it. */
+		error = -EOPNOTSUPP;
+	}
 	if (error)
 		goto out_trans_cancel;
 


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 14/18] xfs: port xfs_swap_extents_rmap to our new code
  2020-04-29  2:44 [PATCH RFC 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (12 preceding siblings ...)
  2020-04-29  2:45 ` [PATCH 13/18] xfs: allow xfs_swap_range to use older extent swap algorithms Darrick J. Wong
@ 2020-04-29  2:45 ` Darrick J. Wong
  2020-04-29  2:45 ` [PATCH 15/18] xfs: consolidate all of the xfs_swap_extent_forks code Darrick J. Wong
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2020-04-29  2:45 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <darrick.wong@oracle.com>

The inner loop of xfs_swap_extents_rmap does the same work as
xfs_swapext_finish_one, so adapt it to use that.  Doing so has the side
benefit that the older code path no longer wastes its time remapping
shared extents.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/libxfs/xfs_swapext.c |   46 +++++++++++++++
 fs/xfs/libxfs/xfs_swapext.h |    5 ++
 fs/xfs/xfs_bmap_util.c      |  136 +++----------------------------------------
 fs/xfs/xfs_trace.h          |    5 --
 4 files changed, 60 insertions(+), 132 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_swapext.c b/fs/xfs/libxfs/xfs_swapext.c
index 6597c613fa3e..64083d48fb7d 100644
--- a/fs/xfs/libxfs/xfs_swapext.c
+++ b/fs/xfs/libxfs/xfs_swapext.c
@@ -433,3 +433,49 @@ xfs_swapext_atomic(
 	xfs_swapext_reflink_finish(*tpp, ip1, ip2, state);
 	return 0;
 }
+
+/*
+ * Swap a range of extents from one inode to another, non-atomically.
+ *
+ * Use deferred bmap log items swap a range of extents from one inode with
+ * another.  Overall extent swap progress is /not/ tracked through the log,
+ * which means that while log recovery can finish remapping a single extent,
+ * it cannot finish the entire operation.
+ */
+int
+xfs_swapext_deferred_bmap(
+	struct xfs_trans		**tpp,
+	struct xfs_inode		*ip1,
+	struct xfs_inode		*ip2,
+	int				whichfork,
+	xfs_fileoff_t			startoff1,
+	xfs_fileoff_t			startoff2,
+	xfs_filblks_t			blockcount,
+	unsigned int			flags)
+{
+	struct xfs_swapext_intent	sxi;
+	unsigned int			state;
+	int				error;
+
+	ASSERT(xfs_isilocked(ip1, XFS_ILOCK_EXCL));
+	ASSERT(xfs_isilocked(ip2, XFS_ILOCK_EXCL));
+	ASSERT(whichfork != XFS_COW_FORK);
+
+	state = xfs_swapext_reflink_prep(ip1, ip2, whichfork, startoff1,
+			startoff2, blockcount);
+
+	xfs_swapext_init_intent(&sxi, ip1, ip2, whichfork, startoff1, startoff2,
+			blockcount, flags);
+
+	while (sxi.si_blockcount > 0) {
+		error = xfs_swapext_finish_one(*tpp, &sxi);
+		if (error)
+			return error;
+		error = xfs_defer_finish(tpp);
+		if (error)
+			return error;
+	}
+
+	xfs_swapext_reflink_finish(*tpp, ip1, ip2, state);
+	return 0;
+}
diff --git a/fs/xfs/libxfs/xfs_swapext.h b/fs/xfs/libxfs/xfs_swapext.h
index af1893f37d39..f4146f55a4c9 100644
--- a/fs/xfs/libxfs/xfs_swapext.h
+++ b/fs/xfs/libxfs/xfs_swapext.h
@@ -54,4 +54,9 @@ int xfs_swapext_atomic(struct xfs_trans **tpp, struct xfs_inode *ip1,
 		xfs_fileoff_t startoff2, xfs_filblks_t blockcount,
 		unsigned int flags);
 
+int xfs_swapext_deferred_bmap(struct xfs_trans **tpp, struct xfs_inode *ip1,
+		struct xfs_inode *ip2, int whichfork, xfs_fileoff_t startoff1,
+		xfs_fileoff_t startoff2, xfs_filblks_t blockcount,
+		unsigned int flags);
+
 #endif /* __XFS_SWAPEXT_H_ */
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 72aebf7ed42d..d1351f0176a3 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1351,131 +1351,6 @@ xfs_swap_extent_flush(
 	return 0;
 }
 
-/*
- * Move extents from one file to another, when rmap is enabled.
- */
-STATIC int
-xfs_swap_extent_rmap(
-	struct xfs_trans		**tpp,
-	struct xfs_inode		*ip,
-	struct xfs_inode		*tip)
-{
-	struct xfs_trans		*tp = *tpp;
-	struct xfs_bmbt_irec		irec;
-	struct xfs_bmbt_irec		uirec;
-	struct xfs_bmbt_irec		tirec;
-	xfs_fileoff_t			offset_fsb;
-	xfs_fileoff_t			end_fsb;
-	xfs_filblks_t			count_fsb;
-	int				error;
-	xfs_filblks_t			ilen;
-	xfs_filblks_t			rlen;
-	int				nimaps;
-	uint64_t			tip_flags2;
-
-	/*
-	 * If the source file has shared blocks, we must flag the donor
-	 * file as having shared blocks so that we get the shared-block
-	 * rmap functions when we go to fix up the rmaps.  The flags
-	 * will be switch for reals later.
-	 */
-	tip_flags2 = tip->i_d.di_flags2;
-	if (ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK)
-		tip->i_d.di_flags2 |= XFS_DIFLAG2_REFLINK;
-
-	offset_fsb = 0;
-	end_fsb = XFS_B_TO_FSB(ip->i_mount, i_size_read(VFS_I(ip)));
-	count_fsb = (xfs_filblks_t)(end_fsb - offset_fsb);
-
-	while (count_fsb) {
-		/* Read extent from the donor file */
-		nimaps = 1;
-		error = xfs_bmapi_read(tip, offset_fsb, count_fsb, &tirec,
-				&nimaps, 0);
-		if (error)
-			goto out;
-		if (nimaps != 1 || tirec.br_startblock == DELAYSTARTBLOCK) {
-			/*
-			 * We should never get no mapping or a delalloc extent
-			 * since the donor file should have been flushed by the
-			 * caller.
-			 */
-			ASSERT(0);
-			error = -EINVAL;
-			goto out;
-		}
-
-		trace_xfs_swap_extent_rmap_remap(tip, &tirec);
-		ilen = tirec.br_blockcount;
-
-		/* Unmap the old blocks in the source file. */
-		while (tirec.br_blockcount) {
-			ASSERT(tp->t_firstblock == NULLFSBLOCK);
-			trace_xfs_swap_extent_rmap_remap_piece(tip, &tirec);
-
-			/* Read extent from the source file */
-			nimaps = 1;
-			error = xfs_bmapi_read(ip, tirec.br_startoff,
-					tirec.br_blockcount, &irec,
-					&nimaps, 0);
-			if (error)
-				goto out;
-			if (nimaps != 1 ||
-			    tirec.br_startoff != irec.br_startoff) {
-				/*
-				 * We should never get no mapping or a mapping
-				 * for another offset, but bail out if that
-				 * ever does.
-				 */
-				ASSERT(0);
-				error = -EFSCORRUPTED;
-				goto out;
-			}
-			trace_xfs_swap_extent_rmap_remap_piece(ip, &irec);
-
-			/* Trim the extent. */
-			uirec = tirec;
-			uirec.br_blockcount = rlen = min_t(xfs_filblks_t,
-					tirec.br_blockcount,
-					irec.br_blockcount);
-			trace_xfs_swap_extent_rmap_remap_piece(tip, &uirec);
-
-			/* Remove the mapping from the donor file. */
-			xfs_bmap_unmap_extent(tp, tip, XFS_DATA_FORK, &uirec);
-
-			/* Remove the mapping from the source file. */
-			xfs_bmap_unmap_extent(tp, ip, XFS_DATA_FORK, &irec);
-
-			/* Map the donor file's blocks into the source file. */
-			xfs_bmap_map_extent(tp, ip, XFS_DATA_FORK, &uirec);
-
-			/* Map the source file's blocks into the donor file. */
-			xfs_bmap_map_extent(tp, tip, XFS_DATA_FORK, &irec);
-
-			error = xfs_defer_finish(tpp);
-			tp = *tpp;
-			if (error)
-				goto out;
-
-			tirec.br_startoff += rlen;
-			if (tirec.br_startblock != HOLESTARTBLOCK &&
-			    tirec.br_startblock != DELAYSTARTBLOCK)
-				tirec.br_startblock += rlen;
-			tirec.br_blockcount -= rlen;
-		}
-
-		/* Roll on... */
-		count_fsb -= ilen;
-		offset_fsb += ilen;
-	}
-
-out:
-	if (error)
-		trace_xfs_swap_extent_rmap_error(ip, error, _RET_IP_);
-	tip->i_d.di_flags2 = tip_flags2;
-	return error;
-}
-
 /* Swap the extents of two files by swapping data forks. */
 STATIC int
 xfs_swap_extent_forks(
@@ -1765,15 +1640,20 @@ xfs_swap_extents(
 	target_log_flags = XFS_ILOG_CORE;
 
 	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
-		error = xfs_swap_extent_rmap(&tp, ip, tip);
+		error = xfs_swapext_deferred_bmap(&tp, ip, tip, XFS_DATA_FORK,
+				0, 0, XFS_B_TO_FSB(ip->i_mount,
+						   i_size_read(VFS_I(ip))), 0);
 	else
 		error = xfs_swap_extent_forks(tp, ip, tip, &src_log_flags,
 				&target_log_flags);
-	if (error)
+	if (error) {
+		trace_xfs_swap_extent_error(ip, error, _THIS_IP_);
 		goto out_trans_cancel;
+	}
 
 	/* Do we have to swap reflink flags? */
-	if ((ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK) ^
+	if (!xfs_sb_version_hasrmapbt(&mp->m_sb) &&
+	    (ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK) ^
 	    (tip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK)) {
 		f = ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK;
 		ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 7917203e56d4..306cf86c353d 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3204,14 +3204,11 @@ DEFINE_INODE_ERROR_EVENT(xfs_reflink_end_cow_error);
 
 DEFINE_INODE_IREC_EVENT(xfs_reflink_cancel_cow);
 
-/* rmap swapext tracepoints */
-DEFINE_INODE_IREC_EVENT(xfs_swap_extent_rmap_remap);
-DEFINE_INODE_IREC_EVENT(xfs_swap_extent_rmap_remap_piece);
-DEFINE_INODE_ERROR_EVENT(xfs_swap_extent_rmap_error);
 
 /* swapext tracepoints */
 DEFINE_DOUBLE_IO_EVENT(xfs_file_swap_range);
 DEFINE_INODE_ERROR_EVENT(xfs_file_swap_range_error);
+DEFINE_INODE_ERROR_EVENT(xfs_swap_extent_error);
 DEFINE_INODE_IREC_EVENT(xfs_swapext_extent1);
 DEFINE_INODE_IREC_EVENT(xfs_swapext_extent2);
 DEFINE_ITRUNC_EVENT(xfs_swapext_update_inode_size);


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 15/18] xfs: consolidate all of the xfs_swap_extent_forks code
  2020-04-29  2:44 [PATCH RFC 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (13 preceding siblings ...)
  2020-04-29  2:45 ` [PATCH 14/18] xfs: port xfs_swap_extents_rmap to our new code Darrick J. Wong
@ 2020-04-29  2:45 ` Darrick J. Wong
  2020-04-29  2:45 ` [PATCH 16/18] xfs: refactor reflink flag handling in xfs_swap_extent_forks Darrick J. Wong
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2020-04-29  2:45 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <darrick.wong@oracle.com>

Consolidate the bmbt owner change scan code in xfs_swap_extent_forks,
since it's not needed for the deferred bmap log item swapext
implementation.

The goal is to package up all three implementations into functions that
have the same preconditions and leave the system in the same state.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_bmap_util.c |  211 +++++++++++++++++++++++-------------------------
 1 file changed, 103 insertions(+), 108 deletions(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index d1351f0176a3..1767f1586c46 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1351,19 +1351,61 @@ xfs_swap_extent_flush(
 	return 0;
 }
 
+/*
+ * Fix up the owners of the bmbt blocks to refer to the current inode. The
+ * change owner scan attempts to order all modified buffers in the current
+ * transaction. In the event of ordered buffer failure, the offending buffer is
+ * physically logged as a fallback and the scan returns -EAGAIN. We must roll
+ * the transaction in this case to replenish the fallback log reservation and
+ * restart the scan. This process repeats until the scan completes.
+ */
+static int
+xfs_swap_change_owner(
+	struct xfs_trans	**tpp,
+	struct xfs_inode	*ip,
+	struct xfs_inode	*tmpip)
+{
+	int			error;
+	struct xfs_trans	*tp = *tpp;
+
+	do {
+		error = xfs_bmbt_change_owner(tp, ip, XFS_DATA_FORK, ip->i_ino,
+					      NULL);
+		/* success or fatal error */
+		if (error != -EAGAIN)
+			break;
+
+		error = xfs_trans_roll(tpp);
+		if (error)
+			break;
+		tp = *tpp;
+
+		/*
+		 * Redirty both inodes so they can relog and keep the log tail
+		 * moving forward.
+		 */
+		xfs_trans_ijoin(tp, ip, 0);
+		xfs_trans_ijoin(tp, tmpip, 0);
+		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+		xfs_trans_log_inode(tp, tmpip, XFS_ILOG_CORE);
+	} while (true);
+
+	return error;
+}
+
 /* Swap the extents of two files by swapping data forks. */
 STATIC int
 xfs_swap_extent_forks(
-	struct xfs_trans	*tp,
+	struct xfs_trans	**tpp,
 	struct xfs_inode	*ip,
-	struct xfs_inode	*tip,
-	int			*src_log_flags,
-	int			*target_log_flags)
+	struct xfs_inode	*tip)
 {
 	xfs_filblks_t		aforkblks = 0;
 	xfs_filblks_t		taforkblks = 0;
 	xfs_extnum_t		junk;
 	uint64_t		tmp;
+	int			src_log_flags = XFS_ILOG_CORE;
+	int			target_log_flags = XFS_ILOG_CORE;
 	int			error;
 
 	/*
@@ -1371,14 +1413,14 @@ xfs_swap_extent_forks(
 	 */
 	if ( ((XFS_IFORK_Q(ip) != 0) && (ip->i_d.di_anextents > 0)) &&
 	     (ip->i_d.di_aformat != XFS_DINODE_FMT_LOCAL)) {
-		error = xfs_bmap_count_blocks(tp, ip, XFS_ATTR_FORK, &junk,
+		error = xfs_bmap_count_blocks(*tpp, ip, XFS_ATTR_FORK, &junk,
 				&aforkblks);
 		if (error)
 			return error;
 	}
 	if ( ((XFS_IFORK_Q(tip) != 0) && (tip->i_d.di_anextents > 0)) &&
 	     (tip->i_d.di_aformat != XFS_DINODE_FMT_LOCAL)) {
-		error = xfs_bmap_count_blocks(tp, tip, XFS_ATTR_FORK, &junk,
+		error = xfs_bmap_count_blocks(*tpp, tip, XFS_ATTR_FORK, &junk,
 				&taforkblks);
 		if (error)
 			return error;
@@ -1393,9 +1435,9 @@ xfs_swap_extent_forks(
 	 */
 	if (xfs_sb_version_has_v3inode(&ip->i_mount->m_sb)) {
 		if (ip->i_d.di_format == XFS_DINODE_FMT_BTREE)
-			(*target_log_flags) |= XFS_ILOG_DOWNER;
+			target_log_flags |= XFS_ILOG_DOWNER;
 		if (tip->i_d.di_format == XFS_DINODE_FMT_BTREE)
-			(*src_log_flags) |= XFS_ILOG_DOWNER;
+			src_log_flags |= XFS_ILOG_DOWNER;
 	}
 
 	/*
@@ -1428,69 +1470,77 @@ xfs_swap_extent_forks(
 
 	switch (ip->i_d.di_format) {
 	case XFS_DINODE_FMT_EXTENTS:
-		(*src_log_flags) |= XFS_ILOG_DEXT;
+		src_log_flags |= XFS_ILOG_DEXT;
 		break;
 	case XFS_DINODE_FMT_BTREE:
 		ASSERT(!xfs_sb_version_has_v3inode(&ip->i_mount->m_sb) ||
-		       (*src_log_flags & XFS_ILOG_DOWNER));
-		(*src_log_flags) |= XFS_ILOG_DBROOT;
+		       (src_log_flags & XFS_ILOG_DOWNER));
+		src_log_flags |= XFS_ILOG_DBROOT;
 		break;
 	}
 
 	switch (tip->i_d.di_format) {
 	case XFS_DINODE_FMT_EXTENTS:
-		(*target_log_flags) |= XFS_ILOG_DEXT;
+		target_log_flags |= XFS_ILOG_DEXT;
 		break;
 	case XFS_DINODE_FMT_BTREE:
-		(*target_log_flags) |= XFS_ILOG_DBROOT;
+		target_log_flags |= XFS_ILOG_DBROOT;
 		ASSERT(!xfs_sb_version_has_v3inode(&ip->i_mount->m_sb) ||
-		       (*target_log_flags & XFS_ILOG_DOWNER));
+		       (target_log_flags & XFS_ILOG_DOWNER));
 		break;
 	}
 
-	return 0;
-}
+	/* Do we have to swap reflink flags? */
+	if ((ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK) ^
+	    (tip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK)) {
+		uint64_t	f;
 
-/*
- * Fix up the owners of the bmbt blocks to refer to the current inode. The
- * change owner scan attempts to order all modified buffers in the current
- * transaction. In the event of ordered buffer failure, the offending buffer is
- * physically logged as a fallback and the scan returns -EAGAIN. We must roll
- * the transaction in this case to replenish the fallback log reservation and
- * restart the scan. This process repeats until the scan completes.
- */
-static int
-xfs_swap_change_owner(
-	struct xfs_trans	**tpp,
-	struct xfs_inode	*ip,
-	struct xfs_inode	*tmpip)
-{
-	int			error;
-	struct xfs_trans	*tp = *tpp;
+		f = ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK;
+		ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+		ip->i_d.di_flags2 |= tip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK;
+		tip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+		tip->i_d.di_flags2 |= f & XFS_DIFLAG2_REFLINK;
+	}
 
-	do {
-		error = xfs_bmbt_change_owner(tp, ip, XFS_DATA_FORK, ip->i_ino,
-					      NULL);
-		/* success or fatal error */
-		if (error != -EAGAIN)
-			break;
+	/* Swap the cow forks. */
+	if (xfs_sb_version_hasreflink(&ip->i_mount->m_sb)) {
+		ASSERT(ip->i_cformat == XFS_DINODE_FMT_EXTENTS);
+		ASSERT(tip->i_cformat == XFS_DINODE_FMT_EXTENTS);
 
-		error = xfs_trans_roll(tpp);
-		if (error)
-			break;
-		tp = *tpp;
+		swap(ip->i_cnextents, tip->i_cnextents);
+		swap(ip->i_cowfp, tip->i_cowfp);
 
-		/*
-		 * Redirty both inodes so they can relog and keep the log tail
-		 * moving forward.
-		 */
-		xfs_trans_ijoin(tp, ip, 0);
-		xfs_trans_ijoin(tp, tmpip, 0);
-		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
-		xfs_trans_log_inode(tp, tmpip, XFS_ILOG_CORE);
-	} while (true);
+		if (ip->i_cowfp && ip->i_cowfp->if_bytes)
+			xfs_inode_set_cowblocks_tag(ip);
+		else
+			xfs_inode_clear_cowblocks_tag(ip);
+		if (tip->i_cowfp && tip->i_cowfp->if_bytes)
+			xfs_inode_set_cowblocks_tag(tip);
+		else
+			xfs_inode_clear_cowblocks_tag(tip);
+	}
 
-	return error;
+	xfs_trans_log_inode(*tpp, ip,  src_log_flags);
+	xfs_trans_log_inode(*tpp, tip, target_log_flags);
+
+	/*
+	 * The extent forks have been swapped, but crc=1,rmapbt=0 filesystems
+	 * have inode number owner values in the bmbt blocks that still refer to
+	 * the old inode. Scan each bmbt to fix up the owner values with the
+	 * inode number of the current inode.
+	 */
+	if (src_log_flags & XFS_ILOG_DOWNER) {
+		error = xfs_swap_change_owner(tpp, ip, tip);
+		if (error)
+			return error;
+	}
+	if (target_log_flags & XFS_ILOG_DOWNER) {
+		error = xfs_swap_change_owner(tpp, tip, ip);
+		if (error)
+			return error;
+	}
+
+	return 0;
 }
 
 int
@@ -1502,10 +1552,8 @@ xfs_swap_extents(
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_trans	*tp;
 	struct xfs_bstat	*sbp = &sxp->sx_stat;
-	int			src_log_flags, target_log_flags;
 	int			error = 0;
 	int			lock_flags;
-	uint64_t		f;
 	int			resblks = 0;
 
 	/*
@@ -1636,70 +1684,17 @@ xfs_swap_extents(
 	 * recovery is going to see the fork as owned by the swapped inode,
 	 * not the pre-swapped inodes.
 	 */
-	src_log_flags = XFS_ILOG_CORE;
-	target_log_flags = XFS_ILOG_CORE;
-
 	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
 		error = xfs_swapext_deferred_bmap(&tp, ip, tip, XFS_DATA_FORK,
 				0, 0, XFS_B_TO_FSB(ip->i_mount,
 						   i_size_read(VFS_I(ip))), 0);
 	else
-		error = xfs_swap_extent_forks(tp, ip, tip, &src_log_flags,
-				&target_log_flags);
+		error = xfs_swap_extent_forks(&tp, ip, tip);
 	if (error) {
 		trace_xfs_swap_extent_error(ip, error, _THIS_IP_);
 		goto out_trans_cancel;
 	}
 
-	/* Do we have to swap reflink flags? */
-	if (!xfs_sb_version_hasrmapbt(&mp->m_sb) &&
-	    (ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK) ^
-	    (tip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK)) {
-		f = ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK;
-		ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
-		ip->i_d.di_flags2 |= tip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK;
-		tip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
-		tip->i_d.di_flags2 |= f & XFS_DIFLAG2_REFLINK;
-	}
-
-	/* Swap the cow forks. */
-	if (xfs_sb_version_hasreflink(&mp->m_sb)) {
-		ASSERT(ip->i_cformat == XFS_DINODE_FMT_EXTENTS);
-		ASSERT(tip->i_cformat == XFS_DINODE_FMT_EXTENTS);
-
-		swap(ip->i_cnextents, tip->i_cnextents);
-		swap(ip->i_cowfp, tip->i_cowfp);
-
-		if (ip->i_cowfp && ip->i_cowfp->if_bytes)
-			xfs_inode_set_cowblocks_tag(ip);
-		else
-			xfs_inode_clear_cowblocks_tag(ip);
-		if (tip->i_cowfp && tip->i_cowfp->if_bytes)
-			xfs_inode_set_cowblocks_tag(tip);
-		else
-			xfs_inode_clear_cowblocks_tag(tip);
-	}
-
-	xfs_trans_log_inode(tp, ip,  src_log_flags);
-	xfs_trans_log_inode(tp, tip, target_log_flags);
-
-	/*
-	 * The extent forks have been swapped, but crc=1,rmapbt=0 filesystems
-	 * have inode number owner values in the bmbt blocks that still refer to
-	 * the old inode. Scan each bmbt to fix up the owner values with the
-	 * inode number of the current inode.
-	 */
-	if (src_log_flags & XFS_ILOG_DOWNER) {
-		error = xfs_swap_change_owner(&tp, ip, tip);
-		if (error)
-			goto out_trans_cancel;
-	}
-	if (target_log_flags & XFS_ILOG_DOWNER) {
-		error = xfs_swap_change_owner(&tp, tip, ip);
-		if (error)
-			goto out_trans_cancel;
-	}
-
 	/*
 	 * If this is a synchronous mount, make sure that the
 	 * transaction goes to disk before returning to the user.


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 16/18] xfs: refactor reflink flag handling in xfs_swap_extent_forks
  2020-04-29  2:44 [PATCH RFC 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (14 preceding siblings ...)
  2020-04-29  2:45 ` [PATCH 15/18] xfs: consolidate all of the xfs_swap_extent_forks code Darrick J. Wong
@ 2020-04-29  2:45 ` Darrick J. Wong
  2020-04-29  2:46 ` [PATCH 17/18] xfs: remove old swap extents implementation Darrick J. Wong
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2020-04-29  2:45 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <darrick.wong@oracle.com>

Refactor the old data fork swap function to use the new reflink flag
helpers to propagate reflink flags between the two files.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_bmap_util.c |   34 +++++-----------------------------
 1 file changed, 5 insertions(+), 29 deletions(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 1767f1586c46..639b42b1d568 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1404,10 +1404,14 @@ xfs_swap_extent_forks(
 	xfs_filblks_t		taforkblks = 0;
 	xfs_extnum_t		junk;
 	uint64_t		tmp;
+	unsigned int		state;
 	int			src_log_flags = XFS_ILOG_CORE;
 	int			target_log_flags = XFS_ILOG_CORE;
 	int			error;
 
+	state = xfs_swapext_reflink_prep(ip, tip, XFS_DATA_FORK, 0, 0,
+			XFS_B_TO_FSB(ip->i_mount, i_size_read(VFS_I(ip))));
+
 	/*
 	 * Count the number of extended attribute blocks
 	 */
@@ -1490,35 +1494,7 @@ xfs_swap_extent_forks(
 		break;
 	}
 
-	/* Do we have to swap reflink flags? */
-	if ((ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK) ^
-	    (tip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK)) {
-		uint64_t	f;
-
-		f = ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK;
-		ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
-		ip->i_d.di_flags2 |= tip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK;
-		tip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
-		tip->i_d.di_flags2 |= f & XFS_DIFLAG2_REFLINK;
-	}
-
-	/* Swap the cow forks. */
-	if (xfs_sb_version_hasreflink(&ip->i_mount->m_sb)) {
-		ASSERT(ip->i_cformat == XFS_DINODE_FMT_EXTENTS);
-		ASSERT(tip->i_cformat == XFS_DINODE_FMT_EXTENTS);
-
-		swap(ip->i_cnextents, tip->i_cnextents);
-		swap(ip->i_cowfp, tip->i_cowfp);
-
-		if (ip->i_cowfp && ip->i_cowfp->if_bytes)
-			xfs_inode_set_cowblocks_tag(ip);
-		else
-			xfs_inode_clear_cowblocks_tag(ip);
-		if (tip->i_cowfp && tip->i_cowfp->if_bytes)
-			xfs_inode_set_cowblocks_tag(tip);
-		else
-			xfs_inode_clear_cowblocks_tag(tip);
-	}
+	xfs_swapext_reflink_finish(*tpp, ip, tip, state);
 
 	xfs_trans_log_inode(*tpp, ip,  src_log_flags);
 	xfs_trans_log_inode(*tpp, tip, target_log_flags);


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 17/18] xfs: remove old swap extents implementation
  2020-04-29  2:44 [PATCH RFC 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (15 preceding siblings ...)
  2020-04-29  2:45 ` [PATCH 16/18] xfs: refactor reflink flag handling in xfs_swap_extent_forks Darrick J. Wong
@ 2020-04-29  2:46 ` Darrick J. Wong
  2020-04-29  2:46 ` [PATCH 18/18] xfs: fix quota accounting in the old fork swap code Darrick J. Wong
  2020-05-01 19:46 ` [PATCH RFC 00/18] xfs: atomic file updates Jann Horn
  18 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2020-04-29  2:46 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <darrick.wong@oracle.com>

Migrate the old XFS_IOC_SWAPEXT implementation to use our shiny new one.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_bmap_util.c |  193 ------------------------------------------------
 fs/xfs/xfs_bmap_util.h |    2 
 fs/xfs/xfs_ioctl.c     |  108 ++++++++-------------------
 3 files changed, 32 insertions(+), 271 deletions(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 639b42b1d568..df373107e782 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1334,23 +1334,6 @@ xfs_swap_extents_check_format(
 	return 0;
 }
 
-static int
-xfs_swap_extent_flush(
-	struct xfs_inode	*ip)
-{
-	int	error;
-
-	error = filemap_write_and_wait(VFS_I(ip)->i_mapping);
-	if (error)
-		return error;
-	truncate_pagecache_range(VFS_I(ip), 0, -1);
-
-	/* Verify O_DIRECT for ftmp */
-	if (VFS_I(ip)->i_mapping->nrpages)
-		return -EINVAL;
-	return 0;
-}
-
 /*
  * Fix up the owners of the bmbt blocks to refer to the current inode. The
  * change owner scan attempts to order all modified buffers in the current
@@ -1519,181 +1502,6 @@ xfs_swap_extent_forks(
 	return 0;
 }
 
-int
-xfs_swap_extents(
-	struct xfs_inode	*ip,	/* target inode */
-	struct xfs_inode	*tip,	/* tmp inode */
-	struct xfs_swapext	*sxp)
-{
-	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_trans	*tp;
-	struct xfs_bstat	*sbp = &sxp->sx_stat;
-	int			error = 0;
-	int			lock_flags;
-	int			resblks = 0;
-
-	/*
-	 * Lock the inodes against other IO, page faults and truncate to
-	 * begin with.  Then we can ensure the inodes are flushed and have no
-	 * page cache safely. Once we have done this we can take the ilocks and
-	 * do the rest of the checks.
-	 */
-	lock_two_nondirectories(VFS_I(ip), VFS_I(tip));
-	lock_flags = XFS_MMAPLOCK_EXCL;
-	xfs_lock_two_inodes(ip, XFS_MMAPLOCK_EXCL, tip, XFS_MMAPLOCK_EXCL);
-
-	/* Verify that both files have the same format */
-	if ((VFS_I(ip)->i_mode & S_IFMT) != (VFS_I(tip)->i_mode & S_IFMT)) {
-		error = -EINVAL;
-		goto out_unlock;
-	}
-
-	/* Verify both files are either real-time or non-realtime */
-	if (XFS_IS_REALTIME_INODE(ip) != XFS_IS_REALTIME_INODE(tip)) {
-		error = -EINVAL;
-		goto out_unlock;
-	}
-
-	error = xfs_qm_dqattach(ip);
-	if (error)
-		goto out_unlock;
-
-	error = xfs_qm_dqattach(tip);
-	if (error)
-		goto out_unlock;
-
-	error = xfs_swap_extent_flush(ip);
-	if (error)
-		goto out_unlock;
-	error = xfs_swap_extent_flush(tip);
-	if (error)
-		goto out_unlock;
-
-	if (xfs_inode_has_cow_data(tip)) {
-		error = xfs_reflink_cancel_cow_range(tip, 0, NULLFILEOFF, true);
-		if (error)
-			goto out_unlock;
-	}
-
-	/*
-	 * Extent "swapping" with rmap requires a permanent reservation and
-	 * a block reservation because it's really just a remap operation
-	 * performed with log redo items!
-	 */
-	if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
-		int		w	= XFS_DATA_FORK;
-		uint32_t	ipnext	= XFS_IFORK_NEXTENTS(ip, w);
-		uint32_t	tipnext	= XFS_IFORK_NEXTENTS(tip, w);
-
-		/*
-		 * Conceptually this shouldn't affect the shape of either bmbt,
-		 * but since we atomically move extents one by one, we reserve
-		 * enough space to rebuild both trees.
-		 */
-		resblks = XFS_SWAP_RMAP_SPACE_RES(mp, ipnext, w);
-		resblks +=  XFS_SWAP_RMAP_SPACE_RES(mp, tipnext, w);
-
-		/*
-		 * Handle the corner case where either inode might straddle the
-		 * btree format boundary. If so, the inode could bounce between
-		 * btree <-> extent format on unmap -> remap cycles, freeing and
-		 * allocating a bmapbt block each time.
-		 */
-		if (ipnext == (XFS_IFORK_MAXEXT(ip, w) + 1))
-			resblks += XFS_IFORK_MAXEXT(ip, w);
-		if (tipnext == (XFS_IFORK_MAXEXT(tip, w) + 1))
-			resblks += XFS_IFORK_MAXEXT(tip, w);
-	}
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, 0, &tp);
-	if (error)
-		goto out_unlock;
-
-	/*
-	 * Lock and join the inodes to the tansaction so that transaction commit
-	 * or cancel will unlock the inodes from this point onwards.
-	 */
-	xfs_lock_two_inodes(ip, XFS_ILOCK_EXCL, tip, XFS_ILOCK_EXCL);
-	lock_flags |= XFS_ILOCK_EXCL;
-	xfs_trans_ijoin(tp, ip, 0);
-	xfs_trans_ijoin(tp, tip, 0);
-
-
-	/* Verify all data are being swapped */
-	if (sxp->sx_offset != 0 ||
-	    sxp->sx_length != ip->i_d.di_size ||
-	    sxp->sx_length != tip->i_d.di_size) {
-		error = -EFAULT;
-		goto out_trans_cancel;
-	}
-
-	trace_xfs_swap_extent_before(ip, 0);
-	trace_xfs_swap_extent_before(tip, 1);
-
-	/* check inode formats now that data is flushed */
-	error = xfs_swap_extents_check_format(ip, tip);
-	if (error) {
-		xfs_notice(mp,
-		    "%s: inode 0x%llx format is incompatible for exchanging.",
-				__func__, ip->i_ino);
-		goto out_trans_cancel;
-	}
-
-	/*
-	 * Compare the current change & modify times with that
-	 * passed in.  If they differ, we abort this swap.
-	 * This is the mechanism used to ensure the calling
-	 * process that the file was not changed out from
-	 * under it.
-	 */
-	if ((sbp->bs_ctime.tv_sec != VFS_I(ip)->i_ctime.tv_sec) ||
-	    (sbp->bs_ctime.tv_nsec != VFS_I(ip)->i_ctime.tv_nsec) ||
-	    (sbp->bs_mtime.tv_sec != VFS_I(ip)->i_mtime.tv_sec) ||
-	    (sbp->bs_mtime.tv_nsec != VFS_I(ip)->i_mtime.tv_nsec)) {
-		error = -EBUSY;
-		goto out_trans_cancel;
-	}
-
-	/*
-	 * Note the trickiness in setting the log flags - we set the owner log
-	 * flag on the opposite inode (i.e. the inode we are setting the new
-	 * owner to be) because once we swap the forks and log that, log
-	 * recovery is going to see the fork as owned by the swapped inode,
-	 * not the pre-swapped inodes.
-	 */
-	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
-		error = xfs_swapext_deferred_bmap(&tp, ip, tip, XFS_DATA_FORK,
-				0, 0, XFS_B_TO_FSB(ip->i_mount,
-						   i_size_read(VFS_I(ip))), 0);
-	else
-		error = xfs_swap_extent_forks(&tp, ip, tip);
-	if (error) {
-		trace_xfs_swap_extent_error(ip, error, _THIS_IP_);
-		goto out_trans_cancel;
-	}
-
-	/*
-	 * If this is a synchronous mount, make sure that the
-	 * transaction goes to disk before returning to the user.
-	 */
-	if (mp->m_flags & XFS_MOUNT_WSYNC)
-		xfs_trans_set_sync(tp);
-
-	error = xfs_trans_commit(tp);
-
-	trace_xfs_swap_extent_after(ip, 0);
-	trace_xfs_swap_extent_after(tip, 1);
-
-out_unlock:
-	xfs_iunlock(ip, lock_flags);
-	xfs_iunlock(tip, lock_flags);
-	unlock_two_nondirectories(VFS_I(ip), VFS_I(tip));
-	return error;
-
-out_trans_cancel:
-	xfs_trans_cancel(tp);
-	goto out_unlock;
-}
-
 /* Prepare two files to have their data swapped. */
 int
 xfs_swap_range_prep(
@@ -2061,4 +1869,3 @@ xfs_swap_range(
 	xfs_trans_cancel(tp);
 	goto out_unlock;
 }
-
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index d3444a63bbd7..e0712c274dd2 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -66,8 +66,6 @@ int	xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
 bool	xfs_can_free_eofblocks(struct xfs_inode *ip, bool force);
 int	xfs_free_eofblocks(struct xfs_inode *ip);
 
-int	xfs_swap_extents(struct xfs_inode *ip, struct xfs_inode *tip,
-			 struct xfs_swapext *sx);
 int	xfs_swap_range_prep(struct file *file1, struct file *file2,
 			    struct file_swap_range *fsr);
 int	xfs_swap_range(struct xfs_inode *ip1, struct xfs_inode *ip2,
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 274423ba3bb5..f93de4f7a944 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1864,81 +1864,47 @@ xfs_ioc_scrub_metadata(
 
 int
 xfs_ioc_swapext(
-	xfs_swapext_t	*sxp)
+	struct xfs_swapext	__user *arg)
 {
-	xfs_inode_t     *ip, *tip;
-	struct fd	f, tmp;
-	int		error = 0;
+	struct xfs_swapext	sx;
+	struct file_swap_range	fsr = { 0 };
+	struct fd		fd2, fd1;
+	int			error = 0;
 
-	/* Pull information for the target fd */
-	f = fdget((int)sxp->sx_fdtarget);
-	if (!f.file) {
-		error = -EINVAL;
-		goto out;
-	}
+	if (copy_from_user(&sx, arg, sizeof(struct xfs_swapext)))
+		return -EFAULT;
 
-	if (!(f.file->f_mode & FMODE_WRITE) ||
-	    !(f.file->f_mode & FMODE_READ) ||
-	    (f.file->f_flags & O_APPEND)) {
-		error = -EBADF;
-		goto out_put_file;
-	}
+	fd2 = fdget((int)sx.sx_fdtarget);
+	if (!fd2.file)
+		return -EINVAL;
 
-	tmp = fdget((int)sxp->sx_fdtmp);
-	if (!tmp.file) {
+	fd1 = fdget((int)sx.sx_fdtmp);
+	if (!fd1.file) {
 		error = -EINVAL;
-		goto out_put_file;
+		goto dest_fdput;
 	}
 
-	if (!(tmp.file->f_mode & FMODE_WRITE) ||
-	    !(tmp.file->f_mode & FMODE_READ) ||
-	    (tmp.file->f_flags & O_APPEND)) {
-		error = -EBADF;
-		goto out_put_tmp_file;
-	}
+	fsr.file1_fd = sx.sx_fdtmp;
+	fsr.length = sx.sx_length;
+	fsr.flags = FILE_SWAP_RANGE_NONATOMIC | FILE_SWAP_RANGE_FILE2_FRESH |
+		    FILE_SWAP_RANGE_FULL_FILES;
+	fsr.file2_ino = sx.sx_stat.bs_ino;
+	fsr.file2_mtime = sx.sx_stat.bs_mtime.tv_sec;
+	fsr.file2_ctime = sx.sx_stat.bs_ctime.tv_sec;
+	fsr.file2_mtime_nsec = sx.sx_stat.bs_mtime.tv_nsec;
+	fsr.file2_ctime_nsec = sx.sx_stat.bs_ctime.tv_nsec;
 
-	if (IS_SWAPFILE(file_inode(f.file)) ||
-	    IS_SWAPFILE(file_inode(tmp.file))) {
-		error = -EINVAL;
-		goto out_put_tmp_file;
-	}
+	error = vfs_swap_file_range(fd1.file, fd2.file, &fsr);
 
 	/*
-	 * We need to ensure that the fds passed in point to XFS inodes
-	 * before we cast and access them as XFS structures as we have no
-	 * control over what the user passes us here.
+	 * The old implementation returned EFAULT if the swap range was not
+	 * the entirety of both files.
 	 */
-	if (f.file->f_op != &xfs_file_operations ||
-	    tmp.file->f_op != &xfs_file_operations) {
-		error = -EINVAL;
-		goto out_put_tmp_file;
-	}
-
-	ip = XFS_I(file_inode(f.file));
-	tip = XFS_I(file_inode(tmp.file));
-
-	if (ip->i_mount != tip->i_mount) {
-		error = -EINVAL;
-		goto out_put_tmp_file;
-	}
-
-	if (ip->i_ino == tip->i_ino) {
-		error = -EINVAL;
-		goto out_put_tmp_file;
-	}
-
-	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
-		error = -EIO;
-		goto out_put_tmp_file;
-	}
-
-	error = xfs_swap_extents(ip, tip, sxp);
-
- out_put_tmp_file:
-	fdput(tmp);
- out_put_file:
-	fdput(f);
- out:
+	if (error == -EDOM)
+		error = -EFAULT;
+	fdput(fd1);
+dest_fdput:
+	fdput(fd2);
 	return error;
 }
 
@@ -2183,18 +2149,8 @@ xfs_file_ioctl(
 	case XFS_IOC_ATTRMULTI_BY_HANDLE:
 		return xfs_attrmulti_by_handle(filp, arg);
 
-	case XFS_IOC_SWAPEXT: {
-		struct xfs_swapext	sxp;
-
-		if (copy_from_user(&sxp, arg, sizeof(xfs_swapext_t)))
-			return -EFAULT;
-		error = mnt_want_write_file(filp);
-		if (error)
-			return error;
-		error = xfs_ioc_swapext(&sxp);
-		mnt_drop_write_file(filp);
-		return error;
-	}
+	case XFS_IOC_SWAPEXT:
+		return xfs_ioc_swapext(arg);
 
 	case XFS_IOC_FSCOUNTS: {
 		xfs_fsop_counts_t out;


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 18/18] xfs: fix quota accounting in the old fork swap code
  2020-04-29  2:44 [PATCH RFC 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (16 preceding siblings ...)
  2020-04-29  2:46 ` [PATCH 17/18] xfs: remove old swap extents implementation Darrick J. Wong
@ 2020-04-29  2:46 ` Darrick J. Wong
  2020-05-01 19:46 ` [PATCH RFC 00/18] xfs: atomic file updates Jann Horn
  18 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2020-04-29  2:46 UTC (permalink / raw)
  To: darrick.wong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <darrick.wong@oracle.com>

The old fork swapping code doesn't change quota counts when it swaps
data forks.  Fix it to do that.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 fs/xfs/xfs_bmap_util.c |   10 ++++++++++
 1 file changed, 10 insertions(+)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index df373107e782..de6d1747a3fa 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1386,6 +1386,7 @@ xfs_swap_extent_forks(
 	xfs_filblks_t		aforkblks = 0;
 	xfs_filblks_t		taforkblks = 0;
 	xfs_extnum_t		junk;
+	int64_t			temp_blks;
 	uint64_t		tmp;
 	unsigned int		state;
 	int			src_log_flags = XFS_ILOG_CORE;
@@ -1432,6 +1433,15 @@ xfs_swap_extent_forks(
 	 */
 	swap(ip->i_df, tip->i_df);
 
+	/* Update quota accounting. */
+	temp_blks = tip->i_d.di_nblocks - taforkblks + aforkblks;
+	xfs_trans_mod_dquot_byino(*tpp, ip, XFS_TRANS_DQ_BCOUNT,
+			temp_blks - ip->i_d.di_nblocks);
+
+	temp_blks = ip->i_d.di_nblocks + taforkblks - aforkblks;
+	xfs_trans_mod_dquot_byino(*tpp, tip, XFS_TRANS_DQ_BCOUNT,
+			temp_blks - tip->i_d.di_nblocks);
+
 	/*
 	 * Fix the on-disk inode values
 	 */


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 00/18] xfs: atomic file updates
  2020-04-29  2:44 [PATCH RFC 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (17 preceding siblings ...)
  2020-04-29  2:46 ` [PATCH 18/18] xfs: fix quota accounting in the old fork swap code Darrick J. Wong
@ 2020-05-01 19:46 ` Jann Horn
  2020-05-01 20:11   ` Darrick J. Wong
  18 siblings, 1 reply; 22+ messages in thread
From: Jann Horn @ 2020-05-01 19:46 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, linux-fsdevel, Linux API

On Wed, Apr 29, 2020 at 4:46 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> This series creates a new log incompat feature and log intent items to
> track high level progress of swapping ranges of two files and finish
> interrupted work if the system goes down.  It then adds a new
> FISWAPRANGE ioctl so that userspace can access the atomic extent
> swapping feature.  With this feature, user programs will be able to
> update files atomically by opening an O_TMPFILE, reflinking the source
> file to it, making whatever updates they want to make, and then
> atomically swap the changed bits back to the source file.  It even has
> an optional ability to detect a changed source file and reject the
> update.
>
> The intent behind this new userspace functionality is to enable atomic
> rewrites of arbitrary parts of individual files.  For years, application
> programmers wanting to ensure the atomicity of a file update had to
> write the changes to a new file in the same directory, fsync the new
> file, rename the new file on top of the old filename, and then fsync the
> directory.  People get it wrong all the time, and $fs hacks abound.
>
> With atomic file updates, this is no longer necessary.  Programmers
> create an O_TMPFILE, optionally FICLONE the file contents into the
> temporary file, make whatever changes they want to the tempfile, and
> FISWAPRANGE the contents from the tempfile into the regular file.

That also requires the *readers* to be atomic though, right? Since now
the updates are visible to readers instantly, instead of only on the
next open()? If you used this to update /etc/passwd while someone else
is in the middle of reading it with a sequence of read() calls, there
would be fireworks...

I guess maybe the new API could also be wired up to ext4's
EXT4_IOC_MOVE_EXT somehow, provided that the caller specifies
FILE_SWAP_RANGE_NONATOMIC?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH RFC 00/18] xfs: atomic file updates
  2020-05-01 19:46 ` [PATCH RFC 00/18] xfs: atomic file updates Jann Horn
@ 2020-05-01 20:11   ` Darrick J. Wong
  0 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2020-05-01 20:11 UTC (permalink / raw)
  To: Jann Horn; +Cc: linux-xfs, linux-fsdevel, Linux API

On Fri, May 01, 2020 at 09:46:07PM +0200, Jann Horn wrote:
> On Wed, Apr 29, 2020 at 4:46 AM Darrick J. Wong <darrick.wong@oracle.com> wrote:
> > This series creates a new log incompat feature and log intent items to
> > track high level progress of swapping ranges of two files and finish
> > interrupted work if the system goes down.  It then adds a new
> > FISWAPRANGE ioctl so that userspace can access the atomic extent
> > swapping feature.  With this feature, user programs will be able to
> > update files atomically by opening an O_TMPFILE, reflinking the source
> > file to it, making whatever updates they want to make, and then
> > atomically swap the changed bits back to the source file.  It even has
> > an optional ability to detect a changed source file and reject the
> > update.
> >
> > The intent behind this new userspace functionality is to enable atomic
> > rewrites of arbitrary parts of individual files.  For years, application
> > programmers wanting to ensure the atomicity of a file update had to
> > write the changes to a new file in the same directory, fsync the new
> > file, rename the new file on top of the old filename, and then fsync the
> > directory.  People get it wrong all the time, and $fs hacks abound.
> >
> > With atomic file updates, this is no longer necessary.  Programmers
> > create an O_TMPFILE, optionally FICLONE the file contents into the
> > temporary file, make whatever changes they want to the tempfile, and
> > FISWAPRANGE the contents from the tempfile into the regular file.
> 
> That also requires the *readers* to be atomic though, right? Since now
> the updates are visible to readers instantly, instead of only on the
> next open()? If you used this to update /etc/passwd while someone else
> is in the middle of reading it with a sequence of read() calls, there
> would be fireworks...

Right.  In XFS, we guarantee read atomicity by by grabbing i_rwsem and
the xfs mmap lock, break any layout leases, drain the directios, and
then flush+invalidate the page cache.  Once that preparation step is
done, we do the actual extent swap.

> I guess maybe the new API could also be wired up to ext4's
> EXT4_IOC_MOVE_EXT somehow, provided that the caller specifies
> FILE_SWAP_RANGE_NONATOMIC?

Sort of.  ext4's MOVE_EXT also swaps the file contents doing the swap
one buffer_head at a time, so you'd have to turn that off since this API
assumes that the caller already set each file's contents beforehand.

Ted has theorized that so long as the extent map size is less than 1/4
of the journal then it would be possible to do atomic swaps in ext4
without adding all the logical log item bits that were a prerequisite
for the xfs implementation.

--D

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 02/18] xfs: fix xfs_reflink_remap_prep calling conventions
  2020-04-29  2:44 ` [PATCH 02/18] xfs: fix xfs_reflink_remap_prep calling conventions Darrick J. Wong
@ 2020-05-01 22:54   ` Allison Collins
  0 siblings, 0 replies; 22+ messages in thread
From: Allison Collins @ 2020-05-01 22:54 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, linux-fsdevel, linux-api



On 4/28/20 7:44 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
> 
> Fix the return value of xfs_reflink_remap_prep so that its calling
> conventions match the rest of xfs.
> 
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
Looks fine to me:
Reviewed-by: Allison Collins <allison.henderson@oracle.com>

> ---
>   fs/xfs/xfs_file.c    |    2 +-
>   fs/xfs/xfs_reflink.c |    6 +++---
>   2 files changed, 4 insertions(+), 4 deletions(-)
> 
> 
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 994fd3d59872..1759fbcbcd46 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -1029,7 +1029,7 @@ xfs_file_remap_range(
>   	/* Prepare and then clone file data. */
>   	ret = xfs_reflink_remap_prep(file_in, pos_in, file_out, pos_out,
>   			&len, remap_flags);
> -	if (ret < 0 || len == 0)
> +	if (ret || len == 0)
>   		return ret;
>   
>   	trace_xfs_reflink_remap_range(src, pos_in, len, dest, pos_out);
> diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> index d8c8b299cb1f..5e978d1f169d 100644
> --- a/fs/xfs/xfs_reflink.c
> +++ b/fs/xfs/xfs_reflink.c
> @@ -1375,7 +1375,7 @@ xfs_reflink_remap_prep(
>   	struct inode		*inode_out = file_inode(file_out);
>   	struct xfs_inode	*dest = XFS_I(inode_out);
>   	bool			same_inode = (inode_in == inode_out);
> -	ssize_t			ret;
> +	int			ret;
>   
>   	/* Lock both files against IO */
>   	ret = xfs_iolock_two_inodes_and_break_layout(inode_in, inode_out);
> @@ -1399,7 +1399,7 @@ xfs_reflink_remap_prep(
>   
>   	ret = generic_remap_file_range_prep(file_in, pos_in, file_out, pos_out,
>   			len, remap_flags);
> -	if (ret < 0 || *len == 0)
> +	if (ret || *len == 0)
>   		goto out_unlock;
>   
>   	/* Attach dquots to dest inode before changing block map */
> @@ -1434,7 +1434,7 @@ xfs_reflink_remap_prep(
>   	if (ret)
>   		goto out_unlock;
>   
> -	return 1;
> +	return 0;
>   out_unlock:
>   	xfs_reflink_remap_unlock(file_in, file_out);
>   	return ret;
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2020-05-01 22:54 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-04-29  2:44 [PATCH RFC 00/18] xfs: atomic file updates Darrick J. Wong
2020-04-29  2:44 ` [PATCH 01/18] xfs: clean up the error handling in xfs_swap_extent_rmap Darrick J. Wong
2020-04-29  2:44 ` [PATCH 02/18] xfs: fix xfs_reflink_remap_prep calling conventions Darrick J. Wong
2020-05-01 22:54   ` Allison Collins
2020-04-29  2:44 ` [PATCH 03/18] vfs: introduce new file extent swap ioctl Darrick J. Wong
2020-04-29  2:44 ` [PATCH 04/18] xfs: support deferred bmap updates on the attr fork Darrick J. Wong
2020-04-29  2:44 ` [PATCH 05/18] xfs: xfs_bmap_finish_one should map unwritten extents properly Darrick J. Wong
2020-04-29  2:44 ` [PATCH 06/18] xfs: create a log incompat flag for atomic extent swapping Darrick J. Wong
2020-04-29  2:45 ` [PATCH 07/18] xfs: allow deferred ops items to put themselves at the end of the pending queue Darrick J. Wong
2020-04-29  2:45 ` [PATCH 08/18] xfs: introduce a swap-extent log intent item Darrick J. Wong
2020-04-29  2:45 ` [PATCH 09/18] xfs: create deferred log items for extent swapping Darrick J. Wong
2020-04-29  2:45 ` [PATCH 10/18] xfs: refactor locking and unlocking two inodes against userspace IO Darrick J. Wong
2020-04-29  2:45 ` [PATCH 11/18] xfs: add a ->swap_file_range handler Darrick J. Wong
2020-04-29  2:45 ` [PATCH 12/18] xfs: add error injection to test swapext recovery Darrick J. Wong
2020-04-29  2:45 ` [PATCH 13/18] xfs: allow xfs_swap_range to use older extent swap algorithms Darrick J. Wong
2020-04-29  2:45 ` [PATCH 14/18] xfs: port xfs_swap_extents_rmap to our new code Darrick J. Wong
2020-04-29  2:45 ` [PATCH 15/18] xfs: consolidate all of the xfs_swap_extent_forks code Darrick J. Wong
2020-04-29  2:45 ` [PATCH 16/18] xfs: refactor reflink flag handling in xfs_swap_extent_forks Darrick J. Wong
2020-04-29  2:46 ` [PATCH 17/18] xfs: remove old swap extents implementation Darrick J. Wong
2020-04-29  2:46 ` [PATCH 18/18] xfs: fix quota accounting in the old fork swap code Darrick J. Wong
2020-05-01 19:46 ` [PATCH RFC 00/18] xfs: atomic file updates Jann Horn
2020-05-01 20:11   ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).