linux-api.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCHSET RFC v3 00/18] xfs: atomic file updates
@ 2021-04-01  1:08 Darrick J. Wong
  2021-04-01  1:08 ` [PATCH 01/18] vfs: introduce new file range exchange ioctl Darrick J. Wong
                   ` (18 more replies)
  0 siblings, 19 replies; 30+ messages in thread
From: Darrick J. Wong @ 2021-04-01  1:08 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

Hi all,

This series creates a new FIEXCHANGE_RANGE system call to exchange
ranges of bytes between two files atomically.  This new functionality
enables data storage programs to stage and commit file updates such that
reader programs will see either the old contents or the new contents in
their entirety, with no chance of torn writes.  A successful call
completion guarantees that the new contents will be seen even if the
system fails.

User programs will be able to update files atomically by opening an
O_TMPFILE, reflinking the source file to it, making whatever updates
they want to make, and exchange the relevant ranges of the temp file
with the original file.  If the updates are aligned with the file block
size, a new (since v2) flag provides for exchanging only the written
areas.  Callers can arrange for the update to be rejected if the
original file has been changed.

The intent behind this new userspace functionality is to enable atomic
rewrites of arbitrary parts of individual files.  For years, application
programmers wanting to ensure the atomicity of a file update had to
write the changes to a new file in the same directory, fsync the new
file, rename the new file on top of the old filename, and then fsync the
directory.  People get it wrong all the time, and $fs hacks abound.
Here is the proposed manual page:

IOCTL-FIEXCHANGE_RANGE(Linux Programmer's ManIOCTL-FIEXCHANGE_RANGE(2)

NAME
       ioctl_fiexchange_range  - exchange the contents of parts of two
       files

SYNOPSIS
       #include <sys/ioctl.h>
       #include <linux/fiexchange.h>

       int    ioctl(int     file2_fd,     FIEXCHANGE_RANGE,     struct
       file_xchg_range *arg);

DESCRIPTION
       Given  a  range  of bytes in a first file file1_fd and a second
       range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
       changes the contents of the two ranges.

       Exchanges  are  atomic  with  regards to concurrent file opera‐
       tions, so no userspace-level locks need to be taken  to  obtain
       consistent  results.  Implementations must guarantee that read‐
       ers see either the old contents or the new  contents  in  their
       entirety, even if the system fails.

       The exchange parameters are conveyed in a structure of the fol‐
       lowing form:

           struct file_xchg_range {
               __s64    file1_fd;
               __s64    file1_offset;
               __s64    file2_offset;
               __s64    length;

               __u64    flags;

               __s64    file2_ino;
               __s64    file2_mtime;
               __s64    file2_ctime;
               __s32    file2_mtime_nsec;
               __s32    file2_ctime_nsec;

               __u64    pad[6];
           };

       The field pad must be zero.

       The fields file1_fd, file1_offset, and length define the  first
       range of bytes to be exchanged.

       The fields file2_fd, file2_offset, and length define the second
       range of bytes to be exchanged.

       Both files must be from the same filesystem mount.  If the  two
       file  descriptors represent the same file, the byte ranges must
       not overlap.  Most  disk-based  filesystems  require  that  the
       starts  of  both ranges must be aligned to the file block size.
       If this is the case, the ends of the ranges  must  also  be  so
       aligned unless the FILE_XCHG_RANGE_TO_EOF flag is set.

       The field flags control the behavior of the exchange operation.

           FILE_XCHG_RANGE_FILE2_FRESH
                  Check  the  freshness  of file2_fd after locking the
                  file but before exchanging the contents.   The  sup‐
                  plied  file2_ino field must match file2's inode num‐
                  ber, and the supplied file2_mtime, file2_mtime_nsec,
                  file2_ctime,  and file2_ctime_nsec fields must match
                  the modification time and change time of file2.   If
                  they do not match, EBUSY will be returned.

           FILE_XCHG_RANGE_TO_EOF
                  Ignore  the length parameter.  All bytes in file1_fd
                  from file1_offset to EOF are moved to file2_fd,  and
                  file2's  size is set to (file2_offset+(file1_length-
                  file1_offset)).  Meanwhile, all bytes in file2  from
                  file2_offset  to  EOF are moved to file1 and file1's
                  size   is   set   to    (file1_offset+(file2_length-
                  file2_offset)).   This option is not compatible with
                  FILE_XCHG_RANGE_FULL_FILES.

           FILE_XCHG_RANGE_FSYNC
                  Ensure that all modified in-core data in  both  file
                  ranges  and  all  metadata updates pertaining to the
                  exchange operation are flushed to persistent storage
                  before  the  call  returns.  Opening either file de‐
                  scriptor with O_SYNC or O_DSYNC will have  the  same
                  effect.

           FILE_XCHG_RANGE_SKIP_FILE1_HOLES
                  Skip  sub-ranges  of  file1_fd that are known not to
                  contain data.  This facility can be used  to  imple‐
                  ment  atomic scatter-gather writes of any complexity
                  for software-defined storage targets.

           FILE_XCHG_RANGE_DRY_RUN
                  Check the parameters and the feasibility of the  op‐
                  eration, but do not change anything.

           FILE_XCHG_RANGE_COMMIT
                  This      flag      is      a     combination     of
                  FILE_XCHG_RANGE_FILE2_FRESH |  FILE_XCHG_RANGE_FSYNC
                  and  can  be  used  to commit changes to file2_fd to
                  persistent storage if and  only  if  file2  has  not
                  changed.

           FILE_XCHG_RANGE_FULL_FILES
                  Require that file1_offset and file2_offset are zero,
                  and that the length field  matches  the  lengths  of
                  both  files.   If  not, EDOM will be returned.  This
                  option      is       not       compatible       with
                  FILE_XCHG_RANGE_TO_EOF.

           FILE_XCHG_RANGE_NONATOMIC
                  This  flag  relaxes the requirement that readers see
                  only the old contents or the new contents  in  their
                  entirety.   If  the system fails before all modified
                  in-core data and metadata updates are  persisted  to
                  disk,  the contents of both file ranges after recov‐
                  ery are not defined and may be a mix of both.

                  Do not use this flag unless  the  contents  of  both
                  ranges  are  known  to be identical and there are no
                  other writers.

RETURN VALUE
       On error, -1 is returned, and errno is set to indicate the  er‐
       ror.

ERRORS
       Error  codes can be one of, but are not limited to, the follow‐
       ing:

       EBADF  file1_fd is not open for reading and writing or is  open
              for  append-only  writes;  or  file2_fd  is not open for
              reading and writing or is open for append-only writes.

       EBUSY  The inode number and timestamps supplied  do  not  match
              file2_fd  and  FILE_XCHG_RANGE_FILE2_FRESH  was  set  in
              flags.

       EDOM   The ranges do not cover the entirety of both files,  and
              FILE_XCHG_RANGE_FULL_FILES was set in flags.

       EINVAL The  parameters  are  not correct for these files.  This
              error can also appear if either file  descriptor  repre‐
              sents  a device, FIFO, or socket.  Disk filesystems gen‐
              erally require the offset and  length  arguments  to  be
              aligned to the fundamental block sizes of both files.

       EIO    An I/O error occurred.

       EISDIR One of the files is a directory.

       ENOMEM The  kernel  was unable to allocate sufficient memory to
              perform the operation.

       ENOSPC There is not enough free space  in  the  filesystem  ex‐
              change the contents safely.

       EOPNOTSUPP
              The filesystem does not support exchanging bytes between
              the two files.

       EPERM  file1_fd or file2_fd are immutable.

       ETXTBSY
              One of the files is a swap file.

       EUCLEAN
              The filesystem is corrupt.

       EXDEV  file1_fd and  file2_fd  are  not  on  the  same  mounted
              filesystem.

CONFORMING TO
       This API is Linux-specific.

USE CASES
       Three use cases are imagined for this system call.

       The  first  is a filesystem defragmenter, which copies the con‐
       tents of a file into another file and wishes  to  exchange  the
       space  mappings  of  the  two files, provided that the original
       file has not changed.  The flags NONATOMIC and FILE2_FRESH  are
       recommended for this application.

       The  second is a data storage program that wants to commit non-
       contiguous updates to a file atomically.  This can be  done  by
       creating a temporary file, calling FICLONE(2) to share the con‐
       tents, and staging the updates into the temporary file.  Either
       of  the  FULL_FILES or TO_EOF flags are recommended, along with
       FSYNC.  Depending on  the  application's  locking  design,  the
       flags FILE2_FRESH or COMMIT may be applicable here.  The tempo‐
       rary file can be deleted or punched out afterwards.

       The third is a software-defined storage host (e.g. a disk juke‐
       box)  which  implements an atomic scatter-gather write command.
       Provided the exported disk's logical  block  size  matches  the
       file's  allocation  unit  size,  this can be done by creating a
       temporary file and writing the data at the appropriate offsets.
       Use  this  call  with  the SKIP_HOLES flag to exchange only the
       blocks involved in the write command.  The  use  of  the  FSYNC
       flag is recommended here.  The temporary file should be deleted
       or punched out completely before being reused to stage  another
       write.

NOTES
       Some  filesystems may limit the amount of data or the number of
       extents that can be exchanged in a single call.

SEE ALSO
       ioctl(2)

Linux                         2021-04-01     IOCTL-FIEXCHANGE_RANGE(2)

The reference implementation in XFS creates a new log incompat feature
and log intent items to track high level progress of swapping ranges of
two files and finish interrupted work if the system goes down.  Sample
code can be found in the corresponding changes to xfs_io to exercise the
use case mentioned above.

Note that this function is /not/ the O_DIRECT atomic file writes concept
that has also been floating around for years.  This RFC is constructed
entirely in software, which means that there are no limitations other
than the general filesystem limits.

As a side note, the original motivation behind the kernel functionality
is online repair of file-based metadata.  The atomic file swap is
implemented as an atomic inode fork swap, which means that we can
implement online reconstruction of extended attributes and directories
by building a new one in another inode and atomically swap the contents.

Subsequent patchsets adapt the online filesystem repair code to use
atomic extent swapping.  This enables repair functions to construct a
clean copy of a directory, xattr information, realtime bitmaps, and
realtime summary information in a temporary inode.  If this completes
successfully, the new contents can be swapped atomically into the inode
being repaired.  This is essential to avoid making corruption problems
worse if the system goes down in the middle of running repair.

If you're going to start using this mess, you probably ought to just
pull from my git trees, which are linked below.

This is an extraordinary way to destroy everything.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=atomic-file-updates

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=atomic-file-updates

xfsdocs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-documentation.git/log/?h=atomic-file-updates
---
 Documentation/filesystems/vfs.rst |   16 +
 fs/ioctl.c                        |   42 ++
 fs/remap_range.c                  |  283 ++++++++++
 fs/xfs/Makefile                   |    3 
 fs/xfs/libxfs/xfs_bmap.h          |    4 
 fs/xfs/libxfs/xfs_defer.c         |   49 +-
 fs/xfs/libxfs/xfs_defer.h         |   11 
 fs/xfs/libxfs/xfs_errortag.h      |    4 
 fs/xfs/libxfs/xfs_format.h        |   37 +
 fs/xfs/libxfs/xfs_fs.h            |    2 
 fs/xfs/libxfs/xfs_log_format.h    |   63 ++
 fs/xfs/libxfs/xfs_log_recover.h   |    4 
 fs/xfs/libxfs/xfs_sb.c            |    2 
 fs/xfs/libxfs/xfs_shared.h        |    6 
 fs/xfs/libxfs/xfs_swapext.c       | 1030 +++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_swapext.h       |   89 +++
 fs/xfs/xfs_bmap_item.c            |   13 
 fs/xfs/xfs_bmap_util.c            |  611 ----------------------
 fs/xfs/xfs_bmap_util.h            |    3 
 fs/xfs/xfs_error.c                |    3 
 fs/xfs/xfs_extfree_item.c         |    2 
 fs/xfs/xfs_file.c                 |   49 ++
 fs/xfs/xfs_inode.c                |   13 
 fs/xfs/xfs_inode.h                |    1 
 fs/xfs/xfs_ioctl.c                |  102 +---
 fs/xfs/xfs_ioctl.h                |    4 
 fs/xfs/xfs_ioctl32.c              |    8 
 fs/xfs/xfs_log.c                  |   65 ++
 fs/xfs/xfs_log.h                  |    3 
 fs/xfs/xfs_log_priv.h             |    3 
 fs/xfs/xfs_log_recover.c          |   57 ++
 fs/xfs/xfs_mount.c                |  110 ++++
 fs/xfs/xfs_mount.h                |    2 
 fs/xfs/xfs_refcount_item.c        |    2 
 fs/xfs/xfs_rmap_item.c            |    2 
 fs/xfs/xfs_super.c                |   17 +
 fs/xfs/xfs_swapext_item.c         |  649 +++++++++++++++++++++++
 fs/xfs/xfs_swapext_item.h         |   61 ++
 fs/xfs/xfs_trace.c                |    1 
 fs/xfs/xfs_trace.h                |  196 +++++++
 fs/xfs/xfs_trans.c                |   14 -
 fs/xfs/xfs_xchgrange.c            |  772 ++++++++++++++++++++++++++++
 fs/xfs/xfs_xchgrange.h            |   30 +
 include/linux/fs.h                |   14 -
 include/uapi/linux/fiexchange.h   |  101 ++++
 45 files changed, 3807 insertions(+), 746 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_swapext.c
 create mode 100644 fs/xfs/libxfs/xfs_swapext.h
 create mode 100644 fs/xfs/xfs_swapext_item.c
 create mode 100644 fs/xfs/xfs_swapext_item.h
 create mode 100644 fs/xfs/xfs_xchgrange.c
 create mode 100644 fs/xfs/xfs_xchgrange.h
 create mode 100644 include/uapi/linux/fiexchange.h


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 01/18] vfs: introduce new file range exchange ioctl
  2021-04-01  1:08 [PATCHSET RFC v3 00/18] xfs: atomic file updates Darrick J. Wong
@ 2021-04-01  1:08 ` Darrick J. Wong
  2021-04-01  1:44   ` Al Viro
  2021-04-01  3:32   ` Amir Goldstein
  2021-04-01  1:08 ` [PATCH 02/18] xfs: support two inodes in the defer capture structure Darrick J. Wong
                   ` (17 subsequent siblings)
  18 siblings, 2 replies; 30+ messages in thread
From: Darrick J. Wong @ 2021-04-01  1:08 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Introduce a new ioctl to handle swapping ranges of bytes between files.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 Documentation/filesystems/vfs.rst |   16 ++
 fs/ioctl.c                        |   42 +++++
 fs/remap_range.c                  |  283 +++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_fs.h            |    1 
 include/linux/fs.h                |   14 ++
 include/uapi/linux/fiexchange.h   |  101 +++++++++++++
 6 files changed, 456 insertions(+), 1 deletion(-)
 create mode 100644 include/uapi/linux/fiexchange.h


diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 2049bbf5e388..9f16b260bc7e 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -1006,6 +1006,8 @@ This describes how the VFS can manipulate an open file.  As of kernel
 		loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
 					   struct file *file_out, loff_t pos_out,
 					   loff_t len, unsigned int remap_flags);
+                int (*xchg_file_range)(struct file *file1, struct file *file2,
+                                       struct file_xchg_range *fxr);
 		int (*fadvise)(struct file *, loff_t, loff_t, int);
 	};
 
@@ -1124,6 +1126,20 @@ otherwise noted.
 	ok with the implementation shortening the request length to
 	satisfy alignment or EOF requirements (or any other reason).
 
+``xchg_file_range``
+	called by the ioctl(2) system call for FIEXCHANGE_RANGE to exchange the
+	contents of two file ranges.  An implementation should exchange
+	fxr.length bytes starting at fxr.file1_offset in file1 with the same
+	number of bytes starting at fxr.file2_offset in file2.  Refer to
+	fiexchange.h file for more information.  Implementations must call
+	generic_xchg_file_range_prep to prepare the two files prior to taking
+	locks; they must call generic_xchg_file_range_check_fresh once the
+	inode is locked to abort the call if file2 has changed; and they must
+	update the inode change and mod times of both files as part of the
+	metadata update.  The timestamp updates must be done atomically as part
+	of the data exchange operation to ensure correctness of the freshness
+	check.
+
 ``fadvise``
 	possibly called by the fadvise64() system call.
 
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 4e6cc0a7d69c..a1c64fdfd2f2 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -260,6 +260,45 @@ static long ioctl_file_clone_range(struct file *file,
 				args.src_length, args.dest_offset);
 }
 
+static long ioctl_file_xchg_range(struct file *file2,
+				  struct file_xchg_range __user *argp)
+{
+	struct file_xchg_range args;
+	struct fd file1;
+	__u64 old_flags;
+	int ret;
+
+	if (copy_from_user(&args, argp, sizeof(args)))
+		return -EFAULT;
+
+	file1 = fdget(args.file1_fd);
+	if (!file1.file)
+		return -EBADF;
+
+	ret = -EXDEV;
+	if (file1.file->f_path.mnt != file2->f_path.mnt)
+		goto fdput;
+
+	old_flags = args.flags;
+
+	ret = vfs_xchg_file_range(file1.file, file2, &args);
+	if (ret)
+		goto fdput;
+
+	/*
+	 * The VFS will set RANGE_FSYNC on its own if the file or inode require
+	 * synchronous writes.  Don't leak this back to userspace.
+	 */
+	args.flags &= ~FILE_XCHG_RANGE_FSYNC;
+	args.flags |= (old_flags & FILE_XCHG_RANGE_FSYNC);
+
+	if (copy_to_user(argp, &args, sizeof(args)))
+		ret = -EFAULT;
+fdput:
+	fdput(file1);
+	return ret;
+}
+
 #ifdef CONFIG_BLOCK
 
 static inline sector_t logical_to_blk(struct inode *inode, loff_t offset)
@@ -720,6 +759,9 @@ static int do_vfs_ioctl(struct file *filp, unsigned int fd,
 	case FIDEDUPERANGE:
 		return ioctl_file_dedupe_range(filp, argp);
 
+	case FIEXCHANGE_RANGE:
+		return ioctl_file_xchg_range(filp, argp);
+
 	case FIONREAD:
 		if (!S_ISREG(inode->i_mode))
 			return vfs_ioctl(filp, cmd, arg);
diff --git a/fs/remap_range.c b/fs/remap_range.c
index e4a5fdd7ad7b..1a0bbd73106e 100644
--- a/fs/remap_range.c
+++ b/fs/remap_range.c
@@ -580,3 +580,286 @@ int vfs_dedupe_file_range(struct file *file, struct file_dedupe_range *same)
 	return ret;
 }
 EXPORT_SYMBOL(vfs_dedupe_file_range);
+
+/* Performs necessary checks before doing a range exchange. */
+static int generic_xchg_file_range_checks(struct file *file1,
+					  struct file *file2,
+					  const struct file_xchg_range *fxr,
+					  unsigned int blocksize)
+{
+	struct inode *inode1 = file1->f_mapping->host;
+	struct inode *inode2 = file2->f_mapping->host;
+	int64_t test_len;
+	uint64_t blen;
+	loff_t size1, size2;
+	int ret;
+
+	if (fxr->length < 0)
+		return -EINVAL;
+
+	/* The start of both ranges must be aligned to an fs block. */
+	if (!IS_ALIGNED(fxr->file1_offset, blocksize) ||
+	    !IS_ALIGNED(fxr->file2_offset, blocksize))
+		return -EINVAL;
+
+	/* Ensure offsets don't wrap. */
+	if (fxr->file1_offset + fxr->length < fxr->file1_offset ||
+	    fxr->file2_offset + fxr->length < fxr->file2_offset)
+		return -EINVAL;
+
+	size1 = i_size_read(inode1);
+	size2 = i_size_read(inode2);
+
+	/*
+	 * We require both ranges to be within EOF, unless we're exchanging
+	 * to EOF.  generic_xchg_range_prep already checked that both
+	 * fxr->file1_offset and fxr->file2_offset are within EOF.
+	 */
+	if (!(fxr->flags & FILE_XCHG_RANGE_TO_EOF) &&
+	    (fxr->file1_offset + fxr->length > size1 ||
+	     fxr->file2_offset + fxr->length > size2))
+		return -EINVAL;
+
+	/*
+	 * Make sure we don't hit any file size limits.  If we hit any size
+	 * limits such that test_length was adjusted, we abort the whole
+	 * operation.
+	 */
+	test_len = fxr->length;
+	ret = generic_write_check_limits(file2, fxr->file2_offset, &test_len);
+	if (ret)
+		return ret;
+	ret = generic_write_check_limits(file1, fxr->file1_offset, &test_len);
+	if (ret)
+		return ret;
+	if (test_len != fxr->length)
+		return -EINVAL;
+
+	/*
+	 * If the user wanted us to exchange up to the infile's EOF, round up
+	 * to the next block boundary for this check.  Do the same for the
+	 * outfile.
+	 *
+	 * Otherwise, reject the range length if it's not block aligned.  We
+	 * already confirmed the starting offsets' block alignment.
+	 */
+	if (fxr->file1_offset + fxr->length == size1)
+		blen = ALIGN(size1, blocksize) - fxr->file1_offset;
+	else if (fxr->file2_offset + fxr->length == size2)
+		blen = ALIGN(size2, blocksize) - fxr->file2_offset;
+	else if (!IS_ALIGNED(fxr->length, blocksize))
+		return -EINVAL;
+	else
+		blen = fxr->length;
+
+	/* Don't allow overlapped exchanges within the same file. */
+	if (inode1 == inode2 &&
+	    fxr->file2_offset + blen > fxr->file1_offset &&
+	    fxr->file1_offset + blen > fxr->file2_offset)
+		return -EINVAL;
+
+	return 0;
+}
+
+/*
+ * Check that the two inodes are eligible for range exchanges, the ranges make
+ * sense, and then flush all dirty data.  Caller must ensure that the inodes
+ * have been locked against any other modifications.
+ */
+int generic_xchg_file_range_prep(struct file *file1, struct file *file2,
+				 struct file_xchg_range *fxr,
+				 unsigned int blocksize)
+{
+	struct inode *inode1 = file_inode(file1);
+	struct inode *inode2 = file_inode(file2);
+	u64 blkmask = blocksize - 1;
+	bool same_inode = (inode1 == inode2);
+	int ret;
+
+	/* Don't touch certain kinds of inodes */
+	if (IS_IMMUTABLE(inode1) || IS_IMMUTABLE(inode2))
+		return -EPERM;
+	if (IS_SWAPFILE(inode1) || IS_SWAPFILE(inode2))
+		return -ETXTBSY;
+
+	/* Don't reflink dirs, pipes, sockets... */
+	if (S_ISDIR(inode1->i_mode) || S_ISDIR(inode2->i_mode))
+		return -EISDIR;
+	if (!S_ISREG(inode1->i_mode) || !S_ISREG(inode2->i_mode))
+		return -EINVAL;
+
+	/* Ranges cannot start after EOF. */
+	if (fxr->file1_offset > i_size_read(inode1) ||
+	    fxr->file2_offset > i_size_read(inode2))
+		return -EINVAL;
+
+	/*
+	 * If the caller said to exchange to EOF, we set the length of the
+	 * request large enough to cover everything to the end of both files.
+	 */
+	if (fxr->flags & FILE_XCHG_RANGE_TO_EOF)
+		fxr->length = max_t(int64_t,
+				    i_size_read(inode1) - fxr->file1_offset,
+				    i_size_read(inode2) - fxr->file2_offset);
+
+	/* Zero length exchange exits immediately. */
+	if (fxr->length == 0)
+		return 0;
+
+	/* Check that we don't violate system file offset limits. */
+	ret = generic_xchg_file_range_checks(file1, file2, fxr, blocksize);
+	if (ret)
+		return ret;
+
+	/*
+	 * Ensure that we don't exchange a partial EOF block into the middle of
+	 * another file.
+	 */
+	if (fxr->length & blkmask) {
+		loff_t new_length = fxr->length;
+
+		if (fxr->file2_offset + new_length < i_size_read(inode2))
+			new_length &= ~blkmask;
+
+		if (fxr->file1_offset + new_length < i_size_read(inode1))
+			new_length &= ~blkmask;
+
+		if (new_length != fxr->length)
+			return -EINVAL;
+	}
+
+	/* Wait for the completion of any pending IOs on both files */
+	inode_dio_wait(inode1);
+	if (!same_inode)
+		inode_dio_wait(inode2);
+
+	ret = filemap_write_and_wait_range(inode1->i_mapping, fxr->file1_offset,
+					   fxr->file1_offset + fxr->length - 1);
+	if (ret)
+		return ret;
+
+	ret = filemap_write_and_wait_range(inode2->i_mapping, fxr->file2_offset,
+					   fxr->file2_offset + fxr->length - 1);
+	if (ret)
+		return ret;
+
+	/*
+	 * If the files or inodes involved require synchronous writes, amend
+	 * the request to force the filesystem to flush all data and metadata
+	 * to disk after the operation completes.
+	 */
+	if (((file1->f_flags | file2->f_flags) & (__O_SYNC | O_DSYNC)) ||
+	    IS_SYNC(file_inode(file1)) || IS_SYNC(file_inode(file2)))
+		fxr->flags |= FILE_XCHG_RANGE_FSYNC;
+
+	/* Remove privilege bits from both files. */
+	ret = file_remove_privs(file1);
+	if (ret)
+		return ret;
+	return file_remove_privs(file2);
+}
+EXPORT_SYMBOL(generic_xchg_file_range_prep);
+
+/*
+ * Check that both files' metadata agree with the snapshot that we took for
+ * the range exchange request.
+
+ * This should be called after the filesystem has locked /all/ inode metadata
+ * against modification.
+ */
+int generic_xchg_file_range_check_fresh(struct inode *inode1,
+					struct inode *inode2,
+					const struct file_xchg_range *fxr)
+{
+	/* Check that the offset/length values cover all of both files */
+	if ((fxr->flags & FILE_XCHG_RANGE_FULL_FILES) &&
+	    (fxr->file1_offset != 0 ||
+	     fxr->file2_offset != 0 ||
+	     fxr->length != i_size_read(inode1) ||
+	     fxr->length != i_size_read(inode2)))
+		return -EDOM;
+
+	/* Check that file2 hasn't otherwise been modified. */
+	if ((fxr->flags & FILE_XCHG_RANGE_FILE2_FRESH) &&
+	    (fxr->file2_ino        != inode2->i_ino ||
+	     fxr->file2_ctime      != inode2->i_ctime.tv_sec  ||
+	     fxr->file2_ctime_nsec != inode2->i_ctime.tv_nsec ||
+	     fxr->file2_mtime      != inode2->i_mtime.tv_sec  ||
+	     fxr->file2_mtime_nsec != inode2->i_mtime.tv_nsec))
+		return -EBUSY;
+
+	return 0;
+}
+EXPORT_SYMBOL(generic_xchg_file_range_check_fresh);
+
+static inline int xchg_range_verify_area(struct file *file, loff_t pos,
+					 struct file_xchg_range *fxr)
+{
+	int64_t len = fxr->length;
+
+	if (fxr->flags & FILE_XCHG_RANGE_TO_EOF)
+		len = min_t(int64_t, len, i_size_read(file_inode(file)) - pos);
+	return remap_verify_area(file, pos, len, true);
+}
+
+int do_xchg_file_range(struct file *file1, struct file *file2,
+		       struct file_xchg_range *fxr)
+{
+	int ret;
+
+	if ((fxr->flags & ~FILE_XCHG_RANGE_ALL_FLAGS) ||
+	    memchr_inv(&fxr->pad, 0, sizeof(fxr->pad)))
+		return -EINVAL;
+
+	if ((fxr->flags & FILE_XCHG_RANGE_FULL_FILES) &&
+	    (fxr->flags & FILE_XCHG_RANGE_TO_EOF))
+		return -EINVAL;
+
+	/*
+	 * The ioctl enforces that src and dest files are on the same mount.
+	 * Practically, they only need to be on the same file system.
+	 */
+	if (file_inode(file1)->i_sb != file_inode(file2)->i_sb)
+		return -EXDEV;
+
+	ret = generic_file_rw_checks(file1, file2);
+	if (ret < 0)
+		return ret;
+
+	ret = generic_file_rw_checks(file2, file1);
+	if (ret < 0)
+		return ret;
+
+	if (!file1->f_op->xchg_file_range)
+		return -EOPNOTSUPP;
+
+	ret = xchg_range_verify_area(file1, fxr->file1_offset, fxr);
+	if (ret)
+		return ret;
+
+	ret = xchg_range_verify_area(file2, fxr->file2_offset, fxr);
+	if (ret)
+		return ret;
+
+	ret = file2->f_op->xchg_file_range(file1, file2, fxr);
+	if (ret)
+		return ret;
+
+	fsnotify_modify(file1);
+	fsnotify_modify(file2);
+	return 0;
+}
+EXPORT_SYMBOL(do_xchg_file_range);
+
+int vfs_xchg_file_range(struct file *file1, struct file *file2,
+			struct file_xchg_range *fxr)
+{
+	int ret;
+
+	file_start_write(file2);
+	ret = do_xchg_file_range(file1, file2, fxr);
+	file_end_write(file2);
+
+	return ret;
+}
+EXPORT_SYMBOL(vfs_xchg_file_range);
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 75cdf2685c0d..e7e1e3051739 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -848,6 +848,7 @@ struct xfs_scrub_metadata {
 #define XFS_IOC_FSGEOMETRY	     _IOR ('X', 126, struct xfs_fsop_geom)
 #define XFS_IOC_BULKSTAT	     _IOR ('X', 127, struct xfs_bulkstat_req)
 #define XFS_IOC_INUMBERS	     _IOR ('X', 128, struct xfs_inumbers_req)
+/*	FIEXCHANGE_RANGE ----------- hoisted 129	 */
 /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
 
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ec8f3ddf4a6a..a38209fdf200 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -44,6 +44,7 @@
 
 #include <asm/byteorder.h>
 #include <uapi/linux/fs.h>
+#include <uapi/linux/fiexchange.h>
 
 struct backing_dev_info;
 struct bdi_writeback;
@@ -1924,6 +1925,8 @@ struct file_operations {
 	loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
 				   struct file *file_out, loff_t pos_out,
 				   loff_t len, unsigned int remap_flags);
+	int (*xchg_file_range)(struct file *file1, struct file *file2,
+			       struct file_xchg_range *fsr);
 	int (*fadvise)(struct file *, loff_t, loff_t, int);
 } __randomize_layout;
 
@@ -1993,6 +1996,9 @@ extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 					 struct file *file_out, loff_t pos_out,
 					 loff_t *count,
 					 unsigned int remap_flags);
+extern int generic_xchg_file_range_prep(struct file *file1, struct file *file2,
+					struct file_xchg_range *fsr,
+					unsigned int blocksize);
 extern loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
 				  struct file *file_out, loff_t pos_out,
 				  loff_t len, unsigned int remap_flags);
@@ -2004,7 +2010,13 @@ extern int vfs_dedupe_file_range(struct file *file,
 extern loff_t vfs_dedupe_file_range_one(struct file *src_file, loff_t src_pos,
 					struct file *dst_file, loff_t dst_pos,
 					loff_t len, unsigned int remap_flags);
-
+extern int do_xchg_file_range(struct file *file1, struct file *file2,
+			      struct file_xchg_range *fsr);
+extern int vfs_xchg_file_range(struct file *file1, struct file *file2,
+			       struct file_xchg_range *fsr);
+extern int generic_xchg_file_range_check_fresh(struct inode *inode1,
+					struct inode *inode2,
+					const struct file_xchg_range *fsr);
 
 struct super_operations {
    	struct inode *(*alloc_inode)(struct super_block *sb);
diff --git a/include/uapi/linux/fiexchange.h b/include/uapi/linux/fiexchange.h
new file mode 100644
index 000000000000..17372590371a
--- /dev/null
+++ b/include/uapi/linux/fiexchange.h
@@ -0,0 +1,101 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later WITH Linux-syscall-note */
+/*
+ * FIEXCHANGE ioctl definitions, to facilitate exchanging parts of files.
+ *
+ * Copyright (C) 2021 Oracle.  All Rights Reserved.
+ *
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef _LINUX_FIEXCHANGE_H
+#define _LINUX_FIEXCHANGE_H
+
+#include <linux/types.h>
+
+/*
+ * Exchange part of file1 with part of the file that this ioctl that is being
+ * called against (which we'll call file2).  Filesystems must be able to
+ * restart and complete the operation even after the system goes down.
+ */
+struct file_xchg_range {
+	__s64		file1_fd;
+	__s64		file1_offset;	/* file1 offset, bytes */
+	__s64		file2_offset;	/* file2 offset, bytes */
+	__s64		length;		/* bytes to exchange */
+
+	__u64		flags;		/* see FILE_XCHG_RANGE_* below */
+
+	/* file2 metadata for optional freshness checks */
+	__s64		file2_ino;	/* inode number */
+	__s64		file2_mtime;	/* modification time */
+	__s64		file2_ctime;	/* change time */
+	__s32		file2_mtime_nsec; /* mod time, nsec */
+	__s32		file2_ctime_nsec; /* change time, nsec */
+
+	__u64		pad[6];		/* must be zeroes */
+};
+
+/*
+ * Atomic exchange operations are not required.  This relaxes the requirement
+ * that the filesystem must be able to complete the operation after a crash.
+ */
+#define FILE_XCHG_RANGE_NONATOMIC	(1 << 0)
+
+/*
+ * Check that file2's inode number, mtime, and ctime against the values
+ * provided, and return -EBUSY if there isn't an exact match.
+ */
+#define FILE_XCHG_RANGE_FILE2_FRESH	(1 << 1)
+
+/*
+ * Check that the file1's length is equal to file1_offset + length, and that
+ * file2's length is equal to file2_offset + length.  Returns -EDOM if there
+ * isn't an exact match.
+ */
+#define FILE_XCHG_RANGE_FULL_FILES	(1 << 2)
+
+/*
+ * Exchange file data all the way to the ends of both files, and then exchange
+ * the file sizes.  This flag can be used to replace a file's contents with a
+ * different amount of data.  length will be ignored.
+ */
+#define FILE_XCHG_RANGE_TO_EOF		(1 << 3)
+
+/* Flush all changes in file data and file metadata to disk before returning. */
+#define FILE_XCHG_RANGE_FSYNC		(1 << 4)
+
+/* Dry run; do all the parameter verification but do not change anything. */
+#define FILE_XCHG_RANGE_DRY_RUN		(1 << 5)
+
+/*
+ * Do not exchange any part of the range where file1's mapping is a hole.  This
+ * can be used to emulate scatter-gather atomic writes with a temp file.
+ */
+#define FILE_XCHG_RANGE_SKIP_FILE1_HOLES (1 << 6)
+
+/*
+ * Commit the contents of file1 into file2 if file2 has the same inode number,
+ * mtime, and ctime as the arguments provided to the call.  The old contents of
+ * file2 will be moved to file1.
+ *
+ * With this flag, all committed information can be retrieved even if the
+ * system crashes or is rebooted.  This includes writing through or flushing a
+ * disk cache if present.  The call blocks until the device reports that the
+ * commit is complete.
+ *
+ * This flag should not be combined with NONATOMIC.  It can be combined with
+ * SKIP_FILE1_HOLES.
+ */
+#define FILE_XCHG_RANGE_COMMIT		(FILE_XCHG_RANGE_FILE2_FRESH | \
+					 FILE_XCHG_RANGE_FSYNC)
+
+#define FILE_XCHG_RANGE_ALL_FLAGS	(FILE_XCHG_RANGE_NONATOMIC | \
+					 FILE_XCHG_RANGE_FILE2_FRESH | \
+					 FILE_XCHG_RANGE_FULL_FILES | \
+					 FILE_XCHG_RANGE_TO_EOF | \
+					 FILE_XCHG_RANGE_FSYNC | \
+					 FILE_XCHG_RANGE_DRY_RUN | \
+					 FILE_XCHG_RANGE_SKIP_FILE1_HOLES)
+
+#define FIEXCHANGE_RANGE	_IOWR('X', 129, struct file_xchg_range)
+
+#endif /* _LINUX_FIEXCHANGE_H */


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 02/18] xfs: support two inodes in the defer capture structure
  2021-04-01  1:08 [PATCHSET RFC v3 00/18] xfs: atomic file updates Darrick J. Wong
  2021-04-01  1:08 ` [PATCH 01/18] vfs: introduce new file range exchange ioctl Darrick J. Wong
@ 2021-04-01  1:08 ` Darrick J. Wong
  2021-04-02 23:20   ` Allison Henderson
  2021-04-01  1:09 ` [PATCH 03/18] xfs: allow setting and clearing of log incompat feature flags Darrick J. Wong
                   ` (16 subsequent siblings)
  18 siblings, 1 reply; 30+ messages in thread
From: Darrick J. Wong @ 2021-04-01  1:08 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Make it so that xfs_defer_ops_capture_and_commit can capture two inodes.
This will be needed by the atomic extent swap log item so that it can
recover an operation involving two inodes.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_defer.c  |   48 ++++++++++++++++++++++++++++++--------------
 fs/xfs/libxfs/xfs_defer.h  |    9 ++++++--
 fs/xfs/xfs_bmap_item.c     |    2 +-
 fs/xfs/xfs_extfree_item.c  |    2 +-
 fs/xfs/xfs_log_recover.c   |   14 ++++++++-----
 fs/xfs/xfs_refcount_item.c |    2 +-
 fs/xfs/xfs_rmap_item.c     |    2 +-
 7 files changed, 52 insertions(+), 27 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index eff4a127188e..a7d1357687d0 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -628,7 +628,8 @@ xfs_defer_move(
 static struct xfs_defer_capture *
 xfs_defer_ops_capture(
 	struct xfs_trans		*tp,
-	struct xfs_inode		*capture_ip)
+	struct xfs_inode		*capture_ip1,
+	struct xfs_inode		*capture_ip2)
 {
 	struct xfs_defer_capture	*dfc;
 
@@ -658,9 +659,13 @@ xfs_defer_ops_capture(
 	 * Grab an extra reference to this inode and attach it to the capture
 	 * structure.
 	 */
-	if (capture_ip) {
-		ihold(VFS_I(capture_ip));
-		dfc->dfc_capture_ip = capture_ip;
+	if (capture_ip1) {
+		ihold(VFS_I(capture_ip1));
+		dfc->dfc_capture_ip1 = capture_ip1;
+	}
+	if (capture_ip2 && capture_ip2 != capture_ip1) {
+		ihold(VFS_I(capture_ip2));
+		dfc->dfc_capture_ip2 = capture_ip2;
 	}
 
 	return dfc;
@@ -673,8 +678,10 @@ xfs_defer_ops_release(
 	struct xfs_defer_capture	*dfc)
 {
 	xfs_defer_cancel_list(mp, &dfc->dfc_dfops);
-	if (dfc->dfc_capture_ip)
-		xfs_irele(dfc->dfc_capture_ip);
+	if (dfc->dfc_capture_ip1)
+		xfs_irele(dfc->dfc_capture_ip1);
+	if (dfc->dfc_capture_ip2)
+		xfs_irele(dfc->dfc_capture_ip2);
 	kmem_free(dfc);
 }
 
@@ -684,22 +691,26 @@ xfs_defer_ops_release(
  * of the deferred ops operate on an inode, the caller must pass in that inode
  * so that the reference can be transferred to the capture structure.  The
  * caller must hold ILOCK_EXCL on the inode, and must unlock it before calling
- * xfs_defer_ops_continue.
+ * xfs_defer_ops_continue.  Do not pass a null capture_ip1 and a non-null
+ * capture_ip2.
  */
 int
 xfs_defer_ops_capture_and_commit(
 	struct xfs_trans		*tp,
-	struct xfs_inode		*capture_ip,
+	struct xfs_inode		*capture_ip1,
+	struct xfs_inode		*capture_ip2,
 	struct list_head		*capture_list)
 {
 	struct xfs_mount		*mp = tp->t_mountp;
 	struct xfs_defer_capture	*dfc;
 	int				error;
 
-	ASSERT(!capture_ip || xfs_isilocked(capture_ip, XFS_ILOCK_EXCL));
+	ASSERT(!capture_ip1 || xfs_isilocked(capture_ip1, XFS_ILOCK_EXCL));
+	ASSERT(!capture_ip2 || xfs_isilocked(capture_ip2, XFS_ILOCK_EXCL));
+	ASSERT(capture_ip2 == NULL || capture_ip1 != NULL);
 
 	/* If we don't capture anything, commit transaction and exit. */
-	dfc = xfs_defer_ops_capture(tp, capture_ip);
+	dfc = xfs_defer_ops_capture(tp, capture_ip1, capture_ip2);
 	if (!dfc)
 		return xfs_trans_commit(tp);
 
@@ -724,17 +735,24 @@ void
 xfs_defer_ops_continue(
 	struct xfs_defer_capture	*dfc,
 	struct xfs_trans		*tp,
-	struct xfs_inode		**captured_ipp)
+	struct xfs_inode		**captured_ipp1,
+	struct xfs_inode		**captured_ipp2)
 {
 	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
 	ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY));
 
 	/* Lock and join the captured inode to the new transaction. */
-	if (dfc->dfc_capture_ip) {
-		xfs_ilock(dfc->dfc_capture_ip, XFS_ILOCK_EXCL);
-		xfs_trans_ijoin(tp, dfc->dfc_capture_ip, 0);
+	if (dfc->dfc_capture_ip1 && dfc->dfc_capture_ip2) {
+		xfs_lock_two_inodes(dfc->dfc_capture_ip1, XFS_ILOCK_EXCL,
+				    dfc->dfc_capture_ip2, XFS_ILOCK_EXCL);
+		xfs_trans_ijoin(tp, dfc->dfc_capture_ip1, 0);
+		xfs_trans_ijoin(tp, dfc->dfc_capture_ip2, 0);
+	} else if (dfc->dfc_capture_ip1) {
+		xfs_ilock(dfc->dfc_capture_ip1, XFS_ILOCK_EXCL);
+		xfs_trans_ijoin(tp, dfc->dfc_capture_ip1, 0);
 	}
-	*captured_ipp = dfc->dfc_capture_ip;
+	*captured_ipp1 = dfc->dfc_capture_ip1;
+	*captured_ipp2 = dfc->dfc_capture_ip2;
 
 	/* Move captured dfops chain and state to the transaction. */
 	list_splice_init(&dfc->dfc_dfops, &tp->t_dfops);
diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index 05472f71fffe..f5e3ca17aa26 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -87,7 +87,8 @@ struct xfs_defer_capture {
 	 * An inode reference that must be maintained to complete the deferred
 	 * work.
 	 */
-	struct xfs_inode	*dfc_capture_ip;
+	struct xfs_inode	*dfc_capture_ip1;
+	struct xfs_inode	*dfc_capture_ip2;
 };
 
 /*
@@ -95,9 +96,11 @@ struct xfs_defer_capture {
  * This doesn't normally happen except log recovery.
  */
 int xfs_defer_ops_capture_and_commit(struct xfs_trans *tp,
-		struct xfs_inode *capture_ip, struct list_head *capture_list);
+		struct xfs_inode *capture_ip1, struct xfs_inode *capture_ip2,
+		struct list_head *capture_list);
 void xfs_defer_ops_continue(struct xfs_defer_capture *d, struct xfs_trans *tp,
-		struct xfs_inode **captured_ipp);
+		struct xfs_inode **captured_ipp1,
+		struct xfs_inode **captured_ipp2);
 void xfs_defer_ops_release(struct xfs_mount *mp, struct xfs_defer_capture *d);
 
 #endif /* __XFS_DEFER_H__ */
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 895a56b16029..bba73ddd0585 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -551,7 +551,7 @@ xfs_bui_item_recover(
 	 * Commit transaction, which frees the transaction and saves the inode
 	 * for later replay activities.
 	 */
-	error = xfs_defer_ops_capture_and_commit(tp, ip, capture_list);
+	error = xfs_defer_ops_capture_and_commit(tp, ip, NULL, capture_list);
 	if (error)
 		goto err_unlock;
 
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index c767918c0c3f..ebfc7de8083e 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -632,7 +632,7 @@ xfs_efi_item_recover(
 
 	}
 
-	return xfs_defer_ops_capture_and_commit(tp, NULL, capture_list);
+	return xfs_defer_ops_capture_and_commit(tp, NULL, NULL, capture_list);
 
 abort_error:
 	xfs_trans_cancel(tp);
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index b227a6ad9f5d..ce1a7928eb2d 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2439,7 +2439,7 @@ xlog_finish_defer_ops(
 {
 	struct xfs_defer_capture *dfc, *next;
 	struct xfs_trans	*tp;
-	struct xfs_inode	*ip;
+	struct xfs_inode	*ip1, *ip2;
 	int			error = 0;
 
 	list_for_each_entry_safe(dfc, next, capture_list, dfc_list) {
@@ -2465,12 +2465,16 @@ xlog_finish_defer_ops(
 		 * from recovering a single intent item.
 		 */
 		list_del_init(&dfc->dfc_list);
-		xfs_defer_ops_continue(dfc, tp, &ip);
+		xfs_defer_ops_continue(dfc, tp, &ip1, &ip2);
 
 		error = xfs_trans_commit(tp);
-		if (ip) {
-			xfs_iunlock(ip, XFS_ILOCK_EXCL);
-			xfs_irele(ip);
+		if (ip1) {
+			xfs_iunlock(ip1, XFS_ILOCK_EXCL);
+			xfs_irele(ip1);
+		}
+		if (ip2) {
+			xfs_iunlock(ip2, XFS_ILOCK_EXCL);
+			xfs_irele(ip2);
 		}
 		if (error)
 			return error;
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index 07ebccbbf4df..427d8259a36d 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -554,7 +554,7 @@ xfs_cui_item_recover(
 	}
 
 	xfs_refcount_finish_one_cleanup(tp, rcur, error);
-	return xfs_defer_ops_capture_and_commit(tp, NULL, capture_list);
+	return xfs_defer_ops_capture_and_commit(tp, NULL, NULL, capture_list);
 
 abort_error:
 	xfs_refcount_finish_one_cleanup(tp, rcur, error);
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index 49cebd68b672..deb852a3c5f6 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -584,7 +584,7 @@ xfs_rui_item_recover(
 	}
 
 	xfs_rmap_finish_one_cleanup(tp, rcur, error);
-	return xfs_defer_ops_capture_and_commit(tp, NULL, capture_list);
+	return xfs_defer_ops_capture_and_commit(tp, NULL, NULL, capture_list);
 
 abort_error:
 	xfs_rmap_finish_one_cleanup(tp, rcur, error);


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 03/18] xfs: allow setting and clearing of log incompat feature flags
  2021-04-01  1:08 [PATCHSET RFC v3 00/18] xfs: atomic file updates Darrick J. Wong
  2021-04-01  1:08 ` [PATCH 01/18] vfs: introduce new file range exchange ioctl Darrick J. Wong
  2021-04-01  1:08 ` [PATCH 02/18] xfs: support two inodes in the defer capture structure Darrick J. Wong
@ 2021-04-01  1:09 ` Darrick J. Wong
  2021-04-02 23:20   ` Allison Henderson
  2021-04-01  1:09 ` [PATCH 04/18] xfs: clear log incompat feature bits when the log is idle Darrick J. Wong
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 30+ messages in thread
From: Darrick J. Wong @ 2021-04-01  1:09 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Log incompat feature flags in the superblock exist for one purpose: to
protect the contents of a dirty log from replay on a kernel that isn't
prepared to handle those dirty contents.  This means that they can be
cleared if (a) we know the log is clean and (b) we know that there
aren't any other threads in the system that might be setting or relying
upon a log incompat flag.

Therefore, clear the log incompat flags when we've finished recovering
the log, when we're unmounting cleanly, remounting read-only, or
freezing; and provide a function so that subsequent patches can start
using this.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_format.h |   15 ++++++
 fs/xfs/xfs_log.c           |   14 ++++++
 fs/xfs/xfs_log_recover.c   |   16 ++++++
 fs/xfs/xfs_mount.c         |  110 ++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_mount.h         |    2 +
 5 files changed, 157 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 9620795a6e08..7e9c964772c9 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -495,6 +495,21 @@ xfs_sb_has_incompat_log_feature(
 	return (sbp->sb_features_log_incompat & feature) != 0;
 }
 
+static inline void
+xfs_sb_remove_incompat_log_features(
+	struct xfs_sb	*sbp)
+{
+	sbp->sb_features_log_incompat &= ~XFS_SB_FEAT_INCOMPAT_LOG_ALL;
+}
+
+static inline void
+xfs_sb_add_incompat_log_features(
+	struct xfs_sb	*sbp,
+	unsigned int	features)
+{
+	sbp->sb_features_log_incompat |= features;
+}
+
 /*
  * V5 superblock specific feature checks
  */
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 06041834daa3..cf73bc9f4d18 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -945,6 +945,20 @@ int
 xfs_log_quiesce(
 	struct xfs_mount	*mp)
 {
+	/*
+	 * Clear log incompat features since we're quiescing the log.  Report
+	 * failures, though it's not fatal to have a higher log feature
+	 * protection level than the log contents actually require.
+	 */
+	if (xfs_clear_incompat_log_features(mp)) {
+		int error;
+
+		error = xfs_sync_sb(mp, false);
+		if (error)
+			xfs_warn(mp,
+	"Failed to clear log incompat features on quiesce");
+	}
+
 	cancel_delayed_work_sync(&mp->m_log->l_work);
 	xfs_log_force(mp, XFS_LOG_SYNC);
 
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index ce1a7928eb2d..fdba9b55822e 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -3480,6 +3480,22 @@ xlog_recover_finish(
 		 */
 		xfs_log_force(log->l_mp, XFS_LOG_SYNC);
 
+		/*
+		 * Now that we've recovered the log and all the intents, we can
+		 * clear the log incompat feature bits in the superblock
+		 * because there's no longer anything to protect.  We rely on
+		 * the AIL push to write out the updated superblock after
+		 * everything else.
+		 */
+		if (xfs_clear_incompat_log_features(log->l_mp)) {
+			error = xfs_sync_sb(log->l_mp, false);
+			if (error < 0) {
+				xfs_alert(log->l_mp,
+	"Failed to clear log incompat features on recovery");
+				return error;
+			}
+		}
+
 		xlog_recover_process_iunlinks(log);
 
 		xlog_recover_check_summary(log);
diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
index b7e653180d22..f16036e1986b 100644
--- a/fs/xfs/xfs_mount.c
+++ b/fs/xfs/xfs_mount.c
@@ -1333,6 +1333,116 @@ xfs_force_summary_recalc(
 	xfs_fs_mark_checked(mp, XFS_SICK_FS_COUNTERS);
 }
 
+/*
+ * Enable a log incompat feature flag in the primary superblock.  The caller
+ * cannot have any other transactions in progress.
+ */
+int
+xfs_add_incompat_log_feature(
+	struct xfs_mount	*mp,
+	uint32_t		feature)
+{
+	struct xfs_dsb		*dsb;
+	int			error;
+
+	ASSERT(hweight32(feature) == 1);
+	ASSERT(!(feature & XFS_SB_FEAT_INCOMPAT_LOG_UNKNOWN));
+
+	/*
+	 * Force the log to disk and kick the background AIL thread to reduce
+	 * the chances that the bwrite will stall waiting for the AIL to unpin
+	 * the primary superblock buffer.  This isn't a data integrity
+	 * operation, so we don't need a synchronous push.
+	 */
+	error = xfs_log_force(mp, XFS_LOG_SYNC);
+	if (error)
+		return error;
+	xfs_ail_push_all(mp->m_ail);
+
+	/*
+	 * Lock the primary superblock buffer to serialize all callers that
+	 * are trying to set feature bits.
+	 */
+	xfs_buf_lock(mp->m_sb_bp);
+	xfs_buf_hold(mp->m_sb_bp);
+
+	if (XFS_FORCED_SHUTDOWN(mp)) {
+		error = -EIO;
+		goto rele;
+	}
+
+	if (xfs_sb_has_incompat_log_feature(&mp->m_sb, feature))
+		goto rele;
+
+	/*
+	 * Write the primary superblock to disk immediately, because we need
+	 * the log_incompat bit to be set in the primary super now to protect
+	 * the log items that we're going to commit later.
+	 */
+	dsb = mp->m_sb_bp->b_addr;
+	xfs_sb_to_disk(dsb, &mp->m_sb);
+	dsb->sb_features_log_incompat |= cpu_to_be32(feature);
+	error = xfs_bwrite(mp->m_sb_bp);
+	if (error)
+		goto shutdown;
+
+	/*
+	 * Add the feature bits to the incore superblock before we unlock the
+	 * buffer.
+	 */
+	xfs_sb_add_incompat_log_features(&mp->m_sb, feature);
+	xfs_buf_relse(mp->m_sb_bp);
+
+	/* Log the superblock to disk. */
+	return xfs_sync_sb(mp, false);
+shutdown:
+	xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR);
+rele:
+	xfs_buf_relse(mp->m_sb_bp);
+	return error;
+}
+
+/*
+ * Clear all the log incompat flags from the superblock.
+ *
+ * The caller cannot be in a transaction, must ensure that the log does not
+ * contain any log items protected by any log incompat bit, and must ensure
+ * that there are no other threads that depend on the state of the log incompat
+ * feature flags in the primary super.
+ *
+ * Returns true if the superblock is dirty.
+ */
+bool
+xfs_clear_incompat_log_features(
+	struct xfs_mount	*mp)
+{
+	bool			ret = false;
+
+	if (!xfs_sb_version_hascrc(&mp->m_sb) ||
+	    !xfs_sb_has_incompat_log_feature(&mp->m_sb,
+				XFS_SB_FEAT_INCOMPAT_LOG_ALL) ||
+	    XFS_FORCED_SHUTDOWN(mp))
+		return false;
+
+	/*
+	 * Update the incore superblock.  We synchronize on the primary super
+	 * buffer lock to be consistent with the add function, though at least
+	 * in theory this shouldn't be necessary.
+	 */
+	xfs_buf_lock(mp->m_sb_bp);
+	xfs_buf_hold(mp->m_sb_bp);
+
+	if (xfs_sb_has_incompat_log_feature(&mp->m_sb,
+				XFS_SB_FEAT_INCOMPAT_LOG_ALL)) {
+		xfs_info(mp, "Clearing log incompat feature flags.");
+		xfs_sb_remove_incompat_log_features(&mp->m_sb);
+		ret = true;
+	}
+
+	xfs_buf_relse(mp->m_sb_bp);
+	return ret;
+}
+
 /*
  * Update the in-core delayed block counter.
  *
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 63d0dc1b798d..eb45684b186a 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -453,6 +453,8 @@ int	xfs_zero_extent(struct xfs_inode *ip, xfs_fsblock_t start_fsb,
 struct xfs_error_cfg * xfs_error_get_cfg(struct xfs_mount *mp,
 		int error_class, int error);
 void xfs_force_summary_recalc(struct xfs_mount *mp);
+int xfs_add_incompat_log_feature(struct xfs_mount *mp, uint32_t feature);
+bool xfs_clear_incompat_log_features(struct xfs_mount *mp);
 void xfs_mod_delalloc(struct xfs_mount *mp, int64_t delta);
 
 void xfs_hook_init(struct xfs_hook_chain *chain);


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 04/18] xfs: clear log incompat feature bits when the log is idle
  2021-04-01  1:08 [PATCHSET RFC v3 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (2 preceding siblings ...)
  2021-04-01  1:09 ` [PATCH 03/18] xfs: allow setting and clearing of log incompat feature flags Darrick J. Wong
@ 2021-04-01  1:09 ` Darrick J. Wong
  2021-04-02 23:20   ` Allison Henderson
  2021-04-01  1:09 ` [PATCH 05/18] xfs: create a log incompat flag for atomic extent swapping Darrick J. Wong
                   ` (14 subsequent siblings)
  18 siblings, 1 reply; 30+ messages in thread
From: Darrick J. Wong @ 2021-04-01  1:09 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

When there are no ongoing transactions and the log contents have been
checkpointed back into the filesystem, the log performs 'covering',
which is to say that it log a dummy transaction to record the fact that
the tail has caught up with the head.  This is a good time to clear log
incompat feature flags, because they are flags that are temporarily set
to limit the range of kernels that can replay a dirty log.

Since it's possible that some other higher level thread is about to
start logging items protected by a log incompat flag, we create a rwsem
so that upper level threads can coordinate this with the log.  It would
probably be more performant to use a percpu rwsem, but the ability to
/try/ taking the write lock during covering is critical, and percpu
rwsems do not provide that.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_shared.h |    6 +++++
 fs/xfs/xfs_log.c           |   49 ++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_log.h           |    3 +++
 fs/xfs/xfs_log_priv.h      |    3 +++
 fs/xfs/xfs_trans.c         |   14 +++++++++----
 5 files changed, 71 insertions(+), 4 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
index 8c61a461bf7b..c7c9a0cebb04 100644
--- a/fs/xfs/libxfs/xfs_shared.h
+++ b/fs/xfs/libxfs/xfs_shared.h
@@ -62,6 +62,12 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
 #define	XFS_TRANS_SB_DIRTY	0x02	/* superblock is modified */
 #define	XFS_TRANS_PERM_LOG_RES	0x04	/* xact took a permanent log res */
 #define	XFS_TRANS_SYNC		0x08	/* make commit synchronous */
+/*
+ * This transaction uses a log incompat feature, which means that we must tell
+ * the log that we've finished using it at the transaction commit or cancel.
+ * Callers must call xlog_use_incompat_feat before setting this flag.
+ */
+#define XFS_TRANS_LOG_INCOMPAT	0x10
 #define XFS_TRANS_RESERVE	0x20    /* OK to use reserved data blocks */
 #define XFS_TRANS_NO_WRITECOUNT 0x40	/* do not elevate SB writecount */
 #define XFS_TRANS_RES_FDBLKS	0x80	/* reserve newly freed blocks */
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index cf73bc9f4d18..cb72be62da3e 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -1335,6 +1335,32 @@ xfs_log_work_queue(
 				msecs_to_jiffies(xfs_syncd_centisecs * 10));
 }
 
+/*
+ * Clear the log incompat flags if we have the opportunity.
+ *
+ * This only happens if we're about to log the second dummy transaction as part
+ * of covering the log and we can get the log incompat feature usage lock.
+ */
+static inline void
+xlog_clear_incompat(
+	struct xlog		*log)
+{
+	struct xfs_mount	*mp = log->l_mp;
+
+	if (!xfs_sb_has_incompat_log_feature(&mp->m_sb,
+				XFS_SB_FEAT_INCOMPAT_LOG_ALL))
+		return;
+
+	if (log->l_covered_state != XLOG_STATE_COVER_DONE2)
+		return;
+
+	if (!down_write_trylock(&log->l_incompat_users))
+		return;
+
+	xfs_clear_incompat_log_features(mp);
+	up_write(&log->l_incompat_users);
+}
+
 /*
  * Every sync period we need to unpin all items in the AIL and push them to
  * disk. If there is nothing dirty, then we might need to cover the log to
@@ -1361,6 +1387,7 @@ xfs_log_worker(
 		 * synchronously log the superblock instead to ensure the
 		 * superblock is immediately unpinned and can be written back.
 		 */
+		xlog_clear_incompat(log);
 		xfs_sync_sb(mp, true);
 	} else
 		xfs_log_force(mp, 0);
@@ -1443,6 +1470,8 @@ xlog_alloc_log(
 	}
 	log->l_sectBBsize = 1 << log2_size;
 
+	init_rwsem(&log->l_incompat_users);
+
 	xlog_get_iclog_buffer_size(mp, log);
 
 	spin_lock_init(&log->l_icloglock);
@@ -3933,3 +3962,23 @@ xfs_log_in_recovery(
 
 	return log->l_flags & XLOG_ACTIVE_RECOVERY;
 }
+
+/*
+ * Notify the log that we're about to start using a feature that is protected
+ * by a log incompat feature flag.  This will prevent log covering from
+ * clearing those flags.
+ */
+void
+xlog_use_incompat_feat(
+	struct xlog		*log)
+{
+	down_read(&log->l_incompat_users);
+}
+
+/* Notify the log that we've finished using log incompat features. */
+void
+xlog_drop_incompat_feat(
+	struct xlog		*log)
+{
+	up_read(&log->l_incompat_users);
+}
diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
index 044e02cb8921..8b7d0a56cbf1 100644
--- a/fs/xfs/xfs_log.h
+++ b/fs/xfs/xfs_log.h
@@ -145,4 +145,7 @@ bool	xfs_log_in_recovery(struct xfs_mount *);
 
 xfs_lsn_t xlog_grant_push_threshold(struct xlog *log, int need_bytes);
 
+void xlog_use_incompat_feat(struct xlog *log);
+void xlog_drop_incompat_feat(struct xlog *log);
+
 #endif	/* __XFS_LOG_H__ */
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index 1c6fdbf3d506..75702c4fa69c 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -436,6 +436,9 @@ struct xlog {
 #endif
 	/* log recovery lsn tracking (for buffer submission */
 	xfs_lsn_t		l_recovery_lsn;
+
+	/* Users of log incompat features should take a read lock. */
+	struct rw_semaphore	l_incompat_users;
 };
 
 #define XLOG_BUF_CANCEL_BUCKET(log, blkno) \
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index eb2d8e2e5db6..e548d53c2091 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -71,6 +71,9 @@ xfs_trans_free(
 	xfs_extent_busy_sort(&tp->t_busy);
 	xfs_extent_busy_clear(tp->t_mountp, &tp->t_busy, false);
 
+	if (tp->t_flags & XFS_TRANS_LOG_INCOMPAT)
+		xlog_drop_incompat_feat(tp->t_mountp->m_log);
+
 	trace_xfs_trans_free(tp, _RET_IP_);
 	xfs_trans_clear_context(tp);
 	if (!(tp->t_flags & XFS_TRANS_NO_WRITECOUNT))
@@ -110,10 +113,13 @@ xfs_trans_dup(
 	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
 	ASSERT(tp->t_ticket != NULL);
 
-	ntp->t_flags = XFS_TRANS_PERM_LOG_RES |
-		       (tp->t_flags & XFS_TRANS_RESERVE) |
-		       (tp->t_flags & XFS_TRANS_NO_WRITECOUNT) |
-		       (tp->t_flags & XFS_TRANS_RES_FDBLKS);
+	ntp->t_flags = tp->t_flags & (XFS_TRANS_PERM_LOG_RES |
+				      XFS_TRANS_RESERVE |
+				      XFS_TRANS_NO_WRITECOUNT |
+				      XFS_TRANS_RES_FDBLKS |
+				      XFS_TRANS_LOG_INCOMPAT);
+	/* Give our LOG_INCOMPAT reference to the new transaction. */
+	tp->t_flags &= ~XFS_TRANS_LOG_INCOMPAT;
 	/* We gave our writer reference to the new transaction */
 	tp->t_flags |= XFS_TRANS_NO_WRITECOUNT;
 	ntp->t_ticket = xfs_log_ticket_get(tp->t_ticket);


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 05/18] xfs: create a log incompat flag for atomic extent swapping
  2021-04-01  1:08 [PATCHSET RFC v3 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (3 preceding siblings ...)
  2021-04-01  1:09 ` [PATCH 04/18] xfs: clear log incompat feature bits when the log is idle Darrick J. Wong
@ 2021-04-01  1:09 ` Darrick J. Wong
  2021-04-02 23:21   ` Allison Henderson
  2021-04-01  1:09 ` [PATCH 06/18] xfs: introduce a swap-extent log intent item Darrick J. Wong
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 30+ messages in thread
From: Darrick J. Wong @ 2021-04-01  1:09 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Create a log incompat flag so that we only attempt to process swap
extent log items if the filesystem supports it.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_format.h |   20 ++++++++++++++++++++
 fs/xfs/libxfs/xfs_fs.h     |    1 +
 fs/xfs/libxfs/xfs_sb.c     |    2 ++
 3 files changed, 23 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 7e9c964772c9..e81a7b12a0e3 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -485,6 +485,7 @@ xfs_sb_has_incompat_feature(
 	return (sbp->sb_features_incompat & feature) != 0;
 }
 
+#define XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP (1 << 0)
 #define XFS_SB_FEAT_INCOMPAT_LOG_ALL 0
 #define XFS_SB_FEAT_INCOMPAT_LOG_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_LOG_ALL
 static inline bool
@@ -607,6 +608,25 @@ static inline bool xfs_sb_version_needsrepair(struct xfs_sb *sbp)
 		(sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR);
 }
 
+/*
+ * Decide if this filesystem can use log-assisted ("atomic") extent swapping.
+ * The atomic swap log intent items depend on the block mapping log intent
+ * items introduced with reflink and rmap.  Realtime is not supported yet.
+ */
+static inline bool xfs_sb_version_canatomicswap(struct xfs_sb *sbp)
+{
+	return (xfs_sb_version_hasreflink(sbp) ||
+		xfs_sb_version_hasrmapbt(sbp)) &&
+		!xfs_sb_version_hasrealtime(sbp);
+}
+
+static inline bool xfs_sb_version_hasatomicswap(struct xfs_sb *sbp)
+{
+	return XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5 &&
+		(sbp->sb_features_log_incompat &
+		 XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP);
+}
+
 /*
  * end of superblock version macros
  */
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index e7e1e3051739..08bfce39407e 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -252,6 +252,7 @@ typedef struct xfs_fsop_resblks {
 #define XFS_FSOP_GEOM_FLAGS_REFLINK	(1 << 20) /* files can share blocks */
 #define XFS_FSOP_GEOM_FLAGS_BIGTIME	(1 << 21) /* 64-bit nsec timestamps */
 #define XFS_FSOP_GEOM_FLAGS_INOBTCNT	(1 << 22) /* inobt btree counter */
+#define XFS_FSOP_GEOM_FLAGS_ATOMIC_SWAP	(1 << 23) /* atomic swapext */
 
 /*
  * Minimum and maximum sizes need for growth checks.
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index 6adfe759190c..52791fe33a6e 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -1140,6 +1140,8 @@ xfs_fs_geometry(
 		geo->flags |= XFS_FSOP_GEOM_FLAGS_BIGTIME;
 	if (xfs_sb_version_hasinobtcounts(sbp))
 		geo->flags |= XFS_FSOP_GEOM_FLAGS_INOBTCNT;
+	if (xfs_sb_version_canatomicswap(sbp))
+		geo->flags |= XFS_FSOP_GEOM_FLAGS_ATOMIC_SWAP;
 	if (xfs_sb_version_hassector(sbp))
 		geo->logsectsize = sbp->sb_logsectsize;
 	else


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 06/18] xfs: introduce a swap-extent log intent item
  2021-04-01  1:08 [PATCHSET RFC v3 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (4 preceding siblings ...)
  2021-04-01  1:09 ` [PATCH 05/18] xfs: create a log incompat flag for atomic extent swapping Darrick J. Wong
@ 2021-04-01  1:09 ` Darrick J. Wong
  2021-04-05 23:08   ` Allison Henderson
  2021-04-01  1:09 ` [PATCH 07/18] xfs: create deferred log items for extent swapping Darrick J. Wong
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 30+ messages in thread
From: Darrick J. Wong @ 2021-04-01  1:09 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Introduce a new intent log item to handle swapping extents.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile                 |    1 
 fs/xfs/libxfs/xfs_log_format.h  |   59 +++++++
 fs/xfs/libxfs/xfs_log_recover.h |    2 
 fs/xfs/xfs_log.c                |    2 
 fs/xfs/xfs_log_recover.c        |    2 
 fs/xfs/xfs_super.c              |   17 ++
 fs/xfs/xfs_swapext_item.c       |  328 +++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_swapext_item.h       |   61 +++++++
 8 files changed, 470 insertions(+), 2 deletions(-)
 create mode 100644 fs/xfs/xfs_swapext_item.c
 create mode 100644 fs/xfs/xfs_swapext_item.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index dac3bec1a695..a7cc6f496ad0 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -107,6 +107,7 @@ xfs-y				+= xfs_log.o \
 				   xfs_inode_item_recover.o \
 				   xfs_refcount_item.o \
 				   xfs_rmap_item.o \
+				   xfs_swapext_item.o \
 				   xfs_log_recover.o \
 				   xfs_trans_ail.o \
 				   xfs_trans_buf.o
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index 6107dac4bd6b..52ca6d72de6a 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -117,7 +117,9 @@ struct xfs_unmount_log_format {
 #define XLOG_REG_TYPE_CUD_FORMAT	24
 #define XLOG_REG_TYPE_BUI_FORMAT	25
 #define XLOG_REG_TYPE_BUD_FORMAT	26
-#define XLOG_REG_TYPE_MAX		26
+#define XLOG_REG_TYPE_SXI_FORMAT	27
+#define XLOG_REG_TYPE_SXD_FORMAT	28
+#define XLOG_REG_TYPE_MAX		28
 
 /*
  * Flags to log operation header
@@ -240,6 +242,8 @@ typedef struct xfs_trans_header {
 #define	XFS_LI_CUD		0x1243
 #define	XFS_LI_BUI		0x1244	/* bmbt update intent */
 #define	XFS_LI_BUD		0x1245
+#define	XFS_LI_SXI		0x1246
+#define	XFS_LI_SXD		0x1247
 
 #define XFS_LI_TYPE_DESC \
 	{ XFS_LI_EFI,		"XFS_LI_EFI" }, \
@@ -255,7 +259,9 @@ typedef struct xfs_trans_header {
 	{ XFS_LI_CUI,		"XFS_LI_CUI" }, \
 	{ XFS_LI_CUD,		"XFS_LI_CUD" }, \
 	{ XFS_LI_BUI,		"XFS_LI_BUI" }, \
-	{ XFS_LI_BUD,		"XFS_LI_BUD" }
+	{ XFS_LI_BUD,		"XFS_LI_BUD" }, \
+	{ XFS_LI_SXI,		"XFS_LI_SXI" }, \
+	{ XFS_LI_SXD,		"XFS_LI_SXD" }
 
 /*
  * Inode Log Item Format definitions.
@@ -784,6 +790,55 @@ struct xfs_bud_log_format {
 	uint64_t		bud_bui_id;	/* id of corresponding bui */
 };
 
+/*
+ * SXI/SXD (extent swapping) log format definitions
+ */
+
+struct xfs_swap_extent {
+	uint64_t		sx_inode1;
+	uint64_t		sx_inode2;
+	uint64_t		sx_startoff1;
+	uint64_t		sx_startoff2;
+	uint64_t		sx_blockcount;
+	uint64_t		sx_flags;
+	int64_t			sx_isize1;
+	int64_t			sx_isize2;
+};
+
+/* Swap extents between extended attribute forks. */
+#define XFS_SWAP_EXTENT_ATTR_FORK	(1ULL << 0)
+
+/* Set the file sizes when finished. */
+#define XFS_SWAP_EXTENT_SET_SIZES	(1ULL << 1)
+
+/* Do not swap any part of the range where file1's mapping is a hole. */
+#define XFS_SWAP_EXTENT_SKIP_FILE1_HOLES (1ULL << 2)
+
+#define XFS_SWAP_EXTENT_FLAGS		(XFS_SWAP_EXTENT_ATTR_FORK | \
+					 XFS_SWAP_EXTENT_SET_SIZES | \
+					 XFS_SWAP_EXTENT_SKIP_FILE1_HOLES)
+
+/* This is the structure used to lay out an sxi log item in the log. */
+struct xfs_sxi_log_format {
+	uint16_t		sxi_type;	/* sxi log item type */
+	uint16_t		sxi_size;	/* size of this item */
+	uint32_t		__pad;		/* must be zero */
+	uint64_t		sxi_id;		/* sxi identifier */
+	struct xfs_swap_extent	sxi_extent;	/* extent to swap */
+};
+
+/*
+ * This is the structure used to lay out an sxd log item in the
+ * log.  The sxd_extents array is a variable size array whose
+ * size is given by sxd_nextents;
+ */
+struct xfs_sxd_log_format {
+	uint16_t		sxd_type;	/* sxd log item type */
+	uint16_t		sxd_size;	/* size of this item */
+	uint32_t		__pad;
+	uint64_t		sxd_sxi_id;	/* id of corresponding bui */
+};
+
 /*
  * Dquot Log format definitions.
  *
diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h
index 3cca2bfe714c..dcc11a8c438a 100644
--- a/fs/xfs/libxfs/xfs_log_recover.h
+++ b/fs/xfs/libxfs/xfs_log_recover.h
@@ -72,6 +72,8 @@ extern const struct xlog_recover_item_ops xlog_rui_item_ops;
 extern const struct xlog_recover_item_ops xlog_rud_item_ops;
 extern const struct xlog_recover_item_ops xlog_cui_item_ops;
 extern const struct xlog_recover_item_ops xlog_cud_item_ops;
+extern const struct xlog_recover_item_ops xlog_sxi_item_ops;
+extern const struct xlog_recover_item_ops xlog_sxd_item_ops;
 
 /*
  * Macros, structures, prototypes for internal log manager use.
diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index cb72be62da3e..34213fce3eed 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -2113,6 +2113,8 @@ xlog_print_tic_res(
 	    REG_TYPE_STR(CUD_FORMAT, "cud_format"),
 	    REG_TYPE_STR(BUI_FORMAT, "bui_format"),
 	    REG_TYPE_STR(BUD_FORMAT, "bud_format"),
+	    REG_TYPE_STR(SXI_FORMAT, "sxi_format"),
+	    REG_TYPE_STR(SXD_FORMAT, "sxd_format"),
 	};
 	BUILD_BUG_ON(ARRAY_SIZE(res_type_str) != XLOG_REG_TYPE_MAX + 1);
 #undef REG_TYPE_STR
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index fdba9b55822e..107bb222d79f 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -1775,6 +1775,8 @@ static const struct xlog_recover_item_ops *xlog_recover_item_ops[] = {
 	&xlog_cud_item_ops,
 	&xlog_bui_item_ops,
 	&xlog_bud_item_ops,
+	&xlog_sxi_item_ops,
+	&xlog_sxd_item_ops,
 };
 
 static const struct xlog_recover_item_ops *
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 175dc7acaca8..85ced8cc6070 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -36,6 +36,7 @@
 #include "xfs_bmap_item.h"
 #include "xfs_reflink.h"
 #include "xfs_pwork.h"
+#include "xfs_swapext_item.h"
 
 #include <linux/magic.h>
 #include <linux/fs_context.h>
@@ -2121,8 +2122,24 @@ xfs_init_zones(void)
 	if (!xfs_bui_zone)
 		goto out_destroy_bud_zone;
 
+	xfs_sxd_zone = kmem_cache_create("xfs_sxd_item",
+					 sizeof(struct xfs_sxd_log_item),
+					 0, 0, NULL);
+	if (!xfs_sxd_zone)
+		goto out_destroy_bui_zone;
+
+	xfs_sxi_zone = kmem_cache_create("xfs_sxi_item",
+					 sizeof(struct xfs_sxi_log_item),
+					 0, 0, NULL);
+	if (!xfs_sxi_zone)
+		goto out_destroy_sxd_zone;
+
 	return 0;
 
+ out_destroy_sxd_zone:
+	kmem_cache_destroy(xfs_sxd_zone);
+ out_destroy_bui_zone:
+	kmem_cache_destroy(xfs_bui_zone);
  out_destroy_bud_zone:
 	kmem_cache_destroy(xfs_bud_zone);
  out_destroy_cui_zone:
diff --git a/fs/xfs/xfs_swapext_item.c b/fs/xfs/xfs_swapext_item.c
new file mode 100644
index 000000000000..83913e9fd4d4
--- /dev/null
+++ b/fs/xfs/xfs_swapext_item.c
@@ -0,0 +1,328 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2021 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_bit.h"
+#include "xfs_shared.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_inode.h"
+#include "xfs_trans.h"
+#include "xfs_trans_priv.h"
+#include "xfs_swapext_item.h"
+#include "xfs_log.h"
+#include "xfs_bmap.h"
+#include "xfs_icache.h"
+#include "xfs_trans_space.h"
+#include "xfs_error.h"
+#include "xfs_log_priv.h"
+#include "xfs_log_recover.h"
+
+kmem_zone_t	*xfs_sxi_zone;
+kmem_zone_t	*xfs_sxd_zone;
+
+static const struct xfs_item_ops xfs_sxi_item_ops;
+
+static inline struct xfs_sxi_log_item *SXI_ITEM(struct xfs_log_item *lip)
+{
+	return container_of(lip, struct xfs_sxi_log_item, sxi_item);
+}
+
+STATIC void
+xfs_sxi_item_free(
+	struct xfs_sxi_log_item	*sxi_lip)
+{
+	kmem_cache_free(xfs_sxi_zone, sxi_lip);
+}
+
+/*
+ * Freeing the SXI requires that we remove it from the AIL if it has already
+ * been placed there. However, the SXI may not yet have been placed in the AIL
+ * when called by xfs_sxi_release() from SXD processing due to the ordering of
+ * committed vs unpin operations in bulk insert operations. Hence the reference
+ * count to ensure only the last caller frees the SXI.
+ */
+STATIC void
+xfs_sxi_release(
+	struct xfs_sxi_log_item	*sxi_lip)
+{
+	ASSERT(atomic_read(&sxi_lip->sxi_refcount) > 0);
+	if (atomic_dec_and_test(&sxi_lip->sxi_refcount)) {
+		xfs_trans_ail_delete(&sxi_lip->sxi_item, SHUTDOWN_LOG_IO_ERROR);
+		xfs_sxi_item_free(sxi_lip);
+	}
+}
+
+
+STATIC void
+xfs_sxi_item_size(
+	struct xfs_log_item	*lip,
+	int			*nvecs,
+	int			*nbytes)
+{
+	*nvecs += 1;
+	*nbytes += sizeof(struct xfs_sxi_log_format);
+}
+
+/*
+ * This is called to fill in the vector of log iovecs for the given sxi log
+ * item. We use only 1 iovec, and we point that at the sxi_log_format structure
+ * embedded in the sxi item.
+ */
+STATIC void
+xfs_sxi_item_format(
+	struct xfs_log_item	*lip,
+	struct xfs_log_vec	*lv)
+{
+	struct xfs_sxi_log_item	*sxi_lip = SXI_ITEM(lip);
+	struct xfs_log_iovec	*vecp = NULL;
+
+	sxi_lip->sxi_format.sxi_type = XFS_LI_SXI;
+	sxi_lip->sxi_format.sxi_size = 1;
+
+	xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_SXI_FORMAT,
+			&sxi_lip->sxi_format,
+			sizeof(struct xfs_sxi_log_format));
+}
+
+/*
+ * The unpin operation is the last place an SXI is manipulated in the log. It
+ * is either inserted in the AIL or aborted in the event of a log I/O error. In
+ * either case, the SXI transaction has been successfully committed to make it
+ * this far. Therefore, we expect whoever committed the SXI to either construct
+ * and commit the SXD or drop the SXD's reference in the event of error. Simply
+ * drop the log's SXI reference now that the log is done with it.
+ */
+STATIC void
+xfs_sxi_item_unpin(
+	struct xfs_log_item	*lip,
+	int			remove)
+{
+	struct xfs_sxi_log_item	*sxi_lip = SXI_ITEM(lip);
+
+	xfs_sxi_release(sxi_lip);
+}
+
+/*
+ * The SXI has been either committed or aborted if the transaction has been
+ * cancelled. If the transaction was cancelled, an SXD isn't going to be
+ * constructed and thus we free the SXI here directly.
+ */
+STATIC void
+xfs_sxi_item_release(
+	struct xfs_log_item	*lip)
+{
+	xfs_sxi_release(SXI_ITEM(lip));
+}
+
+/* Allocate and initialize an sxi item with the given number of extents. */
+STATIC struct xfs_sxi_log_item *
+xfs_sxi_init(
+	struct xfs_mount		*mp)
+
+{
+	struct xfs_sxi_log_item		*sxi_lip;
+
+	sxi_lip = kmem_cache_zalloc(xfs_sxi_zone, GFP_KERNEL | __GFP_NOFAIL);
+
+	xfs_log_item_init(mp, &sxi_lip->sxi_item, XFS_LI_SXI, &xfs_sxi_item_ops);
+	sxi_lip->sxi_format.sxi_id = (uintptr_t)(void *)sxi_lip;
+	atomic_set(&sxi_lip->sxi_refcount, 2);
+
+	return sxi_lip;
+}
+
+static inline struct xfs_sxd_log_item *SXD_ITEM(struct xfs_log_item *lip)
+{
+	return container_of(lip, struct xfs_sxd_log_item, sxd_item);
+}
+
+STATIC void
+xfs_sxd_item_size(
+	struct xfs_log_item	*lip,
+	int			*nvecs,
+	int			*nbytes)
+{
+	*nvecs += 1;
+	*nbytes += sizeof(struct xfs_sxd_log_format);
+}
+
+/*
+ * This is called to fill in the vector of log iovecs for the given sxd log
+ * item. We use only 1 iovec, and we point that at the sxd_log_format structure
+ * embedded in the sxd item.
+ */
+STATIC void
+xfs_sxd_item_format(
+	struct xfs_log_item	*lip,
+	struct xfs_log_vec	*lv)
+{
+	struct xfs_sxd_log_item	*sxd_lip = SXD_ITEM(lip);
+	struct xfs_log_iovec	*vecp = NULL;
+
+	sxd_lip->sxd_format.sxd_type = XFS_LI_SXD;
+	sxd_lip->sxd_format.sxd_size = 1;
+
+	xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_SXD_FORMAT, &sxd_lip->sxd_format,
+			sizeof(struct xfs_sxd_log_format));
+}
+
+/*
+ * The SXD is either committed or aborted if the transaction is cancelled. If
+ * the transaction is cancelled, drop our reference to the SXI and free the
+ * SXD.
+ */
+STATIC void
+xfs_sxd_item_release(
+	struct xfs_log_item	*lip)
+{
+	struct xfs_sxd_log_item	*sxd_lip = SXD_ITEM(lip);
+
+	xfs_sxi_release(sxd_lip->sxd_intent_log_item);
+	kmem_cache_free(xfs_sxd_zone, sxd_lip);
+}
+
+static const struct xfs_item_ops xfs_sxd_item_ops = {
+	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED,
+	.iop_size	= xfs_sxd_item_size,
+	.iop_format	= xfs_sxd_item_format,
+	.iop_release	= xfs_sxd_item_release,
+};
+
+/* Process a swapext update intent item that was recovered from the log. */
+STATIC int
+xfs_sxi_item_recover(
+	struct xfs_log_item		*lip,
+	struct list_head		*capture_list)
+{
+	return -EFSCORRUPTED;
+}
+
+STATIC bool
+xfs_sxi_item_match(
+	struct xfs_log_item	*lip,
+	uint64_t		intent_id)
+{
+	return SXI_ITEM(lip)->sxi_format.sxi_id == intent_id;
+}
+
+/* Relog an intent item to push the log tail forward. */
+static struct xfs_log_item *
+xfs_sxi_item_relog(
+	struct xfs_log_item		*intent,
+	struct xfs_trans		*tp)
+{
+	ASSERT(0);
+	return NULL;
+}
+
+static const struct xfs_item_ops xfs_sxi_item_ops = {
+	.iop_size	= xfs_sxi_item_size,
+	.iop_format	= xfs_sxi_item_format,
+	.iop_unpin	= xfs_sxi_item_unpin,
+	.iop_release	= xfs_sxi_item_release,
+	.iop_recover	= xfs_sxi_item_recover,
+	.iop_match	= xfs_sxi_item_match,
+	.iop_relog	= xfs_sxi_item_relog,
+};
+
+/*
+ * Copy an SXI format buffer from the given buf, and into the destination SXI
+ * format structure.  The SXI/SXD items were designed not to need any special
+ * alignment handling.
+ */
+static int
+xfs_sxi_copy_format(
+	struct xfs_log_iovec		*buf,
+	struct xfs_sxi_log_format	*dst_sxi_fmt)
+{
+	struct xfs_sxi_log_format	*src_sxi_fmt;
+	size_t				len;
+
+	src_sxi_fmt = buf->i_addr;
+	len = sizeof(struct xfs_sxi_log_format);
+
+	if (buf->i_len == len) {
+		memcpy(dst_sxi_fmt, src_sxi_fmt, len);
+		return 0;
+	}
+	XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, NULL);
+	return -EFSCORRUPTED;
+}
+
+/*
+ * This routine is called to create an in-core extent swapext update item from
+ * the sxi format structure which was logged on disk.  It allocates an in-core
+ * sxi, copies the extents from the format structure into it, and adds the sxi
+ * to the AIL with the given LSN.
+ */
+STATIC int
+xlog_recover_sxi_commit_pass2(
+	struct xlog			*log,
+	struct list_head		*buffer_list,
+	struct xlog_recover_item	*item,
+	xfs_lsn_t			lsn)
+{
+	int				error;
+	struct xfs_mount		*mp = log->l_mp;
+	struct xfs_sxi_log_item		*sxi_lip;
+	struct xfs_sxi_log_format	*sxi_formatp;
+
+	sxi_formatp = item->ri_buf[0].i_addr;
+
+	if (sxi_formatp->__pad != 0) {
+		XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp);
+		return -EFSCORRUPTED;
+	}
+	sxi_lip = xfs_sxi_init(mp);
+	error = xfs_sxi_copy_format(&item->ri_buf[0], &sxi_lip->sxi_format);
+	if (error) {
+		xfs_sxi_item_free(sxi_lip);
+		return error;
+	}
+	xfs_trans_ail_insert(log->l_ailp, &sxi_lip->sxi_item, lsn);
+	xfs_sxi_release(sxi_lip);
+	return 0;
+}
+
+const struct xlog_recover_item_ops xlog_sxi_item_ops = {
+	.item_type		= XFS_LI_SXI,
+	.commit_pass2		= xlog_recover_sxi_commit_pass2,
+};
+
+/*
+ * This routine is called when an SXD format structure is found in a committed
+ * transaction in the log. Its purpose is to cancel the corresponding SXI if it
+ * was still in the log. To do this it searches the AIL for the SXI with an id
+ * equal to that in the SXD format structure. If we find it we drop the SXD
+ * reference, which removes the SXI from the AIL and frees it.
+ */
+STATIC int
+xlog_recover_sxd_commit_pass2(
+	struct xlog			*log,
+	struct list_head		*buffer_list,
+	struct xlog_recover_item	*item,
+	xfs_lsn_t			lsn)
+{
+	struct xfs_sxd_log_format	*sxd_formatp;
+
+	sxd_formatp = item->ri_buf[0].i_addr;
+	if (item->ri_buf[0].i_len != sizeof(struct xfs_sxd_log_format)) {
+		XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp);
+		return -EFSCORRUPTED;
+	}
+
+	xlog_recover_release_intent(log, XFS_LI_SXI, sxd_formatp->sxd_sxi_id);
+	return 0;
+}
+
+const struct xlog_recover_item_ops xlog_sxd_item_ops = {
+	.item_type		= XFS_LI_SXD,
+	.commit_pass2		= xlog_recover_sxd_commit_pass2,
+};
diff --git a/fs/xfs/xfs_swapext_item.h b/fs/xfs/xfs_swapext_item.h
new file mode 100644
index 000000000000..7caeccdcaa81
--- /dev/null
+++ b/fs/xfs/xfs_swapext_item.h
@@ -0,0 +1,61 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2021 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef	__XFS_SWAPEXT_ITEM_H__
+#define	__XFS_SWAPEXT_ITEM_H__
+
+/*
+ * The extent swapping intent item help us perform atomic extent swaps between
+ * two inode forks.  It does this by tracking the range of logical offsets that
+ * still need to be swapped, and relogs as progress happens.
+ *
+ * *I items should be recorded in the *first* of a series of rolled
+ * transactions, and the *D items should be recorded in the same transaction
+ * that records the associated bmbt updates.
+ *
+ * Should the system crash after the commit of the first transaction but
+ * before the commit of the final transaction in a series, log recovery will
+ * use the redo information recorded by the intent items to replay the
+ * rest of the extent swaps.
+ */
+
+/* kernel only SXI/SXD definitions */
+
+struct xfs_mount;
+struct kmem_zone;
+
+/*
+ * Max number of extents in fast allocation path.
+ */
+#define	XFS_SXI_MAX_FAST_EXTENTS	1
+
+/*
+ * This is the "swapext update intent" log item.  It is used to log the fact
+ * that we are swapping extents between two files.  It is used in conjunction
+ * with the "swapext update done" log item described below.
+ *
+ * These log items follow the same rules as struct xfs_efi_log_item; see the
+ * comments about that structure (in xfs_extfree_item.h) for more details.
+ */
+struct xfs_sxi_log_item {
+	struct xfs_log_item		sxi_item;
+	atomic_t			sxi_refcount;
+	struct xfs_sxi_log_format	sxi_format;
+};
+
+/*
+ * This is the "swapext update done" log item.  It is used to log the fact that
+ * some extent swapping mentioned in an earlier sxi item have been performed.
+ */
+struct xfs_sxd_log_item {
+	struct xfs_log_item		sxd_item;
+	struct xfs_sxi_log_item		*sxd_intent_log_item;
+	struct xfs_sxd_log_format	sxd_format;
+};
+
+extern struct kmem_zone	*xfs_sxi_zone;
+extern struct kmem_zone	*xfs_sxd_zone;
+
+#endif	/* __XFS_SWAPEXT_ITEM_H__ */


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 07/18] xfs: create deferred log items for extent swapping
  2021-04-01  1:08 [PATCHSET RFC v3 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (5 preceding siblings ...)
  2021-04-01  1:09 ` [PATCH 06/18] xfs: introduce a swap-extent log intent item Darrick J. Wong
@ 2021-04-01  1:09 ` Darrick J. Wong
  2021-04-01  1:09 ` [PATCH 08/18] xfs: add a ->xchg_file_range handler Darrick J. Wong
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2021-04-01  1:09 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Now that we've created the skeleton of a log intent item to track and
restart extent swap operations, add the upper level logic to commit
intent items and turn them into concrete work recorded in the log.  We
use the deferred item "multihop" feature that was introduced a few
patches ago to constrain the number of active swap operations to one per
thread.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile                 |    2 
 fs/xfs/libxfs/xfs_bmap.h        |    4 
 fs/xfs/libxfs/xfs_defer.c       |    1 
 fs/xfs/libxfs/xfs_defer.h       |    2 
 fs/xfs/libxfs/xfs_log_recover.h |    2 
 fs/xfs/libxfs/xfs_swapext.c     |  878 +++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_swapext.h     |   84 ++++
 fs/xfs/xfs_bmap_item.c          |   11 
 fs/xfs/xfs_bmap_util.c          |    1 
 fs/xfs/xfs_log_recover.c        |   25 +
 fs/xfs/xfs_swapext_item.c       |  327 ++++++++++++++-
 fs/xfs/xfs_trace.c              |    1 
 fs/xfs/xfs_trace.h              |  185 ++++++++
 fs/xfs/xfs_xchgrange.c          |   66 +++
 fs/xfs/xfs_xchgrange.h          |   19 +
 15 files changed, 1593 insertions(+), 15 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_swapext.c
 create mode 100644 fs/xfs/libxfs/xfs_swapext.h
 create mode 100644 fs/xfs/xfs_xchgrange.c
 create mode 100644 fs/xfs/xfs_xchgrange.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index a7cc6f496ad0..f356869d8fd9 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -46,6 +46,7 @@ xfs-y				+= $(addprefix libxfs/, \
 				   xfs_refcount.o \
 				   xfs_refcount_btree.o \
 				   xfs_sb.o \
+				   xfs_swapext.o \
 				   xfs_symlink_remote.o \
 				   xfs_trans_inode.o \
 				   xfs_trans_resv.o \
@@ -92,6 +93,7 @@ xfs-y				+= xfs_aops.o \
 				   xfs_sysfs.o \
 				   xfs_trans.o \
 				   xfs_xattr.o \
+				   xfs_xchgrange.o \
 				   kmem.o
 
 # low-level transaction/log code
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index 30ce3ba24259..bdf725ded307 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -159,7 +159,7 @@ static inline int xfs_bmapi_whichfork(int bmapi_flags)
 	{ BMAP_COWFORK,		"COW" }
 
 /* Return true if the extent is an allocated extent, written or not. */
-static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec)
+static inline bool xfs_bmap_is_real_extent(const struct xfs_bmbt_irec *irec)
 {
 	return irec->br_startblock != HOLESTARTBLOCK &&
 		irec->br_startblock != DELAYSTARTBLOCK &&
@@ -170,7 +170,7 @@ static inline bool xfs_bmap_is_real_extent(struct xfs_bmbt_irec *irec)
  * Return true if the extent is a real, allocated extent, or false if it is  a
  * delayed allocation, and unwritten extent or a hole.
  */
-static inline bool xfs_bmap_is_written_extent(struct xfs_bmbt_irec *irec)
+static inline bool xfs_bmap_is_written_extent(const struct xfs_bmbt_irec *irec)
 {
 	return xfs_bmap_is_real_extent(irec) &&
 	       irec->br_state != XFS_EXT_UNWRITTEN;
diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index a7d1357687d0..927e7245d7ec 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -178,6 +178,7 @@ static const struct xfs_defer_op_type *defer_op_types[] = {
 	[XFS_DEFER_OPS_TYPE_RMAP]	= &xfs_rmap_update_defer_type,
 	[XFS_DEFER_OPS_TYPE_FREE]	= &xfs_extent_free_defer_type,
 	[XFS_DEFER_OPS_TYPE_AGFL_FREE]	= &xfs_agfl_free_defer_type,
+	[XFS_DEFER_OPS_TYPE_SWAPEXT]	= &xfs_swapext_defer_type,
 };
 
 static void
diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index f5e3ca17aa26..99ff9feb0d9b 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -19,6 +19,7 @@ enum xfs_defer_ops_type {
 	XFS_DEFER_OPS_TYPE_RMAP,
 	XFS_DEFER_OPS_TYPE_FREE,
 	XFS_DEFER_OPS_TYPE_AGFL_FREE,
+	XFS_DEFER_OPS_TYPE_SWAPEXT,
 	XFS_DEFER_OPS_TYPE_MAX,
 };
 
@@ -63,6 +64,7 @@ extern const struct xfs_defer_op_type xfs_refcount_update_defer_type;
 extern const struct xfs_defer_op_type xfs_rmap_update_defer_type;
 extern const struct xfs_defer_op_type xfs_extent_free_defer_type;
 extern const struct xfs_defer_op_type xfs_agfl_free_defer_type;
+extern const struct xfs_defer_op_type xfs_swapext_defer_type;
 
 /*
  * This structure enables a dfops user to detach the chain of deferred
diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h
index dcc11a8c438a..cb8f17074bef 100644
--- a/fs/xfs/libxfs/xfs_log_recover.h
+++ b/fs/xfs/libxfs/xfs_log_recover.h
@@ -124,6 +124,8 @@ void xlog_buf_readahead(struct xlog *log, xfs_daddr_t blkno, uint len,
 		const struct xfs_buf_ops *ops);
 bool xlog_is_buffer_cancelled(struct xlog *log, xfs_daddr_t blkno, uint len);
 
+int xlog_recover_iget(struct xfs_mount *mp, xfs_ino_t ino,
+		struct xfs_inode **ipp);
 void xlog_recover_release_intent(struct xlog *log, unsigned short intent_type,
 		uint64_t intent_id);
 
diff --git a/fs/xfs/libxfs/xfs_swapext.c b/fs/xfs/libxfs/xfs_swapext.c
new file mode 100644
index 000000000000..9fb67cbd018f
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_swapext.c
@@ -0,0 +1,878 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2021 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_inode.h"
+#include "xfs_trans.h"
+#include "xfs_bmap.h"
+#include "xfs_icache.h"
+#include "xfs_quota.h"
+#include "xfs_swapext.h"
+#include "xfs_trace.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_trans_space.h"
+
+/* Information to help us reset reflink flag / CoW fork state after a swap. */
+
+/* Are we swapping the data fork? */
+#define XFS_SX_REFLINK_DATAFORK		(1U << 0)
+
+/* Can we swap the flags? */
+#define XFS_SX_REFLINK_SWAPFLAGS	(1U << 1)
+
+/* Previous state of the two inodes' reflink flags. */
+#define XFS_SX_REFLINK_IP1_REFLINK	(1U << 2)
+#define XFS_SX_REFLINK_IP2_REFLINK	(1U << 3)
+
+
+/*
+ * Prepare both inodes' reflink state for an extent swap, and return our
+ * findings so that xfs_swapext_reflink_finish can deal with the aftermath.
+ */
+unsigned int
+xfs_swapext_reflink_prep(
+	const struct xfs_swapext_req	*req)
+{
+	struct xfs_mount		*mp = req->ip1->i_mount;
+	unsigned int			rs = 0;
+
+	if (req->whichfork != XFS_DATA_FORK)
+		return 0;
+
+	/*
+	 * If either file has shared blocks and we're swapping data forks, we
+	 * must flag the other file as having shared blocks so that we get the
+	 * shared-block rmap functions if we need to fix up the rmaps.  The
+	 * flags will be switched for real by xfs_swapext_reflink_finish.
+	 */
+	if (xfs_is_reflink_inode(req->ip1))
+		rs |= XFS_SX_REFLINK_IP1_REFLINK;
+	if (xfs_is_reflink_inode(req->ip2))
+		rs |= XFS_SX_REFLINK_IP2_REFLINK;
+
+	if (rs & XFS_SX_REFLINK_IP1_REFLINK)
+		req->ip2->i_d.di_flags2 |= XFS_DIFLAG2_REFLINK;
+	if (rs & XFS_SX_REFLINK_IP2_REFLINK)
+		req->ip1->i_d.di_flags2 |= XFS_DIFLAG2_REFLINK;
+
+	/*
+	 * If either file had the reflink flag set before; and the two files'
+	 * reflink state was different; and we're swapping the entirety of both
+	 * files, then we can exchange the reflink flags at the end.
+	 * Otherwise, we propagate the reflink flag from either file to the
+	 * other file.
+	 *
+	 * Note that we've only set the _REFLINK flags of the reflink state, so
+	 * we can cheat and use hweight32 for the reflink flag test.
+	 *
+	 */
+	if (hweight32(rs) == 1 && req->startoff1 == 0 && req->startoff2 == 0 &&
+	    req->blockcount == XFS_B_TO_FSB(mp, req->ip1->i_d.di_size) &&
+	    req->blockcount == XFS_B_TO_FSB(mp, req->ip2->i_d.di_size))
+		rs |= XFS_SX_REFLINK_SWAPFLAGS;
+
+	rs |= XFS_SX_REFLINK_DATAFORK;
+	return rs;
+}
+
+/*
+ * If the reflink flag is set on either inode, make sure it has an incore CoW
+ * fork, since all reflink inodes must have them.  If there's a CoW fork and it
+ * has extents in it, make sure the inodes are tagged appropriately so that
+ * speculative preallocations can be GC'd if we run low of space.
+ */
+static inline void
+xfs_swapext_ensure_cowfork(
+	struct xfs_inode	*ip)
+{
+	struct xfs_ifork	*cfork;
+
+	if (xfs_is_reflink_inode(ip))
+		xfs_ifork_init_cow(ip);
+
+	cfork = XFS_IFORK_PTR(ip, XFS_COW_FORK);
+	if (!cfork)
+		return;
+	if (cfork->if_bytes > 0)
+		xfs_inode_set_cowblocks_tag(ip);
+	else
+		xfs_inode_clear_cowblocks_tag(ip);
+}
+
+/*
+ * Set both inodes' ondisk reflink flags to their final state and ensure that
+ * the incore state is ready to go.
+ */
+void
+xfs_swapext_reflink_finish(
+	struct xfs_trans		*tp,
+	const struct xfs_swapext_req	*req,
+	unsigned int			rs)
+{
+	if (!(rs & XFS_SX_REFLINK_DATAFORK))
+		return;
+
+	if (rs & XFS_SX_REFLINK_SWAPFLAGS) {
+		/* Exchange the reflink inode flags and log them. */
+		req->ip1->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+		if (rs & XFS_SX_REFLINK_IP2_REFLINK)
+			req->ip1->i_d.di_flags2 |= XFS_DIFLAG2_REFLINK;
+
+		req->ip2->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+		if (rs & XFS_SX_REFLINK_IP1_REFLINK)
+			req->ip2->i_d.di_flags2 |= XFS_DIFLAG2_REFLINK;
+
+		xfs_trans_log_inode(tp, req->ip1, XFS_ILOG_CORE);
+		xfs_trans_log_inode(tp, req->ip2, XFS_ILOG_CORE);
+	}
+
+	xfs_swapext_ensure_cowfork(req->ip1);
+	xfs_swapext_ensure_cowfork(req->ip2);
+}
+
+/* Schedule an atomic extent swap. */
+static inline void
+xfs_swapext_schedule(
+	struct xfs_trans		*tp,
+	struct xfs_swapext_intent	*sxi)
+{
+	trace_xfs_swapext_defer(tp->t_mountp, sxi);
+	xfs_defer_add(tp, XFS_DEFER_OPS_TYPE_SWAPEXT, &sxi->sxi_list);
+}
+
+/* Reschedule an atomic extent swap on behalf of log recovery. */
+void
+xfs_swapext_reschedule(
+	struct xfs_trans		*tp,
+	const struct xfs_swapext_intent	*sxi)
+{
+	struct xfs_swapext_intent	*new_sxi;
+
+	new_sxi = kmem_alloc(sizeof(struct xfs_swapext_intent), KM_NOFS);
+	memcpy(new_sxi, sxi, sizeof(*new_sxi));
+	INIT_LIST_HEAD(&new_sxi->sxi_list);
+
+	xfs_swapext_schedule(tp, new_sxi);
+}
+
+/*
+ * Adjust the on-disk inode size upwards if needed so that we never map extents
+ * into the file past EOF.  This is crucial so that log recovery won't get
+ * confused by the sudden appearance of post-eof extents.
+ */
+STATIC void
+xfs_swapext_update_size(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	struct xfs_bmbt_irec	*imap,
+	xfs_fsize_t		new_isize)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	xfs_fsize_t		len;
+
+	if (new_isize < 0)
+		return;
+
+	len = min(XFS_FSB_TO_B(mp, imap->br_startoff + imap->br_blockcount),
+		  new_isize);
+
+	if (len <= ip->i_d.di_size)
+		return;
+
+	trace_xfs_swapext_update_inode_size(ip, len);
+
+	ip->i_d.di_size = len;
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+}
+
+/* Do we have more work to do to finish this operation? */
+bool
+xfs_swapext_has_more_work(
+	struct xfs_swapext_intent	*sxi)
+{
+	return sxi->sxi_blockcount > 0;
+}
+
+/* Check all extents to make sure we can actually swap them. */
+int
+xfs_swapext_check_extents(
+	struct xfs_mount		*mp,
+	const struct xfs_swapext_req	*req)
+{
+	struct xfs_ifork		*ifp1, *ifp2;
+
+	/* No fork? */
+	ifp1 = XFS_IFORK_PTR(req->ip1, req->whichfork);
+	ifp2 = XFS_IFORK_PTR(req->ip2, req->whichfork);
+	if (!ifp1 || !ifp2)
+		return -EINVAL;
+
+	/* We don't know how to swap local format forks. */
+	if (ifp1->if_format == XFS_DINODE_FMT_LOCAL ||
+	    ifp2->if_format == XFS_DINODE_FMT_LOCAL)
+		return -EINVAL;
+
+	/* We don't support realtime data forks yet. */
+	if (!XFS_IS_REALTIME_INODE(req->ip1))
+		return 0;
+	if (req->whichfork == XFS_ATTR_FORK)
+		return 0;
+	return -EINVAL;
+}
+
+#ifdef CONFIG_XFS_QUOTA
+/* Log the actual updates to the quota accounting. */
+static inline void
+xfs_swapext_update_quota(
+	struct xfs_trans		*tp,
+	struct xfs_swapext_intent	*sxi,
+	struct xfs_bmbt_irec		*irec1,
+	struct xfs_bmbt_irec		*irec2)
+{
+	int64_t				ip1_delta = 0, ip2_delta = 0;
+	unsigned int			qflag;
+
+	qflag = XFS_IS_REALTIME_INODE(sxi->sxi_ip1) ? XFS_TRANS_DQ_RTBCOUNT :
+						      XFS_TRANS_DQ_BCOUNT;
+
+	if (xfs_bmap_is_real_extent(irec1)) {
+		ip1_delta -= irec1->br_blockcount;
+		ip2_delta += irec1->br_blockcount;
+	}
+
+	if (xfs_bmap_is_real_extent(irec2)) {
+		ip1_delta += irec2->br_blockcount;
+		ip2_delta -= irec2->br_blockcount;
+	}
+
+	xfs_trans_mod_dquot_byino(tp, sxi->sxi_ip1, qflag, ip1_delta);
+	xfs_trans_mod_dquot_byino(tp, sxi->sxi_ip2, qflag, ip2_delta);
+}
+#else
+# define xfs_swapext_update_quota(tp, sxi, irec1, irec2)	((void)0)
+#endif
+
+/* Finish one extent swap, possibly log more. */
+int
+xfs_swapext_finish_one(
+	struct xfs_trans		*tp,
+	struct xfs_swapext_intent	*sxi)
+{
+	struct xfs_bmbt_irec		irec1, irec2;
+	int				whichfork;
+	int				nimaps;
+	int				bmap_flags;
+	int				error;
+
+	whichfork = (sxi->sxi_flags & XFS_SWAP_EXTENT_ATTR_FORK) ?
+			XFS_ATTR_FORK : XFS_DATA_FORK;
+	bmap_flags = xfs_bmapi_aflag(whichfork);
+
+	while (sxi->sxi_blockcount > 0) {
+		/* Read extent from the first file */
+		nimaps = 1;
+		error = xfs_bmapi_read(sxi->sxi_ip1, sxi->sxi_startoff1,
+				sxi->sxi_blockcount, &irec1, &nimaps,
+				bmap_flags);
+		if (error)
+			return error;
+		if (nimaps != 1 ||
+		    irec1.br_startblock == DELAYSTARTBLOCK ||
+		    irec1.br_startoff != sxi->sxi_startoff1) {
+			/*
+			 * We should never get no mapping or a delalloc extent
+			 * or something that doesn't match what we asked for,
+			 * since the caller flushed both inodes and we hold the
+			 * ILOCKs for both inodes.
+			 */
+			ASSERT(0);
+			return -EINVAL;
+		}
+
+		/*
+		 * If the caller told us to ignore sparse areas of file1, jump
+		 * ahead to the next region.
+		 */
+		if ((sxi->sxi_flags & XFS_SWAP_EXTENT_SKIP_FILE1_HOLES) &&
+		    irec1.br_startblock == HOLESTARTBLOCK) {
+			trace_xfs_swapext_extent1(sxi->sxi_ip1, &irec1);
+
+			sxi->sxi_startoff1 += irec1.br_blockcount;
+			sxi->sxi_startoff2 += irec1.br_blockcount;
+			sxi->sxi_blockcount -= irec1.br_blockcount;
+			continue;
+		}
+
+		/* Read extent from the second file */
+		nimaps = 1;
+		error = xfs_bmapi_read(sxi->sxi_ip2, sxi->sxi_startoff2,
+				irec1.br_blockcount, &irec2, &nimaps,
+				bmap_flags);
+		if (error)
+			return error;
+		if (nimaps != 1 ||
+		    irec2.br_startblock == DELAYSTARTBLOCK ||
+		    irec2.br_startoff != sxi->sxi_startoff2) {
+			/*
+			 * We should never get no mapping or a delalloc extent
+			 * or something that doesn't match what we asked for,
+			 * since the caller flushed both inodes and we hold the
+			 * ILOCKs for both inodes.
+			 */
+			ASSERT(0);
+			return -EINVAL;
+		}
+
+		/*
+		 * We can only swap as many blocks as the smaller of the two
+		 * extent maps.
+		 */
+		irec1.br_blockcount = min(irec1.br_blockcount,
+					  irec2.br_blockcount);
+
+		trace_xfs_swapext_extent1(sxi->sxi_ip1, &irec1);
+		trace_xfs_swapext_extent2(sxi->sxi_ip2, &irec2);
+
+		/*
+		 * Two extents mapped to the same physical block must not have
+		 * different states; that's filesystem corruption.  Move on to
+		 * the next extent if they're both holes or both the same
+		 * physical extent.
+		 */
+		if (irec1.br_startblock == irec2.br_startblock) {
+			if (irec1.br_state != irec2.br_state)
+				return -EFSCORRUPTED;
+
+			sxi->sxi_startoff1 += irec1.br_blockcount;
+			sxi->sxi_startoff2 += irec1.br_blockcount;
+			sxi->sxi_blockcount -= irec1.br_blockcount;
+			continue;
+		}
+
+		xfs_swapext_update_quota(tp, sxi, &irec1, &irec2);
+
+		/* Remove both mappings. */
+		xfs_bmap_unmap_extent(tp, sxi->sxi_ip1, whichfork, &irec1);
+		xfs_bmap_unmap_extent(tp, sxi->sxi_ip2, whichfork, &irec2);
+
+		/*
+		 * Re-add both mappings.  We swap the file offsets between the
+		 * two maps and add the opposite map, which has the effect of
+		 * filling the logical offsets we just unmapped, but with with
+		 * the physical mapping information swapped.
+		 */
+		swap(irec1.br_startoff, irec2.br_startoff);
+		xfs_bmap_map_extent(tp, sxi->sxi_ip1, whichfork, &irec2);
+		xfs_bmap_map_extent(tp, sxi->sxi_ip2, whichfork, &irec1);
+
+		/* Make sure we're not mapping extents past EOF. */
+		if (whichfork == XFS_DATA_FORK) {
+			xfs_swapext_update_size(tp, sxi->sxi_ip1, &irec2,
+					sxi->sxi_isize1);
+			xfs_swapext_update_size(tp, sxi->sxi_ip2, &irec1,
+					sxi->sxi_isize2);
+		}
+
+		/*
+		 * Advance our cursor and exit.   The caller (either defer ops
+		 * or log recovery) will log the SXD item, and if *blockcount
+		 * is nonzero, it will log a new SXI item for the remainder
+		 * and call us back.
+		 */
+		sxi->sxi_startoff1 += irec1.br_blockcount;
+		sxi->sxi_startoff2 += irec1.br_blockcount;
+		sxi->sxi_blockcount -= irec1.br_blockcount;
+		break;
+	}
+
+	/*
+	 * If the caller asked us to exchange the file sizes and we're done
+	 * moving extents, update the ondisk file sizes now.
+	 */
+	if (sxi->sxi_blockcount == 0 &&
+	    (sxi->sxi_flags & XFS_SWAP_EXTENT_SET_SIZES)) {
+		sxi->sxi_ip1->i_d.di_size = sxi->sxi_isize1;
+		sxi->sxi_ip2->i_d.di_size = sxi->sxi_isize2;
+
+		xfs_trans_log_inode(tp, sxi->sxi_ip1, XFS_ILOG_CORE);
+		xfs_trans_log_inode(tp, sxi->sxi_ip2, XFS_ILOG_CORE);
+	}
+
+	if (xfs_swapext_has_more_work(sxi))
+		trace_xfs_swapext_defer(tp->t_mountp, sxi);
+
+	return 0;
+}
+
+/* Estimate the bmbt and rmapbt overhead required to exchange extents. */
+static int
+xfs_swapext_estimate_overhead(
+	const struct xfs_swapext_req	*req,
+	struct xfs_swapext_res		*res)
+{
+	struct xfs_mount		*mp = req->ip1->i_mount;
+	unsigned int			bmbt_overhead;
+
+	/*
+	 * Compute the amount of bmbt blocks we should reserve for each file.
+	 *
+	 * Conceptually this shouldn't affect the shape of either bmbt, but
+	 * since we atomically move extents one by one, we reserve enough space
+	 * to handle a bmbt split for each remap operation (t1).
+	 *
+	 * However, we must be careful to handle a corner case where the
+	 * repeated unmap and map activities could result in ping-ponging of
+	 * the btree shape.  This behavior can come from one of two sources:
+	 *
+	 * An inode's extent list could have just enough records to straddle
+	 * the btree format boundary. If so, the inode could bounce between
+	 * btree <-> extent format on unmap -> remap cycles, freeing and
+	 * allocating a bmapbt block each time.
+	 *
+	 * The same thing can happen if we have just enough records in a block
+	 * to bounce between one and two leaf blocks. If there aren't enough
+	 * sibling blocks to absorb or donate some records, we end up reshaping
+	 * the tree with every remap operation.  This doesn't seem to happen if
+	 * we have more than four bmbt leaf blocks, so we'll make that the
+	 * lower bound on the pingponging (t2).
+	 *
+	 * Therefore, we use XFS_TRANS_RES_FDBLKS so that freed bmbt blocks
+	 * are accounted back to the transaction block reservation.
+	 */
+	bmbt_overhead = XFS_NEXTENTADD_SPACE_RES(mp, res->nr_exchanges,
+						 req->whichfork);
+	res->ip1_bcount += bmbt_overhead;
+	res->ip2_bcount += bmbt_overhead;
+	res->resblks += 2 * bmbt_overhead;
+
+	/* Apply similar logic to rmapbt reservations. */
+	if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
+		unsigned int	rmapbt_overhead;
+
+		if (!XFS_IS_REALTIME_INODE(req->ip1))
+			rmapbt_overhead = XFS_NRMAPADD_SPACE_RES(mp,
+							res->nr_exchanges);
+		else
+			rmapbt_overhead = 0;
+		res->resblks += 2 * rmapbt_overhead;
+	}
+
+	trace_xfs_swapext_estimate(req, res);
+
+	if (res->resblks > UINT_MAX)
+		return -ENOSPC;
+	return 0;
+}
+
+/* Decide if we can merge two real extents. */
+static inline bool
+can_merge(
+	const struct xfs_bmbt_irec	*b1,
+	const struct xfs_bmbt_irec	*b2)
+{
+	/* Zero length means uninitialized. */
+	if (b1->br_blockcount == 0 || b2->br_blockcount == 0)
+		return false;
+
+	/* We don't merge holes. */
+	if (!xfs_bmap_is_real_extent(b1) || !xfs_bmap_is_real_extent(b2))
+		return false;
+
+	if (b1->br_startoff   + b1->br_blockcount == b2->br_startoff &&
+	    b1->br_startblock + b1->br_blockcount == b2->br_startblock &&
+	    b1->br_state			  == b2->br_state &&
+	    b1->br_blockcount + b2->br_blockcount <= MAXEXTLEN)
+		return true;
+
+	return false;
+}
+
+#define CLEFT_CONTIG	0x01
+#define CRIGHT_CONTIG	0x02
+#define CHOLE		0x04
+#define CBOTH_CONTIG	(CLEFT_CONTIG | CRIGHT_CONTIG)
+
+#define NLEFT_CONTIG	0x10
+#define NRIGHT_CONTIG	0x20
+#define NHOLE		0x40
+#define NBOTH_CONTIG	(NLEFT_CONTIG | NRIGHT_CONTIG)
+
+/* Estimate the effect of a single swap on extent count. */
+static inline int
+delta_nextents_step(
+	struct xfs_mount		*mp,
+	const struct xfs_bmbt_irec	*left,
+	const struct xfs_bmbt_irec	*curr,
+	const struct xfs_bmbt_irec	*new,
+	const struct xfs_bmbt_irec	*right)
+{
+	bool				lhole, rhole, chole, nhole;
+	unsigned int			state = 0;
+	int				ret = 0;
+
+	lhole = left->br_blockcount == 0 ||
+		left->br_startblock == HOLESTARTBLOCK;
+	rhole = right->br_blockcount == 0 ||
+		right->br_startblock == HOLESTARTBLOCK;
+	chole = curr->br_startblock == HOLESTARTBLOCK;
+	nhole = new->br_startblock == HOLESTARTBLOCK;
+
+	if (chole)
+		state |= CHOLE;
+	if (!lhole && !chole && can_merge(left, curr))
+		state |= CLEFT_CONTIG;
+	if (!rhole && !chole && can_merge(curr, right))
+		state |= CRIGHT_CONTIG;
+	if ((state & CBOTH_CONTIG) == CBOTH_CONTIG &&
+	    left->br_startblock + curr->br_startblock +
+					right->br_startblock > MAXEXTLEN)
+		state &= ~CRIGHT_CONTIG;
+
+	if (nhole)
+		state |= NHOLE;
+	if (!lhole && !nhole && can_merge(left, new))
+		state |= NLEFT_CONTIG;
+	if (!rhole && !nhole && can_merge(new, right))
+		state |= NRIGHT_CONTIG;
+	if ((state & NBOTH_CONTIG) == NBOTH_CONTIG &&
+	    left->br_startblock + new->br_startblock +
+					right->br_startblock > MAXEXTLEN)
+		state &= ~NRIGHT_CONTIG;
+
+	switch (state & (CLEFT_CONTIG | CRIGHT_CONTIG | CHOLE)) {
+	case CLEFT_CONTIG | CRIGHT_CONTIG:
+		/*
+		 * left/curr/right are the same extent, so deleting curr causes
+		 * 2 new extents to be created.
+		 */
+		ret += 2;
+		break;
+	case 0:
+		/*
+		 * curr is not contiguous with any extent, so we remove curr
+		 * completely
+		 */
+		ret--;
+		break;
+	case CHOLE:
+		/* hole, do nothing */
+		break;
+	case CLEFT_CONTIG:
+	case CRIGHT_CONTIG:
+		/* trim either left or right, no change */
+		break;
+	}
+
+	switch (state & (NLEFT_CONTIG | NRIGHT_CONTIG | NHOLE)) {
+	case NLEFT_CONTIG | NRIGHT_CONTIG:
+		/*
+		 * left/curr/right will become the same extent, so adding
+		 * curr causes the deletion of right.
+		 */
+		ret--;
+		break;
+		break;
+	case 0:
+		/* new is not contiguous with any extent */
+		ret++;
+		break;
+	case NHOLE:
+		/* hole, do nothing. */
+		break;
+	case NLEFT_CONTIG:
+	case NRIGHT_CONTIG:
+		/* new is absorbed into left or right, no change */
+		break;
+	}
+
+	trace_xfs_swapext_delta_nextents_step(mp, left, curr, new, right, ret,
+			state);
+	return ret;
+}
+
+/* Make sure we don't overflow the extent counters. */
+static inline int
+check_delta_nextents(
+	const struct xfs_swapext_req	*req,
+	struct xfs_inode		*ip,
+	int64_t				delta)
+{
+	ASSERT(delta < INT_MAX);
+	ASSERT(delta > INT_MIN);
+
+	if (delta < 0)
+		return 0;
+
+	return xfs_iext_count_may_overflow(ip, req->whichfork, delta);
+}
+
+/* Find the next extent after irec. */
+static inline int
+get_next_ext(
+	struct xfs_inode		*ip,
+	unsigned int			bmap_flags,
+	const struct xfs_bmbt_irec	*irec,
+	struct xfs_bmbt_irec		*nrec)
+{
+	xfs_fileoff_t			off;
+	xfs_filblks_t			blockcount;
+	int				nimaps = 1;
+	int				error;
+
+	off = irec->br_startoff + irec->br_blockcount;
+	blockcount = XFS_MAX_FILEOFF - off;
+	error = xfs_bmapi_read(ip, off, blockcount, nrec, &nimaps, bmap_flags);
+	if (error)
+		return error;
+	if (nrec->br_startblock == DELAYSTARTBLOCK ||
+	    nrec->br_startoff != off) {
+		/*
+		 * If we don't get the extent we want, return a zero-length
+		 * mapping, which our estimator function will pretend is a hole.
+		 */
+		nrec->br_blockcount = 0;
+	}
+
+	return 0;
+}
+
+/*
+ * Estimate the number of exchange operations and the number of file blocks
+ * in each file that will be affected by the exchange operation.
+ */
+int
+xfs_swapext_estimate(
+	const struct xfs_swapext_req	*req,
+	struct xfs_swapext_res		*res)
+{
+	struct xfs_bmbt_irec		irec1, irec2;
+	struct xfs_bmbt_irec		lrec1 = { }, lrec2 = { };
+	struct xfs_bmbt_irec		rrec1, rrec2;
+	xfs_fileoff_t			startoff1 = req->startoff1;
+	xfs_fileoff_t			startoff2 = req->startoff2;
+	xfs_filblks_t			blockcount = req->blockcount;
+	xfs_filblks_t			ip1_blocks = 0, ip2_blocks = 0;
+	int64_t				d_nexts1, d_nexts2;
+	int				bmap_flags;
+	int				nimaps;
+	int				error;
+
+	bmap_flags = xfs_bmapi_aflag(req->whichfork);
+	memset(res, 0, sizeof(struct xfs_swapext_res));
+
+	/*
+	 * To guard against the possibility of overflowing the extent counters,
+	 * we have to estimate an upper bound on the potential increase in that
+	 * counter.  We can split the extent at each end of the range, and for
+	 * each step of the swap we can split the extent that we're working on
+	 * if the extents do not align.
+	 */
+	d_nexts1 = d_nexts2 = 3;
+
+	while (blockcount > 0) {
+		/* Read extent from the first file */
+		nimaps = 1;
+		error = xfs_bmapi_read(req->ip1, startoff1, blockcount,
+				&irec1, &nimaps, bmap_flags);
+		if (error)
+			return error;
+		if (irec1.br_startblock == DELAYSTARTBLOCK ||
+		    irec1.br_startoff != startoff1) {
+			/*
+			 * We should never get no mapping or a delalloc extent
+			 * or something that doesn't match what we asked for,
+			 * since the caller flushed both inodes and we hold the
+			 * ILOCKs for both inodes.
+			 */
+			ASSERT(0);
+			return -EINVAL;
+		}
+
+		/*
+		 * If the caller told us to ignore sparse areas of file1, jump
+		 * ahead to the next region.
+		 */
+		if ((req->flags & XFS_SWAPEXT_SKIP_FILE1_HOLES) &&
+		    irec1.br_startblock == HOLESTARTBLOCK) {
+			memcpy(&lrec1, &irec1, sizeof(struct xfs_bmbt_irec));
+			lrec1.br_blockcount = 0;
+			goto advance;
+		}
+
+		/* Read extent from the second file */
+		nimaps = 1;
+		error = xfs_bmapi_read(req->ip2, startoff2,
+				irec1.br_blockcount, &irec2, &nimaps,
+				bmap_flags);
+		if (error)
+			return error;
+		if (irec2.br_startblock == DELAYSTARTBLOCK ||
+		    irec2.br_startoff != startoff2) {
+			/*
+			 * We should never get no mapping or a delalloc extent
+			 * or something that doesn't match what we asked for,
+			 * since the caller flushed both inodes and we hold the
+			 * ILOCKs for both inodes.
+			 */
+			ASSERT(0);
+			return -EINVAL;
+		}
+
+		/*
+		 * We can only swap as many blocks as the smaller of the two
+		 * extent maps.
+		 */
+		irec1.br_blockcount = min(irec1.br_blockcount,
+					  irec2.br_blockcount);
+
+		/*
+		 * Two extents mapped to the same physical block must not have
+		 * different states; that's filesystem corruption.  Move on to
+		 * the next extent if they're both holes or both the same
+		 * physical extent.
+		 */
+		if (irec1.br_startblock == irec2.br_startblock) {
+			if (irec1.br_state != irec2.br_state)
+				return -EFSCORRUPTED;
+			memcpy(&lrec1, &irec1, sizeof(struct xfs_bmbt_irec));
+			memcpy(&lrec2, &irec2, sizeof(struct xfs_bmbt_irec));
+			goto advance;
+		}
+
+		/* Update accounting. */
+		if (xfs_bmap_is_real_extent(&irec1))
+			ip1_blocks += irec1.br_blockcount;
+		if (xfs_bmap_is_real_extent(&irec2))
+			ip2_blocks += irec2.br_blockcount;
+		res->nr_exchanges++;
+
+		/* Read next extent from the first file */
+		error = get_next_ext(req->ip1, bmap_flags, &irec1, &rrec1);
+		if (error)
+			return error;
+		error = get_next_ext(req->ip2, bmap_flags, &irec2, &rrec2);
+		if (error)
+			return error;
+
+		d_nexts1 += delta_nextents_step(req->ip1->i_mount,
+				&lrec1, &irec1, &irec2, &rrec1);
+		d_nexts2 += delta_nextents_step(req->ip1->i_mount,
+				&lrec2, &irec2, &irec1, &rrec2);
+
+		/* Now pretend we swapped the extents. */
+		if (can_merge(&lrec2, &irec1))
+			lrec2.br_blockcount += irec1.br_blockcount;
+		else
+			memcpy(&lrec2, &irec1, sizeof(struct xfs_bmbt_irec));
+		if (can_merge(&lrec1, &irec2))
+			lrec1.br_blockcount += irec2.br_blockcount;
+		else
+			memcpy(&lrec1, &irec2, sizeof(struct xfs_bmbt_irec));
+
+advance:
+		/* Advance our cursor and move on. */
+		startoff1 += irec1.br_blockcount;
+		startoff2 += irec1.br_blockcount;
+		blockcount -= irec1.br_blockcount;
+	}
+
+	/* Account for the blocks that are being exchanged. */
+	if (XFS_IS_REALTIME_INODE(req->ip1) &&
+	    req->whichfork == XFS_DATA_FORK) {
+		res->ip1_rtbcount = ip1_blocks;
+		res->ip2_rtbcount = ip2_blocks;
+	} else {
+		res->ip1_bcount = ip1_blocks;
+		res->ip2_bcount = ip2_blocks;
+	}
+
+	/*
+	 * Make sure that both forks have enough slack left in their extent
+	 * counters that the swap operation will not overflow.
+	 */
+	trace_xfs_swapext_delta_nextents(req, d_nexts1, d_nexts2);
+	if (req->ip1 == req->ip2) {
+		error = check_delta_nextents(req, req->ip1,
+				d_nexts1 + d_nexts2);
+	} else {
+		error = check_delta_nextents(req, req->ip1, d_nexts1);
+		if (error)
+			return error;
+		error = check_delta_nextents(req, req->ip2, d_nexts2);
+	}
+	if (error)
+		return error;
+
+	return xfs_swapext_estimate_overhead(req, res);
+}
+
+static void
+xfs_swapext_init_intent(
+	struct xfs_swapext_intent	*sxi,
+	const struct xfs_swapext_req	*req)
+{
+	INIT_LIST_HEAD(&sxi->sxi_list);
+	sxi->sxi_flags = 0;
+	if (req->whichfork == XFS_ATTR_FORK)
+		sxi->sxi_flags |= XFS_SWAP_EXTENT_ATTR_FORK;
+	sxi->sxi_isize1 = sxi->sxi_isize2 = -1;
+	if (req->whichfork == XFS_DATA_FORK &&
+	    (req->flags & XFS_SWAPEXT_SET_SIZES)) {
+		sxi->sxi_flags |= XFS_SWAP_EXTENT_SET_SIZES;
+		sxi->sxi_isize1 = req->ip2->i_d.di_size;
+		sxi->sxi_isize2 = req->ip1->i_d.di_size;
+	}
+	if (req->flags & XFS_SWAPEXT_SKIP_FILE1_HOLES)
+		sxi->sxi_flags |= XFS_SWAP_EXTENT_SKIP_FILE1_HOLES;
+	sxi->sxi_ip1 = req->ip1;
+	sxi->sxi_ip2 = req->ip2;
+	sxi->sxi_startoff1 = req->startoff1;
+	sxi->sxi_startoff2 = req->startoff2;
+	sxi->sxi_blockcount = req->blockcount;
+}
+
+/*
+ * Swap a range of extents from one inode to another.  If the atomic swap
+ * feature is enabled, then the operation progress can be resumed even if the
+ * system goes down.
+ *
+ * The caller must ensure the inodes must be joined to the transaction and
+ * ILOCKd; they will still be joined to the transaction at exit.
+ */
+int
+xfs_swapext(
+	struct xfs_trans		**tpp,
+	const struct xfs_swapext_req	*req)
+{
+	struct xfs_swapext_intent	*sxi;
+	unsigned int			reflink_state;
+	int				error;
+
+	ASSERT(xfs_isilocked(req->ip1, XFS_ILOCK_EXCL));
+	ASSERT(xfs_isilocked(req->ip2, XFS_ILOCK_EXCL));
+	ASSERT(req->whichfork != XFS_COW_FORK);
+	if (req->flags & XFS_SWAPEXT_SET_SIZES)
+		ASSERT(req->whichfork == XFS_DATA_FORK);
+
+	reflink_state = xfs_swapext_reflink_prep(req);
+
+	sxi = kmem_alloc(sizeof(struct xfs_swapext_intent), KM_NOFS);
+	xfs_swapext_init_intent(sxi, req);
+	xfs_swapext_schedule(*tpp, sxi);
+
+	error = xfs_defer_finish(tpp);
+	if (error)
+		return error;
+
+	xfs_swapext_reflink_finish(*tpp, req, reflink_state);
+	return 0;
+}
diff --git a/fs/xfs/libxfs/xfs_swapext.h b/fs/xfs/libxfs/xfs_swapext.h
new file mode 100644
index 000000000000..e63f4a5556c1
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_swapext.h
@@ -0,0 +1,84 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2021 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_SWAPEXT_H_
+#define __XFS_SWAPEXT_H_ 1
+
+/*
+ * In-core information about an extent swap request between ranges of two
+ * inodes.
+ */
+struct xfs_swapext_intent {
+	/* List of other incore deferred work. */
+	struct list_head	sxi_list;
+
+	/* The two inodes we're swapping. */
+	union {
+		struct xfs_inode *sxi_ip1;
+		xfs_ino_t	sxi_ino1;
+	};
+	union {
+		struct xfs_inode *sxi_ip2;
+		xfs_ino_t	sxi_ino2;
+	};
+
+	/* File offset range information. */
+	xfs_fileoff_t		sxi_startoff1;
+	xfs_fileoff_t		sxi_startoff2;
+	xfs_filblks_t		sxi_blockcount;
+	uint64_t		sxi_flags;
+
+	/* Set these file sizes after the operation, unless negative. */
+	xfs_fsize_t		sxi_isize1;
+	xfs_fsize_t		sxi_isize2;
+};
+
+/* Set the sizes of both files after the operation. */
+#define XFS_SWAPEXT_SET_SIZES		(1U << 0)
+
+/* Do not swap any part of the range where file1's mapping is a hole. */
+#define XFS_SWAPEXT_SKIP_FILE1_HOLES	(1U << 1)
+
+/* Parameters for a swapext request. */
+struct xfs_swapext_req {
+	struct xfs_inode	*ip1;
+	struct xfs_inode	*ip2;
+	xfs_fileoff_t		startoff1;
+	xfs_fileoff_t		startoff2;
+	xfs_filblks_t		blockcount;
+	int			whichfork;
+	unsigned int		flags;
+};
+
+/* Estimated resource requirements for a swapext operation. */
+struct xfs_swapext_res {
+	xfs_filblks_t		ip1_bcount;
+	xfs_filblks_t		ip2_bcount;
+	xfs_filblks_t		ip1_rtbcount;
+	xfs_filblks_t		ip2_rtbcount;
+	unsigned long long	resblks;
+	unsigned int		nr_exchanges;
+};
+
+bool xfs_swapext_has_more_work(struct xfs_swapext_intent *sxi);
+
+unsigned int xfs_swapext_reflink_prep(const struct xfs_swapext_req *req);
+void xfs_swapext_reflink_finish(struct xfs_trans *tp,
+		const struct xfs_swapext_req *req, unsigned int reflink_state);
+
+int xfs_swapext_estimate(const struct xfs_swapext_req *req,
+		struct xfs_swapext_res *res);
+
+void xfs_swapext_reschedule(struct xfs_trans *tpp,
+		const struct xfs_swapext_intent *sxi_state);
+int xfs_swapext_finish_one(struct xfs_trans *tp,
+		struct xfs_swapext_intent *sxi_state);
+
+int xfs_swapext_check_extents(struct xfs_mount *mp,
+		const struct xfs_swapext_req *req);
+
+int xfs_swapext(struct xfs_trans **tpp, const struct xfs_swapext_req *req);
+
+#endif /* __XFS_SWAPEXT_H_ */
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index bba73ddd0585..33725b761a22 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -24,7 +24,6 @@
 #include "xfs_error.h"
 #include "xfs_log_priv.h"
 #include "xfs_log_recover.h"
-#include "xfs_quota.h"
 
 kmem_zone_t	*xfs_bui_zone;
 kmem_zone_t	*xfs_bud_zone;
@@ -494,18 +493,10 @@ xfs_bui_item_recover(
 			XFS_ATTR_FORK : XFS_DATA_FORK;
 	bui_type = bmap->me_flags & XFS_BMAP_EXTENT_TYPE_MASK;
 
-	/* Grab the inode. */
-	error = xfs_iget(mp, NULL, bmap->me_owner, 0, 0, &ip);
+	error = xlog_recover_iget(mp, bmap->me_owner, &ip);
 	if (error)
 		return error;
 
-	error = xfs_qm_dqattach(ip);
-	if (error)
-		goto err_rele;
-
-	if (VFS_I(ip)->i_nlink == 0)
-		xfs_iflags_set(ip, XFS_IRECOVERY);
-
 	/* Allocate transaction and do the work. */
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate,
 			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp);
diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 269a4cb34bba..87fde8c875a2 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -29,6 +29,7 @@
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
 #include "xfs_sb.h"
+#include "xfs_swapext.h"
 
 /* Kernel only BMAP related definitions and functions */
 
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 107bb222d79f..54705f6864db 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -25,6 +25,7 @@
 #include "xfs_icache.h"
 #include "xfs_error.h"
 #include "xfs_buf_item.h"
+#include "xfs_quota.h"
 
 #define BLK_AVG(blk1, blk2)	((blk1+blk2) >> 1)
 
@@ -1755,6 +1756,30 @@ xlog_recover_release_intent(
 	spin_unlock(&ailp->ail_lock);
 }
 
+int
+xlog_recover_iget(
+	struct xfs_mount	*mp,
+	xfs_ino_t		ino,
+	struct xfs_inode	**ipp)
+{
+	int			error;
+
+	error = xfs_iget(mp, NULL, ino, 0, 0, ipp);
+	if (error)
+		return error;
+
+	error = xfs_qm_dqattach(*ipp);
+	if (error) {
+		xfs_irele(*ipp);
+		return error;
+	}
+
+	if (VFS_I(*ipp)->i_nlink == 0)
+		xfs_iflags_set(*ipp, XFS_IRECOVERY);
+
+	return 0;
+}
+
 /******************************************************************************
  *
  *		Log recover routines
diff --git a/fs/xfs/xfs_swapext_item.c b/fs/xfs/xfs_swapext_item.c
index 83913e9fd4d4..4dba3879bec8 100644
--- a/fs/xfs/xfs_swapext_item.c
+++ b/fs/xfs/xfs_swapext_item.c
@@ -16,13 +16,16 @@
 #include "xfs_trans.h"
 #include "xfs_trans_priv.h"
 #include "xfs_swapext_item.h"
+#include "xfs_swapext.h"
 #include "xfs_log.h"
 #include "xfs_bmap.h"
 #include "xfs_icache.h"
+#include "xfs_bmap_btree.h"
 #include "xfs_trans_space.h"
 #include "xfs_error.h"
 #include "xfs_log_priv.h"
 #include "xfs_log_recover.h"
+#include "xfs_xchgrange.h"
 
 kmem_zone_t	*xfs_sxi_zone;
 kmem_zone_t	*xfs_sxd_zone;
@@ -195,13 +198,318 @@ static const struct xfs_item_ops xfs_sxd_item_ops = {
 	.iop_release	= xfs_sxd_item_release,
 };
 
+static struct xfs_sxd_log_item *
+xfs_trans_get_sxd(
+	struct xfs_trans		*tp,
+	struct xfs_sxi_log_item		*sxi_lip)
+{
+	struct xfs_sxd_log_item		*sxd_lip;
+
+	sxd_lip = kmem_cache_zalloc(xfs_sxd_zone, GFP_KERNEL | __GFP_NOFAIL);
+	xfs_log_item_init(tp->t_mountp, &sxd_lip->sxd_item, XFS_LI_SXD,
+			  &xfs_sxd_item_ops);
+	sxd_lip->sxd_intent_log_item = sxi_lip;
+	sxd_lip->sxd_format.sxd_sxi_id = sxi_lip->sxi_format.sxi_id;
+
+	xfs_trans_add_item(tp, &sxd_lip->sxd_item);
+	return sxd_lip;
+}
+
+/*
+ * Finish an swapext update and log it to the SXD. Note that the transaction is
+ * marked dirty regardless of whether the swapext update succeeds or fails to
+ * support the SXI/SXD lifecycle rules.
+ */
+static int
+xfs_swapext_finish_update(
+	struct xfs_trans		*tp,
+	struct xfs_log_item		*done,
+	struct xfs_swapext_intent	*sxi)
+{
+	int				error;
+
+	error = xfs_swapext_finish_one(tp, sxi);
+
+	/*
+	 * Mark the transaction dirty, even on error. This ensures the
+	 * transaction is aborted, which:
+	 *
+	 * 1.) releases the SXI and frees the SXD
+	 * 2.) shuts down the filesystem
+	 */
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	if (done)
+		set_bit(XFS_LI_DIRTY, &done->li_flags);
+
+	return error;
+}
+
+/* Log swapext updates in the intent item. */
+STATIC void
+xfs_swapext_log_item(
+	struct xfs_trans		*tp,
+	struct xfs_sxi_log_item		*sxi_lip,
+	struct xfs_swapext_intent	*sxi)
+{
+	struct xfs_swap_extent		*sx;
+
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	set_bit(XFS_LI_DIRTY, &sxi_lip->sxi_item.li_flags);
+
+	sx = &sxi_lip->sxi_format.sxi_extent;
+	sx->sx_inode1 = sxi->sxi_ip1->i_ino;
+	sx->sx_inode2 = sxi->sxi_ip2->i_ino;
+	sx->sx_startoff1 = sxi->sxi_startoff1;
+	sx->sx_startoff2 = sxi->sxi_startoff2;
+	sx->sx_blockcount = sxi->sxi_blockcount;
+	sx->sx_isize1 = sxi->sxi_isize1;
+	sx->sx_isize2 = sxi->sxi_isize2;
+	sx->sx_flags = sxi->sxi_flags;
+}
+
+STATIC struct xfs_log_item *
+xfs_swapext_create_intent(
+	struct xfs_trans		*tp,
+	struct list_head		*items,
+	unsigned int			count,
+	bool				sort)
+{
+	struct xfs_sxi_log_item		*sxi_lip = xfs_sxi_init(tp->t_mountp);
+	struct xfs_swapext_intent	*sxi;
+
+	ASSERT(count == XFS_SXI_MAX_FAST_EXTENTS);
+
+	/*
+	 * We use the same defer ops control machinery to perform extent swaps
+	 * even if we lack the machinery to track the operation status through
+	 * log items.
+	 */
+	if (!xfs_sb_version_hasatomicswap(&tp->t_mountp->m_sb))
+		return NULL;
+
+	xfs_trans_add_item(tp, &sxi_lip->sxi_item);
+	list_for_each_entry(sxi, items, sxi_list)
+		xfs_swapext_log_item(tp, sxi_lip, sxi);
+	return &sxi_lip->sxi_item;
+}
+
+STATIC struct xfs_log_item *
+xfs_swapext_create_done(
+	struct xfs_trans		*tp,
+	struct xfs_log_item		*intent,
+	unsigned int			count)
+{
+	if (intent == NULL)
+		return NULL;
+	return &xfs_trans_get_sxd(tp, SXI_ITEM(intent))->sxd_item;
+}
+
+/* Process a deferred swapext update. */
+STATIC int
+xfs_swapext_finish_item(
+	struct xfs_trans		*tp,
+	struct xfs_log_item		*done,
+	struct list_head		*item,
+	struct xfs_btree_cur		**state)
+{
+	struct xfs_swapext_intent	*sxi;
+	int				error;
+
+	sxi = container_of(item, struct xfs_swapext_intent, sxi_list);
+
+	/*
+	 * Swap one more extent between the two files.  If there's still more
+	 * work to do, we want to requeue ourselves after all other pending
+	 * deferred operations have finished.  This includes all of the dfops
+	 * that we queued directly as well as any new ones created in the
+	 * process of finishing the others.  Doing so prevents us from queuing
+	 * a large number of SXI log items in kernel memory, which in turn
+	 * prevents us from pinning the tail of the log (while logging those
+	 * new SXI items) until the first SXI items can be processed.
+	 */
+	error = xfs_swapext_finish_update(tp, done, sxi);
+	if (!error && xfs_swapext_has_more_work(sxi))
+		return -EAGAIN;
+
+	kmem_free(sxi);
+	return error;
+}
+
+/* Abort all pending SXIs. */
+STATIC void
+xfs_swapext_abort_intent(
+	struct xfs_log_item		*intent)
+{
+	xfs_sxi_release(SXI_ITEM(intent));
+}
+
+/* Cancel a deferred swapext update. */
+STATIC void
+xfs_swapext_cancel_item(
+	struct list_head		*item)
+{
+	struct xfs_swapext_intent	*sxi;
+
+	sxi = container_of(item, struct xfs_swapext_intent, sxi_list);
+	kmem_free(sxi);
+}
+
+const struct xfs_defer_op_type xfs_swapext_defer_type = {
+	.max_items	= XFS_SXI_MAX_FAST_EXTENTS,
+	.create_intent	= xfs_swapext_create_intent,
+	.abort_intent	= xfs_swapext_abort_intent,
+	.create_done	= xfs_swapext_create_done,
+	.finish_item	= xfs_swapext_finish_item,
+	.cancel_item	= xfs_swapext_cancel_item,
+};
+
+static int
+xfs_sxi_item_recover_estimate(
+	struct xfs_swapext_intent	*sxi,
+	struct xfs_swapext_res		*res)
+{
+	struct xfs_swapext_req		req = {
+		.ip1			= sxi->sxi_ip1,
+		.ip2			= sxi->sxi_ip2,
+		.startoff1		= sxi->sxi_startoff1,
+		.startoff2		= sxi->sxi_startoff2,
+		.blockcount		= sxi->sxi_blockcount,
+		.whichfork		= XFS_DATA_FORK,
+	};
+
+	if (sxi->sxi_flags & XFS_SWAP_EXTENT_ATTR_FORK)
+		req.whichfork = XFS_ATTR_FORK;
+	if (sxi->sxi_flags & XFS_SWAP_EXTENT_SET_SIZES)
+		req.flags |= XFS_SWAPEXT_SET_SIZES;
+	if (sxi->sxi_flags & XFS_SWAP_EXTENT_SKIP_FILE1_HOLES)
+		req.flags |= XFS_SWAPEXT_SKIP_FILE1_HOLES;
+
+	return xfs_xchg_range_estimate(&req, res);
+}
+
+/* Is this recovered SXI ok? */
+static inline bool
+xfs_sxi_validate(
+	struct xfs_mount		*mp,
+	struct xfs_sxi_log_item		*sxi_lip)
+{
+	struct xfs_swap_extent		*sx = &sxi_lip->sxi_format.sxi_extent;
+
+	if (!xfs_sb_version_hasatomicswap(&mp->m_sb))
+		return false;
+
+	if (sxi_lip->sxi_format.__pad != 0)
+		return false;
+
+	if (sx->sx_flags & ~XFS_SWAP_EXTENT_FLAGS)
+		return false;
+
+	if (!xfs_verify_ino(mp, sx->sx_inode1) ||
+	    !xfs_verify_ino(mp, sx->sx_inode2))
+		return false;
+
+	if ((sx->sx_flags & XFS_SWAP_EXTENT_SET_SIZES) &&
+	     (sx->sx_isize1 < 0 || sx->sx_isize2 < 0))
+		return false;
+
+	if (!xfs_verify_fileext(mp, sx->sx_startoff1, sx->sx_blockcount))
+		return false;
+
+	return xfs_verify_fileext(mp, sx->sx_startoff2, sx->sx_blockcount);
+}
+
 /* Process a swapext update intent item that was recovered from the log. */
 STATIC int
 xfs_sxi_item_recover(
 	struct xfs_log_item		*lip,
 	struct list_head		*capture_list)
 {
-	return -EFSCORRUPTED;
+	struct xfs_swapext_intent	sxi;
+	struct xfs_swapext_res		res;
+	struct xfs_sxi_log_item		*sxi_lip = SXI_ITEM(lip);
+	struct xfs_mount		*mp = lip->li_mountp;
+	struct xfs_swap_extent		*sx = &sxi_lip->sxi_format.sxi_extent;
+	struct xfs_sxd_log_item		*sxd_lip = NULL;
+	struct xfs_trans		*tp;
+	bool				more_work;
+	int				error = 0;
+
+	if (!xfs_sxi_validate(mp, sxi_lip)) {
+		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
+				&sxi_lip->sxi_format,
+				sizeof(sxi_lip->sxi_format));
+		return -EFSCORRUPTED;
+	}
+
+	memset(&sxi, 0, sizeof(sxi));
+	INIT_LIST_HEAD(&sxi.sxi_list);
+
+	/*
+	 * Grab both inodes and set IRECOVERY to prevent trimming of post-eof
+	 * extents and freeing of unlinked inodes until we're totally done
+	 * processing files.
+	 */
+	error = xlog_recover_iget(mp, sx->sx_inode1, &sxi.sxi_ip1);
+	if (error)
+		return error;
+	error = xlog_recover_iget(mp, sx->sx_inode2, &sxi.sxi_ip2);
+	if (error)
+		goto err_rele1;
+
+	/*
+	 * Construct the rest of our in-core swapext intent state so that we
+	 * can allocate all the resources we need to continue the swap work.
+	 */
+	sxi.sxi_flags = sx->sx_flags;
+	sxi.sxi_startoff1 = sx->sx_startoff1;
+	sxi.sxi_startoff2 = sx->sx_startoff2;
+	sxi.sxi_blockcount = sx->sx_blockcount;
+	sxi.sxi_isize1 = sx->sx_isize1;
+	sxi.sxi_isize2 = sx->sx_isize2;
+	error = xfs_sxi_item_recover_estimate(&sxi, &res);
+	if (error)
+		goto err_rele2;
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, res.resblks, 0,
+			0, &tp);
+	if (error)
+		goto err_rele2;
+
+	sxd_lip = xfs_trans_get_sxd(tp, sxi_lip);
+
+	xfs_xchg_range_ilock(tp, sxi.sxi_ip1, sxi.sxi_ip2);
+
+	error = xfs_swapext_finish_update(tp, &sxd_lip->sxd_item, &sxi);
+	if (error)
+		goto err_cancel;
+
+	/*
+	 * If there's more extent swapping to be done, we have to schedule that
+	 * as a separate deferred operation to be run after we've finished
+	 * replaying all of the intents we recovered from the log.
+	 */
+	more_work = xfs_swapext_has_more_work(&sxi);
+	if (more_work)
+		xfs_swapext_reschedule(tp, &sxi);
+
+	/*
+	 * Commit transaction, which frees the transaction and saves the inodes
+	 * for later replay activities.
+	 */
+	error = xfs_defer_ops_capture_and_commit(tp, sxi.sxi_ip1, sxi.sxi_ip2,
+			capture_list);
+	goto err_unlock;
+
+err_cancel:
+	xfs_trans_cancel(tp);
+err_unlock:
+	xfs_xchg_range_iunlock(sxi.sxi_ip1, sxi.sxi_ip2);
+err_rele2:
+	if (sxi.sxi_ip2 != sxi.sxi_ip1)
+		xfs_irele(sxi.sxi_ip2);
+err_rele1:
+	xfs_irele(sxi.sxi_ip1);
+	return error;
 }
 
 STATIC bool
@@ -218,8 +526,21 @@ xfs_sxi_item_relog(
 	struct xfs_log_item		*intent,
 	struct xfs_trans		*tp)
 {
-	ASSERT(0);
-	return NULL;
+	struct xfs_sxd_log_item		*sxd_lip;
+	struct xfs_sxi_log_item		*sxi_lip;
+	struct xfs_swap_extent		*sx;
+
+	sx = &SXI_ITEM(intent)->sxi_format.sxi_extent;
+
+	tp->t_flags |= XFS_TRANS_DIRTY;
+	sxd_lip = xfs_trans_get_sxd(tp, SXI_ITEM(intent));
+	set_bit(XFS_LI_DIRTY, &sxd_lip->sxd_item.li_flags);
+
+	sxi_lip = xfs_sxi_init(tp->t_mountp);
+	memcpy(&sxi_lip->sxi_format.sxi_extent, sx, sizeof(*sx));
+	xfs_trans_add_item(tp, &sxi_lip->sxi_item);
+	set_bit(XFS_LI_DIRTY, &sxi_lip->sxi_item.li_flags);
+	return &sxi_lip->sxi_item;
 }
 
 static const struct xfs_item_ops xfs_sxi_item_ops = {
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index 9b8d703dc9fd..f8cceacfb51d 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -30,6 +30,7 @@
 #include "xfs_fsmap.h"
 #include "xfs_btree_staging.h"
 #include "xfs_icache.h"
+#include "xfs_swapext.h"
 
 /*
  * We include this last to have the helpers above available for the trace
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index e5cc6f2a4fa8..dc9cc3c67e58 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -38,6 +38,9 @@ struct xfs_inobt_rec_incore;
 union xfs_btree_ptr;
 struct xfs_dqtrx;
 struct xfs_eofblocks;
+struct xfs_swapext_intent;
+struct xfs_swapext_req;
+struct xfs_swapext_res;
 
 #define XFS_ATTR_FILTER_FLAGS \
 	{ XFS_ATTR_ROOT,	"ROOT" }, \
@@ -3316,6 +3319,9 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_cancel_cow);
 DEFINE_INODE_IREC_EVENT(xfs_swap_extent_rmap_remap);
 DEFINE_INODE_IREC_EVENT(xfs_swap_extent_rmap_remap_piece);
 DEFINE_INODE_ERROR_EVENT(xfs_swap_extent_rmap_error);
+DEFINE_INODE_IREC_EVENT(xfs_swapext_extent1);
+DEFINE_INODE_IREC_EVENT(xfs_swapext_extent2);
+DEFINE_ITRUNC_EVENT(xfs_swapext_update_inode_size);
 
 /* fsmap traces */
 DECLARE_EVENT_CLASS(xfs_fsmap_class,
@@ -3945,6 +3951,185 @@ DEFINE_EOFBLOCKS_EVENT(xfs_ioc_free_eofblocks);
 DEFINE_EOFBLOCKS_EVENT(xfs_blockgc_free_space);
 DEFINE_EOFBLOCKS_EVENT(xfs_inodegc_free_space);
 
+#define XFS_SWAPEXT_STRINGS \
+	{ XFS_SWAPEXT_SET_SIZES,		"SETSIZES" }, \
+	{ XFS_SWAPEXT_SKIP_FILE1_HOLES,		"SKIP_FILE1_HOLES" }
+
+TRACE_EVENT(xfs_swapext_estimate,
+	TP_PROTO(const struct xfs_swapext_req *req,
+		 const struct xfs_swapext_res *res),
+	TP_ARGS(req, res),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino1)
+		__field(xfs_ino_t, ino2)
+		__field(xfs_fileoff_t, startoff1)
+		__field(xfs_fileoff_t, startoff2)
+		__field(xfs_filblks_t, blockcount)
+		__field(int, whichfork)
+		__field(unsigned int, flags)
+		__field(xfs_filblks_t, ip1_bcount)
+		__field(xfs_filblks_t, ip2_bcount)
+		__field(xfs_filblks_t, ip1_rtbcount)
+		__field(xfs_filblks_t, ip2_rtbcount)
+		__field(unsigned long long, resblks)
+		__field(unsigned int, nr_exchanges)
+	),
+	TP_fast_assign(
+		__entry->dev = req->ip1->i_mount->m_super->s_dev;
+		__entry->ino1 = req->ip1->i_ino;
+		__entry->ino2 = req->ip2->i_ino;
+		__entry->startoff1 = req->startoff1;
+		__entry->startoff2 = req->startoff2;
+		__entry->blockcount = req->blockcount;
+		__entry->whichfork = req->whichfork;
+		__entry->flags = req->flags;
+		__entry->ip1_bcount = res->ip1_bcount;
+		__entry->ip2_bcount = res->ip2_bcount;
+		__entry->ip1_rtbcount = res->ip1_rtbcount;
+		__entry->ip2_rtbcount = res->ip2_rtbcount;
+		__entry->resblks = res->resblks;
+		__entry->nr_exchanges = res->nr_exchanges;
+	),
+	TP_printk("dev %d:%d ino1 0x%llx startoff1 %llu ino2 0x%llx startoff2 %llu blockcount %llu flags (%s) %sfork bcount1 %llu rtbcount1 %llu bcount2 %llu rtbcount2 %llu resblks %llu nr_exchanges %u",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino1, __entry->startoff1,
+		  __entry->ino2, __entry->startoff2,
+		  __entry->blockcount,
+		  __print_flags(__entry->flags, "|", XFS_SWAPEXT_STRINGS),
+		  __entry->whichfork == XFS_ATTR_FORK ? "attr" : "data",
+		  __entry->ip1_bcount,
+		  __entry->ip1_rtbcount,
+		  __entry->ip2_bcount,
+		  __entry->ip2_rtbcount,
+		  __entry->resblks,
+		  __entry->nr_exchanges)
+);
+
+#define XFS_SWAP_EXTENT_STRINGS \
+	{ XFS_SWAP_EXTENT_ATTR_FORK,		"ATTRFORK" }, \
+	{ XFS_SWAP_EXTENT_SET_SIZES,		"SETSIZES" }, \
+	{ XFS_SWAP_EXTENT_SKIP_FILE1_HOLES,	"SKIP_FILE1_HOLES" }
+
+TRACE_EVENT(xfs_swapext_defer,
+	TP_PROTO(struct xfs_mount *mp, const struct xfs_swapext_intent *sxi),
+	TP_ARGS(mp, sxi),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino1)
+		__field(xfs_ino_t, ino2)
+		__field(uint64_t, flags)
+		__field(xfs_fileoff_t, startoff1)
+		__field(xfs_fileoff_t, startoff2)
+		__field(xfs_filblks_t, blockcount)
+		__field(xfs_fsize_t, isize1)
+		__field(xfs_fsize_t, isize2)
+		__field(xfs_fsize_t, new_isize1)
+		__field(xfs_fsize_t, new_isize2)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->ino1 = sxi->sxi_ip1->i_ino;
+		__entry->ino2 = sxi->sxi_ip2->i_ino;
+		__entry->flags = sxi->sxi_flags;
+		__entry->startoff1 = sxi->sxi_startoff1;
+		__entry->startoff2 = sxi->sxi_startoff2;
+		__entry->blockcount = sxi->sxi_blockcount;
+		__entry->isize1 = sxi->sxi_ip1->i_d.di_size;
+		__entry->isize2 = sxi->sxi_ip2->i_d.di_size;
+		__entry->new_isize1 = sxi->sxi_isize1;
+		__entry->new_isize2 = sxi->sxi_isize2;
+	),
+	TP_printk("dev %d:%d ino1 0x%llx startoff1 %llu ino2 0x%llx startoff2 %llu blockcount %llu flags (%s) isize1 %lld newisize1 %lld isize2 %lld newisize2 %lld",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino1, __entry->startoff1,
+		  __entry->ino2, __entry->startoff2,
+		  __entry->blockcount,
+		  __print_flags(__entry->flags, "|", XFS_SWAP_EXTENT_STRINGS),
+		  __entry->isize1, __entry->new_isize1,
+		  __entry->isize2, __entry->new_isize2)
+);
+
+TRACE_EVENT(xfs_swapext_delta_nextents_step,
+	TP_PROTO(struct xfs_mount *mp,
+		 const struct xfs_bmbt_irec *left,
+		 const struct xfs_bmbt_irec *curr,
+		 const struct xfs_bmbt_irec *new,
+		 const struct xfs_bmbt_irec *right,
+		 int delta, unsigned int state),
+	TP_ARGS(mp, left, curr, new, right, delta, state),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_fileoff_t, loff)
+		__field(xfs_fsblock_t, lstart)
+		__field(xfs_filblks_t, lcount)
+		__field(xfs_fileoff_t, coff)
+		__field(xfs_fsblock_t, cstart)
+		__field(xfs_filblks_t, ccount)
+		__field(xfs_fileoff_t, noff)
+		__field(xfs_fsblock_t, nstart)
+		__field(xfs_filblks_t, ncount)
+		__field(xfs_fileoff_t, roff)
+		__field(xfs_fsblock_t, rstart)
+		__field(xfs_filblks_t, rcount)
+		__field(int, delta)
+		__field(unsigned int, state)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->loff = left->br_startoff;
+		__entry->lstart = left->br_startblock;
+		__entry->lcount = left->br_blockcount;
+		__entry->coff = curr->br_startoff;
+		__entry->cstart = curr->br_startblock;
+		__entry->ccount = curr->br_blockcount;
+		__entry->noff = new->br_startoff;
+		__entry->nstart = new->br_startblock;
+		__entry->ncount = new->br_blockcount;
+		__entry->roff = right->br_startoff;
+		__entry->rstart = right->br_startblock;
+		__entry->rcount = right->br_blockcount;
+		__entry->delta = delta;
+		__entry->state = state;
+	),
+	TP_printk("dev %d;%d left %llu:%lld:%llu; curr %llu:%lld:%llu <- new %llu:%lld:%llu; right %llu:%lld:%llu delta %d state x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		__entry->loff, __entry->lstart, __entry->lcount,
+		__entry->coff, __entry->cstart, __entry->ccount,
+		__entry->noff, __entry->nstart, __entry->ncount,
+		__entry->roff, __entry->rstart, __entry->rcount,
+		__entry->delta, __entry->state)
+);
+
+TRACE_EVENT(xfs_swapext_delta_nextents,
+	TP_PROTO(const struct xfs_swapext_req *req, int64_t d_nexts1,
+		 int64_t d_nexts2),
+	TP_ARGS(req, d_nexts1, d_nexts2),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino1)
+		__field(xfs_ino_t, ino2)
+		__field(xfs_extnum_t, nexts1)
+		__field(xfs_extnum_t, nexts2)
+		__field(int64_t, d_nexts1)
+		__field(int64_t, d_nexts2)
+	),
+	TP_fast_assign(
+		__entry->dev = req->ip1->i_mount->m_super->s_dev;
+		__entry->ino1 = req->ip1->i_ino;
+		__entry->ino2 = req->ip2->i_ino;
+		__entry->nexts1 = XFS_IFORK_PTR(req->ip1, req->whichfork)->if_nextents;
+		__entry->nexts2 = XFS_IFORK_PTR(req->ip2, req->whichfork)->if_nextents;
+		__entry->d_nexts1 = d_nexts1;
+		__entry->d_nexts2 = d_nexts2;
+	),
+	TP_printk("dev %d:%d ino1 0x%llx nexts %u ino2 0x%llx nexts %u delta1 %lld delta2 %lld",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino1, __entry->nexts1,
+		  __entry->ino2, __entry->nexts2,
+		  __entry->d_nexts1, __entry->d_nexts2)
+);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/fs/xfs/xfs_xchgrange.c b/fs/xfs/xfs_xchgrange.c
new file mode 100644
index 000000000000..5e7098d5838e
--- /dev/null
+++ b/fs/xfs/xfs_xchgrange.c
@@ -0,0 +1,66 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (C) 2021 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_inode.h"
+#include "xfs_trans.h"
+#include "xfs_swapext.h"
+#include "xfs_xchgrange.h"
+
+/* Lock (and optionally join) two inodes for a file range exchange. */
+void
+xfs_xchg_range_ilock(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip1,
+	struct xfs_inode	*ip2)
+{
+	if (ip1 != ip2)
+		xfs_lock_two_inodes(ip1, XFS_ILOCK_EXCL,
+				    ip2, XFS_ILOCK_EXCL);
+	else
+		xfs_ilock(ip1, XFS_ILOCK_EXCL);
+	if (tp) {
+		xfs_trans_ijoin(tp, ip1, 0);
+		if (ip2 != ip1)
+			xfs_trans_ijoin(tp, ip2, 0);
+	}
+
+}
+
+/* Unlock two inodes after a file range exchange operation. */
+void
+xfs_xchg_range_iunlock(
+	struct xfs_inode	*ip1,
+	struct xfs_inode	*ip2)
+{
+	if (ip2 != ip1)
+		xfs_iunlock(ip2, XFS_ILOCK_EXCL);
+	xfs_iunlock(ip1, XFS_ILOCK_EXCL);
+}
+
+/*
+ * Estimate the resource requirements to exchange file contents between the two
+ * files.  The caller is required to hold the IOLOCK and the MMAPLOCK and to
+ * have flushed both inodes' pagecache and active direct-ios.
+ */
+int
+xfs_xchg_range_estimate(
+	const struct xfs_swapext_req	*req,
+	struct xfs_swapext_res		*res)
+{
+	int				error;
+
+	xfs_xchg_range_ilock(NULL, req->ip1, req->ip2);
+	error = xfs_swapext_estimate(req, res);
+	xfs_xchg_range_iunlock(req->ip1, req->ip2);
+	return error;
+}
diff --git a/fs/xfs/xfs_xchgrange.h b/fs/xfs/xfs_xchgrange.h
new file mode 100644
index 000000000000..ddda2bfb6f4b
--- /dev/null
+++ b/fs/xfs/xfs_xchgrange.h
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (C) 2021 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_XCHGRANGE_H__
+#define __XFS_XCHGRANGE_H__
+
+struct xfs_swapext_req;
+struct xfs_swapext_res;
+
+void xfs_xchg_range_ilock(struct xfs_trans *tp, struct xfs_inode *ip1,
+		struct xfs_inode *ip2);
+void xfs_xchg_range_iunlock(struct xfs_inode *ip1, struct xfs_inode *ip2);
+
+int xfs_xchg_range_estimate(const struct xfs_swapext_req *req,
+		struct xfs_swapext_res *res);
+
+#endif /* __XFS_XCHGRANGE_H__ */


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 08/18] xfs: add a ->xchg_file_range handler
  2021-04-01  1:08 [PATCHSET RFC v3 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (6 preceding siblings ...)
  2021-04-01  1:09 ` [PATCH 07/18] xfs: create deferred log items for extent swapping Darrick J. Wong
@ 2021-04-01  1:09 ` Darrick J. Wong
  2021-04-01  1:09 ` [PATCH 09/18] xfs: add error injection to test swapext recovery Darrick J. Wong
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2021-04-01  1:09 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Add a function to handle file range exchange requests from the vfs.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_file.c      |   49 ++++++
 fs/xfs/xfs_inode.c     |   13 ++
 fs/xfs/xfs_inode.h     |    1 
 fs/xfs/xfs_trace.h     |    4 +
 fs/xfs/xfs_xchgrange.c |  379 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_xchgrange.h |   11 +
 6 files changed, 457 insertions(+)


diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index a007ca0711d9..84a29d01c896 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -24,6 +24,7 @@
 #include "xfs_pnfs.h"
 #include "xfs_iomap.h"
 #include "xfs_reflink.h"
+#include "xfs_xchgrange.h"
 
 #include <linux/falloc.h>
 #include <linux/backing-dev.h>
@@ -1178,6 +1179,53 @@ xfs_file_remap_range(
 	return remapped > 0 ? remapped : ret;
 }
 
+STATIC int
+xfs_file_xchg_range(
+	struct file		*file1,
+	struct file		*file2,
+	struct file_xchg_range	*fxr)
+{
+	struct inode		*inode1 = file_inode(file1);
+	struct inode		*inode2 = file_inode(file2);
+	struct xfs_inode	*ip1 = XFS_I(inode1);
+	struct xfs_inode	*ip2 = XFS_I(inode2);
+	struct xfs_mount	*mp = ip1->i_mount;
+	unsigned int		priv_flags = 0;
+	int			ret;
+
+	if (XFS_FORCED_SHUTDOWN(mp))
+		return -EIO;
+
+	/* Update cmtime if the fd/inode don't forbid it. */
+	if (likely(!(file1->f_mode & FMODE_NOCMTIME) && !IS_NOCMTIME(inode1)))
+		priv_flags |= XFS_XCHG_RANGE_UPD_CMTIME1;
+	if (likely(!(file2->f_mode & FMODE_NOCMTIME) && !IS_NOCMTIME(inode2)))
+		priv_flags |= XFS_XCHG_RANGE_UPD_CMTIME2;
+
+	/* Lock both files against IO */
+	ret = xfs_ilock2_io_mmap(ip1, ip2);
+	if (ret)
+		return ret;
+
+	/* Prepare and then exchange file contents. */
+	ret = xfs_xchg_range_prep(file1, file2, fxr);
+	if (ret)
+		goto out_unlock;
+
+	trace_xfs_file_xchg_range(ip1, fxr->file1_offset, fxr->length, ip2,
+			fxr->file2_offset);
+
+	ret = xfs_xchg_range(ip1, ip2, fxr, priv_flags);
+	if (ret)
+		goto out_unlock;
+
+out_unlock:
+	xfs_iunlock2_io_mmap(ip1, ip2);
+	if (ret)
+		trace_xfs_file_xchg_range_error(ip2, ret, _RET_IP_);
+	return ret;
+}
+
 STATIC int
 xfs_file_open(
 	struct inode	*inode,
@@ -1443,6 +1491,7 @@ const struct file_operations xfs_file_operations = {
 	.fallocate	= xfs_file_fallocate,
 	.fadvise	= xfs_file_fadvise,
 	.remap_file_range = xfs_file_remap_range,
+	.xchg_file_range = xfs_file_xchg_range,
 };
 
 const struct file_operations xfs_dir_file_operations = {
diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 85287f764f4a..59706de3a9d0 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -3838,3 +3838,16 @@ xfs_inode_count_blocks(
 	xfs_bmap_count_leaves(ifp, rblocks);
 	*dblocks = ip->i_d.di_nblocks - *rblocks;
 }
+
+/* Returns the size of fundamental allocation unit for a file, in bytes. */
+unsigned int
+xfs_inode_alloc_unitsize(
+	struct xfs_inode	*ip)
+{
+	unsigned int		blocks = 1;
+
+	if (XFS_IS_REALTIME_INODE(ip))
+		blocks = ip->i_mount->m_sb.sb_rextsize;
+
+	return XFS_FSB_TO_B(ip->i_mount, blocks);
+}
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 1eebd5d03d01..81c7c695fb92 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -500,6 +500,7 @@ void xfs_end_io(struct work_struct *work);
 
 int xfs_ilock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2);
 void xfs_iunlock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2);
+unsigned int xfs_inode_alloc_unitsize(struct xfs_inode *ip);
 
 void xfs_inode_count_blocks(struct xfs_trans *tp, struct xfs_inode *ip,
 		xfs_filblks_t *dblocks, xfs_filblks_t *rblocks);
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index dc9cc3c67e58..f4e739e81594 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3319,6 +3319,10 @@ DEFINE_INODE_IREC_EVENT(xfs_reflink_cancel_cow);
 DEFINE_INODE_IREC_EVENT(xfs_swap_extent_rmap_remap);
 DEFINE_INODE_IREC_EVENT(xfs_swap_extent_rmap_remap_piece);
 DEFINE_INODE_ERROR_EVENT(xfs_swap_extent_rmap_error);
+
+/* swapext tracepoints */
+DEFINE_DOUBLE_IO_EVENT(xfs_file_xchg_range);
+DEFINE_INODE_ERROR_EVENT(xfs_file_xchg_range_error);
 DEFINE_INODE_IREC_EVENT(xfs_swapext_extent1);
 DEFINE_INODE_IREC_EVENT(xfs_swapext_extent2);
 DEFINE_ITRUNC_EVENT(xfs_swapext_update_inode_size);
diff --git a/fs/xfs/xfs_xchgrange.c b/fs/xfs/xfs_xchgrange.c
index 5e7098d5838e..877ef9f3eb64 100644
--- a/fs/xfs/xfs_xchgrange.c
+++ b/fs/xfs/xfs_xchgrange.c
@@ -13,8 +13,15 @@
 #include "xfs_defer.h"
 #include "xfs_inode.h"
 #include "xfs_trans.h"
+#include "xfs_quota.h"
+#include "xfs_bmap_util.h"
+#include "xfs_reflink.h"
+#include "xfs_trace.h"
 #include "xfs_swapext.h"
 #include "xfs_xchgrange.h"
+#include "xfs_sb.h"
+#include "xfs_icache.h"
+#include "xfs_log.h"
 
 /* Lock (and optionally join) two inodes for a file range exchange. */
 void
@@ -64,3 +71,375 @@ xfs_xchg_range_estimate(
 	xfs_xchg_range_iunlock(req->ip1, req->ip2);
 	return error;
 }
+
+/* Prepare two files to have their data exchanged. */
+int
+xfs_xchg_range_prep(
+	struct file		*file1,
+	struct file		*file2,
+	struct file_xchg_range	*fxr)
+{
+	struct xfs_inode	*ip1 = XFS_I(file_inode(file1));
+	struct xfs_inode	*ip2 = XFS_I(file_inode(file2));
+	int			ret;
+
+	/* Verify both files are either real-time or non-realtime */
+	if (XFS_IS_REALTIME_INODE(ip1) != XFS_IS_REALTIME_INODE(ip2))
+		return -EINVAL;
+
+	/*
+	 * The alignment checks in the VFS helpers cannot deal with allocation
+	 * units that are not powers of 2.  This can happen with the realtime
+	 * volume if the extent size is set.  Note that alignment checks are
+	 * skipped if FULL_FILES is set.
+	 */
+	if (!(fxr->flags & FILE_XCHG_RANGE_FULL_FILES) &&
+	    !is_power_of_2(xfs_inode_alloc_unitsize(ip2)))
+		return -EOPNOTSUPP;
+
+	ret = generic_xchg_file_range_prep(file1, file2, fxr,
+			xfs_inode_alloc_unitsize(ip2));
+	if (ret)
+		return ret;
+
+	/* Attach dquots to both inodes before changing block maps. */
+	ret = xfs_qm_dqattach(ip2);
+	if (ret)
+		return ret;
+	ret = xfs_qm_dqattach(ip1);
+	if (ret)
+		return ret;
+
+	/* Flush the relevant ranges of both files. */
+	ret = xfs_flush_unmap_range(ip2, fxr->file2_offset, fxr->length);
+	if (ret)
+		return ret;
+	return xfs_flush_unmap_range(ip1, fxr->file1_offset, fxr->length);
+}
+
+#define QRETRY_IP1	(0x1)
+#define QRETRY_IP2	(0x2)
+
+/*
+ * Obtain a quota reservation to make sure we don't hit EDQUOT.  We can skip
+ * this if quota enforcement is disabled or if both inodes' dquots are the
+ * same.  The qretry structure must be initialized to zeroes before the first
+ * call to this function.
+ */
+STATIC int
+xfs_xchg_range_reserve_quota(
+	struct xfs_trans		*tp,
+	const struct xfs_swapext_req	*req,
+	const struct xfs_swapext_res	*res,
+	unsigned int			*qretry)
+{
+	int64_t				ddelta, rdelta;
+	int				ip1_error = 0;
+	int				error;
+
+	/*
+	 * Don't bother with a quota reservation if we're not enforcing them
+	 * or the two inodes have the same dquots.
+	 */
+	if (!XFS_IS_QUOTA_ON(tp->t_mountp) || req->ip1 == req->ip2 ||
+	    (req->ip1->i_udquot == req->ip2->i_udquot &&
+	     req->ip1->i_gdquot == req->ip2->i_gdquot &&
+	     req->ip1->i_pdquot == req->ip2->i_pdquot))
+		return 0;
+
+	*qretry = 0;
+
+	/*
+	 * For each file, compute the net gain in the number of regular blocks
+	 * that will be mapped into that file and reserve that much quota.  The
+	 * quota counts must be able to absorb at least that much space.
+	 */
+	ddelta = res->ip2_bcount - res->ip1_bcount;
+	rdelta = res->ip2_rtbcount - res->ip1_rtbcount;
+	if (ddelta > 0 || rdelta > 0) {
+		error = xfs_trans_reserve_quota_nblks(tp, req->ip1,
+				ddelta > 0 ? ddelta : 0,
+				rdelta > 0 ? rdelta : 0,
+				false);
+		if (error == -EDQUOT || error == -ENOSPC) {
+			/*
+			 * Save this error and see what happens if we try to
+			 * reserve quota for ip2.  Then report both.
+			 */
+			*qretry |= QRETRY_IP1;
+			ip1_error = error;
+			error = 0;
+		}
+		if (error)
+			return error;
+	}
+	if (ddelta < 0 || rdelta < 0) {
+		error = xfs_trans_reserve_quota_nblks(tp, req->ip2,
+				ddelta < 0 ? -ddelta : 0,
+				rdelta < 0 ? -rdelta : 0,
+				false);
+		if (error == -EDQUOT || error == -ENOSPC)
+			*qretry |= QRETRY_IP2;
+		if (error)
+			return error;
+	}
+	if (ip1_error)
+		return ip1_error;
+
+	/*
+	 * For each file, forcibly reserve the gross gain in mapped blocks so
+	 * that we don't trip over any quota block reservation assertions.
+	 * We must reserve the gross gain because the quota code subtracts from
+	 * bcount the number of blocks that we unmap; it does not add that
+	 * quantity back to the quota block reservation.
+	 */
+	error = xfs_trans_reserve_quota_nblks(tp, req->ip1, res->ip1_bcount,
+			res->ip1_rtbcount, true);
+	if (error)
+		return error;
+
+	return xfs_trans_reserve_quota_nblks(tp, req->ip2, res->ip2_bcount,
+			res->ip2_rtbcount, true);
+}
+
+/*
+ * Get permission to use log-assisted atomic exchange of file extents.
+ *
+ * Callers must not be running any transactions, and they must release the
+ * permission either (1) by calling xlog_drop_incompat_feat when they're done,
+ * or (2) by setting XFS_TRANS_LOG_INCOMPAT on a transaction.
+ */
+STATIC int
+xfs_swapext_enable_log_assist(
+	struct xfs_mount	*mp,
+	bool			force,
+	bool			*enabled)
+{
+	int			error = 0;
+
+	/*
+	 * Protect ourselves from an idle log clearing the atomic swapext
+	 * log incompat feature bit.
+	 */
+	xlog_use_incompat_feat(mp->m_log);
+	*enabled = true;
+
+	/* Already enabled?  We're good to go. */
+	if (xfs_sb_version_hasatomicswap(&mp->m_sb))
+		return 0;
+
+	/*
+	 * If the caller doesn't /require/ log-assisted swapping, drop the
+	 * feature protection and exit.  They'll just have to use something
+	 * else.
+	 */
+	if (!force)
+		goto err;
+
+	/*
+	 * Caller requires log-assisted swapping but the fs feature set isn't
+	 * rich enough.  We have to bail out here.
+	 */
+	if (!xfs_sb_version_canatomicswap(&mp->m_sb)) {
+		error = -EOPNOTSUPP;
+		goto err;
+	}
+
+	/* Enable log-assisted extent swapping. */
+	xfs_warn(mp,
+ "EXPERIMENTAL atomic file range swap feature added. Use at your own risk!");
+	error = xfs_add_incompat_log_feature(mp,
+			XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP);
+	if (error)
+		goto err;
+	return 0;
+err:
+	xlog_drop_incompat_feat(mp->m_log);
+	*enabled = false;
+	return error;
+}
+
+/* Exchange the contents of two files. */
+int
+xfs_xchg_range(
+	struct xfs_inode		*ip1,
+	struct xfs_inode		*ip2,
+	const struct file_xchg_range	*fxr,
+	unsigned int			private_flags)
+{
+	struct xfs_swapext_req		req = {
+		.ip1			= ip1,
+		.ip2			= ip2,
+		.whichfork		= XFS_DATA_FORK,
+	};
+	struct xfs_swapext_res		res;
+	struct xfs_mount		*mp = ip1->i_mount;
+	struct xfs_trans		*tp;
+	loff_t				req_len;
+	unsigned int			qretry;
+	bool				retried = false;
+	bool				use_atomic = false;
+	int				error;
+
+	/* We don't support whole-fork swapping yet. */
+	if (!xfs_sb_version_canatomicswap(&mp->m_sb))
+		return -EOPNOTSUPP;
+
+	if (fxr->flags & FILE_XCHG_RANGE_TO_EOF)
+		req.flags |= XFS_SWAPEXT_SET_SIZES;
+	if (fxr->flags & FILE_XCHG_RANGE_SKIP_FILE1_HOLES)
+		req.flags |= XFS_SWAPEXT_SKIP_FILE1_HOLES;
+
+	req.startoff1 = XFS_B_TO_FSBT(mp, fxr->file1_offset);
+	req.startoff2 = XFS_B_TO_FSBT(mp, fxr->file2_offset);
+
+	/*
+	 * Round the request length up to the nearest fundamental unit of
+	 * allocation.  The prep function already checked that the request
+	 * offsets and length in @fxr are safe to round up.
+	 */
+	req_len = round_up(fxr->length, xfs_inode_alloc_unitsize(ip2));
+	req.blockcount = XFS_B_TO_FSB(mp, req_len);
+
+	/*
+	 * Cancel CoW fork preallocations for the ranges of both files.  The
+	 * prep function should have flushed all the dirty data, so the only
+	 * extents remaining should be speculative.
+	 */
+	if (xfs_inode_has_cow_data(ip1)) {
+		error = xfs_reflink_cancel_cow_range(ip1, fxr->file1_offset,
+				fxr->length, true);
+		if (error)
+			return error;
+	}
+
+	if (xfs_inode_has_cow_data(ip2)) {
+		error = xfs_reflink_cancel_cow_range(ip2, fxr->file2_offset,
+				fxr->length, true);
+		if (error)
+			return error;
+	}
+
+	error = xfs_xchg_range_estimate(&req, &res);
+	if (error)
+		return error;
+
+	error = xfs_swapext_enable_log_assist(mp,
+			!(fxr->flags & FILE_XCHG_RANGE_NONATOMIC),
+			&use_atomic);
+	if (error)
+		return error;
+
+retry:
+	/* Allocate the transaction, lock the inodes, and join them. */
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, res.resblks, 0,
+			XFS_TRANS_RES_FDBLKS, &tp);
+	if (error)
+		goto out_unlock_feat;
+
+	xfs_xchg_range_ilock(tp, ip1, ip2);
+
+	trace_xfs_swap_extent_before(ip2, 0);
+	trace_xfs_swap_extent_before(ip1, 1);
+
+	/*
+	 * Do all of the inputs checking that we can only do once we've taken
+	 * both ILOCKs.
+	 */
+	error = generic_xchg_file_range_check_fresh(VFS_I(ip1), VFS_I(ip2),
+			fxr);
+	if (error)
+		goto out_trans_cancel;
+
+	error = xfs_swapext_check_extents(mp, &req);
+	if (error)
+		goto out_trans_cancel;
+
+	/*
+	 * Reserve ourselves some quota if any of them are in enforcing mode.
+	 * In theory we only need enough to satisfy the change in the number
+	 * of blocks between the two ranges being remapped.
+	 */
+	error = xfs_xchg_range_reserve_quota(tp, &req, &res, &qretry);
+	if ((error == -EDQUOT || error == -ENOSPC) && !retried) {
+		xfs_trans_cancel(tp);
+		xfs_xchg_range_iunlock(ip1, ip2);
+		if (qretry & QRETRY_IP1)
+			xfs_blockgc_free_quota(ip1, 0);
+		if (qretry & QRETRY_IP2)
+			xfs_blockgc_free_quota(ip2, 0);
+		retried = true;
+		goto retry;
+	}
+	if (error)
+		goto out_trans_cancel;
+
+	/* If we got this far on a dry run, all parameters are ok. */
+	if (fxr->flags & FILE_XCHG_RANGE_DRY_RUN)
+		goto out_trans_cancel;
+
+	/*
+	 * If we got permission to use the atomic extent swap feature, put the
+	 * transaction in charge of releasing that permission.
+	 */
+	if (use_atomic) {
+		tp->t_flags |= XFS_TRANS_LOG_INCOMPAT;
+		use_atomic = false;
+	}
+
+	/* Update the mtime and ctime of both files. */
+	if (private_flags & XFS_XCHG_RANGE_UPD_CMTIME1)
+		xfs_trans_ichgtime(tp, ip1,
+				XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
+	if (private_flags & XFS_XCHG_RANGE_UPD_CMTIME2)
+		xfs_trans_ichgtime(tp, ip2,
+				XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
+
+	/* Exchange the file contents by swapping the block mappings. */
+	error = xfs_swapext(&tp, &req);
+	if (error)
+		goto out_trans_cancel;
+
+	/*
+	 * If the caller wanted us to exchange the contents of two complete
+	 * files of unequal length, exchange the incore sizes now.  This should
+	 * be safe because we flushed both files' page caches and moved all the
+	 * post-eof extents, so there should not be anything to zero.
+	 */
+	if (fxr->flags & FILE_XCHG_RANGE_TO_EOF) {
+		loff_t	temp;
+
+		temp = i_size_read(VFS_I(ip2));
+		i_size_write(VFS_I(ip2), i_size_read(VFS_I(ip1)));
+		i_size_write(VFS_I(ip1), temp);
+	}
+
+	/* Relog the inodes to keep transactions moving forward. */
+	xfs_trans_log_inode(tp, ip1, XFS_ILOG_CORE);
+	xfs_trans_log_inode(tp, ip2, XFS_ILOG_CORE);
+
+	/*
+	 * Force the log to persist metadata updates if the caller or the
+	 * administrator requires this.  The VFS prep function already flushed
+	 * the relevant parts of the page cache.
+	 */
+	if ((mp->m_flags & XFS_MOUNT_WSYNC) ||
+	    (fxr->flags & FILE_XCHG_RANGE_FSYNC))
+		xfs_trans_set_sync(tp);
+
+	error = xfs_trans_commit(tp);
+
+	trace_xfs_swap_extent_after(ip2, 0);
+	trace_xfs_swap_extent_after(ip1, 1);
+
+out_unlock:
+	xfs_xchg_range_iunlock(ip1, ip2);
+out_unlock_feat:
+	if (use_atomic)
+		xlog_drop_incompat_feat(mp->m_log);
+	return error;
+
+out_trans_cancel:
+	xfs_trans_cancel(tp);
+	goto out_unlock;
+}
diff --git a/fs/xfs/xfs_xchgrange.h b/fs/xfs/xfs_xchgrange.h
index ddda2bfb6f4b..cca297034689 100644
--- a/fs/xfs/xfs_xchgrange.h
+++ b/fs/xfs/xfs_xchgrange.h
@@ -15,5 +15,16 @@ void xfs_xchg_range_iunlock(struct xfs_inode *ip1, struct xfs_inode *ip2);
 
 int xfs_xchg_range_estimate(const struct xfs_swapext_req *req,
 		struct xfs_swapext_res *res);
+int xfs_xchg_range_prep(struct file *file1, struct file *file2,
+		struct file_xchg_range *fxr);
+
+/* Update ip1's change and mod time. */
+#define XFS_XCHG_RANGE_UPD_CMTIME1	(1 << 0)
+
+/* Update ip2's change and mod time. */
+#define XFS_XCHG_RANGE_UPD_CMTIME2	(1 << 1)
+
+int xfs_xchg_range(struct xfs_inode *ip1, struct xfs_inode *ip2,
+		const struct file_xchg_range *fxr, unsigned int private_flags);
 
 #endif /* __XFS_XCHGRANGE_H__ */


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 09/18] xfs: add error injection to test swapext recovery
  2021-04-01  1:08 [PATCHSET RFC v3 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (7 preceding siblings ...)
  2021-04-01  1:09 ` [PATCH 08/18] xfs: add a ->xchg_file_range handler Darrick J. Wong
@ 2021-04-01  1:09 ` Darrick J. Wong
  2021-04-01  1:09 ` [PATCH 10/18] xfs: port xfs_swap_extents_rmap to our new code Darrick J. Wong
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2021-04-01  1:09 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Add an errortag so that we can test recovery of swapext log items.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_errortag.h |    4 +++-
 fs/xfs/libxfs/xfs_swapext.c  |    5 +++++
 fs/xfs/xfs_error.c           |    3 +++
 3 files changed, 11 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_errortag.h b/fs/xfs/libxfs/xfs_errortag.h
index 6ca9084b6934..52a69bf29570 100644
--- a/fs/xfs/libxfs/xfs_errortag.h
+++ b/fs/xfs/libxfs/xfs_errortag.h
@@ -58,7 +58,8 @@
 #define XFS_ERRTAG_BUF_IOERROR				35
 #define XFS_ERRTAG_REDUCE_MAX_IEXTENTS			36
 #define XFS_ERRTAG_BMAP_ALLOC_MINLEN_EXTENT		37
-#define XFS_ERRTAG_MAX					38
+#define XFS_ERRTAG_SWAPEXT_FINISH_ONE			38
+#define XFS_ERRTAG_MAX					39
 
 /*
  * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -101,5 +102,6 @@
 #define XFS_RANDOM_BUF_IOERROR				XFS_RANDOM_DEFAULT
 #define XFS_RANDOM_REDUCE_MAX_IEXTENTS			1
 #define XFS_RANDOM_BMAP_ALLOC_MINLEN_EXTENT		1
+#define XFS_RANDOM_SWAPEXT_FINISH_ONE			1
 
 #endif /* __XFS_ERRORTAG_H_ */
diff --git a/fs/xfs/libxfs/xfs_swapext.c b/fs/xfs/libxfs/xfs_swapext.c
index 9fb67cbd018f..082680635146 100644
--- a/fs/xfs/libxfs/xfs_swapext.c
+++ b/fs/xfs/libxfs/xfs_swapext.c
@@ -20,6 +20,8 @@
 #include "xfs_trace.h"
 #include "xfs_bmap_btree.h"
 #include "xfs_trans_space.h"
+#include "xfs_errortag.h"
+#include "xfs_error.h"
 
 /* Information to help us reset reflink flag / CoW fork state after a swap. */
 
@@ -407,6 +409,9 @@ xfs_swapext_finish_one(
 		xfs_trans_log_inode(tp, sxi->sxi_ip2, XFS_ILOG_CORE);
 	}
 
+	if (XFS_TEST_ERROR(false, tp->t_mountp, XFS_ERRTAG_SWAPEXT_FINISH_ONE))
+		return -EIO;
+
 	if (xfs_swapext_has_more_work(sxi))
 		trace_xfs_swapext_defer(tp->t_mountp, sxi);
 
diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
index 185b4915b7bf..9b6c38d0671b 100644
--- a/fs/xfs/xfs_error.c
+++ b/fs/xfs/xfs_error.c
@@ -56,6 +56,7 @@ static unsigned int xfs_errortag_random_default[] = {
 	XFS_RANDOM_BUF_IOERROR,
 	XFS_RANDOM_REDUCE_MAX_IEXTENTS,
 	XFS_RANDOM_BMAP_ALLOC_MINLEN_EXTENT,
+	XFS_RANDOM_SWAPEXT_FINISH_ONE,
 };
 
 struct xfs_errortag_attr {
@@ -168,6 +169,7 @@ XFS_ERRORTAG_ATTR_RW(iunlink_fallback,	XFS_ERRTAG_IUNLINK_FALLBACK);
 XFS_ERRORTAG_ATTR_RW(buf_ioerror,	XFS_ERRTAG_BUF_IOERROR);
 XFS_ERRORTAG_ATTR_RW(reduce_max_iextents,	XFS_ERRTAG_REDUCE_MAX_IEXTENTS);
 XFS_ERRORTAG_ATTR_RW(bmap_alloc_minlen_extent,	XFS_ERRTAG_BMAP_ALLOC_MINLEN_EXTENT);
+XFS_ERRORTAG_ATTR_RW(swapext_finish_one, XFS_RANDOM_SWAPEXT_FINISH_ONE);
 
 static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(noerror),
@@ -208,6 +210,7 @@ static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(buf_ioerror),
 	XFS_ERRORTAG_ATTR_LIST(reduce_max_iextents),
 	XFS_ERRORTAG_ATTR_LIST(bmap_alloc_minlen_extent),
+	XFS_ERRORTAG_ATTR_LIST(swapext_finish_one),
 	NULL,
 };
 


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 10/18] xfs: port xfs_swap_extents_rmap to our new code
  2021-04-01  1:08 [PATCHSET RFC v3 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (8 preceding siblings ...)
  2021-04-01  1:09 ` [PATCH 09/18] xfs: add error injection to test swapext recovery Darrick J. Wong
@ 2021-04-01  1:09 ` Darrick J. Wong
  2021-04-01  1:09 ` [PATCH 11/18] xfs: consolidate all of the xfs_swap_extent_forks code Darrick J. Wong
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2021-04-01  1:09 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

The inner loop of xfs_swap_extents_rmap does the same work as
xfs_swapext_finish_one, so adapt it to use that.  Doing so has the side
benefit that the older code path no longer wastes its time remapping
shared extents.

This forms the basis of the non-atomic swaprange implementation.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_util.c |  143 ++++--------------------------------------------
 fs/xfs/xfs_trace.h     |    5 --
 2 files changed, 14 insertions(+), 134 deletions(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 87fde8c875a2..2881583bb957 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1358,132 +1358,6 @@ xfs_swap_extent_flush(
 	return 0;
 }
 
-/*
- * Move extents from one file to another, when rmap is enabled.
- */
-STATIC int
-xfs_swap_extent_rmap(
-	struct xfs_trans		**tpp,
-	struct xfs_inode		*ip,
-	struct xfs_inode		*tip)
-{
-	struct xfs_trans		*tp = *tpp;
-	struct xfs_bmbt_irec		irec;
-	struct xfs_bmbt_irec		uirec;
-	struct xfs_bmbt_irec		tirec;
-	xfs_fileoff_t			offset_fsb;
-	xfs_fileoff_t			end_fsb;
-	xfs_filblks_t			count_fsb;
-	int				error;
-	xfs_filblks_t			ilen;
-	xfs_filblks_t			rlen;
-	int				nimaps;
-	uint64_t			tip_flags2;
-
-	/*
-	 * If the source file has shared blocks, we must flag the donor
-	 * file as having shared blocks so that we get the shared-block
-	 * rmap functions when we go to fix up the rmaps.  The flags
-	 * will be switch for reals later.
-	 */
-	tip_flags2 = tip->i_d.di_flags2;
-	if (ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK)
-		tip->i_d.di_flags2 |= XFS_DIFLAG2_REFLINK;
-
-	offset_fsb = 0;
-	end_fsb = XFS_B_TO_FSB(ip->i_mount, i_size_read(VFS_I(ip)));
-	count_fsb = (xfs_filblks_t)(end_fsb - offset_fsb);
-
-	while (count_fsb) {
-		/* Read extent from the donor file */
-		nimaps = 1;
-		error = xfs_bmapi_read(tip, offset_fsb, count_fsb, &tirec,
-				&nimaps, 0);
-		if (error)
-			goto out;
-		ASSERT(nimaps == 1);
-		ASSERT(tirec.br_startblock != DELAYSTARTBLOCK);
-
-		trace_xfs_swap_extent_rmap_remap(tip, &tirec);
-		ilen = tirec.br_blockcount;
-
-		/* Unmap the old blocks in the source file. */
-		while (tirec.br_blockcount) {
-			ASSERT(tp->t_firstblock == NULLFSBLOCK);
-			trace_xfs_swap_extent_rmap_remap_piece(tip, &tirec);
-
-			/* Read extent from the source file */
-			nimaps = 1;
-			error = xfs_bmapi_read(ip, tirec.br_startoff,
-					tirec.br_blockcount, &irec,
-					&nimaps, 0);
-			if (error)
-				goto out;
-			ASSERT(nimaps == 1);
-			ASSERT(tirec.br_startoff == irec.br_startoff);
-			trace_xfs_swap_extent_rmap_remap_piece(ip, &irec);
-
-			/* Trim the extent. */
-			uirec = tirec;
-			uirec.br_blockcount = rlen = min_t(xfs_filblks_t,
-					tirec.br_blockcount,
-					irec.br_blockcount);
-			trace_xfs_swap_extent_rmap_remap_piece(tip, &uirec);
-
-			if (xfs_bmap_is_real_extent(&uirec)) {
-				error = xfs_iext_count_may_overflow(ip,
-						XFS_DATA_FORK,
-						XFS_IEXT_SWAP_RMAP_CNT);
-				if (error)
-					goto out;
-			}
-
-			if (xfs_bmap_is_real_extent(&irec)) {
-				error = xfs_iext_count_may_overflow(tip,
-						XFS_DATA_FORK,
-						XFS_IEXT_SWAP_RMAP_CNT);
-				if (error)
-					goto out;
-			}
-
-			/* Remove the mapping from the donor file. */
-			xfs_bmap_unmap_extent(tp, tip, XFS_DATA_FORK, &uirec);
-
-			/* Remove the mapping from the source file. */
-			xfs_bmap_unmap_extent(tp, ip, XFS_DATA_FORK, &irec);
-
-			/* Map the donor file's blocks into the source file. */
-			xfs_bmap_map_extent(tp, ip, XFS_DATA_FORK, &uirec);
-
-			/* Map the source file's blocks into the donor file. */
-			xfs_bmap_map_extent(tp, tip, XFS_DATA_FORK, &irec);
-
-			error = xfs_defer_finish(tpp);
-			tp = *tpp;
-			if (error)
-				goto out;
-
-			tirec.br_startoff += rlen;
-			if (tirec.br_startblock != HOLESTARTBLOCK &&
-			    tirec.br_startblock != DELAYSTARTBLOCK)
-				tirec.br_startblock += rlen;
-			tirec.br_blockcount -= rlen;
-		}
-
-		/* Roll on... */
-		count_fsb -= ilen;
-		offset_fsb += ilen;
-	}
-
-	tip->i_d.di_flags2 = tip_flags2;
-	return 0;
-
-out:
-	trace_xfs_swap_extent_rmap_error(ip, error, _RET_IP_);
-	tip->i_d.di_flags2 = tip_flags2;
-	return error;
-}
-
 /* Swap the extents of two files by swapping data forks. */
 STATIC int
 xfs_swap_extent_forks(
@@ -1769,13 +1643,22 @@ xfs_swap_extents(
 	src_log_flags = XFS_ILOG_CORE;
 	target_log_flags = XFS_ILOG_CORE;
 
-	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
-		error = xfs_swap_extent_rmap(&tp, ip, tip);
-	else
+	if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
+		struct xfs_swapext_req	req = {
+			.ip1		= ip,
+			.ip2		= tip,
+			.whichfork	= XFS_DATA_FORK,
+			.blockcount	= XFS_B_TO_FSB(ip->i_mount,
+						       i_size_read(VFS_I(ip))),
+		};
+		error = xfs_swapext(&tp, &req);
+	} else
 		error = xfs_swap_extent_forks(tp, ip, tip, &src_log_flags,
 				&target_log_flags);
-	if (error)
+	if (error) {
+		trace_xfs_swap_extent_error(ip, error, _THIS_IP_);
 		goto out_trans_cancel;
+	}
 
 	/* Do we have to swap reflink flags? */
 	if ((ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK) ^
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index f4e739e81594..f2db023986a4 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3315,14 +3315,11 @@ DEFINE_INODE_ERROR_EVENT(xfs_reflink_end_cow_error);
 
 DEFINE_INODE_IREC_EVENT(xfs_reflink_cancel_cow);
 
-/* rmap swapext tracepoints */
-DEFINE_INODE_IREC_EVENT(xfs_swap_extent_rmap_remap);
-DEFINE_INODE_IREC_EVENT(xfs_swap_extent_rmap_remap_piece);
-DEFINE_INODE_ERROR_EVENT(xfs_swap_extent_rmap_error);
 
 /* swapext tracepoints */
 DEFINE_DOUBLE_IO_EVENT(xfs_file_xchg_range);
 DEFINE_INODE_ERROR_EVENT(xfs_file_xchg_range_error);
+DEFINE_INODE_ERROR_EVENT(xfs_swap_extent_error);
 DEFINE_INODE_IREC_EVENT(xfs_swapext_extent1);
 DEFINE_INODE_IREC_EVENT(xfs_swapext_extent2);
 DEFINE_ITRUNC_EVENT(xfs_swapext_update_inode_size);


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 11/18] xfs: consolidate all of the xfs_swap_extent_forks code
  2021-04-01  1:08 [PATCHSET RFC v3 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (9 preceding siblings ...)
  2021-04-01  1:09 ` [PATCH 10/18] xfs: port xfs_swap_extents_rmap to our new code Darrick J. Wong
@ 2021-04-01  1:09 ` Darrick J. Wong
  2021-04-01  1:09 ` [PATCH 12/18] xfs: refactor reflink flag handling in xfs_swap_extent_forks Darrick J. Wong
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2021-04-01  1:09 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Consolidate the bmbt owner change scan code in xfs_swap_extent_forks,
since it's not needed for the deferred bmap log item swapext
implementation.

The goal is to package up all three implementations into functions that
have the same preconditions and leave the system in the same state.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_util.c |  220 ++++++++++++++++++++++++------------------------
 1 file changed, 108 insertions(+), 112 deletions(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 2881583bb957..bff8725082e8 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1358,19 +1358,61 @@ xfs_swap_extent_flush(
 	return 0;
 }
 
+/*
+ * Fix up the owners of the bmbt blocks to refer to the current inode. The
+ * change owner scan attempts to order all modified buffers in the current
+ * transaction. In the event of ordered buffer failure, the offending buffer is
+ * physically logged as a fallback and the scan returns -EAGAIN. We must roll
+ * the transaction in this case to replenish the fallback log reservation and
+ * restart the scan. This process repeats until the scan completes.
+ */
+static int
+xfs_swap_change_owner(
+	struct xfs_trans	**tpp,
+	struct xfs_inode	*ip,
+	struct xfs_inode	*tmpip)
+{
+	int			error;
+	struct xfs_trans	*tp = *tpp;
+
+	do {
+		error = xfs_bmbt_change_owner(tp, ip, XFS_DATA_FORK, ip->i_ino,
+					      NULL);
+		/* success or fatal error */
+		if (error != -EAGAIN)
+			break;
+
+		error = xfs_trans_roll(tpp);
+		if (error)
+			break;
+		tp = *tpp;
+
+		/*
+		 * Redirty both inodes so they can relog and keep the log tail
+		 * moving forward.
+		 */
+		xfs_trans_ijoin(tp, ip, 0);
+		xfs_trans_ijoin(tp, tmpip, 0);
+		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+		xfs_trans_log_inode(tp, tmpip, XFS_ILOG_CORE);
+	} while (true);
+
+	return error;
+}
+
 /* Swap the extents of two files by swapping data forks. */
 STATIC int
 xfs_swap_extent_forks(
-	struct xfs_trans	*tp,
+	struct xfs_trans	**tpp,
 	struct xfs_inode	*ip,
-	struct xfs_inode	*tip,
-	int			*src_log_flags,
-	int			*target_log_flags)
+	struct xfs_inode	*tip)
 {
 	xfs_filblks_t		aforkblks = 0;
 	xfs_filblks_t		taforkblks = 0;
 	xfs_extnum_t		junk;
 	uint64_t		tmp;
+	int			src_log_flags = XFS_ILOG_CORE;
+	int			target_log_flags = XFS_ILOG_CORE;
 	int			error;
 
 	/*
@@ -1378,14 +1420,14 @@ xfs_swap_extent_forks(
 	 */
 	if (XFS_IFORK_Q(ip) && ip->i_afp->if_nextents > 0 &&
 	    ip->i_afp->if_format != XFS_DINODE_FMT_LOCAL) {
-		error = xfs_bmap_count_blocks(tp, ip, XFS_ATTR_FORK, &junk,
+		error = xfs_bmap_count_blocks(*tpp, ip, XFS_ATTR_FORK, &junk,
 				&aforkblks);
 		if (error)
 			return error;
 	}
 	if (XFS_IFORK_Q(tip) && tip->i_afp->if_nextents > 0 &&
 	    tip->i_afp->if_format != XFS_DINODE_FMT_LOCAL) {
-		error = xfs_bmap_count_blocks(tp, tip, XFS_ATTR_FORK, &junk,
+		error = xfs_bmap_count_blocks(*tpp, tip, XFS_ATTR_FORK, &junk,
 				&taforkblks);
 		if (error)
 			return error;
@@ -1400,9 +1442,9 @@ xfs_swap_extent_forks(
 	 */
 	if (xfs_sb_version_has_v3inode(&ip->i_mount->m_sb)) {
 		if (ip->i_df.if_format == XFS_DINODE_FMT_BTREE)
-			(*target_log_flags) |= XFS_ILOG_DOWNER;
+			target_log_flags |= XFS_ILOG_DOWNER;
 		if (tip->i_df.if_format == XFS_DINODE_FMT_BTREE)
-			(*src_log_flags) |= XFS_ILOG_DOWNER;
+			src_log_flags |= XFS_ILOG_DOWNER;
 	}
 
 	/*
@@ -1432,71 +1474,80 @@ xfs_swap_extent_forks(
 
 	switch (ip->i_df.if_format) {
 	case XFS_DINODE_FMT_EXTENTS:
-		(*src_log_flags) |= XFS_ILOG_DEXT;
+		src_log_flags |= XFS_ILOG_DEXT;
 		break;
 	case XFS_DINODE_FMT_BTREE:
 		ASSERT(!xfs_sb_version_has_v3inode(&ip->i_mount->m_sb) ||
-		       (*src_log_flags & XFS_ILOG_DOWNER));
-		(*src_log_flags) |= XFS_ILOG_DBROOT;
+		       (src_log_flags & XFS_ILOG_DOWNER));
+		src_log_flags |= XFS_ILOG_DBROOT;
 		break;
 	}
 
 	switch (tip->i_df.if_format) {
 	case XFS_DINODE_FMT_EXTENTS:
-		(*target_log_flags) |= XFS_ILOG_DEXT;
+		target_log_flags |= XFS_ILOG_DEXT;
 		break;
 	case XFS_DINODE_FMT_BTREE:
-		(*target_log_flags) |= XFS_ILOG_DBROOT;
+		target_log_flags |= XFS_ILOG_DBROOT;
 		ASSERT(!xfs_sb_version_has_v3inode(&ip->i_mount->m_sb) ||
-		       (*target_log_flags & XFS_ILOG_DOWNER));
+		       (target_log_flags & XFS_ILOG_DOWNER));
 		break;
 	}
 
+	/* Do we have to swap reflink flags? */
+	if ((ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK) ^
+	    (tip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK)) {
+		uint64_t	f;
+
+		f = ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK;
+		ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+		ip->i_d.di_flags2 |= tip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK;
+		tip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
+		tip->i_d.di_flags2 |= f & XFS_DIFLAG2_REFLINK;
+	}
+
+	/* Swap the cow forks. */
+	if (xfs_sb_version_hasreflink(&ip->i_mount->m_sb)) {
+		ASSERT(!ip->i_cowfp ||
+		       ip->i_cowfp->if_format == XFS_DINODE_FMT_EXTENTS);
+		ASSERT(!tip->i_cowfp ||
+		       tip->i_cowfp->if_format == XFS_DINODE_FMT_EXTENTS);
+
+		swap(ip->i_cowfp, tip->i_cowfp);
+
+		if (ip->i_cowfp && ip->i_cowfp->if_bytes)
+			xfs_inode_set_cowblocks_tag(ip);
+		else
+			xfs_inode_clear_cowblocks_tag(ip);
+		if (tip->i_cowfp && tip->i_cowfp->if_bytes)
+			xfs_inode_set_cowblocks_tag(tip);
+		else
+			xfs_inode_clear_cowblocks_tag(tip);
+	}
+
+	xfs_trans_log_inode(*tpp, ip,  src_log_flags);
+	xfs_trans_log_inode(*tpp, tip, target_log_flags);
+
+	/*
+	 * The extent forks have been swapped, but crc=1,rmapbt=0 filesystems
+	 * have inode number owner values in the bmbt blocks that still refer to
+	 * the old inode. Scan each bmbt to fix up the owner values with the
+	 * inode number of the current inode.
+	 */
+	if (src_log_flags & XFS_ILOG_DOWNER) {
+		error = xfs_swap_change_owner(tpp, ip, tip);
+		if (error)
+			return error;
+	}
+	if (target_log_flags & XFS_ILOG_DOWNER) {
+		error = xfs_swap_change_owner(tpp, tip, ip);
+		if (error)
+			return error;
+	}
+
 	return 0;
 }
 
-/*
- * Fix up the owners of the bmbt blocks to refer to the current inode. The
- * change owner scan attempts to order all modified buffers in the current
- * transaction. In the event of ordered buffer failure, the offending buffer is
- * physically logged as a fallback and the scan returns -EAGAIN. We must roll
- * the transaction in this case to replenish the fallback log reservation and
- * restart the scan. This process repeats until the scan completes.
- */
-static int
-xfs_swap_change_owner(
-	struct xfs_trans	**tpp,
-	struct xfs_inode	*ip,
-	struct xfs_inode	*tmpip)
-{
-	int			error;
-	struct xfs_trans	*tp = *tpp;
-
-	do {
-		error = xfs_bmbt_change_owner(tp, ip, XFS_DATA_FORK, ip->i_ino,
-					      NULL);
-		/* success or fatal error */
-		if (error != -EAGAIN)
-			break;
-
-		error = xfs_trans_roll(tpp);
-		if (error)
-			break;
-		tp = *tpp;
-
-		/*
-		 * Redirty both inodes so they can relog and keep the log tail
-		 * moving forward.
-		 */
-		xfs_trans_ijoin(tp, ip, 0);
-		xfs_trans_ijoin(tp, tmpip, 0);
-		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
-		xfs_trans_log_inode(tp, tmpip, XFS_ILOG_CORE);
-	} while (true);
-
-	return error;
-}
-
 int
 xfs_swap_extents(
 	struct xfs_inode	*ip,	/* target inode */
@@ -1506,10 +1557,8 @@ xfs_swap_extents(
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_trans	*tp;
 	struct xfs_bstat	*sbp = &sxp->sx_stat;
-	int			src_log_flags, target_log_flags;
 	int			error = 0;
 	int			lock_flags;
-	uint64_t		f;
 	int			resblks = 0;
 	unsigned int		flags = 0;
 
@@ -1640,9 +1689,6 @@ xfs_swap_extents(
 	 * recovery is going to see the fork as owned by the swapped inode,
 	 * not the pre-swapped inodes.
 	 */
-	src_log_flags = XFS_ILOG_CORE;
-	target_log_flags = XFS_ILOG_CORE;
-
 	if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
 		struct xfs_swapext_req	req = {
 			.ip1		= ip,
@@ -1653,62 +1699,12 @@ xfs_swap_extents(
 		};
 		error = xfs_swapext(&tp, &req);
 	} else
-		error = xfs_swap_extent_forks(tp, ip, tip, &src_log_flags,
-				&target_log_flags);
+		error = xfs_swap_extent_forks(&tp, ip, tip);
 	if (error) {
 		trace_xfs_swap_extent_error(ip, error, _THIS_IP_);
 		goto out_trans_cancel;
 	}
 
-	/* Do we have to swap reflink flags? */
-	if ((ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK) ^
-	    (tip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK)) {
-		f = ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK;
-		ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
-		ip->i_d.di_flags2 |= tip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK;
-		tip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
-		tip->i_d.di_flags2 |= f & XFS_DIFLAG2_REFLINK;
-	}
-
-	/* Swap the cow forks. */
-	if (xfs_sb_version_hasreflink(&mp->m_sb)) {
-		ASSERT(!ip->i_cowfp ||
-		       ip->i_cowfp->if_format == XFS_DINODE_FMT_EXTENTS);
-		ASSERT(!tip->i_cowfp ||
-		       tip->i_cowfp->if_format == XFS_DINODE_FMT_EXTENTS);
-
-		swap(ip->i_cowfp, tip->i_cowfp);
-
-		if (ip->i_cowfp && ip->i_cowfp->if_bytes)
-			xfs_inode_set_cowblocks_tag(ip);
-		else
-			xfs_inode_clear_cowblocks_tag(ip);
-		if (tip->i_cowfp && tip->i_cowfp->if_bytes)
-			xfs_inode_set_cowblocks_tag(tip);
-		else
-			xfs_inode_clear_cowblocks_tag(tip);
-	}
-
-	xfs_trans_log_inode(tp, ip,  src_log_flags);
-	xfs_trans_log_inode(tp, tip, target_log_flags);
-
-	/*
-	 * The extent forks have been swapped, but crc=1,rmapbt=0 filesystems
-	 * have inode number owner values in the bmbt blocks that still refer to
-	 * the old inode. Scan each bmbt to fix up the owner values with the
-	 * inode number of the current inode.
-	 */
-	if (src_log_flags & XFS_ILOG_DOWNER) {
-		error = xfs_swap_change_owner(&tp, ip, tip);
-		if (error)
-			goto out_trans_cancel;
-	}
-	if (target_log_flags & XFS_ILOG_DOWNER) {
-		error = xfs_swap_change_owner(&tp, tip, ip);
-		if (error)
-			goto out_trans_cancel;
-	}
-
 	/*
 	 * If this is a synchronous mount, make sure that the
 	 * transaction goes to disk before returning to the user.


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 12/18] xfs: refactor reflink flag handling in xfs_swap_extent_forks
  2021-04-01  1:08 [PATCHSET RFC v3 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (10 preceding siblings ...)
  2021-04-01  1:09 ` [PATCH 11/18] xfs: consolidate all of the xfs_swap_extent_forks code Darrick J. Wong
@ 2021-04-01  1:09 ` Darrick J. Wong
  2021-04-01  1:09 ` [PATCH 13/18] xfs: allow xfs_swap_range to use older extent swap algorithms Darrick J. Wong
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2021-04-01  1:09 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Refactor the old data fork swap function to use the new reflink flag
helpers to propagate reflink flags between the two files.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_util.c |   58 +++++++++++++-----------------------------------
 1 file changed, 16 insertions(+), 42 deletions(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index bff8725082e8..7cd6a6d5fb00 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1404,17 +1404,21 @@ xfs_swap_change_owner(
 STATIC int
 xfs_swap_extent_forks(
 	struct xfs_trans	**tpp,
-	struct xfs_inode	*ip,
-	struct xfs_inode	*tip)
+	struct xfs_swapext_req	*req)
 {
+	struct xfs_inode	*ip = req->ip1;
+	struct xfs_inode	*tip = req->ip2;
 	xfs_filblks_t		aforkblks = 0;
 	xfs_filblks_t		taforkblks = 0;
 	xfs_extnum_t		junk;
 	uint64_t		tmp;
+	unsigned int		reflink_state;
 	int			src_log_flags = XFS_ILOG_CORE;
 	int			target_log_flags = XFS_ILOG_CORE;
 	int			error;
 
+	reflink_state = xfs_swapext_reflink_prep(req);
+
 	/*
 	 * Count the number of extended attribute blocks
 	 */
@@ -1494,36 +1498,7 @@ xfs_swap_extent_forks(
 		break;
 	}
 
-	/* Do we have to swap reflink flags? */
-	if ((ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK) ^
-	    (tip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK)) {
-		uint64_t	f;
-
-		f = ip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK;
-		ip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
-		ip->i_d.di_flags2 |= tip->i_d.di_flags2 & XFS_DIFLAG2_REFLINK;
-		tip->i_d.di_flags2 &= ~XFS_DIFLAG2_REFLINK;
-		tip->i_d.di_flags2 |= f & XFS_DIFLAG2_REFLINK;
-	}
-
-	/* Swap the cow forks. */
-	if (xfs_sb_version_hasreflink(&ip->i_mount->m_sb)) {
-		ASSERT(!ip->i_cowfp ||
-		       ip->i_cowfp->if_format == XFS_DINODE_FMT_EXTENTS);
-		ASSERT(!tip->i_cowfp ||
-		       tip->i_cowfp->if_format == XFS_DINODE_FMT_EXTENTS);
-
-		swap(ip->i_cowfp, tip->i_cowfp);
-
-		if (ip->i_cowfp && ip->i_cowfp->if_bytes)
-			xfs_inode_set_cowblocks_tag(ip);
-		else
-			xfs_inode_clear_cowblocks_tag(ip);
-		if (tip->i_cowfp && tip->i_cowfp->if_bytes)
-			xfs_inode_set_cowblocks_tag(tip);
-		else
-			xfs_inode_clear_cowblocks_tag(tip);
-	}
+	xfs_swapext_reflink_finish(*tpp, req, reflink_state);
 
 	xfs_trans_log_inode(*tpp, ip,  src_log_flags);
 	xfs_trans_log_inode(*tpp, tip, target_log_flags);
@@ -1554,6 +1529,11 @@ xfs_swap_extents(
 	struct xfs_inode	*tip,	/* tmp inode */
 	struct xfs_swapext	*sxp)
 {
+	struct xfs_swapext_req	req = {
+		.ip1		= ip,
+		.ip2		= tip,
+		.whichfork	= XFS_DATA_FORK,
+	};
 	struct xfs_mount	*mp = ip->i_mount;
 	struct xfs_trans	*tp;
 	struct xfs_bstat	*sbp = &sxp->sx_stat;
@@ -1689,17 +1669,11 @@ xfs_swap_extents(
 	 * recovery is going to see the fork as owned by the swapped inode,
 	 * not the pre-swapped inodes.
 	 */
-	if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
-		struct xfs_swapext_req	req = {
-			.ip1		= ip,
-			.ip2		= tip,
-			.whichfork	= XFS_DATA_FORK,
-			.blockcount	= XFS_B_TO_FSB(ip->i_mount,
-						       i_size_read(VFS_I(ip))),
-		};
+	req.blockcount = XFS_B_TO_FSB(ip->i_mount, i_size_read(VFS_I(ip)));
+	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
 		error = xfs_swapext(&tp, &req);
-	} else
-		error = xfs_swap_extent_forks(&tp, ip, tip);
+	else
+		error = xfs_swap_extent_forks(&tp, &req);
 	if (error) {
 		trace_xfs_swap_extent_error(ip, error, _THIS_IP_);
 		goto out_trans_cancel;


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 13/18] xfs: allow xfs_swap_range to use older extent swap algorithms
  2021-04-01  1:08 [PATCHSET RFC v3 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (11 preceding siblings ...)
  2021-04-01  1:09 ` [PATCH 12/18] xfs: refactor reflink flag handling in xfs_swap_extent_forks Darrick J. Wong
@ 2021-04-01  1:09 ` Darrick J. Wong
  2021-04-01  1:10 ` [PATCH 14/18] xfs: remove old swap extents implementation Darrick J. Wong
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2021-04-01  1:09 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

If userspace permits non-atomic swap operations, use the older code
paths to implement the same functionality.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_util.c |    4 +--
 fs/xfs/xfs_bmap_util.h |    4 +++
 fs/xfs/xfs_xchgrange.c |   66 ++++++++++++++++++++++++++++++++++++++++++++----
 3 files changed, 66 insertions(+), 8 deletions(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 7cd6a6d5fb00..94f1d0d685fe 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1259,7 +1259,7 @@ xfs_insert_file_space(
  * reject and log the attempt. basically we are putting the responsibility on
  * userspace to get this right.
  */
-static int
+int
 xfs_swap_extents_check_format(
 	struct xfs_inode	*ip,	/* target inode */
 	struct xfs_inode	*tip)	/* tmp inode */
@@ -1401,7 +1401,7 @@ xfs_swap_change_owner(
 }
 
 /* Swap the extents of two files by swapping data forks. */
-STATIC int
+int
 xfs_swap_extent_forks(
 	struct xfs_trans	**tpp,
 	struct xfs_swapext_req	*req)
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index 9f993168b55b..de3173e64f47 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -69,6 +69,10 @@ int	xfs_free_eofblocks(struct xfs_inode *ip);
 int	xfs_swap_extents(struct xfs_inode *ip, struct xfs_inode *tip,
 			 struct xfs_swapext *sx);
 
+struct xfs_swapext_req;
+int xfs_swap_extent_forks(struct xfs_trans **tpp, struct xfs_swapext_req *req);
+int xfs_swap_extents_check_format(struct xfs_inode *ip, struct xfs_inode *tip);
+
 xfs_daddr_t xfs_fsb_to_db(struct xfs_inode *ip, xfs_fsblock_t fsb);
 
 xfs_extnum_t xfs_bmap_count_leaves(struct xfs_ifork *ifp, xfs_filblks_t *count);
diff --git a/fs/xfs/xfs_xchgrange.c b/fs/xfs/xfs_xchgrange.c
index 877ef9f3eb64..ef74965198c6 100644
--- a/fs/xfs/xfs_xchgrange.c
+++ b/fs/xfs/xfs_xchgrange.c
@@ -259,6 +259,26 @@ xfs_swapext_enable_log_assist(
 	return error;
 }
 
+/* Decide if we can use the old data fork exchange code. */
+static inline bool
+xfs_xchg_use_forkswap(
+	const struct file_xchg_range	*fxr,
+	struct xfs_inode		*ip1,
+	struct xfs_inode		*ip2)
+{
+	return	(fxr->flags & FILE_XCHG_RANGE_NONATOMIC) &&
+		(fxr->flags & FILE_XCHG_RANGE_FULL_FILES) &&
+		!(fxr->flags & FILE_XCHG_RANGE_TO_EOF) &&
+		fxr->file1_offset == 0 && fxr->file2_offset == 0 &&
+		fxr->length == ip1->i_d.di_size &&
+		fxr->length == ip2->i_d.di_size;
+}
+
+enum xchg_strategy {
+	SWAPEXT		= 1,	/* xfs_swapext() */
+	FORKSWAP	= 2,	/* exchange forks */
+};
+
 /* Exchange the contents of two files. */
 int
 xfs_xchg_range(
@@ -279,12 +299,9 @@ xfs_xchg_range(
 	unsigned int			qretry;
 	bool				retried = false;
 	bool				use_atomic = false;
+	enum xchg_strategy		strategy;
 	int				error;
 
-	/* We don't support whole-fork swapping yet. */
-	if (!xfs_sb_version_canatomicswap(&mp->m_sb))
-		return -EOPNOTSUPP;
-
 	if (fxr->flags & FILE_XCHG_RANGE_TO_EOF)
 		req.flags |= XFS_SWAPEXT_SET_SIZES;
 	if (fxr->flags & FILE_XCHG_RANGE_SKIP_FILE1_HOLES)
@@ -374,6 +391,41 @@ xfs_xchg_range(
 	if (error)
 		goto out_trans_cancel;
 
+	if (use_atomic || xfs_sb_version_hasreflink(&mp->m_sb) ||
+	    xfs_sb_version_hasrmapbt(&mp->m_sb)) {
+		/*
+		 * xfs_swapext() uses deferred bmap log intent items to swap
+		 * extents between file forks.  If the atomic log swap feature
+		 * is enabled, it will also use swapext log intent items to
+		 * restart the operation in case of failure.
+		 *
+		 * This means that we can use it if we previously obtained
+		 * permission from the log to use log-assisted atomic extent
+		 * swapping; or if the fs supports rmap or reflink and the
+		 * user said NONATOMIC.
+		 */
+		strategy = SWAPEXT;
+	} else if (xfs_xchg_use_forkswap(fxr, ip1, ip2)) {
+		/*
+		 * Exchange the file contents by using the old bmap fork
+		 * exchange code, if we're a defrag tool doing a full file
+		 * swap.
+		 */
+		strategy = FORKSWAP;
+
+		error = xfs_swap_extents_check_format(ip2, ip1);
+		if (error) {
+			xfs_notice(mp,
+		"%s: inode 0x%llx format is incompatible for exchanging.",
+					__func__, ip2->i_ino);
+			goto out_trans_cancel;
+		}
+	} else {
+		/* We cannot exchange the file contents. */
+		error = -EOPNOTSUPP;
+		goto out_trans_cancel;
+	}
+
 	/* If we got this far on a dry run, all parameters are ok. */
 	if (fxr->flags & FILE_XCHG_RANGE_DRY_RUN)
 		goto out_trans_cancel;
@@ -395,8 +447,10 @@ xfs_xchg_range(
 		xfs_trans_ichgtime(tp, ip2,
 				XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
 
-	/* Exchange the file contents by swapping the block mappings. */
-	error = xfs_swapext(&tp, &req);
+	if (strategy == SWAPEXT)
+		error = xfs_swapext(&tp, &req);
+	else
+		error = xfs_swap_extent_forks(&tp, &req);
 	if (error)
 		goto out_trans_cancel;
 


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 14/18] xfs: remove old swap extents implementation
  2021-04-01  1:08 [PATCHSET RFC v3 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (12 preceding siblings ...)
  2021-04-01  1:09 ` [PATCH 13/18] xfs: allow xfs_swap_range to use older extent swap algorithms Darrick J. Wong
@ 2021-04-01  1:10 ` Darrick J. Wong
  2021-04-01  1:10 ` [PATCH 15/18] xfs: condense extended attributes after an atomic swap Darrick J. Wong
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2021-04-01  1:10 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Migrate the old XFS_IOC_SWAPEXT implementation to use our shiny new one.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_bmap_util.c |  463 ------------------------------------------------
 fs/xfs/xfs_bmap_util.h |    7 -
 fs/xfs/xfs_ioctl.c     |  102 +++--------
 fs/xfs/xfs_ioctl.h     |    4 
 fs/xfs/xfs_ioctl32.c   |    8 -
 fs/xfs/xfs_xchgrange.c |  273 ++++++++++++++++++++++++++++
 6 files changed, 306 insertions(+), 551 deletions(-)


diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c
index 94f1d0d685fe..44f5c3ce02dd 100644
--- a/fs/xfs/xfs_bmap_util.c
+++ b/fs/xfs/xfs_bmap_util.c
@@ -1238,466 +1238,3 @@ xfs_insert_file_space(
 	xfs_iunlock(ip, XFS_ILOCK_EXCL);
 	return error;
 }
-
-/*
- * We need to check that the format of the data fork in the temporary inode is
- * valid for the target inode before doing the swap. This is not a problem with
- * attr1 because of the fixed fork offset, but attr2 has a dynamically sized
- * data fork depending on the space the attribute fork is taking so we can get
- * invalid formats on the target inode.
- *
- * E.g. target has space for 7 extents in extent format, temp inode only has
- * space for 6.  If we defragment down to 7 extents, then the tmp format is a
- * btree, but when swapped it needs to be in extent format. Hence we can't just
- * blindly swap data forks on attr2 filesystems.
- *
- * Note that we check the swap in both directions so that we don't end up with
- * a corrupt temporary inode, either.
- *
- * Note that fixing the way xfs_fsr sets up the attribute fork in the source
- * inode will prevent this situation from occurring, so all we do here is
- * reject and log the attempt. basically we are putting the responsibility on
- * userspace to get this right.
- */
-int
-xfs_swap_extents_check_format(
-	struct xfs_inode	*ip,	/* target inode */
-	struct xfs_inode	*tip)	/* tmp inode */
-{
-	struct xfs_ifork	*ifp = &ip->i_df;
-	struct xfs_ifork	*tifp = &tip->i_df;
-
-	/* User/group/project quota ids must match if quotas are enforced. */
-	if (XFS_IS_QUOTA_ON(ip->i_mount) &&
-	    (!uid_eq(VFS_I(ip)->i_uid, VFS_I(tip)->i_uid) ||
-	     !gid_eq(VFS_I(ip)->i_gid, VFS_I(tip)->i_gid) ||
-	     ip->i_d.di_projid != tip->i_d.di_projid))
-		return -EINVAL;
-
-	/* Should never get a local format */
-	if (ifp->if_format == XFS_DINODE_FMT_LOCAL ||
-	    tifp->if_format == XFS_DINODE_FMT_LOCAL)
-		return -EINVAL;
-
-	/*
-	 * if the target inode has less extents that then temporary inode then
-	 * why did userspace call us?
-	 */
-	if (ifp->if_nextents < tifp->if_nextents)
-		return -EINVAL;
-
-	/*
-	 * If we have to use the (expensive) rmap swap method, we can
-	 * handle any number of extents and any format.
-	 */
-	if (xfs_sb_version_hasrmapbt(&ip->i_mount->m_sb))
-		return 0;
-
-	/*
-	 * if the target inode is in extent form and the temp inode is in btree
-	 * form then we will end up with the target inode in the wrong format
-	 * as we already know there are less extents in the temp inode.
-	 */
-	if (ifp->if_format == XFS_DINODE_FMT_EXTENTS &&
-	    tifp->if_format == XFS_DINODE_FMT_BTREE)
-		return -EINVAL;
-
-	/* Check temp in extent form to max in target */
-	if (tifp->if_format == XFS_DINODE_FMT_EXTENTS &&
-	    tifp->if_nextents > XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK))
-		return -EINVAL;
-
-	/* Check target in extent form to max in temp */
-	if (ifp->if_format == XFS_DINODE_FMT_EXTENTS &&
-	    ifp->if_nextents > XFS_IFORK_MAXEXT(tip, XFS_DATA_FORK))
-		return -EINVAL;
-
-	/*
-	 * If we are in a btree format, check that the temp root block will fit
-	 * in the target and that it has enough extents to be in btree format
-	 * in the target.
-	 *
-	 * Note that we have to be careful to allow btree->extent conversions
-	 * (a common defrag case) which will occur when the temp inode is in
-	 * extent format...
-	 */
-	if (tifp->if_format == XFS_DINODE_FMT_BTREE) {
-		if (XFS_IFORK_Q(ip) &&
-		    XFS_BMAP_BMDR_SPACE(tifp->if_broot) > XFS_IFORK_BOFF(ip))
-			return -EINVAL;
-		if (tifp->if_nextents <= XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK))
-			return -EINVAL;
-	}
-
-	/* Reciprocal target->temp btree format checks */
-	if (ifp->if_format == XFS_DINODE_FMT_BTREE) {
-		if (XFS_IFORK_Q(tip) &&
-		    XFS_BMAP_BMDR_SPACE(ip->i_df.if_broot) > XFS_IFORK_BOFF(tip))
-			return -EINVAL;
-		if (ifp->if_nextents <= XFS_IFORK_MAXEXT(tip, XFS_DATA_FORK))
-			return -EINVAL;
-	}
-
-	return 0;
-}
-
-static int
-xfs_swap_extent_flush(
-	struct xfs_inode	*ip)
-{
-	int	error;
-
-	error = filemap_write_and_wait(VFS_I(ip)->i_mapping);
-	if (error)
-		return error;
-	truncate_pagecache_range(VFS_I(ip), 0, -1);
-
-	/* Verify O_DIRECT for ftmp */
-	if (VFS_I(ip)->i_mapping->nrpages)
-		return -EINVAL;
-	return 0;
-}
-
-/*
- * Fix up the owners of the bmbt blocks to refer to the current inode. The
- * change owner scan attempts to order all modified buffers in the current
- * transaction. In the event of ordered buffer failure, the offending buffer is
- * physically logged as a fallback and the scan returns -EAGAIN. We must roll
- * the transaction in this case to replenish the fallback log reservation and
- * restart the scan. This process repeats until the scan completes.
- */
-static int
-xfs_swap_change_owner(
-	struct xfs_trans	**tpp,
-	struct xfs_inode	*ip,
-	struct xfs_inode	*tmpip)
-{
-	int			error;
-	struct xfs_trans	*tp = *tpp;
-
-	do {
-		error = xfs_bmbt_change_owner(tp, ip, XFS_DATA_FORK, ip->i_ino,
-					      NULL);
-		/* success or fatal error */
-		if (error != -EAGAIN)
-			break;
-
-		error = xfs_trans_roll(tpp);
-		if (error)
-			break;
-		tp = *tpp;
-
-		/*
-		 * Redirty both inodes so they can relog and keep the log tail
-		 * moving forward.
-		 */
-		xfs_trans_ijoin(tp, ip, 0);
-		xfs_trans_ijoin(tp, tmpip, 0);
-		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
-		xfs_trans_log_inode(tp, tmpip, XFS_ILOG_CORE);
-	} while (true);
-
-	return error;
-}
-
-/* Swap the extents of two files by swapping data forks. */
-int
-xfs_swap_extent_forks(
-	struct xfs_trans	**tpp,
-	struct xfs_swapext_req	*req)
-{
-	struct xfs_inode	*ip = req->ip1;
-	struct xfs_inode	*tip = req->ip2;
-	xfs_filblks_t		aforkblks = 0;
-	xfs_filblks_t		taforkblks = 0;
-	xfs_extnum_t		junk;
-	uint64_t		tmp;
-	unsigned int		reflink_state;
-	int			src_log_flags = XFS_ILOG_CORE;
-	int			target_log_flags = XFS_ILOG_CORE;
-	int			error;
-
-	reflink_state = xfs_swapext_reflink_prep(req);
-
-	/*
-	 * Count the number of extended attribute blocks
-	 */
-	if (XFS_IFORK_Q(ip) && ip->i_afp->if_nextents > 0 &&
-	    ip->i_afp->if_format != XFS_DINODE_FMT_LOCAL) {
-		error = xfs_bmap_count_blocks(*tpp, ip, XFS_ATTR_FORK, &junk,
-				&aforkblks);
-		if (error)
-			return error;
-	}
-	if (XFS_IFORK_Q(tip) && tip->i_afp->if_nextents > 0 &&
-	    tip->i_afp->if_format != XFS_DINODE_FMT_LOCAL) {
-		error = xfs_bmap_count_blocks(*tpp, tip, XFS_ATTR_FORK, &junk,
-				&taforkblks);
-		if (error)
-			return error;
-	}
-
-	/*
-	 * Btree format (v3) inodes have the inode number stamped in the bmbt
-	 * block headers. We can't start changing the bmbt blocks until the
-	 * inode owner change is logged so recovery does the right thing in the
-	 * event of a crash. Set the owner change log flags now and leave the
-	 * bmbt scan as the last step.
-	 */
-	if (xfs_sb_version_has_v3inode(&ip->i_mount->m_sb)) {
-		if (ip->i_df.if_format == XFS_DINODE_FMT_BTREE)
-			target_log_flags |= XFS_ILOG_DOWNER;
-		if (tip->i_df.if_format == XFS_DINODE_FMT_BTREE)
-			src_log_flags |= XFS_ILOG_DOWNER;
-	}
-
-	/*
-	 * Swap the data forks of the inodes
-	 */
-	swap(ip->i_df, tip->i_df);
-
-	/*
-	 * Fix the on-disk inode values
-	 */
-	tmp = (uint64_t)ip->i_d.di_nblocks;
-	ip->i_d.di_nblocks = tip->i_d.di_nblocks - taforkblks + aforkblks;
-	tip->i_d.di_nblocks = tmp + taforkblks - aforkblks;
-
-	/*
-	 * The extents in the source inode could still contain speculative
-	 * preallocation beyond EOF (e.g. the file is open but not modified
-	 * while defrag is in progress). In that case, we need to copy over the
-	 * number of delalloc blocks the data fork in the source inode is
-	 * tracking beyond EOF so that when the fork is truncated away when the
-	 * temporary inode is unlinked we don't underrun the i_delayed_blks
-	 * counter on that inode.
-	 */
-	ASSERT(tip->i_delayed_blks == 0);
-	tip->i_delayed_blks = ip->i_delayed_blks;
-	ip->i_delayed_blks = 0;
-
-	switch (ip->i_df.if_format) {
-	case XFS_DINODE_FMT_EXTENTS:
-		src_log_flags |= XFS_ILOG_DEXT;
-		break;
-	case XFS_DINODE_FMT_BTREE:
-		ASSERT(!xfs_sb_version_has_v3inode(&ip->i_mount->m_sb) ||
-		       (src_log_flags & XFS_ILOG_DOWNER));
-		src_log_flags |= XFS_ILOG_DBROOT;
-		break;
-	}
-
-	switch (tip->i_df.if_format) {
-	case XFS_DINODE_FMT_EXTENTS:
-		target_log_flags |= XFS_ILOG_DEXT;
-		break;
-	case XFS_DINODE_FMT_BTREE:
-		target_log_flags |= XFS_ILOG_DBROOT;
-		ASSERT(!xfs_sb_version_has_v3inode(&ip->i_mount->m_sb) ||
-		       (target_log_flags & XFS_ILOG_DOWNER));
-		break;
-	}
-
-	xfs_swapext_reflink_finish(*tpp, req, reflink_state);
-
-	xfs_trans_log_inode(*tpp, ip,  src_log_flags);
-	xfs_trans_log_inode(*tpp, tip, target_log_flags);
-
-	/*
-	 * The extent forks have been swapped, but crc=1,rmapbt=0 filesystems
-	 * have inode number owner values in the bmbt blocks that still refer to
-	 * the old inode. Scan each bmbt to fix up the owner values with the
-	 * inode number of the current inode.
-	 */
-	if (src_log_flags & XFS_ILOG_DOWNER) {
-		error = xfs_swap_change_owner(tpp, ip, tip);
-		if (error)
-			return error;
-	}
-	if (target_log_flags & XFS_ILOG_DOWNER) {
-		error = xfs_swap_change_owner(tpp, tip, ip);
-		if (error)
-			return error;
-	}
-
-	return 0;
-}
-
-int
-xfs_swap_extents(
-	struct xfs_inode	*ip,	/* target inode */
-	struct xfs_inode	*tip,	/* tmp inode */
-	struct xfs_swapext	*sxp)
-{
-	struct xfs_swapext_req	req = {
-		.ip1		= ip,
-		.ip2		= tip,
-		.whichfork	= XFS_DATA_FORK,
-	};
-	struct xfs_mount	*mp = ip->i_mount;
-	struct xfs_trans	*tp;
-	struct xfs_bstat	*sbp = &sxp->sx_stat;
-	int			error = 0;
-	int			lock_flags;
-	int			resblks = 0;
-	unsigned int		flags = 0;
-
-	/*
-	 * Lock the inodes against other IO, page faults and truncate to
-	 * begin with.  Then we can ensure the inodes are flushed and have no
-	 * page cache safely. Once we have done this we can take the ilocks and
-	 * do the rest of the checks.
-	 */
-	lock_two_nondirectories(VFS_I(ip), VFS_I(tip));
-	lock_flags = XFS_MMAPLOCK_EXCL;
-	xfs_lock_two_inodes(ip, XFS_MMAPLOCK_EXCL, tip, XFS_MMAPLOCK_EXCL);
-
-	/* Verify that both files have the same format */
-	if ((VFS_I(ip)->i_mode & S_IFMT) != (VFS_I(tip)->i_mode & S_IFMT)) {
-		error = -EINVAL;
-		goto out_unlock;
-	}
-
-	/* Verify both files are either real-time or non-realtime */
-	if (XFS_IS_REALTIME_INODE(ip) != XFS_IS_REALTIME_INODE(tip)) {
-		error = -EINVAL;
-		goto out_unlock;
-	}
-
-	error = xfs_qm_dqattach(ip);
-	if (error)
-		goto out_unlock;
-
-	error = xfs_qm_dqattach(tip);
-	if (error)
-		goto out_unlock;
-
-	error = xfs_swap_extent_flush(ip);
-	if (error)
-		goto out_unlock;
-	error = xfs_swap_extent_flush(tip);
-	if (error)
-		goto out_unlock;
-
-	if (xfs_inode_has_cow_data(tip)) {
-		error = xfs_reflink_cancel_cow_range(tip, 0, NULLFILEOFF, true);
-		if (error)
-			goto out_unlock;
-	}
-
-	/*
-	 * Extent "swapping" with rmap requires a permanent reservation and
-	 * a block reservation because it's really just a remap operation
-	 * performed with log redo items!
-	 */
-	if (xfs_sb_version_hasrmapbt(&mp->m_sb)) {
-		int		w = XFS_DATA_FORK;
-		uint32_t	ipnext = ip->i_df.if_nextents;
-		uint32_t	tipnext	= tip->i_df.if_nextents;
-
-		/*
-		 * Conceptually this shouldn't affect the shape of either bmbt,
-		 * but since we atomically move extents one by one, we reserve
-		 * enough space to rebuild both trees.
-		 */
-		resblks = XFS_SWAP_RMAP_SPACE_RES(mp, ipnext, w);
-		resblks +=  XFS_SWAP_RMAP_SPACE_RES(mp, tipnext, w);
-
-		/*
-		 * If either inode straddles a bmapbt block allocation boundary,
-		 * the rmapbt algorithm triggers repeated allocs and frees as
-		 * extents are remapped. This can exhaust the block reservation
-		 * prematurely and cause shutdown. Return freed blocks to the
-		 * transaction reservation to counter this behavior.
-		 */
-		flags |= XFS_TRANS_RES_FDBLKS;
-	}
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, resblks, 0, flags,
-				&tp);
-	if (error)
-		goto out_unlock;
-
-	/*
-	 * Lock and join the inodes to the tansaction so that transaction commit
-	 * or cancel will unlock the inodes from this point onwards.
-	 */
-	xfs_lock_two_inodes(ip, XFS_ILOCK_EXCL, tip, XFS_ILOCK_EXCL);
-	lock_flags |= XFS_ILOCK_EXCL;
-	xfs_trans_ijoin(tp, ip, 0);
-	xfs_trans_ijoin(tp, tip, 0);
-
-
-	/* Verify all data are being swapped */
-	if (sxp->sx_offset != 0 ||
-	    sxp->sx_length != ip->i_d.di_size ||
-	    sxp->sx_length != tip->i_d.di_size) {
-		error = -EFAULT;
-		goto out_trans_cancel;
-	}
-
-	trace_xfs_swap_extent_before(ip, 0);
-	trace_xfs_swap_extent_before(tip, 1);
-
-	/* check inode formats now that data is flushed */
-	error = xfs_swap_extents_check_format(ip, tip);
-	if (error) {
-		xfs_notice(mp,
-		    "%s: inode 0x%llx format is incompatible for exchanging.",
-				__func__, ip->i_ino);
-		goto out_trans_cancel;
-	}
-
-	/*
-	 * Compare the current change & modify times with that
-	 * passed in.  If they differ, we abort this swap.
-	 * This is the mechanism used to ensure the calling
-	 * process that the file was not changed out from
-	 * under it.
-	 */
-	if ((sbp->bs_ctime.tv_sec != VFS_I(ip)->i_ctime.tv_sec) ||
-	    (sbp->bs_ctime.tv_nsec != VFS_I(ip)->i_ctime.tv_nsec) ||
-	    (sbp->bs_mtime.tv_sec != VFS_I(ip)->i_mtime.tv_sec) ||
-	    (sbp->bs_mtime.tv_nsec != VFS_I(ip)->i_mtime.tv_nsec)) {
-		error = -EBUSY;
-		goto out_trans_cancel;
-	}
-
-	/*
-	 * Note the trickiness in setting the log flags - we set the owner log
-	 * flag on the opposite inode (i.e. the inode we are setting the new
-	 * owner to be) because once we swap the forks and log that, log
-	 * recovery is going to see the fork as owned by the swapped inode,
-	 * not the pre-swapped inodes.
-	 */
-	req.blockcount = XFS_B_TO_FSB(ip->i_mount, i_size_read(VFS_I(ip)));
-	if (xfs_sb_version_hasrmapbt(&mp->m_sb))
-		error = xfs_swapext(&tp, &req);
-	else
-		error = xfs_swap_extent_forks(&tp, &req);
-	if (error) {
-		trace_xfs_swap_extent_error(ip, error, _THIS_IP_);
-		goto out_trans_cancel;
-	}
-
-	/*
-	 * If this is a synchronous mount, make sure that the
-	 * transaction goes to disk before returning to the user.
-	 */
-	if (mp->m_flags & XFS_MOUNT_WSYNC)
-		xfs_trans_set_sync(tp);
-
-	error = xfs_trans_commit(tp);
-
-	trace_xfs_swap_extent_after(ip, 0);
-	trace_xfs_swap_extent_after(tip, 1);
-
-out_unlock:
-	xfs_iunlock(ip, lock_flags);
-	xfs_iunlock(tip, lock_flags);
-	unlock_two_nondirectories(VFS_I(ip), VFS_I(tip));
-	return error;
-
-out_trans_cancel:
-	xfs_trans_cancel(tp);
-	goto out_unlock;
-}
diff --git a/fs/xfs/xfs_bmap_util.h b/fs/xfs/xfs_bmap_util.h
index de3173e64f47..cebdd492fa85 100644
--- a/fs/xfs/xfs_bmap_util.h
+++ b/fs/xfs/xfs_bmap_util.h
@@ -66,13 +66,6 @@ int	xfs_insert_file_space(struct xfs_inode *, xfs_off_t offset,
 bool	xfs_can_free_eofblocks(struct xfs_inode *ip, bool force);
 int	xfs_free_eofblocks(struct xfs_inode *ip);
 
-int	xfs_swap_extents(struct xfs_inode *ip, struct xfs_inode *tip,
-			 struct xfs_swapext *sx);
-
-struct xfs_swapext_req;
-int xfs_swap_extent_forks(struct xfs_trans **tpp, struct xfs_swapext_req *req);
-int xfs_swap_extents_check_format(struct xfs_inode *ip, struct xfs_inode *tip);
-
 xfs_daddr_t xfs_fsb_to_db(struct xfs_inode *ip, xfs_fsblock_t fsb);
 
 xfs_extnum_t xfs_bmap_count_leaves(struct xfs_ifork *ifp, xfs_filblks_t *count);
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index ac3192a433f9..1d808a68b50f 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -1870,81 +1870,43 @@ xfs_ioc_scrub_metadata(
 
 int
 xfs_ioc_swapext(
-	xfs_swapext_t	*sxp)
+	struct xfs_swapext	*sxp)
 {
-	xfs_inode_t     *ip, *tip;
-	struct fd	f, tmp;
-	int		error = 0;
+	struct file_xchg_range	fxr = { 0 };
+	struct fd		fd2, fd1;
+	int			error = 0;
 
-	/* Pull information for the target fd */
-	f = fdget((int)sxp->sx_fdtarget);
-	if (!f.file) {
-		error = -EINVAL;
-		goto out;
-	}
-
-	if (!(f.file->f_mode & FMODE_WRITE) ||
-	    !(f.file->f_mode & FMODE_READ) ||
-	    (f.file->f_flags & O_APPEND)) {
-		error = -EBADF;
-		goto out_put_file;
-	}
+	fd2 = fdget((int)sxp->sx_fdtarget);
+	if (!fd2.file)
+		return -EINVAL;
 
-	tmp = fdget((int)sxp->sx_fdtmp);
-	if (!tmp.file) {
+	fd1 = fdget((int)sxp->sx_fdtmp);
+	if (!fd1.file) {
 		error = -EINVAL;
-		goto out_put_file;
+		goto dest_fdput;
 	}
 
-	if (!(tmp.file->f_mode & FMODE_WRITE) ||
-	    !(tmp.file->f_mode & FMODE_READ) ||
-	    (tmp.file->f_flags & O_APPEND)) {
-		error = -EBADF;
-		goto out_put_tmp_file;
-	}
+	fxr.file1_fd = sxp->sx_fdtmp;
+	fxr.length = sxp->sx_length;
+	fxr.flags = FILE_XCHG_RANGE_NONATOMIC | FILE_XCHG_RANGE_FILE2_FRESH |
+		    FILE_XCHG_RANGE_FULL_FILES;
+	fxr.file2_ino = sxp->sx_stat.bs_ino;
+	fxr.file2_mtime = sxp->sx_stat.bs_mtime.tv_sec;
+	fxr.file2_ctime = sxp->sx_stat.bs_ctime.tv_sec;
+	fxr.file2_mtime_nsec = sxp->sx_stat.bs_mtime.tv_nsec;
+	fxr.file2_ctime_nsec = sxp->sx_stat.bs_ctime.tv_nsec;
 
-	if (IS_SWAPFILE(file_inode(f.file)) ||
-	    IS_SWAPFILE(file_inode(tmp.file))) {
-		error = -EINVAL;
-		goto out_put_tmp_file;
-	}
+	error = vfs_xchg_file_range(fd1.file, fd2.file, &fxr);
 
 	/*
-	 * We need to ensure that the fds passed in point to XFS inodes
-	 * before we cast and access them as XFS structures as we have no
-	 * control over what the user passes us here.
+	 * The old implementation returned EFAULT if the swap range was not
+	 * the entirety of both files.
 	 */
-	if (f.file->f_op != &xfs_file_operations ||
-	    tmp.file->f_op != &xfs_file_operations) {
-		error = -EINVAL;
-		goto out_put_tmp_file;
-	}
-
-	ip = XFS_I(file_inode(f.file));
-	tip = XFS_I(file_inode(tmp.file));
-
-	if (ip->i_mount != tip->i_mount) {
-		error = -EINVAL;
-		goto out_put_tmp_file;
-	}
-
-	if (ip->i_ino == tip->i_ino) {
-		error = -EINVAL;
-		goto out_put_tmp_file;
-	}
-
-	if (XFS_FORCED_SHUTDOWN(ip->i_mount)) {
-		error = -EIO;
-		goto out_put_tmp_file;
-	}
-
-	error = xfs_swap_extents(ip, tip, sxp);
-
- out_put_tmp_file:
-	fdput(tmp);
- out_put_file:
-	fdput(f);
- out:
+	if (error == -EDOM)
+		error = -EFAULT;
+	fdput(fd1);
+dest_fdput:
+	fdput(fd2);
 	return error;
 }
 
@@ -2197,14 +2159,10 @@ xfs_file_ioctl(
 	case XFS_IOC_SWAPEXT: {
 		struct xfs_swapext	sxp;
 
-		if (copy_from_user(&sxp, arg, sizeof(xfs_swapext_t)))
+		if (copy_from_user(&sxp, arg, sizeof(struct xfs_swapext)))
 			return -EFAULT;
-		error = mnt_want_write_file(filp);
-		if (error)
-			return error;
-		error = xfs_ioc_swapext(&sxp);
-		mnt_drop_write_file(filp);
-		return error;
+
+		return xfs_ioc_swapext(&sxp);
 	}
 
 	case XFS_IOC_FSCOUNTS: {
diff --git a/fs/xfs/xfs_ioctl.h b/fs/xfs/xfs_ioctl.h
index bab6a5a92407..98c4ff127b0d 100644
--- a/fs/xfs/xfs_ioctl.h
+++ b/fs/xfs/xfs_ioctl.h
@@ -16,9 +16,7 @@ xfs_ioc_space(
 	struct file		*filp,
 	xfs_flock64_t		*bf);
 
-int
-xfs_ioc_swapext(
-	xfs_swapext_t	*sxp);
+int xfs_ioc_swapext(struct xfs_swapext *sxp);
 
 extern int
 xfs_find_handle(
diff --git a/fs/xfs/xfs_ioctl32.c b/fs/xfs/xfs_ioctl32.c
index 33c09ec8e6c0..63186e0063ad 100644
--- a/fs/xfs/xfs_ioctl32.c
+++ b/fs/xfs/xfs_ioctl32.c
@@ -498,12 +498,8 @@ xfs_file_compat_ioctl(
 				   offsetof(struct xfs_swapext, sx_stat)) ||
 		    xfs_ioctl32_bstat_copyin(&sxp.sx_stat, &sxu->sx_stat))
 			return -EFAULT;
-		error = mnt_want_write_file(filp);
-		if (error)
-			return error;
-		error = xfs_ioc_swapext(&sxp);
-		mnt_drop_write_file(filp);
-		return error;
+
+		return xfs_ioc_swapext(&sxp);
 	}
 	case XFS_IOC_FSBULKSTAT_32:
 	case XFS_IOC_FSBULKSTAT_SINGLE_32:
diff --git a/fs/xfs/xfs_xchgrange.c b/fs/xfs/xfs_xchgrange.c
index ef74965198c6..395ab886ebe5 100644
--- a/fs/xfs/xfs_xchgrange.c
+++ b/fs/xfs/xfs_xchgrange.c
@@ -2,6 +2,11 @@
 /*
  * Copyright (C) 2021 Oracle.  All Rights Reserved.
  * Author: Darrick J. Wong <djwong@kernel.org>
+ *
+ * The xfs_swap_extent_* functions are:
+ * Copyright (c) 2000-2006 Silicon Graphics, Inc.
+ * Copyright (c) 2012 Red Hat, Inc.
+ * All Rights Reserved.
  */
 #include "xfs.h"
 #include "xfs_fs.h"
@@ -15,6 +20,7 @@
 #include "xfs_trans.h"
 #include "xfs_quota.h"
 #include "xfs_bmap_util.h"
+#include "xfs_bmap_btree.h"
 #include "xfs_reflink.h"
 #include "xfs_trace.h"
 #include "xfs_swapext.h"
@@ -72,6 +78,273 @@ xfs_xchg_range_estimate(
 	return error;
 }
 
+/*
+ * We need to check that the format of the data fork in the temporary inode is
+ * valid for the target inode before doing the swap. This is not a problem with
+ * attr1 because of the fixed fork offset, but attr2 has a dynamically sized
+ * data fork depending on the space the attribute fork is taking so we can get
+ * invalid formats on the target inode.
+ *
+ * E.g. target has space for 7 extents in extent format, temp inode only has
+ * space for 6.  If we defragment down to 7 extents, then the tmp format is a
+ * btree, but when swapped it needs to be in extent format. Hence we can't just
+ * blindly swap data forks on attr2 filesystems.
+ *
+ * Note that we check the swap in both directions so that we don't end up with
+ * a corrupt temporary inode, either.
+ *
+ * Note that fixing the way xfs_fsr sets up the attribute fork in the source
+ * inode will prevent this situation from occurring, so all we do here is
+ * reject and log the attempt. basically we are putting the responsibility on
+ * userspace to get this right.
+ */
+STATIC int
+xfs_swap_extents_check_format(
+	struct xfs_inode	*ip,	/* target inode */
+	struct xfs_inode	*tip)	/* tmp inode */
+{
+	struct xfs_ifork	*ifp = &ip->i_df;
+	struct xfs_ifork	*tifp = &tip->i_df;
+
+	/* User/group/project quota ids must match if quotas are enforced. */
+	if (XFS_IS_QUOTA_ON(ip->i_mount) &&
+	    (!uid_eq(VFS_I(ip)->i_uid, VFS_I(tip)->i_uid) ||
+	     !gid_eq(VFS_I(ip)->i_gid, VFS_I(tip)->i_gid) ||
+	     ip->i_d.di_projid != tip->i_d.di_projid))
+		return -EINVAL;
+
+	/* Should never get a local format */
+	if (ifp->if_format == XFS_DINODE_FMT_LOCAL ||
+	    tifp->if_format == XFS_DINODE_FMT_LOCAL)
+		return -EINVAL;
+
+	/*
+	 * if the target inode has less extents that then temporary inode then
+	 * why did userspace call us?
+	 */
+	if (ifp->if_nextents < tifp->if_nextents)
+		return -EINVAL;
+
+	/*
+	 * If we have to use the (expensive) rmap swap method, we can
+	 * handle any number of extents and any format.
+	 */
+	if (xfs_sb_version_hasrmapbt(&ip->i_mount->m_sb))
+		return 0;
+
+	/*
+	 * if the target inode is in extent form and the temp inode is in btree
+	 * form then we will end up with the target inode in the wrong format
+	 * as we already know there are less extents in the temp inode.
+	 */
+	if (ifp->if_format == XFS_DINODE_FMT_EXTENTS &&
+	    tifp->if_format == XFS_DINODE_FMT_BTREE)
+		return -EINVAL;
+
+	/* Check temp in extent form to max in target */
+	if (tifp->if_format == XFS_DINODE_FMT_EXTENTS &&
+	    tifp->if_nextents > XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK))
+		return -EINVAL;
+
+	/* Check target in extent form to max in temp */
+	if (ifp->if_format == XFS_DINODE_FMT_EXTENTS &&
+	    ifp->if_nextents > XFS_IFORK_MAXEXT(tip, XFS_DATA_FORK))
+		return -EINVAL;
+
+	/*
+	 * If we are in a btree format, check that the temp root block will fit
+	 * in the target and that it has enough extents to be in btree format
+	 * in the target.
+	 *
+	 * Note that we have to be careful to allow btree->extent conversions
+	 * (a common defrag case) which will occur when the temp inode is in
+	 * extent format...
+	 */
+	if (tifp->if_format == XFS_DINODE_FMT_BTREE) {
+		if (XFS_IFORK_Q(ip) &&
+		    XFS_BMAP_BMDR_SPACE(tifp->if_broot) > XFS_IFORK_BOFF(ip))
+			return -EINVAL;
+		if (tifp->if_nextents <= XFS_IFORK_MAXEXT(ip, XFS_DATA_FORK))
+			return -EINVAL;
+	}
+
+	/* Reciprocal target->temp btree format checks */
+	if (ifp->if_format == XFS_DINODE_FMT_BTREE) {
+		if (XFS_IFORK_Q(tip) &&
+		    XFS_BMAP_BMDR_SPACE(ip->i_df.if_broot) > XFS_IFORK_BOFF(tip))
+			return -EINVAL;
+		if (ifp->if_nextents <= XFS_IFORK_MAXEXT(tip, XFS_DATA_FORK))
+			return -EINVAL;
+	}
+
+	return 0;
+}
+
+/*
+ * Fix up the owners of the bmbt blocks to refer to the current inode. The
+ * change owner scan attempts to order all modified buffers in the current
+ * transaction. In the event of ordered buffer failure, the offending buffer is
+ * physically logged as a fallback and the scan returns -EAGAIN. We must roll
+ * the transaction in this case to replenish the fallback log reservation and
+ * restart the scan. This process repeats until the scan completes.
+ */
+STATIC int
+xfs_swap_change_owner(
+	struct xfs_trans	**tpp,
+	struct xfs_inode	*ip,
+	struct xfs_inode	*tmpip)
+{
+	int			error;
+	struct xfs_trans	*tp = *tpp;
+
+	do {
+		error = xfs_bmbt_change_owner(tp, ip, XFS_DATA_FORK, ip->i_ino,
+					      NULL);
+		/* success or fatal error */
+		if (error != -EAGAIN)
+			break;
+
+		error = xfs_trans_roll(tpp);
+		if (error)
+			break;
+		tp = *tpp;
+
+		/*
+		 * Redirty both inodes so they can relog and keep the log tail
+		 * moving forward.
+		 */
+		xfs_trans_ijoin(tp, ip, 0);
+		xfs_trans_ijoin(tp, tmpip, 0);
+		xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+		xfs_trans_log_inode(tp, tmpip, XFS_ILOG_CORE);
+	} while (true);
+
+	return error;
+}
+
+/* Swap the extents of two files by swapping data forks. */
+STATIC int
+xfs_swap_extent_forks(
+	struct xfs_trans	**tpp,
+	struct xfs_swapext_req	*req)
+{
+	struct xfs_inode	*ip = req->ip2; /* target inode */
+	struct xfs_inode	*tip = req->ip1; /* tmp inode */
+	xfs_filblks_t		aforkblks = 0;
+	xfs_filblks_t		taforkblks = 0;
+	xfs_extnum_t		junk;
+	uint64_t		tmp;
+	unsigned int		reflink_state;
+	int			src_log_flags = XFS_ILOG_CORE;
+	int			target_log_flags = XFS_ILOG_CORE;
+	int			error;
+
+	reflink_state = xfs_swapext_reflink_prep(req);
+
+	/*
+	 * Count the number of extended attribute blocks
+	 */
+	if (XFS_IFORK_Q(ip) && ip->i_afp->if_nextents > 0 &&
+	    ip->i_afp->if_format != XFS_DINODE_FMT_LOCAL) {
+		error = xfs_bmap_count_blocks(*tpp, ip, XFS_ATTR_FORK, &junk,
+				&aforkblks);
+		if (error)
+			return error;
+	}
+	if (XFS_IFORK_Q(tip) && tip->i_afp->if_nextents > 0 &&
+	    tip->i_afp->if_format != XFS_DINODE_FMT_LOCAL) {
+		error = xfs_bmap_count_blocks(*tpp, tip, XFS_ATTR_FORK, &junk,
+				&taforkblks);
+		if (error)
+			return error;
+	}
+
+	/*
+	 * Btree format (v3) inodes have the inode number stamped in the bmbt
+	 * block headers. We can't start changing the bmbt blocks until the
+	 * inode owner change is logged so recovery does the right thing in the
+	 * event of a crash. Set the owner change log flags now and leave the
+	 * bmbt scan as the last step.
+	 */
+	if (xfs_sb_version_has_v3inode(&ip->i_mount->m_sb)) {
+		if (ip->i_df.if_format == XFS_DINODE_FMT_BTREE)
+			target_log_flags |= XFS_ILOG_DOWNER;
+		if (tip->i_df.if_format == XFS_DINODE_FMT_BTREE)
+			src_log_flags |= XFS_ILOG_DOWNER;
+	}
+
+	/*
+	 * Swap the data forks of the inodes
+	 */
+	swap(ip->i_df, tip->i_df);
+
+	/*
+	 * Fix the on-disk inode values
+	 */
+	tmp = (uint64_t)ip->i_d.di_nblocks;
+	ip->i_d.di_nblocks = tip->i_d.di_nblocks - taforkblks + aforkblks;
+	tip->i_d.di_nblocks = tmp + taforkblks - aforkblks;
+
+	/*
+	 * The extents in the source inode could still contain speculative
+	 * preallocation beyond EOF (e.g. the file is open but not modified
+	 * while defrag is in progress). In that case, we need to copy over the
+	 * number of delalloc blocks the data fork in the source inode is
+	 * tracking beyond EOF so that when the fork is truncated away when the
+	 * temporary inode is unlinked we don't underrun the i_delayed_blks
+	 * counter on that inode.
+	 */
+	ASSERT(tip->i_delayed_blks == 0);
+	tip->i_delayed_blks = ip->i_delayed_blks;
+	ip->i_delayed_blks = 0;
+
+	switch (ip->i_df.if_format) {
+	case XFS_DINODE_FMT_EXTENTS:
+		src_log_flags |= XFS_ILOG_DEXT;
+		break;
+	case XFS_DINODE_FMT_BTREE:
+		ASSERT(!xfs_sb_version_has_v3inode(&ip->i_mount->m_sb) ||
+		       (src_log_flags & XFS_ILOG_DOWNER));
+		src_log_flags |= XFS_ILOG_DBROOT;
+		break;
+	}
+
+	switch (tip->i_df.if_format) {
+	case XFS_DINODE_FMT_EXTENTS:
+		target_log_flags |= XFS_ILOG_DEXT;
+		break;
+	case XFS_DINODE_FMT_BTREE:
+		target_log_flags |= XFS_ILOG_DBROOT;
+		ASSERT(!xfs_sb_version_has_v3inode(&ip->i_mount->m_sb) ||
+		       (target_log_flags & XFS_ILOG_DOWNER));
+		break;
+	}
+
+	xfs_swapext_reflink_finish(*tpp, req, reflink_state);
+
+	xfs_trans_log_inode(*tpp, ip,  src_log_flags);
+	xfs_trans_log_inode(*tpp, tip, target_log_flags);
+
+	/*
+	 * The extent forks have been swapped, but crc=1,rmapbt=0 filesystems
+	 * have inode number owner values in the bmbt blocks that still refer to
+	 * the old inode. Scan each bmbt to fix up the owner values with the
+	 * inode number of the current inode.
+	 */
+	if (src_log_flags & XFS_ILOG_DOWNER) {
+		error = xfs_swap_change_owner(tpp, ip, tip);
+		if (error)
+			return error;
+	}
+	if (target_log_flags & XFS_ILOG_DOWNER) {
+		error = xfs_swap_change_owner(tpp, tip, ip);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
 /* Prepare two files to have their data exchanged. */
 int
 xfs_xchg_range_prep(


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 15/18] xfs: condense extended attributes after an atomic swap
  2021-04-01  1:08 [PATCHSET RFC v3 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (13 preceding siblings ...)
  2021-04-01  1:10 ` [PATCH 14/18] xfs: remove old swap extents implementation Darrick J. Wong
@ 2021-04-01  1:10 ` Darrick J. Wong
  2021-04-01  1:10 ` [PATCH 16/18] xfs: condense directories " Darrick J. Wong
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2021-04-01  1:10 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Add a new swapext flag that enables us to perform post-swap processing
on file2 once we're done swapping the extent maps.  If we were swapping
the extended attributes, we want to be able to convert file2's attr fork
from block to inline format.

This isn't used anywhere right now, but we need to have the basic ondisk
flags in place so that a future online xattr repair feature can create
salvaged attrs in a temporary file and swap the attr forks when ready.
If one file is in extents format and the other is inline, we will have to
promote both to extents format to perform the swap.  After the swap, we
can try to condense the fixed file's attr fork back down to inline
format if possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_log_format.h |    6 ++++
 fs/xfs/libxfs/xfs_swapext.c    |   56 ++++++++++++++++++++++++++++++++++++++--
 fs/xfs/libxfs/xfs_swapext.h    |    5 ++++
 fs/xfs/xfs_trace.h             |    6 +++-
 4 files changed, 67 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index 52ca6d72de6a..9cca7db4c663 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -814,9 +814,13 @@ struct xfs_swap_extent {
 /* Do not swap any part of the range where file1's mapping is a hole. */
 #define XFS_SWAP_EXTENT_SKIP_FILE1_HOLES (1ULL << 2)
 
+/* Try to convert inode2 from block to short format at the end, if possible. */
+#define XFS_SWAP_EXTENT_INO2_SHORTFORM	(1ULL << 3)
+
 #define XFS_SWAP_EXTENT_FLAGS		(XFS_SWAP_EXTENT_ATTR_FORK | \
 					 XFS_SWAP_EXTENT_SET_SIZES | \
-					 XFS_SWAP_EXTENT_SKIP_FILE1_HOLES)
+					 XFS_SWAP_EXTENT_SKIP_FILE1_HOLES | \
+					 XFS_SWAP_EXTENT_INO2_SHORTFORM)
 
 /* This is the structure used to lay out an sxi log item in the log. */
 struct xfs_sxi_log_format {
diff --git a/fs/xfs/libxfs/xfs_swapext.c b/fs/xfs/libxfs/xfs_swapext.c
index 082680635146..964af61c9e5d 100644
--- a/fs/xfs/libxfs/xfs_swapext.c
+++ b/fs/xfs/libxfs/xfs_swapext.c
@@ -22,6 +22,9 @@
 #include "xfs_trans_space.h"
 #include "xfs_errortag.h"
 #include "xfs_error.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_attr_leaf.h"
 
 /* Information to help us reset reflink flag / CoW fork state after a swap. */
 
@@ -196,12 +199,46 @@ xfs_swapext_update_size(
 	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
 }
 
+/* Convert inode2's leaf attr fork back to shortform, if possible.. */
+STATIC int
+xfs_swapext_attr_to_shortform2(
+	struct xfs_trans		*tp,
+	struct xfs_swapext_intent	*sxi)
+{
+	struct xfs_da_args	args = {
+		.dp		= sxi->sxi_ip2,
+		.geo		= tp->t_mountp->m_attr_geo,
+		.whichfork	= XFS_ATTR_FORK,
+		.trans		= tp,
+	};
+	struct xfs_buf		*bp;
+	int			forkoff;
+	int			error;
+
+	if (!xfs_bmap_one_block(sxi->sxi_ip2, XFS_ATTR_FORK))
+		return 0;
+
+	error = xfs_attr3_leaf_read(tp, sxi->sxi_ip2, 0, &bp);
+	if (error)
+		return error;
+
+	forkoff = xfs_attr_shortform_allfit(bp, sxi->sxi_ip2);
+	if (forkoff == 0)
+		return 0;
+
+	return xfs_attr3_leaf_to_shortform(bp, &args, forkoff);
+}
+
+/* Mask of all flags that require post-processing of file2. */
+#define XFS_SWAP_EXTENT_POST_PROCESSING (XFS_SWAP_EXTENT_INO2_SHORTFORM)
+
 /* Do we have more work to do to finish this operation? */
 bool
 xfs_swapext_has_more_work(
 	struct xfs_swapext_intent	*sxi)
 {
-	return sxi->sxi_blockcount > 0;
+	return sxi->sxi_blockcount > 0 ||
+		(sxi->sxi_flags & XFS_SWAP_EXTENT_POST_PROCESSING);
 }
 
 /* Check all extents to make sure we can actually swap them. */
@@ -273,12 +310,23 @@ xfs_swapext_finish_one(
 	int				whichfork;
 	int				nimaps;
 	int				bmap_flags;
-	int				error;
+	int				error = 0;
 
 	whichfork = (sxi->sxi_flags & XFS_SWAP_EXTENT_ATTR_FORK) ?
 			XFS_ATTR_FORK : XFS_DATA_FORK;
 	bmap_flags = xfs_bmapi_aflag(whichfork);
 
+	/* Do any post-processing work that we requires a transaction roll. */
+	if (sxi->sxi_blockcount == 0) {
+		if (sxi->sxi_flags & XFS_SWAP_EXTENT_INO2_SHORTFORM) {
+			if (sxi->sxi_flags & XFS_SWAP_EXTENT_ATTR_FORK)
+				error = xfs_swapext_attr_to_shortform2(tp, sxi);
+			sxi->sxi_flags &= ~XFS_SWAP_EXTENT_INO2_SHORTFORM;
+			return error;
+		}
+		return 0;
+	}
+
 	while (sxi->sxi_blockcount > 0) {
 		/* Read extent from the first file */
 		nimaps = 1;
@@ -419,7 +467,7 @@ xfs_swapext_finish_one(
 }
 
 /* Estimate the bmbt and rmapbt overhead required to exchange extents. */
-static int
+int
 xfs_swapext_estimate_overhead(
 	const struct xfs_swapext_req	*req,
 	struct xfs_swapext_res		*res)
@@ -838,6 +886,8 @@ xfs_swapext_init_intent(
 	}
 	if (req->flags & XFS_SWAPEXT_SKIP_FILE1_HOLES)
 		sxi->sxi_flags |= XFS_SWAP_EXTENT_SKIP_FILE1_HOLES;
+	if (req->flags & XFS_SWAPEXT_INO2_SHORTFORM)
+		sxi->sxi_flags |= XFS_SWAP_EXTENT_INO2_SHORTFORM;
 	sxi->sxi_ip1 = req->ip1;
 	sxi->sxi_ip2 = req->ip2;
 	sxi->sxi_startoff1 = req->startoff1;
diff --git a/fs/xfs/libxfs/xfs_swapext.h b/fs/xfs/libxfs/xfs_swapext.h
index e63f4a5556c1..68842d62ec82 100644
--- a/fs/xfs/libxfs/xfs_swapext.h
+++ b/fs/xfs/libxfs/xfs_swapext.h
@@ -41,6 +41,9 @@ struct xfs_swapext_intent {
 /* Do not swap any part of the range where file1's mapping is a hole. */
 #define XFS_SWAPEXT_SKIP_FILE1_HOLES	(1U << 1)
 
+/* Try to convert inode2's fork to local format, if possible. */
+#define XFS_SWAPEXT_INO2_SHORTFORM	(1U << 2)
+
 /* Parameters for a swapext request. */
 struct xfs_swapext_req {
 	struct xfs_inode	*ip1;
@@ -68,6 +71,8 @@ unsigned int xfs_swapext_reflink_prep(const struct xfs_swapext_req *req);
 void xfs_swapext_reflink_finish(struct xfs_trans *tp,
 		const struct xfs_swapext_req *req, unsigned int reflink_state);
 
+int xfs_swapext_estimate_overhead(const struct xfs_swapext_req *req,
+		struct xfs_swapext_res *res);
 int xfs_swapext_estimate(const struct xfs_swapext_req *req,
 		struct xfs_swapext_res *res);
 
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index f2db023986a4..915f3856a04a 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -3954,7 +3954,8 @@ DEFINE_EOFBLOCKS_EVENT(xfs_inodegc_free_space);
 
 #define XFS_SWAPEXT_STRINGS \
 	{ XFS_SWAPEXT_SET_SIZES,		"SETSIZES" }, \
-	{ XFS_SWAPEXT_SKIP_FILE1_HOLES,		"SKIP_FILE1_HOLES" }
+	{ XFS_SWAPEXT_SKIP_FILE1_HOLES,		"SKIP_FILE1_HOLES" }, \
+	{ XFS_SWAPEXT_INO2_SHORTFORM,		"INO2_SHORTFORM" }
 
 TRACE_EVENT(xfs_swapext_estimate,
 	TP_PROTO(const struct xfs_swapext_req *req,
@@ -4010,7 +4011,8 @@ TRACE_EVENT(xfs_swapext_estimate,
 #define XFS_SWAP_EXTENT_STRINGS \
 	{ XFS_SWAP_EXTENT_ATTR_FORK,		"ATTRFORK" }, \
 	{ XFS_SWAP_EXTENT_SET_SIZES,		"SETSIZES" }, \
-	{ XFS_SWAP_EXTENT_SKIP_FILE1_HOLES,	"SKIP_FILE1_HOLES" }
+	{ XFS_SWAP_EXTENT_SKIP_FILE1_HOLES,	"SKIP_FILE1_HOLES" }, \
+	{ XFS_SWAP_EXTENT_INO2_SHORTFORM,	"INO2_SHORTFORM" }
 
 TRACE_EVENT(xfs_swapext_defer,
 	TP_PROTO(struct xfs_mount *mp, const struct xfs_swapext_intent *sxi),


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 16/18] xfs: condense directories after an atomic swap
  2021-04-01  1:08 [PATCHSET RFC v3 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (14 preceding siblings ...)
  2021-04-01  1:10 ` [PATCH 15/18] xfs: condense extended attributes after an atomic swap Darrick J. Wong
@ 2021-04-01  1:10 ` Darrick J. Wong
  2021-04-01  1:10 ` [PATCH 17/18] xfs: make atomic extent swapping support realtime files Darrick J. Wong
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2021-04-01  1:10 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

The previous commit added a new swapext flag that enables us to perform
post-swap processing on file2 once we're done swapping the extent maps.
Now add this ability for directories.

This isn't used anywhere right now, but we need to have the basic ondisk
flags in place so that a future online directory repair feature can
create salvaged dirents in a temporary directory and swap the data forks
when ready.  If one file is in extents format and the other is inline,
we will have to promote both to extents format to perform the swap.
After the swap, we can try to condense the fixed directory down to
inline format if possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_swapext.c |   34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_swapext.c b/fs/xfs/libxfs/xfs_swapext.c
index 964af61c9e5d..41042ee05e40 100644
--- a/fs/xfs/libxfs/xfs_swapext.c
+++ b/fs/xfs/libxfs/xfs_swapext.c
@@ -25,6 +25,7 @@
 #include "xfs_da_format.h"
 #include "xfs_da_btree.h"
 #include "xfs_attr_leaf.h"
+#include "xfs_dir2_priv.h"
 
 /* Information to help us reset reflink flag / CoW fork state after a swap. */
 
@@ -232,6 +233,37 @@ xfs_swapext_attr_to_shortform2(
 /* Mask of all flags that require post-processing of file2. */
 #define XFS_SWAP_EXTENT_POST_PROCESSING (XFS_SWAP_EXTENT_INO2_SHORTFORM)
 
+/* Convert inode2's block dir fork back to shortform, if possible.. */
+STATIC int
+xfs_swapext_dir_to_shortform2(
+	struct xfs_trans		*tp,
+	struct xfs_swapext_intent	*sxi)
+{
+	struct xfs_da_args	args = {
+		.dp		= sxi->sxi_ip2,
+		.geo		= tp->t_mountp->m_dir_geo,
+		.whichfork	= XFS_DATA_FORK,
+		.trans		= tp,
+	};
+	struct xfs_dir2_sf_hdr	sfh;
+	struct xfs_buf		*bp;
+	int			size;
+	int			error;
+
+	if (!xfs_bmap_one_block(sxi->sxi_ip2, XFS_DATA_FORK))
+		return 0;
+
+	error = xfs_dir3_block_read(tp, sxi->sxi_ip2, &bp);
+	if (error)
+		return error;
+
+	size = xfs_dir2_block_sfsize(sxi->sxi_ip2, bp->b_addr, &sfh);
+	if (size > XFS_IFORK_DSIZE(sxi->sxi_ip2))
+		return 0;
+
+	return xfs_dir2_block_to_sf(&args, bp, size, &sfh);
+}
+
 /* Do we have more work to do to finish this operation? */
 bool
 xfs_swapext_has_more_work(
@@ -321,6 +353,8 @@ xfs_swapext_finish_one(
 		if (sxi->sxi_flags & XFS_SWAP_EXTENT_INO2_SHORTFORM) {
 			if (sxi->sxi_flags & XFS_SWAP_EXTENT_ATTR_FORK)
 				error = xfs_swapext_attr_to_shortform2(tp, sxi);
+			else if (S_ISDIR(VFS_I(sxi->sxi_ip2)->i_mode))
+				error = xfs_swapext_dir_to_shortform2(tp, sxi);
 			sxi->sxi_flags &= ~XFS_SWAP_EXTENT_INO2_SHORTFORM;
 			return error;
 		}


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 17/18] xfs: make atomic extent swapping support realtime files
  2021-04-01  1:08 [PATCHSET RFC v3 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (15 preceding siblings ...)
  2021-04-01  1:10 ` [PATCH 16/18] xfs: condense directories " Darrick J. Wong
@ 2021-04-01  1:10 ` Darrick J. Wong
  2021-04-01  1:10 ` [PATCH 18/18] xfs: enable atomic swapext feature Darrick J. Wong
  2021-04-01  3:56 ` [PATCHSET RFC v3 00/18] xfs: atomic file updates Amir Goldstein
  18 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2021-04-01  1:10 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Now that bmap items support the realtime device, we can add the
necessary pieces to the atomic extent swapping code to support such
things.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_format.h  |    7 ++--
 fs/xfs/libxfs/xfs_swapext.c |   67 ++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 68 insertions(+), 6 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index e81a7b12a0e3..15d967414500 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -611,13 +611,12 @@ static inline bool xfs_sb_version_needsrepair(struct xfs_sb *sbp)
 /*
  * Decide if this filesystem can use log-assisted ("atomic") extent swapping.
  * The atomic swap log intent items depend on the block mapping log intent
- * items introduced with reflink and rmap.  Realtime is not supported yet.
+ * items introduced with reflink and rmap.
  */
 static inline bool xfs_sb_version_canatomicswap(struct xfs_sb *sbp)
 {
-	return (xfs_sb_version_hasreflink(sbp) ||
-		xfs_sb_version_hasrmapbt(sbp)) &&
-		!xfs_sb_version_hasrealtime(sbp);
+	return  xfs_sb_version_hasreflink(sbp) ||
+		xfs_sb_version_hasrmapbt(sbp);
 }
 
 static inline bool xfs_sb_version_hasatomicswap(struct xfs_sb *sbp)
diff --git a/fs/xfs/libxfs/xfs_swapext.c b/fs/xfs/libxfs/xfs_swapext.c
index 41042ee05e40..995b59d86d79 100644
--- a/fs/xfs/libxfs/xfs_swapext.c
+++ b/fs/xfs/libxfs/xfs_swapext.c
@@ -279,7 +279,14 @@ xfs_swapext_check_extents(
 	struct xfs_mount		*mp,
 	const struct xfs_swapext_req	*req)
 {
+	struct xfs_bmbt_irec		irec1, irec2;
 	struct xfs_ifork		*ifp1, *ifp2;
+	xfs_fileoff_t			startoff1 = req->startoff1;
+	xfs_fileoff_t			startoff2 = req->startoff2;
+	xfs_filblks_t			blockcount = req->blockcount;
+	uint32_t			mod;
+	int				nimaps;
+	int				error;
 
 	/* No fork? */
 	ifp1 = XFS_IFORK_PTR(req->ip1, req->whichfork);
@@ -292,12 +299,68 @@ xfs_swapext_check_extents(
 	    ifp2->if_format == XFS_DINODE_FMT_LOCAL)
 		return -EINVAL;
 
-	/* We don't support realtime data forks yet. */
+	/*
+	 * There may be partially written rt extents lurking in the ranges to
+	 * be swapped.  If we support atomic swapext log intent items, we can
+	 * guarantee that operations will always finish and never leave an rt
+	 * extent partially mapped to two files, and can move on.  If we don't
+	 * have that coordination, we have to scan both ranges to ensure that
+	 * there are no partially written extents.
+	 */
 	if (!XFS_IS_REALTIME_INODE(req->ip1))
 		return 0;
 	if (req->whichfork == XFS_ATTR_FORK)
 		return 0;
-	return -EINVAL;
+	if (xfs_sb_version_hasatomicswap(&mp->m_sb))
+		return 0;
+	if (mp->m_sb.sb_rextsize == 1)
+		return 0;
+
+	while (blockcount > 0) {
+		/* Read extent from the first file */
+		nimaps = 1;
+		error = xfs_bmapi_read(req->ip1, startoff1, blockcount,
+				&irec1, &nimaps, 0);
+		if (error)
+			return error;
+		ASSERT(nimaps == 1);
+
+		/* Read extent from the second file */
+		nimaps = 1;
+		error = xfs_bmapi_read(req->ip2, startoff2,
+				irec1.br_blockcount, &irec2, &nimaps,
+				0);
+		if (error)
+			return error;
+		ASSERT(nimaps == 1);
+
+		/*
+		 * We can only swap as many blocks as the smaller of the two
+		 * extent maps.
+		 */
+		irec1.br_blockcount = min(irec1.br_blockcount,
+					  irec2.br_blockcount);
+
+		/*
+		 * Both mappings must be aligned to the realtime extent size
+		 * if either mapping comes from the realtime volume.
+		 */
+		div_u64_rem(irec1.br_startoff, mp->m_sb.sb_rextsize, &mod);
+		if (mod)
+			return -EINVAL;
+		div_u64_rem(irec2.br_startoff, mp->m_sb.sb_rextsize, &mod);
+		if (mod)
+			return -EINVAL;
+		div_u64_rem(irec1.br_blockcount, mp->m_sb.sb_rextsize, &mod);
+		if (mod)
+			return -EINVAL;
+
+		startoff1 += irec1.br_blockcount;
+		startoff2 += irec1.br_blockcount;
+		blockcount -= irec1.br_blockcount;
+	}
+
+	return 0;
 }
 
 #ifdef CONFIG_XFS_QUOTA


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 18/18] xfs: enable atomic swapext feature
  2021-04-01  1:08 [PATCHSET RFC v3 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (16 preceding siblings ...)
  2021-04-01  1:10 ` [PATCH 17/18] xfs: make atomic extent swapping support realtime files Darrick J. Wong
@ 2021-04-01  1:10 ` Darrick J. Wong
  2021-04-01  3:56 ` [PATCHSET RFC v3 00/18] xfs: atomic file updates Amir Goldstein
  18 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2021-04-01  1:10 UTC (permalink / raw)
  To: djwong; +Cc: linux-xfs, linux-fsdevel, linux-api

From: Darrick J. Wong <djwong@kernel.org>

Add the atomic swapext feature to the set of features that we will
permit.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_format.h |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 15d967414500..c6e3316dd861 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -486,7 +486,8 @@ xfs_sb_has_incompat_feature(
 }
 
 #define XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP (1 << 0)
-#define XFS_SB_FEAT_INCOMPAT_LOG_ALL 0
+#define XFS_SB_FEAT_INCOMPAT_LOG_ALL \
+		(XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP)
 #define XFS_SB_FEAT_INCOMPAT_LOG_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_LOG_ALL
 static inline bool
 xfs_sb_has_incompat_log_feature(


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 01/18] vfs: introduce new file range exchange ioctl
  2021-04-01  1:08 ` [PATCH 01/18] vfs: introduce new file range exchange ioctl Darrick J. Wong
@ 2021-04-01  1:44   ` Al Viro
  2021-04-01 21:18     ` Darrick J. Wong
  2021-04-01  3:32   ` Amir Goldstein
  1 sibling, 1 reply; 30+ messages in thread
From: Al Viro @ 2021-04-01  1:44 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, linux-fsdevel, linux-api

On Wed, Mar 31, 2021 at 06:08:52PM -0700, Darrick J. Wong wrote:

> +	ret = vfs_xchg_file_range(file1.file, file2, &args);
> +	if (ret)
> +		goto fdput;
> +
> +	/*
> +	 * The VFS will set RANGE_FSYNC on its own if the file or inode require
> +	 * synchronous writes.  Don't leak this back to userspace.
> +	 */
> +	args.flags &= ~FILE_XCHG_RANGE_FSYNC;
> +	args.flags |= (old_flags & FILE_XCHG_RANGE_FSYNC);
> +
> +	if (copy_to_user(argp, &args, sizeof(args)))
> +		ret = -EFAULT;

Erm...  How is userland supposed to figure out whether that EFAULT
came before or after the operation?  Which of the fields are outputs,
anyway?

> +	/* Don't touch certain kinds of inodes */
> +	if (IS_IMMUTABLE(inode1) || IS_IMMUTABLE(inode2))
> +		return -EPERM;

Append-only should get the same treatment (and IMO if you have
O_APPEND on either file, you should get a failure as well).

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 01/18] vfs: introduce new file range exchange ioctl
  2021-04-01  1:08 ` [PATCH 01/18] vfs: introduce new file range exchange ioctl Darrick J. Wong
  2021-04-01  1:44   ` Al Viro
@ 2021-04-01  3:32   ` Amir Goldstein
  2021-04-02  0:37     ` Darrick J. Wong
  1 sibling, 1 reply; 30+ messages in thread
From: Amir Goldstein @ 2021-04-01  3:32 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, linux-fsdevel, Linux API

On Thu, Apr 1, 2021 at 4:13 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> Introduce a new ioctl to handle swapping ranges of bytes between files.
>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  Documentation/filesystems/vfs.rst |   16 ++
>  fs/ioctl.c                        |   42 +++++
>  fs/remap_range.c                  |  283 +++++++++++++++++++++++++++++++++++++
>  fs/xfs/libxfs/xfs_fs.h            |    1
>  include/linux/fs.h                |   14 ++
>  include/uapi/linux/fiexchange.h   |  101 +++++++++++++
>  6 files changed, 456 insertions(+), 1 deletion(-)
>  create mode 100644 include/uapi/linux/fiexchange.h
>
>
> diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> index 2049bbf5e388..9f16b260bc7e 100644
> --- a/Documentation/filesystems/vfs.rst
> +++ b/Documentation/filesystems/vfs.rst
> @@ -1006,6 +1006,8 @@ This describes how the VFS can manipulate an open file.  As of kernel
>                 loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
>                                            struct file *file_out, loff_t pos_out,
>                                            loff_t len, unsigned int remap_flags);
> +                int (*xchg_file_range)(struct file *file1, struct file *file2,
> +                                       struct file_xchg_range *fxr);

An obvious question: why is the xchgn_file_range op not using the
unified remap_file_range() method with REMAP_XCHG_ flags?
Surely replacing the remap_flags arg with struct file_remap_range.

I went to look for reasons and I didn't find them.
Can you share your reasons for that?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHSET RFC v3 00/18] xfs: atomic file updates
  2021-04-01  1:08 [PATCHSET RFC v3 00/18] xfs: atomic file updates Darrick J. Wong
                   ` (17 preceding siblings ...)
  2021-04-01  1:10 ` [PATCH 18/18] xfs: enable atomic swapext feature Darrick J. Wong
@ 2021-04-01  3:56 ` Amir Goldstein
  2021-04-02  0:22   ` Darrick J. Wong
  18 siblings, 1 reply; 30+ messages in thread
From: Amir Goldstein @ 2021-04-01  3:56 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, linux-fsdevel, Linux API

On Thu, Apr 1, 2021 at 4:14 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> Hi all,
>
> This series creates a new FIEXCHANGE_RANGE system call to exchange
> ranges of bytes between two files atomically.  This new functionality
> enables data storage programs to stage and commit file updates such that
> reader programs will see either the old contents or the new contents in
> their entirety, with no chance of torn writes.  A successful call
> completion guarantees that the new contents will be seen even if the
> system fails.
>
> User programs will be able to update files atomically by opening an
> O_TMPFILE, reflinking the source file to it, making whatever updates
> they want to make, and exchange the relevant ranges of the temp file
> with the original file.  If the updates are aligned with the file block
> size, a new (since v2) flag provides for exchanging only the written
> areas.  Callers can arrange for the update to be rejected if the
> original file has been changed.
>
> The intent behind this new userspace functionality is to enable atomic
> rewrites of arbitrary parts of individual files.  For years, application
> programmers wanting to ensure the atomicity of a file update had to
> write the changes to a new file in the same directory, fsync the new
> file, rename the new file on top of the old filename, and then fsync the
> directory.  People get it wrong all the time, and $fs hacks abound.
> Here is the proposed manual page:
>

I like the idea of modernizing FIEXCHANGE_RANGE very much and
I think that the improved implementation and new(?) flags will be very
useful just the way you designed them, but maybe something to consider...

Taking a step back and ignoring the existing xfs ioctl, all the use cases
that you listed actually want MOVE_RANGE not exchange range.
No listed use case does anything with the old data except dump it in the
trash bin. Right?

I do realize that implementing atomic extent exchange was easier back
when that ioctl was implemented for xfs and ext4 and I realize that
deferring inode unlink was much simpler to implement than deferred
extent freeing, but seeing how punch hole and dedupe range already
need to deal with freeing target inode extents, it is not obvious to me that
atomic freeing the target inode extents instead of exchange is a bad idea
(given the appropriate opt-in flags).

Is there a good reason for keeping the "freeing old blocks with unlink"
strategy the only option?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 01/18] vfs: introduce new file range exchange ioctl
  2021-04-01  1:44   ` Al Viro
@ 2021-04-01 21:18     ` Darrick J. Wong
  0 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2021-04-01 21:18 UTC (permalink / raw)
  To: Al Viro; +Cc: linux-xfs, linux-fsdevel, linux-api

On Thu, Apr 01, 2021 at 01:44:10AM +0000, Al Viro wrote:
> On Wed, Mar 31, 2021 at 06:08:52PM -0700, Darrick J. Wong wrote:
> 
> > +	ret = vfs_xchg_file_range(file1.file, file2, &args);
> > +	if (ret)
> > +		goto fdput;
> > +
> > +	/*
> > +	 * The VFS will set RANGE_FSYNC on its own if the file or inode require
> > +	 * synchronous writes.  Don't leak this back to userspace.
> > +	 */
> > +	args.flags &= ~FILE_XCHG_RANGE_FSYNC;
> > +	args.flags |= (old_flags & FILE_XCHG_RANGE_FSYNC);
> > +
> > +	if (copy_to_user(argp, &args, sizeof(args)))
> > +		ret = -EFAULT;
> 
> Erm...  How is userland supposed to figure out whether that EFAULT
> came before or after the operation?  Which of the fields are outputs,
> anyway?

Come to think of it, none of the fields are outputs, so this whole block
can go away.  Thanks for noticing that. :)

> > +	/* Don't touch certain kinds of inodes */
> > +	if (IS_IMMUTABLE(inode1) || IS_IMMUTABLE(inode2))
> > +		return -EPERM;
> 
> Append-only should get the same treatment (and IMO if you have

Assuming you meant IS_APPEND, I thought we only checked that at open
time, as part of requiring O_APPEND?

> O_APPEND on either file, you should get a failure as well).

generic_rw_checks (which is called by do_xchg_file_range) will send back
-EBADF if the file descriptors are O_APPEND.

--D

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCHSET RFC v3 00/18] xfs: atomic file updates
  2021-04-01  3:56 ` [PATCHSET RFC v3 00/18] xfs: atomic file updates Amir Goldstein
@ 2021-04-02  0:22   ` Darrick J. Wong
  0 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2021-04-02  0:22 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: linux-xfs, linux-fsdevel, Linux API

On Thu, Apr 01, 2021 at 06:56:20AM +0300, Amir Goldstein wrote:
> On Thu, Apr 1, 2021 at 4:14 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > Hi all,
> >
> > This series creates a new FIEXCHANGE_RANGE system call to exchange
> > ranges of bytes between two files atomically.  This new functionality
> > enables data storage programs to stage and commit file updates such that
> > reader programs will see either the old contents or the new contents in
> > their entirety, with no chance of torn writes.  A successful call
> > completion guarantees that the new contents will be seen even if the
> > system fails.
> >
> > User programs will be able to update files atomically by opening an
> > O_TMPFILE, reflinking the source file to it, making whatever updates
> > they want to make, and exchange the relevant ranges of the temp file
> > with the original file.  If the updates are aligned with the file block
> > size, a new (since v2) flag provides for exchanging only the written
> > areas.  Callers can arrange for the update to be rejected if the
> > original file has been changed.
> >
> > The intent behind this new userspace functionality is to enable atomic
> > rewrites of arbitrary parts of individual files.  For years, application
> > programmers wanting to ensure the atomicity of a file update had to
> > write the changes to a new file in the same directory, fsync the new
> > file, rename the new file on top of the old filename, and then fsync the
> > directory.  People get it wrong all the time, and $fs hacks abound.
> > Here is the proposed manual page:
> >
> 
> I like the idea of modernizing FIEXCHANGE_RANGE very much and
> I think that the improved implementation and new(?) flags will be very
> useful just the way you designed them, but maybe something to consider...
> 
> Taking a step back and ignoring the existing xfs ioctl, all the use cases
> that you listed actually want MOVE_RANGE not exchange range.
> No listed use case does anything with the old data except dump it in the
> trash bin. Right?

The three listed in the manpage don't do anything with the blocks.

However, there is usecase #4: online filesystem repair, where we want to
be able to construct a new metadata file/directory/xattr tree, exchange
the new contents with the old, and still have the old contents attached
to the file so that we can (very carefully) tear down the internal
buffer caches and other.  For /that/ use case, we require truncation to
be a separate step.

> I do realize that implementing atomic extent exchange was easier back
> when that ioctl was implemented for xfs and ext4 and I realize that
> deferring inode unlink was much simpler to implement than deferred
> extent freeing, but seeing how punch hole and dedupe range already
> need to deal with freeing target inode extents, it is not obvious to me that
> atomic freeing the target inode extents instead of exchange is a bad idea
> (given the appropriate opt-in flags).
> 
> Is there a good reason for keeping the "freeing old blocks with unlink"
> strategy the only option?

Making userspace take the extra step of deciding what to do with the
tempfile (and when!) after the operation reduces the amount of work that
has to be done in the hot path, since we know that the only work we need
to do is switch the mappings (and the reverse mappings).

If this became a move operation where we drop the file2 blocks, it would
be necessary to traverse the refcount btree to see if the blocks are
shared, update the refcount btree, and possibly update the free space
btrees as well.  The current design permits us to skip all that, which
is all the more useful if the operation is synchronous.

Consider also that inactivation of inodes will soon become a background
operation in XFS, which means that userspace soon won't even have to
wait for that part.

--D

> 
> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 01/18] vfs: introduce new file range exchange ioctl
  2021-04-01  3:32   ` Amir Goldstein
@ 2021-04-02  0:37     ` Darrick J. Wong
  0 siblings, 0 replies; 30+ messages in thread
From: Darrick J. Wong @ 2021-04-02  0:37 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: linux-xfs, linux-fsdevel, Linux API

On Thu, Apr 01, 2021 at 06:32:02AM +0300, Amir Goldstein wrote:
> On Thu, Apr 1, 2021 at 4:13 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > Introduce a new ioctl to handle swapping ranges of bytes between files.
> >
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  Documentation/filesystems/vfs.rst |   16 ++
> >  fs/ioctl.c                        |   42 +++++
> >  fs/remap_range.c                  |  283 +++++++++++++++++++++++++++++++++++++
> >  fs/xfs/libxfs/xfs_fs.h            |    1
> >  include/linux/fs.h                |   14 ++
> >  include/uapi/linux/fiexchange.h   |  101 +++++++++++++
> >  6 files changed, 456 insertions(+), 1 deletion(-)
> >  create mode 100644 include/uapi/linux/fiexchange.h
> >
> >
> > diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> > index 2049bbf5e388..9f16b260bc7e 100644
> > --- a/Documentation/filesystems/vfs.rst
> > +++ b/Documentation/filesystems/vfs.rst
> > @@ -1006,6 +1006,8 @@ This describes how the VFS can manipulate an open file.  As of kernel
> >                 loff_t (*remap_file_range)(struct file *file_in, loff_t pos_in,
> >                                            struct file *file_out, loff_t pos_out,
> >                                            loff_t len, unsigned int remap_flags);
> > +                int (*xchg_file_range)(struct file *file1, struct file *file2,
> > +                                       struct file_xchg_range *fxr);
> 
> An obvious question: why is the xchgn_file_range op not using the
> unified remap_file_range() method with REMAP_XCHG_ flags?
> Surely replacing the remap_flags arg with struct file_remap_range.
> 
> I went to look for reasons and I didn't find them.
> Can you share your reasons for that?

Code simplicity.  The file2 freshness parameters don't apply to clone or
dedupe, and the current set of remap flags don't apply to exchange.  I'd
have to hunt down all the ->remap_range implementations and modify them
to error out on REMAP_FILE_EXCHANGE.  Multiplexing flags in this manner
would also require additional remap_flags interpretation code to
safeguard against callers who mix up which flags go with what piece of
functionality.

IOWS: it's not hard to do, but not something I want to do for an RFC
because the goal here is to gauge interest in having a userspace
interface at all.  Until I get to that point, tangling up the code
diverts my time towards rebasing and dealing with merge conflicts, at
the cost of time I can spend concentrating on making the algorithms
right.

--D

> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 02/18] xfs: support two inodes in the defer capture structure
  2021-04-01  1:08 ` [PATCH 02/18] xfs: support two inodes in the defer capture structure Darrick J. Wong
@ 2021-04-02 23:20   ` Allison Henderson
  0 siblings, 0 replies; 30+ messages in thread
From: Allison Henderson @ 2021-04-02 23:20 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, linux-fsdevel, linux-api



On 3/31/21 6:08 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Make it so that xfs_defer_ops_capture_and_commit can capture two inodes.
> This will be needed by the atomic extent swap log item so that it can
> recover an operation involving two inodes.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Ok, makes sense
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>
> ---
>   fs/xfs/libxfs/xfs_defer.c  |   48 ++++++++++++++++++++++++++++++--------------
>   fs/xfs/libxfs/xfs_defer.h  |    9 ++++++--
>   fs/xfs/xfs_bmap_item.c     |    2 +-
>   fs/xfs/xfs_extfree_item.c  |    2 +-
>   fs/xfs/xfs_log_recover.c   |   14 ++++++++-----
>   fs/xfs/xfs_refcount_item.c |    2 +-
>   fs/xfs/xfs_rmap_item.c     |    2 +-
>   7 files changed, 52 insertions(+), 27 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
> index eff4a127188e..a7d1357687d0 100644
> --- a/fs/xfs/libxfs/xfs_defer.c
> +++ b/fs/xfs/libxfs/xfs_defer.c
> @@ -628,7 +628,8 @@ xfs_defer_move(
>   static struct xfs_defer_capture *
>   xfs_defer_ops_capture(
>   	struct xfs_trans		*tp,
> -	struct xfs_inode		*capture_ip)
> +	struct xfs_inode		*capture_ip1,
> +	struct xfs_inode		*capture_ip2)
>   {
>   	struct xfs_defer_capture	*dfc;
>   
> @@ -658,9 +659,13 @@ xfs_defer_ops_capture(
>   	 * Grab an extra reference to this inode and attach it to the capture
>   	 * structure.
>   	 */
> -	if (capture_ip) {
> -		ihold(VFS_I(capture_ip));
> -		dfc->dfc_capture_ip = capture_ip;
> +	if (capture_ip1) {
> +		ihold(VFS_I(capture_ip1));
> +		dfc->dfc_capture_ip1 = capture_ip1;
> +	}
> +	if (capture_ip2 && capture_ip2 != capture_ip1) {
> +		ihold(VFS_I(capture_ip2));
> +		dfc->dfc_capture_ip2 = capture_ip2;
>   	}
>   
>   	return dfc;
> @@ -673,8 +678,10 @@ xfs_defer_ops_release(
>   	struct xfs_defer_capture	*dfc)
>   {
>   	xfs_defer_cancel_list(mp, &dfc->dfc_dfops);
> -	if (dfc->dfc_capture_ip)
> -		xfs_irele(dfc->dfc_capture_ip);
> +	if (dfc->dfc_capture_ip1)
> +		xfs_irele(dfc->dfc_capture_ip1);
> +	if (dfc->dfc_capture_ip2)
> +		xfs_irele(dfc->dfc_capture_ip2);
>   	kmem_free(dfc);
>   }
>   
> @@ -684,22 +691,26 @@ xfs_defer_ops_release(
>    * of the deferred ops operate on an inode, the caller must pass in that inode
>    * so that the reference can be transferred to the capture structure.  The
>    * caller must hold ILOCK_EXCL on the inode, and must unlock it before calling
> - * xfs_defer_ops_continue.
> + * xfs_defer_ops_continue.  Do not pass a null capture_ip1 and a non-null
> + * capture_ip2.
>    */
>   int
>   xfs_defer_ops_capture_and_commit(
>   	struct xfs_trans		*tp,
> -	struct xfs_inode		*capture_ip,
> +	struct xfs_inode		*capture_ip1,
> +	struct xfs_inode		*capture_ip2,
>   	struct list_head		*capture_list)
>   {
>   	struct xfs_mount		*mp = tp->t_mountp;
>   	struct xfs_defer_capture	*dfc;
>   	int				error;
>   
> -	ASSERT(!capture_ip || xfs_isilocked(capture_ip, XFS_ILOCK_EXCL));
> +	ASSERT(!capture_ip1 || xfs_isilocked(capture_ip1, XFS_ILOCK_EXCL));
> +	ASSERT(!capture_ip2 || xfs_isilocked(capture_ip2, XFS_ILOCK_EXCL));
> +	ASSERT(capture_ip2 == NULL || capture_ip1 != NULL);
>   
>   	/* If we don't capture anything, commit transaction and exit. */
> -	dfc = xfs_defer_ops_capture(tp, capture_ip);
> +	dfc = xfs_defer_ops_capture(tp, capture_ip1, capture_ip2);
>   	if (!dfc)
>   		return xfs_trans_commit(tp);
>   
> @@ -724,17 +735,24 @@ void
>   xfs_defer_ops_continue(
>   	struct xfs_defer_capture	*dfc,
>   	struct xfs_trans		*tp,
> -	struct xfs_inode		**captured_ipp)
> +	struct xfs_inode		**captured_ipp1,
> +	struct xfs_inode		**captured_ipp2)
>   {
>   	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
>   	ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY));
>   
>   	/* Lock and join the captured inode to the new transaction. */
> -	if (dfc->dfc_capture_ip) {
> -		xfs_ilock(dfc->dfc_capture_ip, XFS_ILOCK_EXCL);
> -		xfs_trans_ijoin(tp, dfc->dfc_capture_ip, 0);
> +	if (dfc->dfc_capture_ip1 && dfc->dfc_capture_ip2) {
> +		xfs_lock_two_inodes(dfc->dfc_capture_ip1, XFS_ILOCK_EXCL,
> +				    dfc->dfc_capture_ip2, XFS_ILOCK_EXCL);
> +		xfs_trans_ijoin(tp, dfc->dfc_capture_ip1, 0);
> +		xfs_trans_ijoin(tp, dfc->dfc_capture_ip2, 0);
> +	} else if (dfc->dfc_capture_ip1) {
> +		xfs_ilock(dfc->dfc_capture_ip1, XFS_ILOCK_EXCL);
> +		xfs_trans_ijoin(tp, dfc->dfc_capture_ip1, 0);
>   	}
> -	*captured_ipp = dfc->dfc_capture_ip;
> +	*captured_ipp1 = dfc->dfc_capture_ip1;
> +	*captured_ipp2 = dfc->dfc_capture_ip2;
>   
>   	/* Move captured dfops chain and state to the transaction. */
>   	list_splice_init(&dfc->dfc_dfops, &tp->t_dfops);
> diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
> index 05472f71fffe..f5e3ca17aa26 100644
> --- a/fs/xfs/libxfs/xfs_defer.h
> +++ b/fs/xfs/libxfs/xfs_defer.h
> @@ -87,7 +87,8 @@ struct xfs_defer_capture {
>   	 * An inode reference that must be maintained to complete the deferred
>   	 * work.
>   	 */
> -	struct xfs_inode	*dfc_capture_ip;
> +	struct xfs_inode	*dfc_capture_ip1;
> +	struct xfs_inode	*dfc_capture_ip2;
>   };
>   
>   /*
> @@ -95,9 +96,11 @@ struct xfs_defer_capture {
>    * This doesn't normally happen except log recovery.
>    */
>   int xfs_defer_ops_capture_and_commit(struct xfs_trans *tp,
> -		struct xfs_inode *capture_ip, struct list_head *capture_list);
> +		struct xfs_inode *capture_ip1, struct xfs_inode *capture_ip2,
> +		struct list_head *capture_list);
>   void xfs_defer_ops_continue(struct xfs_defer_capture *d, struct xfs_trans *tp,
> -		struct xfs_inode **captured_ipp);
> +		struct xfs_inode **captured_ipp1,
> +		struct xfs_inode **captured_ipp2);
>   void xfs_defer_ops_release(struct xfs_mount *mp, struct xfs_defer_capture *d);
>   
>   #endif /* __XFS_DEFER_H__ */
> diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
> index 895a56b16029..bba73ddd0585 100644
> --- a/fs/xfs/xfs_bmap_item.c
> +++ b/fs/xfs/xfs_bmap_item.c
> @@ -551,7 +551,7 @@ xfs_bui_item_recover(
>   	 * Commit transaction, which frees the transaction and saves the inode
>   	 * for later replay activities.
>   	 */
> -	error = xfs_defer_ops_capture_and_commit(tp, ip, capture_list);
> +	error = xfs_defer_ops_capture_and_commit(tp, ip, NULL, capture_list);
>   	if (error)
>   		goto err_unlock;
>   
> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> index c767918c0c3f..ebfc7de8083e 100644
> --- a/fs/xfs/xfs_extfree_item.c
> +++ b/fs/xfs/xfs_extfree_item.c
> @@ -632,7 +632,7 @@ xfs_efi_item_recover(
>   
>   	}
>   
> -	return xfs_defer_ops_capture_and_commit(tp, NULL, capture_list);
> +	return xfs_defer_ops_capture_and_commit(tp, NULL, NULL, capture_list);
>   
>   abort_error:
>   	xfs_trans_cancel(tp);
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index b227a6ad9f5d..ce1a7928eb2d 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -2439,7 +2439,7 @@ xlog_finish_defer_ops(
>   {
>   	struct xfs_defer_capture *dfc, *next;
>   	struct xfs_trans	*tp;
> -	struct xfs_inode	*ip;
> +	struct xfs_inode	*ip1, *ip2;
>   	int			error = 0;
>   
>   	list_for_each_entry_safe(dfc, next, capture_list, dfc_list) {
> @@ -2465,12 +2465,16 @@ xlog_finish_defer_ops(
>   		 * from recovering a single intent item.
>   		 */
>   		list_del_init(&dfc->dfc_list);
> -		xfs_defer_ops_continue(dfc, tp, &ip);
> +		xfs_defer_ops_continue(dfc, tp, &ip1, &ip2);
>   
>   		error = xfs_trans_commit(tp);
> -		if (ip) {
> -			xfs_iunlock(ip, XFS_ILOCK_EXCL);
> -			xfs_irele(ip);
> +		if (ip1) {
> +			xfs_iunlock(ip1, XFS_ILOCK_EXCL);
> +			xfs_irele(ip1);
> +		}
> +		if (ip2) {
> +			xfs_iunlock(ip2, XFS_ILOCK_EXCL);
> +			xfs_irele(ip2);
>   		}
>   		if (error)
>   			return error;
> diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
> index 07ebccbbf4df..427d8259a36d 100644
> --- a/fs/xfs/xfs_refcount_item.c
> +++ b/fs/xfs/xfs_refcount_item.c
> @@ -554,7 +554,7 @@ xfs_cui_item_recover(
>   	}
>   
>   	xfs_refcount_finish_one_cleanup(tp, rcur, error);
> -	return xfs_defer_ops_capture_and_commit(tp, NULL, capture_list);
> +	return xfs_defer_ops_capture_and_commit(tp, NULL, NULL, capture_list);
>   
>   abort_error:
>   	xfs_refcount_finish_one_cleanup(tp, rcur, error);
> diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
> index 49cebd68b672..deb852a3c5f6 100644
> --- a/fs/xfs/xfs_rmap_item.c
> +++ b/fs/xfs/xfs_rmap_item.c
> @@ -584,7 +584,7 @@ xfs_rui_item_recover(
>   	}
>   
>   	xfs_rmap_finish_one_cleanup(tp, rcur, error);
> -	return xfs_defer_ops_capture_and_commit(tp, NULL, capture_list);
> +	return xfs_defer_ops_capture_and_commit(tp, NULL, NULL, capture_list);
>   
>   abort_error:
>   	xfs_rmap_finish_one_cleanup(tp, rcur, error);
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 03/18] xfs: allow setting and clearing of log incompat feature flags
  2021-04-01  1:09 ` [PATCH 03/18] xfs: allow setting and clearing of log incompat feature flags Darrick J. Wong
@ 2021-04-02 23:20   ` Allison Henderson
  0 siblings, 0 replies; 30+ messages in thread
From: Allison Henderson @ 2021-04-02 23:20 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, linux-fsdevel, linux-api



On 3/31/21 6:09 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Log incompat feature flags in the superblock exist for one purpose: to
> protect the contents of a dirty log from replay on a kernel that isn't
> prepared to handle those dirty contents.  This means that they can be
> cleared if (a) we know the log is clean and (b) we know that there
> aren't any other threads in the system that might be setting or relying
> upon a log incompat flag.
> 
> Therefore, clear the log incompat flags when we've finished recovering
> the log, when we're unmounting cleanly, remounting read-only, or
> freezing; and provide a function so that subsequent patches can start
> using this.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Ok, seems reasonable
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

> ---
>   fs/xfs/libxfs/xfs_format.h |   15 ++++++
>   fs/xfs/xfs_log.c           |   14 ++++++
>   fs/xfs/xfs_log_recover.c   |   16 ++++++
>   fs/xfs/xfs_mount.c         |  110 ++++++++++++++++++++++++++++++++++++++++++++
>   fs/xfs/xfs_mount.h         |    2 +
>   5 files changed, 157 insertions(+)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 9620795a6e08..7e9c964772c9 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -495,6 +495,21 @@ xfs_sb_has_incompat_log_feature(
>   	return (sbp->sb_features_log_incompat & feature) != 0;
>   }
>   
> +static inline void
> +xfs_sb_remove_incompat_log_features(
> +	struct xfs_sb	*sbp)
> +{
> +	sbp->sb_features_log_incompat &= ~XFS_SB_FEAT_INCOMPAT_LOG_ALL;
> +}
> +
> +static inline void
> +xfs_sb_add_incompat_log_features(
> +	struct xfs_sb	*sbp,
> +	unsigned int	features)
> +{
> +	sbp->sb_features_log_incompat |= features;
> +}
> +
>   /*
>    * V5 superblock specific feature checks
>    */
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 06041834daa3..cf73bc9f4d18 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -945,6 +945,20 @@ int
>   xfs_log_quiesce(
>   	struct xfs_mount	*mp)
>   {
> +	/*
> +	 * Clear log incompat features since we're quiescing the log.  Report
> +	 * failures, though it's not fatal to have a higher log feature
> +	 * protection level than the log contents actually require.
> +	 */
> +	if (xfs_clear_incompat_log_features(mp)) {
> +		int error;
> +
> +		error = xfs_sync_sb(mp, false);
> +		if (error)
> +			xfs_warn(mp,
> +	"Failed to clear log incompat features on quiesce");
> +	}
> +
>   	cancel_delayed_work_sync(&mp->m_log->l_work);
>   	xfs_log_force(mp, XFS_LOG_SYNC);
>   
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index ce1a7928eb2d..fdba9b55822e 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -3480,6 +3480,22 @@ xlog_recover_finish(
>   		 */
>   		xfs_log_force(log->l_mp, XFS_LOG_SYNC);
>   
> +		/*
> +		 * Now that we've recovered the log and all the intents, we can
> +		 * clear the log incompat feature bits in the superblock
> +		 * because there's no longer anything to protect.  We rely on
> +		 * the AIL push to write out the updated superblock after
> +		 * everything else.
> +		 */
> +		if (xfs_clear_incompat_log_features(log->l_mp)) {
> +			error = xfs_sync_sb(log->l_mp, false);
> +			if (error < 0) {
> +				xfs_alert(log->l_mp,
> +	"Failed to clear log incompat features on recovery");
> +				return error;
> +			}
> +		}
> +
>   		xlog_recover_process_iunlinks(log);
>   
>   		xlog_recover_check_summary(log);
> diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c
> index b7e653180d22..f16036e1986b 100644
> --- a/fs/xfs/xfs_mount.c
> +++ b/fs/xfs/xfs_mount.c
> @@ -1333,6 +1333,116 @@ xfs_force_summary_recalc(
>   	xfs_fs_mark_checked(mp, XFS_SICK_FS_COUNTERS);
>   }
>   
> +/*
> + * Enable a log incompat feature flag in the primary superblock.  The caller
> + * cannot have any other transactions in progress.
> + */
> +int
> +xfs_add_incompat_log_feature(
> +	struct xfs_mount	*mp,
> +	uint32_t		feature)
> +{
> +	struct xfs_dsb		*dsb;
> +	int			error;
> +
> +	ASSERT(hweight32(feature) == 1);
> +	ASSERT(!(feature & XFS_SB_FEAT_INCOMPAT_LOG_UNKNOWN));
> +
> +	/*
> +	 * Force the log to disk and kick the background AIL thread to reduce
> +	 * the chances that the bwrite will stall waiting for the AIL to unpin
> +	 * the primary superblock buffer.  This isn't a data integrity
> +	 * operation, so we don't need a synchronous push.
> +	 */
> +	error = xfs_log_force(mp, XFS_LOG_SYNC);
> +	if (error)
> +		return error;
> +	xfs_ail_push_all(mp->m_ail);
> +
> +	/*
> +	 * Lock the primary superblock buffer to serialize all callers that
> +	 * are trying to set feature bits.
> +	 */
> +	xfs_buf_lock(mp->m_sb_bp);
> +	xfs_buf_hold(mp->m_sb_bp);
> +
> +	if (XFS_FORCED_SHUTDOWN(mp)) {
> +		error = -EIO;
> +		goto rele;
> +	}
> +
> +	if (xfs_sb_has_incompat_log_feature(&mp->m_sb, feature))
> +		goto rele;
> +
> +	/*
> +	 * Write the primary superblock to disk immediately, because we need
> +	 * the log_incompat bit to be set in the primary super now to protect
> +	 * the log items that we're going to commit later.
> +	 */
> +	dsb = mp->m_sb_bp->b_addr;
> +	xfs_sb_to_disk(dsb, &mp->m_sb);
> +	dsb->sb_features_log_incompat |= cpu_to_be32(feature);
> +	error = xfs_bwrite(mp->m_sb_bp);
> +	if (error)
> +		goto shutdown;
> +
> +	/*
> +	 * Add the feature bits to the incore superblock before we unlock the
> +	 * buffer.
> +	 */
> +	xfs_sb_add_incompat_log_features(&mp->m_sb, feature);
> +	xfs_buf_relse(mp->m_sb_bp);
> +
> +	/* Log the superblock to disk. */
> +	return xfs_sync_sb(mp, false);
> +shutdown:
> +	xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR);
> +rele:
> +	xfs_buf_relse(mp->m_sb_bp);
> +	return error;
> +}
> +
> +/*
> + * Clear all the log incompat flags from the superblock.
> + *
> + * The caller cannot be in a transaction, must ensure that the log does not
> + * contain any log items protected by any log incompat bit, and must ensure
> + * that there are no other threads that depend on the state of the log incompat
> + * feature flags in the primary super.
> + *
> + * Returns true if the superblock is dirty.
> + */
> +bool
> +xfs_clear_incompat_log_features(
> +	struct xfs_mount	*mp)
> +{
> +	bool			ret = false;
> +
> +	if (!xfs_sb_version_hascrc(&mp->m_sb) ||
> +	    !xfs_sb_has_incompat_log_feature(&mp->m_sb,
> +				XFS_SB_FEAT_INCOMPAT_LOG_ALL) ||
> +	    XFS_FORCED_SHUTDOWN(mp))
> +		return false;
> +
> +	/*
> +	 * Update the incore superblock.  We synchronize on the primary super
> +	 * buffer lock to be consistent with the add function, though at least
> +	 * in theory this shouldn't be necessary.
> +	 */
> +	xfs_buf_lock(mp->m_sb_bp);
> +	xfs_buf_hold(mp->m_sb_bp);
> +
> +	if (xfs_sb_has_incompat_log_feature(&mp->m_sb,
> +				XFS_SB_FEAT_INCOMPAT_LOG_ALL)) {
> +		xfs_info(mp, "Clearing log incompat feature flags.");
> +		xfs_sb_remove_incompat_log_features(&mp->m_sb);
> +		ret = true;
> +	}
> +
> +	xfs_buf_relse(mp->m_sb_bp);
> +	return ret;
> +}
> +
>   /*
>    * Update the in-core delayed block counter.
>    *
> diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
> index 63d0dc1b798d..eb45684b186a 100644
> --- a/fs/xfs/xfs_mount.h
> +++ b/fs/xfs/xfs_mount.h
> @@ -453,6 +453,8 @@ int	xfs_zero_extent(struct xfs_inode *ip, xfs_fsblock_t start_fsb,
>   struct xfs_error_cfg * xfs_error_get_cfg(struct xfs_mount *mp,
>   		int error_class, int error);
>   void xfs_force_summary_recalc(struct xfs_mount *mp);
> +int xfs_add_incompat_log_feature(struct xfs_mount *mp, uint32_t feature);
> +bool xfs_clear_incompat_log_features(struct xfs_mount *mp);
>   void xfs_mod_delalloc(struct xfs_mount *mp, int64_t delta);
>   
>   void xfs_hook_init(struct xfs_hook_chain *chain);
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 04/18] xfs: clear log incompat feature bits when the log is idle
  2021-04-01  1:09 ` [PATCH 04/18] xfs: clear log incompat feature bits when the log is idle Darrick J. Wong
@ 2021-04-02 23:20   ` Allison Henderson
  0 siblings, 0 replies; 30+ messages in thread
From: Allison Henderson @ 2021-04-02 23:20 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, linux-fsdevel, linux-api



On 3/31/21 6:09 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> When there are no ongoing transactions and the log contents have been
> checkpointed back into the filesystem, the log performs 'covering',
> which is to say that it log a dummy transaction to record the fact that
> the tail has caught up with the head.  This is a good time to clear log
> incompat feature flags, because they are flags that are temporarily set
> to limit the range of kernels that can replay a dirty log.
> 
> Since it's possible that some other higher level thread is about to
> start logging items protected by a log incompat flag, we create a rwsem
> so that upper level threads can coordinate this with the log.  It would
> probably be more performant to use a percpu rwsem, but the ability to
> /try/ taking the write lock during covering is critical, and percpu
> rwsems do not provide that.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
ok, makes sense
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

> ---
>   fs/xfs/libxfs/xfs_shared.h |    6 +++++
>   fs/xfs/xfs_log.c           |   49 ++++++++++++++++++++++++++++++++++++++++++++
>   fs/xfs/xfs_log.h           |    3 +++
>   fs/xfs/xfs_log_priv.h      |    3 +++
>   fs/xfs/xfs_trans.c         |   14 +++++++++----
>   5 files changed, 71 insertions(+), 4 deletions(-)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_shared.h b/fs/xfs/libxfs/xfs_shared.h
> index 8c61a461bf7b..c7c9a0cebb04 100644
> --- a/fs/xfs/libxfs/xfs_shared.h
> +++ b/fs/xfs/libxfs/xfs_shared.h
> @@ -62,6 +62,12 @@ void	xfs_log_get_max_trans_res(struct xfs_mount *mp,
>   #define	XFS_TRANS_SB_DIRTY	0x02	/* superblock is modified */
>   #define	XFS_TRANS_PERM_LOG_RES	0x04	/* xact took a permanent log res */
>   #define	XFS_TRANS_SYNC		0x08	/* make commit synchronous */
> +/*
> + * This transaction uses a log incompat feature, which means that we must tell
> + * the log that we've finished using it at the transaction commit or cancel.
> + * Callers must call xlog_use_incompat_feat before setting this flag.
> + */
> +#define XFS_TRANS_LOG_INCOMPAT	0x10
>   #define XFS_TRANS_RESERVE	0x20    /* OK to use reserved data blocks */
>   #define XFS_TRANS_NO_WRITECOUNT 0x40	/* do not elevate SB writecount */
>   #define XFS_TRANS_RES_FDBLKS	0x80	/* reserve newly freed blocks */
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index cf73bc9f4d18..cb72be62da3e 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -1335,6 +1335,32 @@ xfs_log_work_queue(
>   				msecs_to_jiffies(xfs_syncd_centisecs * 10));
>   }
>   
> +/*
> + * Clear the log incompat flags if we have the opportunity.
> + *
> + * This only happens if we're about to log the second dummy transaction as part
> + * of covering the log and we can get the log incompat feature usage lock.
> + */
> +static inline void
> +xlog_clear_incompat(
> +	struct xlog		*log)
> +{
> +	struct xfs_mount	*mp = log->l_mp;
> +
> +	if (!xfs_sb_has_incompat_log_feature(&mp->m_sb,
> +				XFS_SB_FEAT_INCOMPAT_LOG_ALL))
> +		return;
> +
> +	if (log->l_covered_state != XLOG_STATE_COVER_DONE2)
> +		return;
> +
> +	if (!down_write_trylock(&log->l_incompat_users))
> +		return;
> +
> +	xfs_clear_incompat_log_features(mp);
> +	up_write(&log->l_incompat_users);
> +}
> +
>   /*
>    * Every sync period we need to unpin all items in the AIL and push them to
>    * disk. If there is nothing dirty, then we might need to cover the log to
> @@ -1361,6 +1387,7 @@ xfs_log_worker(
>   		 * synchronously log the superblock instead to ensure the
>   		 * superblock is immediately unpinned and can be written back.
>   		 */
> +		xlog_clear_incompat(log);
>   		xfs_sync_sb(mp, true);
>   	} else
>   		xfs_log_force(mp, 0);
> @@ -1443,6 +1470,8 @@ xlog_alloc_log(
>   	}
>   	log->l_sectBBsize = 1 << log2_size;
>   
> +	init_rwsem(&log->l_incompat_users);
> +
>   	xlog_get_iclog_buffer_size(mp, log);
>   
>   	spin_lock_init(&log->l_icloglock);
> @@ -3933,3 +3962,23 @@ xfs_log_in_recovery(
>   
>   	return log->l_flags & XLOG_ACTIVE_RECOVERY;
>   }
> +
> +/*
> + * Notify the log that we're about to start using a feature that is protected
> + * by a log incompat feature flag.  This will prevent log covering from
> + * clearing those flags.
> + */
> +void
> +xlog_use_incompat_feat(
> +	struct xlog		*log)
> +{
> +	down_read(&log->l_incompat_users);
> +}
> +
> +/* Notify the log that we've finished using log incompat features. */
> +void
> +xlog_drop_incompat_feat(
> +	struct xlog		*log)
> +{
> +	up_read(&log->l_incompat_users);
> +}
> diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h
> index 044e02cb8921..8b7d0a56cbf1 100644
> --- a/fs/xfs/xfs_log.h
> +++ b/fs/xfs/xfs_log.h
> @@ -145,4 +145,7 @@ bool	xfs_log_in_recovery(struct xfs_mount *);
>   
>   xfs_lsn_t xlog_grant_push_threshold(struct xlog *log, int need_bytes);
>   
> +void xlog_use_incompat_feat(struct xlog *log);
> +void xlog_drop_incompat_feat(struct xlog *log);
> +
>   #endif	/* __XFS_LOG_H__ */
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index 1c6fdbf3d506..75702c4fa69c 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -436,6 +436,9 @@ struct xlog {
>   #endif
>   	/* log recovery lsn tracking (for buffer submission */
>   	xfs_lsn_t		l_recovery_lsn;
> +
> +	/* Users of log incompat features should take a read lock. */
> +	struct rw_semaphore	l_incompat_users;
>   };
>   
>   #define XLOG_BUF_CANCEL_BUCKET(log, blkno) \
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index eb2d8e2e5db6..e548d53c2091 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -71,6 +71,9 @@ xfs_trans_free(
>   	xfs_extent_busy_sort(&tp->t_busy);
>   	xfs_extent_busy_clear(tp->t_mountp, &tp->t_busy, false);
>   
> +	if (tp->t_flags & XFS_TRANS_LOG_INCOMPAT)
> +		xlog_drop_incompat_feat(tp->t_mountp->m_log);
> +
>   	trace_xfs_trans_free(tp, _RET_IP_);
>   	xfs_trans_clear_context(tp);
>   	if (!(tp->t_flags & XFS_TRANS_NO_WRITECOUNT))
> @@ -110,10 +113,13 @@ xfs_trans_dup(
>   	ASSERT(tp->t_flags & XFS_TRANS_PERM_LOG_RES);
>   	ASSERT(tp->t_ticket != NULL);
>   
> -	ntp->t_flags = XFS_TRANS_PERM_LOG_RES |
> -		       (tp->t_flags & XFS_TRANS_RESERVE) |
> -		       (tp->t_flags & XFS_TRANS_NO_WRITECOUNT) |
> -		       (tp->t_flags & XFS_TRANS_RES_FDBLKS);
> +	ntp->t_flags = tp->t_flags & (XFS_TRANS_PERM_LOG_RES |
> +				      XFS_TRANS_RESERVE |
> +				      XFS_TRANS_NO_WRITECOUNT |
> +				      XFS_TRANS_RES_FDBLKS |
> +				      XFS_TRANS_LOG_INCOMPAT);
> +	/* Give our LOG_INCOMPAT reference to the new transaction. */
> +	tp->t_flags &= ~XFS_TRANS_LOG_INCOMPAT;
>   	/* We gave our writer reference to the new transaction */
>   	tp->t_flags |= XFS_TRANS_NO_WRITECOUNT;
>   	ntp->t_ticket = xfs_log_ticket_get(tp->t_ticket);
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 05/18] xfs: create a log incompat flag for atomic extent swapping
  2021-04-01  1:09 ` [PATCH 05/18] xfs: create a log incompat flag for atomic extent swapping Darrick J. Wong
@ 2021-04-02 23:21   ` Allison Henderson
  0 siblings, 0 replies; 30+ messages in thread
From: Allison Henderson @ 2021-04-02 23:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, linux-fsdevel, linux-api



On 3/31/21 6:09 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Create a log incompat flag so that we only attempt to process swap
> extent log items if the filesystem supports it.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
looks ok
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

> ---
>   fs/xfs/libxfs/xfs_format.h |   20 ++++++++++++++++++++
>   fs/xfs/libxfs/xfs_fs.h     |    1 +
>   fs/xfs/libxfs/xfs_sb.c     |    2 ++
>   3 files changed, 23 insertions(+)
> 
> 
> diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
> index 7e9c964772c9..e81a7b12a0e3 100644
> --- a/fs/xfs/libxfs/xfs_format.h
> +++ b/fs/xfs/libxfs/xfs_format.h
> @@ -485,6 +485,7 @@ xfs_sb_has_incompat_feature(
>   	return (sbp->sb_features_incompat & feature) != 0;
>   }
>   
> +#define XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP (1 << 0)
>   #define XFS_SB_FEAT_INCOMPAT_LOG_ALL 0
>   #define XFS_SB_FEAT_INCOMPAT_LOG_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_LOG_ALL
>   static inline bool
> @@ -607,6 +608,25 @@ static inline bool xfs_sb_version_needsrepair(struct xfs_sb *sbp)
>   		(sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR);
>   }
>   
> +/*
> + * Decide if this filesystem can use log-assisted ("atomic") extent swapping.
> + * The atomic swap log intent items depend on the block mapping log intent
> + * items introduced with reflink and rmap.  Realtime is not supported yet.
> + */
> +static inline bool xfs_sb_version_canatomicswap(struct xfs_sb *sbp)
> +{
> +	return (xfs_sb_version_hasreflink(sbp) ||
> +		xfs_sb_version_hasrmapbt(sbp)) &&
> +		!xfs_sb_version_hasrealtime(sbp);
> +}
> +
> +static inline bool xfs_sb_version_hasatomicswap(struct xfs_sb *sbp)
> +{
> +	return XFS_SB_VERSION_NUM(sbp) == XFS_SB_VERSION_5 &&
> +		(sbp->sb_features_log_incompat &
> +		 XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP);
> +}
> +
>   /*
>    * end of superblock version macros
>    */
> diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
> index e7e1e3051739..08bfce39407e 100644
> --- a/fs/xfs/libxfs/xfs_fs.h
> +++ b/fs/xfs/libxfs/xfs_fs.h
> @@ -252,6 +252,7 @@ typedef struct xfs_fsop_resblks {
>   #define XFS_FSOP_GEOM_FLAGS_REFLINK	(1 << 20) /* files can share blocks */
>   #define XFS_FSOP_GEOM_FLAGS_BIGTIME	(1 << 21) /* 64-bit nsec timestamps */
>   #define XFS_FSOP_GEOM_FLAGS_INOBTCNT	(1 << 22) /* inobt btree counter */
> +#define XFS_FSOP_GEOM_FLAGS_ATOMIC_SWAP	(1 << 23) /* atomic swapext */
>   
>   /*
>    * Minimum and maximum sizes need for growth checks.
> diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
> index 6adfe759190c..52791fe33a6e 100644
> --- a/fs/xfs/libxfs/xfs_sb.c
> +++ b/fs/xfs/libxfs/xfs_sb.c
> @@ -1140,6 +1140,8 @@ xfs_fs_geometry(
>   		geo->flags |= XFS_FSOP_GEOM_FLAGS_BIGTIME;
>   	if (xfs_sb_version_hasinobtcounts(sbp))
>   		geo->flags |= XFS_FSOP_GEOM_FLAGS_INOBTCNT;
> +	if (xfs_sb_version_canatomicswap(sbp))
> +		geo->flags |= XFS_FSOP_GEOM_FLAGS_ATOMIC_SWAP;
>   	if (xfs_sb_version_hassector(sbp))
>   		geo->logsectsize = sbp->sb_logsectsize;
>   	else
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 06/18] xfs: introduce a swap-extent log intent item
  2021-04-01  1:09 ` [PATCH 06/18] xfs: introduce a swap-extent log intent item Darrick J. Wong
@ 2021-04-05 23:08   ` Allison Henderson
  0 siblings, 0 replies; 30+ messages in thread
From: Allison Henderson @ 2021-04-05 23:08 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, linux-fsdevel, linux-api



On 3/31/21 6:09 PM, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Introduce a new intent log item to handle swapping extents.
> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Looks ok to me.  Seems reasonably similar to existing log items.
Reviewed-by: Allison Henderson <allison.henderson@oracle.com>

> ---
>   fs/xfs/Makefile                 |    1
>   fs/xfs/libxfs/xfs_log_format.h  |   59 +++++++
>   fs/xfs/libxfs/xfs_log_recover.h |    2
>   fs/xfs/xfs_log.c                |    2
>   fs/xfs/xfs_log_recover.c        |    2
>   fs/xfs/xfs_super.c              |   17 ++
>   fs/xfs/xfs_swapext_item.c       |  328 +++++++++++++++++++++++++++++++++++++++
>   fs/xfs/xfs_swapext_item.h       |   61 +++++++
>   8 files changed, 470 insertions(+), 2 deletions(-)
>   create mode 100644 fs/xfs/xfs_swapext_item.c
>   create mode 100644 fs/xfs/xfs_swapext_item.h
> 
> 
> diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
> index dac3bec1a695..a7cc6f496ad0 100644
> --- a/fs/xfs/Makefile
> +++ b/fs/xfs/Makefile
> @@ -107,6 +107,7 @@ xfs-y				+= xfs_log.o \
>   				   xfs_inode_item_recover.o \
>   				   xfs_refcount_item.o \
>   				   xfs_rmap_item.o \
> +				   xfs_swapext_item.o \
>   				   xfs_log_recover.o \
>   				   xfs_trans_ail.o \
>   				   xfs_trans_buf.o
> diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
> index 6107dac4bd6b..52ca6d72de6a 100644
> --- a/fs/xfs/libxfs/xfs_log_format.h
> +++ b/fs/xfs/libxfs/xfs_log_format.h
> @@ -117,7 +117,9 @@ struct xfs_unmount_log_format {
>   #define XLOG_REG_TYPE_CUD_FORMAT	24
>   #define XLOG_REG_TYPE_BUI_FORMAT	25
>   #define XLOG_REG_TYPE_BUD_FORMAT	26
> -#define XLOG_REG_TYPE_MAX		26
> +#define XLOG_REG_TYPE_SXI_FORMAT	27
> +#define XLOG_REG_TYPE_SXD_FORMAT	28
> +#define XLOG_REG_TYPE_MAX		28
>   
>   /*
>    * Flags to log operation header
> @@ -240,6 +242,8 @@ typedef struct xfs_trans_header {
>   #define	XFS_LI_CUD		0x1243
>   #define	XFS_LI_BUI		0x1244	/* bmbt update intent */
>   #define	XFS_LI_BUD		0x1245
> +#define	XFS_LI_SXI		0x1246
> +#define	XFS_LI_SXD		0x1247
>   
>   #define XFS_LI_TYPE_DESC \
>   	{ XFS_LI_EFI,		"XFS_LI_EFI" }, \
> @@ -255,7 +259,9 @@ typedef struct xfs_trans_header {
>   	{ XFS_LI_CUI,		"XFS_LI_CUI" }, \
>   	{ XFS_LI_CUD,		"XFS_LI_CUD" }, \
>   	{ XFS_LI_BUI,		"XFS_LI_BUI" }, \
> -	{ XFS_LI_BUD,		"XFS_LI_BUD" }
> +	{ XFS_LI_BUD,		"XFS_LI_BUD" }, \
> +	{ XFS_LI_SXI,		"XFS_LI_SXI" }, \
> +	{ XFS_LI_SXD,		"XFS_LI_SXD" }
>   
>   /*
>    * Inode Log Item Format definitions.
> @@ -784,6 +790,55 @@ struct xfs_bud_log_format {
>   	uint64_t		bud_bui_id;	/* id of corresponding bui */
>   };
>   
> +/*
> + * SXI/SXD (extent swapping) log format definitions
> + */
> +
> +struct xfs_swap_extent {
> +	uint64_t		sx_inode1;
> +	uint64_t		sx_inode2;
> +	uint64_t		sx_startoff1;
> +	uint64_t		sx_startoff2;
> +	uint64_t		sx_blockcount;
> +	uint64_t		sx_flags;
> +	int64_t			sx_isize1;
> +	int64_t			sx_isize2;
> +};
> +
> +/* Swap extents between extended attribute forks. */
> +#define XFS_SWAP_EXTENT_ATTR_FORK	(1ULL << 0)
> +
> +/* Set the file sizes when finished. */
> +#define XFS_SWAP_EXTENT_SET_SIZES	(1ULL << 1)
> +
> +/* Do not swap any part of the range where file1's mapping is a hole. */
> +#define XFS_SWAP_EXTENT_SKIP_FILE1_HOLES (1ULL << 2)
> +
> +#define XFS_SWAP_EXTENT_FLAGS		(XFS_SWAP_EXTENT_ATTR_FORK | \
> +					 XFS_SWAP_EXTENT_SET_SIZES | \
> +					 XFS_SWAP_EXTENT_SKIP_FILE1_HOLES)
> +
> +/* This is the structure used to lay out an sxi log item in the log. */
> +struct xfs_sxi_log_format {
> +	uint16_t		sxi_type;	/* sxi log item type */
> +	uint16_t		sxi_size;	/* size of this item */
> +	uint32_t		__pad;		/* must be zero */
> +	uint64_t		sxi_id;		/* sxi identifier */
> +	struct xfs_swap_extent	sxi_extent;	/* extent to swap */
> +};
> +
> +/*
> + * This is the structure used to lay out an sxd log item in the
> + * log.  The sxd_extents array is a variable size array whose
> + * size is given by sxd_nextents;
> + */
> +struct xfs_sxd_log_format {
> +	uint16_t		sxd_type;	/* sxd log item type */
> +	uint16_t		sxd_size;	/* size of this item */
> +	uint32_t		__pad;
> +	uint64_t		sxd_sxi_id;	/* id of corresponding bui */
> +};
> +
>   /*
>    * Dquot Log format definitions.
>    *
> diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h
> index 3cca2bfe714c..dcc11a8c438a 100644
> --- a/fs/xfs/libxfs/xfs_log_recover.h
> +++ b/fs/xfs/libxfs/xfs_log_recover.h
> @@ -72,6 +72,8 @@ extern const struct xlog_recover_item_ops xlog_rui_item_ops;
>   extern const struct xlog_recover_item_ops xlog_rud_item_ops;
>   extern const struct xlog_recover_item_ops xlog_cui_item_ops;
>   extern const struct xlog_recover_item_ops xlog_cud_item_ops;
> +extern const struct xlog_recover_item_ops xlog_sxi_item_ops;
> +extern const struct xlog_recover_item_ops xlog_sxd_item_ops;
>   
>   /*
>    * Macros, structures, prototypes for internal log manager use.
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index cb72be62da3e..34213fce3eed 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -2113,6 +2113,8 @@ xlog_print_tic_res(
>   	    REG_TYPE_STR(CUD_FORMAT, "cud_format"),
>   	    REG_TYPE_STR(BUI_FORMAT, "bui_format"),
>   	    REG_TYPE_STR(BUD_FORMAT, "bud_format"),
> +	    REG_TYPE_STR(SXI_FORMAT, "sxi_format"),
> +	    REG_TYPE_STR(SXD_FORMAT, "sxd_format"),
>   	};
>   	BUILD_BUG_ON(ARRAY_SIZE(res_type_str) != XLOG_REG_TYPE_MAX + 1);
>   #undef REG_TYPE_STR
> diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
> index fdba9b55822e..107bb222d79f 100644
> --- a/fs/xfs/xfs_log_recover.c
> +++ b/fs/xfs/xfs_log_recover.c
> @@ -1775,6 +1775,8 @@ static const struct xlog_recover_item_ops *xlog_recover_item_ops[] = {
>   	&xlog_cud_item_ops,
>   	&xlog_bui_item_ops,
>   	&xlog_bud_item_ops,
> +	&xlog_sxi_item_ops,
> +	&xlog_sxd_item_ops,
>   };
>   
>   static const struct xlog_recover_item_ops *
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 175dc7acaca8..85ced8cc6070 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -36,6 +36,7 @@
>   #include "xfs_bmap_item.h"
>   #include "xfs_reflink.h"
>   #include "xfs_pwork.h"
> +#include "xfs_swapext_item.h"
>   
>   #include <linux/magic.h>
>   #include <linux/fs_context.h>
> @@ -2121,8 +2122,24 @@ xfs_init_zones(void)
>   	if (!xfs_bui_zone)
>   		goto out_destroy_bud_zone;
>   
> +	xfs_sxd_zone = kmem_cache_create("xfs_sxd_item",
> +					 sizeof(struct xfs_sxd_log_item),
> +					 0, 0, NULL);
> +	if (!xfs_sxd_zone)
> +		goto out_destroy_bui_zone;
> +
> +	xfs_sxi_zone = kmem_cache_create("xfs_sxi_item",
> +					 sizeof(struct xfs_sxi_log_item),
> +					 0, 0, NULL);
> +	if (!xfs_sxi_zone)
> +		goto out_destroy_sxd_zone;
> +
>   	return 0;
>   
> + out_destroy_sxd_zone:
> +	kmem_cache_destroy(xfs_sxd_zone);
> + out_destroy_bui_zone:
> +	kmem_cache_destroy(xfs_bui_zone);
>    out_destroy_bud_zone:
>   	kmem_cache_destroy(xfs_bud_zone);
>    out_destroy_cui_zone:
> diff --git a/fs/xfs/xfs_swapext_item.c b/fs/xfs/xfs_swapext_item.c
> new file mode 100644
> index 000000000000..83913e9fd4d4
> --- /dev/null
> +++ b/fs/xfs/xfs_swapext_item.c
> @@ -0,0 +1,328 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Copyright (C) 2021 Oracle.  All Rights Reserved.
> + * Author: Darrick J. Wong <djwong@kernel.org>
> + */
> +#include "xfs.h"
> +#include "xfs_fs.h"
> +#include "xfs_format.h"
> +#include "xfs_log_format.h"
> +#include "xfs_trans_resv.h"
> +#include "xfs_bit.h"
> +#include "xfs_shared.h"
> +#include "xfs_mount.h"
> +#include "xfs_defer.h"
> +#include "xfs_inode.h"
> +#include "xfs_trans.h"
> +#include "xfs_trans_priv.h"
> +#include "xfs_swapext_item.h"
> +#include "xfs_log.h"
> +#include "xfs_bmap.h"
> +#include "xfs_icache.h"
> +#include "xfs_trans_space.h"
> +#include "xfs_error.h"
> +#include "xfs_log_priv.h"
> +#include "xfs_log_recover.h"
> +
> +kmem_zone_t	*xfs_sxi_zone;
> +kmem_zone_t	*xfs_sxd_zone;
> +
> +static const struct xfs_item_ops xfs_sxi_item_ops;
> +
> +static inline struct xfs_sxi_log_item *SXI_ITEM(struct xfs_log_item *lip)
> +{
> +	return container_of(lip, struct xfs_sxi_log_item, sxi_item);
> +}
> +
> +STATIC void
> +xfs_sxi_item_free(
> +	struct xfs_sxi_log_item	*sxi_lip)
> +{
> +	kmem_cache_free(xfs_sxi_zone, sxi_lip);
> +}
> +
> +/*
> + * Freeing the SXI requires that we remove it from the AIL if it has already
> + * been placed there. However, the SXI may not yet have been placed in the AIL
> + * when called by xfs_sxi_release() from SXD processing due to the ordering of
> + * committed vs unpin operations in bulk insert operations. Hence the reference
> + * count to ensure only the last caller frees the SXI.
> + */
> +STATIC void
> +xfs_sxi_release(
> +	struct xfs_sxi_log_item	*sxi_lip)
> +{
> +	ASSERT(atomic_read(&sxi_lip->sxi_refcount) > 0);
> +	if (atomic_dec_and_test(&sxi_lip->sxi_refcount)) {
> +		xfs_trans_ail_delete(&sxi_lip->sxi_item, SHUTDOWN_LOG_IO_ERROR);
> +		xfs_sxi_item_free(sxi_lip);
> +	}
> +}
> +
> +
> +STATIC void
> +xfs_sxi_item_size(
> +	struct xfs_log_item	*lip,
> +	int			*nvecs,
> +	int			*nbytes)
> +{
> +	*nvecs += 1;
> +	*nbytes += sizeof(struct xfs_sxi_log_format);
> +}
> +
> +/*
> + * This is called to fill in the vector of log iovecs for the given sxi log
> + * item. We use only 1 iovec, and we point that at the sxi_log_format structure
> + * embedded in the sxi item.
> + */
> +STATIC void
> +xfs_sxi_item_format(
> +	struct xfs_log_item	*lip,
> +	struct xfs_log_vec	*lv)
> +{
> +	struct xfs_sxi_log_item	*sxi_lip = SXI_ITEM(lip);
> +	struct xfs_log_iovec	*vecp = NULL;
> +
> +	sxi_lip->sxi_format.sxi_type = XFS_LI_SXI;
> +	sxi_lip->sxi_format.sxi_size = 1;
> +
> +	xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_SXI_FORMAT,
> +			&sxi_lip->sxi_format,
> +			sizeof(struct xfs_sxi_log_format));
> +}
> +
> +/*
> + * The unpin operation is the last place an SXI is manipulated in the log. It
> + * is either inserted in the AIL or aborted in the event of a log I/O error. In
> + * either case, the SXI transaction has been successfully committed to make it
> + * this far. Therefore, we expect whoever committed the SXI to either construct
> + * and commit the SXD or drop the SXD's reference in the event of error. Simply
> + * drop the log's SXI reference now that the log is done with it.
> + */
> +STATIC void
> +xfs_sxi_item_unpin(
> +	struct xfs_log_item	*lip,
> +	int			remove)
> +{
> +	struct xfs_sxi_log_item	*sxi_lip = SXI_ITEM(lip);
> +
> +	xfs_sxi_release(sxi_lip);
> +}
> +
> +/*
> + * The SXI has been either committed or aborted if the transaction has been
> + * cancelled. If the transaction was cancelled, an SXD isn't going to be
> + * constructed and thus we free the SXI here directly.
> + */
> +STATIC void
> +xfs_sxi_item_release(
> +	struct xfs_log_item	*lip)
> +{
> +	xfs_sxi_release(SXI_ITEM(lip));
> +}
> +
> +/* Allocate and initialize an sxi item with the given number of extents. */
> +STATIC struct xfs_sxi_log_item *
> +xfs_sxi_init(
> +	struct xfs_mount		*mp)
> +
> +{
> +	struct xfs_sxi_log_item		*sxi_lip;
> +
> +	sxi_lip = kmem_cache_zalloc(xfs_sxi_zone, GFP_KERNEL | __GFP_NOFAIL);
> +
> +	xfs_log_item_init(mp, &sxi_lip->sxi_item, XFS_LI_SXI, &xfs_sxi_item_ops);
> +	sxi_lip->sxi_format.sxi_id = (uintptr_t)(void *)sxi_lip;
> +	atomic_set(&sxi_lip->sxi_refcount, 2);
> +
> +	return sxi_lip;
> +}
> +
> +static inline struct xfs_sxd_log_item *SXD_ITEM(struct xfs_log_item *lip)
> +{
> +	return container_of(lip, struct xfs_sxd_log_item, sxd_item);
> +}
> +
> +STATIC void
> +xfs_sxd_item_size(
> +	struct xfs_log_item	*lip,
> +	int			*nvecs,
> +	int			*nbytes)
> +{
> +	*nvecs += 1;
> +	*nbytes += sizeof(struct xfs_sxd_log_format);
> +}
> +
> +/*
> + * This is called to fill in the vector of log iovecs for the given sxd log
> + * item. We use only 1 iovec, and we point that at the sxd_log_format structure
> + * embedded in the sxd item.
> + */
> +STATIC void
> +xfs_sxd_item_format(
> +	struct xfs_log_item	*lip,
> +	struct xfs_log_vec	*lv)
> +{
> +	struct xfs_sxd_log_item	*sxd_lip = SXD_ITEM(lip);
> +	struct xfs_log_iovec	*vecp = NULL;
> +
> +	sxd_lip->sxd_format.sxd_type = XFS_LI_SXD;
> +	sxd_lip->sxd_format.sxd_size = 1;
> +
> +	xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_SXD_FORMAT, &sxd_lip->sxd_format,
> +			sizeof(struct xfs_sxd_log_format));
> +}
> +
> +/*
> + * The SXD is either committed or aborted if the transaction is cancelled. If
> + * the transaction is cancelled, drop our reference to the SXI and free the
> + * SXD.
> + */
> +STATIC void
> +xfs_sxd_item_release(
> +	struct xfs_log_item	*lip)
> +{
> +	struct xfs_sxd_log_item	*sxd_lip = SXD_ITEM(lip);
> +
> +	xfs_sxi_release(sxd_lip->sxd_intent_log_item);
> +	kmem_cache_free(xfs_sxd_zone, sxd_lip);
> +}
> +
> +static const struct xfs_item_ops xfs_sxd_item_ops = {
> +	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED,
> +	.iop_size	= xfs_sxd_item_size,
> +	.iop_format	= xfs_sxd_item_format,
> +	.iop_release	= xfs_sxd_item_release,
> +};
> +
> +/* Process a swapext update intent item that was recovered from the log. */
> +STATIC int
> +xfs_sxi_item_recover(
> +	struct xfs_log_item		*lip,
> +	struct list_head		*capture_list)
> +{
> +	return -EFSCORRUPTED;
> +}
> +
> +STATIC bool
> +xfs_sxi_item_match(
> +	struct xfs_log_item	*lip,
> +	uint64_t		intent_id)
> +{
> +	return SXI_ITEM(lip)->sxi_format.sxi_id == intent_id;
> +}
> +
> +/* Relog an intent item to push the log tail forward. */
> +static struct xfs_log_item *
> +xfs_sxi_item_relog(
> +	struct xfs_log_item		*intent,
> +	struct xfs_trans		*tp)
> +{
> +	ASSERT(0);
> +	return NULL;
> +}
> +
> +static const struct xfs_item_ops xfs_sxi_item_ops = {
> +	.iop_size	= xfs_sxi_item_size,
> +	.iop_format	= xfs_sxi_item_format,
> +	.iop_unpin	= xfs_sxi_item_unpin,
> +	.iop_release	= xfs_sxi_item_release,
> +	.iop_recover	= xfs_sxi_item_recover,
> +	.iop_match	= xfs_sxi_item_match,
> +	.iop_relog	= xfs_sxi_item_relog,
> +};
> +
> +/*
> + * Copy an SXI format buffer from the given buf, and into the destination SXI
> + * format structure.  The SXI/SXD items were designed not to need any special
> + * alignment handling.
> + */
> +static int
> +xfs_sxi_copy_format(
> +	struct xfs_log_iovec		*buf,
> +	struct xfs_sxi_log_format	*dst_sxi_fmt)
> +{
> +	struct xfs_sxi_log_format	*src_sxi_fmt;
> +	size_t				len;
> +
> +	src_sxi_fmt = buf->i_addr;
> +	len = sizeof(struct xfs_sxi_log_format);
> +
> +	if (buf->i_len == len) {
> +		memcpy(dst_sxi_fmt, src_sxi_fmt, len);
> +		return 0;
> +	}
> +	XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, NULL);
> +	return -EFSCORRUPTED;
> +}
> +
> +/*
> + * This routine is called to create an in-core extent swapext update item from
> + * the sxi format structure which was logged on disk.  It allocates an in-core
> + * sxi, copies the extents from the format structure into it, and adds the sxi
> + * to the AIL with the given LSN.
> + */
> +STATIC int
> +xlog_recover_sxi_commit_pass2(
> +	struct xlog			*log,
> +	struct list_head		*buffer_list,
> +	struct xlog_recover_item	*item,
> +	xfs_lsn_t			lsn)
> +{
> +	int				error;
> +	struct xfs_mount		*mp = log->l_mp;
> +	struct xfs_sxi_log_item		*sxi_lip;
> +	struct xfs_sxi_log_format	*sxi_formatp;
> +
> +	sxi_formatp = item->ri_buf[0].i_addr;
> +
> +	if (sxi_formatp->__pad != 0) {
> +		XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp);
> +		return -EFSCORRUPTED;
> +	}
> +	sxi_lip = xfs_sxi_init(mp);
> +	error = xfs_sxi_copy_format(&item->ri_buf[0], &sxi_lip->sxi_format);
> +	if (error) {
> +		xfs_sxi_item_free(sxi_lip);
> +		return error;
> +	}
> +	xfs_trans_ail_insert(log->l_ailp, &sxi_lip->sxi_item, lsn);
> +	xfs_sxi_release(sxi_lip);
> +	return 0;
> +}
> +
> +const struct xlog_recover_item_ops xlog_sxi_item_ops = {
> +	.item_type		= XFS_LI_SXI,
> +	.commit_pass2		= xlog_recover_sxi_commit_pass2,
> +};
> +
> +/*
> + * This routine is called when an SXD format structure is found in a committed
> + * transaction in the log. Its purpose is to cancel the corresponding SXI if it
> + * was still in the log. To do this it searches the AIL for the SXI with an id
> + * equal to that in the SXD format structure. If we find it we drop the SXD
> + * reference, which removes the SXI from the AIL and frees it.
> + */
> +STATIC int
> +xlog_recover_sxd_commit_pass2(
> +	struct xlog			*log,
> +	struct list_head		*buffer_list,
> +	struct xlog_recover_item	*item,
> +	xfs_lsn_t			lsn)
> +{
> +	struct xfs_sxd_log_format	*sxd_formatp;
> +
> +	sxd_formatp = item->ri_buf[0].i_addr;
> +	if (item->ri_buf[0].i_len != sizeof(struct xfs_sxd_log_format)) {
> +		XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp);
> +		return -EFSCORRUPTED;
> +	}
> +
> +	xlog_recover_release_intent(log, XFS_LI_SXI, sxd_formatp->sxd_sxi_id);
> +	return 0;
> +}
> +
> +const struct xlog_recover_item_ops xlog_sxd_item_ops = {
> +	.item_type		= XFS_LI_SXD,
> +	.commit_pass2		= xlog_recover_sxd_commit_pass2,
> +};
> diff --git a/fs/xfs/xfs_swapext_item.h b/fs/xfs/xfs_swapext_item.h
> new file mode 100644
> index 000000000000..7caeccdcaa81
> --- /dev/null
> +++ b/fs/xfs/xfs_swapext_item.h
> @@ -0,0 +1,61 @@
> +/* SPDX-License-Identifier: GPL-2.0-or-later */
> +/*
> + * Copyright (C) 2021 Oracle.  All Rights Reserved.
> + * Author: Darrick J. Wong <djwong@kernel.org>
> + */
> +#ifndef	__XFS_SWAPEXT_ITEM_H__
> +#define	__XFS_SWAPEXT_ITEM_H__
> +
> +/*
> + * The extent swapping intent item help us perform atomic extent swaps between
> + * two inode forks.  It does this by tracking the range of logical offsets that
> + * still need to be swapped, and relogs as progress happens.
> + *
> + * *I items should be recorded in the *first* of a series of rolled
> + * transactions, and the *D items should be recorded in the same transaction
> + * that records the associated bmbt updates.
> + *
> + * Should the system crash after the commit of the first transaction but
> + * before the commit of the final transaction in a series, log recovery will
> + * use the redo information recorded by the intent items to replay the
> + * rest of the extent swaps.
> + */
> +
> +/* kernel only SXI/SXD definitions */
> +
> +struct xfs_mount;
> +struct kmem_zone;
> +
> +/*
> + * Max number of extents in fast allocation path.
> + */
> +#define	XFS_SXI_MAX_FAST_EXTENTS	1
> +
> +/*
> + * This is the "swapext update intent" log item.  It is used to log the fact
> + * that we are swapping extents between two files.  It is used in conjunction
> + * with the "swapext update done" log item described below.
> + *
> + * These log items follow the same rules as struct xfs_efi_log_item; see the
> + * comments about that structure (in xfs_extfree_item.h) for more details.
> + */
> +struct xfs_sxi_log_item {
> +	struct xfs_log_item		sxi_item;
> +	atomic_t			sxi_refcount;
> +	struct xfs_sxi_log_format	sxi_format;
> +};
> +
> +/*
> + * This is the "swapext update done" log item.  It is used to log the fact that
> + * some extent swapping mentioned in an earlier sxi item have been performed.
> + */
> +struct xfs_sxd_log_item {
> +	struct xfs_log_item		sxd_item;
> +	struct xfs_sxi_log_item		*sxd_intent_log_item;
> +	struct xfs_sxd_log_format	sxd_format;
> +};
> +
> +extern struct kmem_zone	*xfs_sxi_zone;
> +extern struct kmem_zone	*xfs_sxd_zone;
> +
> +#endif	/* __XFS_SWAPEXT_ITEM_H__ */
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2021-04-05 23:09 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-01  1:08 [PATCHSET RFC v3 00/18] xfs: atomic file updates Darrick J. Wong
2021-04-01  1:08 ` [PATCH 01/18] vfs: introduce new file range exchange ioctl Darrick J. Wong
2021-04-01  1:44   ` Al Viro
2021-04-01 21:18     ` Darrick J. Wong
2021-04-01  3:32   ` Amir Goldstein
2021-04-02  0:37     ` Darrick J. Wong
2021-04-01  1:08 ` [PATCH 02/18] xfs: support two inodes in the defer capture structure Darrick J. Wong
2021-04-02 23:20   ` Allison Henderson
2021-04-01  1:09 ` [PATCH 03/18] xfs: allow setting and clearing of log incompat feature flags Darrick J. Wong
2021-04-02 23:20   ` Allison Henderson
2021-04-01  1:09 ` [PATCH 04/18] xfs: clear log incompat feature bits when the log is idle Darrick J. Wong
2021-04-02 23:20   ` Allison Henderson
2021-04-01  1:09 ` [PATCH 05/18] xfs: create a log incompat flag for atomic extent swapping Darrick J. Wong
2021-04-02 23:21   ` Allison Henderson
2021-04-01  1:09 ` [PATCH 06/18] xfs: introduce a swap-extent log intent item Darrick J. Wong
2021-04-05 23:08   ` Allison Henderson
2021-04-01  1:09 ` [PATCH 07/18] xfs: create deferred log items for extent swapping Darrick J. Wong
2021-04-01  1:09 ` [PATCH 08/18] xfs: add a ->xchg_file_range handler Darrick J. Wong
2021-04-01  1:09 ` [PATCH 09/18] xfs: add error injection to test swapext recovery Darrick J. Wong
2021-04-01  1:09 ` [PATCH 10/18] xfs: port xfs_swap_extents_rmap to our new code Darrick J. Wong
2021-04-01  1:09 ` [PATCH 11/18] xfs: consolidate all of the xfs_swap_extent_forks code Darrick J. Wong
2021-04-01  1:09 ` [PATCH 12/18] xfs: refactor reflink flag handling in xfs_swap_extent_forks Darrick J. Wong
2021-04-01  1:09 ` [PATCH 13/18] xfs: allow xfs_swap_range to use older extent swap algorithms Darrick J. Wong
2021-04-01  1:10 ` [PATCH 14/18] xfs: remove old swap extents implementation Darrick J. Wong
2021-04-01  1:10 ` [PATCH 15/18] xfs: condense extended attributes after an atomic swap Darrick J. Wong
2021-04-01  1:10 ` [PATCH 16/18] xfs: condense directories " Darrick J. Wong
2021-04-01  1:10 ` [PATCH 17/18] xfs: make atomic extent swapping support realtime files Darrick J. Wong
2021-04-01  1:10 ` [PATCH 18/18] xfs: enable atomic swapext feature Darrick J. Wong
2021-04-01  3:56 ` [PATCHSET RFC v3 00/18] xfs: atomic file updates Amir Goldstein
2021-04-02  0:22   ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).