All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] namei: implemented RENAME_NEWER flag for renameat2() conditional replace
@ 2022-06-27 22:11 James Yonan
  2022-06-28  9:46 ` Amir Goldstein
  2022-06-28 17:34 ` Al Viro
  0 siblings, 2 replies; 24+ messages in thread
From: James Yonan @ 2022-06-27 22:11 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: James Yonan

RENAME_NEWER is a new userspace-visible flag for renameat2(), and
stands alongside existing flags such as RENAME_NOREPLACE,
RENAME_EXCHANGE, and RENAME_WHITEOUT.

RENAME_NEWER is a conditional variation on RENAME_NOREPLACE, and
indicates that if the target of the rename exists, the rename will
only succeed if the source file is newer than the target (i.e. source
mtime > target mtime).  Otherwise, the rename will fail with -EEXIST
instead of replacing the target.  When the target doesn't exist,
RENAME_NEWER does a plain rename like RENAME_NOREPLACE.

RENAME_NEWER is very useful in distributed systems that mirror a
directory structure, or use a directory as a key/value store, and need
to guarantee that files will only be overwritten by newer files, and
that all updates are atomic.

While this patch may appear large at first glance, most of the changes
deal with renameat2() flags validation, and the core logic is only
5 lines in the do_renameat2() function in fs/namei.c:

	if ((flags & RENAME_NEWER)
	    && d_is_positive(new_dentry)
	    && timespec64_compare(&d_backing_inode(old_dentry)->i_mtime,
				  &d_backing_inode(new_dentry)->i_mtime) <= 0)
		goto exit5;

It's pretty cool in a way that a new atomic file operation can even be
implemented in just 5 lines of code, and it's thanks to the existing
locking infrastructure around file rename/move that these operations
become almost trivial.  Unfortunately, every fs must approve a new
renameat2() flag, so it bloats the patch a bit.

So one question to ask is could this functionality be implemented
in userspace without adding a new renameat2() flag?  I think you
could attempt it with iterative RENAME_EXCHANGE, but it's hackish,
inefficient, and not atomic, because races could cause temporary
mtime backtracks.  How about using file locking?  Probably not,
because the problem we want to solve is maintaining file/directory
atomicity for readers by creating files out-of-directory, setting
their mtime, and atomically moving them into place.  The strategy
to lock such an operation really requires more complex locking methods
than are generally exposed to userspace.  And if you are using inotify
on the directory to notify readers of changes, it certainly makes
sense to reduce unnecessary churn by preventing a move operation
based on the mtime check.

While some people might question the utility of adding features to
filesystems to make them more like databases, there is real value
in the performance, atomicity, consistent VFS interface, multi-thread
safety, and async-notify capabilities of modern filesystems that
starts to blur the line, and actually make filesystem-based key-value
stores a win for many applications.

Like RENAME_NOREPLACE, the RENAME_NEWER implementation lives in
the VFS, however the individual fs implementations do strict flags
checking and will return -EINVAL for any flag they don't recognize.
For this reason, my general approach with flags is to accept
RENAME_NEWER wherever RENAME_NOREPLACE is also accepted, since
RENAME_NEWER is simply a conditional variant of RENAME_NOREPLACE.

I noticed only one fs driver (cifs) that treated RENAME_NOREPLACE
in a non-generic way, because no-replace is the natural behavior
for CIFS, and it therefore considers RENAME_NOREPLACE as a hint that
no replacement can occur.  Aside from this special case, it seems
safe to assume that any fs that supports RENAME_NOREPLACE should
also be able to support RENAME_NEWER out of the box.

I did not notice a general self-test for renameat2() at the VFS
layer (outside of fs-specific tests), so I created one, though
at the moment it only exercises RENAME_NEWER.  Build and run with:

  make -C tools/testing/selftests TARGETS=renameat2 run_tests

Signed-off-by: James Yonan <james@openvpn.net>
---
 Documentation/filesystems/vfs.rst             |   5 +
 fs/affs/namei.c                               |   2 +-
 fs/bfs/dir.c                                  |   2 +-
 fs/btrfs/inode.c                              |   2 +-
 fs/cifs/inode.c                               |   2 +-
 fs/exfat/namei.c                              |   7 +-
 fs/ext2/namei.c                               |   2 +-
 fs/ext4/namei.c                               |   2 +-
 fs/f2fs/namei.c                               |   2 +-
 fs/fat/namei_msdos.c                          |   2 +-
 fs/fat/namei_vfat.c                           |   2 +-
 fs/fuse/dir.c                                 |   2 +-
 fs/gfs2/inode.c                               |   2 +-
 fs/hfs/dir.c                                  |   2 +-
 fs/hfsplus/dir.c                              |   2 +-
 fs/hostfs/hostfs_kern.c                       |   2 +-
 fs/hpfs/namei.c                               |   2 +-
 fs/jffs2/dir.c                                |   2 +-
 fs/jfs/namei.c                                |   2 +-
 fs/libfs.c                                    |   2 +-
 fs/minix/namei.c                              |   2 +-
 fs/namei.c                                    |  10 +-
 fs/nilfs2/namei.c                             |   2 +-
 fs/ntfs3/namei.c                              |   2 +-
 fs/omfs/dir.c                                 |   2 +-
 fs/overlayfs/dir.c                            |   4 +-
 fs/reiserfs/namei.c                           |   2 +-
 fs/sysv/namei.c                               |   2 +-
 fs/ubifs/dir.c                                |   2 +-
 fs/udf/namei.c                                |   2 +-
 fs/ufs/namei.c                                |   2 +-
 fs/xfs/xfs_iops.c                             |   2 +-
 include/uapi/linux/fs.h                       |   1 +
 mm/shmem.c                                    |   2 +-
 tools/include/uapi/linux/fs.h                 |   1 +
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/renameat2/.gitignore  |   1 +
 tools/testing/selftests/renameat2/Makefile    |  10 ++
 .../selftests/renameat2/renameat2_tests.c     | 142 ++++++++++++++++++
 39 files changed, 204 insertions(+), 36 deletions(-)
 create mode 100644 tools/testing/selftests/renameat2/.gitignore
 create mode 100644 tools/testing/selftests/renameat2/Makefile
 create mode 100644 tools/testing/selftests/renameat2/renameat2_tests.c

diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 08069ecd49a6..8c5c773c426c 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -515,6 +515,11 @@ otherwise noted.
 	(2) RENAME_EXCHANGE: exchange source and target.  Both must
 	exist; this is checked by the VFS.  Unlike plain rename, source
 	and target may be of different type.
+	(3) RENAME_NEWER: this flag is similar to RENAME_NOREPLACE,
+	and indicates that if the target of the rename exists, the
+	rename should only succeed if the source file is newer than
+	the target (i.e. source mtime > target mtime).  Otherwise, the
+	rename should fail with -EEXIST instead of replacing the target.
 
 ``get_link``
 	called by the VFS to follow a symbolic link to the inode it
diff --git a/fs/affs/namei.c b/fs/affs/namei.c
index bcab18956b4f..e52f1851e1b9 100644
--- a/fs/affs/namei.c
+++ b/fs/affs/namei.c
@@ -508,7 +508,7 @@ int affs_rename2(struct user_namespace *mnt_userns, struct inode *old_dir,
 		 struct dentry *new_dentry, unsigned int flags)
 {
 
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_NEWER))
 		return -EINVAL;
 
 	pr_debug("%s(old=%lu,\"%pd\" to new=%lu,\"%pd\")\n", __func__,
diff --git a/fs/bfs/dir.c b/fs/bfs/dir.c
index 34d4f68f786b..d161bfb37050 100644
--- a/fs/bfs/dir.c
+++ b/fs/bfs/dir.c
@@ -209,7 +209,7 @@ static int bfs_rename(struct user_namespace *mnt_userns, struct inode *old_dir,
 	struct bfs_sb_info *info;
 	int error = -ENOENT;
 
-	if (flags & ~RENAME_NOREPLACE)
+	if (flags & ~(RENAME_NOREPLACE | RENAME_NEWER))
 		return -EINVAL;
 
 	old_bh = new_bh = NULL;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 05e0c4a5affd..1e3d8c3f3c50 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9549,7 +9549,7 @@ static int btrfs_rename2(struct user_namespace *mnt_userns, struct inode *old_di
 			 struct dentry *old_dentry, struct inode *new_dir,
 			 struct dentry *new_dentry, unsigned int flags)
 {
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT | RENAME_NEWER))
 		return -EINVAL;
 
 	if (flags & RENAME_EXCHANGE)
diff --git a/fs/cifs/inode.c b/fs/cifs/inode.c
index 81da81e18553..79b2e03b9a6f 100644
--- a/fs/cifs/inode.c
+++ b/fs/cifs/inode.c
@@ -2087,7 +2087,7 @@ cifs_rename2(struct user_namespace *mnt_userns, struct inode *source_dir,
 	int rc, tmprc;
 	int retry_count = 0;
 
-	if (flags & ~RENAME_NOREPLACE)
+	if (flags & ~(RENAME_NOREPLACE | RENAME_NEWER))
 		return -EINVAL;
 
 	cifs_sb = CIFS_SB(source_dir->i_sb);
diff --git a/fs/exfat/namei.c b/fs/exfat/namei.c
index c6eaf7e9ea74..fe74c1115f8d 100644
--- a/fs/exfat/namei.c
+++ b/fs/exfat/namei.c
@@ -1322,10 +1322,11 @@ static int exfat_rename(struct user_namespace *mnt_userns,
 
 	/*
 	 * The VFS already checks for existence, so for local filesystems
-	 * the RENAME_NOREPLACE implementation is equivalent to plain rename.
-	 * Don't support any other flags
+	 * the RENAME_NOREPLACE implementation (and conditional variants
+	 * such as RENAME_NEWER) is equivalent to plain rename.
+	 * Don't support any other flags.
 	 */
-	if (flags & ~RENAME_NOREPLACE)
+	if (flags & ~(RENAME_NOREPLACE | RENAME_NEWER))
 		return -EINVAL;
 
 	mutex_lock(&EXFAT_SB(sb)->s_lock);
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 5f6b7560eb3f..c5adbf2d770f 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -336,7 +336,7 @@ static int ext2_rename (struct user_namespace * mnt_userns,
 	struct ext2_dir_entry_2 * old_de;
 	int err;
 
-	if (flags & ~RENAME_NOREPLACE)
+	if (flags & ~(RENAME_NOREPLACE | RENAME_NEWER))
 		return -EINVAL;
 
 	err = dquot_initialize(old_dir);
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index db4ba99d1ceb..a5788c56a21d 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -4128,7 +4128,7 @@ static int ext4_rename2(struct user_namespace *mnt_userns,
 	if (unlikely(ext4_forced_shutdown(EXT4_SB(old_dir->i_sb))))
 		return -EIO;
 
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT | RENAME_NEWER))
 		return -EINVAL;
 
 	err = fscrypt_prepare_rename(old_dir, old_dentry, new_dir, new_dentry,
diff --git a/fs/f2fs/namei.c b/fs/f2fs/namei.c
index bf00d5057abb..fa52b8a5b346 100644
--- a/fs/f2fs/namei.c
+++ b/fs/f2fs/namei.c
@@ -1306,7 +1306,7 @@ static int f2fs_rename2(struct user_namespace *mnt_userns,
 {
 	int err;
 
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT | RENAME_NEWER))
 		return -EINVAL;
 
 	err = fscrypt_prepare_rename(old_dir, old_dentry, new_dir, new_dentry,
diff --git a/fs/fat/namei_msdos.c b/fs/fat/namei_msdos.c
index efba301d68ae..ad7ae428ca93 100644
--- a/fs/fat/namei_msdos.c
+++ b/fs/fat/namei_msdos.c
@@ -603,7 +603,7 @@ static int msdos_rename(struct user_namespace *mnt_userns,
 	unsigned char old_msdos_name[MSDOS_NAME], new_msdos_name[MSDOS_NAME];
 	int err, is_hid;
 
-	if (flags & ~RENAME_NOREPLACE)
+	if (flags & ~(RENAME_NOREPLACE | RENAME_NEWER))
 		return -EINVAL;
 
 	mutex_lock(&MSDOS_SB(sb)->s_lock);
diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c
index c573314806cf..7f251ea267c3 100644
--- a/fs/fat/namei_vfat.c
+++ b/fs/fat/namei_vfat.c
@@ -902,7 +902,7 @@ static int vfat_rename(struct user_namespace *mnt_userns, struct inode *old_dir,
 	int err, is_dir, update_dotdot, corrupt = 0;
 	struct super_block *sb = old_dir->i_sb;
 
-	if (flags & ~RENAME_NOREPLACE)
+	if (flags & ~(RENAME_NOREPLACE | RENAME_NEWER))
 		return -EINVAL;
 
 	old_sinfo.bh = sinfo.bh = dotdot_bh = NULL;
diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index 74303d6e987b..7576414315ae 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -980,7 +980,7 @@ static int fuse_rename2(struct user_namespace *mnt_userns, struct inode *olddir,
 	if (fuse_is_bad(olddir))
 		return -EIO;
 
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT | RENAME_NEWER))
 		return -EINVAL;
 
 	if (flags) {
diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index c8ec876f33ea..ee2c33700290 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -1753,7 +1753,7 @@ static int gfs2_rename2(struct user_namespace *mnt_userns, struct inode *odir,
 			struct dentry *odentry, struct inode *ndir,
 			struct dentry *ndentry, unsigned int flags)
 {
-	flags &= ~RENAME_NOREPLACE;
+	flags &= ~(RENAME_NOREPLACE | RENAME_NEWER);
 
 	if (flags & ~RENAME_EXCHANGE)
 		return -EINVAL;
diff --git a/fs/hfs/dir.c b/fs/hfs/dir.c
index 527f6e46cbe8..0665fa46fc2b 100644
--- a/fs/hfs/dir.c
+++ b/fs/hfs/dir.c
@@ -286,7 +286,7 @@ static int hfs_rename(struct user_namespace *mnt_userns, struct inode *old_dir,
 {
 	int res;
 
-	if (flags & ~RENAME_NOREPLACE)
+	if (flags & ~(RENAME_NOREPLACE | RENAME_NEWER))
 		return -EINVAL;
 
 	/* Unlink destination if it already exists */
diff --git a/fs/hfsplus/dir.c b/fs/hfsplus/dir.c
index 84714bbccc12..edc53792a80b 100644
--- a/fs/hfsplus/dir.c
+++ b/fs/hfsplus/dir.c
@@ -536,7 +536,7 @@ static int hfsplus_rename(struct user_namespace *mnt_userns,
 {
 	int res;
 
-	if (flags & ~RENAME_NOREPLACE)
+	if (flags & ~(RENAME_NOREPLACE | RENAME_NEWER))
 		return -EINVAL;
 
 	/* Unlink destination if it already exists */
diff --git a/fs/hostfs/hostfs_kern.c b/fs/hostfs/hostfs_kern.c
index cc1bc6f93a01..39c2a30c64b1 100644
--- a/fs/hostfs/hostfs_kern.c
+++ b/fs/hostfs/hostfs_kern.c
@@ -742,7 +742,7 @@ static int hostfs_rename2(struct user_namespace *mnt_userns,
 	char *old_name, *new_name;
 	int err;
 
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_NEWER))
 		return -EINVAL;
 
 	old_name = dentry_name(old_dentry);
diff --git a/fs/hpfs/namei.c b/fs/hpfs/namei.c
index 15fc63276caa..3c1e9b1e671e 100644
--- a/fs/hpfs/namei.c
+++ b/fs/hpfs/namei.c
@@ -531,7 +531,7 @@ static int hpfs_rename(struct user_namespace *mnt_userns, struct inode *old_dir,
 	struct fnode *fnode;
 	int err;
 
-	if (flags & ~RENAME_NOREPLACE)
+	if (flags & ~(RENAME_NOREPLACE | RENAME_NEWER))
 		return -EINVAL;
 
 	if ((err = hpfs_chk_name(new_name, &new_len))) return err;
diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index c0aabbcbfd58..f5e98837c51e 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -773,7 +773,7 @@ static int jffs2_rename (struct user_namespace *mnt_userns,
 	uint8_t type;
 	uint32_t now;
 
-	if (flags & ~RENAME_NOREPLACE)
+	if (flags & ~(RENAME_NOREPLACE | RENAME_NEWER))
 		return -EINVAL;
 
 	/* The VFS will check for us and prevent trying to rename a
diff --git a/fs/jfs/namei.c b/fs/jfs/namei.c
index 9db4f5789c0e..75b5df0ffd98 100644
--- a/fs/jfs/namei.c
+++ b/fs/jfs/namei.c
@@ -1080,7 +1080,7 @@ static int jfs_rename(struct user_namespace *mnt_userns, struct inode *old_dir,
 	s64 new_size = 0;
 	int commit_flag;
 
-	if (flags & ~RENAME_NOREPLACE)
+	if (flags & ~(RENAME_NOREPLACE | RENAME_NEWER))
 		return -EINVAL;
 
 	jfs_info("jfs_rename: %pd %pd", old_dentry, new_dentry);
diff --git a/fs/libfs.c b/fs/libfs.c
index 31b0ddf01c31..6430e0c4ae1e 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -479,7 +479,7 @@ int simple_rename(struct user_namespace *mnt_userns, struct inode *old_dir,
 	struct inode *inode = d_inode(old_dentry);
 	int they_are_dirs = d_is_dir(old_dentry);
 
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_NEWER))
 		return -EINVAL;
 
 	if (flags & RENAME_EXCHANGE)
diff --git a/fs/minix/namei.c b/fs/minix/namei.c
index 937fa5fae2b8..b130da0ab3c0 100644
--- a/fs/minix/namei.c
+++ b/fs/minix/namei.c
@@ -197,7 +197,7 @@ static int minix_rename(struct user_namespace *mnt_userns,
 	struct minix_dir_entry * old_de;
 	int err = -ENOENT;
 
-	if (flags & ~RENAME_NOREPLACE)
+	if (flags & ~(RENAME_NOREPLACE | RENAME_NEWER))
 		return -EINVAL;
 
 	old_de = minix_find_entry(old_dentry, &old_page);
diff --git a/fs/namei.c b/fs/namei.c
index 1f28d3f463c3..11df238f1875 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -40,6 +40,7 @@
 #include <linux/bitops.h>
 #include <linux/init_task.h>
 #include <linux/uaccess.h>
+#include <linux/time64.h>
 
 #include "internal.h"
 #include "mount.h"
@@ -4769,10 +4770,10 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
 	bool should_retry = false;
 	int error = -EINVAL;
 
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT | RENAME_NEWER))
 		goto put_names;
 
-	if ((flags & (RENAME_NOREPLACE | RENAME_WHITEOUT)) &&
+	if ((flags & (RENAME_NOREPLACE | RENAME_WHITEOUT | RENAME_NEWER)) &&
 	    (flags & RENAME_EXCHANGE))
 		goto put_names;
 
@@ -4825,6 +4826,11 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
 	error = -EEXIST;
 	if ((flags & RENAME_NOREPLACE) && d_is_positive(new_dentry))
 		goto exit5;
+	if ((flags & RENAME_NEWER)
+	    && d_is_positive(new_dentry)
+	    && timespec64_compare(&d_backing_inode(old_dentry)->i_mtime,
+				  &d_backing_inode(new_dentry)->i_mtime) <= 0)
+		goto exit5;
 	if (flags & RENAME_EXCHANGE) {
 		error = -ENOENT;
 		if (d_is_negative(new_dentry))
diff --git a/fs/nilfs2/namei.c b/fs/nilfs2/namei.c
index 23899e0ae850..7b1b53cb5bc8 100644
--- a/fs/nilfs2/namei.c
+++ b/fs/nilfs2/namei.c
@@ -354,7 +354,7 @@ static int nilfs_rename(struct user_namespace *mnt_userns,
 	struct nilfs_transaction_info ti;
 	int err;
 
-	if (flags & ~RENAME_NOREPLACE)
+	if (flags & ~(RENAME_NOREPLACE | RENAME_NEWER))
 		return -EINVAL;
 
 	err = nilfs_transaction_begin(old_dir->i_sb, &ti, 1);
diff --git a/fs/ntfs3/namei.c b/fs/ntfs3/namei.c
index bc741213ad84..2fc3091114c0 100644
--- a/fs/ntfs3/namei.c
+++ b/fs/ntfs3/namei.c
@@ -252,7 +252,7 @@ static int ntfs_rename(struct user_namespace *mnt_userns, struct inode *dir,
 		      1024);
 	static_assert(PATH_MAX >= 4 * 1024);
 
-	if (flags & ~RENAME_NOREPLACE)
+	if (flags & ~(RENAME_NOREPLACE | RENAME_NEWER))
 		return -EINVAL;
 
 	is_same = dentry->d_name.len == new_dentry->d_name.len &&
diff --git a/fs/omfs/dir.c b/fs/omfs/dir.c
index c219f91f44e9..42ecaaa318c9 100644
--- a/fs/omfs/dir.c
+++ b/fs/omfs/dir.c
@@ -378,7 +378,7 @@ static int omfs_rename(struct user_namespace *mnt_userns, struct inode *old_dir,
 	struct inode *old_inode = d_inode(old_dentry);
 	int err;
 
-	if (flags & ~RENAME_NOREPLACE)
+	if (flags & ~(RENAME_NOREPLACE | RENAME_NEWER))
 		return -EINVAL;
 
 	if (new_inode) {
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index 6b03457f72bb..1e38ad37ceec 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -1101,10 +1101,10 @@ static int ovl_rename(struct user_namespace *mnt_userns, struct inode *olddir,
 	LIST_HEAD(list);
 
 	err = -EINVAL;
-	if (flags & ~(RENAME_EXCHANGE | RENAME_NOREPLACE))
+	if (flags & ~(RENAME_EXCHANGE | RENAME_NOREPLACE | RENAME_NEWER))
 		goto out;
 
-	flags &= ~RENAME_NOREPLACE;
+	flags &= ~(RENAME_NOREPLACE | RENAME_NEWER);
 
 	/* Don't copy up directory trees */
 	err = -EXDEV;
diff --git a/fs/reiserfs/namei.c b/fs/reiserfs/namei.c
index 3d7a35d6a18b..7c1a19cecc40 100644
--- a/fs/reiserfs/namei.c
+++ b/fs/reiserfs/namei.c
@@ -1325,7 +1325,7 @@ static int reiserfs_rename(struct user_namespace *mnt_userns,
 	unsigned long savelink = 1;
 	struct timespec64 ctime;
 
-	if (flags & ~RENAME_NOREPLACE)
+	if (flags & ~(RENAME_NOREPLACE | RENAME_NEWER))
 		return -EINVAL;
 
 	/*
diff --git a/fs/sysv/namei.c b/fs/sysv/namei.c
index b2e6abc06a2d..fb3504f98e93 100644
--- a/fs/sysv/namei.c
+++ b/fs/sysv/namei.c
@@ -201,7 +201,7 @@ static int sysv_rename(struct user_namespace *mnt_userns, struct inode *old_dir,
 	struct sysv_dir_entry * old_de;
 	int err = -ENOENT;
 
-	if (flags & ~RENAME_NOREPLACE)
+	if (flags & ~(RENAME_NOREPLACE | RENAME_NEWER))
 		return -EINVAL;
 
 	old_de = sysv_find_entry(old_dentry, &old_page);
diff --git a/fs/ubifs/dir.c b/fs/ubifs/dir.c
index 86151889548e..923117c4290b 100644
--- a/fs/ubifs/dir.c
+++ b/fs/ubifs/dir.c
@@ -1610,7 +1610,7 @@ static int ubifs_rename(struct user_namespace *mnt_userns,
 	int err;
 	struct ubifs_info *c = old_dir->i_sb->s_fs_info;
 
-	if (flags & ~(RENAME_NOREPLACE | RENAME_WHITEOUT | RENAME_EXCHANGE))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_WHITEOUT | RENAME_EXCHANGE | RENAME_NEWER))
 		return -EINVAL;
 
 	ubifs_assert(c, inode_is_locked(old_dir));
diff --git a/fs/udf/namei.c b/fs/udf/namei.c
index b3d5f97f16cd..abe60354aa7c 100644
--- a/fs/udf/namei.c
+++ b/fs/udf/namei.c
@@ -1087,7 +1087,7 @@ static int udf_rename(struct user_namespace *mnt_userns, struct inode *old_dir,
 	struct kernel_lb_addr tloc;
 	struct udf_inode_info *old_iinfo = UDF_I(old_inode);
 
-	if (flags & ~RENAME_NOREPLACE)
+	if (flags & ~(RENAME_NOREPLACE | RENAME_NEWER))
 		return -EINVAL;
 
 	ofi = udf_find_entry(old_dir, &old_dentry->d_name, &ofibh, &ocfi);
diff --git a/fs/ufs/namei.c b/fs/ufs/namei.c
index 29d5a0e0c8f0..a55519376066 100644
--- a/fs/ufs/namei.c
+++ b/fs/ufs/namei.c
@@ -255,7 +255,7 @@ static int ufs_rename(struct user_namespace *mnt_userns, struct inode *old_dir,
 	struct ufs_dir_entry *old_de;
 	int err = -ENOENT;
 
-	if (flags & ~RENAME_NOREPLACE)
+	if (flags & ~(RENAME_NOREPLACE | RENAME_NEWER))
 		return -EINVAL;
 
 	old_de = ufs_find_entry(old_dir, &old_dentry->d_name, &old_page);
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 29f5b8b8aca6..2cafa7abc4f3 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -457,7 +457,7 @@ xfs_vn_rename(
 	struct xfs_name	oname;
 	struct xfs_name	nname;
 
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT | RENAME_NEWER))
 		return -EINVAL;
 
 	/* if we are exchanging files, we need to set i_mode of both files */
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index bdf7b404b3e7..b848054b8942 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -50,6 +50,7 @@
 #define RENAME_NOREPLACE	(1 << 0)	/* Don't overwrite target */
 #define RENAME_EXCHANGE		(1 << 1)	/* Exchange source and dest */
 #define RENAME_WHITEOUT		(1 << 2)	/* Whiteout source */
+#define RENAME_NEWER		(1 << 3)	/* Only newer file can overwrite target */
 
 struct file_clone_range {
 	__s64 src_fd;
diff --git a/mm/shmem.c b/mm/shmem.c
index a6f565308133..52f3e4b74625 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3009,7 +3009,7 @@ static int shmem_rename2(struct user_namespace *mnt_userns,
 	struct inode *inode = d_inode(old_dentry);
 	int they_are_dirs = S_ISDIR(inode->i_mode);
 
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT | RENAME_NEWER))
 		return -EINVAL;
 
 	if (flags & RENAME_EXCHANGE)
diff --git a/tools/include/uapi/linux/fs.h b/tools/include/uapi/linux/fs.h
index bdf7b404b3e7..b848054b8942 100644
--- a/tools/include/uapi/linux/fs.h
+++ b/tools/include/uapi/linux/fs.h
@@ -50,6 +50,7 @@
 #define RENAME_NOREPLACE	(1 << 0)	/* Don't overwrite target */
 #define RENAME_EXCHANGE		(1 << 1)	/* Exchange source and dest */
 #define RENAME_WHITEOUT		(1 << 2)	/* Whiteout source */
+#define RENAME_NEWER		(1 << 3)	/* Only newer file can overwrite target */
 
 struct file_clone_range {
 	__s64 src_fd;
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index de11992dc577..34226dfbca7a 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -54,6 +54,7 @@ TARGETS += proc
 TARGETS += pstore
 TARGETS += ptrace
 TARGETS += openat2
+TARGETS += renameat2
 TARGETS += resctrl
 TARGETS += rlimits
 TARGETS += rseq
diff --git a/tools/testing/selftests/renameat2/.gitignore b/tools/testing/selftests/renameat2/.gitignore
new file mode 100644
index 000000000000..79bbdf497559
--- /dev/null
+++ b/tools/testing/selftests/renameat2/.gitignore
@@ -0,0 +1 @@
+renameat2_tests
diff --git a/tools/testing/selftests/renameat2/Makefile b/tools/testing/selftests/renameat2/Makefile
new file mode 100644
index 000000000000..c39f5e281a5d
--- /dev/null
+++ b/tools/testing/selftests/renameat2/Makefile
@@ -0,0 +1,10 @@
+# SPDX-License-Identifier: GPL-2.0
+
+CFLAGS = -g -Wall -O2
+CFLAGS += $(KHDR_INCLUDES)
+
+TEST_GEN_PROGS := renameat2_tests
+
+include ../lib.mk
+
+$(OUTPUT)/renameat2_tests: renameat2_tests.c
diff --git a/tools/testing/selftests/renameat2/renameat2_tests.c b/tools/testing/selftests/renameat2/renameat2_tests.c
new file mode 100644
index 000000000000..843432f07179
--- /dev/null
+++ b/tools/testing/selftests/renameat2/renameat2_tests.c
@@ -0,0 +1,142 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include "../kselftest_harness.h"
+
+#include <fcntl.h>
+#include <time.h>
+#include <sys/stat.h>
+
+/* requires a kernel that implements renameat2() RENAME_NEWER flag */
+#ifndef RENAME_NEWER
+#define RENAME_NEWER (1 << 3)
+#endif
+
+static int create_file_with_timestamp(const char *filename,
+				      const time_t tv_sec,
+				      const long tv_nsec,
+				      struct stat *s)
+{
+	int fd;
+	struct timespec times[2];
+
+	fd = open(filename, O_CREAT|O_TRUNC|O_WRONLY, 0777);
+	if (fd < 0)
+		return errno;
+	times[0].tv_sec = tv_sec;
+	times[0].tv_nsec = tv_nsec;
+	times[1] = times[0];
+	if (futimens(fd, times) != 0) {
+		close(fd);
+		return errno;
+	}
+	if (fstat(fd, s)) {
+		close(fd);
+		return errno;
+	}
+	if (close(fd) != 0)
+		return errno;
+	return 0;
+}
+
+static int do_rename_newer(const char *source_file, const char *target_file)
+{
+	if (renameat2(AT_FDCWD, source_file, AT_FDCWD, target_file, RENAME_NEWER))
+		return errno;
+	return 0;
+}
+
+static int verify_inode(const char *filename, const struct stat *orig_stat)
+{
+	struct stat s;
+
+	if (stat(filename, &s))
+		return errno;
+	if (orig_stat->st_ino != s.st_ino)
+		return ENOENT;
+	return 0;
+}
+
+static int verify_exist(const char *filename)
+{
+	int fd;
+
+	fd = open(filename, O_RDONLY);
+	if (fd < 0)
+		return errno;
+	if (close(fd) != 0)
+		return errno;
+	return 0;
+}
+
+TEST(rename_newer)
+{
+	char dirname[] = "/tmp/ksft-renameat2-rename-newer.XXXXXX";
+	const time_t now = time(NULL);
+	struct stat stat_a, stat_b, stat_c;
+	int eno;
+
+	/* make the top-level directory */
+	if (!mkdtemp(dirname))
+		ksft_exit_fail_msg("rename_newer: failed to create tmpdir\n");
+
+	/* cd to top-level directory */
+	if (chdir(dirname))
+		ksft_exit_fail_msg("rename_newer: failed to cd to tmpdir\n");
+
+	/* create files */
+	eno = create_file_with_timestamp("a", now, 700000, &stat_a);
+	if (eno)
+		ksft_exit_fail_msg("rename_newer: failed to create file 'a', errno=%d\n", eno);
+	eno = create_file_with_timestamp("b", now+1, 500000, &stat_b);
+	if (eno)
+		ksft_exit_fail_msg("rename_newer: failed to create file 'b', errno=%d\n", eno);
+	eno = create_file_with_timestamp("c", now+1, 500000, &stat_c);
+	if (eno)
+		ksft_exit_fail_msg("rename_newer: failed to create file 'c', errno=%d\n", eno);
+
+	/* rename b -> e -- should succeed because e doesn't exist */
+	eno = do_rename_newer("b", "e");
+	if (eno)
+		ksft_exit_fail_msg("rename_newer: failed to rename 'b' -> 'e', errno=%d (kernel may be missing RENAME_NEWER feature)\n", eno);
+	eno = verify_inode("e", &stat_b);
+	if (eno)
+		ksft_exit_fail_msg("rename_newer: could not verify inode of 'e' after rename 'b' -> 'e', errno=%d\n", eno);
+	eno = verify_exist("b");
+	if (eno != ENOENT)
+		ksft_exit_fail_msg("rename_newer: strangely, 'b' still exists after rename 'b' -> 'e', errno=%d\n", eno);
+
+	/* rename c -> e -- should fail because c and e have the same timestamp */
+	eno = do_rename_newer("c", "e");
+	if (eno != EEXIST)
+		ksft_exit_fail_msg("rename_newer: rename 'c' -> 'e' should have failed because 'c' and 'e' have the same timestamp, errno=%d\n", eno);
+	eno = verify_inode("c", &stat_c);
+	if (eno)
+		ksft_exit_fail_msg("rename_newer: could not verify inode of 'c' after attempted rename 'c' -> 'e' didn't occur, errno=%d\n", eno);
+	eno = verify_inode("e", &stat_b);
+	if (eno)
+		ksft_exit_fail_msg("rename_newer: could not verify inode of 'e' after attempted rename 'c' -> 'e' didn't occur, errno=%d\n", eno);
+
+	/* rename a -> c -- should fail because c is newer than a */
+	eno = do_rename_newer("a", "c");
+	if (eno != EEXIST)
+		ksft_exit_fail_msg("rename_newer: rename 'a' -> 'c' should have failed because 'c' is newer, errno=%d\n", eno);
+	eno = verify_inode("a", &stat_a);
+	if (eno)
+		ksft_exit_fail_msg("rename_newer: could not verify inode of 'a' after attempted rename 'a' -> 'c' didn't occur, errno=%d\n", eno);
+	eno = verify_inode("c", &stat_c);
+	if (eno)
+		ksft_exit_fail_msg("rename_newer: could not verify inode of 'c' after attempted rename 'a' -> 'c' didn't occur, errno=%d\n", eno);
+
+	/* rename c -> a -- should succeed because c is newer than a */
+	eno = do_rename_newer("c", "a");
+	if (eno)
+		ksft_exit_fail_msg("rename_newer: rename 'c' -> 'a' should have succeeded because 'c' is newer than 'a', errno=%d\n", eno);
+	eno = verify_inode("a", &stat_c);
+	if (eno)
+		ksft_exit_fail_msg("rename_newer: could not verify inode of 'a' after rename 'c' -> 'a', errno=%d\n", eno);
+	eno = verify_exist("c");
+	if (eno != ENOENT)
+		ksft_exit_fail_msg("rename_newer: strangely, 'c' still exists after rename 'c' -> 'a', errno=%d\n", eno);
+}
+
+TEST_HARNESS_MAIN
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] namei: implemented RENAME_NEWER flag for renameat2() conditional replace
  2022-06-27 22:11 [PATCH] namei: implemented RENAME_NEWER flag for renameat2() conditional replace James Yonan
@ 2022-06-28  9:46 ` Amir Goldstein
  2022-06-28 21:56   ` James Yonan
  2022-06-28 17:34 ` Al Viro
  1 sibling, 1 reply; 24+ messages in thread
From: Amir Goldstein @ 2022-06-28  9:46 UTC (permalink / raw)
  To: James Yonan; +Cc: linux-fsdevel, Linux API

[+linux-api]

On Tue, Jun 28, 2022 at 1:58 AM James Yonan <james@openvpn.net> wrote:
>
> RENAME_NEWER is a new userspace-visible flag for renameat2(), and
> stands alongside existing flags such as RENAME_NOREPLACE,
> RENAME_EXCHANGE, and RENAME_WHITEOUT.
>
> RENAME_NEWER is a conditional variation on RENAME_NOREPLACE, and
> indicates that if the target of the rename exists, the rename will
> only succeed if the source file is newer than the target (i.e. source
> mtime > target mtime).  Otherwise, the rename will fail with -EEXIST
> instead of replacing the target.  When the target doesn't exist,
> RENAME_NEWER does a plain rename like RENAME_NOREPLACE.
>
> RENAME_NEWER is very useful in distributed systems that mirror a
> directory structure, or use a directory as a key/value store, and need
> to guarantee that files will only be overwritten by newer files, and
> that all updates are atomic.

This feature sounds very cool.
For adding a new API it is always useful if you bring forward a userland
tool (rsync?) that intend to use it, preferably with a POC patch.
A concrete prospective user is always better than a hypothetical one.
Some people hold the opinion that only new APIs with real prospective
users should be merged.

>
> While this patch may appear large at first glance, most of the changes
> deal with renameat2() flags validation, and the core logic is only
> 5 lines in the do_renameat2() function in fs/namei.c:
>
>         if ((flags & RENAME_NEWER)
>             && d_is_positive(new_dentry)
>             && timespec64_compare(&d_backing_inode(old_dentry)->i_mtime,
>                                   &d_backing_inode(new_dentry)->i_mtime) <= 0)
>                 goto exit5;
>

I have a few questions:
- Why mtime?
- Why not ctime?
- Shouldn't a feature like that protect from overwriting metadata changes
  to the destination file?

In any case, I would be much more comfortable with comparing ctime
because when it comes to user settable times, what if rsync *wants*
to update the destination file's mtime to an earlier time that was set in
the rsync source?

If comparing ctime does not fit your use case and you can convince
the community that comparing mtime is a justified use case, we would
need to use a flag name to reflect that like RENAME_NEWER_MTIME
so we don't block future use case of RENAME_NEWER_CTIME.
I hope that we can agree that ctime is enough and that mtime will not
make sense so we can settle with RENAME_NEWER that means ctime.

> It's pretty cool in a way that a new atomic file operation can even be
> implemented in just 5 lines of code, and it's thanks to the existing
> locking infrastructure around file rename/move that these operations
> become almost trivial.  Unfortunately, every fs must approve a new
> renameat2() flag, so it bloats the patch a bit.
>
> So one question to ask is could this functionality be implemented
> in userspace without adding a new renameat2() flag?  I think you
> could attempt it with iterative RENAME_EXCHANGE, but it's hackish,
> inefficient, and not atomic, because races could cause temporary
> mtime backtracks.  How about using file locking?  Probably not,
> because the problem we want to solve is maintaining file/directory
> atomicity for readers by creating files out-of-directory, setting
> their mtime, and atomically moving them into place.  The strategy
> to lock such an operation really requires more complex locking methods
> than are generally exposed to userspace.  And if you are using inotify
> on the directory to notify readers of changes, it certainly makes
> sense to reduce unnecessary churn by preventing a move operation
> based on the mtime check.
>
> While some people might question the utility of adding features to
> filesystems to make them more like databases, there is real value
> in the performance, atomicity, consistent VFS interface, multi-thread
> safety, and async-notify capabilities of modern filesystems that
> starts to blur the line, and actually make filesystem-based key-value
> stores a win for many applications.
>
> Like RENAME_NOREPLACE, the RENAME_NEWER implementation lives in
> the VFS, however the individual fs implementations do strict flags
> checking and will return -EINVAL for any flag they don't recognize.
> For this reason, my general approach with flags is to accept
> RENAME_NEWER wherever RENAME_NOREPLACE is also accepted, since
> RENAME_NEWER is simply a conditional variant of RENAME_NOREPLACE.

You are not taking into account that mtime may be cached attribute that is not
uptodate in network filesystems (fuse too) so behavior can be a bit
unpredictable,
unless filesystem code compares also the cache coherency of the attributes
on both files and even then, without extending the network protocol this is
questionable behavior for client side.
So I think your filter of which filesystems to enable is way too wide.

How many of them did you test, I'll take a wild guess that not all of them.

Please do not enable RENAME_NEWER is any filesystem that you did
not test or that was not tested by some other fs developer using the
tests that you write.

>
> I noticed only one fs driver (cifs) that treated RENAME_NOREPLACE
> in a non-generic way, because no-replace is the natural behavior
> for CIFS, and it therefore considers RENAME_NOREPLACE as a hint that
> no replacement can occur.  Aside from this special case, it seems
> safe to assume that any fs that supports RENAME_NOREPLACE should
> also be able to support RENAME_NEWER out of the box.
>
> I did not notice a general self-test for renameat2() at the VFS
> layer (outside of fs-specific tests), so I created one, though
> at the moment it only exercises RENAME_NEWER.  Build and run with:
>
>   make -C tools/testing/selftests TARGETS=renameat2 run_tests

Please make sure to also add test coverage in fstests and/or LTP.
See fstest generic/024 for RENAME_NOREPLACE and LTP test
renameat201 as examples.
renameat201 can and should be converted to newlib and run with
.all_filesystems = 1.
This is the easiest and fastest way for you to verify your patch
on the common fs that are installed on your system.
LTP maintainers can help you with that.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] namei: implemented RENAME_NEWER flag for renameat2() conditional replace
  2022-06-27 22:11 [PATCH] namei: implemented RENAME_NEWER flag for renameat2() conditional replace James Yonan
  2022-06-28  9:46 ` Amir Goldstein
@ 2022-06-28 17:34 ` Al Viro
  2022-06-28 18:34   ` Amir Goldstein
  1 sibling, 1 reply; 24+ messages in thread
From: Al Viro @ 2022-06-28 17:34 UTC (permalink / raw)
  To: James Yonan; +Cc: linux-fsdevel

On Mon, Jun 27, 2022 at 04:11:07PM -0600, James Yonan wrote:

> 	    && d_is_positive(new_dentry)
> 	    && timespec64_compare(&d_backing_inode(old_dentry)->i_mtime,
> 				  &d_backing_inode(new_dentry)->i_mtime) <= 0)
> 		goto exit5;
> 
> It's pretty cool in a way that a new atomic file operation can even be
> implemented in just 5 lines of code, and it's thanks to the existing
> locking infrastructure around file rename/move that these operations
> become almost trivial.  Unfortunately, every fs must approve a new
> renameat2() flag, so it bloats the patch a bit.

How is it atomic and what's to stabilize ->i_mtime in that test?
Confused...

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] namei: implemented RENAME_NEWER flag for renameat2() conditional replace
  2022-06-28 17:34 ` Al Viro
@ 2022-06-28 18:34   ` Amir Goldstein
  2022-06-28 23:19     ` James Yonan
  0 siblings, 1 reply; 24+ messages in thread
From: Amir Goldstein @ 2022-06-28 18:34 UTC (permalink / raw)
  To: Al Viro; +Cc: James Yonan, linux-fsdevel

On Tue, Jun 28, 2022 at 8:44 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Mon, Jun 27, 2022 at 04:11:07PM -0600, James Yonan wrote:
>
> >           && d_is_positive(new_dentry)
> >           && timespec64_compare(&d_backing_inode(old_dentry)->i_mtime,
> >                                 &d_backing_inode(new_dentry)->i_mtime) <= 0)
> >               goto exit5;
> >
> > It's pretty cool in a way that a new atomic file operation can even be
> > implemented in just 5 lines of code, and it's thanks to the existing
> > locking infrastructure around file rename/move that these operations
> > become almost trivial.  Unfortunately, every fs must approve a new
> > renameat2() flag, so it bloats the patch a bit.
>
> How is it atomic and what's to stabilize ->i_mtime in that test?
> Confused...

Good point.
RENAME_EXCHANGE_WITH_NEWER would have been better
in that regard.

And you'd have to check in vfs_rename() after lock_two_nondirectories()

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] namei: implemented RENAME_NEWER flag for renameat2() conditional replace
  2022-06-28  9:46 ` Amir Goldstein
@ 2022-06-28 21:56   ` James Yonan
  2022-06-29  5:15     ` Amir Goldstein
  0 siblings, 1 reply; 24+ messages in thread
From: James Yonan @ 2022-06-28 21:56 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: linux-fsdevel, Linux API

On 6/28/22 03:46, Amir Goldstein wrote:
> [+linux-api]
>
> On Tue, Jun 28, 2022 at 1:58 AM James Yonan <james@openvpn.net> wrote:
>> RENAME_NEWER is a new userspace-visible flag for renameat2(), and
>> stands alongside existing flags such as RENAME_NOREPLACE,
>> RENAME_EXCHANGE, and RENAME_WHITEOUT.
>>
>> RENAME_NEWER is a conditional variation on RENAME_NOREPLACE, and
>> indicates that if the target of the rename exists, the rename will
>> only succeed if the source file is newer than the target (i.e. source
>> mtime > target mtime).  Otherwise, the rename will fail with -EEXIST
>> instead of replacing the target.  When the target doesn't exist,
>> RENAME_NEWER does a plain rename like RENAME_NOREPLACE.
>>
>> RENAME_NEWER is very useful in distributed systems that mirror a
>> directory structure, or use a directory as a key/value store, and need
>> to guarantee that files will only be overwritten by newer files, and
>> that all updates are atomic.
> This feature sounds very cool.
> For adding a new API it is always useful if you bring forward a userland
> tool (rsync?) that intend to use it, preferably with a POC patch.
> A concrete prospective user is always better than a hypothetical one.
> Some people hold the opinion that only new APIs with real prospective
> users should be merged.

Not sure that rsync would be the canonical user, though it might be a 
reasonable POC.  The problem that we are solving is essentially 
near-real-time directory mirroring or replication of one source 
directory to many target directories on follower nodes.  Many writers, 
many readers, filesystem-based, strong guarantees of eventual 
convergence, infinitely scalable.  You have a source directory that 
could be an AWS S3 bucket.  You have potentially thousands of follower 
nodes that want to replicate the source directory on a local 
filesystem.  You have messages flying around the network (Kafka, AWS 
SQS, etc.) representing file updates.  These messages might be reordered 
or duplicated but each contains the file content and a unique 
nanosecond-scale timestamp.  Because the file update throughput can be 
in the thousands of files per second, you might have multiple threads on 
each node, receiving updates, and moving them into the target 
directory.  The only way to guarantee that the state of the mirror 
target directory on all nodes converges to the state of the source 
directory is to have the last-step move operation be atomic and 
conditional to the modification time (so that an earlier version of the 
file doesn't overwrite a later version).

So I understand that there needs to be a strong case to extend the Linux 
API, and I think the argument is that conditional atomic file operations 
enable entirely new classes of applications.  The fact that this can be 
facilitated by essentially 5 lines of kernel code is remarkable.

>
>> While this patch may appear large at first glance, most of the changes
>> deal with renameat2() flags validation, and the core logic is only
>> 5 lines in the do_renameat2() function in fs/namei.c:
>>
>>          if ((flags & RENAME_NEWER)
>>              && d_is_positive(new_dentry)
>>              && timespec64_compare(&d_backing_inode(old_dentry)->i_mtime,
>>                                    &d_backing_inode(new_dentry)->i_mtime) <= 0)
>>                  goto exit5;
>>
> I have a few questions:
> - Why mtime?
> - Why not ctime?
> - Shouldn't a feature like that protect from overwriting metadata changes
>    to the destination file?
>
> In any case, I would be much more comfortable with comparing ctime
> because when it comes to user settable times, what if rsync *wants*
> to update the destination file's mtime to an earlier time that was set in
> the rsync source?
>
> If comparing ctime does not fit your use case and you can convince
> the community that comparing mtime is a justified use case, we would
> need to use a flag name to reflect that like RENAME_NEWER_MTIME
> so we don't block future use case of RENAME_NEWER_CTIME.
> I hope that we can agree that ctime is enough and that mtime will not
> make sense so we can settle with RENAME_NEWER that means ctime.

So I actually think that mtime is the better timestamp to use because 
ctime is modified by the rename operation itself, while mtime measures 
the last modification time of the file content, which is what we care about.

>
>> It's pretty cool in a way that a new atomic file operation can even be
>> implemented in just 5 lines of code, and it's thanks to the existing
>> locking infrastructure around file rename/move that these operations
>> become almost trivial.  Unfortunately, every fs must approve a new
>> renameat2() flag, so it bloats the patch a bit.
>>
>> So one question to ask is could this functionality be implemented
>> in userspace without adding a new renameat2() flag?  I think you
>> could attempt it with iterative RENAME_EXCHANGE, but it's hackish,
>> inefficient, and not atomic, because races could cause temporary
>> mtime backtracks.  How about using file locking?  Probably not,
>> because the problem we want to solve is maintaining file/directory
>> atomicity for readers by creating files out-of-directory, setting
>> their mtime, and atomically moving them into place.  The strategy
>> to lock such an operation really requires more complex locking methods
>> than are generally exposed to userspace.  And if you are using inotify
>> on the directory to notify readers of changes, it certainly makes
>> sense to reduce unnecessary churn by preventing a move operation
>> based on the mtime check.
>>
>> While some people might question the utility of adding features to
>> filesystems to make them more like databases, there is real value
>> in the performance, atomicity, consistent VFS interface, multi-thread
>> safety, and async-notify capabilities of modern filesystems that
>> starts to blur the line, and actually make filesystem-based key-value
>> stores a win for many applications.
>>
>> Like RENAME_NOREPLACE, the RENAME_NEWER implementation lives in
>> the VFS, however the individual fs implementations do strict flags
>> checking and will return -EINVAL for any flag they don't recognize.
>> For this reason, my general approach with flags is to accept
>> RENAME_NEWER wherever RENAME_NOREPLACE is also accepted, since
>> RENAME_NEWER is simply a conditional variant of RENAME_NOREPLACE.
> You are not taking into account that mtime may be cached attribute that is not
> uptodate in network filesystems (fuse too) so behavior can be a bit
> unpredictable,
> unless filesystem code compares also the cache coherency of the attributes
> on both files and even then, without extending the network protocol this is
> questionable behavior for client side.
> So I think your filter of which filesystems to enable is way too wide.

So I'm new to the filesystem code, but my reading of do_renameat2() in 
fs/namei.c seems to indicate that if d_is_positive(dentry) is true, it's 
safe to access the inode struct via d_backing_inode(dentry) and 
dereference mtime.  But what you're saying is that the mtime value 
cached in the inode struct might not be up-to-date for network filesystems?

>
> How many of them did you test, I'll take a wild guess that not all of them.
>
> Please do not enable RENAME_NEWER is any filesystem that you did
> not test or that was not tested by some other fs developer using the
> tests that you write.

So I'm mostly interested in implementing this on local filesystems, 
because the application layer has already done the heavy lifting on the 
networking side so that the filesystem layer can be local, fast, and 
atomic.  So yes, I haven't tested this yet on networked filesystems.  
But I'm thinking that because all functionality is implemented at the 
VFS layer, it should be portable to any fs that also supports 
RENAME_NOREPLACE, with the caveat that it depends on the ability of the 
VFS to get a current and accurate mtime attribute inside the critical 
section between lock_rename() and unlock_rename().

>
>> I noticed only one fs driver (cifs) that treated RENAME_NOREPLACE
>> in a non-generic way, because no-replace is the natural behavior
>> for CIFS, and it therefore considers RENAME_NOREPLACE as a hint that
>> no replacement can occur.  Aside from this special case, it seems
>> safe to assume that any fs that supports RENAME_NOREPLACE should
>> also be able to support RENAME_NEWER out of the box.
>>
>> I did not notice a general self-test for renameat2() at the VFS
>> layer (outside of fs-specific tests), so I created one, though
>> at the moment it only exercises RENAME_NEWER.  Build and run with:
>>
>>    make -C tools/testing/selftests TARGETS=renameat2 run_tests
> Please make sure to also add test coverage in fstests and/or LTP.
> See fstest generic/024 for RENAME_NOREPLACE and LTP test
> renameat201 as examples.
> renameat201 can and should be converted to newlib and run with
> .all_filesystems = 1.
> This is the easiest and fastest way for you to verify your patch
> on the common fs that are installed on your system.
> LTP maintainers can help you with that.

Thanks, that's good to know.  And thanks for taking the time to write up 
such a detailed response.

James



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] namei: implemented RENAME_NEWER flag for renameat2() conditional replace
  2022-06-28 18:34   ` Amir Goldstein
@ 2022-06-28 23:19     ` James Yonan
  2022-06-29  1:43       ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: James Yonan @ 2022-06-28 23:19 UTC (permalink / raw)
  To: Amir Goldstein, Al Viro; +Cc: linux-fsdevel

On 6/28/22 12:34, Amir Goldstein wrote:
> On Tue, Jun 28, 2022 at 8:44 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>> On Mon, Jun 27, 2022 at 04:11:07PM -0600, James Yonan wrote:
>>
>>>            && d_is_positive(new_dentry)
>>>            && timespec64_compare(&d_backing_inode(old_dentry)->i_mtime,
>>>                                  &d_backing_inode(new_dentry)->i_mtime) <= 0)
>>>                goto exit5;
>>>
>>> It's pretty cool in a way that a new atomic file operation can even be
>>> implemented in just 5 lines of code, and it's thanks to the existing
>>> locking infrastructure around file rename/move that these operations
>>> become almost trivial.  Unfortunately, every fs must approve a new
>>> renameat2() flag, so it bloats the patch a bit.
>> How is it atomic and what's to stabilize ->i_mtime in that test?
>> Confused...
> Good point.
> RENAME_EXCHANGE_WITH_NEWER would have been better
> in that regard.
>
> And you'd have to check in vfs_rename() after lock_two_nondirectories()

So I mean atomic in the sense that you are comparing the old and new 
mtimes inside the lock_rename/unlock_rename critical section in 
do_renameat2(), so the basic guarantees of rename still hold, i.e. that 
readers see an atomic transition from old to new files, or no transition 
(where mtime comparison results in -EEXIST return).  I understand that 
it doesn't guarantee i_mtime stability, but the application layer may 
not need that guarantee. In our case, mtime is immutable after local 
file creation and before do_renameat2() is used to move the file into place.

Re: RENAME_EXCHANGE_WITH_NEWER, that's an interesting idea.  You could 
actually implement it with minor changes in the patch, by simply 
combining RENAME_EXCHANGE|RENAME_NEWER.  Because fundamentally, all 
RENAME_NEWER does is compare mtimes and possibly return early with 
-EEXIST.  If the early return is not taken, then it becomes a plain 
rename or RENAME_EXCHANGE if that flag is also specified.

James



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] namei: implemented RENAME_NEWER flag for renameat2() conditional replace
  2022-06-28 23:19     ` James Yonan
@ 2022-06-29  1:43       ` Dave Chinner
  2022-06-29  2:07         ` NeilBrown
  0 siblings, 1 reply; 24+ messages in thread
From: Dave Chinner @ 2022-06-29  1:43 UTC (permalink / raw)
  To: James Yonan; +Cc: Amir Goldstein, Al Viro, linux-fsdevel

On Tue, Jun 28, 2022 at 05:19:12PM -0600, James Yonan wrote:
> On 6/28/22 12:34, Amir Goldstein wrote:
> > On Tue, Jun 28, 2022 at 8:44 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > On Mon, Jun 27, 2022 at 04:11:07PM -0600, James Yonan wrote:
> > > 
> > > >            && d_is_positive(new_dentry)
> > > >            && timespec64_compare(&d_backing_inode(old_dentry)->i_mtime,
> > > >                                  &d_backing_inode(new_dentry)->i_mtime) <= 0)
> > > >                goto exit5;
> > > > 
> > > > It's pretty cool in a way that a new atomic file operation can even be
> > > > implemented in just 5 lines of code, and it's thanks to the existing
> > > > locking infrastructure around file rename/move that these operations
> > > > become almost trivial.  Unfortunately, every fs must approve a new
> > > > renameat2() flag, so it bloats the patch a bit.
> > > How is it atomic and what's to stabilize ->i_mtime in that test?
> > > Confused...
> > Good point.
> > RENAME_EXCHANGE_WITH_NEWER would have been better
> > in that regard.
> > 
> > And you'd have to check in vfs_rename() after lock_two_nondirectories()
> 
> So I mean atomic in the sense that you are comparing the old and new mtimes
> inside the lock_rename/unlock_rename critical section in do_renameat2(), so

mtime is not stable during rename, even with the inode locked. e.g. a
write page fault occurring concurrently with rename will change
mtime, and so which inode is "newer" can change during the rename
syscall...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] namei: implemented RENAME_NEWER flag for renameat2() conditional replace
  2022-06-29  1:43       ` Dave Chinner
@ 2022-06-29  2:07         ` NeilBrown
  2022-06-29  2:35           ` Dave Chinner
  0 siblings, 1 reply; 24+ messages in thread
From: NeilBrown @ 2022-06-29  2:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: James Yonan, Amir Goldstein, Al Viro, linux-fsdevel

On Wed, 29 Jun 2022, Dave Chinner wrote:
> On Tue, Jun 28, 2022 at 05:19:12PM -0600, James Yonan wrote:
> > On 6/28/22 12:34, Amir Goldstein wrote:
> > > On Tue, Jun 28, 2022 at 8:44 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > > On Mon, Jun 27, 2022 at 04:11:07PM -0600, James Yonan wrote:
> > > > 
> > > > >            && d_is_positive(new_dentry)
> > > > >            && timespec64_compare(&d_backing_inode(old_dentry)->i_mtime,
> > > > >                                  &d_backing_inode(new_dentry)->i_mtime) <= 0)
> > > > >                goto exit5;
> > > > > 
> > > > > It's pretty cool in a way that a new atomic file operation can even be
> > > > > implemented in just 5 lines of code, and it's thanks to the existing
> > > > > locking infrastructure around file rename/move that these operations
> > > > > become almost trivial.  Unfortunately, every fs must approve a new
> > > > > renameat2() flag, so it bloats the patch a bit.
> > > > How is it atomic and what's to stabilize ->i_mtime in that test?
> > > > Confused...
> > > Good point.
> > > RENAME_EXCHANGE_WITH_NEWER would have been better
> > > in that regard.
> > > 
> > > And you'd have to check in vfs_rename() after lock_two_nondirectories()
> > 
> > So I mean atomic in the sense that you are comparing the old and new mtimes
> > inside the lock_rename/unlock_rename critical section in do_renameat2(), so
> 
> mtime is not stable during rename, even with the inode locked. e.g. a
> write page fault occurring concurrently with rename will change
> mtime, and so which inode is "newer" can change during the rename
> syscall...

I don't think that is really important for the proposed use case.
In any case where you might be using this new rename flag, the target
file wouldn't be open for write, so the mtime wouldn't change.
The atomicity is really wanted to make sure the file at the destination
name is still the one that was expected (I think).

So I think it would be reasonable for the rename to fail if the target
file (or even "either file") is open for write.  Can that change while
the inode is locked?

It would be nice if renameat2 took a third fd so we could say "only
rename if <this> is the target file", but it doesn't.

NeilBrown

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] namei: implemented RENAME_NEWER flag for renameat2() conditional replace
  2022-06-29  2:07         ` NeilBrown
@ 2022-06-29  2:35           ` Dave Chinner
  2022-06-29  2:49             ` NeilBrown
  2022-06-30 16:39             ` James Yonan
  0 siblings, 2 replies; 24+ messages in thread
From: Dave Chinner @ 2022-06-29  2:35 UTC (permalink / raw)
  To: NeilBrown; +Cc: James Yonan, Amir Goldstein, Al Viro, linux-fsdevel

On Wed, Jun 29, 2022 at 12:07:04PM +1000, NeilBrown wrote:
> On Wed, 29 Jun 2022, Dave Chinner wrote:
> > On Tue, Jun 28, 2022 at 05:19:12PM -0600, James Yonan wrote:
> > > On 6/28/22 12:34, Amir Goldstein wrote:
> > > > On Tue, Jun 28, 2022 at 8:44 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > > > On Mon, Jun 27, 2022 at 04:11:07PM -0600, James Yonan wrote:
> > > > > 
> > > > > >            && d_is_positive(new_dentry)
> > > > > >            && timespec64_compare(&d_backing_inode(old_dentry)->i_mtime,
> > > > > >                                  &d_backing_inode(new_dentry)->i_mtime) <= 0)
> > > > > >                goto exit5;
> > > > > > 
> > > > > > It's pretty cool in a way that a new atomic file operation can even be
> > > > > > implemented in just 5 lines of code, and it's thanks to the existing
> > > > > > locking infrastructure around file rename/move that these operations
> > > > > > become almost trivial.  Unfortunately, every fs must approve a new
> > > > > > renameat2() flag, so it bloats the patch a bit.
> > > > > How is it atomic and what's to stabilize ->i_mtime in that test?
> > > > > Confused...
> > > > Good point.
> > > > RENAME_EXCHANGE_WITH_NEWER would have been better
> > > > in that regard.
> > > > 
> > > > And you'd have to check in vfs_rename() after lock_two_nondirectories()
> > > 
> > > So I mean atomic in the sense that you are comparing the old and new mtimes
> > > inside the lock_rename/unlock_rename critical section in do_renameat2(), so
> > 
> > mtime is not stable during rename, even with the inode locked. e.g. a
> > write page fault occurring concurrently with rename will change
> > mtime, and so which inode is "newer" can change during the rename
> > syscall...
> 
> I don't think that is really important for the proposed use case.

Sure, but that's not the point. How do you explain it the API
semantics to an app developer that might want to use this
functionality? RENAME_EXCHANGE_WITH_NEWER would be atomic in the
sense you either get the old or new file at the destination, but
it's not atomic in the sense that it is serialised against all other
potential modification operations against either the source or
destination. Hence the "if newer" comparison is not part of the
"atomic rename" operation that is supposedly being performed...

I'm also sceptical of the use of mtime - we can't rely on mtime to
determine the newer file accurately on all filesystems. e.g. Some
fileystems only have second granularity in their timestamps, so
there's a big window where "newer" cannot actually be determined by
timestamp comparisons.

/me is having flashbacks to the bad old days of NFS using inode
timestamps for change ordering and cache consistency....

> In any case where you might be using this new rename flag, the target
> file wouldn't be open for write, so the mtime wouldn't change.
> The atomicity is really wanted to make sure the file at the destination
> name is still the one that was expected (I think).

How would you document this, and how would the application be
expected to handle such a "someone else has this open for write"
error? There's nothing the app can do about the cause of the
failure, so how is it expected to handle such an error?

I'm not opposed to adding functionality like this, I'm just pointing
out problems that I see arising from the insufficiently
constrained/specified behaviour of the proposed functionality.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] namei: implemented RENAME_NEWER flag for renameat2() conditional replace
  2022-06-29  2:35           ` Dave Chinner
@ 2022-06-29  2:49             ` NeilBrown
  2022-06-30 16:39             ` James Yonan
  1 sibling, 0 replies; 24+ messages in thread
From: NeilBrown @ 2022-06-29  2:49 UTC (permalink / raw)
  To: Dave Chinner; +Cc: James Yonan, Amir Goldstein, Al Viro, linux-fsdevel

On Wed, 29 Jun 2022, Dave Chinner wrote:
> On Wed, Jun 29, 2022 at 12:07:04PM +1000, NeilBrown wrote:
> > On Wed, 29 Jun 2022, Dave Chinner wrote:
> > > On Tue, Jun 28, 2022 at 05:19:12PM -0600, James Yonan wrote:
> > > > On 6/28/22 12:34, Amir Goldstein wrote:
> > > > > On Tue, Jun 28, 2022 at 8:44 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > > > > On Mon, Jun 27, 2022 at 04:11:07PM -0600, James Yonan wrote:
> > > > > > 
> > > > > > >            && d_is_positive(new_dentry)
> > > > > > >            && timespec64_compare(&d_backing_inode(old_dentry)->i_mtime,
> > > > > > >                                  &d_backing_inode(new_dentry)->i_mtime) <= 0)
> > > > > > >                goto exit5;
> > > > > > > 
> > > > > > > It's pretty cool in a way that a new atomic file operation can even be
> > > > > > > implemented in just 5 lines of code, and it's thanks to the existing
> > > > > > > locking infrastructure around file rename/move that these operations
> > > > > > > become almost trivial.  Unfortunately, every fs must approve a new
> > > > > > > renameat2() flag, so it bloats the patch a bit.
> > > > > > How is it atomic and what's to stabilize ->i_mtime in that test?
> > > > > > Confused...
> > > > > Good point.
> > > > > RENAME_EXCHANGE_WITH_NEWER would have been better
> > > > > in that regard.
> > > > > 
> > > > > And you'd have to check in vfs_rename() after lock_two_nondirectories()
> > > > 
> > > > So I mean atomic in the sense that you are comparing the old and new mtimes
> > > > inside the lock_rename/unlock_rename critical section in do_renameat2(), so
> > > 
> > > mtime is not stable during rename, even with the inode locked. e.g. a
> > > write page fault occurring concurrently with rename will change
> > > mtime, and so which inode is "newer" can change during the rename
> > > syscall...
> > 
> > I don't think that is really important for the proposed use case.
> 
> Sure, but that's not the point. How do you explain it the API
> semantics to an app developer that might want to use this
> functionality? RENAME_EXCHANGE_WITH_NEWER would be atomic in the
> sense you either get the old or new file at the destination, but
> it's not atomic in the sense that it is serialised against all other
> potential modification operations against either the source or
> destination. Hence the "if newer" comparison is not part of the
> "atomic rename" operation that is supposedly being performed...
> 
> I'm also sceptical of the use of mtime - we can't rely on mtime to
> determine the newer file accurately on all filesystems. e.g. Some
> fileystems only have second granularity in their timestamps, so
> there's a big window where "newer" cannot actually be determined by
> timestamp comparisons.
> 
> /me is having flashbacks to the bad old days of NFS using inode
> timestamps for change ordering and cache consistency....
> 
> > In any case where you might be using this new rename flag, the target
> > file wouldn't be open for write, so the mtime wouldn't change.
> > The atomicity is really wanted to make sure the file at the destination
> > name is still the one that was expected (I think).
> 
> How would you document this, and how would the application be
> expected to handle such a "someone else has this open for write"
> error? There's nothing the app can do about the cause of the
> failure, so how is it expected to handle such an error?

I would document this by saying "The rename will fail if either file is open
for write access as in that case the mtimes cannot be considered to be
stable".

The app would respond as it would to any other unexpected error.

The expectation, as I understand it, is that this is used in situations
when the target file is only changed by a rename.  Someone else opening
the file for O_WRITE would be just as bad as someone else unlinking the
source file so renameat2 gets ENOENT.  In both cases, the new
renameat2() cannot do anything useful, so it shouldn't try.  Equally in
both cases the application cannot do anything useful.

NeilBrown


> 
> I'm not opposed to adding functionality like this, I'm just pointing
> out problems that I see arising from the insufficiently
> constrained/specified behaviour of the proposed functionality.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] namei: implemented RENAME_NEWER flag for renameat2() conditional replace
  2022-06-28 21:56   ` James Yonan
@ 2022-06-29  5:15     ` Amir Goldstein
  2022-06-30 16:18       ` James Yonan
  0 siblings, 1 reply; 24+ messages in thread
From: Amir Goldstein @ 2022-06-29  5:15 UTC (permalink / raw)
  To: James Yonan; +Cc: linux-fsdevel, Linux API

On Wed, Jun 29, 2022 at 12:56 AM James Yonan <james@openvpn.net> wrote:
>
> On 6/28/22 03:46, Amir Goldstein wrote:
> > [+linux-api]
> >
> > On Tue, Jun 28, 2022 at 1:58 AM James Yonan <james@openvpn.net> wrote:
> >> RENAME_NEWER is a new userspace-visible flag for renameat2(), and
> >> stands alongside existing flags such as RENAME_NOREPLACE,
> >> RENAME_EXCHANGE, and RENAME_WHITEOUT.
> >>
> >> RENAME_NEWER is a conditional variation on RENAME_NOREPLACE, and
> >> indicates that if the target of the rename exists, the rename will
> >> only succeed if the source file is newer than the target (i.e. source
> >> mtime > target mtime).  Otherwise, the rename will fail with -EEXIST
> >> instead of replacing the target.  When the target doesn't exist,
> >> RENAME_NEWER does a plain rename like RENAME_NOREPLACE.
> >>
> >> RENAME_NEWER is very useful in distributed systems that mirror a
> >> directory structure, or use a directory as a key/value store, and need
> >> to guarantee that files will only be overwritten by newer files, and
> >> that all updates are atomic.
> > This feature sounds very cool.
> > For adding a new API it is always useful if you bring forward a userland
> > tool (rsync?) that intend to use it, preferably with a POC patch.
> > A concrete prospective user is always better than a hypothetical one.
> > Some people hold the opinion that only new APIs with real prospective
> > users should be merged.
>
> Not sure that rsync would be the canonical user, though it might be a
> reasonable POC.  The problem that we are solving is essentially
> near-real-time directory mirroring or replication of one source
> directory to many target directories on follower nodes.  Many writers,
> many readers, filesystem-based, strong guarantees of eventual
> convergence, infinitely scalable.  You have a source directory that
> could be an AWS S3 bucket.  You have potentially thousands of follower
> nodes that want to replicate the source directory on a local
> filesystem.  You have messages flying around the network (Kafka, AWS
> SQS, etc.) representing file updates.  These messages might be reordered
> or duplicated but each contains the file content and a unique
> nanosecond-scale timestamp.  Because the file update throughput can be
> in the thousands of files per second, you might have multiple threads on
> each node, receiving updates, and moving them into the target
> directory.  The only way to guarantee that the state of the mirror
> target directory on all nodes converges to the state of the source
> directory is to have the last-step move operation be atomic and
> conditional to the modification time (so that an earlier version of the
> file doesn't overwrite a later version).
>
> So I understand that there needs to be a strong case to extend the Linux
> API, and I think the argument is that conditional atomic file operations
> enable entirely new classes of applications.  The fact that this can be

Sure it does, but is the condition that you propose going to serve all
the prospect applications or just your application?
You may add to your arguments in favor of mtime that 'mv --update'
and 'rsync --update' could use the new flag.

> facilitated by essentially 5 lines of kernel code is remarkable.
>
> >
> >> While this patch may appear large at first glance, most of the changes
> >> deal with renameat2() flags validation, and the core logic is only
> >> 5 lines in the do_renameat2() function in fs/namei.c:
> >>
> >>          if ((flags & RENAME_NEWER)
> >>              && d_is_positive(new_dentry)
> >>              && timespec64_compare(&d_backing_inode(old_dentry)->i_mtime,
> >>                                    &d_backing_inode(new_dentry)->i_mtime) <= 0)
> >>                  goto exit5;
> >>
> > I have a few questions:
> > - Why mtime?
> > - Why not ctime?
> > - Shouldn't a feature like that protect from overwriting metadata changes
> >    to the destination file?
> >
> > In any case, I would be much more comfortable with comparing ctime
> > because when it comes to user settable times, what if rsync *wants*
> > to update the destination file's mtime to an earlier time that was set in
> > the rsync source?
> >
> > If comparing ctime does not fit your use case and you can convince
> > the community that comparing mtime is a justified use case, we would
> > need to use a flag name to reflect that like RENAME_NEWER_MTIME
> > so we don't block future use case of RENAME_NEWER_CTIME.
> > I hope that we can agree that ctime is enough and that mtime will not
> > make sense so we can settle with RENAME_NEWER that means ctime.
>
> So I actually think that mtime is the better timestamp to use because
> ctime is modified by the rename operation itself, while mtime measures
> the last modification time of the file content, which is what we care about.

That is fine. I am saying there are other use cases that replicate not
only data, but metadata too. Even Cloud files have metadata attached,
so one day, someone else may want to compare ctime, which makes
the work NEWER ambiguous and you need to disambiguate it with
something like RENAME_EXCHANGE_NEWER_MTIME.

>
> >
> >> It's pretty cool in a way that a new atomic file operation can even be
> >> implemented in just 5 lines of code, and it's thanks to the existing
> >> locking infrastructure around file rename/move that these operations
> >> become almost trivial.  Unfortunately, every fs must approve a new
> >> renameat2() flag, so it bloats the patch a bit.
> >>
> >> So one question to ask is could this functionality be implemented
> >> in userspace without adding a new renameat2() flag?  I think you
> >> could attempt it with iterative RENAME_EXCHANGE, but it's hackish,
> >> inefficient, and not atomic, because races could cause temporary
> >> mtime backtracks.  How about using file locking?  Probably not,
> >> because the problem we want to solve is maintaining file/directory
> >> atomicity for readers by creating files out-of-directory, setting
> >> their mtime, and atomically moving them into place.  The strategy
> >> to lock such an operation really requires more complex locking methods
> >> than are generally exposed to userspace.  And if you are using inotify
> >> on the directory to notify readers of changes, it certainly makes
> >> sense to reduce unnecessary churn by preventing a move operation
> >> based on the mtime check.
> >>
> >> While some people might question the utility of adding features to
> >> filesystems to make them more like databases, there is real value
> >> in the performance, atomicity, consistent VFS interface, multi-thread
> >> safety, and async-notify capabilities of modern filesystems that
> >> starts to blur the line, and actually make filesystem-based key-value
> >> stores a win for many applications.
> >>
> >> Like RENAME_NOREPLACE, the RENAME_NEWER implementation lives in
> >> the VFS, however the individual fs implementations do strict flags
> >> checking and will return -EINVAL for any flag they don't recognize.
> >> For this reason, my general approach with flags is to accept
> >> RENAME_NEWER wherever RENAME_NOREPLACE is also accepted, since
> >> RENAME_NEWER is simply a conditional variant of RENAME_NOREPLACE.
> > You are not taking into account that mtime may be cached attribute that is not
> > uptodate in network filesystems (fuse too) so behavior can be a bit
> > unpredictable,
> > unless filesystem code compares also the cache coherency of the attributes
> > on both files and even then, without extending the network protocol this is
> > questionable behavior for client side.
> > So I think your filter of which filesystems to enable is way too wide.
>
> So I'm new to the filesystem code, but my reading of do_renameat2() in
> fs/namei.c seems to indicate that if d_is_positive(dentry) is true, it's
> safe to access the inode struct via d_backing_inode(dentry) and
> dereference mtime.  But what you're saying is that the mtime value
> cached in the inode struct might not be up-to-date for network filesystems?
>

In the client's cache A may be newer than B, while on the server B is actually
newer, but that doesn't matter because...


> >
> > How many of them did you test, I'll take a wild guess that not all of them.
> >
> > Please do not enable RENAME_NEWER is any filesystem that you did
> > not test or that was not tested by some other fs developer using the
> > tests that you write.
>
> So I'm mostly interested in implementing this on local filesystems,

... so don't add the functionality to filesystems that you don't need to add to
and you do not test.

LTP all_filesystems will give you coverage for all the local fs that you should
care about. Other fs could add support for the new flag themselves if the
developers care and test it - that's not your job when adding a new API.

> because the application layer has already done the heavy lifting on the
> networking side so that the filesystem layer can be local, fast, and
> atomic.  So yes, I haven't tested this yet on networked filesystems.
> But I'm thinking that because all functionality is implemented at the
> VFS layer, it should be portable to any fs that also supports
> RENAME_NOREPLACE, with the caveat that it depends on the ability of the
> VFS to get a current and accurate mtime attribute inside the critical
> section between lock_rename() and unlock_rename().
>

The implementation is generic. You just implement the logic in the vfs and
enable it for a few tested filesystems and whoever wants to join the party
is welcome to test their own filesystems and opt-in to the new flag whether
they like. Nothing wrong with that.

w.r.t stability of i_mtime, if I am not mistaken i_mtime itself is
stable with inode
lock held (i.e. after lock_two_nondirectories()), however, as Dave pointed out,
the file's data can be modified in page cache, so as long as the file is open
for write or mmaped writable, the check of mtime is not atomic.

Neil's suggestion to deny the operation on open files makes sense.
You can use a variant of deny_write_access() that takes inode
which implies the error  ETXTBSY for an attempt to exchange newer
with a file that is open for write.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] namei: implemented RENAME_NEWER flag for renameat2() conditional replace
  2022-06-29  5:15     ` Amir Goldstein
@ 2022-06-30 16:18       ` James Yonan
  0 siblings, 0 replies; 24+ messages in thread
From: James Yonan @ 2022-06-30 16:18 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: linux-fsdevel, Linux API

On 6/28/22 23:15, Amir Goldstein wrote:
>> because the application layer has already done the heavy lifting on the
>> networking side so that the filesystem layer can be local, fast, and
>> atomic.  So yes, I haven't tested this yet on networked filesystems.
>> But I'm thinking that because all functionality is implemented at the
>> VFS layer, it should be portable to any fs that also supports
>> RENAME_NOREPLACE, with the caveat that it depends on the ability of the
>> VFS to get a current and accurate mtime attribute inside the critical
>> section between lock_rename() and unlock_rename().
> The implementation is generic. You just implement the logic in the vfs and
> enable it for a few tested filesystems and whoever wants to join the party
> is welcome to test their own filesystems and opt-in to the new flag whether
> they like. Nothing wrong with that.
>
> w.r.t stability of i_mtime, if I am not mistaken i_mtime itself is
> stable with inode
> lock held (i.e. after lock_two_nondirectories()), however, as Dave pointed out,
> the file's data can be modified in page cache, so as long as the file is open
> for write or mmaped writable, the check of mtime is not atomic.
>
> Neil's suggestion to deny the operation on open files makes sense.
> You can use a variant of deny_write_access() that takes inode
> which implies the error  ETXTBSY for an attempt to exchange newer
> with a file that is open for write.

So I will incorporate these suggestions into an upcoming v2 patch.

Thanks,
James



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] namei: implemented RENAME_NEWER flag for renameat2() conditional replace
  2022-06-29  2:35           ` Dave Chinner
  2022-06-29  2:49             ` NeilBrown
@ 2022-06-30 16:39             ` James Yonan
  2022-07-01  9:23               ` [PATCH v2] namei: implemented RENAME_NEWER_MTIME " James Yonan
  1 sibling, 1 reply; 24+ messages in thread
From: James Yonan @ 2022-06-30 16:39 UTC (permalink / raw)
  To: Dave Chinner, NeilBrown; +Cc: Amir Goldstein, Al Viro, linux-fsdevel

On 6/28/22 20:35, Dave Chinner wrote:
> How do you explain it the API
> semantics to an app developer that might want to use this
> functionality? RENAME_EXCHANGE_WITH_NEWER would be atomic in the
> sense you either get the old or new file at the destination, but
> it's not atomic in the sense that it is serialised against all other
> potential modification operations against either the source or
> destination. Hence the "if newer" comparison is not part of the
> "atomic rename" operation that is supposedly being performed...

So the current proposal based on feedback is to move the mtime 
comparison to vfs_rename() to take advantage of existing 
{lock,unlock}_two_nondirectories critical section, then nest another 
critical section {deny,allow}_write_access (adapted to inodes) to 
stabilize the mtime.  The proposed use case never needs to compare 
mtimes of files that are open for write, and the plan would be to return 
-ETXTBSY in this case.

> I'm also sceptical of the use of mtime - we can't rely on mtime to
> determine the newer file accurately on all filesystems. e.g. Some
> fileystems only have second granularity in their timestamps, so
> there's a big window where "newer" cannot actually be determined by
> timestamp comparisons.
So in the "use a directory as a key/value store" use case in distributed 
systems, the file mtime is generally determined remotely by the file 
content creator and is set locally via futimens() rather than the local 
system clock.  So this gives you nanosecond scale time resolution if the 
content creator supports it, even if the system clock has less resolution.

James



^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v2] namei: implemented RENAME_NEWER_MTIME flag for renameat2() conditional replace
  2022-06-30 16:39             ` James Yonan
@ 2022-07-01  9:23               ` James Yonan
  2022-07-01 10:34                 ` Amir Goldstein
  2022-07-02  8:07                 ` Dave Chinner
  0 siblings, 2 replies; 24+ messages in thread
From: James Yonan @ 2022-07-01  9:23 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: david, neilb, amir73il, viro, James Yonan

RENAME_NEWER_MTIME is a new userspace-visible flag for renameat2(), and
stands alongside existing flags such as RENAME_NOREPLACE, RENAME_EXCHANGE,
and RENAME_WHITEOUT.

RENAME_NEWER_MTIME is a conditional variation on RENAME_NOREPLACE, and
indicates that if the target of the rename exists, the rename or exchange
will only succeed if the source file is newer than the target (i.e. source
mtime > target mtime).  Otherwise, the rename will fail with -EEXIST
instead of replacing the target.  When the target doesn't exist,
RENAME_NEWER_MTIME does a plain rename like RENAME_NOREPLACE.

RENAME_NEWER_MTIME can also be combined with RENAME_EXCHANGE for
conditional exchange, where the exchange only occurs if source mtime >
target mtime.  Otherwise, the operation will fail with -EEXIST.

RENAME_NEWER_MTIME is very useful in distributed systems that mirror a
directory structure, or use a directory as a key/value store, and need to
guarantee that files will only be overwritten by newer files, and that all
updates are atomic.

RENAME_NEWER_MTIME is implemented in vfs_rename(), and we lock and deny
write access to both source and target inodes before comparing their
mtimes, to stabilize the comparison.

So one question to ask is could this functionality be implemented in
userspace without adding a new renameat2() flag?  I think you could
attempt it with iterative RENAME_EXCHANGE, but it's hackish, inefficient,
and not atomic, because races could cause temporary mtime backtracks.
How about using file locking?  Probably not, because the problem we want
to solve is maintaining file/directory atomicity for readers by creating
files out-of-directory, setting their mtime, and atomically moving them
into place.  The strategy to lock such an operation really requires more
complex locking methods than are generally exposed to userspace.  And if
you are using inotify on the directory to notify readers of changes, it
certainly makes sense to reduce unnecessary churn by preventing a move
operation based on the mtime check.

While some people might question the utility of adding features to
filesystems to make them more like databases, there is real value in the
performance, atomicity, consistent VFS interface, multi-thread safety, and
async-notify capabilities of modern filesystems that starts to blur the
line, and actually make filesystem-based key-value stores a win for many
applications.

Like RENAME_NOREPLACE, the RENAME_NEWER_MTIME implementation lives in
the VFS, however the individual fs implementations do strict flags
checking and will return -EINVAL for any flag they don't recognize.
At this time, I have enabled and tested RENAME_NEWER_MTIME on ext2, ext3,
ext4, xfs, btrfs, and tmpfs.

I did not notice a general self-test for renameat2() at the VFS
layer (outside of fs-specific tests), so I created one, though
at the moment it only exercises RENAME_NEWER_MTIME and RENAME_EXCHANGE.
The self-test is written to be portable to the Linux Test Project,
and the advantage of running it there is that it automatically runs
tests on multiple filesystems.  See comments at the beginning of
renameat2_tests.c for more info.

Build and run the self-test with:

  make -C tools/testing/selftests TARGETS=renameat2 run_tests

Questions:

Q: Why use mtime and not ctime for timestamp comparison?

A: I think the "use a directory as a key/value store" use case
   cares about the modification time of the file content rather
   than metadata.  Also, the rename operation itself modifies
   ctime, making it less useful as a reference timestamp.
   In any event, this patch creates the infrastructure for
   conditional rename/exchange based on inode timestamp, so a
   subsequent patch to add RENAME_NEWER_CTIME would be a mostly
   trivial exercise.

Q: Why compare mtimes when some systems have poor system clock
   accuracy and resolution?

A: So in the "use a directory as a key/value store" use case
   in distributed systems, the file mtime is actually determined
   remotely by the file content creator and is set locally
   via futimens() rather than the local system clock.  So this gives
   you nanosecond-scale time resolution if the content creator
   supports it, even if the local system clock has less resolution.

Patch version history:

v2: Changed flag name from RENAME_NEWER to RENAME_NEWER_MTIME so
    as to disambiguate and make it clear that we are comparing
    mtime values.

    RENAME_NEWER_MTIME can now be combined with RENAME_EXCHANGE
    for conditional exchange, where exchange only occurs if
    source mtime > target mtime.

    Moved the mtime comparison logic into vfs_rename() to take
    advantage of existing {lock,unlock}_two_nondirectories critical
    section, and then further nest another critical section
    {deny,allow}_write_access (adapted to inodes) to stabilize the
    mtime, since our use case doesn't require renaming files that
    are open for write (we will return -ETXTBSY in this case).

    Did some refactoring of inline functions in linux/fs.h that
    manage inode->i_writecount, and added inode_deny_write_access2()
    and inode_allow_write_access2() functions.

    Extended the self-test (renameat2_tests.c):

    1. Verify that RENAME_NEWER_MTIME fails with ETXTBSY when
       one of the files is open for write.

    2. Test conditional exchange use case with combined flags
       RENAME_EXCHANGE|RENAME_NEWER_MTIME.

    3. The test .c file is now drop-in portable to the Linux Test
       Project where you can take advantage of the .all_filesystems = 1
       flag to automatically run tests on multiple filesystems.

Signed-off-by: James Yonan <james@openvpn.net>
---
 Documentation/filesystems/vfs.rst             |   9 +
 fs/btrfs/inode.c                              |   2 +-
 fs/ext2/namei.c                               |   2 +-
 fs/ext4/namei.c                               |   2 +-
 fs/libfs.c                                    |   2 +-
 fs/namei.c                                    |  21 +-
 fs/xfs/xfs_iops.c                             |   2 +-
 include/linux/fs.h                            |  43 +-
 include/uapi/linux/fs.h                       |   1 +
 mm/shmem.c                                    |   2 +-
 tools/include/uapi/linux/fs.h                 |   1 +
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/renameat2/.gitignore  |   1 +
 tools/testing/selftests/renameat2/Makefile    |  10 +
 .../selftests/renameat2/renameat2_tests.c     | 447 ++++++++++++++++++
 15 files changed, 534 insertions(+), 12 deletions(-)
 create mode 100644 tools/testing/selftests/renameat2/.gitignore
 create mode 100644 tools/testing/selftests/renameat2/Makefile
 create mode 100644 tools/testing/selftests/renameat2/renameat2_tests.c

diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 08069ecd49a6..9bee575c1f27 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -515,6 +515,15 @@ otherwise noted.
 	(2) RENAME_EXCHANGE: exchange source and target.  Both must
 	exist; this is checked by the VFS.  Unlike plain rename, source
 	and target may be of different type.
+	(3) RENAME_NEWER_MTIME: this flag is similar to RENAME_NOREPLACE,
+	and indicates a conditional rename: if the target of the rename
+	exists, the rename should only succeed if the source file is newer
+	than the target (i.e. source mtime > target mtime).  Otherwise, the
+	rename should fail with -EEXIST instead of replacing the target.
+	To exchange source and target conditional on source being newer
+	than target, pass flags as RENAME_EXCHANGE|RENAME_NEWER_MTIME.
+	RENAME_NEWER_MTIME will fail with -ETXTBSY if either source or
+	target is open for write.
 
 ``get_link``
 	called by the VFS to follow a symbolic link to the inode it
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 05e0c4a5affd..c6232ac0f0eb 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9549,7 +9549,7 @@ static int btrfs_rename2(struct user_namespace *mnt_userns, struct inode *old_di
 			 struct dentry *old_dentry, struct inode *new_dir,
 			 struct dentry *new_dentry, unsigned int flags)
 {
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT | RENAME_NEWER_MTIME))
 		return -EINVAL;
 
 	if (flags & RENAME_EXCHANGE)
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 5f6b7560eb3f..35dc17f80528 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -336,7 +336,7 @@ static int ext2_rename (struct user_namespace * mnt_userns,
 	struct ext2_dir_entry_2 * old_de;
 	int err;
 
-	if (flags & ~RENAME_NOREPLACE)
+	if (flags & ~(RENAME_NOREPLACE | RENAME_NEWER_MTIME))
 		return -EINVAL;
 
 	err = dquot_initialize(old_dir);
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index db4ba99d1ceb..210537bb144b 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -4128,7 +4128,7 @@ static int ext4_rename2(struct user_namespace *mnt_userns,
 	if (unlikely(ext4_forced_shutdown(EXT4_SB(old_dir->i_sb))))
 		return -EIO;
 
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT | RENAME_NEWER_MTIME))
 		return -EINVAL;
 
 	err = fscrypt_prepare_rename(old_dir, old_dentry, new_dir, new_dentry,
diff --git a/fs/libfs.c b/fs/libfs.c
index 31b0ddf01c31..13e206f2cc58 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -479,7 +479,7 @@ int simple_rename(struct user_namespace *mnt_userns, struct inode *old_dir,
 	struct inode *inode = d_inode(old_dentry);
 	int they_are_dirs = d_is_dir(old_dentry);
 
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_NEWER_MTIME))
 		return -EINVAL;
 
 	if (flags & RENAME_EXCHANGE)
diff --git a/fs/namei.c b/fs/namei.c
index 1f28d3f463c3..5a8940058c43 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -40,6 +40,7 @@
 #include <linux/bitops.h>
 #include <linux/init_task.h>
 #include <linux/uaccess.h>
+#include <linux/time64.h>
 
 #include "internal.h"
 #include "mount.h"
@@ -4685,11 +4686,22 @@ int vfs_rename(struct renamedata *rd)
 
 	take_dentry_name_snapshot(&old_name, old_dentry);
 	dget(new_dentry);
-	if (!is_dir || (flags & RENAME_EXCHANGE))
+	if (!is_dir || (flags & (RENAME_EXCHANGE|RENAME_NEWER_MTIME)))
 		lock_two_nondirectories(source, target);
 	else if (target)
 		inode_lock(target);
 
+	if ((flags & RENAME_NEWER_MTIME) && target) {
+		/* deny write access to stabilize mtime comparison below */
+		error = inode_deny_write_access2(source, target);
+		if (error) /* -ETXTBSY */
+			goto out1;
+		if (timespec64_compare(&source->i_mtime, &target->i_mtime) <= 0) {
+			error = -EEXIST;
+			goto out;
+		}
+	}
+
 	error = -EPERM;
 	if (IS_SWAPFILE(source) || (target && IS_SWAPFILE(target)))
 		goto out;
@@ -4736,7 +4748,10 @@ int vfs_rename(struct renamedata *rd)
 			d_exchange(old_dentry, new_dentry);
 	}
 out:
-	if (!is_dir || (flags & RENAME_EXCHANGE))
+	if ((flags & RENAME_NEWER_MTIME) && target)
+		inode_allow_write_access2(source, target);
+out1:
+	if (!is_dir || (flags & (RENAME_EXCHANGE|RENAME_NEWER_MTIME)))
 		unlock_two_nondirectories(source, target);
 	else if (target)
 		inode_unlock(target);
@@ -4769,7 +4784,7 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
 	bool should_retry = false;
 	int error = -EINVAL;
 
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT | RENAME_NEWER_MTIME))
 		goto put_names;
 
 	if ((flags & (RENAME_NOREPLACE | RENAME_WHITEOUT)) &&
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 29f5b8b8aca6..84efc0b02a3c 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -457,7 +457,7 @@ xfs_vn_rename(
 	struct xfs_name	oname;
 	struct xfs_name	nname;
 
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT | RENAME_NEWER_MTIME))
 		return -EINVAL;
 
 	/* if we are exchanging files, we need to set i_mode of both files */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9ad5e3520fae..0c79f12ec51f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2819,14 +2819,21 @@ static inline void file_end_write(struct file *file)
  * use {get,deny}_write_access() - these functions check the sign and refuse
  * to do the change if sign is wrong.
  */
+static inline int inode_deny_write_access(struct inode *inode)
+{
+	return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY;
+}
+static inline void inode_allow_write_access(struct inode *inode)
+{
+	atomic_inc(&inode->i_writecount);
+}
 static inline int get_write_access(struct inode *inode)
 {
 	return atomic_inc_unless_negative(&inode->i_writecount) ? 0 : -ETXTBSY;
 }
 static inline int deny_write_access(struct file *file)
 {
-	struct inode *inode = file_inode(file);
-	return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY;
+	return inode_deny_write_access(file_inode(file));
 }
 static inline void put_write_access(struct inode * inode)
 {
@@ -2835,13 +2842,43 @@ static inline void put_write_access(struct inode * inode)
 static inline void allow_write_access(struct file *file)
 {
 	if (file)
-		atomic_inc(&file_inode(file)->i_writecount);
+		inode_allow_write_access(file_inode(file));
 }
 static inline bool inode_is_open_for_write(const struct inode *inode)
 {
 	return atomic_read(&inode->i_writecount) > 0;
 }
 
+/**
+ * inode_deny_write_access2 - deny write access on two inodes.
+ * Returns -ETXTBSY if write access cannot be denied on either inode.
+ * @inode1: first inode
+ * @inode2: second inode
+ */
+static inline int inode_deny_write_access2(struct inode *inode1, struct inode *inode2)
+{
+	int error = inode_deny_write_access(inode1);
+	if (error)
+		return error;
+	error = inode_deny_write_access(inode2);
+	if (error)
+		inode_allow_write_access(inode1);
+	return error;
+}
+
+/**
+ * inode_allow_write_access2 - allow write access on two inodes.
+ * This method is intended to be called after a successful call
+ * to inode_deny_write_access2().
+ * @inode1: first inode
+ * @inode2: second inode
+ */
+static inline void inode_allow_write_access2(struct inode *inode1, struct inode *inode2)
+{
+	inode_allow_write_access(inode1);
+	inode_allow_write_access(inode2);
+}
+
 #if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING)
 static inline void i_readcount_dec(struct inode *inode)
 {
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index bdf7b404b3e7..7e9c32dce3e4 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -50,6 +50,7 @@
 #define RENAME_NOREPLACE	(1 << 0)	/* Don't overwrite target */
 #define RENAME_EXCHANGE		(1 << 1)	/* Exchange source and dest */
 #define RENAME_WHITEOUT		(1 << 2)	/* Whiteout source */
+#define RENAME_NEWER_MTIME	(1 << 3)	/* Only newer file can overwrite target */
 
 struct file_clone_range {
 	__s64 src_fd;
diff --git a/mm/shmem.c b/mm/shmem.c
index a6f565308133..592de9245c80 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3009,7 +3009,7 @@ static int shmem_rename2(struct user_namespace *mnt_userns,
 	struct inode *inode = d_inode(old_dentry);
 	int they_are_dirs = S_ISDIR(inode->i_mode);
 
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT | RENAME_NEWER_MTIME))
 		return -EINVAL;
 
 	if (flags & RENAME_EXCHANGE)
diff --git a/tools/include/uapi/linux/fs.h b/tools/include/uapi/linux/fs.h
index bdf7b404b3e7..7e9c32dce3e4 100644
--- a/tools/include/uapi/linux/fs.h
+++ b/tools/include/uapi/linux/fs.h
@@ -50,6 +50,7 @@
 #define RENAME_NOREPLACE	(1 << 0)	/* Don't overwrite target */
 #define RENAME_EXCHANGE		(1 << 1)	/* Exchange source and dest */
 #define RENAME_WHITEOUT		(1 << 2)	/* Whiteout source */
+#define RENAME_NEWER_MTIME	(1 << 3)	/* Only newer file can overwrite target */
 
 struct file_clone_range {
 	__s64 src_fd;
diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index de11992dc577..34226dfbca7a 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -54,6 +54,7 @@ TARGETS += proc
 TARGETS += pstore
 TARGETS += ptrace
 TARGETS += openat2
+TARGETS += renameat2
 TARGETS += resctrl
 TARGETS += rlimits
 TARGETS += rseq
diff --git a/tools/testing/selftests/renameat2/.gitignore b/tools/testing/selftests/renameat2/.gitignore
new file mode 100644
index 000000000000..79bbdf497559
--- /dev/null
+++ b/tools/testing/selftests/renameat2/.gitignore
@@ -0,0 +1 @@
+renameat2_tests
diff --git a/tools/testing/selftests/renameat2/Makefile b/tools/testing/selftests/renameat2/Makefile
new file mode 100644
index 000000000000..c39f5e281a5d
--- /dev/null
+++ b/tools/testing/selftests/renameat2/Makefile
@@ -0,0 +1,10 @@
+# SPDX-License-Identifier: GPL-2.0
+
+CFLAGS = -g -Wall -O2
+CFLAGS += $(KHDR_INCLUDES)
+
+TEST_GEN_PROGS := renameat2_tests
+
+include ../lib.mk
+
+$(OUTPUT)/renameat2_tests: renameat2_tests.c
diff --git a/tools/testing/selftests/renameat2/renameat2_tests.c b/tools/testing/selftests/renameat2/renameat2_tests.c
new file mode 100644
index 000000000000..1fdb908cf428
--- /dev/null
+++ b/tools/testing/selftests/renameat2/renameat2_tests.c
@@ -0,0 +1,447 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Written by James Yonan <james@openvpn.net>
+ * Copyright (c) 2022 OpenVPN, Inc.
+ */
+
+/*
+ * Test renameat2() with RENAME_NOREPLACE, RENAME_EXCHANGE,
+ * and RENAME_NEWER_MTIME.
+ *
+ * This test is designed to be portable between
+ * the Linux kernel self-tests and the Linux Test Project.
+ * The cool thing about running the test in the Linux Test Project
+ * is that it will automatically iterate the test over all the
+ * filesystems available in your kernel.  In a default kernel,
+ * that includes ext2, ext3, ext4, xfs, btrfs, and tmpfs.
+ *
+ * By default we assume a Linux kernel self-test build, where
+ * you can build and run with:
+ *   make -C tools/testing/selftests TARGETS=renameat2 run_tests
+ *
+ * For a Linux Test Project build, place this source file
+ * under the ltp tree in:
+ *   testcases/kernel/syscalls/renameat2/renameat203.c
+ * Then cd to testcases/kernel/syscalls/renameat2 and add:
+ *   CPPFLAGS += -DLINUX_TEST_PROJECT
+ * to the end of the Makefile.  Then run with:
+ *   make && ./rename_newer_mtime
+ */
+
+#ifndef _GNU_SOURCE
+#define _GNU_SOURCE
+#endif
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <time.h>
+
+#ifdef LINUX_TEST_PROJECT
+#include "tst_test.h"
+#include "renameat2.h"
+#else
+#include "../kselftest_harness.h"
+#endif
+
+/* requires a kernel that implements renameat2() RENAME_NEWER_MTIME flag */
+#ifndef RENAME_NEWER_MTIME
+#define RENAME_NEWER_MTIME (1 << 3)
+#endif
+
+/* convert milliseconds to nanoseconds */
+#define MS_TO_NANO(x) ((x) * 1000000)
+
+#ifdef LINUX_TEST_PROJECT
+
+#define MNTPOINT "mntpoint"
+#define WORKDIR MNTPOINT "/testdir.XXXXXX"
+
+#define MY_ERROR(...) tst_brk(TFAIL, __VA_ARGS__)
+#define MY_PASS(...) tst_res(TPASS, __VA_ARGS__)
+
+#else /* Linux kernel self-test */
+
+#define WORKDIR "/tmp/ksft-renameat2-rename-newer-mtime.XXXXXX"
+
+#define MY_ERROR(fmt, ...) ksft_exit_fail_msg("%s/%d: " fmt "\n", __FILE__, __LINE__, __VA_ARGS__)
+#define MY_PASS(...)
+
+#endif
+
+static int create_file_with_timestamp(const char *filename,
+				      const time_t tv_sec,
+				      const long tv_nsec,
+				      struct stat *s,
+				      int *retain_fd)
+{
+	int fd;
+	struct timespec times[2];
+
+	fd = open(filename, O_CREAT|O_TRUNC|O_WRONLY, 0777);
+	if (fd < 0)
+		return errno;
+	times[0].tv_sec = tv_sec;
+	times[0].tv_nsec = tv_nsec;
+	times[1] = times[0];
+	if (futimens(fd, times)) {
+		close(fd);
+		return errno;
+	}
+	if (fstat(fd, s)) {
+		close(fd);
+		return errno;
+	}
+	if (retain_fd)
+		*retain_fd = fd;
+	else if (close(fd))
+		return errno;
+	return 0;
+}
+
+static int create_directory_with_timestamp(const char *dirname,
+					   const time_t tv_sec,
+					   const long tv_nsec,
+					   struct stat *s)
+{
+	struct timespec times[2];
+
+	if (mkdir(dirname, 0777))
+		return errno;
+	times[0].tv_sec = tv_sec;
+	times[0].tv_nsec = tv_nsec;
+	times[1] = times[0];
+	if (utimensat(AT_FDCWD, dirname, times, 0) != 0)
+		return errno;
+	if (lstat(dirname, s))
+		return errno;
+	return 0;
+}
+
+static int do_rename(const char *source_path, const char *target_path,
+		     const unsigned int flags)
+{
+	if (renameat2(AT_FDCWD, source_path, AT_FDCWD, target_path, flags))
+		return errno;
+	return 0;
+}
+
+static int verify_inode(const char *path, const struct stat *orig_stat)
+{
+	struct stat s;
+
+	if (stat(path, &s))
+		return errno;
+	if (orig_stat->st_ino != s.st_ino)
+		return ENOENT;
+	return 0;
+}
+
+static int verify_exist(const char *path)
+{
+	int fd;
+
+	fd = open(path, O_RDONLY);
+	if (fd < 0)
+		return errno;
+	if (close(fd) != 0)
+		return errno;
+	return 0;
+}
+
+static int fd_d = -1; /* retained fd from file "d" */
+
+/*
+ * Test renameat2() with RENAME_NEWER_MTIME, RENAME_NOREPLACE, and RENAME_EXCHANGE.
+ */
+static void do_rename_newer_mtime(void)
+{
+	char dirname[] = WORKDIR;
+	const time_t now = time(NULL);
+	struct stat stat_a, stat_b, stat_c, stat_d, stat_f; /* files */
+	struct stat stat_x, stat_y; /* directories */
+	int eno; /* copied errno */
+
+	/* fd_d initial state */
+	fd_d = -1;
+
+	/* make the top-level directory */
+	if (!mkdtemp(dirname)) {
+		eno = errno;
+		MY_ERROR("failed to create tmpdir, errno=%d", eno);
+	}
+
+	/* cd to top-level directory */
+	if (chdir(dirname)) {
+		eno = errno;
+		MY_ERROR("failed to cd to tmpdir, errno=%d", eno);
+	}
+
+	/* create files with different mtimes */
+	eno = create_file_with_timestamp("a", now, MS_TO_NANO(700), &stat_a, NULL);
+	if (eno)
+		MY_ERROR("failed to create file 'a', errno=%d", eno);
+	eno = create_file_with_timestamp("b", now+1, MS_TO_NANO(500), &stat_b, NULL);
+	if (eno)
+		MY_ERROR("failed to create file 'b', errno=%d", eno);
+	eno = create_file_with_timestamp("c", now+1, MS_TO_NANO(500), &stat_c, NULL);
+	if (eno)
+		MY_ERROR("failed to create file 'c', errno=%d", eno);
+	eno = create_file_with_timestamp("d", now+1, MS_TO_NANO(300), &stat_d, &fd_d); /* leave open */
+	if (eno)
+		MY_ERROR("failed to create file 'd', errno=%d", eno);
+	eno = create_file_with_timestamp("f", now, MS_TO_NANO(0), &stat_f, NULL);
+	if (eno)
+		MY_ERROR("failed to create file 'f', errno=%d", eno);
+
+	/* create directories with different mtimes */
+	eno = create_directory_with_timestamp("x", now+2, MS_TO_NANO(0), &stat_x);
+	if (eno)
+		MY_ERROR("failed to create directory 'x', errno=%d", eno);
+	eno = create_directory_with_timestamp("y", now+3, MS_TO_NANO(0), &stat_y);
+	if (eno)
+		MY_ERROR("failed to create directory 'y', errno=%d", eno);
+
+	/* rename b -> e with RENAME_NEWER_MTIME -- should succeed because e doesn't exist */
+	eno = do_rename("b", "e", RENAME_NEWER_MTIME);
+	if (eno)
+		MY_ERROR("failed to rename 'b' -> 'e', errno=%d (kernel may be missing RENAME_NEWER_MTIME feature)", eno);
+	eno = verify_inode("e", &stat_b);
+	if (eno)
+		MY_ERROR("could not verify inode of 'e' after rename 'b' -> 'e', errno=%d", eno);
+	eno = verify_exist("b");
+	if (eno != ENOENT)
+		MY_ERROR("strangely, 'b' still exists after rename 'b' -> 'e', errno=%d", eno);
+
+	/* rename c -> e with RENAME_NEWER_MTIME -- should fail because c and e have
+	 * the same timestamp
+	 */
+	eno = do_rename("c", "e", RENAME_NEWER_MTIME);
+	if (eno != EEXIST)
+		MY_ERROR("rename 'c' -> 'e' should have failed with EEXIST because 'c' and 'e' have the same timestamp, errno=%d", eno);
+	eno = verify_inode("c", &stat_c);
+	if (eno)
+		MY_ERROR("could not verify inode of 'c' after attempted rename 'c' -> 'e', errno=%d", eno);
+	eno = verify_inode("e", &stat_b);
+	if (eno)
+		MY_ERROR("could not verify inode of 'e' after attempted rename 'c' -> 'e', errno=%d", eno);
+
+	/* rename a -> c with RENAME_NOREPLACE -- should fail because c exists */
+	eno = do_rename("a", "c", RENAME_NOREPLACE);
+	if (eno != EEXIST)
+		MY_ERROR("rename 'a' -> 'c' should have failed because 'c' exists, errno=%d", eno);
+	eno = verify_inode("a", &stat_a);
+	if (eno)
+		MY_ERROR("could not verify inode of 'a' after attempted rename 'a' -> 'c', errno=%d", eno);
+	eno = verify_inode("c", &stat_c);
+	if (eno)
+		MY_ERROR("could not verify inode of 'c' after attempted rename 'a' -> 'c', errno=%d", eno);
+
+	/* rename a -> c with RENAME_NEWER_MTIME -- should fail because c is newer than a */
+	eno = do_rename("a", "c", RENAME_NEWER_MTIME);
+	if (eno != EEXIST)
+		MY_ERROR("rename 'a' -> 'c' should have failed with EEXIST because 'c' is newer, errno=%d", eno);
+	eno = verify_inode("a", &stat_a);
+	if (eno)
+		MY_ERROR("could not verify inode of 'a' after attempted rename 'a' -> 'c', errno=%d", eno);
+	eno = verify_inode("c", &stat_c);
+	if (eno)
+		MY_ERROR("could not verify inode of 'c' after attempted rename 'a' -> 'c', errno=%d", eno);
+
+	/* rename c -> a with RENAME_NEWER_MTIME -- should succeed because c is newer than a */
+	eno = do_rename("c", "a", RENAME_NEWER_MTIME);
+	if (eno)
+		MY_ERROR("rename 'c' -> 'a' should have succeeded because 'c' is newer than 'a', errno=%d", eno);
+	eno = verify_inode("a", &stat_c);
+	if (eno)
+		MY_ERROR("could not verify inode of 'a' after rename 'c' -> 'a', errno=%d", eno);
+	eno = verify_exist("c");
+	if (eno != ENOENT)
+		MY_ERROR("strangely, 'c' still exists after rename 'c' -> 'a', errno=%d", eno);
+
+	/* exchange f <-> nonexistent with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should fail because
+	 * only f exists
+	 */
+	eno = do_rename("f", "nonexistent", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
+	if (eno != ENOENT)
+		MY_ERROR("exchange 'f' <-> 'nonexistent' should have failed with ENOENT, errno=%d", eno);
+	eno = verify_inode("f", &stat_f);
+	if (eno)
+		MY_ERROR("could not verify inode of 'f' after attempted exchange 'f' <-> 'nonexistent', errno=%d", eno);
+
+	/* exchange d <-> f with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should fail because
+	 * d is open for write
+	 */
+	eno = do_rename("d", "f", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
+	if (eno != ETXTBSY)
+		MY_ERROR("exchange 'd' <-> 'f' should have failed with ETXTBSY because d is open for write, errno=%d", eno);
+	eno = verify_inode("d", &stat_d);
+	if (eno)
+		MY_ERROR("could not verify inode of 'd' after attempted exchange 'd' <-> 'f', errno=%d", eno);
+	eno = verify_inode("f", &stat_f);
+	if (eno)
+		MY_ERROR("could not verify inode of 'f' after attempted exchange 'd' <-> 'f', errno=%d", eno);
+
+	/* exchange e <-> d with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should fail because
+	 * d is open for write
+	 */
+	eno = do_rename("e", "d", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
+	if (eno != ETXTBSY)
+		MY_ERROR("exchange 'e' <-> 'd' should have failed with ETXTBSY because d is open for write, errno=%d", eno);
+	eno = verify_inode("e", &stat_b);
+	if (eno)
+		MY_ERROR("could not verify inode of 'e' after attempted exchange 'e' <-> 'd', errno=%d", eno);
+	eno = verify_inode("d", &stat_d);
+	if (eno)
+		MY_ERROR("could not verify inode of 'd' after attempted exchange 'e' <-> 'd', errno=%d", eno);
+
+	/* exchange f <-> d with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should fail because
+	 * d is open for write but also because f is older than d
+	 */
+	eno = do_rename("f", "d", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
+	if (eno != ETXTBSY) /* note in this case we get ETXTBSY first (EEXIST would have
+			     * been returned if d wasn't open for write)
+			     */
+		MY_ERROR("exchange 'f' <-> 'd' should have failed with ETXTBSY because d is open for write, errno=%d", eno);
+	eno = verify_inode("f", &stat_f);
+	if (eno)
+		MY_ERROR("could not verify inode of 'f' after attempted exchange 'f' <-> 'd', errno=%d", eno);
+	eno = verify_inode("d", &stat_d);
+	if (eno)
+		MY_ERROR("could not verify inode of 'd' after attempted exchange 'f' <-> 'd', errno=%d", eno);
+
+	/* close fd_d */
+	if (close(fd_d) != 0) {
+		eno = errno;
+		MY_ERROR("error closing fd_d (write), errno=%d", eno);
+	}
+
+	/* reopen "d" for read access, which should not prevent RENAME_NEWER_MTIME */
+	fd_d = open("d", O_RDONLY);
+	if (fd_d < 0)
+		MY_ERROR("error reopening 'd' for read, errno=%d", eno);
+
+	/* exchange f <-> d with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should fail
+	 * because f is older than d
+	 */
+	eno = do_rename("f", "d", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
+	if (eno != EEXIST)
+		MY_ERROR("exchange 'f' <-> 'd' should have failed with EEXIST because f is older than d, errno=%d", eno);
+	eno = verify_inode("f", &stat_f);
+	if (eno)
+		MY_ERROR("could not verify inode of 'f' after attempted exchange 'f' <-> 'd', errno=%d", eno);
+	eno = verify_inode("d", &stat_d);
+	if (eno)
+		MY_ERROR("could not verify inode of 'd' after attempted exchange 'f' <-> 'd', errno=%d", eno);
+
+	/* double exchange f <-> d with RENAME_EXCHANGE -- should succeed */
+	eno = do_rename("f", "d", RENAME_EXCHANGE);
+	if (eno)
+		MY_ERROR("exchange 'f' <-> 'd' should have succeeded, errno=%d", eno);
+	eno = verify_inode("d", &stat_f);
+	if (eno)
+		MY_ERROR("could not verify inode of 'd' after exchange 'd' <-> 'f', errno=%d", eno);
+	eno = verify_inode("f", &stat_d);
+	if (eno)
+		MY_ERROR("could not verify inode of 'f' after exchange 'd' <-> 'f', errno=%d", eno);
+	eno = do_rename("f", "d", RENAME_EXCHANGE);
+	if (eno)
+		MY_ERROR("exchange 'f' <-> 'd' should have succeeded, errno=%d", eno);
+	eno = verify_inode("d", &stat_d);
+	if (eno)
+		MY_ERROR("could not verify inode of 'd' after exchange 'd' <-> 'f', errno=%d", eno);
+	eno = verify_inode("f", &stat_f);
+	if (eno)
+		MY_ERROR("could not verify inode of 'f' after exchange 'd' <-> 'f', errno=%d", eno);
+
+	/* exchange d <-> f with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should succeed
+	 * because d is newer than f and fd_d is now read-only
+	 */
+	eno = do_rename("d", "f", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
+	if (eno)
+		MY_ERROR("exchange 'd' <-> 'f' failed, errno=%d", eno);
+	eno = verify_inode("d", &stat_f);
+	if (eno)
+		MY_ERROR("could not verify inode of 'd' after exchange 'd' <-> 'f', errno=%d", eno);
+	eno = verify_inode("f", &stat_d);
+	if (eno)
+		MY_ERROR("could not verify inode of 'f' after exchange 'd' <-> 'f', errno=%d", eno);
+
+	/* exchange directories x <-> y with RENAME_EXCHANGE|RENAME_NEWER_MTIME
+	 * -- should fail because x is older than y
+	 */
+	eno = do_rename("x", "y", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
+	if (eno != EEXIST)
+		MY_ERROR("exchange 'x' <-> 'y' should have failed with EEXIST because x is older than y, errno=%d", eno);
+	eno = verify_inode("x", &stat_x);
+	if (eno)
+		MY_ERROR("could not verify inode of 'x' after attempted exchange 'x' <-> 'y', errno=%d", eno);
+	eno = verify_inode("y", &stat_y);
+	if (eno)
+		MY_ERROR("could not verify inode of 'y' after attempted exchange 'x' <-> 'y', errno=%d", eno);
+
+	/* exchange directories y <-> x with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should succeed */
+	eno = do_rename("y", "x", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
+	if (eno)
+		MY_ERROR("exchange 'y' <-> 'x' failed, errno=%d", eno);
+	eno = verify_inode("x", &stat_y);
+	if (eno)
+		MY_ERROR("could not verify inode of 'x' after exchange 'y' <-> 'x', errno=%d", eno);
+	eno = verify_inode("y", &stat_x);
+	if (eno)
+		MY_ERROR("could not verify inode of 'y' after exchange 'y' <-> 'x', errno=%d", eno);
+
+	/* close fd_d */
+	if (close(fd_d) != 0) {
+		eno = errno;
+		MY_ERROR("error closing fd_d (read), errno=%d", eno);
+	}
+	fd_d = -1;
+
+	MY_PASS("rename_newer_mtime test passed, workdir=%s", dirname);
+}
+
+#ifdef LINUX_TEST_PROJECT
+
+static void setup(void)
+{
+}
+
+static void cleanup(void)
+{
+	/* close fd_d */
+	if (fd_d >= 0)
+		close(fd_d);
+}
+
+static struct tst_test test = {
+	.test_all = do_rename_newer_mtime,
+	.setup = setup,
+	.cleanup = cleanup,
+	.needs_root = 1,
+	.all_filesystems = 1,
+	.mount_device = 1,
+	.mntpoint = MNTPOINT,
+	.skip_filesystems = (const char*[]) {
+		"exfat",
+		"ntfs",
+		"vfat",
+		NULL
+	},
+	.needs_cmds = NULL,
+};
+
+#else /* Linux kernel self-test */
+
+TEST(rename_newer_mtime)
+{
+	do_rename_newer_mtime();
+}
+
+TEST_HARNESS_MAIN
+
+#endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] namei: implemented RENAME_NEWER_MTIME flag for renameat2() conditional replace
  2022-07-01  9:23               ` [PATCH v2] namei: implemented RENAME_NEWER_MTIME " James Yonan
@ 2022-07-01 10:34                 ` Amir Goldstein
  2022-07-01 20:06                   ` James Yonan
  2022-07-02  8:07                 ` Dave Chinner
  1 sibling, 1 reply; 24+ messages in thread
From: Amir Goldstein @ 2022-07-01 10:34 UTC (permalink / raw)
  To: James Yonan; +Cc: linux-fsdevel, Dave Chinner, Neil Brown, Al Viro

On Fri, Jul 1, 2022 at 12:24 PM James Yonan <james@openvpn.net> wrote:
>
> RENAME_NEWER_MTIME is a new userspace-visible flag for renameat2(), and
> stands alongside existing flags such as RENAME_NOREPLACE, RENAME_EXCHANGE,
> and RENAME_WHITEOUT.
>
> RENAME_NEWER_MTIME is a conditional variation on RENAME_NOREPLACE, and
> indicates that if the target of the rename exists, the rename or exchange
> will only succeed if the source file is newer than the target (i.e. source
> mtime > target mtime).  Otherwise, the rename will fail with -EEXIST
> instead of replacing the target.  When the target doesn't exist,
> RENAME_NEWER_MTIME does a plain rename like RENAME_NOREPLACE.
>
> RENAME_NEWER_MTIME can also be combined with RENAME_EXCHANGE for
> conditional exchange, where the exchange only occurs if source mtime >
> target mtime.  Otherwise, the operation will fail with -EEXIST.
>
> RENAME_NEWER_MTIME is very useful in distributed systems that mirror a
> directory structure, or use a directory as a key/value store, and need to
> guarantee that files will only be overwritten by newer files, and that all
> updates are atomic.
>
> RENAME_NEWER_MTIME is implemented in vfs_rename(), and we lock and deny
> write access to both source and target inodes before comparing their
> mtimes, to stabilize the comparison.
>
> So one question to ask is could this functionality be implemented in
> userspace without adding a new renameat2() flag?  I think you could
> attempt it with iterative RENAME_EXCHANGE, but it's hackish, inefficient,
> and not atomic, because races could cause temporary mtime backtracks.
> How about using file locking?  Probably not, because the problem we want
> to solve is maintaining file/directory atomicity for readers by creating
> files out-of-directory, setting their mtime, and atomically moving them
> into place.  The strategy to lock such an operation really requires more
> complex locking methods than are generally exposed to userspace.  And if
> you are using inotify on the directory to notify readers of changes, it
> certainly makes sense to reduce unnecessary churn by preventing a move
> operation based on the mtime check.
>
> While some people might question the utility of adding features to
> filesystems to make them more like databases, there is real value in the
> performance, atomicity, consistent VFS interface, multi-thread safety, and
> async-notify capabilities of modern filesystems that starts to blur the
> line, and actually make filesystem-based key-value stores a win for many
> applications.
>
> Like RENAME_NOREPLACE, the RENAME_NEWER_MTIME implementation lives in
> the VFS, however the individual fs implementations do strict flags
> checking and will return -EINVAL for any flag they don't recognize.
> At this time, I have enabled and tested RENAME_NEWER_MTIME on ext2, ext3,
> ext4, xfs, btrfs, and tmpfs.
>
> I did not notice a general self-test for renameat2() at the VFS
> layer (outside of fs-specific tests), so I created one, though
> at the moment it only exercises RENAME_NEWER_MTIME and RENAME_EXCHANGE.
> The self-test is written to be portable to the Linux Test Project,
> and the advantage of running it there is that it automatically runs
> tests on multiple filesystems.  See comments at the beginning of
> renameat2_tests.c for more info.
>
> Build and run the self-test with:
>
>   make -C tools/testing/selftests TARGETS=renameat2 run_tests
>
> Questions:
>
> Q: Why use mtime and not ctime for timestamp comparison?
>
> A: I think the "use a directory as a key/value store" use case
>    cares about the modification time of the file content rather
>    than metadata.  Also, the rename operation itself modifies
>    ctime, making it less useful as a reference timestamp.
>    In any event, this patch creates the infrastructure for
>    conditional rename/exchange based on inode timestamp, so a
>    subsequent patch to add RENAME_NEWER_CTIME would be a mostly
>    trivial exercise.
>
> Q: Why compare mtimes when some systems have poor system clock
>    accuracy and resolution?
>
> A: So in the "use a directory as a key/value store" use case
>    in distributed systems, the file mtime is actually determined
>    remotely by the file content creator and is set locally
>    via futimens() rather than the local system clock.  So this gives
>    you nanosecond-scale time resolution if the content creator
>    supports it, even if the local system clock has less resolution.
>
> Patch version history:

This usually comes after --- line

The criteria is to judge whether this information is going to be
useful for the person reading the git log after merge.

>
> v2: Changed flag name from RENAME_NEWER to RENAME_NEWER_MTIME so
>     as to disambiguate and make it clear that we are comparing
>     mtime values.
>
>     RENAME_NEWER_MTIME can now be combined with RENAME_EXCHANGE
>     for conditional exchange, where exchange only occurs if
>     source mtime > target mtime.
>
>     Moved the mtime comparison logic into vfs_rename() to take
>     advantage of existing {lock,unlock}_two_nondirectories critical
>     section, and then further nest another critical section
>     {deny,allow}_write_access (adapted to inodes) to stabilize the
>     mtime, since our use case doesn't require renaming files that
>     are open for write (we will return -ETXTBSY in this case).
>
>     Did some refactoring of inline functions in linux/fs.h that
>     manage inode->i_writecount, and added inode_deny_write_access2()
>     and inode_allow_write_access2() functions.
>
>     Extended the self-test (renameat2_tests.c):
>
>     1. Verify that RENAME_NEWER_MTIME fails with ETXTBSY when
>        one of the files is open for write.
>
>     2. Test conditional exchange use case with combined flags
>        RENAME_EXCHANGE|RENAME_NEWER_MTIME.
>
>     3. The test .c file is now drop-in portable to the Linux Test
>        Project where you can take advantage of the .all_filesystems = 1
>        flag to automatically run tests on multiple filesystems.
>
> Signed-off-by: James Yonan <james@openvpn.net>
> ---
>  Documentation/filesystems/vfs.rst             |   9 +
>  fs/btrfs/inode.c                              |   2 +-
>  fs/ext2/namei.c                               |   2 +-
>  fs/ext4/namei.c                               |   2 +-
>  fs/libfs.c                                    |   2 +-
>  fs/namei.c                                    |  21 +-
>  fs/xfs/xfs_iops.c                             |   2 +-
>  include/linux/fs.h                            |  43 +-
>  include/uapi/linux/fs.h                       |   1 +
>  mm/shmem.c                                    |   2 +-
>  tools/include/uapi/linux/fs.h                 |   1 +
>  tools/testing/selftests/Makefile              |   1 +
>  tools/testing/selftests/renameat2/.gitignore  |   1 +
>  tools/testing/selftests/renameat2/Makefile    |  10 +
>  .../selftests/renameat2/renameat2_tests.c     | 447 ++++++++++++++++++
>  15 files changed, 534 insertions(+), 12 deletions(-)
>  create mode 100644 tools/testing/selftests/renameat2/.gitignore
>  create mode 100644 tools/testing/selftests/renameat2/Makefile
>  create mode 100644 tools/testing/selftests/renameat2/renameat2_tests.c
>
> diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
> index 08069ecd49a6..9bee575c1f27 100644
> --- a/Documentation/filesystems/vfs.rst
> +++ b/Documentation/filesystems/vfs.rst
> @@ -515,6 +515,15 @@ otherwise noted.
>         (2) RENAME_EXCHANGE: exchange source and target.  Both must
>         exist; this is checked by the VFS.  Unlike plain rename, source
>         and target may be of different type.
> +       (3) RENAME_NEWER_MTIME: this flag is similar to RENAME_NOREPLACE,
> +       and indicates a conditional rename: if the target of the rename
> +       exists, the rename should only succeed if the source file is newer
> +       than the target (i.e. source mtime > target mtime).  Otherwise, the
> +       rename should fail with -EEXIST instead of replacing the target.
> +       To exchange source and target conditional on source being newer
> +       than target, pass flags as RENAME_EXCHANGE|RENAME_NEWER_MTIME.
> +       RENAME_NEWER_MTIME will fail with -ETXTBSY if either source or
> +       target is open for write.
>
>  ``get_link``
>         called by the VFS to follow a symbolic link to the inode it
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 05e0c4a5affd..c6232ac0f0eb 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -9549,7 +9549,7 @@ static int btrfs_rename2(struct user_namespace *mnt_userns, struct inode *old_di
>                          struct dentry *old_dentry, struct inode *new_dir,
>                          struct dentry *new_dentry, unsigned int flags)
>  {
> -       if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
> +       if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT | RENAME_NEWER_MTIME))

This line may be "legally long" according to checkpatch but for no
good reason IMO.
Please break it and other similar long lines.

>                 return -EINVAL;
>
>         if (flags & RENAME_EXCHANGE)
> diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
> index 5f6b7560eb3f..35dc17f80528 100644
> --- a/fs/ext2/namei.c
> +++ b/fs/ext2/namei.c
> @@ -336,7 +336,7 @@ static int ext2_rename (struct user_namespace * mnt_userns,
>         struct ext2_dir_entry_2 * old_de;
>         int err;
>
> -       if (flags & ~RENAME_NOREPLACE)
> +       if (flags & ~(RENAME_NOREPLACE | RENAME_NEWER_MTIME))
>                 return -EINVAL;
>
>         err = dquot_initialize(old_dir);
> diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
> index db4ba99d1ceb..210537bb144b 100644
> --- a/fs/ext4/namei.c
> +++ b/fs/ext4/namei.c
> @@ -4128,7 +4128,7 @@ static int ext4_rename2(struct user_namespace *mnt_userns,
>         if (unlikely(ext4_forced_shutdown(EXT4_SB(old_dir->i_sb))))
>                 return -EIO;
>
> -       if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
> +       if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT | RENAME_NEWER_MTIME))
>                 return -EINVAL;
>
>         err = fscrypt_prepare_rename(old_dir, old_dentry, new_dir, new_dentry,
> diff --git a/fs/libfs.c b/fs/libfs.c
> index 31b0ddf01c31..13e206f2cc58 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -479,7 +479,7 @@ int simple_rename(struct user_namespace *mnt_userns, struct inode *old_dir,
>         struct inode *inode = d_inode(old_dentry);
>         int they_are_dirs = d_is_dir(old_dentry);
>
> -       if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE))
> +       if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_NEWER_MTIME))
>                 return -EINVAL;

Leftover?
I don't think that simple fs have meaningful mtime
and they are not tested.

>
>         if (flags & RENAME_EXCHANGE)
> diff --git a/fs/namei.c b/fs/namei.c
> index 1f28d3f463c3..5a8940058c43 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -40,6 +40,7 @@
>  #include <linux/bitops.h>
>  #include <linux/init_task.h>
>  #include <linux/uaccess.h>
> +#include <linux/time64.h>
>
>  #include "internal.h"
>  #include "mount.h"
> @@ -4685,11 +4686,22 @@ int vfs_rename(struct renamedata *rd)
>
>         take_dentry_name_snapshot(&old_name, old_dentry);
>         dget(new_dentry);
> -       if (!is_dir || (flags & RENAME_EXCHANGE))
> +       if (!is_dir || (flags & (RENAME_EXCHANGE|RENAME_NEWER_MTIME)))
>                 lock_two_nondirectories(source, target);
>         else if (target)
>                 inode_lock(target);
>
> +       if ((flags & RENAME_NEWER_MTIME) && target) {
> +               /* deny write access to stabilize mtime comparison below */
> +               error = inode_deny_write_access2(source, target);

This is not needed for non regular file, but I guess it doesn't hurt...
You could do a helper lock_two_inodes_deny_write() that takes
care of both inode_lock() and inode_deny_write_access() and
call it instead of lock_two_nondirectories() above.

Then the lock and unlock routines would be more straightforward
and less error prone, e.g.:

-       if (!is_dir || (flags & RENAME_EXCHANGE))
+       if (flags & RENAME_NEWER_MTIME)
+               lock_two_inodes_deny_write(source, target);
+       else if (!is_dir || (flags & (RENAME_EXCHANGE)))
                lock_two_nondirectories(source, target);

...

out:
-       if (!is_dir || (flags & RENAME_EXCHANGE))
+       if (flags & RENAME_NEWER_MTIME)
+               unlock_two_inodes_allow_write(source, target);
+       else if (!is_dir || (flags & (RENAME_EXCHANGE)))
                unlock_two_nondirectories(source, target);

OTOH, for directory, inode_lock is needed to stabilize mtime and
lock_two_nondirectories() doesn't do that for you and it is also
non trivial to get the locking order and lockdep annotations correct.

Since you don't have a use case for RENAME_NEWER_MTIME and
directories (?), maybe the easier way around this would be to deny that
earlier in do_renameat2() with -EISDIR.


> +               if (error) /* -ETXTBSY */
> +                       goto out1;
> +               if (timespec64_compare(&source->i_mtime, &target->i_mtime) <= 0) {
> +                       error = -EEXIST;
> +                       goto out;
> +               }
> +       }
> +
>         error = -EPERM;
>         if (IS_SWAPFILE(source) || (target && IS_SWAPFILE(target)))
>                 goto out;
> @@ -4736,7 +4748,10 @@ int vfs_rename(struct renamedata *rd)
>                         d_exchange(old_dentry, new_dentry);
>         }
>  out:
> -       if (!is_dir || (flags & RENAME_EXCHANGE))
> +       if ((flags & RENAME_NEWER_MTIME) && target)
> +               inode_allow_write_access2(source, target);
> +out1:
> +       if (!is_dir || (flags & (RENAME_EXCHANGE|RENAME_NEWER_MTIME)))
>                 unlock_two_nondirectories(source, target);
>         else if (target)
>                 inode_unlock(target);
> @@ -4769,7 +4784,7 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
>         bool should_retry = false;
>         int error = -EINVAL;
>
> -       if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
> +       if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT | RENAME_NEWER_MTIME))
>                 goto put_names;
>
>         if ((flags & (RENAME_NOREPLACE | RENAME_WHITEOUT)) &&
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 29f5b8b8aca6..84efc0b02a3c 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -457,7 +457,7 @@ xfs_vn_rename(
>         struct xfs_name oname;
>         struct xfs_name nname;
>
> -       if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
> +       if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT | RENAME_NEWER_MTIME))
>                 return -EINVAL;
>
>         /* if we are exchanging files, we need to set i_mode of both files */
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 9ad5e3520fae..0c79f12ec51f 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2819,14 +2819,21 @@ static inline void file_end_write(struct file *file)
>   * use {get,deny}_write_access() - these functions check the sign and refuse
>   * to do the change if sign is wrong.
>   */
> +static inline int inode_deny_write_access(struct inode *inode)
> +{
> +       return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY;
> +}
> +static inline void inode_allow_write_access(struct inode *inode)
> +{
> +       atomic_inc(&inode->i_writecount);
> +}
>  static inline int get_write_access(struct inode *inode)
>  {
>         return atomic_inc_unless_negative(&inode->i_writecount) ? 0 : -ETXTBSY;
>  }
>  static inline int deny_write_access(struct file *file)
>  {
> -       struct inode *inode = file_inode(file);
> -       return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY;
> +       return inode_deny_write_access(file_inode(file));
>  }
>  static inline void put_write_access(struct inode * inode)
>  {
> @@ -2835,13 +2842,43 @@ static inline void put_write_access(struct inode * inode)
>  static inline void allow_write_access(struct file *file)
>  {
>         if (file)
> -               atomic_inc(&file_inode(file)->i_writecount);
> +               inode_allow_write_access(file_inode(file));
>  }
>  static inline bool inode_is_open_for_write(const struct inode *inode)
>  {
>         return atomic_read(&inode->i_writecount) > 0;
>  }
>
> +/**
> + * inode_deny_write_access2 - deny write access on two inodes.
> + * Returns -ETXTBSY if write access cannot be denied on either inode.
> + * @inode1: first inode
> + * @inode2: second inode
> + */
> +static inline int inode_deny_write_access2(struct inode *inode1, struct inode *inode2)
> +{
> +       int error = inode_deny_write_access(inode1);
> +       if (error)
> +               return error;
> +       error = inode_deny_write_access(inode2);
> +       if (error)
> +               inode_allow_write_access(inode1);
> +       return error;
> +}
> +
> +/**
> + * inode_allow_write_access2 - allow write access on two inodes.
> + * This method is intended to be called after a successful call
> + * to inode_deny_write_access2().
> + * @inode1: first inode
> + * @inode2: second inode
> + */
> +static inline void inode_allow_write_access2(struct inode *inode1, struct inode *inode2)
> +{
> +       inode_allow_write_access(inode1);
> +       inode_allow_write_access(inode2);
> +}
> +
>  #if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING)
>  static inline void i_readcount_dec(struct inode *inode)
>  {
> diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
> index bdf7b404b3e7..7e9c32dce3e4 100644
> --- a/include/uapi/linux/fs.h
> +++ b/include/uapi/linux/fs.h
> @@ -50,6 +50,7 @@
>  #define RENAME_NOREPLACE       (1 << 0)        /* Don't overwrite target */
>  #define RENAME_EXCHANGE                (1 << 1)        /* Exchange source and dest */
>  #define RENAME_WHITEOUT                (1 << 2)        /* Whiteout source */
> +#define RENAME_NEWER_MTIME     (1 << 3)        /* Only newer file can overwrite target */
>
>  struct file_clone_range {
>         __s64 src_fd;
> diff --git a/mm/shmem.c b/mm/shmem.c
> index a6f565308133..592de9245c80 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -3009,7 +3009,7 @@ static int shmem_rename2(struct user_namespace *mnt_userns,
>         struct inode *inode = d_inode(old_dentry);
>         int they_are_dirs = S_ISDIR(inode->i_mode);
>
> -       if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
> +       if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT | RENAME_NEWER_MTIME))
>                 return -EINVAL;
>
>         if (flags & RENAME_EXCHANGE)
> diff --git a/tools/include/uapi/linux/fs.h b/tools/include/uapi/linux/fs.h
> index bdf7b404b3e7..7e9c32dce3e4 100644
> --- a/tools/include/uapi/linux/fs.h
> +++ b/tools/include/uapi/linux/fs.h
> @@ -50,6 +50,7 @@
>  #define RENAME_NOREPLACE       (1 << 0)        /* Don't overwrite target */
>  #define RENAME_EXCHANGE                (1 << 1)        /* Exchange source and dest */
>  #define RENAME_WHITEOUT                (1 << 2)        /* Whiteout source */
> +#define RENAME_NEWER_MTIME     (1 << 3)        /* Only newer file can overwrite target */
>
>  struct file_clone_range {
>         __s64 src_fd;
> diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
> index de11992dc577..34226dfbca7a 100644
> --- a/tools/testing/selftests/Makefile
> +++ b/tools/testing/selftests/Makefile
> @@ -54,6 +54,7 @@ TARGETS += proc
>  TARGETS += pstore
>  TARGETS += ptrace
>  TARGETS += openat2
> +TARGETS += renameat2
>  TARGETS += resctrl
>  TARGETS += rlimits
>  TARGETS += rseq
> diff --git a/tools/testing/selftests/renameat2/.gitignore b/tools/testing/selftests/renameat2/.gitignore
> new file mode 100644
> index 000000000000..79bbdf497559
> --- /dev/null
> +++ b/tools/testing/selftests/renameat2/.gitignore
> @@ -0,0 +1 @@
> +renameat2_tests
> diff --git a/tools/testing/selftests/renameat2/Makefile b/tools/testing/selftests/renameat2/Makefile
> new file mode 100644
> index 000000000000..c39f5e281a5d
> --- /dev/null
> +++ b/tools/testing/selftests/renameat2/Makefile
> @@ -0,0 +1,10 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +CFLAGS = -g -Wall -O2
> +CFLAGS += $(KHDR_INCLUDES)
> +
> +TEST_GEN_PROGS := renameat2_tests
> +
> +include ../lib.mk
> +
> +$(OUTPUT)/renameat2_tests: renameat2_tests.c
> diff --git a/tools/testing/selftests/renameat2/renameat2_tests.c b/tools/testing/selftests/renameat2/renameat2_tests.c
> new file mode 100644
> index 000000000000..1fdb908cf428
> --- /dev/null
> +++ b/tools/testing/selftests/renameat2/renameat2_tests.c
> @@ -0,0 +1,447 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Written by James Yonan <james@openvpn.net>
> + * Copyright (c) 2022 OpenVPN, Inc.
> + */
> +
> +/*
> + * Test renameat2() with RENAME_NOREPLACE, RENAME_EXCHANGE,
> + * and RENAME_NEWER_MTIME.
> + *
> + * This test is designed to be portable between
> + * the Linux kernel self-tests and the Linux Test Project.
> + * The cool thing about running the test in the Linux Test Project
> + * is that it will automatically iterate the test over all the
> + * filesystems available in your kernel.  In a default kernel,
> + * that includes ext2, ext3, ext4, xfs, btrfs, and tmpfs.
> + *
> + * By default we assume a Linux kernel self-test build, where
> + * you can build and run with:
> + *   make -C tools/testing/selftests TARGETS=renameat2 run_tests
> + *
> + * For a Linux Test Project build, place this source file
> + * under the ltp tree in:
> + *   testcases/kernel/syscalls/renameat2/renameat203.c
> + * Then cd to testcases/kernel/syscalls/renameat2 and add:
> + *   CPPFLAGS += -DLINUX_TEST_PROJECT
> + * to the end of the Makefile.  Then run with:
> + *   make && ./rename_newer_mtime
> + */
> +
> +#ifndef _GNU_SOURCE
> +#define _GNU_SOURCE
> +#endif
> +
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <sys/types.h>
> +#include <sys/stat.h>
> +#include <unistd.h>
> +#include <fcntl.h>
> +#include <time.h>
> +
> +#ifdef LINUX_TEST_PROJECT
> +#include "tst_test.h"
> +#include "renameat2.h"
> +#else
> +#include "../kselftest_harness.h"
> +#endif
> +
> +/* requires a kernel that implements renameat2() RENAME_NEWER_MTIME flag */
> +#ifndef RENAME_NEWER_MTIME
> +#define RENAME_NEWER_MTIME (1 << 3)
> +#endif
> +
> +/* convert milliseconds to nanoseconds */
> +#define MS_TO_NANO(x) ((x) * 1000000)
> +
> +#ifdef LINUX_TEST_PROJECT
> +
> +#define MNTPOINT "mntpoint"
> +#define WORKDIR MNTPOINT "/testdir.XXXXXX"
> +
> +#define MY_ERROR(...) tst_brk(TFAIL, __VA_ARGS__)
> +#define MY_PASS(...) tst_res(TPASS, __VA_ARGS__)
> +
> +#else /* Linux kernel self-test */
> +
> +#define WORKDIR "/tmp/ksft-renameat2-rename-newer-mtime.XXXXXX"
> +
> +#define MY_ERROR(fmt, ...) ksft_exit_fail_msg("%s/%d: " fmt "\n", __FILE__, __LINE__, __VA_ARGS__)
> +#define MY_PASS(...)
> +
> +#endif
> +
> +static int create_file_with_timestamp(const char *filename,
> +                                     const time_t tv_sec,
> +                                     const long tv_nsec,
> +                                     struct stat *s,
> +                                     int *retain_fd)
> +{
> +       int fd;
> +       struct timespec times[2];
> +
> +       fd = open(filename, O_CREAT|O_TRUNC|O_WRONLY, 0777);
> +       if (fd < 0)
> +               return errno;
> +       times[0].tv_sec = tv_sec;
> +       times[0].tv_nsec = tv_nsec;
> +       times[1] = times[0];
> +       if (futimens(fd, times)) {
> +               close(fd);
> +               return errno;
> +       }
> +       if (fstat(fd, s)) {
> +               close(fd);
> +               return errno;
> +       }
> +       if (retain_fd)
> +               *retain_fd = fd;
> +       else if (close(fd))
> +               return errno;
> +       return 0;
> +}
> +
> +static int create_directory_with_timestamp(const char *dirname,
> +                                          const time_t tv_sec,
> +                                          const long tv_nsec,
> +                                          struct stat *s)
> +{
> +       struct timespec times[2];
> +
> +       if (mkdir(dirname, 0777))
> +               return errno;
> +       times[0].tv_sec = tv_sec;
> +       times[0].tv_nsec = tv_nsec;
> +       times[1] = times[0];
> +       if (utimensat(AT_FDCWD, dirname, times, 0) != 0)
> +               return errno;
> +       if (lstat(dirname, s))
> +               return errno;
> +       return 0;
> +}
> +
> +static int do_rename(const char *source_path, const char *target_path,
> +                    const unsigned int flags)
> +{
> +       if (renameat2(AT_FDCWD, source_path, AT_FDCWD, target_path, flags))
> +               return errno;
> +       return 0;
> +}
> +
> +static int verify_inode(const char *path, const struct stat *orig_stat)
> +{
> +       struct stat s;
> +
> +       if (stat(path, &s))
> +               return errno;
> +       if (orig_stat->st_ino != s.st_ino)
> +               return ENOENT;
> +       return 0;
> +}
> +
> +static int verify_exist(const char *path)
> +{
> +       int fd;
> +
> +       fd = open(path, O_RDONLY);
> +       if (fd < 0)
> +               return errno;
> +       if (close(fd) != 0)
> +               return errno;
> +       return 0;
> +}
> +
> +static int fd_d = -1; /* retained fd from file "d" */
> +
> +/*
> + * Test renameat2() with RENAME_NEWER_MTIME, RENAME_NOREPLACE, and RENAME_EXCHANGE.
> + */
> +static void do_rename_newer_mtime(void)
> +{
> +       char dirname[] = WORKDIR;
> +       const time_t now = time(NULL);
> +       struct stat stat_a, stat_b, stat_c, stat_d, stat_f; /* files */
> +       struct stat stat_x, stat_y; /* directories */
> +       int eno; /* copied errno */
> +
> +       /* fd_d initial state */
> +       fd_d = -1;
> +
> +       /* make the top-level directory */
> +       if (!mkdtemp(dirname)) {
> +               eno = errno;
> +               MY_ERROR("failed to create tmpdir, errno=%d", eno);
> +       }
> +
> +       /* cd to top-level directory */
> +       if (chdir(dirname)) {
> +               eno = errno;
> +               MY_ERROR("failed to cd to tmpdir, errno=%d", eno);
> +       }
> +
> +       /* create files with different mtimes */
> +       eno = create_file_with_timestamp("a", now, MS_TO_NANO(700), &stat_a, NULL);
> +       if (eno)
> +               MY_ERROR("failed to create file 'a', errno=%d", eno);
> +       eno = create_file_with_timestamp("b", now+1, MS_TO_NANO(500), &stat_b, NULL);
> +       if (eno)
> +               MY_ERROR("failed to create file 'b', errno=%d", eno);
> +       eno = create_file_with_timestamp("c", now+1, MS_TO_NANO(500), &stat_c, NULL);
> +       if (eno)
> +               MY_ERROR("failed to create file 'c', errno=%d", eno);
> +       eno = create_file_with_timestamp("d", now+1, MS_TO_NANO(300), &stat_d, &fd_d); /* leave open */
> +       if (eno)
> +               MY_ERROR("failed to create file 'd', errno=%d", eno);
> +       eno = create_file_with_timestamp("f", now, MS_TO_NANO(0), &stat_f, NULL);
> +       if (eno)
> +               MY_ERROR("failed to create file 'f', errno=%d", eno);
> +
> +       /* create directories with different mtimes */
> +       eno = create_directory_with_timestamp("x", now+2, MS_TO_NANO(0), &stat_x);
> +       if (eno)
> +               MY_ERROR("failed to create directory 'x', errno=%d", eno);
> +       eno = create_directory_with_timestamp("y", now+3, MS_TO_NANO(0), &stat_y);
> +       if (eno)
> +               MY_ERROR("failed to create directory 'y', errno=%d", eno);
> +
> +       /* rename b -> e with RENAME_NEWER_MTIME -- should succeed because e doesn't exist */
> +       eno = do_rename("b", "e", RENAME_NEWER_MTIME);
> +       if (eno)
> +               MY_ERROR("failed to rename 'b' -> 'e', errno=%d (kernel may be missing RENAME_NEWER_MTIME feature)", eno);
> +       eno = verify_inode("e", &stat_b);
> +       if (eno)
> +               MY_ERROR("could not verify inode of 'e' after rename 'b' -> 'e', errno=%d", eno);
> +       eno = verify_exist("b");
> +       if (eno != ENOENT)
> +               MY_ERROR("strangely, 'b' still exists after rename 'b' -> 'e', errno=%d", eno);
> +
> +       /* rename c -> e with RENAME_NEWER_MTIME -- should fail because c and e have
> +        * the same timestamp
> +        */
> +       eno = do_rename("c", "e", RENAME_NEWER_MTIME);
> +       if (eno != EEXIST)
> +               MY_ERROR("rename 'c' -> 'e' should have failed with EEXIST because 'c' and 'e' have the same timestamp, errno=%d", eno);
> +       eno = verify_inode("c", &stat_c);
> +       if (eno)
> +               MY_ERROR("could not verify inode of 'c' after attempted rename 'c' -> 'e', errno=%d", eno);
> +       eno = verify_inode("e", &stat_b);
> +       if (eno)
> +               MY_ERROR("could not verify inode of 'e' after attempted rename 'c' -> 'e', errno=%d", eno);
> +
> +       /* rename a -> c with RENAME_NOREPLACE -- should fail because c exists */
> +       eno = do_rename("a", "c", RENAME_NOREPLACE);
> +       if (eno != EEXIST)
> +               MY_ERROR("rename 'a' -> 'c' should have failed because 'c' exists, errno=%d", eno);
> +       eno = verify_inode("a", &stat_a);
> +       if (eno)
> +               MY_ERROR("could not verify inode of 'a' after attempted rename 'a' -> 'c', errno=%d", eno);
> +       eno = verify_inode("c", &stat_c);
> +       if (eno)
> +               MY_ERROR("could not verify inode of 'c' after attempted rename 'a' -> 'c', errno=%d", eno);
> +
> +       /* rename a -> c with RENAME_NEWER_MTIME -- should fail because c is newer than a */
> +       eno = do_rename("a", "c", RENAME_NEWER_MTIME);
> +       if (eno != EEXIST)
> +               MY_ERROR("rename 'a' -> 'c' should have failed with EEXIST because 'c' is newer, errno=%d", eno);
> +       eno = verify_inode("a", &stat_a);
> +       if (eno)
> +               MY_ERROR("could not verify inode of 'a' after attempted rename 'a' -> 'c', errno=%d", eno);
> +       eno = verify_inode("c", &stat_c);
> +       if (eno)
> +               MY_ERROR("could not verify inode of 'c' after attempted rename 'a' -> 'c', errno=%d", eno);
> +
> +       /* rename c -> a with RENAME_NEWER_MTIME -- should succeed because c is newer than a */
> +       eno = do_rename("c", "a", RENAME_NEWER_MTIME);
> +       if (eno)
> +               MY_ERROR("rename 'c' -> 'a' should have succeeded because 'c' is newer than 'a', errno=%d", eno);
> +       eno = verify_inode("a", &stat_c);
> +       if (eno)
> +               MY_ERROR("could not verify inode of 'a' after rename 'c' -> 'a', errno=%d", eno);
> +       eno = verify_exist("c");
> +       if (eno != ENOENT)
> +               MY_ERROR("strangely, 'c' still exists after rename 'c' -> 'a', errno=%d", eno);
> +
> +       /* exchange f <-> nonexistent with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should fail because
> +        * only f exists
> +        */
> +       eno = do_rename("f", "nonexistent", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
> +       if (eno != ENOENT)
> +               MY_ERROR("exchange 'f' <-> 'nonexistent' should have failed with ENOENT, errno=%d", eno);
> +       eno = verify_inode("f", &stat_f);
> +       if (eno)
> +               MY_ERROR("could not verify inode of 'f' after attempted exchange 'f' <-> 'nonexistent', errno=%d", eno);
> +
> +       /* exchange d <-> f with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should fail because
> +        * d is open for write
> +        */
> +       eno = do_rename("d", "f", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
> +       if (eno != ETXTBSY)
> +               MY_ERROR("exchange 'd' <-> 'f' should have failed with ETXTBSY because d is open for write, errno=%d", eno);
> +       eno = verify_inode("d", &stat_d);
> +       if (eno)
> +               MY_ERROR("could not verify inode of 'd' after attempted exchange 'd' <-> 'f', errno=%d", eno);
> +       eno = verify_inode("f", &stat_f);
> +       if (eno)
> +               MY_ERROR("could not verify inode of 'f' after attempted exchange 'd' <-> 'f', errno=%d", eno);
> +
> +       /* exchange e <-> d with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should fail because
> +        * d is open for write
> +        */
> +       eno = do_rename("e", "d", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
> +       if (eno != ETXTBSY)
> +               MY_ERROR("exchange 'e' <-> 'd' should have failed with ETXTBSY because d is open for write, errno=%d", eno);
> +       eno = verify_inode("e", &stat_b);
> +       if (eno)
> +               MY_ERROR("could not verify inode of 'e' after attempted exchange 'e' <-> 'd', errno=%d", eno);
> +       eno = verify_inode("d", &stat_d);
> +       if (eno)
> +               MY_ERROR("could not verify inode of 'd' after attempted exchange 'e' <-> 'd', errno=%d", eno);
> +
> +       /* exchange f <-> d with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should fail because
> +        * d is open for write but also because f is older than d
> +        */
> +       eno = do_rename("f", "d", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
> +       if (eno != ETXTBSY) /* note in this case we get ETXTBSY first (EEXIST would have
> +                            * been returned if d wasn't open for write)
> +                            */
> +               MY_ERROR("exchange 'f' <-> 'd' should have failed with ETXTBSY because d is open for write, errno=%d", eno);
> +       eno = verify_inode("f", &stat_f);
> +       if (eno)
> +               MY_ERROR("could not verify inode of 'f' after attempted exchange 'f' <-> 'd', errno=%d", eno);
> +       eno = verify_inode("d", &stat_d);
> +       if (eno)
> +               MY_ERROR("could not verify inode of 'd' after attempted exchange 'f' <-> 'd', errno=%d", eno);
> +
> +       /* close fd_d */
> +       if (close(fd_d) != 0) {
> +               eno = errno;
> +               MY_ERROR("error closing fd_d (write), errno=%d", eno);
> +       }
> +
> +       /* reopen "d" for read access, which should not prevent RENAME_NEWER_MTIME */
> +       fd_d = open("d", O_RDONLY);
> +       if (fd_d < 0)
> +               MY_ERROR("error reopening 'd' for read, errno=%d", eno);
> +
> +       /* exchange f <-> d with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should fail
> +        * because f is older than d
> +        */
> +       eno = do_rename("f", "d", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
> +       if (eno != EEXIST)
> +               MY_ERROR("exchange 'f' <-> 'd' should have failed with EEXIST because f is older than d, errno=%d", eno);
> +       eno = verify_inode("f", &stat_f);
> +       if (eno)
> +               MY_ERROR("could not verify inode of 'f' after attempted exchange 'f' <-> 'd', errno=%d", eno);
> +       eno = verify_inode("d", &stat_d);
> +       if (eno)
> +               MY_ERROR("could not verify inode of 'd' after attempted exchange 'f' <-> 'd', errno=%d", eno);
> +
> +       /* double exchange f <-> d with RENAME_EXCHANGE -- should succeed */
> +       eno = do_rename("f", "d", RENAME_EXCHANGE);
> +       if (eno)
> +               MY_ERROR("exchange 'f' <-> 'd' should have succeeded, errno=%d", eno);
> +       eno = verify_inode("d", &stat_f);
> +       if (eno)
> +               MY_ERROR("could not verify inode of 'd' after exchange 'd' <-> 'f', errno=%d", eno);
> +       eno = verify_inode("f", &stat_d);
> +       if (eno)
> +               MY_ERROR("could not verify inode of 'f' after exchange 'd' <-> 'f', errno=%d", eno);
> +       eno = do_rename("f", "d", RENAME_EXCHANGE);
> +       if (eno)
> +               MY_ERROR("exchange 'f' <-> 'd' should have succeeded, errno=%d", eno);
> +       eno = verify_inode("d", &stat_d);
> +       if (eno)
> +               MY_ERROR("could not verify inode of 'd' after exchange 'd' <-> 'f', errno=%d", eno);
> +       eno = verify_inode("f", &stat_f);
> +       if (eno)
> +               MY_ERROR("could not verify inode of 'f' after exchange 'd' <-> 'f', errno=%d", eno);
> +
> +       /* exchange d <-> f with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should succeed
> +        * because d is newer than f and fd_d is now read-only
> +        */
> +       eno = do_rename("d", "f", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
> +       if (eno)
> +               MY_ERROR("exchange 'd' <-> 'f' failed, errno=%d", eno);
> +       eno = verify_inode("d", &stat_f);
> +       if (eno)
> +               MY_ERROR("could not verify inode of 'd' after exchange 'd' <-> 'f', errno=%d", eno);
> +       eno = verify_inode("f", &stat_d);
> +       if (eno)
> +               MY_ERROR("could not verify inode of 'f' after exchange 'd' <-> 'f', errno=%d", eno);
> +
> +       /* exchange directories x <-> y with RENAME_EXCHANGE|RENAME_NEWER_MTIME
> +        * -- should fail because x is older than y
> +        */
> +       eno = do_rename("x", "y", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
> +       if (eno != EEXIST)
> +               MY_ERROR("exchange 'x' <-> 'y' should have failed with EEXIST because x is older than y, errno=%d", eno);
> +       eno = verify_inode("x", &stat_x);
> +       if (eno)
> +               MY_ERROR("could not verify inode of 'x' after attempted exchange 'x' <-> 'y', errno=%d", eno);
> +       eno = verify_inode("y", &stat_y);
> +       if (eno)
> +               MY_ERROR("could not verify inode of 'y' after attempted exchange 'x' <-> 'y', errno=%d", eno);
> +
> +       /* exchange directories y <-> x with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should succeed */
> +       eno = do_rename("y", "x", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
> +       if (eno)
> +               MY_ERROR("exchange 'y' <-> 'x' failed, errno=%d", eno);
> +       eno = verify_inode("x", &stat_y);
> +       if (eno)
> +               MY_ERROR("could not verify inode of 'x' after exchange 'y' <-> 'x', errno=%d", eno);
> +       eno = verify_inode("y", &stat_x);
> +       if (eno)
> +               MY_ERROR("could not verify inode of 'y' after exchange 'y' <-> 'x', errno=%d", eno);
> +
> +       /* close fd_d */
> +       if (close(fd_d) != 0) {
> +               eno = errno;
> +               MY_ERROR("error closing fd_d (read), errno=%d", eno);
> +       }
> +       fd_d = -1;
> +
> +       MY_PASS("rename_newer_mtime test passed, workdir=%s", dirname);
> +}
> +
> +#ifdef LINUX_TEST_PROJECT
> +
> +static void setup(void)
> +{
> +}
> +
> +static void cleanup(void)
> +{
> +       /* close fd_d */
> +       if (fd_d >= 0)
> +               close(fd_d);
> +}
> +
> +static struct tst_test test = {
> +       .test_all = do_rename_newer_mtime,
> +       .setup = setup,
> +       .cleanup = cleanup,
> +       .needs_root = 1,
> +       .all_filesystems = 1,
> +       .mount_device = 1,
> +       .mntpoint = MNTPOINT,
> +       .skip_filesystems = (const char*[]) {
> +               "exfat",
> +               "ntfs",
> +               "vfat",
> +               NULL
> +       },
> +       .needs_cmds = NULL,
> +};
> +
> +#else /* Linux kernel self-test */
> +
> +TEST(rename_newer_mtime)
> +{
> +       do_rename_newer_mtime();
> +}
> +
> +TEST_HARNESS_MAIN
> +
> +#endif

Nice trick :)

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] namei: implemented RENAME_NEWER_MTIME flag for renameat2() conditional replace
  2022-07-01 10:34                 ` Amir Goldstein
@ 2022-07-01 20:06                   ` James Yonan
  0 siblings, 0 replies; 24+ messages in thread
From: James Yonan @ 2022-07-01 20:06 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: linux-fsdevel, Dave Chinner, Neil Brown, Al Viro

On 7/1/22 04:34, Amir Goldstein wrote:
>> @@ -4685,11 +4686,22 @@ int vfs_rename(struct renamedata *rd)
>>
>>          take_dentry_name_snapshot(&old_name, old_dentry);
>>          dget(new_dentry);
>> -       if (!is_dir || (flags & RENAME_EXCHANGE))
>> +       if (!is_dir || (flags & (RENAME_EXCHANGE|RENAME_NEWER_MTIME)))
>>                  lock_two_nondirectories(source, target);
>>          else if (target)
>>                  inode_lock(target);
>>
>> +       if ((flags & RENAME_NEWER_MTIME) && target) {
>> +               /* deny write access to stabilize mtime comparison below */
>> +               error = inode_deny_write_access2(source, target);
> This is not needed for non regular file, but I guess it doesn't hurt...
> You could do a helper lock_two_inodes_deny_write() that takes
> care of both inode_lock() and inode_deny_write_access() and
> call it instead of lock_two_nondirectories() above.
>
> Then the lock and unlock routines would be more straightforward
> and less error prone, e.g.:
>
> -       if (!is_dir || (flags & RENAME_EXCHANGE))
> +       if (flags & RENAME_NEWER_MTIME)
> +               lock_two_inodes_deny_write(source, target);
> +       else if (!is_dir || (flags & (RENAME_EXCHANGE)))
>                  lock_two_nondirectories(source, target);
>
> ...
>
> out:
> -       if (!is_dir || (flags & RENAME_EXCHANGE))
> +       if (flags & RENAME_NEWER_MTIME)
> +               unlock_two_inodes_allow_write(source, target);
> +       else if (!is_dir || (flags & (RENAME_EXCHANGE)))
>                  unlock_two_nondirectories(source, target);

So keep in mind that RENAME_NEWER_MTIME can be used together with 
RENAME_EXCHANGE, and I'm a bit concerned about having RENAME_NEWER_MTIME 
usurp the locking logic of RENAME_EXCHANGE when both are used.  My 
thinking in the v2 patch was to keep the locking for each mostly 
separate and nested, to make it clear that they are independent options 
that can also be used together.  Also because 
lock_two_inodes_deny_write() can fail, it adds a new possible bailout 
point that would need to unwind take_dentry_name_snapshot() and dget().  
So it's hard to avoid having to add a new "out1" label.

> OTOH, for directory, inode_lock is needed to stabilize mtime and
> lock_two_nondirectories() doesn't do that for you and it is also
> non trivial to get the locking order and lockdep annotations correct.
>
> Since you don't have a use case for RENAME_NEWER_MTIME and
> directories (?), maybe the easier way around this would be to deny that
> earlier in do_renameat2() with -EISDIR.

Yes, that makes sense.  I will add that.

James



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v2] namei: implemented RENAME_NEWER_MTIME flag for renameat2() conditional replace
  2022-07-01  9:23               ` [PATCH v2] namei: implemented RENAME_NEWER_MTIME " James Yonan
  2022-07-01 10:34                 ` Amir Goldstein
@ 2022-07-02  8:07                 ` Dave Chinner
  2022-07-05 13:30                   ` [PATCH v3 1/2] RENAME_NEWER_MTIME is a new userspace-visible flag for renameat2(), and stands alongside existing flags including RENAME_NOREPLACE, RENAME_EXCHANGE, and RENAME_WHITEOUT James Yonan
                                     ` (2 more replies)
  1 sibling, 3 replies; 24+ messages in thread
From: Dave Chinner @ 2022-07-02  8:07 UTC (permalink / raw)
  To: James Yonan; +Cc: linux-fsdevel, neilb, amir73il, viro

On Fri, Jul 01, 2022 at 03:23:26AM -0600, James Yonan wrote:
> RENAME_NEWER_MTIME is a new userspace-visible flag for renameat2(), and
> stands alongside existing flags such as RENAME_NOREPLACE, RENAME_EXCHANGE,
> and RENAME_WHITEOUT.
> 
> RENAME_NEWER_MTIME is a conditional variation on RENAME_NOREPLACE, and
> indicates that if the target of the rename exists, the rename or exchange
> will only succeed if the source file is newer than the target (i.e. source
> mtime > target mtime).  Otherwise, the rename will fail with -EEXIST
> instead of replacing the target.  When the target doesn't exist,
> RENAME_NEWER_MTIME does a plain rename like RENAME_NOREPLACE.
> 
> RENAME_NEWER_MTIME can also be combined with RENAME_EXCHANGE for
> conditional exchange, where the exchange only occurs if source mtime >
> target mtime.  Otherwise, the operation will fail with -EEXIST.
> 
> RENAME_NEWER_MTIME is very useful in distributed systems that mirror a
> directory structure, or use a directory as a key/value store, and need to
> guarantee that files will only be overwritten by newer files, and that all
> updates are atomic.

You need to cc linux-api and write a renameat2() man page update
that documents how this option behaves for application developers.
The bits where it will appear to randomly fail are especially
important to document properly...

> RENAME_NEWER_MTIME is implemented in vfs_rename(), and we lock and deny
> write access to both source and target inodes before comparing their
> mtimes, to stabilize the comparison.
> 
> So one question to ask is could this functionality be implemented in
> userspace without adding a new renameat2() flag?  I think you could
> attempt it with iterative RENAME_EXCHANGE, but it's hackish, inefficient,
> and not atomic, because races could cause temporary mtime backtracks.
> How about using file locking?  Probably not, because the problem we want
> to solve is maintaining file/directory atomicity for readers by creating
> files out-of-directory, setting their mtime, and atomically moving them
> into place.  The strategy to lock such an operation really requires more
> complex locking methods than are generally exposed to userspace.  And if
> you are using inotify on the directory to notify readers of changes, it
> certainly makes sense to reduce unnecessary churn by preventing a move
> operation based on the mtime check.
> 
> While some people might question the utility of adding features to
> filesystems to make them more like databases, there is real value in the
> performance, atomicity, consistent VFS interface, multi-thread safety, and
> async-notify capabilities of modern filesystems that starts to blur the
> line, and actually make filesystem-based key-value stores a win for many
> applications.
> 
> Like RENAME_NOREPLACE, the RENAME_NEWER_MTIME implementation lives in
> the VFS, however the individual fs implementations do strict flags
> checking and will return -EINVAL for any flag they don't recognize.
> At this time, I have enabled and tested RENAME_NEWER_MTIME on ext2, ext3,
> ext4, xfs, btrfs, and tmpfs.
> 
> I did not notice a general self-test for renameat2() at the VFS
> layer (outside of fs-specific tests), so I created one, though
> at the moment it only exercises RENAME_NEWER_MTIME and RENAME_EXCHANGE.
> The self-test is written to be portable to the Linux Test Project,
> and the advantage of running it there is that it automatically runs
> tests on multiple filesystems.  See comments at the beginning of
> renameat2_tests.c for more info.
> 
> Build and run the self-test with:
> 
>   make -C tools/testing/selftests TARGETS=renameat2 run_tests
> 
> Questions:
> 
> Q: Why use mtime and not ctime for timestamp comparison?
> 
> A: I think the "use a directory as a key/value store" use case
>    cares about the modification time of the file content rather
>    than metadata.  Also, the rename operation itself modifies
>    ctime, making it less useful as a reference timestamp.
>    In any event, this patch creates the infrastructure for
>    conditional rename/exchange based on inode timestamp, so a
>    subsequent patch to add RENAME_NEWER_CTIME would be a mostly
>    trivial exercise.
> 
> Q: Why compare mtimes when some systems have poor system clock
>    accuracy and resolution?
> 
> A: So in the "use a directory as a key/value store" use case
>    in distributed systems, the file mtime is actually determined
>    remotely by the file content creator and is set locally
>    via futimens() rather than the local system clock.  So this gives
>    you nanosecond-scale time resolution if the content creator
>    supports it, even if the local system clock has less resolution.

That's not a useful answer to an application developer wondering if
he can use this feature in his application. This answer is the
justification of why *your application doesn't care* about poor
timestamp resolution, not an explanation of how some other
application can detect and/or address the problem when it arises...

.....

> diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
> index de11992dc577..34226dfbca7a 100644
> --- a/tools/testing/selftests/Makefile
> +++ b/tools/testing/selftests/Makefile
> @@ -54,6 +54,7 @@ TARGETS += proc
>  TARGETS += pstore
>  TARGETS += ptrace
>  TARGETS += openat2
> +TARGETS += renameat2
>  TARGETS += resctrl
>  TARGETS += rlimits
>  TARGETS += rseq
> diff --git a/tools/testing/selftests/renameat2/.gitignore b/tools/testing/selftests/renameat2/.gitignore
> new file mode 100644
> index 000000000000..79bbdf497559
> --- /dev/null
> +++ b/tools/testing/selftests/renameat2/.gitignore
> @@ -0,0 +1 @@
> +renameat2_tests
> diff --git a/tools/testing/selftests/renameat2/Makefile b/tools/testing/selftests/renameat2/Makefile
> new file mode 100644
> index 000000000000..c39f5e281a5d
> --- /dev/null
> +++ b/tools/testing/selftests/renameat2/Makefile
> @@ -0,0 +1,10 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +CFLAGS = -g -Wall -O2
> +CFLAGS += $(KHDR_INCLUDES)
> +
> +TEST_GEN_PROGS := renameat2_tests
> +
> +include ../lib.mk
> +
> +$(OUTPUT)/renameat2_tests: renameat2_tests.c
> diff --git a/tools/testing/selftests/renameat2/renameat2_tests.c b/tools/testing/selftests/renameat2/renameat2_tests.c
> new file mode 100644
> index 000000000000..1fdb908cf428
> --- /dev/null
> +++ b/tools/testing/selftests/renameat2/renameat2_tests.c
> @@ -0,0 +1,447 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Written by James Yonan <james@openvpn.net>
> + * Copyright (c) 2022 OpenVPN, Inc.
> + */
.....

Please add the test in a separate patch. You probably also want to
write tests for fstests so that it gets exercised regularly by
filesystem developers who will notice when it breaks on their
filesystem......

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v3 1/2] RENAME_NEWER_MTIME is a new userspace-visible flag for renameat2(), and stands alongside existing flags including RENAME_NOREPLACE, RENAME_EXCHANGE, and RENAME_WHITEOUT.
  2022-07-02  8:07                 ` Dave Chinner
@ 2022-07-05 13:30                   ` James Yonan
  2022-07-05 13:30                     ` [PATCH v3 2/2] selftests: added a new target renameat2 James Yonan
  2022-07-05 13:30                     ` [PATCH man-pages] rename.2: document new renameat2() flag RENAME_NEWER_MTIME James Yonan
  2022-07-05 14:03                   ` [RESEND PATCH v3 1/2] namei: implemented RENAME_NEWER_MTIME flag for renameat2() conditional replace James Yonan
  2022-07-11 19:13                   ` [PATCH v4 " James Yonan
  2 siblings, 2 replies; 24+ messages in thread
From: James Yonan @ 2022-07-05 13:30 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: david, neilb, amir73il, viro, linux-api, James Yonan

RENAME_NEWER_MTIME is a conditional variation on RENAME_NOREPLACE, and
indicates that if the target of the rename exists, the rename or exchange
will only succeed if the source file is newer than the target (i.e.
source mtime > target mtime).  Otherwise, the rename will fail with
-EEXIST instead of replacing the target.  When the target doesn't exist,
RENAME_NEWER_MTIME does a plain rename like RENAME_NOREPLACE.

RENAME_NEWER_MTIME can also be combined with RENAME_EXCHANGE for
conditional exchange, where the exchange only occurs if source mtime >
target mtime.  Otherwise, the operation will fail with -EEXIST.

Some of the use cases for RENAME_NEWER_MTIME include (a) using a
directory as a key-value store, or (b) maintaining a near-real-time
mirror of a remote data source.  A common design pattern for maintaining
such a data store would be to create a file using a temporary pathname,
setting the file mtime using utimensat(2) or futimens(2) based on the
remote creation timestamp of the file content, then using
RENAME_NEWER_MTIME to move the file into place in the target directory.
If the operation returns an error with errno == EEXIST, then the source
file is not up-to-date and can safely be deleted. The goal is to
facilitate distributed systems having many concurrent writers and
readers, where update notifications are possibly delayed, duplicated, or
reordered, yet where readers see a consistent view of the target
directory with predictable semantics and atomic updates.

Note that RENAME_NEWER_MTIME depends on accurate, high-resolution
timestamps for mtime, preferably approaching nanosecond resolution.

RENAME_NEWER_MTIME is implemented in vfs_rename(), and we lock and deny
write access to both source and target inodes before comparing their
mtimes, to stabilize the comparison.

The use case for RENAME_NEWER_MTIME doesn't really align with
directories, so we return -EISDIR if either source or target is a
directory.  This makes the locking necessary to stabilize the mtime
comparison (in vfs_rename()) much more straightforward.

Like RENAME_NOREPLACE, the RENAME_NEWER_MTIME implementation lives in
the VFS, however the individual fs implementations do strict flags
checking and will return -EINVAL for any flag they don't recognize.
At this time, I have enabled and tested RENAME_NEWER_MTIME on ext2, ext3,
ext4, xfs, btrfs, and tmpfs.

I did not notice a general self-test for renameat2() at the VFS
layer (outside of fs-specific tests), so I created one, though
at the moment it only exercises RENAME_NEWER_MTIME and RENAME_EXCHANGE.
The self-test is written to be portable to the Linux Test Project,
and the advantage of running it there is that it automatically runs
tests on multiple filesystems.  See comments at the beginning of
renameat2_tests.c for more info.

Build and run the self-test with:

  make -C tools/testing/selftests TARGETS=renameat2 run_tests

Questions:

Q: Why use mtime and not ctime for timestamp comparison?

A: I see the "use a directory as a key/value store" use case
   as caring more about the modification time of the file content
   rather than the metadata.  Also, the rename operation itself
   modifies ctime, making it less useful as a reference timestamp.
   In any event, this patch creates the infrastructure for
   conditional rename/exchange based on inode timestamp, so a
   subsequent patch to add RENAME_NEWER_CTIME would be a mostly
   trivial exercise.

Signed-off-by: James Yonan <james@openvpn.net>
---
Patch version history:

v2: Changed flag name from RENAME_NEWER to RENAME_NEWER_MTIME so
    as to disambiguate and make it clear that we are comparing
    mtime values.

    RENAME_NEWER_MTIME can now be combined with RENAME_EXCHANGE
    for conditional exchange, where exchange only occurs if
    source mtime > target mtime.

    Moved the mtime comparison logic into vfs_rename() to take
    advantage of existing {lock,unlock}_two_nondirectories critical
    section, and then further nest another critical section
    {deny,allow}_write_access (adapted to inodes) to stabilize the
    mtime, since our use case doesn't require renaming files that
    are open for write (we will return -ETXTBSY in this case).

    Did some refactoring of inline functions in linux/fs.h that
    manage inode->i_writecount, and added inode_deny_write_access2()
    and inode_allow_write_access2() functions.

    Extended the self-test (renameat2_tests.c):

    1. Verify that RENAME_NEWER_MTIME fails with errno == ETXTBSY when
       one of the files is open for write.

    2. Test conditional exchange use case with combined flags
       RENAME_EXCHANGE|RENAME_NEWER_MTIME.

    3. The test .c file is now drop-in portable to the Linux Test
       Project where you can take advantage of the .all_filesystems = 1
       flag to automatically run tests on multiple filesystems.

v3: The use case for RENAME_NEWER_MTIME doesn't really align
    with directories, so return -EISDIR if either source or
    target is a directory.  This makes the locking necessary
    to stabilize the mtime comparison (in vfs_rename())
    much more straightforward.

    simple_rename() in libfs.c doesn't need to support
    RENAME_NEWER_MTIME.

    Broke up some long lines.

    Rebased on top of 5.19-rc5.

    Break out the self-test into a separate patch.

    Documented RENAME_NEWER_MTIME in the rename.2 man page
    (separate patch).
---
 Documentation/filesystems/vfs.rst | 11 ++++++++
 fs/btrfs/inode.c                  |  3 ++-
 fs/ext2/namei.c                   |  2 +-
 fs/ext4/namei.c                   |  3 ++-
 fs/namei.c                        | 37 +++++++++++++++++++++++---
 fs/xfs/xfs_iops.c                 |  3 ++-
 include/linux/fs.h                | 43 ++++++++++++++++++++++++++++---
 include/uapi/linux/fs.h           |  1 +
 mm/shmem.c                        |  3 ++-
 tools/include/uapi/linux/fs.h     |  1 +
 10 files changed, 95 insertions(+), 12 deletions(-)

diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 08069ecd49a6..495e7352cca1 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -515,6 +515,17 @@ otherwise noted.
 	(2) RENAME_EXCHANGE: exchange source and target.  Both must
 	exist; this is checked by the VFS.  Unlike plain rename, source
 	and target may be of different type.
+	(3) RENAME_NEWER_MTIME: this flag is similar to RENAME_NOREPLACE,
+	and indicates a conditional rename: if the target of the rename
+	exists, the rename should only succeed if the source file is
+	newer than the target (i.e. source mtime > target mtime).
+	Otherwise, the rename should fail with -EEXIST instead of
+	replacing the target.  To exchange source and target conditional
+	on source being newer than target, pass flags as
+	RENAME_EXCHANGE|RENAME_NEWER_MTIME.  RENAME_NEWER_MTIME will fail
+	with -ETXTBSY if either source or target is open for write.
+	RENAME_NEWER_MTIME is not currently supported on directories, and
+	will return -EISDIR if either source or target is a directory.
 
 ``get_link``
 	called by the VFS to follow a symbolic link to the inode it
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 05e0c4a5affd..0b78858d25b8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9549,7 +9549,8 @@ static int btrfs_rename2(struct user_namespace *mnt_userns, struct inode *old_di
 			 struct dentry *old_dentry, struct inode *new_dir,
 			 struct dentry *new_dentry, unsigned int flags)
 {
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT
+		      | RENAME_NEWER_MTIME))
 		return -EINVAL;
 
 	if (flags & RENAME_EXCHANGE)
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 5f6b7560eb3f..35dc17f80528 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -336,7 +336,7 @@ static int ext2_rename (struct user_namespace * mnt_userns,
 	struct ext2_dir_entry_2 * old_de;
 	int err;
 
-	if (flags & ~RENAME_NOREPLACE)
+	if (flags & ~(RENAME_NOREPLACE | RENAME_NEWER_MTIME))
 		return -EINVAL;
 
 	err = dquot_initialize(old_dir);
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index db4ba99d1ceb..3e393e2959b6 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -4128,7 +4128,8 @@ static int ext4_rename2(struct user_namespace *mnt_userns,
 	if (unlikely(ext4_forced_shutdown(EXT4_SB(old_dir->i_sb))))
 		return -EIO;
 
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT
+		      | RENAME_NEWER_MTIME))
 		return -EINVAL;
 
 	err = fscrypt_prepare_rename(old_dir, old_dentry, new_dir, new_dentry,
diff --git a/fs/namei.c b/fs/namei.c
index 1f28d3f463c3..7776afc199c0 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -40,6 +40,7 @@
 #include <linux/bitops.h>
 #include <linux/init_task.h>
 #include <linux/uaccess.h>
+#include <linux/time64.h>
 
 #include "internal.h"
 #include "mount.h"
@@ -4685,11 +4686,22 @@ int vfs_rename(struct renamedata *rd)
 
 	take_dentry_name_snapshot(&old_name, old_dentry);
 	dget(new_dentry);
-	if (!is_dir || (flags & RENAME_EXCHANGE))
+	if (!is_dir || (flags & (RENAME_EXCHANGE|RENAME_NEWER_MTIME)))
 		lock_two_nondirectories(source, target);
 	else if (target)
 		inode_lock(target);
 
+	if ((flags & RENAME_NEWER_MTIME) && target) {
+		/* deny write access to stabilize mtime comparison below */
+		error = inode_deny_write_access2(source, target);
+		if (error) /* -ETXTBSY */
+			goto out1;
+		if (timespec64_compare(&source->i_mtime, &target->i_mtime) <= 0) {
+			error = -EEXIST;
+			goto out;
+		}
+	}
+
 	error = -EPERM;
 	if (IS_SWAPFILE(source) || (target && IS_SWAPFILE(target)))
 		goto out;
@@ -4736,7 +4748,10 @@ int vfs_rename(struct renamedata *rd)
 			d_exchange(old_dentry, new_dentry);
 	}
 out:
-	if (!is_dir || (flags & RENAME_EXCHANGE))
+	if ((flags & RENAME_NEWER_MTIME) && target)
+		inode_allow_write_access2(source, target);
+out1:
+	if (!is_dir || (flags & (RENAME_EXCHANGE|RENAME_NEWER_MTIME)))
 		unlock_two_nondirectories(source, target);
 	else if (target)
 		inode_unlock(target);
@@ -4769,11 +4784,12 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
 	bool should_retry = false;
 	int error = -EINVAL;
 
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT
+		      | RENAME_NEWER_MTIME))
 		goto put_names;
 
 	if ((flags & (RENAME_NOREPLACE | RENAME_WHITEOUT)) &&
-	    (flags & RENAME_EXCHANGE))
+	    (flags & (RENAME_EXCHANGE | RENAME_NEWER_MTIME)))
 		goto put_names;
 
 	if (flags & RENAME_EXCHANGE)
@@ -4825,6 +4841,19 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
 	error = -EEXIST;
 	if ((flags & RENAME_NOREPLACE) && d_is_positive(new_dentry))
 		goto exit5;
+	if (flags & RENAME_NEWER_MTIME) {
+		/* The use case for RENAME_NEWER_MTIME doesn't really align
+		 * with directories, so bail out here if either source or
+		 * target is a directory.  This makes the locking necessary
+		 * to stabilize the mtime comparison (in vfs_rename)
+		 * much more straightforward.
+		 */
+		error = -EISDIR;
+		if (d_is_dir(old_dentry))
+			goto exit5;
+		if (d_is_positive(new_dentry) && d_is_dir(new_dentry))
+			goto exit5;
+	}
 	if (flags & RENAME_EXCHANGE) {
 		error = -ENOENT;
 		if (d_is_negative(new_dentry))
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 29f5b8b8aca6..a6ec8edd5398 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -457,7 +457,8 @@ xfs_vn_rename(
 	struct xfs_name	oname;
 	struct xfs_name	nname;
 
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT
+		      | RENAME_NEWER_MTIME))
 		return -EINVAL;
 
 	/* if we are exchanging files, we need to set i_mode of both files */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9ad5e3520fae..0c79f12ec51f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2819,14 +2819,21 @@ static inline void file_end_write(struct file *file)
  * use {get,deny}_write_access() - these functions check the sign and refuse
  * to do the change if sign is wrong.
  */
+static inline int inode_deny_write_access(struct inode *inode)
+{
+	return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY;
+}
+static inline void inode_allow_write_access(struct inode *inode)
+{
+	atomic_inc(&inode->i_writecount);
+}
 static inline int get_write_access(struct inode *inode)
 {
 	return atomic_inc_unless_negative(&inode->i_writecount) ? 0 : -ETXTBSY;
 }
 static inline int deny_write_access(struct file *file)
 {
-	struct inode *inode = file_inode(file);
-	return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY;
+	return inode_deny_write_access(file_inode(file));
 }
 static inline void put_write_access(struct inode * inode)
 {
@@ -2835,13 +2842,43 @@ static inline void put_write_access(struct inode * inode)
 static inline void allow_write_access(struct file *file)
 {
 	if (file)
-		atomic_inc(&file_inode(file)->i_writecount);
+		inode_allow_write_access(file_inode(file));
 }
 static inline bool inode_is_open_for_write(const struct inode *inode)
 {
 	return atomic_read(&inode->i_writecount) > 0;
 }
 
+/**
+ * inode_deny_write_access2 - deny write access on two inodes.
+ * Returns -ETXTBSY if write access cannot be denied on either inode.
+ * @inode1: first inode
+ * @inode2: second inode
+ */
+static inline int inode_deny_write_access2(struct inode *inode1, struct inode *inode2)
+{
+	int error = inode_deny_write_access(inode1);
+	if (error)
+		return error;
+	error = inode_deny_write_access(inode2);
+	if (error)
+		inode_allow_write_access(inode1);
+	return error;
+}
+
+/**
+ * inode_allow_write_access2 - allow write access on two inodes.
+ * This method is intended to be called after a successful call
+ * to inode_deny_write_access2().
+ * @inode1: first inode
+ * @inode2: second inode
+ */
+static inline void inode_allow_write_access2(struct inode *inode1, struct inode *inode2)
+{
+	inode_allow_write_access(inode1);
+	inode_allow_write_access(inode2);
+}
+
 #if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING)
 static inline void i_readcount_dec(struct inode *inode)
 {
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index bdf7b404b3e7..7e9c32dce3e4 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -50,6 +50,7 @@
 #define RENAME_NOREPLACE	(1 << 0)	/* Don't overwrite target */
 #define RENAME_EXCHANGE		(1 << 1)	/* Exchange source and dest */
 #define RENAME_WHITEOUT		(1 << 2)	/* Whiteout source */
+#define RENAME_NEWER_MTIME	(1 << 3)	/* Only newer file can overwrite target */
 
 struct file_clone_range {
 	__s64 src_fd;
diff --git a/mm/shmem.c b/mm/shmem.c
index a6f565308133..41de04b828fd 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3009,7 +3009,8 @@ static int shmem_rename2(struct user_namespace *mnt_userns,
 	struct inode *inode = d_inode(old_dentry);
 	int they_are_dirs = S_ISDIR(inode->i_mode);
 
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT
+		      | RENAME_NEWER_MTIME))
 		return -EINVAL;
 
 	if (flags & RENAME_EXCHANGE)
diff --git a/tools/include/uapi/linux/fs.h b/tools/include/uapi/linux/fs.h
index bdf7b404b3e7..7e9c32dce3e4 100644
--- a/tools/include/uapi/linux/fs.h
+++ b/tools/include/uapi/linux/fs.h
@@ -50,6 +50,7 @@
 #define RENAME_NOREPLACE	(1 << 0)	/* Don't overwrite target */
 #define RENAME_EXCHANGE		(1 << 1)	/* Exchange source and dest */
 #define RENAME_WHITEOUT		(1 << 2)	/* Whiteout source */
+#define RENAME_NEWER_MTIME	(1 << 3)	/* Only newer file can overwrite target */
 
 struct file_clone_range {
 	__s64 src_fd;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v3 2/2] selftests: added a new target renameat2
  2022-07-05 13:30                   ` [PATCH v3 1/2] RENAME_NEWER_MTIME is a new userspace-visible flag for renameat2(), and stands alongside existing flags including RENAME_NOREPLACE, RENAME_EXCHANGE, and RENAME_WHITEOUT James Yonan
@ 2022-07-05 13:30                     ` James Yonan
  2022-07-05 13:30                     ` [PATCH man-pages] rename.2: document new renameat2() flag RENAME_NEWER_MTIME James Yonan
  1 sibling, 0 replies; 24+ messages in thread
From: James Yonan @ 2022-07-05 13:30 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: david, neilb, amir73il, viro, linux-api, James Yonan

The new renameat2 target tests the new renameat2()
flag RENAME_NEWER_MTIME along with RENAME_NOREPLACE
and RENAME_EXCHANGE.

This test is designed to be portable between
the Linux kernel self-tests and the Linux Test Project.

Signed-off-by: James Yonan <james@openvpn.net>
---
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/renameat2/.gitignore  |   1 +
 tools/testing/selftests/renameat2/Makefile    |  11 +
 .../selftests/renameat2/renameat2_tests.c     | 451 ++++++++++++++++++
 4 files changed, 464 insertions(+)
 create mode 100644 tools/testing/selftests/renameat2/.gitignore
 create mode 100644 tools/testing/selftests/renameat2/Makefile
 create mode 100644 tools/testing/selftests/renameat2/renameat2_tests.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index de11992dc577..34226dfbca7a 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -54,6 +54,7 @@ TARGETS += proc
 TARGETS += pstore
 TARGETS += ptrace
 TARGETS += openat2
+TARGETS += renameat2
 TARGETS += resctrl
 TARGETS += rlimits
 TARGETS += rseq
diff --git a/tools/testing/selftests/renameat2/.gitignore b/tools/testing/selftests/renameat2/.gitignore
new file mode 100644
index 000000000000..79bbdf497559
--- /dev/null
+++ b/tools/testing/selftests/renameat2/.gitignore
@@ -0,0 +1 @@
+renameat2_tests
diff --git a/tools/testing/selftests/renameat2/Makefile b/tools/testing/selftests/renameat2/Makefile
new file mode 100644
index 000000000000..6d5c44906b03
--- /dev/null
+++ b/tools/testing/selftests/renameat2/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0
+
+CFLAGS = -g -Wall -O2 -pthread
+CFLAGS += $(KHDR_INCLUDES)
+LDLIBS += -lpthread
+
+TEST_GEN_PROGS := renameat2_tests
+
+include ../lib.mk
+
+$(OUTPUT)/renameat2_tests: renameat2_tests.c
diff --git a/tools/testing/selftests/renameat2/renameat2_tests.c b/tools/testing/selftests/renameat2/renameat2_tests.c
new file mode 100644
index 000000000000..bc41975a565f
--- /dev/null
+++ b/tools/testing/selftests/renameat2/renameat2_tests.c
@@ -0,0 +1,451 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Written by James Yonan <james@openvpn.net>
+ * Copyright (c) 2022 OpenVPN, Inc.
+ */
+
+/*
+ * Test renameat2() with RENAME_NOREPLACE, RENAME_EXCHANGE,
+ * and RENAME_NEWER_MTIME.
+ *
+ * This test is designed to be portable between
+ * the Linux kernel self-tests and the Linux Test Project.
+ * The cool thing about running the test in the Linux Test Project
+ * is that it will automatically iterate the test over all the
+ * filesystems available in your kernel.  In a default kernel,
+ * that includes ext2, ext3, ext4, xfs, btrfs, and tmpfs.
+ *
+ * By default we assume a Linux kernel self-test build, where
+ * you can build and run with:
+ *   make -C tools/testing/selftests TARGETS=renameat2 run_tests
+ *
+ * For a Linux Test Project build, place this source file
+ * under the ltp tree in:
+ *   testcases/kernel/syscalls/renameat2/renameat203.c
+ * Then cd to testcases/kernel/syscalls/renameat2 and add:
+ *   CPPFLAGS += -DLINUX_TEST_PROJECT
+ * to the end of the Makefile.  Then run with:
+ *   make && ./rename_newer_mtime
+ */
+
+#ifndef _GNU_SOURCE
+#define _GNU_SOURCE
+#endif
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <time.h>
+
+#ifdef LINUX_TEST_PROJECT
+#include "tst_test.h"
+#include "renameat2.h"
+#else
+#include "../kselftest_harness.h"
+#endif
+
+/* requires a kernel that implements renameat2() RENAME_NEWER_MTIME flag */
+#ifndef RENAME_NEWER_MTIME
+#define RENAME_NEWER_MTIME (1 << 3)
+#endif
+
+/* convert milliseconds to nanoseconds */
+#define MS_TO_NANO(x) ((x) * 1000000)
+
+#ifdef LINUX_TEST_PROJECT
+
+#define MNTPOINT "mntpoint"
+#define WORKDIR MNTPOINT "/testdir.XXXXXX"
+
+#define MY_ERROR(...) tst_brk(TFAIL, __VA_ARGS__)
+#define MY_PASS(...) tst_res(TPASS, __VA_ARGS__)
+
+#else /* Linux kernel self-test */
+
+#define WORKDIR "/tmp/ksft-renameat2-rename-newer-mtime.XXXXXX"
+
+#define MY_ERROR(fmt, ...) ksft_exit_fail_msg("%s/%d: " fmt "\n", __FILE__, __LINE__, __VA_ARGS__)
+#define MY_PASS(...)
+
+#endif
+
+static int create_file_with_timestamp(const char *filename,
+				      const time_t tv_sec,
+				      const long tv_nsec,
+				      struct stat *s,
+				      int *retain_fd)
+{
+	int fd;
+	struct timespec times[2];
+
+	fd = open(filename, O_CREAT|O_TRUNC|O_WRONLY, 0777);
+	if (fd < 0)
+		return errno;
+	times[0].tv_sec = tv_sec;
+	times[0].tv_nsec = tv_nsec;
+	times[1] = times[0];
+	if (futimens(fd, times)) {
+		close(fd);
+		return errno;
+	}
+	if (fstat(fd, s)) {
+		close(fd);
+		return errno;
+	}
+	if (retain_fd)
+		*retain_fd = fd;
+	else if (close(fd))
+		return errno;
+	return 0;
+}
+
+static int create_directory_with_timestamp(const char *dirname,
+					   const time_t tv_sec,
+					   const long tv_nsec,
+					   struct stat *s)
+{
+	struct timespec times[2];
+
+	if (mkdir(dirname, 0777))
+		return errno;
+	times[0].tv_sec = tv_sec;
+	times[0].tv_nsec = tv_nsec;
+	times[1] = times[0];
+	if (utimensat(AT_FDCWD, dirname, times, 0) != 0)
+		return errno;
+	if (lstat(dirname, s))
+		return errno;
+	return 0;
+}
+
+static int do_rename(const char *source_path, const char *target_path,
+		     const unsigned int flags)
+{
+	if (renameat2(AT_FDCWD, source_path, AT_FDCWD, target_path, flags))
+		return errno;
+	return 0;
+}
+
+static int verify_inode(const char *path, const struct stat *orig_stat)
+{
+	struct stat s;
+
+	if (stat(path, &s))
+		return errno;
+	if (orig_stat->st_ino != s.st_ino)
+		return ENOENT;
+	return 0;
+}
+
+static int verify_exist(const char *path)
+{
+	int fd;
+
+	fd = open(path, O_RDONLY);
+	if (fd < 0)
+		return errno;
+	if (close(fd) != 0)
+		return errno;
+	return 0;
+}
+
+static int fd_d = -1; /* retained fd from file "d" */
+
+/*
+ * Test renameat2() with RENAME_NEWER_MTIME, RENAME_NOREPLACE, and RENAME_EXCHANGE.
+ */
+static void do_rename_newer_mtime(void)
+{
+	char dirname[] = WORKDIR;
+	const time_t now = time(NULL);
+	struct stat stat_a, stat_b, stat_c, stat_d, stat_f; /* files */
+	struct stat stat_x, stat_y; /* directories */
+	int eno; /* copied errno */
+
+	/* fd_d initial state */
+	fd_d = -1;
+
+	/* make the top-level directory */
+	if (!mkdtemp(dirname)) {
+		eno = errno;
+		MY_ERROR("failed to create tmpdir, errno=%d", eno);
+	}
+
+	/* cd to top-level directory */
+	if (chdir(dirname)) {
+		eno = errno;
+		MY_ERROR("failed to cd to tmpdir, errno=%d", eno);
+	}
+
+	/* create files with different mtimes */
+	eno = create_file_with_timestamp("a", now, MS_TO_NANO(700), &stat_a, NULL);
+	if (eno)
+		MY_ERROR("failed to create file 'a', errno=%d", eno);
+	eno = create_file_with_timestamp("b", now+1, MS_TO_NANO(500), &stat_b, NULL);
+	if (eno)
+		MY_ERROR("failed to create file 'b', errno=%d", eno);
+	eno = create_file_with_timestamp("c", now+1, MS_TO_NANO(500), &stat_c, NULL);
+	if (eno)
+		MY_ERROR("failed to create file 'c', errno=%d", eno);
+	eno = create_file_with_timestamp("d", now+1, MS_TO_NANO(300), &stat_d, &fd_d); /* leave open for write */
+	if (eno)
+		MY_ERROR("failed to create file 'd', errno=%d", eno);
+	eno = create_file_with_timestamp("f", now, MS_TO_NANO(0), &stat_f, NULL);
+	if (eno)
+		MY_ERROR("failed to create file 'f', errno=%d", eno);
+
+	/* create directories with different mtimes */
+	eno = create_directory_with_timestamp("x", now+2, MS_TO_NANO(0), &stat_x);
+	if (eno)
+		MY_ERROR("failed to create directory 'x', errno=%d", eno);
+	eno = create_directory_with_timestamp("y", now+3, MS_TO_NANO(0), &stat_y);
+	if (eno)
+		MY_ERROR("failed to create directory 'y', errno=%d", eno);
+
+	/* rename b -> e with RENAME_NEWER_MTIME -- should succeed because e doesn't exist */
+	eno = do_rename("b", "e", RENAME_NEWER_MTIME);
+	if (eno)
+		MY_ERROR("failed to rename 'b' -> 'e', errno=%d (kernel may be missing RENAME_NEWER_MTIME feature)", eno);
+	eno = verify_inode("e", &stat_b);
+	if (eno)
+		MY_ERROR("could not verify inode of 'e' after rename 'b' -> 'e', errno=%d", eno);
+	eno = verify_exist("b");
+	if (eno != ENOENT)
+		MY_ERROR("strangely, 'b' still exists after rename 'b' -> 'e', errno=%d", eno);
+
+	/* rename c -> e with RENAME_NEWER_MTIME|RENAME_NOREPLACE -- should fail
+	 * because RENAME_NEWER_MTIME and RENAME_NOREPLACE cannot be used together
+	 */
+	eno = do_rename("c", "e", RENAME_NEWER_MTIME|RENAME_NOREPLACE);
+	if (eno != EINVAL)
+		MY_ERROR("rename 'c' -> 'e' should have failed with EINVAL because RENAME_NEWER_MTIME and RENAME_NOREPLACE cannot be used together, errno=%d", eno);
+
+	/* rename c -> e with RENAME_NEWER_MTIME|RENAME_WHITEOUT -- should fail
+	 * because RENAME_NEWER_MTIME and RENAME_WHITEOUT cannot be used together
+	 */
+	eno = do_rename("c", "e", RENAME_NEWER_MTIME|RENAME_WHITEOUT);
+	if (eno != EINVAL)
+		MY_ERROR("rename 'c' -> 'e' should have failed with EINVAL because RENAME_NEWER_MTIME and RENAME_WHITEOUT cannot be used together, errno=%d", eno);
+
+	/* rename c -> e with RENAME_NEWER_MTIME -- should fail because c and e have
+	 * the same timestamp
+	 */
+	eno = do_rename("c", "e", RENAME_NEWER_MTIME);
+	if (eno != EEXIST)
+		MY_ERROR("rename 'c' -> 'e' should have failed with EEXIST because 'c' and 'e' have the same timestamp, errno=%d", eno);
+	eno = verify_inode("c", &stat_c);
+	if (eno)
+		MY_ERROR("could not verify inode of 'c' after attempted rename 'c' -> 'e', errno=%d", eno);
+	eno = verify_inode("e", &stat_b);
+	if (eno)
+		MY_ERROR("could not verify inode of 'e' after attempted rename 'c' -> 'e', errno=%d", eno);
+
+	/* rename a -> c with RENAME_NOREPLACE -- should fail because c exists */
+	eno = do_rename("a", "c", RENAME_NOREPLACE);
+	if (eno != EEXIST)
+		MY_ERROR("rename 'a' -> 'c' should have failed because 'c' exists, errno=%d", eno);
+	eno = verify_inode("a", &stat_a);
+	if (eno)
+		MY_ERROR("could not verify inode of 'a' after attempted rename 'a' -> 'c', errno=%d", eno);
+	eno = verify_inode("c", &stat_c);
+	if (eno)
+		MY_ERROR("could not verify inode of 'c' after attempted rename 'a' -> 'c', errno=%d", eno);
+
+	/* rename a -> c with RENAME_NEWER_MTIME -- should fail because c is newer than a */
+	eno = do_rename("a", "c", RENAME_NEWER_MTIME);
+	if (eno != EEXIST)
+		MY_ERROR("rename 'a' -> 'c' should have failed with EEXIST because 'c' is newer, errno=%d", eno);
+	eno = verify_inode("a", &stat_a);
+	if (eno)
+		MY_ERROR("could not verify inode of 'a' after attempted rename 'a' -> 'c', errno=%d", eno);
+	eno = verify_inode("c", &stat_c);
+	if (eno)
+		MY_ERROR("could not verify inode of 'c' after attempted rename 'a' -> 'c', errno=%d", eno);
+
+	/* rename c -> a with RENAME_NEWER_MTIME -- should succeed because c is newer than a */
+	eno = do_rename("c", "a", RENAME_NEWER_MTIME);
+	if (eno)
+		MY_ERROR("rename 'c' -> 'a' should have succeeded because 'c' is newer than 'a', errno=%d", eno);
+	eno = verify_inode("a", &stat_c);
+	if (eno)
+		MY_ERROR("could not verify inode of 'a' after rename 'c' -> 'a', errno=%d", eno);
+	eno = verify_exist("c");
+	if (eno != ENOENT)
+		MY_ERROR("strangely, 'c' still exists after rename 'c' -> 'a', errno=%d", eno);
+
+	/* exchange f <-> nonexistent with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should fail because
+	 * only f exists
+	 */
+	eno = do_rename("f", "nonexistent", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
+	if (eno != ENOENT)
+		MY_ERROR("exchange 'f' <-> 'nonexistent' should have failed with ENOENT, errno=%d", eno);
+	eno = verify_inode("f", &stat_f);
+	if (eno)
+		MY_ERROR("could not verify inode of 'f' after attempted exchange 'f' <-> 'nonexistent', errno=%d", eno);
+
+	/* exchange d <-> f with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should fail because
+	 * d is open for write
+	 */
+	eno = do_rename("d", "f", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
+	if (eno != ETXTBSY)
+		MY_ERROR("exchange 'd' <-> 'f' should have failed with ETXTBSY because d is open for write, errno=%d", eno);
+	eno = verify_inode("d", &stat_d);
+	if (eno)
+		MY_ERROR("could not verify inode of 'd' after attempted exchange 'd' <-> 'f', errno=%d", eno);
+	eno = verify_inode("f", &stat_f);
+	if (eno)
+		MY_ERROR("could not verify inode of 'f' after attempted exchange 'd' <-> 'f', errno=%d", eno);
+
+	/* exchange e <-> d with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should fail because
+	 * d is open for write
+	 */
+	eno = do_rename("e", "d", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
+	if (eno != ETXTBSY)
+		MY_ERROR("exchange 'e' <-> 'd' should have failed with ETXTBSY because d is open for write, errno=%d", eno);
+	eno = verify_inode("e", &stat_b);
+	if (eno)
+		MY_ERROR("could not verify inode of 'e' after attempted exchange 'e' <-> 'd', errno=%d", eno);
+	eno = verify_inode("d", &stat_d);
+	if (eno)
+		MY_ERROR("could not verify inode of 'd' after attempted exchange 'e' <-> 'd', errno=%d", eno);
+
+	/* exchange f <-> d with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should fail because
+	 * d is open for write but also because f is older than d
+	 */
+	eno = do_rename("f", "d", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
+	if (eno != ETXTBSY) /* note in this case we get ETXTBSY first (EEXIST would have
+			     * been returned if d wasn't open for write)
+			     */
+		MY_ERROR("exchange 'f' <-> 'd' should have failed with ETXTBSY because d is open for write, errno=%d", eno);
+	eno = verify_inode("f", &stat_f);
+	if (eno)
+		MY_ERROR("could not verify inode of 'f' after attempted exchange 'f' <-> 'd', errno=%d", eno);
+	eno = verify_inode("d", &stat_d);
+	if (eno)
+		MY_ERROR("could not verify inode of 'd' after attempted exchange 'f' <-> 'd', errno=%d", eno);
+
+	/* close fd_d */
+	if (close(fd_d) != 0) {
+		eno = errno;
+		MY_ERROR("error closing fd_d (write), errno=%d", eno);
+	}
+
+	/* reopen "d" for read access, which should not prevent RENAME_NEWER_MTIME */
+	fd_d = open("d", O_RDONLY);
+	if (fd_d < 0)
+		MY_ERROR("error reopening 'd' for read, errno=%d", eno);
+
+	/* exchange f <-> d with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should fail
+	 * because f is older than d
+	 */
+	eno = do_rename("f", "d", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
+	if (eno != EEXIST)
+		MY_ERROR("exchange 'f' <-> 'd' should have failed with EEXIST because f is older than d, errno=%d", eno);
+	eno = verify_inode("f", &stat_f);
+	if (eno)
+		MY_ERROR("could not verify inode of 'f' after attempted exchange 'f' <-> 'd', errno=%d", eno);
+	eno = verify_inode("d", &stat_d);
+	if (eno)
+		MY_ERROR("could not verify inode of 'd' after attempted exchange 'f' <-> 'd', errno=%d", eno);
+
+	/* double exchange f <-> d with RENAME_EXCHANGE -- should succeed */
+	eno = do_rename("f", "d", RENAME_EXCHANGE);
+	if (eno)
+		MY_ERROR("exchange 'f' <-> 'd' should have succeeded, errno=%d", eno);
+	eno = verify_inode("d", &stat_f);
+	if (eno)
+		MY_ERROR("could not verify inode of 'd' after exchange 'd' <-> 'f', errno=%d", eno);
+	eno = verify_inode("f", &stat_d);
+	if (eno)
+		MY_ERROR("could not verify inode of 'f' after exchange 'd' <-> 'f', errno=%d", eno);
+	eno = do_rename("f", "d", RENAME_EXCHANGE);
+	if (eno)
+		MY_ERROR("exchange 'f' <-> 'd' should have succeeded, errno=%d", eno);
+	eno = verify_inode("d", &stat_d);
+	if (eno)
+		MY_ERROR("could not verify inode of 'd' after exchange 'd' <-> 'f', errno=%d", eno);
+	eno = verify_inode("f", &stat_f);
+	if (eno)
+		MY_ERROR("could not verify inode of 'f' after exchange 'd' <-> 'f', errno=%d", eno);
+
+	/* exchange d <-> f with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should succeed
+	 * because d is newer than f and fd_d is now read-only
+	 */
+	eno = do_rename("d", "f", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
+	if (eno)
+		MY_ERROR("exchange 'd' <-> 'f' failed, errno=%d", eno);
+	eno = verify_inode("d", &stat_f);
+	if (eno)
+		MY_ERROR("could not verify inode of 'd' after exchange 'd' <-> 'f', errno=%d", eno);
+	eno = verify_inode("f", &stat_d);
+	if (eno)
+		MY_ERROR("could not verify inode of 'f' after exchange 'd' <-> 'f', errno=%d", eno);
+
+	/* exchange directories x <-> y with RENAME_EXCHANGE|RENAME_NEWER_MTIME
+	 * -- should fail because RENAME_NEWER_MTIME is not implemented
+	 * for directories.
+	 */
+	eno = do_rename("x", "y", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
+	if (eno != EISDIR)
+		MY_ERROR("exchange 'x' <-> 'y' should have failed with EISDIR because x and y are directories, errno=%d", eno);
+	eno = verify_inode("x", &stat_x);
+	if (eno)
+		MY_ERROR("could not verify inode of 'x' after attempted exchange 'x' <-> 'y', errno=%d", eno);
+	eno = verify_inode("y", &stat_y);
+	if (eno)
+		MY_ERROR("could not verify inode of 'y' after attempted exchange 'x' <-> 'y', errno=%d", eno);
+
+	/* close fd_d */
+	if (close(fd_d) != 0) {
+		eno = errno;
+		MY_ERROR("error closing fd_d (read), errno=%d", eno);
+	}
+	fd_d = -1;
+
+	MY_PASS("rename_newer_mtime test passed, workdir=%s", dirname);
+}
+
+#ifdef LINUX_TEST_PROJECT
+
+static void setup(void)
+{
+}
+
+static void cleanup(void)
+{
+	/* close fd_d */
+	if (fd_d >= 0)
+		close(fd_d);
+}
+
+static struct tst_test test = {
+	.test_all = do_rename_newer_mtime,
+	.setup = setup,
+	.cleanup = cleanup,
+	.needs_root = 1,
+	.all_filesystems = 1,
+	.mount_device = 1,
+	.mntpoint = MNTPOINT,
+	.skip_filesystems = (const char*[]) {
+		"exfat",
+		"ntfs",
+		"vfat",
+		NULL
+	},
+	.needs_cmds = NULL,
+};
+
+#else /* Linux kernel self-test */
+
+TEST(rename_newer_mtime)
+{
+	do_rename_newer_mtime();
+}
+
+TEST_HARNESS_MAIN
+
+#endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH man-pages] rename.2: document new renameat2() flag RENAME_NEWER_MTIME
  2022-07-05 13:30                   ` [PATCH v3 1/2] RENAME_NEWER_MTIME is a new userspace-visible flag for renameat2(), and stands alongside existing flags including RENAME_NOREPLACE, RENAME_EXCHANGE, and RENAME_WHITEOUT James Yonan
  2022-07-05 13:30                     ` [PATCH v3 2/2] selftests: added a new target renameat2 James Yonan
@ 2022-07-05 13:30                     ` James Yonan
  1 sibling, 0 replies; 24+ messages in thread
From: James Yonan @ 2022-07-05 13:30 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: david, neilb, amir73il, viro, linux-api, James Yonan

Signed-off-by: James Yonan <james@openvpn.net>
---
 man2/rename.2 | 138 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 138 insertions(+)

diff --git a/man2/rename.2 b/man2/rename.2
index e403c0393..13169310d 100644
--- a/man2/rename.2
+++ b/man2/rename.2
@@ -273,6 +273,106 @@ btrfs (since Linux 4.7),
 .\" btrfs: commit cdd1fedf8261cd7a73c0596298902ff4f0f04492
 and ubifs (since Linux 4.9).
 .\" ubifs: commit 9e0a1fff8db56eaaebb74b4a3ef65f86811c4798
+.TP
+.BR RENAME_NEWER_MTIME " (since Linux 5.xx)"
+.B RENAME_NEWER_MTIME
+modifies the behavior of plain rename or
+.B RENAME_EXCHANGE
+by making the rename or exchange operation conditional on the file
+modification time
+.B mtime.
+If
+.I newpath
+exists, only perform the operation if
+.I oldpath
+.B mtime
+>
+.I newpath
+.B mtime;
+otherwise return an error.  If
+.I newpath
+doesn't exist, do a plain rename.
+.IP
+.B RENAME_NEWER_MTIME
+combines
+.B mtime
+comparison and conditional replacement into
+an atomic operation that augments the existing guarantee
+of rename operations -- that not only is there no point
+at which another process attempting to access
+.I newpath
+would find it missing, but there is no point at which a reader
+could detect an
+.B mtime
+backtrack in
+.I newpath.
+.IP
+Some of the use cases for
+.B RENAME_NEWER_MTIME
+include (a) using a directory as a key-value store, or
+(b) maintaining a near-real-time mirror of a remote data source.
+A common design pattern for maintaining such a data
+store would be to create a file using a temporary pathname,
+setting the file
+.B mtime
+using
+.BR utimensat (2)
+or
+.BR futimens (2)
+based on the remote creation
+timestamp of the file content, then using
+.B RENAME_NEWER_MTIME
+to move the file into place in the target directory.  If
+the operation returns an error with
+.I errno
+set to
+.B EEXIST,
+then
+.I oldpath
+is not up-to-date and can safely be deleted.
+The goal is to facilitate distributed systems
+having many concurrent writers and readers,
+where update notifications are possibly delayed, duplicated,
+or reordered, yet where readers see a consistent view
+of the target directory with predictable semantics
+and atomic updates.
+.IP
+Note that
+.B RENAME_NEWER_MTIME
+depends on accurate, high-resolution timestamps for
+.B mtime,
+preferably approaching nanosecond resolution.
+.IP
+.B RENAME_NEWER_MTIME
+only works on non-directory files and cannot be used when
+.I oldpath
+or
+.I newpath
+is open for write.
+.IP
+.B RENAME_NEWER_MTIME
+can be combined with
+.B RENAME_EXCHANGE
+where
+.I oldpath
+and
+.I newpath
+will only be exchanged if
+.I oldpath
+.B mtime
+>
+.I newpath
+.B mtime.
+.IP
+.B RENAME_NEWER_MTIME
+cannot be combined with
+.B RENAME_NOREPLACE
+or
+.B RENAME_WHITEOUT.
+.IP
+.B RENAME_NEWER_MTIME
+requires support from the underlying filesystem.  As of Linux 5.xx,
+ext2, ext3, ext4, xfs, btrfs, and tmpfs are supported.
 .SH RETURN VALUE
 On success, zero is returned.
 On error, \-1 is returned, and
@@ -449,6 +549,37 @@ and
 .I newpath
 already exists.
 .TP
+.B EEXIST
+.I flags
+contain
+.B RENAME_NEWER_MTIME
+and
+.I oldpath
+.B mtime
+<=
+.I newpath
+.B mtime.
+.TP
+.B EISDIR
+.I flags
+contain
+.B RENAME_NEWER_MTIME
+and
+.I oldpath
+or
+.I newpath
+is a directory.
+.TP
+.B ETXTBSY
+.I flags
+contain
+.B RENAME_NEWER_MTIME
+and
+.I oldpath
+or
+.I newpath
+is open for write.
+.TP
 .B EINVAL
 An invalid flag was specified in
 .IR flags .
@@ -470,6 +601,13 @@ were specified in
 .IR flags .
 .TP
 .B EINVAL
+.B RENAME_NEWER_MTIME
+was used together with 
+.B RENAME_NOREPLACE
+or
+.B RENAME_WHITEOUT.
+.TP
+.B EINVAL
 The filesystem does not support one of the flags in
 .IR flags .
 .TP
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RESEND PATCH v3 1/2] namei: implemented RENAME_NEWER_MTIME flag for renameat2() conditional replace
  2022-07-02  8:07                 ` Dave Chinner
  2022-07-05 13:30                   ` [PATCH v3 1/2] RENAME_NEWER_MTIME is a new userspace-visible flag for renameat2(), and stands alongside existing flags including RENAME_NOREPLACE, RENAME_EXCHANGE, and RENAME_WHITEOUT James Yonan
@ 2022-07-05 14:03                   ` James Yonan
  2022-07-11 19:13                   ` [PATCH v4 " James Yonan
  2 siblings, 0 replies; 24+ messages in thread
From: James Yonan @ 2022-07-05 14:03 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: david, neilb, amir73il, viro, linux-api, James Yonan

RENAME_NEWER_MTIME is a new userspace-visible flag for renameat2(), and
stands alongside existing flags including RENAME_NOREPLACE,
RENAME_EXCHANGE, and RENAME_WHITEOUT.

RENAME_NEWER_MTIME is a conditional variation on RENAME_NOREPLACE, and
indicates that if the target of the rename exists, the rename or exchange
will only succeed if the source file is newer than the target (i.e.
source mtime > target mtime).  Otherwise, the rename will fail with
-EEXIST instead of replacing the target.  When the target doesn't exist,
RENAME_NEWER_MTIME does a plain rename like RENAME_NOREPLACE.

RENAME_NEWER_MTIME can also be combined with RENAME_EXCHANGE for
conditional exchange, where the exchange only occurs if source mtime >
target mtime.  Otherwise, the operation will fail with -EEXIST.

Some of the use cases for RENAME_NEWER_MTIME include (a) using a
directory as a key-value store, or (b) maintaining a near-real-time
mirror of a remote data source.  A common design pattern for maintaining
such a data store would be to create a file using a temporary pathname,
setting the file mtime using utimensat(2) or futimens(2) based on the
remote creation timestamp of the file content, then using
RENAME_NEWER_MTIME to move the file into place in the target directory.
If the operation returns an error with errno == EEXIST, then the source
file is not up-to-date and can safely be deleted. The goal is to
facilitate distributed systems having many concurrent writers and
readers, where update notifications are possibly delayed, duplicated, or
reordered, yet where readers see a consistent view of the target
directory with predictable semantics and atomic updates.

Note that RENAME_NEWER_MTIME depends on accurate, high-resolution
timestamps for mtime, preferably approaching nanosecond resolution.

RENAME_NEWER_MTIME is implemented in vfs_rename(), and we lock and deny
write access to both source and target inodes before comparing their
mtimes, to stabilize the comparison.

The use case for RENAME_NEWER_MTIME doesn't really align with
directories, so we return -EISDIR if either source or target is a
directory.  This makes the locking necessary to stabilize the mtime
comparison (in vfs_rename()) much more straightforward.

Like RENAME_NOREPLACE, the RENAME_NEWER_MTIME implementation lives in
the VFS, however the individual fs implementations do strict flags
checking and will return -EINVAL for any flag they don't recognize.
At this time, I have enabled and tested RENAME_NEWER_MTIME on ext2, ext3,
ext4, xfs, btrfs, and tmpfs.

I did not notice a general self-test for renameat2() at the VFS
layer (outside of fs-specific tests), so I created one, though
at the moment it only exercises RENAME_NEWER_MTIME and RENAME_EXCHANGE.
The self-test is written to be portable to the Linux Test Project,
and the advantage of running it there is that it automatically runs
tests on multiple filesystems.  See comments at the beginning of
renameat2_tests.c for more info.

Build and run the self-test with:

  make -C tools/testing/selftests TARGETS=renameat2 run_tests

Questions:

Q: Why use mtime and not ctime for timestamp comparison?

A: I see the "use a directory as a key/value store" use case
   as caring more about the modification time of the file content
   rather than the metadata.  Also, the rename operation itself
   modifies ctime, making it less useful as a reference timestamp.
   In any event, this patch creates the infrastructure for
   conditional rename/exchange based on inode timestamp, so a
   subsequent patch to add RENAME_NEWER_CTIME would be a mostly
   trivial exercise.

Signed-off-by: James Yonan <james@openvpn.net>
---
Patch version history:

v2: Changed flag name from RENAME_NEWER to RENAME_NEWER_MTIME so
    as to disambiguate and make it clear that we are comparing
    mtime values.

    RENAME_NEWER_MTIME can now be combined with RENAME_EXCHANGE
    for conditional exchange, where exchange only occurs if
    source mtime > target mtime.

    Moved the mtime comparison logic into vfs_rename() to take
    advantage of existing {lock,unlock}_two_nondirectories critical
    section, and then further nest another critical section
    {deny,allow}_write_access (adapted to inodes) to stabilize the
    mtime, since our use case doesn't require renaming files that
    are open for write (we will return -ETXTBSY in this case).

    Did some refactoring of inline functions in linux/fs.h that
    manage inode->i_writecount, and added inode_deny_write_access2()
    and inode_allow_write_access2() functions.

    Extended the self-test (renameat2_tests.c):

    1. Verify that RENAME_NEWER_MTIME fails with errno == ETXTBSY when
       one of the files is open for write.

    2. Test conditional exchange use case with combined flags
       RENAME_EXCHANGE|RENAME_NEWER_MTIME.

    3. The test .c file is now drop-in portable to the Linux Test
       Project where you can take advantage of the .all_filesystems = 1
       flag to automatically run tests on multiple filesystems.

v3: The use case for RENAME_NEWER_MTIME doesn't really align
    with directories, so return -EISDIR if either source or
    target is a directory.  This makes the locking necessary
    to stabilize the mtime comparison (in vfs_rename())
    much more straightforward.

    simple_rename() in libfs.c doesn't need to support
    RENAME_NEWER_MTIME.

    Broke up some long lines.

    Rebased on top of 5.19-rc5.

    Break out the self-test into a separate patch.

    Documented RENAME_NEWER_MTIME in the rename.2 man page
    (separate patch).
---
 Documentation/filesystems/vfs.rst | 11 ++++++++
 fs/btrfs/inode.c                  |  3 ++-
 fs/ext2/namei.c                   |  2 +-
 fs/ext4/namei.c                   |  3 ++-
 fs/namei.c                        | 37 +++++++++++++++++++++++---
 fs/xfs/xfs_iops.c                 |  3 ++-
 include/linux/fs.h                | 43 ++++++++++++++++++++++++++++---
 include/uapi/linux/fs.h           |  1 +
 mm/shmem.c                        |  3 ++-
 tools/include/uapi/linux/fs.h     |  1 +
 10 files changed, 95 insertions(+), 12 deletions(-)

diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 08069ecd49a6..495e7352cca1 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -515,6 +515,17 @@ otherwise noted.
 	(2) RENAME_EXCHANGE: exchange source and target.  Both must
 	exist; this is checked by the VFS.  Unlike plain rename, source
 	and target may be of different type.
+	(3) RENAME_NEWER_MTIME: this flag is similar to RENAME_NOREPLACE,
+	and indicates a conditional rename: if the target of the rename
+	exists, the rename should only succeed if the source file is
+	newer than the target (i.e. source mtime > target mtime).
+	Otherwise, the rename should fail with -EEXIST instead of
+	replacing the target.  To exchange source and target conditional
+	on source being newer than target, pass flags as
+	RENAME_EXCHANGE|RENAME_NEWER_MTIME.  RENAME_NEWER_MTIME will fail
+	with -ETXTBSY if either source or target is open for write.
+	RENAME_NEWER_MTIME is not currently supported on directories, and
+	will return -EISDIR if either source or target is a directory.
 
 ``get_link``
 	called by the VFS to follow a symbolic link to the inode it
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 05e0c4a5affd..0b78858d25b8 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9549,7 +9549,8 @@ static int btrfs_rename2(struct user_namespace *mnt_userns, struct inode *old_di
 			 struct dentry *old_dentry, struct inode *new_dir,
 			 struct dentry *new_dentry, unsigned int flags)
 {
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT
+		      | RENAME_NEWER_MTIME))
 		return -EINVAL;
 
 	if (flags & RENAME_EXCHANGE)
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 5f6b7560eb3f..35dc17f80528 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -336,7 +336,7 @@ static int ext2_rename (struct user_namespace * mnt_userns,
 	struct ext2_dir_entry_2 * old_de;
 	int err;
 
-	if (flags & ~RENAME_NOREPLACE)
+	if (flags & ~(RENAME_NOREPLACE | RENAME_NEWER_MTIME))
 		return -EINVAL;
 
 	err = dquot_initialize(old_dir);
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index db4ba99d1ceb..3e393e2959b6 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -4128,7 +4128,8 @@ static int ext4_rename2(struct user_namespace *mnt_userns,
 	if (unlikely(ext4_forced_shutdown(EXT4_SB(old_dir->i_sb))))
 		return -EIO;
 
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT
+		      | RENAME_NEWER_MTIME))
 		return -EINVAL;
 
 	err = fscrypt_prepare_rename(old_dir, old_dentry, new_dir, new_dentry,
diff --git a/fs/namei.c b/fs/namei.c
index 1f28d3f463c3..7776afc199c0 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -40,6 +40,7 @@
 #include <linux/bitops.h>
 #include <linux/init_task.h>
 #include <linux/uaccess.h>
+#include <linux/time64.h>
 
 #include "internal.h"
 #include "mount.h"
@@ -4685,11 +4686,22 @@ int vfs_rename(struct renamedata *rd)
 
 	take_dentry_name_snapshot(&old_name, old_dentry);
 	dget(new_dentry);
-	if (!is_dir || (flags & RENAME_EXCHANGE))
+	if (!is_dir || (flags & (RENAME_EXCHANGE|RENAME_NEWER_MTIME)))
 		lock_two_nondirectories(source, target);
 	else if (target)
 		inode_lock(target);
 
+	if ((flags & RENAME_NEWER_MTIME) && target) {
+		/* deny write access to stabilize mtime comparison below */
+		error = inode_deny_write_access2(source, target);
+		if (error) /* -ETXTBSY */
+			goto out1;
+		if (timespec64_compare(&source->i_mtime, &target->i_mtime) <= 0) {
+			error = -EEXIST;
+			goto out;
+		}
+	}
+
 	error = -EPERM;
 	if (IS_SWAPFILE(source) || (target && IS_SWAPFILE(target)))
 		goto out;
@@ -4736,7 +4748,10 @@ int vfs_rename(struct renamedata *rd)
 			d_exchange(old_dentry, new_dentry);
 	}
 out:
-	if (!is_dir || (flags & RENAME_EXCHANGE))
+	if ((flags & RENAME_NEWER_MTIME) && target)
+		inode_allow_write_access2(source, target);
+out1:
+	if (!is_dir || (flags & (RENAME_EXCHANGE|RENAME_NEWER_MTIME)))
 		unlock_two_nondirectories(source, target);
 	else if (target)
 		inode_unlock(target);
@@ -4769,11 +4784,12 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
 	bool should_retry = false;
 	int error = -EINVAL;
 
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT
+		      | RENAME_NEWER_MTIME))
 		goto put_names;
 
 	if ((flags & (RENAME_NOREPLACE | RENAME_WHITEOUT)) &&
-	    (flags & RENAME_EXCHANGE))
+	    (flags & (RENAME_EXCHANGE | RENAME_NEWER_MTIME)))
 		goto put_names;
 
 	if (flags & RENAME_EXCHANGE)
@@ -4825,6 +4841,19 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
 	error = -EEXIST;
 	if ((flags & RENAME_NOREPLACE) && d_is_positive(new_dentry))
 		goto exit5;
+	if (flags & RENAME_NEWER_MTIME) {
+		/* The use case for RENAME_NEWER_MTIME doesn't really align
+		 * with directories, so bail out here if either source or
+		 * target is a directory.  This makes the locking necessary
+		 * to stabilize the mtime comparison (in vfs_rename)
+		 * much more straightforward.
+		 */
+		error = -EISDIR;
+		if (d_is_dir(old_dentry))
+			goto exit5;
+		if (d_is_positive(new_dentry) && d_is_dir(new_dentry))
+			goto exit5;
+	}
 	if (flags & RENAME_EXCHANGE) {
 		error = -ENOENT;
 		if (d_is_negative(new_dentry))
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 29f5b8b8aca6..a6ec8edd5398 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -457,7 +457,8 @@ xfs_vn_rename(
 	struct xfs_name	oname;
 	struct xfs_name	nname;
 
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT
+		      | RENAME_NEWER_MTIME))
 		return -EINVAL;
 
 	/* if we are exchanging files, we need to set i_mode of both files */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9ad5e3520fae..0c79f12ec51f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2819,14 +2819,21 @@ static inline void file_end_write(struct file *file)
  * use {get,deny}_write_access() - these functions check the sign and refuse
  * to do the change if sign is wrong.
  */
+static inline int inode_deny_write_access(struct inode *inode)
+{
+	return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY;
+}
+static inline void inode_allow_write_access(struct inode *inode)
+{
+	atomic_inc(&inode->i_writecount);
+}
 static inline int get_write_access(struct inode *inode)
 {
 	return atomic_inc_unless_negative(&inode->i_writecount) ? 0 : -ETXTBSY;
 }
 static inline int deny_write_access(struct file *file)
 {
-	struct inode *inode = file_inode(file);
-	return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY;
+	return inode_deny_write_access(file_inode(file));
 }
 static inline void put_write_access(struct inode * inode)
 {
@@ -2835,13 +2842,43 @@ static inline void put_write_access(struct inode * inode)
 static inline void allow_write_access(struct file *file)
 {
 	if (file)
-		atomic_inc(&file_inode(file)->i_writecount);
+		inode_allow_write_access(file_inode(file));
 }
 static inline bool inode_is_open_for_write(const struct inode *inode)
 {
 	return atomic_read(&inode->i_writecount) > 0;
 }
 
+/**
+ * inode_deny_write_access2 - deny write access on two inodes.
+ * Returns -ETXTBSY if write access cannot be denied on either inode.
+ * @inode1: first inode
+ * @inode2: second inode
+ */
+static inline int inode_deny_write_access2(struct inode *inode1, struct inode *inode2)
+{
+	int error = inode_deny_write_access(inode1);
+	if (error)
+		return error;
+	error = inode_deny_write_access(inode2);
+	if (error)
+		inode_allow_write_access(inode1);
+	return error;
+}
+
+/**
+ * inode_allow_write_access2 - allow write access on two inodes.
+ * This method is intended to be called after a successful call
+ * to inode_deny_write_access2().
+ * @inode1: first inode
+ * @inode2: second inode
+ */
+static inline void inode_allow_write_access2(struct inode *inode1, struct inode *inode2)
+{
+	inode_allow_write_access(inode1);
+	inode_allow_write_access(inode2);
+}
+
 #if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING)
 static inline void i_readcount_dec(struct inode *inode)
 {
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index bdf7b404b3e7..7e9c32dce3e4 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -50,6 +50,7 @@
 #define RENAME_NOREPLACE	(1 << 0)	/* Don't overwrite target */
 #define RENAME_EXCHANGE		(1 << 1)	/* Exchange source and dest */
 #define RENAME_WHITEOUT		(1 << 2)	/* Whiteout source */
+#define RENAME_NEWER_MTIME	(1 << 3)	/* Only newer file can overwrite target */
 
 struct file_clone_range {
 	__s64 src_fd;
diff --git a/mm/shmem.c b/mm/shmem.c
index a6f565308133..41de04b828fd 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3009,7 +3009,8 @@ static int shmem_rename2(struct user_namespace *mnt_userns,
 	struct inode *inode = d_inode(old_dentry);
 	int they_are_dirs = S_ISDIR(inode->i_mode);
 
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT
+		      | RENAME_NEWER_MTIME))
 		return -EINVAL;
 
 	if (flags & RENAME_EXCHANGE)
diff --git a/tools/include/uapi/linux/fs.h b/tools/include/uapi/linux/fs.h
index bdf7b404b3e7..7e9c32dce3e4 100644
--- a/tools/include/uapi/linux/fs.h
+++ b/tools/include/uapi/linux/fs.h
@@ -50,6 +50,7 @@
 #define RENAME_NOREPLACE	(1 << 0)	/* Don't overwrite target */
 #define RENAME_EXCHANGE		(1 << 1)	/* Exchange source and dest */
 #define RENAME_WHITEOUT		(1 << 2)	/* Whiteout source */
+#define RENAME_NEWER_MTIME	(1 << 3)	/* Only newer file can overwrite target */
 
 struct file_clone_range {
 	__s64 src_fd;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 1/2] namei: implemented RENAME_NEWER_MTIME flag for renameat2() conditional replace
  2022-07-02  8:07                 ` Dave Chinner
  2022-07-05 13:30                   ` [PATCH v3 1/2] RENAME_NEWER_MTIME is a new userspace-visible flag for renameat2(), and stands alongside existing flags including RENAME_NOREPLACE, RENAME_EXCHANGE, and RENAME_WHITEOUT James Yonan
  2022-07-05 14:03                   ` [RESEND PATCH v3 1/2] namei: implemented RENAME_NEWER_MTIME flag for renameat2() conditional replace James Yonan
@ 2022-07-11 19:13                   ` James Yonan
  2022-07-11 19:13                     ` [PATCH v4 2/2] selftests: added a new target renameat2 James Yonan
  2022-07-11 23:10                     ` [PATCH v4 1/2] namei: implemented RENAME_NEWER_MTIME flag for renameat2() conditional replace Dave Chinner
  2 siblings, 2 replies; 24+ messages in thread
From: James Yonan @ 2022-07-11 19:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: david, neilb, amir73il, viro, linux-api, James Yonan

RENAME_NEWER_MTIME is a new userspace-visible flag for renameat2(), and
stands alongside existing flags including RENAME_NOREPLACE,
RENAME_EXCHANGE, and RENAME_WHITEOUT.

RENAME_NEWER_MTIME is a conditional variation on RENAME_NOREPLACE, and
indicates that if the target of the rename exists, the rename or exchange
will only succeed if the source file is newer than the target (i.e.
source mtime > target mtime).  Otherwise, the rename will fail with
-EEXIST instead of replacing the target.  When the target doesn't exist,
RENAME_NEWER_MTIME does a plain rename like RENAME_NOREPLACE.

RENAME_NEWER_MTIME can also be combined with RENAME_EXCHANGE for
conditional exchange, where the exchange only occurs if source mtime >
target mtime.  Otherwise, the operation will fail with -EEXIST.

Some of the use cases for RENAME_NEWER_MTIME include (a) using a
directory as a key-value store, or (b) maintaining a near-real-time
mirror of a remote data source.  A common design pattern for maintaining
such a data store would be to create a file using a temporary pathname,
setting the file mtime using utimensat(2) or futimens(2) based on the
remote creation timestamp of the file content, then using
RENAME_NEWER_MTIME to move the file into place in the target directory.
If the operation returns an error with errno == EEXIST, then the source
file is not up-to-date and can safely be deleted. The goal is to
facilitate distributed systems having many concurrent writers and
readers, where update notifications are possibly delayed, duplicated, or
reordered, yet where readers see a consistent view of the target
directory with predictable semantics and atomic updates.

Note that RENAME_NEWER_MTIME depends on accurate, high-resolution
timestamps for mtime, preferably approaching nanosecond resolution.

RENAME_NEWER_MTIME is implemented in vfs_rename(), and we lock and deny
write access to both source and target inodes before comparing their
mtimes, to stabilize the comparison.

The use case for RENAME_NEWER_MTIME doesn't really align with
directories, so we return -EISDIR if either source or target is a
directory.  This makes the locking necessary to stabilize the mtime
comparison (in vfs_rename()) much more straightforward.

Like RENAME_NOREPLACE, the RENAME_NEWER_MTIME implementation lives in
the VFS, however the individual fs implementations do strict flags
checking and will return -EINVAL for any flag they don't recognize.
At this time, I have enabled and tested RENAME_NEWER_MTIME on ext2, ext3,
ext4, xfs, btrfs, and tmpfs.

I did not notice a general self-test for renameat2() at the VFS
layer (outside of fs-specific tests), so I created one, though
at the moment it only exercises RENAME_NEWER_MTIME and RENAME_EXCHANGE.
The self-test is written to be portable to the Linux Test Project,
and the advantage of running it there is that it automatically runs
tests on multiple filesystems.  See comments at the beginning of
renameat2_tests.c for more info.

Build and run the self-test with:

  make -C tools/testing/selftests TARGETS=renameat2 run_tests

Questions:

Q: Why use mtime and not ctime for timestamp comparison?

A: I see the "use a directory as a key/value store" use case
   as caring more about the modification time of the file content
   rather than the metadata.  Also, the rename operation itself
   modifies ctime, making it less useful as a reference timestamp.
   In any event, this patch creates the infrastructure for
   conditional rename/exchange based on inode timestamp, so a
   subsequent patch to add RENAME_NEWER_CTIME would be a mostly
   trivial exercise.

Signed-off-by: James Yonan <james@openvpn.net>
---
Patch version history:

v2: Changed flag name from RENAME_NEWER to RENAME_NEWER_MTIME so
    as to disambiguate and make it clear that we are comparing
    mtime values.

    RENAME_NEWER_MTIME can now be combined with RENAME_EXCHANGE
    for conditional exchange, where exchange only occurs if
    source mtime > target mtime.

    Moved the mtime comparison logic into vfs_rename() to take
    advantage of existing {lock,unlock}_two_nondirectories critical
    section, and then further nest another critical section
    {deny,allow}_write_access (adapted to inodes) to stabilize the
    mtime, since our use case doesn't require renaming files that
    are open for write (we will return -ETXTBSY in this case).

    Did some refactoring of inline functions in linux/fs.h that
    manage inode->i_writecount, and added inode_deny_write_access2()
    and inode_allow_write_access2() functions.

    Extended the self-test (renameat2_tests.c):

    1. Verify that RENAME_NEWER_MTIME fails with errno == ETXTBSY when
       one of the files is open for write.

    2. Test conditional exchange use case with combined flags
       RENAME_EXCHANGE|RENAME_NEWER_MTIME.

    3. The test .c file is now drop-in portable to the Linux Test
       Project where you can take advantage of the .all_filesystems = 1
       flag to automatically run tests on multiple filesystems.

v3: The use case for RENAME_NEWER_MTIME doesn't really align
    with directories, so return -EISDIR if either source or
    target is a directory.  This makes the locking necessary
    to stabilize the mtime comparison (in vfs_rename())
    much more straightforward.

    simple_rename() in libfs.c doesn't need to support
    RENAME_NEWER_MTIME.

    Broke up some long lines.

    Rebased on top of 5.19-rc5.

    Break out the self-test into a separate patch.

    Documented RENAME_NEWER_MTIME in the rename.2 man page
    (separate patch).

v4: Testing for d_is_postive() is unneeded if testing d_is_dir()
    on target dentry.

    Break long lines after | operator, rather than before.

    Rebased on top of 5.19-rc6.

    Submitted rename.2 man page update to linux-man@vger.kernel.org
---
 Documentation/filesystems/vfs.rst | 11 ++++++++
 fs/btrfs/inode.c                  |  3 ++-
 fs/ext2/namei.c                   |  2 +-
 fs/ext4/namei.c                   |  3 ++-
 fs/namei.c                        | 37 +++++++++++++++++++++++---
 fs/xfs/xfs_iops.c                 |  3 ++-
 include/linux/fs.h                | 43 ++++++++++++++++++++++++++++---
 include/uapi/linux/fs.h           |  1 +
 mm/shmem.c                        |  3 ++-
 tools/include/uapi/linux/fs.h     |  1 +
 10 files changed, 95 insertions(+), 12 deletions(-)

diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 08069ecd49a6..495e7352cca1 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -515,6 +515,17 @@ otherwise noted.
 	(2) RENAME_EXCHANGE: exchange source and target.  Both must
 	exist; this is checked by the VFS.  Unlike plain rename, source
 	and target may be of different type.
+	(3) RENAME_NEWER_MTIME: this flag is similar to RENAME_NOREPLACE,
+	and indicates a conditional rename: if the target of the rename
+	exists, the rename should only succeed if the source file is
+	newer than the target (i.e. source mtime > target mtime).
+	Otherwise, the rename should fail with -EEXIST instead of
+	replacing the target.  To exchange source and target conditional
+	on source being newer than target, pass flags as
+	RENAME_EXCHANGE|RENAME_NEWER_MTIME.  RENAME_NEWER_MTIME will fail
+	with -ETXTBSY if either source or target is open for write.
+	RENAME_NEWER_MTIME is not currently supported on directories, and
+	will return -EISDIR if either source or target is a directory.
 
 ``get_link``
 	called by the VFS to follow a symbolic link to the inode it
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 05e0c4a5affd..22c59808762a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -9549,7 +9549,8 @@ static int btrfs_rename2(struct user_namespace *mnt_userns, struct inode *old_di
 			 struct dentry *old_dentry, struct inode *new_dir,
 			 struct dentry *new_dentry, unsigned int flags)
 {
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT |
+		      RENAME_NEWER_MTIME))
 		return -EINVAL;
 
 	if (flags & RENAME_EXCHANGE)
diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c
index 5f6b7560eb3f..35dc17f80528 100644
--- a/fs/ext2/namei.c
+++ b/fs/ext2/namei.c
@@ -336,7 +336,7 @@ static int ext2_rename (struct user_namespace * mnt_userns,
 	struct ext2_dir_entry_2 * old_de;
 	int err;
 
-	if (flags & ~RENAME_NOREPLACE)
+	if (flags & ~(RENAME_NOREPLACE | RENAME_NEWER_MTIME))
 		return -EINVAL;
 
 	err = dquot_initialize(old_dir);
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index db4ba99d1ceb..5f3f124e3f90 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -4128,7 +4128,8 @@ static int ext4_rename2(struct user_namespace *mnt_userns,
 	if (unlikely(ext4_forced_shutdown(EXT4_SB(old_dir->i_sb))))
 		return -EIO;
 
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT |
+		      RENAME_NEWER_MTIME))
 		return -EINVAL;
 
 	err = fscrypt_prepare_rename(old_dir, old_dentry, new_dir, new_dentry,
diff --git a/fs/namei.c b/fs/namei.c
index 1f28d3f463c3..e7917def510a 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -40,6 +40,7 @@
 #include <linux/bitops.h>
 #include <linux/init_task.h>
 #include <linux/uaccess.h>
+#include <linux/time64.h>
 
 #include "internal.h"
 #include "mount.h"
@@ -4685,11 +4686,22 @@ int vfs_rename(struct renamedata *rd)
 
 	take_dentry_name_snapshot(&old_name, old_dentry);
 	dget(new_dentry);
-	if (!is_dir || (flags & RENAME_EXCHANGE))
+	if (!is_dir || (flags & (RENAME_EXCHANGE|RENAME_NEWER_MTIME)))
 		lock_two_nondirectories(source, target);
 	else if (target)
 		inode_lock(target);
 
+	if ((flags & RENAME_NEWER_MTIME) && target) {
+		/* deny write access to stabilize mtime comparison below */
+		error = inode_deny_write_access2(source, target);
+		if (error) /* -ETXTBSY */
+			goto out1;
+		if (timespec64_compare(&source->i_mtime, &target->i_mtime) <= 0) {
+			error = -EEXIST;
+			goto out;
+		}
+	}
+
 	error = -EPERM;
 	if (IS_SWAPFILE(source) || (target && IS_SWAPFILE(target)))
 		goto out;
@@ -4736,7 +4748,10 @@ int vfs_rename(struct renamedata *rd)
 			d_exchange(old_dentry, new_dentry);
 	}
 out:
-	if (!is_dir || (flags & RENAME_EXCHANGE))
+	if ((flags & RENAME_NEWER_MTIME) && target)
+		inode_allow_write_access2(source, target);
+out1:
+	if (!is_dir || (flags & (RENAME_EXCHANGE|RENAME_NEWER_MTIME)))
 		unlock_two_nondirectories(source, target);
 	else if (target)
 		inode_unlock(target);
@@ -4769,11 +4784,12 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
 	bool should_retry = false;
 	int error = -EINVAL;
 
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT
+		      | RENAME_NEWER_MTIME))
 		goto put_names;
 
 	if ((flags & (RENAME_NOREPLACE | RENAME_WHITEOUT)) &&
-	    (flags & RENAME_EXCHANGE))
+	    (flags & (RENAME_EXCHANGE | RENAME_NEWER_MTIME)))
 		goto put_names;
 
 	if (flags & RENAME_EXCHANGE)
@@ -4825,6 +4841,19 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd,
 	error = -EEXIST;
 	if ((flags & RENAME_NOREPLACE) && d_is_positive(new_dentry))
 		goto exit5;
+
+	/* The use case for RENAME_NEWER_MTIME doesn't really align
+	 * with directories, so bail out here if either source or
+	 * target is a directory.  This makes the locking necessary
+	 * to stabilize the mtime comparison (in vfs_rename)
+	 * much more straightforward.
+	 */
+	if ((flags & RENAME_NEWER_MTIME) &&
+	    (d_is_dir(old_dentry) || d_is_dir(new_dentry))) {
+		error = -EISDIR;
+		goto exit5;
+	}
+
 	if (flags & RENAME_EXCHANGE) {
 		error = -ENOENT;
 		if (d_is_negative(new_dentry))
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 29f5b8b8aca6..d615cc2b8d36 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -457,7 +457,8 @@ xfs_vn_rename(
 	struct xfs_name	oname;
 	struct xfs_name	nname;
 
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT |
+		      RENAME_NEWER_MTIME))
 		return -EINVAL;
 
 	/* if we are exchanging files, we need to set i_mode of both files */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9ad5e3520fae..0c79f12ec51f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2819,14 +2819,21 @@ static inline void file_end_write(struct file *file)
  * use {get,deny}_write_access() - these functions check the sign and refuse
  * to do the change if sign is wrong.
  */
+static inline int inode_deny_write_access(struct inode *inode)
+{
+	return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY;
+}
+static inline void inode_allow_write_access(struct inode *inode)
+{
+	atomic_inc(&inode->i_writecount);
+}
 static inline int get_write_access(struct inode *inode)
 {
 	return atomic_inc_unless_negative(&inode->i_writecount) ? 0 : -ETXTBSY;
 }
 static inline int deny_write_access(struct file *file)
 {
-	struct inode *inode = file_inode(file);
-	return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY;
+	return inode_deny_write_access(file_inode(file));
 }
 static inline void put_write_access(struct inode * inode)
 {
@@ -2835,13 +2842,43 @@ static inline void put_write_access(struct inode * inode)
 static inline void allow_write_access(struct file *file)
 {
 	if (file)
-		atomic_inc(&file_inode(file)->i_writecount);
+		inode_allow_write_access(file_inode(file));
 }
 static inline bool inode_is_open_for_write(const struct inode *inode)
 {
 	return atomic_read(&inode->i_writecount) > 0;
 }
 
+/**
+ * inode_deny_write_access2 - deny write access on two inodes.
+ * Returns -ETXTBSY if write access cannot be denied on either inode.
+ * @inode1: first inode
+ * @inode2: second inode
+ */
+static inline int inode_deny_write_access2(struct inode *inode1, struct inode *inode2)
+{
+	int error = inode_deny_write_access(inode1);
+	if (error)
+		return error;
+	error = inode_deny_write_access(inode2);
+	if (error)
+		inode_allow_write_access(inode1);
+	return error;
+}
+
+/**
+ * inode_allow_write_access2 - allow write access on two inodes.
+ * This method is intended to be called after a successful call
+ * to inode_deny_write_access2().
+ * @inode1: first inode
+ * @inode2: second inode
+ */
+static inline void inode_allow_write_access2(struct inode *inode1, struct inode *inode2)
+{
+	inode_allow_write_access(inode1);
+	inode_allow_write_access(inode2);
+}
+
 #if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING)
 static inline void i_readcount_dec(struct inode *inode)
 {
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index bdf7b404b3e7..7e9c32dce3e4 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -50,6 +50,7 @@
 #define RENAME_NOREPLACE	(1 << 0)	/* Don't overwrite target */
 #define RENAME_EXCHANGE		(1 << 1)	/* Exchange source and dest */
 #define RENAME_WHITEOUT		(1 << 2)	/* Whiteout source */
+#define RENAME_NEWER_MTIME	(1 << 3)	/* Only newer file can overwrite target */
 
 struct file_clone_range {
 	__s64 src_fd;
diff --git a/mm/shmem.c b/mm/shmem.c
index a6f565308133..94d8324db46d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3009,7 +3009,8 @@ static int shmem_rename2(struct user_namespace *mnt_userns,
 	struct inode *inode = d_inode(old_dentry);
 	int they_are_dirs = S_ISDIR(inode->i_mode);
 
-	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT))
+	if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT |
+		      RENAME_NEWER_MTIME))
 		return -EINVAL;
 
 	if (flags & RENAME_EXCHANGE)
diff --git a/tools/include/uapi/linux/fs.h b/tools/include/uapi/linux/fs.h
index bdf7b404b3e7..7e9c32dce3e4 100644
--- a/tools/include/uapi/linux/fs.h
+++ b/tools/include/uapi/linux/fs.h
@@ -50,6 +50,7 @@
 #define RENAME_NOREPLACE	(1 << 0)	/* Don't overwrite target */
 #define RENAME_EXCHANGE		(1 << 1)	/* Exchange source and dest */
 #define RENAME_WHITEOUT		(1 << 2)	/* Whiteout source */
+#define RENAME_NEWER_MTIME	(1 << 3)	/* Only newer file can overwrite target */
 
 struct file_clone_range {
 	__s64 src_fd;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v4 2/2] selftests: added a new target renameat2
  2022-07-11 19:13                   ` [PATCH v4 " James Yonan
@ 2022-07-11 19:13                     ` James Yonan
  2022-07-11 23:10                     ` [PATCH v4 1/2] namei: implemented RENAME_NEWER_MTIME flag for renameat2() conditional replace Dave Chinner
  1 sibling, 0 replies; 24+ messages in thread
From: James Yonan @ 2022-07-11 19:13 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: david, neilb, amir73il, viro, linux-api, James Yonan

The new renameat2 target tests the new renameat2()
flag RENAME_NEWER_MTIME along with RENAME_NOREPLACE
and RENAME_EXCHANGE.

This test is designed to be portable between
the Linux kernel self-tests and the Linux Test Project.

Signed-off-by: James Yonan <james@openvpn.net>
---
 tools/testing/selftests/Makefile              |   1 +
 tools/testing/selftests/renameat2/.gitignore  |   1 +
 tools/testing/selftests/renameat2/Makefile    |  11 +
 .../selftests/renameat2/renameat2_tests.c     | 451 ++++++++++++++++++
 4 files changed, 464 insertions(+)
 create mode 100644 tools/testing/selftests/renameat2/.gitignore
 create mode 100644 tools/testing/selftests/renameat2/Makefile
 create mode 100644 tools/testing/selftests/renameat2/renameat2_tests.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index de11992dc577..34226dfbca7a 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -54,6 +54,7 @@ TARGETS += proc
 TARGETS += pstore
 TARGETS += ptrace
 TARGETS += openat2
+TARGETS += renameat2
 TARGETS += resctrl
 TARGETS += rlimits
 TARGETS += rseq
diff --git a/tools/testing/selftests/renameat2/.gitignore b/tools/testing/selftests/renameat2/.gitignore
new file mode 100644
index 000000000000..79bbdf497559
--- /dev/null
+++ b/tools/testing/selftests/renameat2/.gitignore
@@ -0,0 +1 @@
+renameat2_tests
diff --git a/tools/testing/selftests/renameat2/Makefile b/tools/testing/selftests/renameat2/Makefile
new file mode 100644
index 000000000000..6d5c44906b03
--- /dev/null
+++ b/tools/testing/selftests/renameat2/Makefile
@@ -0,0 +1,11 @@
+# SPDX-License-Identifier: GPL-2.0
+
+CFLAGS = -g -Wall -O2 -pthread
+CFLAGS += $(KHDR_INCLUDES)
+LDLIBS += -lpthread
+
+TEST_GEN_PROGS := renameat2_tests
+
+include ../lib.mk
+
+$(OUTPUT)/renameat2_tests: renameat2_tests.c
diff --git a/tools/testing/selftests/renameat2/renameat2_tests.c b/tools/testing/selftests/renameat2/renameat2_tests.c
new file mode 100644
index 000000000000..bc41975a565f
--- /dev/null
+++ b/tools/testing/selftests/renameat2/renameat2_tests.c
@@ -0,0 +1,451 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Written by James Yonan <james@openvpn.net>
+ * Copyright (c) 2022 OpenVPN, Inc.
+ */
+
+/*
+ * Test renameat2() with RENAME_NOREPLACE, RENAME_EXCHANGE,
+ * and RENAME_NEWER_MTIME.
+ *
+ * This test is designed to be portable between
+ * the Linux kernel self-tests and the Linux Test Project.
+ * The cool thing about running the test in the Linux Test Project
+ * is that it will automatically iterate the test over all the
+ * filesystems available in your kernel.  In a default kernel,
+ * that includes ext2, ext3, ext4, xfs, btrfs, and tmpfs.
+ *
+ * By default we assume a Linux kernel self-test build, where
+ * you can build and run with:
+ *   make -C tools/testing/selftests TARGETS=renameat2 run_tests
+ *
+ * For a Linux Test Project build, place this source file
+ * under the ltp tree in:
+ *   testcases/kernel/syscalls/renameat2/renameat203.c
+ * Then cd to testcases/kernel/syscalls/renameat2 and add:
+ *   CPPFLAGS += -DLINUX_TEST_PROJECT
+ * to the end of the Makefile.  Then run with:
+ *   make && ./rename_newer_mtime
+ */
+
+#ifndef _GNU_SOURCE
+#define _GNU_SOURCE
+#endif
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <time.h>
+
+#ifdef LINUX_TEST_PROJECT
+#include "tst_test.h"
+#include "renameat2.h"
+#else
+#include "../kselftest_harness.h"
+#endif
+
+/* requires a kernel that implements renameat2() RENAME_NEWER_MTIME flag */
+#ifndef RENAME_NEWER_MTIME
+#define RENAME_NEWER_MTIME (1 << 3)
+#endif
+
+/* convert milliseconds to nanoseconds */
+#define MS_TO_NANO(x) ((x) * 1000000)
+
+#ifdef LINUX_TEST_PROJECT
+
+#define MNTPOINT "mntpoint"
+#define WORKDIR MNTPOINT "/testdir.XXXXXX"
+
+#define MY_ERROR(...) tst_brk(TFAIL, __VA_ARGS__)
+#define MY_PASS(...) tst_res(TPASS, __VA_ARGS__)
+
+#else /* Linux kernel self-test */
+
+#define WORKDIR "/tmp/ksft-renameat2-rename-newer-mtime.XXXXXX"
+
+#define MY_ERROR(fmt, ...) ksft_exit_fail_msg("%s/%d: " fmt "\n", __FILE__, __LINE__, __VA_ARGS__)
+#define MY_PASS(...)
+
+#endif
+
+static int create_file_with_timestamp(const char *filename,
+				      const time_t tv_sec,
+				      const long tv_nsec,
+				      struct stat *s,
+				      int *retain_fd)
+{
+	int fd;
+	struct timespec times[2];
+
+	fd = open(filename, O_CREAT|O_TRUNC|O_WRONLY, 0777);
+	if (fd < 0)
+		return errno;
+	times[0].tv_sec = tv_sec;
+	times[0].tv_nsec = tv_nsec;
+	times[1] = times[0];
+	if (futimens(fd, times)) {
+		close(fd);
+		return errno;
+	}
+	if (fstat(fd, s)) {
+		close(fd);
+		return errno;
+	}
+	if (retain_fd)
+		*retain_fd = fd;
+	else if (close(fd))
+		return errno;
+	return 0;
+}
+
+static int create_directory_with_timestamp(const char *dirname,
+					   const time_t tv_sec,
+					   const long tv_nsec,
+					   struct stat *s)
+{
+	struct timespec times[2];
+
+	if (mkdir(dirname, 0777))
+		return errno;
+	times[0].tv_sec = tv_sec;
+	times[0].tv_nsec = tv_nsec;
+	times[1] = times[0];
+	if (utimensat(AT_FDCWD, dirname, times, 0) != 0)
+		return errno;
+	if (lstat(dirname, s))
+		return errno;
+	return 0;
+}
+
+static int do_rename(const char *source_path, const char *target_path,
+		     const unsigned int flags)
+{
+	if (renameat2(AT_FDCWD, source_path, AT_FDCWD, target_path, flags))
+		return errno;
+	return 0;
+}
+
+static int verify_inode(const char *path, const struct stat *orig_stat)
+{
+	struct stat s;
+
+	if (stat(path, &s))
+		return errno;
+	if (orig_stat->st_ino != s.st_ino)
+		return ENOENT;
+	return 0;
+}
+
+static int verify_exist(const char *path)
+{
+	int fd;
+
+	fd = open(path, O_RDONLY);
+	if (fd < 0)
+		return errno;
+	if (close(fd) != 0)
+		return errno;
+	return 0;
+}
+
+static int fd_d = -1; /* retained fd from file "d" */
+
+/*
+ * Test renameat2() with RENAME_NEWER_MTIME, RENAME_NOREPLACE, and RENAME_EXCHANGE.
+ */
+static void do_rename_newer_mtime(void)
+{
+	char dirname[] = WORKDIR;
+	const time_t now = time(NULL);
+	struct stat stat_a, stat_b, stat_c, stat_d, stat_f; /* files */
+	struct stat stat_x, stat_y; /* directories */
+	int eno; /* copied errno */
+
+	/* fd_d initial state */
+	fd_d = -1;
+
+	/* make the top-level directory */
+	if (!mkdtemp(dirname)) {
+		eno = errno;
+		MY_ERROR("failed to create tmpdir, errno=%d", eno);
+	}
+
+	/* cd to top-level directory */
+	if (chdir(dirname)) {
+		eno = errno;
+		MY_ERROR("failed to cd to tmpdir, errno=%d", eno);
+	}
+
+	/* create files with different mtimes */
+	eno = create_file_with_timestamp("a", now, MS_TO_NANO(700), &stat_a, NULL);
+	if (eno)
+		MY_ERROR("failed to create file 'a', errno=%d", eno);
+	eno = create_file_with_timestamp("b", now+1, MS_TO_NANO(500), &stat_b, NULL);
+	if (eno)
+		MY_ERROR("failed to create file 'b', errno=%d", eno);
+	eno = create_file_with_timestamp("c", now+1, MS_TO_NANO(500), &stat_c, NULL);
+	if (eno)
+		MY_ERROR("failed to create file 'c', errno=%d", eno);
+	eno = create_file_with_timestamp("d", now+1, MS_TO_NANO(300), &stat_d, &fd_d); /* leave open for write */
+	if (eno)
+		MY_ERROR("failed to create file 'd', errno=%d", eno);
+	eno = create_file_with_timestamp("f", now, MS_TO_NANO(0), &stat_f, NULL);
+	if (eno)
+		MY_ERROR("failed to create file 'f', errno=%d", eno);
+
+	/* create directories with different mtimes */
+	eno = create_directory_with_timestamp("x", now+2, MS_TO_NANO(0), &stat_x);
+	if (eno)
+		MY_ERROR("failed to create directory 'x', errno=%d", eno);
+	eno = create_directory_with_timestamp("y", now+3, MS_TO_NANO(0), &stat_y);
+	if (eno)
+		MY_ERROR("failed to create directory 'y', errno=%d", eno);
+
+	/* rename b -> e with RENAME_NEWER_MTIME -- should succeed because e doesn't exist */
+	eno = do_rename("b", "e", RENAME_NEWER_MTIME);
+	if (eno)
+		MY_ERROR("failed to rename 'b' -> 'e', errno=%d (kernel may be missing RENAME_NEWER_MTIME feature)", eno);
+	eno = verify_inode("e", &stat_b);
+	if (eno)
+		MY_ERROR("could not verify inode of 'e' after rename 'b' -> 'e', errno=%d", eno);
+	eno = verify_exist("b");
+	if (eno != ENOENT)
+		MY_ERROR("strangely, 'b' still exists after rename 'b' -> 'e', errno=%d", eno);
+
+	/* rename c -> e with RENAME_NEWER_MTIME|RENAME_NOREPLACE -- should fail
+	 * because RENAME_NEWER_MTIME and RENAME_NOREPLACE cannot be used together
+	 */
+	eno = do_rename("c", "e", RENAME_NEWER_MTIME|RENAME_NOREPLACE);
+	if (eno != EINVAL)
+		MY_ERROR("rename 'c' -> 'e' should have failed with EINVAL because RENAME_NEWER_MTIME and RENAME_NOREPLACE cannot be used together, errno=%d", eno);
+
+	/* rename c -> e with RENAME_NEWER_MTIME|RENAME_WHITEOUT -- should fail
+	 * because RENAME_NEWER_MTIME and RENAME_WHITEOUT cannot be used together
+	 */
+	eno = do_rename("c", "e", RENAME_NEWER_MTIME|RENAME_WHITEOUT);
+	if (eno != EINVAL)
+		MY_ERROR("rename 'c' -> 'e' should have failed with EINVAL because RENAME_NEWER_MTIME and RENAME_WHITEOUT cannot be used together, errno=%d", eno);
+
+	/* rename c -> e with RENAME_NEWER_MTIME -- should fail because c and e have
+	 * the same timestamp
+	 */
+	eno = do_rename("c", "e", RENAME_NEWER_MTIME);
+	if (eno != EEXIST)
+		MY_ERROR("rename 'c' -> 'e' should have failed with EEXIST because 'c' and 'e' have the same timestamp, errno=%d", eno);
+	eno = verify_inode("c", &stat_c);
+	if (eno)
+		MY_ERROR("could not verify inode of 'c' after attempted rename 'c' -> 'e', errno=%d", eno);
+	eno = verify_inode("e", &stat_b);
+	if (eno)
+		MY_ERROR("could not verify inode of 'e' after attempted rename 'c' -> 'e', errno=%d", eno);
+
+	/* rename a -> c with RENAME_NOREPLACE -- should fail because c exists */
+	eno = do_rename("a", "c", RENAME_NOREPLACE);
+	if (eno != EEXIST)
+		MY_ERROR("rename 'a' -> 'c' should have failed because 'c' exists, errno=%d", eno);
+	eno = verify_inode("a", &stat_a);
+	if (eno)
+		MY_ERROR("could not verify inode of 'a' after attempted rename 'a' -> 'c', errno=%d", eno);
+	eno = verify_inode("c", &stat_c);
+	if (eno)
+		MY_ERROR("could not verify inode of 'c' after attempted rename 'a' -> 'c', errno=%d", eno);
+
+	/* rename a -> c with RENAME_NEWER_MTIME -- should fail because c is newer than a */
+	eno = do_rename("a", "c", RENAME_NEWER_MTIME);
+	if (eno != EEXIST)
+		MY_ERROR("rename 'a' -> 'c' should have failed with EEXIST because 'c' is newer, errno=%d", eno);
+	eno = verify_inode("a", &stat_a);
+	if (eno)
+		MY_ERROR("could not verify inode of 'a' after attempted rename 'a' -> 'c', errno=%d", eno);
+	eno = verify_inode("c", &stat_c);
+	if (eno)
+		MY_ERROR("could not verify inode of 'c' after attempted rename 'a' -> 'c', errno=%d", eno);
+
+	/* rename c -> a with RENAME_NEWER_MTIME -- should succeed because c is newer than a */
+	eno = do_rename("c", "a", RENAME_NEWER_MTIME);
+	if (eno)
+		MY_ERROR("rename 'c' -> 'a' should have succeeded because 'c' is newer than 'a', errno=%d", eno);
+	eno = verify_inode("a", &stat_c);
+	if (eno)
+		MY_ERROR("could not verify inode of 'a' after rename 'c' -> 'a', errno=%d", eno);
+	eno = verify_exist("c");
+	if (eno != ENOENT)
+		MY_ERROR("strangely, 'c' still exists after rename 'c' -> 'a', errno=%d", eno);
+
+	/* exchange f <-> nonexistent with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should fail because
+	 * only f exists
+	 */
+	eno = do_rename("f", "nonexistent", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
+	if (eno != ENOENT)
+		MY_ERROR("exchange 'f' <-> 'nonexistent' should have failed with ENOENT, errno=%d", eno);
+	eno = verify_inode("f", &stat_f);
+	if (eno)
+		MY_ERROR("could not verify inode of 'f' after attempted exchange 'f' <-> 'nonexistent', errno=%d", eno);
+
+	/* exchange d <-> f with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should fail because
+	 * d is open for write
+	 */
+	eno = do_rename("d", "f", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
+	if (eno != ETXTBSY)
+		MY_ERROR("exchange 'd' <-> 'f' should have failed with ETXTBSY because d is open for write, errno=%d", eno);
+	eno = verify_inode("d", &stat_d);
+	if (eno)
+		MY_ERROR("could not verify inode of 'd' after attempted exchange 'd' <-> 'f', errno=%d", eno);
+	eno = verify_inode("f", &stat_f);
+	if (eno)
+		MY_ERROR("could not verify inode of 'f' after attempted exchange 'd' <-> 'f', errno=%d", eno);
+
+	/* exchange e <-> d with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should fail because
+	 * d is open for write
+	 */
+	eno = do_rename("e", "d", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
+	if (eno != ETXTBSY)
+		MY_ERROR("exchange 'e' <-> 'd' should have failed with ETXTBSY because d is open for write, errno=%d", eno);
+	eno = verify_inode("e", &stat_b);
+	if (eno)
+		MY_ERROR("could not verify inode of 'e' after attempted exchange 'e' <-> 'd', errno=%d", eno);
+	eno = verify_inode("d", &stat_d);
+	if (eno)
+		MY_ERROR("could not verify inode of 'd' after attempted exchange 'e' <-> 'd', errno=%d", eno);
+
+	/* exchange f <-> d with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should fail because
+	 * d is open for write but also because f is older than d
+	 */
+	eno = do_rename("f", "d", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
+	if (eno != ETXTBSY) /* note in this case we get ETXTBSY first (EEXIST would have
+			     * been returned if d wasn't open for write)
+			     */
+		MY_ERROR("exchange 'f' <-> 'd' should have failed with ETXTBSY because d is open for write, errno=%d", eno);
+	eno = verify_inode("f", &stat_f);
+	if (eno)
+		MY_ERROR("could not verify inode of 'f' after attempted exchange 'f' <-> 'd', errno=%d", eno);
+	eno = verify_inode("d", &stat_d);
+	if (eno)
+		MY_ERROR("could not verify inode of 'd' after attempted exchange 'f' <-> 'd', errno=%d", eno);
+
+	/* close fd_d */
+	if (close(fd_d) != 0) {
+		eno = errno;
+		MY_ERROR("error closing fd_d (write), errno=%d", eno);
+	}
+
+	/* reopen "d" for read access, which should not prevent RENAME_NEWER_MTIME */
+	fd_d = open("d", O_RDONLY);
+	if (fd_d < 0)
+		MY_ERROR("error reopening 'd' for read, errno=%d", eno);
+
+	/* exchange f <-> d with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should fail
+	 * because f is older than d
+	 */
+	eno = do_rename("f", "d", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
+	if (eno != EEXIST)
+		MY_ERROR("exchange 'f' <-> 'd' should have failed with EEXIST because f is older than d, errno=%d", eno);
+	eno = verify_inode("f", &stat_f);
+	if (eno)
+		MY_ERROR("could not verify inode of 'f' after attempted exchange 'f' <-> 'd', errno=%d", eno);
+	eno = verify_inode("d", &stat_d);
+	if (eno)
+		MY_ERROR("could not verify inode of 'd' after attempted exchange 'f' <-> 'd', errno=%d", eno);
+
+	/* double exchange f <-> d with RENAME_EXCHANGE -- should succeed */
+	eno = do_rename("f", "d", RENAME_EXCHANGE);
+	if (eno)
+		MY_ERROR("exchange 'f' <-> 'd' should have succeeded, errno=%d", eno);
+	eno = verify_inode("d", &stat_f);
+	if (eno)
+		MY_ERROR("could not verify inode of 'd' after exchange 'd' <-> 'f', errno=%d", eno);
+	eno = verify_inode("f", &stat_d);
+	if (eno)
+		MY_ERROR("could not verify inode of 'f' after exchange 'd' <-> 'f', errno=%d", eno);
+	eno = do_rename("f", "d", RENAME_EXCHANGE);
+	if (eno)
+		MY_ERROR("exchange 'f' <-> 'd' should have succeeded, errno=%d", eno);
+	eno = verify_inode("d", &stat_d);
+	if (eno)
+		MY_ERROR("could not verify inode of 'd' after exchange 'd' <-> 'f', errno=%d", eno);
+	eno = verify_inode("f", &stat_f);
+	if (eno)
+		MY_ERROR("could not verify inode of 'f' after exchange 'd' <-> 'f', errno=%d", eno);
+
+	/* exchange d <-> f with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should succeed
+	 * because d is newer than f and fd_d is now read-only
+	 */
+	eno = do_rename("d", "f", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
+	if (eno)
+		MY_ERROR("exchange 'd' <-> 'f' failed, errno=%d", eno);
+	eno = verify_inode("d", &stat_f);
+	if (eno)
+		MY_ERROR("could not verify inode of 'd' after exchange 'd' <-> 'f', errno=%d", eno);
+	eno = verify_inode("f", &stat_d);
+	if (eno)
+		MY_ERROR("could not verify inode of 'f' after exchange 'd' <-> 'f', errno=%d", eno);
+
+	/* exchange directories x <-> y with RENAME_EXCHANGE|RENAME_NEWER_MTIME
+	 * -- should fail because RENAME_NEWER_MTIME is not implemented
+	 * for directories.
+	 */
+	eno = do_rename("x", "y", RENAME_EXCHANGE|RENAME_NEWER_MTIME);
+	if (eno != EISDIR)
+		MY_ERROR("exchange 'x' <-> 'y' should have failed with EISDIR because x and y are directories, errno=%d", eno);
+	eno = verify_inode("x", &stat_x);
+	if (eno)
+		MY_ERROR("could not verify inode of 'x' after attempted exchange 'x' <-> 'y', errno=%d", eno);
+	eno = verify_inode("y", &stat_y);
+	if (eno)
+		MY_ERROR("could not verify inode of 'y' after attempted exchange 'x' <-> 'y', errno=%d", eno);
+
+	/* close fd_d */
+	if (close(fd_d) != 0) {
+		eno = errno;
+		MY_ERROR("error closing fd_d (read), errno=%d", eno);
+	}
+	fd_d = -1;
+
+	MY_PASS("rename_newer_mtime test passed, workdir=%s", dirname);
+}
+
+#ifdef LINUX_TEST_PROJECT
+
+static void setup(void)
+{
+}
+
+static void cleanup(void)
+{
+	/* close fd_d */
+	if (fd_d >= 0)
+		close(fd_d);
+}
+
+static struct tst_test test = {
+	.test_all = do_rename_newer_mtime,
+	.setup = setup,
+	.cleanup = cleanup,
+	.needs_root = 1,
+	.all_filesystems = 1,
+	.mount_device = 1,
+	.mntpoint = MNTPOINT,
+	.skip_filesystems = (const char*[]) {
+		"exfat",
+		"ntfs",
+		"vfat",
+		NULL
+	},
+	.needs_cmds = NULL,
+};
+
+#else /* Linux kernel self-test */
+
+TEST(rename_newer_mtime)
+{
+	do_rename_newer_mtime();
+}
+
+TEST_HARNESS_MAIN
+
+#endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 1/2] namei: implemented RENAME_NEWER_MTIME flag for renameat2() conditional replace
  2022-07-11 19:13                   ` [PATCH v4 " James Yonan
  2022-07-11 19:13                     ` [PATCH v4 2/2] selftests: added a new target renameat2 James Yonan
@ 2022-07-11 23:10                     ` Dave Chinner
  1 sibling, 0 replies; 24+ messages in thread
From: Dave Chinner @ 2022-07-11 23:10 UTC (permalink / raw)
  To: James Yonan; +Cc: linux-fsdevel, neilb, amir73il, viro, linux-api

On Mon, Jul 11, 2022 at 01:13:30PM -0600, James Yonan wrote:
> RENAME_NEWER_MTIME is a new userspace-visible flag for renameat2(), and
> stands alongside existing flags including RENAME_NOREPLACE,
> RENAME_EXCHANGE, and RENAME_WHITEOUT.
> 
> RENAME_NEWER_MTIME is a conditional variation on RENAME_NOREPLACE, and
> indicates that if the target of the rename exists, the rename or exchange
> will only succeed if the source file is newer than the target (i.e.
> source mtime > target mtime).  Otherwise, the rename will fail with
> -EEXIST instead of replacing the target.  When the target doesn't exist,
> RENAME_NEWER_MTIME does a plain rename like RENAME_NOREPLACE.
> 
> RENAME_NEWER_MTIME can also be combined with RENAME_EXCHANGE for
> conditional exchange, where the exchange only occurs if source mtime >
> target mtime.  Otherwise, the operation will fail with -EEXIST.
> 
> Some of the use cases for RENAME_NEWER_MTIME include (a) using a
> directory as a key-value store, or (b) maintaining a near-real-time
> mirror of a remote data source.  A common design pattern for maintaining
> such a data store would be to create a file using a temporary pathname,
> setting the file mtime using utimensat(2) or futimens(2) based on the
> remote creation timestamp of the file content, then using
> RENAME_NEWER_MTIME to move the file into place in the target directory.
> If the operation returns an error with errno == EEXIST, then the source
> file is not up-to-date and can safely be deleted. The goal is to
> facilitate distributed systems having many concurrent writers and
> readers, where update notifications are possibly delayed, duplicated, or
> reordered, yet where readers see a consistent view of the target
> directory with predictable semantics and atomic updates.
> 
> Note that RENAME_NEWER_MTIME depends on accurate, high-resolution
> timestamps for mtime, preferably approaching nanosecond resolution.
> 
> RENAME_NEWER_MTIME is implemented in vfs_rename(), and we lock and deny
> write access to both source and target inodes before comparing their
> mtimes, to stabilize the comparison.
> 
> The use case for RENAME_NEWER_MTIME doesn't really align with
> directories, so we return -EISDIR if either source or target is a
> directory.  This makes the locking necessary to stabilize the mtime
> comparison (in vfs_rename()) much more straightforward.
> 
> Like RENAME_NOREPLACE, the RENAME_NEWER_MTIME implementation lives in
> the VFS, however the individual fs implementations do strict flags
> checking and will return -EINVAL for any flag they don't recognize.
> At this time, I have enabled and tested RENAME_NEWER_MTIME on ext2, ext3,
> ext4, xfs, btrfs, and tmpfs.
> 
> I did not notice a general self-test for renameat2() at the VFS
> layer (outside of fs-specific tests),

We have a whole bunch of renameat2() tests in fstests that cover all
the functionality of renameat2(), and fsstress will also exercise it
in stress workloads, too:

$ git grep -l renameat2
.gitignore
common/renameat2
configure.ac
ltp/fsstress.c
src/Makefile
src/renameat2.c
tests/btrfs/247
tests/generic/023
tests/generic/024
tests/generic/025
tests/generic/078
tests/generic/398
tests/generic/419
tests/generic/585
tests/generic/621
tests/generic/626

> so I created one, though
> at the moment it only exercises RENAME_NEWER_MTIME and RENAME_EXCHANGE.
> The self-test is written to be portable to the Linux Test Project,
> and the advantage of running it there is that it automatically runs
> tests on multiple filesystems.  See comments at the beginning of
> renameat2_tests.c for more info.

Ideally, new renameat2 correctness tests should be added to fstests
as per the existing tests (as this is the primary test suite a lot
of fs developers use) so that we don't end up with partial test
coverage fragmented across different test suites. It does us no
favors to have non-overlapping partial coverage in different test
suites - we are better to implement complete coverage in one test
suite and focus our efforts there...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-07-11 23:10 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-27 22:11 [PATCH] namei: implemented RENAME_NEWER flag for renameat2() conditional replace James Yonan
2022-06-28  9:46 ` Amir Goldstein
2022-06-28 21:56   ` James Yonan
2022-06-29  5:15     ` Amir Goldstein
2022-06-30 16:18       ` James Yonan
2022-06-28 17:34 ` Al Viro
2022-06-28 18:34   ` Amir Goldstein
2022-06-28 23:19     ` James Yonan
2022-06-29  1:43       ` Dave Chinner
2022-06-29  2:07         ` NeilBrown
2022-06-29  2:35           ` Dave Chinner
2022-06-29  2:49             ` NeilBrown
2022-06-30 16:39             ` James Yonan
2022-07-01  9:23               ` [PATCH v2] namei: implemented RENAME_NEWER_MTIME " James Yonan
2022-07-01 10:34                 ` Amir Goldstein
2022-07-01 20:06                   ` James Yonan
2022-07-02  8:07                 ` Dave Chinner
2022-07-05 13:30                   ` [PATCH v3 1/2] RENAME_NEWER_MTIME is a new userspace-visible flag for renameat2(), and stands alongside existing flags including RENAME_NOREPLACE, RENAME_EXCHANGE, and RENAME_WHITEOUT James Yonan
2022-07-05 13:30                     ` [PATCH v3 2/2] selftests: added a new target renameat2 James Yonan
2022-07-05 13:30                     ` [PATCH man-pages] rename.2: document new renameat2() flag RENAME_NEWER_MTIME James Yonan
2022-07-05 14:03                   ` [RESEND PATCH v3 1/2] namei: implemented RENAME_NEWER_MTIME flag for renameat2() conditional replace James Yonan
2022-07-11 19:13                   ` [PATCH v4 " James Yonan
2022-07-11 19:13                     ` [PATCH v4 2/2] selftests: added a new target renameat2 James Yonan
2022-07-11 23:10                     ` [PATCH v4 1/2] namei: implemented RENAME_NEWER_MTIME flag for renameat2() conditional replace Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.