From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A4269C433EF for ; Fri, 1 Jul 2022 10:35:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236261AbiGAKfE (ORCPT ); Fri, 1 Jul 2022 06:35:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55340 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236281AbiGAKfC (ORCPT ); Fri, 1 Jul 2022 06:35:02 -0400 Received: from mail-vk1-xa33.google.com (mail-vk1-xa33.google.com [IPv6:2607:f8b0:4864:20::a33]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 99F7B70E4E for ; Fri, 1 Jul 2022 03:34:59 -0700 (PDT) Received: by mail-vk1-xa33.google.com with SMTP id 37so969326vkl.7 for ; Fri, 01 Jul 2022 03:34:59 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=AG48PFO30CBDekFsn3h5tQXXllhty/E8bf+KB/qCakY=; b=gLttMVSGNgGBR9wiygWoS4zO8kRCzcZc08taE9lqB4aNpBv7K5yMnEQvGNeCK8q7py vgqt6hPofG0IJOaj1NBgPCASsq6kkHmQjadRhCydoNBHBbzsJj3L9YGk3do+HewGUV2z M+w9VHPTDDL7yONAN+OP8oYVBbVSle9LyhrWJDVEVUIJ8V8GHCll4AYc0lluiQdBY3iz hqe+pwumS0cCOxAjr0TLF/sgcfqqrA7F5qcQcOXzE4P+NcJMt5vHPTR0vhAk3xborNAL yGYIdFPnBPYAowneB4H6mhHtX5Ucdf32c4/baglWTAAk4Gc9sxhFMywQoVL6UHvdGjMz ecYQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=AG48PFO30CBDekFsn3h5tQXXllhty/E8bf+KB/qCakY=; b=A7LuWPv4Zxohz/rdhE9lGWIyej2pQI+qVhwnF+c7NznFUjlTgBNTMFk73lrUyrckfL XtA8Z8QMsNkDVpJYuIle4AgbsiPob2eckEkyuT3ujk0xH60csN2NfVUbw4tFVFNgfSol cSvEgUWfaUG1HP6dSkATVTm6VLlWwIEsdZEbmtqAlDpjgJbKPNmEyNl5/yOVPU0mHafk UUZ26FaE3IIp9URxRUt1YaiqwEWo0VfsLAlZMNahE4qAk+hqXVjrf7SNnhKsaFqKmX6g oZCCp94z00ruKN8AIG046S8DX+0o/8rQZPZpIxGy0mc3TpbAJxHjGAhU1jLLkC419m05 q6bA== X-Gm-Message-State: AJIora/6r6f7re/uVocpvkWQgO7A2qYiBxcvxRlpu2koPDFR1blgGPFx dgarPNT0XJAV8WjOKJcH4jllYt2OSn9p6W9mmRg= X-Google-Smtp-Source: AGRyM1siKdXNhGh5XFA/fRp0vTAInCfYdhXSbnXNTm3k0XUvInuu+rBwDCCPQASR1ZUCSG1unO3eWQdzceOZ/FRj/1M= X-Received: by 2002:a05:6122:e10:b0:370:e49b:d1dd with SMTP id bk16-20020a0561220e1000b00370e49bd1ddmr33941vkb.25.1656671698393; Fri, 01 Jul 2022 03:34:58 -0700 (PDT) MIME-Version: 1.0 References: <20220701092326.1845210-1-james@openvpn.net> In-Reply-To: <20220701092326.1845210-1-james@openvpn.net> From: Amir Goldstein Date: Fri, 1 Jul 2022 13:34:46 +0300 Message-ID: Subject: Re: [PATCH v2] namei: implemented RENAME_NEWER_MTIME flag for renameat2() conditional replace To: James Yonan Cc: linux-fsdevel , Dave Chinner , Neil Brown , Al Viro Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Fri, Jul 1, 2022 at 12:24 PM James Yonan wrote: > > RENAME_NEWER_MTIME is a new userspace-visible flag for renameat2(), and > stands alongside existing flags such as RENAME_NOREPLACE, RENAME_EXCHANGE, > and RENAME_WHITEOUT. > > RENAME_NEWER_MTIME is a conditional variation on RENAME_NOREPLACE, and > indicates that if the target of the rename exists, the rename or exchange > will only succeed if the source file is newer than the target (i.e. source > mtime > target mtime). Otherwise, the rename will fail with -EEXIST > instead of replacing the target. When the target doesn't exist, > RENAME_NEWER_MTIME does a plain rename like RENAME_NOREPLACE. > > RENAME_NEWER_MTIME can also be combined with RENAME_EXCHANGE for > conditional exchange, where the exchange only occurs if source mtime > > target mtime. Otherwise, the operation will fail with -EEXIST. > > RENAME_NEWER_MTIME is very useful in distributed systems that mirror a > directory structure, or use a directory as a key/value store, and need to > guarantee that files will only be overwritten by newer files, and that all > updates are atomic. > > RENAME_NEWER_MTIME is implemented in vfs_rename(), and we lock and deny > write access to both source and target inodes before comparing their > mtimes, to stabilize the comparison. > > So one question to ask is could this functionality be implemented in > userspace without adding a new renameat2() flag? I think you could > attempt it with iterative RENAME_EXCHANGE, but it's hackish, inefficient, > and not atomic, because races could cause temporary mtime backtracks. > How about using file locking? Probably not, because the problem we want > to solve is maintaining file/directory atomicity for readers by creating > files out-of-directory, setting their mtime, and atomically moving them > into place. The strategy to lock such an operation really requires more > complex locking methods than are generally exposed to userspace. And if > you are using inotify on the directory to notify readers of changes, it > certainly makes sense to reduce unnecessary churn by preventing a move > operation based on the mtime check. > > While some people might question the utility of adding features to > filesystems to make them more like databases, there is real value in the > performance, atomicity, consistent VFS interface, multi-thread safety, and > async-notify capabilities of modern filesystems that starts to blur the > line, and actually make filesystem-based key-value stores a win for many > applications. > > Like RENAME_NOREPLACE, the RENAME_NEWER_MTIME implementation lives in > the VFS, however the individual fs implementations do strict flags > checking and will return -EINVAL for any flag they don't recognize. > At this time, I have enabled and tested RENAME_NEWER_MTIME on ext2, ext3, > ext4, xfs, btrfs, and tmpfs. > > I did not notice a general self-test for renameat2() at the VFS > layer (outside of fs-specific tests), so I created one, though > at the moment it only exercises RENAME_NEWER_MTIME and RENAME_EXCHANGE. > The self-test is written to be portable to the Linux Test Project, > and the advantage of running it there is that it automatically runs > tests on multiple filesystems. See comments at the beginning of > renameat2_tests.c for more info. > > Build and run the self-test with: > > make -C tools/testing/selftests TARGETS=renameat2 run_tests > > Questions: > > Q: Why use mtime and not ctime for timestamp comparison? > > A: I think the "use a directory as a key/value store" use case > cares about the modification time of the file content rather > than metadata. Also, the rename operation itself modifies > ctime, making it less useful as a reference timestamp. > In any event, this patch creates the infrastructure for > conditional rename/exchange based on inode timestamp, so a > subsequent patch to add RENAME_NEWER_CTIME would be a mostly > trivial exercise. > > Q: Why compare mtimes when some systems have poor system clock > accuracy and resolution? > > A: So in the "use a directory as a key/value store" use case > in distributed systems, the file mtime is actually determined > remotely by the file content creator and is set locally > via futimens() rather than the local system clock. So this gives > you nanosecond-scale time resolution if the content creator > supports it, even if the local system clock has less resolution. > > Patch version history: This usually comes after --- line The criteria is to judge whether this information is going to be useful for the person reading the git log after merge. > > v2: Changed flag name from RENAME_NEWER to RENAME_NEWER_MTIME so > as to disambiguate and make it clear that we are comparing > mtime values. > > RENAME_NEWER_MTIME can now be combined with RENAME_EXCHANGE > for conditional exchange, where exchange only occurs if > source mtime > target mtime. > > Moved the mtime comparison logic into vfs_rename() to take > advantage of existing {lock,unlock}_two_nondirectories critical > section, and then further nest another critical section > {deny,allow}_write_access (adapted to inodes) to stabilize the > mtime, since our use case doesn't require renaming files that > are open for write (we will return -ETXTBSY in this case). > > Did some refactoring of inline functions in linux/fs.h that > manage inode->i_writecount, and added inode_deny_write_access2() > and inode_allow_write_access2() functions. > > Extended the self-test (renameat2_tests.c): > > 1. Verify that RENAME_NEWER_MTIME fails with ETXTBSY when > one of the files is open for write. > > 2. Test conditional exchange use case with combined flags > RENAME_EXCHANGE|RENAME_NEWER_MTIME. > > 3. The test .c file is now drop-in portable to the Linux Test > Project where you can take advantage of the .all_filesystems = 1 > flag to automatically run tests on multiple filesystems. > > Signed-off-by: James Yonan > --- > Documentation/filesystems/vfs.rst | 9 + > fs/btrfs/inode.c | 2 +- > fs/ext2/namei.c | 2 +- > fs/ext4/namei.c | 2 +- > fs/libfs.c | 2 +- > fs/namei.c | 21 +- > fs/xfs/xfs_iops.c | 2 +- > include/linux/fs.h | 43 +- > include/uapi/linux/fs.h | 1 + > mm/shmem.c | 2 +- > tools/include/uapi/linux/fs.h | 1 + > tools/testing/selftests/Makefile | 1 + > tools/testing/selftests/renameat2/.gitignore | 1 + > tools/testing/selftests/renameat2/Makefile | 10 + > .../selftests/renameat2/renameat2_tests.c | 447 ++++++++++++++++++ > 15 files changed, 534 insertions(+), 12 deletions(-) > create mode 100644 tools/testing/selftests/renameat2/.gitignore > create mode 100644 tools/testing/selftests/renameat2/Makefile > create mode 100644 tools/testing/selftests/renameat2/renameat2_tests.c > > diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst > index 08069ecd49a6..9bee575c1f27 100644 > --- a/Documentation/filesystems/vfs.rst > +++ b/Documentation/filesystems/vfs.rst > @@ -515,6 +515,15 @@ otherwise noted. > (2) RENAME_EXCHANGE: exchange source and target. Both must > exist; this is checked by the VFS. Unlike plain rename, source > and target may be of different type. > + (3) RENAME_NEWER_MTIME: this flag is similar to RENAME_NOREPLACE, > + and indicates a conditional rename: if the target of the rename > + exists, the rename should only succeed if the source file is newer > + than the target (i.e. source mtime > target mtime). Otherwise, the > + rename should fail with -EEXIST instead of replacing the target. > + To exchange source and target conditional on source being newer > + than target, pass flags as RENAME_EXCHANGE|RENAME_NEWER_MTIME. > + RENAME_NEWER_MTIME will fail with -ETXTBSY if either source or > + target is open for write. > > ``get_link`` > called by the VFS to follow a symbolic link to the inode it > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 05e0c4a5affd..c6232ac0f0eb 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -9549,7 +9549,7 @@ static int btrfs_rename2(struct user_namespace *mnt_userns, struct inode *old_di > struct dentry *old_dentry, struct inode *new_dir, > struct dentry *new_dentry, unsigned int flags) > { > - if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT)) > + if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT | RENAME_NEWER_MTIME)) This line may be "legally long" according to checkpatch but for no good reason IMO. Please break it and other similar long lines. > return -EINVAL; > > if (flags & RENAME_EXCHANGE) > diff --git a/fs/ext2/namei.c b/fs/ext2/namei.c > index 5f6b7560eb3f..35dc17f80528 100644 > --- a/fs/ext2/namei.c > +++ b/fs/ext2/namei.c > @@ -336,7 +336,7 @@ static int ext2_rename (struct user_namespace * mnt_userns, > struct ext2_dir_entry_2 * old_de; > int err; > > - if (flags & ~RENAME_NOREPLACE) > + if (flags & ~(RENAME_NOREPLACE | RENAME_NEWER_MTIME)) > return -EINVAL; > > err = dquot_initialize(old_dir); > diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c > index db4ba99d1ceb..210537bb144b 100644 > --- a/fs/ext4/namei.c > +++ b/fs/ext4/namei.c > @@ -4128,7 +4128,7 @@ static int ext4_rename2(struct user_namespace *mnt_userns, > if (unlikely(ext4_forced_shutdown(EXT4_SB(old_dir->i_sb)))) > return -EIO; > > - if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT)) > + if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT | RENAME_NEWER_MTIME)) > return -EINVAL; > > err = fscrypt_prepare_rename(old_dir, old_dentry, new_dir, new_dentry, > diff --git a/fs/libfs.c b/fs/libfs.c > index 31b0ddf01c31..13e206f2cc58 100644 > --- a/fs/libfs.c > +++ b/fs/libfs.c > @@ -479,7 +479,7 @@ int simple_rename(struct user_namespace *mnt_userns, struct inode *old_dir, > struct inode *inode = d_inode(old_dentry); > int they_are_dirs = d_is_dir(old_dentry); > > - if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE)) > + if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_NEWER_MTIME)) > return -EINVAL; Leftover? I don't think that simple fs have meaningful mtime and they are not tested. > > if (flags & RENAME_EXCHANGE) > diff --git a/fs/namei.c b/fs/namei.c > index 1f28d3f463c3..5a8940058c43 100644 > --- a/fs/namei.c > +++ b/fs/namei.c > @@ -40,6 +40,7 @@ > #include > #include > #include > +#include > > #include "internal.h" > #include "mount.h" > @@ -4685,11 +4686,22 @@ int vfs_rename(struct renamedata *rd) > > take_dentry_name_snapshot(&old_name, old_dentry); > dget(new_dentry); > - if (!is_dir || (flags & RENAME_EXCHANGE)) > + if (!is_dir || (flags & (RENAME_EXCHANGE|RENAME_NEWER_MTIME))) > lock_two_nondirectories(source, target); > else if (target) > inode_lock(target); > > + if ((flags & RENAME_NEWER_MTIME) && target) { > + /* deny write access to stabilize mtime comparison below */ > + error = inode_deny_write_access2(source, target); This is not needed for non regular file, but I guess it doesn't hurt... You could do a helper lock_two_inodes_deny_write() that takes care of both inode_lock() and inode_deny_write_access() and call it instead of lock_two_nondirectories() above. Then the lock and unlock routines would be more straightforward and less error prone, e.g.: - if (!is_dir || (flags & RENAME_EXCHANGE)) + if (flags & RENAME_NEWER_MTIME) + lock_two_inodes_deny_write(source, target); + else if (!is_dir || (flags & (RENAME_EXCHANGE))) lock_two_nondirectories(source, target); ... out: - if (!is_dir || (flags & RENAME_EXCHANGE)) + if (flags & RENAME_NEWER_MTIME) + unlock_two_inodes_allow_write(source, target); + else if (!is_dir || (flags & (RENAME_EXCHANGE))) unlock_two_nondirectories(source, target); OTOH, for directory, inode_lock is needed to stabilize mtime and lock_two_nondirectories() doesn't do that for you and it is also non trivial to get the locking order and lockdep annotations correct. Since you don't have a use case for RENAME_NEWER_MTIME and directories (?), maybe the easier way around this would be to deny that earlier in do_renameat2() with -EISDIR. > + if (error) /* -ETXTBSY */ > + goto out1; > + if (timespec64_compare(&source->i_mtime, &target->i_mtime) <= 0) { > + error = -EEXIST; > + goto out; > + } > + } > + > error = -EPERM; > if (IS_SWAPFILE(source) || (target && IS_SWAPFILE(target))) > goto out; > @@ -4736,7 +4748,10 @@ int vfs_rename(struct renamedata *rd) > d_exchange(old_dentry, new_dentry); > } > out: > - if (!is_dir || (flags & RENAME_EXCHANGE)) > + if ((flags & RENAME_NEWER_MTIME) && target) > + inode_allow_write_access2(source, target); > +out1: > + if (!is_dir || (flags & (RENAME_EXCHANGE|RENAME_NEWER_MTIME))) > unlock_two_nondirectories(source, target); > else if (target) > inode_unlock(target); > @@ -4769,7 +4784,7 @@ int do_renameat2(int olddfd, struct filename *from, int newdfd, > bool should_retry = false; > int error = -EINVAL; > > - if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT)) > + if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT | RENAME_NEWER_MTIME)) > goto put_names; > > if ((flags & (RENAME_NOREPLACE | RENAME_WHITEOUT)) && > diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c > index 29f5b8b8aca6..84efc0b02a3c 100644 > --- a/fs/xfs/xfs_iops.c > +++ b/fs/xfs/xfs_iops.c > @@ -457,7 +457,7 @@ xfs_vn_rename( > struct xfs_name oname; > struct xfs_name nname; > > - if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT)) > + if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT | RENAME_NEWER_MTIME)) > return -EINVAL; > > /* if we are exchanging files, we need to set i_mode of both files */ > diff --git a/include/linux/fs.h b/include/linux/fs.h > index 9ad5e3520fae..0c79f12ec51f 100644 > --- a/include/linux/fs.h > +++ b/include/linux/fs.h > @@ -2819,14 +2819,21 @@ static inline void file_end_write(struct file *file) > * use {get,deny}_write_access() - these functions check the sign and refuse > * to do the change if sign is wrong. > */ > +static inline int inode_deny_write_access(struct inode *inode) > +{ > + return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY; > +} > +static inline void inode_allow_write_access(struct inode *inode) > +{ > + atomic_inc(&inode->i_writecount); > +} > static inline int get_write_access(struct inode *inode) > { > return atomic_inc_unless_negative(&inode->i_writecount) ? 0 : -ETXTBSY; > } > static inline int deny_write_access(struct file *file) > { > - struct inode *inode = file_inode(file); > - return atomic_dec_unless_positive(&inode->i_writecount) ? 0 : -ETXTBSY; > + return inode_deny_write_access(file_inode(file)); > } > static inline void put_write_access(struct inode * inode) > { > @@ -2835,13 +2842,43 @@ static inline void put_write_access(struct inode * inode) > static inline void allow_write_access(struct file *file) > { > if (file) > - atomic_inc(&file_inode(file)->i_writecount); > + inode_allow_write_access(file_inode(file)); > } > static inline bool inode_is_open_for_write(const struct inode *inode) > { > return atomic_read(&inode->i_writecount) > 0; > } > > +/** > + * inode_deny_write_access2 - deny write access on two inodes. > + * Returns -ETXTBSY if write access cannot be denied on either inode. > + * @inode1: first inode > + * @inode2: second inode > + */ > +static inline int inode_deny_write_access2(struct inode *inode1, struct inode *inode2) > +{ > + int error = inode_deny_write_access(inode1); > + if (error) > + return error; > + error = inode_deny_write_access(inode2); > + if (error) > + inode_allow_write_access(inode1); > + return error; > +} > + > +/** > + * inode_allow_write_access2 - allow write access on two inodes. > + * This method is intended to be called after a successful call > + * to inode_deny_write_access2(). > + * @inode1: first inode > + * @inode2: second inode > + */ > +static inline void inode_allow_write_access2(struct inode *inode1, struct inode *inode2) > +{ > + inode_allow_write_access(inode1); > + inode_allow_write_access(inode2); > +} > + > #if defined(CONFIG_IMA) || defined(CONFIG_FILE_LOCKING) > static inline void i_readcount_dec(struct inode *inode) > { > diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h > index bdf7b404b3e7..7e9c32dce3e4 100644 > --- a/include/uapi/linux/fs.h > +++ b/include/uapi/linux/fs.h > @@ -50,6 +50,7 @@ > #define RENAME_NOREPLACE (1 << 0) /* Don't overwrite target */ > #define RENAME_EXCHANGE (1 << 1) /* Exchange source and dest */ > #define RENAME_WHITEOUT (1 << 2) /* Whiteout source */ > +#define RENAME_NEWER_MTIME (1 << 3) /* Only newer file can overwrite target */ > > struct file_clone_range { > __s64 src_fd; > diff --git a/mm/shmem.c b/mm/shmem.c > index a6f565308133..592de9245c80 100644 > --- a/mm/shmem.c > +++ b/mm/shmem.c > @@ -3009,7 +3009,7 @@ static int shmem_rename2(struct user_namespace *mnt_userns, > struct inode *inode = d_inode(old_dentry); > int they_are_dirs = S_ISDIR(inode->i_mode); > > - if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT)) > + if (flags & ~(RENAME_NOREPLACE | RENAME_EXCHANGE | RENAME_WHITEOUT | RENAME_NEWER_MTIME)) > return -EINVAL; > > if (flags & RENAME_EXCHANGE) > diff --git a/tools/include/uapi/linux/fs.h b/tools/include/uapi/linux/fs.h > index bdf7b404b3e7..7e9c32dce3e4 100644 > --- a/tools/include/uapi/linux/fs.h > +++ b/tools/include/uapi/linux/fs.h > @@ -50,6 +50,7 @@ > #define RENAME_NOREPLACE (1 << 0) /* Don't overwrite target */ > #define RENAME_EXCHANGE (1 << 1) /* Exchange source and dest */ > #define RENAME_WHITEOUT (1 << 2) /* Whiteout source */ > +#define RENAME_NEWER_MTIME (1 << 3) /* Only newer file can overwrite target */ > > struct file_clone_range { > __s64 src_fd; > diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile > index de11992dc577..34226dfbca7a 100644 > --- a/tools/testing/selftests/Makefile > +++ b/tools/testing/selftests/Makefile > @@ -54,6 +54,7 @@ TARGETS += proc > TARGETS += pstore > TARGETS += ptrace > TARGETS += openat2 > +TARGETS += renameat2 > TARGETS += resctrl > TARGETS += rlimits > TARGETS += rseq > diff --git a/tools/testing/selftests/renameat2/.gitignore b/tools/testing/selftests/renameat2/.gitignore > new file mode 100644 > index 000000000000..79bbdf497559 > --- /dev/null > +++ b/tools/testing/selftests/renameat2/.gitignore > @@ -0,0 +1 @@ > +renameat2_tests > diff --git a/tools/testing/selftests/renameat2/Makefile b/tools/testing/selftests/renameat2/Makefile > new file mode 100644 > index 000000000000..c39f5e281a5d > --- /dev/null > +++ b/tools/testing/selftests/renameat2/Makefile > @@ -0,0 +1,10 @@ > +# SPDX-License-Identifier: GPL-2.0 > + > +CFLAGS = -g -Wall -O2 > +CFLAGS += $(KHDR_INCLUDES) > + > +TEST_GEN_PROGS := renameat2_tests > + > +include ../lib.mk > + > +$(OUTPUT)/renameat2_tests: renameat2_tests.c > diff --git a/tools/testing/selftests/renameat2/renameat2_tests.c b/tools/testing/selftests/renameat2/renameat2_tests.c > new file mode 100644 > index 000000000000..1fdb908cf428 > --- /dev/null > +++ b/tools/testing/selftests/renameat2/renameat2_tests.c > @@ -0,0 +1,447 @@ > +// SPDX-License-Identifier: GPL-2.0 > + > +/* > + * Written by James Yonan > + * Copyright (c) 2022 OpenVPN, Inc. > + */ > + > +/* > + * Test renameat2() with RENAME_NOREPLACE, RENAME_EXCHANGE, > + * and RENAME_NEWER_MTIME. > + * > + * This test is designed to be portable between > + * the Linux kernel self-tests and the Linux Test Project. > + * The cool thing about running the test in the Linux Test Project > + * is that it will automatically iterate the test over all the > + * filesystems available in your kernel. In a default kernel, > + * that includes ext2, ext3, ext4, xfs, btrfs, and tmpfs. > + * > + * By default we assume a Linux kernel self-test build, where > + * you can build and run with: > + * make -C tools/testing/selftests TARGETS=renameat2 run_tests > + * > + * For a Linux Test Project build, place this source file > + * under the ltp tree in: > + * testcases/kernel/syscalls/renameat2/renameat203.c > + * Then cd to testcases/kernel/syscalls/renameat2 and add: > + * CPPFLAGS += -DLINUX_TEST_PROJECT > + * to the end of the Makefile. Then run with: > + * make && ./rename_newer_mtime > + */ > + > +#ifndef _GNU_SOURCE > +#define _GNU_SOURCE > +#endif > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#ifdef LINUX_TEST_PROJECT > +#include "tst_test.h" > +#include "renameat2.h" > +#else > +#include "../kselftest_harness.h" > +#endif > + > +/* requires a kernel that implements renameat2() RENAME_NEWER_MTIME flag */ > +#ifndef RENAME_NEWER_MTIME > +#define RENAME_NEWER_MTIME (1 << 3) > +#endif > + > +/* convert milliseconds to nanoseconds */ > +#define MS_TO_NANO(x) ((x) * 1000000) > + > +#ifdef LINUX_TEST_PROJECT > + > +#define MNTPOINT "mntpoint" > +#define WORKDIR MNTPOINT "/testdir.XXXXXX" > + > +#define MY_ERROR(...) tst_brk(TFAIL, __VA_ARGS__) > +#define MY_PASS(...) tst_res(TPASS, __VA_ARGS__) > + > +#else /* Linux kernel self-test */ > + > +#define WORKDIR "/tmp/ksft-renameat2-rename-newer-mtime.XXXXXX" > + > +#define MY_ERROR(fmt, ...) ksft_exit_fail_msg("%s/%d: " fmt "\n", __FILE__, __LINE__, __VA_ARGS__) > +#define MY_PASS(...) > + > +#endif > + > +static int create_file_with_timestamp(const char *filename, > + const time_t tv_sec, > + const long tv_nsec, > + struct stat *s, > + int *retain_fd) > +{ > + int fd; > + struct timespec times[2]; > + > + fd = open(filename, O_CREAT|O_TRUNC|O_WRONLY, 0777); > + if (fd < 0) > + return errno; > + times[0].tv_sec = tv_sec; > + times[0].tv_nsec = tv_nsec; > + times[1] = times[0]; > + if (futimens(fd, times)) { > + close(fd); > + return errno; > + } > + if (fstat(fd, s)) { > + close(fd); > + return errno; > + } > + if (retain_fd) > + *retain_fd = fd; > + else if (close(fd)) > + return errno; > + return 0; > +} > + > +static int create_directory_with_timestamp(const char *dirname, > + const time_t tv_sec, > + const long tv_nsec, > + struct stat *s) > +{ > + struct timespec times[2]; > + > + if (mkdir(dirname, 0777)) > + return errno; > + times[0].tv_sec = tv_sec; > + times[0].tv_nsec = tv_nsec; > + times[1] = times[0]; > + if (utimensat(AT_FDCWD, dirname, times, 0) != 0) > + return errno; > + if (lstat(dirname, s)) > + return errno; > + return 0; > +} > + > +static int do_rename(const char *source_path, const char *target_path, > + const unsigned int flags) > +{ > + if (renameat2(AT_FDCWD, source_path, AT_FDCWD, target_path, flags)) > + return errno; > + return 0; > +} > + > +static int verify_inode(const char *path, const struct stat *orig_stat) > +{ > + struct stat s; > + > + if (stat(path, &s)) > + return errno; > + if (orig_stat->st_ino != s.st_ino) > + return ENOENT; > + return 0; > +} > + > +static int verify_exist(const char *path) > +{ > + int fd; > + > + fd = open(path, O_RDONLY); > + if (fd < 0) > + return errno; > + if (close(fd) != 0) > + return errno; > + return 0; > +} > + > +static int fd_d = -1; /* retained fd from file "d" */ > + > +/* > + * Test renameat2() with RENAME_NEWER_MTIME, RENAME_NOREPLACE, and RENAME_EXCHANGE. > + */ > +static void do_rename_newer_mtime(void) > +{ > + char dirname[] = WORKDIR; > + const time_t now = time(NULL); > + struct stat stat_a, stat_b, stat_c, stat_d, stat_f; /* files */ > + struct stat stat_x, stat_y; /* directories */ > + int eno; /* copied errno */ > + > + /* fd_d initial state */ > + fd_d = -1; > + > + /* make the top-level directory */ > + if (!mkdtemp(dirname)) { > + eno = errno; > + MY_ERROR("failed to create tmpdir, errno=%d", eno); > + } > + > + /* cd to top-level directory */ > + if (chdir(dirname)) { > + eno = errno; > + MY_ERROR("failed to cd to tmpdir, errno=%d", eno); > + } > + > + /* create files with different mtimes */ > + eno = create_file_with_timestamp("a", now, MS_TO_NANO(700), &stat_a, NULL); > + if (eno) > + MY_ERROR("failed to create file 'a', errno=%d", eno); > + eno = create_file_with_timestamp("b", now+1, MS_TO_NANO(500), &stat_b, NULL); > + if (eno) > + MY_ERROR("failed to create file 'b', errno=%d", eno); > + eno = create_file_with_timestamp("c", now+1, MS_TO_NANO(500), &stat_c, NULL); > + if (eno) > + MY_ERROR("failed to create file 'c', errno=%d", eno); > + eno = create_file_with_timestamp("d", now+1, MS_TO_NANO(300), &stat_d, &fd_d); /* leave open */ > + if (eno) > + MY_ERROR("failed to create file 'd', errno=%d", eno); > + eno = create_file_with_timestamp("f", now, MS_TO_NANO(0), &stat_f, NULL); > + if (eno) > + MY_ERROR("failed to create file 'f', errno=%d", eno); > + > + /* create directories with different mtimes */ > + eno = create_directory_with_timestamp("x", now+2, MS_TO_NANO(0), &stat_x); > + if (eno) > + MY_ERROR("failed to create directory 'x', errno=%d", eno); > + eno = create_directory_with_timestamp("y", now+3, MS_TO_NANO(0), &stat_y); > + if (eno) > + MY_ERROR("failed to create directory 'y', errno=%d", eno); > + > + /* rename b -> e with RENAME_NEWER_MTIME -- should succeed because e doesn't exist */ > + eno = do_rename("b", "e", RENAME_NEWER_MTIME); > + if (eno) > + MY_ERROR("failed to rename 'b' -> 'e', errno=%d (kernel may be missing RENAME_NEWER_MTIME feature)", eno); > + eno = verify_inode("e", &stat_b); > + if (eno) > + MY_ERROR("could not verify inode of 'e' after rename 'b' -> 'e', errno=%d", eno); > + eno = verify_exist("b"); > + if (eno != ENOENT) > + MY_ERROR("strangely, 'b' still exists after rename 'b' -> 'e', errno=%d", eno); > + > + /* rename c -> e with RENAME_NEWER_MTIME -- should fail because c and e have > + * the same timestamp > + */ > + eno = do_rename("c", "e", RENAME_NEWER_MTIME); > + if (eno != EEXIST) > + MY_ERROR("rename 'c' -> 'e' should have failed with EEXIST because 'c' and 'e' have the same timestamp, errno=%d", eno); > + eno = verify_inode("c", &stat_c); > + if (eno) > + MY_ERROR("could not verify inode of 'c' after attempted rename 'c' -> 'e', errno=%d", eno); > + eno = verify_inode("e", &stat_b); > + if (eno) > + MY_ERROR("could not verify inode of 'e' after attempted rename 'c' -> 'e', errno=%d", eno); > + > + /* rename a -> c with RENAME_NOREPLACE -- should fail because c exists */ > + eno = do_rename("a", "c", RENAME_NOREPLACE); > + if (eno != EEXIST) > + MY_ERROR("rename 'a' -> 'c' should have failed because 'c' exists, errno=%d", eno); > + eno = verify_inode("a", &stat_a); > + if (eno) > + MY_ERROR("could not verify inode of 'a' after attempted rename 'a' -> 'c', errno=%d", eno); > + eno = verify_inode("c", &stat_c); > + if (eno) > + MY_ERROR("could not verify inode of 'c' after attempted rename 'a' -> 'c', errno=%d", eno); > + > + /* rename a -> c with RENAME_NEWER_MTIME -- should fail because c is newer than a */ > + eno = do_rename("a", "c", RENAME_NEWER_MTIME); > + if (eno != EEXIST) > + MY_ERROR("rename 'a' -> 'c' should have failed with EEXIST because 'c' is newer, errno=%d", eno); > + eno = verify_inode("a", &stat_a); > + if (eno) > + MY_ERROR("could not verify inode of 'a' after attempted rename 'a' -> 'c', errno=%d", eno); > + eno = verify_inode("c", &stat_c); > + if (eno) > + MY_ERROR("could not verify inode of 'c' after attempted rename 'a' -> 'c', errno=%d", eno); > + > + /* rename c -> a with RENAME_NEWER_MTIME -- should succeed because c is newer than a */ > + eno = do_rename("c", "a", RENAME_NEWER_MTIME); > + if (eno) > + MY_ERROR("rename 'c' -> 'a' should have succeeded because 'c' is newer than 'a', errno=%d", eno); > + eno = verify_inode("a", &stat_c); > + if (eno) > + MY_ERROR("could not verify inode of 'a' after rename 'c' -> 'a', errno=%d", eno); > + eno = verify_exist("c"); > + if (eno != ENOENT) > + MY_ERROR("strangely, 'c' still exists after rename 'c' -> 'a', errno=%d", eno); > + > + /* exchange f <-> nonexistent with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should fail because > + * only f exists > + */ > + eno = do_rename("f", "nonexistent", RENAME_EXCHANGE|RENAME_NEWER_MTIME); > + if (eno != ENOENT) > + MY_ERROR("exchange 'f' <-> 'nonexistent' should have failed with ENOENT, errno=%d", eno); > + eno = verify_inode("f", &stat_f); > + if (eno) > + MY_ERROR("could not verify inode of 'f' after attempted exchange 'f' <-> 'nonexistent', errno=%d", eno); > + > + /* exchange d <-> f with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should fail because > + * d is open for write > + */ > + eno = do_rename("d", "f", RENAME_EXCHANGE|RENAME_NEWER_MTIME); > + if (eno != ETXTBSY) > + MY_ERROR("exchange 'd' <-> 'f' should have failed with ETXTBSY because d is open for write, errno=%d", eno); > + eno = verify_inode("d", &stat_d); > + if (eno) > + MY_ERROR("could not verify inode of 'd' after attempted exchange 'd' <-> 'f', errno=%d", eno); > + eno = verify_inode("f", &stat_f); > + if (eno) > + MY_ERROR("could not verify inode of 'f' after attempted exchange 'd' <-> 'f', errno=%d", eno); > + > + /* exchange e <-> d with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should fail because > + * d is open for write > + */ > + eno = do_rename("e", "d", RENAME_EXCHANGE|RENAME_NEWER_MTIME); > + if (eno != ETXTBSY) > + MY_ERROR("exchange 'e' <-> 'd' should have failed with ETXTBSY because d is open for write, errno=%d", eno); > + eno = verify_inode("e", &stat_b); > + if (eno) > + MY_ERROR("could not verify inode of 'e' after attempted exchange 'e' <-> 'd', errno=%d", eno); > + eno = verify_inode("d", &stat_d); > + if (eno) > + MY_ERROR("could not verify inode of 'd' after attempted exchange 'e' <-> 'd', errno=%d", eno); > + > + /* exchange f <-> d with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should fail because > + * d is open for write but also because f is older than d > + */ > + eno = do_rename("f", "d", RENAME_EXCHANGE|RENAME_NEWER_MTIME); > + if (eno != ETXTBSY) /* note in this case we get ETXTBSY first (EEXIST would have > + * been returned if d wasn't open for write) > + */ > + MY_ERROR("exchange 'f' <-> 'd' should have failed with ETXTBSY because d is open for write, errno=%d", eno); > + eno = verify_inode("f", &stat_f); > + if (eno) > + MY_ERROR("could not verify inode of 'f' after attempted exchange 'f' <-> 'd', errno=%d", eno); > + eno = verify_inode("d", &stat_d); > + if (eno) > + MY_ERROR("could not verify inode of 'd' after attempted exchange 'f' <-> 'd', errno=%d", eno); > + > + /* close fd_d */ > + if (close(fd_d) != 0) { > + eno = errno; > + MY_ERROR("error closing fd_d (write), errno=%d", eno); > + } > + > + /* reopen "d" for read access, which should not prevent RENAME_NEWER_MTIME */ > + fd_d = open("d", O_RDONLY); > + if (fd_d < 0) > + MY_ERROR("error reopening 'd' for read, errno=%d", eno); > + > + /* exchange f <-> d with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should fail > + * because f is older than d > + */ > + eno = do_rename("f", "d", RENAME_EXCHANGE|RENAME_NEWER_MTIME); > + if (eno != EEXIST) > + MY_ERROR("exchange 'f' <-> 'd' should have failed with EEXIST because f is older than d, errno=%d", eno); > + eno = verify_inode("f", &stat_f); > + if (eno) > + MY_ERROR("could not verify inode of 'f' after attempted exchange 'f' <-> 'd', errno=%d", eno); > + eno = verify_inode("d", &stat_d); > + if (eno) > + MY_ERROR("could not verify inode of 'd' after attempted exchange 'f' <-> 'd', errno=%d", eno); > + > + /* double exchange f <-> d with RENAME_EXCHANGE -- should succeed */ > + eno = do_rename("f", "d", RENAME_EXCHANGE); > + if (eno) > + MY_ERROR("exchange 'f' <-> 'd' should have succeeded, errno=%d", eno); > + eno = verify_inode("d", &stat_f); > + if (eno) > + MY_ERROR("could not verify inode of 'd' after exchange 'd' <-> 'f', errno=%d", eno); > + eno = verify_inode("f", &stat_d); > + if (eno) > + MY_ERROR("could not verify inode of 'f' after exchange 'd' <-> 'f', errno=%d", eno); > + eno = do_rename("f", "d", RENAME_EXCHANGE); > + if (eno) > + MY_ERROR("exchange 'f' <-> 'd' should have succeeded, errno=%d", eno); > + eno = verify_inode("d", &stat_d); > + if (eno) > + MY_ERROR("could not verify inode of 'd' after exchange 'd' <-> 'f', errno=%d", eno); > + eno = verify_inode("f", &stat_f); > + if (eno) > + MY_ERROR("could not verify inode of 'f' after exchange 'd' <-> 'f', errno=%d", eno); > + > + /* exchange d <-> f with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should succeed > + * because d is newer than f and fd_d is now read-only > + */ > + eno = do_rename("d", "f", RENAME_EXCHANGE|RENAME_NEWER_MTIME); > + if (eno) > + MY_ERROR("exchange 'd' <-> 'f' failed, errno=%d", eno); > + eno = verify_inode("d", &stat_f); > + if (eno) > + MY_ERROR("could not verify inode of 'd' after exchange 'd' <-> 'f', errno=%d", eno); > + eno = verify_inode("f", &stat_d); > + if (eno) > + MY_ERROR("could not verify inode of 'f' after exchange 'd' <-> 'f', errno=%d", eno); > + > + /* exchange directories x <-> y with RENAME_EXCHANGE|RENAME_NEWER_MTIME > + * -- should fail because x is older than y > + */ > + eno = do_rename("x", "y", RENAME_EXCHANGE|RENAME_NEWER_MTIME); > + if (eno != EEXIST) > + MY_ERROR("exchange 'x' <-> 'y' should have failed with EEXIST because x is older than y, errno=%d", eno); > + eno = verify_inode("x", &stat_x); > + if (eno) > + MY_ERROR("could not verify inode of 'x' after attempted exchange 'x' <-> 'y', errno=%d", eno); > + eno = verify_inode("y", &stat_y); > + if (eno) > + MY_ERROR("could not verify inode of 'y' after attempted exchange 'x' <-> 'y', errno=%d", eno); > + > + /* exchange directories y <-> x with RENAME_EXCHANGE|RENAME_NEWER_MTIME -- should succeed */ > + eno = do_rename("y", "x", RENAME_EXCHANGE|RENAME_NEWER_MTIME); > + if (eno) > + MY_ERROR("exchange 'y' <-> 'x' failed, errno=%d", eno); > + eno = verify_inode("x", &stat_y); > + if (eno) > + MY_ERROR("could not verify inode of 'x' after exchange 'y' <-> 'x', errno=%d", eno); > + eno = verify_inode("y", &stat_x); > + if (eno) > + MY_ERROR("could not verify inode of 'y' after exchange 'y' <-> 'x', errno=%d", eno); > + > + /* close fd_d */ > + if (close(fd_d) != 0) { > + eno = errno; > + MY_ERROR("error closing fd_d (read), errno=%d", eno); > + } > + fd_d = -1; > + > + MY_PASS("rename_newer_mtime test passed, workdir=%s", dirname); > +} > + > +#ifdef LINUX_TEST_PROJECT > + > +static void setup(void) > +{ > +} > + > +static void cleanup(void) > +{ > + /* close fd_d */ > + if (fd_d >= 0) > + close(fd_d); > +} > + > +static struct tst_test test = { > + .test_all = do_rename_newer_mtime, > + .setup = setup, > + .cleanup = cleanup, > + .needs_root = 1, > + .all_filesystems = 1, > + .mount_device = 1, > + .mntpoint = MNTPOINT, > + .skip_filesystems = (const char*[]) { > + "exfat", > + "ntfs", > + "vfat", > + NULL > + }, > + .needs_cmds = NULL, > +}; > + > +#else /* Linux kernel self-test */ > + > +TEST(rename_newer_mtime) > +{ > + do_rename_newer_mtime(); > +} > + > +TEST_HARNESS_MAIN > + > +#endif Nice trick :) Thanks, Amir.