All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: James Yonan <james@openvpn.net>
Cc: linux-fsdevel@vger.kernel.org, neilb@suse.de, amir73il@gmail.com,
	viro@zeniv.linux.org.uk
Subject: Re: [PATCH v2] namei: implemented RENAME_NEWER_MTIME flag for renameat2() conditional replace
Date: Sat, 2 Jul 2022 18:07:10 +1000	[thread overview]
Message-ID: <20220702080710.GB3108597@dread.disaster.area> (raw)
In-Reply-To: <20220701092326.1845210-1-james@openvpn.net>

On Fri, Jul 01, 2022 at 03:23:26AM -0600, James Yonan wrote:
> RENAME_NEWER_MTIME is a new userspace-visible flag for renameat2(), and
> stands alongside existing flags such as RENAME_NOREPLACE, RENAME_EXCHANGE,
> and RENAME_WHITEOUT.
> 
> RENAME_NEWER_MTIME is a conditional variation on RENAME_NOREPLACE, and
> indicates that if the target of the rename exists, the rename or exchange
> will only succeed if the source file is newer than the target (i.e. source
> mtime > target mtime).  Otherwise, the rename will fail with -EEXIST
> instead of replacing the target.  When the target doesn't exist,
> RENAME_NEWER_MTIME does a plain rename like RENAME_NOREPLACE.
> 
> RENAME_NEWER_MTIME can also be combined with RENAME_EXCHANGE for
> conditional exchange, where the exchange only occurs if source mtime >
> target mtime.  Otherwise, the operation will fail with -EEXIST.
> 
> RENAME_NEWER_MTIME is very useful in distributed systems that mirror a
> directory structure, or use a directory as a key/value store, and need to
> guarantee that files will only be overwritten by newer files, and that all
> updates are atomic.

You need to cc linux-api and write a renameat2() man page update
that documents how this option behaves for application developers.
The bits where it will appear to randomly fail are especially
important to document properly...

> RENAME_NEWER_MTIME is implemented in vfs_rename(), and we lock and deny
> write access to both source and target inodes before comparing their
> mtimes, to stabilize the comparison.
> 
> So one question to ask is could this functionality be implemented in
> userspace without adding a new renameat2() flag?  I think you could
> attempt it with iterative RENAME_EXCHANGE, but it's hackish, inefficient,
> and not atomic, because races could cause temporary mtime backtracks.
> How about using file locking?  Probably not, because the problem we want
> to solve is maintaining file/directory atomicity for readers by creating
> files out-of-directory, setting their mtime, and atomically moving them
> into place.  The strategy to lock such an operation really requires more
> complex locking methods than are generally exposed to userspace.  And if
> you are using inotify on the directory to notify readers of changes, it
> certainly makes sense to reduce unnecessary churn by preventing a move
> operation based on the mtime check.
> 
> While some people might question the utility of adding features to
> filesystems to make them more like databases, there is real value in the
> performance, atomicity, consistent VFS interface, multi-thread safety, and
> async-notify capabilities of modern filesystems that starts to blur the
> line, and actually make filesystem-based key-value stores a win for many
> applications.
> 
> Like RENAME_NOREPLACE, the RENAME_NEWER_MTIME implementation lives in
> the VFS, however the individual fs implementations do strict flags
> checking and will return -EINVAL for any flag they don't recognize.
> At this time, I have enabled and tested RENAME_NEWER_MTIME on ext2, ext3,
> ext4, xfs, btrfs, and tmpfs.
> 
> I did not notice a general self-test for renameat2() at the VFS
> layer (outside of fs-specific tests), so I created one, though
> at the moment it only exercises RENAME_NEWER_MTIME and RENAME_EXCHANGE.
> The self-test is written to be portable to the Linux Test Project,
> and the advantage of running it there is that it automatically runs
> tests on multiple filesystems.  See comments at the beginning of
> renameat2_tests.c for more info.
> 
> Build and run the self-test with:
> 
>   make -C tools/testing/selftests TARGETS=renameat2 run_tests
> 
> Questions:
> 
> Q: Why use mtime and not ctime for timestamp comparison?
> 
> A: I think the "use a directory as a key/value store" use case
>    cares about the modification time of the file content rather
>    than metadata.  Also, the rename operation itself modifies
>    ctime, making it less useful as a reference timestamp.
>    In any event, this patch creates the infrastructure for
>    conditional rename/exchange based on inode timestamp, so a
>    subsequent patch to add RENAME_NEWER_CTIME would be a mostly
>    trivial exercise.
> 
> Q: Why compare mtimes when some systems have poor system clock
>    accuracy and resolution?
> 
> A: So in the "use a directory as a key/value store" use case
>    in distributed systems, the file mtime is actually determined
>    remotely by the file content creator and is set locally
>    via futimens() rather than the local system clock.  So this gives
>    you nanosecond-scale time resolution if the content creator
>    supports it, even if the local system clock has less resolution.

That's not a useful answer to an application developer wondering if
he can use this feature in his application. This answer is the
justification of why *your application doesn't care* about poor
timestamp resolution, not an explanation of how some other
application can detect and/or address the problem when it arises...

.....

> diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
> index de11992dc577..34226dfbca7a 100644
> --- a/tools/testing/selftests/Makefile
> +++ b/tools/testing/selftests/Makefile
> @@ -54,6 +54,7 @@ TARGETS += proc
>  TARGETS += pstore
>  TARGETS += ptrace
>  TARGETS += openat2
> +TARGETS += renameat2
>  TARGETS += resctrl
>  TARGETS += rlimits
>  TARGETS += rseq
> diff --git a/tools/testing/selftests/renameat2/.gitignore b/tools/testing/selftests/renameat2/.gitignore
> new file mode 100644
> index 000000000000..79bbdf497559
> --- /dev/null
> +++ b/tools/testing/selftests/renameat2/.gitignore
> @@ -0,0 +1 @@
> +renameat2_tests
> diff --git a/tools/testing/selftests/renameat2/Makefile b/tools/testing/selftests/renameat2/Makefile
> new file mode 100644
> index 000000000000..c39f5e281a5d
> --- /dev/null
> +++ b/tools/testing/selftests/renameat2/Makefile
> @@ -0,0 +1,10 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +CFLAGS = -g -Wall -O2
> +CFLAGS += $(KHDR_INCLUDES)
> +
> +TEST_GEN_PROGS := renameat2_tests
> +
> +include ../lib.mk
> +
> +$(OUTPUT)/renameat2_tests: renameat2_tests.c
> diff --git a/tools/testing/selftests/renameat2/renameat2_tests.c b/tools/testing/selftests/renameat2/renameat2_tests.c
> new file mode 100644
> index 000000000000..1fdb908cf428
> --- /dev/null
> +++ b/tools/testing/selftests/renameat2/renameat2_tests.c
> @@ -0,0 +1,447 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +/*
> + * Written by James Yonan <james@openvpn.net>
> + * Copyright (c) 2022 OpenVPN, Inc.
> + */
.....

Please add the test in a separate patch. You probably also want to
write tests for fstests so that it gets exercised regularly by
filesystem developers who will notice when it breaks on their
filesystem......

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  parent reply	other threads:[~2022-07-02  8:07 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-27 22:11 [PATCH] namei: implemented RENAME_NEWER flag for renameat2() conditional replace James Yonan
2022-06-28  9:46 ` Amir Goldstein
2022-06-28 21:56   ` James Yonan
2022-06-29  5:15     ` Amir Goldstein
2022-06-30 16:18       ` James Yonan
2022-06-28 17:34 ` Al Viro
2022-06-28 18:34   ` Amir Goldstein
2022-06-28 23:19     ` James Yonan
2022-06-29  1:43       ` Dave Chinner
2022-06-29  2:07         ` NeilBrown
2022-06-29  2:35           ` Dave Chinner
2022-06-29  2:49             ` NeilBrown
2022-06-30 16:39             ` James Yonan
2022-07-01  9:23               ` [PATCH v2] namei: implemented RENAME_NEWER_MTIME " James Yonan
2022-07-01 10:34                 ` Amir Goldstein
2022-07-01 20:06                   ` James Yonan
2022-07-02  8:07                 ` Dave Chinner [this message]
2022-07-05 13:30                   ` [PATCH v3 1/2] RENAME_NEWER_MTIME is a new userspace-visible flag for renameat2(), and stands alongside existing flags including RENAME_NOREPLACE, RENAME_EXCHANGE, and RENAME_WHITEOUT James Yonan
2022-07-05 13:30                     ` [PATCH v3 2/2] selftests: added a new target renameat2 James Yonan
2022-07-05 13:30                     ` [PATCH man-pages] rename.2: document new renameat2() flag RENAME_NEWER_MTIME James Yonan
2022-07-05 14:03                   ` [RESEND PATCH v3 1/2] namei: implemented RENAME_NEWER_MTIME flag for renameat2() conditional replace James Yonan
2022-07-11 19:13                   ` [PATCH v4 " James Yonan
2022-07-11 19:13                     ` [PATCH v4 2/2] selftests: added a new target renameat2 James Yonan
2022-07-11 23:10                     ` [PATCH v4 1/2] namei: implemented RENAME_NEWER_MTIME flag for renameat2() conditional replace Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220702080710.GB3108597@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=amir73il@gmail.com \
    --cc=james@openvpn.net \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.