linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Theodore Ts'o" <tytso@mit.edu>
To: Amir Goldstein <amir73il@gmail.com>
Cc: Vijay Chidambaram <vijay@cs.utexas.edu>,
	lsf-pc@lists.linux-foundation.org,
	Dave Chinner <david@fromorbit.com>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Jan Kara <jack@suse.cz>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Jayashree Mohan <jaya@cs.utexas.edu>,
	Filipe Manana <fdmanana@suse.com>, Chris Mason <clm@fb.com>,
	lwn@lwn.net
Subject: Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
Date: Fri, 3 May 2019 05:58:46 -0400	[thread overview]
Message-ID: <20190503095846.GE23724@mit.edu> (raw)
In-Reply-To: <CAOQ4uxjM+ivnn-oU482GmRqOF6bYY5j89NdyHnfH++f49qB4yw@mail.gmail.com>

On Fri, May 03, 2019 at 12:16:32AM -0400, Amir Goldstein wrote:
> OK. we can leave that one for later.
> Although I am not sure what the concern is.
> If we are able to agree  and document a LINK_ATOMIC flag,
> what would be the down side of documenting a RENAME_ATOMIC
> flag with same semantics? After all, as I said, this is what many users
> already expect when renaming a temp file (as ext4 heuristics prove).

The problem is if the "temp file" has been hardlinked to 1000
different directories, does the rename() have to guarantee that we
have to make sure that the changes to all 1000 directories have been
persisted to disk?  And all of the parent directories of those 1000
directories have also *all* been persisted to disk, all the way up to
the root?

With the O_TMPFILE linkat case, we know that inode hasn't been
hard-linked to any other directory, and mercifully directories have
only one parent directory, so we only have to check one set of
directory inodes all the way up to the root having been persisted.

But.... I can already imagine someone complaining that if due to bind
mounts and 1000 mount namespaces, there is some *other* directory
pathname which could be used to reach said "tmpfile", we have to
guarantee that all parent directories which could be used to reach
said "tmpfile" even if they span a dozen different file systems,
*also* have to be persisted due to sloppy drafting of what the
atomicity rules might happen to be.

If we are only guaranteeing the persistence of the containing
directories of the source and destination files, that's pretty easy.
But then the consistency rules need to *explicitly* state this.  Some
of the handwaving definitions of what would be guaranteed.... scare
me.

						- Ted

P.S.  If we were going to do this, we'd probably want to simply define
a flag to be AT_FSYNC, using the strict POSIX definition of fsync,
which is to say, as a result of the linkat or renameat, the file in
question, and its associated metadata, are guaranteed to be persisted
to disk.  No other guarantees about any other inode's metadata
regardless of when they might be made, would be guaranteed.

If people really want "global barrier" semantics, then perhaps it
would be better to simply define a barrierfs(2) system call that works
like syncfs(2) --- it applies to the whole file system, and guarantees
that all changes made after barrierfs(2) will be visible if any
changes made *after* barrierfs(2) are visible.  Amir, you used "global
ordering" a few times; if you really need that, let's define a new
system call which guarantees that.  Maybe some of the research
proposals for exotic changes to SSD semantics, etc., would allow
barrierfs(2) semantics to be something that we could implement more
efficiently than syncfs(2).  But let's make this be explicit, as
opposed to some magic guarantee that falls out as a side effect of the
fsync(2) system call to a single inode.

  reply	other threads:[~2019-05-03  9:59 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-27 21:00 [TOPIC] Extending the filesystem crash recovery guaranties contract Amir Goldstein
2019-05-02 16:12 ` Amir Goldstein
2019-05-02 17:11   ` Vijay Chidambaram
2019-05-02 17:39     ` Amir Goldstein
2019-05-03  2:30       ` Theodore Ts'o
2019-05-03  3:15         ` Vijay Chidambaram
2019-05-03  9:45           ` Theodore Ts'o
2019-05-04  0:17             ` Vijay Chidambaram
2019-05-04  1:43               ` Theodore Ts'o
2019-05-07 18:38                 ` Jan Kara
2019-05-03  4:16         ` Amir Goldstein
2019-05-03  9:58           ` Theodore Ts'o [this message]
2019-05-03 14:18             ` Amir Goldstein
2019-05-09  2:36             ` Dave Chinner
2019-05-09  1:43         ` Dave Chinner
2019-05-09  2:20           ` Theodore Ts'o
2019-05-09  2:58             ` Dave Chinner
2019-05-09  3:31               ` Theodore Ts'o
2019-05-09  5:19                 ` Darrick J. Wong
2019-05-09  5:02             ` Vijay Chidambaram
2019-05-09  5:37               ` Darrick J. Wong
2019-05-09 15:46               ` Theodore Ts'o
2019-05-09  8:47           ` Amir Goldstein
2019-05-02 21:05   ` Darrick J. Wong
2019-05-02 22:19     ` Amir Goldstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190503095846.GE23724@mit.edu \
    --to=tytso@mit.edu \
    --cc=amir73il@gmail.com \
    --cc=clm@fb.com \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=fdmanana@suse.com \
    --cc=jack@suse.cz \
    --cc=jaya@cs.utexas.edu \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=lwn@lwn.net \
    --cc=vijay@cs.utexas.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).