Re: [TOPIC] Extending the filesystem crash recovery guaranties contract

From: "Theodore Ts'o" <tytso@mit.edu>
To: Amir Goldstein <amir73il@gmail.com>
Cc: Vijay Chidambaram <vijay@cs.utexas.edu>,
	lsf-pc@lists.linux-foundation.org,
	Dave Chinner <david@fromorbit.com>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Jan Kara <jack@suse.cz>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Jayashree Mohan <jaya@cs.utexas.edu>,
	Filipe Manana <fdmanana@suse.com>, Chris Mason <clm@fb.com>,
	lwn@lwn.net
Subject: Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
Date: Fri, 3 May 2019 05:58:46 -0400	[thread overview]
Message-ID: <20190503095846.GE23724@mit.edu> (raw)
In-Reply-To: <CAOQ4uxjM+ivnn-oU482GmRqOF6bYY5j89NdyHnfH++f49qB4yw@mail.gmail.com>

On Fri, May 03, 2019 at 12:16:32AM -0400, Amir Goldstein wrote:
> OK. we can leave that one for later.
> Although I am not sure what the concern is.
> If we are able to agree  and document a LINK_ATOMIC flag,
> what would be the down side of documenting a RENAME_ATOMIC
> flag with same semantics? After all, as I said, this is what many users
> already expect when renaming a temp file (as ext4 heuristics prove).

The problem is if the "temp file" has been hardlinked to 1000
different directories, does the rename() have to guarantee that we
have to make sure that the changes to all 1000 directories have been
persisted to disk?  And all of the parent directories of those 1000
directories have also *all* been persisted to disk, all the way up to
the root?

With the O_TMPFILE linkat case, we know that inode hasn't been
hard-linked to any other directory, and mercifully directories have
only one parent directory, so we only have to check one set of
directory inodes all the way up to the root having been persisted.

But.... I can already imagine someone complaining that if due to bind
mounts and 1000 mount namespaces, there is some *other* directory
pathname which could be used to reach said "tmpfile", we have to
guarantee that all parent directories which could be used to reach
said "tmpfile" even if they span a dozen different file systems,
*also* have to be persisted due to sloppy drafting of what the
atomicity rules might happen to be.

If we are only guaranteeing the persistence of the containing
directories of the source and destination files, that's pretty easy.
But then the consistency rules need to *explicitly* state this.  Some
of the handwaving definitions of what would be guaranteed.... scare
me.

						- Ted

P.S.  If we were going to do this, we'd probably want to simply define
a flag to be AT_FSYNC, using the strict POSIX definition of fsync,
which is to say, as a result of the linkat or renameat, the file in
question, and its associated metadata, are guaranteed to be persisted
to disk.  No other guarantees about any other inode's metadata
regardless of when they might be made, would be guaranteed.

If people really want "global barrier" semantics, then perhaps it
would be better to simply define a barrierfs(2) system call that works
like syncfs(2) --- it applies to the whole file system, and guarantees
that all changes made after barrierfs(2) will be visible if any
changes made *after* barrierfs(2) are visible.  Amir, you used "global
ordering" a few times; if you really need that, let's define a new
system call which guarantees that.  Maybe some of the research
proposals for exotic changes to SSD semantics, etc., would allow
barrierfs(2) semantics to be something that we could implement more
efficiently than syncfs(2).  But let's make this be explicit, as
opposed to some magic guarantee that falls out as a side effect of the
fsync(2) system call to a single inode.