Re: [TOPIC] Extending the filesystem crash recovery guaranties contract

From: Amir Goldstein <amir73il@gmail.com>
To: "Theodore Ts'o" <tytso@mit.edu>
Cc: Vijay Chidambaram <vijay@cs.utexas.edu>,
	lsf-pc@lists.linux-foundation.org,
	Dave Chinner <david@fromorbit.com>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Jan Kara <jack@suse.cz>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Jayashree Mohan <jaya@cs.utexas.edu>,
	Filipe Manana <fdmanana@suse.com>, Chris Mason <clm@fb.com>,
	lwn@lwn.net
Subject: Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
Date: Fri, 3 May 2019 10:18:11 -0400	[thread overview]
Message-ID: <CAOQ4uxi5AGnXRY7CbdbAwz2OJiXYxTo5NQqaFGSqw23ihSyK1g@mail.gmail.com> (raw)
In-Reply-To: <20190503095846.GE23724@mit.edu>

On Fri, May 3, 2019 at 5:59 AM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Fri, May 03, 2019 at 12:16:32AM -0400, Amir Goldstein wrote:
> > OK. we can leave that one for later.
> > Although I am not sure what the concern is.
> > If we are able to agree  and document a LINK_ATOMIC flag,
> > what would be the down side of documenting a RENAME_ATOMIC
> > flag with same semantics? After all, as I said, this is what many users
> > already expect when renaming a temp file (as ext4 heuristics prove).
>
> The problem is if the "temp file" has been hardlinked to 1000
> different directories, does the rename() have to guarantee that we
> have to make sure that the changes to all 1000 directories have been
> persisted to disk?  And all of the parent directories of those 1000
> directories have also *all* been persisted to disk, all the way up to
> the root?
>
> With the O_TMPFILE linkat case, we know that inode hasn't been
> hard-linked to any other directory, and mercifully directories have
> only one parent directory, so we only have to check one set of
> directory inodes all the way up to the root having been persisted.
>
> But.... I can already imagine someone complaining that if due to bind
> mounts and 1000 mount namespaces, there is some *other* directory
> pathname which could be used to reach said "tmpfile", we have to
> guarantee that all parent directories which could be used to reach
> said "tmpfile" even if they span a dozen different file systems,
> *also* have to be persisted due to sloppy drafting of what the
> atomicity rules might happen to be.
>
> If we are only guaranteeing the persistence of the containing
> directories of the source and destination files, that's pretty easy.
> But then the consistency rules need to *explicitly* state this.  Some
> of the handwaving definitions of what would be guaranteed.... scare
> me.
>

I see. So the issue is with the language:
"metadata modifications made to the file before being linked"
that may be interpreted that hardlinking a file is making a
modification to the file. I can't help myself writing the pun
"nlink doesn't count".

Tough one. We can include more exclusive language, but that
is not going to aid the goal of a simple documented API.

OK, I'll withdraw RENAME_ATOMIC for now and concede to
having LINK_ATOMIC fail when trying to link and nlink > 0.

How about if I implement RENAME_ATOMIC for in-kernel users
only at this point in time?

Overlayfs needs it for correctness of directory copy up operation.

>
> P.S.  If we were going to do this, we'd probably want to simply define
> a flag to be AT_FSYNC, using the strict POSIX definition of fsync,
> which is to say, as a result of the linkat or renameat, the file in
> question, and its associated metadata, are guaranteed to be persisted
> to disk.  No other guarantees about any other inode's metadata
> regardless of when they might be made, would be guaranteed.
>

I agree that may be useful. Not to my use case though.

> If people really want "global barrier" semantics, then perhaps it
> would be better to simply define a barrierfs(2) system call that works
> like syncfs(2) --- it applies to the whole file system, and guarantees
> that all changes made after barrierfs(2) will be visible if any
> changes made *after* barrierfs(2) are visible.  Amir, you used "global
> ordering" a few times; if you really need that, let's define a new
> system call which guarantees that.  Maybe some of the research
> proposals for exotic changes to SSD semantics, etc., would allow
> barrierfs(2) semantics to be something that we could implement more
> efficiently than syncfs(2).  But let's make this be explicit, as
> opposed to some magic guarantee that falls out as a side effect of the
> fsync(2) system call to a single inode.

Yes, maybe. For xfs/ext4.
Not sure about btrfs. Seems like fbarrier(2) would have been
more natural for btrfs model (file and all its dependencies).

I think barrierfs(2) would be useful, but I think it is harder to
explain to users.
See barrierfs() should not flush all inode pages that would be counter
productive, so what does it really mean to end users?
We would end up with the same problem of misunderstood sync_file_range().

I would have been happy with this API:
sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE_AND_WAIT);
barrierfs(fd);
rename(...)/link(...)

Perhaps atomic_rename()/atomic_link() should be library functions
wrapping the lower level API to hide those details from end users.

Thanks,
Amir.