linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Amir Goldstein <amir73il@gmail.com>
Cc: lsf-pc@lists.linux-foundation.org,
	Dave Chinner <david@fromorbit.com>, Theodore Tso <tytso@mit.edu>,
	Jan Kara <jack@suse.cz>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Jayashree Mohan <jaya@cs.utexas.edu>,
	Vijaychidambaram Velayudhan Pillai <vijay@cs.utexas.edu>,
	Filipe Manana <fdmanana@suse.com>, Chris Mason <clm@fb.com>,
	lwn@lwn.net
Subject: Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
Date: Thu, 2 May 2019 14:05:24 -0700	[thread overview]
Message-ID: <20190502210524.GI5200@magnolia> (raw)
In-Reply-To: <CAOQ4uxgEicLTA4LtV2fpvx7okEEa=FtbYE7Qa_=JeVEGXz40kw@mail.gmail.com>

On Thu, May 02, 2019 at 12:12:22PM -0400, Amir Goldstein wrote:
> On Sat, Apr 27, 2019 at 5:00 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > Suggestion for another filesystems track topic.
> >
> > Some of you may remember the emotional(?) discussions that ensued
> > when the crashmonkey developers embarked on a mission to document
> > and verify filesystem crash recovery guaranties:
> >
> > https://lore.kernel.org/linux-fsdevel/CAOQ4uxj8YpYPPdEvAvKPKXO7wdBg6T1O3osd6fSPFKH9j=i2Yg@mail.gmail.com/
> >
> > There are two camps among filesystem developers and every camp
> > has good arguments for wanting to document existing behavior and for
> > not wanting to document anything beyond "use fsync if you want any guaranty".
> >
> > I would like to take a suggestion proposed by Jan on a related discussion:
> > https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQx+TO3Dt7TA3ocXnNxbr3+oVyJLYUSpv4QCt_Texdvw@mail.gmail.com/
> >
> > and make a proposal that may be able to meet the concerns of
> > both camps.
> >
> > The proposal is to add new APIs which communicate
> > crash consistency requirements of the application to the filesystem.
> >
> > Example API could look like this:
> > renameat2(..., RENAME_METADATA_BARRIER | RENAME_DATA_BARRIER)
> > It's just an example. The API could take another form and may need
> > more barrier types (I proposed to use new file_sync_range() flags).
> >
> > The idea is simple though.
> > METADATA_BARRIER means all the inode metadata will be observed
> > after crash if rename is observed after crash.
> > DATA_BARRIER same for file data.
> > We may also want a "ALL_METADATA_BARRIER" and/or
> > "METADATA_DEPENDENCY_BARRIER" to more accurately
> > describe what SOMC guaranties actually provide today.
> >
> > The implementation is also simple. filesystem that currently
> > have SOMC behavior don't need to do anything to respect
> > METADATA_BARRIER and only need to call
> > filemap_write_and_wait_range() to respect DATA_BARRIER.
> > filesystem developers are thus not tying their hands w.r.t future
> > performance optimizations for operations that are not explicitly
> > requesting a barrier.
> >
> 
> An update: Following the LSF session on $SUBJECT I had a discussion
> with Ted, Jan and Chris.
> 
> We were all in agreement that linking an O_TMPFILE into the namespace
> is probably already perceived by users as the barrier/atomic operation that
> I am trying to describe.
> 
> So at least maintainers of btrfs/ext4/ext2 are sympathetic to the idea of
> providing the required semantics when linking O_TMPFILE *as long* as
> the semantics are properly documented.
> 
> This is what open(2) man page has to say right now:
> 
>  *  Creating a file that is initially invisible, which is then
> populated with data
>     and adjusted to have  appropriate  filesystem  attributes  (fchown(2),
>     fchmod(2), fsetxattr(2), etc.)  before being atomically linked into the
>     filesystem in a fully formed state (using linkat(2) as described above).
> 
> The phrase that I would like to add (probably in link(2) man page) is:
> "The filesystem provided the guaranty that after a crash, if the linked
>  O_TMPFILE is observed in the target directory, than all the data and

"if the linked O_TMPFILE is observed" ... meaning that if we can't
recover all the data+metadata information then it's ok to obliterate the
file?  Is the filesystem allowed to drop the tmpfile data if userspace
links the tmpfile into a directory but doesn't fsync the directory?

TBH I would've thought the basis of the RENAME_ATOMIC (and LINK_ATOMIC?)
user requirement would be "Until I say otherwise I want always to be
able to read <data> from this given string <pathname>."

(vs. regular Unix rename/link where we make you specify how much you
care about that by hitting us on the head with a file fsync and then a
directory fsync.)

>  metadata modifications made to the file before being linked are also
>  observed."
> 
> For some filesystems, btrfs in farticular, that would mean an implicit
> fsync on the linked inode. On other filesystems, ext4/xfs in particular
> that would only require at least committing delayed allocations, but
> will NOT require inode fsync nor journal commit/flushing disk caches.

I don't think it does much good to commit delalloc blocks but not flush
dirty overwrites, and I don't think it makes a lot of sense to flush out
overwrite data without also pushing out the inode metadata too.

FWIW I'm ok with the "Here's a 'I'm really serious' flag that carries
with it a full fsync, though how to sell developers on using it?

> I would like to hear the opinion of XFS developers and filesystem
> maintainers who did not attend the LSF session.

I miss you all too.  Sorry I couldn't make it this year. :(

> I have no objection to adding an opt-in LINK_ATOMIC flag
> and pass it down to filesystems instead of changing behavior and
> patching stable kernels, but I prefer the latter.
> 
> I believe this should have been the semantics to begin with
> if for no other reason, because users would expect it regardless
> of whatever we write in manual page and no matter how many
> !!!!!!!! we use for disclaimers.
> 
> And if we can all agree on that, then O_TMPFILE is quite young
> in historic perspective, so not too late to call the expectation gap
> a bug and fix it.(?)

Why would linking an O_TMPFILE be a special case as opposed to making
hard links in general?  If you hardlink a dirty file then surely you'd
also want to be able to read the data from the new location?

> Taking this another step forward, if we agree on the language
> I used above to describe the expected behavior, then we can
> add an opt-in RENAME_ATOMIC flag to provide the same
> semantics and document it in the same manner (this functionality
> is needed for directories and non regular files) and all there is left
> is the fun part of choosing the flag name ;-)

Will have to think about /that/ some more.

--D

> 
> Thanks,
> Amir.

  parent reply	other threads:[~2019-05-02 21:05 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-27 21:00 [TOPIC] Extending the filesystem crash recovery guaranties contract Amir Goldstein
2019-05-02 16:12 ` Amir Goldstein
2019-05-02 17:11   ` Vijay Chidambaram
2019-05-02 17:39     ` Amir Goldstein
2019-05-03  2:30       ` Theodore Ts'o
2019-05-03  3:15         ` Vijay Chidambaram
2019-05-03  9:45           ` Theodore Ts'o
2019-05-04  0:17             ` Vijay Chidambaram
2019-05-04  1:43               ` Theodore Ts'o
2019-05-07 18:38                 ` Jan Kara
2019-05-03  4:16         ` Amir Goldstein
2019-05-03  9:58           ` Theodore Ts'o
2019-05-03 14:18             ` Amir Goldstein
2019-05-09  2:36             ` Dave Chinner
2019-05-09  1:43         ` Dave Chinner
2019-05-09  2:20           ` Theodore Ts'o
2019-05-09  2:58             ` Dave Chinner
2019-05-09  3:31               ` Theodore Ts'o
2019-05-09  5:19                 ` Darrick J. Wong
2019-05-09  5:02             ` Vijay Chidambaram
2019-05-09  5:37               ` Darrick J. Wong
2019-05-09 15:46               ` Theodore Ts'o
2019-05-09  8:47           ` Amir Goldstein
2019-05-02 21:05   ` Darrick J. Wong [this message]
2019-05-02 22:19     ` Amir Goldstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190502210524.GI5200@magnolia \
    --to=darrick.wong@oracle.com \
    --cc=amir73il@gmail.com \
    --cc=clm@fb.com \
    --cc=david@fromorbit.com \
    --cc=fdmanana@suse.com \
    --cc=jack@suse.cz \
    --cc=jaya@cs.utexas.edu \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=lwn@lwn.net \
    --cc=tytso@mit.edu \
    --cc=vijay@cs.utexas.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).