linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [TOPIC] Extending the filesystem crash recovery guaranties contract
@ 2019-04-27 21:00 Amir Goldstein
  2019-05-02 16:12 ` Amir Goldstein
  0 siblings, 1 reply; 25+ messages in thread
From: Amir Goldstein @ 2019-04-27 21:00 UTC (permalink / raw)
  To: lsf-pc
  Cc: Dave Chinner, Theodore Tso, Jan Kara, linux-fsdevel,
	Jayashree Mohan, Vijaychidambaram Velayudhan Pillai,
	Filipe Manana

Suggestion for another filesystems track topic.

Some of you may remember the emotional(?) discussions that ensued
when the crashmonkey developers embarked on a mission to document
and verify filesystem crash recovery guaranties:

https://lore.kernel.org/linux-fsdevel/CAOQ4uxj8YpYPPdEvAvKPKXO7wdBg6T1O3osd6fSPFKH9j=i2Yg@mail.gmail.com/

There are two camps among filesystem developers and every camp
has good arguments for wanting to document existing behavior and for
not wanting to document anything beyond "use fsync if you want any guaranty".

I would like to take a suggestion proposed by Jan on a related discussion:
https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQx+TO3Dt7TA3ocXnNxbr3+oVyJLYUSpv4QCt_Texdvw@mail.gmail.com/

and make a proposal that may be able to meet the concerns of
both camps.

The proposal is to add new APIs which communicate
crash consistency requirements of the application to the filesystem.

Example API could look like this:
renameat2(..., RENAME_METADATA_BARRIER | RENAME_DATA_BARRIER)
It's just an example. The API could take another form and may need
more barrier types (I proposed to use new file_sync_range() flags).

The idea is simple though.
METADATA_BARRIER means all the inode metadata will be observed
after crash if rename is observed after crash.
DATA_BARRIER same for file data.
We may also want a "ALL_METADATA_BARRIER" and/or
"METADATA_DEPENDENCY_BARRIER" to more accurately
describe what SOMC guaranties actually provide today.

The implementation is also simple. filesystem that currently
have SOMC behavior don't need to do anything to respect
METADATA_BARRIER and only need to call
filemap_write_and_wait_range() to respect DATA_BARRIER.
filesystem developers are thus not tying their hands w.r.t future
performance optimizations for operations that are not explicitly
requesting a barrier.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
  2019-04-27 21:00 [TOPIC] Extending the filesystem crash recovery guaranties contract Amir Goldstein
@ 2019-05-02 16:12 ` Amir Goldstein
  2019-05-02 17:11   ` Vijay Chidambaram
  2019-05-02 21:05   ` Darrick J. Wong
  0 siblings, 2 replies; 25+ messages in thread
From: Amir Goldstein @ 2019-05-02 16:12 UTC (permalink / raw)
  To: lsf-pc, Dave Chinner, Darrick J. Wong
  Cc: Theodore Tso, Jan Kara, linux-fsdevel, Jayashree Mohan,
	Vijaychidambaram Velayudhan Pillai, Filipe Manana, Chris Mason,
	lwn

On Sat, Apr 27, 2019 at 5:00 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> Suggestion for another filesystems track topic.
>
> Some of you may remember the emotional(?) discussions that ensued
> when the crashmonkey developers embarked on a mission to document
> and verify filesystem crash recovery guaranties:
>
> https://lore.kernel.org/linux-fsdevel/CAOQ4uxj8YpYPPdEvAvKPKXO7wdBg6T1O3osd6fSPFKH9j=i2Yg@mail.gmail.com/
>
> There are two camps among filesystem developers and every camp
> has good arguments for wanting to document existing behavior and for
> not wanting to document anything beyond "use fsync if you want any guaranty".
>
> I would like to take a suggestion proposed by Jan on a related discussion:
> https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQx+TO3Dt7TA3ocXnNxbr3+oVyJLYUSpv4QCt_Texdvw@mail.gmail.com/
>
> and make a proposal that may be able to meet the concerns of
> both camps.
>
> The proposal is to add new APIs which communicate
> crash consistency requirements of the application to the filesystem.
>
> Example API could look like this:
> renameat2(..., RENAME_METADATA_BARRIER | RENAME_DATA_BARRIER)
> It's just an example. The API could take another form and may need
> more barrier types (I proposed to use new file_sync_range() flags).
>
> The idea is simple though.
> METADATA_BARRIER means all the inode metadata will be observed
> after crash if rename is observed after crash.
> DATA_BARRIER same for file data.
> We may also want a "ALL_METADATA_BARRIER" and/or
> "METADATA_DEPENDENCY_BARRIER" to more accurately
> describe what SOMC guaranties actually provide today.
>
> The implementation is also simple. filesystem that currently
> have SOMC behavior don't need to do anything to respect
> METADATA_BARRIER and only need to call
> filemap_write_and_wait_range() to respect DATA_BARRIER.
> filesystem developers are thus not tying their hands w.r.t future
> performance optimizations for operations that are not explicitly
> requesting a barrier.
>

An update: Following the LSF session on $SUBJECT I had a discussion
with Ted, Jan and Chris.

We were all in agreement that linking an O_TMPFILE into the namespace
is probably already perceived by users as the barrier/atomic operation that
I am trying to describe.

So at least maintainers of btrfs/ext4/ext2 are sympathetic to the idea of
providing the required semantics when linking O_TMPFILE *as long* as
the semantics are properly documented.

This is what open(2) man page has to say right now:

 *  Creating a file that is initially invisible, which is then
populated with data
    and adjusted to have  appropriate  filesystem  attributes  (fchown(2),
    fchmod(2), fsetxattr(2), etc.)  before being atomically linked into the
    filesystem in a fully formed state (using linkat(2) as described above).

The phrase that I would like to add (probably in link(2) man page) is:
"The filesystem provided the guaranty that after a crash, if the linked
 O_TMPFILE is observed in the target directory, than all the data and
 metadata modifications made to the file before being linked are also
 observed."

For some filesystems, btrfs in farticular, that would mean an implicit
fsync on the linked inode. On other filesystems, ext4/xfs in particular
that would only require at least committing delayed allocations, but
will NOT require inode fsync nor journal commit/flushing disk caches.

I would like to hear the opinion of XFS developers and filesystem
maintainers who did not attend the LSF session.

I have no objection to adding an opt-in LINK_ATOMIC flag
and pass it down to filesystems instead of changing behavior and
patching stable kernels, but I prefer the latter.

I believe this should have been the semantics to begin with
if for no other reason, because users would expect it regardless
of whatever we write in manual page and no matter how many
!!!!!!!! we use for disclaimers.

And if we can all agree on that, then O_TMPFILE is quite young
in historic perspective, so not too late to call the expectation gap
a bug and fix it.(?)

Taking this another step forward, if we agree on the language
I used above to describe the expected behavior, then we can
add an opt-in RENAME_ATOMIC flag to provide the same
semantics and document it in the same manner (this functionality
is needed for directories and non regular files) and all there is left
is the fun part of choosing the flag name ;-)

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
  2019-05-02 16:12 ` Amir Goldstein
@ 2019-05-02 17:11   ` Vijay Chidambaram
  2019-05-02 17:39     ` Amir Goldstein
  2019-05-02 21:05   ` Darrick J. Wong
  1 sibling, 1 reply; 25+ messages in thread
From: Vijay Chidambaram @ 2019-05-02 17:11 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: lsf-pc, Dave Chinner, Darrick J. Wong, Theodore Tso, Jan Kara,
	linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn

Thank you for driving this discussion Amir. I'm glad ext4 and btrfs
developers want to provide these semantics.

If I'm understanding this correctly, the new semantics will be: any
data changes to files written with O_TMPFILE will be visible if the
associated metadata is also visible. Basically, there will be a
barrier between O_TMPFILE data and O_TMPFILE metadata.

The expectation is that applications will use this, and then rename
the O_TMPFILE file over the original file. Is this correct? If so, is
there also an implied barrier between O_TMPFILE metadata and the
rename?

Where does this land us on the discussion about documenting
file-system crash-recovery guarantees? Has that been deemed not
necessary?

Thanks,
Vijay Chidambaram
http://www.cs.utexas.edu/~vijay/

On Thu, May 2, 2019 at 11:12 AM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Sat, Apr 27, 2019 at 5:00 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > Suggestion for another filesystems track topic.
> >
> > Some of you may remember the emotional(?) discussions that ensued
> > when the crashmonkey developers embarked on a mission to document
> > and verify filesystem crash recovery guaranties:
> >
> > https://lore.kernel.org/linux-fsdevel/CAOQ4uxj8YpYPPdEvAvKPKXO7wdBg6T1O3osd6fSPFKH9j=i2Yg@mail.gmail.com/
> >
> > There are two camps among filesystem developers and every camp
> > has good arguments for wanting to document existing behavior and for
> > not wanting to document anything beyond "use fsync if you want any guaranty".
> >
> > I would like to take a suggestion proposed by Jan on a related discussion:
> > https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQx+TO3Dt7TA3ocXnNxbr3+oVyJLYUSpv4QCt_Texdvw@mail.gmail.com/
> >
> > and make a proposal that may be able to meet the concerns of
> > both camps.
> >
> > The proposal is to add new APIs which communicate
> > crash consistency requirements of the application to the filesystem.
> >
> > Example API could look like this:
> > renameat2(..., RENAME_METADATA_BARRIER | RENAME_DATA_BARRIER)
> > It's just an example. The API could take another form and may need
> > more barrier types (I proposed to use new file_sync_range() flags).
> >
> > The idea is simple though.
> > METADATA_BARRIER means all the inode metadata will be observed
> > after crash if rename is observed after crash.
> > DATA_BARRIER same for file data.
> > We may also want a "ALL_METADATA_BARRIER" and/or
> > "METADATA_DEPENDENCY_BARRIER" to more accurately
> > describe what SOMC guaranties actually provide today.
> >
> > The implementation is also simple. filesystem that currently
> > have SOMC behavior don't need to do anything to respect
> > METADATA_BARRIER and only need to call
> > filemap_write_and_wait_range() to respect DATA_BARRIER.
> > filesystem developers are thus not tying their hands w.r.t future
> > performance optimizations for operations that are not explicitly
> > requesting a barrier.
> >
>
> An update: Following the LSF session on $SUBJECT I had a discussion
> with Ted, Jan and Chris.
>
> We were all in agreement that linking an O_TMPFILE into the namespace
> is probably already perceived by users as the barrier/atomic operation that
> I am trying to describe.
>
> So at least maintainers of btrfs/ext4/ext2 are sympathetic to the idea of
> providing the required semantics when linking O_TMPFILE *as long* as
> the semantics are properly documented.
>
> This is what open(2) man page has to say right now:
>
>  *  Creating a file that is initially invisible, which is then
> populated with data
>     and adjusted to have  appropriate  filesystem  attributes  (fchown(2),
>     fchmod(2), fsetxattr(2), etc.)  before being atomically linked into the
>     filesystem in a fully formed state (using linkat(2) as described above).
>
> The phrase that I would like to add (probably in link(2) man page) is:
> "The filesystem provided the guaranty that after a crash, if the linked
>  O_TMPFILE is observed in the target directory, than all the data and
>  metadata modifications made to the file before being linked are also
>  observed."
>
> For some filesystems, btrfs in farticular, that would mean an implicit
> fsync on the linked inode. On other filesystems, ext4/xfs in particular
> that would only require at least committing delayed allocations, but
> will NOT require inode fsync nor journal commit/flushing disk caches.
>
> I would like to hear the opinion of XFS developers and filesystem
> maintainers who did not attend the LSF session.
>
> I have no objection to adding an opt-in LINK_ATOMIC flag
> and pass it down to filesystems instead of changing behavior and
> patching stable kernels, but I prefer the latter.
>
> I believe this should have been the semantics to begin with
> if for no other reason, because users would expect it regardless
> of whatever we write in manual page and no matter how many
> !!!!!!!! we use for disclaimers.
>
> And if we can all agree on that, then O_TMPFILE is quite young
> in historic perspective, so not too late to call the expectation gap
> a bug and fix it.(?)
>
> Taking this another step forward, if we agree on the language
> I used above to describe the expected behavior, then we can
> add an opt-in RENAME_ATOMIC flag to provide the same
> semantics and document it in the same manner (this functionality
> is needed for directories and non regular files) and all there is left
> is the fun part of choosing the flag name ;-)
>
> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
  2019-05-02 17:11   ` Vijay Chidambaram
@ 2019-05-02 17:39     ` Amir Goldstein
  2019-05-03  2:30       ` Theodore Ts'o
  0 siblings, 1 reply; 25+ messages in thread
From: Amir Goldstein @ 2019-05-02 17:39 UTC (permalink / raw)
  To: Vijay Chidambaram
  Cc: lsf-pc, Dave Chinner, Darrick J. Wong, Theodore Tso, Jan Kara,
	linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn

On Thu, May 2, 2019 at 1:11 PM Vijay Chidambaram <vijay@cs.utexas.edu> wrote:
>
> Thank you for driving this discussion Amir. I'm glad ext4 and btrfs
> developers want to provide these semantics.
>
> If I'm understanding this correctly, the new semantics will be: any
> data changes to files written with O_TMPFILE will be visible if the
> associated metadata is also visible. Basically, there will be a
> barrier between O_TMPFILE data and O_TMPFILE metadata.

Mmm, this phrasing deviates from what I wrote.
The agreement is that we should document something *minimal*
that users can understand. I was hoping that this phrasing meets
those requirements:

""The filesystem provided the guaranty that after a crash, if the linked
 O_TMPFILE is observed in the target directory, than all the data and
 metadata modifications made to the file before being linked are also
 observed."

No more, no less.

>
> The expectation is that applications will use this, and then rename
> the O_TMPFILE file over the original file. Is this correct? If so, is
> there also an implied barrier between O_TMPFILE metadata and the
> rename?

Not really, the use case is when users want to create a file to apear
"atomically" in the namespace with certain data and metadata.

For replacing an existing file with another the same could be
achieved with renameat2(AT_FDCWD, tempname, AT_FDCWD, newname,
RENAME_ATOMIC). There is no need to create the tempname
file using O_TMPFILE in that case, but if you do, the RENAME_ATOMIC
flag would be redundant.

RENAME_ATOMIC flag is needed because directories and non regular
files cannot be created using O_TMPFILE.

>
> Where does this land us on the discussion about documenting
> file-system crash-recovery guarantees? Has that been deemed not
> necessary?
>

Can't say for sure.
Some filesystem maintainers hold on to the opinion that they do
NOT wish to have a document describing existing behavior of specific
filesystems, which is large parts of the document that your group posted.

They would rather that only the guaranties of the APIs are documented
and those should already be documented in man pages anyway - if they
are not, man pages could be improved.

I am not saying there is no room for a document that elaborates on those
guaranties. I personally think that could be useful and certainly think that
your group's work for adding xfstest coverage for API guaranties is useful.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
  2019-05-02 16:12 ` Amir Goldstein
  2019-05-02 17:11   ` Vijay Chidambaram
@ 2019-05-02 21:05   ` Darrick J. Wong
  2019-05-02 22:19     ` Amir Goldstein
  1 sibling, 1 reply; 25+ messages in thread
From: Darrick J. Wong @ 2019-05-02 21:05 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: lsf-pc, Dave Chinner, Theodore Tso, Jan Kara, linux-fsdevel,
	Jayashree Mohan, Vijaychidambaram Velayudhan Pillai,
	Filipe Manana, Chris Mason, lwn

On Thu, May 02, 2019 at 12:12:22PM -0400, Amir Goldstein wrote:
> On Sat, Apr 27, 2019 at 5:00 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > Suggestion for another filesystems track topic.
> >
> > Some of you may remember the emotional(?) discussions that ensued
> > when the crashmonkey developers embarked on a mission to document
> > and verify filesystem crash recovery guaranties:
> >
> > https://lore.kernel.org/linux-fsdevel/CAOQ4uxj8YpYPPdEvAvKPKXO7wdBg6T1O3osd6fSPFKH9j=i2Yg@mail.gmail.com/
> >
> > There are two camps among filesystem developers and every camp
> > has good arguments for wanting to document existing behavior and for
> > not wanting to document anything beyond "use fsync if you want any guaranty".
> >
> > I would like to take a suggestion proposed by Jan on a related discussion:
> > https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQx+TO3Dt7TA3ocXnNxbr3+oVyJLYUSpv4QCt_Texdvw@mail.gmail.com/
> >
> > and make a proposal that may be able to meet the concerns of
> > both camps.
> >
> > The proposal is to add new APIs which communicate
> > crash consistency requirements of the application to the filesystem.
> >
> > Example API could look like this:
> > renameat2(..., RENAME_METADATA_BARRIER | RENAME_DATA_BARRIER)
> > It's just an example. The API could take another form and may need
> > more barrier types (I proposed to use new file_sync_range() flags).
> >
> > The idea is simple though.
> > METADATA_BARRIER means all the inode metadata will be observed
> > after crash if rename is observed after crash.
> > DATA_BARRIER same for file data.
> > We may also want a "ALL_METADATA_BARRIER" and/or
> > "METADATA_DEPENDENCY_BARRIER" to more accurately
> > describe what SOMC guaranties actually provide today.
> >
> > The implementation is also simple. filesystem that currently
> > have SOMC behavior don't need to do anything to respect
> > METADATA_BARRIER and only need to call
> > filemap_write_and_wait_range() to respect DATA_BARRIER.
> > filesystem developers are thus not tying their hands w.r.t future
> > performance optimizations for operations that are not explicitly
> > requesting a barrier.
> >
> 
> An update: Following the LSF session on $SUBJECT I had a discussion
> with Ted, Jan and Chris.
> 
> We were all in agreement that linking an O_TMPFILE into the namespace
> is probably already perceived by users as the barrier/atomic operation that
> I am trying to describe.
> 
> So at least maintainers of btrfs/ext4/ext2 are sympathetic to the idea of
> providing the required semantics when linking O_TMPFILE *as long* as
> the semantics are properly documented.
> 
> This is what open(2) man page has to say right now:
> 
>  *  Creating a file that is initially invisible, which is then
> populated with data
>     and adjusted to have  appropriate  filesystem  attributes  (fchown(2),
>     fchmod(2), fsetxattr(2), etc.)  before being atomically linked into the
>     filesystem in a fully formed state (using linkat(2) as described above).
> 
> The phrase that I would like to add (probably in link(2) man page) is:
> "The filesystem provided the guaranty that after a crash, if the linked
>  O_TMPFILE is observed in the target directory, than all the data and

"if the linked O_TMPFILE is observed" ... meaning that if we can't
recover all the data+metadata information then it's ok to obliterate the
file?  Is the filesystem allowed to drop the tmpfile data if userspace
links the tmpfile into a directory but doesn't fsync the directory?

TBH I would've thought the basis of the RENAME_ATOMIC (and LINK_ATOMIC?)
user requirement would be "Until I say otherwise I want always to be
able to read <data> from this given string <pathname>."

(vs. regular Unix rename/link where we make you specify how much you
care about that by hitting us on the head with a file fsync and then a
directory fsync.)

>  metadata modifications made to the file before being linked are also
>  observed."
> 
> For some filesystems, btrfs in farticular, that would mean an implicit
> fsync on the linked inode. On other filesystems, ext4/xfs in particular
> that would only require at least committing delayed allocations, but
> will NOT require inode fsync nor journal commit/flushing disk caches.

I don't think it does much good to commit delalloc blocks but not flush
dirty overwrites, and I don't think it makes a lot of sense to flush out
overwrite data without also pushing out the inode metadata too.

FWIW I'm ok with the "Here's a 'I'm really serious' flag that carries
with it a full fsync, though how to sell developers on using it?

> I would like to hear the opinion of XFS developers and filesystem
> maintainers who did not attend the LSF session.

I miss you all too.  Sorry I couldn't make it this year. :(

> I have no objection to adding an opt-in LINK_ATOMIC flag
> and pass it down to filesystems instead of changing behavior and
> patching stable kernels, but I prefer the latter.
> 
> I believe this should have been the semantics to begin with
> if for no other reason, because users would expect it regardless
> of whatever we write in manual page and no matter how many
> !!!!!!!! we use for disclaimers.
> 
> And if we can all agree on that, then O_TMPFILE is quite young
> in historic perspective, so not too late to call the expectation gap
> a bug and fix it.(?)

Why would linking an O_TMPFILE be a special case as opposed to making
hard links in general?  If you hardlink a dirty file then surely you'd
also want to be able to read the data from the new location?

> Taking this another step forward, if we agree on the language
> I used above to describe the expected behavior, then we can
> add an opt-in RENAME_ATOMIC flag to provide the same
> semantics and document it in the same manner (this functionality
> is needed for directories and non regular files) and all there is left
> is the fun part of choosing the flag name ;-)

Will have to think about /that/ some more.

--D

> 
> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
  2019-05-02 21:05   ` Darrick J. Wong
@ 2019-05-02 22:19     ` Amir Goldstein
  0 siblings, 0 replies; 25+ messages in thread
From: Amir Goldstein @ 2019-05-02 22:19 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: lsf-pc, Dave Chinner, Theodore Tso, Jan Kara, linux-fsdevel,
	Jayashree Mohan, Vijaychidambaram Velayudhan Pillai,
	Filipe Manana, Chris Mason, lwn

On Thu, May 2, 2019 at 5:05 PM Darrick J. Wong <darrick.wong@oracle.com> wrote:
>
> On Thu, May 02, 2019 at 12:12:22PM -0400, Amir Goldstein wrote:
> > On Sat, Apr 27, 2019 at 5:00 PM Amir Goldstein <amir73il@gmail.com> wrote:
> > >
> > > Suggestion for another filesystems track topic.
> > >
> > > Some of you may remember the emotional(?) discussions that ensued
> > > when the crashmonkey developers embarked on a mission to document
> > > and verify filesystem crash recovery guaranties:
> > >
> > > https://lore.kernel.org/linux-fsdevel/CAOQ4uxj8YpYPPdEvAvKPKXO7wdBg6T1O3osd6fSPFKH9j=i2Yg@mail.gmail.com/
> > >
> > > There are two camps among filesystem developers and every camp
> > > has good arguments for wanting to document existing behavior and for
> > > not wanting to document anything beyond "use fsync if you want any guaranty".
> > >
> > > I would like to take a suggestion proposed by Jan on a related discussion:
> > > https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQx+TO3Dt7TA3ocXnNxbr3+oVyJLYUSpv4QCt_Texdvw@mail.gmail.com/
> > >
> > > and make a proposal that may be able to meet the concerns of
> > > both camps.
> > >
> > > The proposal is to add new APIs which communicate
> > > crash consistency requirements of the application to the filesystem.
> > >
> > > Example API could look like this:
> > > renameat2(..., RENAME_METADATA_BARRIER | RENAME_DATA_BARRIER)
> > > It's just an example. The API could take another form and may need
> > > more barrier types (I proposed to use new file_sync_range() flags).
> > >
> > > The idea is simple though.
> > > METADATA_BARRIER means all the inode metadata will be observed
> > > after crash if rename is observed after crash.
> > > DATA_BARRIER same for file data.
> > > We may also want a "ALL_METADATA_BARRIER" and/or
> > > "METADATA_DEPENDENCY_BARRIER" to more accurately
> > > describe what SOMC guaranties actually provide today.
> > >
> > > The implementation is also simple. filesystem that currently
> > > have SOMC behavior don't need to do anything to respect
> > > METADATA_BARRIER and only need to call
> > > filemap_write_and_wait_range() to respect DATA_BARRIER.
> > > filesystem developers are thus not tying their hands w.r.t future
> > > performance optimizations for operations that are not explicitly
> > > requesting a barrier.
> > >
> >
> > An update: Following the LSF session on $SUBJECT I had a discussion
> > with Ted, Jan and Chris.
> >
> > We were all in agreement that linking an O_TMPFILE into the namespace
> > is probably already perceived by users as the barrier/atomic operation that
> > I am trying to describe.
> >
> > So at least maintainers of btrfs/ext4/ext2 are sympathetic to the idea of
> > providing the required semantics when linking O_TMPFILE *as long* as
> > the semantics are properly documented.
> >
> > This is what open(2) man page has to say right now:
> >
> >  *  Creating a file that is initially invisible, which is then
> > populated with data
> >     and adjusted to have  appropriate  filesystem  attributes  (fchown(2),
> >     fchmod(2), fsetxattr(2), etc.)  before being atomically linked into the
> >     filesystem in a fully formed state (using linkat(2) as described above).
> >
> > The phrase that I would like to add (probably in link(2) man page) is:
> > "The filesystem provided the guaranty that after a crash, if the linked
> >  O_TMPFILE is observed in the target directory, than all the data and
>
> "if the linked O_TMPFILE is observed" ... meaning that if we can't
> recover all the data+metadata information then it's ok to obliterate the
> file?  Is the filesystem allowed to drop the tmpfile data if userspace
> links the tmpfile into a directory but doesn't fsync the directory?
>

Yes! Yes! Definitely allowed!

Linking an O_TMPFILE has a single possible use case -
an "atomic" creation of a fully baked file.

I am trying hard to explain that for my use case, durability
is not a requirement from the "atomic" creation, but rather
"if the linked O_TMPFILE is observed" semantics.

> TBH I would've thought the basis of the RENAME_ATOMIC (and LINK_ATOMIC?)
> user requirement would be "Until I say otherwise I want always to be
> able to read <data> from this given string <pathname>."
>

Sadly, it is even hard for me to explain the difference to filesystem
developers,
so what is the hope with mortal users, but what can I do, the kernel has
and interface for durability (several of them), but no interface for what I need
(ordering), so I must introduce one.

The good news, and that the key argument in my sales pitch, is that some
users already have expectations that are not documented and not correct
about rename/link, so hopefully, if we add and document those flags,
situation cannot get worse.

> (vs. regular Unix rename/link where we make you specify how much you
> care about that by hitting us on the head with a file fsync and then a
> directory fsync.)

OK. Perhaps a solution to this human interface issue is introducing
two pairs of flags LINK_SYNC and LINK_ATOMIC.
I did not think that the former is needed, but maybe it is just needed
as a way to document what LINK_ATOMIC is *not*, e.g.:

"LINK_SYNC
 If the operation succeeds, the filesystem provides the guaranty that
 after a crash, the linked O_TMPFILE will be observed in the target
 directory and that all the data and metadata modifications made to
 the file before being linked are also observed."

LINK_ATOMIC
 If the operation succeeds, the filesystem provides the guaranty that
 after a crash, if the linked O_TMPFILE is observed in the target
 directory, then all the data and metadata modifications made to the
 file before being linked are also observed.
 LINK_ATOMIC is often cheaper than LINK_SYNC, because it does
 not require flushing volatile disk write caches, but it does not provide
 the guaranty that the file will be observed in the target directory after
 crash."

My intuition about this is "less is better", so prefer not add two flags.

>
> >  metadata modifications made to the file before being linked are also
> >  observed."
> >
> > For some filesystems, btrfs in farticular, that would mean an implicit
> > fsync on the linked inode. On other filesystems, ext4/xfs in particular
> > that would only require at least committing delayed allocations, but
> > will NOT require inode fsync nor journal commit/flushing disk caches.
>
> I don't think it does much good to commit delalloc blocks but not flush
> dirty overwrites, and I don't think it makes a lot of sense to flush out
> overwrite data without also pushing out the inode metadata too.

My intention was that this flag will call filemap_write_and_wait_range()
on ext4/xfs, which is what my application does today to get the desired
result. From there on, we can rely on "strictly ordered metadata consistency"
(SOMC) to provide what the interface needs.

>
> FWIW I'm ok with the "Here's a 'I'm really serious' flag that carries
> with it a full fsync, though how to sell developers on using it?
>

I am an application developer and I have no need of such flag
I have a need for another flag, which is why I started this discussion...
But also, this is why my preference is to NOT add a LINK_ATOMIC
flag at all and just assume that users cannot possibly think it is a good
outcome to observe a half baked linked O_TMPFILE after crash
and give users what they want.


> > I would like to hear the opinion of XFS developers and filesystem
> > maintainers who did not attend the LSF session.
>
> I miss you all too.  Sorry I couldn't make it this year. :(
>
> > I have no objection to adding an opt-in LINK_ATOMIC flag
> > and pass it down to filesystems instead of changing behavior and
> > patching stable kernels, but I prefer the latter.
> >
> > I believe this should have been the semantics to begin with
> > if for no other reason, because users would expect it regardless
> > of whatever we write in manual page and no matter how many
> > !!!!!!!! we use for disclaimers.
> >
> > And if we can all agree on that, then O_TMPFILE is quite young
> > in historic perspective, so not too late to call the expectation gap
> > a bug and fix it.(?)
>
> Why would linking an O_TMPFILE be a special case as opposed to making
> hard links in general?  If you hardlink a dirty file then surely you'd
> also want to be able to read the data from the new location?
>

Because of the use case that O_TMPFILE implies, whatever users
do before file is linked is expected to be private and unexposed to
others. You cannot say the same about making modifications to a linked
file. I don't mind adding LINK_ATOMIC and then it will obviously be respected
also for linking nlink > 0.


> > Taking this another step forward, if we agree on the language
> > I used above to describe the expected behavior, then we can
> > add an opt-in RENAME_ATOMIC flag to provide the same
> > semantics and document it in the same manner (this functionality
> > is needed for directories and non regular files) and all there is left
> > is the fun part of choosing the flag name ;-)
>
> Will have to think about /that/ some more.
>

For your amusement, here are some suggestions that I had
that folks here did not like:
RENAME_BARRIER
RENAME_ORDERED

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
  2019-05-02 17:39     ` Amir Goldstein
@ 2019-05-03  2:30       ` Theodore Ts'o
  2019-05-03  3:15         ` Vijay Chidambaram
                           ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Theodore Ts'o @ 2019-05-03  2:30 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Vijay Chidambaram, lsf-pc, Dave Chinner, Darrick J. Wong,
	Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana,
	Chris Mason, lwn

On Thu, May 02, 2019 at 01:39:47PM -0400, Amir Goldstein wrote:
> > The expectation is that applications will use this, and then rename
> > the O_TMPFILE file over the original file. Is this correct? If so, is
> > there also an implied barrier between O_TMPFILE metadata and the
> > rename?

In the case of O_TMPFILE, the file can be brought into the namespace
using something like:

linkat(AT_FDCWD, "/proc/self/fd/42", AT_FDCWD, pathname, AT_SYMLINK_FOLLOW);

it's not using rename.

To be clear, this discussion happened in the hallway, and it's not
clear it had full support by everyone.  After our discussion, some of
us came up with an example where forcing a call to
filemap_write_and_wait() before the linkat(2) might *not* be the right
thing.  Suppose some browser wanted to wait until a file was fully(
downloaded before letting it appear in the directory --- but what was
being downloaded was a 4 GiB DVD image (say, a distribution's install
media).  If the download was done using O_TMPFILE followed by
linkat(2), that might be a case where forcing the data blocks to disk
before allowing the linkat(2) to proceed might not be what the
application or the user would want.

So it might be that we will need to add a linkat flag to indicate that
we want the kernel to call filemap_write_and_wait() before making the
metadata changes in linkat(2).

> For replacing an existing file with another the same could be
> achieved with renameat2(AT_FDCWD, tempname, AT_FDCWD, newname,
> RENAME_ATOMIC). There is no need to create the tempname
> file using O_TMPFILE in that case, but if you do, the RENAME_ATOMIC
> flag would be redundant.
> 
> RENAME_ATOMIC flag is needed because directories and non regular
> files cannot be created using O_TMPFILE.

I think there's much less consensus about this.  Again, most of this
happened in a hallway conversation.

> > Where does this land us on the discussion about documenting
> > file-system crash-recovery guarantees? Has that been deemed not
> > necessary?
> 
> Can't say for sure.
> Some filesystem maintainers hold on to the opinion that they do
> NOT wish to have a document describing existing behavior of specific
> filesystems, which is large parts of the document that your group posted.
> 
> They would rather that only the guaranties of the APIs are documented
> and those should already be documented in man pages anyway - if they
> are not, man pages could be improved.
> 
> I am not saying there is no room for a document that elaborates on those
> guaranties. I personally think that could be useful and certainly think that
> your group's work for adding xfstest coverage for API guaranties is useful.

Again, here is my concern.  If we promise that ext4 will always obey
Dave Chinner's SOMC model, it would forever rule out Daejun Park and
Dongkun Shin's "iJournaling: Fine-grained journaling for improving the
latency of fsync system call"[1] published in Usenix ATC 2017.

[1] https://www.usenix.org/system/files/conference/atc17/atc17-park.pdf

That's because this provides a fast fsync() using an incremental
journal.  This fast fsync would cause the metadata associated with the
inode being fsync'ed to be persisted after the crash --- ahead of
metadata changes to other, potentially completely unrelated files,
which would *not* be persisted after the crash.  Fine grained
journalling would provide all of the guarantee all of the POSIX, and
for applications that only care about the single file being fsync'ed
-- they would be happy.  BUT, it violates the proposed crash
consistency guarantees.

So if the crash consistency guarantees forbids future innovations
where applications might *want* a fast fsync() that doesn't drag
unrelated inodes into the persistence guarantees, is that really what
we want?  Do we want to forever rule out various academic
investigations such as Park and Shin's because "it violates the crash
consistency recovery model"?  Especially if some applications don't
*need* the crash consistency model?

						- Ted

P.S.  I feel especially strong about this because I'm working with an
engineer currently trying to implement a simplified version of Park
and Shin's proposal...  So this is not a hypothetical concern of mine.
I'd much rather not invalidate all of this engineer's work to date,
especially since there is a published paper demonstrating that for
some workloads (such as sqlite), this approach can be a big win.

P.P.S.  One of the other discussions that did happen during the main
LSF/MM File system session, and for which there was general agreement
across a number of major file system maintainers, was a fsync2()
system call which would take a list of file descriptors (and flags)
that should be fsync'ed.  The semantics would be that when the
fsync2() successfully returns, all of the guarantees of fsync() or
fdatasync() requested by the list of file descriptors and flags would
be satisfied.  This would allow file systems to more optimally fsync a
batch of files, for example by implementing data integrity writebacks
for all of the files, followed by a single journal commit to guarantee
persistence for all of the metadata changes.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
  2019-05-03  2:30       ` Theodore Ts'o
@ 2019-05-03  3:15         ` Vijay Chidambaram
  2019-05-03  9:45           ` Theodore Ts'o
  2019-05-03  4:16         ` Amir Goldstein
  2019-05-09  1:43         ` Dave Chinner
  2 siblings, 1 reply; 25+ messages in thread
From: Vijay Chidambaram @ 2019-05-03  3:15 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Amir Goldstein, lsf-pc, Dave Chinner, Darrick J. Wong, Jan Kara,
	linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn

> Again, here is my concern.  If we promise that ext4 will always obey
> Dave Chinner's SOMC model, it would forever rule out Daejun Park and
> Dongkun Shin's "iJournaling: Fine-grained journaling for improving the
> latency of fsync system call"[1] published in Usenix ATC 2017.
>
> [1] https://www.usenix.org/system/files/conference/atc17/atc17-park.pdf
>
> That's because this provides a fast fsync() using an incremental
> journal.  This fast fsync would cause the metadata associated with the
> inode being fsync'ed to be persisted after the crash --- ahead of
> metadata changes to other, potentially completely unrelated files,
> which would *not* be persisted after the crash.  Fine grained
> journalling would provide all of the guarantee all of the POSIX, and
> for applications that only care about the single file being fsync'ed
> -- they would be happy.  BUT, it violates the proposed crash
> consistency guarantees.
>
> So if the crash consistency guarantees forbids future innovations
> where applications might *want* a fast fsync() that doesn't drag
> unrelated inodes into the persistence guarantees, is that really what
> we want?  Do we want to forever rule out various academic
> investigations such as Park and Shin's because "it violates the crash
> consistency recovery model"?  Especially if some applications don't
> *need* the crash consistency model?
>
>                                                 - Ted
>
> P.S.  I feel especially strong about this because I'm working with an
> engineer currently trying to implement a simplified version of Park
> and Shin's proposal...  So this is not a hypothetical concern of mine.
> I'd much rather not invalidate all of this engineer's work to date,
> especially since there is a published paper demonstrating that for
> some workloads (such as sqlite), this approach can be a big win.

Ted, I sympathize with your position. To be clear, this is not what my
group or Amir is suggesting we do.

A few things to clarify:
1) We are not suggesting that all file systems follow SOMC semantics.
If ext4 does not want to do so, we are quite happy to document ext4
provides a different set of reasonable semantics. We can make the
ext4-related documentation as minimal as you want (or drop ext4 from
documentation entirely). I'm hoping this will satisfy you.
2) As I understand it, I do not think SOMC rules out the scenario in
your example, because it does not require fsync to push un-related
files to storage.
3) We are not documenting how fsync works internally, merely what the
user-visible behavior is. I think this will actually free up file
systems to optimize fsync aggressively while making sure they provide
the required user-visible behavior.

Quoting from Dave Chinner's response when you brought up this concern
previously (https://patchwork.kernel.org/patch/10849903/#22538743):

"Sure, but again this is orthognal to what we are discussing here:
the user visible ordering of metadata operations after a crash.

If anyone implements a multi-segment or per-inode journal (say, like
NOVA), then it is up to that implementation to maintain the ordering
guarantees that a SOMC model requires. You can implement whatever
fsync() go-fast bits you want, as long as it provides the ordering
behaviour guarantees that the model defines.

IOWs, Ted, I think you have the wrong end of the stick here. This
isn't about optimising fsync() to provide better performance, it's
about guaranteeing order so that fsync() is not necessary and we
improve performance by allowing applications to omit order-only
synchornisation points in their workloads.

i.e. an order-based integrity model /reduces/ the need for a
hyper-optimised fsync operation because applications won't need to
use it as often."

> P.P.S.  One of the other discussions that did happen during the main
> LSF/MM File system session, and for which there was general agreement
> across a number of major file system maintainers, was a fsync2()
> system call which would take a list of file descriptors (and flags)
> that should be fsync'ed.  The semantics would be that when the
> fsync2() successfully returns, all of the guarantees of fsync() or
> fdatasync() requested by the list of file descriptors and flags would
> be satisfied.  This would allow file systems to more optimally fsync a
> batch of files, for example by implementing data integrity writebacks
> for all of the files, followed by a single journal commit to guarantee
> persistence for all of the metadata changes.

I like this "group fsync" idea. I think this is a great way to extend
the basic fsync interface.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
  2019-05-03  2:30       ` Theodore Ts'o
  2019-05-03  3:15         ` Vijay Chidambaram
@ 2019-05-03  4:16         ` Amir Goldstein
  2019-05-03  9:58           ` Theodore Ts'o
  2019-05-09  1:43         ` Dave Chinner
  2 siblings, 1 reply; 25+ messages in thread
From: Amir Goldstein @ 2019-05-03  4:16 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Vijay Chidambaram, lsf-pc, Dave Chinner, Darrick J. Wong,
	Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana,
	Chris Mason, lwn

On Thu, May 2, 2019 at 10:30 PM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Thu, May 02, 2019 at 01:39:47PM -0400, Amir Goldstein wrote:
> > > The expectation is that applications will use this, and then rename
> > > the O_TMPFILE file over the original file. Is this correct? If so, is
> > > there also an implied barrier between O_TMPFILE metadata and the
> > > rename?
>
> In the case of O_TMPFILE, the file can be brought into the namespace
> using something like:
>
> linkat(AT_FDCWD, "/proc/self/fd/42", AT_FDCWD, pathname, AT_SYMLINK_FOLLOW);
>
> it's not using rename.
>
> To be clear, this discussion happened in the hallway, and it's not
> clear it had full support by everyone.  After our discussion, some of
> us came up with an example where forcing a call to
> filemap_write_and_wait() before the linkat(2) might *not* be the right
> thing.  Suppose some browser wanted to wait until a file was fully(
> downloaded before letting it appear in the directory --- but what was
> being downloaded was a 4 GiB DVD image (say, a distribution's install
> media).  If the download was done using O_TMPFILE followed by
> linkat(2), that might be a case where forcing the data blocks to disk
> before allowing the linkat(2) to proceed might not be what the
> application or the user would want.
>
> So it might be that we will need to add a linkat flag to indicate that
> we want the kernel to call filemap_write_and_wait() before making the
> metadata changes in linkat(2).
>

Agreed.

> > For replacing an existing file with another the same could be
> > achieved with renameat2(AT_FDCWD, tempname, AT_FDCWD, newname,
> > RENAME_ATOMIC). There is no need to create the tempname
> > file using O_TMPFILE in that case, but if you do, the RENAME_ATOMIC
> > flag would be redundant.
> >
> > RENAME_ATOMIC flag is needed because directories and non regular
> > files cannot be created using O_TMPFILE.
>
> I think there's much less consensus about this.  Again, most of this
> happened in a hallway conversation.
>

OK. we can leave that one for later.
Although I am not sure what the concern is.
If we are able to agree  and document a LINK_ATOMIC flag,
what would be the down side of documenting a RENAME_ATOMIC
flag with same semantics? After all, as I said, this is what many users
already expect when renaming a temp file (as ext4 heuristics prove).

I would love to get Dave's take on the proposal of LINK_ATOMIC/
RENAME_ATOMIC, preferably before the discussion wanders off
to an argument about what SOMC means...

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
  2019-05-03  3:15         ` Vijay Chidambaram
@ 2019-05-03  9:45           ` Theodore Ts'o
  2019-05-04  0:17             ` Vijay Chidambaram
  0 siblings, 1 reply; 25+ messages in thread
From: Theodore Ts'o @ 2019-05-03  9:45 UTC (permalink / raw)
  To: Vijay Chidambaram
  Cc: Amir Goldstein, lsf-pc, Dave Chinner, Darrick J. Wong, Jan Kara,
	linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn

On Thu, May 02, 2019 at 10:15:01PM -0500, Vijay Chidambaram wrote:
> 
> A few things to clarify:
> 1) We are not suggesting that all file systems follow SOMC semantics.
> If ext4 does not want to do so, we are quite happy to document ext4
> provides a different set of reasonable semantics. We can make the
> ext4-related documentation as minimal as you want (or drop ext4 from
> documentation entirely). I'm hoping this will satisfy you.
> 2) As I understand it, I do not think SOMC rules out the scenario in
> your example, because it does not require fsync to push un-related
> files to storage.
> 3) We are not documenting how fsync works internally, merely what the
> user-visible behavior is. I think this will actually free up file
> systems to optimize fsync aggressively while making sure they provide
> the required user-visible behavior.

As documented, the draft of the rules *I* saw specifically said that a
fsync() to inode B would guarantee that metadata changes for inode A,
which were made before the changes to inode B, would be persisted to
disk since the metadata changes for B happened after the changes to
inode A.  It used the fsync(2) *explicitly* as an example for how
ordering of unrelated files could be guaranteed.  And this would
invalidate Park and Shin's incremental journal for fsync.

If the guarantees are when fsync(2) is *not* being used, sure, then
the SOMC model is naturally what would happen with most common file
system.  But then fsync(2) needs to appear nowhere in the crash
consistency model description, and that is not the case today.

Best regards,

						- Ted

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
  2019-05-03  4:16         ` Amir Goldstein
@ 2019-05-03  9:58           ` Theodore Ts'o
  2019-05-03 14:18             ` Amir Goldstein
  2019-05-09  2:36             ` Dave Chinner
  0 siblings, 2 replies; 25+ messages in thread
From: Theodore Ts'o @ 2019-05-03  9:58 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: Vijay Chidambaram, lsf-pc, Dave Chinner, Darrick J. Wong,
	Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana,
	Chris Mason, lwn

On Fri, May 03, 2019 at 12:16:32AM -0400, Amir Goldstein wrote:
> OK. we can leave that one for later.
> Although I am not sure what the concern is.
> If we are able to agree  and document a LINK_ATOMIC flag,
> what would be the down side of documenting a RENAME_ATOMIC
> flag with same semantics? After all, as I said, this is what many users
> already expect when renaming a temp file (as ext4 heuristics prove).

The problem is if the "temp file" has been hardlinked to 1000
different directories, does the rename() have to guarantee that we
have to make sure that the changes to all 1000 directories have been
persisted to disk?  And all of the parent directories of those 1000
directories have also *all* been persisted to disk, all the way up to
the root?

With the O_TMPFILE linkat case, we know that inode hasn't been
hard-linked to any other directory, and mercifully directories have
only one parent directory, so we only have to check one set of
directory inodes all the way up to the root having been persisted.

But.... I can already imagine someone complaining that if due to bind
mounts and 1000 mount namespaces, there is some *other* directory
pathname which could be used to reach said "tmpfile", we have to
guarantee that all parent directories which could be used to reach
said "tmpfile" even if they span a dozen different file systems,
*also* have to be persisted due to sloppy drafting of what the
atomicity rules might happen to be.

If we are only guaranteeing the persistence of the containing
directories of the source and destination files, that's pretty easy.
But then the consistency rules need to *explicitly* state this.  Some
of the handwaving definitions of what would be guaranteed.... scare
me.

						- Ted

P.S.  If we were going to do this, we'd probably want to simply define
a flag to be AT_FSYNC, using the strict POSIX definition of fsync,
which is to say, as a result of the linkat or renameat, the file in
question, and its associated metadata, are guaranteed to be persisted
to disk.  No other guarantees about any other inode's metadata
regardless of when they might be made, would be guaranteed.

If people really want "global barrier" semantics, then perhaps it
would be better to simply define a barrierfs(2) system call that works
like syncfs(2) --- it applies to the whole file system, and guarantees
that all changes made after barrierfs(2) will be visible if any
changes made *after* barrierfs(2) are visible.  Amir, you used "global
ordering" a few times; if you really need that, let's define a new
system call which guarantees that.  Maybe some of the research
proposals for exotic changes to SSD semantics, etc., would allow
barrierfs(2) semantics to be something that we could implement more
efficiently than syncfs(2).  But let's make this be explicit, as
opposed to some magic guarantee that falls out as a side effect of the
fsync(2) system call to a single inode.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
  2019-05-03  9:58           ` Theodore Ts'o
@ 2019-05-03 14:18             ` Amir Goldstein
  2019-05-09  2:36             ` Dave Chinner
  1 sibling, 0 replies; 25+ messages in thread
From: Amir Goldstein @ 2019-05-03 14:18 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Vijay Chidambaram, lsf-pc, Dave Chinner, Darrick J. Wong,
	Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana,
	Chris Mason, lwn

On Fri, May 3, 2019 at 5:59 AM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Fri, May 03, 2019 at 12:16:32AM -0400, Amir Goldstein wrote:
> > OK. we can leave that one for later.
> > Although I am not sure what the concern is.
> > If we are able to agree  and document a LINK_ATOMIC flag,
> > what would be the down side of documenting a RENAME_ATOMIC
> > flag with same semantics? After all, as I said, this is what many users
> > already expect when renaming a temp file (as ext4 heuristics prove).
>
> The problem is if the "temp file" has been hardlinked to 1000
> different directories, does the rename() have to guarantee that we
> have to make sure that the changes to all 1000 directories have been
> persisted to disk?  And all of the parent directories of those 1000
> directories have also *all* been persisted to disk, all the way up to
> the root?
>
> With the O_TMPFILE linkat case, we know that inode hasn't been
> hard-linked to any other directory, and mercifully directories have
> only one parent directory, so we only have to check one set of
> directory inodes all the way up to the root having been persisted.
>
> But.... I can already imagine someone complaining that if due to bind
> mounts and 1000 mount namespaces, there is some *other* directory
> pathname which could be used to reach said "tmpfile", we have to
> guarantee that all parent directories which could be used to reach
> said "tmpfile" even if they span a dozen different file systems,
> *also* have to be persisted due to sloppy drafting of what the
> atomicity rules might happen to be.
>
> If we are only guaranteeing the persistence of the containing
> directories of the source and destination files, that's pretty easy.
> But then the consistency rules need to *explicitly* state this.  Some
> of the handwaving definitions of what would be guaranteed.... scare
> me.
>

I see. So the issue is with the language:
"metadata modifications made to the file before being linked"
that may be interpreted that hardlinking a file is making a
modification to the file. I can't help myself writing the pun
"nlink doesn't count".

Tough one. We can include more exclusive language, but that
is not going to aid the goal of a simple documented API.

OK, I'll withdraw RENAME_ATOMIC for now and concede to
having LINK_ATOMIC fail when trying to link and nlink > 0.

How about if I implement RENAME_ATOMIC for in-kernel users
only at this point in time?

Overlayfs needs it for correctness of directory copy up operation.

>
> P.S.  If we were going to do this, we'd probably want to simply define
> a flag to be AT_FSYNC, using the strict POSIX definition of fsync,
> which is to say, as a result of the linkat or renameat, the file in
> question, and its associated metadata, are guaranteed to be persisted
> to disk.  No other guarantees about any other inode's metadata
> regardless of when they might be made, would be guaranteed.
>

I agree that may be useful. Not to my use case though.

> If people really want "global barrier" semantics, then perhaps it
> would be better to simply define a barrierfs(2) system call that works
> like syncfs(2) --- it applies to the whole file system, and guarantees
> that all changes made after barrierfs(2) will be visible if any
> changes made *after* barrierfs(2) are visible.  Amir, you used "global
> ordering" a few times; if you really need that, let's define a new
> system call which guarantees that.  Maybe some of the research
> proposals for exotic changes to SSD semantics, etc., would allow
> barrierfs(2) semantics to be something that we could implement more
> efficiently than syncfs(2).  But let's make this be explicit, as
> opposed to some magic guarantee that falls out as a side effect of the
> fsync(2) system call to a single inode.

Yes, maybe. For xfs/ext4.
Not sure about btrfs. Seems like fbarrier(2) would have been
more natural for btrfs model (file and all its dependencies).

I think barrierfs(2) would be useful, but I think it is harder to
explain to users.
See barrierfs() should not flush all inode pages that would be counter
productive, so what does it really mean to end users?
We would end up with the same problem of misunderstood sync_file_range().

I would have been happy with this API:
sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE_AND_WAIT);
barrierfs(fd);
rename(...)/link(...)

Perhaps atomic_rename()/atomic_link() should be library functions
wrapping the lower level API to hide those details from end users.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
  2019-05-03  9:45           ` Theodore Ts'o
@ 2019-05-04  0:17             ` Vijay Chidambaram
  2019-05-04  1:43               ` Theodore Ts'o
  0 siblings, 1 reply; 25+ messages in thread
From: Vijay Chidambaram @ 2019-05-04  0:17 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Amir Goldstein, lsf-pc, Dave Chinner, Darrick J. Wong, Jan Kara,
	linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn

> As documented, the draft of the rules *I* saw specifically said that a
> fsync() to inode B would guarantee that metadata changes for inode A,
> which were made before the changes to inode B, would be persisted to
> disk since the metadata changes for B happened after the changes to
> inode A.  It used the fsync(2) *explicitly* as an example for how
> ordering of unrelated files could be guaranteed.  And this would
> invalidate Park and Shin's incremental journal for fsync.
>
> If the guarantees are when fsync(2) is *not* being used, sure, then
> the SOMC model is naturally what would happen with most common file
> system.  But then fsync(2) needs to appear nowhere in the crash
> consistency model description, and that is not the case today.
>

I think there might be a mis-understanding about the example
(reproduced below) and about SOMC. The relationship that matters is
not whether X happens before Y. The relationship between X and Y is
that they are in the same directory, so fsync(new file X) implies
fsync(X's parent directory) which contains Y.  In the example, X is
A/foo and Y is A/bar. For truly un-related files such as A/foo and
B/bar, SOMC does indeed allow fsync(A/foo) to not persist B/bar.

touch A/foo
echo “hello” >  A/foo
sync
mv A/foo A/bar
echo “world” > A/foo
fsync A/foo
CRASH

We could rewrite the example to not include fsync, but this example
comes directly from xfstest generic/342, so we would like to preserve
it.

But in any case, I think this is beside the point. If ext4 does not
want to provide SOMC-like behavior, I think that is totally
reasonable. The documentation does *not* say all file systems should
provide SOMC. As long as the documentation does not say ext4 provides
SOMC-like behavior, are you okay with the rest of the documentation
effort? If so, we can send out v3 with these changes.

Please forgive my continued pushing on this:  I would like to see more
documentation about these file-system aspects in the kernel. XFS and
btrfs developers approved of the effort, so there is some support for
this. We have already put in some work on the documentation, so I'd
like to see it finished up and merged. (Sorry for hijacking/forking
the thread Amir!)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
  2019-05-04  0:17             ` Vijay Chidambaram
@ 2019-05-04  1:43               ` Theodore Ts'o
  2019-05-07 18:38                 ` Jan Kara
  0 siblings, 1 reply; 25+ messages in thread
From: Theodore Ts'o @ 2019-05-04  1:43 UTC (permalink / raw)
  To: Vijay Chidambaram
  Cc: Amir Goldstein, lsf-pc, Dave Chinner, Darrick J. Wong, Jan Kara,
	linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn

On Fri, May 03, 2019 at 07:17:54PM -0500, Vijay Chidambaram wrote:
> 
> I think there might be a mis-understanding about the example
> (reproduced below) and about SOMC. The relationship that matters is
> not whether X happens before Y. The relationship between X and Y is
> that they are in the same directory, so fsync(new file X) implies
> fsync(X's parent directory) which contains Y.  In the example, X is
> A/foo and Y is A/bar. For truly un-related files such as A/foo and
> B/bar, SOMC does indeed allow fsync(A/foo) to not persist B/bar.

When you say "X and Y are in the same directory", how does this apply
in the face of hard links?  Remember, file X might be in a 100
different directories.  Does that mean if changes to file X is visible
after a crash, all files Y in any of X's 100 containing directories
that were modified before X must have their changes be visible after
the crash?

I suspect that the SOMC as articulated by Dave does make such global
guarantees.  Certainly without Park and Shin's incremental fsync,
unrelated files will have the property that if A/foo is modified after
B/bar, and B/bar's metadata changes are visible after a crash, A/foo's
metadata will also be visible.  This is true for ext4, and xfs.

Even if we ignore the hard link problem, and assume that it only
applies for files foo and bar with st_nlinks == 1, the crash
consistency guarantees you've described will *still* rule out Park and
Shin's increment fsync.  So depending on whether ext4 has fast fsync's
enabled, we might or might not have behavior consistency with your
proposed crash consistency rules.

But at this point, even if we promulgate these "guarantees" in a
kernel documentation file, applications won't be able to depend on
them.  If they do, they will be unreliable depending on which file
system they use; so they won't be particularly useful for application
authors care about portability.  (Or worse, for users who are under
the delusion that the application authors care about portability, and
who will be subject to data corruption after a crash.)  Do we *really*
want to be promulgating these semantics to application authors?

Finally, I'll note that generic/342 is much more specific, and your
proposed crash consistency rule is more general.

# Test that if we rename a file, create a new file that has the old name of the
# other file and is a child of the same parent directory, fsync the new inode,
# power fail and mount the filesystem, we do not lose the first file and that
# file has the name it was renamed to.

> touch A/foo
> echo “hello” >  A/foo
> sync
> mv A/foo A/bar
> echo “world” > A/foo
> fsync A/foo
> CRASH

Sure, that's one that fast commit will honor.

But what about:

echo "world" > A/foo
echo "hello" > A/bar
chmod 755 A/bar
sync
chmod 750 A/bar
echo "new world" >> A/foo
fsync A/foo
CRASH

.... will your crash consistency rules guarantee that the permissions
change for A/bar is visible after the fsync of A/foo?

Or if A/foo and A/bar exists, and we do:

echo "world" > A/foo
echo "hello" > A/bar
sync
mv A/bar A/quux
echo "new world" >> A/foo
fsync A/foo
CRASH

... is the rename of A/bar and A/quux guaranteed to be visible after
the crash?

With Park and Shin's incremental fsync journal, the two cases I've
described below would *not* have such guarantees.  Standard ext4 today
would in fact have these guarantees.  But I would consider this an
accident of the implementation, and *not* a promise that I would want
to make for all time, precisely because it forbids us from making
innovations that might improve performance.

Even if I didn't have an engineer working on implementing Park and
Shin's proposal, what worries me is if I did make this guarantee, it
would tie my hands from making this optimization in the future --- and
I can't necessarily forsee all possible optimizations we might want to
make in the future.

So the question I'm trying to ask is how many applications will
actually benefit from "documenting current behavior" and effectively
turning this into a promise for all time?  Ultimately this is a
tradeoff.  Sure, this might enable applications to do things that are
more aggressive than what Posix guarantees; but it also ties the hands
of file system engineers.

This is why I'd much rather do this via new system calls; say, maybe
something like fsync_with_barrier(fd).  This can degrade to fsync(fd)
if necessary, but it allows the application to explicitly request
certain semantics, as opposed to encouraging applications to *assume*
that certain magic side effects will be there --- and which might not
be true for all file systems, or for all time.

We still need to very carefully define what the semantics of
fsync_with_barrier(fd) would be --- especially whether
fsync_with_barrier(fd) provides local (within the same directory) or
global barrier guarantees, and if it's local, how are files with
multiple "parent directories" interact with the guarantees.  But at
least this way it's an explicit declaration of what the application
wants, and not an implicit one.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
  2019-05-04  1:43               ` Theodore Ts'o
@ 2019-05-07 18:38                 ` Jan Kara
  0 siblings, 0 replies; 25+ messages in thread
From: Jan Kara @ 2019-05-07 18:38 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Vijay Chidambaram, Amir Goldstein, lsf-pc, Dave Chinner,
	Darrick J. Wong, Jan Kara, linux-fsdevel, Jayashree Mohan,
	Filipe Manana, Chris Mason, lwn

On Fri 03-05-19 21:43:07, Theodore Ts'o wrote:
> On Fri, May 03, 2019 at 07:17:54PM -0500, Vijay Chidambaram wrote:
> > 
> > I think there might be a mis-understanding about the example
> > (reproduced below) and about SOMC. The relationship that matters is
> > not whether X happens before Y. The relationship between X and Y is
> > that they are in the same directory, so fsync(new file X) implies
> > fsync(X's parent directory) which contains Y.  In the example, X is
> > A/foo and Y is A/bar. For truly un-related files such as A/foo and
> > B/bar, SOMC does indeed allow fsync(A/foo) to not persist B/bar.
> 
> When you say "X and Y are in the same directory", how does this apply
> in the face of hard links?  Remember, file X might be in a 100
> different directories.  Does that mean if changes to file X is visible
> after a crash, all files Y in any of X's 100 containing directories
> that were modified before X must have their changes be visible after
> the crash?
> 
> I suspect that the SOMC as articulated by Dave does make such global
> guarantees.  Certainly without Park and Shin's incremental fsync,
> unrelated files will have the property that if A/foo is modified after
> B/bar, and B/bar's metadata changes are visible after a crash, A/foo's
> metadata will also be visible.  This is true for ext4, and xfs.
> 
> Even if we ignore the hard link problem, and assume that it only
> applies for files foo and bar with st_nlinks == 1, the crash
> consistency guarantees you've described will *still* rule out Park and
> Shin's increment fsync.  So depending on whether ext4 has fast fsync's
> enabled, we might or might not have behavior consistency with your
> proposed crash consistency rules.
> 
> But at this point, even if we promulgate these "guarantees" in a
> kernel documentation file, applications won't be able to depend on
> them.  If they do, they will be unreliable depending on which file
> system they use; so they won't be particularly useful for application
> authors care about portability.  (Or worse, for users who are under
> the delusion that the application authors care about portability, and
> who will be subject to data corruption after a crash.)  Do we *really*
> want to be promulgating these semantics to application authors?

I agree that having fs specific promises for crash consistency is bad.
The application would have to detect what filesystem it is running on and
based on that issue fsync or not. I don't think many applications will get
this right so IMO it would result in more problems in case crash, not less.

> Finally, I'll note that generic/342 is much more specific, and your
> proposed crash consistency rule is more general.
> 
> # Test that if we rename a file, create a new file that has the old name of the
> # other file and is a child of the same parent directory, fsync the new inode,
> # power fail and mount the filesystem, we do not lose the first file and that
> # file has the name it was renamed to.
> 
> > touch A/foo
> > echo “hello” >  A/foo
> > sync
> > mv A/foo A/bar
> > echo “world” > A/foo
> > fsync A/foo
> > CRASH
> 
> Sure, that's one that fast commit will honor.

Hum, but will this be also honored in case of hardlinks? E.g.

echo "hello" >A/foo
ln A/foo B/foo
sync
mv A/foo A/bar
mv B/foo B/bar
echo "world" >A/foo
fsync A/foo

Will you also persist changes to B? If not, will you persist them if we
do 'ln A/foo B/foo' before 'fsync A/foo'? I'm just wondering where you draw
the borderline if you actually do care about namespace changes in addition
to inode + its metadata...

> But what about:
> 
> echo "world" > A/foo
> echo "hello" > A/bar
> chmod 755 A/bar
> sync
> chmod 750 A/bar
> echo "new world" >> A/foo
> fsync A/foo
> CRASH
> 
> .... will your crash consistency rules guarantee that the permissions
> change for A/bar is visible after the fsync of A/foo?
> 
> Or if A/foo and A/bar exists, and we do:
> 
> echo "world" > A/foo
> echo "hello" > A/bar
> sync
> mv A/bar A/quux
> echo "new world" >> A/foo
> fsync A/foo
> CRASH
> 
> ... is the rename of A/bar and A/quux guaranteed to be visible after
> the crash?

And I agree that guaranteeing ordering of operations in the same
directory but for unrelated names to operations on inode in that directory
does not seem very useful.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
  2019-05-03  2:30       ` Theodore Ts'o
  2019-05-03  3:15         ` Vijay Chidambaram
  2019-05-03  4:16         ` Amir Goldstein
@ 2019-05-09  1:43         ` Dave Chinner
  2019-05-09  2:20           ` Theodore Ts'o
  2019-05-09  8:47           ` Amir Goldstein
  2 siblings, 2 replies; 25+ messages in thread
From: Dave Chinner @ 2019-05-09  1:43 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Amir Goldstein, Vijay Chidambaram, lsf-pc, Darrick J. Wong,
	Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana,
	Chris Mason, lwn

On Thu, May 02, 2019 at 10:30:43PM -0400, Theodore Ts'o wrote:
> On Thu, May 02, 2019 at 01:39:47PM -0400, Amir Goldstein wrote:
> > I am not saying there is no room for a document that elaborates on those
> > guaranties. I personally think that could be useful and certainly think that
> > your group's work for adding xfstest coverage for API guaranties is useful.
> 
> Again, here is my concern.  If we promise that ext4 will always obey
> Dave Chinner's SOMC model, it would forever rule out Daejun Park and
> Dongkun Shin's "iJournaling: Fine-grained journaling for improving the
> latency of fsync system call"[1] published in Usenix ATC 2017.

No, it doesn't rule that out at all.

In a SOMC model, incremental journalling is just fine when there are
no external dependencies on the thing being fsync'd.  If you have
other dependencies (e.g. the file has just be created and so the dir
it dirty, too) then fsync would need to do the whole shebang, but
otherwise....

> So if the crash consistency guarantees forbids future innovations
> where applications might *want* a fast fsync() that doesn't drag
> unrelated inodes into the persistence guarantees,

.... the whole point of SOMC is that allows filesystems to avoid
dragging external metadata into fsync() operations /unless/ there's
a user visible ordering dependency that must be maintained between
objects.  If all you are doing is stabilising file data in a stable
file/directory, then independent, incremental journaling of the
fsync operations on that file fit the SOMC model just fine.

> is that really what
> we want?  Do we want to forever rule out various academic
> investigations such as Park and Shin's because "it violates the crash
> consistency recovery model"?  Especially if some applications don't
> *need* the crash consistency model?

Stop with the silly inflammatory hyperbole already, Ted. It is not
necessary.

> P.P.S.  One of the other discussions that did happen during the main
> LSF/MM File system session, and for which there was general agreement
> across a number of major file system maintainers, was a fsync2()
> system call which would take a list of file descriptors (and flags)
> that should be fsync'ed.

Hmmmm, that wasn't on the agenda, and nobody has documented it as
yet.

> The semantics would be that when the
> fsync2() successfully returns, all of the guarantees of fsync() or
> fdatasync() requested by the list of file descriptors and flags would
> be satisfied.  This would allow file systems to more optimally fsync a
> batch of files, for example by implementing data integrity writebacks
> for all of the files, followed by a single journal commit to guarantee
> persistence for all of the metadata changes.

What happens when you get writeback errors on only some of the fds?
How do you report the failures and what do you do with the journal
commit on partial success?

Of course, this ignores the elephant in the room: applications can
/already do this/ using AIO_FSYNC and have individual error status
for each fd. Not to mention that filesystems already batch
concurrent fsync journal commits into a single operation. I'm not
seeing the point of a new syscall to do this right now....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
  2019-05-09  1:43         ` Dave Chinner
@ 2019-05-09  2:20           ` Theodore Ts'o
  2019-05-09  2:58             ` Dave Chinner
  2019-05-09  5:02             ` Vijay Chidambaram
  2019-05-09  8:47           ` Amir Goldstein
  1 sibling, 2 replies; 25+ messages in thread
From: Theodore Ts'o @ 2019-05-09  2:20 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Amir Goldstein, Vijay Chidambaram, lsf-pc, Darrick J. Wong,
	Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana,
	Chris Mason, lwn

On Thu, May 09, 2019 at 11:43:27AM +1000, Dave Chinner wrote:
> 
> .... the whole point of SOMC is that allows filesystems to avoid
> dragging external metadata into fsync() operations /unless/ there's
> a user visible ordering dependency that must be maintained between
> objects.  If all you are doing is stabilising file data in a stable
> file/directory, then independent, incremental journaling of the
> fsync operations on that file fit the SOMC model just fine.

Well, that's not what Vijay's crash consistency guarantees state.  It
guarantees quite a bit more than what you've written above.  Which is
my concern.

> > P.P.S.  One of the other discussions that did happen during the main
> > LSF/MM File system session, and for which there was general agreement
> > across a number of major file system maintainers, was a fsync2()
> > system call which would take a list of file descriptors (and flags)
> > that should be fsync'ed.
> 
> Hmmmm, that wasn't on the agenda, and nobody has documented it as
> yet.

It came up as suggested alternative during Ric Wheeler's "Async all
the things" session.  The problem he was trying to address was
programs (perhaps userspace file servers) who need to fsync a large
number of files at the same time.  The problem with his suggested
solution (which we have for AIO and io_uring already) of having the
program issue a large number of asynchronous fsync's and then waiting
for them all, is that the back-end interface is a work queue, so there
is a lot of effective serialization that takes place.

> > The semantics would be that when the
> > fsync2() successfully returns, all of the guarantees of fsync() or
> > fdatasync() requested by the list of file descriptors and flags would
> > be satisfied.  This would allow file systems to more optimally fsync a
> > batch of files, for example by implementing data integrity writebacks
> > for all of the files, followed by a single journal commit to guarantee
> > persistence for all of the metadata changes.
> 
> What happens when you get writeback errors on only some of the fds?
> How do you report the failures and what do you do with the journal
> commit on partial success?

Well, one approach would be to pass back the errors in the structure.
Say something like this:

     int fsync2(int len, struct fsync_req[]);

     struct fsync_req {
          int	fd;        /* IN */
	  int	flags;	   /* IN */
	  int	retval;    /* OUT */
     };

As far as what do you do with the journal commit on partial success,
this are no atomic, "all or nothing" guarantees with this interface.
It is implementation specific whether there would be one or more file
system commits necessary before fsync2 returned.

> Of course, this ignores the elephant in the room: applications can
> /already do this/ using AIO_FSYNC and have individual error status
> for each fd. Not to mention that filesystems already batch
> concurrent fsync journal commits into a single operation. I'm not
> seeing the point of a new syscall to do this right now....

But it doesn't work very well, because the implementation uses a
workqueue.  Sure, you could create N worker threads for N fd's that
you want to fsync, and then file system can batch the fsync requests.
But wouldn't be so much simpler to give a list of fd's that should be
fsync'ed to the file system?  That way you don't have to do lots of
work to split up the work so they can be submitted in parallel, only
to have the file system batch up all of the requests being issued from
all of those kernel threads.

So yes, it's identical to the interfaces we already have.  Just like
select(2), poll(2) and epoll(2) are functionality identical...

     	  	     	    	   	 - Ted

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
  2019-05-03  9:58           ` Theodore Ts'o
  2019-05-03 14:18             ` Amir Goldstein
@ 2019-05-09  2:36             ` Dave Chinner
  1 sibling, 0 replies; 25+ messages in thread
From: Dave Chinner @ 2019-05-09  2:36 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Amir Goldstein, Vijay Chidambaram, lsf-pc, Darrick J. Wong,
	Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana,
	Chris Mason, lwn

On Fri, May 03, 2019 at 05:58:46AM -0400, Theodore Ts'o wrote:
> On Fri, May 03, 2019 at 12:16:32AM -0400, Amir Goldstein wrote:
> > OK. we can leave that one for later.
> > Although I am not sure what the concern is.
> > If we are able to agree  and document a LINK_ATOMIC flag,
> > what would be the down side of documenting a RENAME_ATOMIC
> > flag with same semantics? After all, as I said, this is what many users
> > already expect when renaming a temp file (as ext4 heuristics prove).
> 
> The problem is if the "temp file" has been hardlinked to 1000
> different directories, does the rename() have to guarantee that we
> have to make sure that the changes to all 1000 directories have been
> persisted to disk?

No.

Dependency creation is directional.

If the parent directory modifies an entry that points to an inode,
then the dependency (via inode link count modification) is created.
Modifying an inode does not create a dependency on the parent
directory, because the parent directory is not modified by inode
specific changes.

Yes, sometimes the dependency graph will resolve to fsync other
directories. e.g. because hardlinks to the same inode were created
and this is the first fsync on the inode that stabilises the link
count. Because the link count is being stabilised, all the current
depedencies on that link count (i.e. all the directories with
uncommitted dirent modifications that modified the link count in
that inode) /may/ be included in the fsync.

However, if the filesystem tracks every change to the inode link
count as separate modifications, it only need commit the directory
modifications that occurred /before/ the one being fsync'd. i.e.
SOMC doesn't require "sync the world" behaviour, it's just that we
have filesysetms that currently behave that way because it's a
simple and efficient way of tracking and resolving ordering
dependencies.

IOWs, SOMC is all about cross-object depedencies and how they are
resolved. If you have no cross-object dependencies or your
operations are isolated to a non-shared set of objects, then SOMC
allows them to operate in 100% isolation to everything else and the
filesystem can optimise this in whatever way it wants.

SOMC is not the end of the world, Ted. It's just a consistency model
that has been proposed that could allow substantial optimisation of
application operations and filesystem behaviour. You're free to go
in other directions if you want - diversity is good. :)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
  2019-05-09  2:20           ` Theodore Ts'o
@ 2019-05-09  2:58             ` Dave Chinner
  2019-05-09  3:31               ` Theodore Ts'o
  2019-05-09  5:02             ` Vijay Chidambaram
  1 sibling, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2019-05-09  2:58 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Amir Goldstein, Vijay Chidambaram, lsf-pc, Darrick J. Wong,
	Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana,
	Chris Mason, lwn

On Wed, May 08, 2019 at 10:20:13PM -0400, Theodore Ts'o wrote:
> On Thu, May 09, 2019 at 11:43:27AM +1000, Dave Chinner wrote:
> > 
> > .... the whole point of SOMC is that allows filesystems to avoid
> > dragging external metadata into fsync() operations /unless/ there's
> > a user visible ordering dependency that must be maintained between
> > objects.  If all you are doing is stabilising file data in a stable
> > file/directory, then independent, incremental journaling of the
> > fsync operations on that file fit the SOMC model just fine.
> 
> Well, that's not what Vijay's crash consistency guarantees state.  It
> guarantees quite a bit more than what you've written above.  Which is
> my concern.

SOMC does not defining crash consistency rules - it defines change
dependecies and how ordering and atomicity impact the dependency
graph. How other people have interpreted that is out of my control.

> It came up as suggested alternative during Ric Wheeler's "Async all
> the things" session.  The problem he was trying to address was
> programs (perhaps userspace file servers) who need to fsync a large
> number of files at the same time.  The problem with his suggested
> solution (which we have for AIO and io_uring already) of having the
> program issue a large number of asynchronous fsync's and then waiting
> for them all, is that the back-end interface is a work queue, so there
> is a lot of effective serialization that takes place.

We got linear scaling out to device bandwidth and/or IOPS limits
with bulk fsync benchmarks on XFS with that simple workqueue
implementation.

If there's problems, then I'd suggest that people should be
reporting bugs to the developers of the AIO_FSYNC code (i.e.
Christoph and myself) or providing patches to improve it so these
problems go away.

A new syscall with essentially the same user interface doesn't
guarantee that these implementation problems will be solved.


> > > The semantics would be that when the
> > > fsync2() successfully returns, all of the guarantees of fsync() or
> > > fdatasync() requested by the list of file descriptors and flags would
> > > be satisfied.  This would allow file systems to more optimally fsync a
> > > batch of files, for example by implementing data integrity writebacks
> > > for all of the files, followed by a single journal commit to guarantee
> > > persistence for all of the metadata changes.
> > 
> > What happens when you get writeback errors on only some of the fds?
> > How do you report the failures and what do you do with the journal
> > commit on partial success?
> 
> Well, one approach would be to pass back the errors in the structure.
> Say something like this:
> 
>      int fsync2(int len, struct fsync_req[]);
> 
>      struct fsync_req {
>           int	fd;        /* IN */
> 	  int	flags;	   /* IN */
> 	  int	retval;    /* OUT */
>      };

So it's essentially identical to the AIO_FSYNC interface, except
that it is synchronous.

> As far as what do you do with the journal commit on partial success,
> this are no atomic, "all or nothing" guarantees with this interface.
> It is implementation specific whether there would be one or more file
> system commits necessary before fsync2 returned.

IOWs, same guarantees as AIO_FSYNC.

> > Of course, this ignores the elephant in the room: applications can
> > /already do this/ using AIO_FSYNC and have individual error status
> > for each fd. Not to mention that filesystems already batch
> > concurrent fsync journal commits into a single operation. I'm not
> > seeing the point of a new syscall to do this right now....
> 
> But it doesn't work very well, because the implementation uses a
> workqueue.

Then fix the fucking implementation!

Sheesh! Did LSFMM include a free lobotomy for participants, or
something?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
  2019-05-09  2:58             ` Dave Chinner
@ 2019-05-09  3:31               ` Theodore Ts'o
  2019-05-09  5:19                 ` Darrick J. Wong
  0 siblings, 1 reply; 25+ messages in thread
From: Theodore Ts'o @ 2019-05-09  3:31 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Amir Goldstein, Vijay Chidambaram, lsf-pc, Darrick J. Wong,
	Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana,
	Chris Mason, lwn

On Thu, May 09, 2019 at 12:58:45PM +1000, Dave Chinner wrote:
> 
> SOMC does not defining crash consistency rules - it defines change
> dependecies and how ordering and atomicity impact the dependency
> graph. How other people have interpreted that is out of my control.

Fine; but it's a specific set of the crash consistency rules which I'm
objecting to; it's not a promise that I think I want to make.  (And
before you blindly sign on the bottom line, I'd suggest that you read
it very carefully before deciding whether you want to agree to those
consistency rules as something that XFS will have honor forever.  The
way I read it, it's goes beyond what you've articulated as SOMC.)

> A new syscall with essentially the same user interface doesn't
> guarantee that these implementation problems will be solved.

Well, it makes it easier to send all of the requests to the file
system in a single bundle.  I'd also argue that it's simpler and
easier for an application to use a fsync2() interface as I sketched
out than trying to use the whole AIO or io_uring machinery.

> So it's essentially identical to the AIO_FSYNC interface, except
> that it is synchronous.

Pretty much, yes.

> Sheesh! Did LSFMM include a free lobotomy for participants, or
> something?

Well, we missed your presence, alas.  No doubt your attendance would
have improved the discussion.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
  2019-05-09  2:20           ` Theodore Ts'o
  2019-05-09  2:58             ` Dave Chinner
@ 2019-05-09  5:02             ` Vijay Chidambaram
  2019-05-09  5:37               ` Darrick J. Wong
  2019-05-09 15:46               ` Theodore Ts'o
  1 sibling, 2 replies; 25+ messages in thread
From: Vijay Chidambaram @ 2019-05-09  5:02 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Dave Chinner, Amir Goldstein, lsf-pc, Darrick J. Wong, Jan Kara,
	linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn

On Wed, May 8, 2019 at 9:30 PM Theodore Ts'o <tytso@mit.edu> wrote:
>
> On Thu, May 09, 2019 at 11:43:27AM +1000, Dave Chinner wrote:
> >
> > .... the whole point of SOMC is that allows filesystems to avoid
> > dragging external metadata into fsync() operations /unless/ there's
> > a user visible ordering dependency that must be maintained between
> > objects.  If all you are doing is stabilising file data in a stable
> > file/directory, then independent, incremental journaling of the
> > fsync operations on that file fit the SOMC model just fine.
>
> Well, that's not what Vijay's crash consistency guarantees state.  It
> guarantees quite a bit more than what you've written above.  Which is
> my concern.

The intention is to capture Dave's SOMC semantics. We can re-iterate
and re-phrase until we capture what Dave meant precisely. I am fairly
confident we can do this, given that Dave himself is participating and
helping us refine the text. So this doesn't seem like a reason not to
have documentation at all to me.

As we have stated on multiple times on this and other threads, the
intention is *not* to come up with one set of crash-recovery
guarantees that every Linux file system must abide by forever. Ted,
you keep repeating this, though we have never said this was our
intention.

The intention behind this effort is to simply document the
crash-recovery guarantees provided today by different Linux file
systems. Ted, you question why this is required at all, and why we
simply can't use POSIX and man pages. The answer:

1. POSIX is vague. Not persisting data to stable media on fsync is
also allowed in POSIX (but no Linux file system actually does this),
so its not very useful in terms of understanding what crash-recovery
guarantees file systems actually provide. Given that all Linux file
systems provide something more than POSIX, the natural question to ask
is what do they provide? We understood this from working on
CrashMonkey, and we wanted to document it.
2. Other parts of the Linux kernel have much better documentation,
even though they similarly want to provide freedom for developers to
optimize and change internal implementation. I don't think
documentation and freedom to change internals are mutually exclusive.
3. XFS provides SOMC semantics, and btrfs developers have stated they
want to provide SOMC as well. F2FS developers have a mode in which
they seek to provide SOMC semantics. Given all this, it seemed prudent
to document SOMC.
4. Apart from developers, a document like this would also help
academic researchers understand the current state-of-the-art in
crash-recovery guarantees and the different choices made by different
file systems. It is non-trivial to understand this without
documentation.

FWIW, I think the position of "if we don't write it down, application
developers can't depend on it" is wrong. Even with nothing written
down, developers noticed they could skip fsync() in ext3 when
atomically updating files with rename(). This lead to the whole ext4
rename-and-delayed-allocation problem. The much better path, IMO, is
to document the current set of guarantees given by different file
systems, and talk about what is intended and what is not. This would
give application developers much better guidance in writing
applications.

If ext4 wants to develop incremental fsync and introduce a new set of
semantics that is different from SOMC and much closer to minimal
POSIX, I don't think the documentation affects that at all. As Dave
notes, diversity is good! Documentation is also good :)

That being said, I think I'll stop our push to get this documented
inside the Linux kernel at this point. We got useful comments from
Dave, Amir, and others, so we will incorporate those comments and put
up the documentation on a University of Texas web page. If someone
else wants to carry on and get this merged, you are welcome to do so
:)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
  2019-05-09  3:31               ` Theodore Ts'o
@ 2019-05-09  5:19                 ` Darrick J. Wong
  0 siblings, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2019-05-09  5:19 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Dave Chinner, Amir Goldstein, Vijay Chidambaram, lsf-pc,
	Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana,
	Chris Mason, lwn

On Wed, May 08, 2019 at 11:31:00PM -0400, Theodore Ts'o wrote:
> On Thu, May 09, 2019 at 12:58:45PM +1000, Dave Chinner wrote:
> > 
> > SOMC does not defining crash consistency rules - it defines change
> > dependecies and how ordering and atomicity impact the dependency
> > graph. How other people have interpreted that is out of my control.
> 
> Fine; but it's a specific set of the crash consistency rules which I'm
> objecting to; it's not a promise that I think I want to make.  (And
> before you blindly sign on the bottom line, I'd suggest that you read
> it very carefully before deciding whether you want to agree to those
> consistency rules as something that XFS will have honor forever.  The
> way I read it, it's goes beyond what you've articulated as SOMC.)

I find myself (unusually) rooting for the status quo, where we /don't/
have a big SOMC rulebook that everyone has to follow, and instead we
just tell people that if they really want to know a filesystem they had
better try their workload with that fs + storage.  If they don't like
what they find, we have a reasonable amount of competition and niche
specialization amongst the many filesystems that they can try the
others, or if they're still unsatisfied, see if they can drive a
consensus.  Filesystems are like cars -- the basic interfaces are more
or less the same and they but the implementations can still differ.

(They also tend to crash, catch on fire, and leave a smear of
destruction in their wake.)

> > A new syscall with essentially the same user interface doesn't
> > guarantee that these implementation problems will be solved.
> 
> Well, it makes it easier to send all of the requests to the file
> system in a single bundle.  I'd also argue that it's simpler and
> easier for an application to use a fsync2() interface as I sketched
> out than trying to use the whole AIO or io_uring machinery.

I *would* like to see a more concrete fsync2 proposal.  And while I'm
asking for ponies, whatever it is that came out of the DAX file flags
discussion too.

> 
> > So it's essentially identical to the AIO_FSYNC interface, except
> > that it is synchronous.
> 
> Pretty much, yes.

OH yeah, I forgot we wired that up finally.

> > Sheesh! Did LSFMM include a free lobotomy for participants, or
> > something?

"I'd rather have a bottle in front of me..."

Peace out, see you all on the 20th!

--D

> Well, we missed your presence, alas.  No doubt your attendance would
> have improved the discussion.
> 
> Cheers,
> 
> 					- Ted

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
  2019-05-09  5:02             ` Vijay Chidambaram
@ 2019-05-09  5:37               ` Darrick J. Wong
  2019-05-09 15:46               ` Theodore Ts'o
  1 sibling, 0 replies; 25+ messages in thread
From: Darrick J. Wong @ 2019-05-09  5:37 UTC (permalink / raw)
  To: Vijay Chidambaram
  Cc: Theodore Ts'o, Dave Chinner, Amir Goldstein, lsf-pc,
	Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana,
	Chris Mason, lwn

On Thu, May 09, 2019 at 12:02:17AM -0500, Vijay Chidambaram wrote:
> On Wed, May 8, 2019 at 9:30 PM Theodore Ts'o <tytso@mit.edu> wrote:
> >
> > On Thu, May 09, 2019 at 11:43:27AM +1000, Dave Chinner wrote:
> > >
> > > .... the whole point of SOMC is that allows filesystems to avoid
> > > dragging external metadata into fsync() operations /unless/ there's
> > > a user visible ordering dependency that must be maintained between
> > > objects.  If all you are doing is stabilising file data in a stable
> > > file/directory, then independent, incremental journaling of the
> > > fsync operations on that file fit the SOMC model just fine.
> >
> > Well, that's not what Vijay's crash consistency guarantees state.  It
> > guarantees quite a bit more than what you've written above.  Which is
> > my concern.
> 
> The intention is to capture Dave's SOMC semantics. We can re-iterate
> and re-phrase until we capture what Dave meant precisely. I am fairly
> confident we can do this, given that Dave himself is participating and
> helping us refine the text. So this doesn't seem like a reason not to
> have documentation at all to me.
> 
> As we have stated on multiple times on this and other threads, the
> intention is *not* to come up with one set of crash-recovery
> guarantees that every Linux file system must abide by forever. Ted,
> you keep repeating this, though we have never said this was our
> intention.

It might not be your intention, but I can definitely imagine others
using such a SOMC document as a cudgel to, uh, pressure other
filesystems into implementing the same semantics ("This isn't SOMC
compliant!" "We never said it was." "It has to be compliant!").  That's
fine for XFS because that's how it's supposed to work, but I wouldn't
want other projects to have to defend themselves for lack of XFSiness.

> The intention behind this effort is to simply document the
> crash-recovery guarantees provided today by different Linux file
> systems. Ted, you question why this is required at all, and why we
> simply can't use POSIX and man pages. The answer:
> 
> 1. POSIX is vague. Not persisting data to stable media on fsync is
> also allowed in POSIX (but no Linux file system actually does this),
> so its not very useful in terms of understanding what crash-recovery
> guarantees file systems actually provide. Given that all Linux file
> systems provide something more than POSIX, the natural question to ask
> is what do they provide? We understood this from working on
> CrashMonkey, and we wanted to document it.
> 2. Other parts of the Linux kernel have much better documentation,
> even though they similarly want to provide freedom for developers to
> optimize and change internal implementation. I don't think
> documentation and freedom to change internals are mutually exclusive.
> 3. XFS provides SOMC semantics, and btrfs developers have stated they
> want to provide SOMC as well. F2FS developers have a mode in which
> they seek to provide SOMC semantics. Given all this, it seemed prudent
> to document SOMC.

Point.  To further soften/undercut my earlier email, I think we can
document the filesystem behaviors that specific projects are willing to
endorse while still making it clear that YMMV and you had better test
your workload if you want clarity of behavior. :)

> 4. Apart from developers, a document like this would also help
> academic researchers understand the current state-of-the-art in
> crash-recovery guarantees and the different choices made by different
> file systems. It is non-trivial to understand this without
> documentation.
> 
> FWIW, I think the position of "if we don't write it down, application
> developers can't depend on it" is wrong. Even with nothing written
> down, developers noticed they could skip fsync() in ext3 when
> atomically updating files with rename(). This lead to the whole ext4
> rename-and-delayed-allocation problem. The much better path, IMO, is
> to document the current set of guarantees given by different file
> systems, and talk about what is intended and what is not. This would
> give application developers much better guidance in writing
> applications.
> 
> If ext4 wants to develop incremental fsync and introduce a new set of
> semantics that is different from SOMC and much closer to minimal
> POSIX, I don't think the documentation affects that at all. As Dave
> notes, diversity is good! Documentation is also good :)
> 
> That being said, I think I'll stop our push to get this documented
> inside the Linux kernel at this point. We got useful comments from
> Dave, Amir, and others, so we will incorporate those comments and put
> up the documentation on a University of Texas web page. If someone
> else wants to carry on and get this merged, you are welcome to do so
> :)

Aww, I was going to suggest merging it with an explicit warning that the
document *only* reflects those that have endorsed it, and a pileup at
the end:

Endorsed-by: fs/xfs
Endorsed-by: fs/btrfs

Rejected-by: fs/djwongcrazyfs

(I'm still ¾ tempted to just put the XFS parts in a text file and merge
it into fs/xfs/ if the broader effort doesn't succeed...)

--D

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
  2019-05-09  1:43         ` Dave Chinner
  2019-05-09  2:20           ` Theodore Ts'o
@ 2019-05-09  8:47           ` Amir Goldstein
  1 sibling, 0 replies; 25+ messages in thread
From: Amir Goldstein @ 2019-05-09  8:47 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Theodore Ts'o, Vijay Chidambaram, lsf-pc, Darrick J. Wong,
	Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana,
	Chris Mason, lwn

On Thu, May 9, 2019 at 4:43 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Thu, May 02, 2019 at 10:30:43PM -0400, Theodore Ts'o wrote:
> > On Thu, May 02, 2019 at 01:39:47PM -0400, Amir Goldstein wrote:
> > > I am not saying there is no room for a document that elaborates on those
> > > guaranties. I personally think that could be useful and certainly think that
> > > your group's work for adding xfstest coverage for API guaranties is useful.
> >
> > Again, here is my concern.  If we promise that ext4 will always obey
> > Dave Chinner's SOMC model, it would forever rule out Daejun Park and
> > Dongkun Shin's "iJournaling: Fine-grained journaling for improving the
> > latency of fsync system call"[1] published in Usenix ATC 2017.
>
> No, it doesn't rule that out at all.
>

Dave and all the good people,

Please go back to read the first email in this thread before it diverged yet
again into interpretations of SOMC.

The novelty in my proposal (which I attribute to Jan's idea) is to reduce the
concerns around documenting "expected behavior of the world" to documenting
"expected behavior of linking an O_TMPFILE".

It boils down to documenting AT_LINK_ATOMIC (or whatever flag name):

""The filesystem provided the guaranty that after a crash, if the linked
 O_TMPFILE is observed in the target directory, than all the data and
 metadata modifications made to the file before being linked are also
 observed."

No more, no less.

I intentionally reduced the scope to the point that I could get ext4,btrfs to
sign the treaty. I think this is a good starting point, from which we can make
forward progress.

I'd appreciate if xfs camp, Dave in particular, would address the proposal
regardless of the broader SOMC documentation discussion.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
  2019-05-09  5:02             ` Vijay Chidambaram
  2019-05-09  5:37               ` Darrick J. Wong
@ 2019-05-09 15:46               ` Theodore Ts'o
  1 sibling, 0 replies; 25+ messages in thread
From: Theodore Ts'o @ 2019-05-09 15:46 UTC (permalink / raw)
  To: Vijay Chidambaram
  Cc: Dave Chinner, Amir Goldstein, lsf-pc, Darrick J. Wong, Jan Kara,
	linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn

On Thu, May 09, 2019 at 12:02:17AM -0500, Vijay Chidambaram wrote:
> As we have stated on multiple times on this and other threads, the
> intention is *not* to come up with one set of crash-recovery
> guarantees that every Linux file system must abide by forever. Ted,
> you keep repeating this, though we have never said this was our
> intention.
> 
> The intention behind this effort is to simply document the
> crash-recovery guarantees provided today by different Linux file
> systems. Ted, you question why this is required at all, and why we
> simply can't use POSIX and man pages.

But who is this documentation targeted towards?  Who is it intended to
benefit?  Most application authors do not write applications with
specific file systems in mind.  And even if they do, they can't
control how their users are going to use it.

> FWIW, I think the position of "if we don't write it down, application
> developers can't depend on it" is wrong. Even with nothing written
> down, developers noticed they could skip fsync() in ext3 when
> atomically updating files with rename(). This lead to the whole ext4
> rename-and-delayed-allocation problem. The much better path, IMO, is
> to document the current set of guarantees given by different file
> systems, and talk about what is intended and what is not. This would
> give application developers much better guidance in writing
> applications.

If we were to provide that nuance, that would be much better, I would
agree.  It's not what the current crash consistency guarantees
provides, alas.  I'd also want to talk about what is guaranteed
*first*; documenting the current state of affairs, some of which may
be subject to change and the result of the implementation, is far less
important.  So I'd prefer that "documentation of current behavior" be
the last thing in the document --- perhaps in an appendix --- and not
the headliner.

Indeed, I'd use the ext3 O_PONIES discussion as a prime example of the
risk if we were to just "document current practice" and stop there.
It's the fact that your crash consistency guarantees draft, claims to
"document current practice", and at the same time, uses the word
"guarantee" which causes red flags to go up for me.

If we could separate those two, that would be very helpful.  And if
the current POSIX guarantees are too vague, my preference would be to
first determine what application authors would find more useful in
terms stricter guarantees, and provide those guarantees as we find
them.  We can always add more guarantees later.  Taking guarantees
away is much harder.  And guarantees by defintion always restrict
freedom of action, so this is an engineering tradeoff.  Let's provide
those guarantees when it actually improves application performance,
and not Just Because.

It might also be that defining new system calls, like fbarrier() and
fdatabarrier() is a better approach rather than retconning new
semantics on top of fsync().  I just think a principled design
approach is better rather than taking existing semantics and slapping
the word "guarantee" in the title of said documentation.

I will also say that I have no problems with documenting strong
metadata ordering if it has nothing to do with fsync().  That makes
sense.  The moment that you try to also bring data integrity into the
mix, and give examples of what happens if you call fsync(), that it
goes beyond strong metadata ordering.  So if you want to document what
happens without fsync, ext4 can probably get on board with them.
Unfortuantely, in addition to including the word "guarantee", the
current crash consistency draft also includes the word "fsync".

> 4. Apart from developers, a document like this would also help
> academic researchers understand the current state-of-the-art in
> crash-recovery guarantees and the different choices made by different
> file systems. It is non-trivial to understand this without
> documentation.

It's also very hard to undertand this without taking performance
constraints and implementation choices into account.  It's trivially
easy to give super-strong crash-recovery guarantees.  But if it
sacrifices performance, is it really "state-of-the-art"?

Worse, different applications may want different guarantees, and may
want different crash consistency vs. performance tradeoffs.  This is
why in general, the concept of providing new interfaces where the
application can state more explicitly what they want is much more
appealing to me.

When I have discussions with Amir, he doesn't just want strong
guarantees; he wants specific guarantees with zero overhead, and our
discussions have been in how to we manage that tension between those
two goals.  And it's much easier to achieve this in terms of very
specific cases, such as what happens when you link an O_TMPFILE file
into a directory.

Cheers,

   		     	      	 	   	- Ted

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2019-05-09 15:47 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-27 21:00 [TOPIC] Extending the filesystem crash recovery guaranties contract Amir Goldstein
2019-05-02 16:12 ` Amir Goldstein
2019-05-02 17:11   ` Vijay Chidambaram
2019-05-02 17:39     ` Amir Goldstein
2019-05-03  2:30       ` Theodore Ts'o
2019-05-03  3:15         ` Vijay Chidambaram
2019-05-03  9:45           ` Theodore Ts'o
2019-05-04  0:17             ` Vijay Chidambaram
2019-05-04  1:43               ` Theodore Ts'o
2019-05-07 18:38                 ` Jan Kara
2019-05-03  4:16         ` Amir Goldstein
2019-05-03  9:58           ` Theodore Ts'o
2019-05-03 14:18             ` Amir Goldstein
2019-05-09  2:36             ` Dave Chinner
2019-05-09  1:43         ` Dave Chinner
2019-05-09  2:20           ` Theodore Ts'o
2019-05-09  2:58             ` Dave Chinner
2019-05-09  3:31               ` Theodore Ts'o
2019-05-09  5:19                 ` Darrick J. Wong
2019-05-09  5:02             ` Vijay Chidambaram
2019-05-09  5:37               ` Darrick J. Wong
2019-05-09 15:46               ` Theodore Ts'o
2019-05-09  8:47           ` Amir Goldstein
2019-05-02 21:05   ` Darrick J. Wong
2019-05-02 22:19     ` Amir Goldstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).