* [TOPIC] Extending the filesystem crash recovery guaranties contract @ 2019-04-27 21:00 Amir Goldstein 2019-05-02 16:12 ` Amir Goldstein 0 siblings, 1 reply; 25+ messages in thread From: Amir Goldstein @ 2019-04-27 21:00 UTC (permalink / raw) To: lsf-pc Cc: Dave Chinner, Theodore Tso, Jan Kara, linux-fsdevel, Jayashree Mohan, Vijaychidambaram Velayudhan Pillai, Filipe Manana Suggestion for another filesystems track topic. Some of you may remember the emotional(?) discussions that ensued when the crashmonkey developers embarked on a mission to document and verify filesystem crash recovery guaranties: https://lore.kernel.org/linux-fsdevel/CAOQ4uxj8YpYPPdEvAvKPKXO7wdBg6T1O3osd6fSPFKH9j=i2Yg@mail.gmail.com/ There are two camps among filesystem developers and every camp has good arguments for wanting to document existing behavior and for not wanting to document anything beyond "use fsync if you want any guaranty". I would like to take a suggestion proposed by Jan on a related discussion: https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQx+TO3Dt7TA3ocXnNxbr3+oVyJLYUSpv4QCt_Texdvw@mail.gmail.com/ and make a proposal that may be able to meet the concerns of both camps. The proposal is to add new APIs which communicate crash consistency requirements of the application to the filesystem. Example API could look like this: renameat2(..., RENAME_METADATA_BARRIER | RENAME_DATA_BARRIER) It's just an example. The API could take another form and may need more barrier types (I proposed to use new file_sync_range() flags). The idea is simple though. METADATA_BARRIER means all the inode metadata will be observed after crash if rename is observed after crash. DATA_BARRIER same for file data. We may also want a "ALL_METADATA_BARRIER" and/or "METADATA_DEPENDENCY_BARRIER" to more accurately describe what SOMC guaranties actually provide today. The implementation is also simple. filesystem that currently have SOMC behavior don't need to do anything to respect METADATA_BARRIER and only need to call filemap_write_and_wait_range() to respect DATA_BARRIER. filesystem developers are thus not tying their hands w.r.t future performance optimizations for operations that are not explicitly requesting a barrier. Thanks, Amir. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract 2019-04-27 21:00 [TOPIC] Extending the filesystem crash recovery guaranties contract Amir Goldstein @ 2019-05-02 16:12 ` Amir Goldstein 2019-05-02 17:11 ` Vijay Chidambaram 2019-05-02 21:05 ` Darrick J. Wong 0 siblings, 2 replies; 25+ messages in thread From: Amir Goldstein @ 2019-05-02 16:12 UTC (permalink / raw) To: lsf-pc, Dave Chinner, Darrick J. Wong Cc: Theodore Tso, Jan Kara, linux-fsdevel, Jayashree Mohan, Vijaychidambaram Velayudhan Pillai, Filipe Manana, Chris Mason, lwn On Sat, Apr 27, 2019 at 5:00 PM Amir Goldstein <amir73il@gmail.com> wrote: > > Suggestion for another filesystems track topic. > > Some of you may remember the emotional(?) discussions that ensued > when the crashmonkey developers embarked on a mission to document > and verify filesystem crash recovery guaranties: > > https://lore.kernel.org/linux-fsdevel/CAOQ4uxj8YpYPPdEvAvKPKXO7wdBg6T1O3osd6fSPFKH9j=i2Yg@mail.gmail.com/ > > There are two camps among filesystem developers and every camp > has good arguments for wanting to document existing behavior and for > not wanting to document anything beyond "use fsync if you want any guaranty". > > I would like to take a suggestion proposed by Jan on a related discussion: > https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQx+TO3Dt7TA3ocXnNxbr3+oVyJLYUSpv4QCt_Texdvw@mail.gmail.com/ > > and make a proposal that may be able to meet the concerns of > both camps. > > The proposal is to add new APIs which communicate > crash consistency requirements of the application to the filesystem. > > Example API could look like this: > renameat2(..., RENAME_METADATA_BARRIER | RENAME_DATA_BARRIER) > It's just an example. The API could take another form and may need > more barrier types (I proposed to use new file_sync_range() flags). > > The idea is simple though. > METADATA_BARRIER means all the inode metadata will be observed > after crash if rename is observed after crash. > DATA_BARRIER same for file data. > We may also want a "ALL_METADATA_BARRIER" and/or > "METADATA_DEPENDENCY_BARRIER" to more accurately > describe what SOMC guaranties actually provide today. > > The implementation is also simple. filesystem that currently > have SOMC behavior don't need to do anything to respect > METADATA_BARRIER and only need to call > filemap_write_and_wait_range() to respect DATA_BARRIER. > filesystem developers are thus not tying their hands w.r.t future > performance optimizations for operations that are not explicitly > requesting a barrier. > An update: Following the LSF session on $SUBJECT I had a discussion with Ted, Jan and Chris. We were all in agreement that linking an O_TMPFILE into the namespace is probably already perceived by users as the barrier/atomic operation that I am trying to describe. So at least maintainers of btrfs/ext4/ext2 are sympathetic to the idea of providing the required semantics when linking O_TMPFILE *as long* as the semantics are properly documented. This is what open(2) man page has to say right now: * Creating a file that is initially invisible, which is then populated with data and adjusted to have appropriate filesystem attributes (fchown(2), fchmod(2), fsetxattr(2), etc.) before being atomically linked into the filesystem in a fully formed state (using linkat(2) as described above). The phrase that I would like to add (probably in link(2) man page) is: "The filesystem provided the guaranty that after a crash, if the linked O_TMPFILE is observed in the target directory, than all the data and metadata modifications made to the file before being linked are also observed." For some filesystems, btrfs in farticular, that would mean an implicit fsync on the linked inode. On other filesystems, ext4/xfs in particular that would only require at least committing delayed allocations, but will NOT require inode fsync nor journal commit/flushing disk caches. I would like to hear the opinion of XFS developers and filesystem maintainers who did not attend the LSF session. I have no objection to adding an opt-in LINK_ATOMIC flag and pass it down to filesystems instead of changing behavior and patching stable kernels, but I prefer the latter. I believe this should have been the semantics to begin with if for no other reason, because users would expect it regardless of whatever we write in manual page and no matter how many !!!!!!!! we use for disclaimers. And if we can all agree on that, then O_TMPFILE is quite young in historic perspective, so not too late to call the expectation gap a bug and fix it.(?) Taking this another step forward, if we agree on the language I used above to describe the expected behavior, then we can add an opt-in RENAME_ATOMIC flag to provide the same semantics and document it in the same manner (this functionality is needed for directories and non regular files) and all there is left is the fun part of choosing the flag name ;-) Thanks, Amir. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract 2019-05-02 16:12 ` Amir Goldstein @ 2019-05-02 17:11 ` Vijay Chidambaram 2019-05-02 17:39 ` Amir Goldstein 2019-05-02 21:05 ` Darrick J. Wong 1 sibling, 1 reply; 25+ messages in thread From: Vijay Chidambaram @ 2019-05-02 17:11 UTC (permalink / raw) To: Amir Goldstein Cc: lsf-pc, Dave Chinner, Darrick J. Wong, Theodore Tso, Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn Thank you for driving this discussion Amir. I'm glad ext4 and btrfs developers want to provide these semantics. If I'm understanding this correctly, the new semantics will be: any data changes to files written with O_TMPFILE will be visible if the associated metadata is also visible. Basically, there will be a barrier between O_TMPFILE data and O_TMPFILE metadata. The expectation is that applications will use this, and then rename the O_TMPFILE file over the original file. Is this correct? If so, is there also an implied barrier between O_TMPFILE metadata and the rename? Where does this land us on the discussion about documenting file-system crash-recovery guarantees? Has that been deemed not necessary? Thanks, Vijay Chidambaram http://www.cs.utexas.edu/~vijay/ On Thu, May 2, 2019 at 11:12 AM Amir Goldstein <amir73il@gmail.com> wrote: > > On Sat, Apr 27, 2019 at 5:00 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > Suggestion for another filesystems track topic. > > > > Some of you may remember the emotional(?) discussions that ensued > > when the crashmonkey developers embarked on a mission to document > > and verify filesystem crash recovery guaranties: > > > > https://lore.kernel.org/linux-fsdevel/CAOQ4uxj8YpYPPdEvAvKPKXO7wdBg6T1O3osd6fSPFKH9j=i2Yg@mail.gmail.com/ > > > > There are two camps among filesystem developers and every camp > > has good arguments for wanting to document existing behavior and for > > not wanting to document anything beyond "use fsync if you want any guaranty". > > > > I would like to take a suggestion proposed by Jan on a related discussion: > > https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQx+TO3Dt7TA3ocXnNxbr3+oVyJLYUSpv4QCt_Texdvw@mail.gmail.com/ > > > > and make a proposal that may be able to meet the concerns of > > both camps. > > > > The proposal is to add new APIs which communicate > > crash consistency requirements of the application to the filesystem. > > > > Example API could look like this: > > renameat2(..., RENAME_METADATA_BARRIER | RENAME_DATA_BARRIER) > > It's just an example. The API could take another form and may need > > more barrier types (I proposed to use new file_sync_range() flags). > > > > The idea is simple though. > > METADATA_BARRIER means all the inode metadata will be observed > > after crash if rename is observed after crash. > > DATA_BARRIER same for file data. > > We may also want a "ALL_METADATA_BARRIER" and/or > > "METADATA_DEPENDENCY_BARRIER" to more accurately > > describe what SOMC guaranties actually provide today. > > > > The implementation is also simple. filesystem that currently > > have SOMC behavior don't need to do anything to respect > > METADATA_BARRIER and only need to call > > filemap_write_and_wait_range() to respect DATA_BARRIER. > > filesystem developers are thus not tying their hands w.r.t future > > performance optimizations for operations that are not explicitly > > requesting a barrier. > > > > An update: Following the LSF session on $SUBJECT I had a discussion > with Ted, Jan and Chris. > > We were all in agreement that linking an O_TMPFILE into the namespace > is probably already perceived by users as the barrier/atomic operation that > I am trying to describe. > > So at least maintainers of btrfs/ext4/ext2 are sympathetic to the idea of > providing the required semantics when linking O_TMPFILE *as long* as > the semantics are properly documented. > > This is what open(2) man page has to say right now: > > * Creating a file that is initially invisible, which is then > populated with data > and adjusted to have appropriate filesystem attributes (fchown(2), > fchmod(2), fsetxattr(2), etc.) before being atomically linked into the > filesystem in a fully formed state (using linkat(2) as described above). > > The phrase that I would like to add (probably in link(2) man page) is: > "The filesystem provided the guaranty that after a crash, if the linked > O_TMPFILE is observed in the target directory, than all the data and > metadata modifications made to the file before being linked are also > observed." > > For some filesystems, btrfs in farticular, that would mean an implicit > fsync on the linked inode. On other filesystems, ext4/xfs in particular > that would only require at least committing delayed allocations, but > will NOT require inode fsync nor journal commit/flushing disk caches. > > I would like to hear the opinion of XFS developers and filesystem > maintainers who did not attend the LSF session. > > I have no objection to adding an opt-in LINK_ATOMIC flag > and pass it down to filesystems instead of changing behavior and > patching stable kernels, but I prefer the latter. > > I believe this should have been the semantics to begin with > if for no other reason, because users would expect it regardless > of whatever we write in manual page and no matter how many > !!!!!!!! we use for disclaimers. > > And if we can all agree on that, then O_TMPFILE is quite young > in historic perspective, so not too late to call the expectation gap > a bug and fix it.(?) > > Taking this another step forward, if we agree on the language > I used above to describe the expected behavior, then we can > add an opt-in RENAME_ATOMIC flag to provide the same > semantics and document it in the same manner (this functionality > is needed for directories and non regular files) and all there is left > is the fun part of choosing the flag name ;-) > > Thanks, > Amir. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract 2019-05-02 17:11 ` Vijay Chidambaram @ 2019-05-02 17:39 ` Amir Goldstein 2019-05-03 2:30 ` Theodore Ts'o 0 siblings, 1 reply; 25+ messages in thread From: Amir Goldstein @ 2019-05-02 17:39 UTC (permalink / raw) To: Vijay Chidambaram Cc: lsf-pc, Dave Chinner, Darrick J. Wong, Theodore Tso, Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn On Thu, May 2, 2019 at 1:11 PM Vijay Chidambaram <vijay@cs.utexas.edu> wrote: > > Thank you for driving this discussion Amir. I'm glad ext4 and btrfs > developers want to provide these semantics. > > If I'm understanding this correctly, the new semantics will be: any > data changes to files written with O_TMPFILE will be visible if the > associated metadata is also visible. Basically, there will be a > barrier between O_TMPFILE data and O_TMPFILE metadata. Mmm, this phrasing deviates from what I wrote. The agreement is that we should document something *minimal* that users can understand. I was hoping that this phrasing meets those requirements: ""The filesystem provided the guaranty that after a crash, if the linked O_TMPFILE is observed in the target directory, than all the data and metadata modifications made to the file before being linked are also observed." No more, no less. > > The expectation is that applications will use this, and then rename > the O_TMPFILE file over the original file. Is this correct? If so, is > there also an implied barrier between O_TMPFILE metadata and the > rename? Not really, the use case is when users want to create a file to apear "atomically" in the namespace with certain data and metadata. For replacing an existing file with another the same could be achieved with renameat2(AT_FDCWD, tempname, AT_FDCWD, newname, RENAME_ATOMIC). There is no need to create the tempname file using O_TMPFILE in that case, but if you do, the RENAME_ATOMIC flag would be redundant. RENAME_ATOMIC flag is needed because directories and non regular files cannot be created using O_TMPFILE. > > Where does this land us on the discussion about documenting > file-system crash-recovery guarantees? Has that been deemed not > necessary? > Can't say for sure. Some filesystem maintainers hold on to the opinion that they do NOT wish to have a document describing existing behavior of specific filesystems, which is large parts of the document that your group posted. They would rather that only the guaranties of the APIs are documented and those should already be documented in man pages anyway - if they are not, man pages could be improved. I am not saying there is no room for a document that elaborates on those guaranties. I personally think that could be useful and certainly think that your group's work for adding xfstest coverage for API guaranties is useful. Thanks, Amir. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract 2019-05-02 17:39 ` Amir Goldstein @ 2019-05-03 2:30 ` Theodore Ts'o 2019-05-03 3:15 ` Vijay Chidambaram ` (2 more replies) 0 siblings, 3 replies; 25+ messages in thread From: Theodore Ts'o @ 2019-05-03 2:30 UTC (permalink / raw) To: Amir Goldstein Cc: Vijay Chidambaram, lsf-pc, Dave Chinner, Darrick J. Wong, Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn On Thu, May 02, 2019 at 01:39:47PM -0400, Amir Goldstein wrote: > > The expectation is that applications will use this, and then rename > > the O_TMPFILE file over the original file. Is this correct? If so, is > > there also an implied barrier between O_TMPFILE metadata and the > > rename? In the case of O_TMPFILE, the file can be brought into the namespace using something like: linkat(AT_FDCWD, "/proc/self/fd/42", AT_FDCWD, pathname, AT_SYMLINK_FOLLOW); it's not using rename. To be clear, this discussion happened in the hallway, and it's not clear it had full support by everyone. After our discussion, some of us came up with an example where forcing a call to filemap_write_and_wait() before the linkat(2) might *not* be the right thing. Suppose some browser wanted to wait until a file was fully( downloaded before letting it appear in the directory --- but what was being downloaded was a 4 GiB DVD image (say, a distribution's install media). If the download was done using O_TMPFILE followed by linkat(2), that might be a case where forcing the data blocks to disk before allowing the linkat(2) to proceed might not be what the application or the user would want. So it might be that we will need to add a linkat flag to indicate that we want the kernel to call filemap_write_and_wait() before making the metadata changes in linkat(2). > For replacing an existing file with another the same could be > achieved with renameat2(AT_FDCWD, tempname, AT_FDCWD, newname, > RENAME_ATOMIC). There is no need to create the tempname > file using O_TMPFILE in that case, but if you do, the RENAME_ATOMIC > flag would be redundant. > > RENAME_ATOMIC flag is needed because directories and non regular > files cannot be created using O_TMPFILE. I think there's much less consensus about this. Again, most of this happened in a hallway conversation. > > Where does this land us on the discussion about documenting > > file-system crash-recovery guarantees? Has that been deemed not > > necessary? > > Can't say for sure. > Some filesystem maintainers hold on to the opinion that they do > NOT wish to have a document describing existing behavior of specific > filesystems, which is large parts of the document that your group posted. > > They would rather that only the guaranties of the APIs are documented > and those should already be documented in man pages anyway - if they > are not, man pages could be improved. > > I am not saying there is no room for a document that elaborates on those > guaranties. I personally think that could be useful and certainly think that > your group's work for adding xfstest coverage for API guaranties is useful. Again, here is my concern. If we promise that ext4 will always obey Dave Chinner's SOMC model, it would forever rule out Daejun Park and Dongkun Shin's "iJournaling: Fine-grained journaling for improving the latency of fsync system call"[1] published in Usenix ATC 2017. [1] https://www.usenix.org/system/files/conference/atc17/atc17-park.pdf That's because this provides a fast fsync() using an incremental journal. This fast fsync would cause the metadata associated with the inode being fsync'ed to be persisted after the crash --- ahead of metadata changes to other, potentially completely unrelated files, which would *not* be persisted after the crash. Fine grained journalling would provide all of the guarantee all of the POSIX, and for applications that only care about the single file being fsync'ed -- they would be happy. BUT, it violates the proposed crash consistency guarantees. So if the crash consistency guarantees forbids future innovations where applications might *want* a fast fsync() that doesn't drag unrelated inodes into the persistence guarantees, is that really what we want? Do we want to forever rule out various academic investigations such as Park and Shin's because "it violates the crash consistency recovery model"? Especially if some applications don't *need* the crash consistency model? - Ted P.S. I feel especially strong about this because I'm working with an engineer currently trying to implement a simplified version of Park and Shin's proposal... So this is not a hypothetical concern of mine. I'd much rather not invalidate all of this engineer's work to date, especially since there is a published paper demonstrating that for some workloads (such as sqlite), this approach can be a big win. P.P.S. One of the other discussions that did happen during the main LSF/MM File system session, and for which there was general agreement across a number of major file system maintainers, was a fsync2() system call which would take a list of file descriptors (and flags) that should be fsync'ed. The semantics would be that when the fsync2() successfully returns, all of the guarantees of fsync() or fdatasync() requested by the list of file descriptors and flags would be satisfied. This would allow file systems to more optimally fsync a batch of files, for example by implementing data integrity writebacks for all of the files, followed by a single journal commit to guarantee persistence for all of the metadata changes. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract 2019-05-03 2:30 ` Theodore Ts'o @ 2019-05-03 3:15 ` Vijay Chidambaram 2019-05-03 9:45 ` Theodore Ts'o 2019-05-03 4:16 ` Amir Goldstein 2019-05-09 1:43 ` Dave Chinner 2 siblings, 1 reply; 25+ messages in thread From: Vijay Chidambaram @ 2019-05-03 3:15 UTC (permalink / raw) To: Theodore Ts'o Cc: Amir Goldstein, lsf-pc, Dave Chinner, Darrick J. Wong, Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn > Again, here is my concern. If we promise that ext4 will always obey > Dave Chinner's SOMC model, it would forever rule out Daejun Park and > Dongkun Shin's "iJournaling: Fine-grained journaling for improving the > latency of fsync system call"[1] published in Usenix ATC 2017. > > [1] https://www.usenix.org/system/files/conference/atc17/atc17-park.pdf > > That's because this provides a fast fsync() using an incremental > journal. This fast fsync would cause the metadata associated with the > inode being fsync'ed to be persisted after the crash --- ahead of > metadata changes to other, potentially completely unrelated files, > which would *not* be persisted after the crash. Fine grained > journalling would provide all of the guarantee all of the POSIX, and > for applications that only care about the single file being fsync'ed > -- they would be happy. BUT, it violates the proposed crash > consistency guarantees. > > So if the crash consistency guarantees forbids future innovations > where applications might *want* a fast fsync() that doesn't drag > unrelated inodes into the persistence guarantees, is that really what > we want? Do we want to forever rule out various academic > investigations such as Park and Shin's because "it violates the crash > consistency recovery model"? Especially if some applications don't > *need* the crash consistency model? > > - Ted > > P.S. I feel especially strong about this because I'm working with an > engineer currently trying to implement a simplified version of Park > and Shin's proposal... So this is not a hypothetical concern of mine. > I'd much rather not invalidate all of this engineer's work to date, > especially since there is a published paper demonstrating that for > some workloads (such as sqlite), this approach can be a big win. Ted, I sympathize with your position. To be clear, this is not what my group or Amir is suggesting we do. A few things to clarify: 1) We are not suggesting that all file systems follow SOMC semantics. If ext4 does not want to do so, we are quite happy to document ext4 provides a different set of reasonable semantics. We can make the ext4-related documentation as minimal as you want (or drop ext4 from documentation entirely). I'm hoping this will satisfy you. 2) As I understand it, I do not think SOMC rules out the scenario in your example, because it does not require fsync to push un-related files to storage. 3) We are not documenting how fsync works internally, merely what the user-visible behavior is. I think this will actually free up file systems to optimize fsync aggressively while making sure they provide the required user-visible behavior. Quoting from Dave Chinner's response when you brought up this concern previously (https://patchwork.kernel.org/patch/10849903/#22538743): "Sure, but again this is orthognal to what we are discussing here: the user visible ordering of metadata operations after a crash. If anyone implements a multi-segment or per-inode journal (say, like NOVA), then it is up to that implementation to maintain the ordering guarantees that a SOMC model requires. You can implement whatever fsync() go-fast bits you want, as long as it provides the ordering behaviour guarantees that the model defines. IOWs, Ted, I think you have the wrong end of the stick here. This isn't about optimising fsync() to provide better performance, it's about guaranteeing order so that fsync() is not necessary and we improve performance by allowing applications to omit order-only synchornisation points in their workloads. i.e. an order-based integrity model /reduces/ the need for a hyper-optimised fsync operation because applications won't need to use it as often." > P.P.S. One of the other discussions that did happen during the main > LSF/MM File system session, and for which there was general agreement > across a number of major file system maintainers, was a fsync2() > system call which would take a list of file descriptors (and flags) > that should be fsync'ed. The semantics would be that when the > fsync2() successfully returns, all of the guarantees of fsync() or > fdatasync() requested by the list of file descriptors and flags would > be satisfied. This would allow file systems to more optimally fsync a > batch of files, for example by implementing data integrity writebacks > for all of the files, followed by a single journal commit to guarantee > persistence for all of the metadata changes. I like this "group fsync" idea. I think this is a great way to extend the basic fsync interface. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract 2019-05-03 3:15 ` Vijay Chidambaram @ 2019-05-03 9:45 ` Theodore Ts'o 2019-05-04 0:17 ` Vijay Chidambaram 0 siblings, 1 reply; 25+ messages in thread From: Theodore Ts'o @ 2019-05-03 9:45 UTC (permalink / raw) To: Vijay Chidambaram Cc: Amir Goldstein, lsf-pc, Dave Chinner, Darrick J. Wong, Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn On Thu, May 02, 2019 at 10:15:01PM -0500, Vijay Chidambaram wrote: > > A few things to clarify: > 1) We are not suggesting that all file systems follow SOMC semantics. > If ext4 does not want to do so, we are quite happy to document ext4 > provides a different set of reasonable semantics. We can make the > ext4-related documentation as minimal as you want (or drop ext4 from > documentation entirely). I'm hoping this will satisfy you. > 2) As I understand it, I do not think SOMC rules out the scenario in > your example, because it does not require fsync to push un-related > files to storage. > 3) We are not documenting how fsync works internally, merely what the > user-visible behavior is. I think this will actually free up file > systems to optimize fsync aggressively while making sure they provide > the required user-visible behavior. As documented, the draft of the rules *I* saw specifically said that a fsync() to inode B would guarantee that metadata changes for inode A, which were made before the changes to inode B, would be persisted to disk since the metadata changes for B happened after the changes to inode A. It used the fsync(2) *explicitly* as an example for how ordering of unrelated files could be guaranteed. And this would invalidate Park and Shin's incremental journal for fsync. If the guarantees are when fsync(2) is *not* being used, sure, then the SOMC model is naturally what would happen with most common file system. But then fsync(2) needs to appear nowhere in the crash consistency model description, and that is not the case today. Best regards, - Ted ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract 2019-05-03 9:45 ` Theodore Ts'o @ 2019-05-04 0:17 ` Vijay Chidambaram 2019-05-04 1:43 ` Theodore Ts'o 0 siblings, 1 reply; 25+ messages in thread From: Vijay Chidambaram @ 2019-05-04 0:17 UTC (permalink / raw) To: Theodore Ts'o Cc: Amir Goldstein, lsf-pc, Dave Chinner, Darrick J. Wong, Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn > As documented, the draft of the rules *I* saw specifically said that a > fsync() to inode B would guarantee that metadata changes for inode A, > which were made before the changes to inode B, would be persisted to > disk since the metadata changes for B happened after the changes to > inode A. It used the fsync(2) *explicitly* as an example for how > ordering of unrelated files could be guaranteed. And this would > invalidate Park and Shin's incremental journal for fsync. > > If the guarantees are when fsync(2) is *not* being used, sure, then > the SOMC model is naturally what would happen with most common file > system. But then fsync(2) needs to appear nowhere in the crash > consistency model description, and that is not the case today. > I think there might be a mis-understanding about the example (reproduced below) and about SOMC. The relationship that matters is not whether X happens before Y. The relationship between X and Y is that they are in the same directory, so fsync(new file X) implies fsync(X's parent directory) which contains Y. In the example, X is A/foo and Y is A/bar. For truly un-related files such as A/foo and B/bar, SOMC does indeed allow fsync(A/foo) to not persist B/bar. touch A/foo echo “hello” > A/foo sync mv A/foo A/bar echo “world” > A/foo fsync A/foo CRASH We could rewrite the example to not include fsync, but this example comes directly from xfstest generic/342, so we would like to preserve it. But in any case, I think this is beside the point. If ext4 does not want to provide SOMC-like behavior, I think that is totally reasonable. The documentation does *not* say all file systems should provide SOMC. As long as the documentation does not say ext4 provides SOMC-like behavior, are you okay with the rest of the documentation effort? If so, we can send out v3 with these changes. Please forgive my continued pushing on this: I would like to see more documentation about these file-system aspects in the kernel. XFS and btrfs developers approved of the effort, so there is some support for this. We have already put in some work on the documentation, so I'd like to see it finished up and merged. (Sorry for hijacking/forking the thread Amir!) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract 2019-05-04 0:17 ` Vijay Chidambaram @ 2019-05-04 1:43 ` Theodore Ts'o 2019-05-07 18:38 ` Jan Kara 0 siblings, 1 reply; 25+ messages in thread From: Theodore Ts'o @ 2019-05-04 1:43 UTC (permalink / raw) To: Vijay Chidambaram Cc: Amir Goldstein, lsf-pc, Dave Chinner, Darrick J. Wong, Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn On Fri, May 03, 2019 at 07:17:54PM -0500, Vijay Chidambaram wrote: > > I think there might be a mis-understanding about the example > (reproduced below) and about SOMC. The relationship that matters is > not whether X happens before Y. The relationship between X and Y is > that they are in the same directory, so fsync(new file X) implies > fsync(X's parent directory) which contains Y. In the example, X is > A/foo and Y is A/bar. For truly un-related files such as A/foo and > B/bar, SOMC does indeed allow fsync(A/foo) to not persist B/bar. When you say "X and Y are in the same directory", how does this apply in the face of hard links? Remember, file X might be in a 100 different directories. Does that mean if changes to file X is visible after a crash, all files Y in any of X's 100 containing directories that were modified before X must have their changes be visible after the crash? I suspect that the SOMC as articulated by Dave does make such global guarantees. Certainly without Park and Shin's incremental fsync, unrelated files will have the property that if A/foo is modified after B/bar, and B/bar's metadata changes are visible after a crash, A/foo's metadata will also be visible. This is true for ext4, and xfs. Even if we ignore the hard link problem, and assume that it only applies for files foo and bar with st_nlinks == 1, the crash consistency guarantees you've described will *still* rule out Park and Shin's increment fsync. So depending on whether ext4 has fast fsync's enabled, we might or might not have behavior consistency with your proposed crash consistency rules. But at this point, even if we promulgate these "guarantees" in a kernel documentation file, applications won't be able to depend on them. If they do, they will be unreliable depending on which file system they use; so they won't be particularly useful for application authors care about portability. (Or worse, for users who are under the delusion that the application authors care about portability, and who will be subject to data corruption after a crash.) Do we *really* want to be promulgating these semantics to application authors? Finally, I'll note that generic/342 is much more specific, and your proposed crash consistency rule is more general. # Test that if we rename a file, create a new file that has the old name of the # other file and is a child of the same parent directory, fsync the new inode, # power fail and mount the filesystem, we do not lose the first file and that # file has the name it was renamed to. > touch A/foo > echo “hello” > A/foo > sync > mv A/foo A/bar > echo “world” > A/foo > fsync A/foo > CRASH Sure, that's one that fast commit will honor. But what about: echo "world" > A/foo echo "hello" > A/bar chmod 755 A/bar sync chmod 750 A/bar echo "new world" >> A/foo fsync A/foo CRASH .... will your crash consistency rules guarantee that the permissions change for A/bar is visible after the fsync of A/foo? Or if A/foo and A/bar exists, and we do: echo "world" > A/foo echo "hello" > A/bar sync mv A/bar A/quux echo "new world" >> A/foo fsync A/foo CRASH ... is the rename of A/bar and A/quux guaranteed to be visible after the crash? With Park and Shin's incremental fsync journal, the two cases I've described below would *not* have such guarantees. Standard ext4 today would in fact have these guarantees. But I would consider this an accident of the implementation, and *not* a promise that I would want to make for all time, precisely because it forbids us from making innovations that might improve performance. Even if I didn't have an engineer working on implementing Park and Shin's proposal, what worries me is if I did make this guarantee, it would tie my hands from making this optimization in the future --- and I can't necessarily forsee all possible optimizations we might want to make in the future. So the question I'm trying to ask is how many applications will actually benefit from "documenting current behavior" and effectively turning this into a promise for all time? Ultimately this is a tradeoff. Sure, this might enable applications to do things that are more aggressive than what Posix guarantees; but it also ties the hands of file system engineers. This is why I'd much rather do this via new system calls; say, maybe something like fsync_with_barrier(fd). This can degrade to fsync(fd) if necessary, but it allows the application to explicitly request certain semantics, as opposed to encouraging applications to *assume* that certain magic side effects will be there --- and which might not be true for all file systems, or for all time. We still need to very carefully define what the semantics of fsync_with_barrier(fd) would be --- especially whether fsync_with_barrier(fd) provides local (within the same directory) or global barrier guarantees, and if it's local, how are files with multiple "parent directories" interact with the guarantees. But at least this way it's an explicit declaration of what the application wants, and not an implicit one. Cheers, - Ted ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract 2019-05-04 1:43 ` Theodore Ts'o @ 2019-05-07 18:38 ` Jan Kara 0 siblings, 0 replies; 25+ messages in thread From: Jan Kara @ 2019-05-07 18:38 UTC (permalink / raw) To: Theodore Ts'o Cc: Vijay Chidambaram, Amir Goldstein, lsf-pc, Dave Chinner, Darrick J. Wong, Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn On Fri 03-05-19 21:43:07, Theodore Ts'o wrote: > On Fri, May 03, 2019 at 07:17:54PM -0500, Vijay Chidambaram wrote: > > > > I think there might be a mis-understanding about the example > > (reproduced below) and about SOMC. The relationship that matters is > > not whether X happens before Y. The relationship between X and Y is > > that they are in the same directory, so fsync(new file X) implies > > fsync(X's parent directory) which contains Y. In the example, X is > > A/foo and Y is A/bar. For truly un-related files such as A/foo and > > B/bar, SOMC does indeed allow fsync(A/foo) to not persist B/bar. > > When you say "X and Y are in the same directory", how does this apply > in the face of hard links? Remember, file X might be in a 100 > different directories. Does that mean if changes to file X is visible > after a crash, all files Y in any of X's 100 containing directories > that were modified before X must have their changes be visible after > the crash? > > I suspect that the SOMC as articulated by Dave does make such global > guarantees. Certainly without Park and Shin's incremental fsync, > unrelated files will have the property that if A/foo is modified after > B/bar, and B/bar's metadata changes are visible after a crash, A/foo's > metadata will also be visible. This is true for ext4, and xfs. > > Even if we ignore the hard link problem, and assume that it only > applies for files foo and bar with st_nlinks == 1, the crash > consistency guarantees you've described will *still* rule out Park and > Shin's increment fsync. So depending on whether ext4 has fast fsync's > enabled, we might or might not have behavior consistency with your > proposed crash consistency rules. > > But at this point, even if we promulgate these "guarantees" in a > kernel documentation file, applications won't be able to depend on > them. If they do, they will be unreliable depending on which file > system they use; so they won't be particularly useful for application > authors care about portability. (Or worse, for users who are under > the delusion that the application authors care about portability, and > who will be subject to data corruption after a crash.) Do we *really* > want to be promulgating these semantics to application authors? I agree that having fs specific promises for crash consistency is bad. The application would have to detect what filesystem it is running on and based on that issue fsync or not. I don't think many applications will get this right so IMO it would result in more problems in case crash, not less. > Finally, I'll note that generic/342 is much more specific, and your > proposed crash consistency rule is more general. > > # Test that if we rename a file, create a new file that has the old name of the > # other file and is a child of the same parent directory, fsync the new inode, > # power fail and mount the filesystem, we do not lose the first file and that > # file has the name it was renamed to. > > > touch A/foo > > echo “hello” > A/foo > > sync > > mv A/foo A/bar > > echo “world” > A/foo > > fsync A/foo > > CRASH > > Sure, that's one that fast commit will honor. Hum, but will this be also honored in case of hardlinks? E.g. echo "hello" >A/foo ln A/foo B/foo sync mv A/foo A/bar mv B/foo B/bar echo "world" >A/foo fsync A/foo Will you also persist changes to B? If not, will you persist them if we do 'ln A/foo B/foo' before 'fsync A/foo'? I'm just wondering where you draw the borderline if you actually do care about namespace changes in addition to inode + its metadata... > But what about: > > echo "world" > A/foo > echo "hello" > A/bar > chmod 755 A/bar > sync > chmod 750 A/bar > echo "new world" >> A/foo > fsync A/foo > CRASH > > .... will your crash consistency rules guarantee that the permissions > change for A/bar is visible after the fsync of A/foo? > > Or if A/foo and A/bar exists, and we do: > > echo "world" > A/foo > echo "hello" > A/bar > sync > mv A/bar A/quux > echo "new world" >> A/foo > fsync A/foo > CRASH > > ... is the rename of A/bar and A/quux guaranteed to be visible after > the crash? And I agree that guaranteeing ordering of operations in the same directory but for unrelated names to operations on inode in that directory does not seem very useful. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract 2019-05-03 2:30 ` Theodore Ts'o 2019-05-03 3:15 ` Vijay Chidambaram @ 2019-05-03 4:16 ` Amir Goldstein 2019-05-03 9:58 ` Theodore Ts'o 2019-05-09 1:43 ` Dave Chinner 2 siblings, 1 reply; 25+ messages in thread From: Amir Goldstein @ 2019-05-03 4:16 UTC (permalink / raw) To: Theodore Ts'o Cc: Vijay Chidambaram, lsf-pc, Dave Chinner, Darrick J. Wong, Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn On Thu, May 2, 2019 at 10:30 PM Theodore Ts'o <tytso@mit.edu> wrote: > > On Thu, May 02, 2019 at 01:39:47PM -0400, Amir Goldstein wrote: > > > The expectation is that applications will use this, and then rename > > > the O_TMPFILE file over the original file. Is this correct? If so, is > > > there also an implied barrier between O_TMPFILE metadata and the > > > rename? > > In the case of O_TMPFILE, the file can be brought into the namespace > using something like: > > linkat(AT_FDCWD, "/proc/self/fd/42", AT_FDCWD, pathname, AT_SYMLINK_FOLLOW); > > it's not using rename. > > To be clear, this discussion happened in the hallway, and it's not > clear it had full support by everyone. After our discussion, some of > us came up with an example where forcing a call to > filemap_write_and_wait() before the linkat(2) might *not* be the right > thing. Suppose some browser wanted to wait until a file was fully( > downloaded before letting it appear in the directory --- but what was > being downloaded was a 4 GiB DVD image (say, a distribution's install > media). If the download was done using O_TMPFILE followed by > linkat(2), that might be a case where forcing the data blocks to disk > before allowing the linkat(2) to proceed might not be what the > application or the user would want. > > So it might be that we will need to add a linkat flag to indicate that > we want the kernel to call filemap_write_and_wait() before making the > metadata changes in linkat(2). > Agreed. > > For replacing an existing file with another the same could be > > achieved with renameat2(AT_FDCWD, tempname, AT_FDCWD, newname, > > RENAME_ATOMIC). There is no need to create the tempname > > file using O_TMPFILE in that case, but if you do, the RENAME_ATOMIC > > flag would be redundant. > > > > RENAME_ATOMIC flag is needed because directories and non regular > > files cannot be created using O_TMPFILE. > > I think there's much less consensus about this. Again, most of this > happened in a hallway conversation. > OK. we can leave that one for later. Although I am not sure what the concern is. If we are able to agree and document a LINK_ATOMIC flag, what would be the down side of documenting a RENAME_ATOMIC flag with same semantics? After all, as I said, this is what many users already expect when renaming a temp file (as ext4 heuristics prove). I would love to get Dave's take on the proposal of LINK_ATOMIC/ RENAME_ATOMIC, preferably before the discussion wanders off to an argument about what SOMC means... Thanks, Amir. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract 2019-05-03 4:16 ` Amir Goldstein @ 2019-05-03 9:58 ` Theodore Ts'o 2019-05-03 14:18 ` Amir Goldstein 2019-05-09 2:36 ` Dave Chinner 0 siblings, 2 replies; 25+ messages in thread From: Theodore Ts'o @ 2019-05-03 9:58 UTC (permalink / raw) To: Amir Goldstein Cc: Vijay Chidambaram, lsf-pc, Dave Chinner, Darrick J. Wong, Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn On Fri, May 03, 2019 at 12:16:32AM -0400, Amir Goldstein wrote: > OK. we can leave that one for later. > Although I am not sure what the concern is. > If we are able to agree and document a LINK_ATOMIC flag, > what would be the down side of documenting a RENAME_ATOMIC > flag with same semantics? After all, as I said, this is what many users > already expect when renaming a temp file (as ext4 heuristics prove). The problem is if the "temp file" has been hardlinked to 1000 different directories, does the rename() have to guarantee that we have to make sure that the changes to all 1000 directories have been persisted to disk? And all of the parent directories of those 1000 directories have also *all* been persisted to disk, all the way up to the root? With the O_TMPFILE linkat case, we know that inode hasn't been hard-linked to any other directory, and mercifully directories have only one parent directory, so we only have to check one set of directory inodes all the way up to the root having been persisted. But.... I can already imagine someone complaining that if due to bind mounts and 1000 mount namespaces, there is some *other* directory pathname which could be used to reach said "tmpfile", we have to guarantee that all parent directories which could be used to reach said "tmpfile" even if they span a dozen different file systems, *also* have to be persisted due to sloppy drafting of what the atomicity rules might happen to be. If we are only guaranteeing the persistence of the containing directories of the source and destination files, that's pretty easy. But then the consistency rules need to *explicitly* state this. Some of the handwaving definitions of what would be guaranteed.... scare me. - Ted P.S. If we were going to do this, we'd probably want to simply define a flag to be AT_FSYNC, using the strict POSIX definition of fsync, which is to say, as a result of the linkat or renameat, the file in question, and its associated metadata, are guaranteed to be persisted to disk. No other guarantees about any other inode's metadata regardless of when they might be made, would be guaranteed. If people really want "global barrier" semantics, then perhaps it would be better to simply define a barrierfs(2) system call that works like syncfs(2) --- it applies to the whole file system, and guarantees that all changes made after barrierfs(2) will be visible if any changes made *after* barrierfs(2) are visible. Amir, you used "global ordering" a few times; if you really need that, let's define a new system call which guarantees that. Maybe some of the research proposals for exotic changes to SSD semantics, etc., would allow barrierfs(2) semantics to be something that we could implement more efficiently than syncfs(2). But let's make this be explicit, as opposed to some magic guarantee that falls out as a side effect of the fsync(2) system call to a single inode. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract 2019-05-03 9:58 ` Theodore Ts'o @ 2019-05-03 14:18 ` Amir Goldstein 2019-05-09 2:36 ` Dave Chinner 1 sibling, 0 replies; 25+ messages in thread From: Amir Goldstein @ 2019-05-03 14:18 UTC (permalink / raw) To: Theodore Ts'o Cc: Vijay Chidambaram, lsf-pc, Dave Chinner, Darrick J. Wong, Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn On Fri, May 3, 2019 at 5:59 AM Theodore Ts'o <tytso@mit.edu> wrote: > > On Fri, May 03, 2019 at 12:16:32AM -0400, Amir Goldstein wrote: > > OK. we can leave that one for later. > > Although I am not sure what the concern is. > > If we are able to agree and document a LINK_ATOMIC flag, > > what would be the down side of documenting a RENAME_ATOMIC > > flag with same semantics? After all, as I said, this is what many users > > already expect when renaming a temp file (as ext4 heuristics prove). > > The problem is if the "temp file" has been hardlinked to 1000 > different directories, does the rename() have to guarantee that we > have to make sure that the changes to all 1000 directories have been > persisted to disk? And all of the parent directories of those 1000 > directories have also *all* been persisted to disk, all the way up to > the root? > > With the O_TMPFILE linkat case, we know that inode hasn't been > hard-linked to any other directory, and mercifully directories have > only one parent directory, so we only have to check one set of > directory inodes all the way up to the root having been persisted. > > But.... I can already imagine someone complaining that if due to bind > mounts and 1000 mount namespaces, there is some *other* directory > pathname which could be used to reach said "tmpfile", we have to > guarantee that all parent directories which could be used to reach > said "tmpfile" even if they span a dozen different file systems, > *also* have to be persisted due to sloppy drafting of what the > atomicity rules might happen to be. > > If we are only guaranteeing the persistence of the containing > directories of the source and destination files, that's pretty easy. > But then the consistency rules need to *explicitly* state this. Some > of the handwaving definitions of what would be guaranteed.... scare > me. > I see. So the issue is with the language: "metadata modifications made to the file before being linked" that may be interpreted that hardlinking a file is making a modification to the file. I can't help myself writing the pun "nlink doesn't count". Tough one. We can include more exclusive language, but that is not going to aid the goal of a simple documented API. OK, I'll withdraw RENAME_ATOMIC for now and concede to having LINK_ATOMIC fail when trying to link and nlink > 0. How about if I implement RENAME_ATOMIC for in-kernel users only at this point in time? Overlayfs needs it for correctness of directory copy up operation. > > P.S. If we were going to do this, we'd probably want to simply define > a flag to be AT_FSYNC, using the strict POSIX definition of fsync, > which is to say, as a result of the linkat or renameat, the file in > question, and its associated metadata, are guaranteed to be persisted > to disk. No other guarantees about any other inode's metadata > regardless of when they might be made, would be guaranteed. > I agree that may be useful. Not to my use case though. > If people really want "global barrier" semantics, then perhaps it > would be better to simply define a barrierfs(2) system call that works > like syncfs(2) --- it applies to the whole file system, and guarantees > that all changes made after barrierfs(2) will be visible if any > changes made *after* barrierfs(2) are visible. Amir, you used "global > ordering" a few times; if you really need that, let's define a new > system call which guarantees that. Maybe some of the research > proposals for exotic changes to SSD semantics, etc., would allow > barrierfs(2) semantics to be something that we could implement more > efficiently than syncfs(2). But let's make this be explicit, as > opposed to some magic guarantee that falls out as a side effect of the > fsync(2) system call to a single inode. Yes, maybe. For xfs/ext4. Not sure about btrfs. Seems like fbarrier(2) would have been more natural for btrfs model (file and all its dependencies). I think barrierfs(2) would be useful, but I think it is harder to explain to users. See barrierfs() should not flush all inode pages that would be counter productive, so what does it really mean to end users? We would end up with the same problem of misunderstood sync_file_range(). I would have been happy with this API: sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE_AND_WAIT); barrierfs(fd); rename(...)/link(...) Perhaps atomic_rename()/atomic_link() should be library functions wrapping the lower level API to hide those details from end users. Thanks, Amir. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract 2019-05-03 9:58 ` Theodore Ts'o 2019-05-03 14:18 ` Amir Goldstein @ 2019-05-09 2:36 ` Dave Chinner 1 sibling, 0 replies; 25+ messages in thread From: Dave Chinner @ 2019-05-09 2:36 UTC (permalink / raw) To: Theodore Ts'o Cc: Amir Goldstein, Vijay Chidambaram, lsf-pc, Darrick J. Wong, Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn On Fri, May 03, 2019 at 05:58:46AM -0400, Theodore Ts'o wrote: > On Fri, May 03, 2019 at 12:16:32AM -0400, Amir Goldstein wrote: > > OK. we can leave that one for later. > > Although I am not sure what the concern is. > > If we are able to agree and document a LINK_ATOMIC flag, > > what would be the down side of documenting a RENAME_ATOMIC > > flag with same semantics? After all, as I said, this is what many users > > already expect when renaming a temp file (as ext4 heuristics prove). > > The problem is if the "temp file" has been hardlinked to 1000 > different directories, does the rename() have to guarantee that we > have to make sure that the changes to all 1000 directories have been > persisted to disk? No. Dependency creation is directional. If the parent directory modifies an entry that points to an inode, then the dependency (via inode link count modification) is created. Modifying an inode does not create a dependency on the parent directory, because the parent directory is not modified by inode specific changes. Yes, sometimes the dependency graph will resolve to fsync other directories. e.g. because hardlinks to the same inode were created and this is the first fsync on the inode that stabilises the link count. Because the link count is being stabilised, all the current depedencies on that link count (i.e. all the directories with uncommitted dirent modifications that modified the link count in that inode) /may/ be included in the fsync. However, if the filesystem tracks every change to the inode link count as separate modifications, it only need commit the directory modifications that occurred /before/ the one being fsync'd. i.e. SOMC doesn't require "sync the world" behaviour, it's just that we have filesysetms that currently behave that way because it's a simple and efficient way of tracking and resolving ordering dependencies. IOWs, SOMC is all about cross-object depedencies and how they are resolved. If you have no cross-object dependencies or your operations are isolated to a non-shared set of objects, then SOMC allows them to operate in 100% isolation to everything else and the filesystem can optimise this in whatever way it wants. SOMC is not the end of the world, Ted. It's just a consistency model that has been proposed that could allow substantial optimisation of application operations and filesystem behaviour. You're free to go in other directions if you want - diversity is good. :) Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract 2019-05-03 2:30 ` Theodore Ts'o 2019-05-03 3:15 ` Vijay Chidambaram 2019-05-03 4:16 ` Amir Goldstein @ 2019-05-09 1:43 ` Dave Chinner 2019-05-09 2:20 ` Theodore Ts'o 2019-05-09 8:47 ` Amir Goldstein 2 siblings, 2 replies; 25+ messages in thread From: Dave Chinner @ 2019-05-09 1:43 UTC (permalink / raw) To: Theodore Ts'o Cc: Amir Goldstein, Vijay Chidambaram, lsf-pc, Darrick J. Wong, Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn On Thu, May 02, 2019 at 10:30:43PM -0400, Theodore Ts'o wrote: > On Thu, May 02, 2019 at 01:39:47PM -0400, Amir Goldstein wrote: > > I am not saying there is no room for a document that elaborates on those > > guaranties. I personally think that could be useful and certainly think that > > your group's work for adding xfstest coverage for API guaranties is useful. > > Again, here is my concern. If we promise that ext4 will always obey > Dave Chinner's SOMC model, it would forever rule out Daejun Park and > Dongkun Shin's "iJournaling: Fine-grained journaling for improving the > latency of fsync system call"[1] published in Usenix ATC 2017. No, it doesn't rule that out at all. In a SOMC model, incremental journalling is just fine when there are no external dependencies on the thing being fsync'd. If you have other dependencies (e.g. the file has just be created and so the dir it dirty, too) then fsync would need to do the whole shebang, but otherwise.... > So if the crash consistency guarantees forbids future innovations > where applications might *want* a fast fsync() that doesn't drag > unrelated inodes into the persistence guarantees, .... the whole point of SOMC is that allows filesystems to avoid dragging external metadata into fsync() operations /unless/ there's a user visible ordering dependency that must be maintained between objects. If all you are doing is stabilising file data in a stable file/directory, then independent, incremental journaling of the fsync operations on that file fit the SOMC model just fine. > is that really what > we want? Do we want to forever rule out various academic > investigations such as Park and Shin's because "it violates the crash > consistency recovery model"? Especially if some applications don't > *need* the crash consistency model? Stop with the silly inflammatory hyperbole already, Ted. It is not necessary. > P.P.S. One of the other discussions that did happen during the main > LSF/MM File system session, and for which there was general agreement > across a number of major file system maintainers, was a fsync2() > system call which would take a list of file descriptors (and flags) > that should be fsync'ed. Hmmmm, that wasn't on the agenda, and nobody has documented it as yet. > The semantics would be that when the > fsync2() successfully returns, all of the guarantees of fsync() or > fdatasync() requested by the list of file descriptors and flags would > be satisfied. This would allow file systems to more optimally fsync a > batch of files, for example by implementing data integrity writebacks > for all of the files, followed by a single journal commit to guarantee > persistence for all of the metadata changes. What happens when you get writeback errors on only some of the fds? How do you report the failures and what do you do with the journal commit on partial success? Of course, this ignores the elephant in the room: applications can /already do this/ using AIO_FSYNC and have individual error status for each fd. Not to mention that filesystems already batch concurrent fsync journal commits into a single operation. I'm not seeing the point of a new syscall to do this right now.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract 2019-05-09 1:43 ` Dave Chinner @ 2019-05-09 2:20 ` Theodore Ts'o 2019-05-09 2:58 ` Dave Chinner 2019-05-09 5:02 ` Vijay Chidambaram 2019-05-09 8:47 ` Amir Goldstein 1 sibling, 2 replies; 25+ messages in thread From: Theodore Ts'o @ 2019-05-09 2:20 UTC (permalink / raw) To: Dave Chinner Cc: Amir Goldstein, Vijay Chidambaram, lsf-pc, Darrick J. Wong, Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn On Thu, May 09, 2019 at 11:43:27AM +1000, Dave Chinner wrote: > > .... the whole point of SOMC is that allows filesystems to avoid > dragging external metadata into fsync() operations /unless/ there's > a user visible ordering dependency that must be maintained between > objects. If all you are doing is stabilising file data in a stable > file/directory, then independent, incremental journaling of the > fsync operations on that file fit the SOMC model just fine. Well, that's not what Vijay's crash consistency guarantees state. It guarantees quite a bit more than what you've written above. Which is my concern. > > P.P.S. One of the other discussions that did happen during the main > > LSF/MM File system session, and for which there was general agreement > > across a number of major file system maintainers, was a fsync2() > > system call which would take a list of file descriptors (and flags) > > that should be fsync'ed. > > Hmmmm, that wasn't on the agenda, and nobody has documented it as > yet. It came up as suggested alternative during Ric Wheeler's "Async all the things" session. The problem he was trying to address was programs (perhaps userspace file servers) who need to fsync a large number of files at the same time. The problem with his suggested solution (which we have for AIO and io_uring already) of having the program issue a large number of asynchronous fsync's and then waiting for them all, is that the back-end interface is a work queue, so there is a lot of effective serialization that takes place. > > The semantics would be that when the > > fsync2() successfully returns, all of the guarantees of fsync() or > > fdatasync() requested by the list of file descriptors and flags would > > be satisfied. This would allow file systems to more optimally fsync a > > batch of files, for example by implementing data integrity writebacks > > for all of the files, followed by a single journal commit to guarantee > > persistence for all of the metadata changes. > > What happens when you get writeback errors on only some of the fds? > How do you report the failures and what do you do with the journal > commit on partial success? Well, one approach would be to pass back the errors in the structure. Say something like this: int fsync2(int len, struct fsync_req[]); struct fsync_req { int fd; /* IN */ int flags; /* IN */ int retval; /* OUT */ }; As far as what do you do with the journal commit on partial success, this are no atomic, "all or nothing" guarantees with this interface. It is implementation specific whether there would be one or more file system commits necessary before fsync2 returned. > Of course, this ignores the elephant in the room: applications can > /already do this/ using AIO_FSYNC and have individual error status > for each fd. Not to mention that filesystems already batch > concurrent fsync journal commits into a single operation. I'm not > seeing the point of a new syscall to do this right now.... But it doesn't work very well, because the implementation uses a workqueue. Sure, you could create N worker threads for N fd's that you want to fsync, and then file system can batch the fsync requests. But wouldn't be so much simpler to give a list of fd's that should be fsync'ed to the file system? That way you don't have to do lots of work to split up the work so they can be submitted in parallel, only to have the file system batch up all of the requests being issued from all of those kernel threads. So yes, it's identical to the interfaces we already have. Just like select(2), poll(2) and epoll(2) are functionality identical... - Ted ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract 2019-05-09 2:20 ` Theodore Ts'o @ 2019-05-09 2:58 ` Dave Chinner 2019-05-09 3:31 ` Theodore Ts'o 2019-05-09 5:02 ` Vijay Chidambaram 1 sibling, 1 reply; 25+ messages in thread From: Dave Chinner @ 2019-05-09 2:58 UTC (permalink / raw) To: Theodore Ts'o Cc: Amir Goldstein, Vijay Chidambaram, lsf-pc, Darrick J. Wong, Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn On Wed, May 08, 2019 at 10:20:13PM -0400, Theodore Ts'o wrote: > On Thu, May 09, 2019 at 11:43:27AM +1000, Dave Chinner wrote: > > > > .... the whole point of SOMC is that allows filesystems to avoid > > dragging external metadata into fsync() operations /unless/ there's > > a user visible ordering dependency that must be maintained between > > objects. If all you are doing is stabilising file data in a stable > > file/directory, then independent, incremental journaling of the > > fsync operations on that file fit the SOMC model just fine. > > Well, that's not what Vijay's crash consistency guarantees state. It > guarantees quite a bit more than what you've written above. Which is > my concern. SOMC does not defining crash consistency rules - it defines change dependecies and how ordering and atomicity impact the dependency graph. How other people have interpreted that is out of my control. > It came up as suggested alternative during Ric Wheeler's "Async all > the things" session. The problem he was trying to address was > programs (perhaps userspace file servers) who need to fsync a large > number of files at the same time. The problem with his suggested > solution (which we have for AIO and io_uring already) of having the > program issue a large number of asynchronous fsync's and then waiting > for them all, is that the back-end interface is a work queue, so there > is a lot of effective serialization that takes place. We got linear scaling out to device bandwidth and/or IOPS limits with bulk fsync benchmarks on XFS with that simple workqueue implementation. If there's problems, then I'd suggest that people should be reporting bugs to the developers of the AIO_FSYNC code (i.e. Christoph and myself) or providing patches to improve it so these problems go away. A new syscall with essentially the same user interface doesn't guarantee that these implementation problems will be solved. > > > The semantics would be that when the > > > fsync2() successfully returns, all of the guarantees of fsync() or > > > fdatasync() requested by the list of file descriptors and flags would > > > be satisfied. This would allow file systems to more optimally fsync a > > > batch of files, for example by implementing data integrity writebacks > > > for all of the files, followed by a single journal commit to guarantee > > > persistence for all of the metadata changes. > > > > What happens when you get writeback errors on only some of the fds? > > How do you report the failures and what do you do with the journal > > commit on partial success? > > Well, one approach would be to pass back the errors in the structure. > Say something like this: > > int fsync2(int len, struct fsync_req[]); > > struct fsync_req { > int fd; /* IN */ > int flags; /* IN */ > int retval; /* OUT */ > }; So it's essentially identical to the AIO_FSYNC interface, except that it is synchronous. > As far as what do you do with the journal commit on partial success, > this are no atomic, "all or nothing" guarantees with this interface. > It is implementation specific whether there would be one or more file > system commits necessary before fsync2 returned. IOWs, same guarantees as AIO_FSYNC. > > Of course, this ignores the elephant in the room: applications can > > /already do this/ using AIO_FSYNC and have individual error status > > for each fd. Not to mention that filesystems already batch > > concurrent fsync journal commits into a single operation. I'm not > > seeing the point of a new syscall to do this right now.... > > But it doesn't work very well, because the implementation uses a > workqueue. Then fix the fucking implementation! Sheesh! Did LSFMM include a free lobotomy for participants, or something? -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract 2019-05-09 2:58 ` Dave Chinner @ 2019-05-09 3:31 ` Theodore Ts'o 2019-05-09 5:19 ` Darrick J. Wong 0 siblings, 1 reply; 25+ messages in thread From: Theodore Ts'o @ 2019-05-09 3:31 UTC (permalink / raw) To: Dave Chinner Cc: Amir Goldstein, Vijay Chidambaram, lsf-pc, Darrick J. Wong, Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn On Thu, May 09, 2019 at 12:58:45PM +1000, Dave Chinner wrote: > > SOMC does not defining crash consistency rules - it defines change > dependecies and how ordering and atomicity impact the dependency > graph. How other people have interpreted that is out of my control. Fine; but it's a specific set of the crash consistency rules which I'm objecting to; it's not a promise that I think I want to make. (And before you blindly sign on the bottom line, I'd suggest that you read it very carefully before deciding whether you want to agree to those consistency rules as something that XFS will have honor forever. The way I read it, it's goes beyond what you've articulated as SOMC.) > A new syscall with essentially the same user interface doesn't > guarantee that these implementation problems will be solved. Well, it makes it easier to send all of the requests to the file system in a single bundle. I'd also argue that it's simpler and easier for an application to use a fsync2() interface as I sketched out than trying to use the whole AIO or io_uring machinery. > So it's essentially identical to the AIO_FSYNC interface, except > that it is synchronous. Pretty much, yes. > Sheesh! Did LSFMM include a free lobotomy for participants, or > something? Well, we missed your presence, alas. No doubt your attendance would have improved the discussion. Cheers, - Ted ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract 2019-05-09 3:31 ` Theodore Ts'o @ 2019-05-09 5:19 ` Darrick J. Wong 0 siblings, 0 replies; 25+ messages in thread From: Darrick J. Wong @ 2019-05-09 5:19 UTC (permalink / raw) To: Theodore Ts'o Cc: Dave Chinner, Amir Goldstein, Vijay Chidambaram, lsf-pc, Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn On Wed, May 08, 2019 at 11:31:00PM -0400, Theodore Ts'o wrote: > On Thu, May 09, 2019 at 12:58:45PM +1000, Dave Chinner wrote: > > > > SOMC does not defining crash consistency rules - it defines change > > dependecies and how ordering and atomicity impact the dependency > > graph. How other people have interpreted that is out of my control. > > Fine; but it's a specific set of the crash consistency rules which I'm > objecting to; it's not a promise that I think I want to make. (And > before you blindly sign on the bottom line, I'd suggest that you read > it very carefully before deciding whether you want to agree to those > consistency rules as something that XFS will have honor forever. The > way I read it, it's goes beyond what you've articulated as SOMC.) I find myself (unusually) rooting for the status quo, where we /don't/ have a big SOMC rulebook that everyone has to follow, and instead we just tell people that if they really want to know a filesystem they had better try their workload with that fs + storage. If they don't like what they find, we have a reasonable amount of competition and niche specialization amongst the many filesystems that they can try the others, or if they're still unsatisfied, see if they can drive a consensus. Filesystems are like cars -- the basic interfaces are more or less the same and they but the implementations can still differ. (They also tend to crash, catch on fire, and leave a smear of destruction in their wake.) > > A new syscall with essentially the same user interface doesn't > > guarantee that these implementation problems will be solved. > > Well, it makes it easier to send all of the requests to the file > system in a single bundle. I'd also argue that it's simpler and > easier for an application to use a fsync2() interface as I sketched > out than trying to use the whole AIO or io_uring machinery. I *would* like to see a more concrete fsync2 proposal. And while I'm asking for ponies, whatever it is that came out of the DAX file flags discussion too. > > > So it's essentially identical to the AIO_FSYNC interface, except > > that it is synchronous. > > Pretty much, yes. OH yeah, I forgot we wired that up finally. > > Sheesh! Did LSFMM include a free lobotomy for participants, or > > something? "I'd rather have a bottle in front of me..." Peace out, see you all on the 20th! --D > Well, we missed your presence, alas. No doubt your attendance would > have improved the discussion. > > Cheers, > > - Ted ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract 2019-05-09 2:20 ` Theodore Ts'o 2019-05-09 2:58 ` Dave Chinner @ 2019-05-09 5:02 ` Vijay Chidambaram 2019-05-09 5:37 ` Darrick J. Wong 2019-05-09 15:46 ` Theodore Ts'o 1 sibling, 2 replies; 25+ messages in thread From: Vijay Chidambaram @ 2019-05-09 5:02 UTC (permalink / raw) To: Theodore Ts'o Cc: Dave Chinner, Amir Goldstein, lsf-pc, Darrick J. Wong, Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn On Wed, May 8, 2019 at 9:30 PM Theodore Ts'o <tytso@mit.edu> wrote: > > On Thu, May 09, 2019 at 11:43:27AM +1000, Dave Chinner wrote: > > > > .... the whole point of SOMC is that allows filesystems to avoid > > dragging external metadata into fsync() operations /unless/ there's > > a user visible ordering dependency that must be maintained between > > objects. If all you are doing is stabilising file data in a stable > > file/directory, then independent, incremental journaling of the > > fsync operations on that file fit the SOMC model just fine. > > Well, that's not what Vijay's crash consistency guarantees state. It > guarantees quite a bit more than what you've written above. Which is > my concern. The intention is to capture Dave's SOMC semantics. We can re-iterate and re-phrase until we capture what Dave meant precisely. I am fairly confident we can do this, given that Dave himself is participating and helping us refine the text. So this doesn't seem like a reason not to have documentation at all to me. As we have stated on multiple times on this and other threads, the intention is *not* to come up with one set of crash-recovery guarantees that every Linux file system must abide by forever. Ted, you keep repeating this, though we have never said this was our intention. The intention behind this effort is to simply document the crash-recovery guarantees provided today by different Linux file systems. Ted, you question why this is required at all, and why we simply can't use POSIX and man pages. The answer: 1. POSIX is vague. Not persisting data to stable media on fsync is also allowed in POSIX (but no Linux file system actually does this), so its not very useful in terms of understanding what crash-recovery guarantees file systems actually provide. Given that all Linux file systems provide something more than POSIX, the natural question to ask is what do they provide? We understood this from working on CrashMonkey, and we wanted to document it. 2. Other parts of the Linux kernel have much better documentation, even though they similarly want to provide freedom for developers to optimize and change internal implementation. I don't think documentation and freedom to change internals are mutually exclusive. 3. XFS provides SOMC semantics, and btrfs developers have stated they want to provide SOMC as well. F2FS developers have a mode in which they seek to provide SOMC semantics. Given all this, it seemed prudent to document SOMC. 4. Apart from developers, a document like this would also help academic researchers understand the current state-of-the-art in crash-recovery guarantees and the different choices made by different file systems. It is non-trivial to understand this without documentation. FWIW, I think the position of "if we don't write it down, application developers can't depend on it" is wrong. Even with nothing written down, developers noticed they could skip fsync() in ext3 when atomically updating files with rename(). This lead to the whole ext4 rename-and-delayed-allocation problem. The much better path, IMO, is to document the current set of guarantees given by different file systems, and talk about what is intended and what is not. This would give application developers much better guidance in writing applications. If ext4 wants to develop incremental fsync and introduce a new set of semantics that is different from SOMC and much closer to minimal POSIX, I don't think the documentation affects that at all. As Dave notes, diversity is good! Documentation is also good :) That being said, I think I'll stop our push to get this documented inside the Linux kernel at this point. We got useful comments from Dave, Amir, and others, so we will incorporate those comments and put up the documentation on a University of Texas web page. If someone else wants to carry on and get this merged, you are welcome to do so :) ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract 2019-05-09 5:02 ` Vijay Chidambaram @ 2019-05-09 5:37 ` Darrick J. Wong 2019-05-09 15:46 ` Theodore Ts'o 1 sibling, 0 replies; 25+ messages in thread From: Darrick J. Wong @ 2019-05-09 5:37 UTC (permalink / raw) To: Vijay Chidambaram Cc: Theodore Ts'o, Dave Chinner, Amir Goldstein, lsf-pc, Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn On Thu, May 09, 2019 at 12:02:17AM -0500, Vijay Chidambaram wrote: > On Wed, May 8, 2019 at 9:30 PM Theodore Ts'o <tytso@mit.edu> wrote: > > > > On Thu, May 09, 2019 at 11:43:27AM +1000, Dave Chinner wrote: > > > > > > .... the whole point of SOMC is that allows filesystems to avoid > > > dragging external metadata into fsync() operations /unless/ there's > > > a user visible ordering dependency that must be maintained between > > > objects. If all you are doing is stabilising file data in a stable > > > file/directory, then independent, incremental journaling of the > > > fsync operations on that file fit the SOMC model just fine. > > > > Well, that's not what Vijay's crash consistency guarantees state. It > > guarantees quite a bit more than what you've written above. Which is > > my concern. > > The intention is to capture Dave's SOMC semantics. We can re-iterate > and re-phrase until we capture what Dave meant precisely. I am fairly > confident we can do this, given that Dave himself is participating and > helping us refine the text. So this doesn't seem like a reason not to > have documentation at all to me. > > As we have stated on multiple times on this and other threads, the > intention is *not* to come up with one set of crash-recovery > guarantees that every Linux file system must abide by forever. Ted, > you keep repeating this, though we have never said this was our > intention. It might not be your intention, but I can definitely imagine others using such a SOMC document as a cudgel to, uh, pressure other filesystems into implementing the same semantics ("This isn't SOMC compliant!" "We never said it was." "It has to be compliant!"). That's fine for XFS because that's how it's supposed to work, but I wouldn't want other projects to have to defend themselves for lack of XFSiness. > The intention behind this effort is to simply document the > crash-recovery guarantees provided today by different Linux file > systems. Ted, you question why this is required at all, and why we > simply can't use POSIX and man pages. The answer: > > 1. POSIX is vague. Not persisting data to stable media on fsync is > also allowed in POSIX (but no Linux file system actually does this), > so its not very useful in terms of understanding what crash-recovery > guarantees file systems actually provide. Given that all Linux file > systems provide something more than POSIX, the natural question to ask > is what do they provide? We understood this from working on > CrashMonkey, and we wanted to document it. > 2. Other parts of the Linux kernel have much better documentation, > even though they similarly want to provide freedom for developers to > optimize and change internal implementation. I don't think > documentation and freedom to change internals are mutually exclusive. > 3. XFS provides SOMC semantics, and btrfs developers have stated they > want to provide SOMC as well. F2FS developers have a mode in which > they seek to provide SOMC semantics. Given all this, it seemed prudent > to document SOMC. Point. To further soften/undercut my earlier email, I think we can document the filesystem behaviors that specific projects are willing to endorse while still making it clear that YMMV and you had better test your workload if you want clarity of behavior. :) > 4. Apart from developers, a document like this would also help > academic researchers understand the current state-of-the-art in > crash-recovery guarantees and the different choices made by different > file systems. It is non-trivial to understand this without > documentation. > > FWIW, I think the position of "if we don't write it down, application > developers can't depend on it" is wrong. Even with nothing written > down, developers noticed they could skip fsync() in ext3 when > atomically updating files with rename(). This lead to the whole ext4 > rename-and-delayed-allocation problem. The much better path, IMO, is > to document the current set of guarantees given by different file > systems, and talk about what is intended and what is not. This would > give application developers much better guidance in writing > applications. > > If ext4 wants to develop incremental fsync and introduce a new set of > semantics that is different from SOMC and much closer to minimal > POSIX, I don't think the documentation affects that at all. As Dave > notes, diversity is good! Documentation is also good :) > > That being said, I think I'll stop our push to get this documented > inside the Linux kernel at this point. We got useful comments from > Dave, Amir, and others, so we will incorporate those comments and put > up the documentation on a University of Texas web page. If someone > else wants to carry on and get this merged, you are welcome to do so > :) Aww, I was going to suggest merging it with an explicit warning that the document *only* reflects those that have endorsed it, and a pileup at the end: Endorsed-by: fs/xfs Endorsed-by: fs/btrfs Rejected-by: fs/djwongcrazyfs (I'm still ¾ tempted to just put the XFS parts in a text file and merge it into fs/xfs/ if the broader effort doesn't succeed...) --D ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract 2019-05-09 5:02 ` Vijay Chidambaram 2019-05-09 5:37 ` Darrick J. Wong @ 2019-05-09 15:46 ` Theodore Ts'o 1 sibling, 0 replies; 25+ messages in thread From: Theodore Ts'o @ 2019-05-09 15:46 UTC (permalink / raw) To: Vijay Chidambaram Cc: Dave Chinner, Amir Goldstein, lsf-pc, Darrick J. Wong, Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn On Thu, May 09, 2019 at 12:02:17AM -0500, Vijay Chidambaram wrote: > As we have stated on multiple times on this and other threads, the > intention is *not* to come up with one set of crash-recovery > guarantees that every Linux file system must abide by forever. Ted, > you keep repeating this, though we have never said this was our > intention. > > The intention behind this effort is to simply document the > crash-recovery guarantees provided today by different Linux file > systems. Ted, you question why this is required at all, and why we > simply can't use POSIX and man pages. But who is this documentation targeted towards? Who is it intended to benefit? Most application authors do not write applications with specific file systems in mind. And even if they do, they can't control how their users are going to use it. > FWIW, I think the position of "if we don't write it down, application > developers can't depend on it" is wrong. Even with nothing written > down, developers noticed they could skip fsync() in ext3 when > atomically updating files with rename(). This lead to the whole ext4 > rename-and-delayed-allocation problem. The much better path, IMO, is > to document the current set of guarantees given by different file > systems, and talk about what is intended and what is not. This would > give application developers much better guidance in writing > applications. If we were to provide that nuance, that would be much better, I would agree. It's not what the current crash consistency guarantees provides, alas. I'd also want to talk about what is guaranteed *first*; documenting the current state of affairs, some of which may be subject to change and the result of the implementation, is far less important. So I'd prefer that "documentation of current behavior" be the last thing in the document --- perhaps in an appendix --- and not the headliner. Indeed, I'd use the ext3 O_PONIES discussion as a prime example of the risk if we were to just "document current practice" and stop there. It's the fact that your crash consistency guarantees draft, claims to "document current practice", and at the same time, uses the word "guarantee" which causes red flags to go up for me. If we could separate those two, that would be very helpful. And if the current POSIX guarantees are too vague, my preference would be to first determine what application authors would find more useful in terms stricter guarantees, and provide those guarantees as we find them. We can always add more guarantees later. Taking guarantees away is much harder. And guarantees by defintion always restrict freedom of action, so this is an engineering tradeoff. Let's provide those guarantees when it actually improves application performance, and not Just Because. It might also be that defining new system calls, like fbarrier() and fdatabarrier() is a better approach rather than retconning new semantics on top of fsync(). I just think a principled design approach is better rather than taking existing semantics and slapping the word "guarantee" in the title of said documentation. I will also say that I have no problems with documenting strong metadata ordering if it has nothing to do with fsync(). That makes sense. The moment that you try to also bring data integrity into the mix, and give examples of what happens if you call fsync(), that it goes beyond strong metadata ordering. So if you want to document what happens without fsync, ext4 can probably get on board with them. Unfortuantely, in addition to including the word "guarantee", the current crash consistency draft also includes the word "fsync". > 4. Apart from developers, a document like this would also help > academic researchers understand the current state-of-the-art in > crash-recovery guarantees and the different choices made by different > file systems. It is non-trivial to understand this without > documentation. It's also very hard to undertand this without taking performance constraints and implementation choices into account. It's trivially easy to give super-strong crash-recovery guarantees. But if it sacrifices performance, is it really "state-of-the-art"? Worse, different applications may want different guarantees, and may want different crash consistency vs. performance tradeoffs. This is why in general, the concept of providing new interfaces where the application can state more explicitly what they want is much more appealing to me. When I have discussions with Amir, he doesn't just want strong guarantees; he wants specific guarantees with zero overhead, and our discussions have been in how to we manage that tension between those two goals. And it's much easier to achieve this in terms of very specific cases, such as what happens when you link an O_TMPFILE file into a directory. Cheers, - Ted ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract 2019-05-09 1:43 ` Dave Chinner 2019-05-09 2:20 ` Theodore Ts'o @ 2019-05-09 8:47 ` Amir Goldstein 1 sibling, 0 replies; 25+ messages in thread From: Amir Goldstein @ 2019-05-09 8:47 UTC (permalink / raw) To: Dave Chinner Cc: Theodore Ts'o, Vijay Chidambaram, lsf-pc, Darrick J. Wong, Jan Kara, linux-fsdevel, Jayashree Mohan, Filipe Manana, Chris Mason, lwn On Thu, May 9, 2019 at 4:43 AM Dave Chinner <david@fromorbit.com> wrote: > > On Thu, May 02, 2019 at 10:30:43PM -0400, Theodore Ts'o wrote: > > On Thu, May 02, 2019 at 01:39:47PM -0400, Amir Goldstein wrote: > > > I am not saying there is no room for a document that elaborates on those > > > guaranties. I personally think that could be useful and certainly think that > > > your group's work for adding xfstest coverage for API guaranties is useful. > > > > Again, here is my concern. If we promise that ext4 will always obey > > Dave Chinner's SOMC model, it would forever rule out Daejun Park and > > Dongkun Shin's "iJournaling: Fine-grained journaling for improving the > > latency of fsync system call"[1] published in Usenix ATC 2017. > > No, it doesn't rule that out at all. > Dave and all the good people, Please go back to read the first email in this thread before it diverged yet again into interpretations of SOMC. The novelty in my proposal (which I attribute to Jan's idea) is to reduce the concerns around documenting "expected behavior of the world" to documenting "expected behavior of linking an O_TMPFILE". It boils down to documenting AT_LINK_ATOMIC (or whatever flag name): ""The filesystem provided the guaranty that after a crash, if the linked O_TMPFILE is observed in the target directory, than all the data and metadata modifications made to the file before being linked are also observed." No more, no less. I intentionally reduced the scope to the point that I could get ext4,btrfs to sign the treaty. I think this is a good starting point, from which we can make forward progress. I'd appreciate if xfs camp, Dave in particular, would address the proposal regardless of the broader SOMC documentation discussion. Thanks, Amir. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract 2019-05-02 16:12 ` Amir Goldstein 2019-05-02 17:11 ` Vijay Chidambaram @ 2019-05-02 21:05 ` Darrick J. Wong 2019-05-02 22:19 ` Amir Goldstein 1 sibling, 1 reply; 25+ messages in thread From: Darrick J. Wong @ 2019-05-02 21:05 UTC (permalink / raw) To: Amir Goldstein Cc: lsf-pc, Dave Chinner, Theodore Tso, Jan Kara, linux-fsdevel, Jayashree Mohan, Vijaychidambaram Velayudhan Pillai, Filipe Manana, Chris Mason, lwn On Thu, May 02, 2019 at 12:12:22PM -0400, Amir Goldstein wrote: > On Sat, Apr 27, 2019 at 5:00 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > Suggestion for another filesystems track topic. > > > > Some of you may remember the emotional(?) discussions that ensued > > when the crashmonkey developers embarked on a mission to document > > and verify filesystem crash recovery guaranties: > > > > https://lore.kernel.org/linux-fsdevel/CAOQ4uxj8YpYPPdEvAvKPKXO7wdBg6T1O3osd6fSPFKH9j=i2Yg@mail.gmail.com/ > > > > There are two camps among filesystem developers and every camp > > has good arguments for wanting to document existing behavior and for > > not wanting to document anything beyond "use fsync if you want any guaranty". > > > > I would like to take a suggestion proposed by Jan on a related discussion: > > https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQx+TO3Dt7TA3ocXnNxbr3+oVyJLYUSpv4QCt_Texdvw@mail.gmail.com/ > > > > and make a proposal that may be able to meet the concerns of > > both camps. > > > > The proposal is to add new APIs which communicate > > crash consistency requirements of the application to the filesystem. > > > > Example API could look like this: > > renameat2(..., RENAME_METADATA_BARRIER | RENAME_DATA_BARRIER) > > It's just an example. The API could take another form and may need > > more barrier types (I proposed to use new file_sync_range() flags). > > > > The idea is simple though. > > METADATA_BARRIER means all the inode metadata will be observed > > after crash if rename is observed after crash. > > DATA_BARRIER same for file data. > > We may also want a "ALL_METADATA_BARRIER" and/or > > "METADATA_DEPENDENCY_BARRIER" to more accurately > > describe what SOMC guaranties actually provide today. > > > > The implementation is also simple. filesystem that currently > > have SOMC behavior don't need to do anything to respect > > METADATA_BARRIER and only need to call > > filemap_write_and_wait_range() to respect DATA_BARRIER. > > filesystem developers are thus not tying their hands w.r.t future > > performance optimizations for operations that are not explicitly > > requesting a barrier. > > > > An update: Following the LSF session on $SUBJECT I had a discussion > with Ted, Jan and Chris. > > We were all in agreement that linking an O_TMPFILE into the namespace > is probably already perceived by users as the barrier/atomic operation that > I am trying to describe. > > So at least maintainers of btrfs/ext4/ext2 are sympathetic to the idea of > providing the required semantics when linking O_TMPFILE *as long* as > the semantics are properly documented. > > This is what open(2) man page has to say right now: > > * Creating a file that is initially invisible, which is then > populated with data > and adjusted to have appropriate filesystem attributes (fchown(2), > fchmod(2), fsetxattr(2), etc.) before being atomically linked into the > filesystem in a fully formed state (using linkat(2) as described above). > > The phrase that I would like to add (probably in link(2) man page) is: > "The filesystem provided the guaranty that after a crash, if the linked > O_TMPFILE is observed in the target directory, than all the data and "if the linked O_TMPFILE is observed" ... meaning that if we can't recover all the data+metadata information then it's ok to obliterate the file? Is the filesystem allowed to drop the tmpfile data if userspace links the tmpfile into a directory but doesn't fsync the directory? TBH I would've thought the basis of the RENAME_ATOMIC (and LINK_ATOMIC?) user requirement would be "Until I say otherwise I want always to be able to read <data> from this given string <pathname>." (vs. regular Unix rename/link where we make you specify how much you care about that by hitting us on the head with a file fsync and then a directory fsync.) > metadata modifications made to the file before being linked are also > observed." > > For some filesystems, btrfs in farticular, that would mean an implicit > fsync on the linked inode. On other filesystems, ext4/xfs in particular > that would only require at least committing delayed allocations, but > will NOT require inode fsync nor journal commit/flushing disk caches. I don't think it does much good to commit delalloc blocks but not flush dirty overwrites, and I don't think it makes a lot of sense to flush out overwrite data without also pushing out the inode metadata too. FWIW I'm ok with the "Here's a 'I'm really serious' flag that carries with it a full fsync, though how to sell developers on using it? > I would like to hear the opinion of XFS developers and filesystem > maintainers who did not attend the LSF session. I miss you all too. Sorry I couldn't make it this year. :( > I have no objection to adding an opt-in LINK_ATOMIC flag > and pass it down to filesystems instead of changing behavior and > patching stable kernels, but I prefer the latter. > > I believe this should have been the semantics to begin with > if for no other reason, because users would expect it regardless > of whatever we write in manual page and no matter how many > !!!!!!!! we use for disclaimers. > > And if we can all agree on that, then O_TMPFILE is quite young > in historic perspective, so not too late to call the expectation gap > a bug and fix it.(?) Why would linking an O_TMPFILE be a special case as opposed to making hard links in general? If you hardlink a dirty file then surely you'd also want to be able to read the data from the new location? > Taking this another step forward, if we agree on the language > I used above to describe the expected behavior, then we can > add an opt-in RENAME_ATOMIC flag to provide the same > semantics and document it in the same manner (this functionality > is needed for directories and non regular files) and all there is left > is the fun part of choosing the flag name ;-) Will have to think about /that/ some more. --D > > Thanks, > Amir. ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [TOPIC] Extending the filesystem crash recovery guaranties contract 2019-05-02 21:05 ` Darrick J. Wong @ 2019-05-02 22:19 ` Amir Goldstein 0 siblings, 0 replies; 25+ messages in thread From: Amir Goldstein @ 2019-05-02 22:19 UTC (permalink / raw) To: Darrick J. Wong Cc: lsf-pc, Dave Chinner, Theodore Tso, Jan Kara, linux-fsdevel, Jayashree Mohan, Vijaychidambaram Velayudhan Pillai, Filipe Manana, Chris Mason, lwn On Thu, May 2, 2019 at 5:05 PM Darrick J. Wong <darrick.wong@oracle.com> wrote: > > On Thu, May 02, 2019 at 12:12:22PM -0400, Amir Goldstein wrote: > > On Sat, Apr 27, 2019 at 5:00 PM Amir Goldstein <amir73il@gmail.com> wrote: > > > > > > Suggestion for another filesystems track topic. > > > > > > Some of you may remember the emotional(?) discussions that ensued > > > when the crashmonkey developers embarked on a mission to document > > > and verify filesystem crash recovery guaranties: > > > > > > https://lore.kernel.org/linux-fsdevel/CAOQ4uxj8YpYPPdEvAvKPKXO7wdBg6T1O3osd6fSPFKH9j=i2Yg@mail.gmail.com/ > > > > > > There are two camps among filesystem developers and every camp > > > has good arguments for wanting to document existing behavior and for > > > not wanting to document anything beyond "use fsync if you want any guaranty". > > > > > > I would like to take a suggestion proposed by Jan on a related discussion: > > > https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQx+TO3Dt7TA3ocXnNxbr3+oVyJLYUSpv4QCt_Texdvw@mail.gmail.com/ > > > > > > and make a proposal that may be able to meet the concerns of > > > both camps. > > > > > > The proposal is to add new APIs which communicate > > > crash consistency requirements of the application to the filesystem. > > > > > > Example API could look like this: > > > renameat2(..., RENAME_METADATA_BARRIER | RENAME_DATA_BARRIER) > > > It's just an example. The API could take another form and may need > > > more barrier types (I proposed to use new file_sync_range() flags). > > > > > > The idea is simple though. > > > METADATA_BARRIER means all the inode metadata will be observed > > > after crash if rename is observed after crash. > > > DATA_BARRIER same for file data. > > > We may also want a "ALL_METADATA_BARRIER" and/or > > > "METADATA_DEPENDENCY_BARRIER" to more accurately > > > describe what SOMC guaranties actually provide today. > > > > > > The implementation is also simple. filesystem that currently > > > have SOMC behavior don't need to do anything to respect > > > METADATA_BARRIER and only need to call > > > filemap_write_and_wait_range() to respect DATA_BARRIER. > > > filesystem developers are thus not tying their hands w.r.t future > > > performance optimizations for operations that are not explicitly > > > requesting a barrier. > > > > > > > An update: Following the LSF session on $SUBJECT I had a discussion > > with Ted, Jan and Chris. > > > > We were all in agreement that linking an O_TMPFILE into the namespace > > is probably already perceived by users as the barrier/atomic operation that > > I am trying to describe. > > > > So at least maintainers of btrfs/ext4/ext2 are sympathetic to the idea of > > providing the required semantics when linking O_TMPFILE *as long* as > > the semantics are properly documented. > > > > This is what open(2) man page has to say right now: > > > > * Creating a file that is initially invisible, which is then > > populated with data > > and adjusted to have appropriate filesystem attributes (fchown(2), > > fchmod(2), fsetxattr(2), etc.) before being atomically linked into the > > filesystem in a fully formed state (using linkat(2) as described above). > > > > The phrase that I would like to add (probably in link(2) man page) is: > > "The filesystem provided the guaranty that after a crash, if the linked > > O_TMPFILE is observed in the target directory, than all the data and > > "if the linked O_TMPFILE is observed" ... meaning that if we can't > recover all the data+metadata information then it's ok to obliterate the > file? Is the filesystem allowed to drop the tmpfile data if userspace > links the tmpfile into a directory but doesn't fsync the directory? > Yes! Yes! Definitely allowed! Linking an O_TMPFILE has a single possible use case - an "atomic" creation of a fully baked file. I am trying hard to explain that for my use case, durability is not a requirement from the "atomic" creation, but rather "if the linked O_TMPFILE is observed" semantics. > TBH I would've thought the basis of the RENAME_ATOMIC (and LINK_ATOMIC?) > user requirement would be "Until I say otherwise I want always to be > able to read <data> from this given string <pathname>." > Sadly, it is even hard for me to explain the difference to filesystem developers, so what is the hope with mortal users, but what can I do, the kernel has and interface for durability (several of them), but no interface for what I need (ordering), so I must introduce one. The good news, and that the key argument in my sales pitch, is that some users already have expectations that are not documented and not correct about rename/link, so hopefully, if we add and document those flags, situation cannot get worse. > (vs. regular Unix rename/link where we make you specify how much you > care about that by hitting us on the head with a file fsync and then a > directory fsync.) OK. Perhaps a solution to this human interface issue is introducing two pairs of flags LINK_SYNC and LINK_ATOMIC. I did not think that the former is needed, but maybe it is just needed as a way to document what LINK_ATOMIC is *not*, e.g.: "LINK_SYNC If the operation succeeds, the filesystem provides the guaranty that after a crash, the linked O_TMPFILE will be observed in the target directory and that all the data and metadata modifications made to the file before being linked are also observed." LINK_ATOMIC If the operation succeeds, the filesystem provides the guaranty that after a crash, if the linked O_TMPFILE is observed in the target directory, then all the data and metadata modifications made to the file before being linked are also observed. LINK_ATOMIC is often cheaper than LINK_SYNC, because it does not require flushing volatile disk write caches, but it does not provide the guaranty that the file will be observed in the target directory after crash." My intuition about this is "less is better", so prefer not add two flags. > > > metadata modifications made to the file before being linked are also > > observed." > > > > For some filesystems, btrfs in farticular, that would mean an implicit > > fsync on the linked inode. On other filesystems, ext4/xfs in particular > > that would only require at least committing delayed allocations, but > > will NOT require inode fsync nor journal commit/flushing disk caches. > > I don't think it does much good to commit delalloc blocks but not flush > dirty overwrites, and I don't think it makes a lot of sense to flush out > overwrite data without also pushing out the inode metadata too. My intention was that this flag will call filemap_write_and_wait_range() on ext4/xfs, which is what my application does today to get the desired result. From there on, we can rely on "strictly ordered metadata consistency" (SOMC) to provide what the interface needs. > > FWIW I'm ok with the "Here's a 'I'm really serious' flag that carries > with it a full fsync, though how to sell developers on using it? > I am an application developer and I have no need of such flag I have a need for another flag, which is why I started this discussion... But also, this is why my preference is to NOT add a LINK_ATOMIC flag at all and just assume that users cannot possibly think it is a good outcome to observe a half baked linked O_TMPFILE after crash and give users what they want. > > I would like to hear the opinion of XFS developers and filesystem > > maintainers who did not attend the LSF session. > > I miss you all too. Sorry I couldn't make it this year. :( > > > I have no objection to adding an opt-in LINK_ATOMIC flag > > and pass it down to filesystems instead of changing behavior and > > patching stable kernels, but I prefer the latter. > > > > I believe this should have been the semantics to begin with > > if for no other reason, because users would expect it regardless > > of whatever we write in manual page and no matter how many > > !!!!!!!! we use for disclaimers. > > > > And if we can all agree on that, then O_TMPFILE is quite young > > in historic perspective, so not too late to call the expectation gap > > a bug and fix it.(?) > > Why would linking an O_TMPFILE be a special case as opposed to making > hard links in general? If you hardlink a dirty file then surely you'd > also want to be able to read the data from the new location? > Because of the use case that O_TMPFILE implies, whatever users do before file is linked is expected to be private and unexposed to others. You cannot say the same about making modifications to a linked file. I don't mind adding LINK_ATOMIC and then it will obviously be respected also for linking nlink > 0. > > Taking this another step forward, if we agree on the language > > I used above to describe the expected behavior, then we can > > add an opt-in RENAME_ATOMIC flag to provide the same > > semantics and document it in the same manner (this functionality > > is needed for directories and non regular files) and all there is left > > is the fun part of choosing the flag name ;-) > > Will have to think about /that/ some more. > For your amusement, here are some suggestions that I had that folks here did not like: RENAME_BARRIER RENAME_ORDERED Thanks, Amir. ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2019-05-09 15:47 UTC | newest] Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-04-27 21:00 [TOPIC] Extending the filesystem crash recovery guaranties contract Amir Goldstein 2019-05-02 16:12 ` Amir Goldstein 2019-05-02 17:11 ` Vijay Chidambaram 2019-05-02 17:39 ` Amir Goldstein 2019-05-03 2:30 ` Theodore Ts'o 2019-05-03 3:15 ` Vijay Chidambaram 2019-05-03 9:45 ` Theodore Ts'o 2019-05-04 0:17 ` Vijay Chidambaram 2019-05-04 1:43 ` Theodore Ts'o 2019-05-07 18:38 ` Jan Kara 2019-05-03 4:16 ` Amir Goldstein 2019-05-03 9:58 ` Theodore Ts'o 2019-05-03 14:18 ` Amir Goldstein 2019-05-09 2:36 ` Dave Chinner 2019-05-09 1:43 ` Dave Chinner 2019-05-09 2:20 ` Theodore Ts'o 2019-05-09 2:58 ` Dave Chinner 2019-05-09 3:31 ` Theodore Ts'o 2019-05-09 5:19 ` Darrick J. Wong 2019-05-09 5:02 ` Vijay Chidambaram 2019-05-09 5:37 ` Darrick J. Wong 2019-05-09 15:46 ` Theodore Ts'o 2019-05-09 8:47 ` Amir Goldstein 2019-05-02 21:05 ` Darrick J. Wong 2019-05-02 22:19 ` Amir Goldstein
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).