From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS,URIBL_BLOCKED,USER_AGENT_MUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C592BC04AA9 for ; Fri, 3 May 2019 02:31:13 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 9B2A92070B for ; Fri, 3 May 2019 02:31:13 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726121AbfECCbM (ORCPT ); Thu, 2 May 2019 22:31:12 -0400 Received: from outgoing-auth-1.mit.edu ([18.9.28.11]:59817 "EHLO outgoing.mit.edu" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726022AbfECCbM (ORCPT ); Thu, 2 May 2019 22:31:12 -0400 Received: from callcc.thunk.org (adsl-173-228-226-134.prtc.net [173.228.226.134]) (authenticated bits=0) (User authenticated as tytso@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id x432Uh5F012819 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Thu, 2 May 2019 22:30:46 -0400 Received: by callcc.thunk.org (Postfix, from userid 15806) id 49FDD420024; Thu, 2 May 2019 22:30:43 -0400 (EDT) Date: Thu, 2 May 2019 22:30:43 -0400 From: "Theodore Ts'o" To: Amir Goldstein Cc: Vijay Chidambaram , lsf-pc@lists.linux-foundation.org, Dave Chinner , "Darrick J. Wong" , Jan Kara , linux-fsdevel , Jayashree Mohan , Filipe Manana , Chris Mason , lwn@lwn.net Subject: Re: [TOPIC] Extending the filesystem crash recovery guaranties contract Message-ID: <20190503023043.GB23724@mit.edu> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Thu, May 02, 2019 at 01:39:47PM -0400, Amir Goldstein wrote: > > The expectation is that applications will use this, and then rename > > the O_TMPFILE file over the original file. Is this correct? If so, is > > there also an implied barrier between O_TMPFILE metadata and the > > rename? In the case of O_TMPFILE, the file can be brought into the namespace using something like: linkat(AT_FDCWD, "/proc/self/fd/42", AT_FDCWD, pathname, AT_SYMLINK_FOLLOW); it's not using rename. To be clear, this discussion happened in the hallway, and it's not clear it had full support by everyone. After our discussion, some of us came up with an example where forcing a call to filemap_write_and_wait() before the linkat(2) might *not* be the right thing. Suppose some browser wanted to wait until a file was fully( downloaded before letting it appear in the directory --- but what was being downloaded was a 4 GiB DVD image (say, a distribution's install media). If the download was done using O_TMPFILE followed by linkat(2), that might be a case where forcing the data blocks to disk before allowing the linkat(2) to proceed might not be what the application or the user would want. So it might be that we will need to add a linkat flag to indicate that we want the kernel to call filemap_write_and_wait() before making the metadata changes in linkat(2). > For replacing an existing file with another the same could be > achieved with renameat2(AT_FDCWD, tempname, AT_FDCWD, newname, > RENAME_ATOMIC). There is no need to create the tempname > file using O_TMPFILE in that case, but if you do, the RENAME_ATOMIC > flag would be redundant. > > RENAME_ATOMIC flag is needed because directories and non regular > files cannot be created using O_TMPFILE. I think there's much less consensus about this. Again, most of this happened in a hallway conversation. > > Where does this land us on the discussion about documenting > > file-system crash-recovery guarantees? Has that been deemed not > > necessary? > > Can't say for sure. > Some filesystem maintainers hold on to the opinion that they do > NOT wish to have a document describing existing behavior of specific > filesystems, which is large parts of the document that your group posted. > > They would rather that only the guaranties of the APIs are documented > and those should already be documented in man pages anyway - if they > are not, man pages could be improved. > > I am not saying there is no room for a document that elaborates on those > guaranties. I personally think that could be useful and certainly think that > your group's work for adding xfstest coverage for API guaranties is useful. Again, here is my concern. If we promise that ext4 will always obey Dave Chinner's SOMC model, it would forever rule out Daejun Park and Dongkun Shin's "iJournaling: Fine-grained journaling for improving the latency of fsync system call"[1] published in Usenix ATC 2017. [1] https://www.usenix.org/system/files/conference/atc17/atc17-park.pdf That's because this provides a fast fsync() using an incremental journal. This fast fsync would cause the metadata associated with the inode being fsync'ed to be persisted after the crash --- ahead of metadata changes to other, potentially completely unrelated files, which would *not* be persisted after the crash. Fine grained journalling would provide all of the guarantee all of the POSIX, and for applications that only care about the single file being fsync'ed -- they would be happy. BUT, it violates the proposed crash consistency guarantees. So if the crash consistency guarantees forbids future innovations where applications might *want* a fast fsync() that doesn't drag unrelated inodes into the persistence guarantees, is that really what we want? Do we want to forever rule out various academic investigations such as Park and Shin's because "it violates the crash consistency recovery model"? Especially if some applications don't *need* the crash consistency model? - Ted P.S. I feel especially strong about this because I'm working with an engineer currently trying to implement a simplified version of Park and Shin's proposal... So this is not a hypothetical concern of mine. I'd much rather not invalidate all of this engineer's work to date, especially since there is a published paper demonstrating that for some workloads (such as sqlite), this approach can be a big win. P.P.S. One of the other discussions that did happen during the main LSF/MM File system session, and for which there was general agreement across a number of major file system maintainers, was a fsync2() system call which would take a list of file descriptors (and flags) that should be fsync'ed. The semantics would be that when the fsync2() successfully returns, all of the guarantees of fsync() or fdatasync() requested by the list of file descriptors and flags would be satisfied. This would allow file systems to more optimally fsync a batch of files, for example by implementing data integrity writebacks for all of the files, followed by a single journal commit to guarantee persistence for all of the metadata changes.