From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.5 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS,USER_AGENT_MUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 483C7C43219 for ; Fri, 3 May 2019 09:59:29 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 1BB3F20675 for ; Fri, 3 May 2019 09:59:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726267AbfECJ72 (ORCPT ); Fri, 3 May 2019 05:59:28 -0400 Received: from outgoing-auth-1.mit.edu ([18.9.28.11]:52638 "EHLO outgoing.mit.edu" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1725777AbfECJ72 (ORCPT ); Fri, 3 May 2019 05:59:28 -0400 Received: from callcc.thunk.org (adsl-173-228-226-134.prtc.net [173.228.226.134]) (authenticated bits=0) (User authenticated as tytso@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id x439wkQA013933 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Fri, 3 May 2019 05:58:48 -0400 Received: by callcc.thunk.org (Postfix, from userid 15806) id 4D119420024; Fri, 3 May 2019 05:58:46 -0400 (EDT) Date: Fri, 3 May 2019 05:58:46 -0400 From: "Theodore Ts'o" To: Amir Goldstein Cc: Vijay Chidambaram , lsf-pc@lists.linux-foundation.org, Dave Chinner , "Darrick J. Wong" , Jan Kara , linux-fsdevel , Jayashree Mohan , Filipe Manana , Chris Mason , lwn@lwn.net Subject: Re: [TOPIC] Extending the filesystem crash recovery guaranties contract Message-ID: <20190503095846.GE23724@mit.edu> References: <20190503023043.GB23724@mit.edu> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Fri, May 03, 2019 at 12:16:32AM -0400, Amir Goldstein wrote: > OK. we can leave that one for later. > Although I am not sure what the concern is. > If we are able to agree and document a LINK_ATOMIC flag, > what would be the down side of documenting a RENAME_ATOMIC > flag with same semantics? After all, as I said, this is what many users > already expect when renaming a temp file (as ext4 heuristics prove). The problem is if the "temp file" has been hardlinked to 1000 different directories, does the rename() have to guarantee that we have to make sure that the changes to all 1000 directories have been persisted to disk? And all of the parent directories of those 1000 directories have also *all* been persisted to disk, all the way up to the root? With the O_TMPFILE linkat case, we know that inode hasn't been hard-linked to any other directory, and mercifully directories have only one parent directory, so we only have to check one set of directory inodes all the way up to the root having been persisted. But.... I can already imagine someone complaining that if due to bind mounts and 1000 mount namespaces, there is some *other* directory pathname which could be used to reach said "tmpfile", we have to guarantee that all parent directories which could be used to reach said "tmpfile" even if they span a dozen different file systems, *also* have to be persisted due to sloppy drafting of what the atomicity rules might happen to be. If we are only guaranteeing the persistence of the containing directories of the source and destination files, that's pretty easy. But then the consistency rules need to *explicitly* state this. Some of the handwaving definitions of what would be guaranteed.... scare me. - Ted P.S. If we were going to do this, we'd probably want to simply define a flag to be AT_FSYNC, using the strict POSIX definition of fsync, which is to say, as a result of the linkat or renameat, the file in question, and its associated metadata, are guaranteed to be persisted to disk. No other guarantees about any other inode's metadata regardless of when they might be made, would be guaranteed. If people really want "global barrier" semantics, then perhaps it would be better to simply define a barrierfs(2) system call that works like syncfs(2) --- it applies to the whole file system, and guarantees that all changes made after barrierfs(2) will be visible if any changes made *after* barrierfs(2) are visible. Amir, you used "global ordering" a few times; if you really need that, let's define a new system call which guarantees that. Maybe some of the research proposals for exotic changes to SSD semantics, etc., would allow barrierfs(2) semantics to be something that we could implement more efficiently than syncfs(2). But let's make this be explicit, as opposed to some magic guarantee that falls out as a side effect of the fsync(2) system call to a single inode.