From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.4 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_PASS,USER_AGENT_MUTT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4F2A4C04A6B for ; Thu, 9 May 2019 02:20:44 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 27FAE2173C for ; Thu, 9 May 2019 02:20:44 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726254AbfEICUn (ORCPT ); Wed, 8 May 2019 22:20:43 -0400 Received: from outgoing-auth-1.mit.edu ([18.9.28.11]:49965 "EHLO outgoing.mit.edu" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1725842AbfEICUn (ORCPT ); Wed, 8 May 2019 22:20:43 -0400 Received: from callcc.thunk.org ([66.31.38.53]) (authenticated bits=0) (User authenticated as tytso@ATHENA.MIT.EDU) by outgoing.mit.edu (8.14.7/8.12.4) with ESMTP id x492KEDn029318 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); Wed, 8 May 2019 22:20:15 -0400 Received: by callcc.thunk.org (Postfix, from userid 15806) id 06FB1420024; Wed, 8 May 2019 22:20:14 -0400 (EDT) Date: Wed, 8 May 2019 22:20:13 -0400 From: "Theodore Ts'o" To: Dave Chinner Cc: Amir Goldstein , Vijay Chidambaram , lsf-pc@lists.linux-foundation.org, "Darrick J. Wong" , Jan Kara , linux-fsdevel , Jayashree Mohan , Filipe Manana , Chris Mason , lwn@lwn.net Subject: Re: [TOPIC] Extending the filesystem crash recovery guaranties contract Message-ID: <20190509022013.GC7031@mit.edu> References: <20190503023043.GB23724@mit.edu> <20190509014327.GT1454@dread.disaster.area> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190509014327.GT1454@dread.disaster.area> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-fsdevel@vger.kernel.org On Thu, May 09, 2019 at 11:43:27AM +1000, Dave Chinner wrote: > > .... the whole point of SOMC is that allows filesystems to avoid > dragging external metadata into fsync() operations /unless/ there's > a user visible ordering dependency that must be maintained between > objects. If all you are doing is stabilising file data in a stable > file/directory, then independent, incremental journaling of the > fsync operations on that file fit the SOMC model just fine. Well, that's not what Vijay's crash consistency guarantees state. It guarantees quite a bit more than what you've written above. Which is my concern. > > P.P.S. One of the other discussions that did happen during the main > > LSF/MM File system session, and for which there was general agreement > > across a number of major file system maintainers, was a fsync2() > > system call which would take a list of file descriptors (and flags) > > that should be fsync'ed. > > Hmmmm, that wasn't on the agenda, and nobody has documented it as > yet. It came up as suggested alternative during Ric Wheeler's "Async all the things" session. The problem he was trying to address was programs (perhaps userspace file servers) who need to fsync a large number of files at the same time. The problem with his suggested solution (which we have for AIO and io_uring already) of having the program issue a large number of asynchronous fsync's and then waiting for them all, is that the back-end interface is a work queue, so there is a lot of effective serialization that takes place. > > The semantics would be that when the > > fsync2() successfully returns, all of the guarantees of fsync() or > > fdatasync() requested by the list of file descriptors and flags would > > be satisfied. This would allow file systems to more optimally fsync a > > batch of files, for example by implementing data integrity writebacks > > for all of the files, followed by a single journal commit to guarantee > > persistence for all of the metadata changes. > > What happens when you get writeback errors on only some of the fds? > How do you report the failures and what do you do with the journal > commit on partial success? Well, one approach would be to pass back the errors in the structure. Say something like this: int fsync2(int len, struct fsync_req[]); struct fsync_req { int fd; /* IN */ int flags; /* IN */ int retval; /* OUT */ }; As far as what do you do with the journal commit on partial success, this are no atomic, "all or nothing" guarantees with this interface. It is implementation specific whether there would be one or more file system commits necessary before fsync2 returned. > Of course, this ignores the elephant in the room: applications can > /already do this/ using AIO_FSYNC and have individual error status > for each fd. Not to mention that filesystems already batch > concurrent fsync journal commits into a single operation. I'm not > seeing the point of a new syscall to do this right now.... But it doesn't work very well, because the implementation uses a workqueue. Sure, you could create N worker threads for N fd's that you want to fsync, and then file system can batch the fsync requests. But wouldn't be so much simpler to give a list of fd's that should be fsync'ed to the file system? That way you don't have to do lots of work to split up the work so they can be submitted in parallel, only to have the file system batch up all of the requests being issued from all of those kernel threads. So yes, it's identical to the interfaces we already have. Just like select(2), poll(2) and epoll(2) are functionality identical... - Ted