From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID, DKIM_VALID_AU,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_PASS, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 03DE8C10F13 for ; Thu, 11 Apr 2019 11:08:54 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id AF7812133D for ; Thu, 11 Apr 2019 11:08:53 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="k+8y/fv1" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726800AbfDKLIw (ORCPT ); Thu, 11 Apr 2019 07:08:52 -0400 Received: from mail-yb1-f195.google.com ([209.85.219.195]:37322 "EHLO mail-yb1-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726121AbfDKLIv (ORCPT ); Thu, 11 Apr 2019 07:08:51 -0400 Received: by mail-yb1-f195.google.com with SMTP id p134so2039981ybc.4; Thu, 11 Apr 2019 04:08:51 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=kwErdtBkys8pibDvhxupAtkWfYwJFOpzAPiUD0/sZFw=; b=k+8y/fv1AfZJZhl+QQF35PcKOiS/m1mARd+1I8sJDCLLzn0shR0I1eei7cgFQPHbYV aH3GSfW6k88uwji7lKFHonVa2kvvM5dlH9O4Vs6rRh9TOpGwPB6jC5ETHjUciJ4I+uqt T6lLhafqijXeTm/N/kCN2ZBKIleuPeg/0Cc3pAGckpSyZ1tDzAps8HN03vSZ8m2K5xzR tu/3e0qSiWk2xnmb5cSRC0a5cgRW7CIAZsV3My36FzPgO7hHUpOaElJq5tG1xO2PvD9P qKkBuZf7UfBTZAgjyewjMjaJBydyGFE7xqBkJYDddZaAwb9yL3N3zypaEFbueU0irO64 fBUA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=kwErdtBkys8pibDvhxupAtkWfYwJFOpzAPiUD0/sZFw=; b=Lx+BSthjAJGtn+SN5pSyzGGAuQwz4I7fLro1xitF/BO6Sccx+gCMv7J5lQJWgBw6Tk GkH1v0sli6kqMsPqvASBekGEw1V7p0k2BciP1vNelxYG2sfe93X4mH0MRzvx1z79sGFP mxHyFBocgzO6xaNk1X8d2sETfTgC4rCbuPV314RtT6FmuVzNIJ90JWGBY3ByUS3PavJw PClrNNNWE0FLi7YZtm2NsB7jjJ5gAiNHGSgCBriKOHo52UyPjG6gmbQBp/vlJEdc3oTy qe6lbIFNa/soP+DjIDCpOYFHDUE4ZEUFWpxz0+Adjab3scjt2XrLGdYHdtFS9SyWeMhV irrg== X-Gm-Message-State: APjAAAVT20u5s1fjH5OgXubhwF1814I00GgIVabTOIZyRFtcIdWxRxqT uYGPtftHGdVEJzvQGUiEM0rz8HAbUpklFQZhny4= X-Google-Smtp-Source: APXvYqw4eunRE0a+A46L7dsJhc4/FCMZl/5LTKi6nh8uBnAr2jrRAHRUfb1v+CzP2m0tbCjoKERZ//OimBnhqqOMyUg= X-Received: by 2002:a25:6c88:: with SMTP id h130mr39511746ybc.112.1554980930438; Thu, 11 Apr 2019 04:08:50 -0700 (PDT) MIME-Version: 1.0 References: <20190409114922.30095-1-amir73il@gmail.com> <20190409154303.GB1426@quack2.suse.cz> <20190410091012.GH1426@quack2.suse.cz> <20190411101639.GB1119@quack2.suse.cz> In-Reply-To: <20190411101639.GB1119@quack2.suse.cz> From: Amir Goldstein Date: Thu, 11 Apr 2019 14:08:39 +0300 Message-ID: Subject: Re: [PATCH] fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback To: Jan Kara Cc: Dave Chinner , Al Viro , linux-fsdevel , linux-kernel , Andrew Morton , Anna Schumaker , Josef Bacik Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Apr 11, 2019 at 1:16 PM Jan Kara wrote: > > On Wed 10-04-19 13:42:23, Amir Goldstein wrote: > > On Wed, Apr 10, 2019 at 12:10 PM Jan Kara wrote: > > > > > > On Tue 09-04-19 21:01:54, Amir Goldstein wrote: > > > > On Tue, Apr 9, 2019 at 6:43 PM Jan Kara wrote: > > > > > On Tue 09-04-19 14:49:22, Amir Goldstein wrote: > > > > > > Commit 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE > > > > > > writeback") claims that sync_file_range(2) syscall was "created for > > > > > > userspace to be able to issue background writeout and so waiting for > > > > > > in-flight IO is undesirable there" and changes the writeback (back) to > > > > > > WB_SYNC_NONE. > > > > > > > > > > > > This claim is only partially true. Is is true for users that use the flag > > > > > > SYNC_FILE_RANGE_WRITE by itself, as does PostgreSQL, the user that was > > > > > > the reason for changing to WB_SYNC_NONE writeback. > > > > > > > > > > > > However, that claim is not true for users that use that flag combination > > > > > > SYNC_FILE_RANGE_{WAIT_BEFORE|WRITE|_WAIT_AFTER}. Those users explicitly > > > > > > requested to wait for in-flight IO as well as to writeback of dirty > > > > > > pages. sync_file_range(2) man page describes this flag combintaion as > > > > > > "write-for-data-integrity operation", although some may argue against > > > > > > this definition. > > > > > > > > > > > > Re-brand that flag combination as SYNC_FILE_RANGE_WRITE_AND_WAIT and use > > > > > > the helper filemap_write_and_wait_range(), that uses WB_SYNC_ALL > > > > > > writeback, to perform the range sync request. > > > > > > > > > > > > One intended use case for this API is creating a dependency between > > > > > > a new file's content and its link into the namepsace without requiring > > > > > > journal commit and flushing of disk volatile caches: > > > > > > > > > > > > fd = open(".", O_TMPFILE | O_RDWR); > > > > > > write(fd, newdata, count); > > > > > > sync_file_range(fd, SYNC_FILE_RANGE_WRITE_AND_WAIT, 0, 0); > > > > > > linkat(AT_EMPTY_PATH, fd, AT_FDCWD, newfile); > > > > > > > > > > > > For many local filesystems, ext4 and xfs included, the sequence above > > > > > > will guaranty that after crash, either 'newfile' exists with 'newdata' > > > > > > content or 'newfile' does not exists. For some applications, this > > > > > > guaranty is strong enough and the effects of sync_file_range(2), even > > > > > > with WB_SYNC_ALL, are far less intrusive to other writers in the system > > > > > > than the effects of fdatasync(2). > > > > > > > > > > I agree that this paragraph is true but I don't want any userspace program > > > > > rely on this. We've been through this when ext3 got converted to ext4 and > > > > > it has caused a lot of complaints. Ext3 had provided rather strong data vs > > > > > metadata ordering due to the way journalling was implemented. Then ext4 > > > > > came, implemented delayed allocation and somewhat changed how journalling > > > > > works and suddently userspace programmers were caught by surprise their code > > > > > working by luck stopped working. And they were complaining that when their > > > > > code worked without fsync(2) before, it should work after it as well. So it > > > > > took a lot of explaining that their applications are broken and Ted has > > > > > also implemented some workarounds to make at least the most common mistakes > > > > > silently fixed by the kernel most of the time. > > > > > > > > > > So I'm against providing 90% data integrity guarantees because there'll be > > > > > many people who'll think they can get away with it and then complain when > > > > > they won't once our implementation changes. > > > > > > > > > > Rather I'd modify the manpage to not say that SYNC_FILE_RANGE_WAIT_BEFORE > > > > > | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER is a > > > > > write-for-data-integrity. > > > > > > > > OK. I do agree that manpage is misleading and that the language describing > > > > this flag combination should be toned down. I do not object to adding more > > > > and more disclaimers about this API not being a data integrity API. > > > > > > I don't think we need more disclaimers, I'd just reword that > > > "write-for-data-integrity" bit. > > > > > > > But the requirement I have is a real life workload and fdatasync(2) is not at > > > > all a viable option when application is creating many files per second. > > > > > > > > I need to export the functionality of data writeback to userspace. > > > > Objecting to expose this functionality via the interface that has been > > > > documented to expose it for years and used to expose it in the > > > > past (until the Fixes commit) is a bit weird IMO, even though I do > > > > completely relate to your concern. > > > > > > > > I don't mind removing the "intended use case" part of commit message > > > > if that helps reducing the chance that people misunderstand > > > > the guaranties of the API. > > > > > > > > Another thing I could do is introduce a new flag for sync_file_range() > > > > that will really mean what I want it to mean (all data will be written back > > > > and all related inode metadata changes will be committed to journal > > > > before the next inode metadata change is committed). For the sake of > > > > argument let's call it SYNC_FILE_RANGE_DATA_DEPENDENCY. > > > > > > So there are two parts to your requirements: > > > > > > 1) Data writeback for all file pages has been submitted (+ completed). > > > 2) What happens with related metadata. > > > > > > sync_file_range() is fine for 1) although I agree that WB_SYNC_NONE mode > > > can still end up skipping some pages due to unrelated writeback / lock > > > contention so having sync_file_range() mode doing WB_SYNC_ALL writeback > > > makes some sense to me. > > > > > > > OK. does it make enough sense for you to ACK my patch if the > > commit message was toned down to: > > > > Commit 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE > > writeback") claims that sync_file_range(2) syscall was "created for > > userspace to be able to issue background writeout and so waiting for > > in-flight IO is undesirable there" and changes the writeback (back) to > > WB_SYNC_NONE. > > > > This claim is only partially true. It is true for users that use the flag > > SYNC_FILE_RANGE_WRITE by itself, as does PostgreSQL, the user that was > > the reason for changing to WB_SYNC_NONE writeback. > > > > However, that claim is not true for users that use that flag combination > > SYNC_FILE_RANGE_{WAIT_BEFORE|WRITE|_WAIT_AFTER}. > > Those users explicitly requested to wait for in-flight IO as well as to > > writeback of dirty pages. > > > > Re-brand that flag combination as SYNC_FILE_RANGE_WRITE_AND_WAIT > > and use the helper filemap_write_and_wait_range(), that uses WB_SYNC_ALL > > writeback, to perform the range sync request. > > > > Fixes: 23d0127096cb ("fs/sync.c: make sync_file_range(2) use WB_SYNC_NONE") > > Signed-off-by: Amir Goldstein > > Yes, I'd be fine with that. Great. I'll send V2 with your ACK. I guess since Andrew merged the Fixes commit, I'll point to patch at him as well. > > > > What happens with 2) is pretty much up to the filesystem. You are speaking > > > about the journal but that is very much a filesystem specific detail. > > > What if the filesystem doesn't have journal at all? E.g. ext2? We cannot > > > really expose a generic API that silently fails to work based on fs > > > internal details. > > > > > > I guess I fail to see what kind of guarantees from the fs you're looking > > > for? When you speak about journal, I guess you're looking for some kind of > > > ordering but it's hard to be sure about exact requirements. So can you > > > please spell those out? Once we know requirements, we can speak about the > > > API. > > > > > > > The requirement is quite easy to explain with an example, less easy to > > generalize. I illustrated the requirement with the example: > > O_TMPFILE;write();DATA_BARRIER;linkat() > > > > The requirement of the DATA_BARRIER operation is that if the effects > > of linkat() are observed after crash, then the effects of write() must also > > be observed after crash. > > OK, understood and I agree this makes sense to ask for... And you're not > the only one. Application programmers often *assume* that if they do > write(file_new), rename(file_new, file), they will either see old or new > file contents after a crash. They are wrong but ext4 has heuristics to > detect such behavior and at least initiate writeback on rename to make time > window when data loss can happen smaller. Because there was lot of > controversy around this when users were transitioning from ext3 (where this > worked) to ext4. If there was API with decent performance to achieve this, > I guess at least some applications would start using it. > > > At this point one may say that fdatasync(2) meets the requirements > > of a DATA_BARRIER operation, so I need to introduce a non-requirement. > > The non-requirement is that DATA_BARRIER operation does not need > > to be a DATA_INTEGRITY operation, meaning that after crash, there > > is no requirement to be able to find or access the written data. > > Well, I guess your other real requirement is good performance ;) How that's > achieved is internal to the filesystem. > > > Now it should be clear to filesystem developers why the DATA_BARRIER > > requirement is easier to meet with far less performance penalty than > > fdatasync(2) on xfs and ext4. > > Well, it won't be free either but yes, you can save a transaction commit > / disk cache flush compared to fdatasync(2). > > > Here is a clumsy attempt to generalize the guaranty of such operation, > > without referring to internal fs implementation details: > > "Make sure that all file data hits persistent storage before the next > > file metadata change hits persistent storage". > > > > Does that make sense to you? > > Is the requirement clear? > > > > How does this sound for an API: > > sync_file_range(fd, SYNC_FILE_RANGE_WRITE_DATA_BARRIER, 0, 0); > > > > which individual filesystems could support or EOPNOTSUPP > > via the ->fsync() file_operation? > > Yes, this sounds workable to me but there are also other options - e.g. > somehow pass the information to linkat() / rename() / whatever that the > ordering is required. This might be easier for some filesystems to > implement. I like this idea. rename() and linkat() are definitely the most probable metadata operation that needs this barrier. But I wouldn't want to limit this capability to rename() and linkat() alone. Overlayfs for example, could use a similar data write barrier before remove xattr (when "hydrating" a metacopy upper). > So it might be good to make a short session at LSF/MM about this > so that developers of other filesystems can say what interface they would > be able to implement... Yeh, its on my short list for things to discuss. [CC: Anna & Josef] Thanks, Amir.