From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.1 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8A210C47082 for ; Mon, 31 May 2021 20:35:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 64361611CB for ; Mon, 31 May 2021 20:35:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232182AbhEaUgs (ORCPT ); Mon, 31 May 2021 16:36:48 -0400 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:58163 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231825AbhEaUgm (ORCPT ); Mon, 31 May 2021 16:36:42 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1622493301; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=eIPtoH3V1uRSKTYJIeUQjqM9k7RxE6f0M/KOCbORxjo=; b=R+BfCx6wWZBBcsQ++WCP9qx1LX1glOlH2mRtb6k5egbEKUFDOzhvyhXjQTT6peJSpAZqty 3vAnDt2m0n7j5uiob3upi57a1q/RkpFdNZJc2TE+9hG7I+TgdUUx2idXtnPJeAO6lTil7P N1U3kcQP1j29vCnMLCuHQojSUAYQK9Y= Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com [209.85.128.70]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-407-epLFk5vxPtWjtXUcUKvIgQ-1; Mon, 31 May 2021 16:32:41 -0400 X-MC-Unique: epLFk5vxPtWjtXUcUKvIgQ-1 Received: by mail-wm1-f70.google.com with SMTP id r1-20020a05600c35c1b029018ec2043223so356433wmq.7 for ; Mon, 31 May 2021 13:32:40 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=eIPtoH3V1uRSKTYJIeUQjqM9k7RxE6f0M/KOCbORxjo=; b=nGA2RGmtgMndNAX8PeSOzSuGJPN+EItWbWggsghAlPidhIsvjkGi1qgDe56r5bwGM/ Jgj5fkZqJn4jEz8H7YL+3o7/zm+lXje3A5T4ADb9Ig9tqBvpZi58liROK4VX5mTNfq8I rwpFDh65lKA6qf+VJRpsddK4S6P1cjZ3ZKCkmnhPpZwyBr2BS3v7bidL0cOhqrEwipBV B8g2Vb5jyTHhroBLHQYzGdOnEe+L0fWgQdh2JreH8EkFd4RtVV7IIDIT6u92RUCl68hM kLhgVYkKiri18I2r6S701lImtzvdU+LvSF+LLm3vTK/X1iOe8ch5r9MrTWCQ/lJlWEvM N1Ig== X-Gm-Message-State: AOAM530a+8YTjaGXh/yIJKTmQJtIR+a6JBkLmDjUPgh3i1oqy9P4NI2w hhihIy42g7SDCiZeONVVimw9piIym367wat0P/epVAulqFYfhKx+tNiK42HylVfZknLkBLpAJdI NKD47//123StbulSfpiA7LxXd/nRPtPinE3SAFoK2 X-Received: by 2002:a7b:c19a:: with SMTP id y26mr742489wmi.132.1622493159720; Mon, 31 May 2021 13:32:39 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwuUP4NISItoptr6oKYfry3Adwhj9c5n6SHwj4ZPOi4fGsXEBkf39U4RUlwoOP+qpxNp07En9r9I5ue0x82AUU= X-Received: by 2002:a7b:c19a:: with SMTP id y26mr742477wmi.132.1622493159526; Mon, 31 May 2021 13:32:39 -0700 (PDT) MIME-Version: 1.0 References: <20210531105606.228314-1-agruenba@redhat.com> In-Reply-To: From: Andreas Gruenbacher Date: Mon, 31 May 2021 22:32:28 +0200 Message-ID: Subject: Re: [GIT PULL] gfs2 fixes for v5.13-rc5 To: Linus Torvalds Cc: linux-fsdevel , cluster-devel , Linux Kernel Mailing List Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, May 31, 2021 at 6:30 PM Linus Torvalds wrote: > [ Adding fsdevel, because this is not limited to gfs2 ] > > On Mon, May 31, 2021 at 12:56 AM Andreas Gruenbacher > wrote: > > > > Andreas Gruenbacher (2): > > gfs2: Fix mmap locking for write faults > > This is bogus. > > I've pulled it, but this is just wrong. > > A write fault on a mmap IS NOT A WRITE to the filesystem. > > It's a read. > > Yes, it will later then allow writes to the page, but that's entirely > immaterial - because the write is going to happen not at fault time, > but long *after* the fault, and long *after* the filesystem has > installed the page. > > The actual write will happen when the kernel returns from the user space. > > And no, the explanation in that commit makes no sense either: > > "When a write fault occurs, we need to take the inode glock of the underlying > inode in exclusive mode. Otherwise, there's no guarantee that the > dirty page > will be written back to disk" > > the thing is, FAULT_FLAG_WRITE only says that the *currently* pending > access that triggered the fault was a write. That's entirely > irrelevant to a filesystem, because > > (a) it might be a private mapping, and a write to a page will cause a > COW by the VM layer, and it's not actually a filesystem write at all > > AND > > (b) if it's a shared mapping, the first access that paged things in > is likely a READ, and the page will be marked writable (because it's a > shared mapping!) and subsequent writes will not cause any faults at > all. > > In other words, a filesystem that checks for FAULT_FLAG_WRITE is > _doubly_ wrong. It's absolutely never the right thing to do. It > *cannot* be the right thing to do. > > And yes, some other filesystems do this crazy thing too. If your > friends jumped off a bridge, would you jump too? > > The xfs and ext3/ext4 cases are wrong too - but at least they spent > five seconds (but no more) thinking about it, and they added the check > for VM_SHARED. So they are only wrong for reason (b) > > But wrong is wrong. The new code is not right in gfs2, and the old > code in xfs/ext4 is not right either. > > Yeah, yeah, you can - and people do - do things like "always mark the > page readable on initial page fault, use mkwrite to catch when it > becomes writable, and do timestamps carefully, at at least have full > knowledge of "something might become dirty" > > But even then it is ENTIRELY BOGUS to do things like write locking. > > Really. > > Because the actual write HASN'T HAPPENED YET, AND YOU WILL RELEASE THE > LOCK BEFORE IT EVER DOES! So the lock? It does nothing. If you think > it protects anything at all, you're wrong. > > So don't do write locking. At an absolute most, you can do things like > > - update file times (and honestly, that's quite questionable - > because again - THE WRITE HASN'T HAPPENED YET - so any tests that > depend on exact file access times to figure out when the last write > was done is not being helped by your extra code, because you're > setting the WRONG time. > > - set some "this inode will have dirty pages" flag just for some > possible future use. But honestly, if this is about consistency etc, > you need to do it not for a fault, but across the whole mmap/munmap. > > So some things may be less bogus - but still very very questionable. > > But locking? Bogus. Reads and writes aren't really any different from > a timestamp standpoint (if you think you need to mtime for write > accesses, you should do atime for reads, so from a filesystem > timestamp standpoint read and write faults are exactly the same - and > both are bogus, because by definition for a mmap both the reads and > the writes can then happen long long long afterwards, and repeatedly). > > And if that "set some flag" thing needs a write lock, but a read > doesn't, you're doing something wrong and odd. > > Look at the VM code. The VM code does this right. The mmap_sem is > taken for writing for mmap and for munmap. But a page fault is always > a read lock, even if the access that caused the page fault is a write. > > The actual real honest-to-goodness *write* happens much later, and the > only time the filesystem really knows when it is done is at writeback > time. Not at fault time. So if you take locks, logically you should > take them when the fault happens, and release them when the writeback > is done. > > Are you doing that? No. So don't do the write lock over the read > portion of the page fault. It is not a sensible operation. Thanks for the detailed explanation. I'll work on a fix. Andreas