All of lore.kernel.org
 help / color / mirror / Atom feed
From: Andreas Gruenbacher <agruenba@redhat.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	cluster-devel <cluster-devel@redhat.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [GIT PULL] gfs2 fixes for v5.13-rc5
Date: Mon, 31 May 2021 22:32:28 +0200	[thread overview]
Message-ID: <CAHc6FU51=5cNZX9hdDHj3AJ6fy4bK7nH1qwAi2m2wB45WPoq8Q@mail.gmail.com> (raw)
In-Reply-To: <CAHk-=wj8EWr_D65i4oRSj2FTbrc6RdNydNNCGxeabRnwtoU=3Q@mail.gmail.com>

On Mon, May 31, 2021 at 6:30 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> [ Adding fsdevel, because this is not limited to gfs2 ]
>
> On Mon, May 31, 2021 at 12:56 AM Andreas Gruenbacher
> <agruenba@redhat.com> wrote:
> >
> > Andreas Gruenbacher (2):
> >       gfs2: Fix mmap locking for write faults
>
> This is bogus.
>
> I've pulled it, but this is just wrong.
>
> A write fault on a mmap IS NOT A WRITE to the filesystem.
>
> It's a read.
>
> Yes, it will later then allow writes to the page, but that's entirely
> immaterial - because the write is going to happen not at fault time,
> but long *after* the fault, and long *after* the filesystem has
> installed the page.
>
> The actual write will happen when the kernel returns from the user space.
>
> And no, the explanation in that commit makes no sense either:
>
>    "When a write fault occurs, we need to take the inode glock of the underlying
>     inode in exclusive mode.  Otherwise, there's no guarantee that the
> dirty page
>     will be written back to disk"
>
> the thing is, FAULT_FLAG_WRITE only says that the *currently* pending
> access that triggered the fault was a write. That's entirely
> irrelevant to a filesystem, because
>
>  (a) it might be a private mapping, and a write to a page will cause a
> COW by the VM layer, and it's not actually a filesystem write at all
>
> AND
>
>  (b) if it's a shared mapping, the first access that paged things in
> is likely a READ, and the page will be marked writable (because it's a
> shared mapping!) and subsequent writes will not cause any faults at
> all.
>
> In other words, a filesystem that checks for FAULT_FLAG_WRITE is
> _doubly_ wrong. It's absolutely never the right thing to do. It
> *cannot* be the right thing to do.
>
> And yes, some other filesystems do this crazy thing too. If your
> friends jumped off a bridge, would you jump too?
>
> The xfs and ext3/ext4 cases are wrong too - but at least they spent
> five seconds (but no more) thinking about it, and they added the check
> for VM_SHARED. So they are only wrong for reason (b)
>
> But wrong is wrong. The new code is not right in gfs2, and the old
> code in xfs/ext4 is not right either.
>
> Yeah, yeah, you can - and people do - do things like "always mark the
> page readable on initial page fault, use mkwrite to catch when it
> becomes writable, and do timestamps carefully, at at least have full
> knowledge of "something might become dirty"
>
> But even then it is ENTIRELY BOGUS to do things like write locking.
>
> Really.
>
> Because the actual write HASN'T HAPPENED YET, AND YOU WILL RELEASE THE
> LOCK BEFORE IT EVER DOES! So the lock? It does nothing. If you think
> it protects anything at all, you're wrong.
>
> So don't do write locking. At an absolute most, you can do things like
>
>  - update file times (and honestly, that's quite questionable -
> because again - THE WRITE HASN'T HAPPENED YET - so any tests that
> depend on exact file access times to figure out when the last write
> was done is not being helped by your extra code, because you're
> setting the WRONG time.
>
>  - set some "this inode will have dirty pages" flag just for some
> possible future use. But honestly, if this is about consistency etc,
> you need to do it not for a fault, but across the whole mmap/munmap.
>
> So some things may be less bogus - but still very very questionable.
>
> But locking? Bogus. Reads and writes aren't really any different from
> a timestamp standpoint (if you think you need to mtime for write
> accesses, you should do atime for reads, so from a filesystem
> timestamp standpoint read and write faults are exactly the same - and
> both are bogus, because by definition for a mmap both the reads and
> the writes can then happen long long long afterwards, and repeatedly).
>
> And if that "set some flag" thing needs a write lock, but a read
> doesn't, you're doing something wrong and odd.
>
> Look at the VM code. The VM code does this right. The mmap_sem is
> taken for writing for mmap and for munmap. But a page fault is always
> a read lock, even if the access that caused the page fault is a write.
>
> The actual real honest-to-goodness *write* happens much later, and the
> only time the filesystem really knows when it is done is at writeback
> time. Not at fault time. So if you take locks, logically you should
> take them when the fault happens, and release them when the writeback
> is done.
>
> Are you doing that? No. So don't do the write lock over the read
> portion of the page fault. It is not a sensible operation.

Thanks for the detailed explanation. I'll work on a fix.

Andreas


WARNING: multiple messages have this Message-ID (diff)
From: Andreas Gruenbacher <agruenba@redhat.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [GIT PULL] gfs2 fixes for v5.13-rc5
Date: Mon, 31 May 2021 22:32:28 +0200	[thread overview]
Message-ID: <CAHc6FU51=5cNZX9hdDHj3AJ6fy4bK7nH1qwAi2m2wB45WPoq8Q@mail.gmail.com> (raw)
In-Reply-To: <CAHk-=wj8EWr_D65i4oRSj2FTbrc6RdNydNNCGxeabRnwtoU=3Q@mail.gmail.com>

On Mon, May 31, 2021 at 6:30 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> [ Adding fsdevel, because this is not limited to gfs2 ]
>
> On Mon, May 31, 2021 at 12:56 AM Andreas Gruenbacher
> <agruenba@redhat.com> wrote:
> >
> > Andreas Gruenbacher (2):
> >       gfs2: Fix mmap locking for write faults
>
> This is bogus.
>
> I've pulled it, but this is just wrong.
>
> A write fault on a mmap IS NOT A WRITE to the filesystem.
>
> It's a read.
>
> Yes, it will later then allow writes to the page, but that's entirely
> immaterial - because the write is going to happen not at fault time,
> but long *after* the fault, and long *after* the filesystem has
> installed the page.
>
> The actual write will happen when the kernel returns from the user space.
>
> And no, the explanation in that commit makes no sense either:
>
>    "When a write fault occurs, we need to take the inode glock of the underlying
>     inode in exclusive mode.  Otherwise, there's no guarantee that the
> dirty page
>     will be written back to disk"
>
> the thing is, FAULT_FLAG_WRITE only says that the *currently* pending
> access that triggered the fault was a write. That's entirely
> irrelevant to a filesystem, because
>
>  (a) it might be a private mapping, and a write to a page will cause a
> COW by the VM layer, and it's not actually a filesystem write at all
>
> AND
>
>  (b) if it's a shared mapping, the first access that paged things in
> is likely a READ, and the page will be marked writable (because it's a
> shared mapping!) and subsequent writes will not cause any faults at
> all.
>
> In other words, a filesystem that checks for FAULT_FLAG_WRITE is
> _doubly_ wrong. It's absolutely never the right thing to do. It
> *cannot* be the right thing to do.
>
> And yes, some other filesystems do this crazy thing too. If your
> friends jumped off a bridge, would you jump too?
>
> The xfs and ext3/ext4 cases are wrong too - but at least they spent
> five seconds (but no more) thinking about it, and they added the check
> for VM_SHARED. So they are only wrong for reason (b)
>
> But wrong is wrong. The new code is not right in gfs2, and the old
> code in xfs/ext4 is not right either.
>
> Yeah, yeah, you can - and people do - do things like "always mark the
> page readable on initial page fault, use mkwrite to catch when it
> becomes writable, and do timestamps carefully, at at least have full
> knowledge of "something might become dirty"
>
> But even then it is ENTIRELY BOGUS to do things like write locking.
>
> Really.
>
> Because the actual write HASN'T HAPPENED YET, AND YOU WILL RELEASE THE
> LOCK BEFORE IT EVER DOES! So the lock? It does nothing. If you think
> it protects anything at all, you're wrong.
>
> So don't do write locking. At an absolute most, you can do things like
>
>  - update file times (and honestly, that's quite questionable -
> because again - THE WRITE HASN'T HAPPENED YET - so any tests that
> depend on exact file access times to figure out when the last write
> was done is not being helped by your extra code, because you're
> setting the WRONG time.
>
>  - set some "this inode will have dirty pages" flag just for some
> possible future use. But honestly, if this is about consistency etc,
> you need to do it not for a fault, but across the whole mmap/munmap.
>
> So some things may be less bogus - but still very very questionable.
>
> But locking? Bogus. Reads and writes aren't really any different from
> a timestamp standpoint (if you think you need to mtime for write
> accesses, you should do atime for reads, so from a filesystem
> timestamp standpoint read and write faults are exactly the same - and
> both are bogus, because by definition for a mmap both the reads and
> the writes can then happen long long long afterwards, and repeatedly).
>
> And if that "set some flag" thing needs a write lock, but a read
> doesn't, you're doing something wrong and odd.
>
> Look at the VM code. The VM code does this right. The mmap_sem is
> taken for writing for mmap and for munmap. But a page fault is always
> a read lock, even if the access that caused the page fault is a write.
>
> The actual real honest-to-goodness *write* happens much later, and the
> only time the filesystem really knows when it is done is at writeback
> time. Not at fault time. So if you take locks, logically you should
> take them when the fault happens, and release them when the writeback
> is done.
>
> Are you doing that? No. So don't do the write lock over the read
> portion of the page fault. It is not a sensible operation.

Thanks for the detailed explanation. I'll work on a fix.

Andreas



  reply	other threads:[~2021-05-31 20:35 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-31 10:56 [GIT PULL] gfs2 fixes for v5.13-rc5 Andreas Gruenbacher
2021-05-31 10:56 ` [Cluster-devel] " Andreas Gruenbacher
2021-05-31 16:30 ` Linus Torvalds
2021-05-31 16:30   ` [Cluster-devel] " Linus Torvalds
2021-05-31 20:32   ` Andreas Gruenbacher [this message]
2021-05-31 20:32     ` Andreas Gruenbacher
2021-05-31 16:36 ` pr-tracker-bot
2021-05-31 16:36   ` [Cluster-devel] " pr-tracker-bot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAHc6FU51=5cNZX9hdDHj3AJ6fy4bK7nH1qwAi2m2wB45WPoq8Q@mail.gmail.com' \
    --to=agruenba@redhat.com \
    --cc=cluster-devel@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.