linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Hugh Dickins <hughd@google.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: "Lin, Ming" <minggr@gmail.com>, Hugh Dickins <hughd@google.com>,
	 Simon Ser <contact@emersion.fr>, Peter Xu <peterx@redhat.com>,
	 "Kirill A. Shutemov" <kirill@shutemov.name>,
	 Matthew Wilcox <willy@infradead.org>,
	 Dan Williams <dan.j.williams@intel.com>,
	 "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
	 Will Deacon <will@kernel.org>,
	 Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	 David Herrmann <dh.herrmann@gmail.com>,
	 "linux-mm@kvack.org" <linux-mm@kvack.org>,
	 Greg Kroah-Hartman <greg@kroah.com>,
	"tytso@mit.edu" <tytso@mit.edu>
Subject: Re: Sealed memfd & no-fault mmap
Date: Sat, 29 May 2021 13:15:43 -0700 (PDT)	[thread overview]
Message-ID: <alpine.LSU.2.11.2105291315330.25425@eggly.anvils> (raw)
In-Reply-To: <CAHk-=whSGS=R8PtrfNcDTkCKOengEqygqeWjOZa2b8QkuOueDg@mail.gmail.com>

On Sat, 29 May 2021, Linus Torvalds wrote:
> On Fri, May 28, 2021 at 9:31 PM Lin, Ming <minggr@gmail.com> wrote:
> >
> > I should check the vma is not writable.
> >
> > -               if (!IS_NOFAULT(inode))
> > +               if (!IS_NOFAULT(inode) || (vma->vm_flags & VM_WRITE))
> >                          return -EINVAL;
> 
> That might be enough, yes.
> 
> But if this is sufficient for the compositor needs, and the rule is
> that this only works for read-only mappings, then I think the flag in
> the inode becomes the wrong thing to do.
> 
> Because if it's a read-only mapping, and we only ever care about
> inserting zero pages into the page tables - and they never become part
> of the shared memory region itself, then it really is purely about
> that mmap, not about the shm inode.
> 
> So then it really does become purely about one particular mmap, and it
> really should be a "madvise()" issue, not a "mark inode as no-fault".

Yes, madvise or mmap flag: the recipient of this fd ought not to be
(even capable of) interfering with other maps of the shared object.

And IIUC it would have to be the recipient (Wayland compositor) doing
the NOFAULT business, because (going back to the original mail) we are
only considering this so that Wayland might satisfy clients who predate
or refuse Linux-only APIs.  So, an ioctl (or fcntl, as sealing chose)
at the client end cannot be expected; and could not be relied on anyway.

> 
> I'd almost be inclined to just add a new "flags" field to the vma.
> We've been running out of vma flags for a long time, to the point that
> some of them are only available on 64-bit architectures.
> 
> I get the feeling that we should just bite the bullet and make
> "vm_flags" be an u64. Or possibly make it two explicitly 32-bit
> entities (vm_flags and vm_extra). Get rid of the silly 64-bit-only
> "high" flags, and get rid of our artificial "we don't have enough
> bits".

u64 saves messing around in the vma_merge() area, which has to
consider whether adjacent vm_flags are identical.

> 
> Because we already in practice *do* have enough bits, we've just
> artificially limited ourselves to "on 32-bit architectures we only
> have 32 bits in that field".

Yes, that artificial limitation to 32-bit has been silly all along.

> 
> But all of this is very much dependent on that "this NOFAULT case
> really only works for reads, not for writes".
> 
> (Alternatively, we could allow the *mapping* itself to be writable,
> but always fault on writes, and only insert a read-only zero page)

NOFAULT? Does BSD use "fault" differently, and in Linux terms we
would say NOSIGBUS to mean the same?

Can someone point to a specification of BSD's __MAP_NOFAULT?
Searching just found me references to bugs.

What mainly worries me about the suggestion is: what happens to the
zero page inserted into NOFAULT mappings, when later a page for that
offset is created and added to page cache?

Treating it as an opaque blob of zeroes, that stays there ever after,
hiding the subsequent data: easy to implement, but a hack that we would
probably regret.  (And I notice that even the quote from David Herrmann
in the original post allows for the possibility that client may want to
expand the object.)

I believe the correct behaviour would be to unmap the nofault page
then, allowing the proper page to be faulted in after.  That is
certainly doable (the old mm/filemap_xip.c used to do so), but might
get into some awkward race territory, with filesystem dependence
(reminiscent of hole punch, in reverse).  shmem could operate that
way, and be the better for it: but I wouldn't want to add that,
without also cleaning away all the shmem_recalc_inode() stuff.

Hugh


  reply	other threads:[~2021-05-29 20:16 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-27  8:24 Sealed memfd & no-fault mmap Simon Ser
2021-04-27 16:51 ` Linus Torvalds
2021-04-29 15:48   ` Kirill A. Shutemov
2021-04-29 18:38     ` Peter Xu
2021-05-04  9:29       ` Simon Ser
2021-05-04 16:08         ` Linus Torvalds
2021-05-05 10:21           ` Simon Ser
2021-05-05 18:42             ` Linus Torvalds
2021-05-28 17:07               ` Lin, Ming
2021-05-29  1:03                 ` Linus Torvalds
2021-05-29  7:31                   ` Lin, Ming
2021-05-29 15:44                     ` Linus Torvalds
2021-05-29 20:15                       ` Hugh Dickins [this message]
2021-05-29 23:36                         ` Ming Lin
2021-05-31 21:13                           ` Ming Lin
2021-06-01  6:24                             ` Linus Torvalds
2021-06-01  7:08                               ` Ming Lin
2021-06-03 13:01                                 ` Simon Ser
2021-06-03 20:07                                   ` Ming Lin
2021-06-03 20:49                                     ` Simon Ser
2021-06-03 13:14                         ` Simon Ser
2021-06-03 13:57                           ` Matthew Wilcox
2021-06-03 14:48                             ` Simon Ser

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LSU.2.11.2105291315330.25425@eggly.anvils \
    --to=hughd@google.com \
    --cc=contact@emersion.fr \
    --cc=dan.j.williams@intel.com \
    --cc=dh.herrmann@gmail.com \
    --cc=greg@kroah.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kirill@shutemov.name \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minggr@gmail.com \
    --cc=peterx@redhat.com \
    --cc=torvalds@linux-foundation.org \
    --cc=tytso@mit.edu \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).