linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff Xu <jeffxu@google.com>
To: "Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Jeff Xu <jeffxu@chromium.org>,  Jonathan Corbet <corbet@lwn.net>,
	akpm@linux-foundation.org, keescook@chromium.org,
	 jannh@google.com, sroettger@google.com, willy@infradead.org,
	 gregkh@linuxfoundation.org, torvalds@linux-foundation.org,
	 usama.anjum@collabora.com, rdunlap@infradead.org,
	jeffxu@google.com,  jorgelo@chromium.org, groeck@chromium.org,
	linux-kernel@vger.kernel.org,  linux-kselftest@vger.kernel.org,
	linux-mm@kvack.org, pedro.falcato@gmail.com,
	 dave.hansen@intel.com, linux-hardening@vger.kernel.org,
	deraadt@openbsd.org
Subject: Re: [PATCH v8 0/4] Introduce mseal
Date: Thu, 1 Feb 2024 19:14:42 -0800	[thread overview]
Message-ID: <CALmYWFupdK_wc6jaamjbrZf-PzHwJ_4=b69yCtAik_7uu3hZug@mail.gmail.com> (raw)
In-Reply-To: <20240201204512.ht3e33yj77kkxi4q@revolver>

On Thu, Feb 1, 2024 at 12:45 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> * Jeff Xu <jeffxu@chromium.org> [240131 20:27]:
> > On Wed, Jan 31, 2024 at 11:34 AM Liam R. Howlett
> > <Liam.Howlett@oracle.com> wrote:
> > >
>
> Having to opt-in to allowing mseal will probably not work well.
I'm leaving the opt-in discussion in Linus's thread.

> Initial library mappings happen in one huge chunk then it's cut up into
> smaller VMAs, at least that's what I see with my maple tree tracing.  If
> you opt-in, then the entire library will have to opt-in and so the
> 'discourage inadvertent sealing' argument is not very strong.
>
Regarding "The initial library mappings happen in one huge chunk then
it is cut up into smaller VMAS", this is not a problem.

As example of elf loading (fs/binfmt_elf.c), there is just a few
places to pass in what type of memory to be allocated, e.g.
MAP_PRIVATE, MAP_FIXED_NOREPLACE, we can  add MAP_SEALABLE at those
places.
If glic does additional splitting on the memory range, by using
mprotect(), then the MAP_SEALABLE is automatically applied after
splitting.
If glic uses mmap(MAP_FIXED), then it should use mmap(MAP_FIXED|MAP_SEALABLE).

> It also makes a somewhat messy tracking of inheritance of the attribute
> across splitting, MAP_FIXED replacement, vma_move, vma_copy.  I think
> most of this is forced on the user?
>
The inheritance is the same as other VMA flags.

> It makes your call less flexible, it means you have to hope that the VMA
> origin was blessed before you decide you want to mseal it.
>
> What if you want to ensure the library mapped by a parent or on launch
> is mseal'ed?
>
> What about the initial relocated VMA (expand/shrink of VMA)?
>
> Creating something as "non-sealable" is pointless.  If you don't want it
> sealed, then don't mseal() that region.
>
> If your use case doesn't need it, then can we please drop the opt-in
> behaviour and just have all VMAs treated the same?
>
> If it does need it, can you explain why?
>
> The glibc relocation/fixup will then work.  glibc could mseal once it is
> complete - or an application could bypass glibc support and use the
> feature itself.

Yes. That is the idea.

>
> If we proceed to remove the MAP_SEALABLE flag to mmap, then we have the
> heap/stack concerns.  We can either let people shoot their own feet off
> or try to protect them.
>
> Right now, you seem to be trying to protect them.  Keeping with that, I
> guess we could either get the kernel to mark those VMAs or tell some
> other way?  I'd suggest a range, but people do very strange things with
> these special VMAs [1].  I don't think you can predict enough crazy
> actions to make a difference in trying to protect people.
>
> There are far fewer VMAs that should not be allowed to be mseal'ed than
> should be, and the kernel creates those so it seems logical to only let
> the kernel opt-out on those ones.
>
> I'd rather just let people shoot themselves and return an error.
>
> I also hope it reduces the complexity of this code while increasing the
> flexibility of the feature.  As stated before, we remove the dependency
> of needing support from the initial loader.
>
> Merging VMAs
> I can see this going Very Bad with brk + mseal.  But, again, if someone
> decides to mseal these VMAs then they should expect Bad Things to
> happen (or maybe they know what they are doing even in some complex
> situation?)
>
> vma_merge() can also expand a VMA.  I think this is okay as it checks
> for the same flags, so you will allow VMA expansion of two (or three)
> vma areas to become one.  Is this okay in your model?
>
> >
> > > I mean, you specifically state that this is a 'very specific
> > > requirement' in your cover letter.  Does this mean even other browsers
> > > have no use for it?
> > >
> > No, I don’t mean “other browsers have no use for it”.
> >
> > About specific requirements from Chrome, that refers to "The lifetime
> > of those mappings are not tied to the lifetime of the process, which
> > is not the case of libc" as in the cover letter. This addition to the
> > cover letter was made in V3, thus, it might be beneficial to provide
> > additional context to help answer the question.
> >
> > This patch series begins with multiple-bit approaches (v1,v2,v3), the
> > rationale for this is that I am uncertain if Chrome's specific needs
> > are common enough for other use cases.  Consequently, I am unable to
> > make this decision myself without input from the community. To
> > accommodate this, multiple bits are selected initially due to their
> > adaptability.
> >
> > Since V1, after hearing from the community, Chrome has changed its
> > design (no longer relying on separating out mprotect), and Linus
> > acknowledged the defect of madvise(DONOTNEED) [1]. With those inputs,
> > today mseal() has a simple design that:
> >  - meet Chrome's specific needs.
>
> How many VMAs will chrome have that are mseal'ed?  Is this a common
> operation?
>
> PROT_SEAL seems like an extra flag we could drop.  I don't expect we'll
> be sealing enough VMAs that a hand full of extra syscalls would make a
> difference?
>
> >  - meet Libc's needs.
>
> What needs of libc are you referring to?  I'm looking through the
> version changelog and I guess you mean return EPERM?
>
I meant libc's sealing RO part of the elf binary, those memory's
lifetime are associated with the lifetime of the process.

> >  - Chrome's specific need doesn't interfere with Libc's.
> >
> > [1] https://lore.kernel.org/all/CAHk-=wiVhHmnXviy1xqStLRozC4ziSugTk=1JOc8ORWd2_0h7g@mail.gmail.com/
>
> Linus said he'd be happier if we made the change in general.
>
> >
> > > I am very concerned this feature will land and have to be maintained by
> > > the core mm people for the one user it was specifically targeting.
> > >
> > See above. This feature is not specifically targeting Chrome.
> >
> > > Can we also get some benchmarking on the impact of this feature?  I
> > > believe my answer in v7 removed the worst offender, but since there is
> > > no benchmarking we really are guessing (educated or not, hard data would
> > > help).  We still have an extra loop in madvise, mprotect_pkey, mremap_to
> > > (and mreamp syscall?).
> > >
> > Yes. There is an extra loop in mmap(FIXED), munmap(),
> > madvise(DONOTNEED), mremap(), to emulate the VMAs for the given
> > address range. I suspect the impact would be low, but having some hard
> > data would be good. I will see what I can find to assist the perf
> > testing. If you have a specific test suite in mind, I can also try it.
>
> You should look at mmtests [2]. But since you are adding loops across
> VMA ranges, you need to test loops across several ranges of VMAs.  That
> is, it would be good to see what happens on 1, 3, 6, 12, 24 VMAs, or
> some subset of small and large numbers to get an idea of complexity we
> are adding.  My hope is that the looping will be cache-hot in the maple
> tree and have minimum effect.
>
> In my personal testing, I've seen munmap often do a single VMA, or 3, or
> more rare 7 on x86_64.  There should be some good starting points in
> mmtests for the common operations.
>
Thanks. Will do.


> [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mmapstress/mmapstress03.c
> [2] https://github.com/gormanm/mmtests
>
> Thanks,
> Liam

  parent reply	other threads:[~2024-02-02  3:15 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-31 17:50 [PATCH v8 0/4] Introduce mseal jeffxu
2024-01-31 17:50 ` [PATCH v8 1/4] mseal: Wire up mseal syscall jeffxu
2024-01-31 17:50 ` [PATCH v8 2/4] mseal: add " jeffxu
2024-02-01 23:11   ` Eric Biggers
2024-02-02  3:30     ` Jeff Xu
2024-02-02  3:54       ` Theo de Raadt
2024-02-02  4:03         ` Jeff Xu
2024-02-02  4:10           ` Theo de Raadt
2024-02-02  4:22             ` Jeff Xu
2024-01-31 17:50 ` [PATCH v8 3/4] selftest mm/mseal memory sealing jeffxu
2024-01-31 17:50 ` [PATCH v8 4/4] mseal:add documentation jeffxu
2024-01-31 19:34 ` [PATCH v8 0/4] Introduce mseal Liam R. Howlett
2024-02-01  1:27   ` Jeff Xu
2024-02-01  1:46     ` Theo de Raadt
2024-02-01 16:56       ` Bird, Tim
2024-02-01  1:55     ` Theo de Raadt
2024-02-01 20:45     ` Liam R. Howlett
2024-02-01 22:24       ` Theo de Raadt
2024-02-02  1:06         ` Greg KH
2024-02-02  3:24           ` Jeff Xu
2024-02-02  3:29             ` Linus Torvalds
2024-02-02  3:46               ` Jeff Xu
2024-02-02 15:18             ` Greg KH
2024-02-01 22:37       ` Jeff Xu
2024-02-01 22:54         ` Theo de Raadt
2024-02-01 23:15           ` Linus Torvalds
2024-02-01 23:43             ` Theo de Raadt
2024-02-02  0:26             ` Theo de Raadt
2024-02-02  3:20             ` Jeff Xu
2024-02-02  4:05               ` Theo de Raadt
2024-02-02  4:54                 ` Jeff Xu
2024-02-02  5:00                   ` Theo de Raadt
2024-02-02 17:58                     ` Jeff Xu
2024-02-02 18:51                       ` Pedro Falcato
2024-02-02 21:20                         ` Jeff Xu
2024-02-04 19:39                         ` David Laight
2024-02-02 17:05             ` Theo de Raadt
2024-02-02 21:02               ` Jeff Xu
2024-02-02  3:14       ` Jeff Xu [this message]
2024-02-02 15:13         ` Liam R. Howlett
2024-02-02 17:24           ` Jeff Xu
2024-02-02 19:21             ` Liam R. Howlett
2024-02-02 19:32               ` Theo de Raadt
2024-02-02 20:36                 ` Linus Torvalds
2024-02-02 20:57                   ` Jeff Xu
2024-02-02 21:18                   ` Liam R. Howlett
2024-02-02 23:36                     ` Linus Torvalds
2024-02-03  4:45                       ` Liam R. Howlett
2024-02-05 22:13                         ` Suren Baghdasaryan
2024-02-02 20:14               ` Jeff Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CALmYWFupdK_wc6jaamjbrZf-PzHwJ_4=b69yCtAik_7uu3hZug@mail.gmail.com' \
    --to=jeffxu@google.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@intel.com \
    --cc=deraadt@openbsd.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=groeck@chromium.org \
    --cc=jannh@google.com \
    --cc=jeffxu@chromium.org \
    --cc=jorgelo@chromium.org \
    --cc=keescook@chromium.org \
    --cc=linux-hardening@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=pedro.falcato@gmail.com \
    --cc=rdunlap@infradead.org \
    --cc=sroettger@google.com \
    --cc=torvalds@linux-foundation.org \
    --cc=usama.anjum@collabora.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).