linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/6] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2020-09-24 13:28 Mike Rapoport
  2020-09-24 13:28 ` [PATCH v6 1/6] mm: add definition of PMD_PAGE_ORDER Mike Rapoport
                   ` (9 more replies)
  0 siblings, 10 replies; 58+ messages in thread
From: Mike Rapoport @ 2020-09-24 13:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Idan Yaniv,
	Ingo Molnar, James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Mike Rapoport, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Thomas Gleixner,
	Shuah Khan, Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

From: Mike Rapoport <rppt@linux.ibm.com>

Hi,

This is an implementation of "secret" mappings backed by a file descriptor. 
I've dropped the boot time reservation patch for now as it is not strictly
required for the basic usage and can be easily added later either with or
without CMA.

v6 changes:
* Silence the warning about missing syscall, thanks to Qian Cai
* Replace spaces with tabs in Kconfig additions, per Randy
* Add a selftest. 

v5 changes:
* rebase on v5.9-rc5
* drop boot time memory reservation patch

v4 changes:
* rebase on v5.9-rc1
* Do not redefine PMD_PAGE_ORDER in fs/dax.c, thanks Kirill
* Make secret mappings exclusive by default and only require flags to
  memfd_secret() system call for uncached mappings, thanks again Kirill :)

v3 changes:
* Squash kernel-parameters.txt update into the commit that added the
  command line option.
* Make uncached mode explicitly selectable by architectures. For now enable
  it only on x86.

v2 changes:
* Follow Michael's suggestion and name the new system call 'memfd_secret'
* Add kernel-parameters documentation about the boot option
* Fix i386-tinyconfig regression reported by the kbuild bot.
  CONFIG_SECRETMEM now depends on !EMBEDDED to disable it on small systems
  from one side and still make it available unconditionally on
  architectures that support SET_DIRECT_MAP.

The file descriptor backing secret memory mappings is created using a
dedicated memfd_secret system call The desired protection mode for the
memory is configured using flags parameter of the system call. The mmap()
of the file descriptor created with memfd_secret() will create a "secret"
memory mapping. The pages in that mapping will be marked as not present in
the direct map and will have desired protection bits set in the user page
table. For instance, current implementation allows uncached mappings.

Although normally Linux userspace mappings are protected from other users, 
such secret mappings are useful for environments where a hostile tenant is
trying to trick the kernel into giving them access to other tenants
mappings.

Additionally, the secret mappings may be used as a mean to protect guest
memory in a virtual machine host.

For demonstration of secret memory usage we've created a userspace library
[1] that does two things: the first is act as a preloader for openssl to
redirect all the OPENSSL_malloc calls to secret memory meaning any secret
keys get automatically protected this way and the other thing it does is
expose the API to the user who needs it. We anticipate that a lot of the
use cases would be like the openssl one: many toolkits that deal with
secret keys already have special handling for the memory to try to give
them greater protection, so this would simply be pluggable into the
toolkits without any need for user application modification.

I've hesitated whether to continue to use new flags to memfd_create() or to
add a new system call and I've decided to use a new system call after I've
started to look into man pages update. There would have been two completely
independent descriptions and I think it would have been very confusing.

Hiding secret memory mappings behind an anonymous file allows (ab)use of
the page cache for tracking pages allocated for the "secret" mappings as
well as using address_space_operations for e.g. page migration callbacks.

The anonymous file may be also used implicitly, like hugetlb files, to
implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm
ABIs in the future.

As the fragmentation of the direct map was one of the major concerns raised
during the previous postings, I've added an amortizing cache of PMD-size
pages to each file descriptor that is used as an allocation pool for the
secret memory areas.

v5: https://lore.kernel.org/lkml/20200916073539.3552-1-rppt@kernel.org
v4: https://lore.kernel.org/lkml/20200818141554.13945-1-rppt@kernel.org
v3: https://lore.kernel.org/lkml/20200804095035.18778-1-rppt@kernel.org
v2: https://lore.kernel.org/lkml/20200727162935.31714-1-rppt@kernel.org
v1: https://lore.kernel.org/lkml/20200720092435.17469-1-rppt@kernel.org

Mike Rapoport (6):
  mm: add definition of PMD_PAGE_ORDER
  mmap: make mlock_future_check() global
  mm: introduce memfd_secret system call to create "secret" memory areas
  arch, mm: wire up memfd_secret system call were relevant
  mm: secretmem: use PMD-size pages to amortize direct map fragmentation
  secretmem: test: add basic selftest for memfd_secret(2)

 arch/Kconfig                              |   7 +
 arch/arm64/include/asm/unistd.h           |   2 +-
 arch/arm64/include/asm/unistd32.h         |   2 +
 arch/arm64/include/uapi/asm/unistd.h      |   1 +
 arch/riscv/include/asm/unistd.h           |   1 +
 arch/x86/Kconfig                          |   1 +
 arch/x86/entry/syscalls/syscall_32.tbl    |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl    |   1 +
 fs/dax.c                                  |  11 +-
 include/linux/pgtable.h                   |   3 +
 include/linux/syscalls.h                  |   1 +
 include/uapi/asm-generic/unistd.h         |   7 +-
 include/uapi/linux/magic.h                |   1 +
 include/uapi/linux/secretmem.h            |   8 +
 kernel/sys_ni.c                           |   2 +
 mm/Kconfig                                |   4 +
 mm/Makefile                               |   1 +
 mm/internal.h                             |   3 +
 mm/mmap.c                                 |   5 +-
 mm/secretmem.c                            | 333 ++++++++++++++++++++++
 scripts/checksyscalls.sh                  |   4 +
 tools/testing/selftests/vm/.gitignore     |   1 +
 tools/testing/selftests/vm/Makefile       |   3 +-
 tools/testing/selftests/vm/memfd_secret.c | 296 +++++++++++++++++++
 tools/testing/selftests/vm/run_vmtests    |  17 ++
 25 files changed, 703 insertions(+), 13 deletions(-)
 create mode 100644 include/uapi/linux/secretmem.h
 create mode 100644 mm/secretmem.c
 create mode 100644 tools/testing/selftests/vm/memfd_secret.c

-- 
2.28.0



^ permalink raw reply	[flat|nested] 58+ messages in thread
* [PATCH] man2: new page describing memfd_secret() system call
@ 2021-07-27 12:41 Mike Rapoport
  2021-07-28 20:44 ` Alejandro Colomar (man-pages)
  0 siblings, 1 reply; 58+ messages in thread
From: Mike Rapoport @ 2021-07-27 12:41 UTC (permalink / raw)
  To: Michael Kerrisk
  Cc: Alejandro Colomar, Mike Rapoport, Mike Rapoport, linux-api,
	linux-man, linux-mm

From: Mike Rapoport <rppt@linux.ibm.com>

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
---

Hi,

There were a lot of changes to memfd_secret implementation since the
previous posting of this man page, so its contents also changed
significantly and there is not much sense to call it v2.

 man2/memfd_secret.2 | 143 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 143 insertions(+)
 create mode 100644 man2/memfd_secret.2

diff --git a/man2/memfd_secret.2 b/man2/memfd_secret.2
new file mode 100644
index 000000000..e6eee7018
--- /dev/null
+++ b/man2/memfd_secret.2
@@ -0,0 +1,143 @@
+.\" Copyright (c) 2021, IBM Corporation.
+.\" Written by Mike Rapoport <rppt@linux.ibm.com>
+.\"
+.\" Based on memfd_create(2) man page
+.\" Copyright (C) 2014 Michael Kerrisk <mtk.manpages@gmail.com>
+.\" and Copyright (C) 2014 David Herrmann <dh.herrmann@gmail.com>
+.\"
+.\" %%%LICENSE_START(GPLv2+)
+.\"
+.\" This program is free software; you can redistribute it and/or modify
+.\" it under the terms of the GNU General Public License as published by
+.\" the Free Software Foundation; either version 2 of the License, or
+.\" (at your option) any later version.
+.\"
+.\" This program is distributed in the hope that it will be useful,
+.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
+.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+.\" GNU General Public License for more details.
+.\"
+.\" You should have received a copy of the GNU General Public
+.\" License along with this manual; if not, see
+.\" <http://www.gnu.org/licenses/>.
+.\" %%%LICENSE_END
+.\"
+.TH MEMFD_SECRET 2 2020-08-02 Linux "Linux Programmer's Manual"
+.SH NAME
+memfd_secret \- create an anonymous file to access secret memory regions
+.SH SYNOPSIS
+.nf
+.BI "int memfd_secret(unsigned int " flags ");"
+.fi
+.PP
+.IR Note :
+There is no glibc wrapper for this system call; see NOTES.
+.SH DESCRIPTION
+.BR memfd_secret ()
+creates an anonymous file and returns a file descriptor that refers to it.
+The file provides a way to create and access memory regions
+with stronger protection than usual RAM-based files and
+anonymous memory mappings.
+Once all references to the file are dropped, it is automatically released.
+The initial size of the file is set to 0.
+Following the call, the file size should be set using
+.BR ftruncate (2).
+.PP
+The memory areas backing the file created with
+.BR memfd_create(2)
+are visible only to the contexts that have access to the file descriptor.
+These areas are removed from the kernel page tables
+and only the page tables of the processes holding the file descriptor
+map the corresponding physical memory.
+.PP
+The following values may be bitwise ORed in
+.IR flags
+to control the behavior of
+.BR memfd_secret (2):
+.TP
+.BR FD_CLOEXEC
+Set the close-on-exec flag on the new file descriptor.
+See the description of the
+.B O_CLOEXEC
+flag in
+.BR open (2)
+for reasons why this may be useful.
+.PP
+As its return value,
+.BR memfd_secret ()
+returns a new file descriptor that can be used to refer to an anonymous file.
+This file descriptor is opened for both reading and writing
+.RB ( O_RDWR )
+and
+.B O_LARGEFILE
+is set for the file descriptor.
+.PP
+With respect to
+.BR fork (2)
+and
+.BR execve (2),
+the usual semantics apply for the file descriptor created by
+.BR memfd_secret ().
+A copy of the file descriptor is inherited by the child produced by
+.BR fork (2)
+and refers to the same file.
+The file descriptor is preserved across
+.BR execve (2),
+unless the close-on-exec flag has been set.
+.PP
+The memory regions backed with
+.BR memfd_secret ()
+are locked in the same way as
+.BR mlock (2),
+however the implementation will not try to
+populate the whole range during the
+.BR mmap (2)
+call.
+The amount of memory allowed for memory mappings
+of the file descriptor obeys the same rules as
+.BR mlock (2)
+and cannot exceed
+.BR RLIMIT_MEMLOCK .
+.SH RETURN VALUE
+On success,
+.BR memfd_secret ()
+returns a new file descriptor.
+On error, \-1 is returned and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.TP
+.B EINVAL
+.I flags
+included unknown bits.
+.TP
+.B EMFILE
+The per-process limit on the number of open file descriptors has been reached.
+.TP
+.B EMFILE
+The system-wide limit on the total number of open files has been reached.
+.TP
+.B ENOMEM
+There was insufficient memory to create a new anonymous file.
+.TP
+.B ENOSYS
+.BR memfd_secret ()
+is not implemented on this architecture.
+.SH VERSIONS
+The
+.BR memfd_secret (2)
+system call first appeared in Linux 5.14.
+.SH CONFORMING TO
+The
+.BR memfd_secret (2)
+system call is Linux-specific.
+.SH NOTES
+.PP
+Glibc does not provide a wrapper for this system call; call it using
+.BR syscall (2).
+.SH SEE ALSO
+.BR fcntl (2),
+.BR ftruncate (2),
+.BR mlock (2),
+.BR mmap (2),
+.BR setrlimit (2)
-- 
2.31.1



^ permalink raw reply related	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2021-07-28 20:44 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-24 13:28 [PATCH v6 0/6] mm: introduce memfd_secret system call to create "secret" memory areas Mike Rapoport
2020-09-24 13:28 ` [PATCH v6 1/6] mm: add definition of PMD_PAGE_ORDER Mike Rapoport
2020-09-24 13:29 ` [PATCH v6 2/6] mmap: make mlock_future_check() global Mike Rapoport
2020-09-24 13:29 ` [PATCH v6 3/6] mm: introduce memfd_secret system call to create "secret" memory areas Mike Rapoport
2020-09-29  4:58   ` Edgecombe, Rick P
2020-09-29 13:06     ` Mike Rapoport
2020-09-29 20:06       ` Edgecombe, Rick P
2020-09-30 10:35         ` Mike Rapoport
2020-09-30 20:11           ` Edgecombe, Rick P
2020-10-11  9:42             ` Mike Rapoport
2020-09-24 13:29 ` [PATCH v6 4/6] arch, mm: wire up memfd_secret system call were relevant Mike Rapoport
2020-09-24 13:29 ` [PATCH v6 5/6] mm: secretmem: use PMD-size pages to amortize direct map fragmentation Mike Rapoport
2020-09-25  7:41   ` Peter Zijlstra
2020-09-25  9:00     ` David Hildenbrand
2020-09-25  9:50       ` Peter Zijlstra
2020-09-25 10:31         ` Mark Rutland
2020-09-29 14:04           ` Mike Rapoport
2020-09-29 13:07         ` Mike Rapoport
2020-09-29 13:06       ` Mike Rapoport
2020-09-29 13:05     ` Mike Rapoport
2020-09-29 14:12       ` Peter Zijlstra
2020-09-29 14:31         ` Dave Hansen
2020-09-29 14:58         ` Mike Rapoport
2020-09-29 15:15           ` Peter Zijlstra
2020-09-30 10:27             ` Mike Rapoport
2020-09-30 14:39               ` James Bottomley
2020-09-30 14:45                 ` David Hildenbrand
2020-09-30 15:17                   ` James Bottomley
2020-09-30 15:25                     ` David Hildenbrand
2020-09-30 15:09               ` Matthew Wilcox
2020-10-01  8:14                 ` Mike Rapoport
2020-09-29 15:03         ` James Bottomley
2020-09-30 10:20         ` Mike Rapoport
2020-09-30 10:43           ` Peter Zijlstra
2020-09-24 13:29 ` [PATCH v6 6/6] secretmem: test: add basic selftest for memfd_secret(2) Mike Rapoport
2020-09-24 13:35 ` [PATCH] man2: new page describing memfd_secret() system call Mike Rapoport
2020-09-24 14:55   ` Alejandro Colomar
2020-10-03  9:32     ` Alejandro Colomar
2020-10-05  7:32       ` Mike Rapoport
2020-11-16 21:01         ` [PATCH v2] memfd_secret.2: New " Alejandro Colomar
2020-11-17  6:26           ` Mike Rapoport
2020-09-25  2:34 ` [PATCH v6 0/6] mm: introduce memfd_secret system call to create "secret" memory areas Andrew Morton
2020-09-25  6:42   ` Mike Rapoport
2020-11-01 11:09 ` Hagen Paul Pfeifer
2020-11-02 15:40   ` Mike Rapoport
2020-11-03 13:52     ` Hagen Paul Pfeifer
2020-11-03 16:30       ` Mike Rapoport
2020-11-04 11:39         ` Hagen Paul Pfeifer
2020-11-04 17:02           ` Mike Rapoport
2020-11-09 10:41             ` Hagen Paul Pfeifer
2020-11-02  9:11 ` David Hildenbrand
2020-11-02  9:31   ` David Hildenbrand
2020-11-02 17:43   ` Mike Rapoport
2020-11-02 17:51     ` David Hildenbrand
2020-11-03  9:52       ` Mike Rapoport
2020-11-03 10:11         ` David Hildenbrand
2021-07-27 12:41 [PATCH] man2: new page describing memfd_secret() system call Mike Rapoport
2021-07-28 20:44 ` Alejandro Colomar (man-pages)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).