All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-08  8:49 ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Paul Walmsley, Peter Zijlstra,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleixner, Tycho Andersen, Will Deacon, linux-api,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86

From: Mike Rapoport <rppt@linux.ibm.com>

Hi,

@Andrew, this is based on v5.11-rc5-mmotm-2021-01-27-23-30, with secretmem
and related patches dropped from there, I can rebase whatever way you
prefer.

This is an implementation of "secret" mappings backed by a file descriptor.

The file descriptor backing secret memory mappings is created using a
dedicated memfd_secret system call The desired protection mode for the
memory is configured using flags parameter of the system call. The mmap()
of the file descriptor created with memfd_secret() will create a "secret"
memory mapping. The pages in that mapping will be marked as not present in
the direct map and will be present only in the page table of the owning mm.

Although normally Linux userspace mappings are protected from other users,
such secret mappings are useful for environments where a hostile tenant is
trying to trick the kernel into giving them access to other tenants
mappings.

Additionally, in the future the secret mappings may be used as a mean to
protect guest memory in a virtual machine host.

For demonstration of secret memory usage we've created a userspace library

https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloader.git

that does two things: the first is act as a preloader for openssl to
redirect all the OPENSSL_malloc calls to secret memory meaning any secret
keys get automatically protected this way and the other thing it does is
expose the API to the user who needs it. We anticipate that a lot of the
use cases would be like the openssl one: many toolkits that deal with
secret keys already have special handling for the memory to try to give
them greater protection, so this would simply be pluggable into the
toolkits without any need for user application modification.

Hiding secret memory mappings behind an anonymous file allows usage of
the page cache for tracking pages allocated for the "secret" mappings as
well as using address_space_operations for e.g. page migration callbacks.

The anonymous file may be also used implicitly, like hugetlb files, to
implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm
ABIs in the future.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which affects
the system performance. However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice". Hence, it is sufficient to have
secretmem disabled by default with the ability of a system administrator to
enable it at boot time.

In addition, there is also a long term goal to improve management of the
direct map.

[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

v17:
* Remove pool of large pages backing secretmem allocations, per Michal Hocko
* Add secretmem pages to unevictable LRU, per Michal Hocko
* Use GFP_HIGHUSER as secretmem mapping mask, per Michal Hocko
* Make secretmem an opt-in feature that is disabled by default
 
v16:
* Fix memory leak intorduced in v15
* Clean the data left from previous page user before handing the page to
  the userspace

v15: https://lore.kernel.org/lkml/20210120180612.1058-1-rppt@kernel.org
* Add riscv/Kconfig update to disable set_memory operations for nommu
  builds (patch 3)
* Update the code around add_to_page_cache() per Matthew's comments
  (patches 6,7)
* Add fixups for build/checkpatch errors discovered by CI systems

v14: https://lore.kernel.org/lkml/20201203062949.5484-1-rppt@kernel.org
* Finally s/mod_node_page_state/mod_lruvec_page_state/

v13: https://lore.kernel.org/lkml/20201201074559.27742-1-rppt@kernel.org
* Added Reviewed-by, thanks Catalin and David
* s/mod_node_page_state/mod_lruvec_page_state/ as Shakeel suggested

Older history:
v12: https://lore.kernel.org/lkml/20201125092208.12544-1-rppt@kernel.org
v11: https://lore.kernel.org/lkml/20201124092556.12009-1-rppt@kernel.org
v10: https://lore.kernel.org/lkml/20201123095432.5860-1-rppt@kernel.org
v9: https://lore.kernel.org/lkml/20201117162932.13649-1-rppt@kernel.org
v8: https://lore.kernel.org/lkml/20201110151444.20662-1-rppt@kernel.org
v7: https://lore.kernel.org/lkml/20201026083752.13267-1-rppt@kernel.org
v6: https://lore.kernel.org/lkml/20200924132904.1391-1-rppt@kernel.org
v5: https://lore.kernel.org/lkml/20200916073539.3552-1-rppt@kernel.org
v4: https://lore.kernel.org/lkml/20200818141554.13945-1-rppt@kernel.org
v3: https://lore.kernel.org/lkml/20200804095035.18778-1-rppt@kernel.org
v2: https://lore.kernel.org/lkml/20200727162935.31714-1-rppt@kernel.org
v1: https://lore.kernel.org/lkml/20200720092435.17469-1-rppt@kernel.org
rfc-v2: https://lore.kernel.org/lkml/20200706172051.19465-1-rppt@kernel.org/
rfc-v1: https://lore.kernel.org/lkml/20200130162340.GA14232@rapoport-lnx/
rfc-v0: https://lore.kernel.org/lkml/1572171452-7958-1-git-send-email-rppt@kernel.org/

Arnd Bergmann (1):
  arm64: kfence: fix header inclusion

Mike Rapoport (9):
  mm: add definition of PMD_PAGE_ORDER
  mmap: make mlock_future_check() global
  riscv/Kconfig: make direct map manipulation options depend on MMU
  set_memory: allow set_direct_map_*_noflush() for multiple pages
  set_memory: allow querying whether set_direct_map_*() is actually enabled
  mm: introduce memfd_secret system call to create "secret" memory areas
  PM: hibernate: disable when there are active secretmem users
  arch, mm: wire up memfd_secret system call where relevant
  secretmem: test: add basic selftest for memfd_secret(2)

 arch/arm64/include/asm/Kbuild             |   1 -
 arch/arm64/include/asm/cacheflush.h       |   6 -
 arch/arm64/include/asm/kfence.h           |   2 +-
 arch/arm64/include/asm/set_memory.h       |  17 ++
 arch/arm64/include/uapi/asm/unistd.h      |   1 +
 arch/arm64/kernel/machine_kexec.c         |   1 +
 arch/arm64/mm/mmu.c                       |   6 +-
 arch/arm64/mm/pageattr.c                  |  23 +-
 arch/riscv/Kconfig                        |   4 +-
 arch/riscv/include/asm/set_memory.h       |   4 +-
 arch/riscv/include/asm/unistd.h           |   1 +
 arch/riscv/mm/pageattr.c                  |   8 +-
 arch/x86/entry/syscalls/syscall_32.tbl    |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl    |   1 +
 arch/x86/include/asm/set_memory.h         |   4 +-
 arch/x86/mm/pat/set_memory.c              |   8 +-
 fs/dax.c                                  |  11 +-
 include/linux/pgtable.h                   |   3 +
 include/linux/secretmem.h                 |  30 +++
 include/linux/set_memory.h                |  16 +-
 include/linux/syscalls.h                  |   1 +
 include/uapi/asm-generic/unistd.h         |   6 +-
 include/uapi/linux/magic.h                |   1 +
 kernel/power/hibernate.c                  |   5 +-
 kernel/power/snapshot.c                   |   4 +-
 kernel/sys_ni.c                           |   2 +
 mm/Kconfig                                |   3 +
 mm/Makefile                               |   1 +
 mm/gup.c                                  |  10 +
 mm/internal.h                             |   3 +
 mm/mlock.c                                |   3 +-
 mm/mmap.c                                 |   5 +-
 mm/secretmem.c                            | 261 +++++++++++++++++++
 mm/vmalloc.c                              |   5 +-
 scripts/checksyscalls.sh                  |   4 +
 tools/testing/selftests/vm/.gitignore     |   1 +
 tools/testing/selftests/vm/Makefile       |   3 +-
 tools/testing/selftests/vm/memfd_secret.c | 296 ++++++++++++++++++++++
 tools/testing/selftests/vm/run_vmtests    |  17 ++
 39 files changed, 726 insertions(+), 53 deletions(-)
 create mode 100644 arch/arm64/include/asm/set_memory.h
 create mode 100644 include/linux/secretmem.h
 create mode 100644 mm/secretmem.c
 create mode 100644 tools/testing/selftests/vm/memfd_secret.c

-- 
2.28.0
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-08  8:49 ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Paul Walmsley, Peter Zijlstra,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleixner, Tycho Andersen, Will Deacon, linux-api,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86

From: Mike Rapoport <rppt@linux.ibm.com>

Hi,

@Andrew, this is based on v5.11-rc5-mmotm-2021-01-27-23-30, with secretmem
and related patches dropped from there, I can rebase whatever way you
prefer.

This is an implementation of "secret" mappings backed by a file descriptor.

The file descriptor backing secret memory mappings is created using a
dedicated memfd_secret system call The desired protection mode for the
memory is configured using flags parameter of the system call. The mmap()
of the file descriptor created with memfd_secret() will create a "secret"
memory mapping. The pages in that mapping will be marked as not present in
the direct map and will be present only in the page table of the owning mm.

Although normally Linux userspace mappings are protected from other users,
such secret mappings are useful for environments where a hostile tenant is
trying to trick the kernel into giving them access to other tenants
mappings.

Additionally, in the future the secret mappings may be used as a mean to
protect guest memory in a virtual machine host.

For demonstration of secret memory usage we've created a userspace library

https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloader.git

that does two things: the first is act as a preloader for openssl to
redirect all the OPENSSL_malloc calls to secret memory meaning any secret
keys get automatically protected this way and the other thing it does is
expose the API to the user who needs it. We anticipate that a lot of the
use cases would be like the openssl one: many toolkits that deal with
secret keys already have special handling for the memory to try to give
them greater protection, so this would simply be pluggable into the
toolkits without any need for user application modification.

Hiding secret memory mappings behind an anonymous file allows usage of
the page cache for tracking pages allocated for the "secret" mappings as
well as using address_space_operations for e.g. page migration callbacks.

The anonymous file may be also used implicitly, like hugetlb files, to
implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm
ABIs in the future.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which affects
the system performance. However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice". Hence, it is sufficient to have
secretmem disabled by default with the ability of a system administrator to
enable it at boot time.

In addition, there is also a long term goal to improve management of the
direct map.

[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

v17:
* Remove pool of large pages backing secretmem allocations, per Michal Hocko
* Add secretmem pages to unevictable LRU, per Michal Hocko
* Use GFP_HIGHUSER as secretmem mapping mask, per Michal Hocko
* Make secretmem an opt-in feature that is disabled by default
 
v16:
* Fix memory leak intorduced in v15
* Clean the data left from previous page user before handing the page to
  the userspace

v15: https://lore.kernel.org/lkml/20210120180612.1058-1-rppt@kernel.org
* Add riscv/Kconfig update to disable set_memory operations for nommu
  builds (patch 3)
* Update the code around add_to_page_cache() per Matthew's comments
  (patches 6,7)
* Add fixups for build/checkpatch errors discovered by CI systems

v14: https://lore.kernel.org/lkml/20201203062949.5484-1-rppt@kernel.org
* Finally s/mod_node_page_state/mod_lruvec_page_state/

v13: https://lore.kernel.org/lkml/20201201074559.27742-1-rppt@kernel.org
* Added Reviewed-by, thanks Catalin and David
* s/mod_node_page_state/mod_lruvec_page_state/ as Shakeel suggested

Older history:
v12: https://lore.kernel.org/lkml/20201125092208.12544-1-rppt@kernel.org
v11: https://lore.kernel.org/lkml/20201124092556.12009-1-rppt@kernel.org
v10: https://lore.kernel.org/lkml/20201123095432.5860-1-rppt@kernel.org
v9: https://lore.kernel.org/lkml/20201117162932.13649-1-rppt@kernel.org
v8: https://lore.kernel.org/lkml/20201110151444.20662-1-rppt@kernel.org
v7: https://lore.kernel.org/lkml/20201026083752.13267-1-rppt@kernel.org
v6: https://lore.kernel.org/lkml/20200924132904.1391-1-rppt@kernel.org
v5: https://lore.kernel.org/lkml/20200916073539.3552-1-rppt@kernel.org
v4: https://lore.kernel.org/lkml/20200818141554.13945-1-rppt@kernel.org
v3: https://lore.kernel.org/lkml/20200804095035.18778-1-rppt@kernel.org
v2: https://lore.kernel.org/lkml/20200727162935.31714-1-rppt@kernel.org
v1: https://lore.kernel.org/lkml/20200720092435.17469-1-rppt@kernel.org
rfc-v2: https://lore.kernel.org/lkml/20200706172051.19465-1-rppt@kernel.org/
rfc-v1: https://lore.kernel.org/lkml/20200130162340.GA14232@rapoport-lnx/
rfc-v0: https://lore.kernel.org/lkml/1572171452-7958-1-git-send-email-rppt@kernel.org/

Arnd Bergmann (1):
  arm64: kfence: fix header inclusion

Mike Rapoport (9):
  mm: add definition of PMD_PAGE_ORDER
  mmap: make mlock_future_check() global
  riscv/Kconfig: make direct map manipulation options depend on MMU
  set_memory: allow set_direct_map_*_noflush() for multiple pages
  set_memory: allow querying whether set_direct_map_*() is actually enabled
  mm: introduce memfd_secret system call to create "secret" memory areas
  PM: hibernate: disable when there are active secretmem users
  arch, mm: wire up memfd_secret system call where relevant
  secretmem: test: add basic selftest for memfd_secret(2)

 arch/arm64/include/asm/Kbuild             |   1 -
 arch/arm64/include/asm/cacheflush.h       |   6 -
 arch/arm64/include/asm/kfence.h           |   2 +-
 arch/arm64/include/asm/set_memory.h       |  17 ++
 arch/arm64/include/uapi/asm/unistd.h      |   1 +
 arch/arm64/kernel/machine_kexec.c         |   1 +
 arch/arm64/mm/mmu.c                       |   6 +-
 arch/arm64/mm/pageattr.c                  |  23 +-
 arch/riscv/Kconfig                        |   4 +-
 arch/riscv/include/asm/set_memory.h       |   4 +-
 arch/riscv/include/asm/unistd.h           |   1 +
 arch/riscv/mm/pageattr.c                  |   8 +-
 arch/x86/entry/syscalls/syscall_32.tbl    |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl    |   1 +
 arch/x86/include/asm/set_memory.h         |   4 +-
 arch/x86/mm/pat/set_memory.c              |   8 +-
 fs/dax.c                                  |  11 +-
 include/linux/pgtable.h                   |   3 +
 include/linux/secretmem.h                 |  30 +++
 include/linux/set_memory.h                |  16 +-
 include/linux/syscalls.h                  |   1 +
 include/uapi/asm-generic/unistd.h         |   6 +-
 include/uapi/linux/magic.h                |   1 +
 kernel/power/hibernate.c                  |   5 +-
 kernel/power/snapshot.c                   |   4 +-
 kernel/sys_ni.c                           |   2 +
 mm/Kconfig                                |   3 +
 mm/Makefile                               |   1 +
 mm/gup.c                                  |  10 +
 mm/internal.h                             |   3 +
 mm/mlock.c                                |   3 +-
 mm/mmap.c                                 |   5 +-
 mm/secretmem.c                            | 261 +++++++++++++++++++
 mm/vmalloc.c                              |   5 +-
 scripts/checksyscalls.sh                  |   4 +
 tools/testing/selftests/vm/.gitignore     |   1 +
 tools/testing/selftests/vm/Makefile       |   3 +-
 tools/testing/selftests/vm/memfd_secret.c | 296 ++++++++++++++++++++++
 tools/testing/selftests/vm/run_vmtests    |  17 ++
 39 files changed, 726 insertions(+), 53 deletions(-)
 create mode 100644 arch/arm64/include/asm/set_memory.h
 create mode 100644 include/linux/secretmem.h
 create mode 100644 mm/secretmem.c
 create mode 100644 tools/testing/selftests/vm/memfd_secret.c

-- 
2.28.0


^ permalink raw reply	[flat|nested] 293+ messages in thread

* [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-08  8:49 ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mark Rutland, Michal Hocko, David Hildenbrand, Peter Zijlstra,
	Catalin Marinas, Dave Hansen, linux-mm, linux-kselftest,
	H. Peter Anvin, Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Arnd Bergmann, James Bottomley, Borislav Petkov,
	Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Rick Edgecombe, Roman Gushchin, Mike Rapoport

From: Mike Rapoport <rppt@linux.ibm.com>

Hi,

@Andrew, this is based on v5.11-rc5-mmotm-2021-01-27-23-30, with secretmem
and related patches dropped from there, I can rebase whatever way you
prefer.

This is an implementation of "secret" mappings backed by a file descriptor.

The file descriptor backing secret memory mappings is created using a
dedicated memfd_secret system call The desired protection mode for the
memory is configured using flags parameter of the system call. The mmap()
of the file descriptor created with memfd_secret() will create a "secret"
memory mapping. The pages in that mapping will be marked as not present in
the direct map and will be present only in the page table of the owning mm.

Although normally Linux userspace mappings are protected from other users,
such secret mappings are useful for environments where a hostile tenant is
trying to trick the kernel into giving them access to other tenants
mappings.

Additionally, in the future the secret mappings may be used as a mean to
protect guest memory in a virtual machine host.

For demonstration of secret memory usage we've created a userspace library

https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloader.git

that does two things: the first is act as a preloader for openssl to
redirect all the OPENSSL_malloc calls to secret memory meaning any secret
keys get automatically protected this way and the other thing it does is
expose the API to the user who needs it. We anticipate that a lot of the
use cases would be like the openssl one: many toolkits that deal with
secret keys already have special handling for the memory to try to give
them greater protection, so this would simply be pluggable into the
toolkits without any need for user application modification.

Hiding secret memory mappings behind an anonymous file allows usage of
the page cache for tracking pages allocated for the "secret" mappings as
well as using address_space_operations for e.g. page migration callbacks.

The anonymous file may be also used implicitly, like hugetlb files, to
implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm
ABIs in the future.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which affects
the system performance. However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice". Hence, it is sufficient to have
secretmem disabled by default with the ability of a system administrator to
enable it at boot time.

In addition, there is also a long term goal to improve management of the
direct map.

[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

v17:
* Remove pool of large pages backing secretmem allocations, per Michal Hocko
* Add secretmem pages to unevictable LRU, per Michal Hocko
* Use GFP_HIGHUSER as secretmem mapping mask, per Michal Hocko
* Make secretmem an opt-in feature that is disabled by default
 
v16:
* Fix memory leak intorduced in v15
* Clean the data left from previous page user before handing the page to
  the userspace

v15: https://lore.kernel.org/lkml/20210120180612.1058-1-rppt@kernel.org
* Add riscv/Kconfig update to disable set_memory operations for nommu
  builds (patch 3)
* Update the code around add_to_page_cache() per Matthew's comments
  (patches 6,7)
* Add fixups for build/checkpatch errors discovered by CI systems

v14: https://lore.kernel.org/lkml/20201203062949.5484-1-rppt@kernel.org
* Finally s/mod_node_page_state/mod_lruvec_page_state/

v13: https://lore.kernel.org/lkml/20201201074559.27742-1-rppt@kernel.org
* Added Reviewed-by, thanks Catalin and David
* s/mod_node_page_state/mod_lruvec_page_state/ as Shakeel suggested

Older history:
v12: https://lore.kernel.org/lkml/20201125092208.12544-1-rppt@kernel.org
v11: https://lore.kernel.org/lkml/20201124092556.12009-1-rppt@kernel.org
v10: https://lore.kernel.org/lkml/20201123095432.5860-1-rppt@kernel.org
v9: https://lore.kernel.org/lkml/20201117162932.13649-1-rppt@kernel.org
v8: https://lore.kernel.org/lkml/20201110151444.20662-1-rppt@kernel.org
v7: https://lore.kernel.org/lkml/20201026083752.13267-1-rppt@kernel.org
v6: https://lore.kernel.org/lkml/20200924132904.1391-1-rppt@kernel.org
v5: https://lore.kernel.org/lkml/20200916073539.3552-1-rppt@kernel.org
v4: https://lore.kernel.org/lkml/20200818141554.13945-1-rppt@kernel.org
v3: https://lore.kernel.org/lkml/20200804095035.18778-1-rppt@kernel.org
v2: https://lore.kernel.org/lkml/20200727162935.31714-1-rppt@kernel.org
v1: https://lore.kernel.org/lkml/20200720092435.17469-1-rppt@kernel.org
rfc-v2: https://lore.kernel.org/lkml/20200706172051.19465-1-rppt@kernel.org/
rfc-v1: https://lore.kernel.org/lkml/20200130162340.GA14232@rapoport-lnx/
rfc-v0: https://lore.kernel.org/lkml/1572171452-7958-1-git-send-email-rppt@kernel.org/

Arnd Bergmann (1):
  arm64: kfence: fix header inclusion

Mike Rapoport (9):
  mm: add definition of PMD_PAGE_ORDER
  mmap: make mlock_future_check() global
  riscv/Kconfig: make direct map manipulation options depend on MMU
  set_memory: allow set_direct_map_*_noflush() for multiple pages
  set_memory: allow querying whether set_direct_map_*() is actually enabled
  mm: introduce memfd_secret system call to create "secret" memory areas
  PM: hibernate: disable when there are active secretmem users
  arch, mm: wire up memfd_secret system call where relevant
  secretmem: test: add basic selftest for memfd_secret(2)

 arch/arm64/include/asm/Kbuild             |   1 -
 arch/arm64/include/asm/cacheflush.h       |   6 -
 arch/arm64/include/asm/kfence.h           |   2 +-
 arch/arm64/include/asm/set_memory.h       |  17 ++
 arch/arm64/include/uapi/asm/unistd.h      |   1 +
 arch/arm64/kernel/machine_kexec.c         |   1 +
 arch/arm64/mm/mmu.c                       |   6 +-
 arch/arm64/mm/pageattr.c                  |  23 +-
 arch/riscv/Kconfig                        |   4 +-
 arch/riscv/include/asm/set_memory.h       |   4 +-
 arch/riscv/include/asm/unistd.h           |   1 +
 arch/riscv/mm/pageattr.c                  |   8 +-
 arch/x86/entry/syscalls/syscall_32.tbl    |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl    |   1 +
 arch/x86/include/asm/set_memory.h         |   4 +-
 arch/x86/mm/pat/set_memory.c              |   8 +-
 fs/dax.c                                  |  11 +-
 include/linux/pgtable.h                   |   3 +
 include/linux/secretmem.h                 |  30 +++
 include/linux/set_memory.h                |  16 +-
 include/linux/syscalls.h                  |   1 +
 include/uapi/asm-generic/unistd.h         |   6 +-
 include/uapi/linux/magic.h                |   1 +
 kernel/power/hibernate.c                  |   5 +-
 kernel/power/snapshot.c                   |   4 +-
 kernel/sys_ni.c                           |   2 +
 mm/Kconfig                                |   3 +
 mm/Makefile                               |   1 +
 mm/gup.c                                  |  10 +
 mm/internal.h                             |   3 +
 mm/mlock.c                                |   3 +-
 mm/mmap.c                                 |   5 +-
 mm/secretmem.c                            | 261 +++++++++++++++++++
 mm/vmalloc.c                              |   5 +-
 scripts/checksyscalls.sh                  |   4 +
 tools/testing/selftests/vm/.gitignore     |   1 +
 tools/testing/selftests/vm/Makefile       |   3 +-
 tools/testing/selftests/vm/memfd_secret.c | 296 ++++++++++++++++++++++
 tools/testing/selftests/vm/run_vmtests    |  17 ++
 39 files changed, 726 insertions(+), 53 deletions(-)
 create mode 100644 arch/arm64/include/asm/set_memory.h
 create mode 100644 include/linux/secretmem.h
 create mode 100644 mm/secretmem.c
 create mode 100644 tools/testing/selftests/vm/memfd_secret.c

-- 
2.28.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-08  8:49 ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mark Rutland, Michal Hocko, David Hildenbrand, Peter Zijlstra,
	Catalin Marinas, Dave Hansen, linux-mm, linux-kselftest,
	H. Peter Anvin, Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Arnd Bergmann, James Bottomley, Borislav Petkov,
	Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Rick Edgecombe, Roman Gushchin, Mike Rapoport

From: Mike Rapoport <rppt@linux.ibm.com>

Hi,

@Andrew, this is based on v5.11-rc5-mmotm-2021-01-27-23-30, with secretmem
and related patches dropped from there, I can rebase whatever way you
prefer.

This is an implementation of "secret" mappings backed by a file descriptor.

The file descriptor backing secret memory mappings is created using a
dedicated memfd_secret system call The desired protection mode for the
memory is configured using flags parameter of the system call. The mmap()
of the file descriptor created with memfd_secret() will create a "secret"
memory mapping. The pages in that mapping will be marked as not present in
the direct map and will be present only in the page table of the owning mm.

Although normally Linux userspace mappings are protected from other users,
such secret mappings are useful for environments where a hostile tenant is
trying to trick the kernel into giving them access to other tenants
mappings.

Additionally, in the future the secret mappings may be used as a mean to
protect guest memory in a virtual machine host.

For demonstration of secret memory usage we've created a userspace library

https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloader.git

that does two things: the first is act as a preloader for openssl to
redirect all the OPENSSL_malloc calls to secret memory meaning any secret
keys get automatically protected this way and the other thing it does is
expose the API to the user who needs it. We anticipate that a lot of the
use cases would be like the openssl one: many toolkits that deal with
secret keys already have special handling for the memory to try to give
them greater protection, so this would simply be pluggable into the
toolkits without any need for user application modification.

Hiding secret memory mappings behind an anonymous file allows usage of
the page cache for tracking pages allocated for the "secret" mappings as
well as using address_space_operations for e.g. page migration callbacks.

The anonymous file may be also used implicitly, like hugetlb files, to
implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm
ABIs in the future.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which affects
the system performance. However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice". Hence, it is sufficient to have
secretmem disabled by default with the ability of a system administrator to
enable it at boot time.

In addition, there is also a long term goal to improve management of the
direct map.

[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

v17:
* Remove pool of large pages backing secretmem allocations, per Michal Hocko
* Add secretmem pages to unevictable LRU, per Michal Hocko
* Use GFP_HIGHUSER as secretmem mapping mask, per Michal Hocko
* Make secretmem an opt-in feature that is disabled by default
 
v16:
* Fix memory leak intorduced in v15
* Clean the data left from previous page user before handing the page to
  the userspace

v15: https://lore.kernel.org/lkml/20210120180612.1058-1-rppt@kernel.org
* Add riscv/Kconfig update to disable set_memory operations for nommu
  builds (patch 3)
* Update the code around add_to_page_cache() per Matthew's comments
  (patches 6,7)
* Add fixups for build/checkpatch errors discovered by CI systems

v14: https://lore.kernel.org/lkml/20201203062949.5484-1-rppt@kernel.org
* Finally s/mod_node_page_state/mod_lruvec_page_state/

v13: https://lore.kernel.org/lkml/20201201074559.27742-1-rppt@kernel.org
* Added Reviewed-by, thanks Catalin and David
* s/mod_node_page_state/mod_lruvec_page_state/ as Shakeel suggested

Older history:
v12: https://lore.kernel.org/lkml/20201125092208.12544-1-rppt@kernel.org
v11: https://lore.kernel.org/lkml/20201124092556.12009-1-rppt@kernel.org
v10: https://lore.kernel.org/lkml/20201123095432.5860-1-rppt@kernel.org
v9: https://lore.kernel.org/lkml/20201117162932.13649-1-rppt@kernel.org
v8: https://lore.kernel.org/lkml/20201110151444.20662-1-rppt@kernel.org
v7: https://lore.kernel.org/lkml/20201026083752.13267-1-rppt@kernel.org
v6: https://lore.kernel.org/lkml/20200924132904.1391-1-rppt@kernel.org
v5: https://lore.kernel.org/lkml/20200916073539.3552-1-rppt@kernel.org
v4: https://lore.kernel.org/lkml/20200818141554.13945-1-rppt@kernel.org
v3: https://lore.kernel.org/lkml/20200804095035.18778-1-rppt@kernel.org
v2: https://lore.kernel.org/lkml/20200727162935.31714-1-rppt@kernel.org
v1: https://lore.kernel.org/lkml/20200720092435.17469-1-rppt@kernel.org
rfc-v2: https://lore.kernel.org/lkml/20200706172051.19465-1-rppt@kernel.org/
rfc-v1: https://lore.kernel.org/lkml/20200130162340.GA14232@rapoport-lnx/
rfc-v0: https://lore.kernel.org/lkml/1572171452-7958-1-git-send-email-rppt@kernel.org/

Arnd Bergmann (1):
  arm64: kfence: fix header inclusion

Mike Rapoport (9):
  mm: add definition of PMD_PAGE_ORDER
  mmap: make mlock_future_check() global
  riscv/Kconfig: make direct map manipulation options depend on MMU
  set_memory: allow set_direct_map_*_noflush() for multiple pages
  set_memory: allow querying whether set_direct_map_*() is actually enabled
  mm: introduce memfd_secret system call to create "secret" memory areas
  PM: hibernate: disable when there are active secretmem users
  arch, mm: wire up memfd_secret system call where relevant
  secretmem: test: add basic selftest for memfd_secret(2)

 arch/arm64/include/asm/Kbuild             |   1 -
 arch/arm64/include/asm/cacheflush.h       |   6 -
 arch/arm64/include/asm/kfence.h           |   2 +-
 arch/arm64/include/asm/set_memory.h       |  17 ++
 arch/arm64/include/uapi/asm/unistd.h      |   1 +
 arch/arm64/kernel/machine_kexec.c         |   1 +
 arch/arm64/mm/mmu.c                       |   6 +-
 arch/arm64/mm/pageattr.c                  |  23 +-
 arch/riscv/Kconfig                        |   4 +-
 arch/riscv/include/asm/set_memory.h       |   4 +-
 arch/riscv/include/asm/unistd.h           |   1 +
 arch/riscv/mm/pageattr.c                  |   8 +-
 arch/x86/entry/syscalls/syscall_32.tbl    |   1 +
 arch/x86/entry/syscalls/syscall_64.tbl    |   1 +
 arch/x86/include/asm/set_memory.h         |   4 +-
 arch/x86/mm/pat/set_memory.c              |   8 +-
 fs/dax.c                                  |  11 +-
 include/linux/pgtable.h                   |   3 +
 include/linux/secretmem.h                 |  30 +++
 include/linux/set_memory.h                |  16 +-
 include/linux/syscalls.h                  |   1 +
 include/uapi/asm-generic/unistd.h         |   6 +-
 include/uapi/linux/magic.h                |   1 +
 kernel/power/hibernate.c                  |   5 +-
 kernel/power/snapshot.c                   |   4 +-
 kernel/sys_ni.c                           |   2 +
 mm/Kconfig                                |   3 +
 mm/Makefile                               |   1 +
 mm/gup.c                                  |  10 +
 mm/internal.h                             |   3 +
 mm/mlock.c                                |   3 +-
 mm/mmap.c                                 |   5 +-
 mm/secretmem.c                            | 261 +++++++++++++++++++
 mm/vmalloc.c                              |   5 +-
 scripts/checksyscalls.sh                  |   4 +
 tools/testing/selftests/vm/.gitignore     |   1 +
 tools/testing/selftests/vm/Makefile       |   3 +-
 tools/testing/selftests/vm/memfd_secret.c | 296 ++++++++++++++++++++++
 tools/testing/selftests/vm/run_vmtests    |  17 ++
 39 files changed, 726 insertions(+), 53 deletions(-)
 create mode 100644 arch/arm64/include/asm/set_memory.h
 create mode 100644 include/linux/secretmem.h
 create mode 100644 mm/secretmem.c
 create mode 100644 tools/testing/selftests/vm/memfd_secret.c

-- 
2.28.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* [PATCH v17 01/10] mm: add definition of PMD_PAGE_ORDER
  2021-02-08  8:49 ` Mike Rapoport
  (?)
  (?)
@ 2021-02-08  8:49   ` Mike Rapoport
  -1 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Paul Walmsley, Peter Zijlstra,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleixner, Tycho Andersen, Will Deacon, linux-api,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

From: Mike Rapoport <rppt@linux.ibm.com>

The definition of PMD_PAGE_ORDER denoting the number of base pages in the
second-level leaf page is already used by DAX and maybe handy in other
cases as well.

Several architectures already have definition of PMD_ORDER as the size of
second level page table, so to avoid conflict with these definitions use
PMD_PAGE_ORDER name and update DAX respectively.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
---
 fs/dax.c                | 11 ++++-------
 include/linux/pgtable.h |  3 +++
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 26d5dcd2d69e..0f109eb16196 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -49,9 +49,6 @@ static inline unsigned int pe_order(enum page_entry_size pe_size)
 #define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
 #define PG_PMD_NR	(PMD_SIZE >> PAGE_SHIFT)
 
-/* The order of a PMD entry */
-#define PMD_ORDER	(PMD_SHIFT - PAGE_SHIFT)
-
 static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
 
 static int __init init_dax_wait_table(void)
@@ -98,7 +95,7 @@ static bool dax_is_locked(void *entry)
 static unsigned int dax_entry_order(void *entry)
 {
 	if (xa_to_value(entry) & DAX_PMD)
-		return PMD_ORDER;
+		return PMD_PAGE_ORDER;
 	return 0;
 }
 
@@ -1470,7 +1467,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct address_space *mapping = vma->vm_file->f_mapping;
-	XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, PMD_ORDER);
+	XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, PMD_PAGE_ORDER);
 	unsigned long pmd_addr = vmf->address & PMD_MASK;
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
 	bool sync;
@@ -1529,7 +1526,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	 * entry is already in the array, for instance), it will return
 	 * VM_FAULT_FALLBACK.
 	 */
-	entry = grab_mapping_entry(&xas, mapping, PMD_ORDER);
+	entry = grab_mapping_entry(&xas, mapping, PMD_PAGE_ORDER);
 	if (xa_is_internal(entry)) {
 		result = xa_to_internal(entry);
 		goto fallback;
@@ -1695,7 +1692,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order)
 	if (order == 0)
 		ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
 #ifdef CONFIG_FS_DAX_PMD
-	else if (order == PMD_ORDER)
+	else if (order == PMD_PAGE_ORDER)
 		ret = vmf_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE);
 #endif
 	else
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 8fcdfa52eb4b..ea5c4102c23e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -28,6 +28,9 @@
 #define USER_PGTABLES_CEILING	0UL
 #endif
 
+/* Number of base pages in a second level leaf page */
+#define PMD_PAGE_ORDER	(PMD_SHIFT - PAGE_SHIFT)
+
 /*
  * A page table page can be thought of an array like this: pXd_t[PTRS_PER_PxD]
  *
-- 
2.28.0
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 01/10] mm: add definition of PMD_PAGE_ORDER
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Paul Walmsley, Peter Zijlstra,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleixner, Tycho Andersen, Will Deacon, linux-api,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

From: Mike Rapoport <rppt@linux.ibm.com>

The definition of PMD_PAGE_ORDER denoting the number of base pages in the
second-level leaf page is already used by DAX and maybe handy in other
cases as well.

Several architectures already have definition of PMD_ORDER as the size of
second level page table, so to avoid conflict with these definitions use
PMD_PAGE_ORDER name and update DAX respectively.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
---
 fs/dax.c                | 11 ++++-------
 include/linux/pgtable.h |  3 +++
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 26d5dcd2d69e..0f109eb16196 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -49,9 +49,6 @@ static inline unsigned int pe_order(enum page_entry_size pe_size)
 #define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
 #define PG_PMD_NR	(PMD_SIZE >> PAGE_SHIFT)
 
-/* The order of a PMD entry */
-#define PMD_ORDER	(PMD_SHIFT - PAGE_SHIFT)
-
 static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
 
 static int __init init_dax_wait_table(void)
@@ -98,7 +95,7 @@ static bool dax_is_locked(void *entry)
 static unsigned int dax_entry_order(void *entry)
 {
 	if (xa_to_value(entry) & DAX_PMD)
-		return PMD_ORDER;
+		return PMD_PAGE_ORDER;
 	return 0;
 }
 
@@ -1470,7 +1467,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct address_space *mapping = vma->vm_file->f_mapping;
-	XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, PMD_ORDER);
+	XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, PMD_PAGE_ORDER);
 	unsigned long pmd_addr = vmf->address & PMD_MASK;
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
 	bool sync;
@@ -1529,7 +1526,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	 * entry is already in the array, for instance), it will return
 	 * VM_FAULT_FALLBACK.
 	 */
-	entry = grab_mapping_entry(&xas, mapping, PMD_ORDER);
+	entry = grab_mapping_entry(&xas, mapping, PMD_PAGE_ORDER);
 	if (xa_is_internal(entry)) {
 		result = xa_to_internal(entry);
 		goto fallback;
@@ -1695,7 +1692,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order)
 	if (order == 0)
 		ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
 #ifdef CONFIG_FS_DAX_PMD
-	else if (order == PMD_ORDER)
+	else if (order == PMD_PAGE_ORDER)
 		ret = vmf_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE);
 #endif
 	else
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 8fcdfa52eb4b..ea5c4102c23e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -28,6 +28,9 @@
 #define USER_PGTABLES_CEILING	0UL
 #endif
 
+/* Number of base pages in a second level leaf page */
+#define PMD_PAGE_ORDER	(PMD_SHIFT - PAGE_SHIFT)
+
 /*
  * A page table page can be thought of an array like this: pXd_t[PTRS_PER_PxD]
  *
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 01/10] mm: add definition of PMD_PAGE_ORDER
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mark Rutland, Michal Hocko, David Hildenbrand, Peter Zijlstra,
	Catalin Marinas, Dave Hansen, linux-mm, linux-kselftest,
	H. Peter Anvin, Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Rick Edgecombe,
	Roman Gushchin, Mike Rapoport

From: Mike Rapoport <rppt@linux.ibm.com>

The definition of PMD_PAGE_ORDER denoting the number of base pages in the
second-level leaf page is already used by DAX and maybe handy in other
cases as well.

Several architectures already have definition of PMD_ORDER as the size of
second level page table, so to avoid conflict with these definitions use
PMD_PAGE_ORDER name and update DAX respectively.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
---
 fs/dax.c                | 11 ++++-------
 include/linux/pgtable.h |  3 +++
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 26d5dcd2d69e..0f109eb16196 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -49,9 +49,6 @@ static inline unsigned int pe_order(enum page_entry_size pe_size)
 #define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
 #define PG_PMD_NR	(PMD_SIZE >> PAGE_SHIFT)
 
-/* The order of a PMD entry */
-#define PMD_ORDER	(PMD_SHIFT - PAGE_SHIFT)
-
 static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
 
 static int __init init_dax_wait_table(void)
@@ -98,7 +95,7 @@ static bool dax_is_locked(void *entry)
 static unsigned int dax_entry_order(void *entry)
 {
 	if (xa_to_value(entry) & DAX_PMD)
-		return PMD_ORDER;
+		return PMD_PAGE_ORDER;
 	return 0;
 }
 
@@ -1470,7 +1467,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct address_space *mapping = vma->vm_file->f_mapping;
-	XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, PMD_ORDER);
+	XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, PMD_PAGE_ORDER);
 	unsigned long pmd_addr = vmf->address & PMD_MASK;
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
 	bool sync;
@@ -1529,7 +1526,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	 * entry is already in the array, for instance), it will return
 	 * VM_FAULT_FALLBACK.
 	 */
-	entry = grab_mapping_entry(&xas, mapping, PMD_ORDER);
+	entry = grab_mapping_entry(&xas, mapping, PMD_PAGE_ORDER);
 	if (xa_is_internal(entry)) {
 		result = xa_to_internal(entry);
 		goto fallback;
@@ -1695,7 +1692,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order)
 	if (order == 0)
 		ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
 #ifdef CONFIG_FS_DAX_PMD
-	else if (order == PMD_ORDER)
+	else if (order == PMD_PAGE_ORDER)
 		ret = vmf_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE);
 #endif
 	else
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 8fcdfa52eb4b..ea5c4102c23e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -28,6 +28,9 @@
 #define USER_PGTABLES_CEILING	0UL
 #endif
 
+/* Number of base pages in a second level leaf page */
+#define PMD_PAGE_ORDER	(PMD_SHIFT - PAGE_SHIFT)
+
 /*
  * A page table page can be thought of an array like this: pXd_t[PTRS_PER_PxD]
  *
-- 
2.28.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 01/10] mm: add definition of PMD_PAGE_ORDER
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mark Rutland, Michal Hocko, David Hildenbrand, Peter Zijlstra,
	Catalin Marinas, Dave Hansen, linux-mm, linux-kselftest,
	H. Peter Anvin, Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Rick Edgecombe,
	Roman Gushchin, Mike Rapoport

From: Mike Rapoport <rppt@linux.ibm.com>

The definition of PMD_PAGE_ORDER denoting the number of base pages in the
second-level leaf page is already used by DAX and maybe handy in other
cases as well.

Several architectures already have definition of PMD_ORDER as the size of
second level page table, so to avoid conflict with these definitions use
PMD_PAGE_ORDER name and update DAX respectively.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
---
 fs/dax.c                | 11 ++++-------
 include/linux/pgtable.h |  3 +++
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 26d5dcd2d69e..0f109eb16196 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -49,9 +49,6 @@ static inline unsigned int pe_order(enum page_entry_size pe_size)
 #define PG_PMD_COLOUR	((PMD_SIZE >> PAGE_SHIFT) - 1)
 #define PG_PMD_NR	(PMD_SIZE >> PAGE_SHIFT)
 
-/* The order of a PMD entry */
-#define PMD_ORDER	(PMD_SHIFT - PAGE_SHIFT)
-
 static wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
 
 static int __init init_dax_wait_table(void)
@@ -98,7 +95,7 @@ static bool dax_is_locked(void *entry)
 static unsigned int dax_entry_order(void *entry)
 {
 	if (xa_to_value(entry) & DAX_PMD)
-		return PMD_ORDER;
+		return PMD_PAGE_ORDER;
 	return 0;
 }
 
@@ -1470,7 +1467,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 {
 	struct vm_area_struct *vma = vmf->vma;
 	struct address_space *mapping = vma->vm_file->f_mapping;
-	XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, PMD_ORDER);
+	XA_STATE_ORDER(xas, &mapping->i_pages, vmf->pgoff, PMD_PAGE_ORDER);
 	unsigned long pmd_addr = vmf->address & PMD_MASK;
 	bool write = vmf->flags & FAULT_FLAG_WRITE;
 	bool sync;
@@ -1529,7 +1526,7 @@ static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
 	 * entry is already in the array, for instance), it will return
 	 * VM_FAULT_FALLBACK.
 	 */
-	entry = grab_mapping_entry(&xas, mapping, PMD_ORDER);
+	entry = grab_mapping_entry(&xas, mapping, PMD_PAGE_ORDER);
 	if (xa_is_internal(entry)) {
 		result = xa_to_internal(entry);
 		goto fallback;
@@ -1695,7 +1692,7 @@ dax_insert_pfn_mkwrite(struct vm_fault *vmf, pfn_t pfn, unsigned int order)
 	if (order == 0)
 		ret = vmf_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
 #ifdef CONFIG_FS_DAX_PMD
-	else if (order == PMD_ORDER)
+	else if (order == PMD_PAGE_ORDER)
 		ret = vmf_insert_pfn_pmd(vmf, pfn, FAULT_FLAG_WRITE);
 #endif
 	else
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 8fcdfa52eb4b..ea5c4102c23e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -28,6 +28,9 @@
 #define USER_PGTABLES_CEILING	0UL
 #endif
 
+/* Number of base pages in a second level leaf page */
+#define PMD_PAGE_ORDER	(PMD_SHIFT - PAGE_SHIFT)
+
 /*
  * A page table page can be thought of an array like this: pXd_t[PTRS_PER_PxD]
  *
-- 
2.28.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 02/10] mmap: make mlock_future_check() global
  2021-02-08  8:49 ` Mike Rapoport
  (?)
  (?)
@ 2021-02-08  8:49   ` Mike Rapoport
  -1 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Paul Walmsley, Peter Zijlstra,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleixner, Tycho Andersen, Will Deacon, linux-api,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

From: Mike Rapoport <rppt@linux.ibm.com>

It will be used by the upcoming secret memory implementation.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 mm/internal.h | 3 +++
 mm/mmap.c     | 5 ++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 9902648f2206..8e9c660f33ca 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -353,6 +353,9 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
 extern void mlock_vma_page(struct page *page);
 extern unsigned int munlock_vma_page(struct page *page);
 
+extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+			      unsigned long len);
+
 /*
  * Clear the page's PageMlocked().  This can be useful in a situation where
  * we want to unconditionally remove a page from the pagecache -- e.g.,
diff --git a/mm/mmap.c b/mm/mmap.c
index 28ef5e29152a..10b9b8b88913 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1346,9 +1346,8 @@ static inline unsigned long round_hint_to_min(unsigned long hint)
 	return hint;
 }
 
-static inline int mlock_future_check(struct mm_struct *mm,
-				     unsigned long flags,
-				     unsigned long len)
+int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+		       unsigned long len)
 {
 	unsigned long locked, lock_limit;
 
-- 
2.28.0
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 02/10] mmap: make mlock_future_check() global
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Paul Walmsley, Peter Zijlstra,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleixner, Tycho Andersen, Will Deacon, linux-api,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

From: Mike Rapoport <rppt@linux.ibm.com>

It will be used by the upcoming secret memory implementation.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 mm/internal.h | 3 +++
 mm/mmap.c     | 5 ++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 9902648f2206..8e9c660f33ca 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -353,6 +353,9 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
 extern void mlock_vma_page(struct page *page);
 extern unsigned int munlock_vma_page(struct page *page);
 
+extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+			      unsigned long len);
+
 /*
  * Clear the page's PageMlocked().  This can be useful in a situation where
  * we want to unconditionally remove a page from the pagecache -- e.g.,
diff --git a/mm/mmap.c b/mm/mmap.c
index 28ef5e29152a..10b9b8b88913 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1346,9 +1346,8 @@ static inline unsigned long round_hint_to_min(unsigned long hint)
 	return hint;
 }
 
-static inline int mlock_future_check(struct mm_struct *mm,
-				     unsigned long flags,
-				     unsigned long len)
+int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+		       unsigned long len)
 {
 	unsigned long locked, lock_limit;
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 02/10] mmap: make mlock_future_check() global
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mark Rutland, Michal Hocko, David Hildenbrand, Peter Zijlstra,
	Catalin Marinas, Dave Hansen, linux-mm, linux-kselftest,
	H. Peter Anvin, Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Rick Edgecombe,
	Roman Gushchin, Mike Rapoport

From: Mike Rapoport <rppt@linux.ibm.com>

It will be used by the upcoming secret memory implementation.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 mm/internal.h | 3 +++
 mm/mmap.c     | 5 ++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 9902648f2206..8e9c660f33ca 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -353,6 +353,9 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
 extern void mlock_vma_page(struct page *page);
 extern unsigned int munlock_vma_page(struct page *page);
 
+extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+			      unsigned long len);
+
 /*
  * Clear the page's PageMlocked().  This can be useful in a situation where
  * we want to unconditionally remove a page from the pagecache -- e.g.,
diff --git a/mm/mmap.c b/mm/mmap.c
index 28ef5e29152a..10b9b8b88913 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1346,9 +1346,8 @@ static inline unsigned long round_hint_to_min(unsigned long hint)
 	return hint;
 }
 
-static inline int mlock_future_check(struct mm_struct *mm,
-				     unsigned long flags,
-				     unsigned long len)
+int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+		       unsigned long len)
 {
 	unsigned long locked, lock_limit;
 
-- 
2.28.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 02/10] mmap: make mlock_future_check() global
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mark Rutland, Michal Hocko, David Hildenbrand, Peter Zijlstra,
	Catalin Marinas, Dave Hansen, linux-mm, linux-kselftest,
	H. Peter Anvin, Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Rick Edgecombe,
	Roman Gushchin, Mike Rapoport

From: Mike Rapoport <rppt@linux.ibm.com>

It will be used by the upcoming secret memory implementation.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 mm/internal.h | 3 +++
 mm/mmap.c     | 5 ++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 9902648f2206..8e9c660f33ca 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -353,6 +353,9 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
 extern void mlock_vma_page(struct page *page);
 extern unsigned int munlock_vma_page(struct page *page);
 
+extern int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+			      unsigned long len);
+
 /*
  * Clear the page's PageMlocked().  This can be useful in a situation where
  * we want to unconditionally remove a page from the pagecache -- e.g.,
diff --git a/mm/mmap.c b/mm/mmap.c
index 28ef5e29152a..10b9b8b88913 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1346,9 +1346,8 @@ static inline unsigned long round_hint_to_min(unsigned long hint)
 	return hint;
 }
 
-static inline int mlock_future_check(struct mm_struct *mm,
-				     unsigned long flags,
-				     unsigned long len)
+int mlock_future_check(struct mm_struct *mm, unsigned long flags,
+		       unsigned long len)
 {
 	unsigned long locked, lock_limit;
 
-- 
2.28.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 03/10] riscv/Kconfig: make direct map manipulation options depend on MMU
  2021-02-08  8:49 ` Mike Rapoport
  (?)
  (?)
@ 2021-02-08  8:49   ` Mike Rapoport
  -1 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Paul Walmsley, Peter Zijlstra,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleixner, Tycho Andersen, Will Deacon, linux-api,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86,
	kernel test robot

From: Mike Rapoport <rppt@linux.ibm.com>

ARCH_HAS_SET_DIRECT_MAP and ARCH_HAS_SET_MEMORY configuration options have
no meaning when CONFIG_MMU is disabled and there is no point to enable
them for the nommu case.

Add an explicit dependency on MMU for these options.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reported-by: kernel test robot <lkp@intel.com>
---
 arch/riscv/Kconfig | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index e338e9579f3e..9d941794f11c 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -25,8 +25,8 @@ config RISCV
 	select ARCH_HAS_KCOV
 	select ARCH_HAS_MMIOWB
 	select ARCH_HAS_PTE_SPECIAL
-	select ARCH_HAS_SET_DIRECT_MAP
-	select ARCH_HAS_SET_MEMORY
+	select ARCH_HAS_SET_DIRECT_MAP if MMU
+	select ARCH_HAS_SET_MEMORY if MMU
 	select ARCH_HAS_STRICT_KERNEL_RWX if MMU
 	select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX
 	select ARCH_OPTIONAL_KERNEL_RWX_DEFAULT
-- 
2.28.0
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 03/10] riscv/Kconfig: make direct map manipulation options depend on MMU
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Paul Walmsley, Peter Zijlstra,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleixner, Tycho Andersen, Will Deacon, linux-api,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86,
	kernel test robot

From: Mike Rapoport <rppt@linux.ibm.com>

ARCH_HAS_SET_DIRECT_MAP and ARCH_HAS_SET_MEMORY configuration options have
no meaning when CONFIG_MMU is disabled and there is no point to enable
them for the nommu case.

Add an explicit dependency on MMU for these options.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reported-by: kernel test robot <lkp@intel.com>
---
 arch/riscv/Kconfig | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index e338e9579f3e..9d941794f11c 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -25,8 +25,8 @@ config RISCV
 	select ARCH_HAS_KCOV
 	select ARCH_HAS_MMIOWB
 	select ARCH_HAS_PTE_SPECIAL
-	select ARCH_HAS_SET_DIRECT_MAP
-	select ARCH_HAS_SET_MEMORY
+	select ARCH_HAS_SET_DIRECT_MAP if MMU
+	select ARCH_HAS_SET_MEMORY if MMU
 	select ARCH_HAS_STRICT_KERNEL_RWX if MMU
 	select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX
 	select ARCH_OPTIONAL_KERNEL_RWX_DEFAULT
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 03/10] riscv/Kconfig: make direct map manipulation options depend on MMU
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mark Rutland, Michal Hocko, David Hildenbrand, Peter Zijlstra,
	Catalin Marinas, Dave Hansen, linux-mm, linux-kselftest,
	H. Peter Anvin, Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, kernel test robot,
	linux-nvdimm, Will Deacon, x86, Matthew Wilcox, Mike Rapoport,
	Ingo Molnar, Michael Kerrisk, Arnd Bergmann, James Bottomley,
	Borislav Petkov, Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Rick Edgecombe, Roman Gushchin, Mike Rapoport

From: Mike Rapoport <rppt@linux.ibm.com>

ARCH_HAS_SET_DIRECT_MAP and ARCH_HAS_SET_MEMORY configuration options have
no meaning when CONFIG_MMU is disabled and there is no point to enable
them for the nommu case.

Add an explicit dependency on MMU for these options.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reported-by: kernel test robot <lkp@intel.com>
---
 arch/riscv/Kconfig | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index e338e9579f3e..9d941794f11c 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -25,8 +25,8 @@ config RISCV
 	select ARCH_HAS_KCOV
 	select ARCH_HAS_MMIOWB
 	select ARCH_HAS_PTE_SPECIAL
-	select ARCH_HAS_SET_DIRECT_MAP
-	select ARCH_HAS_SET_MEMORY
+	select ARCH_HAS_SET_DIRECT_MAP if MMU
+	select ARCH_HAS_SET_MEMORY if MMU
 	select ARCH_HAS_STRICT_KERNEL_RWX if MMU
 	select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX
 	select ARCH_OPTIONAL_KERNEL_RWX_DEFAULT
-- 
2.28.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 03/10] riscv/Kconfig: make direct map manipulation options depend on MMU
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mark Rutland, Michal Hocko, David Hildenbrand, Peter Zijlstra,
	Catalin Marinas, Dave Hansen, linux-mm, linux-kselftest,
	H. Peter Anvin, Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, kernel test robot,
	linux-nvdimm, Will Deacon, x86, Matthew Wilcox, Mike Rapoport,
	Ingo Molnar, Michael Kerrisk, Arnd Bergmann, James Bottomley,
	Borislav Petkov, Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Rick Edgecombe, Roman Gushchin, Mike Rapoport

From: Mike Rapoport <rppt@linux.ibm.com>

ARCH_HAS_SET_DIRECT_MAP and ARCH_HAS_SET_MEMORY configuration options have
no meaning when CONFIG_MMU is disabled and there is no point to enable
them for the nommu case.

Add an explicit dependency on MMU for these options.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reported-by: kernel test robot <lkp@intel.com>
---
 arch/riscv/Kconfig | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index e338e9579f3e..9d941794f11c 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -25,8 +25,8 @@ config RISCV
 	select ARCH_HAS_KCOV
 	select ARCH_HAS_MMIOWB
 	select ARCH_HAS_PTE_SPECIAL
-	select ARCH_HAS_SET_DIRECT_MAP
-	select ARCH_HAS_SET_MEMORY
+	select ARCH_HAS_SET_DIRECT_MAP if MMU
+	select ARCH_HAS_SET_MEMORY if MMU
 	select ARCH_HAS_STRICT_KERNEL_RWX if MMU
 	select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX
 	select ARCH_OPTIONAL_KERNEL_RWX_DEFAULT
-- 
2.28.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 04/10] set_memory: allow set_direct_map_*_noflush() for multiple pages
  2021-02-08  8:49 ` Mike Rapoport
  (?)
  (?)
@ 2021-02-08  8:49   ` Mike Rapoport
  -1 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Paul Walmsley, Peter Zijlstra,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleixner, Tycho Andersen, Will Deacon, linux-api,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

From: Mike Rapoport <rppt@linux.ibm.com>

The underlying implementations of set_direct_map_invalid_noflush() and
set_direct_map_default_noflush() allow updating multiple contiguous pages
at once.

Add numpages parameter to set_direct_map_*_noflush() to expose this
ability with these APIs.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/cacheflush.h |  4 ++--
 arch/arm64/mm/pageattr.c            | 10 ++++++----
 arch/riscv/include/asm/set_memory.h |  4 ++--
 arch/riscv/mm/pageattr.c            |  8 ++++----
 arch/x86/include/asm/set_memory.h   |  4 ++--
 arch/x86/mm/pat/set_memory.c        |  8 ++++----
 include/linux/set_memory.h          |  4 ++--
 kernel/power/snapshot.c             |  4 ++--
 mm/vmalloc.c                        |  5 +++--
 9 files changed, 27 insertions(+), 24 deletions(-)

diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index 45217f21f1fe..d3598419a284 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -138,8 +138,8 @@ static __always_inline void __flush_icache_all(void)
 
 int set_memory_valid(unsigned long addr, int numpages, int enable);
 
-int set_direct_map_invalid_noflush(struct page *page);
-int set_direct_map_default_noflush(struct page *page);
+int set_direct_map_invalid_noflush(struct page *page, int numpages);
+int set_direct_map_default_noflush(struct page *page, int numpages);
 bool kernel_page_present(struct page *page);
 
 #include <asm-generic/cacheflush.h>
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 92eccaf595c8..b53ef37bf95a 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -148,34 +148,36 @@ int set_memory_valid(unsigned long addr, int numpages, int enable)
 					__pgprot(PTE_VALID));
 }
 
-int set_direct_map_invalid_noflush(struct page *page)
+int set_direct_map_invalid_noflush(struct page *page, int numpages)
 {
 	struct page_change_data data = {
 		.set_mask = __pgprot(0),
 		.clear_mask = __pgprot(PTE_VALID),
 	};
+	unsigned long size = PAGE_SIZE * numpages;
 
 	if (!debug_pagealloc_enabled() && !rodata_full)
 		return 0;
 
 	return apply_to_page_range(&init_mm,
 				   (unsigned long)page_address(page),
-				   PAGE_SIZE, change_page_range, &data);
+				   size, change_page_range, &data);
 }
 
-int set_direct_map_default_noflush(struct page *page)
+int set_direct_map_default_noflush(struct page *page, int numpages)
 {
 	struct page_change_data data = {
 		.set_mask = __pgprot(PTE_VALID | PTE_WRITE),
 		.clear_mask = __pgprot(PTE_RDONLY),
 	};
+	unsigned long size = PAGE_SIZE * numpages;
 
 	if (!debug_pagealloc_enabled() && !rodata_full)
 		return 0;
 
 	return apply_to_page_range(&init_mm,
 				   (unsigned long)page_address(page),
-				   PAGE_SIZE, change_page_range, &data);
+				   size, change_page_range, &data);
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
diff --git a/arch/riscv/include/asm/set_memory.h b/arch/riscv/include/asm/set_memory.h
index 211eb8244a45..1aaf2720b8f6 100644
--- a/arch/riscv/include/asm/set_memory.h
+++ b/arch/riscv/include/asm/set_memory.h
@@ -26,8 +26,8 @@ static inline void protect_kernel_text_data(void) {};
 static inline int set_memory_rw_nx(unsigned long addr, int numpages) { return 0; }
 #endif
 
-int set_direct_map_invalid_noflush(struct page *page);
-int set_direct_map_default_noflush(struct page *page);
+int set_direct_map_invalid_noflush(struct page *page, int numpages);
+int set_direct_map_default_noflush(struct page *page, int numpages);
 bool kernel_page_present(struct page *page);
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
index 5e49e4b4a4cc..9618181b70be 100644
--- a/arch/riscv/mm/pageattr.c
+++ b/arch/riscv/mm/pageattr.c
@@ -156,11 +156,11 @@ int set_memory_nx(unsigned long addr, int numpages)
 	return __set_memory(addr, numpages, __pgprot(0), __pgprot(_PAGE_EXEC));
 }
 
-int set_direct_map_invalid_noflush(struct page *page)
+int set_direct_map_invalid_noflush(struct page *page, int numpages)
 {
 	int ret;
 	unsigned long start = (unsigned long)page_address(page);
-	unsigned long end = start + PAGE_SIZE;
+	unsigned long end = start + PAGE_SIZE * numpages;
 	struct pageattr_masks masks = {
 		.set_mask = __pgprot(0),
 		.clear_mask = __pgprot(_PAGE_PRESENT)
@@ -173,11 +173,11 @@ int set_direct_map_invalid_noflush(struct page *page)
 	return ret;
 }
 
-int set_direct_map_default_noflush(struct page *page)
+int set_direct_map_default_noflush(struct page *page, int numpages)
 {
 	int ret;
 	unsigned long start = (unsigned long)page_address(page);
-	unsigned long end = start + PAGE_SIZE;
+	unsigned long end = start + PAGE_SIZE * numpages;
 	struct pageattr_masks masks = {
 		.set_mask = PAGE_KERNEL,
 		.clear_mask = __pgprot(0)
diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index 4352f08bfbb5..6224cb291f6c 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -80,8 +80,8 @@ int set_pages_wb(struct page *page, int numpages);
 int set_pages_ro(struct page *page, int numpages);
 int set_pages_rw(struct page *page, int numpages);
 
-int set_direct_map_invalid_noflush(struct page *page);
-int set_direct_map_default_noflush(struct page *page);
+int set_direct_map_invalid_noflush(struct page *page, int numpages);
+int set_direct_map_default_noflush(struct page *page, int numpages);
 bool kernel_page_present(struct page *page);
 
 extern int kernel_set_to_readonly;
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..d157fd617c99 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2184,14 +2184,14 @@ static int __set_pages_np(struct page *page, int numpages)
 	return __change_page_attr_set_clr(&cpa, 0);
 }
 
-int set_direct_map_invalid_noflush(struct page *page)
+int set_direct_map_invalid_noflush(struct page *page, int numpages)
 {
-	return __set_pages_np(page, 1);
+	return __set_pages_np(page, numpages);
 }
 
-int set_direct_map_default_noflush(struct page *page)
+int set_direct_map_default_noflush(struct page *page, int numpages)
 {
-	return __set_pages_p(page, 1);
+	return __set_pages_p(page, numpages);
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index fe1aa4e54680..c650f82db813 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -15,11 +15,11 @@ static inline int set_memory_nx(unsigned long addr, int numpages) { return 0; }
 #endif
 
 #ifndef CONFIG_ARCH_HAS_SET_DIRECT_MAP
-static inline int set_direct_map_invalid_noflush(struct page *page)
+static inline int set_direct_map_invalid_noflush(struct page *page, int numpages)
 {
 	return 0;
 }
-static inline int set_direct_map_default_noflush(struct page *page)
+static inline int set_direct_map_default_noflush(struct page *page, int numpages)
 {
 	return 0;
 }
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index d63560e1cf87..64b7aab9aee4 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -86,7 +86,7 @@ static inline void hibernate_restore_unprotect_page(void *page_address) {}
 static inline void hibernate_map_page(struct page *page)
 {
 	if (IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) {
-		int ret = set_direct_map_default_noflush(page);
+		int ret = set_direct_map_default_noflush(page, 1);
 
 		if (ret)
 			pr_warn_once("Failed to remap page\n");
@@ -99,7 +99,7 @@ static inline void hibernate_unmap_page(struct page *page)
 {
 	if (IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) {
 		unsigned long addr = (unsigned long)page_address(page);
-		int ret  = set_direct_map_invalid_noflush(page);
+		int ret = set_direct_map_invalid_noflush(page, 1);
 
 		if (ret)
 			pr_warn_once("Failed to remap page\n");
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d5f2a84e488a..1da9cd1d0758 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2195,13 +2195,14 @@ struct vm_struct *remove_vm_area(const void *addr)
 }
 
 static inline void set_area_direct_map(const struct vm_struct *area,
-				       int (*set_direct_map)(struct page *page))
+				       int (*set_direct_map)(struct page *page,
+							     int numpages))
 {
 	int i;
 
 	for (i = 0; i < area->nr_pages; i++)
 		if (page_address(area->pages[i]))
-			set_direct_map(area->pages[i]);
+			set_direct_map(area->pages[i], 1);
 }
 
 /* Handle removing and resetting vm mappings related to the vm_struct. */
-- 
2.28.0
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 04/10] set_memory: allow set_direct_map_*_noflush() for multiple pages
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Paul Walmsley, Peter Zijlstra,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleixner, Tycho Andersen, Will Deacon, linux-api,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

From: Mike Rapoport <rppt@linux.ibm.com>

The underlying implementations of set_direct_map_invalid_noflush() and
set_direct_map_default_noflush() allow updating multiple contiguous pages
at once.

Add numpages parameter to set_direct_map_*_noflush() to expose this
ability with these APIs.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/cacheflush.h |  4 ++--
 arch/arm64/mm/pageattr.c            | 10 ++++++----
 arch/riscv/include/asm/set_memory.h |  4 ++--
 arch/riscv/mm/pageattr.c            |  8 ++++----
 arch/x86/include/asm/set_memory.h   |  4 ++--
 arch/x86/mm/pat/set_memory.c        |  8 ++++----
 include/linux/set_memory.h          |  4 ++--
 kernel/power/snapshot.c             |  4 ++--
 mm/vmalloc.c                        |  5 +++--
 9 files changed, 27 insertions(+), 24 deletions(-)

diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index 45217f21f1fe..d3598419a284 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -138,8 +138,8 @@ static __always_inline void __flush_icache_all(void)
 
 int set_memory_valid(unsigned long addr, int numpages, int enable);
 
-int set_direct_map_invalid_noflush(struct page *page);
-int set_direct_map_default_noflush(struct page *page);
+int set_direct_map_invalid_noflush(struct page *page, int numpages);
+int set_direct_map_default_noflush(struct page *page, int numpages);
 bool kernel_page_present(struct page *page);
 
 #include <asm-generic/cacheflush.h>
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 92eccaf595c8..b53ef37bf95a 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -148,34 +148,36 @@ int set_memory_valid(unsigned long addr, int numpages, int enable)
 					__pgprot(PTE_VALID));
 }
 
-int set_direct_map_invalid_noflush(struct page *page)
+int set_direct_map_invalid_noflush(struct page *page, int numpages)
 {
 	struct page_change_data data = {
 		.set_mask = __pgprot(0),
 		.clear_mask = __pgprot(PTE_VALID),
 	};
+	unsigned long size = PAGE_SIZE * numpages;
 
 	if (!debug_pagealloc_enabled() && !rodata_full)
 		return 0;
 
 	return apply_to_page_range(&init_mm,
 				   (unsigned long)page_address(page),
-				   PAGE_SIZE, change_page_range, &data);
+				   size, change_page_range, &data);
 }
 
-int set_direct_map_default_noflush(struct page *page)
+int set_direct_map_default_noflush(struct page *page, int numpages)
 {
 	struct page_change_data data = {
 		.set_mask = __pgprot(PTE_VALID | PTE_WRITE),
 		.clear_mask = __pgprot(PTE_RDONLY),
 	};
+	unsigned long size = PAGE_SIZE * numpages;
 
 	if (!debug_pagealloc_enabled() && !rodata_full)
 		return 0;
 
 	return apply_to_page_range(&init_mm,
 				   (unsigned long)page_address(page),
-				   PAGE_SIZE, change_page_range, &data);
+				   size, change_page_range, &data);
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
diff --git a/arch/riscv/include/asm/set_memory.h b/arch/riscv/include/asm/set_memory.h
index 211eb8244a45..1aaf2720b8f6 100644
--- a/arch/riscv/include/asm/set_memory.h
+++ b/arch/riscv/include/asm/set_memory.h
@@ -26,8 +26,8 @@ static inline void protect_kernel_text_data(void) {};
 static inline int set_memory_rw_nx(unsigned long addr, int numpages) { return 0; }
 #endif
 
-int set_direct_map_invalid_noflush(struct page *page);
-int set_direct_map_default_noflush(struct page *page);
+int set_direct_map_invalid_noflush(struct page *page, int numpages);
+int set_direct_map_default_noflush(struct page *page, int numpages);
 bool kernel_page_present(struct page *page);
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
index 5e49e4b4a4cc..9618181b70be 100644
--- a/arch/riscv/mm/pageattr.c
+++ b/arch/riscv/mm/pageattr.c
@@ -156,11 +156,11 @@ int set_memory_nx(unsigned long addr, int numpages)
 	return __set_memory(addr, numpages, __pgprot(0), __pgprot(_PAGE_EXEC));
 }
 
-int set_direct_map_invalid_noflush(struct page *page)
+int set_direct_map_invalid_noflush(struct page *page, int numpages)
 {
 	int ret;
 	unsigned long start = (unsigned long)page_address(page);
-	unsigned long end = start + PAGE_SIZE;
+	unsigned long end = start + PAGE_SIZE * numpages;
 	struct pageattr_masks masks = {
 		.set_mask = __pgprot(0),
 		.clear_mask = __pgprot(_PAGE_PRESENT)
@@ -173,11 +173,11 @@ int set_direct_map_invalid_noflush(struct page *page)
 	return ret;
 }
 
-int set_direct_map_default_noflush(struct page *page)
+int set_direct_map_default_noflush(struct page *page, int numpages)
 {
 	int ret;
 	unsigned long start = (unsigned long)page_address(page);
-	unsigned long end = start + PAGE_SIZE;
+	unsigned long end = start + PAGE_SIZE * numpages;
 	struct pageattr_masks masks = {
 		.set_mask = PAGE_KERNEL,
 		.clear_mask = __pgprot(0)
diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index 4352f08bfbb5..6224cb291f6c 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -80,8 +80,8 @@ int set_pages_wb(struct page *page, int numpages);
 int set_pages_ro(struct page *page, int numpages);
 int set_pages_rw(struct page *page, int numpages);
 
-int set_direct_map_invalid_noflush(struct page *page);
-int set_direct_map_default_noflush(struct page *page);
+int set_direct_map_invalid_noflush(struct page *page, int numpages);
+int set_direct_map_default_noflush(struct page *page, int numpages);
 bool kernel_page_present(struct page *page);
 
 extern int kernel_set_to_readonly;
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..d157fd617c99 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2184,14 +2184,14 @@ static int __set_pages_np(struct page *page, int numpages)
 	return __change_page_attr_set_clr(&cpa, 0);
 }
 
-int set_direct_map_invalid_noflush(struct page *page)
+int set_direct_map_invalid_noflush(struct page *page, int numpages)
 {
-	return __set_pages_np(page, 1);
+	return __set_pages_np(page, numpages);
 }
 
-int set_direct_map_default_noflush(struct page *page)
+int set_direct_map_default_noflush(struct page *page, int numpages)
 {
-	return __set_pages_p(page, 1);
+	return __set_pages_p(page, numpages);
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index fe1aa4e54680..c650f82db813 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -15,11 +15,11 @@ static inline int set_memory_nx(unsigned long addr, int numpages) { return 0; }
 #endif
 
 #ifndef CONFIG_ARCH_HAS_SET_DIRECT_MAP
-static inline int set_direct_map_invalid_noflush(struct page *page)
+static inline int set_direct_map_invalid_noflush(struct page *page, int numpages)
 {
 	return 0;
 }
-static inline int set_direct_map_default_noflush(struct page *page)
+static inline int set_direct_map_default_noflush(struct page *page, int numpages)
 {
 	return 0;
 }
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index d63560e1cf87..64b7aab9aee4 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -86,7 +86,7 @@ static inline void hibernate_restore_unprotect_page(void *page_address) {}
 static inline void hibernate_map_page(struct page *page)
 {
 	if (IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) {
-		int ret = set_direct_map_default_noflush(page);
+		int ret = set_direct_map_default_noflush(page, 1);
 
 		if (ret)
 			pr_warn_once("Failed to remap page\n");
@@ -99,7 +99,7 @@ static inline void hibernate_unmap_page(struct page *page)
 {
 	if (IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) {
 		unsigned long addr = (unsigned long)page_address(page);
-		int ret  = set_direct_map_invalid_noflush(page);
+		int ret = set_direct_map_invalid_noflush(page, 1);
 
 		if (ret)
 			pr_warn_once("Failed to remap page\n");
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d5f2a84e488a..1da9cd1d0758 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2195,13 +2195,14 @@ struct vm_struct *remove_vm_area(const void *addr)
 }
 
 static inline void set_area_direct_map(const struct vm_struct *area,
-				       int (*set_direct_map)(struct page *page))
+				       int (*set_direct_map)(struct page *page,
+							     int numpages))
 {
 	int i;
 
 	for (i = 0; i < area->nr_pages; i++)
 		if (page_address(area->pages[i]))
-			set_direct_map(area->pages[i]);
+			set_direct_map(area->pages[i], 1);
 }
 
 /* Handle removing and resetting vm mappings related to the vm_struct. */
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 04/10] set_memory: allow set_direct_map_*_noflush() for multiple pages
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mark Rutland, Michal Hocko, David Hildenbrand, Peter Zijlstra,
	Catalin Marinas, Dave Hansen, linux-mm, linux-kselftest,
	H. Peter Anvin, Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Rick Edgecombe,
	Roman Gushchin, Mike Rapoport

From: Mike Rapoport <rppt@linux.ibm.com>

The underlying implementations of set_direct_map_invalid_noflush() and
set_direct_map_default_noflush() allow updating multiple contiguous pages
at once.

Add numpages parameter to set_direct_map_*_noflush() to expose this
ability with these APIs.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/cacheflush.h |  4 ++--
 arch/arm64/mm/pageattr.c            | 10 ++++++----
 arch/riscv/include/asm/set_memory.h |  4 ++--
 arch/riscv/mm/pageattr.c            |  8 ++++----
 arch/x86/include/asm/set_memory.h   |  4 ++--
 arch/x86/mm/pat/set_memory.c        |  8 ++++----
 include/linux/set_memory.h          |  4 ++--
 kernel/power/snapshot.c             |  4 ++--
 mm/vmalloc.c                        |  5 +++--
 9 files changed, 27 insertions(+), 24 deletions(-)

diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index 45217f21f1fe..d3598419a284 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -138,8 +138,8 @@ static __always_inline void __flush_icache_all(void)
 
 int set_memory_valid(unsigned long addr, int numpages, int enable);
 
-int set_direct_map_invalid_noflush(struct page *page);
-int set_direct_map_default_noflush(struct page *page);
+int set_direct_map_invalid_noflush(struct page *page, int numpages);
+int set_direct_map_default_noflush(struct page *page, int numpages);
 bool kernel_page_present(struct page *page);
 
 #include <asm-generic/cacheflush.h>
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 92eccaf595c8..b53ef37bf95a 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -148,34 +148,36 @@ int set_memory_valid(unsigned long addr, int numpages, int enable)
 					__pgprot(PTE_VALID));
 }
 
-int set_direct_map_invalid_noflush(struct page *page)
+int set_direct_map_invalid_noflush(struct page *page, int numpages)
 {
 	struct page_change_data data = {
 		.set_mask = __pgprot(0),
 		.clear_mask = __pgprot(PTE_VALID),
 	};
+	unsigned long size = PAGE_SIZE * numpages;
 
 	if (!debug_pagealloc_enabled() && !rodata_full)
 		return 0;
 
 	return apply_to_page_range(&init_mm,
 				   (unsigned long)page_address(page),
-				   PAGE_SIZE, change_page_range, &data);
+				   size, change_page_range, &data);
 }
 
-int set_direct_map_default_noflush(struct page *page)
+int set_direct_map_default_noflush(struct page *page, int numpages)
 {
 	struct page_change_data data = {
 		.set_mask = __pgprot(PTE_VALID | PTE_WRITE),
 		.clear_mask = __pgprot(PTE_RDONLY),
 	};
+	unsigned long size = PAGE_SIZE * numpages;
 
 	if (!debug_pagealloc_enabled() && !rodata_full)
 		return 0;
 
 	return apply_to_page_range(&init_mm,
 				   (unsigned long)page_address(page),
-				   PAGE_SIZE, change_page_range, &data);
+				   size, change_page_range, &data);
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
diff --git a/arch/riscv/include/asm/set_memory.h b/arch/riscv/include/asm/set_memory.h
index 211eb8244a45..1aaf2720b8f6 100644
--- a/arch/riscv/include/asm/set_memory.h
+++ b/arch/riscv/include/asm/set_memory.h
@@ -26,8 +26,8 @@ static inline void protect_kernel_text_data(void) {};
 static inline int set_memory_rw_nx(unsigned long addr, int numpages) { return 0; }
 #endif
 
-int set_direct_map_invalid_noflush(struct page *page);
-int set_direct_map_default_noflush(struct page *page);
+int set_direct_map_invalid_noflush(struct page *page, int numpages);
+int set_direct_map_default_noflush(struct page *page, int numpages);
 bool kernel_page_present(struct page *page);
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
index 5e49e4b4a4cc..9618181b70be 100644
--- a/arch/riscv/mm/pageattr.c
+++ b/arch/riscv/mm/pageattr.c
@@ -156,11 +156,11 @@ int set_memory_nx(unsigned long addr, int numpages)
 	return __set_memory(addr, numpages, __pgprot(0), __pgprot(_PAGE_EXEC));
 }
 
-int set_direct_map_invalid_noflush(struct page *page)
+int set_direct_map_invalid_noflush(struct page *page, int numpages)
 {
 	int ret;
 	unsigned long start = (unsigned long)page_address(page);
-	unsigned long end = start + PAGE_SIZE;
+	unsigned long end = start + PAGE_SIZE * numpages;
 	struct pageattr_masks masks = {
 		.set_mask = __pgprot(0),
 		.clear_mask = __pgprot(_PAGE_PRESENT)
@@ -173,11 +173,11 @@ int set_direct_map_invalid_noflush(struct page *page)
 	return ret;
 }
 
-int set_direct_map_default_noflush(struct page *page)
+int set_direct_map_default_noflush(struct page *page, int numpages)
 {
 	int ret;
 	unsigned long start = (unsigned long)page_address(page);
-	unsigned long end = start + PAGE_SIZE;
+	unsigned long end = start + PAGE_SIZE * numpages;
 	struct pageattr_masks masks = {
 		.set_mask = PAGE_KERNEL,
 		.clear_mask = __pgprot(0)
diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index 4352f08bfbb5..6224cb291f6c 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -80,8 +80,8 @@ int set_pages_wb(struct page *page, int numpages);
 int set_pages_ro(struct page *page, int numpages);
 int set_pages_rw(struct page *page, int numpages);
 
-int set_direct_map_invalid_noflush(struct page *page);
-int set_direct_map_default_noflush(struct page *page);
+int set_direct_map_invalid_noflush(struct page *page, int numpages);
+int set_direct_map_default_noflush(struct page *page, int numpages);
 bool kernel_page_present(struct page *page);
 
 extern int kernel_set_to_readonly;
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..d157fd617c99 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2184,14 +2184,14 @@ static int __set_pages_np(struct page *page, int numpages)
 	return __change_page_attr_set_clr(&cpa, 0);
 }
 
-int set_direct_map_invalid_noflush(struct page *page)
+int set_direct_map_invalid_noflush(struct page *page, int numpages)
 {
-	return __set_pages_np(page, 1);
+	return __set_pages_np(page, numpages);
 }
 
-int set_direct_map_default_noflush(struct page *page)
+int set_direct_map_default_noflush(struct page *page, int numpages)
 {
-	return __set_pages_p(page, 1);
+	return __set_pages_p(page, numpages);
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index fe1aa4e54680..c650f82db813 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -15,11 +15,11 @@ static inline int set_memory_nx(unsigned long addr, int numpages) { return 0; }
 #endif
 
 #ifndef CONFIG_ARCH_HAS_SET_DIRECT_MAP
-static inline int set_direct_map_invalid_noflush(struct page *page)
+static inline int set_direct_map_invalid_noflush(struct page *page, int numpages)
 {
 	return 0;
 }
-static inline int set_direct_map_default_noflush(struct page *page)
+static inline int set_direct_map_default_noflush(struct page *page, int numpages)
 {
 	return 0;
 }
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index d63560e1cf87..64b7aab9aee4 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -86,7 +86,7 @@ static inline void hibernate_restore_unprotect_page(void *page_address) {}
 static inline void hibernate_map_page(struct page *page)
 {
 	if (IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) {
-		int ret = set_direct_map_default_noflush(page);
+		int ret = set_direct_map_default_noflush(page, 1);
 
 		if (ret)
 			pr_warn_once("Failed to remap page\n");
@@ -99,7 +99,7 @@ static inline void hibernate_unmap_page(struct page *page)
 {
 	if (IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) {
 		unsigned long addr = (unsigned long)page_address(page);
-		int ret  = set_direct_map_invalid_noflush(page);
+		int ret = set_direct_map_invalid_noflush(page, 1);
 
 		if (ret)
 			pr_warn_once("Failed to remap page\n");
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d5f2a84e488a..1da9cd1d0758 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2195,13 +2195,14 @@ struct vm_struct *remove_vm_area(const void *addr)
 }
 
 static inline void set_area_direct_map(const struct vm_struct *area,
-				       int (*set_direct_map)(struct page *page))
+				       int (*set_direct_map)(struct page *page,
+							     int numpages))
 {
 	int i;
 
 	for (i = 0; i < area->nr_pages; i++)
 		if (page_address(area->pages[i]))
-			set_direct_map(area->pages[i]);
+			set_direct_map(area->pages[i], 1);
 }
 
 /* Handle removing and resetting vm mappings related to the vm_struct. */
-- 
2.28.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 04/10] set_memory: allow set_direct_map_*_noflush() for multiple pages
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mark Rutland, Michal Hocko, David Hildenbrand, Peter Zijlstra,
	Catalin Marinas, Dave Hansen, linux-mm, linux-kselftest,
	H. Peter Anvin, Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Rick Edgecombe,
	Roman Gushchin, Mike Rapoport

From: Mike Rapoport <rppt@linux.ibm.com>

The underlying implementations of set_direct_map_invalid_noflush() and
set_direct_map_default_noflush() allow updating multiple contiguous pages
at once.

Add numpages parameter to set_direct_map_*_noflush() to expose this
ability with these APIs.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>	[arm64]
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/cacheflush.h |  4 ++--
 arch/arm64/mm/pageattr.c            | 10 ++++++----
 arch/riscv/include/asm/set_memory.h |  4 ++--
 arch/riscv/mm/pageattr.c            |  8 ++++----
 arch/x86/include/asm/set_memory.h   |  4 ++--
 arch/x86/mm/pat/set_memory.c        |  8 ++++----
 include/linux/set_memory.h          |  4 ++--
 kernel/power/snapshot.c             |  4 ++--
 mm/vmalloc.c                        |  5 +++--
 9 files changed, 27 insertions(+), 24 deletions(-)

diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index 45217f21f1fe..d3598419a284 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -138,8 +138,8 @@ static __always_inline void __flush_icache_all(void)
 
 int set_memory_valid(unsigned long addr, int numpages, int enable);
 
-int set_direct_map_invalid_noflush(struct page *page);
-int set_direct_map_default_noflush(struct page *page);
+int set_direct_map_invalid_noflush(struct page *page, int numpages);
+int set_direct_map_default_noflush(struct page *page, int numpages);
 bool kernel_page_present(struct page *page);
 
 #include <asm-generic/cacheflush.h>
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index 92eccaf595c8..b53ef37bf95a 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -148,34 +148,36 @@ int set_memory_valid(unsigned long addr, int numpages, int enable)
 					__pgprot(PTE_VALID));
 }
 
-int set_direct_map_invalid_noflush(struct page *page)
+int set_direct_map_invalid_noflush(struct page *page, int numpages)
 {
 	struct page_change_data data = {
 		.set_mask = __pgprot(0),
 		.clear_mask = __pgprot(PTE_VALID),
 	};
+	unsigned long size = PAGE_SIZE * numpages;
 
 	if (!debug_pagealloc_enabled() && !rodata_full)
 		return 0;
 
 	return apply_to_page_range(&init_mm,
 				   (unsigned long)page_address(page),
-				   PAGE_SIZE, change_page_range, &data);
+				   size, change_page_range, &data);
 }
 
-int set_direct_map_default_noflush(struct page *page)
+int set_direct_map_default_noflush(struct page *page, int numpages)
 {
 	struct page_change_data data = {
 		.set_mask = __pgprot(PTE_VALID | PTE_WRITE),
 		.clear_mask = __pgprot(PTE_RDONLY),
 	};
+	unsigned long size = PAGE_SIZE * numpages;
 
 	if (!debug_pagealloc_enabled() && !rodata_full)
 		return 0;
 
 	return apply_to_page_range(&init_mm,
 				   (unsigned long)page_address(page),
-				   PAGE_SIZE, change_page_range, &data);
+				   size, change_page_range, &data);
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
diff --git a/arch/riscv/include/asm/set_memory.h b/arch/riscv/include/asm/set_memory.h
index 211eb8244a45..1aaf2720b8f6 100644
--- a/arch/riscv/include/asm/set_memory.h
+++ b/arch/riscv/include/asm/set_memory.h
@@ -26,8 +26,8 @@ static inline void protect_kernel_text_data(void) {};
 static inline int set_memory_rw_nx(unsigned long addr, int numpages) { return 0; }
 #endif
 
-int set_direct_map_invalid_noflush(struct page *page);
-int set_direct_map_default_noflush(struct page *page);
+int set_direct_map_invalid_noflush(struct page *page, int numpages);
+int set_direct_map_default_noflush(struct page *page, int numpages);
 bool kernel_page_present(struct page *page);
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
index 5e49e4b4a4cc..9618181b70be 100644
--- a/arch/riscv/mm/pageattr.c
+++ b/arch/riscv/mm/pageattr.c
@@ -156,11 +156,11 @@ int set_memory_nx(unsigned long addr, int numpages)
 	return __set_memory(addr, numpages, __pgprot(0), __pgprot(_PAGE_EXEC));
 }
 
-int set_direct_map_invalid_noflush(struct page *page)
+int set_direct_map_invalid_noflush(struct page *page, int numpages)
 {
 	int ret;
 	unsigned long start = (unsigned long)page_address(page);
-	unsigned long end = start + PAGE_SIZE;
+	unsigned long end = start + PAGE_SIZE * numpages;
 	struct pageattr_masks masks = {
 		.set_mask = __pgprot(0),
 		.clear_mask = __pgprot(_PAGE_PRESENT)
@@ -173,11 +173,11 @@ int set_direct_map_invalid_noflush(struct page *page)
 	return ret;
 }
 
-int set_direct_map_default_noflush(struct page *page)
+int set_direct_map_default_noflush(struct page *page, int numpages)
 {
 	int ret;
 	unsigned long start = (unsigned long)page_address(page);
-	unsigned long end = start + PAGE_SIZE;
+	unsigned long end = start + PAGE_SIZE * numpages;
 	struct pageattr_masks masks = {
 		.set_mask = PAGE_KERNEL,
 		.clear_mask = __pgprot(0)
diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index 4352f08bfbb5..6224cb291f6c 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -80,8 +80,8 @@ int set_pages_wb(struct page *page, int numpages);
 int set_pages_ro(struct page *page, int numpages);
 int set_pages_rw(struct page *page, int numpages);
 
-int set_direct_map_invalid_noflush(struct page *page);
-int set_direct_map_default_noflush(struct page *page);
+int set_direct_map_invalid_noflush(struct page *page, int numpages);
+int set_direct_map_default_noflush(struct page *page, int numpages);
 bool kernel_page_present(struct page *page);
 
 extern int kernel_set_to_readonly;
diff --git a/arch/x86/mm/pat/set_memory.c b/arch/x86/mm/pat/set_memory.c
index 16f878c26667..d157fd617c99 100644
--- a/arch/x86/mm/pat/set_memory.c
+++ b/arch/x86/mm/pat/set_memory.c
@@ -2184,14 +2184,14 @@ static int __set_pages_np(struct page *page, int numpages)
 	return __change_page_attr_set_clr(&cpa, 0);
 }
 
-int set_direct_map_invalid_noflush(struct page *page)
+int set_direct_map_invalid_noflush(struct page *page, int numpages)
 {
-	return __set_pages_np(page, 1);
+	return __set_pages_np(page, numpages);
 }
 
-int set_direct_map_default_noflush(struct page *page)
+int set_direct_map_default_noflush(struct page *page, int numpages)
 {
-	return __set_pages_p(page, 1);
+	return __set_pages_p(page, numpages);
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index fe1aa4e54680..c650f82db813 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -15,11 +15,11 @@ static inline int set_memory_nx(unsigned long addr, int numpages) { return 0; }
 #endif
 
 #ifndef CONFIG_ARCH_HAS_SET_DIRECT_MAP
-static inline int set_direct_map_invalid_noflush(struct page *page)
+static inline int set_direct_map_invalid_noflush(struct page *page, int numpages)
 {
 	return 0;
 }
-static inline int set_direct_map_default_noflush(struct page *page)
+static inline int set_direct_map_default_noflush(struct page *page, int numpages)
 {
 	return 0;
 }
diff --git a/kernel/power/snapshot.c b/kernel/power/snapshot.c
index d63560e1cf87..64b7aab9aee4 100644
--- a/kernel/power/snapshot.c
+++ b/kernel/power/snapshot.c
@@ -86,7 +86,7 @@ static inline void hibernate_restore_unprotect_page(void *page_address) {}
 static inline void hibernate_map_page(struct page *page)
 {
 	if (IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) {
-		int ret = set_direct_map_default_noflush(page);
+		int ret = set_direct_map_default_noflush(page, 1);
 
 		if (ret)
 			pr_warn_once("Failed to remap page\n");
@@ -99,7 +99,7 @@ static inline void hibernate_unmap_page(struct page *page)
 {
 	if (IS_ENABLED(CONFIG_ARCH_HAS_SET_DIRECT_MAP)) {
 		unsigned long addr = (unsigned long)page_address(page);
-		int ret  = set_direct_map_invalid_noflush(page);
+		int ret = set_direct_map_invalid_noflush(page, 1);
 
 		if (ret)
 			pr_warn_once("Failed to remap page\n");
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d5f2a84e488a..1da9cd1d0758 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2195,13 +2195,14 @@ struct vm_struct *remove_vm_area(const void *addr)
 }
 
 static inline void set_area_direct_map(const struct vm_struct *area,
-				       int (*set_direct_map)(struct page *page))
+				       int (*set_direct_map)(struct page *page,
+							     int numpages))
 {
 	int i;
 
 	for (i = 0; i < area->nr_pages; i++)
 		if (page_address(area->pages[i]))
-			set_direct_map(area->pages[i]);
+			set_direct_map(area->pages[i], 1);
 }
 
 /* Handle removing and resetting vm mappings related to the vm_struct. */
-- 
2.28.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 05/10] set_memory: allow querying whether set_direct_map_*() is actually enabled
  2021-02-08  8:49 ` Mike Rapoport
  (?)
  (?)
@ 2021-02-08  8:49   ` Mike Rapoport
  -1 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Paul Walmsley, Peter Zijlstra,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleixner, Tycho Andersen, Will Deacon, linux-api,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

From: Mike Rapoport <rppt@linux.ibm.com>

On arm64, set_direct_map_*() functions may return 0 without actually
changing the linear map.  This behaviour can be controlled using kernel
parameters, so we need a way to determine at runtime whether calls to
set_direct_map_invalid_noflush() and set_direct_map_default_noflush() have
any effect.

Extend set_memory API with can_set_direct_map() function that allows
checking if calling set_direct_map_*() will actually change the page
table, replace several occurrences of open coded checks in arm64 with the
new function and provide a generic stub for architectures that always
modify page tables upon calls to set_direct_map APIs.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/Kbuild       |  1 -
 arch/arm64/include/asm/cacheflush.h |  6 ------
 arch/arm64/include/asm/set_memory.h | 17 +++++++++++++++++
 arch/arm64/kernel/machine_kexec.c   |  1 +
 arch/arm64/mm/mmu.c                 |  6 +++---
 arch/arm64/mm/pageattr.c            | 13 +++++++++----
 include/linux/set_memory.h          | 12 ++++++++++++
 7 files changed, 42 insertions(+), 14 deletions(-)
 create mode 100644 arch/arm64/include/asm/set_memory.h

diff --git a/arch/arm64/include/asm/Kbuild b/arch/arm64/include/asm/Kbuild
index 07ac208edc89..73aa25843f65 100644
--- a/arch/arm64/include/asm/Kbuild
+++ b/arch/arm64/include/asm/Kbuild
@@ -3,5 +3,4 @@ generic-y += early_ioremap.h
 generic-y += mcs_spinlock.h
 generic-y += qrwlock.h
 generic-y += qspinlock.h
-generic-y += set_memory.h
 generic-y += user.h
diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index d3598419a284..b1bdf83a73db 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -136,12 +136,6 @@ static __always_inline void __flush_icache_all(void)
 	dsb(ish);
 }
 
-int set_memory_valid(unsigned long addr, int numpages, int enable);
-
-int set_direct_map_invalid_noflush(struct page *page, int numpages);
-int set_direct_map_default_noflush(struct page *page, int numpages);
-bool kernel_page_present(struct page *page);
-
 #include <asm-generic/cacheflush.h>
 
 #endif /* __ASM_CACHEFLUSH_H */
diff --git a/arch/arm64/include/asm/set_memory.h b/arch/arm64/include/asm/set_memory.h
new file mode 100644
index 000000000000..ecb6b0f449ab
--- /dev/null
+++ b/arch/arm64/include/asm/set_memory.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef _ASM_ARM64_SET_MEMORY_H
+#define _ASM_ARM64_SET_MEMORY_H
+
+#include <asm-generic/set_memory.h>
+
+bool can_set_direct_map(void);
+#define can_set_direct_map can_set_direct_map
+
+int set_memory_valid(unsigned long addr, int numpages, int enable);
+
+int set_direct_map_invalid_noflush(struct page *page, int numpages);
+int set_direct_map_default_noflush(struct page *page, int numpages);
+bool kernel_page_present(struct page *page);
+
+#endif /* _ASM_ARM64_SET_MEMORY_H */
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index a0b144cfaea7..0cbc50c4fa5a 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -11,6 +11,7 @@
 #include <linux/kernel.h>
 #include <linux/kexec.h>
 #include <linux/page-flags.h>
+#include <linux/set_memory.h>
 #include <linux/smp.h>
 
 #include <asm/cacheflush.h>
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 9445eb77e3da..bd8521637120 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -22,6 +22,7 @@
 #include <linux/io.h>
 #include <linux/mm.h>
 #include <linux/vmalloc.h>
+#include <linux/set_memory.h>
 
 #include <asm/barrier.h>
 #include <asm/cputype.h>
@@ -492,7 +493,7 @@ static void __init map_mem(pgd_t *pgdp)
 	int flags = 0;
 	u64 i;
 
-	if (rodata_full || crash_mem_map || debug_pagealloc_enabled())
+	if (can_set_direct_map() || crash_mem_map)
 		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
 	/*
@@ -1470,8 +1471,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
 	 * KFENCE requires linear map to be mapped at page granularity, so that
 	 * it is possible to protect/unprotect single pages in the KFENCE pool.
 	 */
-	if (rodata_full || debug_pagealloc_enabled() ||
-	    IS_ENABLED(CONFIG_KFENCE))
+	if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
 		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
 	__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index b53ef37bf95a..d505172265b0 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -19,6 +19,11 @@ struct page_change_data {
 
 bool rodata_full __ro_after_init = IS_ENABLED(CONFIG_RODATA_FULL_DEFAULT_ENABLED);
 
+bool can_set_direct_map(void)
+{
+	return rodata_full || debug_pagealloc_enabled();
+}
+
 static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
 {
 	struct page_change_data *cdata = data;
@@ -156,7 +161,7 @@ int set_direct_map_invalid_noflush(struct page *page, int numpages)
 	};
 	unsigned long size = PAGE_SIZE * numpages;
 
-	if (!debug_pagealloc_enabled() && !rodata_full)
+	if (!can_set_direct_map())
 		return 0;
 
 	return apply_to_page_range(&init_mm,
@@ -172,7 +177,7 @@ int set_direct_map_default_noflush(struct page *page, int numpages)
 	};
 	unsigned long size = PAGE_SIZE * numpages;
 
-	if (!debug_pagealloc_enabled() && !rodata_full)
+	if (!can_set_direct_map())
 		return 0;
 
 	return apply_to_page_range(&init_mm,
@@ -183,7 +188,7 @@ int set_direct_map_default_noflush(struct page *page, int numpages)
 #ifdef CONFIG_DEBUG_PAGEALLOC
 void __kernel_map_pages(struct page *page, int numpages, int enable)
 {
-	if (!debug_pagealloc_enabled() && !rodata_full)
+	if (!can_set_direct_map())
 		return;
 
 	set_memory_valid((unsigned long)page_address(page), numpages, enable);
@@ -208,7 +213,7 @@ bool kernel_page_present(struct page *page)
 	pte_t *ptep;
 	unsigned long addr = (unsigned long)page_address(page);
 
-	if (!debug_pagealloc_enabled() && !rodata_full)
+	if (!can_set_direct_map())
 		return true;
 
 	pgdp = pgd_offset_k(addr);
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index c650f82db813..7b4b6626032d 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -28,7 +28,19 @@ static inline bool kernel_page_present(struct page *page)
 {
 	return true;
 }
+#else /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */
+/*
+ * Some architectures, e.g. ARM64 can disable direct map modifications at
+ * boot time. Let them overrive this query.
+ */
+#ifndef can_set_direct_map
+static inline bool can_set_direct_map(void)
+{
+	return true;
+}
+#define can_set_direct_map can_set_direct_map
 #endif
+#endif /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */
 
 #ifndef set_mce_nospec
 static inline int set_mce_nospec(unsigned long pfn, bool unmap)
-- 
2.28.0
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 05/10] set_memory: allow querying whether set_direct_map_*() is actually enabled
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Paul Walmsley, Peter Zijlstra,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleixner, Tycho Andersen, Will Deacon, linux-api,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

From: Mike Rapoport <rppt@linux.ibm.com>

On arm64, set_direct_map_*() functions may return 0 without actually
changing the linear map.  This behaviour can be controlled using kernel
parameters, so we need a way to determine at runtime whether calls to
set_direct_map_invalid_noflush() and set_direct_map_default_noflush() have
any effect.

Extend set_memory API with can_set_direct_map() function that allows
checking if calling set_direct_map_*() will actually change the page
table, replace several occurrences of open coded checks in arm64 with the
new function and provide a generic stub for architectures that always
modify page tables upon calls to set_direct_map APIs.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/Kbuild       |  1 -
 arch/arm64/include/asm/cacheflush.h |  6 ------
 arch/arm64/include/asm/set_memory.h | 17 +++++++++++++++++
 arch/arm64/kernel/machine_kexec.c   |  1 +
 arch/arm64/mm/mmu.c                 |  6 +++---
 arch/arm64/mm/pageattr.c            | 13 +++++++++----
 include/linux/set_memory.h          | 12 ++++++++++++
 7 files changed, 42 insertions(+), 14 deletions(-)
 create mode 100644 arch/arm64/include/asm/set_memory.h

diff --git a/arch/arm64/include/asm/Kbuild b/arch/arm64/include/asm/Kbuild
index 07ac208edc89..73aa25843f65 100644
--- a/arch/arm64/include/asm/Kbuild
+++ b/arch/arm64/include/asm/Kbuild
@@ -3,5 +3,4 @@ generic-y += early_ioremap.h
 generic-y += mcs_spinlock.h
 generic-y += qrwlock.h
 generic-y += qspinlock.h
-generic-y += set_memory.h
 generic-y += user.h
diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index d3598419a284..b1bdf83a73db 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -136,12 +136,6 @@ static __always_inline void __flush_icache_all(void)
 	dsb(ish);
 }
 
-int set_memory_valid(unsigned long addr, int numpages, int enable);
-
-int set_direct_map_invalid_noflush(struct page *page, int numpages);
-int set_direct_map_default_noflush(struct page *page, int numpages);
-bool kernel_page_present(struct page *page);
-
 #include <asm-generic/cacheflush.h>
 
 #endif /* __ASM_CACHEFLUSH_H */
diff --git a/arch/arm64/include/asm/set_memory.h b/arch/arm64/include/asm/set_memory.h
new file mode 100644
index 000000000000..ecb6b0f449ab
--- /dev/null
+++ b/arch/arm64/include/asm/set_memory.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef _ASM_ARM64_SET_MEMORY_H
+#define _ASM_ARM64_SET_MEMORY_H
+
+#include <asm-generic/set_memory.h>
+
+bool can_set_direct_map(void);
+#define can_set_direct_map can_set_direct_map
+
+int set_memory_valid(unsigned long addr, int numpages, int enable);
+
+int set_direct_map_invalid_noflush(struct page *page, int numpages);
+int set_direct_map_default_noflush(struct page *page, int numpages);
+bool kernel_page_present(struct page *page);
+
+#endif /* _ASM_ARM64_SET_MEMORY_H */
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index a0b144cfaea7..0cbc50c4fa5a 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -11,6 +11,7 @@
 #include <linux/kernel.h>
 #include <linux/kexec.h>
 #include <linux/page-flags.h>
+#include <linux/set_memory.h>
 #include <linux/smp.h>
 
 #include <asm/cacheflush.h>
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 9445eb77e3da..bd8521637120 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -22,6 +22,7 @@
 #include <linux/io.h>
 #include <linux/mm.h>
 #include <linux/vmalloc.h>
+#include <linux/set_memory.h>
 
 #include <asm/barrier.h>
 #include <asm/cputype.h>
@@ -492,7 +493,7 @@ static void __init map_mem(pgd_t *pgdp)
 	int flags = 0;
 	u64 i;
 
-	if (rodata_full || crash_mem_map || debug_pagealloc_enabled())
+	if (can_set_direct_map() || crash_mem_map)
 		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
 	/*
@@ -1470,8 +1471,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
 	 * KFENCE requires linear map to be mapped at page granularity, so that
 	 * it is possible to protect/unprotect single pages in the KFENCE pool.
 	 */
-	if (rodata_full || debug_pagealloc_enabled() ||
-	    IS_ENABLED(CONFIG_KFENCE))
+	if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
 		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
 	__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index b53ef37bf95a..d505172265b0 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -19,6 +19,11 @@ struct page_change_data {
 
 bool rodata_full __ro_after_init = IS_ENABLED(CONFIG_RODATA_FULL_DEFAULT_ENABLED);
 
+bool can_set_direct_map(void)
+{
+	return rodata_full || debug_pagealloc_enabled();
+}
+
 static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
 {
 	struct page_change_data *cdata = data;
@@ -156,7 +161,7 @@ int set_direct_map_invalid_noflush(struct page *page, int numpages)
 	};
 	unsigned long size = PAGE_SIZE * numpages;
 
-	if (!debug_pagealloc_enabled() && !rodata_full)
+	if (!can_set_direct_map())
 		return 0;
 
 	return apply_to_page_range(&init_mm,
@@ -172,7 +177,7 @@ int set_direct_map_default_noflush(struct page *page, int numpages)
 	};
 	unsigned long size = PAGE_SIZE * numpages;
 
-	if (!debug_pagealloc_enabled() && !rodata_full)
+	if (!can_set_direct_map())
 		return 0;
 
 	return apply_to_page_range(&init_mm,
@@ -183,7 +188,7 @@ int set_direct_map_default_noflush(struct page *page, int numpages)
 #ifdef CONFIG_DEBUG_PAGEALLOC
 void __kernel_map_pages(struct page *page, int numpages, int enable)
 {
-	if (!debug_pagealloc_enabled() && !rodata_full)
+	if (!can_set_direct_map())
 		return;
 
 	set_memory_valid((unsigned long)page_address(page), numpages, enable);
@@ -208,7 +213,7 @@ bool kernel_page_present(struct page *page)
 	pte_t *ptep;
 	unsigned long addr = (unsigned long)page_address(page);
 
-	if (!debug_pagealloc_enabled() && !rodata_full)
+	if (!can_set_direct_map())
 		return true;
 
 	pgdp = pgd_offset_k(addr);
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index c650f82db813..7b4b6626032d 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -28,7 +28,19 @@ static inline bool kernel_page_present(struct page *page)
 {
 	return true;
 }
+#else /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */
+/*
+ * Some architectures, e.g. ARM64 can disable direct map modifications at
+ * boot time. Let them overrive this query.
+ */
+#ifndef can_set_direct_map
+static inline bool can_set_direct_map(void)
+{
+	return true;
+}
+#define can_set_direct_map can_set_direct_map
 #endif
+#endif /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */
 
 #ifndef set_mce_nospec
 static inline int set_mce_nospec(unsigned long pfn, bool unmap)
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 05/10] set_memory: allow querying whether set_direct_map_*() is actually enabled
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mark Rutland, Michal Hocko, David Hildenbrand, Peter Zijlstra,
	Catalin Marinas, Dave Hansen, linux-mm, linux-kselftest,
	H. Peter Anvin, Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Rick Edgecombe,
	Roman Gushchin, Mike Rapoport

From: Mike Rapoport <rppt@linux.ibm.com>

On arm64, set_direct_map_*() functions may return 0 without actually
changing the linear map.  This behaviour can be controlled using kernel
parameters, so we need a way to determine at runtime whether calls to
set_direct_map_invalid_noflush() and set_direct_map_default_noflush() have
any effect.

Extend set_memory API with can_set_direct_map() function that allows
checking if calling set_direct_map_*() will actually change the page
table, replace several occurrences of open coded checks in arm64 with the
new function and provide a generic stub for architectures that always
modify page tables upon calls to set_direct_map APIs.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/Kbuild       |  1 -
 arch/arm64/include/asm/cacheflush.h |  6 ------
 arch/arm64/include/asm/set_memory.h | 17 +++++++++++++++++
 arch/arm64/kernel/machine_kexec.c   |  1 +
 arch/arm64/mm/mmu.c                 |  6 +++---
 arch/arm64/mm/pageattr.c            | 13 +++++++++----
 include/linux/set_memory.h          | 12 ++++++++++++
 7 files changed, 42 insertions(+), 14 deletions(-)
 create mode 100644 arch/arm64/include/asm/set_memory.h

diff --git a/arch/arm64/include/asm/Kbuild b/arch/arm64/include/asm/Kbuild
index 07ac208edc89..73aa25843f65 100644
--- a/arch/arm64/include/asm/Kbuild
+++ b/arch/arm64/include/asm/Kbuild
@@ -3,5 +3,4 @@ generic-y += early_ioremap.h
 generic-y += mcs_spinlock.h
 generic-y += qrwlock.h
 generic-y += qspinlock.h
-generic-y += set_memory.h
 generic-y += user.h
diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index d3598419a284..b1bdf83a73db 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -136,12 +136,6 @@ static __always_inline void __flush_icache_all(void)
 	dsb(ish);
 }
 
-int set_memory_valid(unsigned long addr, int numpages, int enable);
-
-int set_direct_map_invalid_noflush(struct page *page, int numpages);
-int set_direct_map_default_noflush(struct page *page, int numpages);
-bool kernel_page_present(struct page *page);
-
 #include <asm-generic/cacheflush.h>
 
 #endif /* __ASM_CACHEFLUSH_H */
diff --git a/arch/arm64/include/asm/set_memory.h b/arch/arm64/include/asm/set_memory.h
new file mode 100644
index 000000000000..ecb6b0f449ab
--- /dev/null
+++ b/arch/arm64/include/asm/set_memory.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef _ASM_ARM64_SET_MEMORY_H
+#define _ASM_ARM64_SET_MEMORY_H
+
+#include <asm-generic/set_memory.h>
+
+bool can_set_direct_map(void);
+#define can_set_direct_map can_set_direct_map
+
+int set_memory_valid(unsigned long addr, int numpages, int enable);
+
+int set_direct_map_invalid_noflush(struct page *page, int numpages);
+int set_direct_map_default_noflush(struct page *page, int numpages);
+bool kernel_page_present(struct page *page);
+
+#endif /* _ASM_ARM64_SET_MEMORY_H */
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index a0b144cfaea7..0cbc50c4fa5a 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -11,6 +11,7 @@
 #include <linux/kernel.h>
 #include <linux/kexec.h>
 #include <linux/page-flags.h>
+#include <linux/set_memory.h>
 #include <linux/smp.h>
 
 #include <asm/cacheflush.h>
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 9445eb77e3da..bd8521637120 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -22,6 +22,7 @@
 #include <linux/io.h>
 #include <linux/mm.h>
 #include <linux/vmalloc.h>
+#include <linux/set_memory.h>
 
 #include <asm/barrier.h>
 #include <asm/cputype.h>
@@ -492,7 +493,7 @@ static void __init map_mem(pgd_t *pgdp)
 	int flags = 0;
 	u64 i;
 
-	if (rodata_full || crash_mem_map || debug_pagealloc_enabled())
+	if (can_set_direct_map() || crash_mem_map)
 		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
 	/*
@@ -1470,8 +1471,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
 	 * KFENCE requires linear map to be mapped at page granularity, so that
 	 * it is possible to protect/unprotect single pages in the KFENCE pool.
 	 */
-	if (rodata_full || debug_pagealloc_enabled() ||
-	    IS_ENABLED(CONFIG_KFENCE))
+	if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
 		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
 	__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index b53ef37bf95a..d505172265b0 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -19,6 +19,11 @@ struct page_change_data {
 
 bool rodata_full __ro_after_init = IS_ENABLED(CONFIG_RODATA_FULL_DEFAULT_ENABLED);
 
+bool can_set_direct_map(void)
+{
+	return rodata_full || debug_pagealloc_enabled();
+}
+
 static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
 {
 	struct page_change_data *cdata = data;
@@ -156,7 +161,7 @@ int set_direct_map_invalid_noflush(struct page *page, int numpages)
 	};
 	unsigned long size = PAGE_SIZE * numpages;
 
-	if (!debug_pagealloc_enabled() && !rodata_full)
+	if (!can_set_direct_map())
 		return 0;
 
 	return apply_to_page_range(&init_mm,
@@ -172,7 +177,7 @@ int set_direct_map_default_noflush(struct page *page, int numpages)
 	};
 	unsigned long size = PAGE_SIZE * numpages;
 
-	if (!debug_pagealloc_enabled() && !rodata_full)
+	if (!can_set_direct_map())
 		return 0;
 
 	return apply_to_page_range(&init_mm,
@@ -183,7 +188,7 @@ int set_direct_map_default_noflush(struct page *page, int numpages)
 #ifdef CONFIG_DEBUG_PAGEALLOC
 void __kernel_map_pages(struct page *page, int numpages, int enable)
 {
-	if (!debug_pagealloc_enabled() && !rodata_full)
+	if (!can_set_direct_map())
 		return;
 
 	set_memory_valid((unsigned long)page_address(page), numpages, enable);
@@ -208,7 +213,7 @@ bool kernel_page_present(struct page *page)
 	pte_t *ptep;
 	unsigned long addr = (unsigned long)page_address(page);
 
-	if (!debug_pagealloc_enabled() && !rodata_full)
+	if (!can_set_direct_map())
 		return true;
 
 	pgdp = pgd_offset_k(addr);
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index c650f82db813..7b4b6626032d 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -28,7 +28,19 @@ static inline bool kernel_page_present(struct page *page)
 {
 	return true;
 }
+#else /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */
+/*
+ * Some architectures, e.g. ARM64 can disable direct map modifications at
+ * boot time. Let them overrive this query.
+ */
+#ifndef can_set_direct_map
+static inline bool can_set_direct_map(void)
+{
+	return true;
+}
+#define can_set_direct_map can_set_direct_map
 #endif
+#endif /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */
 
 #ifndef set_mce_nospec
 static inline int set_mce_nospec(unsigned long pfn, bool unmap)
-- 
2.28.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 05/10] set_memory: allow querying whether set_direct_map_*() is actually enabled
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mark Rutland, Michal Hocko, David Hildenbrand, Peter Zijlstra,
	Catalin Marinas, Dave Hansen, linux-mm, linux-kselftest,
	H. Peter Anvin, Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Rick Edgecombe,
	Roman Gushchin, Mike Rapoport

From: Mike Rapoport <rppt@linux.ibm.com>

On arm64, set_direct_map_*() functions may return 0 without actually
changing the linear map.  This behaviour can be controlled using kernel
parameters, so we need a way to determine at runtime whether calls to
set_direct_map_invalid_noflush() and set_direct_map_default_noflush() have
any effect.

Extend set_memory API with can_set_direct_map() function that allows
checking if calling set_direct_map_*() will actually change the page
table, replace several occurrences of open coded checks in arm64 with the
new function and provide a generic stub for architectures that always
modify page tables upon calls to set_direct_map APIs.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
Reviewed-by: David Hildenbrand <david@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/Kbuild       |  1 -
 arch/arm64/include/asm/cacheflush.h |  6 ------
 arch/arm64/include/asm/set_memory.h | 17 +++++++++++++++++
 arch/arm64/kernel/machine_kexec.c   |  1 +
 arch/arm64/mm/mmu.c                 |  6 +++---
 arch/arm64/mm/pageattr.c            | 13 +++++++++----
 include/linux/set_memory.h          | 12 ++++++++++++
 7 files changed, 42 insertions(+), 14 deletions(-)
 create mode 100644 arch/arm64/include/asm/set_memory.h

diff --git a/arch/arm64/include/asm/Kbuild b/arch/arm64/include/asm/Kbuild
index 07ac208edc89..73aa25843f65 100644
--- a/arch/arm64/include/asm/Kbuild
+++ b/arch/arm64/include/asm/Kbuild
@@ -3,5 +3,4 @@ generic-y += early_ioremap.h
 generic-y += mcs_spinlock.h
 generic-y += qrwlock.h
 generic-y += qspinlock.h
-generic-y += set_memory.h
 generic-y += user.h
diff --git a/arch/arm64/include/asm/cacheflush.h b/arch/arm64/include/asm/cacheflush.h
index d3598419a284..b1bdf83a73db 100644
--- a/arch/arm64/include/asm/cacheflush.h
+++ b/arch/arm64/include/asm/cacheflush.h
@@ -136,12 +136,6 @@ static __always_inline void __flush_icache_all(void)
 	dsb(ish);
 }
 
-int set_memory_valid(unsigned long addr, int numpages, int enable);
-
-int set_direct_map_invalid_noflush(struct page *page, int numpages);
-int set_direct_map_default_noflush(struct page *page, int numpages);
-bool kernel_page_present(struct page *page);
-
 #include <asm-generic/cacheflush.h>
 
 #endif /* __ASM_CACHEFLUSH_H */
diff --git a/arch/arm64/include/asm/set_memory.h b/arch/arm64/include/asm/set_memory.h
new file mode 100644
index 000000000000..ecb6b0f449ab
--- /dev/null
+++ b/arch/arm64/include/asm/set_memory.h
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+
+#ifndef _ASM_ARM64_SET_MEMORY_H
+#define _ASM_ARM64_SET_MEMORY_H
+
+#include <asm-generic/set_memory.h>
+
+bool can_set_direct_map(void);
+#define can_set_direct_map can_set_direct_map
+
+int set_memory_valid(unsigned long addr, int numpages, int enable);
+
+int set_direct_map_invalid_noflush(struct page *page, int numpages);
+int set_direct_map_default_noflush(struct page *page, int numpages);
+bool kernel_page_present(struct page *page);
+
+#endif /* _ASM_ARM64_SET_MEMORY_H */
diff --git a/arch/arm64/kernel/machine_kexec.c b/arch/arm64/kernel/machine_kexec.c
index a0b144cfaea7..0cbc50c4fa5a 100644
--- a/arch/arm64/kernel/machine_kexec.c
+++ b/arch/arm64/kernel/machine_kexec.c
@@ -11,6 +11,7 @@
 #include <linux/kernel.h>
 #include <linux/kexec.h>
 #include <linux/page-flags.h>
+#include <linux/set_memory.h>
 #include <linux/smp.h>
 
 #include <asm/cacheflush.h>
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 9445eb77e3da..bd8521637120 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -22,6 +22,7 @@
 #include <linux/io.h>
 #include <linux/mm.h>
 #include <linux/vmalloc.h>
+#include <linux/set_memory.h>
 
 #include <asm/barrier.h>
 #include <asm/cputype.h>
@@ -492,7 +493,7 @@ static void __init map_mem(pgd_t *pgdp)
 	int flags = 0;
 	u64 i;
 
-	if (rodata_full || crash_mem_map || debug_pagealloc_enabled())
+	if (can_set_direct_map() || crash_mem_map)
 		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
 	/*
@@ -1470,8 +1471,7 @@ int arch_add_memory(int nid, u64 start, u64 size,
 	 * KFENCE requires linear map to be mapped at page granularity, so that
 	 * it is possible to protect/unprotect single pages in the KFENCE pool.
 	 */
-	if (rodata_full || debug_pagealloc_enabled() ||
-	    IS_ENABLED(CONFIG_KFENCE))
+	if (can_set_direct_map() || IS_ENABLED(CONFIG_KFENCE))
 		flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
 	__create_pgd_mapping(swapper_pg_dir, start, __phys_to_virt(start),
diff --git a/arch/arm64/mm/pageattr.c b/arch/arm64/mm/pageattr.c
index b53ef37bf95a..d505172265b0 100644
--- a/arch/arm64/mm/pageattr.c
+++ b/arch/arm64/mm/pageattr.c
@@ -19,6 +19,11 @@ struct page_change_data {
 
 bool rodata_full __ro_after_init = IS_ENABLED(CONFIG_RODATA_FULL_DEFAULT_ENABLED);
 
+bool can_set_direct_map(void)
+{
+	return rodata_full || debug_pagealloc_enabled();
+}
+
 static int change_page_range(pte_t *ptep, unsigned long addr, void *data)
 {
 	struct page_change_data *cdata = data;
@@ -156,7 +161,7 @@ int set_direct_map_invalid_noflush(struct page *page, int numpages)
 	};
 	unsigned long size = PAGE_SIZE * numpages;
 
-	if (!debug_pagealloc_enabled() && !rodata_full)
+	if (!can_set_direct_map())
 		return 0;
 
 	return apply_to_page_range(&init_mm,
@@ -172,7 +177,7 @@ int set_direct_map_default_noflush(struct page *page, int numpages)
 	};
 	unsigned long size = PAGE_SIZE * numpages;
 
-	if (!debug_pagealloc_enabled() && !rodata_full)
+	if (!can_set_direct_map())
 		return 0;
 
 	return apply_to_page_range(&init_mm,
@@ -183,7 +188,7 @@ int set_direct_map_default_noflush(struct page *page, int numpages)
 #ifdef CONFIG_DEBUG_PAGEALLOC
 void __kernel_map_pages(struct page *page, int numpages, int enable)
 {
-	if (!debug_pagealloc_enabled() && !rodata_full)
+	if (!can_set_direct_map())
 		return;
 
 	set_memory_valid((unsigned long)page_address(page), numpages, enable);
@@ -208,7 +213,7 @@ bool kernel_page_present(struct page *page)
 	pte_t *ptep;
 	unsigned long addr = (unsigned long)page_address(page);
 
-	if (!debug_pagealloc_enabled() && !rodata_full)
+	if (!can_set_direct_map())
 		return true;
 
 	pgdp = pgd_offset_k(addr);
diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index c650f82db813..7b4b6626032d 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -28,7 +28,19 @@ static inline bool kernel_page_present(struct page *page)
 {
 	return true;
 }
+#else /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */
+/*
+ * Some architectures, e.g. ARM64 can disable direct map modifications at
+ * boot time. Let them overrive this query.
+ */
+#ifndef can_set_direct_map
+static inline bool can_set_direct_map(void)
+{
+	return true;
+}
+#define can_set_direct_map can_set_direct_map
 #endif
+#endif /* CONFIG_ARCH_HAS_SET_DIRECT_MAP */
 
 #ifndef set_mce_nospec
 static inline int set_mce_nospec(unsigned long pfn, bool unmap)
-- 
2.28.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 06/10] arm64: kfence: fix header inclusion
  2021-02-08  8:49 ` Mike Rapoport
  (?)
  (?)
@ 2021-02-08  8:49   ` Mike Rapoport
  -1 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Paul Walmsley, Peter Zijlstra,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleixner, Tycho Andersen, Will Deacon, linux-api,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Marco Elver, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov

From: Arnd Bergmann <arnd@arndb.de>

Randconfig builds started warning about a missing function declaration
after set_memory_valid() is moved to a new file:

In file included from mm/kfence/core.c:26:
arch/arm64/include/asm/kfence.h:17:2: error: implicit declaration of function 'set_memory_valid' [-Werror,-Wimplicit-function-declaration]

Include the correct header again.

Link: https://lkml.kernel.org/r/20210125125025.102381-1-arnd@kernel.org
Fixes: 9e18ec3cfabd ("set_memory: allow querying whether set_direct_map_*() is actually enabled")
Fixes: 204555ff8bd6 ("arm64, kfence: enable KFENCE for ARM64")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 arch/arm64/include/asm/kfence.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kfence.h b/arch/arm64/include/asm/kfence.h
index d061176d57ea..aa855c6a0ae6 100644
--- a/arch/arm64/include/asm/kfence.h
+++ b/arch/arm64/include/asm/kfence.h
@@ -8,7 +8,7 @@
 #ifndef __ASM_KFENCE_H
 #define __ASM_KFENCE_H
 
-#include <asm/cacheflush.h>
+#include <asm/set_memory.h>
 
 static inline bool arch_kfence_init_pool(void) { return true; }
 
-- 
2.28.0
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 06/10] arm64: kfence: fix header inclusion
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Paul Walmsley, Peter Zijlstra,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleixner, Tycho Andersen, Will Deacon, linux-api,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Marco Elver, Alexander Potapenko, Andrey Konovalov,
	Dmitry Vyukov

From: Arnd Bergmann <arnd@arndb.de>

Randconfig builds started warning about a missing function declaration
after set_memory_valid() is moved to a new file:

In file included from mm/kfence/core.c:26:
arch/arm64/include/asm/kfence.h:17:2: error: implicit declaration of function 'set_memory_valid' [-Werror,-Wimplicit-function-declaration]

Include the correct header again.

Link: https://lkml.kernel.org/r/20210125125025.102381-1-arnd@kernel.org
Fixes: 9e18ec3cfabd ("set_memory: allow querying whether set_direct_map_*() is actually enabled")
Fixes: 204555ff8bd6 ("arm64, kfence: enable KFENCE for ARM64")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 arch/arm64/include/asm/kfence.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kfence.h b/arch/arm64/include/asm/kfence.h
index d061176d57ea..aa855c6a0ae6 100644
--- a/arch/arm64/include/asm/kfence.h
+++ b/arch/arm64/include/asm/kfence.h
@@ -8,7 +8,7 @@
 #ifndef __ASM_KFENCE_H
 #define __ASM_KFENCE_H
 
-#include <asm/cacheflush.h>
+#include <asm/set_memory.h>
 
 static inline bool arch_kfence_init_pool(void) { return true; }
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 06/10] arm64: kfence: fix header inclusion
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mark Rutland, Michal Hocko, David Hildenbrand, Peter Zijlstra,
	Catalin Marinas, Dave Hansen, linux-mm, Alexander Potapenko,
	linux-kselftest, H. Peter Anvin, Christopher Lameter, Shuah Khan,
	Thomas Gleixner, Elena Reshetova, linux-arch, Tycho Andersen,
	linux-nvdimm, Will Deacon, x86, Matthew Wilcox, Mike Rapoport,
	Ingo Molnar, Michael Kerrisk, Marco Elver, Arnd Bergmann,
	Andrey Konovalov, James Bottomley, Dmitry Vyukov,
	Borislav Petkov, Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Rick Edgecombe, Roman Gushchin, Mike Rapoport

From: Arnd Bergmann <arnd@arndb.de>

Randconfig builds started warning about a missing function declaration
after set_memory_valid() is moved to a new file:

In file included from mm/kfence/core.c:26:
arch/arm64/include/asm/kfence.h:17:2: error: implicit declaration of function 'set_memory_valid' [-Werror,-Wimplicit-function-declaration]

Include the correct header again.

Link: https://lkml.kernel.org/r/20210125125025.102381-1-arnd@kernel.org
Fixes: 9e18ec3cfabd ("set_memory: allow querying whether set_direct_map_*() is actually enabled")
Fixes: 204555ff8bd6 ("arm64, kfence: enable KFENCE for ARM64")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 arch/arm64/include/asm/kfence.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kfence.h b/arch/arm64/include/asm/kfence.h
index d061176d57ea..aa855c6a0ae6 100644
--- a/arch/arm64/include/asm/kfence.h
+++ b/arch/arm64/include/asm/kfence.h
@@ -8,7 +8,7 @@
 #ifndef __ASM_KFENCE_H
 #define __ASM_KFENCE_H
 
-#include <asm/cacheflush.h>
+#include <asm/set_memory.h>
 
 static inline bool arch_kfence_init_pool(void) { return true; }
 
-- 
2.28.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 06/10] arm64: kfence: fix header inclusion
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mark Rutland, Michal Hocko, David Hildenbrand, Peter Zijlstra,
	Catalin Marinas, Dave Hansen, linux-mm, Alexander Potapenko,
	linux-kselftest, H. Peter Anvin, Christopher Lameter, Shuah Khan,
	Thomas Gleixner, Elena Reshetova, linux-arch, Tycho Andersen,
	linux-nvdimm, Will Deacon, x86, Matthew Wilcox, Mike Rapoport,
	Ingo Molnar, Michael Kerrisk, Marco Elver, Arnd Bergmann,
	Andrey Konovalov, James Bottomley, Dmitry Vyukov,
	Borislav Petkov, Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Rick Edgecombe, Roman Gushchin, Mike Rapoport

From: Arnd Bergmann <arnd@arndb.de>

Randconfig builds started warning about a missing function declaration
after set_memory_valid() is moved to a new file:

In file included from mm/kfence/core.c:26:
arch/arm64/include/asm/kfence.h:17:2: error: implicit declaration of function 'set_memory_valid' [-Werror,-Wimplicit-function-declaration]

Include the correct header again.

Link: https://lkml.kernel.org/r/20210125125025.102381-1-arnd@kernel.org
Fixes: 9e18ec3cfabd ("set_memory: allow querying whether set_direct_map_*() is actually enabled")
Fixes: 204555ff8bd6 ("arm64, kfence: enable KFENCE for ARM64")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Marco Elver <elver@google.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 arch/arm64/include/asm/kfence.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/kfence.h b/arch/arm64/include/asm/kfence.h
index d061176d57ea..aa855c6a0ae6 100644
--- a/arch/arm64/include/asm/kfence.h
+++ b/arch/arm64/include/asm/kfence.h
@@ -8,7 +8,7 @@
 #ifndef __ASM_KFENCE_H
 #define __ASM_KFENCE_H
 
-#include <asm/cacheflush.h>
+#include <asm/set_memory.h>
 
 static inline bool arch_kfence_init_pool(void) { return true; }
 
-- 
2.28.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
  2021-02-08  8:49 ` Mike Rapoport
  (?)
  (?)
@ 2021-02-08  8:49   ` Mike Rapoport
  -1 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Paul Walmsley, Peter Zijlstra,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleixner, Tycho Andersen, Will Deacon, linux-api,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

From: Mike Rapoport <rppt@linux.ibm.com>

Introduce "memfd_secret" system call with the ability to create memory
areas visible only in the context of the owning process and not mapped not
only to other processes but in the kernel page tables as well.

The secretmem feature is off by default and the user must explicitly enable
it at the boot time.

Once secretmem is enabled, the user will be able to create a file
descriptor using the memfd_secret() system call. The memory areas created
by mmap() calls from this file descriptor will be unmapped from the kernel
direct map and they will be only mapped in the page table of the owning mm.

The file descriptor based memory has several advantages over the
"traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
paves the way for VMMs to remove the secret memory range from the process;
there may be situations where sharing is useful and file descriptor based
approach allows to seal the operations.

As secret memory implementation is not an extension of tmpfs or hugetlbfs,
usage of a dedicated system call rather than hooking new functionality into
memfd_create(2) emphasises that memfd_secret(2) has different semantics and
allows better upwards compatibility.

The secret memory remains accessible in the process context using uaccess
primitives, but it is not exposed to the kernel otherwise; secret memory
areas are removed from the direct map and functions in the
follow_page()/get_user_page() family will refuse to return a page that
belongs to the secret memory area.

Once there will be a use case that will require exposing secretmem to the
kernel it will be an opt-in request in the system call flags so that user
would have to decide what data can be exposed to the kernel.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which affects
the system performance. However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice". Hence, it is sufficient to have
secretmem disabled by default with the ability of a system administrator to
enable it at boot time.

The secretmem mappings are locked in memory so they cannot exceed
RLIMIT_MEMLOCK. Since these mappings are already locked an attempt to
mlock() secretmem range would fail and mlockall() will ignore secretmem
mappings.

Pages in the secretmem regions are unevictable and unmovable to avoid
accidental exposure of the sensitive data via swap or during page
migration.

A page that was a part of the secret memory area is cleared when it is
freed to ensure the data is not exposed to the next user of that page.

The following example demonstrates creation of a secret mapping (error
handling is omitted):

	fd = memfd_secret(0);
	ftruncate(fd, MAP_SIZE);
	ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
		   MAP_SHARED, fd, 0);

[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
---
 include/linux/secretmem.h  |  24 ++++
 include/uapi/linux/magic.h |   1 +
 kernel/sys_ni.c            |   2 +
 mm/Kconfig                 |   3 +
 mm/Makefile                |   1 +
 mm/gup.c                   |  10 ++
 mm/mlock.c                 |   3 +-
 mm/secretmem.c             | 246 +++++++++++++++++++++++++++++++++++++
 8 files changed, 289 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/secretmem.h
 create mode 100644 mm/secretmem.c

diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
new file mode 100644
index 000000000000..70e7db9f94fe
--- /dev/null
+++ b/include/linux/secretmem.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _LINUX_SECRETMEM_H
+#define _LINUX_SECRETMEM_H
+
+#ifdef CONFIG_SECRETMEM
+
+bool vma_is_secretmem(struct vm_area_struct *vma);
+bool page_is_secretmem(struct page *page);
+
+#else
+
+static inline bool vma_is_secretmem(struct vm_area_struct *vma)
+{
+	return false;
+}
+
+static inline bool page_is_secretmem(struct page *page)
+{
+	return false;
+}
+
+#endif /* CONFIG_SECRETMEM */
+
+#endif /* _LINUX_SECRETMEM_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index f3956fc11de6..35687dcb1a42 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -97,5 +97,6 @@
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define Z3FOLD_MAGIC		0x33
 #define PPC_CMM_MAGIC		0xc7571590
+#define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 19aa806890d5..e9a2011ee4a2 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -352,6 +352,8 @@ COND_SYSCALL(pkey_mprotect);
 COND_SYSCALL(pkey_alloc);
 COND_SYSCALL(pkey_free);
 
+/* memfd_secret */
+COND_SYSCALL(memfd_secret);
 
 /*
  * Architecture specific weak syscall entries.
diff --git a/mm/Kconfig b/mm/Kconfig
index 24c045b24b95..5f8243442f66 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -872,4 +872,7 @@ config MAPPING_DIRTY_HELPERS
 config KMAP_LOCAL
 	bool
 
+config SECRETMEM
+	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
+
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 72227b24a616..b2a564eec27f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -120,3 +120,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
+obj-$(CONFIG_SECRETMEM) += secretmem.o
diff --git a/mm/gup.c b/mm/gup.c
index e4c224cd9661..3e086b073624 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -10,6 +10,7 @@
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/secretmem.h>
 
 #include <linux/sched/signal.h>
 #include <linux/rwsem.h>
@@ -759,6 +760,9 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 	struct follow_page_context ctx = { NULL };
 	struct page *page;
 
+	if (vma_is_secretmem(vma))
+		return NULL;
+
 	page = follow_page_mask(vma, address, foll_flags, &ctx);
 	if (ctx.pgmap)
 		put_dev_pagemap(ctx.pgmap);
@@ -892,6 +896,9 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
 		return -EOPNOTSUPP;
 
+	if (vma_is_secretmem(vma))
+		return -EFAULT;
+
 	if (write) {
 		if (!(vm_flags & VM_WRITE)) {
 			if (!(gup_flags & FOLL_FORCE))
@@ -2031,6 +2038,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
 
+		if (page_is_secretmem(page))
+			goto pte_unmap;
+
 		head = try_grab_compound_head(page, 1, flags);
 		if (!head)
 			goto pte_unmap;
diff --git a/mm/mlock.c b/mm/mlock.c
index 73960bb3464d..127e72dcac3d 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -23,6 +23,7 @@
 #include <linux/hugetlb.h>
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h>
+#include <linux/secretmem.h>
 
 #include "internal.h"
 
@@ -503,7 +504,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 
 	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
 	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) ||
-	    vma_is_dax(vma))
+	    vma_is_dax(vma) || vma_is_secretmem(vma))
 		/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
 		goto out;
 
diff --git a/mm/secretmem.c b/mm/secretmem.c
new file mode 100644
index 000000000000..fa6738e860c2
--- /dev/null
+++ b/mm/secretmem.c
@@ -0,0 +1,246 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright IBM Corporation, 2021
+ *
+ * Author: Mike Rapoport <rppt@linux.ibm.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/swap.h>
+#include <linux/mount.h>
+#include <linux/memfd.h>
+#include <linux/bitops.h>
+#include <linux/printk.h>
+#include <linux/pagemap.h>
+#include <linux/syscalls.h>
+#include <linux/pseudo_fs.h>
+#include <linux/secretmem.h>
+#include <linux/set_memory.h>
+#include <linux/sched/signal.h>
+
+#include <uapi/linux/magic.h>
+
+#include <asm/tlbflush.h>
+
+#include "internal.h"
+
+#undef pr_fmt
+#define pr_fmt(fmt) "secretmem: " fmt
+
+/*
+ * Define mode and flag masks to allow validation of the system call
+ * parameters.
+ */
+#define SECRETMEM_MODE_MASK	(0x0)
+#define SECRETMEM_FLAGS_MASK	SECRETMEM_MODE_MASK
+
+static bool secretmem_enable __ro_after_init;
+module_param_named(enable, secretmem_enable, bool, 0400);
+MODULE_PARM_DESC(secretmem_enable,
+		 "Enable secretmem and memfd_secret(2) system call");
+
+static vm_fault_t secretmem_fault(struct vm_fault *vmf)
+{
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	pgoff_t offset = vmf->pgoff;
+	gfp_t gfp = vmf->gfp_mask;
+	unsigned long addr;
+	struct page *page;
+	int err;
+
+	if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
+		return vmf_error(-EINVAL);
+
+retry:
+	page = find_lock_page(mapping, offset);
+	if (!page) {
+		page = alloc_page(gfp | __GFP_ZERO);
+		if (!page)
+			return VM_FAULT_OOM;
+
+		err = set_direct_map_invalid_noflush(page, 1);
+		if (err) {
+			put_page(page);
+			return vmf_error(err);
+		}
+
+		__SetPageUptodate(page);
+		err = add_to_page_cache_lru(page, mapping, offset, gfp);
+		if (unlikely(err)) {
+			put_page(page);
+			/*
+			 * If a split of large page was required, it
+			 * already happened when we marked the page invalid
+			 * which guarantees that this call won't fail
+			 */
+			set_direct_map_default_noflush(page, 1);
+			if (err == -EEXIST)
+				goto retry;
+
+			return vmf_error(err);
+		}
+
+		addr = (unsigned long)page_address(page);
+		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+	}
+
+	vmf->page = page;
+	return VM_FAULT_LOCKED;
+}
+
+static const struct vm_operations_struct secretmem_vm_ops = {
+	.fault = secretmem_fault,
+};
+
+static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	unsigned long len = vma->vm_end - vma->vm_start;
+
+	if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
+		return -EINVAL;
+
+	if (mlock_future_check(vma->vm_mm, vma->vm_flags | VM_LOCKED, len))
+		return -EAGAIN;
+
+	vma->vm_flags |= VM_LOCKED | VM_DONTDUMP;
+	vma->vm_ops = &secretmem_vm_ops;
+
+	return 0;
+}
+
+bool vma_is_secretmem(struct vm_area_struct *vma)
+{
+	return vma->vm_ops == &secretmem_vm_ops;
+}
+
+static const struct file_operations secretmem_fops = {
+	.mmap		= secretmem_mmap,
+};
+
+static bool secretmem_isolate_page(struct page *page, isolate_mode_t mode)
+{
+	return false;
+}
+
+static int secretmem_migratepage(struct address_space *mapping,
+				 struct page *newpage, struct page *page,
+				 enum migrate_mode mode)
+{
+	return -EBUSY;
+}
+
+static void secretmem_freepage(struct page *page)
+{
+	set_direct_map_default_noflush(page, 1);
+	clear_highpage(page);
+}
+
+static const struct address_space_operations secretmem_aops = {
+	.freepage	= secretmem_freepage,
+	.migratepage	= secretmem_migratepage,
+	.isolate_page	= secretmem_isolate_page,
+};
+
+bool page_is_secretmem(struct page *page)
+{
+	struct address_space *mapping = page_mapping(page);
+
+	if (!mapping)
+		return false;
+
+	return mapping->a_ops == &secretmem_aops;
+}
+
+static struct vfsmount *secretmem_mnt;
+
+static struct file *secretmem_file_create(unsigned long flags)
+{
+	struct file *file = ERR_PTR(-ENOMEM);
+	struct inode *inode;
+
+	inode = alloc_anon_inode(secretmem_mnt->mnt_sb);
+	if (IS_ERR(inode))
+		return ERR_CAST(inode);
+
+	file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
+				 O_RDWR, &secretmem_fops);
+	if (IS_ERR(file))
+		goto err_free_inode;
+
+	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+	mapping_set_unevictable(inode->i_mapping);
+
+	inode->i_mapping->a_ops = &secretmem_aops;
+
+	/* pretend we are a normal file with zero size */
+	inode->i_mode |= S_IFREG;
+	inode->i_size = 0;
+
+	return file;
+
+err_free_inode:
+	iput(inode);
+	return file;
+}
+
+SYSCALL_DEFINE1(memfd_secret, unsigned long, flags)
+{
+	struct file *file;
+	int fd, err;
+
+	/* make sure local flags do not confict with global fcntl.h */
+	BUILD_BUG_ON(SECRETMEM_FLAGS_MASK & O_CLOEXEC);
+
+	if (!secretmem_enable)
+		return -ENOSYS;
+
+	if (flags & ~(SECRETMEM_FLAGS_MASK | O_CLOEXEC))
+		return -EINVAL;
+
+	fd = get_unused_fd_flags(flags & O_CLOEXEC);
+	if (fd < 0)
+		return fd;
+
+	file = secretmem_file_create(flags);
+	if (IS_ERR(file)) {
+		err = PTR_ERR(file);
+		goto err_put_fd;
+	}
+
+	file->f_flags |= O_LARGEFILE;
+
+	fd_install(fd, file);
+	return fd;
+
+err_put_fd:
+	put_unused_fd(fd);
+	return err;
+}
+
+static int secretmem_init_fs_context(struct fs_context *fc)
+{
+	return init_pseudo(fc, SECRETMEM_MAGIC) ? 0 : -ENOMEM;
+}
+
+static struct file_system_type secretmem_fs = {
+	.name		= "secretmem",
+	.init_fs_context = secretmem_init_fs_context,
+	.kill_sb	= kill_anon_super,
+};
+
+static int secretmem_init(void)
+{
+	int ret = 0;
+
+	if (!secretmem_enable)
+		return ret;
+
+	secretmem_mnt = kern_mount(&secretmem_fs);
+	if (IS_ERR(secretmem_mnt))
+		ret = PTR_ERR(secretmem_mnt);
+
+	return ret;
+}
+fs_initcall(secretmem_init);
-- 
2.28.0
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Paul Walmsley, Peter Zijlstra,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleixner, Tycho Andersen, Will Deacon, linux-api,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

From: Mike Rapoport <rppt@linux.ibm.com>

Introduce "memfd_secret" system call with the ability to create memory
areas visible only in the context of the owning process and not mapped not
only to other processes but in the kernel page tables as well.

The secretmem feature is off by default and the user must explicitly enable
it at the boot time.

Once secretmem is enabled, the user will be able to create a file
descriptor using the memfd_secret() system call. The memory areas created
by mmap() calls from this file descriptor will be unmapped from the kernel
direct map and they will be only mapped in the page table of the owning mm.

The file descriptor based memory has several advantages over the
"traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
paves the way for VMMs to remove the secret memory range from the process;
there may be situations where sharing is useful and file descriptor based
approach allows to seal the operations.

As secret memory implementation is not an extension of tmpfs or hugetlbfs,
usage of a dedicated system call rather than hooking new functionality into
memfd_create(2) emphasises that memfd_secret(2) has different semantics and
allows better upwards compatibility.

The secret memory remains accessible in the process context using uaccess
primitives, but it is not exposed to the kernel otherwise; secret memory
areas are removed from the direct map and functions in the
follow_page()/get_user_page() family will refuse to return a page that
belongs to the secret memory area.

Once there will be a use case that will require exposing secretmem to the
kernel it will be an opt-in request in the system call flags so that user
would have to decide what data can be exposed to the kernel.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which affects
the system performance. However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice". Hence, it is sufficient to have
secretmem disabled by default with the ability of a system administrator to
enable it at boot time.

The secretmem mappings are locked in memory so they cannot exceed
RLIMIT_MEMLOCK. Since these mappings are already locked an attempt to
mlock() secretmem range would fail and mlockall() will ignore secretmem
mappings.

Pages in the secretmem regions are unevictable and unmovable to avoid
accidental exposure of the sensitive data via swap or during page
migration.

A page that was a part of the secret memory area is cleared when it is
freed to ensure the data is not exposed to the next user of that page.

The following example demonstrates creation of a secret mapping (error
handling is omitted):

	fd = memfd_secret(0);
	ftruncate(fd, MAP_SIZE);
	ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
		   MAP_SHARED, fd, 0);

[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
---
 include/linux/secretmem.h  |  24 ++++
 include/uapi/linux/magic.h |   1 +
 kernel/sys_ni.c            |   2 +
 mm/Kconfig                 |   3 +
 mm/Makefile                |   1 +
 mm/gup.c                   |  10 ++
 mm/mlock.c                 |   3 +-
 mm/secretmem.c             | 246 +++++++++++++++++++++++++++++++++++++
 8 files changed, 289 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/secretmem.h
 create mode 100644 mm/secretmem.c

diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
new file mode 100644
index 000000000000..70e7db9f94fe
--- /dev/null
+++ b/include/linux/secretmem.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _LINUX_SECRETMEM_H
+#define _LINUX_SECRETMEM_H
+
+#ifdef CONFIG_SECRETMEM
+
+bool vma_is_secretmem(struct vm_area_struct *vma);
+bool page_is_secretmem(struct page *page);
+
+#else
+
+static inline bool vma_is_secretmem(struct vm_area_struct *vma)
+{
+	return false;
+}
+
+static inline bool page_is_secretmem(struct page *page)
+{
+	return false;
+}
+
+#endif /* CONFIG_SECRETMEM */
+
+#endif /* _LINUX_SECRETMEM_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index f3956fc11de6..35687dcb1a42 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -97,5 +97,6 @@
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define Z3FOLD_MAGIC		0x33
 #define PPC_CMM_MAGIC		0xc7571590
+#define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 19aa806890d5..e9a2011ee4a2 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -352,6 +352,8 @@ COND_SYSCALL(pkey_mprotect);
 COND_SYSCALL(pkey_alloc);
 COND_SYSCALL(pkey_free);
 
+/* memfd_secret */
+COND_SYSCALL(memfd_secret);
 
 /*
  * Architecture specific weak syscall entries.
diff --git a/mm/Kconfig b/mm/Kconfig
index 24c045b24b95..5f8243442f66 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -872,4 +872,7 @@ config MAPPING_DIRTY_HELPERS
 config KMAP_LOCAL
 	bool
 
+config SECRETMEM
+	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
+
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 72227b24a616..b2a564eec27f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -120,3 +120,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
+obj-$(CONFIG_SECRETMEM) += secretmem.o
diff --git a/mm/gup.c b/mm/gup.c
index e4c224cd9661..3e086b073624 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -10,6 +10,7 @@
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/secretmem.h>
 
 #include <linux/sched/signal.h>
 #include <linux/rwsem.h>
@@ -759,6 +760,9 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 	struct follow_page_context ctx = { NULL };
 	struct page *page;
 
+	if (vma_is_secretmem(vma))
+		return NULL;
+
 	page = follow_page_mask(vma, address, foll_flags, &ctx);
 	if (ctx.pgmap)
 		put_dev_pagemap(ctx.pgmap);
@@ -892,6 +896,9 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
 		return -EOPNOTSUPP;
 
+	if (vma_is_secretmem(vma))
+		return -EFAULT;
+
 	if (write) {
 		if (!(vm_flags & VM_WRITE)) {
 			if (!(gup_flags & FOLL_FORCE))
@@ -2031,6 +2038,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
 
+		if (page_is_secretmem(page))
+			goto pte_unmap;
+
 		head = try_grab_compound_head(page, 1, flags);
 		if (!head)
 			goto pte_unmap;
diff --git a/mm/mlock.c b/mm/mlock.c
index 73960bb3464d..127e72dcac3d 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -23,6 +23,7 @@
 #include <linux/hugetlb.h>
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h>
+#include <linux/secretmem.h>
 
 #include "internal.h"
 
@@ -503,7 +504,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 
 	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
 	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) ||
-	    vma_is_dax(vma))
+	    vma_is_dax(vma) || vma_is_secretmem(vma))
 		/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
 		goto out;
 
diff --git a/mm/secretmem.c b/mm/secretmem.c
new file mode 100644
index 000000000000..fa6738e860c2
--- /dev/null
+++ b/mm/secretmem.c
@@ -0,0 +1,246 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright IBM Corporation, 2021
+ *
+ * Author: Mike Rapoport <rppt@linux.ibm.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/swap.h>
+#include <linux/mount.h>
+#include <linux/memfd.h>
+#include <linux/bitops.h>
+#include <linux/printk.h>
+#include <linux/pagemap.h>
+#include <linux/syscalls.h>
+#include <linux/pseudo_fs.h>
+#include <linux/secretmem.h>
+#include <linux/set_memory.h>
+#include <linux/sched/signal.h>
+
+#include <uapi/linux/magic.h>
+
+#include <asm/tlbflush.h>
+
+#include "internal.h"
+
+#undef pr_fmt
+#define pr_fmt(fmt) "secretmem: " fmt
+
+/*
+ * Define mode and flag masks to allow validation of the system call
+ * parameters.
+ */
+#define SECRETMEM_MODE_MASK	(0x0)
+#define SECRETMEM_FLAGS_MASK	SECRETMEM_MODE_MASK
+
+static bool secretmem_enable __ro_after_init;
+module_param_named(enable, secretmem_enable, bool, 0400);
+MODULE_PARM_DESC(secretmem_enable,
+		 "Enable secretmem and memfd_secret(2) system call");
+
+static vm_fault_t secretmem_fault(struct vm_fault *vmf)
+{
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	pgoff_t offset = vmf->pgoff;
+	gfp_t gfp = vmf->gfp_mask;
+	unsigned long addr;
+	struct page *page;
+	int err;
+
+	if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
+		return vmf_error(-EINVAL);
+
+retry:
+	page = find_lock_page(mapping, offset);
+	if (!page) {
+		page = alloc_page(gfp | __GFP_ZERO);
+		if (!page)
+			return VM_FAULT_OOM;
+
+		err = set_direct_map_invalid_noflush(page, 1);
+		if (err) {
+			put_page(page);
+			return vmf_error(err);
+		}
+
+		__SetPageUptodate(page);
+		err = add_to_page_cache_lru(page, mapping, offset, gfp);
+		if (unlikely(err)) {
+			put_page(page);
+			/*
+			 * If a split of large page was required, it
+			 * already happened when we marked the page invalid
+			 * which guarantees that this call won't fail
+			 */
+			set_direct_map_default_noflush(page, 1);
+			if (err == -EEXIST)
+				goto retry;
+
+			return vmf_error(err);
+		}
+
+		addr = (unsigned long)page_address(page);
+		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+	}
+
+	vmf->page = page;
+	return VM_FAULT_LOCKED;
+}
+
+static const struct vm_operations_struct secretmem_vm_ops = {
+	.fault = secretmem_fault,
+};
+
+static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	unsigned long len = vma->vm_end - vma->vm_start;
+
+	if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
+		return -EINVAL;
+
+	if (mlock_future_check(vma->vm_mm, vma->vm_flags | VM_LOCKED, len))
+		return -EAGAIN;
+
+	vma->vm_flags |= VM_LOCKED | VM_DONTDUMP;
+	vma->vm_ops = &secretmem_vm_ops;
+
+	return 0;
+}
+
+bool vma_is_secretmem(struct vm_area_struct *vma)
+{
+	return vma->vm_ops == &secretmem_vm_ops;
+}
+
+static const struct file_operations secretmem_fops = {
+	.mmap		= secretmem_mmap,
+};
+
+static bool secretmem_isolate_page(struct page *page, isolate_mode_t mode)
+{
+	return false;
+}
+
+static int secretmem_migratepage(struct address_space *mapping,
+				 struct page *newpage, struct page *page,
+				 enum migrate_mode mode)
+{
+	return -EBUSY;
+}
+
+static void secretmem_freepage(struct page *page)
+{
+	set_direct_map_default_noflush(page, 1);
+	clear_highpage(page);
+}
+
+static const struct address_space_operations secretmem_aops = {
+	.freepage	= secretmem_freepage,
+	.migratepage	= secretmem_migratepage,
+	.isolate_page	= secretmem_isolate_page,
+};
+
+bool page_is_secretmem(struct page *page)
+{
+	struct address_space *mapping = page_mapping(page);
+
+	if (!mapping)
+		return false;
+
+	return mapping->a_ops == &secretmem_aops;
+}
+
+static struct vfsmount *secretmem_mnt;
+
+static struct file *secretmem_file_create(unsigned long flags)
+{
+	struct file *file = ERR_PTR(-ENOMEM);
+	struct inode *inode;
+
+	inode = alloc_anon_inode(secretmem_mnt->mnt_sb);
+	if (IS_ERR(inode))
+		return ERR_CAST(inode);
+
+	file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
+				 O_RDWR, &secretmem_fops);
+	if (IS_ERR(file))
+		goto err_free_inode;
+
+	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+	mapping_set_unevictable(inode->i_mapping);
+
+	inode->i_mapping->a_ops = &secretmem_aops;
+
+	/* pretend we are a normal file with zero size */
+	inode->i_mode |= S_IFREG;
+	inode->i_size = 0;
+
+	return file;
+
+err_free_inode:
+	iput(inode);
+	return file;
+}
+
+SYSCALL_DEFINE1(memfd_secret, unsigned long, flags)
+{
+	struct file *file;
+	int fd, err;
+
+	/* make sure local flags do not confict with global fcntl.h */
+	BUILD_BUG_ON(SECRETMEM_FLAGS_MASK & O_CLOEXEC);
+
+	if (!secretmem_enable)
+		return -ENOSYS;
+
+	if (flags & ~(SECRETMEM_FLAGS_MASK | O_CLOEXEC))
+		return -EINVAL;
+
+	fd = get_unused_fd_flags(flags & O_CLOEXEC);
+	if (fd < 0)
+		return fd;
+
+	file = secretmem_file_create(flags);
+	if (IS_ERR(file)) {
+		err = PTR_ERR(file);
+		goto err_put_fd;
+	}
+
+	file->f_flags |= O_LARGEFILE;
+
+	fd_install(fd, file);
+	return fd;
+
+err_put_fd:
+	put_unused_fd(fd);
+	return err;
+}
+
+static int secretmem_init_fs_context(struct fs_context *fc)
+{
+	return init_pseudo(fc, SECRETMEM_MAGIC) ? 0 : -ENOMEM;
+}
+
+static struct file_system_type secretmem_fs = {
+	.name		= "secretmem",
+	.init_fs_context = secretmem_init_fs_context,
+	.kill_sb	= kill_anon_super,
+};
+
+static int secretmem_init(void)
+{
+	int ret = 0;
+
+	if (!secretmem_enable)
+		return ret;
+
+	secretmem_mnt = kern_mount(&secretmem_fs);
+	if (IS_ERR(secretmem_mnt))
+		ret = PTR_ERR(secretmem_mnt);
+
+	return ret;
+}
+fs_initcall(secretmem_init);
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mark Rutland, Michal Hocko, David Hildenbrand, Peter Zijlstra,
	Catalin Marinas, Dave Hansen, linux-mm, linux-kselftest,
	H. Peter Anvin, Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Rick Edgecombe,
	Roman Gushchin, Mike Rapoport

From: Mike Rapoport <rppt@linux.ibm.com>

Introduce "memfd_secret" system call with the ability to create memory
areas visible only in the context of the owning process and not mapped not
only to other processes but in the kernel page tables as well.

The secretmem feature is off by default and the user must explicitly enable
it at the boot time.

Once secretmem is enabled, the user will be able to create a file
descriptor using the memfd_secret() system call. The memory areas created
by mmap() calls from this file descriptor will be unmapped from the kernel
direct map and they will be only mapped in the page table of the owning mm.

The file descriptor based memory has several advantages over the
"traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
paves the way for VMMs to remove the secret memory range from the process;
there may be situations where sharing is useful and file descriptor based
approach allows to seal the operations.

As secret memory implementation is not an extension of tmpfs or hugetlbfs,
usage of a dedicated system call rather than hooking new functionality into
memfd_create(2) emphasises that memfd_secret(2) has different semantics and
allows better upwards compatibility.

The secret memory remains accessible in the process context using uaccess
primitives, but it is not exposed to the kernel otherwise; secret memory
areas are removed from the direct map and functions in the
follow_page()/get_user_page() family will refuse to return a page that
belongs to the secret memory area.

Once there will be a use case that will require exposing secretmem to the
kernel it will be an opt-in request in the system call flags so that user
would have to decide what data can be exposed to the kernel.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which affects
the system performance. However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice". Hence, it is sufficient to have
secretmem disabled by default with the ability of a system administrator to
enable it at boot time.

The secretmem mappings are locked in memory so they cannot exceed
RLIMIT_MEMLOCK. Since these mappings are already locked an attempt to
mlock() secretmem range would fail and mlockall() will ignore secretmem
mappings.

Pages in the secretmem regions are unevictable and unmovable to avoid
accidental exposure of the sensitive data via swap or during page
migration.

A page that was a part of the secret memory area is cleared when it is
freed to ensure the data is not exposed to the next user of that page.

The following example demonstrates creation of a secret mapping (error
handling is omitted):

	fd = memfd_secret(0);
	ftruncate(fd, MAP_SIZE);
	ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
		   MAP_SHARED, fd, 0);

[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
---
 include/linux/secretmem.h  |  24 ++++
 include/uapi/linux/magic.h |   1 +
 kernel/sys_ni.c            |   2 +
 mm/Kconfig                 |   3 +
 mm/Makefile                |   1 +
 mm/gup.c                   |  10 ++
 mm/mlock.c                 |   3 +-
 mm/secretmem.c             | 246 +++++++++++++++++++++++++++++++++++++
 8 files changed, 289 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/secretmem.h
 create mode 100644 mm/secretmem.c

diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
new file mode 100644
index 000000000000..70e7db9f94fe
--- /dev/null
+++ b/include/linux/secretmem.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _LINUX_SECRETMEM_H
+#define _LINUX_SECRETMEM_H
+
+#ifdef CONFIG_SECRETMEM
+
+bool vma_is_secretmem(struct vm_area_struct *vma);
+bool page_is_secretmem(struct page *page);
+
+#else
+
+static inline bool vma_is_secretmem(struct vm_area_struct *vma)
+{
+	return false;
+}
+
+static inline bool page_is_secretmem(struct page *page)
+{
+	return false;
+}
+
+#endif /* CONFIG_SECRETMEM */
+
+#endif /* _LINUX_SECRETMEM_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index f3956fc11de6..35687dcb1a42 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -97,5 +97,6 @@
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define Z3FOLD_MAGIC		0x33
 #define PPC_CMM_MAGIC		0xc7571590
+#define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 19aa806890d5..e9a2011ee4a2 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -352,6 +352,8 @@ COND_SYSCALL(pkey_mprotect);
 COND_SYSCALL(pkey_alloc);
 COND_SYSCALL(pkey_free);
 
+/* memfd_secret */
+COND_SYSCALL(memfd_secret);
 
 /*
  * Architecture specific weak syscall entries.
diff --git a/mm/Kconfig b/mm/Kconfig
index 24c045b24b95..5f8243442f66 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -872,4 +872,7 @@ config MAPPING_DIRTY_HELPERS
 config KMAP_LOCAL
 	bool
 
+config SECRETMEM
+	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
+
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 72227b24a616..b2a564eec27f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -120,3 +120,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
+obj-$(CONFIG_SECRETMEM) += secretmem.o
diff --git a/mm/gup.c b/mm/gup.c
index e4c224cd9661..3e086b073624 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -10,6 +10,7 @@
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/secretmem.h>
 
 #include <linux/sched/signal.h>
 #include <linux/rwsem.h>
@@ -759,6 +760,9 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 	struct follow_page_context ctx = { NULL };
 	struct page *page;
 
+	if (vma_is_secretmem(vma))
+		return NULL;
+
 	page = follow_page_mask(vma, address, foll_flags, &ctx);
 	if (ctx.pgmap)
 		put_dev_pagemap(ctx.pgmap);
@@ -892,6 +896,9 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
 		return -EOPNOTSUPP;
 
+	if (vma_is_secretmem(vma))
+		return -EFAULT;
+
 	if (write) {
 		if (!(vm_flags & VM_WRITE)) {
 			if (!(gup_flags & FOLL_FORCE))
@@ -2031,6 +2038,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
 
+		if (page_is_secretmem(page))
+			goto pte_unmap;
+
 		head = try_grab_compound_head(page, 1, flags);
 		if (!head)
 			goto pte_unmap;
diff --git a/mm/mlock.c b/mm/mlock.c
index 73960bb3464d..127e72dcac3d 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -23,6 +23,7 @@
 #include <linux/hugetlb.h>
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h>
+#include <linux/secretmem.h>
 
 #include "internal.h"
 
@@ -503,7 +504,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 
 	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
 	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) ||
-	    vma_is_dax(vma))
+	    vma_is_dax(vma) || vma_is_secretmem(vma))
 		/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
 		goto out;
 
diff --git a/mm/secretmem.c b/mm/secretmem.c
new file mode 100644
index 000000000000..fa6738e860c2
--- /dev/null
+++ b/mm/secretmem.c
@@ -0,0 +1,246 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright IBM Corporation, 2021
+ *
+ * Author: Mike Rapoport <rppt@linux.ibm.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/swap.h>
+#include <linux/mount.h>
+#include <linux/memfd.h>
+#include <linux/bitops.h>
+#include <linux/printk.h>
+#include <linux/pagemap.h>
+#include <linux/syscalls.h>
+#include <linux/pseudo_fs.h>
+#include <linux/secretmem.h>
+#include <linux/set_memory.h>
+#include <linux/sched/signal.h>
+
+#include <uapi/linux/magic.h>
+
+#include <asm/tlbflush.h>
+
+#include "internal.h"
+
+#undef pr_fmt
+#define pr_fmt(fmt) "secretmem: " fmt
+
+/*
+ * Define mode and flag masks to allow validation of the system call
+ * parameters.
+ */
+#define SECRETMEM_MODE_MASK	(0x0)
+#define SECRETMEM_FLAGS_MASK	SECRETMEM_MODE_MASK
+
+static bool secretmem_enable __ro_after_init;
+module_param_named(enable, secretmem_enable, bool, 0400);
+MODULE_PARM_DESC(secretmem_enable,
+		 "Enable secretmem and memfd_secret(2) system call");
+
+static vm_fault_t secretmem_fault(struct vm_fault *vmf)
+{
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	pgoff_t offset = vmf->pgoff;
+	gfp_t gfp = vmf->gfp_mask;
+	unsigned long addr;
+	struct page *page;
+	int err;
+
+	if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
+		return vmf_error(-EINVAL);
+
+retry:
+	page = find_lock_page(mapping, offset);
+	if (!page) {
+		page = alloc_page(gfp | __GFP_ZERO);
+		if (!page)
+			return VM_FAULT_OOM;
+
+		err = set_direct_map_invalid_noflush(page, 1);
+		if (err) {
+			put_page(page);
+			return vmf_error(err);
+		}
+
+		__SetPageUptodate(page);
+		err = add_to_page_cache_lru(page, mapping, offset, gfp);
+		if (unlikely(err)) {
+			put_page(page);
+			/*
+			 * If a split of large page was required, it
+			 * already happened when we marked the page invalid
+			 * which guarantees that this call won't fail
+			 */
+			set_direct_map_default_noflush(page, 1);
+			if (err == -EEXIST)
+				goto retry;
+
+			return vmf_error(err);
+		}
+
+		addr = (unsigned long)page_address(page);
+		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+	}
+
+	vmf->page = page;
+	return VM_FAULT_LOCKED;
+}
+
+static const struct vm_operations_struct secretmem_vm_ops = {
+	.fault = secretmem_fault,
+};
+
+static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	unsigned long len = vma->vm_end - vma->vm_start;
+
+	if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
+		return -EINVAL;
+
+	if (mlock_future_check(vma->vm_mm, vma->vm_flags | VM_LOCKED, len))
+		return -EAGAIN;
+
+	vma->vm_flags |= VM_LOCKED | VM_DONTDUMP;
+	vma->vm_ops = &secretmem_vm_ops;
+
+	return 0;
+}
+
+bool vma_is_secretmem(struct vm_area_struct *vma)
+{
+	return vma->vm_ops == &secretmem_vm_ops;
+}
+
+static const struct file_operations secretmem_fops = {
+	.mmap		= secretmem_mmap,
+};
+
+static bool secretmem_isolate_page(struct page *page, isolate_mode_t mode)
+{
+	return false;
+}
+
+static int secretmem_migratepage(struct address_space *mapping,
+				 struct page *newpage, struct page *page,
+				 enum migrate_mode mode)
+{
+	return -EBUSY;
+}
+
+static void secretmem_freepage(struct page *page)
+{
+	set_direct_map_default_noflush(page, 1);
+	clear_highpage(page);
+}
+
+static const struct address_space_operations secretmem_aops = {
+	.freepage	= secretmem_freepage,
+	.migratepage	= secretmem_migratepage,
+	.isolate_page	= secretmem_isolate_page,
+};
+
+bool page_is_secretmem(struct page *page)
+{
+	struct address_space *mapping = page_mapping(page);
+
+	if (!mapping)
+		return false;
+
+	return mapping->a_ops == &secretmem_aops;
+}
+
+static struct vfsmount *secretmem_mnt;
+
+static struct file *secretmem_file_create(unsigned long flags)
+{
+	struct file *file = ERR_PTR(-ENOMEM);
+	struct inode *inode;
+
+	inode = alloc_anon_inode(secretmem_mnt->mnt_sb);
+	if (IS_ERR(inode))
+		return ERR_CAST(inode);
+
+	file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
+				 O_RDWR, &secretmem_fops);
+	if (IS_ERR(file))
+		goto err_free_inode;
+
+	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+	mapping_set_unevictable(inode->i_mapping);
+
+	inode->i_mapping->a_ops = &secretmem_aops;
+
+	/* pretend we are a normal file with zero size */
+	inode->i_mode |= S_IFREG;
+	inode->i_size = 0;
+
+	return file;
+
+err_free_inode:
+	iput(inode);
+	return file;
+}
+
+SYSCALL_DEFINE1(memfd_secret, unsigned long, flags)
+{
+	struct file *file;
+	int fd, err;
+
+	/* make sure local flags do not confict with global fcntl.h */
+	BUILD_BUG_ON(SECRETMEM_FLAGS_MASK & O_CLOEXEC);
+
+	if (!secretmem_enable)
+		return -ENOSYS;
+
+	if (flags & ~(SECRETMEM_FLAGS_MASK | O_CLOEXEC))
+		return -EINVAL;
+
+	fd = get_unused_fd_flags(flags & O_CLOEXEC);
+	if (fd < 0)
+		return fd;
+
+	file = secretmem_file_create(flags);
+	if (IS_ERR(file)) {
+		err = PTR_ERR(file);
+		goto err_put_fd;
+	}
+
+	file->f_flags |= O_LARGEFILE;
+
+	fd_install(fd, file);
+	return fd;
+
+err_put_fd:
+	put_unused_fd(fd);
+	return err;
+}
+
+static int secretmem_init_fs_context(struct fs_context *fc)
+{
+	return init_pseudo(fc, SECRETMEM_MAGIC) ? 0 : -ENOMEM;
+}
+
+static struct file_system_type secretmem_fs = {
+	.name		= "secretmem",
+	.init_fs_context = secretmem_init_fs_context,
+	.kill_sb	= kill_anon_super,
+};
+
+static int secretmem_init(void)
+{
+	int ret = 0;
+
+	if (!secretmem_enable)
+		return ret;
+
+	secretmem_mnt = kern_mount(&secretmem_fs);
+	if (IS_ERR(secretmem_mnt))
+		ret = PTR_ERR(secretmem_mnt);
+
+	return ret;
+}
+fs_initcall(secretmem_init);
-- 
2.28.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mark Rutland, Michal Hocko, David Hildenbrand, Peter Zijlstra,
	Catalin Marinas, Dave Hansen, linux-mm, linux-kselftest,
	H. Peter Anvin, Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Rick Edgecombe,
	Roman Gushchin, Mike Rapoport

From: Mike Rapoport <rppt@linux.ibm.com>

Introduce "memfd_secret" system call with the ability to create memory
areas visible only in the context of the owning process and not mapped not
only to other processes but in the kernel page tables as well.

The secretmem feature is off by default and the user must explicitly enable
it at the boot time.

Once secretmem is enabled, the user will be able to create a file
descriptor using the memfd_secret() system call. The memory areas created
by mmap() calls from this file descriptor will be unmapped from the kernel
direct map and they will be only mapped in the page table of the owning mm.

The file descriptor based memory has several advantages over the
"traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
paves the way for VMMs to remove the secret memory range from the process;
there may be situations where sharing is useful and file descriptor based
approach allows to seal the operations.

As secret memory implementation is not an extension of tmpfs or hugetlbfs,
usage of a dedicated system call rather than hooking new functionality into
memfd_create(2) emphasises that memfd_secret(2) has different semantics and
allows better upwards compatibility.

The secret memory remains accessible in the process context using uaccess
primitives, but it is not exposed to the kernel otherwise; secret memory
areas are removed from the direct map and functions in the
follow_page()/get_user_page() family will refuse to return a page that
belongs to the secret memory area.

Once there will be a use case that will require exposing secretmem to the
kernel it will be an opt-in request in the system call flags so that user
would have to decide what data can be exposed to the kernel.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which affects
the system performance. However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice". Hence, it is sufficient to have
secretmem disabled by default with the ability of a system administrator to
enable it at boot time.

The secretmem mappings are locked in memory so they cannot exceed
RLIMIT_MEMLOCK. Since these mappings are already locked an attempt to
mlock() secretmem range would fail and mlockall() will ignore secretmem
mappings.

Pages in the secretmem regions are unevictable and unmovable to avoid
accidental exposure of the sensitive data via swap or during page
migration.

A page that was a part of the secret memory area is cleared when it is
freed to ensure the data is not exposed to the next user of that page.

The following example demonstrates creation of a secret mapping (error
handling is omitted):

	fd = memfd_secret(0);
	ftruncate(fd, MAP_SIZE);
	ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
		   MAP_SHARED, fd, 0);

[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
---
 include/linux/secretmem.h  |  24 ++++
 include/uapi/linux/magic.h |   1 +
 kernel/sys_ni.c            |   2 +
 mm/Kconfig                 |   3 +
 mm/Makefile                |   1 +
 mm/gup.c                   |  10 ++
 mm/mlock.c                 |   3 +-
 mm/secretmem.c             | 246 +++++++++++++++++++++++++++++++++++++
 8 files changed, 289 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/secretmem.h
 create mode 100644 mm/secretmem.c

diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
new file mode 100644
index 000000000000..70e7db9f94fe
--- /dev/null
+++ b/include/linux/secretmem.h
@@ -0,0 +1,24 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _LINUX_SECRETMEM_H
+#define _LINUX_SECRETMEM_H
+
+#ifdef CONFIG_SECRETMEM
+
+bool vma_is_secretmem(struct vm_area_struct *vma);
+bool page_is_secretmem(struct page *page);
+
+#else
+
+static inline bool vma_is_secretmem(struct vm_area_struct *vma)
+{
+	return false;
+}
+
+static inline bool page_is_secretmem(struct page *page)
+{
+	return false;
+}
+
+#endif /* CONFIG_SECRETMEM */
+
+#endif /* _LINUX_SECRETMEM_H */
diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index f3956fc11de6..35687dcb1a42 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -97,5 +97,6 @@
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define Z3FOLD_MAGIC		0x33
 #define PPC_CMM_MAGIC		0xc7571590
+#define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 19aa806890d5..e9a2011ee4a2 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -352,6 +352,8 @@ COND_SYSCALL(pkey_mprotect);
 COND_SYSCALL(pkey_alloc);
 COND_SYSCALL(pkey_free);
 
+/* memfd_secret */
+COND_SYSCALL(memfd_secret);
 
 /*
  * Architecture specific weak syscall entries.
diff --git a/mm/Kconfig b/mm/Kconfig
index 24c045b24b95..5f8243442f66 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -872,4 +872,7 @@ config MAPPING_DIRTY_HELPERS
 config KMAP_LOCAL
 	bool
 
+config SECRETMEM
+	def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
+
 endmenu
diff --git a/mm/Makefile b/mm/Makefile
index 72227b24a616..b2a564eec27f 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -120,3 +120,4 @@ obj-$(CONFIG_MEMFD_CREATE) += memfd.o
 obj-$(CONFIG_MAPPING_DIRTY_HELPERS) += mapping_dirty_helpers.o
 obj-$(CONFIG_PTDUMP_CORE) += ptdump.o
 obj-$(CONFIG_PAGE_REPORTING) += page_reporting.o
+obj-$(CONFIG_SECRETMEM) += secretmem.o
diff --git a/mm/gup.c b/mm/gup.c
index e4c224cd9661..3e086b073624 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -10,6 +10,7 @@
 #include <linux/rmap.h>
 #include <linux/swap.h>
 #include <linux/swapops.h>
+#include <linux/secretmem.h>
 
 #include <linux/sched/signal.h>
 #include <linux/rwsem.h>
@@ -759,6 +760,9 @@ struct page *follow_page(struct vm_area_struct *vma, unsigned long address,
 	struct follow_page_context ctx = { NULL };
 	struct page *page;
 
+	if (vma_is_secretmem(vma))
+		return NULL;
+
 	page = follow_page_mask(vma, address, foll_flags, &ctx);
 	if (ctx.pgmap)
 		put_dev_pagemap(ctx.pgmap);
@@ -892,6 +896,9 @@ static int check_vma_flags(struct vm_area_struct *vma, unsigned long gup_flags)
 	if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma))
 		return -EOPNOTSUPP;
 
+	if (vma_is_secretmem(vma))
+		return -EFAULT;
+
 	if (write) {
 		if (!(vm_flags & VM_WRITE)) {
 			if (!(gup_flags & FOLL_FORCE))
@@ -2031,6 +2038,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
 		VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
 		page = pte_page(pte);
 
+		if (page_is_secretmem(page))
+			goto pte_unmap;
+
 		head = try_grab_compound_head(page, 1, flags);
 		if (!head)
 			goto pte_unmap;
diff --git a/mm/mlock.c b/mm/mlock.c
index 73960bb3464d..127e72dcac3d 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -23,6 +23,7 @@
 #include <linux/hugetlb.h>
 #include <linux/memcontrol.h>
 #include <linux/mm_inline.h>
+#include <linux/secretmem.h>
 
 #include "internal.h"
 
@@ -503,7 +504,7 @@ static int mlock_fixup(struct vm_area_struct *vma, struct vm_area_struct **prev,
 
 	if (newflags == vma->vm_flags || (vma->vm_flags & VM_SPECIAL) ||
 	    is_vm_hugetlb_page(vma) || vma == get_gate_vma(current->mm) ||
-	    vma_is_dax(vma))
+	    vma_is_dax(vma) || vma_is_secretmem(vma))
 		/* don't set VM_LOCKED or VM_LOCKONFAULT and don't count */
 		goto out;
 
diff --git a/mm/secretmem.c b/mm/secretmem.c
new file mode 100644
index 000000000000..fa6738e860c2
--- /dev/null
+++ b/mm/secretmem.c
@@ -0,0 +1,246 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright IBM Corporation, 2021
+ *
+ * Author: Mike Rapoport <rppt@linux.ibm.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/fs.h>
+#include <linux/swap.h>
+#include <linux/mount.h>
+#include <linux/memfd.h>
+#include <linux/bitops.h>
+#include <linux/printk.h>
+#include <linux/pagemap.h>
+#include <linux/syscalls.h>
+#include <linux/pseudo_fs.h>
+#include <linux/secretmem.h>
+#include <linux/set_memory.h>
+#include <linux/sched/signal.h>
+
+#include <uapi/linux/magic.h>
+
+#include <asm/tlbflush.h>
+
+#include "internal.h"
+
+#undef pr_fmt
+#define pr_fmt(fmt) "secretmem: " fmt
+
+/*
+ * Define mode and flag masks to allow validation of the system call
+ * parameters.
+ */
+#define SECRETMEM_MODE_MASK	(0x0)
+#define SECRETMEM_FLAGS_MASK	SECRETMEM_MODE_MASK
+
+static bool secretmem_enable __ro_after_init;
+module_param_named(enable, secretmem_enable, bool, 0400);
+MODULE_PARM_DESC(secretmem_enable,
+		 "Enable secretmem and memfd_secret(2) system call");
+
+static vm_fault_t secretmem_fault(struct vm_fault *vmf)
+{
+	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+	struct inode *inode = file_inode(vmf->vma->vm_file);
+	pgoff_t offset = vmf->pgoff;
+	gfp_t gfp = vmf->gfp_mask;
+	unsigned long addr;
+	struct page *page;
+	int err;
+
+	if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
+		return vmf_error(-EINVAL);
+
+retry:
+	page = find_lock_page(mapping, offset);
+	if (!page) {
+		page = alloc_page(gfp | __GFP_ZERO);
+		if (!page)
+			return VM_FAULT_OOM;
+
+		err = set_direct_map_invalid_noflush(page, 1);
+		if (err) {
+			put_page(page);
+			return vmf_error(err);
+		}
+
+		__SetPageUptodate(page);
+		err = add_to_page_cache_lru(page, mapping, offset, gfp);
+		if (unlikely(err)) {
+			put_page(page);
+			/*
+			 * If a split of large page was required, it
+			 * already happened when we marked the page invalid
+			 * which guarantees that this call won't fail
+			 */
+			set_direct_map_default_noflush(page, 1);
+			if (err == -EEXIST)
+				goto retry;
+
+			return vmf_error(err);
+		}
+
+		addr = (unsigned long)page_address(page);
+		flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
+	}
+
+	vmf->page = page;
+	return VM_FAULT_LOCKED;
+}
+
+static const struct vm_operations_struct secretmem_vm_ops = {
+	.fault = secretmem_fault,
+};
+
+static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	unsigned long len = vma->vm_end - vma->vm_start;
+
+	if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) == 0)
+		return -EINVAL;
+
+	if (mlock_future_check(vma->vm_mm, vma->vm_flags | VM_LOCKED, len))
+		return -EAGAIN;
+
+	vma->vm_flags |= VM_LOCKED | VM_DONTDUMP;
+	vma->vm_ops = &secretmem_vm_ops;
+
+	return 0;
+}
+
+bool vma_is_secretmem(struct vm_area_struct *vma)
+{
+	return vma->vm_ops == &secretmem_vm_ops;
+}
+
+static const struct file_operations secretmem_fops = {
+	.mmap		= secretmem_mmap,
+};
+
+static bool secretmem_isolate_page(struct page *page, isolate_mode_t mode)
+{
+	return false;
+}
+
+static int secretmem_migratepage(struct address_space *mapping,
+				 struct page *newpage, struct page *page,
+				 enum migrate_mode mode)
+{
+	return -EBUSY;
+}
+
+static void secretmem_freepage(struct page *page)
+{
+	set_direct_map_default_noflush(page, 1);
+	clear_highpage(page);
+}
+
+static const struct address_space_operations secretmem_aops = {
+	.freepage	= secretmem_freepage,
+	.migratepage	= secretmem_migratepage,
+	.isolate_page	= secretmem_isolate_page,
+};
+
+bool page_is_secretmem(struct page *page)
+{
+	struct address_space *mapping = page_mapping(page);
+
+	if (!mapping)
+		return false;
+
+	return mapping->a_ops == &secretmem_aops;
+}
+
+static struct vfsmount *secretmem_mnt;
+
+static struct file *secretmem_file_create(unsigned long flags)
+{
+	struct file *file = ERR_PTR(-ENOMEM);
+	struct inode *inode;
+
+	inode = alloc_anon_inode(secretmem_mnt->mnt_sb);
+	if (IS_ERR(inode))
+		return ERR_CAST(inode);
+
+	file = alloc_file_pseudo(inode, secretmem_mnt, "secretmem",
+				 O_RDWR, &secretmem_fops);
+	if (IS_ERR(file))
+		goto err_free_inode;
+
+	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+	mapping_set_unevictable(inode->i_mapping);
+
+	inode->i_mapping->a_ops = &secretmem_aops;
+
+	/* pretend we are a normal file with zero size */
+	inode->i_mode |= S_IFREG;
+	inode->i_size = 0;
+
+	return file;
+
+err_free_inode:
+	iput(inode);
+	return file;
+}
+
+SYSCALL_DEFINE1(memfd_secret, unsigned long, flags)
+{
+	struct file *file;
+	int fd, err;
+
+	/* make sure local flags do not confict with global fcntl.h */
+	BUILD_BUG_ON(SECRETMEM_FLAGS_MASK & O_CLOEXEC);
+
+	if (!secretmem_enable)
+		return -ENOSYS;
+
+	if (flags & ~(SECRETMEM_FLAGS_MASK | O_CLOEXEC))
+		return -EINVAL;
+
+	fd = get_unused_fd_flags(flags & O_CLOEXEC);
+	if (fd < 0)
+		return fd;
+
+	file = secretmem_file_create(flags);
+	if (IS_ERR(file)) {
+		err = PTR_ERR(file);
+		goto err_put_fd;
+	}
+
+	file->f_flags |= O_LARGEFILE;
+
+	fd_install(fd, file);
+	return fd;
+
+err_put_fd:
+	put_unused_fd(fd);
+	return err;
+}
+
+static int secretmem_init_fs_context(struct fs_context *fc)
+{
+	return init_pseudo(fc, SECRETMEM_MAGIC) ? 0 : -ENOMEM;
+}
+
+static struct file_system_type secretmem_fs = {
+	.name		= "secretmem",
+	.init_fs_context = secretmem_init_fs_context,
+	.kill_sb	= kill_anon_super,
+};
+
+static int secretmem_init(void)
+{
+	int ret = 0;
+
+	if (!secretmem_enable)
+		return ret;
+
+	secretmem_mnt = kern_mount(&secretmem_fs);
+	if (IS_ERR(secretmem_mnt))
+		ret = PTR_ERR(secretmem_mnt);
+
+	return ret;
+}
+fs_initcall(secretmem_init);
-- 
2.28.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
  2021-02-08  8:49 ` Mike Rapoport
  (?)
  (?)
@ 2021-02-08  8:49   ` Mike Rapoport
  -1 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Paul Walmsley, Peter Zijlstra,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleixner, Tycho Andersen, Will Deacon, linux-api,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

From: Mike Rapoport <rppt@linux.ibm.com>

It is unsafe to allow saving of secretmem areas to the hibernation
snapshot as they would be visible after the resume and this essentially
will defeat the purpose of secret memory mappings.

Prevent hibernation whenever there are active secret memory users.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 include/linux/secretmem.h |  6 ++++++
 kernel/power/hibernate.c  |  5 ++++-
 mm/secretmem.c            | 15 +++++++++++++++
 3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
index 70e7db9f94fe..907a6734059c 100644
--- a/include/linux/secretmem.h
+++ b/include/linux/secretmem.h
@@ -6,6 +6,7 @@
 
 bool vma_is_secretmem(struct vm_area_struct *vma);
 bool page_is_secretmem(struct page *page);
+bool secretmem_active(void);
 
 #else
 
@@ -19,6 +20,11 @@ static inline bool page_is_secretmem(struct page *page)
 	return false;
 }
 
+static inline bool secretmem_active(void)
+{
+	return false;
+}
+
 #endif /* CONFIG_SECRETMEM */
 
 #endif /* _LINUX_SECRETMEM_H */
diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
index da0b41914177..559acef3fddb 100644
--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -31,6 +31,7 @@
 #include <linux/genhd.h>
 #include <linux/ktime.h>
 #include <linux/security.h>
+#include <linux/secretmem.h>
 #include <trace/events/power.h>
 
 #include "power.h"
@@ -81,7 +82,9 @@ void hibernate_release(void)
 
 bool hibernation_available(void)
 {
-	return nohibernate == 0 && !security_locked_down(LOCKDOWN_HIBERNATION);
+	return nohibernate == 0 &&
+		!security_locked_down(LOCKDOWN_HIBERNATION) &&
+		!secretmem_active();
 }
 
 /**
diff --git a/mm/secretmem.c b/mm/secretmem.c
index fa6738e860c2..f2ae3f32a193 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -40,6 +40,13 @@ module_param_named(enable, secretmem_enable, bool, 0400);
 MODULE_PARM_DESC(secretmem_enable,
 		 "Enable secretmem and memfd_secret(2) system call");
 
+static atomic_t secretmem_users;
+
+bool secretmem_active(void)
+{
+	return !!atomic_read(&secretmem_users);
+}
+
 static vm_fault_t secretmem_fault(struct vm_fault *vmf)
 {
 	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
@@ -94,6 +101,12 @@ static const struct vm_operations_struct secretmem_vm_ops = {
 	.fault = secretmem_fault,
 };
 
+static int secretmem_release(struct inode *inode, struct file *file)
+{
+	atomic_dec(&secretmem_users);
+	return 0;
+}
+
 static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	unsigned long len = vma->vm_end - vma->vm_start;
@@ -116,6 +129,7 @@ bool vma_is_secretmem(struct vm_area_struct *vma)
 }
 
 static const struct file_operations secretmem_fops = {
+	.release	= secretmem_release,
 	.mmap		= secretmem_mmap,
 };
 
@@ -212,6 +226,7 @@ SYSCALL_DEFINE1(memfd_secret, unsigned long, flags)
 	file->f_flags |= O_LARGEFILE;
 
 	fd_install(fd, file);
+	atomic_inc(&secretmem_users);
 	return fd;
 
 err_put_fd:
-- 
2.28.0
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Paul Walmsley, Peter Zijlstra,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleixner, Tycho Andersen, Will Deacon, linux-api,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

From: Mike Rapoport <rppt@linux.ibm.com>

It is unsafe to allow saving of secretmem areas to the hibernation
snapshot as they would be visible after the resume and this essentially
will defeat the purpose of secret memory mappings.

Prevent hibernation whenever there are active secret memory users.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 include/linux/secretmem.h |  6 ++++++
 kernel/power/hibernate.c  |  5 ++++-
 mm/secretmem.c            | 15 +++++++++++++++
 3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
index 70e7db9f94fe..907a6734059c 100644
--- a/include/linux/secretmem.h
+++ b/include/linux/secretmem.h
@@ -6,6 +6,7 @@
 
 bool vma_is_secretmem(struct vm_area_struct *vma);
 bool page_is_secretmem(struct page *page);
+bool secretmem_active(void);
 
 #else
 
@@ -19,6 +20,11 @@ static inline bool page_is_secretmem(struct page *page)
 	return false;
 }
 
+static inline bool secretmem_active(void)
+{
+	return false;
+}
+
 #endif /* CONFIG_SECRETMEM */
 
 #endif /* _LINUX_SECRETMEM_H */
diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
index da0b41914177..559acef3fddb 100644
--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -31,6 +31,7 @@
 #include <linux/genhd.h>
 #include <linux/ktime.h>
 #include <linux/security.h>
+#include <linux/secretmem.h>
 #include <trace/events/power.h>
 
 #include "power.h"
@@ -81,7 +82,9 @@ void hibernate_release(void)
 
 bool hibernation_available(void)
 {
-	return nohibernate == 0 && !security_locked_down(LOCKDOWN_HIBERNATION);
+	return nohibernate == 0 &&
+		!security_locked_down(LOCKDOWN_HIBERNATION) &&
+		!secretmem_active();
 }
 
 /**
diff --git a/mm/secretmem.c b/mm/secretmem.c
index fa6738e860c2..f2ae3f32a193 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -40,6 +40,13 @@ module_param_named(enable, secretmem_enable, bool, 0400);
 MODULE_PARM_DESC(secretmem_enable,
 		 "Enable secretmem and memfd_secret(2) system call");
 
+static atomic_t secretmem_users;
+
+bool secretmem_active(void)
+{
+	return !!atomic_read(&secretmem_users);
+}
+
 static vm_fault_t secretmem_fault(struct vm_fault *vmf)
 {
 	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
@@ -94,6 +101,12 @@ static const struct vm_operations_struct secretmem_vm_ops = {
 	.fault = secretmem_fault,
 };
 
+static int secretmem_release(struct inode *inode, struct file *file)
+{
+	atomic_dec(&secretmem_users);
+	return 0;
+}
+
 static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	unsigned long len = vma->vm_end - vma->vm_start;
@@ -116,6 +129,7 @@ bool vma_is_secretmem(struct vm_area_struct *vma)
 }
 
 static const struct file_operations secretmem_fops = {
+	.release	= secretmem_release,
 	.mmap		= secretmem_mmap,
 };
 
@@ -212,6 +226,7 @@ SYSCALL_DEFINE1(memfd_secret, unsigned long, flags)
 	file->f_flags |= O_LARGEFILE;
 
 	fd_install(fd, file);
+	atomic_inc(&secretmem_users);
 	return fd;
 
 err_put_fd:
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mark Rutland, Michal Hocko, David Hildenbrand, Peter Zijlstra,
	Catalin Marinas, Dave Hansen, linux-mm, linux-kselftest,
	H. Peter Anvin, Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Rick Edgecombe,
	Roman Gushchin, Mike Rapoport

From: Mike Rapoport <rppt@linux.ibm.com>

It is unsafe to allow saving of secretmem areas to the hibernation
snapshot as they would be visible after the resume and this essentially
will defeat the purpose of secret memory mappings.

Prevent hibernation whenever there are active secret memory users.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 include/linux/secretmem.h |  6 ++++++
 kernel/power/hibernate.c  |  5 ++++-
 mm/secretmem.c            | 15 +++++++++++++++
 3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
index 70e7db9f94fe..907a6734059c 100644
--- a/include/linux/secretmem.h
+++ b/include/linux/secretmem.h
@@ -6,6 +6,7 @@
 
 bool vma_is_secretmem(struct vm_area_struct *vma);
 bool page_is_secretmem(struct page *page);
+bool secretmem_active(void);
 
 #else
 
@@ -19,6 +20,11 @@ static inline bool page_is_secretmem(struct page *page)
 	return false;
 }
 
+static inline bool secretmem_active(void)
+{
+	return false;
+}
+
 #endif /* CONFIG_SECRETMEM */
 
 #endif /* _LINUX_SECRETMEM_H */
diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
index da0b41914177..559acef3fddb 100644
--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -31,6 +31,7 @@
 #include <linux/genhd.h>
 #include <linux/ktime.h>
 #include <linux/security.h>
+#include <linux/secretmem.h>
 #include <trace/events/power.h>
 
 #include "power.h"
@@ -81,7 +82,9 @@ void hibernate_release(void)
 
 bool hibernation_available(void)
 {
-	return nohibernate == 0 && !security_locked_down(LOCKDOWN_HIBERNATION);
+	return nohibernate == 0 &&
+		!security_locked_down(LOCKDOWN_HIBERNATION) &&
+		!secretmem_active();
 }
 
 /**
diff --git a/mm/secretmem.c b/mm/secretmem.c
index fa6738e860c2..f2ae3f32a193 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -40,6 +40,13 @@ module_param_named(enable, secretmem_enable, bool, 0400);
 MODULE_PARM_DESC(secretmem_enable,
 		 "Enable secretmem and memfd_secret(2) system call");
 
+static atomic_t secretmem_users;
+
+bool secretmem_active(void)
+{
+	return !!atomic_read(&secretmem_users);
+}
+
 static vm_fault_t secretmem_fault(struct vm_fault *vmf)
 {
 	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
@@ -94,6 +101,12 @@ static const struct vm_operations_struct secretmem_vm_ops = {
 	.fault = secretmem_fault,
 };
 
+static int secretmem_release(struct inode *inode, struct file *file)
+{
+	atomic_dec(&secretmem_users);
+	return 0;
+}
+
 static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	unsigned long len = vma->vm_end - vma->vm_start;
@@ -116,6 +129,7 @@ bool vma_is_secretmem(struct vm_area_struct *vma)
 }
 
 static const struct file_operations secretmem_fops = {
+	.release	= secretmem_release,
 	.mmap		= secretmem_mmap,
 };
 
@@ -212,6 +226,7 @@ SYSCALL_DEFINE1(memfd_secret, unsigned long, flags)
 	file->f_flags |= O_LARGEFILE;
 
 	fd_install(fd, file);
+	atomic_inc(&secretmem_users);
 	return fd;
 
 err_put_fd:
-- 
2.28.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mark Rutland, Michal Hocko, David Hildenbrand, Peter Zijlstra,
	Catalin Marinas, Dave Hansen, linux-mm, linux-kselftest,
	H. Peter Anvin, Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Rick Edgecombe,
	Roman Gushchin, Mike Rapoport

From: Mike Rapoport <rppt@linux.ibm.com>

It is unsafe to allow saving of secretmem areas to the hibernation
snapshot as they would be visible after the resume and this essentially
will defeat the purpose of secret memory mappings.

Prevent hibernation whenever there are active secret memory users.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 include/linux/secretmem.h |  6 ++++++
 kernel/power/hibernate.c  |  5 ++++-
 mm/secretmem.c            | 15 +++++++++++++++
 3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h
index 70e7db9f94fe..907a6734059c 100644
--- a/include/linux/secretmem.h
+++ b/include/linux/secretmem.h
@@ -6,6 +6,7 @@
 
 bool vma_is_secretmem(struct vm_area_struct *vma);
 bool page_is_secretmem(struct page *page);
+bool secretmem_active(void);
 
 #else
 
@@ -19,6 +20,11 @@ static inline bool page_is_secretmem(struct page *page)
 	return false;
 }
 
+static inline bool secretmem_active(void)
+{
+	return false;
+}
+
 #endif /* CONFIG_SECRETMEM */
 
 #endif /* _LINUX_SECRETMEM_H */
diff --git a/kernel/power/hibernate.c b/kernel/power/hibernate.c
index da0b41914177..559acef3fddb 100644
--- a/kernel/power/hibernate.c
+++ b/kernel/power/hibernate.c
@@ -31,6 +31,7 @@
 #include <linux/genhd.h>
 #include <linux/ktime.h>
 #include <linux/security.h>
+#include <linux/secretmem.h>
 #include <trace/events/power.h>
 
 #include "power.h"
@@ -81,7 +82,9 @@ void hibernate_release(void)
 
 bool hibernation_available(void)
 {
-	return nohibernate == 0 && !security_locked_down(LOCKDOWN_HIBERNATION);
+	return nohibernate == 0 &&
+		!security_locked_down(LOCKDOWN_HIBERNATION) &&
+		!secretmem_active();
 }
 
 /**
diff --git a/mm/secretmem.c b/mm/secretmem.c
index fa6738e860c2..f2ae3f32a193 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -40,6 +40,13 @@ module_param_named(enable, secretmem_enable, bool, 0400);
 MODULE_PARM_DESC(secretmem_enable,
 		 "Enable secretmem and memfd_secret(2) system call");
 
+static atomic_t secretmem_users;
+
+bool secretmem_active(void)
+{
+	return !!atomic_read(&secretmem_users);
+}
+
 static vm_fault_t secretmem_fault(struct vm_fault *vmf)
 {
 	struct address_space *mapping = vmf->vma->vm_file->f_mapping;
@@ -94,6 +101,12 @@ static const struct vm_operations_struct secretmem_vm_ops = {
 	.fault = secretmem_fault,
 };
 
+static int secretmem_release(struct inode *inode, struct file *file)
+{
+	atomic_dec(&secretmem_users);
+	return 0;
+}
+
 static int secretmem_mmap(struct file *file, struct vm_area_struct *vma)
 {
 	unsigned long len = vma->vm_end - vma->vm_start;
@@ -116,6 +129,7 @@ bool vma_is_secretmem(struct vm_area_struct *vma)
 }
 
 static const struct file_operations secretmem_fops = {
+	.release	= secretmem_release,
 	.mmap		= secretmem_mmap,
 };
 
@@ -212,6 +226,7 @@ SYSCALL_DEFINE1(memfd_secret, unsigned long, flags)
 	file->f_flags |= O_LARGEFILE;
 
 	fd_install(fd, file);
+	atomic_inc(&secretmem_users);
 	return fd;
 
 err_put_fd:
-- 
2.28.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 09/10] arch, mm: wire up memfd_secret system call where relevant
  2021-02-08  8:49 ` Mike Rapoport
  (?)
  (?)
@ 2021-02-08  8:49   ` Mike Rapoport
  -1 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Paul Walmsley, Peter Zijlstra,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleixner, Tycho Andersen, Will Deacon, linux-api,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Palmer Dabbelt, Hagen Paul Pfeifer

From: Mike Rapoport <rppt@linux.ibm.com>

Wire up memfd_secret system call on architectures that define
ARCH_HAS_SET_DIRECT_MAP, namely arm64, risc-v and x86.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/uapi/asm/unistd.h   | 1 +
 arch/riscv/include/asm/unistd.h        | 1 +
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 include/linux/syscalls.h               | 1 +
 include/uapi/asm-generic/unistd.h      | 6 +++++-
 scripts/checksyscalls.sh               | 4 ++++
 7 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/uapi/asm/unistd.h b/arch/arm64/include/uapi/asm/unistd.h
index f83a70e07df8..ce2ee8f1e361 100644
--- a/arch/arm64/include/uapi/asm/unistd.h
+++ b/arch/arm64/include/uapi/asm/unistd.h
@@ -20,5 +20,6 @@
 #define __ARCH_WANT_SET_GET_RLIMIT
 #define __ARCH_WANT_TIME32_SYSCALLS
 #define __ARCH_WANT_SYS_CLONE3
+#define __ARCH_WANT_MEMFD_SECRET
 
 #include <asm-generic/unistd.h>
diff --git a/arch/riscv/include/asm/unistd.h b/arch/riscv/include/asm/unistd.h
index 977ee6181dab..6c316093a1e5 100644
--- a/arch/riscv/include/asm/unistd.h
+++ b/arch/riscv/include/asm/unistd.h
@@ -9,6 +9,7 @@
  */
 
 #define __ARCH_WANT_SYS_CLONE
+#define __ARCH_WANT_MEMFD_SECRET
 
 #include <uapi/asm/unistd.h>
 
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index a1c9f496fca6..34f04076a140 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -447,3 +447,4 @@
 440	i386	process_madvise		sys_process_madvise
 441	i386	epoll_pwait2		sys_epoll_pwait2		compat_sys_epoll_pwait2
 442	i386	mount_setattr		sys_mount_setattr
+443	i386	memfd_secret		sys_memfd_secret
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 7bf01cbe582f..bd3783edf27f 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -364,6 +364,7 @@
 440	common	process_madvise		sys_process_madvise
 441	common	epoll_pwait2		sys_epoll_pwait2
 442	common	mount_setattr		sys_mount_setattr
+443	common	memfd_secret		sys_memfd_secret
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index cd7b5c817ba2..ad7ac9717884 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1041,6 +1041,7 @@ asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
 				       siginfo_t __user *info,
 				       unsigned int flags);
 asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
+asmlinkage long sys_memfd_secret(unsigned long flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index ce58cff99b66..7ac0732dbaa4 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -863,9 +863,13 @@ __SYSCALL(__NR_process_madvise, sys_process_madvise)
 __SC_COMP(__NR_epoll_pwait2, sys_epoll_pwait2, compat_sys_epoll_pwait2)
 #define __NR_mount_setattr 442
 __SYSCALL(__NR_mount_setattr, sys_mount_setattr)
+#ifdef __ARCH_WANT_MEMFD_SECRET
+#define __NR_memfd_secret 443
+__SYSCALL(__NR_memfd_secret, sys_memfd_secret)
+#endif
 
 #undef __NR_syscalls
-#define __NR_syscalls 443
+#define __NR_syscalls 444
 
 /*
  * 32 bit systems traditionally used different
diff --git a/scripts/checksyscalls.sh b/scripts/checksyscalls.sh
index a18b47695f55..b7609958ee36 100755
--- a/scripts/checksyscalls.sh
+++ b/scripts/checksyscalls.sh
@@ -40,6 +40,10 @@ cat << EOF
 #define __IGNORE_setrlimit	/* setrlimit */
 #endif
 
+#ifndef __ARCH_WANT_MEMFD_SECRET
+#define __IGNORE_memfd_secret
+#endif
+
 /* Missing flags argument */
 #define __IGNORE_renameat	/* renameat2 */
 
-- 
2.28.0
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 09/10] arch, mm: wire up memfd_secret system call where relevant
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Paul Walmsley, Peter Zijlstra,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleixner, Tycho Andersen, Will Deacon, linux-api,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Palmer Dabbelt, Hagen Paul Pfeifer

From: Mike Rapoport <rppt@linux.ibm.com>

Wire up memfd_secret system call on architectures that define
ARCH_HAS_SET_DIRECT_MAP, namely arm64, risc-v and x86.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/uapi/asm/unistd.h   | 1 +
 arch/riscv/include/asm/unistd.h        | 1 +
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 include/linux/syscalls.h               | 1 +
 include/uapi/asm-generic/unistd.h      | 6 +++++-
 scripts/checksyscalls.sh               | 4 ++++
 7 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/uapi/asm/unistd.h b/arch/arm64/include/uapi/asm/unistd.h
index f83a70e07df8..ce2ee8f1e361 100644
--- a/arch/arm64/include/uapi/asm/unistd.h
+++ b/arch/arm64/include/uapi/asm/unistd.h
@@ -20,5 +20,6 @@
 #define __ARCH_WANT_SET_GET_RLIMIT
 #define __ARCH_WANT_TIME32_SYSCALLS
 #define __ARCH_WANT_SYS_CLONE3
+#define __ARCH_WANT_MEMFD_SECRET
 
 #include <asm-generic/unistd.h>
diff --git a/arch/riscv/include/asm/unistd.h b/arch/riscv/include/asm/unistd.h
index 977ee6181dab..6c316093a1e5 100644
--- a/arch/riscv/include/asm/unistd.h
+++ b/arch/riscv/include/asm/unistd.h
@@ -9,6 +9,7 @@
  */
 
 #define __ARCH_WANT_SYS_CLONE
+#define __ARCH_WANT_MEMFD_SECRET
 
 #include <uapi/asm/unistd.h>
 
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index a1c9f496fca6..34f04076a140 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -447,3 +447,4 @@
 440	i386	process_madvise		sys_process_madvise
 441	i386	epoll_pwait2		sys_epoll_pwait2		compat_sys_epoll_pwait2
 442	i386	mount_setattr		sys_mount_setattr
+443	i386	memfd_secret		sys_memfd_secret
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 7bf01cbe582f..bd3783edf27f 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -364,6 +364,7 @@
 440	common	process_madvise		sys_process_madvise
 441	common	epoll_pwait2		sys_epoll_pwait2
 442	common	mount_setattr		sys_mount_setattr
+443	common	memfd_secret		sys_memfd_secret
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index cd7b5c817ba2..ad7ac9717884 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1041,6 +1041,7 @@ asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
 				       siginfo_t __user *info,
 				       unsigned int flags);
 asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
+asmlinkage long sys_memfd_secret(unsigned long flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index ce58cff99b66..7ac0732dbaa4 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -863,9 +863,13 @@ __SYSCALL(__NR_process_madvise, sys_process_madvise)
 __SC_COMP(__NR_epoll_pwait2, sys_epoll_pwait2, compat_sys_epoll_pwait2)
 #define __NR_mount_setattr 442
 __SYSCALL(__NR_mount_setattr, sys_mount_setattr)
+#ifdef __ARCH_WANT_MEMFD_SECRET
+#define __NR_memfd_secret 443
+__SYSCALL(__NR_memfd_secret, sys_memfd_secret)
+#endif
 
 #undef __NR_syscalls
-#define __NR_syscalls 443
+#define __NR_syscalls 444
 
 /*
  * 32 bit systems traditionally used different
diff --git a/scripts/checksyscalls.sh b/scripts/checksyscalls.sh
index a18b47695f55..b7609958ee36 100755
--- a/scripts/checksyscalls.sh
+++ b/scripts/checksyscalls.sh
@@ -40,6 +40,10 @@ cat << EOF
 #define __IGNORE_setrlimit	/* setrlimit */
 #endif
 
+#ifndef __ARCH_WANT_MEMFD_SECRET
+#define __IGNORE_memfd_secret
+#endif
+
 /* Missing flags argument */
 #define __IGNORE_renameat	/* renameat2 */
 
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 09/10] arch, mm: wire up memfd_secret system call where relevant
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mark Rutland, Michal Hocko, David Hildenbrand, Peter Zijlstra,
	Catalin Marinas, Dave Hansen, linux-mm, linux-kselftest,
	H. Peter Anvin, Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Rick Edgecombe,
	Roman Gushchin, Mike Rapoport

From: Mike Rapoport <rppt@linux.ibm.com>

Wire up memfd_secret system call on architectures that define
ARCH_HAS_SET_DIRECT_MAP, namely arm64, risc-v and x86.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/uapi/asm/unistd.h   | 1 +
 arch/riscv/include/asm/unistd.h        | 1 +
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 include/linux/syscalls.h               | 1 +
 include/uapi/asm-generic/unistd.h      | 6 +++++-
 scripts/checksyscalls.sh               | 4 ++++
 7 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/uapi/asm/unistd.h b/arch/arm64/include/uapi/asm/unistd.h
index f83a70e07df8..ce2ee8f1e361 100644
--- a/arch/arm64/include/uapi/asm/unistd.h
+++ b/arch/arm64/include/uapi/asm/unistd.h
@@ -20,5 +20,6 @@
 #define __ARCH_WANT_SET_GET_RLIMIT
 #define __ARCH_WANT_TIME32_SYSCALLS
 #define __ARCH_WANT_SYS_CLONE3
+#define __ARCH_WANT_MEMFD_SECRET
 
 #include <asm-generic/unistd.h>
diff --git a/arch/riscv/include/asm/unistd.h b/arch/riscv/include/asm/unistd.h
index 977ee6181dab..6c316093a1e5 100644
--- a/arch/riscv/include/asm/unistd.h
+++ b/arch/riscv/include/asm/unistd.h
@@ -9,6 +9,7 @@
  */
 
 #define __ARCH_WANT_SYS_CLONE
+#define __ARCH_WANT_MEMFD_SECRET
 
 #include <uapi/asm/unistd.h>
 
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index a1c9f496fca6..34f04076a140 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -447,3 +447,4 @@
 440	i386	process_madvise		sys_process_madvise
 441	i386	epoll_pwait2		sys_epoll_pwait2		compat_sys_epoll_pwait2
 442	i386	mount_setattr		sys_mount_setattr
+443	i386	memfd_secret		sys_memfd_secret
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 7bf01cbe582f..bd3783edf27f 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -364,6 +364,7 @@
 440	common	process_madvise		sys_process_madvise
 441	common	epoll_pwait2		sys_epoll_pwait2
 442	common	mount_setattr		sys_mount_setattr
+443	common	memfd_secret		sys_memfd_secret
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index cd7b5c817ba2..ad7ac9717884 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1041,6 +1041,7 @@ asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
 				       siginfo_t __user *info,
 				       unsigned int flags);
 asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
+asmlinkage long sys_memfd_secret(unsigned long flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index ce58cff99b66..7ac0732dbaa4 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -863,9 +863,13 @@ __SYSCALL(__NR_process_madvise, sys_process_madvise)
 __SC_COMP(__NR_epoll_pwait2, sys_epoll_pwait2, compat_sys_epoll_pwait2)
 #define __NR_mount_setattr 442
 __SYSCALL(__NR_mount_setattr, sys_mount_setattr)
+#ifdef __ARCH_WANT_MEMFD_SECRET
+#define __NR_memfd_secret 443
+__SYSCALL(__NR_memfd_secret, sys_memfd_secret)
+#endif
 
 #undef __NR_syscalls
-#define __NR_syscalls 443
+#define __NR_syscalls 444
 
 /*
  * 32 bit systems traditionally used different
diff --git a/scripts/checksyscalls.sh b/scripts/checksyscalls.sh
index a18b47695f55..b7609958ee36 100755
--- a/scripts/checksyscalls.sh
+++ b/scripts/checksyscalls.sh
@@ -40,6 +40,10 @@ cat << EOF
 #define __IGNORE_setrlimit	/* setrlimit */
 #endif
 
+#ifndef __ARCH_WANT_MEMFD_SECRET
+#define __IGNORE_memfd_secret
+#endif
+
 /* Missing flags argument */
 #define __IGNORE_renameat	/* renameat2 */
 
-- 
2.28.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 09/10] arch, mm: wire up memfd_secret system call where relevant
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mark Rutland, Michal Hocko, David Hildenbrand, Peter Zijlstra,
	Catalin Marinas, Dave Hansen, linux-mm, linux-kselftest,
	H. Peter Anvin, Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Rick Edgecombe,
	Roman Gushchin, Mike Rapoport

From: Mike Rapoport <rppt@linux.ibm.com>

Wire up memfd_secret system call on architectures that define
ARCH_HAS_SET_DIRECT_MAP, namely arm64, risc-v and x86.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Acked-by: Palmer Dabbelt <palmerdabbelt@google.com>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Catalin Marinas <catalin.marinas@arm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/uapi/asm/unistd.h   | 1 +
 arch/riscv/include/asm/unistd.h        | 1 +
 arch/x86/entry/syscalls/syscall_32.tbl | 1 +
 arch/x86/entry/syscalls/syscall_64.tbl | 1 +
 include/linux/syscalls.h               | 1 +
 include/uapi/asm-generic/unistd.h      | 6 +++++-
 scripts/checksyscalls.sh               | 4 ++++
 7 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/uapi/asm/unistd.h b/arch/arm64/include/uapi/asm/unistd.h
index f83a70e07df8..ce2ee8f1e361 100644
--- a/arch/arm64/include/uapi/asm/unistd.h
+++ b/arch/arm64/include/uapi/asm/unistd.h
@@ -20,5 +20,6 @@
 #define __ARCH_WANT_SET_GET_RLIMIT
 #define __ARCH_WANT_TIME32_SYSCALLS
 #define __ARCH_WANT_SYS_CLONE3
+#define __ARCH_WANT_MEMFD_SECRET
 
 #include <asm-generic/unistd.h>
diff --git a/arch/riscv/include/asm/unistd.h b/arch/riscv/include/asm/unistd.h
index 977ee6181dab..6c316093a1e5 100644
--- a/arch/riscv/include/asm/unistd.h
+++ b/arch/riscv/include/asm/unistd.h
@@ -9,6 +9,7 @@
  */
 
 #define __ARCH_WANT_SYS_CLONE
+#define __ARCH_WANT_MEMFD_SECRET
 
 #include <uapi/asm/unistd.h>
 
diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index a1c9f496fca6..34f04076a140 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -447,3 +447,4 @@
 440	i386	process_madvise		sys_process_madvise
 441	i386	epoll_pwait2		sys_epoll_pwait2		compat_sys_epoll_pwait2
 442	i386	mount_setattr		sys_mount_setattr
+443	i386	memfd_secret		sys_memfd_secret
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 7bf01cbe582f..bd3783edf27f 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -364,6 +364,7 @@
 440	common	process_madvise		sys_process_madvise
 441	common	epoll_pwait2		sys_epoll_pwait2
 442	common	mount_setattr		sys_mount_setattr
+443	common	memfd_secret		sys_memfd_secret
 
 #
 # Due to a historical design error, certain syscalls are numbered differently
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index cd7b5c817ba2..ad7ac9717884 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -1041,6 +1041,7 @@ asmlinkage long sys_pidfd_send_signal(int pidfd, int sig,
 				       siginfo_t __user *info,
 				       unsigned int flags);
 asmlinkage long sys_pidfd_getfd(int pidfd, int fd, unsigned int flags);
+asmlinkage long sys_memfd_secret(unsigned long flags);
 
 /*
  * Architecture-specific system calls
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index ce58cff99b66..7ac0732dbaa4 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -863,9 +863,13 @@ __SYSCALL(__NR_process_madvise, sys_process_madvise)
 __SC_COMP(__NR_epoll_pwait2, sys_epoll_pwait2, compat_sys_epoll_pwait2)
 #define __NR_mount_setattr 442
 __SYSCALL(__NR_mount_setattr, sys_mount_setattr)
+#ifdef __ARCH_WANT_MEMFD_SECRET
+#define __NR_memfd_secret 443
+__SYSCALL(__NR_memfd_secret, sys_memfd_secret)
+#endif
 
 #undef __NR_syscalls
-#define __NR_syscalls 443
+#define __NR_syscalls 444
 
 /*
  * 32 bit systems traditionally used different
diff --git a/scripts/checksyscalls.sh b/scripts/checksyscalls.sh
index a18b47695f55..b7609958ee36 100755
--- a/scripts/checksyscalls.sh
+++ b/scripts/checksyscalls.sh
@@ -40,6 +40,10 @@ cat << EOF
 #define __IGNORE_setrlimit	/* setrlimit */
 #endif
 
+#ifndef __ARCH_WANT_MEMFD_SECRET
+#define __IGNORE_memfd_secret
+#endif
+
 /* Missing flags argument */
 #define __IGNORE_renameat	/* renameat2 */
 
-- 
2.28.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 10/10] secretmem: test: add basic selftest for memfd_secret(2)
  2021-02-08  8:49 ` Mike Rapoport
  (?)
  (?)
@ 2021-02-08  8:49   ` Mike Rapoport
  -1 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Paul Walmsley, Peter Zijlstra,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleixner, Tycho Andersen, Will Deacon, linux-api,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

From: Mike Rapoport <rppt@linux.ibm.com>

The test verifies that file descriptor created with memfd_secret does not
allow read/write operations, that secret memory mappings respect
RLIMIT_MEMLOCK and that remote accesses with process_vm_read() and
ptrace() to the secret memory fail.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 tools/testing/selftests/vm/.gitignore     |   1 +
 tools/testing/selftests/vm/Makefile       |   3 +-
 tools/testing/selftests/vm/memfd_secret.c | 296 ++++++++++++++++++++++
 tools/testing/selftests/vm/run_vmtests    |  17 ++
 4 files changed, 316 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/vm/memfd_secret.c

diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore
index 9a35c3f6a557..c8deddc81e7a 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -21,4 +21,5 @@ va_128TBswitch
 map_fixed_noreplace
 write_to_hugetlbfs
 hmm-tests
+memfd_secret
 local_config.*
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index d42115e4284d..0200fb61646c 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -34,6 +34,7 @@ TEST_GEN_FILES += khugepaged
 TEST_GEN_FILES += map_fixed_noreplace
 TEST_GEN_FILES += map_hugetlb
 TEST_GEN_FILES += map_populate
+TEST_GEN_FILES += memfd_secret
 TEST_GEN_FILES += mlock-random-test
 TEST_GEN_FILES += mlock2-tests
 TEST_GEN_FILES += mremap_dontunmap
@@ -133,7 +134,7 @@ warn_32bit_failure:
 endif
 endif
 
-$(OUTPUT)/mlock-random-test: LDLIBS += -lcap
+$(OUTPUT)/mlock-random-test $(OUTPUT)/memfd_secret: LDLIBS += -lcap
 
 $(OUTPUT)/gup_test: ../../../../mm/gup_test.h
 
diff --git a/tools/testing/selftests/vm/memfd_secret.c b/tools/testing/selftests/vm/memfd_secret.c
new file mode 100644
index 000000000000..c878c2b841fc
--- /dev/null
+++ b/tools/testing/selftests/vm/memfd_secret.c
@@ -0,0 +1,296 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright IBM Corporation, 2020
+ *
+ * Author: Mike Rapoport <rppt@linux.ibm.com>
+ */
+
+#define _GNU_SOURCE
+#include <sys/uio.h>
+#include <sys/mman.h>
+#include <sys/wait.h>
+#include <sys/types.h>
+#include <sys/ptrace.h>
+#include <sys/syscall.h>
+#include <sys/resource.h>
+#include <sys/capability.h>
+
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+#include <stdio.h>
+
+#include "../kselftest.h"
+
+#define fail(fmt, ...) ksft_test_result_fail(fmt, ##__VA_ARGS__)
+#define pass(fmt, ...) ksft_test_result_pass(fmt, ##__VA_ARGS__)
+#define skip(fmt, ...) ksft_test_result_skip(fmt, ##__VA_ARGS__)
+
+#ifdef __NR_memfd_secret
+
+#define PATTERN	0x55
+
+static const int prot = PROT_READ | PROT_WRITE;
+static const int mode = MAP_SHARED;
+
+static unsigned long page_size;
+static unsigned long mlock_limit_cur;
+static unsigned long mlock_limit_max;
+
+static int memfd_secret(unsigned long flags)
+{
+	return syscall(__NR_memfd_secret, flags);
+}
+
+static void test_file_apis(int fd)
+{
+	char buf[64];
+
+	if ((read(fd, buf, sizeof(buf)) >= 0) ||
+	    (write(fd, buf, sizeof(buf)) >= 0) ||
+	    (pread(fd, buf, sizeof(buf), 0) >= 0) ||
+	    (pwrite(fd, buf, sizeof(buf), 0) >= 0))
+		fail("unexpected file IO\n");
+	else
+		pass("file IO is blocked as expected\n");
+}
+
+static void test_mlock_limit(int fd)
+{
+	size_t len;
+	char *mem;
+
+	len = mlock_limit_cur;
+	mem = mmap(NULL, len, prot, mode, fd, 0);
+	if (mem == MAP_FAILED) {
+		fail("unable to mmap secret memory\n");
+		return;
+	}
+	munmap(mem, len);
+
+	len = mlock_limit_max * 2;
+	mem = mmap(NULL, len, prot, mode, fd, 0);
+	if (mem != MAP_FAILED) {
+		fail("unexpected mlock limit violation\n");
+		munmap(mem, len);
+		return;
+	}
+
+	pass("mlock limit is respected\n");
+}
+
+static void try_process_vm_read(int fd, int pipefd[2])
+{
+	struct iovec liov, riov;
+	char buf[64];
+	char *mem;
+
+	if (read(pipefd[0], &mem, sizeof(mem)) < 0) {
+		fail("pipe write: %s\n", strerror(errno));
+		exit(KSFT_FAIL);
+	}
+
+	liov.iov_len = riov.iov_len = sizeof(buf);
+	liov.iov_base = buf;
+	riov.iov_base = mem;
+
+	if (process_vm_readv(getppid(), &liov, 1, &riov, 1, 0) < 0) {
+		if (errno == ENOSYS)
+			exit(KSFT_SKIP);
+		exit(KSFT_PASS);
+	}
+
+	exit(KSFT_FAIL);
+}
+
+static void try_ptrace(int fd, int pipefd[2])
+{
+	pid_t ppid = getppid();
+	int status;
+	char *mem;
+	long ret;
+
+	if (read(pipefd[0], &mem, sizeof(mem)) < 0) {
+		perror("pipe write");
+		exit(KSFT_FAIL);
+	}
+
+	ret = ptrace(PTRACE_ATTACH, ppid, 0, 0);
+	if (ret) {
+		perror("ptrace_attach");
+		exit(KSFT_FAIL);
+	}
+
+	ret = waitpid(ppid, &status, WUNTRACED);
+	if ((ret != ppid) || !(WIFSTOPPED(status))) {
+		fprintf(stderr, "weird waitppid result %ld stat %x\n",
+			ret, status);
+		exit(KSFT_FAIL);
+	}
+
+	if (ptrace(PTRACE_PEEKDATA, ppid, mem, 0))
+		exit(KSFT_PASS);
+
+	exit(KSFT_FAIL);
+}
+
+static void check_child_status(pid_t pid, const char *name)
+{
+	int status;
+
+	waitpid(pid, &status, 0);
+
+	if (WIFEXITED(status) && WEXITSTATUS(status) == KSFT_SKIP) {
+		skip("%s is not supported\n", name);
+		return;
+	}
+
+	if ((WIFEXITED(status) && WEXITSTATUS(status) == KSFT_PASS) ||
+	    WIFSIGNALED(status)) {
+		pass("%s is blocked as expected\n", name);
+		return;
+	}
+
+	fail("%s: unexpected memory access\n", name);
+}
+
+static void test_remote_access(int fd, const char *name,
+			       void (*func)(int fd, int pipefd[2]))
+{
+	int pipefd[2];
+	pid_t pid;
+	char *mem;
+
+	if (pipe(pipefd)) {
+		fail("pipe failed: %s\n", strerror(errno));
+		return;
+	}
+
+	pid = fork();
+	if (pid < 0) {
+		fail("fork failed: %s\n", strerror(errno));
+		return;
+	}
+
+	if (pid == 0) {
+		func(fd, pipefd);
+		return;
+	}
+
+	mem = mmap(NULL, page_size, prot, mode, fd, 0);
+	if (mem == MAP_FAILED) {
+		fail("Unable to mmap secret memory\n");
+		return;
+	}
+
+	ftruncate(fd, page_size);
+	memset(mem, PATTERN, page_size);
+
+	if (write(pipefd[1], &mem, sizeof(mem)) < 0) {
+		fail("pipe write: %s\n", strerror(errno));
+		return;
+	}
+
+	check_child_status(pid, name);
+}
+
+static void test_process_vm_read(int fd)
+{
+	test_remote_access(fd, "process_vm_read", try_process_vm_read);
+}
+
+static void test_ptrace(int fd)
+{
+	test_remote_access(fd, "ptrace", try_ptrace);
+}
+
+static int set_cap_limits(rlim_t max)
+{
+	struct rlimit new;
+	cap_t cap = cap_init();
+
+	new.rlim_cur = max;
+	new.rlim_max = max;
+	if (setrlimit(RLIMIT_MEMLOCK, &new)) {
+		perror("setrlimit() returns error");
+		return -1;
+	}
+
+	/* drop capabilities including CAP_IPC_LOCK */
+	if (cap_set_proc(cap)) {
+		perror("cap_set_proc() returns error");
+		return -2;
+	}
+
+	return 0;
+}
+
+static void prepare(void)
+{
+	struct rlimit rlim;
+
+	page_size = sysconf(_SC_PAGE_SIZE);
+	if (!page_size)
+		ksft_exit_fail_msg("Failed to get page size %s\n",
+				   strerror(errno));
+
+	if (getrlimit(RLIMIT_MEMLOCK, &rlim))
+		ksft_exit_fail_msg("Unable to detect mlock limit: %s\n",
+				   strerror(errno));
+
+	mlock_limit_cur = rlim.rlim_cur;
+	mlock_limit_max = rlim.rlim_max;
+
+	printf("page_size: %ld, mlock.soft: %ld, mlock.hard: %ld\n",
+	       page_size, mlock_limit_cur, mlock_limit_max);
+
+	if (page_size > mlock_limit_cur)
+		mlock_limit_cur = page_size;
+	if (page_size > mlock_limit_max)
+		mlock_limit_max = page_size;
+
+	if (set_cap_limits(mlock_limit_max))
+		ksft_exit_fail_msg("Unable to set mlock limit: %s\n",
+				   strerror(errno));
+}
+
+#define NUM_TESTS 4
+
+int main(int argc, char *argv[])
+{
+	int fd;
+
+	prepare();
+
+	ksft_print_header();
+	ksft_set_plan(NUM_TESTS);
+
+	fd = memfd_secret(0);
+	if (fd < 0) {
+		if (errno == ENOSYS)
+			ksft_exit_skip("memfd_secret is not supported\n");
+		else
+			ksft_exit_fail_msg("memfd_secret failed: %s\n",
+					   strerror(errno));
+	}
+
+	test_mlock_limit(fd);
+	test_file_apis(fd);
+	test_process_vm_read(fd);
+	test_ptrace(fd);
+
+	close(fd);
+
+	ksft_exit(!ksft_get_fail_cnt());
+}
+
+#else /* __NR_memfd_secret */
+
+int main(int argc, char *argv[])
+{
+	printf("skip: skipping memfd_secret test (missing __NR_memfd_secret)\n");
+	return KSFT_SKIP;
+}
+
+#endif /* __NR_memfd_secret */
diff --git a/tools/testing/selftests/vm/run_vmtests b/tools/testing/selftests/vm/run_vmtests
index e953f3cd9664..95a67382f132 100755
--- a/tools/testing/selftests/vm/run_vmtests
+++ b/tools/testing/selftests/vm/run_vmtests
@@ -346,4 +346,21 @@ else
 	exitcode=1
 fi
 
+echo "running memfd_secret test"
+echo "------------------------------------"
+./memfd_secret
+ret_val=$?
+
+if [ $ret_val -eq 0 ]; then
+	echo "[PASS]"
+elif [ $ret_val -eq $ksft_skip ]; then
+	echo "[SKIP]"
+	exitcode=$ksft_skip
+else
+	echo "[FAIL]"
+	exitcode=1
+fi
+
+exit $exitcode
+
 exit $exitcode
-- 
2.28.0
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 10/10] secretmem: test: add basic selftest for memfd_secret(2)
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Michal Hocko, Mike Rapoport, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Paul Walmsley, Peter Zijlstra,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleixner, Tycho Andersen, Will Deacon, linux-api,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

From: Mike Rapoport <rppt@linux.ibm.com>

The test verifies that file descriptor created with memfd_secret does not
allow read/write operations, that secret memory mappings respect
RLIMIT_MEMLOCK and that remote accesses with process_vm_read() and
ptrace() to the secret memory fail.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 tools/testing/selftests/vm/.gitignore     |   1 +
 tools/testing/selftests/vm/Makefile       |   3 +-
 tools/testing/selftests/vm/memfd_secret.c | 296 ++++++++++++++++++++++
 tools/testing/selftests/vm/run_vmtests    |  17 ++
 4 files changed, 316 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/vm/memfd_secret.c

diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore
index 9a35c3f6a557..c8deddc81e7a 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -21,4 +21,5 @@ va_128TBswitch
 map_fixed_noreplace
 write_to_hugetlbfs
 hmm-tests
+memfd_secret
 local_config.*
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index d42115e4284d..0200fb61646c 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -34,6 +34,7 @@ TEST_GEN_FILES += khugepaged
 TEST_GEN_FILES += map_fixed_noreplace
 TEST_GEN_FILES += map_hugetlb
 TEST_GEN_FILES += map_populate
+TEST_GEN_FILES += memfd_secret
 TEST_GEN_FILES += mlock-random-test
 TEST_GEN_FILES += mlock2-tests
 TEST_GEN_FILES += mremap_dontunmap
@@ -133,7 +134,7 @@ warn_32bit_failure:
 endif
 endif
 
-$(OUTPUT)/mlock-random-test: LDLIBS += -lcap
+$(OUTPUT)/mlock-random-test $(OUTPUT)/memfd_secret: LDLIBS += -lcap
 
 $(OUTPUT)/gup_test: ../../../../mm/gup_test.h
 
diff --git a/tools/testing/selftests/vm/memfd_secret.c b/tools/testing/selftests/vm/memfd_secret.c
new file mode 100644
index 000000000000..c878c2b841fc
--- /dev/null
+++ b/tools/testing/selftests/vm/memfd_secret.c
@@ -0,0 +1,296 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright IBM Corporation, 2020
+ *
+ * Author: Mike Rapoport <rppt@linux.ibm.com>
+ */
+
+#define _GNU_SOURCE
+#include <sys/uio.h>
+#include <sys/mman.h>
+#include <sys/wait.h>
+#include <sys/types.h>
+#include <sys/ptrace.h>
+#include <sys/syscall.h>
+#include <sys/resource.h>
+#include <sys/capability.h>
+
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+#include <stdio.h>
+
+#include "../kselftest.h"
+
+#define fail(fmt, ...) ksft_test_result_fail(fmt, ##__VA_ARGS__)
+#define pass(fmt, ...) ksft_test_result_pass(fmt, ##__VA_ARGS__)
+#define skip(fmt, ...) ksft_test_result_skip(fmt, ##__VA_ARGS__)
+
+#ifdef __NR_memfd_secret
+
+#define PATTERN	0x55
+
+static const int prot = PROT_READ | PROT_WRITE;
+static const int mode = MAP_SHARED;
+
+static unsigned long page_size;
+static unsigned long mlock_limit_cur;
+static unsigned long mlock_limit_max;
+
+static int memfd_secret(unsigned long flags)
+{
+	return syscall(__NR_memfd_secret, flags);
+}
+
+static void test_file_apis(int fd)
+{
+	char buf[64];
+
+	if ((read(fd, buf, sizeof(buf)) >= 0) ||
+	    (write(fd, buf, sizeof(buf)) >= 0) ||
+	    (pread(fd, buf, sizeof(buf), 0) >= 0) ||
+	    (pwrite(fd, buf, sizeof(buf), 0) >= 0))
+		fail("unexpected file IO\n");
+	else
+		pass("file IO is blocked as expected\n");
+}
+
+static void test_mlock_limit(int fd)
+{
+	size_t len;
+	char *mem;
+
+	len = mlock_limit_cur;
+	mem = mmap(NULL, len, prot, mode, fd, 0);
+	if (mem == MAP_FAILED) {
+		fail("unable to mmap secret memory\n");
+		return;
+	}
+	munmap(mem, len);
+
+	len = mlock_limit_max * 2;
+	mem = mmap(NULL, len, prot, mode, fd, 0);
+	if (mem != MAP_FAILED) {
+		fail("unexpected mlock limit violation\n");
+		munmap(mem, len);
+		return;
+	}
+
+	pass("mlock limit is respected\n");
+}
+
+static void try_process_vm_read(int fd, int pipefd[2])
+{
+	struct iovec liov, riov;
+	char buf[64];
+	char *mem;
+
+	if (read(pipefd[0], &mem, sizeof(mem)) < 0) {
+		fail("pipe write: %s\n", strerror(errno));
+		exit(KSFT_FAIL);
+	}
+
+	liov.iov_len = riov.iov_len = sizeof(buf);
+	liov.iov_base = buf;
+	riov.iov_base = mem;
+
+	if (process_vm_readv(getppid(), &liov, 1, &riov, 1, 0) < 0) {
+		if (errno == ENOSYS)
+			exit(KSFT_SKIP);
+		exit(KSFT_PASS);
+	}
+
+	exit(KSFT_FAIL);
+}
+
+static void try_ptrace(int fd, int pipefd[2])
+{
+	pid_t ppid = getppid();
+	int status;
+	char *mem;
+	long ret;
+
+	if (read(pipefd[0], &mem, sizeof(mem)) < 0) {
+		perror("pipe write");
+		exit(KSFT_FAIL);
+	}
+
+	ret = ptrace(PTRACE_ATTACH, ppid, 0, 0);
+	if (ret) {
+		perror("ptrace_attach");
+		exit(KSFT_FAIL);
+	}
+
+	ret = waitpid(ppid, &status, WUNTRACED);
+	if ((ret != ppid) || !(WIFSTOPPED(status))) {
+		fprintf(stderr, "weird waitppid result %ld stat %x\n",
+			ret, status);
+		exit(KSFT_FAIL);
+	}
+
+	if (ptrace(PTRACE_PEEKDATA, ppid, mem, 0))
+		exit(KSFT_PASS);
+
+	exit(KSFT_FAIL);
+}
+
+static void check_child_status(pid_t pid, const char *name)
+{
+	int status;
+
+	waitpid(pid, &status, 0);
+
+	if (WIFEXITED(status) && WEXITSTATUS(status) == KSFT_SKIP) {
+		skip("%s is not supported\n", name);
+		return;
+	}
+
+	if ((WIFEXITED(status) && WEXITSTATUS(status) == KSFT_PASS) ||
+	    WIFSIGNALED(status)) {
+		pass("%s is blocked as expected\n", name);
+		return;
+	}
+
+	fail("%s: unexpected memory access\n", name);
+}
+
+static void test_remote_access(int fd, const char *name,
+			       void (*func)(int fd, int pipefd[2]))
+{
+	int pipefd[2];
+	pid_t pid;
+	char *mem;
+
+	if (pipe(pipefd)) {
+		fail("pipe failed: %s\n", strerror(errno));
+		return;
+	}
+
+	pid = fork();
+	if (pid < 0) {
+		fail("fork failed: %s\n", strerror(errno));
+		return;
+	}
+
+	if (pid == 0) {
+		func(fd, pipefd);
+		return;
+	}
+
+	mem = mmap(NULL, page_size, prot, mode, fd, 0);
+	if (mem == MAP_FAILED) {
+		fail("Unable to mmap secret memory\n");
+		return;
+	}
+
+	ftruncate(fd, page_size);
+	memset(mem, PATTERN, page_size);
+
+	if (write(pipefd[1], &mem, sizeof(mem)) < 0) {
+		fail("pipe write: %s\n", strerror(errno));
+		return;
+	}
+
+	check_child_status(pid, name);
+}
+
+static void test_process_vm_read(int fd)
+{
+	test_remote_access(fd, "process_vm_read", try_process_vm_read);
+}
+
+static void test_ptrace(int fd)
+{
+	test_remote_access(fd, "ptrace", try_ptrace);
+}
+
+static int set_cap_limits(rlim_t max)
+{
+	struct rlimit new;
+	cap_t cap = cap_init();
+
+	new.rlim_cur = max;
+	new.rlim_max = max;
+	if (setrlimit(RLIMIT_MEMLOCK, &new)) {
+		perror("setrlimit() returns error");
+		return -1;
+	}
+
+	/* drop capabilities including CAP_IPC_LOCK */
+	if (cap_set_proc(cap)) {
+		perror("cap_set_proc() returns error");
+		return -2;
+	}
+
+	return 0;
+}
+
+static void prepare(void)
+{
+	struct rlimit rlim;
+
+	page_size = sysconf(_SC_PAGE_SIZE);
+	if (!page_size)
+		ksft_exit_fail_msg("Failed to get page size %s\n",
+				   strerror(errno));
+
+	if (getrlimit(RLIMIT_MEMLOCK, &rlim))
+		ksft_exit_fail_msg("Unable to detect mlock limit: %s\n",
+				   strerror(errno));
+
+	mlock_limit_cur = rlim.rlim_cur;
+	mlock_limit_max = rlim.rlim_max;
+
+	printf("page_size: %ld, mlock.soft: %ld, mlock.hard: %ld\n",
+	       page_size, mlock_limit_cur, mlock_limit_max);
+
+	if (page_size > mlock_limit_cur)
+		mlock_limit_cur = page_size;
+	if (page_size > mlock_limit_max)
+		mlock_limit_max = page_size;
+
+	if (set_cap_limits(mlock_limit_max))
+		ksft_exit_fail_msg("Unable to set mlock limit: %s\n",
+				   strerror(errno));
+}
+
+#define NUM_TESTS 4
+
+int main(int argc, char *argv[])
+{
+	int fd;
+
+	prepare();
+
+	ksft_print_header();
+	ksft_set_plan(NUM_TESTS);
+
+	fd = memfd_secret(0);
+	if (fd < 0) {
+		if (errno == ENOSYS)
+			ksft_exit_skip("memfd_secret is not supported\n");
+		else
+			ksft_exit_fail_msg("memfd_secret failed: %s\n",
+					   strerror(errno));
+	}
+
+	test_mlock_limit(fd);
+	test_file_apis(fd);
+	test_process_vm_read(fd);
+	test_ptrace(fd);
+
+	close(fd);
+
+	ksft_exit(!ksft_get_fail_cnt());
+}
+
+#else /* __NR_memfd_secret */
+
+int main(int argc, char *argv[])
+{
+	printf("skip: skipping memfd_secret test (missing __NR_memfd_secret)\n");
+	return KSFT_SKIP;
+}
+
+#endif /* __NR_memfd_secret */
diff --git a/tools/testing/selftests/vm/run_vmtests b/tools/testing/selftests/vm/run_vmtests
index e953f3cd9664..95a67382f132 100755
--- a/tools/testing/selftests/vm/run_vmtests
+++ b/tools/testing/selftests/vm/run_vmtests
@@ -346,4 +346,21 @@ else
 	exitcode=1
 fi
 
+echo "running memfd_secret test"
+echo "------------------------------------"
+./memfd_secret
+ret_val=$?
+
+if [ $ret_val -eq 0 ]; then
+	echo "[PASS]"
+elif [ $ret_val -eq $ksft_skip ]; then
+	echo "[SKIP]"
+	exitcode=$ksft_skip
+else
+	echo "[FAIL]"
+	exitcode=1
+fi
+
+exit $exitcode
+
 exit $exitcode
-- 
2.28.0


^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 10/10] secretmem: test: add basic selftest for memfd_secret(2)
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mark Rutland, Michal Hocko, David Hildenbrand, Peter Zijlstra,
	Catalin Marinas, Dave Hansen, linux-mm, linux-kselftest,
	H. Peter Anvin, Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Rick Edgecombe,
	Roman Gushchin, Mike Rapoport

From: Mike Rapoport <rppt@linux.ibm.com>

The test verifies that file descriptor created with memfd_secret does not
allow read/write operations, that secret memory mappings respect
RLIMIT_MEMLOCK and that remote accesses with process_vm_read() and
ptrace() to the secret memory fail.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 tools/testing/selftests/vm/.gitignore     |   1 +
 tools/testing/selftests/vm/Makefile       |   3 +-
 tools/testing/selftests/vm/memfd_secret.c | 296 ++++++++++++++++++++++
 tools/testing/selftests/vm/run_vmtests    |  17 ++
 4 files changed, 316 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/vm/memfd_secret.c

diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore
index 9a35c3f6a557..c8deddc81e7a 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -21,4 +21,5 @@ va_128TBswitch
 map_fixed_noreplace
 write_to_hugetlbfs
 hmm-tests
+memfd_secret
 local_config.*
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index d42115e4284d..0200fb61646c 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -34,6 +34,7 @@ TEST_GEN_FILES += khugepaged
 TEST_GEN_FILES += map_fixed_noreplace
 TEST_GEN_FILES += map_hugetlb
 TEST_GEN_FILES += map_populate
+TEST_GEN_FILES += memfd_secret
 TEST_GEN_FILES += mlock-random-test
 TEST_GEN_FILES += mlock2-tests
 TEST_GEN_FILES += mremap_dontunmap
@@ -133,7 +134,7 @@ warn_32bit_failure:
 endif
 endif
 
-$(OUTPUT)/mlock-random-test: LDLIBS += -lcap
+$(OUTPUT)/mlock-random-test $(OUTPUT)/memfd_secret: LDLIBS += -lcap
 
 $(OUTPUT)/gup_test: ../../../../mm/gup_test.h
 
diff --git a/tools/testing/selftests/vm/memfd_secret.c b/tools/testing/selftests/vm/memfd_secret.c
new file mode 100644
index 000000000000..c878c2b841fc
--- /dev/null
+++ b/tools/testing/selftests/vm/memfd_secret.c
@@ -0,0 +1,296 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright IBM Corporation, 2020
+ *
+ * Author: Mike Rapoport <rppt@linux.ibm.com>
+ */
+
+#define _GNU_SOURCE
+#include <sys/uio.h>
+#include <sys/mman.h>
+#include <sys/wait.h>
+#include <sys/types.h>
+#include <sys/ptrace.h>
+#include <sys/syscall.h>
+#include <sys/resource.h>
+#include <sys/capability.h>
+
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+#include <stdio.h>
+
+#include "../kselftest.h"
+
+#define fail(fmt, ...) ksft_test_result_fail(fmt, ##__VA_ARGS__)
+#define pass(fmt, ...) ksft_test_result_pass(fmt, ##__VA_ARGS__)
+#define skip(fmt, ...) ksft_test_result_skip(fmt, ##__VA_ARGS__)
+
+#ifdef __NR_memfd_secret
+
+#define PATTERN	0x55
+
+static const int prot = PROT_READ | PROT_WRITE;
+static const int mode = MAP_SHARED;
+
+static unsigned long page_size;
+static unsigned long mlock_limit_cur;
+static unsigned long mlock_limit_max;
+
+static int memfd_secret(unsigned long flags)
+{
+	return syscall(__NR_memfd_secret, flags);
+}
+
+static void test_file_apis(int fd)
+{
+	char buf[64];
+
+	if ((read(fd, buf, sizeof(buf)) >= 0) ||
+	    (write(fd, buf, sizeof(buf)) >= 0) ||
+	    (pread(fd, buf, sizeof(buf), 0) >= 0) ||
+	    (pwrite(fd, buf, sizeof(buf), 0) >= 0))
+		fail("unexpected file IO\n");
+	else
+		pass("file IO is blocked as expected\n");
+}
+
+static void test_mlock_limit(int fd)
+{
+	size_t len;
+	char *mem;
+
+	len = mlock_limit_cur;
+	mem = mmap(NULL, len, prot, mode, fd, 0);
+	if (mem == MAP_FAILED) {
+		fail("unable to mmap secret memory\n");
+		return;
+	}
+	munmap(mem, len);
+
+	len = mlock_limit_max * 2;
+	mem = mmap(NULL, len, prot, mode, fd, 0);
+	if (mem != MAP_FAILED) {
+		fail("unexpected mlock limit violation\n");
+		munmap(mem, len);
+		return;
+	}
+
+	pass("mlock limit is respected\n");
+}
+
+static void try_process_vm_read(int fd, int pipefd[2])
+{
+	struct iovec liov, riov;
+	char buf[64];
+	char *mem;
+
+	if (read(pipefd[0], &mem, sizeof(mem)) < 0) {
+		fail("pipe write: %s\n", strerror(errno));
+		exit(KSFT_FAIL);
+	}
+
+	liov.iov_len = riov.iov_len = sizeof(buf);
+	liov.iov_base = buf;
+	riov.iov_base = mem;
+
+	if (process_vm_readv(getppid(), &liov, 1, &riov, 1, 0) < 0) {
+		if (errno == ENOSYS)
+			exit(KSFT_SKIP);
+		exit(KSFT_PASS);
+	}
+
+	exit(KSFT_FAIL);
+}
+
+static void try_ptrace(int fd, int pipefd[2])
+{
+	pid_t ppid = getppid();
+	int status;
+	char *mem;
+	long ret;
+
+	if (read(pipefd[0], &mem, sizeof(mem)) < 0) {
+		perror("pipe write");
+		exit(KSFT_FAIL);
+	}
+
+	ret = ptrace(PTRACE_ATTACH, ppid, 0, 0);
+	if (ret) {
+		perror("ptrace_attach");
+		exit(KSFT_FAIL);
+	}
+
+	ret = waitpid(ppid, &status, WUNTRACED);
+	if ((ret != ppid) || !(WIFSTOPPED(status))) {
+		fprintf(stderr, "weird waitppid result %ld stat %x\n",
+			ret, status);
+		exit(KSFT_FAIL);
+	}
+
+	if (ptrace(PTRACE_PEEKDATA, ppid, mem, 0))
+		exit(KSFT_PASS);
+
+	exit(KSFT_FAIL);
+}
+
+static void check_child_status(pid_t pid, const char *name)
+{
+	int status;
+
+	waitpid(pid, &status, 0);
+
+	if (WIFEXITED(status) && WEXITSTATUS(status) == KSFT_SKIP) {
+		skip("%s is not supported\n", name);
+		return;
+	}
+
+	if ((WIFEXITED(status) && WEXITSTATUS(status) == KSFT_PASS) ||
+	    WIFSIGNALED(status)) {
+		pass("%s is blocked as expected\n", name);
+		return;
+	}
+
+	fail("%s: unexpected memory access\n", name);
+}
+
+static void test_remote_access(int fd, const char *name,
+			       void (*func)(int fd, int pipefd[2]))
+{
+	int pipefd[2];
+	pid_t pid;
+	char *mem;
+
+	if (pipe(pipefd)) {
+		fail("pipe failed: %s\n", strerror(errno));
+		return;
+	}
+
+	pid = fork();
+	if (pid < 0) {
+		fail("fork failed: %s\n", strerror(errno));
+		return;
+	}
+
+	if (pid == 0) {
+		func(fd, pipefd);
+		return;
+	}
+
+	mem = mmap(NULL, page_size, prot, mode, fd, 0);
+	if (mem == MAP_FAILED) {
+		fail("Unable to mmap secret memory\n");
+		return;
+	}
+
+	ftruncate(fd, page_size);
+	memset(mem, PATTERN, page_size);
+
+	if (write(pipefd[1], &mem, sizeof(mem)) < 0) {
+		fail("pipe write: %s\n", strerror(errno));
+		return;
+	}
+
+	check_child_status(pid, name);
+}
+
+static void test_process_vm_read(int fd)
+{
+	test_remote_access(fd, "process_vm_read", try_process_vm_read);
+}
+
+static void test_ptrace(int fd)
+{
+	test_remote_access(fd, "ptrace", try_ptrace);
+}
+
+static int set_cap_limits(rlim_t max)
+{
+	struct rlimit new;
+	cap_t cap = cap_init();
+
+	new.rlim_cur = max;
+	new.rlim_max = max;
+	if (setrlimit(RLIMIT_MEMLOCK, &new)) {
+		perror("setrlimit() returns error");
+		return -1;
+	}
+
+	/* drop capabilities including CAP_IPC_LOCK */
+	if (cap_set_proc(cap)) {
+		perror("cap_set_proc() returns error");
+		return -2;
+	}
+
+	return 0;
+}
+
+static void prepare(void)
+{
+	struct rlimit rlim;
+
+	page_size = sysconf(_SC_PAGE_SIZE);
+	if (!page_size)
+		ksft_exit_fail_msg("Failed to get page size %s\n",
+				   strerror(errno));
+
+	if (getrlimit(RLIMIT_MEMLOCK, &rlim))
+		ksft_exit_fail_msg("Unable to detect mlock limit: %s\n",
+				   strerror(errno));
+
+	mlock_limit_cur = rlim.rlim_cur;
+	mlock_limit_max = rlim.rlim_max;
+
+	printf("page_size: %ld, mlock.soft: %ld, mlock.hard: %ld\n",
+	       page_size, mlock_limit_cur, mlock_limit_max);
+
+	if (page_size > mlock_limit_cur)
+		mlock_limit_cur = page_size;
+	if (page_size > mlock_limit_max)
+		mlock_limit_max = page_size;
+
+	if (set_cap_limits(mlock_limit_max))
+		ksft_exit_fail_msg("Unable to set mlock limit: %s\n",
+				   strerror(errno));
+}
+
+#define NUM_TESTS 4
+
+int main(int argc, char *argv[])
+{
+	int fd;
+
+	prepare();
+
+	ksft_print_header();
+	ksft_set_plan(NUM_TESTS);
+
+	fd = memfd_secret(0);
+	if (fd < 0) {
+		if (errno == ENOSYS)
+			ksft_exit_skip("memfd_secret is not supported\n");
+		else
+			ksft_exit_fail_msg("memfd_secret failed: %s\n",
+					   strerror(errno));
+	}
+
+	test_mlock_limit(fd);
+	test_file_apis(fd);
+	test_process_vm_read(fd);
+	test_ptrace(fd);
+
+	close(fd);
+
+	ksft_exit(!ksft_get_fail_cnt());
+}
+
+#else /* __NR_memfd_secret */
+
+int main(int argc, char *argv[])
+{
+	printf("skip: skipping memfd_secret test (missing __NR_memfd_secret)\n");
+	return KSFT_SKIP;
+}
+
+#endif /* __NR_memfd_secret */
diff --git a/tools/testing/selftests/vm/run_vmtests b/tools/testing/selftests/vm/run_vmtests
index e953f3cd9664..95a67382f132 100755
--- a/tools/testing/selftests/vm/run_vmtests
+++ b/tools/testing/selftests/vm/run_vmtests
@@ -346,4 +346,21 @@ else
 	exitcode=1
 fi
 
+echo "running memfd_secret test"
+echo "------------------------------------"
+./memfd_secret
+ret_val=$?
+
+if [ $ret_val -eq 0 ]; then
+	echo "[PASS]"
+elif [ $ret_val -eq $ksft_skip ]; then
+	echo "[SKIP]"
+	exitcode=$ksft_skip
+else
+	echo "[FAIL]"
+	exitcode=1
+fi
+
+exit $exitcode
+
 exit $exitcode
-- 
2.28.0


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* [PATCH v17 10/10] secretmem: test: add basic selftest for memfd_secret(2)
@ 2021-02-08  8:49   ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08  8:49 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Mark Rutland, Michal Hocko, David Hildenbrand, Peter Zijlstra,
	Catalin Marinas, Dave Hansen, linux-mm, linux-kselftest,
	H. Peter Anvin, Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Rick Edgecombe,
	Roman Gushchin, Mike Rapoport

From: Mike Rapoport <rppt@linux.ibm.com>

The test verifies that file descriptor created with memfd_secret does not
allow read/write operations, that secret memory mappings respect
RLIMIT_MEMLOCK and that remote accesses with process_vm_read() and
ptrace() to the secret memory fail.

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Elena Reshetova <elena.reshetova@intel.com>
Cc: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Bottomley <jejb@linux.ibm.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Palmer Dabbelt <palmerdabbelt@google.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tycho Andersen <tycho@tycho.ws>
Cc: Will Deacon <will@kernel.org>
---
 tools/testing/selftests/vm/.gitignore     |   1 +
 tools/testing/selftests/vm/Makefile       |   3 +-
 tools/testing/selftests/vm/memfd_secret.c | 296 ++++++++++++++++++++++
 tools/testing/selftests/vm/run_vmtests    |  17 ++
 4 files changed, 316 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/vm/memfd_secret.c

diff --git a/tools/testing/selftests/vm/.gitignore b/tools/testing/selftests/vm/.gitignore
index 9a35c3f6a557..c8deddc81e7a 100644
--- a/tools/testing/selftests/vm/.gitignore
+++ b/tools/testing/selftests/vm/.gitignore
@@ -21,4 +21,5 @@ va_128TBswitch
 map_fixed_noreplace
 write_to_hugetlbfs
 hmm-tests
+memfd_secret
 local_config.*
diff --git a/tools/testing/selftests/vm/Makefile b/tools/testing/selftests/vm/Makefile
index d42115e4284d..0200fb61646c 100644
--- a/tools/testing/selftests/vm/Makefile
+++ b/tools/testing/selftests/vm/Makefile
@@ -34,6 +34,7 @@ TEST_GEN_FILES += khugepaged
 TEST_GEN_FILES += map_fixed_noreplace
 TEST_GEN_FILES += map_hugetlb
 TEST_GEN_FILES += map_populate
+TEST_GEN_FILES += memfd_secret
 TEST_GEN_FILES += mlock-random-test
 TEST_GEN_FILES += mlock2-tests
 TEST_GEN_FILES += mremap_dontunmap
@@ -133,7 +134,7 @@ warn_32bit_failure:
 endif
 endif
 
-$(OUTPUT)/mlock-random-test: LDLIBS += -lcap
+$(OUTPUT)/mlock-random-test $(OUTPUT)/memfd_secret: LDLIBS += -lcap
 
 $(OUTPUT)/gup_test: ../../../../mm/gup_test.h
 
diff --git a/tools/testing/selftests/vm/memfd_secret.c b/tools/testing/selftests/vm/memfd_secret.c
new file mode 100644
index 000000000000..c878c2b841fc
--- /dev/null
+++ b/tools/testing/selftests/vm/memfd_secret.c
@@ -0,0 +1,296 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright IBM Corporation, 2020
+ *
+ * Author: Mike Rapoport <rppt@linux.ibm.com>
+ */
+
+#define _GNU_SOURCE
+#include <sys/uio.h>
+#include <sys/mman.h>
+#include <sys/wait.h>
+#include <sys/types.h>
+#include <sys/ptrace.h>
+#include <sys/syscall.h>
+#include <sys/resource.h>
+#include <sys/capability.h>
+
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <errno.h>
+#include <stdio.h>
+
+#include "../kselftest.h"
+
+#define fail(fmt, ...) ksft_test_result_fail(fmt, ##__VA_ARGS__)
+#define pass(fmt, ...) ksft_test_result_pass(fmt, ##__VA_ARGS__)
+#define skip(fmt, ...) ksft_test_result_skip(fmt, ##__VA_ARGS__)
+
+#ifdef __NR_memfd_secret
+
+#define PATTERN	0x55
+
+static const int prot = PROT_READ | PROT_WRITE;
+static const int mode = MAP_SHARED;
+
+static unsigned long page_size;
+static unsigned long mlock_limit_cur;
+static unsigned long mlock_limit_max;
+
+static int memfd_secret(unsigned long flags)
+{
+	return syscall(__NR_memfd_secret, flags);
+}
+
+static void test_file_apis(int fd)
+{
+	char buf[64];
+
+	if ((read(fd, buf, sizeof(buf)) >= 0) ||
+	    (write(fd, buf, sizeof(buf)) >= 0) ||
+	    (pread(fd, buf, sizeof(buf), 0) >= 0) ||
+	    (pwrite(fd, buf, sizeof(buf), 0) >= 0))
+		fail("unexpected file IO\n");
+	else
+		pass("file IO is blocked as expected\n");
+}
+
+static void test_mlock_limit(int fd)
+{
+	size_t len;
+	char *mem;
+
+	len = mlock_limit_cur;
+	mem = mmap(NULL, len, prot, mode, fd, 0);
+	if (mem == MAP_FAILED) {
+		fail("unable to mmap secret memory\n");
+		return;
+	}
+	munmap(mem, len);
+
+	len = mlock_limit_max * 2;
+	mem = mmap(NULL, len, prot, mode, fd, 0);
+	if (mem != MAP_FAILED) {
+		fail("unexpected mlock limit violation\n");
+		munmap(mem, len);
+		return;
+	}
+
+	pass("mlock limit is respected\n");
+}
+
+static void try_process_vm_read(int fd, int pipefd[2])
+{
+	struct iovec liov, riov;
+	char buf[64];
+	char *mem;
+
+	if (read(pipefd[0], &mem, sizeof(mem)) < 0) {
+		fail("pipe write: %s\n", strerror(errno));
+		exit(KSFT_FAIL);
+	}
+
+	liov.iov_len = riov.iov_len = sizeof(buf);
+	liov.iov_base = buf;
+	riov.iov_base = mem;
+
+	if (process_vm_readv(getppid(), &liov, 1, &riov, 1, 0) < 0) {
+		if (errno == ENOSYS)
+			exit(KSFT_SKIP);
+		exit(KSFT_PASS);
+	}
+
+	exit(KSFT_FAIL);
+}
+
+static void try_ptrace(int fd, int pipefd[2])
+{
+	pid_t ppid = getppid();
+	int status;
+	char *mem;
+	long ret;
+
+	if (read(pipefd[0], &mem, sizeof(mem)) < 0) {
+		perror("pipe write");
+		exit(KSFT_FAIL);
+	}
+
+	ret = ptrace(PTRACE_ATTACH, ppid, 0, 0);
+	if (ret) {
+		perror("ptrace_attach");
+		exit(KSFT_FAIL);
+	}
+
+	ret = waitpid(ppid, &status, WUNTRACED);
+	if ((ret != ppid) || !(WIFSTOPPED(status))) {
+		fprintf(stderr, "weird waitppid result %ld stat %x\n",
+			ret, status);
+		exit(KSFT_FAIL);
+	}
+
+	if (ptrace(PTRACE_PEEKDATA, ppid, mem, 0))
+		exit(KSFT_PASS);
+
+	exit(KSFT_FAIL);
+}
+
+static void check_child_status(pid_t pid, const char *name)
+{
+	int status;
+
+	waitpid(pid, &status, 0);
+
+	if (WIFEXITED(status) && WEXITSTATUS(status) == KSFT_SKIP) {
+		skip("%s is not supported\n", name);
+		return;
+	}
+
+	if ((WIFEXITED(status) && WEXITSTATUS(status) == KSFT_PASS) ||
+	    WIFSIGNALED(status)) {
+		pass("%s is blocked as expected\n", name);
+		return;
+	}
+
+	fail("%s: unexpected memory access\n", name);
+}
+
+static void test_remote_access(int fd, const char *name,
+			       void (*func)(int fd, int pipefd[2]))
+{
+	int pipefd[2];
+	pid_t pid;
+	char *mem;
+
+	if (pipe(pipefd)) {
+		fail("pipe failed: %s\n", strerror(errno));
+		return;
+	}
+
+	pid = fork();
+	if (pid < 0) {
+		fail("fork failed: %s\n", strerror(errno));
+		return;
+	}
+
+	if (pid == 0) {
+		func(fd, pipefd);
+		return;
+	}
+
+	mem = mmap(NULL, page_size, prot, mode, fd, 0);
+	if (mem == MAP_FAILED) {
+		fail("Unable to mmap secret memory\n");
+		return;
+	}
+
+	ftruncate(fd, page_size);
+	memset(mem, PATTERN, page_size);
+
+	if (write(pipefd[1], &mem, sizeof(mem)) < 0) {
+		fail("pipe write: %s\n", strerror(errno));
+		return;
+	}
+
+	check_child_status(pid, name);
+}
+
+static void test_process_vm_read(int fd)
+{
+	test_remote_access(fd, "process_vm_read", try_process_vm_read);
+}
+
+static void test_ptrace(int fd)
+{
+	test_remote_access(fd, "ptrace", try_ptrace);
+}
+
+static int set_cap_limits(rlim_t max)
+{
+	struct rlimit new;
+	cap_t cap = cap_init();
+
+	new.rlim_cur = max;
+	new.rlim_max = max;
+	if (setrlimit(RLIMIT_MEMLOCK, &new)) {
+		perror("setrlimit() returns error");
+		return -1;
+	}
+
+	/* drop capabilities including CAP_IPC_LOCK */
+	if (cap_set_proc(cap)) {
+		perror("cap_set_proc() returns error");
+		return -2;
+	}
+
+	return 0;
+}
+
+static void prepare(void)
+{
+	struct rlimit rlim;
+
+	page_size = sysconf(_SC_PAGE_SIZE);
+	if (!page_size)
+		ksft_exit_fail_msg("Failed to get page size %s\n",
+				   strerror(errno));
+
+	if (getrlimit(RLIMIT_MEMLOCK, &rlim))
+		ksft_exit_fail_msg("Unable to detect mlock limit: %s\n",
+				   strerror(errno));
+
+	mlock_limit_cur = rlim.rlim_cur;
+	mlock_limit_max = rlim.rlim_max;
+
+	printf("page_size: %ld, mlock.soft: %ld, mlock.hard: %ld\n",
+	       page_size, mlock_limit_cur, mlock_limit_max);
+
+	if (page_size > mlock_limit_cur)
+		mlock_limit_cur = page_size;
+	if (page_size > mlock_limit_max)
+		mlock_limit_max = page_size;
+
+	if (set_cap_limits(mlock_limit_max))
+		ksft_exit_fail_msg("Unable to set mlock limit: %s\n",
+				   strerror(errno));
+}
+
+#define NUM_TESTS 4
+
+int main(int argc, char *argv[])
+{
+	int fd;
+
+	prepare();
+
+	ksft_print_header();
+	ksft_set_plan(NUM_TESTS);
+
+	fd = memfd_secret(0);
+	if (fd < 0) {
+		if (errno == ENOSYS)
+			ksft_exit_skip("memfd_secret is not supported\n");
+		else
+			ksft_exit_fail_msg("memfd_secret failed: %s\n",
+					   strerror(errno));
+	}
+
+	test_mlock_limit(fd);
+	test_file_apis(fd);
+	test_process_vm_read(fd);
+	test_ptrace(fd);
+
+	close(fd);
+
+	ksft_exit(!ksft_get_fail_cnt());
+}
+
+#else /* __NR_memfd_secret */
+
+int main(int argc, char *argv[])
+{
+	printf("skip: skipping memfd_secret test (missing __NR_memfd_secret)\n");
+	return KSFT_SKIP;
+}
+
+#endif /* __NR_memfd_secret */
diff --git a/tools/testing/selftests/vm/run_vmtests b/tools/testing/selftests/vm/run_vmtests
index e953f3cd9664..95a67382f132 100755
--- a/tools/testing/selftests/vm/run_vmtests
+++ b/tools/testing/selftests/vm/run_vmtests
@@ -346,4 +346,21 @@ else
 	exitcode=1
 fi
 
+echo "running memfd_secret test"
+echo "------------------------------------"
+./memfd_secret
+ret_val=$?
+
+if [ $ret_val -eq 0 ]; then
+	echo "[PASS]"
+elif [ $ret_val -eq $ksft_skip ]; then
+	echo "[SKIP]"
+	exitcode=$ksft_skip
+else
+	echo "[FAIL]"
+	exitcode=1
+fi
+
+exit $exitcode
+
 exit $exitcode
-- 
2.28.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
  2021-02-08  8:49 ` Mike Rapoport
  (?)
  (?)
@ 2021-02-08  9:27   ` David Hildenbrand
  -1 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08  9:27 UTC (permalink / raw)
  To: Mike Rapoport, Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dave Hansen,
	Elena Reshetova, H. Peter Anvin, Ingo Molnar, James Bottomley,
	Kirill A. Shutemov, Matthew Wilcox, Mark Rutland, Michal Hocko,
	Mike Rapoport, Michael Kerrisk, Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Rick Edgecombe, Roman Gushchin, Shakeel Butt,
	Shuah Khan, Thomas Gleixner, Tycho Andersen, Will Deacon

On 08.02.21 09:49, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Hi,
> 
> @Andrew, this is based on v5.11-rc5-mmotm-2021-01-27-23-30, with secretmem
> and related patches dropped from there, I can rebase whatever way you
> prefer.
> 
> This is an implementation of "secret" mappings backed by a file descriptor.
> 
> The file descriptor backing secret memory mappings is created using a
> dedicated memfd_secret system call The desired protection mode for the
> memory is configured using flags parameter of the system call. The mmap()
> of the file descriptor created with memfd_secret() will create a "secret"
> memory mapping. The pages in that mapping will be marked as not present in
> the direct map and will be present only in the page table of the owning mm.
> 
> Although normally Linux userspace mappings are protected from other users,
> such secret mappings are useful for environments where a hostile tenant is
> trying to trick the kernel into giving them access to other tenants
> mappings.
> 
> Additionally, in the future the secret mappings may be used as a mean to
> protect guest memory in a virtual machine host.
> 
> For demonstration of secret memory usage we've created a userspace library
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloader.git
> 
> that does two things: the first is act as a preloader for openssl to
> redirect all the OPENSSL_malloc calls to secret memory meaning any secret
> keys get automatically protected this way and the other thing it does is
> expose the API to the user who needs it. We anticipate that a lot of the
> use cases would be like the openssl one: many toolkits that deal with
> secret keys already have special handling for the memory to try to give
> them greater protection, so this would simply be pluggable into the
> toolkits without any need for user application modification.
> 
> Hiding secret memory mappings behind an anonymous file allows usage of
> the page cache for tracking pages allocated for the "secret" mappings as
> well as using address_space_operations for e.g. page migration callbacks.
> 
> The anonymous file may be also used implicitly, like hugetlb files, to
> implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm
> ABIs in the future.
> 
> Removing of the pages from the direct map may cause its fragmentation on
> architectures that use large pages to map the physical memory which affects
> the system performance. However, the original Kconfig text for
> CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
> improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
> ("x86: add gbpages switches")) and the recent report [1] showed that "...
> although 1G mappings are a good default choice, there is no compelling
> evidence that it must be the only choice". Hence, it is sufficient to have
> secretmem disabled by default with the ability of a system administrator to
> enable it at boot time.
> 
> In addition, there is also a long term goal to improve management of the
> direct map.

Some questions (and request to document the answers) as we now allow to 
have unmovable allocations all over the place and I don't see a single 
comment regarding that in the cover letter:

1. How will the issue of plenty of unmovable allocations for user space 
be tackled in the future?

2. How has this issue been documented? E.g., interaction with 
ZONE_MOVABLE and CMA, alloc_conig_range()/alloc_contig_pages?.

3. How are the plans to support migration in the future and which 
interface changes will be required? (Michal mentioned some good points 
to make this configurable via the interface, we should plan ahead and 
document)

Thanks!

-- 
Thanks,

David / dhildenb
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-08  9:27   ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08  9:27 UTC (permalink / raw)
  To: Mike Rapoport, Andrew Morton
  Cc: Alexander Viro, Andy Lutomirski, Arnd Bergmann, Borislav Petkov,
	Catalin Marinas, Christopher Lameter, Dan Williams, Dave Hansen,
	Elena Reshetova, H. Peter Anvin, Ingo Molnar, James Bottomley,
	Kirill A. Shutemov, Matthew Wilcox, Mark Rutland, Michal Hocko,
	Mike Rapoport, Michael Kerrisk, Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Rick Edgecombe, Roman Gushchin, Shakeel Butt,
	Shuah Khan, Thomas Gleixner, Tycho Andersen, Will Deacon,
	linux-api, linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86

On 08.02.21 09:49, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Hi,
> 
> @Andrew, this is based on v5.11-rc5-mmotm-2021-01-27-23-30, with secretmem
> and related patches dropped from there, I can rebase whatever way you
> prefer.
> 
> This is an implementation of "secret" mappings backed by a file descriptor.
> 
> The file descriptor backing secret memory mappings is created using a
> dedicated memfd_secret system call The desired protection mode for the
> memory is configured using flags parameter of the system call. The mmap()
> of the file descriptor created with memfd_secret() will create a "secret"
> memory mapping. The pages in that mapping will be marked as not present in
> the direct map and will be present only in the page table of the owning mm.
> 
> Although normally Linux userspace mappings are protected from other users,
> such secret mappings are useful for environments where a hostile tenant is
> trying to trick the kernel into giving them access to other tenants
> mappings.
> 
> Additionally, in the future the secret mappings may be used as a mean to
> protect guest memory in a virtual machine host.
> 
> For demonstration of secret memory usage we've created a userspace library
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloader.git
> 
> that does two things: the first is act as a preloader for openssl to
> redirect all the OPENSSL_malloc calls to secret memory meaning any secret
> keys get automatically protected this way and the other thing it does is
> expose the API to the user who needs it. We anticipate that a lot of the
> use cases would be like the openssl one: many toolkits that deal with
> secret keys already have special handling for the memory to try to give
> them greater protection, so this would simply be pluggable into the
> toolkits without any need for user application modification.
> 
> Hiding secret memory mappings behind an anonymous file allows usage of
> the page cache for tracking pages allocated for the "secret" mappings as
> well as using address_space_operations for e.g. page migration callbacks.
> 
> The anonymous file may be also used implicitly, like hugetlb files, to
> implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm
> ABIs in the future.
> 
> Removing of the pages from the direct map may cause its fragmentation on
> architectures that use large pages to map the physical memory which affects
> the system performance. However, the original Kconfig text for
> CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
> improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
> ("x86: add gbpages switches")) and the recent report [1] showed that "...
> although 1G mappings are a good default choice, there is no compelling
> evidence that it must be the only choice". Hence, it is sufficient to have
> secretmem disabled by default with the ability of a system administrator to
> enable it at boot time.
> 
> In addition, there is also a long term goal to improve management of the
> direct map.

Some questions (and request to document the answers) as we now allow to 
have unmovable allocations all over the place and I don't see a single 
comment regarding that in the cover letter:

1. How will the issue of plenty of unmovable allocations for user space 
be tackled in the future?

2. How has this issue been documented? E.g., interaction with 
ZONE_MOVABLE and CMA, alloc_conig_range()/alloc_contig_pages?.

3. How are the plans to support migration in the future and which 
interface changes will be required? (Michal mentioned some good points 
to make this configurable via the interface, we should plan ahead and 
document)

Thanks!

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-08  9:27   ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08  9:27 UTC (permalink / raw)
  To: Mike Rapoport, Andrew Morton
  Cc: Mark Rutland, Michal Hocko, Peter Zijlstra, Catalin Marinas,
	Dave Hansen, linux-mm, linux-kselftest, H. Peter Anvin,
	Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Arnd Bergmann, James Bottomley, Borislav Petkov,
	Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Rick Edgecombe, Roman Gushchin

On 08.02.21 09:49, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Hi,
> 
> @Andrew, this is based on v5.11-rc5-mmotm-2021-01-27-23-30, with secretmem
> and related patches dropped from there, I can rebase whatever way you
> prefer.
> 
> This is an implementation of "secret" mappings backed by a file descriptor.
> 
> The file descriptor backing secret memory mappings is created using a
> dedicated memfd_secret system call The desired protection mode for the
> memory is configured using flags parameter of the system call. The mmap()
> of the file descriptor created with memfd_secret() will create a "secret"
> memory mapping. The pages in that mapping will be marked as not present in
> the direct map and will be present only in the page table of the owning mm.
> 
> Although normally Linux userspace mappings are protected from other users,
> such secret mappings are useful for environments where a hostile tenant is
> trying to trick the kernel into giving them access to other tenants
> mappings.
> 
> Additionally, in the future the secret mappings may be used as a mean to
> protect guest memory in a virtual machine host.
> 
> For demonstration of secret memory usage we've created a userspace library
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloader.git
> 
> that does two things: the first is act as a preloader for openssl to
> redirect all the OPENSSL_malloc calls to secret memory meaning any secret
> keys get automatically protected this way and the other thing it does is
> expose the API to the user who needs it. We anticipate that a lot of the
> use cases would be like the openssl one: many toolkits that deal with
> secret keys already have special handling for the memory to try to give
> them greater protection, so this would simply be pluggable into the
> toolkits without any need for user application modification.
> 
> Hiding secret memory mappings behind an anonymous file allows usage of
> the page cache for tracking pages allocated for the "secret" mappings as
> well as using address_space_operations for e.g. page migration callbacks.
> 
> The anonymous file may be also used implicitly, like hugetlb files, to
> implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm
> ABIs in the future.
> 
> Removing of the pages from the direct map may cause its fragmentation on
> architectures that use large pages to map the physical memory which affects
> the system performance. However, the original Kconfig text for
> CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
> improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
> ("x86: add gbpages switches")) and the recent report [1] showed that "...
> although 1G mappings are a good default choice, there is no compelling
> evidence that it must be the only choice". Hence, it is sufficient to have
> secretmem disabled by default with the ability of a system administrator to
> enable it at boot time.
> 
> In addition, there is also a long term goal to improve management of the
> direct map.

Some questions (and request to document the answers) as we now allow to 
have unmovable allocations all over the place and I don't see a single 
comment regarding that in the cover letter:

1. How will the issue of plenty of unmovable allocations for user space 
be tackled in the future?

2. How has this issue been documented? E.g., interaction with 
ZONE_MOVABLE and CMA, alloc_conig_range()/alloc_contig_pages?.

3. How are the plans to support migration in the future and which 
interface changes will be required? (Michal mentioned some good points 
to make this configurable via the interface, we should plan ahead and 
document)

Thanks!

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-08  9:27   ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08  9:27 UTC (permalink / raw)
  To: Mike Rapoport, Andrew Morton
  Cc: Mark Rutland, Michal Hocko, Peter Zijlstra, Catalin Marinas,
	Dave Hansen, linux-mm, linux-kselftest, H. Peter Anvin,
	Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Arnd Bergmann, James Bottomley, Borislav Petkov,
	Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Rick Edgecombe, Roman Gushchin

On 08.02.21 09:49, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Hi,
> 
> @Andrew, this is based on v5.11-rc5-mmotm-2021-01-27-23-30, with secretmem
> and related patches dropped from there, I can rebase whatever way you
> prefer.
> 
> This is an implementation of "secret" mappings backed by a file descriptor.
> 
> The file descriptor backing secret memory mappings is created using a
> dedicated memfd_secret system call The desired protection mode for the
> memory is configured using flags parameter of the system call. The mmap()
> of the file descriptor created with memfd_secret() will create a "secret"
> memory mapping. The pages in that mapping will be marked as not present in
> the direct map and will be present only in the page table of the owning mm.
> 
> Although normally Linux userspace mappings are protected from other users,
> such secret mappings are useful for environments where a hostile tenant is
> trying to trick the kernel into giving them access to other tenants
> mappings.
> 
> Additionally, in the future the secret mappings may be used as a mean to
> protect guest memory in a virtual machine host.
> 
> For demonstration of secret memory usage we've created a userspace library
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloader.git
> 
> that does two things: the first is act as a preloader for openssl to
> redirect all the OPENSSL_malloc calls to secret memory meaning any secret
> keys get automatically protected this way and the other thing it does is
> expose the API to the user who needs it. We anticipate that a lot of the
> use cases would be like the openssl one: many toolkits that deal with
> secret keys already have special handling for the memory to try to give
> them greater protection, so this would simply be pluggable into the
> toolkits without any need for user application modification.
> 
> Hiding secret memory mappings behind an anonymous file allows usage of
> the page cache for tracking pages allocated for the "secret" mappings as
> well as using address_space_operations for e.g. page migration callbacks.
> 
> The anonymous file may be also used implicitly, like hugetlb files, to
> implement mmap(MAP_SECRET) and use the secret memory areas with "native" mm
> ABIs in the future.
> 
> Removing of the pages from the direct map may cause its fragmentation on
> architectures that use large pages to map the physical memory which affects
> the system performance. However, the original Kconfig text for
> CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
> improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
> ("x86: add gbpages switches")) and the recent report [1] showed that "...
> although 1G mappings are a good default choice, there is no compelling
> evidence that it must be the only choice". Hence, it is sufficient to have
> secretmem disabled by default with the ability of a system administrator to
> enable it at boot time.
> 
> In addition, there is also a long term goal to improve management of the
> direct map.

Some questions (and request to document the answers) as we now allow to 
have unmovable allocations all over the place and I don't see a single 
comment regarding that in the cover letter:

1. How will the issue of plenty of unmovable allocations for user space 
be tackled in the future?

2. How has this issue been documented? E.g., interaction with 
ZONE_MOVABLE and CMA, alloc_conig_range()/alloc_contig_pages?.

3. How are the plans to support migration in the future and which 
interface changes will be required? (Michal mentioned some good points 
to make this configurable via the interface, we should plan ahead and 
document)

Thanks!

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
  2021-02-08  8:49   ` Mike Rapoport
  (?)
  (?)
@ 2021-02-08 10:18     ` Michal Hocko
  -1 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-08 10:18 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, Alexander Viro, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christopher Lameter,
	Dave Hansen, David Hildenbrand, Elena Reshetova, H. Peter Anvin,
	Ingo Molnar, James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Mike Rapoport, Michael Kerrisk, Palmer Dabbelt,
	Paul Walmsley, Peter Zijlstra, Rick Edgecombe, Roman Gushchin,
	Shakeel Butt, Shuah Khan, Thomas Gleixner, Tycho Andersen,
	Will Deacon, linux-api, linux-arch, linux-arm-kernel,
	linux-fsdevel, linux-mm, linux-kernel, linux-kselftest,
	linux-nvdimm, linux-riscv, x86, Hagen Paul Pfeifer,
	Palmer Dabbelt

On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> It is unsafe to allow saving of secretmem areas to the hibernation
> snapshot as they would be visible after the resume and this essentially
> will defeat the purpose of secret memory mappings.
> 
> Prevent hibernation whenever there are active secret memory users.

Does this feature need any special handling? As it is effectivelly
unevictable memory then it should behave the same as other mlock, ramfs
which should already disable hibernation as those cannot be swapped out,
no?

-- 
Michal Hocko
SUSE Labs
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 10:18     ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-08 10:18 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, Alexander Viro, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christopher Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> It is unsafe to allow saving of secretmem areas to the hibernation
> snapshot as they would be visible after the resume and this essentially
> will defeat the purpose of secret memory mappings.
> 
> Prevent hibernation whenever there are active secret memory users.

Does this feature need any special handling? As it is effectivelly
unevictable memory then it should behave the same as other mlock, ramfs
which should already disable hibernation as those cannot be swapped out,
no?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 10:18     ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-08 10:18 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mark Rutland, David Hildenbrand, Peter Zijlstra, Catalin Marinas,
	Dave Hansen, linux-mm, linux-kselftest, H. Peter Anvin,
	Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin

On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> It is unsafe to allow saving of secretmem areas to the hibernation
> snapshot as they would be visible after the resume and this essentially
> will defeat the purpose of secret memory mappings.
> 
> Prevent hibernation whenever there are active secret memory users.

Does this feature need any special handling? As it is effectivelly
unevictable memory then it should behave the same as other mlock, ramfs
which should already disable hibernation as those cannot be swapped out,
no?

-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 10:18     ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-08 10:18 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mark Rutland, David Hildenbrand, Peter Zijlstra, Catalin Marinas,
	Dave Hansen, linux-mm, linux-kselftest, H. Peter Anvin,
	Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin

On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> It is unsafe to allow saving of secretmem areas to the hibernation
> snapshot as they would be visible after the resume and this essentially
> will defeat the purpose of secret memory mappings.
> 
> Prevent hibernation whenever there are active secret memory users.

Does this feature need any special handling? As it is effectivelly
unevictable memory then it should behave the same as other mlock, ramfs
which should already disable hibernation as those cannot be swapped out,
no?

-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
  2021-02-08 10:18     ` Michal Hocko
  (?)
  (?)
@ 2021-02-08 10:32       ` David Hildenbrand
  -1 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08 10:32 UTC (permalink / raw)
  To: Michal Hocko, Mike Rapoport
  Cc: Andrew Morton, Alexander Viro, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christopher Lameter,
	Dave Hansen, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Mike Rapoport, Michael Kerrisk, Palmer Dabbelt,
	Paul Walmsley, Peter Zijlstra, Rick Edgecombe, Roman Gushchin,
	Shakeel Butt, Shuah Khan, Thomas Gleixner, Tycho Andersen,
	Will Deacon, linux-api, linux-arch, linux-arm-kernel,
	linux-fsdevel, linux-mm, linux-kernel, linux-kselftest,
	linux-nvdimm, linux-riscv, x86, Hagen Paul Pfeifer,
	Palmer Dabbelt

On 08.02.21 11:18, Michal Hocko wrote:
> On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
>> From: Mike Rapoport <rppt@linux.ibm.com>
>>
>> It is unsafe to allow saving of secretmem areas to the hibernation
>> snapshot as they would be visible after the resume and this essentially
>> will defeat the purpose of secret memory mappings.
>>
>> Prevent hibernation whenever there are active secret memory users.
> 
> Does this feature need any special handling? As it is effectivelly
> unevictable memory then it should behave the same as other mlock, ramfs
> which should already disable hibernation as those cannot be swapped out,
> no?
> 

Why should unevictable memory not go to swap when hibernating? We're 
merely dumping all of our system RAM (including any unmovable 
allocations) to swap storage and the system is essentially completely 
halted.

-- 
Thanks,

David / dhildenb
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 10:32       ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08 10:32 UTC (permalink / raw)
  To: Michal Hocko, Mike Rapoport
  Cc: Andrew Morton, Alexander Viro, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christopher Lameter,
	Dan Williams, Dave Hansen, Elena Reshetova, H. Peter Anvin,
	Ingo Molnar, James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Mike Rapoport, Michael Kerrisk, Palmer Dabbelt,
	Paul Walmsley, Peter Zijlstra, Rick Edgecombe, Roman Gushchin,
	Shakeel Butt, Shuah Khan, Thomas Gleixner, Tycho Andersen,
	Will Deacon, linux-api, linux-arch, linux-arm-kernel,
	linux-fsdevel, linux-mm, linux-kernel, linux-kselftest,
	linux-nvdimm, linux-riscv, x86, Hagen Paul Pfeifer,
	Palmer Dabbelt

On 08.02.21 11:18, Michal Hocko wrote:
> On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
>> From: Mike Rapoport <rppt@linux.ibm.com>
>>
>> It is unsafe to allow saving of secretmem areas to the hibernation
>> snapshot as they would be visible after the resume and this essentially
>> will defeat the purpose of secret memory mappings.
>>
>> Prevent hibernation whenever there are active secret memory users.
> 
> Does this feature need any special handling? As it is effectivelly
> unevictable memory then it should behave the same as other mlock, ramfs
> which should already disable hibernation as those cannot be swapped out,
> no?
> 

Why should unevictable memory not go to swap when hibernating? We're 
merely dumping all of our system RAM (including any unmovable 
allocations) to swap storage and the system is essentially completely 
halted.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 10:32       ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08 10:32 UTC (permalink / raw)
  To: Michal Hocko, Mike Rapoport
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Palmer Dabbelt,
	Arnd Bergmann, James Bottomley, Hagen Paul Pfeifer,
	Borislav Petkov, Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Andrew Morton, Rick Edgecombe, Roman Gushchin

On 08.02.21 11:18, Michal Hocko wrote:
> On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
>> From: Mike Rapoport <rppt@linux.ibm.com>
>>
>> It is unsafe to allow saving of secretmem areas to the hibernation
>> snapshot as they would be visible after the resume and this essentially
>> will defeat the purpose of secret memory mappings.
>>
>> Prevent hibernation whenever there are active secret memory users.
> 
> Does this feature need any special handling? As it is effectivelly
> unevictable memory then it should behave the same as other mlock, ramfs
> which should already disable hibernation as those cannot be swapped out,
> no?
> 

Why should unevictable memory not go to swap when hibernating? We're 
merely dumping all of our system RAM (including any unmovable 
allocations) to swap storage and the system is essentially completely 
halted.

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 10:32       ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08 10:32 UTC (permalink / raw)
  To: Michal Hocko, Mike Rapoport
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Palmer Dabbelt,
	Arnd Bergmann, James Bottomley, Hagen Paul Pfeifer,
	Borislav Petkov, Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Andrew Morton, Rick Edgecombe, Roman Gushchin

On 08.02.21 11:18, Michal Hocko wrote:
> On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
>> From: Mike Rapoport <rppt@linux.ibm.com>
>>
>> It is unsafe to allow saving of secretmem areas to the hibernation
>> snapshot as they would be visible after the resume and this essentially
>> will defeat the purpose of secret memory mappings.
>>
>> Prevent hibernation whenever there are active secret memory users.
> 
> Does this feature need any special handling? As it is effectivelly
> unevictable memory then it should behave the same as other mlock, ramfs
> which should already disable hibernation as those cannot be swapped out,
> no?
> 

Why should unevictable memory not go to swap when hibernating? We're 
merely dumping all of our system RAM (including any unmovable 
allocations) to swap storage and the system is essentially completely 
halted.

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
  2021-02-08  8:49   ` Mike Rapoport
  (?)
  (?)
@ 2021-02-08 10:49     ` Michal Hocko
  -1 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-08 10:49 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, Alexander Viro, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christopher Lameter,
	Dave Hansen, David Hildenbrand, Elena Reshetova, H. Peter Anvin,
	Ingo Molnar, James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Mike Rapoport, Michael Kerrisk, Palmer Dabbelt,
	Paul Walmsley, Peter Zijlstra, Rick Edgecombe, Roman Gushchin,
	Shakeel Butt, Shuah Khan, Thomas Gleixner, Tycho Andersen,
	Will Deacon, linux-api, linux-arch, linux-arm-kernel,
	linux-fsdevel, linux-mm, linux-kernel, linux-kselftest,
	linux-nvdimm, linux-riscv, x86, Hagen Paul Pfeifer,
	Palmer Dabbelt

On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Introduce "memfd_secret" system call with the ability to create memory
> areas visible only in the context of the owning process and not mapped not
> only to other processes but in the kernel page tables as well.
> 
> The secretmem feature is off by default and the user must explicitly enable
> it at the boot time.
> 
> Once secretmem is enabled, the user will be able to create a file
> descriptor using the memfd_secret() system call. The memory areas created
> by mmap() calls from this file descriptor will be unmapped from the kernel
> direct map and they will be only mapped in the page table of the owning mm.

Is this really true? I guess you meant to say that the memory will
visible only via page tables to anybody who can mmap the respective file
descriptor. There is nothing like an owning mm as the fd is inherently a
shareable resource and the ownership becomes a very vague and hard to
define term.

> The file descriptor based memory has several advantages over the
> "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
> paves the way for VMMs to remove the secret memory range from the process;

I do not understand how it helps to remove the memory from the process
as the interface explicitly allows to add a memory that is removed from
all other processes via direct map.

> there may be situations where sharing is useful and file descriptor based
> approach allows to seal the operations.

It would be great to expand on this some more.

> As secret memory implementation is not an extension of tmpfs or hugetlbfs,
> usage of a dedicated system call rather than hooking new functionality into
> memfd_create(2) emphasises that memfd_secret(2) has different semantics and
> allows better upwards compatibility.

What is this supposed to mean? What are differences?

> The secret memory remains accessible in the process context using uaccess
> primitives, but it is not exposed to the kernel otherwise; secret memory
> areas are removed from the direct map and functions in the
> follow_page()/get_user_page() family will refuse to return a page that
> belongs to the secret memory area.
> 
> Once there will be a use case that will require exposing secretmem to the
> kernel it will be an opt-in request in the system call flags so that user
> would have to decide what data can be exposed to the kernel.
>
> Removing of the pages from the direct map may cause its fragmentation on
> architectures that use large pages to map the physical memory which affects
> the system performance. However, the original Kconfig text for
> CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
> improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
> ("x86: add gbpages switches")) and the recent report [1] showed that "...
> although 1G mappings are a good default choice, there is no compelling
> evidence that it must be the only choice". Hence, it is sufficient to have
> secretmem disabled by default with the ability of a system administrator to
> enable it at boot time.

OK, this looks like a reasonable compromise for the initial
implementation. Documentation of the command line parameter should be
very explicit about this though.

> The secretmem mappings are locked in memory so they cannot exceed
> RLIMIT_MEMLOCK. Since these mappings are already locked an attempt to
> mlock() secretmem range would fail and mlockall() will ignore secretmem
> mappings.

What about munlock?

> Pages in the secretmem regions are unevictable and unmovable to avoid
> accidental exposure of the sensitive data via swap or during page
> migration.
> 
> A page that was a part of the secret memory area is cleared when it is
> freed to ensure the data is not exposed to the next user of that page.
> 
> The following example demonstrates creation of a secret mapping (error
> handling is omitted):
> 
> 	fd = memfd_secret(0);
> 	ftruncate(fd, MAP_SIZE);
> 	ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
> 		   MAP_SHARED, fd, 0);

Please also list usecases which you are aware of as well.

I am also missing some more information about the implementation. E.g.
does this memory live on an unevictable LRU and therefore participates
into stats. What about memcg accounting. What is the cross fork (CoW)/exec
behavior. How is the memory reflected in OOM situation? Is a shared
mapping enforced?

Anyway, thanks for improving the changelog. This is definitely much more
informative.

> [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

I have only glanced through the implementation and it looks sane. I will
have a closer look later but this should be pretty simple with the
proposed semantic.
-- 
Michal Hocko
SUSE Labs
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-08 10:49     ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-08 10:49 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, Alexander Viro, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christopher Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Introduce "memfd_secret" system call with the ability to create memory
> areas visible only in the context of the owning process and not mapped not
> only to other processes but in the kernel page tables as well.
> 
> The secretmem feature is off by default and the user must explicitly enable
> it at the boot time.
> 
> Once secretmem is enabled, the user will be able to create a file
> descriptor using the memfd_secret() system call. The memory areas created
> by mmap() calls from this file descriptor will be unmapped from the kernel
> direct map and they will be only mapped in the page table of the owning mm.

Is this really true? I guess you meant to say that the memory will
visible only via page tables to anybody who can mmap the respective file
descriptor. There is nothing like an owning mm as the fd is inherently a
shareable resource and the ownership becomes a very vague and hard to
define term.

> The file descriptor based memory has several advantages over the
> "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
> paves the way for VMMs to remove the secret memory range from the process;

I do not understand how it helps to remove the memory from the process
as the interface explicitly allows to add a memory that is removed from
all other processes via direct map.

> there may be situations where sharing is useful and file descriptor based
> approach allows to seal the operations.

It would be great to expand on this some more.

> As secret memory implementation is not an extension of tmpfs or hugetlbfs,
> usage of a dedicated system call rather than hooking new functionality into
> memfd_create(2) emphasises that memfd_secret(2) has different semantics and
> allows better upwards compatibility.

What is this supposed to mean? What are differences?

> The secret memory remains accessible in the process context using uaccess
> primitives, but it is not exposed to the kernel otherwise; secret memory
> areas are removed from the direct map and functions in the
> follow_page()/get_user_page() family will refuse to return a page that
> belongs to the secret memory area.
> 
> Once there will be a use case that will require exposing secretmem to the
> kernel it will be an opt-in request in the system call flags so that user
> would have to decide what data can be exposed to the kernel.
>
> Removing of the pages from the direct map may cause its fragmentation on
> architectures that use large pages to map the physical memory which affects
> the system performance. However, the original Kconfig text for
> CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
> improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
> ("x86: add gbpages switches")) and the recent report [1] showed that "...
> although 1G mappings are a good default choice, there is no compelling
> evidence that it must be the only choice". Hence, it is sufficient to have
> secretmem disabled by default with the ability of a system administrator to
> enable it at boot time.

OK, this looks like a reasonable compromise for the initial
implementation. Documentation of the command line parameter should be
very explicit about this though.

> The secretmem mappings are locked in memory so they cannot exceed
> RLIMIT_MEMLOCK. Since these mappings are already locked an attempt to
> mlock() secretmem range would fail and mlockall() will ignore secretmem
> mappings.

What about munlock?

> Pages in the secretmem regions are unevictable and unmovable to avoid
> accidental exposure of the sensitive data via swap or during page
> migration.
> 
> A page that was a part of the secret memory area is cleared when it is
> freed to ensure the data is not exposed to the next user of that page.
> 
> The following example demonstrates creation of a secret mapping (error
> handling is omitted):
> 
> 	fd = memfd_secret(0);
> 	ftruncate(fd, MAP_SIZE);
> 	ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
> 		   MAP_SHARED, fd, 0);

Please also list usecases which you are aware of as well.

I am also missing some more information about the implementation. E.g.
does this memory live on an unevictable LRU and therefore participates
into stats. What about memcg accounting. What is the cross fork (CoW)/exec
behavior. How is the memory reflected in OOM situation? Is a shared
mapping enforced?

Anyway, thanks for improving the changelog. This is definitely much more
informative.

> [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

I have only glanced through the implementation and it looks sane. I will
have a closer look later but this should be pretty simple with the
proposed semantic.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-08 10:49     ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-08 10:49 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mark Rutland, David Hildenbrand, Peter Zijlstra, Catalin Marinas,
	Dave Hansen, linux-mm, linux-kselftest, H. Peter Anvin,
	Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin

On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Introduce "memfd_secret" system call with the ability to create memory
> areas visible only in the context of the owning process and not mapped not
> only to other processes but in the kernel page tables as well.
> 
> The secretmem feature is off by default and the user must explicitly enable
> it at the boot time.
> 
> Once secretmem is enabled, the user will be able to create a file
> descriptor using the memfd_secret() system call. The memory areas created
> by mmap() calls from this file descriptor will be unmapped from the kernel
> direct map and they will be only mapped in the page table of the owning mm.

Is this really true? I guess you meant to say that the memory will
visible only via page tables to anybody who can mmap the respective file
descriptor. There is nothing like an owning mm as the fd is inherently a
shareable resource and the ownership becomes a very vague and hard to
define term.

> The file descriptor based memory has several advantages over the
> "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
> paves the way for VMMs to remove the secret memory range from the process;

I do not understand how it helps to remove the memory from the process
as the interface explicitly allows to add a memory that is removed from
all other processes via direct map.

> there may be situations where sharing is useful and file descriptor based
> approach allows to seal the operations.

It would be great to expand on this some more.

> As secret memory implementation is not an extension of tmpfs or hugetlbfs,
> usage of a dedicated system call rather than hooking new functionality into
> memfd_create(2) emphasises that memfd_secret(2) has different semantics and
> allows better upwards compatibility.

What is this supposed to mean? What are differences?

> The secret memory remains accessible in the process context using uaccess
> primitives, but it is not exposed to the kernel otherwise; secret memory
> areas are removed from the direct map and functions in the
> follow_page()/get_user_page() family will refuse to return a page that
> belongs to the secret memory area.
> 
> Once there will be a use case that will require exposing secretmem to the
> kernel it will be an opt-in request in the system call flags so that user
> would have to decide what data can be exposed to the kernel.
>
> Removing of the pages from the direct map may cause its fragmentation on
> architectures that use large pages to map the physical memory which affects
> the system performance. However, the original Kconfig text for
> CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
> improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
> ("x86: add gbpages switches")) and the recent report [1] showed that "...
> although 1G mappings are a good default choice, there is no compelling
> evidence that it must be the only choice". Hence, it is sufficient to have
> secretmem disabled by default with the ability of a system administrator to
> enable it at boot time.

OK, this looks like a reasonable compromise for the initial
implementation. Documentation of the command line parameter should be
very explicit about this though.

> The secretmem mappings are locked in memory so they cannot exceed
> RLIMIT_MEMLOCK. Since these mappings are already locked an attempt to
> mlock() secretmem range would fail and mlockall() will ignore secretmem
> mappings.

What about munlock?

> Pages in the secretmem regions are unevictable and unmovable to avoid
> accidental exposure of the sensitive data via swap or during page
> migration.
> 
> A page that was a part of the secret memory area is cleared when it is
> freed to ensure the data is not exposed to the next user of that page.
> 
> The following example demonstrates creation of a secret mapping (error
> handling is omitted):
> 
> 	fd = memfd_secret(0);
> 	ftruncate(fd, MAP_SIZE);
> 	ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
> 		   MAP_SHARED, fd, 0);

Please also list usecases which you are aware of as well.

I am also missing some more information about the implementation. E.g.
does this memory live on an unevictable LRU and therefore participates
into stats. What about memcg accounting. What is the cross fork (CoW)/exec
behavior. How is the memory reflected in OOM situation? Is a shared
mapping enforced?

Anyway, thanks for improving the changelog. This is definitely much more
informative.

> [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

I have only glanced through the implementation and it looks sane. I will
have a closer look later but this should be pretty simple with the
proposed semantic.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-08 10:49     ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-08 10:49 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mark Rutland, David Hildenbrand, Peter Zijlstra, Catalin Marinas,
	Dave Hansen, linux-mm, linux-kselftest, H. Peter Anvin,
	Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin

On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
> From: Mike Rapoport <rppt@linux.ibm.com>
> 
> Introduce "memfd_secret" system call with the ability to create memory
> areas visible only in the context of the owning process and not mapped not
> only to other processes but in the kernel page tables as well.
> 
> The secretmem feature is off by default and the user must explicitly enable
> it at the boot time.
> 
> Once secretmem is enabled, the user will be able to create a file
> descriptor using the memfd_secret() system call. The memory areas created
> by mmap() calls from this file descriptor will be unmapped from the kernel
> direct map and they will be only mapped in the page table of the owning mm.

Is this really true? I guess you meant to say that the memory will
visible only via page tables to anybody who can mmap the respective file
descriptor. There is nothing like an owning mm as the fd is inherently a
shareable resource and the ownership becomes a very vague and hard to
define term.

> The file descriptor based memory has several advantages over the
> "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
> paves the way for VMMs to remove the secret memory range from the process;

I do not understand how it helps to remove the memory from the process
as the interface explicitly allows to add a memory that is removed from
all other processes via direct map.

> there may be situations where sharing is useful and file descriptor based
> approach allows to seal the operations.

It would be great to expand on this some more.

> As secret memory implementation is not an extension of tmpfs or hugetlbfs,
> usage of a dedicated system call rather than hooking new functionality into
> memfd_create(2) emphasises that memfd_secret(2) has different semantics and
> allows better upwards compatibility.

What is this supposed to mean? What are differences?

> The secret memory remains accessible in the process context using uaccess
> primitives, but it is not exposed to the kernel otherwise; secret memory
> areas are removed from the direct map and functions in the
> follow_page()/get_user_page() family will refuse to return a page that
> belongs to the secret memory area.
> 
> Once there will be a use case that will require exposing secretmem to the
> kernel it will be an opt-in request in the system call flags so that user
> would have to decide what data can be exposed to the kernel.
>
> Removing of the pages from the direct map may cause its fragmentation on
> architectures that use large pages to map the physical memory which affects
> the system performance. However, the original Kconfig text for
> CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
> improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
> ("x86: add gbpages switches")) and the recent report [1] showed that "...
> although 1G mappings are a good default choice, there is no compelling
> evidence that it must be the only choice". Hence, it is sufficient to have
> secretmem disabled by default with the ability of a system administrator to
> enable it at boot time.

OK, this looks like a reasonable compromise for the initial
implementation. Documentation of the command line parameter should be
very explicit about this though.

> The secretmem mappings are locked in memory so they cannot exceed
> RLIMIT_MEMLOCK. Since these mappings are already locked an attempt to
> mlock() secretmem range would fail and mlockall() will ignore secretmem
> mappings.

What about munlock?

> Pages in the secretmem regions are unevictable and unmovable to avoid
> accidental exposure of the sensitive data via swap or during page
> migration.
> 
> A page that was a part of the secret memory area is cleared when it is
> freed to ensure the data is not exposed to the next user of that page.
> 
> The following example demonstrates creation of a secret mapping (error
> handling is omitted):
> 
> 	fd = memfd_secret(0);
> 	ftruncate(fd, MAP_SIZE);
> 	ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
> 		   MAP_SHARED, fd, 0);

Please also list usecases which you are aware of as well.

I am also missing some more information about the implementation. E.g.
does this memory live on an unevictable LRU and therefore participates
into stats. What about memcg accounting. What is the cross fork (CoW)/exec
behavior. How is the memory reflected in OOM situation? Is a shared
mapping enforced?

Anyway, thanks for improving the changelog. This is definitely much more
informative.

> [1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

I have only glanced through the implementation and it looks sane. I will
have a closer look later but this should be pretty simple with the
proposed semantic.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
  2021-02-08 10:32       ` David Hildenbrand
  (?)
  (?)
@ 2021-02-08 10:51         ` Michal Hocko
  -1 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-08 10:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Ander sen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On Mon 08-02-21 11:32:11, David Hildenbrand wrote:
> On 08.02.21 11:18, Michal Hocko wrote:
> > On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
> > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > 
> > > It is unsafe to allow saving of secretmem areas to the hibernation
> > > snapshot as they would be visible after the resume and this essentially
> > > will defeat the purpose of secret memory mappings.
> > > 
> > > Prevent hibernation whenever there are active secret memory users.
> > 
> > Does this feature need any special handling? As it is effectivelly
> > unevictable memory then it should behave the same as other mlock, ramfs
> > which should already disable hibernation as those cannot be swapped out,
> > no?
> > 
> 
> Why should unevictable memory not go to swap when hibernating? We're merely
> dumping all of our system RAM (including any unmovable allocations) to swap
> storage and the system is essentially completely halted.
> 
My understanding is that mlock is never really made visible via swap
storage.
-- 
Michal Hocko
SUSE Labs
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 10:51         ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-08 10:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dan Williams, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On Mon 08-02-21 11:32:11, David Hildenbrand wrote:
> On 08.02.21 11:18, Michal Hocko wrote:
> > On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
> > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > 
> > > It is unsafe to allow saving of secretmem areas to the hibernation
> > > snapshot as they would be visible after the resume and this essentially
> > > will defeat the purpose of secret memory mappings.
> > > 
> > > Prevent hibernation whenever there are active secret memory users.
> > 
> > Does this feature need any special handling? As it is effectivelly
> > unevictable memory then it should behave the same as other mlock, ramfs
> > which should already disable hibernation as those cannot be swapped out,
> > no?
> > 
> 
> Why should unevictable memory not go to swap when hibernating? We're merely
> dumping all of our system RAM (including any unmovable allocations) to swap
> storage and the system is essentially completely halted.
> 
My understanding is that mlock is never really made visible via swap
storage.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 10:51         ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-08 10:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Palmer Dabbelt,
	Arnd Bergmann, James Bottomley, Hagen Paul Pfeifer,
	Borislav Petkov, Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Andrew Morton, Rick Edgecombe, Roman Gushchin,
	Mike Rapoport

On Mon 08-02-21 11:32:11, David Hildenbrand wrote:
> On 08.02.21 11:18, Michal Hocko wrote:
> > On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
> > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > 
> > > It is unsafe to allow saving of secretmem areas to the hibernation
> > > snapshot as they would be visible after the resume and this essentially
> > > will defeat the purpose of secret memory mappings.
> > > 
> > > Prevent hibernation whenever there are active secret memory users.
> > 
> > Does this feature need any special handling? As it is effectivelly
> > unevictable memory then it should behave the same as other mlock, ramfs
> > which should already disable hibernation as those cannot be swapped out,
> > no?
> > 
> 
> Why should unevictable memory not go to swap when hibernating? We're merely
> dumping all of our system RAM (including any unmovable allocations) to swap
> storage and the system is essentially completely halted.
> 
My understanding is that mlock is never really made visible via swap
storage.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 10:51         ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-08 10:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Palmer Dabbelt,
	Arnd Bergmann, James Bottomley, Hagen Paul Pfeifer,
	Borislav Petkov, Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Andrew Morton, Rick Edgecombe, Roman Gushchin,
	Mike Rapoport

On Mon 08-02-21 11:32:11, David Hildenbrand wrote:
> On 08.02.21 11:18, Michal Hocko wrote:
> > On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
> > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > 
> > > It is unsafe to allow saving of secretmem areas to the hibernation
> > > snapshot as they would be visible after the resume and this essentially
> > > will defeat the purpose of secret memory mappings.
> > > 
> > > Prevent hibernation whenever there are active secret memory users.
> > 
> > Does this feature need any special handling? As it is effectivelly
> > unevictable memory then it should behave the same as other mlock, ramfs
> > which should already disable hibernation as those cannot be swapped out,
> > no?
> > 
> 
> Why should unevictable memory not go to swap when hibernating? We're merely
> dumping all of our system RAM (including any unmovable allocations) to swap
> storage and the system is essentially completely halted.
> 
My understanding is that mlock is never really made visible via swap
storage.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
  2021-02-08 10:51         ` Michal Hocko
  (?)
  (?)
@ 2021-02-08 10:53           ` David Hildenbrand
  -1 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08 10:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Ander sen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On 08.02.21 11:51, Michal Hocko wrote:
> On Mon 08-02-21 11:32:11, David Hildenbrand wrote:
>> On 08.02.21 11:18, Michal Hocko wrote:
>>> On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
>>>> From: Mike Rapoport <rppt@linux.ibm.com>
>>>>
>>>> It is unsafe to allow saving of secretmem areas to the hibernation
>>>> snapshot as they would be visible after the resume and this essentially
>>>> will defeat the purpose of secret memory mappings.
>>>>
>>>> Prevent hibernation whenever there are active secret memory users.
>>>
>>> Does this feature need any special handling? As it is effectivelly
>>> unevictable memory then it should behave the same as other mlock, ramfs
>>> which should already disable hibernation as those cannot be swapped out,
>>> no?
>>>
>>
>> Why should unevictable memory not go to swap when hibernating? We're merely
>> dumping all of our system RAM (including any unmovable allocations) to swap
>> storage and the system is essentially completely halted.
>>
> My understanding is that mlock is never really made visible via swap
> storage.

"Using swap storage for hibernation" and "swapping at runtime" are two 
different things. I might be wrong, though.

-- 
Thanks,

David / dhildenb
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 10:53           ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08 10:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dan Williams, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On 08.02.21 11:51, Michal Hocko wrote:
> On Mon 08-02-21 11:32:11, David Hildenbrand wrote:
>> On 08.02.21 11:18, Michal Hocko wrote:
>>> On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
>>>> From: Mike Rapoport <rppt@linux.ibm.com>
>>>>
>>>> It is unsafe to allow saving of secretmem areas to the hibernation
>>>> snapshot as they would be visible after the resume and this essentially
>>>> will defeat the purpose of secret memory mappings.
>>>>
>>>> Prevent hibernation whenever there are active secret memory users.
>>>
>>> Does this feature need any special handling? As it is effectivelly
>>> unevictable memory then it should behave the same as other mlock, ramfs
>>> which should already disable hibernation as those cannot be swapped out,
>>> no?
>>>
>>
>> Why should unevictable memory not go to swap when hibernating? We're merely
>> dumping all of our system RAM (including any unmovable allocations) to swap
>> storage and the system is essentially completely halted.
>>
> My understanding is that mlock is never really made visible via swap
> storage.

"Using swap storage for hibernation" and "swapping at runtime" are two 
different things. I might be wrong, though.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 10:53           ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08 10:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Palmer Dabbelt,
	Arnd Bergmann, James Bottomley, Hagen Paul Pfeifer,
	Borislav Petkov, Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Andrew Morton, Rick Edgecombe, Roman Gushchin,
	Mike Rapoport

On 08.02.21 11:51, Michal Hocko wrote:
> On Mon 08-02-21 11:32:11, David Hildenbrand wrote:
>> On 08.02.21 11:18, Michal Hocko wrote:
>>> On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
>>>> From: Mike Rapoport <rppt@linux.ibm.com>
>>>>
>>>> It is unsafe to allow saving of secretmem areas to the hibernation
>>>> snapshot as they would be visible after the resume and this essentially
>>>> will defeat the purpose of secret memory mappings.
>>>>
>>>> Prevent hibernation whenever there are active secret memory users.
>>>
>>> Does this feature need any special handling? As it is effectivelly
>>> unevictable memory then it should behave the same as other mlock, ramfs
>>> which should already disable hibernation as those cannot be swapped out,
>>> no?
>>>
>>
>> Why should unevictable memory not go to swap when hibernating? We're merely
>> dumping all of our system RAM (including any unmovable allocations) to swap
>> storage and the system is essentially completely halted.
>>
> My understanding is that mlock is never really made visible via swap
> storage.

"Using swap storage for hibernation" and "swapping at runtime" are two 
different things. I might be wrong, though.

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 10:53           ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08 10:53 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Palmer Dabbelt,
	Arnd Bergmann, James Bottomley, Hagen Paul Pfeifer,
	Borislav Petkov, Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Andrew Morton, Rick Edgecombe, Roman Gushchin,
	Mike Rapoport

On 08.02.21 11:51, Michal Hocko wrote:
> On Mon 08-02-21 11:32:11, David Hildenbrand wrote:
>> On 08.02.21 11:18, Michal Hocko wrote:
>>> On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
>>>> From: Mike Rapoport <rppt@linux.ibm.com>
>>>>
>>>> It is unsafe to allow saving of secretmem areas to the hibernation
>>>> snapshot as they would be visible after the resume and this essentially
>>>> will defeat the purpose of secret memory mappings.
>>>>
>>>> Prevent hibernation whenever there are active secret memory users.
>>>
>>> Does this feature need any special handling? As it is effectivelly
>>> unevictable memory then it should behave the same as other mlock, ramfs
>>> which should already disable hibernation as those cannot be swapped out,
>>> no?
>>>
>>
>> Why should unevictable memory not go to swap when hibernating? We're merely
>> dumping all of our system RAM (including any unmovable allocations) to swap
>> storage and the system is essentially completely halted.
>>
> My understanding is that mlock is never really made visible via swap
> storage.

"Using swap storage for hibernation" and "swapping at runtime" are two 
different things. I might be wrong, though.

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
  2021-02-08 10:53           ` David Hildenbrand
  (?)
  (?)
@ 2021-02-08 10:57             ` Michal Hocko
  -1 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-08 10:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Ander sen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On Mon 08-02-21 11:53:58, David Hildenbrand wrote:
> On 08.02.21 11:51, Michal Hocko wrote:
> > On Mon 08-02-21 11:32:11, David Hildenbrand wrote:
> > > On 08.02.21 11:18, Michal Hocko wrote:
> > > > On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
> > > > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > > > 
> > > > > It is unsafe to allow saving of secretmem areas to the hibernation
> > > > > snapshot as they would be visible after the resume and this essentially
> > > > > will defeat the purpose of secret memory mappings.
> > > > > 
> > > > > Prevent hibernation whenever there are active secret memory users.
> > > > 
> > > > Does this feature need any special handling? As it is effectivelly
> > > > unevictable memory then it should behave the same as other mlock, ramfs
> > > > which should already disable hibernation as those cannot be swapped out,
> > > > no?
> > > > 
> > > 
> > > Why should unevictable memory not go to swap when hibernating? We're merely
> > > dumping all of our system RAM (including any unmovable allocations) to swap
> > > storage and the system is essentially completely halted.
> > > 
> > My understanding is that mlock is never really made visible via swap
> > storage.
> 
> "Using swap storage for hibernation" and "swapping at runtime" are two
> different things. I might be wrong, though.

Well, mlock is certainly used to keep sensitive information, not only to
protect from major/minor faults.
-- 
Michal Hocko
SUSE Labs
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 10:57             ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-08 10:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dan Williams, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On Mon 08-02-21 11:53:58, David Hildenbrand wrote:
> On 08.02.21 11:51, Michal Hocko wrote:
> > On Mon 08-02-21 11:32:11, David Hildenbrand wrote:
> > > On 08.02.21 11:18, Michal Hocko wrote:
> > > > On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
> > > > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > > > 
> > > > > It is unsafe to allow saving of secretmem areas to the hibernation
> > > > > snapshot as they would be visible after the resume and this essentially
> > > > > will defeat the purpose of secret memory mappings.
> > > > > 
> > > > > Prevent hibernation whenever there are active secret memory users.
> > > > 
> > > > Does this feature need any special handling? As it is effectivelly
> > > > unevictable memory then it should behave the same as other mlock, ramfs
> > > > which should already disable hibernation as those cannot be swapped out,
> > > > no?
> > > > 
> > > 
> > > Why should unevictable memory not go to swap when hibernating? We're merely
> > > dumping all of our system RAM (including any unmovable allocations) to swap
> > > storage and the system is essentially completely halted.
> > > 
> > My understanding is that mlock is never really made visible via swap
> > storage.
> 
> "Using swap storage for hibernation" and "swapping at runtime" are two
> different things. I might be wrong, though.

Well, mlock is certainly used to keep sensitive information, not only to
protect from major/minor faults.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 10:57             ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-08 10:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Palmer Dabbelt,
	Arnd Bergmann, James Bottomley, Hagen Paul Pfeifer,
	Borislav Petkov, Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Andrew Morton, Rick Edgecombe, Roman Gushchin,
	Mike Rapoport

On Mon 08-02-21 11:53:58, David Hildenbrand wrote:
> On 08.02.21 11:51, Michal Hocko wrote:
> > On Mon 08-02-21 11:32:11, David Hildenbrand wrote:
> > > On 08.02.21 11:18, Michal Hocko wrote:
> > > > On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
> > > > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > > > 
> > > > > It is unsafe to allow saving of secretmem areas to the hibernation
> > > > > snapshot as they would be visible after the resume and this essentially
> > > > > will defeat the purpose of secret memory mappings.
> > > > > 
> > > > > Prevent hibernation whenever there are active secret memory users.
> > > > 
> > > > Does this feature need any special handling? As it is effectivelly
> > > > unevictable memory then it should behave the same as other mlock, ramfs
> > > > which should already disable hibernation as those cannot be swapped out,
> > > > no?
> > > > 
> > > 
> > > Why should unevictable memory not go to swap when hibernating? We're merely
> > > dumping all of our system RAM (including any unmovable allocations) to swap
> > > storage and the system is essentially completely halted.
> > > 
> > My understanding is that mlock is never really made visible via swap
> > storage.
> 
> "Using swap storage for hibernation" and "swapping at runtime" are two
> different things. I might be wrong, though.

Well, mlock is certainly used to keep sensitive information, not only to
protect from major/minor faults.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 10:57             ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-08 10:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Palmer Dabbelt,
	Arnd Bergmann, James Bottomley, Hagen Paul Pfeifer,
	Borislav Petkov, Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Andrew Morton, Rick Edgecombe, Roman Gushchin,
	Mike Rapoport

On Mon 08-02-21 11:53:58, David Hildenbrand wrote:
> On 08.02.21 11:51, Michal Hocko wrote:
> > On Mon 08-02-21 11:32:11, David Hildenbrand wrote:
> > > On 08.02.21 11:18, Michal Hocko wrote:
> > > > On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
> > > > > From: Mike Rapoport <rppt@linux.ibm.com>
> > > > > 
> > > > > It is unsafe to allow saving of secretmem areas to the hibernation
> > > > > snapshot as they would be visible after the resume and this essentially
> > > > > will defeat the purpose of secret memory mappings.
> > > > > 
> > > > > Prevent hibernation whenever there are active secret memory users.
> > > > 
> > > > Does this feature need any special handling? As it is effectivelly
> > > > unevictable memory then it should behave the same as other mlock, ramfs
> > > > which should already disable hibernation as those cannot be swapped out,
> > > > no?
> > > > 
> > > 
> > > Why should unevictable memory not go to swap when hibernating? We're merely
> > > dumping all of our system RAM (including any unmovable allocations) to swap
> > > storage and the system is essentially completely halted.
> > > 
> > My understanding is that mlock is never really made visible via swap
> > storage.
> 
> "Using swap storage for hibernation" and "swapping at runtime" are two
> different things. I might be wrong, though.

Well, mlock is certainly used to keep sensitive information, not only to
protect from major/minor faults.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
  2021-02-08 10:57             ` Michal Hocko
  (?)
  (?)
@ 2021-02-08 11:13               ` David Hildenbrand
  -1 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08 11:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Ander sen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On 08.02.21 11:57, Michal Hocko wrote:
> On Mon 08-02-21 11:53:58, David Hildenbrand wrote:
>> On 08.02.21 11:51, Michal Hocko wrote:
>>> On Mon 08-02-21 11:32:11, David Hildenbrand wrote:
>>>> On 08.02.21 11:18, Michal Hocko wrote:
>>>>> On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
>>>>>> From: Mike Rapoport <rppt@linux.ibm.com>
>>>>>>
>>>>>> It is unsafe to allow saving of secretmem areas to the hibernation
>>>>>> snapshot as they would be visible after the resume and this essentially
>>>>>> will defeat the purpose of secret memory mappings.
>>>>>>
>>>>>> Prevent hibernation whenever there are active secret memory users.
>>>>>
>>>>> Does this feature need any special handling? As it is effectivelly
>>>>> unevictable memory then it should behave the same as other mlock, ramfs
>>>>> which should already disable hibernation as those cannot be swapped out,
>>>>> no?
>>>>>
>>>>
>>>> Why should unevictable memory not go to swap when hibernating? We're merely
>>>> dumping all of our system RAM (including any unmovable allocations) to swap
>>>> storage and the system is essentially completely halted.
>>>>
>>> My understanding is that mlock is never really made visible via swap
>>> storage.
>>
>> "Using swap storage for hibernation" and "swapping at runtime" are two
>> different things. I might be wrong, though.
> 
> Well, mlock is certainly used to keep sensitive information, not only to
> protect from major/minor faults.
> 

I think you're right in theory, the man page mentions "Cryptographic 
security software often handles critical bytes like passwords or secret 
keys as data structures" ...

however, I am not aware of any such swap handling and wasn't able to 
spot it quickly. Let me take a closer look.


-- 
Thanks,

David / dhildenb
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 11:13               ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08 11:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dan Williams, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On 08.02.21 11:57, Michal Hocko wrote:
> On Mon 08-02-21 11:53:58, David Hildenbrand wrote:
>> On 08.02.21 11:51, Michal Hocko wrote:
>>> On Mon 08-02-21 11:32:11, David Hildenbrand wrote:
>>>> On 08.02.21 11:18, Michal Hocko wrote:
>>>>> On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
>>>>>> From: Mike Rapoport <rppt@linux.ibm.com>
>>>>>>
>>>>>> It is unsafe to allow saving of secretmem areas to the hibernation
>>>>>> snapshot as they would be visible after the resume and this essentially
>>>>>> will defeat the purpose of secret memory mappings.
>>>>>>
>>>>>> Prevent hibernation whenever there are active secret memory users.
>>>>>
>>>>> Does this feature need any special handling? As it is effectivelly
>>>>> unevictable memory then it should behave the same as other mlock, ramfs
>>>>> which should already disable hibernation as those cannot be swapped out,
>>>>> no?
>>>>>
>>>>
>>>> Why should unevictable memory not go to swap when hibernating? We're merely
>>>> dumping all of our system RAM (including any unmovable allocations) to swap
>>>> storage and the system is essentially completely halted.
>>>>
>>> My understanding is that mlock is never really made visible via swap
>>> storage.
>>
>> "Using swap storage for hibernation" and "swapping at runtime" are two
>> different things. I might be wrong, though.
> 
> Well, mlock is certainly used to keep sensitive information, not only to
> protect from major/minor faults.
> 

I think you're right in theory, the man page mentions "Cryptographic 
security software often handles critical bytes like passwords or secret 
keys as data structures" ...

however, I am not aware of any such swap handling and wasn't able to 
spot it quickly. Let me take a closer look.


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 11:13               ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08 11:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Palmer Dabbelt,
	Arnd Bergmann, James Bottomley, Hagen Paul Pfeifer,
	Borislav Petkov, Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Andrew Morton, Rick Edgecombe, Roman Gushchin,
	Mike Rapoport

On 08.02.21 11:57, Michal Hocko wrote:
> On Mon 08-02-21 11:53:58, David Hildenbrand wrote:
>> On 08.02.21 11:51, Michal Hocko wrote:
>>> On Mon 08-02-21 11:32:11, David Hildenbrand wrote:
>>>> On 08.02.21 11:18, Michal Hocko wrote:
>>>>> On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
>>>>>> From: Mike Rapoport <rppt@linux.ibm.com>
>>>>>>
>>>>>> It is unsafe to allow saving of secretmem areas to the hibernation
>>>>>> snapshot as they would be visible after the resume and this essentially
>>>>>> will defeat the purpose of secret memory mappings.
>>>>>>
>>>>>> Prevent hibernation whenever there are active secret memory users.
>>>>>
>>>>> Does this feature need any special handling? As it is effectivelly
>>>>> unevictable memory then it should behave the same as other mlock, ramfs
>>>>> which should already disable hibernation as those cannot be swapped out,
>>>>> no?
>>>>>
>>>>
>>>> Why should unevictable memory not go to swap when hibernating? We're merely
>>>> dumping all of our system RAM (including any unmovable allocations) to swap
>>>> storage and the system is essentially completely halted.
>>>>
>>> My understanding is that mlock is never really made visible via swap
>>> storage.
>>
>> "Using swap storage for hibernation" and "swapping at runtime" are two
>> different things. I might be wrong, though.
> 
> Well, mlock is certainly used to keep sensitive information, not only to
> protect from major/minor faults.
> 

I think you're right in theory, the man page mentions "Cryptographic 
security software often handles critical bytes like passwords or secret 
keys as data structures" ...

however, I am not aware of any such swap handling and wasn't able to 
spot it quickly. Let me take a closer look.


-- 
Thanks,

David / dhildenb


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 11:13               ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08 11:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Palmer Dabbelt,
	Arnd Bergmann, James Bottomley, Hagen Paul Pfeifer,
	Borislav Petkov, Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Andrew Morton, Rick Edgecombe, Roman Gushchin,
	Mike Rapoport

On 08.02.21 11:57, Michal Hocko wrote:
> On Mon 08-02-21 11:53:58, David Hildenbrand wrote:
>> On 08.02.21 11:51, Michal Hocko wrote:
>>> On Mon 08-02-21 11:32:11, David Hildenbrand wrote:
>>>> On 08.02.21 11:18, Michal Hocko wrote:
>>>>> On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
>>>>>> From: Mike Rapoport <rppt@linux.ibm.com>
>>>>>>
>>>>>> It is unsafe to allow saving of secretmem areas to the hibernation
>>>>>> snapshot as they would be visible after the resume and this essentially
>>>>>> will defeat the purpose of secret memory mappings.
>>>>>>
>>>>>> Prevent hibernation whenever there are active secret memory users.
>>>>>
>>>>> Does this feature need any special handling? As it is effectivelly
>>>>> unevictable memory then it should behave the same as other mlock, ramfs
>>>>> which should already disable hibernation as those cannot be swapped out,
>>>>> no?
>>>>>
>>>>
>>>> Why should unevictable memory not go to swap when hibernating? We're merely
>>>> dumping all of our system RAM (including any unmovable allocations) to swap
>>>> storage and the system is essentially completely halted.
>>>>
>>> My understanding is that mlock is never really made visible via swap
>>> storage.
>>
>> "Using swap storage for hibernation" and "swapping at runtime" are two
>> different things. I might be wrong, though.
> 
> Well, mlock is certainly used to keep sensitive information, not only to
> protect from major/minor faults.
> 

I think you're right in theory, the man page mentions "Cryptographic 
security software often handles critical bytes like passwords or secret 
keys as data structures" ...

however, I am not aware of any such swap handling and wasn't able to 
spot it quickly. Let me take a closer look.


-- 
Thanks,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
  2021-02-08 11:13               ` David Hildenbrand
  (?)
  (?)
@ 2021-02-08 11:14                 ` David Hildenbrand
  -1 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08 11:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Ander sen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On 08.02.21 12:13, David Hildenbrand wrote:
> On 08.02.21 11:57, Michal Hocko wrote:
>> On Mon 08-02-21 11:53:58, David Hildenbrand wrote:
>>> On 08.02.21 11:51, Michal Hocko wrote:
>>>> On Mon 08-02-21 11:32:11, David Hildenbrand wrote:
>>>>> On 08.02.21 11:18, Michal Hocko wrote:
>>>>>> On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
>>>>>>> From: Mike Rapoport <rppt@linux.ibm.com>
>>>>>>>
>>>>>>> It is unsafe to allow saving of secretmem areas to the hibernation
>>>>>>> snapshot as they would be visible after the resume and this essentially
>>>>>>> will defeat the purpose of secret memory mappings.
>>>>>>>
>>>>>>> Prevent hibernation whenever there are active secret memory users.
>>>>>>
>>>>>> Does this feature need any special handling? As it is effectivelly
>>>>>> unevictable memory then it should behave the same as other mlock, ramfs
>>>>>> which should already disable hibernation as those cannot be swapped out,
>>>>>> no?
>>>>>>
>>>>>
>>>>> Why should unevictable memory not go to swap when hibernating? We're merely
>>>>> dumping all of our system RAM (including any unmovable allocations) to swap
>>>>> storage and the system is essentially completely halted.
>>>>>
>>>> My understanding is that mlock is never really made visible via swap
>>>> storage.
>>>
>>> "Using swap storage for hibernation" and "swapping at runtime" are two
>>> different things. I might be wrong, though.
>>
>> Well, mlock is certainly used to keep sensitive information, not only to
>> protect from major/minor faults.
>>
> 
> I think you're right in theory, the man page mentions "Cryptographic
> security software often handles critical bytes like passwords or secret
> keys as data structures" ...
> 
> however, I am not aware of any such swap handling and wasn't able to
> spot it quickly. Let me take a closer look.

s/swap/hibernate/


-- 
Thanks,

David / dhildenb
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 11:14                 ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08 11:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dan Williams, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On 08.02.21 12:13, David Hildenbrand wrote:
> On 08.02.21 11:57, Michal Hocko wrote:
>> On Mon 08-02-21 11:53:58, David Hildenbrand wrote:
>>> On 08.02.21 11:51, Michal Hocko wrote:
>>>> On Mon 08-02-21 11:32:11, David Hildenbrand wrote:
>>>>> On 08.02.21 11:18, Michal Hocko wrote:
>>>>>> On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
>>>>>>> From: Mike Rapoport <rppt@linux.ibm.com>
>>>>>>>
>>>>>>> It is unsafe to allow saving of secretmem areas to the hibernation
>>>>>>> snapshot as they would be visible after the resume and this essentially
>>>>>>> will defeat the purpose of secret memory mappings.
>>>>>>>
>>>>>>> Prevent hibernation whenever there are active secret memory users.
>>>>>>
>>>>>> Does this feature need any special handling? As it is effectivelly
>>>>>> unevictable memory then it should behave the same as other mlock, ramfs
>>>>>> which should already disable hibernation as those cannot be swapped out,
>>>>>> no?
>>>>>>
>>>>>
>>>>> Why should unevictable memory not go to swap when hibernating? We're merely
>>>>> dumping all of our system RAM (including any unmovable allocations) to swap
>>>>> storage and the system is essentially completely halted.
>>>>>
>>>> My understanding is that mlock is never really made visible via swap
>>>> storage.
>>>
>>> "Using swap storage for hibernation" and "swapping at runtime" are two
>>> different things. I might be wrong, though.
>>
>> Well, mlock is certainly used to keep sensitive information, not only to
>> protect from major/minor faults.
>>
> 
> I think you're right in theory, the man page mentions "Cryptographic
> security software often handles critical bytes like passwords or secret
> keys as data structures" ...
> 
> however, I am not aware of any such swap handling and wasn't able to
> spot it quickly. Let me take a closer look.

s/swap/hibernate/


-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 11:14                 ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08 11:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Palmer Dabbelt,
	Arnd Bergmann, James Bottomley, Hagen Paul Pfeifer,
	Borislav Petkov, Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Andrew Morton, Rick Edgecombe, Roman Gushchin,
	Mike Rapoport

On 08.02.21 12:13, David Hildenbrand wrote:
> On 08.02.21 11:57, Michal Hocko wrote:
>> On Mon 08-02-21 11:53:58, David Hildenbrand wrote:
>>> On 08.02.21 11:51, Michal Hocko wrote:
>>>> On Mon 08-02-21 11:32:11, David Hildenbrand wrote:
>>>>> On 08.02.21 11:18, Michal Hocko wrote:
>>>>>> On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
>>>>>>> From: Mike Rapoport <rppt@linux.ibm.com>
>>>>>>>
>>>>>>> It is unsafe to allow saving of secretmem areas to the hibernation
>>>>>>> snapshot as they would be visible after the resume and this essentially
>>>>>>> will defeat the purpose of secret memory mappings.
>>>>>>>
>>>>>>> Prevent hibernation whenever there are active secret memory users.
>>>>>>
>>>>>> Does this feature need any special handling? As it is effectivelly
>>>>>> unevictable memory then it should behave the same as other mlock, ramfs
>>>>>> which should already disable hibernation as those cannot be swapped out,
>>>>>> no?
>>>>>>
>>>>>
>>>>> Why should unevictable memory not go to swap when hibernating? We're merely
>>>>> dumping all of our system RAM (including any unmovable allocations) to swap
>>>>> storage and the system is essentially completely halted.
>>>>>
>>>> My understanding is that mlock is never really made visible via swap
>>>> storage.
>>>
>>> "Using swap storage for hibernation" and "swapping at runtime" are two
>>> different things. I might be wrong, though.
>>
>> Well, mlock is certainly used to keep sensitive information, not only to
>> protect from major/minor faults.
>>
> 
> I think you're right in theory, the man page mentions "Cryptographic
> security software often handles critical bytes like passwords or secret
> keys as data structures" ...
> 
> however, I am not aware of any such swap handling and wasn't able to
> spot it quickly. Let me take a closer look.

s/swap/hibernate/


-- 
Thanks,

David / dhildenb


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 11:14                 ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08 11:14 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Palmer Dabbelt,
	Arnd Bergmann, James Bottomley, Hagen Paul Pfeifer,
	Borislav Petkov, Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Andrew Morton, Rick Edgecombe, Roman Gushchin,
	Mike Rapoport

On 08.02.21 12:13, David Hildenbrand wrote:
> On 08.02.21 11:57, Michal Hocko wrote:
>> On Mon 08-02-21 11:53:58, David Hildenbrand wrote:
>>> On 08.02.21 11:51, Michal Hocko wrote:
>>>> On Mon 08-02-21 11:32:11, David Hildenbrand wrote:
>>>>> On 08.02.21 11:18, Michal Hocko wrote:
>>>>>> On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
>>>>>>> From: Mike Rapoport <rppt@linux.ibm.com>
>>>>>>>
>>>>>>> It is unsafe to allow saving of secretmem areas to the hibernation
>>>>>>> snapshot as they would be visible after the resume and this essentially
>>>>>>> will defeat the purpose of secret memory mappings.
>>>>>>>
>>>>>>> Prevent hibernation whenever there are active secret memory users.
>>>>>>
>>>>>> Does this feature need any special handling? As it is effectivelly
>>>>>> unevictable memory then it should behave the same as other mlock, ramfs
>>>>>> which should already disable hibernation as those cannot be swapped out,
>>>>>> no?
>>>>>>
>>>>>
>>>>> Why should unevictable memory not go to swap when hibernating? We're merely
>>>>> dumping all of our system RAM (including any unmovable allocations) to swap
>>>>> storage and the system is essentially completely halted.
>>>>>
>>>> My understanding is that mlock is never really made visible via swap
>>>> storage.
>>>
>>> "Using swap storage for hibernation" and "swapping at runtime" are two
>>> different things. I might be wrong, though.
>>
>> Well, mlock is certainly used to keep sensitive information, not only to
>> protect from major/minor faults.
>>
> 
> I think you're right in theory, the man page mentions "Cryptographic
> security software often handles critical bytes like passwords or secret
> keys as data structures" ...
> 
> however, I am not aware of any such swap handling and wasn't able to
> spot it quickly. Let me take a closer look.

s/swap/hibernate/


-- 
Thanks,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
  2021-02-08 11:14                 ` David Hildenbrand
  (?)
  (?)
@ 2021-02-08 11:26                   ` David Hildenbrand
  -1 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08 11:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Ander sen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On 08.02.21 12:14, David Hildenbrand wrote:
> On 08.02.21 12:13, David Hildenbrand wrote:
>> On 08.02.21 11:57, Michal Hocko wrote:
>>> On Mon 08-02-21 11:53:58, David Hildenbrand wrote:
>>>> On 08.02.21 11:51, Michal Hocko wrote:
>>>>> On Mon 08-02-21 11:32:11, David Hildenbrand wrote:
>>>>>> On 08.02.21 11:18, Michal Hocko wrote:
>>>>>>> On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
>>>>>>>> From: Mike Rapoport <rppt@linux.ibm.com>
>>>>>>>>
>>>>>>>> It is unsafe to allow saving of secretmem areas to the hibernation
>>>>>>>> snapshot as they would be visible after the resume and this essentially
>>>>>>>> will defeat the purpose of secret memory mappings.
>>>>>>>>
>>>>>>>> Prevent hibernation whenever there are active secret memory users.
>>>>>>>
>>>>>>> Does this feature need any special handling? As it is effectivelly
>>>>>>> unevictable memory then it should behave the same as other mlock, ramfs
>>>>>>> which should already disable hibernation as those cannot be swapped out,
>>>>>>> no?
>>>>>>>
>>>>>>
>>>>>> Why should unevictable memory not go to swap when hibernating? We're merely
>>>>>> dumping all of our system RAM (including any unmovable allocations) to swap
>>>>>> storage and the system is essentially completely halted.
>>>>>>
>>>>> My understanding is that mlock is never really made visible via swap
>>>>> storage.
>>>>
>>>> "Using swap storage for hibernation" and "swapping at runtime" are two
>>>> different things. I might be wrong, though.
>>>
>>> Well, mlock is certainly used to keep sensitive information, not only to
>>> protect from major/minor faults.
>>>
>>
>> I think you're right in theory, the man page mentions "Cryptographic
>> security software often handles critical bytes like passwords or secret
>> keys as data structures" ...
>>
>> however, I am not aware of any such swap handling and wasn't able to
>> spot it quickly. Let me take a closer look.
> 
> s/swap/hibernate/

My F33 system happily hibernates to disk, even with an application that 
succeeded in din doing an mlockall().

And it somewhat makes sense. Even my freshly-booted, idle F33 has

$ cat /proc/meminfo  | grep lock
Mlocked:            4860 kB

So, stopping to hibernate with mlocked memory would essentially prohibit 
any modern Linux distro to hibernate ever.

-- 
Thanks,

David / dhildenb
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 11:26                   ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08 11:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dan Williams, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On 08.02.21 12:14, David Hildenbrand wrote:
> On 08.02.21 12:13, David Hildenbrand wrote:
>> On 08.02.21 11:57, Michal Hocko wrote:
>>> On Mon 08-02-21 11:53:58, David Hildenbrand wrote:
>>>> On 08.02.21 11:51, Michal Hocko wrote:
>>>>> On Mon 08-02-21 11:32:11, David Hildenbrand wrote:
>>>>>> On 08.02.21 11:18, Michal Hocko wrote:
>>>>>>> On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
>>>>>>>> From: Mike Rapoport <rppt@linux.ibm.com>
>>>>>>>>
>>>>>>>> It is unsafe to allow saving of secretmem areas to the hibernation
>>>>>>>> snapshot as they would be visible after the resume and this essentially
>>>>>>>> will defeat the purpose of secret memory mappings.
>>>>>>>>
>>>>>>>> Prevent hibernation whenever there are active secret memory users.
>>>>>>>
>>>>>>> Does this feature need any special handling? As it is effectivelly
>>>>>>> unevictable memory then it should behave the same as other mlock, ramfs
>>>>>>> which should already disable hibernation as those cannot be swapped out,
>>>>>>> no?
>>>>>>>
>>>>>>
>>>>>> Why should unevictable memory not go to swap when hibernating? We're merely
>>>>>> dumping all of our system RAM (including any unmovable allocations) to swap
>>>>>> storage and the system is essentially completely halted.
>>>>>>
>>>>> My understanding is that mlock is never really made visible via swap
>>>>> storage.
>>>>
>>>> "Using swap storage for hibernation" and "swapping at runtime" are two
>>>> different things. I might be wrong, though.
>>>
>>> Well, mlock is certainly used to keep sensitive information, not only to
>>> protect from major/minor faults.
>>>
>>
>> I think you're right in theory, the man page mentions "Cryptographic
>> security software often handles critical bytes like passwords or secret
>> keys as data structures" ...
>>
>> however, I am not aware of any such swap handling and wasn't able to
>> spot it quickly. Let me take a closer look.
> 
> s/swap/hibernate/

My F33 system happily hibernates to disk, even with an application that 
succeeded in din doing an mlockall().

And it somewhat makes sense. Even my freshly-booted, idle F33 has

$ cat /proc/meminfo  | grep lock
Mlocked:            4860 kB

So, stopping to hibernate with mlocked memory would essentially prohibit 
any modern Linux distro to hibernate ever.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 11:26                   ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08 11:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Palmer Dabbelt,
	Arnd Bergmann, James Bottomley, Hagen Paul Pfeifer,
	Borislav Petkov, Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Andrew Morton, Rick Edgecombe, Roman Gushchin,
	Mike Rapoport

On 08.02.21 12:14, David Hildenbrand wrote:
> On 08.02.21 12:13, David Hildenbrand wrote:
>> On 08.02.21 11:57, Michal Hocko wrote:
>>> On Mon 08-02-21 11:53:58, David Hildenbrand wrote:
>>>> On 08.02.21 11:51, Michal Hocko wrote:
>>>>> On Mon 08-02-21 11:32:11, David Hildenbrand wrote:
>>>>>> On 08.02.21 11:18, Michal Hocko wrote:
>>>>>>> On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
>>>>>>>> From: Mike Rapoport <rppt@linux.ibm.com>
>>>>>>>>
>>>>>>>> It is unsafe to allow saving of secretmem areas to the hibernation
>>>>>>>> snapshot as they would be visible after the resume and this essentially
>>>>>>>> will defeat the purpose of secret memory mappings.
>>>>>>>>
>>>>>>>> Prevent hibernation whenever there are active secret memory users.
>>>>>>>
>>>>>>> Does this feature need any special handling? As it is effectivelly
>>>>>>> unevictable memory then it should behave the same as other mlock, ramfs
>>>>>>> which should already disable hibernation as those cannot be swapped out,
>>>>>>> no?
>>>>>>>
>>>>>>
>>>>>> Why should unevictable memory not go to swap when hibernating? We're merely
>>>>>> dumping all of our system RAM (including any unmovable allocations) to swap
>>>>>> storage and the system is essentially completely halted.
>>>>>>
>>>>> My understanding is that mlock is never really made visible via swap
>>>>> storage.
>>>>
>>>> "Using swap storage for hibernation" and "swapping at runtime" are two
>>>> different things. I might be wrong, though.
>>>
>>> Well, mlock is certainly used to keep sensitive information, not only to
>>> protect from major/minor faults.
>>>
>>
>> I think you're right in theory, the man page mentions "Cryptographic
>> security software often handles critical bytes like passwords or secret
>> keys as data structures" ...
>>
>> however, I am not aware of any such swap handling and wasn't able to
>> spot it quickly. Let me take a closer look.
> 
> s/swap/hibernate/

My F33 system happily hibernates to disk, even with an application that 
succeeded in din doing an mlockall().

And it somewhat makes sense. Even my freshly-booted, idle F33 has

$ cat /proc/meminfo  | grep lock
Mlocked:            4860 kB

So, stopping to hibernate with mlocked memory would essentially prohibit 
any modern Linux distro to hibernate ever.

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 11:26                   ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08 11:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Palmer Dabbelt,
	Arnd Bergmann, James Bottomley, Hagen Paul Pfeifer,
	Borislav Petkov, Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Andrew Morton, Rick Edgecombe, Roman Gushchin,
	Mike Rapoport

On 08.02.21 12:14, David Hildenbrand wrote:
> On 08.02.21 12:13, David Hildenbrand wrote:
>> On 08.02.21 11:57, Michal Hocko wrote:
>>> On Mon 08-02-21 11:53:58, David Hildenbrand wrote:
>>>> On 08.02.21 11:51, Michal Hocko wrote:
>>>>> On Mon 08-02-21 11:32:11, David Hildenbrand wrote:
>>>>>> On 08.02.21 11:18, Michal Hocko wrote:
>>>>>>> On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
>>>>>>>> From: Mike Rapoport <rppt@linux.ibm.com>
>>>>>>>>
>>>>>>>> It is unsafe to allow saving of secretmem areas to the hibernation
>>>>>>>> snapshot as they would be visible after the resume and this essentially
>>>>>>>> will defeat the purpose of secret memory mappings.
>>>>>>>>
>>>>>>>> Prevent hibernation whenever there are active secret memory users.
>>>>>>>
>>>>>>> Does this feature need any special handling? As it is effectivelly
>>>>>>> unevictable memory then it should behave the same as other mlock, ramfs
>>>>>>> which should already disable hibernation as those cannot be swapped out,
>>>>>>> no?
>>>>>>>
>>>>>>
>>>>>> Why should unevictable memory not go to swap when hibernating? We're merely
>>>>>> dumping all of our system RAM (including any unmovable allocations) to swap
>>>>>> storage and the system is essentially completely halted.
>>>>>>
>>>>> My understanding is that mlock is never really made visible via swap
>>>>> storage.
>>>>
>>>> "Using swap storage for hibernation" and "swapping at runtime" are two
>>>> different things. I might be wrong, though.
>>>
>>> Well, mlock is certainly used to keep sensitive information, not only to
>>> protect from major/minor faults.
>>>
>>
>> I think you're right in theory, the man page mentions "Cryptographic
>> security software often handles critical bytes like passwords or secret
>> keys as data structures" ...
>>
>> however, I am not aware of any such swap handling and wasn't able to
>> spot it quickly. Let me take a closer look.
> 
> s/swap/hibernate/

My F33 system happily hibernates to disk, even with an application that 
succeeded in din doing an mlockall().

And it somewhat makes sense. Even my freshly-booted, idle F33 has

$ cat /proc/meminfo  | grep lock
Mlocked:            4860 kB

So, stopping to hibernate with mlocked memory would essentially prohibit 
any modern Linux distro to hibernate ever.

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
  2021-02-08 11:26                   ` David Hildenbrand
  (?)
  (?)
@ 2021-02-08 12:17                     ` Michal Hocko
  -1 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-08 12:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Ander sen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On Mon 08-02-21 12:26:31, David Hildenbrand wrote:
[...]
> My F33 system happily hibernates to disk, even with an application that
> succeeded in din doing an mlockall().
> 
> And it somewhat makes sense. Even my freshly-booted, idle F33 has
> 
> $ cat /proc/meminfo  | grep lock
> Mlocked:            4860 kB
> 
> So, stopping to hibernate with mlocked memory would essentially prohibit any
> modern Linux distro to hibernate ever.

My system seems to be completely fine without mlocked memory. It would
be interesting to see who mlocks memory on your system and check whether
the expectated mlock semantic really works for those. This should be
documented at least.
-- 
Michal Hocko
SUSE Labs
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 12:17                     ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-08 12:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dan Williams, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On Mon 08-02-21 12:26:31, David Hildenbrand wrote:
[...]
> My F33 system happily hibernates to disk, even with an application that
> succeeded in din doing an mlockall().
> 
> And it somewhat makes sense. Even my freshly-booted, idle F33 has
> 
> $ cat /proc/meminfo  | grep lock
> Mlocked:            4860 kB
> 
> So, stopping to hibernate with mlocked memory would essentially prohibit any
> modern Linux distro to hibernate ever.

My system seems to be completely fine without mlocked memory. It would
be interesting to see who mlocks memory on your system and check whether
the expectated mlock semantic really works for those. This should be
documented at least.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 12:17                     ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-08 12:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Palmer Dabbelt,
	Arnd Bergmann, James Bottomley, Hagen Paul Pfeifer,
	Borislav Petkov, Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Andrew Morton, Rick Edgecombe, Roman Gushchin,
	Mike Rapoport

On Mon 08-02-21 12:26:31, David Hildenbrand wrote:
[...]
> My F33 system happily hibernates to disk, even with an application that
> succeeded in din doing an mlockall().
> 
> And it somewhat makes sense. Even my freshly-booted, idle F33 has
> 
> $ cat /proc/meminfo  | grep lock
> Mlocked:            4860 kB
> 
> So, stopping to hibernate with mlocked memory would essentially prohibit any
> modern Linux distro to hibernate ever.

My system seems to be completely fine without mlocked memory. It would
be interesting to see who mlocks memory on your system and check whether
the expectated mlock semantic really works for those. This should be
documented at least.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 12:17                     ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-08 12:17 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Palmer Dabbelt,
	Arnd Bergmann, James Bottomley, Hagen Paul Pfeifer,
	Borislav Petkov, Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Andrew Morton, Rick Edgecombe, Roman Gushchin,
	Mike Rapoport

On Mon 08-02-21 12:26:31, David Hildenbrand wrote:
[...]
> My F33 system happily hibernates to disk, even with an application that
> succeeded in din doing an mlockall().
> 
> And it somewhat makes sense. Even my freshly-booted, idle F33 has
> 
> $ cat /proc/meminfo  | grep lock
> Mlocked:            4860 kB
> 
> So, stopping to hibernate with mlocked memory would essentially prohibit any
> modern Linux distro to hibernate ever.

My system seems to be completely fine without mlocked memory. It would
be interesting to see who mlocks memory on your system and check whether
the expectated mlock semantic really works for those. This should be
documented at least.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
  2021-02-08 12:17                     ` Michal Hocko
  (?)
  (?)
@ 2021-02-08 13:34                       ` Michal Hocko
  -1 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-08 13:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Ander sen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt, Rafael J. Wysocki

Btw. I do not see Rafael involved. Maybe he can add some insight to
this. Please note that the patch in question is
http://lkml.kernel.org/r/20210208084920.2884-9-rppt@kernel.org and
the full series is http://lkml.kernel.org/r/20210208084920.2884-1-rppt@kernel.org

On Mon 08-02-21 13:17:26, Michal Hocko wrote:
> On Mon 08-02-21 12:26:31, David Hildenbrand wrote:
> [...]
> > My F33 system happily hibernates to disk, even with an application that
> > succeeded in din doing an mlockall().
> > 
> > And it somewhat makes sense. Even my freshly-booted, idle F33 has
> > 
> > $ cat /proc/meminfo  | grep lock
> > Mlocked:            4860 kB
> > 
> > So, stopping to hibernate with mlocked memory would essentially prohibit any
> > modern Linux distro to hibernate ever.
> 
> My system seems to be completely fine without mlocked memory. It would
> be interesting to see who mlocks memory on your system and check whether
> the expectated mlock semantic really works for those. This should be
> documented at least.
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 13:34                       ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-08 13:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dan Williams, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt, Rafael J. Wysocki

Btw. I do not see Rafael involved. Maybe he can add some insight to
this. Please note that the patch in question is
http://lkml.kernel.org/r/20210208084920.2884-9-rppt@kernel.org and
the full series is http://lkml.kernel.org/r/20210208084920.2884-1-rppt@kernel.org

On Mon 08-02-21 13:17:26, Michal Hocko wrote:
> On Mon 08-02-21 12:26:31, David Hildenbrand wrote:
> [...]
> > My F33 system happily hibernates to disk, even with an application that
> > succeeded in din doing an mlockall().
> > 
> > And it somewhat makes sense. Even my freshly-booted, idle F33 has
> > 
> > $ cat /proc/meminfo  | grep lock
> > Mlocked:            4860 kB
> > 
> > So, stopping to hibernate with mlocked memory would essentially prohibit any
> > modern Linux distro to hibernate ever.
> 
> My system seems to be completely fine without mlocked memory. It would
> be interesting to see who mlocks memory on your system and check whether
> the expectated mlock semantic really works for those. This should be
> documented at least.
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 13:34                       ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-08 13:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Palmer Dabbelt,
	Arnd Bergmann, James Bottomley, Hagen Paul Pfeifer,
	Borislav Petkov, Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	Rafael J. Wysocki, linux-kernel, linux-riscv, Palmer Dabbelt,
	linux-fsdevel, Shakeel Butt, Andrew Morton, Rick Edgecombe,
	Roman Gushchin, Mike Rapoport

Btw. I do not see Rafael involved. Maybe he can add some insight to
this. Please note that the patch in question is
http://lkml.kernel.org/r/20210208084920.2884-9-rppt@kernel.org and
the full series is http://lkml.kernel.org/r/20210208084920.2884-1-rppt@kernel.org

On Mon 08-02-21 13:17:26, Michal Hocko wrote:
> On Mon 08-02-21 12:26:31, David Hildenbrand wrote:
> [...]
> > My F33 system happily hibernates to disk, even with an application that
> > succeeded in din doing an mlockall().
> > 
> > And it somewhat makes sense. Even my freshly-booted, idle F33 has
> > 
> > $ cat /proc/meminfo  | grep lock
> > Mlocked:            4860 kB
> > 
> > So, stopping to hibernate with mlocked memory would essentially prohibit any
> > modern Linux distro to hibernate ever.
> 
> My system seems to be completely fine without mlocked memory. It would
> be interesting to see who mlocks memory on your system and check whether
> the expectated mlock semantic really works for those. This should be
> documented at least.
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 13:34                       ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-08 13:34 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Palmer Dabbelt,
	Arnd Bergmann, James Bottomley, Hagen Paul Pfeifer,
	Borislav Petkov, Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	Rafael J. Wysocki, linux-kernel, linux-riscv, Palmer Dabbelt,
	linux-fsdevel, Shakeel Butt, Andrew Morton, Rick Edgecombe,
	Roman Gushchin, Mike Rapoport

Btw. I do not see Rafael involved. Maybe he can add some insight to
this. Please note that the patch in question is
http://lkml.kernel.org/r/20210208084920.2884-9-rppt@kernel.org and
the full series is http://lkml.kernel.org/r/20210208084920.2884-1-rppt@kernel.org

On Mon 08-02-21 13:17:26, Michal Hocko wrote:
> On Mon 08-02-21 12:26:31, David Hildenbrand wrote:
> [...]
> > My F33 system happily hibernates to disk, even with an application that
> > succeeded in din doing an mlockall().
> > 
> > And it somewhat makes sense. Even my freshly-booted, idle F33 has
> > 
> > $ cat /proc/meminfo  | grep lock
> > Mlocked:            4860 kB
> > 
> > So, stopping to hibernate with mlocked memory would essentially prohibit any
> > modern Linux distro to hibernate ever.
> 
> My system seems to be completely fine without mlocked memory. It would
> be interesting to see who mlocks memory on your system and check whether
> the expectated mlock semantic really works for those. This should be
> documented at least.
> -- 
> Michal Hocko
> SUSE Labs

-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
  2021-02-08 12:17                     ` Michal Hocko
  (?)
  (?)
@ 2021-02-08 13:40                       ` David Hildenbrand
  -1 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08 13:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Ander sen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On 08.02.21 13:17, Michal Hocko wrote:
> On Mon 08-02-21 12:26:31, David Hildenbrand wrote:
> [...]
>> My F33 system happily hibernates to disk, even with an application that
>> succeeded in din doing an mlockall().
>>
>> And it somewhat makes sense. Even my freshly-booted, idle F33 has
>>
>> $ cat /proc/meminfo  | grep lock
>> Mlocked:            4860 kB
>>
>> So, stopping to hibernate with mlocked memory would essentially prohibit any
>> modern Linux distro to hibernate ever.
> 
> My system seems to be completely fine without mlocked memory. It would
> be interesting to see who mlocks memory on your system and check whether
> the expectated mlock semantic really works for those. This should be
> documented at least.

I checked some other installations (Ubuntu, RHEL), and they also show no 
sign of Mlock. My notebook (F33) and desktop (F33) both have mlocked 
memory. Either related to F33 or due to some software (e.g., kerberos).

-- 
Thanks,

David / dhildenb
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 13:40                       ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08 13:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dan Williams, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On 08.02.21 13:17, Michal Hocko wrote:
> On Mon 08-02-21 12:26:31, David Hildenbrand wrote:
> [...]
>> My F33 system happily hibernates to disk, even with an application that
>> succeeded in din doing an mlockall().
>>
>> And it somewhat makes sense. Even my freshly-booted, idle F33 has
>>
>> $ cat /proc/meminfo  | grep lock
>> Mlocked:            4860 kB
>>
>> So, stopping to hibernate with mlocked memory would essentially prohibit any
>> modern Linux distro to hibernate ever.
> 
> My system seems to be completely fine without mlocked memory. It would
> be interesting to see who mlocks memory on your system and check whether
> the expectated mlock semantic really works for those. This should be
> documented at least.

I checked some other installations (Ubuntu, RHEL), and they also show no 
sign of Mlock. My notebook (F33) and desktop (F33) both have mlocked 
memory. Either related to F33 or due to some software (e.g., kerberos).

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 13:40                       ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08 13:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Palmer Dabbelt,
	Arnd Bergmann, James Bottomley, Hagen Paul Pfeifer,
	Borislav Petkov, Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Andrew Morton, Rick Edgecombe, Roman Gushchin,
	Mike Rapoport

On 08.02.21 13:17, Michal Hocko wrote:
> On Mon 08-02-21 12:26:31, David Hildenbrand wrote:
> [...]
>> My F33 system happily hibernates to disk, even with an application that
>> succeeded in din doing an mlockall().
>>
>> And it somewhat makes sense. Even my freshly-booted, idle F33 has
>>
>> $ cat /proc/meminfo  | grep lock
>> Mlocked:            4860 kB
>>
>> So, stopping to hibernate with mlocked memory would essentially prohibit any
>> modern Linux distro to hibernate ever.
> 
> My system seems to be completely fine without mlocked memory. It would
> be interesting to see who mlocks memory on your system and check whether
> the expectated mlock semantic really works for those. This should be
> documented at least.

I checked some other installations (Ubuntu, RHEL), and they also show no 
sign of Mlock. My notebook (F33) and desktop (F33) both have mlocked 
memory. Either related to F33 or due to some software (e.g., kerberos).

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 13:40                       ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08 13:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Palmer Dabbelt,
	Arnd Bergmann, James Bottomley, Hagen Paul Pfeifer,
	Borislav Petkov, Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Andrew Morton, Rick Edgecombe, Roman Gushchin,
	Mike Rapoport

On 08.02.21 13:17, Michal Hocko wrote:
> On Mon 08-02-21 12:26:31, David Hildenbrand wrote:
> [...]
>> My F33 system happily hibernates to disk, even with an application that
>> succeeded in din doing an mlockall().
>>
>> And it somewhat makes sense. Even my freshly-booted, idle F33 has
>>
>> $ cat /proc/meminfo  | grep lock
>> Mlocked:            4860 kB
>>
>> So, stopping to hibernate with mlocked memory would essentially prohibit any
>> modern Linux distro to hibernate ever.
> 
> My system seems to be completely fine without mlocked memory. It would
> be interesting to see who mlocks memory on your system and check whether
> the expectated mlock semantic really works for those. This should be
> documented at least.

I checked some other installations (Ubuntu, RHEL), and they also show no 
sign of Mlock. My notebook (F33) and desktop (F33) both have mlocked 
memory. Either related to F33 or due to some software (e.g., kerberos).

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
  2021-02-08  9:27   ` David Hildenbrand
  (?)
  (?)
@ 2021-02-08 21:13     ` Mike Rapoport
  -1 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08 21:13 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Alexander Viro, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christopher Lameter,
	Dave Hansen, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Michal Hocko, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Anders en, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote:
> On 08.02.21 09:49, Mike Rapoport wrote:
> 
> Some questions (and request to document the answers) as we now allow to have
> unmovable allocations all over the place and I don't see a single comment
> regarding that in the cover letter:
> 
> 1. How will the issue of plenty of unmovable allocations for user space be
> tackled in the future?
> 
> 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE
> and CMA, alloc_conig_range()/alloc_contig_pages?.

Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not
allocate movable pages at the first place.
 
> 3. How are the plans to support migration in the future and which interface
> changes will be required? (Michal mentioned some good points to make this
> configurable via the interface, we should plan ahead and document)

The only interface change required is an addition of bit value for syscall
flags, I really think it can be documented with the addition of migration
or any other feature for that sake. 

-- 
Sincerely yours,
Mike.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-08 21:13     ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08 21:13 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Alexander Viro, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christopher Lameter,
	Dan Williams, Dave Hansen, Elena Reshetova, H. Peter Anvin,
	Ingo Molnar, James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Michal Hocko, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote:
> On 08.02.21 09:49, Mike Rapoport wrote:
> 
> Some questions (and request to document the answers) as we now allow to have
> unmovable allocations all over the place and I don't see a single comment
> regarding that in the cover letter:
> 
> 1. How will the issue of plenty of unmovable allocations for user space be
> tackled in the future?
> 
> 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE
> and CMA, alloc_conig_range()/alloc_contig_pages?.

Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not
allocate movable pages at the first place.
 
> 3. How are the plans to support migration in the future and which interface
> changes will be required? (Michal mentioned some good points to make this
> configurable via the interface, we should plan ahead and document)

The only interface change required is an addition of bit value for syscall
flags, I really think it can be documented with the addition of migration
or any other feature for that sake. 

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-08 21:13     ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08 21:13 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mark Rutland, Michal Hocko, Peter Zijlstra, Catalin Marinas,
	Dave Hansen, linux-mm, linux-kselftest, H. Peter Anvin,
	Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Arnd Bergmann, James Bottomley, Borislav Petkov,
	Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Andrew Morton, Rick Edgecombe, Roman Gushchin

On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote:
> On 08.02.21 09:49, Mike Rapoport wrote:
> 
> Some questions (and request to document the answers) as we now allow to have
> unmovable allocations all over the place and I don't see a single comment
> regarding that in the cover letter:
> 
> 1. How will the issue of plenty of unmovable allocations for user space be
> tackled in the future?
> 
> 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE
> and CMA, alloc_conig_range()/alloc_contig_pages?.

Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not
allocate movable pages at the first place.
 
> 3. How are the plans to support migration in the future and which interface
> changes will be required? (Michal mentioned some good points to make this
> configurable via the interface, we should plan ahead and document)

The only interface change required is an addition of bit value for syscall
flags, I really think it can be documented with the addition of migration
or any other feature for that sake. 

-- 
Sincerely yours,
Mike.

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-08 21:13     ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08 21:13 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mark Rutland, Michal Hocko, Peter Zijlstra, Catalin Marinas,
	Dave Hansen, linux-mm, linux-kselftest, H. Peter Anvin,
	Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Arnd Bergmann, James Bottomley, Borislav Petkov,
	Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Andrew Morton, Rick Edgecombe, Roman Gushchin

On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote:
> On 08.02.21 09:49, Mike Rapoport wrote:
> 
> Some questions (and request to document the answers) as we now allow to have
> unmovable allocations all over the place and I don't see a single comment
> regarding that in the cover letter:
> 
> 1. How will the issue of plenty of unmovable allocations for user space be
> tackled in the future?
> 
> 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE
> and CMA, alloc_conig_range()/alloc_contig_pages?.

Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not
allocate movable pages at the first place.
 
> 3. How are the plans to support migration in the future and which interface
> changes will be required? (Michal mentioned some good points to make this
> configurable via the interface, we should plan ahead and document)

The only interface change required is an addition of bit value for syscall
flags, I really think it can be documented with the addition of migration
or any other feature for that sake. 

-- 
Sincerely yours,
Mike.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
  2021-02-08 10:49     ` Michal Hocko
  (?)
  (?)
@ 2021-02-08 21:26       ` Mike Rapoport
  -1 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08 21:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Alexander Viro, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christopher Lameter,
	Dave Hansen, David Hildenbrand, Elena Reshetova, H. Peter Anvin,
	Ingo Molnar, James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Mike Rapoport, Michael Kerrisk, Palmer Dabbelt,
	Paul Walmsley, Peter Zijlstra, Rick Edgecombe, Roman Gushchin,
	Shakeel Butt, Shuah Khan, Thomas Gleixner, Tycho Andersen,
	Will Deacon, linux-api, linux-arch, linux-arm-kernel,
	linux-fsdevel, linux-mm, linux-kernel, linux-kselftest,
	linux-nvdimm, linux-riscv, x86, Hagen Paul Pfeifer,
	Palmer Dabbelt

On Mon, Feb 08, 2021 at 11:49:22AM +0100, Michal Hocko wrote:
> On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
> > From: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > Introduce "memfd_secret" system call with the ability to create memory
> > areas visible only in the context of the owning process and not mapped not
> > only to other processes but in the kernel page tables as well.
> > 
> > The secretmem feature is off by default and the user must explicitly enable
> > it at the boot time.
> > 
> > Once secretmem is enabled, the user will be able to create a file
> > descriptor using the memfd_secret() system call. The memory areas created
> > by mmap() calls from this file descriptor will be unmapped from the kernel
> > direct map and they will be only mapped in the page table of the owning mm.
> 
> Is this really true? I guess you meant to say that the memory will
> visible only via page tables to anybody who can mmap the respective file
> descriptor. There is nothing like an owning mm as the fd is inherently a
> shareable resource and the ownership becomes a very vague and hard to
> define term.

Hmm, it seems I've been dragging this paragraph from the very first
mmap(MAP_EXCLUSIVE) rfc and nobody (including myself) noticed the
inconsistency.
 
> > The file descriptor based memory has several advantages over the
> > "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
> > paves the way for VMMs to remove the secret memory range from the process;
> 
> I do not understand how it helps to remove the memory from the process
> as the interface explicitly allows to add a memory that is removed from
> all other processes via direct map.

The current implementation does not help to remove the memory from the
process, but using fd-backed memory seems a better interface to remove
guest memory from host mappings than mmap. As Andy nicely put it:

"Getting fd-backed memory into a guest will take some possibly major work in
the kernel, but getting vma-backed memory into a guest without mapping it
in the host user address space seems much, much worse."
 
> > As secret memory implementation is not an extension of tmpfs or hugetlbfs,
> > usage of a dedicated system call rather than hooking new functionality into
> > memfd_create(2) emphasises that memfd_secret(2) has different semantics and
> > allows better upwards compatibility.
> 
> What is this supposed to mean? What are differences?

Well, the phrasing could be better indeed. That supposed to mean that
they differ in the semantics behind the file descriptor: memfd_create
implements sealing for shmem and hugetlbfs while memfd_secret implements
memory hidden from the kernel.
 
> > The secretmem mappings are locked in memory so they cannot exceed
> > RLIMIT_MEMLOCK. Since these mappings are already locked an attempt to
> > mlock() secretmem range would fail and mlockall() will ignore secretmem
> > mappings.
> 
> What about munlock?

Isn't this implied? ;-)
I'll add a sentence about it.
 
-- 
Sincerely yours,
Mike.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-08 21:26       ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08 21:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Alexander Viro, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christopher Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On Mon, Feb 08, 2021 at 11:49:22AM +0100, Michal Hocko wrote:
> On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
> > From: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > Introduce "memfd_secret" system call with the ability to create memory
> > areas visible only in the context of the owning process and not mapped not
> > only to other processes but in the kernel page tables as well.
> > 
> > The secretmem feature is off by default and the user must explicitly enable
> > it at the boot time.
> > 
> > Once secretmem is enabled, the user will be able to create a file
> > descriptor using the memfd_secret() system call. The memory areas created
> > by mmap() calls from this file descriptor will be unmapped from the kernel
> > direct map and they will be only mapped in the page table of the owning mm.
> 
> Is this really true? I guess you meant to say that the memory will
> visible only via page tables to anybody who can mmap the respective file
> descriptor. There is nothing like an owning mm as the fd is inherently a
> shareable resource and the ownership becomes a very vague and hard to
> define term.

Hmm, it seems I've been dragging this paragraph from the very first
mmap(MAP_EXCLUSIVE) rfc and nobody (including myself) noticed the
inconsistency.
 
> > The file descriptor based memory has several advantages over the
> > "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
> > paves the way for VMMs to remove the secret memory range from the process;
> 
> I do not understand how it helps to remove the memory from the process
> as the interface explicitly allows to add a memory that is removed from
> all other processes via direct map.

The current implementation does not help to remove the memory from the
process, but using fd-backed memory seems a better interface to remove
guest memory from host mappings than mmap. As Andy nicely put it:

"Getting fd-backed memory into a guest will take some possibly major work in
the kernel, but getting vma-backed memory into a guest without mapping it
in the host user address space seems much, much worse."
 
> > As secret memory implementation is not an extension of tmpfs or hugetlbfs,
> > usage of a dedicated system call rather than hooking new functionality into
> > memfd_create(2) emphasises that memfd_secret(2) has different semantics and
> > allows better upwards compatibility.
> 
> What is this supposed to mean? What are differences?

Well, the phrasing could be better indeed. That supposed to mean that
they differ in the semantics behind the file descriptor: memfd_create
implements sealing for shmem and hugetlbfs while memfd_secret implements
memory hidden from the kernel.
 
> > The secretmem mappings are locked in memory so they cannot exceed
> > RLIMIT_MEMLOCK. Since these mappings are already locked an attempt to
> > mlock() secretmem range would fail and mlockall() will ignore secretmem
> > mappings.
> 
> What about munlock?

Isn't this implied? ;-)
I'll add a sentence about it.
 
-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-08 21:26       ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08 21:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mark Rutland, David Hildenbrand, Peter Zijlstra, Catalin Marinas,
	Dave Hansen, linux-mm, linux-kselftest, H. Peter Anvin,
	Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin

On Mon, Feb 08, 2021 at 11:49:22AM +0100, Michal Hocko wrote:
> On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
> > From: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > Introduce "memfd_secret" system call with the ability to create memory
> > areas visible only in the context of the owning process and not mapped not
> > only to other processes but in the kernel page tables as well.
> > 
> > The secretmem feature is off by default and the user must explicitly enable
> > it at the boot time.
> > 
> > Once secretmem is enabled, the user will be able to create a file
> > descriptor using the memfd_secret() system call. The memory areas created
> > by mmap() calls from this file descriptor will be unmapped from the kernel
> > direct map and they will be only mapped in the page table of the owning mm.
> 
> Is this really true? I guess you meant to say that the memory will
> visible only via page tables to anybody who can mmap the respective file
> descriptor. There is nothing like an owning mm as the fd is inherently a
> shareable resource and the ownership becomes a very vague and hard to
> define term.

Hmm, it seems I've been dragging this paragraph from the very first
mmap(MAP_EXCLUSIVE) rfc and nobody (including myself) noticed the
inconsistency.
 
> > The file descriptor based memory has several advantages over the
> > "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
> > paves the way for VMMs to remove the secret memory range from the process;
> 
> I do not understand how it helps to remove the memory from the process
> as the interface explicitly allows to add a memory that is removed from
> all other processes via direct map.

The current implementation does not help to remove the memory from the
process, but using fd-backed memory seems a better interface to remove
guest memory from host mappings than mmap. As Andy nicely put it:

"Getting fd-backed memory into a guest will take some possibly major work in
the kernel, but getting vma-backed memory into a guest without mapping it
in the host user address space seems much, much worse."
 
> > As secret memory implementation is not an extension of tmpfs or hugetlbfs,
> > usage of a dedicated system call rather than hooking new functionality into
> > memfd_create(2) emphasises that memfd_secret(2) has different semantics and
> > allows better upwards compatibility.
> 
> What is this supposed to mean? What are differences?

Well, the phrasing could be better indeed. That supposed to mean that
they differ in the semantics behind the file descriptor: memfd_create
implements sealing for shmem and hugetlbfs while memfd_secret implements
memory hidden from the kernel.
 
> > The secretmem mappings are locked in memory so they cannot exceed
> > RLIMIT_MEMLOCK. Since these mappings are already locked an attempt to
> > mlock() secretmem range would fail and mlockall() will ignore secretmem
> > mappings.
> 
> What about munlock?

Isn't this implied? ;-)
I'll add a sentence about it.
 
-- 
Sincerely yours,
Mike.

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-08 21:26       ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08 21:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mark Rutland, David Hildenbrand, Peter Zijlstra, Catalin Marinas,
	Dave Hansen, linux-mm, linux-kselftest, H. Peter Anvin,
	Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin

On Mon, Feb 08, 2021 at 11:49:22AM +0100, Michal Hocko wrote:
> On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
> > From: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > Introduce "memfd_secret" system call with the ability to create memory
> > areas visible only in the context of the owning process and not mapped not
> > only to other processes but in the kernel page tables as well.
> > 
> > The secretmem feature is off by default and the user must explicitly enable
> > it at the boot time.
> > 
> > Once secretmem is enabled, the user will be able to create a file
> > descriptor using the memfd_secret() system call. The memory areas created
> > by mmap() calls from this file descriptor will be unmapped from the kernel
> > direct map and they will be only mapped in the page table of the owning mm.
> 
> Is this really true? I guess you meant to say that the memory will
> visible only via page tables to anybody who can mmap the respective file
> descriptor. There is nothing like an owning mm as the fd is inherently a
> shareable resource and the ownership becomes a very vague and hard to
> define term.

Hmm, it seems I've been dragging this paragraph from the very first
mmap(MAP_EXCLUSIVE) rfc and nobody (including myself) noticed the
inconsistency.
 
> > The file descriptor based memory has several advantages over the
> > "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
> > paves the way for VMMs to remove the secret memory range from the process;
> 
> I do not understand how it helps to remove the memory from the process
> as the interface explicitly allows to add a memory that is removed from
> all other processes via direct map.

The current implementation does not help to remove the memory from the
process, but using fd-backed memory seems a better interface to remove
guest memory from host mappings than mmap. As Andy nicely put it:

"Getting fd-backed memory into a guest will take some possibly major work in
the kernel, but getting vma-backed memory into a guest without mapping it
in the host user address space seems much, much worse."
 
> > As secret memory implementation is not an extension of tmpfs or hugetlbfs,
> > usage of a dedicated system call rather than hooking new functionality into
> > memfd_create(2) emphasises that memfd_secret(2) has different semantics and
> > allows better upwards compatibility.
> 
> What is this supposed to mean? What are differences?

Well, the phrasing could be better indeed. That supposed to mean that
they differ in the semantics behind the file descriptor: memfd_create
implements sealing for shmem and hugetlbfs while memfd_secret implements
memory hidden from the kernel.
 
> > The secretmem mappings are locked in memory so they cannot exceed
> > RLIMIT_MEMLOCK. Since these mappings are already locked an attempt to
> > mlock() secretmem range would fail and mlockall() will ignore secretmem
> > mappings.
> 
> What about munlock?

Isn't this implied? ;-)
I'll add a sentence about it.
 
-- 
Sincerely yours,
Mike.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
  2021-02-08 10:18     ` Michal Hocko
  (?)
  (?)
@ 2021-02-08 21:28       ` Mike Rapoport
  -1 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08 21:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Alexander Viro, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christopher Lameter,
	Dave Hansen, David Hildenbrand, Elena Reshetova, H. Peter Anvin,
	Ingo Molnar, James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Mike Rapoport, Michael Kerrisk, Palmer Dabbelt,
	Paul Walmsley, Peter Zijlstra, Rick Edgecombe, Roman Gushchin,
	Shakeel Butt, Shuah Khan, Thomas Gleixner, Tycho Andersen,
	Will Deacon, linux-api, linux-arch, linux-arm-kernel,
	linux-fsdevel, linux-mm, linux-kernel, linux-kselftest,
	linux-nvdimm, linux-riscv, x86, Hagen Paul Pfeifer,
	Palmer Dabbelt

On Mon, Feb 08, 2021 at 11:18:37AM +0100, Michal Hocko wrote:
> On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
> > From: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > It is unsafe to allow saving of secretmem areas to the hibernation
> > snapshot as they would be visible after the resume and this essentially
> > will defeat the purpose of secret memory mappings.
> > 
> > Prevent hibernation whenever there are active secret memory users.
> 
> Does this feature need any special handling? As it is effectivelly
> unevictable memory then it should behave the same as other mlock, ramfs
> which should already disable hibernation as those cannot be swapped out,
> no?

As David already said, hibernation does not care about mlocked memory, so
this feature requires a special handling.

-- 
Sincerely yours,
Mike.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 21:28       ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08 21:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Alexander Viro, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christopher Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On Mon, Feb 08, 2021 at 11:18:37AM +0100, Michal Hocko wrote:
> On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
> > From: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > It is unsafe to allow saving of secretmem areas to the hibernation
> > snapshot as they would be visible after the resume and this essentially
> > will defeat the purpose of secret memory mappings.
> > 
> > Prevent hibernation whenever there are active secret memory users.
> 
> Does this feature need any special handling? As it is effectivelly
> unevictable memory then it should behave the same as other mlock, ramfs
> which should already disable hibernation as those cannot be swapped out,
> no?

As David already said, hibernation does not care about mlocked memory, so
this feature requires a special handling.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 21:28       ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08 21:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mark Rutland, David Hildenbrand, Peter Zijlstra, Catalin Marinas,
	Dave Hansen, linux-mm, linux-kselftest, H. Peter Anvin,
	Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin

On Mon, Feb 08, 2021 at 11:18:37AM +0100, Michal Hocko wrote:
> On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
> > From: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > It is unsafe to allow saving of secretmem areas to the hibernation
> > snapshot as they would be visible after the resume and this essentially
> > will defeat the purpose of secret memory mappings.
> > 
> > Prevent hibernation whenever there are active secret memory users.
> 
> Does this feature need any special handling? As it is effectivelly
> unevictable memory then it should behave the same as other mlock, ramfs
> which should already disable hibernation as those cannot be swapped out,
> no?

As David already said, hibernation does not care about mlocked memory, so
this feature requires a special handling.

-- 
Sincerely yours,
Mike.

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 08/10] PM: hibernate: disable when there are active secretmem users
@ 2021-02-08 21:28       ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-08 21:28 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mark Rutland, David Hildenbrand, Peter Zijlstra, Catalin Marinas,
	Dave Hansen, linux-mm, linux-kselftest, H. Peter Anvin,
	Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin

On Mon, Feb 08, 2021 at 11:18:37AM +0100, Michal Hocko wrote:
> On Mon 08-02-21 10:49:18, Mike Rapoport wrote:
> > From: Mike Rapoport <rppt@linux.ibm.com>
> > 
> > It is unsafe to allow saving of secretmem areas to the hibernation
> > snapshot as they would be visible after the resume and this essentially
> > will defeat the purpose of secret memory mappings.
> > 
> > Prevent hibernation whenever there are active secret memory users.
> 
> Does this feature need any special handling? As it is effectivelly
> unevictable memory then it should behave the same as other mlock, ramfs
> which should already disable hibernation as those cannot be swapped out,
> no?

As David already said, hibernation does not care about mlocked memory, so
this feature requires a special handling.

-- 
Sincerely yours,
Mike.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
  2021-02-08 21:13     ` Mike Rapoport
  (?)
  (?)
@ 2021-02-08 21:38       ` David Hildenbrand
  -1 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08 21:38 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: David Hildenbrand, Andrew Morton, Alexander Viro,
	Andy Lutomirski, Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Michal Hocko, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Paul Walmsley, Peter Zijlstra,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleix ner, Tycho Andersen, Will Deacon, linux-api,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86


> Am 08.02.2021 um 22:13 schrieb Mike Rapoport <rppt@kernel.org>:
> 
> On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote:
>> On 08.02.21 09:49, Mike Rapoport wrote:
>> 
>> Some questions (and request to document the answers) as we now allow to have
>> unmovable allocations all over the place and I don't see a single comment
>> regarding that in the cover letter:
>> 
>> 1. How will the issue of plenty of unmovable allocations for user space be
>> tackled in the future?
>> 
>> 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE
>> and CMA, alloc_conig_range()/alloc_contig_pages?.
> 
> Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not
> allocate movable pages at the first place.

That is not the point. Secretmem cannot go on CMA / ZONE_MOVABLE memory and behaves like long-term pinnings in that sense. This is a real issue when using a lot of sectremem.

Please have a look at what Pavel documents regarding long term pinnings and ZONE_MOVABLE in his patches currently on the list.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-08 21:38       ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08 21:38 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: David Hildenbrand, Andrew Morton, Alexander Viro,
	Andy Lutomirski, Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dan Williams, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Michal Hocko, Mike Rapoport,
	Michael Kerrisk, Palmer Dabbelt, Paul Walmsley, Peter Zijlstra,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleixner, Tycho Andersen, Will Deacon, linux-api,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86


> Am 08.02.2021 um 22:13 schrieb Mike Rapoport <rppt@kernel.org>:
> 
> On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote:
>> On 08.02.21 09:49, Mike Rapoport wrote:
>> 
>> Some questions (and request to document the answers) as we now allow to have
>> unmovable allocations all over the place and I don't see a single comment
>> regarding that in the cover letter:
>> 
>> 1. How will the issue of plenty of unmovable allocations for user space be
>> tackled in the future?
>> 
>> 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE
>> and CMA, alloc_conig_range()/alloc_contig_pages?.
> 
> Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not
> allocate movable pages at the first place.

That is not the point. Secretmem cannot go on CMA / ZONE_MOVABLE memory and behaves like long-term pinnings in that sense. This is a real issue when using a lot of sectremem.

Please have a look at what Pavel documents regarding long term pinnings and ZONE_MOVABLE in his patches currently on the list.

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-08 21:38       ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08 21:38 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mark Rutland, Michal Hocko, David Hildenbrand, Peter Zijlstra,
	Catalin Marinas, Dave Hansen, linux-mm, linux-kselftest,
	H. Peter Anvin, Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Arnd Bergmann, James Bottomley, Borislav Petkov,
	Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Andrew Morton, Rick Edgecombe, Roman Gushchin


> Am 08.02.2021 um 22:13 schrieb Mike Rapoport <rppt@kernel.org>:
> 
> On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote:
>> On 08.02.21 09:49, Mike Rapoport wrote:
>> 
>> Some questions (and request to document the answers) as we now allow to have
>> unmovable allocations all over the place and I don't see a single comment
>> regarding that in the cover letter:
>> 
>> 1. How will the issue of plenty of unmovable allocations for user space be
>> tackled in the future?
>> 
>> 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE
>> and CMA, alloc_conig_range()/alloc_contig_pages?.
> 
> Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not
> allocate movable pages at the first place.

That is not the point. Secretmem cannot go on CMA / ZONE_MOVABLE memory and behaves like long-term pinnings in that sense. This is a real issue when using a lot of sectremem.

Please have a look at what Pavel documents regarding long term pinnings and ZONE_MOVABLE in his patches currently on the list.


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-08 21:38       ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-08 21:38 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mark Rutland, Michal Hocko, David Hildenbrand, Peter Zijlstra,
	Catalin Marinas, Dave Hansen, linux-mm, linux-kselftest,
	H. Peter Anvin, Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Arnd Bergmann, James Bottomley, Borislav Petkov,
	Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Andrew Morton, Rick Edgecombe, Roman Gushchin


> Am 08.02.2021 um 22:13 schrieb Mike Rapoport <rppt@kernel.org>:
> 
> On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote:
>> On 08.02.21 09:49, Mike Rapoport wrote:
>> 
>> Some questions (and request to document the answers) as we now allow to have
>> unmovable allocations all over the place and I don't see a single comment
>> regarding that in the cover letter:
>> 
>> 1. How will the issue of plenty of unmovable allocations for user space be
>> tackled in the future?
>> 
>> 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE
>> and CMA, alloc_conig_range()/alloc_contig_pages?.
> 
> Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not
> allocate movable pages at the first place.

That is not the point. Secretmem cannot go on CMA / ZONE_MOVABLE memory and behaves like long-term pinnings in that sense. This is a real issue when using a lot of sectremem.

Please have a look at what Pavel documents regarding long term pinnings and ZONE_MOVABLE in his patches currently on the list.


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
  2021-02-08 21:26       ` Mike Rapoport
  (?)
  (?)
@ 2021-02-09  8:47         ` Michal Hocko
  -1 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-09  8:47 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, Alexander Viro, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christopher Lameter,
	Dave Hansen, David Hildenbrand, Elena Reshetova, H. Peter Anvin,
	Ingo Molnar, James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Mike Rapoport, Michael Kerrisk, Palmer Dabbelt,
	Paul Walmsley, Peter Zijlstra, Rick Edgecombe, Roman Gushchin,
	Shakeel Butt, Shuah Khan, Thomas Gleixner, Tycho Andersen,
	Will Deacon, linux-api, linux-arch, linux-arm-kernel,
	linux-fsdevel, linux-mm, linux-kernel, linux-kselftest,
	linux-nvdimm, linux-riscv, x86, Hagen Paul Pfeifer,
	Palmer Dabbelt

On Mon 08-02-21 23:26:05, Mike Rapoport wrote:
> On Mon, Feb 08, 2021 at 11:49:22AM +0100, Michal Hocko wrote:
> > On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
[...]
> > > The file descriptor based memory has several advantages over the
> > > "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
> > > paves the way for VMMs to remove the secret memory range from the process;
> > 
> > I do not understand how it helps to remove the memory from the process
> > as the interface explicitly allows to add a memory that is removed from
> > all other processes via direct map.
> 
> The current implementation does not help to remove the memory from the
> process, but using fd-backed memory seems a better interface to remove
> guest memory from host mappings than mmap. As Andy nicely put it:
> 
> "Getting fd-backed memory into a guest will take some possibly major work in
> the kernel, but getting vma-backed memory into a guest without mapping it
> in the host user address space seems much, much worse."

OK, so IIUC this means that the model is to hand over memory from host
to guest. I thought the guest would be under control of its address
space and therefore it operates on the VMAs. This would benefit from
an additional and more specific clarification.

> > > As secret memory implementation is not an extension of tmpfs or hugetlbfs,
> > > usage of a dedicated system call rather than hooking new functionality into
> > > memfd_create(2) emphasises that memfd_secret(2) has different semantics and
> > > allows better upwards compatibility.
> > 
> > What is this supposed to mean? What are differences?
> 
> Well, the phrasing could be better indeed. That supposed to mean that
> they differ in the semantics behind the file descriptor: memfd_create
> implements sealing for shmem and hugetlbfs while memfd_secret implements
> memory hidden from the kernel.

Right but why memfd_create model is not sufficient for the usecase?
Please note that I am arguing against. To be honest I do not really care
much. Using an existing scheme is usually preferable from my POV but
there might be real reasons why shmem as a backing "storage" is not
appropriate.
  
> > > The secretmem mappings are locked in memory so they cannot exceed
> > > RLIMIT_MEMLOCK. Since these mappings are already locked an attempt to
> > > mlock() secretmem range would fail and mlockall() will ignore secretmem
> > > mappings.
> > 
> > What about munlock?
> 
> Isn't this implied? ;-)

My bad here. I thought that munlock fails on vmas which are not mlocked
and I was curious about the behavior when mlockall() is followed by
munlock. But I do not see this being the case. So this should be ok.

-- 
Michal Hocko
SUSE Labs
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09  8:47         ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-09  8:47 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Andrew Morton, Alexander Viro, Andy Lutomirski, Arnd Bergmann,
	Borislav Petkov, Catalin Marinas, Christopher Lameter,
	Dan Williams, Dave Hansen, David Hildenbrand, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On Mon 08-02-21 23:26:05, Mike Rapoport wrote:
> On Mon, Feb 08, 2021 at 11:49:22AM +0100, Michal Hocko wrote:
> > On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
[...]
> > > The file descriptor based memory has several advantages over the
> > > "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
> > > paves the way for VMMs to remove the secret memory range from the process;
> > 
> > I do not understand how it helps to remove the memory from the process
> > as the interface explicitly allows to add a memory that is removed from
> > all other processes via direct map.
> 
> The current implementation does not help to remove the memory from the
> process, but using fd-backed memory seems a better interface to remove
> guest memory from host mappings than mmap. As Andy nicely put it:
> 
> "Getting fd-backed memory into a guest will take some possibly major work in
> the kernel, but getting vma-backed memory into a guest without mapping it
> in the host user address space seems much, much worse."

OK, so IIUC this means that the model is to hand over memory from host
to guest. I thought the guest would be under control of its address
space and therefore it operates on the VMAs. This would benefit from
an additional and more specific clarification.

> > > As secret memory implementation is not an extension of tmpfs or hugetlbfs,
> > > usage of a dedicated system call rather than hooking new functionality into
> > > memfd_create(2) emphasises that memfd_secret(2) has different semantics and
> > > allows better upwards compatibility.
> > 
> > What is this supposed to mean? What are differences?
> 
> Well, the phrasing could be better indeed. That supposed to mean that
> they differ in the semantics behind the file descriptor: memfd_create
> implements sealing for shmem and hugetlbfs while memfd_secret implements
> memory hidden from the kernel.

Right but why memfd_create model is not sufficient for the usecase?
Please note that I am arguing against. To be honest I do not really care
much. Using an existing scheme is usually preferable from my POV but
there might be real reasons why shmem as a backing "storage" is not
appropriate.
  
> > > The secretmem mappings are locked in memory so they cannot exceed
> > > RLIMIT_MEMLOCK. Since these mappings are already locked an attempt to
> > > mlock() secretmem range would fail and mlockall() will ignore secretmem
> > > mappings.
> > 
> > What about munlock?
> 
> Isn't this implied? ;-)

My bad here. I thought that munlock fails on vmas which are not mlocked
and I was curious about the behavior when mlockall() is followed by
munlock. But I do not see this being the case. So this should be ok.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09  8:47         ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-09  8:47 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mark Rutland, David Hildenbrand, Peter Zijlstra, Catalin Marinas,
	Dave Hansen, linux-mm, linux-kselftest, H. Peter Anvin,
	Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin

On Mon 08-02-21 23:26:05, Mike Rapoport wrote:
> On Mon, Feb 08, 2021 at 11:49:22AM +0100, Michal Hocko wrote:
> > On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
[...]
> > > The file descriptor based memory has several advantages over the
> > > "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
> > > paves the way for VMMs to remove the secret memory range from the process;
> > 
> > I do not understand how it helps to remove the memory from the process
> > as the interface explicitly allows to add a memory that is removed from
> > all other processes via direct map.
> 
> The current implementation does not help to remove the memory from the
> process, but using fd-backed memory seems a better interface to remove
> guest memory from host mappings than mmap. As Andy nicely put it:
> 
> "Getting fd-backed memory into a guest will take some possibly major work in
> the kernel, but getting vma-backed memory into a guest without mapping it
> in the host user address space seems much, much worse."

OK, so IIUC this means that the model is to hand over memory from host
to guest. I thought the guest would be under control of its address
space and therefore it operates on the VMAs. This would benefit from
an additional and more specific clarification.

> > > As secret memory implementation is not an extension of tmpfs or hugetlbfs,
> > > usage of a dedicated system call rather than hooking new functionality into
> > > memfd_create(2) emphasises that memfd_secret(2) has different semantics and
> > > allows better upwards compatibility.
> > 
> > What is this supposed to mean? What are differences?
> 
> Well, the phrasing could be better indeed. That supposed to mean that
> they differ in the semantics behind the file descriptor: memfd_create
> implements sealing for shmem and hugetlbfs while memfd_secret implements
> memory hidden from the kernel.

Right but why memfd_create model is not sufficient for the usecase?
Please note that I am arguing against. To be honest I do not really care
much. Using an existing scheme is usually preferable from my POV but
there might be real reasons why shmem as a backing "storage" is not
appropriate.
  
> > > The secretmem mappings are locked in memory so they cannot exceed
> > > RLIMIT_MEMLOCK. Since these mappings are already locked an attempt to
> > > mlock() secretmem range would fail and mlockall() will ignore secretmem
> > > mappings.
> > 
> > What about munlock?
> 
> Isn't this implied? ;-)

My bad here. I thought that munlock fails on vmas which are not mlocked
and I was curious about the behavior when mlockall() is followed by
munlock. But I do not see this being the case. So this should be ok.

-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09  8:47         ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-09  8:47 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mark Rutland, David Hildenbrand, Peter Zijlstra, Catalin Marinas,
	Dave Hansen, linux-mm, linux-kselftest, H. Peter Anvin,
	Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin

On Mon 08-02-21 23:26:05, Mike Rapoport wrote:
> On Mon, Feb 08, 2021 at 11:49:22AM +0100, Michal Hocko wrote:
> > On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
[...]
> > > The file descriptor based memory has several advantages over the
> > > "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
> > > paves the way for VMMs to remove the secret memory range from the process;
> > 
> > I do not understand how it helps to remove the memory from the process
> > as the interface explicitly allows to add a memory that is removed from
> > all other processes via direct map.
> 
> The current implementation does not help to remove the memory from the
> process, but using fd-backed memory seems a better interface to remove
> guest memory from host mappings than mmap. As Andy nicely put it:
> 
> "Getting fd-backed memory into a guest will take some possibly major work in
> the kernel, but getting vma-backed memory into a guest without mapping it
> in the host user address space seems much, much worse."

OK, so IIUC this means that the model is to hand over memory from host
to guest. I thought the guest would be under control of its address
space and therefore it operates on the VMAs. This would benefit from
an additional and more specific clarification.

> > > As secret memory implementation is not an extension of tmpfs or hugetlbfs,
> > > usage of a dedicated system call rather than hooking new functionality into
> > > memfd_create(2) emphasises that memfd_secret(2) has different semantics and
> > > allows better upwards compatibility.
> > 
> > What is this supposed to mean? What are differences?
> 
> Well, the phrasing could be better indeed. That supposed to mean that
> they differ in the semantics behind the file descriptor: memfd_create
> implements sealing for shmem and hugetlbfs while memfd_secret implements
> memory hidden from the kernel.

Right but why memfd_create model is not sufficient for the usecase?
Please note that I am arguing against. To be honest I do not really care
much. Using an existing scheme is usually preferable from my POV but
there might be real reasons why shmem as a backing "storage" is not
appropriate.
  
> > > The secretmem mappings are locked in memory so they cannot exceed
> > > RLIMIT_MEMLOCK. Since these mappings are already locked an attempt to
> > > mlock() secretmem range would fail and mlockall() will ignore secretmem
> > > mappings.
> > 
> > What about munlock?
> 
> Isn't this implied? ;-)

My bad here. I thought that munlock fails on vmas which are not mlocked
and I was curious about the behavior when mlockall() is followed by
munlock. But I do not see this being the case. So this should be ok.

-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
  2021-02-08 21:38       ` David Hildenbrand
  (?)
  (?)
@ 2021-02-09  8:59         ` Michal Hocko
  -1 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-09  8:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Ander sen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

On Mon 08-02-21 22:38:03, David Hildenbrand wrote:
> 
> > Am 08.02.2021 um 22:13 schrieb Mike Rapoport <rppt@kernel.org>:
> > 
> > On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote:
> >> On 08.02.21 09:49, Mike Rapoport wrote:
> >> 
> >> Some questions (and request to document the answers) as we now allow to have
> >> unmovable allocations all over the place and I don't see a single comment
> >> regarding that in the cover letter:
> >> 
> >> 1. How will the issue of plenty of unmovable allocations for user space be
> >> tackled in the future?
> >> 
> >> 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE
> >> and CMA, alloc_conig_range()/alloc_contig_pages?.
> > 
> > Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not
> > allocate movable pages at the first place.
> 
> That is not the point. Secretmem cannot go on CMA / ZONE_MOVABLE
> memory and behaves like long-term pinnings in that sense. This is a
> real issue when using a lot of sectremem.

A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE.
As I've said it is quite easy to land at the similar situation even with
tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is
really uncommon. It would be even worse that those would be allowed to
consume both CMA/ZONE_MOVABLE.

One has to be very careful when relying on CMA or movable zones. This is
definitely worth a comment in the kernel command line parameter
documentation. But this is not a new problem.

-- 
Michal Hocko
SUSE Labs
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09  8:59         ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-09  8:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dan Williams, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

On Mon 08-02-21 22:38:03, David Hildenbrand wrote:
> 
> > Am 08.02.2021 um 22:13 schrieb Mike Rapoport <rppt@kernel.org>:
> > 
> > On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote:
> >> On 08.02.21 09:49, Mike Rapoport wrote:
> >> 
> >> Some questions (and request to document the answers) as we now allow to have
> >> unmovable allocations all over the place and I don't see a single comment
> >> regarding that in the cover letter:
> >> 
> >> 1. How will the issue of plenty of unmovable allocations for user space be
> >> tackled in the future?
> >> 
> >> 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE
> >> and CMA, alloc_conig_range()/alloc_contig_pages?.
> > 
> > Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not
> > allocate movable pages at the first place.
> 
> That is not the point. Secretmem cannot go on CMA / ZONE_MOVABLE
> memory and behaves like long-term pinnings in that sense. This is a
> real issue when using a lot of sectremem.

A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE.
As I've said it is quite easy to land at the similar situation even with
tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is
really uncommon. It would be even worse that those would be allowed to
consume both CMA/ZONE_MOVABLE.

One has to be very careful when relying on CMA or movable zones. This is
definitely worth a comment in the kernel command line parameter
documentation. But this is not a new problem.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09  8:59         ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-09  8:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Arnd Bergmann,
	James Bottomley, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin, Mike Rapoport

On Mon 08-02-21 22:38:03, David Hildenbrand wrote:
> 
> > Am 08.02.2021 um 22:13 schrieb Mike Rapoport <rppt@kernel.org>:
> > 
> > On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote:
> >> On 08.02.21 09:49, Mike Rapoport wrote:
> >> 
> >> Some questions (and request to document the answers) as we now allow to have
> >> unmovable allocations all over the place and I don't see a single comment
> >> regarding that in the cover letter:
> >> 
> >> 1. How will the issue of plenty of unmovable allocations for user space be
> >> tackled in the future?
> >> 
> >> 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE
> >> and CMA, alloc_conig_range()/alloc_contig_pages?.
> > 
> > Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not
> > allocate movable pages at the first place.
> 
> That is not the point. Secretmem cannot go on CMA / ZONE_MOVABLE
> memory and behaves like long-term pinnings in that sense. This is a
> real issue when using a lot of sectremem.

A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE.
As I've said it is quite easy to land at the similar situation even with
tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is
really uncommon. It would be even worse that those would be allowed to
consume both CMA/ZONE_MOVABLE.

One has to be very careful when relying on CMA or movable zones. This is
definitely worth a comment in the kernel command line parameter
documentation. But this is not a new problem.

-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09  8:59         ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-09  8:59 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Arnd Bergmann,
	James Bottomley, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin, Mike Rapoport

On Mon 08-02-21 22:38:03, David Hildenbrand wrote:
> 
> > Am 08.02.2021 um 22:13 schrieb Mike Rapoport <rppt@kernel.org>:
> > 
> > On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote:
> >> On 08.02.21 09:49, Mike Rapoport wrote:
> >> 
> >> Some questions (and request to document the answers) as we now allow to have
> >> unmovable allocations all over the place and I don't see a single comment
> >> regarding that in the cover letter:
> >> 
> >> 1. How will the issue of plenty of unmovable allocations for user space be
> >> tackled in the future?
> >> 
> >> 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE
> >> and CMA, alloc_conig_range()/alloc_contig_pages?.
> > 
> > Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not
> > allocate movable pages at the first place.
> 
> That is not the point. Secretmem cannot go on CMA / ZONE_MOVABLE
> memory and behaves like long-term pinnings in that sense. This is a
> real issue when using a lot of sectremem.

A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE.
As I've said it is quite easy to land at the similar situation even with
tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is
really uncommon. It would be even worse that those would be allowed to
consume both CMA/ZONE_MOVABLE.

One has to be very careful when relying on CMA or movable zones. This is
definitely worth a comment in the kernel command line parameter
documentation. But this is not a new problem.

-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
  2021-02-09  8:47         ` Michal Hocko
  (?)
  (?)
@ 2021-02-09  9:09           ` Mike Rapoport
  -1 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-09  9:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dave Hansen, David Hildenbrand,
	Elena Reshetova, H. Peter Anvin, Ingo Molnar, James Bottomley,
	Kirill A. Shutemov, Matthew Wilcox, Mark Rutland,
	Michael Kerrisk, Palmer Dabbelt, Paul Walmsley, Peter Zijlstra,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleixner, Tycho And ersen, Will Deacon, linux-api,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On Tue, Feb 09, 2021 at 09:47:08AM +0100, Michal Hocko wrote:
> On Mon 08-02-21 23:26:05, Mike Rapoport wrote:
> > On Mon, Feb 08, 2021 at 11:49:22AM +0100, Michal Hocko wrote:
> > > On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
> [...]
> > > > The file descriptor based memory has several advantages over the
> > > > "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
> > > > paves the way for VMMs to remove the secret memory range from the process;
> > > 
> > > I do not understand how it helps to remove the memory from the process
> > > as the interface explicitly allows to add a memory that is removed from
> > > all other processes via direct map.
> > 
> > The current implementation does not help to remove the memory from the
> > process, but using fd-backed memory seems a better interface to remove
> > guest memory from host mappings than mmap. As Andy nicely put it:
> > 
> > "Getting fd-backed memory into a guest will take some possibly major work in
> > the kernel, but getting vma-backed memory into a guest without mapping it
> > in the host user address space seems much, much worse."
> 
> OK, so IIUC this means that the model is to hand over memory from host
> to guest. I thought the guest would be under control of its address
> space and therefore it operates on the VMAs. This would benefit from
> an additional and more specific clarification.

How guest would operate on VMAs if the interface between host and guest is
virtual hardware?

If you mean qemu (or any other userspace part of VMM that uses KVM), so one
of the points Andy mentioned back than is to remove mappings of the guest
memory from the qemu process.
 
> > > > As secret memory implementation is not an extension of tmpfs or hugetlbfs,
> > > > usage of a dedicated system call rather than hooking new functionality into
> > > > memfd_create(2) emphasises that memfd_secret(2) has different semantics and
> > > > allows better upwards compatibility.
> > > 
> > > What is this supposed to mean? What are differences?
> > 
> > Well, the phrasing could be better indeed. That supposed to mean that
> > they differ in the semantics behind the file descriptor: memfd_create
> > implements sealing for shmem and hugetlbfs while memfd_secret implements
> > memory hidden from the kernel.
> 
> Right but why memfd_create model is not sufficient for the usecase?
> Please note that I am arguing against. To be honest I do not really care
> much. Using an existing scheme is usually preferable from my POV but
> there might be real reasons why shmem as a backing "storage" is not
> appropriate.
   
Citing my older email:

    I've hesitated whether to continue to use new flags to memfd_create() or to
    add a new system call and I've decided to use a new system call after I've
    started to look into man pages update. There would have been two completely
    independent descriptions and I think it would have been very confusing.

-- 
Sincerely yours,
Mike.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09  9:09           ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-09  9:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Michael Kerrisk, Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Rick Edgecombe, Roman Gushchin, Shakeel Butt,
	Shuah Khan, Thomas Gleixner, Tycho Andersen, Will Deacon,
	linux-api, linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On Tue, Feb 09, 2021 at 09:47:08AM +0100, Michal Hocko wrote:
> On Mon 08-02-21 23:26:05, Mike Rapoport wrote:
> > On Mon, Feb 08, 2021 at 11:49:22AM +0100, Michal Hocko wrote:
> > > On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
> [...]
> > > > The file descriptor based memory has several advantages over the
> > > > "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
> > > > paves the way for VMMs to remove the secret memory range from the process;
> > > 
> > > I do not understand how it helps to remove the memory from the process
> > > as the interface explicitly allows to add a memory that is removed from
> > > all other processes via direct map.
> > 
> > The current implementation does not help to remove the memory from the
> > process, but using fd-backed memory seems a better interface to remove
> > guest memory from host mappings than mmap. As Andy nicely put it:
> > 
> > "Getting fd-backed memory into a guest will take some possibly major work in
> > the kernel, but getting vma-backed memory into a guest without mapping it
> > in the host user address space seems much, much worse."
> 
> OK, so IIUC this means that the model is to hand over memory from host
> to guest. I thought the guest would be under control of its address
> space and therefore it operates on the VMAs. This would benefit from
> an additional and more specific clarification.

How guest would operate on VMAs if the interface between host and guest is
virtual hardware?

If you mean qemu (or any other userspace part of VMM that uses KVM), so one
of the points Andy mentioned back than is to remove mappings of the guest
memory from the qemu process.
 
> > > > As secret memory implementation is not an extension of tmpfs or hugetlbfs,
> > > > usage of a dedicated system call rather than hooking new functionality into
> > > > memfd_create(2) emphasises that memfd_secret(2) has different semantics and
> > > > allows better upwards compatibility.
> > > 
> > > What is this supposed to mean? What are differences?
> > 
> > Well, the phrasing could be better indeed. That supposed to mean that
> > they differ in the semantics behind the file descriptor: memfd_create
> > implements sealing for shmem and hugetlbfs while memfd_secret implements
> > memory hidden from the kernel.
> 
> Right but why memfd_create model is not sufficient for the usecase?
> Please note that I am arguing against. To be honest I do not really care
> much. Using an existing scheme is usually preferable from my POV but
> there might be real reasons why shmem as a backing "storage" is not
> appropriate.
   
Citing my older email:

    I've hesitated whether to continue to use new flags to memfd_create() or to
    add a new system call and I've decided to use a new system call after I've
    started to look into man pages update. There would have been two completely
    independent descriptions and I think it would have been very confusing.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09  9:09           ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-09  9:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mark Rutland, David Hildenbrand, Peter Zijlstra, Catalin Marinas,
	Dave Hansen, linux-mm, linux-kselftest, H. Peter Anvin,
	Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Ingo Molnar, Michael Kerrisk,
	Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin, Mike Rapoport

On Tue, Feb 09, 2021 at 09:47:08AM +0100, Michal Hocko wrote:
> On Mon 08-02-21 23:26:05, Mike Rapoport wrote:
> > On Mon, Feb 08, 2021 at 11:49:22AM +0100, Michal Hocko wrote:
> > > On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
> [...]
> > > > The file descriptor based memory has several advantages over the
> > > > "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
> > > > paves the way for VMMs to remove the secret memory range from the process;
> > > 
> > > I do not understand how it helps to remove the memory from the process
> > > as the interface explicitly allows to add a memory that is removed from
> > > all other processes via direct map.
> > 
> > The current implementation does not help to remove the memory from the
> > process, but using fd-backed memory seems a better interface to remove
> > guest memory from host mappings than mmap. As Andy nicely put it:
> > 
> > "Getting fd-backed memory into a guest will take some possibly major work in
> > the kernel, but getting vma-backed memory into a guest without mapping it
> > in the host user address space seems much, much worse."
> 
> OK, so IIUC this means that the model is to hand over memory from host
> to guest. I thought the guest would be under control of its address
> space and therefore it operates on the VMAs. This would benefit from
> an additional and more specific clarification.

How guest would operate on VMAs if the interface between host and guest is
virtual hardware?

If you mean qemu (or any other userspace part of VMM that uses KVM), so one
of the points Andy mentioned back than is to remove mappings of the guest
memory from the qemu process.
 
> > > > As secret memory implementation is not an extension of tmpfs or hugetlbfs,
> > > > usage of a dedicated system call rather than hooking new functionality into
> > > > memfd_create(2) emphasises that memfd_secret(2) has different semantics and
> > > > allows better upwards compatibility.
> > > 
> > > What is this supposed to mean? What are differences?
> > 
> > Well, the phrasing could be better indeed. That supposed to mean that
> > they differ in the semantics behind the file descriptor: memfd_create
> > implements sealing for shmem and hugetlbfs while memfd_secret implements
> > memory hidden from the kernel.
> 
> Right but why memfd_create model is not sufficient for the usecase?
> Please note that I am arguing against. To be honest I do not really care
> much. Using an existing scheme is usually preferable from my POV but
> there might be real reasons why shmem as a backing "storage" is not
> appropriate.
   
Citing my older email:

    I've hesitated whether to continue to use new flags to memfd_create() or to
    add a new system call and I've decided to use a new system call after I've
    started to look into man pages update. There would have been two completely
    independent descriptions and I think it would have been very confusing.

-- 
Sincerely yours,
Mike.

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09  9:09           ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-09  9:09 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mark Rutland, David Hildenbrand, Peter Zijlstra, Catalin Marinas,
	Dave Hansen, linux-mm, linux-kselftest, H. Peter Anvin,
	Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Ingo Molnar, Michael Kerrisk,
	Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin, Mike Rapoport

On Tue, Feb 09, 2021 at 09:47:08AM +0100, Michal Hocko wrote:
> On Mon 08-02-21 23:26:05, Mike Rapoport wrote:
> > On Mon, Feb 08, 2021 at 11:49:22AM +0100, Michal Hocko wrote:
> > > On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
> [...]
> > > > The file descriptor based memory has several advantages over the
> > > > "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
> > > > paves the way for VMMs to remove the secret memory range from the process;
> > > 
> > > I do not understand how it helps to remove the memory from the process
> > > as the interface explicitly allows to add a memory that is removed from
> > > all other processes via direct map.
> > 
> > The current implementation does not help to remove the memory from the
> > process, but using fd-backed memory seems a better interface to remove
> > guest memory from host mappings than mmap. As Andy nicely put it:
> > 
> > "Getting fd-backed memory into a guest will take some possibly major work in
> > the kernel, but getting vma-backed memory into a guest without mapping it
> > in the host user address space seems much, much worse."
> 
> OK, so IIUC this means that the model is to hand over memory from host
> to guest. I thought the guest would be under control of its address
> space and therefore it operates on the VMAs. This would benefit from
> an additional and more specific clarification.

How guest would operate on VMAs if the interface between host and guest is
virtual hardware?

If you mean qemu (or any other userspace part of VMM that uses KVM), so one
of the points Andy mentioned back than is to remove mappings of the guest
memory from the qemu process.
 
> > > > As secret memory implementation is not an extension of tmpfs or hugetlbfs,
> > > > usage of a dedicated system call rather than hooking new functionality into
> > > > memfd_create(2) emphasises that memfd_secret(2) has different semantics and
> > > > allows better upwards compatibility.
> > > 
> > > What is this supposed to mean? What are differences?
> > 
> > Well, the phrasing could be better indeed. That supposed to mean that
> > they differ in the semantics behind the file descriptor: memfd_create
> > implements sealing for shmem and hugetlbfs while memfd_secret implements
> > memory hidden from the kernel.
> 
> Right but why memfd_create model is not sufficient for the usecase?
> Please note that I am arguing against. To be honest I do not really care
> much. Using an existing scheme is usually preferable from my POV but
> there might be real reasons why shmem as a backing "storage" is not
> appropriate.
   
Citing my older email:

    I've hesitated whether to continue to use new flags to memfd_create() or to
    add a new system call and I've decided to use a new system call after I've
    started to look into man pages update. There would have been two completely
    independent descriptions and I think it would have been very confusing.

-- 
Sincerely yours,
Mike.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
  2021-02-09  8:59         ` Michal Hocko
  (?)
  (?)
@ 2021-02-09  9:15           ` David Hildenbrand
  -1 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-09  9:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Ander sen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

On 09.02.21 09:59, Michal Hocko wrote:
> On Mon 08-02-21 22:38:03, David Hildenbrand wrote:
>>
>>> Am 08.02.2021 um 22:13 schrieb Mike Rapoport <rppt@kernel.org>:
>>>
>>> On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote:
>>>> On 08.02.21 09:49, Mike Rapoport wrote:
>>>>
>>>> Some questions (and request to document the answers) as we now allow to have
>>>> unmovable allocations all over the place and I don't see a single comment
>>>> regarding that in the cover letter:
>>>>
>>>> 1. How will the issue of plenty of unmovable allocations for user space be
>>>> tackled in the future?
>>>>
>>>> 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE
>>>> and CMA, alloc_conig_range()/alloc_contig_pages?.
>>>
>>> Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not
>>> allocate movable pages at the first place.
>>
>> That is not the point. Secretmem cannot go on CMA / ZONE_MOVABLE
>> memory and behaves like long-term pinnings in that sense. This is a
>> real issue when using a lot of sectremem.
> 
> A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE.
> As I've said it is quite easy to land at the similar situation even with
> tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is
> really uncommon. It would be even worse that those would be allowed to
> consume both CMA/ZONE_MOVABLE.

IIRC, tmpfs/MAP_ANON|MAP_SHARED memory
a) Is movable, can land in ZONE_MOVABLE/CMA
b) Can be limited by sizing tmpfs appropriately

AFAIK, what you describe is a problem with memory overcommit, not with 
zone imbalances (below). Or what am I missing?

> 
> One has to be very careful when relying on CMA or movable zones. This is
> definitely worth a comment in the kernel command line parameter
> documentation. But this is not a new problem.

I see the following thing worth documenting:

Assume you have a system with 2GB of ZONE_NORMAL/ZONE_DMA and 4GB of 
ZONE_MOVABLE/CMA.

Assume you make use of 1.5GB of secretmem. Your system might run into 
OOM any time although you still have plenty of memory on ZONE_MOVAVLE 
(and even swap!), simply because you are making excessive use of 
unmovable allocations (for user space!) in an environment where you 
should not make excessive use of unmovable allocations (e.g., where 
should page tables go?).

The existing controls (mlock limit) don't really match the current 
semantics of that memory. I repeat it once again: secretmem *currently* 
resembles long-term pinned memory, not mlocked memory. Things will 
change when implementing migration support for secretmem pages. Until 
then, the semantics are different and this should be spelled out.

For long-term pinnings this is kind of obvious, still we're now 
documenting it because it's dangerous to not be aware of. Secretmem 
behaves exactly the same and I think this is worth spelling out: 
secretmem has the potential of being used much more often than fairly 
special vfio/rdma/ ...

Looking at a cover letter that doesn't even mention the issue of 
unmovable allocations makes me thing that we are either trying to ignore 
the problem or are not aware of the problem.

-- 
Thanks,

David / dhildenb
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09  9:15           ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-09  9:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dan Williams, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

On 09.02.21 09:59, Michal Hocko wrote:
> On Mon 08-02-21 22:38:03, David Hildenbrand wrote:
>>
>>> Am 08.02.2021 um 22:13 schrieb Mike Rapoport <rppt@kernel.org>:
>>>
>>> On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote:
>>>> On 08.02.21 09:49, Mike Rapoport wrote:
>>>>
>>>> Some questions (and request to document the answers) as we now allow to have
>>>> unmovable allocations all over the place and I don't see a single comment
>>>> regarding that in the cover letter:
>>>>
>>>> 1. How will the issue of plenty of unmovable allocations for user space be
>>>> tackled in the future?
>>>>
>>>> 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE
>>>> and CMA, alloc_conig_range()/alloc_contig_pages?.
>>>
>>> Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not
>>> allocate movable pages at the first place.
>>
>> That is not the point. Secretmem cannot go on CMA / ZONE_MOVABLE
>> memory and behaves like long-term pinnings in that sense. This is a
>> real issue when using a lot of sectremem.
> 
> A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE.
> As I've said it is quite easy to land at the similar situation even with
> tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is
> really uncommon. It would be even worse that those would be allowed to
> consume both CMA/ZONE_MOVABLE.

IIRC, tmpfs/MAP_ANON|MAP_SHARED memory
a) Is movable, can land in ZONE_MOVABLE/CMA
b) Can be limited by sizing tmpfs appropriately

AFAIK, what you describe is a problem with memory overcommit, not with 
zone imbalances (below). Or what am I missing?

> 
> One has to be very careful when relying on CMA or movable zones. This is
> definitely worth a comment in the kernel command line parameter
> documentation. But this is not a new problem.

I see the following thing worth documenting:

Assume you have a system with 2GB of ZONE_NORMAL/ZONE_DMA and 4GB of 
ZONE_MOVABLE/CMA.

Assume you make use of 1.5GB of secretmem. Your system might run into 
OOM any time although you still have plenty of memory on ZONE_MOVAVLE 
(and even swap!), simply because you are making excessive use of 
unmovable allocations (for user space!) in an environment where you 
should not make excessive use of unmovable allocations (e.g., where 
should page tables go?).

The existing controls (mlock limit) don't really match the current 
semantics of that memory. I repeat it once again: secretmem *currently* 
resembles long-term pinned memory, not mlocked memory. Things will 
change when implementing migration support for secretmem pages. Until 
then, the semantics are different and this should be spelled out.

For long-term pinnings this is kind of obvious, still we're now 
documenting it because it's dangerous to not be aware of. Secretmem 
behaves exactly the same and I think this is worth spelling out: 
secretmem has the potential of being used much more often than fairly 
special vfio/rdma/ ...

Looking at a cover letter that doesn't even mention the issue of 
unmovable allocations makes me thing that we are either trying to ignore 
the problem or are not aware of the problem.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09  9:15           ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-09  9:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Arnd Bergmann,
	James Bottomley, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin, Mike Rapoport

On 09.02.21 09:59, Michal Hocko wrote:
> On Mon 08-02-21 22:38:03, David Hildenbrand wrote:
>>
>>> Am 08.02.2021 um 22:13 schrieb Mike Rapoport <rppt@kernel.org>:
>>>
>>> On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote:
>>>> On 08.02.21 09:49, Mike Rapoport wrote:
>>>>
>>>> Some questions (and request to document the answers) as we now allow to have
>>>> unmovable allocations all over the place and I don't see a single comment
>>>> regarding that in the cover letter:
>>>>
>>>> 1. How will the issue of plenty of unmovable allocations for user space be
>>>> tackled in the future?
>>>>
>>>> 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE
>>>> and CMA, alloc_conig_range()/alloc_contig_pages?.
>>>
>>> Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not
>>> allocate movable pages at the first place.
>>
>> That is not the point. Secretmem cannot go on CMA / ZONE_MOVABLE
>> memory and behaves like long-term pinnings in that sense. This is a
>> real issue when using a lot of sectremem.
> 
> A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE.
> As I've said it is quite easy to land at the similar situation even with
> tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is
> really uncommon. It would be even worse that those would be allowed to
> consume both CMA/ZONE_MOVABLE.

IIRC, tmpfs/MAP_ANON|MAP_SHARED memory
a) Is movable, can land in ZONE_MOVABLE/CMA
b) Can be limited by sizing tmpfs appropriately

AFAIK, what you describe is a problem with memory overcommit, not with 
zone imbalances (below). Or what am I missing?

> 
> One has to be very careful when relying on CMA or movable zones. This is
> definitely worth a comment in the kernel command line parameter
> documentation. But this is not a new problem.

I see the following thing worth documenting:

Assume you have a system with 2GB of ZONE_NORMAL/ZONE_DMA and 4GB of 
ZONE_MOVABLE/CMA.

Assume you make use of 1.5GB of secretmem. Your system might run into 
OOM any time although you still have plenty of memory on ZONE_MOVAVLE 
(and even swap!), simply because you are making excessive use of 
unmovable allocations (for user space!) in an environment where you 
should not make excessive use of unmovable allocations (e.g., where 
should page tables go?).

The existing controls (mlock limit) don't really match the current 
semantics of that memory. I repeat it once again: secretmem *currently* 
resembles long-term pinned memory, not mlocked memory. Things will 
change when implementing migration support for secretmem pages. Until 
then, the semantics are different and this should be spelled out.

For long-term pinnings this is kind of obvious, still we're now 
documenting it because it's dangerous to not be aware of. Secretmem 
behaves exactly the same and I think this is worth spelling out: 
secretmem has the potential of being used much more often than fairly 
special vfio/rdma/ ...

Looking at a cover letter that doesn't even mention the issue of 
unmovable allocations makes me thing that we are either trying to ignore 
the problem or are not aware of the problem.

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09  9:15           ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-09  9:15 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Arnd Bergmann,
	James Bottomley, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin, Mike Rapoport

On 09.02.21 09:59, Michal Hocko wrote:
> On Mon 08-02-21 22:38:03, David Hildenbrand wrote:
>>
>>> Am 08.02.2021 um 22:13 schrieb Mike Rapoport <rppt@kernel.org>:
>>>
>>> On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote:
>>>> On 08.02.21 09:49, Mike Rapoport wrote:
>>>>
>>>> Some questions (and request to document the answers) as we now allow to have
>>>> unmovable allocations all over the place and I don't see a single comment
>>>> regarding that in the cover letter:
>>>>
>>>> 1. How will the issue of plenty of unmovable allocations for user space be
>>>> tackled in the future?
>>>>
>>>> 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE
>>>> and CMA, alloc_conig_range()/alloc_contig_pages?.
>>>
>>> Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not
>>> allocate movable pages at the first place.
>>
>> That is not the point. Secretmem cannot go on CMA / ZONE_MOVABLE
>> memory and behaves like long-term pinnings in that sense. This is a
>> real issue when using a lot of sectremem.
> 
> A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE.
> As I've said it is quite easy to land at the similar situation even with
> tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is
> really uncommon. It would be even worse that those would be allowed to
> consume both CMA/ZONE_MOVABLE.

IIRC, tmpfs/MAP_ANON|MAP_SHARED memory
a) Is movable, can land in ZONE_MOVABLE/CMA
b) Can be limited by sizing tmpfs appropriately

AFAIK, what you describe is a problem with memory overcommit, not with 
zone imbalances (below). Or what am I missing?

> 
> One has to be very careful when relying on CMA or movable zones. This is
> definitely worth a comment in the kernel command line parameter
> documentation. But this is not a new problem.

I see the following thing worth documenting:

Assume you have a system with 2GB of ZONE_NORMAL/ZONE_DMA and 4GB of 
ZONE_MOVABLE/CMA.

Assume you make use of 1.5GB of secretmem. Your system might run into 
OOM any time although you still have plenty of memory on ZONE_MOVAVLE 
(and even swap!), simply because you are making excessive use of 
unmovable allocations (for user space!) in an environment where you 
should not make excessive use of unmovable allocations (e.g., where 
should page tables go?).

The existing controls (mlock limit) don't really match the current 
semantics of that memory. I repeat it once again: secretmem *currently* 
resembles long-term pinned memory, not mlocked memory. Things will 
change when implementing migration support for secretmem pages. Until 
then, the semantics are different and this should be spelled out.

For long-term pinnings this is kind of obvious, still we're now 
documenting it because it's dangerous to not be aware of. Secretmem 
behaves exactly the same and I think this is worth spelling out: 
secretmem has the potential of being used much more often than fairly 
special vfio/rdma/ ...

Looking at a cover letter that doesn't even mention the issue of 
unmovable allocations makes me thing that we are either trying to ignore 
the problem or are not aware of the problem.

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
  2021-02-09  9:15           ` David Hildenbrand
  (?)
  (?)
@ 2021-02-09  9:53             ` Michal Hocko
  -1 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-09  9:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Ander sen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

On Tue 09-02-21 10:15:17, David Hildenbrand wrote:
> On 09.02.21 09:59, Michal Hocko wrote:
> > On Mon 08-02-21 22:38:03, David Hildenbrand wrote:
> > > 
> > > > Am 08.02.2021 um 22:13 schrieb Mike Rapoport <rppt@kernel.org>:
> > > > 
> > > > On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote:
> > > > > On 08.02.21 09:49, Mike Rapoport wrote:
> > > > > 
> > > > > Some questions (and request to document the answers) as we now allow to have
> > > > > unmovable allocations all over the place and I don't see a single comment
> > > > > regarding that in the cover letter:
> > > > > 
> > > > > 1. How will the issue of plenty of unmovable allocations for user space be
> > > > > tackled in the future?
> > > > > 
> > > > > 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE
> > > > > and CMA, alloc_conig_range()/alloc_contig_pages?.
> > > > 
> > > > Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not
> > > > allocate movable pages at the first place.
> > > 
> > > That is not the point. Secretmem cannot go on CMA / ZONE_MOVABLE
> > > memory and behaves like long-term pinnings in that sense. This is a
> > > real issue when using a lot of sectremem.
> > 
> > A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE.
> > As I've said it is quite easy to land at the similar situation even with
> > tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is
> > really uncommon. It would be even worse that those would be allowed to
> > consume both CMA/ZONE_MOVABLE.
> 
> IIRC, tmpfs/MAP_ANON|MAP_SHARED memory
> a) Is movable, can land in ZONE_MOVABLE/CMA
> b) Can be limited by sizing tmpfs appropriately
> 
> AFAIK, what you describe is a problem with memory overcommit, not with zone
> imbalances (below). Or what am I missing?

It can be problem for both. If you have just too much of shm (do not
forget about MAP_SHARED|MAP_ANON which is much harder to size from an
admin POV) then migrateability doesn't really help because you need a
free memory to migrate. Without reclaimability this can easily become a
problem. That is why I am saying this is not really a new problem.
Swapless systems are not all that uncommon.
 
> > One has to be very careful when relying on CMA or movable zones. This is
> > definitely worth a comment in the kernel command line parameter
> > documentation. But this is not a new problem.
> 
> I see the following thing worth documenting:
> 
> Assume you have a system with 2GB of ZONE_NORMAL/ZONE_DMA and 4GB of
> ZONE_MOVABLE/CMA.
> 
> Assume you make use of 1.5GB of secretmem. Your system might run into OOM
> any time although you still have plenty of memory on ZONE_MOVAVLE (and even
> swap!), simply because you are making excessive use of unmovable allocations
> (for user space!) in an environment where you should not make excessive use
> of unmovable allocations (e.g., where should page tables go?).

yes, you are right of course and I am not really disputing this. But I
would argue that 2:1 Movable/Normal is something to expect problems
already. "Lowmem" allocations can easily trigger OOM even without secret
mem in the picture. It all just takes to allocate a lot of GFP_KERNEL or
even GFP_{HIGH}USER. Really, it is CMA/MOVABLE that are elephant in the
room and one has to be really careful when relying on them.
 
> The existing controls (mlock limit) don't really match the current semantics
> of that memory. I repeat it once again: secretmem *currently* resembles
> long-term pinned memory, not mlocked memory.

Well, if we had a proper user space pinning accounting then I would
agree that there is a better model to use. But we don't. And previous
attempts to achieve that have failed. So I am afraid that we do not have
much choice left than using mlock as a model.

> Things will change when
> implementing migration support for secretmem pages. Until then, the
> semantics are different and this should be spelled out.
> 
> For long-term pinnings this is kind of obvious, still we're now documenting
> it because it's dangerous to not be aware of. Secretmem behaves exactly the
> same and I think this is worth spelling out: secretmem has the potential of
> being used much more often than fairly special vfio/rdma/ ...

yeah I do agree that pinning is a problem for movable/CMA but most
people simply do not care about those. Movable is the thing for hoptlug
and a really weird fragmentation avoidance IIRC and CMA is mostly to
handle crap HW. If those are to be used along with secret mem or
longterm GUP then they will constantly bump into corner cases. Do not
take me wrong, we should be looking at those problems, we should even
document them but I do not see this as anything new. We should probably
have a central place in Documentation explaining all those problems. I
would be even happy to see an explicit note in the tunables - e.g.
configuring movable/normal in 2:1 will get you back to 32b times wrt.
low mem problems.
-- 
Michal Hocko
SUSE Labs
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09  9:53             ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-09  9:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dan Williams, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

On Tue 09-02-21 10:15:17, David Hildenbrand wrote:
> On 09.02.21 09:59, Michal Hocko wrote:
> > On Mon 08-02-21 22:38:03, David Hildenbrand wrote:
> > > 
> > > > Am 08.02.2021 um 22:13 schrieb Mike Rapoport <rppt@kernel.org>:
> > > > 
> > > > On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote:
> > > > > On 08.02.21 09:49, Mike Rapoport wrote:
> > > > > 
> > > > > Some questions (and request to document the answers) as we now allow to have
> > > > > unmovable allocations all over the place and I don't see a single comment
> > > > > regarding that in the cover letter:
> > > > > 
> > > > > 1. How will the issue of plenty of unmovable allocations for user space be
> > > > > tackled in the future?
> > > > > 
> > > > > 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE
> > > > > and CMA, alloc_conig_range()/alloc_contig_pages?.
> > > > 
> > > > Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not
> > > > allocate movable pages at the first place.
> > > 
> > > That is not the point. Secretmem cannot go on CMA / ZONE_MOVABLE
> > > memory and behaves like long-term pinnings in that sense. This is a
> > > real issue when using a lot of sectremem.
> > 
> > A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE.
> > As I've said it is quite easy to land at the similar situation even with
> > tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is
> > really uncommon. It would be even worse that those would be allowed to
> > consume both CMA/ZONE_MOVABLE.
> 
> IIRC, tmpfs/MAP_ANON|MAP_SHARED memory
> a) Is movable, can land in ZONE_MOVABLE/CMA
> b) Can be limited by sizing tmpfs appropriately
> 
> AFAIK, what you describe is a problem with memory overcommit, not with zone
> imbalances (below). Or what am I missing?

It can be problem for both. If you have just too much of shm (do not
forget about MAP_SHARED|MAP_ANON which is much harder to size from an
admin POV) then migrateability doesn't really help because you need a
free memory to migrate. Without reclaimability this can easily become a
problem. That is why I am saying this is not really a new problem.
Swapless systems are not all that uncommon.
 
> > One has to be very careful when relying on CMA or movable zones. This is
> > definitely worth a comment in the kernel command line parameter
> > documentation. But this is not a new problem.
> 
> I see the following thing worth documenting:
> 
> Assume you have a system with 2GB of ZONE_NORMAL/ZONE_DMA and 4GB of
> ZONE_MOVABLE/CMA.
> 
> Assume you make use of 1.5GB of secretmem. Your system might run into OOM
> any time although you still have plenty of memory on ZONE_MOVAVLE (and even
> swap!), simply because you are making excessive use of unmovable allocations
> (for user space!) in an environment where you should not make excessive use
> of unmovable allocations (e.g., where should page tables go?).

yes, you are right of course and I am not really disputing this. But I
would argue that 2:1 Movable/Normal is something to expect problems
already. "Lowmem" allocations can easily trigger OOM even without secret
mem in the picture. It all just takes to allocate a lot of GFP_KERNEL or
even GFP_{HIGH}USER. Really, it is CMA/MOVABLE that are elephant in the
room and one has to be really careful when relying on them.
 
> The existing controls (mlock limit) don't really match the current semantics
> of that memory. I repeat it once again: secretmem *currently* resembles
> long-term pinned memory, not mlocked memory.

Well, if we had a proper user space pinning accounting then I would
agree that there is a better model to use. But we don't. And previous
attempts to achieve that have failed. So I am afraid that we do not have
much choice left than using mlock as a model.

> Things will change when
> implementing migration support for secretmem pages. Until then, the
> semantics are different and this should be spelled out.
> 
> For long-term pinnings this is kind of obvious, still we're now documenting
> it because it's dangerous to not be aware of. Secretmem behaves exactly the
> same and I think this is worth spelling out: secretmem has the potential of
> being used much more often than fairly special vfio/rdma/ ...

yeah I do agree that pinning is a problem for movable/CMA but most
people simply do not care about those. Movable is the thing for hoptlug
and a really weird fragmentation avoidance IIRC and CMA is mostly to
handle crap HW. If those are to be used along with secret mem or
longterm GUP then they will constantly bump into corner cases. Do not
take me wrong, we should be looking at those problems, we should even
document them but I do not see this as anything new. We should probably
have a central place in Documentation explaining all those problems. I
would be even happy to see an explicit note in the tunables - e.g.
configuring movable/normal in 2:1 will get you back to 32b times wrt.
low mem problems.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09  9:53             ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-09  9:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Arnd Bergmann,
	James Bottomley, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin, Mike Rapoport

On Tue 09-02-21 10:15:17, David Hildenbrand wrote:
> On 09.02.21 09:59, Michal Hocko wrote:
> > On Mon 08-02-21 22:38:03, David Hildenbrand wrote:
> > > 
> > > > Am 08.02.2021 um 22:13 schrieb Mike Rapoport <rppt@kernel.org>:
> > > > 
> > > > On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote:
> > > > > On 08.02.21 09:49, Mike Rapoport wrote:
> > > > > 
> > > > > Some questions (and request to document the answers) as we now allow to have
> > > > > unmovable allocations all over the place and I don't see a single comment
> > > > > regarding that in the cover letter:
> > > > > 
> > > > > 1. How will the issue of plenty of unmovable allocations for user space be
> > > > > tackled in the future?
> > > > > 
> > > > > 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE
> > > > > and CMA, alloc_conig_range()/alloc_contig_pages?.
> > > > 
> > > > Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not
> > > > allocate movable pages at the first place.
> > > 
> > > That is not the point. Secretmem cannot go on CMA / ZONE_MOVABLE
> > > memory and behaves like long-term pinnings in that sense. This is a
> > > real issue when using a lot of sectremem.
> > 
> > A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE.
> > As I've said it is quite easy to land at the similar situation even with
> > tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is
> > really uncommon. It would be even worse that those would be allowed to
> > consume both CMA/ZONE_MOVABLE.
> 
> IIRC, tmpfs/MAP_ANON|MAP_SHARED memory
> a) Is movable, can land in ZONE_MOVABLE/CMA
> b) Can be limited by sizing tmpfs appropriately
> 
> AFAIK, what you describe is a problem with memory overcommit, not with zone
> imbalances (below). Or what am I missing?

It can be problem for both. If you have just too much of shm (do not
forget about MAP_SHARED|MAP_ANON which is much harder to size from an
admin POV) then migrateability doesn't really help because you need a
free memory to migrate. Without reclaimability this can easily become a
problem. That is why I am saying this is not really a new problem.
Swapless systems are not all that uncommon.
 
> > One has to be very careful when relying on CMA or movable zones. This is
> > definitely worth a comment in the kernel command line parameter
> > documentation. But this is not a new problem.
> 
> I see the following thing worth documenting:
> 
> Assume you have a system with 2GB of ZONE_NORMAL/ZONE_DMA and 4GB of
> ZONE_MOVABLE/CMA.
> 
> Assume you make use of 1.5GB of secretmem. Your system might run into OOM
> any time although you still have plenty of memory on ZONE_MOVAVLE (and even
> swap!), simply because you are making excessive use of unmovable allocations
> (for user space!) in an environment where you should not make excessive use
> of unmovable allocations (e.g., where should page tables go?).

yes, you are right of course and I am not really disputing this. But I
would argue that 2:1 Movable/Normal is something to expect problems
already. "Lowmem" allocations can easily trigger OOM even without secret
mem in the picture. It all just takes to allocate a lot of GFP_KERNEL or
even GFP_{HIGH}USER. Really, it is CMA/MOVABLE that are elephant in the
room and one has to be really careful when relying on them.
 
> The existing controls (mlock limit) don't really match the current semantics
> of that memory. I repeat it once again: secretmem *currently* resembles
> long-term pinned memory, not mlocked memory.

Well, if we had a proper user space pinning accounting then I would
agree that there is a better model to use. But we don't. And previous
attempts to achieve that have failed. So I am afraid that we do not have
much choice left than using mlock as a model.

> Things will change when
> implementing migration support for secretmem pages. Until then, the
> semantics are different and this should be spelled out.
> 
> For long-term pinnings this is kind of obvious, still we're now documenting
> it because it's dangerous to not be aware of. Secretmem behaves exactly the
> same and I think this is worth spelling out: secretmem has the potential of
> being used much more often than fairly special vfio/rdma/ ...

yeah I do agree that pinning is a problem for movable/CMA but most
people simply do not care about those. Movable is the thing for hoptlug
and a really weird fragmentation avoidance IIRC and CMA is mostly to
handle crap HW. If those are to be used along with secret mem or
longterm GUP then they will constantly bump into corner cases. Do not
take me wrong, we should be looking at those problems, we should even
document them but I do not see this as anything new. We should probably
have a central place in Documentation explaining all those problems. I
would be even happy to see an explicit note in the tunables - e.g.
configuring movable/normal in 2:1 will get you back to 32b times wrt.
low mem problems.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09  9:53             ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-09  9:53 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Arnd Bergmann,
	James Bottomley, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin, Mike Rapoport

On Tue 09-02-21 10:15:17, David Hildenbrand wrote:
> On 09.02.21 09:59, Michal Hocko wrote:
> > On Mon 08-02-21 22:38:03, David Hildenbrand wrote:
> > > 
> > > > Am 08.02.2021 um 22:13 schrieb Mike Rapoport <rppt@kernel.org>:
> > > > 
> > > > On Mon, Feb 08, 2021 at 10:27:18AM +0100, David Hildenbrand wrote:
> > > > > On 08.02.21 09:49, Mike Rapoport wrote:
> > > > > 
> > > > > Some questions (and request to document the answers) as we now allow to have
> > > > > unmovable allocations all over the place and I don't see a single comment
> > > > > regarding that in the cover letter:
> > > > > 
> > > > > 1. How will the issue of plenty of unmovable allocations for user space be
> > > > > tackled in the future?
> > > > > 
> > > > > 2. How has this issue been documented? E.g., interaction with ZONE_MOVABLE
> > > > > and CMA, alloc_conig_range()/alloc_contig_pages?.
> > > > 
> > > > Secretmem sets the mappings gfp mask to GFP_HIGHUSER, so it does not
> > > > allocate movable pages at the first place.
> > > 
> > > That is not the point. Secretmem cannot go on CMA / ZONE_MOVABLE
> > > memory and behaves like long-term pinnings in that sense. This is a
> > > real issue when using a lot of sectremem.
> > 
> > A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE.
> > As I've said it is quite easy to land at the similar situation even with
> > tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is
> > really uncommon. It would be even worse that those would be allowed to
> > consume both CMA/ZONE_MOVABLE.
> 
> IIRC, tmpfs/MAP_ANON|MAP_SHARED memory
> a) Is movable, can land in ZONE_MOVABLE/CMA
> b) Can be limited by sizing tmpfs appropriately
> 
> AFAIK, what you describe is a problem with memory overcommit, not with zone
> imbalances (below). Or what am I missing?

It can be problem for both. If you have just too much of shm (do not
forget about MAP_SHARED|MAP_ANON which is much harder to size from an
admin POV) then migrateability doesn't really help because you need a
free memory to migrate. Without reclaimability this can easily become a
problem. That is why I am saying this is not really a new problem.
Swapless systems are not all that uncommon.
 
> > One has to be very careful when relying on CMA or movable zones. This is
> > definitely worth a comment in the kernel command line parameter
> > documentation. But this is not a new problem.
> 
> I see the following thing worth documenting:
> 
> Assume you have a system with 2GB of ZONE_NORMAL/ZONE_DMA and 4GB of
> ZONE_MOVABLE/CMA.
> 
> Assume you make use of 1.5GB of secretmem. Your system might run into OOM
> any time although you still have plenty of memory on ZONE_MOVAVLE (and even
> swap!), simply because you are making excessive use of unmovable allocations
> (for user space!) in an environment where you should not make excessive use
> of unmovable allocations (e.g., where should page tables go?).

yes, you are right of course and I am not really disputing this. But I
would argue that 2:1 Movable/Normal is something to expect problems
already. "Lowmem" allocations can easily trigger OOM even without secret
mem in the picture. It all just takes to allocate a lot of GFP_KERNEL or
even GFP_{HIGH}USER. Really, it is CMA/MOVABLE that are elephant in the
room and one has to be really careful when relying on them.
 
> The existing controls (mlock limit) don't really match the current semantics
> of that memory. I repeat it once again: secretmem *currently* resembles
> long-term pinned memory, not mlocked memory.

Well, if we had a proper user space pinning accounting then I would
agree that there is a better model to use. But we don't. And previous
attempts to achieve that have failed. So I am afraid that we do not have
much choice left than using mlock as a model.

> Things will change when
> implementing migration support for secretmem pages. Until then, the
> semantics are different and this should be spelled out.
> 
> For long-term pinnings this is kind of obvious, still we're now documenting
> it because it's dangerous to not be aware of. Secretmem behaves exactly the
> same and I think this is worth spelling out: secretmem has the potential of
> being used much more often than fairly special vfio/rdma/ ...

yeah I do agree that pinning is a problem for movable/CMA but most
people simply do not care about those. Movable is the thing for hoptlug
and a really weird fragmentation avoidance IIRC and CMA is mostly to
handle crap HW. If those are to be used along with secret mem or
longterm GUP then they will constantly bump into corner cases. Do not
take me wrong, we should be looking at those problems, we should even
document them but I do not see this as anything new. We should probably
have a central place in Documentation explaining all those problems. I
would be even happy to see an explicit note in the tunables - e.g.
configuring movable/normal in 2:1 will get you back to 32b times wrt.
low mem problems.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
  2021-02-09  9:53             ` Michal Hocko
  (?)
  (?)
@ 2021-02-09 10:23               ` David Hildenbrand
  -1 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-09 10:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Ander sen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

>>> A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE.
>>> As I've said it is quite easy to land at the similar situation even with
>>> tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is
>>> really uncommon. It would be even worse that those would be allowed to
>>> consume both CMA/ZONE_MOVABLE.
>>
>> IIRC, tmpfs/MAP_ANON|MAP_SHARED memory
>> a) Is movable, can land in ZONE_MOVABLE/CMA
>> b) Can be limited by sizing tmpfs appropriately
>>
>> AFAIK, what you describe is a problem with memory overcommit, not with zone
>> imbalances (below). Or what am I missing?
> 
> It can be problem for both. If you have just too much of shm (do not
> forget about MAP_SHARED|MAP_ANON which is much harder to size from an
> admin POV) then migrateability doesn't really help because you need a
> free memory to migrate. Without reclaimability this can easily become a
> problem. That is why I am saying this is not really a new problem.
> Swapless systems are not all that uncommon.

I get your point, it's similar but still different. "no memory in the 
system" vs. "plenty of unusable free memory available in the system".

In many setups, memory for user space applications can go to 
ZONE_MOVABLE just fine. ZONE_NORMAL etc. can be used for supporting user 
space memory (e.g., page tables) and other kernel stuff.

Like, have 4GB of ZONE_MOVABLE with 2GB of ZONE_NORMAL. Have an 
application (database) that allocates 4GB of memory. Works just fine. 
The zone ratio ends up being a problem for example with many processes 
(-> many page tables).

Not being able to put user space memory into the movable zone is a 
special case. And we are introducing yet another special case here 
(besides vfio, rdma, unmigratable huge pages like gigantic pages).

With plenty of secretmem, looking at /proc/meminfo Total vs. Free can be 
a big lie of how your system behaves.

>   
>>> One has to be very careful when relying on CMA or movable zones. This is
>>> definitely worth a comment in the kernel command line parameter
>>> documentation. But this is not a new problem.
>>
>> I see the following thing worth documenting:
>>
>> Assume you have a system with 2GB of ZONE_NORMAL/ZONE_DMA and 4GB of
>> ZONE_MOVABLE/CMA.
>>
>> Assume you make use of 1.5GB of secretmem. Your system might run into OOM
>> any time although you still have plenty of memory on ZONE_MOVAVLE (and even
>> swap!), simply because you are making excessive use of unmovable allocations
>> (for user space!) in an environment where you should not make excessive use
>> of unmovable allocations (e.g., where should page tables go?).
> 
> yes, you are right of course and I am not really disputing this. But I
> would argue that 2:1 Movable/Normal is something to expect problems
> already. "Lowmem" allocations can easily trigger OOM even without secret
> mem in the picture. It all just takes to allocate a lot of GFP_KERNEL or
> even GFP_{HIGH}USER. Really, it is CMA/MOVABLE that are elephant in the
> room and one has to be really careful when relying on them.

Right, it's all about what the setup actually needs. Sure, there are 
cases where you need significantly more GFP_KERNEL/GFP_{HIGH}USER such 
that a 2:1 ratio is not feasible. But I claim that these are corner cases.

Secretmem gives user space the option to allocate a lot of 
GFP_{HIGH}USER memory. If I am not wrong, "ulimit -a" tells me that each 
application on F33 can allocate 16 GiB (!) of secretmem.

Which other ways do you know where random user space can do something 
similar? I'd be curious what other scenarios there are where user space 
can easily allocate a lot of unmovable memory.

>   
>> The existing controls (mlock limit) don't really match the current semantics
>> of that memory. I repeat it once again: secretmem *currently* resembles
>> long-term pinned memory, not mlocked memory.
> 
> Well, if we had a proper user space pinning accounting then I would
> agree that there is a better model to use. But we don't. And previous
> attempts to achieve that have failed. So I am afraid that we do not have
> much choice left than using mlock as a model.

Yes, I agree.

> 
>> Things will change when
>> implementing migration support for secretmem pages. Until then, the
>> semantics are different and this should be spelled out.
>>
>> For long-term pinnings this is kind of obvious, still we're now documenting
>> it because it's dangerous to not be aware of. Secretmem behaves exactly the
>> same and I think this is worth spelling out: secretmem has the potential of
>> being used much more often than fairly special vfio/rdma/ ...
> 
> yeah I do agree that pinning is a problem for movable/CMA but most
> people simply do not care about those. Movable is the thing for hoptlug
> and a really weird fragmentation avoidance IIRC and CMA is mostly to

+ handling gigantic pages dynamically

> handle crap HW. If those are to be used along with secret mem or
> longterm GUP then they will constantly bump into corner cases. Do not
> take me wrong, we should be looking at those problems, we should even
> document them but I do not see this as anything new. We should probably
> have a central place in Documentation explaining all those problems. I

Exactly.

> would be even happy to see an explicit note in the tunables - e.g.
> configuring movable/normal in 2:1 will get you back to 32b times wrt.
> low mem problems.

In most setups, ratios of 1:1 up to 4:1 work reasonably well. Of course, 
it's not applicable to all setups (obviously).

For example, oVirt has been using ratios of 3:1 for a long time. (online 
all memory to ZONE_MOVABLE in the guest, never hotplug more than 3x boot 
memory size). Most distros end up onlining all hotplugged memory on bare 
metal to ZONE_MOVABLE, and I've seen basically no bug reports related to 
that.

Highmem was a little different, yet similar. RHEL provided the bigmem 
kernel with ratios of 60:4, which worked in many setups. The main 
difference to highmem was that e.g., pagetables could be placed onto it. 
So ratios like 18:1 are completely insane with ZONE_MOVABLE.

I am constantly trying to fight for making more stuff MOVABLE instead of 
going into the other direction (e.g., because it's easier to implement, 
which feels like the wrong direction).

Maybe I am the only person that really cares about ZONE_MOVABLE these 
days :) I can't stop such new stuff from popping up, so at least I want 
it to be documented.

-- 
Thanks,

David / dhildenb
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09 10:23               ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-09 10:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dan Williams, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

>>> A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE.
>>> As I've said it is quite easy to land at the similar situation even with
>>> tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is
>>> really uncommon. It would be even worse that those would be allowed to
>>> consume both CMA/ZONE_MOVABLE.
>>
>> IIRC, tmpfs/MAP_ANON|MAP_SHARED memory
>> a) Is movable, can land in ZONE_MOVABLE/CMA
>> b) Can be limited by sizing tmpfs appropriately
>>
>> AFAIK, what you describe is a problem with memory overcommit, not with zone
>> imbalances (below). Or what am I missing?
> 
> It can be problem for both. If you have just too much of shm (do not
> forget about MAP_SHARED|MAP_ANON which is much harder to size from an
> admin POV) then migrateability doesn't really help because you need a
> free memory to migrate. Without reclaimability this can easily become a
> problem. That is why I am saying this is not really a new problem.
> Swapless systems are not all that uncommon.

I get your point, it's similar but still different. "no memory in the 
system" vs. "plenty of unusable free memory available in the system".

In many setups, memory for user space applications can go to 
ZONE_MOVABLE just fine. ZONE_NORMAL etc. can be used for supporting user 
space memory (e.g., page tables) and other kernel stuff.

Like, have 4GB of ZONE_MOVABLE with 2GB of ZONE_NORMAL. Have an 
application (database) that allocates 4GB of memory. Works just fine. 
The zone ratio ends up being a problem for example with many processes 
(-> many page tables).

Not being able to put user space memory into the movable zone is a 
special case. And we are introducing yet another special case here 
(besides vfio, rdma, unmigratable huge pages like gigantic pages).

With plenty of secretmem, looking at /proc/meminfo Total vs. Free can be 
a big lie of how your system behaves.

>   
>>> One has to be very careful when relying on CMA or movable zones. This is
>>> definitely worth a comment in the kernel command line parameter
>>> documentation. But this is not a new problem.
>>
>> I see the following thing worth documenting:
>>
>> Assume you have a system with 2GB of ZONE_NORMAL/ZONE_DMA and 4GB of
>> ZONE_MOVABLE/CMA.
>>
>> Assume you make use of 1.5GB of secretmem. Your system might run into OOM
>> any time although you still have plenty of memory on ZONE_MOVAVLE (and even
>> swap!), simply because you are making excessive use of unmovable allocations
>> (for user space!) in an environment where you should not make excessive use
>> of unmovable allocations (e.g., where should page tables go?).
> 
> yes, you are right of course and I am not really disputing this. But I
> would argue that 2:1 Movable/Normal is something to expect problems
> already. "Lowmem" allocations can easily trigger OOM even without secret
> mem in the picture. It all just takes to allocate a lot of GFP_KERNEL or
> even GFP_{HIGH}USER. Really, it is CMA/MOVABLE that are elephant in the
> room and one has to be really careful when relying on them.

Right, it's all about what the setup actually needs. Sure, there are 
cases where you need significantly more GFP_KERNEL/GFP_{HIGH}USER such 
that a 2:1 ratio is not feasible. But I claim that these are corner cases.

Secretmem gives user space the option to allocate a lot of 
GFP_{HIGH}USER memory. If I am not wrong, "ulimit -a" tells me that each 
application on F33 can allocate 16 GiB (!) of secretmem.

Which other ways do you know where random user space can do something 
similar? I'd be curious what other scenarios there are where user space 
can easily allocate a lot of unmovable memory.

>   
>> The existing controls (mlock limit) don't really match the current semantics
>> of that memory. I repeat it once again: secretmem *currently* resembles
>> long-term pinned memory, not mlocked memory.
> 
> Well, if we had a proper user space pinning accounting then I would
> agree that there is a better model to use. But we don't. And previous
> attempts to achieve that have failed. So I am afraid that we do not have
> much choice left than using mlock as a model.

Yes, I agree.

> 
>> Things will change when
>> implementing migration support for secretmem pages. Until then, the
>> semantics are different and this should be spelled out.
>>
>> For long-term pinnings this is kind of obvious, still we're now documenting
>> it because it's dangerous to not be aware of. Secretmem behaves exactly the
>> same and I think this is worth spelling out: secretmem has the potential of
>> being used much more often than fairly special vfio/rdma/ ...
> 
> yeah I do agree that pinning is a problem for movable/CMA but most
> people simply do not care about those. Movable is the thing for hoptlug
> and a really weird fragmentation avoidance IIRC and CMA is mostly to

+ handling gigantic pages dynamically

> handle crap HW. If those are to be used along with secret mem or
> longterm GUP then they will constantly bump into corner cases. Do not
> take me wrong, we should be looking at those problems, we should even
> document them but I do not see this as anything new. We should probably
> have a central place in Documentation explaining all those problems. I

Exactly.

> would be even happy to see an explicit note in the tunables - e.g.
> configuring movable/normal in 2:1 will get you back to 32b times wrt.
> low mem problems.

In most setups, ratios of 1:1 up to 4:1 work reasonably well. Of course, 
it's not applicable to all setups (obviously).

For example, oVirt has been using ratios of 3:1 for a long time. (online 
all memory to ZONE_MOVABLE in the guest, never hotplug more than 3x boot 
memory size). Most distros end up onlining all hotplugged memory on bare 
metal to ZONE_MOVABLE, and I've seen basically no bug reports related to 
that.

Highmem was a little different, yet similar. RHEL provided the bigmem 
kernel with ratios of 60:4, which worked in many setups. The main 
difference to highmem was that e.g., pagetables could be placed onto it. 
So ratios like 18:1 are completely insane with ZONE_MOVABLE.

I am constantly trying to fight for making more stuff MOVABLE instead of 
going into the other direction (e.g., because it's easier to implement, 
which feels like the wrong direction).

Maybe I am the only person that really cares about ZONE_MOVABLE these 
days :) I can't stop such new stuff from popping up, so at least I want 
it to be documented.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09 10:23               ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-09 10:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Arnd Bergmann,
	James Bottomley, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin, Mike Rapoport

>>> A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE.
>>> As I've said it is quite easy to land at the similar situation even with
>>> tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is
>>> really uncommon. It would be even worse that those would be allowed to
>>> consume both CMA/ZONE_MOVABLE.
>>
>> IIRC, tmpfs/MAP_ANON|MAP_SHARED memory
>> a) Is movable, can land in ZONE_MOVABLE/CMA
>> b) Can be limited by sizing tmpfs appropriately
>>
>> AFAIK, what you describe is a problem with memory overcommit, not with zone
>> imbalances (below). Or what am I missing?
> 
> It can be problem for both. If you have just too much of shm (do not
> forget about MAP_SHARED|MAP_ANON which is much harder to size from an
> admin POV) then migrateability doesn't really help because you need a
> free memory to migrate. Without reclaimability this can easily become a
> problem. That is why I am saying this is not really a new problem.
> Swapless systems are not all that uncommon.

I get your point, it's similar but still different. "no memory in the 
system" vs. "plenty of unusable free memory available in the system".

In many setups, memory for user space applications can go to 
ZONE_MOVABLE just fine. ZONE_NORMAL etc. can be used for supporting user 
space memory (e.g., page tables) and other kernel stuff.

Like, have 4GB of ZONE_MOVABLE with 2GB of ZONE_NORMAL. Have an 
application (database) that allocates 4GB of memory. Works just fine. 
The zone ratio ends up being a problem for example with many processes 
(-> many page tables).

Not being able to put user space memory into the movable zone is a 
special case. And we are introducing yet another special case here 
(besides vfio, rdma, unmigratable huge pages like gigantic pages).

With plenty of secretmem, looking at /proc/meminfo Total vs. Free can be 
a big lie of how your system behaves.

>   
>>> One has to be very careful when relying on CMA or movable zones. This is
>>> definitely worth a comment in the kernel command line parameter
>>> documentation. But this is not a new problem.
>>
>> I see the following thing worth documenting:
>>
>> Assume you have a system with 2GB of ZONE_NORMAL/ZONE_DMA and 4GB of
>> ZONE_MOVABLE/CMA.
>>
>> Assume you make use of 1.5GB of secretmem. Your system might run into OOM
>> any time although you still have plenty of memory on ZONE_MOVAVLE (and even
>> swap!), simply because you are making excessive use of unmovable allocations
>> (for user space!) in an environment where you should not make excessive use
>> of unmovable allocations (e.g., where should page tables go?).
> 
> yes, you are right of course and I am not really disputing this. But I
> would argue that 2:1 Movable/Normal is something to expect problems
> already. "Lowmem" allocations can easily trigger OOM even without secret
> mem in the picture. It all just takes to allocate a lot of GFP_KERNEL or
> even GFP_{HIGH}USER. Really, it is CMA/MOVABLE that are elephant in the
> room and one has to be really careful when relying on them.

Right, it's all about what the setup actually needs. Sure, there are 
cases where you need significantly more GFP_KERNEL/GFP_{HIGH}USER such 
that a 2:1 ratio is not feasible. But I claim that these are corner cases.

Secretmem gives user space the option to allocate a lot of 
GFP_{HIGH}USER memory. If I am not wrong, "ulimit -a" tells me that each 
application on F33 can allocate 16 GiB (!) of secretmem.

Which other ways do you know where random user space can do something 
similar? I'd be curious what other scenarios there are where user space 
can easily allocate a lot of unmovable memory.

>   
>> The existing controls (mlock limit) don't really match the current semantics
>> of that memory. I repeat it once again: secretmem *currently* resembles
>> long-term pinned memory, not mlocked memory.
> 
> Well, if we had a proper user space pinning accounting then I would
> agree that there is a better model to use. But we don't. And previous
> attempts to achieve that have failed. So I am afraid that we do not have
> much choice left than using mlock as a model.

Yes, I agree.

> 
>> Things will change when
>> implementing migration support for secretmem pages. Until then, the
>> semantics are different and this should be spelled out.
>>
>> For long-term pinnings this is kind of obvious, still we're now documenting
>> it because it's dangerous to not be aware of. Secretmem behaves exactly the
>> same and I think this is worth spelling out: secretmem has the potential of
>> being used much more often than fairly special vfio/rdma/ ...
> 
> yeah I do agree that pinning is a problem for movable/CMA but most
> people simply do not care about those. Movable is the thing for hoptlug
> and a really weird fragmentation avoidance IIRC and CMA is mostly to

+ handling gigantic pages dynamically

> handle crap HW. If those are to be used along with secret mem or
> longterm GUP then they will constantly bump into corner cases. Do not
> take me wrong, we should be looking at those problems, we should even
> document them but I do not see this as anything new. We should probably
> have a central place in Documentation explaining all those problems. I

Exactly.

> would be even happy to see an explicit note in the tunables - e.g.
> configuring movable/normal in 2:1 will get you back to 32b times wrt.
> low mem problems.

In most setups, ratios of 1:1 up to 4:1 work reasonably well. Of course, 
it's not applicable to all setups (obviously).

For example, oVirt has been using ratios of 3:1 for a long time. (online 
all memory to ZONE_MOVABLE in the guest, never hotplug more than 3x boot 
memory size). Most distros end up onlining all hotplugged memory on bare 
metal to ZONE_MOVABLE, and I've seen basically no bug reports related to 
that.

Highmem was a little different, yet similar. RHEL provided the bigmem 
kernel with ratios of 60:4, which worked in many setups. The main 
difference to highmem was that e.g., pagetables could be placed onto it. 
So ratios like 18:1 are completely insane with ZONE_MOVABLE.

I am constantly trying to fight for making more stuff MOVABLE instead of 
going into the other direction (e.g., because it's easier to implement, 
which feels like the wrong direction).

Maybe I am the only person that really cares about ZONE_MOVABLE these 
days :) I can't stop such new stuff from popping up, so at least I want 
it to be documented.

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09 10:23               ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-09 10:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Arnd Bergmann,
	James Bottomley, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin, Mike Rapoport

>>> A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE.
>>> As I've said it is quite easy to land at the similar situation even with
>>> tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is
>>> really uncommon. It would be even worse that those would be allowed to
>>> consume both CMA/ZONE_MOVABLE.
>>
>> IIRC, tmpfs/MAP_ANON|MAP_SHARED memory
>> a) Is movable, can land in ZONE_MOVABLE/CMA
>> b) Can be limited by sizing tmpfs appropriately
>>
>> AFAIK, what you describe is a problem with memory overcommit, not with zone
>> imbalances (below). Or what am I missing?
> 
> It can be problem for both. If you have just too much of shm (do not
> forget about MAP_SHARED|MAP_ANON which is much harder to size from an
> admin POV) then migrateability doesn't really help because you need a
> free memory to migrate. Without reclaimability this can easily become a
> problem. That is why I am saying this is not really a new problem.
> Swapless systems are not all that uncommon.

I get your point, it's similar but still different. "no memory in the 
system" vs. "plenty of unusable free memory available in the system".

In many setups, memory for user space applications can go to 
ZONE_MOVABLE just fine. ZONE_NORMAL etc. can be used for supporting user 
space memory (e.g., page tables) and other kernel stuff.

Like, have 4GB of ZONE_MOVABLE with 2GB of ZONE_NORMAL. Have an 
application (database) that allocates 4GB of memory. Works just fine. 
The zone ratio ends up being a problem for example with many processes 
(-> many page tables).

Not being able to put user space memory into the movable zone is a 
special case. And we are introducing yet another special case here 
(besides vfio, rdma, unmigratable huge pages like gigantic pages).

With plenty of secretmem, looking at /proc/meminfo Total vs. Free can be 
a big lie of how your system behaves.

>   
>>> One has to be very careful when relying on CMA or movable zones. This is
>>> definitely worth a comment in the kernel command line parameter
>>> documentation. But this is not a new problem.
>>
>> I see the following thing worth documenting:
>>
>> Assume you have a system with 2GB of ZONE_NORMAL/ZONE_DMA and 4GB of
>> ZONE_MOVABLE/CMA.
>>
>> Assume you make use of 1.5GB of secretmem. Your system might run into OOM
>> any time although you still have plenty of memory on ZONE_MOVAVLE (and even
>> swap!), simply because you are making excessive use of unmovable allocations
>> (for user space!) in an environment where you should not make excessive use
>> of unmovable allocations (e.g., where should page tables go?).
> 
> yes, you are right of course and I am not really disputing this. But I
> would argue that 2:1 Movable/Normal is something to expect problems
> already. "Lowmem" allocations can easily trigger OOM even without secret
> mem in the picture. It all just takes to allocate a lot of GFP_KERNEL or
> even GFP_{HIGH}USER. Really, it is CMA/MOVABLE that are elephant in the
> room and one has to be really careful when relying on them.

Right, it's all about what the setup actually needs. Sure, there are 
cases where you need significantly more GFP_KERNEL/GFP_{HIGH}USER such 
that a 2:1 ratio is not feasible. But I claim that these are corner cases.

Secretmem gives user space the option to allocate a lot of 
GFP_{HIGH}USER memory. If I am not wrong, "ulimit -a" tells me that each 
application on F33 can allocate 16 GiB (!) of secretmem.

Which other ways do you know where random user space can do something 
similar? I'd be curious what other scenarios there are where user space 
can easily allocate a lot of unmovable memory.

>   
>> The existing controls (mlock limit) don't really match the current semantics
>> of that memory. I repeat it once again: secretmem *currently* resembles
>> long-term pinned memory, not mlocked memory.
> 
> Well, if we had a proper user space pinning accounting then I would
> agree that there is a better model to use. But we don't. And previous
> attempts to achieve that have failed. So I am afraid that we do not have
> much choice left than using mlock as a model.

Yes, I agree.

> 
>> Things will change when
>> implementing migration support for secretmem pages. Until then, the
>> semantics are different and this should be spelled out.
>>
>> For long-term pinnings this is kind of obvious, still we're now documenting
>> it because it's dangerous to not be aware of. Secretmem behaves exactly the
>> same and I think this is worth spelling out: secretmem has the potential of
>> being used much more often than fairly special vfio/rdma/ ...
> 
> yeah I do agree that pinning is a problem for movable/CMA but most
> people simply do not care about those. Movable is the thing for hoptlug
> and a really weird fragmentation avoidance IIRC and CMA is mostly to

+ handling gigantic pages dynamically

> handle crap HW. If those are to be used along with secret mem or
> longterm GUP then they will constantly bump into corner cases. Do not
> take me wrong, we should be looking at those problems, we should even
> document them but I do not see this as anything new. We should probably
> have a central place in Documentation explaining all those problems. I

Exactly.

> would be even happy to see an explicit note in the tunables - e.g.
> configuring movable/normal in 2:1 will get you back to 32b times wrt.
> low mem problems.

In most setups, ratios of 1:1 up to 4:1 work reasonably well. Of course, 
it's not applicable to all setups (obviously).

For example, oVirt has been using ratios of 3:1 for a long time. (online 
all memory to ZONE_MOVABLE in the guest, never hotplug more than 3x boot 
memory size). Most distros end up onlining all hotplugged memory on bare 
metal to ZONE_MOVABLE, and I've seen basically no bug reports related to 
that.

Highmem was a little different, yet similar. RHEL provided the bigmem 
kernel with ratios of 60:4, which worked in many setups. The main 
difference to highmem was that e.g., pagetables could be placed onto it. 
So ratios like 18:1 are completely insane with ZONE_MOVABLE.

I am constantly trying to fight for making more stuff MOVABLE instead of 
going into the other direction (e.g., because it's easier to implement, 
which feels like the wrong direction).

Maybe I am the only person that really cares about ZONE_MOVABLE these 
days :) I can't stop such new stuff from popping up, so at least I want 
it to be documented.

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
  2021-02-09 10:23               ` David Hildenbrand
  (?)
  (?)
@ 2021-02-09 10:30                 ` David Hildenbrand
  -1 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-09 10:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Ander sen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

On 09.02.21 11:23, David Hildenbrand wrote:
>>>> A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE.
>>>> As I've said it is quite easy to land at the similar situation even with
>>>> tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is
>>>> really uncommon. It would be even worse that those would be allowed to
>>>> consume both CMA/ZONE_MOVABLE.
>>>
>>> IIRC, tmpfs/MAP_ANON|MAP_SHARED memory
>>> a) Is movable, can land in ZONE_MOVABLE/CMA
>>> b) Can be limited by sizing tmpfs appropriately
>>>
>>> AFAIK, what you describe is a problem with memory overcommit, not with zone
>>> imbalances (below). Or what am I missing?
>>
>> It can be problem for both. If you have just too much of shm (do not
>> forget about MAP_SHARED|MAP_ANON which is much harder to size from an
>> admin POV) then migrateability doesn't really help because you need a
>> free memory to migrate. Without reclaimability this can easily become a
>> problem. That is why I am saying this is not really a new problem.
>> Swapless systems are not all that uncommon.
> 
> I get your point, it's similar but still different. "no memory in the
> system" vs. "plenty of unusable free memory available in the system".
> 
> In many setups, memory for user space applications can go to
> ZONE_MOVABLE just fine. ZONE_NORMAL etc. can be used for supporting user
> space memory (e.g., page tables) and other kernel stuff.
> 
> Like, have 4GB of ZONE_MOVABLE with 2GB of ZONE_NORMAL. Have an
> application (database) that allocates 4GB of memory. Works just fine.
> The zone ratio ends up being a problem for example with many processes
> (-> many page tables).
> 
> Not being able to put user space memory into the movable zone is a
> special case. And we are introducing yet another special case here
> (besides vfio, rdma, unmigratable huge pages like gigantic pages).
> 
> With plenty of secretmem, looking at /proc/meminfo Total vs. Free can be
> a big lie of how your system behaves.
> 
>>    
>>>> One has to be very careful when relying on CMA or movable zones. This is
>>>> definitely worth a comment in the kernel command line parameter
>>>> documentation. But this is not a new problem.
>>>
>>> I see the following thing worth documenting:
>>>
>>> Assume you have a system with 2GB of ZONE_NORMAL/ZONE_DMA and 4GB of
>>> ZONE_MOVABLE/CMA.
>>>
>>> Assume you make use of 1.5GB of secretmem. Your system might run into OOM
>>> any time although you still have plenty of memory on ZONE_MOVAVLE (and even
>>> swap!), simply because you are making excessive use of unmovable allocations
>>> (for user space!) in an environment where you should not make excessive use
>>> of unmovable allocations (e.g., where should page tables go?).
>>
>> yes, you are right of course and I am not really disputing this. But I
>> would argue that 2:1 Movable/Normal is something to expect problems
>> already. "Lowmem" allocations can easily trigger OOM even without secret
>> mem in the picture. It all just takes to allocate a lot of GFP_KERNEL or
>> even GFP_{HIGH}USER. Really, it is CMA/MOVABLE that are elephant in the
>> room and one has to be really careful when relying on them.
> 
> Right, it's all about what the setup actually needs. Sure, there are
> cases where you need significantly more GFP_KERNEL/GFP_{HIGH}USER such
> that a 2:1 ratio is not feasible. But I claim that these are corner cases.
> 
> Secretmem gives user space the option to allocate a lot of
> GFP_{HIGH}USER memory. If I am not wrong, "ulimit -a" tells me that each
> application on F33 can allocate 16 GiB (!) of secretmem.

Got to learn to do my math. It's 16 MiB - so as a default it's less 
dangerous than I thought!

-- 
Thanks,

David / dhildenb
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09 10:30                 ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-09 10:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dan Williams, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

On 09.02.21 11:23, David Hildenbrand wrote:
>>>> A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE.
>>>> As I've said it is quite easy to land at the similar situation even with
>>>> tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is
>>>> really uncommon. It would be even worse that those would be allowed to
>>>> consume both CMA/ZONE_MOVABLE.
>>>
>>> IIRC, tmpfs/MAP_ANON|MAP_SHARED memory
>>> a) Is movable, can land in ZONE_MOVABLE/CMA
>>> b) Can be limited by sizing tmpfs appropriately
>>>
>>> AFAIK, what you describe is a problem with memory overcommit, not with zone
>>> imbalances (below). Or what am I missing?
>>
>> It can be problem for both. If you have just too much of shm (do not
>> forget about MAP_SHARED|MAP_ANON which is much harder to size from an
>> admin POV) then migrateability doesn't really help because you need a
>> free memory to migrate. Without reclaimability this can easily become a
>> problem. That is why I am saying this is not really a new problem.
>> Swapless systems are not all that uncommon.
> 
> I get your point, it's similar but still different. "no memory in the
> system" vs. "plenty of unusable free memory available in the system".
> 
> In many setups, memory for user space applications can go to
> ZONE_MOVABLE just fine. ZONE_NORMAL etc. can be used for supporting user
> space memory (e.g., page tables) and other kernel stuff.
> 
> Like, have 4GB of ZONE_MOVABLE with 2GB of ZONE_NORMAL. Have an
> application (database) that allocates 4GB of memory. Works just fine.
> The zone ratio ends up being a problem for example with many processes
> (-> many page tables).
> 
> Not being able to put user space memory into the movable zone is a
> special case. And we are introducing yet another special case here
> (besides vfio, rdma, unmigratable huge pages like gigantic pages).
> 
> With plenty of secretmem, looking at /proc/meminfo Total vs. Free can be
> a big lie of how your system behaves.
> 
>>    
>>>> One has to be very careful when relying on CMA or movable zones. This is
>>>> definitely worth a comment in the kernel command line parameter
>>>> documentation. But this is not a new problem.
>>>
>>> I see the following thing worth documenting:
>>>
>>> Assume you have a system with 2GB of ZONE_NORMAL/ZONE_DMA and 4GB of
>>> ZONE_MOVABLE/CMA.
>>>
>>> Assume you make use of 1.5GB of secretmem. Your system might run into OOM
>>> any time although you still have plenty of memory on ZONE_MOVAVLE (and even
>>> swap!), simply because you are making excessive use of unmovable allocations
>>> (for user space!) in an environment where you should not make excessive use
>>> of unmovable allocations (e.g., where should page tables go?).
>>
>> yes, you are right of course and I am not really disputing this. But I
>> would argue that 2:1 Movable/Normal is something to expect problems
>> already. "Lowmem" allocations can easily trigger OOM even without secret
>> mem in the picture. It all just takes to allocate a lot of GFP_KERNEL or
>> even GFP_{HIGH}USER. Really, it is CMA/MOVABLE that are elephant in the
>> room and one has to be really careful when relying on them.
> 
> Right, it's all about what the setup actually needs. Sure, there are
> cases where you need significantly more GFP_KERNEL/GFP_{HIGH}USER such
> that a 2:1 ratio is not feasible. But I claim that these are corner cases.
> 
> Secretmem gives user space the option to allocate a lot of
> GFP_{HIGH}USER memory. If I am not wrong, "ulimit -a" tells me that each
> application on F33 can allocate 16 GiB (!) of secretmem.

Got to learn to do my math. It's 16 MiB - so as a default it's less 
dangerous than I thought!

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09 10:30                 ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-09 10:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Arnd Bergmann,
	James Bottomley, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin, Mike Rapoport

On 09.02.21 11:23, David Hildenbrand wrote:
>>>> A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE.
>>>> As I've said it is quite easy to land at the similar situation even with
>>>> tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is
>>>> really uncommon. It would be even worse that those would be allowed to
>>>> consume both CMA/ZONE_MOVABLE.
>>>
>>> IIRC, tmpfs/MAP_ANON|MAP_SHARED memory
>>> a) Is movable, can land in ZONE_MOVABLE/CMA
>>> b) Can be limited by sizing tmpfs appropriately
>>>
>>> AFAIK, what you describe is a problem with memory overcommit, not with zone
>>> imbalances (below). Or what am I missing?
>>
>> It can be problem for both. If you have just too much of shm (do not
>> forget about MAP_SHARED|MAP_ANON which is much harder to size from an
>> admin POV) then migrateability doesn't really help because you need a
>> free memory to migrate. Without reclaimability this can easily become a
>> problem. That is why I am saying this is not really a new problem.
>> Swapless systems are not all that uncommon.
> 
> I get your point, it's similar but still different. "no memory in the
> system" vs. "plenty of unusable free memory available in the system".
> 
> In many setups, memory for user space applications can go to
> ZONE_MOVABLE just fine. ZONE_NORMAL etc. can be used for supporting user
> space memory (e.g., page tables) and other kernel stuff.
> 
> Like, have 4GB of ZONE_MOVABLE with 2GB of ZONE_NORMAL. Have an
> application (database) that allocates 4GB of memory. Works just fine.
> The zone ratio ends up being a problem for example with many processes
> (-> many page tables).
> 
> Not being able to put user space memory into the movable zone is a
> special case. And we are introducing yet another special case here
> (besides vfio, rdma, unmigratable huge pages like gigantic pages).
> 
> With plenty of secretmem, looking at /proc/meminfo Total vs. Free can be
> a big lie of how your system behaves.
> 
>>    
>>>> One has to be very careful when relying on CMA or movable zones. This is
>>>> definitely worth a comment in the kernel command line parameter
>>>> documentation. But this is not a new problem.
>>>
>>> I see the following thing worth documenting:
>>>
>>> Assume you have a system with 2GB of ZONE_NORMAL/ZONE_DMA and 4GB of
>>> ZONE_MOVABLE/CMA.
>>>
>>> Assume you make use of 1.5GB of secretmem. Your system might run into OOM
>>> any time although you still have plenty of memory on ZONE_MOVAVLE (and even
>>> swap!), simply because you are making excessive use of unmovable allocations
>>> (for user space!) in an environment where you should not make excessive use
>>> of unmovable allocations (e.g., where should page tables go?).
>>
>> yes, you are right of course and I am not really disputing this. But I
>> would argue that 2:1 Movable/Normal is something to expect problems
>> already. "Lowmem" allocations can easily trigger OOM even without secret
>> mem in the picture. It all just takes to allocate a lot of GFP_KERNEL or
>> even GFP_{HIGH}USER. Really, it is CMA/MOVABLE that are elephant in the
>> room and one has to be really careful when relying on them.
> 
> Right, it's all about what the setup actually needs. Sure, there are
> cases where you need significantly more GFP_KERNEL/GFP_{HIGH}USER such
> that a 2:1 ratio is not feasible. But I claim that these are corner cases.
> 
> Secretmem gives user space the option to allocate a lot of
> GFP_{HIGH}USER memory. If I am not wrong, "ulimit -a" tells me that each
> application on F33 can allocate 16 GiB (!) of secretmem.

Got to learn to do my math. It's 16 MiB - so as a default it's less 
dangerous than I thought!

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09 10:30                 ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-09 10:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Arnd Bergmann,
	James Bottomley, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin, Mike Rapoport

On 09.02.21 11:23, David Hildenbrand wrote:
>>>> A lot of unevictable memory is a concern regardless of CMA/ZONE_MOVABLE.
>>>> As I've said it is quite easy to land at the similar situation even with
>>>> tmpfs/MAP_ANON|MAP_SHARED on swapless system. Neither of the two is
>>>> really uncommon. It would be even worse that those would be allowed to
>>>> consume both CMA/ZONE_MOVABLE.
>>>
>>> IIRC, tmpfs/MAP_ANON|MAP_SHARED memory
>>> a) Is movable, can land in ZONE_MOVABLE/CMA
>>> b) Can be limited by sizing tmpfs appropriately
>>>
>>> AFAIK, what you describe is a problem with memory overcommit, not with zone
>>> imbalances (below). Or what am I missing?
>>
>> It can be problem for both. If you have just too much of shm (do not
>> forget about MAP_SHARED|MAP_ANON which is much harder to size from an
>> admin POV) then migrateability doesn't really help because you need a
>> free memory to migrate. Without reclaimability this can easily become a
>> problem. That is why I am saying this is not really a new problem.
>> Swapless systems are not all that uncommon.
> 
> I get your point, it's similar but still different. "no memory in the
> system" vs. "plenty of unusable free memory available in the system".
> 
> In many setups, memory for user space applications can go to
> ZONE_MOVABLE just fine. ZONE_NORMAL etc. can be used for supporting user
> space memory (e.g., page tables) and other kernel stuff.
> 
> Like, have 4GB of ZONE_MOVABLE with 2GB of ZONE_NORMAL. Have an
> application (database) that allocates 4GB of memory. Works just fine.
> The zone ratio ends up being a problem for example with many processes
> (-> many page tables).
> 
> Not being able to put user space memory into the movable zone is a
> special case. And we are introducing yet another special case here
> (besides vfio, rdma, unmigratable huge pages like gigantic pages).
> 
> With plenty of secretmem, looking at /proc/meminfo Total vs. Free can be
> a big lie of how your system behaves.
> 
>>    
>>>> One has to be very careful when relying on CMA or movable zones. This is
>>>> definitely worth a comment in the kernel command line parameter
>>>> documentation. But this is not a new problem.
>>>
>>> I see the following thing worth documenting:
>>>
>>> Assume you have a system with 2GB of ZONE_NORMAL/ZONE_DMA and 4GB of
>>> ZONE_MOVABLE/CMA.
>>>
>>> Assume you make use of 1.5GB of secretmem. Your system might run into OOM
>>> any time although you still have plenty of memory on ZONE_MOVAVLE (and even
>>> swap!), simply because you are making excessive use of unmovable allocations
>>> (for user space!) in an environment where you should not make excessive use
>>> of unmovable allocations (e.g., where should page tables go?).
>>
>> yes, you are right of course and I am not really disputing this. But I
>> would argue that 2:1 Movable/Normal is something to expect problems
>> already. "Lowmem" allocations can easily trigger OOM even without secret
>> mem in the picture. It all just takes to allocate a lot of GFP_KERNEL or
>> even GFP_{HIGH}USER. Really, it is CMA/MOVABLE that are elephant in the
>> room and one has to be really careful when relying on them.
> 
> Right, it's all about what the setup actually needs. Sure, there are
> cases where you need significantly more GFP_KERNEL/GFP_{HIGH}USER such
> that a 2:1 ratio is not feasible. But I claim that these are corner cases.
> 
> Secretmem gives user space the option to allocate a lot of
> GFP_{HIGH}USER memory. If I am not wrong, "ulimit -a" tells me that each
> application on F33 can allocate 16 GiB (!) of secretmem.

Got to learn to do my math. It's 16 MiB - so as a default it's less 
dangerous than I thought!

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
  2021-02-09  9:09           ` Mike Rapoport
  (?)
  (?)
@ 2021-02-09 13:17             ` Michal Hocko
  -1 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-09 13:17 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dave Hansen, David Hildenbrand,
	Elena Reshetova, H. Peter Anvin, Ingo Molnar, James Bottomley,
	Kirill A. Shutemov, Matthew Wilcox, Mark Rutland,
	Michael Kerrisk, Palmer Dabbelt, Paul Walmsley, Peter Zijlstra,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleixner, Tycho And ersen, Will Deacon, linux-api,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On Tue 09-02-21 11:09:38, Mike Rapoport wrote:
> On Tue, Feb 09, 2021 at 09:47:08AM +0100, Michal Hocko wrote:
> > On Mon 08-02-21 23:26:05, Mike Rapoport wrote:
> > > On Mon, Feb 08, 2021 at 11:49:22AM +0100, Michal Hocko wrote:
> > > > On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
> > [...]
> > > > > The file descriptor based memory has several advantages over the
> > > > > "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
> > > > > paves the way for VMMs to remove the secret memory range from the process;
> > > > 
> > > > I do not understand how it helps to remove the memory from the process
> > > > as the interface explicitly allows to add a memory that is removed from
> > > > all other processes via direct map.
> > > 
> > > The current implementation does not help to remove the memory from the
> > > process, but using fd-backed memory seems a better interface to remove
> > > guest memory from host mappings than mmap. As Andy nicely put it:
> > > 
> > > "Getting fd-backed memory into a guest will take some possibly major work in
> > > the kernel, but getting vma-backed memory into a guest without mapping it
> > > in the host user address space seems much, much worse."
> > 
> > OK, so IIUC this means that the model is to hand over memory from host
> > to guest. I thought the guest would be under control of its address
> > space and therefore it operates on the VMAs. This would benefit from
> > an additional and more specific clarification.
> 
> How guest would operate on VMAs if the interface between host and guest is
> virtual hardware?

I have to say that I am not really familiar with this area so my view
might be misleading or completely wrong. I thought that the HW address
ranges are mapped to the guest process and therefore have a VMA.

> If you mean qemu (or any other userspace part of VMM that uses KVM), so one
> of the points Andy mentioned back than is to remove mappings of the guest
> memory from the qemu process.
>  
> > > > > As secret memory implementation is not an extension of tmpfs or hugetlbfs,
> > > > > usage of a dedicated system call rather than hooking new functionality into
> > > > > memfd_create(2) emphasises that memfd_secret(2) has different semantics and
> > > > > allows better upwards compatibility.
> > > > 
> > > > What is this supposed to mean? What are differences?
> > > 
> > > Well, the phrasing could be better indeed. That supposed to mean that
> > > they differ in the semantics behind the file descriptor: memfd_create
> > > implements sealing for shmem and hugetlbfs while memfd_secret implements
> > > memory hidden from the kernel.
> > 
> > Right but why memfd_create model is not sufficient for the usecase?
> > Please note that I am arguing against. To be honest I do not really care
> > much. Using an existing scheme is usually preferable from my POV but
> > there might be real reasons why shmem as a backing "storage" is not
> > appropriate.
>    
> Citing my older email:
> 
>     I've hesitated whether to continue to use new flags to memfd_create() or to
>     add a new system call and I've decided to use a new system call after I've
>     started to look into man pages update. There would have been two completely
>     independent descriptions and I think it would have been very confusing.

Could you elaborate? Unmapping from the kernel address space can work
both for sealed or hugetlb memfds, no? Those features are completely
orthogonal AFAICS. With a dedicated syscall you will need to introduce
this functionality on top if that is required. Have you considered that?
I mean hugetlb pages are used to back guest memory very often. Is this
something that will be a secret memory usecase?

Please be really specific when giving arguments to back a new syscall
decision.
-- 
Michal Hocko
SUSE Labs
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09 13:17             ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-09 13:17 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Michael Kerrisk, Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Rick Edgecombe, Roman Gushchin, Shakeel Butt,
	Shuah Khan, Thomas Gleixner, Tycho Andersen, Will Deacon,
	linux-api, linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On Tue 09-02-21 11:09:38, Mike Rapoport wrote:
> On Tue, Feb 09, 2021 at 09:47:08AM +0100, Michal Hocko wrote:
> > On Mon 08-02-21 23:26:05, Mike Rapoport wrote:
> > > On Mon, Feb 08, 2021 at 11:49:22AM +0100, Michal Hocko wrote:
> > > > On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
> > [...]
> > > > > The file descriptor based memory has several advantages over the
> > > > > "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
> > > > > paves the way for VMMs to remove the secret memory range from the process;
> > > > 
> > > > I do not understand how it helps to remove the memory from the process
> > > > as the interface explicitly allows to add a memory that is removed from
> > > > all other processes via direct map.
> > > 
> > > The current implementation does not help to remove the memory from the
> > > process, but using fd-backed memory seems a better interface to remove
> > > guest memory from host mappings than mmap. As Andy nicely put it:
> > > 
> > > "Getting fd-backed memory into a guest will take some possibly major work in
> > > the kernel, but getting vma-backed memory into a guest without mapping it
> > > in the host user address space seems much, much worse."
> > 
> > OK, so IIUC this means that the model is to hand over memory from host
> > to guest. I thought the guest would be under control of its address
> > space and therefore it operates on the VMAs. This would benefit from
> > an additional and more specific clarification.
> 
> How guest would operate on VMAs if the interface between host and guest is
> virtual hardware?

I have to say that I am not really familiar with this area so my view
might be misleading or completely wrong. I thought that the HW address
ranges are mapped to the guest process and therefore have a VMA.

> If you mean qemu (or any other userspace part of VMM that uses KVM), so one
> of the points Andy mentioned back than is to remove mappings of the guest
> memory from the qemu process.
>  
> > > > > As secret memory implementation is not an extension of tmpfs or hugetlbfs,
> > > > > usage of a dedicated system call rather than hooking new functionality into
> > > > > memfd_create(2) emphasises that memfd_secret(2) has different semantics and
> > > > > allows better upwards compatibility.
> > > > 
> > > > What is this supposed to mean? What are differences?
> > > 
> > > Well, the phrasing could be better indeed. That supposed to mean that
> > > they differ in the semantics behind the file descriptor: memfd_create
> > > implements sealing for shmem and hugetlbfs while memfd_secret implements
> > > memory hidden from the kernel.
> > 
> > Right but why memfd_create model is not sufficient for the usecase?
> > Please note that I am arguing against. To be honest I do not really care
> > much. Using an existing scheme is usually preferable from my POV but
> > there might be real reasons why shmem as a backing "storage" is not
> > appropriate.
>    
> Citing my older email:
> 
>     I've hesitated whether to continue to use new flags to memfd_create() or to
>     add a new system call and I've decided to use a new system call after I've
>     started to look into man pages update. There would have been two completely
>     independent descriptions and I think it would have been very confusing.

Could you elaborate? Unmapping from the kernel address space can work
both for sealed or hugetlb memfds, no? Those features are completely
orthogonal AFAICS. With a dedicated syscall you will need to introduce
this functionality on top if that is required. Have you considered that?
I mean hugetlb pages are used to back guest memory very often. Is this
something that will be a secret memory usecase?

Please be really specific when giving arguments to back a new syscall
decision.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09 13:17             ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-09 13:17 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mark Rutland, David Hildenbrand, Peter Zijlstra, Catalin Marinas,
	Dave Hansen, linux-mm, linux-kselftest, H. Peter Anvin,
	Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Ingo Molnar, Michael Kerrisk,
	Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin, Mike Rapoport

On Tue 09-02-21 11:09:38, Mike Rapoport wrote:
> On Tue, Feb 09, 2021 at 09:47:08AM +0100, Michal Hocko wrote:
> > On Mon 08-02-21 23:26:05, Mike Rapoport wrote:
> > > On Mon, Feb 08, 2021 at 11:49:22AM +0100, Michal Hocko wrote:
> > > > On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
> > [...]
> > > > > The file descriptor based memory has several advantages over the
> > > > > "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
> > > > > paves the way for VMMs to remove the secret memory range from the process;
> > > > 
> > > > I do not understand how it helps to remove the memory from the process
> > > > as the interface explicitly allows to add a memory that is removed from
> > > > all other processes via direct map.
> > > 
> > > The current implementation does not help to remove the memory from the
> > > process, but using fd-backed memory seems a better interface to remove
> > > guest memory from host mappings than mmap. As Andy nicely put it:
> > > 
> > > "Getting fd-backed memory into a guest will take some possibly major work in
> > > the kernel, but getting vma-backed memory into a guest without mapping it
> > > in the host user address space seems much, much worse."
> > 
> > OK, so IIUC this means that the model is to hand over memory from host
> > to guest. I thought the guest would be under control of its address
> > space and therefore it operates on the VMAs. This would benefit from
> > an additional and more specific clarification.
> 
> How guest would operate on VMAs if the interface between host and guest is
> virtual hardware?

I have to say that I am not really familiar with this area so my view
might be misleading or completely wrong. I thought that the HW address
ranges are mapped to the guest process and therefore have a VMA.

> If you mean qemu (or any other userspace part of VMM that uses KVM), so one
> of the points Andy mentioned back than is to remove mappings of the guest
> memory from the qemu process.
>  
> > > > > As secret memory implementation is not an extension of tmpfs or hugetlbfs,
> > > > > usage of a dedicated system call rather than hooking new functionality into
> > > > > memfd_create(2) emphasises that memfd_secret(2) has different semantics and
> > > > > allows better upwards compatibility.
> > > > 
> > > > What is this supposed to mean? What are differences?
> > > 
> > > Well, the phrasing could be better indeed. That supposed to mean that
> > > they differ in the semantics behind the file descriptor: memfd_create
> > > implements sealing for shmem and hugetlbfs while memfd_secret implements
> > > memory hidden from the kernel.
> > 
> > Right but why memfd_create model is not sufficient for the usecase?
> > Please note that I am arguing against. To be honest I do not really care
> > much. Using an existing scheme is usually preferable from my POV but
> > there might be real reasons why shmem as a backing "storage" is not
> > appropriate.
>    
> Citing my older email:
> 
>     I've hesitated whether to continue to use new flags to memfd_create() or to
>     add a new system call and I've decided to use a new system call after I've
>     started to look into man pages update. There would have been two completely
>     independent descriptions and I think it would have been very confusing.

Could you elaborate? Unmapping from the kernel address space can work
both for sealed or hugetlb memfds, no? Those features are completely
orthogonal AFAICS. With a dedicated syscall you will need to introduce
this functionality on top if that is required. Have you considered that?
I mean hugetlb pages are used to back guest memory very often. Is this
something that will be a secret memory usecase?

Please be really specific when giving arguments to back a new syscall
decision.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09 13:17             ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-09 13:17 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mark Rutland, David Hildenbrand, Peter Zijlstra, Catalin Marinas,
	Dave Hansen, linux-mm, linux-kselftest, H. Peter Anvin,
	Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Ingo Molnar, Michael Kerrisk,
	Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin, Mike Rapoport

On Tue 09-02-21 11:09:38, Mike Rapoport wrote:
> On Tue, Feb 09, 2021 at 09:47:08AM +0100, Michal Hocko wrote:
> > On Mon 08-02-21 23:26:05, Mike Rapoport wrote:
> > > On Mon, Feb 08, 2021 at 11:49:22AM +0100, Michal Hocko wrote:
> > > > On Mon 08-02-21 10:49:17, Mike Rapoport wrote:
> > [...]
> > > > > The file descriptor based memory has several advantages over the
> > > > > "traditional" mm interfaces, such as mlock(), mprotect(), madvise(). It
> > > > > paves the way for VMMs to remove the secret memory range from the process;
> > > > 
> > > > I do not understand how it helps to remove the memory from the process
> > > > as the interface explicitly allows to add a memory that is removed from
> > > > all other processes via direct map.
> > > 
> > > The current implementation does not help to remove the memory from the
> > > process, but using fd-backed memory seems a better interface to remove
> > > guest memory from host mappings than mmap. As Andy nicely put it:
> > > 
> > > "Getting fd-backed memory into a guest will take some possibly major work in
> > > the kernel, but getting vma-backed memory into a guest without mapping it
> > > in the host user address space seems much, much worse."
> > 
> > OK, so IIUC this means that the model is to hand over memory from host
> > to guest. I thought the guest would be under control of its address
> > space and therefore it operates on the VMAs. This would benefit from
> > an additional and more specific clarification.
> 
> How guest would operate on VMAs if the interface between host and guest is
> virtual hardware?

I have to say that I am not really familiar with this area so my view
might be misleading or completely wrong. I thought that the HW address
ranges are mapped to the guest process and therefore have a VMA.

> If you mean qemu (or any other userspace part of VMM that uses KVM), so one
> of the points Andy mentioned back than is to remove mappings of the guest
> memory from the qemu process.
>  
> > > > > As secret memory implementation is not an extension of tmpfs or hugetlbfs,
> > > > > usage of a dedicated system call rather than hooking new functionality into
> > > > > memfd_create(2) emphasises that memfd_secret(2) has different semantics and
> > > > > allows better upwards compatibility.
> > > > 
> > > > What is this supposed to mean? What are differences?
> > > 
> > > Well, the phrasing could be better indeed. That supposed to mean that
> > > they differ in the semantics behind the file descriptor: memfd_create
> > > implements sealing for shmem and hugetlbfs while memfd_secret implements
> > > memory hidden from the kernel.
> > 
> > Right but why memfd_create model is not sufficient for the usecase?
> > Please note that I am arguing against. To be honest I do not really care
> > much. Using an existing scheme is usually preferable from my POV but
> > there might be real reasons why shmem as a backing "storage" is not
> > appropriate.
>    
> Citing my older email:
> 
>     I've hesitated whether to continue to use new flags to memfd_create() or to
>     add a new system call and I've decided to use a new system call after I've
>     started to look into man pages update. There would have been two completely
>     independent descriptions and I think it would have been very confusing.

Could you elaborate? Unmapping from the kernel address space can work
both for sealed or hugetlb memfds, no? Those features are completely
orthogonal AFAICS. With a dedicated syscall you will need to introduce
this functionality on top if that is required. Have you considered that?
I mean hugetlb pages are used to back guest memory very often. Is this
something that will be a secret memory usecase?

Please be really specific when giving arguments to back a new syscall
decision.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
  2021-02-09 10:23               ` David Hildenbrand
  (?)
  (?)
@ 2021-02-09 13:25                 ` Michal Hocko
  -1 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-09 13:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Ander sen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

On Tue 09-02-21 11:23:35, David Hildenbrand wrote:
[...]
> I am constantly trying to fight for making more stuff MOVABLE instead of
> going into the other direction (e.g., because it's easier to implement,
> which feels like the wrong direction).
> 
> Maybe I am the only person that really cares about ZONE_MOVABLE these days
> :) I can't stop such new stuff from popping up, so at least I want it to be
> documented.

MOVABLE zone is certainly an important thing to keep working. And there
is still quite a lot of work on the way. But as I've said this is more
of a outlier than a norm. On the other hand movable zone is kinda hard
requirement for a lot of application and it is to be expected that
many features will be less than 100% compatible.  Some usecases even
impossible. That's why I am arguing that we should have a central
document where the movable zone is documented with all the potential
problems we have encountered over time and explicitly state which
features are fully/partially incompatible.

-- 
Michal Hocko
SUSE Labs
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09 13:25                 ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-09 13:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dan Williams, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

On Tue 09-02-21 11:23:35, David Hildenbrand wrote:
[...]
> I am constantly trying to fight for making more stuff MOVABLE instead of
> going into the other direction (e.g., because it's easier to implement,
> which feels like the wrong direction).
> 
> Maybe I am the only person that really cares about ZONE_MOVABLE these days
> :) I can't stop such new stuff from popping up, so at least I want it to be
> documented.

MOVABLE zone is certainly an important thing to keep working. And there
is still quite a lot of work on the way. But as I've said this is more
of a outlier than a norm. On the other hand movable zone is kinda hard
requirement for a lot of application and it is to be expected that
many features will be less than 100% compatible.  Some usecases even
impossible. That's why I am arguing that we should have a central
document where the movable zone is documented with all the potential
problems we have encountered over time and explicitly state which
features are fully/partially incompatible.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09 13:25                 ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-09 13:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Arnd Bergmann,
	James Bottomley, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin, Mike Rapoport

On Tue 09-02-21 11:23:35, David Hildenbrand wrote:
[...]
> I am constantly trying to fight for making more stuff MOVABLE instead of
> going into the other direction (e.g., because it's easier to implement,
> which feels like the wrong direction).
> 
> Maybe I am the only person that really cares about ZONE_MOVABLE these days
> :) I can't stop such new stuff from popping up, so at least I want it to be
> documented.

MOVABLE zone is certainly an important thing to keep working. And there
is still quite a lot of work on the way. But as I've said this is more
of a outlier than a norm. On the other hand movable zone is kinda hard
requirement for a lot of application and it is to be expected that
many features will be less than 100% compatible.  Some usecases even
impossible. That's why I am arguing that we should have a central
document where the movable zone is documented with all the potential
problems we have encountered over time and explicitly state which
features are fully/partially incompatible.

-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09 13:25                 ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-09 13:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Arnd Bergmann,
	James Bottomley, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin, Mike Rapoport

On Tue 09-02-21 11:23:35, David Hildenbrand wrote:
[...]
> I am constantly trying to fight for making more stuff MOVABLE instead of
> going into the other direction (e.g., because it's easier to implement,
> which feels like the wrong direction).
> 
> Maybe I am the only person that really cares about ZONE_MOVABLE these days
> :) I can't stop such new stuff from popping up, so at least I want it to be
> documented.

MOVABLE zone is certainly an important thing to keep working. And there
is still quite a lot of work on the way. But as I've said this is more
of a outlier than a norm. On the other hand movable zone is kinda hard
requirement for a lot of application and it is to be expected that
many features will be less than 100% compatible.  Some usecases even
impossible. That's why I am arguing that we should have a central
document where the movable zone is documented with all the potential
problems we have encountered over time and explicitly state which
features are fully/partially incompatible.

-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
  2021-02-09 13:25                 ` Michal Hocko
  (?)
  (?)
@ 2021-02-09 16:17                   ` David Hildenbrand
  -1 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-09 16:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Ander sen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

On 09.02.21 14:25, Michal Hocko wrote:
> On Tue 09-02-21 11:23:35, David Hildenbrand wrote:
> [...]
>> I am constantly trying to fight for making more stuff MOVABLE instead of
>> going into the other direction (e.g., because it's easier to implement,
>> which feels like the wrong direction).
>>
>> Maybe I am the only person that really cares about ZONE_MOVABLE these days
>> :) I can't stop such new stuff from popping up, so at least I want it to be
>> documented.
> 
> MOVABLE zone is certainly an important thing to keep working. And there
> is still quite a lot of work on the way. But as I've said this is more
> of a outlier than a norm. On the other hand movable zone is kinda hard
> requirement for a lot of application and it is to be expected that
> many features will be less than 100% compatible.  Some usecases even
> impossible. That's why I am arguing that we should have a central
> document where the movable zone is documented with all the potential
> problems we have encountered over time and explicitly state which
> features are fully/partially incompatible.
> 

I'll send a mail during the next weeks to gather current restrictions to 
document them (and include my brain dump). We might see more excessive 
use of ZONE_MOVABLE in the future and as history told us, of CMA as 
well. We really should start documenting/caring.

@Mike, it would be sufficient for me if one of your patches at least 
mention the situation in the description like

"Please note that secretmem currently behaves much more like long-term 
GUP instead of mlocked memory; secretmem is unmovable memory directly 
consumed/controlled by user space. secretmem cannot be placed onto 
ZONE_MOVABLE/CMA.

As long as there is no excessive use of secretmem (e.g., maximum of 16 
MiB for selected processes) in combination with ZONE_MOVABLE/CMA, this 
is barely a real issue. However, it is something to keep in mind when a 
significant amount of system RAM might be used for secretmem. In the 
future, we might support migration of secretmem and make it look much 
more like mlocked memory instead."

Just a suggestion.

-- 
Thanks,

David / dhildenb
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09 16:17                   ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-09 16:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dan Williams, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

On 09.02.21 14:25, Michal Hocko wrote:
> On Tue 09-02-21 11:23:35, David Hildenbrand wrote:
> [...]
>> I am constantly trying to fight for making more stuff MOVABLE instead of
>> going into the other direction (e.g., because it's easier to implement,
>> which feels like the wrong direction).
>>
>> Maybe I am the only person that really cares about ZONE_MOVABLE these days
>> :) I can't stop such new stuff from popping up, so at least I want it to be
>> documented.
> 
> MOVABLE zone is certainly an important thing to keep working. And there
> is still quite a lot of work on the way. But as I've said this is more
> of a outlier than a norm. On the other hand movable zone is kinda hard
> requirement for a lot of application and it is to be expected that
> many features will be less than 100% compatible.  Some usecases even
> impossible. That's why I am arguing that we should have a central
> document where the movable zone is documented with all the potential
> problems we have encountered over time and explicitly state which
> features are fully/partially incompatible.
> 

I'll send a mail during the next weeks to gather current restrictions to 
document them (and include my brain dump). We might see more excessive 
use of ZONE_MOVABLE in the future and as history told us, of CMA as 
well. We really should start documenting/caring.

@Mike, it would be sufficient for me if one of your patches at least 
mention the situation in the description like

"Please note that secretmem currently behaves much more like long-term 
GUP instead of mlocked memory; secretmem is unmovable memory directly 
consumed/controlled by user space. secretmem cannot be placed onto 
ZONE_MOVABLE/CMA.

As long as there is no excessive use of secretmem (e.g., maximum of 16 
MiB for selected processes) in combination with ZONE_MOVABLE/CMA, this 
is barely a real issue. However, it is something to keep in mind when a 
significant amount of system RAM might be used for secretmem. In the 
future, we might support migration of secretmem and make it look much 
more like mlocked memory instead."

Just a suggestion.

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09 16:17                   ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-09 16:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Arnd Bergmann,
	James Bottomley, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin, Mike Rapoport

On 09.02.21 14:25, Michal Hocko wrote:
> On Tue 09-02-21 11:23:35, David Hildenbrand wrote:
> [...]
>> I am constantly trying to fight for making more stuff MOVABLE instead of
>> going into the other direction (e.g., because it's easier to implement,
>> which feels like the wrong direction).
>>
>> Maybe I am the only person that really cares about ZONE_MOVABLE these days
>> :) I can't stop such new stuff from popping up, so at least I want it to be
>> documented.
> 
> MOVABLE zone is certainly an important thing to keep working. And there
> is still quite a lot of work on the way. But as I've said this is more
> of a outlier than a norm. On the other hand movable zone is kinda hard
> requirement for a lot of application and it is to be expected that
> many features will be less than 100% compatible.  Some usecases even
> impossible. That's why I am arguing that we should have a central
> document where the movable zone is documented with all the potential
> problems we have encountered over time and explicitly state which
> features are fully/partially incompatible.
> 

I'll send a mail during the next weeks to gather current restrictions to 
document them (and include my brain dump). We might see more excessive 
use of ZONE_MOVABLE in the future and as history told us, of CMA as 
well. We really should start documenting/caring.

@Mike, it would be sufficient for me if one of your patches at least 
mention the situation in the description like

"Please note that secretmem currently behaves much more like long-term 
GUP instead of mlocked memory; secretmem is unmovable memory directly 
consumed/controlled by user space. secretmem cannot be placed onto 
ZONE_MOVABLE/CMA.

As long as there is no excessive use of secretmem (e.g., maximum of 16 
MiB for selected processes) in combination with ZONE_MOVABLE/CMA, this 
is barely a real issue. However, it is something to keep in mind when a 
significant amount of system RAM might be used for secretmem. In the 
future, we might support migration of secretmem and make it look much 
more like mlocked memory instead."

Just a suggestion.

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09 16:17                   ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-09 16:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Arnd Bergmann,
	James Bottomley, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin, Mike Rapoport

On 09.02.21 14:25, Michal Hocko wrote:
> On Tue 09-02-21 11:23:35, David Hildenbrand wrote:
> [...]
>> I am constantly trying to fight for making more stuff MOVABLE instead of
>> going into the other direction (e.g., because it's easier to implement,
>> which feels like the wrong direction).
>>
>> Maybe I am the only person that really cares about ZONE_MOVABLE these days
>> :) I can't stop such new stuff from popping up, so at least I want it to be
>> documented.
> 
> MOVABLE zone is certainly an important thing to keep working. And there
> is still quite a lot of work on the way. But as I've said this is more
> of a outlier than a norm. On the other hand movable zone is kinda hard
> requirement for a lot of application and it is to be expected that
> many features will be less than 100% compatible.  Some usecases even
> impossible. That's why I am arguing that we should have a central
> document where the movable zone is documented with all the potential
> problems we have encountered over time and explicitly state which
> features are fully/partially incompatible.
> 

I'll send a mail during the next weeks to gather current restrictions to 
document them (and include my brain dump). We might see more excessive 
use of ZONE_MOVABLE in the future and as history told us, of CMA as 
well. We really should start documenting/caring.

@Mike, it would be sufficient for me if one of your patches at least 
mention the situation in the description like

"Please note that secretmem currently behaves much more like long-term 
GUP instead of mlocked memory; secretmem is unmovable memory directly 
consumed/controlled by user space. secretmem cannot be placed onto 
ZONE_MOVABLE/CMA.

As long as there is no excessive use of secretmem (e.g., maximum of 16 
MiB for selected processes) in combination with ZONE_MOVABLE/CMA, this 
is barely a real issue. However, it is something to keep in mind when a 
significant amount of system RAM might be used for secretmem. In the 
future, we might support migration of secretmem and make it look much 
more like mlocked memory instead."

Just a suggestion.

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
  2021-02-09 16:17                   ` David Hildenbrand
  (?)
  (?)
@ 2021-02-09 20:08                     ` Michal Hocko
  -1 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-09 20:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Ander sen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

On Tue 09-02-21 17:17:22, David Hildenbrand wrote:
> On 09.02.21 14:25, Michal Hocko wrote:
> > On Tue 09-02-21 11:23:35, David Hildenbrand wrote:
> > [...]
> > > I am constantly trying to fight for making more stuff MOVABLE instead of
> > > going into the other direction (e.g., because it's easier to implement,
> > > which feels like the wrong direction).
> > > 
> > > Maybe I am the only person that really cares about ZONE_MOVABLE these days
> > > :) I can't stop such new stuff from popping up, so at least I want it to be
> > > documented.
> > 
> > MOVABLE zone is certainly an important thing to keep working. And there
> > is still quite a lot of work on the way. But as I've said this is more
> > of a outlier than a norm. On the other hand movable zone is kinda hard
> > requirement for a lot of application and it is to be expected that
> > many features will be less than 100% compatible.  Some usecases even
> > impossible. That's why I am arguing that we should have a central
> > document where the movable zone is documented with all the potential
> > problems we have encountered over time and explicitly state which
> > features are fully/partially incompatible.
> > 
> 
> I'll send a mail during the next weeks to gather current restrictions to
> document them (and include my brain dump). We might see more excessive use
> of ZONE_MOVABLE in the future and as history told us, of CMA as well. We
> really should start documenting/caring.

Excellent! Thanks a lot. I will do my best to help reviewing that.

> @Mike, it would be sufficient for me if one of your patches at least mention
> the situation in the description like
> 
> "Please note that secretmem currently behaves much more like long-term GUP
> instead of mlocked memory; secretmem is unmovable memory directly
> consumed/controlled by user space. secretmem cannot be placed onto
> ZONE_MOVABLE/CMA.

Sounds good to me.

-- 
Michal Hocko
SUSE Labs
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09 20:08                     ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-09 20:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dan Williams, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Mike Rapoport, Michael Kerrisk,
	Palmer Dabbelt, Paul Walmsley, Peter Zijlstra, Rick Edgecombe,
	Roman Gushchin, Shakeel Butt, Shuah Khan, Thomas Gleixner,
	Tycho Andersen, Will Deacon, linux-api, linux-arch,
	linux-arm-kernel, linux-fsdevel, linux-mm, linux-kernel,
	linux-kselftest, linux-nvdimm, linux-riscv, x86

On Tue 09-02-21 17:17:22, David Hildenbrand wrote:
> On 09.02.21 14:25, Michal Hocko wrote:
> > On Tue 09-02-21 11:23:35, David Hildenbrand wrote:
> > [...]
> > > I am constantly trying to fight for making more stuff MOVABLE instead of
> > > going into the other direction (e.g., because it's easier to implement,
> > > which feels like the wrong direction).
> > > 
> > > Maybe I am the only person that really cares about ZONE_MOVABLE these days
> > > :) I can't stop such new stuff from popping up, so at least I want it to be
> > > documented.
> > 
> > MOVABLE zone is certainly an important thing to keep working. And there
> > is still quite a lot of work on the way. But as I've said this is more
> > of a outlier than a norm. On the other hand movable zone is kinda hard
> > requirement for a lot of application and it is to be expected that
> > many features will be less than 100% compatible.  Some usecases even
> > impossible. That's why I am arguing that we should have a central
> > document where the movable zone is documented with all the potential
> > problems we have encountered over time and explicitly state which
> > features are fully/partially incompatible.
> > 
> 
> I'll send a mail during the next weeks to gather current restrictions to
> document them (and include my brain dump). We might see more excessive use
> of ZONE_MOVABLE in the future and as history told us, of CMA as well. We
> really should start documenting/caring.

Excellent! Thanks a lot. I will do my best to help reviewing that.

> @Mike, it would be sufficient for me if one of your patches at least mention
> the situation in the description like
> 
> "Please note that secretmem currently behaves much more like long-term GUP
> instead of mlocked memory; secretmem is unmovable memory directly
> consumed/controlled by user space. secretmem cannot be placed onto
> ZONE_MOVABLE/CMA.

Sounds good to me.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09 20:08                     ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-09 20:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Arnd Bergmann,
	James Bottomley, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin, Mike Rapoport

On Tue 09-02-21 17:17:22, David Hildenbrand wrote:
> On 09.02.21 14:25, Michal Hocko wrote:
> > On Tue 09-02-21 11:23:35, David Hildenbrand wrote:
> > [...]
> > > I am constantly trying to fight for making more stuff MOVABLE instead of
> > > going into the other direction (e.g., because it's easier to implement,
> > > which feels like the wrong direction).
> > > 
> > > Maybe I am the only person that really cares about ZONE_MOVABLE these days
> > > :) I can't stop such new stuff from popping up, so at least I want it to be
> > > documented.
> > 
> > MOVABLE zone is certainly an important thing to keep working. And there
> > is still quite a lot of work on the way. But as I've said this is more
> > of a outlier than a norm. On the other hand movable zone is kinda hard
> > requirement for a lot of application and it is to be expected that
> > many features will be less than 100% compatible.  Some usecases even
> > impossible. That's why I am arguing that we should have a central
> > document where the movable zone is documented with all the potential
> > problems we have encountered over time and explicitly state which
> > features are fully/partially incompatible.
> > 
> 
> I'll send a mail during the next weeks to gather current restrictions to
> document them (and include my brain dump). We might see more excessive use
> of ZONE_MOVABLE in the future and as history told us, of CMA as well. We
> really should start documenting/caring.

Excellent! Thanks a lot. I will do my best to help reviewing that.

> @Mike, it would be sufficient for me if one of your patches at least mention
> the situation in the description like
> 
> "Please note that secretmem currently behaves much more like long-term GUP
> instead of mlocked memory; secretmem is unmovable memory directly
> consumed/controlled by user space. secretmem cannot be placed onto
> ZONE_MOVABLE/CMA.

Sounds good to me.

-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 00/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-09 20:08                     ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-09 20:08 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Arnd Bergmann,
	James Bottomley, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin, Mike Rapoport

On Tue 09-02-21 17:17:22, David Hildenbrand wrote:
> On 09.02.21 14:25, Michal Hocko wrote:
> > On Tue 09-02-21 11:23:35, David Hildenbrand wrote:
> > [...]
> > > I am constantly trying to fight for making more stuff MOVABLE instead of
> > > going into the other direction (e.g., because it's easier to implement,
> > > which feels like the wrong direction).
> > > 
> > > Maybe I am the only person that really cares about ZONE_MOVABLE these days
> > > :) I can't stop such new stuff from popping up, so at least I want it to be
> > > documented.
> > 
> > MOVABLE zone is certainly an important thing to keep working. And there
> > is still quite a lot of work on the way. But as I've said this is more
> > of a outlier than a norm. On the other hand movable zone is kinda hard
> > requirement for a lot of application and it is to be expected that
> > many features will be less than 100% compatible.  Some usecases even
> > impossible. That's why I am arguing that we should have a central
> > document where the movable zone is documented with all the potential
> > problems we have encountered over time and explicitly state which
> > features are fully/partially incompatible.
> > 
> 
> I'll send a mail during the next weeks to gather current restrictions to
> document them (and include my brain dump). We might see more excessive use
> of ZONE_MOVABLE in the future and as history told us, of CMA as well. We
> really should start documenting/caring.

Excellent! Thanks a lot. I will do my best to help reviewing that.

> @Mike, it would be sufficient for me if one of your patches at least mention
> the situation in the description like
> 
> "Please note that secretmem currently behaves much more like long-term GUP
> instead of mlocked memory; secretmem is unmovable memory directly
> consumed/controlled by user space. secretmem cannot be placed onto
> ZONE_MOVABLE/CMA.

Sounds good to me.

-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
  2021-02-09 13:17             ` Michal Hocko
  (?)
  (?)
@ 2021-02-11  7:13               ` Mike Rapoport
  -1 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-11  7:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dave Hansen, David Hildenbrand,
	Elena Reshetova, H. Peter Anvin, Ingo Molnar, James Bottomley,
	Kirill A. Shutemov, Matthew Wilcox, Mark Rutland,
	Michael Kerrisk, Palmer Dabbelt, Paul Walmsley, Peter Zijlstra,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleixner, Tycho Andersen, Will Deacon, linux-api,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On Tue, Feb 09, 2021 at 02:17:11PM +0100, Michal Hocko wrote:
> On Tue 09-02-21 11:09:38, Mike Rapoport wrote:
> > On Tue, Feb 09, 2021 at 09:47:08AM +0100, Michal Hocko wrote:
> > > 
> > > OK, so IIUC this means that the model is to hand over memory from host
> > > to guest. I thought the guest would be under control of its address
> > > space and therefore it operates on the VMAs. This would benefit from
> > > an additional and more specific clarification.
> > 
> > How guest would operate on VMAs if the interface between host and guest is
> > virtual hardware?
> 
> I have to say that I am not really familiar with this area so my view
> might be misleading or completely wrong. I thought that the HW address
> ranges are mapped to the guest process and therefore have a VMA.

There is a qemu process that currently has mappings of what guest sees as
its physical memory, but qemu is a part of hypervisor, i.e. host.
 
> > Citing my older email:
> > 
> >     I've hesitated whether to continue to use new flags to memfd_create() or to
> >     add a new system call and I've decided to use a new system call after I've
> >     started to look into man pages update. There would have been two completely
> >     independent descriptions and I think it would have been very confusing.
> 
> Could you elaborate? Unmapping from the kernel address space can work
> both for sealed or hugetlb memfds, no? Those features are completely
> orthogonal AFAICS. With a dedicated syscall you will need to introduce
> this functionality on top if that is required. Have you considered that?
> I mean hugetlb pages are used to back guest memory very often. Is this
> something that will be a secret memory usecase?
> 
> Please be really specific when giving arguments to back a new syscall
> decision.

Isn't "syscalls have completely independent description" specific enough?

We are talking about API here, not the implementation details whether
secretmem supports large pages or not.

The purpose of memfd_create() is to create a file-like access to memory.
The purpose of memfd_secret() is to create a way to access memory hidden
from the kernel.

I don't think overloading memfd_create() with the secretmem flags because
they happen to return a file descriptor will be better for users, but
rather will be more confusing.

-- 
Sincerely yours,
Mike.
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-11  7:13               ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-11  7:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Michael Kerrisk, Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Rick Edgecombe, Roman Gushchin, Shakeel Butt,
	Shuah Khan, Thomas Gleixner, Tycho Andersen, Will Deacon,
	linux-api, linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On Tue, Feb 09, 2021 at 02:17:11PM +0100, Michal Hocko wrote:
> On Tue 09-02-21 11:09:38, Mike Rapoport wrote:
> > On Tue, Feb 09, 2021 at 09:47:08AM +0100, Michal Hocko wrote:
> > > 
> > > OK, so IIUC this means that the model is to hand over memory from host
> > > to guest. I thought the guest would be under control of its address
> > > space and therefore it operates on the VMAs. This would benefit from
> > > an additional and more specific clarification.
> > 
> > How guest would operate on VMAs if the interface between host and guest is
> > virtual hardware?
> 
> I have to say that I am not really familiar with this area so my view
> might be misleading or completely wrong. I thought that the HW address
> ranges are mapped to the guest process and therefore have a VMA.

There is a qemu process that currently has mappings of what guest sees as
its physical memory, but qemu is a part of hypervisor, i.e. host.
 
> > Citing my older email:
> > 
> >     I've hesitated whether to continue to use new flags to memfd_create() or to
> >     add a new system call and I've decided to use a new system call after I've
> >     started to look into man pages update. There would have been two completely
> >     independent descriptions and I think it would have been very confusing.
> 
> Could you elaborate? Unmapping from the kernel address space can work
> both for sealed or hugetlb memfds, no? Those features are completely
> orthogonal AFAICS. With a dedicated syscall you will need to introduce
> this functionality on top if that is required. Have you considered that?
> I mean hugetlb pages are used to back guest memory very often. Is this
> something that will be a secret memory usecase?
> 
> Please be really specific when giving arguments to back a new syscall
> decision.

Isn't "syscalls have completely independent description" specific enough?

We are talking about API here, not the implementation details whether
secretmem supports large pages or not.

The purpose of memfd_create() is to create a file-like access to memory.
The purpose of memfd_secret() is to create a way to access memory hidden
from the kernel.

I don't think overloading memfd_create() with the secretmem flags because
they happen to return a file descriptor will be better for users, but
rather will be more confusing.

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-11  7:13               ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-11  7:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mark Rutland, David Hildenbrand, Peter Zijlstra, Catalin Marinas,
	Dave Hansen, linux-mm, linux-kselftest, H. Peter Anvin,
	Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin

On Tue, Feb 09, 2021 at 02:17:11PM +0100, Michal Hocko wrote:
> On Tue 09-02-21 11:09:38, Mike Rapoport wrote:
> > On Tue, Feb 09, 2021 at 09:47:08AM +0100, Michal Hocko wrote:
> > > 
> > > OK, so IIUC this means that the model is to hand over memory from host
> > > to guest. I thought the guest would be under control of its address
> > > space and therefore it operates on the VMAs. This would benefit from
> > > an additional and more specific clarification.
> > 
> > How guest would operate on VMAs if the interface between host and guest is
> > virtual hardware?
> 
> I have to say that I am not really familiar with this area so my view
> might be misleading or completely wrong. I thought that the HW address
> ranges are mapped to the guest process and therefore have a VMA.

There is a qemu process that currently has mappings of what guest sees as
its physical memory, but qemu is a part of hypervisor, i.e. host.
 
> > Citing my older email:
> > 
> >     I've hesitated whether to continue to use new flags to memfd_create() or to
> >     add a new system call and I've decided to use a new system call after I've
> >     started to look into man pages update. There would have been two completely
> >     independent descriptions and I think it would have been very confusing.
> 
> Could you elaborate? Unmapping from the kernel address space can work
> both for sealed or hugetlb memfds, no? Those features are completely
> orthogonal AFAICS. With a dedicated syscall you will need to introduce
> this functionality on top if that is required. Have you considered that?
> I mean hugetlb pages are used to back guest memory very often. Is this
> something that will be a secret memory usecase?
> 
> Please be really specific when giving arguments to back a new syscall
> decision.

Isn't "syscalls have completely independent description" specific enough?

We are talking about API here, not the implementation details whether
secretmem supports large pages or not.

The purpose of memfd_create() is to create a file-like access to memory.
The purpose of memfd_secret() is to create a way to access memory hidden
from the kernel.

I don't think overloading memfd_create() with the secretmem flags because
they happen to return a file descriptor will be better for users, but
rather will be more confusing.

-- 
Sincerely yours,
Mike.

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-11  7:13               ` Mike Rapoport
  0 siblings, 0 replies; 293+ messages in thread
From: Mike Rapoport @ 2021-02-11  7:13 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mark Rutland, David Hildenbrand, Peter Zijlstra, Catalin Marinas,
	Dave Hansen, linux-mm, linux-kselftest, H. Peter Anvin,
	Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin

On Tue, Feb 09, 2021 at 02:17:11PM +0100, Michal Hocko wrote:
> On Tue 09-02-21 11:09:38, Mike Rapoport wrote:
> > On Tue, Feb 09, 2021 at 09:47:08AM +0100, Michal Hocko wrote:
> > > 
> > > OK, so IIUC this means that the model is to hand over memory from host
> > > to guest. I thought the guest would be under control of its address
> > > space and therefore it operates on the VMAs. This would benefit from
> > > an additional and more specific clarification.
> > 
> > How guest would operate on VMAs if the interface between host and guest is
> > virtual hardware?
> 
> I have to say that I am not really familiar with this area so my view
> might be misleading or completely wrong. I thought that the HW address
> ranges are mapped to the guest process and therefore have a VMA.

There is a qemu process that currently has mappings of what guest sees as
its physical memory, but qemu is a part of hypervisor, i.e. host.
 
> > Citing my older email:
> > 
> >     I've hesitated whether to continue to use new flags to memfd_create() or to
> >     add a new system call and I've decided to use a new system call after I've
> >     started to look into man pages update. There would have been two completely
> >     independent descriptions and I think it would have been very confusing.
> 
> Could you elaborate? Unmapping from the kernel address space can work
> both for sealed or hugetlb memfds, no? Those features are completely
> orthogonal AFAICS. With a dedicated syscall you will need to introduce
> this functionality on top if that is required. Have you considered that?
> I mean hugetlb pages are used to back guest memory very often. Is this
> something that will be a secret memory usecase?
> 
> Please be really specific when giving arguments to back a new syscall
> decision.

Isn't "syscalls have completely independent description" specific enough?

We are talking about API here, not the implementation details whether
secretmem supports large pages or not.

The purpose of memfd_create() is to create a file-like access to memory.
The purpose of memfd_secret() is to create a way to access memory hidden
from the kernel.

I don't think overloading memfd_create() with the secretmem flags because
they happen to return a file descriptor will be better for users, but
rather will be more confusing.

-- 
Sincerely yours,
Mike.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
  2021-02-11  7:13               ` Mike Rapoport
  (?)
  (?)
@ 2021-02-11  8:39                 ` Michal Hocko
  -1 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-11  8:39 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dave Hansen, David Hildenbrand,
	Elena Reshetova, H. Peter Anvin, Ingo Molnar, James Bottomley,
	Kirill A. Shutemov, Matthew Wilcox, Mark Rutland,
	Michael Kerrisk, Palmer Dabbelt, Paul Walmsley, Peter Zijlstra,
	Rick Edgecombe, Roman Gushchin, Shakeel Butt, Shuah Khan,
	Thomas Gleixner, Tycho Andersen, Will Deacon, linux-api,
	linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On Thu 11-02-21 09:13:19, Mike Rapoport wrote:
> On Tue, Feb 09, 2021 at 02:17:11PM +0100, Michal Hocko wrote:
> > On Tue 09-02-21 11:09:38, Mike Rapoport wrote:
[...]
> > > Citing my older email:
> > > 
> > >     I've hesitated whether to continue to use new flags to memfd_create() or to
> > >     add a new system call and I've decided to use a new system call after I've
> > >     started to look into man pages update. There would have been two completely
> > >     independent descriptions and I think it would have been very confusing.
> > 
> > Could you elaborate? Unmapping from the kernel address space can work
> > both for sealed or hugetlb memfds, no? Those features are completely
> > orthogonal AFAICS. With a dedicated syscall you will need to introduce
> > this functionality on top if that is required. Have you considered that?
> > I mean hugetlb pages are used to back guest memory very often. Is this
> > something that will be a secret memory usecase?
> > 
> > Please be really specific when giving arguments to back a new syscall
> > decision.
> 
> Isn't "syscalls have completely independent description" specific enough?

No, it's not as you can see from questions I've had above. More on that
below.

> We are talking about API here, not the implementation details whether
> secretmem supports large pages or not.
> 
> The purpose of memfd_create() is to create a file-like access to memory.
> The purpose of memfd_secret() is to create a way to access memory hidden
> from the kernel.
> 
> I don't think overloading memfd_create() with the secretmem flags because
> they happen to return a file descriptor will be better for users, but
> rather will be more confusing.

This is quite a subjective conclusion. I could very well argue that it
would be much better to have a single syscall to get a fd backed memory
with spedific requirements (sealing, unmapping from the kernel address
space). Neither of us would be clearly right or wrong. A more important
point is a future extensibility and usability, though. So let's just
think of few usecases I have outlined above. Is it unrealistic to expect
that secret memory should be sealable? What about hugetlb? Because if
the answer is no then a new API is a clear win as the combination of
flags would never work and then we would just suffer from the syscall 
multiplexing without much gain. On the other hand if combination of the
functionality is to be expected then you will have to jam it into
memfd_create and copy the interface likely causing more confusion. See
what I mean?

I by no means do not insist one way or the other but from what I have
seen so far I have a feeling that the interface hasn't been thought
through enough. Sure you have landed with fd based approach and that
seems fair. But how to get that fd seems to still have some gaps IMHO.
-- 
Michal Hocko
SUSE Labs
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-11  8:39                 ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-11  8:39 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dan Williams, Dave Hansen,
	David Hildenbrand, Elena Reshetova, H. Peter Anvin, Ingo Molnar,
	James Bottomley, Kirill A. Shutemov, Matthew Wilcox,
	Mark Rutland, Michael Kerrisk, Palmer Dabbelt, Paul Walmsley,
	Peter Zijlstra, Rick Edgecombe, Roman Gushchin, Shakeel Butt,
	Shuah Khan, Thomas Gleixner, Tycho Andersen, Will Deacon,
	linux-api, linux-arch, linux-arm-kernel, linux-fsdevel, linux-mm,
	linux-kernel, linux-kselftest, linux-nvdimm, linux-riscv, x86,
	Hagen Paul Pfeifer, Palmer Dabbelt

On Thu 11-02-21 09:13:19, Mike Rapoport wrote:
> On Tue, Feb 09, 2021 at 02:17:11PM +0100, Michal Hocko wrote:
> > On Tue 09-02-21 11:09:38, Mike Rapoport wrote:
[...]
> > > Citing my older email:
> > > 
> > >     I've hesitated whether to continue to use new flags to memfd_create() or to
> > >     add a new system call and I've decided to use a new system call after I've
> > >     started to look into man pages update. There would have been two completely
> > >     independent descriptions and I think it would have been very confusing.
> > 
> > Could you elaborate? Unmapping from the kernel address space can work
> > both for sealed or hugetlb memfds, no? Those features are completely
> > orthogonal AFAICS. With a dedicated syscall you will need to introduce
> > this functionality on top if that is required. Have you considered that?
> > I mean hugetlb pages are used to back guest memory very often. Is this
> > something that will be a secret memory usecase?
> > 
> > Please be really specific when giving arguments to back a new syscall
> > decision.
> 
> Isn't "syscalls have completely independent description" specific enough?

No, it's not as you can see from questions I've had above. More on that
below.

> We are talking about API here, not the implementation details whether
> secretmem supports large pages or not.
> 
> The purpose of memfd_create() is to create a file-like access to memory.
> The purpose of memfd_secret() is to create a way to access memory hidden
> from the kernel.
> 
> I don't think overloading memfd_create() with the secretmem flags because
> they happen to return a file descriptor will be better for users, but
> rather will be more confusing.

This is quite a subjective conclusion. I could very well argue that it
would be much better to have a single syscall to get a fd backed memory
with spedific requirements (sealing, unmapping from the kernel address
space). Neither of us would be clearly right or wrong. A more important
point is a future extensibility and usability, though. So let's just
think of few usecases I have outlined above. Is it unrealistic to expect
that secret memory should be sealable? What about hugetlb? Because if
the answer is no then a new API is a clear win as the combination of
flags would never work and then we would just suffer from the syscall 
multiplexing without much gain. On the other hand if combination of the
functionality is to be expected then you will have to jam it into
memfd_create and copy the interface likely causing more confusion. See
what I mean?

I by no means do not insist one way or the other but from what I have
seen so far I have a feeling that the interface hasn't been thought
through enough. Sure you have landed with fd based approach and that
seems fair. But how to get that fd seems to still have some gaps IMHO.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-11  8:39                 ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-11  8:39 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mark Rutland, David Hildenbrand, Peter Zijlstra, Catalin Marinas,
	Dave Hansen, linux-mm, linux-kselftest, H. Peter Anvin,
	Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin

On Thu 11-02-21 09:13:19, Mike Rapoport wrote:
> On Tue, Feb 09, 2021 at 02:17:11PM +0100, Michal Hocko wrote:
> > On Tue 09-02-21 11:09:38, Mike Rapoport wrote:
[...]
> > > Citing my older email:
> > > 
> > >     I've hesitated whether to continue to use new flags to memfd_create() or to
> > >     add a new system call and I've decided to use a new system call after I've
> > >     started to look into man pages update. There would have been two completely
> > >     independent descriptions and I think it would have been very confusing.
> > 
> > Could you elaborate? Unmapping from the kernel address space can work
> > both for sealed or hugetlb memfds, no? Those features are completely
> > orthogonal AFAICS. With a dedicated syscall you will need to introduce
> > this functionality on top if that is required. Have you considered that?
> > I mean hugetlb pages are used to back guest memory very often. Is this
> > something that will be a secret memory usecase?
> > 
> > Please be really specific when giving arguments to back a new syscall
> > decision.
> 
> Isn't "syscalls have completely independent description" specific enough?

No, it's not as you can see from questions I've had above. More on that
below.

> We are talking about API here, not the implementation details whether
> secretmem supports large pages or not.
> 
> The purpose of memfd_create() is to create a file-like access to memory.
> The purpose of memfd_secret() is to create a way to access memory hidden
> from the kernel.
> 
> I don't think overloading memfd_create() with the secretmem flags because
> they happen to return a file descriptor will be better for users, but
> rather will be more confusing.

This is quite a subjective conclusion. I could very well argue that it
would be much better to have a single syscall to get a fd backed memory
with spedific requirements (sealing, unmapping from the kernel address
space). Neither of us would be clearly right or wrong. A more important
point is a future extensibility and usability, though. So let's just
think of few usecases I have outlined above. Is it unrealistic to expect
that secret memory should be sealable? What about hugetlb? Because if
the answer is no then a new API is a clear win as the combination of
flags would never work and then we would just suffer from the syscall 
multiplexing without much gain. On the other hand if combination of the
functionality is to be expected then you will have to jam it into
memfd_create and copy the interface likely causing more confusion. See
what I mean?

I by no means do not insist one way or the other but from what I have
seen so far I have a feeling that the interface hasn't been thought
through enough. Sure you have landed with fd based approach and that
seems fair. But how to get that fd seems to still have some gaps IMHO.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-11  8:39                 ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-11  8:39 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Mark Rutland, David Hildenbrand, Peter Zijlstra, Catalin Marinas,
	Dave Hansen, linux-mm, linux-kselftest, H. Peter Anvin,
	Christopher Lameter, Shuah Khan, Thomas Gleixner,
	Elena Reshetova, linux-arch, Tycho Andersen, linux-nvdimm,
	Will Deacon, x86, Matthew Wilcox, Mike Rapoport, Ingo Molnar,
	Michael Kerrisk, Palmer Dabbelt, Arnd Bergmann, James Bottomley,
	Hagen Paul Pfeifer, Borislav Petkov, Alexander Viro,
	Andy Lutomirski, Paul Walmsley, Kirill A. Shutemov, Dan Williams,
	linux-arm-kernel, linux-api, linux-kernel, linux-riscv,
	Palmer Dabbelt, linux-fsdevel, Shakeel Butt, Andrew Morton,
	Rick Edgecombe, Roman Gushchin

On Thu 11-02-21 09:13:19, Mike Rapoport wrote:
> On Tue, Feb 09, 2021 at 02:17:11PM +0100, Michal Hocko wrote:
> > On Tue 09-02-21 11:09:38, Mike Rapoport wrote:
[...]
> > > Citing my older email:
> > > 
> > >     I've hesitated whether to continue to use new flags to memfd_create() or to
> > >     add a new system call and I've decided to use a new system call after I've
> > >     started to look into man pages update. There would have been two completely
> > >     independent descriptions and I think it would have been very confusing.
> > 
> > Could you elaborate? Unmapping from the kernel address space can work
> > both for sealed or hugetlb memfds, no? Those features are completely
> > orthogonal AFAICS. With a dedicated syscall you will need to introduce
> > this functionality on top if that is required. Have you considered that?
> > I mean hugetlb pages are used to back guest memory very often. Is this
> > something that will be a secret memory usecase?
> > 
> > Please be really specific when giving arguments to back a new syscall
> > decision.
> 
> Isn't "syscalls have completely independent description" specific enough?

No, it's not as you can see from questions I've had above. More on that
below.

> We are talking about API here, not the implementation details whether
> secretmem supports large pages or not.
> 
> The purpose of memfd_create() is to create a file-like access to memory.
> The purpose of memfd_secret() is to create a way to access memory hidden
> from the kernel.
> 
> I don't think overloading memfd_create() with the secretmem flags because
> they happen to return a file descriptor will be better for users, but
> rather will be more confusing.

This is quite a subjective conclusion. I could very well argue that it
would be much better to have a single syscall to get a fd backed memory
with spedific requirements (sealing, unmapping from the kernel address
space). Neither of us would be clearly right or wrong. A more important
point is a future extensibility and usability, though. So let's just
think of few usecases I have outlined above. Is it unrealistic to expect
that secret memory should be sealable? What about hugetlb? Because if
the answer is no then a new API is a clear win as the combination of
flags would never work and then we would just suffer from the syscall 
multiplexing without much gain. On the other hand if combination of the
functionality is to be expected then you will have to jam it into
memfd_create and copy the interface likely causing more confusion. See
what I mean?

I by no means do not insist one way or the other but from what I have
seen so far I have a feeling that the interface hasn't been thought
through enough. Sure you have landed with fd based approach and that
seems fair. But how to get that fd seems to still have some gaps IMHO.
-- 
Michal Hocko
SUSE Labs

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
  2021-02-11  8:39                 ` Michal Hocko
  (?)
  (?)
@ 2021-02-11  9:01                   ` David Hildenbrand
  -1 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-11  9:01 UTC (permalink / raw)
  To: Michal Hocko, Mike Rapoport
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Michael Kerrisk, Palmer Dabbelt,
	Paul Walmsley, Peter Zijlstra, Rick Edgecombe, Roman Gushchin,
	Shakeel Butt, Shuah Khan, Thomas Gleixner, Tycho Andersen,
	Will Deacon, linux-api, linux-arch, linux-arm-kernel,
	linux-fsdevel, linux-mm, linux-kernel, linux-kselftest,
	linux-nvdimm, linux-riscv, x86, Hagen Paul Pfeifer,
	Palmer Dabbelt

On 11.02.21 09:39, Michal Hocko wrote:
> On Thu 11-02-21 09:13:19, Mike Rapoport wrote:
>> On Tue, Feb 09, 2021 at 02:17:11PM +0100, Michal Hocko wrote:
>>> On Tue 09-02-21 11:09:38, Mike Rapoport wrote:
> [...]
>>>> Citing my older email:
>>>>
>>>>      I've hesitated whether to continue to use new flags to memfd_create() or to
>>>>      add a new system call and I've decided to use a new system call after I've
>>>>      started to look into man pages update. There would have been two completely
>>>>      independent descriptions and I think it would have been very confusing.
>>>
>>> Could you elaborate? Unmapping from the kernel address space can work
>>> both for sealed or hugetlb memfds, no? Those features are completely
>>> orthogonal AFAICS. With a dedicated syscall you will need to introduce
>>> this functionality on top if that is required. Have you considered that?
>>> I mean hugetlb pages are used to back guest memory very often. Is this
>>> something that will be a secret memory usecase?
>>>
>>> Please be really specific when giving arguments to back a new syscall
>>> decision.
>>
>> Isn't "syscalls have completely independent description" specific enough?
> 
> No, it's not as you can see from questions I've had above. More on that
> below.
> 
>> We are talking about API here, not the implementation details whether
>> secretmem supports large pages or not.
>>
>> The purpose of memfd_create() is to create a file-like access to memory.
>> The purpose of memfd_secret() is to create a way to access memory hidden
>> from the kernel.
>>
>> I don't think overloading memfd_create() with the secretmem flags because
>> they happen to return a file descriptor will be better for users, but
>> rather will be more confusing.
> 
> This is quite a subjective conclusion. I could very well argue that it
> would be much better to have a single syscall to get a fd backed memory
> with spedific requirements (sealing, unmapping from the kernel address
> space). Neither of us would be clearly right or wrong. A more important
> point is a future extensibility and usability, though. So let's just
> think of few usecases I have outlined above. Is it unrealistic to expect
> that secret memory should be sealable? What about hugetlb? Because if
> the answer is no then a new API is a clear win as the combination of
> flags would never work and then we would just suffer from the syscall
> multiplexing without much gain. On the other hand if combination of the
> functionality is to be expected then you will have to jam it into
> memfd_create and copy the interface likely causing more confusion. See
> what I mean?
> 
> I by no means do not insist one way or the other but from what I have
> seen so far I have a feeling that the interface hasn't been thought
> through enough. Sure you have landed with fd based approach and that
> seems fair. But how to get that fd seems to still have some gaps IMHO.
> 

I agree with Michal. This has been raised by different
people already, including on LWN (https://lwn.net/Articles/835342/).

I can follow Mike's reasoning (man page), and I am also fine if there is
a valid reason. However, IMHO the basic description seems to match quite good:

        memfd_create() creates an anonymous file and returns a file descriptor that refers to it.  The
        file behaves like a regular file, and so can be modified, truncated, memory-mapped, and so on.
        However,  unlike a regular file, it lives in RAM and has a volatile backing storage.  Once all
        references to the file are dropped, it is automatically released.  Anonymous  memory  is  used
        for  all  backing pages of the file.  Therefore, files created by memfd_create() have the same
        semantics as other anonymous memory allocations such as those allocated using mmap(2) with the
        MAP_ANONYMOUS flag.

AFAIKS, we would need MFD_SECRET and disallow
MFD_ALLOW_SEALING and MFD_HUGETLB.

In addition, we could add MFD_SECRET_NEVER_MAP, which could disallow any kind of
temporary mappings (eor migration). TBC.

---

Some random thoughts regarding files.

What is the page size of secretmem memory? Sometimes we use huge pages,
sometimes we fallback to 4k pages. So I assume huge pages in general?

What are semantics of MADV()/FALLOCATE() etc on such files?
I assume PUNCH_HOLE fails in a nice way? does it work?

Does mremap()/mremap(FIXED) work/is it blocked?

Does mprotect() fail in a nice way?

Is userfaultfd() properly fenced? Or does it even work (doubt)?

How does it behave if I mmap(FIXED) something in between?
In which granularity can I do that (->page-size?)?

What are other granularity restrictions (->page size)?

Don't want to open a big discussion here, just some random thoughts.
Maybe it has all been already figured out and most of the answers
above are "Fails with -EINVAL".

-- 
Thanks,

David / dhildenb
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-11  9:01                   ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-11  9:01 UTC (permalink / raw)
  To: Michal Hocko, Mike Rapoport
  Cc: Mike Rapoport, Andrew Morton, Alexander Viro, Andy Lutomirski,
	Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dan Williams, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Michael Kerrisk, Palmer Dabbelt,
	Paul Walmsley, Peter Zijlstra, Rick Edgecombe, Roman Gushchin,
	Shakeel Butt, Shuah Khan, Thomas Gleixner, Tycho Andersen,
	Will Deacon, linux-api, linux-arch, linux-arm-kernel,
	linux-fsdevel, linux-mm, linux-kernel, linux-kselftest,
	linux-nvdimm, linux-riscv, x86, Hagen Paul Pfeifer,
	Palmer Dabbelt

On 11.02.21 09:39, Michal Hocko wrote:
> On Thu 11-02-21 09:13:19, Mike Rapoport wrote:
>> On Tue, Feb 09, 2021 at 02:17:11PM +0100, Michal Hocko wrote:
>>> On Tue 09-02-21 11:09:38, Mike Rapoport wrote:
> [...]
>>>> Citing my older email:
>>>>
>>>>      I've hesitated whether to continue to use new flags to memfd_create() or to
>>>>      add a new system call and I've decided to use a new system call after I've
>>>>      started to look into man pages update. There would have been two completely
>>>>      independent descriptions and I think it would have been very confusing.
>>>
>>> Could you elaborate? Unmapping from the kernel address space can work
>>> both for sealed or hugetlb memfds, no? Those features are completely
>>> orthogonal AFAICS. With a dedicated syscall you will need to introduce
>>> this functionality on top if that is required. Have you considered that?
>>> I mean hugetlb pages are used to back guest memory very often. Is this
>>> something that will be a secret memory usecase?
>>>
>>> Please be really specific when giving arguments to back a new syscall
>>> decision.
>>
>> Isn't "syscalls have completely independent description" specific enough?
> 
> No, it's not as you can see from questions I've had above. More on that
> below.
> 
>> We are talking about API here, not the implementation details whether
>> secretmem supports large pages or not.
>>
>> The purpose of memfd_create() is to create a file-like access to memory.
>> The purpose of memfd_secret() is to create a way to access memory hidden
>> from the kernel.
>>
>> I don't think overloading memfd_create() with the secretmem flags because
>> they happen to return a file descriptor will be better for users, but
>> rather will be more confusing.
> 
> This is quite a subjective conclusion. I could very well argue that it
> would be much better to have a single syscall to get a fd backed memory
> with spedific requirements (sealing, unmapping from the kernel address
> space). Neither of us would be clearly right or wrong. A more important
> point is a future extensibility and usability, though. So let's just
> think of few usecases I have outlined above. Is it unrealistic to expect
> that secret memory should be sealable? What about hugetlb? Because if
> the answer is no then a new API is a clear win as the combination of
> flags would never work and then we would just suffer from the syscall
> multiplexing without much gain. On the other hand if combination of the
> functionality is to be expected then you will have to jam it into
> memfd_create and copy the interface likely causing more confusion. See
> what I mean?
> 
> I by no means do not insist one way or the other but from what I have
> seen so far I have a feeling that the interface hasn't been thought
> through enough. Sure you have landed with fd based approach and that
> seems fair. But how to get that fd seems to still have some gaps IMHO.
> 

I agree with Michal. This has been raised by different
people already, including on LWN (https://lwn.net/Articles/835342/).

I can follow Mike's reasoning (man page), and I am also fine if there is
a valid reason. However, IMHO the basic description seems to match quite good:

        memfd_create() creates an anonymous file and returns a file descriptor that refers to it.  The
        file behaves like a regular file, and so can be modified, truncated, memory-mapped, and so on.
        However,  unlike a regular file, it lives in RAM and has a volatile backing storage.  Once all
        references to the file are dropped, it is automatically released.  Anonymous  memory  is  used
        for  all  backing pages of the file.  Therefore, files created by memfd_create() have the same
        semantics as other anonymous memory allocations such as those allocated using mmap(2) with the
        MAP_ANONYMOUS flag.

AFAIKS, we would need MFD_SECRET and disallow
MFD_ALLOW_SEALING and MFD_HUGETLB.

In addition, we could add MFD_SECRET_NEVER_MAP, which could disallow any kind of
temporary mappings (eor migration). TBC.

---

Some random thoughts regarding files.

What is the page size of secretmem memory? Sometimes we use huge pages,
sometimes we fallback to 4k pages. So I assume huge pages in general?

What are semantics of MADV()/FALLOCATE() etc on such files?
I assume PUNCH_HOLE fails in a nice way? does it work?

Does mremap()/mremap(FIXED) work/is it blocked?

Does mprotect() fail in a nice way?

Is userfaultfd() properly fenced? Or does it even work (doubt)?

How does it behave if I mmap(FIXED) something in between?
In which granularity can I do that (->page-size?)?

What are other granularity restrictions (->page size)?

Don't want to open a big discussion here, just some random thoughts.
Maybe it has all been already figured out and most of the answers
above are "Fails with -EINVAL".

-- 
Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-11  9:01                   ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-11  9:01 UTC (permalink / raw)
  To: Michal Hocko, Mike Rapoport
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Palmer Dabbelt,
	Arnd Bergmann, James Bottomley, Hagen Paul Pfeifer,
	Borislav Petkov, Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Andrew Morton, Rick Edgecombe, Roman Gushchin

On 11.02.21 09:39, Michal Hocko wrote:
> On Thu 11-02-21 09:13:19, Mike Rapoport wrote:
>> On Tue, Feb 09, 2021 at 02:17:11PM +0100, Michal Hocko wrote:
>>> On Tue 09-02-21 11:09:38, Mike Rapoport wrote:
> [...]
>>>> Citing my older email:
>>>>
>>>>      I've hesitated whether to continue to use new flags to memfd_create() or to
>>>>      add a new system call and I've decided to use a new system call after I've
>>>>      started to look into man pages update. There would have been two completely
>>>>      independent descriptions and I think it would have been very confusing.
>>>
>>> Could you elaborate? Unmapping from the kernel address space can work
>>> both for sealed or hugetlb memfds, no? Those features are completely
>>> orthogonal AFAICS. With a dedicated syscall you will need to introduce
>>> this functionality on top if that is required. Have you considered that?
>>> I mean hugetlb pages are used to back guest memory very often. Is this
>>> something that will be a secret memory usecase?
>>>
>>> Please be really specific when giving arguments to back a new syscall
>>> decision.
>>
>> Isn't "syscalls have completely independent description" specific enough?
> 
> No, it's not as you can see from questions I've had above. More on that
> below.
> 
>> We are talking about API here, not the implementation details whether
>> secretmem supports large pages or not.
>>
>> The purpose of memfd_create() is to create a file-like access to memory.
>> The purpose of memfd_secret() is to create a way to access memory hidden
>> from the kernel.
>>
>> I don't think overloading memfd_create() with the secretmem flags because
>> they happen to return a file descriptor will be better for users, but
>> rather will be more confusing.
> 
> This is quite a subjective conclusion. I could very well argue that it
> would be much better to have a single syscall to get a fd backed memory
> with spedific requirements (sealing, unmapping from the kernel address
> space). Neither of us would be clearly right or wrong. A more important
> point is a future extensibility and usability, though. So let's just
> think of few usecases I have outlined above. Is it unrealistic to expect
> that secret memory should be sealable? What about hugetlb? Because if
> the answer is no then a new API is a clear win as the combination of
> flags would never work and then we would just suffer from the syscall
> multiplexing without much gain. On the other hand if combination of the
> functionality is to be expected then you will have to jam it into
> memfd_create and copy the interface likely causing more confusion. See
> what I mean?
> 
> I by no means do not insist one way or the other but from what I have
> seen so far I have a feeling that the interface hasn't been thought
> through enough. Sure you have landed with fd based approach and that
> seems fair. But how to get that fd seems to still have some gaps IMHO.
> 

I agree with Michal. This has been raised by different
people already, including on LWN (https://lwn.net/Articles/835342/).

I can follow Mike's reasoning (man page), and I am also fine if there is
a valid reason. However, IMHO the basic description seems to match quite good:

        memfd_create() creates an anonymous file and returns a file descriptor that refers to it.  The
        file behaves like a regular file, and so can be modified, truncated, memory-mapped, and so on.
        However,  unlike a regular file, it lives in RAM and has a volatile backing storage.  Once all
        references to the file are dropped, it is automatically released.  Anonymous  memory  is  used
        for  all  backing pages of the file.  Therefore, files created by memfd_create() have the same
        semantics as other anonymous memory allocations such as those allocated using mmap(2) with the
        MAP_ANONYMOUS flag.

AFAIKS, we would need MFD_SECRET and disallow
MFD_ALLOW_SEALING and MFD_HUGETLB.

In addition, we could add MFD_SECRET_NEVER_MAP, which could disallow any kind of
temporary mappings (eor migration). TBC.

---

Some random thoughts regarding files.

What is the page size of secretmem memory? Sometimes we use huge pages,
sometimes we fallback to 4k pages. So I assume huge pages in general?

What are semantics of MADV()/FALLOCATE() etc on such files?
I assume PUNCH_HOLE fails in a nice way? does it work?

Does mremap()/mremap(FIXED) work/is it blocked?

Does mprotect() fail in a nice way?

Is userfaultfd() properly fenced? Or does it even work (doubt)?

How does it behave if I mmap(FIXED) something in between?
In which granularity can I do that (->page-size?)?

What are other granularity restrictions (->page size)?

Don't want to open a big discussion here, just some random thoughts.
Maybe it has all been already figured out and most of the answers
above are "Fails with -EINVAL".

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-11  9:01                   ` David Hildenbrand
  0 siblings, 0 replies; 293+ messages in thread
From: David Hildenbrand @ 2021-02-11  9:01 UTC (permalink / raw)
  To: Michal Hocko, Mike Rapoport
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Palmer Dabbelt,
	Arnd Bergmann, James Bottomley, Hagen Paul Pfeifer,
	Borislav Petkov, Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Andrew Morton, Rick Edgecombe, Roman Gushchin

On 11.02.21 09:39, Michal Hocko wrote:
> On Thu 11-02-21 09:13:19, Mike Rapoport wrote:
>> On Tue, Feb 09, 2021 at 02:17:11PM +0100, Michal Hocko wrote:
>>> On Tue 09-02-21 11:09:38, Mike Rapoport wrote:
> [...]
>>>> Citing my older email:
>>>>
>>>>      I've hesitated whether to continue to use new flags to memfd_create() or to
>>>>      add a new system call and I've decided to use a new system call after I've
>>>>      started to look into man pages update. There would have been two completely
>>>>      independent descriptions and I think it would have been very confusing.
>>>
>>> Could you elaborate? Unmapping from the kernel address space can work
>>> both for sealed or hugetlb memfds, no? Those features are completely
>>> orthogonal AFAICS. With a dedicated syscall you will need to introduce
>>> this functionality on top if that is required. Have you considered that?
>>> I mean hugetlb pages are used to back guest memory very often. Is this
>>> something that will be a secret memory usecase?
>>>
>>> Please be really specific when giving arguments to back a new syscall
>>> decision.
>>
>> Isn't "syscalls have completely independent description" specific enough?
> 
> No, it's not as you can see from questions I've had above. More on that
> below.
> 
>> We are talking about API here, not the implementation details whether
>> secretmem supports large pages or not.
>>
>> The purpose of memfd_create() is to create a file-like access to memory.
>> The purpose of memfd_secret() is to create a way to access memory hidden
>> from the kernel.
>>
>> I don't think overloading memfd_create() with the secretmem flags because
>> they happen to return a file descriptor will be better for users, but
>> rather will be more confusing.
> 
> This is quite a subjective conclusion. I could very well argue that it
> would be much better to have a single syscall to get a fd backed memory
> with spedific requirements (sealing, unmapping from the kernel address
> space). Neither of us would be clearly right or wrong. A more important
> point is a future extensibility and usability, though. So let's just
> think of few usecases I have outlined above. Is it unrealistic to expect
> that secret memory should be sealable? What about hugetlb? Because if
> the answer is no then a new API is a clear win as the combination of
> flags would never work and then we would just suffer from the syscall
> multiplexing without much gain. On the other hand if combination of the
> functionality is to be expected then you will have to jam it into
> memfd_create and copy the interface likely causing more confusion. See
> what I mean?
> 
> I by no means do not insist one way or the other but from what I have
> seen so far I have a feeling that the interface hasn't been thought
> through enough. Sure you have landed with fd based approach and that
> seems fair. But how to get that fd seems to still have some gaps IMHO.
> 

I agree with Michal. This has been raised by different
people already, including on LWN (https://lwn.net/Articles/835342/).

I can follow Mike's reasoning (man page), and I am also fine if there is
a valid reason. However, IMHO the basic description seems to match quite good:

        memfd_create() creates an anonymous file and returns a file descriptor that refers to it.  The
        file behaves like a regular file, and so can be modified, truncated, memory-mapped, and so on.
        However,  unlike a regular file, it lives in RAM and has a volatile backing storage.  Once all
        references to the file are dropped, it is automatically released.  Anonymous  memory  is  used
        for  all  backing pages of the file.  Therefore, files created by memfd_create() have the same
        semantics as other anonymous memory allocations such as those allocated using mmap(2) with the
        MAP_ANONYMOUS flag.

AFAIKS, we would need MFD_SECRET and disallow
MFD_ALLOW_SEALING and MFD_HUGETLB.

In addition, we could add MFD_SECRET_NEVER_MAP, which could disallow any kind of
temporary mappings (eor migration). TBC.

---

Some random thoughts regarding files.

What is the page size of secretmem memory? Sometimes we use huge pages,
sometimes we fallback to 4k pages. So I assume huge pages in general?

What are semantics of MADV()/FALLOCATE() etc on such files?
I assume PUNCH_HOLE fails in a nice way? does it work?

Does mremap()/mremap(FIXED) work/is it blocked?

Does mprotect() fail in a nice way?

Is userfaultfd() properly fenced? Or does it even work (doubt)?

How does it behave if I mmap(FIXED) something in between?
In which granularity can I do that (->page-size?)?

What are other granularity restrictions (->page size)?

Don't want to open a big discussion here, just some random thoughts.
Maybe it has all been already figured out and most of the answers
above are "Fails with -EINVAL".

-- 
Thanks,

David / dhildenb


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
  2021-02-11  9:01                   ` David Hildenbrand
  (?)
  (?)
@ 2021-02-11  9:38                     ` Michal Hocko
  -1 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-11  9:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Rapoport, Mike Rapoport, Andrew Morton, Alexander Viro,
	Andy Lutomirski, Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Michael Kerrisk, Palmer Dabbelt,
	Paul Walmsley, Peter Zijlstra, Rick Edgecombe, Roman Gushchin,
	Shakeel Butt, Shuah Khan, Thomas Gleixner, Tycho Ander sen,
	Will Deacon, linux-api, linux-arch, linux-arm-kernel,
	linux-fsdevel, linux-mm, linux-kernel, linux-kselftest,
	linux-nvdimm, linux-riscv, x86, Hagen Paul Pfeifer,
	Palmer Dabbelt

On Thu 11-02-21 10:01:32, David Hildenbrand wrote:
[...]
> AFAIKS, we would need MFD_SECRET and disallow
> MFD_ALLOW_SEALING and MFD_HUGETLB.

Yes for an initial version. But I do expect a request to support both
features is just a matter of time.

> In addition, we could add MFD_SECRET_NEVER_MAP, which could disallow any kind of
> temporary mappings (eor migration). TBC.

I believe this is the mode Mike wants to have by default. A more relax
one would be an opt-in. MFD_SECRET_RELAXED which would allow temporal
mappings in the kernel for content copying (e.g. for migration).

> ---
> 
> Some random thoughts regarding files.
> 
> What is the page size of secretmem memory? Sometimes we use huge pages,
> sometimes we fallback to 4k pages. So I assume huge pages in general?

Unless there is an explicit request for hugetlb I would say the page
size is not really important like for any other fds. Huge pages can be
used transparently.
 
> What are semantics of MADV()/FALLOCATE() etc on such files?

I would expect the same semantic as regular shmem (memfd_create) except
the memory doesn't have _any_ backing storage which makes it
unevictable. So the reclaim related madv won't work but there shouldn't
be any real reason why e.g. MADV_DONTNEED, WILLNEED, DONT_FORK and
others don't work.

> I assume PUNCH_HOLE fails in a nice way? does it work?
> Does mremap()/mremap(FIXED) work/is it blocked?
> Does mprotect() fail in a nice way?

I do not see a reason why those shouldn't work.
 
> Is userfaultfd() properly fenced? Or does it even work (doubt)?
> 
> How does it behave if I mmap(FIXED) something in between?
> In which granularity can I do that (->page-size?)?

Again, nothing really exceptional here. This is a mapping like any
other from address space manipulation POV.

> What are other granularity restrictions (->page size)?
> 
> Don't want to open a big discussion here, just some random thoughts.
> Maybe it has all been already figured out and most of the answers
> above are "Fails with -EINVAL".

I think that the behavior should be really in sync with shmem semantic
as much as possible. Most operations should simply work with an
aditional direct map manipulation. There is no real reason to be
special. Some functionality might be missing, e.g. hugetlb support but
that has been traditionally added on top of shmem interface so nothing
really new here.
-- 
Michal Hocko
SUSE Labs
_______________________________________________
Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-11  9:38                     ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-11  9:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mike Rapoport, Mike Rapoport, Andrew Morton, Alexander Viro,
	Andy Lutomirski, Arnd Bergmann, Borislav Petkov, Catalin Marinas,
	Christopher Lameter, Dan Williams, Dave Hansen, Elena Reshetova,
	H. Peter Anvin, Ingo Molnar, James Bottomley, Kirill A. Shutemov,
	Matthew Wilcox, Mark Rutland, Michael Kerrisk, Palmer Dabbelt,
	Paul Walmsley, Peter Zijlstra, Rick Edgecombe, Roman Gushchin,
	Shakeel Butt, Shuah Khan, Thomas Gleixner, Tycho Andersen,
	Will Deacon, linux-api, linux-arch, linux-arm-kernel,
	linux-fsdevel, linux-mm, linux-kernel, linux-kselftest,
	linux-nvdimm, linux-riscv, x86, Hagen Paul Pfeifer,
	Palmer Dabbelt

On Thu 11-02-21 10:01:32, David Hildenbrand wrote:
[...]
> AFAIKS, we would need MFD_SECRET and disallow
> MFD_ALLOW_SEALING and MFD_HUGETLB.

Yes for an initial version. But I do expect a request to support both
features is just a matter of time.

> In addition, we could add MFD_SECRET_NEVER_MAP, which could disallow any kind of
> temporary mappings (eor migration). TBC.

I believe this is the mode Mike wants to have by default. A more relax
one would be an opt-in. MFD_SECRET_RELAXED which would allow temporal
mappings in the kernel for content copying (e.g. for migration).

> ---
> 
> Some random thoughts regarding files.
> 
> What is the page size of secretmem memory? Sometimes we use huge pages,
> sometimes we fallback to 4k pages. So I assume huge pages in general?

Unless there is an explicit request for hugetlb I would say the page
size is not really important like for any other fds. Huge pages can be
used transparently.
 
> What are semantics of MADV()/FALLOCATE() etc on such files?

I would expect the same semantic as regular shmem (memfd_create) except
the memory doesn't have _any_ backing storage which makes it
unevictable. So the reclaim related madv won't work but there shouldn't
be any real reason why e.g. MADV_DONTNEED, WILLNEED, DONT_FORK and
others don't work.

> I assume PUNCH_HOLE fails in a nice way? does it work?
> Does mremap()/mremap(FIXED) work/is it blocked?
> Does mprotect() fail in a nice way?

I do not see a reason why those shouldn't work.
 
> Is userfaultfd() properly fenced? Or does it even work (doubt)?
> 
> How does it behave if I mmap(FIXED) something in between?
> In which granularity can I do that (->page-size?)?

Again, nothing really exceptional here. This is a mapping like any
other from address space manipulation POV.

> What are other granularity restrictions (->page size)?
> 
> Don't want to open a big discussion here, just some random thoughts.
> Maybe it has all been already figured out and most of the answers
> above are "Fails with -EINVAL".

I think that the behavior should be really in sync with shmem semantic
as much as possible. Most operations should simply work with an
aditional direct map manipulation. There is no real reason to be
special. Some functionality might be missing, e.g. hugetlb support but
that has been traditionally added on top of shmem interface so nothing
really new here.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 293+ messages in thread

* Re: [PATCH v17 07/10] mm: introduce memfd_secret system call to create "secret" memory areas
@ 2021-02-11  9:38                     ` Michal Hocko
  0 siblings, 0 replies; 293+ messages in thread
From: Michal Hocko @ 2021-02-11  9:38 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Mark Rutland, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	linux-mm, linux-kselftest, H. Peter Anvin, Christopher Lameter,
	Shuah Khan, Thomas Gleixner, Elena Reshetova, linux-arch,
	Tycho Andersen, linux-nvdimm, Will Deacon, x86, Matthew Wilcox,
	Mike Rapoport, Ingo Molnar, Michael Kerrisk, Palmer Dabbelt,
	Arnd Bergmann, James Bottomley, Hagen Paul Pfeifer,
	Borislav Petkov, Alexander Viro, Andy Lutomirski, Paul Walmsley,
	Kirill A. Shutemov, Dan Williams, linux-arm-kernel, linux-api,
	linux-kernel, linux-riscv, Palmer Dabbelt, linux-fsdevel,
	Shakeel Butt, Andrew Morton, Rick Edgecombe, Roman Gushchin,
	Mike Rapoport

On Thu 11-02-21 10:01:32, David Hildenbrand wrote:
[...]
> AFAIKS, we would need MFD_SECRET and disallow
> MFD_ALLOW_SEALING and MFD_HUGETLB.

Yes for an initial version. But I do expect a request to support both
features is just a matter of time.

> In addition, we could add MFD_SECRET_NEVER_MAP, which could disallow any kind of
> temporary mappings (eor migration). TBC.

I believe this is the mode Mike wants to have by default. A more relax
one would be an opt-in. MFD_SECRET_RELAXED which would allow temporal
mappings in the kernel for content copying (e.g. for migration).

> ---
> 
> Some random thoughts regarding files.
> 
> What is th