All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6] File Sealing & memfd_create()
@ 2014-03-19 19:06 ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-19 19:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Hugh Dickins, Alexander Viro, Matthew Wilcox, Karol Lewandowski,
	Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, john.stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages),
	David Herrmann

Hi

This series introduces the concept of "file sealing". Sealing a file restricts
the set of allowed operations on the file in question. Multiple seals are
defined and each seal will cause a different set of operations to return EPERM
if it is set. The following seals are introduced:

 * SEAL_SHRINK: If set, the inode size cannot be reduced
 * SEAL_GROW: If set, the inode size cannot be increased
 * SEAL_WRITE: If set, the file content cannot be modified

Unlike existing techniques that provide similar protection, sealing allows
file-sharing without any trust-relationship. This is enforced by rejecting seal
modifications if you don't own an exclusive reference to the given file. So if
you own a file-descriptor, you can be sure that no-one besides you can modify
the seals on the given file. This allows mapping shared files from untrusted
parties without the fear of the file getting truncated or modified by an
attacker.

Several use-cases exist that could make great use of sealing:

  1) Graphics Compositors
     If a graphics client creates a memory-backed render-buffer and passes a
     file-decsriptor to it to the graphics server for display, the server
     _has_ to setup SIGBUS handlers whenever mapping the given file. Otherwise,
     the client might run ftruncate() or O_TRUNC on the on file in parallel,
     thus crashing the server.
     With sealing, a compositor can reject any incoming file-descriptor that
     does _not_ have SEAL_SHRINK set. This way, any memory-mappings are
     guaranteed to stay accessible. Furthermore, we still allow clients to
     increase the buffer-size in case they want to resize the render-buffer for
     the next frame. We also allow parallel writes so the client can render new
     frames into the same buffer (client is responsible of never rendering into
     a front-buffer if you want to avoid artifacts).

     Real use-case: Wayland wl_shm buffers can be transparently converted

  2) Geneal-purpose IPC
     IPC mechanisms that do not require a mutual trust-relationship (like dbus)
     cannot do zero-copy so far. With sealing, zero-copy can be easily done by
     sharing a file-descriptor that has SEAL_SHRINK | SEAL_GROW | SEAL_WRITE
     set. This way, the source can store sensible data in the file, seal the
     file and then pass it to the destination. The destination verifies these
     seals are set and then can parse the message in-line.
     Note that these files are usually one-shot files. Without any
     trust-relationship, a destination can notify the source that it released a
     file again, but a source can never rely on it. So unless the destination
     releases the file, a source cannot clear the seals for modification again.
     However, this is inherent to situations without any trust-relationship.

     Real use-case: kdbus messages already use a similar interface and can be
                    transparently converted to use these seals

Other similar use-cases exist (eg., audio), but these two I am personally
working on. Interest in this interface has been raised from several other camps
and I've put respective maintainers into CC. If more information on these
use-cases is needed, I think they can give some insights.

The API introduced by this patchset is:

 * fcntl() extension:
   Two new fcntl() commands are added that allow retrieveing (SHMEM_GET_SEALS)
   and setting (SHMEM_SET_SEALS) seals on a file. Only shmfs implements them so
   far and there is no intention to implement them on other file-systems.
   All shmfs based files support sealing.

   Patch 2/6

 * memfd_create() syscall:
   The new memfd_create() syscall is a public frontend to the shmem_file_new()
   interface in the kernel. It avoids the need of a local shmfs mount-point (as
   requested by android people) and acts more like MAP_ANON than O_TMPFILE.

   Patch 3/6

The other 4 patches are cleanups, self-tests and docs.

The commit-messages explain the API extensions in detail. Man-page proposals
are also provided. Last but not least, the extensive self-tests document the
intended behavior, in case it is still not clear.

Technically, sealing and memfd_create() are independent, but the described
use-cases would greatly benefit from the combination of both. Hence, I merged
them into the same series. Please also note that this series is based on earlier
works (ashmem, memfd, shmgetfd, ..) and unifies these attempts.

Comments welcome!

Thanks
David

David Herrmann (4):
  fs: fix i_writecount on shmem and friends
  shm: add sealing API
  shm: add memfd_create() syscall
  selftests: add memfd_create() + sealing tests

David Herrmann (2): (man-pages)
  fcntl.2: document SHMEM_SET/GET_SEALS commands
  memfd_create.2: add memfd_create() man-page

 arch/x86/syscalls/syscall_32.tbl           |   1 +
 arch/x86/syscalls/syscall_64.tbl           |   1 +
 fs/fcntl.c                                 |  12 +-
 fs/file_table.c                            |  27 +-
 include/linux/shmem_fs.h                   |  17 +
 include/linux/syscalls.h                   |   1 +
 include/uapi/linux/fcntl.h                 |  13 +
 include/uapi/linux/memfd.h                 |   9 +
 kernel/sys_ni.c                            |   1 +
 mm/shmem.c                                 | 267 +++++++-
 tools/testing/selftests/Makefile           |   1 +
 tools/testing/selftests/memfd/.gitignore   |   2 +
 tools/testing/selftests/memfd/Makefile     |  29 +
 tools/testing/selftests/memfd/memfd_test.c | 972 +++++++++++++++++++++++++++++
 14 files changed, 1338 insertions(+), 15 deletions(-)
 create mode 100644 include/uapi/linux/memfd.h
 create mode 100644 tools/testing/selftests/memfd/.gitignore
 create mode 100644 tools/testing/selftests/memfd/Makefile
 create mode 100644 tools/testing/selftests/memfd/memfd_test.c

-- 
1.9.0


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH 0/6] File Sealing & memfd_create()
@ 2014-03-19 19:06 ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-19 19:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Matthew Wilcox, Ryan Lortie, Hugh Dickins, Johannes Weiner,
	Kay Sievers, dri-devel, Daniel Mack, linux-mm, linux-fsdevel,
	Karol Lewandowski, Lennart Poettering, Greg Kroah-Hartman,
	Tejun Heo, Michael Kerrisk (man-pages),
	Andrew Morton, Linus Torvalds, Alexander Viro

Hi

This series introduces the concept of "file sealing". Sealing a file restricts
the set of allowed operations on the file in question. Multiple seals are
defined and each seal will cause a different set of operations to return EPERM
if it is set. The following seals are introduced:

 * SEAL_SHRINK: If set, the inode size cannot be reduced
 * SEAL_GROW: If set, the inode size cannot be increased
 * SEAL_WRITE: If set, the file content cannot be modified

Unlike existing techniques that provide similar protection, sealing allows
file-sharing without any trust-relationship. This is enforced by rejecting seal
modifications if you don't own an exclusive reference to the given file. So if
you own a file-descriptor, you can be sure that no-one besides you can modify
the seals on the given file. This allows mapping shared files from untrusted
parties without the fear of the file getting truncated or modified by an
attacker.

Several use-cases exist that could make great use of sealing:

  1) Graphics Compositors
     If a graphics client creates a memory-backed render-buffer and passes a
     file-decsriptor to it to the graphics server for display, the server
     _has_ to setup SIGBUS handlers whenever mapping the given file. Otherwise,
     the client might run ftruncate() or O_TRUNC on the on file in parallel,
     thus crashing the server.
     With sealing, a compositor can reject any incoming file-descriptor that
     does _not_ have SEAL_SHRINK set. This way, any memory-mappings are
     guaranteed to stay accessible. Furthermore, we still allow clients to
     increase the buffer-size in case they want to resize the render-buffer for
     the next frame. We also allow parallel writes so the client can render new
     frames into the same buffer (client is responsible of never rendering into
     a front-buffer if you want to avoid artifacts).

     Real use-case: Wayland wl_shm buffers can be transparently converted

  2) Geneal-purpose IPC
     IPC mechanisms that do not require a mutual trust-relationship (like dbus)
     cannot do zero-copy so far. With sealing, zero-copy can be easily done by
     sharing a file-descriptor that has SEAL_SHRINK | SEAL_GROW | SEAL_WRITE
     set. This way, the source can store sensible data in the file, seal the
     file and then pass it to the destination. The destination verifies these
     seals are set and then can parse the message in-line.
     Note that these files are usually one-shot files. Without any
     trust-relationship, a destination can notify the source that it released a
     file again, but a source can never rely on it. So unless the destination
     releases the file, a source cannot clear the seals for modification again.
     However, this is inherent to situations without any trust-relationship.

     Real use-case: kdbus messages already use a similar interface and can be
                    transparently converted to use these seals

Other similar use-cases exist (eg., audio), but these two I am personally
working on. Interest in this interface has been raised from several other camps
and I've put respective maintainers into CC. If more information on these
use-cases is needed, I think they can give some insights.

The API introduced by this patchset is:

 * fcntl() extension:
   Two new fcntl() commands are added that allow retrieveing (SHMEM_GET_SEALS)
   and setting (SHMEM_SET_SEALS) seals on a file. Only shmfs implements them so
   far and there is no intention to implement them on other file-systems.
   All shmfs based files support sealing.

   Patch 2/6

 * memfd_create() syscall:
   The new memfd_create() syscall is a public frontend to the shmem_file_new()
   interface in the kernel. It avoids the need of a local shmfs mount-point (as
   requested by android people) and acts more like MAP_ANON than O_TMPFILE.

   Patch 3/6

The other 4 patches are cleanups, self-tests and docs.

The commit-messages explain the API extensions in detail. Man-page proposals
are also provided. Last but not least, the extensive self-tests document the
intended behavior, in case it is still not clear.

Technically, sealing and memfd_create() are independent, but the described
use-cases would greatly benefit from the combination of both. Hence, I merged
them into the same series. Please also note that this series is based on earlier
works (ashmem, memfd, shmgetfd, ..) and unifies these attempts.

Comments welcome!

Thanks
David

David Herrmann (4):
  fs: fix i_writecount on shmem and friends
  shm: add sealing API
  shm: add memfd_create() syscall
  selftests: add memfd_create() + sealing tests

David Herrmann (2): (man-pages)
  fcntl.2: document SHMEM_SET/GET_SEALS commands
  memfd_create.2: add memfd_create() man-page

 arch/x86/syscalls/syscall_32.tbl           |   1 +
 arch/x86/syscalls/syscall_64.tbl           |   1 +
 fs/fcntl.c                                 |  12 +-
 fs/file_table.c                            |  27 +-
 include/linux/shmem_fs.h                   |  17 +
 include/linux/syscalls.h                   |   1 +
 include/uapi/linux/fcntl.h                 |  13 +
 include/uapi/linux/memfd.h                 |   9 +
 kernel/sys_ni.c                            |   1 +
 mm/shmem.c                                 | 267 +++++++-
 tools/testing/selftests/Makefile           |   1 +
 tools/testing/selftests/memfd/.gitignore   |   2 +
 tools/testing/selftests/memfd/Makefile     |  29 +
 tools/testing/selftests/memfd/memfd_test.c | 972 +++++++++++++++++++++++++++++
 14 files changed, 1338 insertions(+), 15 deletions(-)
 create mode 100644 include/uapi/linux/memfd.h
 create mode 100644 tools/testing/selftests/memfd/.gitignore
 create mode 100644 tools/testing/selftests/memfd/Makefile
 create mode 100644 tools/testing/selftests/memfd/memfd_test.c

-- 
1.9.0

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH 0/6] File Sealing & memfd_create()
@ 2014-03-19 19:06 ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-19 19:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Hugh Dickins, Alexander Viro, Matthew Wilcox, Karol Lewandowski,
	Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, john.stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages),
	David Herrmann

Hi

This series introduces the concept of "file sealing". Sealing a file restricts
the set of allowed operations on the file in question. Multiple seals are
defined and each seal will cause a different set of operations to return EPERM
if it is set. The following seals are introduced:

 * SEAL_SHRINK: If set, the inode size cannot be reduced
 * SEAL_GROW: If set, the inode size cannot be increased
 * SEAL_WRITE: If set, the file content cannot be modified

Unlike existing techniques that provide similar protection, sealing allows
file-sharing without any trust-relationship. This is enforced by rejecting seal
modifications if you don't own an exclusive reference to the given file. So if
you own a file-descriptor, you can be sure that no-one besides you can modify
the seals on the given file. This allows mapping shared files from untrusted
parties without the fear of the file getting truncated or modified by an
attacker.

Several use-cases exist that could make great use of sealing:

  1) Graphics Compositors
     If a graphics client creates a memory-backed render-buffer and passes a
     file-decsriptor to it to the graphics server for display, the server
     _has_ to setup SIGBUS handlers whenever mapping the given file. Otherwise,
     the client might run ftruncate() or O_TRUNC on the on file in parallel,
     thus crashing the server.
     With sealing, a compositor can reject any incoming file-descriptor that
     does _not_ have SEAL_SHRINK set. This way, any memory-mappings are
     guaranteed to stay accessible. Furthermore, we still allow clients to
     increase the buffer-size in case they want to resize the render-buffer for
     the next frame. We also allow parallel writes so the client can render new
     frames into the same buffer (client is responsible of never rendering into
     a front-buffer if you want to avoid artifacts).

     Real use-case: Wayland wl_shm buffers can be transparently converted

  2) Geneal-purpose IPC
     IPC mechanisms that do not require a mutual trust-relationship (like dbus)
     cannot do zero-copy so far. With sealing, zero-copy can be easily done by
     sharing a file-descriptor that has SEAL_SHRINK | SEAL_GROW | SEAL_WRITE
     set. This way, the source can store sensible data in the file, seal the
     file and then pass it to the destination. The destination verifies these
     seals are set and then can parse the message in-line.
     Note that these files are usually one-shot files. Without any
     trust-relationship, a destination can notify the source that it released a
     file again, but a source can never rely on it. So unless the destination
     releases the file, a source cannot clear the seals for modification again.
     However, this is inherent to situations without any trust-relationship.

     Real use-case: kdbus messages already use a similar interface and can be
                    transparently converted to use these seals

Other similar use-cases exist (eg., audio), but these two I am personally
working on. Interest in this interface has been raised from several other camps
and I've put respective maintainers into CC. If more information on these
use-cases is needed, I think they can give some insights.

The API introduced by this patchset is:

 * fcntl() extension:
   Two new fcntl() commands are added that allow retrieveing (SHMEM_GET_SEALS)
   and setting (SHMEM_SET_SEALS) seals on a file. Only shmfs implements them so
   far and there is no intention to implement them on other file-systems.
   All shmfs based files support sealing.

   Patch 2/6

 * memfd_create() syscall:
   The new memfd_create() syscall is a public frontend to the shmem_file_new()
   interface in the kernel. It avoids the need of a local shmfs mount-point (as
   requested by android people) and acts more like MAP_ANON than O_TMPFILE.

   Patch 3/6

The other 4 patches are cleanups, self-tests and docs.

The commit-messages explain the API extensions in detail. Man-page proposals
are also provided. Last but not least, the extensive self-tests document the
intended behavior, in case it is still not clear.

Technically, sealing and memfd_create() are independent, but the described
use-cases would greatly benefit from the combination of both. Hence, I merged
them into the same series. Please also note that this series is based on earlier
works (ashmem, memfd, shmgetfd, ..) and unifies these attempts.

Comments welcome!

Thanks
David

David Herrmann (4):
  fs: fix i_writecount on shmem and friends
  shm: add sealing API
  shm: add memfd_create() syscall
  selftests: add memfd_create() + sealing tests

David Herrmann (2): (man-pages)
  fcntl.2: document SHMEM_SET/GET_SEALS commands
  memfd_create.2: add memfd_create() man-page

 arch/x86/syscalls/syscall_32.tbl           |   1 +
 arch/x86/syscalls/syscall_64.tbl           |   1 +
 fs/fcntl.c                                 |  12 +-
 fs/file_table.c                            |  27 +-
 include/linux/shmem_fs.h                   |  17 +
 include/linux/syscalls.h                   |   1 +
 include/uapi/linux/fcntl.h                 |  13 +
 include/uapi/linux/memfd.h                 |   9 +
 kernel/sys_ni.c                            |   1 +
 mm/shmem.c                                 | 267 +++++++-
 tools/testing/selftests/Makefile           |   1 +
 tools/testing/selftests/memfd/.gitignore   |   2 +
 tools/testing/selftests/memfd/Makefile     |  29 +
 tools/testing/selftests/memfd/memfd_test.c | 972 +++++++++++++++++++++++++++++
 14 files changed, 1338 insertions(+), 15 deletions(-)
 create mode 100644 include/uapi/linux/memfd.h
 create mode 100644 tools/testing/selftests/memfd/.gitignore
 create mode 100644 tools/testing/selftests/memfd/Makefile
 create mode 100644 tools/testing/selftests/memfd/memfd_test.c

-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH 0/6] File Sealing & memfd_create()
@ 2014-03-19 19:06 ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-19 19:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Matthew Wilcox, Ryan Lortie, Hugh Dickins, Johannes Weiner,
	Kay Sievers, dri-devel, Daniel Mack, linux-mm, linux-fsdevel,
	Karol Lewandowski, Lennart Poettering, Greg Kroah-Hartman,
	Tejun Heo, Michael Kerrisk (man-pages),
	Andrew Morton, Linus Torvalds, Alexander Viro

Hi

This series introduces the concept of "file sealing". Sealing a file restricts
the set of allowed operations on the file in question. Multiple seals are
defined and each seal will cause a different set of operations to return EPERM
if it is set. The following seals are introduced:

 * SEAL_SHRINK: If set, the inode size cannot be reduced
 * SEAL_GROW: If set, the inode size cannot be increased
 * SEAL_WRITE: If set, the file content cannot be modified

Unlike existing techniques that provide similar protection, sealing allows
file-sharing without any trust-relationship. This is enforced by rejecting seal
modifications if you don't own an exclusive reference to the given file. So if
you own a file-descriptor, you can be sure that no-one besides you can modify
the seals on the given file. This allows mapping shared files from untrusted
parties without the fear of the file getting truncated or modified by an
attacker.

Several use-cases exist that could make great use of sealing:

  1) Graphics Compositors
     If a graphics client creates a memory-backed render-buffer and passes a
     file-decsriptor to it to the graphics server for display, the server
     _has_ to setup SIGBUS handlers whenever mapping the given file. Otherwise,
     the client might run ftruncate() or O_TRUNC on the on file in parallel,
     thus crashing the server.
     With sealing, a compositor can reject any incoming file-descriptor that
     does _not_ have SEAL_SHRINK set. This way, any memory-mappings are
     guaranteed to stay accessible. Furthermore, we still allow clients to
     increase the buffer-size in case they want to resize the render-buffer for
     the next frame. We also allow parallel writes so the client can render new
     frames into the same buffer (client is responsible of never rendering into
     a front-buffer if you want to avoid artifacts).

     Real use-case: Wayland wl_shm buffers can be transparently converted

  2) Geneal-purpose IPC
     IPC mechanisms that do not require a mutual trust-relationship (like dbus)
     cannot do zero-copy so far. With sealing, zero-copy can be easily done by
     sharing a file-descriptor that has SEAL_SHRINK | SEAL_GROW | SEAL_WRITE
     set. This way, the source can store sensible data in the file, seal the
     file and then pass it to the destination. The destination verifies these
     seals are set and then can parse the message in-line.
     Note that these files are usually one-shot files. Without any
     trust-relationship, a destination can notify the source that it released a
     file again, but a source can never rely on it. So unless the destination
     releases the file, a source cannot clear the seals for modification again.
     However, this is inherent to situations without any trust-relationship.

     Real use-case: kdbus messages already use a similar interface and can be
                    transparently converted to use these seals

Other similar use-cases exist (eg., audio), but these two I am personally
working on. Interest in this interface has been raised from several other camps
and I've put respective maintainers into CC. If more information on these
use-cases is needed, I think they can give some insights.

The API introduced by this patchset is:

 * fcntl() extension:
   Two new fcntl() commands are added that allow retrieveing (SHMEM_GET_SEALS)
   and setting (SHMEM_SET_SEALS) seals on a file. Only shmfs implements them so
   far and there is no intention to implement them on other file-systems.
   All shmfs based files support sealing.

   Patch 2/6

 * memfd_create() syscall:
   The new memfd_create() syscall is a public frontend to the shmem_file_new()
   interface in the kernel. It avoids the need of a local shmfs mount-point (as
   requested by android people) and acts more like MAP_ANON than O_TMPFILE.

   Patch 3/6

The other 4 patches are cleanups, self-tests and docs.

The commit-messages explain the API extensions in detail. Man-page proposals
are also provided. Last but not least, the extensive self-tests document the
intended behavior, in case it is still not clear.

Technically, sealing and memfd_create() are independent, but the described
use-cases would greatly benefit from the combination of both. Hence, I merged
them into the same series. Please also note that this series is based on earlier
works (ashmem, memfd, shmgetfd, ..) and unifies these attempts.

Comments welcome!

Thanks
David

David Herrmann (4):
  fs: fix i_writecount on shmem and friends
  shm: add sealing API
  shm: add memfd_create() syscall
  selftests: add memfd_create() + sealing tests

David Herrmann (2): (man-pages)
  fcntl.2: document SHMEM_SET/GET_SEALS commands
  memfd_create.2: add memfd_create() man-page

 arch/x86/syscalls/syscall_32.tbl           |   1 +
 arch/x86/syscalls/syscall_64.tbl           |   1 +
 fs/fcntl.c                                 |  12 +-
 fs/file_table.c                            |  27 +-
 include/linux/shmem_fs.h                   |  17 +
 include/linux/syscalls.h                   |   1 +
 include/uapi/linux/fcntl.h                 |  13 +
 include/uapi/linux/memfd.h                 |   9 +
 kernel/sys_ni.c                            |   1 +
 mm/shmem.c                                 | 267 +++++++-
 tools/testing/selftests/Makefile           |   1 +
 tools/testing/selftests/memfd/.gitignore   |   2 +
 tools/testing/selftests/memfd/Makefile     |  29 +
 tools/testing/selftests/memfd/memfd_test.c | 972 +++++++++++++++++++++++++++++
 14 files changed, 1338 insertions(+), 15 deletions(-)
 create mode 100644 include/uapi/linux/memfd.h
 create mode 100644 tools/testing/selftests/memfd/.gitignore
 create mode 100644 tools/testing/selftests/memfd/Makefile
 create mode 100644 tools/testing/selftests/memfd/memfd_test.c

-- 
1.9.0

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [PATCH 1/6] fs: fix i_writecount on shmem and friends
  2014-03-19 19:06 ` David Herrmann
  (?)
  (?)
@ 2014-03-19 19:06   ` David Herrmann
  -1 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-19 19:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Hugh Dickins, Alexander Viro, Matthew Wilcox, Karol Lewandowski,
	Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, john.stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages),
	David Herrmann

VM_DENYWRITE currently relies on i_writecount. Unless there's an active
writable reference to an inode, VM_DENYWRITE is not allowed.
Unfortunately, alloc_file() does not increase i_writecount, therefore,
does not prevent a following VM_DENYWRITE even though the new file might
have been opened with FMODE_WRITE. However, callers of alloc_file() expect
the file object to be fully instantiated so they can call fput() on it. We
could now either fix all callers to do an get_write_access() if opened
with FMODE_WRITE, or simply fix alloc_file() to do that. I chose the
latter.

Note that this bug allows some rather subtle misbehavior. The following
sequence of calls should work just fine, but currently fails:
    int p[2], orig, ro, rw;
    char buf[128];

    pipe(p);
    sprintf(buf, "/proc/self/fd/%d", p[1]);
    ro = open(buf, O_RDONLY);
    close(p[1]);
    sprintf(buf, "/proc/self/fd/%d", ro);
    rw = open(buf, O_RDWR);

The final open() cannot succeed as close(p[1]) caused an integer underflow
on i_writecount, effectively causing VM_DENYWRITE on the inode. The open
will fail with -ETXTBUSY.

It's a rather odd sequence of calls and given that open() doesn't use
alloc_file() (and thus not affected by this bug), it's rather unlikely
that this is a serious issue. But stuff like anon_inode shares a *single*
inode across a huge set of interfaces. If any of these is broken like
pipe(), it will affect all of these (ranging from dma-buf to epoll).

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
Hi

This patch is only included for reference. It was submitted to fs-devel
separately and is being worked on. However, this bug must be fixed in order to
make use of memfd_create(), so I decided to include it here.

David

 fs/file_table.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index 5b24008..8059d68 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -168,6 +168,7 @@ struct file *alloc_file(struct path *path, fmode_t mode,
 		const struct file_operations *fop)
 {
 	struct file *file;
+	int error;
 
 	file = get_empty_filp();
 	if (IS_ERR(file))
@@ -179,15 +180,23 @@ struct file *alloc_file(struct path *path, fmode_t mode,
 	file->f_mode = mode;
 	file->f_op = fop;
 
-	/*
-	 * These mounts don't really matter in practice
-	 * for r/o bind mounts.  They aren't userspace-
-	 * visible.  We do this for consistency, and so
-	 * that we can do debugging checks at __fput()
-	 */
-	if ((mode & FMODE_WRITE) && !special_file(path->dentry->d_inode->i_mode)) {
-		file_take_write(file);
-		WARN_ON(mnt_clone_write(path->mnt));
+	if (mode & FMODE_WRITE) {
+		error = get_write_access(path->dentry->d_inode);
+		if (error) {
+			put_filp(file);
+			return ERR_PTR(error);
+		}
+
+		/*
+		 * These mounts don't really matter in practice
+		 * for r/o bind mounts.  They aren't userspace-
+		 * visible.  We do this for consistency, and so
+		 * that we can do debugging checks at __fput()
+		 */
+		if (!special_file(path->dentry->d_inode->i_mode)) {
+			file_take_write(file);
+			WARN_ON(mnt_clone_write(path->mnt));
+		}
 	}
 	if ((mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
 		i_readcount_inc(path->dentry->d_inode);
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 1/6] fs: fix i_writecount on shmem and friends
@ 2014-03-19 19:06   ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-19 19:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Matthew Wilcox, Ryan Lortie, Hugh Dickins, Johannes Weiner,
	Kay Sievers, dri-devel, Daniel Mack, linux-mm, linux-fsdevel,
	Karol Lewandowski, Lennart Poettering, Greg Kroah-Hartman,
	Tejun Heo, Michael Kerrisk (man-pages),
	Andrew Morton, Linus Torvalds, Alexander Viro

VM_DENYWRITE currently relies on i_writecount. Unless there's an active
writable reference to an inode, VM_DENYWRITE is not allowed.
Unfortunately, alloc_file() does not increase i_writecount, therefore,
does not prevent a following VM_DENYWRITE even though the new file might
have been opened with FMODE_WRITE. However, callers of alloc_file() expect
the file object to be fully instantiated so they can call fput() on it. We
could now either fix all callers to do an get_write_access() if opened
with FMODE_WRITE, or simply fix alloc_file() to do that. I chose the
latter.

Note that this bug allows some rather subtle misbehavior. The following
sequence of calls should work just fine, but currently fails:
    int p[2], orig, ro, rw;
    char buf[128];

    pipe(p);
    sprintf(buf, "/proc/self/fd/%d", p[1]);
    ro = open(buf, O_RDONLY);
    close(p[1]);
    sprintf(buf, "/proc/self/fd/%d", ro);
    rw = open(buf, O_RDWR);

The final open() cannot succeed as close(p[1]) caused an integer underflow
on i_writecount, effectively causing VM_DENYWRITE on the inode. The open
will fail with -ETXTBUSY.

It's a rather odd sequence of calls and given that open() doesn't use
alloc_file() (and thus not affected by this bug), it's rather unlikely
that this is a serious issue. But stuff like anon_inode shares a *single*
inode across a huge set of interfaces. If any of these is broken like
pipe(), it will affect all of these (ranging from dma-buf to epoll).

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
Hi

This patch is only included for reference. It was submitted to fs-devel
separately and is being worked on. However, this bug must be fixed in order to
make use of memfd_create(), so I decided to include it here.

David

 fs/file_table.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index 5b24008..8059d68 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -168,6 +168,7 @@ struct file *alloc_file(struct path *path, fmode_t mode,
 		const struct file_operations *fop)
 {
 	struct file *file;
+	int error;
 
 	file = get_empty_filp();
 	if (IS_ERR(file))
@@ -179,15 +180,23 @@ struct file *alloc_file(struct path *path, fmode_t mode,
 	file->f_mode = mode;
 	file->f_op = fop;
 
-	/*
-	 * These mounts don't really matter in practice
-	 * for r/o bind mounts.  They aren't userspace-
-	 * visible.  We do this for consistency, and so
-	 * that we can do debugging checks at __fput()
-	 */
-	if ((mode & FMODE_WRITE) && !special_file(path->dentry->d_inode->i_mode)) {
-		file_take_write(file);
-		WARN_ON(mnt_clone_write(path->mnt));
+	if (mode & FMODE_WRITE) {
+		error = get_write_access(path->dentry->d_inode);
+		if (error) {
+			put_filp(file);
+			return ERR_PTR(error);
+		}
+
+		/*
+		 * These mounts don't really matter in practice
+		 * for r/o bind mounts.  They aren't userspace-
+		 * visible.  We do this for consistency, and so
+		 * that we can do debugging checks at __fput()
+		 */
+		if (!special_file(path->dentry->d_inode->i_mode)) {
+			file_take_write(file);
+			WARN_ON(mnt_clone_write(path->mnt));
+		}
 	}
 	if ((mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
 		i_readcount_inc(path->dentry->d_inode);
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 1/6] fs: fix i_writecount on shmem and friends
@ 2014-03-19 19:06   ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-19 19:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Hugh Dickins, Alexander Viro, Matthew Wilcox, Karol Lewandowski,
	Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, john.stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages),
	David Herrmann

VM_DENYWRITE currently relies on i_writecount. Unless there's an active
writable reference to an inode, VM_DENYWRITE is not allowed.
Unfortunately, alloc_file() does not increase i_writecount, therefore,
does not prevent a following VM_DENYWRITE even though the new file might
have been opened with FMODE_WRITE. However, callers of alloc_file() expect
the file object to be fully instantiated so they can call fput() on it. We
could now either fix all callers to do an get_write_access() if opened
with FMODE_WRITE, or simply fix alloc_file() to do that. I chose the
latter.

Note that this bug allows some rather subtle misbehavior. The following
sequence of calls should work just fine, but currently fails:
    int p[2], orig, ro, rw;
    char buf[128];

    pipe(p);
    sprintf(buf, "/proc/self/fd/%d", p[1]);
    ro = open(buf, O_RDONLY);
    close(p[1]);
    sprintf(buf, "/proc/self/fd/%d", ro);
    rw = open(buf, O_RDWR);

The final open() cannot succeed as close(p[1]) caused an integer underflow
on i_writecount, effectively causing VM_DENYWRITE on the inode. The open
will fail with -ETXTBUSY.

It's a rather odd sequence of calls and given that open() doesn't use
alloc_file() (and thus not affected by this bug), it's rather unlikely
that this is a serious issue. But stuff like anon_inode shares a *single*
inode across a huge set of interfaces. If any of these is broken like
pipe(), it will affect all of these (ranging from dma-buf to epoll).

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
Hi

This patch is only included for reference. It was submitted to fs-devel
separately and is being worked on. However, this bug must be fixed in order to
make use of memfd_create(), so I decided to include it here.

David

 fs/file_table.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index 5b24008..8059d68 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -168,6 +168,7 @@ struct file *alloc_file(struct path *path, fmode_t mode,
 		const struct file_operations *fop)
 {
 	struct file *file;
+	int error;
 
 	file = get_empty_filp();
 	if (IS_ERR(file))
@@ -179,15 +180,23 @@ struct file *alloc_file(struct path *path, fmode_t mode,
 	file->f_mode = mode;
 	file->f_op = fop;
 
-	/*
-	 * These mounts don't really matter in practice
-	 * for r/o bind mounts.  They aren't userspace-
-	 * visible.  We do this for consistency, and so
-	 * that we can do debugging checks at __fput()
-	 */
-	if ((mode & FMODE_WRITE) && !special_file(path->dentry->d_inode->i_mode)) {
-		file_take_write(file);
-		WARN_ON(mnt_clone_write(path->mnt));
+	if (mode & FMODE_WRITE) {
+		error = get_write_access(path->dentry->d_inode);
+		if (error) {
+			put_filp(file);
+			return ERR_PTR(error);
+		}
+
+		/*
+		 * These mounts don't really matter in practice
+		 * for r/o bind mounts.  They aren't userspace-
+		 * visible.  We do this for consistency, and so
+		 * that we can do debugging checks at __fput()
+		 */
+		if (!special_file(path->dentry->d_inode->i_mode)) {
+			file_take_write(file);
+			WARN_ON(mnt_clone_write(path->mnt));
+		}
 	}
 	if ((mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
 		i_readcount_inc(path->dentry->d_inode);
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 1/6] fs: fix i_writecount on shmem and friends
@ 2014-03-19 19:06   ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-19 19:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Matthew Wilcox, Ryan Lortie, Hugh Dickins, Johannes Weiner,
	Kay Sievers, dri-devel, Daniel Mack, linux-mm, linux-fsdevel,
	Karol Lewandowski, Lennart Poettering, Greg Kroah-Hartman,
	Tejun Heo, Michael Kerrisk (man-pages),
	Andrew Morton, Linus Torvalds, Alexander Viro

VM_DENYWRITE currently relies on i_writecount. Unless there's an active
writable reference to an inode, VM_DENYWRITE is not allowed.
Unfortunately, alloc_file() does not increase i_writecount, therefore,
does not prevent a following VM_DENYWRITE even though the new file might
have been opened with FMODE_WRITE. However, callers of alloc_file() expect
the file object to be fully instantiated so they can call fput() on it. We
could now either fix all callers to do an get_write_access() if opened
with FMODE_WRITE, or simply fix alloc_file() to do that. I chose the
latter.

Note that this bug allows some rather subtle misbehavior. The following
sequence of calls should work just fine, but currently fails:
    int p[2], orig, ro, rw;
    char buf[128];

    pipe(p);
    sprintf(buf, "/proc/self/fd/%d", p[1]);
    ro = open(buf, O_RDONLY);
    close(p[1]);
    sprintf(buf, "/proc/self/fd/%d", ro);
    rw = open(buf, O_RDWR);

The final open() cannot succeed as close(p[1]) caused an integer underflow
on i_writecount, effectively causing VM_DENYWRITE on the inode. The open
will fail with -ETXTBUSY.

It's a rather odd sequence of calls and given that open() doesn't use
alloc_file() (and thus not affected by this bug), it's rather unlikely
that this is a serious issue. But stuff like anon_inode shares a *single*
inode across a huge set of interfaces. If any of these is broken like
pipe(), it will affect all of these (ranging from dma-buf to epoll).

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
Hi

This patch is only included for reference. It was submitted to fs-devel
separately and is being worked on. However, this bug must be fixed in order to
make use of memfd_create(), so I decided to include it here.

David

 fs/file_table.c | 27 ++++++++++++++++++---------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/fs/file_table.c b/fs/file_table.c
index 5b24008..8059d68 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -168,6 +168,7 @@ struct file *alloc_file(struct path *path, fmode_t mode,
 		const struct file_operations *fop)
 {
 	struct file *file;
+	int error;
 
 	file = get_empty_filp();
 	if (IS_ERR(file))
@@ -179,15 +180,23 @@ struct file *alloc_file(struct path *path, fmode_t mode,
 	file->f_mode = mode;
 	file->f_op = fop;
 
-	/*
-	 * These mounts don't really matter in practice
-	 * for r/o bind mounts.  They aren't userspace-
-	 * visible.  We do this for consistency, and so
-	 * that we can do debugging checks at __fput()
-	 */
-	if ((mode & FMODE_WRITE) && !special_file(path->dentry->d_inode->i_mode)) {
-		file_take_write(file);
-		WARN_ON(mnt_clone_write(path->mnt));
+	if (mode & FMODE_WRITE) {
+		error = get_write_access(path->dentry->d_inode);
+		if (error) {
+			put_filp(file);
+			return ERR_PTR(error);
+		}
+
+		/*
+		 * These mounts don't really matter in practice
+		 * for r/o bind mounts.  They aren't userspace-
+		 * visible.  We do this for consistency, and so
+		 * that we can do debugging checks at __fput()
+		 */
+		if (!special_file(path->dentry->d_inode->i_mode)) {
+			file_take_write(file);
+			WARN_ON(mnt_clone_write(path->mnt));
+		}
 	}
 	if ((mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)
 		i_readcount_inc(path->dentry->d_inode);
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 2/6] shm: add sealing API
  2014-03-19 19:06 ` David Herrmann
  (?)
  (?)
@ 2014-03-19 19:06   ` David Herrmann
  -1 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-19 19:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Hugh Dickins, Alexander Viro, Matthew Wilcox, Karol Lewandowski,
	Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, john.stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages),
	David Herrmann

If two processes share a common memory region, they usually want some
guarantees to allow safe access. This often includes:
  - one side cannot overwrite data while the other reads it
  - one side cannot shrink the buffer while the other accesses it
  - one side cannot grow the buffer beyond previously set boundaries

If there is a trust-relationship between both parties, there is no need
for policy enforcement. However, if there's no trust relationship (eg.,
for general-purpose IPC) sharing memory-regions is highly fragile and
often not possible without local copies. Look at the following two
use-cases:
  1) A graphics client wants to share its rendering-buffer with a
     graphics-server. The memory-region is allocated by the client for
     read/write access and a second FD is passed to the server. While
     scanning out from the memory region, the server has no guarantee that
     the client doesn't shrink the buffer at any time, requiring rather
     cumbersome SIGBUS handling.
  2) A process wants to perform an RPC on another process. To avoid huge
     bandwidth consumption, zero-copy is preferred. After a message is
     assembled in-memory and a FD is passed to the remote side, both sides
     want to be sure that neither modifies this shared copy, anymore. The
     source may have put sensible data into the message without a separate
     copy and the target may want to parse the message inline, to avoid a
     local copy.

While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide
ways to achieve most of this, the first one is unproportionally ugly to
use in libraries and the latter two are broken/racy or even disabled due
to denial of service attacks.

This patch introduces the concept of SEALING. If you seal a file, a
specific set of operations is blocked until this seal is removed again.
Unlike locks, seals can only be modified if you own an exclusive reference
to the file. Hence, if, and only if you hold a reference to a file, you
can be sure that no-one else can modify the seals besides you (and you can
only modify them, if you are the exclusive holder). This makes sealing
useful in situations where no trust-relationship is given.

An initial set of SEALS is introduced by this patch:
  - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced
            in size. This currently affects only ftruncate().
  - GROW: If SEAL_GROW is set, the file in question cannot be increased
          in size. This affects ftruncate(), fallocate() and write().
  - WRITE: If SEAL_WRITE is set, no write operations (besides resizing)
           are possible. This affects fallocate(PUNCH_HOLE), mmap() and
           write().

The described use-cases can easily use these seals to provide safe use
without any trust-relationship:
  1) The graphics server can verify that a passed file-descriptor has
     SEAL_SHRINK set. This allows safe scanout, while the client is
     allowed to increase buffer size for window-resizing on-the-fly.
     Concurrent writes are explicitly allowed.
  2) Both processes can verify that SEAL_SHRINK, SEAL_GROW and SEAL_WRITE
     are set. This guarantees that neither process can modify the data
     while the other side parses it. Furthermore, it guarantees that even
     with writable FDs passed to the peer, it cannot increase the size to
     hit memory-limits of the source process (in case the file-storage is
     accounted to the source).

There is one exception to setting seals: Imagine a library makes use of
sealing. While creating a new memory object with an FD, another thread may
fork(), retaining a copy of the FD and thus also a reference. Sealing
wouldn't be possible anymore, until this process closes the FDs or
exec()s. To avoid this race initial seals can be set on non-exclusive FDs.
This is safe as both sides can, and always have to, verify that the
required set of seals is set. Once they are set, neither side can extend,
reduce or modify the set of seals as long as they have no exclusive
reference.
Note that this exception also allows keeping read-only mmaps() around
during initial sealing (mmaps() also own a reference to the file).

The new API is an extension to fcntl(), adding two new commands:
  SHMEM_GET_SEALS: Return a bitset describing the seals on the file. This
                   can be called on any FD if the underlying file supports
                   sealing.
  SHMEM_SET_SEALS: Change the seals of a given file. This requires WRITE
                   access to the file. If at least one seal is already
                   set, this also requires an exclusive reference. Note
                   that this call will fail with EPERM if there is any
                   active mapping with MAP_SHARED set.

The fcntl() handler is currently specific to shmem. There is no intention
to support this on other file-systems, that's why the bits are prefixed
with SHMEM_*. Furthermore, sealing is supported on all shmem-files.
Setting seals requires write-access, so this doesn't allow any DoS attacks
onto existing shmem users (just like mandatory locking).

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 fs/fcntl.c                 |  12 ++-
 include/linux/shmem_fs.h   |  17 ++++
 include/uapi/linux/fcntl.h |  13 +++
 mm/shmem.c                 | 200 ++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 236 insertions(+), 6 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index ef68665..eea0b65 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -21,6 +21,7 @@
 #include <linux/rcupdate.h>
 #include <linux/pid_namespace.h>
 #include <linux/user_namespace.h>
+#include <linux/shmem_fs.h>
 
 #include <asm/poll.h>
 #include <asm/siginfo.h>
@@ -248,9 +249,10 @@ static int f_getowner_uids(struct file *filp, unsigned long arg)
 #endif
 
 static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
-		struct file *filp)
+		     struct fd f)
 {
 	long err = -EINVAL;
+	struct file *filp = f.file;
 
 	switch (cmd) {
 	case F_DUPFD:
@@ -326,6 +328,10 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	case F_GETPIPE_SZ:
 		err = pipe_fcntl(filp, cmd, arg);
 		break;
+	case SHMEM_SET_SEALS:
+	case SHMEM_GET_SEALS:
+		err = shmem_fcntl(f, cmd, arg);
+		break;
 	default:
 		break;
 	}
@@ -360,7 +366,7 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
 
 	err = security_file_fcntl(f.file, cmd, arg);
 	if (!err)
-		err = do_fcntl(fd, cmd, arg, f.file);
+		err = do_fcntl(fd, cmd, arg, f);
 
 out1:
  	fdput(f);
@@ -397,7 +403,7 @@ SYSCALL_DEFINE3(fcntl64, unsigned int, fd, unsigned int, cmd,
 					(struct flock64 __user *) arg);
 			break;
 		default:
-			err = do_fcntl(fd, cmd, arg, f.file);
+			err = do_fcntl(fd, cmd, arg, f);
 			break;
 	}
 out1:
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 9d55438..6a3f685 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -1,6 +1,7 @@
 #ifndef __SHMEM_FS_H
 #define __SHMEM_FS_H
 
+#include <linux/file.h>
 #include <linux/swap.h>
 #include <linux/mempolicy.h>
 #include <linux/pagemap.h>
@@ -20,6 +21,7 @@ struct shmem_inode_info {
 	struct shared_policy	policy;		/* NUMA memory alloc policy */
 	struct list_head	swaplist;	/* chain of maybes on swap */
 	struct simple_xattrs	xattrs;		/* list of xattrs */
+	u32			seals;		/* shmem seals */
 	struct inode		vfs_inode;
 };
 
@@ -57,6 +59,21 @@ extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
 extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end);
 extern int shmem_unuse(swp_entry_t entry, struct page *page);
 
+#ifdef CONFIG_SHMEM
+
+extern int shmem_set_seals(struct file *file, u32 seals);
+extern int shmem_get_seals(struct file *file);
+extern long shmem_fcntl(struct fd f, unsigned int cmd, unsigned long arg);
+
+#else
+
+static inline long shmem_fcntl(struct fd f, unsigned int cmd, unsigned long arg)
+{
+	return -EINVAL;
+}
+
+#endif
+
 static inline struct page *shmem_read_mapping_page(
 				struct address_space *mapping, pgoff_t index)
 {
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 074b886..8f31bef 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -28,6 +28,19 @@
 #define F_GETPIPE_SZ	(F_LINUX_SPECIFIC_BASE + 8)
 
 /*
+ * Set/Get seals
+ */
+#define SHMEM_SET_SEALS	(F_LINUX_SPECIFIC_BASE + 9)
+#define SHMEM_GET_SEALS	(F_LINUX_SPECIFIC_BASE + 10)
+
+/*
+ * Types of seals
+ */
+#define SHMEM_SEAL_SHRINK	0x0001	/* prevent file from shrinking */
+#define SHMEM_SEAL_GROW		0x0002	/* prevent file from growing */
+#define SHMEM_SEAL_WRITE	0x0004	/* prevent writes */
+
+/*
  * Types of directory notifications that may be requested.
  */
 #define DN_ACCESS	0x00000001	/* File accessed */
diff --git a/mm/shmem.c b/mm/shmem.c
index 1f18c9d..44d7f3b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -66,6 +66,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/highmem.h>
 #include <linux/seq_file.h>
 #include <linux/magic.h>
+#include <linux/fcntl.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -596,16 +597,23 @@ EXPORT_SYMBOL_GPL(shmem_truncate_range);
 static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = dentry->d_inode;
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	loff_t oldsize = inode->i_size;
+	loff_t newsize = attr->ia_size;
 	int error;
 
 	error = inode_change_ok(inode, attr);
 	if (error)
 		return error;
 
-	if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
-		loff_t oldsize = inode->i_size;
-		loff_t newsize = attr->ia_size;
+	/* protected by i_mutex */
+	if (attr->ia_valid & ATTR_SIZE) {
+		if ((newsize < oldsize && (info->seals & SHMEM_SEAL_SHRINK)) ||
+		    (newsize > oldsize && (info->seals & SHMEM_SEAL_GROW)))
+			return -EPERM;
+	}
 
+	if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
 		if (newsize != oldsize) {
 			i_size_write(inode, newsize);
 			inode->i_ctime = inode->i_mtime = CURRENT_TIME;
@@ -1354,6 +1362,13 @@ out_nomem:
 
 static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
 {
+	struct inode *inode = file_inode(file);
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	/* protected by mmap_sem and owns additional file-reference */
+	if ((info->seals & SHMEM_SEAL_WRITE) && (vma->vm_flags & VM_SHARED))
+		return -EPERM;
+
 	file_accessed(file);
 	vma->vm_ops = &shmem_vm_ops;
 	return 0;
@@ -1433,7 +1448,15 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
 			struct page **pagep, void **fsdata)
 {
 	struct inode *inode = mapping->host;
+	struct shmem_inode_info *info = SHMEM_I(inode);
 	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+
+	/* i_mutex is held by caller */
+	if (info->seals & SHMEM_SEAL_WRITE)
+		return -EPERM;
+	if ((info->seals & SHMEM_SEAL_GROW) && pos + len > inode->i_size)
+		return -EPERM;
+
 	return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
 }
 
@@ -1802,11 +1825,171 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
 	return offset;
 }
 
+#define SHMEM_ALL_SEALS (SHMEM_SEAL_SHRINK | \
+			 SHMEM_SEAL_GROW | \
+			 SHMEM_SEAL_WRITE)
+
+int shmem_set_seals(struct file *file, u32 seals)
+{
+	struct dentry *dentry = file->f_path.dentry;
+	struct inode *inode = dentry->d_inode;
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	bool has_writers, has_readers;
+	int r;
+
+	/*
+	 * SHMEM SEALING
+	 * Sealing allows multiple parties to share a shmem-file but restrict
+	 * access to a specific subset of file operations as long as more than
+	 * one party has access to the inode. This way, mutually untrusted
+	 * parties can share common memory regions with a well-defined policy.
+	 *
+	 * Seals can be set on any shmem-file, but always affect the whole
+	 * underlying inode. Once a seal is set, it may prevent some kinds of
+	 * access to the file. Currently, the following seals are defined:
+	 *   SHRINK: Prevent the file from shrinking
+	 *   GROW: Prevent the file from growing
+	 *   WRITE: Prevent write access to the file
+	 *
+	 * As we don't require any trust relationship between two parties, we
+	 * cannot allow asynchronous sealing. Instead, sealing is only allowed
+	 * if you own an exclusive reference to the shmem-file. Each FD, each
+	 * mmap and any link increase the ref-count. So as long as you have any
+	 * access to the file, you can be sure no-one (besides perhaps you) can
+	 * modify the seals.
+	 * There is one exception: Setting initial seals is allowed even if
+	 * there are multiple references to the file (but no writable mappings
+	 * may exist). Once *any* seal is set, removing or changing it requires
+	 * an exclusive reference, though.
+	 *
+	 * The combination of SHRINK and WRITE also guarantees that any mapped
+	 * region will not get destructed asynchronously. Even if at some point
+	 * revoke() is supported, the region will stay mapped (maybe only
+	 * privately) and accessible.
+	 */
+
+	if (file->f_op != &shmem_file_operations)
+		return -EBADF;
+
+	/* require write-access to modify seals */
+	if (!(file->f_mode & FMODE_WRITE))
+		return -EPERM;
+
+	if (seals & ~(u32)SHMEM_ALL_SEALS)
+		return -EINVAL;
+
+	/*
+	 * - i_mutex prevents racing write/ftruncate/fallocate/..
+	 * - mmap_sem prevents racing mmap() calls
+	 * - i_lock prevents racing open() calls and new inode-refs
+	 */
+
+	mutex_lock(&inode->i_mutex);
+	down_read(&current->mm->mmap_sem);
+	spin_lock(&inode->i_lock);
+
+	/*
+	 * Changing seals is only allowed on exclusive references. Exception is
+	 * initial sealing, which allows other readers. We need to test for
+	 * i_mmap_writable to prevent VM_SHARED vmas on our exclusive writer.
+	 * i_writecount is not checked, as we explicitly allow writable FDs
+	 * even if sealed. It's the write-operation that is blocked, not the
+	 * writable FD itself.
+	 * Readers are tested the same way F_SETLEASE does it. One dentry,
+	 * inode and file ref combination is allowed.
+	 * Note that we actually allow 2 file-refs: One is the ref in the
+	 * file-table, the other is from the current context.
+	 * Note: for racing dup() calls see GET_SEALS
+	 */
+	has_writers = file->f_mapping->i_mmap_writable > 0;
+
+	has_readers = d_count(dentry) > 1 || atomic_read(&inode->i_count) > 1;
+	has_readers = has_readers || file_count(file) > 2;
+
+	if (has_writers || (has_readers && info->seals != 0)) {
+		r = -EPERM;
+	} else {
+		info->seals = seals;
+		r = 0;
+	}
+
+	spin_unlock(&inode->i_lock);
+	up_read(&current->mm->mmap_sem);
+	mutex_unlock(&inode->i_mutex);
+
+	return r;
+}
+EXPORT_SYMBOL(shmem_set_seals);
+
+int shmem_get_seals(struct file *file)
+{
+	struct inode *inode = file_inode(file);
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	unsigned long flags;
+	int r;
+
+	if (file->f_op != &shmem_file_operations)
+		return -EBADF;
+
+	/*
+	 * Lock i_lock so we don't read seals between file_count() and setting
+	 * the seals in SET_SEALS. Racing get_file()s could end up with an
+	 * inconsistent view.
+	 */
+
+	spin_lock_irqsave(&inode->i_lock, flags);
+	r = info->seals;
+	spin_unlock_irqrestore(&inode->i_lock, flags);
+
+	return r;
+}
+EXPORT_SYMBOL(shmem_get_seals);
+
+long shmem_fcntl(struct fd f, unsigned int cmd, unsigned long arg)
+{
+	long r;
+
+	if (f.file->f_op != &shmem_file_operations)
+		return -EBADF;
+
+	switch (cmd) {
+	case SHMEM_SET_SEALS:
+		/* disallow upper 32bit */
+		if (arg >> 32)
+			return -EINVAL;
+
+		/*
+		 * shmem_set_seals() allows 2 file-refs, one of the owner and
+		 * one of the current context. Make sure we have a real
+		 * owner-ref here, otherwise the fast-path of __fdget_light
+		 * breaks the assumptions in shmem_set_seals().
+		 */
+
+		if (!(f.flags & FDPUT_FPUT))
+			get_file(f.file);
+
+		r = shmem_set_seals(f.file, arg);
+
+		if (!(f.flags & FDPUT_FPUT))
+			fput(f.file);
+		break;
+	case SHMEM_GET_SEALS:
+		r = shmem_get_seals(f.file);
+		break;
+	default:
+		r = -EINVAL;
+		break;
+	}
+
+	return r;
+}
+
 static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 							 loff_t len)
 {
 	struct inode *inode = file_inode(file);
 	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+	struct shmem_inode_info *info = SHMEM_I(inode);
 	struct shmem_falloc shmem_falloc;
 	pgoff_t start, index, end;
 	int error;
@@ -1818,6 +2001,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 		loff_t unmap_start = round_up(offset, PAGE_SIZE);
 		loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
 
+		/* protected by i_mutex */
+		if (info->seals & SHMEM_SEAL_WRITE) {
+			error = -EPERM;
+			goto out;
+		}
+
 		if ((u64)unmap_end > (u64)unmap_start)
 			unmap_mapping_range(mapping, unmap_start,
 					    1 + unmap_end - unmap_start, 0);
@@ -1832,6 +2021,11 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	if (error)
 		goto out;
 
+	if ((info->seals & SHMEM_SEAL_GROW) && offset + len > inode->i_size) {
+		error = -EPERM;
+		goto out;
+	}
+
 	start = offset >> PAGE_CACHE_SHIFT;
 	end = (offset + len + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	/* Try to avoid a swapstorm if len is impossible to satisfy */
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 2/6] shm: add sealing API
@ 2014-03-19 19:06   ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-19 19:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Matthew Wilcox, Ryan Lortie, Hugh Dickins, Johannes Weiner,
	Kay Sievers, dri-devel, Daniel Mack, linux-mm, linux-fsdevel,
	Karol Lewandowski, Lennart Poettering, Greg Kroah-Hartman,
	Tejun Heo, Michael Kerrisk (man-pages),
	Andrew Morton, Linus Torvalds, Alexander Viro

If two processes share a common memory region, they usually want some
guarantees to allow safe access. This often includes:
  - one side cannot overwrite data while the other reads it
  - one side cannot shrink the buffer while the other accesses it
  - one side cannot grow the buffer beyond previously set boundaries

If there is a trust-relationship between both parties, there is no need
for policy enforcement. However, if there's no trust relationship (eg.,
for general-purpose IPC) sharing memory-regions is highly fragile and
often not possible without local copies. Look at the following two
use-cases:
  1) A graphics client wants to share its rendering-buffer with a
     graphics-server. The memory-region is allocated by the client for
     read/write access and a second FD is passed to the server. While
     scanning out from the memory region, the server has no guarantee that
     the client doesn't shrink the buffer at any time, requiring rather
     cumbersome SIGBUS handling.
  2) A process wants to perform an RPC on another process. To avoid huge
     bandwidth consumption, zero-copy is preferred. After a message is
     assembled in-memory and a FD is passed to the remote side, both sides
     want to be sure that neither modifies this shared copy, anymore. The
     source may have put sensible data into the message without a separate
     copy and the target may want to parse the message inline, to avoid a
     local copy.

While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide
ways to achieve most of this, the first one is unproportionally ugly to
use in libraries and the latter two are broken/racy or even disabled due
to denial of service attacks.

This patch introduces the concept of SEALING. If you seal a file, a
specific set of operations is blocked until this seal is removed again.
Unlike locks, seals can only be modified if you own an exclusive reference
to the file. Hence, if, and only if you hold a reference to a file, you
can be sure that no-one else can modify the seals besides you (and you can
only modify them, if you are the exclusive holder). This makes sealing
useful in situations where no trust-relationship is given.

An initial set of SEALS is introduced by this patch:
  - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced
            in size. This currently affects only ftruncate().
  - GROW: If SEAL_GROW is set, the file in question cannot be increased
          in size. This affects ftruncate(), fallocate() and write().
  - WRITE: If SEAL_WRITE is set, no write operations (besides resizing)
           are possible. This affects fallocate(PUNCH_HOLE), mmap() and
           write().

The described use-cases can easily use these seals to provide safe use
without any trust-relationship:
  1) The graphics server can verify that a passed file-descriptor has
     SEAL_SHRINK set. This allows safe scanout, while the client is
     allowed to increase buffer size for window-resizing on-the-fly.
     Concurrent writes are explicitly allowed.
  2) Both processes can verify that SEAL_SHRINK, SEAL_GROW and SEAL_WRITE
     are set. This guarantees that neither process can modify the data
     while the other side parses it. Furthermore, it guarantees that even
     with writable FDs passed to the peer, it cannot increase the size to
     hit memory-limits of the source process (in case the file-storage is
     accounted to the source).

There is one exception to setting seals: Imagine a library makes use of
sealing. While creating a new memory object with an FD, another thread may
fork(), retaining a copy of the FD and thus also a reference. Sealing
wouldn't be possible anymore, until this process closes the FDs or
exec()s. To avoid this race initial seals can be set on non-exclusive FDs.
This is safe as both sides can, and always have to, verify that the
required set of seals is set. Once they are set, neither side can extend,
reduce or modify the set of seals as long as they have no exclusive
reference.
Note that this exception also allows keeping read-only mmaps() around
during initial sealing (mmaps() also own a reference to the file).

The new API is an extension to fcntl(), adding two new commands:
  SHMEM_GET_SEALS: Return a bitset describing the seals on the file. This
                   can be called on any FD if the underlying file supports
                   sealing.
  SHMEM_SET_SEALS: Change the seals of a given file. This requires WRITE
                   access to the file. If at least one seal is already
                   set, this also requires an exclusive reference. Note
                   that this call will fail with EPERM if there is any
                   active mapping with MAP_SHARED set.

The fcntl() handler is currently specific to shmem. There is no intention
to support this on other file-systems, that's why the bits are prefixed
with SHMEM_*. Furthermore, sealing is supported on all shmem-files.
Setting seals requires write-access, so this doesn't allow any DoS attacks
onto existing shmem users (just like mandatory locking).

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 fs/fcntl.c                 |  12 ++-
 include/linux/shmem_fs.h   |  17 ++++
 include/uapi/linux/fcntl.h |  13 +++
 mm/shmem.c                 | 200 ++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 236 insertions(+), 6 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index ef68665..eea0b65 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -21,6 +21,7 @@
 #include <linux/rcupdate.h>
 #include <linux/pid_namespace.h>
 #include <linux/user_namespace.h>
+#include <linux/shmem_fs.h>
 
 #include <asm/poll.h>
 #include <asm/siginfo.h>
@@ -248,9 +249,10 @@ static int f_getowner_uids(struct file *filp, unsigned long arg)
 #endif
 
 static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
-		struct file *filp)
+		     struct fd f)
 {
 	long err = -EINVAL;
+	struct file *filp = f.file;
 
 	switch (cmd) {
 	case F_DUPFD:
@@ -326,6 +328,10 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	case F_GETPIPE_SZ:
 		err = pipe_fcntl(filp, cmd, arg);
 		break;
+	case SHMEM_SET_SEALS:
+	case SHMEM_GET_SEALS:
+		err = shmem_fcntl(f, cmd, arg);
+		break;
 	default:
 		break;
 	}
@@ -360,7 +366,7 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
 
 	err = security_file_fcntl(f.file, cmd, arg);
 	if (!err)
-		err = do_fcntl(fd, cmd, arg, f.file);
+		err = do_fcntl(fd, cmd, arg, f);
 
 out1:
  	fdput(f);
@@ -397,7 +403,7 @@ SYSCALL_DEFINE3(fcntl64, unsigned int, fd, unsigned int, cmd,
 					(struct flock64 __user *) arg);
 			break;
 		default:
-			err = do_fcntl(fd, cmd, arg, f.file);
+			err = do_fcntl(fd, cmd, arg, f);
 			break;
 	}
 out1:
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 9d55438..6a3f685 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -1,6 +1,7 @@
 #ifndef __SHMEM_FS_H
 #define __SHMEM_FS_H
 
+#include <linux/file.h>
 #include <linux/swap.h>
 #include <linux/mempolicy.h>
 #include <linux/pagemap.h>
@@ -20,6 +21,7 @@ struct shmem_inode_info {
 	struct shared_policy	policy;		/* NUMA memory alloc policy */
 	struct list_head	swaplist;	/* chain of maybes on swap */
 	struct simple_xattrs	xattrs;		/* list of xattrs */
+	u32			seals;		/* shmem seals */
 	struct inode		vfs_inode;
 };
 
@@ -57,6 +59,21 @@ extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
 extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end);
 extern int shmem_unuse(swp_entry_t entry, struct page *page);
 
+#ifdef CONFIG_SHMEM
+
+extern int shmem_set_seals(struct file *file, u32 seals);
+extern int shmem_get_seals(struct file *file);
+extern long shmem_fcntl(struct fd f, unsigned int cmd, unsigned long arg);
+
+#else
+
+static inline long shmem_fcntl(struct fd f, unsigned int cmd, unsigned long arg)
+{
+	return -EINVAL;
+}
+
+#endif
+
 static inline struct page *shmem_read_mapping_page(
 				struct address_space *mapping, pgoff_t index)
 {
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 074b886..8f31bef 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -28,6 +28,19 @@
 #define F_GETPIPE_SZ	(F_LINUX_SPECIFIC_BASE + 8)
 
 /*
+ * Set/Get seals
+ */
+#define SHMEM_SET_SEALS	(F_LINUX_SPECIFIC_BASE + 9)
+#define SHMEM_GET_SEALS	(F_LINUX_SPECIFIC_BASE + 10)
+
+/*
+ * Types of seals
+ */
+#define SHMEM_SEAL_SHRINK	0x0001	/* prevent file from shrinking */
+#define SHMEM_SEAL_GROW		0x0002	/* prevent file from growing */
+#define SHMEM_SEAL_WRITE	0x0004	/* prevent writes */
+
+/*
  * Types of directory notifications that may be requested.
  */
 #define DN_ACCESS	0x00000001	/* File accessed */
diff --git a/mm/shmem.c b/mm/shmem.c
index 1f18c9d..44d7f3b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -66,6 +66,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/highmem.h>
 #include <linux/seq_file.h>
 #include <linux/magic.h>
+#include <linux/fcntl.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -596,16 +597,23 @@ EXPORT_SYMBOL_GPL(shmem_truncate_range);
 static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = dentry->d_inode;
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	loff_t oldsize = inode->i_size;
+	loff_t newsize = attr->ia_size;
 	int error;
 
 	error = inode_change_ok(inode, attr);
 	if (error)
 		return error;
 
-	if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
-		loff_t oldsize = inode->i_size;
-		loff_t newsize = attr->ia_size;
+	/* protected by i_mutex */
+	if (attr->ia_valid & ATTR_SIZE) {
+		if ((newsize < oldsize && (info->seals & SHMEM_SEAL_SHRINK)) ||
+		    (newsize > oldsize && (info->seals & SHMEM_SEAL_GROW)))
+			return -EPERM;
+	}
 
+	if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
 		if (newsize != oldsize) {
 			i_size_write(inode, newsize);
 			inode->i_ctime = inode->i_mtime = CURRENT_TIME;
@@ -1354,6 +1362,13 @@ out_nomem:
 
 static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
 {
+	struct inode *inode = file_inode(file);
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	/* protected by mmap_sem and owns additional file-reference */
+	if ((info->seals & SHMEM_SEAL_WRITE) && (vma->vm_flags & VM_SHARED))
+		return -EPERM;
+
 	file_accessed(file);
 	vma->vm_ops = &shmem_vm_ops;
 	return 0;
@@ -1433,7 +1448,15 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
 			struct page **pagep, void **fsdata)
 {
 	struct inode *inode = mapping->host;
+	struct shmem_inode_info *info = SHMEM_I(inode);
 	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+
+	/* i_mutex is held by caller */
+	if (info->seals & SHMEM_SEAL_WRITE)
+		return -EPERM;
+	if ((info->seals & SHMEM_SEAL_GROW) && pos + len > inode->i_size)
+		return -EPERM;
+
 	return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
 }
 
@@ -1802,11 +1825,171 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
 	return offset;
 }
 
+#define SHMEM_ALL_SEALS (SHMEM_SEAL_SHRINK | \
+			 SHMEM_SEAL_GROW | \
+			 SHMEM_SEAL_WRITE)
+
+int shmem_set_seals(struct file *file, u32 seals)
+{
+	struct dentry *dentry = file->f_path.dentry;
+	struct inode *inode = dentry->d_inode;
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	bool has_writers, has_readers;
+	int r;
+
+	/*
+	 * SHMEM SEALING
+	 * Sealing allows multiple parties to share a shmem-file but restrict
+	 * access to a specific subset of file operations as long as more than
+	 * one party has access to the inode. This way, mutually untrusted
+	 * parties can share common memory regions with a well-defined policy.
+	 *
+	 * Seals can be set on any shmem-file, but always affect the whole
+	 * underlying inode. Once a seal is set, it may prevent some kinds of
+	 * access to the file. Currently, the following seals are defined:
+	 *   SHRINK: Prevent the file from shrinking
+	 *   GROW: Prevent the file from growing
+	 *   WRITE: Prevent write access to the file
+	 *
+	 * As we don't require any trust relationship between two parties, we
+	 * cannot allow asynchronous sealing. Instead, sealing is only allowed
+	 * if you own an exclusive reference to the shmem-file. Each FD, each
+	 * mmap and any link increase the ref-count. So as long as you have any
+	 * access to the file, you can be sure no-one (besides perhaps you) can
+	 * modify the seals.
+	 * There is one exception: Setting initial seals is allowed even if
+	 * there are multiple references to the file (but no writable mappings
+	 * may exist). Once *any* seal is set, removing or changing it requires
+	 * an exclusive reference, though.
+	 *
+	 * The combination of SHRINK and WRITE also guarantees that any mapped
+	 * region will not get destructed asynchronously. Even if at some point
+	 * revoke() is supported, the region will stay mapped (maybe only
+	 * privately) and accessible.
+	 */
+
+	if (file->f_op != &shmem_file_operations)
+		return -EBADF;
+
+	/* require write-access to modify seals */
+	if (!(file->f_mode & FMODE_WRITE))
+		return -EPERM;
+
+	if (seals & ~(u32)SHMEM_ALL_SEALS)
+		return -EINVAL;
+
+	/*
+	 * - i_mutex prevents racing write/ftruncate/fallocate/..
+	 * - mmap_sem prevents racing mmap() calls
+	 * - i_lock prevents racing open() calls and new inode-refs
+	 */
+
+	mutex_lock(&inode->i_mutex);
+	down_read(&current->mm->mmap_sem);
+	spin_lock(&inode->i_lock);
+
+	/*
+	 * Changing seals is only allowed on exclusive references. Exception is
+	 * initial sealing, which allows other readers. We need to test for
+	 * i_mmap_writable to prevent VM_SHARED vmas on our exclusive writer.
+	 * i_writecount is not checked, as we explicitly allow writable FDs
+	 * even if sealed. It's the write-operation that is blocked, not the
+	 * writable FD itself.
+	 * Readers are tested the same way F_SETLEASE does it. One dentry,
+	 * inode and file ref combination is allowed.
+	 * Note that we actually allow 2 file-refs: One is the ref in the
+	 * file-table, the other is from the current context.
+	 * Note: for racing dup() calls see GET_SEALS
+	 */
+	has_writers = file->f_mapping->i_mmap_writable > 0;
+
+	has_readers = d_count(dentry) > 1 || atomic_read(&inode->i_count) > 1;
+	has_readers = has_readers || file_count(file) > 2;
+
+	if (has_writers || (has_readers && info->seals != 0)) {
+		r = -EPERM;
+	} else {
+		info->seals = seals;
+		r = 0;
+	}
+
+	spin_unlock(&inode->i_lock);
+	up_read(&current->mm->mmap_sem);
+	mutex_unlock(&inode->i_mutex);
+
+	return r;
+}
+EXPORT_SYMBOL(shmem_set_seals);
+
+int shmem_get_seals(struct file *file)
+{
+	struct inode *inode = file_inode(file);
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	unsigned long flags;
+	int r;
+
+	if (file->f_op != &shmem_file_operations)
+		return -EBADF;
+
+	/*
+	 * Lock i_lock so we don't read seals between file_count() and setting
+	 * the seals in SET_SEALS. Racing get_file()s could end up with an
+	 * inconsistent view.
+	 */
+
+	spin_lock_irqsave(&inode->i_lock, flags);
+	r = info->seals;
+	spin_unlock_irqrestore(&inode->i_lock, flags);
+
+	return r;
+}
+EXPORT_SYMBOL(shmem_get_seals);
+
+long shmem_fcntl(struct fd f, unsigned int cmd, unsigned long arg)
+{
+	long r;
+
+	if (f.file->f_op != &shmem_file_operations)
+		return -EBADF;
+
+	switch (cmd) {
+	case SHMEM_SET_SEALS:
+		/* disallow upper 32bit */
+		if (arg >> 32)
+			return -EINVAL;
+
+		/*
+		 * shmem_set_seals() allows 2 file-refs, one of the owner and
+		 * one of the current context. Make sure we have a real
+		 * owner-ref here, otherwise the fast-path of __fdget_light
+		 * breaks the assumptions in shmem_set_seals().
+		 */
+
+		if (!(f.flags & FDPUT_FPUT))
+			get_file(f.file);
+
+		r = shmem_set_seals(f.file, arg);
+
+		if (!(f.flags & FDPUT_FPUT))
+			fput(f.file);
+		break;
+	case SHMEM_GET_SEALS:
+		r = shmem_get_seals(f.file);
+		break;
+	default:
+		r = -EINVAL;
+		break;
+	}
+
+	return r;
+}
+
 static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 							 loff_t len)
 {
 	struct inode *inode = file_inode(file);
 	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+	struct shmem_inode_info *info = SHMEM_I(inode);
 	struct shmem_falloc shmem_falloc;
 	pgoff_t start, index, end;
 	int error;
@@ -1818,6 +2001,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 		loff_t unmap_start = round_up(offset, PAGE_SIZE);
 		loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
 
+		/* protected by i_mutex */
+		if (info->seals & SHMEM_SEAL_WRITE) {
+			error = -EPERM;
+			goto out;
+		}
+
 		if ((u64)unmap_end > (u64)unmap_start)
 			unmap_mapping_range(mapping, unmap_start,
 					    1 + unmap_end - unmap_start, 0);
@@ -1832,6 +2021,11 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	if (error)
 		goto out;
 
+	if ((info->seals & SHMEM_SEAL_GROW) && offset + len > inode->i_size) {
+		error = -EPERM;
+		goto out;
+	}
+
 	start = offset >> PAGE_CACHE_SHIFT;
 	end = (offset + len + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	/* Try to avoid a swapstorm if len is impossible to satisfy */
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 2/6] shm: add sealing API
@ 2014-03-19 19:06   ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-19 19:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Hugh Dickins, Alexander Viro, Matthew Wilcox, Karol Lewandowski,
	Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, john.stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages),
	David Herrmann

If two processes share a common memory region, they usually want some
guarantees to allow safe access. This often includes:
  - one side cannot overwrite data while the other reads it
  - one side cannot shrink the buffer while the other accesses it
  - one side cannot grow the buffer beyond previously set boundaries

If there is a trust-relationship between both parties, there is no need
for policy enforcement. However, if there's no trust relationship (eg.,
for general-purpose IPC) sharing memory-regions is highly fragile and
often not possible without local copies. Look at the following two
use-cases:
  1) A graphics client wants to share its rendering-buffer with a
     graphics-server. The memory-region is allocated by the client for
     read/write access and a second FD is passed to the server. While
     scanning out from the memory region, the server has no guarantee that
     the client doesn't shrink the buffer at any time, requiring rather
     cumbersome SIGBUS handling.
  2) A process wants to perform an RPC on another process. To avoid huge
     bandwidth consumption, zero-copy is preferred. After a message is
     assembled in-memory and a FD is passed to the remote side, both sides
     want to be sure that neither modifies this shared copy, anymore. The
     source may have put sensible data into the message without a separate
     copy and the target may want to parse the message inline, to avoid a
     local copy.

While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide
ways to achieve most of this, the first one is unproportionally ugly to
use in libraries and the latter two are broken/racy or even disabled due
to denial of service attacks.

This patch introduces the concept of SEALING. If you seal a file, a
specific set of operations is blocked until this seal is removed again.
Unlike locks, seals can only be modified if you own an exclusive reference
to the file. Hence, if, and only if you hold a reference to a file, you
can be sure that no-one else can modify the seals besides you (and you can
only modify them, if you are the exclusive holder). This makes sealing
useful in situations where no trust-relationship is given.

An initial set of SEALS is introduced by this patch:
  - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced
            in size. This currently affects only ftruncate().
  - GROW: If SEAL_GROW is set, the file in question cannot be increased
          in size. This affects ftruncate(), fallocate() and write().
  - WRITE: If SEAL_WRITE is set, no write operations (besides resizing)
           are possible. This affects fallocate(PUNCH_HOLE), mmap() and
           write().

The described use-cases can easily use these seals to provide safe use
without any trust-relationship:
  1) The graphics server can verify that a passed file-descriptor has
     SEAL_SHRINK set. This allows safe scanout, while the client is
     allowed to increase buffer size for window-resizing on-the-fly.
     Concurrent writes are explicitly allowed.
  2) Both processes can verify that SEAL_SHRINK, SEAL_GROW and SEAL_WRITE
     are set. This guarantees that neither process can modify the data
     while the other side parses it. Furthermore, it guarantees that even
     with writable FDs passed to the peer, it cannot increase the size to
     hit memory-limits of the source process (in case the file-storage is
     accounted to the source).

There is one exception to setting seals: Imagine a library makes use of
sealing. While creating a new memory object with an FD, another thread may
fork(), retaining a copy of the FD and thus also a reference. Sealing
wouldn't be possible anymore, until this process closes the FDs or
exec()s. To avoid this race initial seals can be set on non-exclusive FDs.
This is safe as both sides can, and always have to, verify that the
required set of seals is set. Once they are set, neither side can extend,
reduce or modify the set of seals as long as they have no exclusive
reference.
Note that this exception also allows keeping read-only mmaps() around
during initial sealing (mmaps() also own a reference to the file).

The new API is an extension to fcntl(), adding two new commands:
  SHMEM_GET_SEALS: Return a bitset describing the seals on the file. This
                   can be called on any FD if the underlying file supports
                   sealing.
  SHMEM_SET_SEALS: Change the seals of a given file. This requires WRITE
                   access to the file. If at least one seal is already
                   set, this also requires an exclusive reference. Note
                   that this call will fail with EPERM if there is any
                   active mapping with MAP_SHARED set.

The fcntl() handler is currently specific to shmem. There is no intention
to support this on other file-systems, that's why the bits are prefixed
with SHMEM_*. Furthermore, sealing is supported on all shmem-files.
Setting seals requires write-access, so this doesn't allow any DoS attacks
onto existing shmem users (just like mandatory locking).

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 fs/fcntl.c                 |  12 ++-
 include/linux/shmem_fs.h   |  17 ++++
 include/uapi/linux/fcntl.h |  13 +++
 mm/shmem.c                 | 200 ++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 236 insertions(+), 6 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index ef68665..eea0b65 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -21,6 +21,7 @@
 #include <linux/rcupdate.h>
 #include <linux/pid_namespace.h>
 #include <linux/user_namespace.h>
+#include <linux/shmem_fs.h>
 
 #include <asm/poll.h>
 #include <asm/siginfo.h>
@@ -248,9 +249,10 @@ static int f_getowner_uids(struct file *filp, unsigned long arg)
 #endif
 
 static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
-		struct file *filp)
+		     struct fd f)
 {
 	long err = -EINVAL;
+	struct file *filp = f.file;
 
 	switch (cmd) {
 	case F_DUPFD:
@@ -326,6 +328,10 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	case F_GETPIPE_SZ:
 		err = pipe_fcntl(filp, cmd, arg);
 		break;
+	case SHMEM_SET_SEALS:
+	case SHMEM_GET_SEALS:
+		err = shmem_fcntl(f, cmd, arg);
+		break;
 	default:
 		break;
 	}
@@ -360,7 +366,7 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
 
 	err = security_file_fcntl(f.file, cmd, arg);
 	if (!err)
-		err = do_fcntl(fd, cmd, arg, f.file);
+		err = do_fcntl(fd, cmd, arg, f);
 
 out1:
  	fdput(f);
@@ -397,7 +403,7 @@ SYSCALL_DEFINE3(fcntl64, unsigned int, fd, unsigned int, cmd,
 					(struct flock64 __user *) arg);
 			break;
 		default:
-			err = do_fcntl(fd, cmd, arg, f.file);
+			err = do_fcntl(fd, cmd, arg, f);
 			break;
 	}
 out1:
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 9d55438..6a3f685 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -1,6 +1,7 @@
 #ifndef __SHMEM_FS_H
 #define __SHMEM_FS_H
 
+#include <linux/file.h>
 #include <linux/swap.h>
 #include <linux/mempolicy.h>
 #include <linux/pagemap.h>
@@ -20,6 +21,7 @@ struct shmem_inode_info {
 	struct shared_policy	policy;		/* NUMA memory alloc policy */
 	struct list_head	swaplist;	/* chain of maybes on swap */
 	struct simple_xattrs	xattrs;		/* list of xattrs */
+	u32			seals;		/* shmem seals */
 	struct inode		vfs_inode;
 };
 
@@ -57,6 +59,21 @@ extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
 extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end);
 extern int shmem_unuse(swp_entry_t entry, struct page *page);
 
+#ifdef CONFIG_SHMEM
+
+extern int shmem_set_seals(struct file *file, u32 seals);
+extern int shmem_get_seals(struct file *file);
+extern long shmem_fcntl(struct fd f, unsigned int cmd, unsigned long arg);
+
+#else
+
+static inline long shmem_fcntl(struct fd f, unsigned int cmd, unsigned long arg)
+{
+	return -EINVAL;
+}
+
+#endif
+
 static inline struct page *shmem_read_mapping_page(
 				struct address_space *mapping, pgoff_t index)
 {
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 074b886..8f31bef 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -28,6 +28,19 @@
 #define F_GETPIPE_SZ	(F_LINUX_SPECIFIC_BASE + 8)
 
 /*
+ * Set/Get seals
+ */
+#define SHMEM_SET_SEALS	(F_LINUX_SPECIFIC_BASE + 9)
+#define SHMEM_GET_SEALS	(F_LINUX_SPECIFIC_BASE + 10)
+
+/*
+ * Types of seals
+ */
+#define SHMEM_SEAL_SHRINK	0x0001	/* prevent file from shrinking */
+#define SHMEM_SEAL_GROW		0x0002	/* prevent file from growing */
+#define SHMEM_SEAL_WRITE	0x0004	/* prevent writes */
+
+/*
  * Types of directory notifications that may be requested.
  */
 #define DN_ACCESS	0x00000001	/* File accessed */
diff --git a/mm/shmem.c b/mm/shmem.c
index 1f18c9d..44d7f3b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -66,6 +66,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/highmem.h>
 #include <linux/seq_file.h>
 #include <linux/magic.h>
+#include <linux/fcntl.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -596,16 +597,23 @@ EXPORT_SYMBOL_GPL(shmem_truncate_range);
 static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = dentry->d_inode;
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	loff_t oldsize = inode->i_size;
+	loff_t newsize = attr->ia_size;
 	int error;
 
 	error = inode_change_ok(inode, attr);
 	if (error)
 		return error;
 
-	if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
-		loff_t oldsize = inode->i_size;
-		loff_t newsize = attr->ia_size;
+	/* protected by i_mutex */
+	if (attr->ia_valid & ATTR_SIZE) {
+		if ((newsize < oldsize && (info->seals & SHMEM_SEAL_SHRINK)) ||
+		    (newsize > oldsize && (info->seals & SHMEM_SEAL_GROW)))
+			return -EPERM;
+	}
 
+	if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
 		if (newsize != oldsize) {
 			i_size_write(inode, newsize);
 			inode->i_ctime = inode->i_mtime = CURRENT_TIME;
@@ -1354,6 +1362,13 @@ out_nomem:
 
 static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
 {
+	struct inode *inode = file_inode(file);
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	/* protected by mmap_sem and owns additional file-reference */
+	if ((info->seals & SHMEM_SEAL_WRITE) && (vma->vm_flags & VM_SHARED))
+		return -EPERM;
+
 	file_accessed(file);
 	vma->vm_ops = &shmem_vm_ops;
 	return 0;
@@ -1433,7 +1448,15 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
 			struct page **pagep, void **fsdata)
 {
 	struct inode *inode = mapping->host;
+	struct shmem_inode_info *info = SHMEM_I(inode);
 	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+
+	/* i_mutex is held by caller */
+	if (info->seals & SHMEM_SEAL_WRITE)
+		return -EPERM;
+	if ((info->seals & SHMEM_SEAL_GROW) && pos + len > inode->i_size)
+		return -EPERM;
+
 	return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
 }
 
@@ -1802,11 +1825,171 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
 	return offset;
 }
 
+#define SHMEM_ALL_SEALS (SHMEM_SEAL_SHRINK | \
+			 SHMEM_SEAL_GROW | \
+			 SHMEM_SEAL_WRITE)
+
+int shmem_set_seals(struct file *file, u32 seals)
+{
+	struct dentry *dentry = file->f_path.dentry;
+	struct inode *inode = dentry->d_inode;
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	bool has_writers, has_readers;
+	int r;
+
+	/*
+	 * SHMEM SEALING
+	 * Sealing allows multiple parties to share a shmem-file but restrict
+	 * access to a specific subset of file operations as long as more than
+	 * one party has access to the inode. This way, mutually untrusted
+	 * parties can share common memory regions with a well-defined policy.
+	 *
+	 * Seals can be set on any shmem-file, but always affect the whole
+	 * underlying inode. Once a seal is set, it may prevent some kinds of
+	 * access to the file. Currently, the following seals are defined:
+	 *   SHRINK: Prevent the file from shrinking
+	 *   GROW: Prevent the file from growing
+	 *   WRITE: Prevent write access to the file
+	 *
+	 * As we don't require any trust relationship between two parties, we
+	 * cannot allow asynchronous sealing. Instead, sealing is only allowed
+	 * if you own an exclusive reference to the shmem-file. Each FD, each
+	 * mmap and any link increase the ref-count. So as long as you have any
+	 * access to the file, you can be sure no-one (besides perhaps you) can
+	 * modify the seals.
+	 * There is one exception: Setting initial seals is allowed even if
+	 * there are multiple references to the file (but no writable mappings
+	 * may exist). Once *any* seal is set, removing or changing it requires
+	 * an exclusive reference, though.
+	 *
+	 * The combination of SHRINK and WRITE also guarantees that any mapped
+	 * region will not get destructed asynchronously. Even if at some point
+	 * revoke() is supported, the region will stay mapped (maybe only
+	 * privately) and accessible.
+	 */
+
+	if (file->f_op != &shmem_file_operations)
+		return -EBADF;
+
+	/* require write-access to modify seals */
+	if (!(file->f_mode & FMODE_WRITE))
+		return -EPERM;
+
+	if (seals & ~(u32)SHMEM_ALL_SEALS)
+		return -EINVAL;
+
+	/*
+	 * - i_mutex prevents racing write/ftruncate/fallocate/..
+	 * - mmap_sem prevents racing mmap() calls
+	 * - i_lock prevents racing open() calls and new inode-refs
+	 */
+
+	mutex_lock(&inode->i_mutex);
+	down_read(&current->mm->mmap_sem);
+	spin_lock(&inode->i_lock);
+
+	/*
+	 * Changing seals is only allowed on exclusive references. Exception is
+	 * initial sealing, which allows other readers. We need to test for
+	 * i_mmap_writable to prevent VM_SHARED vmas on our exclusive writer.
+	 * i_writecount is not checked, as we explicitly allow writable FDs
+	 * even if sealed. It's the write-operation that is blocked, not the
+	 * writable FD itself.
+	 * Readers are tested the same way F_SETLEASE does it. One dentry,
+	 * inode and file ref combination is allowed.
+	 * Note that we actually allow 2 file-refs: One is the ref in the
+	 * file-table, the other is from the current context.
+	 * Note: for racing dup() calls see GET_SEALS
+	 */
+	has_writers = file->f_mapping->i_mmap_writable > 0;
+
+	has_readers = d_count(dentry) > 1 || atomic_read(&inode->i_count) > 1;
+	has_readers = has_readers || file_count(file) > 2;
+
+	if (has_writers || (has_readers && info->seals != 0)) {
+		r = -EPERM;
+	} else {
+		info->seals = seals;
+		r = 0;
+	}
+
+	spin_unlock(&inode->i_lock);
+	up_read(&current->mm->mmap_sem);
+	mutex_unlock(&inode->i_mutex);
+
+	return r;
+}
+EXPORT_SYMBOL(shmem_set_seals);
+
+int shmem_get_seals(struct file *file)
+{
+	struct inode *inode = file_inode(file);
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	unsigned long flags;
+	int r;
+
+	if (file->f_op != &shmem_file_operations)
+		return -EBADF;
+
+	/*
+	 * Lock i_lock so we don't read seals between file_count() and setting
+	 * the seals in SET_SEALS. Racing get_file()s could end up with an
+	 * inconsistent view.
+	 */
+
+	spin_lock_irqsave(&inode->i_lock, flags);
+	r = info->seals;
+	spin_unlock_irqrestore(&inode->i_lock, flags);
+
+	return r;
+}
+EXPORT_SYMBOL(shmem_get_seals);
+
+long shmem_fcntl(struct fd f, unsigned int cmd, unsigned long arg)
+{
+	long r;
+
+	if (f.file->f_op != &shmem_file_operations)
+		return -EBADF;
+
+	switch (cmd) {
+	case SHMEM_SET_SEALS:
+		/* disallow upper 32bit */
+		if (arg >> 32)
+			return -EINVAL;
+
+		/*
+		 * shmem_set_seals() allows 2 file-refs, one of the owner and
+		 * one of the current context. Make sure we have a real
+		 * owner-ref here, otherwise the fast-path of __fdget_light
+		 * breaks the assumptions in shmem_set_seals().
+		 */
+
+		if (!(f.flags & FDPUT_FPUT))
+			get_file(f.file);
+
+		r = shmem_set_seals(f.file, arg);
+
+		if (!(f.flags & FDPUT_FPUT))
+			fput(f.file);
+		break;
+	case SHMEM_GET_SEALS:
+		r = shmem_get_seals(f.file);
+		break;
+	default:
+		r = -EINVAL;
+		break;
+	}
+
+	return r;
+}
+
 static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 							 loff_t len)
 {
 	struct inode *inode = file_inode(file);
 	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+	struct shmem_inode_info *info = SHMEM_I(inode);
 	struct shmem_falloc shmem_falloc;
 	pgoff_t start, index, end;
 	int error;
@@ -1818,6 +2001,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 		loff_t unmap_start = round_up(offset, PAGE_SIZE);
 		loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
 
+		/* protected by i_mutex */
+		if (info->seals & SHMEM_SEAL_WRITE) {
+			error = -EPERM;
+			goto out;
+		}
+
 		if ((u64)unmap_end > (u64)unmap_start)
 			unmap_mapping_range(mapping, unmap_start,
 					    1 + unmap_end - unmap_start, 0);
@@ -1832,6 +2021,11 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	if (error)
 		goto out;
 
+	if ((info->seals & SHMEM_SEAL_GROW) && offset + len > inode->i_size) {
+		error = -EPERM;
+		goto out;
+	}
+
 	start = offset >> PAGE_CACHE_SHIFT;
 	end = (offset + len + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	/* Try to avoid a swapstorm if len is impossible to satisfy */
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 2/6] shm: add sealing API
@ 2014-03-19 19:06   ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-19 19:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Matthew Wilcox, Ryan Lortie, Hugh Dickins, Johannes Weiner,
	Kay Sievers, dri-devel, Daniel Mack, linux-mm, linux-fsdevel,
	Karol Lewandowski, Lennart Poettering, Greg Kroah-Hartman,
	Tejun Heo, Michael Kerrisk (man-pages),
	Andrew Morton, Linus Torvalds, Alexander Viro

If two processes share a common memory region, they usually want some
guarantees to allow safe access. This often includes:
  - one side cannot overwrite data while the other reads it
  - one side cannot shrink the buffer while the other accesses it
  - one side cannot grow the buffer beyond previously set boundaries

If there is a trust-relationship between both parties, there is no need
for policy enforcement. However, if there's no trust relationship (eg.,
for general-purpose IPC) sharing memory-regions is highly fragile and
often not possible without local copies. Look at the following two
use-cases:
  1) A graphics client wants to share its rendering-buffer with a
     graphics-server. The memory-region is allocated by the client for
     read/write access and a second FD is passed to the server. While
     scanning out from the memory region, the server has no guarantee that
     the client doesn't shrink the buffer at any time, requiring rather
     cumbersome SIGBUS handling.
  2) A process wants to perform an RPC on another process. To avoid huge
     bandwidth consumption, zero-copy is preferred. After a message is
     assembled in-memory and a FD is passed to the remote side, both sides
     want to be sure that neither modifies this shared copy, anymore. The
     source may have put sensible data into the message without a separate
     copy and the target may want to parse the message inline, to avoid a
     local copy.

While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide
ways to achieve most of this, the first one is unproportionally ugly to
use in libraries and the latter two are broken/racy or even disabled due
to denial of service attacks.

This patch introduces the concept of SEALING. If you seal a file, a
specific set of operations is blocked until this seal is removed again.
Unlike locks, seals can only be modified if you own an exclusive reference
to the file. Hence, if, and only if you hold a reference to a file, you
can be sure that no-one else can modify the seals besides you (and you can
only modify them, if you are the exclusive holder). This makes sealing
useful in situations where no trust-relationship is given.

An initial set of SEALS is introduced by this patch:
  - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced
            in size. This currently affects only ftruncate().
  - GROW: If SEAL_GROW is set, the file in question cannot be increased
          in size. This affects ftruncate(), fallocate() and write().
  - WRITE: If SEAL_WRITE is set, no write operations (besides resizing)
           are possible. This affects fallocate(PUNCH_HOLE), mmap() and
           write().

The described use-cases can easily use these seals to provide safe use
without any trust-relationship:
  1) The graphics server can verify that a passed file-descriptor has
     SEAL_SHRINK set. This allows safe scanout, while the client is
     allowed to increase buffer size for window-resizing on-the-fly.
     Concurrent writes are explicitly allowed.
  2) Both processes can verify that SEAL_SHRINK, SEAL_GROW and SEAL_WRITE
     are set. This guarantees that neither process can modify the data
     while the other side parses it. Furthermore, it guarantees that even
     with writable FDs passed to the peer, it cannot increase the size to
     hit memory-limits of the source process (in case the file-storage is
     accounted to the source).

There is one exception to setting seals: Imagine a library makes use of
sealing. While creating a new memory object with an FD, another thread may
fork(), retaining a copy of the FD and thus also a reference. Sealing
wouldn't be possible anymore, until this process closes the FDs or
exec()s. To avoid this race initial seals can be set on non-exclusive FDs.
This is safe as both sides can, and always have to, verify that the
required set of seals is set. Once they are set, neither side can extend,
reduce or modify the set of seals as long as they have no exclusive
reference.
Note that this exception also allows keeping read-only mmaps() around
during initial sealing (mmaps() also own a reference to the file).

The new API is an extension to fcntl(), adding two new commands:
  SHMEM_GET_SEALS: Return a bitset describing the seals on the file. This
                   can be called on any FD if the underlying file supports
                   sealing.
  SHMEM_SET_SEALS: Change the seals of a given file. This requires WRITE
                   access to the file. If at least one seal is already
                   set, this also requires an exclusive reference. Note
                   that this call will fail with EPERM if there is any
                   active mapping with MAP_SHARED set.

The fcntl() handler is currently specific to shmem. There is no intention
to support this on other file-systems, that's why the bits are prefixed
with SHMEM_*. Furthermore, sealing is supported on all shmem-files.
Setting seals requires write-access, so this doesn't allow any DoS attacks
onto existing shmem users (just like mandatory locking).

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 fs/fcntl.c                 |  12 ++-
 include/linux/shmem_fs.h   |  17 ++++
 include/uapi/linux/fcntl.h |  13 +++
 mm/shmem.c                 | 200 ++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 236 insertions(+), 6 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index ef68665..eea0b65 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -21,6 +21,7 @@
 #include <linux/rcupdate.h>
 #include <linux/pid_namespace.h>
 #include <linux/user_namespace.h>
+#include <linux/shmem_fs.h>
 
 #include <asm/poll.h>
 #include <asm/siginfo.h>
@@ -248,9 +249,10 @@ static int f_getowner_uids(struct file *filp, unsigned long arg)
 #endif
 
 static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
-		struct file *filp)
+		     struct fd f)
 {
 	long err = -EINVAL;
+	struct file *filp = f.file;
 
 	switch (cmd) {
 	case F_DUPFD:
@@ -326,6 +328,10 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	case F_GETPIPE_SZ:
 		err = pipe_fcntl(filp, cmd, arg);
 		break;
+	case SHMEM_SET_SEALS:
+	case SHMEM_GET_SEALS:
+		err = shmem_fcntl(f, cmd, arg);
+		break;
 	default:
 		break;
 	}
@@ -360,7 +366,7 @@ SYSCALL_DEFINE3(fcntl, unsigned int, fd, unsigned int, cmd, unsigned long, arg)
 
 	err = security_file_fcntl(f.file, cmd, arg);
 	if (!err)
-		err = do_fcntl(fd, cmd, arg, f.file);
+		err = do_fcntl(fd, cmd, arg, f);
 
 out1:
  	fdput(f);
@@ -397,7 +403,7 @@ SYSCALL_DEFINE3(fcntl64, unsigned int, fd, unsigned int, cmd,
 					(struct flock64 __user *) arg);
 			break;
 		default:
-			err = do_fcntl(fd, cmd, arg, f.file);
+			err = do_fcntl(fd, cmd, arg, f);
 			break;
 	}
 out1:
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 9d55438..6a3f685 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -1,6 +1,7 @@
 #ifndef __SHMEM_FS_H
 #define __SHMEM_FS_H
 
+#include <linux/file.h>
 #include <linux/swap.h>
 #include <linux/mempolicy.h>
 #include <linux/pagemap.h>
@@ -20,6 +21,7 @@ struct shmem_inode_info {
 	struct shared_policy	policy;		/* NUMA memory alloc policy */
 	struct list_head	swaplist;	/* chain of maybes on swap */
 	struct simple_xattrs	xattrs;		/* list of xattrs */
+	u32			seals;		/* shmem seals */
 	struct inode		vfs_inode;
 };
 
@@ -57,6 +59,21 @@ extern struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
 extern void shmem_truncate_range(struct inode *inode, loff_t start, loff_t end);
 extern int shmem_unuse(swp_entry_t entry, struct page *page);
 
+#ifdef CONFIG_SHMEM
+
+extern int shmem_set_seals(struct file *file, u32 seals);
+extern int shmem_get_seals(struct file *file);
+extern long shmem_fcntl(struct fd f, unsigned int cmd, unsigned long arg);
+
+#else
+
+static inline long shmem_fcntl(struct fd f, unsigned int cmd, unsigned long arg)
+{
+	return -EINVAL;
+}
+
+#endif
+
 static inline struct page *shmem_read_mapping_page(
 				struct address_space *mapping, pgoff_t index)
 {
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 074b886..8f31bef 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -28,6 +28,19 @@
 #define F_GETPIPE_SZ	(F_LINUX_SPECIFIC_BASE + 8)
 
 /*
+ * Set/Get seals
+ */
+#define SHMEM_SET_SEALS	(F_LINUX_SPECIFIC_BASE + 9)
+#define SHMEM_GET_SEALS	(F_LINUX_SPECIFIC_BASE + 10)
+
+/*
+ * Types of seals
+ */
+#define SHMEM_SEAL_SHRINK	0x0001	/* prevent file from shrinking */
+#define SHMEM_SEAL_GROW		0x0002	/* prevent file from growing */
+#define SHMEM_SEAL_WRITE	0x0004	/* prevent writes */
+
+/*
  * Types of directory notifications that may be requested.
  */
 #define DN_ACCESS	0x00000001	/* File accessed */
diff --git a/mm/shmem.c b/mm/shmem.c
index 1f18c9d..44d7f3b 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -66,6 +66,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/highmem.h>
 #include <linux/seq_file.h>
 #include <linux/magic.h>
+#include <linux/fcntl.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -596,16 +597,23 @@ EXPORT_SYMBOL_GPL(shmem_truncate_range);
 static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = dentry->d_inode;
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	loff_t oldsize = inode->i_size;
+	loff_t newsize = attr->ia_size;
 	int error;
 
 	error = inode_change_ok(inode, attr);
 	if (error)
 		return error;
 
-	if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
-		loff_t oldsize = inode->i_size;
-		loff_t newsize = attr->ia_size;
+	/* protected by i_mutex */
+	if (attr->ia_valid & ATTR_SIZE) {
+		if ((newsize < oldsize && (info->seals & SHMEM_SEAL_SHRINK)) ||
+		    (newsize > oldsize && (info->seals & SHMEM_SEAL_GROW)))
+			return -EPERM;
+	}
 
+	if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
 		if (newsize != oldsize) {
 			i_size_write(inode, newsize);
 			inode->i_ctime = inode->i_mtime = CURRENT_TIME;
@@ -1354,6 +1362,13 @@ out_nomem:
 
 static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
 {
+	struct inode *inode = file_inode(file);
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	/* protected by mmap_sem and owns additional file-reference */
+	if ((info->seals & SHMEM_SEAL_WRITE) && (vma->vm_flags & VM_SHARED))
+		return -EPERM;
+
 	file_accessed(file);
 	vma->vm_ops = &shmem_vm_ops;
 	return 0;
@@ -1433,7 +1448,15 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
 			struct page **pagep, void **fsdata)
 {
 	struct inode *inode = mapping->host;
+	struct shmem_inode_info *info = SHMEM_I(inode);
 	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+
+	/* i_mutex is held by caller */
+	if (info->seals & SHMEM_SEAL_WRITE)
+		return -EPERM;
+	if ((info->seals & SHMEM_SEAL_GROW) && pos + len > inode->i_size)
+		return -EPERM;
+
 	return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
 }
 
@@ -1802,11 +1825,171 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
 	return offset;
 }
 
+#define SHMEM_ALL_SEALS (SHMEM_SEAL_SHRINK | \
+			 SHMEM_SEAL_GROW | \
+			 SHMEM_SEAL_WRITE)
+
+int shmem_set_seals(struct file *file, u32 seals)
+{
+	struct dentry *dentry = file->f_path.dentry;
+	struct inode *inode = dentry->d_inode;
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	bool has_writers, has_readers;
+	int r;
+
+	/*
+	 * SHMEM SEALING
+	 * Sealing allows multiple parties to share a shmem-file but restrict
+	 * access to a specific subset of file operations as long as more than
+	 * one party has access to the inode. This way, mutually untrusted
+	 * parties can share common memory regions with a well-defined policy.
+	 *
+	 * Seals can be set on any shmem-file, but always affect the whole
+	 * underlying inode. Once a seal is set, it may prevent some kinds of
+	 * access to the file. Currently, the following seals are defined:
+	 *   SHRINK: Prevent the file from shrinking
+	 *   GROW: Prevent the file from growing
+	 *   WRITE: Prevent write access to the file
+	 *
+	 * As we don't require any trust relationship between two parties, we
+	 * cannot allow asynchronous sealing. Instead, sealing is only allowed
+	 * if you own an exclusive reference to the shmem-file. Each FD, each
+	 * mmap and any link increase the ref-count. So as long as you have any
+	 * access to the file, you can be sure no-one (besides perhaps you) can
+	 * modify the seals.
+	 * There is one exception: Setting initial seals is allowed even if
+	 * there are multiple references to the file (but no writable mappings
+	 * may exist). Once *any* seal is set, removing or changing it requires
+	 * an exclusive reference, though.
+	 *
+	 * The combination of SHRINK and WRITE also guarantees that any mapped
+	 * region will not get destructed asynchronously. Even if at some point
+	 * revoke() is supported, the region will stay mapped (maybe only
+	 * privately) and accessible.
+	 */
+
+	if (file->f_op != &shmem_file_operations)
+		return -EBADF;
+
+	/* require write-access to modify seals */
+	if (!(file->f_mode & FMODE_WRITE))
+		return -EPERM;
+
+	if (seals & ~(u32)SHMEM_ALL_SEALS)
+		return -EINVAL;
+
+	/*
+	 * - i_mutex prevents racing write/ftruncate/fallocate/..
+	 * - mmap_sem prevents racing mmap() calls
+	 * - i_lock prevents racing open() calls and new inode-refs
+	 */
+
+	mutex_lock(&inode->i_mutex);
+	down_read(&current->mm->mmap_sem);
+	spin_lock(&inode->i_lock);
+
+	/*
+	 * Changing seals is only allowed on exclusive references. Exception is
+	 * initial sealing, which allows other readers. We need to test for
+	 * i_mmap_writable to prevent VM_SHARED vmas on our exclusive writer.
+	 * i_writecount is not checked, as we explicitly allow writable FDs
+	 * even if sealed. It's the write-operation that is blocked, not the
+	 * writable FD itself.
+	 * Readers are tested the same way F_SETLEASE does it. One dentry,
+	 * inode and file ref combination is allowed.
+	 * Note that we actually allow 2 file-refs: One is the ref in the
+	 * file-table, the other is from the current context.
+	 * Note: for racing dup() calls see GET_SEALS
+	 */
+	has_writers = file->f_mapping->i_mmap_writable > 0;
+
+	has_readers = d_count(dentry) > 1 || atomic_read(&inode->i_count) > 1;
+	has_readers = has_readers || file_count(file) > 2;
+
+	if (has_writers || (has_readers && info->seals != 0)) {
+		r = -EPERM;
+	} else {
+		info->seals = seals;
+		r = 0;
+	}
+
+	spin_unlock(&inode->i_lock);
+	up_read(&current->mm->mmap_sem);
+	mutex_unlock(&inode->i_mutex);
+
+	return r;
+}
+EXPORT_SYMBOL(shmem_set_seals);
+
+int shmem_get_seals(struct file *file)
+{
+	struct inode *inode = file_inode(file);
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	unsigned long flags;
+	int r;
+
+	if (file->f_op != &shmem_file_operations)
+		return -EBADF;
+
+	/*
+	 * Lock i_lock so we don't read seals between file_count() and setting
+	 * the seals in SET_SEALS. Racing get_file()s could end up with an
+	 * inconsistent view.
+	 */
+
+	spin_lock_irqsave(&inode->i_lock, flags);
+	r = info->seals;
+	spin_unlock_irqrestore(&inode->i_lock, flags);
+
+	return r;
+}
+EXPORT_SYMBOL(shmem_get_seals);
+
+long shmem_fcntl(struct fd f, unsigned int cmd, unsigned long arg)
+{
+	long r;
+
+	if (f.file->f_op != &shmem_file_operations)
+		return -EBADF;
+
+	switch (cmd) {
+	case SHMEM_SET_SEALS:
+		/* disallow upper 32bit */
+		if (arg >> 32)
+			return -EINVAL;
+
+		/*
+		 * shmem_set_seals() allows 2 file-refs, one of the owner and
+		 * one of the current context. Make sure we have a real
+		 * owner-ref here, otherwise the fast-path of __fdget_light
+		 * breaks the assumptions in shmem_set_seals().
+		 */
+
+		if (!(f.flags & FDPUT_FPUT))
+			get_file(f.file);
+
+		r = shmem_set_seals(f.file, arg);
+
+		if (!(f.flags & FDPUT_FPUT))
+			fput(f.file);
+		break;
+	case SHMEM_GET_SEALS:
+		r = shmem_get_seals(f.file);
+		break;
+	default:
+		r = -EINVAL;
+		break;
+	}
+
+	return r;
+}
+
 static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 							 loff_t len)
 {
 	struct inode *inode = file_inode(file);
 	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+	struct shmem_inode_info *info = SHMEM_I(inode);
 	struct shmem_falloc shmem_falloc;
 	pgoff_t start, index, end;
 	int error;
@@ -1818,6 +2001,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 		loff_t unmap_start = round_up(offset, PAGE_SIZE);
 		loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
 
+		/* protected by i_mutex */
+		if (info->seals & SHMEM_SEAL_WRITE) {
+			error = -EPERM;
+			goto out;
+		}
+
 		if ((u64)unmap_end > (u64)unmap_start)
 			unmap_mapping_range(mapping, unmap_start,
 					    1 + unmap_end - unmap_start, 0);
@@ -1832,6 +2021,11 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	if (error)
 		goto out;
 
+	if ((info->seals & SHMEM_SEAL_GROW) && offset + len > inode->i_size) {
+		error = -EPERM;
+		goto out;
+	}
+
 	start = offset >> PAGE_CACHE_SHIFT;
 	end = (offset + len + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	/* Try to avoid a swapstorm if len is impossible to satisfy */
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 3/6] shm: add memfd_create() syscall
  2014-03-19 19:06 ` David Herrmann
  (?)
  (?)
@ 2014-03-19 19:06   ` David Herrmann
  -1 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-19 19:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Hugh Dickins, Alexander Viro, Matthew Wilcox, Karol Lewandowski,
	Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, john.stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages),
	David Herrmann

memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
that you can pass to mmap(). It explicitly allows sealing and
avoids any connection to user-visible mount-points. Thus, it's not
subject to quotas on mounted file-systems, but can be used like
malloc()'ed memory, but with a file-descriptor to it.

memfd_create() does not create a front-FD, but instead returns the raw
shmem file, so calls like ftruncate() can be used. Also calls like fstat()
will return proper information and mark the file as regular file. Sealing
is explicitly supported on memfds.

Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
subject to quotas and alike.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 arch/x86/syscalls/syscall_32.tbl |  1 +
 arch/x86/syscalls/syscall_64.tbl |  1 +
 include/linux/syscalls.h         |  1 +
 include/uapi/linux/memfd.h       |  9 ++++++
 kernel/sys_ni.c                  |  1 +
 mm/shmem.c                       | 67 ++++++++++++++++++++++++++++++++++++++++
 6 files changed, 80 insertions(+)
 create mode 100644 include/uapi/linux/memfd.h

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 96bc506..c943b8a 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -359,3 +359,4 @@
 350	i386	finit_module		sys_finit_module
 351	i386	sched_setattr		sys_sched_setattr
 352	i386	sched_getattr		sys_sched_getattr
+353	i386	memfd_create		sys_memfd_create
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index a12bddc..e9d56a8 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -322,6 +322,7 @@
 313	common	finit_module		sys_finit_module
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
+316	common	memfd_create		sys_memfd_create
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a747a77..124b838 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -791,6 +791,7 @@ asmlinkage long sys_timerfd_settime(int ufd, int flags,
 asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
 asmlinkage long sys_eventfd(unsigned int count);
 asmlinkage long sys_eventfd2(unsigned int count, int flags);
+asmlinkage long sys_memfd_create(const char *uname_ptr, u64 size, u64 flags);
 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
 asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int);
 asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
new file mode 100644
index 0000000..d74cc89
--- /dev/null
+++ b/include/uapi/linux/memfd.h
@@ -0,0 +1,9 @@
+#ifndef _UAPI_LINUX_MEMFD_H
+#define _UAPI_LINUX_MEMFD_H
+
+#include <linux/types.h>
+
+/* flags for memfd_create(2) */
+#define MFD_CLOEXEC		0x0001
+
+#endif /* _UAPI_LINUX_MEMFD_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7078052..53e05af 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -193,6 +193,7 @@ cond_syscall(compat_sys_timerfd_settime);
 cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
+cond_syscall(sys_memfd_create);
 
 /* performance counters: */
 cond_syscall(sys_perf_event_open);
diff --git a/mm/shmem.c b/mm/shmem.c
index 44d7f3b..48feb42 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -66,7 +66,9 @@ static struct vfsmount *shm_mnt;
 #include <linux/highmem.h>
 #include <linux/seq_file.h>
 #include <linux/magic.h>
+#include <linux/syscalls.h>
 #include <linux/fcntl.h>
+#include <uapi/linux/memfd.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -3039,6 +3041,71 @@ out4:
 	return error;
 }
 
+/* maximum length of memfd names */
+#define MFD_MAX_NAMELEN 256
+
+SYSCALL_DEFINE3(memfd_create,
+		const char*, uname,
+		u64, size,
+		u64, flags)
+{
+	struct file *shm;
+	char *name;
+	int fd, r;
+	long len;
+
+	if (flags & ~(u64)MFD_CLOEXEC)
+		return -EINVAL;
+	if ((u64)(loff_t)size != size || (loff_t)size < 0)
+		return -EINVAL;
+
+	/* length includes terminating zero */
+	len = strnlen_user(uname, MFD_MAX_NAMELEN);
+	if (len <= 0)
+		return -EFAULT;
+	else if (len > MFD_MAX_NAMELEN)
+		return -EINVAL;
+
+	name = kmalloc(len + 6, GFP_KERNEL);
+	if (!name)
+		return -ENOMEM;
+
+	strcpy(name, "memfd:");
+	if (copy_from_user(&name[6], uname, len)) {
+		r = -EFAULT;
+		goto err_name;
+	}
+
+	/* terminating-zero may have changed after strnlen_user() returned */
+	if (name[len + 6 - 1]) {
+		r = -EFAULT;
+		goto err_name;
+	}
+
+	fd = get_unused_fd_flags((flags & MFD_CLOEXEC) ? O_CLOEXEC : 0);
+	if (fd < 0) {
+		r = fd;
+		goto err_name;
+	}
+
+	shm = shmem_file_setup(name, size, 0);
+	if (IS_ERR(shm)) {
+		r = PTR_ERR(shm);
+		goto err_fd;
+	}
+	shm->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
+
+	fd_install(fd, shm);
+	kfree(name);
+	return fd;
+
+err_fd:
+	put_unused_fd(fd);
+err_name:
+	kfree(name);
+	return r;
+}
+
 #else /* !CONFIG_SHMEM */
 
 /*
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 3/6] shm: add memfd_create() syscall
@ 2014-03-19 19:06   ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-19 19:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Matthew Wilcox, Ryan Lortie, Hugh Dickins, Johannes Weiner,
	Kay Sievers, dri-devel, Daniel Mack, linux-mm, linux-fsdevel,
	Karol Lewandowski, Lennart Poettering, Greg Kroah-Hartman,
	Tejun Heo, Michael Kerrisk (man-pages),
	Andrew Morton, Linus Torvalds, Alexander Viro

memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
that you can pass to mmap(). It explicitly allows sealing and
avoids any connection to user-visible mount-points. Thus, it's not
subject to quotas on mounted file-systems, but can be used like
malloc()'ed memory, but with a file-descriptor to it.

memfd_create() does not create a front-FD, but instead returns the raw
shmem file, so calls like ftruncate() can be used. Also calls like fstat()
will return proper information and mark the file as regular file. Sealing
is explicitly supported on memfds.

Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
subject to quotas and alike.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 arch/x86/syscalls/syscall_32.tbl |  1 +
 arch/x86/syscalls/syscall_64.tbl |  1 +
 include/linux/syscalls.h         |  1 +
 include/uapi/linux/memfd.h       |  9 ++++++
 kernel/sys_ni.c                  |  1 +
 mm/shmem.c                       | 67 ++++++++++++++++++++++++++++++++++++++++
 6 files changed, 80 insertions(+)
 create mode 100644 include/uapi/linux/memfd.h

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 96bc506..c943b8a 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -359,3 +359,4 @@
 350	i386	finit_module		sys_finit_module
 351	i386	sched_setattr		sys_sched_setattr
 352	i386	sched_getattr		sys_sched_getattr
+353	i386	memfd_create		sys_memfd_create
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index a12bddc..e9d56a8 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -322,6 +322,7 @@
 313	common	finit_module		sys_finit_module
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
+316	common	memfd_create		sys_memfd_create
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a747a77..124b838 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -791,6 +791,7 @@ asmlinkage long sys_timerfd_settime(int ufd, int flags,
 asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
 asmlinkage long sys_eventfd(unsigned int count);
 asmlinkage long sys_eventfd2(unsigned int count, int flags);
+asmlinkage long sys_memfd_create(const char *uname_ptr, u64 size, u64 flags);
 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
 asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int);
 asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
new file mode 100644
index 0000000..d74cc89
--- /dev/null
+++ b/include/uapi/linux/memfd.h
@@ -0,0 +1,9 @@
+#ifndef _UAPI_LINUX_MEMFD_H
+#define _UAPI_LINUX_MEMFD_H
+
+#include <linux/types.h>
+
+/* flags for memfd_create(2) */
+#define MFD_CLOEXEC		0x0001
+
+#endif /* _UAPI_LINUX_MEMFD_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7078052..53e05af 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -193,6 +193,7 @@ cond_syscall(compat_sys_timerfd_settime);
 cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
+cond_syscall(sys_memfd_create);
 
 /* performance counters: */
 cond_syscall(sys_perf_event_open);
diff --git a/mm/shmem.c b/mm/shmem.c
index 44d7f3b..48feb42 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -66,7 +66,9 @@ static struct vfsmount *shm_mnt;
 #include <linux/highmem.h>
 #include <linux/seq_file.h>
 #include <linux/magic.h>
+#include <linux/syscalls.h>
 #include <linux/fcntl.h>
+#include <uapi/linux/memfd.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -3039,6 +3041,71 @@ out4:
 	return error;
 }
 
+/* maximum length of memfd names */
+#define MFD_MAX_NAMELEN 256
+
+SYSCALL_DEFINE3(memfd_create,
+		const char*, uname,
+		u64, size,
+		u64, flags)
+{
+	struct file *shm;
+	char *name;
+	int fd, r;
+	long len;
+
+	if (flags & ~(u64)MFD_CLOEXEC)
+		return -EINVAL;
+	if ((u64)(loff_t)size != size || (loff_t)size < 0)
+		return -EINVAL;
+
+	/* length includes terminating zero */
+	len = strnlen_user(uname, MFD_MAX_NAMELEN);
+	if (len <= 0)
+		return -EFAULT;
+	else if (len > MFD_MAX_NAMELEN)
+		return -EINVAL;
+
+	name = kmalloc(len + 6, GFP_KERNEL);
+	if (!name)
+		return -ENOMEM;
+
+	strcpy(name, "memfd:");
+	if (copy_from_user(&name[6], uname, len)) {
+		r = -EFAULT;
+		goto err_name;
+	}
+
+	/* terminating-zero may have changed after strnlen_user() returned */
+	if (name[len + 6 - 1]) {
+		r = -EFAULT;
+		goto err_name;
+	}
+
+	fd = get_unused_fd_flags((flags & MFD_CLOEXEC) ? O_CLOEXEC : 0);
+	if (fd < 0) {
+		r = fd;
+		goto err_name;
+	}
+
+	shm = shmem_file_setup(name, size, 0);
+	if (IS_ERR(shm)) {
+		r = PTR_ERR(shm);
+		goto err_fd;
+	}
+	shm->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
+
+	fd_install(fd, shm);
+	kfree(name);
+	return fd;
+
+err_fd:
+	put_unused_fd(fd);
+err_name:
+	kfree(name);
+	return r;
+}
+
 #else /* !CONFIG_SHMEM */
 
 /*
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 3/6] shm: add memfd_create() syscall
@ 2014-03-19 19:06   ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-19 19:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Hugh Dickins, Alexander Viro, Matthew Wilcox, Karol Lewandowski,
	Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, john.stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages),
	David Herrmann

memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
that you can pass to mmap(). It explicitly allows sealing and
avoids any connection to user-visible mount-points. Thus, it's not
subject to quotas on mounted file-systems, but can be used like
malloc()'ed memory, but with a file-descriptor to it.

memfd_create() does not create a front-FD, but instead returns the raw
shmem file, so calls like ftruncate() can be used. Also calls like fstat()
will return proper information and mark the file as regular file. Sealing
is explicitly supported on memfds.

Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
subject to quotas and alike.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 arch/x86/syscalls/syscall_32.tbl |  1 +
 arch/x86/syscalls/syscall_64.tbl |  1 +
 include/linux/syscalls.h         |  1 +
 include/uapi/linux/memfd.h       |  9 ++++++
 kernel/sys_ni.c                  |  1 +
 mm/shmem.c                       | 67 ++++++++++++++++++++++++++++++++++++++++
 6 files changed, 80 insertions(+)
 create mode 100644 include/uapi/linux/memfd.h

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 96bc506..c943b8a 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -359,3 +359,4 @@
 350	i386	finit_module		sys_finit_module
 351	i386	sched_setattr		sys_sched_setattr
 352	i386	sched_getattr		sys_sched_getattr
+353	i386	memfd_create		sys_memfd_create
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index a12bddc..e9d56a8 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -322,6 +322,7 @@
 313	common	finit_module		sys_finit_module
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
+316	common	memfd_create		sys_memfd_create
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a747a77..124b838 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -791,6 +791,7 @@ asmlinkage long sys_timerfd_settime(int ufd, int flags,
 asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
 asmlinkage long sys_eventfd(unsigned int count);
 asmlinkage long sys_eventfd2(unsigned int count, int flags);
+asmlinkage long sys_memfd_create(const char *uname_ptr, u64 size, u64 flags);
 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
 asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int);
 asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
new file mode 100644
index 0000000..d74cc89
--- /dev/null
+++ b/include/uapi/linux/memfd.h
@@ -0,0 +1,9 @@
+#ifndef _UAPI_LINUX_MEMFD_H
+#define _UAPI_LINUX_MEMFD_H
+
+#include <linux/types.h>
+
+/* flags for memfd_create(2) */
+#define MFD_CLOEXEC		0x0001
+
+#endif /* _UAPI_LINUX_MEMFD_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7078052..53e05af 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -193,6 +193,7 @@ cond_syscall(compat_sys_timerfd_settime);
 cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
+cond_syscall(sys_memfd_create);
 
 /* performance counters: */
 cond_syscall(sys_perf_event_open);
diff --git a/mm/shmem.c b/mm/shmem.c
index 44d7f3b..48feb42 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -66,7 +66,9 @@ static struct vfsmount *shm_mnt;
 #include <linux/highmem.h>
 #include <linux/seq_file.h>
 #include <linux/magic.h>
+#include <linux/syscalls.h>
 #include <linux/fcntl.h>
+#include <uapi/linux/memfd.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -3039,6 +3041,71 @@ out4:
 	return error;
 }
 
+/* maximum length of memfd names */
+#define MFD_MAX_NAMELEN 256
+
+SYSCALL_DEFINE3(memfd_create,
+		const char*, uname,
+		u64, size,
+		u64, flags)
+{
+	struct file *shm;
+	char *name;
+	int fd, r;
+	long len;
+
+	if (flags & ~(u64)MFD_CLOEXEC)
+		return -EINVAL;
+	if ((u64)(loff_t)size != size || (loff_t)size < 0)
+		return -EINVAL;
+
+	/* length includes terminating zero */
+	len = strnlen_user(uname, MFD_MAX_NAMELEN);
+	if (len <= 0)
+		return -EFAULT;
+	else if (len > MFD_MAX_NAMELEN)
+		return -EINVAL;
+
+	name = kmalloc(len + 6, GFP_KERNEL);
+	if (!name)
+		return -ENOMEM;
+
+	strcpy(name, "memfd:");
+	if (copy_from_user(&name[6], uname, len)) {
+		r = -EFAULT;
+		goto err_name;
+	}
+
+	/* terminating-zero may have changed after strnlen_user() returned */
+	if (name[len + 6 - 1]) {
+		r = -EFAULT;
+		goto err_name;
+	}
+
+	fd = get_unused_fd_flags((flags & MFD_CLOEXEC) ? O_CLOEXEC : 0);
+	if (fd < 0) {
+		r = fd;
+		goto err_name;
+	}
+
+	shm = shmem_file_setup(name, size, 0);
+	if (IS_ERR(shm)) {
+		r = PTR_ERR(shm);
+		goto err_fd;
+	}
+	shm->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
+
+	fd_install(fd, shm);
+	kfree(name);
+	return fd;
+
+err_fd:
+	put_unused_fd(fd);
+err_name:
+	kfree(name);
+	return r;
+}
+
 #else /* !CONFIG_SHMEM */
 
 /*
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 3/6] shm: add memfd_create() syscall
@ 2014-03-19 19:06   ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-19 19:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Matthew Wilcox, Ryan Lortie, Hugh Dickins, Johannes Weiner,
	Kay Sievers, dri-devel, Daniel Mack, linux-mm, linux-fsdevel,
	Karol Lewandowski, Lennart Poettering, Greg Kroah-Hartman,
	Tejun Heo, Michael Kerrisk (man-pages),
	Andrew Morton, Linus Torvalds, Alexander Viro

memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
that you can pass to mmap(). It explicitly allows sealing and
avoids any connection to user-visible mount-points. Thus, it's not
subject to quotas on mounted file-systems, but can be used like
malloc()'ed memory, but with a file-descriptor to it.

memfd_create() does not create a front-FD, but instead returns the raw
shmem file, so calls like ftruncate() can be used. Also calls like fstat()
will return proper information and mark the file as regular file. Sealing
is explicitly supported on memfds.

Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
subject to quotas and alike.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 arch/x86/syscalls/syscall_32.tbl |  1 +
 arch/x86/syscalls/syscall_64.tbl |  1 +
 include/linux/syscalls.h         |  1 +
 include/uapi/linux/memfd.h       |  9 ++++++
 kernel/sys_ni.c                  |  1 +
 mm/shmem.c                       | 67 ++++++++++++++++++++++++++++++++++++++++
 6 files changed, 80 insertions(+)
 create mode 100644 include/uapi/linux/memfd.h

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 96bc506..c943b8a 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -359,3 +359,4 @@
 350	i386	finit_module		sys_finit_module
 351	i386	sched_setattr		sys_sched_setattr
 352	i386	sched_getattr		sys_sched_getattr
+353	i386	memfd_create		sys_memfd_create
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index a12bddc..e9d56a8 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -322,6 +322,7 @@
 313	common	finit_module		sys_finit_module
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
+316	common	memfd_create		sys_memfd_create
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a747a77..124b838 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -791,6 +791,7 @@ asmlinkage long sys_timerfd_settime(int ufd, int flags,
 asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
 asmlinkage long sys_eventfd(unsigned int count);
 asmlinkage long sys_eventfd2(unsigned int count, int flags);
+asmlinkage long sys_memfd_create(const char *uname_ptr, u64 size, u64 flags);
 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
 asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int);
 asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
new file mode 100644
index 0000000..d74cc89
--- /dev/null
+++ b/include/uapi/linux/memfd.h
@@ -0,0 +1,9 @@
+#ifndef _UAPI_LINUX_MEMFD_H
+#define _UAPI_LINUX_MEMFD_H
+
+#include <linux/types.h>
+
+/* flags for memfd_create(2) */
+#define MFD_CLOEXEC		0x0001
+
+#endif /* _UAPI_LINUX_MEMFD_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7078052..53e05af 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -193,6 +193,7 @@ cond_syscall(compat_sys_timerfd_settime);
 cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
+cond_syscall(sys_memfd_create);
 
 /* performance counters: */
 cond_syscall(sys_perf_event_open);
diff --git a/mm/shmem.c b/mm/shmem.c
index 44d7f3b..48feb42 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -66,7 +66,9 @@ static struct vfsmount *shm_mnt;
 #include <linux/highmem.h>
 #include <linux/seq_file.h>
 #include <linux/magic.h>
+#include <linux/syscalls.h>
 #include <linux/fcntl.h>
+#include <uapi/linux/memfd.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -3039,6 +3041,71 @@ out4:
 	return error;
 }
 
+/* maximum length of memfd names */
+#define MFD_MAX_NAMELEN 256
+
+SYSCALL_DEFINE3(memfd_create,
+		const char*, uname,
+		u64, size,
+		u64, flags)
+{
+	struct file *shm;
+	char *name;
+	int fd, r;
+	long len;
+
+	if (flags & ~(u64)MFD_CLOEXEC)
+		return -EINVAL;
+	if ((u64)(loff_t)size != size || (loff_t)size < 0)
+		return -EINVAL;
+
+	/* length includes terminating zero */
+	len = strnlen_user(uname, MFD_MAX_NAMELEN);
+	if (len <= 0)
+		return -EFAULT;
+	else if (len > MFD_MAX_NAMELEN)
+		return -EINVAL;
+
+	name = kmalloc(len + 6, GFP_KERNEL);
+	if (!name)
+		return -ENOMEM;
+
+	strcpy(name, "memfd:");
+	if (copy_from_user(&name[6], uname, len)) {
+		r = -EFAULT;
+		goto err_name;
+	}
+
+	/* terminating-zero may have changed after strnlen_user() returned */
+	if (name[len + 6 - 1]) {
+		r = -EFAULT;
+		goto err_name;
+	}
+
+	fd = get_unused_fd_flags((flags & MFD_CLOEXEC) ? O_CLOEXEC : 0);
+	if (fd < 0) {
+		r = fd;
+		goto err_name;
+	}
+
+	shm = shmem_file_setup(name, size, 0);
+	if (IS_ERR(shm)) {
+		r = PTR_ERR(shm);
+		goto err_fd;
+	}
+	shm->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
+
+	fd_install(fd, shm);
+	kfree(name);
+	return fd;
+
+err_fd:
+	put_unused_fd(fd);
+err_name:
+	kfree(name);
+	return r;
+}
+
 #else /* !CONFIG_SHMEM */
 
 /*
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 4/6] selftests: add memfd_create() + sealing tests
  2014-03-19 19:06 ` David Herrmann
@ 2014-03-19 19:06   ` David Herrmann
  -1 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-19 19:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Hugh Dickins, Alexander Viro, Matthew Wilcox, Karol Lewandowski,
	Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, john.stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages),
	David Herrmann

Some basic tests to verify sealing on memfds works as expected and
guarantees the advertised semantics.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 tools/testing/selftests/Makefile           |   1 +
 tools/testing/selftests/memfd/.gitignore   |   2 +
 tools/testing/selftests/memfd/Makefile     |  29 +
 tools/testing/selftests/memfd/memfd_test.c | 972 +++++++++++++++++++++++++++++
 4 files changed, 1004 insertions(+)
 create mode 100644 tools/testing/selftests/memfd/.gitignore
 create mode 100644 tools/testing/selftests/memfd/Makefile
 create mode 100644 tools/testing/selftests/memfd/memfd_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 32487ed..c57325a 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -2,6 +2,7 @@ TARGETS = breakpoints
 TARGETS += cpu-hotplug
 TARGETS += efivarfs
 TARGETS += kcmp
+TARGETS += memfd
 TARGETS += memory-hotplug
 TARGETS += mqueue
 TARGETS += net
diff --git a/tools/testing/selftests/memfd/.gitignore b/tools/testing/selftests/memfd/.gitignore
new file mode 100644
index 0000000..bcc8ee2
--- /dev/null
+++ b/tools/testing/selftests/memfd/.gitignore
@@ -0,0 +1,2 @@
+memfd_test
+memfd-test-file
diff --git a/tools/testing/selftests/memfd/Makefile b/tools/testing/selftests/memfd/Makefile
new file mode 100644
index 0000000..36653b9
--- /dev/null
+++ b/tools/testing/selftests/memfd/Makefile
@@ -0,0 +1,29 @@
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
+ifeq ($(ARCH),i386)
+	ARCH := X86
+endif
+ifeq ($(ARCH),x86_64)
+	ARCH := X86
+endif
+
+CFLAGS += -I../../../../arch/x86/include/generated/uapi/
+CFLAGS += -I../../../../arch/x86/include/uapi/
+CFLAGS += -I../../../../include/uapi/
+CFLAGS += -I../../../../include/
+
+all:
+ifeq ($(ARCH),X86)
+	gcc $(CFLAGS) memfd_test.c -o memfd_test
+else
+	echo "Not an x86 target, can't build memfd selftest"
+endif
+
+run_tests: all
+ifeq ($(ARCH),X86)
+	gcc $(CFLAGS) memfd_test.c -o memfd_test
+endif
+	@./memfd_test || echo "memfd_test: [FAIL]"
+
+clean:
+	$(RM) memfd_test
diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c
new file mode 100644
index 0000000..41bac6f
--- /dev/null
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -0,0 +1,972 @@
+#define _GNU_SOURCE
+#define __EXPORTED_HEADERS__
+
+#include <errno.h>
+#include <inttypes.h>
+#include <limits.h>
+#include <linux/falloc.h>
+#include <linux/fcntl.h>
+#include <linux/memfd.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#define MFD_DEF_SIZE 8192
+#define STACK_SIZE 65535
+
+static int sys_memfd_create(const char *name,
+			    __u64 size,
+			    __u64 flags)
+{
+	return syscall(__NR_memfd_create, name, size, flags);
+}
+
+static int mfd_assert_new(const char *name, __u64 sz, __u64 flags)
+{
+	int r;
+
+	r = sys_memfd_create(name, sz, flags);
+	if (r < 0) {
+		printf("memfd_create(\"%s\", %llu, %llu) failed: %m\n",
+		       name, (unsigned long long)sz,
+		       (unsigned long long)flags);
+		abort();
+	}
+
+	return r;
+}
+
+static void mfd_fail_new(const char *name, __u64 size, __u64 flags)
+{
+	int r;
+
+	r = sys_memfd_create(name, size, flags);
+	if (r >= 0) {
+		printf("memfd_create(\"%s\", %llu, %llu) succeeded, but failure expected\n",
+		       name, (unsigned long long)size,
+		       (unsigned long long)flags);
+		close(r);
+		abort();
+	}
+}
+
+static __u64 mfd_assert_get_seals(int fd)
+{
+	long r;
+
+	r = fcntl(fd, SHMEM_GET_SEALS);
+	if (r < 0) {
+		printf("GET_SEALS(%d) failed: %m\n", fd);
+		abort();
+	}
+
+	return r;
+}
+
+static void mfd_assert_has_seals(int fd, __u64 seals)
+{
+	__u64 s;
+
+	s = mfd_assert_get_seals(fd);
+	if (s != seals) {
+		printf("%llu != %llu = GET_SEALS(%d)\n",
+		       (unsigned long long)seals, (unsigned long long)s, fd);
+		abort();
+	}
+}
+
+static void mfd_assert_set_seals(int fd, __u64 seals)
+{
+	long r;
+	__u64 s;
+
+	s = mfd_assert_get_seals(fd);
+	r = fcntl(fd, SHMEM_SET_SEALS, seals);
+	if (r < 0) {
+		printf("SET_SEALS(%d, %llu -> %llu) failed: %m\n",
+		       fd, (unsigned long long)s, (unsigned long long)seals);
+		abort();
+	}
+}
+
+static void mfd_fail_set_seals(int fd, __u64 seals)
+{
+	long r;
+	__u64 s;
+
+	s = mfd_assert_get_seals(fd);
+	r = fcntl(fd, SHMEM_SET_SEALS, seals);
+	if (r >= 0) {
+		printf("SET_SEALS(%d, %llu -> %llu) didn't fail as expected\n",
+		       fd, (unsigned long long)s, (unsigned long long)seals);
+		abort();
+	}
+}
+
+static void mfd_assert_size(int fd, size_t size)
+{
+	struct stat st;
+	int r;
+
+	r = fstat(fd, &st);
+	if (r < 0) {
+		printf("fstat(%d) failed: %m\n", fd);
+		abort();
+	} else if (st.st_size != size) {
+		printf("wrong file size %lld, but expected %lld\n",
+		       (long long)st.st_size, (long long)size);
+		abort();
+	}
+}
+
+static int mfd_assert_dup(int fd)
+{
+	int r;
+
+	r = dup(fd);
+	if (r < 0) {
+		printf("dup(%d) failed: %m\n", fd);
+		abort();
+	}
+
+	return r;
+}
+
+static void *mfd_assert_mmap_shared(int fd)
+{
+	void *p;
+
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ | PROT_WRITE,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+
+	return p;
+}
+
+static void *mfd_assert_mmap_private(int fd)
+{
+	void *p;
+
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ,
+		 MAP_PRIVATE,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+
+	return p;
+}
+
+static int mfd_assert_open(int fd, int flags, mode_t mode)
+{
+	char buf[512];
+	int r;
+
+	sprintf(buf, "/proc/self/fd/%d", fd);
+	r = open(buf, flags, mode);
+	if (r < 0) {
+		printf("open(%s) failed: %m\n", buf);
+		abort();
+	}
+
+	return r;
+}
+
+static void mfd_fail_open(int fd, int flags, mode_t mode)
+{
+	char buf[512];
+	int r;
+
+	sprintf(buf, "/proc/self/fd/%d", fd);
+	r = open(buf, flags, mode);
+	if (r >= 0) {
+		printf("open(%s) didn't fail as expected\n");
+		abort();
+	}
+}
+
+static void mfd_assert_read(int fd)
+{
+	char buf[16];
+	void *p;
+	ssize_t l;
+
+	l = read(fd, buf, sizeof(buf));
+	if (l != sizeof(buf)) {
+		printf("read() failed: %m\n");
+		abort();
+	}
+
+	/* verify PROT_READ *is* allowed */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ,
+		 MAP_PRIVATE,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+	munmap(p, MFD_DEF_SIZE);
+
+	/* verify MAP_PRIVATE is *always* allowed (even writable) */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ | PROT_WRITE,
+		 MAP_PRIVATE,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+	munmap(p, MFD_DEF_SIZE);
+}
+
+static void mfd_assert_write(int fd)
+{
+	ssize_t l;
+	void *p;
+	int r;
+
+	/* verify write() succeeds */
+	l = write(fd, "\0\0\0\0", 4);
+	if (l != 4) {
+		printf("write() failed: %m\n");
+		abort();
+	}
+
+	/* verify PROT_READ | PROT_WRITE is allowed */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ | PROT_WRITE,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+	*(char*)p = 0;
+	munmap(p, MFD_DEF_SIZE);
+
+	/* verify PROT_WRITE is allowed */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_WRITE,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+	*(char*)p = 0;
+	munmap(p, MFD_DEF_SIZE);
+
+	/* verify PROT_READ with MAP_SHARED is allowed and a following
+	 * mprotect(PROT_WRITE) allows writing */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+
+	r = mprotect(p, MFD_DEF_SIZE, PROT_READ | PROT_WRITE);
+	if (r < 0) {
+		printf("mprotect() failed: %m\n");
+		abort();
+	}
+
+	*(char*)p = 0;
+	munmap(p, MFD_DEF_SIZE);
+
+	/* verify PUNCH_HOLE works */
+	r = fallocate(fd,
+		      FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+		      0,
+		      MFD_DEF_SIZE);
+	if (r < 0) {
+		printf("fallocate(PUNCH_HOLE) failed: %m\n");
+		abort();
+	}
+}
+
+static void mfd_fail_write(int fd)
+{
+	ssize_t l;
+	void *p;
+	int r;
+
+	/* verify write() fails */
+	l = write(fd, "data", 4);
+	if (l != -EPERM) {
+		printf("expected EPERM on write(), but got %d: %m\n", (int)l);
+		abort();
+	}
+
+	/* verify PROT_READ | PROT_WRITE is not allowed */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ | PROT_WRITE,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p != MAP_FAILED) {
+		printf("mmap() didn't fail as expected\n");
+		abort();
+	}
+
+	/* verify PROT_WRITE is not allowed */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_WRITE,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p != MAP_FAILED) {
+		printf("mmap() didn't fail as expected\n");
+		abort();
+	}
+
+	/* verify PROT_READ with MAP_SHARED is not allowed */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p != MAP_FAILED) {
+		printf("mmap() didn't fail as expected\n");
+		abort();
+	}
+
+	/* verify PUNCH_HOLE fails */
+	r = fallocate(fd,
+		      FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+		      0,
+		      MFD_DEF_SIZE);
+	if (r >= 0) {
+		printf("fallocate(PUNCH_HOLE) didn't fail as expected\n");
+		abort();
+	}
+}
+
+static void mfd_assert_shrink(int fd)
+{
+	int r, fd2;
+
+	r = ftruncate(fd, MFD_DEF_SIZE / 2);
+	if (r < 0) {
+		printf("ftruncate(SHRINK) failed: %m\n");
+		abort();
+	}
+
+	mfd_assert_size(fd, MFD_DEF_SIZE / 2);
+
+	fd2 = mfd_assert_open(fd,
+			      O_RDWR | O_CREAT | O_TRUNC,
+			      S_IRUSR | S_IWUSR);
+	close(fd2);
+
+	mfd_assert_size(fd, 0);
+}
+
+static void mfd_fail_shrink(int fd)
+{
+	int r;
+
+	r = ftruncate(fd, MFD_DEF_SIZE / 2);
+	if (r >= 0) {
+		printf("ftruncate(SHRINK) didn't fail as expected\n");
+		abort();
+	}
+
+	mfd_fail_open(fd,
+		      O_RDWR | O_CREAT | O_TRUNC,
+		      S_IRUSR | S_IWUSR);
+}
+
+static void mfd_assert_grow(int fd)
+{
+	int r;
+
+	r = ftruncate(fd, MFD_DEF_SIZE * 2);
+	if (r < 0) {
+		printf("ftruncate(GROW) failed: %m\n");
+		abort();
+	}
+
+	mfd_assert_size(fd, MFD_DEF_SIZE * 2);
+
+	r = fallocate(fd,
+		      0,
+		      0,
+		      MFD_DEF_SIZE * 4);
+	if (r < 0) {
+		printf("fallocate(ALLOC) failed: %m\n");
+		abort();
+	}
+
+	mfd_assert_size(fd, MFD_DEF_SIZE * 4);
+}
+
+static void mfd_fail_grow(int fd)
+{
+	int r;
+
+	r = ftruncate(fd, MFD_DEF_SIZE * 2);
+	if (r >= 0) {
+		printf("ftruncate(GROW) didn't fail as expected\n");
+		abort();
+	}
+
+	r = fallocate(fd,
+		      0,
+		      0,
+		      MFD_DEF_SIZE * 4);
+	if (r >= 0) {
+		printf("fallocate(ALLOC) didn't fail as expected\n");
+		abort();
+	}
+}
+
+static void mfd_assert_grow_write(int fd)
+{
+	static char buf[MFD_DEF_SIZE * 8];
+	ssize_t l;
+
+	l = pwrite(fd, buf, sizeof(buf), 0);
+	if (l != sizeof(buf)) {
+		printf("pwrite() failed: %m\n");
+		abort();
+	}
+
+	mfd_assert_size(fd, MFD_DEF_SIZE * 8);
+}
+
+static void mfd_fail_grow_write(int fd)
+{
+	static char buf[MFD_DEF_SIZE * 8];
+	ssize_t l;
+
+	l = pwrite(fd, buf, sizeof(buf), 0);
+	if (l == sizeof(buf)) {
+		printf("pwrite() didn't fail as expected\n");
+		abort();
+	}
+}
+
+static int idle_thread_fn(void *arg)
+{
+	sigset_t set;
+	int sig;
+
+	/* dummy waiter; SIGTERM terminates us anyway */
+	sigemptyset(&set);
+	sigaddset(&set, SIGTERM);
+	sigwait(&set, &sig);
+
+	return 0;
+}
+
+static pid_t spawn_idle_thread(void)
+{
+	uint8_t *stack;
+	pid_t pid;
+
+	stack = malloc(STACK_SIZE);
+	if (!stack) {
+		printf("malloc(STACK_SIZE) failed: %m\n");
+		abort();
+	}
+
+	pid = clone(idle_thread_fn,
+		    stack + STACK_SIZE,
+		    CLONE_FILES | CLONE_FS | CLONE_VM | SIGCHLD,
+		    NULL);
+	if (pid < 0) {
+		printf("clone() failed: %m\n");
+		abort();
+	}
+
+	return pid;
+}
+
+static void join_idle_thread(pid_t pid)
+{
+	kill(pid, SIGTERM);
+	waitpid(pid, NULL, 0);
+}
+
+static pid_t spawn_idle_proc(void)
+{
+	pid_t pid;
+	sigset_t set;
+	int sig;
+
+	pid = fork();
+	if (pid < 0) {
+		printf("fork() failed: %m\n");
+		abort();
+	} else if (!pid) {
+		/* dummy waiter; SIGTERM terminates us anyway */
+		sigemptyset(&set);
+		sigaddset(&set, SIGTERM);
+		sigwait(&set, &sig);
+		exit(0);
+	}
+
+	return pid;
+}
+
+static void join_idle_proc(pid_t pid)
+{
+	kill(pid, SIGTERM);
+	waitpid(pid, NULL, 0);
+}
+
+/*
+ * Test memfd_create() syscall
+ * Verify syscall-argument validation, including name checks, flag validation
+ * and more.
+ */
+static void test_create(void)
+{
+	char buf[2048];
+	int fd;
+
+	/* test NULL name */
+	mfd_fail_new(NULL, 0, 0);
+
+	/* test over-long name (not zero-terminated) */
+	memset(buf, 0xff, sizeof(buf));
+	mfd_fail_new(buf, 0, 0);
+
+	/* test over-long zero-terminated name */
+	memset(buf, 0xff, sizeof(buf));
+	buf[sizeof(buf) - 1] = 0;
+	mfd_fail_new(buf, 0, 0);
+
+	/* verify "" is a valid name */
+	fd = mfd_assert_new("", 0, 0);
+	close(fd);
+
+	/* verify invalid O_* open flags */
+	mfd_fail_new("", 0, 0x0100);
+	mfd_fail_new("", 0, ~MFD_CLOEXEC);
+	mfd_fail_new("", 0, ~0);
+	mfd_fail_new("", 0, 0x8000000000000000ULL);
+
+	/* verify MFD_CLOEXEC is allowed */
+	fd = mfd_assert_new("", 0, MFD_CLOEXEC);
+	close(fd);
+}
+
+/*
+ * Test basic sealing
+ * A very basic sealing test to see whether setting/retrieving seals works.
+ */
+static void test_basic(void)
+{
+	int fd;
+
+	fd = mfd_assert_new("kern_memfd_basic",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC);
+	mfd_assert_has_seals(fd, 0);
+	mfd_assert_set_seals(fd, SHMEM_SEAL_SHRINK |
+				 SHMEM_SEAL_GROW |
+				 SHMEM_SEAL_WRITE);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_SHRINK |
+				 SHMEM_SEAL_GROW |
+				 SHMEM_SEAL_WRITE);
+	close(fd);
+}
+
+/*
+ * Test SEAL_WRITE
+ * Test whether SEAL_WRITE actually prevents modifications.
+ */
+static void test_seal_write(void)
+{
+	int fd;
+
+	fd = mfd_assert_new("kern_memfd_seal_write",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC);
+	mfd_assert_has_seals(fd, 0);
+	mfd_assert_set_seals(fd, SHMEM_SEAL_WRITE);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE);
+
+	mfd_assert_read(fd);
+	mfd_fail_write(fd);
+	mfd_assert_shrink(fd);
+	mfd_assert_grow(fd);
+	mfd_fail_grow_write(fd);
+
+	close(fd);
+}
+
+/*
+ * Test SEAL_SHRINK
+ * Test whether SEAL_SHRINK actually prevents shrinking
+ */
+static void test_seal_shrink(void)
+{
+	int fd;
+
+	fd = mfd_assert_new("kern_memfd_seal_shrink",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC);
+	mfd_assert_has_seals(fd, 0);
+	mfd_assert_set_seals(fd, SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_SHRINK);
+
+	mfd_assert_read(fd);
+	mfd_assert_write(fd);
+	mfd_fail_shrink(fd);
+	mfd_assert_grow(fd);
+	mfd_assert_grow_write(fd);
+
+	close(fd);
+}
+
+/*
+ * Test SEAL_GROW
+ * Test whether SEAL_GROW actually prevents growing
+ */
+static void test_seal_grow(void)
+{
+	int fd;
+
+	fd = mfd_assert_new("kern_memfd_seal_grow",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC);
+	mfd_assert_has_seals(fd, 0);
+	mfd_assert_set_seals(fd, SHMEM_SEAL_GROW);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_GROW);
+
+	mfd_assert_read(fd);
+	mfd_assert_write(fd);
+	mfd_assert_shrink(fd);
+	mfd_fail_grow(fd);
+	mfd_fail_grow_write(fd);
+
+	close(fd);
+}
+
+/*
+ * Test SEAL_SHRINK | SEAL_GROW
+ * Test whether SEAL_SHRINK | SEAL_GROW actually prevents resizing
+ */
+static void test_seal_resize(void)
+{
+	int fd;
+
+	fd = mfd_assert_new("kern_memfd_seal_resize",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC);
+	mfd_assert_has_seals(fd, 0);
+	mfd_assert_set_seals(fd, SHMEM_SEAL_SHRINK | SHMEM_SEAL_GROW);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_SHRINK | SHMEM_SEAL_GROW);
+
+	mfd_assert_read(fd);
+	mfd_assert_write(fd);
+	mfd_fail_shrink(fd);
+	mfd_fail_grow(fd);
+	mfd_fail_grow_write(fd);
+
+	close(fd);
+}
+
+/*
+ * Test sharing via dup()
+ * Test whether seal-modifications are correctly discarded if multiple FDs for
+ * the same file exist.
+ */
+static void test_share_dup(void)
+{
+	int fd, fd2;
+
+	fd = mfd_assert_new("kern_memfd_share_dup",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC);
+	mfd_assert_has_seals(fd, 0);
+
+	fd2 = mfd_assert_dup(fd);
+	mfd_assert_set_seals(fd, SHMEM_SEAL_WRITE);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE);
+
+	mfd_fail_set_seals(fd, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE);
+
+	mfd_fail_set_seals(fd, SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE);
+
+	mfd_fail_set_seals(fd, 0);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE);
+
+	close(fd2);
+
+	mfd_assert_set_seals(fd, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK);
+
+	mfd_assert_set_seals(fd, SHMEM_SEAL_GROW);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_GROW);
+
+	mfd_assert_set_seals(fd, 0);
+	mfd_assert_has_seals(fd, 0);
+
+	/* try again but switch FDs to test that they're equal */
+
+	fd2 = mfd_assert_dup(fd);
+	mfd_assert_set_seals(fd2, SHMEM_SEAL_WRITE);
+	mfd_assert_has_seals(fd2, SHMEM_SEAL_WRITE);
+
+	mfd_fail_set_seals(fd2, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd2, SHMEM_SEAL_WRITE);
+
+	mfd_fail_set_seals(fd2, SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd2, SHMEM_SEAL_WRITE);
+
+	mfd_fail_set_seals(fd2, 0);
+	mfd_assert_has_seals(fd2, SHMEM_SEAL_WRITE);
+
+	close(fd);
+
+	mfd_assert_set_seals(fd2, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd2, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK);
+
+	mfd_assert_set_seals(fd2, SHMEM_SEAL_GROW);
+	mfd_assert_has_seals(fd2, SHMEM_SEAL_GROW);
+
+	mfd_assert_set_seals(fd2, 0);
+	mfd_assert_has_seals(fd2, 0);
+
+	close(fd2);
+}
+
+/*
+ * Test sealing with active mmap()s
+ * Modifying seals is only allowed if no other mmap() refs exist, except for
+ * initial sealing, which allows read-only mappings. Test for the different
+ * combinations here.
+ */
+static void test_share_mmap(void)
+{
+	int fd;
+	void *p;
+
+	fd = mfd_assert_new("kern_memfd_share_mmap",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC);
+	mfd_assert_has_seals(fd, 0);
+
+	/* shared/writable ref prevents sealing */
+	p = mfd_assert_mmap_shared(fd);
+	mfd_fail_set_seals(fd, SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, 0);
+	munmap(p, MFD_DEF_SIZE);
+
+	/* readable ref allows initial sealing, but prevents modifications */
+	p = mfd_assert_mmap_private(fd);
+	mfd_assert_set_seals(fd, SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_SHRINK);
+	mfd_fail_set_seals(fd, SHMEM_SEAL_WRITE);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_SHRINK);
+	munmap(p, MFD_DEF_SIZE);
+
+	/* dropping all additional refs allows modifications again */
+	mfd_assert_set_seals(fd, 0);
+	mfd_assert_has_seals(fd, 0);
+
+	close(fd);
+}
+
+/*
+ * Test sealing with open(/proc/self/fd/%d)
+ * Via /proc we can get access to a separate file-context for the same memfd.
+ * This is *not* like dup(), but like a real separate open(). Make sure the
+ * semantics are as expected and we correctly check for RDONLY / WRONLY / RDWR.
+ */
+static void test_share_open(void)
+{
+	int fd, fd2;
+
+	fd = mfd_assert_new("kern_memfd_share_open",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC);
+	mfd_assert_has_seals(fd, 0);
+
+	fd2 = mfd_assert_open(fd, O_RDONLY, 0);
+	mfd_assert_set_seals(fd, SHMEM_SEAL_WRITE);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE);
+
+	mfd_fail_set_seals(fd, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE);
+
+	mfd_fail_set_seals(fd, SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE);
+
+	mfd_fail_set_seals(fd, 0);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE);
+
+	close(fd2);
+
+	mfd_assert_set_seals(fd, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK);
+
+	mfd_assert_set_seals(fd, SHMEM_SEAL_GROW);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_GROW);
+
+	mfd_assert_set_seals(fd, 0);
+	mfd_assert_has_seals(fd, 0);
+
+	/* test that RDONLY doesn't allow setting seals, even if exclusive */
+
+	fd2 = mfd_assert_open(fd, O_RDONLY, 0);
+	mfd_fail_set_seals(fd2, SHMEM_SEAL_WRITE);
+	mfd_assert_has_seals(fd2, 0);
+
+	close(fd);
+
+	mfd_fail_set_seals(fd2, SHMEM_SEAL_WRITE);
+	mfd_assert_has_seals(fd2, 0);
+
+	close(fd2);
+
+	/* same again but with writable open */
+
+	fd = mfd_assert_new("kern_memfd_share_open",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC);
+	mfd_assert_has_seals(fd, 0);
+
+	fd2 = mfd_assert_open(fd, O_RDWR, 0);
+	mfd_assert_set_seals(fd2, SHMEM_SEAL_WRITE);
+	mfd_assert_has_seals(fd2, SHMEM_SEAL_WRITE);
+
+	close(fd);
+
+	mfd_assert_set_seals(fd2, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd2, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK);
+
+	mfd_assert_set_seals(fd2, SHMEM_SEAL_GROW);
+	mfd_assert_has_seals(fd2, SHMEM_SEAL_GROW);
+
+	mfd_assert_set_seals(fd2, 0);
+	mfd_assert_has_seals(fd2, 0);
+
+	close(fd2);
+}
+
+/*
+ * Test sharing via fork()
+ * Test whether seal-modifications are correctly discarded if multiple FDs for
+ * the same file exist.
+ */
+static void test_share_fork(void)
+{
+	int fd;
+	pid_t pid;
+
+	fd = mfd_assert_new("kern_memfd_share_fork",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC);
+	mfd_assert_has_seals(fd, 0);
+
+	pid = spawn_idle_proc();
+	mfd_assert_set_seals(fd, SHMEM_SEAL_WRITE);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE);
+
+	mfd_fail_set_seals(fd, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE);
+
+	mfd_fail_set_seals(fd, SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE);
+
+	mfd_fail_set_seals(fd, 0);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE);
+
+	join_idle_proc(pid);
+
+	mfd_assert_set_seals(fd, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK);
+
+	mfd_assert_set_seals(fd, SHMEM_SEAL_GROW);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_GROW);
+
+	mfd_assert_set_seals(fd, 0);
+	mfd_assert_has_seals(fd, 0);
+
+	close(fd);
+}
+
+int main(int argc, char **argv)
+{
+	pid_t pid;
+
+	printf("memfd: CREATE\n");
+	test_create();
+	printf("memfd: BASIC\n");
+	test_basic();
+
+	printf("memfd: SEAL-WRITE\n");
+	test_seal_write();
+	printf("memfd: SEAL-SHRINK\n");
+	test_seal_shrink();
+	printf("memfd: SEAL-GROW\n");
+	test_seal_grow();
+	printf("memfd: SEAL-RESIZE\n");
+	test_seal_resize();
+
+	printf("memfd: SHARE-DUP\n");
+	test_share_dup();
+	printf("memfd: SHARE-MMAP\n");
+	test_share_mmap();
+	printf("memfd: SHARE-OPEN\n");
+	test_share_open();
+	printf("memfd: SHARE-FORK\n");
+	test_share_fork();
+
+	/* Run test-suite in a multi-threaded environment with a shared
+	 * file-table. This triggers the slow-path in fdget() in the kernel. */
+	pid = spawn_idle_thread();
+	printf("memfd: SHARE-DUP (shared file-table)\n");
+	test_share_dup();
+	printf("memfd: SHARE-MMAP (shared file-table)\n");
+	test_share_mmap();
+	printf("memfd: SHARE-OPEN (shared file-table)\n");
+	test_share_open();
+	printf("memfd: SHARE-FORK (shared file-table)\n");
+	test_share_fork();
+	join_idle_thread(pid);
+
+	printf("memfd: DONE\n");
+
+	return 0;
+}
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH 4/6] selftests: add memfd_create() + sealing tests
@ 2014-03-19 19:06   ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-19 19:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Hugh Dickins, Alexander Viro, Matthew Wilcox, Karol Lewandowski,
	Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, john.stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages),
	David Herrmann

Some basic tests to verify sealing on memfds works as expected and
guarantees the advertised semantics.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 tools/testing/selftests/Makefile           |   1 +
 tools/testing/selftests/memfd/.gitignore   |   2 +
 tools/testing/selftests/memfd/Makefile     |  29 +
 tools/testing/selftests/memfd/memfd_test.c | 972 +++++++++++++++++++++++++++++
 4 files changed, 1004 insertions(+)
 create mode 100644 tools/testing/selftests/memfd/.gitignore
 create mode 100644 tools/testing/selftests/memfd/Makefile
 create mode 100644 tools/testing/selftests/memfd/memfd_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 32487ed..c57325a 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -2,6 +2,7 @@ TARGETS = breakpoints
 TARGETS += cpu-hotplug
 TARGETS += efivarfs
 TARGETS += kcmp
+TARGETS += memfd
 TARGETS += memory-hotplug
 TARGETS += mqueue
 TARGETS += net
diff --git a/tools/testing/selftests/memfd/.gitignore b/tools/testing/selftests/memfd/.gitignore
new file mode 100644
index 0000000..bcc8ee2
--- /dev/null
+++ b/tools/testing/selftests/memfd/.gitignore
@@ -0,0 +1,2 @@
+memfd_test
+memfd-test-file
diff --git a/tools/testing/selftests/memfd/Makefile b/tools/testing/selftests/memfd/Makefile
new file mode 100644
index 0000000..36653b9
--- /dev/null
+++ b/tools/testing/selftests/memfd/Makefile
@@ -0,0 +1,29 @@
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
+ifeq ($(ARCH),i386)
+	ARCH := X86
+endif
+ifeq ($(ARCH),x86_64)
+	ARCH := X86
+endif
+
+CFLAGS += -I../../../../arch/x86/include/generated/uapi/
+CFLAGS += -I../../../../arch/x86/include/uapi/
+CFLAGS += -I../../../../include/uapi/
+CFLAGS += -I../../../../include/
+
+all:
+ifeq ($(ARCH),X86)
+	gcc $(CFLAGS) memfd_test.c -o memfd_test
+else
+	echo "Not an x86 target, can't build memfd selftest"
+endif
+
+run_tests: all
+ifeq ($(ARCH),X86)
+	gcc $(CFLAGS) memfd_test.c -o memfd_test
+endif
+	@./memfd_test || echo "memfd_test: [FAIL]"
+
+clean:
+	$(RM) memfd_test
diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c
new file mode 100644
index 0000000..41bac6f
--- /dev/null
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -0,0 +1,972 @@
+#define _GNU_SOURCE
+#define __EXPORTED_HEADERS__
+
+#include <errno.h>
+#include <inttypes.h>
+#include <limits.h>
+#include <linux/falloc.h>
+#include <linux/fcntl.h>
+#include <linux/memfd.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#define MFD_DEF_SIZE 8192
+#define STACK_SIZE 65535
+
+static int sys_memfd_create(const char *name,
+			    __u64 size,
+			    __u64 flags)
+{
+	return syscall(__NR_memfd_create, name, size, flags);
+}
+
+static int mfd_assert_new(const char *name, __u64 sz, __u64 flags)
+{
+	int r;
+
+	r = sys_memfd_create(name, sz, flags);
+	if (r < 0) {
+		printf("memfd_create(\"%s\", %llu, %llu) failed: %m\n",
+		       name, (unsigned long long)sz,
+		       (unsigned long long)flags);
+		abort();
+	}
+
+	return r;
+}
+
+static void mfd_fail_new(const char *name, __u64 size, __u64 flags)
+{
+	int r;
+
+	r = sys_memfd_create(name, size, flags);
+	if (r >= 0) {
+		printf("memfd_create(\"%s\", %llu, %llu) succeeded, but failure expected\n",
+		       name, (unsigned long long)size,
+		       (unsigned long long)flags);
+		close(r);
+		abort();
+	}
+}
+
+static __u64 mfd_assert_get_seals(int fd)
+{
+	long r;
+
+	r = fcntl(fd, SHMEM_GET_SEALS);
+	if (r < 0) {
+		printf("GET_SEALS(%d) failed: %m\n", fd);
+		abort();
+	}
+
+	return r;
+}
+
+static void mfd_assert_has_seals(int fd, __u64 seals)
+{
+	__u64 s;
+
+	s = mfd_assert_get_seals(fd);
+	if (s != seals) {
+		printf("%llu != %llu = GET_SEALS(%d)\n",
+		       (unsigned long long)seals, (unsigned long long)s, fd);
+		abort();
+	}
+}
+
+static void mfd_assert_set_seals(int fd, __u64 seals)
+{
+	long r;
+	__u64 s;
+
+	s = mfd_assert_get_seals(fd);
+	r = fcntl(fd, SHMEM_SET_SEALS, seals);
+	if (r < 0) {
+		printf("SET_SEALS(%d, %llu -> %llu) failed: %m\n",
+		       fd, (unsigned long long)s, (unsigned long long)seals);
+		abort();
+	}
+}
+
+static void mfd_fail_set_seals(int fd, __u64 seals)
+{
+	long r;
+	__u64 s;
+
+	s = mfd_assert_get_seals(fd);
+	r = fcntl(fd, SHMEM_SET_SEALS, seals);
+	if (r >= 0) {
+		printf("SET_SEALS(%d, %llu -> %llu) didn't fail as expected\n",
+		       fd, (unsigned long long)s, (unsigned long long)seals);
+		abort();
+	}
+}
+
+static void mfd_assert_size(int fd, size_t size)
+{
+	struct stat st;
+	int r;
+
+	r = fstat(fd, &st);
+	if (r < 0) {
+		printf("fstat(%d) failed: %m\n", fd);
+		abort();
+	} else if (st.st_size != size) {
+		printf("wrong file size %lld, but expected %lld\n",
+		       (long long)st.st_size, (long long)size);
+		abort();
+	}
+}
+
+static int mfd_assert_dup(int fd)
+{
+	int r;
+
+	r = dup(fd);
+	if (r < 0) {
+		printf("dup(%d) failed: %m\n", fd);
+		abort();
+	}
+
+	return r;
+}
+
+static void *mfd_assert_mmap_shared(int fd)
+{
+	void *p;
+
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ | PROT_WRITE,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+
+	return p;
+}
+
+static void *mfd_assert_mmap_private(int fd)
+{
+	void *p;
+
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ,
+		 MAP_PRIVATE,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+
+	return p;
+}
+
+static int mfd_assert_open(int fd, int flags, mode_t mode)
+{
+	char buf[512];
+	int r;
+
+	sprintf(buf, "/proc/self/fd/%d", fd);
+	r = open(buf, flags, mode);
+	if (r < 0) {
+		printf("open(%s) failed: %m\n", buf);
+		abort();
+	}
+
+	return r;
+}
+
+static void mfd_fail_open(int fd, int flags, mode_t mode)
+{
+	char buf[512];
+	int r;
+
+	sprintf(buf, "/proc/self/fd/%d", fd);
+	r = open(buf, flags, mode);
+	if (r >= 0) {
+		printf("open(%s) didn't fail as expected\n");
+		abort();
+	}
+}
+
+static void mfd_assert_read(int fd)
+{
+	char buf[16];
+	void *p;
+	ssize_t l;
+
+	l = read(fd, buf, sizeof(buf));
+	if (l != sizeof(buf)) {
+		printf("read() failed: %m\n");
+		abort();
+	}
+
+	/* verify PROT_READ *is* allowed */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ,
+		 MAP_PRIVATE,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+	munmap(p, MFD_DEF_SIZE);
+
+	/* verify MAP_PRIVATE is *always* allowed (even writable) */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ | PROT_WRITE,
+		 MAP_PRIVATE,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+	munmap(p, MFD_DEF_SIZE);
+}
+
+static void mfd_assert_write(int fd)
+{
+	ssize_t l;
+	void *p;
+	int r;
+
+	/* verify write() succeeds */
+	l = write(fd, "\0\0\0\0", 4);
+	if (l != 4) {
+		printf("write() failed: %m\n");
+		abort();
+	}
+
+	/* verify PROT_READ | PROT_WRITE is allowed */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ | PROT_WRITE,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+	*(char*)p = 0;
+	munmap(p, MFD_DEF_SIZE);
+
+	/* verify PROT_WRITE is allowed */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_WRITE,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+	*(char*)p = 0;
+	munmap(p, MFD_DEF_SIZE);
+
+	/* verify PROT_READ with MAP_SHARED is allowed and a following
+	 * mprotect(PROT_WRITE) allows writing */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+
+	r = mprotect(p, MFD_DEF_SIZE, PROT_READ | PROT_WRITE);
+	if (r < 0) {
+		printf("mprotect() failed: %m\n");
+		abort();
+	}
+
+	*(char*)p = 0;
+	munmap(p, MFD_DEF_SIZE);
+
+	/* verify PUNCH_HOLE works */
+	r = fallocate(fd,
+		      FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+		      0,
+		      MFD_DEF_SIZE);
+	if (r < 0) {
+		printf("fallocate(PUNCH_HOLE) failed: %m\n");
+		abort();
+	}
+}
+
+static void mfd_fail_write(int fd)
+{
+	ssize_t l;
+	void *p;
+	int r;
+
+	/* verify write() fails */
+	l = write(fd, "data", 4);
+	if (l != -EPERM) {
+		printf("expected EPERM on write(), but got %d: %m\n", (int)l);
+		abort();
+	}
+
+	/* verify PROT_READ | PROT_WRITE is not allowed */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ | PROT_WRITE,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p != MAP_FAILED) {
+		printf("mmap() didn't fail as expected\n");
+		abort();
+	}
+
+	/* verify PROT_WRITE is not allowed */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_WRITE,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p != MAP_FAILED) {
+		printf("mmap() didn't fail as expected\n");
+		abort();
+	}
+
+	/* verify PROT_READ with MAP_SHARED is not allowed */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p != MAP_FAILED) {
+		printf("mmap() didn't fail as expected\n");
+		abort();
+	}
+
+	/* verify PUNCH_HOLE fails */
+	r = fallocate(fd,
+		      FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+		      0,
+		      MFD_DEF_SIZE);
+	if (r >= 0) {
+		printf("fallocate(PUNCH_HOLE) didn't fail as expected\n");
+		abort();
+	}
+}
+
+static void mfd_assert_shrink(int fd)
+{
+	int r, fd2;
+
+	r = ftruncate(fd, MFD_DEF_SIZE / 2);
+	if (r < 0) {
+		printf("ftruncate(SHRINK) failed: %m\n");
+		abort();
+	}
+
+	mfd_assert_size(fd, MFD_DEF_SIZE / 2);
+
+	fd2 = mfd_assert_open(fd,
+			      O_RDWR | O_CREAT | O_TRUNC,
+			      S_IRUSR | S_IWUSR);
+	close(fd2);
+
+	mfd_assert_size(fd, 0);
+}
+
+static void mfd_fail_shrink(int fd)
+{
+	int r;
+
+	r = ftruncate(fd, MFD_DEF_SIZE / 2);
+	if (r >= 0) {
+		printf("ftruncate(SHRINK) didn't fail as expected\n");
+		abort();
+	}
+
+	mfd_fail_open(fd,
+		      O_RDWR | O_CREAT | O_TRUNC,
+		      S_IRUSR | S_IWUSR);
+}
+
+static void mfd_assert_grow(int fd)
+{
+	int r;
+
+	r = ftruncate(fd, MFD_DEF_SIZE * 2);
+	if (r < 0) {
+		printf("ftruncate(GROW) failed: %m\n");
+		abort();
+	}
+
+	mfd_assert_size(fd, MFD_DEF_SIZE * 2);
+
+	r = fallocate(fd,
+		      0,
+		      0,
+		      MFD_DEF_SIZE * 4);
+	if (r < 0) {
+		printf("fallocate(ALLOC) failed: %m\n");
+		abort();
+	}
+
+	mfd_assert_size(fd, MFD_DEF_SIZE * 4);
+}
+
+static void mfd_fail_grow(int fd)
+{
+	int r;
+
+	r = ftruncate(fd, MFD_DEF_SIZE * 2);
+	if (r >= 0) {
+		printf("ftruncate(GROW) didn't fail as expected\n");
+		abort();
+	}
+
+	r = fallocate(fd,
+		      0,
+		      0,
+		      MFD_DEF_SIZE * 4);
+	if (r >= 0) {
+		printf("fallocate(ALLOC) didn't fail as expected\n");
+		abort();
+	}
+}
+
+static void mfd_assert_grow_write(int fd)
+{
+	static char buf[MFD_DEF_SIZE * 8];
+	ssize_t l;
+
+	l = pwrite(fd, buf, sizeof(buf), 0);
+	if (l != sizeof(buf)) {
+		printf("pwrite() failed: %m\n");
+		abort();
+	}
+
+	mfd_assert_size(fd, MFD_DEF_SIZE * 8);
+}
+
+static void mfd_fail_grow_write(int fd)
+{
+	static char buf[MFD_DEF_SIZE * 8];
+	ssize_t l;
+
+	l = pwrite(fd, buf, sizeof(buf), 0);
+	if (l == sizeof(buf)) {
+		printf("pwrite() didn't fail as expected\n");
+		abort();
+	}
+}
+
+static int idle_thread_fn(void *arg)
+{
+	sigset_t set;
+	int sig;
+
+	/* dummy waiter; SIGTERM terminates us anyway */
+	sigemptyset(&set);
+	sigaddset(&set, SIGTERM);
+	sigwait(&set, &sig);
+
+	return 0;
+}
+
+static pid_t spawn_idle_thread(void)
+{
+	uint8_t *stack;
+	pid_t pid;
+
+	stack = malloc(STACK_SIZE);
+	if (!stack) {
+		printf("malloc(STACK_SIZE) failed: %m\n");
+		abort();
+	}
+
+	pid = clone(idle_thread_fn,
+		    stack + STACK_SIZE,
+		    CLONE_FILES | CLONE_FS | CLONE_VM | SIGCHLD,
+		    NULL);
+	if (pid < 0) {
+		printf("clone() failed: %m\n");
+		abort();
+	}
+
+	return pid;
+}
+
+static void join_idle_thread(pid_t pid)
+{
+	kill(pid, SIGTERM);
+	waitpid(pid, NULL, 0);
+}
+
+static pid_t spawn_idle_proc(void)
+{
+	pid_t pid;
+	sigset_t set;
+	int sig;
+
+	pid = fork();
+	if (pid < 0) {
+		printf("fork() failed: %m\n");
+		abort();
+	} else if (!pid) {
+		/* dummy waiter; SIGTERM terminates us anyway */
+		sigemptyset(&set);
+		sigaddset(&set, SIGTERM);
+		sigwait(&set, &sig);
+		exit(0);
+	}
+
+	return pid;
+}
+
+static void join_idle_proc(pid_t pid)
+{
+	kill(pid, SIGTERM);
+	waitpid(pid, NULL, 0);
+}
+
+/*
+ * Test memfd_create() syscall
+ * Verify syscall-argument validation, including name checks, flag validation
+ * and more.
+ */
+static void test_create(void)
+{
+	char buf[2048];
+	int fd;
+
+	/* test NULL name */
+	mfd_fail_new(NULL, 0, 0);
+
+	/* test over-long name (not zero-terminated) */
+	memset(buf, 0xff, sizeof(buf));
+	mfd_fail_new(buf, 0, 0);
+
+	/* test over-long zero-terminated name */
+	memset(buf, 0xff, sizeof(buf));
+	buf[sizeof(buf) - 1] = 0;
+	mfd_fail_new(buf, 0, 0);
+
+	/* verify "" is a valid name */
+	fd = mfd_assert_new("", 0, 0);
+	close(fd);
+
+	/* verify invalid O_* open flags */
+	mfd_fail_new("", 0, 0x0100);
+	mfd_fail_new("", 0, ~MFD_CLOEXEC);
+	mfd_fail_new("", 0, ~0);
+	mfd_fail_new("", 0, 0x8000000000000000ULL);
+
+	/* verify MFD_CLOEXEC is allowed */
+	fd = mfd_assert_new("", 0, MFD_CLOEXEC);
+	close(fd);
+}
+
+/*
+ * Test basic sealing
+ * A very basic sealing test to see whether setting/retrieving seals works.
+ */
+static void test_basic(void)
+{
+	int fd;
+
+	fd = mfd_assert_new("kern_memfd_basic",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC);
+	mfd_assert_has_seals(fd, 0);
+	mfd_assert_set_seals(fd, SHMEM_SEAL_SHRINK |
+				 SHMEM_SEAL_GROW |
+				 SHMEM_SEAL_WRITE);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_SHRINK |
+				 SHMEM_SEAL_GROW |
+				 SHMEM_SEAL_WRITE);
+	close(fd);
+}
+
+/*
+ * Test SEAL_WRITE
+ * Test whether SEAL_WRITE actually prevents modifications.
+ */
+static void test_seal_write(void)
+{
+	int fd;
+
+	fd = mfd_assert_new("kern_memfd_seal_write",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC);
+	mfd_assert_has_seals(fd, 0);
+	mfd_assert_set_seals(fd, SHMEM_SEAL_WRITE);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE);
+
+	mfd_assert_read(fd);
+	mfd_fail_write(fd);
+	mfd_assert_shrink(fd);
+	mfd_assert_grow(fd);
+	mfd_fail_grow_write(fd);
+
+	close(fd);
+}
+
+/*
+ * Test SEAL_SHRINK
+ * Test whether SEAL_SHRINK actually prevents shrinking
+ */
+static void test_seal_shrink(void)
+{
+	int fd;
+
+	fd = mfd_assert_new("kern_memfd_seal_shrink",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC);
+	mfd_assert_has_seals(fd, 0);
+	mfd_assert_set_seals(fd, SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_SHRINK);
+
+	mfd_assert_read(fd);
+	mfd_assert_write(fd);
+	mfd_fail_shrink(fd);
+	mfd_assert_grow(fd);
+	mfd_assert_grow_write(fd);
+
+	close(fd);
+}
+
+/*
+ * Test SEAL_GROW
+ * Test whether SEAL_GROW actually prevents growing
+ */
+static void test_seal_grow(void)
+{
+	int fd;
+
+	fd = mfd_assert_new("kern_memfd_seal_grow",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC);
+	mfd_assert_has_seals(fd, 0);
+	mfd_assert_set_seals(fd, SHMEM_SEAL_GROW);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_GROW);
+
+	mfd_assert_read(fd);
+	mfd_assert_write(fd);
+	mfd_assert_shrink(fd);
+	mfd_fail_grow(fd);
+	mfd_fail_grow_write(fd);
+
+	close(fd);
+}
+
+/*
+ * Test SEAL_SHRINK | SEAL_GROW
+ * Test whether SEAL_SHRINK | SEAL_GROW actually prevents resizing
+ */
+static void test_seal_resize(void)
+{
+	int fd;
+
+	fd = mfd_assert_new("kern_memfd_seal_resize",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC);
+	mfd_assert_has_seals(fd, 0);
+	mfd_assert_set_seals(fd, SHMEM_SEAL_SHRINK | SHMEM_SEAL_GROW);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_SHRINK | SHMEM_SEAL_GROW);
+
+	mfd_assert_read(fd);
+	mfd_assert_write(fd);
+	mfd_fail_shrink(fd);
+	mfd_fail_grow(fd);
+	mfd_fail_grow_write(fd);
+
+	close(fd);
+}
+
+/*
+ * Test sharing via dup()
+ * Test whether seal-modifications are correctly discarded if multiple FDs for
+ * the same file exist.
+ */
+static void test_share_dup(void)
+{
+	int fd, fd2;
+
+	fd = mfd_assert_new("kern_memfd_share_dup",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC);
+	mfd_assert_has_seals(fd, 0);
+
+	fd2 = mfd_assert_dup(fd);
+	mfd_assert_set_seals(fd, SHMEM_SEAL_WRITE);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE);
+
+	mfd_fail_set_seals(fd, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE);
+
+	mfd_fail_set_seals(fd, SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE);
+
+	mfd_fail_set_seals(fd, 0);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE);
+
+	close(fd2);
+
+	mfd_assert_set_seals(fd, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK);
+
+	mfd_assert_set_seals(fd, SHMEM_SEAL_GROW);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_GROW);
+
+	mfd_assert_set_seals(fd, 0);
+	mfd_assert_has_seals(fd, 0);
+
+	/* try again but switch FDs to test that they're equal */
+
+	fd2 = mfd_assert_dup(fd);
+	mfd_assert_set_seals(fd2, SHMEM_SEAL_WRITE);
+	mfd_assert_has_seals(fd2, SHMEM_SEAL_WRITE);
+
+	mfd_fail_set_seals(fd2, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd2, SHMEM_SEAL_WRITE);
+
+	mfd_fail_set_seals(fd2, SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd2, SHMEM_SEAL_WRITE);
+
+	mfd_fail_set_seals(fd2, 0);
+	mfd_assert_has_seals(fd2, SHMEM_SEAL_WRITE);
+
+	close(fd);
+
+	mfd_assert_set_seals(fd2, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd2, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK);
+
+	mfd_assert_set_seals(fd2, SHMEM_SEAL_GROW);
+	mfd_assert_has_seals(fd2, SHMEM_SEAL_GROW);
+
+	mfd_assert_set_seals(fd2, 0);
+	mfd_assert_has_seals(fd2, 0);
+
+	close(fd2);
+}
+
+/*
+ * Test sealing with active mmap()s
+ * Modifying seals is only allowed if no other mmap() refs exist, except for
+ * initial sealing, which allows read-only mappings. Test for the different
+ * combinations here.
+ */
+static void test_share_mmap(void)
+{
+	int fd;
+	void *p;
+
+	fd = mfd_assert_new("kern_memfd_share_mmap",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC);
+	mfd_assert_has_seals(fd, 0);
+
+	/* shared/writable ref prevents sealing */
+	p = mfd_assert_mmap_shared(fd);
+	mfd_fail_set_seals(fd, SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, 0);
+	munmap(p, MFD_DEF_SIZE);
+
+	/* readable ref allows initial sealing, but prevents modifications */
+	p = mfd_assert_mmap_private(fd);
+	mfd_assert_set_seals(fd, SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_SHRINK);
+	mfd_fail_set_seals(fd, SHMEM_SEAL_WRITE);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_SHRINK);
+	munmap(p, MFD_DEF_SIZE);
+
+	/* dropping all additional refs allows modifications again */
+	mfd_assert_set_seals(fd, 0);
+	mfd_assert_has_seals(fd, 0);
+
+	close(fd);
+}
+
+/*
+ * Test sealing with open(/proc/self/fd/%d)
+ * Via /proc we can get access to a separate file-context for the same memfd.
+ * This is *not* like dup(), but like a real separate open(). Make sure the
+ * semantics are as expected and we correctly check for RDONLY / WRONLY / RDWR.
+ */
+static void test_share_open(void)
+{
+	int fd, fd2;
+
+	fd = mfd_assert_new("kern_memfd_share_open",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC);
+	mfd_assert_has_seals(fd, 0);
+
+	fd2 = mfd_assert_open(fd, O_RDONLY, 0);
+	mfd_assert_set_seals(fd, SHMEM_SEAL_WRITE);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE);
+
+	mfd_fail_set_seals(fd, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE);
+
+	mfd_fail_set_seals(fd, SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE);
+
+	mfd_fail_set_seals(fd, 0);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE);
+
+	close(fd2);
+
+	mfd_assert_set_seals(fd, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK);
+
+	mfd_assert_set_seals(fd, SHMEM_SEAL_GROW);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_GROW);
+
+	mfd_assert_set_seals(fd, 0);
+	mfd_assert_has_seals(fd, 0);
+
+	/* test that RDONLY doesn't allow setting seals, even if exclusive */
+
+	fd2 = mfd_assert_open(fd, O_RDONLY, 0);
+	mfd_fail_set_seals(fd2, SHMEM_SEAL_WRITE);
+	mfd_assert_has_seals(fd2, 0);
+
+	close(fd);
+
+	mfd_fail_set_seals(fd2, SHMEM_SEAL_WRITE);
+	mfd_assert_has_seals(fd2, 0);
+
+	close(fd2);
+
+	/* same again but with writable open */
+
+	fd = mfd_assert_new("kern_memfd_share_open",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC);
+	mfd_assert_has_seals(fd, 0);
+
+	fd2 = mfd_assert_open(fd, O_RDWR, 0);
+	mfd_assert_set_seals(fd2, SHMEM_SEAL_WRITE);
+	mfd_assert_has_seals(fd2, SHMEM_SEAL_WRITE);
+
+	close(fd);
+
+	mfd_assert_set_seals(fd2, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd2, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK);
+
+	mfd_assert_set_seals(fd2, SHMEM_SEAL_GROW);
+	mfd_assert_has_seals(fd2, SHMEM_SEAL_GROW);
+
+	mfd_assert_set_seals(fd2, 0);
+	mfd_assert_has_seals(fd2, 0);
+
+	close(fd2);
+}
+
+/*
+ * Test sharing via fork()
+ * Test whether seal-modifications are correctly discarded if multiple FDs for
+ * the same file exist.
+ */
+static void test_share_fork(void)
+{
+	int fd;
+	pid_t pid;
+
+	fd = mfd_assert_new("kern_memfd_share_fork",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC);
+	mfd_assert_has_seals(fd, 0);
+
+	pid = spawn_idle_proc();
+	mfd_assert_set_seals(fd, SHMEM_SEAL_WRITE);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE);
+
+	mfd_fail_set_seals(fd, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE);
+
+	mfd_fail_set_seals(fd, SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE);
+
+	mfd_fail_set_seals(fd, 0);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE);
+
+	join_idle_proc(pid);
+
+	mfd_assert_set_seals(fd, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_WRITE | SHMEM_SEAL_SHRINK);
+
+	mfd_assert_set_seals(fd, SHMEM_SEAL_GROW);
+	mfd_assert_has_seals(fd, SHMEM_SEAL_GROW);
+
+	mfd_assert_set_seals(fd, 0);
+	mfd_assert_has_seals(fd, 0);
+
+	close(fd);
+}
+
+int main(int argc, char **argv)
+{
+	pid_t pid;
+
+	printf("memfd: CREATE\n");
+	test_create();
+	printf("memfd: BASIC\n");
+	test_basic();
+
+	printf("memfd: SEAL-WRITE\n");
+	test_seal_write();
+	printf("memfd: SEAL-SHRINK\n");
+	test_seal_shrink();
+	printf("memfd: SEAL-GROW\n");
+	test_seal_grow();
+	printf("memfd: SEAL-RESIZE\n");
+	test_seal_resize();
+
+	printf("memfd: SHARE-DUP\n");
+	test_share_dup();
+	printf("memfd: SHARE-MMAP\n");
+	test_share_mmap();
+	printf("memfd: SHARE-OPEN\n");
+	test_share_open();
+	printf("memfd: SHARE-FORK\n");
+	test_share_fork();
+
+	/* Run test-suite in a multi-threaded environment with a shared
+	 * file-table. This triggers the slow-path in fdget() in the kernel. */
+	pid = spawn_idle_thread();
+	printf("memfd: SHARE-DUP (shared file-table)\n");
+	test_share_dup();
+	printf("memfd: SHARE-MMAP (shared file-table)\n");
+	test_share_mmap();
+	printf("memfd: SHARE-OPEN (shared file-table)\n");
+	test_share_open();
+	printf("memfd: SHARE-FORK (shared file-table)\n");
+	test_share_fork();
+	join_idle_thread(pid);
+
+	printf("memfd: DONE\n");
+
+	return 0;
+}
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH man-pages 5/6] fcntl.2: document SHMEM_SET/GET_SEALS commands
  2014-03-19 19:06 ` David Herrmann
  (?)
  (?)
@ 2014-03-19 19:06   ` David Herrmann
  -1 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-19 19:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Hugh Dickins, Alexander Viro, Matthew Wilcox, Karol Lewandowski,
	Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, john.stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages),
	David Herrmann

The SHMEM_GET_SEALS and SHMEM_SET_SEALS commands allow retrieving and
modifying the active set of seals on a file. They're only supported on
selected file-systems (currently shmfs) and are linux-only.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 man2/fcntl.2 | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 90 insertions(+)

diff --git a/man2/fcntl.2 b/man2/fcntl.2
index c010a49..53d55a5 100644
--- a/man2/fcntl.2
+++ b/man2/fcntl.2
@@ -57,6 +57,8 @@
 .\"     Document F_SETOWN_EX and F_GETOWN_EX
 .\" 2010-06-17, Michael Kerrisk
 .\"	Document F_SETPIPE_SZ and F_GETPIPE_SZ.
+.\" 2014-03-19, David Herrmann <dh.herrmann@gmail.com>
+.\"	Document SHMEM_SET_SEALS and SHMEM_GET_SEALS
 .\"
 .TH FCNTL 2 2014-02-20 "Linux" "Linux Programmer's Manual"
 .SH NAME
@@ -1064,6 +1066,94 @@ of buffer space currently used to store data produces the error
 .BR F_GETPIPE_SZ " (\fIvoid\fP; since Linux 2.6.35)"
 Return (as the function result) the capacity of the pipe referred to by
 .IR fd .
+.SS File Sealing
+Sealing files limits the set of allowed operations on a given file. For each
+seal that is set on a file, a specific set of operations will fail with
+.B EPERM
+on this file from now on. The file is said to be sealed. A file does not have
+any seals set by default. Moreover, most filesystems do not support sealing
+(only shmfs implements it right now). The following seals are available:
+.RS
+.TP
+.BR SHMEM_SEAL_SHRINK
+If this seal is set, the file in question cannot be reduced in size. This
+affects
+.BR open (2)
+with the
+.B O_TRUNC
+flag and
+.BR ftruncate (2).
+They will fail with
+.B EPERM
+if you try to shrink the file in question. Increasing the file size is still
+possible.
+.TP
+.BR SHMEM_SEAL_GROW
+If this seal is set, the size of the file in question cannot be increased. This
+affects
+.BR write (2)
+if you write across size boundaries,
+.BR ftruncate (2)
+and
+.BR fallocate (2).
+These calls will fail with
+.B EPERM
+if you use them to increase the file size or write beyond size boundaries. If
+you keep the size or shrink it, those calls still work as expected.
+.TP
+.BR SHMEM_SEAL_WRITE
+If this seal is set, you cannot modify data contents of the file. Note that
+shrinking or growing the size of the file is still possible and allowed. Thus,
+this seal is normally used in combination with one of the other seals. This seal
+affects
+.BR write (2)
+and
+.BR fallocate (2)
+(only in combination with the
+.B FALLOC_FL_PUNCH_HOLE
+flag). Those calls will fail with
+.B EPERM
+if this seal is set. Furthermore, trying to create new memory-mappings via
+.BR mmap (2)
+in combination with
+.B MAP_SHARED
+will also fail with
+.BR EPERM .
+.RE
+.TP
+.BR SHMEM_SET_SEALS " (\fIint\fP; since Linux TBD)"
+Change the set of seals of the file referred to by
+.I fd
+to
+.IR arg .
+You are required to own an exclusive reference to the file in question in order
+to modify the seals. Otherwise, this call will fail with
+.BR EPERM .
+There is one exception: If no seals are set, this restriction does not apply and
+you can set seals even if you don't own an exclusive reference. However, in any
+case there may not exist any shared writable mapping or this call will always
+fail with
+.BR EPERM .
+These semantics guarantee that once you verified a specific set of seals is set
+on a given file, nobody besides you (in case you own an exclusive reference) can
+modify the seals, anymore.
+
+You own an exclusive reference to a file if, and only if, the file-descriptor
+passed to
+.BR fcntl (2)
+is the only reference to the underlying inode. There must not be any duplicates
+of this file-descriptor, no other open files to the same underlying inode, no
+hard-links or any active memory mappings.
+.TP
+.BR SHMEM_GET_SEALS " (\fIvoid\fP; since Linux TBD)"
+Return (as the function result) the current set of seals of the file referred to
+by
+.IR fd .
+If no seals are set, 0 is returned. If the file does not support sealing, -1 is
+returned and
+.I errno
+is set to
+.BR EINVAL .
 .SH RETURN VALUE
 For a successful call, the return value depends on the operation:
 .TP 0.9i
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH man-pages 5/6] fcntl.2: document SHMEM_SET/GET_SEALS commands
@ 2014-03-19 19:06   ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-19 19:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Matthew Wilcox, Ryan Lortie, Hugh Dickins, Johannes Weiner,
	Kay Sievers, dri-devel, Daniel Mack, linux-mm, linux-fsdevel,
	Karol Lewandowski, Lennart Poettering, Greg Kroah-Hartman,
	Tejun Heo, Michael Kerrisk (man-pages),
	Andrew Morton, Linus Torvalds, Alexander Viro

The SHMEM_GET_SEALS and SHMEM_SET_SEALS commands allow retrieving and
modifying the active set of seals on a file. They're only supported on
selected file-systems (currently shmfs) and are linux-only.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 man2/fcntl.2 | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 90 insertions(+)

diff --git a/man2/fcntl.2 b/man2/fcntl.2
index c010a49..53d55a5 100644
--- a/man2/fcntl.2
+++ b/man2/fcntl.2
@@ -57,6 +57,8 @@
 .\"     Document F_SETOWN_EX and F_GETOWN_EX
 .\" 2010-06-17, Michael Kerrisk
 .\"	Document F_SETPIPE_SZ and F_GETPIPE_SZ.
+.\" 2014-03-19, David Herrmann <dh.herrmann@gmail.com>
+.\"	Document SHMEM_SET_SEALS and SHMEM_GET_SEALS
 .\"
 .TH FCNTL 2 2014-02-20 "Linux" "Linux Programmer's Manual"
 .SH NAME
@@ -1064,6 +1066,94 @@ of buffer space currently used to store data produces the error
 .BR F_GETPIPE_SZ " (\fIvoid\fP; since Linux 2.6.35)"
 Return (as the function result) the capacity of the pipe referred to by
 .IR fd .
+.SS File Sealing
+Sealing files limits the set of allowed operations on a given file. For each
+seal that is set on a file, a specific set of operations will fail with
+.B EPERM
+on this file from now on. The file is said to be sealed. A file does not have
+any seals set by default. Moreover, most filesystems do not support sealing
+(only shmfs implements it right now). The following seals are available:
+.RS
+.TP
+.BR SHMEM_SEAL_SHRINK
+If this seal is set, the file in question cannot be reduced in size. This
+affects
+.BR open (2)
+with the
+.B O_TRUNC
+flag and
+.BR ftruncate (2).
+They will fail with
+.B EPERM
+if you try to shrink the file in question. Increasing the file size is still
+possible.
+.TP
+.BR SHMEM_SEAL_GROW
+If this seal is set, the size of the file in question cannot be increased. This
+affects
+.BR write (2)
+if you write across size boundaries,
+.BR ftruncate (2)
+and
+.BR fallocate (2).
+These calls will fail with
+.B EPERM
+if you use them to increase the file size or write beyond size boundaries. If
+you keep the size or shrink it, those calls still work as expected.
+.TP
+.BR SHMEM_SEAL_WRITE
+If this seal is set, you cannot modify data contents of the file. Note that
+shrinking or growing the size of the file is still possible and allowed. Thus,
+this seal is normally used in combination with one of the other seals. This seal
+affects
+.BR write (2)
+and
+.BR fallocate (2)
+(only in combination with the
+.B FALLOC_FL_PUNCH_HOLE
+flag). Those calls will fail with
+.B EPERM
+if this seal is set. Furthermore, trying to create new memory-mappings via
+.BR mmap (2)
+in combination with
+.B MAP_SHARED
+will also fail with
+.BR EPERM .
+.RE
+.TP
+.BR SHMEM_SET_SEALS " (\fIint\fP; since Linux TBD)"
+Change the set of seals of the file referred to by
+.I fd
+to
+.IR arg .
+You are required to own an exclusive reference to the file in question in order
+to modify the seals. Otherwise, this call will fail with
+.BR EPERM .
+There is one exception: If no seals are set, this restriction does not apply and
+you can set seals even if you don't own an exclusive reference. However, in any
+case there may not exist any shared writable mapping or this call will always
+fail with
+.BR EPERM .
+These semantics guarantee that once you verified a specific set of seals is set
+on a given file, nobody besides you (in case you own an exclusive reference) can
+modify the seals, anymore.
+
+You own an exclusive reference to a file if, and only if, the file-descriptor
+passed to
+.BR fcntl (2)
+is the only reference to the underlying inode. There must not be any duplicates
+of this file-descriptor, no other open files to the same underlying inode, no
+hard-links or any active memory mappings.
+.TP
+.BR SHMEM_GET_SEALS " (\fIvoid\fP; since Linux TBD)"
+Return (as the function result) the current set of seals of the file referred to
+by
+.IR fd .
+If no seals are set, 0 is returned. If the file does not support sealing, -1 is
+returned and
+.I errno
+is set to
+.BR EINVAL .
 .SH RETURN VALUE
 For a successful call, the return value depends on the operation:
 .TP 0.9i
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH man-pages 5/6] fcntl.2: document SHMEM_SET/GET_SEALS commands
@ 2014-03-19 19:06   ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-19 19:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Hugh Dickins, Alexander Viro, Matthew Wilcox, Karol Lewandowski,
	Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, john.stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages),
	David Herrmann

The SHMEM_GET_SEALS and SHMEM_SET_SEALS commands allow retrieving and
modifying the active set of seals on a file. They're only supported on
selected file-systems (currently shmfs) and are linux-only.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 man2/fcntl.2 | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 90 insertions(+)

diff --git a/man2/fcntl.2 b/man2/fcntl.2
index c010a49..53d55a5 100644
--- a/man2/fcntl.2
+++ b/man2/fcntl.2
@@ -57,6 +57,8 @@
 .\"     Document F_SETOWN_EX and F_GETOWN_EX
 .\" 2010-06-17, Michael Kerrisk
 .\"	Document F_SETPIPE_SZ and F_GETPIPE_SZ.
+.\" 2014-03-19, David Herrmann <dh.herrmann@gmail.com>
+.\"	Document SHMEM_SET_SEALS and SHMEM_GET_SEALS
 .\"
 .TH FCNTL 2 2014-02-20 "Linux" "Linux Programmer's Manual"
 .SH NAME
@@ -1064,6 +1066,94 @@ of buffer space currently used to store data produces the error
 .BR F_GETPIPE_SZ " (\fIvoid\fP; since Linux 2.6.35)"
 Return (as the function result) the capacity of the pipe referred to by
 .IR fd .
+.SS File Sealing
+Sealing files limits the set of allowed operations on a given file. For each
+seal that is set on a file, a specific set of operations will fail with
+.B EPERM
+on this file from now on. The file is said to be sealed. A file does not have
+any seals set by default. Moreover, most filesystems do not support sealing
+(only shmfs implements it right now). The following seals are available:
+.RS
+.TP
+.BR SHMEM_SEAL_SHRINK
+If this seal is set, the file in question cannot be reduced in size. This
+affects
+.BR open (2)
+with the
+.B O_TRUNC
+flag and
+.BR ftruncate (2).
+They will fail with
+.B EPERM
+if you try to shrink the file in question. Increasing the file size is still
+possible.
+.TP
+.BR SHMEM_SEAL_GROW
+If this seal is set, the size of the file in question cannot be increased. This
+affects
+.BR write (2)
+if you write across size boundaries,
+.BR ftruncate (2)
+and
+.BR fallocate (2).
+These calls will fail with
+.B EPERM
+if you use them to increase the file size or write beyond size boundaries. If
+you keep the size or shrink it, those calls still work as expected.
+.TP
+.BR SHMEM_SEAL_WRITE
+If this seal is set, you cannot modify data contents of the file. Note that
+shrinking or growing the size of the file is still possible and allowed. Thus,
+this seal is normally used in combination with one of the other seals. This seal
+affects
+.BR write (2)
+and
+.BR fallocate (2)
+(only in combination with the
+.B FALLOC_FL_PUNCH_HOLE
+flag). Those calls will fail with
+.B EPERM
+if this seal is set. Furthermore, trying to create new memory-mappings via
+.BR mmap (2)
+in combination with
+.B MAP_SHARED
+will also fail with
+.BR EPERM .
+.RE
+.TP
+.BR SHMEM_SET_SEALS " (\fIint\fP; since Linux TBD)"
+Change the set of seals of the file referred to by
+.I fd
+to
+.IR arg .
+You are required to own an exclusive reference to the file in question in order
+to modify the seals. Otherwise, this call will fail with
+.BR EPERM .
+There is one exception: If no seals are set, this restriction does not apply and
+you can set seals even if you don't own an exclusive reference. However, in any
+case there may not exist any shared writable mapping or this call will always
+fail with
+.BR EPERM .
+These semantics guarantee that once you verified a specific set of seals is set
+on a given file, nobody besides you (in case you own an exclusive reference) can
+modify the seals, anymore.
+
+You own an exclusive reference to a file if, and only if, the file-descriptor
+passed to
+.BR fcntl (2)
+is the only reference to the underlying inode. There must not be any duplicates
+of this file-descriptor, no other open files to the same underlying inode, no
+hard-links or any active memory mappings.
+.TP
+.BR SHMEM_GET_SEALS " (\fIvoid\fP; since Linux TBD)"
+Return (as the function result) the current set of seals of the file referred to
+by
+.IR fd .
+If no seals are set, 0 is returned. If the file does not support sealing, -1 is
+returned and
+.I errno
+is set to
+.BR EINVAL .
 .SH RETURN VALUE
 For a successful call, the return value depends on the operation:
 .TP 0.9i
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH man-pages 5/6] fcntl.2: document SHMEM_SET/GET_SEALS commands
@ 2014-03-19 19:06   ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-19 19:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Matthew Wilcox, Ryan Lortie, Hugh Dickins, Johannes Weiner,
	Kay Sievers, dri-devel, Daniel Mack, linux-mm, linux-fsdevel,
	Karol Lewandowski, Lennart Poettering, Greg Kroah-Hartman,
	Tejun Heo, Michael Kerrisk (man-pages),
	Andrew Morton, Linus Torvalds, Alexander Viro

The SHMEM_GET_SEALS and SHMEM_SET_SEALS commands allow retrieving and
modifying the active set of seals on a file. They're only supported on
selected file-systems (currently shmfs) and are linux-only.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 man2/fcntl.2 | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 90 insertions(+)

diff --git a/man2/fcntl.2 b/man2/fcntl.2
index c010a49..53d55a5 100644
--- a/man2/fcntl.2
+++ b/man2/fcntl.2
@@ -57,6 +57,8 @@
 .\"     Document F_SETOWN_EX and F_GETOWN_EX
 .\" 2010-06-17, Michael Kerrisk
 .\"	Document F_SETPIPE_SZ and F_GETPIPE_SZ.
+.\" 2014-03-19, David Herrmann <dh.herrmann@gmail.com>
+.\"	Document SHMEM_SET_SEALS and SHMEM_GET_SEALS
 .\"
 .TH FCNTL 2 2014-02-20 "Linux" "Linux Programmer's Manual"
 .SH NAME
@@ -1064,6 +1066,94 @@ of buffer space currently used to store data produces the error
 .BR F_GETPIPE_SZ " (\fIvoid\fP; since Linux 2.6.35)"
 Return (as the function result) the capacity of the pipe referred to by
 .IR fd .
+.SS File Sealing
+Sealing files limits the set of allowed operations on a given file. For each
+seal that is set on a file, a specific set of operations will fail with
+.B EPERM
+on this file from now on. The file is said to be sealed. A file does not have
+any seals set by default. Moreover, most filesystems do not support sealing
+(only shmfs implements it right now). The following seals are available:
+.RS
+.TP
+.BR SHMEM_SEAL_SHRINK
+If this seal is set, the file in question cannot be reduced in size. This
+affects
+.BR open (2)
+with the
+.B O_TRUNC
+flag and
+.BR ftruncate (2).
+They will fail with
+.B EPERM
+if you try to shrink the file in question. Increasing the file size is still
+possible.
+.TP
+.BR SHMEM_SEAL_GROW
+If this seal is set, the size of the file in question cannot be increased. This
+affects
+.BR write (2)
+if you write across size boundaries,
+.BR ftruncate (2)
+and
+.BR fallocate (2).
+These calls will fail with
+.B EPERM
+if you use them to increase the file size or write beyond size boundaries. If
+you keep the size or shrink it, those calls still work as expected.
+.TP
+.BR SHMEM_SEAL_WRITE
+If this seal is set, you cannot modify data contents of the file. Note that
+shrinking or growing the size of the file is still possible and allowed. Thus,
+this seal is normally used in combination with one of the other seals. This seal
+affects
+.BR write (2)
+and
+.BR fallocate (2)
+(only in combination with the
+.B FALLOC_FL_PUNCH_HOLE
+flag). Those calls will fail with
+.B EPERM
+if this seal is set. Furthermore, trying to create new memory-mappings via
+.BR mmap (2)
+in combination with
+.B MAP_SHARED
+will also fail with
+.BR EPERM .
+.RE
+.TP
+.BR SHMEM_SET_SEALS " (\fIint\fP; since Linux TBD)"
+Change the set of seals of the file referred to by
+.I fd
+to
+.IR arg .
+You are required to own an exclusive reference to the file in question in order
+to modify the seals. Otherwise, this call will fail with
+.BR EPERM .
+There is one exception: If no seals are set, this restriction does not apply and
+you can set seals even if you don't own an exclusive reference. However, in any
+case there may not exist any shared writable mapping or this call will always
+fail with
+.BR EPERM .
+These semantics guarantee that once you verified a specific set of seals is set
+on a given file, nobody besides you (in case you own an exclusive reference) can
+modify the seals, anymore.
+
+You own an exclusive reference to a file if, and only if, the file-descriptor
+passed to
+.BR fcntl (2)
+is the only reference to the underlying inode. There must not be any duplicates
+of this file-descriptor, no other open files to the same underlying inode, no
+hard-links or any active memory mappings.
+.TP
+.BR SHMEM_GET_SEALS " (\fIvoid\fP; since Linux TBD)"
+Return (as the function result) the current set of seals of the file referred to
+by
+.IR fd .
+If no seals are set, 0 is returned. If the file does not support sealing, -1 is
+returned and
+.I errno
+is set to
+.BR EINVAL .
 .SH RETURN VALUE
 For a successful call, the return value depends on the operation:
 .TP 0.9i
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH man-pages 6/6] memfd_create.2: add memfd_create() man-page
  2014-03-19 19:06 ` David Herrmann
  (?)
  (?)
@ 2014-03-19 19:06   ` David Herrmann
  -1 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-19 19:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Hugh Dickins, Alexander Viro, Matthew Wilcox, Karol Lewandowski,
	Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, john.stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages),
	David Herrmann

The memfd_create() syscall creates anonymous files similar to O_TMPFILE
but does not require an active mount-point.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 man2/memfd_create.2 | 110 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 110 insertions(+)
 create mode 100644 man2/memfd_create.2

diff --git a/man2/memfd_create.2 b/man2/memfd_create.2
new file mode 100644
index 0000000..3e362e0
--- /dev/null
+++ b/man2/memfd_create.2
@@ -0,0 +1,110 @@
+.\" Copyright (C) 2014 David Herrmann <dh.herrmann@gmail.com>
+.\" starting from a version by Michael Kerrisk <mtk.manpages@gmail.com>
+.\"
+.\" %%%LICENSE_START(GPLv2+_SW_3_PARA)
+.\" This program is free software; you can redistribute it and/or modify
+.\" it under the terms of the GNU General Public License as published by
+.\" the Free Software Foundation; either version 2 of the License, or
+.\" (at your option) any later version.
+.\"
+.\" This program is distributed in the hope that it will be useful,
+.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
+.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+.\" GNU General Public License for more details.
+.\"
+.\" You should have received a copy of the GNU General Public
+.\" License along with this manual; if not, see
+.\" <http://www.gnu.org/licenses/>.
+.\" %%%LICENSE_END
+.\"
+.TH MEMFD_CREATE 2 2014-03-18 Linux "Linux Programmer's Manual"
+.SH NAME
+memfd_create \- create an anonymous file
+.SH SYNOPSIS
+.B #include <sys/memfd.h>
+.sp
+.BI "int memfd_create(const char *" name ", u64 " size ", u64 " flags ");"
+.SH DESCRIPTION
+.BR memfd_create ()
+creates an anonymous file and returns a file-descriptor to it. The file behaves
+like regular files, thus can be modified, truncated, memory-mapped and more.
+However, unlike regular files it lives in main memory and has no non-volatile
+backing storage. Once all references to the file are dropped, it is
+automatically released. Like all shmem-based files, memfd files support
+.BR SHMEM
+sealing parameters. See
+.BR SHMEM_SET_SEALS " with " fcntl (2)
+for more information.
+
+The initial size of the file is set to
+.IR size ". " name
+is used as internal file-name and will occur as such in
+.IR /proc/self/fd/ .
+The name is always prefixed with
+.BR memfd:
+and serves only debugging purposes.
+
+The following values may be bitwise ORed in
+.IR flags
+to change the behaviour of
+.BR memfd_create ():
+.TP
+.BR MFD_CLOEXEC
+Set the close-on-exec
+.RB ( FD_CLOEXEC )
+flag on the new file descriptor.
+See the description of the
+.B O_CLOEXEC
+flag in
+.BR open (2)
+for reasons why this may be useful.
+.PP
+Unused bits must be cleared to 0.
+
+As its return value,
+.BR memfd_create ()
+returns a new file descriptor that can be used to refer to the file.
+A copy of the file descriptor created by
+.BR memfd_create ()
+is inherited by the child produced by
+.BR fork (2).
+The duplicate file descriptor is associated with the same file.
+File descriptors created by
+.BR memfd_create ()
+are preserved across
+.BR execve (2),
+unless the close-on-exec flag has been set.
+.SH RETURN VALUE
+On success,
+.BR memfd_create ()
+returns a new file descriptor.
+On error, \-1 is returned and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.TP
+.B EINVAL
+An unsupported value was specified in one of the arguments.
+.TP
+.B EMFILE
+The per-process limit on open file descriptors has been reached.
+.TP
+.B ENFILE
+The system-wide limit on the total number of open files has been
+reached.
+.TP
+.B EFAULT
+The name given in
+.IR name
+points to invalid memory.
+.TP
+.B ENOMEM
+There was insufficient memory to create a new anonymous file.
+.SH VERSIONS
+to-be-defined
+.SH CONFORMING TO
+.BR memfd_create ()
+is Linux-specific.
+.SH SEE ALSO
+.BR shmget (2),
+.BR fcntl (2),
-- 
1.9.0


^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH man-pages 6/6] memfd_create.2: add memfd_create() man-page
@ 2014-03-19 19:06   ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-19 19:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Matthew Wilcox, Ryan Lortie, Hugh Dickins, Johannes Weiner,
	Kay Sievers, dri-devel, Daniel Mack, linux-mm, linux-fsdevel,
	Karol Lewandowski, Lennart Poettering, Greg Kroah-Hartman,
	Tejun Heo, Michael Kerrisk (man-pages),
	Andrew Morton, Linus Torvalds, Alexander Viro

The memfd_create() syscall creates anonymous files similar to O_TMPFILE
but does not require an active mount-point.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 man2/memfd_create.2 | 110 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 110 insertions(+)
 create mode 100644 man2/memfd_create.2

diff --git a/man2/memfd_create.2 b/man2/memfd_create.2
new file mode 100644
index 0000000..3e362e0
--- /dev/null
+++ b/man2/memfd_create.2
@@ -0,0 +1,110 @@
+.\" Copyright (C) 2014 David Herrmann <dh.herrmann@gmail.com>
+.\" starting from a version by Michael Kerrisk <mtk.manpages@gmail.com>
+.\"
+.\" %%%LICENSE_START(GPLv2+_SW_3_PARA)
+.\" This program is free software; you can redistribute it and/or modify
+.\" it under the terms of the GNU General Public License as published by
+.\" the Free Software Foundation; either version 2 of the License, or
+.\" (at your option) any later version.
+.\"
+.\" This program is distributed in the hope that it will be useful,
+.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
+.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+.\" GNU General Public License for more details.
+.\"
+.\" You should have received a copy of the GNU General Public
+.\" License along with this manual; if not, see
+.\" <http://www.gnu.org/licenses/>.
+.\" %%%LICENSE_END
+.\"
+.TH MEMFD_CREATE 2 2014-03-18 Linux "Linux Programmer's Manual"
+.SH NAME
+memfd_create \- create an anonymous file
+.SH SYNOPSIS
+.B #include <sys/memfd.h>
+.sp
+.BI "int memfd_create(const char *" name ", u64 " size ", u64 " flags ");"
+.SH DESCRIPTION
+.BR memfd_create ()
+creates an anonymous file and returns a file-descriptor to it. The file behaves
+like regular files, thus can be modified, truncated, memory-mapped and more.
+However, unlike regular files it lives in main memory and has no non-volatile
+backing storage. Once all references to the file are dropped, it is
+automatically released. Like all shmem-based files, memfd files support
+.BR SHMEM
+sealing parameters. See
+.BR SHMEM_SET_SEALS " with " fcntl (2)
+for more information.
+
+The initial size of the file is set to
+.IR size ". " name
+is used as internal file-name and will occur as such in
+.IR /proc/self/fd/ .
+The name is always prefixed with
+.BR memfd:
+and serves only debugging purposes.
+
+The following values may be bitwise ORed in
+.IR flags
+to change the behaviour of
+.BR memfd_create ():
+.TP
+.BR MFD_CLOEXEC
+Set the close-on-exec
+.RB ( FD_CLOEXEC )
+flag on the new file descriptor.
+See the description of the
+.B O_CLOEXEC
+flag in
+.BR open (2)
+for reasons why this may be useful.
+.PP
+Unused bits must be cleared to 0.
+
+As its return value,
+.BR memfd_create ()
+returns a new file descriptor that can be used to refer to the file.
+A copy of the file descriptor created by
+.BR memfd_create ()
+is inherited by the child produced by
+.BR fork (2).
+The duplicate file descriptor is associated with the same file.
+File descriptors created by
+.BR memfd_create ()
+are preserved across
+.BR execve (2),
+unless the close-on-exec flag has been set.
+.SH RETURN VALUE
+On success,
+.BR memfd_create ()
+returns a new file descriptor.
+On error, \-1 is returned and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.TP
+.B EINVAL
+An unsupported value was specified in one of the arguments.
+.TP
+.B EMFILE
+The per-process limit on open file descriptors has been reached.
+.TP
+.B ENFILE
+The system-wide limit on the total number of open files has been
+reached.
+.TP
+.B EFAULT
+The name given in
+.IR name
+points to invalid memory.
+.TP
+.B ENOMEM
+There was insufficient memory to create a new anonymous file.
+.SH VERSIONS
+to-be-defined
+.SH CONFORMING TO
+.BR memfd_create ()
+is Linux-specific.
+.SH SEE ALSO
+.BR shmget (2),
+.BR fcntl (2),
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH man-pages 6/6] memfd_create.2: add memfd_create() man-page
@ 2014-03-19 19:06   ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-19 19:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Hugh Dickins, Alexander Viro, Matthew Wilcox, Karol Lewandowski,
	Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, john.stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages),
	David Herrmann

The memfd_create() syscall creates anonymous files similar to O_TMPFILE
but does not require an active mount-point.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 man2/memfd_create.2 | 110 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 110 insertions(+)
 create mode 100644 man2/memfd_create.2

diff --git a/man2/memfd_create.2 b/man2/memfd_create.2
new file mode 100644
index 0000000..3e362e0
--- /dev/null
+++ b/man2/memfd_create.2
@@ -0,0 +1,110 @@
+.\" Copyright (C) 2014 David Herrmann <dh.herrmann@gmail.com>
+.\" starting from a version by Michael Kerrisk <mtk.manpages@gmail.com>
+.\"
+.\" %%%LICENSE_START(GPLv2+_SW_3_PARA)
+.\" This program is free software; you can redistribute it and/or modify
+.\" it under the terms of the GNU General Public License as published by
+.\" the Free Software Foundation; either version 2 of the License, or
+.\" (at your option) any later version.
+.\"
+.\" This program is distributed in the hope that it will be useful,
+.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
+.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+.\" GNU General Public License for more details.
+.\"
+.\" You should have received a copy of the GNU General Public
+.\" License along with this manual; if not, see
+.\" <http://www.gnu.org/licenses/>.
+.\" %%%LICENSE_END
+.\"
+.TH MEMFD_CREATE 2 2014-03-18 Linux "Linux Programmer's Manual"
+.SH NAME
+memfd_create \- create an anonymous file
+.SH SYNOPSIS
+.B #include <sys/memfd.h>
+.sp
+.BI "int memfd_create(const char *" name ", u64 " size ", u64 " flags ");"
+.SH DESCRIPTION
+.BR memfd_create ()
+creates an anonymous file and returns a file-descriptor to it. The file behaves
+like regular files, thus can be modified, truncated, memory-mapped and more.
+However, unlike regular files it lives in main memory and has no non-volatile
+backing storage. Once all references to the file are dropped, it is
+automatically released. Like all shmem-based files, memfd files support
+.BR SHMEM
+sealing parameters. See
+.BR SHMEM_SET_SEALS " with " fcntl (2)
+for more information.
+
+The initial size of the file is set to
+.IR size ". " name
+is used as internal file-name and will occur as such in
+.IR /proc/self/fd/ .
+The name is always prefixed with
+.BR memfd:
+and serves only debugging purposes.
+
+The following values may be bitwise ORed in
+.IR flags
+to change the behaviour of
+.BR memfd_create ():
+.TP
+.BR MFD_CLOEXEC
+Set the close-on-exec
+.RB ( FD_CLOEXEC )
+flag on the new file descriptor.
+See the description of the
+.B O_CLOEXEC
+flag in
+.BR open (2)
+for reasons why this may be useful.
+.PP
+Unused bits must be cleared to 0.
+
+As its return value,
+.BR memfd_create ()
+returns a new file descriptor that can be used to refer to the file.
+A copy of the file descriptor created by
+.BR memfd_create ()
+is inherited by the child produced by
+.BR fork (2).
+The duplicate file descriptor is associated with the same file.
+File descriptors created by
+.BR memfd_create ()
+are preserved across
+.BR execve (2),
+unless the close-on-exec flag has been set.
+.SH RETURN VALUE
+On success,
+.BR memfd_create ()
+returns a new file descriptor.
+On error, \-1 is returned and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.TP
+.B EINVAL
+An unsupported value was specified in one of the arguments.
+.TP
+.B EMFILE
+The per-process limit on open file descriptors has been reached.
+.TP
+.B ENFILE
+The system-wide limit on the total number of open files has been
+reached.
+.TP
+.B EFAULT
+The name given in
+.IR name
+points to invalid memory.
+.TP
+.B ENOMEM
+There was insufficient memory to create a new anonymous file.
+.SH VERSIONS
+to-be-defined
+.SH CONFORMING TO
+.BR memfd_create ()
+is Linux-specific.
+.SH SEE ALSO
+.BR shmget (2),
+.BR fcntl (2),
-- 
1.9.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 123+ messages in thread

* [PATCH man-pages 6/6] memfd_create.2: add memfd_create() man-page
@ 2014-03-19 19:06   ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-19 19:06 UTC (permalink / raw)
  To: linux-kernel
  Cc: Matthew Wilcox, Ryan Lortie, Hugh Dickins, Johannes Weiner,
	Kay Sievers, dri-devel, Daniel Mack, linux-mm, linux-fsdevel,
	Karol Lewandowski, Lennart Poettering, Greg Kroah-Hartman,
	Tejun Heo, Michael Kerrisk (man-pages),
	Andrew Morton, Linus Torvalds, Alexander Viro

The memfd_create() syscall creates anonymous files similar to O_TMPFILE
but does not require an active mount-point.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 man2/memfd_create.2 | 110 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 110 insertions(+)
 create mode 100644 man2/memfd_create.2

diff --git a/man2/memfd_create.2 b/man2/memfd_create.2
new file mode 100644
index 0000000..3e362e0
--- /dev/null
+++ b/man2/memfd_create.2
@@ -0,0 +1,110 @@
+.\" Copyright (C) 2014 David Herrmann <dh.herrmann@gmail.com>
+.\" starting from a version by Michael Kerrisk <mtk.manpages@gmail.com>
+.\"
+.\" %%%LICENSE_START(GPLv2+_SW_3_PARA)
+.\" This program is free software; you can redistribute it and/or modify
+.\" it under the terms of the GNU General Public License as published by
+.\" the Free Software Foundation; either version 2 of the License, or
+.\" (at your option) any later version.
+.\"
+.\" This program is distributed in the hope that it will be useful,
+.\" but WITHOUT ANY WARRANTY; without even the implied warranty of
+.\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+.\" GNU General Public License for more details.
+.\"
+.\" You should have received a copy of the GNU General Public
+.\" License along with this manual; if not, see
+.\" <http://www.gnu.org/licenses/>.
+.\" %%%LICENSE_END
+.\"
+.TH MEMFD_CREATE 2 2014-03-18 Linux "Linux Programmer's Manual"
+.SH NAME
+memfd_create \- create an anonymous file
+.SH SYNOPSIS
+.B #include <sys/memfd.h>
+.sp
+.BI "int memfd_create(const char *" name ", u64 " size ", u64 " flags ");"
+.SH DESCRIPTION
+.BR memfd_create ()
+creates an anonymous file and returns a file-descriptor to it. The file behaves
+like regular files, thus can be modified, truncated, memory-mapped and more.
+However, unlike regular files it lives in main memory and has no non-volatile
+backing storage. Once all references to the file are dropped, it is
+automatically released. Like all shmem-based files, memfd files support
+.BR SHMEM
+sealing parameters. See
+.BR SHMEM_SET_SEALS " with " fcntl (2)
+for more information.
+
+The initial size of the file is set to
+.IR size ". " name
+is used as internal file-name and will occur as such in
+.IR /proc/self/fd/ .
+The name is always prefixed with
+.BR memfd:
+and serves only debugging purposes.
+
+The following values may be bitwise ORed in
+.IR flags
+to change the behaviour of
+.BR memfd_create ():
+.TP
+.BR MFD_CLOEXEC
+Set the close-on-exec
+.RB ( FD_CLOEXEC )
+flag on the new file descriptor.
+See the description of the
+.B O_CLOEXEC
+flag in
+.BR open (2)
+for reasons why this may be useful.
+.PP
+Unused bits must be cleared to 0.
+
+As its return value,
+.BR memfd_create ()
+returns a new file descriptor that can be used to refer to the file.
+A copy of the file descriptor created by
+.BR memfd_create ()
+is inherited by the child produced by
+.BR fork (2).
+The duplicate file descriptor is associated with the same file.
+File descriptors created by
+.BR memfd_create ()
+are preserved across
+.BR execve (2),
+unless the close-on-exec flag has been set.
+.SH RETURN VALUE
+On success,
+.BR memfd_create ()
+returns a new file descriptor.
+On error, \-1 is returned and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.TP
+.B EINVAL
+An unsupported value was specified in one of the arguments.
+.TP
+.B EMFILE
+The per-process limit on open file descriptors has been reached.
+.TP
+.B ENFILE
+The system-wide limit on the total number of open files has been
+reached.
+.TP
+.B EFAULT
+The name given in
+.IR name
+points to invalid memory.
+.TP
+.B ENOMEM
+There was insufficient memory to create a new anonymous file.
+.SH VERSIONS
+to-be-defined
+.SH CONFORMING TO
+.BR memfd_create ()
+is Linux-specific.
+.SH SEE ALSO
+.BR shmget (2),
+.BR fcntl (2),
-- 
1.9.0

^ permalink raw reply related	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-03-19 19:06 ` David Herrmann
  (?)
  (?)
@ 2014-03-20  2:55   ` Greg Kroah-Hartman
  -1 siblings, 0 replies; 123+ messages in thread
From: Greg Kroah-Hartman @ 2014-03-20  2:55 UTC (permalink / raw)
  To: David Herrmann
  Cc: linux-kernel, Hugh Dickins, Alexander Viro, Matthew Wilcox,
	Karol Lewandowski, Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, john.stultz, Tejun Heo, Johannes Weiner,
	dri-devel, linux-fsdevel, linux-mm, Andrew Morton,
	Linus Torvalds, Ryan Lortie, Michael Kerrisk (man-pages)

On Wed, Mar 19, 2014 at 08:06:45PM +0100, David Herrmann wrote:
> Hi
> 
> This series introduces the concept of "file sealing". Sealing a file restricts
> the set of allowed operations on the file in question. Multiple seals are
> defined and each seal will cause a different set of operations to return EPERM
> if it is set. The following seals are introduced:
> 
>  * SEAL_SHRINK: If set, the inode size cannot be reduced
>  * SEAL_GROW: If set, the inode size cannot be increased
>  * SEAL_WRITE: If set, the file content cannot be modified
> 
> Unlike existing techniques that provide similar protection, sealing allows
> file-sharing without any trust-relationship. This is enforced by rejecting seal
> modifications if you don't own an exclusive reference to the given file. So if
> you own a file-descriptor, you can be sure that no-one besides you can modify
> the seals on the given file. This allows mapping shared files from untrusted
> parties without the fear of the file getting truncated or modified by an
> attacker.
> 
> Several use-cases exist that could make great use of sealing:
> 
>   1) Graphics Compositors
>      If a graphics client creates a memory-backed render-buffer and passes a
>      file-decsriptor to it to the graphics server for display, the server
>      _has_ to setup SIGBUS handlers whenever mapping the given file. Otherwise,
>      the client might run ftruncate() or O_TRUNC on the on file in parallel,
>      thus crashing the server.
>      With sealing, a compositor can reject any incoming file-descriptor that
>      does _not_ have SEAL_SHRINK set. This way, any memory-mappings are
>      guaranteed to stay accessible. Furthermore, we still allow clients to
>      increase the buffer-size in case they want to resize the render-buffer for
>      the next frame. We also allow parallel writes so the client can render new
>      frames into the same buffer (client is responsible of never rendering into
>      a front-buffer if you want to avoid artifacts).
> 
>      Real use-case: Wayland wl_shm buffers can be transparently converted

Very nice, the Enlightenment developers have been asking for something
like this for a while, it should help them out a lot as well.

And thanks for the man pages and test code, if only all new apis came
with that already...

greg k-h

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-03-20  2:55   ` Greg Kroah-Hartman
  0 siblings, 0 replies; 123+ messages in thread
From: Greg Kroah-Hartman @ 2014-03-20  2:55 UTC (permalink / raw)
  To: David Herrmann
  Cc: Matthew Wilcox, Ryan Lortie, Hugh Dickins, Kay Sievers,
	linux-kernel, dri-devel, Daniel Mack, linux-mm, linux-fsdevel,
	Karol Lewandowski, Lennart Poettering, Johannes Weiner,
	Tejun Heo, Michael Kerrisk (man-pages),
	Andrew Morton, Linus Torvalds, Alexander Viro

On Wed, Mar 19, 2014 at 08:06:45PM +0100, David Herrmann wrote:
> Hi
> 
> This series introduces the concept of "file sealing". Sealing a file restricts
> the set of allowed operations on the file in question. Multiple seals are
> defined and each seal will cause a different set of operations to return EPERM
> if it is set. The following seals are introduced:
> 
>  * SEAL_SHRINK: If set, the inode size cannot be reduced
>  * SEAL_GROW: If set, the inode size cannot be increased
>  * SEAL_WRITE: If set, the file content cannot be modified
> 
> Unlike existing techniques that provide similar protection, sealing allows
> file-sharing without any trust-relationship. This is enforced by rejecting seal
> modifications if you don't own an exclusive reference to the given file. So if
> you own a file-descriptor, you can be sure that no-one besides you can modify
> the seals on the given file. This allows mapping shared files from untrusted
> parties without the fear of the file getting truncated or modified by an
> attacker.
> 
> Several use-cases exist that could make great use of sealing:
> 
>   1) Graphics Compositors
>      If a graphics client creates a memory-backed render-buffer and passes a
>      file-decsriptor to it to the graphics server for display, the server
>      _has_ to setup SIGBUS handlers whenever mapping the given file. Otherwise,
>      the client might run ftruncate() or O_TRUNC on the on file in parallel,
>      thus crashing the server.
>      With sealing, a compositor can reject any incoming file-descriptor that
>      does _not_ have SEAL_SHRINK set. This way, any memory-mappings are
>      guaranteed to stay accessible. Furthermore, we still allow clients to
>      increase the buffer-size in case they want to resize the render-buffer for
>      the next frame. We also allow parallel writes so the client can render new
>      frames into the same buffer (client is responsible of never rendering into
>      a front-buffer if you want to avoid artifacts).
> 
>      Real use-case: Wayland wl_shm buffers can be transparently converted

Very nice, the Enlightenment developers have been asking for something
like this for a while, it should help them out a lot as well.

And thanks for the man pages and test code, if only all new apis came
with that already...

greg k-h

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-03-20  2:55   ` Greg Kroah-Hartman
  0 siblings, 0 replies; 123+ messages in thread
From: Greg Kroah-Hartman @ 2014-03-20  2:55 UTC (permalink / raw)
  To: David Herrmann
  Cc: linux-kernel, Hugh Dickins, Alexander Viro, Matthew Wilcox,
	Karol Lewandowski, Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, john.stultz, Tejun Heo, Johannes Weiner,
	dri-devel, linux-fsdevel, linux-mm, Andrew Morton,
	Linus Torvalds, Ryan Lortie, Michael Kerrisk (man-pages)

On Wed, Mar 19, 2014 at 08:06:45PM +0100, David Herrmann wrote:
> Hi
> 
> This series introduces the concept of "file sealing". Sealing a file restricts
> the set of allowed operations on the file in question. Multiple seals are
> defined and each seal will cause a different set of operations to return EPERM
> if it is set. The following seals are introduced:
> 
>  * SEAL_SHRINK: If set, the inode size cannot be reduced
>  * SEAL_GROW: If set, the inode size cannot be increased
>  * SEAL_WRITE: If set, the file content cannot be modified
> 
> Unlike existing techniques that provide similar protection, sealing allows
> file-sharing without any trust-relationship. This is enforced by rejecting seal
> modifications if you don't own an exclusive reference to the given file. So if
> you own a file-descriptor, you can be sure that no-one besides you can modify
> the seals on the given file. This allows mapping shared files from untrusted
> parties without the fear of the file getting truncated or modified by an
> attacker.
> 
> Several use-cases exist that could make great use of sealing:
> 
>   1) Graphics Compositors
>      If a graphics client creates a memory-backed render-buffer and passes a
>      file-decsriptor to it to the graphics server for display, the server
>      _has_ to setup SIGBUS handlers whenever mapping the given file. Otherwise,
>      the client might run ftruncate() or O_TRUNC on the on file in parallel,
>      thus crashing the server.
>      With sealing, a compositor can reject any incoming file-descriptor that
>      does _not_ have SEAL_SHRINK set. This way, any memory-mappings are
>      guaranteed to stay accessible. Furthermore, we still allow clients to
>      increase the buffer-size in case they want to resize the render-buffer for
>      the next frame. We also allow parallel writes so the client can render new
>      frames into the same buffer (client is responsible of never rendering into
>      a front-buffer if you want to avoid artifacts).
> 
>      Real use-case: Wayland wl_shm buffers can be transparently converted

Very nice, the Enlightenment developers have been asking for something
like this for a while, it should help them out a lot as well.

And thanks for the man pages and test code, if only all new apis came
with that already...

greg k-h

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-03-20  2:55   ` Greg Kroah-Hartman
  0 siblings, 0 replies; 123+ messages in thread
From: Greg Kroah-Hartman @ 2014-03-20  2:55 UTC (permalink / raw)
  To: David Herrmann
  Cc: Matthew Wilcox, Ryan Lortie, Hugh Dickins, Kay Sievers,
	linux-kernel, dri-devel, Daniel Mack, linux-mm, linux-fsdevel,
	Karol Lewandowski, Lennart Poettering, Johannes Weiner,
	Tejun Heo, Michael Kerrisk (man-pages),
	Andrew Morton, Linus Torvalds, Alexander Viro

On Wed, Mar 19, 2014 at 08:06:45PM +0100, David Herrmann wrote:
> Hi
> 
> This series introduces the concept of "file sealing". Sealing a file restricts
> the set of allowed operations on the file in question. Multiple seals are
> defined and each seal will cause a different set of operations to return EPERM
> if it is set. The following seals are introduced:
> 
>  * SEAL_SHRINK: If set, the inode size cannot be reduced
>  * SEAL_GROW: If set, the inode size cannot be increased
>  * SEAL_WRITE: If set, the file content cannot be modified
> 
> Unlike existing techniques that provide similar protection, sealing allows
> file-sharing without any trust-relationship. This is enforced by rejecting seal
> modifications if you don't own an exclusive reference to the given file. So if
> you own a file-descriptor, you can be sure that no-one besides you can modify
> the seals on the given file. This allows mapping shared files from untrusted
> parties without the fear of the file getting truncated or modified by an
> attacker.
> 
> Several use-cases exist that could make great use of sealing:
> 
>   1) Graphics Compositors
>      If a graphics client creates a memory-backed render-buffer and passes a
>      file-decsriptor to it to the graphics server for display, the server
>      _has_ to setup SIGBUS handlers whenever mapping the given file. Otherwise,
>      the client might run ftruncate() or O_TRUNC on the on file in parallel,
>      thus crashing the server.
>      With sealing, a compositor can reject any incoming file-descriptor that
>      does _not_ have SEAL_SHRINK set. This way, any memory-mappings are
>      guaranteed to stay accessible. Furthermore, we still allow clients to
>      increase the buffer-size in case they want to resize the render-buffer for
>      the next frame. We also allow parallel writes so the client can render new
>      frames into the same buffer (client is responsible of never rendering into
>      a front-buffer if you want to avoid artifacts).
> 
>      Real use-case: Wayland wl_shm buffers can be transparently converted

Very nice, the Enlightenment developers have been asking for something
like this for a while, it should help them out a lot as well.

And thanks for the man pages and test code, if only all new apis came
with that already...

greg k-h

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-03-19 19:06 ` David Herrmann
@ 2014-03-20  3:49   ` Linus Torvalds
  -1 siblings, 0 replies; 123+ messages in thread
From: Linus Torvalds @ 2014-03-20  3:49 UTC (permalink / raw)
  To: David Herrmann
  Cc: Linux Kernel Mailing List, Hugh Dickins, Alexander Viro,
	Matthew Wilcox, Karol Lewandowski, Kay Sievers, Daniel Mack,
	Lennart Poettering, Kristian Høgsberg, John Stultz,
	Greg Kroah-Hartman, Tejun Heo, Johannes Weiner, DRI,
	linux-fsdevel, linux-mm, Andrew Morton, Ryan Lortie,
	Michael Kerrisk (man-pages)

On Wed, Mar 19, 2014 at 12:06 PM, David Herrmann <dh.herrmann@gmail.com> wrote:
>
> Unlike existing techniques that provide similar protection, sealing allows
> file-sharing without any trust-relationship. This is enforced by rejecting seal
> modifications if you don't own an exclusive reference to the given file.

I like the concept, but I really hate that "exclusive reference"
approach. I see why you did it, but I also worry that it means that
people can open random shm files that are *not* expected to be sealed,
and screw up applications that don't expect it.

Is there really any use-case where the sealer isn't also the same
thing that *created* the file in the first place? Because I would be a
ton happier with the notion that you can only seal things that you
yourself created. At that point, the exclusive reference isn't such a
big deal any more, but more importantly, you can't play random
denial-of-service games on files that aren't really yours.

The fact that you bring up the races involved with the exclusive
reference approach also just makes me go "Is that really the correct
security model"?

                   Linus

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-03-20  3:49   ` Linus Torvalds
  0 siblings, 0 replies; 123+ messages in thread
From: Linus Torvalds @ 2014-03-20  3:49 UTC (permalink / raw)
  To: David Herrmann
  Cc: Linux Kernel Mailing List, Hugh Dickins, Alexander Viro,
	Matthew Wilcox, Karol Lewandowski, Kay Sievers, Daniel Mack,
	Lennart Poettering, Kristian Høgsberg, John Stultz,
	Greg Kroah-Hartman, Tejun Heo, Johannes Weiner, DRI,
	linux-fsdevel, linux-mm, Andrew Morton, Ryan Lortie,
	Michael Kerrisk (man-pages)

On Wed, Mar 19, 2014 at 12:06 PM, David Herrmann <dh.herrmann@gmail.com> wrote:
>
> Unlike existing techniques that provide similar protection, sealing allows
> file-sharing without any trust-relationship. This is enforced by rejecting seal
> modifications if you don't own an exclusive reference to the given file.

I like the concept, but I really hate that "exclusive reference"
approach. I see why you did it, but I also worry that it means that
people can open random shm files that are *not* expected to be sealed,
and screw up applications that don't expect it.

Is there really any use-case where the sealer isn't also the same
thing that *created* the file in the first place? Because I would be a
ton happier with the notion that you can only seal things that you
yourself created. At that point, the exclusive reference isn't such a
big deal any more, but more importantly, you can't play random
denial-of-service games on files that aren't really yours.

The fact that you bring up the races involved with the exclusive
reference approach also just makes me go "Is that really the correct
security model"?

                   Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-03-20  3:49   ` Linus Torvalds
  (?)
  (?)
@ 2014-03-20  8:07     ` David Herrmann
  -1 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-20  8:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Hugh Dickins, Alexander Viro,
	Karol Lewandowski, Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, John Stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, DRI, linux-fsdevel, linux-mm,
	Andrew Morton, Ryan Lortie, Michael Kerrisk (man-pages)

Hi

On Thu, Mar 20, 2014 at 4:49 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Is there really any use-case where the sealer isn't also the same
> thing that *created* the file in the first place? Because I would be a
> ton happier with the notion that you can only seal things that you
> yourself created. At that point, the exclusive reference isn't such a
> big deal any more, but more importantly, you can't play random
> denial-of-service games on files that aren't really yours.

My first idea was to add MFD_ALLOW_SEALING as memfd_create() flag,
which enables the sealing-API for that file. Then I looked at POSIX
mandatory locking and noticed that it provides similar restrictions on
_all_ files. Mandatory locks can be more easily removed, but an
attacker could just re-apply them in a loop, so that's not really an
argument. Furthermore, sealing requires _write_ access so I wonder
what kind of DoS attacks are possible with sealing that are not
already possible with write access? And sealing is only possible if no
writable, shared mapping exists. So even if an attacker seals a file,
all that happens is EPERM, not SIGBUS (still a possible
denial-of-service scenario).

But I understand that it is quite hard to review all the possible
scenarios. So I'm fine with checking inode-ownership permissions for
SET_SEALS. We could also make sealing a one-shot operation. Given that
in a no-trust situation there is never a guarantee that the other side
drops its references, re-using a sealed file is usually not possible.
However, in sane environments, this could be a nice optimization in
case the other side plays along. The one-shot semantics would allow
dropping reference-checks entirely. The inode-ownership semantics
would still require it.

Thanks
David

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-03-20  8:07     ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-20  8:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-mm, Ryan Lortie, Hugh Dickins, Johannes Weiner,
	Kay Sievers, Linux Kernel Mailing List, DRI, Daniel Mack,
	linux-fsdevel, Karol Lewandowski, Lennart Poettering,
	Greg Kroah-Hartman, Tejun Heo, Michael Kerrisk (man-pages),
	Andrew Morton, Alexander Viro

Hi

On Thu, Mar 20, 2014 at 4:49 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Is there really any use-case where the sealer isn't also the same
> thing that *created* the file in the first place? Because I would be a
> ton happier with the notion that you can only seal things that you
> yourself created. At that point, the exclusive reference isn't such a
> big deal any more, but more importantly, you can't play random
> denial-of-service games on files that aren't really yours.

My first idea was to add MFD_ALLOW_SEALING as memfd_create() flag,
which enables the sealing-API for that file. Then I looked at POSIX
mandatory locking and noticed that it provides similar restrictions on
_all_ files. Mandatory locks can be more easily removed, but an
attacker could just re-apply them in a loop, so that's not really an
argument. Furthermore, sealing requires _write_ access so I wonder
what kind of DoS attacks are possible with sealing that are not
already possible with write access? And sealing is only possible if no
writable, shared mapping exists. So even if an attacker seals a file,
all that happens is EPERM, not SIGBUS (still a possible
denial-of-service scenario).

But I understand that it is quite hard to review all the possible
scenarios. So I'm fine with checking inode-ownership permissions for
SET_SEALS. We could also make sealing a one-shot operation. Given that
in a no-trust situation there is never a guarantee that the other side
drops its references, re-using a sealed file is usually not possible.
However, in sane environments, this could be a nice optimization in
case the other side plays along. The one-shot semantics would allow
dropping reference-checks entirely. The inode-ownership semantics
would still require it.

Thanks
David

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-03-20  8:07     ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-20  8:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Linux Kernel Mailing List, Hugh Dickins, Alexander Viro,
	Karol Lewandowski, Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, John Stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, DRI, linux-fsdevel, linux-mm,
	Andrew Morton, Ryan Lortie, Michael Kerrisk (man-pages)

Hi

On Thu, Mar 20, 2014 at 4:49 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Is there really any use-case where the sealer isn't also the same
> thing that *created* the file in the first place? Because I would be a
> ton happier with the notion that you can only seal things that you
> yourself created. At that point, the exclusive reference isn't such a
> big deal any more, but more importantly, you can't play random
> denial-of-service games on files that aren't really yours.

My first idea was to add MFD_ALLOW_SEALING as memfd_create() flag,
which enables the sealing-API for that file. Then I looked at POSIX
mandatory locking and noticed that it provides similar restrictions on
_all_ files. Mandatory locks can be more easily removed, but an
attacker could just re-apply them in a loop, so that's not really an
argument. Furthermore, sealing requires _write_ access so I wonder
what kind of DoS attacks are possible with sealing that are not
already possible with write access? And sealing is only possible if no
writable, shared mapping exists. So even if an attacker seals a file,
all that happens is EPERM, not SIGBUS (still a possible
denial-of-service scenario).

But I understand that it is quite hard to review all the possible
scenarios. So I'm fine with checking inode-ownership permissions for
SET_SEALS. We could also make sealing a one-shot operation. Given that
in a no-trust situation there is never a guarantee that the other side
drops its references, re-using a sealed file is usually not possible.
However, in sane environments, this could be a nice optimization in
case the other side plays along. The one-shot semantics would allow
dropping reference-checks entirely. The inode-ownership semantics
would still require it.

Thanks
David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-03-20  8:07     ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-20  8:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-mm, Ryan Lortie, Hugh Dickins, Johannes Weiner,
	Kay Sievers, Linux Kernel Mailing List, DRI, Daniel Mack,
	linux-fsdevel, Karol Lewandowski, Lennart Poettering,
	Greg Kroah-Hartman, Tejun Heo, Michael Kerrisk (man-pages),
	Andrew Morton, Alexander Viro

Hi

On Thu, Mar 20, 2014 at 4:49 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Is there really any use-case where the sealer isn't also the same
> thing that *created* the file in the first place? Because I would be a
> ton happier with the notion that you can only seal things that you
> yourself created. At that point, the exclusive reference isn't such a
> big deal any more, but more importantly, you can't play random
> denial-of-service games on files that aren't really yours.

My first idea was to add MFD_ALLOW_SEALING as memfd_create() flag,
which enables the sealing-API for that file. Then I looked at POSIX
mandatory locking and noticed that it provides similar restrictions on
_all_ files. Mandatory locks can be more easily removed, but an
attacker could just re-apply them in a loop, so that's not really an
argument. Furthermore, sealing requires _write_ access so I wonder
what kind of DoS attacks are possible with sealing that are not
already possible with write access? And sealing is only possible if no
writable, shared mapping exists. So even if an attacker seals a file,
all that happens is EPERM, not SIGBUS (still a possible
denial-of-service scenario).

But I understand that it is quite hard to review all the possible
scenarios. So I'm fine with checking inode-ownership permissions for
SET_SEALS. We could also make sealing a one-shot operation. Given that
in a no-trust situation there is never a guarantee that the other side
drops its references, re-using a sealed file is usually not possible.
However, in sane environments, this could be a nice optimization in
case the other side plays along. The one-shot semantics would allow
dropping reference-checks entirely. The inode-ownership semantics
would still require it.

Thanks
David

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 3/6] shm: add memfd_create() syscall
  2014-03-19 19:06   ` David Herrmann
@ 2014-03-20  8:47     ` Cyrill Gorcunov
  -1 siblings, 0 replies; 123+ messages in thread
From: Cyrill Gorcunov @ 2014-03-20  8:47 UTC (permalink / raw)
  To: David Herrmann, Pavel Emelyanov
  Cc: linux-kernel, Hugh Dickins, Alexander Viro, Matthew Wilcox,
	Karol Lewandowski, Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, john.stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages)

On Wed, Mar 19, 2014 at 08:06:48PM +0100, David Herrmann wrote:
> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
> that you can pass to mmap(). It explicitly allows sealing and
> avoids any connection to user-visible mount-points. Thus, it's not
> subject to quotas on mounted file-systems, but can be used like
> malloc()'ed memory, but with a file-descriptor to it.
> 
> memfd_create() does not create a front-FD, but instead returns the raw
> shmem file, so calls like ftruncate() can be used. Also calls like fstat()
> will return proper information and mark the file as regular file. Sealing
> is explicitly supported on memfds.
> 
> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
> subject to quotas and alike.

If I'm not mistaken in something obvious, this looks similar to /proc/pid/map_files
feature, Pavel?

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 3/6] shm: add memfd_create() syscall
@ 2014-03-20  8:47     ` Cyrill Gorcunov
  0 siblings, 0 replies; 123+ messages in thread
From: Cyrill Gorcunov @ 2014-03-20  8:47 UTC (permalink / raw)
  To: David Herrmann, Pavel Emelyanov
  Cc: linux-kernel, Hugh Dickins, Alexander Viro, Matthew Wilcox,
	Karol Lewandowski, Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, john.stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages)

On Wed, Mar 19, 2014 at 08:06:48PM +0100, David Herrmann wrote:
> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
> that you can pass to mmap(). It explicitly allows sealing and
> avoids any connection to user-visible mount-points. Thus, it's not
> subject to quotas on mounted file-systems, but can be used like
> malloc()'ed memory, but with a file-descriptor to it.
> 
> memfd_create() does not create a front-FD, but instead returns the raw
> shmem file, so calls like ftruncate() can be used. Also calls like fstat()
> will return proper information and mark the file as regular file. Sealing
> is explicitly supported on memfds.
> 
> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
> subject to quotas and alike.

If I'm not mistaken in something obvious, this looks similar to /proc/pid/map_files
feature, Pavel?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 3/6] shm: add memfd_create() syscall
  2014-03-20  8:47     ` Cyrill Gorcunov
@ 2014-03-20  9:01       ` Pavel Emelyanov
  -1 siblings, 0 replies; 123+ messages in thread
From: Pavel Emelyanov @ 2014-03-20  9:01 UTC (permalink / raw)
  To: Cyrill Gorcunov, David Herrmann
  Cc: linux-kernel, Hugh Dickins, Alexander Viro, Matthew Wilcox,
	Karol Lewandowski, Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, john.stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages)

On 03/20/2014 12:47 PM, Cyrill Gorcunov wrote:
> On Wed, Mar 19, 2014 at 08:06:48PM +0100, David Herrmann wrote:
>> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
>> that you can pass to mmap(). It explicitly allows sealing and
>> avoids any connection to user-visible mount-points. Thus, it's not
>> subject to quotas on mounted file-systems, but can be used like
>> malloc()'ed memory, but with a file-descriptor to it.
>>
>> memfd_create() does not create a front-FD, but instead returns the raw
>> shmem file, so calls like ftruncate() can be used. Also calls like fstat()
>> will return proper information and mark the file as regular file. Sealing
>> is explicitly supported on memfds.
>>
>> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
>> subject to quotas and alike.
> 
> If I'm not mistaken in something obvious, this looks similar to /proc/pid/map_files
> feature, Pavel?

Thanks, Cyrill.

It is, but the map_files will work "in the opposite direction" :) In the memfd
case one first gets an FD, then mmap()s it; in the /proc/pis/map_files case one
should first mmap() a region, then open it via /proc/self/map_files.

But I don't know whether this matters.

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 3/6] shm: add memfd_create() syscall
@ 2014-03-20  9:01       ` Pavel Emelyanov
  0 siblings, 0 replies; 123+ messages in thread
From: Pavel Emelyanov @ 2014-03-20  9:01 UTC (permalink / raw)
  To: Cyrill Gorcunov, David Herrmann
  Cc: linux-kernel, Hugh Dickins, Alexander Viro, Matthew Wilcox,
	Karol Lewandowski, Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, john.stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages)

On 03/20/2014 12:47 PM, Cyrill Gorcunov wrote:
> On Wed, Mar 19, 2014 at 08:06:48PM +0100, David Herrmann wrote:
>> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
>> that you can pass to mmap(). It explicitly allows sealing and
>> avoids any connection to user-visible mount-points. Thus, it's not
>> subject to quotas on mounted file-systems, but can be used like
>> malloc()'ed memory, but with a file-descriptor to it.
>>
>> memfd_create() does not create a front-FD, but instead returns the raw
>> shmem file, so calls like ftruncate() can be used. Also calls like fstat()
>> will return proper information and mark the file as regular file. Sealing
>> is explicitly supported on memfds.
>>
>> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
>> subject to quotas and alike.
> 
> If I'm not mistaken in something obvious, this looks similar to /proc/pid/map_files
> feature, Pavel?

Thanks, Cyrill.

It is, but the map_files will work "in the opposite direction" :) In the memfd
case one first gets an FD, then mmap()s it; in the /proc/pis/map_files case one
should first mmap() a region, then open it via /proc/self/map_files.

But I don't know whether this matters.

Thanks,
Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 3/6] shm: add memfd_create() syscall
  2014-03-20  9:01       ` Pavel Emelyanov
  (?)
  (?)
@ 2014-03-20 11:29         ` David Herrmann
  -1 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-20 11:29 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Cyrill Gorcunov, linux-kernel, Hugh Dickins, Alexander Viro,
	Karol Lewandowski, Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, John Stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages)

Hi

On Thu, Mar 20, 2014 at 10:01 AM, Pavel Emelyanov <xemul@parallels.com> wrote:
> On 03/20/2014 12:47 PM, Cyrill Gorcunov wrote:
>> If I'm not mistaken in something obvious, this looks similar to /proc/pid/map_files
>> feature, Pavel?
>
> It is, but the map_files will work "in the opposite direction" :) In the memfd
> case one first gets an FD, then mmap()s it; in the /proc/pis/map_files case one
> should first mmap() a region, then open it via /proc/self/map_files.
>
> But I don't know whether this matters.

Yes, you can replace memfd_create() so far with:
  p = mmap(NULL, size, ..., MAP_ANON | MAP_SHARED, -1, 0);
  sprintf(path, "/proc/self/map_files/%lx-%lx", p, p + size);
  fd = open(path, O_RDWR);

However, map_files is only enabled with CONFIG_CHECKPOINT_RESTORE, the
/proc/pid/map_files/ directory is root-only (at least I get EPERM if
non-root), it doesn't provide the "name" argument which is very handy
for debugging, it doesn't explicitly support sealing (it requires
MAP_ANON to be backed by shmem) and it's a very weird API for
something this simple.

Thanks
David

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 3/6] shm: add memfd_create() syscall
@ 2014-03-20 11:29         ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-20 11:29 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: linux-mm, Ryan Lortie, Hugh Dickins, Johannes Weiner,
	Kay Sievers, linux-kernel, dri-devel, Daniel Mack,
	Cyrill Gorcunov, linux-fsdevel, Karol Lewandowski,
	Lennart Poettering, Greg Kroah-Hartman, Tejun Heo,
	Michael Kerrisk (man-pages),
	Andrew Morton, Linus Torvalds, Alexander Viro

Hi

On Thu, Mar 20, 2014 at 10:01 AM, Pavel Emelyanov <xemul@parallels.com> wrote:
> On 03/20/2014 12:47 PM, Cyrill Gorcunov wrote:
>> If I'm not mistaken in something obvious, this looks similar to /proc/pid/map_files
>> feature, Pavel?
>
> It is, but the map_files will work "in the opposite direction" :) In the memfd
> case one first gets an FD, then mmap()s it; in the /proc/pis/map_files case one
> should first mmap() a region, then open it via /proc/self/map_files.
>
> But I don't know whether this matters.

Yes, you can replace memfd_create() so far with:
  p = mmap(NULL, size, ..., MAP_ANON | MAP_SHARED, -1, 0);
  sprintf(path, "/proc/self/map_files/%lx-%lx", p, p + size);
  fd = open(path, O_RDWR);

However, map_files is only enabled with CONFIG_CHECKPOINT_RESTORE, the
/proc/pid/map_files/ directory is root-only (at least I get EPERM if
non-root), it doesn't provide the "name" argument which is very handy
for debugging, it doesn't explicitly support sealing (it requires
MAP_ANON to be backed by shmem) and it's a very weird API for
something this simple.

Thanks
David

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 3/6] shm: add memfd_create() syscall
@ 2014-03-20 11:29         ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-20 11:29 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: Cyrill Gorcunov, linux-kernel, Hugh Dickins, Alexander Viro,
	Karol Lewandowski, Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, John Stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages)

Hi

On Thu, Mar 20, 2014 at 10:01 AM, Pavel Emelyanov <xemul@parallels.com> wrote:
> On 03/20/2014 12:47 PM, Cyrill Gorcunov wrote:
>> If I'm not mistaken in something obvious, this looks similar to /proc/pid/map_files
>> feature, Pavel?
>
> It is, but the map_files will work "in the opposite direction" :) In the memfd
> case one first gets an FD, then mmap()s it; in the /proc/pis/map_files case one
> should first mmap() a region, then open it via /proc/self/map_files.
>
> But I don't know whether this matters.

Yes, you can replace memfd_create() so far with:
  p = mmap(NULL, size, ..., MAP_ANON | MAP_SHARED, -1, 0);
  sprintf(path, "/proc/self/map_files/%lx-%lx", p, p + size);
  fd = open(path, O_RDWR);

However, map_files is only enabled with CONFIG_CHECKPOINT_RESTORE, the
/proc/pid/map_files/ directory is root-only (at least I get EPERM if
non-root), it doesn't provide the "name" argument which is very handy
for debugging, it doesn't explicitly support sealing (it requires
MAP_ANON to be backed by shmem) and it's a very weird API for
something this simple.

Thanks
David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 3/6] shm: add memfd_create() syscall
@ 2014-03-20 11:29         ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-20 11:29 UTC (permalink / raw)
  To: Pavel Emelyanov
  Cc: linux-mm, Ryan Lortie, Hugh Dickins, Johannes Weiner,
	Kay Sievers, linux-kernel, dri-devel, Daniel Mack,
	Cyrill Gorcunov, linux-fsdevel, Karol Lewandowski,
	Lennart Poettering, Greg Kroah-Hartman, Tejun Heo,
	Michael Kerrisk (man-pages),
	Andrew Morton, Linus Torvalds, Alexander Viro

Hi

On Thu, Mar 20, 2014 at 10:01 AM, Pavel Emelyanov <xemul@parallels.com> wrote:
> On 03/20/2014 12:47 PM, Cyrill Gorcunov wrote:
>> If I'm not mistaken in something obvious, this looks similar to /proc/pid/map_files
>> feature, Pavel?
>
> It is, but the map_files will work "in the opposite direction" :) In the memfd
> case one first gets an FD, then mmap()s it; in the /proc/pis/map_files case one
> should first mmap() a region, then open it via /proc/self/map_files.
>
> But I don't know whether this matters.

Yes, you can replace memfd_create() so far with:
  p = mmap(NULL, size, ..., MAP_ANON | MAP_SHARED, -1, 0);
  sprintf(path, "/proc/self/map_files/%lx-%lx", p, p + size);
  fd = open(path, O_RDWR);

However, map_files is only enabled with CONFIG_CHECKPOINT_RESTORE, the
/proc/pid/map_files/ directory is root-only (at least I get EPERM if
non-root), it doesn't provide the "name" argument which is very handy
for debugging, it doesn't explicitly support sealing (it requires
MAP_ANON to be backed by shmem) and it's a very weird API for
something this simple.

Thanks
David

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 3/6] shm: add memfd_create() syscall
  2014-03-20 11:29         ` David Herrmann
@ 2014-03-20 11:50           ` Pavel Emelyanov
  -1 siblings, 0 replies; 123+ messages in thread
From: Pavel Emelyanov @ 2014-03-20 11:50 UTC (permalink / raw)
  To: David Herrmann
  Cc: Cyrill Gorcunov, linux-kernel, Hugh Dickins, Alexander Viro,
	Karol Lewandowski, Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, John Stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages)

On 03/20/2014 03:29 PM, David Herrmann wrote:
> Hi
> 
> On Thu, Mar 20, 2014 at 10:01 AM, Pavel Emelyanov <xemul@parallels.com> wrote:
>> On 03/20/2014 12:47 PM, Cyrill Gorcunov wrote:
>>> If I'm not mistaken in something obvious, this looks similar to /proc/pid/map_files
>>> feature, Pavel?
>>
>> It is, but the map_files will work "in the opposite direction" :) In the memfd
>> case one first gets an FD, then mmap()s it; in the /proc/pis/map_files case one
>> should first mmap() a region, then open it via /proc/self/map_files.
>>
>> But I don't know whether this matters.
> 
> Yes, you can replace memfd_create() so far with:
>   p = mmap(NULL, size, ..., MAP_ANON | MAP_SHARED, -1, 0);
>   sprintf(path, "/proc/self/map_files/%lx-%lx", p, p + size);
>   fd = open(path, O_RDWR);
> 
> However, map_files is only enabled with CONFIG_CHECKPOINT_RESTORE, the
> /proc/pid/map_files/ directory is root-only (at least I get EPERM if
> non-root),

Yes. But this is something we'd also like to have fixed :) Having two
parties willing the same makes it easier for the patch to get accepted.

> it doesn't provide the "name" argument which is very handy
> for debugging,

What if we make mmap's shmem_zero_setup() generate a meaningful name,
would it solve the debugging issue?

> it doesn't explicitly support sealing (it requires MAP_ANON to be backed 
> by shmem)

Can you elaborate on this? The fd generated by sys_memfd() will be
shmem-backed, so will be the file opened via map_files link for the
MAP_ANON | MAP_SHARED mapping. So what are the problems to make it
support sealing?

> and it's a very weird API for something this simple.

:)

Thanks,
Pavel

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 3/6] shm: add memfd_create() syscall
@ 2014-03-20 11:50           ` Pavel Emelyanov
  0 siblings, 0 replies; 123+ messages in thread
From: Pavel Emelyanov @ 2014-03-20 11:50 UTC (permalink / raw)
  To: David Herrmann
  Cc: Cyrill Gorcunov, linux-kernel, Hugh Dickins, Alexander Viro,
	Karol Lewandowski, Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, John Stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages)

On 03/20/2014 03:29 PM, David Herrmann wrote:
> Hi
> 
> On Thu, Mar 20, 2014 at 10:01 AM, Pavel Emelyanov <xemul@parallels.com> wrote:
>> On 03/20/2014 12:47 PM, Cyrill Gorcunov wrote:
>>> If I'm not mistaken in something obvious, this looks similar to /proc/pid/map_files
>>> feature, Pavel?
>>
>> It is, but the map_files will work "in the opposite direction" :) In the memfd
>> case one first gets an FD, then mmap()s it; in the /proc/pis/map_files case one
>> should first mmap() a region, then open it via /proc/self/map_files.
>>
>> But I don't know whether this matters.
> 
> Yes, you can replace memfd_create() so far with:
>   p = mmap(NULL, size, ..., MAP_ANON | MAP_SHARED, -1, 0);
>   sprintf(path, "/proc/self/map_files/%lx-%lx", p, p + size);
>   fd = open(path, O_RDWR);
> 
> However, map_files is only enabled with CONFIG_CHECKPOINT_RESTORE, the
> /proc/pid/map_files/ directory is root-only (at least I get EPERM if
> non-root),

Yes. But this is something we'd also like to have fixed :) Having two
parties willing the same makes it easier for the patch to get accepted.

> it doesn't provide the "name" argument which is very handy
> for debugging,

What if we make mmap's shmem_zero_setup() generate a meaningful name,
would it solve the debugging issue?

> it doesn't explicitly support sealing (it requires MAP_ANON to be backed 
> by shmem)

Can you elaborate on this? The fd generated by sys_memfd() will be
shmem-backed, so will be the file opened via map_files link for the
MAP_ANON | MAP_SHARED mapping. So what are the problems to make it
support sealing?

> and it's a very weird API for something this simple.

:)

Thanks,
Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-03-20  8:07     ` David Herrmann
  (?)
@ 2014-03-20 14:41       ` One Thousand Gnomes
  -1 siblings, 0 replies; 123+ messages in thread
From: One Thousand Gnomes @ 2014-03-20 14:41 UTC (permalink / raw)
  To: David Herrmann
  Cc: Linus Torvalds, Linux Kernel Mailing List, Hugh Dickins,
	Alexander Viro, Karol Lewandowski, Kay Sievers, Daniel Mack,
	Lennart Poettering, Kristian Høgsberg, John Stultz,
	Greg Kroah-Hartman, Tejun Heo, Johannes Weiner, DRI,
	linux-fsdevel, linux-mm, Andrew Morton, Ryan Lortie,
	Michael Kerrisk (man-pages)

> My first idea was to add MFD_ALLOW_SEALING as memfd_create() flag,
> which enables the sealing-API for that file. Then I looked at POSIX

This actually seems the most sensible to me. The reason being that if I
have some existing used object there is no way on earth I can be sure who
has existing references to it, and we don't have revoke() to fix that.

So it pretty much has to be a new object in a sane programming model.

> mandatory locking and noticed that it provides similar restrictions on
> _all_ files. Mandatory locks can be more easily removed, but an

The fact someone got it past a standards body doesn't make it a good idea.

> attacker could just re-apply them in a loop, so that's not really an
> argument. Furthermore, sealing requires _write_ access so I wonder
> what kind of DoS attacks are possible with sealing that are not
> already possible with write access? And sealing is only possible if no
> writable, shared mapping exists. So even if an attacker seals a file,
> all that happens is EPERM, not SIGBUS (still a possible
> denial-of-service scenario).

I think you want two things at minimum

owner to seal
root can always override

I would query the name too. Right now your assumption is 'shmem only' but
that might change with other future use cases or types (eg some driver
file handles) so SHMEM_ in the fcntl might become misleading.

Whether you want some way to undo a seal without an exclusive reference as
the file owner is another question.

Alan

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-03-20 14:41       ` One Thousand Gnomes
  0 siblings, 0 replies; 123+ messages in thread
From: One Thousand Gnomes @ 2014-03-20 14:41 UTC (permalink / raw)
  To: David Herrmann
  Cc: Linus Torvalds, Linux Kernel Mailing List, Hugh Dickins,
	Alexander Viro, Karol Lewandowski, Kay Sievers, Daniel Mack,
	Lennart Poettering, Kristian Høgsberg, John Stultz,
	Greg Kroah-Hartman, Tejun Heo, Johannes Weiner, DRI,
	linux-fsdevel, linux-mm, Andrew Morton, Ryan Lortie

> My first idea was to add MFD_ALLOW_SEALING as memfd_create() flag,
> which enables the sealing-API for that file. Then I looked at POSIX

This actually seems the most sensible to me. The reason being that if I
have some existing used object there is no way on earth I can be sure who
has existing references to it, and we don't have revoke() to fix that.

So it pretty much has to be a new object in a sane programming model.

> mandatory locking and noticed that it provides similar restrictions on
> _all_ files. Mandatory locks can be more easily removed, but an

The fact someone got it past a standards body doesn't make it a good idea.

> attacker could just re-apply them in a loop, so that's not really an
> argument. Furthermore, sealing requires _write_ access so I wonder
> what kind of DoS attacks are possible with sealing that are not
> already possible with write access? And sealing is only possible if no
> writable, shared mapping exists. So even if an attacker seals a file,
> all that happens is EPERM, not SIGBUS (still a possible
> denial-of-service scenario).

I think you want two things at minimum

owner to seal
root can always override

I would query the name too. Right now your assumption is 'shmem only' but
that might change with other future use cases or types (eg some driver
file handles) so SHMEM_ in the fcntl might become misleading.

Whether you want some way to undo a seal without an exclusive reference as
the file owner is another question.

Alan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-03-20 14:41       ` One Thousand Gnomes
  0 siblings, 0 replies; 123+ messages in thread
From: One Thousand Gnomes @ 2014-03-20 14:41 UTC (permalink / raw)
  To: David Herrmann
  Cc: Linus Torvalds, Linux Kernel Mailing List, Hugh Dickins,
	Alexander Viro, Karol Lewandowski, Kay Sievers, Daniel Mack,
	Lennart Poettering, Kristian Høgsberg, John Stultz,
	Greg Kroah-Hartman, Tejun Heo, Johannes Weiner, DRI,
	linux-fsdevel, linux-mm, Andrew Morton, Ryan Lortie,
	Michael Kerrisk (man-pages)

> My first idea was to add MFD_ALLOW_SEALING as memfd_create() flag,
> which enables the sealing-API for that file. Then I looked at POSIX

This actually seems the most sensible to me. The reason being that if I
have some existing used object there is no way on earth I can be sure who
has existing references to it, and we don't have revoke() to fix that.

So it pretty much has to be a new object in a sane programming model.

> mandatory locking and noticed that it provides similar restrictions on
> _all_ files. Mandatory locks can be more easily removed, but an

The fact someone got it past a standards body doesn't make it a good idea.

> attacker could just re-apply them in a loop, so that's not really an
> argument. Furthermore, sealing requires _write_ access so I wonder
> what kind of DoS attacks are possible with sealing that are not
> already possible with write access? And sealing is only possible if no
> writable, shared mapping exists. So even if an attacker seals a file,
> all that happens is EPERM, not SIGBUS (still a possible
> denial-of-service scenario).

I think you want two things at minimum

owner to seal
root can always override

I would query the name too. Right now your assumption is 'shmem only' but
that might change with other future use cases or types (eg some driver
file handles) so SHMEM_ in the fcntl might become misleading.

Whether you want some way to undo a seal without an exclusive reference as
the file owner is another question.

Alan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-03-20 14:41       ` One Thousand Gnomes
  (?)
  (?)
@ 2014-03-20 15:12         ` David Herrmann
  -1 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-20 15:12 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Linus Torvalds, Linux Kernel Mailing List, Hugh Dickins,
	Alexander Viro, Karol Lewandowski, Kay Sievers, Daniel Mack,
	Lennart Poettering, Kristian Høgsberg, John Stultz,
	Greg Kroah-Hartman, Tejun Heo, Johannes Weiner, DRI,
	linux-fsdevel, linux-mm, Andrew Morton, Ryan Lortie,
	Michael Kerrisk (man-pages)

Hi

On Thu, Mar 20, 2014 at 3:41 PM, One Thousand Gnomes
<gnomes@lxorguk.ukuu.org.uk> wrote:
> I think you want two things at minimum
>
> owner to seal
> root can always override

Why should root be allowed to override?

> I would query the name too. Right now your assumption is 'shmem only' but
> that might change with other future use cases or types (eg some driver
> file handles) so SHMEM_ in the fcntl might become misleading.

I'm fine with F_SET/GET_SEALS. But given you suggested requiring
MFD_ALLOW_SEALS for sealing, I don't see why we couldn't limit this
interface entirely to memfd_create().

> Whether you want some way to undo a seal without an exclusive reference as
> the file owner is another question.

No. You are never allowed to undo a seal but with an exclusive
reference. This interface was created for situations _without_ any
trust relationship. So if the owner is allowed to undo seals, the
interface doesn't make any sense. The only options I see is to not
allow un-sealing at all (which I'm fine with) or tracking users (which
is way too much overhead).

Thanks
David

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-03-20 15:12         ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-20 15:12 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: linux-mm, Ryan Lortie, Kay Sievers, Johannes Weiner,
	Hugh Dickins, Linux Kernel Mailing List, DRI, Daniel Mack,
	linux-fsdevel, Karol Lewandowski, Lennart Poettering,
	Greg Kroah-Hartman, Tejun Heo, Michael Kerrisk (man-pages),
	Andrew Morton, Linus Torvalds, Alexander Viro

Hi

On Thu, Mar 20, 2014 at 3:41 PM, One Thousand Gnomes
<gnomes@lxorguk.ukuu.org.uk> wrote:
> I think you want two things at minimum
>
> owner to seal
> root can always override

Why should root be allowed to override?

> I would query the name too. Right now your assumption is 'shmem only' but
> that might change with other future use cases or types (eg some driver
> file handles) so SHMEM_ in the fcntl might become misleading.

I'm fine with F_SET/GET_SEALS. But given you suggested requiring
MFD_ALLOW_SEALS for sealing, I don't see why we couldn't limit this
interface entirely to memfd_create().

> Whether you want some way to undo a seal without an exclusive reference as
> the file owner is another question.

No. You are never allowed to undo a seal but with an exclusive
reference. This interface was created for situations _without_ any
trust relationship. So if the owner is allowed to undo seals, the
interface doesn't make any sense. The only options I see is to not
allow un-sealing at all (which I'm fine with) or tracking users (which
is way too much overhead).

Thanks
David

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-03-20 15:12         ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-20 15:12 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: Linus Torvalds, Linux Kernel Mailing List, Hugh Dickins,
	Alexander Viro, Karol Lewandowski, Kay Sievers, Daniel Mack,
	Lennart Poettering, Kristian Høgsberg, John Stultz,
	Greg Kroah-Hartman, Tejun Heo, Johannes Weiner, DRI,
	linux-fsdevel, linux-mm, Andrew Morton, Ryan Lortie,
	Michael Kerrisk (man-pages)

Hi

On Thu, Mar 20, 2014 at 3:41 PM, One Thousand Gnomes
<gnomes@lxorguk.ukuu.org.uk> wrote:
> I think you want two things at minimum
>
> owner to seal
> root can always override

Why should root be allowed to override?

> I would query the name too. Right now your assumption is 'shmem only' but
> that might change with other future use cases or types (eg some driver
> file handles) so SHMEM_ in the fcntl might become misleading.

I'm fine with F_SET/GET_SEALS. But given you suggested requiring
MFD_ALLOW_SEALS for sealing, I don't see why we couldn't limit this
interface entirely to memfd_create().

> Whether you want some way to undo a seal without an exclusive reference as
> the file owner is another question.

No. You are never allowed to undo a seal but with an exclusive
reference. This interface was created for situations _without_ any
trust relationship. So if the owner is allowed to undo seals, the
interface doesn't make any sense. The only options I see is to not
allow un-sealing at all (which I'm fine with) or tracking users (which
is way too much overhead).

Thanks
David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-03-20 15:12         ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-20 15:12 UTC (permalink / raw)
  To: One Thousand Gnomes
  Cc: linux-mm, Ryan Lortie, Kay Sievers, Johannes Weiner,
	Hugh Dickins, Linux Kernel Mailing List, DRI, Daniel Mack,
	linux-fsdevel, Karol Lewandowski, Lennart Poettering,
	Greg Kroah-Hartman, Tejun Heo, Michael Kerrisk (man-pages),
	Andrew Morton, Linus Torvalds, Alexander Viro

Hi

On Thu, Mar 20, 2014 at 3:41 PM, One Thousand Gnomes
<gnomes@lxorguk.ukuu.org.uk> wrote:
> I think you want two things at minimum
>
> owner to seal
> root can always override

Why should root be allowed to override?

> I would query the name too. Right now your assumption is 'shmem only' but
> that might change with other future use cases or types (eg some driver
> file handles) so SHMEM_ in the fcntl might become misleading.

I'm fine with F_SET/GET_SEALS. But given you suggested requiring
MFD_ALLOW_SEALS for sealing, I don't see why we couldn't limit this
interface entirely to memfd_create().

> Whether you want some way to undo a seal without an exclusive reference as
> the file owner is another question.

No. You are never allowed to undo a seal but with an exclusive
reference. This interface was created for situations _without_ any
trust relationship. So if the owner is allowed to undo seals, the
interface doesn't make any sense. The only options I see is to not
allow un-sealing at all (which I'm fine with) or tracking users (which
is way too much overhead).

Thanks
David

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-03-20 15:12         ` David Herrmann
  (?)
@ 2014-03-20 15:26           ` One Thousand Gnomes
  -1 siblings, 0 replies; 123+ messages in thread
From: One Thousand Gnomes @ 2014-03-20 15:26 UTC (permalink / raw)
  To: David Herrmann
  Cc: Linus Torvalds, Linux Kernel Mailing List, Hugh Dickins,
	Alexander Viro, Karol Lewandowski, Kay Sievers, Daniel Mack,
	Lennart Poettering, Kristian Høgsberg, John Stultz,
	Greg Kroah-Hartman, Tejun Heo, Johannes Weiner, DRI,
	linux-fsdevel, linux-mm, Andrew Morton, Ryan Lortie,
	Michael Kerrisk (man-pages)

On Thu, 20 Mar 2014 16:12:54 +0100
David Herrmann <dh.herrmann@gmail.com> wrote:

> Hi
> 
> On Thu, Mar 20, 2014 at 3:41 PM, One Thousand Gnomes
> <gnomes@lxorguk.ukuu.org.uk> wrote:
> > I think you want two things at minimum
> >
> > owner to seal
> > root can always override
> 
> Why should root be allowed to override?

Because root can already override it by say mmapping the kernel memory
and patching. It also tends to be valuable for debugging horrible
problems with complex systems.

Imposing fake restrictions on root just causes annoyance when doing stuff
like debugging but doesn't actually change the security situation.
> 
> I'm fine with F_SET/GET_SEALS. But given you suggested requiring
> MFD_ALLOW_SEALS for sealing, I don't see why we couldn't limit this
> interface entirely to memfd_create().

But if someone does find a non memfd use for it then it's useful not to
have to go "this fnctl for memfd, that fnctl for the other"

Just planning ahead.


> > Whether you want some way to undo a seal without an exclusive reference as
> > the file owner is another question.
> 
> No. You are never allowed to undo a seal but with an exclusive
> reference. This interface was created for situations _without_ any
> trust relationship. So if the owner is allowed to undo seals, the
> interface doesn't make any sense. The only options I see is to not
> allow un-sealing at all (which I'm fine with) or tracking users (which
> is way too much overhead).

Ok - that makes sense

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-03-20 15:26           ` One Thousand Gnomes
  0 siblings, 0 replies; 123+ messages in thread
From: One Thousand Gnomes @ 2014-03-20 15:26 UTC (permalink / raw)
  To: David Herrmann
  Cc: Linus Torvalds, Linux Kernel Mailing List, Hugh Dickins,
	Alexander Viro, Karol Lewandowski, Kay Sievers, Daniel Mack,
	Lennart Poettering, Kristian Høgsberg, John Stultz,
	Greg Kroah-Hartman, Tejun Heo, Johannes Weiner, DRI,
	linux-fsdevel, linux-mm, Andrew Morton, Ryan Lortie

On Thu, 20 Mar 2014 16:12:54 +0100
David Herrmann <dh.herrmann@gmail.com> wrote:

> Hi
> 
> On Thu, Mar 20, 2014 at 3:41 PM, One Thousand Gnomes
> <gnomes@lxorguk.ukuu.org.uk> wrote:
> > I think you want two things at minimum
> >
> > owner to seal
> > root can always override
> 
> Why should root be allowed to override?

Because root can already override it by say mmapping the kernel memory
and patching. It also tends to be valuable for debugging horrible
problems with complex systems.

Imposing fake restrictions on root just causes annoyance when doing stuff
like debugging but doesn't actually change the security situation.
> 
> I'm fine with F_SET/GET_SEALS. But given you suggested requiring
> MFD_ALLOW_SEALS for sealing, I don't see why we couldn't limit this
> interface entirely to memfd_create().

But if someone does find a non memfd use for it then it's useful not to
have to go "this fnctl for memfd, that fnctl for the other"

Just planning ahead.


> > Whether you want some way to undo a seal without an exclusive reference as
> > the file owner is another question.
> 
> No. You are never allowed to undo a seal but with an exclusive
> reference. This interface was created for situations _without_ any
> trust relationship. So if the owner is allowed to undo seals, the
> interface doesn't make any sense. The only options I see is to not
> allow un-sealing at all (which I'm fine with) or tracking users (which
> is way too much overhead).

Ok - that makes sense

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-03-20 15:26           ` One Thousand Gnomes
  0 siblings, 0 replies; 123+ messages in thread
From: One Thousand Gnomes @ 2014-03-20 15:26 UTC (permalink / raw)
  To: David Herrmann
  Cc: Linus Torvalds, Linux Kernel Mailing List, Hugh Dickins,
	Alexander Viro, Karol Lewandowski, Kay Sievers, Daniel Mack,
	Lennart Poettering, Kristian Høgsberg, John Stultz,
	Greg Kroah-Hartman, Tejun Heo, Johannes Weiner, DRI,
	linux-fsdevel, linux-mm, Andrew Morton, Ryan Lortie,
	Michael Kerrisk (man-pages)

On Thu, 20 Mar 2014 16:12:54 +0100
David Herrmann <dh.herrmann@gmail.com> wrote:

> Hi
> 
> On Thu, Mar 20, 2014 at 3:41 PM, One Thousand Gnomes
> <gnomes@lxorguk.ukuu.org.uk> wrote:
> > I think you want two things at minimum
> >
> > owner to seal
> > root can always override
> 
> Why should root be allowed to override?

Because root can already override it by say mmapping the kernel memory
and patching. It also tends to be valuable for debugging horrible
problems with complex systems.

Imposing fake restrictions on root just causes annoyance when doing stuff
like debugging but doesn't actually change the security situation.
> 
> I'm fine with F_SET/GET_SEALS. But given you suggested requiring
> MFD_ALLOW_SEALS for sealing, I don't see why we couldn't limit this
> interface entirely to memfd_create().

But if someone does find a non memfd use for it then it's useful not to
have to go "this fnctl for memfd, that fnctl for the other"

Just planning ahead.


> > Whether you want some way to undo a seal without an exclusive reference as
> > the file owner is another question.
> 
> No. You are never allowed to undo a seal but with an exclusive
> reference. This interface was created for situations _without_ any
> trust relationship. So if the owner is allowed to undo seals, the
> interface doesn't make any sense. The only options I see is to not
> allow un-sealing at all (which I'm fine with) or tracking users (which
> is way too much overhead).

Ok - that makes sense

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-03-19 19:06 ` David Herrmann
@ 2014-03-20 15:32   ` tytso
  -1 siblings, 0 replies; 123+ messages in thread
From: tytso @ 2014-03-20 15:32 UTC (permalink / raw)
  To: David Herrmann
  Cc: linux-kernel, Hugh Dickins, Alexander Viro, Matthew Wilcox,
	Karol Lewandowski, Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian, =?iso-8859-1?Q?H=F8gsberg_=3Ckrh=40bitplanet=2Enet=3E?=,
	john.stultz, Greg Kroah-Hartman, Tejun Heo, Johannes Weiner,
	dri-devel, linux-fsdevel, linux-mm, Andrew Morton,
	Linus Torvalds, Ryan Lortie, Michael Kerrisk (man-pages)

On Wed, Mar 19, 2014 at 08:06:45PM +0100, David Herrmann wrote:
> 
> This series introduces the concept of "file sealing". Sealing a file restricts
> the set of allowed operations on the file in question. Multiple seals are
> defined and each seal will cause a different set of operations to return EPERM
> if it is set. The following seals are introduced:
> 
>  * SEAL_SHRINK: If set, the inode size cannot be reduced
>  * SEAL_GROW: If set, the inode size cannot be increased
>  * SEAL_WRITE: If set, the file content cannot be modified

Looking at your patches, and what files you are modifying, you are
enforcing this in the low-level file system.

Why not make sealing an attribute of the "struct file", and enforce it
at the VFS layer?  That way all file system objects would have access
to sealing interface, and for memfd_shmem, you can't get another
struct file pointing at the object, the security properties would be
identical.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-03-20 15:32   ` tytso
  0 siblings, 0 replies; 123+ messages in thread
From: tytso @ 2014-03-20 15:32 UTC (permalink / raw)
  To: David Herrmann
  Cc: linux-kernel, Hugh Dickins, Alexander Viro, Matthew Wilcox,
	Karol Lewandowski, Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian, =?iso-8859-1?Q?H=F8gsberg_=3Ckrh=40bitplanet=2Enet=3E?=,
	john.stultz, Greg Kroah-Hartman, Tejun Heo, Johannes Weiner,
	dri-devel, linux-fsdevel, linux-mm, Andrew Morton,
	Linus Torvalds, Ryan Lortie, Michael Kerrisk (man-pages)

On Wed, Mar 19, 2014 at 08:06:45PM +0100, David Herrmann wrote:
> 
> This series introduces the concept of "file sealing". Sealing a file restricts
> the set of allowed operations on the file in question. Multiple seals are
> defined and each seal will cause a different set of operations to return EPERM
> if it is set. The following seals are introduced:
> 
>  * SEAL_SHRINK: If set, the inode size cannot be reduced
>  * SEAL_GROW: If set, the inode size cannot be increased
>  * SEAL_WRITE: If set, the file content cannot be modified

Looking at your patches, and what files you are modifying, you are
enforcing this in the low-level file system.

Why not make sealing an attribute of the "struct file", and enforce it
at the VFS layer?  That way all file system objects would have access
to sealing interface, and for memfd_shmem, you can't get another
struct file pointing at the object, the security properties would be
identical.

Cheers,

						- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-03-20 15:32   ` tytso
  (?)
@ 2014-03-20 15:39   ` One Thousand Gnomes
  -1 siblings, 0 replies; 123+ messages in thread
From: One Thousand Gnomes @ 2014-03-20 15:39 UTC (permalink / raw)
  To: tytso
  Cc: David Herrmann, linux-kernel, Hugh Dickins, Alexander Viro,
	Matthew Wilcox, Karol Lewandowski, Kay Sievers, Daniel Mack,
	Lennart Poettering, Kristian, Høgsberg, john.stultz,
	Greg Kroah-Hartman, Tejun Heo, Johannes Weiner, dri-devel,
	linux-fsdevel, linux-mm, Andrew Morton, Linus Torvalds,
	Ryan Lortie, Michael Kerrisk (man-pages)

On Thu, 20 Mar 2014 11:32:51 -0400
tytso@mit.edu wrote:

> On Wed, Mar 19, 2014 at 08:06:45PM +0100, David Herrmann wrote:
> > 
> > This series introduces the concept of "file sealing". Sealing a file restricts
> > the set of allowed operations on the file in question. Multiple seals are
> > defined and each seal will cause a different set of operations to return EPERM
> > if it is set. The following seals are introduced:
> > 
> >  * SEAL_SHRINK: If set, the inode size cannot be reduced
> >  * SEAL_GROW: If set, the inode size cannot be increased
> >  * SEAL_WRITE: If set, the file content cannot be modified
> 
> Looking at your patches, and what files you are modifying, you are
> enforcing this in the low-level file system.
> 
> Why not make sealing an attribute of the "struct file", and enforce it
> at the VFS layer?  That way all file system objects would have access
> to sealing interface, and for memfd_shmem, you can't get another
> struct file pointing at the object, the security properties would be
> identical.

Would it be more sensible to have a "sealer" which is a "device" which
you give a file handle too and it gives you back a sealable one.

So for the memfd case you'd create a private handle, pass it to the
sealer, and then pass the sealer handles to everyone else.

You have to implicitly trust the creator of the object has
- handed you the object you expect
- sealed it

so that appears no weaker but means you can meaningfully created sealed
versions of arbitary objects and if you want have non-sealed ones around
with it being up to the creator if they want for example to simply close
the unsealed one immediately afterwards.

Alan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-03-20 15:32   ` tytso
@ 2014-03-20 15:48     ` David Herrmann
  -1 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-20 15:48 UTC (permalink / raw)
  To: tytso, David Herrmann, linux-kernel, Hugh Dickins,
	Alexander Viro, Karol Lewandowski, Kay Sievers, Daniel Mack,
	Lennart Poettering, John Stultz, Greg Kroah-Hartman, Tejun Heo,
	Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages)

Hi

On Thu, Mar 20, 2014 at 4:32 PM,  <tytso@mit.edu> wrote:
> Why not make sealing an attribute of the "struct file", and enforce it
> at the VFS layer?  That way all file system objects would have access
> to sealing interface, and for memfd_shmem, you can't get another
> struct file pointing at the object, the security properties would be
> identical.

Sealing as introduced here is an inode-attribute, not "struct file".
This is intentional. For instance, a gfx-client can get a read-only FD
via /proc/self/fd/ and pass it to the compositor so it can never
overwrite the contents (unless the compositor has write-access to the
inode itself, in which case it can just re-open it read-write).

Furthermore, I don't see any use-case besides memfd for sealing, so I
purposely avoided changing core VFS interfaces. Protecting
page-allocation/access for SEAL_WRITE like I do in shmem.c is not that
easy to do generically. So if we moved this interface to "struct
inode", all that would change is moving "u32 seals;" from one struct
to the other. Ok, some protections might get easily implemented
generically, but I without proper insight in the underlying
implemenation, I couldn't verify all paths and possible races. Isn't
keeping the API generic enough so far? Changing the underlying
implementation can be done once we know what we want.

Thanks
David

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-03-20 15:48     ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-03-20 15:48 UTC (permalink / raw)
  To: tytso, David Herrmann, linux-kernel, Hugh Dickins,
	Alexander Viro, Karol Lewandowski, Kay Sievers, Daniel Mack,
	Lennart Poettering, John Stultz, Greg Kroah-Hartman, Tejun Heo,
	Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages)

Hi

On Thu, Mar 20, 2014 at 4:32 PM,  <tytso@mit.edu> wrote:
> Why not make sealing an attribute of the "struct file", and enforce it
> at the VFS layer?  That way all file system objects would have access
> to sealing interface, and for memfd_shmem, you can't get another
> struct file pointing at the object, the security properties would be
> identical.

Sealing as introduced here is an inode-attribute, not "struct file".
This is intentional. For instance, a gfx-client can get a read-only FD
via /proc/self/fd/ and pass it to the compositor so it can never
overwrite the contents (unless the compositor has write-access to the
inode itself, in which case it can just re-open it read-write).

Furthermore, I don't see any use-case besides memfd for sealing, so I
purposely avoided changing core VFS interfaces. Protecting
page-allocation/access for SEAL_WRITE like I do in shmem.c is not that
easy to do generically. So if we moved this interface to "struct
inode", all that would change is moving "u32 seals;" from one struct
to the other. Ok, some protections might get easily implemented
generically, but I without proper insight in the underlying
implemenation, I couldn't verify all paths and possible races. Isn't
keeping the API generic enough so far? Changing the underlying
implementation can be done once we know what we want.

Thanks
David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-03-20 15:48     ` David Herrmann
@ 2014-03-20 16:38       ` tytso
  -1 siblings, 0 replies; 123+ messages in thread
From: tytso @ 2014-03-20 16:38 UTC (permalink / raw)
  To: David Herrmann
  Cc: linux-kernel, Hugh Dickins, Alexander Viro, Karol Lewandowski,
	Kay Sievers, Daniel Mack, Lennart Poettering, John Stultz,
	Greg Kroah-Hartman, Tejun Heo, Johannes Weiner, dri-devel,
	linux-fsdevel, linux-mm, Andrew Morton, Linus Torvalds,
	Ryan Lortie, Michael Kerrisk (man-pages)

On Thu, Mar 20, 2014 at 04:48:30PM +0100, David Herrmann wrote:
> On Thu, Mar 20, 2014 at 4:32 PM,  <tytso@mit.edu> wrote:
> > Why not make sealing an attribute of the "struct file", and enforce it
> > at the VFS layer?  That way all file system objects would have access
> > to sealing interface, and for memfd_shmem, you can't get another
> > struct file pointing at the object, the security properties would be
> > identical.
> 
> Sealing as introduced here is an inode-attribute, not "struct file".
> This is intentional. For instance, a gfx-client can get a read-only FD
> via /proc/self/fd/ and pass it to the compositor so it can never
> overwrite the contents (unless the compositor has write-access to the
> inode itself, in which case it can just re-open it read-write).

Hmm, good point.  I had forgotten about the /proc/self/fd hole.
Hmm... what if we have a SEAL_PROC which forces the permissions of
/proc/self/fd to be 000?

So if it is a property of the attribute, SEAL_WRITE and SEAL_GROW is
basically equivalent to using chattr to set the immutable and
append-only attribute, except for the "you can't undo the seal unless
you have exclusive access to the inode" magic.

That does make it pretty memfd_create specific and not a very general
API, which is a little unfortunate; hence why I'm trying to explore
ways of making a bit more generic and hopefully useful for more use
cases.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-03-20 16:38       ` tytso
  0 siblings, 0 replies; 123+ messages in thread
From: tytso @ 2014-03-20 16:38 UTC (permalink / raw)
  To: David Herrmann
  Cc: linux-kernel, Hugh Dickins, Alexander Viro, Karol Lewandowski,
	Kay Sievers, Daniel Mack, Lennart Poettering, John Stultz,
	Greg Kroah-Hartman, Tejun Heo, Johannes Weiner, dri-devel,
	linux-fsdevel, linux-mm, Andrew Morton, Linus Torvalds,
	Ryan Lortie, Michael Kerrisk (man-pages)

On Thu, Mar 20, 2014 at 04:48:30PM +0100, David Herrmann wrote:
> On Thu, Mar 20, 2014 at 4:32 PM,  <tytso@mit.edu> wrote:
> > Why not make sealing an attribute of the "struct file", and enforce it
> > at the VFS layer?  That way all file system objects would have access
> > to sealing interface, and for memfd_shmem, you can't get another
> > struct file pointing at the object, the security properties would be
> > identical.
> 
> Sealing as introduced here is an inode-attribute, not "struct file".
> This is intentional. For instance, a gfx-client can get a read-only FD
> via /proc/self/fd/ and pass it to the compositor so it can never
> overwrite the contents (unless the compositor has write-access to the
> inode itself, in which case it can just re-open it read-write).

Hmm, good point.  I had forgotten about the /proc/self/fd hole.
Hmm... what if we have a SEAL_PROC which forces the permissions of
/proc/self/fd to be 000?

So if it is a property of the attribute, SEAL_WRITE and SEAL_GROW is
basically equivalent to using chattr to set the immutable and
append-only attribute, except for the "you can't undo the seal unless
you have exclusive access to the inode" magic.

That does make it pretty memfd_create specific and not a very general
API, which is a little unfortunate; hence why I'm trying to explore
ways of making a bit more generic and hopefully useful for more use
cases.

Cheers,

					- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 3/6] shm: add memfd_create() syscall
  2014-03-19 19:06   ` David Herrmann
@ 2014-03-20 19:22     ` John Stultz
  -1 siblings, 0 replies; 123+ messages in thread
From: John Stultz @ 2014-03-20 19:22 UTC (permalink / raw)
  To: David Herrmann, linux-kernel
  Cc: Hugh Dickins, Alexander Viro, Matthew Wilcox, Karol Lewandowski,
	Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, Greg Kroah-Hartman, Tejun Heo,
	Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages),
	Colin Cross

On 03/19/2014 12:06 PM, David Herrmann wrote:
> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
> that you can pass to mmap(). It explicitly allows sealing and
> avoids any connection to user-visible mount-points. Thus, it's not
> subject to quotas on mounted file-systems, but can be used like
> malloc()'ed memory, but with a file-descriptor to it.
>
> memfd_create() does not create a front-FD, but instead returns the raw
> shmem file, so calls like ftruncate() can be used. Also calls like fstat()
> will return proper information and mark the file as regular file. Sealing
> is explicitly supported on memfds.
>
> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
> subject to quotas and alike.

This syscall would also be useful to Android, since it would satisfy the
requirement for providing atomically unlinked tmpfs fds that ashmem
provides (although upstreamed solutions to ashmem's other
functionalities are still needed).

My only comment is that I think memfd_* is sort of a new namespace.
Since this is providing shmem files, it seems it might be better named
something like shmfd_create() or my earlier suggestion of shmget_fd(). 
Otherwise, when talking about functionality like sealing, which is only
available on shmfs, we'll have to say "shmfs/tmpfs/memfd" or risk
confusing folks who might not initially grasp that its all the same
underneath.

thanks
-john








^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 3/6] shm: add memfd_create() syscall
@ 2014-03-20 19:22     ` John Stultz
  0 siblings, 0 replies; 123+ messages in thread
From: John Stultz @ 2014-03-20 19:22 UTC (permalink / raw)
  To: David Herrmann, linux-kernel
  Cc: Hugh Dickins, Alexander Viro, Matthew Wilcox, Karol Lewandowski,
	Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, Greg Kroah-Hartman, Tejun Heo,
	Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages),
	Colin Cross

On 03/19/2014 12:06 PM, David Herrmann wrote:
> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
> that you can pass to mmap(). It explicitly allows sealing and
> avoids any connection to user-visible mount-points. Thus, it's not
> subject to quotas on mounted file-systems, but can be used like
> malloc()'ed memory, but with a file-descriptor to it.
>
> memfd_create() does not create a front-FD, but instead returns the raw
> shmem file, so calls like ftruncate() can be used. Also calls like fstat()
> will return proper information and mark the file as regular file. Sealing
> is explicitly supported on memfds.
>
> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
> subject to quotas and alike.

This syscall would also be useful to Android, since it would satisfy the
requirement for providing atomically unlinked tmpfs fds that ashmem
provides (although upstreamed solutions to ashmem's other
functionalities are still needed).

My only comment is that I think memfd_* is sort of a new namespace.
Since this is providing shmem files, it seems it might be better named
something like shmfd_create() or my earlier suggestion of shmget_fd(). 
Otherwise, when talking about functionality like sealing, which is only
available on shmfs, we'll have to say "shmfs/tmpfs/memfd" or risk
confusing folks who might not initially grasp that its all the same
underneath.

thanks
-john







--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 3/6] shm: add memfd_create() syscall
  2014-03-19 19:06   ` David Herrmann
@ 2014-04-02 13:38     ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 123+ messages in thread
From: Konstantin Khlebnikov @ 2014-04-02 13:38 UTC (permalink / raw)
  To: David Herrmann
  Cc: Linux Kernel Mailing List, Hugh Dickins, Alexander Viro,
	Matthew Wilcox, Karol Lewandowski, Kay Sievers, Daniel Mack,
	Lennart Poettering, Kristian Høgsberg, john.stultz,
	Greg Kroah-Hartman, Tejun Heo, Johannes Weiner, dri-devel,
	linux-fsdevel, linux-mm, Andrew Morton, Linus Torvalds,
	Ryan Lortie, Michael Kerrisk (man-pages)

On Wed, Mar 19, 2014 at 11:06 PM, David Herrmann <dh.herrmann@gmail.com> wrote:
> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
> that you can pass to mmap(). It explicitly allows sealing and
> avoids any connection to user-visible mount-points. Thus, it's not
> subject to quotas on mounted file-systems, but can be used like
> malloc()'ed memory, but with a file-descriptor to it.
>
> memfd_create() does not create a front-FD, but instead returns the raw
> shmem file, so calls like ftruncate() can be used. Also calls like fstat()
> will return proper information and mark the file as regular file. Sealing
> is explicitly supported on memfds.
>
> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
> subject to quotas and alike.

Instead of adding new syscall we can extend existing openat() a little
bit more:

openat(AT_FDSHM, "name", O_TMPFILE | O_RDWR, 0666)

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 3/6] shm: add memfd_create() syscall
@ 2014-04-02 13:38     ` Konstantin Khlebnikov
  0 siblings, 0 replies; 123+ messages in thread
From: Konstantin Khlebnikov @ 2014-04-02 13:38 UTC (permalink / raw)
  To: David Herrmann
  Cc: Linux Kernel Mailing List, Hugh Dickins, Alexander Viro,
	Matthew Wilcox, Karol Lewandowski, Kay Sievers, Daniel Mack,
	Lennart Poettering, Kristian Høgsberg, john.stultz,
	Greg Kroah-Hartman, Tejun Heo, Johannes Weiner, dri-devel,
	linux-fsdevel, linux-mm, Andrew Morton, Linus Torvalds,
	Ryan Lortie, Michael Kerrisk (man-pages)

On Wed, Mar 19, 2014 at 11:06 PM, David Herrmann <dh.herrmann@gmail.com> wrote:
> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
> that you can pass to mmap(). It explicitly allows sealing and
> avoids any connection to user-visible mount-points. Thus, it's not
> subject to quotas on mounted file-systems, but can be used like
> malloc()'ed memory, but with a file-descriptor to it.
>
> memfd_create() does not create a front-FD, but instead returns the raw
> shmem file, so calls like ftruncate() can be used. Also calls like fstat()
> will return proper information and mark the file as regular file. Sealing
> is explicitly supported on memfds.
>
> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
> subject to quotas and alike.

Instead of adding new syscall we can extend existing openat() a little
bit more:

openat(AT_FDSHM, "name", O_TMPFILE | O_RDWR, 0666)

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 3/6] shm: add memfd_create() syscall
  2014-04-02 13:38     ` Konstantin Khlebnikov
@ 2014-04-02 14:18       ` David Herrmann
  -1 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-04-02 14:18 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Linux Kernel Mailing List, Alexander Viro, Kay Sievers,
	Daniel Mack, Lennart Poettering, John Stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie

Hi

On Wed, Apr 2, 2014 at 3:38 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> On Wed, Mar 19, 2014 at 11:06 PM, David Herrmann <dh.herrmann@gmail.com> wrote:
>> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
>> that you can pass to mmap(). It explicitly allows sealing and
>> avoids any connection to user-visible mount-points. Thus, it's not
>> subject to quotas on mounted file-systems, but can be used like
>> malloc()'ed memory, but with a file-descriptor to it.
>>
>> memfd_create() does not create a front-FD, but instead returns the raw
>> shmem file, so calls like ftruncate() can be used. Also calls like fstat()
>> will return proper information and mark the file as regular file. Sealing
>> is explicitly supported on memfds.
>>
>> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
>> subject to quotas and alike.
>
> Instead of adding new syscall we can extend existing openat() a little
> bit more:
>
> openat(AT_FDSHM, "name", O_TMPFILE | O_RDWR, 0666)

O_TMPFILE requires an existing directory as "name". So you have to use:
  open("/run/", O_TMPFILE | O_RDWR, 0666)
instead of
  open("/run/new_file", O_TMPFILE | O_RDWR, 0666)

We _really_ want to set a name for the inode, though. Otherwise,
debug-info via /proc/pid/fd/ is useless.

Furthermore, Linus requested to allow sealing only on files that
_explicitly_ allow sealing. So v2 of this series will have
MFD_ALLOW_SEALING as memfd_create() flag. I don't think we can do this
with linkat() (or is that meant to be implicit for the new AT_FDSHM?).
Last but not least, you now need a separate syscall to set the
file-size.

I could live with most of these issues, except for the name-thing. Ideas?

Thanks
David

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 3/6] shm: add memfd_create() syscall
@ 2014-04-02 14:18       ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-04-02 14:18 UTC (permalink / raw)
  To: Konstantin Khlebnikov
  Cc: Linux Kernel Mailing List, Alexander Viro, Kay Sievers,
	Daniel Mack, Lennart Poettering, John Stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie

Hi

On Wed, Apr 2, 2014 at 3:38 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
> On Wed, Mar 19, 2014 at 11:06 PM, David Herrmann <dh.herrmann@gmail.com> wrote:
>> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
>> that you can pass to mmap(). It explicitly allows sealing and
>> avoids any connection to user-visible mount-points. Thus, it's not
>> subject to quotas on mounted file-systems, but can be used like
>> malloc()'ed memory, but with a file-descriptor to it.
>>
>> memfd_create() does not create a front-FD, but instead returns the raw
>> shmem file, so calls like ftruncate() can be used. Also calls like fstat()
>> will return proper information and mark the file as regular file. Sealing
>> is explicitly supported on memfds.
>>
>> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
>> subject to quotas and alike.
>
> Instead of adding new syscall we can extend existing openat() a little
> bit more:
>
> openat(AT_FDSHM, "name", O_TMPFILE | O_RDWR, 0666)

O_TMPFILE requires an existing directory as "name". So you have to use:
  open("/run/", O_TMPFILE | O_RDWR, 0666)
instead of
  open("/run/new_file", O_TMPFILE | O_RDWR, 0666)

We _really_ want to set a name for the inode, though. Otherwise,
debug-info via /proc/pid/fd/ is useless.

Furthermore, Linus requested to allow sealing only on files that
_explicitly_ allow sealing. So v2 of this series will have
MFD_ALLOW_SEALING as memfd_create() flag. I don't think we can do this
with linkat() (or is that meant to be implicit for the new AT_FDSHM?).
Last but not least, you now need a separate syscall to set the
file-size.

I could live with most of these issues, except for the name-thing. Ideas?

Thanks
David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 3/6] shm: add memfd_create() syscall
  2014-04-02 14:18       ` David Herrmann
  (?)
@ 2014-04-02 14:52         ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 123+ messages in thread
From: Konstantin Khlebnikov @ 2014-04-02 14:52 UTC (permalink / raw)
  To: David Herrmann
  Cc: Linux Kernel Mailing List, Alexander Viro, Kay Sievers,
	Daniel Mack, Lennart Poettering, John Stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie

On Wed, Apr 2, 2014 at 6:18 PM, David Herrmann <dh.herrmann@gmail.com> wrote:
> Hi
>
> On Wed, Apr 2, 2014 at 3:38 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>> On Wed, Mar 19, 2014 at 11:06 PM, David Herrmann <dh.herrmann@gmail.com> wrote:
>>> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
>>> that you can pass to mmap(). It explicitly allows sealing and
>>> avoids any connection to user-visible mount-points. Thus, it's not
>>> subject to quotas on mounted file-systems, but can be used like
>>> malloc()'ed memory, but with a file-descriptor to it.
>>>
>>> memfd_create() does not create a front-FD, but instead returns the raw
>>> shmem file, so calls like ftruncate() can be used. Also calls like fstat()
>>> will return proper information and mark the file as regular file. Sealing
>>> is explicitly supported on memfds.
>>>
>>> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
>>> subject to quotas and alike.
>>
>> Instead of adding new syscall we can extend existing openat() a little
>> bit more:
>>
>> openat(AT_FDSHM, "name", O_TMPFILE | O_RDWR, 0666)
>
> O_TMPFILE requires an existing directory as "name". So you have to use:
>   open("/run/", O_TMPFILE | O_RDWR, 0666)
> instead of
>   open("/run/new_file", O_TMPFILE | O_RDWR, 0666)
>
> We _really_ want to set a name for the inode, though. Otherwise,
> debug-info via /proc/pid/fd/ is useless.
>
> Furthermore, Linus requested to allow sealing only on files that
> _explicitly_ allow sealing. So v2 of this series will have
> MFD_ALLOW_SEALING as memfd_create() flag. I don't think we can do this
> with linkat() (or is that meant to be implicit for the new AT_FDSHM?).
> Last but not least, you now need a separate syscall to set the
> file-size.
>
> I could live with most of these issues, except for the name-thing. Ideas?

Hmm, why AT_FDSHM + O_TMPFILE pair cannot has different naming behavior?
Actually O_TMPFILE flag is optional here. AT_FDSHM is enough, but
O_TMPFILE allows to
move branching out of common fast-paths and hide it inside do_tmpfile.

BTW you can set some extended attribute via fsetxattr and distinguish
files in proc by its value.

OR you could add fcntl() for changing 'name' of tmpfiles. In
combination with AT_FDSHM this
would give complete solution without changing O_TMPFILE naming scheme.
But one syscall turns into three. )

--

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 3/6] shm: add memfd_create() syscall
@ 2014-04-02 14:52         ` Konstantin Khlebnikov
  0 siblings, 0 replies; 123+ messages in thread
From: Konstantin Khlebnikov @ 2014-04-02 14:52 UTC (permalink / raw)
  To: David Herrmann
  Cc: linux-mm, Ryan Lortie, Greg Kroah-Hartman, Kay Sievers,
	Linux Kernel Mailing List, dri-devel, Daniel Mack, linux-fsdevel,
	Alexander Viro, Lennart Poettering, Johannes Weiner, Tejun Heo,
	Andrew Morton, Linus Torvalds

On Wed, Apr 2, 2014 at 6:18 PM, David Herrmann <dh.herrmann@gmail.com> wrote:
> Hi
>
> On Wed, Apr 2, 2014 at 3:38 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>> On Wed, Mar 19, 2014 at 11:06 PM, David Herrmann <dh.herrmann@gmail.com> wrote:
>>> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
>>> that you can pass to mmap(). It explicitly allows sealing and
>>> avoids any connection to user-visible mount-points. Thus, it's not
>>> subject to quotas on mounted file-systems, but can be used like
>>> malloc()'ed memory, but with a file-descriptor to it.
>>>
>>> memfd_create() does not create a front-FD, but instead returns the raw
>>> shmem file, so calls like ftruncate() can be used. Also calls like fstat()
>>> will return proper information and mark the file as regular file. Sealing
>>> is explicitly supported on memfds.
>>>
>>> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
>>> subject to quotas and alike.
>>
>> Instead of adding new syscall we can extend existing openat() a little
>> bit more:
>>
>> openat(AT_FDSHM, "name", O_TMPFILE | O_RDWR, 0666)
>
> O_TMPFILE requires an existing directory as "name". So you have to use:
>   open("/run/", O_TMPFILE | O_RDWR, 0666)
> instead of
>   open("/run/new_file", O_TMPFILE | O_RDWR, 0666)
>
> We _really_ want to set a name for the inode, though. Otherwise,
> debug-info via /proc/pid/fd/ is useless.
>
> Furthermore, Linus requested to allow sealing only on files that
> _explicitly_ allow sealing. So v2 of this series will have
> MFD_ALLOW_SEALING as memfd_create() flag. I don't think we can do this
> with linkat() (or is that meant to be implicit for the new AT_FDSHM?).
> Last but not least, you now need a separate syscall to set the
> file-size.
>
> I could live with most of these issues, except for the name-thing. Ideas?

Hmm, why AT_FDSHM + O_TMPFILE pair cannot has different naming behavior?
Actually O_TMPFILE flag is optional here. AT_FDSHM is enough, but
O_TMPFILE allows to
move branching out of common fast-paths and hide it inside do_tmpfile.

BTW you can set some extended attribute via fsetxattr and distinguish
files in proc by its value.

OR you could add fcntl() for changing 'name' of tmpfiles. In
combination with AT_FDSHM this
would give complete solution without changing O_TMPFILE naming scheme.
But one syscall turns into three. )

--

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 3/6] shm: add memfd_create() syscall
@ 2014-04-02 14:52         ` Konstantin Khlebnikov
  0 siblings, 0 replies; 123+ messages in thread
From: Konstantin Khlebnikov @ 2014-04-02 14:52 UTC (permalink / raw)
  To: David Herrmann
  Cc: Linux Kernel Mailing List, Alexander Viro, Kay Sievers,
	Daniel Mack, Lennart Poettering, John Stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie

On Wed, Apr 2, 2014 at 6:18 PM, David Herrmann <dh.herrmann@gmail.com> wrote:
> Hi
>
> On Wed, Apr 2, 2014 at 3:38 PM, Konstantin Khlebnikov <koct9i@gmail.com> wrote:
>> On Wed, Mar 19, 2014 at 11:06 PM, David Herrmann <dh.herrmann@gmail.com> wrote:
>>> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
>>> that you can pass to mmap(). It explicitly allows sealing and
>>> avoids any connection to user-visible mount-points. Thus, it's not
>>> subject to quotas on mounted file-systems, but can be used like
>>> malloc()'ed memory, but with a file-descriptor to it.
>>>
>>> memfd_create() does not create a front-FD, but instead returns the raw
>>> shmem file, so calls like ftruncate() can be used. Also calls like fstat()
>>> will return proper information and mark the file as regular file. Sealing
>>> is explicitly supported on memfds.
>>>
>>> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
>>> subject to quotas and alike.
>>
>> Instead of adding new syscall we can extend existing openat() a little
>> bit more:
>>
>> openat(AT_FDSHM, "name", O_TMPFILE | O_RDWR, 0666)
>
> O_TMPFILE requires an existing directory as "name". So you have to use:
>   open("/run/", O_TMPFILE | O_RDWR, 0666)
> instead of
>   open("/run/new_file", O_TMPFILE | O_RDWR, 0666)
>
> We _really_ want to set a name for the inode, though. Otherwise,
> debug-info via /proc/pid/fd/ is useless.
>
> Furthermore, Linus requested to allow sealing only on files that
> _explicitly_ allow sealing. So v2 of this series will have
> MFD_ALLOW_SEALING as memfd_create() flag. I don't think we can do this
> with linkat() (or is that meant to be implicit for the new AT_FDSHM?).
> Last but not least, you now need a separate syscall to set the
> file-size.
>
> I could live with most of these issues, except for the name-thing. Ideas?

Hmm, why AT_FDSHM + O_TMPFILE pair cannot has different naming behavior?
Actually O_TMPFILE flag is optional here. AT_FDSHM is enough, but
O_TMPFILE allows to
move branching out of common fast-paths and hide it inside do_tmpfile.

BTW you can set some extended attribute via fsetxattr and distinguish
files in proc by its value.

OR you could add fcntl() for changing 'name' of tmpfiles. In
combination with AT_FDSHM this
would give complete solution without changing O_TMPFILE naming scheme.
But one syscall turns into three. )

--

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-03-19 19:06 ` David Herrmann
@ 2014-04-08 13:00   ` Florian Weimer
  -1 siblings, 0 replies; 123+ messages in thread
From: Florian Weimer @ 2014-04-08 13:00 UTC (permalink / raw)
  To: David Herrmann, linux-kernel
  Cc: Hugh Dickins, Alexander Viro, Matthew Wilcox, Karol Lewandowski,
	Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, john.stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages)

On 03/19/2014 08:06 PM, David Herrmann wrote:

> Unlike existing techniques that provide similar protection, sealing allows
> file-sharing without any trust-relationship. This is enforced by rejecting seal
> modifications if you don't own an exclusive reference to the given file. So if
> you own a file-descriptor, you can be sure that no-one besides you can modify
> the seals on the given file. This allows mapping shared files from untrusted
> parties without the fear of the file getting truncated or modified by an
> attacker.

How do you keep these promises on network and FUSE file systems?  Surely 
there is still some trust involved for such descriptors?

What happens if you create a loop device on a sealed descriptor?

Why does memfd_create not create a file backed by a memory region in the 
current process?  Wouldn't this be a far more generic primitive? 
Creating aliases of memory regions would be interesting for many things 
(not just libffi bypassing SELinux-enforced NX restrictions :-).

-- 
Florian Weimer / Red Hat Product Security Team

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-04-08 13:00   ` Florian Weimer
  0 siblings, 0 replies; 123+ messages in thread
From: Florian Weimer @ 2014-04-08 13:00 UTC (permalink / raw)
  To: David Herrmann, linux-kernel
  Cc: Hugh Dickins, Alexander Viro, Matthew Wilcox, Karol Lewandowski,
	Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, john.stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages)

On 03/19/2014 08:06 PM, David Herrmann wrote:

> Unlike existing techniques that provide similar protection, sealing allows
> file-sharing without any trust-relationship. This is enforced by rejecting seal
> modifications if you don't own an exclusive reference to the given file. So if
> you own a file-descriptor, you can be sure that no-one besides you can modify
> the seals on the given file. This allows mapping shared files from untrusted
> parties without the fear of the file getting truncated or modified by an
> attacker.

How do you keep these promises on network and FUSE file systems?  Surely 
there is still some trust involved for such descriptors?

What happens if you create a loop device on a sealed descriptor?

Why does memfd_create not create a file backed by a memory region in the 
current process?  Wouldn't this be a far more generic primitive? 
Creating aliases of memory regions would be interesting for many things 
(not just libffi bypassing SELinux-enforced NX restrictions :-).

-- 
Florian Weimer / Red Hat Product Security Team

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-04-08 13:00   ` Florian Weimer
@ 2014-04-09 21:31     ` David Herrmann
  -1 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-04-09 21:31 UTC (permalink / raw)
  To: Florian Weimer
  Cc: linux-kernel, Kay Sievers, Daniel Mack, Lennart Poettering,
	dri-devel, linux-fsdevel, linux-mm

Hi

On Tue, Apr 8, 2014 at 3:00 PM, Florian Weimer <fweimer@redhat.com> wrote:
> How do you keep these promises on network and FUSE file systems?

I don't. This is shmem only.

Thanks
David

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-04-09 21:31     ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-04-09 21:31 UTC (permalink / raw)
  To: Florian Weimer
  Cc: linux-kernel, Kay Sievers, Daniel Mack, Lennart Poettering,
	dri-devel, linux-fsdevel, linux-mm

Hi

On Tue, Apr 8, 2014 at 3:00 PM, Florian Weimer <fweimer@redhat.com> wrote:
> How do you keep these promises on network and FUSE file systems?

I don't. This is shmem only.

Thanks
David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-03-20 15:32   ` tytso
@ 2014-04-10 14:45     ` Colin Walters
  -1 siblings, 0 replies; 123+ messages in thread
From: Colin Walters @ 2014-04-10 14:45 UTC (permalink / raw)
  To: tytso
  Cc: David Herrmann, linux-kernel, Hugh Dickins, Alexander Viro,
	Matthew Wilcox, Karol Lewandowski, Kay Sievers, Daniel Mack,
	Lennart Poettering, Kristian, john.stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie, mtk.manpages

On Thu, Mar 20, 2014 at 11:32 AM, tytso@mit.edu wrote:
> 
> Looking at your patches, and what files you are modifying, you are
> enforcing this in the low-level file system.

I would love for this to be implemented in the filesystem level as 
well.  Something like the ext4 immutable bit, but with the ability to 
still make hardlinks would be *very* useful for OSTree.  And anyone 
else that uses hardlinks as a data source.  The vserver people do 
something similiar:
http://linux-vserver.org/util-vserver:Vhashify

At the moment I have a read-only bind mount over /usr, but what I 
really want is to make the individual objects in the object store in 
/ostree/repo/objects be immutable, so even if a user or app navigates 
out to /sysroot they still can't mutate them (or the link targets in 
the visible /usr).





^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-04-10 14:45     ` Colin Walters
  0 siblings, 0 replies; 123+ messages in thread
From: Colin Walters @ 2014-04-10 14:45 UTC (permalink / raw)
  To: tytso
  Cc: David Herrmann, linux-kernel, Hugh Dickins, Alexander Viro,
	Matthew Wilcox, Karol Lewandowski, Kay Sievers, Daniel Mack,
	Lennart Poettering, Kristian, john.stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie, mtk.manpages

On Thu, Mar 20, 2014 at 11:32 AM, tytso@mit.edu wrote:
> 
> Looking at your patches, and what files you are modifying, you are
> enforcing this in the low-level file system.

I would love for this to be implemented in the filesystem level as 
well.  Something like the ext4 immutable bit, but with the ability to 
still make hardlinks would be *very* useful for OSTree.  And anyone 
else that uses hardlinks as a data source.  The vserver people do 
something similiar:
http://linux-vserver.org/util-vserver:Vhashify

At the moment I have a read-only bind mount over /usr, but what I 
really want is to make the individual objects in the object store in 
/ostree/repo/objects be immutable, so even if a user or app navigates 
out to /sysroot they still can't mutate them (or the link targets in 
the visible /usr).




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 3/6] shm: add memfd_create() syscall
  2014-04-02 13:38     ` Konstantin Khlebnikov
@ 2014-04-10 19:07       ` Andy Lutomirski
  -1 siblings, 0 replies; 123+ messages in thread
From: Andy Lutomirski @ 2014-04-10 19:07 UTC (permalink / raw)
  To: Konstantin Khlebnikov, David Herrmann
  Cc: Linux Kernel Mailing List, Hugh Dickins, Alexander Viro,
	Matthew Wilcox, Karol Lewandowski, Kay Sievers, Daniel Mack,
	Lennart Poettering, Kristian Høgsberg, john.stultz,
	Greg Kroah-Hartman, Tejun Heo, Johannes Weiner, dri-devel,
	linux-fsdevel, linux-mm, Andrew Morton, Linus Torvalds,
	Ryan Lortie, Michael Kerrisk (man-pages)

On 04/02/2014 06:38 AM, Konstantin Khlebnikov wrote:
> On Wed, Mar 19, 2014 at 11:06 PM, David Herrmann <dh.herrmann@gmail.com> wrote:
>> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
>> that you can pass to mmap(). It explicitly allows sealing and
>> avoids any connection to user-visible mount-points. Thus, it's not
>> subject to quotas on mounted file-systems, but can be used like
>> malloc()'ed memory, but with a file-descriptor to it.
>>
>> memfd_create() does not create a front-FD, but instead returns the raw
>> shmem file, so calls like ftruncate() can be used. Also calls like fstat()
>> will return proper information and mark the file as regular file. Sealing
>> is explicitly supported on memfds.
>>
>> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
>> subject to quotas and alike.
> 
> Instead of adding new syscall we can extend existing openat() a little
> bit more:
> 
> openat(AT_FDSHM, "name", O_TMPFILE | O_RDWR, 0666)

Please don't.  O_TMPFILE is a messy enough API, and the last thing we
need to do is to extend it.  If we want a fancy API for creating new
inodes with no corresponding dentry, let's create one.

Otherwise, let's just stick with a special-purpose API for these shm files.

--Andy

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 3/6] shm: add memfd_create() syscall
@ 2014-04-10 19:07       ` Andy Lutomirski
  0 siblings, 0 replies; 123+ messages in thread
From: Andy Lutomirski @ 2014-04-10 19:07 UTC (permalink / raw)
  To: Konstantin Khlebnikov, David Herrmann
  Cc: Linux Kernel Mailing List, Hugh Dickins, Alexander Viro,
	Matthew Wilcox, Karol Lewandowski, Kay Sievers, Daniel Mack,
	Lennart Poettering, Kristian Høgsberg, john.stultz,
	Greg Kroah-Hartman, Tejun Heo, Johannes Weiner, dri-devel,
	linux-fsdevel, linux-mm, Andrew Morton, Linus Torvalds,
	Ryan Lortie, Michael Kerrisk (man-pages)

On 04/02/2014 06:38 AM, Konstantin Khlebnikov wrote:
> On Wed, Mar 19, 2014 at 11:06 PM, David Herrmann <dh.herrmann@gmail.com> wrote:
>> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
>> that you can pass to mmap(). It explicitly allows sealing and
>> avoids any connection to user-visible mount-points. Thus, it's not
>> subject to quotas on mounted file-systems, but can be used like
>> malloc()'ed memory, but with a file-descriptor to it.
>>
>> memfd_create() does not create a front-FD, but instead returns the raw
>> shmem file, so calls like ftruncate() can be used. Also calls like fstat()
>> will return proper information and mark the file as regular file. Sealing
>> is explicitly supported on memfds.
>>
>> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
>> subject to quotas and alike.
> 
> Instead of adding new syscall we can extend existing openat() a little
> bit more:
> 
> openat(AT_FDSHM, "name", O_TMPFILE | O_RDWR, 0666)

Please don't.  O_TMPFILE is a messy enough API, and the last thing we
need to do is to extend it.  If we want a fancy API for creating new
inodes with no corresponding dentry, let's create one.

Otherwise, let's just stick with a special-purpose API for these shm files.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-03-20 16:38       ` tytso
@ 2014-04-10 19:14         ` Andy Lutomirski
  -1 siblings, 0 replies; 123+ messages in thread
From: Andy Lutomirski @ 2014-04-10 19:14 UTC (permalink / raw)
  To: tytso, David Herrmann, linux-kernel, Hugh Dickins,
	Alexander Viro, Karol Lewandowski, Kay Sievers, Daniel Mack,
	Lennart Poettering, John Stultz, Greg Kroah-Hartman, Tejun Heo,
	Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages)

On 03/20/2014 09:38 AM, tytso@mit.edu wrote:
> On Thu, Mar 20, 2014 at 04:48:30PM +0100, David Herrmann wrote:
>> On Thu, Mar 20, 2014 at 4:32 PM,  <tytso@mit.edu> wrote:
>>> Why not make sealing an attribute of the "struct file", and enforce it
>>> at the VFS layer?  That way all file system objects would have access
>>> to sealing interface, and for memfd_shmem, you can't get another
>>> struct file pointing at the object, the security properties would be
>>> identical.
>>
>> Sealing as introduced here is an inode-attribute, not "struct file".
>> This is intentional. For instance, a gfx-client can get a read-only FD
>> via /proc/self/fd/ and pass it to the compositor so it can never
>> overwrite the contents (unless the compositor has write-access to the
>> inode itself, in which case it can just re-open it read-write).
> 
> Hmm, good point.  I had forgotten about the /proc/self/fd hole.
> Hmm... what if we have a SEAL_PROC which forces the permissions of
> /proc/self/fd to be 000?

This is the second time in a week that someone has asked for a way to
have a struct file (or struct inode or whatever) that can't be reopened
through /proc/pid/fd.  This should be quite easy to implement as a
separate feature.

Actually, that feature would solve a major pet peeve of mine, I think: I
want something like memfd that allows me to keep the thing read-write
but that whomever I pass the fd to can't change.  With this feature, I
could do:

fd_rw = memfd_create (or O_TMPFILE or whatever)
fd_ro = open(/proc/self/fd/fd_ro, O_RDONLY);
fcntl(fd_ro, F_RESTRICT, F_RESTRICT_REOPEN);

send fd_ro via SCM_RIGHTS.

To really make this work well, I also want to SEAL_SHRINK the inode so
that the receiver can verify that I'm not going to truncate the file out
from under it.

Bingo, fast and secure one-way IPC.

--Andy

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-04-10 19:14         ` Andy Lutomirski
  0 siblings, 0 replies; 123+ messages in thread
From: Andy Lutomirski @ 2014-04-10 19:14 UTC (permalink / raw)
  To: tytso, David Herrmann, linux-kernel, Hugh Dickins,
	Alexander Viro, Karol Lewandowski, Kay Sievers, Daniel Mack,
	Lennart Poettering, John Stultz, Greg Kroah-Hartman, Tejun Heo,
	Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages)

On 03/20/2014 09:38 AM, tytso@mit.edu wrote:
> On Thu, Mar 20, 2014 at 04:48:30PM +0100, David Herrmann wrote:
>> On Thu, Mar 20, 2014 at 4:32 PM,  <tytso@mit.edu> wrote:
>>> Why not make sealing an attribute of the "struct file", and enforce it
>>> at the VFS layer?  That way all file system objects would have access
>>> to sealing interface, and for memfd_shmem, you can't get another
>>> struct file pointing at the object, the security properties would be
>>> identical.
>>
>> Sealing as introduced here is an inode-attribute, not "struct file".
>> This is intentional. For instance, a gfx-client can get a read-only FD
>> via /proc/self/fd/ and pass it to the compositor so it can never
>> overwrite the contents (unless the compositor has write-access to the
>> inode itself, in which case it can just re-open it read-write).
> 
> Hmm, good point.  I had forgotten about the /proc/self/fd hole.
> Hmm... what if we have a SEAL_PROC which forces the permissions of
> /proc/self/fd to be 000?

This is the second time in a week that someone has asked for a way to
have a struct file (or struct inode or whatever) that can't be reopened
through /proc/pid/fd.  This should be quite easy to implement as a
separate feature.

Actually, that feature would solve a major pet peeve of mine, I think: I
want something like memfd that allows me to keep the thing read-write
but that whomever I pass the fd to can't change.  With this feature, I
could do:

fd_rw = memfd_create (or O_TMPFILE or whatever)
fd_ro = open(/proc/self/fd/fd_ro, O_RDONLY);
fcntl(fd_ro, F_RESTRICT, F_RESTRICT_REOPEN);

send fd_ro via SCM_RIGHTS.

To really make this work well, I also want to SEAL_SHRINK the inode so
that the receiver can verify that I'm not going to truncate the file out
from under it.

Bingo, fast and secure one-way IPC.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-04-10 14:45     ` Colin Walters
@ 2014-04-10 19:15       ` Andy Lutomirski
  -1 siblings, 0 replies; 123+ messages in thread
From: Andy Lutomirski @ 2014-04-10 19:15 UTC (permalink / raw)
  To: Colin Walters, tytso
  Cc: David Herrmann, linux-kernel, Hugh Dickins, Alexander Viro,
	Matthew Wilcox, Karol Lewandowski, Kay Sievers, Daniel Mack,
	Lennart Poettering, Kristian, john.stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie, mtk.manpages

On 04/10/2014 07:45 AM, Colin Walters wrote:
> On Thu, Mar 20, 2014 at 11:32 AM, tytso@mit.edu wrote:
>>
>> Looking at your patches, and what files you are modifying, you are
>> enforcing this in the low-level file system.
> 
> I would love for this to be implemented in the filesystem level as
> well.  Something like the ext4 immutable bit, but with the ability to
> still make hardlinks would be *very* useful for OSTree.  And anyone else
> that uses hardlinks as a data source.  The vserver people do something
> similiar:
> http://linux-vserver.org/util-vserver:Vhashify
> 
> At the moment I have a read-only bind mount over /usr, but what I really
> want is to make the individual objects in the object store in
> /ostree/repo/objects be immutable, so even if a user or app navigates
> out to /sysroot they still can't mutate them (or the link targets in the
> visible /usr).

COW links can do this already, I think.  Of course, you'll have to use a
filesystem that supports them.

--Andy

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-04-10 19:15       ` Andy Lutomirski
  0 siblings, 0 replies; 123+ messages in thread
From: Andy Lutomirski @ 2014-04-10 19:15 UTC (permalink / raw)
  To: Colin Walters, tytso
  Cc: David Herrmann, linux-kernel, Hugh Dickins, Alexander Viro,
	Matthew Wilcox, Karol Lewandowski, Kay Sievers, Daniel Mack,
	Lennart Poettering, Kristian, john.stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie, mtk.manpages

On 04/10/2014 07:45 AM, Colin Walters wrote:
> On Thu, Mar 20, 2014 at 11:32 AM, tytso@mit.edu wrote:
>>
>> Looking at your patches, and what files you are modifying, you are
>> enforcing this in the low-level file system.
> 
> I would love for this to be implemented in the filesystem level as
> well.  Something like the ext4 immutable bit, but with the ability to
> still make hardlinks would be *very* useful for OSTree.  And anyone else
> that uses hardlinks as a data source.  The vserver people do something
> similiar:
> http://linux-vserver.org/util-vserver:Vhashify
> 
> At the moment I have a read-only bind mount over /usr, but what I really
> want is to make the individual objects in the object store in
> /ostree/repo/objects be immutable, so even if a user or app navigates
> out to /sysroot they still can't mutate them (or the link targets in the
> visible /usr).

COW links can do this already, I think.  Of course, you'll have to use a
filesystem that supports them.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-04-08 13:00   ` Florian Weimer
@ 2014-04-10 19:17     ` Andy Lutomirski
  -1 siblings, 0 replies; 123+ messages in thread
From: Andy Lutomirski @ 2014-04-10 19:17 UTC (permalink / raw)
  To: Florian Weimer, David Herrmann, linux-kernel
  Cc: Hugh Dickins, Alexander Viro, Matthew Wilcox, Karol Lewandowski,
	Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, john.stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages)

On 04/08/2014 06:00 AM, Florian Weimer wrote:
> On 03/19/2014 08:06 PM, David Herrmann wrote:
> 
>> Unlike existing techniques that provide similar protection, sealing
>> allows
>> file-sharing without any trust-relationship. This is enforced by
>> rejecting seal
>> modifications if you don't own an exclusive reference to the given
>> file. So if
>> you own a file-descriptor, you can be sure that no-one besides you can
>> modify
>> the seals on the given file. This allows mapping shared files from
>> untrusted
>> parties without the fear of the file getting truncated or modified by an
>> attacker.
> 
> How do you keep these promises on network and FUSE file systems?  Surely
> there is still some trust involved for such descriptors?
> 
> What happens if you create a loop device on a sealed descriptor?
> 
> Why does memfd_create not create a file backed by a memory region in the
> current process?  Wouldn't this be a far more generic primitive?
> Creating aliases of memory regions would be interesting for many things
> (not just libffi bypassing SELinux-enforced NX restrictions :-).

If you write a patch to prevent selinux from enforcing NX, I will ack
that patch with all my might.  I don't know how far it would get me, but
I think that selinux has no business going anywhere near execmem.

Adding a clone mode to mremap might be a better bet.  But memfd solves
that problem, too, albeit messily.

--Andy

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-04-10 19:17     ` Andy Lutomirski
  0 siblings, 0 replies; 123+ messages in thread
From: Andy Lutomirski @ 2014-04-10 19:17 UTC (permalink / raw)
  To: Florian Weimer, David Herrmann, linux-kernel
  Cc: Hugh Dickins, Alexander Viro, Matthew Wilcox, Karol Lewandowski,
	Kay Sievers, Daniel Mack, Lennart Poettering,
	Kristian Høgsberg, john.stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages)

On 04/08/2014 06:00 AM, Florian Weimer wrote:
> On 03/19/2014 08:06 PM, David Herrmann wrote:
> 
>> Unlike existing techniques that provide similar protection, sealing
>> allows
>> file-sharing without any trust-relationship. This is enforced by
>> rejecting seal
>> modifications if you don't own an exclusive reference to the given
>> file. So if
>> you own a file-descriptor, you can be sure that no-one besides you can
>> modify
>> the seals on the given file. This allows mapping shared files from
>> untrusted
>> parties without the fear of the file getting truncated or modified by an
>> attacker.
> 
> How do you keep these promises on network and FUSE file systems?  Surely
> there is still some trust involved for such descriptors?
> 
> What happens if you create a loop device on a sealed descriptor?
> 
> Why does memfd_create not create a file backed by a memory region in the
> current process?  Wouldn't this be a far more generic primitive?
> Creating aliases of memory regions would be interesting for many things
> (not just libffi bypassing SELinux-enforced NX restrictions :-).

If you write a patch to prevent selinux from enforcing NX, I will ack
that patch with all my might.  I don't know how far it would get me, but
I think that selinux has no business going anywhere near execmem.

Adding a clone mode to mremap might be a better bet.  But memfd solves
that problem, too, albeit messily.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-04-10 19:15       ` Andy Lutomirski
@ 2014-04-10 19:45         ` Colin Walters
  -1 siblings, 0 replies; 123+ messages in thread
From: Colin Walters @ 2014-04-10 19:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: tytso, David Herrmann, linux-kernel, Hugh Dickins,
	Alexander Viro, Matthew Wilcox, Karol Lewandowski, Kay Sievers,
	Daniel Mack, Lennart Poettering, Kristian, john.stultz,
	Greg Kroah-Hartman, Tejun Heo, Johannes Weiner, dri-devel,
	linux-fsdevel, linux-mm, Andrew Morton, Linus Torvalds,
	Ryan Lortie, mtk.manpages

On Thu, Apr 10, 2014 at 3:15 PM, Andy Lutomirski <luto@amacapital.net> 
wrote:
> 
> 
> COW links can do this already, I think.  Of course, you'll have to 
> use a
> filesystem that supports them.

COW is nice if the filesystem supports them, but my userspace code 
needs to be filesystem agnostic.  Because of that, the design for 
userspace simply doesn't allow arbitrary writes.

Instead, I have to painfully audit every rpm %post/dpkg postinst type 
script to ensure they break hardlinks, and furthermore only allow 
executing scripts that are known to do so.

But I think even in a btrfs world it'd still be useful to mark files as 
content-immutable.





^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-04-10 19:45         ` Colin Walters
  0 siblings, 0 replies; 123+ messages in thread
From: Colin Walters @ 2014-04-10 19:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: tytso, David Herrmann, linux-kernel, Hugh Dickins,
	Alexander Viro, Matthew Wilcox, Karol Lewandowski, Kay Sievers,
	Daniel Mack, Lennart Poettering, Kristian, john.stultz,
	Greg Kroah-Hartman, Tejun Heo, Johannes Weiner, dri-devel,
	linux-fsdevel, linux-mm, Andrew Morton, Linus Torvalds,
	Ryan Lortie, mtk.manpages

On Thu, Apr 10, 2014 at 3:15 PM, Andy Lutomirski <luto@amacapital.net> 
wrote:
> 
> 
> COW links can do this already, I think.  Of course, you'll have to 
> use a
> filesystem that supports them.

COW is nice if the filesystem supports them, but my userspace code 
needs to be filesystem agnostic.  Because of that, the design for 
userspace simply doesn't allow arbitrary writes.

Instead, I have to painfully audit every rpm %post/dpkg postinst type 
script to ensure they break hardlinks, and furthermore only allow 
executing scripts that are known to do so.

But I think even in a btrfs world it'd still be useful to mark files as 
content-immutable.




--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-04-10 19:14         ` Andy Lutomirski
@ 2014-04-10 20:32           ` Theodore Ts'o
  -1 siblings, 0 replies; 123+ messages in thread
From: Theodore Ts'o @ 2014-04-10 20:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Herrmann, linux-kernel, Hugh Dickins, Alexander Viro,
	Karol Lewandowski, Kay Sievers, Daniel Mack, Lennart Poettering,
	John Stultz, Greg Kroah-Hartman, Tejun Heo, Johannes Weiner,
	dri-devel, linux-fsdevel, linux-mm, Andrew Morton,
	Linus Torvalds, Ryan Lortie, Michael Kerrisk (man-pages)

On Thu, Apr 10, 2014 at 12:14:27PM -0700, Andy Lutomirski wrote:
> 
> This is the second time in a week that someone has asked for a way to
> have a struct file (or struct inode or whatever) that can't be reopened
> through /proc/pid/fd.  This should be quite easy to implement as a
> separate feature.

What I suggested on a different thread was to add the following new
file descriptor flags, to join FD_CLOEXEC, which would be maniuplated
using the F_GETFD and F_SETFD fcntl commands:

FD_NOPROCFS	disallow being able to open the inode via /proc/<pid>/fd

FD_NOPASSFD	disallow being able to pass the fd via a unix domain socket

FD_LOCKFLAGS	if this bit is set, disallow any further changes of FD_CLOEXEC,
		FD_NOPROCFS, FD_NOPASSFD, and FD_LOCKFLAGS flags.

Regardless of what else we might need to meet the use case for the
proposed File Sealing API, I think this is a useful feature that could
be used in many other contexts besides just the proposed
memfd_create() use case.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-04-10 20:32           ` Theodore Ts'o
  0 siblings, 0 replies; 123+ messages in thread
From: Theodore Ts'o @ 2014-04-10 20:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: David Herrmann, linux-kernel, Hugh Dickins, Alexander Viro,
	Karol Lewandowski, Kay Sievers, Daniel Mack, Lennart Poettering,
	John Stultz, Greg Kroah-Hartman, Tejun Heo, Johannes Weiner,
	dri-devel, linux-fsdevel, linux-mm, Andrew Morton,
	Linus Torvalds, Ryan Lortie, Michael Kerrisk (man-pages)

On Thu, Apr 10, 2014 at 12:14:27PM -0700, Andy Lutomirski wrote:
> 
> This is the second time in a week that someone has asked for a way to
> have a struct file (or struct inode or whatever) that can't be reopened
> through /proc/pid/fd.  This should be quite easy to implement as a
> separate feature.

What I suggested on a different thread was to add the following new
file descriptor flags, to join FD_CLOEXEC, which would be maniuplated
using the F_GETFD and F_SETFD fcntl commands:

FD_NOPROCFS	disallow being able to open the inode via /proc/<pid>/fd

FD_NOPASSFD	disallow being able to pass the fd via a unix domain socket

FD_LOCKFLAGS	if this bit is set, disallow any further changes of FD_CLOEXEC,
		FD_NOPROCFS, FD_NOPASSFD, and FD_LOCKFLAGS flags.

Regardless of what else we might need to meet the use case for the
proposed File Sealing API, I think this is a useful feature that could
be used in many other contexts besides just the proposed
memfd_create() use case.

Cheers,

					- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-04-10 20:32           ` Theodore Ts'o
@ 2014-04-10 20:37             ` Andy Lutomirski
  -1 siblings, 0 replies; 123+ messages in thread
From: Andy Lutomirski @ 2014-04-10 20:37 UTC (permalink / raw)
  To: Theodore Ts'o, Andy Lutomirski, David Herrmann, linux-kernel,
	Hugh Dickins, Alexander Viro, Karol Lewandowski, Kay Sievers,
	Daniel Mack, Lennart Poettering, John Stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages)

On Thu, Apr 10, 2014 at 1:32 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Thu, Apr 10, 2014 at 12:14:27PM -0700, Andy Lutomirski wrote:
>>
>> This is the second time in a week that someone has asked for a way to
>> have a struct file (or struct inode or whatever) that can't be reopened
>> through /proc/pid/fd.  This should be quite easy to implement as a
>> separate feature.
>
> What I suggested on a different thread was to add the following new
> file descriptor flags, to join FD_CLOEXEC, which would be maniuplated
> using the F_GETFD and F_SETFD fcntl commands:
>
> FD_NOPROCFS     disallow being able to open the inode via /proc/<pid>/fd
>
> FD_NOPASSFD     disallow being able to pass the fd via a unix domain socket
>
> FD_LOCKFLAGS    if this bit is set, disallow any further changes of FD_CLOEXEC,
>                 FD_NOPROCFS, FD_NOPASSFD, and FD_LOCKFLAGS flags.
>
> Regardless of what else we might need to meet the use case for the
> proposed File Sealing API, I think this is a useful feature that could
> be used in many other contexts besides just the proposed
> memfd_create() use case.

It occurs to me that, before going nuts with these kinds of flags, it
may pay to just try to fix the /proc/self/fd issue for real -- we
could just make open("/proc/self/fd/3", O_RDWR) fail if fd 3 is
read-only.  That may be enough for the file sealing thing.

--Andy

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-04-10 20:37             ` Andy Lutomirski
  0 siblings, 0 replies; 123+ messages in thread
From: Andy Lutomirski @ 2014-04-10 20:37 UTC (permalink / raw)
  To: Theodore Ts'o, Andy Lutomirski, David Herrmann, linux-kernel,
	Hugh Dickins, Alexander Viro, Karol Lewandowski, Kay Sievers,
	Daniel Mack, Lennart Poettering, John Stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages)

On Thu, Apr 10, 2014 at 1:32 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Thu, Apr 10, 2014 at 12:14:27PM -0700, Andy Lutomirski wrote:
>>
>> This is the second time in a week that someone has asked for a way to
>> have a struct file (or struct inode or whatever) that can't be reopened
>> through /proc/pid/fd.  This should be quite easy to implement as a
>> separate feature.
>
> What I suggested on a different thread was to add the following new
> file descriptor flags, to join FD_CLOEXEC, which would be maniuplated
> using the F_GETFD and F_SETFD fcntl commands:
>
> FD_NOPROCFS     disallow being able to open the inode via /proc/<pid>/fd
>
> FD_NOPASSFD     disallow being able to pass the fd via a unix domain socket
>
> FD_LOCKFLAGS    if this bit is set, disallow any further changes of FD_CLOEXEC,
>                 FD_NOPROCFS, FD_NOPASSFD, and FD_LOCKFLAGS flags.
>
> Regardless of what else we might need to meet the use case for the
> proposed File Sealing API, I think this is a useful feature that could
> be used in many other contexts besides just the proposed
> memfd_create() use case.

It occurs to me that, before going nuts with these kinds of flags, it
may pay to just try to fix the /proc/self/fd issue for real -- we
could just make open("/proc/self/fd/3", O_RDWR) fail if fd 3 is
read-only.  That may be enough for the file sealing thing.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-04-10 20:37             ` Andy Lutomirski
@ 2014-04-10 20:49               ` David Herrmann
  -1 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-04-10 20:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Theodore Ts'o, linux-kernel, Kay Sievers, Daniel Mack,
	Lennart Poettering, John Stultz, Greg Kroah-Hartman, dri-devel,
	linux-fsdevel, linux-mm, Andrew Morton, Linus Torvalds,
	Ryan Lortie, Michael Kerrisk (man-pages)

Hi

On Thu, Apr 10, 2014 at 10:37 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> It occurs to me that, before going nuts with these kinds of flags, it
> may pay to just try to fix the /proc/self/fd issue for real -- we
> could just make open("/proc/self/fd/3", O_RDWR) fail if fd 3 is
> read-only.  That may be enough for the file sealing thing.

For the sealing API, none of this is needed. As long as the inode is
owned by the uid who creates the memfd, you can pass it around and
no-one besides root and you can open /proc/self/fd/$fd (assuming chmod
700). If you share the fd with someone with the same uid as you,
you're screwed anyway. We don't protect users against themselves (I
mean, they can ptrace you, or kill()..). Therefore, I'm not really
convinced that we want this for memfd. At least no-one has provided a
_proper_ use-case for this so far.

Thanks
David

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-04-10 20:49               ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-04-10 20:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Theodore Ts'o, linux-kernel, Kay Sievers, Daniel Mack,
	Lennart Poettering, John Stultz, Greg Kroah-Hartman, dri-devel,
	linux-fsdevel, linux-mm, Andrew Morton, Linus Torvalds,
	Ryan Lortie, Michael Kerrisk (man-pages)

Hi

On Thu, Apr 10, 2014 at 10:37 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> It occurs to me that, before going nuts with these kinds of flags, it
> may pay to just try to fix the /proc/self/fd issue for real -- we
> could just make open("/proc/self/fd/3", O_RDWR) fail if fd 3 is
> read-only.  That may be enough for the file sealing thing.

For the sealing API, none of this is needed. As long as the inode is
owned by the uid who creates the memfd, you can pass it around and
no-one besides root and you can open /proc/self/fd/$fd (assuming chmod
700). If you share the fd with someone with the same uid as you,
you're screwed anyway. We don't protect users against themselves (I
mean, they can ptrace you, or kill()..). Therefore, I'm not really
convinced that we want this for memfd. At least no-one has provided a
_proper_ use-case for this so far.

Thanks
David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-04-10 20:49               ` David Herrmann
@ 2014-04-10 21:16                 ` Andy Lutomirski
  -1 siblings, 0 replies; 123+ messages in thread
From: Andy Lutomirski @ 2014-04-10 21:16 UTC (permalink / raw)
  To: David Herrmann
  Cc: Theodore Ts'o, linux-kernel, Kay Sievers, Daniel Mack,
	Lennart Poettering, John Stultz, Greg Kroah-Hartman, dri-devel,
	linux-fsdevel, linux-mm, Andrew Morton, Linus Torvalds,
	Ryan Lortie, Michael Kerrisk (man-pages)

On Thu, Apr 10, 2014 at 1:49 PM, David Herrmann <dh.herrmann@gmail.com> wrote:
> Hi
>
> On Thu, Apr 10, 2014 at 10:37 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> It occurs to me that, before going nuts with these kinds of flags, it
>> may pay to just try to fix the /proc/self/fd issue for real -- we
>> could just make open("/proc/self/fd/3", O_RDWR) fail if fd 3 is
>> read-only.  That may be enough for the file sealing thing.
>
> For the sealing API, none of this is needed. As long as the inode is
> owned by the uid who creates the memfd, you can pass it around and
> no-one besides root and you can open /proc/self/fd/$fd (assuming chmod
> 700). If you share the fd with someone with the same uid as you,
> you're screwed anyway. We don't protect users against themselves (I
> mean, they can ptrace you, or kill()..). Therefore, I'm not really
> convinced that we want this for memfd. At least no-one has provided a
> _proper_ use-case for this so far.

Hmm.  Fair enough.

Would it make sense for the initial mode on a memfd inode to be 000?
Anyone who finds this to be problematic could use fchmod to fix it.

I might even go so far as to suggest that the default uid on the inode
should be 0 (i.e. global root), since there is the odd corner case of
root setting euid != 0, creating a memfd, and setting euid back to 0.
The latter might cause resource accounting issues, though.

--Andy

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-04-10 21:16                 ` Andy Lutomirski
  0 siblings, 0 replies; 123+ messages in thread
From: Andy Lutomirski @ 2014-04-10 21:16 UTC (permalink / raw)
  To: David Herrmann
  Cc: Theodore Ts'o, linux-kernel, Kay Sievers, Daniel Mack,
	Lennart Poettering, John Stultz, Greg Kroah-Hartman, dri-devel,
	linux-fsdevel, linux-mm, Andrew Morton, Linus Torvalds,
	Ryan Lortie, Michael Kerrisk (man-pages)

On Thu, Apr 10, 2014 at 1:49 PM, David Herrmann <dh.herrmann@gmail.com> wrote:
> Hi
>
> On Thu, Apr 10, 2014 at 10:37 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> It occurs to me that, before going nuts with these kinds of flags, it
>> may pay to just try to fix the /proc/self/fd issue for real -- we
>> could just make open("/proc/self/fd/3", O_RDWR) fail if fd 3 is
>> read-only.  That may be enough for the file sealing thing.
>
> For the sealing API, none of this is needed. As long as the inode is
> owned by the uid who creates the memfd, you can pass it around and
> no-one besides root and you can open /proc/self/fd/$fd (assuming chmod
> 700). If you share the fd with someone with the same uid as you,
> you're screwed anyway. We don't protect users against themselves (I
> mean, they can ptrace you, or kill()..). Therefore, I'm not really
> convinced that we want this for memfd. At least no-one has provided a
> _proper_ use-case for this so far.

Hmm.  Fair enough.

Would it make sense for the initial mode on a memfd inode to be 000?
Anyone who finds this to be problematic could use fchmod to fix it.

I might even go so far as to suggest that the default uid on the inode
should be 0 (i.e. global root), since there is the odd corner case of
root setting euid != 0, creating a memfd, and setting euid back to 0.
The latter might cause resource accounting issues, though.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-04-10 21:16                 ` Andy Lutomirski
  (?)
  (?)
@ 2014-04-10 22:57                   ` David Herrmann
  -1 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-04-10 22:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Theodore Ts'o, linux-kernel, Kay Sievers, Daniel Mack,
	Lennart Poettering, John Stultz, Greg Kroah-Hartman, dri-devel,
	linux-fsdevel, linux-mm, Andrew Morton, Linus Torvalds,
	Ryan Lortie, Michael Kerrisk (man-pages)

Hi

On Thu, Apr 10, 2014 at 11:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> Would it make sense for the initial mode on a memfd inode to be 000?
> Anyone who finds this to be problematic could use fchmod to fix it.

memfd_create() should be subject to umask() just like anything else.
That should solve any possible race here, right?

Thanks
David

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-04-10 22:57                   ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-04-10 22:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Theodore Ts'o, Ryan Lortie, Greg Kroah-Hartman, Kay Sievers,
	linux-kernel, dri-devel, Daniel Mack, linux-mm,
	Michael Kerrisk (man-pages),
	Lennart Poettering, linux-fsdevel, Andrew Morton, Linus Torvalds

Hi

On Thu, Apr 10, 2014 at 11:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> Would it make sense for the initial mode on a memfd inode to be 000?
> Anyone who finds this to be problematic could use fchmod to fix it.

memfd_create() should be subject to umask() just like anything else.
That should solve any possible race here, right?

Thanks
David

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-04-10 22:57                   ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-04-10 22:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Theodore Ts'o, linux-kernel, Kay Sievers, Daniel Mack,
	Lennart Poettering, John Stultz, Greg Kroah-Hartman, dri-devel,
	linux-fsdevel, linux-mm, Andrew Morton, Linus Torvalds,
	Ryan Lortie, Michael Kerrisk (man-pages)

Hi

On Thu, Apr 10, 2014 at 11:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> Would it make sense for the initial mode on a memfd inode to be 000?
> Anyone who finds this to be problematic could use fchmod to fix it.

memfd_create() should be subject to umask() just like anything else.
That should solve any possible race here, right?

Thanks
David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-04-10 22:57                   ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-04-10 22:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Theodore Ts'o, Ryan Lortie, Greg Kroah-Hartman, Kay Sievers,
	linux-kernel, dri-devel, Daniel Mack, linux-mm,
	Michael Kerrisk (man-pages),
	Lennart Poettering, linux-fsdevel, Andrew Morton, Linus Torvalds

Hi

On Thu, Apr 10, 2014 at 11:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> Would it make sense for the initial mode on a memfd inode to be 000?
> Anyone who finds this to be problematic could use fchmod to fix it.

memfd_create() should be subject to umask() just like anything else.
That should solve any possible race here, right?

Thanks
David

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-04-10 22:57                   ` David Herrmann
@ 2014-04-10 23:05                     ` Andy Lutomirski
  -1 siblings, 0 replies; 123+ messages in thread
From: Andy Lutomirski @ 2014-04-10 23:05 UTC (permalink / raw)
  To: David Herrmann
  Cc: Theodore Ts'o, linux-kernel, Kay Sievers, Daniel Mack,
	Lennart Poettering, John Stultz, Greg Kroah-Hartman, dri-devel,
	linux-fsdevel, linux-mm, Andrew Morton, Linus Torvalds,
	Ryan Lortie, Michael Kerrisk (man-pages)

On Thu, Apr 10, 2014 at 3:57 PM, David Herrmann <dh.herrmann@gmail.com> wrote:
> Hi
>
> On Thu, Apr 10, 2014 at 11:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> Would it make sense for the initial mode on a memfd inode to be 000?
>> Anyone who finds this to be problematic could use fchmod to fix it.
>
> memfd_create() should be subject to umask() just like anything else.
> That should solve any possible race here, right?

Yes, but how many people will actually think about umask when doing
things that don't really look like creating files?

/proc/pid/fd is a really weird corner case in which the mode of an
inode that doesn't have a name matters.  I suspect that almost no one
will ever want to open one of these things out of /proc/self/fd, and
those who do should be made to think about it.

It also avoids odd screwups where things are secure until someone runs
them with umask 000.

--Andy

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-04-10 23:05                     ` Andy Lutomirski
  0 siblings, 0 replies; 123+ messages in thread
From: Andy Lutomirski @ 2014-04-10 23:05 UTC (permalink / raw)
  To: David Herrmann
  Cc: Theodore Ts'o, linux-kernel, Kay Sievers, Daniel Mack,
	Lennart Poettering, John Stultz, Greg Kroah-Hartman, dri-devel,
	linux-fsdevel, linux-mm, Andrew Morton, Linus Torvalds,
	Ryan Lortie, Michael Kerrisk (man-pages)

On Thu, Apr 10, 2014 at 3:57 PM, David Herrmann <dh.herrmann@gmail.com> wrote:
> Hi
>
> On Thu, Apr 10, 2014 at 11:16 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> Would it make sense for the initial mode on a memfd inode to be 000?
>> Anyone who finds this to be problematic could use fchmod to fix it.
>
> memfd_create() should be subject to umask() just like anything else.
> That should solve any possible race here, right?

Yes, but how many people will actually think about umask when doing
things that don't really look like creating files?

/proc/pid/fd is a really weird corner case in which the mode of an
inode that doesn't have a name matters.  I suspect that almost no one
will ever want to open one of these things out of /proc/self/fd, and
those who do should be made to think about it.

It also avoids odd screwups where things are secure until someone runs
them with umask 000.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-04-10 23:05                     ` Andy Lutomirski
@ 2014-04-10 23:16                       ` David Herrmann
  -1 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-04-10 23:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Theodore Ts'o, linux-kernel, Kay Sievers, Daniel Mack,
	Lennart Poettering, John Stultz, Greg Kroah-Hartman, dri-devel,
	linux-fsdevel, linux-mm, Andrew Morton, Linus Torvalds,
	Ryan Lortie, Michael Kerrisk (man-pages)

Hi

On Fri, Apr 11, 2014 at 1:05 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> /proc/pid/fd is a really weird corner case in which the mode of an
> inode that doesn't have a name matters.  I suspect that almost no one
> will ever want to open one of these things out of /proc/self/fd, and
> those who do should be made to think about it.

I'm arguing in the context of memfd, and there's no security leak if
people get access to the underlying inode (at least I'm not aware of
any). As I said, context information is attached to the inode, not
file context, so I'm fine if people want to open multiple file
contexts via /proc. If someone wants to forbid open(), I want to hear
_why_. I assume the memfd object has uid==uid-of-creator and
mode==(777 & ~umask) (which usually results in X00, so no access for
non-owners). I cannot see how /proc is a security issue here.

Thanks
David

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-04-10 23:16                       ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-04-10 23:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Theodore Ts'o, linux-kernel, Kay Sievers, Daniel Mack,
	Lennart Poettering, John Stultz, Greg Kroah-Hartman, dri-devel,
	linux-fsdevel, linux-mm, Andrew Morton, Linus Torvalds,
	Ryan Lortie, Michael Kerrisk (man-pages)

Hi

On Fri, Apr 11, 2014 at 1:05 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> /proc/pid/fd is a really weird corner case in which the mode of an
> inode that doesn't have a name matters.  I suspect that almost no one
> will ever want to open one of these things out of /proc/self/fd, and
> those who do should be made to think about it.

I'm arguing in the context of memfd, and there's no security leak if
people get access to the underlying inode (at least I'm not aware of
any). As I said, context information is attached to the inode, not
file context, so I'm fine if people want to open multiple file
contexts via /proc. If someone wants to forbid open(), I want to hear
_why_. I assume the memfd object has uid==uid-of-creator and
mode==(777 & ~umask) (which usually results in X00, so no access for
non-owners). I cannot see how /proc is a security issue here.

Thanks
David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-04-10 23:16                       ` David Herrmann
@ 2014-04-10 23:32                         ` Andy Lutomirski
  -1 siblings, 0 replies; 123+ messages in thread
From: Andy Lutomirski @ 2014-04-10 23:32 UTC (permalink / raw)
  To: David Herrmann
  Cc: Theodore Ts'o, linux-kernel, Kay Sievers, Daniel Mack,
	Lennart Poettering, John Stultz, Greg Kroah-Hartman, dri-devel,
	linux-fsdevel, linux-mm, Andrew Morton, Linus Torvalds,
	Ryan Lortie, Michael Kerrisk (man-pages)

On Thu, Apr 10, 2014 at 4:16 PM, David Herrmann <dh.herrmann@gmail.com> wrote:
> Hi
>
> On Fri, Apr 11, 2014 at 1:05 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> /proc/pid/fd is a really weird corner case in which the mode of an
>> inode that doesn't have a name matters.  I suspect that almost no one
>> will ever want to open one of these things out of /proc/self/fd, and
>> those who do should be made to think about it.
>
> I'm arguing in the context of memfd, and there's no security leak if
> people get access to the underlying inode (at least I'm not aware of
> any).

I'm not sure what you mean.

> As I said, context information is attached to the inode, not
> file context, so I'm fine if people want to open multiple file
> contexts via /proc. If someone wants to forbid open(), I want to hear
> _why_. I assume the memfd object has uid==uid-of-creator and
> mode==(777 & ~umask) (which usually results in X00, so no access for
> non-owners). I cannot see how /proc is a security issue here.

On further reflection, my argument for 000 is crap.  As far as I can
see, the only time that the mode matters at all when playing with
/proc/pid/fd, and they only way to get a non-O_RDWR memfd is using
/proc/pid/fd, so I'll argue for 0600 instead.

Argument why 0600 is better than 0600 & ~umask: either callers don't
care because the inode mode simply doesn't matter or they're using
/proc/pid/fd to *reduce* permissions, in which case they'd probably
like to avoid having to play with umask or call fchmod.

Argument why 0600 is better than 0777 & ~umask: People /prod/pid/fd
are the only ones who care, in which case they probably prefer for the
permissions not be increased by other users if they give them a
reduced-permission fd.

Anyway, this is all mostly unimportant.  Some text in the man page is
probably sufficient, but I still think that 0600 is trivial to
implement and a little bit more friendly.

--Andy

>
> Thanks
> David



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-04-10 23:32                         ` Andy Lutomirski
  0 siblings, 0 replies; 123+ messages in thread
From: Andy Lutomirski @ 2014-04-10 23:32 UTC (permalink / raw)
  To: David Herrmann
  Cc: Theodore Ts'o, linux-kernel, Kay Sievers, Daniel Mack,
	Lennart Poettering, John Stultz, Greg Kroah-Hartman, dri-devel,
	linux-fsdevel, linux-mm, Andrew Morton, Linus Torvalds,
	Ryan Lortie, Michael Kerrisk (man-pages)

On Thu, Apr 10, 2014 at 4:16 PM, David Herrmann <dh.herrmann@gmail.com> wrote:
> Hi
>
> On Fri, Apr 11, 2014 at 1:05 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>> /proc/pid/fd is a really weird corner case in which the mode of an
>> inode that doesn't have a name matters.  I suspect that almost no one
>> will ever want to open one of these things out of /proc/self/fd, and
>> those who do should be made to think about it.
>
> I'm arguing in the context of memfd, and there's no security leak if
> people get access to the underlying inode (at least I'm not aware of
> any).

I'm not sure what you mean.

> As I said, context information is attached to the inode, not
> file context, so I'm fine if people want to open multiple file
> contexts via /proc. If someone wants to forbid open(), I want to hear
> _why_. I assume the memfd object has uid==uid-of-creator and
> mode==(777 & ~umask) (which usually results in X00, so no access for
> non-owners). I cannot see how /proc is a security issue here.

On further reflection, my argument for 000 is crap.  As far as I can
see, the only time that the mode matters at all when playing with
/proc/pid/fd, and they only way to get a non-O_RDWR memfd is using
/proc/pid/fd, so I'll argue for 0600 instead.

Argument why 0600 is better than 0600 & ~umask: either callers don't
care because the inode mode simply doesn't matter or they're using
/proc/pid/fd to *reduce* permissions, in which case they'd probably
like to avoid having to play with umask or call fchmod.

Argument why 0600 is better than 0777 & ~umask: People /prod/pid/fd
are the only ones who care, in which case they probably prefer for the
permissions not be increased by other users if they give them a
reduced-permission fd.

Anyway, this is all mostly unimportant.  Some text in the man page is
probably sufficient, but I still think that 0600 is trivial to
implement and a little bit more friendly.

--Andy

>
> Thanks
> David



-- 
Andy Lutomirski
AMA Capital Management, LLC

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-04-10 19:45         ` Colin Walters
@ 2014-04-11  6:09           ` Alex Elsayed
  -1 siblings, 0 replies; 123+ messages in thread
From: Alex Elsayed @ 2014-04-11  6:09 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, dri-devel, linux-fsdevel

Colin Walters wrote:

> On Thu, Apr 10, 2014 at 3:15 PM, Andy Lutomirski <luto@amacapital.net>
> wrote:
>> 
>> 
>> COW links can do this already, I think.  Of course, you'll have to
>> use a
>> filesystem that supports them.
> 
> COW is nice if the filesystem supports them, but my userspace code
> needs to be filesystem agnostic.  Because of that, the design for
> userspace simply doesn't allow arbitrary writes.
> 
> Instead, I have to painfully audit every rpm %post/dpkg postinst type
> script to ensure they break hardlinks, and furthermore only allow
> executing scripts that are known to do so.
> 
> But I think even in a btrfs world it'd still be useful to mark files as
> content-immutable.

If you create each tree as a subvolume and when it's complete put it in 
place with btrfs subvolume snapshot -r FOO_inprogress /ostree/repo/FOO,
you get exactly that.

You can even use the new(ish) btrfs out-of-band dedup functionality to 
deduplicate read-only snapshots safely.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-04-11  6:09           ` Alex Elsayed
  0 siblings, 0 replies; 123+ messages in thread
From: Alex Elsayed @ 2014-04-11  6:09 UTC (permalink / raw)
  To: linux-mm; +Cc: linux-kernel, dri-devel, linux-fsdevel

Colin Walters wrote:

> On Thu, Apr 10, 2014 at 3:15 PM, Andy Lutomirski <luto@amacapital.net>
> wrote:
>> 
>> 
>> COW links can do this already, I think.  Of course, you'll have to
>> use a
>> filesystem that supports them.
> 
> COW is nice if the filesystem supports them, but my userspace code
> needs to be filesystem agnostic.  Because of that, the design for
> userspace simply doesn't allow arbitrary writes.
> 
> Instead, I have to painfully audit every rpm %post/dpkg postinst type
> script to ensure they break hardlinks, and furthermore only allow
> executing scripts that are known to do so.
> 
> But I think even in a btrfs world it'd still be useful to mark files as
> content-immutable.

If you create each tree as a subvolume and when it's complete put it in 
place with btrfs subvolume snapshot -r FOO_inprogress /ostree/repo/FOO,
you get exactly that.

You can even use the new(ish) btrfs out-of-band dedup functionality to 
deduplicate read-only snapshots safely.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-04-10 20:37             ` Andy Lutomirski
@ 2014-04-20 15:03               ` Pavel Machek
  -1 siblings, 0 replies; 123+ messages in thread
From: Pavel Machek @ 2014-04-20 15:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Theodore Ts'o, David Herrmann, linux-kernel, Hugh Dickins,
	Alexander Viro, Karol Lewandowski, Kay Sievers, Daniel Mack,
	Lennart Poettering, John Stultz, Greg Kroah-Hartman, Tejun Heo,
	Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages)

On Thu 2014-04-10 13:37:26, Andy Lutomirski wrote:
> On Thu, Apr 10, 2014 at 1:32 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> > On Thu, Apr 10, 2014 at 12:14:27PM -0700, Andy Lutomirski wrote:
> >>
> >> This is the second time in a week that someone has asked for a way to
> >> have a struct file (or struct inode or whatever) that can't be reopened
> >> through /proc/pid/fd.  This should be quite easy to implement as a
> >> separate feature.
> >
> > What I suggested on a different thread was to add the following new
> > file descriptor flags, to join FD_CLOEXEC, which would be maniuplated
> > using the F_GETFD and F_SETFD fcntl commands:
> >
> > FD_NOPROCFS     disallow being able to open the inode via /proc/<pid>/fd
> >
> > FD_NOPASSFD     disallow being able to pass the fd via a unix domain socket
> >
> > FD_LOCKFLAGS    if this bit is set, disallow any further changes of FD_CLOEXEC,
> >                 FD_NOPROCFS, FD_NOPASSFD, and FD_LOCKFLAGS flags.
> >
> > Regardless of what else we might need to meet the use case for the
> > proposed File Sealing API, I think this is a useful feature that could
> > be used in many other contexts besides just the proposed
> > memfd_create() use case.
> 
> It occurs to me that, before going nuts with these kinds of flags, it
> may pay to just try to fix the /proc/self/fd issue for real -- we
> could just make open("/proc/self/fd/3", O_RDWR) fail if fd 3 is
> read-only.  That may be enough for the file sealing thing.

Yes please.

Current behaviour is very unexpected, and unexpected behaviour in
security area is normally called "security hole".

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-04-20 15:03               ` Pavel Machek
  0 siblings, 0 replies; 123+ messages in thread
From: Pavel Machek @ 2014-04-20 15:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Theodore Ts'o, David Herrmann, linux-kernel, Hugh Dickins,
	Alexander Viro, Karol Lewandowski, Kay Sievers, Daniel Mack,
	Lennart Poettering, John Stultz, Greg Kroah-Hartman, Tejun Heo,
	Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages)

On Thu 2014-04-10 13:37:26, Andy Lutomirski wrote:
> On Thu, Apr 10, 2014 at 1:32 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> > On Thu, Apr 10, 2014 at 12:14:27PM -0700, Andy Lutomirski wrote:
> >>
> >> This is the second time in a week that someone has asked for a way to
> >> have a struct file (or struct inode or whatever) that can't be reopened
> >> through /proc/pid/fd.  This should be quite easy to implement as a
> >> separate feature.
> >
> > What I suggested on a different thread was to add the following new
> > file descriptor flags, to join FD_CLOEXEC, which would be maniuplated
> > using the F_GETFD and F_SETFD fcntl commands:
> >
> > FD_NOPROCFS     disallow being able to open the inode via /proc/<pid>/fd
> >
> > FD_NOPASSFD     disallow being able to pass the fd via a unix domain socket
> >
> > FD_LOCKFLAGS    if this bit is set, disallow any further changes of FD_CLOEXEC,
> >                 FD_NOPROCFS, FD_NOPASSFD, and FD_LOCKFLAGS flags.
> >
> > Regardless of what else we might need to meet the use case for the
> > proposed File Sealing API, I think this is a useful feature that could
> > be used in many other contexts besides just the proposed
> > memfd_create() use case.
> 
> It occurs to me that, before going nuts with these kinds of flags, it
> may pay to just try to fix the /proc/self/fd issue for real -- we
> could just make open("/proc/self/fd/3", O_RDWR) fail if fd 3 is
> read-only.  That may be enough for the file sealing thing.

Yes please.

Current behaviour is very unexpected, and unexpected behaviour in
security area is normally called "security hole".

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-04-09 21:31     ` David Herrmann
@ 2014-04-22  9:10       ` Florian Weimer
  -1 siblings, 0 replies; 123+ messages in thread
From: Florian Weimer @ 2014-04-22  9:10 UTC (permalink / raw)
  To: David Herrmann
  Cc: linux-kernel, Kay Sievers, Daniel Mack, Lennart Poettering,
	dri-devel, linux-fsdevel, linux-mm

On 04/09/2014 11:31 PM, David Herrmann wrote:

> On Tue, Apr 8, 2014 at 3:00 PM, Florian Weimer <fweimer@redhat.com> wrote:
>> How do you keep these promises on network and FUSE file systems?
>
> I don't. This is shmem only.

Ah.  What do you recommend for recipient to recognize such descriptors? 
  Would they just try to seal them and reject them if this fails?

-- 
Florian Weimer / Red Hat Product Security Team

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-04-22  9:10       ` Florian Weimer
  0 siblings, 0 replies; 123+ messages in thread
From: Florian Weimer @ 2014-04-22  9:10 UTC (permalink / raw)
  To: David Herrmann
  Cc: linux-kernel, Kay Sievers, Daniel Mack, Lennart Poettering,
	dri-devel, linux-fsdevel, linux-mm

On 04/09/2014 11:31 PM, David Herrmann wrote:

> On Tue, Apr 8, 2014 at 3:00 PM, Florian Weimer <fweimer@redhat.com> wrote:
>> How do you keep these promises on network and FUSE file systems?
>
> I don't. This is shmem only.

Ah.  What do you recommend for recipient to recognize such descriptors? 
  Would they just try to seal them and reject them if this fails?

-- 
Florian Weimer / Red Hat Product Security Team

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-04-22  9:10       ` Florian Weimer
@ 2014-04-22 11:55         ` David Herrmann
  -1 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-04-22 11:55 UTC (permalink / raw)
  To: Florian Weimer
  Cc: linux-kernel, Kay Sievers, Daniel Mack, Lennart Poettering,
	dri-devel, linux-fsdevel, linux-mm

Hi

On Tue, Apr 22, 2014 at 11:10 AM, Florian Weimer <fweimer@redhat.com> wrote:
> Ah.  What do you recommend for recipient to recognize such descriptors?
> Would they just try to seal them and reject them if this fails?

This highly depends on your use-case. Please see the initial email in
this thread. It describes 2 example use-cases. In both cases, the
recipients read the current set of seals and verify that a given set
of seals is set.

Thanks
David

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-04-22 11:55         ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-04-22 11:55 UTC (permalink / raw)
  To: Florian Weimer
  Cc: linux-kernel, Kay Sievers, Daniel Mack, Lennart Poettering,
	dri-devel, linux-fsdevel, linux-mm

Hi

On Tue, Apr 22, 2014 at 11:10 AM, Florian Weimer <fweimer@redhat.com> wrote:
> Ah.  What do you recommend for recipient to recognize such descriptors?
> Would they just try to seal them and reject them if this fails?

This highly depends on your use-case. Please see the initial email in
this thread. It describes 2 example use-cases. In both cases, the
recipients read the current set of seals and verify that a given set
of seals is set.

Thanks
David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-04-22 11:55         ` David Herrmann
@ 2014-04-22 12:44           ` Florian Weimer
  -1 siblings, 0 replies; 123+ messages in thread
From: Florian Weimer @ 2014-04-22 12:44 UTC (permalink / raw)
  To: David Herrmann
  Cc: linux-kernel, Kay Sievers, Daniel Mack, Lennart Poettering,
	dri-devel, linux-fsdevel, linux-mm

On 04/22/2014 01:55 PM, David Herrmann wrote:
> Hi
>
> On Tue, Apr 22, 2014 at 11:10 AM, Florian Weimer <fweimer@redhat.com> wrote:
>> Ah.  What do you recommend for recipient to recognize such descriptors?
>> Would they just try to seal them and reject them if this fails?
>
> This highly depends on your use-case. Please see the initial email in
> this thread. It describes 2 example use-cases. In both cases, the
> recipients read the current set of seals and verify that a given set
> of seals is set.

I didn't find that very convincing.  But in v2, seals are monotonic, so 
checking them should be reliable enough.

What happens when you create a loop device on a write-sealed descriptor?

-- 
Florian Weimer / Red Hat Product Security Team

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-04-22 12:44           ` Florian Weimer
  0 siblings, 0 replies; 123+ messages in thread
From: Florian Weimer @ 2014-04-22 12:44 UTC (permalink / raw)
  To: David Herrmann
  Cc: linux-kernel, Kay Sievers, Daniel Mack, Lennart Poettering,
	dri-devel, linux-fsdevel, linux-mm

On 04/22/2014 01:55 PM, David Herrmann wrote:
> Hi
>
> On Tue, Apr 22, 2014 at 11:10 AM, Florian Weimer <fweimer@redhat.com> wrote:
>> Ah.  What do you recommend for recipient to recognize such descriptors?
>> Would they just try to seal them and reject them if this fails?
>
> This highly depends on your use-case. Please see the initial email in
> this thread. It describes 2 example use-cases. In both cases, the
> recipients read the current set of seals and verify that a given set
> of seals is set.

I didn't find that very convincing.  But in v2, seals are monotonic, so 
checking them should be reliable enough.

What happens when you create a loop device on a write-sealed descriptor?

-- 
Florian Weimer / Red Hat Product Security Team

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-04-22 12:44           ` Florian Weimer
@ 2014-04-22 12:55             ` David Herrmann
  -1 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-04-22 12:55 UTC (permalink / raw)
  To: Florian Weimer
  Cc: linux-kernel, Kay Sievers, Daniel Mack, Lennart Poettering,
	dri-devel, linux-fsdevel, linux-mm

Hi

On Tue, Apr 22, 2014 at 2:44 PM, Florian Weimer <fweimer@redhat.com> wrote:
> I didn't find that very convincing.  But in v2, seals are monotonic, so
> checking them should be reliable enough.

Ok.

> What happens when you create a loop device on a write-sealed descriptor?

Any write-back to the loop-device will fail with EPERM as soon as the
fd gets write-sealed. See __do_lo_send_write() in
drivers/block/loop.c. It's up to the loop-device to forward the error
via bio_endio() to the caller for proper error-handling.

Thanks
David

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-04-22 12:55             ` David Herrmann
  0 siblings, 0 replies; 123+ messages in thread
From: David Herrmann @ 2014-04-22 12:55 UTC (permalink / raw)
  To: Florian Weimer
  Cc: linux-kernel, Kay Sievers, Daniel Mack, Lennart Poettering,
	dri-devel, linux-fsdevel, linux-mm

Hi

On Tue, Apr 22, 2014 at 2:44 PM, Florian Weimer <fweimer@redhat.com> wrote:
> I didn't find that very convincing.  But in v2, seals are monotonic, so
> checking them should be reliable enough.

Ok.

> What happens when you create a loop device on a write-sealed descriptor?

Any write-back to the loop-device will fail with EPERM as soon as the
fd gets write-sealed. See __do_lo_send_write() in
drivers/block/loop.c. It's up to the loop-device to forward the error
via bio_endio() to the caller for proper error-handling.

Thanks
David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-04-10 20:37             ` Andy Lutomirski
  (?)
  (?)
@ 2014-06-17  9:48               ` Florian Weimer
  -1 siblings, 0 replies; 123+ messages in thread
From: Florian Weimer @ 2014-06-17  9:48 UTC (permalink / raw)
  To: Andy Lutomirski, Theodore Ts'o, David Herrmann, linux-kernel,
	Hugh Dickins, Alexander Viro, Karol Lewandowski, Kay Sievers,
	Daniel Mack, Lennart Poettering, John Stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages)

On 04/10/2014 10:37 PM, Andy Lutomirski wrote:

> It occurs to me that, before going nuts with these kinds of flags, it
> may pay to just try to fix the /proc/self/fd issue for real -- we
> could just make open("/proc/self/fd/3", O_RDWR) fail if fd 3 is
> read-only.  That may be enough for the file sealing thing.

Increasing privilege on O_PATH descriptors via access through 
/proc/self/fd is part of the userspace API.  The same thing might be 
true for O_RDONLY descriptors, but it's a bit less likely that there are 
any users out there.  In any case, I'm not sure it makes sense to plug 
the O_RDONLY hole while leaving the O_PATH hole open.

-- 
Florian Weimer / Red Hat Product Security Team

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-06-17  9:48               ` Florian Weimer
  0 siblings, 0 replies; 123+ messages in thread
From: Florian Weimer @ 2014-06-17  9:48 UTC (permalink / raw)
  To: Andy Lutomirski, Theodore Ts'o, David Herrmann, linux-kernel,
	Hugh Dickins, Alexander Viro, Karol Lewandowski, Kay Sievers,
	Daniel Mack, Lennart Poettering, John Stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan

On 04/10/2014 10:37 PM, Andy Lutomirski wrote:

> It occurs to me that, before going nuts with these kinds of flags, it
> may pay to just try to fix the /proc/self/fd issue for real -- we
> could just make open("/proc/self/fd/3", O_RDWR) fail if fd 3 is
> read-only.  That may be enough for the file sealing thing.

Increasing privilege on O_PATH descriptors via access through 
/proc/self/fd is part of the userspace API.  The same thing might be 
true for O_RDONLY descriptors, but it's a bit less likely that there are 
any users out there.  In any case, I'm not sure it makes sense to plug 
the O_RDONLY hole while leaving the O_PATH hole open.

-- 
Florian Weimer / Red Hat Product Security Team

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-06-17  9:48               ` Florian Weimer
  0 siblings, 0 replies; 123+ messages in thread
From: Florian Weimer @ 2014-06-17  9:48 UTC (permalink / raw)
  To: Andy Lutomirski, Theodore Ts'o, David Herrmann, linux-kernel,
	Hugh Dickins, Alexander Viro, Karol Lewandowski, Kay Sievers,
	Daniel Mack, Lennart Poettering, John Stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan Lortie,
	Michael Kerrisk (man-pages)

On 04/10/2014 10:37 PM, Andy Lutomirski wrote:

> It occurs to me that, before going nuts with these kinds of flags, it
> may pay to just try to fix the /proc/self/fd issue for real -- we
> could just make open("/proc/self/fd/3", O_RDWR) fail if fd 3 is
> read-only.  That may be enough for the file sealing thing.

Increasing privilege on O_PATH descriptors via access through 
/proc/self/fd is part of the userspace API.  The same thing might be 
true for O_RDONLY descriptors, but it's a bit less likely that there are 
any users out there.  In any case, I'm not sure it makes sense to plug 
the O_RDONLY hole while leaving the O_PATH hole open.

-- 
Florian Weimer / Red Hat Product Security Team

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
@ 2014-06-17  9:48               ` Florian Weimer
  0 siblings, 0 replies; 123+ messages in thread
From: Florian Weimer @ 2014-06-17  9:48 UTC (permalink / raw)
  To: Andy Lutomirski, Theodore Ts'o, David Herrmann, linux-kernel,
	Hugh Dickins, Alexander Viro, Karol Lewandowski, Kay Sievers,
	Daniel Mack, Lennart Poettering, John Stultz, Greg Kroah-Hartman,
	Tejun Heo, Johannes Weiner, dri-devel, linux-fsdevel, linux-mm,
	Andrew Morton, Linus Torvalds, Ryan

On 04/10/2014 10:37 PM, Andy Lutomirski wrote:

> It occurs to me that, before going nuts with these kinds of flags, it
> may pay to just try to fix the /proc/self/fd issue for real -- we
> could just make open("/proc/self/fd/3", O_RDWR) fail if fd 3 is
> read-only.  That may be enough for the file sealing thing.

Increasing privilege on O_PATH descriptors via access through 
/proc/self/fd is part of the userspace API.  The same thing might be 
true for O_RDONLY descriptors, but it's a bit less likely that there are 
any users out there.  In any case, I'm not sure it makes sense to plug 
the O_RDONLY hole while leaving the O_PATH hole open.

-- 
Florian Weimer / Red Hat Product Security Team

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [PATCH 0/6] File Sealing & memfd_create()
  2014-06-17  9:48               ` Florian Weimer
                                 ` (2 preceding siblings ...)
  (?)
@ 2014-06-17 16:21               ` Andy Lutomirski
  -1 siblings, 0 replies; 123+ messages in thread
From: Andy Lutomirski @ 2014-06-17 16:21 UTC (permalink / raw)
  To: Florian Weimer
  Cc: Michael Kerrisk, Greg KH, Johannes Weiner, Hugh Dickins,
	Lennart Poettering, Andrew Morton, linux-mm, linux-fsdevel,
	Kay Sievers, Alexander Viro, John Stultz, Linus Torvalds,
	Tejun Heo, Theodore Ts'o, Ryan Lortie, Daniel Mack,
	dri-devel, linux-kernel, David Herrmann, Karol Lewandowski

[-- Attachment #1: Type: text/plain, Size: 1009 bytes --]

On Jun 17, 2014 2:48 AM, "Florian Weimer" <fweimer@redhat.com> wrote:
>
> On 04/10/2014 10:37 PM, Andy Lutomirski wrote:
>
>> It occurs to me that, before going nuts with these kinds of flags, it
>> may pay to just try to fix the /proc/self/fd issue for real -- we
>> could just make open("/proc/self/fd/3", O_RDWR) fail if fd 3 is
>> read-only.  That may be enough for the file sealing thing.
>
>
> Increasing privilege on O_PATH descriptors via access through
/proc/self/fd is part of the userspace API.  The same thing might be true
for O_RDONLY descriptors, but it's a bit less likely that there are any
users out there.  In any case, I'm not sure it makes sense to plug the
O_RDONLY hole while leaving the O_PATH hole open.

Do you mean O_PATH fds for the directory or O_PATH fds for the file
itself?  In any event, I'm much less concerned about passing O_PATH memfds
around than O_RDONLY memfds.

I have incomplete patches for this stuff.  I need to fix them so they work
and get past Al Viro.


--Andy

[-- Attachment #2: Type: text/html, Size: 1314 bytes --]

^ permalink raw reply	[flat|nested] 123+ messages in thread

end of thread, other threads:[~2014-06-17 16:21 UTC | newest]

Thread overview: 123+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-03-19 19:06 [PATCH 0/6] File Sealing & memfd_create() David Herrmann
2014-03-19 19:06 ` David Herrmann
2014-03-19 19:06 ` David Herrmann
2014-03-19 19:06 ` David Herrmann
2014-03-19 19:06 ` [PATCH 1/6] fs: fix i_writecount on shmem and friends David Herrmann
2014-03-19 19:06   ` David Herrmann
2014-03-19 19:06   ` David Herrmann
2014-03-19 19:06   ` David Herrmann
2014-03-19 19:06 ` [PATCH 2/6] shm: add sealing API David Herrmann
2014-03-19 19:06   ` David Herrmann
2014-03-19 19:06   ` David Herrmann
2014-03-19 19:06   ` David Herrmann
2014-03-19 19:06 ` [PATCH 3/6] shm: add memfd_create() syscall David Herrmann
2014-03-19 19:06   ` David Herrmann
2014-03-19 19:06   ` David Herrmann
2014-03-19 19:06   ` David Herrmann
2014-03-20  8:47   ` Cyrill Gorcunov
2014-03-20  8:47     ` Cyrill Gorcunov
2014-03-20  9:01     ` Pavel Emelyanov
2014-03-20  9:01       ` Pavel Emelyanov
2014-03-20 11:29       ` David Herrmann
2014-03-20 11:29         ` David Herrmann
2014-03-20 11:29         ` David Herrmann
2014-03-20 11:29         ` David Herrmann
2014-03-20 11:50         ` Pavel Emelyanov
2014-03-20 11:50           ` Pavel Emelyanov
2014-03-20 19:22   ` John Stultz
2014-03-20 19:22     ` John Stultz
2014-04-02 13:38   ` Konstantin Khlebnikov
2014-04-02 13:38     ` Konstantin Khlebnikov
2014-04-02 14:18     ` David Herrmann
2014-04-02 14:18       ` David Herrmann
2014-04-02 14:52       ` Konstantin Khlebnikov
2014-04-02 14:52         ` Konstantin Khlebnikov
2014-04-02 14:52         ` Konstantin Khlebnikov
2014-04-10 19:07     ` Andy Lutomirski
2014-04-10 19:07       ` Andy Lutomirski
2014-03-19 19:06 ` [PATCH 4/6] selftests: add memfd_create() + sealing tests David Herrmann
2014-03-19 19:06   ` David Herrmann
2014-03-19 19:06 ` [PATCH man-pages 5/6] fcntl.2: document SHMEM_SET/GET_SEALS commands David Herrmann
2014-03-19 19:06   ` David Herrmann
2014-03-19 19:06   ` David Herrmann
2014-03-19 19:06   ` David Herrmann
2014-03-19 19:06 ` [PATCH man-pages 6/6] memfd_create.2: add memfd_create() man-page David Herrmann
2014-03-19 19:06   ` David Herrmann
2014-03-19 19:06   ` David Herrmann
2014-03-19 19:06   ` David Herrmann
2014-03-20  2:55 ` [PATCH 0/6] File Sealing & memfd_create() Greg Kroah-Hartman
2014-03-20  2:55   ` Greg Kroah-Hartman
2014-03-20  2:55   ` Greg Kroah-Hartman
2014-03-20  2:55   ` Greg Kroah-Hartman
2014-03-20  3:49 ` Linus Torvalds
2014-03-20  3:49   ` Linus Torvalds
2014-03-20  8:07   ` David Herrmann
2014-03-20  8:07     ` David Herrmann
2014-03-20  8:07     ` David Herrmann
2014-03-20  8:07     ` David Herrmann
2014-03-20 14:41     ` One Thousand Gnomes
2014-03-20 14:41       ` One Thousand Gnomes
2014-03-20 14:41       ` One Thousand Gnomes
2014-03-20 15:12       ` David Herrmann
2014-03-20 15:12         ` David Herrmann
2014-03-20 15:12         ` David Herrmann
2014-03-20 15:12         ` David Herrmann
2014-03-20 15:26         ` One Thousand Gnomes
2014-03-20 15:26           ` One Thousand Gnomes
2014-03-20 15:26           ` One Thousand Gnomes
2014-03-20 15:32 ` tytso
2014-03-20 15:32   ` tytso
2014-03-20 15:39   ` One Thousand Gnomes
2014-03-20 15:48   ` David Herrmann
2014-03-20 15:48     ` David Herrmann
2014-03-20 16:38     ` tytso
2014-03-20 16:38       ` tytso
2014-04-10 19:14       ` Andy Lutomirski
2014-04-10 19:14         ` Andy Lutomirski
2014-04-10 20:32         ` Theodore Ts'o
2014-04-10 20:32           ` Theodore Ts'o
2014-04-10 20:37           ` Andy Lutomirski
2014-04-10 20:37             ` Andy Lutomirski
2014-04-10 20:49             ` David Herrmann
2014-04-10 20:49               ` David Herrmann
2014-04-10 21:16               ` Andy Lutomirski
2014-04-10 21:16                 ` Andy Lutomirski
2014-04-10 22:57                 ` David Herrmann
2014-04-10 22:57                   ` David Herrmann
2014-04-10 22:57                   ` David Herrmann
2014-04-10 22:57                   ` David Herrmann
2014-04-10 23:05                   ` Andy Lutomirski
2014-04-10 23:05                     ` Andy Lutomirski
2014-04-10 23:16                     ` David Herrmann
2014-04-10 23:16                       ` David Herrmann
2014-04-10 23:32                       ` Andy Lutomirski
2014-04-10 23:32                         ` Andy Lutomirski
2014-04-20 15:03             ` Pavel Machek
2014-04-20 15:03               ` Pavel Machek
2014-06-17  9:48             ` Florian Weimer
2014-06-17  9:48               ` Florian Weimer
2014-06-17  9:48               ` Florian Weimer
2014-06-17  9:48               ` Florian Weimer
2014-06-17 16:21               ` Andy Lutomirski
2014-04-10 14:45   ` Colin Walters
2014-04-10 14:45     ` Colin Walters
2014-04-10 19:15     ` Andy Lutomirski
2014-04-10 19:15       ` Andy Lutomirski
2014-04-10 19:45       ` Colin Walters
2014-04-10 19:45         ` Colin Walters
2014-04-11  6:09         ` Alex Elsayed
2014-04-11  6:09           ` Alex Elsayed
2014-04-08 13:00 ` Florian Weimer
2014-04-08 13:00   ` Florian Weimer
2014-04-09 21:31   ` David Herrmann
2014-04-09 21:31     ` David Herrmann
2014-04-22  9:10     ` Florian Weimer
2014-04-22  9:10       ` Florian Weimer
2014-04-22 11:55       ` David Herrmann
2014-04-22 11:55         ` David Herrmann
2014-04-22 12:44         ` Florian Weimer
2014-04-22 12:44           ` Florian Weimer
2014-04-22 12:55           ` David Herrmann
2014-04-22 12:55             ` David Herrmann
2014-04-10 19:17   ` Andy Lutomirski
2014-04-10 19:17     ` Andy Lutomirski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.