All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/3] File Sealing & memfd_create()
@ 2014-04-15 18:38 ` David Herrmann
  0 siblings, 0 replies; 53+ messages in thread
From: David Herrmann @ 2014-04-15 18:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, Johannes Weiner, Tejun Heo,
	Greg Kroah-Hartman, john.stultz, Kristian Høgsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers, David Herrmann

Hi

This is v2 of the File-Sealing and memfd_create() patches. You can find v1 with
a longer introduction at gmane:
  http://thread.gmane.org/gmane.comp.video.dri.devel/102241
An LWN article about memfd+sealing is available, too:
  https://lwn.net/Articles/593918/

Shortlog of changes since v1:
 - Dropped the "exclusive reference" idea
   Now sealing is a one-shot operation. Once a given seal is set, you cannot
   remove this seal again, ever. This allows us to drop all the ref-count
   checking and simplifies the code a lot. We also no longer have all the races
   we have to test for.
 - The i_writecount fix is now upstream (slightly different, by Al Viro) so I
   dropped it from the series.
 - Change SHMEM_* prefix to F_* to avoid any API-association to shmem.
 - Sealing is disabled on all files by default (even though we still haven't
   found any DoS attack). You need to pass MFD_ALLOW_SEALING to memfd_create()
   to get an object that supports the sealing API.
 - Changed F_SET_SEALS to F_ADD_SEALS. This better reflects the API. You can
   never remove seals, you can only add seals. Note that the semantics also
   changed slightly: You can now _always_ call F_ADD_SEALS to add _more_ seals.
   However, a new seal was added which "seals sealing" (F_SEAL_SEAL). So once
   F_SEAL_SEAL is set, F_ADD_SEAL is no longer allowed.
   This feature was requested by the glib developers.
 - memfd_create() names are now limited to NAME_MAX instead of 256 hardcoded.
 - Rewrote the test suite

The biggest change in v2 is the removal of the "exclusive reference" idea. It
was a nice optimization, but the implementation was ugly and racy regarding
file-table changes. Linus didn't like it either so we decided to drop it
entirely. Sealing is a one-shot operation now. A sealed file can never be
unsealed, even if you're the only holder.

I also addressed most of the concerns regarding API naming and semantics. I got
feedback from glib, EFL, wayland, kdbus, ostree, audio developers and we
discussed many possible use-cases (and also cases that don't make sense). So I
think we're in a very good state right now.

People requested to make this interface more generic. I renamed the API to
reflect that, but I didn't change the implementation. Thing is, seals cannot be
removed, ever. Therefore, semantics for sealing on non-volatile storage are
undefined. We don't write them to disc and it is unclear whether a sealed file
can be unlinked/removed again. There're more issues with this and no-one came up
with a use-case, hence I didn't bother implementing it.
There's also an ongoing discussion about an AIO race, but this also affects
other inode-protections like S_IMMUTABLE/etc. So I don't think we should tie
the fix to this series.
Another discussion was about preventing /proc/self/fd/. But again, no-one could
tell me _why_, so I didn't bother. On the contrary, I even provided several
use-cases that make use of /proc/self/fd/ to get read-only FDs to pass around.

If anyone wants to test this, please use 3.15-rc1 as base. The i_writecount
fixes are required for this series.

Comments welcome!
David

David Herrmann (3):
  shm: add sealing API
  shm: add memfd_create() syscall
  selftests: add memfd_create() + sealing tests

 arch/x86/syscalls/syscall_32.tbl           |   1 +
 arch/x86/syscalls/syscall_64.tbl           |   1 +
 fs/fcntl.c                                 |   5 +
 include/linux/shmem_fs.h                   |  20 +
 include/linux/syscalls.h                   |   1 +
 include/uapi/linux/fcntl.h                 |  15 +
 include/uapi/linux/memfd.h                 |  10 +
 kernel/sys_ni.c                            |   1 +
 mm/shmem.c                                 | 236 +++++++-
 tools/testing/selftests/Makefile           |   1 +
 tools/testing/selftests/memfd/.gitignore   |   2 +
 tools/testing/selftests/memfd/Makefile     |  29 +
 tools/testing/selftests/memfd/memfd_test.c | 944 +++++++++++++++++++++++++++++
 13 files changed, 1263 insertions(+), 3 deletions(-)
 create mode 100644 include/uapi/linux/memfd.h
 create mode 100644 tools/testing/selftests/memfd/.gitignore
 create mode 100644 tools/testing/selftests/memfd/Makefile
 create mode 100644 tools/testing/selftests/memfd/memfd_test.c

-- 
1.9.2


^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v2 0/3] File Sealing & memfd_create()
@ 2014-04-15 18:38 ` David Herrmann
  0 siblings, 0 replies; 53+ messages in thread
From: David Herrmann @ 2014-04-15 18:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, Johannes Weiner, Tejun Heo,
	Greg Kroah-Hartman, john.stultz, Kristian Høgsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers, David Herrmann

Hi

This is v2 of the File-Sealing and memfd_create() patches. You can find v1 with
a longer introduction at gmane:
  http://thread.gmane.org/gmane.comp.video.dri.devel/102241
An LWN article about memfd+sealing is available, too:
  https://lwn.net/Articles/593918/

Shortlog of changes since v1:
 - Dropped the "exclusive reference" idea
   Now sealing is a one-shot operation. Once a given seal is set, you cannot
   remove this seal again, ever. This allows us to drop all the ref-count
   checking and simplifies the code a lot. We also no longer have all the races
   we have to test for.
 - The i_writecount fix is now upstream (slightly different, by Al Viro) so I
   dropped it from the series.
 - Change SHMEM_* prefix to F_* to avoid any API-association to shmem.
 - Sealing is disabled on all files by default (even though we still haven't
   found any DoS attack). You need to pass MFD_ALLOW_SEALING to memfd_create()
   to get an object that supports the sealing API.
 - Changed F_SET_SEALS to F_ADD_SEALS. This better reflects the API. You can
   never remove seals, you can only add seals. Note that the semantics also
   changed slightly: You can now _always_ call F_ADD_SEALS to add _more_ seals.
   However, a new seal was added which "seals sealing" (F_SEAL_SEAL). So once
   F_SEAL_SEAL is set, F_ADD_SEAL is no longer allowed.
   This feature was requested by the glib developers.
 - memfd_create() names are now limited to NAME_MAX instead of 256 hardcoded.
 - Rewrote the test suite

The biggest change in v2 is the removal of the "exclusive reference" idea. It
was a nice optimization, but the implementation was ugly and racy regarding
file-table changes. Linus didn't like it either so we decided to drop it
entirely. Sealing is a one-shot operation now. A sealed file can never be
unsealed, even if you're the only holder.

I also addressed most of the concerns regarding API naming and semantics. I got
feedback from glib, EFL, wayland, kdbus, ostree, audio developers and we
discussed many possible use-cases (and also cases that don't make sense). So I
think we're in a very good state right now.

People requested to make this interface more generic. I renamed the API to
reflect that, but I didn't change the implementation. Thing is, seals cannot be
removed, ever. Therefore, semantics for sealing on non-volatile storage are
undefined. We don't write them to disc and it is unclear whether a sealed file
can be unlinked/removed again. There're more issues with this and no-one came up
with a use-case, hence I didn't bother implementing it.
There's also an ongoing discussion about an AIO race, but this also affects
other inode-protections like S_IMMUTABLE/etc. So I don't think we should tie
the fix to this series.
Another discussion was about preventing /proc/self/fd/. But again, no-one could
tell me _why_, so I didn't bother. On the contrary, I even provided several
use-cases that make use of /proc/self/fd/ to get read-only FDs to pass around.

If anyone wants to test this, please use 3.15-rc1 as base. The i_writecount
fixes are required for this series.

Comments welcome!
David

David Herrmann (3):
  shm: add sealing API
  shm: add memfd_create() syscall
  selftests: add memfd_create() + sealing tests

 arch/x86/syscalls/syscall_32.tbl           |   1 +
 arch/x86/syscalls/syscall_64.tbl           |   1 +
 fs/fcntl.c                                 |   5 +
 include/linux/shmem_fs.h                   |  20 +
 include/linux/syscalls.h                   |   1 +
 include/uapi/linux/fcntl.h                 |  15 +
 include/uapi/linux/memfd.h                 |  10 +
 kernel/sys_ni.c                            |   1 +
 mm/shmem.c                                 | 236 +++++++-
 tools/testing/selftests/Makefile           |   1 +
 tools/testing/selftests/memfd/.gitignore   |   2 +
 tools/testing/selftests/memfd/Makefile     |  29 +
 tools/testing/selftests/memfd/memfd_test.c | 944 +++++++++++++++++++++++++++++
 13 files changed, 1263 insertions(+), 3 deletions(-)
 create mode 100644 include/uapi/linux/memfd.h
 create mode 100644 tools/testing/selftests/memfd/.gitignore
 create mode 100644 tools/testing/selftests/memfd/Makefile
 create mode 100644 tools/testing/selftests/memfd/memfd_test.c

-- 
1.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* [PATCH v2 1/3] shm: add sealing API
  2014-04-15 18:38 ` David Herrmann
@ 2014-04-15 18:38   ` David Herrmann
  -1 siblings, 0 replies; 53+ messages in thread
From: David Herrmann @ 2014-04-15 18:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, Johannes Weiner, Tejun Heo,
	Greg Kroah-Hartman, john.stultz, Kristian Høgsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers, David Herrmann

If two processes share a common memory region, they usually want some
guarantees to allow safe access. This often includes:
  - one side cannot overwrite data while the other reads it
  - one side cannot shrink the buffer while the other accesses it
  - one side cannot grow the buffer beyond previously set boundaries

If there is a trust-relationship between both parties, there is no need
for policy enforcement. However, if there's no trust relationship (eg.,
for general-purpose IPC) sharing memory-regions is highly fragile and
often not possible without local copies. Look at the following two
use-cases:
  1) A graphics client wants to share its rendering-buffer with a
     graphics-server. The memory-region is allocated by the client for
     read/write access and a second FD is passed to the server. While
     scanning out from the memory region, the server has no guarantee that
     the client doesn't shrink the buffer at any time, requiring rather
     cumbersome SIGBUS handling.
  2) A process wants to perform an RPC on another process. To avoid huge
     bandwidth consumption, zero-copy is preferred. After a message is
     assembled in-memory and a FD is passed to the remote side, both sides
     want to be sure that neither modifies this shared copy, anymore. The
     source may have put sensible data into the message without a separate
     copy and the target may want to parse the message inline, to avoid a
     local copy.

While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide
ways to achieve most of this, the first one is unproportionally ugly to
use in libraries and the latter two are broken/racy or even disabled due
to denial of service attacks.

This patch introduces the concept of SEALING. If you seal a file, a
specific set of operations is blocked on that file forever.
Unlike locks, seals can only be set, never removed. Hence, once you
verified a specific set of seals is set, you're guaranteed that no-one can
perform the blocked operations on this file, anymore.

An initial set of SEALS is introduced by this patch:
  - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced
            in size. This affects ftruncate() and open(O_TRUNC).
  - GROW: If SEAL_GROW is set, the file in question cannot be increased
          in size. This affects ftruncate(), fallocate() and write().
  - WRITE: If SEAL_WRITE is set, no write operations (besides resizing)
           are possible. This affects fallocate(PUNCH_HOLE), mmap() and
           write().
  - SEAL: If SEAL_SEAL is set, no further seals can be added to a file.
          This basically prevents the F_ADD_SEAL operation on a file and
          can be set to prevent others from adding further seals that you
          don't want.

The described use-cases can easily use these seals to provide safe use
without any trust-relationship:
  1) The graphics server can verify that a passed file-descriptor has
     SEAL_SHRINK set. This allows safe scanout, while the client is
     allowed to increase buffer size for window-resizing on-the-fly.
     Concurrent writes are explicitly allowed.
  2) For general-purpose IPC, both processes can verify that SEAL_SHRINK,
     SEAL_GROW and SEAL_WRITE are set. This guarantees that neither
     process can modify the data while the other side parses it.
     Furthermore, it guarantees that even with writable FDs passed to the
     peer, it cannot increase the size to hit memory-limits of the source
     process (in case the file-storage is accounted to the source).

The new API is an extension to fcntl(), adding two new commands:
  F_GET_SEALS: Return a bitset describing the seals on the file. This
               can be called on any FD if the underlying file supports
               sealing.
  F_ADD_SEALS: Change the seals of a given file. This requires WRITE
               access to the file and F_SEAL_SEAL may not already be set.
               Furthermore, the underlying file must support sealing and
               there may not be any existing shared mapping of that file.
               Otherwise, EBADF/EPERM is returned.
               The given seals are _added_ to the existing set of seals
               on the file. You cannot remove seals again.

The fcntl() handler is currently specific to shmem and disabled on all
files. A file needs to explicitly support sealing for this interface to
work. A separate syscall is added in a follow-up, which creates files that
support sealing. There is no intention to support this on other
file-systems. Semantics are unclear for non-volatile files and we lack any
use-case right now. Therefore, the implementation is specific to shmem.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 fs/fcntl.c                 |   5 ++
 include/linux/shmem_fs.h   |  20 ++++++
 include/uapi/linux/fcntl.h |  15 +++++
 mm/shmem.c                 | 162 ++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 199 insertions(+), 3 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 9ead159..1a7a722 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -21,6 +21,7 @@
 #include <linux/rcupdate.h>
 #include <linux/pid_namespace.h>
 #include <linux/user_namespace.h>
+#include <linux/shmem_fs.h>
 
 #include <asm/poll.h>
 #include <asm/siginfo.h>
@@ -336,6 +337,10 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	case F_GETPIPE_SZ:
 		err = pipe_fcntl(filp, cmd, arg);
 		break;
+	case F_ADD_SEALS:
+	case F_GET_SEALS:
+		err = shmem_fcntl(filp, cmd, arg);
+		break;
 	default:
 		break;
 	}
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 4d1771c..c043d67 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -1,6 +1,7 @@
 #ifndef __SHMEM_FS_H
 #define __SHMEM_FS_H
 
+#include <linux/file.h>
 #include <linux/swap.h>
 #include <linux/mempolicy.h>
 #include <linux/pagemap.h>
@@ -20,6 +21,7 @@ struct shmem_inode_info {
 	struct shared_policy	policy;		/* NUMA memory alloc policy */
 	struct list_head	swaplist;	/* chain of maybes on swap */
 	struct simple_xattrs	xattrs;		/* list of xattrs */
+	u32			seals;		/* shmem seals */
 	struct inode		vfs_inode;
 };
 
@@ -65,4 +67,22 @@ static inline struct page *shmem_read_mapping_page(
 					mapping_gfp_mask(mapping));
 }
 
+/* marks inode to support sealing */
+#define SHMEM_ALLOW_SEALING (1U << 31)
+
+#ifdef CONFIG_SHMEM
+
+extern int shmem_add_seals(struct file *file, u32 seals);
+extern int shmem_get_seals(struct file *file);
+extern long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
+
+#else
+
+static inline long shmem_fcntl(struct file *f, unsigned int c, unsigned long a)
+{
+	return -EINVAL;
+}
+
+#endif
+
 #endif
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 074b886..1b9b9f4 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -28,6 +28,21 @@
 #define F_GETPIPE_SZ	(F_LINUX_SPECIFIC_BASE + 8)
 
 /*
+ * Set/Get seals
+ */
+#define F_ADD_SEALS	(F_LINUX_SPECIFIC_BASE + 9)
+#define F_GET_SEALS	(F_LINUX_SPECIFIC_BASE + 10)
+
+/*
+ * Types of seals
+ */
+#define F_SEAL_SEAL	0x0001	/* prevent further seals from being set */
+#define F_SEAL_SHRINK	0x0002	/* prevent file from shrinking */
+#define F_SEAL_GROW	0x0004	/* prevent file from growing */
+#define F_SEAL_WRITE	0x0008	/* prevent writes */
+/* (1U << 31) is reserved for internal use */
+
+/*
  * Types of directory notifications that may be requested.
  */
 #define DN_ACCESS	0x00000001	/* File accessed */
diff --git a/mm/shmem.c b/mm/shmem.c
index 9f70e02..175a5b8 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -66,6 +66,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/highmem.h>
 #include <linux/seq_file.h>
 #include <linux/magic.h>
+#include <linux/fcntl.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -531,16 +532,23 @@ EXPORT_SYMBOL_GPL(shmem_truncate_range);
 static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = dentry->d_inode;
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	loff_t oldsize = inode->i_size;
+	loff_t newsize = attr->ia_size;
 	int error;
 
 	error = inode_change_ok(inode, attr);
 	if (error)
 		return error;
 
-	if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
-		loff_t oldsize = inode->i_size;
-		loff_t newsize = attr->ia_size;
+	/* protected by i_mutex */
+	if (attr->ia_valid & ATTR_SIZE) {
+		if ((newsize < oldsize && (info->seals & F_SEAL_SHRINK)) ||
+		    (newsize > oldsize && (info->seals & F_SEAL_GROW)))
+			return -EPERM;
+	}
 
+	if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
 		if (newsize != oldsize) {
 			i_size_write(inode, newsize);
 			inode->i_ctime = inode->i_mtime = CURRENT_TIME;
@@ -1289,6 +1297,13 @@ out_nomem:
 
 static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
 {
+	struct inode *inode = file_inode(file);
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	/* protected by mmap_sem */
+	if ((info->seals & F_SEAL_WRITE) && (vma->vm_flags & VM_SHARED))
+		return -EPERM;
+
 	file_accessed(file);
 	vma->vm_ops = &shmem_vm_ops;
 	return 0;
@@ -1373,7 +1388,15 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
 			struct page **pagep, void **fsdata)
 {
 	struct inode *inode = mapping->host;
+	struct shmem_inode_info *info = SHMEM_I(inode);
 	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+
+	/* i_mutex is held by caller */
+	if (info->seals & F_SEAL_WRITE)
+		return -EPERM;
+	if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size)
+		return -EPERM;
+
 	return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
 }
 
@@ -1719,11 +1742,133 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
 	return offset;
 }
 
+#define F_ALL_SEALS (F_SEAL_SEAL | \
+		     F_SEAL_SHRINK | \
+		     F_SEAL_GROW | \
+		     F_SEAL_WRITE)
+
+int shmem_add_seals(struct file *file, u32 seals)
+{
+	struct dentry *dentry = file->f_path.dentry;
+	struct inode *inode = dentry->d_inode;
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	int r;
+
+	/* SHMEM_ALLOW_SEALING is a private, unused bit */
+	BUILD_BUG_ON(F_ALL_SEALS & SHMEM_ALLOW_SEALING);
+
+	/*
+	 * SEALING
+	 * Sealing allows multiple parties to share a shmem-file but restrict
+	 * access to a specific subset of file operations. Seals can only be
+	 * added, but never removed. This way, mutually untrusted parties can
+	 * share common memory regions with a well-defined policy. A malicious
+	 * peer can thus never perform unwanted operations on a shared object.
+	 *
+	 * Seals are only supported on special shmem-files and always affect
+	 * the whole underlying inode. Once a seal is set, it may prevent some
+	 * kinds of access to the file. Currently, the following seals are
+	 * defined:
+	 *   SEAL_SEAL: Prevent further seals from being set on this file
+	 *   SEAL_SHRINK: Prevent the file from shrinking
+	 *   SEAL_GROW: Prevent the file from growing
+	 *   SEAL_WRITE: Prevent write access to the file
+	 *
+	 * As we don't require any trust relationship between two parties, we
+	 * must prevent seals from being removed. Therefore, sealing a file
+	 * only adds a given set of seals to the file, it never touches
+	 * existing seals. Furthermore, the "setting seals"-operation can be
+	 * sealed itself, which basically prevents any further seal from being
+	 * added.
+	 *
+	 * Semantics of sealing are only defined on volatile files. Only
+	 * anonymous shmem files support sealing. More importantly, seals are
+	 * never written to disk. Therefore, there's no plan to support it on
+	 * other file types.
+	 */
+
+	if (file->f_op != &shmem_file_operations)
+		return -EBADF;
+	if (!(info->seals & SHMEM_ALLOW_SEALING))
+		return -EBADF;
+	if (!(file->f_mode & FMODE_WRITE))
+		return -EPERM;
+	if (seals & ~(u32)F_ALL_SEALS)
+		return -EINVAL;
+
+	/*
+	 * - i_mutex prevents racing write/ftruncate/fallocate/..
+	 * - mmap_sem prevents racing mmap() calls
+	 */
+
+	mutex_lock(&inode->i_mutex);
+	down_read(&current->mm->mmap_sem);
+
+	/* you cannot seal while shared mappings exist */
+	if (file->f_mapping->i_mmap_writable > 0) {
+		r = -EPERM;
+		goto unlock;
+	}
+
+	if (info->seals & F_SEAL_SEAL) {
+		r = -EPERM;
+		goto unlock;
+	}
+
+	info->seals |= seals;
+	r = 0;
+
+unlock:
+	up_read(&current->mm->mmap_sem);
+	mutex_unlock(&inode->i_mutex);
+	return r;
+}
+EXPORT_SYMBOL(shmem_add_seals);
+
+int shmem_get_seals(struct file *file)
+{
+	struct shmem_inode_info *info;
+
+	if (file->f_op != &shmem_file_operations)
+		return -EBADF;
+
+	info = SHMEM_I(file_inode(file));
+	if (!(info->seals & SHMEM_ALLOW_SEALING))
+		return -EBADF;
+
+	return info->seals & F_ALL_SEALS;
+}
+EXPORT_SYMBOL(shmem_get_seals);
+
+long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	long r;
+
+	switch (cmd) {
+	case F_ADD_SEALS:
+		/* disallow upper 32bit */
+		if (arg >> 32)
+			return -EINVAL;
+
+		r = shmem_add_seals(file, arg);
+		break;
+	case F_GET_SEALS:
+		r = shmem_get_seals(file);
+		break;
+	default:
+		r = -EINVAL;
+		break;
+	}
+
+	return r;
+}
+
 static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 							 loff_t len)
 {
 	struct inode *inode = file_inode(file);
 	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+	struct shmem_inode_info *info = SHMEM_I(inode);
 	struct shmem_falloc shmem_falloc;
 	pgoff_t start, index, end;
 	int error;
@@ -1735,6 +1880,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 		loff_t unmap_start = round_up(offset, PAGE_SIZE);
 		loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
 
+		/* protected by i_mutex */
+		if (info->seals & F_SEAL_WRITE) {
+			error = -EPERM;
+			goto out;
+		}
+
 		if ((u64)unmap_end > (u64)unmap_start)
 			unmap_mapping_range(mapping, unmap_start,
 					    1 + unmap_end - unmap_start, 0);
@@ -1749,6 +1900,11 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	if (error)
 		goto out;
 
+	if ((info->seals & F_SEAL_GROW) && offset + len > inode->i_size) {
+		error = -EPERM;
+		goto out;
+	}
+
 	start = offset >> PAGE_CACHE_SHIFT;
 	end = (offset + len + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	/* Try to avoid a swapstorm if len is impossible to satisfy */
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 1/3] shm: add sealing API
@ 2014-04-15 18:38   ` David Herrmann
  0 siblings, 0 replies; 53+ messages in thread
From: David Herrmann @ 2014-04-15 18:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, Johannes Weiner, Tejun Heo,
	Greg Kroah-Hartman, john.stultz, Kristian Høgsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers, David Herrmann

If two processes share a common memory region, they usually want some
guarantees to allow safe access. This often includes:
  - one side cannot overwrite data while the other reads it
  - one side cannot shrink the buffer while the other accesses it
  - one side cannot grow the buffer beyond previously set boundaries

If there is a trust-relationship between both parties, there is no need
for policy enforcement. However, if there's no trust relationship (eg.,
for general-purpose IPC) sharing memory-regions is highly fragile and
often not possible without local copies. Look at the following two
use-cases:
  1) A graphics client wants to share its rendering-buffer with a
     graphics-server. The memory-region is allocated by the client for
     read/write access and a second FD is passed to the server. While
     scanning out from the memory region, the server has no guarantee that
     the client doesn't shrink the buffer at any time, requiring rather
     cumbersome SIGBUS handling.
  2) A process wants to perform an RPC on another process. To avoid huge
     bandwidth consumption, zero-copy is preferred. After a message is
     assembled in-memory and a FD is passed to the remote side, both sides
     want to be sure that neither modifies this shared copy, anymore. The
     source may have put sensible data into the message without a separate
     copy and the target may want to parse the message inline, to avoid a
     local copy.

While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide
ways to achieve most of this, the first one is unproportionally ugly to
use in libraries and the latter two are broken/racy or even disabled due
to denial of service attacks.

This patch introduces the concept of SEALING. If you seal a file, a
specific set of operations is blocked on that file forever.
Unlike locks, seals can only be set, never removed. Hence, once you
verified a specific set of seals is set, you're guaranteed that no-one can
perform the blocked operations on this file, anymore.

An initial set of SEALS is introduced by this patch:
  - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced
            in size. This affects ftruncate() and open(O_TRUNC).
  - GROW: If SEAL_GROW is set, the file in question cannot be increased
          in size. This affects ftruncate(), fallocate() and write().
  - WRITE: If SEAL_WRITE is set, no write operations (besides resizing)
           are possible. This affects fallocate(PUNCH_HOLE), mmap() and
           write().
  - SEAL: If SEAL_SEAL is set, no further seals can be added to a file.
          This basically prevents the F_ADD_SEAL operation on a file and
          can be set to prevent others from adding further seals that you
          don't want.

The described use-cases can easily use these seals to provide safe use
without any trust-relationship:
  1) The graphics server can verify that a passed file-descriptor has
     SEAL_SHRINK set. This allows safe scanout, while the client is
     allowed to increase buffer size for window-resizing on-the-fly.
     Concurrent writes are explicitly allowed.
  2) For general-purpose IPC, both processes can verify that SEAL_SHRINK,
     SEAL_GROW and SEAL_WRITE are set. This guarantees that neither
     process can modify the data while the other side parses it.
     Furthermore, it guarantees that even with writable FDs passed to the
     peer, it cannot increase the size to hit memory-limits of the source
     process (in case the file-storage is accounted to the source).

The new API is an extension to fcntl(), adding two new commands:
  F_GET_SEALS: Return a bitset describing the seals on the file. This
               can be called on any FD if the underlying file supports
               sealing.
  F_ADD_SEALS: Change the seals of a given file. This requires WRITE
               access to the file and F_SEAL_SEAL may not already be set.
               Furthermore, the underlying file must support sealing and
               there may not be any existing shared mapping of that file.
               Otherwise, EBADF/EPERM is returned.
               The given seals are _added_ to the existing set of seals
               on the file. You cannot remove seals again.

The fcntl() handler is currently specific to shmem and disabled on all
files. A file needs to explicitly support sealing for this interface to
work. A separate syscall is added in a follow-up, which creates files that
support sealing. There is no intention to support this on other
file-systems. Semantics are unclear for non-volatile files and we lack any
use-case right now. Therefore, the implementation is specific to shmem.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 fs/fcntl.c                 |   5 ++
 include/linux/shmem_fs.h   |  20 ++++++
 include/uapi/linux/fcntl.h |  15 +++++
 mm/shmem.c                 | 162 ++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 199 insertions(+), 3 deletions(-)

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 9ead159..1a7a722 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -21,6 +21,7 @@
 #include <linux/rcupdate.h>
 #include <linux/pid_namespace.h>
 #include <linux/user_namespace.h>
+#include <linux/shmem_fs.h>
 
 #include <asm/poll.h>
 #include <asm/siginfo.h>
@@ -336,6 +337,10 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
 	case F_GETPIPE_SZ:
 		err = pipe_fcntl(filp, cmd, arg);
 		break;
+	case F_ADD_SEALS:
+	case F_GET_SEALS:
+		err = shmem_fcntl(filp, cmd, arg);
+		break;
 	default:
 		break;
 	}
diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 4d1771c..c043d67 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -1,6 +1,7 @@
 #ifndef __SHMEM_FS_H
 #define __SHMEM_FS_H
 
+#include <linux/file.h>
 #include <linux/swap.h>
 #include <linux/mempolicy.h>
 #include <linux/pagemap.h>
@@ -20,6 +21,7 @@ struct shmem_inode_info {
 	struct shared_policy	policy;		/* NUMA memory alloc policy */
 	struct list_head	swaplist;	/* chain of maybes on swap */
 	struct simple_xattrs	xattrs;		/* list of xattrs */
+	u32			seals;		/* shmem seals */
 	struct inode		vfs_inode;
 };
 
@@ -65,4 +67,22 @@ static inline struct page *shmem_read_mapping_page(
 					mapping_gfp_mask(mapping));
 }
 
+/* marks inode to support sealing */
+#define SHMEM_ALLOW_SEALING (1U << 31)
+
+#ifdef CONFIG_SHMEM
+
+extern int shmem_add_seals(struct file *file, u32 seals);
+extern int shmem_get_seals(struct file *file);
+extern long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
+
+#else
+
+static inline long shmem_fcntl(struct file *f, unsigned int c, unsigned long a)
+{
+	return -EINVAL;
+}
+
+#endif
+
 #endif
diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 074b886..1b9b9f4 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -28,6 +28,21 @@
 #define F_GETPIPE_SZ	(F_LINUX_SPECIFIC_BASE + 8)
 
 /*
+ * Set/Get seals
+ */
+#define F_ADD_SEALS	(F_LINUX_SPECIFIC_BASE + 9)
+#define F_GET_SEALS	(F_LINUX_SPECIFIC_BASE + 10)
+
+/*
+ * Types of seals
+ */
+#define F_SEAL_SEAL	0x0001	/* prevent further seals from being set */
+#define F_SEAL_SHRINK	0x0002	/* prevent file from shrinking */
+#define F_SEAL_GROW	0x0004	/* prevent file from growing */
+#define F_SEAL_WRITE	0x0008	/* prevent writes */
+/* (1U << 31) is reserved for internal use */
+
+/*
  * Types of directory notifications that may be requested.
  */
 #define DN_ACCESS	0x00000001	/* File accessed */
diff --git a/mm/shmem.c b/mm/shmem.c
index 9f70e02..175a5b8 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -66,6 +66,7 @@ static struct vfsmount *shm_mnt;
 #include <linux/highmem.h>
 #include <linux/seq_file.h>
 #include <linux/magic.h>
+#include <linux/fcntl.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -531,16 +532,23 @@ EXPORT_SYMBOL_GPL(shmem_truncate_range);
 static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
 {
 	struct inode *inode = dentry->d_inode;
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	loff_t oldsize = inode->i_size;
+	loff_t newsize = attr->ia_size;
 	int error;
 
 	error = inode_change_ok(inode, attr);
 	if (error)
 		return error;
 
-	if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
-		loff_t oldsize = inode->i_size;
-		loff_t newsize = attr->ia_size;
+	/* protected by i_mutex */
+	if (attr->ia_valid & ATTR_SIZE) {
+		if ((newsize < oldsize && (info->seals & F_SEAL_SHRINK)) ||
+		    (newsize > oldsize && (info->seals & F_SEAL_GROW)))
+			return -EPERM;
+	}
 
+	if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
 		if (newsize != oldsize) {
 			i_size_write(inode, newsize);
 			inode->i_ctime = inode->i_mtime = CURRENT_TIME;
@@ -1289,6 +1297,13 @@ out_nomem:
 
 static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
 {
+	struct inode *inode = file_inode(file);
+	struct shmem_inode_info *info = SHMEM_I(inode);
+
+	/* protected by mmap_sem */
+	if ((info->seals & F_SEAL_WRITE) && (vma->vm_flags & VM_SHARED))
+		return -EPERM;
+
 	file_accessed(file);
 	vma->vm_ops = &shmem_vm_ops;
 	return 0;
@@ -1373,7 +1388,15 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
 			struct page **pagep, void **fsdata)
 {
 	struct inode *inode = mapping->host;
+	struct shmem_inode_info *info = SHMEM_I(inode);
 	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+
+	/* i_mutex is held by caller */
+	if (info->seals & F_SEAL_WRITE)
+		return -EPERM;
+	if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size)
+		return -EPERM;
+
 	return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
 }
 
@@ -1719,11 +1742,133 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
 	return offset;
 }
 
+#define F_ALL_SEALS (F_SEAL_SEAL | \
+		     F_SEAL_SHRINK | \
+		     F_SEAL_GROW | \
+		     F_SEAL_WRITE)
+
+int shmem_add_seals(struct file *file, u32 seals)
+{
+	struct dentry *dentry = file->f_path.dentry;
+	struct inode *inode = dentry->d_inode;
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	int r;
+
+	/* SHMEM_ALLOW_SEALING is a private, unused bit */
+	BUILD_BUG_ON(F_ALL_SEALS & SHMEM_ALLOW_SEALING);
+
+	/*
+	 * SEALING
+	 * Sealing allows multiple parties to share a shmem-file but restrict
+	 * access to a specific subset of file operations. Seals can only be
+	 * added, but never removed. This way, mutually untrusted parties can
+	 * share common memory regions with a well-defined policy. A malicious
+	 * peer can thus never perform unwanted operations on a shared object.
+	 *
+	 * Seals are only supported on special shmem-files and always affect
+	 * the whole underlying inode. Once a seal is set, it may prevent some
+	 * kinds of access to the file. Currently, the following seals are
+	 * defined:
+	 *   SEAL_SEAL: Prevent further seals from being set on this file
+	 *   SEAL_SHRINK: Prevent the file from shrinking
+	 *   SEAL_GROW: Prevent the file from growing
+	 *   SEAL_WRITE: Prevent write access to the file
+	 *
+	 * As we don't require any trust relationship between two parties, we
+	 * must prevent seals from being removed. Therefore, sealing a file
+	 * only adds a given set of seals to the file, it never touches
+	 * existing seals. Furthermore, the "setting seals"-operation can be
+	 * sealed itself, which basically prevents any further seal from being
+	 * added.
+	 *
+	 * Semantics of sealing are only defined on volatile files. Only
+	 * anonymous shmem files support sealing. More importantly, seals are
+	 * never written to disk. Therefore, there's no plan to support it on
+	 * other file types.
+	 */
+
+	if (file->f_op != &shmem_file_operations)
+		return -EBADF;
+	if (!(info->seals & SHMEM_ALLOW_SEALING))
+		return -EBADF;
+	if (!(file->f_mode & FMODE_WRITE))
+		return -EPERM;
+	if (seals & ~(u32)F_ALL_SEALS)
+		return -EINVAL;
+
+	/*
+	 * - i_mutex prevents racing write/ftruncate/fallocate/..
+	 * - mmap_sem prevents racing mmap() calls
+	 */
+
+	mutex_lock(&inode->i_mutex);
+	down_read(&current->mm->mmap_sem);
+
+	/* you cannot seal while shared mappings exist */
+	if (file->f_mapping->i_mmap_writable > 0) {
+		r = -EPERM;
+		goto unlock;
+	}
+
+	if (info->seals & F_SEAL_SEAL) {
+		r = -EPERM;
+		goto unlock;
+	}
+
+	info->seals |= seals;
+	r = 0;
+
+unlock:
+	up_read(&current->mm->mmap_sem);
+	mutex_unlock(&inode->i_mutex);
+	return r;
+}
+EXPORT_SYMBOL(shmem_add_seals);
+
+int shmem_get_seals(struct file *file)
+{
+	struct shmem_inode_info *info;
+
+	if (file->f_op != &shmem_file_operations)
+		return -EBADF;
+
+	info = SHMEM_I(file_inode(file));
+	if (!(info->seals & SHMEM_ALLOW_SEALING))
+		return -EBADF;
+
+	return info->seals & F_ALL_SEALS;
+}
+EXPORT_SYMBOL(shmem_get_seals);
+
+long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
+{
+	long r;
+
+	switch (cmd) {
+	case F_ADD_SEALS:
+		/* disallow upper 32bit */
+		if (arg >> 32)
+			return -EINVAL;
+
+		r = shmem_add_seals(file, arg);
+		break;
+	case F_GET_SEALS:
+		r = shmem_get_seals(file);
+		break;
+	default:
+		r = -EINVAL;
+		break;
+	}
+
+	return r;
+}
+
 static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 							 loff_t len)
 {
 	struct inode *inode = file_inode(file);
 	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
+	struct shmem_inode_info *info = SHMEM_I(inode);
 	struct shmem_falloc shmem_falloc;
 	pgoff_t start, index, end;
 	int error;
@@ -1735,6 +1880,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 		loff_t unmap_start = round_up(offset, PAGE_SIZE);
 		loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
 
+		/* protected by i_mutex */
+		if (info->seals & F_SEAL_WRITE) {
+			error = -EPERM;
+			goto out;
+		}
+
 		if ((u64)unmap_end > (u64)unmap_start)
 			unmap_mapping_range(mapping, unmap_start,
 					    1 + unmap_end - unmap_start, 0);
@@ -1749,6 +1900,11 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 	if (error)
 		goto out;
 
+	if ((info->seals & F_SEAL_GROW) && offset + len > inode->i_size) {
+		error = -EPERM;
+		goto out;
+	}
+
 	start = offset >> PAGE_CACHE_SHIFT;
 	end = (offset + len + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
 	/* Try to avoid a swapstorm if len is impossible to satisfy */
-- 
1.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 2/3] shm: add memfd_create() syscall
  2014-04-15 18:38 ` David Herrmann
@ 2014-04-15 18:38   ` David Herrmann
  -1 siblings, 0 replies; 53+ messages in thread
From: David Herrmann @ 2014-04-15 18:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, Johannes Weiner, Tejun Heo,
	Greg Kroah-Hartman, john.stultz, Kristian Høgsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers, David Herrmann

memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
that you can pass to mmap(). It can support sealing and avoids any
connection to user-visible mount-points. Thus, it's not subject to quotas
on mounted file-systems, but can be used like malloc()'ed memory, but
with a file-descriptor to it.

memfd_create() does not create a front-FD, but instead returns the raw
shmem file, so calls like ftruncate() can be used. Also calls like fstat()
will return proper information and mark the file as regular file. If you
want sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not
support (like on all other regular files).

Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
subject to quotas and alike.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 arch/x86/syscalls/syscall_32.tbl |  1 +
 arch/x86/syscalls/syscall_64.tbl |  1 +
 include/linux/syscalls.h         |  1 +
 include/uapi/linux/memfd.h       | 10 ++++++
 kernel/sys_ni.c                  |  1 +
 mm/shmem.c                       | 74 ++++++++++++++++++++++++++++++++++++++++
 6 files changed, 88 insertions(+)
 create mode 100644 include/uapi/linux/memfd.h

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 96bc506..c943b8a 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -359,3 +359,4 @@
 350	i386	finit_module		sys_finit_module
 351	i386	sched_setattr		sys_sched_setattr
 352	i386	sched_getattr		sys_sched_getattr
+353	i386	memfd_create		sys_memfd_create
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 04376ac..dfcfd6f 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,7 @@
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
 316	common	renameat2		sys_renameat2
+317	common	memfd_create		sys_memfd_create
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a4a0588..133b705 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -802,6 +802,7 @@ asmlinkage long sys_timerfd_settime(int ufd, int flags,
 asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
 asmlinkage long sys_eventfd(unsigned int count);
 asmlinkage long sys_eventfd2(unsigned int count, int flags);
+asmlinkage long sys_memfd_create(const char *uname_ptr, u64 size, u64 flags);
 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
 asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int);
 asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
new file mode 100644
index 0000000..c4a6db0
--- /dev/null
+++ b/include/uapi/linux/memfd.h
@@ -0,0 +1,10 @@
+#ifndef _UAPI_LINUX_MEMFD_H
+#define _UAPI_LINUX_MEMFD_H
+
+#include <linux/types.h>
+
+/* flags for memfd_create(2) (u64) */
+#define MFD_CLOEXEC		0x0001ULL
+#define MFD_ALLOW_SEALING	0x0002ULL
+
+#endif /* _UAPI_LINUX_MEMFD_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index bc8d1b7..f96c329 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -195,6 +195,7 @@ cond_syscall(compat_sys_timerfd_settime);
 cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
+cond_syscall(sys_memfd_create);
 
 /* performance counters: */
 cond_syscall(sys_perf_event_open);
diff --git a/mm/shmem.c b/mm/shmem.c
index 175a5b8..203cc4e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -66,7 +66,9 @@ static struct vfsmount *shm_mnt;
 #include <linux/highmem.h>
 #include <linux/seq_file.h>
 #include <linux/magic.h>
+#include <linux/syscalls.h>
 #include <linux/fcntl.h>
+#include <uapi/linux/memfd.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -2919,6 +2921,78 @@ out4:
 	return error;
 }
 
+#define MFD_NAME_PREFIX "memfd:"
+#define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
+#define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
+
+#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING)
+
+SYSCALL_DEFINE3(memfd_create,
+		const char*, uname,
+		u64, size,
+		u64, flags)
+{
+	struct shmem_inode_info *info;
+	struct file *shm;
+	char *name;
+	int fd, r;
+	long len;
+
+	if (flags & ~(u64)MFD_ALL_FLAGS)
+		return -EINVAL;
+	if ((u64)(loff_t)size != size || (loff_t)size < 0)
+		return -EINVAL;
+
+	/* length includes terminating zero */
+	len = strnlen_user(uname, MFD_NAME_MAX_LEN);
+	if (len <= 0)
+		return -EFAULT;
+	else if (len > MFD_NAME_MAX_LEN)
+		return -EINVAL;
+
+	name = kmalloc(len + MFD_NAME_PREFIX_LEN, GFP_KERNEL);
+	if (!name)
+		return -ENOMEM;
+
+	strcpy(name, MFD_NAME_PREFIX);
+	if (copy_from_user(&name[MFD_NAME_PREFIX_LEN], uname, len)) {
+		r = -EFAULT;
+		goto err_name;
+	}
+
+	/* terminating-zero may have changed after strnlen_user() returned */
+	if (name[len + MFD_NAME_PREFIX_LEN - 1]) {
+		r = -EFAULT;
+		goto err_name;
+	}
+
+	fd = get_unused_fd_flags((flags & MFD_CLOEXEC) ? O_CLOEXEC : 0);
+	if (fd < 0) {
+		r = fd;
+		goto err_name;
+	}
+
+	shm = shmem_file_setup(name, size, 0);
+	if (IS_ERR(shm)) {
+		r = PTR_ERR(shm);
+		goto err_fd;
+	}
+	info = SHMEM_I(file_inode(shm));
+	shm->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
+	if (flags & MFD_ALLOW_SEALING)
+		info->seals |= SHMEM_ALLOW_SEALING;
+
+	fd_install(fd, shm);
+	kfree(name);
+	return fd;
+
+err_fd:
+	put_unused_fd(fd);
+err_name:
+	kfree(name);
+	return r;
+}
+
 #else /* !CONFIG_SHMEM */
 
 /*
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 2/3] shm: add memfd_create() syscall
@ 2014-04-15 18:38   ` David Herrmann
  0 siblings, 0 replies; 53+ messages in thread
From: David Herrmann @ 2014-04-15 18:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, Johannes Weiner, Tejun Heo,
	Greg Kroah-Hartman, john.stultz, Kristian Høgsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers, David Herrmann

memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
that you can pass to mmap(). It can support sealing and avoids any
connection to user-visible mount-points. Thus, it's not subject to quotas
on mounted file-systems, but can be used like malloc()'ed memory, but
with a file-descriptor to it.

memfd_create() does not create a front-FD, but instead returns the raw
shmem file, so calls like ftruncate() can be used. Also calls like fstat()
will return proper information and mark the file as regular file. If you
want sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not
support (like on all other regular files).

Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
subject to quotas and alike.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 arch/x86/syscalls/syscall_32.tbl |  1 +
 arch/x86/syscalls/syscall_64.tbl |  1 +
 include/linux/syscalls.h         |  1 +
 include/uapi/linux/memfd.h       | 10 ++++++
 kernel/sys_ni.c                  |  1 +
 mm/shmem.c                       | 74 ++++++++++++++++++++++++++++++++++++++++
 6 files changed, 88 insertions(+)
 create mode 100644 include/uapi/linux/memfd.h

diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
index 96bc506..c943b8a 100644
--- a/arch/x86/syscalls/syscall_32.tbl
+++ b/arch/x86/syscalls/syscall_32.tbl
@@ -359,3 +359,4 @@
 350	i386	finit_module		sys_finit_module
 351	i386	sched_setattr		sys_sched_setattr
 352	i386	sched_getattr		sys_sched_getattr
+353	i386	memfd_create		sys_memfd_create
diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
index 04376ac..dfcfd6f 100644
--- a/arch/x86/syscalls/syscall_64.tbl
+++ b/arch/x86/syscalls/syscall_64.tbl
@@ -323,6 +323,7 @@
 314	common	sched_setattr		sys_sched_setattr
 315	common	sched_getattr		sys_sched_getattr
 316	common	renameat2		sys_renameat2
+317	common	memfd_create		sys_memfd_create
 
 #
 # x32-specific system call numbers start at 512 to avoid cache impact
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a4a0588..133b705 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -802,6 +802,7 @@ asmlinkage long sys_timerfd_settime(int ufd, int flags,
 asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
 asmlinkage long sys_eventfd(unsigned int count);
 asmlinkage long sys_eventfd2(unsigned int count, int flags);
+asmlinkage long sys_memfd_create(const char *uname_ptr, u64 size, u64 flags);
 asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
 asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int);
 asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
new file mode 100644
index 0000000..c4a6db0
--- /dev/null
+++ b/include/uapi/linux/memfd.h
@@ -0,0 +1,10 @@
+#ifndef _UAPI_LINUX_MEMFD_H
+#define _UAPI_LINUX_MEMFD_H
+
+#include <linux/types.h>
+
+/* flags for memfd_create(2) (u64) */
+#define MFD_CLOEXEC		0x0001ULL
+#define MFD_ALLOW_SEALING	0x0002ULL
+
+#endif /* _UAPI_LINUX_MEMFD_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index bc8d1b7..f96c329 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -195,6 +195,7 @@ cond_syscall(compat_sys_timerfd_settime);
 cond_syscall(compat_sys_timerfd_gettime);
 cond_syscall(sys_eventfd);
 cond_syscall(sys_eventfd2);
+cond_syscall(sys_memfd_create);
 
 /* performance counters: */
 cond_syscall(sys_perf_event_open);
diff --git a/mm/shmem.c b/mm/shmem.c
index 175a5b8..203cc4e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -66,7 +66,9 @@ static struct vfsmount *shm_mnt;
 #include <linux/highmem.h>
 #include <linux/seq_file.h>
 #include <linux/magic.h>
+#include <linux/syscalls.h>
 #include <linux/fcntl.h>
+#include <uapi/linux/memfd.h>
 
 #include <asm/uaccess.h>
 #include <asm/pgtable.h>
@@ -2919,6 +2921,78 @@ out4:
 	return error;
 }
 
+#define MFD_NAME_PREFIX "memfd:"
+#define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
+#define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
+
+#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING)
+
+SYSCALL_DEFINE3(memfd_create,
+		const char*, uname,
+		u64, size,
+		u64, flags)
+{
+	struct shmem_inode_info *info;
+	struct file *shm;
+	char *name;
+	int fd, r;
+	long len;
+
+	if (flags & ~(u64)MFD_ALL_FLAGS)
+		return -EINVAL;
+	if ((u64)(loff_t)size != size || (loff_t)size < 0)
+		return -EINVAL;
+
+	/* length includes terminating zero */
+	len = strnlen_user(uname, MFD_NAME_MAX_LEN);
+	if (len <= 0)
+		return -EFAULT;
+	else if (len > MFD_NAME_MAX_LEN)
+		return -EINVAL;
+
+	name = kmalloc(len + MFD_NAME_PREFIX_LEN, GFP_KERNEL);
+	if (!name)
+		return -ENOMEM;
+
+	strcpy(name, MFD_NAME_PREFIX);
+	if (copy_from_user(&name[MFD_NAME_PREFIX_LEN], uname, len)) {
+		r = -EFAULT;
+		goto err_name;
+	}
+
+	/* terminating-zero may have changed after strnlen_user() returned */
+	if (name[len + MFD_NAME_PREFIX_LEN - 1]) {
+		r = -EFAULT;
+		goto err_name;
+	}
+
+	fd = get_unused_fd_flags((flags & MFD_CLOEXEC) ? O_CLOEXEC : 0);
+	if (fd < 0) {
+		r = fd;
+		goto err_name;
+	}
+
+	shm = shmem_file_setup(name, size, 0);
+	if (IS_ERR(shm)) {
+		r = PTR_ERR(shm);
+		goto err_fd;
+	}
+	info = SHMEM_I(file_inode(shm));
+	shm->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
+	if (flags & MFD_ALLOW_SEALING)
+		info->seals |= SHMEM_ALLOW_SEALING;
+
+	fd_install(fd, shm);
+	kfree(name);
+	return fd;
+
+err_fd:
+	put_unused_fd(fd);
+err_name:
+	kfree(name);
+	return r;
+}
+
 #else /* !CONFIG_SHMEM */
 
 /*
-- 
1.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 3/3] selftests: add memfd_create() + sealing tests
  2014-04-15 18:38 ` David Herrmann
@ 2014-04-15 18:38   ` David Herrmann
  -1 siblings, 0 replies; 53+ messages in thread
From: David Herrmann @ 2014-04-15 18:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, Johannes Weiner, Tejun Heo,
	Greg Kroah-Hartman, john.stultz, Kristian Høgsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers, David Herrmann

Some basic tests to verify sealing on memfds works as expected and
guarantees the advertised semantics.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 tools/testing/selftests/Makefile           |   1 +
 tools/testing/selftests/memfd/.gitignore   |   2 +
 tools/testing/selftests/memfd/Makefile     |  29 +
 tools/testing/selftests/memfd/memfd_test.c | 944 +++++++++++++++++++++++++++++
 4 files changed, 976 insertions(+)
 create mode 100644 tools/testing/selftests/memfd/.gitignore
 create mode 100644 tools/testing/selftests/memfd/Makefile
 create mode 100644 tools/testing/selftests/memfd/memfd_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 32487ed..c57325a 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -2,6 +2,7 @@ TARGETS = breakpoints
 TARGETS += cpu-hotplug
 TARGETS += efivarfs
 TARGETS += kcmp
+TARGETS += memfd
 TARGETS += memory-hotplug
 TARGETS += mqueue
 TARGETS += net
diff --git a/tools/testing/selftests/memfd/.gitignore b/tools/testing/selftests/memfd/.gitignore
new file mode 100644
index 0000000..bcc8ee2
--- /dev/null
+++ b/tools/testing/selftests/memfd/.gitignore
@@ -0,0 +1,2 @@
+memfd_test
+memfd-test-file
diff --git a/tools/testing/selftests/memfd/Makefile b/tools/testing/selftests/memfd/Makefile
new file mode 100644
index 0000000..36653b9
--- /dev/null
+++ b/tools/testing/selftests/memfd/Makefile
@@ -0,0 +1,29 @@
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
+ifeq ($(ARCH),i386)
+	ARCH := X86
+endif
+ifeq ($(ARCH),x86_64)
+	ARCH := X86
+endif
+
+CFLAGS += -I../../../../arch/x86/include/generated/uapi/
+CFLAGS += -I../../../../arch/x86/include/uapi/
+CFLAGS += -I../../../../include/uapi/
+CFLAGS += -I../../../../include/
+
+all:
+ifeq ($(ARCH),X86)
+	gcc $(CFLAGS) memfd_test.c -o memfd_test
+else
+	echo "Not an x86 target, can't build memfd selftest"
+endif
+
+run_tests: all
+ifeq ($(ARCH),X86)
+	gcc $(CFLAGS) memfd_test.c -o memfd_test
+endif
+	@./memfd_test || echo "memfd_test: [FAIL]"
+
+clean:
+	$(RM) memfd_test
diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c
new file mode 100644
index 0000000..3e105ea
--- /dev/null
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -0,0 +1,944 @@
+#define _GNU_SOURCE
+#define __EXPORTED_HEADERS__
+
+#include <errno.h>
+#include <inttypes.h>
+#include <limits.h>
+#include <linux/falloc.h>
+#include <linux/fcntl.h>
+#include <linux/memfd.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#define MFD_DEF_SIZE 8192
+#define STACK_SIZE 65535
+
+static int sys_memfd_create(const char *name,
+			    __u64 size,
+			    __u64 flags)
+{
+	return syscall(__NR_memfd_create, name, size, flags);
+}
+
+static int mfd_assert_new(const char *name, __u64 sz, __u64 flags)
+{
+	int r;
+
+	r = sys_memfd_create(name, sz, flags);
+	if (r < 0) {
+		printf("memfd_create(\"%s\", %llu, %llu) failed: %m\n",
+		       name,
+		       (unsigned long long)sz,
+		       (unsigned long long)flags);
+		abort();
+	}
+
+	return r;
+}
+
+static void mfd_fail_new(const char *name, __u64 size, __u64 flags)
+{
+	int r;
+
+	r = sys_memfd_create(name, size, flags);
+	if (r >= 0) {
+		printf("memfd_create(\"%s\", %llu, %llu) succeeded, but failure expected\n",
+		       name,
+		       (unsigned long long)size,
+		       (unsigned long long)flags);
+		close(r);
+		abort();
+	}
+}
+
+static __u64 mfd_assert_get_seals(int fd)
+{
+	long r;
+
+	r = fcntl(fd, F_GET_SEALS);
+	if (r < 0) {
+		printf("GET_SEALS(%d) failed: %m\n", fd);
+		abort();
+	}
+
+	return r;
+}
+
+static void mfd_fail_get_seals(int fd)
+{
+	long r;
+
+	r = fcntl(fd, F_GET_SEALS);
+	if (r >= 0) {
+		printf("GET_SEALS(%d) succeeded, but failure expected\n");
+		abort();
+	}
+}
+
+static void mfd_assert_has_seals(int fd, __u64 seals)
+{
+	__u64 s;
+
+	s = mfd_assert_get_seals(fd);
+	if (s != seals) {
+		printf("%llu != %llu = GET_SEALS(%d)\n",
+		       (unsigned long long)seals, (unsigned long long)s, fd);
+		abort();
+	}
+}
+
+static void mfd_assert_add_seals(int fd, __u64 seals)
+{
+	long r;
+	__u64 s;
+
+	s = mfd_assert_get_seals(fd);
+	r = fcntl(fd, F_ADD_SEALS, seals);
+	if (r < 0) {
+		printf("ADD_SEALS(%d, %llu -> %llu) failed: %m\n",
+		       fd, (unsigned long long)s, (unsigned long long)seals);
+		abort();
+	}
+}
+
+static void mfd_fail_add_seals(int fd, __u64 seals)
+{
+	long r;
+	__u64 s;
+
+	r = fcntl(fd, F_GET_SEALS);
+	if (r < 0)
+		s = 0;
+	else
+		s = r;
+
+	r = fcntl(fd, F_ADD_SEALS, seals);
+	if (r >= 0) {
+		printf("ADD_SEALS(%d, %llu -> %llu) didn't fail as expected\n",
+		       fd, (unsigned long long)s, (unsigned long long)seals);
+		abort();
+	}
+}
+
+static void mfd_assert_size(int fd, size_t size)
+{
+	struct stat st;
+	int r;
+
+	r = fstat(fd, &st);
+	if (r < 0) {
+		printf("fstat(%d) failed: %m\n", fd);
+		abort();
+	} else if (st.st_size != size) {
+		printf("wrong file size %lld, but expected %lld\n",
+		       (long long)st.st_size, (long long)size);
+		abort();
+	}
+}
+
+static int mfd_assert_dup(int fd)
+{
+	int r;
+
+	r = dup(fd);
+	if (r < 0) {
+		printf("dup(%d) failed: %m\n", fd);
+		abort();
+	}
+
+	return r;
+}
+
+static void *mfd_assert_mmap_shared(int fd)
+{
+	void *p;
+
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ | PROT_WRITE,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+
+	return p;
+}
+
+static void *mfd_assert_mmap_private(int fd)
+{
+	void *p;
+
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ,
+		 MAP_PRIVATE,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+
+	return p;
+}
+
+static int mfd_assert_open(int fd, int flags, mode_t mode)
+{
+	char buf[512];
+	int r;
+
+	sprintf(buf, "/proc/self/fd/%d", fd);
+	r = open(buf, flags, mode);
+	if (r < 0) {
+		printf("open(%s) failed: %m\n", buf);
+		abort();
+	}
+
+	return r;
+}
+
+static void mfd_fail_open(int fd, int flags, mode_t mode)
+{
+	char buf[512];
+	int r;
+
+	sprintf(buf, "/proc/self/fd/%d", fd);
+	r = open(buf, flags, mode);
+	if (r >= 0) {
+		printf("open(%s) didn't fail as expected\n");
+		abort();
+	}
+}
+
+static void mfd_assert_read(int fd)
+{
+	char buf[16];
+	void *p;
+	ssize_t l;
+
+	l = read(fd, buf, sizeof(buf));
+	if (l != sizeof(buf)) {
+		printf("read() failed: %m\n");
+		abort();
+	}
+
+	/* verify PROT_READ *is* allowed */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ,
+		 MAP_PRIVATE,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+	munmap(p, MFD_DEF_SIZE);
+
+	/* verify MAP_PRIVATE is *always* allowed (even writable) */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ | PROT_WRITE,
+		 MAP_PRIVATE,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+	munmap(p, MFD_DEF_SIZE);
+}
+
+static void mfd_assert_write(int fd)
+{
+	ssize_t l;
+	void *p;
+	int r;
+
+	/* verify write() succeeds */
+	l = write(fd, "\0\0\0\0", 4);
+	if (l != 4) {
+		printf("write() failed: %m\n");
+		abort();
+	}
+
+	/* verify PROT_READ | PROT_WRITE is allowed */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ | PROT_WRITE,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+	*(char*)p = 0;
+	munmap(p, MFD_DEF_SIZE);
+
+	/* verify PROT_WRITE is allowed */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_WRITE,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+	*(char*)p = 0;
+	munmap(p, MFD_DEF_SIZE);
+
+	/* verify PROT_READ with MAP_SHARED is allowed and a following
+	 * mprotect(PROT_WRITE) allows writing */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+
+	r = mprotect(p, MFD_DEF_SIZE, PROT_READ | PROT_WRITE);
+	if (r < 0) {
+		printf("mprotect() failed: %m\n");
+		abort();
+	}
+
+	*(char*)p = 0;
+	munmap(p, MFD_DEF_SIZE);
+
+	/* verify PUNCH_HOLE works */
+	r = fallocate(fd,
+		      FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+		      0,
+		      MFD_DEF_SIZE);
+	if (r < 0) {
+		printf("fallocate(PUNCH_HOLE) failed: %m\n");
+		abort();
+	}
+}
+
+static void mfd_fail_write(int fd)
+{
+	ssize_t l;
+	void *p;
+	int r;
+
+	/* verify write() fails */
+	l = write(fd, "data", 4);
+	if (l != -EPERM) {
+		printf("expected EPERM on write(), but got %d: %m\n", (int)l);
+		abort();
+	}
+
+	/* verify PROT_READ | PROT_WRITE is not allowed */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ | PROT_WRITE,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p != MAP_FAILED) {
+		printf("mmap() didn't fail as expected\n");
+		abort();
+	}
+
+	/* verify PROT_WRITE is not allowed */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_WRITE,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p != MAP_FAILED) {
+		printf("mmap() didn't fail as expected\n");
+		abort();
+	}
+
+	/* verify PROT_READ with MAP_SHARED is not allowed */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p != MAP_FAILED) {
+		printf("mmap() didn't fail as expected\n");
+		abort();
+	}
+
+	/* verify PUNCH_HOLE fails */
+	r = fallocate(fd,
+		      FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+		      0,
+		      MFD_DEF_SIZE);
+	if (r >= 0) {
+		printf("fallocate(PUNCH_HOLE) didn't fail as expected\n");
+		abort();
+	}
+}
+
+static void mfd_assert_shrink(int fd)
+{
+	int r, fd2;
+
+	r = ftruncate(fd, MFD_DEF_SIZE / 2);
+	if (r < 0) {
+		printf("ftruncate(SHRINK) failed: %m\n");
+		abort();
+	}
+
+	mfd_assert_size(fd, MFD_DEF_SIZE / 2);
+
+	fd2 = mfd_assert_open(fd,
+			      O_RDWR | O_CREAT | O_TRUNC,
+			      S_IRUSR | S_IWUSR);
+	close(fd2);
+
+	mfd_assert_size(fd, 0);
+}
+
+static void mfd_fail_shrink(int fd)
+{
+	int r;
+
+	r = ftruncate(fd, MFD_DEF_SIZE / 2);
+	if (r >= 0) {
+		printf("ftruncate(SHRINK) didn't fail as expected\n");
+		abort();
+	}
+
+	mfd_fail_open(fd,
+		      O_RDWR | O_CREAT | O_TRUNC,
+		      S_IRUSR | S_IWUSR);
+}
+
+static void mfd_assert_grow(int fd)
+{
+	int r;
+
+	r = ftruncate(fd, MFD_DEF_SIZE * 2);
+	if (r < 0) {
+		printf("ftruncate(GROW) failed: %m\n");
+		abort();
+	}
+
+	mfd_assert_size(fd, MFD_DEF_SIZE * 2);
+
+	r = fallocate(fd,
+		      0,
+		      0,
+		      MFD_DEF_SIZE * 4);
+	if (r < 0) {
+		printf("fallocate(ALLOC) failed: %m\n");
+		abort();
+	}
+
+	mfd_assert_size(fd, MFD_DEF_SIZE * 4);
+}
+
+static void mfd_fail_grow(int fd)
+{
+	int r;
+
+	r = ftruncate(fd, MFD_DEF_SIZE * 2);
+	if (r >= 0) {
+		printf("ftruncate(GROW) didn't fail as expected\n");
+		abort();
+	}
+
+	r = fallocate(fd,
+		      0,
+		      0,
+		      MFD_DEF_SIZE * 4);
+	if (r >= 0) {
+		printf("fallocate(ALLOC) didn't fail as expected\n");
+		abort();
+	}
+}
+
+static void mfd_assert_grow_write(int fd)
+{
+	static char buf[MFD_DEF_SIZE * 8];
+	ssize_t l;
+
+	l = pwrite(fd, buf, sizeof(buf), 0);
+	if (l != sizeof(buf)) {
+		printf("pwrite() failed: %m\n");
+		abort();
+	}
+
+	mfd_assert_size(fd, MFD_DEF_SIZE * 8);
+}
+
+static void mfd_fail_grow_write(int fd)
+{
+	static char buf[MFD_DEF_SIZE * 8];
+	ssize_t l;
+
+	l = pwrite(fd, buf, sizeof(buf), 0);
+	if (l == sizeof(buf)) {
+		printf("pwrite() didn't fail as expected\n");
+		abort();
+	}
+}
+
+static int idle_thread_fn(void *arg)
+{
+	sigset_t set;
+	int sig;
+
+	/* dummy waiter; SIGTERM terminates us anyway */
+	sigemptyset(&set);
+	sigaddset(&set, SIGTERM);
+	sigwait(&set, &sig);
+
+	return 0;
+}
+
+static pid_t spawn_idle_thread(void)
+{
+	uint8_t *stack;
+	pid_t pid;
+
+	stack = malloc(STACK_SIZE);
+	if (!stack) {
+		printf("malloc(STACK_SIZE) failed: %m\n");
+		abort();
+	}
+
+	pid = clone(idle_thread_fn,
+		    stack + STACK_SIZE,
+		    CLONE_FILES | CLONE_FS | CLONE_VM | SIGCHLD,
+		    NULL);
+	if (pid < 0) {
+		printf("clone() failed: %m\n");
+		abort();
+	}
+
+	return pid;
+}
+
+static void join_idle_thread(pid_t pid)
+{
+	kill(pid, SIGTERM);
+	waitpid(pid, NULL, 0);
+}
+
+static pid_t spawn_idle_proc(void)
+{
+	pid_t pid;
+	sigset_t set;
+	int sig;
+
+	pid = fork();
+	if (pid < 0) {
+		printf("fork() failed: %m\n");
+		abort();
+	} else if (!pid) {
+		/* dummy waiter; SIGTERM terminates us anyway */
+		sigemptyset(&set);
+		sigaddset(&set, SIGTERM);
+		sigwait(&set, &sig);
+		exit(0);
+	}
+
+	return pid;
+}
+
+static void join_idle_proc(pid_t pid)
+{
+	kill(pid, SIGTERM);
+	waitpid(pid, NULL, 0);
+}
+
+/*
+ * Test memfd_create() syscall
+ * Verify syscall-argument validation, including name checks, flag validation
+ * and more.
+ */
+static void test_create(void)
+{
+	char buf[2048];
+	int fd;
+
+	/* test NULL name */
+	mfd_fail_new(NULL, 0, 0);
+
+	/* test over-long name (not zero-terminated) */
+	memset(buf, 0xff, sizeof(buf));
+	mfd_fail_new(buf, 0, 0);
+
+	/* test over-long zero-terminated name */
+	memset(buf, 0xff, sizeof(buf));
+	buf[sizeof(buf) - 1] = 0;
+	mfd_fail_new(buf, 0, 0);
+
+	/* verify "" is a valid name */
+	fd = mfd_assert_new("", 0, 0);
+	close(fd);
+
+	/* verify invalid O_* open flags */
+	mfd_fail_new("", 0, 0x0100);
+	mfd_fail_new("", 0, ~MFD_CLOEXEC);
+	mfd_fail_new("", 0, ~MFD_ALLOW_SEALING);
+	mfd_fail_new("", 0, ~0);
+	mfd_fail_new("", 0, 0x8000000000000000ULL);
+
+	/* verify MFD_CLOEXEC is allowed */
+	fd = mfd_assert_new("", 0, MFD_CLOEXEC);
+	close(fd);
+
+	/* verify MFD_ALLOW_SEALING is allowed */
+	fd = mfd_assert_new("", 0, MFD_ALLOW_SEALING);
+	close(fd);
+
+	/* verify MFD_ALLOW_SEALING | MFD_CLOEXEC is allowed */
+	fd = mfd_assert_new("", 0, MFD_ALLOW_SEALING | MFD_CLOEXEC);
+	close(fd);
+}
+
+/*
+ * Test basic sealing
+ * A very basic sealing test to see whether setting/retrieving seals works.
+ */
+static void test_basic(void)
+{
+	int fd;
+
+	fd = mfd_assert_new("kern_memfd_basic",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
+
+	/* add basic seals */
+	mfd_assert_has_seals(fd, 0);
+	mfd_assert_add_seals(fd, F_SEAL_SHRINK |
+				 F_SEAL_WRITE);
+	mfd_assert_has_seals(fd, F_SEAL_SHRINK |
+				 F_SEAL_WRITE);
+
+	/* add them again */
+	mfd_assert_add_seals(fd, F_SEAL_SHRINK |
+				 F_SEAL_WRITE);
+	mfd_assert_has_seals(fd, F_SEAL_SHRINK |
+				 F_SEAL_WRITE);
+
+	/* add more seals and seal against sealing */
+	mfd_assert_add_seals(fd, F_SEAL_GROW | F_SEAL_SEAL);
+	mfd_assert_has_seals(fd, F_SEAL_SHRINK |
+				 F_SEAL_GROW |
+				 F_SEAL_WRITE |
+				 F_SEAL_SEAL);
+
+	/* verify that sealing no longer works */
+	mfd_fail_add_seals(fd, F_SEAL_GROW);
+	mfd_fail_add_seals(fd, 0);
+
+	close(fd);
+
+	/* verify sealing does not work without MFD_ALLOW_SEALING */
+	fd = mfd_assert_new("kern_memfd_basic",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC);
+	mfd_fail_get_seals(fd);
+	mfd_fail_add_seals(fd, F_SEAL_SHRINK |
+			       F_SEAL_GROW |
+			       F_SEAL_WRITE);
+	mfd_fail_get_seals(fd);
+	close(fd);
+}
+
+/*
+ * Test SEAL_WRITE
+ * Test whether SEAL_WRITE actually prevents modifications.
+ */
+static void test_seal_write(void)
+{
+	int fd;
+
+	fd = mfd_assert_new("kern_memfd_seal_write",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
+	mfd_assert_has_seals(fd, 0);
+	mfd_assert_add_seals(fd, F_SEAL_WRITE);
+	mfd_assert_has_seals(fd, F_SEAL_WRITE);
+
+	mfd_assert_read(fd);
+	mfd_fail_write(fd);
+	mfd_assert_shrink(fd);
+	mfd_assert_grow(fd);
+	mfd_fail_grow_write(fd);
+
+	close(fd);
+}
+
+/*
+ * Test SEAL_SHRINK
+ * Test whether SEAL_SHRINK actually prevents shrinking
+ */
+static void test_seal_shrink(void)
+{
+	int fd;
+
+	fd = mfd_assert_new("kern_memfd_seal_shrink",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
+	mfd_assert_has_seals(fd, 0);
+	mfd_assert_add_seals(fd, F_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, F_SEAL_SHRINK);
+
+	mfd_assert_read(fd);
+	mfd_assert_write(fd);
+	mfd_fail_shrink(fd);
+	mfd_assert_grow(fd);
+	mfd_assert_grow_write(fd);
+
+	close(fd);
+}
+
+/*
+ * Test SEAL_GROW
+ * Test whether SEAL_GROW actually prevents growing
+ */
+static void test_seal_grow(void)
+{
+	int fd;
+
+	fd = mfd_assert_new("kern_memfd_seal_grow",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
+	mfd_assert_has_seals(fd, 0);
+	mfd_assert_add_seals(fd, F_SEAL_GROW);
+	mfd_assert_has_seals(fd, F_SEAL_GROW);
+
+	mfd_assert_read(fd);
+	mfd_assert_write(fd);
+	mfd_assert_shrink(fd);
+	mfd_fail_grow(fd);
+	mfd_fail_grow_write(fd);
+
+	close(fd);
+}
+
+/*
+ * Test SEAL_SHRINK | SEAL_GROW
+ * Test whether SEAL_SHRINK | SEAL_GROW actually prevents resizing
+ */
+static void test_seal_resize(void)
+{
+	int fd;
+
+	fd = mfd_assert_new("kern_memfd_seal_resize",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
+	mfd_assert_has_seals(fd, 0);
+	mfd_assert_add_seals(fd, F_SEAL_SHRINK | F_SEAL_GROW);
+	mfd_assert_has_seals(fd, F_SEAL_SHRINK | F_SEAL_GROW);
+
+	mfd_assert_read(fd);
+	mfd_assert_write(fd);
+	mfd_fail_shrink(fd);
+	mfd_fail_grow(fd);
+	mfd_fail_grow_write(fd);
+
+	close(fd);
+}
+
+/*
+ * Test sharing via dup()
+ * Test that seals are shared between dupped FDs and they're all equal.
+ */
+static void test_share_dup(void)
+{
+	int fd, fd2;
+
+	fd = mfd_assert_new("kern_memfd_share_dup",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
+	mfd_assert_has_seals(fd, 0);
+
+	fd2 = mfd_assert_dup(fd);
+	mfd_assert_has_seals(fd2, 0);
+
+	mfd_assert_add_seals(fd, F_SEAL_WRITE);
+	mfd_assert_has_seals(fd, F_SEAL_WRITE);
+	mfd_assert_has_seals(fd2, F_SEAL_WRITE);
+
+	mfd_assert_add_seals(fd2, F_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK);
+	mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK);
+
+	mfd_assert_add_seals(fd, F_SEAL_SEAL);
+	mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
+	mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
+
+	mfd_fail_add_seals(fd, F_SEAL_GROW);
+	mfd_fail_add_seals(fd2, F_SEAL_GROW);
+	mfd_fail_add_seals(fd, F_SEAL_SEAL);
+	mfd_fail_add_seals(fd2, F_SEAL_SEAL);
+
+	close(fd2);
+
+	mfd_fail_add_seals(fd, F_SEAL_GROW);
+	close(fd);
+}
+
+/*
+ * Test sealing with active mmap()s
+ * Modifying seals is only allowed if no other mmap() refs exist.
+ */
+static void test_share_mmap(void)
+{
+	int fd;
+	void *p;
+
+	fd = mfd_assert_new("kern_memfd_share_mmap",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
+	mfd_assert_has_seals(fd, 0);
+
+	/* shared/writable ref prevents sealing */
+	p = mfd_assert_mmap_shared(fd);
+	mfd_fail_add_seals(fd, F_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, 0);
+	munmap(p, MFD_DEF_SIZE);
+
+	/* readable ref allows sealing */
+	p = mfd_assert_mmap_private(fd);
+	mfd_assert_add_seals(fd, F_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, F_SEAL_SHRINK);
+	munmap(p, MFD_DEF_SIZE);
+
+	close(fd);
+}
+
+/*
+ * Test sealing with open(/proc/self/fd/%d)
+ * Via /proc we can get access to a separate file-context for the same memfd.
+ * This is *not* like dup(), but like a real separate open(). Make sure the
+ * semantics are as expected and we correctly check for RDONLY / WRONLY / RDWR.
+ */
+static void test_share_open(void)
+{
+	int fd, fd2;
+
+	fd = mfd_assert_new("kern_memfd_share_open",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
+	mfd_assert_has_seals(fd, 0);
+
+	fd2 = mfd_assert_open(fd, O_RDWR, 0);
+	mfd_assert_add_seals(fd, F_SEAL_WRITE);
+	mfd_assert_has_seals(fd, F_SEAL_WRITE);
+	mfd_assert_has_seals(fd2, F_SEAL_WRITE);
+
+	mfd_assert_add_seals(fd2, F_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK);
+	mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK);
+
+	close(fd);
+	fd = mfd_assert_open(fd2, O_RDONLY, 0);
+
+	mfd_fail_add_seals(fd, F_SEAL_SEAL);
+	mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK);
+	mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK);
+
+	close(fd2);
+	fd2 = mfd_assert_open(fd, O_RDWR, 0);
+
+	mfd_assert_add_seals(fd2, F_SEAL_SEAL);
+	mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
+	mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
+
+	close(fd2);
+	close(fd);
+}
+
+/*
+ * Test sharing via fork()
+ * Test whether seal-modifications work as expected with forked childs.
+ */
+static void test_share_fork(void)
+{
+	int fd;
+	pid_t pid;
+
+	fd = mfd_assert_new("kern_memfd_share_fork",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
+	mfd_assert_has_seals(fd, 0);
+
+	pid = spawn_idle_proc();
+	mfd_assert_add_seals(fd, F_SEAL_SEAL);
+	mfd_assert_has_seals(fd, F_SEAL_SEAL);
+
+	mfd_fail_add_seals(fd, F_SEAL_WRITE);
+	mfd_assert_has_seals(fd, F_SEAL_SEAL);
+
+	join_idle_proc(pid);
+
+	mfd_fail_add_seals(fd, F_SEAL_WRITE);
+	mfd_assert_has_seals(fd, F_SEAL_SEAL);
+
+	close(fd);
+}
+
+int main(int argc, char **argv)
+{
+	pid_t pid;
+
+	printf("memfd: CREATE\n");
+	test_create();
+	printf("memfd: BASIC\n");
+	test_basic();
+
+	printf("memfd: SEAL-WRITE\n");
+	test_seal_write();
+	printf("memfd: SEAL-SHRINK\n");
+	test_seal_shrink();
+	printf("memfd: SEAL-GROW\n");
+	test_seal_grow();
+	printf("memfd: SEAL-RESIZE\n");
+	test_seal_resize();
+
+	printf("memfd: SHARE-DUP\n");
+	test_share_dup();
+	printf("memfd: SHARE-MMAP\n");
+	test_share_mmap();
+	printf("memfd: SHARE-OPEN\n");
+	test_share_open();
+	printf("memfd: SHARE-FORK\n");
+	test_share_fork();
+
+	/* Run test-suite in a multi-threaded environment with a shared
+	 * file-table. */
+	pid = spawn_idle_thread();
+	printf("memfd: SHARE-DUP (shared file-table)\n");
+	test_share_dup();
+	printf("memfd: SHARE-MMAP (shared file-table)\n");
+	test_share_mmap();
+	printf("memfd: SHARE-OPEN (shared file-table)\n");
+	test_share_open();
+	printf("memfd: SHARE-FORK (shared file-table)\n");
+	test_share_fork();
+	join_idle_thread(pid);
+
+	printf("memfd: DONE\n");
+
+	return 0;
+}
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 53+ messages in thread

* [PATCH v2 3/3] selftests: add memfd_create() + sealing tests
@ 2014-04-15 18:38   ` David Herrmann
  0 siblings, 0 replies; 53+ messages in thread
From: David Herrmann @ 2014-04-15 18:38 UTC (permalink / raw)
  To: linux-kernel
  Cc: Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, Johannes Weiner, Tejun Heo,
	Greg Kroah-Hartman, john.stultz, Kristian Høgsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers, David Herrmann

Some basic tests to verify sealing on memfds works as expected and
guarantees the advertised semantics.

Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
---
 tools/testing/selftests/Makefile           |   1 +
 tools/testing/selftests/memfd/.gitignore   |   2 +
 tools/testing/selftests/memfd/Makefile     |  29 +
 tools/testing/selftests/memfd/memfd_test.c | 944 +++++++++++++++++++++++++++++
 4 files changed, 976 insertions(+)
 create mode 100644 tools/testing/selftests/memfd/.gitignore
 create mode 100644 tools/testing/selftests/memfd/Makefile
 create mode 100644 tools/testing/selftests/memfd/memfd_test.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 32487ed..c57325a 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -2,6 +2,7 @@ TARGETS = breakpoints
 TARGETS += cpu-hotplug
 TARGETS += efivarfs
 TARGETS += kcmp
+TARGETS += memfd
 TARGETS += memory-hotplug
 TARGETS += mqueue
 TARGETS += net
diff --git a/tools/testing/selftests/memfd/.gitignore b/tools/testing/selftests/memfd/.gitignore
new file mode 100644
index 0000000..bcc8ee2
--- /dev/null
+++ b/tools/testing/selftests/memfd/.gitignore
@@ -0,0 +1,2 @@
+memfd_test
+memfd-test-file
diff --git a/tools/testing/selftests/memfd/Makefile b/tools/testing/selftests/memfd/Makefile
new file mode 100644
index 0000000..36653b9
--- /dev/null
+++ b/tools/testing/selftests/memfd/Makefile
@@ -0,0 +1,29 @@
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
+ifeq ($(ARCH),i386)
+	ARCH := X86
+endif
+ifeq ($(ARCH),x86_64)
+	ARCH := X86
+endif
+
+CFLAGS += -I../../../../arch/x86/include/generated/uapi/
+CFLAGS += -I../../../../arch/x86/include/uapi/
+CFLAGS += -I../../../../include/uapi/
+CFLAGS += -I../../../../include/
+
+all:
+ifeq ($(ARCH),X86)
+	gcc $(CFLAGS) memfd_test.c -o memfd_test
+else
+	echo "Not an x86 target, can't build memfd selftest"
+endif
+
+run_tests: all
+ifeq ($(ARCH),X86)
+	gcc $(CFLAGS) memfd_test.c -o memfd_test
+endif
+	@./memfd_test || echo "memfd_test: [FAIL]"
+
+clean:
+	$(RM) memfd_test
diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c
new file mode 100644
index 0000000..3e105ea
--- /dev/null
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -0,0 +1,944 @@
+#define _GNU_SOURCE
+#define __EXPORTED_HEADERS__
+
+#include <errno.h>
+#include <inttypes.h>
+#include <limits.h>
+#include <linux/falloc.h>
+#include <linux/fcntl.h>
+#include <linux/memfd.h>
+#include <sched.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <string.h>
+#include <sys/mman.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+#define MFD_DEF_SIZE 8192
+#define STACK_SIZE 65535
+
+static int sys_memfd_create(const char *name,
+			    __u64 size,
+			    __u64 flags)
+{
+	return syscall(__NR_memfd_create, name, size, flags);
+}
+
+static int mfd_assert_new(const char *name, __u64 sz, __u64 flags)
+{
+	int r;
+
+	r = sys_memfd_create(name, sz, flags);
+	if (r < 0) {
+		printf("memfd_create(\"%s\", %llu, %llu) failed: %m\n",
+		       name,
+		       (unsigned long long)sz,
+		       (unsigned long long)flags);
+		abort();
+	}
+
+	return r;
+}
+
+static void mfd_fail_new(const char *name, __u64 size, __u64 flags)
+{
+	int r;
+
+	r = sys_memfd_create(name, size, flags);
+	if (r >= 0) {
+		printf("memfd_create(\"%s\", %llu, %llu) succeeded, but failure expected\n",
+		       name,
+		       (unsigned long long)size,
+		       (unsigned long long)flags);
+		close(r);
+		abort();
+	}
+}
+
+static __u64 mfd_assert_get_seals(int fd)
+{
+	long r;
+
+	r = fcntl(fd, F_GET_SEALS);
+	if (r < 0) {
+		printf("GET_SEALS(%d) failed: %m\n", fd);
+		abort();
+	}
+
+	return r;
+}
+
+static void mfd_fail_get_seals(int fd)
+{
+	long r;
+
+	r = fcntl(fd, F_GET_SEALS);
+	if (r >= 0) {
+		printf("GET_SEALS(%d) succeeded, but failure expected\n");
+		abort();
+	}
+}
+
+static void mfd_assert_has_seals(int fd, __u64 seals)
+{
+	__u64 s;
+
+	s = mfd_assert_get_seals(fd);
+	if (s != seals) {
+		printf("%llu != %llu = GET_SEALS(%d)\n",
+		       (unsigned long long)seals, (unsigned long long)s, fd);
+		abort();
+	}
+}
+
+static void mfd_assert_add_seals(int fd, __u64 seals)
+{
+	long r;
+	__u64 s;
+
+	s = mfd_assert_get_seals(fd);
+	r = fcntl(fd, F_ADD_SEALS, seals);
+	if (r < 0) {
+		printf("ADD_SEALS(%d, %llu -> %llu) failed: %m\n",
+		       fd, (unsigned long long)s, (unsigned long long)seals);
+		abort();
+	}
+}
+
+static void mfd_fail_add_seals(int fd, __u64 seals)
+{
+	long r;
+	__u64 s;
+
+	r = fcntl(fd, F_GET_SEALS);
+	if (r < 0)
+		s = 0;
+	else
+		s = r;
+
+	r = fcntl(fd, F_ADD_SEALS, seals);
+	if (r >= 0) {
+		printf("ADD_SEALS(%d, %llu -> %llu) didn't fail as expected\n",
+		       fd, (unsigned long long)s, (unsigned long long)seals);
+		abort();
+	}
+}
+
+static void mfd_assert_size(int fd, size_t size)
+{
+	struct stat st;
+	int r;
+
+	r = fstat(fd, &st);
+	if (r < 0) {
+		printf("fstat(%d) failed: %m\n", fd);
+		abort();
+	} else if (st.st_size != size) {
+		printf("wrong file size %lld, but expected %lld\n",
+		       (long long)st.st_size, (long long)size);
+		abort();
+	}
+}
+
+static int mfd_assert_dup(int fd)
+{
+	int r;
+
+	r = dup(fd);
+	if (r < 0) {
+		printf("dup(%d) failed: %m\n", fd);
+		abort();
+	}
+
+	return r;
+}
+
+static void *mfd_assert_mmap_shared(int fd)
+{
+	void *p;
+
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ | PROT_WRITE,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+
+	return p;
+}
+
+static void *mfd_assert_mmap_private(int fd)
+{
+	void *p;
+
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ,
+		 MAP_PRIVATE,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+
+	return p;
+}
+
+static int mfd_assert_open(int fd, int flags, mode_t mode)
+{
+	char buf[512];
+	int r;
+
+	sprintf(buf, "/proc/self/fd/%d", fd);
+	r = open(buf, flags, mode);
+	if (r < 0) {
+		printf("open(%s) failed: %m\n", buf);
+		abort();
+	}
+
+	return r;
+}
+
+static void mfd_fail_open(int fd, int flags, mode_t mode)
+{
+	char buf[512];
+	int r;
+
+	sprintf(buf, "/proc/self/fd/%d", fd);
+	r = open(buf, flags, mode);
+	if (r >= 0) {
+		printf("open(%s) didn't fail as expected\n");
+		abort();
+	}
+}
+
+static void mfd_assert_read(int fd)
+{
+	char buf[16];
+	void *p;
+	ssize_t l;
+
+	l = read(fd, buf, sizeof(buf));
+	if (l != sizeof(buf)) {
+		printf("read() failed: %m\n");
+		abort();
+	}
+
+	/* verify PROT_READ *is* allowed */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ,
+		 MAP_PRIVATE,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+	munmap(p, MFD_DEF_SIZE);
+
+	/* verify MAP_PRIVATE is *always* allowed (even writable) */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ | PROT_WRITE,
+		 MAP_PRIVATE,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+	munmap(p, MFD_DEF_SIZE);
+}
+
+static void mfd_assert_write(int fd)
+{
+	ssize_t l;
+	void *p;
+	int r;
+
+	/* verify write() succeeds */
+	l = write(fd, "\0\0\0\0", 4);
+	if (l != 4) {
+		printf("write() failed: %m\n");
+		abort();
+	}
+
+	/* verify PROT_READ | PROT_WRITE is allowed */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ | PROT_WRITE,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+	*(char*)p = 0;
+	munmap(p, MFD_DEF_SIZE);
+
+	/* verify PROT_WRITE is allowed */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_WRITE,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+	*(char*)p = 0;
+	munmap(p, MFD_DEF_SIZE);
+
+	/* verify PROT_READ with MAP_SHARED is allowed and a following
+	 * mprotect(PROT_WRITE) allows writing */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p == MAP_FAILED) {
+		printf("mmap() failed: %m\n");
+		abort();
+	}
+
+	r = mprotect(p, MFD_DEF_SIZE, PROT_READ | PROT_WRITE);
+	if (r < 0) {
+		printf("mprotect() failed: %m\n");
+		abort();
+	}
+
+	*(char*)p = 0;
+	munmap(p, MFD_DEF_SIZE);
+
+	/* verify PUNCH_HOLE works */
+	r = fallocate(fd,
+		      FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+		      0,
+		      MFD_DEF_SIZE);
+	if (r < 0) {
+		printf("fallocate(PUNCH_HOLE) failed: %m\n");
+		abort();
+	}
+}
+
+static void mfd_fail_write(int fd)
+{
+	ssize_t l;
+	void *p;
+	int r;
+
+	/* verify write() fails */
+	l = write(fd, "data", 4);
+	if (l != -EPERM) {
+		printf("expected EPERM on write(), but got %d: %m\n", (int)l);
+		abort();
+	}
+
+	/* verify PROT_READ | PROT_WRITE is not allowed */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ | PROT_WRITE,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p != MAP_FAILED) {
+		printf("mmap() didn't fail as expected\n");
+		abort();
+	}
+
+	/* verify PROT_WRITE is not allowed */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_WRITE,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p != MAP_FAILED) {
+		printf("mmap() didn't fail as expected\n");
+		abort();
+	}
+
+	/* verify PROT_READ with MAP_SHARED is not allowed */
+	p = mmap(NULL,
+		 MFD_DEF_SIZE,
+		 PROT_READ,
+		 MAP_SHARED,
+		 fd,
+		 0);
+	if (p != MAP_FAILED) {
+		printf("mmap() didn't fail as expected\n");
+		abort();
+	}
+
+	/* verify PUNCH_HOLE fails */
+	r = fallocate(fd,
+		      FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+		      0,
+		      MFD_DEF_SIZE);
+	if (r >= 0) {
+		printf("fallocate(PUNCH_HOLE) didn't fail as expected\n");
+		abort();
+	}
+}
+
+static void mfd_assert_shrink(int fd)
+{
+	int r, fd2;
+
+	r = ftruncate(fd, MFD_DEF_SIZE / 2);
+	if (r < 0) {
+		printf("ftruncate(SHRINK) failed: %m\n");
+		abort();
+	}
+
+	mfd_assert_size(fd, MFD_DEF_SIZE / 2);
+
+	fd2 = mfd_assert_open(fd,
+			      O_RDWR | O_CREAT | O_TRUNC,
+			      S_IRUSR | S_IWUSR);
+	close(fd2);
+
+	mfd_assert_size(fd, 0);
+}
+
+static void mfd_fail_shrink(int fd)
+{
+	int r;
+
+	r = ftruncate(fd, MFD_DEF_SIZE / 2);
+	if (r >= 0) {
+		printf("ftruncate(SHRINK) didn't fail as expected\n");
+		abort();
+	}
+
+	mfd_fail_open(fd,
+		      O_RDWR | O_CREAT | O_TRUNC,
+		      S_IRUSR | S_IWUSR);
+}
+
+static void mfd_assert_grow(int fd)
+{
+	int r;
+
+	r = ftruncate(fd, MFD_DEF_SIZE * 2);
+	if (r < 0) {
+		printf("ftruncate(GROW) failed: %m\n");
+		abort();
+	}
+
+	mfd_assert_size(fd, MFD_DEF_SIZE * 2);
+
+	r = fallocate(fd,
+		      0,
+		      0,
+		      MFD_DEF_SIZE * 4);
+	if (r < 0) {
+		printf("fallocate(ALLOC) failed: %m\n");
+		abort();
+	}
+
+	mfd_assert_size(fd, MFD_DEF_SIZE * 4);
+}
+
+static void mfd_fail_grow(int fd)
+{
+	int r;
+
+	r = ftruncate(fd, MFD_DEF_SIZE * 2);
+	if (r >= 0) {
+		printf("ftruncate(GROW) didn't fail as expected\n");
+		abort();
+	}
+
+	r = fallocate(fd,
+		      0,
+		      0,
+		      MFD_DEF_SIZE * 4);
+	if (r >= 0) {
+		printf("fallocate(ALLOC) didn't fail as expected\n");
+		abort();
+	}
+}
+
+static void mfd_assert_grow_write(int fd)
+{
+	static char buf[MFD_DEF_SIZE * 8];
+	ssize_t l;
+
+	l = pwrite(fd, buf, sizeof(buf), 0);
+	if (l != sizeof(buf)) {
+		printf("pwrite() failed: %m\n");
+		abort();
+	}
+
+	mfd_assert_size(fd, MFD_DEF_SIZE * 8);
+}
+
+static void mfd_fail_grow_write(int fd)
+{
+	static char buf[MFD_DEF_SIZE * 8];
+	ssize_t l;
+
+	l = pwrite(fd, buf, sizeof(buf), 0);
+	if (l == sizeof(buf)) {
+		printf("pwrite() didn't fail as expected\n");
+		abort();
+	}
+}
+
+static int idle_thread_fn(void *arg)
+{
+	sigset_t set;
+	int sig;
+
+	/* dummy waiter; SIGTERM terminates us anyway */
+	sigemptyset(&set);
+	sigaddset(&set, SIGTERM);
+	sigwait(&set, &sig);
+
+	return 0;
+}
+
+static pid_t spawn_idle_thread(void)
+{
+	uint8_t *stack;
+	pid_t pid;
+
+	stack = malloc(STACK_SIZE);
+	if (!stack) {
+		printf("malloc(STACK_SIZE) failed: %m\n");
+		abort();
+	}
+
+	pid = clone(idle_thread_fn,
+		    stack + STACK_SIZE,
+		    CLONE_FILES | CLONE_FS | CLONE_VM | SIGCHLD,
+		    NULL);
+	if (pid < 0) {
+		printf("clone() failed: %m\n");
+		abort();
+	}
+
+	return pid;
+}
+
+static void join_idle_thread(pid_t pid)
+{
+	kill(pid, SIGTERM);
+	waitpid(pid, NULL, 0);
+}
+
+static pid_t spawn_idle_proc(void)
+{
+	pid_t pid;
+	sigset_t set;
+	int sig;
+
+	pid = fork();
+	if (pid < 0) {
+		printf("fork() failed: %m\n");
+		abort();
+	} else if (!pid) {
+		/* dummy waiter; SIGTERM terminates us anyway */
+		sigemptyset(&set);
+		sigaddset(&set, SIGTERM);
+		sigwait(&set, &sig);
+		exit(0);
+	}
+
+	return pid;
+}
+
+static void join_idle_proc(pid_t pid)
+{
+	kill(pid, SIGTERM);
+	waitpid(pid, NULL, 0);
+}
+
+/*
+ * Test memfd_create() syscall
+ * Verify syscall-argument validation, including name checks, flag validation
+ * and more.
+ */
+static void test_create(void)
+{
+	char buf[2048];
+	int fd;
+
+	/* test NULL name */
+	mfd_fail_new(NULL, 0, 0);
+
+	/* test over-long name (not zero-terminated) */
+	memset(buf, 0xff, sizeof(buf));
+	mfd_fail_new(buf, 0, 0);
+
+	/* test over-long zero-terminated name */
+	memset(buf, 0xff, sizeof(buf));
+	buf[sizeof(buf) - 1] = 0;
+	mfd_fail_new(buf, 0, 0);
+
+	/* verify "" is a valid name */
+	fd = mfd_assert_new("", 0, 0);
+	close(fd);
+
+	/* verify invalid O_* open flags */
+	mfd_fail_new("", 0, 0x0100);
+	mfd_fail_new("", 0, ~MFD_CLOEXEC);
+	mfd_fail_new("", 0, ~MFD_ALLOW_SEALING);
+	mfd_fail_new("", 0, ~0);
+	mfd_fail_new("", 0, 0x8000000000000000ULL);
+
+	/* verify MFD_CLOEXEC is allowed */
+	fd = mfd_assert_new("", 0, MFD_CLOEXEC);
+	close(fd);
+
+	/* verify MFD_ALLOW_SEALING is allowed */
+	fd = mfd_assert_new("", 0, MFD_ALLOW_SEALING);
+	close(fd);
+
+	/* verify MFD_ALLOW_SEALING | MFD_CLOEXEC is allowed */
+	fd = mfd_assert_new("", 0, MFD_ALLOW_SEALING | MFD_CLOEXEC);
+	close(fd);
+}
+
+/*
+ * Test basic sealing
+ * A very basic sealing test to see whether setting/retrieving seals works.
+ */
+static void test_basic(void)
+{
+	int fd;
+
+	fd = mfd_assert_new("kern_memfd_basic",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
+
+	/* add basic seals */
+	mfd_assert_has_seals(fd, 0);
+	mfd_assert_add_seals(fd, F_SEAL_SHRINK |
+				 F_SEAL_WRITE);
+	mfd_assert_has_seals(fd, F_SEAL_SHRINK |
+				 F_SEAL_WRITE);
+
+	/* add them again */
+	mfd_assert_add_seals(fd, F_SEAL_SHRINK |
+				 F_SEAL_WRITE);
+	mfd_assert_has_seals(fd, F_SEAL_SHRINK |
+				 F_SEAL_WRITE);
+
+	/* add more seals and seal against sealing */
+	mfd_assert_add_seals(fd, F_SEAL_GROW | F_SEAL_SEAL);
+	mfd_assert_has_seals(fd, F_SEAL_SHRINK |
+				 F_SEAL_GROW |
+				 F_SEAL_WRITE |
+				 F_SEAL_SEAL);
+
+	/* verify that sealing no longer works */
+	mfd_fail_add_seals(fd, F_SEAL_GROW);
+	mfd_fail_add_seals(fd, 0);
+
+	close(fd);
+
+	/* verify sealing does not work without MFD_ALLOW_SEALING */
+	fd = mfd_assert_new("kern_memfd_basic",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC);
+	mfd_fail_get_seals(fd);
+	mfd_fail_add_seals(fd, F_SEAL_SHRINK |
+			       F_SEAL_GROW |
+			       F_SEAL_WRITE);
+	mfd_fail_get_seals(fd);
+	close(fd);
+}
+
+/*
+ * Test SEAL_WRITE
+ * Test whether SEAL_WRITE actually prevents modifications.
+ */
+static void test_seal_write(void)
+{
+	int fd;
+
+	fd = mfd_assert_new("kern_memfd_seal_write",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
+	mfd_assert_has_seals(fd, 0);
+	mfd_assert_add_seals(fd, F_SEAL_WRITE);
+	mfd_assert_has_seals(fd, F_SEAL_WRITE);
+
+	mfd_assert_read(fd);
+	mfd_fail_write(fd);
+	mfd_assert_shrink(fd);
+	mfd_assert_grow(fd);
+	mfd_fail_grow_write(fd);
+
+	close(fd);
+}
+
+/*
+ * Test SEAL_SHRINK
+ * Test whether SEAL_SHRINK actually prevents shrinking
+ */
+static void test_seal_shrink(void)
+{
+	int fd;
+
+	fd = mfd_assert_new("kern_memfd_seal_shrink",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
+	mfd_assert_has_seals(fd, 0);
+	mfd_assert_add_seals(fd, F_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, F_SEAL_SHRINK);
+
+	mfd_assert_read(fd);
+	mfd_assert_write(fd);
+	mfd_fail_shrink(fd);
+	mfd_assert_grow(fd);
+	mfd_assert_grow_write(fd);
+
+	close(fd);
+}
+
+/*
+ * Test SEAL_GROW
+ * Test whether SEAL_GROW actually prevents growing
+ */
+static void test_seal_grow(void)
+{
+	int fd;
+
+	fd = mfd_assert_new("kern_memfd_seal_grow",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
+	mfd_assert_has_seals(fd, 0);
+	mfd_assert_add_seals(fd, F_SEAL_GROW);
+	mfd_assert_has_seals(fd, F_SEAL_GROW);
+
+	mfd_assert_read(fd);
+	mfd_assert_write(fd);
+	mfd_assert_shrink(fd);
+	mfd_fail_grow(fd);
+	mfd_fail_grow_write(fd);
+
+	close(fd);
+}
+
+/*
+ * Test SEAL_SHRINK | SEAL_GROW
+ * Test whether SEAL_SHRINK | SEAL_GROW actually prevents resizing
+ */
+static void test_seal_resize(void)
+{
+	int fd;
+
+	fd = mfd_assert_new("kern_memfd_seal_resize",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
+	mfd_assert_has_seals(fd, 0);
+	mfd_assert_add_seals(fd, F_SEAL_SHRINK | F_SEAL_GROW);
+	mfd_assert_has_seals(fd, F_SEAL_SHRINK | F_SEAL_GROW);
+
+	mfd_assert_read(fd);
+	mfd_assert_write(fd);
+	mfd_fail_shrink(fd);
+	mfd_fail_grow(fd);
+	mfd_fail_grow_write(fd);
+
+	close(fd);
+}
+
+/*
+ * Test sharing via dup()
+ * Test that seals are shared between dupped FDs and they're all equal.
+ */
+static void test_share_dup(void)
+{
+	int fd, fd2;
+
+	fd = mfd_assert_new("kern_memfd_share_dup",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
+	mfd_assert_has_seals(fd, 0);
+
+	fd2 = mfd_assert_dup(fd);
+	mfd_assert_has_seals(fd2, 0);
+
+	mfd_assert_add_seals(fd, F_SEAL_WRITE);
+	mfd_assert_has_seals(fd, F_SEAL_WRITE);
+	mfd_assert_has_seals(fd2, F_SEAL_WRITE);
+
+	mfd_assert_add_seals(fd2, F_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK);
+	mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK);
+
+	mfd_assert_add_seals(fd, F_SEAL_SEAL);
+	mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
+	mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
+
+	mfd_fail_add_seals(fd, F_SEAL_GROW);
+	mfd_fail_add_seals(fd2, F_SEAL_GROW);
+	mfd_fail_add_seals(fd, F_SEAL_SEAL);
+	mfd_fail_add_seals(fd2, F_SEAL_SEAL);
+
+	close(fd2);
+
+	mfd_fail_add_seals(fd, F_SEAL_GROW);
+	close(fd);
+}
+
+/*
+ * Test sealing with active mmap()s
+ * Modifying seals is only allowed if no other mmap() refs exist.
+ */
+static void test_share_mmap(void)
+{
+	int fd;
+	void *p;
+
+	fd = mfd_assert_new("kern_memfd_share_mmap",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
+	mfd_assert_has_seals(fd, 0);
+
+	/* shared/writable ref prevents sealing */
+	p = mfd_assert_mmap_shared(fd);
+	mfd_fail_add_seals(fd, F_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, 0);
+	munmap(p, MFD_DEF_SIZE);
+
+	/* readable ref allows sealing */
+	p = mfd_assert_mmap_private(fd);
+	mfd_assert_add_seals(fd, F_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, F_SEAL_SHRINK);
+	munmap(p, MFD_DEF_SIZE);
+
+	close(fd);
+}
+
+/*
+ * Test sealing with open(/proc/self/fd/%d)
+ * Via /proc we can get access to a separate file-context for the same memfd.
+ * This is *not* like dup(), but like a real separate open(). Make sure the
+ * semantics are as expected and we correctly check for RDONLY / WRONLY / RDWR.
+ */
+static void test_share_open(void)
+{
+	int fd, fd2;
+
+	fd = mfd_assert_new("kern_memfd_share_open",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
+	mfd_assert_has_seals(fd, 0);
+
+	fd2 = mfd_assert_open(fd, O_RDWR, 0);
+	mfd_assert_add_seals(fd, F_SEAL_WRITE);
+	mfd_assert_has_seals(fd, F_SEAL_WRITE);
+	mfd_assert_has_seals(fd2, F_SEAL_WRITE);
+
+	mfd_assert_add_seals(fd2, F_SEAL_SHRINK);
+	mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK);
+	mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK);
+
+	close(fd);
+	fd = mfd_assert_open(fd2, O_RDONLY, 0);
+
+	mfd_fail_add_seals(fd, F_SEAL_SEAL);
+	mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK);
+	mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK);
+
+	close(fd2);
+	fd2 = mfd_assert_open(fd, O_RDWR, 0);
+
+	mfd_assert_add_seals(fd2, F_SEAL_SEAL);
+	mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
+	mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
+
+	close(fd2);
+	close(fd);
+}
+
+/*
+ * Test sharing via fork()
+ * Test whether seal-modifications work as expected with forked childs.
+ */
+static void test_share_fork(void)
+{
+	int fd;
+	pid_t pid;
+
+	fd = mfd_assert_new("kern_memfd_share_fork",
+			    MFD_DEF_SIZE,
+			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
+	mfd_assert_has_seals(fd, 0);
+
+	pid = spawn_idle_proc();
+	mfd_assert_add_seals(fd, F_SEAL_SEAL);
+	mfd_assert_has_seals(fd, F_SEAL_SEAL);
+
+	mfd_fail_add_seals(fd, F_SEAL_WRITE);
+	mfd_assert_has_seals(fd, F_SEAL_SEAL);
+
+	join_idle_proc(pid);
+
+	mfd_fail_add_seals(fd, F_SEAL_WRITE);
+	mfd_assert_has_seals(fd, F_SEAL_SEAL);
+
+	close(fd);
+}
+
+int main(int argc, char **argv)
+{
+	pid_t pid;
+
+	printf("memfd: CREATE\n");
+	test_create();
+	printf("memfd: BASIC\n");
+	test_basic();
+
+	printf("memfd: SEAL-WRITE\n");
+	test_seal_write();
+	printf("memfd: SEAL-SHRINK\n");
+	test_seal_shrink();
+	printf("memfd: SEAL-GROW\n");
+	test_seal_grow();
+	printf("memfd: SEAL-RESIZE\n");
+	test_seal_resize();
+
+	printf("memfd: SHARE-DUP\n");
+	test_share_dup();
+	printf("memfd: SHARE-MMAP\n");
+	test_share_mmap();
+	printf("memfd: SHARE-OPEN\n");
+	test_share_open();
+	printf("memfd: SHARE-FORK\n");
+	test_share_fork();
+
+	/* Run test-suite in a multi-threaded environment with a shared
+	 * file-table. */
+	pid = spawn_idle_thread();
+	printf("memfd: SHARE-DUP (shared file-table)\n");
+	test_share_dup();
+	printf("memfd: SHARE-MMAP (shared file-table)\n");
+	test_share_mmap();
+	printf("memfd: SHARE-OPEN (shared file-table)\n");
+	test_share_open();
+	printf("memfd: SHARE-FORK (shared file-table)\n");
+	test_share_fork();
+	join_idle_thread(pid);
+
+	printf("memfd: DONE\n");
+
+	return 0;
+}
-- 
1.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/3] File Sealing & memfd_create()
  2014-04-15 18:38 ` David Herrmann
@ 2014-05-14  5:09   ` Hugh Dickins
  -1 siblings, 0 replies; 53+ messages in thread
From: Hugh Dickins @ 2014-05-14  5:09 UTC (permalink / raw)
  To: David Herrmann
  Cc: Tony Battersby, Andy Lutomirski, Al Viro, Jan Kara,
	Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

On Tue, 15 Apr 2014, David Herrmann wrote:
> 
> This is v2 of the File-Sealing and memfd_create() patches. You can find v1 with
> a longer introduction at gmane:
>   http://thread.gmane.org/gmane.comp.video.dri.devel/102241
> An LWN article about memfd+sealing is available, too:
>   https://lwn.net/Articles/593918/

Sorry it's taken so long: at last I managed to set aside a few hours at
the weekend, to read through your memfd+sealing work and let it sink in.

Good stuff.  I've a page of notes which I shall respond with, either
later in the week or at the weekend; but they're pretty much trivia, or
notes to myself, beside the async I/O issue raised by Tony Battersby.

I thought I'd better not wait longer to give warning that I do take
that issue seriously.

> 
> Shortlog of changes since v1:
>  - Dropped the "exclusive reference" idea
>    Now sealing is a one-shot operation. Once a given seal is set, you cannot
>    remove this seal again, ever. This allows us to drop all the ref-count
>    checking and simplifies the code a lot. We also no longer have all the races
>    we have to test for.
>  - The i_writecount fix is now upstream (slightly different, by Al Viro) so I
>    dropped it from the series.
>  - Change SHMEM_* prefix to F_* to avoid any API-association to shmem.
>  - Sealing is disabled on all files by default (even though we still haven't
>    found any DoS attack). You need to pass MFD_ALLOW_SEALING to memfd_create()
>    to get an object that supports the sealing API.
>  - Changed F_SET_SEALS to F_ADD_SEALS. This better reflects the API. You can
>    never remove seals, you can only add seals. Note that the semantics also
>    changed slightly: You can now _always_ call F_ADD_SEALS to add _more_ seals.
>    However, a new seal was added which "seals sealing" (F_SEAL_SEAL). So once
>    F_SEAL_SEAL is set, F_ADD_SEAL is no longer allowed.
>    This feature was requested by the glib developers.
>  - memfd_create() names are now limited to NAME_MAX instead of 256 hardcoded.
>  - Rewrote the test suite
> 
> The biggest change in v2 is the removal of the "exclusive reference" idea. It
> was a nice optimization, but the implementation was ugly and racy regarding
> file-table changes. Linus didn't like it either so we decided to drop it
> entirely. Sealing is a one-shot operation now. A sealed file can never be
> unsealed, even if you're the only holder.
> 
> I also addressed most of the concerns regarding API naming and semantics. I got
> feedback from glib, EFL, wayland, kdbus, ostree, audio developers and we
> discussed many possible use-cases (and also cases that don't make sense). So I
> think we're in a very good state right now.
> 
> People requested to make this interface more generic. I renamed the API to
> reflect that, but I didn't change the implementation. Thing is, seals cannot be
> removed, ever. Therefore, semantics for sealing on non-volatile storage are
> undefined. We don't write them to disc and it is unclear whether a sealed file
> can be unlinked/removed again. There're more issues with this and no-one came up
> with a use-case, hence I didn't bother implementing it.
> There's also an ongoing discussion about an AIO race, but this also affects
> other inode-protections like S_IMMUTABLE/etc. So I don't think we should tie
> the fix to this series.

I disagree on that.

Whatever the bugs or limitations with S_IMMUTABLE, ETXTBSY etc,
we have lived with those without complaint for many years.

You now propose an entirely new kind of guarantee, but that guarantee
is broken by the possibility of outstanding async I/O to a page of the
sealed object.

I don't see how we can add the new feature while knowing it broken.  We
have to devise a solution, but I haven't thought of a good solution yet.

Checking page counts in a GB file prior to sealing does not appeal at
all: we'd be lucky ever to find them all accounted for.  Adding overhead
to get_user_pages_fast() won't appeal to its adherents, and I'm not even
convinced that GUP is the only way in here.

Any ideas?

> Another discussion was about preventing /proc/self/fd/. But again, no-one could
> tell me _why_, so I didn't bother. On the contrary, I even provided several
> use-cases that make use of /proc/self/fd/ to get read-only FDs to pass around.
> 
> If anyone wants to test this, please use 3.15-rc1 as base. The i_writecount
> fixes are required for this series.
> 
> Comments welcome!
> David
> 
> David Herrmann (3):
>   shm: add sealing API
>   shm: add memfd_create() syscall
>   selftests: add memfd_create() + sealing tests
> 
>  arch/x86/syscalls/syscall_32.tbl           |   1 +
>  arch/x86/syscalls/syscall_64.tbl           |   1 +
>  fs/fcntl.c                                 |   5 +
>  include/linux/shmem_fs.h                   |  20 +
>  include/linux/syscalls.h                   |   1 +
>  include/uapi/linux/fcntl.h                 |  15 +
>  include/uapi/linux/memfd.h                 |  10 +
>  kernel/sys_ni.c                            |   1 +
>  mm/shmem.c                                 | 236 +++++++-
>  tools/testing/selftests/Makefile           |   1 +
>  tools/testing/selftests/memfd/.gitignore   |   2 +
>  tools/testing/selftests/memfd/Makefile     |  29 +
>  tools/testing/selftests/memfd/memfd_test.c | 944 +++++++++++++++++++++++++++++
>  13 files changed, 1263 insertions(+), 3 deletions(-)
>  create mode 100644 include/uapi/linux/memfd.h
>  create mode 100644 tools/testing/selftests/memfd/.gitignore
>  create mode 100644 tools/testing/selftests/memfd/Makefile
>  create mode 100644 tools/testing/selftests/memfd/memfd_test.c
> 
> -- 
> 1.9.2

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/3] File Sealing & memfd_create()
@ 2014-05-14  5:09   ` Hugh Dickins
  0 siblings, 0 replies; 53+ messages in thread
From: Hugh Dickins @ 2014-05-14  5:09 UTC (permalink / raw)
  To: David Herrmann
  Cc: Tony Battersby, Andy Lutomirski, Al Viro, Jan Kara,
	Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

On Tue, 15 Apr 2014, David Herrmann wrote:
> 
> This is v2 of the File-Sealing and memfd_create() patches. You can find v1 with
> a longer introduction at gmane:
>   http://thread.gmane.org/gmane.comp.video.dri.devel/102241
> An LWN article about memfd+sealing is available, too:
>   https://lwn.net/Articles/593918/

Sorry it's taken so long: at last I managed to set aside a few hours at
the weekend, to read through your memfd+sealing work and let it sink in.

Good stuff.  I've a page of notes which I shall respond with, either
later in the week or at the weekend; but they're pretty much trivia, or
notes to myself, beside the async I/O issue raised by Tony Battersby.

I thought I'd better not wait longer to give warning that I do take
that issue seriously.

> 
> Shortlog of changes since v1:
>  - Dropped the "exclusive reference" idea
>    Now sealing is a one-shot operation. Once a given seal is set, you cannot
>    remove this seal again, ever. This allows us to drop all the ref-count
>    checking and simplifies the code a lot. We also no longer have all the races
>    we have to test for.
>  - The i_writecount fix is now upstream (slightly different, by Al Viro) so I
>    dropped it from the series.
>  - Change SHMEM_* prefix to F_* to avoid any API-association to shmem.
>  - Sealing is disabled on all files by default (even though we still haven't
>    found any DoS attack). You need to pass MFD_ALLOW_SEALING to memfd_create()
>    to get an object that supports the sealing API.
>  - Changed F_SET_SEALS to F_ADD_SEALS. This better reflects the API. You can
>    never remove seals, you can only add seals. Note that the semantics also
>    changed slightly: You can now _always_ call F_ADD_SEALS to add _more_ seals.
>    However, a new seal was added which "seals sealing" (F_SEAL_SEAL). So once
>    F_SEAL_SEAL is set, F_ADD_SEAL is no longer allowed.
>    This feature was requested by the glib developers.
>  - memfd_create() names are now limited to NAME_MAX instead of 256 hardcoded.
>  - Rewrote the test suite
> 
> The biggest change in v2 is the removal of the "exclusive reference" idea. It
> was a nice optimization, but the implementation was ugly and racy regarding
> file-table changes. Linus didn't like it either so we decided to drop it
> entirely. Sealing is a one-shot operation now. A sealed file can never be
> unsealed, even if you're the only holder.
> 
> I also addressed most of the concerns regarding API naming and semantics. I got
> feedback from glib, EFL, wayland, kdbus, ostree, audio developers and we
> discussed many possible use-cases (and also cases that don't make sense). So I
> think we're in a very good state right now.
> 
> People requested to make this interface more generic. I renamed the API to
> reflect that, but I didn't change the implementation. Thing is, seals cannot be
> removed, ever. Therefore, semantics for sealing on non-volatile storage are
> undefined. We don't write them to disc and it is unclear whether a sealed file
> can be unlinked/removed again. There're more issues with this and no-one came up
> with a use-case, hence I didn't bother implementing it.
> There's also an ongoing discussion about an AIO race, but this also affects
> other inode-protections like S_IMMUTABLE/etc. So I don't think we should tie
> the fix to this series.

I disagree on that.

Whatever the bugs or limitations with S_IMMUTABLE, ETXTBSY etc,
we have lived with those without complaint for many years.

You now propose an entirely new kind of guarantee, but that guarantee
is broken by the possibility of outstanding async I/O to a page of the
sealed object.

I don't see how we can add the new feature while knowing it broken.  We
have to devise a solution, but I haven't thought of a good solution yet.

Checking page counts in a GB file prior to sealing does not appeal at
all: we'd be lucky ever to find them all accounted for.  Adding overhead
to get_user_pages_fast() won't appeal to its adherents, and I'm not even
convinced that GUP is the only way in here.

Any ideas?

> Another discussion was about preventing /proc/self/fd/. But again, no-one could
> tell me _why_, so I didn't bother. On the contrary, I even provided several
> use-cases that make use of /proc/self/fd/ to get read-only FDs to pass around.
> 
> If anyone wants to test this, please use 3.15-rc1 as base. The i_writecount
> fixes are required for this series.
> 
> Comments welcome!
> David
> 
> David Herrmann (3):
>   shm: add sealing API
>   shm: add memfd_create() syscall
>   selftests: add memfd_create() + sealing tests
> 
>  arch/x86/syscalls/syscall_32.tbl           |   1 +
>  arch/x86/syscalls/syscall_64.tbl           |   1 +
>  fs/fcntl.c                                 |   5 +
>  include/linux/shmem_fs.h                   |  20 +
>  include/linux/syscalls.h                   |   1 +
>  include/uapi/linux/fcntl.h                 |  15 +
>  include/uapi/linux/memfd.h                 |  10 +
>  kernel/sys_ni.c                            |   1 +
>  mm/shmem.c                                 | 236 +++++++-
>  tools/testing/selftests/Makefile           |   1 +
>  tools/testing/selftests/memfd/.gitignore   |   2 +
>  tools/testing/selftests/memfd/Makefile     |  29 +
>  tools/testing/selftests/memfd/memfd_test.c | 944 +++++++++++++++++++++++++++++
>  13 files changed, 1263 insertions(+), 3 deletions(-)
>  create mode 100644 include/uapi/linux/memfd.h
>  create mode 100644 tools/testing/selftests/memfd/.gitignore
>  create mode 100644 tools/testing/selftests/memfd/Makefile
>  create mode 100644 tools/testing/selftests/memfd/memfd_test.c
> 
> -- 
> 1.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/3] File Sealing & memfd_create()
  2014-05-14  5:09   ` Hugh Dickins
@ 2014-05-14 16:15     ` Tony Battersby
  -1 siblings, 0 replies; 53+ messages in thread
From: Tony Battersby @ 2014-05-14 16:15 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Herrmann, Andy Lutomirski, Al Viro, Jan Kara,
	Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

Hugh Dickins wrote:
> Checking page counts in a GB file prior to sealing does not appeal at
> all: we'd be lucky ever to find them all accounted for.

Here is a refinement of that idea: during a seal operation, iterate over
all the pages in the file and check their refcounts.  On any page that
has an unexpected extra reference, allocate a new page, copy the data
over to the new page, and then replace the page having the extra
reference with the newly-allocated page in the file.  That way you still
get zero-copy on pages that don't have extra references, and you don't
have to fail the seal operation if some of the pages are still being
referenced by something else.

The downside of course is the extra memory usage and memcpy overhead if
something is holding extra references to the pages.  So whether this is
a good approach depends on:

*) Whether extra page references would happen frequently or infrequently
under various kernel configurations and usage scenarios.  I don't know
enough about the mm system to answer this myself.

*) Whether or not the extra memory usage and memcpy overhead could be
considered a DoS attack vector by someone who has found a way to add
extra references to the pages intentionally.

Tony Battersby

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/3] File Sealing & memfd_create()
@ 2014-05-14 16:15     ` Tony Battersby
  0 siblings, 0 replies; 53+ messages in thread
From: Tony Battersby @ 2014-05-14 16:15 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Herrmann, Andy Lutomirski, Al Viro, Jan Kara,
	Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

Hugh Dickins wrote:
> Checking page counts in a GB file prior to sealing does not appeal at
> all: we'd be lucky ever to find them all accounted for.

Here is a refinement of that idea: during a seal operation, iterate over
all the pages in the file and check their refcounts.  On any page that
has an unexpected extra reference, allocate a new page, copy the data
over to the new page, and then replace the page having the extra
reference with the newly-allocated page in the file.  That way you still
get zero-copy on pages that don't have extra references, and you don't
have to fail the seal operation if some of the pages are still being
referenced by something else.

The downside of course is the extra memory usage and memcpy overhead if
something is holding extra references to the pages.  So whether this is
a good approach depends on:

*) Whether extra page references would happen frequently or infrequently
under various kernel configurations and usage scenarios.  I don't know
enough about the mm system to answer this myself.

*) Whether or not the extra memory usage and memcpy overhead could be
considered a DoS attack vector by someone who has found a way to add
extra references to the pages intentionally.

Tony Battersby

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/3] File Sealing & memfd_create()
  2014-05-14 16:15     ` Tony Battersby
@ 2014-05-14 22:35       ` Hugh Dickins
  -1 siblings, 0 replies; 53+ messages in thread
From: Hugh Dickins @ 2014-05-14 22:35 UTC (permalink / raw)
  To: Tony Battersby
  Cc: Hugh Dickins, David Herrmann, Al Viro, Jan Kara, Michael Kerrisk,
	Ryan Lortie, Linus Torvalds, Andrew Morton, linux-mm,
	linux-fsdevel, linux-kernel, Johannes Weiner, Tejun Heo,
	Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

On Wed, 14 May 2014, Tony Battersby wrote:
> Hugh Dickins wrote:
> > Checking page counts in a GB file prior to sealing does not appeal at
> > all: we'd be lucky ever to find them all accounted for.
> 
> Here is a refinement of that idea: during a seal operation, iterate over
> all the pages in the file and check their refcounts.  On any page that
> has an unexpected extra reference, allocate a new page, copy the data
> over to the new page, and then replace the page having the extra
> reference with the newly-allocated page in the file.  That way you still
> get zero-copy on pages that don't have extra references, and you don't
> have to fail the seal operation if some of the pages are still being
> referenced by something else.

That does seem a more promising idea than any that I'd had: thank you.

But whether it can actually be made to work (safely) is not yet clear
to me.

It would be rather like page migration; but whereas page migration
backs off whenever the page count cannot be fully accounted for
(as does KSM), that is precisely when this would have to act.

Taking action in the case of ignorance does not make me feel very
comfortable.  Page lock and radix tree lock would guard against
many surprises, but not necessarily all.

> 
> The downside of course is the extra memory usage and memcpy overhead if
> something is holding extra references to the pages.  So whether this is
> a good approach depends on:
> 
> *) Whether extra page references would happen frequently or infrequently
> under various kernel configurations and usage scenarios.  I don't know
> enough about the mm system to answer this myself.
> 
> *) Whether or not the extra memory usage and memcpy overhead could be
> considered a DoS attack vector by someone who has found a way to add
> extra references to the pages intentionally.

I may just be too naive on such issues, but neither of those worries
me particularly.  If something can already add an extra pin to many
pages, that is already a concern for memory usage.  The sealing case
would double its scale, but I don't see that as a new issue.

The aspect which really worries me is this: the maintenance burden.
This approach would add some peculiar new code, introducing a rare
special case: which we might get right today, but will very easily
forget tomorrow when making some other changes to mm.  If we compile
a list of danger areas in mm, this would surely belong on that list.

Hugh

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/3] File Sealing & memfd_create()
@ 2014-05-14 22:35       ` Hugh Dickins
  0 siblings, 0 replies; 53+ messages in thread
From: Hugh Dickins @ 2014-05-14 22:35 UTC (permalink / raw)
  To: Tony Battersby
  Cc: Hugh Dickins, David Herrmann, Al Viro, Jan Kara, Michael Kerrisk,
	Ryan Lortie, Linus Torvalds, Andrew Morton, linux-mm,
	linux-fsdevel, linux-kernel, Johannes Weiner, Tejun Heo,
	Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

On Wed, 14 May 2014, Tony Battersby wrote:
> Hugh Dickins wrote:
> > Checking page counts in a GB file prior to sealing does not appeal at
> > all: we'd be lucky ever to find them all accounted for.
> 
> Here is a refinement of that idea: during a seal operation, iterate over
> all the pages in the file and check their refcounts.  On any page that
> has an unexpected extra reference, allocate a new page, copy the data
> over to the new page, and then replace the page having the extra
> reference with the newly-allocated page in the file.  That way you still
> get zero-copy on pages that don't have extra references, and you don't
> have to fail the seal operation if some of the pages are still being
> referenced by something else.

That does seem a more promising idea than any that I'd had: thank you.

But whether it can actually be made to work (safely) is not yet clear
to me.

It would be rather like page migration; but whereas page migration
backs off whenever the page count cannot be fully accounted for
(as does KSM), that is precisely when this would have to act.

Taking action in the case of ignorance does not make me feel very
comfortable.  Page lock and radix tree lock would guard against
many surprises, but not necessarily all.

> 
> The downside of course is the extra memory usage and memcpy overhead if
> something is holding extra references to the pages.  So whether this is
> a good approach depends on:
> 
> *) Whether extra page references would happen frequently or infrequently
> under various kernel configurations and usage scenarios.  I don't know
> enough about the mm system to answer this myself.
> 
> *) Whether or not the extra memory usage and memcpy overhead could be
> considered a DoS attack vector by someone who has found a way to add
> extra references to the pages intentionally.

I may just be too naive on such issues, but neither of those worries
me particularly.  If something can already add an extra pin to many
pages, that is already a concern for memory usage.  The sealing case
would double its scale, but I don't see that as a new issue.

The aspect which really worries me is this: the maintenance burden.
This approach would add some peculiar new code, introducing a rare
special case: which we might get right today, but will very easily
forget tomorrow when making some other changes to mm.  If we compile
a list of danger areas in mm, this would surely belong on that list.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/3] File Sealing & memfd_create()
  2014-05-14 22:35       ` Hugh Dickins
@ 2014-05-19 11:44         ` David Herrmann
  -1 siblings, 0 replies; 53+ messages in thread
From: David Herrmann @ 2014-05-19 11:44 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Tony Battersby, Al Viro, Jan Kara, Michael Kerrisk, Ryan Lortie,
	Linus Torvalds, Andrew Morton, linux-mm, linux-fsdevel,
	linux-kernel, Johannes Weiner, Tejun Heo, Greg Kroah-Hartman,
	John Stultz, Kristian Hogsberg, Lennart Poettering, Daniel Mack,
	Kay Sievers

Hi

On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins <hughd@google.com> wrote:
> The aspect which really worries me is this: the maintenance burden.
> This approach would add some peculiar new code, introducing a rare
> special case: which we might get right today, but will very easily
> forget tomorrow when making some other changes to mm.  If we compile
> a list of danger areas in mm, this would surely belong on that list.

I tried doing the page-replacement in the last 4 days, but honestly,
it's far more complex than I thought. So if no-one more experienced
with mm/ comes up with a simple implementation, I'll have to delay
this for some more weeks.

However, I still wonder why we try to fix this as part of this
patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary
amount of time. Same is true for network block-devices, NFS, iscsi,
maybe loop-devices, ... This means, _any_ once mapped page can be
written to after an arbitrary delay. This can break any feature that
makes FS objects read-only (remounting read-only, setting S_IMMUTABLE,
sealing, ..).

Shouldn't we try to fix the _cause_ of this?

Isn't there a simple way to lock/mark/.. affected vmas in
get_user_pages(_fast)() and release them once done? We could increase
i_mmap_writable on all affected address_space and decrease it on
release. This would at least prevent sealing and could be check on
other operations, too (like setting S_IMMUTABLE).
This should be as easy as checking page_mapping(page) != NULL and then
adjusting ->i_mmap_writable in
get_writable_user_pages/put_writable_user_pages, right?

Thanks
David

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/3] File Sealing & memfd_create()
@ 2014-05-19 11:44         ` David Herrmann
  0 siblings, 0 replies; 53+ messages in thread
From: David Herrmann @ 2014-05-19 11:44 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Tony Battersby, Al Viro, Jan Kara, Michael Kerrisk, Ryan Lortie,
	Linus Torvalds, Andrew Morton, linux-mm, linux-fsdevel,
	linux-kernel, Johannes Weiner, Tejun Heo, Greg Kroah-Hartman,
	John Stultz, Kristian Hogsberg, Lennart Poettering, Daniel Mack,
	Kay Sievers

Hi

On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins <hughd@google.com> wrote:
> The aspect which really worries me is this: the maintenance burden.
> This approach would add some peculiar new code, introducing a rare
> special case: which we might get right today, but will very easily
> forget tomorrow when making some other changes to mm.  If we compile
> a list of danger areas in mm, this would surely belong on that list.

I tried doing the page-replacement in the last 4 days, but honestly,
it's far more complex than I thought. So if no-one more experienced
with mm/ comes up with a simple implementation, I'll have to delay
this for some more weeks.

However, I still wonder why we try to fix this as part of this
patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary
amount of time. Same is true for network block-devices, NFS, iscsi,
maybe loop-devices, ... This means, _any_ once mapped page can be
written to after an arbitrary delay. This can break any feature that
makes FS objects read-only (remounting read-only, setting S_IMMUTABLE,
sealing, ..).

Shouldn't we try to fix the _cause_ of this?

Isn't there a simple way to lock/mark/.. affected vmas in
get_user_pages(_fast)() and release them once done? We could increase
i_mmap_writable on all affected address_space and decrease it on
release. This would at least prevent sealing and could be check on
other operations, too (like setting S_IMMUTABLE).
This should be as easy as checking page_mapping(page) != NULL and then
adjusting ->i_mmap_writable in
get_writable_user_pages/put_writable_user_pages, right?

Thanks
David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/3] File Sealing & memfd_create()
  2014-05-19 11:44         ` David Herrmann
@ 2014-05-19 16:09           ` Jan Kara
  -1 siblings, 0 replies; 53+ messages in thread
From: Jan Kara @ 2014-05-19 16:09 UTC (permalink / raw)
  To: David Herrmann
  Cc: Hugh Dickins, Tony Battersby, Al Viro, Jan Kara, Michael Kerrisk,
	Ryan Lortie, Linus Torvalds, Andrew Morton, linux-mm,
	linux-fsdevel, linux-kernel, Johannes Weiner, Tejun Heo,
	Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

On Mon 19-05-14 13:44:25, David Herrmann wrote:
> Hi
> 
> On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins <hughd@google.com> wrote:
> > The aspect which really worries me is this: the maintenance burden.
> > This approach would add some peculiar new code, introducing a rare
> > special case: which we might get right today, but will very easily
> > forget tomorrow when making some other changes to mm.  If we compile
> > a list of danger areas in mm, this would surely belong on that list.
> 
> I tried doing the page-replacement in the last 4 days, but honestly,
> it's far more complex than I thought. So if no-one more experienced
> with mm/ comes up with a simple implementation, I'll have to delay
> this for some more weeks.
> 
> However, I still wonder why we try to fix this as part of this
> patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary
> amount of time. Same is true for network block-devices, NFS, iscsi,
> maybe loop-devices, ... This means, _any_ once mapped page can be
> written to after an arbitrary delay. This can break any feature that
> makes FS objects read-only (remounting read-only, setting S_IMMUTABLE,
> sealing, ..).
> 
> Shouldn't we try to fix the _cause_ of this?
> 
> Isn't there a simple way to lock/mark/.. affected vmas in
> get_user_pages(_fast)() and release them once done? We could increase
> i_mmap_writable on all affected address_space and decrease it on
> release. This would at least prevent sealing and could be check on
> other operations, too (like setting S_IMMUTABLE).
> This should be as easy as checking page_mapping(page) != NULL and then
> adjusting ->i_mmap_writable in
> get_writable_user_pages/put_writable_user_pages, right?
  Doing this would be quite a bit of work. Currently references returned by
get_user_pages() are page references like any other and thus are released
by put_page() or similar. Now you would make them special and they need
special releasing and there are lots of places in kernel where
get_user_pages() is used that would need changing.

Another aspect is that it could have performance implications - if there
are several processes using get_user_pages[_fast]() on a file, they would
start contending on modifying i_mmap_writeable.

One somewhat crazy idea I have is that maybe we could delay unmapping of a
page if this was last VMA referencing it until all extra page references of
pages in there are dropped. That would make i_mmap_writeable reliable for
you and it would also close those races with remount. Hugh, do you think
this might be viable?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/3] File Sealing & memfd_create()
@ 2014-05-19 16:09           ` Jan Kara
  0 siblings, 0 replies; 53+ messages in thread
From: Jan Kara @ 2014-05-19 16:09 UTC (permalink / raw)
  To: David Herrmann
  Cc: Hugh Dickins, Tony Battersby, Al Viro, Jan Kara, Michael Kerrisk,
	Ryan Lortie, Linus Torvalds, Andrew Morton, linux-mm,
	linux-fsdevel, linux-kernel, Johannes Weiner, Tejun Heo,
	Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

On Mon 19-05-14 13:44:25, David Herrmann wrote:
> Hi
> 
> On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins <hughd@google.com> wrote:
> > The aspect which really worries me is this: the maintenance burden.
> > This approach would add some peculiar new code, introducing a rare
> > special case: which we might get right today, but will very easily
> > forget tomorrow when making some other changes to mm.  If we compile
> > a list of danger areas in mm, this would surely belong on that list.
> 
> I tried doing the page-replacement in the last 4 days, but honestly,
> it's far more complex than I thought. So if no-one more experienced
> with mm/ comes up with a simple implementation, I'll have to delay
> this for some more weeks.
> 
> However, I still wonder why we try to fix this as part of this
> patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary
> amount of time. Same is true for network block-devices, NFS, iscsi,
> maybe loop-devices, ... This means, _any_ once mapped page can be
> written to after an arbitrary delay. This can break any feature that
> makes FS objects read-only (remounting read-only, setting S_IMMUTABLE,
> sealing, ..).
> 
> Shouldn't we try to fix the _cause_ of this?
> 
> Isn't there a simple way to lock/mark/.. affected vmas in
> get_user_pages(_fast)() and release them once done? We could increase
> i_mmap_writable on all affected address_space and decrease it on
> release. This would at least prevent sealing and could be check on
> other operations, too (like setting S_IMMUTABLE).
> This should be as easy as checking page_mapping(page) != NULL and then
> adjusting ->i_mmap_writable in
> get_writable_user_pages/put_writable_user_pages, right?
  Doing this would be quite a bit of work. Currently references returned by
get_user_pages() are page references like any other and thus are released
by put_page() or similar. Now you would make them special and they need
special releasing and there are lots of places in kernel where
get_user_pages() is used that would need changing.

Another aspect is that it could have performance implications - if there
are several processes using get_user_pages[_fast]() on a file, they would
start contending on modifying i_mmap_writeable.

One somewhat crazy idea I have is that maybe we could delay unmapping of a
page if this was last VMA referencing it until all extra page references of
pages in there are dropped. That would make i_mmap_writeable reliable for
you and it would also close those races with remount. Hugh, do you think
this might be viable?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/3] File Sealing & memfd_create()
  2014-05-19 16:09           ` Jan Kara
@ 2014-05-19 22:11             ` Hugh Dickins
  -1 siblings, 0 replies; 53+ messages in thread
From: Hugh Dickins @ 2014-05-19 22:11 UTC (permalink / raw)
  To: Jan Kara
  Cc: David Herrmann, Hugh Dickins, Tony Battersby, Al Viro,
	Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

On Mon, 19 May 2014, Jan Kara wrote:
> On Mon 19-05-14 13:44:25, David Herrmann wrote:
> > On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins <hughd@google.com> wrote:
> > > The aspect which really worries me is this: the maintenance burden.
> > > This approach would add some peculiar new code, introducing a rare
> > > special case: which we might get right today, but will very easily
> > > forget tomorrow when making some other changes to mm.  If we compile
> > > a list of danger areas in mm, this would surely belong on that list.
> > 
> > I tried doing the page-replacement in the last 4 days, but honestly,
> > it's far more complex than I thought. So if no-one more experienced

To be honest, I'm quite glad to hear that: it is still a solution worth
considering, but I'd rather continue the search for a better solution.

> > with mm/ comes up with a simple implementation, I'll have to delay
> > this for some more weeks.
> > 
> > However, I still wonder why we try to fix this as part of this
> > patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary
> > amount of time. Same is true for network block-devices, NFS, iscsi,
> > maybe loop-devices, ... This means, _any_ once mapped page can be
> > written to after an arbitrary delay. This can break any feature that
> > makes FS objects read-only (remounting read-only, setting S_IMMUTABLE,
> > sealing, ..).

We need to fix it together with your sealing patchset, because your
patchset is all about introducing a new kind of guarantee: a guarantee
which this async i/o issue makes impossible to give, as things stand.

Exasperating for you, I understand; but that's how it is.
A new feature may make new demands on the infrastructure.

I can imagine existing problems, but (I may be out of touch) I have
not heard of them as problems in practice.  Certainly they would not
be recent regressions: mm-page versus fs-file has worked in this way
for as long as I've known them (pages released independently of
unmapping the file, with the understanding that i/o might still
be in progress, so care taken not to free the pages too soon).

> > 
> > Shouldn't we try to fix the _cause_ of this?

Nobody is against fixing the cause: we are all looking for the
simplest way of doing so,

> > 
> > Isn't there a simple way to lock/mark/.. affected vmas in
> > get_user_pages(_fast)() and release them once done? We could increase
> > i_mmap_writable on all affected address_space and decrease it on
> > release. This would at least prevent sealing and could be check on
> > other operations, too (like setting S_IMMUTABLE).
> > This should be as easy as checking page_mapping(page) != NULL and then
> > adjusting ->i_mmap_writable in
> > get_writable_user_pages/put_writable_user_pages, right?
>   Doing this would be quite a bit of work. Currently references returned by
> get_user_pages() are page references like any other and thus are released
> by put_page() or similar. Now you would make them special and they need
> special releasing and there are lots of places in kernel where
> get_user_pages() is used that would need changing.

Lots of places that would need changing, yes; but we have often
wondered in the past whether there should be a put_user_pages().
Though I'm not sure that it would actually solve anything...

> 
> Another aspect is that it could have performance implications - if there
> are several processes using get_user_pages[_fast]() on a file, they would
> start contending on modifying i_mmap_writeable.

Doing extra vma work in get_user_pages() wouldn't be so bad.  But doing
any vma work in get_user_pages_fast() would upset almost all its users:
get_user_pages_fast() is a fast-path which expressly avoids the vmas,
and hates additional cachelines being added to its machinations.

If sealing had appeared before get_user_pages_fast(), maybe we wouldn't
have let get_user_pages_fast() in; but now it's the other way around.

I would be more interested in attacking from the get_user_pages() and
get_user_pages_fast() end, if I could convince myself that they do
actually delimit the problem; maybe they do, but I'm not yet convinced.

> 
> One somewhat crazy idea I have is that maybe we could delay unmapping of a
> page if this was last VMA referencing it until all extra page references of
> pages in there are dropped. That would make i_mmap_writeable reliable for
> you and it would also close those races with remount. Hugh, do you think
> this might be viable?

It is definitely worth pursuing further, but I'm not very hopeful on it.
In a world of free page flags and free struct page fields, maybe.  (And
I don't see sealing as a feature sensibly restricted to 64-bit only.)

I think we would have to set a page flag, maybe bump a count, for every
leftover page that raises i_mmap_writable; and lower it (potentially from
interrupt context) at put_page() time.  Easy to make i_mmap_writable an
atomic rather than guarded by i_mmap_mutex, but we still need to
synchronize on it falling to 0.

And how would we recognize the relevant, decrementing, put_page()?
page_count divided into "read_"count and write_count?  Ugh!

I also have a strong instinct against adding delays into munmap+exit;
though that mainly comes from the urge to free memory, and here we are
only delaying until a page becomes freeable, so maybe I should abandon
that bias in this case.

I did start thinking in this direction last week, but stuck somewhere
and retreated, I forget on what issue.  At this moment I'm not really
in that zone, but anxious to complete my promised responses to David's
patches, which I almost but not quite completed last night.

Hugh

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/3] File Sealing & memfd_create()
@ 2014-05-19 22:11             ` Hugh Dickins
  0 siblings, 0 replies; 53+ messages in thread
From: Hugh Dickins @ 2014-05-19 22:11 UTC (permalink / raw)
  To: Jan Kara
  Cc: David Herrmann, Hugh Dickins, Tony Battersby, Al Viro,
	Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

On Mon, 19 May 2014, Jan Kara wrote:
> On Mon 19-05-14 13:44:25, David Herrmann wrote:
> > On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins <hughd@google.com> wrote:
> > > The aspect which really worries me is this: the maintenance burden.
> > > This approach would add some peculiar new code, introducing a rare
> > > special case: which we might get right today, but will very easily
> > > forget tomorrow when making some other changes to mm.  If we compile
> > > a list of danger areas in mm, this would surely belong on that list.
> > 
> > I tried doing the page-replacement in the last 4 days, but honestly,
> > it's far more complex than I thought. So if no-one more experienced

To be honest, I'm quite glad to hear that: it is still a solution worth
considering, but I'd rather continue the search for a better solution.

> > with mm/ comes up with a simple implementation, I'll have to delay
> > this for some more weeks.
> > 
> > However, I still wonder why we try to fix this as part of this
> > patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary
> > amount of time. Same is true for network block-devices, NFS, iscsi,
> > maybe loop-devices, ... This means, _any_ once mapped page can be
> > written to after an arbitrary delay. This can break any feature that
> > makes FS objects read-only (remounting read-only, setting S_IMMUTABLE,
> > sealing, ..).

We need to fix it together with your sealing patchset, because your
patchset is all about introducing a new kind of guarantee: a guarantee
which this async i/o issue makes impossible to give, as things stand.

Exasperating for you, I understand; but that's how it is.
A new feature may make new demands on the infrastructure.

I can imagine existing problems, but (I may be out of touch) I have
not heard of them as problems in practice.  Certainly they would not
be recent regressions: mm-page versus fs-file has worked in this way
for as long as I've known them (pages released independently of
unmapping the file, with the understanding that i/o might still
be in progress, so care taken not to free the pages too soon).

> > 
> > Shouldn't we try to fix the _cause_ of this?

Nobody is against fixing the cause: we are all looking for the
simplest way of doing so,

> > 
> > Isn't there a simple way to lock/mark/.. affected vmas in
> > get_user_pages(_fast)() and release them once done? We could increase
> > i_mmap_writable on all affected address_space and decrease it on
> > release. This would at least prevent sealing and could be check on
> > other operations, too (like setting S_IMMUTABLE).
> > This should be as easy as checking page_mapping(page) != NULL and then
> > adjusting ->i_mmap_writable in
> > get_writable_user_pages/put_writable_user_pages, right?
>   Doing this would be quite a bit of work. Currently references returned by
> get_user_pages() are page references like any other and thus are released
> by put_page() or similar. Now you would make them special and they need
> special releasing and there are lots of places in kernel where
> get_user_pages() is used that would need changing.

Lots of places that would need changing, yes; but we have often
wondered in the past whether there should be a put_user_pages().
Though I'm not sure that it would actually solve anything...

> 
> Another aspect is that it could have performance implications - if there
> are several processes using get_user_pages[_fast]() on a file, they would
> start contending on modifying i_mmap_writeable.

Doing extra vma work in get_user_pages() wouldn't be so bad.  But doing
any vma work in get_user_pages_fast() would upset almost all its users:
get_user_pages_fast() is a fast-path which expressly avoids the vmas,
and hates additional cachelines being added to its machinations.

If sealing had appeared before get_user_pages_fast(), maybe we wouldn't
have let get_user_pages_fast() in; but now it's the other way around.

I would be more interested in attacking from the get_user_pages() and
get_user_pages_fast() end, if I could convince myself that they do
actually delimit the problem; maybe they do, but I'm not yet convinced.

> 
> One somewhat crazy idea I have is that maybe we could delay unmapping of a
> page if this was last VMA referencing it until all extra page references of
> pages in there are dropped. That would make i_mmap_writeable reliable for
> you and it would also close those races with remount. Hugh, do you think
> this might be viable?

It is definitely worth pursuing further, but I'm not very hopeful on it.
In a world of free page flags and free struct page fields, maybe.  (And
I don't see sealing as a feature sensibly restricted to 64-bit only.)

I think we would have to set a page flag, maybe bump a count, for every
leftover page that raises i_mmap_writable; and lower it (potentially from
interrupt context) at put_page() time.  Easy to make i_mmap_writable an
atomic rather than guarded by i_mmap_mutex, but we still need to
synchronize on it falling to 0.

And how would we recognize the relevant, decrementing, put_page()?
page_count divided into "read_"count and write_count?  Ugh!

I also have a strong instinct against adding delays into munmap+exit;
though that mainly comes from the urge to free memory, and here we are
only delaying until a page becomes freeable, so maybe I should abandon
that bias in this case.

I did start thinking in this direction last week, but stuck somewhere
and retreated, I forget on what issue.  At this moment I'm not really
in that zone, but anxious to complete my promised responses to David's
patches, which I almost but not quite completed last night.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 1/3] shm: add sealing API
  2014-04-15 18:38   ` David Herrmann
@ 2014-05-20  2:16     ` Hugh Dickins
  -1 siblings, 0 replies; 53+ messages in thread
From: Hugh Dickins @ 2014-05-20  2:16 UTC (permalink / raw)
  To: David Herrmann
  Cc: Tony Battersby, Andy Lutomirski, Jan Kara, Michael Kerrisk,
	Ryan Lortie, Linus Torvalds, Andrew Morton, linux-mm,
	linux-fsdevel, linux-kernel, Johannes Weiner, Tejun Heo,
	Greg Kroah-Hartman, john.stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

On Tue, 15 Apr 2014, David Herrmann wrote:

> If two processes share a common memory region, they usually want some
> guarantees to allow safe access. This often includes:
>   - one side cannot overwrite data while the other reads it
>   - one side cannot shrink the buffer while the other accesses it
>   - one side cannot grow the buffer beyond previously set boundaries
> 
> If there is a trust-relationship between both parties, there is no need
> for policy enforcement. However, if there's no trust relationship (eg.,
> for general-purpose IPC) sharing memory-regions is highly fragile and
> often not possible without local copies. Look at the following two
> use-cases:
>   1) A graphics client wants to share its rendering-buffer with a
>      graphics-server. The memory-region is allocated by the client for
>      read/write access and a second FD is passed to the server. While
>      scanning out from the memory region, the server has no guarantee that
>      the client doesn't shrink the buffer at any time, requiring rather
>      cumbersome SIGBUS handling.
>   2) A process wants to perform an RPC on another process. To avoid huge
>      bandwidth consumption, zero-copy is preferred. After a message is
>      assembled in-memory and a FD is passed to the remote side, both sides
>      want to be sure that neither modifies this shared copy, anymore. The
>      source may have put sensible data into the message without a separate
>      copy and the target may want to parse the message inline, to avoid a
>      local copy.
> 
> While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide
> ways to achieve most of this, the first one is unproportionally ugly to
> use in libraries and the latter two are broken/racy or even disabled due
> to denial of service attacks.
> 
> This patch introduces the concept of SEALING. If you seal a file, a
> specific set of operations is blocked on that file forever.
> Unlike locks, seals can only be set, never removed. Hence, once you
> verified a specific set of seals is set, you're guaranteed that no-one can
> perform the blocked operations on this file, anymore.
> 
> An initial set of SEALS is introduced by this patch:
>   - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced
>             in size. This affects ftruncate() and open(O_TRUNC).
>   - GROW: If SEAL_GROW is set, the file in question cannot be increased
>           in size. This affects ftruncate(), fallocate() and write().
>   - WRITE: If SEAL_WRITE is set, no write operations (besides resizing)
>            are possible. This affects fallocate(PUNCH_HOLE), mmap() and
>            write().
>   - SEAL: If SEAL_SEAL is set, no further seals can be added to a file.
>           This basically prevents the F_ADD_SEAL operation on a file and
>           can be set to prevent others from adding further seals that you
>           don't want.
> 
> The described use-cases can easily use these seals to provide safe use
> without any trust-relationship:
>   1) The graphics server can verify that a passed file-descriptor has
>      SEAL_SHRINK set. This allows safe scanout, while the client is
>      allowed to increase buffer size for window-resizing on-the-fly.
>      Concurrent writes are explicitly allowed.
>   2) For general-purpose IPC, both processes can verify that SEAL_SHRINK,
>      SEAL_GROW and SEAL_WRITE are set. This guarantees that neither
>      process can modify the data while the other side parses it.
>      Furthermore, it guarantees that even with writable FDs passed to the
>      peer, it cannot increase the size to hit memory-limits of the source
>      process (in case the file-storage is accounted to the source).
> 
> The new API is an extension to fcntl(), adding two new commands:
>   F_GET_SEALS: Return a bitset describing the seals on the file. This
>                can be called on any FD if the underlying file supports
>                sealing.
>   F_ADD_SEALS: Change the seals of a given file. This requires WRITE
>                access to the file and F_SEAL_SEAL may not already be set.
>                Furthermore, the underlying file must support sealing and
>                there may not be any existing shared mapping of that file.
>                Otherwise, EBADF/EPERM is returned.
>                The given seals are _added_ to the existing set of seals
>                on the file. You cannot remove seals again.
> 
> The fcntl() handler is currently specific to shmem and disabled on all
> files. A file needs to explicitly support sealing for this interface to
> work. A separate syscall is added in a follow-up, which creates files that
> support sealing. There is no intention to support this on other
> file-systems. Semantics are unclear for non-volatile files and we lack any
> use-case right now. Therefore, the implementation is specific to shmem.

Yes, I think you've struck the right balance, by making it a general
fcntl interface, but implementing it only in shmem.

> 
> Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
> ---
>  fs/fcntl.c                 |   5 ++
>  include/linux/shmem_fs.h   |  20 ++++++
>  include/uapi/linux/fcntl.h |  15 +++++
>  mm/shmem.c                 | 162 ++++++++++++++++++++++++++++++++++++++++++++-
>  4 files changed, 199 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 9ead159..1a7a722 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -21,6 +21,7 @@
>  #include <linux/rcupdate.h>
>  #include <linux/pid_namespace.h>
>  #include <linux/user_namespace.h>
> +#include <linux/shmem_fs.h>
>  
>  #include <asm/poll.h>
>  #include <asm/siginfo.h>
> @@ -336,6 +337,10 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
>  	case F_GETPIPE_SZ:
>  		err = pipe_fcntl(filp, cmd, arg);
>  		break;
> +	case F_ADD_SEALS:
> +	case F_GET_SEALS:
> +		err = shmem_fcntl(filp, cmd, arg);

Okay.  I agree that fcntl() is the best interface to use; and although
we always feel a bit dirty exporting a function from shmem.c for use
outside, you are following what's already done with pipe_fcntl(); and
it seems overkill to add an fcntl method to file_operations without any
wider usage.

> +		break;
>  	default:
>  		break;
>  	}
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 4d1771c..c043d67 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -1,6 +1,7 @@
>  #ifndef __SHMEM_FS_H
>  #define __SHMEM_FS_H
>  
> +#include <linux/file.h>
>  #include <linux/swap.h>
>  #include <linux/mempolicy.h>
>  #include <linux/pagemap.h>
> @@ -20,6 +21,7 @@ struct shmem_inode_info {
>  	struct shared_policy	policy;		/* NUMA memory alloc policy */
>  	struct list_head	swaplist;	/* chain of maybes on swap */
>  	struct simple_xattrs	xattrs;		/* list of xattrs */
> +	u32			seals;		/* shmem seals */

Okay.  I do wonder why you chose "u32" where I would have chosen
"unsigned int": probably just our different backgrounds - kernel
internals most often use the basic types, whereas you are thinking
about explicit interfaces.  Even syscalls tend to have "int" args,
but perhaps that's just a historic mistake.  I have no good reason
to disagree with your use of "u32", but draw attention to it in
case someone else feels more strongly.

Oh, how about you move "seals" up between "lock" and "flags":
on many configurations, it will then occupy what used to be padding.

>  	struct inode		vfs_inode;
>  };
>  
> @@ -65,4 +67,22 @@ static inline struct page *shmem_read_mapping_page(
>  					mapping_gfp_mask(mapping));
>  }
>  
> +/* marks inode to support sealing */
> +#define SHMEM_ALLOW_SEALING (1U << 31)

This feels unnecessary to me: see comment on shmem_add_seals.

> +
> +#ifdef CONFIG_SHMEM

Should that rather be CONFIG_TMPFS?  I think you have placed
shmem_fcntl() and its supporting functions in the CONFIG_TMPFS
part of mm/shmem.c (and CONFIG_TMPFS depends on CONFIG_SHMEM).

It's almost certainly true that "CONFIG_TMPFS" has outlived its v2.4
usefulness, and serves as more of a confusion than a help nowadays:
particularly since !CONFIG_SHMEM gives you the ramfs filesystem, but
CONFIG_SHMEM without CONFIG_TMPFS does not give you a filesystem.

Blame me for leaving CONFIG_TMPFS around; but for now,
I think it's CONFIG_TMPFS you want there (please check).

> +
> +extern int shmem_add_seals(struct file *file, u32 seals);
> +extern int shmem_get_seals(struct file *file);
> +extern long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
> +
> +#else
> +

Are you sure you want to generate a link error rather than a runtime
fallback if there's a driver using shmem_add_seals() or shmem_get_seals()
in a !CONFIG_SHMEM kernel?  That might be the right decision, but it
surprises me a little.

> +static inline long shmem_fcntl(struct file *f, unsigned int c, unsigned long a)
> +{
> +	return -EINVAL;

Should be -EBADF to match what you get in the CONFIG_SHMEM case.

> +}
> +
> +#endif
> +
>  #endif
> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index 074b886..1b9b9f4 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -28,6 +28,21 @@
>  #define F_GETPIPE_SZ	(F_LINUX_SPECIFIC_BASE + 8)
>  
>  /*
> + * Set/Get seals
> + */
> +#define F_ADD_SEALS	(F_LINUX_SPECIFIC_BASE + 9)
> +#define F_GET_SEALS	(F_LINUX_SPECIFIC_BASE + 10)
> +
> +/*
> + * Types of seals
> + */
> +#define F_SEAL_SEAL	0x0001	/* prevent further seals from being set */
> +#define F_SEAL_SHRINK	0x0002	/* prevent file from shrinking */
> +#define F_SEAL_GROW	0x0004	/* prevent file from growing */
> +#define F_SEAL_WRITE	0x0008	/* prevent writes */
> +/* (1U << 31) is reserved for internal use */

I question the need to reserve that: see comment on shmem_add_seals.

> +
> +/*
>   * Types of directory notifications that may be requested.
>   */
>  #define DN_ACCESS	0x00000001	/* File accessed */
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 9f70e02..175a5b8 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -66,6 +66,7 @@ static struct vfsmount *shm_mnt;
>  #include <linux/highmem.h>
>  #include <linux/seq_file.h>
>  #include <linux/magic.h>
> +#include <linux/fcntl.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/pgtable.h>
> @@ -531,16 +532,23 @@ EXPORT_SYMBOL_GPL(shmem_truncate_range);
>  static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
>  {
>  	struct inode *inode = dentry->d_inode;
> +	struct shmem_inode_info *info = SHMEM_I(inode);
> +	loff_t oldsize = inode->i_size;
> +	loff_t newsize = attr->ia_size;
>  	int error;
>  
>  	error = inode_change_ok(inode, attr);
>  	if (error)
>  		return error;
>  
> -	if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
> -		loff_t oldsize = inode->i_size;
> -		loff_t newsize = attr->ia_size;
> +	/* protected by i_mutex */
> +	if (attr->ia_valid & ATTR_SIZE) {
> +		if ((newsize < oldsize && (info->seals & F_SEAL_SHRINK)) ||
> +		    (newsize > oldsize && (info->seals & F_SEAL_GROW)))
> +			return -EPERM;
> +	}
>  
> +	if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
>  		if (newsize != oldsize) {
>  			i_size_write(inode, newsize);
>  			inode->i_ctime = inode->i_mtime = CURRENT_TIME;
> @@ -1289,6 +1297,13 @@ out_nomem:
>  
>  static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
>  {
> +	struct inode *inode = file_inode(file);
> +	struct shmem_inode_info *info = SHMEM_I(inode);
> +
> +	/* protected by mmap_sem */
> +	if ((info->seals & F_SEAL_WRITE) && (vma->vm_flags & VM_SHARED))
> +		return -EPERM;
> +
>  	file_accessed(file);
>  	vma->vm_ops = &shmem_vm_ops;
>  	return 0;
> @@ -1373,7 +1388,15 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
>  			struct page **pagep, void **fsdata)
>  {
>  	struct inode *inode = mapping->host;
> +	struct shmem_inode_info *info = SHMEM_I(inode);
>  	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
> +
> +	/* i_mutex is held by caller */
> +	if (info->seals & F_SEAL_WRITE)
> +		return -EPERM;
> +	if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size)
> +		return -EPERM;
> +
>  	return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
>  }
>  
> @@ -1719,11 +1742,133 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
>  	return offset;
>  }
>  
> +#define F_ALL_SEALS (F_SEAL_SEAL | \
> +		     F_SEAL_SHRINK | \
> +		     F_SEAL_GROW | \
> +		     F_SEAL_WRITE)
> +
> +int shmem_add_seals(struct file *file, u32 seals)
> +{
> +	struct dentry *dentry = file->f_path.dentry;
> +	struct inode *inode = dentry->d_inode;
> +	struct shmem_inode_info *info = SHMEM_I(inode);
> +	int r;

mm/shmem.c is currently using "int error", "int err", "int ret" or
"int retval" for this (maybe more!): I'd prefer you not to add "r"
to the menagerie, "error" or "err" would be good here.

> +
> +	/* SHMEM_ALLOW_SEALING is a private, unused bit */
> +	BUILD_BUG_ON(F_ALL_SEALS & SHMEM_ALLOW_SEALING);

I see no need for SHMEM_ALLOW_SEALING.
Now that you have added F_SEAL_SEAL, why don't you just make
shmem_get_inode() initialize info->seals with F_SEAL_SEAL,
then clear that in the one place you need to in the next patch?

> +
> +	/*
> +	 * SEALING
> +	 * Sealing allows multiple parties to share a shmem-file but restrict
> +	 * access to a specific subset of file operations. Seals can only be
> +	 * added, but never removed. This way, mutually untrusted parties can
> +	 * share common memory regions with a well-defined policy. A malicious
> +	 * peer can thus never perform unwanted operations on a shared object.
> +	 *
> +	 * Seals are only supported on special shmem-files and always affect
> +	 * the whole underlying inode. Once a seal is set, it may prevent some
> +	 * kinds of access to the file. Currently, the following seals are
> +	 * defined:
> +	 *   SEAL_SEAL: Prevent further seals from being set on this file
> +	 *   SEAL_SHRINK: Prevent the file from shrinking
> +	 *   SEAL_GROW: Prevent the file from growing
> +	 *   SEAL_WRITE: Prevent write access to the file
> +	 *
> +	 * As we don't require any trust relationship between two parties, we
> +	 * must prevent seals from being removed. Therefore, sealing a file
> +	 * only adds a given set of seals to the file, it never touches
> +	 * existing seals. Furthermore, the "setting seals"-operation can be
> +	 * sealed itself, which basically prevents any further seal from being
> +	 * added.
> +	 *
> +	 * Semantics of sealing are only defined on volatile files. Only
> +	 * anonymous shmem files support sealing. More importantly, seals are
> +	 * never written to disk. Therefore, there's no plan to support it on
> +	 * other file types.
> +	 */
> +
> +	if (file->f_op != &shmem_file_operations)
> +		return -EBADF;

Okay: that's not what I expect -EBADF to mean, but it does follow
the precedent set by pipe_fcntl().

> +	if (!(info->seals & SHMEM_ALLOW_SEALING))
> +		return -EBADF;
> +	if (!(file->f_mode & FMODE_WRITE))
> +		return -EPERM;
> +	if (seals & ~(u32)F_ALL_SEALS)
> +		return -EINVAL;
> +
> +	/*
> +	 * - i_mutex prevents racing write/ftruncate/fallocate/..
> +	 * - mmap_sem prevents racing mmap() calls
> +	 */
> +
> +	mutex_lock(&inode->i_mutex);
> +	down_read(&current->mm->mmap_sem);

I don't think that use of current->mm->mmap_sem can be correct:
it guards against races with other threads of this process, but
what if another process has this object open and races to mmap it?

I imagine you have to use i_mmap_mutex, and plumb an error return
into __vma_link_file() etc in mm/mmap.c, if the file is found already
sealed against writing - which may prove irritating, especially with
knowledge of sealing being private to mm/shmem.c.

But I have not stopped to work it out properly: the answer may depend
on the answer to the major issue of outstanding async I/O.  As I
mentioned last week, that's an issue I think we cannot overlook.
Tony's copy-raised-pagecount-pages suggestion is a good one, but
not so attractive that I'll give up hope for a better solution.

> +
> +	/* you cannot seal while shared mappings exist */
> +	if (file->f_mapping->i_mmap_writable > 0) {
> +		r = -EPERM;
> +		goto unlock;
> +	}
> +
> +	if (info->seals & F_SEAL_SEAL) {
> +		r = -EPERM;
> +		goto unlock;
> +	}
> +
> +	info->seals |= seals;
> +	r = 0;
> +
> +unlock:
> +	up_read(&current->mm->mmap_sem);
> +	mutex_unlock(&inode->i_mutex);
> +	return r;
> +}
> +EXPORT_SYMBOL(shmem_add_seals);

EXPORT_SYMBOL_GPL(shmem_add_seals).

We don't see an example of its use, but I certainly don't want to see
drivers/gpu changes as part of this patchset, so I think that's okay.

> +
> +int shmem_get_seals(struct file *file)
> +{
> +	struct shmem_inode_info *info;
> +
> +	if (file->f_op != &shmem_file_operations)
> +		return -EBADF;
> +
> +	info = SHMEM_I(file_inode(file));
> +	if (!(info->seals & SHMEM_ALLOW_SEALING))
> +		return -EBADF;

Hmm, so the F_SEAL_SEAL change I suggest would remove that -EBADF,
and instead return F_SEAL_SEAL on any shmem object.  I think that's
fine, but you may see a reason why not?

> +
> +	return info->seals & F_ALL_SEALS;
> +}
> +EXPORT_SYMBOL(shmem_get_seals);

EXPORT_SYMBOL_GPL(shmem_get_seals).

> +
> +long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
> +{
> +	long r;

long ret or retval please.

> +
> +	switch (cmd) {
> +	case F_ADD_SEALS:
> +		/* disallow upper 32bit */
> +		if (arg >> 32)
> +			return -EINVAL;
> +
> +		r = shmem_add_seals(file, arg);
> +		break;
> +	case F_GET_SEALS:
> +		r = shmem_get_seals(file);
> +		break;
> +	default:
> +		r = -EINVAL;
> +		break;
> +	}
> +
> +	return r;
> +}
> +
>  static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>  							 loff_t len)
>  {
>  	struct inode *inode = file_inode(file);
>  	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
> +	struct shmem_inode_info *info = SHMEM_I(inode);
>  	struct shmem_falloc shmem_falloc;
>  	pgoff_t start, index, end;
>  	int error;
> @@ -1735,6 +1880,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>  		loff_t unmap_start = round_up(offset, PAGE_SIZE);
>  		loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
>  
> +		/* protected by i_mutex */
> +		if (info->seals & F_SEAL_WRITE) {
> +			error = -EPERM;
> +			goto out;
> +		}
> +
>  		if ((u64)unmap_end > (u64)unmap_start)
>  			unmap_mapping_range(mapping, unmap_start,
>  					    1 + unmap_end - unmap_start, 0);
> @@ -1749,6 +1900,11 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>  	if (error)
>  		goto out;
>  
> +	if ((info->seals & F_SEAL_GROW) && offset + len > inode->i_size) {

Okay.  I don't think it needs a comment, but I note in passing that we
*could* permit a FALLOC_FL_KEEP_SIZE change there, since it will make
no difference to what data is accessible; but it would also serve no
useful purpose, so fine to stick with the simpler test you have.

> +		error = -EPERM;
> +		goto out;
> +	}
> +
>  	start = offset >> PAGE_CACHE_SHIFT;
>  	end = (offset + len + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
>  	/* Try to avoid a swapstorm if len is impossible to satisfy */
> -- 
> 1.9.2

There is also, or may be, a small issue of sparse (holey) files.
I do have a question on that in comments on your next patch, and
the answer here may depend on what you want in memfd_create().

What I'm thinking of here is that once a sparse file is sealed
against writing, we must be sure not to give an error when reading
its holes: whereas there are a few unlikely ways in which reading
the holes of a sparse tmpfs file can give -ENOMEM or -ENOSPC.

Most of the memory allocations here can in fact only fail when the
allocating process has already been selected for OOM-kill: that is
not guaranteed forever, but it is how __alloc_pages_slowpath()
currently behaves on ordinary low-order allocations, and will be
hard to change if we ever do so.  Though I dislike relying upon
this, I think we can allow reading holes to fail, if the process
is going to be forcibly killed before it returns to userspace.

But there might still be an issue with vm_enough_memory(),
and there might still be an issue with memcg limits.

We do already use the ZERO_PAGE instead of allocating when it's a
simple read; and on the face of it, we could extend that to mmap
once the file is sealed.  But I am rather afraid to do so - for
many years there was an mmap /dev/zero case which did that, but
it was an easily forgotten case which caught us out at least
once, so I'm reluctant to reintroduce it now for sealing.

Anyway, I don't expect you to resolve the issue of sealed holes:
that's very much my territory, to give you support on.

Hugh

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 1/3] shm: add sealing API
@ 2014-05-20  2:16     ` Hugh Dickins
  0 siblings, 0 replies; 53+ messages in thread
From: Hugh Dickins @ 2014-05-20  2:16 UTC (permalink / raw)
  To: David Herrmann
  Cc: Tony Battersby, Andy Lutomirski, Jan Kara, Michael Kerrisk,
	Ryan Lortie, Linus Torvalds, Andrew Morton, linux-mm,
	linux-fsdevel, linux-kernel, Johannes Weiner, Tejun Heo,
	Greg Kroah-Hartman, john.stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

On Tue, 15 Apr 2014, David Herrmann wrote:

> If two processes share a common memory region, they usually want some
> guarantees to allow safe access. This often includes:
>   - one side cannot overwrite data while the other reads it
>   - one side cannot shrink the buffer while the other accesses it
>   - one side cannot grow the buffer beyond previously set boundaries
> 
> If there is a trust-relationship between both parties, there is no need
> for policy enforcement. However, if there's no trust relationship (eg.,
> for general-purpose IPC) sharing memory-regions is highly fragile and
> often not possible without local copies. Look at the following two
> use-cases:
>   1) A graphics client wants to share its rendering-buffer with a
>      graphics-server. The memory-region is allocated by the client for
>      read/write access and a second FD is passed to the server. While
>      scanning out from the memory region, the server has no guarantee that
>      the client doesn't shrink the buffer at any time, requiring rather
>      cumbersome SIGBUS handling.
>   2) A process wants to perform an RPC on another process. To avoid huge
>      bandwidth consumption, zero-copy is preferred. After a message is
>      assembled in-memory and a FD is passed to the remote side, both sides
>      want to be sure that neither modifies this shared copy, anymore. The
>      source may have put sensible data into the message without a separate
>      copy and the target may want to parse the message inline, to avoid a
>      local copy.
> 
> While SIGBUS handling, POSIX mandatory locking and MAP_DENYWRITE provide
> ways to achieve most of this, the first one is unproportionally ugly to
> use in libraries and the latter two are broken/racy or even disabled due
> to denial of service attacks.
> 
> This patch introduces the concept of SEALING. If you seal a file, a
> specific set of operations is blocked on that file forever.
> Unlike locks, seals can only be set, never removed. Hence, once you
> verified a specific set of seals is set, you're guaranteed that no-one can
> perform the blocked operations on this file, anymore.
> 
> An initial set of SEALS is introduced by this patch:
>   - SHRINK: If SEAL_SHRINK is set, the file in question cannot be reduced
>             in size. This affects ftruncate() and open(O_TRUNC).
>   - GROW: If SEAL_GROW is set, the file in question cannot be increased
>           in size. This affects ftruncate(), fallocate() and write().
>   - WRITE: If SEAL_WRITE is set, no write operations (besides resizing)
>            are possible. This affects fallocate(PUNCH_HOLE), mmap() and
>            write().
>   - SEAL: If SEAL_SEAL is set, no further seals can be added to a file.
>           This basically prevents the F_ADD_SEAL operation on a file and
>           can be set to prevent others from adding further seals that you
>           don't want.
> 
> The described use-cases can easily use these seals to provide safe use
> without any trust-relationship:
>   1) The graphics server can verify that a passed file-descriptor has
>      SEAL_SHRINK set. This allows safe scanout, while the client is
>      allowed to increase buffer size for window-resizing on-the-fly.
>      Concurrent writes are explicitly allowed.
>   2) For general-purpose IPC, both processes can verify that SEAL_SHRINK,
>      SEAL_GROW and SEAL_WRITE are set. This guarantees that neither
>      process can modify the data while the other side parses it.
>      Furthermore, it guarantees that even with writable FDs passed to the
>      peer, it cannot increase the size to hit memory-limits of the source
>      process (in case the file-storage is accounted to the source).
> 
> The new API is an extension to fcntl(), adding two new commands:
>   F_GET_SEALS: Return a bitset describing the seals on the file. This
>                can be called on any FD if the underlying file supports
>                sealing.
>   F_ADD_SEALS: Change the seals of a given file. This requires WRITE
>                access to the file and F_SEAL_SEAL may not already be set.
>                Furthermore, the underlying file must support sealing and
>                there may not be any existing shared mapping of that file.
>                Otherwise, EBADF/EPERM is returned.
>                The given seals are _added_ to the existing set of seals
>                on the file. You cannot remove seals again.
> 
> The fcntl() handler is currently specific to shmem and disabled on all
> files. A file needs to explicitly support sealing for this interface to
> work. A separate syscall is added in a follow-up, which creates files that
> support sealing. There is no intention to support this on other
> file-systems. Semantics are unclear for non-volatile files and we lack any
> use-case right now. Therefore, the implementation is specific to shmem.

Yes, I think you've struck the right balance, by making it a general
fcntl interface, but implementing it only in shmem.

> 
> Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
> ---
>  fs/fcntl.c                 |   5 ++
>  include/linux/shmem_fs.h   |  20 ++++++
>  include/uapi/linux/fcntl.h |  15 +++++
>  mm/shmem.c                 | 162 ++++++++++++++++++++++++++++++++++++++++++++-
>  4 files changed, 199 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/fcntl.c b/fs/fcntl.c
> index 9ead159..1a7a722 100644
> --- a/fs/fcntl.c
> +++ b/fs/fcntl.c
> @@ -21,6 +21,7 @@
>  #include <linux/rcupdate.h>
>  #include <linux/pid_namespace.h>
>  #include <linux/user_namespace.h>
> +#include <linux/shmem_fs.h>
>  
>  #include <asm/poll.h>
>  #include <asm/siginfo.h>
> @@ -336,6 +337,10 @@ static long do_fcntl(int fd, unsigned int cmd, unsigned long arg,
>  	case F_GETPIPE_SZ:
>  		err = pipe_fcntl(filp, cmd, arg);
>  		break;
> +	case F_ADD_SEALS:
> +	case F_GET_SEALS:
> +		err = shmem_fcntl(filp, cmd, arg);

Okay.  I agree that fcntl() is the best interface to use; and although
we always feel a bit dirty exporting a function from shmem.c for use
outside, you are following what's already done with pipe_fcntl(); and
it seems overkill to add an fcntl method to file_operations without any
wider usage.

> +		break;
>  	default:
>  		break;
>  	}
> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
> index 4d1771c..c043d67 100644
> --- a/include/linux/shmem_fs.h
> +++ b/include/linux/shmem_fs.h
> @@ -1,6 +1,7 @@
>  #ifndef __SHMEM_FS_H
>  #define __SHMEM_FS_H
>  
> +#include <linux/file.h>
>  #include <linux/swap.h>
>  #include <linux/mempolicy.h>
>  #include <linux/pagemap.h>
> @@ -20,6 +21,7 @@ struct shmem_inode_info {
>  	struct shared_policy	policy;		/* NUMA memory alloc policy */
>  	struct list_head	swaplist;	/* chain of maybes on swap */
>  	struct simple_xattrs	xattrs;		/* list of xattrs */
> +	u32			seals;		/* shmem seals */

Okay.  I do wonder why you chose "u32" where I would have chosen
"unsigned int": probably just our different backgrounds - kernel
internals most often use the basic types, whereas you are thinking
about explicit interfaces.  Even syscalls tend to have "int" args,
but perhaps that's just a historic mistake.  I have no good reason
to disagree with your use of "u32", but draw attention to it in
case someone else feels more strongly.

Oh, how about you move "seals" up between "lock" and "flags":
on many configurations, it will then occupy what used to be padding.

>  	struct inode		vfs_inode;
>  };
>  
> @@ -65,4 +67,22 @@ static inline struct page *shmem_read_mapping_page(
>  					mapping_gfp_mask(mapping));
>  }
>  
> +/* marks inode to support sealing */
> +#define SHMEM_ALLOW_SEALING (1U << 31)

This feels unnecessary to me: see comment on shmem_add_seals.

> +
> +#ifdef CONFIG_SHMEM

Should that rather be CONFIG_TMPFS?  I think you have placed
shmem_fcntl() and its supporting functions in the CONFIG_TMPFS
part of mm/shmem.c (and CONFIG_TMPFS depends on CONFIG_SHMEM).

It's almost certainly true that "CONFIG_TMPFS" has outlived its v2.4
usefulness, and serves as more of a confusion than a help nowadays:
particularly since !CONFIG_SHMEM gives you the ramfs filesystem, but
CONFIG_SHMEM without CONFIG_TMPFS does not give you a filesystem.

Blame me for leaving CONFIG_TMPFS around; but for now,
I think it's CONFIG_TMPFS you want there (please check).

> +
> +extern int shmem_add_seals(struct file *file, u32 seals);
> +extern int shmem_get_seals(struct file *file);
> +extern long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
> +
> +#else
> +

Are you sure you want to generate a link error rather than a runtime
fallback if there's a driver using shmem_add_seals() or shmem_get_seals()
in a !CONFIG_SHMEM kernel?  That might be the right decision, but it
surprises me a little.

> +static inline long shmem_fcntl(struct file *f, unsigned int c, unsigned long a)
> +{
> +	return -EINVAL;

Should be -EBADF to match what you get in the CONFIG_SHMEM case.

> +}
> +
> +#endif
> +
>  #endif
> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
> index 074b886..1b9b9f4 100644
> --- a/include/uapi/linux/fcntl.h
> +++ b/include/uapi/linux/fcntl.h
> @@ -28,6 +28,21 @@
>  #define F_GETPIPE_SZ	(F_LINUX_SPECIFIC_BASE + 8)
>  
>  /*
> + * Set/Get seals
> + */
> +#define F_ADD_SEALS	(F_LINUX_SPECIFIC_BASE + 9)
> +#define F_GET_SEALS	(F_LINUX_SPECIFIC_BASE + 10)
> +
> +/*
> + * Types of seals
> + */
> +#define F_SEAL_SEAL	0x0001	/* prevent further seals from being set */
> +#define F_SEAL_SHRINK	0x0002	/* prevent file from shrinking */
> +#define F_SEAL_GROW	0x0004	/* prevent file from growing */
> +#define F_SEAL_WRITE	0x0008	/* prevent writes */
> +/* (1U << 31) is reserved for internal use */

I question the need to reserve that: see comment on shmem_add_seals.

> +
> +/*
>   * Types of directory notifications that may be requested.
>   */
>  #define DN_ACCESS	0x00000001	/* File accessed */
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 9f70e02..175a5b8 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -66,6 +66,7 @@ static struct vfsmount *shm_mnt;
>  #include <linux/highmem.h>
>  #include <linux/seq_file.h>
>  #include <linux/magic.h>
> +#include <linux/fcntl.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/pgtable.h>
> @@ -531,16 +532,23 @@ EXPORT_SYMBOL_GPL(shmem_truncate_range);
>  static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
>  {
>  	struct inode *inode = dentry->d_inode;
> +	struct shmem_inode_info *info = SHMEM_I(inode);
> +	loff_t oldsize = inode->i_size;
> +	loff_t newsize = attr->ia_size;
>  	int error;
>  
>  	error = inode_change_ok(inode, attr);
>  	if (error)
>  		return error;
>  
> -	if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
> -		loff_t oldsize = inode->i_size;
> -		loff_t newsize = attr->ia_size;
> +	/* protected by i_mutex */
> +	if (attr->ia_valid & ATTR_SIZE) {
> +		if ((newsize < oldsize && (info->seals & F_SEAL_SHRINK)) ||
> +		    (newsize > oldsize && (info->seals & F_SEAL_GROW)))
> +			return -EPERM;
> +	}
>  
> +	if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
>  		if (newsize != oldsize) {
>  			i_size_write(inode, newsize);
>  			inode->i_ctime = inode->i_mtime = CURRENT_TIME;
> @@ -1289,6 +1297,13 @@ out_nomem:
>  
>  static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
>  {
> +	struct inode *inode = file_inode(file);
> +	struct shmem_inode_info *info = SHMEM_I(inode);
> +
> +	/* protected by mmap_sem */
> +	if ((info->seals & F_SEAL_WRITE) && (vma->vm_flags & VM_SHARED))
> +		return -EPERM;
> +
>  	file_accessed(file);
>  	vma->vm_ops = &shmem_vm_ops;
>  	return 0;
> @@ -1373,7 +1388,15 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
>  			struct page **pagep, void **fsdata)
>  {
>  	struct inode *inode = mapping->host;
> +	struct shmem_inode_info *info = SHMEM_I(inode);
>  	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
> +
> +	/* i_mutex is held by caller */
> +	if (info->seals & F_SEAL_WRITE)
> +		return -EPERM;
> +	if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size)
> +		return -EPERM;
> +
>  	return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
>  }
>  
> @@ -1719,11 +1742,133 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
>  	return offset;
>  }
>  
> +#define F_ALL_SEALS (F_SEAL_SEAL | \
> +		     F_SEAL_SHRINK | \
> +		     F_SEAL_GROW | \
> +		     F_SEAL_WRITE)
> +
> +int shmem_add_seals(struct file *file, u32 seals)
> +{
> +	struct dentry *dentry = file->f_path.dentry;
> +	struct inode *inode = dentry->d_inode;
> +	struct shmem_inode_info *info = SHMEM_I(inode);
> +	int r;

mm/shmem.c is currently using "int error", "int err", "int ret" or
"int retval" for this (maybe more!): I'd prefer you not to add "r"
to the menagerie, "error" or "err" would be good here.

> +
> +	/* SHMEM_ALLOW_SEALING is a private, unused bit */
> +	BUILD_BUG_ON(F_ALL_SEALS & SHMEM_ALLOW_SEALING);

I see no need for SHMEM_ALLOW_SEALING.
Now that you have added F_SEAL_SEAL, why don't you just make
shmem_get_inode() initialize info->seals with F_SEAL_SEAL,
then clear that in the one place you need to in the next patch?

> +
> +	/*
> +	 * SEALING
> +	 * Sealing allows multiple parties to share a shmem-file but restrict
> +	 * access to a specific subset of file operations. Seals can only be
> +	 * added, but never removed. This way, mutually untrusted parties can
> +	 * share common memory regions with a well-defined policy. A malicious
> +	 * peer can thus never perform unwanted operations on a shared object.
> +	 *
> +	 * Seals are only supported on special shmem-files and always affect
> +	 * the whole underlying inode. Once a seal is set, it may prevent some
> +	 * kinds of access to the file. Currently, the following seals are
> +	 * defined:
> +	 *   SEAL_SEAL: Prevent further seals from being set on this file
> +	 *   SEAL_SHRINK: Prevent the file from shrinking
> +	 *   SEAL_GROW: Prevent the file from growing
> +	 *   SEAL_WRITE: Prevent write access to the file
> +	 *
> +	 * As we don't require any trust relationship between two parties, we
> +	 * must prevent seals from being removed. Therefore, sealing a file
> +	 * only adds a given set of seals to the file, it never touches
> +	 * existing seals. Furthermore, the "setting seals"-operation can be
> +	 * sealed itself, which basically prevents any further seal from being
> +	 * added.
> +	 *
> +	 * Semantics of sealing are only defined on volatile files. Only
> +	 * anonymous shmem files support sealing. More importantly, seals are
> +	 * never written to disk. Therefore, there's no plan to support it on
> +	 * other file types.
> +	 */
> +
> +	if (file->f_op != &shmem_file_operations)
> +		return -EBADF;

Okay: that's not what I expect -EBADF to mean, but it does follow
the precedent set by pipe_fcntl().

> +	if (!(info->seals & SHMEM_ALLOW_SEALING))
> +		return -EBADF;
> +	if (!(file->f_mode & FMODE_WRITE))
> +		return -EPERM;
> +	if (seals & ~(u32)F_ALL_SEALS)
> +		return -EINVAL;
> +
> +	/*
> +	 * - i_mutex prevents racing write/ftruncate/fallocate/..
> +	 * - mmap_sem prevents racing mmap() calls
> +	 */
> +
> +	mutex_lock(&inode->i_mutex);
> +	down_read(&current->mm->mmap_sem);

I don't think that use of current->mm->mmap_sem can be correct:
it guards against races with other threads of this process, but
what if another process has this object open and races to mmap it?

I imagine you have to use i_mmap_mutex, and plumb an error return
into __vma_link_file() etc in mm/mmap.c, if the file is found already
sealed against writing - which may prove irritating, especially with
knowledge of sealing being private to mm/shmem.c.

But I have not stopped to work it out properly: the answer may depend
on the answer to the major issue of outstanding async I/O.  As I
mentioned last week, that's an issue I think we cannot overlook.
Tony's copy-raised-pagecount-pages suggestion is a good one, but
not so attractive that I'll give up hope for a better solution.

> +
> +	/* you cannot seal while shared mappings exist */
> +	if (file->f_mapping->i_mmap_writable > 0) {
> +		r = -EPERM;
> +		goto unlock;
> +	}
> +
> +	if (info->seals & F_SEAL_SEAL) {
> +		r = -EPERM;
> +		goto unlock;
> +	}
> +
> +	info->seals |= seals;
> +	r = 0;
> +
> +unlock:
> +	up_read(&current->mm->mmap_sem);
> +	mutex_unlock(&inode->i_mutex);
> +	return r;
> +}
> +EXPORT_SYMBOL(shmem_add_seals);

EXPORT_SYMBOL_GPL(shmem_add_seals).

We don't see an example of its use, but I certainly don't want to see
drivers/gpu changes as part of this patchset, so I think that's okay.

> +
> +int shmem_get_seals(struct file *file)
> +{
> +	struct shmem_inode_info *info;
> +
> +	if (file->f_op != &shmem_file_operations)
> +		return -EBADF;
> +
> +	info = SHMEM_I(file_inode(file));
> +	if (!(info->seals & SHMEM_ALLOW_SEALING))
> +		return -EBADF;

Hmm, so the F_SEAL_SEAL change I suggest would remove that -EBADF,
and instead return F_SEAL_SEAL on any shmem object.  I think that's
fine, but you may see a reason why not?

> +
> +	return info->seals & F_ALL_SEALS;
> +}
> +EXPORT_SYMBOL(shmem_get_seals);

EXPORT_SYMBOL_GPL(shmem_get_seals).

> +
> +long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
> +{
> +	long r;

long ret or retval please.

> +
> +	switch (cmd) {
> +	case F_ADD_SEALS:
> +		/* disallow upper 32bit */
> +		if (arg >> 32)
> +			return -EINVAL;
> +
> +		r = shmem_add_seals(file, arg);
> +		break;
> +	case F_GET_SEALS:
> +		r = shmem_get_seals(file);
> +		break;
> +	default:
> +		r = -EINVAL;
> +		break;
> +	}
> +
> +	return r;
> +}
> +
>  static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>  							 loff_t len)
>  {
>  	struct inode *inode = file_inode(file);
>  	struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
> +	struct shmem_inode_info *info = SHMEM_I(inode);
>  	struct shmem_falloc shmem_falloc;
>  	pgoff_t start, index, end;
>  	int error;
> @@ -1735,6 +1880,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>  		loff_t unmap_start = round_up(offset, PAGE_SIZE);
>  		loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
>  
> +		/* protected by i_mutex */
> +		if (info->seals & F_SEAL_WRITE) {
> +			error = -EPERM;
> +			goto out;
> +		}
> +
>  		if ((u64)unmap_end > (u64)unmap_start)
>  			unmap_mapping_range(mapping, unmap_start,
>  					    1 + unmap_end - unmap_start, 0);
> @@ -1749,6 +1900,11 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>  	if (error)
>  		goto out;
>  
> +	if ((info->seals & F_SEAL_GROW) && offset + len > inode->i_size) {

Okay.  I don't think it needs a comment, but I note in passing that we
*could* permit a FALLOC_FL_KEEP_SIZE change there, since it will make
no difference to what data is accessible; but it would also serve no
useful purpose, so fine to stick with the simpler test you have.

> +		error = -EPERM;
> +		goto out;
> +	}
> +
>  	start = offset >> PAGE_CACHE_SHIFT;
>  	end = (offset + len + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
>  	/* Try to avoid a swapstorm if len is impossible to satisfy */
> -- 
> 1.9.2

There is also, or may be, a small issue of sparse (holey) files.
I do have a question on that in comments on your next patch, and
the answer here may depend on what you want in memfd_create().

What I'm thinking of here is that once a sparse file is sealed
against writing, we must be sure not to give an error when reading
its holes: whereas there are a few unlikely ways in which reading
the holes of a sparse tmpfs file can give -ENOMEM or -ENOSPC.

Most of the memory allocations here can in fact only fail when the
allocating process has already been selected for OOM-kill: that is
not guaranteed forever, but it is how __alloc_pages_slowpath()
currently behaves on ordinary low-order allocations, and will be
hard to change if we ever do so.  Though I dislike relying upon
this, I think we can allow reading holes to fail, if the process
is going to be forcibly killed before it returns to userspace.

But there might still be an issue with vm_enough_memory(),
and there might still be an issue with memcg limits.

We do already use the ZERO_PAGE instead of allocating when it's a
simple read; and on the face of it, we could extend that to mmap
once the file is sealed.  But I am rather afraid to do so - for
many years there was an mmap /dev/zero case which did that, but
it was an easily forgotten case which caught us out at least
once, so I'm reluctant to reintroduce it now for sealing.

Anyway, I don't expect you to resolve the issue of sealed holes:
that's very much my territory, to give you support on.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 2/3] shm: add memfd_create() syscall
  2014-04-15 18:38   ` David Herrmann
@ 2014-05-20  2:20     ` Hugh Dickins
  -1 siblings, 0 replies; 53+ messages in thread
From: Hugh Dickins @ 2014-05-20  2:20 UTC (permalink / raw)
  To: David Herrmann
  Cc: Tony Battersby, Andy Lutomirsky, Al Viro, Jan Kara,
	Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, Greg Kroah-Hartman, john.stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

On Tue, 15 Apr 2014, David Herrmann wrote:

> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
> that you can pass to mmap(). It can support sealing and avoids any
> connection to user-visible mount-points. Thus, it's not subject to quotas
> on mounted file-systems, but can be used like malloc()'ed memory, but
> with a file-descriptor to it.
> 
> memfd_create() does not create a front-FD, but instead returns the raw

What is a front-FD?

> shmem file, so calls like ftruncate() can be used. Also calls like fstat()
> will return proper information and mark the file as regular file. If you
> want sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not
> support (like on all other regular files).
> 
> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
> subject to quotas and alike.

You mention quotas a couple of times, and I want to be clear about that.

I think you are mainly thinking of the "df" size limitation which comes
by default on a tmpfs mount, but can be retuned or removed with the
size= or nr_block= mount options.  You want memfd_create() to be free
of that limitation, which indeed it is.

(I'm not proud of the way in which an unlimited tmpfs mount can easily
be used to OOM the system, killing processes which do little to give
back the memory needed; but that's how it is, and you're not making
that worse, just adding a further interface to it.)

And we have never implemented fs/quota/-style quotas on tmpfs,
so you're certainly free from those.

But a created memfd is still subject to an RLIMIT_FSIZE limit, and
to a memcg's memory.limit_in_bytes and memory.memsw.limit_in_bytes:
I expect you don't care about those, that they would be unlimited
in the cases that you care about.

And a created memfd is still subject to __vm_enough_memory() limiting:
unlimited when OVERCOMMIT_ALWAYS, a little unpredictable when
OVERCOMMIT_GUESS, strictly accounted when OVERCOMMIT_NEVER.  I don't
think we can compromise on OVERCOMMIT_NEVER, but if OVERCOMMIT_GUESS
gives you a problem, we could probably tweak it for your case.
More on this below, when considering the size arg to memfd_create().

> 
> Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
> ---
>  arch/x86/syscalls/syscall_32.tbl |  1 +
>  arch/x86/syscalls/syscall_64.tbl |  1 +

Okay.  No point in cluttering the patchset with other architectures
until this is closer to merge.  Miklos Szeredi's recent patches
"add renameat2 syscall" provide a very helpful precedent to follow.

>  include/linux/syscalls.h         |  1 +
>  include/uapi/linux/memfd.h       | 10 ++++++
>  kernel/sys_ni.c                  |  1 +
>  mm/shmem.c                       | 74 ++++++++++++++++++++++++++++++++++++++++
>  6 files changed, 88 insertions(+)
>  create mode 100644 include/uapi/linux/memfd.h
> 
> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
> index 96bc506..c943b8a 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -359,3 +359,4 @@
>  350	i386	finit_module		sys_finit_module
>  351	i386	sched_setattr		sys_sched_setattr
>  352	i386	sched_getattr		sys_sched_getattr
> +353	i386	memfd_create		sys_memfd_create
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index 04376ac..dfcfd6f 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -323,6 +323,7 @@
>  314	common	sched_setattr		sys_sched_setattr
>  315	common	sched_getattr		sys_sched_getattr
>  316	common	renameat2		sys_renameat2
> +317	common	memfd_create		sys_memfd_create
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index a4a0588..133b705 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -802,6 +802,7 @@ asmlinkage long sys_timerfd_settime(int ufd, int flags,
>  asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
>  asmlinkage long sys_eventfd(unsigned int count);
>  asmlinkage long sys_eventfd2(unsigned int count, int flags);
> +asmlinkage long sys_memfd_create(const char *uname_ptr, u64 size, u64 flags);
>  asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
>  asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int);
>  asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
> diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> new file mode 100644
> index 0000000..c4a6db0
> --- /dev/null
> +++ b/include/uapi/linux/memfd.h
> @@ -0,0 +1,10 @@
> +#ifndef _UAPI_LINUX_MEMFD_H
> +#define _UAPI_LINUX_MEMFD_H
> +
> +#include <linux/types.h>

Why include linux/types.h in this one?

> +
> +/* flags for memfd_create(2) (u64) */
> +#define MFD_CLOEXEC		0x0001ULL
> +#define MFD_ALLOW_SEALING	0x0002ULL
> +
> +#endif /* _UAPI_LINUX_MEMFD_H */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index bc8d1b7..f96c329 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -195,6 +195,7 @@ cond_syscall(compat_sys_timerfd_settime);
>  cond_syscall(compat_sys_timerfd_gettime);
>  cond_syscall(sys_eventfd);
>  cond_syscall(sys_eventfd2);
> +cond_syscall(sys_memfd_create);
>  
>  /* performance counters: */
>  cond_syscall(sys_perf_event_open);
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 175a5b8..203cc4e 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -66,7 +66,9 @@ static struct vfsmount *shm_mnt;
>  #include <linux/highmem.h>
>  #include <linux/seq_file.h>
>  #include <linux/magic.h>
> +#include <linux/syscalls.h>
>  #include <linux/fcntl.h>
> +#include <uapi/linux/memfd.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/pgtable.h>
> @@ -2919,6 +2921,78 @@ out4:
>  	return error;
>  }
>  

Whereas 1/3's sealing stuff was under CONFIG_TMPFS, this is in a
CONFIG_SHMEM part of mm/shmem.c, built even when !CONFIG_TMPFS: in
which case you could not write to or truncate the object created,
just mmap it and access it that way (like SysV SHM).  Not necessarily
wrong, but it may prevent surprises to put this under CONFIG_TMPFS:
the user gets an fd, so probably expects filesystem operations to work.

> +#define MFD_NAME_PREFIX "memfd:"
> +#define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
> +#define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
> +
> +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING)
> +
> +SYSCALL_DEFINE3(memfd_create,
> +		const char*, uname,
> +		u64, size,
> +		u64, flags)

If I'd come in earlier, I'd have probably looked for another name
than memfd_create; but I don't have anything better in mind, and
you've done a great job of sounding out potential users, so let's
stick with the name everyone is expecting.

The uname: it's a funny thing, not belonging in a filesystem tree;
but you're very sure you want it, and we already make up funny
names for SysV SHM and /dev/zero objects, so okay.

The size: u64 or loff_t or size_t?  But more on size below.

The flags: u64?  That's a big future you're allowing for!
open and mmap use ints for their flags, will this really need more?

But I don't think I've been present at the birth of a syscall before:
there are probably several considerations that I'm unaware of, that
you may have factored in - listen to the experts, not to me.

> +{
> +	struct shmem_inode_info *info;
> +	struct file *shm;

"struct file *file" is more usual.

> +	char *name;
> +	int fd, r;

"int err" or "int error" rather than "int r".

> +	long len;
> +
> +	if (flags & ~(u64)MFD_ALL_FLAGS)
> +		return -EINVAL;
> +	if ((u64)(loff_t)size != size || (loff_t)size < 0)
> +		return -EINVAL;
> +
> +	/* length includes terminating zero */
> +	len = strnlen_user(uname, MFD_NAME_MAX_LEN);
> +	if (len <= 0)
> +		return -EFAULT;
> +	else if (len > MFD_NAME_MAX_LEN)

Please omit the "else ".

And, since strnlen_user() returns length including terminating NUL,
wouldn't it be more exact to use MFD_NAME_MAX_LEN + 1 in those two
places above?

> +		return -EINVAL;
> +
> +	name = kmalloc(len + MFD_NAME_PREFIX_LEN, GFP_KERNEL);

Probably better to say GFP_TEMPORARY than GFP_KERNEL,
though it doesn't seem to be used very much at all.

> +	if (!name)
> +		return -ENOMEM;
> +
> +	strcpy(name, MFD_NAME_PREFIX);
> +	if (copy_from_user(&name[MFD_NAME_PREFIX_LEN], uname, len)) {
> +		r = -EFAULT;
> +		goto err_name;
> +	}
> +
> +	/* terminating-zero may have changed after strnlen_user() returned */
> +	if (name[len + MFD_NAME_PREFIX_LEN - 1]) {
> +		r = -EFAULT;
> +		goto err_name;
> +	}
> +
> +	fd = get_unused_fd_flags((flags & MFD_CLOEXEC) ? O_CLOEXEC : 0);
> +	if (fd < 0) {
> +		r = fd;
> +		goto err_name;
> +	}
> +
> +	shm = shmem_file_setup(name, size, 0);

That's an interesting line: I am anxious to know whether you mean to
pass flags 0 there, or would rather pass VM_NORESERVE.  Passing 0
makes the object resemble mmap or SysV SHM, in accounting for the
whole size upfront; passing VM_NORESERVE makes the object resemble
a tmpfs file, accounted page by page as they are instantiated.

Accounting meaning calls to __vm_enough_memory() in mm/mmap.c:
whose behaviour is governed by /proc/sys/vm/overcommit_memory
(and overcommit_kbytes or overcommit_ratio): OVERCOMMIT_ALWAYS
(no enforcement), OVERCOMMIT_GUESS (default) or OVERCOMMIT_NEVER
(enforcing strict no-overcommit).

We have a small problem if you really intend flags 0: because then
that size is preaccounted, yet we also allow these objects to grow
or be truncated without accounting, and the number (/proc/meminfo's
Committed_AS) will go wrong.

If you really intend that preaccounting, then we need to add an
orig_size field to shmem_inode_info, and treat pages below that
as preaccounted, but pages above it to be accounted one by one.
If you don't intend preaccounting, then please pass VM_NORESERVE
to shmem_file_setup().

But this does highlight how the "size" arg to memfd_create() is
perhaps redundant.  Why give a size there, when size can be changed
afterwards?  I expect your answer is that many callers want to choose
the size at the beginning, and would prefer to avoid the extra call.
I'm not sure if that's a good enough reason for a redundant argument.

> +	if (IS_ERR(shm)) {
> +		r = PTR_ERR(shm);
> +		goto err_fd;
> +	}
> +	info = SHMEM_I(file_inode(shm));
> +	shm->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
> +	if (flags & MFD_ALLOW_SEALING)
> +		info->seals |= SHMEM_ALLOW_SEALING;

In comments on 1/3 I suggest removing F_SEAL_SEAL instead here.

> +
> +	fd_install(fd, shm);
> +	kfree(name);
> +	return fd;
> +
> +err_fd:
> +	put_unused_fd(fd);
> +err_name:
> +	kfree(name);
> +	return r;
> +}
> +
>  #else /* !CONFIG_SHMEM */
>  
>  /*
> -- 
> 1.9.2

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 2/3] shm: add memfd_create() syscall
@ 2014-05-20  2:20     ` Hugh Dickins
  0 siblings, 0 replies; 53+ messages in thread
From: Hugh Dickins @ 2014-05-20  2:20 UTC (permalink / raw)
  To: David Herrmann
  Cc: Tony Battersby, Andy Lutomirsky, Al Viro, Jan Kara,
	Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, Greg Kroah-Hartman, john.stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

On Tue, 15 Apr 2014, David Herrmann wrote:

> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
> that you can pass to mmap(). It can support sealing and avoids any
> connection to user-visible mount-points. Thus, it's not subject to quotas
> on mounted file-systems, but can be used like malloc()'ed memory, but
> with a file-descriptor to it.
> 
> memfd_create() does not create a front-FD, but instead returns the raw

What is a front-FD?

> shmem file, so calls like ftruncate() can be used. Also calls like fstat()
> will return proper information and mark the file as regular file. If you
> want sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not
> support (like on all other regular files).
> 
> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
> subject to quotas and alike.

You mention quotas a couple of times, and I want to be clear about that.

I think you are mainly thinking of the "df" size limitation which comes
by default on a tmpfs mount, but can be retuned or removed with the
size= or nr_block= mount options.  You want memfd_create() to be free
of that limitation, which indeed it is.

(I'm not proud of the way in which an unlimited tmpfs mount can easily
be used to OOM the system, killing processes which do little to give
back the memory needed; but that's how it is, and you're not making
that worse, just adding a further interface to it.)

And we have never implemented fs/quota/-style quotas on tmpfs,
so you're certainly free from those.

But a created memfd is still subject to an RLIMIT_FSIZE limit, and
to a memcg's memory.limit_in_bytes and memory.memsw.limit_in_bytes:
I expect you don't care about those, that they would be unlimited
in the cases that you care about.

And a created memfd is still subject to __vm_enough_memory() limiting:
unlimited when OVERCOMMIT_ALWAYS, a little unpredictable when
OVERCOMMIT_GUESS, strictly accounted when OVERCOMMIT_NEVER.  I don't
think we can compromise on OVERCOMMIT_NEVER, but if OVERCOMMIT_GUESS
gives you a problem, we could probably tweak it for your case.
More on this below, when considering the size arg to memfd_create().

> 
> Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
> ---
>  arch/x86/syscalls/syscall_32.tbl |  1 +
>  arch/x86/syscalls/syscall_64.tbl |  1 +

Okay.  No point in cluttering the patchset with other architectures
until this is closer to merge.  Miklos Szeredi's recent patches
"add renameat2 syscall" provide a very helpful precedent to follow.

>  include/linux/syscalls.h         |  1 +
>  include/uapi/linux/memfd.h       | 10 ++++++
>  kernel/sys_ni.c                  |  1 +
>  mm/shmem.c                       | 74 ++++++++++++++++++++++++++++++++++++++++
>  6 files changed, 88 insertions(+)
>  create mode 100644 include/uapi/linux/memfd.h
> 
> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
> index 96bc506..c943b8a 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -359,3 +359,4 @@
>  350	i386	finit_module		sys_finit_module
>  351	i386	sched_setattr		sys_sched_setattr
>  352	i386	sched_getattr		sys_sched_getattr
> +353	i386	memfd_create		sys_memfd_create
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index 04376ac..dfcfd6f 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -323,6 +323,7 @@
>  314	common	sched_setattr		sys_sched_setattr
>  315	common	sched_getattr		sys_sched_getattr
>  316	common	renameat2		sys_renameat2
> +317	common	memfd_create		sys_memfd_create
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index a4a0588..133b705 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -802,6 +802,7 @@ asmlinkage long sys_timerfd_settime(int ufd, int flags,
>  asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
>  asmlinkage long sys_eventfd(unsigned int count);
>  asmlinkage long sys_eventfd2(unsigned int count, int flags);
> +asmlinkage long sys_memfd_create(const char *uname_ptr, u64 size, u64 flags);
>  asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
>  asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int);
>  asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
> diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> new file mode 100644
> index 0000000..c4a6db0
> --- /dev/null
> +++ b/include/uapi/linux/memfd.h
> @@ -0,0 +1,10 @@
> +#ifndef _UAPI_LINUX_MEMFD_H
> +#define _UAPI_LINUX_MEMFD_H
> +
> +#include <linux/types.h>

Why include linux/types.h in this one?

> +
> +/* flags for memfd_create(2) (u64) */
> +#define MFD_CLOEXEC		0x0001ULL
> +#define MFD_ALLOW_SEALING	0x0002ULL
> +
> +#endif /* _UAPI_LINUX_MEMFD_H */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index bc8d1b7..f96c329 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -195,6 +195,7 @@ cond_syscall(compat_sys_timerfd_settime);
>  cond_syscall(compat_sys_timerfd_gettime);
>  cond_syscall(sys_eventfd);
>  cond_syscall(sys_eventfd2);
> +cond_syscall(sys_memfd_create);
>  
>  /* performance counters: */
>  cond_syscall(sys_perf_event_open);
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 175a5b8..203cc4e 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -66,7 +66,9 @@ static struct vfsmount *shm_mnt;
>  #include <linux/highmem.h>
>  #include <linux/seq_file.h>
>  #include <linux/magic.h>
> +#include <linux/syscalls.h>
>  #include <linux/fcntl.h>
> +#include <uapi/linux/memfd.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/pgtable.h>
> @@ -2919,6 +2921,78 @@ out4:
>  	return error;
>  }
>  

Whereas 1/3's sealing stuff was under CONFIG_TMPFS, this is in a
CONFIG_SHMEM part of mm/shmem.c, built even when !CONFIG_TMPFS: in
which case you could not write to or truncate the object created,
just mmap it and access it that way (like SysV SHM).  Not necessarily
wrong, but it may prevent surprises to put this under CONFIG_TMPFS:
the user gets an fd, so probably expects filesystem operations to work.

> +#define MFD_NAME_PREFIX "memfd:"
> +#define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
> +#define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
> +
> +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING)
> +
> +SYSCALL_DEFINE3(memfd_create,
> +		const char*, uname,
> +		u64, size,
> +		u64, flags)

If I'd come in earlier, I'd have probably looked for another name
than memfd_create; but I don't have anything better in mind, and
you've done a great job of sounding out potential users, so let's
stick with the name everyone is expecting.

The uname: it's a funny thing, not belonging in a filesystem tree;
but you're very sure you want it, and we already make up funny
names for SysV SHM and /dev/zero objects, so okay.

The size: u64 or loff_t or size_t?  But more on size below.

The flags: u64?  That's a big future you're allowing for!
open and mmap use ints for their flags, will this really need more?

But I don't think I've been present at the birth of a syscall before:
there are probably several considerations that I'm unaware of, that
you may have factored in - listen to the experts, not to me.

> +{
> +	struct shmem_inode_info *info;
> +	struct file *shm;

"struct file *file" is more usual.

> +	char *name;
> +	int fd, r;

"int err" or "int error" rather than "int r".

> +	long len;
> +
> +	if (flags & ~(u64)MFD_ALL_FLAGS)
> +		return -EINVAL;
> +	if ((u64)(loff_t)size != size || (loff_t)size < 0)
> +		return -EINVAL;
> +
> +	/* length includes terminating zero */
> +	len = strnlen_user(uname, MFD_NAME_MAX_LEN);
> +	if (len <= 0)
> +		return -EFAULT;
> +	else if (len > MFD_NAME_MAX_LEN)

Please omit the "else ".

And, since strnlen_user() returns length including terminating NUL,
wouldn't it be more exact to use MFD_NAME_MAX_LEN + 1 in those two
places above?

> +		return -EINVAL;
> +
> +	name = kmalloc(len + MFD_NAME_PREFIX_LEN, GFP_KERNEL);

Probably better to say GFP_TEMPORARY than GFP_KERNEL,
though it doesn't seem to be used very much at all.

> +	if (!name)
> +		return -ENOMEM;
> +
> +	strcpy(name, MFD_NAME_PREFIX);
> +	if (copy_from_user(&name[MFD_NAME_PREFIX_LEN], uname, len)) {
> +		r = -EFAULT;
> +		goto err_name;
> +	}
> +
> +	/* terminating-zero may have changed after strnlen_user() returned */
> +	if (name[len + MFD_NAME_PREFIX_LEN - 1]) {
> +		r = -EFAULT;
> +		goto err_name;
> +	}
> +
> +	fd = get_unused_fd_flags((flags & MFD_CLOEXEC) ? O_CLOEXEC : 0);
> +	if (fd < 0) {
> +		r = fd;
> +		goto err_name;
> +	}
> +
> +	shm = shmem_file_setup(name, size, 0);

That's an interesting line: I am anxious to know whether you mean to
pass flags 0 there, or would rather pass VM_NORESERVE.  Passing 0
makes the object resemble mmap or SysV SHM, in accounting for the
whole size upfront; passing VM_NORESERVE makes the object resemble
a tmpfs file, accounted page by page as they are instantiated.

Accounting meaning calls to __vm_enough_memory() in mm/mmap.c:
whose behaviour is governed by /proc/sys/vm/overcommit_memory
(and overcommit_kbytes or overcommit_ratio): OVERCOMMIT_ALWAYS
(no enforcement), OVERCOMMIT_GUESS (default) or OVERCOMMIT_NEVER
(enforcing strict no-overcommit).

We have a small problem if you really intend flags 0: because then
that size is preaccounted, yet we also allow these objects to grow
or be truncated without accounting, and the number (/proc/meminfo's
Committed_AS) will go wrong.

If you really intend that preaccounting, then we need to add an
orig_size field to shmem_inode_info, and treat pages below that
as preaccounted, but pages above it to be accounted one by one.
If you don't intend preaccounting, then please pass VM_NORESERVE
to shmem_file_setup().

But this does highlight how the "size" arg to memfd_create() is
perhaps redundant.  Why give a size there, when size can be changed
afterwards?  I expect your answer is that many callers want to choose
the size at the beginning, and would prefer to avoid the extra call.
I'm not sure if that's a good enough reason for a redundant argument.

> +	if (IS_ERR(shm)) {
> +		r = PTR_ERR(shm);
> +		goto err_fd;
> +	}
> +	info = SHMEM_I(file_inode(shm));
> +	shm->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
> +	if (flags & MFD_ALLOW_SEALING)
> +		info->seals |= SHMEM_ALLOW_SEALING;

In comments on 1/3 I suggest removing F_SEAL_SEAL instead here.

> +
> +	fd_install(fd, shm);
> +	kfree(name);
> +	return fd;
> +
> +err_fd:
> +	put_unused_fd(fd);
> +err_name:
> +	kfree(name);
> +	return r;
> +}
> +
>  #else /* !CONFIG_SHMEM */
>  
>  /*
> -- 
> 1.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 3/3] selftests: add memfd_create() + sealing tests
  2014-04-15 18:38   ` David Herrmann
@ 2014-05-20  2:22     ` Hugh Dickins
  -1 siblings, 0 replies; 53+ messages in thread
From: Hugh Dickins @ 2014-05-20  2:22 UTC (permalink / raw)
  To: David Herrmann
  Cc: Tony Battersby, Andy Lutomirsky, Al Viro, Jan Kara,
	Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, Greg Kroah-Hartman, john.stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

On Tue, 15 Apr 2014, David Herrmann wrote:

> Some basic tests to verify sealing on memfds works as expected and
> guarantees the advertised semantics.

Thanks for providing these.

A few remarks below, and I should note one oddity.

Curious about leaks (probably none, I was merely curious), I tried to
run memfd_test 4096 times in succession, and never succeeded.  After
many iterations, the 32-bit one tends to hang somewhere just before
reaching the DONE, and the 64-bit one gave me some kind of assert
error from a library.

I expect there's some threading race around join_idle_thread():
which I think you will sort out infinitely sooner than I would.
No need to fix it right now: the test works well enough.

> 
> Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
> ---
>  tools/testing/selftests/Makefile           |   1 +
>  tools/testing/selftests/memfd/.gitignore   |   2 +
>  tools/testing/selftests/memfd/Makefile     |  29 +
>  tools/testing/selftests/memfd/memfd_test.c | 944 +++++++++++++++++++++++++++++
>  4 files changed, 976 insertions(+)
>  create mode 100644 tools/testing/selftests/memfd/.gitignore
>  create mode 100644 tools/testing/selftests/memfd/Makefile
>  create mode 100644 tools/testing/selftests/memfd/memfd_test.c
> 
> diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
> index 32487ed..c57325a 100644
> --- a/tools/testing/selftests/Makefile
> +++ b/tools/testing/selftests/Makefile
> @@ -2,6 +2,7 @@ TARGETS = breakpoints
>  TARGETS += cpu-hotplug
>  TARGETS += efivarfs
>  TARGETS += kcmp
> +TARGETS += memfd
>  TARGETS += memory-hotplug
>  TARGETS += mqueue
>  TARGETS += net
> diff --git a/tools/testing/selftests/memfd/.gitignore b/tools/testing/selftests/memfd/.gitignore
> new file mode 100644
> index 0000000..bcc8ee2
> --- /dev/null
> +++ b/tools/testing/selftests/memfd/.gitignore
> @@ -0,0 +1,2 @@
> +memfd_test
> +memfd-test-file
> diff --git a/tools/testing/selftests/memfd/Makefile b/tools/testing/selftests/memfd/Makefile
> new file mode 100644
> index 0000000..36653b9
> --- /dev/null
> +++ b/tools/testing/selftests/memfd/Makefile
> @@ -0,0 +1,29 @@
> +uname_M := $(shell uname -m 2>/dev/null || echo not)
> +ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
> +ifeq ($(ARCH),i386)
> +	ARCH := X86
> +endif
> +ifeq ($(ARCH),x86_64)
> +	ARCH := X86
> +endif
> +
> +CFLAGS += -I../../../../arch/x86/include/generated/uapi/
> +CFLAGS += -I../../../../arch/x86/include/uapi/
> +CFLAGS += -I../../../../include/uapi/
> +CFLAGS += -I../../../../include/
> +
> +all:
> +ifeq ($(ARCH),X86)
> +	gcc $(CFLAGS) memfd_test.c -o memfd_test
> +else
> +	echo "Not an x86 target, can't build memfd selftest"
> +endif
> +
> +run_tests: all
> +ifeq ($(ARCH),X86)
> +	gcc $(CFLAGS) memfd_test.c -o memfd_test
> +endif
> +	@./memfd_test || echo "memfd_test: [FAIL]"
> +
> +clean:
> +	$(RM) memfd_test
> diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c
> new file mode 100644
> index 0000000..3e105ea
> --- /dev/null
> +++ b/tools/testing/selftests/memfd/memfd_test.c
> @@ -0,0 +1,944 @@
> +#define _GNU_SOURCE
> +#define __EXPORTED_HEADERS__
> +
> +#include <errno.h>
> +#include <inttypes.h>
> +#include <limits.h>
> +#include <linux/falloc.h>
> +#include <linux/fcntl.h>
> +#include <linux/memfd.h>
> +#include <sched.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <signal.h>
> +#include <string.h>
> +#include <sys/mman.h>
> +#include <sys/stat.h>
> +#include <sys/syscall.h>
> +#include <unistd.h>
> +
> +#define MFD_DEF_SIZE 8192
> +#define STACK_SIZE 65535
> +
> +static int sys_memfd_create(const char *name,
> +			    __u64 size,
> +			    __u64 flags)
> +{
> +	return syscall(__NR_memfd_create, name, size, flags);
> +}
> +
> +static int mfd_assert_new(const char *name, __u64 sz, __u64 flags)
> +{
> +	int r;
> +
> +	r = sys_memfd_create(name, sz, flags);
> +	if (r < 0) {
> +		printf("memfd_create(\"%s\", %llu, %llu) failed: %m\n",
> +		       name,
> +		       (unsigned long long)sz,
> +		       (unsigned long long)flags);
> +		abort();
> +	}
> +
> +	return r;
> +}
> +
> +static void mfd_fail_new(const char *name, __u64 size, __u64 flags)
> +{
> +	int r;
> +
> +	r = sys_memfd_create(name, size, flags);
> +	if (r >= 0) {
> +		printf("memfd_create(\"%s\", %llu, %llu) succeeded, but failure expected\n",

scripts/checkpatch.pl complains about line-length: please ignore it on this.

> +		       name,
> +		       (unsigned long long)size,
> +		       (unsigned long long)flags);
> +		close(r);
> +		abort();
> +	}
> +}
> +
> +static __u64 mfd_assert_get_seals(int fd)
> +{
> +	long r;
> +
> +	r = fcntl(fd, F_GET_SEALS);
> +	if (r < 0) {
> +		printf("GET_SEALS(%d) failed: %m\n", fd);
> +		abort();
> +	}
> +
> +	return r;
> +}
> +
> +static void mfd_fail_get_seals(int fd)
> +{
> +	long r;
> +
> +	r = fcntl(fd, F_GET_SEALS);
> +	if (r >= 0) {
> +		printf("GET_SEALS(%d) succeeded, but failure expected\n");
> +		abort();
> +	}
> +}
> +
> +static void mfd_assert_has_seals(int fd, __u64 seals)
> +{
> +	__u64 s;
> +
> +	s = mfd_assert_get_seals(fd);
> +	if (s != seals) {
> +		printf("%llu != %llu = GET_SEALS(%d)\n",
> +		       (unsigned long long)seals, (unsigned long long)s, fd);
> +		abort();
> +	}
> +}
> +
> +static void mfd_assert_add_seals(int fd, __u64 seals)
> +{
> +	long r;
> +	__u64 s;
> +
> +	s = mfd_assert_get_seals(fd);
> +	r = fcntl(fd, F_ADD_SEALS, seals);
> +	if (r < 0) {
> +		printf("ADD_SEALS(%d, %llu -> %llu) failed: %m\n",
> +		       fd, (unsigned long long)s, (unsigned long long)seals);
> +		abort();
> +	}
> +}
> +
> +static void mfd_fail_add_seals(int fd, __u64 seals)
> +{
> +	long r;
> +	__u64 s;
> +
> +	r = fcntl(fd, F_GET_SEALS);
> +	if (r < 0)
> +		s = 0;
> +	else
> +		s = r;
> +
> +	r = fcntl(fd, F_ADD_SEALS, seals);
> +	if (r >= 0) {
> +		printf("ADD_SEALS(%d, %llu -> %llu) didn't fail as expected\n",
> +		       fd, (unsigned long long)s, (unsigned long long)seals);
> +		abort();
> +	}
> +}
> +
> +static void mfd_assert_size(int fd, size_t size)
> +{
> +	struct stat st;
> +	int r;
> +
> +	r = fstat(fd, &st);
> +	if (r < 0) {
> +		printf("fstat(%d) failed: %m\n", fd);
> +		abort();
> +	} else if (st.st_size != size) {
> +		printf("wrong file size %lld, but expected %lld\n",
> +		       (long long)st.st_size, (long long)size);
> +		abort();
> +	}
> +}
> +
> +static int mfd_assert_dup(int fd)
> +{
> +	int r;
> +
> +	r = dup(fd);
> +	if (r < 0) {
> +		printf("dup(%d) failed: %m\n", fd);
> +		abort();
> +	}
> +
> +	return r;
> +}
> +
> +static void *mfd_assert_mmap_shared(int fd)
> +{
> +	void *p;
> +
> +	p = mmap(NULL,
> +		 MFD_DEF_SIZE,
> +		 PROT_READ | PROT_WRITE,
> +		 MAP_SHARED,
> +		 fd,
> +		 0);
> +	if (p == MAP_FAILED) {
> +		printf("mmap() failed: %m\n");
> +		abort();
> +	}
> +
> +	return p;
> +}
> +
> +static void *mfd_assert_mmap_private(int fd)
> +{
> +	void *p;
> +
> +	p = mmap(NULL,
> +		 MFD_DEF_SIZE,
> +		 PROT_READ,
> +		 MAP_PRIVATE,
> +		 fd,
> +		 0);
> +	if (p == MAP_FAILED) {
> +		printf("mmap() failed: %m\n");
> +		abort();
> +	}
> +
> +	return p;
> +}
> +
> +static int mfd_assert_open(int fd, int flags, mode_t mode)
> +{
> +	char buf[512];
> +	int r;
> +
> +	sprintf(buf, "/proc/self/fd/%d", fd);
> +	r = open(buf, flags, mode);
> +	if (r < 0) {
> +		printf("open(%s) failed: %m\n", buf);
> +		abort();
> +	}
> +
> +	return r;
> +}
> +
> +static void mfd_fail_open(int fd, int flags, mode_t mode)
> +{
> +	char buf[512];
> +	int r;
> +
> +	sprintf(buf, "/proc/self/fd/%d", fd);
> +	r = open(buf, flags, mode);
> +	if (r >= 0) {
> +		printf("open(%s) didn't fail as expected\n");
> +		abort();
> +	}
> +}
> +
> +static void mfd_assert_read(int fd)
> +{
> +	char buf[16];
> +	void *p;
> +	ssize_t l;
> +
> +	l = read(fd, buf, sizeof(buf));
> +	if (l != sizeof(buf)) {
> +		printf("read() failed: %m\n");
> +		abort();
> +	}
> +
> +	/* verify PROT_READ *is* allowed */
> +	p = mmap(NULL,
> +		 MFD_DEF_SIZE,
> +		 PROT_READ,
> +		 MAP_PRIVATE,
> +		 fd,
> +		 0);
> +	if (p == MAP_FAILED) {
> +		printf("mmap() failed: %m\n");
> +		abort();
> +	}
> +	munmap(p, MFD_DEF_SIZE);
> +
> +	/* verify MAP_PRIVATE is *always* allowed (even writable) */
> +	p = mmap(NULL,
> +		 MFD_DEF_SIZE,
> +		 PROT_READ | PROT_WRITE,
> +		 MAP_PRIVATE,
> +		 fd,
> +		 0);
> +	if (p == MAP_FAILED) {
> +		printf("mmap() failed: %m\n");
> +		abort();
> +	}
> +	munmap(p, MFD_DEF_SIZE);
> +}
> +
> +static void mfd_assert_write(int fd)
> +{
> +	ssize_t l;
> +	void *p;
> +	int r;
> +
> +	/* verify write() succeeds */
> +	l = write(fd, "\0\0\0\0", 4);
> +	if (l != 4) {
> +		printf("write() failed: %m\n");
> +		abort();
> +	}
> +
> +	/* verify PROT_READ | PROT_WRITE is allowed */
> +	p = mmap(NULL,
> +		 MFD_DEF_SIZE,
> +		 PROT_READ | PROT_WRITE,
> +		 MAP_SHARED,
> +		 fd,
> +		 0);
> +	if (p == MAP_FAILED) {
> +		printf("mmap() failed: %m\n");
> +		abort();
> +	}
> +	*(char*)p = 0;

scripts/checkpatch.pl complains about (char*): better calm it with (char *).
Same on two other lines below.

> +	munmap(p, MFD_DEF_SIZE);
> +
> +	/* verify PROT_WRITE is allowed */
> +	p = mmap(NULL,
> +		 MFD_DEF_SIZE,
> +		 PROT_WRITE,
> +		 MAP_SHARED,
> +		 fd,
> +		 0);
> +	if (p == MAP_FAILED) {
> +		printf("mmap() failed: %m\n");
> +		abort();
> +	}
> +	*(char*)p = 0;
> +	munmap(p, MFD_DEF_SIZE);
> +
> +	/* verify PROT_READ with MAP_SHARED is allowed and a following
> +	 * mprotect(PROT_WRITE) allows writing */
> +	p = mmap(NULL,
> +		 MFD_DEF_SIZE,
> +		 PROT_READ,
> +		 MAP_SHARED,
> +		 fd,
> +		 0);
> +	if (p == MAP_FAILED) {
> +		printf("mmap() failed: %m\n");
> +		abort();
> +	}
> +
> +	r = mprotect(p, MFD_DEF_SIZE, PROT_READ | PROT_WRITE);
> +	if (r < 0) {
> +		printf("mprotect() failed: %m\n");
> +		abort();
> +	}
> +
> +	*(char*)p = 0;
> +	munmap(p, MFD_DEF_SIZE);
> +
> +	/* verify PUNCH_HOLE works */
> +	r = fallocate(fd,
> +		      FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> +		      0,
> +		      MFD_DEF_SIZE);
> +	if (r < 0) {
> +		printf("fallocate(PUNCH_HOLE) failed: %m\n");
> +		abort();
> +	}
> +}
> +
> +static void mfd_fail_write(int fd)
> +{
> +	ssize_t l;
> +	void *p;
> +	int r;
> +
> +	/* verify write() fails */
> +	l = write(fd, "data", 4);
> +	if (l != -EPERM) {
> +		printf("expected EPERM on write(), but got %d: %m\n", (int)l);
> +		abort();
> +	}
> +
> +	/* verify PROT_READ | PROT_WRITE is not allowed */
> +	p = mmap(NULL,
> +		 MFD_DEF_SIZE,
> +		 PROT_READ | PROT_WRITE,
> +		 MAP_SHARED,
> +		 fd,
> +		 0);
> +	if (p != MAP_FAILED) {
> +		printf("mmap() didn't fail as expected\n");
> +		abort();
> +	}
> +
> +	/* verify PROT_WRITE is not allowed */
> +	p = mmap(NULL,
> +		 MFD_DEF_SIZE,
> +		 PROT_WRITE,
> +		 MAP_SHARED,
> +		 fd,
> +		 0);
> +	if (p != MAP_FAILED) {
> +		printf("mmap() didn't fail as expected\n");
> +		abort();
> +	}
> +
> +	/* verify PROT_READ with MAP_SHARED is not allowed */

This is a particularly interesting case, checking PROT_READ,MAP_SHARED
not allowed in mfd_fail_write().  It feels invidious to ask for more
of a comment, in a test which you have been generous to provide at all.
But it stopped me short for a while: more comment might help others too.

The reason being (right?) that this fd was opened O_RDWR, so a
MAP_SHARED mapping would permit a subsequent mprotect(,,PROT_WRITE),
which sealing the file against writes must prevent.

Your kernel checks rely on VM_SHARED and i_mmap_writable for this
protection: which is fine, but an implementation detail which could
be modified in future, if this case were ever to pose a difficulty.

> +	p = mmap(NULL,
> +		 MFD_DEF_SIZE,
> +		 PROT_READ,
> +		 MAP_SHARED,
> +		 fd,
> +		 0);
> +	if (p != MAP_FAILED) {
> +		printf("mmap() didn't fail as expected\n");
> +		abort();
> +	}
> +
> +	/* verify PUNCH_HOLE fails */
> +	r = fallocate(fd,
> +		      FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> +		      0,
> +		      MFD_DEF_SIZE);
> +	if (r >= 0) {
> +		printf("fallocate(PUNCH_HOLE) didn't fail as expected\n");
> +		abort();
> +	}
> +}
> +
> +static void mfd_assert_shrink(int fd)
> +{
> +	int r, fd2;
> +
> +	r = ftruncate(fd, MFD_DEF_SIZE / 2);
> +	if (r < 0) {
> +		printf("ftruncate(SHRINK) failed: %m\n");
> +		abort();
> +	}
> +
> +	mfd_assert_size(fd, MFD_DEF_SIZE / 2);
> +
> +	fd2 = mfd_assert_open(fd,
> +			      O_RDWR | O_CREAT | O_TRUNC,
> +			      S_IRUSR | S_IWUSR);
> +	close(fd2);
> +
> +	mfd_assert_size(fd, 0);
> +}
> +
> +static void mfd_fail_shrink(int fd)
> +{
> +	int r;
> +
> +	r = ftruncate(fd, MFD_DEF_SIZE / 2);
> +	if (r >= 0) {
> +		printf("ftruncate(SHRINK) didn't fail as expected\n");
> +		abort();
> +	}
> +
> +	mfd_fail_open(fd,
> +		      O_RDWR | O_CREAT | O_TRUNC,
> +		      S_IRUSR | S_IWUSR);
> +}
> +
> +static void mfd_assert_grow(int fd)
> +{
> +	int r;
> +
> +	r = ftruncate(fd, MFD_DEF_SIZE * 2);
> +	if (r < 0) {
> +		printf("ftruncate(GROW) failed: %m\n");
> +		abort();
> +	}
> +
> +	mfd_assert_size(fd, MFD_DEF_SIZE * 2);
> +
> +	r = fallocate(fd,
> +		      0,
> +		      0,
> +		      MFD_DEF_SIZE * 4);
> +	if (r < 0) {
> +		printf("fallocate(ALLOC) failed: %m\n");
> +		abort();
> +	}
> +
> +	mfd_assert_size(fd, MFD_DEF_SIZE * 4);
> +}
> +
> +static void mfd_fail_grow(int fd)
> +{
> +	int r;
> +
> +	r = ftruncate(fd, MFD_DEF_SIZE * 2);
> +	if (r >= 0) {
> +		printf("ftruncate(GROW) didn't fail as expected\n");
> +		abort();
> +	}
> +
> +	r = fallocate(fd,
> +		      0,
> +		      0,
> +		      MFD_DEF_SIZE * 4);
> +	if (r >= 0) {
> +		printf("fallocate(ALLOC) didn't fail as expected\n");
> +		abort();
> +	}
> +}
> +
> +static void mfd_assert_grow_write(int fd)
> +{
> +	static char buf[MFD_DEF_SIZE * 8];
> +	ssize_t l;
> +
> +	l = pwrite(fd, buf, sizeof(buf), 0);
> +	if (l != sizeof(buf)) {
> +		printf("pwrite() failed: %m\n");
> +		abort();
> +	}
> +
> +	mfd_assert_size(fd, MFD_DEF_SIZE * 8);
> +}
> +
> +static void mfd_fail_grow_write(int fd)
> +{
> +	static char buf[MFD_DEF_SIZE * 8];
> +	ssize_t l;
> +
> +	l = pwrite(fd, buf, sizeof(buf), 0);
> +	if (l == sizeof(buf)) {
> +		printf("pwrite() didn't fail as expected\n");
> +		abort();
> +	}
> +}
> +
> +static int idle_thread_fn(void *arg)
> +{
> +	sigset_t set;
> +	int sig;
> +
> +	/* dummy waiter; SIGTERM terminates us anyway */
> +	sigemptyset(&set);
> +	sigaddset(&set, SIGTERM);
> +	sigwait(&set, &sig);
> +
> +	return 0;
> +}
> +
> +static pid_t spawn_idle_thread(void)
> +{
> +	uint8_t *stack;
> +	pid_t pid;
> +
> +	stack = malloc(STACK_SIZE);
> +	if (!stack) {
> +		printf("malloc(STACK_SIZE) failed: %m\n");
> +		abort();
> +	}
> +
> +	pid = clone(idle_thread_fn,
> +		    stack + STACK_SIZE,
> +		    CLONE_FILES | CLONE_FS | CLONE_VM | SIGCHLD,
> +		    NULL);
> +	if (pid < 0) {
> +		printf("clone() failed: %m\n");
> +		abort();
> +	}
> +
> +	return pid;
> +}
> +
> +static void join_idle_thread(pid_t pid)
> +{
> +	kill(pid, SIGTERM);
> +	waitpid(pid, NULL, 0);
> +}
> +
> +static pid_t spawn_idle_proc(void)
> +{
> +	pid_t pid;
> +	sigset_t set;
> +	int sig;
> +
> +	pid = fork();
> +	if (pid < 0) {
> +		printf("fork() failed: %m\n");
> +		abort();
> +	} else if (!pid) {
> +		/* dummy waiter; SIGTERM terminates us anyway */
> +		sigemptyset(&set);
> +		sigaddset(&set, SIGTERM);
> +		sigwait(&set, &sig);
> +		exit(0);
> +	}
> +
> +	return pid;
> +}
> +
> +static void join_idle_proc(pid_t pid)
> +{
> +	kill(pid, SIGTERM);
> +	waitpid(pid, NULL, 0);
> +}
> +
> +/*
> + * Test memfd_create() syscall
> + * Verify syscall-argument validation, including name checks, flag validation
> + * and more.
> + */
> +static void test_create(void)
> +{
> +	char buf[2048];
> +	int fd;
> +
> +	/* test NULL name */
> +	mfd_fail_new(NULL, 0, 0);
> +
> +	/* test over-long name (not zero-terminated) */
> +	memset(buf, 0xff, sizeof(buf));
> +	mfd_fail_new(buf, 0, 0);
> +
> +	/* test over-long zero-terminated name */
> +	memset(buf, 0xff, sizeof(buf));
> +	buf[sizeof(buf) - 1] = 0;
> +	mfd_fail_new(buf, 0, 0);
> +
> +	/* verify "" is a valid name */
> +	fd = mfd_assert_new("", 0, 0);
> +	close(fd);
> +
> +	/* verify invalid O_* open flags */
> +	mfd_fail_new("", 0, 0x0100);
> +	mfd_fail_new("", 0, ~MFD_CLOEXEC);
> +	mfd_fail_new("", 0, ~MFD_ALLOW_SEALING);
> +	mfd_fail_new("", 0, ~0);
> +	mfd_fail_new("", 0, 0x8000000000000000ULL);
> +
> +	/* verify MFD_CLOEXEC is allowed */
> +	fd = mfd_assert_new("", 0, MFD_CLOEXEC);
> +	close(fd);
> +
> +	/* verify MFD_ALLOW_SEALING is allowed */
> +	fd = mfd_assert_new("", 0, MFD_ALLOW_SEALING);
> +	close(fd);
> +
> +	/* verify MFD_ALLOW_SEALING | MFD_CLOEXEC is allowed */
> +	fd = mfd_assert_new("", 0, MFD_ALLOW_SEALING | MFD_CLOEXEC);
> +	close(fd);
> +}
> +
> +/*
> + * Test basic sealing
> + * A very basic sealing test to see whether setting/retrieving seals works.
> + */
> +static void test_basic(void)
> +{
> +	int fd;
> +
> +	fd = mfd_assert_new("kern_memfd_basic",
> +			    MFD_DEF_SIZE,
> +			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
> +
> +	/* add basic seals */
> +	mfd_assert_has_seals(fd, 0);
> +	mfd_assert_add_seals(fd, F_SEAL_SHRINK |
> +				 F_SEAL_WRITE);
> +	mfd_assert_has_seals(fd, F_SEAL_SHRINK |
> +				 F_SEAL_WRITE);
> +
> +	/* add them again */
> +	mfd_assert_add_seals(fd, F_SEAL_SHRINK |
> +				 F_SEAL_WRITE);
> +	mfd_assert_has_seals(fd, F_SEAL_SHRINK |
> +				 F_SEAL_WRITE);
> +
> +	/* add more seals and seal against sealing */
> +	mfd_assert_add_seals(fd, F_SEAL_GROW | F_SEAL_SEAL);
> +	mfd_assert_has_seals(fd, F_SEAL_SHRINK |
> +				 F_SEAL_GROW |
> +				 F_SEAL_WRITE |
> +				 F_SEAL_SEAL);
> +
> +	/* verify that sealing no longer works */
> +	mfd_fail_add_seals(fd, F_SEAL_GROW);
> +	mfd_fail_add_seals(fd, 0);
> +
> +	close(fd);
> +
> +	/* verify sealing does not work without MFD_ALLOW_SEALING */
> +	fd = mfd_assert_new("kern_memfd_basic",
> +			    MFD_DEF_SIZE,
> +			    MFD_CLOEXEC);
> +	mfd_fail_get_seals(fd);
> +	mfd_fail_add_seals(fd, F_SEAL_SHRINK |
> +			       F_SEAL_GROW |
> +			       F_SEAL_WRITE);
> +	mfd_fail_get_seals(fd);
> +	close(fd);
> +}
> +
> +/*
> + * Test SEAL_WRITE
> + * Test whether SEAL_WRITE actually prevents modifications.
> + */
> +static void test_seal_write(void)
> +{
> +	int fd;
> +
> +	fd = mfd_assert_new("kern_memfd_seal_write",
> +			    MFD_DEF_SIZE,
> +			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
> +	mfd_assert_has_seals(fd, 0);
> +	mfd_assert_add_seals(fd, F_SEAL_WRITE);
> +	mfd_assert_has_seals(fd, F_SEAL_WRITE);
> +
> +	mfd_assert_read(fd);
> +	mfd_fail_write(fd);
> +	mfd_assert_shrink(fd);
> +	mfd_assert_grow(fd);
> +	mfd_fail_grow_write(fd);
> +
> +	close(fd);
> +}
> +
> +/*
> + * Test SEAL_SHRINK
> + * Test whether SEAL_SHRINK actually prevents shrinking
> + */
> +static void test_seal_shrink(void)
> +{
> +	int fd;
> +
> +	fd = mfd_assert_new("kern_memfd_seal_shrink",
> +			    MFD_DEF_SIZE,
> +			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
> +	mfd_assert_has_seals(fd, 0);
> +	mfd_assert_add_seals(fd, F_SEAL_SHRINK);
> +	mfd_assert_has_seals(fd, F_SEAL_SHRINK);
> +
> +	mfd_assert_read(fd);
> +	mfd_assert_write(fd);
> +	mfd_fail_shrink(fd);
> +	mfd_assert_grow(fd);
> +	mfd_assert_grow_write(fd);
> +
> +	close(fd);
> +}
> +
> +/*
> + * Test SEAL_GROW
> + * Test whether SEAL_GROW actually prevents growing
> + */
> +static void test_seal_grow(void)
> +{
> +	int fd;
> +
> +	fd = mfd_assert_new("kern_memfd_seal_grow",
> +			    MFD_DEF_SIZE,
> +			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
> +	mfd_assert_has_seals(fd, 0);
> +	mfd_assert_add_seals(fd, F_SEAL_GROW);
> +	mfd_assert_has_seals(fd, F_SEAL_GROW);
> +
> +	mfd_assert_read(fd);
> +	mfd_assert_write(fd);
> +	mfd_assert_shrink(fd);
> +	mfd_fail_grow(fd);
> +	mfd_fail_grow_write(fd);
> +
> +	close(fd);
> +}
> +
> +/*
> + * Test SEAL_SHRINK | SEAL_GROW
> + * Test whether SEAL_SHRINK | SEAL_GROW actually prevents resizing
> + */
> +static void test_seal_resize(void)
> +{
> +	int fd;
> +
> +	fd = mfd_assert_new("kern_memfd_seal_resize",
> +			    MFD_DEF_SIZE,
> +			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
> +	mfd_assert_has_seals(fd, 0);
> +	mfd_assert_add_seals(fd, F_SEAL_SHRINK | F_SEAL_GROW);
> +	mfd_assert_has_seals(fd, F_SEAL_SHRINK | F_SEAL_GROW);
> +
> +	mfd_assert_read(fd);
> +	mfd_assert_write(fd);
> +	mfd_fail_shrink(fd);
> +	mfd_fail_grow(fd);
> +	mfd_fail_grow_write(fd);
> +
> +	close(fd);
> +}
> +
> +/*
> + * Test sharing via dup()
> + * Test that seals are shared between dupped FDs and they're all equal.
> + */
> +static void test_share_dup(void)
> +{
> +	int fd, fd2;
> +
> +	fd = mfd_assert_new("kern_memfd_share_dup",
> +			    MFD_DEF_SIZE,
> +			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
> +	mfd_assert_has_seals(fd, 0);
> +
> +	fd2 = mfd_assert_dup(fd);
> +	mfd_assert_has_seals(fd2, 0);
> +
> +	mfd_assert_add_seals(fd, F_SEAL_WRITE);
> +	mfd_assert_has_seals(fd, F_SEAL_WRITE);
> +	mfd_assert_has_seals(fd2, F_SEAL_WRITE);
> +
> +	mfd_assert_add_seals(fd2, F_SEAL_SHRINK);
> +	mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK);
> +	mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK);
> +
> +	mfd_assert_add_seals(fd, F_SEAL_SEAL);
> +	mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
> +	mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
> +
> +	mfd_fail_add_seals(fd, F_SEAL_GROW);
> +	mfd_fail_add_seals(fd2, F_SEAL_GROW);
> +	mfd_fail_add_seals(fd, F_SEAL_SEAL);
> +	mfd_fail_add_seals(fd2, F_SEAL_SEAL);
> +
> +	close(fd2);
> +
> +	mfd_fail_add_seals(fd, F_SEAL_GROW);
> +	close(fd);
> +}
> +
> +/*
> + * Test sealing with active mmap()s
> + * Modifying seals is only allowed if no other mmap() refs exist.
> + */
> +static void test_share_mmap(void)
> +{
> +	int fd;
> +	void *p;
> +
> +	fd = mfd_assert_new("kern_memfd_share_mmap",
> +			    MFD_DEF_SIZE,
> +			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
> +	mfd_assert_has_seals(fd, 0);
> +
> +	/* shared/writable ref prevents sealing */
> +	p = mfd_assert_mmap_shared(fd);
> +	mfd_fail_add_seals(fd, F_SEAL_SHRINK);
> +	mfd_assert_has_seals(fd, 0);
> +	munmap(p, MFD_DEF_SIZE);
> +
> +	/* readable ref allows sealing */
> +	p = mfd_assert_mmap_private(fd);
> +	mfd_assert_add_seals(fd, F_SEAL_SHRINK);
> +	mfd_assert_has_seals(fd, F_SEAL_SHRINK);
> +	munmap(p, MFD_DEF_SIZE);
> +
> +	close(fd);
> +}
> +
> +/*
> + * Test sealing with open(/proc/self/fd/%d)
> + * Via /proc we can get access to a separate file-context for the same memfd.
> + * This is *not* like dup(), but like a real separate open(). Make sure the
> + * semantics are as expected and we correctly check for RDONLY / WRONLY / RDWR.
> + */
> +static void test_share_open(void)
> +{
> +	int fd, fd2;
> +
> +	fd = mfd_assert_new("kern_memfd_share_open",
> +			    MFD_DEF_SIZE,
> +			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
> +	mfd_assert_has_seals(fd, 0);
> +
> +	fd2 = mfd_assert_open(fd, O_RDWR, 0);
> +	mfd_assert_add_seals(fd, F_SEAL_WRITE);
> +	mfd_assert_has_seals(fd, F_SEAL_WRITE);
> +	mfd_assert_has_seals(fd2, F_SEAL_WRITE);
> +
> +	mfd_assert_add_seals(fd2, F_SEAL_SHRINK);
> +	mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK);
> +	mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK);
> +
> +	close(fd);
> +	fd = mfd_assert_open(fd2, O_RDONLY, 0);
> +
> +	mfd_fail_add_seals(fd, F_SEAL_SEAL);
> +	mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK);
> +	mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK);
> +
> +	close(fd2);
> +	fd2 = mfd_assert_open(fd, O_RDWR, 0);
> +
> +	mfd_assert_add_seals(fd2, F_SEAL_SEAL);
> +	mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
> +	mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
> +
> +	close(fd2);
> +	close(fd);
> +}
> +
> +/*
> + * Test sharing via fork()
> + * Test whether seal-modifications work as expected with forked childs.
> + */
> +static void test_share_fork(void)
> +{
> +	int fd;
> +	pid_t pid;
> +
> +	fd = mfd_assert_new("kern_memfd_share_fork",
> +			    MFD_DEF_SIZE,
> +			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
> +	mfd_assert_has_seals(fd, 0);
> +
> +	pid = spawn_idle_proc();
> +	mfd_assert_add_seals(fd, F_SEAL_SEAL);
> +	mfd_assert_has_seals(fd, F_SEAL_SEAL);
> +
> +	mfd_fail_add_seals(fd, F_SEAL_WRITE);
> +	mfd_assert_has_seals(fd, F_SEAL_SEAL);
> +
> +	join_idle_proc(pid);
> +
> +	mfd_fail_add_seals(fd, F_SEAL_WRITE);
> +	mfd_assert_has_seals(fd, F_SEAL_SEAL);
> +
> +	close(fd);
> +}
> +
> +int main(int argc, char **argv)
> +{
> +	pid_t pid;
> +
> +	printf("memfd: CREATE\n");
> +	test_create();
> +	printf("memfd: BASIC\n");
> +	test_basic();
> +
> +	printf("memfd: SEAL-WRITE\n");
> +	test_seal_write();
> +	printf("memfd: SEAL-SHRINK\n");
> +	test_seal_shrink();
> +	printf("memfd: SEAL-GROW\n");
> +	test_seal_grow();
> +	printf("memfd: SEAL-RESIZE\n");
> +	test_seal_resize();
> +
> +	printf("memfd: SHARE-DUP\n");
> +	test_share_dup();
> +	printf("memfd: SHARE-MMAP\n");
> +	test_share_mmap();
> +	printf("memfd: SHARE-OPEN\n");
> +	test_share_open();
> +	printf("memfd: SHARE-FORK\n");
> +	test_share_fork();
> +
> +	/* Run test-suite in a multi-threaded environment with a shared
> +	 * file-table. */
> +	pid = spawn_idle_thread();
> +	printf("memfd: SHARE-DUP (shared file-table)\n");
> +	test_share_dup();
> +	printf("memfd: SHARE-MMAP (shared file-table)\n");
> +	test_share_mmap();
> +	printf("memfd: SHARE-OPEN (shared file-table)\n");
> +	test_share_open();
> +	printf("memfd: SHARE-FORK (shared file-table)\n");
> +	test_share_fork();
> +	join_idle_thread(pid);
> +
> +	printf("memfd: DONE\n");
> +
> +	return 0;
> +}
> -- 
> 1.9.2

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 3/3] selftests: add memfd_create() + sealing tests
@ 2014-05-20  2:22     ` Hugh Dickins
  0 siblings, 0 replies; 53+ messages in thread
From: Hugh Dickins @ 2014-05-20  2:22 UTC (permalink / raw)
  To: David Herrmann
  Cc: Tony Battersby, Andy Lutomirsky, Al Viro, Jan Kara,
	Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, Greg Kroah-Hartman, john.stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

On Tue, 15 Apr 2014, David Herrmann wrote:

> Some basic tests to verify sealing on memfds works as expected and
> guarantees the advertised semantics.

Thanks for providing these.

A few remarks below, and I should note one oddity.

Curious about leaks (probably none, I was merely curious), I tried to
run memfd_test 4096 times in succession, and never succeeded.  After
many iterations, the 32-bit one tends to hang somewhere just before
reaching the DONE, and the 64-bit one gave me some kind of assert
error from a library.

I expect there's some threading race around join_idle_thread():
which I think you will sort out infinitely sooner than I would.
No need to fix it right now: the test works well enough.

> 
> Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
> ---
>  tools/testing/selftests/Makefile           |   1 +
>  tools/testing/selftests/memfd/.gitignore   |   2 +
>  tools/testing/selftests/memfd/Makefile     |  29 +
>  tools/testing/selftests/memfd/memfd_test.c | 944 +++++++++++++++++++++++++++++
>  4 files changed, 976 insertions(+)
>  create mode 100644 tools/testing/selftests/memfd/.gitignore
>  create mode 100644 tools/testing/selftests/memfd/Makefile
>  create mode 100644 tools/testing/selftests/memfd/memfd_test.c
> 
> diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
> index 32487ed..c57325a 100644
> --- a/tools/testing/selftests/Makefile
> +++ b/tools/testing/selftests/Makefile
> @@ -2,6 +2,7 @@ TARGETS = breakpoints
>  TARGETS += cpu-hotplug
>  TARGETS += efivarfs
>  TARGETS += kcmp
> +TARGETS += memfd
>  TARGETS += memory-hotplug
>  TARGETS += mqueue
>  TARGETS += net
> diff --git a/tools/testing/selftests/memfd/.gitignore b/tools/testing/selftests/memfd/.gitignore
> new file mode 100644
> index 0000000..bcc8ee2
> --- /dev/null
> +++ b/tools/testing/selftests/memfd/.gitignore
> @@ -0,0 +1,2 @@
> +memfd_test
> +memfd-test-file
> diff --git a/tools/testing/selftests/memfd/Makefile b/tools/testing/selftests/memfd/Makefile
> new file mode 100644
> index 0000000..36653b9
> --- /dev/null
> +++ b/tools/testing/selftests/memfd/Makefile
> @@ -0,0 +1,29 @@
> +uname_M := $(shell uname -m 2>/dev/null || echo not)
> +ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
> +ifeq ($(ARCH),i386)
> +	ARCH := X86
> +endif
> +ifeq ($(ARCH),x86_64)
> +	ARCH := X86
> +endif
> +
> +CFLAGS += -I../../../../arch/x86/include/generated/uapi/
> +CFLAGS += -I../../../../arch/x86/include/uapi/
> +CFLAGS += -I../../../../include/uapi/
> +CFLAGS += -I../../../../include/
> +
> +all:
> +ifeq ($(ARCH),X86)
> +	gcc $(CFLAGS) memfd_test.c -o memfd_test
> +else
> +	echo "Not an x86 target, can't build memfd selftest"
> +endif
> +
> +run_tests: all
> +ifeq ($(ARCH),X86)
> +	gcc $(CFLAGS) memfd_test.c -o memfd_test
> +endif
> +	@./memfd_test || echo "memfd_test: [FAIL]"
> +
> +clean:
> +	$(RM) memfd_test
> diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c
> new file mode 100644
> index 0000000..3e105ea
> --- /dev/null
> +++ b/tools/testing/selftests/memfd/memfd_test.c
> @@ -0,0 +1,944 @@
> +#define _GNU_SOURCE
> +#define __EXPORTED_HEADERS__
> +
> +#include <errno.h>
> +#include <inttypes.h>
> +#include <limits.h>
> +#include <linux/falloc.h>
> +#include <linux/fcntl.h>
> +#include <linux/memfd.h>
> +#include <sched.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <signal.h>
> +#include <string.h>
> +#include <sys/mman.h>
> +#include <sys/stat.h>
> +#include <sys/syscall.h>
> +#include <unistd.h>
> +
> +#define MFD_DEF_SIZE 8192
> +#define STACK_SIZE 65535
> +
> +static int sys_memfd_create(const char *name,
> +			    __u64 size,
> +			    __u64 flags)
> +{
> +	return syscall(__NR_memfd_create, name, size, flags);
> +}
> +
> +static int mfd_assert_new(const char *name, __u64 sz, __u64 flags)
> +{
> +	int r;
> +
> +	r = sys_memfd_create(name, sz, flags);
> +	if (r < 0) {
> +		printf("memfd_create(\"%s\", %llu, %llu) failed: %m\n",
> +		       name,
> +		       (unsigned long long)sz,
> +		       (unsigned long long)flags);
> +		abort();
> +	}
> +
> +	return r;
> +}
> +
> +static void mfd_fail_new(const char *name, __u64 size, __u64 flags)
> +{
> +	int r;
> +
> +	r = sys_memfd_create(name, size, flags);
> +	if (r >= 0) {
> +		printf("memfd_create(\"%s\", %llu, %llu) succeeded, but failure expected\n",

scripts/checkpatch.pl complains about line-length: please ignore it on this.

> +		       name,
> +		       (unsigned long long)size,
> +		       (unsigned long long)flags);
> +		close(r);
> +		abort();
> +	}
> +}
> +
> +static __u64 mfd_assert_get_seals(int fd)
> +{
> +	long r;
> +
> +	r = fcntl(fd, F_GET_SEALS);
> +	if (r < 0) {
> +		printf("GET_SEALS(%d) failed: %m\n", fd);
> +		abort();
> +	}
> +
> +	return r;
> +}
> +
> +static void mfd_fail_get_seals(int fd)
> +{
> +	long r;
> +
> +	r = fcntl(fd, F_GET_SEALS);
> +	if (r >= 0) {
> +		printf("GET_SEALS(%d) succeeded, but failure expected\n");
> +		abort();
> +	}
> +}
> +
> +static void mfd_assert_has_seals(int fd, __u64 seals)
> +{
> +	__u64 s;
> +
> +	s = mfd_assert_get_seals(fd);
> +	if (s != seals) {
> +		printf("%llu != %llu = GET_SEALS(%d)\n",
> +		       (unsigned long long)seals, (unsigned long long)s, fd);
> +		abort();
> +	}
> +}
> +
> +static void mfd_assert_add_seals(int fd, __u64 seals)
> +{
> +	long r;
> +	__u64 s;
> +
> +	s = mfd_assert_get_seals(fd);
> +	r = fcntl(fd, F_ADD_SEALS, seals);
> +	if (r < 0) {
> +		printf("ADD_SEALS(%d, %llu -> %llu) failed: %m\n",
> +		       fd, (unsigned long long)s, (unsigned long long)seals);
> +		abort();
> +	}
> +}
> +
> +static void mfd_fail_add_seals(int fd, __u64 seals)
> +{
> +	long r;
> +	__u64 s;
> +
> +	r = fcntl(fd, F_GET_SEALS);
> +	if (r < 0)
> +		s = 0;
> +	else
> +		s = r;
> +
> +	r = fcntl(fd, F_ADD_SEALS, seals);
> +	if (r >= 0) {
> +		printf("ADD_SEALS(%d, %llu -> %llu) didn't fail as expected\n",
> +		       fd, (unsigned long long)s, (unsigned long long)seals);
> +		abort();
> +	}
> +}
> +
> +static void mfd_assert_size(int fd, size_t size)
> +{
> +	struct stat st;
> +	int r;
> +
> +	r = fstat(fd, &st);
> +	if (r < 0) {
> +		printf("fstat(%d) failed: %m\n", fd);
> +		abort();
> +	} else if (st.st_size != size) {
> +		printf("wrong file size %lld, but expected %lld\n",
> +		       (long long)st.st_size, (long long)size);
> +		abort();
> +	}
> +}
> +
> +static int mfd_assert_dup(int fd)
> +{
> +	int r;
> +
> +	r = dup(fd);
> +	if (r < 0) {
> +		printf("dup(%d) failed: %m\n", fd);
> +		abort();
> +	}
> +
> +	return r;
> +}
> +
> +static void *mfd_assert_mmap_shared(int fd)
> +{
> +	void *p;
> +
> +	p = mmap(NULL,
> +		 MFD_DEF_SIZE,
> +		 PROT_READ | PROT_WRITE,
> +		 MAP_SHARED,
> +		 fd,
> +		 0);
> +	if (p == MAP_FAILED) {
> +		printf("mmap() failed: %m\n");
> +		abort();
> +	}
> +
> +	return p;
> +}
> +
> +static void *mfd_assert_mmap_private(int fd)
> +{
> +	void *p;
> +
> +	p = mmap(NULL,
> +		 MFD_DEF_SIZE,
> +		 PROT_READ,
> +		 MAP_PRIVATE,
> +		 fd,
> +		 0);
> +	if (p == MAP_FAILED) {
> +		printf("mmap() failed: %m\n");
> +		abort();
> +	}
> +
> +	return p;
> +}
> +
> +static int mfd_assert_open(int fd, int flags, mode_t mode)
> +{
> +	char buf[512];
> +	int r;
> +
> +	sprintf(buf, "/proc/self/fd/%d", fd);
> +	r = open(buf, flags, mode);
> +	if (r < 0) {
> +		printf("open(%s) failed: %m\n", buf);
> +		abort();
> +	}
> +
> +	return r;
> +}
> +
> +static void mfd_fail_open(int fd, int flags, mode_t mode)
> +{
> +	char buf[512];
> +	int r;
> +
> +	sprintf(buf, "/proc/self/fd/%d", fd);
> +	r = open(buf, flags, mode);
> +	if (r >= 0) {
> +		printf("open(%s) didn't fail as expected\n");
> +		abort();
> +	}
> +}
> +
> +static void mfd_assert_read(int fd)
> +{
> +	char buf[16];
> +	void *p;
> +	ssize_t l;
> +
> +	l = read(fd, buf, sizeof(buf));
> +	if (l != sizeof(buf)) {
> +		printf("read() failed: %m\n");
> +		abort();
> +	}
> +
> +	/* verify PROT_READ *is* allowed */
> +	p = mmap(NULL,
> +		 MFD_DEF_SIZE,
> +		 PROT_READ,
> +		 MAP_PRIVATE,
> +		 fd,
> +		 0);
> +	if (p == MAP_FAILED) {
> +		printf("mmap() failed: %m\n");
> +		abort();
> +	}
> +	munmap(p, MFD_DEF_SIZE);
> +
> +	/* verify MAP_PRIVATE is *always* allowed (even writable) */
> +	p = mmap(NULL,
> +		 MFD_DEF_SIZE,
> +		 PROT_READ | PROT_WRITE,
> +		 MAP_PRIVATE,
> +		 fd,
> +		 0);
> +	if (p == MAP_FAILED) {
> +		printf("mmap() failed: %m\n");
> +		abort();
> +	}
> +	munmap(p, MFD_DEF_SIZE);
> +}
> +
> +static void mfd_assert_write(int fd)
> +{
> +	ssize_t l;
> +	void *p;
> +	int r;
> +
> +	/* verify write() succeeds */
> +	l = write(fd, "\0\0\0\0", 4);
> +	if (l != 4) {
> +		printf("write() failed: %m\n");
> +		abort();
> +	}
> +
> +	/* verify PROT_READ | PROT_WRITE is allowed */
> +	p = mmap(NULL,
> +		 MFD_DEF_SIZE,
> +		 PROT_READ | PROT_WRITE,
> +		 MAP_SHARED,
> +		 fd,
> +		 0);
> +	if (p == MAP_FAILED) {
> +		printf("mmap() failed: %m\n");
> +		abort();
> +	}
> +	*(char*)p = 0;

scripts/checkpatch.pl complains about (char*): better calm it with (char *).
Same on two other lines below.

> +	munmap(p, MFD_DEF_SIZE);
> +
> +	/* verify PROT_WRITE is allowed */
> +	p = mmap(NULL,
> +		 MFD_DEF_SIZE,
> +		 PROT_WRITE,
> +		 MAP_SHARED,
> +		 fd,
> +		 0);
> +	if (p == MAP_FAILED) {
> +		printf("mmap() failed: %m\n");
> +		abort();
> +	}
> +	*(char*)p = 0;
> +	munmap(p, MFD_DEF_SIZE);
> +
> +	/* verify PROT_READ with MAP_SHARED is allowed and a following
> +	 * mprotect(PROT_WRITE) allows writing */
> +	p = mmap(NULL,
> +		 MFD_DEF_SIZE,
> +		 PROT_READ,
> +		 MAP_SHARED,
> +		 fd,
> +		 0);
> +	if (p == MAP_FAILED) {
> +		printf("mmap() failed: %m\n");
> +		abort();
> +	}
> +
> +	r = mprotect(p, MFD_DEF_SIZE, PROT_READ | PROT_WRITE);
> +	if (r < 0) {
> +		printf("mprotect() failed: %m\n");
> +		abort();
> +	}
> +
> +	*(char*)p = 0;
> +	munmap(p, MFD_DEF_SIZE);
> +
> +	/* verify PUNCH_HOLE works */
> +	r = fallocate(fd,
> +		      FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> +		      0,
> +		      MFD_DEF_SIZE);
> +	if (r < 0) {
> +		printf("fallocate(PUNCH_HOLE) failed: %m\n");
> +		abort();
> +	}
> +}
> +
> +static void mfd_fail_write(int fd)
> +{
> +	ssize_t l;
> +	void *p;
> +	int r;
> +
> +	/* verify write() fails */
> +	l = write(fd, "data", 4);
> +	if (l != -EPERM) {
> +		printf("expected EPERM on write(), but got %d: %m\n", (int)l);
> +		abort();
> +	}
> +
> +	/* verify PROT_READ | PROT_WRITE is not allowed */
> +	p = mmap(NULL,
> +		 MFD_DEF_SIZE,
> +		 PROT_READ | PROT_WRITE,
> +		 MAP_SHARED,
> +		 fd,
> +		 0);
> +	if (p != MAP_FAILED) {
> +		printf("mmap() didn't fail as expected\n");
> +		abort();
> +	}
> +
> +	/* verify PROT_WRITE is not allowed */
> +	p = mmap(NULL,
> +		 MFD_DEF_SIZE,
> +		 PROT_WRITE,
> +		 MAP_SHARED,
> +		 fd,
> +		 0);
> +	if (p != MAP_FAILED) {
> +		printf("mmap() didn't fail as expected\n");
> +		abort();
> +	}
> +
> +	/* verify PROT_READ with MAP_SHARED is not allowed */

This is a particularly interesting case, checking PROT_READ,MAP_SHARED
not allowed in mfd_fail_write().  It feels invidious to ask for more
of a comment, in a test which you have been generous to provide at all.
But it stopped me short for a while: more comment might help others too.

The reason being (right?) that this fd was opened O_RDWR, so a
MAP_SHARED mapping would permit a subsequent mprotect(,,PROT_WRITE),
which sealing the file against writes must prevent.

Your kernel checks rely on VM_SHARED and i_mmap_writable for this
protection: which is fine, but an implementation detail which could
be modified in future, if this case were ever to pose a difficulty.

> +	p = mmap(NULL,
> +		 MFD_DEF_SIZE,
> +		 PROT_READ,
> +		 MAP_SHARED,
> +		 fd,
> +		 0);
> +	if (p != MAP_FAILED) {
> +		printf("mmap() didn't fail as expected\n");
> +		abort();
> +	}
> +
> +	/* verify PUNCH_HOLE fails */
> +	r = fallocate(fd,
> +		      FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> +		      0,
> +		      MFD_DEF_SIZE);
> +	if (r >= 0) {
> +		printf("fallocate(PUNCH_HOLE) didn't fail as expected\n");
> +		abort();
> +	}
> +}
> +
> +static void mfd_assert_shrink(int fd)
> +{
> +	int r, fd2;
> +
> +	r = ftruncate(fd, MFD_DEF_SIZE / 2);
> +	if (r < 0) {
> +		printf("ftruncate(SHRINK) failed: %m\n");
> +		abort();
> +	}
> +
> +	mfd_assert_size(fd, MFD_DEF_SIZE / 2);
> +
> +	fd2 = mfd_assert_open(fd,
> +			      O_RDWR | O_CREAT | O_TRUNC,
> +			      S_IRUSR | S_IWUSR);
> +	close(fd2);
> +
> +	mfd_assert_size(fd, 0);
> +}
> +
> +static void mfd_fail_shrink(int fd)
> +{
> +	int r;
> +
> +	r = ftruncate(fd, MFD_DEF_SIZE / 2);
> +	if (r >= 0) {
> +		printf("ftruncate(SHRINK) didn't fail as expected\n");
> +		abort();
> +	}
> +
> +	mfd_fail_open(fd,
> +		      O_RDWR | O_CREAT | O_TRUNC,
> +		      S_IRUSR | S_IWUSR);
> +}
> +
> +static void mfd_assert_grow(int fd)
> +{
> +	int r;
> +
> +	r = ftruncate(fd, MFD_DEF_SIZE * 2);
> +	if (r < 0) {
> +		printf("ftruncate(GROW) failed: %m\n");
> +		abort();
> +	}
> +
> +	mfd_assert_size(fd, MFD_DEF_SIZE * 2);
> +
> +	r = fallocate(fd,
> +		      0,
> +		      0,
> +		      MFD_DEF_SIZE * 4);
> +	if (r < 0) {
> +		printf("fallocate(ALLOC) failed: %m\n");
> +		abort();
> +	}
> +
> +	mfd_assert_size(fd, MFD_DEF_SIZE * 4);
> +}
> +
> +static void mfd_fail_grow(int fd)
> +{
> +	int r;
> +
> +	r = ftruncate(fd, MFD_DEF_SIZE * 2);
> +	if (r >= 0) {
> +		printf("ftruncate(GROW) didn't fail as expected\n");
> +		abort();
> +	}
> +
> +	r = fallocate(fd,
> +		      0,
> +		      0,
> +		      MFD_DEF_SIZE * 4);
> +	if (r >= 0) {
> +		printf("fallocate(ALLOC) didn't fail as expected\n");
> +		abort();
> +	}
> +}
> +
> +static void mfd_assert_grow_write(int fd)
> +{
> +	static char buf[MFD_DEF_SIZE * 8];
> +	ssize_t l;
> +
> +	l = pwrite(fd, buf, sizeof(buf), 0);
> +	if (l != sizeof(buf)) {
> +		printf("pwrite() failed: %m\n");
> +		abort();
> +	}
> +
> +	mfd_assert_size(fd, MFD_DEF_SIZE * 8);
> +}
> +
> +static void mfd_fail_grow_write(int fd)
> +{
> +	static char buf[MFD_DEF_SIZE * 8];
> +	ssize_t l;
> +
> +	l = pwrite(fd, buf, sizeof(buf), 0);
> +	if (l == sizeof(buf)) {
> +		printf("pwrite() didn't fail as expected\n");
> +		abort();
> +	}
> +}
> +
> +static int idle_thread_fn(void *arg)
> +{
> +	sigset_t set;
> +	int sig;
> +
> +	/* dummy waiter; SIGTERM terminates us anyway */
> +	sigemptyset(&set);
> +	sigaddset(&set, SIGTERM);
> +	sigwait(&set, &sig);
> +
> +	return 0;
> +}
> +
> +static pid_t spawn_idle_thread(void)
> +{
> +	uint8_t *stack;
> +	pid_t pid;
> +
> +	stack = malloc(STACK_SIZE);
> +	if (!stack) {
> +		printf("malloc(STACK_SIZE) failed: %m\n");
> +		abort();
> +	}
> +
> +	pid = clone(idle_thread_fn,
> +		    stack + STACK_SIZE,
> +		    CLONE_FILES | CLONE_FS | CLONE_VM | SIGCHLD,
> +		    NULL);
> +	if (pid < 0) {
> +		printf("clone() failed: %m\n");
> +		abort();
> +	}
> +
> +	return pid;
> +}
> +
> +static void join_idle_thread(pid_t pid)
> +{
> +	kill(pid, SIGTERM);
> +	waitpid(pid, NULL, 0);
> +}
> +
> +static pid_t spawn_idle_proc(void)
> +{
> +	pid_t pid;
> +	sigset_t set;
> +	int sig;
> +
> +	pid = fork();
> +	if (pid < 0) {
> +		printf("fork() failed: %m\n");
> +		abort();
> +	} else if (!pid) {
> +		/* dummy waiter; SIGTERM terminates us anyway */
> +		sigemptyset(&set);
> +		sigaddset(&set, SIGTERM);
> +		sigwait(&set, &sig);
> +		exit(0);
> +	}
> +
> +	return pid;
> +}
> +
> +static void join_idle_proc(pid_t pid)
> +{
> +	kill(pid, SIGTERM);
> +	waitpid(pid, NULL, 0);
> +}
> +
> +/*
> + * Test memfd_create() syscall
> + * Verify syscall-argument validation, including name checks, flag validation
> + * and more.
> + */
> +static void test_create(void)
> +{
> +	char buf[2048];
> +	int fd;
> +
> +	/* test NULL name */
> +	mfd_fail_new(NULL, 0, 0);
> +
> +	/* test over-long name (not zero-terminated) */
> +	memset(buf, 0xff, sizeof(buf));
> +	mfd_fail_new(buf, 0, 0);
> +
> +	/* test over-long zero-terminated name */
> +	memset(buf, 0xff, sizeof(buf));
> +	buf[sizeof(buf) - 1] = 0;
> +	mfd_fail_new(buf, 0, 0);
> +
> +	/* verify "" is a valid name */
> +	fd = mfd_assert_new("", 0, 0);
> +	close(fd);
> +
> +	/* verify invalid O_* open flags */
> +	mfd_fail_new("", 0, 0x0100);
> +	mfd_fail_new("", 0, ~MFD_CLOEXEC);
> +	mfd_fail_new("", 0, ~MFD_ALLOW_SEALING);
> +	mfd_fail_new("", 0, ~0);
> +	mfd_fail_new("", 0, 0x8000000000000000ULL);
> +
> +	/* verify MFD_CLOEXEC is allowed */
> +	fd = mfd_assert_new("", 0, MFD_CLOEXEC);
> +	close(fd);
> +
> +	/* verify MFD_ALLOW_SEALING is allowed */
> +	fd = mfd_assert_new("", 0, MFD_ALLOW_SEALING);
> +	close(fd);
> +
> +	/* verify MFD_ALLOW_SEALING | MFD_CLOEXEC is allowed */
> +	fd = mfd_assert_new("", 0, MFD_ALLOW_SEALING | MFD_CLOEXEC);
> +	close(fd);
> +}
> +
> +/*
> + * Test basic sealing
> + * A very basic sealing test to see whether setting/retrieving seals works.
> + */
> +static void test_basic(void)
> +{
> +	int fd;
> +
> +	fd = mfd_assert_new("kern_memfd_basic",
> +			    MFD_DEF_SIZE,
> +			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
> +
> +	/* add basic seals */
> +	mfd_assert_has_seals(fd, 0);
> +	mfd_assert_add_seals(fd, F_SEAL_SHRINK |
> +				 F_SEAL_WRITE);
> +	mfd_assert_has_seals(fd, F_SEAL_SHRINK |
> +				 F_SEAL_WRITE);
> +
> +	/* add them again */
> +	mfd_assert_add_seals(fd, F_SEAL_SHRINK |
> +				 F_SEAL_WRITE);
> +	mfd_assert_has_seals(fd, F_SEAL_SHRINK |
> +				 F_SEAL_WRITE);
> +
> +	/* add more seals and seal against sealing */
> +	mfd_assert_add_seals(fd, F_SEAL_GROW | F_SEAL_SEAL);
> +	mfd_assert_has_seals(fd, F_SEAL_SHRINK |
> +				 F_SEAL_GROW |
> +				 F_SEAL_WRITE |
> +				 F_SEAL_SEAL);
> +
> +	/* verify that sealing no longer works */
> +	mfd_fail_add_seals(fd, F_SEAL_GROW);
> +	mfd_fail_add_seals(fd, 0);
> +
> +	close(fd);
> +
> +	/* verify sealing does not work without MFD_ALLOW_SEALING */
> +	fd = mfd_assert_new("kern_memfd_basic",
> +			    MFD_DEF_SIZE,
> +			    MFD_CLOEXEC);
> +	mfd_fail_get_seals(fd);
> +	mfd_fail_add_seals(fd, F_SEAL_SHRINK |
> +			       F_SEAL_GROW |
> +			       F_SEAL_WRITE);
> +	mfd_fail_get_seals(fd);
> +	close(fd);
> +}
> +
> +/*
> + * Test SEAL_WRITE
> + * Test whether SEAL_WRITE actually prevents modifications.
> + */
> +static void test_seal_write(void)
> +{
> +	int fd;
> +
> +	fd = mfd_assert_new("kern_memfd_seal_write",
> +			    MFD_DEF_SIZE,
> +			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
> +	mfd_assert_has_seals(fd, 0);
> +	mfd_assert_add_seals(fd, F_SEAL_WRITE);
> +	mfd_assert_has_seals(fd, F_SEAL_WRITE);
> +
> +	mfd_assert_read(fd);
> +	mfd_fail_write(fd);
> +	mfd_assert_shrink(fd);
> +	mfd_assert_grow(fd);
> +	mfd_fail_grow_write(fd);
> +
> +	close(fd);
> +}
> +
> +/*
> + * Test SEAL_SHRINK
> + * Test whether SEAL_SHRINK actually prevents shrinking
> + */
> +static void test_seal_shrink(void)
> +{
> +	int fd;
> +
> +	fd = mfd_assert_new("kern_memfd_seal_shrink",
> +			    MFD_DEF_SIZE,
> +			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
> +	mfd_assert_has_seals(fd, 0);
> +	mfd_assert_add_seals(fd, F_SEAL_SHRINK);
> +	mfd_assert_has_seals(fd, F_SEAL_SHRINK);
> +
> +	mfd_assert_read(fd);
> +	mfd_assert_write(fd);
> +	mfd_fail_shrink(fd);
> +	mfd_assert_grow(fd);
> +	mfd_assert_grow_write(fd);
> +
> +	close(fd);
> +}
> +
> +/*
> + * Test SEAL_GROW
> + * Test whether SEAL_GROW actually prevents growing
> + */
> +static void test_seal_grow(void)
> +{
> +	int fd;
> +
> +	fd = mfd_assert_new("kern_memfd_seal_grow",
> +			    MFD_DEF_SIZE,
> +			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
> +	mfd_assert_has_seals(fd, 0);
> +	mfd_assert_add_seals(fd, F_SEAL_GROW);
> +	mfd_assert_has_seals(fd, F_SEAL_GROW);
> +
> +	mfd_assert_read(fd);
> +	mfd_assert_write(fd);
> +	mfd_assert_shrink(fd);
> +	mfd_fail_grow(fd);
> +	mfd_fail_grow_write(fd);
> +
> +	close(fd);
> +}
> +
> +/*
> + * Test SEAL_SHRINK | SEAL_GROW
> + * Test whether SEAL_SHRINK | SEAL_GROW actually prevents resizing
> + */
> +static void test_seal_resize(void)
> +{
> +	int fd;
> +
> +	fd = mfd_assert_new("kern_memfd_seal_resize",
> +			    MFD_DEF_SIZE,
> +			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
> +	mfd_assert_has_seals(fd, 0);
> +	mfd_assert_add_seals(fd, F_SEAL_SHRINK | F_SEAL_GROW);
> +	mfd_assert_has_seals(fd, F_SEAL_SHRINK | F_SEAL_GROW);
> +
> +	mfd_assert_read(fd);
> +	mfd_assert_write(fd);
> +	mfd_fail_shrink(fd);
> +	mfd_fail_grow(fd);
> +	mfd_fail_grow_write(fd);
> +
> +	close(fd);
> +}
> +
> +/*
> + * Test sharing via dup()
> + * Test that seals are shared between dupped FDs and they're all equal.
> + */
> +static void test_share_dup(void)
> +{
> +	int fd, fd2;
> +
> +	fd = mfd_assert_new("kern_memfd_share_dup",
> +			    MFD_DEF_SIZE,
> +			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
> +	mfd_assert_has_seals(fd, 0);
> +
> +	fd2 = mfd_assert_dup(fd);
> +	mfd_assert_has_seals(fd2, 0);
> +
> +	mfd_assert_add_seals(fd, F_SEAL_WRITE);
> +	mfd_assert_has_seals(fd, F_SEAL_WRITE);
> +	mfd_assert_has_seals(fd2, F_SEAL_WRITE);
> +
> +	mfd_assert_add_seals(fd2, F_SEAL_SHRINK);
> +	mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK);
> +	mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK);
> +
> +	mfd_assert_add_seals(fd, F_SEAL_SEAL);
> +	mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
> +	mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
> +
> +	mfd_fail_add_seals(fd, F_SEAL_GROW);
> +	mfd_fail_add_seals(fd2, F_SEAL_GROW);
> +	mfd_fail_add_seals(fd, F_SEAL_SEAL);
> +	mfd_fail_add_seals(fd2, F_SEAL_SEAL);
> +
> +	close(fd2);
> +
> +	mfd_fail_add_seals(fd, F_SEAL_GROW);
> +	close(fd);
> +}
> +
> +/*
> + * Test sealing with active mmap()s
> + * Modifying seals is only allowed if no other mmap() refs exist.
> + */
> +static void test_share_mmap(void)
> +{
> +	int fd;
> +	void *p;
> +
> +	fd = mfd_assert_new("kern_memfd_share_mmap",
> +			    MFD_DEF_SIZE,
> +			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
> +	mfd_assert_has_seals(fd, 0);
> +
> +	/* shared/writable ref prevents sealing */
> +	p = mfd_assert_mmap_shared(fd);
> +	mfd_fail_add_seals(fd, F_SEAL_SHRINK);
> +	mfd_assert_has_seals(fd, 0);
> +	munmap(p, MFD_DEF_SIZE);
> +
> +	/* readable ref allows sealing */
> +	p = mfd_assert_mmap_private(fd);
> +	mfd_assert_add_seals(fd, F_SEAL_SHRINK);
> +	mfd_assert_has_seals(fd, F_SEAL_SHRINK);
> +	munmap(p, MFD_DEF_SIZE);
> +
> +	close(fd);
> +}
> +
> +/*
> + * Test sealing with open(/proc/self/fd/%d)
> + * Via /proc we can get access to a separate file-context for the same memfd.
> + * This is *not* like dup(), but like a real separate open(). Make sure the
> + * semantics are as expected and we correctly check for RDONLY / WRONLY / RDWR.
> + */
> +static void test_share_open(void)
> +{
> +	int fd, fd2;
> +
> +	fd = mfd_assert_new("kern_memfd_share_open",
> +			    MFD_DEF_SIZE,
> +			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
> +	mfd_assert_has_seals(fd, 0);
> +
> +	fd2 = mfd_assert_open(fd, O_RDWR, 0);
> +	mfd_assert_add_seals(fd, F_SEAL_WRITE);
> +	mfd_assert_has_seals(fd, F_SEAL_WRITE);
> +	mfd_assert_has_seals(fd2, F_SEAL_WRITE);
> +
> +	mfd_assert_add_seals(fd2, F_SEAL_SHRINK);
> +	mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK);
> +	mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK);
> +
> +	close(fd);
> +	fd = mfd_assert_open(fd2, O_RDONLY, 0);
> +
> +	mfd_fail_add_seals(fd, F_SEAL_SEAL);
> +	mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK);
> +	mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK);
> +
> +	close(fd2);
> +	fd2 = mfd_assert_open(fd, O_RDWR, 0);
> +
> +	mfd_assert_add_seals(fd2, F_SEAL_SEAL);
> +	mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
> +	mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
> +
> +	close(fd2);
> +	close(fd);
> +}
> +
> +/*
> + * Test sharing via fork()
> + * Test whether seal-modifications work as expected with forked childs.
> + */
> +static void test_share_fork(void)
> +{
> +	int fd;
> +	pid_t pid;
> +
> +	fd = mfd_assert_new("kern_memfd_share_fork",
> +			    MFD_DEF_SIZE,
> +			    MFD_CLOEXEC | MFD_ALLOW_SEALING);
> +	mfd_assert_has_seals(fd, 0);
> +
> +	pid = spawn_idle_proc();
> +	mfd_assert_add_seals(fd, F_SEAL_SEAL);
> +	mfd_assert_has_seals(fd, F_SEAL_SEAL);
> +
> +	mfd_fail_add_seals(fd, F_SEAL_WRITE);
> +	mfd_assert_has_seals(fd, F_SEAL_SEAL);
> +
> +	join_idle_proc(pid);
> +
> +	mfd_fail_add_seals(fd, F_SEAL_WRITE);
> +	mfd_assert_has_seals(fd, F_SEAL_SEAL);
> +
> +	close(fd);
> +}
> +
> +int main(int argc, char **argv)
> +{
> +	pid_t pid;
> +
> +	printf("memfd: CREATE\n");
> +	test_create();
> +	printf("memfd: BASIC\n");
> +	test_basic();
> +
> +	printf("memfd: SEAL-WRITE\n");
> +	test_seal_write();
> +	printf("memfd: SEAL-SHRINK\n");
> +	test_seal_shrink();
> +	printf("memfd: SEAL-GROW\n");
> +	test_seal_grow();
> +	printf("memfd: SEAL-RESIZE\n");
> +	test_seal_resize();
> +
> +	printf("memfd: SHARE-DUP\n");
> +	test_share_dup();
> +	printf("memfd: SHARE-MMAP\n");
> +	test_share_mmap();
> +	printf("memfd: SHARE-OPEN\n");
> +	test_share_open();
> +	printf("memfd: SHARE-FORK\n");
> +	test_share_fork();
> +
> +	/* Run test-suite in a multi-threaded environment with a shared
> +	 * file-table. */
> +	pid = spawn_idle_thread();
> +	printf("memfd: SHARE-DUP (shared file-table)\n");
> +	test_share_dup();
> +	printf("memfd: SHARE-MMAP (shared file-table)\n");
> +	test_share_mmap();
> +	printf("memfd: SHARE-OPEN (shared file-table)\n");
> +	test_share_open();
> +	printf("memfd: SHARE-FORK (shared file-table)\n");
> +	test_share_fork();
> +	join_idle_thread(pid);
> +
> +	printf("memfd: DONE\n");
> +
> +	return 0;
> +}
> -- 
> 1.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 2/3] shm: add memfd_create() syscall
  2014-04-15 18:38   ` David Herrmann
@ 2014-05-21 10:50     ` Konstantin Khlebnikov
  -1 siblings, 0 replies; 53+ messages in thread
From: Konstantin Khlebnikov @ 2014-05-21 10:50 UTC (permalink / raw)
  To: David Herrmann
  Cc: Linux Kernel Mailing List, Michael Kerrisk, Ryan Lortie,
	Linus Torvalds, Andrew Morton, linux-mm, linux-fsdevel,
	Johannes Weiner, Tejun Heo, Greg Kroah-Hartman, John Stultz,
	Kristian Høgsberg, Lennart Poettering, Daniel Mack,
	Kay Sievers

On Tue, Apr 15, 2014 at 10:38 PM, David Herrmann <dh.herrmann@gmail.com> wrote:
> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
> that you can pass to mmap(). It can support sealing and avoids any
> connection to user-visible mount-points. Thus, it's not subject to quotas
> on mounted file-systems, but can be used like malloc()'ed memory, but
> with a file-descriptor to it.
>
> memfd_create() does not create a front-FD, but instead returns the raw
> shmem file, so calls like ftruncate() can be used. Also calls like fstat()
> will return proper information and mark the file as regular file. If you
> want sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not
> support (like on all other regular files).
>
> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
> subject to quotas and alike.
>
> Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
> ---

<cut>

> +++ b/include/linux/syscalls.h
> @@ -802,6 +802,7 @@ asmlinkage long sys_timerfd_settime(int ufd, int flags,
>  asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
>  asmlinkage long sys_eventfd(unsigned int count);
>  asmlinkage long sys_eventfd2(unsigned int count, int flags);
> +asmlinkage long sys_memfd_create(const char *uname_ptr, u64 size, u64 flags);

Is it right to use u64 here? I think arguments sould be 'loff_t' and 'int'.

>  asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
>  asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int);
>  asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 2/3] shm: add memfd_create() syscall
@ 2014-05-21 10:50     ` Konstantin Khlebnikov
  0 siblings, 0 replies; 53+ messages in thread
From: Konstantin Khlebnikov @ 2014-05-21 10:50 UTC (permalink / raw)
  To: David Herrmann
  Cc: Linux Kernel Mailing List, Michael Kerrisk, Ryan Lortie,
	Linus Torvalds, Andrew Morton, linux-mm, linux-fsdevel,
	Johannes Weiner, Tejun Heo, Greg Kroah-Hartman, John Stultz,
	Kristian Høgsberg, Lennart Poettering, Daniel Mack,
	Kay Sievers

On Tue, Apr 15, 2014 at 10:38 PM, David Herrmann <dh.herrmann@gmail.com> wrote:
> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
> that you can pass to mmap(). It can support sealing and avoids any
> connection to user-visible mount-points. Thus, it's not subject to quotas
> on mounted file-systems, but can be used like malloc()'ed memory, but
> with a file-descriptor to it.
>
> memfd_create() does not create a front-FD, but instead returns the raw
> shmem file, so calls like ftruncate() can be used. Also calls like fstat()
> will return proper information and mark the file as regular file. If you
> want sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not
> support (like on all other regular files).
>
> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
> subject to quotas and alike.
>
> Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
> ---

<cut>

> +++ b/include/linux/syscalls.h
> @@ -802,6 +802,7 @@ asmlinkage long sys_timerfd_settime(int ufd, int flags,
>  asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
>  asmlinkage long sys_eventfd(unsigned int count);
>  asmlinkage long sys_eventfd2(unsigned int count, int flags);
> +asmlinkage long sys_memfd_create(const char *uname_ptr, u64 size, u64 flags);

Is it right to use u64 here? I think arguments sould be 'loff_t' and 'int'.

>  asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
>  asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int);
>  asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 1/3] shm: add sealing API
  2014-05-20  2:16     ` Hugh Dickins
@ 2014-05-23 16:37       ` David Herrmann
  -1 siblings, 0 replies; 53+ messages in thread
From: David Herrmann @ 2014-05-23 16:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Tony Battersby, Andy Lutomirski, Jan Kara, Michael Kerrisk,
	Ryan Lortie, Linus Torvalds, Andrew Morton, linux-mm,
	linux-fsdevel, linux-kernel, Johannes Weiner, Tejun Heo,
	Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

Hi Hugh

Thanks for the review! Looks all good, few comments inline in case I
didn't agree. Everything else I didn't comment on is fixed in my tree.

On Tue, May 20, 2014 at 4:16 AM, Hugh Dickins <hughd@google.com> wrote:
> On Tue, 15 Apr 2014, David Herrmann wrote:
>> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
>> index 4d1771c..c043d67 100644
>> --- a/include/linux/shmem_fs.h
>> +++ b/include/linux/shmem_fs.h
>> @@ -1,6 +1,7 @@
>>  #ifndef __SHMEM_FS_H
>>  #define __SHMEM_FS_H
>>
>> +#include <linux/file.h>
>>  #include <linux/swap.h>
>>  #include <linux/mempolicy.h>
>>  #include <linux/pagemap.h>
>> @@ -20,6 +21,7 @@ struct shmem_inode_info {
>>       struct shared_policy    policy;         /* NUMA memory alloc policy */
>>       struct list_head        swaplist;       /* chain of maybes on swap */
>>       struct simple_xattrs    xattrs;         /* list of xattrs */
>> +     u32                     seals;          /* shmem seals */
>
> Okay.  I do wonder why you chose "u32" where I would have chosen
> "unsigned int": probably just our different backgrounds - kernel
> internals most often use the basic types, whereas you are thinking
> about explicit interfaces.  Even syscalls tend to have "int" args,
> but perhaps that's just a historic mistake.  I have no good reason
> to disagree with your use of "u32", but draw attention to it in
> case someone else feels more strongly.
>
> Oh, how about you move "seals" up between "lock" and "flags":
> on many configurations, it will then occupy what used to be padding.

No specific reason for u32, just personal preference. I've changed it
to "unsigned int" and moved it up.

>>       struct inode            vfs_inode;
>>  };
>>
>> @@ -65,4 +67,22 @@ static inline struct page *shmem_read_mapping_page(
>>                                       mapping_gfp_mask(mapping));
>>  }
>>
>> +/* marks inode to support sealing */
>> +#define SHMEM_ALLOW_SEALING (1U << 31)
>
> This feels unnecessary to me: see comment on shmem_add_seals.

Indeed, we can just mark all files as "already sealed" except for
memfd-files. This causes SHMEM_GET_SEALS to succeed on non-memfd
files, but i think that's fine. Fixed!

>> +
>> +#ifdef CONFIG_SHMEM
>
> Should that rather be CONFIG_TMPFS?  I think you have placed
> shmem_fcntl() and its supporting functions in the CONFIG_TMPFS
> part of mm/shmem.c (and CONFIG_TMPFS depends on CONFIG_SHMEM).
>
> It's almost certainly true that "CONFIG_TMPFS" has outlived its v2.4
> usefulness, and serves as more of a confusion than a help nowadays:
> particularly since !CONFIG_SHMEM gives you the ramfs filesystem, but
> CONFIG_SHMEM without CONFIG_TMPFS does not give you a filesystem.
>
> Blame me for leaving CONFIG_TMPFS around; but for now,
> I think it's CONFIG_TMPFS you want there (please check).

We definitely want TMPFS for ftruncate/fallocate and friends. Fixed!

>> +
>> +extern int shmem_add_seals(struct file *file, u32 seals);
>> +extern int shmem_get_seals(struct file *file);
>> +extern long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
>> +
>> +#else
>> +
>
> Are you sure you want to generate a link error rather than a runtime
> fallback if there's a driver using shmem_add_seals() or shmem_get_seals()
> in a !CONFIG_SHMEM kernel?  That might be the right decision, but it
> surprises me a little.

I have some experimental kernel patches that depend on sealing. As
there is currently no way to test for sealing via Kconfig, I thought a
link-error is the best solution. I expect people to change this once
we actually have code that can deal with a fallback. But I couldn't
come up with a use-case were people want sealing as an optional
feature.

>> +static inline long shmem_fcntl(struct file *f, unsigned int c, unsigned long a)
>> +{
>> +     return -EINVAL;
>
> Should be -EBADF to match what you get in the CONFIG_SHMEM case.
>
>> +}
>> +
>> +#endif
>> +
>>  #endif
>> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
>> index 074b886..1b9b9f4 100644
>> --- a/include/uapi/linux/fcntl.h
>> +++ b/include/uapi/linux/fcntl.h
>> @@ -28,6 +28,21 @@
>>  #define F_GETPIPE_SZ (F_LINUX_SPECIFIC_BASE + 8)
>>
>>  /*
>> + * Set/Get seals
>> + */
>> +#define F_ADD_SEALS  (F_LINUX_SPECIFIC_BASE + 9)
>> +#define F_GET_SEALS  (F_LINUX_SPECIFIC_BASE + 10)
>> +
>> +/*
>> + * Types of seals
>> + */
>> +#define F_SEAL_SEAL  0x0001  /* prevent further seals from being set */
>> +#define F_SEAL_SHRINK        0x0002  /* prevent file from shrinking */
>> +#define F_SEAL_GROW  0x0004  /* prevent file from growing */
>> +#define F_SEAL_WRITE 0x0008  /* prevent writes */
>> +/* (1U << 31) is reserved for internal use */
>
> I question the need to reserve that: see comment on shmem_add_seals.
>
>> +
>> +/*
>>   * Types of directory notifications that may be requested.
>>   */
>>  #define DN_ACCESS    0x00000001      /* File accessed */
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 9f70e02..175a5b8 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -66,6 +66,7 @@ static struct vfsmount *shm_mnt;
>>  #include <linux/highmem.h>
>>  #include <linux/seq_file.h>
>>  #include <linux/magic.h>
>> +#include <linux/fcntl.h>
>>
>>  #include <asm/uaccess.h>
>>  #include <asm/pgtable.h>
>> @@ -531,16 +532,23 @@ EXPORT_SYMBOL_GPL(shmem_truncate_range);
>>  static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
>>  {
>>       struct inode *inode = dentry->d_inode;
>> +     struct shmem_inode_info *info = SHMEM_I(inode);
>> +     loff_t oldsize = inode->i_size;
>> +     loff_t newsize = attr->ia_size;
>>       int error;
>>
>>       error = inode_change_ok(inode, attr);
>>       if (error)
>>               return error;
>>
>> -     if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
>> -             loff_t oldsize = inode->i_size;
>> -             loff_t newsize = attr->ia_size;
>> +     /* protected by i_mutex */
>> +     if (attr->ia_valid & ATTR_SIZE) {
>> +             if ((newsize < oldsize && (info->seals & F_SEAL_SHRINK)) ||
>> +                 (newsize > oldsize && (info->seals & F_SEAL_GROW)))
>> +                     return -EPERM;
>> +     }
>>
>> +     if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
>>               if (newsize != oldsize) {
>>                       i_size_write(inode, newsize);
>>                       inode->i_ctime = inode->i_mtime = CURRENT_TIME;
>> @@ -1289,6 +1297,13 @@ out_nomem:
>>
>>  static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
>>  {
>> +     struct inode *inode = file_inode(file);
>> +     struct shmem_inode_info *info = SHMEM_I(inode);
>> +
>> +     /* protected by mmap_sem */
>> +     if ((info->seals & F_SEAL_WRITE) && (vma->vm_flags & VM_SHARED))
>> +             return -EPERM;
>> +
>>       file_accessed(file);
>>       vma->vm_ops = &shmem_vm_ops;
>>       return 0;
>> @@ -1373,7 +1388,15 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
>>                       struct page **pagep, void **fsdata)
>>  {
>>       struct inode *inode = mapping->host;
>> +     struct shmem_inode_info *info = SHMEM_I(inode);
>>       pgoff_t index = pos >> PAGE_CACHE_SHIFT;
>> +
>> +     /* i_mutex is held by caller */
>> +     if (info->seals & F_SEAL_WRITE)
>> +             return -EPERM;
>> +     if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size)
>> +             return -EPERM;
>> +
>>       return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
>>  }
>>
>> @@ -1719,11 +1742,133 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
>>       return offset;
>>  }
>>
>> +#define F_ALL_SEALS (F_SEAL_SEAL | \
>> +                  F_SEAL_SHRINK | \
>> +                  F_SEAL_GROW | \
>> +                  F_SEAL_WRITE)
>> +
>> +int shmem_add_seals(struct file *file, u32 seals)
>> +{
>> +     struct dentry *dentry = file->f_path.dentry;
>> +     struct inode *inode = dentry->d_inode;
>> +     struct shmem_inode_info *info = SHMEM_I(inode);
>> +     int r;
>
> mm/shmem.c is currently using "int error", "int err", "int ret" or
> "int retval" for this (maybe more!): I'd prefer you not to add "r"
> to the menagerie, "error" or "err" would be good here.
>
>> +
>> +     /* SHMEM_ALLOW_SEALING is a private, unused bit */
>> +     BUILD_BUG_ON(F_ALL_SEALS & SHMEM_ALLOW_SEALING);
>
> I see no need for SHMEM_ALLOW_SEALING.
> Now that you have added F_SEAL_SEAL, why don't you just make
> shmem_get_inode() initialize info->seals with F_SEAL_SEAL,
> then clear that in the one place you need to in the next patch?
>
>> +
>> +     /*
>> +      * SEALING
>> +      * Sealing allows multiple parties to share a shmem-file but restrict
>> +      * access to a specific subset of file operations. Seals can only be
>> +      * added, but never removed. This way, mutually untrusted parties can
>> +      * share common memory regions with a well-defined policy. A malicious
>> +      * peer can thus never perform unwanted operations on a shared object.
>> +      *
>> +      * Seals are only supported on special shmem-files and always affect
>> +      * the whole underlying inode. Once a seal is set, it may prevent some
>> +      * kinds of access to the file. Currently, the following seals are
>> +      * defined:
>> +      *   SEAL_SEAL: Prevent further seals from being set on this file
>> +      *   SEAL_SHRINK: Prevent the file from shrinking
>> +      *   SEAL_GROW: Prevent the file from growing
>> +      *   SEAL_WRITE: Prevent write access to the file
>> +      *
>> +      * As we don't require any trust relationship between two parties, we
>> +      * must prevent seals from being removed. Therefore, sealing a file
>> +      * only adds a given set of seals to the file, it never touches
>> +      * existing seals. Furthermore, the "setting seals"-operation can be
>> +      * sealed itself, which basically prevents any further seal from being
>> +      * added.
>> +      *
>> +      * Semantics of sealing are only defined on volatile files. Only
>> +      * anonymous shmem files support sealing. More importantly, seals are
>> +      * never written to disk. Therefore, there's no plan to support it on
>> +      * other file types.
>> +      */
>> +
>> +     if (file->f_op != &shmem_file_operations)
>> +             return -EBADF;
>
> Okay: that's not what I expect -EBADF to mean, but it does follow
> the precedent set by pipe_fcntl().

I wasn't sure about that either, but as you noticed this behavior is
copied from pipe_fcntl(). I'm open for discussion, but if no-one cares
I will keep this behavior.

>> +     if (!(info->seals & SHMEM_ALLOW_SEALING))
>> +             return -EBADF;
>> +     if (!(file->f_mode & FMODE_WRITE))
>> +             return -EPERM;
>> +     if (seals & ~(u32)F_ALL_SEALS)
>> +             return -EINVAL;
>> +
>> +     /*
>> +      * - i_mutex prevents racing write/ftruncate/fallocate/..
>> +      * - mmap_sem prevents racing mmap() calls
>> +      */
>> +
>> +     mutex_lock(&inode->i_mutex);
>> +     down_read(&current->mm->mmap_sem);
>
> I don't think that use of current->mm->mmap_sem can be correct:
> it guards against races with other threads of this process, but
> what if another process has this object open and races to mmap it?
>
> I imagine you have to use i_mmap_mutex, and plumb an error return
> into __vma_link_file() etc in mm/mmap.c, if the file is found already
> sealed against writing - which may prove irritating, especially with
> knowledge of sealing being private to mm/shmem.c.

Yes, that access to "current->mm" is wrong.

i_mmap_mutex is the only per-object lock that is taken in the mmap()
path and all vma_link() users can easily be changed to deal with
errors. So I think it should be easy to make __vma_link_file() fail if
no writable mappings are allowed. Testing for shmem-seals seems odd
here, indeed. We could instead make i_mmap_writable work like
i_writecount. If it's negative, no new writable mappings are allowed.
shmem_set_seals() could then decrement it to <0 and __vma_link_file()
just tests whether it's negative. Comments?

> But I have not stopped to work it out properly: the answer may depend
> on the answer to the major issue of outstanding async I/O.  As I
> mentioned last week, that's an issue I think we cannot overlook.
> Tony's copy-raised-pagecount-pages suggestion is a good one, but
> not so attractive that I'll give up hope for a better solution.
>
>> +
>> +     /* you cannot seal while shared mappings exist */
>> +     if (file->f_mapping->i_mmap_writable > 0) {
>> +             r = -EPERM;
>> +             goto unlock;
>> +     }
>> +
>> +     if (info->seals & F_SEAL_SEAL) {
>> +             r = -EPERM;
>> +             goto unlock;
>> +     }
>> +
>> +     info->seals |= seals;
>> +     r = 0;
>> +
>> +unlock:
>> +     up_read(&current->mm->mmap_sem);
>> +     mutex_unlock(&inode->i_mutex);
>> +     return r;
>> +}
>> +EXPORT_SYMBOL(shmem_add_seals);
>
> EXPORT_SYMBOL_GPL(shmem_add_seals).
>
> We don't see an example of its use, but I certainly don't want to see
> drivers/gpu changes as part of this patchset, so I think that's okay.
>
>> +
>> +int shmem_get_seals(struct file *file)
>> +{
>> +     struct shmem_inode_info *info;
>> +
>> +     if (file->f_op != &shmem_file_operations)
>> +             return -EBADF;
>> +
>> +     info = SHMEM_I(file_inode(file));
>> +     if (!(info->seals & SHMEM_ALLOW_SEALING))
>> +             return -EBADF;
>
> Hmm, so the F_SEAL_SEAL change I suggest would remove that -EBADF,
> and instead return F_SEAL_SEAL on any shmem object.  I think that's
> fine, but you may see a reason why not?

Fine with me.

>> +
>> +     return info->seals & F_ALL_SEALS;
>> +}
>> +EXPORT_SYMBOL(shmem_get_seals);
>
> EXPORT_SYMBOL_GPL(shmem_get_seals).
>
>> +
>> +long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
>> +{
>> +     long r;
>
> long ret or retval please.
>
>> +
>> +     switch (cmd) {
>> +     case F_ADD_SEALS:
>> +             /* disallow upper 32bit */
>> +             if (arg >> 32)
>> +                     return -EINVAL;
>> +
>> +             r = shmem_add_seals(file, arg);
>> +             break;
>> +     case F_GET_SEALS:
>> +             r = shmem_get_seals(file);
>> +             break;
>> +     default:
>> +             r = -EINVAL;
>> +             break;
>> +     }
>> +
>> +     return r;
>> +}
>> +
>>  static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>>                                                        loff_t len)
>>  {
>>       struct inode *inode = file_inode(file);
>>       struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
>> +     struct shmem_inode_info *info = SHMEM_I(inode);
>>       struct shmem_falloc shmem_falloc;
>>       pgoff_t start, index, end;
>>       int error;
>> @@ -1735,6 +1880,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>>               loff_t unmap_start = round_up(offset, PAGE_SIZE);
>>               loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
>>
>> +             /* protected by i_mutex */
>> +             if (info->seals & F_SEAL_WRITE) {
>> +                     error = -EPERM;
>> +                     goto out;
>> +             }
>> +
>>               if ((u64)unmap_end > (u64)unmap_start)
>>                       unmap_mapping_range(mapping, unmap_start,
>>                                           1 + unmap_end - unmap_start, 0);
>> @@ -1749,6 +1900,11 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>>       if (error)
>>               goto out;
>>
>> +     if ((info->seals & F_SEAL_GROW) && offset + len > inode->i_size) {
>
> Okay.  I don't think it needs a comment, but I note in passing that we
> *could* permit a FALLOC_FL_KEEP_SIZE change there, since it will make
> no difference to what data is accessible; but it would also serve no
> useful purpose, so fine to stick with the simpler test you have.

Yeah, it could be used to fill previously punched holes, but on the
other hand that sounds like a very odd use-case. I will think about
it, but it doesn't hurt to fix it right now.

>> +             error = -EPERM;
>> +             goto out;
>> +     }
>> +
>>       start = offset >> PAGE_CACHE_SHIFT;
>>       end = (offset + len + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
>>       /* Try to avoid a swapstorm if len is impossible to satisfy */
>> --
>> 1.9.2
>
> There is also, or may be, a small issue of sparse (holey) files.
> I do have a question on that in comments on your next patch, and
> the answer here may depend on what you want in memfd_create().
>
> What I'm thinking of here is that once a sparse file is sealed
> against writing, we must be sure not to give an error when reading
> its holes: whereas there are a few unlikely ways in which reading
> the holes of a sparse tmpfs file can give -ENOMEM or -ENOSPC.
>
> Most of the memory allocations here can in fact only fail when the
> allocating process has already been selected for OOM-kill: that is
> not guaranteed forever, but it is how __alloc_pages_slowpath()
> currently behaves on ordinary low-order allocations, and will be
> hard to change if we ever do so.  Though I dislike relying upon
> this, I think we can allow reading holes to fail, if the process
> is going to be forcibly killed before it returns to userspace.
>
> But there might still be an issue with vm_enough_memory(),
> and there might still be an issue with memcg limits.
>
> We do already use the ZERO_PAGE instead of allocating when it's a
> simple read; and on the face of it, we could extend that to mmap
> once the file is sealed.  But I am rather afraid to do so - for
> many years there was an mmap /dev/zero case which did that, but
> it was an easily forgotten case which caught us out at least
> once, so I'm reluctant to reintroduce it now for sealing.
>
> Anyway, I don't expect you to resolve the issue of sealed holes:
> that's very much my territory, to give you support on.

Why not require users to use mlock() if they want to protect
themselves against OOM situations? At least the man-page says that
mlock() guarantess that all pages in the specified range are loaded. I
didn't verify whether that includes holes, though. And if
RLIMIT_MEMLOCK is too small, users ought to access the object in
smaller chunks.
And it's not specific to sparse files. Any other page may be swapped
out and the swap-in can fail due to ENOMEM (page-table allocations,
tree-inserts, and so on). But you definitely know better what to do
here, so suggestions welcome.

Anyway, sealing is not meant to protect against OOM situations. I
mean, any mapping is subject to OOM, so processes that care should
have a suitable infrastructure via SIGBUS or mlock() for all mappings,
including sealed files. Furthermore, write-sealing is meant to prevent
targeted attacks that modify data while it is being parsed. We
properly protect users against that. OOM is an orthogonal issue, imho.

Moreover, if we guarantee that sealed files are always present in
memory, we give users a way to circumvent RLIMIT_MEMLOCK (only for
readable mappings, but still..).

Thanks a lot for the review!
David

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 1/3] shm: add sealing API
@ 2014-05-23 16:37       ` David Herrmann
  0 siblings, 0 replies; 53+ messages in thread
From: David Herrmann @ 2014-05-23 16:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Tony Battersby, Andy Lutomirski, Jan Kara, Michael Kerrisk,
	Ryan Lortie, Linus Torvalds, Andrew Morton, linux-mm,
	linux-fsdevel, linux-kernel, Johannes Weiner, Tejun Heo,
	Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

Hi Hugh

Thanks for the review! Looks all good, few comments inline in case I
didn't agree. Everything else I didn't comment on is fixed in my tree.

On Tue, May 20, 2014 at 4:16 AM, Hugh Dickins <hughd@google.com> wrote:
> On Tue, 15 Apr 2014, David Herrmann wrote:
>> diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
>> index 4d1771c..c043d67 100644
>> --- a/include/linux/shmem_fs.h
>> +++ b/include/linux/shmem_fs.h
>> @@ -1,6 +1,7 @@
>>  #ifndef __SHMEM_FS_H
>>  #define __SHMEM_FS_H
>>
>> +#include <linux/file.h>
>>  #include <linux/swap.h>
>>  #include <linux/mempolicy.h>
>>  #include <linux/pagemap.h>
>> @@ -20,6 +21,7 @@ struct shmem_inode_info {
>>       struct shared_policy    policy;         /* NUMA memory alloc policy */
>>       struct list_head        swaplist;       /* chain of maybes on swap */
>>       struct simple_xattrs    xattrs;         /* list of xattrs */
>> +     u32                     seals;          /* shmem seals */
>
> Okay.  I do wonder why you chose "u32" where I would have chosen
> "unsigned int": probably just our different backgrounds - kernel
> internals most often use the basic types, whereas you are thinking
> about explicit interfaces.  Even syscalls tend to have "int" args,
> but perhaps that's just a historic mistake.  I have no good reason
> to disagree with your use of "u32", but draw attention to it in
> case someone else feels more strongly.
>
> Oh, how about you move "seals" up between "lock" and "flags":
> on many configurations, it will then occupy what used to be padding.

No specific reason for u32, just personal preference. I've changed it
to "unsigned int" and moved it up.

>>       struct inode            vfs_inode;
>>  };
>>
>> @@ -65,4 +67,22 @@ static inline struct page *shmem_read_mapping_page(
>>                                       mapping_gfp_mask(mapping));
>>  }
>>
>> +/* marks inode to support sealing */
>> +#define SHMEM_ALLOW_SEALING (1U << 31)
>
> This feels unnecessary to me: see comment on shmem_add_seals.

Indeed, we can just mark all files as "already sealed" except for
memfd-files. This causes SHMEM_GET_SEALS to succeed on non-memfd
files, but i think that's fine. Fixed!

>> +
>> +#ifdef CONFIG_SHMEM
>
> Should that rather be CONFIG_TMPFS?  I think you have placed
> shmem_fcntl() and its supporting functions in the CONFIG_TMPFS
> part of mm/shmem.c (and CONFIG_TMPFS depends on CONFIG_SHMEM).
>
> It's almost certainly true that "CONFIG_TMPFS" has outlived its v2.4
> usefulness, and serves as more of a confusion than a help nowadays:
> particularly since !CONFIG_SHMEM gives you the ramfs filesystem, but
> CONFIG_SHMEM without CONFIG_TMPFS does not give you a filesystem.
>
> Blame me for leaving CONFIG_TMPFS around; but for now,
> I think it's CONFIG_TMPFS you want there (please check).

We definitely want TMPFS for ftruncate/fallocate and friends. Fixed!

>> +
>> +extern int shmem_add_seals(struct file *file, u32 seals);
>> +extern int shmem_get_seals(struct file *file);
>> +extern long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg);
>> +
>> +#else
>> +
>
> Are you sure you want to generate a link error rather than a runtime
> fallback if there's a driver using shmem_add_seals() or shmem_get_seals()
> in a !CONFIG_SHMEM kernel?  That might be the right decision, but it
> surprises me a little.

I have some experimental kernel patches that depend on sealing. As
there is currently no way to test for sealing via Kconfig, I thought a
link-error is the best solution. I expect people to change this once
we actually have code that can deal with a fallback. But I couldn't
come up with a use-case were people want sealing as an optional
feature.

>> +static inline long shmem_fcntl(struct file *f, unsigned int c, unsigned long a)
>> +{
>> +     return -EINVAL;
>
> Should be -EBADF to match what you get in the CONFIG_SHMEM case.
>
>> +}
>> +
>> +#endif
>> +
>>  #endif
>> diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
>> index 074b886..1b9b9f4 100644
>> --- a/include/uapi/linux/fcntl.h
>> +++ b/include/uapi/linux/fcntl.h
>> @@ -28,6 +28,21 @@
>>  #define F_GETPIPE_SZ (F_LINUX_SPECIFIC_BASE + 8)
>>
>>  /*
>> + * Set/Get seals
>> + */
>> +#define F_ADD_SEALS  (F_LINUX_SPECIFIC_BASE + 9)
>> +#define F_GET_SEALS  (F_LINUX_SPECIFIC_BASE + 10)
>> +
>> +/*
>> + * Types of seals
>> + */
>> +#define F_SEAL_SEAL  0x0001  /* prevent further seals from being set */
>> +#define F_SEAL_SHRINK        0x0002  /* prevent file from shrinking */
>> +#define F_SEAL_GROW  0x0004  /* prevent file from growing */
>> +#define F_SEAL_WRITE 0x0008  /* prevent writes */
>> +/* (1U << 31) is reserved for internal use */
>
> I question the need to reserve that: see comment on shmem_add_seals.
>
>> +
>> +/*
>>   * Types of directory notifications that may be requested.
>>   */
>>  #define DN_ACCESS    0x00000001      /* File accessed */
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 9f70e02..175a5b8 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -66,6 +66,7 @@ static struct vfsmount *shm_mnt;
>>  #include <linux/highmem.h>
>>  #include <linux/seq_file.h>
>>  #include <linux/magic.h>
>> +#include <linux/fcntl.h>
>>
>>  #include <asm/uaccess.h>
>>  #include <asm/pgtable.h>
>> @@ -531,16 +532,23 @@ EXPORT_SYMBOL_GPL(shmem_truncate_range);
>>  static int shmem_setattr(struct dentry *dentry, struct iattr *attr)
>>  {
>>       struct inode *inode = dentry->d_inode;
>> +     struct shmem_inode_info *info = SHMEM_I(inode);
>> +     loff_t oldsize = inode->i_size;
>> +     loff_t newsize = attr->ia_size;
>>       int error;
>>
>>       error = inode_change_ok(inode, attr);
>>       if (error)
>>               return error;
>>
>> -     if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
>> -             loff_t oldsize = inode->i_size;
>> -             loff_t newsize = attr->ia_size;
>> +     /* protected by i_mutex */
>> +     if (attr->ia_valid & ATTR_SIZE) {
>> +             if ((newsize < oldsize && (info->seals & F_SEAL_SHRINK)) ||
>> +                 (newsize > oldsize && (info->seals & F_SEAL_GROW)))
>> +                     return -EPERM;
>> +     }
>>
>> +     if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
>>               if (newsize != oldsize) {
>>                       i_size_write(inode, newsize);
>>                       inode->i_ctime = inode->i_mtime = CURRENT_TIME;
>> @@ -1289,6 +1297,13 @@ out_nomem:
>>
>>  static int shmem_mmap(struct file *file, struct vm_area_struct *vma)
>>  {
>> +     struct inode *inode = file_inode(file);
>> +     struct shmem_inode_info *info = SHMEM_I(inode);
>> +
>> +     /* protected by mmap_sem */
>> +     if ((info->seals & F_SEAL_WRITE) && (vma->vm_flags & VM_SHARED))
>> +             return -EPERM;
>> +
>>       file_accessed(file);
>>       vma->vm_ops = &shmem_vm_ops;
>>       return 0;
>> @@ -1373,7 +1388,15 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
>>                       struct page **pagep, void **fsdata)
>>  {
>>       struct inode *inode = mapping->host;
>> +     struct shmem_inode_info *info = SHMEM_I(inode);
>>       pgoff_t index = pos >> PAGE_CACHE_SHIFT;
>> +
>> +     /* i_mutex is held by caller */
>> +     if (info->seals & F_SEAL_WRITE)
>> +             return -EPERM;
>> +     if ((info->seals & F_SEAL_GROW) && pos + len > inode->i_size)
>> +             return -EPERM;
>> +
>>       return shmem_getpage(inode, index, pagep, SGP_WRITE, NULL);
>>  }
>>
>> @@ -1719,11 +1742,133 @@ static loff_t shmem_file_llseek(struct file *file, loff_t offset, int whence)
>>       return offset;
>>  }
>>
>> +#define F_ALL_SEALS (F_SEAL_SEAL | \
>> +                  F_SEAL_SHRINK | \
>> +                  F_SEAL_GROW | \
>> +                  F_SEAL_WRITE)
>> +
>> +int shmem_add_seals(struct file *file, u32 seals)
>> +{
>> +     struct dentry *dentry = file->f_path.dentry;
>> +     struct inode *inode = dentry->d_inode;
>> +     struct shmem_inode_info *info = SHMEM_I(inode);
>> +     int r;
>
> mm/shmem.c is currently using "int error", "int err", "int ret" or
> "int retval" for this (maybe more!): I'd prefer you not to add "r"
> to the menagerie, "error" or "err" would be good here.
>
>> +
>> +     /* SHMEM_ALLOW_SEALING is a private, unused bit */
>> +     BUILD_BUG_ON(F_ALL_SEALS & SHMEM_ALLOW_SEALING);
>
> I see no need for SHMEM_ALLOW_SEALING.
> Now that you have added F_SEAL_SEAL, why don't you just make
> shmem_get_inode() initialize info->seals with F_SEAL_SEAL,
> then clear that in the one place you need to in the next patch?
>
>> +
>> +     /*
>> +      * SEALING
>> +      * Sealing allows multiple parties to share a shmem-file but restrict
>> +      * access to a specific subset of file operations. Seals can only be
>> +      * added, but never removed. This way, mutually untrusted parties can
>> +      * share common memory regions with a well-defined policy. A malicious
>> +      * peer can thus never perform unwanted operations on a shared object.
>> +      *
>> +      * Seals are only supported on special shmem-files and always affect
>> +      * the whole underlying inode. Once a seal is set, it may prevent some
>> +      * kinds of access to the file. Currently, the following seals are
>> +      * defined:
>> +      *   SEAL_SEAL: Prevent further seals from being set on this file
>> +      *   SEAL_SHRINK: Prevent the file from shrinking
>> +      *   SEAL_GROW: Prevent the file from growing
>> +      *   SEAL_WRITE: Prevent write access to the file
>> +      *
>> +      * As we don't require any trust relationship between two parties, we
>> +      * must prevent seals from being removed. Therefore, sealing a file
>> +      * only adds a given set of seals to the file, it never touches
>> +      * existing seals. Furthermore, the "setting seals"-operation can be
>> +      * sealed itself, which basically prevents any further seal from being
>> +      * added.
>> +      *
>> +      * Semantics of sealing are only defined on volatile files. Only
>> +      * anonymous shmem files support sealing. More importantly, seals are
>> +      * never written to disk. Therefore, there's no plan to support it on
>> +      * other file types.
>> +      */
>> +
>> +     if (file->f_op != &shmem_file_operations)
>> +             return -EBADF;
>
> Okay: that's not what I expect -EBADF to mean, but it does follow
> the precedent set by pipe_fcntl().

I wasn't sure about that either, but as you noticed this behavior is
copied from pipe_fcntl(). I'm open for discussion, but if no-one cares
I will keep this behavior.

>> +     if (!(info->seals & SHMEM_ALLOW_SEALING))
>> +             return -EBADF;
>> +     if (!(file->f_mode & FMODE_WRITE))
>> +             return -EPERM;
>> +     if (seals & ~(u32)F_ALL_SEALS)
>> +             return -EINVAL;
>> +
>> +     /*
>> +      * - i_mutex prevents racing write/ftruncate/fallocate/..
>> +      * - mmap_sem prevents racing mmap() calls
>> +      */
>> +
>> +     mutex_lock(&inode->i_mutex);
>> +     down_read(&current->mm->mmap_sem);
>
> I don't think that use of current->mm->mmap_sem can be correct:
> it guards against races with other threads of this process, but
> what if another process has this object open and races to mmap it?
>
> I imagine you have to use i_mmap_mutex, and plumb an error return
> into __vma_link_file() etc in mm/mmap.c, if the file is found already
> sealed against writing - which may prove irritating, especially with
> knowledge of sealing being private to mm/shmem.c.

Yes, that access to "current->mm" is wrong.

i_mmap_mutex is the only per-object lock that is taken in the mmap()
path and all vma_link() users can easily be changed to deal with
errors. So I think it should be easy to make __vma_link_file() fail if
no writable mappings are allowed. Testing for shmem-seals seems odd
here, indeed. We could instead make i_mmap_writable work like
i_writecount. If it's negative, no new writable mappings are allowed.
shmem_set_seals() could then decrement it to <0 and __vma_link_file()
just tests whether it's negative. Comments?

> But I have not stopped to work it out properly: the answer may depend
> on the answer to the major issue of outstanding async I/O.  As I
> mentioned last week, that's an issue I think we cannot overlook.
> Tony's copy-raised-pagecount-pages suggestion is a good one, but
> not so attractive that I'll give up hope for a better solution.
>
>> +
>> +     /* you cannot seal while shared mappings exist */
>> +     if (file->f_mapping->i_mmap_writable > 0) {
>> +             r = -EPERM;
>> +             goto unlock;
>> +     }
>> +
>> +     if (info->seals & F_SEAL_SEAL) {
>> +             r = -EPERM;
>> +             goto unlock;
>> +     }
>> +
>> +     info->seals |= seals;
>> +     r = 0;
>> +
>> +unlock:
>> +     up_read(&current->mm->mmap_sem);
>> +     mutex_unlock(&inode->i_mutex);
>> +     return r;
>> +}
>> +EXPORT_SYMBOL(shmem_add_seals);
>
> EXPORT_SYMBOL_GPL(shmem_add_seals).
>
> We don't see an example of its use, but I certainly don't want to see
> drivers/gpu changes as part of this patchset, so I think that's okay.
>
>> +
>> +int shmem_get_seals(struct file *file)
>> +{
>> +     struct shmem_inode_info *info;
>> +
>> +     if (file->f_op != &shmem_file_operations)
>> +             return -EBADF;
>> +
>> +     info = SHMEM_I(file_inode(file));
>> +     if (!(info->seals & SHMEM_ALLOW_SEALING))
>> +             return -EBADF;
>
> Hmm, so the F_SEAL_SEAL change I suggest would remove that -EBADF,
> and instead return F_SEAL_SEAL on any shmem object.  I think that's
> fine, but you may see a reason why not?

Fine with me.

>> +
>> +     return info->seals & F_ALL_SEALS;
>> +}
>> +EXPORT_SYMBOL(shmem_get_seals);
>
> EXPORT_SYMBOL_GPL(shmem_get_seals).
>
>> +
>> +long shmem_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
>> +{
>> +     long r;
>
> long ret or retval please.
>
>> +
>> +     switch (cmd) {
>> +     case F_ADD_SEALS:
>> +             /* disallow upper 32bit */
>> +             if (arg >> 32)
>> +                     return -EINVAL;
>> +
>> +             r = shmem_add_seals(file, arg);
>> +             break;
>> +     case F_GET_SEALS:
>> +             r = shmem_get_seals(file);
>> +             break;
>> +     default:
>> +             r = -EINVAL;
>> +             break;
>> +     }
>> +
>> +     return r;
>> +}
>> +
>>  static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>>                                                        loff_t len)
>>  {
>>       struct inode *inode = file_inode(file);
>>       struct shmem_sb_info *sbinfo = SHMEM_SB(inode->i_sb);
>> +     struct shmem_inode_info *info = SHMEM_I(inode);
>>       struct shmem_falloc shmem_falloc;
>>       pgoff_t start, index, end;
>>       int error;
>> @@ -1735,6 +1880,12 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>>               loff_t unmap_start = round_up(offset, PAGE_SIZE);
>>               loff_t unmap_end = round_down(offset + len, PAGE_SIZE) - 1;
>>
>> +             /* protected by i_mutex */
>> +             if (info->seals & F_SEAL_WRITE) {
>> +                     error = -EPERM;
>> +                     goto out;
>> +             }
>> +
>>               if ((u64)unmap_end > (u64)unmap_start)
>>                       unmap_mapping_range(mapping, unmap_start,
>>                                           1 + unmap_end - unmap_start, 0);
>> @@ -1749,6 +1900,11 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
>>       if (error)
>>               goto out;
>>
>> +     if ((info->seals & F_SEAL_GROW) && offset + len > inode->i_size) {
>
> Okay.  I don't think it needs a comment, but I note in passing that we
> *could* permit a FALLOC_FL_KEEP_SIZE change there, since it will make
> no difference to what data is accessible; but it would also serve no
> useful purpose, so fine to stick with the simpler test you have.

Yeah, it could be used to fill previously punched holes, but on the
other hand that sounds like a very odd use-case. I will think about
it, but it doesn't hurt to fix it right now.

>> +             error = -EPERM;
>> +             goto out;
>> +     }
>> +
>>       start = offset >> PAGE_CACHE_SHIFT;
>>       end = (offset + len + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT;
>>       /* Try to avoid a swapstorm if len is impossible to satisfy */
>> --
>> 1.9.2
>
> There is also, or may be, a small issue of sparse (holey) files.
> I do have a question on that in comments on your next patch, and
> the answer here may depend on what you want in memfd_create().
>
> What I'm thinking of here is that once a sparse file is sealed
> against writing, we must be sure not to give an error when reading
> its holes: whereas there are a few unlikely ways in which reading
> the holes of a sparse tmpfs file can give -ENOMEM or -ENOSPC.
>
> Most of the memory allocations here can in fact only fail when the
> allocating process has already been selected for OOM-kill: that is
> not guaranteed forever, but it is how __alloc_pages_slowpath()
> currently behaves on ordinary low-order allocations, and will be
> hard to change if we ever do so.  Though I dislike relying upon
> this, I think we can allow reading holes to fail, if the process
> is going to be forcibly killed before it returns to userspace.
>
> But there might still be an issue with vm_enough_memory(),
> and there might still be an issue with memcg limits.
>
> We do already use the ZERO_PAGE instead of allocating when it's a
> simple read; and on the face of it, we could extend that to mmap
> once the file is sealed.  But I am rather afraid to do so - for
> many years there was an mmap /dev/zero case which did that, but
> it was an easily forgotten case which caught us out at least
> once, so I'm reluctant to reintroduce it now for sealing.
>
> Anyway, I don't expect you to resolve the issue of sealed holes:
> that's very much my territory, to give you support on.

Why not require users to use mlock() if they want to protect
themselves against OOM situations? At least the man-page says that
mlock() guarantess that all pages in the specified range are loaded. I
didn't verify whether that includes holes, though. And if
RLIMIT_MEMLOCK is too small, users ought to access the object in
smaller chunks.
And it's not specific to sparse files. Any other page may be swapped
out and the swap-in can fail due to ENOMEM (page-table allocations,
tree-inserts, and so on). But you definitely know better what to do
here, so suggestions welcome.

Anyway, sealing is not meant to protect against OOM situations. I
mean, any mapping is subject to OOM, so processes that care should
have a suitable infrastructure via SIGBUS or mlock() for all mappings,
including sealed files. Furthermore, write-sealing is meant to prevent
targeted attacks that modify data while it is being parsed. We
properly protect users against that. OOM is an orthogonal issue, imho.

Moreover, if we guarantee that sealed files are always present in
memory, we give users a way to circumvent RLIMIT_MEMLOCK (only for
readable mappings, but still..).

Thanks a lot for the review!
David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 2/3] shm: add memfd_create() syscall
  2014-05-20  2:20     ` Hugh Dickins
@ 2014-05-23 16:57       ` David Herrmann
  -1 siblings, 0 replies; 53+ messages in thread
From: David Herrmann @ 2014-05-23 16:57 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Tony Battersby, Andy Lutomirsky, Al Viro, Jan Kara,
	Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

Hi

On Tue, May 20, 2014 at 4:20 AM, Hugh Dickins <hughd@google.com> wrote:
> On Tue, 15 Apr 2014, David Herrmann wrote:
>
>> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
>> that you can pass to mmap(). It can support sealing and avoids any
>> connection to user-visible mount-points. Thus, it's not subject to quotas
>> on mounted file-systems, but can be used like malloc()'ed memory, but
>> with a file-descriptor to it.
>>
>> memfd_create() does not create a front-FD, but instead returns the raw
>
> What is a front-FD?

With 'front-FD' I refer to things like dma-buf: They allocate a
file-descriptor which is just a wrapper around a kernel-internal FD.
For instance, DRM-gem buffers exported as dma-buf. fops on the dma-buf
are forwarded to the shmem-fd of the given gem-object, but any access
to the inode of the dma-buf fd is a no-op as the dma-buf fd uses
anon-inode, not the shmem-inode.

A previous revision of memfd used something like that, but that was
inherently racy.

>> shmem file, so calls like ftruncate() can be used. Also calls like fstat()
>> will return proper information and mark the file as regular file. If you
>> want sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not
>> support (like on all other regular files).
>>
>> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
>> subject to quotas and alike.
>
> You mention quotas a couple of times, and I want to be clear about that.
>
> I think you are mainly thinking of the "df" size limitation which comes
> by default on a tmpfs mount, but can be retuned or removed with the
> size= or nr_block= mount options.  You want memfd_create() to be free
> of that limitation, which indeed it is.
>
> (I'm not proud of the way in which an unlimited tmpfs mount can easily
> be used to OOM the system, killing processes which do little to give
> back the memory needed; but that's how it is, and you're not making
> that worse, just adding a further interface to it.)
>
> And we have never implemented fs/quota/-style quotas on tmpfs,
> so you're certainly free from those.
>
> But a created memfd is still subject to an RLIMIT_FSIZE limit, and
> to a memcg's memory.limit_in_bytes and memory.memsw.limit_in_bytes:
> I expect you don't care about those, that they would be unlimited
> in the cases that you care about.
>
> And a created memfd is still subject to __vm_enough_memory() limiting:
> unlimited when OVERCOMMIT_ALWAYS, a little unpredictable when
> OVERCOMMIT_GUESS, strictly accounted when OVERCOMMIT_NEVER.  I don't
> think we can compromise on OVERCOMMIT_NEVER, but if OVERCOMMIT_GUESS
> gives you a problem, we could probably tweak it for your case.
> More on this below, when considering the size arg to memfd_create().

Yes, memfd_create() is supposed to be the same as mmap(MAP_ANON) and
I'm aware of the limits you describe (at least some of them..). But
thanks a lot for listing them, I will try to document them in my
man-pages.

>>
>> Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
>> ---
>>  arch/x86/syscalls/syscall_32.tbl |  1 +
>>  arch/x86/syscalls/syscall_64.tbl |  1 +
>
> Okay.  No point in cluttering the patchset with other architectures
> until this is closer to merge.  Miklos Szeredi's recent patches
> "add renameat2 syscall" provide a very helpful precedent to follow.

Thanks for the hint to renameat2. I will include other architectures
as a follow-up once we agreed on the implementation.

>>  include/linux/syscalls.h         |  1 +
>>  include/uapi/linux/memfd.h       | 10 ++++++
>>  kernel/sys_ni.c                  |  1 +
>>  mm/shmem.c                       | 74 ++++++++++++++++++++++++++++++++++++++++
>>  6 files changed, 88 insertions(+)
>>  create mode 100644 include/uapi/linux/memfd.h
>>
>> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
>> index 96bc506..c943b8a 100644
>> --- a/arch/x86/syscalls/syscall_32.tbl
>> +++ b/arch/x86/syscalls/syscall_32.tbl
>> @@ -359,3 +359,4 @@
>>  350  i386    finit_module            sys_finit_module
>>  351  i386    sched_setattr           sys_sched_setattr
>>  352  i386    sched_getattr           sys_sched_getattr
>> +353  i386    memfd_create            sys_memfd_create
>> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
>> index 04376ac..dfcfd6f 100644
>> --- a/arch/x86/syscalls/syscall_64.tbl
>> +++ b/arch/x86/syscalls/syscall_64.tbl
>> @@ -323,6 +323,7 @@
>>  314  common  sched_setattr           sys_sched_setattr
>>  315  common  sched_getattr           sys_sched_getattr
>>  316  common  renameat2               sys_renameat2
>> +317  common  memfd_create            sys_memfd_create
>>
>>  #
>>  # x32-specific system call numbers start at 512 to avoid cache impact
>> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>> index a4a0588..133b705 100644
>> --- a/include/linux/syscalls.h
>> +++ b/include/linux/syscalls.h
>> @@ -802,6 +802,7 @@ asmlinkage long sys_timerfd_settime(int ufd, int flags,
>>  asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
>>  asmlinkage long sys_eventfd(unsigned int count);
>>  asmlinkage long sys_eventfd2(unsigned int count, int flags);
>> +asmlinkage long sys_memfd_create(const char *uname_ptr, u64 size, u64 flags);
>>  asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
>>  asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int);
>>  asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
>> diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
>> new file mode 100644
>> index 0000000..c4a6db0
>> --- /dev/null
>> +++ b/include/uapi/linux/memfd.h
>> @@ -0,0 +1,10 @@
>> +#ifndef _UAPI_LINUX_MEMFD_H
>> +#define _UAPI_LINUX_MEMFD_H
>> +
>> +#include <linux/types.h>
>
> Why include linux/types.h in this one?
>
>> +
>> +/* flags for memfd_create(2) (u64) */
>> +#define MFD_CLOEXEC          0x0001ULL
>> +#define MFD_ALLOW_SEALING    0x0002ULL
>> +
>> +#endif /* _UAPI_LINUX_MEMFD_H */
>> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
>> index bc8d1b7..f96c329 100644
>> --- a/kernel/sys_ni.c
>> +++ b/kernel/sys_ni.c
>> @@ -195,6 +195,7 @@ cond_syscall(compat_sys_timerfd_settime);
>>  cond_syscall(compat_sys_timerfd_gettime);
>>  cond_syscall(sys_eventfd);
>>  cond_syscall(sys_eventfd2);
>> +cond_syscall(sys_memfd_create);
>>
>>  /* performance counters: */
>>  cond_syscall(sys_perf_event_open);
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 175a5b8..203cc4e 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -66,7 +66,9 @@ static struct vfsmount *shm_mnt;
>>  #include <linux/highmem.h>
>>  #include <linux/seq_file.h>
>>  #include <linux/magic.h>
>> +#include <linux/syscalls.h>
>>  #include <linux/fcntl.h>
>> +#include <uapi/linux/memfd.h>
>>
>>  #include <asm/uaccess.h>
>>  #include <asm/pgtable.h>
>> @@ -2919,6 +2921,78 @@ out4:
>>       return error;
>>  }
>>
>
> Whereas 1/3's sealing stuff was under CONFIG_TMPFS, this is in a
> CONFIG_SHMEM part of mm/shmem.c, built even when !CONFIG_TMPFS: in
> which case you could not write to or truncate the object created,
> just mmap it and access it that way (like SysV SHM).  Not necessarily
> wrong, but it may prevent surprises to put this under CONFIG_TMPFS:
> the user gets an fd, so probably expects filesystem operations to work.
>
>> +#define MFD_NAME_PREFIX "memfd:"
>> +#define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
>> +#define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
>> +
>> +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING)
>> +
>> +SYSCALL_DEFINE3(memfd_create,
>> +             const char*, uname,
>> +             u64, size,
>> +             u64, flags)
>
> If I'd come in earlier, I'd have probably looked for another name
> than memfd_create; but I don't have anything better in mind, and
> you've done a great job of sounding out potential users, so let's
> stick with the name everyone is expecting.
>
> The uname: it's a funny thing, not belonging in a filesystem tree;
> but you're very sure you want it, and we already make up funny
> names for SysV SHM and /dev/zero objects, so okay.
>
> The size: u64 or loff_t or size_t?  But more on size below.
>
> The flags: u64?  That's a big future you're allowing for!
> open and mmap use ints for their flags, will this really need more?
>
> But I don't think I've been present at the birth of a syscall before:
> there are probably several considerations that I'm unaware of, that
> you may have factored in - listen to the experts, not to me.
>
>> +{
>> +     struct shmem_inode_info *info;
>> +     struct file *shm;
>
> "struct file *file" is more usual.
>
>> +     char *name;
>> +     int fd, r;
>
> "int err" or "int error" rather than "int r".
>
>> +     long len;
>> +
>> +     if (flags & ~(u64)MFD_ALL_FLAGS)
>> +             return -EINVAL;
>> +     if ((u64)(loff_t)size != size || (loff_t)size < 0)
>> +             return -EINVAL;
>> +
>> +     /* length includes terminating zero */
>> +     len = strnlen_user(uname, MFD_NAME_MAX_LEN);
>> +     if (len <= 0)
>> +             return -EFAULT;
>> +     else if (len > MFD_NAME_MAX_LEN)
>
> Please omit the "else ".
>
> And, since strnlen_user() returns length including terminating NUL,
> wouldn't it be more exact to use MFD_NAME_MAX_LEN + 1 in those two
> places above?
>
>> +             return -EINVAL;
>> +
>> +     name = kmalloc(len + MFD_NAME_PREFIX_LEN, GFP_KERNEL);
>
> Probably better to say GFP_TEMPORARY than GFP_KERNEL,
> though it doesn't seem to be used very much at all.
>
>> +     if (!name)
>> +             return -ENOMEM;
>> +
>> +     strcpy(name, MFD_NAME_PREFIX);
>> +     if (copy_from_user(&name[MFD_NAME_PREFIX_LEN], uname, len)) {
>> +             r = -EFAULT;
>> +             goto err_name;
>> +     }
>> +
>> +     /* terminating-zero may have changed after strnlen_user() returned */
>> +     if (name[len + MFD_NAME_PREFIX_LEN - 1]) {
>> +             r = -EFAULT;
>> +             goto err_name;
>> +     }
>> +
>> +     fd = get_unused_fd_flags((flags & MFD_CLOEXEC) ? O_CLOEXEC : 0);
>> +     if (fd < 0) {
>> +             r = fd;
>> +             goto err_name;
>> +     }
>> +
>> +     shm = shmem_file_setup(name, size, 0);
>
> That's an interesting line: I am anxious to know whether you mean to
> pass flags 0 there, or would rather pass VM_NORESERVE.  Passing 0
> makes the object resemble mmap or SysV SHM, in accounting for the
> whole size upfront; passing VM_NORESERVE makes the object resemble
> a tmpfs file, accounted page by page as they are instantiated.

I definitely meant to use VM_NORESERVE. Thanks for pointing that out.

> Accounting meaning calls to __vm_enough_memory() in mm/mmap.c:
> whose behaviour is governed by /proc/sys/vm/overcommit_memory
> (and overcommit_kbytes or overcommit_ratio): OVERCOMMIT_ALWAYS
> (no enforcement), OVERCOMMIT_GUESS (default) or OVERCOMMIT_NEVER
> (enforcing strict no-overcommit).
>
> We have a small problem if you really intend flags 0: because then
> that size is preaccounted, yet we also allow these objects to grow
> or be truncated without accounting, and the number (/proc/meminfo's
> Committed_AS) will go wrong.
>
> If you really intend that preaccounting, then we need to add an
> orig_size field to shmem_inode_info, and treat pages below that
> as preaccounted, but pages above it to be accounted one by one.
> If you don't intend preaccounting, then please pass VM_NORESERVE
> to shmem_file_setup().
>
> But this does highlight how the "size" arg to memfd_create() is
> perhaps redundant.  Why give a size there, when size can be changed
> afterwards?  I expect your answer is that many callers want to choose
> the size at the beginning, and would prefer to avoid the extra call.
> I'm not sure if that's a good enough reason for a redundant argument.

At one point in time we might be required to support atomic-sealing.
So a memfd_create() call takes the initial seals as upper 32bits in
"flags" and sets them before returning the object. If these seals
contain SEAL_GROW/SHRINK, we must pass the size during setup (think
CLOEXEC with fork()).

Note that we spent a lot of time discussing whether such
atomic-sealing is necessary and no-one came up with a real race so
far. Therefore, I didn't include that. But especially if we add new
seals (like SHMEM_SEAL_OPEN, which I still think is not needed and
just hides real problems), we might at one point be required to
support that. That's also the reason why "flags" is 64bits.

One might argue that we can just add memfd_create2() once that
happens, but I didn't see any harm in including "size" and making them
64bit.

I've fixed all your other concerns.

Thanks
David

>> +     if (IS_ERR(shm)) {
>> +             r = PTR_ERR(shm);
>> +             goto err_fd;
>> +     }
>> +     info = SHMEM_I(file_inode(shm));
>> +     shm->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
>> +     if (flags & MFD_ALLOW_SEALING)
>> +             info->seals |= SHMEM_ALLOW_SEALING;
>
> In comments on 1/3 I suggest removing F_SEAL_SEAL instead here.
>
>> +
>> +     fd_install(fd, shm);
>> +     kfree(name);
>> +     return fd;
>> +
>> +err_fd:
>> +     put_unused_fd(fd);
>> +err_name:
>> +     kfree(name);
>> +     return r;
>> +}
>> +
>>  #else /* !CONFIG_SHMEM */
>>
>>  /*
>> --
>> 1.9.2

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 2/3] shm: add memfd_create() syscall
@ 2014-05-23 16:57       ` David Herrmann
  0 siblings, 0 replies; 53+ messages in thread
From: David Herrmann @ 2014-05-23 16:57 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Tony Battersby, Andy Lutomirsky, Al Viro, Jan Kara,
	Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

Hi

On Tue, May 20, 2014 at 4:20 AM, Hugh Dickins <hughd@google.com> wrote:
> On Tue, 15 Apr 2014, David Herrmann wrote:
>
>> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
>> that you can pass to mmap(). It can support sealing and avoids any
>> connection to user-visible mount-points. Thus, it's not subject to quotas
>> on mounted file-systems, but can be used like malloc()'ed memory, but
>> with a file-descriptor to it.
>>
>> memfd_create() does not create a front-FD, but instead returns the raw
>
> What is a front-FD?

With 'front-FD' I refer to things like dma-buf: They allocate a
file-descriptor which is just a wrapper around a kernel-internal FD.
For instance, DRM-gem buffers exported as dma-buf. fops on the dma-buf
are forwarded to the shmem-fd of the given gem-object, but any access
to the inode of the dma-buf fd is a no-op as the dma-buf fd uses
anon-inode, not the shmem-inode.

A previous revision of memfd used something like that, but that was
inherently racy.

>> shmem file, so calls like ftruncate() can be used. Also calls like fstat()
>> will return proper information and mark the file as regular file. If you
>> want sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not
>> support (like on all other regular files).
>>
>> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
>> subject to quotas and alike.
>
> You mention quotas a couple of times, and I want to be clear about that.
>
> I think you are mainly thinking of the "df" size limitation which comes
> by default on a tmpfs mount, but can be retuned or removed with the
> size= or nr_block= mount options.  You want memfd_create() to be free
> of that limitation, which indeed it is.
>
> (I'm not proud of the way in which an unlimited tmpfs mount can easily
> be used to OOM the system, killing processes which do little to give
> back the memory needed; but that's how it is, and you're not making
> that worse, just adding a further interface to it.)
>
> And we have never implemented fs/quota/-style quotas on tmpfs,
> so you're certainly free from those.
>
> But a created memfd is still subject to an RLIMIT_FSIZE limit, and
> to a memcg's memory.limit_in_bytes and memory.memsw.limit_in_bytes:
> I expect you don't care about those, that they would be unlimited
> in the cases that you care about.
>
> And a created memfd is still subject to __vm_enough_memory() limiting:
> unlimited when OVERCOMMIT_ALWAYS, a little unpredictable when
> OVERCOMMIT_GUESS, strictly accounted when OVERCOMMIT_NEVER.  I don't
> think we can compromise on OVERCOMMIT_NEVER, but if OVERCOMMIT_GUESS
> gives you a problem, we could probably tweak it for your case.
> More on this below, when considering the size arg to memfd_create().

Yes, memfd_create() is supposed to be the same as mmap(MAP_ANON) and
I'm aware of the limits you describe (at least some of them..). But
thanks a lot for listing them, I will try to document them in my
man-pages.

>>
>> Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
>> ---
>>  arch/x86/syscalls/syscall_32.tbl |  1 +
>>  arch/x86/syscalls/syscall_64.tbl |  1 +
>
> Okay.  No point in cluttering the patchset with other architectures
> until this is closer to merge.  Miklos Szeredi's recent patches
> "add renameat2 syscall" provide a very helpful precedent to follow.

Thanks for the hint to renameat2. I will include other architectures
as a follow-up once we agreed on the implementation.

>>  include/linux/syscalls.h         |  1 +
>>  include/uapi/linux/memfd.h       | 10 ++++++
>>  kernel/sys_ni.c                  |  1 +
>>  mm/shmem.c                       | 74 ++++++++++++++++++++++++++++++++++++++++
>>  6 files changed, 88 insertions(+)
>>  create mode 100644 include/uapi/linux/memfd.h
>>
>> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
>> index 96bc506..c943b8a 100644
>> --- a/arch/x86/syscalls/syscall_32.tbl
>> +++ b/arch/x86/syscalls/syscall_32.tbl
>> @@ -359,3 +359,4 @@
>>  350  i386    finit_module            sys_finit_module
>>  351  i386    sched_setattr           sys_sched_setattr
>>  352  i386    sched_getattr           sys_sched_getattr
>> +353  i386    memfd_create            sys_memfd_create
>> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
>> index 04376ac..dfcfd6f 100644
>> --- a/arch/x86/syscalls/syscall_64.tbl
>> +++ b/arch/x86/syscalls/syscall_64.tbl
>> @@ -323,6 +323,7 @@
>>  314  common  sched_setattr           sys_sched_setattr
>>  315  common  sched_getattr           sys_sched_getattr
>>  316  common  renameat2               sys_renameat2
>> +317  common  memfd_create            sys_memfd_create
>>
>>  #
>>  # x32-specific system call numbers start at 512 to avoid cache impact
>> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
>> index a4a0588..133b705 100644
>> --- a/include/linux/syscalls.h
>> +++ b/include/linux/syscalls.h
>> @@ -802,6 +802,7 @@ asmlinkage long sys_timerfd_settime(int ufd, int flags,
>>  asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
>>  asmlinkage long sys_eventfd(unsigned int count);
>>  asmlinkage long sys_eventfd2(unsigned int count, int flags);
>> +asmlinkage long sys_memfd_create(const char *uname_ptr, u64 size, u64 flags);
>>  asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
>>  asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int);
>>  asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
>> diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
>> new file mode 100644
>> index 0000000..c4a6db0
>> --- /dev/null
>> +++ b/include/uapi/linux/memfd.h
>> @@ -0,0 +1,10 @@
>> +#ifndef _UAPI_LINUX_MEMFD_H
>> +#define _UAPI_LINUX_MEMFD_H
>> +
>> +#include <linux/types.h>
>
> Why include linux/types.h in this one?
>
>> +
>> +/* flags for memfd_create(2) (u64) */
>> +#define MFD_CLOEXEC          0x0001ULL
>> +#define MFD_ALLOW_SEALING    0x0002ULL
>> +
>> +#endif /* _UAPI_LINUX_MEMFD_H */
>> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
>> index bc8d1b7..f96c329 100644
>> --- a/kernel/sys_ni.c
>> +++ b/kernel/sys_ni.c
>> @@ -195,6 +195,7 @@ cond_syscall(compat_sys_timerfd_settime);
>>  cond_syscall(compat_sys_timerfd_gettime);
>>  cond_syscall(sys_eventfd);
>>  cond_syscall(sys_eventfd2);
>> +cond_syscall(sys_memfd_create);
>>
>>  /* performance counters: */
>>  cond_syscall(sys_perf_event_open);
>> diff --git a/mm/shmem.c b/mm/shmem.c
>> index 175a5b8..203cc4e 100644
>> --- a/mm/shmem.c
>> +++ b/mm/shmem.c
>> @@ -66,7 +66,9 @@ static struct vfsmount *shm_mnt;
>>  #include <linux/highmem.h>
>>  #include <linux/seq_file.h>
>>  #include <linux/magic.h>
>> +#include <linux/syscalls.h>
>>  #include <linux/fcntl.h>
>> +#include <uapi/linux/memfd.h>
>>
>>  #include <asm/uaccess.h>
>>  #include <asm/pgtable.h>
>> @@ -2919,6 +2921,78 @@ out4:
>>       return error;
>>  }
>>
>
> Whereas 1/3's sealing stuff was under CONFIG_TMPFS, this is in a
> CONFIG_SHMEM part of mm/shmem.c, built even when !CONFIG_TMPFS: in
> which case you could not write to or truncate the object created,
> just mmap it and access it that way (like SysV SHM).  Not necessarily
> wrong, but it may prevent surprises to put this under CONFIG_TMPFS:
> the user gets an fd, so probably expects filesystem operations to work.
>
>> +#define MFD_NAME_PREFIX "memfd:"
>> +#define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
>> +#define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
>> +
>> +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING)
>> +
>> +SYSCALL_DEFINE3(memfd_create,
>> +             const char*, uname,
>> +             u64, size,
>> +             u64, flags)
>
> If I'd come in earlier, I'd have probably looked for another name
> than memfd_create; but I don't have anything better in mind, and
> you've done a great job of sounding out potential users, so let's
> stick with the name everyone is expecting.
>
> The uname: it's a funny thing, not belonging in a filesystem tree;
> but you're very sure you want it, and we already make up funny
> names for SysV SHM and /dev/zero objects, so okay.
>
> The size: u64 or loff_t or size_t?  But more on size below.
>
> The flags: u64?  That's a big future you're allowing for!
> open and mmap use ints for their flags, will this really need more?
>
> But I don't think I've been present at the birth of a syscall before:
> there are probably several considerations that I'm unaware of, that
> you may have factored in - listen to the experts, not to me.
>
>> +{
>> +     struct shmem_inode_info *info;
>> +     struct file *shm;
>
> "struct file *file" is more usual.
>
>> +     char *name;
>> +     int fd, r;
>
> "int err" or "int error" rather than "int r".
>
>> +     long len;
>> +
>> +     if (flags & ~(u64)MFD_ALL_FLAGS)
>> +             return -EINVAL;
>> +     if ((u64)(loff_t)size != size || (loff_t)size < 0)
>> +             return -EINVAL;
>> +
>> +     /* length includes terminating zero */
>> +     len = strnlen_user(uname, MFD_NAME_MAX_LEN);
>> +     if (len <= 0)
>> +             return -EFAULT;
>> +     else if (len > MFD_NAME_MAX_LEN)
>
> Please omit the "else ".
>
> And, since strnlen_user() returns length including terminating NUL,
> wouldn't it be more exact to use MFD_NAME_MAX_LEN + 1 in those two
> places above?
>
>> +             return -EINVAL;
>> +
>> +     name = kmalloc(len + MFD_NAME_PREFIX_LEN, GFP_KERNEL);
>
> Probably better to say GFP_TEMPORARY than GFP_KERNEL,
> though it doesn't seem to be used very much at all.
>
>> +     if (!name)
>> +             return -ENOMEM;
>> +
>> +     strcpy(name, MFD_NAME_PREFIX);
>> +     if (copy_from_user(&name[MFD_NAME_PREFIX_LEN], uname, len)) {
>> +             r = -EFAULT;
>> +             goto err_name;
>> +     }
>> +
>> +     /* terminating-zero may have changed after strnlen_user() returned */
>> +     if (name[len + MFD_NAME_PREFIX_LEN - 1]) {
>> +             r = -EFAULT;
>> +             goto err_name;
>> +     }
>> +
>> +     fd = get_unused_fd_flags((flags & MFD_CLOEXEC) ? O_CLOEXEC : 0);
>> +     if (fd < 0) {
>> +             r = fd;
>> +             goto err_name;
>> +     }
>> +
>> +     shm = shmem_file_setup(name, size, 0);
>
> That's an interesting line: I am anxious to know whether you mean to
> pass flags 0 there, or would rather pass VM_NORESERVE.  Passing 0
> makes the object resemble mmap or SysV SHM, in accounting for the
> whole size upfront; passing VM_NORESERVE makes the object resemble
> a tmpfs file, accounted page by page as they are instantiated.

I definitely meant to use VM_NORESERVE. Thanks for pointing that out.

> Accounting meaning calls to __vm_enough_memory() in mm/mmap.c:
> whose behaviour is governed by /proc/sys/vm/overcommit_memory
> (and overcommit_kbytes or overcommit_ratio): OVERCOMMIT_ALWAYS
> (no enforcement), OVERCOMMIT_GUESS (default) or OVERCOMMIT_NEVER
> (enforcing strict no-overcommit).
>
> We have a small problem if you really intend flags 0: because then
> that size is preaccounted, yet we also allow these objects to grow
> or be truncated without accounting, and the number (/proc/meminfo's
> Committed_AS) will go wrong.
>
> If you really intend that preaccounting, then we need to add an
> orig_size field to shmem_inode_info, and treat pages below that
> as preaccounted, but pages above it to be accounted one by one.
> If you don't intend preaccounting, then please pass VM_NORESERVE
> to shmem_file_setup().
>
> But this does highlight how the "size" arg to memfd_create() is
> perhaps redundant.  Why give a size there, when size can be changed
> afterwards?  I expect your answer is that many callers want to choose
> the size at the beginning, and would prefer to avoid the extra call.
> I'm not sure if that's a good enough reason for a redundant argument.

At one point in time we might be required to support atomic-sealing.
So a memfd_create() call takes the initial seals as upper 32bits in
"flags" and sets them before returning the object. If these seals
contain SEAL_GROW/SHRINK, we must pass the size during setup (think
CLOEXEC with fork()).

Note that we spent a lot of time discussing whether such
atomic-sealing is necessary and no-one came up with a real race so
far. Therefore, I didn't include that. But especially if we add new
seals (like SHMEM_SEAL_OPEN, which I still think is not needed and
just hides real problems), we might at one point be required to
support that. That's also the reason why "flags" is 64bits.

One might argue that we can just add memfd_create2() once that
happens, but I didn't see any harm in including "size" and making them
64bit.

I've fixed all your other concerns.

Thanks
David

>> +     if (IS_ERR(shm)) {
>> +             r = PTR_ERR(shm);
>> +             goto err_fd;
>> +     }
>> +     info = SHMEM_I(file_inode(shm));
>> +     shm->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
>> +     if (flags & MFD_ALLOW_SEALING)
>> +             info->seals |= SHMEM_ALLOW_SEALING;
>
> In comments on 1/3 I suggest removing F_SEAL_SEAL instead here.
>
>> +
>> +     fd_install(fd, shm);
>> +     kfree(name);
>> +     return fd;
>> +
>> +err_fd:
>> +     put_unused_fd(fd);
>> +err_name:
>> +     kfree(name);
>> +     return r;
>> +}
>> +
>>  #else /* !CONFIG_SHMEM */
>>
>>  /*
>> --
>> 1.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 3/3] selftests: add memfd_create() + sealing tests
  2014-05-20  2:22     ` Hugh Dickins
@ 2014-05-23 17:06       ` David Herrmann
  -1 siblings, 0 replies; 53+ messages in thread
From: David Herrmann @ 2014-05-23 17:06 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Tony Battersby, Andy Lutomirsky, Al Viro, Jan Kara,
	Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

Hi

On Tue, May 20, 2014 at 4:22 AM, Hugh Dickins <hughd@google.com> wrote:
> On Tue, 15 Apr 2014, David Herrmann wrote:
>
>> Some basic tests to verify sealing on memfds works as expected and
>> guarantees the advertised semantics.
>
> Thanks for providing these.
>
> A few remarks below, and I should note one oddity.
>
> Curious about leaks (probably none, I was merely curious), I tried to
> run memfd_test 4096 times in succession, and never succeeded.  After
> many iterations, the 32-bit one tends to hang somewhere just before
> reaching the DONE, and the 64-bit one gave me some kind of assert
> error from a library.
>
> I expect there's some threading race around join_idle_thread():
> which I think you will sort out infinitely sooner than I would.
> No need to fix it right now: the test works well enough.

Ugh, I will look into that. Didn't see anything obvious so far.

>>
>> Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
>> ---
>>  tools/testing/selftests/Makefile           |   1 +
>>  tools/testing/selftests/memfd/.gitignore   |   2 +
>>  tools/testing/selftests/memfd/Makefile     |  29 +
>>  tools/testing/selftests/memfd/memfd_test.c | 944 +++++++++++++++++++++++++++++
>>  4 files changed, 976 insertions(+)
>>  create mode 100644 tools/testing/selftests/memfd/.gitignore
>>  create mode 100644 tools/testing/selftests/memfd/Makefile
>>  create mode 100644 tools/testing/selftests/memfd/memfd_test.c
>>
>> diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
>> index 32487ed..c57325a 100644
>> --- a/tools/testing/selftests/Makefile
>> +++ b/tools/testing/selftests/Makefile
>> @@ -2,6 +2,7 @@ TARGETS = breakpoints
>>  TARGETS += cpu-hotplug
>>  TARGETS += efivarfs
>>  TARGETS += kcmp
>> +TARGETS += memfd
>>  TARGETS += memory-hotplug
>>  TARGETS += mqueue
>>  TARGETS += net
>> diff --git a/tools/testing/selftests/memfd/.gitignore b/tools/testing/selftests/memfd/.gitignore
>> new file mode 100644
>> index 0000000..bcc8ee2
>> --- /dev/null
>> +++ b/tools/testing/selftests/memfd/.gitignore
>> @@ -0,0 +1,2 @@
>> +memfd_test
>> +memfd-test-file
>> diff --git a/tools/testing/selftests/memfd/Makefile b/tools/testing/selftests/memfd/Makefile
>> new file mode 100644
>> index 0000000..36653b9
>> --- /dev/null
>> +++ b/tools/testing/selftests/memfd/Makefile
>> @@ -0,0 +1,29 @@
>> +uname_M := $(shell uname -m 2>/dev/null || echo not)
>> +ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
>> +ifeq ($(ARCH),i386)
>> +     ARCH := X86
>> +endif
>> +ifeq ($(ARCH),x86_64)
>> +     ARCH := X86
>> +endif
>> +
>> +CFLAGS += -I../../../../arch/x86/include/generated/uapi/
>> +CFLAGS += -I../../../../arch/x86/include/uapi/
>> +CFLAGS += -I../../../../include/uapi/
>> +CFLAGS += -I../../../../include/
>> +
>> +all:
>> +ifeq ($(ARCH),X86)
>> +     gcc $(CFLAGS) memfd_test.c -o memfd_test
>> +else
>> +     echo "Not an x86 target, can't build memfd selftest"
>> +endif
>> +
>> +run_tests: all
>> +ifeq ($(ARCH),X86)
>> +     gcc $(CFLAGS) memfd_test.c -o memfd_test
>> +endif
>> +     @./memfd_test || echo "memfd_test: [FAIL]"
>> +
>> +clean:
>> +     $(RM) memfd_test
>> diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c
>> new file mode 100644
>> index 0000000..3e105ea
>> --- /dev/null
>> +++ b/tools/testing/selftests/memfd/memfd_test.c
>> @@ -0,0 +1,944 @@
>> +#define _GNU_SOURCE
>> +#define __EXPORTED_HEADERS__
>> +
>> +#include <errno.h>
>> +#include <inttypes.h>
>> +#include <limits.h>
>> +#include <linux/falloc.h>
>> +#include <linux/fcntl.h>
>> +#include <linux/memfd.h>
>> +#include <sched.h>
>> +#include <stdio.h>
>> +#include <stdlib.h>
>> +#include <signal.h>
>> +#include <string.h>
>> +#include <sys/mman.h>
>> +#include <sys/stat.h>
>> +#include <sys/syscall.h>
>> +#include <unistd.h>
>> +
>> +#define MFD_DEF_SIZE 8192
>> +#define STACK_SIZE 65535
>> +
>> +static int sys_memfd_create(const char *name,
>> +                         __u64 size,
>> +                         __u64 flags)
>> +{
>> +     return syscall(__NR_memfd_create, name, size, flags);
>> +}
>> +
>> +static int mfd_assert_new(const char *name, __u64 sz, __u64 flags)
>> +{
>> +     int r;
>> +
>> +     r = sys_memfd_create(name, sz, flags);
>> +     if (r < 0) {
>> +             printf("memfd_create(\"%s\", %llu, %llu) failed: %m\n",
>> +                    name,
>> +                    (unsigned long long)sz,
>> +                    (unsigned long long)flags);
>> +             abort();
>> +     }
>> +
>> +     return r;
>> +}
>> +
>> +static void mfd_fail_new(const char *name, __u64 size, __u64 flags)
>> +{
>> +     int r;
>> +
>> +     r = sys_memfd_create(name, size, flags);
>> +     if (r >= 0) {
>> +             printf("memfd_create(\"%s\", %llu, %llu) succeeded, but failure expected\n",
>
> scripts/checkpatch.pl complains about line-length: please ignore it on this.
>
>> +                    name,
>> +                    (unsigned long long)size,
>> +                    (unsigned long long)flags);
>> +             close(r);
>> +             abort();
>> +     }
>> +}
>> +
>> +static __u64 mfd_assert_get_seals(int fd)
>> +{
>> +     long r;
>> +
>> +     r = fcntl(fd, F_GET_SEALS);
>> +     if (r < 0) {
>> +             printf("GET_SEALS(%d) failed: %m\n", fd);
>> +             abort();
>> +     }
>> +
>> +     return r;
>> +}
>> +
>> +static void mfd_fail_get_seals(int fd)
>> +{
>> +     long r;
>> +
>> +     r = fcntl(fd, F_GET_SEALS);
>> +     if (r >= 0) {
>> +             printf("GET_SEALS(%d) succeeded, but failure expected\n");
>> +             abort();
>> +     }
>> +}
>> +
>> +static void mfd_assert_has_seals(int fd, __u64 seals)
>> +{
>> +     __u64 s;
>> +
>> +     s = mfd_assert_get_seals(fd);
>> +     if (s != seals) {
>> +             printf("%llu != %llu = GET_SEALS(%d)\n",
>> +                    (unsigned long long)seals, (unsigned long long)s, fd);
>> +             abort();
>> +     }
>> +}
>> +
>> +static void mfd_assert_add_seals(int fd, __u64 seals)
>> +{
>> +     long r;
>> +     __u64 s;
>> +
>> +     s = mfd_assert_get_seals(fd);
>> +     r = fcntl(fd, F_ADD_SEALS, seals);
>> +     if (r < 0) {
>> +             printf("ADD_SEALS(%d, %llu -> %llu) failed: %m\n",
>> +                    fd, (unsigned long long)s, (unsigned long long)seals);
>> +             abort();
>> +     }
>> +}
>> +
>> +static void mfd_fail_add_seals(int fd, __u64 seals)
>> +{
>> +     long r;
>> +     __u64 s;
>> +
>> +     r = fcntl(fd, F_GET_SEALS);
>> +     if (r < 0)
>> +             s = 0;
>> +     else
>> +             s = r;
>> +
>> +     r = fcntl(fd, F_ADD_SEALS, seals);
>> +     if (r >= 0) {
>> +             printf("ADD_SEALS(%d, %llu -> %llu) didn't fail as expected\n",
>> +                    fd, (unsigned long long)s, (unsigned long long)seals);
>> +             abort();
>> +     }
>> +}
>> +
>> +static void mfd_assert_size(int fd, size_t size)
>> +{
>> +     struct stat st;
>> +     int r;
>> +
>> +     r = fstat(fd, &st);
>> +     if (r < 0) {
>> +             printf("fstat(%d) failed: %m\n", fd);
>> +             abort();
>> +     } else if (st.st_size != size) {
>> +             printf("wrong file size %lld, but expected %lld\n",
>> +                    (long long)st.st_size, (long long)size);
>> +             abort();
>> +     }
>> +}
>> +
>> +static int mfd_assert_dup(int fd)
>> +{
>> +     int r;
>> +
>> +     r = dup(fd);
>> +     if (r < 0) {
>> +             printf("dup(%d) failed: %m\n", fd);
>> +             abort();
>> +     }
>> +
>> +     return r;
>> +}
>> +
>> +static void *mfd_assert_mmap_shared(int fd)
>> +{
>> +     void *p;
>> +
>> +     p = mmap(NULL,
>> +              MFD_DEF_SIZE,
>> +              PROT_READ | PROT_WRITE,
>> +              MAP_SHARED,
>> +              fd,
>> +              0);
>> +     if (p == MAP_FAILED) {
>> +             printf("mmap() failed: %m\n");
>> +             abort();
>> +     }
>> +
>> +     return p;
>> +}
>> +
>> +static void *mfd_assert_mmap_private(int fd)
>> +{
>> +     void *p;
>> +
>> +     p = mmap(NULL,
>> +              MFD_DEF_SIZE,
>> +              PROT_READ,
>> +              MAP_PRIVATE,
>> +              fd,
>> +              0);
>> +     if (p == MAP_FAILED) {
>> +             printf("mmap() failed: %m\n");
>> +             abort();
>> +     }
>> +
>> +     return p;
>> +}
>> +
>> +static int mfd_assert_open(int fd, int flags, mode_t mode)
>> +{
>> +     char buf[512];
>> +     int r;
>> +
>> +     sprintf(buf, "/proc/self/fd/%d", fd);
>> +     r = open(buf, flags, mode);
>> +     if (r < 0) {
>> +             printf("open(%s) failed: %m\n", buf);
>> +             abort();
>> +     }
>> +
>> +     return r;
>> +}
>> +
>> +static void mfd_fail_open(int fd, int flags, mode_t mode)
>> +{
>> +     char buf[512];
>> +     int r;
>> +
>> +     sprintf(buf, "/proc/self/fd/%d", fd);
>> +     r = open(buf, flags, mode);
>> +     if (r >= 0) {
>> +             printf("open(%s) didn't fail as expected\n");
>> +             abort();
>> +     }
>> +}
>> +
>> +static void mfd_assert_read(int fd)
>> +{
>> +     char buf[16];
>> +     void *p;
>> +     ssize_t l;
>> +
>> +     l = read(fd, buf, sizeof(buf));
>> +     if (l != sizeof(buf)) {
>> +             printf("read() failed: %m\n");
>> +             abort();
>> +     }
>> +
>> +     /* verify PROT_READ *is* allowed */
>> +     p = mmap(NULL,
>> +              MFD_DEF_SIZE,
>> +              PROT_READ,
>> +              MAP_PRIVATE,
>> +              fd,
>> +              0);
>> +     if (p == MAP_FAILED) {
>> +             printf("mmap() failed: %m\n");
>> +             abort();
>> +     }
>> +     munmap(p, MFD_DEF_SIZE);
>> +
>> +     /* verify MAP_PRIVATE is *always* allowed (even writable) */
>> +     p = mmap(NULL,
>> +              MFD_DEF_SIZE,
>> +              PROT_READ | PROT_WRITE,
>> +              MAP_PRIVATE,
>> +              fd,
>> +              0);
>> +     if (p == MAP_FAILED) {
>> +             printf("mmap() failed: %m\n");
>> +             abort();
>> +     }
>> +     munmap(p, MFD_DEF_SIZE);
>> +}
>> +
>> +static void mfd_assert_write(int fd)
>> +{
>> +     ssize_t l;
>> +     void *p;
>> +     int r;
>> +
>> +     /* verify write() succeeds */
>> +     l = write(fd, "\0\0\0\0", 4);
>> +     if (l != 4) {
>> +             printf("write() failed: %m\n");
>> +             abort();
>> +     }
>> +
>> +     /* verify PROT_READ | PROT_WRITE is allowed */
>> +     p = mmap(NULL,
>> +              MFD_DEF_SIZE,
>> +              PROT_READ | PROT_WRITE,
>> +              MAP_SHARED,
>> +              fd,
>> +              0);
>> +     if (p == MAP_FAILED) {
>> +             printf("mmap() failed: %m\n");
>> +             abort();
>> +     }
>> +     *(char*)p = 0;
>
> scripts/checkpatch.pl complains about (char*): better calm it with (char *).
> Same on two other lines below.
>
>> +     munmap(p, MFD_DEF_SIZE);
>> +
>> +     /* verify PROT_WRITE is allowed */
>> +     p = mmap(NULL,
>> +              MFD_DEF_SIZE,
>> +              PROT_WRITE,
>> +              MAP_SHARED,
>> +              fd,
>> +              0);
>> +     if (p == MAP_FAILED) {
>> +             printf("mmap() failed: %m\n");
>> +             abort();
>> +     }
>> +     *(char*)p = 0;
>> +     munmap(p, MFD_DEF_SIZE);
>> +
>> +     /* verify PROT_READ with MAP_SHARED is allowed and a following
>> +      * mprotect(PROT_WRITE) allows writing */
>> +     p = mmap(NULL,
>> +              MFD_DEF_SIZE,
>> +              PROT_READ,
>> +              MAP_SHARED,
>> +              fd,
>> +              0);
>> +     if (p == MAP_FAILED) {
>> +             printf("mmap() failed: %m\n");
>> +             abort();
>> +     }
>> +
>> +     r = mprotect(p, MFD_DEF_SIZE, PROT_READ | PROT_WRITE);
>> +     if (r < 0) {
>> +             printf("mprotect() failed: %m\n");
>> +             abort();
>> +     }
>> +
>> +     *(char*)p = 0;
>> +     munmap(p, MFD_DEF_SIZE);
>> +
>> +     /* verify PUNCH_HOLE works */
>> +     r = fallocate(fd,
>> +                   FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
>> +                   0,
>> +                   MFD_DEF_SIZE);
>> +     if (r < 0) {
>> +             printf("fallocate(PUNCH_HOLE) failed: %m\n");
>> +             abort();
>> +     }
>> +}
>> +
>> +static void mfd_fail_write(int fd)
>> +{
>> +     ssize_t l;
>> +     void *p;
>> +     int r;
>> +
>> +     /* verify write() fails */
>> +     l = write(fd, "data", 4);
>> +     if (l != -EPERM) {
>> +             printf("expected EPERM on write(), but got %d: %m\n", (int)l);
>> +             abort();
>> +     }
>> +
>> +     /* verify PROT_READ | PROT_WRITE is not allowed */
>> +     p = mmap(NULL,
>> +              MFD_DEF_SIZE,
>> +              PROT_READ | PROT_WRITE,
>> +              MAP_SHARED,
>> +              fd,
>> +              0);
>> +     if (p != MAP_FAILED) {
>> +             printf("mmap() didn't fail as expected\n");
>> +             abort();
>> +     }
>> +
>> +     /* verify PROT_WRITE is not allowed */
>> +     p = mmap(NULL,
>> +              MFD_DEF_SIZE,
>> +              PROT_WRITE,
>> +              MAP_SHARED,
>> +              fd,
>> +              0);
>> +     if (p != MAP_FAILED) {
>> +             printf("mmap() didn't fail as expected\n");
>> +             abort();
>> +     }
>> +
>> +     /* verify PROT_READ with MAP_SHARED is not allowed */
>
> This is a particularly interesting case, checking PROT_READ,MAP_SHARED
> not allowed in mfd_fail_write().  It feels invidious to ask for more
> of a comment, in a test which you have been generous to provide at all.
> But it stopped me short for a while: more comment might help others too.
>
> The reason being (right?) that this fd was opened O_RDWR, so a
> MAP_SHARED mapping would permit a subsequent mprotect(,,PROT_WRITE),
> which sealing the file against writes must prevent.
>
> Your kernel checks rely on VM_SHARED and i_mmap_writable for this
> protection: which is fine, but an implementation detail which could
> be modified in future, if this case were ever to pose a difficulty.

Yes indeed, this is meant to catch VM_MAYWRITE. Currently, every
mmap(MAP_SHARED) on a writable FD allows mprotect(PROT_WRITE) later
on. I thought that's hard-coded ABI so I rely on it here. But I can
definitely add a comment mentioning VM_MAYWRITE.

Given that this test-case fails if I run mfd_fail_write() on a
read-only FD, I might even want to change it to run mmap()+mprotect().
This should clear up all doubts.

Thanks!
David

>> +     p = mmap(NULL,
>> +              MFD_DEF_SIZE,
>> +              PROT_READ,
>> +              MAP_SHARED,
>> +              fd,
>> +              0);
>> +     if (p != MAP_FAILED) {
>> +             printf("mmap() didn't fail as expected\n");
>> +             abort();
>> +     }
>> +
>> +     /* verify PUNCH_HOLE fails */
>> +     r = fallocate(fd,
>> +                   FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
>> +                   0,
>> +                   MFD_DEF_SIZE);
>> +     if (r >= 0) {
>> +             printf("fallocate(PUNCH_HOLE) didn't fail as expected\n");
>> +             abort();
>> +     }
>> +}
>> +
>> +static void mfd_assert_shrink(int fd)
>> +{
>> +     int r, fd2;
>> +
>> +     r = ftruncate(fd, MFD_DEF_SIZE / 2);
>> +     if (r < 0) {
>> +             printf("ftruncate(SHRINK) failed: %m\n");
>> +             abort();
>> +     }
>> +
>> +     mfd_assert_size(fd, MFD_DEF_SIZE / 2);
>> +
>> +     fd2 = mfd_assert_open(fd,
>> +                           O_RDWR | O_CREAT | O_TRUNC,
>> +                           S_IRUSR | S_IWUSR);
>> +     close(fd2);
>> +
>> +     mfd_assert_size(fd, 0);
>> +}
>> +
>> +static void mfd_fail_shrink(int fd)
>> +{
>> +     int r;
>> +
>> +     r = ftruncate(fd, MFD_DEF_SIZE / 2);
>> +     if (r >= 0) {
>> +             printf("ftruncate(SHRINK) didn't fail as expected\n");
>> +             abort();
>> +     }
>> +
>> +     mfd_fail_open(fd,
>> +                   O_RDWR | O_CREAT | O_TRUNC,
>> +                   S_IRUSR | S_IWUSR);
>> +}
>> +
>> +static void mfd_assert_grow(int fd)
>> +{
>> +     int r;
>> +
>> +     r = ftruncate(fd, MFD_DEF_SIZE * 2);
>> +     if (r < 0) {
>> +             printf("ftruncate(GROW) failed: %m\n");
>> +             abort();
>> +     }
>> +
>> +     mfd_assert_size(fd, MFD_DEF_SIZE * 2);
>> +
>> +     r = fallocate(fd,
>> +                   0,
>> +                   0,
>> +                   MFD_DEF_SIZE * 4);
>> +     if (r < 0) {
>> +             printf("fallocate(ALLOC) failed: %m\n");
>> +             abort();
>> +     }
>> +
>> +     mfd_assert_size(fd, MFD_DEF_SIZE * 4);
>> +}
>> +
>> +static void mfd_fail_grow(int fd)
>> +{
>> +     int r;
>> +
>> +     r = ftruncate(fd, MFD_DEF_SIZE * 2);
>> +     if (r >= 0) {
>> +             printf("ftruncate(GROW) didn't fail as expected\n");
>> +             abort();
>> +     }
>> +
>> +     r = fallocate(fd,
>> +                   0,
>> +                   0,
>> +                   MFD_DEF_SIZE * 4);
>> +     if (r >= 0) {
>> +             printf("fallocate(ALLOC) didn't fail as expected\n");
>> +             abort();
>> +     }
>> +}
>> +
>> +static void mfd_assert_grow_write(int fd)
>> +{
>> +     static char buf[MFD_DEF_SIZE * 8];
>> +     ssize_t l;
>> +
>> +     l = pwrite(fd, buf, sizeof(buf), 0);
>> +     if (l != sizeof(buf)) {
>> +             printf("pwrite() failed: %m\n");
>> +             abort();
>> +     }
>> +
>> +     mfd_assert_size(fd, MFD_DEF_SIZE * 8);
>> +}
>> +
>> +static void mfd_fail_grow_write(int fd)
>> +{
>> +     static char buf[MFD_DEF_SIZE * 8];
>> +     ssize_t l;
>> +
>> +     l = pwrite(fd, buf, sizeof(buf), 0);
>> +     if (l == sizeof(buf)) {
>> +             printf("pwrite() didn't fail as expected\n");
>> +             abort();
>> +     }
>> +}
>> +
>> +static int idle_thread_fn(void *arg)
>> +{
>> +     sigset_t set;
>> +     int sig;
>> +
>> +     /* dummy waiter; SIGTERM terminates us anyway */
>> +     sigemptyset(&set);
>> +     sigaddset(&set, SIGTERM);
>> +     sigwait(&set, &sig);
>> +
>> +     return 0;
>> +}
>> +
>> +static pid_t spawn_idle_thread(void)
>> +{
>> +     uint8_t *stack;
>> +     pid_t pid;
>> +
>> +     stack = malloc(STACK_SIZE);
>> +     if (!stack) {
>> +             printf("malloc(STACK_SIZE) failed: %m\n");
>> +             abort();
>> +     }
>> +
>> +     pid = clone(idle_thread_fn,
>> +                 stack + STACK_SIZE,
>> +                 CLONE_FILES | CLONE_FS | CLONE_VM | SIGCHLD,
>> +                 NULL);
>> +     if (pid < 0) {
>> +             printf("clone() failed: %m\n");
>> +             abort();
>> +     }
>> +
>> +     return pid;
>> +}
>> +
>> +static void join_idle_thread(pid_t pid)
>> +{
>> +     kill(pid, SIGTERM);
>> +     waitpid(pid, NULL, 0);
>> +}
>> +
>> +static pid_t spawn_idle_proc(void)
>> +{
>> +     pid_t pid;
>> +     sigset_t set;
>> +     int sig;
>> +
>> +     pid = fork();
>> +     if (pid < 0) {
>> +             printf("fork() failed: %m\n");
>> +             abort();
>> +     } else if (!pid) {
>> +             /* dummy waiter; SIGTERM terminates us anyway */
>> +             sigemptyset(&set);
>> +             sigaddset(&set, SIGTERM);
>> +             sigwait(&set, &sig);
>> +             exit(0);
>> +     }
>> +
>> +     return pid;
>> +}
>> +
>> +static void join_idle_proc(pid_t pid)
>> +{
>> +     kill(pid, SIGTERM);
>> +     waitpid(pid, NULL, 0);
>> +}
>> +
>> +/*
>> + * Test memfd_create() syscall
>> + * Verify syscall-argument validation, including name checks, flag validation
>> + * and more.
>> + */
>> +static void test_create(void)
>> +{
>> +     char buf[2048];
>> +     int fd;
>> +
>> +     /* test NULL name */
>> +     mfd_fail_new(NULL, 0, 0);
>> +
>> +     /* test over-long name (not zero-terminated) */
>> +     memset(buf, 0xff, sizeof(buf));
>> +     mfd_fail_new(buf, 0, 0);
>> +
>> +     /* test over-long zero-terminated name */
>> +     memset(buf, 0xff, sizeof(buf));
>> +     buf[sizeof(buf) - 1] = 0;
>> +     mfd_fail_new(buf, 0, 0);
>> +
>> +     /* verify "" is a valid name */
>> +     fd = mfd_assert_new("", 0, 0);
>> +     close(fd);
>> +
>> +     /* verify invalid O_* open flags */
>> +     mfd_fail_new("", 0, 0x0100);
>> +     mfd_fail_new("", 0, ~MFD_CLOEXEC);
>> +     mfd_fail_new("", 0, ~MFD_ALLOW_SEALING);
>> +     mfd_fail_new("", 0, ~0);
>> +     mfd_fail_new("", 0, 0x8000000000000000ULL);
>> +
>> +     /* verify MFD_CLOEXEC is allowed */
>> +     fd = mfd_assert_new("", 0, MFD_CLOEXEC);
>> +     close(fd);
>> +
>> +     /* verify MFD_ALLOW_SEALING is allowed */
>> +     fd = mfd_assert_new("", 0, MFD_ALLOW_SEALING);
>> +     close(fd);
>> +
>> +     /* verify MFD_ALLOW_SEALING | MFD_CLOEXEC is allowed */
>> +     fd = mfd_assert_new("", 0, MFD_ALLOW_SEALING | MFD_CLOEXEC);
>> +     close(fd);
>> +}
>> +
>> +/*
>> + * Test basic sealing
>> + * A very basic sealing test to see whether setting/retrieving seals works.
>> + */
>> +static void test_basic(void)
>> +{
>> +     int fd;
>> +
>> +     fd = mfd_assert_new("kern_memfd_basic",
>> +                         MFD_DEF_SIZE,
>> +                         MFD_CLOEXEC | MFD_ALLOW_SEALING);
>> +
>> +     /* add basic seals */
>> +     mfd_assert_has_seals(fd, 0);
>> +     mfd_assert_add_seals(fd, F_SEAL_SHRINK |
>> +                              F_SEAL_WRITE);
>> +     mfd_assert_has_seals(fd, F_SEAL_SHRINK |
>> +                              F_SEAL_WRITE);
>> +
>> +     /* add them again */
>> +     mfd_assert_add_seals(fd, F_SEAL_SHRINK |
>> +                              F_SEAL_WRITE);
>> +     mfd_assert_has_seals(fd, F_SEAL_SHRINK |
>> +                              F_SEAL_WRITE);
>> +
>> +     /* add more seals and seal against sealing */
>> +     mfd_assert_add_seals(fd, F_SEAL_GROW | F_SEAL_SEAL);
>> +     mfd_assert_has_seals(fd, F_SEAL_SHRINK |
>> +                              F_SEAL_GROW |
>> +                              F_SEAL_WRITE |
>> +                              F_SEAL_SEAL);
>> +
>> +     /* verify that sealing no longer works */
>> +     mfd_fail_add_seals(fd, F_SEAL_GROW);
>> +     mfd_fail_add_seals(fd, 0);
>> +
>> +     close(fd);
>> +
>> +     /* verify sealing does not work without MFD_ALLOW_SEALING */
>> +     fd = mfd_assert_new("kern_memfd_basic",
>> +                         MFD_DEF_SIZE,
>> +                         MFD_CLOEXEC);
>> +     mfd_fail_get_seals(fd);
>> +     mfd_fail_add_seals(fd, F_SEAL_SHRINK |
>> +                            F_SEAL_GROW |
>> +                            F_SEAL_WRITE);
>> +     mfd_fail_get_seals(fd);
>> +     close(fd);
>> +}
>> +
>> +/*
>> + * Test SEAL_WRITE
>> + * Test whether SEAL_WRITE actually prevents modifications.
>> + */
>> +static void test_seal_write(void)
>> +{
>> +     int fd;
>> +
>> +     fd = mfd_assert_new("kern_memfd_seal_write",
>> +                         MFD_DEF_SIZE,
>> +                         MFD_CLOEXEC | MFD_ALLOW_SEALING);
>> +     mfd_assert_has_seals(fd, 0);
>> +     mfd_assert_add_seals(fd, F_SEAL_WRITE);
>> +     mfd_assert_has_seals(fd, F_SEAL_WRITE);
>> +
>> +     mfd_assert_read(fd);
>> +     mfd_fail_write(fd);
>> +     mfd_assert_shrink(fd);
>> +     mfd_assert_grow(fd);
>> +     mfd_fail_grow_write(fd);
>> +
>> +     close(fd);
>> +}
>> +
>> +/*
>> + * Test SEAL_SHRINK
>> + * Test whether SEAL_SHRINK actually prevents shrinking
>> + */
>> +static void test_seal_shrink(void)
>> +{
>> +     int fd;
>> +
>> +     fd = mfd_assert_new("kern_memfd_seal_shrink",
>> +                         MFD_DEF_SIZE,
>> +                         MFD_CLOEXEC | MFD_ALLOW_SEALING);
>> +     mfd_assert_has_seals(fd, 0);
>> +     mfd_assert_add_seals(fd, F_SEAL_SHRINK);
>> +     mfd_assert_has_seals(fd, F_SEAL_SHRINK);
>> +
>> +     mfd_assert_read(fd);
>> +     mfd_assert_write(fd);
>> +     mfd_fail_shrink(fd);
>> +     mfd_assert_grow(fd);
>> +     mfd_assert_grow_write(fd);
>> +
>> +     close(fd);
>> +}
>> +
>> +/*
>> + * Test SEAL_GROW
>> + * Test whether SEAL_GROW actually prevents growing
>> + */
>> +static void test_seal_grow(void)
>> +{
>> +     int fd;
>> +
>> +     fd = mfd_assert_new("kern_memfd_seal_grow",
>> +                         MFD_DEF_SIZE,
>> +                         MFD_CLOEXEC | MFD_ALLOW_SEALING);
>> +     mfd_assert_has_seals(fd, 0);
>> +     mfd_assert_add_seals(fd, F_SEAL_GROW);
>> +     mfd_assert_has_seals(fd, F_SEAL_GROW);
>> +
>> +     mfd_assert_read(fd);
>> +     mfd_assert_write(fd);
>> +     mfd_assert_shrink(fd);
>> +     mfd_fail_grow(fd);
>> +     mfd_fail_grow_write(fd);
>> +
>> +     close(fd);
>> +}
>> +
>> +/*
>> + * Test SEAL_SHRINK | SEAL_GROW
>> + * Test whether SEAL_SHRINK | SEAL_GROW actually prevents resizing
>> + */
>> +static void test_seal_resize(void)
>> +{
>> +     int fd;
>> +
>> +     fd = mfd_assert_new("kern_memfd_seal_resize",
>> +                         MFD_DEF_SIZE,
>> +                         MFD_CLOEXEC | MFD_ALLOW_SEALING);
>> +     mfd_assert_has_seals(fd, 0);
>> +     mfd_assert_add_seals(fd, F_SEAL_SHRINK | F_SEAL_GROW);
>> +     mfd_assert_has_seals(fd, F_SEAL_SHRINK | F_SEAL_GROW);
>> +
>> +     mfd_assert_read(fd);
>> +     mfd_assert_write(fd);
>> +     mfd_fail_shrink(fd);
>> +     mfd_fail_grow(fd);
>> +     mfd_fail_grow_write(fd);
>> +
>> +     close(fd);
>> +}
>> +
>> +/*
>> + * Test sharing via dup()
>> + * Test that seals are shared between dupped FDs and they're all equal.
>> + */
>> +static void test_share_dup(void)
>> +{
>> +     int fd, fd2;
>> +
>> +     fd = mfd_assert_new("kern_memfd_share_dup",
>> +                         MFD_DEF_SIZE,
>> +                         MFD_CLOEXEC | MFD_ALLOW_SEALING);
>> +     mfd_assert_has_seals(fd, 0);
>> +
>> +     fd2 = mfd_assert_dup(fd);
>> +     mfd_assert_has_seals(fd2, 0);
>> +
>> +     mfd_assert_add_seals(fd, F_SEAL_WRITE);
>> +     mfd_assert_has_seals(fd, F_SEAL_WRITE);
>> +     mfd_assert_has_seals(fd2, F_SEAL_WRITE);
>> +
>> +     mfd_assert_add_seals(fd2, F_SEAL_SHRINK);
>> +     mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK);
>> +     mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK);
>> +
>> +     mfd_assert_add_seals(fd, F_SEAL_SEAL);
>> +     mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
>> +     mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
>> +
>> +     mfd_fail_add_seals(fd, F_SEAL_GROW);
>> +     mfd_fail_add_seals(fd2, F_SEAL_GROW);
>> +     mfd_fail_add_seals(fd, F_SEAL_SEAL);
>> +     mfd_fail_add_seals(fd2, F_SEAL_SEAL);
>> +
>> +     close(fd2);
>> +
>> +     mfd_fail_add_seals(fd, F_SEAL_GROW);
>> +     close(fd);
>> +}
>> +
>> +/*
>> + * Test sealing with active mmap()s
>> + * Modifying seals is only allowed if no other mmap() refs exist.
>> + */
>> +static void test_share_mmap(void)
>> +{
>> +     int fd;
>> +     void *p;
>> +
>> +     fd = mfd_assert_new("kern_memfd_share_mmap",
>> +                         MFD_DEF_SIZE,
>> +                         MFD_CLOEXEC | MFD_ALLOW_SEALING);
>> +     mfd_assert_has_seals(fd, 0);
>> +
>> +     /* shared/writable ref prevents sealing */
>> +     p = mfd_assert_mmap_shared(fd);
>> +     mfd_fail_add_seals(fd, F_SEAL_SHRINK);
>> +     mfd_assert_has_seals(fd, 0);
>> +     munmap(p, MFD_DEF_SIZE);
>> +
>> +     /* readable ref allows sealing */
>> +     p = mfd_assert_mmap_private(fd);
>> +     mfd_assert_add_seals(fd, F_SEAL_SHRINK);
>> +     mfd_assert_has_seals(fd, F_SEAL_SHRINK);
>> +     munmap(p, MFD_DEF_SIZE);
>> +
>> +     close(fd);
>> +}
>> +
>> +/*
>> + * Test sealing with open(/proc/self/fd/%d)
>> + * Via /proc we can get access to a separate file-context for the same memfd.
>> + * This is *not* like dup(), but like a real separate open(). Make sure the
>> + * semantics are as expected and we correctly check for RDONLY / WRONLY / RDWR.
>> + */
>> +static void test_share_open(void)
>> +{
>> +     int fd, fd2;
>> +
>> +     fd = mfd_assert_new("kern_memfd_share_open",
>> +                         MFD_DEF_SIZE,
>> +                         MFD_CLOEXEC | MFD_ALLOW_SEALING);
>> +     mfd_assert_has_seals(fd, 0);
>> +
>> +     fd2 = mfd_assert_open(fd, O_RDWR, 0);
>> +     mfd_assert_add_seals(fd, F_SEAL_WRITE);
>> +     mfd_assert_has_seals(fd, F_SEAL_WRITE);
>> +     mfd_assert_has_seals(fd2, F_SEAL_WRITE);
>> +
>> +     mfd_assert_add_seals(fd2, F_SEAL_SHRINK);
>> +     mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK);
>> +     mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK);
>> +
>> +     close(fd);
>> +     fd = mfd_assert_open(fd2, O_RDONLY, 0);
>> +
>> +     mfd_fail_add_seals(fd, F_SEAL_SEAL);
>> +     mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK);
>> +     mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK);
>> +
>> +     close(fd2);
>> +     fd2 = mfd_assert_open(fd, O_RDWR, 0);
>> +
>> +     mfd_assert_add_seals(fd2, F_SEAL_SEAL);
>> +     mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
>> +     mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
>> +
>> +     close(fd2);
>> +     close(fd);
>> +}
>> +
>> +/*
>> + * Test sharing via fork()
>> + * Test whether seal-modifications work as expected with forked childs.
>> + */
>> +static void test_share_fork(void)
>> +{
>> +     int fd;
>> +     pid_t pid;
>> +
>> +     fd = mfd_assert_new("kern_memfd_share_fork",
>> +                         MFD_DEF_SIZE,
>> +                         MFD_CLOEXEC | MFD_ALLOW_SEALING);
>> +     mfd_assert_has_seals(fd, 0);
>> +
>> +     pid = spawn_idle_proc();
>> +     mfd_assert_add_seals(fd, F_SEAL_SEAL);
>> +     mfd_assert_has_seals(fd, F_SEAL_SEAL);
>> +
>> +     mfd_fail_add_seals(fd, F_SEAL_WRITE);
>> +     mfd_assert_has_seals(fd, F_SEAL_SEAL);
>> +
>> +     join_idle_proc(pid);
>> +
>> +     mfd_fail_add_seals(fd, F_SEAL_WRITE);
>> +     mfd_assert_has_seals(fd, F_SEAL_SEAL);
>> +
>> +     close(fd);
>> +}
>> +
>> +int main(int argc, char **argv)
>> +{
>> +     pid_t pid;
>> +
>> +     printf("memfd: CREATE\n");
>> +     test_create();
>> +     printf("memfd: BASIC\n");
>> +     test_basic();
>> +
>> +     printf("memfd: SEAL-WRITE\n");
>> +     test_seal_write();
>> +     printf("memfd: SEAL-SHRINK\n");
>> +     test_seal_shrink();
>> +     printf("memfd: SEAL-GROW\n");
>> +     test_seal_grow();
>> +     printf("memfd: SEAL-RESIZE\n");
>> +     test_seal_resize();
>> +
>> +     printf("memfd: SHARE-DUP\n");
>> +     test_share_dup();
>> +     printf("memfd: SHARE-MMAP\n");
>> +     test_share_mmap();
>> +     printf("memfd: SHARE-OPEN\n");
>> +     test_share_open();
>> +     printf("memfd: SHARE-FORK\n");
>> +     test_share_fork();
>> +
>> +     /* Run test-suite in a multi-threaded environment with a shared
>> +      * file-table. */
>> +     pid = spawn_idle_thread();
>> +     printf("memfd: SHARE-DUP (shared file-table)\n");
>> +     test_share_dup();
>> +     printf("memfd: SHARE-MMAP (shared file-table)\n");
>> +     test_share_mmap();
>> +     printf("memfd: SHARE-OPEN (shared file-table)\n");
>> +     test_share_open();
>> +     printf("memfd: SHARE-FORK (shared file-table)\n");
>> +     test_share_fork();
>> +     join_idle_thread(pid);
>> +
>> +     printf("memfd: DONE\n");
>> +
>> +     return 0;
>> +}
>> --
>> 1.9.2

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 3/3] selftests: add memfd_create() + sealing tests
@ 2014-05-23 17:06       ` David Herrmann
  0 siblings, 0 replies; 53+ messages in thread
From: David Herrmann @ 2014-05-23 17:06 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Tony Battersby, Andy Lutomirsky, Al Viro, Jan Kara,
	Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

Hi

On Tue, May 20, 2014 at 4:22 AM, Hugh Dickins <hughd@google.com> wrote:
> On Tue, 15 Apr 2014, David Herrmann wrote:
>
>> Some basic tests to verify sealing on memfds works as expected and
>> guarantees the advertised semantics.
>
> Thanks for providing these.
>
> A few remarks below, and I should note one oddity.
>
> Curious about leaks (probably none, I was merely curious), I tried to
> run memfd_test 4096 times in succession, and never succeeded.  After
> many iterations, the 32-bit one tends to hang somewhere just before
> reaching the DONE, and the 64-bit one gave me some kind of assert
> error from a library.
>
> I expect there's some threading race around join_idle_thread():
> which I think you will sort out infinitely sooner than I would.
> No need to fix it right now: the test works well enough.

Ugh, I will look into that. Didn't see anything obvious so far.

>>
>> Signed-off-by: David Herrmann <dh.herrmann@gmail.com>
>> ---
>>  tools/testing/selftests/Makefile           |   1 +
>>  tools/testing/selftests/memfd/.gitignore   |   2 +
>>  tools/testing/selftests/memfd/Makefile     |  29 +
>>  tools/testing/selftests/memfd/memfd_test.c | 944 +++++++++++++++++++++++++++++
>>  4 files changed, 976 insertions(+)
>>  create mode 100644 tools/testing/selftests/memfd/.gitignore
>>  create mode 100644 tools/testing/selftests/memfd/Makefile
>>  create mode 100644 tools/testing/selftests/memfd/memfd_test.c
>>
>> diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
>> index 32487ed..c57325a 100644
>> --- a/tools/testing/selftests/Makefile
>> +++ b/tools/testing/selftests/Makefile
>> @@ -2,6 +2,7 @@ TARGETS = breakpoints
>>  TARGETS += cpu-hotplug
>>  TARGETS += efivarfs
>>  TARGETS += kcmp
>> +TARGETS += memfd
>>  TARGETS += memory-hotplug
>>  TARGETS += mqueue
>>  TARGETS += net
>> diff --git a/tools/testing/selftests/memfd/.gitignore b/tools/testing/selftests/memfd/.gitignore
>> new file mode 100644
>> index 0000000..bcc8ee2
>> --- /dev/null
>> +++ b/tools/testing/selftests/memfd/.gitignore
>> @@ -0,0 +1,2 @@
>> +memfd_test
>> +memfd-test-file
>> diff --git a/tools/testing/selftests/memfd/Makefile b/tools/testing/selftests/memfd/Makefile
>> new file mode 100644
>> index 0000000..36653b9
>> --- /dev/null
>> +++ b/tools/testing/selftests/memfd/Makefile
>> @@ -0,0 +1,29 @@
>> +uname_M := $(shell uname -m 2>/dev/null || echo not)
>> +ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
>> +ifeq ($(ARCH),i386)
>> +     ARCH := X86
>> +endif
>> +ifeq ($(ARCH),x86_64)
>> +     ARCH := X86
>> +endif
>> +
>> +CFLAGS += -I../../../../arch/x86/include/generated/uapi/
>> +CFLAGS += -I../../../../arch/x86/include/uapi/
>> +CFLAGS += -I../../../../include/uapi/
>> +CFLAGS += -I../../../../include/
>> +
>> +all:
>> +ifeq ($(ARCH),X86)
>> +     gcc $(CFLAGS) memfd_test.c -o memfd_test
>> +else
>> +     echo "Not an x86 target, can't build memfd selftest"
>> +endif
>> +
>> +run_tests: all
>> +ifeq ($(ARCH),X86)
>> +     gcc $(CFLAGS) memfd_test.c -o memfd_test
>> +endif
>> +     @./memfd_test || echo "memfd_test: [FAIL]"
>> +
>> +clean:
>> +     $(RM) memfd_test
>> diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c
>> new file mode 100644
>> index 0000000..3e105ea
>> --- /dev/null
>> +++ b/tools/testing/selftests/memfd/memfd_test.c
>> @@ -0,0 +1,944 @@
>> +#define _GNU_SOURCE
>> +#define __EXPORTED_HEADERS__
>> +
>> +#include <errno.h>
>> +#include <inttypes.h>
>> +#include <limits.h>
>> +#include <linux/falloc.h>
>> +#include <linux/fcntl.h>
>> +#include <linux/memfd.h>
>> +#include <sched.h>
>> +#include <stdio.h>
>> +#include <stdlib.h>
>> +#include <signal.h>
>> +#include <string.h>
>> +#include <sys/mman.h>
>> +#include <sys/stat.h>
>> +#include <sys/syscall.h>
>> +#include <unistd.h>
>> +
>> +#define MFD_DEF_SIZE 8192
>> +#define STACK_SIZE 65535
>> +
>> +static int sys_memfd_create(const char *name,
>> +                         __u64 size,
>> +                         __u64 flags)
>> +{
>> +     return syscall(__NR_memfd_create, name, size, flags);
>> +}
>> +
>> +static int mfd_assert_new(const char *name, __u64 sz, __u64 flags)
>> +{
>> +     int r;
>> +
>> +     r = sys_memfd_create(name, sz, flags);
>> +     if (r < 0) {
>> +             printf("memfd_create(\"%s\", %llu, %llu) failed: %m\n",
>> +                    name,
>> +                    (unsigned long long)sz,
>> +                    (unsigned long long)flags);
>> +             abort();
>> +     }
>> +
>> +     return r;
>> +}
>> +
>> +static void mfd_fail_new(const char *name, __u64 size, __u64 flags)
>> +{
>> +     int r;
>> +
>> +     r = sys_memfd_create(name, size, flags);
>> +     if (r >= 0) {
>> +             printf("memfd_create(\"%s\", %llu, %llu) succeeded, but failure expected\n",
>
> scripts/checkpatch.pl complains about line-length: please ignore it on this.
>
>> +                    name,
>> +                    (unsigned long long)size,
>> +                    (unsigned long long)flags);
>> +             close(r);
>> +             abort();
>> +     }
>> +}
>> +
>> +static __u64 mfd_assert_get_seals(int fd)
>> +{
>> +     long r;
>> +
>> +     r = fcntl(fd, F_GET_SEALS);
>> +     if (r < 0) {
>> +             printf("GET_SEALS(%d) failed: %m\n", fd);
>> +             abort();
>> +     }
>> +
>> +     return r;
>> +}
>> +
>> +static void mfd_fail_get_seals(int fd)
>> +{
>> +     long r;
>> +
>> +     r = fcntl(fd, F_GET_SEALS);
>> +     if (r >= 0) {
>> +             printf("GET_SEALS(%d) succeeded, but failure expected\n");
>> +             abort();
>> +     }
>> +}
>> +
>> +static void mfd_assert_has_seals(int fd, __u64 seals)
>> +{
>> +     __u64 s;
>> +
>> +     s = mfd_assert_get_seals(fd);
>> +     if (s != seals) {
>> +             printf("%llu != %llu = GET_SEALS(%d)\n",
>> +                    (unsigned long long)seals, (unsigned long long)s, fd);
>> +             abort();
>> +     }
>> +}
>> +
>> +static void mfd_assert_add_seals(int fd, __u64 seals)
>> +{
>> +     long r;
>> +     __u64 s;
>> +
>> +     s = mfd_assert_get_seals(fd);
>> +     r = fcntl(fd, F_ADD_SEALS, seals);
>> +     if (r < 0) {
>> +             printf("ADD_SEALS(%d, %llu -> %llu) failed: %m\n",
>> +                    fd, (unsigned long long)s, (unsigned long long)seals);
>> +             abort();
>> +     }
>> +}
>> +
>> +static void mfd_fail_add_seals(int fd, __u64 seals)
>> +{
>> +     long r;
>> +     __u64 s;
>> +
>> +     r = fcntl(fd, F_GET_SEALS);
>> +     if (r < 0)
>> +             s = 0;
>> +     else
>> +             s = r;
>> +
>> +     r = fcntl(fd, F_ADD_SEALS, seals);
>> +     if (r >= 0) {
>> +             printf("ADD_SEALS(%d, %llu -> %llu) didn't fail as expected\n",
>> +                    fd, (unsigned long long)s, (unsigned long long)seals);
>> +             abort();
>> +     }
>> +}
>> +
>> +static void mfd_assert_size(int fd, size_t size)
>> +{
>> +     struct stat st;
>> +     int r;
>> +
>> +     r = fstat(fd, &st);
>> +     if (r < 0) {
>> +             printf("fstat(%d) failed: %m\n", fd);
>> +             abort();
>> +     } else if (st.st_size != size) {
>> +             printf("wrong file size %lld, but expected %lld\n",
>> +                    (long long)st.st_size, (long long)size);
>> +             abort();
>> +     }
>> +}
>> +
>> +static int mfd_assert_dup(int fd)
>> +{
>> +     int r;
>> +
>> +     r = dup(fd);
>> +     if (r < 0) {
>> +             printf("dup(%d) failed: %m\n", fd);
>> +             abort();
>> +     }
>> +
>> +     return r;
>> +}
>> +
>> +static void *mfd_assert_mmap_shared(int fd)
>> +{
>> +     void *p;
>> +
>> +     p = mmap(NULL,
>> +              MFD_DEF_SIZE,
>> +              PROT_READ | PROT_WRITE,
>> +              MAP_SHARED,
>> +              fd,
>> +              0);
>> +     if (p == MAP_FAILED) {
>> +             printf("mmap() failed: %m\n");
>> +             abort();
>> +     }
>> +
>> +     return p;
>> +}
>> +
>> +static void *mfd_assert_mmap_private(int fd)
>> +{
>> +     void *p;
>> +
>> +     p = mmap(NULL,
>> +              MFD_DEF_SIZE,
>> +              PROT_READ,
>> +              MAP_PRIVATE,
>> +              fd,
>> +              0);
>> +     if (p == MAP_FAILED) {
>> +             printf("mmap() failed: %m\n");
>> +             abort();
>> +     }
>> +
>> +     return p;
>> +}
>> +
>> +static int mfd_assert_open(int fd, int flags, mode_t mode)
>> +{
>> +     char buf[512];
>> +     int r;
>> +
>> +     sprintf(buf, "/proc/self/fd/%d", fd);
>> +     r = open(buf, flags, mode);
>> +     if (r < 0) {
>> +             printf("open(%s) failed: %m\n", buf);
>> +             abort();
>> +     }
>> +
>> +     return r;
>> +}
>> +
>> +static void mfd_fail_open(int fd, int flags, mode_t mode)
>> +{
>> +     char buf[512];
>> +     int r;
>> +
>> +     sprintf(buf, "/proc/self/fd/%d", fd);
>> +     r = open(buf, flags, mode);
>> +     if (r >= 0) {
>> +             printf("open(%s) didn't fail as expected\n");
>> +             abort();
>> +     }
>> +}
>> +
>> +static void mfd_assert_read(int fd)
>> +{
>> +     char buf[16];
>> +     void *p;
>> +     ssize_t l;
>> +
>> +     l = read(fd, buf, sizeof(buf));
>> +     if (l != sizeof(buf)) {
>> +             printf("read() failed: %m\n");
>> +             abort();
>> +     }
>> +
>> +     /* verify PROT_READ *is* allowed */
>> +     p = mmap(NULL,
>> +              MFD_DEF_SIZE,
>> +              PROT_READ,
>> +              MAP_PRIVATE,
>> +              fd,
>> +              0);
>> +     if (p == MAP_FAILED) {
>> +             printf("mmap() failed: %m\n");
>> +             abort();
>> +     }
>> +     munmap(p, MFD_DEF_SIZE);
>> +
>> +     /* verify MAP_PRIVATE is *always* allowed (even writable) */
>> +     p = mmap(NULL,
>> +              MFD_DEF_SIZE,
>> +              PROT_READ | PROT_WRITE,
>> +              MAP_PRIVATE,
>> +              fd,
>> +              0);
>> +     if (p == MAP_FAILED) {
>> +             printf("mmap() failed: %m\n");
>> +             abort();
>> +     }
>> +     munmap(p, MFD_DEF_SIZE);
>> +}
>> +
>> +static void mfd_assert_write(int fd)
>> +{
>> +     ssize_t l;
>> +     void *p;
>> +     int r;
>> +
>> +     /* verify write() succeeds */
>> +     l = write(fd, "\0\0\0\0", 4);
>> +     if (l != 4) {
>> +             printf("write() failed: %m\n");
>> +             abort();
>> +     }
>> +
>> +     /* verify PROT_READ | PROT_WRITE is allowed */
>> +     p = mmap(NULL,
>> +              MFD_DEF_SIZE,
>> +              PROT_READ | PROT_WRITE,
>> +              MAP_SHARED,
>> +              fd,
>> +              0);
>> +     if (p == MAP_FAILED) {
>> +             printf("mmap() failed: %m\n");
>> +             abort();
>> +     }
>> +     *(char*)p = 0;
>
> scripts/checkpatch.pl complains about (char*): better calm it with (char *).
> Same on two other lines below.
>
>> +     munmap(p, MFD_DEF_SIZE);
>> +
>> +     /* verify PROT_WRITE is allowed */
>> +     p = mmap(NULL,
>> +              MFD_DEF_SIZE,
>> +              PROT_WRITE,
>> +              MAP_SHARED,
>> +              fd,
>> +              0);
>> +     if (p == MAP_FAILED) {
>> +             printf("mmap() failed: %m\n");
>> +             abort();
>> +     }
>> +     *(char*)p = 0;
>> +     munmap(p, MFD_DEF_SIZE);
>> +
>> +     /* verify PROT_READ with MAP_SHARED is allowed and a following
>> +      * mprotect(PROT_WRITE) allows writing */
>> +     p = mmap(NULL,
>> +              MFD_DEF_SIZE,
>> +              PROT_READ,
>> +              MAP_SHARED,
>> +              fd,
>> +              0);
>> +     if (p == MAP_FAILED) {
>> +             printf("mmap() failed: %m\n");
>> +             abort();
>> +     }
>> +
>> +     r = mprotect(p, MFD_DEF_SIZE, PROT_READ | PROT_WRITE);
>> +     if (r < 0) {
>> +             printf("mprotect() failed: %m\n");
>> +             abort();
>> +     }
>> +
>> +     *(char*)p = 0;
>> +     munmap(p, MFD_DEF_SIZE);
>> +
>> +     /* verify PUNCH_HOLE works */
>> +     r = fallocate(fd,
>> +                   FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
>> +                   0,
>> +                   MFD_DEF_SIZE);
>> +     if (r < 0) {
>> +             printf("fallocate(PUNCH_HOLE) failed: %m\n");
>> +             abort();
>> +     }
>> +}
>> +
>> +static void mfd_fail_write(int fd)
>> +{
>> +     ssize_t l;
>> +     void *p;
>> +     int r;
>> +
>> +     /* verify write() fails */
>> +     l = write(fd, "data", 4);
>> +     if (l != -EPERM) {
>> +             printf("expected EPERM on write(), but got %d: %m\n", (int)l);
>> +             abort();
>> +     }
>> +
>> +     /* verify PROT_READ | PROT_WRITE is not allowed */
>> +     p = mmap(NULL,
>> +              MFD_DEF_SIZE,
>> +              PROT_READ | PROT_WRITE,
>> +              MAP_SHARED,
>> +              fd,
>> +              0);
>> +     if (p != MAP_FAILED) {
>> +             printf("mmap() didn't fail as expected\n");
>> +             abort();
>> +     }
>> +
>> +     /* verify PROT_WRITE is not allowed */
>> +     p = mmap(NULL,
>> +              MFD_DEF_SIZE,
>> +              PROT_WRITE,
>> +              MAP_SHARED,
>> +              fd,
>> +              0);
>> +     if (p != MAP_FAILED) {
>> +             printf("mmap() didn't fail as expected\n");
>> +             abort();
>> +     }
>> +
>> +     /* verify PROT_READ with MAP_SHARED is not allowed */
>
> This is a particularly interesting case, checking PROT_READ,MAP_SHARED
> not allowed in mfd_fail_write().  It feels invidious to ask for more
> of a comment, in a test which you have been generous to provide at all.
> But it stopped me short for a while: more comment might help others too.
>
> The reason being (right?) that this fd was opened O_RDWR, so a
> MAP_SHARED mapping would permit a subsequent mprotect(,,PROT_WRITE),
> which sealing the file against writes must prevent.
>
> Your kernel checks rely on VM_SHARED and i_mmap_writable for this
> protection: which is fine, but an implementation detail which could
> be modified in future, if this case were ever to pose a difficulty.

Yes indeed, this is meant to catch VM_MAYWRITE. Currently, every
mmap(MAP_SHARED) on a writable FD allows mprotect(PROT_WRITE) later
on. I thought that's hard-coded ABI so I rely on it here. But I can
definitely add a comment mentioning VM_MAYWRITE.

Given that this test-case fails if I run mfd_fail_write() on a
read-only FD, I might even want to change it to run mmap()+mprotect().
This should clear up all doubts.

Thanks!
David

>> +     p = mmap(NULL,
>> +              MFD_DEF_SIZE,
>> +              PROT_READ,
>> +              MAP_SHARED,
>> +              fd,
>> +              0);
>> +     if (p != MAP_FAILED) {
>> +             printf("mmap() didn't fail as expected\n");
>> +             abort();
>> +     }
>> +
>> +     /* verify PUNCH_HOLE fails */
>> +     r = fallocate(fd,
>> +                   FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
>> +                   0,
>> +                   MFD_DEF_SIZE);
>> +     if (r >= 0) {
>> +             printf("fallocate(PUNCH_HOLE) didn't fail as expected\n");
>> +             abort();
>> +     }
>> +}
>> +
>> +static void mfd_assert_shrink(int fd)
>> +{
>> +     int r, fd2;
>> +
>> +     r = ftruncate(fd, MFD_DEF_SIZE / 2);
>> +     if (r < 0) {
>> +             printf("ftruncate(SHRINK) failed: %m\n");
>> +             abort();
>> +     }
>> +
>> +     mfd_assert_size(fd, MFD_DEF_SIZE / 2);
>> +
>> +     fd2 = mfd_assert_open(fd,
>> +                           O_RDWR | O_CREAT | O_TRUNC,
>> +                           S_IRUSR | S_IWUSR);
>> +     close(fd2);
>> +
>> +     mfd_assert_size(fd, 0);
>> +}
>> +
>> +static void mfd_fail_shrink(int fd)
>> +{
>> +     int r;
>> +
>> +     r = ftruncate(fd, MFD_DEF_SIZE / 2);
>> +     if (r >= 0) {
>> +             printf("ftruncate(SHRINK) didn't fail as expected\n");
>> +             abort();
>> +     }
>> +
>> +     mfd_fail_open(fd,
>> +                   O_RDWR | O_CREAT | O_TRUNC,
>> +                   S_IRUSR | S_IWUSR);
>> +}
>> +
>> +static void mfd_assert_grow(int fd)
>> +{
>> +     int r;
>> +
>> +     r = ftruncate(fd, MFD_DEF_SIZE * 2);
>> +     if (r < 0) {
>> +             printf("ftruncate(GROW) failed: %m\n");
>> +             abort();
>> +     }
>> +
>> +     mfd_assert_size(fd, MFD_DEF_SIZE * 2);
>> +
>> +     r = fallocate(fd,
>> +                   0,
>> +                   0,
>> +                   MFD_DEF_SIZE * 4);
>> +     if (r < 0) {
>> +             printf("fallocate(ALLOC) failed: %m\n");
>> +             abort();
>> +     }
>> +
>> +     mfd_assert_size(fd, MFD_DEF_SIZE * 4);
>> +}
>> +
>> +static void mfd_fail_grow(int fd)
>> +{
>> +     int r;
>> +
>> +     r = ftruncate(fd, MFD_DEF_SIZE * 2);
>> +     if (r >= 0) {
>> +             printf("ftruncate(GROW) didn't fail as expected\n");
>> +             abort();
>> +     }
>> +
>> +     r = fallocate(fd,
>> +                   0,
>> +                   0,
>> +                   MFD_DEF_SIZE * 4);
>> +     if (r >= 0) {
>> +             printf("fallocate(ALLOC) didn't fail as expected\n");
>> +             abort();
>> +     }
>> +}
>> +
>> +static void mfd_assert_grow_write(int fd)
>> +{
>> +     static char buf[MFD_DEF_SIZE * 8];
>> +     ssize_t l;
>> +
>> +     l = pwrite(fd, buf, sizeof(buf), 0);
>> +     if (l != sizeof(buf)) {
>> +             printf("pwrite() failed: %m\n");
>> +             abort();
>> +     }
>> +
>> +     mfd_assert_size(fd, MFD_DEF_SIZE * 8);
>> +}
>> +
>> +static void mfd_fail_grow_write(int fd)
>> +{
>> +     static char buf[MFD_DEF_SIZE * 8];
>> +     ssize_t l;
>> +
>> +     l = pwrite(fd, buf, sizeof(buf), 0);
>> +     if (l == sizeof(buf)) {
>> +             printf("pwrite() didn't fail as expected\n");
>> +             abort();
>> +     }
>> +}
>> +
>> +static int idle_thread_fn(void *arg)
>> +{
>> +     sigset_t set;
>> +     int sig;
>> +
>> +     /* dummy waiter; SIGTERM terminates us anyway */
>> +     sigemptyset(&set);
>> +     sigaddset(&set, SIGTERM);
>> +     sigwait(&set, &sig);
>> +
>> +     return 0;
>> +}
>> +
>> +static pid_t spawn_idle_thread(void)
>> +{
>> +     uint8_t *stack;
>> +     pid_t pid;
>> +
>> +     stack = malloc(STACK_SIZE);
>> +     if (!stack) {
>> +             printf("malloc(STACK_SIZE) failed: %m\n");
>> +             abort();
>> +     }
>> +
>> +     pid = clone(idle_thread_fn,
>> +                 stack + STACK_SIZE,
>> +                 CLONE_FILES | CLONE_FS | CLONE_VM | SIGCHLD,
>> +                 NULL);
>> +     if (pid < 0) {
>> +             printf("clone() failed: %m\n");
>> +             abort();
>> +     }
>> +
>> +     return pid;
>> +}
>> +
>> +static void join_idle_thread(pid_t pid)
>> +{
>> +     kill(pid, SIGTERM);
>> +     waitpid(pid, NULL, 0);
>> +}
>> +
>> +static pid_t spawn_idle_proc(void)
>> +{
>> +     pid_t pid;
>> +     sigset_t set;
>> +     int sig;
>> +
>> +     pid = fork();
>> +     if (pid < 0) {
>> +             printf("fork() failed: %m\n");
>> +             abort();
>> +     } else if (!pid) {
>> +             /* dummy waiter; SIGTERM terminates us anyway */
>> +             sigemptyset(&set);
>> +             sigaddset(&set, SIGTERM);
>> +             sigwait(&set, &sig);
>> +             exit(0);
>> +     }
>> +
>> +     return pid;
>> +}
>> +
>> +static void join_idle_proc(pid_t pid)
>> +{
>> +     kill(pid, SIGTERM);
>> +     waitpid(pid, NULL, 0);
>> +}
>> +
>> +/*
>> + * Test memfd_create() syscall
>> + * Verify syscall-argument validation, including name checks, flag validation
>> + * and more.
>> + */
>> +static void test_create(void)
>> +{
>> +     char buf[2048];
>> +     int fd;
>> +
>> +     /* test NULL name */
>> +     mfd_fail_new(NULL, 0, 0);
>> +
>> +     /* test over-long name (not zero-terminated) */
>> +     memset(buf, 0xff, sizeof(buf));
>> +     mfd_fail_new(buf, 0, 0);
>> +
>> +     /* test over-long zero-terminated name */
>> +     memset(buf, 0xff, sizeof(buf));
>> +     buf[sizeof(buf) - 1] = 0;
>> +     mfd_fail_new(buf, 0, 0);
>> +
>> +     /* verify "" is a valid name */
>> +     fd = mfd_assert_new("", 0, 0);
>> +     close(fd);
>> +
>> +     /* verify invalid O_* open flags */
>> +     mfd_fail_new("", 0, 0x0100);
>> +     mfd_fail_new("", 0, ~MFD_CLOEXEC);
>> +     mfd_fail_new("", 0, ~MFD_ALLOW_SEALING);
>> +     mfd_fail_new("", 0, ~0);
>> +     mfd_fail_new("", 0, 0x8000000000000000ULL);
>> +
>> +     /* verify MFD_CLOEXEC is allowed */
>> +     fd = mfd_assert_new("", 0, MFD_CLOEXEC);
>> +     close(fd);
>> +
>> +     /* verify MFD_ALLOW_SEALING is allowed */
>> +     fd = mfd_assert_new("", 0, MFD_ALLOW_SEALING);
>> +     close(fd);
>> +
>> +     /* verify MFD_ALLOW_SEALING | MFD_CLOEXEC is allowed */
>> +     fd = mfd_assert_new("", 0, MFD_ALLOW_SEALING | MFD_CLOEXEC);
>> +     close(fd);
>> +}
>> +
>> +/*
>> + * Test basic sealing
>> + * A very basic sealing test to see whether setting/retrieving seals works.
>> + */
>> +static void test_basic(void)
>> +{
>> +     int fd;
>> +
>> +     fd = mfd_assert_new("kern_memfd_basic",
>> +                         MFD_DEF_SIZE,
>> +                         MFD_CLOEXEC | MFD_ALLOW_SEALING);
>> +
>> +     /* add basic seals */
>> +     mfd_assert_has_seals(fd, 0);
>> +     mfd_assert_add_seals(fd, F_SEAL_SHRINK |
>> +                              F_SEAL_WRITE);
>> +     mfd_assert_has_seals(fd, F_SEAL_SHRINK |
>> +                              F_SEAL_WRITE);
>> +
>> +     /* add them again */
>> +     mfd_assert_add_seals(fd, F_SEAL_SHRINK |
>> +                              F_SEAL_WRITE);
>> +     mfd_assert_has_seals(fd, F_SEAL_SHRINK |
>> +                              F_SEAL_WRITE);
>> +
>> +     /* add more seals and seal against sealing */
>> +     mfd_assert_add_seals(fd, F_SEAL_GROW | F_SEAL_SEAL);
>> +     mfd_assert_has_seals(fd, F_SEAL_SHRINK |
>> +                              F_SEAL_GROW |
>> +                              F_SEAL_WRITE |
>> +                              F_SEAL_SEAL);
>> +
>> +     /* verify that sealing no longer works */
>> +     mfd_fail_add_seals(fd, F_SEAL_GROW);
>> +     mfd_fail_add_seals(fd, 0);
>> +
>> +     close(fd);
>> +
>> +     /* verify sealing does not work without MFD_ALLOW_SEALING */
>> +     fd = mfd_assert_new("kern_memfd_basic",
>> +                         MFD_DEF_SIZE,
>> +                         MFD_CLOEXEC);
>> +     mfd_fail_get_seals(fd);
>> +     mfd_fail_add_seals(fd, F_SEAL_SHRINK |
>> +                            F_SEAL_GROW |
>> +                            F_SEAL_WRITE);
>> +     mfd_fail_get_seals(fd);
>> +     close(fd);
>> +}
>> +
>> +/*
>> + * Test SEAL_WRITE
>> + * Test whether SEAL_WRITE actually prevents modifications.
>> + */
>> +static void test_seal_write(void)
>> +{
>> +     int fd;
>> +
>> +     fd = mfd_assert_new("kern_memfd_seal_write",
>> +                         MFD_DEF_SIZE,
>> +                         MFD_CLOEXEC | MFD_ALLOW_SEALING);
>> +     mfd_assert_has_seals(fd, 0);
>> +     mfd_assert_add_seals(fd, F_SEAL_WRITE);
>> +     mfd_assert_has_seals(fd, F_SEAL_WRITE);
>> +
>> +     mfd_assert_read(fd);
>> +     mfd_fail_write(fd);
>> +     mfd_assert_shrink(fd);
>> +     mfd_assert_grow(fd);
>> +     mfd_fail_grow_write(fd);
>> +
>> +     close(fd);
>> +}
>> +
>> +/*
>> + * Test SEAL_SHRINK
>> + * Test whether SEAL_SHRINK actually prevents shrinking
>> + */
>> +static void test_seal_shrink(void)
>> +{
>> +     int fd;
>> +
>> +     fd = mfd_assert_new("kern_memfd_seal_shrink",
>> +                         MFD_DEF_SIZE,
>> +                         MFD_CLOEXEC | MFD_ALLOW_SEALING);
>> +     mfd_assert_has_seals(fd, 0);
>> +     mfd_assert_add_seals(fd, F_SEAL_SHRINK);
>> +     mfd_assert_has_seals(fd, F_SEAL_SHRINK);
>> +
>> +     mfd_assert_read(fd);
>> +     mfd_assert_write(fd);
>> +     mfd_fail_shrink(fd);
>> +     mfd_assert_grow(fd);
>> +     mfd_assert_grow_write(fd);
>> +
>> +     close(fd);
>> +}
>> +
>> +/*
>> + * Test SEAL_GROW
>> + * Test whether SEAL_GROW actually prevents growing
>> + */
>> +static void test_seal_grow(void)
>> +{
>> +     int fd;
>> +
>> +     fd = mfd_assert_new("kern_memfd_seal_grow",
>> +                         MFD_DEF_SIZE,
>> +                         MFD_CLOEXEC | MFD_ALLOW_SEALING);
>> +     mfd_assert_has_seals(fd, 0);
>> +     mfd_assert_add_seals(fd, F_SEAL_GROW);
>> +     mfd_assert_has_seals(fd, F_SEAL_GROW);
>> +
>> +     mfd_assert_read(fd);
>> +     mfd_assert_write(fd);
>> +     mfd_assert_shrink(fd);
>> +     mfd_fail_grow(fd);
>> +     mfd_fail_grow_write(fd);
>> +
>> +     close(fd);
>> +}
>> +
>> +/*
>> + * Test SEAL_SHRINK | SEAL_GROW
>> + * Test whether SEAL_SHRINK | SEAL_GROW actually prevents resizing
>> + */
>> +static void test_seal_resize(void)
>> +{
>> +     int fd;
>> +
>> +     fd = mfd_assert_new("kern_memfd_seal_resize",
>> +                         MFD_DEF_SIZE,
>> +                         MFD_CLOEXEC | MFD_ALLOW_SEALING);
>> +     mfd_assert_has_seals(fd, 0);
>> +     mfd_assert_add_seals(fd, F_SEAL_SHRINK | F_SEAL_GROW);
>> +     mfd_assert_has_seals(fd, F_SEAL_SHRINK | F_SEAL_GROW);
>> +
>> +     mfd_assert_read(fd);
>> +     mfd_assert_write(fd);
>> +     mfd_fail_shrink(fd);
>> +     mfd_fail_grow(fd);
>> +     mfd_fail_grow_write(fd);
>> +
>> +     close(fd);
>> +}
>> +
>> +/*
>> + * Test sharing via dup()
>> + * Test that seals are shared between dupped FDs and they're all equal.
>> + */
>> +static void test_share_dup(void)
>> +{
>> +     int fd, fd2;
>> +
>> +     fd = mfd_assert_new("kern_memfd_share_dup",
>> +                         MFD_DEF_SIZE,
>> +                         MFD_CLOEXEC | MFD_ALLOW_SEALING);
>> +     mfd_assert_has_seals(fd, 0);
>> +
>> +     fd2 = mfd_assert_dup(fd);
>> +     mfd_assert_has_seals(fd2, 0);
>> +
>> +     mfd_assert_add_seals(fd, F_SEAL_WRITE);
>> +     mfd_assert_has_seals(fd, F_SEAL_WRITE);
>> +     mfd_assert_has_seals(fd2, F_SEAL_WRITE);
>> +
>> +     mfd_assert_add_seals(fd2, F_SEAL_SHRINK);
>> +     mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK);
>> +     mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK);
>> +
>> +     mfd_assert_add_seals(fd, F_SEAL_SEAL);
>> +     mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
>> +     mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
>> +
>> +     mfd_fail_add_seals(fd, F_SEAL_GROW);
>> +     mfd_fail_add_seals(fd2, F_SEAL_GROW);
>> +     mfd_fail_add_seals(fd, F_SEAL_SEAL);
>> +     mfd_fail_add_seals(fd2, F_SEAL_SEAL);
>> +
>> +     close(fd2);
>> +
>> +     mfd_fail_add_seals(fd, F_SEAL_GROW);
>> +     close(fd);
>> +}
>> +
>> +/*
>> + * Test sealing with active mmap()s
>> + * Modifying seals is only allowed if no other mmap() refs exist.
>> + */
>> +static void test_share_mmap(void)
>> +{
>> +     int fd;
>> +     void *p;
>> +
>> +     fd = mfd_assert_new("kern_memfd_share_mmap",
>> +                         MFD_DEF_SIZE,
>> +                         MFD_CLOEXEC | MFD_ALLOW_SEALING);
>> +     mfd_assert_has_seals(fd, 0);
>> +
>> +     /* shared/writable ref prevents sealing */
>> +     p = mfd_assert_mmap_shared(fd);
>> +     mfd_fail_add_seals(fd, F_SEAL_SHRINK);
>> +     mfd_assert_has_seals(fd, 0);
>> +     munmap(p, MFD_DEF_SIZE);
>> +
>> +     /* readable ref allows sealing */
>> +     p = mfd_assert_mmap_private(fd);
>> +     mfd_assert_add_seals(fd, F_SEAL_SHRINK);
>> +     mfd_assert_has_seals(fd, F_SEAL_SHRINK);
>> +     munmap(p, MFD_DEF_SIZE);
>> +
>> +     close(fd);
>> +}
>> +
>> +/*
>> + * Test sealing with open(/proc/self/fd/%d)
>> + * Via /proc we can get access to a separate file-context for the same memfd.
>> + * This is *not* like dup(), but like a real separate open(). Make sure the
>> + * semantics are as expected and we correctly check for RDONLY / WRONLY / RDWR.
>> + */
>> +static void test_share_open(void)
>> +{
>> +     int fd, fd2;
>> +
>> +     fd = mfd_assert_new("kern_memfd_share_open",
>> +                         MFD_DEF_SIZE,
>> +                         MFD_CLOEXEC | MFD_ALLOW_SEALING);
>> +     mfd_assert_has_seals(fd, 0);
>> +
>> +     fd2 = mfd_assert_open(fd, O_RDWR, 0);
>> +     mfd_assert_add_seals(fd, F_SEAL_WRITE);
>> +     mfd_assert_has_seals(fd, F_SEAL_WRITE);
>> +     mfd_assert_has_seals(fd2, F_SEAL_WRITE);
>> +
>> +     mfd_assert_add_seals(fd2, F_SEAL_SHRINK);
>> +     mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK);
>> +     mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK);
>> +
>> +     close(fd);
>> +     fd = mfd_assert_open(fd2, O_RDONLY, 0);
>> +
>> +     mfd_fail_add_seals(fd, F_SEAL_SEAL);
>> +     mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK);
>> +     mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK);
>> +
>> +     close(fd2);
>> +     fd2 = mfd_assert_open(fd, O_RDWR, 0);
>> +
>> +     mfd_assert_add_seals(fd2, F_SEAL_SEAL);
>> +     mfd_assert_has_seals(fd, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
>> +     mfd_assert_has_seals(fd2, F_SEAL_WRITE | F_SEAL_SHRINK | F_SEAL_SEAL);
>> +
>> +     close(fd2);
>> +     close(fd);
>> +}
>> +
>> +/*
>> + * Test sharing via fork()
>> + * Test whether seal-modifications work as expected with forked childs.
>> + */
>> +static void test_share_fork(void)
>> +{
>> +     int fd;
>> +     pid_t pid;
>> +
>> +     fd = mfd_assert_new("kern_memfd_share_fork",
>> +                         MFD_DEF_SIZE,
>> +                         MFD_CLOEXEC | MFD_ALLOW_SEALING);
>> +     mfd_assert_has_seals(fd, 0);
>> +
>> +     pid = spawn_idle_proc();
>> +     mfd_assert_add_seals(fd, F_SEAL_SEAL);
>> +     mfd_assert_has_seals(fd, F_SEAL_SEAL);
>> +
>> +     mfd_fail_add_seals(fd, F_SEAL_WRITE);
>> +     mfd_assert_has_seals(fd, F_SEAL_SEAL);
>> +
>> +     join_idle_proc(pid);
>> +
>> +     mfd_fail_add_seals(fd, F_SEAL_WRITE);
>> +     mfd_assert_has_seals(fd, F_SEAL_SEAL);
>> +
>> +     close(fd);
>> +}
>> +
>> +int main(int argc, char **argv)
>> +{
>> +     pid_t pid;
>> +
>> +     printf("memfd: CREATE\n");
>> +     test_create();
>> +     printf("memfd: BASIC\n");
>> +     test_basic();
>> +
>> +     printf("memfd: SEAL-WRITE\n");
>> +     test_seal_write();
>> +     printf("memfd: SEAL-SHRINK\n");
>> +     test_seal_shrink();
>> +     printf("memfd: SEAL-GROW\n");
>> +     test_seal_grow();
>> +     printf("memfd: SEAL-RESIZE\n");
>> +     test_seal_resize();
>> +
>> +     printf("memfd: SHARE-DUP\n");
>> +     test_share_dup();
>> +     printf("memfd: SHARE-MMAP\n");
>> +     test_share_mmap();
>> +     printf("memfd: SHARE-OPEN\n");
>> +     test_share_open();
>> +     printf("memfd: SHARE-FORK\n");
>> +     test_share_fork();
>> +
>> +     /* Run test-suite in a multi-threaded environment with a shared
>> +      * file-table. */
>> +     pid = spawn_idle_thread();
>> +     printf("memfd: SHARE-DUP (shared file-table)\n");
>> +     test_share_dup();
>> +     printf("memfd: SHARE-MMAP (shared file-table)\n");
>> +     test_share_mmap();
>> +     printf("memfd: SHARE-OPEN (shared file-table)\n");
>> +     test_share_open();
>> +     printf("memfd: SHARE-FORK (shared file-table)\n");
>> +     test_share_fork();
>> +     join_idle_thread(pid);
>> +
>> +     printf("memfd: DONE\n");
>> +
>> +     return 0;
>> +}
>> --
>> 1.9.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/3] File Sealing & memfd_create()
  2014-05-19 22:11             ` Hugh Dickins
@ 2014-05-26 11:44               ` David Herrmann
  -1 siblings, 0 replies; 53+ messages in thread
From: David Herrmann @ 2014-05-26 11:44 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Jan Kara, Tony Battersby, Al Viro, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, John Stultz, Christoph Lameter, Mel Gorman

Hi

(CC migrate.c committers)

On Tue, May 20, 2014 at 12:11 AM, Hugh Dickins <hughd@google.com> wrote:
> On Mon, 19 May 2014, Jan Kara wrote:
>> On Mon 19-05-14 13:44:25, David Herrmann wrote:
>> > On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins <hughd@google.com> wrote:
>> > > The aspect which really worries me is this: the maintenance burden.
>> > > This approach would add some peculiar new code, introducing a rare
>> > > special case: which we might get right today, but will very easily
>> > > forget tomorrow when making some other changes to mm.  If we compile
>> > > a list of danger areas in mm, this would surely belong on that list.
>> >
>> > I tried doing the page-replacement in the last 4 days, but honestly,
>> > it's far more complex than I thought. So if no-one more experienced
>
> To be honest, I'm quite glad to hear that: it is still a solution worth
> considering, but I'd rather continue the search for a better solution.

What if we set VM_IO for memory-mappings if a file supports sealing?
That might be a hack and quite restrictive, but we could also add a
VM_DONTPIN flag that just prevents any page-pinning like GUP (which is
also a side-effect of VM_IO). This is basically what we do to protect
PCI BARs from that race during hotplug (well, VM_PFNMAP ist what
protects those, but the code is the same). If we mention in the
man-page that memfd-objects don't support direct-IO, we'd be fine, I
think. Not sure if that hack is better than the page-replacement,
though. It'd be definitely much simpler.

Regarding page-replacement, I tried using migrate_page(), however,
this obviously fails in page_freeze_refs() due to the elevated
ref-count and we cannot account for those, as they might vanish
asynchronously. Now I wonder whether we could just add a new mode
MIGRATE_PHASE_OUT that avoids freezing the page and forces the copy.
Existing refs would still operate on the old page, but any new access
gets the new page. This way, we could collect pages with elevated
ref-counts in shmem similar to do_move_page_to_node_array() and then
call migrate_pages(). Now migrate_pages() takes good care to prevent
any new refs during migration. try_to_unmap(TTU_MIGRATION) marks PTEs
as 'in-migration', so accesses are delayed. Page-faults wait on the
page-lock and retry due to mapping==NULL. lru is disabled beforehand.
Therefore, there cannot be any racing page-lookups as they all stall
on the migration. Moreover, page_freeze_refs() fails only if the page
is pinned by independent users (usually some form of I/O).
Question is what those additional ref-counts might be. Given that
shmem 'owns' its pages, none of these external references should pass
those refs around. All they use it for is I/O. Therefore, we shouldn't
even need an additional try_to_unmap() _after_ MIGRATE_PHASE_OUT as we
expect those external refs to never pass page-refs around. If that's a
valid assumption (and I haven't found any offenders so far), we should
be good with migrate_pages(MIGRATE_PHASE_OUT) as I described.

Comments?

While skimming over migrate.c I noticed two odd behaviors:
1) migration_entry_wait() is used to wait on a migration to finish,
before accessing PTE entries. However, we call get_page() there, which
increases the ref-count of the old page and causes page_freeze_refs()
to fail. There's no way we can know how many tasks wait on a migration
entry when calling page_freeze_refs(). I have no idea how that's
supposed to work? Why don't we store the new page in the migration-swp
entry so any lookups stall on the new page? We don't care for
ref-counts on that page and if the migration fails, new->mapping is
set to NULL and any lookup is retried. remove_migration_pte() can
restore the old page correctly.
2) remove_migration_pte() calls get_page(new) before writing the PTE.
But who releases the ref of the old page?

Thanks
David

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/3] File Sealing & memfd_create()
@ 2014-05-26 11:44               ` David Herrmann
  0 siblings, 0 replies; 53+ messages in thread
From: David Herrmann @ 2014-05-26 11:44 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Jan Kara, Tony Battersby, Al Viro, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, John Stultz, Christoph Lameter, Mel Gorman

Hi

(CC migrate.c committers)

On Tue, May 20, 2014 at 12:11 AM, Hugh Dickins <hughd@google.com> wrote:
> On Mon, 19 May 2014, Jan Kara wrote:
>> On Mon 19-05-14 13:44:25, David Herrmann wrote:
>> > On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins <hughd@google.com> wrote:
>> > > The aspect which really worries me is this: the maintenance burden.
>> > > This approach would add some peculiar new code, introducing a rare
>> > > special case: which we might get right today, but will very easily
>> > > forget tomorrow when making some other changes to mm.  If we compile
>> > > a list of danger areas in mm, this would surely belong on that list.
>> >
>> > I tried doing the page-replacement in the last 4 days, but honestly,
>> > it's far more complex than I thought. So if no-one more experienced
>
> To be honest, I'm quite glad to hear that: it is still a solution worth
> considering, but I'd rather continue the search for a better solution.

What if we set VM_IO for memory-mappings if a file supports sealing?
That might be a hack and quite restrictive, but we could also add a
VM_DONTPIN flag that just prevents any page-pinning like GUP (which is
also a side-effect of VM_IO). This is basically what we do to protect
PCI BARs from that race during hotplug (well, VM_PFNMAP ist what
protects those, but the code is the same). If we mention in the
man-page that memfd-objects don't support direct-IO, we'd be fine, I
think. Not sure if that hack is better than the page-replacement,
though. It'd be definitely much simpler.

Regarding page-replacement, I tried using migrate_page(), however,
this obviously fails in page_freeze_refs() due to the elevated
ref-count and we cannot account for those, as they might vanish
asynchronously. Now I wonder whether we could just add a new mode
MIGRATE_PHASE_OUT that avoids freezing the page and forces the copy.
Existing refs would still operate on the old page, but any new access
gets the new page. This way, we could collect pages with elevated
ref-counts in shmem similar to do_move_page_to_node_array() and then
call migrate_pages(). Now migrate_pages() takes good care to prevent
any new refs during migration. try_to_unmap(TTU_MIGRATION) marks PTEs
as 'in-migration', so accesses are delayed. Page-faults wait on the
page-lock and retry due to mapping==NULL. lru is disabled beforehand.
Therefore, there cannot be any racing page-lookups as they all stall
on the migration. Moreover, page_freeze_refs() fails only if the page
is pinned by independent users (usually some form of I/O).
Question is what those additional ref-counts might be. Given that
shmem 'owns' its pages, none of these external references should pass
those refs around. All they use it for is I/O. Therefore, we shouldn't
even need an additional try_to_unmap() _after_ MIGRATE_PHASE_OUT as we
expect those external refs to never pass page-refs around. If that's a
valid assumption (and I haven't found any offenders so far), we should
be good with migrate_pages(MIGRATE_PHASE_OUT) as I described.

Comments?

While skimming over migrate.c I noticed two odd behaviors:
1) migration_entry_wait() is used to wait on a migration to finish,
before accessing PTE entries. However, we call get_page() there, which
increases the ref-count of the old page and causes page_freeze_refs()
to fail. There's no way we can know how many tasks wait on a migration
entry when calling page_freeze_refs(). I have no idea how that's
supposed to work? Why don't we store the new page in the migration-swp
entry so any lookups stall on the new page? We don't care for
ref-counts on that page and if the migration fails, new->mapping is
set to NULL and any lookup is retried. remove_migration_pte() can
restore the old page correctly.
2) remove_migration_pte() calls get_page(new) before writing the PTE.
But who releases the ref of the old page?

Thanks
David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/3] File Sealing & memfd_create()
  2014-05-26 11:44               ` David Herrmann
@ 2014-05-31  4:44                 ` Hugh Dickins
  -1 siblings, 0 replies; 53+ messages in thread
From: Hugh Dickins @ 2014-05-31  4:44 UTC (permalink / raw)
  To: David Herrmann
  Cc: Hugh Dickins, Jan Kara, Tony Battersby, Al Viro, Linus Torvalds,
	Andrew Morton, linux-mm, linux-fsdevel, linux-kernel,
	Johannes Weiner, Tejun Heo, John Stultz, Christoph Lameter,
	Mel Gorman

On Mon, 26 May 2014, David Herrmann wrote:
> 
> (CC migrate.c committers)
> 
> On Tue, May 20, 2014 at 12:11 AM, Hugh Dickins <hughd@google.com> wrote:
> > On Mon, 19 May 2014, Jan Kara wrote:
> >> On Mon 19-05-14 13:44:25, David Herrmann wrote:
> >> > On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins <hughd@google.com> wrote:
> >> > > The aspect which really worries me is this: the maintenance burden.
> >> > > This approach would add some peculiar new code, introducing a rare
> >> > > special case: which we might get right today, but will very easily
> >> > > forget tomorrow when making some other changes to mm.  If we compile
> >> > > a list of danger areas in mm, this would surely belong on that list.
> >> >
> >> > I tried doing the page-replacement in the last 4 days, but honestly,
> >> > it's far more complex than I thought. So if no-one more experienced
> >
> > To be honest, I'm quite glad to hear that: it is still a solution worth
> > considering, but I'd rather continue the search for a better solution.
> 
> What if we set VM_IO for memory-mappings if a file supports sealing?
> That might be a hack and quite restrictive, but we could also add a
> VM_DONTPIN flag that just prevents any page-pinning like GUP (which is
> also a side-effect of VM_IO). This is basically what we do to protect
> PCI BARs from that race during hotplug (well, VM_PFNMAP ist what
> protects those, but the code is the same). If we mention in the
> man-page that memfd-objects don't support direct-IO, we'd be fine, I
> think. Not sure if that hack is better than the page-replacement,
> though. It'd be definitely much simpler.

I admire your resourcefulness, but three things put me off using
VM_IO.

One is just a great distaste for using VM_IO in mm, when it's about
device and memory regions unknown to mm.  You can indeed get around
that objection by making a new VM_DONTPIN with similar restrictions.

Two is that I'm afraid of those restrictions which you foresee.
Soon after putting this in, we shall have people saying "I can't
do such-and-such on my memfd-object, can you please enable it"
and then we shall have to find another solution.

But three, more importantly, that gives no protection against
get_user_pages_fast() callers: as mentioned before, GUP-fast works
on page tables, not examining vmas at all, so it is blind to VM_IO.

get_user_pages_fast() does check pte access, and pte_special bit,
falling back to get_user_pages() if unsuitable: IIRC, pte_special
support is important for an architecture to support GUP-fast.
But I definitely don't want any shmem to be !vm_normal_page().

> 
> Regarding page-replacement, I tried using migrate_page(), however,
> this obviously fails in page_freeze_refs() due to the elevated
> ref-count and we cannot account for those, as they might vanish
> asynchronously. Now I wonder whether we could just add a new mode
> MIGRATE_PHASE_OUT that avoids freezing the page and forces the copy.
> Existing refs would still operate on the old page, but any new access
> gets the new page. This way, we could collect pages with elevated
> ref-counts in shmem similar to do_move_page_to_node_array() and then
> call migrate_pages(). Now migrate_pages() takes good care to prevent
> any new refs during migration. try_to_unmap(TTU_MIGRATION) marks PTEs
> as 'in-migration', so accesses are delayed. Page-faults wait on the
> page-lock and retry due to mapping==NULL. lru is disabled beforehand.
> Therefore, there cannot be any racing page-lookups as they all stall
> on the migration. Moreover, page_freeze_refs() fails only if the page
> is pinned by independent users (usually some form of I/O).
> Question is what those additional ref-counts might be. Given that
> shmem 'owns' its pages, none of these external references should pass
> those refs around. All they use it for is I/O. Therefore, we shouldn't
> even need an additional try_to_unmap() _after_ MIGRATE_PHASE_OUT as we
> expect those external refs to never pass page-refs around. If that's a
> valid assumption (and I haven't found any offenders so far), we should
> be good with migrate_pages(MIGRATE_PHASE_OUT) as I described.
> 
> Comments?

I have not given these details all the attention that they deserve.
It's Tony's copy-on-seal suggestion, fleshed out to use the migration
infrastructure; but I still feel the same way about it as I did before.

It's complexity (and complexity outside of mm/shmem.c) that I would
prefer to avoid; and I find it hard to predict the consequence of
this copy-on-write-in-reverse, where existing users are left holding
something that used to be part of the object and now is not.  Perhaps
the existing sudden-truncation codepaths would handle it appropriately,
perhaps they would not.

(I wonder if Daniel Phillips's Tux3 "page fork"ing is relevant here?
But don't want to explore that right now.  "Revoke" also comes to mind.)

I do think it may turn out to be a good future way forward, if what
we do now turns out to be too restrictive; but we don't want to turn
mm (slightly) upside down just to get sealing in.

> 
> While skimming over migrate.c I noticed two odd behaviors:
> 1) migration_entry_wait() is used to wait on a migration to finish,
> before accessing PTE entries. However, we call get_page() there, which
> increases the ref-count of the old page and causes page_freeze_refs()
> to fail. There's no way we can know how many tasks wait on a migration
> entry when calling page_freeze_refs(). I have no idea how that's
> supposed to work? Why don't we store the new page in the migration-swp
> entry so any lookups stall on the new page? We don't care for
> ref-counts on that page and if the migration fails, new->mapping is
> set to NULL and any lookup is retried. remove_migration_pte() can
> restore the old page correctly.

Good point.  IIRC one reason for using the old page in the migration
pte, is that the migration may fail and the new page be discarded before
the old ptes have been restored.  Perhaps some rearrangement could handle
that.  As to how it works at present, I think it simply relies on the
unlikelihood that any migration_entry_wait() will raise the page count
before the "expected_count" is checked.  And there's a fair bit of
-EAGAIN retrying too.

But again, I've only just caught up with you here, and haven't thought
about it much yet.  I don't remember considering migration_entry_wait()
as counter-productive in the way you indicate: worth more thought.

> 2) remove_migration_pte() calls get_page(new) before writing the PTE.
> But who releases the ref of the old page?

putback_lru_page()?  You may want a longer answer, or you may have
worked it out for yourself meanwhile - I'm confident that we do not
actually have a common leak there, that would have been noticed by now.


After all that negativity, below is a function for you.  Basically,
it's for your original idea of checking page references at sealing
time.  I was put off by the thought of endless attempts to get every
page of a large file accountable at the same time, but didn't think
of radix_tree tags until a couple of nights ago: they make it much
more palatable.

This is not wonderful (possibility of -EBUSY after 150ms - even if
pages are pinned for read only); but I think it's good enough, to
start with anyway.  You may disagree - and please don't jump at it, 
just because you're keen to get the sealing in.

I imagine you calling this near the i_mmap_writable check, and only
bothering to call it when the object has at some time been mapped
VM_SHARED (note VM_SHARED in shmem_inode_info->flags).  Precisely
the locking needed, we can see when you have i_mmap_writable right.

I still have your replies on 1-3/3 to respond to: not tonight.
Oh, before I forget, linux-api@vger.kernel.org - that's a mailing
list that we've only recently become conscious of, good for Cc'ing
patches such as yours to.

Hugh

/*
 * We need a tag: a new tag would expand every radix_tree_node by 8 bytes,
 * so reuse a tag which we firmly believe is never set or cleared on shmem.
 */
#define SHMEM_TAG_PINNED	PAGECACHE_TAG_TOWRITE
#define LAST_SCAN		4	/* about 150ms max */

static int shmem_wait_for_pins(struct address_space *mapping)
{
	struct radix_tree_iter iter;
	void **slot;
	pgoff_t start;
	struct page *page;
	int scan;
	int error;

	start = 0;
	rcu_read_lock();
restart1:
	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
		page = radix_tree_deref_slot(slot);
		if (!page || radix_tree_exception(page)) {
			if (radix_tree_deref_retry(page))
				goto restart1;
			goto continue1;
		}

		if (page_count(page) - page_mapcount(page) == 1)
			goto continue1;

		spin_lock_irq(&mapping->tree_lock);
		radix_tree_tag_set(&mapping->page_tree, iter.index,
						SHMEM_TAG_PINNED);
		spin_unlock_irq(&mapping->tree_lock);
continue1:
		if (need_resched()) {
			cond_resched_rcu();
			start = iter.index + 1;
			goto restart1;
		}
	}
	rcu_read_unlock();

	error = 0;
	for (scan = 0; scan <= LAST_SCAN; scan++) {
		if (!radix_tree_tagged(&mapping->page_tree, SHMEM_TAG_PINNED))
			break;

		if (!scan)
			lru_add_drain_all();
		else if (schedule_timeout_killable((HZ << scan) / 200))
			scan = LAST_SCAN;

		start = 0;
		rcu_read_lock();
restart2:
		radix_tree_for_each_tagged(slot, &mapping->page_tree, &iter,
						start, SHMEM_TAG_PINNED) {
			page = radix_tree_deref_slot(slot);
			if (radix_tree_exception(page)) {
				if (radix_tree_deref_retry(page))
					goto restart2;
				page = NULL;
			}

			if (page &&
			    page_count(page) - page_mapcount(page) != 1) {
				if (scan < LAST_SCAN)
					goto continue2;
				/*
				 * On the last scan, we should probably clean
				 * up all those tags we inserted; but make a
				 * note that we still found pages pinned.
				 */
				error = -EBUSY;
			}

			spin_lock_irq(&mapping->tree_lock);
			radix_tree_tag_clear(&mapping->page_tree,
						iter.index, SHMEM_TAG_PINNED);
			spin_unlock_irq(&mapping->tree_lock);
continue2:
			if (need_resched()) {
				cond_resched_rcu();
				start = iter.index + 1;
				goto restart2;
			}
		}
		rcu_read_unlock();
	}

	return error;
}

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/3] File Sealing & memfd_create()
@ 2014-05-31  4:44                 ` Hugh Dickins
  0 siblings, 0 replies; 53+ messages in thread
From: Hugh Dickins @ 2014-05-31  4:44 UTC (permalink / raw)
  To: David Herrmann
  Cc: Hugh Dickins, Jan Kara, Tony Battersby, Al Viro, Linus Torvalds,
	Andrew Morton, linux-mm, linux-fsdevel, linux-kernel,
	Johannes Weiner, Tejun Heo, John Stultz, Christoph Lameter,
	Mel Gorman

On Mon, 26 May 2014, David Herrmann wrote:
> 
> (CC migrate.c committers)
> 
> On Tue, May 20, 2014 at 12:11 AM, Hugh Dickins <hughd@google.com> wrote:
> > On Mon, 19 May 2014, Jan Kara wrote:
> >> On Mon 19-05-14 13:44:25, David Herrmann wrote:
> >> > On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins <hughd@google.com> wrote:
> >> > > The aspect which really worries me is this: the maintenance burden.
> >> > > This approach would add some peculiar new code, introducing a rare
> >> > > special case: which we might get right today, but will very easily
> >> > > forget tomorrow when making some other changes to mm.  If we compile
> >> > > a list of danger areas in mm, this would surely belong on that list.
> >> >
> >> > I tried doing the page-replacement in the last 4 days, but honestly,
> >> > it's far more complex than I thought. So if no-one more experienced
> >
> > To be honest, I'm quite glad to hear that: it is still a solution worth
> > considering, but I'd rather continue the search for a better solution.
> 
> What if we set VM_IO for memory-mappings if a file supports sealing?
> That might be a hack and quite restrictive, but we could also add a
> VM_DONTPIN flag that just prevents any page-pinning like GUP (which is
> also a side-effect of VM_IO). This is basically what we do to protect
> PCI BARs from that race during hotplug (well, VM_PFNMAP ist what
> protects those, but the code is the same). If we mention in the
> man-page that memfd-objects don't support direct-IO, we'd be fine, I
> think. Not sure if that hack is better than the page-replacement,
> though. It'd be definitely much simpler.

I admire your resourcefulness, but three things put me off using
VM_IO.

One is just a great distaste for using VM_IO in mm, when it's about
device and memory regions unknown to mm.  You can indeed get around
that objection by making a new VM_DONTPIN with similar restrictions.

Two is that I'm afraid of those restrictions which you foresee.
Soon after putting this in, we shall have people saying "I can't
do such-and-such on my memfd-object, can you please enable it"
and then we shall have to find another solution.

But three, more importantly, that gives no protection against
get_user_pages_fast() callers: as mentioned before, GUP-fast works
on page tables, not examining vmas at all, so it is blind to VM_IO.

get_user_pages_fast() does check pte access, and pte_special bit,
falling back to get_user_pages() if unsuitable: IIRC, pte_special
support is important for an architecture to support GUP-fast.
But I definitely don't want any shmem to be !vm_normal_page().

> 
> Regarding page-replacement, I tried using migrate_page(), however,
> this obviously fails in page_freeze_refs() due to the elevated
> ref-count and we cannot account for those, as they might vanish
> asynchronously. Now I wonder whether we could just add a new mode
> MIGRATE_PHASE_OUT that avoids freezing the page and forces the copy.
> Existing refs would still operate on the old page, but any new access
> gets the new page. This way, we could collect pages with elevated
> ref-counts in shmem similar to do_move_page_to_node_array() and then
> call migrate_pages(). Now migrate_pages() takes good care to prevent
> any new refs during migration. try_to_unmap(TTU_MIGRATION) marks PTEs
> as 'in-migration', so accesses are delayed. Page-faults wait on the
> page-lock and retry due to mapping==NULL. lru is disabled beforehand.
> Therefore, there cannot be any racing page-lookups as they all stall
> on the migration. Moreover, page_freeze_refs() fails only if the page
> is pinned by independent users (usually some form of I/O).
> Question is what those additional ref-counts might be. Given that
> shmem 'owns' its pages, none of these external references should pass
> those refs around. All they use it for is I/O. Therefore, we shouldn't
> even need an additional try_to_unmap() _after_ MIGRATE_PHASE_OUT as we
> expect those external refs to never pass page-refs around. If that's a
> valid assumption (and I haven't found any offenders so far), we should
> be good with migrate_pages(MIGRATE_PHASE_OUT) as I described.
> 
> Comments?

I have not given these details all the attention that they deserve.
It's Tony's copy-on-seal suggestion, fleshed out to use the migration
infrastructure; but I still feel the same way about it as I did before.

It's complexity (and complexity outside of mm/shmem.c) that I would
prefer to avoid; and I find it hard to predict the consequence of
this copy-on-write-in-reverse, where existing users are left holding
something that used to be part of the object and now is not.  Perhaps
the existing sudden-truncation codepaths would handle it appropriately,
perhaps they would not.

(I wonder if Daniel Phillips's Tux3 "page fork"ing is relevant here?
But don't want to explore that right now.  "Revoke" also comes to mind.)

I do think it may turn out to be a good future way forward, if what
we do now turns out to be too restrictive; but we don't want to turn
mm (slightly) upside down just to get sealing in.

> 
> While skimming over migrate.c I noticed two odd behaviors:
> 1) migration_entry_wait() is used to wait on a migration to finish,
> before accessing PTE entries. However, we call get_page() there, which
> increases the ref-count of the old page and causes page_freeze_refs()
> to fail. There's no way we can know how many tasks wait on a migration
> entry when calling page_freeze_refs(). I have no idea how that's
> supposed to work? Why don't we store the new page in the migration-swp
> entry so any lookups stall on the new page? We don't care for
> ref-counts on that page and if the migration fails, new->mapping is
> set to NULL and any lookup is retried. remove_migration_pte() can
> restore the old page correctly.

Good point.  IIRC one reason for using the old page in the migration
pte, is that the migration may fail and the new page be discarded before
the old ptes have been restored.  Perhaps some rearrangement could handle
that.  As to how it works at present, I think it simply relies on the
unlikelihood that any migration_entry_wait() will raise the page count
before the "expected_count" is checked.  And there's a fair bit of
-EAGAIN retrying too.

But again, I've only just caught up with you here, and haven't thought
about it much yet.  I don't remember considering migration_entry_wait()
as counter-productive in the way you indicate: worth more thought.

> 2) remove_migration_pte() calls get_page(new) before writing the PTE.
> But who releases the ref of the old page?

putback_lru_page()?  You may want a longer answer, or you may have
worked it out for yourself meanwhile - I'm confident that we do not
actually have a common leak there, that would have been noticed by now.


After all that negativity, below is a function for you.  Basically,
it's for your original idea of checking page references at sealing
time.  I was put off by the thought of endless attempts to get every
page of a large file accountable at the same time, but didn't think
of radix_tree tags until a couple of nights ago: they make it much
more palatable.

This is not wonderful (possibility of -EBUSY after 150ms - even if
pages are pinned for read only); but I think it's good enough, to
start with anyway.  You may disagree - and please don't jump at it, 
just because you're keen to get the sealing in.

I imagine you calling this near the i_mmap_writable check, and only
bothering to call it when the object has at some time been mapped
VM_SHARED (note VM_SHARED in shmem_inode_info->flags).  Precisely
the locking needed, we can see when you have i_mmap_writable right.

I still have your replies on 1-3/3 to respond to: not tonight.
Oh, before I forget, linux-api@vger.kernel.org - that's a mailing
list that we've only recently become conscious of, good for Cc'ing
patches such as yours to.

Hugh

/*
 * We need a tag: a new tag would expand every radix_tree_node by 8 bytes,
 * so reuse a tag which we firmly believe is never set or cleared on shmem.
 */
#define SHMEM_TAG_PINNED	PAGECACHE_TAG_TOWRITE
#define LAST_SCAN		4	/* about 150ms max */

static int shmem_wait_for_pins(struct address_space *mapping)
{
	struct radix_tree_iter iter;
	void **slot;
	pgoff_t start;
	struct page *page;
	int scan;
	int error;

	start = 0;
	rcu_read_lock();
restart1:
	radix_tree_for_each_slot(slot, &mapping->page_tree, &iter, start) {
		page = radix_tree_deref_slot(slot);
		if (!page || radix_tree_exception(page)) {
			if (radix_tree_deref_retry(page))
				goto restart1;
			goto continue1;
		}

		if (page_count(page) - page_mapcount(page) == 1)
			goto continue1;

		spin_lock_irq(&mapping->tree_lock);
		radix_tree_tag_set(&mapping->page_tree, iter.index,
						SHMEM_TAG_PINNED);
		spin_unlock_irq(&mapping->tree_lock);
continue1:
		if (need_resched()) {
			cond_resched_rcu();
			start = iter.index + 1;
			goto restart1;
		}
	}
	rcu_read_unlock();

	error = 0;
	for (scan = 0; scan <= LAST_SCAN; scan++) {
		if (!radix_tree_tagged(&mapping->page_tree, SHMEM_TAG_PINNED))
			break;

		if (!scan)
			lru_add_drain_all();
		else if (schedule_timeout_killable((HZ << scan) / 200))
			scan = LAST_SCAN;

		start = 0;
		rcu_read_lock();
restart2:
		radix_tree_for_each_tagged(slot, &mapping->page_tree, &iter,
						start, SHMEM_TAG_PINNED) {
			page = radix_tree_deref_slot(slot);
			if (radix_tree_exception(page)) {
				if (radix_tree_deref_retry(page))
					goto restart2;
				page = NULL;
			}

			if (page &&
			    page_count(page) - page_mapcount(page) != 1) {
				if (scan < LAST_SCAN)
					goto continue2;
				/*
				 * On the last scan, we should probably clean
				 * up all those tags we inserted; but make a
				 * note that we still found pages pinned.
				 */
				error = -EBUSY;
			}

			spin_lock_irq(&mapping->tree_lock);
			radix_tree_tag_clear(&mapping->page_tree,
						iter.index, SHMEM_TAG_PINNED);
			spin_unlock_irq(&mapping->tree_lock);
continue2:
			if (need_resched()) {
				cond_resched_rcu();
				start = iter.index + 1;
				goto restart2;
			}
		}
		rcu_read_unlock();
	}

	return error;
}

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/3] File Sealing & memfd_create()
  2014-05-19 11:44         ` David Herrmann
@ 2014-06-02  4:42           ` Minchan Kim
  -1 siblings, 0 replies; 53+ messages in thread
From: Minchan Kim @ 2014-06-02  4:42 UTC (permalink / raw)
  To: David Herrmann
  Cc: Hugh Dickins, Tony Battersby, Al Viro, Jan Kara, Michael Kerrisk,
	Ryan Lortie, Linus Torvalds, Andrew Morton, linux-mm,
	linux-fsdevel, linux-kernel, Johannes Weiner, Tejun Heo,
	Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers, Peter Zijlstra

Hello,

On Mon, May 19, 2014 at 01:44:25PM +0200, David Herrmann wrote:
> Hi
> 
> On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins <hughd@google.com> wrote:
> > The aspect which really worries me is this: the maintenance burden.
> > This approach would add some peculiar new code, introducing a rare
> > special case: which we might get right today, but will very easily
> > forget tomorrow when making some other changes to mm.  If we compile
> > a list of danger areas in mm, this would surely belong on that list.
> 
> I tried doing the page-replacement in the last 4 days, but honestly,
> it's far more complex than I thought. So if no-one more experienced
> with mm/ comes up with a simple implementation, I'll have to delay
> this for some more weeks.
> 
> However, I still wonder why we try to fix this as part of this
> patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary
> amount of time. Same is true for network block-devices, NFS, iscsi,
> maybe loop-devices, ... This means, _any_ once mapped page can be
> written to after an arbitrary delay. This can break any feature that
> makes FS objects read-only (remounting read-only, setting S_IMMUTABLE,
> sealing, ..).
> 
> Shouldn't we try to fix the _cause_ of this?

I didn't follow this patchset and couldn't find what's your most cocern
but at a first glance, it seems you have troubled with pinned page.
If so, it's really big problem for CMA and I think peterz's approach(ie,
mm_mpin) is really make sense to me.

https://lkml.org/lkml/2014/5/26/340


> 
> Isn't there a simple way to lock/mark/.. affected vmas in
> get_user_pages(_fast)() and release them once done? We could increase
> i_mmap_writable on all affected address_space and decrease it on
> release. This would at least prevent sealing and could be check on
> other operations, too (like setting S_IMMUTABLE).
> This should be as easy as checking page_mapping(page) != NULL and then
> adjusting ->i_mmap_writable in
> get_writable_user_pages/put_writable_user_pages, right?
> 
> Thanks
> David
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/3] File Sealing & memfd_create()
@ 2014-06-02  4:42           ` Minchan Kim
  0 siblings, 0 replies; 53+ messages in thread
From: Minchan Kim @ 2014-06-02  4:42 UTC (permalink / raw)
  To: David Herrmann
  Cc: Hugh Dickins, Tony Battersby, Al Viro, Jan Kara, Michael Kerrisk,
	Ryan Lortie, Linus Torvalds, Andrew Morton, linux-mm,
	linux-fsdevel, linux-kernel, Johannes Weiner, Tejun Heo,
	Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers, Peter Zijlstra

Hello,

On Mon, May 19, 2014 at 01:44:25PM +0200, David Herrmann wrote:
> Hi
> 
> On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins <hughd@google.com> wrote:
> > The aspect which really worries me is this: the maintenance burden.
> > This approach would add some peculiar new code, introducing a rare
> > special case: which we might get right today, but will very easily
> > forget tomorrow when making some other changes to mm.  If we compile
> > a list of danger areas in mm, this would surely belong on that list.
> 
> I tried doing the page-replacement in the last 4 days, but honestly,
> it's far more complex than I thought. So if no-one more experienced
> with mm/ comes up with a simple implementation, I'll have to delay
> this for some more weeks.
> 
> However, I still wonder why we try to fix this as part of this
> patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary
> amount of time. Same is true for network block-devices, NFS, iscsi,
> maybe loop-devices, ... This means, _any_ once mapped page can be
> written to after an arbitrary delay. This can break any feature that
> makes FS objects read-only (remounting read-only, setting S_IMMUTABLE,
> sealing, ..).
> 
> Shouldn't we try to fix the _cause_ of this?

I didn't follow this patchset and couldn't find what's your most cocern
but at a first glance, it seems you have troubled with pinned page.
If so, it's really big problem for CMA and I think peterz's approach(ie,
mm_mpin) is really make sense to me.

https://lkml.org/lkml/2014/5/26/340


> 
> Isn't there a simple way to lock/mark/.. affected vmas in
> get_user_pages(_fast)() and release them once done? We could increase
> i_mmap_writable on all affected address_space and decrease it on
> release. This would at least prevent sealing and could be check on
> other operations, too (like setting S_IMMUTABLE).
> This should be as easy as checking page_mapping(page) != NULL and then
> adjusting ->i_mmap_writable in
> get_writable_user_pages/put_writable_user_pages, right?
> 
> Thanks
> David
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Kind regards,
Minchan Kim

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/3] File Sealing & memfd_create()
  2014-06-02  4:42           ` Minchan Kim
@ 2014-06-02  9:14             ` Jan Kara
  -1 siblings, 0 replies; 53+ messages in thread
From: Jan Kara @ 2014-06-02  9:14 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Herrmann, Hugh Dickins, Tony Battersby, Al Viro, Jan Kara,
	Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers, Peter Zijlstra

On Mon 02-06-14 13:42:59, Minchan Kim wrote:
> Hello,
> 
> On Mon, May 19, 2014 at 01:44:25PM +0200, David Herrmann wrote:
> > Hi
> > 
> > On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins <hughd@google.com> wrote:
> > > The aspect which really worries me is this: the maintenance burden.
> > > This approach would add some peculiar new code, introducing a rare
> > > special case: which we might get right today, but will very easily
> > > forget tomorrow when making some other changes to mm.  If we compile
> > > a list of danger areas in mm, this would surely belong on that list.
> > 
> > I tried doing the page-replacement in the last 4 days, but honestly,
> > it's far more complex than I thought. So if no-one more experienced
> > with mm/ comes up with a simple implementation, I'll have to delay
> > this for some more weeks.
> > 
> > However, I still wonder why we try to fix this as part of this
> > patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary
> > amount of time. Same is true for network block-devices, NFS, iscsi,
> > maybe loop-devices, ... This means, _any_ once mapped page can be
> > written to after an arbitrary delay. This can break any feature that
> > makes FS objects read-only (remounting read-only, setting S_IMMUTABLE,
> > sealing, ..).
> > 
> > Shouldn't we try to fix the _cause_ of this?
> 
> I didn't follow this patchset and couldn't find what's your most cocern
> but at a first glance, it seems you have troubled with pinned page.
> If so, it's really big problem for CMA and I think peterz's approach(ie,
> mm_mpin) is really make sense to me.
  Well, his concern are pinned pages (and also pages used for direct IO and
similar) but not because they are pinned but because they can be modified
while someone holds reference to them. So I'm not sure Peter's patches will
help here.
 
> https://lkml.org/lkml/2014/5/26/340

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/3] File Sealing & memfd_create()
@ 2014-06-02  9:14             ` Jan Kara
  0 siblings, 0 replies; 53+ messages in thread
From: Jan Kara @ 2014-06-02  9:14 UTC (permalink / raw)
  To: Minchan Kim
  Cc: David Herrmann, Hugh Dickins, Tony Battersby, Al Viro, Jan Kara,
	Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers, Peter Zijlstra

On Mon 02-06-14 13:42:59, Minchan Kim wrote:
> Hello,
> 
> On Mon, May 19, 2014 at 01:44:25PM +0200, David Herrmann wrote:
> > Hi
> > 
> > On Thu, May 15, 2014 at 12:35 AM, Hugh Dickins <hughd@google.com> wrote:
> > > The aspect which really worries me is this: the maintenance burden.
> > > This approach would add some peculiar new code, introducing a rare
> > > special case: which we might get right today, but will very easily
> > > forget tomorrow when making some other changes to mm.  If we compile
> > > a list of danger areas in mm, this would surely belong on that list.
> > 
> > I tried doing the page-replacement in the last 4 days, but honestly,
> > it's far more complex than I thought. So if no-one more experienced
> > with mm/ comes up with a simple implementation, I'll have to delay
> > this for some more weeks.
> > 
> > However, I still wonder why we try to fix this as part of this
> > patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary
> > amount of time. Same is true for network block-devices, NFS, iscsi,
> > maybe loop-devices, ... This means, _any_ once mapped page can be
> > written to after an arbitrary delay. This can break any feature that
> > makes FS objects read-only (remounting read-only, setting S_IMMUTABLE,
> > sealing, ..).
> > 
> > Shouldn't we try to fix the _cause_ of this?
> 
> I didn't follow this patchset and couldn't find what's your most cocern
> but at a first glance, it seems you have troubled with pinned page.
> If so, it's really big problem for CMA and I think peterz's approach(ie,
> mm_mpin) is really make sense to me.
  Well, his concern are pinned pages (and also pages used for direct IO and
similar) but not because they are pinned but because they can be modified
while someone holds reference to them. So I'm not sure Peter's patches will
help here.
 
> https://lkml.org/lkml/2014/5/26/340

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 1/3] shm: add sealing API
  2014-05-23 16:37       ` David Herrmann
@ 2014-06-02 10:30         ` Hugh Dickins
  -1 siblings, 0 replies; 53+ messages in thread
From: Hugh Dickins @ 2014-06-02 10:30 UTC (permalink / raw)
  To: David Herrmann
  Cc: Hugh Dickins, Tony Battersby, Andy Lutomirski, Jan Kara,
	Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

On Fri, 23 May 2014, David Herrmann wrote:
> 
> i_mmap_mutex is the only per-object lock that is taken in the mmap()
> path and all vma_link() users can easily be changed to deal with
> errors. So I think it should be easy to make __vma_link_file() fail if
> no writable mappings are allowed. Testing for shmem-seals seems odd
> here, indeed. We could instead make i_mmap_writable work like
> i_writecount. If it's negative, no new writable mappings are allowed.
> shmem_set_seals() could then decrement it to <0 and __vma_link_file()
> just tests whether it's negative. Comments?

i_mmap_mutex is certainly the right lock, and I'm happy with making
i_mmap_writable use the negative like i_writecount if that helps.

But I have to confess that I'm annoyingly stalled on this.  The part
I do not like (although I suggested it) is giving an error return to
__vma_link_file() and hence to vma_link().

Because successful return from file->f_op->mmap() is supposed to be
mmap's point of no return, and if we allow vma_link() to fail, then the
file->f_op->mmap() ought to be undone in a way never needed before.

Now, you know and I know that the vma_link() can only fail on sealed
shmem objects, and shmem_mmap() doesn't do anything that we need to
recover from (I don't think we need worry too much about the atime
update in file_accessed()).  But error from vma_link() does set a
trap (or a puzzle) for the unwary, and I'd prefer to avoid it.
We can comment it, but it still feels dirty.

I'm inclined to say that your shmem_mmap() (which already checks
sealed against shared) ought to manage i_mmap_writable itself (under
i_mmap_mutex); but then we need a funny little VM_flag for shmem_mmap()
to tell __vma_link_file() that i_mmap_writable++ has already been done;
or else some dance of ->opens and ->closes to keep its accounting right.

As I say, I am annoyingly stalled on this: so I'd better just let you
get on with it, and see how I feel about whatever you come up with.

> >
> > There is also, or may be, a small issue of sparse (holey) files.
> > I do have a question on that in comments on your next patch, and
> > the answer here may depend on what you want in memfd_create().
> >
> > What I'm thinking of here is that once a sparse file is sealed
> > against writing, we must be sure not to give an error when reading
> > its holes: whereas there are a few unlikely ways in which reading
> > the holes of a sparse tmpfs file can give -ENOMEM or -ENOSPC.
> >
> > Most of the memory allocations here can in fact only fail when the
> > allocating process has already been selected for OOM-kill: that is
> > not guaranteed forever, but it is how __alloc_pages_slowpath()
> > currently behaves on ordinary low-order allocations, and will be
> > hard to change if we ever do so.  Though I dislike relying upon
> > this, I think we can allow reading holes to fail, if the process
> > is going to be forcibly killed before it returns to userspace.
> >
> > But there might still be an issue with vm_enough_memory(),
> > and there might still be an issue with memcg limits.
> >
> > We do already use the ZERO_PAGE instead of allocating when it's a
> > simple read; and on the face of it, we could extend that to mmap
> > once the file is sealed.  But I am rather afraid to do so - for
> > many years there was an mmap /dev/zero case which did that, but
> > it was an easily forgotten case which caught us out at least
> > once, so I'm reluctant to reintroduce it now for sealing.
> >
> > Anyway, I don't expect you to resolve the issue of sealed holes:
> > that's very much my territory, to give you support on.
> 
> Why not require users to use mlock() if they want to protect
> themselves against OOM situations? At least the man-page says that
> mlock() guarantess that all pages in the specified range are loaded. I
> didn't verify whether that includes holes, though. And if
> RLIMIT_MEMLOCK is too small, users ought to access the object in
> smaller chunks.

Fair enough.
mlock() does instantiate the holes, in shmem's case at least.
mlock() is an mm operation, whereas in general we have a file here,
which is not necessariy mmap'ed.  It's a pity to ask the user to
mmap+mlock to achieve that effect; but okay, that does the job.

> And it's not specific to sparse files. Any other page may be swapped
> out and the swap-in can fail due to ENOMEM (page-table allocations,
> tree-inserts, and so on). But you definitely know better what to do
> here, so suggestions welcome.

You're right that OOM can hit you, even when just swapping in a page
that was properly instantiated before.  But those pages are better
accounted than holes: I still feel that the holes could be seen as
a sealed bomb, which explodes into OOM when read by the caller.

> 
> Anyway, sealing is not meant to protect against OOM situations. I
> mean, any mapping is subject to OOM, so processes that care should
> have a suitable infrastructure via SIGBUS or mlock() for all mappings,
> including sealed files. Furthermore, write-sealing is meant to prevent
> targeted attacks that modify data while it is being parsed. We
> properly protect users against that. OOM is an orthogonal issue, imho.

But I'm happy to hear that OOM doesn't trouble you, that you see it
as orthogonal.  Sealing does prompt me again to look into reworking
the issue of sparse files (never well handled in shmem), but from
what you say that's not urgent - a relief to both of us, thank you.

Hugh

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 1/3] shm: add sealing API
@ 2014-06-02 10:30         ` Hugh Dickins
  0 siblings, 0 replies; 53+ messages in thread
From: Hugh Dickins @ 2014-06-02 10:30 UTC (permalink / raw)
  To: David Herrmann
  Cc: Hugh Dickins, Tony Battersby, Andy Lutomirski, Jan Kara,
	Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

On Fri, 23 May 2014, David Herrmann wrote:
> 
> i_mmap_mutex is the only per-object lock that is taken in the mmap()
> path and all vma_link() users can easily be changed to deal with
> errors. So I think it should be easy to make __vma_link_file() fail if
> no writable mappings are allowed. Testing for shmem-seals seems odd
> here, indeed. We could instead make i_mmap_writable work like
> i_writecount. If it's negative, no new writable mappings are allowed.
> shmem_set_seals() could then decrement it to <0 and __vma_link_file()
> just tests whether it's negative. Comments?

i_mmap_mutex is certainly the right lock, and I'm happy with making
i_mmap_writable use the negative like i_writecount if that helps.

But I have to confess that I'm annoyingly stalled on this.  The part
I do not like (although I suggested it) is giving an error return to
__vma_link_file() and hence to vma_link().

Because successful return from file->f_op->mmap() is supposed to be
mmap's point of no return, and if we allow vma_link() to fail, then the
file->f_op->mmap() ought to be undone in a way never needed before.

Now, you know and I know that the vma_link() can only fail on sealed
shmem objects, and shmem_mmap() doesn't do anything that we need to
recover from (I don't think we need worry too much about the atime
update in file_accessed()).  But error from vma_link() does set a
trap (or a puzzle) for the unwary, and I'd prefer to avoid it.
We can comment it, but it still feels dirty.

I'm inclined to say that your shmem_mmap() (which already checks
sealed against shared) ought to manage i_mmap_writable itself (under
i_mmap_mutex); but then we need a funny little VM_flag for shmem_mmap()
to tell __vma_link_file() that i_mmap_writable++ has already been done;
or else some dance of ->opens and ->closes to keep its accounting right.

As I say, I am annoyingly stalled on this: so I'd better just let you
get on with it, and see how I feel about whatever you come up with.

> >
> > There is also, or may be, a small issue of sparse (holey) files.
> > I do have a question on that in comments on your next patch, and
> > the answer here may depend on what you want in memfd_create().
> >
> > What I'm thinking of here is that once a sparse file is sealed
> > against writing, we must be sure not to give an error when reading
> > its holes: whereas there are a few unlikely ways in which reading
> > the holes of a sparse tmpfs file can give -ENOMEM or -ENOSPC.
> >
> > Most of the memory allocations here can in fact only fail when the
> > allocating process has already been selected for OOM-kill: that is
> > not guaranteed forever, but it is how __alloc_pages_slowpath()
> > currently behaves on ordinary low-order allocations, and will be
> > hard to change if we ever do so.  Though I dislike relying upon
> > this, I think we can allow reading holes to fail, if the process
> > is going to be forcibly killed before it returns to userspace.
> >
> > But there might still be an issue with vm_enough_memory(),
> > and there might still be an issue with memcg limits.
> >
> > We do already use the ZERO_PAGE instead of allocating when it's a
> > simple read; and on the face of it, we could extend that to mmap
> > once the file is sealed.  But I am rather afraid to do so - for
> > many years there was an mmap /dev/zero case which did that, but
> > it was an easily forgotten case which caught us out at least
> > once, so I'm reluctant to reintroduce it now for sealing.
> >
> > Anyway, I don't expect you to resolve the issue of sealed holes:
> > that's very much my territory, to give you support on.
> 
> Why not require users to use mlock() if they want to protect
> themselves against OOM situations? At least the man-page says that
> mlock() guarantess that all pages in the specified range are loaded. I
> didn't verify whether that includes holes, though. And if
> RLIMIT_MEMLOCK is too small, users ought to access the object in
> smaller chunks.

Fair enough.
mlock() does instantiate the holes, in shmem's case at least.
mlock() is an mm operation, whereas in general we have a file here,
which is not necessariy mmap'ed.  It's a pity to ask the user to
mmap+mlock to achieve that effect; but okay, that does the job.

> And it's not specific to sparse files. Any other page may be swapped
> out and the swap-in can fail due to ENOMEM (page-table allocations,
> tree-inserts, and so on). But you definitely know better what to do
> here, so suggestions welcome.

You're right that OOM can hit you, even when just swapping in a page
that was properly instantiated before.  But those pages are better
accounted than holes: I still feel that the holes could be seen as
a sealed bomb, which explodes into OOM when read by the caller.

> 
> Anyway, sealing is not meant to protect against OOM situations. I
> mean, any mapping is subject to OOM, so processes that care should
> have a suitable infrastructure via SIGBUS or mlock() for all mappings,
> including sealed files. Furthermore, write-sealing is meant to prevent
> targeted attacks that modify data while it is being parsed. We
> properly protect users against that. OOM is an orthogonal issue, imho.

But I'm happy to hear that OOM doesn't trouble you, that you see it
as orthogonal.  Sealing does prompt me again to look into reworking
the issue of sparse files (never well handled in shmem), but from
what you say that's not urgent - a relief to both of us, thank you.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 2/3] shm: add memfd_create() syscall
  2014-05-23 16:57       ` David Herrmann
@ 2014-06-02 10:59         ` Hugh Dickins
  -1 siblings, 0 replies; 53+ messages in thread
From: Hugh Dickins @ 2014-06-02 10:59 UTC (permalink / raw)
  To: David Herrmann
  Cc: Hugh Dickins, Tony Battersby, Andy Lutomirsky, Al Viro, Jan Kara,
	Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

On Fri, 23 May 2014, David Herrmann wrote:
> On Tue, May 20, 2014 at 4:20 AM, Hugh Dickins <hughd@google.com> wrote:
> >
> > What is a front-FD?
> 
> With 'front-FD' I refer to things like dma-buf: They allocate a
> file-descriptor which is just a wrapper around a kernel-internal FD.
> For instance, DRM-gem buffers exported as dma-buf. fops on the dma-buf
> are forwarded to the shmem-fd of the given gem-object, but any access
> to the inode of the dma-buf fd is a no-op as the dma-buf fd uses
> anon-inode, not the shmem-inode.
> 
> A previous revision of memfd used something like that, but that was
> inherently racy.

Thanks for explaining: then I guess you can leave "front-FD" out of the
description next time around, in case there are others like me who are
more mystified than enlightened by it.

> > But this does highlight how the "size" arg to memfd_create() is
> > perhaps redundant.  Why give a size there, when size can be changed
> > afterwards?  I expect your answer is that many callers want to choose
> > the size at the beginning, and would prefer to avoid the extra call.
> > I'm not sure if that's a good enough reason for a redundant argument.
> 
> At one point in time we might be required to support atomic-sealing.
> So a memfd_create() call takes the initial seals as upper 32bits in
> "flags" and sets them before returning the object. If these seals
> contain SEAL_GROW/SHRINK, we must pass the size during setup (think
> CLOEXEC with fork()).

That does sound like over-design to me.  You stop short of passing
in an optional buffer of the data it's to contain, good.

I think it would be a clearer interface without the size, but really
that's an issue for the linux-api people you'll be Cc'ing next time.

You say "think CLOEXEC with fork()": you have thought about this, I
have not, please spell out for me what the atomic size guards against.
Do you want an fd that's not shared across fork?

> 
> Note that we spent a lot of time discussing whether such
> atomic-sealing is necessary and no-one came up with a real race so
> far. Therefore, I didn't include that. But especially if we add new
> seals (like SHMEM_SEAL_OPEN, which I still think is not needed and
> just hides real problems), we might at one point be required to
> support that. That's also the reason why "flags" is 64bits.
> 
> One might argue that we can just add memfd_create2() once that
> happens, but I didn't see any harm in including "size" and making them
> 64bit.

I've not noticed another system call with 64-bit flags, it does seem
over the top to me: the familiar ones all use int.  But again,
a matter for linux-api not for me.

Hugh

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 2/3] shm: add memfd_create() syscall
@ 2014-06-02 10:59         ` Hugh Dickins
  0 siblings, 0 replies; 53+ messages in thread
From: Hugh Dickins @ 2014-06-02 10:59 UTC (permalink / raw)
  To: David Herrmann
  Cc: Hugh Dickins, Tony Battersby, Andy Lutomirsky, Al Viro, Jan Kara,
	Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

On Fri, 23 May 2014, David Herrmann wrote:
> On Tue, May 20, 2014 at 4:20 AM, Hugh Dickins <hughd@google.com> wrote:
> >
> > What is a front-FD?
> 
> With 'front-FD' I refer to things like dma-buf: They allocate a
> file-descriptor which is just a wrapper around a kernel-internal FD.
> For instance, DRM-gem buffers exported as dma-buf. fops on the dma-buf
> are forwarded to the shmem-fd of the given gem-object, but any access
> to the inode of the dma-buf fd is a no-op as the dma-buf fd uses
> anon-inode, not the shmem-inode.
> 
> A previous revision of memfd used something like that, but that was
> inherently racy.

Thanks for explaining: then I guess you can leave "front-FD" out of the
description next time around, in case there are others like me who are
more mystified than enlightened by it.

> > But this does highlight how the "size" arg to memfd_create() is
> > perhaps redundant.  Why give a size there, when size can be changed
> > afterwards?  I expect your answer is that many callers want to choose
> > the size at the beginning, and would prefer to avoid the extra call.
> > I'm not sure if that's a good enough reason for a redundant argument.
> 
> At one point in time we might be required to support atomic-sealing.
> So a memfd_create() call takes the initial seals as upper 32bits in
> "flags" and sets them before returning the object. If these seals
> contain SEAL_GROW/SHRINK, we must pass the size during setup (think
> CLOEXEC with fork()).

That does sound like over-design to me.  You stop short of passing
in an optional buffer of the data it's to contain, good.

I think it would be a clearer interface without the size, but really
that's an issue for the linux-api people you'll be Cc'ing next time.

You say "think CLOEXEC with fork()": you have thought about this, I
have not, please spell out for me what the atomic size guards against.
Do you want an fd that's not shared across fork?

> 
> Note that we spent a lot of time discussing whether such
> atomic-sealing is necessary and no-one came up with a real race so
> far. Therefore, I didn't include that. But especially if we add new
> seals (like SHMEM_SEAL_OPEN, which I still think is not needed and
> just hides real problems), we might at one point be required to
> support that. That's also the reason why "flags" is 64bits.
> 
> One might argue that we can just add memfd_create2() once that
> happens, but I didn't see any harm in including "size" and making them
> 64bit.

I've not noticed another system call with 64-bit flags, it does seem
over the top to me: the familiar ones all use int.  But again,
a matter for linux-api not for me.

Hugh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/3] File Sealing & memfd_create()
  2014-06-02  9:14             ` Jan Kara
@ 2014-06-02 16:04               ` David Herrmann
  -1 siblings, 0 replies; 53+ messages in thread
From: David Herrmann @ 2014-06-02 16:04 UTC (permalink / raw)
  To: Jan Kara
  Cc: Minchan Kim, Hugh Dickins, Tony Battersby, Al Viro,
	Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers, Peter Zijlstra

Hi

On Mon, Jun 2, 2014 at 11:14 AM, Jan Kara <jack@suse.cz> wrote:
> On Mon 02-06-14 13:42:59, Minchan Kim wrote:
>> On Mon, May 19, 2014 at 01:44:25PM +0200, David Herrmann wrote:
>> > I tried doing the page-replacement in the last 4 days, but honestly,
>> > it's far more complex than I thought. So if no-one more experienced
>> > with mm/ comes up with a simple implementation, I'll have to delay
>> > this for some more weeks.
>> >
>> > However, I still wonder why we try to fix this as part of this
>> > patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary
>> > amount of time. Same is true for network block-devices, NFS, iscsi,
>> > maybe loop-devices, ... This means, _any_ once mapped page can be
>> > written to after an arbitrary delay. This can break any feature that
>> > makes FS objects read-only (remounting read-only, setting S_IMMUTABLE,
>> > sealing, ..).
>> >
>> > Shouldn't we try to fix the _cause_ of this?
>>
>> I didn't follow this patchset and couldn't find what's your most cocern
>> but at a first glance, it seems you have troubled with pinned page.
>> If so, it's really big problem for CMA and I think peterz's approach(ie,
>> mm_mpin) is really make sense to me.
>   Well, his concern are pinned pages (and also pages used for direct IO and
> similar) but not because they are pinned but because they can be modified
> while someone holds reference to them. So I'm not sure Peter's patches will
> help here.

Correct, the problem is not accounting for pinned-pages, but waiting
for them to get released. Furthermore, Peter's patches make VM_PINNED
an optional feature, so we'd still miss all the short-term GUP users.
Sadly, that means we cannot even use it to test for pending GUP users.

Thanks
David

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/3] File Sealing & memfd_create()
@ 2014-06-02 16:04               ` David Herrmann
  0 siblings, 0 replies; 53+ messages in thread
From: David Herrmann @ 2014-06-02 16:04 UTC (permalink / raw)
  To: Jan Kara
  Cc: Minchan Kim, Hugh Dickins, Tony Battersby, Al Viro,
	Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers, Peter Zijlstra

Hi

On Mon, Jun 2, 2014 at 11:14 AM, Jan Kara <jack@suse.cz> wrote:
> On Mon 02-06-14 13:42:59, Minchan Kim wrote:
>> On Mon, May 19, 2014 at 01:44:25PM +0200, David Herrmann wrote:
>> > I tried doing the page-replacement in the last 4 days, but honestly,
>> > it's far more complex than I thought. So if no-one more experienced
>> > with mm/ comes up with a simple implementation, I'll have to delay
>> > this for some more weeks.
>> >
>> > However, I still wonder why we try to fix this as part of this
>> > patchset. Using FUSE, a DIRECT-IO call can be delayed for an arbitrary
>> > amount of time. Same is true for network block-devices, NFS, iscsi,
>> > maybe loop-devices, ... This means, _any_ once mapped page can be
>> > written to after an arbitrary delay. This can break any feature that
>> > makes FS objects read-only (remounting read-only, setting S_IMMUTABLE,
>> > sealing, ..).
>> >
>> > Shouldn't we try to fix the _cause_ of this?
>>
>> I didn't follow this patchset and couldn't find what's your most cocern
>> but at a first glance, it seems you have troubled with pinned page.
>> If so, it's really big problem for CMA and I think peterz's approach(ie,
>> mm_mpin) is really make sense to me.
>   Well, his concern are pinned pages (and also pages used for direct IO and
> similar) but not because they are pinned but because they can be modified
> while someone holds reference to them. So I'm not sure Peter's patches will
> help here.

Correct, the problem is not accounting for pinned-pages, but waiting
for them to get released. Furthermore, Peter's patches make VM_PINNED
an optional feature, so we'd still miss all the short-term GUP users.
Sadly, that means we cannot even use it to test for pending GUP users.

Thanks
David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 2/3] shm: add memfd_create() syscall
  2014-06-02 10:59         ` Hugh Dickins
@ 2014-06-02 17:50           ` Andy Lutomirski
  -1 siblings, 0 replies; 53+ messages in thread
From: Andy Lutomirski @ 2014-06-02 17:50 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Herrmann, Tony Battersby, Al Viro, Jan Kara,
	Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

On Mon, Jun 2, 2014 at 3:59 AM, Hugh Dickins <hughd@google.com> wrote:
> On Fri, 23 May 2014, David Herrmann wrote:
>> On Tue, May 20, 2014 at 4:20 AM, Hugh Dickins <hughd@google.com> wrote:
>> >
>> > What is a front-FD?
>>
>> With 'front-FD' I refer to things like dma-buf: They allocate a
>> file-descriptor which is just a wrapper around a kernel-internal FD.
>> For instance, DRM-gem buffers exported as dma-buf. fops on the dma-buf
>> are forwarded to the shmem-fd of the given gem-object, but any access
>> to the inode of the dma-buf fd is a no-op as the dma-buf fd uses
>> anon-inode, not the shmem-inode.
>>
>> A previous revision of memfd used something like that, but that was
>> inherently racy.
>
> Thanks for explaining: then I guess you can leave "front-FD" out of the
> description next time around, in case there are others like me who are
> more mystified than enlightened by it.
>
>> > But this does highlight how the "size" arg to memfd_create() is
>> > perhaps redundant.  Why give a size there, when size can be changed
>> > afterwards?  I expect your answer is that many callers want to choose
>> > the size at the beginning, and would prefer to avoid the extra call.
>> > I'm not sure if that's a good enough reason for a redundant argument.
>>
>> At one point in time we might be required to support atomic-sealing.
>> So a memfd_create() call takes the initial seals as upper 32bits in
>> "flags" and sets them before returning the object. If these seals
>> contain SEAL_GROW/SHRINK, we must pass the size during setup (think
>> CLOEXEC with fork()).
>
> That does sound like over-design to me.  You stop short of passing
> in an optional buffer of the data it's to contain, good.
>
> I think it would be a clearer interface without the size, but really
> that's an issue for the linux-api people you'll be Cc'ing next time.

I agree that the interface is more orthogonal without size, but I
suspect that every single user of memfd_create will follow up with an
immediate ftruncate for fallocate.  That being said, maybe it's better
to leave size out so that users have to think about whether to use
ftruncate or fallocate.

>
> You say "think CLOEXEC with fork()": you have thought about this, I
> have not, please spell out for me what the atomic size guards against.
> Do you want an fd that's not shared across fork?
>
>>
>> Note that we spent a lot of time discussing whether such
>> atomic-sealing is necessary and no-one came up with a real race so
>> far. Therefore, I didn't include that. But especially if we add new
>> seals (like SHMEM_SEAL_OPEN, which I still think is not needed and
>> just hides real problems), we might at one point be required to
>> support that. That's also the reason why "flags" is 64bits.
>>
>> One might argue that we can just add memfd_create2() once that
>> happens, but I didn't see any harm in including "size" and making them
>> 64bit.
>
> I've not noticed another system call with 64-bit flags, it does seem
> over the top to me: the familiar ones all use int.  But again,
> a matter for linux-api not for me.

I think that 64-bit flags are barely better than just having two flags
arguments: 64-bit syscall arguments take up two slots on 32-bit
architectures, so they don't save any space.  (They save a tiny amount
of time on 64-bit architectures.)

>
> Hugh



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 2/3] shm: add memfd_create() syscall
@ 2014-06-02 17:50           ` Andy Lutomirski
  0 siblings, 0 replies; 53+ messages in thread
From: Andy Lutomirski @ 2014-06-02 17:50 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: David Herrmann, Tony Battersby, Al Viro, Jan Kara,
	Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

On Mon, Jun 2, 2014 at 3:59 AM, Hugh Dickins <hughd@google.com> wrote:
> On Fri, 23 May 2014, David Herrmann wrote:
>> On Tue, May 20, 2014 at 4:20 AM, Hugh Dickins <hughd@google.com> wrote:
>> >
>> > What is a front-FD?
>>
>> With 'front-FD' I refer to things like dma-buf: They allocate a
>> file-descriptor which is just a wrapper around a kernel-internal FD.
>> For instance, DRM-gem buffers exported as dma-buf. fops on the dma-buf
>> are forwarded to the shmem-fd of the given gem-object, but any access
>> to the inode of the dma-buf fd is a no-op as the dma-buf fd uses
>> anon-inode, not the shmem-inode.
>>
>> A previous revision of memfd used something like that, but that was
>> inherently racy.
>
> Thanks for explaining: then I guess you can leave "front-FD" out of the
> description next time around, in case there are others like me who are
> more mystified than enlightened by it.
>
>> > But this does highlight how the "size" arg to memfd_create() is
>> > perhaps redundant.  Why give a size there, when size can be changed
>> > afterwards?  I expect your answer is that many callers want to choose
>> > the size at the beginning, and would prefer to avoid the extra call.
>> > I'm not sure if that's a good enough reason for a redundant argument.
>>
>> At one point in time we might be required to support atomic-sealing.
>> So a memfd_create() call takes the initial seals as upper 32bits in
>> "flags" and sets them before returning the object. If these seals
>> contain SEAL_GROW/SHRINK, we must pass the size during setup (think
>> CLOEXEC with fork()).
>
> That does sound like over-design to me.  You stop short of passing
> in an optional buffer of the data it's to contain, good.
>
> I think it would be a clearer interface without the size, but really
> that's an issue for the linux-api people you'll be Cc'ing next time.

I agree that the interface is more orthogonal without size, but I
suspect that every single user of memfd_create will follow up with an
immediate ftruncate for fallocate.  That being said, maybe it's better
to leave size out so that users have to think about whether to use
ftruncate or fallocate.

>
> You say "think CLOEXEC with fork()": you have thought about this, I
> have not, please spell out for me what the atomic size guards against.
> Do you want an fd that's not shared across fork?
>
>>
>> Note that we spent a lot of time discussing whether such
>> atomic-sealing is necessary and no-one came up with a real race so
>> far. Therefore, I didn't include that. But especially if we add new
>> seals (like SHMEM_SEAL_OPEN, which I still think is not needed and
>> just hides real problems), we might at one point be required to
>> support that. That's also the reason why "flags" is 64bits.
>>
>> One might argue that we can just add memfd_create2() once that
>> happens, but I didn't see any harm in including "size" and making them
>> 64bit.
>
> I've not noticed another system call with 64-bit flags, it does seem
> over the top to me: the familiar ones all use int.  But again,
> a matter for linux-api not for me.

I think that 64-bit flags are barely better than just having two flags
arguments: 64-bit syscall arguments take up two slots on 32-bit
architectures, so they don't save any space.  (They save a tiny amount
of time on 64-bit architectures.)

>
> Hugh



-- 
Andy Lutomirski
AMA Capital Management, LLC

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 0/3] File Sealing & memfd_create()
  2014-06-02 16:04               ` David Herrmann
  (?)
@ 2014-06-03  8:31               ` Peter Zijlstra
  -1 siblings, 0 replies; 53+ messages in thread
From: Peter Zijlstra @ 2014-06-03  8:31 UTC (permalink / raw)
  To: David Herrmann
  Cc: Jan Kara, Minchan Kim, Hugh Dickins, Tony Battersby, Al Viro,
	Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

[-- Attachment #1: Type: text/plain, Size: 560 bytes --]

On Mon, Jun 02, 2014 at 06:04:02PM +0200, David Herrmann wrote:
> Correct, the problem is not accounting for pinned-pages, but waiting
> for them to get released. Furthermore, Peter's patches make VM_PINNED
> an optional feature, so we'd still miss all the short-term GUP users.
> Sadly, that means we cannot even use it to test for pending GUP users.

Right, I'm not bothered about temporary pins, they'll go away quickly
and generally not bother reclaim and the like (much). The thing I
'worry' about is the persistent pins, which have unbounded life spans.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 2/3] shm: add memfd_create() syscall
  2014-06-02 10:59         ` Hugh Dickins
@ 2014-06-13 10:42           ` David Herrmann
  -1 siblings, 0 replies; 53+ messages in thread
From: David Herrmann @ 2014-06-13 10:42 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Tony Battersby, Andy Lutomirsky, Al Viro, Jan Kara,
	Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

Hi

On Mon, Jun 2, 2014 at 12:59 PM, Hugh Dickins <hughd@google.com> wrote:
> On Fri, 23 May 2014, David Herrmann wrote:
>> On Tue, May 20, 2014 at 4:20 AM, Hugh Dickins <hughd@google.com> wrote:
>> > But this does highlight how the "size" arg to memfd_create() is
>> > perhaps redundant.  Why give a size there, when size can be changed
>> > afterwards?  I expect your answer is that many callers want to choose
>> > the size at the beginning, and would prefer to avoid the extra call.
>> > I'm not sure if that's a good enough reason for a redundant argument.
>>
>> At one point in time we might be required to support atomic-sealing.
>> So a memfd_create() call takes the initial seals as upper 32bits in
>> "flags" and sets them before returning the object. If these seals
>> contain SEAL_GROW/SHRINK, we must pass the size during setup (think
>> CLOEXEC with fork()).
>
> That does sound like over-design to me.  You stop short of passing
> in an optional buffer of the data it's to contain, good.
>
> I think it would be a clearer interface without the size, but really
> that's an issue for the linux-api people you'll be Cc'ing next time.
>
> You say "think CLOEXEC with fork()": you have thought about this, I
> have not, please spell out for me what the atomic size guards against.
> Do you want an fd that's not shared across fork?

My thinking was:
Imagine a seal called SEAL_OPEN that prevents against open()
(specifically on /proc/self/fd/). That seal obviously has to be set
before creating the object, otherwise there's a race. Therefore, I'd
need a "seals" argument for memfd_create(). Now imagine there's a
similar seal that has such a race but prevents any following resize.
Then I'd have to set the size during initialization, too.

However, in my opinion SEAL_OPEN does not protect against any real
attack (it only protects you from yourself). Therefore, I never added
it. Furthermore, I couldn't think of any similar situation, so I now
removed the "size" argument and made "flags" just an "unsigned int".
It was just a precaution, but I'm fine with dropping it as we cannot
come up with a real possible race.

Sorry for the confusion. I'll send v3 in a minute.

Thanks
David

^ permalink raw reply	[flat|nested] 53+ messages in thread

* Re: [PATCH v2 2/3] shm: add memfd_create() syscall
@ 2014-06-13 10:42           ` David Herrmann
  0 siblings, 0 replies; 53+ messages in thread
From: David Herrmann @ 2014-06-13 10:42 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Tony Battersby, Andy Lutomirsky, Al Viro, Jan Kara,
	Michael Kerrisk, Ryan Lortie, Linus Torvalds, Andrew Morton,
	linux-mm, linux-fsdevel, linux-kernel, Johannes Weiner,
	Tejun Heo, Greg Kroah-Hartman, John Stultz, Kristian Hogsberg,
	Lennart Poettering, Daniel Mack, Kay Sievers

Hi

On Mon, Jun 2, 2014 at 12:59 PM, Hugh Dickins <hughd@google.com> wrote:
> On Fri, 23 May 2014, David Herrmann wrote:
>> On Tue, May 20, 2014 at 4:20 AM, Hugh Dickins <hughd@google.com> wrote:
>> > But this does highlight how the "size" arg to memfd_create() is
>> > perhaps redundant.  Why give a size there, when size can be changed
>> > afterwards?  I expect your answer is that many callers want to choose
>> > the size at the beginning, and would prefer to avoid the extra call.
>> > I'm not sure if that's a good enough reason for a redundant argument.
>>
>> At one point in time we might be required to support atomic-sealing.
>> So a memfd_create() call takes the initial seals as upper 32bits in
>> "flags" and sets them before returning the object. If these seals
>> contain SEAL_GROW/SHRINK, we must pass the size during setup (think
>> CLOEXEC with fork()).
>
> That does sound like over-design to me.  You stop short of passing
> in an optional buffer of the data it's to contain, good.
>
> I think it would be a clearer interface without the size, but really
> that's an issue for the linux-api people you'll be Cc'ing next time.
>
> You say "think CLOEXEC with fork()": you have thought about this, I
> have not, please spell out for me what the atomic size guards against.
> Do you want an fd that's not shared across fork?

My thinking was:
Imagine a seal called SEAL_OPEN that prevents against open()
(specifically on /proc/self/fd/). That seal obviously has to be set
before creating the object, otherwise there's a race. Therefore, I'd
need a "seals" argument for memfd_create(). Now imagine there's a
similar seal that has such a race but prevents any following resize.
Then I'd have to set the size during initialization, too.

However, in my opinion SEAL_OPEN does not protect against any real
attack (it only protects you from yourself). Therefore, I never added
it. Furthermore, I couldn't think of any similar situation, so I now
removed the "size" argument and made "flags" just an "unsigned int".
It was just a precaution, but I'm fine with dropping it as we cannot
come up with a real possible race.

Sorry for the confusion. I'll send v3 in a minute.

Thanks
David

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2014-06-13 10:42 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-04-15 18:38 [PATCH v2 0/3] File Sealing & memfd_create() David Herrmann
2014-04-15 18:38 ` David Herrmann
2014-04-15 18:38 ` [PATCH v2 1/3] shm: add sealing API David Herrmann
2014-04-15 18:38   ` David Herrmann
2014-05-20  2:16   ` Hugh Dickins
2014-05-20  2:16     ` Hugh Dickins
2014-05-23 16:37     ` David Herrmann
2014-05-23 16:37       ` David Herrmann
2014-06-02 10:30       ` Hugh Dickins
2014-06-02 10:30         ` Hugh Dickins
2014-04-15 18:38 ` [PATCH v2 2/3] shm: add memfd_create() syscall David Herrmann
2014-04-15 18:38   ` David Herrmann
2014-05-20  2:20   ` Hugh Dickins
2014-05-20  2:20     ` Hugh Dickins
2014-05-23 16:57     ` David Herrmann
2014-05-23 16:57       ` David Herrmann
2014-06-02 10:59       ` Hugh Dickins
2014-06-02 10:59         ` Hugh Dickins
2014-06-02 17:50         ` Andy Lutomirski
2014-06-02 17:50           ` Andy Lutomirski
2014-06-13 10:42         ` David Herrmann
2014-06-13 10:42           ` David Herrmann
2014-05-21 10:50   ` Konstantin Khlebnikov
2014-05-21 10:50     ` Konstantin Khlebnikov
2014-04-15 18:38 ` [PATCH v2 3/3] selftests: add memfd_create() + sealing tests David Herrmann
2014-04-15 18:38   ` David Herrmann
2014-05-20  2:22   ` Hugh Dickins
2014-05-20  2:22     ` Hugh Dickins
2014-05-23 17:06     ` David Herrmann
2014-05-23 17:06       ` David Herrmann
2014-05-14  5:09 ` [PATCH v2 0/3] File Sealing & memfd_create() Hugh Dickins
2014-05-14  5:09   ` Hugh Dickins
2014-05-14 16:15   ` Tony Battersby
2014-05-14 16:15     ` Tony Battersby
2014-05-14 22:35     ` Hugh Dickins
2014-05-14 22:35       ` Hugh Dickins
2014-05-19 11:44       ` David Herrmann
2014-05-19 11:44         ` David Herrmann
2014-05-19 16:09         ` Jan Kara
2014-05-19 16:09           ` Jan Kara
2014-05-19 22:11           ` Hugh Dickins
2014-05-19 22:11             ` Hugh Dickins
2014-05-26 11:44             ` David Herrmann
2014-05-26 11:44               ` David Herrmann
2014-05-31  4:44               ` Hugh Dickins
2014-05-31  4:44                 ` Hugh Dickins
2014-06-02  4:42         ` Minchan Kim
2014-06-02  4:42           ` Minchan Kim
2014-06-02  9:14           ` Jan Kara
2014-06-02  9:14             ` Jan Kara
2014-06-02 16:04             ` David Herrmann
2014-06-02 16:04               ` David Herrmann
2014-06-03  8:31               ` Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.