All of lore.kernel.org
 help / color / mirror / Atom feed
From: Hugh Dickins <hughd@google.com>
To: David Herrmann <dh.herrmann@gmail.com>
Cc: linux-kernel@vger.kernel.org,
	Michael Kerrisk <mtk.manpages@gmail.com>,
	Ryan Lortie <desrt@desrt.ca>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-api@vger.kernel.org, Greg Kroah-Hartman <greg@kroah.com>,
	john.stultz@linaro.org,
	Lennart Poettering <lennart@poettering.net>,
	Daniel Mack <zonque@gmail.com>, Kay Sievers <kay@vrfy.org>,
	Hugh Dickins <hughd@google.com>,
	Andy Lutomirski <luto@amacapital.net>,
	Alexander Viro <viro@zeniv.linux.org.uk>
Subject: Re: [PATCH v4 3/6] shm: add memfd_create() syscall
Date: Wed, 23 Jul 2014 21:19:32 -0700 (PDT)	[thread overview]
Message-ID: <alpine.LSU.2.11.1407232112040.991@eggly.anvils> (raw)
In-Reply-To: <1405877680-999-4-git-send-email-dh.herrmann@gmail.com>

On Sun, 20 Jul 2014, David Herrmann wrote:

> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
> that you can pass to mmap(). It can support sealing and avoids any
> connection to user-visible mount-points. Thus, it's not subject to quotas
> on mounted file-systems, but can be used like malloc()'ed memory, but
> with a file-descriptor to it.
> 
> memfd_create() returns the raw shmem file, so calls like ftruncate() can
> be used to modify the underlying inode. Also calls like fstat()
> will return proper information and mark the file as regular file. If you
> want sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not
> supported (like on all other regular files).
> 
> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
> subject to a filesystem size limit. It is still properly accounted to
> memcg limits, though, and to the same overcommit or no-overcommit
> accounting as all user memory.
> 
> Signed-off-by: David Herrmann <dh.herrmann@gmail.com>

Acked-by: Hugh Dickins <hughd@google.com>

It appears to be the new syscall season, and I'm afraid I've delayed
you just long enough for two or three more to interpose themselves
after sys_renameat2.  If he agrees, I think that messiness is best
left to akpm, who I expect would adapt the patch according to what
has actually reached Linus by the time he's ready to send this in.

> ---
>  arch/x86/syscalls/syscall_32.tbl |  1 +
>  arch/x86/syscalls/syscall_64.tbl |  1 +
>  include/linux/syscalls.h         |  1 +
>  include/uapi/linux/memfd.h       |  8 +++++
>  kernel/sys_ni.c                  |  1 +
>  mm/shmem.c                       | 73 ++++++++++++++++++++++++++++++++++++++++
>  6 files changed, 85 insertions(+)
>  create mode 100644 include/uapi/linux/memfd.h
> 
> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
> index d6b8679..e7495b4 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -360,3 +360,4 @@
>  351	i386	sched_setattr		sys_sched_setattr
>  352	i386	sched_getattr		sys_sched_getattr
>  353	i386	renameat2		sys_renameat2
> +354	i386	memfd_create		sys_memfd_create
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index ec255a1..28be0e1 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -323,6 +323,7 @@
>  314	common	sched_setattr		sys_sched_setattr
>  315	common	sched_getattr		sys_sched_getattr
>  316	common	renameat2		sys_renameat2
> +317	common	memfd_create		sys_memfd_create
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index b0881a0..de00585 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -802,6 +802,7 @@ asmlinkage long sys_timerfd_settime(int ufd, int flags,
>  asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
>  asmlinkage long sys_eventfd(unsigned int count);
>  asmlinkage long sys_eventfd2(unsigned int count, int flags);
> +asmlinkage long sys_memfd_create(const char __user *uname_ptr, unsigned int flags);
>  asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
>  asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int);
>  asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
> diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> new file mode 100644
> index 0000000..534e364
> --- /dev/null
> +++ b/include/uapi/linux/memfd.h
> @@ -0,0 +1,8 @@
> +#ifndef _UAPI_LINUX_MEMFD_H
> +#define _UAPI_LINUX_MEMFD_H
> +
> +/* flags for memfd_create(2) (unsigned int) */
> +#define MFD_CLOEXEC		0x0001U
> +#define MFD_ALLOW_SEALING	0x0002U
> +
> +#endif /* _UAPI_LINUX_MEMFD_H */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 36441b5..489a4e6 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -197,6 +197,7 @@ cond_syscall(compat_sys_timerfd_settime);
>  cond_syscall(compat_sys_timerfd_gettime);
>  cond_syscall(sys_eventfd);
>  cond_syscall(sys_eventfd2);
> +cond_syscall(sys_memfd_create);
>  
>  /* performance counters: */
>  cond_syscall(sys_perf_event_open);
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 51dccd0..770e072 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -66,7 +66,9 @@ static struct vfsmount *shm_mnt;
>  #include <linux/highmem.h>
>  #include <linux/seq_file.h>
>  #include <linux/magic.h>
> +#include <linux/syscalls.h>
>  #include <linux/fcntl.h>
> +#include <uapi/linux/memfd.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/pgtable.h>
> @@ -2678,6 +2680,77 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
>  	shmem_show_mpol(seq, sbinfo->mpol);
>  	return 0;
>  }
> +
> +#define MFD_NAME_PREFIX "memfd:"
> +#define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
> +#define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
> +
> +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING)
> +
> +SYSCALL_DEFINE2(memfd_create,
> +		const char __user *, uname,
> +		unsigned int, flags)
> +{
> +	struct shmem_inode_info *info;
> +	struct file *file;
> +	int fd, error;
> +	char *name;
> +	long len;
> +
> +	if (flags & ~(unsigned int)MFD_ALL_FLAGS)
> +		return -EINVAL;
> +
> +	/* length includes terminating zero */
> +	len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
> +	if (len <= 0)
> +		return -EFAULT;
> +	if (len > MFD_NAME_MAX_LEN + 1)
> +		return -EINVAL;
> +
> +	name = kmalloc(len + MFD_NAME_PREFIX_LEN, GFP_TEMPORARY);
> +	if (!name)
> +		return -ENOMEM;
> +
> +	strcpy(name, MFD_NAME_PREFIX);
> +	if (copy_from_user(&name[MFD_NAME_PREFIX_LEN], uname, len)) {
> +		error = -EFAULT;
> +		goto err_name;
> +	}
> +
> +	/* terminating-zero may have changed after strnlen_user() returned */
> +	if (name[len + MFD_NAME_PREFIX_LEN - 1]) {
> +		error = -EFAULT;
> +		goto err_name;
> +	}
> +
> +	fd = get_unused_fd_flags((flags & MFD_CLOEXEC) ? O_CLOEXEC : 0);
> +	if (fd < 0) {
> +		error = fd;
> +		goto err_name;
> +	}
> +
> +	file = shmem_file_setup(name, 0, VM_NORESERVE);
> +	if (IS_ERR(file)) {
> +		error = PTR_ERR(file);
> +		goto err_fd;
> +	}
> +	info = SHMEM_I(file_inode(file));
> +	file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
> +	file->f_flags |= O_RDWR | O_LARGEFILE;
> +	if (flags & MFD_ALLOW_SEALING)
> +		info->seals &= ~F_SEAL_SEAL;
> +
> +	fd_install(fd, file);
> +	kfree(name);
> +	return fd;
> +
> +err_fd:
> +	put_unused_fd(fd);
> +err_name:
> +	kfree(name);
> +	return error;
> +}
> +
>  #endif /* CONFIG_TMPFS */
>  
>  static void shmem_put_super(struct super_block *sb)
> -- 
> 2.0.2
> 
> 

WARNING: multiple messages have this Message-ID (diff)
From: Hugh Dickins <hughd@google.com>
To: David Herrmann <dh.herrmann@gmail.com>
Cc: linux-kernel@vger.kernel.org,
	Michael Kerrisk <mtk.manpages@gmail.com>,
	Ryan Lortie <desrt@desrt.ca>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-api@vger.kernel.org, Greg Kroah-Hartman <greg@kroah.com>,
	john.stultz@linaro.org,
	Lennart Poettering <lennart@poettering.net>,
	Daniel Mack <zonque@gmail.com>, Kay Sievers <kay@vrfy.org>,
	Hugh Dickins <hughd@google.com>,
	Andy Lutomirski <luto@amacapital.net>,
	Alexander Viro <viro@zeniv.linux.org.uk>
Subject: Re: [PATCH v4 3/6] shm: add memfd_create() syscall
Date: Wed, 23 Jul 2014 21:19:32 -0700 (PDT)	[thread overview]
Message-ID: <alpine.LSU.2.11.1407232112040.991@eggly.anvils> (raw)
In-Reply-To: <1405877680-999-4-git-send-email-dh.herrmann@gmail.com>

On Sun, 20 Jul 2014, David Herrmann wrote:

> memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
> that you can pass to mmap(). It can support sealing and avoids any
> connection to user-visible mount-points. Thus, it's not subject to quotas
> on mounted file-systems, but can be used like malloc()'ed memory, but
> with a file-descriptor to it.
> 
> memfd_create() returns the raw shmem file, so calls like ftruncate() can
> be used to modify the underlying inode. Also calls like fstat()
> will return proper information and mark the file as regular file. If you
> want sealing, you can specify MFD_ALLOW_SEALING. Otherwise, sealing is not
> supported (like on all other regular files).
> 
> Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
> subject to a filesystem size limit. It is still properly accounted to
> memcg limits, though, and to the same overcommit or no-overcommit
> accounting as all user memory.
> 
> Signed-off-by: David Herrmann <dh.herrmann@gmail.com>

Acked-by: Hugh Dickins <hughd@google.com>

It appears to be the new syscall season, and I'm afraid I've delayed
you just long enough for two or three more to interpose themselves
after sys_renameat2.  If he agrees, I think that messiness is best
left to akpm, who I expect would adapt the patch according to what
has actually reached Linus by the time he's ready to send this in.

> ---
>  arch/x86/syscalls/syscall_32.tbl |  1 +
>  arch/x86/syscalls/syscall_64.tbl |  1 +
>  include/linux/syscalls.h         |  1 +
>  include/uapi/linux/memfd.h       |  8 +++++
>  kernel/sys_ni.c                  |  1 +
>  mm/shmem.c                       | 73 ++++++++++++++++++++++++++++++++++++++++
>  6 files changed, 85 insertions(+)
>  create mode 100644 include/uapi/linux/memfd.h
> 
> diff --git a/arch/x86/syscalls/syscall_32.tbl b/arch/x86/syscalls/syscall_32.tbl
> index d6b8679..e7495b4 100644
> --- a/arch/x86/syscalls/syscall_32.tbl
> +++ b/arch/x86/syscalls/syscall_32.tbl
> @@ -360,3 +360,4 @@
>  351	i386	sched_setattr		sys_sched_setattr
>  352	i386	sched_getattr		sys_sched_getattr
>  353	i386	renameat2		sys_renameat2
> +354	i386	memfd_create		sys_memfd_create
> diff --git a/arch/x86/syscalls/syscall_64.tbl b/arch/x86/syscalls/syscall_64.tbl
> index ec255a1..28be0e1 100644
> --- a/arch/x86/syscalls/syscall_64.tbl
> +++ b/arch/x86/syscalls/syscall_64.tbl
> @@ -323,6 +323,7 @@
>  314	common	sched_setattr		sys_sched_setattr
>  315	common	sched_getattr		sys_sched_getattr
>  316	common	renameat2		sys_renameat2
> +317	common	memfd_create		sys_memfd_create
>  
>  #
>  # x32-specific system call numbers start at 512 to avoid cache impact
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index b0881a0..de00585 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -802,6 +802,7 @@ asmlinkage long sys_timerfd_settime(int ufd, int flags,
>  asmlinkage long sys_timerfd_gettime(int ufd, struct itimerspec __user *otmr);
>  asmlinkage long sys_eventfd(unsigned int count);
>  asmlinkage long sys_eventfd2(unsigned int count, int flags);
> +asmlinkage long sys_memfd_create(const char __user *uname_ptr, unsigned int flags);
>  asmlinkage long sys_fallocate(int fd, int mode, loff_t offset, loff_t len);
>  asmlinkage long sys_old_readdir(unsigned int, struct old_linux_dirent __user *, unsigned int);
>  asmlinkage long sys_pselect6(int, fd_set __user *, fd_set __user *,
> diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> new file mode 100644
> index 0000000..534e364
> --- /dev/null
> +++ b/include/uapi/linux/memfd.h
> @@ -0,0 +1,8 @@
> +#ifndef _UAPI_LINUX_MEMFD_H
> +#define _UAPI_LINUX_MEMFD_H
> +
> +/* flags for memfd_create(2) (unsigned int) */
> +#define MFD_CLOEXEC		0x0001U
> +#define MFD_ALLOW_SEALING	0x0002U
> +
> +#endif /* _UAPI_LINUX_MEMFD_H */
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index 36441b5..489a4e6 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -197,6 +197,7 @@ cond_syscall(compat_sys_timerfd_settime);
>  cond_syscall(compat_sys_timerfd_gettime);
>  cond_syscall(sys_eventfd);
>  cond_syscall(sys_eventfd2);
> +cond_syscall(sys_memfd_create);
>  
>  /* performance counters: */
>  cond_syscall(sys_perf_event_open);
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 51dccd0..770e072 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -66,7 +66,9 @@ static struct vfsmount *shm_mnt;
>  #include <linux/highmem.h>
>  #include <linux/seq_file.h>
>  #include <linux/magic.h>
> +#include <linux/syscalls.h>
>  #include <linux/fcntl.h>
> +#include <uapi/linux/memfd.h>
>  
>  #include <asm/uaccess.h>
>  #include <asm/pgtable.h>
> @@ -2678,6 +2680,77 @@ static int shmem_show_options(struct seq_file *seq, struct dentry *root)
>  	shmem_show_mpol(seq, sbinfo->mpol);
>  	return 0;
>  }
> +
> +#define MFD_NAME_PREFIX "memfd:"
> +#define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
> +#define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
> +
> +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING)
> +
> +SYSCALL_DEFINE2(memfd_create,
> +		const char __user *, uname,
> +		unsigned int, flags)
> +{
> +	struct shmem_inode_info *info;
> +	struct file *file;
> +	int fd, error;
> +	char *name;
> +	long len;
> +
> +	if (flags & ~(unsigned int)MFD_ALL_FLAGS)
> +		return -EINVAL;
> +
> +	/* length includes terminating zero */
> +	len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
> +	if (len <= 0)
> +		return -EFAULT;
> +	if (len > MFD_NAME_MAX_LEN + 1)
> +		return -EINVAL;
> +
> +	name = kmalloc(len + MFD_NAME_PREFIX_LEN, GFP_TEMPORARY);
> +	if (!name)
> +		return -ENOMEM;
> +
> +	strcpy(name, MFD_NAME_PREFIX);
> +	if (copy_from_user(&name[MFD_NAME_PREFIX_LEN], uname, len)) {
> +		error = -EFAULT;
> +		goto err_name;
> +	}
> +
> +	/* terminating-zero may have changed after strnlen_user() returned */
> +	if (name[len + MFD_NAME_PREFIX_LEN - 1]) {
> +		error = -EFAULT;
> +		goto err_name;
> +	}
> +
> +	fd = get_unused_fd_flags((flags & MFD_CLOEXEC) ? O_CLOEXEC : 0);
> +	if (fd < 0) {
> +		error = fd;
> +		goto err_name;
> +	}
> +
> +	file = shmem_file_setup(name, 0, VM_NORESERVE);
> +	if (IS_ERR(file)) {
> +		error = PTR_ERR(file);
> +		goto err_fd;
> +	}
> +	info = SHMEM_I(file_inode(file));
> +	file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
> +	file->f_flags |= O_RDWR | O_LARGEFILE;
> +	if (flags & MFD_ALLOW_SEALING)
> +		info->seals &= ~F_SEAL_SEAL;
> +
> +	fd_install(fd, file);
> +	kfree(name);
> +	return fd;
> +
> +err_fd:
> +	put_unused_fd(fd);
> +err_name:
> +	kfree(name);
> +	return error;
> +}
> +
>  #endif /* CONFIG_TMPFS */
>  
>  static void shmem_put_super(struct super_block *sb)
> -- 
> 2.0.2
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2014-07-24  4:21 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-07-20 17:34 [PATCH v4 0/6] File Sealing & memfd_create() David Herrmann
2014-07-20 17:34 ` David Herrmann
2014-07-20 17:34 ` [PATCH v4 1/6] mm: allow drivers to prevent new writable mappings David Herrmann
2014-07-20 17:34   ` David Herrmann
2014-07-24  4:07   ` Hugh Dickins
2014-07-24  4:07     ` Hugh Dickins
2014-07-20 17:34 ` [PATCH v4 2/6] shm: add sealing API David Herrmann
2014-07-20 17:34   ` David Herrmann
2014-07-24  4:11   ` Hugh Dickins
2014-07-24  4:11     ` Hugh Dickins
2014-07-20 17:34 ` [PATCH v4 3/6] shm: add memfd_create() syscall David Herrmann
2014-07-20 17:34   ` David Herrmann
2014-07-24  4:19   ` Hugh Dickins [this message]
2014-07-24  4:19     ` Hugh Dickins
2014-07-20 17:34 ` [PATCH v4 4/6] selftests: add memfd_create() + sealing tests David Herrmann
2014-07-20 17:34   ` David Herrmann
2014-07-24  4:20   ` Hugh Dickins
2014-07-24  4:20     ` Hugh Dickins
2014-07-20 17:34 ` [PATCH v4 5/6] selftests: add memfd/sealing page-pinning tests David Herrmann
2014-07-20 17:34   ` David Herrmann
2014-07-24  4:28   ` Hugh Dickins
2014-07-24  4:28     ` Hugh Dickins
2014-07-20 17:34 ` [PATCH v4 6/6] shm: wait for pins to be released when sealing David Herrmann
2014-07-20 17:34   ` David Herrmann
2014-07-24  4:32   ` Hugh Dickins
2014-07-24  4:32     ` Hugh Dickins
2014-07-24  4:48 ` [PATCH v4 0/6] File Sealing & memfd_create() Hugh Dickins
2014-07-24  4:48   ` Hugh Dickins
2014-07-24 21:47 ` Andrew Morton
2014-07-24 21:47   ` Andrew Morton
2014-07-24 22:44   ` David Herrmann
2014-07-24 22:44     ` David Herrmann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LSU.2.11.1407232112040.991@eggly.anvils \
    --to=hughd@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=desrt@desrt.ca \
    --cc=dh.herrmann@gmail.com \
    --cc=greg@kroah.com \
    --cc=john.stultz@linaro.org \
    --cc=kay@vrfy.org \
    --cc=lennart@poettering.net \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@amacapital.net \
    --cc=mtk.manpages@gmail.com \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    --cc=zonque@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.