From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 15E11C77B62 for ; Tue, 4 Apr 2023 08:25:19 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 6E6956B0071; Tue, 4 Apr 2023 04:25:19 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 696AD6B0074; Tue, 4 Apr 2023 04:25:19 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 537926B0075; Tue, 4 Apr 2023 04:25:19 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15]) by kanga.kvack.org (Postfix) with ESMTP id 445176B0071 for ; Tue, 4 Apr 2023 04:25:19 -0400 (EDT) Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 1A9941A08A9 for ; Tue, 4 Apr 2023 08:25:19 +0000 (UTC) X-FDA: 80643023958.24.556762E Received: from new2-smtp.messagingengine.com (new2-smtp.messagingengine.com [66.111.4.224]) by imf28.hostedemail.com (Postfix) with ESMTP id F34D1C0016 for ; Tue, 4 Apr 2023 08:25:16 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=shutemov.name header.s=fm1 header.b="I DSiBq2"; dkim=pass header.d=messagingengine.com header.s=fm2 header.b=KxjPvdsh; spf=pass (imf28.hostedemail.com: domain of kirill@shutemov.name designates 66.111.4.224 as permitted sender) smtp.mailfrom=kirill@shutemov.name; dmarc=none ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1680596717; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=ucaJdjXCvIGCze6KDfbsPlFe2n6TgL3UUQGpg9tF7po=; b=iFO8OE7uGZGz5hwC20ko/DduJ/UcayLnHFnkovOGBqmcn+YuC82M2VB/QX80zitwdTnlv/ H4AkUaoUL7ZMh8tsGW9G2tScZsRSs9nzCivIaAAlDqAxa1Lhxe/0POTlBmmIX4Cs1UXMVp ZchcwNB8pI4y6iJe0WUOflkO5lMu+FU= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=shutemov.name header.s=fm1 header.b="I DSiBq2"; dkim=pass header.d=messagingengine.com header.s=fm2 header.b=KxjPvdsh; spf=pass (imf28.hostedemail.com: domain of kirill@shutemov.name designates 66.111.4.224 as permitted sender) smtp.mailfrom=kirill@shutemov.name; dmarc=none ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1680596717; a=rsa-sha256; cv=none; b=CktzI7ayRsi7aSyUXoSezET29l3x422uXd+rh93QlKBGHEiJe7fjeqosxRjX9y2XoWkcVk hx4z5I0dNcFjR3JzuiMwgtCxuFNmpOR/nTzreT1ErKIV3gAWlmlmjZqrp1r1HeoIhm6Pxl jl9DNwbPoD5cZugb7g2eXrWS9QsSAYo= Received: from compute5.internal (compute5.nyi.internal [10.202.2.45]) by mailnew.nyi.internal (Postfix) with ESMTP id 1357F582072; Tue, 4 Apr 2023 04:25:13 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute5.internal (MEProxy); Tue, 04 Apr 2023 04:25:13 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=shutemov.name; h=cc:cc:content-type:content-type:date:date:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:sender:subject:subject:to:to; s=fm1; t=1680596713; x= 1680603913; bh=ucaJdjXCvIGCze6KDfbsPlFe2n6TgL3UUQGpg9tF7po=; b=I DSiBq2VrNCBC6QDxMURGV8kNtJO9gautIsomo38gXtIcakYN8v07Vw2OYXsSxfn/ aME2qwMeJzDNTlSp3qzI9uRMEP3f+Py0rAVeAXat7UyeKEzqp3FPlm0aIf9IsYjU VJMXwkd4SeUH00xSxDFBjlctr/KE/jVpLD7/h5Z7c2HZosqmXl8ah69koYjVKfMT EKGrhBzG7kgvGD167S1ZyDIT9zruuzRj0k6xLwq1xP8PbiKquj2w/Trgcl4nMdZn pEn9oeVAmcytzq5luBhkXUbiTk4DEACjFw2INXfbGWz9X3yl0QxMdplcJtecGdDv w/IPQCjLyN72kjsv7QPJA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:cc:content-type:content-type:date:date :feedback-id:feedback-id:from:from:in-reply-to:in-reply-to :message-id:mime-version:references:reply-to:sender:subject :subject:to:to:x-me-proxy:x-me-proxy:x-me-sender:x-me-sender :x-sasl-enc; s=fm2; t=1680596713; x=1680603913; bh=ucaJdjXCvIGCz e6KDfbsPlFe2n6TgL3UUQGpg9tF7po=; b=KxjPvdshfxbt2/6hFUHYrRjGmYUQI AacryrBkFC92hJLW6LUXwVPG8eoE6c9/ToxcviVUPXwT9n6U9KUtitW12VQTgEPM dtWOhdv0rYR6MoluiNosghIZDNR9g+Xn7iMiDrGkNh+VsX9uFKovlLDYM4XsFMtn BxmQ+xNwM/IVoJHAXQenfwwYITbs+6mK6YhZpuLJlps+FR2JJDBBaN5137529WCO G6WozEo2zY2g0o+poTxy+XKEesh2njzFQJghOtV/BlkD90rQ1c1qkjfLVq8/8Nsj cZR1ZPsKHFr0W0XueC5rcVZqlM9mpJNRHSnNgFCxUbirqS4tT4YcQT3vw== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvhedrvdeiledgtdefucetufdoteggodetrfdotf fvucfrrhhofhhilhgvmecuhfgrshhtofgrihhlpdfqfgfvpdfurfetoffkrfgpnffqhgen uceurghilhhouhhtmecufedttdenucesvcftvggtihhpihgvnhhtshculddquddttddmne cujfgurhepfffhvfevuffkfhggtggujgesthdttddttddtvdenucfhrhhomhepfdfmihhr ihhllhcutedrucfuhhhuthgvmhhovhdfuceokhhirhhilhhlsehshhhuthgvmhhovhdrnh grmhgvqeenucggtffrrghtthgvrhhnpefhieeghfdtfeehtdeftdehgfehuddtvdeuheet tddtheejueekjeegueeivdektdenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmh epmhgrihhlfhhrohhmpehkihhrihhllhesshhhuhhtvghmohhvrdhnrghmvg X-ME-Proxy: Feedback-ID: ie3994620:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Tue, 4 Apr 2023 04:25:10 -0400 (EDT) Received: by box.shutemov.name (Postfix, from userid 1000) id 5730E10CC3C; Tue, 4 Apr 2023 11:25:07 +0300 (+03) Date: Tue, 4 Apr 2023 11:25:07 +0300 From: "Kirill A. Shutemov" To: Ackerley Tng Cc: kvm@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, qemu-devel@nongnu.org, aarcange@redhat.com, ak@linux.intel.com, akpm@linux-foundation.org, arnd@arndb.de, bfields@fieldses.org, bp@alien8.de, chao.p.peng@linux.intel.com, corbet@lwn.net, dave.hansen@intel.com, david@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, hpa@zytor.com, hughd@google.com, jlayton@kernel.org, jmattson@google.com, joro@8bytes.org, jun.nakajima@intel.com, kirill.shutemov@linux.intel.com, linmiaohe@huawei.com, luto@kernel.org, mail@maciej.szmigiero.name, mhocko@suse.com, michael.roth@amd.com, mingo@redhat.com, naoya.horiguchi@nec.com, pbonzini@redhat.com, qperret@google.com, rppt@kernel.org, seanjc@google.com, shuah@kernel.org, steven.price@arm.com, tabba@google.com, tglx@linutronix.de, vannapurve@google.com, vbabka@suse.cz, vkuznets@redhat.com, wanpengli@tencent.com, wei.w.wang@intel.com, x86@kernel.org, yu.c.zhang@linux.intel.com Subject: Re: [RFC PATCH v3 1/2] mm: restrictedmem: Allow userspace to specify mount for memfd_restricted Message-ID: <20230404082507.sbyfahwc4gdupmya@box.shutemov.name> References: <592ebd9e33a906ba026d56dc68f42d691706f865.1680306489.git.ackerleytng@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <592ebd9e33a906ba026d56dc68f42d691706f865.1680306489.git.ackerleytng@google.com> X-Rspamd-Server: rspam07 X-Rspamd-Queue-Id: F34D1C0016 X-Rspam-User: X-Stat-Signature: s73jkjbgdrznq9xdqb9qen1hkofbkiok X-HE-Tag: 1680596716-999666 X-HE-Meta: U2FsdGVkX1+KtkdcLQRtstukeH/otlOYs09PQdUKyO4M27K/pLfHr02H/XS5nE0oDkMDHMgyKqJRgxc+aj1SXN4beSnhvheuUfOA1vo0Ry4MJYTS8oQFAbFOupivhSPoI17K8sbBiaCHWCMvryEj10LciG4fdlkdfdZf2qojW6u0fyLAUpUOFaGzx9VlLk9DHl8SUWOfPvCDD7Ny7gr7v4UxdjJDCU7txWGaGJ2QTT2qz3vXfKm+GnJwfUPbfJtaLA7/kwVFHiL2Scrro1xTuSQjl+ljj4IcxCdbezGO8YfE8cpu1VCgrfoB6vvMsnmRANy3uBaAPk9RJdNnt3/CM5mtzhJVFqbxWR5LtQjXN/wRelBVzvbh3SjDwlf1SvVC69c+Hw3xIzHMnmn4o44aAdu8kOZ7xrbf4pymHHV8WAhQjGf4NVerkTo18lSFY3lMmLl5zhzNcQTmUmq5mwg/7BgtvL+9GRUQ+5YW/1sHmqjrLOQASZIWX9sFppk3kwTtru4j+YwV8bm7Sw7yBHxcU14Jp/1E9TrltLRU8yEQzce3KkY492NhjLZ4atWoP9Ub6QBaPxaqK4/7yPfS+eh/A7wArR5usOlwxk1s8p8iFpkLAJwJB9+Nzy2c3LftbTiIsTKONbiJFGc6vLGFJcNjdA91UYM6tzM1ixLgvZ1R6/MaKrhNsXc84bsdsgP6EleJfe0FLWp/8gFG7GI1CiaoQRv0xNowgMWL2uUSVyDSIchBlixHYaryLJS2n/54OCTCKQAYDlNIHC5k6P92DykwPI2SuKHh1FZW9hbtpyDTOX1ngbSCWkkbfPhh4N5WZysah1WkNwggys71dEXCJjxe43kT+UbUQd4L8ye9v6HMWWCWeUDCcxFABUmuSSUwoaGWg0tnUwyEwzjEaHH0RcAoBv162lNCaQ6BQqQ+wDQDgpnfgdKKDrn+/IlLTdocKNxv5LxCO8zrFGrpSa0lDcc 7TyXlIRO CQQMFGxIHmCzgXcuczvYtxjFAIk87cJmegGf7NAhtXIzP1mwqsvMdT4d/dAobPYXUHikoVogQ28itUHUZC4QpGTZP0Ghcup9ULCp+aAscAlRNgQksyl2aJ616gXgqXtrKWT/+Dw7tpRnMlTjnHS+qm/4NL1uUG1SNGQY/XMgb4NzR8wPwjGft1V26mpa09MM7lorJzpbJlX+TpS4kzhsZtCFrFeUbZiNGLFcKcJRgphe7bIDa+5f5ki5GAh+PUulClVhzkpsX/Y4IcWVaXUG4iihlYXVM5dJXzS8KEFetNuRwiMACAjJ6Q9HyaDJct/BBEKecYwmBBXdar6rXTs9usihKRRzOzg8g91k35ydFWISYmDta5iXKe5Pp0yE95LMaUm8MzitdDGeaGi7ZGQSFavmZk07UoGPhvqW67WFJfB5JVJM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Fri, Mar 31, 2023 at 11:50:39PM +0000, Ackerley Tng wrote: > By default, the backing shmem file for a restrictedmem fd is created > on shmem's kernel space mount. > > With this patch, an optional tmpfs mount can be specified via an fd, > which will be used as the mountpoint for backing the shmem file > associated with a restrictedmem fd. > > This will help restrictedmem fds inherit the properties of the > provided tmpfs mounts, for example, hugepage allocation hints, NUMA > binding hints, etc. > > Permissions for the fd passed to memfd_restricted() is modeled after > the openat() syscall, since both of these allow creation of a file > upon a mount/directory. > > Permission to reference the mount the fd represents is checked upon fd > creation by other syscalls (e.g. fsmount(), open(), or open_tree(), > etc) and any process that can present memfd_restricted() with a valid > fd is expected to have obtained permission to use the mount > represented by the fd. This behavior is intended to parallel that of > the openat() syscall. > > memfd_restricted() will check that the tmpfs superblock is > writable, and that the mount is also writable, before attempting to > create a restrictedmem file on the mount. > > Signed-off-by: Ackerley Tng > --- > include/linux/syscalls.h | 2 +- > include/uapi/linux/restrictedmem.h | 8 ++++ > mm/restrictedmem.c | 74 +++++++++++++++++++++++++++--- > 3 files changed, 77 insertions(+), 7 deletions(-) > create mode 100644 include/uapi/linux/restrictedmem.h > > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index f9e9e0c820c5..a23c4c385cd3 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -1056,7 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags); > asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len, > unsigned long home_node, > unsigned long flags); > -asmlinkage long sys_memfd_restricted(unsigned int flags); > +asmlinkage long sys_memfd_restricted(unsigned int flags, int mount_fd); > > /* > * Architecture-specific system calls > diff --git a/include/uapi/linux/restrictedmem.h b/include/uapi/linux/restrictedmem.h > new file mode 100644 > index 000000000000..22d6f2285f6d > --- /dev/null > +++ b/include/uapi/linux/restrictedmem.h > @@ -0,0 +1,8 @@ > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ > +#ifndef _UAPI_LINUX_RESTRICTEDMEM_H > +#define _UAPI_LINUX_RESTRICTEDMEM_H > + > +/* flags for memfd_restricted */ > +#define RMFD_USERMNT 0x0001U > + > +#endif /* _UAPI_LINUX_RESTRICTEDMEM_H */ > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c > index c5d869d8c2d8..f7b62364a31a 100644 > --- a/mm/restrictedmem.c > +++ b/mm/restrictedmem.c > @@ -1,11 +1,12 @@ > // SPDX-License-Identifier: GPL-2.0 > -#include "linux/sbitmap.h" > +#include > #include > #include > #include > #include > #include > #include > +#include > #include > > struct restrictedmem { > @@ -189,19 +190,20 @@ static struct file *restrictedmem_file_create(struct file *memfd) > return file; > } > > -SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) > +static int restrictedmem_create(struct vfsmount *mount) > { > struct file *file, *restricted_file; > int fd, err; > > - if (flags) > - return -EINVAL; > - > fd = get_unused_fd_flags(0); > if (fd < 0) > return fd; > > - file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE); > + if (mount) > + file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem", 0, VM_NORESERVE); > + else > + file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE); > + > if (IS_ERR(file)) { > err = PTR_ERR(file); > goto err_fd; > @@ -223,6 +225,66 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) > return err; > } > > +static bool is_shmem_mount(struct vfsmount *mnt) > +{ > + return mnt && mnt->mnt_sb && mnt->mnt_sb->s_magic == TMPFS_MAGIC; > +} > + > +static bool is_mount_root(struct file *file) > +{ > + return file->f_path.dentry == file->f_path.mnt->mnt_root; > +} > + > +static int restrictedmem_create_on_user_mount(int mount_fd) > +{ > + int ret; > + struct fd f; > + struct vfsmount *mnt; > + > + f = fdget_raw(mount_fd); > + if (!f.file) > + return -EBADF; > + > + ret = -EINVAL; > + if (!is_mount_root(f.file)) > + goto out; > + > + mnt = f.file->f_path.mnt; > + if (!is_shmem_mount(mnt)) > + goto out; > + > + ret = file_permission(f.file, MAY_WRITE | MAY_EXEC); Why MAY_EXEC? > + if (ret) > + goto out; > + > + ret = mnt_want_write(mnt); > + if (unlikely(ret)) > + goto out; > + > + ret = restrictedmem_create(mnt); > + > + mnt_drop_write(mnt); > +out: > + fdput(f); > + > + return ret; > +} We need review from fs folks. Look mostly sensible, but I have no experience in fs. > + > +SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd) > +{ > + if (flags & ~RMFD_USERMNT) > + return -EINVAL; > + > + if (flags == RMFD_USERMNT) { > + if (mount_fd < 0) > + return -EINVAL; > + > + return restrictedmem_create_on_user_mount(mount_fd); > + } else { > + return restrictedmem_create(NULL); > + } Maybe restructure with single restrictedmem_create() call? struct vfsmount *mnt = NULL; if (flags == RMFD_USERMNT) { ... mnt = ...(); } return restrictedmem_create(mnt); > +} > + > int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end, > struct restrictedmem_notifier *notifier, bool exclusive) > { > -- > 2.40.0.348.gf938b09366-goog -- Kiryl Shutsemau / Kirill A. Shutemov