From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AA1D9C433DB for ; Thu, 21 Jan 2021 12:28:42 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 17366239D1 for ; Thu, 21 Jan 2021 12:28:41 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 17366239D1 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 4EE076B000E; Thu, 21 Jan 2021 07:28:41 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 499D56B0010; Thu, 21 Jan 2021 07:28:41 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 33BDD6B0012; Thu, 21 Jan 2021 07:28:41 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0146.hostedemail.com [216.40.44.146]) by kanga.kvack.org (Postfix) with ESMTP id 186EE6B000E for ; Thu, 21 Jan 2021 07:28:41 -0500 (EST) Received: from smtpin23.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id DFA2F8249980 for ; Thu, 21 Jan 2021 12:28:40 +0000 (UTC) X-FDA: 77729710800.23.magic88_250f0eb27563 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin23.hostedemail.com (Postfix) with ESMTP id B710A37612 for ; Thu, 21 Jan 2021 12:28:40 +0000 (UTC) X-HE-Tag: magic88_250f0eb27563 X-Filterd-Recvd-Size: 15634 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by imf19.hostedemail.com (Postfix) with ESMTP for ; Thu, 21 Jan 2021 12:28:40 +0000 (UTC) Received: by mail.kernel.org (Postfix) with ESMTPSA id 415B923975; Thu, 21 Jan 2021 12:28:29 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1611232119; bh=lGsR1oZQWwxrLD73NqWDDlcprqNKrjsYY6yQD4YnAwM=; h=From:To:Cc:Subject:Date:In-Reply-To:References:From; b=OmfGefjPZ6vZW2FiRxHuNGxYQtb3ojIv6QoYROWeRY+vql8fwNK/Q8hxytlQcBVXa AkrqF+bTztUajCMt0orcEvEzxxQKd5fprHnXTyb4c8/X2+QG8A8+aVc6l7yiLp6vPX IObYUFUmGJwhObNkhskUMPdD98D5B+YmbMJJnlYk5qhQuo4aAi/E/6AF/maRRlg1Pf r1BWldbN2kB535i+GEOQCAQzQ+xfpf9okxBXVgUoY/BrqtIiRLf9ccUKOEFCxWWHQC c8ZLmK4LC5aj1W/z4iOZuvhR+pXeLqe0b3U+BQY8a5yTSJOLxbagnbgJHxkJY0rbEF JFM81UUI9LK9w== From: Mike Rapoport To: Andrew Morton Cc: Alexander Viro , Andy Lutomirski , Arnd Bergmann , Borislav Petkov , Catalin Marinas , Christopher Lameter , Dan Williams , Dave Hansen , David Hildenbrand , Elena Reshetova , "H. Peter Anvin" , Ingo Molnar , James Bottomley , "Kirill A. Shutemov" , Matthew Wilcox , Mark Rutland , Mike Rapoport , Mike Rapoport , Michael Kerrisk , Palmer Dabbelt , Paul Walmsley , Peter Zijlstra , Rick Edgecombe , Roman Gushchin , Shakeel Butt , Shuah Khan , Thomas Gleixner , Tycho Andersen , Will Deacon , linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-nvdimm@lists.01.org, linux-riscv@lists.infradead.org, x86@kernel.org, Hagen Paul Pfeifer , Palmer Dabbelt Subject: [PATCH v16 06/11] mm: introduce memfd_secret system call to create "secret" memory areas Date: Thu, 21 Jan 2021 14:27:18 +0200 Message-Id: <20210121122723.3446-7-rppt@kernel.org> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20210121122723.3446-1-rppt@kernel.org> References: <20210121122723.3446-1-rppt@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Mike Rapoport Introduce "memfd_secret" system call with the ability to create memory areas visible only in the context of the owning process and not mapped no= t only to other processes but in the kernel page tables as well. The user will create a file descriptor using the memfd_secret() system call. The memory areas created by mmap() calls from this file descriptor will be unmapped from the kernel direct map and they will be only mapped = in the page table of the owning mm. The secret memory remains accessible in the process context using uaccess primitives, but it is not accessible using direct/linear map addresses. Functions in the follow_page()/get_user_page() family will refuse to retu= rn a page that belongs to the secret memory area. A page that was a part of the secret memory area is cleared when it is freed. The following example demonstrates creation of a secret mapping (error handling is omitted): fd =3D memfd_secret(0); ftruncate(fd, MAP_SIZE); ptr =3D mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); Signed-off-by: Mike Rapoport Acked-by: Hagen Paul Pfeifer Cc: Alexander Viro Cc: Andy Lutomirski Cc: Arnd Bergmann Cc: Borislav Petkov Cc: Catalin Marinas Cc: Christopher Lameter Cc: Dan Williams Cc: Dave Hansen Cc: David Hildenbrand Cc: Elena Reshetova Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: James Bottomley Cc: "Kirill A. Shutemov" Cc: Mark Rutland Cc: Matthew Wilcox Cc: Michael Kerrisk Cc: Palmer Dabbelt Cc: Palmer Dabbelt Cc: Paul Walmsley Cc: Peter Zijlstra Cc: Rick Edgecombe Cc: Roman Gushchin Cc: Shakeel Butt Cc: Shuah Khan Cc: Thomas Gleixner Cc: Tycho Andersen Cc: Will Deacon --- include/linux/secretmem.h | 24 ++++ include/uapi/linux/magic.h | 1 + kernel/sys_ni.c | 2 + mm/Kconfig | 3 + mm/Makefile | 1 + mm/gup.c | 10 ++ mm/secretmem.c | 278 +++++++++++++++++++++++++++++++++++++ 7 files changed, 319 insertions(+) create mode 100644 include/linux/secretmem.h create mode 100644 mm/secretmem.c diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h new file mode 100644 index 000000000000..70e7db9f94fe --- /dev/null +++ b/include/linux/secretmem.h @@ -0,0 +1,24 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _LINUX_SECRETMEM_H +#define _LINUX_SECRETMEM_H + +#ifdef CONFIG_SECRETMEM + +bool vma_is_secretmem(struct vm_area_struct *vma); +bool page_is_secretmem(struct page *page); + +#else + +static inline bool vma_is_secretmem(struct vm_area_struct *vma) +{ + return false; +} + +static inline bool page_is_secretmem(struct page *page) +{ + return false; +} + +#endif /* CONFIG_SECRETMEM */ + +#endif /* _LINUX_SECRETMEM_H */ diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h index f3956fc11de6..35687dcb1a42 100644 --- a/include/uapi/linux/magic.h +++ b/include/uapi/linux/magic.h @@ -97,5 +97,6 @@ #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */ #define Z3FOLD_MAGIC 0x33 #define PPC_CMM_MAGIC 0xc7571590 +#define SECRETMEM_MAGIC 0x5345434d /* "SECM" */ =20 #endif /* __LINUX_MAGIC_H__ */ diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 769ad6225ab1..869aa6b5bf34 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -355,6 +355,8 @@ COND_SYSCALL(pkey_mprotect); COND_SYSCALL(pkey_alloc); COND_SYSCALL(pkey_free); =20 +/* memfd_secret */ +COND_SYSCALL(memfd_secret); =20 /* * Architecture specific weak syscall entries. diff --git a/mm/Kconfig b/mm/Kconfig index 24c045b24b95..5f8243442f66 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -872,4 +872,7 @@ config MAPPING_DIRTY_HELPERS config KMAP_LOCAL bool =20 +config SECRETMEM + def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED + endmenu diff --git a/mm/Makefile b/mm/Makefile index 72227b24a616..b2a564eec27f 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -120,3 +120,4 @@ obj-$(CONFIG_MEMFD_CREATE) +=3D memfd.o obj-$(CONFIG_MAPPING_DIRTY_HELPERS) +=3D mapping_dirty_helpers.o obj-$(CONFIG_PTDUMP_CORE) +=3D ptdump.o obj-$(CONFIG_PAGE_REPORTING) +=3D page_reporting.o +obj-$(CONFIG_SECRETMEM) +=3D secretmem.o diff --git a/mm/gup.c b/mm/gup.c index e4c224cd9661..3e086b073624 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -10,6 +10,7 @@ #include #include #include +#include =20 #include #include @@ -759,6 +760,9 @@ struct page *follow_page(struct vm_area_struct *vma, = unsigned long address, struct follow_page_context ctx =3D { NULL }; struct page *page; =20 + if (vma_is_secretmem(vma)) + return NULL; + page =3D follow_page_mask(vma, address, foll_flags, &ctx); if (ctx.pgmap) put_dev_pagemap(ctx.pgmap); @@ -892,6 +896,9 @@ static int check_vma_flags(struct vm_area_struct *vma= , unsigned long gup_flags) if ((gup_flags & FOLL_LONGTERM) && vma_is_fsdax(vma)) return -EOPNOTSUPP; =20 + if (vma_is_secretmem(vma)) + return -EFAULT; + if (write) { if (!(vm_flags & VM_WRITE)) { if (!(gup_flags & FOLL_FORCE)) @@ -2031,6 +2038,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long a= ddr, unsigned long end, VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page =3D pte_page(pte); =20 + if (page_is_secretmem(page)) + goto pte_unmap; + head =3D try_grab_compound_head(page, 1, flags); if (!head) goto pte_unmap; diff --git a/mm/secretmem.c b/mm/secretmem.c new file mode 100644 index 000000000000..904351d12c33 --- /dev/null +++ b/mm/secretmem.c @@ -0,0 +1,278 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright IBM Corporation, 2020 + * + * Author: Mike Rapoport + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include + +#include "internal.h" + +#undef pr_fmt +#define pr_fmt(fmt) "secretmem: " fmt + +/* + * Define mode and flag masks to allow validation of the system call + * parameters. + */ +#define SECRETMEM_MODE_MASK (0x0) +#define SECRETMEM_FLAGS_MASK SECRETMEM_MODE_MASK + +struct secretmem_ctx { + unsigned int mode; +}; + +static struct page *secretmem_alloc_page(gfp_t gfp) +{ + /* + * FIXME: use a cache of large pages to reduce the direct map + * fragmentation + */ + return alloc_page(gfp | __GFP_ZERO); +} + +static vm_fault_t secretmem_fault(struct vm_fault *vmf) +{ + struct address_space *mapping =3D vmf->vma->vm_file->f_mapping; + struct inode *inode =3D file_inode(vmf->vma->vm_file); + pgoff_t offset =3D vmf->pgoff; + unsigned long addr; + struct page *page; + int err; + + if (((loff_t)vmf->pgoff << PAGE_SHIFT) >=3D i_size_read(inode)) + return vmf_error(-EINVAL); + +retry: + page =3D find_lock_page(mapping, offset); + if (!page) { + page =3D secretmem_alloc_page(vmf->gfp_mask); + if (!page) + return VM_FAULT_OOM; + + err =3D set_direct_map_invalid_noflush(page, 1); + if (err) { + put_page(page); + return vmf_error(err); + } + + __SetPageUptodate(page); + err =3D add_to_page_cache(page, mapping, offset, vmf->gfp_mask); + if (unlikely(err)) { + put_page(page); + if (err =3D=3D -EEXIST) + goto retry; + goto err_restore_direct_map; + } + + addr =3D (unsigned long)page_address(page); + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); + } + + vmf->page =3D page; + return VM_FAULT_LOCKED; + +err_restore_direct_map: + /* + * If a split of large page was required, it already happened + * when we marked the page invalid which guarantees that this call + * won't fail + */ + set_direct_map_default_noflush(page, 1); + return vmf_error(err); +} + +static const struct vm_operations_struct secretmem_vm_ops =3D { + .fault =3D secretmem_fault, +}; + +static int secretmem_mmap(struct file *file, struct vm_area_struct *vma) +{ + unsigned long len =3D vma->vm_end - vma->vm_start; + + if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) =3D=3D 0) + return -EINVAL; + + if (mlock_future_check(vma->vm_mm, vma->vm_flags | VM_LOCKED, len)) + return -EAGAIN; + + vma->vm_ops =3D &secretmem_vm_ops; + vma->vm_flags |=3D VM_LOCKED; + + return 0; +} + +bool vma_is_secretmem(struct vm_area_struct *vma) +{ + return vma->vm_ops =3D=3D &secretmem_vm_ops; +} + +static const struct file_operations secretmem_fops =3D { + .mmap =3D secretmem_mmap, +}; + +static bool secretmem_isolate_page(struct page *page, isolate_mode_t mod= e) +{ + return false; +} + +static int secretmem_migratepage(struct address_space *mapping, + struct page *newpage, struct page *page, + enum migrate_mode mode) +{ + return -EBUSY; +} + +static void secretmem_freepage(struct page *page) +{ + set_direct_map_default_noflush(page, 1); + clear_highpage(page); +} + +static const struct address_space_operations secretmem_aops =3D { + .freepage =3D secretmem_freepage, + .migratepage =3D secretmem_migratepage, + .isolate_page =3D secretmem_isolate_page, +}; + +bool page_is_secretmem(struct page *page) +{ + struct address_space *mapping =3D page_mapping(page); + + if (!mapping) + return false; + + return mapping->a_ops =3D=3D &secretmem_aops; +} + +static struct vfsmount *secretmem_mnt; + +static struct file *secretmem_file_create(unsigned long flags) +{ + struct file *file =3D ERR_PTR(-ENOMEM); + struct secretmem_ctx *ctx; + struct inode *inode; + + inode =3D alloc_anon_inode(secretmem_mnt->mnt_sb); + if (IS_ERR(inode)) + return ERR_CAST(inode); + + ctx =3D kzalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + goto err_free_inode; + + file =3D alloc_file_pseudo(inode, secretmem_mnt, "secretmem", + O_RDWR, &secretmem_fops); + if (IS_ERR(file)) + goto err_free_ctx; + + mapping_set_unevictable(inode->i_mapping); + + inode->i_mapping->private_data =3D ctx; + inode->i_mapping->a_ops =3D &secretmem_aops; + + /* pretend we are a normal file with zero size */ + inode->i_mode |=3D S_IFREG; + inode->i_size =3D 0; + + file->private_data =3D ctx; + + ctx->mode =3D flags & SECRETMEM_MODE_MASK; + + return file; + +err_free_ctx: + kfree(ctx); +err_free_inode: + iput(inode); + return file; +} + +SYSCALL_DEFINE1(memfd_secret, unsigned long, flags) +{ + struct file *file; + int fd, err; + + /* make sure local flags do not confict with global fcntl.h */ + BUILD_BUG_ON(SECRETMEM_FLAGS_MASK & O_CLOEXEC); + + if (flags & ~(SECRETMEM_FLAGS_MASK | O_CLOEXEC)) + return -EINVAL; + + fd =3D get_unused_fd_flags(flags & O_CLOEXEC); + if (fd < 0) + return fd; + + file =3D secretmem_file_create(flags); + if (IS_ERR(file)) { + err =3D PTR_ERR(file); + goto err_put_fd; + } + + file->f_flags |=3D O_LARGEFILE; + + fd_install(fd, file); + return fd; + +err_put_fd: + put_unused_fd(fd); + return err; +} + +static void secretmem_evict_inode(struct inode *inode) +{ + struct secretmem_ctx *ctx =3D inode->i_private; + + truncate_inode_pages_final(&inode->i_data); + clear_inode(inode); + kfree(ctx); +} + +static const struct super_operations secretmem_super_ops =3D { + .evict_inode =3D secretmem_evict_inode, +}; + +static int secretmem_init_fs_context(struct fs_context *fc) +{ + struct pseudo_fs_context *ctx =3D init_pseudo(fc, SECRETMEM_MAGIC); + + if (!ctx) + return -ENOMEM; + ctx->ops =3D &secretmem_super_ops; + + return 0; +} + +static struct file_system_type secretmem_fs =3D { + .name =3D "secretmem", + .init_fs_context =3D secretmem_init_fs_context, + .kill_sb =3D kill_anon_super, +}; + +static int secretmem_init(void) +{ + int ret =3D 0; + + secretmem_mnt =3D kern_mount(&secretmem_fs); + if (IS_ERR(secretmem_mnt)) + ret =3D PTR_ERR(secretmem_mnt); + + return ret; +} +fs_initcall(secretmem_init); --=20 2.28.0