From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-17.0 required=3.0 tests=BAYES_00,INCLUDES_CR_TRAILER, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED, USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7E964C63777 for ; Thu, 3 Dec 2020 06:31:05 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D7AAE21D91 for ; Thu, 3 Dec 2020 06:31:04 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D7AAE21D91 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 4C6A16B0074; Thu, 3 Dec 2020 01:31:04 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 450F36B0075; Thu, 3 Dec 2020 01:31:04 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 31A326B0078; Thu, 3 Dec 2020 01:31:04 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0187.hostedemail.com [216.40.44.187]) by kanga.kvack.org (Postfix) with ESMTP id 17B8C6B0074 for ; Thu, 3 Dec 2020 01:31:04 -0500 (EST) Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id D136A8249980 for ; Thu, 3 Dec 2020 06:31:03 +0000 (UTC) X-FDA: 77550998406.16.straw56_330b330273b9 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin16.hostedemail.com (Postfix) with ESMTP id AF81F100E6903 for ; Thu, 3 Dec 2020 06:31:03 +0000 (UTC) X-HE-Tag: straw56_330b330273b9 X-Filterd-Recvd-Size: 14159 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by imf24.hostedemail.com (Postfix) with ESMTP for ; Thu, 3 Dec 2020 06:31:02 +0000 (UTC) From: Mike Rapoport Authentication-Results:mail.kernel.org; dkim=permerror (bad message/signature format) To: Andrew Morton Cc: Alexander Viro , Andy Lutomirski , Arnd Bergmann , Borislav Petkov , Catalin Marinas , Christopher Lameter , Dan Williams , Dave Hansen , David Hildenbrand , Elena Reshetova , "H. Peter Anvin" , Ingo Molnar , James Bottomley , "Kirill A. Shutemov" , Matthew Wilcox , Mark Rutland , Mike Rapoport , Mike Rapoport , Michael Kerrisk , Palmer Dabbelt , Paul Walmsley , Peter Zijlstra , Rick Edgecombe , Roman Gushchin , Shakeel Butt , Shuah Khan , Thomas Gleixner , Tycho Andersen , Will Deacon , linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-arm-kernel@lists.infradead.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-nvdimm@lists.01.org, linux-riscv@lists.infradead.org, x86@kernel.org, Hagen Paul Pfeifer Subject: [PATCH v14 05/10] mm: introduce memfd_secret system call to create "secret" memory areas Date: Thu, 3 Dec 2020 08:29:44 +0200 Message-Id: <20201203062949.5484-6-rppt@kernel.org> X-Mailer: git-send-email 2.28.0 In-Reply-To: <20201203062949.5484-1-rppt@kernel.org> References: <20201203062949.5484-1-rppt@kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Mike Rapoport Introduce "memfd_secret" system call with the ability to create memory areas visible only in the context of the owning process and not mapped no= t only to other processes but in the kernel page tables as well. The user will create a file descriptor using the memfd_secret() system call. The memory areas created by mmap() calls from this file descriptor will be unmapped from the kernel direct map and they will be only mapped = in the page table of the owning mm. The secret memory remains accessible in the process context using uaccess primitives, but it is not accessible using direct/linear map addresses. Functions in the follow_page()/get_user_page() family will refuse to retu= rn a page that belongs to the secret memory area. A page that was a part of the secret memory area is cleared when it is freed. The following example demonstrates creation of a secret mapping (error handling is omitted): fd =3D memfd_secret(0); ftruncate(fd, MAP_SIZE); ptr =3D mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0); Signed-off-by: Mike Rapoport Acked-by: Hagen Paul Pfeifer --- arch/x86/Kconfig | 2 +- include/linux/secretmem.h | 24 ++++ include/uapi/linux/magic.h | 1 + kernel/sys_ni.c | 2 + mm/Kconfig | 3 + mm/Makefile | 1 + mm/gup.c | 10 ++ mm/secretmem.c | 273 +++++++++++++++++++++++++++++++++++++ 8 files changed, 315 insertions(+), 1 deletion(-) create mode 100644 include/linux/secretmem.h create mode 100644 mm/secretmem.c diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 34d5fb82f674..7d781fea79c2 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -41,7 +41,7 @@ config FORCE_DYNAMIC_FTRACE in order to test the non static function tracing in the generic code, as other architectures still use it. But we only need to keep it around for x86_64. No need to keep it - for x86_32. For x86_32, force DYNAMIC_FTRACE.=20 + for x86_32. For x86_32, force DYNAMIC_FTRACE. # # Arch settings # diff --git a/include/linux/secretmem.h b/include/linux/secretmem.h new file mode 100644 index 000000000000..70e7db9f94fe --- /dev/null +++ b/include/linux/secretmem.h @@ -0,0 +1,24 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _LINUX_SECRETMEM_H +#define _LINUX_SECRETMEM_H + +#ifdef CONFIG_SECRETMEM + +bool vma_is_secretmem(struct vm_area_struct *vma); +bool page_is_secretmem(struct page *page); + +#else + +static inline bool vma_is_secretmem(struct vm_area_struct *vma) +{ + return false; +} + +static inline bool page_is_secretmem(struct page *page) +{ + return false; +} + +#endif /* CONFIG_SECRETMEM */ + +#endif /* _LINUX_SECRETMEM_H */ diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h index f3956fc11de6..35687dcb1a42 100644 --- a/include/uapi/linux/magic.h +++ b/include/uapi/linux/magic.h @@ -97,5 +97,6 @@ #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */ #define Z3FOLD_MAGIC 0x33 #define PPC_CMM_MAGIC 0xc7571590 +#define SECRETMEM_MAGIC 0x5345434d /* "SECM" */ =20 #endif /* __LINUX_MAGIC_H__ */ diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c index 2dd6cbb8cabc..805fd7a668be 100644 --- a/kernel/sys_ni.c +++ b/kernel/sys_ni.c @@ -353,6 +353,8 @@ COND_SYSCALL(pkey_mprotect); COND_SYSCALL(pkey_alloc); COND_SYSCALL(pkey_free); =20 +/* memfd_secret */ +COND_SYSCALL(memfd_secret); =20 /* * Architecture specific weak syscall entries. diff --git a/mm/Kconfig b/mm/Kconfig index c89c5444924b..d8d170fa5210 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -884,4 +884,7 @@ config ARCH_HAS_HUGEPD config MAPPING_DIRTY_HELPERS bool =20 +config SECRETMEM + def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED + endmenu diff --git a/mm/Makefile b/mm/Makefile index 6eeb4b29efb8..dfda14c48a75 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -121,3 +121,4 @@ obj-$(CONFIG_MEMFD_CREATE) +=3D memfd.o obj-$(CONFIG_MAPPING_DIRTY_HELPERS) +=3D mapping_dirty_helpers.o obj-$(CONFIG_PTDUMP_CORE) +=3D ptdump.o obj-$(CONFIG_PAGE_REPORTING) +=3D page_reporting.o +obj-$(CONFIG_SECRETMEM) +=3D secretmem.o diff --git a/mm/gup.c b/mm/gup.c index 5ec98de1e5de..71164fa83114 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -10,6 +10,7 @@ #include #include #include +#include =20 #include #include @@ -793,6 +794,9 @@ struct page *follow_page(struct vm_area_struct *vma, = unsigned long address, struct follow_page_context ctx =3D { NULL }; struct page *page; =20 + if (vma_is_secretmem(vma)) + return NULL; + page =3D follow_page_mask(vma, address, foll_flags, &ctx); if (ctx.pgmap) put_dev_pagemap(ctx.pgmap); @@ -923,6 +927,9 @@ static int check_vma_flags(struct vm_area_struct *vma= , unsigned long gup_flags) if (gup_flags & FOLL_ANON && !vma_is_anonymous(vma)) return -EFAULT; =20 + if (vma_is_secretmem(vma)) + return -EFAULT; + if (write) { if (!(vm_flags & VM_WRITE)) { if (!(gup_flags & FOLL_FORCE)) @@ -2196,6 +2203,9 @@ static int gup_pte_range(pmd_t pmd, unsigned long a= ddr, unsigned long end, VM_BUG_ON(!pfn_valid(pte_pfn(pte))); page =3D pte_page(pte); =20 + if (page_is_secretmem(page)) + goto pte_unmap; + head =3D try_grab_compound_head(page, 1, flags); if (!head) goto pte_unmap; diff --git a/mm/secretmem.c b/mm/secretmem.c new file mode 100644 index 000000000000..781aaaca8c70 --- /dev/null +++ b/mm/secretmem.c @@ -0,0 +1,273 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright IBM Corporation, 2020 + * + * Author: Mike Rapoport + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#include + +#include "internal.h" + +#undef pr_fmt +#define pr_fmt(fmt) "secretmem: " fmt + +/* + * Define mode and flag masks to allow validation of the system call + * parameters. + */ +#define SECRETMEM_MODE_MASK (0x0) +#define SECRETMEM_FLAGS_MASK SECRETMEM_MODE_MASK + +struct secretmem_ctx { + unsigned int mode; +}; + +static struct page *secretmem_alloc_page(gfp_t gfp) +{ + /* + * FIXME: use a cache of large pages to reduce the direct map + * fragmentation + */ + return alloc_page(gfp); +} + +static vm_fault_t secretmem_fault(struct vm_fault *vmf) +{ + struct address_space *mapping =3D vmf->vma->vm_file->f_mapping; + struct inode *inode =3D file_inode(vmf->vma->vm_file); + pgoff_t offset =3D vmf->pgoff; + vm_fault_t ret =3D 0; + unsigned long addr; + struct page *page; + int err; + + if (((loff_t)vmf->pgoff << PAGE_SHIFT) >=3D i_size_read(inode)) + return vmf_error(-EINVAL); + + page =3D find_get_page(mapping, offset); + if (!page) { + + page =3D secretmem_alloc_page(vmf->gfp_mask); + if (!page) + return vmf_error(-ENOMEM); + + err =3D add_to_page_cache(page, mapping, offset, vmf->gfp_mask); + if (unlikely(err)) + goto err_put_page; + + err =3D set_direct_map_invalid_noflush(page, 1); + if (err) + goto err_del_page_cache; + + addr =3D (unsigned long)page_address(page); + flush_tlb_kernel_range(addr, addr + PAGE_SIZE); + + __SetPageUptodate(page); + + ret =3D VM_FAULT_LOCKED; + } + + vmf->page =3D page; + return ret; + +err_del_page_cache: + delete_from_page_cache(page); +err_put_page: + put_page(page); + return vmf_error(err); +} + +static const struct vm_operations_struct secretmem_vm_ops =3D { + .fault =3D secretmem_fault, +}; + +static int secretmem_mmap(struct file *file, struct vm_area_struct *vma) +{ + unsigned long len =3D vma->vm_end - vma->vm_start; + + if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) =3D=3D 0) + return -EINVAL; + + if (mlock_future_check(vma->vm_mm, vma->vm_flags | VM_LOCKED, len)) + return -EAGAIN; + + vma->vm_ops =3D &secretmem_vm_ops; + vma->vm_flags |=3D VM_LOCKED; + + return 0; +} + +bool vma_is_secretmem(struct vm_area_struct *vma) +{ + return vma->vm_ops =3D=3D &secretmem_vm_ops; +} + +static const struct file_operations secretmem_fops =3D { + .mmap =3D secretmem_mmap, +}; + +static bool secretmem_isolate_page(struct page *page, isolate_mode_t mod= e) +{ + return false; +} + +static int secretmem_migratepage(struct address_space *mapping, + struct page *newpage, struct page *page, + enum migrate_mode mode) +{ + return -EBUSY; +} + +static void secretmem_freepage(struct page *page) +{ + set_direct_map_default_noflush(page, 1); + clear_highpage(page); +} + +static const struct address_space_operations secretmem_aops =3D { + .freepage =3D secretmem_freepage, + .migratepage =3D secretmem_migratepage, + .isolate_page =3D secretmem_isolate_page, +}; + +bool page_is_secretmem(struct page *page) +{ + struct address_space *mapping =3D page_mapping(page); + + if (!mapping) + return false; + + return mapping->a_ops =3D=3D &secretmem_aops; +} + +static struct vfsmount *secretmem_mnt; + +static struct file *secretmem_file_create(unsigned long flags) +{ + struct file *file =3D ERR_PTR(-ENOMEM); + struct secretmem_ctx *ctx; + struct inode *inode; + + inode =3D alloc_anon_inode(secretmem_mnt->mnt_sb); + if (IS_ERR(inode)) + return ERR_CAST(inode); + + ctx =3D kzalloc(sizeof(*ctx), GFP_KERNEL); + if (!ctx) + goto err_free_inode; + + file =3D alloc_file_pseudo(inode, secretmem_mnt, "secretmem", + O_RDWR, &secretmem_fops); + if (IS_ERR(file)) + goto err_free_ctx; + + mapping_set_unevictable(inode->i_mapping); + + inode->i_mapping->private_data =3D ctx; + inode->i_mapping->a_ops =3D &secretmem_aops; + + /* pretend we are a normal file with zero size */ + inode->i_mode |=3D S_IFREG; + inode->i_size =3D 0; + + file->private_data =3D ctx; + + ctx->mode =3D flags & SECRETMEM_MODE_MASK; + + return file; + +err_free_ctx: + kfree(ctx); +err_free_inode: + iput(inode); + return file; +} + +SYSCALL_DEFINE1(memfd_secret, unsigned long, flags) +{ + struct file *file; + int fd, err; + + /* make sure local flags do not confict with global fcntl.h */ + BUILD_BUG_ON(SECRETMEM_FLAGS_MASK & O_CLOEXEC); + + if (flags & ~(SECRETMEM_FLAGS_MASK | O_CLOEXEC)) + return -EINVAL; + + fd =3D get_unused_fd_flags(flags & O_CLOEXEC); + if (fd < 0) + return fd; + + file =3D secretmem_file_create(flags); + if (IS_ERR(file)) { + err =3D PTR_ERR(file); + goto err_put_fd; + } + + file->f_flags |=3D O_LARGEFILE; + + fd_install(fd, file); + return fd; + +err_put_fd: + put_unused_fd(fd); + return err; +} + +static void secretmem_evict_inode(struct inode *inode) +{ + struct secretmem_ctx *ctx =3D inode->i_private; + + truncate_inode_pages_final(&inode->i_data); + clear_inode(inode); + kfree(ctx); +} + +static const struct super_operations secretmem_super_ops =3D { + .evict_inode =3D secretmem_evict_inode, +}; + +static int secretmem_init_fs_context(struct fs_context *fc) +{ + struct pseudo_fs_context *ctx =3D init_pseudo(fc, SECRETMEM_MAGIC); + + if (!ctx) + return -ENOMEM; + ctx->ops =3D &secretmem_super_ops; + + return 0; +} + +static struct file_system_type secretmem_fs =3D { + .name =3D "secretmem", + .init_fs_context =3D secretmem_init_fs_context, + .kill_sb =3D kill_anon_super, +}; + +static int secretmem_init(void) +{ + int ret =3D 0; + + secretmem_mnt =3D kern_mount(&secretmem_fs); + if (IS_ERR(secretmem_mnt)) + ret =3D PTR_ERR(secretmem_mnt); + + return ret; +} +fs_initcall(secretmem_init); --=20 2.28.0