From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D760BC54EE9 for ; Mon, 19 Sep 2022 09:13:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229870AbiISJNB (ORCPT ); Mon, 19 Sep 2022 05:13:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:51870 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229730AbiISJM4 (ORCPT ); Mon, 19 Sep 2022 05:12:56 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0150624BEB for ; Mon, 19 Sep 2022 02:12:52 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1663578771; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=jxl8dn6mQx+TY+/R8pZqq4F+YYfwiklYX5vSpv2CpI0=; b=J5pouKe1VlcV+F1AwRxlmUiVMY1Tb+t0g1cOLkkLtlqKoJVVND10eoYNZ0n92/FnQX4fBK ucOWgYT1xE7Lq/ieXh843dFzVgBleccnmtgLgDEOMvcZ4N8v7vrfgxqk/vXE0GGH6zswMn s/cjhXkl+5j7u1nFhKGqRmBeqA7u5Gw= Received: from mail-wr1-f72.google.com (mail-wr1-f72.google.com [209.85.221.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_128_GCM_SHA256) id us-mta-336-3FDIr5veMLak36ef_gMGsg-1; Mon, 19 Sep 2022 05:12:50 -0400 X-MC-Unique: 3FDIr5veMLak36ef_gMGsg-1 Received: by mail-wr1-f72.google.com with SMTP id u27-20020adfa19b000000b0022863c08ac4so6467495wru.11 for ; Mon, 19 Sep 2022 02:12:50 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:cc:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date; bh=jxl8dn6mQx+TY+/R8pZqq4F+YYfwiklYX5vSpv2CpI0=; b=kuJ1buWTrh4+AlWYMe3e8HV3LHz/AWXwqwhbAppRZ7TsaJoRmJyJUjrLFalB5ZyT+z xuyg0DPheMp1UXr63Fs8BSAixt7K4cWSBW3rglJlv+12eDdmP6rTZtXCJjJQlZeYRSQX MaV+Jcq5rTflngICrVDfkjp4ojSlsRL3a2oSB1iw4HXOTSANdXqmp5dH85HKxfwJZ84t OnYN/Ml/RI3sFMMRyP+HGrwnPsz1LsQr3cSrwJz+IYFala1jY9kOrcX+dbD9XpjJHIGr utxw1U1TiwreQeBhj3134cnvZTPsOmIjj0VcvJ30fxEM5qn2smI9RjY0tgOpVpqlgVpF ZMZA== X-Gm-Message-State: ACrzQf2CaHfFnIj8XKoWAppl30Ob5H8vKhTnYSM8Nz/atq13bg9Jdq8h ub1h13pz5BW4SjTIzALe8anFmVrM9VzP77YGNOp01AikX5L/ai+DJJt/0xG5/h/DuNujPe5TzwM znLxf/LkzJD7LQFHJlrvMhfyB X-Received: by 2002:a05:600c:434c:b0:3b4:82fb:5f78 with SMTP id r12-20020a05600c434c00b003b482fb5f78mr11569507wme.157.1663578769342; Mon, 19 Sep 2022 02:12:49 -0700 (PDT) X-Google-Smtp-Source: AMsMyM42ZT4pkoKjyDjQQPF2OgJto1LVpfUrBqi2mVYRAlQ8yoTL+vv0kZdwqQbz4IN8mCzxBGb4fw== X-Received: by 2002:a05:600c:434c:b0:3b4:82fb:5f78 with SMTP id r12-20020a05600c434c00b003b482fb5f78mr11569468wme.157.1663578768973; Mon, 19 Sep 2022 02:12:48 -0700 (PDT) Received: from ?IPV6:2003:cb:c703:c100:c136:f914:345f:f5f3? (p200300cbc703c100c136f914345ff5f3.dip0.t-ipconnect.de. [2003:cb:c703:c100:c136:f914:345f:f5f3]) by smtp.gmail.com with ESMTPSA id bk23-20020a0560001d9700b0022b014fb0b7sm2473698wrb.110.2022.09.19.02.12.46 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 19 Sep 2022 02:12:48 -0700 (PDT) Message-ID: Date: Mon, 19 Sep 2022 11:12:46 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.2.1 Content-Language: en-US To: Chao Peng , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org Cc: Paolo Bonzini , Jonathan Corbet , Sean Christopherson , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song , wei.w.wang@intel.com References: <20220915142913.2213336-1-chao.p.peng@linux.intel.com> <20220915142913.2213336-2-chao.p.peng@linux.intel.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd In-Reply-To: <20220915142913.2213336-2-chao.p.peng@linux.intel.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 15.09.22 16:29, Chao Peng wrote: > From: "Kirill A. Shutemov" > > KVM can use memfd-provided memory for guest memory. For normal userspace > accessible memory, KVM userspace (e.g. QEMU) mmaps the memfd into its > virtual address space and then tells KVM to use the virtual address to > setup the mapping in the secondary page table (e.g. EPT). > > With confidential computing technologies like Intel TDX, the > memfd-provided memory may be encrypted with special key for special > software domain (e.g. KVM guest) and is not expected to be directly > accessed by userspace. Precisely, userspace access to such encrypted > memory may lead to host crash so it should be prevented. Initially my thaught was that this whole inaccessible thing is TDX specific and there is no need to force that on other mechanisms. That's why I suggested to not expose this to user space but handle the notifier requirements internally. IIUC now, protected KVM has similar demands. Either access (read/write) of guest RAM would result in a fault and possibly crash the hypervisor (at least not the whole machine IIUC). > > This patch introduces userspace inaccessible memfd (created with > MFD_INACCESSIBLE). Its memory is inaccessible from userspace through > ordinary MMU access (e.g. read/write/mmap) but can be accessed via > in-kernel interface so KVM can directly interact with core-mm without > the need to map the memory into KVM userspace. With secretmem we decided to not add such "concept switch" flags and instead use a dedicated syscall. What about memfd_inaccessible()? Especially, sealing and hugetlb are not even supported and it might take a while to support either. > > It provides semantics required for KVM guest private(encrypted) memory > support that a file descriptor with this flag set is going to be used as > the source of guest memory in confidential computing environments such > as Intel TDX/AMD SEV. > > KVM userspace is still in charge of the lifecycle of the memfd. It > should pass the opened fd to KVM. KVM uses the kernel APIs newly added > in this patch to obtain the physical memory address and then populate > the secondary page table entries. > > The userspace inaccessible memfd can be fallocate-ed and hole-punched > from userspace. When hole-punching happens, KVM can get notified through > inaccessible_notifier it then gets chance to remove any mapped entries > of the range in the secondary page tables. > > The userspace inaccessible memfd itself is implemented as a shim layer > on top of real memory file systems like tmpfs/hugetlbfs but this patch > only implemented tmpfs. The allocated memory is currently marked as > unmovable and unevictable, this is required for current confidential > usage. But in future this might be changed. > > Signed-off-by: Kirill A. Shutemov > Signed-off-by: Chao Peng > --- > include/linux/memfd.h | 24 ++++ > include/uapi/linux/magic.h | 1 + > include/uapi/linux/memfd.h | 1 + > mm/Makefile | 2 +- > mm/memfd.c | 25 ++++- > mm/memfd_inaccessible.c | 219 +++++++++++++++++++++++++++++++++++++ > 6 files changed, 270 insertions(+), 2 deletions(-) > create mode 100644 mm/memfd_inaccessible.c > > diff --git a/include/linux/memfd.h b/include/linux/memfd.h > index 4f1600413f91..334ddff08377 100644 > --- a/include/linux/memfd.h > +++ b/include/linux/memfd.h > @@ -3,6 +3,7 @@ > #define __LINUX_MEMFD_H > > #include > +#include > > #ifdef CONFIG_MEMFD_CREATE > extern long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg); > @@ -13,4 +14,27 @@ static inline long memfd_fcntl(struct file *f, unsigned int c, unsigned long a) > } > #endif > > +struct inaccessible_notifier; > + > +struct inaccessible_notifier_ops { > + void (*invalidate)(struct inaccessible_notifier *notifier, > + pgoff_t start, pgoff_t end); > +}; > + > +struct inaccessible_notifier { > + struct list_head list; > + const struct inaccessible_notifier_ops *ops; > +}; > + > +void inaccessible_register_notifier(struct file *file, > + struct inaccessible_notifier *notifier); > +void inaccessible_unregister_notifier(struct file *file, > + struct inaccessible_notifier *notifier); > + > +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn, > + int *order); > +void inaccessible_put_pfn(struct file *file, pfn_t pfn); > + > +struct file *memfd_mkinaccessible(struct file *memfd); > + > #endif /* __LINUX_MEMFD_H */ > diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h > index 6325d1d0e90f..9d066be3d7e8 100644 > --- a/include/uapi/linux/magic.h > +++ b/include/uapi/linux/magic.h > @@ -101,5 +101,6 @@ > #define DMA_BUF_MAGIC 0x444d4142 /* "DMAB" */ > #define DEVMEM_MAGIC 0x454d444d /* "DMEM" */ > #define SECRETMEM_MAGIC 0x5345434d /* "SECM" */ > +#define INACCESSIBLE_MAGIC 0x494e4143 /* "INAC" */ [...] > + > +int inaccessible_get_pfn(struct file *file, pgoff_t offset, pfn_t *pfn, > + int *order) > +{ > + struct inaccessible_data *data = file->f_mapping->private_data; > + struct file *memfd = data->memfd; > + struct page *page; > + int ret; > + > + ret = shmem_getpage(file_inode(memfd), offset, &page, SGP_WRITE); > + if (ret) > + return ret; > + > + *pfn = page_to_pfn_t(page); > + *order = thp_order(compound_head(page)); > + SetPageUptodate(page); > + unlock_page(page); > + > + return 0; > +} > +EXPORT_SYMBOL_GPL(inaccessible_get_pfn); > + > +void inaccessible_put_pfn(struct file *file, pfn_t pfn) > +{ > + struct page *page = pfn_t_to_page(pfn); > + > + if (WARN_ON_ONCE(!page)) > + return; > + > + put_page(page); > +} > +EXPORT_SYMBOL_GPL(inaccessible_put_pfn); Sorry, I missed your reply regarding get/put interface. https://lore.kernel.org/linux-mm/20220810092532.GD862421@chaop.bj.intel.com/ "We have a design assumption that somedays this can even support non-page based backing stores." As long as there is no such user in sight (especially how to get the memfd from even allocating such memory which will require bigger changes), I prefer to keep it simple here and work on pages/folios. No need to over-complicate it for now. -- Thanks, David / dhildenb