From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id A0EE6C3DA7D for ; Fri, 23 Dec 2022 08:24:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 14415900003; Fri, 23 Dec 2022 03:24:52 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 0CCE0900002; Fri, 23 Dec 2022 03:24:52 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id E87ED900003; Fri, 23 Dec 2022 03:24:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id D2DC1900002 for ; Fri, 23 Dec 2022 03:24:51 -0500 (EST) Received: from smtpin27.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id AB85A1C5DB4 for ; Fri, 23 Dec 2022 08:24:51 +0000 (UTC) X-FDA: 80272885182.27.16672D1 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by imf17.hostedemail.com (Postfix) with ESMTP id 1BE7040016 for ; Fri, 23 Dec 2022 08:24:48 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=UgYKWmvK; dmarc=pass (policy=none) header.from=intel.com; spf=none (imf17.hostedemail.com: domain of chao.p.peng@linux.intel.com has no SPF policy when checking 134.134.136.65) smtp.mailfrom=chao.p.peng@linux.intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1671783889; h=from:from:sender:reply-to:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=5YJz1CNPCfK3NlesidwXm27PDPinpF4w0HpEpWYcC8I=; b=s9XJ/QCyXG0ROkZZ46JP8r43pjLTvKzzWK6JA7sBnl0yd4IAZ/jMoJKlU7rUxocYUXUYCl IYa6cy2ZE2C3IH/ZZkoy/CmLu4bCDCsZhV6rtN6gzhLkxlQoFcju7g9EMLbMlzoNjNTIhI /uPZY/pv0euzITmbP5YLRglBLD6lvqM= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=UgYKWmvK; dmarc=pass (policy=none) header.from=intel.com; spf=none (imf17.hostedemail.com: domain of chao.p.peng@linux.intel.com has no SPF policy when checking 134.134.136.65) smtp.mailfrom=chao.p.peng@linux.intel.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1671783889; a=rsa-sha256; cv=none; b=8qp4JKJGVqAmKtagHMW/C1/nI5AGWs54JBW3iFCgkClabdd0pJeENFJ1/GODDFZGRZEyuL rCCI1iEU9X7sJFNa7d3avFhDbhk9ZjMt/1KuLOH78wSAkeD27hfMtB2I2nVuB/eMe1DNEC HxHDl7meCmGcW5lMIoX89C1kvFsyjfI= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1671783889; x=1703319889; h=date:from:to:cc:subject:message-id:reply-to:references: mime-version:content-transfer-encoding:in-reply-to; bh=HImD4R1dLRRaXrHVVqh8GNzM0wgrc3l8oV1ABGfcy/k=; b=UgYKWmvK50g60IutqUAp3aYXMLVcw2kDEgaJmPH1puftqS/p+Le4tXI2 pC8qlP05US2WGyEaLkDon+IpHG8wvdxvUSgAPp0TDP2SHYIAy6WLGdZgy fGwj22vifuae8uyroZ2qzTbAFa4tdXC6IryTdDe+BCmTIquC5YjvCo136 f906FseFd7DJVgruUgx4oPra4rKLR8J9CPAmILQHfL9oGwthZYCHelCpx UHTL/mxfxk95khi4E1JUzw5hlu2b1fGWGgRh3T1Hzl0FdxxO0VmmwPPvF PSytHFb9/wYacEfUIHIOSbxN/fYZ6eOe8PCVZWLKRf1hVk3aDsYHZ1Jvp g==; X-IronPort-AV: E=McAfee;i="6500,9779,10569"; a="322238902" X-IronPort-AV: E=Sophos;i="5.96,267,1665471600"; d="scan'208";a="322238902" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Dec 2022 00:24:46 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6500,9779,10569"; a="740807258" X-IronPort-AV: E=Sophos;i="5.96,267,1665471600"; d="scan'208";a="740807258" Received: from chaop.bj.intel.com (HELO localhost) ([10.240.193.75]) by FMSMGA003.fm.intel.com with ESMTP; 23 Dec 2022 00:24:35 -0800 Date: Fri, 23 Dec 2022 16:20:20 +0800 From: Chao Peng To: "Huang, Kai" Cc: "tglx@linutronix.de" , "linux-arch@vger.kernel.org" , "kvm@vger.kernel.org" , "jmattson@google.com" , "Hocko, Michal" , "pbonzini@redhat.com" , "ak@linux.intel.com" , "Lutomirski, Andy" , "linux-fsdevel@vger.kernel.org" , "tabba@google.com" , "david@redhat.com" , "michael.roth@amd.com" , "kirill.shutemov@linux.intel.com" , "corbet@lwn.net" , "qemu-devel@nongnu.org" , "dhildenb@redhat.com" , "bfields@fieldses.org" , "linux-kernel@vger.kernel.org" , "x86@kernel.org" , "bp@alien8.de" , "ddutile@redhat.com" , "rppt@kernel.org" , "shuah@kernel.org" , "vkuznets@redhat.com" , "vbabka@suse.cz" , "mail@maciej.szmigiero.name" , "naoya.horiguchi@nec.com" , "qperret@google.com" , "arnd@arndb.de" , "linux-api@vger.kernel.org" , "yu.c.zhang@linux.intel.com" , "Christopherson,, Sean" , "wanpengli@tencent.com" , "vannapurve@google.com" , "hughd@google.com" , "aarcange@redhat.com" , "mingo@redhat.com" , "hpa@zytor.com" , "Nakajima, Jun" , "jlayton@kernel.org" , "joro@8bytes.org" , "linux-mm@kvack.org" , "Wang, Wei W" , "steven.price@arm.com" , "linux-doc@vger.kernel.org" , "Hansen, Dave" , "akpm@linux-foundation.org" , "linmiaohe@huawei.com" Subject: Re: [PATCH v10 1/9] mm: Introduce memfd_restricted system call to create restricted user memory Message-ID: <20221223082020.GA1829090@chaop.bj.intel.com> Reply-To: Chao Peng References: <20221202061347.1070246-1-chao.p.peng@linux.intel.com> <20221202061347.1070246-2-chao.p.peng@linux.intel.com> <5c6e2e516f19b0a030eae9bf073d555c57ca1f21.camel@intel.com> <20221219075313.GB1691829@chaop.bj.intel.com> <20221220072228.GA1724933@chaop.bj.intel.com> <126046ce506df070d57e6fe5ab9c92cdaf4cf9b7.camel@intel.com> <20221221133905.GA1766136@chaop.bj.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 1BE7040016 X-Stat-Signature: 7sp6f7bsqjmgd4hbwdj84r6hq5xekf6y X-HE-Tag: 1671783888-749768 X-HE-Meta: U2FsdGVkX19rwqwFn1QQKnp7jsVXSOR/woqPexW1JbZa3ij6SHr9AWnghRQc6HDlBYbu83BHgYc3l8ZILpHe0Liae+x61K8+3KxZV+3avGJEAfxGUBFG4tEdlRRa9R23RotaJhlBWlsBoGA99CBhFOj9EoC9dky3fGn3kwIgoSRj/bFyiAN0LzsW/7JsreZJTjITdzC1xHFfSA2i9vqlZJO10FDBKiYuOs8UmQRsjs7Yl5C/oY2OJwXw9mCI3P2K33FutMlsBVIk/uOkbEGlhEhYPivkXQ3Xuaa+loCvgjTDwUbTducawvwd7Jjfwz02xeSo3lKt6IqSfND3ucWJ6Acyt1dkXyr6fBa1afjmZ4ZgaavZkmb8f8+J596BSoQeHzwg/qTlWE+XWYh3sEYJAuiAb95aWIxr2r6H4EvAadDrgTrBZtxxWggL39SRi3VdeEcHRls6FbwbcVcTT1V0IVjTQsVzGLyR4PXHYf1mcA1i4+ctMZDQHeJmKij3egCNAT8R83KESi4T1sc/RWX2YZtxmxytuOO00pyrf4Pa65oK9YZz2XED5oZWSVpoinY6e3zybdZPB5verx2RtN/7Xgf7DGZqFbsc3lzJMlvT5DgQkLMYLhlN/x7vWYo+AfJzWwg1vbi85vU/mx7ISUDOyMv+oVos9vuFvB+yht9gJs8hpamcold3juDirPhSYfXr8Usullw8Ug3hF+GY7SESyam4GVclPZaWJ0XItUdQ74Wkr/n6xyAhYZkr72qpo5mbLEzRnbu/Rpnf/xHGBz+cw4v6zH0yyr92dSuVnC2XDjgMNjoAvI+7je1OUDrlkbgq2ibMgDee/OeQJCpHqT0Y3EFSSnCnjH5D2WERcS63ALj7JqMn7IAFsB76jzTHq73JyUxNgp4YEVwGWLqhoiCW1VK4hliAZYQXVig2tKWRK4EmSACFif9KQxN+VMs/LRI0LqaAwonOzcMDqMGcupt 42ibnJk0 OUaas4EMRZJlTZqHxwtIziLhWyt+IdQD2sYDptRa9ufmzpw7hHt/5Frj5eB3JqmWaCiPI X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Dec 22, 2022 at 12:37:19AM +0000, Huang, Kai wrote: > On Wed, 2022-12-21 at 21:39 +0800, Chao Peng wrote: > > > On Tue, Dec 20, 2022 at 08:33:05AM +0000, Huang, Kai wrote: > > > > > On Tue, 2022-12-20 at 15:22 +0800, Chao Peng wrote: > > > > > > > On Mon, Dec 19, 2022 at 08:48:10AM +0000, Huang, Kai wrote: > > > > > > > > > On Mon, 2022-12-19 at 15:53 +0800, Chao Peng wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > > > > > > > > > > > > + > > > > > > > > > > > > > > > + /* > > > > > > > > > > > > > > > + * These pages are currently unmovable so don't place them into > > > > > > > > > > > > > > > movable > > > > > > > > > > > > > > > + * pageblocks (e.g. CMA and ZONE_MOVABLE). > > > > > > > > > > > > > > > + */ > > > > > > > > > > > > > > > + mapping = memfd->f_mapping; > > > > > > > > > > > > > > > + mapping_set_unevictable(mapping); > > > > > > > > > > > > > > > + mapping_set_gfp_mask(mapping, > > > > > > > > > > > > > > > +      mapping_gfp_mask(mapping) & ~__GFP_MOVABLE); > > > > > > > > > > > > > > > > > > > > > > > > > > But, IIUC removing __GFP_MOVABLE flag here only makes page allocation from > > > > > > > > > > > > > non- > > > > > > > > > > > > > movable zones, but doesn't necessarily prevent page from being migrated.  My > > > > > > > > > > > > > first glance is you need to implement either a_ops->migrate_folio() or just > > > > > > > > > > > > > get_page() after faulting in the page to prevent. > > > > > > > > > > > > > > > > > > > > > > The current api restrictedmem_get_page() already does this, after the > > > > > > > > > > > caller calling it, it holds a reference to the page. The caller then > > > > > > > > > > > decides when to call put_page() appropriately. > > > > > > > > > > > > > > > > > > I tried to dig some history. Perhaps I am missing something, but it seems Kirill > > > > > > > > > said in v9 that this code doesn't prevent page migration, and we need to > > > > > > > > > increase page refcount in restrictedmem_get_page(): > > > > > > > > > > > > > > > > > > https://lore.kernel.org/linux-mm/20221129112139.usp6dqhbih47qpjl@box.shutemov.name/ > > > > > > > > > > > > > > > > > > But looking at this series it seems restrictedmem_get_page() in this v10 is > > > > > > > > > identical to the one in v9 (except v10 uses 'folio' instead of 'page')? > > > > > > > > > > > > > > restrictedmem_get_page() increases page refcount several versions ago so > > > > > > > no change in v10 is needed. You probably missed my reply: > > > > > > > > > > > > > > https://lore.kernel.org/linux-mm/20221129135844.GA902164@chaop.bj.intel.com/ > > > > > > > > > > But for non-restricted-mem case, it is correct for KVM to decrease page's > > > > > refcount after setting up mapping in the secondary mmu, otherwise the page will > > > > > be pinned by KVM for normal VM (since KVM uses GUP to get the page). > > > > > > That's true. Actually even true for restrictedmem case, most likely we > > > will still need the kvm_release_pfn_clean() for KVM generic code. On one > > > side, other restrictedmem users like pKVM may not require page pinning > > > at all. On the other side, see below. > > OK. Agreed. > > > > > > > > > > > > > > So what we are expecting is: for KVM if the page comes from restricted mem, then > > > > > KVM cannot decrease the refcount, otherwise for normal page via GUP KVM should. > > > > > > I argue that this page pinning (or page migration prevention) is not > > > tied to where the page comes from, instead related to how the page will > > > be used. Whether the page is restrictedmem backed or GUP() backed, once > > > it's used by current version of TDX then the page pinning is needed. So > > > such page migration prevention is really TDX thing, even not KVM generic > > > thing (that's why I think we don't need change the existing logic of > > > kvm_release_pfn_clean()).  > > > > > This essentially boils down to who "owns" page migration handling, and sadly, > page migration is kinda "owned" by the core-kernel, i.e. KVM cannot handle page > migration by itself -- it's just a passive receiver. No, I'm not talking on the page migration handling itself, I know page migration requires coordination from both core-mm and KVM. I'm more concerning on the page migration prevention here. This is something we need to address for TDX before the page migration is supported. > > For normal pages, page migration is totally done by the core-kernel (i.e. it > unmaps page from VMA, allocates a new page, and uses migrate_pape() or a_ops- > >migrate_page() to actually migrate the page). > > In the sense of TDX, conceptually it should be done in the same way. The more > important thing is: yes KVM can use get_page() to prevent page migration, but > when KVM wants to support it, KVM cannot just remove get_page(), as the core- > kernel will still just do migrate_page() which won't work for TDX (given > restricted_memfd doesn't have a_ops->migrate_page() implemented). > > So I think the restricted_memfd filesystem should own page migration handling, > (i.e. by implementing a_ops->migrate_page() to either just reject page migration > or somehow support it). > > To support page migration, it may require KVM's help in case of TDX (the > TDH.MEM.PAGE.RELOCATE SEAMCALL requires "GPA" and "level" of EPT mapping, which > are only available in KVM), but that doesn't make KVM to own the handling of > page migration. > > > > > Wouldn't better to let TDX code (or who > > > requires that) to increase/decrease the refcount when it populates/drops > > > the secure EPT entries? This is exactly what the current TDX code does: > > > > > > get_page(): > > > https://github.com/intel/tdx/blob/kvm-upstream/arch/x86/kvm/vmx/tdx.c#L1217 > > > > > > put_page(): > > > https://github.com/intel/tdx/blob/kvm-upstream/arch/x86/kvm/vmx/tdx.c#L1334 > > > > > As explained above, I think doing so in KVM is wrong: it can prevent by using > get_page(), but you cannot simply remove it to support page migration. Removing get_page() is definitely not enough for page migration support. But the key thing is for page migration prevention, other than get_page(), do we really have alternative. Thanks, Chao > > Sean also said similar thing when reviewing v8 KVM TDX series and I also agree: > > https://lore.kernel.org/lkml/Yvu5PsAndEbWKTHc@google.com/ > https://lore.kernel.org/lkml/31fec1b4438a6d9bb7ff719f96caa8b23ed764d6.camel@intel.com/ >