From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.2 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4B047C433FE for ; Thu, 16 Sep 2021 07:36:40 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D3065611C6 for ; Thu, 16 Sep 2021 07:36:39 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org D3065611C6 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id DDC4A6B0071; Thu, 16 Sep 2021 03:36:38 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id D8C5D6B0072; Thu, 16 Sep 2021 03:36:38 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id C7AE76B0073; Thu, 16 Sep 2021 03:36:38 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0187.hostedemail.com [216.40.44.187]) by kanga.kvack.org (Postfix) with ESMTP id BA2B16B0071 for ; Thu, 16 Sep 2021 03:36:38 -0400 (EDT) Received: from smtpin35.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 5D3502C6B9 for ; Thu, 16 Sep 2021 07:36:38 +0000 (UTC) X-FDA: 78592629276.35.BDFADA7 Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by imf08.hostedemail.com (Postfix) with ESMTP id 6309830000B8 for ; Thu, 16 Sep 2021 07:36:37 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10108"; a="222546992" X-IronPort-AV: E=Sophos;i="5.85,297,1624345200"; d="scan'208";a="222546992" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 Sep 2021 00:36:35 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.85,297,1624345200"; d="scan'208";a="472680027" Received: from chaop.bj.intel.com (HELO localhost) ([10.240.192.135]) by orsmga007.jf.intel.com with ESMTP; 16 Sep 2021 00:36:28 -0700 Date: Thu, 16 Sep 2021 15:36:27 +0800 From: Chao Peng To: "Kirill A. Shutemov" Cc: "Kirill A. Shutemov" , Andy Lutomirski , Sean Christopherson , Paolo Bonzini , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Borislav Petkov , Andrew Morton , Joerg Roedel , Andi Kleen , David Rientjes , Vlastimil Babka , Tom Lendacky , Thomas Gleixner , Peter Zijlstra , Ingo Molnar , Varad Gautam , Dario Faggioli , x86@kernel.org, linux-mm@kvack.org, linux-coco@lists.linux.dev, Kuppuswamy Sathyanarayanan , David Hildenbrand , Dave Hansen , Yu Zhang Subject: Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory Message-ID: <20210916073627.GA18399@chaop.bj.intel.com> Reply-To: Chao Peng References: <20210824005248.200037-1-seanjc@google.com> <20210902184711.7v65p5lwhpr2pvk7@box.shutemov.name> <20210903191414.g7tfzsbzc7tpkx37@box.shutemov.name> <02806f62-8820-d5f9-779c-15c0e9cd0e85@kernel.org> <20210910171811.xl3lms6xoj3kx223@box.shutemov.name> <20210915195857.GA52522@chaop.bj.intel.com> <20210915141147.s4mgtcfv3ber5fnt@black.fi.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=gb2312 Content-Disposition: inline In-Reply-To: <20210915141147.s4mgtcfv3ber5fnt@black.fi.intel.com> User-Agent: Mutt/1.9.4 (2018-02-28) X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 6309830000B8 X-Stat-Signature: z1wqiy76ct1zhtpastgmsdq1z6iuezs9 Authentication-Results: imf08.hostedemail.com; dkim=none; spf=none (imf08.hostedemail.com: domain of chao.p.peng@linux.intel.com has no SPF policy when checking 134.134.136.24) smtp.mailfrom=chao.p.peng@linux.intel.com; dmarc=fail reason="No valid SPF, No valid DKIM" header.from=intel.com (policy=none) X-HE-Tag: 1631777797-743783 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Sep 15, 2021 at 05:11:47PM +0300, Kirill A. Shutemov wrote: > On Wed, Sep 15, 2021 at 07:58:57PM +0000, Chao Peng wrote: > > On Fri, Sep 10, 2021 at 08:18:11PM +0300, Kirill A. Shutemov wrote: > > > On Fri, Sep 03, 2021 at 12:15:51PM -0700, Andy Lutomirski wrote: > > > > On 9/3/21 12:14 PM, Kirill A. Shutemov wrote: > > > > > On Thu, Sep 02, 2021 at 08:33:31PM +0000, Sean Christopherson w= rote: > > > > >> Would requiring the size to be '0' at F_SEAL_GUEST time solve = that problem? > > > > >=20 > > > > > I guess. Maybe we would need a WRITE_ONCE() on set. I donno. I = will look > > > > > closer into locking next. > > > >=20 > > > > We can decisively eliminate this sort of failure by making the sw= itch > > > > happen at open time instead of after. For a memfd-like API, this= would > > > > be straightforward. For a filesystem, it would take a bit more t= hought. > > >=20 > > > I think it should work fine as long as we check seals after i_size = in the > > > read path. See the comment in shmem_file_read_iter(). > > >=20 > > > Below is updated version. I think it should be good enough to start > > > integrate with KVM. > > >=20 > > > I also attach a test-case that consists of kernel patch and userspa= ce > > > program. It demonstrates how it can be integrated into KVM code. > > >=20 > > > One caveat I noticed is that guest_ops::invalidate_page_range() can= be > > > called after the owner (struct kvm) has being freed. It happens bec= ause > > > memfd can outlive KVM. So the callback has to check if such owner e= xists, > > > than check that there's a memslot with such inode. > >=20 > > Would introducing memfd_unregister_guest() fix this? >=20 > I considered this, but it get complex quickly. >=20 > At what point it gets called? On KVM memslot destroy? I meant when the VM gets destroyed. >=20 > What if multiple KVM slot share the same memfd? Add refcount into memfd= on > how many times the owner registered the memfd? >=20 > It would leave us in strange state: memfd refcount owners (struct KVM) = and > KVM memslot pins the struct file. Weird refcount exchnage program. >=20 > I hate it. But yes agree things will get much complex in practice. >=20 > > > I guess it should be okay: we have vm_list we can check owner again= st. > > > We may consider replace vm_list with something more scalable if num= ber of > > > VMs will get too high. > > >=20 > > > Any comments? > > >=20 > > > diff --git a/include/linux/memfd.h b/include/linux/memfd.h > > > index 4f1600413f91..3005e233140a 100644 > > > --- a/include/linux/memfd.h > > > +++ b/include/linux/memfd.h > > > @@ -4,13 +4,34 @@ > > > =20 > > > #include > > > =20 > > > +struct guest_ops { > > > + void (*invalidate_page_range)(struct inode *inode, void *owner, > > > + pgoff_t start, pgoff_t end); > > > +}; > >=20 > > I can see there are two scenarios to invalidate page(s), when punchin= g a > > hole or ftruncating to 0, in either cases KVM should already been cal= led > > with necessary information from usersapce with memory slot punch hole > > syscall or memory slot delete syscall, so wondering this callback is > > really needed. >=20 > So what you propose? Forbid truncate/punch from userspace and make KVM > handle punch hole/truncate from within kernel? I think it's layering > violation. As far as I understand the flow for punch hole/truncate in this design, there will be two steps for userspace: 1. punch hole/delete kvm memory slot, and then 2. puncn hole/truncate on the memory backing store fd. In concept we can do whatever needed for invalidation in either steps. If we can do the invalidation in step 1 then we don=A1=AFt need bothering this callback. This is what I mean but agree the current callback can also work. >=20 > > > + > > > +struct guest_mem_ops { > > > + unsigned long (*get_lock_pfn)(struct inode *inode, pgoff_t offset= ); > > > + void (*put_unlock_pfn)(unsigned long pfn); > >=20 > > Same as above, I=A1=AFm not clear on which time put_unlock_pfn() woul= d be > > called, I=A1=AFm thinking the page can be put_and_unlock when userspa= ce > > punching a hole or ftruncating to 0 on the fd. >=20 > No. put_unlock_pfn() has to be called after the pfn is in SEPT. This wa= y > we close race between SEPT population and truncate/punch. get_lock_pfn(= ) > would stop truncate untile put_unlock_pfn() called. Okay, makes sense. Thanks, Chao