From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-8.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 776EFC433F5 for ; Wed, 15 Sep 2021 14:11:55 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id F1FAC6112E for ; Wed, 15 Sep 2021 14:11:54 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org F1FAC6112E Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 76C786B0071; Wed, 15 Sep 2021 10:11:54 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 6F5E1900002; Wed, 15 Sep 2021 10:11:54 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 548E96B0073; Wed, 15 Sep 2021 10:11:54 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0204.hostedemail.com [216.40.44.204]) by kanga.kvack.org (Postfix) with ESMTP id 3F06B6B0071 for ; Wed, 15 Sep 2021 10:11:54 -0400 (EDT) Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id CBA082BA94 for ; Wed, 15 Sep 2021 14:11:53 +0000 (UTC) X-FDA: 78589996506.03.784CA6C Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by imf18.hostedemail.com (Postfix) with ESMTP id DB6FB4002096 for ; Wed, 15 Sep 2021 14:11:52 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10107"; a="222370201" X-IronPort-AV: E=Sophos;i="5.85,295,1624345200"; d="scan'208";a="222370201" Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2021 07:11:51 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.85,295,1624345200"; d="scan'208";a="651213477" Received: from black.fi.intel.com ([10.237.72.28]) by orsmga005.jf.intel.com with ESMTP; 15 Sep 2021 07:11:43 -0700 Received: by black.fi.intel.com (Postfix, from userid 1000) id 726895A5; Wed, 15 Sep 2021 17:11:47 +0300 (EEST) Date: Wed, 15 Sep 2021 17:11:47 +0300 From: "Kirill A. Shutemov" To: Chao Peng Cc: "Kirill A. Shutemov" , Andy Lutomirski , Sean Christopherson , Paolo Bonzini , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Borislav Petkov , Andrew Morton , Joerg Roedel , Andi Kleen , David Rientjes , Vlastimil Babka , Tom Lendacky , Thomas Gleixner , Peter Zijlstra , Ingo Molnar , Varad Gautam , Dario Faggioli , x86@kernel.org, linux-mm@kvack.org, linux-coco@lists.linux.dev, Kuppuswamy Sathyanarayanan , David Hildenbrand , Dave Hansen , Yu Zhang Subject: Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory Message-ID: <20210915141147.s4mgtcfv3ber5fnt@black.fi.intel.com> References: <20210824005248.200037-1-seanjc@google.com> <20210902184711.7v65p5lwhpr2pvk7@box.shutemov.name> <20210903191414.g7tfzsbzc7tpkx37@box.shutemov.name> <02806f62-8820-d5f9-779c-15c0e9cd0e85@kernel.org> <20210910171811.xl3lms6xoj3kx223@box.shutemov.name> <20210915195857.GA52522@chaop.bj.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <20210915195857.GA52522@chaop.bj.intel.com> X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: DB6FB4002096 X-Stat-Signature: kz4b7zad9yubfa4emm5o56gxx3fhd9d9 Authentication-Results: imf18.hostedemail.com; dkim=none; dmarc=fail reason="No valid SPF, No valid DKIM" header.from=intel.com (policy=none); spf=none (imf18.hostedemail.com: domain of kirill.shutemov@linux.intel.com has no SPF policy when checking 134.134.136.24) smtp.mailfrom=kirill.shutemov@linux.intel.com X-HE-Tag: 1631715112-381397 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Sep 15, 2021 at 07:58:57PM +0000, Chao Peng wrote: > On Fri, Sep 10, 2021 at 08:18:11PM +0300, Kirill A. Shutemov wrote: > > On Fri, Sep 03, 2021 at 12:15:51PM -0700, Andy Lutomirski wrote: > > > On 9/3/21 12:14 PM, Kirill A. Shutemov wrote: > > > > On Thu, Sep 02, 2021 at 08:33:31PM +0000, Sean Christopherson wro= te: > > > >> Would requiring the size to be '0' at F_SEAL_GUEST time solve th= at problem? > > > >=20 > > > > I guess. Maybe we would need a WRITE_ONCE() on set. I donno. I wi= ll look > > > > closer into locking next. > > >=20 > > > We can decisively eliminate this sort of failure by making the swit= ch > > > happen at open time instead of after. For a memfd-like API, this w= ould > > > be straightforward. For a filesystem, it would take a bit more tho= ught. > >=20 > > I think it should work fine as long as we check seals after i_size in= the > > read path. See the comment in shmem_file_read_iter(). > >=20 > > Below is updated version. I think it should be good enough to start > > integrate with KVM. > >=20 > > I also attach a test-case that consists of kernel patch and userspace > > program. It demonstrates how it can be integrated into KVM code. > >=20 > > One caveat I noticed is that guest_ops::invalidate_page_range() can b= e > > called after the owner (struct kvm) has being freed. It happens becau= se > > memfd can outlive KVM. So the callback has to check if such owner exi= sts, > > than check that there's a memslot with such inode. >=20 > Would introducing memfd_unregister_guest() fix this? I considered this, but it get complex quickly. At what point it gets called? On KVM memslot destroy? What if multiple KVM slot share the same memfd? Add refcount into memfd o= n how many times the owner registered the memfd? It would leave us in strange state: memfd refcount owners (struct KVM) an= d KVM memslot pins the struct file. Weird refcount exchnage program. I hate it. > > I guess it should be okay: we have vm_list we can check owner against= . > > We may consider replace vm_list with something more scalable if numbe= r of > > VMs will get too high. > >=20 > > Any comments? > >=20 > > diff --git a/include/linux/memfd.h b/include/linux/memfd.h > > index 4f1600413f91..3005e233140a 100644 > > --- a/include/linux/memfd.h > > +++ b/include/linux/memfd.h > > @@ -4,13 +4,34 @@ > > =20 > > #include > > =20 > > +struct guest_ops { > > + void (*invalidate_page_range)(struct inode *inode, void *owner, > > + pgoff_t start, pgoff_t end); > > +}; >=20 > I can see there are two scenarios to invalidate page(s), when punching = a > hole or ftruncating to 0, in either cases KVM should already been calle= d > with necessary information from usersapce with memory slot punch hole > syscall or memory slot delete syscall, so wondering this callback is > really needed. So what you propose? Forbid truncate/punch from userspace and make KVM handle punch hole/truncate from within kernel? I think it's layering violation. > > + > > +struct guest_mem_ops { > > + unsigned long (*get_lock_pfn)(struct inode *inode, pgoff_t offset); > > + void (*put_unlock_pfn)(unsigned long pfn); >=20 > Same as above, I=E2=80=99m not clear on which time put_unlock_pfn() wou= ld be > called, I=E2=80=99m thinking the page can be put_and_unlock when usersp= ace > punching a hole or ftruncating to 0 on the fd. No. put_unlock_pfn() has to be called after the pfn is in SEPT. This way we close race between SEPT population and truncate/punch. get_lock_pfn() would stop truncate untile put_unlock_pfn() called. > We did miss pfn_mapping_level() callback which is needed for KVM to que= ry > the page size level (e.g. 4K or 2M) that the backing store can support. Okay, makes sense. We can return the information as part of get_lock_pfn(= ) call. > Are we stick our design to memfd interface (e.g other memory backing > stores like tmpfs and hugetlbfs will all rely on this memfd interface t= o > interact with KVM), or this is just the initial implementation for PoC? >=20 > If we really want to expose multiple memory backing stores directly to > KVM (as opposed to expose backing stores to memfd and then memfd expose > single interface to KVM), I feel we need a third layer between KVM and > backing stores to eliminate the direct call like this. Backing stores > can register =E2=80=98memory fd providers=E2=80=99 and KVM should be ab= le to connect to > the right backing store provider with the fd provided by usersapce unde= r > the help of this third layer. memfd can provide shmem and hugetlbfs. That's should be enough for now. --=20 Kirill A. Shutemov