From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 82E643FC3 for ; Thu, 16 Sep 2021 07:37:00 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10108"; a="308051093" X-IronPort-AV: E=Sophos;i="5.85,297,1624345200"; d="scan'208";a="308051093" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 Sep 2021 00:36:35 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.85,297,1624345200"; d="scan'208";a="472680027" Received: from chaop.bj.intel.com (HELO localhost) ([10.240.192.135]) by orsmga007.jf.intel.com with ESMTP; 16 Sep 2021 00:36:28 -0700 Date: Thu, 16 Sep 2021 15:36:27 +0800 From: Chao Peng To: "Kirill A. Shutemov" Cc: "Kirill A. Shutemov" , Andy Lutomirski , Sean Christopherson , Paolo Bonzini , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Borislav Petkov , Andrew Morton , Joerg Roedel , Andi Kleen , David Rientjes , Vlastimil Babka , Tom Lendacky , Thomas Gleixner , Peter Zijlstra , Ingo Molnar , Varad Gautam , Dario Faggioli , x86@kernel.org, linux-mm@kvack.org, linux-coco@lists.linux.dev, Kuppuswamy Sathyanarayanan , David Hildenbrand , Dave Hansen , Yu Zhang Subject: Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory Message-ID: <20210916073627.GA18399@chaop.bj.intel.com> Reply-To: Chao Peng References: <20210824005248.200037-1-seanjc@google.com> <20210902184711.7v65p5lwhpr2pvk7@box.shutemov.name> <20210903191414.g7tfzsbzc7tpkx37@box.shutemov.name> <02806f62-8820-d5f9-779c-15c0e9cd0e85@kernel.org> <20210910171811.xl3lms6xoj3kx223@box.shutemov.name> <20210915195857.GA52522@chaop.bj.intel.com> <20210915141147.s4mgtcfv3ber5fnt@black.fi.intel.com> Precedence: bulk X-Mailing-List: linux-coco@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=gb2312 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20210915141147.s4mgtcfv3ber5fnt@black.fi.intel.com> User-Agent: Mutt/1.9.4 (2018-02-28) On Wed, Sep 15, 2021 at 05:11:47PM +0300, Kirill A. Shutemov wrote: > On Wed, Sep 15, 2021 at 07:58:57PM +0000, Chao Peng wrote: > > On Fri, Sep 10, 2021 at 08:18:11PM +0300, Kirill A. Shutemov wrote: > > > On Fri, Sep 03, 2021 at 12:15:51PM -0700, Andy Lutomirski wrote: > > > > On 9/3/21 12:14 PM, Kirill A. Shutemov wrote: > > > > > On Thu, Sep 02, 2021 at 08:33:31PM +0000, Sean Christopherson wrote: > > > > >> Would requiring the size to be '0' at F_SEAL_GUEST time solve that problem? > > > > > > > > > > I guess. Maybe we would need a WRITE_ONCE() on set. I donno. I will look > > > > > closer into locking next. > > > > > > > > We can decisively eliminate this sort of failure by making the switch > > > > happen at open time instead of after. For a memfd-like API, this would > > > > be straightforward. For a filesystem, it would take a bit more thought. > > > > > > I think it should work fine as long as we check seals after i_size in the > > > read path. See the comment in shmem_file_read_iter(). > > > > > > Below is updated version. I think it should be good enough to start > > > integrate with KVM. > > > > > > I also attach a test-case that consists of kernel patch and userspace > > > program. It demonstrates how it can be integrated into KVM code. > > > > > > One caveat I noticed is that guest_ops::invalidate_page_range() can be > > > called after the owner (struct kvm) has being freed. It happens because > > > memfd can outlive KVM. So the callback has to check if such owner exists, > > > than check that there's a memslot with such inode. > > > > Would introducing memfd_unregister_guest() fix this? > > I considered this, but it get complex quickly. > > At what point it gets called? On KVM memslot destroy? I meant when the VM gets destroyed. > > What if multiple KVM slot share the same memfd? Add refcount into memfd on > how many times the owner registered the memfd? > > It would leave us in strange state: memfd refcount owners (struct KVM) and > KVM memslot pins the struct file. Weird refcount exchnage program. > > I hate it. But yes agree things will get much complex in practice. > > > > I guess it should be okay: we have vm_list we can check owner against. > > > We may consider replace vm_list with something more scalable if number of > > > VMs will get too high. > > > > > > Any comments? > > > > > > diff --git a/include/linux/memfd.h b/include/linux/memfd.h > > > index 4f1600413f91..3005e233140a 100644 > > > --- a/include/linux/memfd.h > > > +++ b/include/linux/memfd.h > > > @@ -4,13 +4,34 @@ > > > > > > #include > > > > > > +struct guest_ops { > > > + void (*invalidate_page_range)(struct inode *inode, void *owner, > > > + pgoff_t start, pgoff_t end); > > > +}; > > > > I can see there are two scenarios to invalidate page(s), when punching a > > hole or ftruncating to 0, in either cases KVM should already been called > > with necessary information from usersapce with memory slot punch hole > > syscall or memory slot delete syscall, so wondering this callback is > > really needed. > > So what you propose? Forbid truncate/punch from userspace and make KVM > handle punch hole/truncate from within kernel? I think it's layering > violation. As far as I understand the flow for punch hole/truncate in this design, there will be two steps for userspace: 1. punch hole/delete kvm memory slot, and then 2. puncn hole/truncate on the memory backing store fd. In concept we can do whatever needed for invalidation in either steps. If we can do the invalidation in step 1 then we don¡¯t need bothering this callback. This is what I mean but agree the current callback can also work. > > > > + > > > +struct guest_mem_ops { > > > + unsigned long (*get_lock_pfn)(struct inode *inode, pgoff_t offset); > > > + void (*put_unlock_pfn)(unsigned long pfn); > > > > Same as above, I¡¯m not clear on which time put_unlock_pfn() would be > > called, I¡¯m thinking the page can be put_and_unlock when userspace > > punching a hole or ftruncating to 0 on the fd. > > No. put_unlock_pfn() has to be called after the pfn is in SEPT. This way > we close race between SEPT population and truncate/punch. get_lock_pfn() > would stop truncate untile put_unlock_pfn() called. Okay, makes sense. Thanks, Chao From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.2 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3A84EC433F5 for ; Thu, 16 Sep 2021 07:36:39 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 1B50A60F46 for ; Thu, 16 Sep 2021 07:36:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234792AbhIPHh5 (ORCPT ); Thu, 16 Sep 2021 03:37:57 -0400 Received: from mga09.intel.com ([134.134.136.24]:55143 "EHLO mga09.intel.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234296AbhIPHh4 (ORCPT ); Thu, 16 Sep 2021 03:37:56 -0400 X-IronPort-AV: E=McAfee;i="6200,9189,10108"; a="222546991" X-IronPort-AV: E=Sophos;i="5.85,297,1624345200"; d="scan'208";a="222546991" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 Sep 2021 00:36:35 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.85,297,1624345200"; d="scan'208";a="472680027" Received: from chaop.bj.intel.com (HELO localhost) ([10.240.192.135]) by orsmga007.jf.intel.com with ESMTP; 16 Sep 2021 00:36:28 -0700 Date: Thu, 16 Sep 2021 15:36:27 +0800 From: Chao Peng To: "Kirill A. Shutemov" Cc: "Kirill A. Shutemov" , Andy Lutomirski , Sean Christopherson , Paolo Bonzini , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Borislav Petkov , Andrew Morton , Joerg Roedel , Andi Kleen , David Rientjes , Vlastimil Babka , Tom Lendacky , Thomas Gleixner , Peter Zijlstra , Ingo Molnar , Varad Gautam , Dario Faggioli , x86@kernel.org, linux-mm@kvack.org, linux-coco@lists.linux.dev, Kuppuswamy Sathyanarayanan , David Hildenbrand , Dave Hansen , Yu Zhang Subject: Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory Message-ID: <20210916073627.GA18399@chaop.bj.intel.com> Reply-To: Chao Peng References: <20210824005248.200037-1-seanjc@google.com> <20210902184711.7v65p5lwhpr2pvk7@box.shutemov.name> <20210903191414.g7tfzsbzc7tpkx37@box.shutemov.name> <02806f62-8820-d5f9-779c-15c0e9cd0e85@kernel.org> <20210910171811.xl3lms6xoj3kx223@box.shutemov.name> <20210915195857.GA52522@chaop.bj.intel.com> <20210915141147.s4mgtcfv3ber5fnt@black.fi.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=gb2312 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <20210915141147.s4mgtcfv3ber5fnt@black.fi.intel.com> User-Agent: Mutt/1.9.4 (2018-02-28) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Sep 15, 2021 at 05:11:47PM +0300, Kirill A. Shutemov wrote: > On Wed, Sep 15, 2021 at 07:58:57PM +0000, Chao Peng wrote: > > On Fri, Sep 10, 2021 at 08:18:11PM +0300, Kirill A. Shutemov wrote: > > > On Fri, Sep 03, 2021 at 12:15:51PM -0700, Andy Lutomirski wrote: > > > > On 9/3/21 12:14 PM, Kirill A. Shutemov wrote: > > > > > On Thu, Sep 02, 2021 at 08:33:31PM +0000, Sean Christopherson wrote: > > > > >> Would requiring the size to be '0' at F_SEAL_GUEST time solve that problem? > > > > > > > > > > I guess. Maybe we would need a WRITE_ONCE() on set. I donno. I will look > > > > > closer into locking next. > > > > > > > > We can decisively eliminate this sort of failure by making the switch > > > > happen at open time instead of after. For a memfd-like API, this would > > > > be straightforward. For a filesystem, it would take a bit more thought. > > > > > > I think it should work fine as long as we check seals after i_size in the > > > read path. See the comment in shmem_file_read_iter(). > > > > > > Below is updated version. I think it should be good enough to start > > > integrate with KVM. > > > > > > I also attach a test-case that consists of kernel patch and userspace > > > program. It demonstrates how it can be integrated into KVM code. > > > > > > One caveat I noticed is that guest_ops::invalidate_page_range() can be > > > called after the owner (struct kvm) has being freed. It happens because > > > memfd can outlive KVM. So the callback has to check if such owner exists, > > > than check that there's a memslot with such inode. > > > > Would introducing memfd_unregister_guest() fix this? > > I considered this, but it get complex quickly. > > At what point it gets called? On KVM memslot destroy? I meant when the VM gets destroyed. > > What if multiple KVM slot share the same memfd? Add refcount into memfd on > how many times the owner registered the memfd? > > It would leave us in strange state: memfd refcount owners (struct KVM) and > KVM memslot pins the struct file. Weird refcount exchnage program. > > I hate it. But yes agree things will get much complex in practice. > > > > I guess it should be okay: we have vm_list we can check owner against. > > > We may consider replace vm_list with something more scalable if number of > > > VMs will get too high. > > > > > > Any comments? > > > > > > diff --git a/include/linux/memfd.h b/include/linux/memfd.h > > > index 4f1600413f91..3005e233140a 100644 > > > --- a/include/linux/memfd.h > > > +++ b/include/linux/memfd.h > > > @@ -4,13 +4,34 @@ > > > > > > #include > > > > > > +struct guest_ops { > > > + void (*invalidate_page_range)(struct inode *inode, void *owner, > > > + pgoff_t start, pgoff_t end); > > > +}; > > > > I can see there are two scenarios to invalidate page(s), when punching a > > hole or ftruncating to 0, in either cases KVM should already been called > > with necessary information from usersapce with memory slot punch hole > > syscall or memory slot delete syscall, so wondering this callback is > > really needed. > > So what you propose? Forbid truncate/punch from userspace and make KVM > handle punch hole/truncate from within kernel? I think it's layering > violation. As far as I understand the flow for punch hole/truncate in this design, there will be two steps for userspace: 1. punch hole/delete kvm memory slot, and then 2. puncn hole/truncate on the memory backing store fd. In concept we can do whatever needed for invalidation in either steps. If we can do the invalidation in step 1 then we don¡¯t need bothering this callback. This is what I mean but agree the current callback can also work. > > > > + > > > +struct guest_mem_ops { > > > + unsigned long (*get_lock_pfn)(struct inode *inode, pgoff_t offset); > > > + void (*put_unlock_pfn)(unsigned long pfn); > > > > Same as above, I¡¯m not clear on which time put_unlock_pfn() would be > > called, I¡¯m thinking the page can be put_and_unlock when userspace > > punching a hole or ftruncating to 0 on the fd. > > No. put_unlock_pfn() has to be called after the pfn is in SEPT. This way > we close race between SEPT population and truncate/punch. get_lock_pfn() > would stop truncate untile put_unlock_pfn() called. Okay, makes sense. Thanks, Chao