From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=V4uO=OG=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-10.2 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,
	SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 4B047C433FE
	for <linux-mm@archiver.kernel.org>; Thu, 16 Sep 2021 07:36:40 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id D3065611C6
	for <linux-mm@archiver.kernel.org>; Thu, 16 Sep 2021 07:36:39 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org D3065611C6
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org
Received: by kanga.kvack.org (Postfix)
	id DDC4A6B0071; Thu, 16 Sep 2021 03:36:38 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id D8C5D6B0072; Thu, 16 Sep 2021 03:36:38 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id C7AE76B0073; Thu, 16 Sep 2021 03:36:38 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0187.hostedemail.com [216.40.44.187])
	by kanga.kvack.org (Postfix) with ESMTP id BA2B16B0071
	for <linux-mm@kvack.org>; Thu, 16 Sep 2021 03:36:38 -0400 (EDT)
Received: from smtpin35.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id 5D3502C6B9
	for <linux-mm@kvack.org>; Thu, 16 Sep 2021 07:36:38 +0000 (UTC)
X-FDA: 78592629276.35.BDFADA7
Received: from mga09.intel.com (mga09.intel.com [134.134.136.24])
	by imf08.hostedemail.com (Postfix) with ESMTP id 6309830000B8
	for <linux-mm@kvack.org>; Thu, 16 Sep 2021 07:36:37 +0000 (UTC)
X-IronPort-AV: E=McAfee;i="6200,9189,10108"; a="222546992"
X-IronPort-AV: E=Sophos;i="5.85,297,1624345200"; 
   d="scan'208";a="222546992"
Received: from orsmga007.jf.intel.com ([10.7.209.58])
  by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 16 Sep 2021 00:36:35 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.85,297,1624345200"; 
   d="scan'208";a="472680027"
Received: from chaop.bj.intel.com (HELO localhost) ([10.240.192.135])
  by orsmga007.jf.intel.com with ESMTP; 16 Sep 2021 00:36:28 -0700
Date: Thu, 16 Sep 2021 15:36:27 +0800
From: Chao Peng <chao.p.peng@linux.intel.com>
To: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>,
	Andy Lutomirski <luto@kernel.org>,
	Sean Christopherson <seanjc@google.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Borislav Petkov <bp@alien8.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Joerg Roedel <jroedel@suse.de>, Andi Kleen <ak@linux.intel.com>,
	David Rientjes <rientjes@google.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Tom Lendacky <thomas.lendacky@amd.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Varad Gautam <varad.gautam@suse.com>,
	Dario Faggioli <dfaggioli@suse.com>, x86@kernel.org,
	linux-mm@kvack.org, linux-coco@lists.linux.dev,
	Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>,
	David Hildenbrand <david@redhat.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Yu Zhang <yu.c.zhang@linux.intel.com>
Subject: Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest
 private memory
Message-ID: <20210916073627.GA18399@chaop.bj.intel.com>
Reply-To: Chao Peng <chao.p.peng@linux.intel.com>
References: <20210824005248.200037-1-seanjc@google.com>
 <20210902184711.7v65p5lwhpr2pvk7@box.shutemov.name>
 <YTE1GzPimvUB1FOF@google.com>
 <20210903191414.g7tfzsbzc7tpkx37@box.shutemov.name>
 <02806f62-8820-d5f9-779c-15c0e9cd0e85@kernel.org>
 <20210910171811.xl3lms6xoj3kx223@box.shutemov.name>
 <20210915195857.GA52522@chaop.bj.intel.com>
 <20210915141147.s4mgtcfv3ber5fnt@black.fi.intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=gb2312
Content-Disposition: inline
In-Reply-To: <20210915141147.s4mgtcfv3ber5fnt@black.fi.intel.com>
User-Agent: Mutt/1.9.4 (2018-02-28)
X-Rspamd-Server: rspam05
X-Rspamd-Queue-Id: 6309830000B8
X-Stat-Signature: z1wqiy76ct1zhtpastgmsdq1z6iuezs9
Authentication-Results: imf08.hostedemail.com;
	dkim=none;
	spf=none (imf08.hostedemail.com: domain of chao.p.peng@linux.intel.com has no SPF policy when checking 134.134.136.24) smtp.mailfrom=chao.p.peng@linux.intel.com;
	dmarc=fail reason="No valid SPF, No valid DKIM" header.from=intel.com (policy=none)
X-HE-Tag: 1631777797-743783
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, Sep 15, 2021 at 05:11:47PM +0300, Kirill A. Shutemov wrote:
> On Wed, Sep 15, 2021 at 07:58:57PM +0000, Chao Peng wrote:
> > On Fri, Sep 10, 2021 at 08:18:11PM +0300, Kirill A. Shutemov wrote:
> > > On Fri, Sep 03, 2021 at 12:15:51PM -0700, Andy Lutomirski wrote:
> > > > On 9/3/21 12:14 PM, Kirill A. Shutemov wrote:
> > > > > On Thu, Sep 02, 2021 at 08:33:31PM +0000, Sean Christopherson w=
rote:
> > > > >> Would requiring the size to be '0' at F_SEAL_GUEST time solve =
that problem?
> > > > >=20
> > > > > I guess. Maybe we would need a WRITE_ONCE() on set. I donno. I =
will look
> > > > > closer into locking next.
> > > >=20
> > > > We can decisively eliminate this sort of failure by making the sw=
itch
> > > > happen at open time instead of after.  For a memfd-like API, this=
 would
> > > > be straightforward.  For a filesystem, it would take a bit more t=
hought.
> > >=20
> > > I think it should work fine as long as we check seals after i_size =
in the
> > > read path. See the comment in shmem_file_read_iter().
> > >=20
> > > Below is updated version. I think it should be good enough to start
> > > integrate with KVM.
> > >=20
> > > I also attach a test-case that consists of kernel patch and userspa=
ce
> > > program. It demonstrates how it can be integrated into KVM code.
> > >=20
> > > One caveat I noticed is that guest_ops::invalidate_page_range() can=
 be
> > > called after the owner (struct kvm) has being freed. It happens bec=
ause
> > > memfd can outlive KVM. So the callback has to check if such owner e=
xists,
> > > than check that there's a memslot with such inode.
> >=20
> > Would introducing memfd_unregister_guest() fix this?
>=20
> I considered this, but it get complex quickly.
>=20
> At what point it gets called? On KVM memslot destroy?

I meant when the VM gets destroyed.

>=20
> What if multiple KVM slot share the same memfd? Add refcount into memfd=
 on
> how many times the owner registered the memfd?
>=20
> It would leave us in strange state: memfd refcount owners (struct KVM) =
and
> KVM memslot pins the struct file. Weird refcount exchnage program.
>=20
> I hate it.

But yes agree things will get much complex in practice.

>=20
> > > I guess it should be okay: we have vm_list we can check owner again=
st.
> > > We may consider replace vm_list with something more scalable if num=
ber of
> > > VMs will get too high.
> > >=20
> > > Any comments?
> > >=20
> > > diff --git a/include/linux/memfd.h b/include/linux/memfd.h
> > > index 4f1600413f91..3005e233140a 100644
> > > --- a/include/linux/memfd.h
> > > +++ b/include/linux/memfd.h
> > > @@ -4,13 +4,34 @@
> > > =20
> > >  #include <linux/file.h>
> > > =20
> > > +struct guest_ops {
> > > +	void (*invalidate_page_range)(struct inode *inode, void *owner,
> > > +				      pgoff_t start, pgoff_t end);
> > > +};
> >=20
> > I can see there are two scenarios to invalidate page(s), when punchin=
g a
> > hole or ftruncating to 0, in either cases KVM should already been cal=
led
> > with necessary information from usersapce with memory slot punch hole
> > syscall or memory slot delete syscall, so wondering this callback is
> > really needed.
>=20
> So what you propose? Forbid truncate/punch from userspace and make KVM
> handle punch hole/truncate from within kernel? I think it's layering
> violation.

As far as I understand the flow for punch hole/truncate in this design,
there will be two steps for userspace:
  1. punch hole/delete kvm memory slot, and then
  2. puncn hole/truncate on the memory backing store fd.

In concept we can do whatever needed for invalidation in either steps.
If we can do the invalidation in step 1 then we don=A1=AFt need bothering
this callback. This is what I mean but agree the current callback can
also work.

>=20
> > > +
> > > +struct guest_mem_ops {
> > > +	unsigned long (*get_lock_pfn)(struct inode *inode, pgoff_t offset=
);
> > > +	void (*put_unlock_pfn)(unsigned long pfn);
> >=20
> > Same as above, I=A1=AFm not clear on which time put_unlock_pfn() woul=
d be
> > called, I=A1=AFm thinking the page can be put_and_unlock when userspa=
ce
> > punching a hole or ftruncating to 0 on the fd.
>=20
> No. put_unlock_pfn() has to be called after the pfn is in SEPT. This wa=
y
> we close race between SEPT population and truncate/punch. get_lock_pfn(=
)
> would stop truncate untile put_unlock_pfn() called.

Okay, makes sense.

Thanks,
Chao