From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=fFyd=OF=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-8.8 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,
	SPF_PASS autolearn=ham autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 776EFC433F5
	for <linux-mm@archiver.kernel.org>; Wed, 15 Sep 2021 14:11:55 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id F1FAC6112E
	for <linux-mm@archiver.kernel.org>; Wed, 15 Sep 2021 14:11:54 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org F1FAC6112E
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org
Received: by kanga.kvack.org (Postfix)
	id 76C786B0071; Wed, 15 Sep 2021 10:11:54 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 6F5E1900002; Wed, 15 Sep 2021 10:11:54 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 548E96B0073; Wed, 15 Sep 2021 10:11:54 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0204.hostedemail.com [216.40.44.204])
	by kanga.kvack.org (Postfix) with ESMTP id 3F06B6B0071
	for <linux-mm@kvack.org>; Wed, 15 Sep 2021 10:11:54 -0400 (EDT)
Received: from smtpin03.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay02.hostedemail.com (Postfix) with ESMTP id CBA082BA94
	for <linux-mm@kvack.org>; Wed, 15 Sep 2021 14:11:53 +0000 (UTC)
X-FDA: 78589996506.03.784CA6C
Received: from mga09.intel.com (mga09.intel.com [134.134.136.24])
	by imf18.hostedemail.com (Postfix) with ESMTP id DB6FB4002096
	for <linux-mm@kvack.org>; Wed, 15 Sep 2021 14:11:52 +0000 (UTC)
X-IronPort-AV: E=McAfee;i="6200,9189,10107"; a="222370201"
X-IronPort-AV: E=Sophos;i="5.85,295,1624345200"; 
   d="scan'208";a="222370201"
Received: from orsmga005.jf.intel.com ([10.7.209.41])
  by orsmga102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Sep 2021 07:11:51 -0700
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.85,295,1624345200"; 
   d="scan'208";a="651213477"
Received: from black.fi.intel.com ([10.237.72.28])
  by orsmga005.jf.intel.com with ESMTP; 15 Sep 2021 07:11:43 -0700
Received: by black.fi.intel.com (Postfix, from userid 1000)
	id 726895A5; Wed, 15 Sep 2021 17:11:47 +0300 (EEST)
Date: Wed, 15 Sep 2021 17:11:47 +0300
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
To: Chao Peng <chao.p.peng@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>,
	Andy Lutomirski <luto@kernel.org>,
	Sean Christopherson <seanjc@google.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Borislav Petkov <bp@alien8.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Joerg Roedel <jroedel@suse.de>, Andi Kleen <ak@linux.intel.com>,
	David Rientjes <rientjes@google.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Tom Lendacky <thomas.lendacky@amd.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Varad Gautam <varad.gautam@suse.com>,
	Dario Faggioli <dfaggioli@suse.com>, x86@kernel.org,
	linux-mm@kvack.org, linux-coco@lists.linux.dev,
	Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>,
	David Hildenbrand <david@redhat.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Yu Zhang <yu.c.zhang@linux.intel.com>
Subject: Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest
 private memory
Message-ID: <20210915141147.s4mgtcfv3ber5fnt@black.fi.intel.com>
References: <20210824005248.200037-1-seanjc@google.com>
 <20210902184711.7v65p5lwhpr2pvk7@box.shutemov.name>
 <YTE1GzPimvUB1FOF@google.com>
 <20210903191414.g7tfzsbzc7tpkx37@box.shutemov.name>
 <02806f62-8820-d5f9-779c-15c0e9cd0e85@kernel.org>
 <20210910171811.xl3lms6xoj3kx223@box.shutemov.name>
 <20210915195857.GA52522@chaop.bj.intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
In-Reply-To: <20210915195857.GA52522@chaop.bj.intel.com>
X-Rspamd-Server: rspam04
X-Rspamd-Queue-Id: DB6FB4002096
X-Stat-Signature: kz4b7zad9yubfa4emm5o56gxx3fhd9d9
Authentication-Results: imf18.hostedemail.com;
	dkim=none;
	dmarc=fail reason="No valid SPF, No valid DKIM" header.from=intel.com (policy=none);
	spf=none (imf18.hostedemail.com: domain of kirill.shutemov@linux.intel.com has no SPF policy when checking 134.134.136.24) smtp.mailfrom=kirill.shutemov@linux.intel.com
X-HE-Tag: 1631715112-381397
Content-Transfer-Encoding: quoted-printable
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Wed, Sep 15, 2021 at 07:58:57PM +0000, Chao Peng wrote:
> On Fri, Sep 10, 2021 at 08:18:11PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Sep 03, 2021 at 12:15:51PM -0700, Andy Lutomirski wrote:
> > > On 9/3/21 12:14 PM, Kirill A. Shutemov wrote:
> > > > On Thu, Sep 02, 2021 at 08:33:31PM +0000, Sean Christopherson wro=
te:
> > > >> Would requiring the size to be '0' at F_SEAL_GUEST time solve th=
at problem?
> > > >=20
> > > > I guess. Maybe we would need a WRITE_ONCE() on set. I donno. I wi=
ll look
> > > > closer into locking next.
> > >=20
> > > We can decisively eliminate this sort of failure by making the swit=
ch
> > > happen at open time instead of after.  For a memfd-like API, this w=
ould
> > > be straightforward.  For a filesystem, it would take a bit more tho=
ught.
> >=20
> > I think it should work fine as long as we check seals after i_size in=
 the
> > read path. See the comment in shmem_file_read_iter().
> >=20
> > Below is updated version. I think it should be good enough to start
> > integrate with KVM.
> >=20
> > I also attach a test-case that consists of kernel patch and userspace
> > program. It demonstrates how it can be integrated into KVM code.
> >=20
> > One caveat I noticed is that guest_ops::invalidate_page_range() can b=
e
> > called after the owner (struct kvm) has being freed. It happens becau=
se
> > memfd can outlive KVM. So the callback has to check if such owner exi=
sts,
> > than check that there's a memslot with such inode.
>=20
> Would introducing memfd_unregister_guest() fix this?

I considered this, but it get complex quickly.

At what point it gets called? On KVM memslot destroy?

What if multiple KVM slot share the same memfd? Add refcount into memfd o=
n
how many times the owner registered the memfd?

It would leave us in strange state: memfd refcount owners (struct KVM) an=
d
KVM memslot pins the struct file. Weird refcount exchnage program.

I hate it.

> > I guess it should be okay: we have vm_list we can check owner against=
.
> > We may consider replace vm_list with something more scalable if numbe=
r of
> > VMs will get too high.
> >=20
> > Any comments?
> >=20
> > diff --git a/include/linux/memfd.h b/include/linux/memfd.h
> > index 4f1600413f91..3005e233140a 100644
> > --- a/include/linux/memfd.h
> > +++ b/include/linux/memfd.h
> > @@ -4,13 +4,34 @@
> > =20
> >  #include <linux/file.h>
> > =20
> > +struct guest_ops {
> > +	void (*invalidate_page_range)(struct inode *inode, void *owner,
> > +				      pgoff_t start, pgoff_t end);
> > +};
>=20
> I can see there are two scenarios to invalidate page(s), when punching =
a
> hole or ftruncating to 0, in either cases KVM should already been calle=
d
> with necessary information from usersapce with memory slot punch hole
> syscall or memory slot delete syscall, so wondering this callback is
> really needed.

So what you propose? Forbid truncate/punch from userspace and make KVM
handle punch hole/truncate from within kernel? I think it's layering
violation.

> > +
> > +struct guest_mem_ops {
> > +	unsigned long (*get_lock_pfn)(struct inode *inode, pgoff_t offset);
> > +	void (*put_unlock_pfn)(unsigned long pfn);
>=20
> Same as above, I=E2=80=99m not clear on which time put_unlock_pfn() wou=
ld be
> called, I=E2=80=99m thinking the page can be put_and_unlock when usersp=
ace
> punching a hole or ftruncating to 0 on the fd.

No. put_unlock_pfn() has to be called after the pfn is in SEPT. This way
we close race between SEPT population and truncate/punch. get_lock_pfn()
would stop truncate untile put_unlock_pfn() called.

> We did miss pfn_mapping_level() callback which is needed for KVM to que=
ry
> the page size level (e.g. 4K or 2M) that the backing store can support.

Okay, makes sense. We can return the information as part of get_lock_pfn(=
)
call.

> Are we stick our design to memfd interface (e.g other memory backing
> stores like tmpfs and hugetlbfs will all rely on this memfd interface t=
o
> interact with KVM), or this is just the initial implementation for PoC?
>=20
> If we really want to expose multiple memory backing stores directly to
> KVM (as opposed to expose backing stores to memfd and then memfd expose
> single interface to KVM), I feel we need a third layer between KVM and
> backing stores to eliminate the direct call like this. Backing stores
> can register =E2=80=98memory fd providers=E2=80=99 and KVM should be ab=
le to connect to
> the right backing store provider with the fd provided by usersapce unde=
r
> the help of this third layer.

memfd can provide shmem and hugetlbfs. That's should be enough for now.

--=20
 Kirill A. Shutemov