All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chao Peng <chao.p.peng@linux.intel.com>
To: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>,
	Andy Lutomirski <luto@kernel.org>,
	Sean Christopherson <seanjc@google.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Borislav Petkov <bp@alien8.de>,
	Andrew Morton <akpm@linux-foundation.org>,
	Joerg Roedel <jroedel@suse.de>, Andi Kleen <ak@linux.intel.com>,
	David Rientjes <rientjes@google.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	Tom Lendacky <thomas.lendacky@amd.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Varad Gautam <varad.gautam@suse.com>,
	Dario Faggioli <dfaggioli@suse.com>,
	x86@kernel.org, linux-mm@kvack.org, linux-coco@lists.linux.dev,
	Kuppuswamy Sathyanarayanan
	<sathyanarayanan.kuppuswamy@linux.intel.com>,
	David Hildenbrand <david@redhat.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Yu Zhang <yu.c.zhang@linux.intel.com>
Subject: Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory
Date: Thu, 16 Sep 2021 15:36:27 +0800	[thread overview]
Message-ID: <20210916073627.GA18399@chaop.bj.intel.com> (raw)
In-Reply-To: <20210915141147.s4mgtcfv3ber5fnt@black.fi.intel.com>

On Wed, Sep 15, 2021 at 05:11:47PM +0300, Kirill A. Shutemov wrote:
> On Wed, Sep 15, 2021 at 07:58:57PM +0000, Chao Peng wrote:
> > On Fri, Sep 10, 2021 at 08:18:11PM +0300, Kirill A. Shutemov wrote:
> > > On Fri, Sep 03, 2021 at 12:15:51PM -0700, Andy Lutomirski wrote:
> > > > On 9/3/21 12:14 PM, Kirill A. Shutemov wrote:
> > > > > On Thu, Sep 02, 2021 at 08:33:31PM +0000, Sean Christopherson wrote:
> > > > >> Would requiring the size to be '0' at F_SEAL_GUEST time solve that problem?
> > > > > 
> > > > > I guess. Maybe we would need a WRITE_ONCE() on set. I donno. I will look
> > > > > closer into locking next.
> > > > 
> > > > We can decisively eliminate this sort of failure by making the switch
> > > > happen at open time instead of after.  For a memfd-like API, this would
> > > > be straightforward.  For a filesystem, it would take a bit more thought.
> > > 
> > > I think it should work fine as long as we check seals after i_size in the
> > > read path. See the comment in shmem_file_read_iter().
> > > 
> > > Below is updated version. I think it should be good enough to start
> > > integrate with KVM.
> > > 
> > > I also attach a test-case that consists of kernel patch and userspace
> > > program. It demonstrates how it can be integrated into KVM code.
> > > 
> > > One caveat I noticed is that guest_ops::invalidate_page_range() can be
> > > called after the owner (struct kvm) has being freed. It happens because
> > > memfd can outlive KVM. So the callback has to check if such owner exists,
> > > than check that there's a memslot with such inode.
> > 
> > Would introducing memfd_unregister_guest() fix this?
> 
> I considered this, but it get complex quickly.
> 
> At what point it gets called? On KVM memslot destroy?

I meant when the VM gets destroyed.

> 
> What if multiple KVM slot share the same memfd? Add refcount into memfd on
> how many times the owner registered the memfd?
> 
> It would leave us in strange state: memfd refcount owners (struct KVM) and
> KVM memslot pins the struct file. Weird refcount exchnage program.
> 
> I hate it.

But yes agree things will get much complex in practice.

> 
> > > I guess it should be okay: we have vm_list we can check owner against.
> > > We may consider replace vm_list with something more scalable if number of
> > > VMs will get too high.
> > > 
> > > Any comments?
> > > 
> > > diff --git a/include/linux/memfd.h b/include/linux/memfd.h
> > > index 4f1600413f91..3005e233140a 100644
> > > --- a/include/linux/memfd.h
> > > +++ b/include/linux/memfd.h
> > > @@ -4,13 +4,34 @@
> > >  
> > >  #include <linux/file.h>
> > >  
> > > +struct guest_ops {
> > > +	void (*invalidate_page_range)(struct inode *inode, void *owner,
> > > +				      pgoff_t start, pgoff_t end);
> > > +};
> > 
> > I can see there are two scenarios to invalidate page(s), when punching a
> > hole or ftruncating to 0, in either cases KVM should already been called
> > with necessary information from usersapce with memory slot punch hole
> > syscall or memory slot delete syscall, so wondering this callback is
> > really needed.
> 
> So what you propose? Forbid truncate/punch from userspace and make KVM
> handle punch hole/truncate from within kernel? I think it's layering
> violation.

As far as I understand the flow for punch hole/truncate in this design,
there will be two steps for userspace:
  1. punch hole/delete kvm memory slot, and then
  2. puncn hole/truncate on the memory backing store fd.

In concept we can do whatever needed for invalidation in either steps.
If we can do the invalidation in step 1 then we don’t need bothering
this callback. This is what I mean but agree the current callback can
also work.

> 
> > > +
> > > +struct guest_mem_ops {
> > > +	unsigned long (*get_lock_pfn)(struct inode *inode, pgoff_t offset);
> > > +	void (*put_unlock_pfn)(unsigned long pfn);
> > 
> > Same as above, I’m not clear on which time put_unlock_pfn() would be
> > called, I’m thinking the page can be put_and_unlock when userspace
> > punching a hole or ftruncating to 0 on the fd.
> 
> No. put_unlock_pfn() has to be called after the pfn is in SEPT. This way
> we close race between SEPT population and truncate/punch. get_lock_pfn()
> would stop truncate untile put_unlock_pfn() called.

Okay, makes sense.

Thanks,
Chao

  reply	other threads:[~2021-09-16  7:37 UTC|newest]

Thread overview: 142+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-24  0:52 [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory Sean Christopherson
2021-08-24  0:52 ` Sean Christopherson
2021-08-24 10:48 ` Yu Zhang
2021-08-24 10:48   ` Yu Zhang
2021-08-26  0:35   ` Sean Christopherson
2021-08-26  0:35     ` Sean Christopherson
2021-08-26 13:23     ` Yu Zhang
2021-08-26 13:23       ` Yu Zhang
2021-08-26 10:15 ` David Hildenbrand
2021-08-26 10:15   ` David Hildenbrand
2021-08-26 17:05   ` Andy Lutomirski
2021-08-26 17:05     ` Andy Lutomirski
2021-08-26 21:26     ` David Hildenbrand
2021-08-26 21:26       ` David Hildenbrand
2021-08-27 18:24       ` Andy Lutomirski
2021-08-27 18:24         ` Andy Lutomirski
2021-08-27 22:28         ` Sean Christopherson
2021-08-27 22:28           ` Sean Christopherson
2021-08-31 19:12           ` David Hildenbrand
2021-08-31 19:12             ` David Hildenbrand
2021-08-31 20:45             ` Sean Christopherson
2021-08-31 20:45               ` Sean Christopherson
2021-09-01  7:51               ` David Hildenbrand
2021-09-01  7:51                 ` David Hildenbrand
2021-08-27  2:31   ` Yu Zhang
2021-08-27  2:31     ` Yu Zhang
2021-08-31 19:08     ` David Hildenbrand
2021-08-31 19:08       ` David Hildenbrand
2021-08-31 20:01       ` Andi Kleen
2021-08-31 20:01         ` Andi Kleen
2021-08-31 20:15         ` David Hildenbrand
2021-08-31 20:15           ` David Hildenbrand
2021-08-31 20:39           ` Andi Kleen
2021-08-31 20:39             ` Andi Kleen
2021-09-01  3:34             ` Yu Zhang
2021-09-01  3:34               ` Yu Zhang
2021-09-01  4:53     ` Andy Lutomirski
2021-09-01  4:53       ` Andy Lutomirski
2021-09-01  7:12       ` Tian, Kevin
2021-09-01  7:12         ` Tian, Kevin
2021-09-01 10:24       ` Yu Zhang
2021-09-01 10:24         ` Yu Zhang
2021-09-01 16:07         ` Andy Lutomirski
2021-09-01 16:07           ` Andy Lutomirski
2021-09-01 16:27           ` David Hildenbrand
2021-09-01 16:27             ` David Hildenbrand
2021-09-02  8:34             ` Yu Zhang
2021-09-02  8:34               ` Yu Zhang
2021-09-02  8:44               ` David Hildenbrand
2021-09-02  8:44                 ` David Hildenbrand
2021-09-02 11:02                 ` Yu Zhang
2021-09-02 11:02                   ` Yu Zhang
2021-09-02  8:19           ` Yu Zhang
2021-09-02  8:19             ` Yu Zhang
2021-09-02 18:41             ` Andy Lutomirski
2021-09-02 18:41               ` Andy Lutomirski
2021-09-07  1:33             ` Yan Zhao
2021-09-07  1:33               ` Yan Zhao
2021-09-02  9:27           ` Joerg Roedel
2021-09-02  9:27             ` Joerg Roedel
2021-09-02 18:41             ` Andy Lutomirski
2021-09-02 18:41               ` Andy Lutomirski
2021-09-02 18:57               ` Sean Christopherson
2021-09-02 18:57                 ` Sean Christopherson
2021-09-02 19:07                 ` Dave Hansen
2021-09-02 19:07                   ` Dave Hansen
2021-09-02 20:42                   ` Andy Lutomirski
2021-09-02 20:42                     ` Andy Lutomirski
2021-08-27 22:18   ` Sean Christopherson
2021-08-27 22:18     ` Sean Christopherson
2021-08-31 19:07     ` David Hildenbrand
2021-08-31 19:07       ` David Hildenbrand
2021-08-31 21:54       ` Sean Christopherson
2021-08-31 21:54         ` Sean Christopherson
2021-09-01  8:09         ` David Hildenbrand
2021-09-01  8:09           ` David Hildenbrand
2021-09-01 15:54           ` Andy Lutomirski
2021-09-01 15:54             ` Andy Lutomirski
2021-09-01 16:16             ` David Hildenbrand
2021-09-01 16:16               ` David Hildenbrand
2021-09-01 17:09               ` Andy Lutomirski
2021-09-01 17:09                 ` Andy Lutomirski
2021-09-01 16:18             ` James Bottomley
2021-09-01 16:18               ` James Bottomley
2021-09-01 16:22               ` David Hildenbrand
2021-09-01 16:22                 ` David Hildenbrand
2021-09-01 16:31                 ` James Bottomley
2021-09-01 16:31                   ` James Bottomley
2021-09-01 16:37                   ` David Hildenbrand
2021-09-01 16:37                     ` David Hildenbrand
2021-09-01 16:45                     ` James Bottomley
2021-09-01 16:45                       ` James Bottomley
2021-09-01 17:08                       ` David Hildenbrand
2021-09-01 17:08                         ` David Hildenbrand
2021-09-01 17:50                         ` Sean Christopherson
2021-09-01 17:50                           ` Sean Christopherson
2021-09-01 17:53                           ` David Hildenbrand
2021-09-01 17:53                             ` David Hildenbrand
2021-09-01 17:08               ` Andy Lutomirski
2021-09-01 17:08                 ` Andy Lutomirski
2021-09-01 17:13                 ` James Bottomley
2021-09-01 17:13                   ` James Bottomley
2021-09-02 10:18                 ` Joerg Roedel
2021-09-02 10:18                   ` Joerg Roedel
2021-09-01 18:24               ` Andy Lutomirski
2021-09-01 18:24                 ` Andy Lutomirski
2021-09-01 19:26               ` Dave Hansen
2021-09-01 19:26                 ` Dave Hansen
2021-09-07 15:00               ` Tom Lendacky
2021-09-07 15:00                 ` Tom Lendacky
2021-09-01  4:58       ` Andy Lutomirski
2021-09-01  4:58         ` Andy Lutomirski
2021-09-01  7:49         ` David Hildenbrand
2021-09-01  7:49           ` David Hildenbrand
2021-09-02 18:47 ` Kirill A. Shutemov
2021-09-02 18:47   ` Kirill A. Shutemov
2021-09-02 20:33   ` Sean Christopherson
2021-09-02 20:33     ` Sean Christopherson
2021-09-03 19:14     ` Kirill A. Shutemov
2021-09-03 19:14       ` Kirill A. Shutemov
2021-09-03 19:15       ` Andy Lutomirski
2021-09-03 19:15         ` Andy Lutomirski
2021-09-10 17:18         ` Kirill A. Shutemov
2021-09-10 17:18           ` Kirill A. Shutemov
2021-09-15 19:58           ` Chao Peng
2021-09-15 19:58             ` Chao Peng
2021-09-15 13:51             ` David Hildenbrand
2021-09-15 13:51               ` David Hildenbrand
2021-09-15 14:29               ` Kirill A. Shutemov
2021-09-15 14:29                 ` Kirill A. Shutemov
2021-09-15 14:59                 ` David Hildenbrand
2021-09-15 14:59                   ` David Hildenbrand
2021-09-15 15:35                   ` David Hildenbrand
2021-09-15 15:35                     ` David Hildenbrand
2021-09-15 20:04                   ` Kirill A. Shutemov
2021-09-15 20:04                     ` Kirill A. Shutemov
2021-09-15 14:11             ` Kirill A. Shutemov
2021-09-15 14:11               ` Kirill A. Shutemov
2021-09-16  7:36               ` Chao Peng [this message]
2021-09-16  7:36                 ` Chao Peng
2021-09-16  9:24               ` Paolo Bonzini
2021-09-16  9:24                 ` Paolo Bonzini

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210916073627.GA18399@chaop.bj.intel.com \
    --to=chao.p.peng@linux.intel.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=bp@alien8.de \
    --cc=dave.hansen@intel.com \
    --cc=david@redhat.com \
    --cc=dfaggioli@suse.com \
    --cc=jmattson@google.com \
    --cc=joro@8bytes.org \
    --cc=jroedel@suse.de \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kirill@shutemov.name \
    --cc=kvm@vger.kernel.org \
    --cc=linux-coco@lists.linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rientjes@google.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=seanjc@google.com \
    --cc=tglx@linutronix.de \
    --cc=thomas.lendacky@amd.com \
    --cc=varad.gautam@suse.com \
    --cc=vbabka@suse.cz \
    --cc=vkuznets@redhat.com \
    --cc=wanpengli@tencent.com \
    --cc=x86@kernel.org \
    --cc=yu.c.zhang@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.