qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Fuad Tabba <tabba@google.com>
Cc: Chao Peng <chao.p.peng@linux.intel.com>,
	David Hildenbrand <david@redhat.com>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
	linux-api@vger.kernel.org, linux-doc@vger.kernel.org,
	qemu-devel@nongnu.org, Paolo Bonzini <pbonzini@redhat.com>,
	Jonathan Corbet <corbet@lwn.net>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	x86@kernel.org, "H . Peter Anvin" <hpa@zytor.com>,
	Hugh Dickins <hughd@google.com>, Jeff Layton <jlayton@kernel.org>,
	"J . Bruce Fields" <bfields@fieldses.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Shuah Khan <shuah@kernel.org>, Mike Rapoport <rppt@kernel.org>,
	Steven Price <steven.price@arm.com>,
	"Maciej S . Szmigiero" <mail@maciej.szmigiero.name>,
	Vlastimil Babka <vbabka@suse.cz>,
	Vishal Annapurve <vannapurve@google.com>,
	Yu Zhang <yu.c.zhang@linux.intel.com>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com,
	ak@linux.intel.com, aarcange@redhat.com, ddutile@redhat.com,
	dhildenb@redhat.com, Quentin Perret <qperret@google.com>,
	Michael Roth <michael.roth@amd.com>,
	mhocko@suse.com, Muchun Song <songmuchun@bytedance.com>,
	wei.w.wang@intel.com, Will Deacon <will@kernel.org>,
	Marc Zyngier <maz@kernel.org>
Subject: Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd
Date: Tue, 18 Oct 2022 00:33:48 +0000	[thread overview]
Message-ID: <Y030bGhh0mvGS6E1@google.com> (raw)
In-Reply-To: <CA+EHjTw3din891hMUeRW-cn46ktyMWSdoB31pL+zWpXo_=3UVg@mail.gmail.com>

On Fri, Sep 30, 2022, Fuad Tabba wrote:
> > > > > pKVM would also need a way to make an fd accessible again
> > > > > when shared back, which I think isn't possible with this patch.
> > > >
> > > > But does pKVM really want to mmap/munmap a new region at the page-level,
> > > > that can cause VMA fragmentation if the conversion is frequent as I see.
> > > > Even with a KVM ioctl for mapping as mentioned below, I think there will
> > > > be the same issue.
> > >
> > > pKVM doesn't really need to unmap the memory. What is really important
> > > is that the memory is not GUP'able.
> >
> > Well, not entirely unguppable, just unguppable without a magic FOLL_* flag,
> > otherwise KVM wouldn't be able to get the PFN to map into guest memory.
> >
> > The problem is that gup() and "mapped" are tied together.  So yes, pKVM doesn't
> > strictly need to unmap memory _in the untrusted host_, but since mapped==guppable,
> > the end result is the same.
> >
> > Emphasis above because pKVM still needs unmap the memory _somehwere_.  IIUC, the
> > current approach is to do that only in the stage-2 page tables, i.e. only in the
> > context of the hypervisor.  Which is also the source of the gup() problems; the
> > untrusted kernel is blissfully unaware that the memory is inaccessible.
> >
> > Any approach that moves some of that information into the untrusted kernel so that
> > the kernel can protect itself will incur fragmentation in the VMAs.  Well, unless
> > all of guest memory becomes unguppable, but that's likely not a viable option.
> 
> Actually, for pKVM, there is no need for the guest memory to be GUP'able at
> all if we use the new inaccessible_get_pfn().

Ya, I was referring to pKVM without UPM / inaccessible memory.

Jumping back to blocking gup(), what about using the same tricks as secretmem to
block gup()?  E.g. compare vm_ops to block regular gup() and a_ops to block fast
gup() on struct page?  With a Kconfig that's selected by pKVM (which would also
need its own Kconfig), e.g. CONFIG_INACCESSIBLE_MAPPABLE_MEM, there would be zero
performance overhead for non-pKVM kernels, i.e. hooking gup() shouldn't be
controversial.

I suspect the fast gup() path could even be optimized to avoid the page_mapping()
lookup by adding a PG_inaccessible flag that's defined iff the TBD Kconfig is
selected.  I'm guessing pKVM isn't expected to be deployed on massivve NUMA systems
anytime soon, so there should be plenty of page flags to go around.

Blocking gup() instead of trying to play refcount games when converting back to
private would eliminate the need to put heavy restrictions on mapping, as the goal
of those were purely to simplify the KVM implementation, e.g. the "one mapping per
memslot" thing would go away entirely.

> This of course goes back to what I'd mentioned before in v7; it seems that
> representing the memslot memory as a file descriptor should be orthogonal to
> whether the memory is shared or private, rather than a private_fd for private
> memory and the userspace_addr for shared memory.

I also explored the idea of backing any guest memory with an fd, but came to
the conclusion that private memory needs a separate handle[1], at least on x86.

For SNP and TDX, even though the GPA is the same (ignoring the fact that SNP and
TDX steal GPA bits to differentiate private vs. shared), the two types need to be
treated as separate mappings[2].  Post-boot, converting is lossy in both directions,
so even conceptually they are two disctint pages that just happen to share (some)
GPA bits.

To allow conversions, i.e. changing which mapping to use, without memslot updates,
KVM needs to let userspace provide both mappings in a single memslot.  So while
fd-based memory is an orthogonal concept, e.g. we could add fd-based shared memory,
KVM would still need a dedicated private handle.

For pKVM, the fd doesn't strictly need to be mutually exclusive with the existing
userspace_addr, but since the private_fd is going to be added for x86, I think it
makes sense to use that instead of adding generic fd-based memory for pKVM's use
case (which is arguably still "private" memory but with special semantics).

[1] https://lore.kernel.org/all/YulTH7bL4MwT5v5K@google.com
[2] https://lore.kernel.org/all/869622df-5bf6-0fbb-cac4-34c6ae7df119@kernel.org

>  The host can then map or unmap the shared/private memory using the fd, which
>  allows it more freedom in even choosing to unmap shared memory when not
>  needed, for example.


  parent reply	other threads:[~2022-10-18  0:38 UTC|newest]

Thread overview: 97+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-15 14:29 [PATCH v8 0/8] KVM: mm: fd-based approach for supporting KVM Chao Peng
2022-09-15 14:29 ` [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd Chao Peng
2022-09-19  9:12   ` David Hildenbrand
2022-09-19 19:10     ` Sean Christopherson
2022-09-21 21:10       ` Andy Lutomirski
2022-09-22 13:23         ` Wang, Wei W
2022-09-23 15:20         ` Fuad Tabba
2022-09-23 15:19       ` Fuad Tabba
2022-09-26 14:23         ` Chao Peng
2022-09-26 15:51           ` Fuad Tabba
2022-09-27 22:47             ` Sean Christopherson
2022-09-30 16:19               ` Fuad Tabba
2022-10-13 13:34                 ` Chao Peng
2022-10-17 10:31                   ` Fuad Tabba
2022-10-17 14:58                     ` Chao Peng
2022-10-17 19:05                       ` Fuad Tabba
2022-10-19 13:30                         ` Chao Peng
2022-10-18  0:33                 ` Sean Christopherson [this message]
2022-10-19 15:04                   ` Fuad Tabba
2022-09-23  0:58     ` Kirill A . Shutemov
2022-09-26 10:35       ` David Hildenbrand
2022-09-26 14:48         ` Kirill A. Shutemov
2022-09-26 14:53           ` David Hildenbrand
2022-09-27 23:23             ` Sean Christopherson
2022-09-28 13:36               ` Kirill A. Shutemov
2022-09-22 13:26   ` Wang, Wei W
2022-09-22 19:49     ` Sean Christopherson
2022-09-23  0:53       ` Kirill A . Shutemov
2022-09-23 15:20         ` Fuad Tabba
2022-09-30 16:14   ` Fuad Tabba
2022-09-30 16:23     ` Kirill A . Shutemov
2022-10-03  7:33       ` Fuad Tabba
2022-10-03 11:01         ` Kirill A. Shutemov
2022-10-04 15:39           ` Fuad Tabba
2022-10-06  8:50   ` Fuad Tabba
2022-10-06 13:04     ` Kirill A. Shutemov
2022-10-17 13:00   ` Vlastimil Babka
2022-10-17 16:19     ` Kirill A . Shutemov
2022-10-17 16:39       ` Gupta, Pankaj
2022-10-17 21:56         ` Kirill A . Shutemov
2022-10-18 13:42           ` Vishal Annapurve
2022-10-19 15:32             ` Kirill A . Shutemov
2022-10-20 10:50               ` Vishal Annapurve
2022-10-21 13:54                 ` Chao Peng
2022-10-21 16:53                   ` Sean Christopherson
2022-10-19 12:23   ` Vishal Annapurve
2022-10-21 13:47     ` Chao Peng
2022-10-21 16:18       ` Sean Christopherson
2022-10-24 14:59         ` Kirill A . Shutemov
2022-10-24 15:26           ` David Hildenbrand
2022-11-03 16:27           ` Vishal Annapurve
2022-09-15 14:29 ` [PATCH v8 2/8] KVM: Extend the memslot to support fd-based private memory Chao Peng
2022-09-16  9:14   ` Bagas Sanjaya
2022-09-16  9:53     ` Chao Peng
2022-09-26 10:26   ` Fuad Tabba
2022-09-26 14:04     ` Chao Peng
2022-09-29 22:45   ` Isaku Yamahata
2022-09-29 23:22     ` Sean Christopherson
2022-10-05 13:04   ` Jarkko Sakkinen
2022-10-05 22:05     ` Jarkko Sakkinen
2022-10-06  9:00   ` Fuad Tabba
2022-10-06 14:58   ` Jarkko Sakkinen
2022-10-06 15:07     ` Jarkko Sakkinen
2022-10-06 15:34       ` Sean Christopherson
2022-10-07 11:14         ` Jarkko Sakkinen
2022-10-07 14:58           ` Sean Christopherson
2022-10-07 21:54             ` Jarkko Sakkinen
2022-10-08 16:15               ` Jarkko Sakkinen
2022-10-08 17:35                 ` Jarkko Sakkinen
2022-10-10  8:25                   ` Chao Peng
2022-10-12  8:14                     ` Jarkko Sakkinen
2022-09-15 14:29 ` [PATCH v8 3/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
2022-09-16  9:17   ` Bagas Sanjaya
2022-09-16  9:54     ` Chao Peng
2022-09-15 14:29 ` [PATCH v8 4/8] KVM: Use gfn instead of hva for mmu_notifier_retry Chao Peng
2022-09-15 14:29 ` [PATCH v8 5/8] KVM: Register/unregister the guest private memory regions Chao Peng
2022-09-26 10:36   ` Fuad Tabba
2022-09-26 14:07     ` Chao Peng
2022-10-11  9:48   ` Fuad Tabba
2022-10-12  2:35     ` Chao Peng
2022-10-17 10:15       ` Fuad Tabba
2022-10-17 22:17         ` Sean Christopherson
2022-10-19 13:23           ` Chao Peng
2022-10-19 15:02             ` Fuad Tabba
2022-10-19 16:09               ` Sean Christopherson
2022-10-19 18:32                 ` Fuad Tabba
2022-09-15 14:29 ` [PATCH v8 6/8] KVM: Update lpage info when private/shared memory are mixed Chao Peng
2022-09-29 16:52   ` Isaku Yamahata
2022-09-30  8:59     ` Chao Peng
2022-09-15 14:29 ` [PATCH v8 7/8] KVM: Handle page fault for private memory Chao Peng
2022-10-14 18:57   ` Sean Christopherson
2022-10-17 14:48     ` Chao Peng
2022-09-15 14:29 ` [PATCH v8 8/8] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
2022-10-04 14:55   ` Jarkko Sakkinen
2022-10-10  8:31     ` Chao Peng
2022-10-06  8:55   ` Fuad Tabba
2022-10-10  8:33     ` Chao Peng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y030bGhh0mvGS6E1@google.com \
    --to=seanjc@google.com \
    --cc=aarcange@redhat.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=bfields@fieldses.org \
    --cc=bp@alien8.de \
    --cc=chao.p.peng@linux.intel.com \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@intel.com \
    --cc=david@redhat.com \
    --cc=ddutile@redhat.com \
    --cc=dhildenb@redhat.com \
    --cc=hpa@zytor.com \
    --cc=hughd@google.com \
    --cc=jlayton@kernel.org \
    --cc=jmattson@google.com \
    --cc=joro@8bytes.org \
    --cc=jun.nakajima@intel.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-api@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=luto@kernel.org \
    --cc=mail@maciej.szmigiero.name \
    --cc=maz@kernel.org \
    --cc=mhocko@suse.com \
    --cc=michael.roth@amd.com \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qperret@google.com \
    --cc=rppt@kernel.org \
    --cc=shuah@kernel.org \
    --cc=songmuchun@bytedance.com \
    --cc=steven.price@arm.com \
    --cc=tabba@google.com \
    --cc=tglx@linutronix.de \
    --cc=vannapurve@google.com \
    --cc=vbabka@suse.cz \
    --cc=vkuznets@redhat.com \
    --cc=wanpengli@tencent.com \
    --cc=wei.w.wang@intel.com \
    --cc=will@kernel.org \
    --cc=x86@kernel.org \
    --cc=yu.c.zhang@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).