From: Andy Lutomirski <luto@kernel.org>
To: Chao Peng <chao.p.peng@linux.intel.com>,
kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-mm@kvack.org, linux-fsdevel@vger.kernel.org,
linux-api@vger.kernel.org, linux-doc@vger.kernel.org,
qemu-devel@nongnu.org
Cc: Paolo Bonzini <pbonzini@redhat.com>,
Jonathan Corbet <corbet@lwn.net>,
Sean Christopherson <seanjc@google.com>,
Vitaly Kuznetsov <vkuznets@redhat.com>,
Wanpeng Li <wanpengli@tencent.com>,
Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
x86@kernel.org, "H . Peter Anvin" <hpa@zytor.com>,
Hugh Dickins <hughd@google.com>, Jeff Layton <jlayton@kernel.org>,
"J . Bruce Fields" <bfields@fieldses.org>,
Andrew Morton <akpm@linux-foundation.org>,
Mike Rapoport <rppt@kernel.org>,
Steven Price <steven.price@arm.com>,
"Maciej S . Szmigiero" <mail@maciej.szmigiero.name>,
Vlastimil Babka <vbabka@suse.cz>,
Vishal Annapurve <vannapurve@google.com>,
Yu Zhang <yu.c.zhang@linux.intel.com>,
"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
jun.nakajima@intel.com, dave.hansen@intel.com,
ak@linux.intel.com, david@redhat.com, aarcange@redhat.com,
ddutile@redhat.com, dhildenb@redhat.com,
Quentin Perret <qperret@google.com>,
Michael Roth <michael.roth@amd.com>,
mhocko@suse.com
Subject: Re: [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory
Date: Fri, 20 May 2022 10:57:41 -0700 [thread overview]
Message-ID: <8840b360-cdb2-244c-bfb6-9a0e7306c188@kernel.org> (raw)
In-Reply-To: <20220519153713.819591-5-chao.p.peng@linux.intel.com>
On 5/19/22 08:37, Chao Peng wrote:
> Extend the memslot definition to provide guest private memory through a
> file descriptor(fd) instead of userspace_addr(hva). Such guest private
> memory(fd) may never be mapped into userspace so no userspace_addr(hva)
> can be used. Instead add another two new fields
> (private_fd/private_offset), plus the existing memory_size to represent
> the private memory range. Such memslot can still have the existing
> userspace_addr(hva). When use, a single memslot can maintain both
> private memory through private fd(private_fd/private_offset) and shared
> memory through hva(userspace_addr). A GPA is considered private by KVM
> if the memslot has private fd and that corresponding page in the private
> fd is populated, otherwise, it's shared.
>
So this is a strange API and, IMO, a layering violation. I want to make
sure that we're all actually on board with making this a permanent part
of the Linux API. Specifically, we end up with a multiplexing situation
as you have described. For a given GPA, there are *two* possible host
backings: an fd-backed one (from the fd, which is private for now might
might end up potentially shared depending on future extensions) and a
VMA-backed one. The selection of which one backs the address is made
internally by whatever backs the fd.
This, IMO, a clear layering violation. Normally, an fd has an
associated address space, and pages in that address space can have
contents, can be holes that appear to contain all zeros, or could have
holes that are inaccessible. If you try to access a hole, you get
whatever is in the hole.
But now, with this patchset, the fd is more of an overlay and you get
*something else* if you try to access through the hole.
This results in operations on the fd bubbling up to the KVM mapping in
what is, IMO, a strange way. If the user punches a hole, KVM has to
modify its mappings such that the GPA goes to whatever VMA may be there.
(And update the RMP, the hypervisor's tables, or whatever else might
actually control privateness.) Conversely, if the user does fallocate
to fill a hole, the guest mapping *to an unrelated page* has to be
zapped so that the fd's page shows up. And the RMP needs updating, etc.
I am lukewarm on this for a few reasons.
1. This is weird. AFAIK nothing else works like this. Obviously this
is subjecting, but "weird" and "layering violation" sometimes translate
to "problematic locking".
2. fd-backed private memory can't have normal holes. If I make a memfd,
punch a hole in it, and mmap(MAP_SHARED) it, I end up with a page that
reads as zero. If I write to it, the page gets allocated. But with
this new mechanism, if I punch a hole and put it in a memslot, reads and
writes go somewhere else. So what if I actually wanted lazily allocated
private zeros?
2b. For a hypothetical future extension in which an fd can also have
shared pages (for conversion, for example, or simply because the fd
backing might actually be more efficient than indirecting through VMAs
and therefore get used for shared memory or entirely-non-confidential
VMs), lazy fd-backed zeros sound genuinely useful.
3. TDX hardware capability is not fully exposed. TDX can have a private
page and a shared page at GPAs that differ only by the private bit.
Sure, no one plans to use this today, but baking this into the user ABI
throws away half the potential address space.
3b. Any software solution that works like TDX (which IMO seems like an
eminently reasonable design to me) has the same issue.
The alternative would be to have some kind of separate table or bitmap
(part of the memslot?) that tells KVM whether a GPA should map to the fd.
What do you all think?
next prev parent reply other threads:[~2022-05-20 17:57 UTC|newest]
Thread overview: 58+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-05-19 15:37 [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory Chao Peng
2022-05-19 15:37 ` [PATCH v6 1/8] mm: Introduce memfile_notifier Chao Peng
2022-05-19 15:37 ` [PATCH v6 2/8] mm/shmem: Support memfile_notifier Chao Peng
2022-05-19 15:37 ` [PATCH v6 3/8] mm/memfd: Introduce MFD_INACCESSIBLE flag Chao Peng
2022-05-31 19:15 ` Vishal Annapurve
2022-06-01 10:17 ` Chao Peng
2022-06-01 12:11 ` Gupta, Pankaj
2022-06-02 10:07 ` Chao Peng
2022-06-14 20:23 ` Sean Christopherson
2022-06-15 8:53 ` Chao Peng
2022-05-19 15:37 ` [PATCH v6 4/8] KVM: Extend the memslot to support fd-based private memory Chao Peng
2022-05-20 17:57 ` Andy Lutomirski [this message]
2022-05-20 18:31 ` Sean Christopherson
2022-05-22 4:03 ` Andy Lutomirski
2022-05-23 13:21 ` Chao Peng
2022-05-23 15:22 ` Sean Christopherson
2022-05-30 13:26 ` Chao Peng
2022-06-10 16:14 ` Sean Christopherson
2022-06-14 6:45 ` Chao Peng
2022-06-23 22:59 ` Michael Roth
2022-06-24 8:54 ` Chao Peng
2022-06-24 13:01 ` Michael Roth
2022-06-17 20:52 ` Sean Christopherson
2022-06-17 21:27 ` Sean Christopherson
2022-06-20 14:09 ` Chao Peng
2022-06-20 14:08 ` Chao Peng
2022-05-19 15:37 ` [PATCH v6 5/8] KVM: Add KVM_EXIT_MEMORY_FAULT exit Chao Peng
2022-05-19 15:37 ` [PATCH v6 6/8] KVM: Handle page fault for private memory Chao Peng
2022-06-17 21:30 ` Sean Christopherson
2022-06-20 14:16 ` Chao Peng
2022-08-19 0:40 ` Kirill A. Shutemov
2022-08-25 23:43 ` Sean Christopherson
2022-06-24 3:58 ` Nikunj A. Dadhania
2022-06-24 9:02 ` Chao Peng
2022-06-30 19:14 ` Vishal Annapurve
2022-06-30 22:21 ` Michael Roth
2022-07-01 1:21 ` Xiaoyao Li
2022-07-07 20:08 ` Sean Christopherson
2022-07-08 3:29 ` Xiaoyao Li
2022-07-20 23:08 ` Vishal Annapurve
2022-07-21 9:45 ` Chao Peng
2022-05-19 15:37 ` [PATCH v6 7/8] KVM: Enable and expose KVM_MEM_PRIVATE Chao Peng
2022-06-23 22:07 ` Michael Roth
2022-06-24 8:43 ` Chao Peng
2022-05-19 15:37 ` [PATCH v6 8/8] memfd_create.2: Describe MFD_INACCESSIBLE flag Chao Peng
2022-06-06 20:09 ` [PATCH v6 0/8] KVM: mm: fd-based approach for supporting KVM guest private memory Vishal Annapurve
2022-06-07 6:57 ` Chao Peng
2022-06-08 0:55 ` Marc Orr
2022-06-08 2:18 ` Chao Peng
2022-06-08 19:37 ` Vishal Annapurve
2022-06-09 20:29 ` Sean Christopherson
2022-06-14 7:28 ` Chao Peng
2022-06-14 17:37 ` Andy Lutomirski
2022-06-14 19:08 ` Sean Christopherson
2022-06-14 20:59 ` Andy Lutomirski
2022-06-15 9:17 ` Chao Peng
2022-06-15 14:29 ` Sean Christopherson
2022-06-10 0:11 ` Marc Orr
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8840b360-cdb2-244c-bfb6-9a0e7306c188@kernel.org \
--to=luto@kernel.org \
--cc=aarcange@redhat.com \
--cc=ak@linux.intel.com \
--cc=akpm@linux-foundation.org \
--cc=bfields@fieldses.org \
--cc=bp@alien8.de \
--cc=chao.p.peng@linux.intel.com \
--cc=corbet@lwn.net \
--cc=dave.hansen@intel.com \
--cc=david@redhat.com \
--cc=ddutile@redhat.com \
--cc=dhildenb@redhat.com \
--cc=hpa@zytor.com \
--cc=hughd@google.com \
--cc=jlayton@kernel.org \
--cc=jmattson@google.com \
--cc=joro@8bytes.org \
--cc=jun.nakajima@intel.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=kvm@vger.kernel.org \
--cc=linux-api@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=mail@maciej.szmigiero.name \
--cc=mhocko@suse.com \
--cc=michael.roth@amd.com \
--cc=mingo@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=qperret@google.com \
--cc=rppt@kernel.org \
--cc=seanjc@google.com \
--cc=steven.price@arm.com \
--cc=tglx@linutronix.de \
--cc=vannapurve@google.com \
--cc=vbabka@suse.cz \
--cc=vkuznets@redhat.com \
--cc=wanpengli@tencent.com \
--cc=x86@kernel.org \
--cc=yu.c.zhang@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).